Reduce LLM costs by up to 50%

Edgee’s token compression runs at the edge before every request reaches LLM providers, automatically reducing prompt size by up to 50% while preserving semantic meaning and output quality. This is particularly effective for:

RAG pipelines with large document contexts
Long conversation histories in multi-turn agents
Verbose system instructions and formatting
Document analysis and summarization tasks

How It Works

Token compression happens automatically on every request through a four-step process:

Semantic Analysis

Analyze the prompt structure to identify redundant context, verbose formatting, and compressible sections without losing critical information.

Context Optimization

Compress repeated context (common in RAG), condense verbose formatting, and remove unnecessary whitespace while maintaining semantic relationships.

Instruction Preservation

Preserve critical instructions, few-shot examples, and task-specific requirements. System prompts and user intent remain intact.

Quality Verification

Verify the compressed prompt maintains semantic equivalence to the original. If quality checks fail, the original prompt is used.

Compression is most effective for prompts with repeated context (RAG), long system instructions, or verbose multi-turn histories. Simple queries may see minimal compression.

Enabling Token Compression

Token compression can be enabled in three ways, giving you flexibility to control compression at the request, API key, or organization level:

1. Per Request (SDK)

Enable compression for specific requests using the SDK:

TypeScript
Python
Go
Rust

const response = await edgee.send({
  model: 'gpt-4o',
  input: {
    "messages": [
      {"role": "user", "content": "Your prompt here"}
    ],
    "enable_compression": true,
    "compression_rate": 0.8  // Target 80% compression (optional)
  }
});

response = edgee.send(
    model="gpt-4o",
    input={
        "messages": [
            {"role": "user", "content": "Your prompt here"}
        ],
        "enable_compression": True,
        "compression_rate": 0.8  # Target 80% compression (optional)
    }
)

response, err := client.Send("gpt-4o", edgee.InputObject{
    Messages: []edgee.Message{
        {Role: "user", Content: "Your prompt here"},
    },
    EnableCompression: true,
    CompressionRate: 0.8, // Target 80% compression (optional)
})

let input = InputObject::new(vec![Message::user("Your prompt here")])
    .with_compression(true)
    .with_compression_rate(0.8); // Target 80% compression (optional)

let response = client.send("gpt-4o", input).await?;

2. Per API Key (Console)

Enable compression for specific API keys in your organization settings. This is useful when you want different compression settings for different applications or environments.

In the Tools section of your console:

Toggle Enable token compression on
Set your target Compression rate (0.7-0.9, default 0.75)
Under Scope, select Apply to specific API keys
Choose which API keys should use compression

3. Organization-Wide (Console)

Enable compression for all requests across your entire organization. This is the recommended setting for most users to maximize savings automatically.

In the Tools section of your console:

Toggle Enable token compression on
Set your target Compression rate (0.7-0.9, default 0.75)
Under Scope, select Apply to all org requests
All API keys will now use compression by default

Compression rate controls how aggressively Edgee compresses prompts. A higher rate (e.g., 0.9) attempts more compression but may be less effective, while a lower rate (e.g., 0.7) is more conservative. The default of 0.75 provides a good balance for most use cases.

SDK-level configuration takes precedence over console settings. If you enable compression in your code with enable_compression: true, it will override the console configuration for that specific request.

When It Works Best

Token compression delivers the highest savings for these common use cases:

RAG Pipelines

40-50% reductionLarge document contexts with redundant information compress effectively. Ideal for Q&A systems, knowledge bases, and semantic search.

Long Contexts

30-45% reductionLengthy conversation histories, documentation, or background information. Common in chatbots and assistant applications.

Document Analysis

35-50% reductionSummarization, extraction, and analysis of long documents. Verbose source material compresses well.

Multi-Turn Agents

25-40% reductionConversational agents with growing context windows. Savings increase with conversation length.

Code Example

Every response includes compression metrics so you can track your savings:

import Edgee from 'edgee';

const edgee = new Edgee("your-api-key");

// Example: RAG Q&A with large context
const documents = [
  "Long document content here...",
  "Another document with context...",
  "More relevant information..."
];

const response = await edgee.send({
  model: 'gpt-4o',
  input: `Answer the question based on these documents:\n\n${documents.join('\n\n')}\n\nQuestion: What is the main topic?`,
  enable_compression: true, // Enable compression for this request
  compression_rate: 0.8, // Target compression ratio (0-1, e.g., 0.8 = 80%)
});

console.log(response.text);

// Compression metrics
if (response.compression) {
  console.log(`Original tokens: ${response.compression.input_tokens}`);
  console.log(`Compressed tokens: ${response.usage.prompt_tokens}`);
  console.log(`Tokens saved: ${response.compression.saved_tokens}`);
  console.log(`Compression rate: ${(response.compression.rate * 100).toFixed(1)}%`);
}

Example output:

Original tokens: 2,450
Compressed tokens: 1,225
Tokens saved: 1,225
Compression ratio: 50%

Real-World Savings

Here’s what token compression means for your monthly AI bill:

Use Case	Monthly Requests	Without Edgee	With Edgee (50% compression)	Monthly Savings
RAG Q&A (GPT-4o)	100,000 @ 2,000 tokens	$3,000	$1,500	$1,500
Document Analysis (Claude 3.5)	50,000 @ 4,000 tokens	$1,800	$900	$900
Chatbot (GPT-4o-mini)	500,000 @ 500 tokens	$375	$188	$187
Multi-turn Agent (GPT-4o)	200,000 @ 1,000 tokens	$3,000	$1,500	$1,500

Savings calculations use list pricing for GPT-4o (

5/1M input tokens), Claude 3.5 Sonnet (

3/1M input tokens), and GPT-4o-mini ($0.15/1M input tokens). Actual compression ratios vary by use case.

Best Practices

Optimize prompts for compression

Structure RAG contexts with clear sections
Use consistent formatting in document chunks
Avoid excessive whitespace in system prompts
Group similar information together

Track savings over time

Monitor usage.saved_tokens across requests
Calculate cumulative savings weekly or monthly
Use observability tools to identify high-compression opportunities
Compare costs across different use cases

Configure compression per use case

Enable compression by default for all requests
Compression happens automatically without configuration
Track compression.rate to understand effectiveness
Use response metrics to optimize prompt design

Combine with cost-aware routing

Use automatic model selection for additional savings
Route to cheaper models when appropriate
Compression + routing can reduce costs by 60-70% total
Monitor both compression and routing savings

Response Fields

Every Edgee response includes detailed compression metrics:

// Usage information
response.usage.prompt_tokens          // Compressed token count (billed)
response.usage.completion_tokens      // Output tokens (unchanged)
response.usage.total_tokens           // Total for billing calculation

// Compression information (when applied)
response.compression.input_tokens     // Original token count (before compression)
response.compression.saved_tokens     // Tokens saved by compression
response.compression.rate             // Compression rate (0-1, e.g., 0.61 = 61%)

Use these fields to:

Track savings in real-time
Build cost dashboards and budgeting tools
Identify high-value compression opportunities
Optimize prompt design for maximum compression

What’s Next

Observability

Monitor token savings, costs, and compression ratios across all requests.

Intelligent Routing

Combine compression with cost-aware model routing for even greater savings.

Quick Start

Get started in 5 minutes and start saving on your next request.

SDK Documentation

Explore SDKs in TypeScript, Python, Go, and Rust with built-in compression support.

Introduction

Quickstart

Features

Integrations

Token Compression

Reduce LLM costs by up to 50%

How It Works

Enabling Token Compression

1. Per Request (SDK)

2. Per API Key (Console)

3. Organization-Wide (Console)

When It Works Best

RAG Pipelines

Long Contexts

Document Analysis

Multi-Turn Agents

Code Example

Real-World Savings

Best Practices

Response Fields

What’s Next

Observability

Intelligent Routing

Quick Start

SDK Documentation

Introduction

Quickstart

Features

Integrations

​Reduce LLM costs by up to 50%

​How It Works

​Enabling Token Compression

​1. Per Request (SDK)

​2. Per API Key (Console)

​3. Organization-Wide (Console)

​When It Works Best

RAG Pipelines

Long Contexts

Document Analysis

Multi-Turn Agents

​Code Example

​Real-World Savings

​Best Practices

​Response Fields

​What’s Next

Observability

Intelligent Routing

Quick Start

SDK Documentation

Reduce LLM costs by up to 50%

How It Works

Enabling Token Compression

1. Per Request (SDK)

2. Per API Key (Console)

3. Organization-Wide (Console)

When It Works Best

Code Example

Real-World Savings

Best Practices

Response Fields

What’s Next