
- RAG pipelines with large document contexts
- Long conversation histories in multi-turn agents
- Verbose system instructions and formatting
- Document analysis and summarization tasks
Looking for lossless compression for Claude Code? See Claude Token Compression (Beta).
How It Works
Agentic token compression uses multiple strategies that work together on every request. The core semantic compression strategy follows a four-step process; other strategies (tool compression, smart crusher, cache aligner) work in parallel, some fully lossless.Semantic Analysis
Analyze the prompt structure to identify redundant context, verbose formatting, and compressible sections without losing critical information.
Context Optimization
Compress repeated context (common in RAG), condense verbose formatting, and remove unnecessary elements while maintaining semantic relationships.
Instruction Preservation
Preserve critical instructions, few-shot examples, and task-specific requirements. System prompts and user intent remain intact.
Compression is most effective for prompts with repeated context (RAG), long system instructions, or verbose multi-turn histories. Simple queries may see minimal compression.
Understanding compression ratio
The compression ratio (sometimes called compression rate in APIs) is compressed size ÷ original size: how large the compressed prompt is relative to the original.- 0.9 (Light) = compressed prompt is 90% of the original length → ~10% fewer tokens
- 0.7 (Strong) = compressed prompt is 70% of the original → ~30% fewer tokens (more aggressive)
Semantic preservation and BERT score
To avoid changing the meaning of the prompt, we compare the compressed text to the original using BERT score (F1). It measures how semantically similar the two texts are on a scale of 0–1 (0%–100%).- Semantic preservation threshold (0–100%) is the minimum similarity we require. If the BERT score is below this threshold, we do not use the compressed prompt—we send the original instead, so quality is preserved.
- In the console you choose Off (no check), Ultra Safe (0.95), Safe (0.85), or Edgy (0.75). Off = we always use the compressed prompt when compression runs; higher values = we only use the compressed prompt when it is very similar to the original; otherwise we fall back to the original.
Enabling Agentic Token Compression
Agentic token compression can be enabled in three ways, giving you flexibility to control compression at the request, API key, or organization level.1. Per Request (SDK or Headers)
Enable compression for specific requests using the SDK or headers:- TypeScript
- Python
- Go
- Rust
- cURL
2. Per API Key (Console)
Enable compression for specific API keys in your organization settings. This is useful when you want different compression settings for different applications or environments.
- Set Compression to Light (0.9), Medium (0.8), or Strong (0.7) — see Understanding compression ratio
- Set Semantic preservation threshold to Off, Ultra Safe (0.95), Safe (0.85), or Edgy (0.75) — see Semantic preservation and BERT score
- Under Scope, select Apply to specific API keys
- Choose which API keys should use compression
When It Works Best
Token compression delivers the highest savings for these common use cases:RAG Pipelines
40-50% reductionLarge document contexts with redundant information compress effectively. Ideal for Q&A systems, knowledge bases, and semantic search.
Long Contexts
30-45% reductionLengthy conversation histories, documentation, or background information. Common in chatbots and assistant applications.
Document Analysis
35-50% reductionSummarization, extraction, and analysis of long documents. Verbose source material compresses well.
Multi-Turn Agents
25-40% reductionConversational agents with growing context windows. Savings increase with conversation length.
Code Example
Every response includes compression metrics so you can track your savings:Real-World Savings
Here’s what token compression means for your monthly AI bill with 50% compression:| Use Case | Monthly Requests | Without Edgee | With Edgee |
|---|---|---|---|
| RAG Q&A (GPT-5.2) | 1,000,000 @ 2,000 input tokens | $3,500 | $1,750 |
| Document Analysis (Sonnet 4.6) | 50,000 @ 20,000 input tokens | $3,000 | $1,500 |
| Chatbot (Haiku) | 5,000,000 @ 500 input tokens | $2,500 | $1,250 |
Response Fields
Every Edgee response includes the standard usage information, and detailed compression metrics (if compression was applied):- Track savings in real-time
- Build cost dashboards and budgeting tools
- Identify high-value compression opportunities
- Optimize prompt design for maximum compression
