Two layers
The two-layer taxonomy is the non-negotiable foundation. Every strategy in this page belongs to exactly one of them.- Input compression: ~99% of total token volume, ~90% of the cost. What enters the context window: system prompts, tool results, codebase context, conversation history, MCP tool definitions.
- Output compression: ~1% of total volume but 10% of the cost. What the model generates: filler, repetitive scaffolding, polite preambles, over-explanation, markdown overhead.
The three compression strategies
Edgee ships three named compression strategies, toggleable independently.| Compression | Layer | Cost reduction |
|---|---|---|
| Tool Result | Input | −19% |
| Tool Surface (alpha) | Input | ~−25% projected |
| Output | Output | −6.5% when enabled |
Tool Result Trimming
Filterstool_result messages before they reach the model. Strips:
- Boilerplate framing
- Pagination markers
- ANSI escape sequences
- Repeated headers
- Verbose JSON wrappers
- File contents — output from Read tool and file system operations.
- Grep and search outputs — code search, ripgrep, similar tools.
- Shell command output — stdout/stderr from Bash and terminal commands.
- API responses — large JSON or text payloads returned by tool calls.
- Database query results — rows and records returned from tool-executed queries.
tool_result payloads — the model receives the same technical content, with redundant framing removed.
Customer traffic. tool_result_trimming reduces token costs by 19% on average.
Initially based on rtk-ai/rtk, we built our tool result compression strategy directly into the Edgee Rust gateway, so users don’t need a separate binary in their pipeline.
Tool Surface Reduction
Coding agents connect multiple MCP servers, each exposing its own set of tools. The agent sends the full tool list to the model on every request, even when only one or two MCP servers are relevant. This bloats context and drives up cost. How it works: Edgee creates a virtual MCP server that the model sees. Instead of the full tool list, the model talks to the virtual MCP. The virtual MCP classifies the user’s task and searches for the correct real MCP server to use. It sends the result back to the client, which then executes the real MCP server. The result is a tool-aware gateway:- The IDE still exposes all MCP servers — nothing changes for the developer’s setup.
- The agent still discovers tools through the standard MCP protocol — nothing changes for the agent’s behavior.
- The model only ever sees the virtual MCP. The client receives the routing decision from it and executes the real MCP server.
Output Brevity
Reduces verbosity in model responses without losing technical content. Same answer, fewer tokens. Available levels:| Level | What it does | Trade-off |
|---|---|---|
light | Asks the model to skip pleasantries, articles, and filler, while keeping standard sentence structure. | Lowest output reduction, most readable. |
medium | Forces the model to drop articles, fragments, and conventional grammar in favor of dense technical content. | Dense, less natural prose. |
hard | An aggressive variant that pushes output brevity further with stricter instructions.. | Highest output reduction; least readable for humans, still parseable for downstream tools. |
output_brevity is opt-in and disabled by default. For chat-style or RAG workloads where the model produces long-form answers, output is the dominant cost and output_brevity becomes the lever.
Customer traffic. Where enabled, output_brevity reduces total token costs by 6.5% on average.
Academic note. Recent work supports the broader claim — Brevity Constraints Reverse Performance Hierarchies in Language Models (Hakim, arXiv:2604.00025, March 2026) found that constraining models to brief responses can improve accuracy on certain benchmarks. The study is on open-weight models, not Claude/GPT directly.
Reading the compression block
Every response that runs through any compression strategy carries a compression block on the response body. Use it to track savings per request.
| Field | Type | Meaning |
|---|---|---|
saved_tokens | integer | Input tokens removed (original count minus compressed count). |
cost_savings | integer | Estimated cost savings in micro-units. Divide by 1_000_000 for USD. |
reduction | number | Percentage reduction in input tokens. 48 → 48%. |
time_ms | integer | Wall-clock time spent on compression. |
usage.prompt_tokens field on the same response reflects the compressed count actually billed by the provider, not the original input.
Enabling and disabling
Three surfaces, in order of how most users will use them.CLI (default-on for coding agents)
When you launch a coding agent through the Edgee CLI,tool_result_trimming is enabled automatically — no console step required.
tool_surface_reduction is opt-in. output_brevity is opt-in for coding-agent sessions because output is a small share of their volume.
Console (per-key toggle)
In the Edgee Console, open Dashboard and manage your agent’s settings right from the UI. For team-managed keys, the same toggles are available per-member from Team management → agent settings. See Team management.Next
Claude Token Compression
tool_result_trimming applied to Claude API traffic.Codex Token Compression
tool_result_trimming applied to the OpenAI Responses wire format.