Token Compression - Edgee documentation

Token compression reduces the number of tokens sent to and received from the LLM, without losing information from the model’s perspective. Compression is the surgical removal of redundancy. Not summarization.

Two layers

The two-layer taxonomy is the non-negotiable foundation. Every strategy in this page belongs to exactly one of them.

Input compression: ~99% of total token volume, ~90% of the cost. What enters the context window: system prompts, tool results, codebase context, conversation history, MCP tool definitions.
Output compression: ~1% of total volume but 10% of the cost. What the model generates: filler, repetitive scaffolding, polite preambles, over-explanation, markdown overhead.

Agentic workloads consume 5–30× more tokens per task than chatbot workloads, and approximately 40% of those tokens are redundant. Compression targets that redundant share.

The three compression strategies

Edgee ships three named compression strategies, toggleable independently.

Compression	Layer	Cost reduction
Tool Result	Input	−19%
Tool Surface (alpha)	Input	~−25% projected
Output	Output	−6.5% when enabled

Tool Result Trimming

Filters tool_result messages before they reach the model. Strips:

Boilerplate framing
Pagination markers
ANSI escape sequences
Repeated headers
Verbose JSON wrappers

What it targets in a typical coding-agent session:

File contents — output from Read tool and file system operations.
Grep and search outputs — code search, ripgrep, similar tools.
Shell command output — stdout/stderr from Bash and terminal commands.
API responses — large JSON or text payloads returned by tool calls.
Database query results — rows and records returned from tool-executed queries.

User messages and assistant turns are not modified. Lossiness. Lossless on tool_result payloads — the model receives the same technical content, with redundant framing removed. Customer traffic. tool_result_trimming reduces token costs by 19% on average. Initially based on rtk-ai/rtk, we built our tool result compression strategy directly into the Edgee Rust gateway, so users don’t need a separate binary in their pipeline.

Tool Surface Reduction

Coding agents connect multiple MCP servers, each exposing its own set of tools. The agent sends the full tool list to the model on every request, even when only one or two MCP servers are relevant. This bloats context and drives up cost. How it works: Edgee creates a virtual MCP server that the model sees. Instead of the full tool list, the model talks to the virtual MCP. The virtual MCP classifies the user’s task and searches for the correct real MCP server to use. It sends the result back to the client, which then executes the real MCP server. The result is a tool-aware gateway:

The IDE still exposes all MCP servers — nothing changes for the developer’s setup.
The agent still discovers tools through the standard MCP protocol — nothing changes for the agent’s behavior.
The model only ever sees the virtual MCP. The client receives the routing decision from it and executes the real MCP server.

Output Brevity

Reduces verbosity in model responses without losing technical content. Same answer, fewer tokens. Available levels:

Level	What it does	Trade-off
`light`	Asks the model to skip pleasantries, articles, and filler, while keeping standard sentence structure.	Lowest output reduction, most readable.
`medium`	Forces the model to drop articles, fragments, and conventional grammar in favor of dense technical content.	Dense, less natural prose.
`hard`	An aggressive variant that pushes output brevity further with stricter instructions..	Highest output reduction; least readable for humans, still parseable for downstream tools.

For coding-agent sessions, output is a small share of total token volume (~1%), so output_brevity is opt-in and disabled by default. For chat-style or RAG workloads where the model produces long-form answers, output is the dominant cost and output_brevity becomes the lever. Customer traffic. Where enabled, output_brevity reduces total token costs by 6.5% on average. Academic note. Recent work supports the broader claim — Brevity Constraints Reverse Performance Hierarchies in Language Models (Hakim, arXiv:2604.00025, March 2026) found that constraining models to brief responses can improve accuracy on certain benchmarks. The study is on open-weight models, not Claude/GPT directly.

Reading the `compression` block

Every response that runs through any compression strategy carries a compression block on the response body. Use it to track savings per request.

const response = await edgee.send({
  model: 'gpt-5.2',
  input: 'Long prompt with lots of context...',
});

if (response.compression) {
  console.log(response.compression.saved_tokens); // e.g. 450
  console.log(response.compression.cost_savings); // micro-units (1_000_000 = $1.00)
  console.log(response.compression.reduction);    // percentage, e.g. 48 → 48%
  console.log(response.compression.time_ms);      // ms spent on compression
}

Field reference:

Field	Type	Meaning
`saved_tokens`	integer	Input tokens removed (original count minus compressed count).
`cost_savings`	integer	Estimated cost savings in micro-units. Divide by `1_000_000` for USD.
`reduction`	number	Percentage reduction in input tokens. `48` → 48%.
`time_ms`	integer	Wall-clock time spent on compression.

The usage.prompt_tokens field on the same response reflects the compressed count actually billed by the provider, not the original input.

Enabling and disabling

Three surfaces, in order of how most users will use them.

CLI (default-on for coding agents)

When you launch a coding agent through the Edgee CLI, tool_result_trimming is enabled automatically — no console step required.

edgee launch claude
edgee launch codex
edgee launch opencode

tool_surface_reduction is opt-in. output_brevity is opt-in for coding-agent sessions because output is a small share of their volume.

Console (per-key toggle)

In the Edgee Console, open Dashboard and manage your agent’s settings right from the UI. For team-managed keys, the same toggles are available per-member from Team management → agent settings. See Team management.

https://mintcdn.com/edgee/RmPUqoqJw-u0FxFP/images/icons/claude.svg?fit=max&auto=format&n=RmPUqoqJw-u0FxFP&q=85&s=d3154991b618d253ee22ffaf55a433fc

Claude Token Compression

tool_result_trimming applied to Claude API traffic.

https://mintcdn.com/edgee/CrNen493EQpoYoa2/images/icons/codex.svg?fit=max&auto=format&n=CrNen493EQpoYoa2&q=85&s=0f19fa96ee1277109c66c3b411f868c0

Codex Token Compression

tool_result_trimming applied to the OpenAI Responses wire format.

​Two layers

​The three compression strategies

​Tool Result Trimming

​Tool Surface Reduction

​Output Brevity

​Reading the compression block

​Enabling and disabling

​CLI (default-on for coding agents)

​Console (per-key toggle)

​Next