Measuring token compression for coding agents on SWE-bench Lite

Cover Image for Measuring token compression for coding agents on SWE-bench Lite

Why compression matters ?

Coding agents have become the substrate for building products. They are long-running and context-heavy. For example, a single SWE-bench coding task can take 30–100 API turns, where 1–10 million tokens per task is common. These numbers, multiplied across a fleet of agents, a developer team, or a SaaS product make the bill escalates quickly in the blink of an eye.

The economic pressure is one motivation for compression, but not the only one. Other elements point in the same direction:

Bottleneck What it limits Why compression helps
Dollar cost Per-task spend; sustainable unit economics for AI products Fewer tokens (of the right kind) → smaller bill
Latency Time-to-first-token scales with prefix size on most providers Smaller prefix → faster perceived response
Context window 200k or 1M token windows still fill up on long agent sessions Compressing history extends the effective horizon
Throughput Server-side batch sizes are inversely related to per-request token count Smaller requests → more parallelism for the provider; lower queueing delays for the user

Hence, a robust compression layer addresses all four and is therefore a critical part of an agentic workflow performance. Compressor V2 is such a layer and is a component of the Edgee AI gateway that was built with these requirements in mind.

From V1 to V2

Edgee's V1 compressor, shipped earlier this year, used a single strategy we called tool result trimming. This strategy is inspired by the famous RTK project and consists in cleaning up the verbose tool outputs that agents collect over a long session. It delivered roughly 10% cost savings on real coding workloads — safe, simple, but limited in scope.

V2 takes a different approach: three orthogonal strategies layered together, each one targeting a different layer of context and configurable independently per API key. Since the three strategies don't overlap they attack different sources of token bloat:

  • Brevity attacks the most expensive token class, output tokens.
  • Tool surface reduction (TSR) attacks the most repetitive part of the prefix (MCP tool catalogs).
  • Tool result trimming attacks the long tail of verbose tool outputs that accumulate over the conversation.

Through the Gateway, customers compose the combination that matches their workload and reduce their token consumption. In this post, we address the empirical results that quantify exactly these gains.

To start, let us recall some important notions.

Prefix caching

Anthropic's prefix cache is content-keyed, any byte that differs between two requests invalidates the cache from that point on. A long session that reuses the same system prompt and tool catalog pays the cache_create fee once and reads back the prefix at 1/10th the input price on every subsequent turn.

This is also why strategies like brevity or tool result trimming are so well-suited for coding agents. They touch only the output, never the prefix. Therefore, cache amortization across long agent sessions stays intact, and the savings are concentrated in the most expensive class.

Statistical tools

For every metric we report, three things are computed:

  • Paired sign test. For each task, compare per-task means: did edgee use less? Count the wins. The win count follows a binomial(n, 0.5) under the null and p < 0.05 is significant at the conventional threshold.
  • Bootstrap 95% Confidence Interval. Resample task pairs 10,000 times, compute the statistic on each resample, take the 2.5th and 97.5th percentiles.
  • Within-task coefficient of variation. Average across (task, backend) cells. Above 20% means too much per-replicate noise; below 20% means the result is trustworthy.

Statistical methodology in depth

This section walks through the tests we apply, why we chose them, and what they tell us.

The structure of the data

For every (task, backend) combination, we collect N replicate measurements and average across replicates to get a single per-task mean for each backend. The result is a paired dataset: n tasks, each with one vanilla mean and one edgee mean. Pairing matters — task difficulty varies by orders of magnitude (some SWE-bench tasks are 100k-token toy fixes, others are 12M-token deep refactors), so unpaired comparisons would be dominated by between-task variance.

The paired sign test

Our primary tool for direction-of-effect claims.

Setup. For each of n tasks, ask: did edgee use less cost than vanilla? Count the yeses. Under the null hypothesis ("no real difference between backends"), the count of edgee wins follows a binomial distribution with parameters n and probability 0.5.

Two-sided p-value. Given k wins out of n:

p = 2 · Pr [ X ≥ max( k , n − k ) | X ~ Binomial( n , 0.5) ]

For n = 6 tasks and k = 6 wins: p = 2 · (1/2)⁶ = 0.031. The brevity result for example lands in this area.

Why this test, specifically. We chose the sign test over plausible alternatives:

Test What it assumes Why we passed
Paired t-test Paired differences are normally distributed Token-cost differences have heavy tails — normality is violated.
Wilcoxon signed-rank Paired differences are symmetric around the median Our differences are skewed.
Sign test Only that pairs are independent Holds for our design. Lowest assumptions.

The cost of using the sign test is statistical power. It uses only the direction of each pair, not the magnitude. The benefit is that when it does detect significance, the result holds under distributional weirdness.

The hard floor on n. The lowest achievable two-sided p-value is 2 · (1/2)ⁿ

n tasks Min achievable two-sided p
6 0.031
8 0.008
10 0.002
20 ≈ 2 × 10⁻⁶

Therefore, adding replicates does not increase sign-test power — power depends on task count, not on within-cell precision.

The bootstrap confidence interval

Where sign test tells us direction; the bootstrap, on the other hand, tells us magnitude.

Setup. Given n per-task means for each backend, for a statistic like the median ratio of edgee-to-vanilla cost:

  1. Resample with replacement n task pairs B times (we use B = 10,000).
  2. Compute the statistic on each resampled set.
  3. Take the 2.5th and 97.5th percentiles of the resulting distribution.

That's the 95% CI.

Why percentile method. The bias-corrected-and-accelerated (BCa) method gives slightly tighter intervals when the bootstrap distribution is skewed, but at the cost of more sensitivity to outliers. With small n (n = 6 to n = 10), we deliberately tolerate skew rather than aggressively correct for it. Percentile is the conservative choice.

Why B = 10,000. Bootstrap CI endpoints are themselves estimates, with Monte Carlo error shrinking as 1/√B. At B = 10,000 the simulation error is well under 1% of the CI width — small enough not to matter.

Interpreting a CI. A 95% CI excluding 1.0× means: under repeated sampling, 95% of the constructed CIs would contain the true population median, and the observed range does not include "no effect." In practice, treating "the CI excludes 1" as evidence that the effect is real is a defensible shortcut.

Within-task coefficient of variation

For each (task, backend) cell, compute σ/μ across the N replicates. Average across the 2n cells. A mean CV above ~20% means per-replicate noise is comparable to the effect sizes we're trying to measure. In our headline brevity run, mean CV was well under the 20% threshold.

Aggregate vs mean vs median

A single per-metric "reduction" hides three legitimate aggregations:

Aggregation Formula Answers
Aggregate 1 − (Σ eᵢ) / (Σ vᵢ) "Of total spend, what fraction did edgee save?" — volume-weighted
Mean per task (1/n) × Σ (1 − eᵢ/vᵢ) "What does the average task experience?"
Median per task median(1 − eᵢ/vᵢ) "What does the typical task experience?" — outlier-robust

We considered all three on every result.

Experimental methodology

This section explains the design choices that guided this work.

Workload

SWE-bench Lite. The standard public benchmark of 300 GitHub issues from popular Python repositories, each paired with a passing-test commit. We use SWE-bench Lite in agent mode: claude gets the issue, full repo access, and is asked to fix the bug autonomously. This is the canonical workload for coding-agent evaluation.

Replicates and shuffle

Each (task, backend) cell is replicated N times. Replicate ordering is randomized per task:

                                Shuffle within each task:
   Task 1:    v_1 v_2 e_1 e_2            ─►   e_2 v_1 e_1 v_2
   Task 2:    v_1 v_2 e_1 e_2            ─►   v_1 e_2 v_2 e_1
   

If we always ran "all vanilla, then all edgee," the second batch would inherit a cold-cache disadvantage. Random ordering removes this systematic bias.

Per-replicate nonces

Even with shuffling, two replicates of the same (task, backend) cell would produce byte-identical request prefixes — and Anthropic's cache would serve replicate 2 almost entirely from cache, dramatically underreporting its true cost. We prepend a random nonce to the first user message:

[trial: 8q4r7n2vmp]

You are working autonomously on a bug fix in the django repository. …

That single differing line invalidates the cache for the rest of the prefix making each replicate starts cold.

Token accounting

Token usage is read directly from Claude Code's per-session JSONL log files, which contain raw usage fields returned by Anthropic. Cost is computed locally from these counts and the published price table.

Results across the V2 stack

V2's three strategies were evaluated independently against workloads matched to their design target. Brevity is the headline result on autonomous coding. TSR is the headline result on tool-heavy MCP workloads and Tool result trimming provides incremental savings that compound with the other two over long sessions.

Brevity, ~30% per-task cost reduction

The headline coding-workload result.

Setup

  • Workload: 6 representative SWE-bench Lite tasks, agent mode (single autonomous prompt per session).
  • Replicates: 2 per (task, backend) cell.
  • Backends: vanilla Claude Code vs edgee with brevity enabled.
  • Total sessions: 24.
  • Shuffle: randomized per task.
  • Per-replicate nonces: active (each replicate starts cold).

Headline numbers

Metric Aggregate Mean per task Median per task Sign test Significance
Cost ($) +51.1% +34.2% +27.5% 6/6 favor edgee ★ (p = 0.031)
Total tokens +58.9% +35.2% +27.5% 6/6 favor edgee ★ (p = 0.031)
Output tokens +53.9% similar similar 6/6 favor edgee ★ (p = 0.031)

Three things to notice:

  1. All six tasks favor edgee on every metric. Sign-test wins of 6/6 hit the minimum achievable p-value at n=6, p = 0.031, which is statistically significant at the conventional α = 0.05 threshold.
  2. The three metrics agree directionally. Cost is down because total tokens are down because output tokens are down. The three signals are mutually reinforcing.
  3. We cite the median (~30%) as the headline rather than the aggregate (+51%) to have a more conservative and outlier-robust metric.

Bootstrap CI on the token ratio

The 95% bootstrap confidence interval for the median ratio of edgee/vanilla total tokens:

token ratio = 0.70× (95% CI: [0.41×, 0.84×])

The entire CI is below 1.0×, which means: under repeated sampling, the constructed interval does not include "no effect" and hence making the result robust.

Why brevity works so well

The mechanism behind the headline number is structural. Brevity instructs the model to be terse — no preamble, no commentary, just the work. On agent-mode coding tasks, the model normally produces a lot of meta-text alongside its actual edits: "I'll first read the file, then I'll check the imports, then I'll make the change..." That commentary is real output tokens a high price.

When brevity is active, the model still does all the same tool calls and produces the same final patches — it just stops narrating its plan. The work doesn't change; the prose around the work shrinks dramatically.

This has three implications for the cost math:

  • Output drops by half
  • Cache_read stays roughly constant (the prefix isn't touched; cache amortization across the long agent loop continues working).
  • Cache_create stays roughly constant (no prefix invalidation; tools stay stable, system prompt stays stable).

The result: a strategy that concentrates its savings on the most expensive token class without inducing any cache-create churn. That's the strategy that combines a large token-count signal with an even larger dollar signal — both move in the same direction.

Where the cost lands — per-class decomposition

The clearest way to see brevity's mechanism is to break down both backends by token class. For all 12 vanilla sessions and all 12 edgee sessions combined:

Token type vanilla cost edgee cost Δ
Fresh input small small ~0
Cache read meaningful meaningful small
Cache create meaningful meaningful small
Output dominant roughly half −54%

The output class — the single most expensive token type on the Anthropic API — drops by more than half, and that drop accounts for the entire +51% aggregate cost reduction. Brevity doesn't just reduce tokens; it reduces the right tokens.

TSR on MCP workloads — ~10% cost

The headline result for tool-heavy workflows.

When an agent connects to MCP servers like Linear, Notion, or GitHub, the request prefix accumulates dozens of tool definitions — typically 30–40 per server, each one a paragraph of JSON describing the tool's name, parameters, and behavior. Tool surface reduction (TSR) rewrites this prefix on the fly, replacing the full tool catalog with a single virtual mcp__edgee_gateway__search tool. The model calls this tool with an {intent, args} payload; the gateway resolves the intent server-side, dispatches the appropriate real tool, and returns the result. The model never sees the full catalog.

Setup

  • Workload: 8 synthetic read-only queries against Linear and Notion. Tasks range from single-server lookups ("list 5 recent Linear issues") to cross-server stress queries ("find Linear projects and Notion docs mentioning 'roadmap'").
  • Replicates: 3 per (task, backend) cell.
  • Backends: vanilla Claude Code vs edgee with TSR enabled.
  • Total sessions: 48.

Headline numbers

Metric Aggregate Mean per task Median per task Sign test Significance
Total tokens +33.0% +31.0% +30.6% 8/8 favor edgee ★★ (p = 0.008)
Cost ($) +11.2% +10.8% ~+10% 5/8 favor edgee not significant

Three things to notice:

  1. All eight tasks favor edgee on token volume. 8/8 hits the minimum achievable two-sided p-value at n=8, p = 0.008 — substantially below the conventional α = 0.05 threshold. This is the strongest direction-of-effect signal in any of our measurements.
  2. Cost reduction is real but more modest than tokens. The median per-task cost reduction lands around 10%, with 5/8 tasks favoring edgee. The sign test on cost does not reach conventional significance at n=8 — but the point estimate is directionally favorable and consistent across our two TSR runs.
  3. The 95% bootstrap CI on the token ratio is [0.63×, 0.75×] — a tight interval entirely below 1.0×. There is no plausible value of the true population median that is consistent with "no effect" on the token side.

Why the cost effect is smaller than the token effect

A 33% reduction in tokens translates into only ~10% cost reduction because of the per-class pricing asymmetry. TSR's compression removes a lot of cache_read tokens from the prefix — the cheapest token class. Those savings are real but priced at lower rate of output tokens. The strong token signal therefore doesn't multiply 1-to-1 into dollar savings.

This is a feature of the measurement. We treat the token-volume reduction (33%, p = 0.008) and the cost reduction (~10%, trending) as two separate claims, each with its own statistical support. Customers whose binding constraint is throughput, latency, or context-window care primarily about the token claim; customers focused on the bill care primarily about the cost claim. V2 ships both.

Why TSR is the right strategy for tool-heavy workloads

Coding agents have stable tool catalogs that amortize cheaply across long sessions; brevity attacks their bottleneck (output). Tool-using assistants — chatbots integrated with Linear, Slack, Notion, GitHub, etc. — have unstable tool exposure as they connect more services, and their bottleneck is the prefix bloat from dozens or hundreds of MCP tool definitions.

TSR addresses exactly this regime. The math is structural:

  • A 40-tool MCP catalog adds roughly 40–80k tokens to every request's tools[] array.
  • Under TSR, that becomes ~1k tokens (a single virtual tool definition).
  • For products that connect 4–5 MCP servers, the prefix savings compound: 200k tokens of catalog become ~1k.

The 33% volume reduction we measured at an 8-task synthetic benchmark scales upward in regimes where the catalog is larger or where many independent short queries run against the same backend (defeating cache amortization). For products whose latency, context-window, or throughput is the binding constraint, TSR delivers a strong win.

Tool result trimming, 10% cost savings, compounding over long sessions

The third V2 strategy, refined from V1.

Over long sessions, agents accumulate tool outputs — file listings, stack traces, repeated log frames — that bloat the conversation history without proportionally informing the model's next decision. Tool result trimming cleans these up: it preserves the semantically important slices of each tool output and compresses or summarizes the rest. The model retains enough context to reason; it doesn't see thirty pages of identical error frames.

Setup

  • Workload: 6 representative SWE-bench Lite tasks, agent mode.
  • Replicates: 2 per (task, backend) cell.
  • Total sessions: 24.

Result

Metric Median per task Sign test
Cost ($) +10.4% 4/6 favor edgee

Our agent-mode SWE-bench benchmark of the refined V2 version reproduces this at +10.4% median. The sign test (4/6) does not reach conventional significance at n=6. Tool result trimming is a directionally favorable but modest strategy on its own, with the cleanest gains visible on longer sessions where tool-output accumulation dominates the cache_read budget.

The reason we ship it as part of V2 is that it compounds with brevity and TSR. Because tool result trimming operates on conversation history (not on the prefix or the output), it composes cleanly with the other two strategies without overlap. A coding workflow running brevity + tool result trimming gets brevity's ~30% per-task cost reduction on output plus incremental 5–10% from cleaning up the long tail of accumulated tool results.

The clean composability is the design goal of the V2 stack: three strategies, three different layers, additive rather than competing savings.

The V2 stack, composing the strategies

V2's three strategies were designed to be orthogonal so customers can compose them per-workload. Putting the results together:

Workload shape Recommended V2 combination Expected effect
Long autonomous coding sessions (SWE-bench shape) Brevity + tool result trimming ~30% per-task cost from brevity, +5–10% from tool result trimming compounding as session length grows
Tool-heavy queries against MCP servers TSR + brevity ~10% cost + 33% token-volume reduction from TSR; brevity adds further output-side savings on top
Mixed coding + tool workflows All three Each strategy targets a distinct layer; combinations are clean and additive
Latency- or context-bound workloads (regardless of dollar cost) TSR (priority), brevity (secondary) Maximum tokens-through-the-model reduction (33% on TSR's target workload)

The strategies don't double-count because they attack different layers of the request:

  • Brevity → output tokens (assistant message content)
  • TSR → prefix tokens (tools[] array)
  • Tool result trimming → message history (cumulative tool outputs)

Each can be toggled independently on a per-API-key basis from the edgee console.

Conclusion

Compressor V2 layers three orthogonal compression strategies, and each delivers a measurable result on the workload it targets:

  • Brevity cuts cost by ~30% per task on autonomous coding workloads, with 6/6 tasks favoring edgee, paired sign test p = 0.031. The mechanism is structural: brevity attacks the most expensive token class (output) without disturbing the prefix or cache amortization.
  • TSR delivers ~10% per-task cost reduction on tool-heavy MCP workloads. The token-side signal is 8/8 tasks favoring edgee at p = 0.008.
  • Tool result trimming, refined from V1 and based on the RTK project, delivers 5–10% per-task cost reduction as a compounding layer that composes cleanly with brevity and TSR across long sessions.

To conclude, users should enable the combination that matches their workload shape, per API key. If you're running coding agents or tool-heavy assistants at scale and the cost bill is what's keeping you up at night, V2 is in production today and saves a combined 50% on the cost of your sessions.

Contact us

Would you like to find out more about Edgee, test our services or our upcoming features? We’d love to hear from you. Please fill in the form below and we’ll be in touch.