Skip to main content
Token savings are only useful if the agent still solves the problem. We ran a pre-registered A/B on SWE-bench Verified with the Hermes harness and deepseek/deepseek-v4-flash — 100 instances per arm, a seeded sample across 8 repos (sphinx, matplotlib, xarray, pytest, requests, pylint, seaborn, flask). 99 instances per arm were graded under the official Docker harness (1 excluded due to a missing per-instance docker image upstream).
Pre-registered means assignment, sampling, and statistical tests were fixed before the data was run — avoiding the p-hacking risk of post-hoc sample selection.

Resolved rate (docker-graded, n=99/arm, paired)

ArmResolvedRate95% Wilson CI
TELOS45 / 9945.5%[36.0%, 55.2%]
Vanilla42 / 9942.4%[33.2%, 52.3%]
Paired 2×2 on the same 99 instances: both resolved 33; TELOS-only 12; vanilla-only 9; neither 45.
Vanilla ✓Vanilla ✗
TELOS ✓3312
TELOS ✗945
Exact McNemar two-sided p = 0.66 — the +3 pp absolute gap is not statistically significant. TELOS does not regress resolved rate at this sample size.

Token efficiency (agent-side, n=99/arm, same instances)

Per-taskTELOSVanillaΔ
new_input (post-cache, billed)93,712198,706−52.8%
prompt_tokens (raw + cache)352,400515,953−31.7%
output_tokens24,97525,218−1.0%
api_calls32.632.1+1.4%
cache_share73.4%61.5%+11.9 pp
reported cost (USD)$2.29$3.85−40.5%

The key observation: savings come from the protocol, not from doing less

The near-zero difference in output_tokens (−1.0%) and api_calls (+1.4%) is the core finding. A common alternative hypothesis is that TELOS trades token savings for less reasoning (fewer tool calls, shorter output) — if true, output and api_calls would drop significantly. The table rejects that: the savings come entirely from prompt-side byte stability and cache hits, consistent with the monotonic-append guarantee.

Read this honestly

The 99-instance subset gives a Wilson CI of roughly ±10 pp on each arm, and the paired difference Δ has a 95% CI of about [−6 pp, +12 pp]. This run can rule out an absolute regression worse than ~6 pp at 95% confidence, but cannot pin Δ to ±2 pp (that needs n ≥ 400/arm). What it shows with high confidence: the input-token bill is roughly halved, and end-to-end cost drops ~40%, at the same correctness band.

Three orthogonal axes

The study supports a more general claim: agent economics can be attacked along three non-substitutable axes —

Content compression

LongLLMLingua, AutoCompressors — lossy reduction of total tokens.

Engine caching

vLLM PagedAttention, SGLang RadixAttention — reuse computed prefixes.

Protocol contract

TELOS — guarantee the bytes are stable so the engine cache can hit.
The axes are orthogonal: once TELOS pushes cache hit rate to saturation, engine work shifts to capacity and eviction; where raw_input is still large, content compression can apply selectively in the FOLD region — and PIN/DROP byte stability guarantees the compression won’t break the prefix.

Reproduce it

You don’t have to trust the percentages — the artifacts, sampling seed, and evaluation scripts are all in the repo.
# Agent stage (n = 100 / arm)
python scripts/run_swebench_batch.py -n 100 --seed 7 \
  --workers 4 --output /tmp/telos-ab-n100/with

python scripts/run_swebench_batch.py -n 100 --seed 7 \
  --workers 4 --output /tmp/telos-ab-n100/without --no-telos

# Docker evaluation
python -m swebench.harness.run_evaluation \
  --dataset_name princeton-nlp/SWE-bench_Verified \
  --predictions_path predictions-telos.jsonl \
  --max_workers 3 --run_id ab-n100-telos \
  --cache_level instance --timeout 1500

About & Citation

Cite the study and the protocol.