deepseek/deepseek-v4-flash — 100 instances per arm, a
seeded sample across 8 repos (sphinx, matplotlib, xarray, pytest, requests, pylint, seaborn, flask).
99 instances per arm were graded under the official Docker harness (1 excluded due to a missing
per-instance docker image upstream).
Pre-registered means assignment, sampling, and statistical tests were fixed before the data was
run — avoiding the p-hacking risk of post-hoc sample selection.
Resolved rate (docker-graded, n=99/arm, paired)
| Arm | Resolved | Rate | 95% Wilson CI |
|---|---|---|---|
| TELOS | 45 / 99 | 45.5% | [36.0%, 55.2%] |
| Vanilla | 42 / 99 | 42.4% | [33.2%, 52.3%] |
| Vanilla ✓ | Vanilla ✗ | |
|---|---|---|
| TELOS ✓ | 33 | 12 |
| TELOS ✗ | 9 | 45 |
Token efficiency (agent-side, n=99/arm, same instances)
| Per-task | TELOS | Vanilla | Δ |
|---|---|---|---|
| new_input (post-cache, billed) | 93,712 | 198,706 | −52.8% |
| prompt_tokens (raw + cache) | 352,400 | 515,953 | −31.7% |
| output_tokens | 24,975 | 25,218 | −1.0% |
| api_calls | 32.6 | 32.1 | +1.4% |
| cache_share | 73.4% | 61.5% | +11.9 pp |
| reported cost (USD) | $2.29 | $3.85 | −40.5% |
The key observation: savings come from the protocol, not from doing less
The near-zero difference inoutput_tokens (−1.0%) and api_calls (+1.4%) is the core finding. A
common alternative hypothesis is that TELOS trades token savings for less reasoning (fewer tool
calls, shorter output) — if true, output and api_calls would drop significantly. The table rejects
that: the savings come entirely from prompt-side byte stability and cache hits, consistent with the
monotonic-append guarantee.
Read this honestly
Three orthogonal axes
The study supports a more general claim: agent economics can be attacked along three non-substitutable axes —Content compression
LongLLMLingua, AutoCompressors — lossy reduction of total tokens.
Engine caching
vLLM PagedAttention, SGLang RadixAttention — reuse computed prefixes.
Protocol contract
TELOS — guarantee the bytes are stable so the engine cache can hit.
Reproduce it
You don’t have to trust the percentages — the artifacts, sampling seed, and evaluation scripts are all in the repo.About & Citation
Cite the study and the protocol.