Replay & Comparison

How do you prove TELOS saves money without paying for two full agent runs? You record one real session, then replay the byte-identical turn sequence under different modes, measuring only the billing.

The corpus

By default the proxy records the raw request of every call to ~/.telos/corpus/<session>.jsonl — requests only, not responses (Anthropic is stateless; the Nth-turn request already contains everything from the previous N−1 turns).

telos replay --list                # list recorded sessions

--no-record turns recording off; --corpus-dir changes the directory.
Functions: record_call / load_session / list_sessions.

Controlled replay

telos replay --session <id> --modes none telos rtk both

replay_session(turns, mode, ...) replays a real session under a given mode: a byte-identical turn sequence → RTK filtering (if mode.rtk) → the TELOS pipeline (if mode.telos) → sent upstream with max_tokens=1 → only the usage is taken.

Why max_tokens=1 — only prefill / cache billing is measured; output generation is deliberately neutered, so the comparison isolates the prompt-side cost.

Cache isolation. By default a unique prefix [telos-replay ns=<session>/<mode>] is injected at the very front of the system segment for each mode, so the Anthropic-side caches stay independent — preventing an earlier-replayed mode from warming the cache for a later one to free-ride on. The result is appended to usage_log with compare_group = <original session id> and replay: true.

Replay vs dual session

	Cost	Controlled variables	Suitable claim
replay	1 real session + cheap prefill	good (turns pinned)	“for a given workload, the token bill drops by X”
dual session	N×K full sessions	poor (trajectory forks)	“using TELOS, the agent is cheaper overall”

Replay is the tool for an honest, reproducible savings number on a fixed workload. The SWE-bench A/B is the dual-session approach, used to show that correctness does not regress.

SWE-bench Verified A/B

The pre-registered dual-arm study on real GitHub issues.

​The corpus

​Controlled replay

​Replay vs dual session

SWE-bench Verified A/B

The corpus

Controlled replay

Replay vs dual session