Headroom Latency Benchmarks¶
Measured compression overhead across content types and sizes to answer: does the token savings outweigh the processing time?
Generated: 2026-02-24 01:11 UTC
Environment¶
- Platform: macOS-26.1-arm64-arm-64bit
- Processor: arm
- Python: 3.11.11
- Headroom: v0.3.7
Note: These benchmarks were captured on v0.3.7. Since then, v0.5.6 added parallel message compression, eliminated redundant token counting, and optimized hot-path hashing. Expect lower latency on current versions. Re-benchmarking is planned.
TL;DR¶
- Average compression: 93% token reduction
- Maximum compression overhead: 12213ms (p50)
- Net latency win: 11/12 scenarios against Claude Sonnet 4.5
Compression Overhead by Scenario¶
| Scenario | Tokens In | Tokens Out | Saved | Ratio | p50 (ms) | p95 (ms) | Mean (ms) |
|---|---|---|---|---|---|---|---|
| JSON: Search Results (100 items) | 10.2K | 1.5K | 8.7K | 86% | 189 | 231 | 196 |
| JSON: Search Results (500 items) | 50.2K | 1.5K | 48.7K | 97% | 943 | 955 | 943 |
| JSON: Search Results (1K items) | 100.5K | 1.5K | 99.0K | 99% | 2012 | 2198 | 2032 |
| JSON: Search Results (5K items) | 502.6K | 1.5K | 501.2K | 100% | 12213 | 12804 | 12223 |
| JSON: API Responses (500 items) | 38.9K | 1.1K | 37.8K | 97% | 743 | 776 | 744 |
| JSON: Database Rows (1K rows) | 43.7K | 605 | 43.1K | 99% | 961 | 1104 | 986 |
| JSON: String Array (100 strings) | 1.1K | 231 | 820 | 78% | 15.0 | 15.4 | 15.0 |
| JSON: String Array (500 strings) | 4.9K | 233 | 4.6K | 95% | 71.9 | 80.3 | 72.7 |
| JSON: String Array (1K strings) | 9.6K | 242 | 9.4K | 97% | 146 | 160 | 147 |
| JSON: Number Array (200 numbers) | 1.2K | 192 | 1.1K | 85% | 30.9 | 61.9 | 33.8 |
| JSON: Number Array (1K numbers) | 6.1K | 243 | 5.8K | 96% | 301 | 307 | 300 |
| JSON: Mixed Array (250 items) | 2.3K | 368 | 1.9K | 84% | 38.4 | 39.8 | 38.4 |
Per-Transform Latency Breakdown¶
| Scenario | Transform | p50 (ms) | % of Total |
|---|---|---|---|
| JSON: Search Results (100 items) | cache_aligner | 2.2 | 1% |
| JSON: Search Results (100 items) | content_router | 186 | 98% |
| JSON: Search Results (100 items) | rolling_window | <0.01 | 0% |
| JSON: Search Results (500 items) | cache_aligner | 10.7 | 1% |
| JSON: Search Results (500 items) | content_router | 927 | 98% |
| JSON: Search Results (500 items) | rolling_window | <0.01 | 0% |
| JSON: Search Results (1K items) | cache_aligner | 21.0 | 1% |
| JSON: Search Results (1K items) | content_router | 1980 | 98% |
| JSON: Search Results (1K items) | rolling_window | <0.01 | 0% |
| JSON: Search Results (5K items) | cache_aligner | 105 | 1% |
| JSON: Search Results (5K items) | content_router | 11985 | 98% |
| JSON: Search Results (5K items) | rolling_window | <0.01 | 0% |
| JSON: API Responses (500 items) | cache_aligner | 8.8 | 1% |
| JSON: API Responses (500 items) | content_router | 729 | 98% |
| JSON: API Responses (500 items) | rolling_window | <0.01 | 0% |
| JSON: Database Rows (1K rows) | cache_aligner | 9.3 | 1% |
| JSON: Database Rows (1K rows) | content_router | 946 | 99% |
| JSON: Database Rows (1K rows) | rolling_window | <0.01 | 0% |
| JSON: String Array (100 strings) | cache_aligner | 0.27 | 2% |
| JSON: String Array (100 strings) | content_router | 14.5 | 97% |
| JSON: String Array (100 strings) | rolling_window | <0.01 | 0% |
| JSON: String Array (500 strings) | cache_aligner | 0.95 | 1% |
| JSON: String Array (500 strings) | content_router | 70.2 | 98% |
| JSON: String Array (500 strings) | rolling_window | <0.01 | 0% |
| JSON: String Array (1K strings) | cache_aligner | 1.9 | 1% |
| JSON: String Array (1K strings) | content_router | 143 | 98% |
| JSON: String Array (1K strings) | rolling_window | <0.01 | 0% |
| JSON: Number Array (200 numbers) | cache_aligner | 0.66 | 2% |
| JSON: Number Array (200 numbers) | content_router | 29.6 | 96% |
| JSON: Number Array (200 numbers) | rolling_window | <0.01 | 0% |
| JSON: Number Array (1K numbers) | cache_aligner | 2.5 | 1% |
| JSON: Number Array (1K numbers) | content_router | 297 | 99% |
| JSON: Number Array (1K numbers) | rolling_window | <0.01 | 0% |
| JSON: Mixed Array (250 items) | cache_aligner | 0.58 | 1% |
| JSON: Mixed Array (250 items) | content_router | 37.4 | 97% |
| JSON: Mixed Array (250 items) | rolling_window | <0.01 | 0% |
Cost-Benefit Analysis¶
Net latency benefit = LLM time saved from fewer tokens - compression overhead.
| Scenario | Compress (ms) | LLM Saved (ms)* | Net Benefit | $/1K Requests** |
|---|---|---|---|---|
| JSON: Search Results (100 items) | 189 | 261 | +71.8ms | $26.13 |
| JSON: Search Results (500 items) | 943 | 1461 | +517.5ms | $146.06 |
| JSON: Search Results (1K items) | 2012 | 2969 | +956.9ms | $296.91 |
| JSON: Search Results (5K items) | 12213 | 15035 | +2822.2ms | $1503.53 |
| JSON: API Responses (500 items) | 743 | 1134 | +390.7ms | $113.38 |
| JSON: Database Rows (1K rows) | 961 | 1292 | +330.7ms | $129.16 |
| JSON: String Array (100 strings) | 15.0 | 24.6 | +9.6ms | $2.46 |
| JSON: String Array (500 strings) | 71.9 | 139 | +67.1ms | $13.90 |
| JSON: String Array (1K strings) | 146 | 282 | +135.9ms | $28.16 |
| JSON: Number Array (200 numbers) | 30.9 | 31.6 | +0.7ms | $3.16 |
| JSON: Number Array (1K numbers) | 301 | 175 | -126.3ms | $17.45 |
| JSON: Mixed Array (250 items) | 38.4 | 56.6 | +18.2ms | $5.66 |
* LLM time saved based on Claude Sonnet 4.5 prefill rate (0.03ms/token) ** Cost savings at $3.0/MTok input pricing
Break-Even Across Models¶
Compression overhead (p50) vs. LLM time saved for different model speed tiers:
| Scenario | Compress (ms) | GPT-4o Mini | GPT-4o | Claude Sonnet 4.5 | Claude Opus 4 |
|---|---|---|---|---|---|
| JSON: Search Results (100 items) | 189 | -102ms | +71.8ms | +71.8ms | +507ms |
| JSON: Search Results (500 items) | 943 | -456ms | +518ms | +518ms | +2952ms |
| JSON: Search Results (1K items) | 2012 | -1022ms | +957ms | +957ms | +5905ms |
| JSON: Search Results (5K items) | 12213 | -7201ms | +2822ms | +2822ms | +27881ms |
| JSON: API Responses (500 items) | 743 | -365ms | +391ms | +391ms | +2280ms |
| JSON: Database Rows (1K rows) | 961 | -530ms | +331ms | +331ms | +2483ms |
| JSON: String Array (100 strings) | 15.0 | -6.8ms | +9.6ms | +9.6ms | +50.6ms |
| JSON: String Array (500 strings) | 71.9 | -25.6ms | +67.1ms | +67.1ms | +299ms |
| JSON: String Array (1K strings) | 146 | -51.9ms | +136ms | +136ms | +605ms |
| JSON: Number Array (200 numbers) | 30.9 | -20.4ms | +0.68ms | +0.68ms | +53.3ms |
| JSON: Number Array (1K numbers) | 301 | -243ms | -126ms | -126ms | +165ms |
| JSON: Mixed Array (250 items) | 38.4 | -19.5ms | +18.2ms | +18.2ms | +113ms |
Key Takeaways¶
- Compression pays for itself in latency for 11/12 compressing scenarios (json). For these, the LLM prefill time saved exceeds compression overhead.
- ContentRouter is 98% of pipeline cost on average — it does the actual compression work. CacheAligner and context management are <2% of total time.
- Cost savings are substantial regardless of latency. The highest-compression scenario (JSON: Search Results (5K items)) saves $1504/1K requests at Claude Sonnet 4.5 pricing.
- Slower/pricier models benefit most. Claude Opus shows a net latency win in 12/12 scenarios vs 11 for Claude Sonnet 4.5, with 0.08ms/token prefill.
Benchmarks run with python benchmarks/bench_latency.py. Results vary based on hardware, Python version, and content characteristics.