Benchmarks¶
Headroom's core promise: compress context without losing accuracy. This page shows accuracy benchmarks, compression performance, and real-world production telemetry from 250+ active proxy instances.
Key Results
98.2% recall on article extraction with 94.9% compression. 52ms median overhead in production. 1.4 billion tokens saved across 249 instances.
Compression Performance¶
Tested on Apple M-series (CPU), headroom v0.5.18. Each test runs compress() on realistic tool outputs.
| Content Type | Original | Compressed | Saved | Ratio | Latency |
|---|---|---|---|---|---|
| JSON array (100 items) | 3,163 | 297 | 2,866 | 90.6% | 1ms |
| JSON array (500 items) | 9,526 | 1,614 | 7,912 | 83.1% | 2ms |
| Shell output (200 lines) | 3,238 | 469 | 2,769 | 85.5% | 1ms |
| Build log (200 lines) | 2,412 | 148 | 2,264 | 93.9% | 1ms |
| grep results (150 hits) | 2,624 | 2,624 | 0 | 0.0% | <1ms |
| Python source (~480 lines) | 2,958 | 2,958 | 0 | 0.0% | <1ms |
| Total | 23,921 | 8,110 | 15,811 | 66.1% | 5ms |
Notes:
- grep results and Python source show 0% compression — these are already compact structured formats. SmartCrusher only compresses JSON arrays; code passes through to preserve correctness.
- Latency is for the
compress()SDK call, not the full proxy round-trip.
Production Telemetry¶
Real-world data from 50,000+ proxy sessions across 250+ unique instances (March 30 – April 2, 2026). Collected via anonymous telemetry beacon (opt-out: HEADROOM_TELEMETRY=off).
Proxy Overhead¶
| Percentile | Latency |
|---|---|
| Median (P50) | 52ms |
| P90 | 309ms |
| P99 | 4,172ms |
| Mean | 161ms |
The median 52ms overhead is negligible compared to LLM inference time (typically 2-10 seconds).
Compression Rate¶
| Percentile | Compression |
|---|---|
| P25 | 4.8% |
| Median | 4.8% |
| P75 | 6.9% |
| Mean | 11.3% |
Median compression is modest because many requests are short conversational turns. Heavy tool-use sessions (file reads, shell output) see 40-80% compression.
Pipeline Step Timing (Production Median)¶
| Step | Median | P90 | Description |
|---|---|---|---|
pipeline_total |
16.9ms | 289ms | Full compression pipeline |
content_router |
11.7ms | 259ms | Content detection + routing |
compressor:smart_crusher |
50.1ms | 50ms | JSON array compression |
compressor:text |
32.0ms | 576ms | Text compression (Kompress ONNX) |
compressor:mixed |
316ms | 428ms | Mixed content compression |
compressor:code_aware |
815ms | 886ms | Tree-sitter AST compression |
_initial_token_count |
2.9ms | 16ms | Token counting (tiktoken) |
_deep_copy |
0.1ms | 0.3ms | Message copy overhead |
Fleet Summary¶
| Metric | Value |
|---|---|
| Clean instances | 249 |
| Total tokens saved | 1.4 billion |
| Total $ saved | ~$4,000 |
| OS distribution | Linux 57%, macOS 38%, Windows 5% |
| Top version | 0.5.17 (77%) |
| Models used | Claude Opus 4.6, Sonnet 4.6, Haiku 4.5 |
Accuracy Benchmarks¶
HTML Extraction¶
Dataset: Scrapinghub Article Extraction Benchmark Samples: 181 HTML pages with ground truth article bodies Baseline: trafilatura (0.958 F1)
| Metric | Value | Description |
|---|---|---|
| F1 Score | 0.919 | Token-level overlap with ground truth |
| Precision | 0.879 | Proportion of extracted content that's relevant |
| Recall | 0.982 | Proportion of ground truth content captured |
| Compression | 94.9% | Average size reduction |
For LLM applications, recall is critical — 98.2% means nearly all article content is preserved. The slight precision drop (some extra content) doesn't hurt LLM accuracy.
# Run it yourself
pip install "headroom-ai[html]" datasets
pytest tests/test_evals/test_html_oss_benchmarks.py::TestExtractionBenchmark -v -s
JSON Compression (SmartCrusher)¶
Test: 100 production log entries with critical error at position 67 Task: Find the error, error code, resolution, and affected count
| Metric | Baseline | Headroom |
|---|---|---|
| Input tokens | 10,144 | 1,260 |
| Correct answers | 4/4 | 4/4 |
| Compression | — | 87.6% |
SmartCrusher preserves first N items (schema), last N items (recency), all anomalies (errors, warnings), and statistical distribution.
QA Accuracy Preservation¶
| Metric | Original HTML | Extracted | Delta |
|---|---|---|---|
| F1 Score | 0.85 | 0.87 | +0.02 |
| Exact Match | 60% | 62% | +2% |
Extraction Can Improve Accuracy
Removing HTML noise sometimes helps LLMs focus on relevant content.
Limitations¶
What Headroom Does NOT Compress¶
- Short messages (< 300 tokens) — overhead exceeds savings
- Source code — passes through unchanged to preserve correctness (unless tree-sitter AST compression is enabled)
- grep/search results — compact structured format, already minimal
- Images — counted at fixed token cost (~1,600 tokens), not compressed as text
- System prompts — preserved for prefix cache compatibility
Known Overhead Sources¶
- Token counting (P90: 16ms) — runs tiktoken twice (before + after compression)
- Tree-sitter AST parsing (P90: 886ms) — expensive for large code files
- Kompress ONNX (P90: 576ms) — ML inference on CPU for text compression
- Content detection (Magika) — ML classification of content type
When Headroom Adds the Most Value¶
- Long agent sessions with accumulated tool outputs (40-80% compression)
- JSON-heavy workflows (API responses, database queries) — 83-94% compression
- Build/test output — 85-94% compression
- Multi-tool agents — 60-76% compression across tool results
When Headroom Adds Little Value¶
- Short conversational exchanges — median 4.8% compression
- Code-only sessions (reading/writing files) — code passes through
- Single-turn requests — no accumulated context to compress
Methodology¶
Token-Level F1¶
Precision = |predicted ∩ ground_truth| / |predicted|
Recall = |predicted ∩ ground_truth| / |ground_truth|
F1 = 2 * (Precision * Recall) / (Precision + Recall)
Compression Ratio¶
A 94.9% compression means the output is 5.1% of the original size.
Production Telemetry¶
- Collected via anonymous beacon (no prompts, no content, no PII)
- Image-inflated instances excluded (base64 counted as text tokens — fixed in v0.5.18)
- Multi-worker beacon spam excluded (per-instance MAX, not SUM)
- Opt-out:
HEADROOM_TELEMETRY=off
Reproducing Results¶
# Clone the repo
git clone https://github.com/chopratejas/headroom.git
cd headroom
# Install with eval dependencies
pip install -e ".[evals,html]"
# Run all benchmarks
pytest tests/test_evals/ -v -s
# Run compression benchmark
python -c "from headroom import compress; print(compress([{'role':'user','content':'test'}]))"