Skip to content

Benchmarks

Headroom's core promise: compress context without losing accuracy. This page shows accuracy benchmarks, compression performance, and real-world production telemetry from 250+ active proxy instances.

Key Results

98.2% recall on article extraction with 94.9% compression. 52ms median overhead in production. 1.4 billion tokens saved across 249 instances.


Compression Performance

Tested on Apple M-series (CPU), headroom v0.5.18. Each test runs compress() on realistic tool outputs.

Content Type Original Compressed Saved Ratio Latency
JSON array (100 items) 3,163 297 2,866 90.6% 1ms
JSON array (500 items) 9,526 1,614 7,912 83.1% 2ms
Shell output (200 lines) 3,238 469 2,769 85.5% 1ms
Build log (200 lines) 2,412 148 2,264 93.9% 1ms
grep results (150 hits) 2,624 2,624 0 0.0% <1ms
Python source (~480 lines) 2,958 2,958 0 0.0% <1ms
Total 23,921 8,110 15,811 66.1% 5ms

Notes:

  • grep results and Python source show 0% compression — these are already compact structured formats. SmartCrusher only compresses JSON arrays; code passes through to preserve correctness.
  • Latency is for the compress() SDK call, not the full proxy round-trip.

Production Telemetry

Real-world data from 50,000+ proxy sessions across 250+ unique instances (March 30 – April 2, 2026). Collected via anonymous telemetry beacon (opt-out: HEADROOM_TELEMETRY=off).

Proxy Overhead

Percentile Latency
Median (P50) 52ms
P90 309ms
P99 4,172ms
Mean 161ms

The median 52ms overhead is negligible compared to LLM inference time (typically 2-10 seconds).

Compression Rate

Percentile Compression
P25 4.8%
Median 4.8%
P75 6.9%
Mean 11.3%

Median compression is modest because many requests are short conversational turns. Heavy tool-use sessions (file reads, shell output) see 40-80% compression.

Pipeline Step Timing (Production Median)

Step Median P90 Description
pipeline_total 16.9ms 289ms Full compression pipeline
content_router 11.7ms 259ms Content detection + routing
compressor:smart_crusher 50.1ms 50ms JSON array compression
compressor:text 32.0ms 576ms Text compression (Kompress ONNX)
compressor:mixed 316ms 428ms Mixed content compression
compressor:code_aware 815ms 886ms Tree-sitter AST compression
_initial_token_count 2.9ms 16ms Token counting (tiktoken)
_deep_copy 0.1ms 0.3ms Message copy overhead

Fleet Summary

Metric Value
Clean instances 249
Total tokens saved 1.4 billion
Total $ saved ~$4,000
OS distribution Linux 57%, macOS 38%, Windows 5%
Top version 0.5.17 (77%)
Models used Claude Opus 4.6, Sonnet 4.6, Haiku 4.5

Accuracy Benchmarks

HTML Extraction

Dataset: Scrapinghub Article Extraction Benchmark Samples: 181 HTML pages with ground truth article bodies Baseline: trafilatura (0.958 F1)

Metric Value Description
F1 Score 0.919 Token-level overlap with ground truth
Precision 0.879 Proportion of extracted content that's relevant
Recall 0.982 Proportion of ground truth content captured
Compression 94.9% Average size reduction

For LLM applications, recall is critical — 98.2% means nearly all article content is preserved. The slight precision drop (some extra content) doesn't hurt LLM accuracy.

# Run it yourself
pip install "headroom-ai[html]" datasets
pytest tests/test_evals/test_html_oss_benchmarks.py::TestExtractionBenchmark -v -s

JSON Compression (SmartCrusher)

Test: 100 production log entries with critical error at position 67 Task: Find the error, error code, resolution, and affected count

Metric Baseline Headroom
Input tokens 10,144 1,260
Correct answers 4/4 4/4
Compression 87.6%

SmartCrusher preserves first N items (schema), last N items (recency), all anomalies (errors, warnings), and statistical distribution.

QA Accuracy Preservation

Metric Original HTML Extracted Delta
F1 Score 0.85 0.87 +0.02
Exact Match 60% 62% +2%

Extraction Can Improve Accuracy

Removing HTML noise sometimes helps LLMs focus on relevant content.


Limitations

What Headroom Does NOT Compress

  • Short messages (< 300 tokens) — overhead exceeds savings
  • Source code — passes through unchanged to preserve correctness (unless tree-sitter AST compression is enabled)
  • grep/search results — compact structured format, already minimal
  • Images — counted at fixed token cost (~1,600 tokens), not compressed as text
  • System prompts — preserved for prefix cache compatibility

Known Overhead Sources

  • Token counting (P90: 16ms) — runs tiktoken twice (before + after compression)
  • Tree-sitter AST parsing (P90: 886ms) — expensive for large code files
  • Kompress ONNX (P90: 576ms) — ML inference on CPU for text compression
  • Content detection (Magika) — ML classification of content type

When Headroom Adds the Most Value

  • Long agent sessions with accumulated tool outputs (40-80% compression)
  • JSON-heavy workflows (API responses, database queries) — 83-94% compression
  • Build/test output — 85-94% compression
  • Multi-tool agents — 60-76% compression across tool results

When Headroom Adds Little Value

  • Short conversational exchanges — median 4.8% compression
  • Code-only sessions (reading/writing files) — code passes through
  • Single-turn requests — no accumulated context to compress

Methodology

Token-Level F1

Precision = |predicted ∩ ground_truth| / |predicted|
Recall = |predicted ∩ ground_truth| / |ground_truth|
F1 = 2 * (Precision * Recall) / (Precision + Recall)

Compression Ratio

Compression = 1 - (compressed_size / original_size)

A 94.9% compression means the output is 5.1% of the original size.

Production Telemetry

  • Collected via anonymous beacon (no prompts, no content, no PII)
  • Image-inflated instances excluded (base64 counted as text tokens — fixed in v0.5.18)
  • Multi-worker beacon spam excluded (per-instance MAX, not SUM)
  • Opt-out: HEADROOM_TELEMETRY=off

Reproducing Results

# Clone the repo
git clone https://github.com/chopratejas/headroom.git
cd headroom

# Install with eval dependencies
pip install -e ".[evals,html]"

# Run all benchmarks
pytest tests/test_evals/ -v -s

# Run compression benchmark
python -c "from headroom import compress; print(compress([{'role':'user','content':'test'}]))"