Headroom¶
The Context Optimization Layer for LLM Applications
Compress everything your AI agent reads. Same answers, fraction of the tokens.
What It Does¶
Every tool call, DB query, file read, and RAG retrieval your agent makes is 70-95% boilerplate. Headroom compresses it away before it hits the model. The LLM sees less noise, responds faster, and costs less.
Your Agent / App
│
│ tool outputs, logs, DB reads, RAG results, file reads, API responses
▼
Headroom ← proxy, Python library, or framework integration
│
▼
LLM Provider (OpenAI, Anthropic, Google, Bedrock, 100+ via LiteLLM)
Headroom works as a transparent proxy (zero code changes), a Python function (compress()), or a framework integration (LangChain, Agno, Strands, LiteLLM, MCP).
Quick Start¶
# Point any tool at the proxy
ANTHROPIC_BASE_URL=http://localhost:8787 claude
OPENAI_BASE_URL=http://localhost:8787/v1 your-app
That's it. Your existing code works unchanged, with 40-90% fewer tokens.
from headroom import compress
result = compress(messages, model="claude-sonnet-4-5-20250929")
response = client.messages.create(
model="claude-sonnet-4-5-20250929",
messages=result.messages,
)
print(f"Saved {result.tokens_saved} tokens ({result.compression_ratio:.0%})")
Works with any Python LLM client. Full SDK guide →
headroom wrap claude # Claude Code
headroom wrap codex # OpenAI Codex CLI
headroom wrap aider # Aider
headroom wrap cursor # Cursor
headroom wrap openclaw # OpenClaw plugin bootstrap
Starts the proxy, points your tool at it, compresses everything automatically.
import { compress } from 'headroom-ai';
const result = await compress(messages, { model: 'claude-sonnet-4-5-20250929' });
// Use result.messages with any LLM client
console.log(`Saved ${result.tokensSaved} tokens`);
Works with Vercel AI SDK, OpenAI Node SDK, and Anthropic TS SDK. Full TS guide →
Framework Integrations¶
LangChain¶
Wrap any chat model. Supports memory, retrievers, tools, streaming, async.
Agno¶
Full agent framework integration with observability hooks.
Strands¶
Model wrapping + tool output hook provider for Strands Agents.
MCP Tools¶
Three tools for Claude Code, Cursor, or any MCP client: headroom_compress, headroom_retrieve, headroom_stats.
TypeScript SDK¶
compress(), Vercel AI SDK middleware, OpenAI and Anthropic client wrappers.
OpenClaw¶
ContextEngine plugin for OpenClaw agents. Auto-compresses context in assemble().
How It Works¶
Headroom runs a three-stage pipeline on every request:
graph LR
A[Your Prompt] --> B[CacheAligner]
B --> C[ContentRouter]
C --> D[IntelligentContext]
D --> E[LLM Provider]
C -->|JSON| F[SmartCrusher]
C -->|Code| G[CodeCompressor]
C -->|Text| H[Kompress]
C -->|Logs| I[LogCompressor]
F --> D
G --> D
H --> D
I --> D
Stage 1: CacheAligner — Stabilizes message prefixes so the provider's KV cache actually hits. Claude offers a 90% read discount on cached prefixes; CacheAligner makes that work.
Stage 2: ContentRouter — Auto-detects content type (JSON, code, logs, search results, diffs, HTML, plain text) and routes each to the optimal compressor:
| Content Type | Compressor | How It Works |
|---|---|---|
| JSON arrays | SmartCrusher | Statistical analysis: keeps errors, anomalies, boundaries. No hardcoded rules. |
| Source code | CodeCompressor | AST-aware (tree-sitter). Preserves function signatures, collapses bodies. |
| Plain text | Kompress | ModernBERT token classification. Removes redundant tokens while preserving meaning. |
| Build/test logs | LogCompressor | Keeps failures, errors, warnings. Drops passing noise. |
| Search results | SearchCompressor | Ranks by relevance to user query, keeps top matches. |
| Git diffs | DiffCompressor | Preserves change hunks, drops unchanged context. |
| HTML | HTMLExtractor | Strips markup, extracts readable content. |
Stage 3: IntelligentContext — If the conversation still exceeds the model's context limit, scores each message by importance (recency, references, density) and drops the lowest-value ones.
Nothing is lost. Compressed content goes into the CCR store (Compress-Cache-Retrieve). The LLM gets a headroom_retrieve tool and can fetch full originals when it needs more detail.
Results¶
100 production log entries. One critical error buried at position 67.
| Metric | Baseline | Headroom |
|---|---|---|
| Input tokens | 10,144 | 1,260 |
| Correct answers | 4/4 | 4/4 |
87.6% fewer tokens. Same answer. The FATAL error was automatically preserved — not by keyword matching, but by statistical analysis of field variance.
Real Workloads¶
| Scenario | Before | After | Savings |
|---|---|---|---|
| Code search (100 results) | 17,765 | 1,408 | 92% |
| SRE incident debugging | 65,694 | 5,118 | 92% |
| Codebase exploration | 78,502 | 41,254 | 47% |
| GitHub issue triage | 54,174 | 14,761 | 73% |
Accuracy Benchmarks¶
| Benchmark | Category | N | Accuracy | Compression |
|---|---|---|---|---|
| GSM8K | Math | 100 | 0.870 | 0.000 delta |
| TruthfulQA | Factual | 100 | 0.560 | +0.030 delta |
| SQuAD v2 | QA | 100 | 97% | 19% reduction |
| BFCL | Tool/Function | 100 | 97% | 32% reduction |
| CCR Needle | Lossless | 50 | 100% | 77% reduction |
Full benchmark methodology → | Known limitations →
Key Features¶
Lossless Compression (CCR)¶
Compresses aggressively, stores originals, gives the LLM a tool to retrieve full details. Nothing is thrown away. Learn more →
Smart Content Detection¶
Auto-detects JSON, code, logs, text, diffs, HTML. Routes each to the best compressor. Zero configuration needed. Learn more →
Cache Optimization¶
Stabilizes prefixes so provider KV caches hit. Tracks frozen messages to preserve the 90% read discount. Learn more →
Image Compression¶
40-90% token reduction via trained ML router. Automatically selects resize/quality tradeoff per image. Learn more →
Persistent Memory¶
Hierarchical memory (user/session/agent/turn) with SQLite + HNSW backends. Survives across conversations. Learn more →
Failure Learning¶
Reads past sessions, finds failed tool calls, correlates with what succeeded, writes learnings to CLAUDE.md. Learn more →
Metrics & Observability¶
Prometheus endpoint, per-request logging, cost tracking, budget limits, pipeline timing breakdowns. Learn more →
Cloud Providers¶
Works with any LLM provider out of the box:
headroom proxy # Direct Anthropic/OpenAI
headroom proxy --backend bedrock --region us-east-1 # AWS Bedrock
headroom proxy --backend vertex_ai --region us-central1 # Google Vertex AI
headroom proxy --backend azure # Azure OpenAI
headroom proxy --backend openrouter # OpenRouter (400+ models)
Or via LiteLLM for 100+ providers (Together, Groq, Fireworks, Ollama, vLLM, etc.).
Installation¶
pip install headroom-ai # Core library (Python)
pip install "headroom-ai[all]" # Everything (recommended)
npm install headroom-ai # TypeScript / Node.js
pip install "headroom-ai[proxy]" # Proxy server + MCP tools
pip install "headroom-ai[ml]" # ML compression (Kompress, requires torch)
pip install "headroom-ai[langchain]" # LangChain integration
pip install "headroom-ai[agno]" # Agno integration
pip install "headroom-ai[evals]" # Evaluation framework
Requires Python 3.10+.
Next Steps¶
- Quickstart — Running in 5 minutes
- Integration Guide — Every way to add Headroom to your stack
- Architecture — How the pipeline works under the hood
- Benchmarks — Accuracy and latency data
- Limitations — When compression helps and when it doesn't
Apache 2.0 — Free for commercial use. GitHub | PyPI | Discord