Skip to content

Headroom

The Context Optimization Layer for LLM Applications

Compress everything your AI agent reads. Same answers, fraction of the tokens.

PyPI Python License Discord

87% Avg Token Reduction
100% Answer Accuracy
6 Compression Algorithms
100+ LLM Providers

What It Does

Every tool call, DB query, file read, and RAG retrieval your agent makes is 70-95% boilerplate. Headroom compresses it away before it hits the model. The LLM sees less noise, responds faster, and costs less.

Your Agent / App
    │  tool outputs, logs, DB reads, RAG results, file reads, API responses
 Headroom  ← proxy, Python library, or framework integration
 LLM Provider  (OpenAI, Anthropic, Google, Bedrock, 100+ via LiteLLM)

Headroom works as a transparent proxy (zero code changes), a Python function (compress()), or a framework integration (LangChain, Agno, Strands, LiteLLM, MCP).


Quick Start

pip install "headroom-ai[all]"
headroom proxy
# Point any tool at the proxy
ANTHROPIC_BASE_URL=http://localhost:8787 claude
OPENAI_BASE_URL=http://localhost:8787/v1 your-app

That's it. Your existing code works unchanged, with 40-90% fewer tokens.

from headroom import compress

result = compress(messages, model="claude-sonnet-4-5-20250929")
response = client.messages.create(
    model="claude-sonnet-4-5-20250929",
    messages=result.messages,
)
print(f"Saved {result.tokens_saved} tokens ({result.compression_ratio:.0%})")

Works with any Python LLM client. Full SDK guide →

headroom wrap claude       # Claude Code
headroom wrap codex        # OpenAI Codex CLI
headroom wrap aider        # Aider
headroom wrap cursor       # Cursor
headroom wrap openclaw     # OpenClaw plugin bootstrap

Starts the proxy, points your tool at it, compresses everything automatically.

import { compress } from 'headroom-ai';

const result = await compress(messages, { model: 'claude-sonnet-4-5-20250929' });
// Use result.messages with any LLM client
console.log(`Saved ${result.tokensSaved} tokens`);

Works with Vercel AI SDK, OpenAI Node SDK, and Anthropic TS SDK. Full TS guide →

import litellm
from headroom.integrations.litellm_callback import HeadroomCallback

litellm.callbacks = [HeadroomCallback()]
# All 100+ providers now compressed automatically

Framework Integrations

LangChain

Wrap any chat model. Supports memory, retrievers, tools, streaming, async.

from headroom.integrations import HeadroomChatModel

llm = HeadroomChatModel(ChatOpenAI(model="gpt-4o"))

LangChain Guide →

Agno

Full agent framework integration with observability hooks.

from headroom.integrations.agno import HeadroomAgnoModel

model = HeadroomAgnoModel(Claude(id="claude-sonnet-4-20250514"))
agent = Agent(model=model)

Agno Guide →

Strands

Model wrapping + tool output hook provider for Strands Agents.

from headroom.integrations.strands import HeadroomStrandsModel

model = HeadroomStrandsModel(wrapped_model=bedrock_model)
agent = Agent(model=model)

Strands Guide →

MCP Tools

Three tools for Claude Code, Cursor, or any MCP client: headroom_compress, headroom_retrieve, headroom_stats.

headroom mcp install && claude

MCP Guide →

TypeScript SDK

compress(), Vercel AI SDK middleware, OpenAI and Anthropic client wrappers.

npm install headroom-ai

TypeScript SDK Guide →

OpenClaw

ContextEngine plugin for OpenClaw agents. Auto-compresses context in assemble().

headroom wrap openclaw

OpenClaw Plugin →

All integration patterns →


How It Works

Headroom runs a three-stage pipeline on every request:

graph LR
    A[Your Prompt] --> B[CacheAligner]
    B --> C[ContentRouter]
    C --> D[IntelligentContext]
    D --> E[LLM Provider]

    C -->|JSON| F[SmartCrusher]
    C -->|Code| G[CodeCompressor]
    C -->|Text| H[Kompress]
    C -->|Logs| I[LogCompressor]

    F --> D
    G --> D
    H --> D
    I --> D

Stage 1: CacheAligner — Stabilizes message prefixes so the provider's KV cache actually hits. Claude offers a 90% read discount on cached prefixes; CacheAligner makes that work.

Stage 2: ContentRouter — Auto-detects content type (JSON, code, logs, search results, diffs, HTML, plain text) and routes each to the optimal compressor:

Content Type Compressor How It Works
JSON arrays SmartCrusher Statistical analysis: keeps errors, anomalies, boundaries. No hardcoded rules.
Source code CodeCompressor AST-aware (tree-sitter). Preserves function signatures, collapses bodies.
Plain text Kompress ModernBERT token classification. Removes redundant tokens while preserving meaning.
Build/test logs LogCompressor Keeps failures, errors, warnings. Drops passing noise.
Search results SearchCompressor Ranks by relevance to user query, keeps top matches.
Git diffs DiffCompressor Preserves change hunks, drops unchanged context.
HTML HTMLExtractor Strips markup, extracts readable content.

Stage 3: IntelligentContext — If the conversation still exceeds the model's context limit, scores each message by importance (recency, references, density) and drops the lowest-value ones.

Nothing is lost. Compressed content goes into the CCR store (Compress-Cache-Retrieve). The LLM gets a headroom_retrieve tool and can fetch full originals when it needs more detail.

Full architecture deep dive →


Results

100 production log entries. One critical error buried at position 67.

Metric Baseline Headroom
Input tokens 10,144 1,260
Correct answers 4/4 4/4

87.6% fewer tokens. Same answer. The FATAL error was automatically preserved — not by keyword matching, but by statistical analysis of field variance.

Real Workloads

Scenario Before After Savings
Code search (100 results) 17,765 1,408 92%
SRE incident debugging 65,694 5,118 92%
Codebase exploration 78,502 41,254 47%
GitHub issue triage 54,174 14,761 73%

Accuracy Benchmarks

Benchmark Category N Accuracy Compression
GSM8K Math 100 0.870 0.000 delta
TruthfulQA Factual 100 0.560 +0.030 delta
SQuAD v2 QA 100 97% 19% reduction
BFCL Tool/Function 100 97% 32% reduction
CCR Needle Lossless 50 100% 77% reduction

Full benchmark methodology → | Known limitations →


Key Features

Lossless Compression (CCR)

Compresses aggressively, stores originals, gives the LLM a tool to retrieve full details. Nothing is thrown away. Learn more →

Smart Content Detection

Auto-detects JSON, code, logs, text, diffs, HTML. Routes each to the best compressor. Zero configuration needed. Learn more →

Cache Optimization

Stabilizes prefixes so provider KV caches hit. Tracks frozen messages to preserve the 90% read discount. Learn more →

Image Compression

40-90% token reduction via trained ML router. Automatically selects resize/quality tradeoff per image. Learn more →

Persistent Memory

Hierarchical memory (user/session/agent/turn) with SQLite + HNSW backends. Survives across conversations. Learn more →

Failure Learning

Reads past sessions, finds failed tool calls, correlates with what succeeded, writes learnings to CLAUDE.md. Learn more →

Multi-Agent Context

Compress what moves between agents. Any framework.

ctx = SharedContext()
ctx.put("research", big_output)
summary = ctx.get("research")  # ~80% smaller
Learn more →

Metrics & Observability

Prometheus endpoint, per-request logging, cost tracking, budget limits, pipeline timing breakdowns. Learn more →


Cloud Providers

Works with any LLM provider out of the box:

headroom proxy                                          # Direct Anthropic/OpenAI
headroom proxy --backend bedrock --region us-east-1     # AWS Bedrock
headroom proxy --backend vertex_ai --region us-central1 # Google Vertex AI
headroom proxy --backend azure                          # Azure OpenAI
headroom proxy --backend openrouter                     # OpenRouter (400+ models)

Or via LiteLLM for 100+ providers (Together, Groq, Fireworks, Ollama, vLLM, etc.).


Installation

pip install headroom-ai                # Core library (Python)
pip install "headroom-ai[all]"         # Everything (recommended)
npm install headroom-ai                # TypeScript / Node.js
pip install "headroom-ai[proxy]"       # Proxy server + MCP tools
pip install "headroom-ai[ml]"          # ML compression (Kompress, requires torch)
pip install "headroom-ai[langchain]"   # LangChain integration
pip install "headroom-ai[agno]"        # Agno integration
pip install "headroom-ai[evals]"       # Evaluation framework

Requires Python 3.10+.


Next Steps


Apache 2.0 — Free for commercial use. GitHub | PyPI | Discord