Integration Guide¶
You don't need to run the Headroom proxy. Headroom is a compression library that works with any LLM client, proxy, or framework.
Pick Your Path¶
| You have... | Use this | Setup |
|---|---|---|
| Any Python app | compress() |
2 lines |
| LiteLLM | LiteLLM callback | 1 line |
| A Python proxy (FastAPI, custom) | ASGI middleware | 1 line |
| Claude Code / Cursor | Headroom proxy | 1 env var |
| Agno agents | Agno integration | Wrap model |
| LangChain | LangChain integration | Wrap model |
| Non-Python app | Headroom proxy | HTTP |
| TypeScript SDK | compress() |
npm install headroom-ai |
| Vercel AI SDK | headroomMiddleware() |
Middleware adapter |
| OpenAI Node SDK | withHeadroom() |
Client wrapper |
| Anthropic TS SDK | withHeadroom() |
Client wrapper |
compress() Function¶
The simplest integration. Works with any LLM client.
from headroom import compress
# Before sending to your LLM:
result = compress(messages, model="claude-sonnet-4-5-20250929")
response = your_client.create(messages=result.messages) # Fewer tokens, same answer
print(f"Saved {result.tokens_saved} tokens ({result.compression_ratio:.0%})")
With Anthropic SDK¶
from anthropic import Anthropic
from headroom import compress
client = Anthropic()
messages = [
{"role": "user", "content": "What went wrong?"},
{"role": "assistant", "content": "Let me check.", "tool_use": [...]},
{"role": "user", "content": [{"type": "tool_result", "content": huge_json}]},
]
compressed = compress(messages, model="claude-sonnet-4-5-20250929")
response = client.messages.create(
model="claude-sonnet-4-5-20250929",
messages=compressed.messages,
max_tokens=1000,
)
With OpenAI SDK¶
from openai import OpenAI
from headroom import compress
client = OpenAI()
messages = [
{"role": "user", "content": "Analyze these results"},
{"role": "tool", "content": big_json_output, "tool_call_id": "call_1"},
]
compressed = compress(messages, model="gpt-4o")
response = client.chat.completions.create(
model="gpt-4o",
messages=compressed.messages,
)
With LiteLLM (direct)¶
import litellm
from headroom import compress
messages = [...]
compressed = compress(messages, model="bedrock/claude-sonnet")
response = litellm.completion(model="bedrock/claude-sonnet", messages=compressed.messages)
With any HTTP client¶
import httpx
from headroom import compress
compressed = compress(messages, model="claude-sonnet-4-5-20250929")
httpx.post("https://api.anthropic.com/v1/messages", json={
"model": "claude-sonnet-4-5-20250929",
"messages": compressed.messages,
}, headers={"X-Api-Key": api_key, "anthropic-version": "2023-06-01"})
What compress() returns¶
result = compress(messages, model="gpt-4o")
result.messages # list[dict] — compressed messages, same format as input
result.tokens_before # int — original token count
result.tokens_after # int — compressed token count
result.tokens_saved # int — tokens removed
result.compression_ratio # float — 0.0 (no savings) to 1.0 (100% removed)
result.transforms_applied # list[str] — what ran (e.g., ["router:smart_crusher:0.35"])
LiteLLM¶
If you're already using LiteLLM as your LLM gateway, add Headroom as a callback:
import litellm
from headroom.integrations.litellm_callback import HeadroomCallback
litellm.callbacks = [HeadroomCallback()]
# All calls now compressed automatically
response = litellm.completion(model="gpt-4o", messages=[...])
response = litellm.completion(model="bedrock/claude-sonnet", messages=[...])
response = litellm.completion(model="azure/gpt-4o", messages=[...])
The callback compresses messages in LiteLLM's pre_call_hook before they're sent to the provider. Works with all 100+ LiteLLM-supported providers.
With LiteLLM Proxy¶
If you run LiteLLM as a proxy server, use the ASGI middleware instead:
# In your LiteLLM proxy startup
from litellm.proxy.proxy_server import app
from headroom.integrations.asgi import CompressionMiddleware
app.add_middleware(CompressionMiddleware)
Or use the callback in your LiteLLM config:
# litellm_config.yaml
litellm_settings:
callbacks: ["headroom.integrations.litellm_callback.HeadroomCallback"]
ASGI Middleware¶
Drop-in middleware for any ASGI application (FastAPI, Starlette, LiteLLM proxy, custom proxies).
from headroom.integrations.asgi import CompressionMiddleware
# FastAPI
app = FastAPI()
app.add_middleware(CompressionMiddleware)
# Starlette
app = Starlette(routes=[...])
app.add_middleware(CompressionMiddleware)
# LiteLLM proxy
from litellm.proxy.proxy_server import app
app.add_middleware(CompressionMiddleware)
The middleware intercepts POST requests to /v1/messages, /v1/chat/completions, /v1/responses, and /chat/completions. All other requests pass through untouched.
Response headers include:
- x-headroom-compressed: true — compression was applied
- x-headroom-tokens-saved: 1234 — tokens removed
Proxy¶
The Headroom proxy is a standalone HTTP server. Best for non-Python apps or tools that only support base URL configuration (Claude Code, Cursor).
# Claude Code
ANTHROPIC_BASE_URL=http://localhost:8787 claude
# Cursor / Any OpenAI client
OPENAI_BASE_URL=http://localhost:8787/v1 cursor
With Cloud Providers¶
# AWS Bedrock
headroom proxy --backend bedrock --region us-east-1
# Google Vertex AI
headroom proxy --backend vertex_ai --region us-central1
# Azure OpenAI
headroom proxy --backend azure
# OpenRouter (400+ models)
OPENROUTER_API_KEY=sk-or-... headroom proxy --backend openrouter
See Proxy Documentation for all options.
Agno¶
Full integration with the Agno agent framework.
from agno.agent import Agent
from agno.models.anthropic import Claude
from headroom.integrations.agno import HeadroomAgnoModel
model = HeadroomAgnoModel(Claude(id="claude-sonnet-4-20250514"))
agent = Agent(model=model, tools=[your_tools])
response = agent.run("Investigate the issue")
print(f"Tokens saved: {model.total_tokens_saved}")
See Agno Guide for hooks, multi-provider, and streaming.
LangChain¶
Full integration with LangChain — chat models, memory, retrievers, tool wrappers, and streaming.
from langchain_openai import ChatOpenAI
from headroom.integrations import HeadroomChatModel
llm = HeadroomChatModel(ChatOpenAI(model="gpt-4o"))
response = llm.invoke("Hello!")
See LangChain Guide for details and known limitations.
TypeScript SDK¶
For Node.js, Next.js, and any TypeScript/JavaScript application.
See the TypeScript SDK Guide for full documentation including Vercel AI SDK middleware, OpenAI SDK wrapper, and Anthropic SDK wrapper.
OpenClaw¶
Context compression plugin for OpenClaw agents.
Configure as context engine:
The plugin auto-detects a running Headroom proxy or starts one. Compression happens in assemble() — zero changes to the agent's behavior.
See the OpenClaw plugin documentation for full setup.
Compression Hooks (Advanced)¶
Customize compression behavior without modifying Headroom's code:
from headroom import compress, CompressionHooks, CompressContext
class MyHooks(CompressionHooks):
def pre_compress(self, messages, ctx):
# Modify messages before compression (dedup, filter, inject)
return messages
def compute_biases(self, messages, ctx):
# Per-message compression aggressiveness
# >1.0 = keep more, <1.0 = compress more
return {5: 1.5, 6: 0.5} # Keep message 5, compress message 6
def post_compress(self, event):
# Observe results (logging, analytics, learning)
print(f"Saved {event.tokens_saved} tokens")
result = compress(messages, model="gpt-4o", hooks=MyHooks())
See Architecture for how hooks integrate with the pipeline.
FAQ¶
Q: Does Headroom change the response format? No. Your LLM returns the same response format. Headroom only modifies the input messages.
Q: What if compression removes something the LLM needs?
Headroom stores originals in CCR (Compress-Cache-Retrieve). The LLM can call headroom_retrieve to get full uncompressed content. Compression summaries tell the LLM what's available.
Q: Does it work with streaming? Yes. Compression happens before the request is sent. Streaming responses are unaffected.
Q: How much latency does it add? 15-200ms depending on content size and type. Small JSON arrays take ~15ms, large tool outputs take 100-200ms. The token savings typically save far more time on the LLM side than compression adds — a 50% token reduction on a Sonnet call saves seconds of generation time. See Latency Benchmarks for real numbers.