Integration Guide¶

You don't need to run the Headroom proxy. Headroom is a compression library that works with any LLM client, proxy, or framework.

Pick Your Path¶

You have...	Use this	Setup
Any Python app	`compress()`	2 lines
LiteLLM	LiteLLM callback	1 line
A Python proxy (FastAPI, custom)	ASGI middleware	1 line
Claude Code / Cursor	Headroom proxy	1 env var
Agno agents	Agno integration	Wrap model
LangChain	LangChain integration	Wrap model
Non-Python app	Headroom proxy	HTTP
TypeScript SDK	`compress()`	`npm install headroom-ai`
Vercel AI SDK	`headroomMiddleware()`	Middleware adapter
OpenAI Node SDK	`withHeadroom()`	Client wrapper
Anthropic TS SDK	`withHeadroom()`	Client wrapper

compress() Function¶

The simplest integration. Works with any LLM client.

from headroom import compress

# Before sending to your LLM:
result = compress(messages, model="claude-sonnet-4-5-20250929")
response = your_client.create(messages=result.messages)  # Fewer tokens, same answer

print(f"Saved {result.tokens_saved} tokens ({result.compression_ratio:.0%})")

With Anthropic SDK¶

from anthropic import Anthropic
from headroom import compress

client = Anthropic()
messages = [
    {"role": "user", "content": "What went wrong?"},
    {"role": "assistant", "content": "Let me check.", "tool_use": [...]},
    {"role": "user", "content": [{"type": "tool_result", "content": huge_json}]},
]

compressed = compress(messages, model="claude-sonnet-4-5-20250929")
response = client.messages.create(
    model="claude-sonnet-4-5-20250929",
    messages=compressed.messages,
    max_tokens=1000,
)

With OpenAI SDK¶

from openai import OpenAI
from headroom import compress

client = OpenAI()
messages = [
    {"role": "user", "content": "Analyze these results"},
    {"role": "tool", "content": big_json_output, "tool_call_id": "call_1"},
]

compressed = compress(messages, model="gpt-4o")
response = client.chat.completions.create(
    model="gpt-4o",
    messages=compressed.messages,
)

With LiteLLM (direct)¶

import litellm
from headroom import compress

messages = [...]
compressed = compress(messages, model="bedrock/claude-sonnet")
response = litellm.completion(model="bedrock/claude-sonnet", messages=compressed.messages)

With any HTTP client¶

import httpx
from headroom import compress

compressed = compress(messages, model="claude-sonnet-4-5-20250929")
httpx.post("https://api.anthropic.com/v1/messages", json={
    "model": "claude-sonnet-4-5-20250929",
    "messages": compressed.messages,
}, headers={"X-Api-Key": api_key, "anthropic-version": "2023-06-01"})

What compress() returns¶

result = compress(messages, model="gpt-4o")
result.messages           # list[dict] — compressed messages, same format as input
result.tokens_before      # int — original token count
result.tokens_after       # int — compressed token count
result.tokens_saved       # int — tokens removed
result.compression_ratio  # float — 0.0 (no savings) to 1.0 (100% removed)
result.transforms_applied # list[str] — what ran (e.g., ["router:smart_crusher:0.35"])

LiteLLM¶

If you're already using LiteLLM as your LLM gateway, add Headroom as a callback:

import litellm
from headroom.integrations.litellm_callback import HeadroomCallback

litellm.callbacks = [HeadroomCallback()]

# All calls now compressed automatically
response = litellm.completion(model="gpt-4o", messages=[...])
response = litellm.completion(model="bedrock/claude-sonnet", messages=[...])
response = litellm.completion(model="azure/gpt-4o", messages=[...])

The callback compresses messages in LiteLLM's pre_call_hook before they're sent to the provider. Works with all 100+ LiteLLM-supported providers.

With LiteLLM Proxy¶

If you run LiteLLM as a proxy server, use the ASGI middleware instead:

# In your LiteLLM proxy startup
from litellm.proxy.proxy_server import app
from headroom.integrations.asgi import CompressionMiddleware

app.add_middleware(CompressionMiddleware)

Or use the callback in your LiteLLM config:

# litellm_config.yaml
litellm_settings:
  callbacks: ["headroom.integrations.litellm_callback.HeadroomCallback"]

ASGI Middleware¶

Drop-in middleware for any ASGI application (FastAPI, Starlette, LiteLLM proxy, custom proxies).

from headroom.integrations.asgi import CompressionMiddleware

# FastAPI
app = FastAPI()
app.add_middleware(CompressionMiddleware)

# Starlette
app = Starlette(routes=[...])
app.add_middleware(CompressionMiddleware)

# LiteLLM proxy
from litellm.proxy.proxy_server import app
app.add_middleware(CompressionMiddleware)

The middleware intercepts POST requests to /v1/messages, /v1/chat/completions, /v1/responses, and /chat/completions. All other requests pass through untouched.

Response headers include: - x-headroom-compressed: true — compression was applied - x-headroom-tokens-saved: 1234 — tokens removed

Proxy¶

The Headroom proxy is a standalone HTTP server. Best for non-Python apps or tools that only support base URL configuration (Claude Code, Cursor).

pip install "headroom-ai[all]"
headroom proxy --port 8787

# Claude Code
ANTHROPIC_BASE_URL=http://localhost:8787 claude

# Cursor / Any OpenAI client
OPENAI_BASE_URL=http://localhost:8787/v1 cursor

With Cloud Providers¶

# AWS Bedrock
headroom proxy --backend bedrock --region us-east-1

# Google Vertex AI
headroom proxy --backend vertex_ai --region us-central1

# Azure OpenAI
headroom proxy --backend azure

# OpenRouter (400+ models)
OPENROUTER_API_KEY=sk-or-... headroom proxy --backend openrouter

See Proxy Documentation for all options.

Agno¶

Full integration with the Agno agent framework.

from agno.agent import Agent
from agno.models.anthropic import Claude
from headroom.integrations.agno import HeadroomAgnoModel

model = HeadroomAgnoModel(Claude(id="claude-sonnet-4-20250514"))
agent = Agent(model=model, tools=[your_tools])
response = agent.run("Investigate the issue")

print(f"Tokens saved: {model.total_tokens_saved}")

See Agno Guide for hooks, multi-provider, and streaming.

LangChain¶

Full integration with LangChain — chat models, memory, retrievers, tool wrappers, and streaming.

from langchain_openai import ChatOpenAI
from headroom.integrations import HeadroomChatModel

llm = HeadroomChatModel(ChatOpenAI(model="gpt-4o"))
response = llm.invoke("Hello!")

See LangChain Guide for details and known limitations.

TypeScript SDK¶

For Node.js, Next.js, and any TypeScript/JavaScript application.

npm install headroom-ai

See the TypeScript SDK Guide for full documentation including Vercel AI SDK middleware, OpenAI SDK wrapper, and Anthropic SDK wrapper.

OpenClaw¶

Context compression plugin for OpenClaw agents.

pip install "headroom-ai[proxy]"
openclaw plugins install headroom-openclaw

Configure as context engine:

{ "plugins": { "slots": { "contextEngine": "headroom" } } }

The plugin auto-detects a running Headroom proxy or starts one. Compression happens in assemble() — zero changes to the agent's behavior.

See the OpenClaw plugin documentation for full setup.

Compression Hooks (Advanced)¶

Customize compression behavior without modifying Headroom's code:

from headroom import compress, CompressionHooks, CompressContext

class MyHooks(CompressionHooks):
    def pre_compress(self, messages, ctx):
        # Modify messages before compression (dedup, filter, inject)
        return messages

    def compute_biases(self, messages, ctx):
        # Per-message compression aggressiveness
        # >1.0 = keep more, <1.0 = compress more
        return {5: 1.5, 6: 0.5}  # Keep message 5, compress message 6

    def post_compress(self, event):
        # Observe results (logging, analytics, learning)
        print(f"Saved {event.tokens_saved} tokens")

result = compress(messages, model="gpt-4o", hooks=MyHooks())

See Architecture for how hooks integrate with the pipeline.

FAQ¶

Q: Does Headroom change the response format? No. Your LLM returns the same response format. Headroom only modifies the input messages.

Q: What if compression removes something the LLM needs? Headroom stores originals in CCR (Compress-Cache-Retrieve). The LLM can call headroom_retrieve to get full uncompressed content. Compression summaries tell the LLM what's available.

Q: Does it work with streaming? Yes. Compression happens before the request is sent. Streaming responses are unaffected.

Q: How much latency does it add? 15-200ms depending on content size and type. Small JSON arrays take ~15ms, large tool outputs take 100-200ms. The token savings typically save far more time on the LLM side than compression adds — a 50% token reduction on a Sonnet call saves seconds of generation time. See Latency Benchmarks for real numbers.