Skip to content

Integration Guide

You don't need to run the Headroom proxy. Headroom is a compression library that works with any LLM client, proxy, or framework.

Pick Your Path

You have... Use this Setup
Any Python app compress() 2 lines
LiteLLM LiteLLM callback 1 line
A Python proxy (FastAPI, custom) ASGI middleware 1 line
Claude Code / Cursor Headroom proxy 1 env var
Agno agents Agno integration Wrap model
LangChain LangChain integration Wrap model
Non-Python app Headroom proxy HTTP
TypeScript SDK compress() npm install headroom-ai
Vercel AI SDK headroomMiddleware() Middleware adapter
OpenAI Node SDK withHeadroom() Client wrapper
Anthropic TS SDK withHeadroom() Client wrapper

compress() Function

The simplest integration. Works with any LLM client.

from headroom import compress

# Before sending to your LLM:
result = compress(messages, model="claude-sonnet-4-5-20250929")
response = your_client.create(messages=result.messages)  # Fewer tokens, same answer

print(f"Saved {result.tokens_saved} tokens ({result.compression_ratio:.0%})")

With Anthropic SDK

from anthropic import Anthropic
from headroom import compress

client = Anthropic()
messages = [
    {"role": "user", "content": "What went wrong?"},
    {"role": "assistant", "content": "Let me check.", "tool_use": [...]},
    {"role": "user", "content": [{"type": "tool_result", "content": huge_json}]},
]

compressed = compress(messages, model="claude-sonnet-4-5-20250929")
response = client.messages.create(
    model="claude-sonnet-4-5-20250929",
    messages=compressed.messages,
    max_tokens=1000,
)

With OpenAI SDK

from openai import OpenAI
from headroom import compress

client = OpenAI()
messages = [
    {"role": "user", "content": "Analyze these results"},
    {"role": "tool", "content": big_json_output, "tool_call_id": "call_1"},
]

compressed = compress(messages, model="gpt-4o")
response = client.chat.completions.create(
    model="gpt-4o",
    messages=compressed.messages,
)

With LiteLLM (direct)

import litellm
from headroom import compress

messages = [...]
compressed = compress(messages, model="bedrock/claude-sonnet")
response = litellm.completion(model="bedrock/claude-sonnet", messages=compressed.messages)

With any HTTP client

import httpx
from headroom import compress

compressed = compress(messages, model="claude-sonnet-4-5-20250929")
httpx.post("https://api.anthropic.com/v1/messages", json={
    "model": "claude-sonnet-4-5-20250929",
    "messages": compressed.messages,
}, headers={"X-Api-Key": api_key, "anthropic-version": "2023-06-01"})

What compress() returns

result = compress(messages, model="gpt-4o")
result.messages           # list[dict] — compressed messages, same format as input
result.tokens_before      # int — original token count
result.tokens_after       # int — compressed token count
result.tokens_saved       # int — tokens removed
result.compression_ratio  # float — 0.0 (no savings) to 1.0 (100% removed)
result.transforms_applied # list[str] — what ran (e.g., ["router:smart_crusher:0.35"])

LiteLLM

If you're already using LiteLLM as your LLM gateway, add Headroom as a callback:

import litellm
from headroom.integrations.litellm_callback import HeadroomCallback

litellm.callbacks = [HeadroomCallback()]

# All calls now compressed automatically
response = litellm.completion(model="gpt-4o", messages=[...])
response = litellm.completion(model="bedrock/claude-sonnet", messages=[...])
response = litellm.completion(model="azure/gpt-4o", messages=[...])

The callback compresses messages in LiteLLM's pre_call_hook before they're sent to the provider. Works with all 100+ LiteLLM-supported providers.

With LiteLLM Proxy

If you run LiteLLM as a proxy server, use the ASGI middleware instead:

# In your LiteLLM proxy startup
from litellm.proxy.proxy_server import app
from headroom.integrations.asgi import CompressionMiddleware

app.add_middleware(CompressionMiddleware)

Or use the callback in your LiteLLM config:

# litellm_config.yaml
litellm_settings:
  callbacks: ["headroom.integrations.litellm_callback.HeadroomCallback"]

ASGI Middleware

Drop-in middleware for any ASGI application (FastAPI, Starlette, LiteLLM proxy, custom proxies).

from headroom.integrations.asgi import CompressionMiddleware

# FastAPI
app = FastAPI()
app.add_middleware(CompressionMiddleware)

# Starlette
app = Starlette(routes=[...])
app.add_middleware(CompressionMiddleware)

# LiteLLM proxy
from litellm.proxy.proxy_server import app
app.add_middleware(CompressionMiddleware)

The middleware intercepts POST requests to /v1/messages, /v1/chat/completions, /v1/responses, and /chat/completions. All other requests pass through untouched.

Response headers include: - x-headroom-compressed: true — compression was applied - x-headroom-tokens-saved: 1234 — tokens removed


Proxy

The Headroom proxy is a standalone HTTP server. Best for non-Python apps or tools that only support base URL configuration (Claude Code, Cursor).

pip install "headroom-ai[all]"
headroom proxy --port 8787
# Claude Code
ANTHROPIC_BASE_URL=http://localhost:8787 claude

# Cursor / Any OpenAI client
OPENAI_BASE_URL=http://localhost:8787/v1 cursor

With Cloud Providers

# AWS Bedrock
headroom proxy --backend bedrock --region us-east-1

# Google Vertex AI
headroom proxy --backend vertex_ai --region us-central1

# Azure OpenAI
headroom proxy --backend azure

# OpenRouter (400+ models)
OPENROUTER_API_KEY=sk-or-... headroom proxy --backend openrouter

See Proxy Documentation for all options.


Agno

Full integration with the Agno agent framework.

from agno.agent import Agent
from agno.models.anthropic import Claude
from headroom.integrations.agno import HeadroomAgnoModel

model = HeadroomAgnoModel(Claude(id="claude-sonnet-4-20250514"))
agent = Agent(model=model, tools=[your_tools])
response = agent.run("Investigate the issue")

print(f"Tokens saved: {model.total_tokens_saved}")

See Agno Guide for hooks, multi-provider, and streaming.


LangChain

Full integration with LangChain — chat models, memory, retrievers, tool wrappers, and streaming.

from langchain_openai import ChatOpenAI
from headroom.integrations import HeadroomChatModel

llm = HeadroomChatModel(ChatOpenAI(model="gpt-4o"))
response = llm.invoke("Hello!")

See LangChain Guide for details and known limitations.


TypeScript SDK

For Node.js, Next.js, and any TypeScript/JavaScript application.

npm install headroom-ai

See the TypeScript SDK Guide for full documentation including Vercel AI SDK middleware, OpenAI SDK wrapper, and Anthropic SDK wrapper.


OpenClaw

Context compression plugin for OpenClaw agents.

pip install "headroom-ai[proxy]"
openclaw plugins install headroom-openclaw

Configure as context engine:

{ "plugins": { "slots": { "contextEngine": "headroom" } } }

The plugin auto-detects a running Headroom proxy or starts one. Compression happens in assemble() — zero changes to the agent's behavior.

See the OpenClaw plugin documentation for full setup.


Compression Hooks (Advanced)

Customize compression behavior without modifying Headroom's code:

from headroom import compress, CompressionHooks, CompressContext

class MyHooks(CompressionHooks):
    def pre_compress(self, messages, ctx):
        # Modify messages before compression (dedup, filter, inject)
        return messages

    def compute_biases(self, messages, ctx):
        # Per-message compression aggressiveness
        # >1.0 = keep more, <1.0 = compress more
        return {5: 1.5, 6: 0.5}  # Keep message 5, compress message 6

    def post_compress(self, event):
        # Observe results (logging, analytics, learning)
        print(f"Saved {event.tokens_saved} tokens")

result = compress(messages, model="gpt-4o", hooks=MyHooks())

See Architecture for how hooks integrate with the pipeline.


FAQ

Q: Does Headroom change the response format? No. Your LLM returns the same response format. Headroom only modifies the input messages.

Q: What if compression removes something the LLM needs? Headroom stores originals in CCR (Compress-Cache-Retrieve). The LLM can call headroom_retrieve to get full uncompressed content. Compression summaries tell the LLM what's available.

Q: Does it work with streaming? Yes. Compression happens before the request is sent. Streaming responses are unaffected.

Q: How much latency does it add? 15-200ms depending on content size and type. Small JSON arrays take ~15ms, large tool outputs take 100-200ms. The token savings typically save far more time on the LLM side than compression adds — a 50% token reduction on a Sonnet call saves seconds of generation time. See Latency Benchmarks for real numbers.