SDK Guide¶
The Headroom SDK wraps your existing LLM client to add compression and optimization transparently.
Installation¶
Quick Start¶
from headroom import HeadroomClient, OpenAIProvider
from openai import OpenAI
# Create wrapped client
client = HeadroomClient(
original_client=OpenAI(),
provider=OpenAIProvider(),
default_mode="optimize",
)
# Use exactly like the original client
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "user", "content": "Hello!"},
],
)
print(response.choices[0].message.content)
Tool Output Compression¶
Real savings happen with tool outputs. Here's where Headroom shines:
import json
# Conversation with large tool output
messages = [
{"role": "user", "content": "Search for Python tutorials"},
{
"role": "assistant",
"content": None,
"tool_calls": [{
"id": "call_123",
"type": "function",
"function": {"name": "search", "arguments": '{"q": "python"}'},
}],
},
{
"role": "tool",
"tool_call_id": "call_123",
"content": json.dumps({
"results": [
{"title": f"Tutorial {i}", "score": 100-i}
for i in range(500)
]
}),
},
{"role": "user", "content": "What are the top 3?"},
]
# Headroom compresses 500 results to ~15, keeping highest-scoring items
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=messages
)
# Check savings
stats = client.get_stats()
print(f"Tokens saved: {stats['session']['tokens_saved_total']}")
# Typical output: "Tokens saved: 3500"
Supported Providers¶
OpenAI¶
from headroom import HeadroomClient, OpenAIProvider
from openai import OpenAI
client = HeadroomClient(
original_client=OpenAI(),
provider=OpenAIProvider(),
)
Anthropic¶
from headroom import HeadroomClient, AnthropicProvider
from anthropic import Anthropic
client = HeadroomClient(
original_client=Anthropic(),
provider=AnthropicProvider(),
)
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
messages=[{"role": "user", "content": "Hello!"}],
)
Google¶
from headroom import HeadroomClient, GoogleProvider
import google.generativeai as genai
client = HeadroomClient(
original_client=genai,
provider=GoogleProvider(),
)
Check Stats¶
# Session stats (no database query)
stats = client.get_stats()
print(stats)
# {
# "session": {"requests_total": 10, "tokens_saved_total": 5000, ...},
# "config": {"mode": "optimize", "provider": "openai", ...},
# "transforms": {"smart_crusher_enabled": True, ...}
# }
Validate Setup¶
Modes¶
Optimize (Default)¶
Applies all safe transforms:
client = HeadroomClient(
original_client=OpenAI(),
provider=OpenAIProvider(),
default_mode="optimize",
)
Audit¶
Observes and logs without modifying:
client = HeadroomClient(
original_client=OpenAI(),
provider=OpenAIProvider(),
default_mode="audit",
)
Simulate¶
Returns a plan without making the API call:
plan = client.chat.completions.simulate(
model="gpt-4o",
messages=large_conversation,
)
print(f"Would save {plan.tokens_saved} tokens")
print(f"Transforms: {plan.transforms}")
Per-Request Overrides¶
response = client.chat.completions.create(
model="gpt-4o",
messages=[...],
# Override mode for this request
headroom_mode="audit",
# Reserve more tokens for output
headroom_output_buffer_tokens=8000,
# Keep last N turns
headroom_keep_turns=5,
)
Enable Logging¶
import logging
logging.basicConfig(level=logging.INFO)
# Now you'll see:
# INFO:headroom.transforms.pipeline:Pipeline complete: 45000 -> 4500 tokens
# INFO:headroom.transforms.smart_crusher:SmartCrusher: kept 15 of 1000 items
Streaming¶
Streaming works transparently:
stream = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "Hello!"}],
stream=True,
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")
Error Handling¶
from headroom import (
HeadroomClient,
HeadroomError,
ConfigurationError,
ProviderError,
)
try:
response = client.chat.completions.create(...)
except ConfigurationError as e:
print(f"Config issue: {e}")
except ProviderError as e:
print(f"Provider issue: {e}")
except HeadroomError as e:
print(f"Headroom error: {e}")
Historical Metrics¶
Query stored metrics:
from datetime import datetime, timedelta
metrics = client.get_metrics(
start_time=datetime.utcnow() - timedelta(hours=1),
limit=100,
)
for m in metrics:
print(f"{m.timestamp}: {m.tokens_input_before} -> {m.tokens_input_after}")
Advanced Configuration¶
See Configuration for full options:
client = HeadroomClient(
original_client=OpenAI(),
provider=OpenAIProvider(),
default_mode="optimize",
enable_cache_optimizer=True,
enable_semantic_cache=False,
model_context_limits={
"gpt-4o": 128000,
"gpt-4o-mini": 128000,
},
)
Comparison with Proxy¶
| Aspect | SDK | Proxy |
|---|---|---|
| Setup | Wrap client | Point URL |
| Control | Fine-grained | Global |
| Metrics | In-process | Centralized |
| Best for | Custom apps | Existing tools |
Use the SDK when you need fine-grained control. Use the proxy for existing tools like Claude Code, Cursor, etc.