Configuration¶
Headroom can be configured via the SDK, proxy command line, or per-request overrides.
SDK Configuration¶
from headroom import HeadroomClient, OpenAIProvider
from openai import OpenAI
client = HeadroomClient(
original_client=OpenAI(),
provider=OpenAIProvider(),
# Mode: "audit" (observe only) or "optimize" (apply transforms)
default_mode="optimize",
# Enable provider-specific cache optimization
enable_cache_optimizer=True,
# Enable query-level semantic caching
enable_semantic_cache=False,
# Override default context limits per model
model_context_limits={
"gpt-4o": 128000,
"gpt-4o-mini": 128000,
},
# Database location (defaults to temp directory)
# store_url="sqlite:////absolute/path/to/headroom.db",
)
Proxy Configuration¶
Command Line Options¶
headroom proxy \
--port 8787 \ # Port to listen on
--host 0.0.0.0 \ # Host to bind to
--budget 10.00 \ # Daily budget limit in USD
--log-file headroom.jsonl # Log file path
Feature Flags¶
# Disable optimization (passthrough mode)
headroom proxy --no-optimize
# Disable semantic caching
headroom proxy --no-cache
# Disable CCR response handling
headroom proxy --no-ccr-responses
# Disable proactive expansion
headroom proxy --no-ccr-expansion
# Enable LLMLingua ML compression
headroom proxy --llmlingua
headroom proxy --llmlingua --llmlingua-device cuda --llmlingua-rate 0.4
All Options¶
Per-Request Overrides¶
Override configuration for specific requests:
response = client.chat.completions.create(
model="gpt-4o",
messages=[...],
# Override mode for this request
headroom_mode="audit",
# Reserve more tokens for output
headroom_output_buffer_tokens=8000,
# Keep last N turns (don't compress)
headroom_keep_turns=5,
# Skip compression for specific tools
headroom_tool_profiles={
"important_tool": {"skip_compression": True}
}
)
Modes¶
| Mode | Behavior | Use Case |
|---|---|---|
audit |
Observes and logs, no modifications | Production monitoring, baseline measurement |
optimize |
Applies safe, deterministic transforms | Production optimization |
simulate |
Returns plan without API call | Testing, cost estimation |
Simulate Mode¶
Preview what would happen without making an API call:
plan = client.chat.completions.simulate(
model="gpt-4o",
messages=large_conversation,
)
print(f"Would save {plan.tokens_saved} tokens")
print(f"Transforms: {plan.transforms}")
print(f"Estimated savings: {plan.estimated_savings}")
SmartCrusher Configuration¶
Fine-tune JSON compression behavior:
from headroom.transforms import SmartCrusherConfig
config = SmartCrusherConfig(
# Maximum items to keep after compression
max_items_after_crush=15,
# Minimum tokens before applying compression
min_tokens_to_crush=200,
# Relevance scoring tier: "bm25" (fast) or "embedding" (accurate)
relevance_tier="bm25",
# Always keep items with these field values
preserve_fields=["error", "warning", "failure"],
)
Cache Aligner Configuration¶
Control prefix stabilization:
from headroom.transforms import CacheAlignerConfig
config = CacheAlignerConfig(
# Enable/disable cache alignment
enabled=True,
# Patterns to extract from system prompt
dynamic_patterns=[
r"Today is \w+ \d+, \d{4}",
r"Current time: .*",
],
)
Rolling Window Configuration¶
Control context window management:
from headroom.transforms import RollingWindowConfig
config = RollingWindowConfig(
# Minimum turns to always keep
min_keep_turns=3,
# Reserve tokens for output
output_buffer_tokens=4000,
# Drop oldest tool outputs first
prefer_drop_tool_outputs=True,
)
Intelligent Context Manager Configuration¶
For semantic-aware context management with importance scoring:
from headroom.config import IntelligentContextConfig, ScoringWeights
# Customize scoring weights (must sum to 1.0, or will be normalized)
weights = ScoringWeights(
recency=0.20, # Newer messages score higher
semantic_similarity=0.20, # Similarity to recent context
toin_importance=0.25, # TOIN-learned retrieval patterns
error_indicator=0.15, # TOIN-learned error field types
forward_reference=0.15, # Messages referenced by later messages
token_density=0.05, # Information density
)
config = IntelligentContextConfig(
# Enable/disable the manager
enabled=True,
# Protection settings
keep_system=True, # Never drop system messages
keep_last_turns=2, # Protect last N user turns
# Token budget
output_buffer_tokens=4000, # Reserve for model output
# Scoring settings
use_importance_scoring=True, # Use semantic scoring (vs position-only)
scoring_weights=weights, # Custom weights
toin_integration=True, # Use TOIN patterns if available
recency_decay_rate=0.1, # Exponential decay lambda
# Strategy thresholds
compress_threshold=0.1, # Try compression first if <10% over budget
)
CCR Integration¶
When IntelligentContext drops messages, they're stored in CCR for potential retrieval:
from headroom.telemetry import get_toin
# Pass TOIN for bidirectional integration
toin = get_toin()
manager = IntelligentContextManager(config=config, toin=toin)
# Dropped messages are:
# 1. Stored in CCR (so LLM can retrieve if needed)
# 2. Recorded to TOIN (so it learns which patterns matter)
# 3. Marked with CCR reference in the inserted message
The marker inserted when messages are dropped includes the CCR reference:
[Earlier context compressed: 14 message(s) dropped by importance scoring.
Full content available via ccr_retrieve tool with reference 'abc123def456'.]
Scoring Weights¶
The ScoringWeights class controls how messages are scored:
| Weight | Default | Description |
|---|---|---|
recency |
0.20 | Exponential decay from conversation end |
semantic_similarity |
0.20 | Embedding cosine similarity to recent context |
toin_importance |
0.25 | TOIN retrieval_rate (high retrieval = important) |
error_indicator |
0.15 | TOIN field_semantics error detection |
forward_reference |
0.15 | Count of later messages referencing this one |
token_density |
0.05 | Unique tokens / total tokens |
Weights are automatically normalized to sum to 1.0:
weights = ScoringWeights(recency=1.0, toin_importance=1.0)
normalized = weights.normalized()
# recency=0.5, toin_importance=0.5, others=0.0
Environment Variables¶
Some settings can be configured via environment variables:
| Variable | Description | Default |
|---|---|---|
HEADROOM_LOG_LEVEL |
Logging level | INFO |
HEADROOM_STORE_URL |
Database URL | temp directory |
HEADROOM_DEFAULT_MODE |
Default mode | optimize |
HEADROOM_MODEL_LIMITS |
Custom model config (JSON string or file path) | - |
Custom Model Configuration¶
Configure context limits and pricing for new or custom models. Useful when: - A new model is released before Headroom is updated - You're using fine-tuned or custom models - You want to override built-in limits
Configuration Methods¶
Settings are resolved in this order (later overrides earlier):
1. Built-in defaults
2. ~/.headroom/models.json config file
3. HEADROOM_MODEL_LIMITS environment variable
4. SDK constructor arguments
Config File Format¶
Create ~/.headroom/models.json:
{
"anthropic": {
"context_limits": {
"claude-4-opus-20250301": 200000,
"claude-custom-finetune": 128000
},
"pricing": {
"claude-4-opus-20250301": {
"input": 15.00,
"output": 75.00,
"cached_input": 1.50
}
}
},
"openai": {
"context_limits": {
"gpt-5": 256000,
"ft:gpt-4o:my-org": 128000
},
"pricing": {
"gpt-5": [5.00, 15.00]
}
}
}
Environment Variable¶
Set HEADROOM_MODEL_LIMITS as a JSON string or file path:
# JSON string
export HEADROOM_MODEL_LIMITS='{"anthropic":{"context_limits":{"claude-new":200000}}}'
# File path
export HEADROOM_MODEL_LIMITS=/path/to/models.json
Pattern-Based Inference¶
Unknown models are automatically inferred from naming patterns:
| Pattern | Inferred Settings |
|---|---|
*opus* |
200K context, Opus-tier pricing |
*sonnet* |
200K context, Sonnet-tier pricing |
*haiku* |
200K context, Haiku-tier pricing |
gpt-4o* |
128K context, GPT-4o pricing |
o1*, o3* |
200K context, reasoning model pricing |
This means new models like claude-4-sonnet-20251201 will work automatically with Sonnet-tier defaults.
SDK Override¶
Override in code for specific models:
from headroom import HeadroomClient, AnthropicProvider
client = HeadroomClient(
original_client=Anthropic(),
provider=AnthropicProvider(
context_limits={
"claude-new-model": 300000,
}
),
)
Provider-Specific Settings¶
OpenAI¶
from headroom import OpenAIProvider
provider = OpenAIProvider(
# Enable automatic prefix caching
enable_prefix_caching=True,
)
Anthropic¶
from headroom import AnthropicProvider
provider = AnthropicProvider(
# Enable cache_control blocks
enable_cache_control=True,
)
Google¶
from headroom import GoogleProvider
provider = GoogleProvider(
# Enable context caching
enable_context_caching=True,
)
Configuration Precedence¶
Settings are applied in this order (later overrides earlier):
- Default values
- Environment variables
- SDK constructor arguments
- Per-request overrides
Validation¶
Validate your configuration: