Headroom Limitations & Known Behavior¶

Honest documentation of when Headroom helps, when it doesn't, and what to watch out for.

When Headroom Helps (and When It Doesn't)¶

Content Type	Compression	Latency Impact	Best For
JSON: Arrays of dicts (search results, API responses, DB rows)	86-100%	Net latency win on Sonnet/Opus	Primary use case — always use
JSON: Arrays of strings (file paths, log lines, tags)	60-90%	Net latency win	New — works with all string arrays
JSON: Arrays of numbers (metrics, time series)	70-85%	Net latency win	New — includes statistical summary
JSON: Mixed-type arrays	50-70%	Net latency win	New — groups by type, compresses each
Structured logs (as JSON)	82-95%	Net latency win	Log entries in tool outputs
Agentic conversations (25-50 turns)	56-81%	Break-even to net win	Multi-tool agent sessions
Plain text (documentation, articles)	43-46%	Adds latency (cost savings only)	Cost optimization, not speed
Code	Passthrough	Minimal overhead	See Code Compression
RAG document contexts	Passthrough	Minimal overhead	Not compressed (plain text in user messages)

See LATENCY_BENCHMARKS.md for full data with per-scenario timing.

Code Compression¶

Headroom includes an AST-aware CodeCompressor (tree-sitter, 8 languages) but it's gated behind safety protections that prevent it from firing in most real-world scenarios. This is intentional.

Why code mostly passes through:

Word count gate: Content under 50 words is silently skipped
Recent code protection (protect_recent_code=4): Code in the last 4 messages is never compressed. In typical tool-call patterns, the tool result is always "recent"
Analysis intent protection (protect_analysis_context=True): If the most recent user message contains keywords like "analyze", "review", "explain", "fix", "debug", "optimize", "error", "bug" — ALL code in the conversation is protected

Why this is the right default: Code is almost always fetched because the user wants to work with it. Compressing function bodies would remove exactly what they need. LLMs like Claude are excellent at navigating large code files without compression.

Where code savings come from: The IntelligentContextManager drops old code messages that are no longer relevant (scoring-based), which is a better strategy than stripping function bodies from active code.

Override: Set protect_analysis_context=False in ContentRouterConfig for aggressive code compression. Requires headroom-ai[code] for tree-sitter.

JSON Compression Constraints¶

What gets compressed¶

Arrays of dicts: Full statistical analysis with adaptive K (Kneedle algorithm)
Arrays of strings: Dedup + adaptive sampling + error preservation
Arrays of numbers: Statistical summary + outlier/change-point preservation
Mixed-type arrays: Grouped by type, each group compressed independently
Nested objects: Recursed into, arrays within are compressed (up to depth 5)

What passes through¶

Arrays below 5 items (min_items_to_analyze)
Content below 200 tokens (min_tokens_to_crush)
Bool-only arrays (not useful to compress)
JSON objects without array values
Malformed JSON (silently passes through, no error)
Non-JSON content (handled by other pipeline stages)

Edge cases¶

NaN/Infinity in numeric fields: Filtered out before statistics are computed
Nesting depth > 5: Inner arrays not examined for compression
Mixed-type arrays with small groups: Groups below min_items_to_analyze are kept as-is

Adaptive K: How Item Retention Works¶

SmartCrusher doesn't use fixed K values. It uses information-theoretic sizing:

Kneedle algorithm on bigram coverage curves finds the point where adding more items stops providing new information
SimHash fingerprinting detects near-duplicate items
zlib validation ensures the subset captures the full set's diversity
The resulting K is split: 30% from array start, 15% from end, 55% for importance-scored items

Safety guarantees (additive, never dropped): - Error items (containing "error", "exception", "failed", "critical", etc.) — across ALL array types - Numeric anomalies (> 2σ from mean) - String length anomalies (> 2σ from mean length) - Change points (sudden shifts in running values)

These are kept even if they exceed the K budget.

Text Compression (LLMLingua)¶

Requires: headroom-ai[llmlingua] — downloads ~2GB model, needs ~1GB RAM
First call: 10-30s model load latency (cached globally after)
Sequence length: Content chunked at 512 tokens (model limit)
Content < 100 tokens: Skipped
Latency: Adds overhead that doesn't break even on fast models (GPT-4o Mini, Sonnet). Use for cost savings, not speed
Thread safety: Single global model instance with lock — sequential access under concurrency

Error Handling¶

All compressors follow the same principle: fail gracefully, return original content unchanged.

Invalid JSON → passthrough (no error raised)
AST parse failure in CodeCompressor → falls back to original or LLMLingua
Compression makes output larger → original returned
Missing optional dependencies (tree-sitter, LLMLingua) → passthrough with warning log
One exception: LLMLingua out-of-memory during model loading raises RuntimeError

Errors are logged at WARNING level and never propagated to callers.

TOIN Cold Start¶

The Tool Output Intelligence Network (TOIN) learns compression patterns from usage. For new tool types:

No learned patterns exist → falls back to statistical heuristics
Confidence below toin_confidence_threshold (default 0.3) → TOIN hints ignored
Patterns build up over time as tools are used repeatedly
Cross-session learning requires persistence (TelemetryConfig.storage_path)

CacheAligner Behavior¶

Only processes system messages for dynamic content extraction
Dynamic content in user/assistant/tool messages is not extracted
May add small markers ([Dynamic Context] separator) that slightly increase token count
Whitespace normalization may affect content with significant indentation (code blocks, ASCII art)

Provider Interactions¶

CacheAligner is designed to maximize Anthropic/OpenAI prefix cache hit rates
Token counting uses model-specific tokenizers (tiktoken for OpenAI, calibrated estimation for Anthropic)
Compression works with all providers — no provider-specific limitations
Compressed content is valid JSON — downstream tools and parsers work unchanged

Performance Characteristics¶

ContentRouter accounts for 91-98% of pipeline cost — it does the actual compression work
CacheAligner and RollingWindow are sub-millisecond
Scaling is roughly linear with input size
Full benchmark data: LATENCY_BENCHMARKS.md

Configuration Tuning¶

Parameter	Default	Effect
`min_items_to_analyze`	5	Arrays below this pass through
`min_tokens_to_crush`	200	Content below this passes through
`max_items_after_crush`	15	Upper bound on retained items
`variance_threshold`	2.0	Std devs for anomaly detection (lower = more preserved)
`first_fraction`	0.3	Fraction of K allocated to array start
`last_fraction`	0.15	Fraction of K allocated to array end
`protect_analysis_context`	True	Protect code when user asks about it
`protect_recent_code`	4	Messages from end to protect code
`skip_user_messages`	True	Never compress user messages
`toin_confidence_threshold`	0.3	Minimum TOIN confidence to apply hints