LLMLingua-2 Integration¶

For maximum compression, Headroom integrates with LLMLingua-2, Microsoft's BERT-based token classifier trained via GPT-4 distillation. It achieves up to 20x compression while preserving semantic meaning.

When to Use LLMLingua-2¶

Approach	Best For	Compression	Speed
SmartCrusher	JSON tool outputs	70-90%	~1ms
Text Utilities	Search/logs	50-90%	~1ms
LLMLingua-2	Any text, max compression	80-95%	~50-200ms

LLMLingua-2 is ideal when you need maximum compression and can tolerate slightly higher latency (e.g., compressing large tool outputs before storage, offline processing).

Installation¶

# Adds ~2GB of model weights
pip install "headroom-ai[llmlingua]"

Basic Usage¶

from headroom.transforms import LLMLinguaCompressor

# Create compressor (model loaded lazily on first use)
compressor = LLMLinguaCompressor()

# Compress any text
long_output = "The function processUserData takes a user object and validates..."
result = compressor.compress(long_output)

print(f"Before: {result.original_tokens} tokens")
print(f"After: {result.compressed_tokens} tokens")
print(f"Saved: {result.savings_percentage:.1f}%")
print(result.compressed)

Content-Aware Compression¶

LLMLingua-2 automatically adjusts compression based on content type:

from headroom.transforms import LLMLinguaCompressor, LLMLinguaConfig

# Conservative for code (keep 40% of tokens)
config = LLMLinguaConfig(
    code_compression_rate=0.4,    # More conservative
    json_compression_rate=0.35,   # Moderate
    text_compression_rate=0.25,   # Aggressive
)

compressor = LLMLinguaCompressor(config)

# Auto-detects content type
code_result = compressor.compress("def calculate(x): return x * 2")
text_result = compressor.compress("This is a verbose explanation...")

Memory Management¶

The model uses ~1GB RAM. Unload it when done:

from headroom.transforms import (
    LLMLinguaCompressor,
    unload_llmlingua_model,
    is_llmlingua_model_loaded,
)

compressor = LLMLinguaCompressor()
result = compressor.compress(content)  # Model loaded here

# Check if loaded
print(is_llmlingua_model_loaded())  # True

# Free memory when done
unload_llmlingua_model()  # Frees ~1GB
print(is_llmlingua_model_loaded())  # False

# Next compression will reload automatically

Device Configuration¶

from headroom.transforms import LLMLinguaConfig, LLMLinguaCompressor

# Force CPU (slower but works everywhere)
config = LLMLinguaConfig(device="cpu")

# Force GPU (faster but needs CUDA)
config = LLMLinguaConfig(device="cuda")

# Auto-detect (default): uses CUDA > MPS > CPU
config = LLMLinguaConfig(device="auto")

compressor = LLMLinguaCompressor(config)

Use in Pipeline¶

from headroom.transforms import TransformPipeline, LLMLinguaCompressor, SmartCrusher

# Combine with other transforms
pipeline = TransformPipeline([
    SmartCrusher(),        # First: compress JSON
    LLMLinguaCompressor(), # Then: ML compression on remaining text
])

result = pipeline.apply(messages, tokenizer)

Proxy Integration¶

Enable LLMLingua in the proxy server for automatic ML compression:

# Enable LLMLingua in proxy (requires: pip install headroom-ai[llmlingua,proxy])
headroom proxy --llmlingua

# With custom settings
headroom proxy --llmlingua --llmlingua-device cuda --llmlingua-rate 0.4

# The proxy shows LLMLingua status at startup:
#   LLMLingua: ENABLED  (device=cuda, rate=0.4)
#
# If llmlingua is installed but not enabled, you'll see a helpful hint:
#   LLMLingua: available (enable with --llmlingua for ML compression)

Configuration Reference¶

Option	Default	Description
`device`	`"auto"`	Device to run model on: auto, cpu, cuda, mps
`code_compression_rate`	`0.4`	Keep 40% of tokens for code
`json_compression_rate`	`0.35`	Keep 35% of tokens for JSON
`text_compression_rate`	`0.25`	Keep 25% of tokens for text
`force_tokens`	`[]`	Tokens to always preserve
`drop_consecutive`	`True`	Drop consecutive whitespace

Performance Characteristics¶

Metric	Value
Model size	~500MB
Memory usage	~1GB RAM
Cold start	10-30s (first load)
Inference	50-200ms per request
Compression	80-95%

Why Opt-In?¶

LLMLingua adds significant dependencies and overhead:

Aspect	Default Proxy	With LLMLingua
Dependencies	~50MB	~2GB
Cold start	<1s	10-30s
Per-request	~1-5ms	~50-200ms
Compression	70-90%	80-95%

The default proxy is lightweight and fast. Enable LLMLingua when you need maximum compression and can accept the tradeoffs.

Troubleshooting¶

"Model not found"¶

# Ensure llmlingua extra is installed
pip install "headroom-ai[llmlingua]"

"CUDA out of memory"¶

# Force CPU mode
config = LLMLinguaConfig(device="cpu")

"Slow compression"¶

Use GPU if available: device="cuda"
Batch multiple compressions
Consider using SmartCrusher for JSON (faster, similar results)