LLMLingua-2 Integration¶
For maximum compression, Headroom integrates with LLMLingua-2, Microsoft's BERT-based token classifier trained via GPT-4 distillation. It achieves up to 20x compression while preserving semantic meaning.
When to Use LLMLingua-2¶
| Approach | Best For | Compression | Speed |
|---|---|---|---|
| SmartCrusher | JSON tool outputs | 70-90% | ~1ms |
| Text Utilities | Search/logs | 50-90% | ~1ms |
| LLMLingua-2 | Any text, max compression | 80-95% | ~50-200ms |
LLMLingua-2 is ideal when you need maximum compression and can tolerate slightly higher latency (e.g., compressing large tool outputs before storage, offline processing).
Installation¶
Basic Usage¶
from headroom.transforms import LLMLinguaCompressor
# Create compressor (model loaded lazily on first use)
compressor = LLMLinguaCompressor()
# Compress any text
long_output = "The function processUserData takes a user object and validates..."
result = compressor.compress(long_output)
print(f"Before: {result.original_tokens} tokens")
print(f"After: {result.compressed_tokens} tokens")
print(f"Saved: {result.savings_percentage:.1f}%")
print(result.compressed)
Content-Aware Compression¶
LLMLingua-2 automatically adjusts compression based on content type:
from headroom.transforms import LLMLinguaCompressor, LLMLinguaConfig
# Conservative for code (keep 40% of tokens)
config = LLMLinguaConfig(
code_compression_rate=0.4, # More conservative
json_compression_rate=0.35, # Moderate
text_compression_rate=0.25, # Aggressive
)
compressor = LLMLinguaCompressor(config)
# Auto-detects content type
code_result = compressor.compress("def calculate(x): return x * 2")
text_result = compressor.compress("This is a verbose explanation...")
Memory Management¶
The model uses ~1GB RAM. Unload it when done:
from headroom.transforms import (
LLMLinguaCompressor,
unload_llmlingua_model,
is_llmlingua_model_loaded,
)
compressor = LLMLinguaCompressor()
result = compressor.compress(content) # Model loaded here
# Check if loaded
print(is_llmlingua_model_loaded()) # True
# Free memory when done
unload_llmlingua_model() # Frees ~1GB
print(is_llmlingua_model_loaded()) # False
# Next compression will reload automatically
Device Configuration¶
from headroom.transforms import LLMLinguaConfig, LLMLinguaCompressor
# Force CPU (slower but works everywhere)
config = LLMLinguaConfig(device="cpu")
# Force GPU (faster but needs CUDA)
config = LLMLinguaConfig(device="cuda")
# Auto-detect (default): uses CUDA > MPS > CPU
config = LLMLinguaConfig(device="auto")
compressor = LLMLinguaCompressor(config)
Use in Pipeline¶
from headroom.transforms import TransformPipeline, LLMLinguaCompressor, SmartCrusher
# Combine with other transforms
pipeline = TransformPipeline([
SmartCrusher(), # First: compress JSON
LLMLinguaCompressor(), # Then: ML compression on remaining text
])
result = pipeline.apply(messages, tokenizer)
Proxy Integration¶
Enable LLMLingua in the proxy server for automatic ML compression:
# Enable LLMLingua in proxy (requires: pip install headroom-ai[llmlingua,proxy])
headroom proxy --llmlingua
# With custom settings
headroom proxy --llmlingua --llmlingua-device cuda --llmlingua-rate 0.4
# The proxy shows LLMLingua status at startup:
# LLMLingua: ENABLED (device=cuda, rate=0.4)
#
# If llmlingua is installed but not enabled, you'll see a helpful hint:
# LLMLingua: available (enable with --llmlingua for ML compression)
Configuration Reference¶
| Option | Default | Description |
|---|---|---|
device |
"auto" |
Device to run model on: auto, cpu, cuda, mps |
code_compression_rate |
0.4 |
Keep 40% of tokens for code |
json_compression_rate |
0.35 |
Keep 35% of tokens for JSON |
text_compression_rate |
0.25 |
Keep 25% of tokens for text |
force_tokens |
[] |
Tokens to always preserve |
drop_consecutive |
True |
Drop consecutive whitespace |
Performance Characteristics¶
| Metric | Value |
|---|---|
| Model size | ~500MB |
| Memory usage | ~1GB RAM |
| Cold start | 10-30s (first load) |
| Inference | 50-200ms per request |
| Compression | 80-95% |
Why Opt-In?¶
LLMLingua adds significant dependencies and overhead:
| Aspect | Default Proxy | With LLMLingua |
|---|---|---|
| Dependencies | ~50MB | ~2GB |
| Cold start | <1s | 10-30s |
| Per-request | ~1-5ms | ~50-200ms |
| Compression | 70-90% | 80-95% |
The default proxy is lightweight and fast. Enable LLMLingua when you need maximum compression and can accept the tradeoffs.
Troubleshooting¶
"Model not found"¶
"CUDA out of memory"¶
"Slow compression"¶
- Use GPU if available:
device="cuda" - Batch multiple compressions
- Consider using SmartCrusher for JSON (faster, similar results)