Skip to content

LLMLingua-2 Integration

For maximum compression, Headroom integrates with LLMLingua-2, Microsoft's BERT-based token classifier trained via GPT-4 distillation. It achieves up to 20x compression while preserving semantic meaning.

When to Use LLMLingua-2

Approach Best For Compression Speed
SmartCrusher JSON tool outputs 70-90% ~1ms
Text Utilities Search/logs 50-90% ~1ms
LLMLingua-2 Any text, max compression 80-95% ~50-200ms

LLMLingua-2 is ideal when you need maximum compression and can tolerate slightly higher latency (e.g., compressing large tool outputs before storage, offline processing).

Installation

# Adds ~2GB of model weights
pip install "headroom-ai[llmlingua]"

Basic Usage

from headroom.transforms import LLMLinguaCompressor

# Create compressor (model loaded lazily on first use)
compressor = LLMLinguaCompressor()

# Compress any text
long_output = "The function processUserData takes a user object and validates..."
result = compressor.compress(long_output)

print(f"Before: {result.original_tokens} tokens")
print(f"After: {result.compressed_tokens} tokens")
print(f"Saved: {result.savings_percentage:.1f}%")
print(result.compressed)

Content-Aware Compression

LLMLingua-2 automatically adjusts compression based on content type:

from headroom.transforms import LLMLinguaCompressor, LLMLinguaConfig

# Conservative for code (keep 40% of tokens)
config = LLMLinguaConfig(
    code_compression_rate=0.4,    # More conservative
    json_compression_rate=0.35,   # Moderate
    text_compression_rate=0.25,   # Aggressive
)

compressor = LLMLinguaCompressor(config)

# Auto-detects content type
code_result = compressor.compress("def calculate(x): return x * 2")
text_result = compressor.compress("This is a verbose explanation...")

Memory Management

The model uses ~1GB RAM. Unload it when done:

from headroom.transforms import (
    LLMLinguaCompressor,
    unload_llmlingua_model,
    is_llmlingua_model_loaded,
)

compressor = LLMLinguaCompressor()
result = compressor.compress(content)  # Model loaded here

# Check if loaded
print(is_llmlingua_model_loaded())  # True

# Free memory when done
unload_llmlingua_model()  # Frees ~1GB
print(is_llmlingua_model_loaded())  # False

# Next compression will reload automatically

Device Configuration

from headroom.transforms import LLMLinguaConfig, LLMLinguaCompressor

# Force CPU (slower but works everywhere)
config = LLMLinguaConfig(device="cpu")

# Force GPU (faster but needs CUDA)
config = LLMLinguaConfig(device="cuda")

# Auto-detect (default): uses CUDA > MPS > CPU
config = LLMLinguaConfig(device="auto")

compressor = LLMLinguaCompressor(config)

Use in Pipeline

from headroom.transforms import TransformPipeline, LLMLinguaCompressor, SmartCrusher

# Combine with other transforms
pipeline = TransformPipeline([
    SmartCrusher(),        # First: compress JSON
    LLMLinguaCompressor(), # Then: ML compression on remaining text
])

result = pipeline.apply(messages, tokenizer)

Proxy Integration

Enable LLMLingua in the proxy server for automatic ML compression:

# Enable LLMLingua in proxy (requires: pip install headroom-ai[llmlingua,proxy])
headroom proxy --llmlingua

# With custom settings
headroom proxy --llmlingua --llmlingua-device cuda --llmlingua-rate 0.4

# The proxy shows LLMLingua status at startup:
#   LLMLingua: ENABLED  (device=cuda, rate=0.4)
#
# If llmlingua is installed but not enabled, you'll see a helpful hint:
#   LLMLingua: available (enable with --llmlingua for ML compression)

Configuration Reference

Option Default Description
device "auto" Device to run model on: auto, cpu, cuda, mps
code_compression_rate 0.4 Keep 40% of tokens for code
json_compression_rate 0.35 Keep 35% of tokens for JSON
text_compression_rate 0.25 Keep 25% of tokens for text
force_tokens [] Tokens to always preserve
drop_consecutive True Drop consecutive whitespace

Performance Characteristics

Metric Value
Model size ~500MB
Memory usage ~1GB RAM
Cold start 10-30s (first load)
Inference 50-200ms per request
Compression 80-95%

Why Opt-In?

LLMLingua adds significant dependencies and overhead:

Aspect Default Proxy With LLMLingua
Dependencies ~50MB ~2GB
Cold start <1s 10-30s
Per-request ~1-5ms ~50-200ms
Compression 70-90% 80-95%

The default proxy is lightweight and fast. Enable LLMLingua when you need maximum compression and can accept the tradeoffs.

Troubleshooting

"Model not found"

# Ensure llmlingua extra is installed
pip install "headroom-ai[llmlingua]"

"CUDA out of memory"

# Force CPU mode
config = LLMLinguaConfig(device="cpu")

"Slow compression"

  • Use GPU if available: device="cuda"
  • Batch multiple compressions
  • Consider using SmartCrusher for JSON (faster, similar results)