Metrics & Monitoring¶
Headroom provides comprehensive metrics for monitoring compression performance, cost savings, and system health.
Proxy Metrics¶
Stats Endpoint¶
{
"requests": {
"total": 42,
"cached": 5,
"rate_limited": 0,
"failed": 0
},
"tokens": {
"input": 50000,
"output": 8000,
"saved": 12500,
"savings_percent": 25.0
},
"cost": {
"total_cost_usd": 0.15,
"total_savings_usd": 0.04
},
"cache": {
"entries": 10,
"total_hits": 5
}
}
Prometheus Metrics¶
# HELP headroom_requests_total Total requests processed
headroom_requests_total{mode="optimize"} 1234
# HELP headroom_tokens_saved_total Total tokens saved
headroom_tokens_saved_total 5678900
# HELP headroom_compression_ratio Compression ratio histogram
headroom_compression_ratio_bucket{le="0.5"} 890
headroom_compression_ratio_bucket{le="0.7"} 1100
headroom_compression_ratio_bucket{le="0.9"} 1200
# HELP headroom_latency_seconds Request latency histogram
headroom_latency_seconds_bucket{le="0.01"} 800
headroom_latency_seconds_bucket{le="0.1"} 1150
# HELP headroom_cache_hits_total Cache hit counter
headroom_cache_hits_total 456
# HELP headroom_cache_misses_total Cache miss counter
headroom_cache_misses_total 778
Health Check¶
SDK Metrics¶
Session Stats¶
Quick stats for the current session (no database query):
{
"session": {
"requests_total": 10,
"tokens_input_before": 50000,
"tokens_input_after": 35000,
"tokens_saved_total": 15000,
"tokens_output_total": 8000,
"cache_hits": 3,
"compression_ratio_avg": 0.70
},
"config": {
"mode": "optimize",
"provider": "openai",
"cache_optimizer_enabled": True,
"semantic_cache_enabled": False
},
"transforms": {
"smart_crusher_enabled": True,
"cache_aligner_enabled": True,
"rolling_window_enabled": True
}
}
Historical Metrics¶
Query stored metrics from the database:
from datetime import datetime, timedelta
# Get recent metrics
metrics = client.get_metrics(
start_time=datetime.utcnow() - timedelta(hours=1),
limit=100,
)
for m in metrics:
print(f"{m.timestamp}: {m.tokens_input_before} -> {m.tokens_input_after}")
Summary Statistics¶
Aggregate statistics across all stored metrics:
summary = client.get_summary()
print(f"Total requests: {summary['total_requests']}")
print(f"Total tokens saved: {summary['total_tokens_saved']}")
print(f"Average compression: {summary['avg_compression_ratio']:.1%}")
print(f"Total cost savings: ${summary['total_cost_saved_usd']:.2f}")
Logging¶
Enable Logging¶
import logging
# INFO level shows compression summaries
logging.basicConfig(level=logging.INFO)
# DEBUG level shows detailed transform decisions
logging.basicConfig(level=logging.DEBUG)
Log Output Examples¶
INFO:headroom.transforms.pipeline:Pipeline complete: 45000 -> 4500 tokens (saved 40500, 90.0% reduction)
INFO:headroom.transforms.smart_crusher:SmartCrusher applied top_n strategy: kept 15 of 1000 items
INFO:headroom.cache.compression_store:CCR cache hit: hash=abc123, retrieved 1000 items
DEBUG:headroom.transforms.smart_crusher:Kept items: [0,1,2,42,77,97,98,99] (errors at 42, warnings at 77)
Proxy Logging¶
# Log to file
headroom proxy --log-file headroom.jsonl
# Increase verbosity
headroom proxy --log-level debug
Grafana Dashboard¶
Example Grafana dashboard configuration for Prometheus metrics:
{
"panels": [
{
"title": "Tokens Saved",
"type": "stat",
"targets": [{"expr": "headroom_tokens_saved_total"}]
},
{
"title": "Compression Ratio",
"type": "gauge",
"targets": [{"expr": "histogram_quantile(0.5, headroom_compression_ratio_bucket)"}]
},
{
"title": "Request Latency (p99)",
"type": "graph",
"targets": [{"expr": "histogram_quantile(0.99, headroom_latency_seconds_bucket)"}]
},
{
"title": "Cache Hit Rate",
"type": "gauge",
"targets": [{"expr": "headroom_cache_hits_total / (headroom_cache_hits_total + headroom_cache_misses_total)"}]
}
]
}
Cost Tracking¶
Per-Request Cost¶
Each request includes cost metadata in the response:
response = client.chat.completions.create(...)
# Access via response metadata (if available)
# Cost is calculated based on model pricing and token counts
Budget Alerts¶
Set a budget limit in the proxy:
When the budget is exceeded:
- Requests return a budget exceeded error
- The /stats endpoint shows budget status
- Logs indicate budget state
Validation¶
Validate your setup is correct:
result = client.validate_setup()
if result["valid"]:
print("Setup is correct!")
else:
print("Issues found:")
for issue in result["issues"]:
print(f" - {issue}")
Key Metrics to Monitor¶
| Metric | What It Tells You | Target |
|---|---|---|
tokens_saved_total |
Total cost savings | Higher is better |
compression_ratio_avg |
Efficiency | 0.7-0.9 typical |
cache_hit_rate |
Cache effectiveness | >20% is good |
latency_p99 |
Performance impact | <10ms |
failed_requests |
Reliability | 0 |