Skip to content

Metrics & Monitoring

Headroom provides comprehensive metrics for monitoring compression performance, cost savings, and system health.

Proxy Metrics

Stats Endpoint

curl http://localhost:8787/stats
{
  "requests": {
    "total": 42,
    "cached": 5,
    "rate_limited": 0,
    "failed": 0
  },
  "tokens": {
    "input": 50000,
    "output": 8000,
    "saved": 12500,
    "savings_percent": 25.0
  },
  "cost": {
    "total_cost_usd": 0.15,
    "total_savings_usd": 0.04
  },
  "cache": {
    "entries": 10,
    "total_hits": 5
  }
}

Prometheus Metrics

curl http://localhost:8787/metrics
# HELP headroom_requests_total Total requests processed
headroom_requests_total{mode="optimize"} 1234

# HELP headroom_tokens_saved_total Total tokens saved
headroom_tokens_saved_total 5678900

# HELP headroom_compression_ratio Compression ratio histogram
headroom_compression_ratio_bucket{le="0.5"} 890
headroom_compression_ratio_bucket{le="0.7"} 1100
headroom_compression_ratio_bucket{le="0.9"} 1200

# HELP headroom_latency_seconds Request latency histogram
headroom_latency_seconds_bucket{le="0.01"} 800
headroom_latency_seconds_bucket{le="0.1"} 1150

# HELP headroom_cache_hits_total Cache hit counter
headroom_cache_hits_total 456

# HELP headroom_cache_misses_total Cache miss counter
headroom_cache_misses_total 778

Health Check

curl http://localhost:8787/health
{
  "status": "healthy",
  "version": "0.1.0",
  "uptime_seconds": 3600,
  "llmlingua_enabled": false
}

SDK Metrics

Session Stats

Quick stats for the current session (no database query):

stats = client.get_stats()
print(stats)
{
    "session": {
        "requests_total": 10,
        "tokens_input_before": 50000,
        "tokens_input_after": 35000,
        "tokens_saved_total": 15000,
        "tokens_output_total": 8000,
        "cache_hits": 3,
        "compression_ratio_avg": 0.70
    },
    "config": {
        "mode": "optimize",
        "provider": "openai",
        "cache_optimizer_enabled": True,
        "semantic_cache_enabled": False
    },
    "transforms": {
        "smart_crusher_enabled": True,
        "cache_aligner_enabled": True,
        "rolling_window_enabled": True
    }
}

Historical Metrics

Query stored metrics from the database:

from datetime import datetime, timedelta

# Get recent metrics
metrics = client.get_metrics(
    start_time=datetime.utcnow() - timedelta(hours=1),
    limit=100,
)

for m in metrics:
    print(f"{m.timestamp}: {m.tokens_input_before} -> {m.tokens_input_after}")

Summary Statistics

Aggregate statistics across all stored metrics:

summary = client.get_summary()
print(f"Total requests: {summary['total_requests']}")
print(f"Total tokens saved: {summary['total_tokens_saved']}")
print(f"Average compression: {summary['avg_compression_ratio']:.1%}")
print(f"Total cost savings: ${summary['total_cost_saved_usd']:.2f}")

Logging

Enable Logging

import logging

# INFO level shows compression summaries
logging.basicConfig(level=logging.INFO)

# DEBUG level shows detailed transform decisions
logging.basicConfig(level=logging.DEBUG)

Log Output Examples

INFO:headroom.transforms.pipeline:Pipeline complete: 45000 -> 4500 tokens (saved 40500, 90.0% reduction)
INFO:headroom.transforms.smart_crusher:SmartCrusher applied top_n strategy: kept 15 of 1000 items
INFO:headroom.cache.compression_store:CCR cache hit: hash=abc123, retrieved 1000 items
DEBUG:headroom.transforms.smart_crusher:Kept items: [0,1,2,42,77,97,98,99] (errors at 42, warnings at 77)

Proxy Logging

# Log to file
headroom proxy --log-file headroom.jsonl

# Increase verbosity
headroom proxy --log-level debug

Grafana Dashboard

Example Grafana dashboard configuration for Prometheus metrics:

{
  "panels": [
    {
      "title": "Tokens Saved",
      "type": "stat",
      "targets": [{"expr": "headroom_tokens_saved_total"}]
    },
    {
      "title": "Compression Ratio",
      "type": "gauge",
      "targets": [{"expr": "histogram_quantile(0.5, headroom_compression_ratio_bucket)"}]
    },
    {
      "title": "Request Latency (p99)",
      "type": "graph",
      "targets": [{"expr": "histogram_quantile(0.99, headroom_latency_seconds_bucket)"}]
    },
    {
      "title": "Cache Hit Rate",
      "type": "gauge",
      "targets": [{"expr": "headroom_cache_hits_total / (headroom_cache_hits_total + headroom_cache_misses_total)"}]
    }
  ]
}

Cost Tracking

Per-Request Cost

Each request includes cost metadata in the response:

response = client.chat.completions.create(...)

# Access via response metadata (if available)
# Cost is calculated based on model pricing and token counts

Budget Alerts

Set a budget limit in the proxy:

headroom proxy --budget 10.00

When the budget is exceeded: - Requests return a budget exceeded error - The /stats endpoint shows budget status - Logs indicate budget state

Validation

Validate your setup is correct:

result = client.validate_setup()

if result["valid"]:
    print("Setup is correct!")
else:
    print("Issues found:")
    for issue in result["issues"]:
        print(f"  - {issue}")

Key Metrics to Monitor

Metric What It Tells You Target
tokens_saved_total Total cost savings Higher is better
compression_ratio_avg Efficiency 0.7-0.9 typical
cache_hit_rate Cache effectiveness >20% is good
latency_p99 Performance impact <10ms
failed_requests Reliability 0