Metrics & Monitoring¶

Headroom provides comprehensive metrics for monitoring compression performance, cost savings, and system health.

Proxy Metrics¶

Stats Endpoint¶

curl http://localhost:8787/stats

{
  "persistent_savings": {
    "lifetime": {
      "tokens_saved": 12500,
      "compression_savings_usd": 0.04
    },
    "recent_history": [
      {
        "timestamp": "2026-03-27T09:00:00Z",
        "total_tokens_saved": 12500,
        "compression_savings_usd": 0.04
      }
    ]
  },
  "requests": {
    "total": 42,
    "cached": 5,
    "rate_limited": 0,
    "failed": 0
  },
  "tokens": {
    "input": 50000,
    "output": 8000,
    "saved": 12500,
    "savings_percent": 25.0
  },
  "cost": {
    "total_cost_usd": 0.15,
    "total_savings_usd": 0.04
  },
  "cache": {
    "entries": 10,
    "total_hits": 5
  }
}

/stats keeps the existing live/session fields, including savings_history, for backward compatibility. The new persistent_savings block is durable local proxy compression history stored by default at ${HEADROOM_WORKSPACE_DIR}/proxy_savings.json (i.e. ~/.headroom/proxy_savings.json when HEADROOM_WORKSPACE_DIR is unset). Use HEADROOM_SAVINGS_PATH to override the file location directly, or set HEADROOM_WORKSPACE_DIR to relocate the entire state root. See the Filesystem Contract for details.

For Anthropic-style providers that return cache-write TTL buckets, /stats also surfaces observed cache TTL usage under prefix_cache:

{
  "prefix_cache": {
    "by_provider": {
      "anthropic": {
        "observed_ttl_buckets": {
          "5m": {"tokens": 20000, "requests": 8},
          "1h": {"tokens": 50000, "requests": 12}
        },
        "observed_ttl_mix": {
          "5m_pct": 28.6,
          "1h_pct": 71.4,
          "active_buckets": ["5m", "1h"]
        }
      }
    },
    "totals": {
      "observed_ttl_buckets": {
        "5m": {"tokens": 20000, "requests": 8},
        "1h": {"tokens": 50000, "requests": 12}
      }
    }
  }
}

These fields are observational only:

they reflect provider-reported cache write buckets
they do not configure TTL
they do not represent remaining expiration time

Historical Savings Endpoint¶

curl http://localhost:8787/stats-history

{
  "schema_version": 2,
  "generated_at": "2026-03-27T09:10:00Z",
  "lifetime": {
    "tokens_saved": 12500,
    "compression_savings_usd": 0.04
  },
  "history": [
    {
      "timestamp": "2026-03-27T09:00:00Z",
      "total_tokens_saved": 12000,
      "compression_savings_usd": 0.038
    }
  ],
  "series": {
    "hourly": [],
    "daily": [],
    "weekly": [],
    "monthly": []
  },
  "exports": {
    "default_format": "json",
    "available_formats": ["json", "csv"],
    "available_series": ["history", "hourly", "daily", "weekly", "monthly"]
  },
  "history_summary": {
    "mode": "compact",
    "stored_points": 2048,
    "returned_points": 500,
    "compacted": true
  }
}

/stats-history is the stable frontend-facing API for durable proxy compression history. It survives proxy restarts, tolerates missing or malformed state files, and powers the historical view in /dashboard. It now includes hourly, daily, weekly, and monthly chart-ready rollups.

By default, the history array is compacted for transport efficiency. Use history_mode=full when you explicitly need the full retained checkpoint list, or history_mode=none when you only need the aggregate rollups and lifetime totals.

For export-friendly downloads:

curl "http://localhost:8787/stats-history?format=csv&series=daily"
curl "http://localhost:8787/stats-history?format=csv&series=monthly"
curl "http://localhost:8787/stats-history?history_mode=full"

CSV exports are available for history, hourly, daily, weekly, and monthly. Plain JSON remains the default response format.

Prometheus Metrics¶

curl http://localhost:8787/metrics

# HELP headroom_requests_total Total number of requests
headroom_requests_total 1234

# HELP headroom_latency_ms_count Count of observed request latencies
headroom_latency_ms_count 1234

# HELP headroom_tokens_saved_total Tokens saved by optimization
headroom_tokens_saved_total 5678900

# HELP headroom_requests_by_provider Requests by provider
headroom_requests_by_provider{provider="anthropic"} 800
headroom_requests_by_provider{provider="openai"} 434

# HELP headroom_requests_by_stack Requests by Headroom integration stack
headroom_requests_by_stack{stack="wrap_claude"} 612
headroom_requests_by_stack{stack="adapter_ts_openai"} 48

# HELP headroom_transform_timing_ms_sum Sum of transform timing in milliseconds
headroom_transform_timing_ms_sum{transform="router"} 5123.7

# HELP headroom_cache_write_ttl_tokens_total Provider cache write tokens by observed TTL bucket
headroom_cache_write_ttl_tokens_total{provider="anthropic",ttl="5m"} 20000
headroom_cache_write_ttl_tokens_total{provider="anthropic",ttl="1h"} 50000

The built-in Prometheus endpoint exposes the proxy's in-memory operational state, including:

request counters
token totals and savings
latency / overhead / TTFB summaries
per-provider and per-model request counts
per-stage pipeline timing
waste signal token totals
provider cache read/write and TTL-bucket counters
cache bust counters

OTEL Metrics¶

Headroom now emits the same operational events through a shared OTEL metrics facade.

There are two integration modes:

Ambient OTEL app setup - if your application already configures a global OTEL meter provider, Headroom records into that provider automatically.
Headroom-managed export - if you want the proxy to configure its own OTEL metrics exporter, install:

pip install "headroom-ai[proxy,otel]"

Then set:

HEADROOM_OTEL_METRICS_ENABLED=1
HEADROOM_OTEL_METRICS_EXPORTER=otlp_http
HEADROOM_OTEL_METRICS_ENDPOINT=http://127.0.0.1:4318/v1/metrics
HEADROOM_OTEL_SERVICE_NAME=headroom-proxy
HEADROOM_OTEL_RESOURCE_ATTRIBUTES=deployment.environment=dev,service.namespace=headroom

For local validation without a collector:

HEADROOM_OTEL_METRICS_ENABLED=1
HEADROOM_OTEL_METRICS_EXPORTER=console
headroom proxy

The proxy's /stats response now includes an otel block that reports whether Headroom is managing an OTEL exporter for the current process.

Headroom's managed OTEL exporters are intentionally scoped to Headroom's own instrumentation. If you already manage global OTEL providers in your app, keep using those and let Headroom record into the ambient providers instead of enabling HEADROOM_OTEL_*.

OTEL Environment Variables¶

Variable	Default	Description
`HEADROOM_OTEL_METRICS_ENABLED`	`0`	Enables Headroom-managed OTEL metric export
`HEADROOM_OTEL_METRICS_EXPORTER`	`otlp_http`	Exporter type: `otlp_http` or `console`
`HEADROOM_OTEL_METRICS_ENDPOINT`	unset	OTLP HTTP metrics endpoint
`HEADROOM_OTEL_METRICS_HEADERS`	unset	Comma-separated `key=value` headers for OTLP export
`HEADROOM_OTEL_METRICS_EXPORT_INTERVAL_MS`	`10000`	Periodic export interval in milliseconds
`HEADROOM_OTEL_SERVICE_NAME`	`headroom-proxy` in proxy mode	OTEL `service.name`
`HEADROOM_OTEL_RESOURCE_ATTRIBUTES`	unset	Comma-separated resource attributes

Anonymous Telemetry vs OTEL¶

Headroom has two separate systems:

HEADROOM_TELEMETRY / --no-telemetry controls the privacy-preserving anonymous data-flywheel beacon and TOIN-related aggregate reporting.
HEADROOM_OTEL_* controls operational OTEL metric export.

They are independent by design so you can disable the anonymous beacon while keeping OTEL metrics enabled, or vice versa.

Beacon identity fields¶

When the anonymous beacon is enabled, each report includes two identity fields so usage can be segmented by integration surface and deployment shape:

headroom_stack — how Headroom is invoked in this process. Values: proxy, wrap_<agent> (e.g. wrap_claude, wrap_codex), adapter_<lang>_<provider> (e.g. adapter_ts_openai), mixed (multi-stack proxy with no dominant caller), or unknown. Overridable via HEADROOM_STACK; headroom wrap <tool> sets it automatically.
install_mode — how the proxy is deployed. Values: wrapped (spawned by headroom wrap), persistent (long-lived service on a fixed port), on_demand (short-lived direct invocation), or unknown.
requests_by_stack — for proxies serving multiple integrations (e.g. a persistent proxy hit by both wrap_claude and a TS adapter), a per-stack request count dict mirroring the headroom_requests_by_stack counter.

Clients tag requests with an X-Headroom-Stack header; the proxy's FastAPI middleware buckets these on /v1/*. Detection is best-effort — any failure falls back to "unknown" and never breaks the proxy.

Langfuse¶

Langfuse fits next to this implementation as a trace backend, not as a metrics backend.

Headroom metrics continue to go to /metrics and/or your OTEL metrics exporter.
Langfuse receives OTLP traces for Headroom's compression pipeline.
Headroom's /stats response includes a langfuse block when Headroom is managing Langfuse trace export for the process.

Enable it with:

HEADROOM_LANGFUSE_ENABLED=1
LANGFUSE_PUBLIC_KEY=pk-lf-...
LANGFUSE_SECRET_KEY=sk-lf-...
LANGFUSE_BASE_URL=https://cloud.langfuse.com

For self-hosted Langfuse, set LANGFUSE_BASE_URL to your instance URL.

Health Check¶

curl http://localhost:8787/health

{
  "status": "healthy",
  "version": "0.1.0",
  "uptime_seconds": 3600
}

SDK Metrics¶

Session Stats¶

Quick stats for the current session (no database query):

stats = client.get_stats()
print(stats)

{
    "session": {
        "requests_total": 10,
        "tokens_input_before": 50000,
        "tokens_input_after": 35000,
        "tokens_saved_total": 15000,
        "tokens_output_total": 8000,
        "cache_hits": 3,
        "compression_ratio_avg": 0.70
    },
    "config": {
        "mode": "optimize",
        "provider": "openai",
        "cache_optimizer_enabled": True,
        "semantic_cache_enabled": False
    },
    "transforms": {
        "smart_crusher_enabled": True,
        "cache_aligner_enabled": True,
        "rolling_window_enabled": True
    }
}

Historical Metrics¶

Query stored metrics from the database:

from datetime import datetime, timedelta

# Get recent metrics
metrics = client.get_metrics(
    start_time=datetime.utcnow() - timedelta(hours=1),
    limit=100,
)

for m in metrics:
    print(f"{m.timestamp}: {m.tokens_input_before} -> {m.tokens_input_after}")

Summary Statistics¶

Aggregate statistics across all stored metrics:

summary = client.get_summary()
print(f"Total requests: {summary['total_requests']}")
print(f"Total tokens saved: {summary['total_tokens_saved']}")
print(f"Average compression: {summary['avg_compression_ratio']:.1%}")
print(f"Total cost savings: ${summary['total_cost_saved_usd']:.2f}")

Logging¶

Enable Logging¶

import logging

# INFO level shows compression summaries
logging.basicConfig(level=logging.INFO)

# DEBUG level shows detailed transform decisions
logging.basicConfig(level=logging.DEBUG)

Log Output Examples¶

INFO:headroom.transforms.pipeline:Pipeline complete: 45000 -> 4500 tokens (saved 40500, 90.0% reduction)
INFO:headroom.transforms.smart_crusher:SmartCrusher applied top_n strategy: kept 15 of 1000 items
INFO:headroom.cache.compression_store:CCR cache hit: hash=abc123, retrieved 1000 items
DEBUG:headroom.transforms.smart_crusher:Kept items: [0,1,2,42,77,97,98,99] (errors at 42, warnings at 77)

Proxy Logging¶

# Log to file
headroom proxy --log-file headroom.jsonl

# Increase verbosity
headroom proxy --log-level debug

Grafana Dashboard¶

Example Grafana dashboard configuration for Prometheus metrics:

{
  "panels": [
    {
      "title": "Tokens Saved",
      "type": "stat",
      "targets": [{"expr": "headroom_tokens_saved_total"}]
    },
    {
      "title": "Average Request Latency (ms)",
      "type": "gauge",
      "targets": [{"expr": "headroom_latency_ms_sum / clamp_min(headroom_latency_ms_count, 1)"}]
    },
    {
      "title": "Max Request Latency (ms)",
      "type": "graph",
      "targets": [{"expr": "headroom_latency_ms_max"}]
    },
    {
      "title": "Provider Cache Hit Rate",
      "type": "gauge",
      "targets": [{"expr": "headroom_provider_cache_hit_requests_total / clamp_min(headroom_provider_cache_requests_total, 1)"}]
    }
  ]
}

Cost Tracking¶

Per-Request Cost¶

Each request includes cost metadata in the response:

response = client.chat.completions.create(...)

# Access via response metadata (if available)
# Cost is calculated based on model pricing and token counts

Budget Alerts¶

Set a budget limit in the proxy:

headroom proxy --budget 10.00

When the budget is exceeded: - Requests return a budget exceeded error - The /stats endpoint shows budget status - Logs indicate budget state

Validation¶

Validate your setup is correct:

result = client.validate_setup()

if result["valid"]:
    print("Setup is correct!")
else:
    print("Issues found:")
    for issue in result["issues"]:
        print(f"  - {issue}")

Key Metrics to Monitor¶

Metric	What It Tells You	Target
`tokens_saved_total`	Total cost savings	Higher is better
`compression_ratio_avg`	Efficiency	0.7-0.9 typical
`cache_hit_rate`	Cache effectiveness	>20% is good
`latency_p99`	Performance impact	<10ms
`failed_requests`	Reliability	0