Agno Integration¶
Headroom integrates with Agno (formerly Phidata) to provide automatic context optimization for AI agents. This guide covers model wrapping, observability hooks, and multi-provider support.
Installation¶
This installs Headroom with Agno support. You'll also need Agno itself:
Quick Start¶
from agno.agent import Agent
from agno.models.openai import OpenAIChat
from headroom.integrations.agno import HeadroomAgnoModel
# Wrap your model
model = HeadroomAgnoModel(OpenAIChat(id="gpt-4o"))
# Create agent as usual
agent = Agent(model=model)
# Use exactly like before
response = agent.run("What's the capital of France?")
# Check savings
print(f"Tokens saved: {model.total_tokens_saved}")
print(model.get_savings_summary())
# {'total_requests': 1, 'total_tokens_saved': 245, 'average_savings_percent': 12.3}
Integration Patterns¶
1. Basic Model Wrapping¶
The simplest integration - wrap any Agno model with HeadroomAgnoModel:
from agno.models.openai import OpenAIChat
from agno.models.anthropic import Claude
from agno.models.google import Gemini
from headroom.integrations.agno import HeadroomAgnoModel
# Works with any Agno model
openai_model = HeadroomAgnoModel(OpenAIChat(id="gpt-4o"))
claude_model = HeadroomAgnoModel(Claude(id="claude-3-5-sonnet-20241022"))
gemini_model = HeadroomAgnoModel(Gemini(id="gemini-2.0-flash"))
# Each automatically uses the correct provider for accurate token counting
Why this matters: Headroom automatically detects the underlying provider and applies the correct tokenizer for accurate optimization metrics.
2. Agent with Observability Hooks¶
Use hooks for detailed tracking without modifying your model:
from agno.agent import Agent
from agno.models.openai import OpenAIChat
from headroom.integrations.agno import (
HeadroomAgnoModel,
HeadroomPreHook,
HeadroomPostHook,
)
# Model wrapper for optimization
model = HeadroomAgnoModel(OpenAIChat(id="gpt-4o"))
# Hooks for observability
pre_hook = HeadroomPreHook()
post_hook = HeadroomPostHook(token_alert_threshold=10000)
agent = Agent(
model=model,
pre_hooks=[pre_hook],
post_hooks=[post_hook],
)
# Run agent
response = agent.run("Analyze this large dataset...")
# Check metrics from model
print(f"Tokens saved: {model.total_tokens_saved}")
# Check observability from hooks
print(f"Post-hook summary: {post_hook.get_summary()}")
print(f"Alerts triggered: {post_hook.alerts}")
Why this matters: Hooks provide observability into agent behavior and can alert when token usage exceeds thresholds.
3. Convenience Hook Factory¶
Use create_headroom_hooks() to create matched hook pairs:
from headroom.integrations.agno import create_headroom_hooks
pre_hook, post_hook = create_headroom_hooks(
token_alert_threshold=5000,
log_level="DEBUG",
)
agent = Agent(
model=model,
pre_hooks=[pre_hook],
post_hooks=[post_hook],
)
4. Custom Configuration¶
Pass a HeadroomConfig for fine-grained control:
from headroom import HeadroomConfig, HeadroomMode
from headroom.integrations.agno import HeadroomAgnoModel
config = HeadroomConfig(
default_mode=HeadroomMode.OPTIMIZE,
# Add other configuration options as needed
)
model = HeadroomAgnoModel(
wrapped_model=OpenAIChat(id="gpt-4o"),
config=config,
)
5. Standalone Message Optimization¶
Optimize messages without wrapping a model:
from headroom.integrations.agno import optimize_messages
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Analyze this large JSON: " + large_json},
]
optimized_messages, metrics = optimize_messages(messages, model="gpt-4o")
print(f"Tokens saved: {metrics['tokens_saved']}")
print(f"Transforms applied: {metrics['transforms_applied']}")
6. Async Operations¶
Full async support for high-throughput applications:
import asyncio
from headroom.integrations.agno import HeadroomAgnoModel
async def process_async():
model = HeadroomAgnoModel(OpenAIChat(id="gpt-4o"))
# Async response
response = await model.aresponse(messages)
# Async streaming
async for chunk in model.aresponse_stream(messages):
print(chunk, end="", flush=True)
print(f"\nTokens saved: {model.total_tokens_saved}")
asyncio.run(process_async())
Real-World Examples¶
Example 1: Tool-Heavy Agent¶
from agno.agent import Agent
from agno.models.openai import OpenAIChat
from agno.tools.duckduckgo import DuckDuckGoTools
from headroom.integrations.agno import HeadroomAgnoModel
# Wrap model for optimization
model = HeadroomAgnoModel(OpenAIChat(id="gpt-4o"))
# Agent with search tools
agent = Agent(
model=model,
tools=[DuckDuckGoTools()],
show_tool_calls=True,
)
# Tool outputs get compressed automatically
response = agent.run("Research the latest AI developments and summarize")
# Impact: Tool outputs (often 10K+ tokens) compressed by 70-90%
print(f"Tokens saved: {model.total_tokens_saved}")
print(model.get_savings_summary())
Example 2: Multi-Model Routing¶
from agno.models.openai import OpenAIChat
from agno.models.anthropic import Claude
from headroom.integrations.agno import HeadroomAgnoModel
# Different models for different tasks
fast_model = HeadroomAgnoModel(OpenAIChat(id="gpt-4o-mini"))
powerful_model = HeadroomAgnoModel(Claude(id="claude-3-5-sonnet-20241022"))
# Use fast model for simple tasks
simple_agent = Agent(model=fast_model)
# Use powerful model for complex reasoning
complex_agent = Agent(model=powerful_model)
# Each tracks its own metrics
print(f"Fast model saved: {fast_model.total_tokens_saved}")
print(f"Powerful model saved: {powerful_model.total_tokens_saved}")
Example 3: Production Monitoring¶
from agno.agent import Agent
from headroom.integrations.agno import (
HeadroomAgnoModel,
create_headroom_hooks,
)
model = HeadroomAgnoModel(OpenAIChat(id="gpt-4o"))
pre_hook, post_hook = create_headroom_hooks(
token_alert_threshold=50000, # Alert on large requests
log_level="WARNING",
)
agent = Agent(
model=model,
pre_hooks=[pre_hook],
post_hooks=[post_hook],
)
# Run multiple requests
for query in user_queries:
response = agent.run(query)
# Check for alerts
if post_hook.alerts:
print(f"WARNING: {len(post_hook.alerts)} requests exceeded threshold")
for alert in post_hook.alerts:
print(f" - {alert}")
# Summary stats
summary = post_hook.get_summary()
print(f"Total requests: {summary['total_requests']}")
print(f"Average tokens: {summary['average_tokens']}")
Example 4: Reset for New Sessions¶
model = HeadroomAgnoModel(OpenAIChat(id="gpt-4o"))
# Session 1
agent.run("First conversation...")
print(f"Session 1 savings: {model.get_savings_summary()}")
# Reset for new session
model.reset()
# Session 2 - metrics start fresh
agent.run("Second conversation...")
print(f"Session 2 savings: {model.get_savings_summary()}")
Supported Providers¶
HeadroomAgnoModel automatically detects the provider from the wrapped model:
| Provider | Agno Models | Auto-Detected |
|---|---|---|
| OpenAI | OpenAIChat, OpenAILike |
Yes |
| Anthropic | Claude, AwsBedrock |
Yes |
Gemini, VertexAI |
Yes | |
| Cohere | Cohere, CohereChat |
Yes |
| Groq | Groq |
Yes (OpenAI-compatible) |
| Mistral | Mistral |
Yes (OpenAI-compatible) |
| Together | Together |
Yes (OpenAI-compatible) |
| Ollama | Ollama |
Yes (OpenAI-compatible) |
To disable auto-detection:
model = HeadroomAgnoModel(
wrapped_model=some_model,
auto_detect_provider=False, # Falls back to OpenAI tokenizer
)
Feature Coverage¶
What's Optimized¶
HeadroomAgnoModel optimizes messages at the LLM call boundary. This covers:
| Feature | Optimized | Notes |
|---|---|---|
| User/Assistant Messages | ✅ Yes | Full message history compressed |
| Tool Calls | ✅ Yes | Tool call arguments optimized |
| Tool Results | ✅ Yes | JSON responses compressed 70-90% via SmartCrusher |
| System Prompts | ✅ Yes | Included in message optimization |
| Streaming Responses | ✅ Yes | Both sync and async |
| Multi-turn Conversations | ✅ Yes | Full history available for optimization |
Known Limitations¶
The integration operates at the model layer, not the agent layer. Some Agno features operate outside this boundary:
| Agno Feature | Status | Explanation |
|---|---|---|
| Agent Memory | ⚠️ Partial | Memory content is optimized when it enters messages, but the persistent memory store itself is not compressed. If you're storing large amounts of data in agent memory, consider summarizing before storage. |
| Knowledge Bases | ⚠️ Partial | KB retrieval happens before messages reach the model. Retrieved context is optimized as part of the message, but we can't influence KB retrieval itself. |
| Agent Teams | ❌ Not supported | Each agent's model is wrapped independently. No cross-agent optimization or team-level coordination. |
| Tool Definitions | ⚠️ Not deduplicated | Tool schemas are sent with every request. Future versions may deduplicate repeated tool definitions. |
| Structured Outputs | ✅ Supported | response_model works normally; optimization doesn't affect output parsing. |
| Reasoning Models | ✅ Supported | Extended thinking works; we don't compress reasoning traces. |
Best Practices for Maximum Savings¶
- Tool-heavy agents see the biggest wins — Tool results (JSON, logs, search results) compress 70-90%
- Long conversations benefit from RollingWindow — Configure context limits to avoid hitting provider maximums
- Wrap at the model level, not agent level — This ensures all LLM calls go through optimization
- Use hooks for observability — Track token usage patterns to identify optimization opportunities
Future Improvements¶
We're tracking these potential enhancements:
- Memory optimization hooks — Compress data before it enters agent memory
- Knowledge base integration — Optimize retrieved context at the KB layer
- Tool schema deduplication — Cache and reference repeated tool definitions
- Team-level optimization — Shared context compression across agent teams
Contributions welcome! See CONTRIBUTING.md.
Configuration Reference¶
HeadroomAgnoModel¶
| Parameter | Type | Default | Description |
|---|---|---|---|
wrapped_model |
Any | Required | The Agno model to wrap |
config |
HeadroomConfig |
None |
Custom configuration |
auto_detect_provider |
bool |
True |
Auto-detect provider for token counting |
Properties:
- wrapped_model - Access the underlying Agno model
- total_tokens_saved - Running total of tokens saved
- metrics_history - List of last 100 OptimizationMetrics
Methods:
- response(messages, **kwargs) - Sync response with optimization
- response_stream(messages, **kwargs) - Sync streaming response
- aresponse(messages, **kwargs) - Async response
- aresponse_stream(messages, **kwargs) - Async streaming
- get_savings_summary() - Returns dict with stats
- reset() - Clear all metrics
HeadroomPreHook¶
| Parameter | Type | Default | Description |
|---|---|---|---|
config |
HeadroomConfig |
None |
Configuration (for future use) |
model |
str |
"gpt-4o" |
Model name for estimation |
HeadroomPostHook¶
| Parameter | Type | Default | Description |
|---|---|---|---|
log_level |
str |
"INFO" |
Logging level |
token_alert_threshold |
int |
None |
Alert if tokens exceed this |
Properties:
- total_requests - Number of requests tracked
- alerts - List of alert messages
Methods:
- get_summary() - Returns dict with request stats
- reset() - Clear history and alerts
create_headroom_hooks()¶
| Parameter | Type | Default | Description |
|---|---|---|---|
config |
HeadroomConfig |
None |
Config for pre-hook |
model |
str |
"gpt-4o" |
Model for pre-hook |
log_level |
str |
"INFO" |
Log level for post-hook |
token_alert_threshold |
int |
None |
Alert threshold for post-hook |
Returns: tuple[HeadroomPreHook, HeadroomPostHook]
Import Reference¶
# Main integration
from headroom.integrations.agno import HeadroomAgnoModel
# Hooks
from headroom.integrations.agno import HeadroomPreHook
from headroom.integrations.agno import HeadroomPostHook
from headroom.integrations.agno import create_headroom_hooks
# Utilities
from headroom.integrations.agno import optimize_messages
from headroom.integrations.agno import agno_available
from headroom.integrations.agno import get_headroom_provider
from headroom.integrations.agno import get_model_name_from_agno
# Or import everything from parent
from headroom.integrations import (
HeadroomAgnoModel,
HeadroomPreHook,
HeadroomPostHook,
create_headroom_hooks,
)
Troubleshooting¶
Check if Agno is Available¶
from headroom.integrations.agno import agno_available
if agno_available():
from headroom.integrations.agno import HeadroomAgnoModel
else:
print("Install agno: pip install agno")
Provider Detection Issues¶
If auto-detection fails, check the detected provider:
from headroom.integrations.agno import get_headroom_provider, get_model_name_from_agno
model = OpenAIChat(id="gpt-4o")
provider = get_headroom_provider(model)
model_name = get_model_name_from_agno(model)
print(f"Detected provider: {type(provider).__name__}")
print(f"Model name: {model_name}")
Metrics Not Updating¶
Ensure you're checking the correct object:
# Model metrics (optimization)
print(model.total_tokens_saved) # Actual savings
# Hook metrics (observability)
print(post_hook.get_summary()) # Request tracking
Note: Hooks track request counts, not token savings. Use the model wrapper for optimization metrics.