Memory¶
Hierarchical, temporal memory for LLM applications. Enable your AI to remember across conversations with intelligent scoping and versioning.
Why Memory?¶
LLMs have two fundamental limitations: 1. Context windows overflow - Too much history, need to truncate 2. No persistence - Every conversation starts from zero
Memory solves both: extract key facts, persist them, inject when relevant.
This is temporal compression - instead of carrying 10,000 tokens of conversation history, carry 100 tokens of extracted memories.
What Makes Headroom Memory Different?¶
| Feature | Headroom | Letta (MemGPT) | Mem0 |
|---|---|---|---|
| Hierarchical Scoping | User → Session → Agent → Turn | Flat (per-agent) | Flat (per-user) |
| Temporal Versioning | Full supersession chains | No | No |
| Zero-Latency Extraction | Inline (Letta-style) | Inline | Separate call |
| One-Liner Integration | with_memory(client) |
Requires agent setup | Requires separate client |
| Pluggable Backends | SQLite, HNSW, FTS5, any embedder | PostgreSQL | Qdrant/Chroma |
| Semantic + Full-Text Search | Both | Semantic only | Semantic only |
| Memory Bubbling | Auto-promote important memories | No | No |
| Protocol-Based Architecture | Yes (dependency injection) | No | No |
Quick Start¶
from openai import OpenAI
from headroom import with_memory
# One line - that's it
client = with_memory(OpenAI(), user_id="alice")
# Use exactly like normal
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "I prefer Python for backend work"}]
)
# Memory extracted INLINE - zero extra latency
# Later, in a new conversation...
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "What language should I use?"}]
)
# → Response uses the Python preference from memory
How It Works¶
┌─────────────────────────────────────────────────────────────┐
│ with_memory() │
│ │
│ 1. INJECT: Semantic search → prepend to user message │
│ 2. INSTRUCT: Add memory extraction instruction │
│ 3. CALL: Forward to LLM │
│ 4. PARSE: Extract <memory> block from response │
│ 5. STORE: Save with embeddings + vector index + FTS │
│ 6. RETURN: Clean response (without memory block) │
│ │
└─────────────────────────────────────────────────────────────┘
Key insight: Memory extraction happens inline as part of the LLM response (Letta-style). No extra API calls, no extra latency.
Hierarchical Scoping¶
Memories exist at different scope levels, enabling fine-grained control:
Scope Levels¶
| Scope | Persists Across | Use Case |
|---|---|---|
| USER | All sessions, all time | Long-term preferences, identity |
| SESSION | Current session only | Current task context |
| AGENT | Current agent in session | Agent-specific context |
| TURN | Single turn only | Ephemeral working memory |
Example: Multi-Session Memory¶
from openai import OpenAI
from headroom import with_memory
# Session 1: Morning
client1 = with_memory(
OpenAI(),
user_id="bob",
session_id="morning-session",
)
response = client1.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "I prefer Go for performance-critical code"}]
)
# Memory stored at USER level (persists across sessions)
# Session 2: Afternoon (different session, same user)
client2 = with_memory(
OpenAI(),
user_id="bob", # Same user
session_id="afternoon-session", # Different session
)
response = client2.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "What language for my new microservice?"}]
)
# → Recalls Go preference from morning session!
Temporal Versioning (Supersession)¶
Memories evolve over time. When facts change, Headroom creates a supersession chain preserving history:
from headroom.memory import HierarchicalMemory, MemoryConfig
memory = await HierarchicalMemory.create()
# Original fact
orig = await memory.add(
content="User works at Google",
user_id="alice",
category=MemoryCategory.FACT,
)
# User changes jobs - supersede the old memory
new = await memory.supersede(
old_memory_id=orig.id,
new_content="User now works at Anthropic",
)
# Query current state (excludes superseded)
current = await memory.query(MemoryFilter(
user_id="alice",
include_superseded=False, # Default
))
# → Returns only "User now works at Anthropic"
# Query full history (includes superseded)
history = await memory.query(MemoryFilter(
user_id="alice",
include_superseded=True,
))
# → Returns both memories with validity timestamps
# Get the chain
chain = await memory.get_history(new.id)
# → [
# Memory(content="User works at Google", valid_until=..., is_current=False),
# Memory(content="User now works at Anthropic", valid_until=None, is_current=True),
# ]
Why Temporal Versioning Matters¶
- Audit trail - Know what was true at any point in time
- Debugging - Understand why the LLM made certain decisions
- Rollback - Restore previous state if needed
- Analytics - Track how user preferences evolve
Memory Categories¶
Memories are categorized for better organization and retrieval:
| Category | Description | Examples |
|---|---|---|
PREFERENCE |
Likes, dislikes, preferred approaches | "Prefers Python", "Likes dark mode" |
FACT |
Identity, role, constraints | "Works at fintech startup", "Senior engineer" |
CONTEXT |
Current goals, ongoing tasks | "Migrating to microservices", "Working on auth" |
ENTITY |
Information about entities | "Project Apollo uses React", "Team lead is Sarah" |
DECISION |
Decisions made | "Chose PostgreSQL over MySQL", "Using REST not GraphQL" |
INSIGHT |
Derived insights | "User tends to prefer typed languages" |
Memory API¶
The with_memory() wrapper provides a .memory API for direct access:
client = with_memory(OpenAI(), user_id="alice")
# Search memories (semantic)
results = client.memory.search("python preferences", top_k=5)
for memory in results:
print(f"{memory.content}")
# Add manual memory
client.memory.add(
"User is a senior engineer",
category="fact",
importance=0.9,
)
# Get all memories
all_memories = client.memory.get_all()
# Clear memories
client.memory.clear()
# Get stats
stats = client.memory.stats()
print(f"Total memories: {stats['total']}")
print(f"By category: {stats['categories']}")
Advanced Usage: Direct HierarchicalMemory API¶
For full control, use the HierarchicalMemory class directly:
import asyncio
from headroom.memory import (
HierarchicalMemory,
MemoryConfig,
MemoryCategory,
EmbedderBackend,
)
from headroom.memory.ports import MemoryFilter, VectorFilter
async def main():
# Create with custom configuration
config = MemoryConfig(
db_path="my_memory.db",
embedder_backend=EmbedderBackend.LOCAL, # or OPENAI, OLLAMA
vector_dimension=384,
cache_max_size=2000,
)
memory = await HierarchicalMemory.create(config)
# Add memory with full control
mem = await memory.add(
content="User prefers functional programming",
user_id="alice",
session_id="sess-123",
agent_id="code-assistant",
category=MemoryCategory.PREFERENCE,
importance=0.9,
entity_refs=["functional-programming", "coding-style"],
metadata={"source": "conversation", "confidence": 0.95},
)
# Semantic search
results = await memory.search(
query="programming paradigm preferences",
user_id="alice",
top_k=5,
min_similarity=0.5,
categories=[MemoryCategory.PREFERENCE],
)
for r in results:
print(f"[{r.similarity:.3f}] {r.memory.content}")
# Full-text search
text_results = await memory.text_search(
query="functional",
user_id="alice",
)
# Query with filters
memories = await memory.query(MemoryFilter(
user_id="alice",
categories=[MemoryCategory.PREFERENCE, MemoryCategory.FACT],
min_importance=0.7,
limit=10,
))
# Convenience methods
await memory.remember("Likes coffee", user_id="alice", importance=0.6)
relevant = await memory.recall("beverage preferences", user_id="alice")
asyncio.run(main())
Configuration¶
Embedder Backends¶
from headroom.memory import MemoryConfig, EmbedderBackend
# Local embeddings (recommended - fast, free, private)
config = MemoryConfig(
embedder_backend=EmbedderBackend.LOCAL,
embedder_model="all-MiniLM-L6-v2", # 384 dimensions, fast
)
# OpenAI embeddings (higher quality, costs money)
config = MemoryConfig(
embedder_backend=EmbedderBackend.OPENAI,
openai_api_key="sk-...",
embedder_model="text-embedding-3-small",
)
# Ollama embeddings (local server, many models)
config = MemoryConfig(
embedder_backend=EmbedderBackend.OLLAMA,
ollama_base_url="http://localhost:11434",
embedder_model="nomic-embed-text",
)
Storage Configuration¶
config = MemoryConfig(
db_path="memory.db", # SQLite database path
vector_dimension=384, # Must match embedder output
hnsw_ef_construction=200, # HNSW index quality (higher = better, slower)
hnsw_m=16, # HNSW connections per node
hnsw_ef_search=50, # HNSW search quality
cache_enabled=True, # Enable LRU cache
cache_max_size=1000, # Max cached memories
)
Wrapper Configuration¶
client = with_memory(
OpenAI(),
user_id="alice",
db_path="memory.db",
top_k=5, # Memories to inject per request
session_id="optional-session",
agent_id="optional-agent",
embedder_backend=EmbedderBackend.LOCAL,
)
Architecture¶
Protocol-Based Design¶
Headroom Memory uses Protocol interfaces (ports) for all components, enabling easy swapping:
┌─────────────────────────────────────────────────────────────┐
│ HierarchicalMemory │
│ (Orchestrator) │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ MemoryStore │ │ VectorIndex │ │ TextIndex │ │
│ │ Protocol │ │ Protocol │ │ Protocol │ │
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │ │
│ ┌──────▼──────┐ ┌──────▼──────┐ ┌──────▼──────┐ │
│ │ SQLite │ │ HNSW │ │ FTS5 │ │
│ │ Adapter │ │ Adapter │ │ Adapter │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │
│ ┌─────────────┐ ┌─────────────┐ │
│ │ Embedder │ │ MemoryCache │ │
│ │ Protocol │ │ Protocol │ │
│ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │
│ ┌──────▼──────┐ ┌──────▼──────┐ │
│ │Local/OpenAI/│ │ LRU Cache │ │
│ │ Ollama │ │ │ │
│ └─────────────┘ └─────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
Components¶
| Component | Protocol | Default Adapter | Purpose |
|---|---|---|---|
| MemoryStore | MemoryStore |
SQLiteMemoryStore |
CRUD + filtering + supersession |
| VectorIndex | VectorIndex |
HNSWVectorIndex |
Semantic similarity search |
| TextIndex | TextIndex |
FTS5TextIndex |
Full-text keyword search |
| Embedder | Embedder |
LocalEmbedder |
Text → vector conversion |
| Cache | MemoryCache |
LRUMemoryCache |
Hot memory caching |
Comparison with State of the Art¶
vs Letta (MemGPT)¶
Letta pioneered inline memory extraction. Headroom builds on this with:
| Aspect | Headroom | Letta |
|---|---|---|
| Scoping | 4-level hierarchy (user/session/agent/turn) | Flat per-agent |
| Temporal | Full supersession chains with history | No versioning |
| Integration | One-liner wrapper for any client | Requires Letta agent framework |
| Search | Semantic + full-text | Semantic only |
| Storage | SQLite + HNSW (embedded) | PostgreSQL (external) |
| Extensibility | Protocol-based adapters | Monolithic |
When to use Letta: You want a full agent framework with built-in memory. When to use Headroom: You want memory as a layer on your existing stack.
vs Mem0¶
Mem0 provides a managed memory service. Headroom differs:
| Aspect | Headroom | Mem0 |
|---|---|---|
| Deployment | Embedded (no server) | Managed service or self-hosted |
| Scoping | 4-level hierarchy | Flat per-user |
| Temporal | Supersession chains | No versioning |
| Extraction | Inline (zero latency) | Separate API call |
| Search | Semantic + full-text | Semantic only |
| Cost | Free (local embeddings) | API costs or infra costs |
| Privacy | All local | Data leaves your infra |
When to use Mem0: You want a managed service and don't mind external dependencies. When to use Headroom: You want embedded memory with no external services.
Feature Matrix¶
| Feature | Headroom | Letta | Mem0 |
|---|---|---|---|
| Hierarchical scoping | ✅ | ❌ | ❌ |
| Temporal versioning | ✅ | ❌ | ❌ |
| Zero-latency extraction | ✅ | ✅ | ❌ |
| Full-text search | ✅ | ❌ | ❌ |
| Embedded (no server) | ✅ | ❌ | ❌ |
| One-liner integration | ✅ | ❌ | ❌ |
| Protocol-based extensibility | ✅ | ❌ | ❌ |
| Memory bubbling | ✅ | ❌ | ❌ |
| Local embeddings | ✅ | ❌ | ✅ |
| Managed service option | ❌ | ❌ | ✅ |
Multi-User Isolation¶
Memories are isolated by user_id:
# Alice's memories
alice_client = with_memory(OpenAI(), user_id="alice")
# Bob's memories (completely separate)
bob_client = with_memory(OpenAI(), user_id="bob")
# Bob cannot see Alice's memories, even with the same database
Performance¶
| Operation | Latency | Notes |
|---|---|---|
| Memory injection | <50ms | Local embeddings + HNSW search |
| Memory extraction | +50-100 tokens | Part of LLM response (inline) |
| Memory storage | <10ms | SQLite + HNSW + FTS5 indexing |
| Cache hit | <1ms | LRU cache lookup |
Overhead: ~100 extra output tokens per response for the <memory> block.
Providers¶
Memory works with any OpenAI-compatible client:
from openai import OpenAI
from headroom import with_memory
# OpenAI
client = with_memory(OpenAI(), user_id="alice")
# Azure OpenAI
client = with_memory(
OpenAI(base_url="https://your-resource.openai.azure.com/..."),
user_id="alice",
)
# Groq
from groq import Groq
client = with_memory(Groq(), user_id="alice")
# Any OpenAI-compatible client
client = with_memory(YourClient(), user_id="alice")
Example: Full Conversation Flow¶
from openai import OpenAI
from headroom import with_memory
client = with_memory(OpenAI(), user_id="developer_jane")
# Conversation 1: User shares context
response = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": "I'm a Python developer at a fintech startup. We use PostgreSQL and FastAPI."
}]
)
# Memories extracted:
# - [FACT] Python developer at fintech startup
# - [PREFERENCE] Uses PostgreSQL for databases
# - [PREFERENCE] Uses FastAPI for web APIs
# Conversation 2 (new session): User asks question
response = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": "What database should I use for my new project?"
}]
)
# Response references PostgreSQL preference from memory:
# → "Given your experience with PostgreSQL at your fintech company,
# I'd recommend sticking with it for consistency..."
# Check stored memories
print("Stored memories:")
for m in client.memory.get_all():
print(f" [{m.category.value}] {m.content}")
Troubleshooting¶
Memories not being extracted¶
- Check if the conversation has memory-worthy content (not just greetings)
- Verify the LLM is following the memory instruction
- Enable logging:
import logging; logging.basicConfig(level=logging.DEBUG)
Memories not being retrieved¶
- Verify
user_idmatches between sessions - Check if memories exist:
client.memory.get_all() - Try a more specific search query
- Check similarity threshold
High latency¶
- Use local embeddings:
embedder_backend=EmbedderBackend.LOCAL - Reduce
top_kfor fewer memories to retrieve - Enable caching (enabled by default)
Memory not persisting¶
- Check
db_pathis the same across sessions - Ensure the database file is writable
- Check for exceptions in logs
Best Practices¶
- Use consistent
user_id- Same ID across sessions for continuity - Use session scoping - Set
session_idfor session-specific context - Start with local embeddings - Faster, free, good enough for most cases
- Monitor memory growth - Use
client.memory.stats()to track - Use importance scores - Higher importance = more likely to be retrieved
- Leverage categories - Helps with debugging and selective retrieval
- Consider supersession - Use
supersede()when facts change, notadd()