Memory¶

Hierarchical, temporal memory for LLM applications. Enable your AI to remember across conversations with intelligent scoping and versioning.

Why Memory?¶

LLMs have two fundamental limitations: 1. Context windows overflow - Too much history, need to truncate 2. No persistence - Every conversation starts from zero

Memory solves both: extract key facts, persist them, inject when relevant.

This is temporal compression - instead of carrying 10,000 tokens of conversation history, carry 100 tokens of extracted memories.

What Makes Headroom Memory Different?¶

Feature	Headroom	Letta (MemGPT)	Mem0
Hierarchical Scoping	User → Session → Agent → Turn	Flat (per-agent)	Flat (per-user)
Temporal Versioning	Full supersession chains	No	No
Zero-Latency Extraction	Inline (Letta-style)	Inline	Separate call
One-Liner Integration	`with_memory(client)`	Requires agent setup	Requires separate client
Pluggable Backends	SQLite, HNSW, FTS5, any embedder	PostgreSQL	Qdrant/Chroma
Semantic + Full-Text Search	Both	Semantic only	Semantic only
Memory Bubbling	Auto-promote important memories	No	No
Protocol-Based Architecture	Yes (dependency injection)	No	No

Quick Start¶

from openai import OpenAI
from headroom import with_memory

# One line - that's it
client = with_memory(OpenAI(), user_id="alice")

# Use exactly like normal
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "I prefer Python for backend work"}]
)
# Memory extracted INLINE - zero extra latency

# Later, in a new conversation...
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What language should I use?"}]
)
# → Response uses the Python preference from memory

How It Works¶

┌─────────────────────────────────────────────────────────────┐
│                      with_memory()                          │
│                                                              │
│   1. INJECT: Semantic search → prepend to user message      │
│   2. INSTRUCT: Add memory extraction instruction            │
│   3. CALL: Forward to LLM                                   │
│   4. PARSE: Extract <memory> block from response            │
│   5. STORE: Save with embeddings + vector index + FTS       │
│   6. RETURN: Clean response (without memory block)          │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Key insight: Memory extraction happens inline as part of the LLM response (Letta-style). No extra API calls, no extra latency.

Hierarchical Scoping¶

Memories exist at different scope levels, enabling fine-grained control:

USER (broadest)
 └── SESSION
      └── AGENT
           └── TURN (narrowest)

Scope Levels¶

Scope	Persists Across	Use Case
USER	All sessions, all time	Long-term preferences, identity
SESSION	Current session only	Current task context
AGENT	Current agent in session	Agent-specific context
TURN	Single turn only	Ephemeral working memory

Example: Multi-Session Memory¶

from openai import OpenAI
from headroom import with_memory

# Session 1: Morning
client1 = with_memory(
    OpenAI(),
    user_id="bob",
    session_id="morning-session",
)
response = client1.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "I prefer Go for performance-critical code"}]
)
# Memory stored at USER level (persists across sessions)

# Session 2: Afternoon (different session, same user)
client2 = with_memory(
    OpenAI(),
    user_id="bob",  # Same user
    session_id="afternoon-session",  # Different session
)
response = client2.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What language for my new microservice?"}]
)
# → Recalls Go preference from morning session!

Temporal Versioning (Supersession)¶

Memories evolve over time. When facts change, Headroom creates a supersession chain preserving history:

from headroom.memory import HierarchicalMemory, MemoryConfig

memory = await HierarchicalMemory.create()

# Original fact
orig = await memory.add(
    content="User works at Google",
    user_id="alice",
    category=MemoryCategory.FACT,
)

# User changes jobs - supersede the old memory
new = await memory.supersede(
    old_memory_id=orig.id,
    new_content="User now works at Anthropic",
)

# Query current state (excludes superseded)
current = await memory.query(MemoryFilter(
    user_id="alice",
    include_superseded=False,  # Default
))
# → Returns only "User now works at Anthropic"

# Query full history (includes superseded)
history = await memory.query(MemoryFilter(
    user_id="alice",
    include_superseded=True,
))
# → Returns both memories with validity timestamps

# Get the chain
chain = await memory.get_history(new.id)
# → [
#     Memory(content="User works at Google", valid_until=..., is_current=False),
#     Memory(content="User now works at Anthropic", valid_until=None, is_current=True),
#   ]

Why Temporal Versioning Matters¶

Audit trail - Know what was true at any point in time
Debugging - Understand why the LLM made certain decisions
Rollback - Restore previous state if needed
Analytics - Track how user preferences evolve

Memory Categories¶

Memories are categorized for better organization and retrieval:

Category	Description	Examples
`PREFERENCE`	Likes, dislikes, preferred approaches	"Prefers Python", "Likes dark mode"
`FACT`	Identity, role, constraints	"Works at fintech startup", "Senior engineer"
`CONTEXT`	Current goals, ongoing tasks	"Migrating to microservices", "Working on auth"
`ENTITY`	Information about entities	"Project Apollo uses React", "Team lead is Sarah"
`DECISION`	Decisions made	"Chose PostgreSQL over MySQL", "Using REST not GraphQL"
`INSIGHT`	Derived insights	"User tends to prefer typed languages"

Memory API¶

The with_memory() wrapper provides a .memory API for direct access:

client = with_memory(OpenAI(), user_id="alice")

# Search memories (semantic)
results = client.memory.search("python preferences", top_k=5)
for memory in results:
    print(f"{memory.content}")

# Add manual memory
client.memory.add(
    "User is a senior engineer",
    category="fact",
    importance=0.9,
)

# Get all memories
all_memories = client.memory.get_all()

# Clear memories
client.memory.clear()

# Get stats
stats = client.memory.stats()
print(f"Total memories: {stats['total']}")
print(f"By category: {stats['categories']}")

Advanced Usage: Direct HierarchicalMemory API¶

For full control, use the HierarchicalMemory class directly:

import asyncio
from headroom.memory import (
    HierarchicalMemory,
    MemoryConfig,
    MemoryCategory,
    EmbedderBackend,
)
from headroom.memory.ports import MemoryFilter, VectorFilter

async def main():
    # Create with custom configuration
    config = MemoryConfig(
        db_path="my_memory.db",
        embedder_backend=EmbedderBackend.LOCAL,  # or OPENAI, OLLAMA
        vector_dimension=384,
        cache_max_size=2000,
    )
    memory = await HierarchicalMemory.create(config)

    # Add memory with full control
    mem = await memory.add(
        content="User prefers functional programming",
        user_id="alice",
        session_id="sess-123",
        agent_id="code-assistant",
        category=MemoryCategory.PREFERENCE,
        importance=0.9,
        entity_refs=["functional-programming", "coding-style"],
        metadata={"source": "conversation", "confidence": 0.95},
    )

    # Semantic search
    results = await memory.search(
        query="programming paradigm preferences",
        user_id="alice",
        top_k=5,
        min_similarity=0.5,
        categories=[MemoryCategory.PREFERENCE],
    )
    for r in results:
        print(f"[{r.similarity:.3f}] {r.memory.content}")

    # Full-text search
    text_results = await memory.text_search(
        query="functional",
        user_id="alice",
    )

    # Query with filters
    memories = await memory.query(MemoryFilter(
        user_id="alice",
        categories=[MemoryCategory.PREFERENCE, MemoryCategory.FACT],
        min_importance=0.7,
        limit=10,
    ))

    # Convenience methods
    await memory.remember("Likes coffee", user_id="alice", importance=0.6)
    relevant = await memory.recall("beverage preferences", user_id="alice")

asyncio.run(main())

Configuration¶

Embedder Backends¶

from headroom.memory import MemoryConfig, EmbedderBackend

# Local embeddings (recommended - fast, free, private)
config = MemoryConfig(
    embedder_backend=EmbedderBackend.LOCAL,
    embedder_model="all-MiniLM-L6-v2",  # 384 dimensions, fast
)

# OpenAI embeddings (higher quality, costs money)
config = MemoryConfig(
    embedder_backend=EmbedderBackend.OPENAI,
    openai_api_key="sk-...",
    embedder_model="text-embedding-3-small",
)

# Ollama embeddings (local server, many models)
config = MemoryConfig(
    embedder_backend=EmbedderBackend.OLLAMA,
    ollama_base_url="http://localhost:11434",
    embedder_model="nomic-embed-text",
)

Storage Configuration¶

config = MemoryConfig(
    db_path="memory.db",          # SQLite database path
    vector_dimension=384,          # Must match embedder output
    hnsw_ef_construction=200,      # HNSW index quality (higher = better, slower)
    hnsw_m=16,                     # HNSW connections per node
    hnsw_ef_search=50,             # HNSW search quality
    cache_enabled=True,            # Enable LRU cache
    cache_max_size=1000,           # Max cached memories
)

Wrapper Configuration¶

client = with_memory(
    OpenAI(),
    user_id="alice",
    db_path="memory.db",
    top_k=5,                       # Memories to inject per request
    session_id="optional-session",
    agent_id="optional-agent",
    embedder_backend=EmbedderBackend.LOCAL,
)

Architecture¶

Protocol-Based Design¶

Headroom Memory uses Protocol interfaces (ports) for all components, enabling easy swapping:

┌─────────────────────────────────────────────────────────────┐
│                   HierarchicalMemory                        │
│                     (Orchestrator)                          │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐        │
│  │ MemoryStore │  │ VectorIndex │  │  TextIndex  │        │
│  │  Protocol   │  │  Protocol   │  │  Protocol   │        │
│  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘        │
│         │                │                │                │
│  ┌──────▼──────┐  ┌──────▼──────┐  ┌──────▼──────┐        │
│  │   SQLite    │  │    HNSW     │  │    FTS5     │        │
│  │  Adapter    │  │   Adapter   │  │   Adapter   │        │
│  └─────────────┘  └─────────────┘  └─────────────┘        │
│                                                             │
│  ┌─────────────┐  ┌─────────────┐                          │
│  │  Embedder   │  │ MemoryCache │                          │
│  │  Protocol   │  │  Protocol   │                          │
│  └──────┬──────┘  └──────┬──────┘                          │
│         │                │                                  │
│  ┌──────▼──────┐  ┌──────▼──────┐                          │
│  │Local/OpenAI/│  │  LRU Cache  │                          │
│  │   Ollama    │  │             │                          │
│  └─────────────┘  └─────────────┘                          │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Components¶

Component	Protocol	Default Adapter	Purpose
MemoryStore	`MemoryStore`	`SQLiteMemoryStore`	CRUD + filtering + supersession
VectorIndex	`VectorIndex`	`HNSWVectorIndex`	Semantic similarity search
TextIndex	`TextIndex`	`FTS5TextIndex`	Full-text keyword search
Embedder	`Embedder`	`LocalEmbedder`	Text → vector conversion
Cache	`MemoryCache`	`LRUMemoryCache`	Hot memory caching

Comparison with State of the Art¶

vs Letta (MemGPT)¶

Letta pioneered inline memory extraction. Headroom builds on this with:

Aspect	Headroom	Letta
Scoping	4-level hierarchy (user/session/agent/turn)	Flat per-agent
Temporal	Full supersession chains with history	No versioning
Integration	One-liner wrapper for any client	Requires Letta agent framework
Search	Semantic + full-text	Semantic only
Storage	SQLite + HNSW (embedded)	PostgreSQL (external)
Extensibility	Protocol-based adapters	Monolithic

When to use Letta: You want a full agent framework with built-in memory. When to use Headroom: You want memory as a layer on your existing stack.

vs Mem0¶

Mem0 provides a managed memory service. Headroom differs:

Aspect	Headroom	Mem0
Deployment	Embedded (no server)	Managed service or self-hosted
Scoping	4-level hierarchy	Flat per-user
Temporal	Supersession chains	No versioning
Extraction	Inline (zero latency)	Separate API call
Search	Semantic + full-text	Semantic only
Cost	Free (local embeddings)	API costs or infra costs
Privacy	All local	Data leaves your infra

When to use Mem0: You want a managed service and don't mind external dependencies. When to use Headroom: You want embedded memory with no external services.

Feature Matrix¶

Feature	Headroom	Letta	Mem0
Hierarchical scoping	✅	❌	❌
Temporal versioning	✅	❌	❌
Zero-latency extraction	✅	✅	❌
Full-text search	✅	❌	❌
Embedded (no server)	✅	❌	❌
One-liner integration	✅	❌	❌
Protocol-based extensibility	✅	❌	❌
Memory bubbling	✅	❌	❌
Local embeddings	✅	❌	✅
Managed service option	❌	❌	✅

Multi-User Isolation¶

Memories are isolated by user_id:

# Alice's memories
alice_client = with_memory(OpenAI(), user_id="alice")

# Bob's memories (completely separate)
bob_client = with_memory(OpenAI(), user_id="bob")

# Bob cannot see Alice's memories, even with the same database

Performance¶

Operation	Latency	Notes
Memory injection	<50ms	Local embeddings + HNSW search
Memory extraction	+50-100 tokens	Part of LLM response (inline)
Memory storage	<10ms	SQLite + HNSW + FTS5 indexing
Cache hit	<1ms	LRU cache lookup

Overhead: ~100 extra output tokens per response for the <memory> block.

Providers¶

Memory works with any OpenAI-compatible client:

from openai import OpenAI
from headroom import with_memory

# OpenAI
client = with_memory(OpenAI(), user_id="alice")

# Azure OpenAI
client = with_memory(
    OpenAI(base_url="https://your-resource.openai.azure.com/..."),
    user_id="alice",
)

# Groq
from groq import Groq
client = with_memory(Groq(), user_id="alice")

# Any OpenAI-compatible client
client = with_memory(YourClient(), user_id="alice")

Example: Full Conversation Flow¶

from openai import OpenAI
from headroom import with_memory

client = with_memory(OpenAI(), user_id="developer_jane")

# Conversation 1: User shares context
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": "I'm a Python developer at a fintech startup. We use PostgreSQL and FastAPI."
    }]
)
# Memories extracted:
#   - [FACT] Python developer at fintech startup
#   - [PREFERENCE] Uses PostgreSQL for databases
#   - [PREFERENCE] Uses FastAPI for web APIs

# Conversation 2 (new session): User asks question
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": "What database should I use for my new project?"
    }]
)
# Response references PostgreSQL preference from memory:
# → "Given your experience with PostgreSQL at your fintech company,
#    I'd recommend sticking with it for consistency..."

# Check stored memories
print("Stored memories:")
for m in client.memory.get_all():
    print(f"  [{m.category.value}] {m.content}")

Troubleshooting¶

Memories not being extracted¶

Check if the conversation has memory-worthy content (not just greetings)
Verify the LLM is following the memory instruction
Enable logging: import logging; logging.basicConfig(level=logging.DEBUG)

Memories not being retrieved¶

Verify user_id matches between sessions
Check if memories exist: client.memory.get_all()
Try a more specific search query
Check similarity threshold

High latency¶

Use local embeddings: embedder_backend=EmbedderBackend.LOCAL
Reduce top_k for fewer memories to retrieve
Enable caching (enabled by default)

Memory not persisting¶

Check db_path is the same across sessions
Ensure the database file is writable
Check for exceptions in logs

Best Practices¶

Use consistent user_id - Same ID across sessions for continuity
Use session scoping - Set session_id for session-specific context
Start with local embeddings - Faster, free, good enough for most cases
Monitor memory growth - Use client.memory.stats() to track
Use importance scores - Higher importance = more likely to be retrieved
Leverage categories - Helps with debugging and selective retrieval
Consider supersession - Use supersede() when facts change, not add()