Catalog · Infrastructure

Volume 06

The AI Agent Memory Catalog

Volume 06 of the Agentic AI Series

19 patterns draft-v0.1 2026-05 Infrastructure

A Catalog of State, Context, and Recall

Draft v0.1

May 2026

Table of Contents

About This Catalog

This is the sixth volume in a catalog of the working vocabulary of agentic AI. The five prior volumes covered patterns (the timing of agent runs), skills (the model instructions packaged for reuse), tools (the function-calling primitives), events and triggers (what makes the agent run), and fabric (the substrate beneath orchestration). This sixth volume covers what the agent knows --- the state-and-memory layer that sits between the model and its history.

Memory is, surprisingly, the layer where the most engineering work happens in production agent systems. The model is purchased, the patterns are well-known, the tools are catalogued, the events are wired, the fabric is deployed. But the question of what to remember, how to retrieve it, when to update it, when to forget --- these are problems each team has to solve, and the answers vary by orders of magnitude in cost and quality. Mem0’s own State of AI Agent Memory 2026 report opens with the observation that memory is now “a production engineering discipline with real benchmarks, measurable trade-offs, and a growing body of operational knowledge.” The discipline is younger than the field assumes.

The taxonomy this catalog adopts borrows from cognitive psychology: working memory (what fits in the LLM’s context window right now), episodic memory (specific past events the agent participated in), semantic memory (stable facts about the user and the world), and procedural memory (skills and how-to). The three production memory layers --- Mem0, Letta (formerly MemGPT), and Zep --- each handle the last three categories with different mechanisms; working memory is handled by the LLM’s context window and the frameworks that manage it (LangGraph, CrewAI, AutoGen). Vector databases supply the substrate; hybrid Postgres-plus-pgvector deployments increasingly supply a unified backing store for both state and memory.

Scope

Coverage:

  • Short-term memory and graph state management: LangGraph checkpointers (cross-reference to Vol 4), Redis-backed conversation history.

  • Vector databases for long-term episodic and semantic memory: Chroma, Qdrant, Pinecone, Weaviate.

  • Dedicated memory layers: Mem0, Letta (MemGPT), Zep.

  • Shared state engines for multi-agent coordination: AutoGen GroupChat memory, CrewAI built-in memory abstractions.

  • Hybrid relational-plus-vector stores: Supabase with pgvector, pgvector as a primitive.

  • Embedding models that produce the vectors: OpenAI text-embedding-3, BGE, Voyage; the MTEB benchmark.

  • Knowledge graphs for memory: GraphRAG, Neo4j-plus-LLM patterns.

  • Evaluation: Mem0’s memory-benchmarks open-source eval suite, awesome-X discovery lists.

Out of scope:

  • General-purpose databases without an agent-memory framing (most of Postgres, all of MySQL, Mongo, etc.). The agent angle is required --- typically the addition of a vector index or a memory layer.

  • Closed enterprise data stores (Salesforce data cloud, etc.) when used without an explicit agent memory framing.

  • RAG over static document corpora when the corpus isn’t treated as memory the agent writes to. This volume focuses on the read-and-write case; the read-only case is conventional RAG and is well-documented elsewhere.

  • Cache and key-value stores in their conventional roles (web session caches, etc.) when not specifically wrapped for agent memory.

How to read this catalog

Part 1 (“The Narratives”) is conceptual orientation: the four kinds of memory, why the context window is a constraint rather than a solution, the retrieval-augmented generation loop, the multi-agent shared state topologies, and the six-stage memory lifecycle. Five diagrams sit in Part 1; everything in Part 2 is text and code.

Part 2 (“The Substrates”) is reference material organized by section. Each section opens with a short essay on what its entries have in common and how they relate to alternatives. Representative substrates appear in the Fowler-style template established by the prior five volumes. The entries are not meant to be read front-to-back; jump in via the table of contents.

Part 1 — The Narratives

Five short essays frame the design space for agent state and memory. The reference entries in Part 2 assume the vocabulary established here.

Chapter 1. The Four Kinds of Memory

Cognitive psychology distinguishes four kinds of memory that humans rely on, and the same taxonomy applies almost unchanged to AI agents. Working memory is what’s in your head right now --- the conversation you’re holding, the variables you’re tracking, the question you just heard. Episodic memory is your record of specific past events: what happened last Tuesday, who you met at the conference, what the customer asked yesterday. Semantic memory is your stable factual knowledge: that Paris is the capital of France, that your sister is allergic to peanuts, that the customer’s billing address is in Brooklyn. Procedural memory is your skills: how to ride a bicycle, how to issue a refund, how to format a Hebrew nikud-bearing document.

The four kinds of memory
Working, episodic, semantic, procedural. Production agent memory layers handle the right three; the LLM's context window IS the working memory.

The distinction matters operationally because each kind has different storage requirements, different retrieval patterns, and different update rules. Working memory is volatile and lives in the LLM’s context window; it’s renewed every turn and discarded when the context window slides. Episodic memory benefits from chronological organization --- a timestamped log of events, retrievable by user, by time, by similarity. Semantic memory benefits from a key-value structure or a graph: facts have identifiers, get updated when reality changes, and link to other facts. Procedural memory is exactly what Volume 2 (the Skills Catalog) covered: model-instruction packages that load when relevant.

This catalog focuses on the three non-working categories, since the LLM’s context window handles working memory by definition and the management of it is mostly an exercise in summarization and truncation. The Skills catalog covered procedural memory. The remaining two --- episodic and semantic --- are where the memory layer category lives. Mem0 and Letta and Zep all sell themselves as solving “memory” for agents, and what they mean is some combination of episodic and semantic storage with retrieval and update logic on top.

A useful test: when you say “the agent remembered that I like spicy food,” you’re describing semantic memory (a stable fact about you). When you say “the agent remembered our conversation last week about the kitchen renovation,” you’re describing episodic memory (a specific past event). The two need different retrieval strategies. A query for “what does Alice like to eat” wants semantic recall; a query for “what did Alice ask me yesterday” wants episodic recall. Production memory layers do both; the better ones know which is which.

Chapter 2. The Context Window as Constraint

The most common naive answer to memory in LLM applications is “we’ll just put it in the context window.” The naive answer is correct for short-lived conversations and small amounts of stable state; it stops being correct surprisingly early. The constraint isn’t just the maximum context window size (200K tokens for Claude as of mid-2026); it’s the budget allocation inside that window.

The context window as constraint
Why memory exists: a 200K context window is allocated across system prompt, skills, tools, conversation, retrieved context, and headroom — most of it isn't free for memory.

Every token in the context window has a job. The system prompt occupies a small fixed chunk --- a few thousand tokens for a serious application. The skills loaded into the run (Volume 2) can occupy anywhere from a few hundred tokens to ten thousand or more. Tool schemas --- the JSON definitions of every function the agent can call (Volume 3) --- can easily consume fifteen thousand tokens in a real production setup. Conversation history grows linearly with each turn; a thirty-turn conversation at fifteen hundred tokens per turn approaches the entire context budget on its own. Retrieved memories (the whole point of the memory layer) want their own ten to twenty thousand tokens of headroom. And the model needs room to actually think, which is the part most teams forget: the response itself, the chain-of-thought scratchpad, the tool calls in flight --- all of these consume context.

The result is that even a 200K window typically has thirty to fifty thousand tokens available for conversation-plus-memory after the fixed overheads are paid. That’s a real constraint, not a notional one. Production agents that try to keep entire conversation histories in-context discover that, around turn forty in a busy session, the user’s new message gets a contextually-aware response that’s forgotten the system prompt and the tool schemas in the same turn.

Two responses to the constraint dominate the field. The first is summarization: as the conversation grows, earlier turns get rewritten as compressed summaries, freeing tokens at the cost of some loss of fidelity. The second is retrieval: rather than carrying everything in-context, the agent maintains an external memory store and queries it when something relevant might apply. Both are right at different times. The summarization path keeps the conversation thread coherent; the retrieval path lets the agent reach much further back, into prior sessions and stable user facts, but with the cost that retrieval can miss something or surface something out of context.

The cost trade-off is also real and worth internalizing. Stuffing one hundred thousand tokens of conversation history into every turn costs roughly $0.30 per turn on Claude Sonnet pricing as of mid-2026. Retrieving the right two thousand tokens from a vector store costs about $0.001 per turn for the embedding and retrieval, plus negligible inference time. The retrieval path is three hundred times cheaper but depends entirely on retrieval quality. Memory engineering is, to a first approximation, the discipline of managing this trade-off well.

Chapter 3. Retrieval-Augmented Generation: The Loop

Almost every production agent memory architecture is, at its core, a retrieval-augmented generation loop. The agent has a question or a turn to handle; the memory layer is queried for relevant prior context; the retrieved context is inserted into the LLM’s prompt; the LLM generates a response that’s informed by both the immediate input and the retrieved memories. The loop is so universal that it’s easy to miss how much of the system’s quality lives in its individual stages.

The retrieval-augmented generation loop
Five sequential stages plus an LLM call. The memory-write path runs alongside, extracting facts from observations and indexing them for future retrieval.

The five stages, in order: the query (a user message, an agent sub-task, or a system trigger), the embedding (the query is converted to a high-dimensional vector by an embedding model --- OpenAI’s text-embedding-3, Voyage AI, the open-source BGE family), the vector search (the top-k most similar vectors in the memory store are retrieved, typically by cosine similarity or a hybrid combining cosine with BM25 keyword scoring), the optional reranker (a cross-encoder model scores each of the k candidates against the query and reorders them by genuine relevance --- a significantly more expensive but more accurate scoring than pure vector similarity), and the context insertion (the top-N reranked candidates are assembled into a context block and prepended to the LLM’s prompt).

Two recurring choices dominate the design. First, the embedding model: closed models like text-embedding-3-large from OpenAI achieve the strongest scores on the MTEB benchmark but cost money per call; open models like BGE and Voyage AI’s open variants are close enough for most workloads and run free on local hardware. Second, the hybrid-vs-pure-vector question: empirically, hybrid search (vector similarity plus BM25 keyword scoring) outperforms pure vector search on agent memory benchmarks by a consistent margin, because user queries often contain specific terms (proper nouns, IDs, jargon) that pure semantic similarity handles poorly.

The memory-write path runs in parallel to the read path. When the agent has an observation worth remembering --- the user stated a preference, a task completed, a fact was confirmed --- a fact-extractor LLM call distills the observation into a clean memory record, the record is checked against existing memories for duplication, and the deduplicated record is written to the vector store with appropriate metadata (user_id, timestamp, importance score). The fact-extractor and dedup logic is where the dedicated memory layers (Mem0, Letta, Zep) distinguish themselves from raw vector stores; the read path is largely the same across all of them.

The single most consequential design decision in a memory loop is what counts as a memory-worthy observation. Writing every utterance produces a noisy memory store full of low-value entries that crowd out the high-value ones during retrieval. Writing nothing means the agent has no memory beyond the conversation thread. The dedicated memory layers all use small-LLM-driven heuristics for this: extract facts, identify preferences, note completed tasks, ignore small talk. The heuristics matter; they’re where memory quality is born or lost.

Chapter 4. Multi-Agent Shared State Topologies

When a single agent has its own memory layer, the design is mostly settled by the choices in Chapter 3. When multiple agents need to coordinate on a shared task, the question becomes how they share context with each other without flooding every agent’s context window with the others’ entire history. Three topologies dominate the answer.

Multi-agent shared state topologies
Group chat (tight coupling), Blackboard (medium coupling), External DB (loose coupling). Most production systems combine the latter two.

The group chat topology is the simplest and the worst-scaling. All agents participate in a single shared conversation; each message is delivered to all agents; coordination happens because every agent sees every other agent’s output. AutoGen’s GroupChat manager is the canonical implementation, and the pattern is right for small teams of agents (two to five) with conversation-shaped collaboration. The pattern scales poorly because every agent’s context window has to hold the entire conversation, multiplying the total token consumption by the agent count. A four-agent group chat with thirty turns per agent is paying for thirty turns of context per agent times four agents, which is well into runaway-cost territory.

The blackboard topology is the classical AI architecture from the 1980s, recast for the LLM era. A single structured store (a shared dictionary, a Redis hash, a Postgres row) holds the task’s state; agents read from the store, do their work, write back; coordination happens because every agent sees the current state but doesn’t need to see how the state got there. CrewAI’s built-in shared task memory implements this pattern. The blackboard scales better than group chat because each agent’s context only holds the current state, not the conversation history that produced it.

The external database topology decouples agents entirely. A persistent external store (Postgres, often with pgvector for the embedding-search half) holds both the task state and the memory layer; agents read and write via SQL or via the memory layer’s API; coordination is loose, asynchronous, and survives restarts. Supabase plus pgvector is the canonical stack. The trade-off is that the external store becomes a contention point if many agents write concurrently, and the protocol for state transitions needs explicit design (which agent owns which fields, how conflicts resolve, what versioning looks like).

Most production multi-agent systems combine two of the three. A blackboard or external DB carries the live task state; an external DB carries the long-term memory layer; transient agent-to-agent messaging happens through group-chat-style channels for the small subset of interactions where it’s genuinely the right shape. The default is the external-DB pattern with a clean schema; group chat is reserved for the parts where it justifies itself.

Chapter 5. Memory Lifecycle: Write, Update, Compact, Forget

A memory store that only accepts writes and only serves reads is a memory store that, over months of production use, becomes a swamp. Useful memories get drowned by noisy ones; outdated facts contradict current ones; the storage cost grows without bound. The lifecycle of memory entries --- not just the read and write --- is where production memory engineering happens.

Memory lifecycle: six stages
Write, Index, Retrieve, Update, Compact, Forget. The right three are where the dedicated memory layers distinguish themselves.

The Write stage is where observations turn into memory records. A fact-extractor LLM reads the agent’s recent observation (a user message, a task outcome, a tool result) and produces a structured memory entry: the fact itself, the user it pertains to, metadata about when and how it was observed, an importance score. Dedup logic checks whether a similar fact is already in the store before writing. The Index stage embeds the fact, stores the vector alongside the text, and makes it searchable. The Retrieve stage is the read side covered in Chapter 3.

Update is where things get hard. The user said yesterday that they prefer Italian food; today they said they’re going vegan. A naive memory layer ends up with both facts in the store, both retrieved, the LLM trying to reconcile them. A thoughtful memory layer detects the contradiction at write time and chooses an update policy: overwrite the old (lose the history), supersede with a new version (keep both, mark which is current), or fork by context (the old fact applies to dinner, the new one to weeknight meals). Mem0’s Graph Memory feature, introduced in 2025, was an explicit response to this problem: model relationships between memories so updates propagate through the graph rather than producing parallel contradictory entries.

Compact handles the volume problem. After months of accumulation, a user’s memory store has thousands of entries, most low-value individually but collectively occupying retrieval slots that the few high-value entries should win. The compact stage summarizes clusters of related low-value memories into one summary entry, deleting the originals. Done well, this preserves the meaningful information while shrinking the store; done badly, it loses specifics that turn out to matter later. Letta’s archival memory paging, inherited from the MemGPT research, is the canonical implementation --- a hierarchical memory structure modeled on operating-system virtual memory, with hot memories in fast storage and cold memories paged out to slower archives.

Forget is the stage most teams skip and most production memory stores need. Old memories that are no longer relevant should decay in importance and eventually leave the store; explicit revocation (the user asked to be forgotten, a fact was retracted) should remove specific entries immediately. Time-decay scoring is the standard implementation: memory importance is a function of recency, retrieval frequency, and recorded importance, and entries below a threshold get garbage-collected periodically. The harder version is memory staleness --- a high-relevance memory about a user’s employer remains highly-scored even after they change jobs, surfacing confidently-wrong information in retrieval. As Mem0’s 2026 State of Memory report observes, staleness in high-relevance memories is an open problem and an active area of research.

Two unsolved problems sit on top of this lifecycle. Cross-session identity resolution: when two interactions arrive without confident user identification, are they the same person? Memory layers assume a stable user_id, but real systems have anonymous sessions, multi-device users, and mixed auth flows that break the assumption. And the staleness problem above: how do you know when a high-relevance memory has become wrong? Both problems are likely to define memory layer differentiation through 2027.

Part 2 — The Substrates

Eight sections follow. Each opens with a short essay on what its entries have in common and how they relate to alternatives. Representative substrates are presented in the same Fowler-style template used by the prior five catalogs.

Sections at a glance

  • Section A --- Short-term memory and graph state

  • Section B --- Vector databases

  • Section C --- Dedicated memory layers

  • Section D --- Shared state engines for multi-agent

  • Section E --- Hybrid relational + vector

  • Section F --- Embedding models and evaluation

  • Section G --- Knowledge graphs for memory

  • Section H --- Curation hubs

Section A — Short-term memory and graph state

How the live conversation thread and the in-run agent state get tracked

Working memory --- the LLM’s context window during a run --- is technically handled by the model itself. What the frameworks contribute is the management around it: typed state objects flowing through a graph of nodes, persistent checkpoints for resumability, durable chat history that survives restarts and supports time-travel debugging. Two substrates dominate: LangGraph’s checkpointer-and-state-graph pattern (covered in detail in Volume 4 and cross-referenced here) and Redis-backed chat history as the low-friction default for production conversation persistence.

The shared property is that both are about state, not memory in the long-term sense. Memory in the dedicated-memory-layer sense is in Section C; this section is for the live state that an in-progress agent run depends on.

LangGraph state and checkpointers (cross-reference)

Source: github.com/langchain-ai/langgraph (26.2k stars, MIT)

Classification Stateful graph framework with persistent checkpoint backends.

Intent

Model agent execution as a graph of nodes operating on a typed state object, with the framework managing state persistence, replay, and human-in-the-loop interrupts.

Motivating Problem

Covered in detail in Volume 4, Section A. The fabric-memory-relevant point: LangGraph’s state object is the canonical short-term working memory of an agent run, and its checkpointers (in-memory, SQLite, Postgres, LangGraph Cloud) are how that working memory survives across process restarts and supports time-travel debugging.

How It Works

The typed state object --- a TypedDict or Pydantic model --- flows through nodes; each node returns updates to the state; the framework merges updates and persists snapshots at every transition. The compile().with_config(thread_id=”…”) binding ties a run to a persistent thread, and the checkpointer stores every snapshot keyed by thread_id and step number.

For memory purposes, the most important property is that the entire state of a long-running agent is queryable and restorable at any past point. Time-travel debugging --- “start the agent over from this snapshot but with a different decision at step seven” --- is a first-class operation. Production debugging benefits enormously.

When to Use It

Stateful agent runs where the in-progress state is non-trivial and needs to survive failures. Anywhere the team is already on LangGraph. Cases where time-travel debugging is operationally valuable. See Volume 4 for fuller treatment.

Alternatives --- CrewAI Flows, LlamaIndex Workflows, AutoGen state runtime. The state-management story differs in detail but the operational shape is similar across the four.

Sources

  • docs.langchain.com/oss/python/langgraph/durable-execution

Redis-backed chat history

Source: redis.io (BSD-3-Clause; via RedisChatMessageHistory in LangChain et al.)

Classification Sub-millisecond in-memory store wrapped as conversation persistence.

Intent

Persist conversation threads in Redis for sub-millisecond reads and writes, with TTL-based expiration and the option to back it with disk-persistent Redis variants for durability.

Motivating Problem

For agents that handle many concurrent conversations --- chatbots, customer support, multi-user assistants --- the conversation thread itself becomes a hot path. Loading the last fifty messages from a relational store on every turn produces visible latency; loading them from Redis takes a millisecond. Most agent stacks already have Redis running for caching; adding a chat-history layer on top is one of the smallest-friction memory investments available.

How It Works

Each conversation gets a Redis key (typically session:user_id or chat:thread_id) backed by a Redis list of messages. Push on the back for new messages; lrange to read the last N. TTL on the key handles session expiration. Wrappers like LangChain’s RedisChatMessageHistory or the OpenAI Assistants API’s thread persistence abstract the operations behind a clean interface.

For durability, Redis AOF (append-only file) and RDB snapshots provide acceptable durability for chat history that’s not the system of record. For higher-durability needs, the same pattern works against Postgres or DynamoDB at the cost of higher per-operation latency.

When to Use It

Any agent application with concurrent conversations and per-turn latency budgets. The default for short-term conversation persistence; almost always worth the modest operational cost.

Alternatives --- in-process dictionaries when the application is single-process and conversations don’t need to survive restarts (rare in production). Postgres-backed history when Redis isn’t already in the stack. The LangGraph or framework-provided persistence layer when the framework abstracts it.

Sources

  • redis.io

  • python.langchain.com/docs/integrations/memory/redis_chat_message_history/

Example artifacts

Code.

from langchain_redis import RedisChatMessageHistory

history = RedisChatMessageHistory(

session_id="user_42",

url="redis://localhost:6379/0",

ttl=3600 * 24 * 7, # one week

)

history.add_user_message("What did we discuss last time?")

history.add_ai_message("You asked about kitchen renovation costs.")

for msg in history.messages:

print(msg.type, msg.content)

Section B — Vector databases

The substrate for episodic and semantic memory: where vectors live and how they’re searched

Long-term agent memory sits on top of a vector database. Four products dominate at different ends of the operational spectrum: Chroma is the lightweight Python-embeddable option for prototypes; Qdrant is the high-performance production option with payload filtering and distributed deployment; Pinecone is the hosted-managed option that handles operations entirely; Weaviate is the open-source-with-multimodal-and-graph-features option. Most production agent memory layers (Mem0, Letta, Zep) abstract over a choice of vector store and let the deployer pick.

The dimensions of choice are operational footprint (Chroma’s embedded mode vs Qdrant’s distributed cluster vs Pinecone’s pure SaaS), query feature surface (payload filtering, hybrid search, multi-vector), and per-vector cost at scale. The agent angle adds two specific requirements that conventional vector-store evaluation skips: support for per-user namespace isolation (so one user’s memories don’t leak into another’s retrieval) and support for metadata-rich filtering (so an episodic query for “what did Alice ask yesterday” can filter to user_id=alice AND timestamp > now-24h before doing vector search).

Chroma

Source: github.com/chroma-core/chroma (Apache-2; Python/TypeScript SDKs; embedded or client/server)

Classification Lightweight open-source embedding database.

Intent

Provide a vector database that runs in-process inside a Python application for fast prototyping, with a path to a client/server deployment when scale demands it.

Motivating Problem

For agent prototypes and small-scale memory deployments, the operational overhead of running a separate vector database is real and unnecessary. Chroma’s answer is to embed the database directly in the Python (or TypeScript) application: a few lines of import-and-call, persistent storage in a local directory, no separate process to manage. The same API works against a client/server deployment when the prototype graduates to production.

How It Works

Install chromadb, instantiate a PersistentClient pointed at a directory, create or get a collection, and add documents with their embeddings (computed by Chroma using a default embedding function or supplied by the caller). Query is a single .query() call returning the top-k nearest neighbors with their documents and metadata. The client/server mode runs the same API surface against a remote Chroma server.

Chroma’s positioning is explicitly developer-friendly. The documentation, the API surface, and the default behaviors prioritize “it just works in five lines” over operational sophistication. The trade-off is that Chroma at scale (millions of vectors, dozens of concurrent users) is operationally less comfortable than Qdrant or Pinecone; the prototype path is excellent.

When to Use It

Prototypes, demos, and small-to-medium production deployments where the in-process embedding is operationally simpler. Agent memory layers (Mem0, Letta) that abstract over Chroma in their default configuration. Anywhere the team wants the simplest possible vector-store onboarding.

Alternatives --- Qdrant when production performance and feature surface matter. Pinecone when fully-managed hosting is preferred. pgvector when Postgres is already in the stack.

Sources

  • github.com/chroma-core/chroma

  • docs.trychroma.com

Example artifacts

Setup.

pip install chromadb

Code.

import chromadb

client = chromadb.PersistentClient(path="./memory_db")

coll = client.get_or_create_collection(name="alice_memories")

coll.add(

documents=["Alice prefers spicy food", "Alice is allergic to
peanuts"],

metadatas=[{"category": "preference"}, {"category":
"allergy"}],

ids=["mem_001", "mem_002"],

)

results = coll.query(

query_texts=["What food restrictions does Alice have?"],

n_results=3,

)

print(results["documents"])

Qdrant

Source: github.com/qdrant/qdrant (Apache-2; Rust; client SDKs in Python/TS/Rust/Go/Java)

Classification High-performance vector search engine for production agent memory.

Intent

Provide a production-grade vector database written in Rust, with distributed deployment, rich payload filtering, hybrid search, and the operational maturity to back agent memory at scale.

Motivating Problem

For agent memory at production scale --- millions of vectors, dozens of concurrent users, sub-100ms p99 retrieval latency --- the requirements outgrow Chroma’s in-process mode and the cost profile of Pinecone’s pure SaaS becomes painful. Qdrant occupies the production-grade-open-source position: written in Rust for performance, distributed by design, with the payload-filtering and hybrid-search features that production memory needs.

How It Works

Collections hold points; each point is a vector plus a JSON payload. The payload supports rich filtering (range queries, geo, full-text, nested fields), which is essential for agent memory because retrieval is rarely pure vector --- it’s usually “find similar memories for user_id=alice in the last 30 days with importance > 0.5,” where the filter dramatically reduces the search space before the vector comparison runs.

Hybrid search combines vector similarity with sparse vectors (BM25-style keyword scoring) for retrieval that handles both semantic similarity and exact-term matches. Qdrant’s sparse-vector support is first-class; the Mem0 documentation calls out Qdrant explicitly for the production hosting path.

Operational: Qdrant runs in single-node mode for development and in a distributed cluster for production. Qdrant Cloud is the hosted offering; the open-source binary is the self-hosted option. Performance benchmarks consistently place Qdrant at or near the top of the open-source vector database category.

When to Use It

Production agent memory deployments. Cases requiring rich payload filtering (per-user isolation, time-range queries, importance thresholds). Hybrid search workloads. Any scale beyond “small prototype.”

Alternatives --- Pinecone for fully-managed hosting; pgvector when Postgres is preferred; Chroma for prototypes; Weaviate for multimodal-or-graph-heavy workloads.

Sources

  • github.com/qdrant/qdrant

  • qdrant.tech/articles/agent-memory/

Example artifacts

Setup.

# Self-hosted:

docker run -p 6333:6333 -p 6334:6334 \

-v \$(pwd)/qdrant_storage:/qdrant/storage \

qdrant/qdrant

# Python client:

pip install qdrant-client

Code.

from qdrant_client import QdrantClient

from qdrant_client.models import PointStruct, Filter, FieldCondition,
MatchValue

client = QdrantClient(url="http://localhost:6333")

client.create_collection(

collection_name="agent_memories",

vectors_config={"size": 1536, "distance": "Cosine"},

)

# Write with payload metadata for filtering

client.upsert(

collection_name="agent_memories",

points=[PointStruct(

id=1,

vector=embedding_of("Alice prefers spicy food"),

payload={"user_id": "alice", "category": "preference",
"timestamp": 1716000000},

)],

)

# Per-user filtered query

hits = client.search(

collection_name="agent_memories",

query_vector=embedding_of("food preferences"),

query_filter=Filter(must=[

FieldCondition(key="user_id", match=MatchValue(value="alice")),

]),

limit=5,

)

Pinecone

Source: pinecone.io (SaaS; closed-source; Python/TypeScript clients)

Classification Managed hosted vector database.

Intent

Provide vector search as a pure SaaS, with no operational responsibility for the deployer: API key in, vectors in, queries out, with serverless scaling and global availability.

Motivating Problem

For teams that want production-grade vector search without operating any infrastructure, Pinecone has been the incumbent answer since 2021. Pure SaaS, no servers to run, pay-per-vector-stored-and-query. The trade-off is dependency on a single vendor and the cost profile, which can be significant at scale; the upside is zero operational overhead and the maturity of a several-years-old hosted product.

How It Works

Create an index via the dashboard or API; choose a dimension matching your embedding model; upsert vectors with metadata; query by vector or by metadata filter. The serverless model (introduced 2024) eliminates the older pod-based provisioning; usage-based pricing scales with storage and read units. Hybrid search and metadata filtering are first-class.

Pinecone’s positioning leans heavily into the managed-service angle: SOC 2, HIPAA, regional deployments, SLA-backed. For agent memory in regulated industries (healthcare, financial services), this is the path of least resistance for the security questionnaire alone.

When to Use It

Teams that explicitly don’t want to operate vector-store infrastructure. Regulated industries where the managed compliance posture is valuable. Pilots that may scale fast and want the operating burden to scale with them.

Alternatives --- Qdrant Cloud for managed open-source. AWS OpenSearch with k-NN for AWS-native deployments. The self-hosted-Qdrant-or-Chroma path when cost-control matters and the team has the operational maturity.

Sources

  • pinecone.io

  • docs.pinecone.io

Weaviate

Source: github.com/weaviate/weaviate (BSD-3-Clause; Go; Python/JS/Go clients)

Classification Open-source vector database with multimodal and graph features.

Intent

Provide vector search with first-class support for multi-vector objects, multimodal data (text + image + audio embeddings on the same object), and graph-like relationships between objects.

Motivating Problem

For agent memory that goes beyond text --- a memory might include the screenshot the user shared, the audio clip from a voice note, the structured event data from a calendar --- single-modality vector stores create awkward modeling problems. Weaviate’s answer is multi-vector objects (one record carries multiple embeddings across modalities) and explicit cross-references between objects, making it natural to model both rich memories and the relationships between them.

How It Works

Objects are stored in classes (collections). Each class has a schema with properties and vector configurations; multiple vectors per object are supported (named-vector mode). Cross-references let one object reference another, producing a graph structure that’s queryable. Modules add specialized capabilities: text2vec-openai for embedding, multi2vec-clip for image+text, ref2vec-centroid for derived embeddings.

Compared with Qdrant or Pinecone, Weaviate is genuinely different in shape: less a pure vector-search engine and more a vector-aware object database with graph semantics. The trade-off is feature breadth versus operational simplicity --- Weaviate is more capable but harder to operate well.

When to Use It

Multimodal agent memory (screenshots, audio, structured data alongside text). Workloads with rich object-to-object relationships that benefit from a graph model. Teams that prefer GraphQL-style query semantics.

Alternatives --- Qdrant for pure vector search at scale; Pinecone for managed; a dedicated graph database (Neo4j) when the graph is the primary structure; pgvector when Postgres is the existing stack.

Sources

  • github.com/weaviate/weaviate

Section C — Dedicated memory layers

Products that handle the read-write-update-forget lifecycle on top of vector substrates

Three products own the dedicated-memory-layer category: Mem0 (the universal-memory-layer positioning, 55K+ GitHub stars, hybrid vector-plus-key-value-plus-graph storage), Letta (formerly MemGPT, the OS-paging-inspired hierarchical memory architecture), and Zep (the conversation-summarization-focused long-term store). All three sit on top of a vector database (Section B) and add the fact-extraction, dedup, update, and decay logic that distinguishes a memory layer from a raw vector store.

The differentiation is in the mechanisms: Mem0 emphasizes integration breadth (21+ frameworks) and graph memory for relationship modeling; Letta emphasizes architectural sophistication (virtual context, archival vs recall memory tiers); Zep emphasizes automated conversation summarization with a knowledge-graph backend. All three can be self-hosted or used as managed services.

Mem0

Source: github.com/mem0ai/mem0 (55.9k stars, Apache-2; Python + TypeScript)

Classification Universal memory layer for AI agents.

Intent

Provide a memory layer that integrates with any agent framework, automatically extracts and stores user-relevant facts, handles dedup and update across a hybrid storage backend, and exposes a uniform add()/search() API.

Motivating Problem

Production agent applications that don’t want to build the memory layer themselves --- the fact-extraction logic, the dedup heuristics, the update rules, the multi-user isolation, the vector-store abstraction --- need a memory product to drop in. Mem0 positions itself as that product, with by far the broadest framework-integration coverage (21 frameworks as of early 2026 including LangChain, CrewAI, AutoGen, Mastra, the OpenAI Agents SDK) and a hybrid storage architecture that handles different memory shapes with different stores.

How It Works

Install mem0ai; instantiate a Memory() with optional configuration for the vector store, LLM, and embedder; call m.add(text, user_id=…) to store an observation; call m.search(query, user_id=…) to retrieve relevant memories. On add(), Mem0 calls a small LLM to extract facts from the observation, checks them against existing memories for duplication, and writes the deduplicated facts to its hybrid storage: a vector database for semantic search, a key-value database for fast lookups, and (in graph-memory mode) a graph database for relationship-aware retrieval.

The hybrid storage is the architectural distinctiveness. Pure vector stores struggle with relationship queries (“what does Alice’s sister prefer” requires traversing a relationship); pure graph stores struggle with semantic similarity. Mem0’s default backends are Qdrant for vectors plus a key-value store; graph mode adds Neo4j or a similar graph store. The retrieval path scores results across all stores and surfaces the top-N by relevance, importance, and recency.

Operational: self-hostable with docker compose up; managed via app.mem0.ai with free-tier API keys; agents can mint their own keys (mem0 init —agent), with the human owner claiming the account later. The CLI surface is unusual in being explicitly designed for AI coding agents to interact with directly. The memory-benchmarks repo (github.com/mem0ai/memory-benchmarks) is the open-source evaluation suite Mem0 maintains to demonstrate retrieval quality against alternatives.

When to Use It

Production agent applications that need a memory layer without building one. Multi-framework deployments where the memory layer should outlive any single framework choice. Cases where Graph Memory’s relationship modeling adds real value (multi-user systems, social-context tasks). Mem0 ships with the broadest integration surface and the most production reference deployments.

Alternatives --- Letta for the OS-paging-inspired architecture and more research-grounded design; Zep for conversation-summarization-heavy workloads. Roll-your-own on top of Qdrant or Chroma when the memory layer needs to be fully under team control. The MTEB-driven decision is mostly about how much of the lifecycle you want to own versus delegate.

Sources

  • github.com/mem0ai/mem0

  • mem0.ai/blog/state-of-ai-agent-memory-2026

  • github.com/mem0ai/memory-benchmarks

Example

A customer-support agent that handles thousands of users across many sessions. Each user-message arrival triggers m.add() with user_id=alice; the fact-extractor distills any preferences, account details, or task progress; the next time Alice writes in (possibly weeks later), m.search() retrieves the relevant prior memories. Graph Memory tracks relationships (Alice’s account is associated with her company, whose support tier dictates SLA), making queries like “what’s Alice’s support tier” resolvable through the graph rather than requiring the SLA to be repeated in every conversation.

Example artifacts

Setup.

pip install mem0ai

# or for the CLI:

npm install -g \@mem0/cli

# Self-hosted server:

cd server && make bootstrap # one-command setup, admin wizard
included

Code.

import os

from mem0 import Memory

os.environ["OPENAI_API_KEY"] = "sk-..."

# Default config: in-memory store; production should configure
Qdrant.

m = Memory()

# Write

result = m.add(

"I am working on improving my tennis skills. Suggest some online
courses.",

user_id="alice",

metadata={"category": "hobbies"},

)

# Read

memories = m.search(

query="What hobbies is Alice working on?",

user_id="alice",

)

for mem in memories:

print(mem["text"], mem["score"])

Letta (formerly MemGPT)

Source: github.com/letta-ai/letta (Apache-2; Python)

Classification Hierarchical memory architecture modeled on operating system virtual memory.

Intent

Implement memory for LLM agents the way a computer’s OS implements memory --- with a small fast “main context” (RAM-like), large slower “archival memory” (disk-like), and paging logic that moves entries between tiers as needed.

Motivating Problem

The original MemGPT research paper (2023) argued that LLM context windows are like RAM --- fast but limited --- and that long-term agent memory should be structured like a complete OS memory hierarchy, with explicit paging between tiers and explicit functions the agent can call to manage its own memory. Letta is the productization of that research: an agent runtime where memory management is a first-class agent capability rather than a hidden retrieval step.

How It Works

An agent runs with three memory tiers visible to it. The main context is the LLM context window, holding the system prompt, the conversation, and a small persistent memory block the agent can rewrite at will. The recall memory holds the full conversation history, queryable by the agent through search functions. The archival memory holds arbitrary stored content, also queryable. The agent has explicit tool calls --- core_memory_replace, archival_memory_insert, archival_memory_search --- that it uses to manage its own memory.

The architectural distinctiveness is that memory management is exposed to the agent as a tool rather than hidden behind a black-box retrieval layer. The agent decides what to remember, what to forget, what to look up. The trade-off is that agent prompts are more complex (the agent has to learn the memory tools) but the resulting memory behavior is more interpretable and controllable.

Letta the company (founded by the MemGPT authors) productizes the research as a hosted platform with agent persistence, multi-user support, and observability. The self-hosted open-source version (the letta-ai/letta repo) provides the same architecture without the hosted infrastructure.

When to Use It

Agents where memory management is part of the agent’s reasoning, not infrastructure underneath. Research applications where the OS-memory metaphor is a productive frame. Cases where you want the agent to make explicit decisions about what to remember and what to forget.

Alternatives --- Mem0 for the simpler add()/search() interface where memory management is opaque to the agent. Zep for conversation-summarization-heavy workloads. The choice is essentially architectural philosophy: should the agent be aware of its memory or not?

Sources

  • github.com/letta-ai/letta

  • research.memgpt.ai (the original 2023 paper)

Zep

Source: github.com/getzep/zep (Apache-2; Go; Python/TypeScript clients)

Classification Long-term memory store with automatic conversation summarization.

Intent

Serve as a fast, scalable long-term memory store for AI assistant apps, with automatic summarization of past conversations to keep context windows clean and a knowledge-graph backend (Graphiti) for relationship-aware retrieval.

Motivating Problem

For conversation-heavy AI assistants --- customer support bots, productivity copilots, AI tutors --- the conversation thread itself is the primary memory substrate, and the dominant operational concern is what to summarize and what to keep verbatim. Zep’s positioning is specifically this workload: a memory layer optimized for conversation summarization, with the summaries themselves becoming the long-term memory and the recent history retained as-is.

How It Works

An agent writes user messages and AI responses to Zep via the SDK. Zep stores them in a session-scoped log. A background process summarizes older messages into compressed summaries, freeing space in the retrievable history without losing the gist. The retrieval path returns a mix of recent verbatim messages and older summaries, sized to fit the application’s context budget.

Zep 2.0 introduced Graphiti, a temporal knowledge graph that captures relationships between entities mentioned in conversations. Facts about users, organizations, projects, and their interactions get extracted automatically and modeled as a graph; retrieval can traverse the graph to answer questions that pure vector search would miss. The combination of automatic summarization plus a knowledge graph is Zep’s architectural distinctiveness.

When to Use It

Conversation-heavy AI assistants where session continuity over days or weeks is the primary memory need. Workloads where automatic summarization-vs-verbatim trade-offs benefit from product-level handling. Cases where temporal knowledge graphs add value (entity resolution, relationship queries).

Alternatives --- Mem0 for the broader fact-extraction-plus-graph-memory positioning. Letta for the OS-paging architecture. Roll-your-own when the conversation patterns require custom summarization policies.

Sources

  • github.com/getzep/zep

  • help.getzep.com

Section D — Shared state engines for multi-agent

How AutoGen and CrewAI handle shared state across agent teams

When a single agent has its own memory layer, the choices are settled by Sections B and C. When multiple agents need to coordinate, the question becomes shared state --- the topic of Chapter 4 in Part 1. Both AutoGen and CrewAI ship with built-in shared-state mechanisms that handle the multi-agent case without requiring the team to build it from scratch. The mechanisms reflect the topologies in Chapter 4: AutoGen leans on the group-chat pattern (with database backends for persistence), CrewAI leans on the blackboard pattern (with explicit short-term and long-term memory abstractions).

AutoGen shared group-chat state

Source: github.com/microsoft/autogen (Microsoft; MIT)

Classification Group-chat-style multi-agent state with database backends.

Intent

Coordinate teams of agents through shared conversation transcripts, with structured database backends for persistence and replay.

Motivating Problem

When the natural unit of multi-agent collaboration is a conversation --- a planner argues with an executor while a critic reviews --- AutoGen’s GroupChat is the framework expression of that pattern. The conversation transcript IS the shared state; every agent sees every message; coordination is emergent from the conversation. The framework handles the message routing, the turn-taking, and the persistence.

How It Works

A GroupChatManager coordinates a list of AssistantAgents. On each turn, the manager picks the next speaker (round-robin, random, or LLM-driven), passes the conversation history to that agent, gets a response, and appends it to the transcript. Termination conditions (max turns, target output reached, manual stop) end the chat. The full transcript persists; AutoGen v0.4 added explicit database backends for the transcript and agent state, so a group chat can pause, restart, and resume.

The pattern’s strength is its expressiveness: complex collaborations emerge from simple agent role definitions and the conversation between them. The pattern’s weakness is the token-cost scaling problem covered in Chapter 4: each agent’s context has to hold the entire conversation, so total token consumption is roughly conversation_length × agent_count.

When to Use It

Multi-agent problems where the collaboration is genuinely conversational and the agent count is small (two to five). Tasks where the conversation transcript is the desired output (negotiations, debates, code reviews). Workloads where AutoGen is already in the stack.

Alternatives --- CrewAI shared task memory for larger agent teams or blackboard-shaped tasks. External-DB shared state (Section E) when the team is large or the state needs to survive beyond one task.

Sources

  • github.com/microsoft/autogen

  • microsoft.github.io/autogen

CrewAI memory abstractions

Source: github.com/crewAIInc/crewai (Python, MIT)

Classification Blackboard-style shared state with short-term/long-term/entity memory.

Intent

Provide built-in, opinionated memory abstractions for multi-agent crews: short-term task memory (current task state), long-term memory (persistent across tasks), entity memory (facts about people, organizations, things).

Motivating Problem

For multi-agent crews working on shared tasks, the AutoGen group-chat pattern’s token-cost scaling becomes prohibitive past three or four agents. CrewAI’s answer is to factor the shared state into three explicit memory types that agents access by API rather than by reading the conversation: short-term memory (the live task workspace), long-term memory (knowledge that persists across tasks), and entity memory (facts about specific named entities).

How It Works

Each Crew (a collection of Agents working on a list of Tasks) has memory enabled by default. The short-term memory is a per-task scratchpad accessible to all agents in the crew --- a blackboard for the current task. Long-term memory is a persistent vector store (Chroma by default) holding facts learned across tasks. Entity memory tracks named entities (people, organizations, products) and their attributes.

Agents access memory through the framework rather than through prompt context: the framework injects relevant memory into the agent’s prompt before each LLM call. This keeps individual agent contexts small while preserving access to shared state. The trade-off is that the framework’s injection logic has to choose what’s relevant --- done well, it works; done poorly, agents end up missing context they should have had.

When to Use It

Multi-agent crews larger than the group-chat sweet spot (four or more agents). Tasks where the three-tier memory model fits naturally. Teams already using CrewAI for agent orchestration.

Alternatives --- AutoGen GroupChat for smaller, conversation-shaped collaborations. External-DB shared state (Section E) when memory needs to span beyond a single crew or survive crew restarts.

Sources

  • github.com/crewAIInc/crewai

  • docs.crewai.com/concepts/memory

Section E — Hybrid relational + vector

The Postgres-plus-pgvector pattern: unified state and memory in one store

For many agent applications, the most operationally sane answer to state and memory together is a single Postgres database with the pgvector extension. Relational tables hold the structured state (projects, tasks, agent logs, kanban statuses, audit trails); pgvector indexes hold the embedding-based memory; one connection string, one backup strategy, one query language. The trade-off relative to a dedicated vector store is performance at very large scales --- pgvector at tens of millions of vectors needs careful index tuning while Qdrant handles the same volumes more naturally --- but for the long tail of agent applications well under that threshold, the operational simplicity is decisive.

Two representative substrates: Supabase, which packages Postgres-plus-pgvector with auth, real-time subscriptions, and a managed SaaS deployment; and pgvector itself, the open-source Postgres extension that any Postgres deployment can adopt.

Supabase (Postgres + pgvector + auth + realtime)

Source: supabase.com (managed and self-hostable; Apache-2 components)

Classification Postgres-shaped backend-as-a-service with first-class pgvector support.

Intent

Provide Postgres plus pgvector plus auth plus real-time subscriptions as a coherent backend, suitable as the unified state-and-memory store for AI agent applications.

Motivating Problem

Most AI agent applications need a relational database for structured state (users, projects, tasks, audit logs), a vector store for memory, an auth system for users, and often a real-time push mechanism for UI updates. Building this stack from components is a real amount of work for what most teams correctly recognize as undifferentiated heavy lifting. Supabase packages all four as a single product, with Postgres at the center, and explicit AI/agent positioning around the pgvector integration.

How It Works

Supabase projects run a Postgres instance with pgvector enabled by default. The dashboard exposes the database, the auth users, the storage buckets, and the real-time channels through one UI. SDKs (JS, Python, Dart) provide typed access. Vector columns are first-class --- a table can have a normal id column, a normal text column, a vector(1536) column, and a normal jsonb metadata column, all in the same row, queryable through the same client.

For agent state plus memory, the natural pattern is: a sessions table for live agent state (status, current task, in-flight work), a memories table for long-term memory (id, user_id, content, embedding vector(1536), metadata jsonb, created_at, importance), and indexes on user_id and on the vector column. Queries combine SQL filtering (WHERE user_id = ? AND created_at > now() - interval ‘30 days’) with vector similarity (ORDER BY embedding <-> $1 LIMIT 10), all in one query.

Operational: Supabase is hosted (with a generous free tier) and also self-hostable via docker compose. Row-Level Security policies in Postgres handle per-user isolation, which matters acutely for agent memory --- one user’s memories should not surface in another’s retrieval. RLS lets the team enforce this at the database layer rather than at the application layer.

When to Use It

AI agent applications where the relational layer is real (projects, users, tasks, audit trails) and adding a memory layer should not mean adding a second database. Multi-tenant applications where Postgres RLS for per-tenant isolation is a feature. Teams that prefer SaaS over self-hosting for operational efficiency.

Alternatives --- raw Postgres plus pgvector (next entry) when the auth and real-time features aren’t needed. Dedicated vector stores (Section B) when vector volume exceeds what pgvector handles comfortably. Firebase or similar when the relational angle is light and the document model fits better.

Sources

  • supabase.com

  • github.com/supabase/supabase

Example artifacts

Schema / config.

-- Schema for unified state + memory

create extension if not exists vector;

create table sessions (

id uuid primary key default gen_random_uuid(),

user_id uuid references auth.users,

status text not null,

current_task text,

created_at timestamptz default now(),

updated_at timestamptz default now()

);

create table memories (

id uuid primary key default gen_random_uuid(),

user_id uuid references auth.users,

content text not null,

embedding vector(1536),

metadata jsonb,

importance real default 0.5,

created_at timestamptz default now()

);

create index memories_user_idx on memories(user_id);

create index memories_embedding_idx on memories

using ivfflat (embedding vector_cosine_ops) with (lists = 100);

-- Row-Level Security: each user sees only their own memories

alter table memories enable row level security;

create policy "users see their own memories" on memories

for all using (auth.uid() = user_id);

Code.

-- Combined SQL + vector query (Supabase / Postgres)

select id, content, metadata, embedding <-> \$1 as distance

from memories

where user_id = \$2

and created_at > now() - interval '30 days'

and importance > 0.4

order by embedding <-> \$1

limit 10;

pgvector (the Postgres extension itself)

Source: github.com/pgvector/pgvector (PostgreSQL License; C)

Classification Open-source Postgres extension for vector similarity search.

Intent

Add vector data types and similarity search operators to Postgres, so any Postgres deployment can serve as both a relational store and a vector memory store.

Motivating Problem

Many teams have Postgres in their stack and don’t want to add a separate vector database alongside it. Until pgvector matured, the choice was painful: keep adding to Postgres and use external services for embedding search, or replicate state into a vector-aware system and live with the synchronization burden. pgvector eliminates the choice: install one extension, get vector columns, vector indexes, and similarity-search operators in the same Postgres.

How It Works

CREATE EXTENSION vector; gives you a vector type with configurable dimensions. Tables hold vector columns alongside normal columns. Three indexes are available: IVFFlat (faster build, slower query, good up to a few million rows), HNSW (slower build, faster query, the production default for serious workloads), and ScaNN-inspired approaches in newer versions. Three similarity operators: <-> (Euclidean distance), <#> (negative inner product), <=> (cosine distance).

The performance ceiling is real and worth understanding. pgvector with HNSW comfortably handles tens of millions of vectors with sub-100ms query latency on appropriate hardware; pure Postgres at hundreds of millions of vectors starts to need careful tuning that competes with what dedicated vector stores handle automatically. The agent memory workload --- typically a few million memories per very-active deployment --- sits comfortably within the comfortable range.

When to Use It

Any Postgres deployment that needs vector search. The default for AI agent applications until vector volume forces a dedicated vector store. Particularly compelling when the relational state and the vector memory share the same per-user isolation requirements and Postgres RLS becomes the enforcement layer.

Alternatives --- Supabase (above) for the managed-Postgres path. Dedicated vector stores (Section B) when vector volume exceeds what pgvector handles. The choice is operational: one database vs two.

Sources

  • github.com/pgvector/pgvector

Section F — Embedding models and evaluation

What turns text into vectors, and how to know if a choice was right

Every memory layer in Sections B through E depends on an embedding model: the function that turns a piece of text into a high-dimensional vector. The quality of retrieval is bounded above by the quality of the embedding model. Two product families dominate: closed proprietary models (OpenAI’s text-embedding-3 family is the incumbent leader; Voyage AI’s closed models compete strongly), and open-source models (the BGE family from BAAI is the strongest open option for most use cases). The MTEB benchmark (Massive Text Embedding Benchmark) is the canonical evaluation; the leaderboard at huggingface.co/spaces/mteb/leaderboard ranks models on hundreds of retrieval tasks.

OpenAI text-embedding-3 and Voyage AI

Source: platform.openai.com/docs/models/embeddings ; voyageai.com

Classification Closed-source proprietary embedding model families.

Intent

Provide top-tier embedding quality through hosted APIs, with the operational advantages of a fully-managed service.

Motivating Problem

For teams that want the highest embedding quality and don’t mind paying per call, the closed proprietary models lead the MTEB benchmark consistently. OpenAI’s text-embedding-3-large is the most-used embedding model in production agent applications as of 2026. Voyage AI’s models (voyage-3, voyage-3-lite, voyage-multilingual) compete at or near the top and target specific use cases (code, multilingual, legal) with specialized variants.

How It Works

Both are pure HTTP APIs: send a list of strings, get back a list of vectors, pay per token of input. OpenAI’s text-embedding-3-small produces 1536-dim vectors at low cost and quality competitive with the older text-embedding-ada-002; text-embedding-3-large produces 3072-dim vectors with the strongest quality and higher cost. Both support truncated outputs via the dimensions parameter for cases where 3072 is overkill.

Voyage’s positioning is specialized: voyage-code-3 for code-related retrieval, voyage-multilingual-2 for non-English, voyage-law-2 for legal text, voyage-finance-2 for financial documents. The specialized models meaningfully outperform general models on their target domains; for agent memory in a specific vertical, a Voyage specialized model is often the right default.

When to Use It

Production agent memory where the marginal cost per embedding is acceptable and the quality lift is worth it. Specialized domains where Voyage’s vertical models add real value. Any case where MTEB-leaderboard-leading quality is required.

Alternatives --- BGE and other open-source models (next entry) when cost or data-residency considerations matter. The MTEB benchmark is the right place to compare options for a specific workload.

Sources

  • platform.openai.com/docs/models/embeddings

  • voyageai.com

BGE and the open-source embedding ecosystem

Source: github.com/FlagOpen/FlagEmbedding (BAAI; MIT)

Classification Open-source embedding model family.

Intent

Provide top-tier-quality embedding models that run on local hardware, with MTEB scores competitive with the closed proprietary leaders.

Motivating Problem

For agent memory deployments that need data residency (memory contents never leave the deployer’s infrastructure), cost predictability (no per-call API fees), or the operational option to run embedding inference on the same hardware as the rest of the application, open-source embedding models are the answer. The BGE family from BAAI (Beijing Academy of Artificial Intelligence) has held the strongest position on the MTEB leaderboard among open models for most of 2024—2026.

How It Works

BGE models come in size tiers: bge-small, bge-base, bge-large, plus the m3 multilingual variant. The recommended production default for English-only memory is bge-large-en-v1.5 (1024-dim vectors, MTEB scores within a few points of text-embedding-3-large). Multilingual workloads use bge-m3 (1024-dim, supports 100+ languages, plus a dense + sparse + multi-vector hybrid output).

Operational: download the model from Hugging Face; serve via sentence-transformers, text-embeddings-inference, or via vLLM for high-throughput cases. Latency is meaningfully lower than calling a remote API (no network round-trip); cost is the hardware running the inference. For batch embedding workloads (initial corpus indexing, periodic re-embedding), the open-source path is dramatically cheaper.

When to Use It

Data-residency requirements that prohibit sending memory contents to a third-party API. Cost-sensitive deployments at scale where per-call API fees become painful. Workloads where embedding-call latency matters and a network round-trip is unacceptable.

Alternatives --- OpenAI text-embedding-3 family for top quality with no operational burden. The MTEB leaderboard for the current best-of-class in any specific size or specialization tier.

Sources

  • github.com/FlagOpen/FlagEmbedding

  • huggingface.co/spaces/mteb/leaderboard

Section G — Knowledge graphs for memory

When relationships between facts are first-class --- GraphRAG and Neo4j patterns

Vector search is excellent at finding similar text and weak at traversing relationships. “What does Alice prefer” is answerable by vector retrieval against Alice’s memories; “what does Alice’s sister prefer” requires knowing that Alice has a sister, who that sister is, and then querying her memories. For agent memory that involves rich entity relationships --- organizational hierarchies, family trees, project dependencies, supply chains --- a knowledge graph as the memory substrate (or alongside it) is the right architecture. Two representative substrates: Microsoft’s GraphRAG (a research-grounded approach to building knowledge graphs from unstructured text) and Neo4j with LLM-augmented patterns (the production graph database, now widely used as a memory backend).

GraphRAG (Microsoft Research)

Source: github.com/microsoft/graphrag (MIT; Python)

Classification Research-grounded knowledge-graph construction for RAG.

Intent

Build a knowledge graph from unstructured text by extracting entities and relationships via LLM, then use the graph structure to support retrieval that pure vector search would miss.

Motivating Problem

Pure vector RAG retrieves text chunks by similarity; it cannot answer multi-hop questions (“who works for the company that competes with Alice’s employer”) or holistic questions (“what are the main themes in this corpus”). GraphRAG’s research-grounded approach is to extract entities and relationships from the corpus first --- producing a knowledge graph --- and then use that graph to support retrieval. Multi-hop queries traverse the graph; holistic queries cluster the graph and summarize the clusters.

How It Works

The indexing pipeline takes a text corpus, chunks it, uses an LLM to extract entities and relationships from each chunk, deduplicates entities across chunks, and produces a graph. The graph is clustered into communities; each community gets an LLM-generated summary. The retrieval-time path combines vector retrieval against the chunks with graph-aware queries against the entity-and-relationship structure.

For agent memory specifically, the same pattern applies: episodic and semantic memories get entity-extracted, the entities accumulate into a per-user graph, retrieval can ask “what events involved Alice’s manager in the last quarter” and traverse the manager-of relationship to surface relevant memories that pure vector search would miss.

When to Use It

Agent memory or RAG over corpora where entity relationships are first-class and queries are often multi-hop or holistic. Domains where the structured knowledge is genuinely there (legal contracts, organizational data, scientific literature) and worth extracting. Teams comfortable with the cost of LLM-driven graph construction (indexing GraphRAG over a meaningful corpus consumes significant LLM tokens).

Alternatives --- Mem0’s Graph Memory feature for a lighter-weight integrated approach. Neo4j with hand-curated schemas when the entities are known in advance. Pure vector RAG when the relationship-traversal queries don’t happen.

Sources

  • github.com/microsoft/graphrag

  • microsoft.github.io/graphrag/

Neo4j with LLM-augmented patterns

Source: neo4j.com (commercial graph database; community and enterprise editions)

Classification Production graph database used as agent-memory backend.

Intent

Use the leading graph database as the storage substrate for agent memory, with LLM-driven extraction filling the graph from observations and Cypher queries serving retrieval.

Motivating Problem

When the agent memory structure is genuinely a graph --- nodes are entities, edges are typed relationships, properties carry metadata --- a purpose-built graph database is a better substrate than fitting graph-shaped data into a vector store. Neo4j is the production-grade choice; the LLM-augmented patterns are the techniques for getting facts into and out of the graph using LLMs as the extraction and query-generation layer.

How It Works

Memories arrive as text; an LLM extracts entities and relationships in a structured format (typically a Cypher CREATE statement or a JSON representation that gets translated to Cypher). The graph accumulates over time. Retrieval can be pure Cypher (the LLM translates user queries to Cypher), pure vector (Neo4j supports vector indexes since 5.11), or hybrid (Cypher to filter the candidate set, vector similarity to rank within the candidates).

The patterns are well-established in the Neo4j community as “Knowledge Graph + LLM,” with the company itself maintaining significant tooling: the Neo4j GraphRAG package, integrations with LangChain and LlamaIndex, and the Neo4j-Vector-Cypher retriever pattern. For agent memory at production scale with genuinely-graph-shaped data, this is one of the better-trodden paths.

When to Use It

Agent memory with rich, well-defined entity-and-relationship structure. Enterprise deployments where Neo4j is already in use for other purposes. Workloads where the graph is part of the answer, not an implementation detail (knowledge management, compliance, supply chain).

Alternatives --- GraphRAG for the research-grounded text-to-graph extraction pattern. Mem0 Graph Memory for a simpler integrated approach. pgvector or dedicated vector stores when the graph is not the primary structure.

Sources

  • neo4j.com/docs/genai-ecosystem/

  • github.com/neo4j/neo4j-graphrag-python

Section H — Curation hubs

Discovery and evaluation resources for the agent-memory ecosystem

The agent memory category moves fast enough that any printed catalog is partially obsolete on arrival. Three resources help: the MTEB leaderboard for current embedding model rankings; Mem0’s memory-benchmarks repo for memory-layer evaluation; the awesome-X lists for discovery. None is a substitute for evaluation against the deployment’s specific workload; all are useful starting points.

MTEB benchmark and leaderboard

Source: huggingface.co/spaces/mteb/leaderboard ; github.com/embeddings-benchmark/mteb

Classification Standardized benchmark for embedding models.

Intent

Rank embedding models on a comprehensive set of retrieval, classification, clustering, and pair classification tasks, providing a single comparable score that approximates real-world embedding quality.

Motivating Problem

Choosing an embedding model is harder than it looks. Self-reported benchmarks favor the reporter; informal comparisons depend on the specific test cases. The Massive Text Embedding Benchmark (MTEB) is the field’s answer: a standardized suite of 56+ tasks across 8 task categories and 112+ languages, with an open leaderboard that lets anyone submit results.

How It Works

Run the mteb Python package against an embedding model; the harness runs the model against the standard task suite; results submit to the public leaderboard. The leaderboard ranks by overall score (average across categories) and by category (retrieval, reranking, classification, clustering, pair classification, STS, summarization, bitext mining). Filters by model size and language let teams compare options at their target operating point.

For agent memory, the relevant categories are retrieval (the largest weight) and reranking (relevant when a cross-encoder reranker is in the pipeline). The leaderboard’s top entries shift every few months; checking before committing to a model is worth the few minutes.

When to Use It

Any agent memory implementation choice about which embedding model to use. Periodic re-evaluation as new models release (every few months a meaningfully better option appears).

Alternatives --- evaluate models on the team’s actual data when the MTEB tasks don’t match the deployment’s domain (specialized verticals often invert the MTEB rankings).

Sources

  • huggingface.co/spaces/mteb/leaderboard

  • github.com/embeddings-benchmark/mteb

Mem0 memory-benchmarks and awesome lists

Source: github.com/mem0ai/memory-benchmarks ; awesome-rag, awesome-llm-memory

Classification Memory-layer evaluation suite and discovery lists.

Intent

Provide an open-source evaluation suite for agent memory systems plus community-maintained discovery indexes for the broader ecosystem.

Motivating Problem

Memory-layer evaluation has been spotty until recently. Mem0’s memory-benchmarks repo (opened in 2025) standardizes a small set of evaluation scenarios: cross-session recall, fact updates, deduplication accuracy, retrieval relevance. Running the same suite against Mem0, Letta, Zep, and roll-your-own implementations gives directly comparable numbers. The discovery side is covered by community awesome-X lists: awesome-rag, awesome-llm-memory, and similar curated GitHub lists track the broader ecosystem.

How It Works

memory-benchmarks: install the package, configure the memory implementation under test, run the eval harness, get a structured report. The benchmark itself is open-source and accepts pull requests adding new scenarios. The first round of results is published in Mem0’s 2026 State of AI Agent Memory report.

Awesome lists: standard GitHub README-as-catalog format, maintained by PR. Categories cover memory frameworks, vector databases, embedding models, RAG patterns, and evaluation suites. Best treated as discovery aids rather than endorsements.

When to Use It

Choosing a memory layer at production scale. Periodic re-evaluation as new memory products release. Initial discovery when surveying the ecosystem.

Alternatives --- build domain-specific evaluations against the team’s actual workload, which is always the most accurate measure but more expensive.

Sources

  • github.com/mem0ai/memory-benchmarks

  • mem0.ai/blog/state-of-ai-agent-memory-2026

Appendix A --- Memory Kind Reference Table

Cross-reference between the four kinds of memory (Chapter 1) and their representative substrates.

Memory kindWhat it holdsRepresentative substrates
WorkingLive conversation, in-contextLangGraph state, Redis chat history, framework persistence
EpisodicSpecific past eventsMem0, Letta, Zep, vector stores with timestamp metadata
SemanticStable facts about user/worldMem0 (with Graph Memory), Letta core memory, Neo4j+LLM
ProceduralSkills and how-toSkills Catalog (Vol 2); not in this catalog’s scope

Appendix B --- The Six-Volume Series Complete

This catalog joins the five prior volumes to form a six-level vocabulary for agentic AI. The volumes are independent; reading order is flexible; the cross-references make whichever path the reader chooses coherent.

  • Volume 1 --- Patterns of AI Agent Workflows --- the timing of agent runs.

  • Volume 2 --- The Claude Skills Catalog --- model instructions in packaged form.

  • Volume 3 --- The AI Agent Tools Catalog --- the function-calling primitives.

  • Volume 4 --- The AI Agent Events & Triggers Catalog --- the activation layer.

  • Volume 5 --- The AI Agent Fabric Catalog --- the infrastructure substrate.

  • Volume 6 --- The AI Agent Memory Catalog (this volume) --- the state and context layer.

Six layers, six catalogs. Read from the top (Patterns) for the agent designer’s sequence: how runs compose, what model instructions are loaded, which tools the agent calls, what events trigger the agent, what fabric it runs on, and what state and memory it operates against. Read from the bottom (Memory) for the operator’s sequence: where the state lives, what fabric it runs on, what wakes it up, what tools it can use, what skills it ships with, what patterns the runs follow.

The fact that the layering admits both reading orders is the property to keep. A working architect makes choices at all six layers; the choices interact; the interactions are what makes a production agent system actually work. The catalog is meant to support the choices, not dictate them.

Appendix C --- The Open Problems

Two problems are open at the memory layer in mid-2026, and worth keeping in view when designing for the next eighteen months:

  1. Cross-session identity resolution. The memory model assumes a stable user_id. Anonymous sessions, multi-device users, and mixed authentication flows break the assumption. When two interactions arrive without confident user identification, are they the same person, and how confidently? Mem0’s 2026 report calls this out as an unsolved identity problem at the memory layer. The fix likely lives at the fabric layer (Volume 5’s identity story) but the memory layer has to live with the consequences.

  2. Memory staleness. A high-relevance memory about a user’s employer is accurate until they change jobs, at which point it becomes confidently wrong. Decay handles low-relevance memories; staleness in high-relevance memories is harder. The current mitigations --- freshness scoring, explicit revocation, periodic re-verification against the user --- all have failure modes. Expect significant research and product activity here over 2026—2027.

Three additional questions are less open but worth deciding deliberately for any production deployment: (a) what counts as a memory-worthy observation, (b) how aggressively to compact, and (c) the per-user privacy and retention story. These are policy decisions, not architecture decisions, and the catalog can’t answer them --- but a serious deployment must.

Appendix D --- Discovery and Evaluation

Hubs for the agent-memory ecosystem:

  • Mem0’s State of AI Agent Memory 2026 --- the field’s best annual snapshot, with framework integration counts, vector-store coverage, and benchmark results.

  • MTEB leaderboard (huggingface.co/spaces/mteb/leaderboard) --- the canonical embedding model ranking.

  • memory-benchmarks (github.com/mem0ai/memory-benchmarks) --- open-source memory-layer evaluation suite.

  • Anthropic’s memory tool documentation (docs.anthropic.com) --- the native Anthropic memory-tool primitive that handles single-conversation memory directly in the API.

  • LangChain Memory documentation (python.langchain.com/docs/modules/memory) --- catalog of memory abstractions in the LangChain ecosystem.

  • “awesome-rag” and “awesome-llm-memory” community lists for ecosystem discovery.

Two pragmatic recommendations. First, run your own evaluation on your own data: published benchmarks (MTEB included) systematically favor the workloads they’re designed around, and your deployment’s domain may invert the rankings. Second, treat memory as a layer that will evolve faster than the rest of your stack; the products in this catalog will look different in twelve months, and the architecture you build should accommodate that.

Appendix E --- Omissions

This catalog covers about 18 substrates across 8 sections. The wider ecosystem is much larger; a non-exhaustive list of what isn’t here:

  • Conventional databases when used purely for relational state without any memory framing (most of Postgres, all of MySQL, MongoDB). The agent angle is required for inclusion.

  • RAG over static document corpora (most enterprise knowledge-base products). The read-only RAG case is conventional and well-documented; this volume focuses on the read-write memory case.

  • Caching layers (Memcached, conventional Redis) in their non-agent-memory roles.

  • Specialized memory products for narrow use cases (browser-history memory, voice-call memory, screenshot memory) that haven’t consolidated into general-purpose categories yet.

  • Vector databases optimized for non-text modalities (image search, audio search) when not used as agent memory backends.

  • Embedding models below the MTEB-top-20 line; the leaderboard tracks the broader population.

Appendix F --- A Note on the Moving Target

Mem0 surpassed 55K GitHub stars in early 2026, after consolidating EmbedChain and adding Graph Memory. Letta (formerly MemGPT) productized the original 2023 paper into a hosted platform and SDK. Zep introduced Graphiti for temporal knowledge graphs. Pinecone migrated to serverless pricing. pgvector matured into a production-default extension for Postgres deployments. The Anthropic memory tool became a native API primitive for single-conversation memory. The category is moving fast; this catalog captures a moment.

The deepest structural fact: the four kinds of memory (working, episodic, semantic, procedural) and the six-stage lifecycle (write, index, retrieve, update, compact, forget) are stable. The products that implement them will keep changing. An architect who internalizes the conceptual vocabulary can map any new product onto the framework in minutes and choose well; an architect who learns only the products has to relearn the field every twelve months.

Six volumes complete. Patterns, Skills, Tools, Events, Fabric, Memory. A working architect now has the vocabulary for the entire stack.

--- End of The AI Agent Memory Catalog v0.1 ---

— The Six-Volume Series Complete —