Vector search is the right tool for document retrieval. Memory is not a document retrieval problem. The field conflated the two, and the numbers show what that costs.
In 2023, the dominant pattern for giving LLMs longer context was retrieval-augmented generation: store documents in a vector index, embed the user's query, pull the nearest neighbors, inject them into the prompt. It worked well enough. Vector search is genuinely good at finding documents that are semantically close to a query, especially across a heterogeneous corpus where you don't know in advance what you're looking for.
When the field turned its attention to cross-session memory, the same toolchain was already in place. Embed the facts the user has shared. Store them in the same vector DB. Retrieve on query. The infrastructure was identical, the API was familiar, and the pattern felt like a natural extension of RAG. Nobody asked whether the retrieval problem was actually the same.
It isn't. And the difference is the entire problem.
Document retrieval is a recall problem. You want the documents most likely to contain relevant information, approximate similarity is an acceptable proxy, and some noise is fine because the model can filter. Memory retrieval is a precision problem. You want exactly the beliefs that apply to this query, and none that don't. These require different tools.
A user's belief store is not a heterogeneous document corpus. It is a dense, typed collection of facts about a specific person, their projects, their preferences, and their domain. Every belief in a developer's store is about software engineering. Redis and MongoDB and TypeScript and Fastify all live in the same semantic region. Cosine similarity cannot tell them apart, because cosine similarity measures something real about them: they are genuinely related. It measures the wrong thing.
In any belief store where the user works in a single technical domain, all beliefs about that domain occupy a common semantic region. A query about Redis is semantically proximate to beliefs about MongoDB, TypeScript, Kubernetes, and GitHub Actions, with cosine scores between 0.65 and 0.83. Those scores are accurate. The problem is they're measuring genuine semantic relatedness, not retrieval relevance. The information a developer needs when asking about Redis is not the same as the information they need when asking about Kubernetes, even though the two topics are semantically close.
To measure this precisely, we built PrecisionMemBench: 89 cases covering alias resolution,
scope disambiguation, supersession chain exclusion, fuzzy matching, cross-user isolation,
and session-level noise isolation. Cases carry mustExclude assertions and
shouldOnlyInclude constraints that make noise a hard failure rather than
an invisible inference cost.
| System | Mean precision | Mean recall | Active retrieval passes | Total passes |
|---|---|---|---|---|
| Tenure | 1.0 | 1.0 | 48/48 | 89/89 |
| Vector baseline | 0.09 | 1.0 | 0/48 | 11/89 |
| Mem0 | 0.05 | 0.99 | 0/48 | 9/89 |
| Zep | 0.08 | 0.95 | 0/48 | 9/89 |
| Hindsight | 0.05 | 1.0 | 0/48 | 8/89 |
retrievalPrecision assertion to be satisfied.
The 11 passes for the vector baseline are structural or trivially empty cases, not active retrieval.
All comparison systems were evaluated using Claude Sonnet 4.6 in the extraction path.
The striking detail is not just the precision numbers. It's the recall. Every vector-based system achieves recall near 1.0, because they return everything. A memory system that returns its entire belief store trivially achieves recall of 1.0. The benchmark's own framing for this: returning everything is not retrieval.
We also tested this across three embedding models at 20x scale difference in parameters: nomic-embed-text (768 dimensions), mxbai-embed-large (1024 dimensions), and qwen3-8b (4096 dimensions). Mean retrieval precision was 0.09 across all three. The 8-billion parameter model ran at 1,130ms mean per query and produced identical precision to the smallest model. A more powerful embedding model distributes scores differently but cannot eliminate genuine semantic proximity within a domain-specific corpus. The fix is not a better ruler. It's a different measurement instrument.
Here is why this failure went undetected for so long: when you evaluate memory systems by routing retrieved context through a capable generative model and measuring answer quality, the model compensates for retrieval noise. A system that returns 12 beliefs when 1 was relevant still gets full marks if the model finds the right one.
LoCoMo, LongMemEval, and most of the evaluations built on top of them measure whether a model answered correctly given retrieved context. None measures whether the memory system retrieved correctly. These are not the same question, and treating them as interchangeable has allowed a systematic retrieval failure to go undetected across the field.
A capable model can locate the correct answer buried in 12 beliefs. It takes extra inference compute, but it works. This is how all published evaluations run: tens to low hundreds of beliefs. The problem is invisible.
A real persistent memory use case reaches thousands of beliefs across months of sessions. Full-corpus retrieval becomes architecturally impossible. The model can no longer compensate. The precision problem can no longer be offloaded to inference.
A structured pipeline, a classifier, a rules engine, or a fine-tuning dataset cannot compensate for a retrieval system that returns 12 beliefs when 1 was relevant. The model-compensates-for-noise safety net is a property of frontier models, not a property of retrieval.
The Hindsight case makes this concrete. In the Hindsight architecture, a 20B intermediary model sits between retrieval and the chat backbone. Its job is to reason over whatever retrieval returns. LoCoMo measures and rewards its ability to find the right answer in noisy context. PrecisionMemBench measures whether the right belief was retrieved at all, without routing through any generative model. The results: Hindsight achieves 0 active retrieval passes, mean precision 0.05, and a session drift score of 1.0 on noise-critical turns, meaning after 8 off-topic drift turns, it returns no relevant beliefs from the original topic on re-entry. The benchmark that LoCoMo cannot detect, PrecisionMemBench surfaces immediately.
This isn't a critique of Hindsight specifically. It is evidence that the evaluation gap is present in the most recently published work in the field. The failure mode is structural and shared across every system that uses cosine similarity as its primary retrieval signal.
A memory system whose correctness depends on a generative model's ability to reason under noise is architecturally coupled to that model's inference capability. That coupling is invisible in answer quality benchmarks precisely because those benchmarks are evaluated using capable generative models.
Session-level noise isolation reveals a second failure mode that single-turn metrics completely conceal. When a conversation drifts across topics, a developer session that starts with Redis, wanders through React, TypeScript, Kubernetes, and SQLAlchemy, then returns to Redis, beliefs from the drift turns accumulate in the vector index. Their semantic mass is now indiscriminable from the re-entry topic.
We measured this with a 10-turn session case: one original topic at turn 0, 8 drift turns across unrelated domains, then an implicit return at turn 9 and explicit return at turn 10. The drift score is the fraction of retrieved non-pinned beliefs originating from drift turns. Zero is perfect isolation.
What the table doesn't fully capture: at turn 10, the vector baseline ranks the correct belief first by cosine score (0.853), while simultaneously retrieving all 7 noise beliefs from unrelated drift turns. Hindsight achieves drift score 1.0 at turn 10 with the correct belief absent entirely, not ranked low, missing from the result set. The cross-encoder reranker bundled in Hindsight's full image is the architectural feature designed to address exactly this class of problem. It does not.
This is the core structural argument against vector search for memory. A purely semantic system degrades as the store grows: more beliefs means more semantic mass, broader cosine overlap, and lower precision on every query. The noise tax, once paid, cannot be recovered.
The failure mode is recognizable to anyone who has shipped memory in production. A GitHub issue on the Mem0 repo from May 2026 describes it precisely:
"I'm using Mem0 as the memory layer for task-based agents. It works well for storing and retrieving user context, but I've been running into subtle failures that are hard to debug: the agent should remember a preference but silently retrieves stale or conflicting information instead."
The issue proposes an eval toolkit to catch these failures. The requested metrics — forgetting, overwrite behavior, temporal relevance, conflict resolution — are exactly what PrecisionMemBench measures at the retrieval layer, before answer quality hides what happened underneath.
A discussion on the Hindsight repo from June 2026 surfaces the same tension from a different angle. A software engineer migrating from a prior memory system wants to quantify the win across output quality, token cost reduction, and multi-hop retrieval. A Hindsight maintainer confirms the system runs evaluations with the maximum recall token budget for competitor comparisons — a detail that matters because token cost and retrieval precision are inversely related in any vector-based system. When you increase the recall budget to improve accuracy, you also inject more noise per turn.
Another commenter in the same thread captures the right reframing: the meaningful measure is not raw token count but tokens-per-relevant-fact-recalled. A memory system that injects 2,000 tokens containing 8 relevant facts is more efficient than one that injects 500 tokens containing 1 relevant fact, even though the latter uses fewer tokens. This is exactly the distinction PrecisionMemBench makes: retrieval precision is what determines whether the tokens you're injecting are useful.
"The three metrics you want to measure (output quality, token cost reduction, retrieval precision) are the right evaluation axes. The meaningful comparison is not raw token count but tokens-per-relevant-fact-recalled."
The most important result in PrecisionMemBench is not the aggregate precision numbers. It is a specific relation-type case that isolates extraction quality from retrieval quality.
Consider a belief in the evaluation seed corpus: the auth service depends on Redis for session storage; if Redis goes down, auth fails open. After ingestion with full content, Mem0's extraction pipeline produced a high-quality stored memory — every operationally significant fact was preserved: the service name, the dependency target, the fail-open behavior, and the coupling assertion. This is a good extraction by any measure.
The benchmark then issues two queries referencing this belief. On the first, Mem0 returns the correct belief plus 16 additional beliefs including a linting preference, a React expertise note, and a writing-domain open question with no relationship to the auth service. Retrieval precision: 0.056. The structurally necessary participant belief (the Redis entity it references) is absent entirely. Recall: 0.5.
On the second query, both beliefs are present. Retrieval precision: 0.111. Sixteen additional beliefs returned alongside the two correct ones.
The two cases together isolate the failure precisely: the precision failure is not caused by poor extraction. It occurs on a concrete, operationally specific relation belief whose extraction was faithful and complete. The failure is structural, not query-dependent. Cosine similarity over a domain-specific corpus cannot discriminate relevant beliefs from semantically proximate ones. That is an invariance, not a tuning problem.
"User's auth service depends on Redis for session storage.
If Redis goes down, auth fails open by denying all requests.
Auth resilience discussions must address Redis availability —
the two are tightly coupled." auth_service_depends_on_redis auth redis session backend auth Tenure's indexed representation of the same belief is four tokens. The content and why it matters fields are not indexed. The participant expansion is executed as a deterministic one-hop join on the participants array rather than as a semantic search. Both benchmark cases pass with precision 1.0: the correct two beliefs are returned, the 16 noise beliefs are absent, and participant routing is enforced structurally rather than inferred from content similarity.
The fix is grounded in a documented property of how people refer to their own work. Corpus-based idiolect research establishes that individual speakers maintain stable, distinctive lexical choices across production contexts over periods of one to two years. In a single-user belief store, the query author and the belief author are the same person. When a developer refers to their Kubernetes setup, they use the same terms they used when they first described it. The query surface and the belief surface converge.
If a belief's canonical name is kubernetes and its alias set includes
k8s and kube, then a query containing k8s should
retrieve that belief with high precision regardless of semantic distance. There is no
ambiguity to resolve: the authored terminology is the ground truth.
Standard BM25 applied naively to this problem fails because it assumes short queries against long documents. Belief retrieval inverts this: queries are potentially hundreds of tokens of natural language while beliefs are short, a canonical name of one to three tokens and an alias list of three to five terms. The index engineering that makes BM25 work for belief retrieval is what gets skipped when people conclude "BM25 doesn't work for this."
The index side decomposes structured identifiers into discrete tokens. The query side uses conventional prose tokenization. Different analyzers at each boundary — required, not optional.
A high boost weight on canonical name and alias phrase matches re-calibrates the score distribution so a single precise match on a short field outweighs accumulated term frequency noise from a long query.
Scope, user identity, supersession status, and resolution status are applied as hard filters concurrent with scoring. A superseded or out-of-scope belief is never a candidate regardless of match quality.
The predictable objection to BM25 for belief retrieval is vocabulary coverage: if a user refers to a belief using a term not yet in the alias set, retrieval fails. This objection is correct as a static description and wrong as a practical one. The alias set is not static. Every session is an observation of how the user refers to their beliefs. New surface forms are captured and added to the alias set of the belief they describe. The system acquires vocabulary from use.
This produces a precision flywheel that runs in the opposite direction from the objection. A purely semantic system degrades as the store grows: more beliefs means more semantic mass, broader cosine overlap, lower precision. Alias-weighted BM25 improves as the store grows: more sessions means more observed surface forms, a richer alias set, higher precision on the vocabulary that's actually used.
A common assumption is that precision comes at a latency cost: rerankers, cross-encoders, and LLM-based re-evaluation are all expensive ways to improve retrieval quality. The benchmark results do not support this for structured belief retrieval.
| System | Mean single-turn (ms) | Mean session-turn (ms) | p95 session (ms) | Ingestion total (s) |
|---|---|---|---|---|
| Tenure | 13.49 | 49.02 | 134.63 | 0.98 |
| Mem0 | 78.81 | 382.62 | 692.35 | 114.19 |
| Zep | 139.64 | 396.89 | 659.99 | 897.04 |
| Hindsight | 672.15 | 2,735.84 | 6,162.57 | 173.28 |
The pattern is consistent: systems that compensate for imprecise primary retrieval with additional infrastructure (cross-encoder rerankers, graph construction, LLM intermediaries) pay latency costs that compound under session load. Tenure's mean single-turn latency of 13.49ms is not a design compromise. It is a consequence of not over-retrieving in the first place. You do not need to rerank a result set of 1.
Zep's async ingestion creates a structural availability gap worth noting separately. Individual beliefs take up to 125 seconds during ingestion, with a mean of 25,630ms per belief against a conversational turn cadence of 10 to 30 seconds. A belief introduced at turn 1 may not be queryable until the session has largely concluded. The benchmark's pre-ingestion design does not penalize this — all beliefs are confirmed available before any retrieval case runs — so the 897-second ingestion total is a lower bound on the availability gap in practice.
The field framed AI memory as a search problem. Search tools were applied. Search evaluation was used to measure success. Because search evaluation routes through a generative model, the model's ability to compensate for retrieval noise was mistaken for retrieval quality.
The benchmark results quantify what that cost: zero active retrieval passes across every comparison system on 48 cases with precision assertions. The systems that passed accumulate passes only on structurally trivial or empty cases. No vector-based memory system retrieves correctly on cases that actually test retrieval.
The correct framing is state management. A belief store is typed state, not a document index. Retrieval is structured lookup over a bounded vocabulary, not open-ended semantic search. The right signal is alias-weighted term matching over user-authored identifiers, not cosine similarity over embedding space. Scope isolation is a hard filter before scoring, not a probabilistic tendency after it.
None of this requires new techniques. BM25 is well understood. Alias matching is standard. Hard scope filters are architecture, not inference. What it requires is correctly identifying the problem. Memory is not a search problem.