Retrieval

Why precision is all that matters in AI memory

Name: precisionMemBench
Creator: Tenure
License: https://opensource.org/licenses/MIT

Retrieval precision is the difference between a memory system that works and one that merely appears to. Everything else is secondary.

Tenure research · ~10 min read

TL;DR

Current benchmarks measure answer quality, not retrieval precision, which hides catastrophic failure modes.
A memory system that returns its entire store achieves recall of 1.0 while drowning the model in noise.
Cosine similarity cannot discriminate between beliefs that share a domain but differ in relevance.
At mean precision of 0.05–0.08, comparison systems return 8–18 irrelevant beliefs per query.
No downstream model can compensate for this forever: context windows fill, contradictions accumulate, and non-generative consumers fail outright.

The measurement problem

The benchmark gap that hides failure

Every major benchmark for LLM memory systems measures whether a model answered correctly, not whether the memory system retrieved correctly. That distinction sounds academic until you realize that a system returning its entire belief store achieves recall of 1.0 and passes answer-quality evaluation. The generative model simply sorts through the noise.

This is the difference between a unit test and an integration test. Retrieval quality must be measured in isolation from the generative model it feeds into, yet no existing benchmark does this. A capable model can locate the correct answer in a noisy context window at small scale, which means systems with mean retrieval precision as low as 0.05 appear competitive with systems that retrieve exactly the right beliefs and nothing else.

A memory system that returns its entire store achieves recall of 1.0 trivially. The generative model was never a neutral downstream consumer; it was load-bearing infrastructure compensating for retrieval imprecision.

The structural limit

Why similarity search fails at memory

In any belief store where participants share a technical domain, all beliefs about that domain occupy a common semantic region. Cosine similarity captures this domain proximity but cannot discriminate within it. A query about Redis is semantically close to a belief about Redis as intended, and also semantically close to beliefs about MongoDB, TypeScript, Fastify, and Kubernetes, with cosine scores between 0.65 and 0.83.

The scores reflect genuine semantic relatedness. They are measuring the wrong thing. A larger embedding model distributes scores differently but cannot eliminate genuine semantic proximity within a domain-specific corpus. This invariance holds across a 20x range in embedding model scale: from 768-dimension models to 8-billion parameter models, mean retrieval precision remains identical at 0.09.

Resolving relevance requires a retrieval signal orthogonal to semantic similarity. The field has responded by adding re-ranking stages, temporal trees, and hierarchical graphs, essentially compensating for the wrong primary signal. The correct signal is simpler: in bounded vocabulary contexts where the user coined the terminology, alias-weighted term matching retrieves what the user named rather than what is semantically nearby.

The compounding cost

Noise accumulates faster than models can compensate

At small scale, a frontier model can act as a filter. When a memory store contains tens of beliefs, returning a dozen irrelevant ones is survivable. But persistent memory is not used at tens of beliefs. It is used at thousands. Full-corpus retrieval becomes architecturally impossible, and the precision problem can no longer be offloaded to inference.

The failure mode is not theoretical. On PrecisionMemBench, vector-based memory systems achieve mean precision of 0.05 to 0.09, returning 10 to 18 irrelevant beliefs per query alongside the one that mattered. Extraction quality does not predict retrieval precision: even when entity extraction is entirely faithful, the retrieval layer contaminates the result set with semantically proximate noise.

Vector similarity

0.05–0.09 Mean retrieval precision (PrecisionMemBench)

8–18 irrelevant beliefs returned per query. Model compensates via in-context reasoning until the context window fills or contradictions accumulate.

BM25 + typed beliefs

1.0 Mean retrieval precision (PrecisionMemBench)

Exactly the required beliefs returned. No noise. Precision holds as the store grows.

Worse, the noise is invisible to answer-quality metrics. LLM-as-a-Judge evaluations report success because the judge is itself a capable generative model, and capable generative models can tolerate moderate noise. The failure only surfaces when you route retrieved context to a classifier, a rules engine, a structured pipeline, or any consumer that is not a frontier LLM. At that point the system does not merely underperform; it fails outright.

Multi-turn failure

Session drift makes noise architectural

Single-turn retrieval metrics conceal a compounding failure that only appears in multi-turn sessions. After topic drift, comparison systems allow semantic mass to bleed across turns. A 10-turn session might establish a topic, drift across eight unrelated domains, then return to the original topic. At re-entry, vector-based systems return drift-turn beliefs alongside the correct one, producing drift scores of 0.92 to 1.0.

This means the memory system actively contaminates the context window with off-topic beliefs simply because they share a broad semantic neighborhood with the re-entry query. Single-turn latency figures are equally misleading: one comparison system reports sub-700ms single-turn latency but exceeds 2,700ms mean per session turn, with p95 above 6,000ms. The session load degrades retrieval paths that were already imprecise.

Single-turn metrics are unit tests for a system that only operates in integration. Memory is stateful across turns, and precision must be measured under drift conditions.

The coupling problem

Precision is what remains when the safety net disappears

The dominant assumption in memory system design is that the downstream model will clean up retrieval noise. This assumption is only valid for frontier generative models operating within a generous context window. It fails for every other consumer.

Strip the generative model out and route retrieved context directly to a fine-tuning dataset, an evaluation harness, a classifier, or a rules engine: current retrieval implementations fail immediately. Not because the right belief is absent, but because it is surrounded by enough noise that a non-generative consumer cannot act on it. The model was not a neutral downstream consumer; it was load-bearing infrastructure.

A memory system whose correctness depends on a generative model's ability to reason under noise is architecturally coupled to that model's inference capability. That coupling is invisible in answer-quality benchmarks precisely because those benchmarks are evaluated using capable generative models. It becomes very visible the moment you try to build anything else on top of the memory layer.

The correct design

Precision is a property of the data model, not the model

High precision is not achieved by fine-tuning embeddings or adding re-ranking stages. It is achieved by treating memory as typed state rather than a search index. Hard scope filters, typed beliefs, epistemic status, and supersession chains are structural guarantees, not probabilistic ones. A belief that is superseded is structurally absent from retrieval. A belief outside the active scope is not down-ranked; it is excluded before scoring begins.

This matters for correctness in any multi-project or multi-user environment. A session in project:client-a must not surface beliefs from project:client-b regardless of semantic proximity. Vector search can only make this a tendency. A typed belief store with hard scope isolation makes it a guarantee.

Vector search provides scope isolation as a tendency. A typed belief store provides it as a structural guarantee. For engineering contexts, the difference matters.

Conclusion

Retrieval precision is the only metric that matters

Latency, recall, and extraction fidelity are all necessary, but none are sufficient. A system with perfect extraction and perfect recall that returns 18 beliefs when one was relevant is a system that fails at the only thing memory is for: surfacing the right context at the right time.

The field has optimized for answer quality because answer quality is easy to measure and impressive to demo. Precision is harder to measure and invisible when generative models compensate for its absence. But the compensation has limits: context windows, latency budgets, and the growing need to connect memory to deterministic systems that cannot tolerate noise.

Memory is not a search problem. It is a state management problem. And state management requires exact retrieval, not approximate similarity. Precision is not a tuning target. It is the defining property of a memory system that actually works.

See how Tenure does it → PrecisionMemBench results