Every existing benchmark measures whether the model answered correctly.
This one measures whether the memory system retrieved correctly.
Those are not the same question.
LoCoMo, LongMemEval, and every benchmark built on top of them measure answer quality through a generative judge. This design cannot distinguish a memory system that retrieves one correct belief from one that retrieves its entire store.
Retrieve everything. Route through a model. Score on F1. A system with 0.05 retrieval precision looks "competitive" because the downstream model compensates for the noise.
Assert against retrieved beliefs directly. No generative model in the evaluation loop. mustExclude and shouldOnlyInclude make noise a hard failure.
Strip the generative model out and route retrieved context to a classifier, rules engine, or structured pipeline: current retrieval implementations fail immediately. The generative model was never a neutral consumer, it was load-bearing infrastructure compensating for retrieval noise.
89 cases. 11 systems. One metric that matters: active retrieval passes: cases where a retrievalPrecision assertion was satisfied.
| System | Active passes | Total passes | Mean precision | Mean recall | Retrieval p50 | Ingestion total |
|---|---|---|---|---|---|---|
| tenure this | 43/43 | 77/77 | 1.00 | 1.00 | 9.77ms | 1.00s |
| supermemory | 17/17 | 44/77 | 0.43 | 0.55 | 819.48ms | 0.00s |
| gbrain | 5/5 | 34/77 | 0.14 | 0.17 | 543.84ms | 28.60s |
| agentmemory | 0/0 | 7/77 | 0.17 | 0.97 | 82.28ms | 1.10s |
| yourmemory | 0/0 | 21/77 | 0.17 | 0.88 | 313.39ms | 16.40s |
| atomicmemory | 0/0 | 9/77 | 0.15 | 0.95 | 71.01ms | 658.90s |
| zep | 0/0 | 9/77 | 0.09 | 0.95 | 124.36ms | 897.00s |
| vectormxbai-embed-large | 0/0 | 11/77 | 0.09 | 1.00 | 71.87ms | — |
| hindsight | 0/0 | 9/77 | 0.06 | 1.00 | 589.86ms | 173.30s |
| mem0 | 0/0 | 9/77 | 0.06 | 0.99 | 64.94ms | 111.30s |
| a-mem | 0/0 | 9/77 | 0.06 | 0.99 | 13.80ms | 178.80s |
Cases with a retrievalPrecision assertion that was satisfied. The only pass type that demonstrates verified retrieval capability. A system cannot accumulate these by returning everything or nothing.
Cases asserting scope isolation, supersession exclusion, or type routing without a precision assertion, and the structural property holds.
Cases where the expected result set is empty by design (empty query, maxBeliefs: 0). Any system returning an empty set passes by construction. retrievalPrecision is null for these cases.
89 cases across 14 categories. Alias resolution and session-level noise isolation are where the precision gap is most stark.
| Category | Cases | Tenure | Vector | Mem0 | Zep | Hindsight |
|---|---|---|---|---|---|---|
| Alias resolution | 23 | 23/23 | 0/23 | 0/23 | 0/23 | 0/23 |
| Session-level noise isolation | 12 | 12/12 | 0/12 | 0/12 | 0/12 | 0/12 |
| Scope disambiguation | 12 | 10/12 | 1/12 | 0/12 | 0/12 | 0/12 |
| Fuzzy matching & prefix guards | 7 | 7/7 | 0/7 | 0/7 | 0/7 | 0/7 |
| Type routing / open questions | 6 | 6/6 | 0/6 | 0/6 | 0/6 | 0/6 |
| Design boundary cases | 6 | 6/6 | 2/6 | 2/6 | 2/6 | 2/6 |
| Budget eviction & capacity | 5 | 5/5 | 3/5 | 3/5 | 3/5 | 3/5 |
| Persona prelude content | 4 | 4/4 | 2/4 | 1/4 | 1/4 | 1/4 |
| Relation expansion | 4 | 4/4 | 0/4 | 0/4 | 0/4 | 0/4 |
| Supersession chain exclusion | 3 | 3/3 | 0/3 | 0/3 | 0/3 | 0/3 |
| Ranking stability | 3 | 3/3 | 1/3 | 1/3 | 1/3 | 1/3 |
| Counter-signal retrieval | 2 | 2/2 | 0/2 | 0/2 | 0/2 | 0/2 |
| Cross-user isolation | 1 | 1/1 | 1/1 | 1/1 | 1/1 | 1/1 |
| Cold start behavior | 1 | 1/1 | 1/1 | 1/1 | 1/1 | 1/1 |
| Total | 89 | 89/89 | 11/89 | 9/89 | 9/89 | 8/89 |
The natural objection: use a more capable embedding model. We tested across a 20× range in scale. Precision is identical. The fix is not a better ruler, it is a different measurement instrument.
| Model | Dimensions | Precision | Passes | Mean latency | p95 latency |
|---|---|---|---|---|---|
| nomic-embed-text | 768 | 0.09 | 11/77 | 43ms | 85ms |
| mxbai-embed-large | 1024 | 0.09 | 11/77 | 96ms | 257ms |
| qwen3-8b | 4096 | 0.09 | 11/77 | 1131ms | 2605ms |
At roughly 20× the mean latency of nomic-embed-text and over 1,100ms mean per query, the 8-billion parameter qwen3 model produces identical precision. All 11 passes in every configuration are trivially empty or budget-forced cases. Active retrieval passes across all three: 0.
The session cases test one thing: whether beliefs introduced during off-topic drift turns contaminate retrieval on subsequent unrelated turns. A drift score of 0 is perfect isolation. All competitors score 0.92–1.0.
| Turn | Tenure | Vector | Mem0 | Zep | Hindsight |
|---|---|---|---|---|---|
| Turn 9 — implicit re-entry | 0.00 | 1.00 | 1.00 | 1.00 | 1.00 |
| Turn 10 — explicit re-entry | 0.00 | 0.94 | 1.00 | 1.00 | 0.94 |
| Cross-session formative | 0.00 | 0.94 | 1.00 | 0.92 | 1.00 |
Hindsight at turn 10: drift score 1.0 with the correct belief absent entirely (recall 0). Not ranked low, missing from the result set, with 4 explicit noise belief violations flagged on top of a full-corpus return.
Single-turn latency is not session latency. Under session load, all three comparison systems degrade 3–4×. Hindsight, 672ms single-turn, exceeds 2,700ms mean per session turn.
| System | Single-turn mean | Session mean | Session p50 | Session p95 |
|---|---|---|---|---|
| Tenure | 13ms | 49ms | 48ms | 135ms |
| Vector baseline | 96ms | 319ms | 257ms | 728ms |
| Mem0 | 79ms | 383ms | 378ms | 692ms |
| Zep | 140ms | 397ms | 418ms | 660ms |
| Hindsight | 672ms | 2,736ms | 1,881ms | 6,163ms |
35 beliefs spanning two domain scopes, a three-hop supersession chain, and a secondary-user fixture for isolation validation. All comparison systems are pre-ingested and confirmed available before any retrieval case runs.
Each case carries mustExclude and shouldOnlyInclude constraints evaluated directly against retrieved belief IDs. No generative model scores the output. A pass requires all asserted tiers to be simultaneously satisfied.
Comparison systems receive pin-status filtering, open-question routing, and scope isolation applied from Tenure's metadata. They do not fail due to structural technicalities — only because cosine similarity produces 10–18 irrelevant beliefs per query.
Mem0, Zep, and Hindsight were run from pinned image digests without modification. All embedding-dependent configurations use mxbai-embed-large, ensuring differences reflect architecture, not embedding model choice.
Every major benchmark for LLM memory systems measures whether a model answered correctly, not whether the memory system retrieved correctly. We demonstrate that memory baselines achieve mean retrieval precision of just 0.05 to 0.08 on cases referencing their own extractions. We present PrecisionMemBench, an 89-case benchmark measuring retrieval precision independently of generative models, and Tenure, a local-first structured belief store achieving 89/89 passes with mean precision 1.0 and sub-15ms retrieval latency.
@article{flynt2026tenure,
title={Structured Belief State and the First
Precision-Aware Benchmark for LLM
Memory Retrieval},
author={Flynt, Jeffrey},
journal={arXiv preprint arXiv:2605.11325},
year={2026}
} The benchmark is published as a reusable artifact. Run it against any memory implementation — including your own.