PrecisionMemBench

The benchmark that measures
what memory tools actually do.

Every existing benchmark measures whether the model answered correctly.
This one measures whether the memory system retrieved correctly.
Those are not the same question.

89 Total cases
5 Systems evaluated
0 Active retrieval passes
1.0 Tenure precision
The problem

Why existing benchmarks are measuring the wrong thing

LoCoMo, LongMemEval, and every benchmark built on top of them measure answer quality through a generative judge. This design cannot distinguish a memory system that retrieves one correct belief from one that retrieves its entire store.

Answer-quality benchmarks

Retrieve everything. Route through a model. Score on F1. A system with 0.05 retrieval precision looks "competitive" because the downstream model compensates for the noise.

-- retrieve everything
SELECT * FROM beliefs
-- hope the model sorts it out
-- 18 results injected
-- precision: 0.05

PrecisionMemBench

Assert against retrieved beliefs directly. No generative model in the evaluation loop. mustExclude and shouldOnlyInclude make noise a hard failure.

-- query at the source
SELECT * FROM beliefs
WHERE scope = 'project:api'
AND alias_match('redis')
-- 1 result. precision: 1.0

Strip the generative model out and route retrieved context to a classifier, rules engine, or structured pipeline: current retrieval implementations fail immediately. The generative model was never a neutral consumer, it was load-bearing infrastructure compensating for retrieval noise.

Results

Leaderboard

89 cases. 11 systems. One metric that matters: active retrieval passes: cases where a retrievalPrecision assertion was satisfied.

System Active passes Total passes Mean precision Mean recall Retrieval p50 Ingestion total
tenure this 43/43 77/77 1.00 1.00 9.77ms 1.00s
supermemory 17/17 44/77 0.43 0.55 819.48ms 0.00s
gbrain 5/5 34/77 0.14 0.17 543.84ms 28.60s
agentmemory 0/0 7/77 0.17 0.97 82.28ms 1.10s
yourmemory 0/0 21/77 0.17 0.88 313.39ms 16.40s
atomicmemory 0/0 9/77 0.15 0.95 71.01ms 658.90s
zep 0/0 9/77 0.09 0.95 124.36ms 897.00s
vectormxbai-embed-large 0/0 11/77 0.09 1.00 71.87ms
hindsight 0/0 9/77 0.06 1.00 589.86ms 173.30s
mem0 0/0 9/77 0.06 0.99 64.94ms 111.30s
a-mem 0/0 9/77 0.06 0.99 13.80ms 178.80s
Active retrieval passes

Cases with a retrievalPrecision assertion that was satisfied. The only pass type that demonstrates verified retrieval capability. A system cannot accumulate these by returning everything or nothing.

Structural passes

Cases asserting scope isolation, supersession exclusion, or type routing without a precision assertion, and the structural property holds.

Trivially empty passes

Cases where the expected result set is empty by design (empty query, maxBeliefs: 0). Any system returning an empty set passes by construction. retrievalPrecision is null for these cases.

All comparison systems received a schema-aware wrapper granting them pin-status filtering, open-question routing, and scope isolation for free at evaluation time. Competitors fail purely on retrieval precision — the structural routing logic was given to them. Results reflect pinned Docker image digests. Reproduce locally →
Results by category

Where the failures happen

89 cases across 14 categories. Alias resolution and session-level noise isolation are where the precision gap is most stark.

Category Cases Tenure Vector Mem0 Zep Hindsight
Alias resolution 23 23/23 0/23 0/23 0/23 0/23
Session-level noise isolation 12 12/12 0/12 0/12 0/12 0/12
Scope disambiguation 12 10/12 1/12 0/12 0/12 0/12
Fuzzy matching & prefix guards 7 7/7 0/7 0/7 0/7 0/7
Type routing / open questions 6 6/6 0/6 0/6 0/6 0/6
Design boundary cases 6 6/6 2/6 2/6 2/6 2/6
Budget eviction & capacity 5 5/5 3/5 3/5 3/5 3/5
Persona prelude content 4 4/4 2/4 1/4 1/4 1/4
Relation expansion 4 4/4 0/4 0/4 0/4 0/4
Supersession chain exclusion 3 3/3 0/3 0/3 0/3 0/3
Ranking stability 3 3/3 1/3 1/3 1/3 1/3
Counter-signal retrieval 2 2/2 0/2 0/2 0/2 0/2
Cross-user isolation 1 1/1 1/1 1/1 1/1 1/1
Cold start behavior 1 1/1 1/1 1/1 1/1 1/1
Total 89 89/89 11/89 9/89 9/89 8/89
Embedding model invariance

A bigger model doesn't fix this

The natural objection: use a more capable embedding model. We tested across a 20× range in scale. Precision is identical. The fix is not a better ruler, it is a different measurement instrument.

Model Dimensions Precision Passes Mean latency p95 latency
nomic-embed-text 768 0.09 11/77 43ms 85ms
mxbai-embed-large 1024 0.09 11/77 96ms 257ms
qwen3-8b 4096 0.09 11/77 1131ms 2605ms

At roughly 20× the mean latency of nomic-embed-text and over 1,100ms mean per query, the 8-billion parameter qwen3 model produces identical precision. All 11 passes in every configuration are trivially empty or budget-forced cases. Active retrieval passes across all three: 0.

Multi-turn noise isolation

After 8 off-topic turns, can you get your context back?

The session cases test one thing: whether beliefs introduced during off-topic drift turns contaminate retrieval on subsequent unrelated turns. A drift score of 0 is perfect isolation. All competitors score 0.92–1.0.

Turn Tenure Vector Mem0 Zep Hindsight
Turn 9 — implicit re-entry 0.00 1.00 1.00 1.00 1.00
Turn 10 — explicit re-entry 0.00 0.94 1.00 1.00 0.94
Cross-session formative 0.00 0.94 1.00 0.92 1.00

Hindsight at turn 10: drift score 1.0 with the correct belief absent entirely (recall 0). Not ranked low, missing from the result set, with 4 explicit noise belief violations flagged on top of a full-corpus return.

Session-turn latency

Single-turn latency is not session latency. Under session load, all three comparison systems degrade 3–4×. Hindsight, 672ms single-turn, exceeds 2,700ms mean per session turn.

System Single-turn mean Session mean Session p50 Session p95
Tenure 13ms 49ms 48ms 135ms
Vector baseline 96ms 319ms 257ms 728ms
Mem0 79ms 383ms 378ms 692ms
Zep 140ms 397ms 418ms 660ms
Hindsight 672ms 2,736ms 1,881ms 6,163ms
Methodology

How it works

01

Fixed seed corpus

35 beliefs spanning two domain scopes, a three-hop supersession chain, and a secondary-user fixture for isolation validation. All comparison systems are pre-ingested and confirmed available before any retrieval case runs.

02

Hard assertions, not judge scores

Each case carries mustExclude and shouldOnlyInclude constraints evaluated directly against retrieved belief IDs. No generative model scores the output. A pass requires all asserted tiers to be simultaneously satisfied.

03

Schema-aware wrapper for fairness

Comparison systems receive pin-status filtering, open-question routing, and scope isolation applied from Tenure's metadata. They do not fail due to structural technicalities — only because cosine similarity produces 10–18 irrelevant beliefs per query.

04

Pinned Docker digests

Mem0, Zep, and Hindsight were run from pinned image digests without modification. All embedding-dependent configurations use mxbai-embed-large, ensuring differences reflect architecture, not embedding model choice.

Case taxonomy

  • Alias resolution (23) — variant surface forms, short-form, natural-language, multi-word
  • Scope disambiguation (12) — hard scope isolation between domain contexts
  • Session noise isolation (12) — 10-turn drift sessions with per-turn assertions
  • Fuzzy matching & prefix guards (7) — transpositions, near-miss, intentional prefix blocks
  • Type routing & open questions (6) — routing logic by belief type
  • Design boundary cases (6) — intentional empty returns and edge behavior
  • Budget eviction & capacity (5) — slot constraints, priority, flooding resistance
  • Relation expansion (4) — one-hop participant joins with scope filtering
  • Persona prelude content (4) — unconditional belief injection
  • Supersession chain exclusion (3) — three-hop chain, active terminal only
  • Ranking stability (3) — consistent ordering across equivalent queries
  • Counter-signal retrieval (2) — rejected terms surfacing active replacements
  • Cross-user isolation (1) — structural exclusion regardless of semantic proximity
  • Cold start behavior (1) — empty store, no error
Research

Published on arXiv

arXiv:2605.11325 [cs.IR] · May 2026 Full paper →

Structured Belief State and the First Precision-Aware Benchmark for LLM Memory Retrieval

Every major benchmark for LLM memory systems measures whether a model answered correctly, not whether the memory system retrieved correctly. We demonstrate that memory baselines achieve mean retrieval precision of just 0.05 to 0.08 on cases referencing their own extractions. We present PrecisionMemBench, an 89-case benchmark measuring retrieval precision independently of generative models, and Tenure, a local-first structured belief store achieving 89/89 passes with mean precision 1.0 and sub-15ms retrieval latency.

BibTeX
@article{flynt2026tenure,
  title={Structured Belief State and the First
         Precision-Aware Benchmark for LLM
         Memory Retrieval},
  author={Flynt, Jeffrey},
  journal={arXiv preprint arXiv:2605.11325},
  year={2026}
}
Reproducible

Run it yourself

The benchmark is published as a reusable artifact. Run it against any memory implementation — including your own.

precisionmembench
$ git clone https://github.com/tenurehq/precisionMemBench
$ cd precisionMemBench
$ npm install
$ MEMORY_PROVIDER=PROVIDER npx ava retrieval.external.eval.test.ts
$ MEMORY_PROVIDER=PROVIDER npx ava session-retrieval.external.eval.test.ts
 
Seeding corpus (35 beliefs)...
Running 89 cases...
Results written to test-results/