Name: precisionMemBench
Creator: Tenure
License: https://opensource.org/licenses/MIT

The problem

Why existing benchmarks are measuring the wrong thing

LoCoMo, LongMemEval, and every benchmark built on top of them measure answer quality through a generative judge. This design cannot distinguish a memory system that retrieves one correct belief from one that retrieves its entire store.

✗

Answer-quality benchmarks

Retrieve everything. Route through a model. Score on F1. A system with 0.05 retrieval precision looks "competitive" because the downstream model compensates for the noise.

-- retrieve everything

SELECT * FROM beliefs

-- hope the model sorts it out

-- 18 results injected

-- precision: 0.05

✓

PrecisionMemBench

Assert against retrieved beliefs directly. No generative model in the evaluation loop. mustExclude and shouldOnlyInclude make noise a hard failure.

-- query at the source

SELECT * FROM beliefs

WHERE scope = 'project:api'

AND alias_match('redis')

-- 1 result. precision: 1.0

Strip the generative model out and route retrieved context to a classifier, rules engine, or structured pipeline: current retrieval implementations fail immediately. The generative model was never a neutral consumer, it was load-bearing infrastructure compensating for retrieval noise.

Results

Leaderboard

89 cases. 11 systems. One metric that matters: active retrieval passes: cases where a retrievalPrecision assertion was satisfied.

System	Active passes	Total passes	Mean precision	Mean recall	Retrieval p50	Ingestion total
tenure this	43/43	77/77	1.00	1.00	9.77ms	1.00s
supermemory	17/17	44/77	0.43	0.55	819.48ms	0.00s
gbrain	5/5	34/77	0.14	0.17	543.84ms	28.60s
agentmemory	0/0	7/77	0.17	0.97	82.28ms	1.10s
yourmemory	0/0	21/77	0.17	0.88	313.39ms	16.40s
atomicmemory	0/0	9/77	0.15	0.95	71.01ms	658.90s
zep	0/0	9/77	0.09	0.95	124.36ms	897.00s
vectormxbai-embed-large	0/0	11/77	0.09	1.00	71.87ms	—
hindsight	0/0	9/77	0.06	1.00	589.86ms	173.30s
mem0	0/0	9/77	0.06	0.99	64.94ms	111.30s
a-mem	0/0	9/77	0.06	0.99	13.80ms	178.80s

Active retrieval passes

Cases with a retrievalPrecision assertion that was satisfied. The only pass type that demonstrates verified retrieval capability. A system cannot accumulate these by returning everything or nothing.

Structural passes

Cases asserting scope isolation, supersession exclusion, or type routing without a precision assertion, and the structural property holds.

Trivially empty passes

Cases where the expected result set is empty by design (empty query, maxBeliefs: 0). Any system returning an empty set passes by construction. retrievalPrecision is null for these cases.

All comparison systems received a schema-aware wrapper granting them pin-status filtering, open-question routing, and scope isolation for free at evaluation time. Competitors fail purely on retrieval precision — the structural routing logic was given to them. Results reflect pinned Docker image digests. Reproduce locally →

Results by category

Where the failures happen

89 cases across 14 categories. Alias resolution and session-level noise isolation are where the precision gap is most stark.

Category	Cases	Tenure	Vector	Mem0	Zep	Hindsight
Alias resolution	23	23/23	0/23	0/23	0/23	0/23
Session-level noise isolation	12	12/12	0/12	0/12	0/12	0/12
Scope disambiguation	12	10/12	1/12	0/12	0/12	0/12
Fuzzy matching & prefix guards	7	7/7	0/7	0/7	0/7	0/7
Type routing / open questions	6	6/6	0/6	0/6	0/6	0/6
Design boundary cases	6	6/6	2/6	2/6	2/6	2/6
Budget eviction & capacity	5	5/5	3/5	3/5	3/5	3/5
Persona prelude content	4	4/4	2/4	1/4	1/4	1/4
Relation expansion	4	4/4	0/4	0/4	0/4	0/4
Supersession chain exclusion	3	3/3	0/3	0/3	0/3	0/3
Ranking stability	3	3/3	1/3	1/3	1/3	1/3
Counter-signal retrieval	2	2/2	0/2	0/2	0/2	0/2
Cross-user isolation	1	1/1	1/1	1/1	1/1	1/1
Cold start behavior	1	1/1	1/1	1/1	1/1	1/1
Total	89	89/89	11/89	9/89	9/89	8/89

Embedding model invariance

A bigger model doesn't fix this

The natural objection: use a more capable embedding model. We tested across a 20× range in scale. Precision is identical. The fix is not a better ruler, it is a different measurement instrument.

Model	Dimensions	Precision	Passes	Mean latency	p95 latency
nomic-embed-text	768	0.09	11/77	43ms	85ms
mxbai-embed-large	1024	0.09	11/77	96ms	257ms
qwen3-8b	4096	0.09	11/77	1131ms	2605ms

At roughly 20× the mean latency of nomic-embed-text and over 1,100ms mean per query, the 8-billion parameter qwen3 model produces identical precision. All 11 passes in every configuration are trivially empty or budget-forced cases. Active retrieval passes across all three: 0.

Multi-turn noise isolation

After 8 off-topic turns, can you get your context back?

The session cases test one thing: whether beliefs introduced during off-topic drift turns contaminate retrieval on subsequent unrelated turns. A drift score of 0 is perfect isolation. All competitors score 0.92–1.0.

Turn	Vector	Mem0	Zep	Hindsight
Turn 9 — implicit re-entry	1.00	1.00	1.00	1.00
Turn 10 — explicit re-entry	0.94	1.00	1.00	0.94
Cross-session formative	0.94	1.00	0.92	1.00

Hindsight at turn 10: drift score 1.0 with the correct belief absent entirely (recall 0). Not ranked low, missing from the result set, with 4 explicit noise belief violations flagged on top of a full-corpus return.

Session-turn latency

Single-turn latency is not session latency. Under session load, all three comparison systems degrade 3–4×. Hindsight, 672ms single-turn, exceeds 2,700ms mean per session turn.

System	Single-turn mean	Session mean	Session p50	Session p95
Tenure	13ms	49ms	48ms	135ms
Vector baseline	96ms	319ms	257ms	728ms
Mem0	79ms	383ms	378ms	692ms
Zep	140ms	397ms	418ms	660ms
Hindsight	672ms	2,736ms	1,881ms	6,163ms

Methodology

How it works

Fixed seed corpus

35 beliefs spanning two domain scopes, a three-hop supersession chain, and a secondary-user fixture for isolation validation. All comparison systems are pre-ingested and confirmed available before any retrieval case runs.

Hard assertions, not judge scores

Each case carries mustExclude and shouldOnlyInclude constraints evaluated directly against retrieved belief IDs. No generative model scores the output. A pass requires all asserted tiers to be simultaneously satisfied.

Schema-aware wrapper for fairness

Comparison systems receive pin-status filtering, open-question routing, and scope isolation applied from Tenure's metadata. They do not fail due to structural technicalities — only because cosine similarity produces 10–18 irrelevant beliefs per query.

Pinned Docker digests

Mem0, Zep, and Hindsight were run from pinned image digests without modification. All embedding-dependent configurations use mxbai-embed-large, ensuring differences reflect architecture, not embedding model choice.

Case taxonomy

Alias resolution (23) — variant surface forms, short-form, natural-language, multi-word
Scope disambiguation (12) — hard scope isolation between domain contexts
Session noise isolation (12) — 10-turn drift sessions with per-turn assertions
Fuzzy matching & prefix guards (7) — transpositions, near-miss, intentional prefix blocks
Type routing & open questions (6) — routing logic by belief type
Design boundary cases (6) — intentional empty returns and edge behavior
Budget eviction & capacity (5) — slot constraints, priority, flooding resistance
Relation expansion (4) — one-hop participant joins with scope filtering
Persona prelude content (4) — unconditional belief injection
Supersession chain exclusion (3) — three-hop chain, active terminal only
Ranking stability (3) — consistent ordering across equivalent queries
Counter-signal retrieval (2) — rejected terms surfacing active replacements
Cross-user isolation (1) — structural exclusion regardless of semantic proximity
Cold start behavior (1) — empty store, no error

Research

Published on arXiv

arXiv:2605.11325 [cs.IR] · May 2026 Full paper →

Structured Belief State and the First Precision-Aware Benchmark for LLM Memory Retrieval

Every major benchmark for LLM memory systems measures whether a model answered correctly, not whether the memory system retrieved correctly. We demonstrate that memory baselines achieve mean retrieval precision of just 0.05 to 0.08 on cases referencing their own extractions. We present PrecisionMemBench, an 89-case benchmark measuring retrieval precision independently of generative models, and Tenure, a local-first structured belief store achieving 89/89 passes with mean precision 1.0 and sub-15ms retrieval latency.

BibTeX

@article{flynt2026tenure,
  title={Structured Belief State and the First
         Precision-Aware Benchmark for LLM
         Memory Retrieval},
  author={Flynt, Jeffrey},
  journal={arXiv preprint arXiv:2605.11325},
  year={2026}
}

Reproducible

Run it yourself

The benchmark is published as a reusable artifact. Run it against any memory implementation — including your own.

precisionmembench

$ git clone https://github.com/tenurehq/precisionMemBench

$ cd precisionMemBench

$ npm install

$ MEMORY_PROVIDER=PROVIDER npx ava retrieval.external.eval.test.ts

$ MEMORY_PROVIDER=PROVIDER npx ava session-retrieval.external.eval.test.ts

✓ Seeding corpus (35 beliefs)...

✓ Running 89 cases...

✓ Results written to test-results/

GitHub → HuggingFace leaderboard Dataset

The benchmark that measureswhat memory tools actually do.