Name: precisionMemBench
Creator: Tenure
License: https://opensource.org/licenses/MIT

TL;DR

Mem0's extraction is architecturally correct. Its retrieval reintroduces the noise extraction was designed to eliminate.
Mean precision 0.06 means ~16 irrelevant beliefs returned per query alongside the correct one.
Zero active retrieval passes out of 43 cases that carry a precision assertion.
Recall of 0.99 is not a feature. Returning everything is not retrieval.
Ingestion cost: 114 seconds for 35 beliefs (mean 3,263ms per belief).

Tenure precision

1.00

43/43 active passes

Mem0 precision

0.06

0/43 active passes

Tenure drift

0.00

Perfect isolation

Mem0 drift

0.94

Near-total contamination

The architectural problem

Mem0 commits to write-time extraction: facts are extracted from conversation turns and stored as natural language strings rather than raw transcripts. This is architecturally correct.

Mem0 then retrieves at read time using embedding similarity, which reintroduces the noise that structured extraction was designed to eliminate. A query about Redis returns the Redis belief alongside beliefs about MongoDB, TypeScript, Fastify, Kubernetes, and GitHub Actions, with cosine scores between 0.65 and 0.83. The scores reflect genuine semantic relatedness. They are measuring the wrong thing.

The failure is structural, not parametric. A more capable embedding model cannot eliminate genuine semantic proximity within a domain-specific corpus. Tested across a 20x range in embedding model scale: precision stays at 0.09 regardless. The fix is not a better ruler. It is a different measurement instrument.

Concrete failure: the relation-type case

A relation-type belief (b-auth-depends-redis) was ingested with full content. Mem0's extraction produced a faithful, high-quality stored memory preserving every operationally significant fact.

Query: "what are the auth service dependencies and failure modes?"

Mem0 returns b-auth-depends-redis correctly, then returns b-linting-v0, b-react-expertise, b-vitest-pref, b-comm-pushback, and b-sqlalchemy-superseded. The structurally necessary participant (b-redis-code) is absent entirely.

Retrieval precision: 0.056. Not caused by poor extraction. The stored memory text is accurate and complete.

Session-level noise isolation

After 8 consecutive off-topic drift turns, Mem0 produces a drift score of 1.0 on the implicit re-entry turn and 1.0 on the explicit re-entry turn. Every retrieved belief originates from drift-turn topics. The correct belief about the original topic is gone.

Session-turn latency: 377.93ms p50 (4.8x degradation over single-turn baseline of 78.81ms).

Full comparison

Property	Tenure	Mem0
Mean retrieval precision	1.00	0.06
Active retrieval passes	43/43	0/43
Total passes (89 cases)	89/89	9/89
Mean recall	1.00	0.99
Retrieval latency (p50)	9.77ms	64.94ms
Session latency (p50)	47.79ms	377.93ms
Drift score (re-entry)	0.00	0.94
Ingestion (35 beliefs)	1.0s	114.2s
Runs locally	Yes, always	Optional (self-host available)
Account required	No	Yes (API key)
Scope isolation	Hard filter	None
Supersession handling	Chain with audit	Overwrite
Per-turn injection audit	Yes	No
Works across every client	Proxy layer, any client	SDK integration per client
Memory in context every request	Always (proxy)	Only if code calls search()
License	MIT	Apache-2.0

Mem0 evaluated against pinned Docker image digest memohq/mem0@sha256:276964b172d2. Extraction and retrieval used Claude Sonnet 4.6, a more capable model than Mem0 uses in its own published evaluations. Full methodology: arXiv:2605.11325. Dataset: HuggingFace.

Tenure vs. Mem0

TL;DR

The architectural problem

Concrete failure: the relation-type case

Session-level noise isolation

Full comparison

Stop paying the noise tax