Research

Every AI memory benchmark has an asterisk

Mem0 publishes 93.4% on LongMemEval as state-of-the-art. Someone runs their product through a clean harness and gets 73.8%. The CTO of Mem0 shows up in the thread and doesn't deny the gap. Instead, he says every number in the field comes with an asterisk. He's right, and that admission is worth understanding.

Tenure research · Jun 24, 2026 · ~7 min read

TL;DR

Mem0 announced 93.4% on LongMemEval. A clean third-party harness produced 73.8%, a 19.6-point gap on the same memory system and the same data.
The gap traces to hardcoded dataset-specific equivalence rules, a judge instructed to "lean toward yes," hidden chain-of-thought reasoning invisible to anyone sampling outputs, and a one-directional score-lift mechanism in their LoCoMo judge.
The CTO of Mem0 responded. He didn't deny the gap. He said every memory vendor tunes their own harness, and the only real fix is a shared harness everyone runs against, with cost and latency reported alongside accuracy.
He's right. The status quo is everyone reporting numbers with an asterisk. The question is whether the field moves toward shared, multi-dimensional evaluation frameworks or keeps letting each vendor tune their own judge.
Tenure's PrecisionMemBench is built for exactly this: multi-dimensional measurement (precision, noise isolation, latency, belief mutability) and deterministic evaluation that checks evidence paths instead of vibes-scoring final answers.

The gap

A 19.6-point difference on the same system, the same data

Mem0 published 93.4% on LongMemEval as their state-of-the-art overall score. When a third party ran their hosted product through a clean evaluation harness (gpt-5 answerer, binary judge with no lean-toward-yes instruction, 5-seed mean), the best they could get was 73.8%. Same memory system. Same benchmark data. A 19.6-point gap.

That kind of gap demands an explanation. The third party dug into Mem0's public benchmark harness at the commit they shipped right before their April announcement, and found several things.

What was found in the public benchmark harness

Finding	What it means
Dataset-specific equivalence rules	14 rules mapping 1-to-1 to specific LongMemEval question IDs. For instance, hardcoding that "scratch grains" should count as "layer feed," skipping the reasoning step the benchmark was designed to test.
Hidden chain-of-thought	Dataset hints get applied inside <mem_thinking> tags invisible to anyone sampling outputs. The judge only sees the cleaned final answer.
Biased judge prompt	The judge is told: "You have a tendency to say 'no' too quickly. Before concluding 'no', you MUST verify the answer is truly wrong, not just differently worded. When in doubt, lean toward 'yes'." A 5-step gauntlet must be cleared before marking anything WRONG. No comparable gauntlet exists before marking something CORRECT.
One-directional score lift	In the LoCoMo judge, evidence can promote a WRONG prediction to CORRECT, but the same evidence cannot demote a CORRECT prediction to WRONG.

None of this is hidden. The commit message from April 3rd, eleven days before the SOTA announcement, reads: "Sync prompts from evals: CONTEXT CHECK, Rule 14 (contradictions), conflicting numbers, personalization scan, BIAS CHECK in judge, chain-of-thought <judge_thinking> tags, 5-step FINAL CHECK." Their engineer typed the words "BIAS CHECK in judge" and "5-step FINAL CHECK" into git.

The response

The CTO shows up and says the quiet part out loud

Deshraj, the CTO of Mem0, responded in the thread. He didn't deny the gap. He didn't claim the findings were wrong. Instead, he made a different argument: these choices were responses to flaws in the benchmarks themselves. The benchmarks contain hidden assumptions that make questions unsolvable even with perfect memory retrieval. The reasoning traces and equivalence rules were attempts to compensate for those flaws.

But then he said something more interesting. He said this:

"Yep. Most of these are a generic harness on one side and a tuned one on the other, different token budgets, latency nobody mentions, half of them agentic, all squashed into a single accuracy score that hides all of it. The only real fix is a shared harness everyone runs against, with cost and latency reported alongside accuracy. Until then every number out there, ours too, comes with an asterisk."

That is the most honest statement in the entire thread. It's not a defense of the gap. It's an admission that the entire field is playing a game where everyone tunes their own judge, and the resulting numbers are not comparable. The CTO of one of the most prominent memory companies is saying, in public, that his own company's benchmark numbers come with an asterisk.

He followed up: "Honestly this is the part we care about most. We're trying to move agentic memory forward as a field, and that means reporting everything we can, accuracy, cost, tokens, and being upfront about where existing benchmarks fall short instead of quietly gaming around them."

The status quo

Everyone's numbers have an asterisk

The CTO's description is accurate. In AI memory benchmarking right now, what you're comparing between vendors is almost never the same thing:

Different harnesses

Each vendor tunes their own evaluation prompts, judge instructions, and equivalence rules. The same memory system run through two different harnesses can produce scores 20 points apart.

Different token budgets

One system might retrieve 500 tokens per query while another retrieves 5,000. More context makes accuracy easier, but nobody reports the cost.

III

Hidden latency

Some systems run agentic loops that call multiple models across multiple steps. Others do a single retrieval pass. The accuracy numbers don't tell you which is which.

Squashed into one score

Precision, recall, noise isolation, contradiction detection, belief mutability, session-turn latency, all collapsed into a single accuracy percentage that hides every dimension that matters for real deployment.

The result is a market where every vendor can claim a high number, and every number is technically true under the vendor's own evaluation setup. But none of them answer the question a developer actually has: if I integrate this, what will my agent's memory actually behave like, and what will it cost?

The fix

A shared harness that reports cost and latency alongside accuracy

The CTO's prescription is exactly right: the only real fix is a shared harness everyone runs against, with cost and latency reported alongside accuracy. Not a single accuracy score. A multi-dimensional report that lets developers see what they're actually trading off.

This is the problem Tenure's PrecisionMemBench was built to address. It doesn't collapse everything into one number. It reports across four dimensions:

PrecisionMemBench dimensions

Dimension	What it measures	Why it matters
Retrieval precision	How often the system retrieves the right belief for the right query	A system that retrieves 93% of relevant facts but also pulls in irrelevant ones on every query creates a different experience than one that retrieves 85% cleanly.
Noise isolation	How well the system keeps unrelated context out of retrieval results	As memory scales, noise becomes the dominant failure mode. A system that can't isolate signal from noise gets worse over time, not better.
Session-turn latency	How retrieval latency changes as memory grows across sessions	An agent that takes 2 seconds to retrieve at 100 memories but 15 seconds at 10,000 is not production-viable, regardless of accuracy.
Belief mutability	How well the system updates, supersedes, or contradicts beliefs when new evidence arrives	Memory isn't just storage. It's state that changes. A system that can't update its beliefs is storing facts, not managing context.

These dimensions are not academic. They correspond directly to what breaks in production: the agent that retrieves the right fact but also three irrelevant ones and confuses the model. The agent that gets slower every week as memory accumulates. The agent that can't update its belief about which API version the team is on. A single accuracy score hides every one of these failure modes.

The upshot

The asterisk is the conversation

It would be easy to treat the Mem0 thread as a story about one vendor's benchmark practices, but that misses the point. The CTO's response is the real signal. The most prominent memory company in the space is saying, on the record, that their own numbers come with an asterisk, and that the only way forward is shared harnesses with cost and latency reported alongside accuracy.

That's not a scandal. It's a statement of where the field actually is. Everyone tunes their own harness because the benchmarks don't ship with enforced evaluation frameworks. Everyone reports a single accuracy score because the market rewards simple numbers. Everyone knows the numbers aren't comparable, but nobody wants to be the first to publish a multi-dimensional report that's harder to put on a landing page.

The asterisk is not the scandal. The asterisk is the category telling the truth about itself. AI memory is becoming infrastructure. Infrastructure cannot be evaluated with vendor-specific answer prompts, hidden assumptions, and judge leniency collapsed into one accuracy score. The field needs a shared harness. Not eventually. Now. PrecisionMemBench starts at the layer every memory provider has to answer for: what did the system retrieve, what did it miss, and what noise did it return?

PrecisionMemBench on GitHub → GroundEval on GitHub →

Every AI memory benchmark has an asterisk

TL;DR

A 19.6-point difference on the same system, the same data

The CTO shows up and says the quiet part out loud

Everyone's numbers have an asterisk

Different harnesses

Different token budgets

Hidden latency

Squashed into one score

A shared harness that reports cost and latency alongside accuracy

The asterisk is the conversation

More from Tenure

LLM-as-judge became the default for agent evaluation

Why most AI evals would miss the Linear sales email failure

precisionMemBench