Mem0 publishes 93.4% on LongMemEval as state-of-the-art. Someone runs their product through a clean harness and gets 73.8%. The CTO of Mem0 shows up in the thread and doesn't deny the gap. Instead, he says every number in the field comes with an asterisk. He's right, and that admission is worth understanding.
Mem0 published 93.4% on LongMemEval as their state-of-the-art overall score. When a third party ran their hosted product through a clean evaluation harness (gpt-5 answerer, binary judge with no lean-toward-yes instruction, 5-seed mean), the best they could get was 73.8%. Same memory system. Same benchmark data. A 19.6-point gap.
That kind of gap demands an explanation. The third party dug into Mem0's public benchmark harness at the commit they shipped right before their April announcement, and found several things.
| Finding | What it means |
|---|---|
| Dataset-specific equivalence rules | 14 rules mapping 1-to-1 to specific LongMemEval question IDs. For instance, hardcoding that "scratch grains" should count as "layer feed," skipping the reasoning step the benchmark was designed to test. |
| Hidden chain-of-thought | Dataset hints get applied inside <mem_thinking> tags invisible to anyone sampling outputs. The judge only sees the cleaned final answer. |
| Biased judge prompt | The judge is told: "You have a tendency to say 'no' too quickly. Before concluding 'no', you MUST verify the answer is truly wrong, not just differently worded. When in doubt, lean toward 'yes'." A 5-step gauntlet must be cleared before marking anything WRONG. No comparable gauntlet exists before marking something CORRECT. |
| One-directional score lift | In the LoCoMo judge, evidence can promote a WRONG prediction to CORRECT, but the same evidence cannot demote a CORRECT prediction to WRONG. |
None of this is hidden. The commit message from April 3rd, eleven days before the SOTA announcement, reads: "Sync prompts from evals: CONTEXT CHECK, Rule 14 (contradictions), conflicting numbers, personalization scan, BIAS CHECK in judge, chain-of-thought <judge_thinking> tags, 5-step FINAL CHECK." Their engineer typed the words "BIAS CHECK in judge" and "5-step FINAL CHECK" into git.
Deshraj, the CTO of Mem0, responded in the thread. He didn't deny the gap. He didn't claim the findings were wrong. Instead, he made a different argument: these choices were responses to flaws in the benchmarks themselves. The benchmarks contain hidden assumptions that make questions unsolvable even with perfect memory retrieval. The reasoning traces and equivalence rules were attempts to compensate for those flaws.
But then he said something more interesting. He said this:
"Yep. Most of these are a generic harness on one side and a tuned one on the other, different token budgets, latency nobody mentions, half of them agentic, all squashed into a single accuracy score that hides all of it. The only real fix is a shared harness everyone runs against, with cost and latency reported alongside accuracy. Until then every number out there, ours too, comes with an asterisk."
That is the most honest statement in the entire thread. It's not a defense of the gap. It's an admission that the entire field is playing a game where everyone tunes their own judge, and the resulting numbers are not comparable. The CTO of one of the most prominent memory companies is saying, in public, that his own company's benchmark numbers come with an asterisk.
He followed up: "Honestly this is the part we care about most. We're trying to move agentic memory forward as a field, and that means reporting everything we can, accuracy, cost, tokens, and being upfront about where existing benchmarks fall short instead of quietly gaming around them."
The CTO's description is accurate. In AI memory benchmarking right now, what you're comparing between vendors is almost never the same thing:
Each vendor tunes their own evaluation prompts, judge instructions, and equivalence rules. The same memory system run through two different harnesses can produce scores 20 points apart.
One system might retrieve 500 tokens per query while another retrieves 5,000. More context makes accuracy easier, but nobody reports the cost.
Some systems run agentic loops that call multiple models across multiple steps. Others do a single retrieval pass. The accuracy numbers don't tell you which is which.
Precision, recall, noise isolation, contradiction detection, belief mutability, session-turn latency, all collapsed into a single accuracy percentage that hides every dimension that matters for real deployment.
The result is a market where every vendor can claim a high number, and every number is technically true under the vendor's own evaluation setup. But none of them answer the question a developer actually has: if I integrate this, what will my agent's memory actually behave like, and what will it cost?
The CTO's prescription is exactly right: the only real fix is a shared harness everyone runs against, with cost and latency reported alongside accuracy. Not a single accuracy score. A multi-dimensional report that lets developers see what they're actually trading off.
This is the problem Tenure's PrecisionMemBench was built to address. It doesn't collapse everything into one number. It reports across four dimensions:
| Dimension | What it measures | Why it matters |
|---|---|---|
| Retrieval precision | How often the system retrieves the right belief for the right query | A system that retrieves 93% of relevant facts but also pulls in irrelevant ones on every query creates a different experience than one that retrieves 85% cleanly. |
| Noise isolation | How well the system keeps unrelated context out of retrieval results | As memory scales, noise becomes the dominant failure mode. A system that can't isolate signal from noise gets worse over time, not better. |
| Session-turn latency | How retrieval latency changes as memory grows across sessions | An agent that takes 2 seconds to retrieve at 100 memories but 15 seconds at 10,000 is not production-viable, regardless of accuracy. |
| Belief mutability | How well the system updates, supersedes, or contradicts beliefs when new evidence arrives | Memory isn't just storage. It's state that changes. A system that can't update its beliefs is storing facts, not managing context. |
These dimensions are not academic. They correspond directly to what breaks in production: the agent that retrieves the right fact but also three irrelevant ones and confuses the model. The agent that gets slower every week as memory accumulates. The agent that can't update its belief about which API version the team is on. A single accuracy score hides every one of these failure modes.
It would be easy to treat the Mem0 thread as a story about one vendor's benchmark practices, but that misses the point. The CTO's response is the real signal. The most prominent memory company in the space is saying, on the record, that their own numbers come with an asterisk, and that the only way forward is shared harnesses with cost and latency reported alongside accuracy.
That's not a scandal. It's a statement of where the field actually is. Everyone tunes their own harness because the benchmarks don't ship with enforced evaluation frameworks. Everyone reports a single accuracy score because the market rewards simple numbers. Everyone knows the numbers aren't comparable, but nobody wants to be the first to publish a multi-dimensional report that's harder to put on a landing page.
The asterisk is not the scandal. The asterisk is the category telling the truth about itself. AI memory is becoming infrastructure. Infrastructure cannot be evaluated with vendor-specific answer prompts, hidden assumptions, and judge leniency collapsed into one accuracy score. The field needs a shared harness. Not eventually. Now. PrecisionMemBench starts at the layer every memory provider has to answer for: what did the system retrieve, what did it miss, and what noise did it return?