Reproducible benchmark results from PrecisionMemBench. 89 cases measuring retrieval precision independently of any generative model. Run it yourself on HuggingFace.
Cloud-first memory SDK. Good extraction, broken retrieval. Recall of 0.99 means everything comes back, including 16 irrelevant beliefs per query.
Temporal knowledge graph. 897-second ingestion across 35 beliefs. Multi-container deployment. Drift score 0.89 on re-entry turns.
MCP-based agent memory from Vectorize. Cross-encoder reranker adds 672ms mean latency without improving precision. Drift score: 0.93.
The best-performing competitor. Memory graph with dynamic dreaming. Still returns 2-3 irrelevant beliefs per correct one on average.
Zero-database memory runtime for coding agents. Triple-stream retrieval (BM25 + vector + graph). High recall, low precision. Drift score: 0.81.
Garry Tan's opinionated agent brain. Hybrid search + knowledge graph + synthesis layer. Precision limited by uncapped retrieval default.
All results from PrecisionMemBench (arXiv:2605.11325). Dataset on HuggingFace. Live leaderboard on HuggingFace Spaces. Run it yourself: npm run test:eval