Embeddings

The embedding invariance problem: why bigger models don't fix retrieval noise

Name: precisionMemBench
Creator: Tenure
License: https://opensource.org/licenses/MIT

A 20x increase in embedding model scale produces identical retrieval precision. The problem is not the ruler. It is what we are measuring.

Tenure research · ~8 min read

TL;DR

nomic-embed-text, mxbai-embed-large, and qwen3-8b all achieve 0.09 mean retrieval precision on the same benchmark [1].
At 8 billion parameters and 1,131ms mean latency, qwen3 produces the same precision as a 768-dimension model at 43ms [1].
Semantic proximity within a domain-specific corpus is structural; it cannot be embedded away.
Higher-capability embeddings redistribute scores but do not isolate distinct concepts that occupy the same semantic region.
The fix is not a better embedding model. It is a different retrieval signal entirely.

The scale assumption

The assumption that bigger fixes it

The standard response to imprecise retrieval in AI memory is to scale the embedding model. A larger model produces richer representations, finer distinctions, and more nuanced similarity. If cosine similarity over a 768-dimension index returns noise, surely a 1024-dimension model reduces it. And if that fails, an 8-billion parameter foundation model must solve it.

This intuition is wrong. On PrecisionMemBench, three embedding models spanning a 20x range in scale produce identical mean retrieval precision: 0.09 [1]. The smallest model runs at 43ms mean latency. The largest runs at 1,131ms. They fail in exactly the same way, returning the same irrelevant beliefs on the same queries with the same score distributions.

A more powerful embedder distributes scores differently but cannot eliminate genuine semantic proximity within a domain-specific corpus. The fix is not a better ruler; it is a different measurement instrument.

The invariance result

What the benchmark shows

We evaluated three configurations on PrecisionMemBench, all using cosine similarity with no additional pipeline infrastructure: nomic-embed-text at 768 dimensions, mxbai-embed-large at 1024 dimensions, and qwen3 at the 8-billion parameter scale with 4096 dimensions [1].

nomic-embed-text (768d)

0.09 Mean precision · 43ms mean latency

11 trivial passes out of 77 active cases. Zero active retrieval passes. Identical failure mode across all query categories.

qwen3-8b (4096d)

0.09 Mean precision · 1,131ms mean latency

Same 11 trivial passes. Same zero active retrieval passes. 20x larger, 26x slower, identical imprecision.

All 11 passes in every configuration are trivially empty or budget-forced cases: cases where the expected relevant belief set is empty by design, so any system returning nothing passes by construction [1]. On the 48 cases that carry an active retrieval assertion, all three models score zero. Not near-zero. Zero.

The qwen3 result is the clearest demonstration that this is not a capability problem. At roughly 20 times the mean latency of nomic-embed-text and over 1,100ms per query, the 8-billion parameter model produces identical precision to the smallest embedding tested [1]. The additional parameters buy no discrimination. They buy only delay.

Domain geometry

Why scale cannot unlink semantic neighbors

In any belief store where participants share a technical domain, all beliefs about that domain occupy a common semantic region. A query about Redis is semantically close to a belief about Redis as intended, and also semantically close to beliefs about MongoDB, TypeScript, Fastify, Kubernetes, and GitHub Actions, with cosine scores between 0.65 and 0.83 [1].

Those scores reflect genuine semantic relatedness. They are not errors. An engineering codebase genuinely does contain related concepts across infrastructure, language, and framework choices. A larger embedding model encodes richer distinctions between those concepts, but it cannot eliminate the fact that they are semantically adjacent in developer discourse. Geometry is not a modeling artifact. It is a property of the corpus.

A more powerful model might separate Redis from MongoDB by a wider margin, but both still reside in the same neighborhood as the query, and that neighborhood contains dozens of beliefs that share the technical domain without sharing relevance. The model redistributes scores within the neighborhood. It does not exclude the neighborhood.

Decoupling

Extraction quality does not predict retrieval precision

It is tempting to blame retrieval noise on poor extraction. If the belief store contains garbled or incomplete facts, of course retrieval struggles. But the benchmark directly decouples these variables. Mem0's extraction pipeline produced a relation belief that preserved every operationally significant fact: the service name, dependency target, fail-open behavior, and coupling assertion [1]. By any standard, this was a high-quality extraction.

Retrieval precision on that exact stored memory was 0.056. The system returned the correctly extracted belief alongside 16 additional irrelevant beliefs including a linting preference, a React expertise entry, and a superseded SQLAlchemy note [1]. In a second case referencing the same belief, recall dropped to 0.5: the structurally necessary participant was missing entirely despite being directly referenced in the stored text.

The precision failure occurs in both high-recall and low-recall conditions, which means the floor is structural rather than query-dependent [1]. Write-time extraction has no bearing on the failure because the failure happens at read time, in the geometry of the embedding space, not in the content of the beliefs.

The cost

What higher latency actually buys you

When you upgrade from nomic-embed-text to qwen3-8b for memory retrieval, you trade 43ms mean latency for 1,131ms [1]. You trade 85ms p95 for 2,605ms. The additional compute purchases no retrieval benefit. It purchases only inference cost and a slower user experience.

This is not a transient limitation awaiting the next generation of embedding models. The invariance is architectural: as long as retrieval relies on cosine similarity over domain-specific beliefs, the model can only redistribute scores within a dense semantic neighborhood. It cannot exclude the neighborhood. A 40-billion parameter model would face the same geometry, appropriately better represented, with proportionally higher latency and identical imprecision.

Larger embedding models do not fix retrieval noise in AI memory because the noise is not a representational deficiency. It is a signal problem. Semantic similarity retrieves what is related. Memory requires retrieving what is named.

The orthogonal signal

Named entities, not semantic neighbors

Resolving relevance requires a retrieval signal orthogonal to semantic similarity [1]. In bounded vocabulary contexts where the user coined the terminology and authored aliases to match expected query surfaces, BM25 with alias-weighted term matching provides higher precision by construction [1].

Embedding similarity retrieves what is semantically related. Alias-weighted BM25 retrieves what the user named. In a single-user persistent memory context, the latter is more often correct [1]. If a user names their Kubernetes belief with canonical name kubernetes and aliases k8s and kube, then a query containing k8s should retrieve that belief with high precision regardless of semantic distance. There is no ambiguity to resolve: the authored terminology is the ground truth.

The predictable objection is vocabulary coverage: what if the user refers to a belief using a term not yet in the alias set? This misidentifies a correct behavior as a limitation. On first encounter the system returns silence rather than noise. The extraction worker captures the new term as an alias. Every subsequent query using that term resolves correctly [1]. The alias enrichment flywheel means the system improves organically through use, while vector systems degrade as semantic mass accumulates [1].

Conclusion

It is a paradigm problem, not a capability problem

Retrieval noise in AI memory is not an embedding quality problem. It is a paradigm problem. Cosine similarity over domain-specific beliefs produces noise because the concepts genuinely are semantically adjacent, and no embedding scale can unlink them without misrepresenting the domain itself.

The benchmark evidence is unambiguous: three models spanning a 20x scale range produce identical mean precision of 0.09, identical pass counts, and identical failure modes [1]. The field has responded by adding re-ranking stages, temporal trees, and hierarchical graphs: infrastructure compensating for the wrong primary signal [1].

Until memory systems stop treating memory as semantic retrieval and start treating it as named entity resolution within a bounded vocabulary, larger embedding models will consume more compute for the same imprecision. The measurement instrument must change, not its resolution.

Read why precision matters → See PrecisionMemBench