Research

Why most AI evals would miss the Linear sales email failure

Name: precisionMemBench
Creator: Tenure
License: https://opensource.org/licenses/MIT

Linear's sales agent emailed an existing customer six times with the wrong company name. It is easy to call that bad AI outreach. But the email was only the visible part. The real failure happened earlier, when the system decided it was allowed to send without proving the facts that decision depended on.

Tenure research · Jun 22, 2026 · ~7 min read

Jean-Michel Lemieux @jmwind · 10:43 AM · 6/22/26

Hey @karrisaarinen, friendly heads-up from the field. I've received 6+ emails from someone on your sales team with comical AI-slop. Wrong company name & already a customer, etc... Always love a good laugh, but you may want to skip-level this one?

@linear.app)

Re: Linear at Quantum Innovations

Stop the AI slop pls. You got the company wrong, you didn't look at my email domain, and we already use Linear. Thinking of canceling now.

Jean-Michel Lemieux

Developer · spellbook.com

5 4 373 94K views

Karri Saarinen @karrisaarinen · 2h

Thanks and apologies. Not ideal, will check with the team what caused this.

Agree that emailing existing customers and 6 times is the dumbest thing

TL;DR

Most people describe bad AI outreach as a generation problem. The message was awkward, repetitive, or poorly personalized.
But the larger failure usually happens one step earlier. Before anything gets written, the system has to know who the recipient is, which company they belong to, whether they are already a customer, whether the account allows outreach, and whether this person has already been contacted too many times.
If those checks are wrong or missing, a better model does not solve the problem. It just writes a cleaner version of the wrong action.
That is why agent evaluation has to look upstream. It should ask what the system checked before acting, not only whether the final output reads well.
GroundEval is built around that question: what did the agent search, fetch, cite, and have permission to use before it answered or acted?

The wrong lesson

The email was not the first failure

The easy reaction to the Linear email is to laugh at the output. The company name is wrong. The recipient is already a customer. The same sequence had already hit them multiple times. Then the CEO replies publicly, and the whole thing becomes another example of AI slop.

But that framing lets the system off too easily.

The embarrassing part is the email everyone saw. The more important part happened before the email existed. Somewhere upstream, the system had enough wrong or unchecked state to decide that this person should be contacted at all.

That is the part a better subject line would not fix. A warmer tone would not fix it. Even a model that writes beautiful outbound copy would still have sent the wrong message if it never checked the basic facts first.

In the Linear case, the pre-send checks were the whole story. Does the company name match the recipient's domain? Is this contact already a customer? Has this sequence already run too many times? If those answers are wrong or never checked, generation is already starting from a failed state.

The visible failure was a bad email. The earlier failure was simpler: the system did not prove that the email should be sent.

The dependency list

Before the message exists, the system has facts to prove

Outbound email looks simple from the outside. Pick a contact, write a message, send it. Inside a real company, the action depends on a stack of state that has to be true.

Recipient state

Is this person a prospect, an active customer, a former customer, a partner, an employee, or someone who should never receive this sequence?

Company mapping

Does the company name in the email match the account linked to the recipient, the email domain, and the current CRM record?

III

Account status

Does the account already use the product, have an open opportunity, have an assigned owner, or sit under a suppression rule?

Outreach history

How many times has this person been contacted, through which channel, by which team, and with what response?

Permission to act

Given all of that state, is this automation allowed to send, or should it suppress, route to a human, or do nothing?

If any one of those checks fails, the right behavior is not "write a better email." The right behavior is "do not send." That is why calling this a content quality problem misses the failure mode. The generated text is only the artifact left at the scene.

What most evals see

Evaluating the email is too late

A conventional evaluation can grade the final email. Is it polite. Is it personalized. Is it relevant. Does it mention the right product. Does it follow the brand voice. Does it avoid obvious hallucinations. Those are useful questions, but they begin after the action has already been approved.

In the failure case, the email can score well on all of those dimensions and still be wrong. It can be polished, concise, friendly, and on brand. It can even contain true statements about the product. None of that proves the system had enough verified state to send it to this person at this time.

Two ways to evaluate the same outbound action

Evaluation target	Question being asked	What it misses
Final email	Does the generated message read well	Whether the email should exist
Model output	Does the model produce plausible personalization	Whether the personalization was grounded
Evidence path	Did the agent verify the facts required to act	The action can be blocked before generation

The important question is not whether the copy sounds human. It is whether the system checked enough evidence to send at all.

This is the same distinction GroundEval makes for question answering agents. A final answer can look plausible while the trace shows the agent never fetched the document, used evidence outside its permission boundary, or claimed absence without searching. The outbound version has the same shape: a final message can look plausible while the pre-send evidence path is invalid.

How GroundEval frames it

Did the agent earn the right to act?

GroundEval treats agent behavior as something that can be tested against a state contract. The contract says what evidence exists, when it existed, who or what was allowed to access it, and which checks are required before a claim or action is valid.

For an outbound agent, the evaluation does not have to ask whether the email was good. It can ask a simpler and more important question: before sending, did the agent check the required systems and reach a valid send decision?

A GroundEval-style outbound test

Test component	Example
Question	Should this outbound agent send a prospecting email to this contact?
Ground truth	No. The contact belongs to an account that already uses the product.
Required trajectory	Check customer status, account mapping, email domain, outreach history, and suppression rules.
Failure condition	The agent sends or drafts outreach without fetching the records needed to justify the send decision.
Valid behavior	Suppress the send, cite the blocking record, and route to the account owner if review is needed.

That is not a judge prompt. It is not a vibes-based review of whether the email seems reasonable. It is a deterministic check against the evidence path: what was searched, what was fetched, what state was available at the time, and whether the action followed from it.

The operational lesson

Agents need preconditions, not just approvals

The usual answer to risky automation is to put a human in the loop. Let someone review it before the agent sends. That can help, but only if the human can see more than the generated email.

A polished draft does not show that the account record says active customer. It does not show that the domain points somewhere else. It does not show that the contact was already emailed five times. It does not show that the product being pitched is already installed.

Without the trace, approval can become a nicer-looking version of the same problem. The reviewer is judging the artifact, not the decision that produced it.

The system needs preconditions. Before an agent acts, it should be able to show which checks were required, which systems were queried, which records were found, and which rule allowed the action to continue. If the required checks are missing, the action should stop before generation.

"Looks good to send" is not the same as "the required evidence was checked, and the send decision follows from it."

The upshot

The future failure mode is not just saying the wrong thing

The common story about AI risk still focuses on the answer. Did the model hallucinate? Did it cite the wrong source? Did it say something unsafe? Those failures matter, but agents are moving into a different regime. They do not only answer. They send, file, update, route, approve, escalate, and suppress.

Once software can act, the eval target changes. The question is no longer only whether the output was correct. It becomes whether the action was valid under the state the agent was allowed to observe.

Bad AI outreach is a small, public example of that larger shift. The visible failure is the embarrassing email. The system failure is that no one could prove, before the email went out, that the agent had checked the facts required to send it.

That is the class of failure GroundEval is designed to catch. Not whether the prose sounds good. Whether the behavior was earned.

GroundEval on GitHub → Read the evaluation essay