Linear's sales agent emailed an existing customer six times with the wrong company name. It is easy to call that bad AI outreach. But the email was only the visible part. The real failure happened earlier, when the system decided it was allowed to send without proving the facts that decision depended on.
Hey @karrisaarinen, friendly heads-up from the field. I've received 6+ emails from someone on your sales team with comical AI-slop. Wrong company name & already a customer, etc... Always love a good laugh, but you may want to skip-level this one?
Stop the AI slop pls. You got the company wrong, you didn't look at my email domain, and we already use Linear. Thinking of canceling now.
Thanks and apologies. Not ideal, will check with the team what caused this.
Agree that emailing existing customers and 6 times is the dumbest thing
The easy reaction to the Linear email is to laugh at the output. The company name is wrong. The recipient is already a customer. The same sequence had already hit them multiple times. Then the CEO replies publicly, and the whole thing becomes another example of AI slop.
But that framing lets the system off too easily.
The embarrassing part is the email everyone saw. The more important part happened before the email existed. Somewhere upstream, the system had enough wrong or unchecked state to decide that this person should be contacted at all.
That is the part a better subject line would not fix. A warmer tone would not fix it. Even a model that writes beautiful outbound copy would still have sent the wrong message if it never checked the basic facts first.
In the Linear case, the pre-send checks were the whole story. Does the company name match the recipient's domain? Is this contact already a customer? Has this sequence already run too many times? If those answers are wrong or never checked, generation is already starting from a failed state.
The visible failure was a bad email. The earlier failure was simpler: the system did not prove that the email should be sent.
Outbound email looks simple from the outside. Pick a contact, write a message, send it. Inside a real company, the action depends on a stack of state that has to be true.
Is this person a prospect, an active customer, a former customer, a partner, an employee, or someone who should never receive this sequence?
Does the company name in the email match the account linked to the recipient, the email domain, and the current CRM record?
Does the account already use the product, have an open opportunity, have an assigned owner, or sit under a suppression rule?
How many times has this person been contacted, through which channel, by which team, and with what response?
Given all of that state, is this automation allowed to send, or should it suppress, route to a human, or do nothing?
If any one of those checks fails, the right behavior is not "write a better email." The right behavior is "do not send." That is why calling this a content quality problem misses the failure mode. The generated text is only the artifact left at the scene.
A conventional evaluation can grade the final email. Is it polite. Is it personalized. Is it relevant. Does it mention the right product. Does it follow the brand voice. Does it avoid obvious hallucinations. Those are useful questions, but they begin after the action has already been approved.
In the failure case, the email can score well on all of those dimensions and still be wrong. It can be polished, concise, friendly, and on brand. It can even contain true statements about the product. None of that proves the system had enough verified state to send it to this person at this time.
| Evaluation target | Question being asked | What it misses |
|---|---|---|
| Final email | Does the generated message read well | Whether the email should exist |
| Model output | Does the model produce plausible personalization | Whether the personalization was grounded |
| Evidence path | Did the agent verify the facts required to act | The action can be blocked before generation |
This is the same distinction GroundEval makes for question answering agents. A final answer can look plausible while the trace shows the agent never fetched the document, used evidence outside its permission boundary, or claimed absence without searching. The outbound version has the same shape: a final message can look plausible while the pre-send evidence path is invalid.
GroundEval treats agent behavior as something that can be tested against a state contract. The contract says what evidence exists, when it existed, who or what was allowed to access it, and which checks are required before a claim or action is valid.
For an outbound agent, the evaluation does not have to ask whether the email was good. It can ask a simpler and more important question: before sending, did the agent check the required systems and reach a valid send decision?
| Test component | Example |
|---|---|
| Question | Should this outbound agent send a prospecting email to this contact? |
| Ground truth | No. The contact belongs to an account that already uses the product. |
| Required trajectory | Check customer status, account mapping, email domain, outreach history, and suppression rules. |
| Failure condition | The agent sends or drafts outreach without fetching the records needed to justify the send decision. |
| Valid behavior | Suppress the send, cite the blocking record, and route to the account owner if review is needed. |
That is not a judge prompt. It is not a vibes-based review of whether the email seems reasonable. It is a deterministic check against the evidence path: what was searched, what was fetched, what state was available at the time, and whether the action followed from it.
The usual answer to risky automation is to put a human in the loop. Let someone review it before the agent sends. That can help, but only if the human can see more than the generated email.
A polished draft does not show that the account record says active customer. It does not show that the domain points somewhere else. It does not show that the contact was already emailed five times. It does not show that the product being pitched is already installed.
Without the trace, approval can become a nicer-looking version of the same problem. The reviewer is judging the artifact, not the decision that produced it.
The system needs preconditions. Before an agent acts, it should be able to show which checks were required, which systems were queried, which records were found, and which rule allowed the action to continue. If the required checks are missing, the action should stop before generation.
"Looks good to send" is not the same as "the required evidence was checked, and the send decision follows from it."
The common story about AI risk still focuses on the answer. Did the model hallucinate? Did it cite the wrong source? Did it say something unsafe? Those failures matter, but agents are moving into a different regime. They do not only answer. They send, file, update, route, approve, escalate, and suppress.
Once software can act, the eval target changes. The question is no longer only whether the output was correct. It becomes whether the action was valid under the state the agent was allowed to observe.
Bad AI outreach is a small, public example of that larger shift. The visible failure is the embarrassing email. The system failure is that no one could prove, before the email went out, that the agent had checked the facts required to send it.
That is the class of failure GroundEval is designed to catch. Not whether the prose sounds good. Whether the behavior was earned.