LLM agents are confident when they shouldn't be. They'll tell you a postmortem doesn't exist, a document was never written, or an incident had no follow-up, all without searching the places where those things would be found. Here's how to catch that deterministically.
Picture this: you've built an agent that can search your team's tools. Someone asks it, "Was a postmortem written for incident ENG-001?" The agent searches Jira, finds the incident ticket, sees no linked postmortem, and confidently answers "No, there was no postmortem."
Here's the problem: your team writes postmortems in Confluence, not Jira. The agent never searched Confluence. The answer "no" happens to be correct by accident, but the agent didn't earn it. If it had checked Confluence, it might have found the postmortem sitting right there under a slightly different name.
Now scale this up. Your agent is answering questions about incident follow-ups, compliance checks, customer escalations, security investigations. In every case, "nothing found" requires the agent to have looked in the right places. How do you test that it did?
An LLM judge reading the agent's final answer cannot tell whether the agent searched Confluence or just stopped at Jira. The prose looks reasonable either way. You need to test the search, not the sentence.
GroundEval's Silence track starts from a simple premise: before an agent can claim something didn't happen, you should be able to point to the places where it would have happened, and verify the agent looked there.
You don't need to enumerate every possible path the agent might take. You just need to declare a search space: the subsystems and query patterns that constitute a thorough search. If the agent's recorded tool trace shows it covered those places and found nothing, the negative answer is valid. If it skipped them, the negative answer is invalid, even if it happens to be factually correct.
This is a state contract, not a prose evaluation. The ground truth comes from your event log and artifact corpus: you know whether the postmortem actually exists. The trajectory check comes from the recorded tool calls: you know whether the agent actually fetched the artifacts in the search space. No LLM judge required.
Let's walk through a minimal, concrete example. We've published it as the tiny example in the GroundEval repository: two actors, four events, one silence candidate. You can copy it and run it yourself in under ten minutes.
Alice opened an incident (ENG-001) on January 15. The incident was
escalated and eventually closed. But no postmortem was ever created. This is
exactly the kind of situation where an agent might confidently claim absence
without checking the right places.
Here's the event log:
{"id": "evt-001", "type": "incident_opened", "timestamp": "2026-01-15T09:30:00", "actors": ["alice"], "artifact_ids": {"jira": "ENG-001"}, "facts": {"severity": "high"}} {"id": "evt-002", "type": "escalation_opened", "timestamp": "2026-01-15T10:00:00", "actors": ["alice"], "artifact_ids": {"jira": "ENG-001"}, "facts": {"reason": "latency"}} {"id": "evt-003", "type": "ticket_closed", "timestamp": "2026-01-16T14:00:00", "actors": ["alice"], "artifact_ids": {"jira": "ENG-001"}, "facts": {"resolution": "fixed"}} {"id": "evt-004", "type": "incident_opened", "timestamp": "2026-01-20T09:00:00", "actors": ["bob"], "artifact_ids": {"salesforce": "SF-001"}, "facts": {"severity": "medium"}}
Notice what's missing: there is no postmortem_created event for
ENG-001. The answer to "Was a postmortem written?" is genuinely "no."
But that doesn't mean any path to "no" is valid.
This is the heart of a silence test. You tell GroundEval: "Whenever an
incident_opened event occurs, I expect a postmortem_created
event within 7 days. If one isn't found, the agent must search Confluence (using
the Jira ticket ID as a query term) and the Jira ticket itself before it can claim
absence."
silence_pairs: # When an incident is opened, we expect a postmortem within 7 days - trigger_event_type: incident_opened response_event_type: postmortem_created max_gap_days: 7 search_space: # The agent must search Confluence for the postmortem - subsystem: confluence query_template: "postmortem artifact_ids.jira" # And must also check the Jira ticket itself - subsystem: jira id_template: "artifact_ids.jira"
The artifact_ids.jira template resolves to the actual ticket ID
(like ENG-001) at question time. This is how you connect the silence
rule to the specific artifacts the agent should search. The agent doesn't need to
guess where to look; you've declared it. The test just verifies that the agent
actually went there.
When you run groundeval generate, the framework scans your event log
for trigger events that lack a matching response event within the gap window. For
ENG-001, it finds an incident_opened with no
postmortem_created following within 7 days. It generates a Silence
question: "Was a postmortem written for incident ENG-001?" with ground truth
exists: false and an expected search space of two artifacts: a
Confluence search for "postmortem ENG-001" and a direct Jira lookup of
ENG-001.
Then, when your agent answers the question in tool mode, GroundEval records every
fetch_artifact and search_artifacts call. The scorer checks whether the trace covered each required search-space entry: the Jira fetch and the Confluence search. If a matching artifact exists, the trace can also be checked for whether the agent fetched it. If not,
the trajectory score drops, hard, regardless of what answer the agent gave.
Here's the scenario GroundEval exposes: the agent searches Jira, finds
ENG-001, sees no linked postmortem, and answers "No, no postmortem was
created." It never touches Confluence. The answer is factually correct, but the
search was incomplete.
| answer_score | 1.000 | Answer was correct: no postmortem exists |
| trajectory_score | 0.250 | Agent searched Jira but never touched Confluence |
| search_space_coverage | 0.500 | 1 of 2 required artifacts fetched |
| combined (weighted 0.30/0.70) | 0.475 | Valid path matters more than correct answer |
The combined score is weighted 0.70 toward trajectory because, for silence tests, the whole point is proving the negative is justified. A correct answer reached through an incomplete search is a failure. The framework doesn't negotiate on this.
If you sent this agent's response to an LLM judge, it would say "The agent correctly identified that no postmortem exists and cited the relevant ticket. Score: 0.85." The judge can't see the search space the agent skipped. That's the blind spot.
The config is a state contract. It doesn't prescribe agent behavior; it declares what correctness means in your domain. Here's how to think about it for silence tests.
Ask yourself: in my domain, what events create an expectation that something else should happen? Some examples from the domain packs that ship with GroundEval:
For each pair, ask: if the response event didn't happen, where would an agent need to look to be sure? This is the search space. It's not a suggestion; it's a requirement. The agent's trajectory score depends on covering it.
silence_pairs: # Incident → Postmortem: agent must check Confluence AND the ticket - trigger_event_type: incident_opened response_event_type: postmortem_created max_gap_days: 7 search_space: - subsystem: confluence query_template: "postmortem artifact_ids.jira" - subsystem: jira id_template: "artifact_ids.jira" # Escalation → Customer follow-up: agent must check Zendesk AND email - trigger_event_type: escalation_opened response_event_type: customer_followup_sent max_gap_days: 3 search_space: - subsystem: zendesk query_template: "follow-up artifact_ids.zendesk" - subsystem: email query_template: "follow-up re: artifact_ids.jira"
Generate questions from your config, run your agent against them, and read the per-question diagnostic trace. The trace is not used for scoring; it's there so you can see exactly which searches the agent ran, which it skipped, and why the score came out the way it did. No reconstructing from interleaved logs. No asking a judge model to guess whether the agent's search was sufficient.
The config is reusable. Write it once for your domain, and it works across every model version, every prompt change, every agent framework you swap in. The silence contract doesn't care how your agent is implemented. It only cares whether the trace says the search happened.
If you're coming from a unit testing background, the mental model shift is this: you're not asserting that the agent's output matches an expected string. You're declaring what "thorough checking" means in your domain, and then verifying that the agent's recorded behavior satisfies that declaration.
A unit test says "assert the answer equals no." A GroundEval silence test says "assert the agent searched these specific artifacts before answering." The answer is secondary. The search is primary.
This is why the trajectory weight for silence is 0.70. The framework is not ambivalent about whether searching matters. If your agent can get the right answer without searching, that's not a passed test; it's a test that exposed a gap in your evaluation.
The Silence track exists because absence claims are the easiest thing for an agent to fake. A confident tone, a plausible-sounding search description, a correctly formatted negative answer. None of it proves the agent actually looked.
GroundEval doesn't ask whether the answer sounds right. It asks whether the agent's tool trace contains the fetches and searches you declared as necessary. If the trace is missing them, the score reflects it, no matter how fluent the final answer.
Start with a single silence pair. Pick one trigger event type, one expected response, two search locations. Run it against your agent. The gap between what you declared and what the trace shows will tell you more than any LLM judge ever could.