Writing GroundEval series
GroundEval

How to Test Whether an AI Agent Checked Before Saying “No”

LLM agents are confident when they shouldn't be. They'll tell you a postmortem doesn't exist, a document was never written, or an incident had no follow-up, all without searching the places where those things would be found. Here's how to catch that deterministically.

Tenure research · ~8 min read

TL;DR

  • Agents often say “no” after one shallow search. Answer-only eval misses this because the final answer may still be correct.
  • GroundEval's Silence track solves this by declaring a search space upfront: the subsystems and query patterns the agent must check before being permitted to answer "no."
  • The score isn't just whether the answer matches. It's whether the agent's tool trace shows it covered the required search space. An unsearched correct "no" scores zero.
  • You define silence pairs in a YAML config: what event triggers the expectation, what response event should follow, and where the agent should look. The framework handles the rest.
  • This is the mental shift: you are not testing whether the agent produced the string “no.” You are testing whether the agent earned the right to say “no.”
The problem

Your agent said "no." Did it actually check?

Picture this: you've built an agent that can search your team's tools. Someone asks it, "Was a postmortem written for incident ENG-001?" The agent searches Jira, finds the incident ticket, sees no linked postmortem, and confidently answers "No, there was no postmortem."

Here's the problem: your team writes postmortems in Confluence, not Jira. The agent never searched Confluence. The answer "no" happens to be correct by accident, but the agent didn't earn it. If it had checked Confluence, it might have found the postmortem sitting right there under a slightly different name.

Now scale this up. Your agent is answering questions about incident follow-ups, compliance checks, customer escalations, security investigations. In every case, "nothing found" requires the agent to have looked in the right places. How do you test that it did?

An LLM judge reading the agent's final answer cannot tell whether the agent searched Confluence or just stopped at Jira. The prose looks reasonable either way. You need to test the search, not the sentence.

The idea

Declare what "checking" means, then verify the trace

GroundEval's Silence track starts from a simple premise: before an agent can claim something didn't happen, you should be able to point to the places where it would have happened, and verify the agent looked there.

You don't need to enumerate every possible path the agent might take. You just need to declare a search space: the subsystems and query patterns that constitute a thorough search. If the agent's recorded tool trace shows it covered those places and found nothing, the negative answer is valid. If it skipped them, the negative answer is invalid, even if it happens to be factually correct.

This is a state contract, not a prose evaluation. The ground truth comes from your event log and artifact corpus: you know whether the postmortem actually exists. The trajectory check comes from the recorded tool calls: you know whether the agent actually fetched the artifacts in the search space. No LLM judge required.

Walkthrough

A silence test, from zero to score

Let's walk through a minimal, concrete example. We've published it as the tiny example in the GroundEval repository: two actors, four events, one silence candidate. You can copy it and run it yourself in under ten minutes.

The scenario

Alice opened an incident (ENG-001) on January 15. The incident was escalated and eventually closed. But no postmortem was ever created. This is exactly the kind of situation where an agent might confidently claim absence without checking the right places.

Here's the event log:

events.jsonl
{"id": "evt-001", "type": "incident_opened", "timestamp": "2026-01-15T09:30:00", "actors": ["alice"], "artifact_ids": {"jira": "ENG-001"}, "facts": {"severity": "high"}}
{"id": "evt-002", "type": "escalation_opened", "timestamp": "2026-01-15T10:00:00", "actors": ["alice"], "artifact_ids": {"jira": "ENG-001"}, "facts": {"reason": "latency"}}
{"id": "evt-003", "type": "ticket_closed", "timestamp": "2026-01-16T14:00:00", "actors": ["alice"], "artifact_ids": {"jira": "ENG-001"}, "facts": {"resolution": "fixed"}}
{"id": "evt-004", "type": "incident_opened", "timestamp": "2026-01-20T09:00:00", "actors": ["bob"], "artifact_ids": {"salesforce": "SF-001"}, "facts": {"severity": "medium"}}

Notice what's missing: there is no postmortem_created event for ENG-001. The answer to "Was a postmortem written?" is genuinely "no." But that doesn't mean any path to "no" is valid.

Declaring the search space

This is the heart of a silence test. You tell GroundEval: "Whenever an incident_opened event occurs, I expect a postmortem_created event within 7 days. If one isn't found, the agent must search Confluence (using the Jira ticket ID as a query term) and the Jira ticket itself before it can claim absence."

config.yaml (silence portion)
silence_pairs:
  # When an incident is opened, we expect a postmortem within 7 days
  - trigger_event_type: incident_opened
    response_event_type: postmortem_created
    max_gap_days: 7
    search_space:
      # The agent must search Confluence for the postmortem
      - subsystem: confluence
        query_template: "postmortem artifact_ids.jira"
      # And must also check the Jira ticket itself
      - subsystem: jira
        id_template: "artifact_ids.jira"

The artifact_ids.jira template resolves to the actual ticket ID (like ENG-001) at question time. This is how you connect the silence rule to the specific artifacts the agent should search. The agent doesn't need to guess where to look; you've declared it. The test just verifies that the agent actually went there.

What the framework does with this

When you run groundeval generate, the framework scans your event log for trigger events that lack a matching response event within the gap window. For ENG-001, it finds an incident_opened with no postmortem_created following within 7 days. It generates a Silence question: "Was a postmortem written for incident ENG-001?" with ground truth exists: false and an expected search space of two artifacts: a Confluence search for "postmortem ENG-001" and a direct Jira lookup of ENG-001.

Then, when your agent answers the question in tool mode, GroundEval records every fetch_artifact and search_artifacts call. The scorer checks whether the trace covered each required search-space entry: the Jira fetch and the Confluence search. If a matching artifact exists, the trace can also be checked for whether the agent fetched it. If not, the trajectory score drops, hard, regardless of what answer the agent gave.

Anatomy of a failure

What "shallow retrieval" looks like in the score

Here's the scenario GroundEval exposes: the agent searches Jira, finds ENG-001, sees no linked postmortem, and answers "No, no postmortem was created." It never touches Confluence. The answer is factually correct, but the search was incomplete.

The agent says
"No, there was no postmortem for ENG-001. I checked the ticket and no postmortem was linked."
GroundEval reports
answer_score 1.000 Answer was correct: no postmortem exists
trajectory_score 0.250 Agent searched Jira but never touched Confluence
search_space_coverage 0.500 1 of 2 required artifacts fetched
combined (weighted 0.30/0.70) 0.475 Valid path matters more than correct answer

The combined score is weighted 0.70 toward trajectory because, for silence tests, the whole point is proving the negative is justified. A correct answer reached through an incomplete search is a failure. The framework doesn't negotiate on this.

If you sent this agent's response to an LLM judge, it would say "The agent correctly identified that no postmortem exists and cited the relevant ticket. Score: 0.85." The judge can't see the search space the agent skipped. That's the blind spot.

Writing your own

How to define a silence pair for your domain

The config is a state contract. It doesn't prescribe agent behavior; it declares what correctness means in your domain. Here's how to think about it for silence tests.

Step 1: Identify a trigger-and-response pair

Ask yourself: in my domain, what events create an expectation that something else should happen? Some examples from the domain packs that ship with GroundEval:

incident_opened
postmortem_created
Engineering
phishing_alert_triggered
containment_action_taken
Security
medication_administered
lab_result_recorded
Healthcare
loan_application_submitted
adverse_action_notice_sent
Finance
contract_draft_created
dpa_review_completed
Legal

Step 2: Declare the search space

For each pair, ask: if the response event didn't happen, where would an agent need to look to be sure? This is the search space. It's not a suggestion; it's a requirement. The agent's trajectory score depends on covering it.

config.yaml (full silence block)
silence_pairs:
  # Incident → Postmortem: agent must check Confluence AND the ticket
  - trigger_event_type: incident_opened
    response_event_type: postmortem_created
    max_gap_days: 7
    search_space:
      - subsystem: confluence
        query_template: "postmortem artifact_ids.jira"
      - subsystem: jira
        id_template: "artifact_ids.jira"

  # Escalation → Customer follow-up: agent must check Zendesk AND email
  - trigger_event_type: escalation_opened
    response_event_type: customer_followup_sent
    max_gap_days: 3
    search_space:
      - subsystem: zendesk
        query_template: "follow-up artifact_ids.zendesk"
      - subsystem: email
        query_template: "follow-up re: artifact_ids.jira"

Step 3: Run and inspect

Generate questions from your config, run your agent against them, and read the per-question diagnostic trace. The trace is not used for scoring; it's there so you can see exactly which searches the agent ran, which it skipped, and why the score came out the way it did. No reconstructing from interleaved logs. No asking a judge model to guess whether the agent's search was sufficient.

The config is reusable. Write it once for your domain, and it works across every model version, every prompt change, every agent framework you swap in. The silence contract doesn't care how your agent is implemented. It only cares whether the trace says the search happened.

Mental model

This isn't a unit test. It's a state contract.

If you're coming from a unit testing background, the mental model shift is this: you're not asserting that the agent's output matches an expected string. You're declaring what "thorough checking" means in your domain, and then verifying that the agent's recorded behavior satisfies that declaration.

A unit test says "assert the answer equals no." A GroundEval silence test says "assert the agent searched these specific artifacts before answering." The answer is secondary. The search is primary.

This is why the trajectory weight for silence is 0.70. The framework is not ambivalent about whether searching matters. If your agent can get the right answer without searching, that's not a passed test; it's a test that exposed a gap in your evaluation.

Summary

A correct "no" is not the same as a justified "no"

The Silence track exists because absence claims are the easiest thing for an agent to fake. A confident tone, a plausible-sounding search description, a correctly formatted negative answer. None of it proves the agent actually looked.

GroundEval doesn't ask whether the answer sounds right. It asks whether the agent's tool trace contains the fetches and searches you declared as necessary. If the trace is missing them, the score reflects it, no matter how fluent the final answer.

Start with a single silence pair. Pick one trigger event type, one expected response, two search locations. Run it against your agent. The gap between what you declared and what the trace shows will tell you more than any LLM judge ever could.

Related

More from the GroundEval series