The canonical eval suites — MMLU, HellaSwag, ARC, HumanEval, the rest — cover what we wrote them to cover. They are necessary. They are not sufficient.
Here are three places where the canonical batteries go quiet, and what we have to write ourselves to fill the gap.
1. Composite tool-use under partial failure
The standard tool-use evals test single-tool invocations against ground truth. The agent calls one tool, the answer is either correct or it isn’t, we score it and move on.
Production agents don’t behave that way. They call three tools, fail the second, retry once, then pick a different path because the third tool depends on a value the second one was supposed to return. None of the canonical batteries score this — they can’t, because they don’t model partial failure as part of the protocol.
We wrote a suite that does. It runs 240 adversarial cases per agent. Our best models pass 228 of them. The 12 they fail are documented. The report is signed.
2. Drift between the agent and the system prompt
Six weeks after shipping, the model provider quietly improves the base model. Your prompts, untouched, now produce subtly different distributions: a few percent more eager to commit to a tool call, a few percent less likely to ask for clarification.
The canonical evals don’t catch this because they don’t replay your prompts against the new base. They were never designed to. You have to.
We run a weekly regression against a frozen test set, scored on a different SHA each time. When the score drifts past 1.5 standard deviations from the baseline, the on-call gets paged. Not the engineer who shipped — the on-call. There’s no negotiating with the threshold.
3. The third party between the model and the user
Most production agents have at least one third party in the path: a vector DB, a search API, a tool server, a feature store. The canonical evals score the model in isolation, with mock returns from those services.
The third party will go down. It will return malformed JSON. It will rate-limit your retries. It will return data from yesterday’s index. Every one of those is a failure mode of the agent, even if the model itself is fine.
We score every third-party path under every failure mode the third party can produce. Most of the time the agent degrades gracefully — falls back, asks for clarification, surfaces the failure. The cases where it doesn’t are the ones we document and ship in the audit pack.
The receipt
The canonical suites are the floor. Your suite is what you can sign your name to. The day someone asks you which suite you shipped against, you should be able to name it, hand them the hashes, and not need to explain.