Safety evaluation often begins with an apparently simple question: did the model comply or refuse?
That question is useful, but it is incomplete. A refusal can be appropriate, disproportionate, irrelevant, or built on a complete misunderstanding of the user’s intent. The visible behavior may look safe while the reasoning underneath it is weak.
A model that refuses whenever it detects certain words is not necessarily demonstrating robust safety judgment. It may simply be reacting to surface-level signals. This can produce a system that appears cautious in evaluation while remaining unreliable in more ambiguous or adversarial contexts.
A safe-looking response is not the same thing as a well-reasoned response.
The problem with refusal as a primary metric
Refusal is attractive as an evaluation signal because it is visible, easy to classify, and relatively simple to compare across test cases. It can be reduced to a binary label and placed neatly into a spreadsheet.
Unfortunately, model behavior rarely remains neat for long.
A binary refusal metric does not tell us:
- whether the model correctly understood the user’s intent;
- whether the refusal addressed the actual risk;
- whether benign content was incorrectly treated as harmful;
- whether harmful content was merely redirected into a different form;
- whether the explanation was accurate, proportionate, or useful;
- whether the same reasoning would generalize to a less obvious case.
When these distinctions are ignored, models can receive credit for producing the shape of safety without demonstrating the substance of it.
Over-refusal is still a model failure
Over-refusal is sometimes treated as the safer error. In narrow terms, that may be true: refusing a benign request is often less immediately dangerous than complying with a harmful one.
But repeatedly dismissing over-refusal as harmless creates its own blind spot. A model that cannot reliably distinguish benign, ambiguous, and harmful intent is not making careful decisions. It is substituting caution for understanding.
This matters because a system that relies heavily on lexical triggers may fail in both directions. It may refuse benign requests containing suspicious language while complying with harmful requests expressed indirectly, euphemistically, or across multiple modalities.
The model has not become safer. It has become easier to predict.
The adjacent-question problem
Not every safety failure is a direct refusal or direct compliance. Models often answer a nearby question instead.
The response may acknowledge the topic, produce general guidance, or redirect toward a safer framing without addressing the actual request. This can make the interaction appear responsive while avoiding the evaluation target entirely.
In some cases, that is a reasonable safety strategy. In others, it is policy-shaped evasion: the model produces language associated with safety while failing to demonstrate that it understood the task.
Evaluators should distinguish between a useful redirect and an irrelevant pivot. Otherwise, the model may receive credit simply for sounding responsible.
A more useful evaluation frame
Refusal should be evaluated as one part of a larger behavioral sequence rather than as the final verdict.
This framing makes room for a wider set of outcomes:
- appropriate compliance;
- appropriate refusal;
- over-refusal;
- unsafe compliance;
- partial compliance;
- irrelevant safety pivot;
- grounding failure;
- intent misclassification.
These categories are less convenient than a binary score. They are also closer to how model failures actually behave.
What good safety judgment looks like
A strong safety response does more than stop. It demonstrates that the model understood why caution was necessary, responded proportionately, and preserved as much legitimate usefulness as possible.
In benign cases, that may mean complying without becoming distracted by isolated risk markers. In ambiguous cases, it may mean asking for clarification. In clearly harmful cases, it may mean refusing the dangerous portion while still offering safe and relevant support.
The objective is not maximum refusal. The objective is reliable judgment.
Refusal is an output category. Safety is a reasoning quality.
Final note
Evaluations that reward refusal without examining understanding risk training models to perform caution rather than practice it.
That may look convincing in a benchmark. It becomes less convincing the moment the user stops speaking in benchmark-shaped language.