Purpose
Multimodal evaluation requires more than checking whether a model can identify objects in an image or summarize visible content. A system may accurately recognize individual inputs while still failing to understand the meaning that emerges from their combination.
This framework separates recognition, integration, intent, reasoning, and safety behavior so evaluators can identify where the failure actually occurs.
Image recognition is not image understanding, and input recognition is not cross-modal reasoning.
Evaluation layers
Input recognition
Can the model accurately identify the visible objects, people, text, relationships, and environmental details in each image?
Cross-modal integration
Does the model combine the visual evidence with the user’s text rather than processing each input independently?
Relational understanding
Can the model interpret relationships between objects, people, scenes, reference images, and requested transformations?
Intent interpretation
Does the model understand why the user supplied the images and what outcome the user is attempting to produce?
Safety judgment
Does the response account for risk that emerges only after the visual and textual inputs are considered together?
Output fidelity
Does the final response or generated image preserve the intended identity, attributes, relationships, constraints, and safety boundaries?
Core evaluation dimensions
Common multimodal failure modes
Failure 01
Multimodal Fragmentation
The model processes text and image inputs separately and misses the meaning created by their combination.
Failure 02
Visual Hallucination
The model invents objects, actions, text, identities, or scene details that are not present in the image.
Failure 03
Context Suppression
The model recognizes relevant visual evidence but fails to use it when interpreting the user’s request.
Failure 04
Reference Drift
Identity, appearance, attributes, or relationships shift across multiple reference images or generated outputs.
Failure 05
Attribute Leakage
Features from one reference image are incorrectly transferred onto another person, object, or scene.
Failure 06
Composite-Risk Blindness
Individually benign inputs combine into a risky request, but the model evaluates each component in isolation.
Failure 07
Prompt Dominance
The model follows the user’s text while disregarding contradictory or limiting visual evidence.
Failure 08
Image Dominance
The model overweights visual content and ignores the actual task or constraints stated in the prompt.
Failure 09
Transformation Misread
The model misunderstands which subject, attribute, background, or style the user wants preserved, changed, or removed.
Failure 10
Safety-Signal Dilution
Risk is obscured because relevant evidence is distributed across multiple inputs rather than concentrated in one prompt or image.
Evaluation workflow
| Step | Evaluator question | Evidence to capture |
|---|---|---|
| Inventory | What inputs are present, and which details are materially relevant? | Objects, people, text, setting, relationships, and references |
| Interpret | What does the user appear to be asking the model to do? | Explicit goal, transformation request, comparison, or analysis |
| Integrate | What meaning emerges only when all modalities are considered? | Cross-modal relationships, contradictions, and composite risk |
| Predict | What would a strong response or output preserve, change, or refuse? | Expected behavior and critical constraints |
| Compare | How does the observed model behavior differ from the expected behavior? | Missing details, distortions, unsafe synthesis, or over-refusal |
| Classify | Which failure mechanism best explains the gap? | Failure label, confidence, severity, and supporting evidence |
Multi-reference evaluation
Multi-reference tasks introduce additional complexity because the model must determine which properties belong to which source and how those properties should interact in the final output.
Evaluators should explicitly document:
- which reference controls identity;
- which reference controls pose, style, setting, or clothing;
- which attributes must remain unchanged;
- which attributes may be transferred;
- whether the references conflict;
- whether the model preserves source boundaries.
A visually plausible output can still be a failure if it transfers the wrong person, attribute, relationship, or safety-relevant detail.
Intent and safety
Multimodal safety decisions should not be based solely on whether an image contains a sensitive object or whether a prompt contains a sensitive term. The evaluator must determine how the inputs function together.
The same image may support:
- a benign organizational request;
- a fictional or analytical request;
- an ambiguous transformation request;
- an adversarial system test;
- a harmful request seeking actionable assistance.
The relevant unit of analysis is therefore not the image or prompt independently. It is the complete multimodal request.
The objects are rarely the entire problem. The arrangement, the relationship, and the question asked about them usually are.
Suggested annotation fields
Final assessment
Strong multimodal evaluation requires evaluators to separate what the model saw, what it understood, what it inferred, and what it ultimately did.
A model may succeed at recognition while failing at integration. It may succeed at integration while failing at intent. It may understand both and still make a poor safety decision.
Collapsing those layers into a single pass-or-fail label makes the evaluation easier to count and harder to learn from.
The goal is not merely to determine whether the output is wrong. The goal is to identify where multimodal understanding broke.