Multimodal Evaluation Framework

Purpose

Multimodal evaluation requires more than checking whether a model can identify objects in an image or summarize visible content. A system may accurately recognize individual inputs while still failing to understand the meaning that emerges from their combination.

This framework separates recognition, integration, intent, reasoning, and safety behavior so evaluators can identify where the failure actually occurs.

Image recognition is not image understanding, and input recognition is not cross-modal reasoning.

Evaluation layers

Layer 01

Input recognition

Can the model accurately identify the visible objects, people, text, relationships, and environmental details in each image?

Layer 02

Cross-modal integration

Does the model combine the visual evidence with the user’s text rather than processing each input independently?

Layer 03

Relational understanding

Can the model interpret relationships between objects, people, scenes, reference images, and requested transformations?

Layer 04

Intent interpretation

Does the model understand why the user supplied the images and what outcome the user is attempting to produce?

Layer 05

Safety judgment

Does the response account for risk that emerges only after the visual and textual inputs are considered together?

Layer 06

Output fidelity

Does the final response or generated image preserve the intended identity, attributes, relationships, constraints, and safety boundaries?

Core evaluation dimensions

Visual Grounding

Accuracy of object recognition, scene interpretation, spatial relationships, visible text, and relevant visual evidence.

Text-Image Alignment

Whether the response reflects the actual relationship between the prompt and the image rather than relying on one modality alone.

Cross-Image Consistency

Whether identities, attributes, visual details, and relationships remain consistent across multiple reference images.

Composite Meaning

Whether the model identifies meaning or risk that emerges from the combination of otherwise benign components.

User Intent

Whether the model correctly understands the purpose of the request, including transformation, comparison, editing, analysis, or generation.

Response Proportionality

Whether the model complies, clarifies, redirects, or refuses in a way that is proportionate to the actual risk.

Common multimodal failure modes

Failure 01

Multimodal Fragmentation

The model processes text and image inputs separately and misses the meaning created by their combination.

Failure 02

Visual Hallucination

The model invents objects, actions, text, identities, or scene details that are not present in the image.

Failure 03

Context Suppression

The model recognizes relevant visual evidence but fails to use it when interpreting the user’s request.

Failure 04

Reference Drift

Identity, appearance, attributes, or relationships shift across multiple reference images or generated outputs.

Failure 05

Attribute Leakage

Features from one reference image are incorrectly transferred onto another person, object, or scene.

Failure 06

Composite-Risk Blindness

Individually benign inputs combine into a risky request, but the model evaluates each component in isolation.

Failure 07

Prompt Dominance

The model follows the user’s text while disregarding contradictory or limiting visual evidence.

Failure 08

Image Dominance

The model overweights visual content and ignores the actual task or constraints stated in the prompt.

Failure 09

Transformation Misread

The model misunderstands which subject, attribute, background, or style the user wants preserved, changed, or removed.

Failure 10

Safety-Signal Dilution

Risk is obscured because relevant evidence is distributed across multiple inputs rather than concentrated in one prompt or image.

Evaluation workflow

Step	Evaluator question	Evidence to capture
Inventory	What inputs are present, and which details are materially relevant?	Objects, people, text, setting, relationships, and references
Interpret	What does the user appear to be asking the model to do?	Explicit goal, transformation request, comparison, or analysis
Integrate	What meaning emerges only when all modalities are considered?	Cross-modal relationships, contradictions, and composite risk
Predict	What would a strong response or output preserve, change, or refuse?	Expected behavior and critical constraints
Compare	How does the observed model behavior differ from the expected behavior?	Missing details, distortions, unsafe synthesis, or over-refusal
Classify	Which failure mechanism best explains the gap?	Failure label, confidence, severity, and supporting evidence

Multi-reference evaluation

Multi-reference tasks introduce additional complexity because the model must determine which properties belong to which source and how those properties should interact in the final output.

Evaluators should explicitly document:

which reference controls identity;
which reference controls pose, style, setting, or clothing;
which attributes must remain unchanged;
which attributes may be transferred;
whether the references conflict;
whether the model preserves source boundaries.

A visually plausible output can still be a failure if it transfers the wrong person, attribute, relationship, or safety-relevant detail.

Intent and safety

Multimodal safety decisions should not be based solely on whether an image contains a sensitive object or whether a prompt contains a sensitive term. The evaluator must determine how the inputs function together.

The same image may support:

a benign organizational request;
a fictional or analytical request;
an ambiguous transformation request;
an adversarial system test;
a harmful request seeking actionable assistance.

The relevant unit of analysis is therefore not the image or prompt independently. It is the complete multimodal request.

The objects are rarely the entire problem. The arrangement, the relationship, and the question asked about them usually are.

Suggested annotation fields

Input Summary

Concise description of each image, prompt, and relevant reference.

User Intent

Benign, ambiguous, adversarial, harmful, or indeterminate, with supporting evidence.

Expected Behavior

What a strong model should recognize, preserve, clarify, generate, or refuse.

Observed Behavior

What the model actually said, generated, omitted, or transformed.

Failure Mechanism

The smallest useful set of labels explaining the observed gap.

Evaluator Confidence

Low, moderate, or high, with unresolved ambiguity documented.

Final assessment

Strong multimodal evaluation requires evaluators to separate what the model saw, what it understood, what it inferred, and what it ultimately did.

A model may succeed at recognition while failing at integration. It may succeed at integration while failing at intent. It may understand both and still make a poor safety decision.

Collapsing those layers into a single pass-or-fail label makes the evaluation easier to count and harder to learn from.

The goal is not merely to determine whether the output is wrong. The goal is to identify where multimodal understanding broke.