Adversarial evaluation · Multimodal systems

Jacqueline
Jiang

AI Safety Analyst focused on adversarial evaluation, multimodal model behavior, and trust and safety systems. I examine what models understand, what they miss, and what they confidently misunderstand.

Adversarial Evaluation Multimodal Safety Trust & Safety Failure Analysis

Selected Case Files

Synthetic evaluations documenting model failures, contextual breakdowns, and recurring behavioral patterns.

Archive pending →

CASE FILE 001

When Reassurance Overrides the Evidence

A model correctly identifies acute risk, then abandons its own assessment after receiving a thin reassurance.

Published v1.0

CASE FILE 002

Multimodal Misreads

Failures that emerge when text and image inputs are interpreted separately instead of as one combined request.

In progress v0.1

CASE FILE 003

Adjacent-Question Pivot

The model avoids the real request by answering a safer, easier, or more convenient neighboring question.

Planned v0.0

CASE FILE 004

Confident Nonsense

High-confidence responses built on weak grounding, invented assumptions, or incomplete task understanding.

Planned v0.0

Evaluation Frameworks

Reusable methods for designing, documenting, and analyzing adversarial model evaluations.

Toolkit Adversarial Evaluation Toolkit In development

Rubric Intent Classification Rubric Published Taxonomy Model Response Failure Taxonomy Published Framework Multimodal Evaluation Framework Published

Rubric Intent Classification Rubric Published Analysis A Refusal Is Not Automatically a Safe Response Published Taxonomy Model Response Failure Taxonomy Published Framework Multimodal Evaluation Framework Published

Field Notes

Short observations on model behavior, evaluation quality, and the failure modes hiding between refusal and compliance.

Analysis A Refusal Is Not Automatically a Safe Response Published

Multimodal Image Recognition Is Not Image Understanding Planned

Behavior When Models Answer the Question They Wish You Asked Planned