FineGRAIN: Evaluating Failure Modes of Text-to-Image Models with Vision Language Model Judges

FineGRAIN Bench will appear as a Spotlight Paper at NeurIPS 2025

Kevin David Hayes¹, Micah Goldblum², Gowthami Somepalli¹, Vikash Sehwag³, Ashwinee Panda¹, Tom Goldstein¹

¹University of Maryland, ²Columbia University, ³Sony AI

About FineGRAIN

FineGRAIN is a comprehensive benchmark for evaluating text-to-image (T2I) models across 27 specific failure modes. While T2I models can generate visually impressive images, they often struggle with precise prompt adherence—missing colors, object counts, spatial relationships, and other critical details.

Our benchmark provides a structured evaluation framework that tests both T2I model capabilities and Vision Language Model (VLM) performance as judges. We evaluate how well VLMs can identify specific failure modes in images generated by leading T2I models including Flux, Stable Diffusion variants, and others across challenging prompts designed to elicit common failure patterns.

The benchmark reveals systematic weaknesses in current models and provides actionable insights for researchers and practitioners. Use FineGRAIN to compare model performance, identify areas for improvement, and track progress in text-to-image generation quality.

T2I Model Performance Comparison (Version 1.1)

Model	Company	Average ▼	Cause-effect Relations ▼	Action and Motion ▼	Anatomical Accuracy ▼	BG-FG Mismatch ▼	Blending Styles ▼	Color Binding ▼	Counts or Multiple Objects ▼	Abstract Concepts ▼	Emotional Conveyance ▼	FG-BG Relations ▼	Human Action ▼	Human Anatomy Moving ▼	Long Text Specific ▼	Negation ▼	Opposite Relation ▼	Perspective ▼	Physics ▼	Scaling ▼	Shape Binding ▼	Short Text Specific ▼	Social Relations ▼	Spatial Relations ▼	Surreal ▼	Tense and Aspect ▼	Text Rendering Style ▼	Text-Based ▼	Texture Binding ▼

Note: * Images resized before evaluation

VLM Judge Performance Comparison

Overall VLM Accuracy

66.1%

Molmo (Best Overall)

65.1%

InternVL3

63.8%

Pixtral

Search Failure Modes

Sort by Performance

Failure Mode	Molmo	InternVL3	Pixtral	Best Performance

Key VLM Insights:

Molmo leads in overall accuracy but excels particularly in anatomical accuracy and action representation
All VLMs achieve near-perfect performance on counting objects (>97% accuracy)
Abstract concepts remain challenging for all VLMs (~30% accuracy)
Text-based evaluation shows significant variation between VLMs (66-77%)

FineGRAIN: 27 Failure Modes

Search Failure Modes

Filter by Difficulty

Failure Mode	Failure Rate	Description	Sample Prompt

How to read this table:

Failure Rate: Percentage of images that contain the failure mode (higher = more challenging)
Sample Prompts: Real prompts used in our evaluation to elicit specific failure modes
Color coding: Red (>70%) = Very challenging, Yellow (30-70%) = Moderate, Green (<30%) = Less challenging

BibTeX

@article{hayes2024finegrain,
  author    = {Hayes, Kevin David and Goldblum, Micah and Somepalli, Gowthami and Sehwag, Vikash and Panda, Ashwinee and Goldstein, Tom},
  title     = {FineGRAIN: Evaluating Failure Modes of Text-to-Image Models with Vision Language Model Judges},
  journal   = {arXiv preprint arXiv:2024.XXXXX},
  year      = {2024},
}