FineGRAIN: Evaluating Failure Modes of Text-to-Image Models with Vision Language Model Judges

FineGRAIN Bench will appear as a Spotlight Paper at NeurIPS 2025

1University of Maryland, 2Columbia University, 3Sony AI

About FineGRAIN

FineGRAIN is a comprehensive benchmark for evaluating text-to-image (T2I) models across 27 specific failure modes. While T2I models can generate visually impressive images, they often struggle with precise prompt adherence—missing colors, object counts, spatial relationships, and other critical details.

Our benchmark provides a structured evaluation framework that tests both T2I model capabilities and Vision Language Model (VLM) performance as judges. We evaluate how well VLMs can identify specific failure modes in images generated by leading T2I models including Flux, Stable Diffusion variants, and others across challenging prompts designed to elicit common failure patterns.

The benchmark reveals systematic weaknesses in current models and provides actionable insights for researchers and practitioners. Use FineGRAIN to compare model performance, identify areas for improvement, and track progress in text-to-image generation quality.

T2I Model Performance Comparison (Version 1.1)

Model Company Average Cause-effect Relations Action and Motion Anatomical Accuracy BG-FG Mismatch Blending Styles Color Binding Counts or Multiple Objects Abstract Concepts Emotional Conveyance FG-BG Relations Human Action Human Anatomy Moving Long Text Specific Negation Opposite Relation Perspective Physics Scaling Shape Binding Short Text Specific Social Relations Spatial Relations Surreal Tense and Aspect Text Rendering Style Text-Based Texture Binding

Note: * Images resized before evaluation

VLM Judge Performance Comparison

Overall VLM Accuracy

66.1%
Molmo (Best Overall)
65.1%
InternVL3
63.8%
Pixtral
Failure Mode Molmo InternVL3 Pixtral Best Performance

Key VLM Insights:

  • Molmo leads in overall accuracy but excels particularly in anatomical accuracy and action representation
  • All VLMs achieve near-perfect performance on counting objects (>97% accuracy)
  • Abstract concepts remain challenging for all VLMs (~30% accuracy)
  • Text-based evaluation shows significant variation between VLMs (66-77%)

FineGRAIN: 27 Failure Modes

Failure Mode Failure Rate Description Sample Prompt

How to read this table:

  • Failure Rate: Percentage of images that contain the failure mode (higher = more challenging)
  • Sample Prompts: Real prompts used in our evaluation to elicit specific failure modes
  • Color coding: Red (>70%) = Very challenging, Yellow (30-70%) = Moderate, Green (<30%) = Less challenging

BibTeX

@article{hayes2024finegrain,
  author    = {Hayes, Kevin David and Goldblum, Micah and Somepalli, Gowthami and Sehwag, Vikash and Panda, Ashwinee and Goldstein, Tom},
  title     = {FineGRAIN: Evaluating Failure Modes of Text-to-Image Models with Vision Language Model Judges},
  journal   = {arXiv preprint arXiv:2024.XXXXX},
  year      = {2024},
}