FineGRAIN Bench will appear as a Spotlight Paper at NeurIPS 2025
FineGRAIN is a comprehensive benchmark for evaluating text-to-image (T2I) models across 27 specific failure modes. While T2I models can generate visually impressive images, they often struggle with precise prompt adherence—missing colors, object counts, spatial relationships, and other critical details.
Our benchmark provides a structured evaluation framework that tests both T2I model capabilities and Vision Language Model (VLM) performance as judges. We evaluate how well VLMs can identify specific failure modes in images generated by leading T2I models including Flux, Stable Diffusion variants, and others across challenging prompts designed to elicit common failure patterns.
The benchmark reveals systematic weaknesses in current models and provides actionable insights for researchers and practitioners. Use FineGRAIN to compare model performance, identify areas for improvement, and track progress in text-to-image generation quality.
| Model | Company | Average ▼ | Cause-effect Relations ▼ | Action and Motion ▼ | Anatomical Accuracy ▼ | BG-FG Mismatch ▼ | Blending Styles ▼ | Color Binding ▼ | Counts or Multiple Objects ▼ | Abstract Concepts ▼ | Emotional Conveyance ▼ | FG-BG Relations ▼ | Human Action ▼ | Human Anatomy Moving ▼ | Long Text Specific ▼ | Negation ▼ | Opposite Relation ▼ | Perspective ▼ | Physics ▼ | Scaling ▼ | Shape Binding ▼ | Short Text Specific ▼ | Social Relations ▼ | Spatial Relations ▼ | Surreal ▼ | Tense and Aspect ▼ | Text Rendering Style ▼ | Text-Based ▼ | Texture Binding ▼ |
|---|
Note: * Images resized before evaluation
| Failure Mode | Molmo | InternVL3 | Pixtral | Best Performance |
|---|
Key VLM Insights:
| Failure Mode | Failure Rate | Description | Sample Prompt |
|---|
How to read this table:
@article{hayes2024finegrain,
author = {Hayes, Kevin David and Goldblum, Micah and Somepalli, Gowthami and Sehwag, Vikash and Panda, Ashwinee and Goldstein, Tom},
title = {FineGRAIN: Evaluating Failure Modes of Text-to-Image Models with Vision Language Model Judges},
journal = {arXiv preprint arXiv:2024.XXXXX},
year = {2024},
}