Clicky

logo Science-T2I: Addressing Scientific Illusions in Image Synthesis

CVPR 2025

1New York University, 2University of Washington, 3University of Pennsylvania, 4University of California, San Diego
Teaser Image
Fig 1: When presented with knowledge-implicit prompts, can LMMs and VLMs effectively distinguish between real and fake scientific images? Can generative models produce scientifically plausible images from such prompts? Does fine-tuning generative models with relevant data enhance their ability to generalize based on knowledge? To explore these questions, we establish a benchmark for evaluating LMMs and VLMs, construct a dataset to train a reward model which can then serve as a reliable tool for assessing generative models, and fine-tune generative models to investigate the generalization.

Science-T2I-S&C Leaderboard

🤩 Submit your LMM/VLM scores now and watch the leaderboard refresh with your achievements! Email us at .

Click the button below to view different results!

# Model Overall Physics Chemistry Biology
Acc GR SO ME AB BU DI EL EV LI Avg RU IM FR Avg LR WR SC RI Avg
1
CLIP-H
VLM
54.7 78.3 40.5 25.0 57.1 63.9 71.4 47.6 26.7 77.8 55.1 16.7 54.2 77.8 52.4 62.2 31.1 81.5 34.6 55.9
2
BLIPScore
VLM
55.0 47.5 44.1 56.9 38.1 50.0 50.0 52.4 20.0 33.3 50.4 42.9 53.1 76.7 43.1 76.7 38.9 58.3 38.5 59.9
3
SigLIP
VLM
57.2 78.3 45.2 44.4 57.1 58.3 83.3 47.6 63.3 58.3 59.6 23.8 60.4 62.2 53.2 46.7 33.3 83.3 53.9 55.9
4 63.8 83.3 42.9 26.4 40.5 47.2 70.2 77.4 68.3 84.7 60.0 73.8 66.7 52.2 67.0 57.8 67.8 95.4 34.6 68.8
5 65.1 92.5 56.0 36.1 38.1 45.8 75.0 77.4 100 95.8 68.2 59.5 55.2 48.9 57.8 51.1 72.2 78.7 46.2 64.7
6 87.0 93.0 86.1 98.2 66.7 74.6 65.9 95.6 100 82.1 87.7 92.9 77.8 81.0 75.9 96.9 99.6 90.7 94.6 95.3
7 70.8 96.7 52.4 41.7 47.6 55.6 63.1 72.6 91.7 90.3 67.8 69.1 56.3 52.2 62.2 84.4 90.0 84.3 75.0 84.4
8 70.8 71.3 35.7 36.1 33.3 56.9 77.4 82.1 100 76.4 62.0 95.2 65.6 58.9 73.8 96.7 83.3 97.2 53.9 86.8
9
SciScore
VLM
93.1 98.3 90.5 100 71.4 66.7 97.6 100 100 100 94.9 100 68.8 97.8 81.0 100 100 100 100 100

Abstract

We present a novel approach to integrating scientific knowledge into generative models, enhancing their realism and consistency in image synthesis. First, we introduce Science-T2I, an expert-annotated adversarial dataset comprising adversarial 20k image pairs with 9k prompts, covering wide distinct scientific knowledge categories. Leveraging Science-T2I, We present SciScore, an end-to-end reward model that refines the assessment of generated images based on scientific knowledge, which is achieved by augmenting both the scientific comprehension and visual capabilities of pre-trained CLIP model. Additionally, based on Science-T2I, we propose a two-stage training framework, comprising a supervised fine-tuning phase and a masked online fine-tuning phase, to incorporate scientific knowledge into existing generative models. Through comprehensive experiments, we demonstrate the effectiveness of our framework in establishing new standards for evaluating the scientific realism of generated content. Specifically, SciScore attains performance comparable to human-level, demonstrating a 5% improvement similar to evaluations conducted by experienced human evaluators. Furthermore, by applying our proposed fine-tuning method to FLUX, we achieve a performance enhancement exceeding 50% based on SciScore.

Science-T2I: An Adversarial Dataset Spanning Scientific Disciplines


Task overview. Science-T2I consists of 16 tasks spanning physics, chemistry and biology that require the model to infer or visualize concepts not explicitly stated in the prompts but rooted in underlying scientific principles.

Fig 2: Data statistics (left) of and wordcloud (right) of Science-T2I.

Task classification. Beyond a classification based on scientific disciplines, the tasks can be categorized into two distinct groups:

  1. Subject-oriented task (ST) require scientific reasoning to discern how inherent differences between subjects lead to varying visual features under identical conditions.
  2. Condition-oriented task (CT) focus on how a single condition affects various subjects. Scientific reasoning in these tasks centers on the applied condition, not the subject's individual properties.
Fig 3: Task classification of Science-T2I.

Prompt design. In Science-T2I, we categorize prompts into three types based on their use in scientific reasoning:

  1. Implicit prompt (IP). Contains terms implying visual characteristics or phenomena requiring scientific interpretation (e.g., "an unripe apple" suggesting greenness).
  2. Explicit prompt (EP). Reformulates the IP into a clear, descriptive statement that results in a scientifically accurate depiction (e.g., "a green apple" explicitly conveying immaturity).
  3. Superficial prompt (SP). Provides explicit but scientifically inaccurate descriptions, focusing on surface-level interpretations (e.g., interpreting "an unripe apple" as "a red apple").
Fig 4: Data curation pipeline of Science-T2I.

Data curation. We utilize GPT-4o to create templates and generate corresponding prompts during the data curation process. These outputs are then used to guide T2I models for image generation. Following this, human annotators review and filter the data, incorporating insights from an additional website knowledgebase to ensure the reliability and accuracy of the final dataset.

Science-T2I Examples


SciScore: Evaluating Scientific Authenticity of Images


Leveraging Science-T2I, we present SciScore, an end-to-end reward model that refines the assessment of generated images based on scientific knowledge, which is achieved by augmenting both the scientific comprehension and visual capabilities of pre-trained CLIP model. The qualitative performance of SciScore is demonstrated in Fig. 1.

Tab 1: (Left) Performance comparison of different models on Science-T2I-S&C across different subjects. (Right) Performance of SciScore in ST and CT.

Generalization of SciScore to complex scenes. SciScore generalizes well to complex scenes Science-T2I-C beyond simple ones Science-T2I-S, showing it can focus on relevant regions and ignore distractions.

Generalization of SciScore across ST and CT. A significant performance gap emerged between ST and CT, with most failures in ST. This is expected, as CT relies on generalizable visual features, while ST depends on subject-specific details. Lacking exposure to novel subjects, SciScore struggles to identify accurate visual contents.

Benchmarking T2I Generation with SciScore

Three-dimensional evaluation. We evaluated T2I models' scientific reasoning by assessing the alignment between images generated from implicit prompts and those from (1) explicit prompts, (2) superficial prompts, and (3) implicit prompts themselves.

Tab 2: Performance of T2I Models on SciScore.

Analysis on reasoning capability. We introduce the 'Normalized Difference' (ND) metric to assess T2I models' ability to interpret implicit prompts by comparing generated images to those from explicit prompts. Low ND scores (average ~35, mostly < 50) indicate a significant failure of current models to move beyond literal prompt interpretations, especially for implicit scientific concepts.

Two-stage Fine-tuning Framework


We propose a two-stage training framework, comprising a supervised fine-tuning phase and a masked online fine-tuning phase, to incorporate scientific knowledge into existing generative models utilizing Science-T2I.

Supervised fine-tuning. For this phase, we utilize FLUX as our base model and train it on the Science-T2I, employing FLUX's native rectified flow training objective without any modifications.

Online fine-tuning. In this phase, we implement a masked online fine-tuning approach, incorporating SciScore as a reward model to direct the learning process through the DPO training objective.

Fig 5: Online fine-tuning pipeline.