Science-T2I

Addressing Scientific Illusions in Image Synthesis

Science-T2I Icon
Science-T2I: We develop an expert-annotated adversarial dataset covering a wide range of scientific knowledge categories. Furthermore, we introduce two benchmarks with distinct scene settings to evaluate how VLMs and LMMs handle tasks requiring vision-based scientific reasoning.
Generalization Analysis Icon
SciScore: We present an end-to-end reward model built by enhancing a pre-trained CLIP model to assess the scientific realism of generated images.
Tuning Icon
Reality Alignment: We propose a two-stage training framework which including supervised fine-tuning and masked online fine-tuning based on Science-T2I to integrate scientific knowledge into generative models.
Teaser Image
Science-T2I
Fig 1: When presented with knowledge-implicit prompts, can LMMs and VLMs effectively distinguish between real and fake scientific images? Can generative models produce scientifically plausible images from such prompts? Does fine-tuning generative models with relevant data enhance their ability to generalize based on knowledge? To explore these questions, we establish a benchmark for evaluating LMMs and VLMs, construct a dataset to train a reward model which can then serve as a reliable tool for assessing generative models, and fine-tune generative models to investigate the generalization.

We introduce a novel method for integrating scientific knowledge into generative models to enhance their realism and consistency in image synthesis. Central to our approach is Science-T2I, an expert-annotated adversarial dataset of 20k image pairs and 9k prompts spanning diverse scientific categories. Building on Science-T2I, we develop SciScore, an end-to-end reward model that refines the evaluation of generated images by enhancing the scientific and visual capabilities of a pre-trained CLIP model. Furthermore, we propose a two-stage training framework—combining supervised fine-tuning and masked online fine-tuning—to embed scientific knowledge into existing generative models. Extensive experiments validate the framework's effectiveness in setting new benchmarks for assessing the scientific realism of generated content.

Science-T2I Logo Science-T2I Generalization Analysis Logo SciScore Generalization Analysis Logo Reality
Alignment

Click to jump to each section.


Science-T2I: An Adversarial Scientific Dataset

Task Overview. Science-T2I consists of 16 tasks spanning physics, chemistry and biology that require the model to infer or visualize concepts not explicitly stated in the prompts but rooted in underlying scientific principles.

Fig 2: Data statistics (left) of and wordcloud (right) of Science-T2I.

Task Classification. Beyond a classification based on scientific disciplines, the tasks can be categorized into two distinct groups:

1. Subject-oriented Task (ST) require scientific reasoning to discern how inherent differences between subjects lead to varying visual features under identical conditions.

2. Condition-oriented Task (CT) focus on how a single condition affects various subjects. Scientific reasoning in these tasks centers on the applied condition, not the subject's individual properties.

Fig 3: Task classification of Science-T2I.

Prompt Design. In Science-T2I, we categorize prompts into three types based on their use in scientific reasoning:

1. Implicit Prompt (IP) refers to prompts that imply visual characteristics or phenomena requiring scientific interpretation. For example, "an unripe apple" suggests the apple's color is green, but this is not explicitly stated.

2. Explicit Prompt (EP) reformulates the IP into a clear, descriptive statement that results in a scientifically accurate depiction. For instance, "a green apple" explicitly conveys the apple's immaturity.

3. Superficial Prompt (SP) provides explicit but scientifically inaccurate descriptions, focusing on surface-level interpretations. For example, interpreting "an unripe apple" as "a red apple" is a superficial interpretation that lacks scientific accuracy.

Fig 4: Data curation pipeline of Science-T2I.

Data Curation. We utilize GPT-4o to create templates and generate corresponding prompts during the data curation process. These outputs are then used to guide T2I models for image generation. Following this, human annotators review and filter the data, incorporating insights from an additional website knowledgebase to ensure the reliability and accuracy of the final dataset.

Science-T2I Examples


Benchmark Setup. To thoroughly evaluate the model's generalization ability across different environments on scientific reasoning tasks, we have created two additional manually annotated test sets, Science-T2I-S and Science-T2I-C:

1. Science-T2I-S closely replicates the stylistic and structural attributes of the training data. It emphasizes simplicity by focusing on specific regions. The goal of Science-T2I-S is to assess the model's performance on data stylistically similar to the training set.

2. Science-T2I-C challenges the model in more complex scenarios, introducing contextual elements like explicit scene settings and diverse scenarios. Prompts may include phrases such as "in a bedroom" or "on the street," adding spatial and contextual variability. This complexity evaluates the model's ability to adapt to nuanced, less constrained environments.

Science-T2I-S&C Leaderboard

🤩 Submit your LMM/VLM scores now and watch the leaderboard refresh with your achievements! Email us at .

Click the button below to view different results!

# Model Overall Physics Chemistry Biology
Acc GR SO ME AB BU DI EL EV LI Avg RU IM FR Avg LR WR SC RI Avg
1
CLIP-H
VLM
54.7 78.3 40.5 25.0 57.1 63.9 71.4 47.6 26.7 77.8 55.1 16.7 54.2 77.8 52.4 62.2 31.1 81.5 34.6 55.9
2
BLIPScore
VLM
55.0 47.5 44.1 56.9 38.1 50.0 50.0 52.4 20.0 33.3 50.4 42.9 53.1 76.7 43.1 76.7 38.9 58.3 38.5 59.9
3
SigLIP
VLM
57.2 78.3 45.2 44.4 57.1 58.3 83.3 47.6 63.3 58.3 59.6 23.8 60.4 62.2 53.2 46.7 33.3 83.3 53.9 55.9
4 63.8 83.3 42.9 26.4 40.5 47.2 70.2 77.4 68.3 84.7 60.0 73.8 66.7 52.2 67.0 57.8 67.8 95.4 34.6 68.8
5 65.1 92.5 56.0 36.1 38.1 45.8 75.0 77.4 100 95.8 68.2 59.5 55.2 48.9 57.8 51.1 72.2 78.7 46.2 64.7
6 87.0 93.0 86.1 98.2 66.7 74.6 65.9 95.6 100 82.1 87.7 92.9 77.8 81.0 75.9 96.9 99.6 90.7 94.6 95.3
7 70.8 96.7 52.4 41.7 47.6 55.6 63.1 72.6 91.7 90.3 67.8 69.1 56.3 52.2 62.2 84.4 90.0 84.3 75.0 84.4
8 70.8 71.3 35.7 36.1 33.3 56.9 77.4 82.1 100 76.4 62.0 95.2 65.6 58.9 73.8 96.7 83.3 97.2 53.9 86.8
9
SciScore
VLM
93.1 98.3 90.5 100 71.4 66.7 97.6 100 100 100 94.9 100 68.8 97.8 81.0 100 100 100 100 100

SciScore: Evaluating Scientific Authenticity of Images

Leveraging Science-T2I, we present SciScore, an end-to-end reward model that refines the assessment of generated images based on scientific knowledge, which is achieved by augmenting both the scientific comprehension and visual capabilities of pre-trained CLIP model. The qualitative performance of SciScore is demonstrated in Fig. 1.

Tab 1: (Left) Performance comparison of different models on Science-T2I-S&C across different subjects. (Right) Performance of SciScore in ST and CT.

Generalization of SciScore to Complex Scenes. SciScore generalizes well to complex scenes Science-T2I-C beyond simple ones Science-T2I-S, showing it can focus on relevant regions and ignore distractions.

Generalization of SciScore across ST and CT. A significant performance gap emerged between ST and CT, with most failures in ST. This is expected, as CT relies on generalizable visual features, while ST depends on subject-specific details. Lacking exposure to novel subjects, SciScore struggles to identify accurate visual contents.

Benchmarking T2I Generation with SciScore

Three-dimensional Evaluation. We evaluated T2I models' scientific reasoning by assessing the alignment between images generated from implicit prompts and those from (1) explicit prompts, (2) superficial prompts, and (3) implicit prompts themselves.

Tab 2: Performance of T2I Models on SciScore.

Analysis of Reasoning Capability. To evaluate the ability of T2I models to interpret implicit prompts, we introduce the Normalized Difference (ND) metric. This metric compares images generated from implicit prompts to those produced from explicit prompts, using the formula: \[ \begin{equation} \text{ND} = \frac{\text{IP} - \text{SP}}{\text{EP} - \text{SP}} \end{equation} \] Here in Tab 2, low ND scores—averaging around 35 and mostly below 50—reveal a significant limitation in current models. Specifically, they struggle to move beyond literal interpretations of prompts, particularly when dealing with implicit scientific concepts.

Reality Alignment: Two-stage Fine-tuning Framework

We propose a two-stage training framework, comprising a supervised fine-tuning phase and a masked online fine-tuning phase, to incorporate scientific knowledge into existing generative models utilizing Science-T2I.

Supervised Fine-tuning. For this phase, we utilize FLUX as our base model and train it on the Science-T2I, employing FLUX's native rectified flow training objective without any modifications.

Online Fine-tuning. In this phase, we implement a masked online fine-tuning approach, incorporating SciScore as a reward model to direct the learning process through the DPO training objective.

Fig 5: Online fine-tuning pipeline. For each prompt, two images are generated to compute SciScore preference metric. Simultaneously, GroundingDINO extracts segmentation masks from these images based on the prompts, which are then used to block gradient propagation in the corresponding regions.

Experiment & Ablation Study

Relative Improvement Metric. To better assess model performance beyond raw evaluations, we observe that explicit prompts consistently yield higher scores than implicit ones, providing an upper-bound performance estimate. To quantify the benefits of finetuning, we introduce the Relative Improvement (RI) metric. Let \(SciScore_B^{IP}\) and \(SciScore_B^{EP}\) represent the SciScore of the base model under implicit and explicit prompts, respectively, and let \(SciScore_F^{IP}\) denote the SciScore of the finetuned model under implicit prompts. The RI is then defined as: \[ \begin{equation} RI = \frac{\text{SciScore}_F^{IP}-\text{SciScore}_B^{IP}}{\text{SciScore}_B^{EP}-\text{SciScore}_B^{IP}} \end{equation} \]

Tab 3: Average SciScore on various methods.

Generalization to Complex Scenes. Finetuned FLUX on Science-T2I-C generalizes well to complex scenes. This suggests the model learned scientific principles, not just memorized examples.

Fig 6: Ablation study of two-stage training.

Necessity of SFT. Initial SFT (blue) provides a better starting point for OFT, leading to stable performance gains. Without it (purple), OFT struggles, highlighting SFT's role in establishing a good base for effective training.

Masking Strategy As A Denoiser. Without masking (yellow), performance is erratic with collapse. Lowering the learning rate (red) prevents collapse but doesn't improve the model. Masking likely prevents the model from treating all features as equally important, reducing noise and enabling stable improvement.

Fig 7: Case study. The upper images are generated using the base FLUX.1[dev], whereas the lower images are produced with our fine-tuning method. Each image pair utilizes an identical random seed to ensure consistency in comparison. Note that the displayed prompts are summaries of the original prompts.

Conclusion

In summary, by leveraging our expert-annotated dataset, Science-T2I, which comprises over 20k adversarial image pairs and 9k prompts, we have developed a comprehensive framework for evaluating and enhancing image realism. Specifically, we introduce SciScore, a novel reward model designed to infuse scientific knowledge into image synthesis. Our results demonstrate that SciScore reaches human-level accuracy in aligning with scientific knowledge. Additionally, we propose a two-stage training framework for T2I models, utilizing SciScore as the reward model. This framework, which combines supervised fine-tuning with online fine-tuning, leads to significant performance improvements in generation tasks that require scientific reasoning.

BibTeX

@misc{li2025sciencet2iaddressingscientificillusions,
  title={Science-T2I: Addressing Scientific Illusions in Image Synthesis},
  author={Jialuo Li and Wenhao Chai and Xingyu Fu and Haiyang Xu and Saining Xie},
  year={2025},
  eprint={2504.13129},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2504.13129},
}