We introduce a novel method for integrating scientific knowledge into generative models to enhance their realism and consistency in image synthesis. Central to our approach is Science-T2I, an expert-annotated adversarial dataset of 20k image pairs and 9k prompts spanning diverse scientific categories. Building on Science-T2I, we develop SciScore, an end-to-end reward model that refines the evaluation of generated images by enhancing the scientific and visual capabilities of a pre-trained CLIP model. Furthermore, we propose a two-stage training framework—combining supervised fine-tuning and masked online fine-tuning—to embed scientific knowledge into existing generative models. Extensive experiments validate the framework's effectiveness in setting new benchmarks for assessing the scientific realism of generated content.
Click to jump to each section.
Task Overview. Science-T2I consists of 16 tasks spanning physics, chemistry and biology that require the model to infer or visualize concepts not explicitly stated in the prompts but rooted in underlying scientific principles.
Task Classification. Beyond a classification based on scientific disciplines, the tasks can be categorized into two distinct groups:
1. Subject-oriented Task (ST) require scientific reasoning to discern how inherent differences between subjects lead to varying visual features under identical conditions.
2. Condition-oriented Task (CT) focus on how a single condition affects various subjects. Scientific reasoning in these tasks centers on the applied condition, not the subject's individual properties.
Prompt Design. In Science-T2I, we categorize prompts into three types based on their use in scientific reasoning:
1. Implicit Prompt (IP) refers to prompts that imply visual characteristics or phenomena requiring scientific interpretation. For example, "an unripe apple" suggests the apple's color is green, but this is not explicitly stated.
2. Explicit Prompt (EP) reformulates the IP into a clear, descriptive statement that results in a scientifically accurate depiction. For instance, "a green apple" explicitly conveys the apple's immaturity.
3. Superficial Prompt (SP) provides explicit but scientifically inaccurate descriptions, focusing on surface-level interpretations. For example, interpreting "an unripe apple" as "a red apple" is a superficial interpretation that lacks scientific accuracy.
Data Curation. We utilize GPT-4o to create templates and generate corresponding prompts during the data curation process. These outputs are then used to guide T2I models for image generation. Following this, human annotators review and filter the data, incorporating insights from an additional website knowledgebase to ensure the reliability and accuracy of the final dataset.
Benchmark Setup. To thoroughly evaluate the model's generalization ability across different environments on scientific reasoning tasks, we have created two additional manually annotated test sets, Science-T2I-S and Science-T2I-C:
1. Science-T2I-S closely replicates the stylistic and structural attributes of the training data. It emphasizes simplicity by focusing on specific regions. The goal of Science-T2I-S is to assess the model's performance on data stylistically similar to the training set.
2. Science-T2I-C challenges the model in more complex scenarios, introducing contextual elements like explicit scene settings and diverse scenarios. Prompts may include phrases such as "in a bedroom" or "on the street," adding spatial and contextual variability. This complexity evaluates the model's ability to adapt to nuanced, less constrained environments.
🤩 Submit your LMM/VLM scores now and watch the leaderboard refresh with your achievements! Email us at .
Click the button below to view different results!
# | Model | Overall | Physics | Chemistry | Biology | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Acc | GR | SO | ME | AB | BU | DI | EL | EV | LI | Avg | RU | IM | FR | Avg | LR | WR | SC | RI | Avg | ||
1 |
CLIP-H
VLM
|
54.7 | 78.3 | 40.5 | 25.0 | 57.1 | 63.9 | 71.4 | 47.6 | 26.7 | 77.8 | 55.1 | 16.7 | 54.2 | 77.8 | 52.4 | 62.2 | 31.1 | 81.5 | 34.6 | 55.9 |
2 |
BLIPScore
VLM
|
55.0 | 47.5 | 44.1 | 56.9 | 38.1 | 50.0 | 50.0 | 52.4 | 20.0 | 33.3 | 50.4 | 42.9 | 53.1 | 76.7 | 43.1 | 76.7 | 38.9 | 58.3 | 38.5 | 59.9 |
3 |
SigLIP
VLM
|
57.2 | 78.3 | 45.2 | 44.4 | 57.1 | 58.3 | 83.3 | 47.6 | 63.3 | 58.3 | 59.6 | 23.8 | 60.4 | 62.2 | 53.2 | 46.7 | 33.3 | 83.3 | 53.9 | 55.9 |
4 |
Qwen2-VL-7B
LMM
|
63.8 | 83.3 | 42.9 | 26.4 | 40.5 | 47.2 | 70.2 | 77.4 | 68.3 | 84.7 | 60.0 | 73.8 | 66.7 | 52.2 | 67.0 | 57.8 | 67.8 | 95.4 | 34.6 | 68.8 |
5 |
LLaVA-OV-7B
LMM
|
65.1 | 92.5 | 56.0 | 36.1 | 38.1 | 45.8 | 75.0 | 77.4 | 100 | 95.8 | 68.2 | 59.5 | 55.2 | 48.9 | 57.8 | 51.1 | 72.2 | 78.7 | 46.2 | 64.7 |
6 | 87.0 | 93.0 | 86.1 | 98.2 | 66.7 | 74.6 | 65.9 | 95.6 | 100 | 82.1 | 87.7 | 92.9 | 77.8 | 81.0 | 75.9 | 96.9 | 99.6 | 90.7 | 94.6 | 95.3 | |
7 |
InternVL2.5-8B
LMM
|
70.8 | 96.7 | 52.4 | 41.7 | 47.6 | 55.6 | 63.1 | 72.6 | 91.7 | 90.3 | 67.8 | 69.1 | 56.3 | 52.2 | 62.2 | 84.4 | 90.0 | 84.3 | 75.0 | 84.4 |
8 |
GPT-4o mini
LMM
|
70.8 | 71.3 | 35.7 | 36.1 | 33.3 | 56.9 | 77.4 | 82.1 | 100 | 76.4 | 62.0 | 95.2 | 65.6 | 58.9 | 73.8 | 96.7 | 83.3 | 97.2 | 53.9 | 86.8 |
9 |
SciScore
VLM
|
93.1 | 98.3 | 90.5 | 100 | 71.4 | 66.7 | 97.6 | 100 | 100 | 100 | 94.9 | 100 | 68.8 | 97.8 | 81.0 | 100 | 100 | 100 | 100 | 100 |
Leveraging Science-T2I, we present SciScore, an end-to-end reward model that refines the assessment of generated images based on scientific knowledge, which is achieved by augmenting both the scientific comprehension and visual capabilities of pre-trained CLIP model. The qualitative performance of SciScore is demonstrated in Fig. 1.
Generalization of SciScore to Complex Scenes. SciScore generalizes well to complex scenes Science-T2I-C beyond simple ones Science-T2I-S, showing it can focus on relevant regions and ignore distractions.
Generalization of SciScore across ST and CT. A significant performance gap emerged between ST and CT, with most failures in ST. This is expected, as CT relies on generalizable visual features, while ST depends on subject-specific details. Lacking exposure to novel subjects, SciScore struggles to identify accurate visual contents.
Three-dimensional Evaluation. We evaluated T2I models' scientific reasoning by assessing the alignment between images generated from implicit prompts and those from (1) explicit prompts, (2) superficial prompts, and (3) implicit prompts themselves.
Analysis of Reasoning Capability. To evaluate the ability of T2I models to interpret implicit prompts, we introduce the Normalized Difference (ND) metric. This metric compares images generated from implicit prompts to those produced from explicit prompts, using the formula: \[ \begin{equation} \text{ND} = \frac{\text{IP} - \text{SP}}{\text{EP} - \text{SP}} \end{equation} \] Here in Tab 2, low ND scores—averaging around 35 and mostly below 50—reveal a significant limitation in current models. Specifically, they struggle to move beyond literal interpretations of prompts, particularly when dealing with implicit scientific concepts.
We propose a two-stage training framework, comprising a supervised fine-tuning phase and a masked online fine-tuning phase, to incorporate scientific knowledge into existing generative models utilizing Science-T2I.
Supervised Fine-tuning. For this phase, we utilize FLUX as our base model and train it on the Science-T2I, employing FLUX's native rectified flow training objective without any modifications.
Online Fine-tuning. In this phase, we implement a masked online fine-tuning approach, incorporating SciScore as a reward model to direct the learning process through the DPO training objective.
Relative Improvement Metric. To better assess model performance beyond raw evaluations, we observe that explicit prompts consistently yield higher scores than implicit ones, providing an upper-bound performance estimate. To quantify the benefits of finetuning, we introduce the Relative Improvement (RI) metric. Let \(SciScore_B^{IP}\) and \(SciScore_B^{EP}\) represent the SciScore of the base model under implicit and explicit prompts, respectively, and let \(SciScore_F^{IP}\) denote the SciScore of the finetuned model under implicit prompts. The RI is then defined as: \[ \begin{equation} RI = \frac{\text{SciScore}_F^{IP}-\text{SciScore}_B^{IP}}{\text{SciScore}_B^{EP}-\text{SciScore}_B^{IP}} \end{equation} \]
Generalization to Complex Scenes. Finetuned FLUX on Science-T2I-C generalizes well to complex scenes. This suggests the model learned scientific principles, not just memorized examples.
Necessity of SFT. Initial SFT (blue) provides a better starting point for OFT, leading to stable performance gains. Without it (purple), OFT struggles, highlighting SFT's role in establishing a good base for effective training.
Masking Strategy As A Denoiser. Without masking (yellow), performance is erratic with collapse. Lowering the learning rate (red) prevents collapse but doesn't improve the model. Masking likely prevents the model from treating all features as equally important, reducing noise and enabling stable improvement.
In summary, by leveraging our expert-annotated dataset, Science-T2I, which comprises over 20k adversarial image pairs and 9k prompts, we have developed a comprehensive framework for evaluating and enhancing image realism. Specifically, we introduce SciScore, a novel reward model designed to infuse scientific knowledge into image synthesis. Our results demonstrate that SciScore reaches human-level accuracy in aligning with scientific knowledge. Additionally, we propose a two-stage training framework for T2I models, utilizing SciScore as the reward model. This framework, which combines supervised fine-tuning with online fine-tuning, leads to significant performance improvements in generation tasks that require scientific reasoning.
@misc{li2025sciencet2iaddressingscientificillusions,
title={Science-T2I: Addressing Scientific Illusions in Image Synthesis},
author={Jialuo Li and Wenhao Chai and Xingyu Fu and Haiyang Xu and Saining Xie},
year={2025},
eprint={2504.13129},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2504.13129},
}