We introduce a novel method for integrating scientific knowledge into generative models to enhance their realism and consistency in image synthesis. Central to our approach is Science-T2I, an expert-annotated adversarial dataset of 20k image pairs and 9k prompts spanning diverse scientific categories. Building on Science-T2I, we develop SciScore, an end-to-end reward model that refines the evaluation of generated images by enhancing the scientific and visual capabilities of a pre-trained CLIP model. Furthermore, we propose a two-stage training framework—combining supervised fine-tuning and masked online fine-tuning—to embed scientific knowledge into existing generative models. Extensive experiments validate the framework's effectiveness in setting new benchmarks for assessing the scientific realism of generated content.

Science-T2I: An Adversarial Scientific Dataset

Task Overview. Science-T2I consists of 16 tasks spanning physics, chemistry and biology that require the model to infer or visualize concepts not explicitly stated in the prompts but rooted in underlying scientific principles.

**Fig 2:** Data statistics (left) of and wordcloud (right) of Science-T2I.

Task Classification. Beyond a classification based on scientific disciplines, the tasks can be categorized into two distinct groups:

1. Subject-oriented Task (ST) require scientific reasoning to discern how inherent differences between subjects lead to varying visual features under identical conditions.

2. Condition-oriented Task (CT) focus on how a single condition affects various subjects. Scientific reasoning in these tasks centers on the applied condition, not the subject's individual properties.

**Fig 3:** Task classification of Science-T2I.

Prompt Design. In Science-T2I, we categorize prompts into three types based on their use in scientific reasoning:

1. Implicit Prompt (IP) refers to prompts that imply visual characteristics or phenomena requiring scientific interpretation. For example, "an unripe apple" suggests the apple's color is green, but this is not explicitly stated.

2. Explicit Prompt (EP) reformulates the IP into a clear, descriptive statement that results in a scientifically accurate depiction. For instance, "a green apple" explicitly conveys the apple's immaturity.

3. Superficial Prompt (SP) provides explicit but scientifically inaccurate descriptions, focusing on surface-level interpretations. For example, interpreting "an unripe apple" as "a red apple" is a superficial interpretation that lacks scientific accuracy.

**Fig 4:** Data curation pipeline of Science-T2I.

Data Curation. We utilize GPT-4o to create templates and generate corresponding prompts during the data curation process. These outputs are then used to guide T2I models for image generation. Following this, human annotators review and filter the data, incorporating insights from an additional website knowledgebase to ensure the reliability and accuracy of the final dataset.

Science-T2I Examples

Benchmark Setup. To thoroughly evaluate the model's generalization ability across different environments on scientific reasoning tasks, we have created two additional manually annotated test sets, Science-T2I-S and Science-T2I-C:

1. Science-T2I-S closely replicates the stylistic and structural attributes of the training data. It emphasizes simplicity by focusing on specific regions. The goal of Science-T2I-S is to assess the model's performance on data stylistically similar to the training set.

2. Science-T2I-C challenges the model in more complex scenarios, introducing contextual elements like explicit scene settings and diverse scenarios. Prompts may include phrases such as "in a bedroom" or "on the street," adding spatial and contextual variability. This complexity evaluates the model's ability to adapt to nuanced, less constrained environments.

Science-T2I-S&C Leaderboard

🤩 Submit your LMM/VLM scores now and watch the leaderboard refresh with your achievements! Email us at .

Click the button below to view different results!

#	Model	Overall	Physics										Chemistry				Biology
#	Model	Acc	GR	SO	ME	AB	BU	DI	EL	EV	LI	Avg	RU	IM	FR	Avg	LR	WR	SC	RI	Avg
1	CLIP-H VLM	54.7	78.3	40.5	25.0	57.1	63.9	71.4	47.6	26.7	77.8	55.1	16.7	54.2	77.8	52.4	62.2	31.1	81.5	34.6	55.9
2	BLIPScore VLM	55.0	47.5	44.1	56.9	38.1	50.0	50.0	52.4	20.0	33.3	50.4	42.9	53.1	76.7	43.1	76.7	38.9	58.3	38.5	59.9
3	SigLIP VLM	57.2	78.3	45.2	44.4	57.1	58.3	83.3	47.6	63.3	58.3	59.6	23.8	60.4	62.2	53.2	46.7	33.3	83.3	53.9	55.9
4	Qwen2-VL-7B LMM	63.8	83.3	42.9	26.4	40.5	47.2	70.2	77.4	68.3	84.7	60.0	73.8	66.7	52.2	67.0	57.8	67.8	95.4	34.6	68.8
5	LLaVA-OV-7B LMM	65.1	92.5	56.0	36.1	38.1	45.8	75.0	77.4	100	95.8	68.2	59.5	55.2	48.9	57.8	51.1	72.2	78.7	46.2	64.7
6	Human Eval	87.0	93.0	86.1	98.2	66.7	74.6	65.9	95.6	100	82.1	87.7	92.9	77.8	81.0	75.9	96.9	99.6	90.7	94.6	95.3
7	InternVL2.5-8B LMM	70.8	96.7	52.4	41.7	47.6	55.6	63.1	72.6	91.7	90.3	67.8	69.1	56.3	52.2	62.2	84.4	90.0	84.3	75.0	84.4
8	GPT-4o mini LMM	70.8	71.3	35.7	36.1	33.3	56.9	77.4	82.1	100	76.4	62.0	95.2	65.6	58.9	73.8	96.7	83.3	97.2	53.9	86.8
9	SciScore VLM	93.1	98.3	90.5	100	71.4	66.7	97.6	100	100	100	94.9	100	68.8	97.8	81.0	100	100	100	100	100

#	Model	Overall	Physics										Chemistry				Biology
#	Model	Acc	GR	SO	ME	AB	BU	DI	EL	EV	LI	Avg	RU	IM	FR	Avg	LR	WR	SC	RI	Avg
1	CLIP-H VLM	59.5	75.0	57.1	66.7	64.3	58.3	78.6	21.4	0.0	66.7	56.6	35.7	50.0	46.7	44.4	80.0	60.0	88.9	75.0	76.7
2	BLIPScore VLM	51.5	40.0	42.9	58.3	50.0	62.5	50.0	28.6	50.0	29.2	49.8	57.1	62.5	60.0	60.0	53.3	46.7	75.0	54.2	58.3
3	SigLIP VLM	61.7	95.0	42.9	58.3	64.3	66.7	85.7	42.9	30.0	41.7	61.5	28.6	62.5	60.0	51.1	46.7	53.3	94.4	83.3	70.0
4	Qwen2-VL-7B LMM	69.8	100	57.1	41.7	57.1	66.7	64.3	67.9	55.0	70.8	66.8	60.7	53.1	36.7	50.0	93.3	96.7	94.4	75.0	90.8
5	LLaVA-OV-7B LMM	70.3	100	71.4	45.8	57.1	75.0	71.4	64.3	100	79.2	74.6	53.6	59.4	36.7	50.0	63.3	86.7	77.8	79.2	76.7
6	Human Eval	86.0	98.6	77.6	91.0	78.6	83.3	66.8	90.9	95.7	76.8	84.7	92.9	86.6	77.1	85.4	88.6	84.8	96.8	83.8	89.1
7	InternVL2.5-8B LMM	75.3	97.5	53.6	58.3	57.1	75.0	92.9	57.1	95.0	70.8	73.8	75.0	71.9	50.0	65.6	90.0	93.3	75.0	87.5	85.8
8	GPT-4o mini LMM	74.8	97.5	50.0	67.7	50.0	54.2	67.9	64.3	90.0	75.0	69.3	89.3	68.8	53.3	70.0	100	83.3	88.9	87.5	90.0
9	SciScore VLM	91.2	100	92.9	100	71.4	41.7	85.7	85.7	100	100	86.9	92.9	81.3	100	91.1	100	100	100	100	100

SciScore: Evaluating Scientific Authenticity of Images

Leveraging Science-T2I, we present SciScore, an end-to-end reward model that refines the assessment of generated images based on scientific knowledge, which is achieved by augmenting both the scientific comprehension and visual capabilities of pre-trained CLIP model. The qualitative performance of SciScore is demonstrated in Fig. 1.

**Tab 1:** (Left) Performance comparison of different models on Science-T2I-S&C across different subjects. (Right) Performance of SciScore in ST and CT.

Generalization of SciScore to Complex Scenes. SciScore generalizes well to complex scenes Science-T2I-C beyond simple ones Science-T2I-S, showing it can focus on relevant regions and ignore distractions.

Generalization of SciScore across ST and CT. A significant performance gap emerged between ST and CT, with most failures in ST. This is expected, as CT relies on generalizable visual features, while ST depends on subject-specific details. Lacking exposure to novel subjects, SciScore struggles to identify accurate visual contents.

Benchmarking T2I Generation with SciScore

Three-dimensional Evaluation. We evaluated T2I models' scientific reasoning by assessing the alignment between images generated from implicit prompts and those from (1) explicit prompts, (2) superficial prompts, and (3) implicit prompts themselves.

**Tab 2:** Performance of T2I Models on SciScore.

Analysis of Reasoning Capability. To evaluate the ability of T2I models to interpret implicit prompts, we introduce the Normalized Difference (ND) metric. This metric compares images generated from implicit prompts to those produced from explicit prompts, using the formula: \[ \begin{equation} \text{ND} = \frac{\text{IP} - \text{SP}}{\text{EP} - \text{SP}} \end{equation} \] Here in Tab 2, low ND scores—averaging around 35 and mostly below 50—reveal a significant limitation in current models. Specifically, they struggle to move beyond literal interpretations of prompts, particularly when dealing with implicit scientific concepts.

Reality Alignment: Two-stage Fine-tuning Framework

We propose a two-stage training framework, comprising a supervised fine-tuning phase and a masked online fine-tuning phase, to incorporate scientific knowledge into existing generative models utilizing Science-T2I.

Supervised Fine-tuning. For this phase, we utilize FLUX as our base model and train it on the Science-T2I, employing FLUX's native rectified flow training objective without any modifications.

Online Fine-tuning. In this phase, we implement a masked online fine-tuning approach, incorporating SciScore as a reward model to direct the learning process through the DPO training objective.

**Fig 5:** Online fine-tuning pipeline. For each prompt, two images are generated to compute SciScore preference metric. Simultaneously, GroundingDINO extracts segmentation masks from these images based on the prompts, which are then used to block gradient propagation in the corresponding regions.

Experiment & Ablation Study

Relative Improvement Metric. To better assess model performance beyond raw evaluations, we observe that explicit prompts consistently yield higher scores than implicit ones, providing an upper-bound performance estimate. To quantify the benefits of finetuning, we introduce the Relative Improvement (RI) metric. Let \(SciScore_B^{IP}\) and \(SciScore_B^{EP}\) represent the SciScore of the base model under implicit and explicit prompts, respectively, and let \(SciScore_F^{IP}\) denote the SciScore of the finetuned model under implicit prompts. The RI is then defined as: \[ \begin{equation} RI = \frac{\text{SciScore}_F^{IP}-\text{SciScore}_B^{IP}}{\text{SciScore}_B^{EP}-\text{SciScore}_B^{IP}} \end{equation} \]

**Tab 3:** Average SciScore on various methods.

Generalization to Complex Scenes. Finetuned FLUX on Science-T2I-C generalizes well to complex scenes. This suggests the model learned scientific principles, not just memorized examples.

**Fig 6:** Ablation study of two-stage training.

Necessity of SFT. Initial SFT (blue) provides a better starting point for OFT, leading to stable performance gains. Without it (purple), OFT struggles, highlighting SFT's role in establishing a good base for effective training.

Masking Strategy As A Denoiser. Without masking (yellow), performance is erratic with collapse. Lowering the learning rate (red) prevents collapse but doesn't improve the model. Masking likely prevents the model from treating all features as equally important, reducing noise and enabling stable improvement.

**Fig 7:** Case study. The upper images are generated using the base FLUX.1[dev], whereas the lower images are produced with our fine-tuning method. Each image pair utilizes an identical random seed to ensure consistency in comparison. Note that the displayed prompts are summaries of the original prompts.

Science-T2I

Addressing Scientific Illusions in Image Synthesis