CVPR 2025
🤩 Submit your LMM/VLM scores now and watch the leaderboard refresh with your achievements! Email us at .
Click the button below to view different results!
# | Model | Overall | Physics | Chemistry | Biology | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Acc | GR | SO | ME | AB | BU | DI | EL | EV | LI | Avg | RU | IM | FR | Avg | LR | WR | SC | RI | Avg | ||
1 |
CLIP-H
VLM
|
54.7 | 78.3 | 40.5 | 25.0 | 57.1 | 63.9 | 71.4 | 47.6 | 26.7 | 77.8 | 55.1 | 16.7 | 54.2 | 77.8 | 52.4 | 62.2 | 31.1 | 81.5 | 34.6 | 55.9 |
2 |
BLIPScore
VLM
|
55.0 | 47.5 | 44.1 | 56.9 | 38.1 | 50.0 | 50.0 | 52.4 | 20.0 | 33.3 | 50.4 | 42.9 | 53.1 | 76.7 | 43.1 | 76.7 | 38.9 | 58.3 | 38.5 | 59.9 |
3 |
SigLIP
VLM
|
57.2 | 78.3 | 45.2 | 44.4 | 57.1 | 58.3 | 83.3 | 47.6 | 63.3 | 58.3 | 59.6 | 23.8 | 60.4 | 62.2 | 53.2 | 46.7 | 33.3 | 83.3 | 53.9 | 55.9 |
4 |
Qwen2-VL-7B
LMM
|
63.8 | 83.3 | 42.9 | 26.4 | 40.5 | 47.2 | 70.2 | 77.4 | 68.3 | 84.7 | 60.0 | 73.8 | 66.7 | 52.2 | 67.0 | 57.8 | 67.8 | 95.4 | 34.6 | 68.8 |
5 |
LLaVA-OV-7B
LMM
|
65.1 | 92.5 | 56.0 | 36.1 | 38.1 | 45.8 | 75.0 | 77.4 | 100 | 95.8 | 68.2 | 59.5 | 55.2 | 48.9 | 57.8 | 51.1 | 72.2 | 78.7 | 46.2 | 64.7 |
6 | 87.0 | 93.0 | 86.1 | 98.2 | 66.7 | 74.6 | 65.9 | 95.6 | 100 | 82.1 | 87.7 | 92.9 | 77.8 | 81.0 | 75.9 | 96.9 | 99.6 | 90.7 | 94.6 | 95.3 | |
7 |
InternVL2.5-8B
LMM
|
70.8 | 96.7 | 52.4 | 41.7 | 47.6 | 55.6 | 63.1 | 72.6 | 91.7 | 90.3 | 67.8 | 69.1 | 56.3 | 52.2 | 62.2 | 84.4 | 90.0 | 84.3 | 75.0 | 84.4 |
8 |
GPT-4o mini
LMM
|
70.8 | 71.3 | 35.7 | 36.1 | 33.3 | 56.9 | 77.4 | 82.1 | 100 | 76.4 | 62.0 | 95.2 | 65.6 | 58.9 | 73.8 | 96.7 | 83.3 | 97.2 | 53.9 | 86.8 |
9 |
SciScore
VLM
|
93.1 | 98.3 | 90.5 | 100 | 71.4 | 66.7 | 97.6 | 100 | 100 | 100 | 94.9 | 100 | 68.8 | 97.8 | 81.0 | 100 | 100 | 100 | 100 | 100 |
We present a novel approach to integrating scientific knowledge into generative models, enhancing their realism and consistency in image synthesis. First, we introduce Science-T2I, an expert-annotated adversarial dataset comprising adversarial 20k image pairs with 9k prompts, covering wide distinct scientific knowledge categories. Leveraging Science-T2I, We present SciScore, an end-to-end reward model that refines the assessment of generated images based on scientific knowledge, which is achieved by augmenting both the scientific comprehension and visual capabilities of pre-trained CLIP model. Additionally, based on Science-T2I, we propose a two-stage training framework, comprising a supervised fine-tuning phase and a masked online fine-tuning phase, to incorporate scientific knowledge into existing generative models. Through comprehensive experiments, we demonstrate the effectiveness of our framework in establishing new standards for evaluating the scientific realism of generated content. Specifically, SciScore attains performance comparable to human-level, demonstrating a 5% improvement similar to evaluations conducted by experienced human evaluators. Furthermore, by applying our proposed fine-tuning method to FLUX, we achieve a performance enhancement exceeding 50% based on SciScore.
Task overview. Science-T2I consists of 16 tasks spanning physics, chemistry and biology that require the model to infer or visualize concepts not explicitly stated in the prompts but rooted in underlying scientific principles.
Task classification. Beyond a classification based on scientific disciplines, the tasks can be categorized into two distinct groups:
Prompt design. In Science-T2I, we categorize prompts into three types based on their use in scientific reasoning:
Data curation. We utilize GPT-4o to create templates and generate corresponding prompts during the data curation process. These outputs are then used to guide T2I models for image generation. Following this, human annotators review and filter the data, incorporating insights from an additional website knowledgebase to ensure the reliability and accuracy of the final dataset.
Leveraging Science-T2I, we present SciScore, an end-to-end reward model that refines the assessment of generated images based on scientific knowledge, which is achieved by augmenting both the scientific comprehension and visual capabilities of pre-trained CLIP model. The qualitative performance of SciScore is demonstrated in Fig. 1.
Generalization of SciScore to complex scenes. SciScore generalizes well to complex scenes Science-T2I-C beyond simple ones Science-T2I-S, showing it can focus on relevant regions and ignore distractions.
Generalization of SciScore across ST and CT. A significant performance gap emerged between ST and CT, with most failures in ST. This is expected, as CT relies on generalizable visual features, while ST depends on subject-specific details. Lacking exposure to novel subjects, SciScore struggles to identify accurate visual contents.
Three-dimensional evaluation. We evaluated T2I models' scientific reasoning by assessing the alignment between images generated from implicit prompts and those from (1) explicit prompts, (2) superficial prompts, and (3) implicit prompts themselves.
Analysis on reasoning capability. We introduce the 'Normalized Difference' (ND) metric to assess T2I models' ability to interpret implicit prompts by comparing generated images to those from explicit prompts. Low ND scores (average ~35, mostly < 50) indicate a significant failure of current models to move beyond literal prompt interpretations, especially for implicit scientific concepts.
We propose a two-stage training framework, comprising a supervised fine-tuning phase and a masked online fine-tuning phase, to incorporate scientific knowledge into existing generative models utilizing Science-T2I.
Supervised fine-tuning. For this phase, we utilize FLUX as our base model and train it on the Science-T2I, employing FLUX's native rectified flow training objective without any modifications.
Online fine-tuning. In this phase, we implement a masked online fine-tuning approach, incorporating SciScore as a reward model to direct the learning process through the DPO training objective.