SimuScene
Text → Code → Video Evaluation for Physical Reasoning
Given a natural-language physical scenario, models generate Python code to render a simulation video. Correctness is measured by whether the observed dynamics match the description.
Simulation gallery
What SimuScene brings
An evaluation paradigm is introduced that assesses code generation models by executing generated programs and evaluating the resulting physical outcomes, rather than relying solely on static code correctness.
SimuScene is presented as a large-scale benchmark of physical simulation scenarios, covering multiple domains and fine-grained concepts to enable systematic analysis of physical reasoning in code generation.
An automated evaluation pipeline is proposed in which generated code is executed to produce simulation videos, and physical correctness is verified through vision-based judgments.
Frontier LLMs performance on SimuPhy benchmark
| Rank | Model | Avg@8 (↑) | Pass@8 (↑) | Visualization |
|---|---|---|---|---|
| — | Qwen3-32B | 11.1% | 30.5% | |
| — | GPT-oss-20b | 10.5% | 32.0% | |
| — | GPT-oss-120b | 14.0% | 37.4% | |
| — | Gemini-2.5-pro | 12.7% | 37.4% | |
| — | DeepSeek-V3.1 | 14.5% | 40.7% | |
| — | Qwen3-235B-A22B | 15.1% | 41.0% | |
| — | GPT-o4-mini | 17.2% | 42.5% | |
| — | GPT-o3 | 15.9% | 45.2% | |
| — | DeepSeek-R1-0528 | 21.5% | 52.7% | |
| — | GPT-5-medium | 20.5% | 59.9% |
How evaluation works
SimuScene evaluates whether the generated simulation matches the intended physical behavior.
Text → Code
The model writes executable Python simulation code from a natural-language scenario.
Code → Video
Run the program to render a short video of object trajectories and interactions.
Video → Judgement
A VLM answers verification questions to decide whether the dynamics match the description.
Downloads & links
Direct download for the benchmark dataset.
BibTeX
If you use SimuScene in your research, please cite:
@misc{simuscene2026,
title={SimuScene: Training and Benchmarking Code Generation to Simulate Physical Scenarios},
author={Yanan Wang and Renxi Wang and Yongxin Wang and Xuezhi Liang and Fajri Koto and Timothy Baldwin and Xiaodan Liang and Haonan Li},
year={2026},
eprint={2602.10840},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2602.10840},
}