Benchmarking physical simulation via executable code

SimuScene
Text → Code → Video Evaluation for Physical Reasoning

Given a natural-language physical scenario, models generate Python code to render a simulation video. Correctness is measured by whether the observed dynamics match the description.

Videos

Simulation gallery

Mechanics
Within a 5-meter diameter, glass-walled circular room, a 2-meter diameter wooden carousel spins on a frictionless central axis at a constant rate of 0.5 revolutions per second. A small, 0.2 kg rubber ball is placed 1 meter from the axis on the smooth wooden surface. The problem is to determine the ball's trajectory and final position after 15 seconds, considering the Coriolis force and absence of friction.
Optics
In a 2m x 0.5m x 0.3m acrylic tank filled with water up to 0.25m, a collimated beam of white light enters the water at a 10-degree angle from a 75-watt incandescent bulb. At the water surface, a thin film of soap creates an interference pattern. The room is set to a consistent 21°C. The film causes additional dispersion of the light, forming a complex, iridescent rainbow on the tank's bottom. Over 20 seconds, as the soap film slowly thins due to evaporation, the rainbow patterns shift and change, displaying a rich interplay of colors that gradually fade.
Fluid Mechanics
Inside a rectangular aquarium with dimensions of 100 cm by 50 cm and filled to a height of 40 cm with a transparent gel, a small stainless steel disc with a radius of 7 cm is submerged and fixed horizontally at a depth of 20 cm. The disc is rotated at a constant speed of 22 revolutions per minute by an overhead motor. The rotation of the disc generates vortices that influence the movement of suspended fine glitter particles within the gel. Over the course of 14 seconds, the glitter particles are swept into helical paths by the vortices, gradually forming concentrated, spiral formations that trace the flow patterns within the gel, showcasing the interaction of rotational motion with a semi-solid medium.
Highlights

What SimuScene brings

Outcome-based evaluation of code generation

An evaluation paradigm is introduced that assesses code generation models by executing generated programs and evaluating the resulting physical outcomes, rather than relying solely on static code correctness.

A large-scale physical simulation benchmark

SimuScene is presented as a large-scale benchmark of physical simulation scenarios, covering multiple domains and fine-grained concepts to enable systematic analysis of physical reasoning in code generation.

Executable code → video → visual judgment pipeline

An automated evaluation pipeline is proposed in which generated code is executed to produce simulation videos, and physical correctness is verified through vision-based judgments.

Benchmark

Frontier LLMs performance on SimuPhy benchmark

Rank Model Avg@8 (↑) Pass@8 (↑) Visualization
Qwen3-32B 11.1% 30.5%
GPT-oss-20b 10.5% 32.0%
GPT-oss-120b 14.0% 37.4%
Gemini-2.5-pro 12.7% 37.4%
DeepSeek-V3.1 14.5% 40.7%
Qwen3-235B-A22B 15.1% 41.0%
GPT-o4-mini 17.2% 42.5%
GPT-o3 15.9% 45.2%
DeepSeek-R1-0528 21.5% 52.7%
GPT-5-medium 20.5% 59.9%
Pipeline

How evaluation works

SimuScene evaluates whether the generated simulation matches the intended physical behavior.

1

Text → Code

The model writes executable Python simulation code from a natural-language scenario.

2

Code → Video

Run the program to render a short video of object trajectories and interactions.

3

Video → Judgement

A VLM answers verification questions to decide whether the dynamics match the description.

Resources

Downloads & links

Dataset

Direct download for the benchmark dataset.

Project

Paper, code, models, and demo materials.

Citation

BibTeX

If you use SimuScene in your research, please cite:

@misc{simuscene2026,
          title={SimuScene: Training and Benchmarking Code Generation to Simulate Physical Scenarios}, 
          author={Yanan Wang and Renxi Wang and Yongxin Wang and Xuezhi Liang and Fajri Koto and Timothy Baldwin and Xiaodan Liang and Haonan Li},
          year={2026},
          eprint={2602.10840},
          archivePrefix={arXiv},
          primaryClass={cs.LG},
          url={https://arxiv.org/abs/2602.10840}, 
    }
Copied!