Benchmarking physical simulation via executable code

SimuScene
Text → Code → Video Evaluation for Physical Reasoning

Given a natural-language physical scenario, models generate Python code to render a simulation video. Correctness is measured by whether the observed dynamics match the description.

Paper Code Download Dataset

Videos

Simulation gallery

Mechanics

Within a 5-meter diameter, glass-walled circular room, a 2-meter diameter wooden carousel spins on a frictionless central axis at a constant rate of 0.5 revolutions per second. A small, 0.2 kg rubber ball is placed 1 meter from the axis on the smooth wooden surface. The problem is to determine the ball's trajectory and final position after 15 seconds, considering the Coriolis force and absence of friction.

Optics

In a 2m x 0.5m x 0.3m acrylic tank filled with water up to 0.25m, a collimated beam of white light enters the water at a 10-degree angle from a 75-watt incandescent bulb. At the water surface, a thin film of soap creates an interference pattern. The room is set to a consistent 21°C. The film causes additional dispersion of the light, forming a complex, iridescent rainbow on the tank's bottom. Over 20 seconds, as the soap film slowly thins due to evaporation, the rainbow patterns shift and change, displaying a rich interplay of colors that gradually fade.

Fluid Mechanics

Inside a rectangular aquarium with dimensions of 100 cm by 50 cm and filled to a height of 40 cm with a transparent gel, a small stainless steel disc with a radius of 7 cm is submerged and fixed horizontally at a depth of 20 cm. The disc is rotated at a constant speed of 22 revolutions per minute by an overhead motor. The rotation of the disc generates vortices that influence the movement of suspended fine glitter particles within the gel. Over the course of 14 seconds, the glitter particles are swept into helical paths by the vortices, gradually forming concentrated, spiral formations that trace the flow patterns within the gel, showcasing the interaction of rotational motion with a semi-solid medium.

Highlights

What SimuScene brings

Outcome-based evaluation of code generation

An evaluation paradigm is introduced that assesses code generation models by executing generated programs and evaluating the resulting physical outcomes, rather than relying solely on static code correctness.

A large-scale physical simulation benchmark

SimuScene is presented as a large-scale benchmark of physical simulation scenarios, covering multiple domains and fine-grained concepts to enable systematic analysis of physical reasoning in code generation.

Executable code → video → visual judgment pipeline

An automated evaluation pipeline is proposed in which generated code is executed to produce simulation videos, and physical correctness is verified through vision-based judgments.

Benchmark

Frontier LLMs performance on SimuPhy benchmark

Rank	Model	Avg@8 (↑)	Pass@8 (↑)
—	Qwen3-32B	11.1%	30.5%
—	GPT-oss-20b	10.5%	32.0%
—	GPT-oss-120b	14.0%	37.4%
—	Gemini-2.5-pro	12.7%	37.4%
—	DeepSeek-V3.1	14.5%	40.7%
—	Qwen3-235B-A22B	15.1%	41.0%
—	GPT-o4-mini	17.2%	42.5%
—	GPT-o3	15.9%	45.2%
—	DeepSeek-R1-0528	21.5%	52.7%
—	GPT-5-medium	20.5%	59.9%

Pipeline

How evaluation works

SimuScene evaluates whether the generated simulation matches the intended physical behavior.

Text → Code

The model writes executable Python simulation code from a natural-language scenario.

Code → Video

Run the program to render a short video of object trajectories and interactions.

Video → Judgement

A VLM answers verification questions to decide whether the dynamics match the description.

Resources

Downloads & links

Dataset

Direct download for the benchmark dataset.

Download Dataset

Project

Paper, code, models, and demo materials.

Paper Code Models

Citation

BibTeX

If you use SimuScene in your research, please cite:

@misc{simuscene2026,
          title={SimuScene: Training and Benchmarking Code Generation to Simulate Physical Scenarios}, 
          author={Yanan Wang and Renxi Wang and Yongxin Wang and Xuezhi Liang and Fajri Koto and Timothy Baldwin and Xiaodan Liang and Haonan Li},
          year={2026},
          eprint={2602.10840},
          archivePrefix={arXiv},
          primaryClass={cs.LG},
          url={https://arxiv.org/abs/2602.10840}, 
    }

SimuScene Text → Code → Video Evaluation for Physical Reasoning