PHYBench: Holistic Evaluation of Physical Perception and Reasoning in Large Language Models
A comprehensive physics reasoning benchmark with 500 original problems contributed by 178 PKU students. Designed evaluation criteria and failure mode analysis for open-ended scientific reasoning.