Shaoyang Guo郭绍阳

Physics undergraduate at Peking University. Research intern at ByteDance Seed. Working on LLM/VLM post-training, RL/SFT, agents, benchmarking, and Physics of AI.

Peking University, School of Physics (Class of 2027) GPA 3.74/4.00 · National Scholarship · CPhO Gold Medalist

GitHub Email PHYBench CV Summary Download CV

LLM/VLM RL/SFT/Agents Benchmarking Physics of AI

News

2025.07 Joined ByteDance Seed as research intern, working on VLM/LLM post-training.

2025.04 PHYBench preprint released on arXiv. Submitted to NeurIPS 2025.

2025.03 Started contributing to VLA models survey (action tokenization perspective).

2024.12 Awarded National Scholarship (top 1% at PKU).

Publications

NeurIPS 2025 (submitted)

PHYBench: Holistic Evaluation of Physical Perception and Reasoning in Large Language Models

Shi Qiu*, Shaoyang Guo*, et al.

A comprehensive physics perception and reasoning benchmark with 500 original problems contributed by 178 PKU students. Co-initiated the project and helped design evaluation and quality-control workflows.

arXiv Project Page

arXiv Preprint

A Survey on Vision-Language-Action Models: An Action Tokenization Perspective

Authors including Shaoyang Guo

A survey on VLA models focusing on action representation. Responsible for the Raw Action chapter, reviewing 30+ key papers on end-to-end VLA architectures.

arXiv GitHub

Experience

Jul 2025 – Present

Research Intern, VLM/LLM Post-Training

ByteDance Seed

Working on post-training and research automation for VLM/LLM systems, with emphasis on RL, SFT, mid-training, rollouts, data pipelines, and agent workflows.

Contributed to HiPhO-oriented RL, SFT, and mid-training work for Seed 2.0 models, improving reported Lite performance from 72.5 to 83.8.
Participated in mid-training runs at large compute scale and supported rollout pipelines for model improvement.
Built data and prompt pipelines for QA pairs, CoT compression, summaries, and SFT-to-RL transfer experiments.
Explored auto-research agent loops, adversarial pair agents, and agent-based research settings.

Feb 2025 – Sep 2025

Co-initiator & Co-first Author, PHYBench

Peking University, Eureka Lab

Co-initiated and co-led PHYBench, a physics perception and reasoning benchmark for LLMs.

Identified gaps in existing LLM physics evaluation and led the project from concept validation to a full data pipeline.
Organized 178 PKU students to build 500 high-quality original physics problems in 2 weeks.
Designed evaluation criteria and quality-control workflows for LLM physics reasoning.
Co-authored the arXiv preprint submitted to NeurIPS 2025.

Mar 2025 – Aug 2025

Research Assistant, VLA Survey

PsiRobot Lab, Peking University

Co-authored a survey on Vision-Language-Action models from an action-tokenization perspective.

Responsible for the Raw Action chapter; reviewed 30+ key papers on end-to-end VLA architectures.
Organized taxonomies for VLA model design and contributed to the arXiv preprint.

Blogs

Ideas and working notes on AI, physics, and research taste.

ComfyResearch Reproduction

Rank Collapse on TinyShakespeare

Reproducing the spectral rank-collapse signature inside a standard ComfyResearch canvas: real text, editable nodes, and live RankMe / alphaReQ curves. We lock the setting at lr=5e-3, wd=1e-3, 20k steps and show how bottleneck width controls the collapse.

Physics of AIRank CollapseComfyResearch

Read →

Project Review

ArchitectureIQ 项目全面 Review

A comprehensive review of ArchitectureIQ's question-generation pipeline, significance tests, evaluation protocol, meta-model results, conclusions, and open problems.

ArchitectureIQEvaluationMeta-Model

Read →

Mechanism Guide

N-gram Gap 机制指南

A visual guide to the N-gram Gap mechanism, including global N-gram frequency, validation loss, contribution analysis, and the training cliff.

N-gramLanguage ModelsMechanistic Analysis

Read →

Regime Bridge

N-gram Gap Regime Bridge

An interactive roadmap connecting top-down and bottom-up evidence for the N-gram Gap: from reduced positives to order controls and observable curve ablations.

N-gramInteractiveExperiments

Read →

Planned Essay

What makes a STEM benchmark actually useful?

Notes on building evaluations that reveal real reasoning capability rather than benchmark-specific pattern matching, with lessons from PHYBench.

BenchmarkingPhysicsEvaluation

Draft coming soon

Writing Plan

Views on large model training

A continuing series for organizing personal views on post-training, data quality, RL/SFT dynamics, and the practical craft of making models better.

Post-TrainingData QualityVLM

Draft coming soon

Personal Note

From physics olympiad to AI research

Reflections on how physics training shapes taste in AI research: problem selection, abstraction, experiments, and long-term curiosity.

ResearchPhysicsPersonal

Draft coming soon

Education & Honors

Peking University, School of Physics

B.S. in Physics, expected Jun 2027. GPA 3.74/4.00, top 10% in the School of Physics; completed 141/149 credits by sophomore year including 3 graduate courses.

National Scholarship (2024)

Ministry of Education, top 1% at Peking University.

Chinese Physics Olympiad Gold Medal

National rank #57 (2022). Admitted to PKU Physics via PKU Excellence Program.

NOIP First Prize (2020)

National Olympiad in Informatics in Provinces.

Contact

Open to research collaboration, especially in VLM post-training, evaluation, and embodied intelligence.

GitHub github.com/guoshaoyang-pku Email guoshaoyang@stu.pku.edu.cn Download CV guoshaoyang-pku-cv-v5.pdf