The Quest for Generalizable Motion Generation:
Data, Model, and Evaluation

Jing Lin1*, Ruisi Wang2*, Junzhe Lu3*, Ziqi Huang1, Guorui Song3, Ailing Zeng4, Xian Liu5, Chen Wei2, Wanqi Yin2, Qingping Sun2, Zhongang Cai2†, Lei Yang2, Ziwei Liu1
1 Nanyang Technological University    2 SenseTime Research    3 Tsinghua University    4 The Chinese University of Hong Kong    5 NVIDIA Research
* Equal Contribution    † Project Lead
ViMoGen enables high-quality and generalizable human motion generation by unifying video and motion priors.

Abstract

Despite recent advances in 3D human motion generation (MoGen) on standard benchmarks, existing text-to-motion models still face a fundamental bottleneck in their generalization capability. In contrast, adjacent generative fields, most notably video generation (ViGen), have demonstrated remarkable generalization in modeling human behaviors. Motivated by this observation, we present a comprehensive framework that systematically transfers knowledge from ViGen to MoGen across three key pillars: Data, Model, and Evaluation.

First, we introduce ViMoGen-228K, a large-scale dataset comprising 228,000 high-quality motion samples. Second, we propose ViMoGen, a flow-matching-based diffusion transformer that unifies priors from MoCap data and ViGen models through gated multimodal conditioning. Finally, we present MBench, a hierarchical benchmark designed for fine-grained evaluation across motion quality, prompt fidelity, and generalization ability. Extensive experiments show that our framework significantly outperforms existing approaches.

Method: ViMoGen

The core of our approach is the ViMoGen model, which employs a dual-branch design to balance motion quality (from MoCap data) and semantic generalization (from Video Generation models).

ViMoGen Architecture
ViMoGen Architecture Overview (T2M & M2M Branches)

Key Innovations

Data: ViMoGen-228K

To address data scarcity, we curated ViMoGen-228K, combining three distinct sources to ensure both fidelity and diversity.

171.5K

Optical MoCap Data
(High Fidelity)

42K

In-the-Wild Video
(Real-world Diversity)

14.7K

Synthetic Video
(Semantic Richness)

(a) Optical MoCap Data

Unify 30 Optical MoCap datasets and augment them with text captions.

(b) In-the-Wild Video Data

Collect 60M in-the-wild videos, annotate, and select 0.1% high-quality motions.

Skateboarding

A person is skateboarding, pushing off forward to start, then standing on the skateboard, and then crouching down.

Paddling

A person is paddling, taking two strokes on the left side, then switching to the right side.

(c) Synthetic Video Data

Construct semantically rich action prompts, generate videos, and annotate motion labels.

Tai Chi

A person is in a horse stance, practicing Tai Chi.

Digging

A person scoops soil with a shovel and digs into the ground.

Benchmark: MBench

We propose MBench, a hierarchical benchmark with three key features: granular assessment across nine dimensions, strong alignment with human perception, and balanced distribution with semantically diverse prompts.

MBench Dimensions
MBench: Dimension Breakdown
MBench Human Alignment
MBench: Human Alignment Scores
MBench Word Clouds
MBench: Prompt Word Clouds

Generated Results

Bodysurfing

Marching

Putting on Shoes

Squatting

Somersaulting

Juggling Balls

Citation

@article{lin2025quest,
  title={The quest for generalizable motion generation: Data, model, and evaluation},
  author={Lin, Jing and Wang, Ruisi and Lu, Junzhe and Huang, Ziqi and Song, Guorui and Zeng, Ailing and Liu, Xian and Wei, Chen and Yin, Wanqi and Sun, Qingping and others},
  journal={arXiv preprint arXiv:2510.26794},
  year={2025}
}