Source: arXiv
Published: 17 June 2026
Runtime: 0:00

A conversation between

Nonghai Zhang , Siyu Zhai , Yanjun Li , Zeyu Zhang , Zhihan Yin , Yandong Guo , Boxin Shi , Hao Tang

MotionVLA: Vision-Language-Action Model for Humanoid Motion

§03

Synthesis

## The Problem: Motion Has Two Voices, But One Codebook

Humanoid motion generation from images and text descriptions requires capturing two fundamentally different aspects of movement. Low-frequency components encode *what* pose the body should take (semantic meaning), while high-frequency components encode *how* it moves dynamically (velocity, acceleration, physics). The authors' key insight is that existing methods use a single quantization codebook—a shared "vocabulary" for compressing motion into tokens—which forces these heterogeneous signals into the same space and misrepresents one or the other.

Their frequency analysis quantifies the mismatch. Using DCT (Discrete Cosine Transform) to decompose motion data, they find that just five low-frequency coefficients capture 93% of joint-position energy but only 37% of joint-velocity energy. This explains why single-codebook approaches struggle: they optimize for position semantics at the expense of physical realism.

## The Solution: Separate the Streams, Predict in Order

MotionVLA addresses this with two innovations.

**DSFT (Dual-Stream Frequency Tokenizer)**: Instead of one codebook, the method splits motion into two independent streams. The *Base stream* captures low-frequency pose semantics via DCT truncation—keeping only the first few coefficients that matter most. The *physical stream* represents remaining high-frequency dynamics and is compressed separately using BPE (Byte-Pair Encoding), a technique that finds repeated patterns in the data. This decoupling lets each stream use an encoding tailored to its statistics.

**MotionVLA Architecture**: Built on Qwen (a 2 billion–parameter language model), the system interleaves Base and physical tokens in a single sequence but predicts them sequentially—first generating Base tokens (pose structure), then physical tokens (motion dynamics). This two-stage autoregressive approach respects the causal dependency: realistic dynamics depend on the chosen pose trajectory.

## Why It Matters

The results validate frequency-aware design. On HumanML3D (a standard benchmark for text-to-motion), MotionVLA cuts the diversity gap to real data by over 50%—meaning generated motions become closer in variety to authentic human movement. On MBench (a vision-conditioned benchmark), Motion-Condition Consistency improves by 3.8%, indicating better alignment between generated motion and scene context. Notably, this is achieved with a lightweight 2B backbone, suggesting the approach is parameter-efficient.

The work addresses a real bottleneck in embodied AI: generating motion that looks natural *and* respects physical constraints. By refusing to jam incompatible signals into shared representation space, the authors show that explicit frequency decoupling—backed by data analysis—outperforms the architectural convenience of single-codebook schemes. The code and website release enable reproduction and extension.

Mine your own.

Lode is a workbench, not a feed. Paste a YouTube URL. The model proposes a transcript, a set of quote-grounded snippets, a synthesis essay, and the fan-out. You decide what stays.

Open the curator