Behavior Foundation Model
for Humanoid Robots

1Peking University, 2The Chinese University of Hong Kong, Shenzhen 3Shanghai Jiao Tong University 4Fudan University 5Shanghai Artificial Intelligence Laboratory

Video

Abstract

Whole-body control (WBC) of humanoid robots has witnessed remarkable progress in skill versatility, enabling a wide range of applications such as locomotion, teleoperation, and motion tracking. Despite these achievements, existing WBC frameworks remain largely task-specific, relying heavily on labor-intensive reward engineering and demonstrating limited generalization across tasks and skills. These limitations hinder their response to arbitrary control modes and restrict their deployment in complex, real-world scenarios. To address these challenges, we revisit existing WBC systems and identify a shared objective across diverse tasks: the generation of appropriate behaviors that guide the robot toward desired goal states. Building on this insight, we propose the Behavior Foundation Model (BFM), a generative model pretrained on large-scale behavioral datasets to capture broad, reusable behavioral knowledge for humanoid robots. BFM integrates a masked online distillation framework with a Conditional Variational Autoencoder (CVAE) to model behavioral distributions, thereby enabling flexible operation across diverse control modes and efficient acquisition of novel behaviors without retraining from scratch. Extensive experiments in both simulation and on a physical humanoid platform demonstrate that BFM generalizes robustly across diverse WBC tasks while rapidly adapting to new behaviors. These results establish BFM as a promising step toward a foundation model for general-purpose humanoid control.

Behavior Gallery


All the behaviors below are generated
based on our Behavior Foundation Model.


Swimming

Walk and Sit on the Ground

Walk and Sit on the Chair

Get up from the Ground

Roundhouse Kick

Basketball Layup

Forward Roll

Butterfly Kick

Cartwheel

Side Salto

Simple Locomotion with Remote Controller

Simple Locomotion with Remote Controller

Squatting Locomotion with Remote Controller

Whole-body Teleoperation with HybrIK

Whole-body Teleoperation with HybrIK

BFM Implementation


A proxy agent is first trained in simulation via large-scale motion imitation as preparation of the large-scale behavioral datasets for BFM pretraining.

The BFM is then implemented as a Conditional Variational Autoencoder with a versatile control interface based on our mathematical formulation and leverage the paradigm of masked online distillation for pretraining.


BFM Implementation
Figure 1: Overview of BFM Implementation

BFM Applications

Behavoior Composition


BFM allows behavior composition through linear interpolation of latent variables from two distinct control modes, generating novel behaviors integrating the features of both control modes.


Root Mode Only

After Composition

Keypoint Mode Only

Behavoior Modulation


BFM allows behavior modulation through linear extrapolation in the latent space, thereby enabling behavior generation that better aligns with the desired control mode.

$$z = (1+\lambda)\mu^{\rho}(s_t^{p,real},s_t^{g,real}) - \lambda\mu^{\rho}(s_{t}^{p,real},\emptyset), \lambda>0$$


Before Modulation

After Modulation

Efficient Behavior Acquisition


We adopt residual learning on our pretrained BFM to leverage a broad spectrum of inherent knowledge encoded in our BFM for efficient acquisition of novel behaviors.


Residual Learning
Figure 2: Overview of Residual Learning on BFM

BFM ONLY

RL from Scratch

BFM + Residual Learning

Real Deployment