MoMask: Generative Masked Modeling of 3D Human Motions

CVPR 2024
University of Alberta, Canada

We introduce MoMask, a novel masked modeling framework for text-driven 3D human motion generation. In MoMask, a hierarchical quantization scheme is employed to represent human motion as multi-layer discrete motion tokens with high-fidelity details. Starting at the base layer, with a sequence of motion tokens obtained by vector quantization, the residual tokens of increasing orders are derived and stored at the subsequent layers of the hierarchy. This is consequently followed by two distinct bidirectional transformers. For the base-layer motion tokens, a Masked Transformer is designated to predict randomly masked motion tokens conditioned on text input at training stage. During generation (i.e. inference) stage, starting from an empty sequence, our Masked Transformer iteratively fills up the missing tokens; Subsequently, a Residual Transformer learns to progressively predict the next-layer tokens based on the results from current layer. Extensive experiments demonstrate that MoMask outperforms the state-of-art methods on the text-to-motion generation task, with an FID of 0.045 (vs e.g. 0.141 of T2M-GPT) on the HumanML3D dataset, and 0.228 (vs 0.514) on KIT-ML, respectively. MoMask can also be seamlessly applied in related tasks without further model fine-tuning, such as text-guided temporal inpainting.

* This video contains audio.

Approach Overview

Gallery of Generation

Application: Temporal Inpainting

We showcase MoMask's capability to inpaint specific regions within existing motion clips, conditioned on a textual description. Here, we present the inpainting results for the middle, suffix, and prefix regions of motion clips. The input motion clips are highlighted in purple, and the synthesized content is represented in cyan.


(Purple=Input, Cyan=Synthesis)

+ "A person falls down and gets back up quickly."

+ "A person is pushed."


(Purple=Input, Cyan=Synthesis)

+ "A person gets up from the ground."

+ "A person is doing warm up"


(Purple=Input, Cyan=Synthesis)

+ "A person bows"

+ "A person squats"

Impact of Residual Quantization


We investigate the impact of varying the number of residual quantization layers on reconstruction results. In the visual comparison, we present the ground truth motion alongside motions recovered from different RVQ-VAEs with 5 residual layers, 3 residual layers, and no residual layers (traditional VQ-VAE), respectively. The result demonstrates that RVQ significantly reduces reconstruction errors, leading to high-fidelity motion tokenization.


Utilizing the pre-trained RVQ model, we conduct a visual comparison of generated motions by considering different combinations of tokens, specifically focusing on the base-layer tokens alone, base-layer tokens combined with the first 3 residual-layer tokens and the first 5 residual-layer tokens. The observation indicates that the absence of residual tokens may result in the failure to accurately perform subtle actions, as illustrated by the case of stumbling in this example.

A man walks forward, stumbles to the right, and then regains his balance and keeps walking forward.


We compare MoMask against three strong baseline approaches, spanning diffusion models (e.g., MDM, MLD), and autoregressive models (e.g., T2M-GPT). In contrast to these existing works, MoMask excels in capturing nuanced language concepts, resulting in the generation of more realistic motions.

Related Motion Generation Works 🚀🚀

Text2Motion: Diverse text-driven motion generation using temporal variational autoencoder.
TM2T: Learning text2motion and motion2text reciprocally through discrete token and language model.
TM2D: Learning dance generation with textual instruction.
Action2Motion: Diverse action-conditioned motion generation.
MotionMix: Semi-supervised human motion generation from multi-modalities.


      title={MoMask: Generative Masked Modeling of 3D Human Motions}, 
      author={Chuan Guo and Yuxuan Mu and Muhammad Gohar Javed and Sen Wang and Li Cheng},