MoMask: Generative Masked Modeling of 3D Human Motions

CVPR 2024

Chuan Guo^*, Yuxuan Mu^*, Muhammad Gohar Javed^*, Sen Wang, Li Cheng,

University of Alberta, Canada

We introduce MoMask, a novel masked modeling framework for text-driven 3D human motion generation. In MoMask, a hierarchical quantization scheme is employed to represent human motion as multi-layer discrete motion tokens with high-fidelity details. Starting at the base layer, with a sequence of motion tokens obtained by vector quantization, the residual tokens of increasing orders are derived and stored at the subsequent layers of the hierarchy. This is consequently followed by two distinct bidirectional transformers. For the base-layer motion tokens, a Masked Transformer is designated to predict randomly masked motion tokens conditioned on text input at training stage. During generation (i.e. inference) stage, starting from an empty sequence, our Masked Transformer iteratively fills up the missing tokens; Subsequently, a Residual Transformer learns to progressively predict the next-layer tokens based on the results from current layer. Extensive experiments demonstrate that MoMask outperforms the state-of-art methods on the text-to-motion generation task, with an FID of 0.045 (vs e.g. 0.141 of T2M-GPT) on the HumanML3D dataset, and 0.228 (vs 0.514) on KIT-ML, respectively. MoMask can also be seamlessly applied in related tasks without further model fine-tuning, such as text-guided temporal inpainting.

* This video contains audio.

Approach Overview

Gallery of Generation

"A character is running on a treadmill."

"The person holds its left foot with its left hand, puts its right foot up and left hand up too."

"A person stands for a few seconds and picks up its arms and shakes them."

"This person kicks with their right leg then jabs several times."

"A person walks in a clockwise circle and stops where he began."

"A man bends down and picks something up with his right hand."

"The man walked forward, spun right on one foot and walked back to his original position."

"A person stands, crosses left leg in front of the right, lowering themselves until they are sitting, both hands on the floor before standing and uncrossing legs."

"A man is walking forward then steps over an object, then continues walking forward."

"A person repeatedly blocks their face with their right arm."

"This person takes 4 steps forward starting with their right foot."

"The person takes 4 steps backwards."

"The person did a kick spin to the left."

"A figure stretches its hands and arms above its head."

"The person does a salsa dance."

"A person jumps up and then lands."

"A person was pushed but did not fall."

Application: Temporal Inpainting

We showcase MoMask's capability to inpaint specific regions within existing motion clips, conditioned on a textual description. Here, we present the inpainting results for the middle, suffix, and prefix regions of motion clips. The input motion clips are highlighted in purple, and the synthesized content is represented in cyan.

Inbetweening

(Purple=Input, Cyan=Synthesis)

+ "A person falls down and gets back up quickly."

+ "A person is pushed."

Prefix

(Purple=Input, Cyan=Synthesis)

+ "A person gets up from the ground."

+ "A person is doing warm up"

Suffix

(Purple=Input, Cyan=Synthesis)

+ "A person bows"

+ "A person squats"

Impact of Residual Quantization

Reconstruction

We investigate the impact of varying the number of residual quantization layers on reconstruction results. In the visual comparison, we present the ground truth motion alongside motions recovered from different RVQ-VAEs with 5 residual layers, 3 residual layers, and no residual layers (traditional VQ-VAE), respectively. The result demonstrates that RVQ significantly reduces reconstruction errors, leading to high-fidelity motion tokenization.

Generation

Utilizing the pre-trained RVQ model, we conduct a visual comparison of generated motions by considering different combinations of tokens, specifically focusing on the base-layer tokens alone, base-layer tokens combined with the first 3 residual-layer tokens and the first 5 residual-layer tokens. The observation indicates that the absence of residual tokens may result in the failure to accurately perform subtle actions, as illustrated by the case of stumbling in this example.

A man walks forward, stumbles to the right, and then regains his balance and keeps walking forward.

Comparisons

We compare MoMask against three strong baseline approaches, spanning diffusion models (e.g., MDM, MLD), and autoregressive models (e.g., T2M-GPT). In contrast to these existing works, MoMask excels in capturing nuanced language concepts, resulting in the generation of more realistic motions.

Related Motion Generation Works 🚀🚀

Text2Motion: Diverse text-driven motion generation using temporal variational autoencoder.
TM2T: Learning text2motion and motion2text reciprocally through discrete token and language model.
TM2D: Learning dance generation with textual instruction.
Action2Motion: Diverse action-conditioned motion generation.
MotionMix: Semi-supervised human motion generation from multi-modalities.

BibTeX

@article{guo2023momask,
      title={MoMask: Generative Masked Modeling of 3D Human Motions}, 
      author={Chuan Guo and Yuxuan Mu and Muhammad Gohar Javed and Sen Wang and Li Cheng},
      year={2023},
      eprint={2312.00063},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}