+ "A person is pushed."
"A character is running on a treadmill."
"The person holds its left foot with its left hand, puts its right foot up and left hand up too."
"A person stands for a few seconds and picks up its arms and shakes them."
"This person kicks with their right leg then jabs several times."
"A person walks in a clockwise circle and stops where he began."
"A man bends down and picks something up with his right hand."
"The man walked forward, spun right on one foot and walked back to his original position."
"A person stands, crosses left leg in front of the right, lowering themselves until they are sitting, both hands on the floor before standing and uncrossing legs."
"A man is walking forward then steps over an object, then continues walking forward."
"A person repeatedly blocks their face with their right arm."
"This person takes 4 steps forward starting with their right foot."
"The person takes 4 steps backwards."
"The person did a kick spin to the left."
"A figure stretches its hands and arms above its head."
"The person does a salsa dance."
"A person jumps up and then lands."
"A person was pushed but did not fall."
We showcase MoMask's capability to inpaint specific regions within existing motion clips, conditioned on a textual description. Here, we present the inpainting results for the middle, suffix, and prefix regions of motion clips. The input motion clips are highlighted in purple, and the synthesized content is represented in cyan.
+ "A person falls down and gets back up quickly."
+ "A person is pushed."
+ "A person gets up from the ground."
+ "A person is doing warm up"
+ "A person bows"
+ "A person squats"
We investigate the impact of varying the number of residual quantization layers on reconstruction results. In the visual comparison, we present the ground truth motion alongside motions recovered from different RVQ-VAEs with 5 residual layers, 3 residual layers, and no residual layers (traditional VQ-VAE), respectively. The result demonstrates that RVQ significantly reduces reconstruction errors, leading to high-fidelity motion tokenization.
Utilizing the pre-trained RVQ model, we conduct a visual comparison of generated motions by considering different combinations of tokens, specifically focusing on the base-layer tokens alone, base-layer tokens combined with the first 3 residual-layer tokens and the first 5 residual-layer tokens. The observation indicates that the absence of residual tokens may result in the failure to accurately perform subtle actions, as illustrated by the case of stumbling in this example.
A man walks forward, stumbles to the right, and then regains his balance and keeps walking forward.
We compare MoMask against three strong baseline approaches, spanning diffusion models (e.g., MDM, MLD), and autoregressive models (e.g., T2M-GPT). In contrast to these existing works, MoMask excels in capturing nuanced language concepts, resulting in the generation of more realistic motions.
@article{guo2023momask,
title={MoMask: Generative Masked Modeling of 3D Human Motions},
author={Chuan Guo and Yuxuan Mu and Muhammad Gohar Javed and Sen Wang and Li Cheng},
year={2023},
eprint={2312.00063},
archivePrefix={arXiv},
primaryClass={cs.CV}
}