BOTH2Hands: Inferring 3D Hands from Both Text Prompts and Body Dynamics

ShanghaiTech University

Abstract

The recently emerging text-to-motion advances have spired numerous attempts for convenient and interactive human motion generation. Yet, existing methods are largely limited to generating body motions only without considering the rich two-hand motions, let alone handling various conditions like body dynamics or texts. To break the data bottleneck, we propose BOTH57M, a novel multi-modal dataset for two-hand motion generation. Our dataset includes accurate motion tracking for the human body and hands and provides pair-wised finger-level hand annotations and body descriptions. We further provide a strong baseline method, BOTH2Hands, for the novel task: generating vivid two-hand motions from both implicit body dynamics and explicit text prompts. We first warm up two parallel body-to-hand and text-to-hand diffusion models and then utilize the cross-attention transformer for motion blending. Extensive experiments and cross-validations demonstrate the effectiveness of our approach and dataset for generating convincing two-hand motions from the hybrid body-and-textual conditions. Our dataset and code will be disseminated to the community for future research.

BOTH57M Dataset

BOTH57M, a unique body-hand motion dataset comprising 1,384 motion clips and 57.4M frames, with 23,477 manually annotated motions and a rich vocabulary of 4,140 words, focusing on hands and body motion in daily various activities, referencing the book "Dictionary of Gestures" and supplementing with custom-designed movements. To the best of our knowledge, this is the only dataset that provides hybrid and detailed annotations of both body and hands at present, providing infinite possibilities for future tasks. The rich vocabulary and hand diversity underscores our advantage in tackling the text-body to hand task.

BibTeX

@inproceedings{zhang24both,
        title = {BOTH2Hands: Inferring 3D Hands from Both Text Prompts and Body Dynamics},
        author = {Zhang, Wenqian and Huang, Molin and Zhou, Yuxuan and Zhang, Juze and Yu, Jingyi and Wang, Jingya and Xu, Lan},
        booktitle = {Conference on Computer Vision and Pattern Recognition ({CVPR})},
        year = {2024},
    }