SATO: Stable Text-to-Motion Framework

No Thumbnail Available
File version
Author(s)
Chen, W
Xiao, H
Zhang, E
Hu, L
Wang, L
Liu, M
Chen, C
Griffith University Author(s)
Primary Supervisor
Other Supervisors
Editor(s)
Date
2024
Size
File type(s)
Location

Melbourne, Australia

License
Abstract

Is the Text to Motion model robust? Recent advancements in Text to Motion models primarily stem from more accurate predictions of specific actions. However, the text modality typically relies solely on pre-trained Contrastive Language-Image Pretraining (CLIP) models. Our research has uncovered a significant issue with the text-to-motion model: its predictions often exhibit inconsistent outputs, resulting in vastly different or even incorrect poses when presented with semantically similar or identical text inputs. In this paper, we undertake an analysis to elucidate the underlying causes of this instability, establishing a clear link between the unpredictability of model outputs and the erratic attention patterns of the text encoder module. Consequently, we introduce a formal framework aimed at addressing this issue, which we term the Sta ble Text-to-Motion Framework (SATO). SATO consists of three modules, each dedicated to stable attention, stable prediction, and maintaining a balance between accuracy and robustness trade-off. We present a methodology for constructing an SATO that satisfies the stability of attention and prediction. To verify the stability of the model, we introduced a new textual synonym perturbation dataset based on HumanML3D and KIT-ML. Results show that SATO is significantly more stable against synonyms and other slight perturbations while keeping its high accuracy performance. Codes and models are released at https://github.com/sato-team/Stable-Text-to-Motion-Framework.

Journal Title
Conference Title

MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

Book Title
Edition
Volume
Issue
Thesis Type
Degree Program
School
Publisher link
Patent number
Funder(s)
Grant identifier(s)
Rights Statement
Rights Statement
Item Access Status
Note
Access the data
Related item(s)
Subject
Persistent link to this record
Citation

Chen, W; Xiao, H; Zhang, E; Hu, L; Wang, L; Liu, M; Chen, C, SATO: Stable Text-to-Motion Framework, MM '24: Proceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 6989-6997