Similarity Preserving Transformer Cross-Modal Hashing for Video-Text Retrieval

No Thumbnail Available
File version
Author(s)
Huang, Qianxin
Peng, Siyao
Shen, Xiaobo
Yuan, Yun-Hao
Pan, Shirui
Griffith University Author(s)
Primary Supervisor
Other Supervisors
Editor(s)
Date
2024
Size
File type(s)
Location

Melbourne, Australia

License
Abstract

As social networks grow exponentially, there is an increasing demand for video retrieval using natural language. Cross-modal hashing that encodes multi-modal data using compact hash code has been widely used in large-scale image-text retrieval, primarily due to its computation and storage efficiency. When applied to video-text retrieval, existing unsupervised cross-modal hashing extracts the frame- or word-level features individually, and thus ignores long-term dependencies. In addition, effectively exploiting the multi-modal structure is a remarkable challenge owing to the complex nature of video and text. To address the above issues, we propose Similarity Preserving Transformer Cross-Modal Hashing (SPTCH), a new unsupervised deep cross-modal hashing method for video-text retrieval. SPTCH encodes video and text by bidirectional transformer encoder that exploits their long-term dependencies. SPTCH constructs a multi-modal collaborative graph to model correlations among multi-modal data, and applies semantic aggregation by employing Graph Convolutional Network (GCN) on such graph. SPTCH designs unsupervised multi-modal contrastive loss and neighborhood reconstruction loss to effectively leverage inter- and intra-modal similarity structure among videos and texts. The empirical results on three video benchmark datasets illustrate that the proposed SPTCH generally outperforms state-of-the-arts in video-text retrieval.

Journal Title
Conference Title

MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

Book Title
Edition
Volume
Issue
Thesis Type
Degree Program
School
Publisher link
Patent number
Funder(s)
Grant identifier(s)
Rights Statement
Rights Statement
Item Access Status
Note
Access the data
Related item(s)
Subject
Persistent link to this record
Citation

Huang, Q; Peng, S; Shen, X; Yuan, Y-H; Pan, S, Similarity Preserving Transformer Cross-Modal Hashing for Video-Text Retrieval, MM '24: Proceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 5883-5891