Retrieval-based Knowledge Augmented Vision Language Pre-training

No Thumbnail Available
File version
Author(s)
Rao, J
Shan, Z
Liu, L
Zhou, Y
Yang, Y
Griffith University Author(s)
Primary Supervisor
Other Supervisors
Editor(s)
Date
2023
Size
File type(s)
Location

Ottowa, Canada

License
Abstract

With the recent progress in large-scale vision and language representation learning, Vision Language Pre-training (VLP) models have achieved promising improvements on various multi-modal downstream tasks. Albeit powerful, these models have not fully leveraged world knowledge to their advantage. A key challenge of knowledge-augmented VLP is the lack of clear connections between knowledge and multi-modal data. Moreover, not all knowledge present in images/texts is useful, therefore prior approaches often struggle to effectively integrate knowledge, visual, and textual information. In this study, we propose REtrieval-based knowledge Augmented Vision Language (REAVL), a novel knowledge-augmented pre-training framework to address the above issues. For the first time, we introduce a knowledge-aware self-supervised learning scheme that efficiently establishes the correspondence between knowledge and multi-modal data and identifies informative knowledge to improve the modeling of alignment and interactions between visual and textual modalities. By adaptively integrating informative knowledge with visual and textual information, REAVL achieves new state-of-the-art performance uniformly on knowledge-based vision-language understanding and multi-modal entity linking tasks, as well as competitive results on general vision-language tasks while only using 0.2% pre-training data of the best models. Our model shows strong sample efficiency and effective knowledge utilization.

Journal Title
Conference Title

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Book Title
Edition
Volume
Issue
Thesis Type
Degree Program
School
Publisher link
Patent number
Funder(s)
Grant identifier(s)
Rights Statement
Rights Statement
Item Access Status
Note
Access the data
Related item(s)
Subject

Artificial intelligence

Nanotechnology

Nanobiotechnology

Persistent link to this record
Citation

Rao, J; Shan, Z; Liu, L; Zhou, Y; Yang, Y, Retrieval-based Knowledge Augmented Vision Language Pre-training, MM '23: Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 5399-5409