Learnable Expansion of Graph Operators for Multi-Modal Feature Fusion
File version
Version of Record (VoR)
Author(s)
Wang, Lei
Zhu, Liyun
Gedeon, Tom
Koniusz, Piotr
Griffith University Author(s)
Primary Supervisor
Other Supervisors
Editor(s)
Date
Size
File type(s)
Location
Singapore, Singapore
Abstract
In computer vision tasks, features often come from diverse representations, domains (e.g., indoor and outdoor), and modalities (e.g., text, images, and videos). Effectively fusing these features is essential for robust performance, especially with the availability of powerful pre-trained models like vision-language models. However, common fusion methods, such as concatenation, element-wise operations, and non-linear techniques, often fail to capture structural relationships, deep feature interactions, and suffer from inefficiency or misalignment of features across domains or modalities. In this paper, we shift from high-dimensional feature space to a lower-dimensional, interpretable graph space by constructing relationship graphs that encode feature relationships at different levels, e.g., clip, frame, patch, token, etc. To capture deeper interactions, we expand graphs through iterative graph relationship updates and introduce a learnable graph fusion operator to integrate these expanded relationships for more effective fusion. Our approach is relationship-centric, operates in a homogeneous space, and is mathematically principled, resembling element-wise relationship score aggregation via multilinear polynomials. We demonstrate the effectiveness of our graph-based fusion method on video anomaly detection, showing strong performance across multi-representational, multi-modal, and multi-domain feature fusion tasks.
Journal Title
Conference Title
ICLR 2025: The Thirteenth International Conference on Learning Representations
Book Title
Edition
Volume
Issue
Thesis Type
Degree Program
School
Publisher link
DOI
Patent number
Funder(s)
Grant identifier(s)
Rights Statement
Rights Statement
This publication is distributed under the Creative Commons Attribution 4.0 International license (https://creativecommons.org/licenses/by/4.0/).
Item Access Status
Note
Copyright permissions for this publication were identified from the publisher's website at https://openreview.net/forum?id=SMZqIOSdlN
Access the data
Related item(s)
Subject
Persistent link to this record
Citation
Ding, D; Wang, L; Zhu, L; Gedeon, T; Koniusz, P, Learnable Expansion of Graph Operators for Multi-Modal Feature Fusion, ICLR 2025: The Thirteenth International Conference on Learning Representations, 2025