Learnable Expansion of Graph Operators for Multi-Modal Feature Fusion

Loading...
Thumbnail Image
File version

Version of Record (VoR)

Author(s)
Ding, Dexuan
Wang, Lei
Zhu, Liyun
Gedeon, Tom
Koniusz, Piotr
Griffith University Author(s)
Primary Supervisor
Other Supervisors
Editor(s)
Date
2025
Size
File type(s)
Location

Singapore, Singapore

Abstract

In computer vision tasks, features often come from diverse representations, domains (e.g., indoor and outdoor), and modalities (e.g., text, images, and videos). Effectively fusing these features is essential for robust performance, especially with the availability of powerful pre-trained models like vision-language models. However, common fusion methods, such as concatenation, element-wise operations, and non-linear techniques, often fail to capture structural relationships, deep feature interactions, and suffer from inefficiency or misalignment of features across domains or modalities. In this paper, we shift from high-dimensional feature space to a lower-dimensional, interpretable graph space by constructing relationship graphs that encode feature relationships at different levels, e.g., clip, frame, patch, token, etc. To capture deeper interactions, we expand graphs through iterative graph relationship updates and introduce a learnable graph fusion operator to integrate these expanded relationships for more effective fusion. Our approach is relationship-centric, operates in a homogeneous space, and is mathematically principled, resembling element-wise relationship score aggregation via multilinear polynomials. We demonstrate the effectiveness of our graph-based fusion method on video anomaly detection, showing strong performance across multi-representational, multi-modal, and multi-domain feature fusion tasks.

Journal Title
Conference Title

ICLR 2025: The Thirteenth International Conference on Learning Representations

Book Title
Edition
Volume
Issue
Thesis Type
Degree Program
School
DOI
Patent number
Funder(s)
Grant identifier(s)
Rights Statement
Rights Statement

This publication is distributed under the Creative Commons Attribution 4.0 International license (https://creativecommons.org/licenses/by/4.0/).

Item Access Status
Note

Copyright permissions for this publication were identified from the publisher's website at https://openreview.net/forum?id=SMZqIOSdlN

Access the data
Related item(s)
Subject
Persistent link to this record
Citation

Ding, D; Wang, L; Zhu, L; Gedeon, T; Koniusz, P, Learnable Expansion of Graph Operators for Multi-Modal Feature Fusion, ICLR 2025: The Thirteenth International Conference on Learning Representations, 2025