CM-CIF: Cross-Modal for Unaligned Modality Fusion with Continuous Integrate-and-Fire
File version
Author(s)
Xu, Y
Xu, Y
Ke, D
Su, K
Griffith University Author(s)
Primary Supervisor
Other Supervisors
Editor(s)
Date
Size
File type(s)
Location
Wuhan, China
License
Abstract
The purpose of Audio-Visual Speech Recognition is to identify the content of the spoken sentence by extracting the lip movement features and acoustic features from an input video file containing a person's conversation. Although the current audio-visual fusion models solve the problem of inconsistency in the time length of different modalities to a certain extent, the fusion of the modalities may cause acoustic boundary ambiguity. To better solve this problem, in this paper, we propose a model named Cross-Modal Continuous Integrate-and-Fire (CM-CIF). The model integrates cross-modal information to the accumulated weight so that the acoustic boundary can be located more accurate. We use the Transformer-seq2seq model as the baseline and test CM-CIF on the public datasets LRS2 and LRS3. Experimental results show that CM-CIF achieves a competitive performance.
Journal Title
Conference Title
2022 7th International Conference on Computer and Communication Systems, ICCCS 2022
Book Title
Edition
Volume
Issue
Thesis Type
Degree Program
School
Publisher link
Patent number
Funder(s)
Grant identifier(s)
Rights Statement
Rights Statement
Item Access Status
Note
Access the data
Related item(s)
Subject
Artificial intelligence
Persistent link to this record
Citation
Jiang, Z; Xu, Y; Xu, Y; Ke, D; Su, K, CM-CIF: Cross-Modal for Unaligned Modality Fusion with Continuous Integrate-and-Fire, 2022 7th International Conference on Computer and Communication Systems, ICCCS 2022, 2022, pp. 358-361