CM-CIF: Cross-Modal for Unaligned Modality Fusion with Continuous Integrate-and-Fire

No Thumbnail Available
File version
Author(s)
Jiang, Z
Xu, Y
Xu, Y
Ke, D
Su, K
Griffith University Author(s)
Primary Supervisor
Other Supervisors
Editor(s)
Date
2022
Size
File type(s)
Location

Wuhan, China

License
Abstract

The purpose of Audio-Visual Speech Recognition is to identify the content of the spoken sentence by extracting the lip movement features and acoustic features from an input video file containing a person's conversation. Although the current audio-visual fusion models solve the problem of inconsistency in the time length of different modalities to a certain extent, the fusion of the modalities may cause acoustic boundary ambiguity. To better solve this problem, in this paper, we propose a model named Cross-Modal Continuous Integrate-and-Fire (CM-CIF). The model integrates cross-modal information to the accumulated weight so that the acoustic boundary can be located more accurate. We use the Transformer-seq2seq model as the baseline and test CM-CIF on the public datasets LRS2 and LRS3. Experimental results show that CM-CIF achieves a competitive performance.

Journal Title
Conference Title

2022 7th International Conference on Computer and Communication Systems, ICCCS 2022

Book Title
Edition
Volume
Issue
Thesis Type
Degree Program
School
Publisher link
Patent number
Funder(s)
Grant identifier(s)
Rights Statement
Rights Statement
Item Access Status
Note
Access the data
Related item(s)
Subject

Artificial intelligence

Persistent link to this record
Citation

Jiang, Z; Xu, Y; Xu, Y; Ke, D; Su, K, CM-CIF: Cross-Modal for Unaligned Modality Fusion with Continuous Integrate-and-Fire, 2022 7th International Conference on Computer and Communication Systems, ICCCS 2022, 2022, pp. 358-361