Home / Journals / CMC / Online First / doi:10.32604/cmc.2025.072145
Special Issues
Table of Content

Open Access

ARTICLE

RSG-Conformer: ReLU-Based Sparse and Grouped Conformer for Audio-Visual Speech Recognition

Yewei Xiao, Xin Du*, Wei Zeng
Institute of Automation and Electronic Information, Xiangtan University, Xiangtan, 411105, China
* Corresponding Author: Xin Du. Email: email

Computers, Materials & Continua https://doi.org/10.32604/cmc.2025.072145

Received 20 August 2025; Accepted 27 October 2025; Published online 01 December 2025

Abstract

Audio-visual speech recognition (AVSR), which integrates audio and visual modalities to improve recognition performance and robustness in noisy or adverse acoustic conditions, has attracted significant research interest. However, Conformer-based architectures remain computational expensive due to the quadratic increase in the spatial and temporal complexity of their softmax-based attention mechanisms with sequence length. In addition, Conformer-based architectures may not provide sufficient flexibility for modeling local dependencies at different granularities. To mitigate these limitations, this study introduces a novel AVSR framework based on a ReLU-based Sparse and Grouped Conformer (RSG-Conformer) architecture. Specifically, we propose a Global-enhanced Sparse Attention (GSA) module incorporating an efficient context restoration block to recover lost contextual cues. Concurrently, a Grouped-scale Convolution (GSC) module replaces the standard Conformer convolution module, providing adaptive local modeling across varying temporal resolutions. Furthermore, we integrate a Refined Intermediate Contextual CTC (RIC-CTC) supervision strategy. This approach applies progressively increasing loss weights combined with convolution-based context aggregation, thereby further relaxing the constraint of conditional independence inherent in standard CTC frameworks. Evaluations on the LRS2 and LRS3 benchmark validate the efficacy of our approach, with word error rates (WERs) reduced to 1.8% and 1.5%, respectively. These results further demonstrate and validate its state-of-the-art performance in AVSR tasks.

Keywords

Audio-visual speech recognition; conformer; CTC; sparse attention
  • 46

    View

  • 8

    Download

  • 0

    Like

Share Link