Open Access iconOpen Access

ARTICLE

RSG-Conformer: ReLU-Based Sparse and Grouped Conformer for Audio-Visual Speech Recognition

Yewei Xiao, Xin Du*, Wei Zeng

Institute of Automation and Electronic Information, Xiangtan University, Xiangtan, 411105, China

* Corresponding Author: Xin Du. Email: email

Computers, Materials & Continua 2026, 86(3), 55 https://doi.org/10.32604/cmc.2025.072145

Abstract

Audio-visual speech recognition (AVSR), which integrates audio and visual modalities to improve recognition performance and robustness in noisy or adverse acoustic conditions, has attracted significant research interest. However, Conformer-based architectures remain computational expensive due to the quadratic increase in the spatial and temporal complexity of their softmax-based attention mechanisms with sequence length. In addition, Conformer-based architectures may not provide sufficient flexibility for modeling local dependencies at different granularities. To mitigate these limitations, this study introduces a novel AVSR framework based on a ReLU-based Sparse and Grouped Conformer (RSG-Conformer) architecture. Specifically, we propose a Global-enhanced Sparse Attention (GSA) module incorporating an efficient context restoration block to recover lost contextual cues. Concurrently, a Grouped-scale Convolution (GSC) module replaces the standard Conformer convolution module, providing adaptive local modeling across varying temporal resolutions. Furthermore, we integrate a Refined Intermediate Contextual CTC (RIC-CTC) supervision strategy. This approach applies progressively increasing loss weights combined with convolution-based context aggregation, thereby further relaxing the constraint of conditional independence inherent in standard CTC frameworks. Evaluations on the LRS2 and LRS3 benchmark validate the efficacy of our approach, with word error rates (WERs) reduced to 1.8% and 1.5%, respectively. These results further demonstrate and validate its state-of-the-art performance in AVSR tasks.

Keywords

Audio-visual speech recognition; conformer; CTC; sparse attention

Cite This Article

APA Style
Xiao, Y., Du, X., Zeng, W. (2026). RSG-Conformer: ReLU-Based Sparse and Grouped Conformer for Audio-Visual Speech Recognition. Computers, Materials & Continua, 86(3), 55. https://doi.org/10.32604/cmc.2025.072145
Vancouver Style
Xiao Y, Du X, Zeng W. RSG-Conformer: ReLU-Based Sparse and Grouped Conformer for Audio-Visual Speech Recognition. Comput Mater Contin. 2026;86(3):55. https://doi.org/10.32604/cmc.2025.072145
IEEE Style
Y. Xiao, X. Du, and W. Zeng, “RSG-Conformer: ReLU-Based Sparse and Grouped Conformer for Audio-Visual Speech Recognition,” Comput. Mater. Contin., vol. 86, no. 3, pp. 55, 2026. https://doi.org/10.32604/cmc.2025.072145



cc Copyright © 2026 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
  • 457

    View

  • 171

    Download

  • 0

    Like

Share Link