RSG-Conformer: ReLU-Based Sparse and Grouped Conformer for Audio-Visual Speech Recognition

Yewei Xiao; Xin Du; Wei Zeng

doi:10.32604/cmc.2025.072145

Open Access icon Open Access

ARTICLE

RSG-Conformer: ReLU-Based Sparse and Grouped Conformer for Audio-Visual Speech Recognition

Yewei Xiao, Xin Du^*, Wei Zeng

Institute of Automation and Electronic Information, Xiangtan University, Xiangtan, 411105, China

* Corresponding Author: Xin Du. Email: email

Computers, Materials & Continua 2026, 86(3), 55 https://doi.org/10.32604/cmc.2025.072145

Received 20 August 2025; Accepted 27 October 2025; Issue published 12 January 2026

Abstract

Audio-visual speech recognition (AVSR), which integrates audio and visual modalities to improve recognition performance and robustness in noisy or adverse acoustic conditions, has attracted significant research interest. However, Conformer-based architectures remain computational expensive due to the quadratic increase in the spatial and temporal complexity of their softmax-based attention mechanisms with sequence length. In addition, Conformer-based architectures may not provide sufficient flexibility for modeling local dependencies at different granularities. To mitigate these limitations, this study introduces a novel AVSR framework based on a ReLU-based Sparse and Grouped Conformer (RSG-Conformer) architecture. Specifically, we propose a Global-enhanced Sparse Attention (GSA) module incorporating an efficient context restoration block to recover lost contextual cues. Concurrently, a Grouped-scale Convolution (GSC) module replaces the standard Conformer convolution module, providing adaptive local modeling across varying temporal resolutions. Furthermore, we integrate a Refined Intermediate Contextual CTC (RIC-CTC) supervision strategy. This approach applies progressively increasing loss weights combined with convolution-based context aggregation, thereby further relaxing the constraint of conditional independence inherent in standard CTC frameworks. Evaluations on the LRS2 and LRS3 benchmark validate the efficacy of our approach, with word error rates (WERs) reduced to 1.8% and 1.5%, respectively. These results further demonstrate and validate its state-of-the-art performance in AVSR tasks.

Keywords

Audio-visual speech recognition; conformer; CTC; sparse attention

Cite This Article

APA Style

Xiao, Y., Du, X., Zeng, W. (2026). RSG-Conformer: ReLU-Based Sparse and Grouped Conformer for Audio-Visual Speech Recognition. Computers, Materials & Continua, 86(3), 55. https://doi.org/10.32604/cmc.2025.072145

Vancouver Style

Xiao Y, Du X, Zeng W. RSG-Conformer: ReLU-Based Sparse and Grouped Conformer for Audio-Visual Speech Recognition. Comput Mater Contin. 2026;86(3):55. https://doi.org/10.32604/cmc.2025.072145

IEEE Style

Y. Xiao, X. Du, and W. Zeng, “RSG-Conformer: ReLU-Based Sparse and Grouped Conformer for Audio-Visual Speech Recognition,” Comput. Mater. Contin., vol. 86, no. 3, pp. 55, 2026. https://doi.org/10.32604/cmc.2025.072145

BibTex EndNote RIS

Copyright © 2026 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Table of Content

RSG-Conformer: ReLU-Based Sparse and Grouped Conformer for Audio-Visual Speech Recognition

Abstract

Keywords

Cite This Article

1246

498

0

Related articles

Further Information

Guidelines

Follow Us

Join Us

Contact Us

WhatsApp:

Share Link