Open Access iconOpen Access

ARTICLE

UniModal-LSR: A Unified Multimodal Framework for Joint Lip Reading and Sign Language Recognition in Video Sequences

Vinh Truong Hoang*, Nghia Dinh, Luu Quang Phuong, Kiet Tran-Trung, Ha Duong Thi Hong, Bay Nguyen Van, Hau Nguyen Trung, Thien Ho Huong

AI Lab, Faculty of Information Technology, Ho Chi Minh City Open University, 35–37 Ho Hao Hon Street, Co Giang Ward, District 1, Ho Chi Minh City, Vietnam

* Corresponding Author: Vinh Truong Hoang. Email: email

(This article belongs to the Special Issue: Attention Mechanism-based Complex System Pattern Intelligent Recognition and Accurate Prediction)

Computers, Materials & Continua 2026, 88(2), 35 https://doi.org/10.32604/cmc.2026.078743

Abstract

Visual speech recognition is a central problem in computer vision, encompassing both lip reading (visual speech recognition) and sign language recognition. Although substantial progress has been achieved independently on each task, their complementary characteristics have rarely been explored jointly. In this work we propose UniModal-LSR (Unified Multimodal Lip and Sign Recognition), a novel deep learning framework that jointly addresses lip reading and sign language recognition within a single multimodal architecture. By exploiting shared properties of visual communication channels, namely temporal dynamics, spatial articulation structure, and contextual dependencies, the proposed model enables bidirectional transfer of knowledge between modalities. The framework incorporates a Hierarchical Temporal-Spatial Encoder that captures multi-scale temporal patterns through the combination of local convolutions and global self-attention. It also includes a Cross-Modal Attention Fusion module that performs dynamic, context-aware information exchange via bidirectional cross-attention and adaptive gating. Additionally, a Contrastive Semantic Alignment loss enforces semantic consistency across modality-specific representations. Overall, the architecture integrates three-dimensional convolutional neural networks for spatiotemporal feature extraction with graph neural networks for explicit hand-pose modeling. Extensive experiments on several public benchmarks show that UniModal-LSR improves performance compared with recent methods. The model attains a Word Error Rate (WER) of 33.2% on LRS2-BBC, representing a 12.4% relative gain. On PHOENIX-2014, it achieves 18.3% WER, a 13.7% relative gain. Moreover, the unified model reduces parameter count by 25.9% relative to two separate task-specific systems. These results indicate that unified multimodal modeling can improve visual speech recognition performance and may support future communication technologies.

Keywords

Multimodal learning; lip reading; visual speech recognition; deep learning; sign language recognition; cross-modal attention

Cite This Article

APA Style
Truong Hoang, V., Dinh, N., Quang Phuong, L., Tran-Trung, K., Duong Thi Hong, H. et al. (2026). UniModal-LSR: A Unified Multimodal Framework for Joint Lip Reading and Sign Language Recognition in Video Sequences. Computers, Materials & Continua, 88(2), 35. https://doi.org/10.32604/cmc.2026.078743
Vancouver Style
Truong Hoang V, Dinh N, Quang Phuong L, Tran-Trung K, Duong Thi Hong H, Nguyen Van B, et al. UniModal-LSR: A Unified Multimodal Framework for Joint Lip Reading and Sign Language Recognition in Video Sequences. Comput Mater Contin. 2026;88(2):35. https://doi.org/10.32604/cmc.2026.078743
IEEE Style
V. Truong Hoang et al., “UniModal-LSR: A Unified Multimodal Framework for Joint Lip Reading and Sign Language Recognition in Video Sequences,” Comput. Mater. Contin., vol. 88, no. 2, pp. 35, 2026. https://doi.org/10.32604/cmc.2026.078743



cc Copyright © 2026 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
  • 337

    View

  • 64

    Download

  • 0

    Like

Share Link