Open Access
ARTICLE
UniModal-LSR: A Unified Multimodal Framework for Joint Lip Reading and Sign Language Recognition in Video Sequences
AI Lab, Faculty of Information Technology, Ho Chi Minh City Open University, 35–37 Ho Hao Hon Street, Co Giang Ward, District 1, Ho Chi Minh City, Vietnam
* Corresponding Author: Vinh Truong Hoang. Email:
(This article belongs to the Special Issue: Attention Mechanism-based Complex System Pattern Intelligent Recognition and Accurate Prediction)
Computers, Materials & Continua 2026, 88(2), 35 https://doi.org/10.32604/cmc.2026.078743
Received 07 January 2026; Accepted 16 March 2026; Issue published 15 June 2026
Abstract
Visual speech recognition is a central problem in computer vision, encompassing both lip reading (visual speech recognition) and sign language recognition. Although substantial progress has been achieved independently on each task, their complementary characteristics have rarely been explored jointly. In this work we propose UniModal-LSR (Unified Multimodal Lip and Sign Recognition), a novel deep learning framework that jointly addresses lip reading and sign language recognition within a single multimodal architecture. By exploiting shared properties of visual communication channels, namely temporal dynamics, spatial articulation structure, and contextual dependencies, the proposed model enables bidirectional transfer of knowledge between modalities. The framework incorporates a Hierarchical Temporal-Spatial Encoder that captures multi-scale temporal patterns through the combination of local convolutions and global self-attention. It also includes a Cross-Modal Attention Fusion module that performs dynamic, context-aware information exchange via bidirectional cross-attention and adaptive gating. Additionally, a Contrastive Semantic Alignment loss enforces semantic consistency across modality-specific representations. Overall, the architecture integrates three-dimensional convolutional neural networks for spatiotemporal feature extraction with graph neural networks for explicit hand-pose modeling. Extensive experiments on several public benchmarks show that UniModal-LSR improves performance compared with recent methods. The model attains a Word Error Rate (WER) of 33.2% on LRS2-BBC, representing a 12.4% relative gain. On PHOENIX-2014, it achieves 18.3% WER, a 13.7% relative gain. Moreover, the unified model reduces parameter count by 25.9% relative to two separate task-specific systems. These results indicate that unified multimodal modeling can improve visual speech recognition performance and may support future communication technologies.Keywords
Cite This Article
Copyright © 2026 The Author(s). Published by Tech Science Press.This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.


Submit a Paper
Propose a Special lssue
View Full Text
Download PDF
Downloads
Citation Tools