UniModal-LSR: A Unified Multimodal Framework for Joint Lip Reading and Sign Language Recognition in Video Sequences

Vinh Hoang; Nghia Dinh; Luu Phuong; Kiet Tran-Trung; Ha Duong; Bay Van; Hau Trung; Thien Huong

doi:10.32604/cmc.2026.078743

Open Access icon Open Access

ARTICLE

UniModal-LSR: A Unified Multimodal Framework for Joint Lip Reading and Sign Language Recognition in Video Sequences

Vinh Truong Hoang^*, Nghia Dinh, Luu Quang Phuong, Kiet Tran-Trung, Ha Duong Thi Hong, Bay Nguyen Van, Hau Nguyen Trung, Thien Ho Huong

AI Lab, Faculty of Information Technology, Ho Chi Minh City Open University, 35–37 Ho Hao Hon Street, Co Giang Ward, District 1, Ho Chi Minh City, Vietnam

* Corresponding Author: Vinh Truong Hoang. Email: email

Computers, Materials & Continua 2026, 88(2), 35 https://doi.org/10.32604/cmc.2026.078743

Received 07 January 2026; Accepted 16 March 2026; Issue published 15 June 2026

Abstract

Visual speech recognition is a central problem in computer vision, encompassing both lip reading (visual speech recognition) and sign language recognition. Although substantial progress has been achieved independently on each task, their complementary characteristics have rarely been explored jointly. In this work we propose UniModal-LSR (Unified Multimodal Lip and Sign Recognition), a novel deep learning framework that jointly addresses lip reading and sign language recognition within a single multimodal architecture. By exploiting shared properties of visual communication channels, namely temporal dynamics, spatial articulation structure, and contextual dependencies, the proposed model enables bidirectional transfer of knowledge between modalities. The framework incorporates a Hierarchical Temporal-Spatial Encoder that captures multi-scale temporal patterns through the combination of local convolutions and global self-attention. It also includes a Cross-Modal Attention Fusion module that performs dynamic, context-aware information exchange via bidirectional cross-attention and adaptive gating. Additionally, a Contrastive Semantic Alignment loss enforces semantic consistency across modality-specific representations. Overall, the architecture integrates three-dimensional convolutional neural networks for spatiotemporal feature extraction with graph neural networks for explicit hand-pose modeling. Extensive experiments on several public benchmarks show that UniModal-LSR improves performance compared with recent methods. The model attains a Word Error Rate (WER) of 33.2% on LRS2-BBC, representing a 12.4% relative gain. On PHOENIX-2014, it achieves 18.3% WER, a 13.7% relative gain. Moreover, the unified model reduces parameter count by 25.9% relative to two separate task-specific systems. These results indicate that unified multimodal modeling can improve visual speech recognition performance and may support future communication technologies.

Keywords

Multimodal learning; lip reading; visual speech recognition; deep learning; sign language recognition; cross-modal attention

Cite This Article

APA Style

Truong Hoang, V., Dinh, N., Quang Phuong, L., Tran-Trung, K., Duong Thi Hong, H. et al. (2026). UniModal-LSR: A Unified Multimodal Framework for Joint Lip Reading and Sign Language Recognition in Video Sequences. Computers, Materials & Continua, 88(2), 35. https://doi.org/10.32604/cmc.2026.078743

Vancouver Style

Truong Hoang V, Dinh N, Quang Phuong L, Tran-Trung K, Duong Thi Hong H, Nguyen Van B, et al. UniModal-LSR: A Unified Multimodal Framework for Joint Lip Reading and Sign Language Recognition in Video Sequences. Comput Mater Contin. 2026;88(2):35. https://doi.org/10.32604/cmc.2026.078743

IEEE Style

V. Truong Hoang et al., “UniModal-LSR: A Unified Multimodal Framework for Joint Lip Reading and Sign Language Recognition in Video Sequences,” Comput. Mater. Contin., vol. 88, no. 2, pp. 35, 2026. https://doi.org/10.32604/cmc.2026.078743

BibTex EndNote RIS

Copyright © 2026 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Table of Content

UniModal-LSR: A Unified Multimodal Framework for Joint Lip Reading and Sign Language Recognition in Video Sequences

Abstract

Keywords

Cite This Article

763

253

0

Related articles

Further Information

Guidelines

Follow Us

Join Us

Contact Us

WhatsApp:

Share Link