UniModal-LSR: A Unified Multimodal Framework for Joint Lip Reading and Sign Language Recognition in Video Sequences

Vinh Truong Hoang^*, Nghia Dinh, Luu Quang Phuong, Kiet Tran-Trung, Ha Duong Thi Hong, Bay Nguyen Van, Hau Nguyen Trung, Thien Ho Huong
AI Lab, Faculty of Information Technology, Ho Chi Minh City Open University, 35–37 Ho Hao Hon Street, Co Giang Ward, District 1, Ho Chi Minh City, Vietnam
* Corresponding Author: Vinh Truong Hoang. Email: email
(This article belongs to the Special Issue: Attention Mechanism-based Complex System Pattern Intelligent Recognition and Accurate Prediction)

Computers, Materials & Continua https://doi.org/10.32604/cmc.2026.078743

Received 07 January 2026; Accepted 16 March 2026; Published online 09 May 2026

Download PDF

Abstract

Visual speech recognition is a central problem in computer vision, encompassing both lip reading (visual speech recognition) and sign language recognition. Although substantial progress has been achieved independently on each task, their complementary characteristics have rarely been explored jointly. In this work we propose UniModal-LSR (Unified Multimodal Lip and Sign Recognition), a novel deep learning framework that jointly addresses lip reading and sign language recognition within a single multimodal architecture. By exploiting shared properties of visual communication channels, namely temporal dynamics, spatial articulation structure, and contextual dependencies, the proposed model enables bidirectional transfer of knowledge between modalities. The framework incorporates a Hierarchical Temporal-Spatial Encoder that captures multi-scale temporal patterns through the combination of local convolutions and global self-attention. It also includes a Cross-Modal Attention Fusion module that performs dynamic, context-aware information exchange via bidirectional cross-attention and adaptive gating. Additionally, a Contrastive Semantic Alignment loss enforces semantic consistency across modality-specific representations. Overall, the architecture integrates three-dimensional convolutional neural networks for spatiotemporal feature extraction with graph neural networks for explicit hand-pose modeling. Extensive experiments on several public benchmarks show that UniModal-LSR improves performance compared with recent methods. The model attains a Word Error Rate (WER) of 33.2% on LRS2-BBC, representing a 12.4% relative gain. On PHOENIX-2014, it achieves 18.3% WER, a 13.7% relative gain. Moreover, the unified model reduces parameter count by 25.9% relative to two separate task-specific systems. These results indicate that unified multimodal modeling can improve visual speech recognition performance and may support future communication technologies.

Keywords

Multimodal learning; lip reading; visual speech recognition; deep learning; sign language recognition; cross-modal attention

Downloads
- Full-Text PDF
Citation Tools
- BibTex
- EndNote
- RIS

297

View
47

Download
0

Like

Sailfish Optimizer with EfficientNet Model for Apple Leaf Disease Detection
Mazen Mushabab Alqahtani, Ashit...
Lightweight Multi-scale Convolutional Neural Network for Rice Leaf Disease Recognition
Chang Zhang, Ruiwen Ni, Ye Mu,...
Crops Leaf Diseases Recognition: A Framework of Optimum Deep Learning Features
Shafaq Abbas, Muhammad Attique...
Calf Posture Recognition Using Convolutional Neural Network
Tan Chen Tung, Uswah Khairuddin,...
Image-Based Automatic Energy Meter Reading Using Deep Learning
Muhammad Imran, Hafeez Anwar,...

All issues

Online First

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

UniModal-LSR: A Unified Multimodal Framework for Joint Lip Reading and Sign Language Recognition in Video Sequences

Abstract

Keywords

297

47

0

Further Information

Guidelines

Follow Us

Join Us

Contact Us

WhatsApp:

Share Link