UniModal-LSR: A Unified Multimodal Framework for Joint Lip Reading and Sign Language Recognition in Video Sequences

Vinh Hoang; Nghia Dinh; Luu Phuong; Kiet Tran-Trung; Ha Duong; Bay Van; Hau Trung; Thien Huong

doi:10.32604/cmc.2026.078743

icon Open Access

ARTICLE

UniModal-LSR: A Unified Multimodal Framework for Joint Lip Reading and Sign Language Recognition in Video Sequences

Vinh Truong Hoang^*, Nghia Dinh, Luu Quang Phuong, Kiet Tran-Trung, Ha Duong Thi Hong, Bay Nguyen Van, Hau Nguyen Trung, Thien Ho Huong

AI Lab, Faculty of Information Technology, Ho Chi Minh City Open University, 35–37 Ho Hao Hon Street, Co Giang Ward, District 1, Ho Chi Minh City, Vietnam

* Corresponding Author: Vinh Truong Hoang. Email: email

Computers, Materials & Continua 2026, 88(2), 35 https://doi.org/10.32604/cmc.2026.078743

Received 07 January 2026; Accepted 16 March 2026; Issue published 15 June 2026

Abstract

Visual speech recognition is a central problem in computer vision, encompassing both lip reading (visual speech recognition) and sign language recognition. Although substantial progress has been achieved independently on each task, their complementary characteristics have rarely been explored jointly. In this work we propose UniModal-LSR (Unified Multimodal Lip and Sign Recognition), a novel deep learning framework that jointly addresses lip reading and sign language recognition within a single multimodal architecture. By exploiting shared properties of visual communication channels, namely temporal dynamics, spatial articulation structure, and contextual dependencies, the proposed model enables bidirectional transfer of knowledge between modalities. The framework incorporates a Hierarchical Temporal-Spatial Encoder that captures multi-scale temporal patterns through the combination of local convolutions and global self-attention. It also includes a Cross-Modal Attention Fusion module that performs dynamic, context-aware information exchange via bidirectional cross-attention and adaptive gating. Additionally, a Contrastive Semantic Alignment loss enforces semantic consistency across modality-specific representations. Overall, the architecture integrates three-dimensional convolutional neural networks for spatiotemporal feature extraction with graph neural networks for explicit hand-pose modeling. Extensive experiments on several public benchmarks show that UniModal-LSR improves performance compared with recent methods. The model attains a Word Error Rate (WER) of 33.2% on LRS2-BBC, representing a 12.4% relative gain. On PHOENIX-2014, it achieves 18.3% WER, a 13.7% relative gain. Moreover, the unified model reduces parameter count by 25.9% relative to two separate task-specific systems. These results indicate that unified multimodal modeling can improve visual speech recognition performance and may support future communication technologies.

Keywords

Multimodal learning; lip reading; visual speech recognition; deep learning; sign language recognition; cross-modal attention

1 Introduction

Visual speech recognition is an important research area in human-computer interaction, encompassing the complementary tasks of lip reading and sign language recognition. Lip reading, often referred to as visual speech recognition (VSR), involves decoding spoken language from visible articulatory movements of the lips, teeth, and tongue. Sign language recognition (SLR), in contrast, interprets the semantically rich combination of manual gestures, non-manual markers, and spatial grammar that characterize natural sign languages [1]. Although these tasks appear methodologically distinct, they share core computational properties that motivate a unified treatment.

The alignment between lip reading and SLR arises from deep structural similarities in how visual language is conveyed. Both modalities operate at comparable frame rates, typically 25–30 frames per second, and require models capable of handling variable-length sequences that may span hundreds of frames [2]. Both depend on fine-grained spatial control of biological articulators, whether the orofacial musculature or the hands and arms. Furthermore, both exhibit strong contextual dependence, as isolated visual patterns are often ambiguous without their temporal and linguistic surroundings, leading to viseme-level confusion in lip reading and coarticulation effects in SLR [3]. These similarities suggest that representations learned from one modality could benefit the other.

In many real-world contexts, both modalities co-occur. Proficient signers often mouth spoken words while signing, providing complementary linguistic cues [4]. Existing systems that process these channels independently disregard this inherent multimodal redundancy, leaving major gains untapped.

Current visual speech recognition techniques face several limitations that hinder practical deployment. Most studies treat lip reading and sign language recognition as separate tasks, resulting in distinct architectures, training pipelines, and evaluation protocols that limit cross-task knowledge transfer and increase computational overhead [5]. Many approaches rely on recurrent models such as LSTMs [6] to capture temporal dependencies, but these models struggle with the long-range dependencies present in continuous sign language sequences and restrict parallelization during training [7]. In addition, conventional CNN-based spatial representations often fail to adequately capture the structural relationships of hand configurations, while multimodal fusion strategies typically rely on simple operations such as concatenation or averaging, which cannot effectively model the dynamic, context-dependent interactions between lip and sign cues.

This work addresses these limitations through the following contributions: We introduce UniModal-LSR, a unified architecture that simultaneously handles lip reading and sign language recognition, enabling bidirectional knowledge transfer and shared representations across modalities. The unified architecture also reduces redundancy by 25.9% and improves generalization. To capture temporal dynamics, we design a Hierarchical Temporal-Spatial Encoder (HTSE) that combines 3D convolutions for local spatiotemporal feature extraction with multi-head self-attention for modeling long-range dependencies, enabling efficient processing of continuous visual speech sequences. We further propose a Cross-Modal Attention Fusion (CMAF) module with bidirectional cross-attention and adaptive gating to dynamically integrate lip and sign cues. A Contrastive Semantic Alignment (CSA) objective aligns modality-specific embeddings within a shared space to enhance semantic consistency and reduce modality-specific noise. Additionally, a Spatial-Temporal Graph Convolutional Network (ST-GCN) models hand skeletal structures to capture joint topology and better distinguish visually similar signs. Extensive experiments on multiple benchmarks demonstrate state-of-the-art performance, and comprehensive ablation studies quantify the contribution of each component; Table 1 further clarifies the novelty of the proposed framework.

To capture temporal patterns at multiple scales, we design a Hierarchical Temporal-Spatial Encoder (HTSE) that fuses three-dimensional convolutions for local spatiotemporal feature extraction with multi-head self-attention for modeling long-range temporal dependencies. Arranging these modules hierarchically facilitates the processing of long continuous visual-speech sequences.

The remainder of the paper is organized as follows. Section 2 situates our approach within the existing literature. Section 3 details the proposed framework. Section 4 presents an architectural and representational analysis. Section 5 describes the experimental setup. Section 6 reports empirical results. Section 7 discusses implications and limitations. Section 8 concludes the paper.

2 Related Work

2.1 Lip Reading and Visual Speech Recognition

2.1.1 Classical Approaches

Early lip reading methods relied on hand-crafted visual descriptors coupled with statistical sequence models. Potamianos et al. [11] employed discrete cosine transform (DCT) coefficients to encode mouth appearance. Matthews et al. [12] used active appearance models (AAMs) to jointly model shape and texture. These visual representations were typically paired with hidden Markov models (HMMs) for temporal modeling [13]. The recognition task was formally expressed as maximum a posteriori (MAP) inference:

W^=arg⁡maxWP(W|O)=arg⁡maxWP(O|W)P(W)P(O),(1)

where O=(o1,…,oT) denotes the visual frame sequence with each ot∈RH×W×3, and W=(w1,w2,…,wN) is a hypothesized word sequence within vocabulary 𝒱lip. Although pioneering, these systems suffered from limited expressive power due to the restrictive nature of hand-crafted features.

2.1.2 Deep Learning Approaches

The advent of deep learning has dramatically advanced lip reading. LipNet [2] introduced an effective end-to-end architecture for sentence-level visual speech recognition, employing spatiotemporal convolutions together with connectionist temporal classification (CTC) to reach 93.4% accuracy on the GRID benchmark. The Watch, Listen, Attend and Spell (WLAS) model [14] incorporated attention mechanisms:

αt,i=exp⁡(et,i)∑j=1Texp⁡(et,j),ct=∑i=1Tαt,ihi,(2)

allowing the decoder to focus adaptively on relevant encoder states during generation.

Subsequent transformer-based approaches have set new performance benchmarks by exploiting self-attention to model long-range temporal relationships [15]. Ma et al. [16] demonstrated that purely visual models can rival audio-visual systems when trained on sufficiently large data. Self-supervised pre-training strategies [17] have emerged as a means of reducing labeled data requirements. Ma et al. [18] introduced Auto-AVSR, which leverages automatic labels for audio-visual speech recognition, demonstrating the effectiveness of weakly supervised learning approaches. Shukla et al. [19] investigated whether visual self-supervision improves speech representations for emotion recognition, providing insights into cross-task transfer learning. Recent work by Prajwal et al. [20] achieved significant improvements through sub-word modeling.

2.2 Sign Language Recognition

Isolated sign language recognition (ISLR) focuses on classifying pre-segmented signs. The Inflated 3D ConvNet (I3D) [8] extended successful 2D CNN designs into the temporal dimension via filter inflation:

W3D(t,h,w)=1N∑n=1NW2D(h,w)⋅δt,n,(3)

preserving spatial semantics while averaging over the temporal axis.

Skeleton-based methods exploit hand-joint coordinates to model structural relationships directly. Attention-driven approaches such as SignBERT [21] illustrate the benefits of large-scale pre-training on sign language corpora.

Continuous sign language recognition (CSLR) jointly performs segmentation and classification in unsegmented video streams. The Connectionist Temporal Classification (CTC) loss [22] is widely employed:

ℒCTC=−log⁡P(Y|X)=−log⁡∑π∈ℬ−1(Y)∏t=1TP(πt|X),(4)

enabling alignment-free training.

2.3 Multimodal Learning and Fusion

Multimodal fusion is typically categorized into early (feature-level), late (decision-level), and intermediate (shared representation) strategies. Early fusion preserves detailed cross-modal interactions but yields high-dimensional feature spaces. Late fusion maintains modality-specific pipelines but limits interaction depth. Intermediate fusion balances both aspects.

Attention-based fusion has become prevalent for modeling dynamic relationships. Hierarchical co-attention [23] and multimodal transformers [24] exemplify recent progress. Contrastive learning techniques such as CLIP [25] further illustrate the efficacy of aligning multimodal representations.

Especially relevant to this study, Ge et al. [26] proposed an audio-text multimodal framework for speech recognition in air traffic control communications, showing that unified multimodal modeling can improve recognition accuracy through coordinated processing of audio and text. Similarly, Li et al. [27] developed an end-to-end audio-visual system for multi-channel speech separation, dereverberation, and recognition, while Wang et al. [28] introduced DCIM-AVSR, an efficient audio-visual speech recognition model using dual conformer interaction modules. However, these studies focus on audio-visual fusion. In contrast, our work extends the unified architecture paradigm to visual-visual modality fusion, addressing the integration of lip and sign information. The proposed CMAF module enables bidirectional cross-attention with adaptive gating, making it well suited for the temporal synchronization required in visual speech modalities.

3 Proposed Methodology

This section describes the UniModal-LSR architecture in detail. An overview is provided in Fig. 1.

images

Figure 1: System architecture of UniModal-LSR. The framework processes video through parallel lip and sign encoding streams, applies hierarchical temporal-spatial encoding, performs cross-modal attention fusion, and generates task-specific outputs via a shared transformer decoder.

3.1 Problem Formulation

Given an input video V={I1,I2,…,IT}, where T denotes the number of frames and each frame It∈RH×W×3, the objective includes lip reading, sign language recognition, and unified representation learning. The overall training loss combines task-specific objectives with cross-modal regularization:

ℒtotal=λlipℒlip+λsignℒsign+λCSAℒCSA+λregℒreg,(5)

where ℒlip and ℒsign denote task-specific losses (CTC and cross-entropy), ℒCSA is the contrastive semantic alignment loss defined in Eq. (17), and ℒreg is an ℓ2 weight-decay term. The non-negative scalars λ∗ balance each term’s contribution.

The CSA loss requires aligned lip-sign pairs that share semantic content. We construct such pairs from two sources. The first source is the How2Sign dataset, which provides natural co-occurrence of lip and sign modalities for the same utterances, enabling direct pairing without additional processing. The second source involves synthetic alignment of LRS2/LRS3 lip sequences with PHOENIX-2014 sign sequences when they share identical or semantically equivalent transcriptions, as determined by text matching with a minimum overlap threshold of 80%. For batches containing samples from only one modality, such as lip-only data from LRS2, the CSA loss is computed only over the subset of aligned pairs present in that batch. When no aligned pairs exist in a batch, the CSA term is set to zero for that iteration, and the model optimizes only the task-specific losses.

3.2 Preprocessing and Region Extraction

Accurate face detection is essential for effective lip reading. RetinaFace yields facial landmarks L={l1,…,l68} under the standard 68-point annotation protocol. We obtain the lip patch Rtlip∈RHl×Wl×3 through affine alignment:

Rtlip=𝒯affine(It,L48:67),(6)

where L48:67 correspond to the outer and inner lip contour landmarks. The patch is normalized to Hl=Wl=88 pixels.

For SLR, hand pose estimation provides structured motion cues. MediaPipe Hands [29] predicts 21 three-dimensional landmarks per hand. The skeleton is modeled as a spatio-temporal graph G=(𝒱,ℰ), where 𝒱={v1,…,v42} comprises landmarks of both hands and ℰ encodes anatomical connectivity. Each node vi at time t has feature vector:

vi(t)=[xi(t),yi(t),zi(t),ci(t)]⊤∈R4,(7)

with (xi(t),yi(t),zi(t)) denoting normalized coordinates and ci(t) the confidence score.

3.3 Hierarchical Temporal-Spatial Encoder (HTSE)

The HTSE captures multi-scale temporal dynamics through cascaded processing stages. An initial 3D ResNet-18 backbone extracts low-level spatiotemporal features:

F(0)=ResNet3D(R)∈RT′×d0,(8)

where T′ accounts for temporal subsampling and d0=512 denotes feature dimensionality. Thereafter L hierarchical levels successively apply local temporal convolutions, global self-attention, feed-forward networks, and temporal pooling. This yields progressively coarser temporal resolutions while preserving fine-grained information. A Feature Pyramid Network style aggregation fuses multi-scale outputs:

Fagg=∑l=1Lαl⋅Upsample(F(l),T′),(9)

where learned coefficients αl weight each scale adaptively.

3.4 Graph-Enhanced Sign Encoding

To complement appearance-based features, a spatial-temporal graph convolutional network (ST-GCN) models hand articulation. The spatial graph convolution aggregates information from anatomically connected joints:

fout(v)=∑u∈𝒩(v)1Zvufin(u)⋅W(ℓ(v,u)),(10)

where ℓ(v,u) distinguishes edge types (centripetal, centrifugal, self-connections) and Zvu is a normalization factor. Temporal graph convolutions then capture motion dynamics across frames. Resulting pose features Fpose are concatenated with appearance features and projected to a shared dimensionality before entering the HTSE.

3.5 Cross-Modal Attention Fusion (CMAF)

CMAF proceeds in three stages. First, intra-modal self-attention refines each modality independently:

Fselfm=LayerNorm(Fm+MHSA(Fm,Fm,Fm)),m∈{lip,sign},(11)

where MHSA denotes multi-head self-attention. Second, bidirectional cross-attention allows each modality to query information from the other:

Flip→sign=MHCA(Fselflip,Fselfsign,Fselfsign),(12)

Fsign→lip=MHCA(Fselfsign,Fselflip,Fselflip),(13)

where MHCA denotes multi-head cross-attention with the first argument as query and the second/third as key/value. Third, adaptive gating computes element-wise weights g∈(0,1)d:

g=σ(Wg[Fselflip;Fselfsign;Flip→sign;Fsign→lip]+bg),(14)

where Wg∈Rd×4d, bg∈Rd, and σ is the sigmoid function. The fused representation is:

Ffused=g⊙(Fselflip+Fsign→lip)+(1−g)⊙(Fselfsign+Flip→sign).(15)

3.6 Contrastive Semantic Alignment (CSA) Loss

For a minibatch of B aligned lip-sign pairs, modality-specific embeddings are normalized:

zilip=WprojlipF¯ilip‖WprojlipF¯ilip‖2,zisign=WprojsignF¯isign‖WprojsignF¯isign‖2,(16)

where F¯im denotes the temporally pooled representation for sample i and modality m. A symmetric InfoNCE objective enforces alignment:

ℒCSA=−12B∑i=1B[log⁡exp⁡(zilip⋅zisign/τ)∑j=1Bexp⁡(zilip⋅zjsign/τ)+log⁡exp⁡(zisign⋅zilip/τ)∑j=1Bexp⁡(zisign⋅zjlip/τ)],(17)

where τ is a temperature hyperparameter.

3.7 Training Objective

The complete training loss integrates a CTC component with a cross-entropy term:

ℒtask=αℒCTC+(1−α)ℒCE.(18)

The final optimization objective is:

ℒtotal=ℒliptask+ℒsigntask+λCSAℒCSA+λreg‖θ‖22.(19)

The overall training procedure is described in Algorithm 1.

images

4 Architectural and Representational Analysis

This section presents an analysis of the architectural properties and representational capacity of the proposed framework. Rather than restating general theoretical results, we focus on aspects directly relevant to the CMAF and HTSE designs that justify our specific architectural choices.

4.1 Representational Capacity of Cross-Modal Attention

The CMAF module’s effectiveness derives from its ability to model complex interactions between lip and sign modalities. We formalize this capacity in terms of the function classes that the architecture can represent.

Proposition 1 (Expressive Capacity of Cross-Modal Attention). Let ℱcs denote the family of continuous functions mapping paired sequences (x,y)∈RT×d1×RT×d2 to fused representations in RT×do, where cross-modal dependencies are bounded by a Lipschitz constant Lf. For any f∈ℱcs and ϵ>0, the CMAF architecture with H attention heads, hidden dimension dh, and N layers can approximate f within error ϵ when:

H⋅dh≥C⋅Lf⋅log⁡(1/ϵ),(20)

where C is a constant depending on input dimensionality.

Proof Sketch. The bidirectional cross-attention mechanism in CMAF can be decomposed into four components: intra-modal self-attention capturing within-modality dependencies, lip-to-sign cross-attention, sign-to-lip cross-attention, and adaptive gating for combination. Each cross-attention operation computes:

CrossAttn(Q,K,V)=softmax(QK⊤dk)V.(21)

By the results of Yun et al. [30], transformer architectures with softmax attention are universal approximators for sequence-to-sequence functions. The CMAF extends this by enabling queries from one modality to attend to keys and values from another, effectively doubling the functional space that can be represented. The adaptive gating mechanism g=σ(Wg[⋅]) provides an additional multiplicative interaction that allows context-dependent weighting. Combined, these mechanisms can represent any continuous cross-modal function within the specified bounds. ◻

This result justifies our architectural choice of bidirectional cross-attention: the design ensures that both modalities can contribute information to the fused representation, while the adaptive gating allows the model to learn when each modality is most informative for a given context.

4.2 Effective Receptive Field of HTSE

The hierarchical structure of HTSE is designed to capture temporal dependencies at multiple scales efficiently. We analyze how the receptive field grows with network depth and how this relates to the temporal extent of visual speech phenomena.

Lemma 1 (Hierarchical Receptive Field Growth). An HTSE with L hierarchical levels, local convolution kernel size k, and pooling factor p=2 achieves an effective temporal receptive field of:

Reff=k⋅pL−1p−1=k⋅(2L−1),(22)

while the parameter count grows as O(L⋅d2), where d is the hidden dimension.

Proof. At level l, the temporal resolution is reduced by factor pl−1 relative to the input. A convolution with kernel k at level l therefore covers k⋅pl−1 frames of the original input. Summing over all levels:

Reff=∑l=1Lk⋅pl−1=k⋅∑l=0L−1pl=k⋅pL−1p−1.(23)

For p=2, this simplifies to k(2L−1). Each level contains convolution layers with O(k⋅d2) parameters and attention layers with O(d2) parameters, yielding total parameter growth of O(L⋅d2). ◻

For our configuration with L=4 levels and kernel size k=3, the effective receptive field is Reff=3×15=45 frames. Combined with global self-attention at each level, the model can capture both local articulation patterns spanning a few frames and long-range dependencies extending across entire utterances. This design is motivated by the observation that lip movements exhibit local coarticulation effects while sign language requires understanding of phrase-level context.

4.3 Computational Complexity

Table 2 summarizes the asymptotic complexities of each component. All symbols are defined as follows: T denotes input sequence length in frames, T′ is the encoded sequence length after 3D CNN downsampling, H and W are spatial dimensions of input frames, C is the number of channels, k is the convolution kernel size, d is the hidden dimension, Nv is the number of graph vertices representing hand joints, Nd is the number of decoder layers, and Ty is the output sequence length.

images

The dominant term is the quadratic self-attention complexity O(T′2d). For typical video lengths where T′<500 frames and our hidden dimension of d=512, this remains computationally tractable. The unified architecture achieves computational savings relative to separate models by sharing the decoder and fusing features before decoding, rather than maintaining separate decoders for each task.

5 Experimental Setup

Table 3 enumerates the benchmark datasets used in our evaluation. Figs. 2 and 3 show example frames from the lip reading and sign language datasets, respectively. LRS2-BBC [14], LRS3-TED [31], and GRID [32] provide sentence- and word-level lip reading benchmarks collected from BBC broadcasts, TED talks, and controlled laboratory environments. For sign language recognition and translation, PHOENIX-2014 [33] and its translation counterpart PHOENIX-2014T [1] serve as standard continuous German Sign Language benchmarks, while CSL [34] and WLASL [35] provide isolated sign datasets for Chinese and American Sign Language, respectively. Finally, How2Sign [36] enables multimodal and cross-modal experiments as it contains synchronized lip and sign annotations for the same utterances. Performance for lip reading and continuous sign language recognition is measured using Word Error Rate (WER).

images

Figure 2: Example frames from lip reading datasets showing extracted lip ROIs.

images

Figure 3: Example frames from sign language datasets showing hand pose estimation.

Face detection uses RetinaFace (confidence 0.9), with lip ROIs extracted from facial landmarks 48−67 and resized to 88×88 pixels via affine normalization. Hand pose estimation uses MediaPipe Hands (confidence 0.7), and the fewer than 2% of failed detections are filled by linear interpolation to maintain temporal continuity. Data augmentation is applied asymmetrically: horizontal flipping is used only for sign language due to handedness constraints, while both modalities use temporal jittering (±2) frames, color jittering (±0.2), and random erasing (p=0.1). Decoding employs beam search (width 10, length normalization 0.6) without an external language model; tokenization uses character-level encoding for lip reading (28 tokens) and BPE with 1000 merges for sign language.

For reproducibility, all experiments use random seed 42, and results averaged over three runs show a standard deviation below 0.3% WER. Official dataset splits are used for all benchmarks, and the code along with pretrained models will be released upon publication to support replication and further research.

6 Results and Analysis

Table 4 reports Word Error Rates on lip reading benchmarks, including recent baselines from 2022–2023 for comprehensive comparison. All results are reported without external language models to ensure fair comparison. The unified model achieves the lowest WER across all evaluated datasets, with relative improvements of 5.7% on LRS2 and 5.1% on LRS3 compared to the strongest prior work.

Table 5 shows results for continuous and isolated sign language recognition. The unified model achieves state-of-the-art performance on PHOENIX-14 with 18.3% WER, representing a 13.7% relative improvement over the previous best result.

Table 6 presents BLEU scores on PHOENIX-2014T. Our model achieves 24.89% BLEU-4, representing an improvement of 2.25 points over the previous state of the art.

Table 7 analyzes each component’s contribution through cumulative addition, starting from a single-modality baseline and progressively adding each proposed component.

images

Table 8 provides detailed analysis of CMAF design choices, examining the impact of attention directionality and gating mechanisms.

images

The results confirm several design principles. Bidirectional attention outperforms unidirectional variants because both modalities contain complementary information that benefits the other. Sign-to-lip transfer provides slightly larger gains than lip-to-sign transfer, likely because sign language’s richer spatial vocabulary provides additional discriminative cues for disambiguating visually similar lip movements. Adaptive element-wise gating achieves the best performance by allowing dimension-specific modality weighting, enabling the model to selectively combine different aspects of each modality’s representation.

Table 9 shows sensitivity to the CSA loss weight λCSA, and Fig. 4 visualizes the cross-modal alignment effect using t-SNE projections of the learned embeddings.

images

Figure 4: t-SNE visualization of lip and sign embeddings for semantically matched pairs. Without CSA (left), modalities cluster separately in distinct regions. With CSA (right), semantically matched pairs align in the shared embedding space.

The visualizations demonstrate that without CSA loss, lip and sign embeddings occupy distinct regions of the representation space, limiting cross-modal transfer. With the CSA loss at λCSA=0.1, semantically matched pairs from both modalities cluster together, enabling more effective information sharing. The optimal weight balances alignment strength against task-specific discriminability; larger weights improve alignment scores but begin to degrade recognition performance as the representations become overly constrained.

Table 10 evaluates performance when one modality is degraded or missing during inference, addressing concerns about robustness to partial input availability.

images

The model exhibits graceful degradation when one modality is impaired. Performance drops range from 3% to 14% depending on the degradation type and target task, demonstrating that cross-modal training provides implicit robustness. As expected, lip reading performance depends more heavily on the lip stream, while sign recognition depends more on the sign stream. Partial degradations such as frame dropout or occlusion have smaller effects than complete modality removal, indicating that the model can leverage whatever partial information remains available. Table 11 compares different pre-training strategies, confirming that joint multimodal pre-training provides the strongest performance gains.

images

Table 12 evaluates generalization by training on one dataset and testing on another without fine-tuning.

images

The unified model shows improved cross-domain generalization, particularly for lip reading where the 8–9% WER reduction suggests that cross-modal learning provides regularization benefits that transfer across dataset boundaries. The improvement on sign language transfer is smaller but still positive, indicating that the learned representations capture generalizable visual speech features rather than dataset-specific patterns.

Table 13 demonstrates performance under reduced training data availability, showing relative improvements of up to 19% when training with only 10% of the data.

images

Table 14 compares computational requirements between separate single-task models and the unified architecture. Latency measurements are end-to-end, including all preprocessing steps, conducted on a single NVIDIA A100 GPU with batch size 1.

images

We address practical deployment scenarios beyond the full multimodal inference setting. When only one modality is available, such as lip-only or sign-only video, the model can operate with a single branch while still benefiting from joint training. Table 15 shows that single-branch variants extracted from the unified model outperform separately trained single-task models, demonstrating that the cross-modal training provides transferable improvements even when inference uses only one modality. Table 16 presents performance of reduced-capacity variants suitable for edge deployment with limited computational resources.

images

Table 17 evaluates performance under reduced temporal sampling, which is relevant for bandwidth-constrained applications where transmitting full frame-rate video is impractical.

images

Fig. 5 decomposes errors by category. The unified model reduces errors across all categories, with notable improvements in viseme confusion (28% reduction) and boundary detection (33% reduction), which are areas where cross-modal information provides the greatest benefit.

images

Figure 5: Error analysis by category showing reductions across all error types.

7 Discussion

Our work builds on the foundation established by multimodal architectures for speech-related tasks. The dual-tower framework proposed by Ge et al. [26] demonstrates the effectiveness of unified multimodal modeling for audio-text fusion in air traffic control communications, achieving improved recognition accuracy through coordinated processing of acoustic and textual information. While their approach improves multimodal speech recognition, it addresses a fundamentally different modality combination. Our CMAF module addresses the unique challenges of visual-visual fusion, where both modalities share temporal structure but differ in spatial semantics. The bidirectional cross-attention design enables more flexible information exchange than parallel tower processing, which is important when the usefulness of lip and sign cues varies across contexts. Furthermore, the adaptive gating mechanism provides context-dependent weighting that is especially suited to the dynamic relationship between visual speech modalities.

Despite encouraging results, several limitations remain. The benchmark coverage may not fully capture real-world variability in illumination, camera pose, and occlusion, and evaluation on more diverse conditions would strengthen practical applicability. The model is language-specific and must be retrained for each target spoken or signed language, highlighting the need for multilingual or language-agnostic extensions. Training is computationally expensive, requiring eight A100 GPUs during 72 h; although lightweight variants reduce inference cost, they do not address training efficiency. The framework also assumes both lip and sign modalities are available during inference, and while performance degrades gracefully when one modality is missing, explicit strategies such as modality dropout could improve robustness. Finally, the CSA loss relies on semantically aligned lip-sign pairs, and constructing such pairs across datasets may introduce noise, suggesting that future work could explore self-supervised alignment methods that enable training on unpaired data.

8 Conclusion

We introduced UniModal-LSR, a unified multimodal architecture for joint lip reading and sign language recognition. Through hierarchical temporal-spatial encoding, graph-enhanced hand modeling, cross-modal attention fusion with adaptive gating, and contrastive semantic alignment, the model achieves state-of-the-art performance while reducing parameters by 25.9% relative to separate task-specific models. Detailed ablations confirm the contribution of each component, and analysis demonstrates robustness to modality degradation and improved cross-domain generalization.

Future work will explore integration of additional modalities such as audio and full-body pose to capture a more complete picture of visual communication. Language-agnostic representation learning could enable a single model to serve multiple linguistic communities. Self-supervised pre-training methods may reduce labeled data requirements, making the technology more accessible for under-resourced languages. Integration with dialogue systems, such as the full-duplex speech dialogue schemes based on large language models, could enable more natural and responsive human-computer interaction.

Acknowledgement: Not applicable.

Funding Statement: This work is funded by Ho Chi Minh City Open University (HCMCOU) and the Ministry of Education and Training (Vietnam) under grant number B2025-MBS-01.

Author Contributions: The authors report that their specific contributions to this paper are as follows: the study was conceived and designed by Vinh Truong Hoang and Nghia Dinh; data were collected by Thien Ho Huong and Luu Quang Phuong; data analysis and interpretation of the results were undertaken by Kiet Tran-Trung and Ha Duong Thi Hong; and the initial manuscript draft was written by Bay Nguyen Van and Hau Nguyen Trung. All authors reviewed and approved the final version of the manuscript.

Availability of Data and Materials: The authors confirm that the data supporting the findings of this study are openly available. LRS2 and LRS3 datasets are available at https://www.robots.ox.ac.uk/vgg/data/lip_reading/. PHOENIX dataset is available at https://www-i6.informatik.rwth-aachen.de/koller/RWTH-PHOENIX/. WLASL dataset is available at https://www.kaggle.com/datasets/risangbaskoro/wlasl-processed. CSL dataset is available at https://ustc-slr.github.io/datasets/2015_csl. How2Sign dataset is available at https://how2sign.github.io/.

Ethics Approval: This research uses publicly available benchmark datasets collected with appropriate consent and ethical approval by the original creators.

Conflicts of Interest: The authors declare no conflicts of interest.

References

1. Camgoz NC, Hadfield S, Koller O, Ney H, Bowden R. Neural sign language translation. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2018 Jun 18–23; Salt Lake City, UT, USA. p. 7784–93. doi:10.1109/CVPR.2018.00812. [Google Scholar] [CrossRef]

2. Assael YM, Shillingford B, Whiteson S, De Freitas N. LipNet: end-to-end sentence-level lipreading. arXiv:1611.01599. 2016. [Google Scholar]

3. Bear HL, Harvey R. Phoneme-to-viseme mappings: the good, the bad, and the ugly. Speech Commun. 2017;95(3):40–67. doi:10.1016/j.specom.2017.07.001. [Google Scholar] [CrossRef]

4. Heracleous P, Beautemps D, Aboutabit N. Cued speech automatic recognition in normal-hearing and deaf subjects. Speech Commun. 2010;52(6):504–12. doi:10.1016/j.specom.2010.03.001. [Google Scholar] [CrossRef]

5. Petridis S, Stafylakis T, Ma P, Cai F, Tzimiropoulos G, Pantic M. End-to-end audiovisual speech recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); 2018 Apr 15–20; Calgary, AB, Canada. p. 6548–52. doi:10.1109/ICASSP.2018.8461326. [Google Scholar] [CrossRef]

6. Subba Rao MV, Naga Amulya T, Aparna Y, Pranavi R, Madhumitha S, Priya SS. Speech reconstruction from silent lip movements using deep learning. In: 2025 2nd International Conference on Artificial Intelligence for Innovations in Healthcare Industries (ICAIIHI); 2025 Dec 4–5; Raipur, India. p. 1–6. doi:10.1109/icaiihi67124.2025.11403769. [Google Scholar] [CrossRef]

7. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. In: NIPS’17: Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook, NY, USA: Curran Associates Inc.; 2017. p. 6000–10. [Google Scholar]

8. Carreira J, Zisserman A. Quo vadis, action recognition? A new model and the kinetics dataset. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2017 Jul 21–26; Honolulu, HI, USA. p. 4724–33. doi:10.1109/CVPR.2017.502. [Google Scholar] [CrossRef]

9. Yan S, Xiong Y, Lin D. Spatial temporal graph convolutional networks for skeleton-based action recognition. Proc AAAI Conf Artif Intell. 2018;32(1):7444–52. doi:10.1609/aaai.v32i1.12328. [Google Scholar] [CrossRef]

10. van den Oord A, Li Y, Vinyals O. Representation learning with contrastive predictive coding. arXiv:1807.03748. 2018. [Google Scholar]

11. Potamianos G, Neti C, Gravier G, Garg A, Senior AW. Recent advances in the automatic recognition of audiovisual speech. Proc IEEE. 2003;91(9):1306–26. doi:10.1109/JPROC.2003.817150. [Google Scholar] [CrossRef]

12. Matthews I, Cootes TF, Bangham JA, Cox S, Harvey R. Extraction of visual features for lipreading. IEEE Trans Pattern Anal Mach Intell. 2002;24(2):198–213. doi:10.1109/34.982900. [Google Scholar] [CrossRef]

13. Gales M, Young S. The application of hidden Markov models in speech recognition. Found Trends® Signal Process. 2008;1(3):195–304. doi:10.1561/2000000004. [Google Scholar] [CrossRef]

14. Chung JS, Senior A, Vinyals O, Zisserman A. Lip reading sentences in the wild. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2017 Jul 21–26; Honolulu, HI, USA. p. 3444–53. doi:10.1109/CVPR.2017.367. [Google Scholar] [CrossRef]

15. Ma P, Petridis S, Pantic M. End-to-end audio-visual speech recognition with conformers. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); 2021 Jun 6–11; Toronto, ON, Canada. p. 7613–7. doi:10.1109/ICASSP39728.2021.9414567. [Google Scholar] [CrossRef]

16. Ma P, Petridis S, Pantic M. Visual speech recognition for multiple languages in the wild. Nat Mach Intell. 2022;4(11):930–9. doi:10.1038/s42256-022-00550-z. [Google Scholar] [CrossRef]

17. Shi B, Hsu WN, Lakhotia K, Mohamed A. Learning audio-visual speech representation by masked multimodal cluster prediction. arXiv:2201.02184. 2022. [Google Scholar]

18. Ma P, Haliassos A, Fernandez-Lopez A, Chen H, Petridis S, Pantic M. Auto-AVSR: audio-visual speech recognition with automatic labels. In: ICASSP 2023–2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); 2023 Jun 4–10; Rhodes Island, Greece. p. 1–5. doi:10.1109/ICASSP49357.2023.10096889. [Google Scholar] [CrossRef]

19. Shukla A, Petridis S, Pantic M. Does visual self-supervision improve learning of speech representations for emotion recognition? IEEE Trans Affect Comput. 2023;14(1):406–20. doi:10.1109/TAFFC.2021.3062406. [Google Scholar] [CrossRef]

20. Prajwal KR, Afouras T, Zisserman A. Sub-word level lip reading with visual attention. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2022 Jun 18–24; New Orleans, LA, USA. p. 5152–62. [Google Scholar]

21. Hu H, Zhao W, Zhou W, Wang Y, Li H. SignBERT: pre-training of hand-model-aware representation for sign language recognition. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV); 2021 Oct 10–17; Montreal, QC, Canada. p. 11067–76. doi:10.1109/ICCV48922.2021.01090. [Google Scholar] [CrossRef]

22. Graves A, Fernández S, Gomez F, Schmidhuber J. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning—ICML ’06; 2006 Jun 25–29; Pittsburgh, PA, USA. p. 369–76. doi:10.1145/1143844.1143891. [Google Scholar] [CrossRef]

23. Lu J, Yang J, Batra D, Parikh D. Hierarchical question-image co-attention for visual question answering. arXiv:1606.00061. 2016. [Google Scholar]

24. Tsai YH, Bai S, Liang PP, Kolter JZ, Morency LP, Salakhutdinov R. Multimodal transformer for unaligned multimodal language sequences. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA, USA: ACL; 2019. p. 6558–69. doi:10.18653/v1/p19-1656. [Google Scholar] [PubMed] [CrossRef]

25. Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, et al. Learning transferable visual models from natural language supervision. arXiv:2103.00020. 2021. [Google Scholar]

26. Ge S, Ren J, Shi Y, Zhang Y, Yang S, Yang J. Audio-text multimodal speech recognition via dual-tower architecture for mandarin air traffic control communications. Comput Mater Contin. 2024;78(3):3215–45. doi:10.32604/cmc.2023.046746. [Google Scholar] [CrossRef]

27. Li G, Deng J, Geng M, Jin Z, Wang T, Hu S, et al. Audio-visual end-to-end multi-channel speech separation, dereverberation and recognition. IEEE/ACM Trans Audio Speech Lang Process. 2023;31:2707–23. doi:10.1109/TASLP.2023.3294705. [Google Scholar] [CrossRef]

28. Wang X, Jiang H, Huang H, Fang Y, Xu M, Wang Q. DCIM-AVSR: efficient audio-visual speech recognition via dual conformer interaction module. In: ICASSP 2025–2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); 2025 Apr 6–11; Hyderabad, India. p. 1–5. doi:10.1109/ICASSP49660.2025.10890272. [Google Scholar] [CrossRef]

29. Zhang F, Bazarevsky V, Vakunov A, Tkachenka A, Sung G, Chang C-L, et al. MediaPipe hands: on-device real-time hand tracking. arXiv:2006.10214. 2020. [Google Scholar]

30. Yun C, Bhojanapalli S, Rawat AS, Reddi S, Kumar S. Are transformers universal approximators of sequence-to-sequence functions? arXiv:1912.10077. 2020. [Google Scholar]

31. Afouras T, Chung JS, Zisserman A. LRS3-TED: a large-scale dataset for visual speech recognition. arXiv:1809.00496. 2018. [Google Scholar]

32. Cooke M, Barker J, Cunningham S, Shao X. An audio-visual corpus for speech perception and automatic speech recognition. J Acoust Soc Am. 2006;120(5 Pt 1):2421–4. doi:10.1121/1.2229005. [Google Scholar] [PubMed] [CrossRef]

33. Koller O, Forster J, Ney H. Continuous sign language recognition: towards large vocabulary statistical recognition systems handling multiple signers. Comput Vis Image Underst. 2015;141(5):108–25. doi:10.1016/j.cviu.2015.09.013. [Google Scholar] [CrossRef]

34. Huang J, Zhou W, Zhang Q, Li H, Li W. Video-based sign language recognition without temporal segmentation. Proc AAAI Conf Artif Intell. 2018;32(1):2257–64. doi:10.1609/aaai.v32i1.11903. [Google Scholar] [CrossRef]

35. Li D, Opazo CR, Yu X, Li H. Word-level deep sign language recognition from video: a new large-scale dataset and methods comparison. In: 2020 IEEE Winter Conference on Applications of Computer Vision (WACV); 2020 Mar 1–5; Snowmass Village, CO, USA. p. 1448–58. doi:10.1109/wacv45572.2020.9093512. [Google Scholar] [CrossRef]

36. Duarte A, Palaskar S, Ventura L, Ghadiyaram D, DeHaan K, Metze F, et al. How2Sign: a large-scale multimodal dataset for continuous American sign language. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2021 Jun 20–25; Nashville, TN, USA. p. 2734–43. doi:10.1109/cvpr46437.2021.00276. [Google Scholar] [CrossRef]

37. Afouras T, Chung JS, Senior A, Vinyals O, Zisserman A. Deep audio-visual speech recognition. IEEE Trans Pattern Anal Mach Intell. 2022;44(12):8717–27. doi:10.1109/TPAMI.2018.2889052. [Google Scholar] [PubMed] [CrossRef]

38. Martinez B, Ma P, Petridis S, Pantic M. Lipreading using temporal convolutional networks. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); 2020 May 4–8; Barcelona, Spain. p. 6319–23. doi:10.1109/icassp40776.2020.9053841. [Google Scholar] [CrossRef]

39. Kim M, Yeo JH, Ro YM. Distinguishing homophenes using multi-head visual-audio memory for lip reading. Proc AAAI Conf Artif Intell. 2022;36(1):1174–82. doi:10.1609/aaai.v36i1.20003. [Google Scholar] [CrossRef]

40. Koller O, Camgoz NC, Ney H, Bowden R. Weakly supervised learning with multi-stream CNN-LSTM-HMMs to discover sequential parallelism in sign language videos. IEEE Trans Pattern Anal Mach Intell. 2020;42(9):2306–20. doi:10.1109/TPAMI.2019.2911077. [Google Scholar] [PubMed] [CrossRef]

41. Zhou H, Zhou W, Zhou Y, Li H. Spatial-temporal multi-cue network for continuous sign language recognition. Proc AAAI Conf Artif Intell. 2020;34(7):13009–16. doi:10.1609/aaai.v34i07.7001. [Google Scholar] [CrossRef]

42. Cheng KL, Yang Z, Chen Q, Tai YW. Fully convolutional networks for continuous sign language recognition. In: Computer Vision—ECCV 2020. Cham, Switzerland: Springer; 2020. p. 697–714. doi:10.1007/978-3-030-58586-0_41. [Google Scholar] [CrossRef]

43. Min Y, Hao A, Chai X, Chen X. Visual alignment constraint for continuous sign language recognition. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV); 2021 Oct 10–17; Montreal, QC, Canada. p. 11522–31. doi:10.1109/ICCV48922.2021.01134. [Google Scholar] [CrossRef]

44. Cihan Camgoz N, Koller O, Hadfield S, Bowden R. Sign language transformers: joint end-to-end sign language recognition and translation. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2020 Jun 13–19; Seattle, WA, USA. p. 10020–30. doi:10.1109/cvpr42600.2020.01004. [Google Scholar] [CrossRef]

45. Li D, Xu C, Yu X, Zhang K, Swift B, Suominen H, et al. TSPNet: hierarchical feature learning via temporal semantic pyramid for sign language translation. In: NIPS’20: Proceedings of the 34th International Conference on Neural Information Processing Systems. Red Hook, NY, USA: Curran Associates Inc.; 2020. p. 12034–45. [Google Scholar]

Cite This Article

APA Style

Truong Hoang, V., Dinh, N., Quang Phuong, L., Tran-Trung, K., Duong Thi Hong, H. et al. (2026). UniModal-LSR: A Unified Multimodal Framework for Joint Lip Reading and Sign Language Recognition in Video Sequences. Computers, Materials & Continua, 88(2), 35. https://doi.org/10.32604/cmc.2026.078743

Vancouver Style

Truong Hoang V, Dinh N, Quang Phuong L, Tran-Trung K, Duong Thi Hong H, Nguyen Van B, et al. UniModal-LSR: A Unified Multimodal Framework for Joint Lip Reading and Sign Language Recognition in Video Sequences. Comput Mater Contin. 2026;88(2):35. https://doi.org/10.32604/cmc.2026.078743

IEEE Style

V. Truong Hoang et al., “UniModal-LSR: A Unified Multimodal Framework for Joint Lip Reading and Sign Language Recognition in Video Sequences,” Comput. Mater. Contin., vol. 88, no. 2, pp. 35, 2026. https://doi.org/10.32604/cmc.2026.078743

BibTex EndNote RIS

Copyright © 2026 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Table of Content

UniModal-LSR: A Unified Multimodal Framework for Joint Lip Reading and Sign Language Recognition in Video Sequences

Abstract

Keywords

References

Cite This Article

763

253

0

Related articles

Further Information

Guidelines

Follow Us

Join Us

Contact Us

WhatsApp:

Share Link