Open Access
ARTICLE
NestLipGNN: A Hierarchical Graph Neural Network Framework with Nested Multi-Granularity Learning for Robust Visual Speech Recognition
AI Lab, Faculty of Information Technology, Ho Chi Minh City Open University, 35-37 Ho Hao Hon Street, Co Giang Ward, District 1, Ho Chi Minh City, Vietnam
* Corresponding Author: Vinh Truong Hoang. Email:
(This article belongs to the Special Issue: Artificial Intelligence in Visual and Audio Signal Processing)
Computers, Materials & Continua 2026, 88(1), 54 https://doi.org/10.32604/cmc.2026.078089
Received 23 December 2025; Accepted 25 March 2026; Issue published 08 May 2026
Abstract
Visual speech recognition (VSR) aims to infer spoken content from visual observations of articulatory movements. Despite significant progress, it remains a challenging task in computer vision and speech processing. Its difficulty arises from pronounced speaker-to-speaker variability, the presence of homophenes (phonemes that are visually indistinguishable), changes in illumination, and the intrinsically high-dimensional nature of spatiotemporal lip dynamics. In this work, we propose NestLipGNN, a graph-based framework that integrates Graph Neural Networks (GNNs) with a nested multi-granularity learning strategy for visual speech recognition. We construct dynamic lip graphs from facial landmarks to model both spatial relationships between lip regions and their temporal motion during speech articulation. The proposed nested learning architecture supports hierarchical feature extraction across several levels of linguistic abstraction, spanning phoneme-level articulatory units, viseme-level visual speech categories, and word-level semantic representations. We further introduce a Temporal Graph Attention mechanism (T-GAT) that adaptively reweights the importance of distinct lip regions over time. We also introduce a graph-based contrastive learning objective to improve the discrimination of visually similar speech patterns, directly confronting the challenge of homophene resolution. Experiments on the LRW, LRS2, LRS3, and GRID datasets show that NestLipGNN improves recognition accuracy compared with existing methods, obtaining 92.3% word-level accuracy on LRW and delivering a 2.1% absolute performance gain over prior methods. Comprehensive ablation analyses confirm the contribution of each architectural component.Keywords
Visual speech recognition (VSR), commonly referred to as lip reading, involves computationally inferring spoken language content solely from visual observations of orofacial articulatory motion, without relying on acoustic information [1,2]. This line of research has attracted considerable interest within the scientific community due to its broad range of practical uses, such as silent speech interfaces for people who have undergone laryngectomy [3], assistive tools for individuals with hearing impairments [4], multimodal speech recognition systems designed for acoustically challenging conditions [5], and security and forensic surveillance applications [6].
The core difficulty in visual speech recognition arises from the inherent ambiguity of visual speech cues: many phonetically different utterances produce lip movements that are visually indistinguishable, a phenomenon known as homopheny in phonetic studies [4,7]. For example, the bilabial consonants /p/, /b/, and /m/ share nearly identical visible articulations, making it extremely challenging to differentiate them using only visual input. This intrinsic visual ambiguity, further exacerbated by inter-speaker morphological differences, variations in head pose, and changes in lighting conditions, makes VSR a significantly more demanding computational task than conventional acoustic speech recognition [6].
Recent progress in deep learning techniques has led to notable gains in the performance of lip reading systems. Convolutional Neural Networks (CNNs) are widely employed to extract spatial features from lip-region images [8,9], whereas Recurrent Neural Networks (RNNs) and their gated extensions are used to model temporal dependencies across sequences of frames [10]. More recently, three-dimensional CNNs [9], and Transformer-based models [11] achieved competitive performance by jointly capturing spatiotemporal patterns. Recent advances including global-local integrated frameworks [12], language model-enhanced approaches and viseme-guided generation methods [13] continue to push performance boundaries. However, these approaches largely treat lip images as conventional Euclidean grids and therefore overlook the intrinsic anatomical structure of different lip regions and the topological constraints that shape articulatory motion.
Graph Neural Networks (GNNs) offer a mathematical framework for learning from non-Euclidean structured data [14–16]. The human lip system can be naturally modeled as a graph, where nodes represent anatomical landmark locations and edges encode spatial relations that characterize lip shape and deformation patterns. This modeling strategy allows explicit incorporation of lip structural attributes during speech production, yielding robustness to certain geometric transformations while retaining critical articulatory information [17]. Graph-based representations of facial landmarks and temporal attention mechanisms have been explored in related domains such as skeleton-based action recognition and facial expression analysis. Recent work demonstrates the efficacy of redundancy-aware learning with symmetric view modeling [18] and global-local integrated frameworks [12] for capturing complex lip dynamics. Our contribution lies not in the individual components themselves, but in their principled integration and adaptation for the specific challenges of visual speech recognition, particularly homophene disambiguation.
The nested learning paradigm, rooted in hierarchical optimization and bilevel programming theory [19], also provides a systematic approach for learning representations across multiple levels of abstraction. Speech is intrinsically organized in a nested hierarchy: phonemes (basic acoustic components) combine to form visemes (visual speech units), which in turn compose words, phrases, and sentences [20]. Recent advances in viseme-guided generation and language model integration [13] demonstrate the importance of multi-level linguistic representation in visual speech recognition. This linguistic structure implies that effective VSR systems should learn representations at several granularities, where lower-level features support and shape higher-level abstractions via structured knowledge transfer.
In this paper, we present NestLipGNN (Fig. 1), a framework that combines graph neural networks with nested multi-granularity learning for visual speech recognition. The main contributions of this work are as follows. First, we introduce a principled approach for building dynamic lip graphs that encode both spatial lip geometry and temporal deformation dynamics using learnable adjacency matrices, which adapt to speaker-specific articulatory patterns. Second, we propose a hierarchical learning scheme that jointly optimizes representations at the phoneme, viseme, and word levels through coupled hierarchical loss functions, facilitating bidirectional information flow across different linguistic abstraction layers. Third, we report state-of-the-art performance on four standard benchmark datasets and complement these results with detailed ablation experiments analyzing the framework’s components.

Figure 1: Architectural overview of the proposed NestLipGNN framework.
The rest of the paper is structured as follows. Section 2 reviews related literature. Section 3 details the proposed NestLipGNN framework. Section 4 outlines the experimental setup and reports empirical results. Section 5 analyzes the implications of our findings, and Section 6 closes with concluding remarks and potential avenues for future work.
2.1 Visual Speech Recognition: Historical Evolution and Modern Methodologies
The development of visual speech recognition (VSR) systems mirrors broader progress in pattern recognition and machine learning. Early computational methods relied on manually designed feature representations, such as discrete cosine transform (DCT) coefficients [21], active appearance models (AAM) [22], and Hidden Markov Models (HMMs) for modeling temporal dynamics [23]. Although these approaches laid essential theoretical and practical foundations, their effectiveness was inherently limited by the expressive power of hand-crafted features and their restricted capacity to represent the complex nonlinear dependencies characteristic of visual speech signals.
The emergence of deep learning led to a major transformation in VSR research. Stafylakis and Tzimiropoulos investigated a variety of architectural designs, including three-dimensional CNNs for joint spatiotemporal feature learning [9], and conformer architectures that integrate convolutional modules with self-attention [24]. More recently, transformer-based approaches have set new performance benchmarks. Afouras et al. [5] adopted transformer encoders for temporal modeling, reporting significant performance improvements. Ma et al. [25] further showed that large-scale multilingual pretraining enables effective cross-lingual transfer. Self-supervised pretraining techniques [11,26], which exploit vast collections of unlabeled video data, have yielded further accuracy gains. The AV-HuBERT framework [26] demonstrated that masked prediction objectives can learn powerful audio-visual representations, while recent work on LP-Conformer [27] has shown the benefits of combining local and global context modeling. Nevertheless, most current methods still rely on grid-based representations, which restrict their ability to explicitly characterize the structural constraints underlying lip articulation.
2.2 Graph Neural Networks: Theoretical Foundations and Applications
Graph Neural Networks (GNNs) have become highly effective computational frameworks for learning from graph-structured data. Spatial methods [14,28] implement convolutions by aggregating information from local neighborhoods via message-passing schemes. Graph Attention Networks (GATs) [15] further extend this paradigm by introducing attention mechanisms to adaptively weight neighboring nodes, thereby enhancing both expressive power and interpretability.
Within human body analysis, GNNs have been widely adopted for skeleton-based action recognition [17], leveraging the inherent graph structure formed by anatomical joint connections. In facial analysis, graph-based techniques have been applied for expression recognition, face alignment, and 3D face reconstruction. Recent advances have also explored dynamic graph construction for temporal sequences and multi-scale graph representations. In contrast, the targeted and systematic use of GNNs for lip reading has received comparatively little attention, which motivates our graph-theoretic formulation of visual speech recognition.
2.3 Hierarchical and Nested Optimization
Hierarchical learning has long-standing theoretical foundations in machine learning, ranging from hierarchical clustering techniques to deep neural networks with multiple layers of abstraction. Nested optimization, often referred to as bilevel programming, has seen a resurgence of interest in areas such as meta-learning [29], hyperparameter optimization [19,30], and neural architecture search [31].
The key idea is that many learning problems naturally admit a hierarchical organization, where higher-level representations are defined in terms of lower-level features via nested optimization loops, each with its own convergence behavior. This framework has been effectively employed in few-shot learning, domain adaptation, and multi-task learning. In speech recognition, hierarchical formulations have captured the phoneme-word-sentence structure [32,33], although commonly using cascaded pipelines rather than fully nested optimization schemes.
We distinguish this work from related approaches as follows. Table 1 summarizes the key differences between NestLipGNN and related graph-based and hierarchical methods. Graph-based landmark modeling has been explored in skeleton action recognition and facial expression analysis [17], but these methods do not address the specific challenges of visual speech recognition, particularly the fine-grained temporal dynamics of articulation and homophene disambiguation. Temporal attention mechanisms [34] have been explored in video understanding, but our T-GAT mechanism specifically incorporates graph structure similarity into the attention computation. Hierarchical supervision has been used in acoustic speech recognition [33], but the specific phoneme-viseme-word hierarchy aligned with visual speech units is novel to this work. The key contribution of this work is the integration of graph modeling, temporal attention, and hierarchical supervision into a single VSR architecture, along with the graph-based contrastive learning objective for homophene resolution.

2.5 Contrastive Representation Learning
Contrastive learning has become a prominent self-supervised framework [35], in which informative representations are acquired by distinguishing positive pairs from negative examples within a learned embedding space. In speech processing, contrastive approaches have been utilized for acoustic representation learning [36], speaker verification [37], and audio-visual correspondence modeling [38]. Building on this paradigm, we introduce graph-structured contrastive objectives to tackle the homophene disambiguation challenge in visual speech recognition.
Fig. 2 presents the complete architectural specification of NestLipGNN. Our methodology comprises five principal components: (1) Dynamic Lip Graph Construction, (2) Spatial Graph Encoder, (3) Temporal Graph Attention, (4) Nested Multi-Granularity Decoder, and (5) Graph Contrastive Learning.

Figure 2: Comprehensive architectural diagram of NestLipGNN. The framework processes video frames to construct dynamic lip graphs, encodes spatial features through GCN and GAT layers, captures temporal dynamics via T-GAT, and generates predictions through a nested learning hierarchy with phoneme (
Let
Our key idea is to factorize this mapping into a sequence of graph-based transformations:
where
3.2 Dynamic Lip Graph Construction
3.2.1 Anatomical Landmark Extraction
We use a pretrained facial landmark detector based on the Face Alignment Network (FAN) [39] to obtain 68 anatomical facial landmarks for each video frame. From this full set, we retain

Figure 3: Lip graph topology comprising 20 anatomical landmarks. Blue nodes correspond to outer lip contour points (vermilion border, 12 vertices), and green nodes correspond to inner lip contour points (8 vertices). Solid edges indicate anatomically defined connections, while dashed edges illustrate learned cross-contour links.
To mitigate the impact of head pose changes and to allow consistent comparisons across frames, we normalize the landmarks via Generalized Procrustes Analysis:
where
3.2.2 Graph-Theoretic Formulation
At each time step
3.2.3 Multi-Modal Node Feature Construction
Each node feature vector aggregates positional, appearance-based, textural, and kinematic cues:
where
The normalized coordinates
where
3.2.4 Adaptive Adjacency Matrix Formulation
We introduce a trainable adjacency matrix that jointly integrates anatomical priors with data-driven relations:
where
The motivation for combining these three adjacency components stems from complementary information sources. The structural adjacency
where
where
The spatial encoder processes the lip graph of each frame, extracting structural cues through a hierarchical sequence of graph convolution and attention layers.
3.3.1 Spectral Graph Convolutions
We employ spectral graph convolutions derived from the first-order Chebyshev polynomial approximation [14]:
where
To enlarge the receptive field and capture information from more distant neighbors, we use a multi-scale aggregation strategy [16]:
where
3.3.2 Graph Attention Mechanism
Following the Graph Attention Network paradigm [15], we incorporate an attention mechanism that assigns adaptive importance to neighboring nodes. The unnormalized attention scores are given by:
where
The normalized attention coefficients are obtained with a softmax over the neighborhood of node
where
The node features are updated by aggregating neighbor embeddings weighted by these attention coefficients:
To capture heterogeneous interaction patterns, we adopt multi-head attention with
where the vertical bar ∥ denotes concatenation across
After L layers of message passing, we compute a single embedding for the entire graph by combining mean and max pooling over the final node representations:
where
3.4 Temporal Graph Attention (T-GAT)
To capture how lip graphs evolve over time, we propose a Temporal Graph Attention mechanism (Fig. 4) that extends self-attention to graph-structured temporal sequences. This mechanism computes attention over all pairs of temporal frames, resulting in

Figure 4: Temporal Graph Attention captures both short-range temporal consistency (solid red arrows) and long-range interactions (dashed orange arrows) over frame-level graph features.
Let
where the positional encoding
T-GAT then obtains query, key, and value tensors through learned linear transformations:
where
Temporal attention scores are determined via a similarity function that is aware of graph structure:
where
These similarity scores are normalized into attention weights using a softmax:
The temporally enriched representation at time t is computed as a weighted combination of the value tensors:
We stack M T-GAT layers and apply residual connections together with layer normalization to facilitate stable training:
3.5 Nested Multi-Granularity Learning Framework
The proposed nested framework enables hierarchical representation learning by simultaneously optimizing multiple layers of linguistic abstraction (Fig. 5).

Figure 5: Schematic of the nested learning hierarchy. The inner loop refines phoneme-level encodings, the middle loop estimates viseme-level representations conditioned on phonemes, and the outer loop optimizes word-level embeddings conditioned on both subordinate layers. Dashed arrows indicate gradient propagation across hierarchy levels.
3.5.1 Hierarchical Representation Levels
We define three representation layers aligned with the linguistic hierarchy.
The phoneme level (
where
The viseme level (
where
The word level (
where
3.5.2 Bilevel Optimization Formulation
Following bilevel optimization principles [19], we formulate the nested learning process as a multi-stage hierarchical optimization problem. The inner loop (phoneme optimization) updates the phoneme decoder parameters
where
The middle loop (viseme optimization) updates the viseme decoder
where
The outer loop (word optimization) updates the word decoder
where
3.5.3 Practical Implementation via Joint Optimization
Solving the hierarchical optimization problem exactly is computationally expensive, so we instead employ a tractable approximation based on a jointly optimized weighted loss:
where
We clarify the relationship between the theoretical bilevel formulation and our practical implementation. The bilevel optimization formulation (Eqs. (27)–(29)) provides a principled framework for understanding how hierarchical supervision should flow from lower to higher linguistic levels. However, true bilevel optimization requires expensive nested gradient computations and multiple inner-loop iterations per outer update, which is computationally prohibitive for deep networks. Our practical implementation approximates this structure through the weighted joint loss (Eq. (30)), which trains all levels simultaneously but preserves the hierarchical dependency structure through the consistency loss
Each level-specific objective is a cross-entropy classification loss:
where
We introduce a hierarchical consistency loss to promote alignment between successive levels:
where
3.6 Graph Contrastive Learning for Homophene Disambiguation
For each mini-batch, we construct positive and negative pairs based on their linguistic identity. Positive pairs are obtained from distinct video instances of the same word, encouraging invariance to speaker, lighting, and pose variations. Hard negatives are chosen from lexically distinct words that are visually similar in articulation, known as homophenes.
We employ the InfoNCE loss, augmented with a graph-based regularization component:
where
We refine the standard contrastive loss by incorporating graph structure through Gromov-Wasserstein regularization:
where
The full training loss combines all constituent terms:
where
The regularization component accounts for both weight decay and the sparsity of the adjacency matrices:
where
Algorithm 1 presents the complete NestLipGNN training pipeline.

We evaluate our approach on four widely used visual speech recognition benchmarks. Table 2 provides a statistical overview of these datasets.

The LRW (Lip Reading in the Wild) dataset [41] is a large-scale word-level VSR dataset containing 500 vocabulary items, with up to 1000 training examples for each word. Each video sample consists of 29 frames captured at 25 fps, sourced from BBC television broadcasts. The data exhibits extensive variability in speakers, head pose, and lighting conditions.
The LRS2 (Lip Reading Sentences 2) dataset [5] is a sentence-level corpus comprising thousands of spoken sentences extracted from BBC programs, recorded under challenging real-world conditions such as multiple active speakers and significant background noise.
The LRS3 (Lip Reading Sentences 3) dataset [42] is the largest publicly accessible lip-reading corpus, totaling more than 400 h of video from TED and TEDx talks, encompassing a wide range of speakers, subject areas, and speaking styles.
The GRID dataset [43] is a carefully controlled laboratory dataset featuring 34 speakers, each producing 1000 command-like utterances. This dataset is primarily used for fine-grained analysis and for assessing cross-corpus generalization.
Since phoneme and viseme labels are not provided as standard annotations for these datasets, we describe our procedure for obtaining them. For word-level datasets (LRW), phoneme sequences are derived using the Montreal Forced Aligner (MFA) [44] with the CMU Pronouncing Dictionary as the lexicon. The aligner produces frame-level phoneme boundaries which are then mapped to video frames using the provided audio-video synchronization timestamps. For sentence-level datasets (LRS2, LRS3), we similarly apply MFA to the audio track with the corresponding transcripts to obtain phoneme alignments.
Viseme labels are derived from phoneme labels using a standard phoneme-to-viseme mapping based on articulatory features [4]. We use a 12-class viseme system that groups phonemes sharing similar lip configurations: bilabials (/p/, /b/, /m/), labiodentals (/f/, /v/), dentals (/th/, /dh/), alveolars (/t/, /d/, /n/, /s/, /z/, /l/), palatals (/sh/, /zh/, /ch/, /jh/), velars (/k/, /g/, /ng/), and vowel groups based on lip rounding and height.
The preprocessing pipeline first detects faces using RetinaFace [45] and aligns them into canonical frontal poses. The lip region is then cropped to a resolution of
The architecture includes a spatial encoder composed of three GCN layers with dimensions
We train the network with the AdamW optimizer [46], using a weight decay of
We apply data augmentation: random horizontal flips with probability 0.5, random cropping to
For the LRW dataset, which involves word-level classification, we report both top-1 and top-5 accuracy. For sentence-level datasets, including LRS2, LRS3, and GRID, performance is evaluated using the Word Error Rate (WER) and Character Error Rate (CER), given by
where S, D, and I denote the numbers of substitution, deletion, and insertion errors, respectively, and N is the total number of words in the reference transcription. CER is computed analogously at the character level.
4.4 Comparison with State-of-the-Art Methods
The comparative results on LRW and sentence-level benchmarks are summarized in Tables 3 and 4. NestLipGNN achieves a top-1 accuracy of 92.3% on LRW, corresponding to an absolute gain of 2.1% over the previous state-of-the-art, while surpassing all methods in top-5 accuracy. On LRS2, LRS3, and GRID, it achieves WERs of 22.8%, 28.7%, and 0.8%, respectively, demonstrating consistent superiority across diverse data regimes. We report mean and standard deviation from five independent training runs with different random seeds, and all improvements over the strongest baseline are statistically significant (paired t-test,
We conduct component-wise ablation studies to quantify the contribution of each module. As shown in Table 5, the baseline ResNet+LSTM achieves 85.2% accuracy on LRW. Introducing the spatial graph structure improves performance by 2.2 percentage points to 87.4%, demonstrating the value of explicitly modeling anatomical connectivity. Incorporating graph attention (GAT) yields a further 0.7% gain, highlighting the benefit of adaptive neighborhood weighting. The addition of the Temporal Graph Attention (T-GAT) module improves performance by 2.4% to 89.8%, illustrating the importance of modeling long-range temporal dynamics. The nested learning scheme contributes an additional 1.7% to 91.5%, validating the effectiveness of multi-level supervision. Finally, adding the graph contrastive objective enhances accuracy by 0.8% to the final result of 92.3%, confirming its role in resolving homophene ambiguities.

Table 6 presents ablation results on sentence-level datasets, demonstrating that the component contributions are consistent across both word-level and sentence-level tasks.

We further analyze the impact of graph topology design, as shown in Fig. 6. The proposed dynamic adjacency matrix (“Full”) outperforms a static anatomical graph by 2.8 percentage points, demonstrating the benefit of learning task-specific connections. The use of both inner and outer lip landmarks (“Full”) achieves 92.3% accuracy, while removing inner landmarks (“No-Inner”) causes a 5.1% drop, confirming that inner lip motion encodes discriminative information critical for phoneme resolution. Randomly wired graphs (“Random”) and dense connectivity (“Dense”) underperform, suggesting that sparse, learned topology is essential.

Figure 6: Effect of different graph structure designs on LRW accuracy. “Full”: proposed learnable adjacency; “Static”: fixed anatomical graph; “Sparse/Dense”: different edge densities; “No-Inner”: only outer lip landmarks; “Random”: randomly wired graph.
The contribution of nested supervision is further evaluated in Table 7. Training only at the word level yields 89.1% accuracy. Adding phoneme supervision improves performance to 90.5%, and adding viseme supervision yields 90.8%. Joint supervision across all three levels increases accuracy to 91.5%, and the inclusion of the consistency loss between levels improves it further to 92.3%. This confirms that hierarchical alignment and multi-level supervision enhance representation learning.

The training curves in Fig. 7 show that NestLipGNN converges faster and reaches a higher terminal accuracy than ablations. The baseline reaches 85.2% at epoch 80, while removing nested learning reduces final accuracy to 89.8% and removing contrastive learning to 91.2%. NestLipGNN, with all components, achieves 92.3%, validating the synergy of the components.

Figure 7: Training curves for full NestLipGNN vs. ablations without nested learning, without contrastive loss, and the CNN+LSTM baseline. NestLipGNN achieves both faster convergence and higher final accuracy.
4.6 Robustness to Landmark Noise
Since our method depends on external landmark detection, we evaluate robustness to landmark localization errors by adding Gaussian noise to detected landmark positions. Table 8 shows that NestLipGNN maintains competitive performance even under significant landmark perturbation. With noise standard deviation

As shown in Table 9, NestLipGNN operates with only 28.6 million parameters and 15.4 billion FLOPs per forward pass, striking a favorable balance between performance and efficiency. Its inference speed of 118 FPS on a single GPU outperforms Transformers and Conformers while maintaining lower memory usage, making it suitable for real-time applications.

4.7 Attention Visualization and Interpretability
The attention maps in Fig. 8 reveal that Temporal Graph Attention (T-GAT) concentrates on salient articulatory transitions such as lip opening at vowel onset and closure at stops. Spatial attention highlights the lip corners and oral aperture, which articulatory phonetics identifies as the most discriminative regions to distinguish consonants such as /b/, /p/, and /m/. This shows that NestLipGNN learns biologically plausible attention patterns, providing model interpretability.

Figure 8: Visualization of attention patterns. (a) Temporal attention matrix highlighting concentration on salient articulatory frames. (b) Spatial attention over lip landmarks; larger and darker circles denote higher attention, underscoring focus on lip corners and aperture regions that are crucial for discriminating speech sounds.
We evaluate cross-dataset generalization by training NestLipGNN on LRS3 and testing on LRS2, GRID, and LRW. As shown in Table 10, NestLipGNN achieves superior transfer performance compared to prior methods, with lower WER on LRS2 and GRID and higher accuracy on LRW, indicating that its graph-based representations capture articulatory dynamics that generalize across disparate datasets and recording conditions.

The strong empirical performance of our graph-based lip representation can be attributed to several central factors. First, the structural inductive bias introduced by explicitly encoding lip topology incorporates prior knowledge of facial anatomy, constraining the hypothesis space and improving sample efficiency [51], which is especially beneficial in the low-resource regime of VSR. Second, the geometric equivariance of graph-based representations offers robustness to transformations such as translations and small rotations while preserving essential structural information, thereby mitigating sensitivity to head pose variations. Third, computational efficiency arises from operating on a sparse 20-node graph instead of dense
The nested learning paradigm confers multiple benefits. Supervising the model simultaneously at phoneme, viseme, and word levels supplies learning signals across a range of linguistic granularities, enabling the network to capture both fine-scale articulatory dynamics and higher-level semantic regularities. Supervision at lower levels also functions as an implicit form of regularization, providing auxiliary tasks that mitigate overfitting to word-level labels and enhance generalization via shared internal representations.
Additionally, knowledge obtained at lower layers of the hierarchy, such as accurate phoneme recognition, naturally transfers to higher-level objectives like word recognition, mirroring the compositional structure of human speech perception. The nested optimization scheme also functions as a curriculum: it establishes robust phoneme-level features before tackling the more complex word recognition task, helping the optimizer traverse the loss landscape more effectively.
Despite strong empirical results, the approach has several limitations. The method relies on accurate landmark detection, which may degrade under extreme poses, occlusions, or poor imaging conditions. Our robustness analysis (Section 4.6) shows graceful degradation under moderate noise, but severe landmark failures would still affect performance. Studies on visual contribution in audio-visual systems [52] suggest that identifying the most informative visual cues could guide more robust landmark selection. Future work may explore joint training with differentiable landmark prediction modules, while viseme-guided generation methods [13] indicate promising directions for improving landmark stability. Additionally, although edge weights are learned, the node set is fixed; adaptive graph construction could further enhance representation.
Our evaluation mainly considers near-frontal views, so extending the framework to multi-view settings or 3D lip modeling could improve real-world applicability. Performance also varies across speakers with different articulatory styles, suggesting that few-shot speaker adaptation or personalization may improve robustness. Finally, although the graph-based temporal modeling could potentially generalize to other graph-structured tasks (e.g., gesture or action recognition), this study focuses solely on visual speech recognition. Validating broader applicability would require additional experiments and domain-specific adaptations.
To provide concrete evidence of the framework’s effectiveness in homophone disambiguation, we analyze performance on specific homophone groups. Table 11 shows confusion rates for challenging word pairs before and after applying graph contrastive learning. The contrastive objective reduces confusion between visually similar words such as “pat”/“bat”/“mat” (bilabial consonants) by 12.3% on average, and between “sip”/“zip” (alveolar fricatives) by 8.7%. Qualitative analysis of attention patterns reveals that for homophone pairs, the model learns to focus on subtle differences in lip aperture timing and inner lip visibility that distinguish these otherwise similar articulations.

This paper introduced NestLipGNN, a framework that integrates Graph Neural Networks with nested multi-granularity learning for visual speech recognition. Our approach builds dynamic lip graphs from anatomical facial landmarks, processes them using spatial and temporal graph attention networks, and produces predictions through a hierarchical nested learning scheme with phoneme-, viseme-, and word-level supervision. A graph contrastive learning objective further strengthens the model’s discriminative power, particularly for resolving homophenes.
Extensive experiments on the LRW, LRS2, LRS3, and GRID benchmarks demonstrate strong performance, achieving 92.3% word accuracy on LRW and delivering a 2.1% absolute improvement over previous approaches. Detailed ablation studies confirm the usefulness of each architectural module. The primary contribution of this work lies in the principled integration of graph-based spatial modeling, temporal attention, hierarchical linguistic supervision, and contrastive learning into a unified framework specifically designed for VSR, rather than in the novelty of individual components. In the future, we plan to extend NestLipGNN to continuous speech recognition, incorporate multi-view and 3D lip modeling, and integrate audio information for robust audio-visual speech recognition.
Acknowledgement: Not applicable.
Funding Statement: This work is funded by Ho Chi Minh City Open University (HCMCOU) and the Ministry of Education and Training (Vietnam) under grant number B2025-MBS-01.
Author Contributions: Vinh Truong Hoang: Conceptualization, Methodology, Writing—Original Draft. Nghia Dinh: Software, Validation. Luu Quang Phuong: Data Curation, Visualization. Kiet Tran-Trung: Formal Analysis. Ha Duong Thi Hong: Investigation. Bay Nguyen Van: Resources. Hau Nguyen Trung: Writing—Review & Editing. Thien Ho Huong: Supervision, Project Administration. All authors reviewed and approved the final version of the manuscript.
Availability of Data and Materials: The LRW, LRS2, LRS3, and GRID datasets used in this study are publicly available from their respective sources. Source code will be made available upon request.
Ethical Approval: This study uses publicly available benchmark datasets, which the original creators collected with appropriate ethical approvals. No additional ethics approval was required for this work.
Conflicts of Interest: The authors declare no conflicts of interest.
References
1. Kania J, Usha B, Haleritti B. LIP NET reading using deep learning. In: 2025 9th International Conference on Computational System and Information Technology for Sustainable Solutions (CSITSS). Piscataway, NJ, USA: IEEE; 2025. p. 1–6. [Google Scholar]
2. Park YH, Park RH, Park HM. Swinlip: an efficient visual speech encoder for lip reading using swin transformer. Neurocomputing. 2025;639:130289. [Google Scholar]
3. Denby B, Schultz T, Honda K, Hueber T, Gilbert JM, Brumberg JS. Silent speech interfaces. Speech Commun. 2010;52(4):270–87. doi:10.1016/j.specom.2009.08.002. [Google Scholar] [CrossRef]
4. Bear HL, Harvey RW. Phoneme-to-viseme mappings: the good, the bad, and the ugly. Speech Commun. 2017;95:40–67. [Google Scholar]
5. Afouras T, Chung JS, Zisserman A. Deep lip reading: a comparison of models and an online application. In: Interspeech 2018; 2018 Sep 2–6; Hyderabad, India. p. 3514–8. [Google Scholar]
6. Deshpande S, Shirsath K, Pashte A, Loya P, Shingade S, Sambhe V. A comprehensive survey of advancement in lip reading models: techniques and future directions. IET Image Process. 2025;19(1):e70095. doi:10.1049/ipr2.70095. [Google Scholar] [CrossRef]
7. Fisher CG. Confusions among visually perceived consonants. J Speech Hear Res. 1968;11(4):796–804. doi:10.1044/jshr.1104.796. [Google Scholar] [PubMed] [CrossRef]
8. Petridis S, Stafylakis T, Ma P, Cai F, Tzimiropoulos G, Pantic M. End-to-end audiovisual speech recognition. In: Proceedings of the 2018 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Piscataway, NJ, USA: IEEE; 2018. p. 6548–52. [Google Scholar]
9. Stafylakis T, Tzimiropoulos G. Combining residual networks with LSTMs for lipreading. In: Interspeech 2017; 2017 Aug 20–24; Stockholm, Sweden. p. 3652–6. doi:10.21437/interspeech.2017-85. [Google Scholar] [CrossRef]
10. Jyoshna B, Parthu V, Pranavi D, Hansitha K, Sharan PE, Charandheep T. Lip sync to speech conversion using deep learning. In: Computational techniques and smart manufacturing. Boca Raton, FL, USA: CRC Press; 2026. p. 604–13. doi:10.1201/9781003679622-70. [Google Scholar] [CrossRef]
11. Ma P, Haliassos A, Fernandez-Lopez A, Chen H, Petridis S, Pantic M. Auto-AVSR: audio-visual speech recognition with automatic labels. In: Proceedings of the 2023 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Piscataway, NJ, USA: IEEE; 2023. p. 1–5. [Google Scholar]
12. Wang T, Yang S, Shan S, Chen X. GLip: a global-local integrated progressive framework for robust visual speech recognition. arXiv:2509.16031. 2025. [Google Scholar]
13. Hao B, Zhou D, Li X, Zhang X, Xie L, Wu J, et al. LipGen: viseme-guided lip video generation for enhancing visual speech recognition. In: Proceedings of the 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Piscataway, NJ, USA: IEEE; 2025. p. 1–5. [Google Scholar]
14. Kipf TN, Welling M. Semi-supervised classification with graph convolutional networks. arXiv:1609.02907. 2017. [Google Scholar]
15. Veličković P, Cucurull G, Casanova A, Romero A, Liò P, Bengio Y. Graph attention networks. arXiv:1710.10903. 2018. [Google Scholar]
16. Xu K, Hu W, Leskovec J, Jegelka S. How powerful are graph neural networks? arXiv:1810.00826. 2019. [Google Scholar]
17. Yan S, Xiong Y, Lin D. Spatial temporal graph convolutional networks for skeleton-based action recognition. In: AAAI’18/IAAI’18/EAAI’18: Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence. Palo Alto, CA, USA: AAAI Press; 2018. p. 7444–52. [Google Scholar]
18. Rathipriya N, Maheswari N. A comprehensive review of recent advances in deep neural networks for lipreading with sign language recognition. IEEE Access. 2024;12(1):136846–79. doi:10.1109/access.2024.3463969. [Google Scholar] [CrossRef]
19. Franceschi L, Frasconi P, Salzo S, Grazzi R, Pontil M. Bilevel programming for hyperparameter optimization and meta-learning. In: Proceedings of the 35th International Conference on Machine Learning. London, UK: PMLR; 2018. p. 1568–77. [Google Scholar]
20. Bear HL, Harvey R. Decoding visemes: improving machine lip-reading. In: Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Shanghai, China: IEEE; 2016. p. 2009–13. doi:10.1109/ICASSP.2016.7472029. [Google Scholar] [CrossRef]
21. Liu R, Yuan H, Gao G, Li H. Listening and seeing again: generative error correction for audio-visual speech recognition. Inf Fusion. 2025;120:103077. [Google Scholar]
22. Matthews I, Cootes TF, Bangham JA, Cox S, Harvey R. Extraction of visual features for lipreading. IEEE Trans Pattern Anal Mach Intell. 2002;24(2):198–213. doi:10.1109/34.982900. [Google Scholar] [CrossRef]
23. Neti C, Potamianos G, Luettin J, Matthews I, Glotin H, Vergyri D, et al. Audio visual speech recognition. In: Proceeding of the ISCA Tutorial and Research Workshop on Multi-Modal Dialogue in Mobile Environments; 2002 Jun 17–21; Kloster Irsee, Germany. [Google Scholar]
24. Kim M, Yeo JH, Ro YM. Distinguishing homophenes using multi-head visual-audio memory for lip reading. Proc AAAI Conf Artif Intell. 2022;36(1):1174–82. [Google Scholar]
25. Ma P, Petridis S, Pantic M. Visual speech recognition for multiple languages in the wild. Nat Mach Intell. 2022;4(11):930–9. doi:10.1038/s42256-022-00550-z. [Google Scholar] [CrossRef]
26. Shi B, Hsu WN, Lakhotia K, Mohamed A. AV-HuBERT: learning audio-visual speech representation by masked multimodal cluster prediction. In: Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway, NJ, USA: IEEE; 2022. p. 14130–41. [Google Scholar]
27. Chang O, Liao H, Serdyuk D, Shah A, Siohan O. Conformer is all you need for visual speech recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway, NJ, USA: IEEE; 2024. p. 1–5. [Google Scholar]
28. Hamilton WL, Ying R, Leskovec J. Inductive representation learning on large graphs. In: NIPS’17: Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook, NY, USA: Curran Associates Inc.; 2017. p. 1024–34. [Google Scholar]
29. Sareddy MR, Kumar RV, Thanjaivadivel M. Enhanced visual-NLP systems using knowledge graphs, meta-learning, and adaptive attention networks. In: Innovations in Computational Intelligence and Computer Vision (ICICV 2025). Cham, Switzerland: Springer; 2026. p. 17–24. doi:10.1007/978-3-032-09825-2_2. [Google Scholar] [CrossRef]
30. Lorraine J, Vicol P, Duvenaud D. Optimizing millions of hyperparameters by implicit differentiation. In: Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics (AISTATS). London, UK: PMLR; 2020. p. 1540–52. [Google Scholar]
31. Liu H, Simonyan K, Yang Y. DARTS: differentiable architecture search. arXiv:1806.09055. 2019. [Google Scholar]
32. Chan W, Jaitly N, Le Q, Vinyals O. Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In: Proceedings of the 2016 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Piscataway, NJ, USA: IEEE; 2016. p. 4960–4. [Google Scholar]
33. Graves A. Sequence transduction with recurrent neural networks. arXiv:1211.3711. 2012. [Google Scholar]
34. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. In: Advances in Neural Information Processing Systems 30 (NeurIPS 2017). Red Hook, NY, USA: Curran Associates, Inc.; 2017. p. 5998–6008. doi:10.65215/ctdc8e75. [Google Scholar] [CrossRef]
35. He K, Fan H, Wu Y, Xie S, Girshick R. Momentum contrast for unsupervised visual representation learning. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway, NJ, USA: IEEE; 2020. p. 9729–38. [Google Scholar]
36. Baevski A, Zhou Y, Mohamed A, Auli M. wav2vec 2.0: a framework for self-supervised learning of speech representations. In: Proceedings of the Advances in Neural Information Processing Systems (NeurIPS). Red Hook, NY, USA: Curran Associates, Inc.; 2020. p. 12449–60. [Google Scholar]
37. Xia W, Huang J, Garcia-Perera LP. Self-supervised text-independent speaker verification using prototypical momentum contrastive learning. In: Proceedings of the 2021 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Piscataway, NJ, USA: IEEE; 2021. p. 6723–7. [Google Scholar]
38. Tsiamas I, Pascual S, Yeh C, Serrà J. Sequential contrastive audio-visual learning. arXiv:2407.05782. 2024. [Google Scholar]
39. Bulat A, Tzimiropoulos G. How far are we from solving the 2D & 3D face alignment problem? (and a dataset of 230,000 3D facial landmarks). In: Proceedings of the IEEE International Conference on Computer Vision. Piscataway, NJ, USA: IEEE; 2017. p. 1021–30. [Google Scholar]
40. Peyré G, Cuturi M, Solomon J. Gromov-wasserstein averaging of kernel and distance matrices. In: Proceedings of the 33rd International Conference on Machine Learning. Red Hook, NY, USA: Curran Associates, Inc.; 2016. p. 2664–72. [Google Scholar]
41. Chung JS, Zisserman A. Lip reading in the wild. In: Computer Vision—ACCV 2016 (ACCV 2016). Cham, Switzerland: Springer; 2016. p. 87–103. [Google Scholar]
42. Afouras T, Chung JS, Zisserman A. LRS3-TED: a large-scale dataset for visual speech recognition. arXiv:1809.00496. 2018. [Google Scholar]
43. Cooke M, Barker J, Cunningham S, Shao X. An audio-visual corpus for speech perception and automatic speech recognition. J Acoust Soc Am. 2006;120(5):2421–4. doi:10.1121/1.2229005. [Google Scholar] [PubMed] [CrossRef]
44. McAuliffe M, Socolof M, Mihuc S, Wagner M, Sonderegger M. Montreal forced aligner: trainable text-speech alignment using Kaldi. In: Interspeech 2017; 2017 Aug 20–24; Stockholm, Sweden; 2017. p. 498–502. doi:10.21437/interspeech.2017-1386. [Google Scholar] [CrossRef]
45. Ren Z, Liu X, Xu J, Zhang Y, Fang M. Littlefacenet: a small-sized face recognition method based on retinaface and adaface. J Imaging. 2025;11(1):24. [Google Scholar] [PubMed]
46. Loshchilov I, Hutter F. Decoupled weight decay regularization. arXiv:1711.05101. 2019. [Google Scholar]
47. Zhang H, Cisse M, Dauphin YN, Lopez-Paz D. Mixup: beyond empirical risk minimization. arXiv:1710.09412. 2018. [Google Scholar]
48. Park DS, Chan W, Zhang Y, Chiu CC, Zoph B, Cubuk ED, et al. SpecAugment: a simple data augmentation method for automatic speech recognition. In: Interspeech 2019; 2019 Sep 15–19; Graz, Austria. p. 2613–7. doi:10.21437/interspeech.2019-2680. [Google Scholar] [CrossRef]
49. Ma P, Martinez B, Petridis S, Pantic M. End-to-end audio-visual speech recognition with conformers. In: Proceedings of the 2021 IEEE International Conference on Acoustics, Speech, and Signal Processing. Piscataway, NJ, USA: IEEE; 2021. p. 7613–7. [Google Scholar]
50. Haliassos A, Ma P, Mira R, Petridis S, Pantic M. Jointly learning visual and auditory speech representations from raw data. arXiv:2212.06246. 2022. [Google Scholar]
51. Battaglia PW, Hamrick JB, Bapst V, Sanchez-Gonzalez A, Zambaldi V, Malinowski M, et al. Relational inductive biases, deep learning, and graph networks. arXiv:1806.01261. 2018. [Google Scholar]
52. Lin Z, Harte N. Uncovering the visual contribution in audio-visual speech recognition. In: Proceedings of the 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Piscataway, NJ, USA: IEEE; 2025. p. 1–5. [Google Scholar]
Cite This Article
Copyright © 2026 The Author(s). Published by Tech Science Press.This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.


Submit a Paper
Propose a Special lssue
View Full Text
Download PDF

Downloads
Citation Tools