TY  - EJOU
AU  - Hoang, Vinh Truong 
AU  - Dinh, Nghia 
AU  - Phuong, Luu Quang 
AU  - Tran-Trung, Kiet 
AU  - Hong, Ha Duong Thi 
AU  - Van, Bay Nguyen 
AU  - Trung, Hau Nguyen 
AU  - Huong, Thien Ho 

TI  - NestLipGNN: A Hierarchical Graph Neural Network Framework with Nested Multi-Granularity Learning for Robust Visual Speech Recognition
T2  - Computers, Materials \& Continua

PY  - 
VL  - 
IS  - 
SN  - 1546-2226

AB  - Visual speech recognition (VSR) aims to infer spoken content from visual observations of articulatory movements. Despite significant progress, it remains a challenging task in computer vision and speech processing. Its difficulty arises from pronounced speaker-to-speaker variability, the presence of homophenes (phonemes that are visually indistinguishable), changes in illumination, and the intrinsically high-dimensional nature of spatiotemporal lip dynamics. In this work, we propose NestLipGNN, a graph-based framework that integrates Graph Neural Networks (GNNs) with a nested multi-granularity learning strategy for visual speech recognition. We construct dynamic lip graphs from facial landmarks to model both spatial relationships between lip regions and their temporal motion during speech articulation. The proposed nested learning architecture supports hierarchical feature extraction across several levels of linguistic abstraction, spanning phoneme-level articulatory units, viseme-level visual speech categories, and word-level semantic representations. We further introduce a Temporal Graph Attention mechanism (T-GAT) that adaptively reweights the importance of distinct lip regions over time. We also introduce a graph-based contrastive learning objective to improve the discrimination of visually similar speech patterns, directly confronting the challenge of homophene resolution. Experiments on the LRW, LRS2, LRS3, and GRID datasets show that NestLipGNN improves recognition accuracy compared with existing methods, obtaining 92.3% word-level accuracy on LRW and delivering a 2.1% absolute performance gain over prior methods. Comprehensive ablation analyses confirm the contribution of each architectural component.
KW  - Visual speech recognition; graph neural networks; nested optimization; hierarchical representation learning; spatiotemporal modeling; contrastive learning; lip reading

DO  - 10.32604/cmc.2026.078089