Open Access iconOpen Access

ARTICLE

NestLipGNN: A Hierarchical Graph Neural Network Framework with Nested Multi-Granularity Learning for Robust Visual Speech Recognition

Vinh Truong Hoang*, Nghia Dinh, Luu Quang Phuong, Kiet Tran-Trung, Ha Duong Thi Hong, Bay Nguyen Van, Hau Nguyen Trung, Thien Ho Huong

AI Lab, Faculty of Information Technology, Ho Chi Minh City Open University, 35-37 Ho Hao Hon Street, Co Giang Ward, District 1, Ho Chi Minh City, Vietnam

* Corresponding Author: Vinh Truong Hoang. Email: email

(This article belongs to the Special Issue: Artificial Intelligence in Visual and Audio Signal Processing)

Computers, Materials & Continua 2026, 88(1), 54 https://doi.org/10.32604/cmc.2026.078089

Abstract

Visual speech recognition (VSR) aims to infer spoken content from visual observations of articulatory movements. Despite significant progress, it remains a challenging task in computer vision and speech processing. Its difficulty arises from pronounced speaker-to-speaker variability, the presence of homophenes (phonemes that are visually indistinguishable), changes in illumination, and the intrinsically high-dimensional nature of spatiotemporal lip dynamics. In this work, we propose NestLipGNN, a graph-based framework that integrates Graph Neural Networks (GNNs) with a nested multi-granularity learning strategy for visual speech recognition. We construct dynamic lip graphs from facial landmarks to model both spatial relationships between lip regions and their temporal motion during speech articulation. The proposed nested learning architecture supports hierarchical feature extraction across several levels of linguistic abstraction, spanning phoneme-level articulatory units, viseme-level visual speech categories, and word-level semantic representations. We further introduce a Temporal Graph Attention mechanism (T-GAT) that adaptively reweights the importance of distinct lip regions over time. We also introduce a graph-based contrastive learning objective to improve the discrimination of visually similar speech patterns, directly confronting the challenge of homophene resolution. Experiments on the LRW, LRS2, LRS3, and GRID datasets show that NestLipGNN improves recognition accuracy compared with existing methods, obtaining 92.3% word-level accuracy on LRW and delivering a 2.1% absolute performance gain over prior methods. Comprehensive ablation analyses confirm the contribution of each architectural component.

Keywords

Visual speech recognition; graph neural networks; nested optimization; hierarchical representation learning; spatiotemporal modeling; contrastive learning; lip reading

Cite This Article

APA Style
Hoang, V.T., Dinh, N., Phuong, L.Q., Tran-Trung, K., Hong, H.D.T. et al. (2026). NestLipGNN: A Hierarchical Graph Neural Network Framework with Nested Multi-Granularity Learning for Robust Visual Speech Recognition. Computers, Materials & Continua, 88(1), 54. https://doi.org/10.32604/cmc.2026.078089
Vancouver Style
Hoang VT, Dinh N, Phuong LQ, Tran-Trung K, Hong HDT, Van BN, et al. NestLipGNN: A Hierarchical Graph Neural Network Framework with Nested Multi-Granularity Learning for Robust Visual Speech Recognition. Comput Mater Contin. 2026;88(1):54. https://doi.org/10.32604/cmc.2026.078089
IEEE Style
V. T. Hoang et al., “NestLipGNN: A Hierarchical Graph Neural Network Framework with Nested Multi-Granularity Learning for Robust Visual Speech Recognition,” Comput. Mater. Contin., vol. 88, no. 1, pp. 54, 2026. https://doi.org/10.32604/cmc.2026.078089



cc Copyright © 2026 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
  • 246

    View

  • 57

    Download

  • 0

    Like

Share Link