NestLipGNN: A Hierarchical Graph Neural Network Framework with Nested Multi-Granularity Learning for Robust Visual Speech Recognition

Vinh Truong Hoang^*, Nghia Dinh, Luu Quang Phuong, Kiet Tran-Trung, Ha Duong Thi Hong, Bay Nguyen Van, Hau Nguyen Trung, Thien Ho Huong
AI Lab, Faculty of Information Technology, Ho Chi Minh City Open University, 35-37 Ho Hao Hon Street, Co Giang Ward, District 1, Ho Chi Minh City, Vietnam
* Corresponding Author: Vinh Truong Hoang. Email: email
(This article belongs to the Special Issue: Artificial Intelligence in Visual and Audio Signal Processing)

Computers, Materials & Continua https://doi.org/10.32604/cmc.2026.078089

Received 23 December 2025; Accepted 25 March 2026; Published online 27 April 2026

Download PDF

Abstract

Visual speech recognition (VSR) aims to infer spoken content from visual observations of articulatory movements. Despite significant progress, it remains a challenging task in computer vision and speech processing. Its difficulty arises from pronounced speaker-to-speaker variability, the presence of homophenes (phonemes that are visually indistinguishable), changes in illumination, and the intrinsically high-dimensional nature of spatiotemporal lip dynamics. In this work, we propose NestLipGNN, a graph-based framework that integrates Graph Neural Networks (GNNs) with a nested multi-granularity learning strategy for visual speech recognition. We construct dynamic lip graphs from facial landmarks to model both spatial relationships between lip regions and their temporal motion during speech articulation. The proposed nested learning architecture supports hierarchical feature extraction across several levels of linguistic abstraction, spanning phoneme-level articulatory units, viseme-level visual speech categories, and word-level semantic representations. We further introduce a Temporal Graph Attention mechanism (T-GAT) that adaptively reweights the importance of distinct lip regions over time. We also introduce a graph-based contrastive learning objective to improve the discrimination of visually similar speech patterns, directly confronting the challenge of homophene resolution. Experiments on the LRW, LRS2, LRS3, and GRID datasets show that NestLipGNN improves recognition accuracy compared with existing methods, obtaining 92.3% word-level accuracy on LRW and delivering a 2.1% absolute performance gain over prior methods. Comprehensive ablation analyses confirm the contribution of each architectural component.

Keywords

Visual speech recognition; graph neural networks; nested optimization; hierarchical representation learning; spatiotemporal modeling; contrastive learning; lip reading

Downloads
- Full-Text PDF
Citation Tools
- BibTex
- EndNote
- RIS

115

View
24

Download
0

Like

Log Anomaly Detection Based on Hierarchical Graph Neural Network and Label Contrastive Coding
Yong Fang, Zhiying Zhao, Yijia...
Identification of Anomaly Scenes in Videos Using Graph Neural Networks
Khalid Masood, Mahmoud M. Al-Sakhnini,...
Graph Construction Method for GNN-Based Multivariate Time-Series Forecasting
Wonyong Chung, Jaeuk Moon, Dongjun...
Research on Optimization of Dual-Resource Batch Scheduling in Flexible Job Shop
Qinhui Liu, Zhijie Gao, Jiang...
A Memory-Guided Anomaly Detection Model with Contrastive Learning for Multivariate Time Series
Wei Zhang, Ping He, Ting Li, Fan...

All issues

Online First

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

NestLipGNN: A Hierarchical Graph Neural Network Framework with Nested Multi-Granularity Learning for Robust Visual Speech Recognition

Abstract

Keywords

115

24

0

Further Information

Guidelines

Follow Us

Join Us

Contact Us

WhatsApp:

Share Link