Home / Journals / CMC / Online First / doi:10.32604/cmc.2026.075616
Special Issues
Table of Content

Open Access

ARTICLE

SQSNet: Hybrid CNN-Transformer Fusion with Spatial Quad-Similarity for Robust Facial Expression Recognition

Mohammed A. Ahmed1, Jian Dong2,*, Ronghua Shi2, Ammar Nassr3, Hani Almaqtari3, Ala A. Alsanabani3
1 School of Computer Science and Engineering, Central South University, Changsha, China
2 School of Electronic Information, Central South University, Changsha, China
3 School of Artificial Intelligence, Xidian University, Xi’an, China
* Corresponding Author: Jian Dong. Email: email
(This article belongs to the Special Issue: Advancements in Pattern Recognition through Machine Learning: Bridging Innovation and Application)

Computers, Materials & Continua https://doi.org/10.32604/cmc.2026.075616

Received 04 November 2025; Accepted 04 February 2026; Published online 11 March 2026

Abstract

Facial Expression Recognition (FER) is an essential endeavor in computer vision, applicable in human-computer interaction, emotion assessment, and mental health surveillance. Although Convolutional Neural Networks (CNNs) have proven effective in Facial Emotion Recognition, they encounter difficulties in capturing long-range connections and global context. To address these constraints, we propose Spatial Quad-Similarity Network (SQSNet), an innovative hybrid framework that integrates the local feature extraction capabilities of CNNs with the global contextual modeling efficacy of Swin Transformers via a cohesive fusion technique. SQSNet introduces the Spatial Quad-Similarity (SQS) module, a feature refinement approach that amplifies discriminative characteristics and mitigates redundancy. Unlike conventional metric learning approaches that operate on global feature representations, SQS computes fine-grained spatial-level similarity across multiple instances, enforcing H × W independent constraints that preserve spatial correspondence between expression-relevant facial regions. This spatial-level formulation is particularly effective for FER, where expressions manifest as localized muscle movements that are lost in global pooling operations. Moreover, SQSNet employs sophisticated regularization methods, including Mixup augmentation, label smoothing, and adaptive learning rate scheduling, to enhance generalization. Experimental findings on three benchmark datasets, RAF-DB, FERPlus, and AffectNet, indicate that SQSNet surpasses current FER methodologies, attaining state-of-the-art accuracies of 91.90%, 91.11%, and 67.15%, respectively. These findings underscore the efficacy of integrating CNNs, Swin Transformers, and spatial similarity-driven feature refining for facial emotion identification, facilitating the development of more dependable emotion recognition systems.

Keywords

Facial expression recognition; convolutional neural networks; swin transformers; cross attention; adaptive learning
  • 160

    View

  • 25

    Download

  • 0

    Like

Share Link