SQSNet: Hybrid CNN-Transformer Fusion with Spatial Quad-Similarity for Robust Facial Expression Recognition

Mohammed Ahmed; Jian Dong; Ronghua Shi; Ammar Nassr; Hani Almaqtari; Ala Alsanabani

doi:10.32604/cmc.2026.075616

Open Access icon Open Access

ARTICLE

SQSNet: Hybrid CNN-Transformer Fusion with Spatial Quad-Similarity for Robust Facial Expression Recognition

Mohammed A. Ahmed¹, Jian Dong^2,*, Ronghua Shi², Ammar Nassr³, Hani Almaqtari³, Ala A. Alsanabani³

1 School of Computer Science and Engineering, Central South University, Changsha, China
2 School of Electronic Information, Central South University, Changsha, China
3 School of Artificial Intelligence, Xidian University, Xi’an, China

* Corresponding Author: Jian Dong. Email: email

(This article belongs to the Special Issue: Advancements in Pattern Recognition through Machine Learning: Bridging Innovation and Application)

Computers, Materials & Continua 2026, 87(3), 72 https://doi.org/10.32604/cmc.2026.075616

Received 04 November 2025; Accepted 04 February 2026; Issue published 09 April 2026

Abstract

Facial Expression Recognition (FER) is an essential endeavor in computer vision, applicable in human-computer interaction, emotion assessment, and mental health surveillance. Although Convolutional Neural Networks (CNNs) have proven effective in Facial Emotion Recognition, they encounter difficulties in capturing long-range connections and global context. To address these constraints, we propose Spatial Quad-Similarity Network (SQSNet), an innovative hybrid framework that integrates the local feature extraction capabilities of CNNs with the global contextual modeling efficacy of Swin Transformers via a cohesive fusion technique. SQSNet introduces the Spatial Quad-Similarity (SQS) module, a feature refinement approach that amplifies discriminative characteristics and mitigates redundancy. Unlike conventional metric learning approaches that operate on global feature representations, SQS computes fine-grained spatial-level similarity across multiple instances, enforcing H × W independent constraints that preserve spatial correspondence between expression-relevant facial regions. This spatial-level formulation is particularly effective for FER, where expressions manifest as localized muscle movements that are lost in global pooling operations. Moreover, SQSNet employs sophisticated regularization methods, including Mixup augmentation, label smoothing, and adaptive learning rate scheduling, to enhance generalization. Experimental findings on three benchmark datasets, RAF-DB, FERPlus, and AffectNet, indicate that SQSNet surpasses current FER methodologies, attaining state-of-the-art accuracies of 91.90%, 91.11%, and 67.15%, respectively. These findings underscore the efficacy of integrating CNNs, Swin Transformers, and spatial similarity-driven feature refining for facial emotion identification, facilitating the development of more dependable emotion recognition systems.

Keywords

Facial expression recognition; convolutional neural networks; swin transformers; cross attention; adaptive learning

Cite This Article

APA Style

Ahmed, M.A., Dong, J., Shi, R., Nassr, A., Almaqtari, H. et al. (2026). SQSNet: Hybrid CNN-Transformer Fusion with Spatial Quad-Similarity for Robust Facial Expression Recognition. Computers, Materials & Continua, 87(3), 72. https://doi.org/10.32604/cmc.2026.075616

Vancouver Style

Ahmed MA, Dong J, Shi R, Nassr A, Almaqtari H, Alsanabani AA. SQSNet: Hybrid CNN-Transformer Fusion with Spatial Quad-Similarity for Robust Facial Expression Recognition. Comput Mater Contin. 2026;87(3):72. https://doi.org/10.32604/cmc.2026.075616

IEEE Style

M. A. Ahmed, J. Dong, R. Shi, A. Nassr, H. Almaqtari, and A. A. Alsanabani, “SQSNet: Hybrid CNN-Transformer Fusion with Spatial Quad-Similarity for Robust Facial Expression Recognition,” Comput. Mater. Contin., vol. 87, no. 3, pp. 72, 2026. https://doi.org/10.32604/cmc.2026.075616

BibTex EndNote RIS

Copyright © 2026 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Table of Content

SQSNet: Hybrid CNN-Transformer Fusion with Spatial Quad-Similarity for Robust Facial Expression Recognition

Abstract

Keywords

Cite This Article

383

104

0

Related articles

Further Information

Guidelines

Follow Us

Join Us

Contact Us

WhatsApp:

Share Link