Open Access iconOpen Access

ARTICLE

SQSNet: Hybrid CNN-Transformer Fusion with Spatial Quad-Similarity for Robust Facial Expression Recognition

Mohammed A. Ahmed1, Jian Dong2,*, Ronghua Shi2, Ammar Nassr3, Hani Almaqtari3, Ala A. Alsanabani3

1 School of Computer Science and Engineering, Central South University, Changsha, China
2 School of Electronic Information, Central South University, Changsha, China
3 School of Artificial Intelligence, Xidian University, Xi’an, China

* Corresponding Author: Jian Dong. Email: email

(This article belongs to the Special Issue: Advancements in Pattern Recognition through Machine Learning: Bridging Innovation and Application)

Computers, Materials & Continua 2026, 87(3), 72 https://doi.org/10.32604/cmc.2026.075616

Abstract

Facial Expression Recognition (FER) is an essential endeavor in computer vision, applicable in human-computer interaction, emotion assessment, and mental health surveillance. Although Convolutional Neural Networks (CNNs) have proven effective in Facial Emotion Recognition, they encounter difficulties in capturing long-range connections and global context. To address these constraints, we propose Spatial Quad-Similarity Network (SQSNet), an innovative hybrid framework that integrates the local feature extraction capabilities of CNNs with the global contextual modeling efficacy of Swin Transformers via a cohesive fusion technique. SQSNet introduces the Spatial Quad-Similarity (SQS) module, a feature refinement approach that amplifies discriminative characteristics and mitigates redundancy. Unlike conventional metric learning approaches that operate on global feature representations, SQS computes fine-grained spatial-level similarity across multiple instances, enforcing H × W independent constraints that preserve spatial correspondence between expression-relevant facial regions. This spatial-level formulation is particularly effective for FER, where expressions manifest as localized muscle movements that are lost in global pooling operations. Moreover, SQSNet employs sophisticated regularization methods, including Mixup augmentation, label smoothing, and adaptive learning rate scheduling, to enhance generalization. Experimental findings on three benchmark datasets, RAF-DB, FERPlus, and AffectNet, indicate that SQSNet surpasses current FER methodologies, attaining state-of-the-art accuracies of 91.90%, 91.11%, and 67.15%, respectively. These findings underscore the efficacy of integrating CNNs, Swin Transformers, and spatial similarity-driven feature refining for facial emotion identification, facilitating the development of more dependable emotion recognition systems.

Keywords

Facial expression recognition; convolutional neural networks; swin transformers; cross attention; adaptive learning

Cite This Article

APA Style
Ahmed, M.A., Dong, J., Shi, R., Nassr, A., Almaqtari, H. et al. (2026). SQSNet: Hybrid CNN-Transformer Fusion with Spatial Quad-Similarity for Robust Facial Expression Recognition. Computers, Materials & Continua, 87(3), 72. https://doi.org/10.32604/cmc.2026.075616
Vancouver Style
Ahmed MA, Dong J, Shi R, Nassr A, Almaqtari H, Alsanabani AA. SQSNet: Hybrid CNN-Transformer Fusion with Spatial Quad-Similarity for Robust Facial Expression Recognition. Comput Mater Contin. 2026;87(3):72. https://doi.org/10.32604/cmc.2026.075616
IEEE Style
M. A. Ahmed, J. Dong, R. Shi, A. Nassr, H. Almaqtari, and A. A. Alsanabani, “SQSNet: Hybrid CNN-Transformer Fusion with Spatial Quad-Similarity for Robust Facial Expression Recognition,” Comput. Mater. Contin., vol. 87, no. 3, pp. 72, 2026. https://doi.org/10.32604/cmc.2026.075616



cc Copyright © 2026 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
  • 331

    View

  • 70

    Download

  • 0

    Like

Share Link