Open Access iconOpen Access

ARTICLE

crossmark

HERL-ViT: A Hybrid Enhanced Vision Transformer Based on Regional-Local Attention for Malware Detection

Boyan Cui1,2, Huijuan Wang1,*, Yongjun Qi1,*, Hongce Chen1, Quanbo Yuan1,3, Dongran Liu1, Xuehua Zhou1

1 School of Computer, North China Institute of Aerospace Engineering, Langfang, 065000, China
2 School of Automation, Southeast University, Nanjing, 210096, China
3 College of Intelligence and Computing, Tianjin University, Tianjin, 300072, China

* Corresponding Authors: Huijuan Wang. Email: email; Yongjun Qi. Email: email

(This article belongs to the Special Issue: Advances in Efficient Vision Transformers: Architectures, Optimization, and Applications)

Computers, Materials & Continua 2025, 85(3), 5531-5553. https://doi.org/10.32604/cmc.2025.070101

Abstract

The proliferation of malware and the emergence of adversarial samples pose severe threats to global cybersecurity, demanding robust detection mechanisms. Traditional malware detection methods suffer from limited feature extraction capabilities, while existing Vision Transformer (ViT)-based approaches face high computational complexity due to global self-attention, hindering their efficiency in handling large-scale image data. To address these issues, this paper proposes a novel hybrid enhanced Vision Transformer architecture, HERL-ViT, tailored for malware detection. The detection framework involves five phases: malware image visualization, image segmentation with patch embedding, regional-local attention-based feature extraction, enhanced feature transformation, and classification. Methodologically, HERL-ViT integrates a multi-level pyramid structure to capture multi-scale features, a regional-to-local attention mechanism to reduce computational complexity, an Optimized Position Encoding Generator for dynamic relative position encoding, and enhanced MLP and downsampling modules to balance performance and efficiency. Key contributions include: (1) A unified framework integrating visualization, adversarial training, and hybrid attention for malware detection; (2) Regional-local attention to achieve both global awareness and local detail capture with lower complexity; (3) Optimized PEG to enhance spatial perception and reduce overfitting; (4) Lightweight network design (5.8M parameters) ensuring high efficiency. Experimental results show HERL-ViT achieves 99.2% accuracy (Loss = 0.066) on malware classification and 98.9% accuracy (Loss = 0.081) on adversarial samples, demonstrating superior performance and robustness compared to state-of-the-art methods.

Keywords

Malware detection; deep learning; counter-attacks; attention mechanisms; applications of artificial intelligence

Cite This Article

APA Style
Cui, B., Wang, H., Qi, Y., Chen, H., Yuan, Q. et al. (2025). HERL-ViT: A Hybrid Enhanced Vision Transformer Based on Regional-Local Attention for Malware Detection. Computers, Materials & Continua, 85(3), 5531–5553. https://doi.org/10.32604/cmc.2025.070101
Vancouver Style
Cui B, Wang H, Qi Y, Chen H, Yuan Q, Liu D, et al. HERL-ViT: A Hybrid Enhanced Vision Transformer Based on Regional-Local Attention for Malware Detection. Comput Mater Contin. 2025;85(3):5531–5553. https://doi.org/10.32604/cmc.2025.070101
IEEE Style
B. Cui et al., “HERL-ViT: A Hybrid Enhanced Vision Transformer Based on Regional-Local Attention for Malware Detection,” Comput. Mater. Contin., vol. 85, no. 3, pp. 5531–5553, 2025. https://doi.org/10.32604/cmc.2025.070101



cc Copyright © 2025 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
  • 634

    View

  • 218

    Download

  • 0

    Like

Share Link