Open Access
ARTICLE
HERL-ViT: A Hybrid Enhanced Vision Transformer Based on Regional-Local Attention for Malware Detection
1 School of Computer, North China Institute of Aerospace Engineering, Langfang, 065000, China
2 School of Automation, Southeast University, Nanjing, 210096, China
3 College of Intelligence and Computing, Tianjin University, Tianjin, 300072, China
* Corresponding Authors: Huijuan Wang. Email: ; Yongjun Qi. Email:
(This article belongs to the Special Issue: Advances in Efficient Vision Transformers: Architectures, Optimization, and Applications)
Computers, Materials & Continua 2025, 85(3), 5531-5553. https://doi.org/10.32604/cmc.2025.070101
Received 08 July 2025; Accepted 22 August 2025; Issue published 23 October 2025
Abstract
The proliferation of malware and the emergence of adversarial samples pose severe threats to global cybersecurity, demanding robust detection mechanisms. Traditional malware detection methods suffer from limited feature extraction capabilities, while existing Vision Transformer (ViT)-based approaches face high computational complexity due to global self-attention, hindering their efficiency in handling large-scale image data. To address these issues, this paper proposes a novel hybrid enhanced Vision Transformer architecture, HERL-ViT, tailored for malware detection. The detection framework involves five phases: malware image visualization, image segmentation with patch embedding, regional-local attention-based feature extraction, enhanced feature transformation, and classification. Methodologically, HERL-ViT integrates a multi-level pyramid structure to capture multi-scale features, a regional-to-local attention mechanism to reduce computational complexity, an Optimized Position Encoding Generator for dynamic relative position encoding, and enhanced MLP and downsampling modules to balance performance and efficiency. Key contributions include: (1) A unified framework integrating visualization, adversarial training, and hybrid attention for malware detection; (2) Regional-local attention to achieve both global awareness and local detail capture with lower complexity; (3) Optimized PEG to enhance spatial perception and reduce overfitting; (4) Lightweight network design (5.8M parameters) ensuring high efficiency. Experimental results show HERL-ViT achieves 99.2% accuracy (Loss = 0.066) on malware classification and 98.9% accuracy (Loss = 0.081) on adversarial samples, demonstrating superior performance and robustness compared to state-of-the-art methods.Keywords
Cite This Article
Copyright © 2025 The Author(s). Published by Tech Science Press.This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.


Submit a Paper
Propose a Special lssue
View Full Text
Download PDF
Downloads
Citation Tools