Open Access iconOpen Access

ARTICLE

MSA-ViT: A Multi-Scale Vision Transformer for Robust Malware Image Classification

Bofan Yang, Bingbing Li, Chuanping Hu*

Key Laboratory of Cyberspace Security, School of Cyber Science and Engineering, Zhengzhou University, Ministry of Education, Zhengzhou, China

* Corresponding Author: Chuanping Hu. Email: email

(This article belongs to the Special Issue: Artificial Intelligence Methods and Techniques to Cybersecurity)

Computers, Materials & Continua 2026, 87(3), 51 https://doi.org/10.32604/cmc.2026.077697

Abstract

The rapid evolution of malware obfuscation and packing techniques significantly undermines the effectiveness of traditional static detection approaches. Transforming malware binaries into grayscale or RGB images enables learning-based classification, yet existing CNN- and ViT-based models depend heavily on fixed-resolution inputs and exhibit poor robustness under cross-resolution distortions. This study proposes a lightweight and sample-adaptive Multi-Scale Vision Transformer (MSA-ViT) for efficient and robust malware image classification. MSA-ViT leverages a fixed set of input scales and integrates them using a Scale-Attention Fusion (SAF) module, where the largest-scale CLS token serves as the query to dynamically aggregate cross-scale representations. To mitigate scale bias and improve generalization, SimCLR self-supervised pre-training and KL-divergence-based cross-scale consistency regularization are incorporated. Experiments on the Malimg and MaleVis datasets demonstrate that MSA-ViT achieves accuracies of 98.5% and 96.0%, respectively, outperforming existing baselines. Robustness evaluations further show that performance degradation remains below 1.8% under scaling, padding, and FGSM perturbations. Attention-based visualizations confirm the interpretability of the fusion mechanism. Overall, MSA-ViT provides an accurate, robust, and computationally efficient solution for image-based malware classification.

Keywords

Malware classification; vision transformers; multi-scale fusion; robustness; self-supervised learning

Cite This Article

APA Style
Yang, B., Li, B., Hu, C. (2026). MSA-ViT: A Multi-Scale Vision Transformer for Robust Malware Image Classification. Computers, Materials & Continua, 87(3), 51. https://doi.org/10.32604/cmc.2026.077697
Vancouver Style
Yang B, Li B, Hu C. MSA-ViT: A Multi-Scale Vision Transformer for Robust Malware Image Classification. Comput Mater Contin. 2026;87(3):51. https://doi.org/10.32604/cmc.2026.077697
IEEE Style
B. Yang, B. Li, and C. Hu, “MSA-ViT: A Multi-Scale Vision Transformer for Robust Malware Image Classification,” Comput. Mater. Contin., vol. 87, no. 3, pp. 51, 2026. https://doi.org/10.32604/cmc.2026.077697



cc Copyright © 2026 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
  • 418

    View

  • 83

    Download

  • 0

    Like

Share Link