MSA-ViT: A Multi-Scale Vision Transformer for Robust Malware Image Classification

Bofan Yang; Bingbing Li; Chuanping Hu

doi:10.32604/cmc.2026.077697

Open Access icon Open Access

ARTICLE

MSA-ViT: A Multi-Scale Vision Transformer for Robust Malware Image Classification

Bofan Yang, Bingbing Li, Chuanping Hu^*

Key Laboratory of Cyberspace Security, School of Cyber Science and Engineering, Zhengzhou University, Ministry of Education, Zhengzhou, China

* Corresponding Author: Chuanping Hu. Email: email

(This article belongs to the Special Issue: Artificial Intelligence Methods and Techniques to Cybersecurity)

Computers, Materials & Continua 2026, 87(3), 51 https://doi.org/10.32604/cmc.2026.077697

Received 15 December 2025; Accepted 02 February 2026; Issue published 09 April 2026

Abstract

The rapid evolution of malware obfuscation and packing techniques significantly undermines the effectiveness of traditional static detection approaches. Transforming malware binaries into grayscale or RGB images enables learning-based classification, yet existing CNN- and ViT-based models depend heavily on fixed-resolution inputs and exhibit poor robustness under cross-resolution distortions. This study proposes a lightweight and sample-adaptive Multi-Scale Vision Transformer (MSA-ViT) for efficient and robust malware image classification. MSA-ViT leverages a fixed set of input scales and integrates them using a Scale-Attention Fusion (SAF) module, where the largest-scale CLS token serves as the query to dynamically aggregate cross-scale representations. To mitigate scale bias and improve generalization, SimCLR self-supervised pre-training and KL-divergence-based cross-scale consistency regularization are incorporated. Experiments on the Malimg and MaleVis datasets demonstrate that MSA-ViT achieves accuracies of 98.5% and 96.0%, respectively, outperforming existing baselines. Robustness evaluations further show that performance degradation remains below 1.8% under scaling, padding, and FGSM perturbations. Attention-based visualizations confirm the interpretability of the fusion mechanism. Overall, MSA-ViT provides an accurate, robust, and computationally efficient solution for image-based malware classification.

Keywords

Malware classification; vision transformers; multi-scale fusion; robustness; self-supervised learning

Cite This Article

APA Style

Yang, B., Li, B., Hu, C. (2026). MSA-ViT: A Multi-Scale Vision Transformer for Robust Malware Image Classification. Computers, Materials & Continua, 87(3), 51. https://doi.org/10.32604/cmc.2026.077697

Vancouver Style

Yang B, Li B, Hu C. MSA-ViT: A Multi-Scale Vision Transformer for Robust Malware Image Classification. Comput Mater Contin. 2026;87(3):51. https://doi.org/10.32604/cmc.2026.077697

IEEE Style

B. Yang, B. Li, and C. Hu, “MSA-ViT: A Multi-Scale Vision Transformer for Robust Malware Image Classification,” Comput. Mater. Contin., vol. 87, no. 3, pp. 51, 2026. https://doi.org/10.32604/cmc.2026.077697

BibTex EndNote RIS

Copyright © 2026 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Table of Content

MSA-ViT: A Multi-Scale Vision Transformer for Robust Malware Image Classification

Abstract

Keywords

Cite This Article

475

116

0

Related articles

Further Information

Guidelines

Follow Us

Join Us

Contact Us

WhatsApp:

Share Link