Home / Journals / CMC / Online First / doi:10.32604/cmc.2026.077697
Special Issues
Table of Content

Open Access

ARTICLE

MSA-ViT: A Multi-Scale Vision Transformer for Robust Malware Image Classification

Bofan Yang, Bingbing Li, Chuanping Hu*
Key Laboratory of Cyberspace Security, School of Cyber Science and Engineering, Zhengzhou University, Ministry of Education, Zhengzhou, China
* Corresponding Author: Chuanping Hu. Email: email
(This article belongs to the Special Issue: Artificial Intelligence Methods and Techniques to Cybersecurity)

Computers, Materials & Continua https://doi.org/10.32604/cmc.2026.077697

Received 15 December 2025; Accepted 02 February 2026; Published online 26 February 2026

Abstract

The rapid evolution of malware obfuscation and packing techniques significantly undermines the effectiveness of traditional static detection approaches. Transforming malware binaries into grayscale or RGB images enables learning-based classification, yet existing CNN- and ViT-based models depend heavily on fixed-resolution inputs and exhibit poor robustness under cross-resolution distortions. This study proposes a lightweight and sample-adaptive Multi-Scale Vision Transformer (MSA-ViT) for efficient and robust malware image classification. MSA-ViT leverages a fixed set of input scales and integrates them using a Scale-Attention Fusion (SAF) module, where the largest-scale CLS token serves as the query to dynamically aggregate cross-scale representations. To mitigate scale bias and improve generalization, SimCLR self-supervised pre-training and KL-divergence-based cross-scale consistency regularization are incorporated. Experiments on the Malimg and MaleVis datasets demonstrate that MSA-ViT achieves accuracies of 98.5% and 96.0%, respectively, outperforming existing baselines. Robustness evaluations further show that performance degradation remains below 1.8% under scaling, padding, and FGSM perturbations. Attention-based visualizations confirm the interpretability of the fusion mechanism. Overall, MSA-ViT provides an accurate, robust, and computationally efficient solution for image-based malware classification.

Keywords

Malware classification; vision transformers; multi-scale fusion; robustness; self-supervised learning
  • 54

    View

  • 10

    Download

  • 0

    Like

Share Link