Vol.27, No.3, 2021, pp.891-908, doi:10.32604/iasc.2021.016933
OPEN ACCESS
ARTICLE
A Learning-based Static Malware Detection System with Integrated Feature
  • Zhiguo Chen1,*, Xiaorui Zhang1,2, Sungryul Kim3
1 School of Computer and Software, Nanjing University of Information Science and Technology, Nanjing, 210044, China
2 Jiangsu Engineering Center of Network Monitoring, Engineering Research Center of Digital Forensics, Ministry of Education, Nanjing, 210044, China
3 Department of Internet and Multimedia Engineering, Konkuk University, Seoul, 05029, Korea
* Corresponding Author: Zhiguo Chen. Email:
Received 01 January 2021; Accepted 08 February 2021; Issue published 01 March 2021
Abstract
The rapid growth of malware poses a significant threat to the security of computer systems. Analysts now need to examine thousands of malware samples daily. It has become a challenging task to determine whether a program is a benign program or malware. Making accurate decisions about the program is crucial for anti-malware products. Precise malware detection techniques have become a popular issue in computer security. Traditional malware detection uses signature-based strategies, which are the most widespread method used in commercial anti-malware software. This method works well against known malware but cannot detect new malware. To overcome the deficiency of the signature-based approach, we proposed a static malware detection system using data mining techniques to identify known and unknown malware by comparing the malware and benign programs’ profiles with real-time response with low false-positive ratio. The proposed system includes a sample labeling module, a feature extraction module, a pre-processing module, and a decision module. The sample labeling module used the VirusTotal to correctly label the collected samples. The feature extraction module statically extracts a set of header information, section entropy, APIs, and section opcode n-grams. The pre-processing module is primarily based on the PCA algorithm used to reduce the dimensionality of the features, thus reducing the overhead costs of computation. The decision module uses various machine-learning algorithms such as K-Nearest Neighbors (KNN), Decision Tree (DT), Gradient Boosting Decision Tree (GBDT), and Extreme Gradient Boosting (XGBoost) to build the detection model for judging whether the program is a benign program or malware. The experimental results indicate our proposed system can achieve 99.56% detection accuracy and 99.55% f1-score on the extracted 79 features using the XGBoost algorithm, and it has the potential for real-time large-scale malware detection tasks.
Keywords
Static analysis; malware detection; machine learning; computer security; principal component analysis
Cite This Article
Z. Chen, X. Zhang and S. Kim, "A learning-based static malware detection system with integrated feature," Intelligent Automation & Soft Computing, vol. 27, no.3, pp. 891–908, 2021.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.