Comparative Analysis of Machine Learning Models for PDF Malware Detection: Evaluating Different Training and Testing Criteria

Bilal Khan; Muhammad Arshad; Sarwar Khan

doi:10.32604/jcs.2023.042501

Open Access icon Open Access

ARTICLE

Comparative Analysis of Machine Learning Models for PDF Malware Detection: Evaluating Different Training and Testing Criteria

Bilal Khan¹, Muhammad Arshad², Sarwar Shah Khan^3,4,*

1 Department of Computer Science, City University of Science and Information Technology, Peshawar, Pakistan
2 Department of Computer Software Engineering, University of Engineering and Technology, Mardan, Pakistan
3 Department of Computer and Software Technology, University of Swat, Swat, Pakistan
4 Department of Computer Science, IQRA National University, Swat, Pakistan

* Corresponding Author: Sarwar Shah Khan. Email: email

Journal of Cyber Security 2023, 5, 1-11. https://doi.org/10.32604/jcs.2023.042501

Received 01 June 2023; Accepted 03 August 2023; Issue published 21 August 2023

Abstract

The proliferation of maliciously coded documents as file transfers increase has led to a rise in sophisticated attacks. Portable Document Format (PDF) files have emerged as a major attack vector for malware due to their adaptability and wide usage. Detecting malware in PDF files is challenging due to its ability to include various harmful elements such as embedded scripts, exploits, and malicious URLs. This paper presents a comparative analysis of machine learning (ML) techniques, including Naive Bayes (NB), K-Nearest Neighbor (KNN), Average One Dependency Estimator (A1DE), Random Forest (RF), and Support Vector Machine (SVM) for PDF malware detection. The study utilizes a dataset obtained from the Canadian Institute for Cyber-security and employs different testing criteria, namely percentage splitting and 10-fold cross-validation. The performance of the techniques is evaluated using F1-score, precision, recall, and accuracy measures. The results indicate that KNN outperforms other models, achieving an accuracy of 99.8599% using 10-fold cross-validation. The findings highlight the effectiveness of ML models in accurately detecting PDF malware and provide insights for developing robust systems to protect against malicious activities.

Keywords

Cyber-security; PDF malware; model training; testing

Cite This Article

APA Style

Khan, B., Arshad, M., Khan, S.S. (2023). Comparative analysis of machine learning models for PDF malware detection: evaluating different training and testing criteria. Journal of Cyber Security, 5(1), 1-11. https://doi.org/10.32604/jcs.2023.042501

Vancouver Style

Khan B, Arshad M, Khan SS. Comparative analysis of machine learning models for PDF malware detection: evaluating different training and testing criteria. J Cyber Secur . 2023;5(1):1-11 https://doi.org/10.32604/jcs.2023.042501

IEEE Style

B. Khan, M. Arshad, and S.S. Khan "Comparative Analysis of Machine Learning Models for PDF Malware Detection: Evaluating Different Training and Testing Criteria," J. Cyber Secur. , vol. 5, no. 1, pp. 1-11. 2023. https://doi.org/10.32604/jcs.2023.042501

BibTex EndNote RIS

This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Table of Content

Comparative Analysis of Machine Learning Models for PDF Malware Detection: Evaluating Different Training and Testing Criteria

Abstract

Keywords

Cite This Article

1096

796

0

Related articles

Further Information

Guidelines

Follow Us

Join Us

Share Link