Enhancing Phishing Email Detection Using DeepSeek-Generated Synthetic Data and DistilBERT Classification Models

Amani Al-Ajlan, Lama Almelaifi, Remas Alharbi, Shahad Al-Hussain, Fay Alfarraj, Najwa Altwaijry, Isra Al-Turaiki^*
Computer Science Department, College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia
* Corresponding Author: Isra Al-Turaiki. Email: ialturaiki@ksu.edu.sa
(This article belongs to the Special Issue: Utilizing and Securing Large Language Models for Cybersecurity and Beyond)

Computers, Materials & Continua https://doi.org/10.32604/cmc.2026.076992

Received 30 November 2025; Accepted 27 February 2026; Published online 17 April 2026

Download PDF

Abstract

Phishing emails are an increasing threat to both individuals and organizations, demanding more sophisticated methods of detection beyond traditional blacklisting and heuristic techniques. One of the main challenges in phishing detection is the limited availability of high-quality phishing datasets. To address this issue, we use generative AI to create synthetic emails that help reduce class imbalance, improve model generalization, and overcome data scarcity. We employ a large language model, DeepSeek-7B-Chat, to generate realistic and context-aware phishing and non-phishing emails. Through prompt engineering and fine-tuning, the model produces diverse and modern phishing-style emails that strengthen phishing detection systems. The quality of the generated synthetic emails is evaluated using BERTScore, Self-Bilingual Evaluation Understudy (Self-BLEU), and Perplexity. Our results show that a fine-tuning model achieves the highest BERTScore F1-scores, indicating strong semantic similarity, while lower Self-BLEU and appropriate perplexity values reflect better diversity and fluency. To evaluate the usefulness of synthetic data in phishing detection, we train DistilBERT models on the original dataset, the synthetic dataset, and a combined version. The model trained on the combined dataset (original and prompt-based) achieves an accuracy and F1-score of 95% and 94%, respectively, on the original test set and of 93% and 93% on the combined dataset (original and prompt-based); the model trained on the combined dataset (original and fine-tuned) achieves an accuracy and F1-score of 95% and 95%, respectively, on the original test set and of 92% and 92% on the combined dataset (original and fine-tuned). Overall, models trained on combined datasets consistently outperformed the other approaches. Thus, our findings indicate that integrating original and synthetic data is more effective than training on either dataset alone and that integrating high quality synthetic data generated by a large language model significantly improves phishing email detection, addressing key limitations in real-world data availability.

Keywords

Phishing email detection; generative AI; large language model (LLM); Deepseek; synthetic data generation; DistilBERT; cybersecurity

Downloads
- Full-Text PDF
Citation Tools
- BibTex
- EndNote
- RIS

259

View
35

Download
1

Like

Malicious URL Classification Using Artificial Fish Swarm Optimization and Deep Learning
Anwer Mustafa Hilal, Aisha Hassan...
Impact of Portable Executable Header Features on Malware Detection Accuracy
Hasan H. Al-Khshali, Muhammad...
Optimal Bottleneck-Driven Deep Belief Network Enabled Malware Classification on IoT-Cloud Environment
Mohammed Maray, Hamed Alqahtani,...
Proposed Biometric Security System Based on Deep Learning and Chaos Algorithms
Iman Almomani, Walid El-Shafai,...
Intelligent Deep Learning Based Cybersecurity Phishing Email Detection and Classification
R. Brindha, S. Nandagopal, H....

All issues

Online First

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

Enhancing Phishing Email Detection Using DeepSeek-Generated Synthetic Data and DistilBERT Classification Models

Abstract

Keywords

259

35

1

Further Information

Guidelines

Follow Us

Join Us

Contact Us

WhatsApp:

Share Link