Enhancing Phishing Email Detection Using DeepSeek-Generated Synthetic Data and DistilBERT Classification Models
Amani Al-Ajlan, Lama Almelaifi, Remas Alharbi, Shahad Al-Hussain, Fay Alfarraj, Najwa Altwaijry, Isra Al-Turaiki*
Computer Science Department, College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia
* Corresponding Author: Isra Al-Turaiki. Email: ialturaiki@ksu.edu.sa
(This article belongs to the Special Issue: Utilizing and Securing Large Language Models for Cybersecurity and Beyond)
Computers, Materials & Continua https://doi.org/10.32604/cmc.2026.076992
Received 30 November 2025; Accepted 27 February 2026; Published online 17 April 2026
Abstract
Phishing emails are an increasing threat to both individuals and organizations, demanding more sophisticated methods of detection beyond traditional blacklisting and heuristic techniques. One of the main challenges in phishing detection is the limited availability of high-quality phishing datasets. To address this issue, we use generative AI to create synthetic emails that help reduce class imbalance, improve model generalization, and overcome data scarcity. We employ a large language model, DeepSeek-7B-Chat, to generate realistic and context-aware phishing and non-phishing emails. Through prompt engineering and fine-tuning, the model produces diverse and modern phishing-style emails that strengthen phishing detection systems. The quality of the generated synthetic emails is evaluated using BERTScore, Self-Bilingual Evaluation Understudy (Self-BLEU), and Perplexity. Our results show that a fine-tuning model achieves the highest BERTScore F1-scores, indicating strong semantic similarity, while lower Self-BLEU and appropriate perplexity values reflect better diversity and fluency. To evaluate the usefulness of synthetic data in phishing detection, we train DistilBERT models on the original dataset, the synthetic dataset, and a combined version. The model trained on the combined dataset (original and prompt-based) achieves an accuracy and F1-score of 95% and 94%, respectively, on the original test set and of 93% and 93% on the combined dataset (original and prompt-based); the model trained on the combined dataset (original and fine-tuned) achieves an accuracy and F1-score of 95% and 95%, respectively, on the original test set and of 92% and 92% on the combined dataset (original and fine-tuned). Overall, models trained on combined datasets consistently outperformed the other approaches. Thus, our findings indicate that integrating original and synthetic data is more effective than training on either dataset alone and that integrating high quality synthetic data generated by a large language model significantly improves phishing email detection, addressing key limitations in real-world data availability.
Keywords
Phishing email detection; generative AI; large language model (LLM); Deepseek; synthetic data generation; DistilBERT; cybersecurity