Open Access iconOpen Access

ARTICLE

Adversarial Prompt Detection in Large Language Models: A Classification-Driven Approach

Ahmet Emre Ergün, Aytuğ Onan*

Department of Computer Engineering, Faculty of Engineering and Architecture, İzmir Katip Çelebi University, İzmir, 35620, Turkey

* Corresponding Author: Aytuğ Onan. Email: email

Computers, Materials & Continua 2025, 83(3), 4855-4877. https://doi.org/10.32604/cmc.2025.063826

Abstract

Large Language Models (LLMs) have significantly advanced human-computer interaction by improving natural language understanding and generation. However, their vulnerability to adversarial prompts–carefully designed inputs that manipulate model outputs–presents substantial challenges. This paper introduces a classification-based approach to detect adversarial prompts by utilizing both prompt features and prompt response features. Eleven machine learning models were evaluated based on key metrics such as accuracy, precision, recall, and F1-score. The results show that the Convolutional Neural Network–Long Short-Term Memory (CNN-LSTM) cascade model delivers the best performance, especially when using prompt features, achieving an accuracy of over 97% in all adversarial scenarios. Furthermore, the Support Vector Machine (SVM) model performed best with prompt response features, particularly excelling in prompt type classification tasks. Classification results revealed that certain types of adversarial attacks, such as “Word Level” and “Adversarial Prefix”, were particularly difficult to detect, as indicated by their low recall and F1-scores. These findings suggest that more subtle manipulations can evade detection mechanisms. In contrast, attacks like “Sentence Level” and “Adversarial Insertion” were easier to identify, due to the model’s effectiveness in recognizing inserted content. Natural Language Processing (NLP) techniques played a critical role by enabling the extraction of semantic and syntactic features from both prompts and their corresponding responses. These insights highlight the importance of combining traditional and deep learning approaches, along with advanced NLP techniques, to build more reliable adversarial prompt detection systems for LLMs.

Keywords

LLM; classification; NLP; adversarial; prompt; machine learning; deep learning

Cite This Article

APA Style
Ergün, A.E., Onan, A. (2025). Adversarial Prompt Detection in Large Language Models: A Classification-Driven Approach. Computers, Materials & Continua, 83(3), 4855–4877. https://doi.org/10.32604/cmc.2025.063826
Vancouver Style
Ergün AE, Onan A. Adversarial Prompt Detection in Large Language Models: A Classification-Driven Approach. Comput Mater Contin. 2025;83(3):4855–4877. https://doi.org/10.32604/cmc.2025.063826
IEEE Style
A. E. Ergün and A. Onan, “Adversarial Prompt Detection in Large Language Models: A Classification-Driven Approach,” Comput. Mater. Contin., vol. 83, no. 3, pp. 4855–4877, 2025. https://doi.org/10.32604/cmc.2025.063826



cc Copyright © 2025 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
  • 1120

    View

  • 604

    Download

  • 0

    Like

Share Link