Adversarial Prompt Detection in Large Language Models: A Classification-Driven Approach

Ahmet Ergün; Aytuğ Onan

doi:10.32604/cmc.2025.063826

Open Access icon Open Access

ARTICLE

Adversarial Prompt Detection in Large Language Models: A Classification-Driven Approach

Ahmet Emre Ergün, Aytuğ Onan^*

Department of Computer Engineering, Faculty of Engineering and Architecture, İzmir Katip Çelebi University, İzmir, 35620, Turkey

* Corresponding Author: Aytuğ Onan. Email: email

Computers, Materials & Continua 2025, 83(3), 4855-4877. https://doi.org/10.32604/cmc.2025.063826

Received 25 January 2025; Accepted 01 April 2025; Issue published 19 May 2025

Abstract

Large Language Models (LLMs) have significantly advanced human-computer interaction by improving natural language understanding and generation. However, their vulnerability to adversarial prompts–carefully designed inputs that manipulate model outputs–presents substantial challenges. This paper introduces a classification-based approach to detect adversarial prompts by utilizing both prompt features and prompt response features. Eleven machine learning models were evaluated based on key metrics such as accuracy, precision, recall, and F1-score. The results show that the Convolutional Neural Network–Long Short-Term Memory (CNN-LSTM) cascade model delivers the best performance, especially when using prompt features, achieving an accuracy of over 97% in all adversarial scenarios. Furthermore, the Support Vector Machine (SVM) model performed best with prompt response features, particularly excelling in prompt type classification tasks. Classification results revealed that certain types of adversarial attacks, such as “Word Level” and “Adversarial Prefix”, were particularly difficult to detect, as indicated by their low recall and F1-scores. These findings suggest that more subtle manipulations can evade detection mechanisms. In contrast, attacks like “Sentence Level” and “Adversarial Insertion” were easier to identify, due to the model’s effectiveness in recognizing inserted content. Natural Language Processing (NLP) techniques played a critical role by enabling the extraction of semantic and syntactic features from both prompts and their corresponding responses. These insights highlight the importance of combining traditional and deep learning approaches, along with advanced NLP techniques, to build more reliable adversarial prompt detection systems for LLMs.

Keywords

LLM; classification; NLP; adversarial; prompt; machine learning; deep learning

Cite This Article

APA Style

Ergün, A.E., Onan, A. (2025). Adversarial Prompt Detection in Large Language Models: A Classification-Driven Approach. Computers, Materials & Continua, 83(3), 4855–4877. https://doi.org/10.32604/cmc.2025.063826

Vancouver Style

Ergün AE, Onan A. Adversarial Prompt Detection in Large Language Models: A Classification-Driven Approach. Comput Mater Contin. 2025;83(3):4855–4877. https://doi.org/10.32604/cmc.2025.063826

IEEE Style

A. E. Ergün and A. Onan, “Adversarial Prompt Detection in Large Language Models: A Classification-Driven Approach,” Comput. Mater. Contin., vol. 83, no. 3, pp. 4855–4877, 2025. https://doi.org/10.32604/cmc.2025.063826

BibTex EndNote RIS

Copyright © 2025 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Table of Content

Adversarial Prompt Detection in Large Language Models: A Classification-Driven Approach

Abstract

Keywords

Cite This Article

3102

1119

0

Related articles

Further Information

Guidelines

Follow Us

Join Us

Contact Us

WhatsApp:

Share Link