Open Access iconOpen Access

ARTICLE

Large Language Model in Healthcare for the Prediction of Genetic Variants from Unstructured Text Medicine Data Using Natural Language Processing

Noor Ayesha1, Muhammad Mujahid2, Abeer Rashad Mirdad2, Faten S. Alamri3,*, Amjad R. Khan2

1 Center of Excellence in CyberSecurity (CYBEX), Prince Sultan University, Riyadh, 11586, Saudi Arabia
2 Artificial Intelligence & Data Analytics Lab, CCIS, Prince Sultan University, Riyadh, 11586, Saudi Arabia
3 Department of Mathematical Sciences, College of Science, Princess Nourah Bint Abdulrahman University, P.O. Box 84428, Riyadh, 11671, Saudi Arabia

* Corresponding Author: Faten S. Alamri. Email: email

(This article belongs to the Special Issue: Enhancing AI Applications through NLP and LLM Integration)

Computers, Materials & Continua 2025, 84(1), 1883-1899. https://doi.org/10.32604/cmc.2025.063560

Abstract

Large language models (LLMs) and natural language processing (NLP) have significant promise to improve efficiency and refine healthcare decision-making and clinical results. Numerous domains, including healthcare, are rapidly adopting LLMs for the classification of biomedical textual data in medical research. The LLM can derive insights from intricate, extensive, unstructured training data. Variants need to be accurately identified and classified to advance genetic research, provide individualized treatment, and assist physicians in making better choices. However, the sophisticated and perplexing language of medical reports is often beyond the capabilities of the devices we now utilize. Such an approach may result in incorrect diagnoses, which could affect a patient’s prognosis and course of therapy. This study evaluated the efficacy of the proposed model by looking at publicly accessible textual clinical data. We have cleaned the clinical textual data using various text preprocessing methods, including stemming, tokenization, and stop word removal. The important features are extracted using Bag of Words (BoW) and Term Frequency-Inverse Document Frequency (TFIDF) feature engineering methods. The important motive of this study is to predict the genetic variants based on the clinical evidence using a novel method with minimal error. According to the experimental results, the random forest model achieved 61% accuracy with 67% precision for class 9 using TFIDF features and 63% accuracy and a 73% F1 score for class 9 using Bag of Words features. The accuracy of the proposed BERT (Bidirectional Encoder Representations from Transformers) model was 70% with 5-fold cross-validation and 71% with 10-fold cross-validation. The research results provide a comprehensive overview of current LLM methods in healthcare, benefiting academics as well as professionals in the discipline.

Keywords

LLM; unstructured data; genetics; prediction; healthcare; medicine

Cite This Article

APA Style
Ayesha, N., Mujahid, M., Mirdad, A.R., Alamri, F.S., Khan, A.R. (2025). Large Language Model in Healthcare for the Prediction of Genetic Variants from Unstructured Text Medicine Data Using Natural Language Processing. Computers, Materials & Continua, 84(1), 1883–1899. https://doi.org/10.32604/cmc.2025.063560
Vancouver Style
Ayesha N, Mujahid M, Mirdad AR, Alamri FS, Khan AR. Large Language Model in Healthcare for the Prediction of Genetic Variants from Unstructured Text Medicine Data Using Natural Language Processing. Comput Mater Contin. 2025;84(1):1883–1899. https://doi.org/10.32604/cmc.2025.063560
IEEE Style
N. Ayesha, M. Mujahid, A. R. Mirdad, F. S. Alamri, and A. R. Khan, “Large Language Model in Healthcare for the Prediction of Genetic Variants from Unstructured Text Medicine Data Using Natural Language Processing,” Comput. Mater. Contin., vol. 84, no. 1, pp. 1883–1899, 2025. https://doi.org/10.32604/cmc.2025.063560



cc Copyright © 2025 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
  • 326

    View

  • 132

    Download

  • 0

    Like

Share Link