Large Language Model in Healthcare for the Prediction of Genetic Variants from Unstructured Text Medicine Data Using Natural Language Processing

Noor Ayesha; Muhammad Mujahid; Abeer Mirdad; Faten Alamri; Amjad Khan

doi:10.32604/cmc.2025.063560

Open Access icon Open Access

ARTICLE

Large Language Model in Healthcare for the Prediction of Genetic Variants from Unstructured Text Medicine Data Using Natural Language Processing

Noor Ayesha¹, Muhammad Mujahid², Abeer Rashad Mirdad², Faten S. Alamri^3,*, Amjad R. Khan²

1 Center of Excellence in CyberSecurity (CYBEX), Prince Sultan University, Riyadh, 11586, Saudi Arabia
2 Artificial Intelligence & Data Analytics Lab, CCIS, Prince Sultan University, Riyadh, 11586, Saudi Arabia
3 Department of Mathematical Sciences, College of Science, Princess Nourah Bint Abdulrahman University, P.O. Box 84428, Riyadh, 11671, Saudi Arabia

* Corresponding Author: Faten S. Alamri. Email: email

(This article belongs to the Special Issue: Enhancing AI Applications through NLP and LLM Integration)

Computers, Materials & Continua 2025, 84(1), 1883-1899. https://doi.org/10.32604/cmc.2025.063560

Received 17 January 2025; Accepted 30 April 2025; Issue published 09 June 2025

Abstract

Large language models (LLMs) and natural language processing (NLP) have significant promise to improve efficiency and refine healthcare decision-making and clinical results. Numerous domains, including healthcare, are rapidly adopting LLMs for the classification of biomedical textual data in medical research. The LLM can derive insights from intricate, extensive, unstructured training data. Variants need to be accurately identified and classified to advance genetic research, provide individualized treatment, and assist physicians in making better choices. However, the sophisticated and perplexing language of medical reports is often beyond the capabilities of the devices we now utilize. Such an approach may result in incorrect diagnoses, which could affect a patient’s prognosis and course of therapy. This study evaluated the efficacy of the proposed model by looking at publicly accessible textual clinical data. We have cleaned the clinical textual data using various text preprocessing methods, including stemming, tokenization, and stop word removal. The important features are extracted using Bag of Words (BoW) and Term Frequency-Inverse Document Frequency (TFIDF) feature engineering methods. The important motive of this study is to predict the genetic variants based on the clinical evidence using a novel method with minimal error. According to the experimental results, the random forest model achieved 61% accuracy with 67% precision for class 9 using TFIDF features and 63% accuracy and a 73% F1 score for class 9 using Bag of Words features. The accuracy of the proposed BERT (Bidirectional Encoder Representations from Transformers) model was 70% with 5-fold cross-validation and 71% with 10-fold cross-validation. The research results provide a comprehensive overview of current LLM methods in healthcare, benefiting academics as well as professionals in the discipline.

Keywords

LLM; unstructured data; genetics; prediction; healthcare; medicine

Cite This Article

APA Style

Ayesha, N., Mujahid, M., Mirdad, A.R., Alamri, F.S., Khan, A.R. (2025). Large Language Model in Healthcare for the Prediction of Genetic Variants from Unstructured Text Medicine Data Using Natural Language Processing. Computers, Materials & Continua, 84(1), 1883–1899. https://doi.org/10.32604/cmc.2025.063560

Vancouver Style

Ayesha N, Mujahid M, Mirdad AR, Alamri FS, Khan AR. Large Language Model in Healthcare for the Prediction of Genetic Variants from Unstructured Text Medicine Data Using Natural Language Processing. Comput Mater Contin. 2025;84(1):1883–1899. https://doi.org/10.32604/cmc.2025.063560

IEEE Style

N. Ayesha, M. Mujahid, A. R. Mirdad, F. S. Alamri, and A. R. Khan, “Large Language Model in Healthcare for the Prediction of Genetic Variants from Unstructured Text Medicine Data Using Natural Language Processing,” Comput. Mater. Contin., vol. 84, no. 1, pp. 1883–1899, 2025. https://doi.org/10.32604/cmc.2025.063560

BibTex EndNote RIS

Copyright © 2025 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Table of Content

Large Language Model in Healthcare for the Prediction of Genetic Variants from Unstructured Text Medicine Data Using Natural Language Processing

Abstract

Keywords

Cite This Article

945

334

0

Related articles

Further Information

Guidelines

Follow Us

Join Us

Contact Us

WhatsApp:

Share Link