Open Access iconOpen Access

ARTICLE

LLM-Based Enhanced Clustering for Low-Resource Language: An Empirical Study

Talha Farooq Khan1, Majid Hussain1, Muhammad Arslan2, Muhammad Saeed1, Lal Khan3,*, Hsien-Tsung Chang4,5,6,*

1 Department of Computer Science, The University of Faisalabad, Faisalabad, 38000, Pakistan
2 Department of Computer Science, The University of Southern Punjab, Multan, 60000, Pakistan
3 Department of AI and SW, Gachon University, Seongnam, 13120, Republic of Korea
4 Department of Artificial Intelligence, Chang Gung University, Linkou, Taoyuan, 333, Taiwan
5 Department of Computer Science and Information Engineering, Chang Gung University, Linkou, Taoyuan, 333, Taiwan
6 Center for Artificial Intelligence in Medicine, Chang Gung Memorial Hospital at Linkou, Linkou, Taoyuan, 333, Taiwan

* Corresponding Authors: Lal Khan. Email: email; Hsien-Tsung Chang. Email: email

(This article belongs to the Special Issue: Applied NLP with Large Language Models: AI Applications Across Domains)

Computer Modeling in Engineering & Sciences 2025, 145(3), 3883-3911. https://doi.org/10.32604/cmes.2025.073021

Abstract

Text clustering is an important task because of its vital role in NLP-related tasks. However, existing research on clustering is mainly based on the English language, with limited work on low-resource languages, such as Urdu. Low-resource language text clustering has many drawbacks in the form of limited annotated collections and strong linguistic diversity. The primary aim of this paper is twofold: (1) By introducing a clustering dataset named UNC-2025 comprises 100k Urdu news documents, and (2) a detailed empirical standard of Large Language Model (LLM) improved clustering methods for Urdu text. We explicitly evaluate the behavior of the 11 multilingual and Urdu-specific embeddings on 3 different clustering algorithms. We carefully evaluated our performance based on a set of internal and external measurements of validity. We discover the best configuration of the mBERT embedding with the HDBSCAN algorithm that attains a new state-of-the-art performance with a high score of external validity of 0.95. This new LLM method has created a new strong standard of Urdu text clustering. Importantly, the results confirm the strength and high scalability of the LLM-generated embeddings towards the ability to generalise the fine, subtle semantics needed to discover topics in low-resource settings and open the door to novel NLP applications in underrepresented languages.

Keywords

Large language models (LLMs); clustering; low resource language; natural language processing

Cite This Article

APA Style
Khan, T.F., Hussain, M., Arslan, M., Saeed, M., Khan, L. et al. (2025). LLM-Based Enhanced Clustering for Low-Resource Language: An Empirical Study. Computer Modeling in Engineering & Sciences, 145(3), 3883–3911. https://doi.org/10.32604/cmes.2025.073021
Vancouver Style
Khan TF, Hussain M, Arslan M, Saeed M, Khan L, Chang H. LLM-Based Enhanced Clustering for Low-Resource Language: An Empirical Study. Comput Model Eng Sci. 2025;145(3):3883–3911. https://doi.org/10.32604/cmes.2025.073021
IEEE Style
T. F. Khan, M. Hussain, M. Arslan, M. Saeed, L. Khan, and H. Chang, “LLM-Based Enhanced Clustering for Low-Resource Language: An Empirical Study,” Comput. Model. Eng. Sci., vol. 145, no. 3, pp. 3883–3911, 2025. https://doi.org/10.32604/cmes.2025.073021



cc Copyright © 2025 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
  • 356

    View

  • 88

    Download

  • 0

    Like

Share Link