Open Access iconOpen Access

ARTICLE

Leveraging Unlabeled Corpus for Arabic Dialect Identification

Mohammed Abdelmajeed1,*, Jiangbin Zheng1, Ahmed Murtadha1, Youcef Nafa1, Mohammed Abaker2, Muhammad Pervez Akhter3

1 School of Computer Science, Northwestern Polytechnical University, Xi’an, 710072, China
2 Department of Computer Science, Applied College, King Khalid University, Muhayil, 63311, Saudi Arabia
3 Computer Science Department, National University of Modern Languages, Faisalabad, 38000, Pakistan

* Corresponding Author: Mohammed Abdelmajeed. Email: email

Computers, Materials & Continua 2025, 83(2), 3471-3491. https://doi.org/10.32604/cmc.2025.059870

Abstract

Arabic Dialect Identification (DID) is a task in Natural Language Processing (NLP) that involves determining the dialect of a given piece of text in Arabic. The state-of-the-art solutions for DID are built on various deep neural networks that commonly learn the representation of sentences in response to a given dialect. Despite the effectiveness of these solutions, the performance heavily relies on the amount of labeled examples, which is labor-intensive to attain and may not be readily available in real-world scenarios. To alleviate the burden of labeling data, this paper introduces a novel solution that leverages unlabeled corpora to boost performance on the DID task. Specifically, we design an architecture that enables learning the shared information between labeled and unlabeled texts through a gradient reversal layer. The key idea is to penalize the model for learning source dataset-specific features and thus enable it to capture common knowledge regardless of the label. Finally, we evaluate the proposed solution on benchmark datasets for DID. Our extensive experiments show that it performs significantly better, especially, with sparse labeled data. By comparing our approach with existing Pre-trained Language Models (PLMs), we achieve a new state-of-the-art performance in the DID field. The code will be available on GitHub upon the paper’s acceptance.

Keywords

Arabic dialect identification; natural language processing; bidirectional encoder representations from transformers; pre-trained language models; gradient reversal layer

Cite This Article

APA Style
Abdelmajeed, M., Zheng, J., Murtadha, A., Nafa, Y., Abaker, M. et al. (2025). Leveraging Unlabeled Corpus for Arabic Dialect Identification. Computers, Materials & Continua, 83(2), 3471–3491. https://doi.org/10.32604/cmc.2025.059870
Vancouver Style
Abdelmajeed M, Zheng J, Murtadha A, Nafa Y, Abaker M, Akhter MP. Leveraging Unlabeled Corpus for Arabic Dialect Identification. Comput Mater Contin. 2025;83(2):3471–3491. https://doi.org/10.32604/cmc.2025.059870
IEEE Style
M. Abdelmajeed, J. Zheng, A. Murtadha, Y. Nafa, M. Abaker, and M. P. Akhter, “Leveraging Unlabeled Corpus for Arabic Dialect Identification,” Comput. Mater. Contin., vol. 83, no. 2, pp. 3471–3491, 2025. https://doi.org/10.32604/cmc.2025.059870



cc Copyright © 2025 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
  • 158

    View

  • 77

    Download

  • 0

    Like

Share Link