Open Access iconOpen Access

ARTICLE

crossmark

Embedding Extraction for Arabic Text Using the AraBERT Model

Amira Hamed Abo-Elghit1,*, Taher Hamza1, Aya Al-Zoghby2

1 Faculty of Computers and Information, Department of Computer Sciences, Mansoura University, Mansoura,35516, Egypt
2 Faculty of Computers and Artificial Intelligence, Department of Computer Sciences, Damietta University, Damietta, 34517, Egypt

* Corresponding Author: Amira Hamed Abo-Elghit. Email: email

Computers, Materials & Continua 2022, 72(1), 1967-1994. https://doi.org/10.32604/cmc.2022.025353

Abstract

Nowadays, we can use the multi-task learning approach to train a machine-learning algorithm to learn multiple related tasks instead of training it to solve a single task. In this work, we propose an algorithm for estimating textual similarity scores and then use these scores in multiple tasks such as text ranking, essay grading, and question answering systems. We used several vectorization schemes to represent the Arabic texts in the SemEval2017-task3-subtask-D dataset. The used schemes include lexical-based similarity features, frequency-based features, and pre-trained model-based features. Also, we used contextual-based embedding models such as Arabic Bidirectional Encoder Representations from Transformers (AraBERT). We used the AraBERT model in two different variants. First, as a feature extractor in addition to the text vectorization schemes’ features. We fed those features to various regression models to make a prediction value that represents the relevancy score between Arabic text units. Second, AraBERT is adopted as a pre-trained model, and its parameters are fine-tuned to estimate the relevancy scores between Arabic textual sentences. To evaluate the research results, we conducted several experiments to compare the use of the AraBERT model in its two variants. In terms of Mean Absolute Percentage Error (MAPE), the results show minor variance between AraBERT v0.2 as a feature extractor (21.7723) and the fine-tuned AraBERT v2 (21.8211). On the other hand, AraBERT v0.2-Large as a feature extractor outperforms the fine-tuned AraBERT v2 model on the used data set in terms of the coefficient of determination () values (0.014050,−0.032861), respectively.

Keywords


Cite This Article

A. Hamed Abo-Elghit, T. Hamza and A. Al-Zoghby, "Embedding extraction for arabic text using the arabert model," Computers, Materials & Continua, vol. 72, no.1, pp. 1967–1994, 2022. https://doi.org/10.32604/cmc.2022.025353



cc This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
  • 1316

    View

  • 1279

    Download

  • 0

    Like

Share Link