Arabic Sentiment Analysis of Users’ Opinions of Governmental Mobile Applications

: Different types of pandemics that have appeared from time to time have changed many aspects of daily life. Some governments encourage their citizens to use certain applications to help control the spread of disease and to deliver other services during lockdown. The Saudi government has launched several mobile apps to control the pandemic and have made these apps available through Google Play and the app store. A huge number of reviews are written daily by users to express their opinions, which include significant information to improve these applications. The manual processing and extracting of information from users’ reviews is an extremely difficult and time-consuming task. Therefore, the use of intelligent methods is necessary to analyse users’ reviews and extract issues that can help in improving these apps. This research aims to support the efforts made by the Saudi government for its citizens and residents by analysing the opinions of people in Saudi Arabia that can be found as reviews on Google Play and the app store using sentiment analysis and machine learning methods. To the best of our knowledge, this is the first study to explore users’ opinions about governmental apps in Saudi Arabia. The findings of this analysis will help government officers make the right decisions to improve the quality of the provided services and help application developers improve these applications by fixing potential issues that cannot be identified during application testing phases. A new dataset used for this research includes 8000 user reviews gathered from social media, Google Play and the app store. Different methods are applied to the dataset, and the results show that the k nearest neighbourhood (KNN) method generates the highest accuracy compared to other implemented methods.


Introduction
The Arabic language is a morphologically complex language used by more than 422 million people, both native and non-native speakers. Currently, the huge volumes of Arabic reviews that reflect users' opinions about mobile applications (apps) are increasing dramatically. This increase makes it difficult to analyse and extract important information manually, especially for popular apps that have a large number of reviews [1]. To track users' opinions and behaviour towards certain apps, sentiment analysis and machine learning play important roles. Sentiment analysis is a branch of natural language processing that analyses users' attitudes about a specific topic, service, or product to abstract important information expressed in text. According to [2], "Sentiment analysis is a fundamental natural language processing task to automatically analyze raw text and infer from it semantic meaning that focuses on the author's attitude towards the written text". The development of machine learning techniques helps in addressing the issue of opinions, but mostly for reviews written in the English language [3]. However, the Arabic language represents an important area for researchers to investigate the ways that machine learning techniques and sentiment analysis can help to automatically obtain accurate information. Users depend heavily on reviews and ratings before downloading apps, especially when there are other similar options, which makes the analysis of users' opinions vital to application owners and developers [4].
Important information about bugs, improvements, and users' feelings and experiences are available in user reviews [5]. App owners, developers, companies, and governments use the information available in user reviews to better understand users' needs [5]. Manual analysis of user reviews seems to be an unachievable task due to the large number of available reviews; therefore, automated extraction is a good option. The nature of review text, different uses of words, the use of slang and idioms, the different structures of languages, fake reviews, and opinion spam are among the challenges for automated analysis [5].
As a consequence of COVID-19, the government of the Kingdom of Saudi Arabia (KSA) has developed many mobile applications (apps) and made them compulsory (Absher, Sehah, Tabaod, Tataman, Tawakalna, Es'efni). These apps are available in Google Play and the app store and on the official governmental website for users to download and review. Due to the huge number of users, thousands of positive, negative, and neutral reviews are reported regularly based on users' experiences. These reviews are important for improving the apps, but it is time-consuming to examine each reviews independently.
This research aims to help and support the great efforts made by the Saudi government for its citizens and residents by analysing the opinions of people in Saudi Arabia on social media and other platforms (Google Play and the app store) regarding these efforts using intelligent methods such as sentiment analysis and machine learning methods. This analysis will support government officers in decision-making to improve the quality of the provided services and will help application developers improve the developed applications and fix any potential issues.

Literature Review
Sentiment analysis in general and Arabic sentiment analysis (ASA) specifically are among the hottest research topics for the scientific community. Many general systematic survey papers have been published recently that surveyed ASA in general, such as [6][7][8][9], which covered different aspects related to ASA. Researchers in [10,11] discussed the challenges and issues facing ASA. In [12][13][14][15], the authors reviewed the tools and approaches of ASA available in the literature. Twitter has received attention from researchers, as in [16,17]. Some of the research on ASA is dedicated to analysing opinions in social media (Facebook, Twitter, YouTube), as in [17][18][19][20][21]. Many available datasets for testing proposed methods of ASA for standard Arabic can be found in [19,20,22]. Like other languages, the Arabic language has many local dialects that people use instead of standard Arabic to express their ideas and opinions. Therefore, many researchers have studied ASA for their local dialects, such as the Moroccan [23], Algerian [4], Saudi [24], Jordanian [25], Tunisian [26], Egyptian [27], Iraqi [28], and Yemeni [29] dialects. Opinions and reviews in Google Play and the app store have motivated many researchers to analyse users' opinions of the different mobile applications available for download on these platforms [3,5,30,31]. Some researchers study sentiment analysis at the word level [17,32], while others consider the sentence level, as in [33,34].
In [4], the researchers analysed Algerian reviews in application stores using ASA. Two approaches were utilized for the analysis: (i) the automatic approach based on machine learning and (ii) the lexicon-based approach. For evaluation, 50000 reviews were collected from popular Algerian applications in the Google Play store. The obtained results were promising as they achieved an accuracy of 80% using the lexicon-based approach and 72% for SVM on dialect reviews.
In [42], the authors discussed sentiment analysis applications that used efficient classification techniques in different domains, such as marketing, health care, and education. With regard to sentiment analysis applications, the issue has been noted that these applications can be removed using mobile phones. A deep learning model was implemented in [30] on a refined dataset to study the various intricate details of the underlying data. A few machine learning techniques have been implemented, such as naïve Bayes and XGBoost, in addition to deep learning classifiers and multilayer perceptron classifiers (MLPs). Furthermore, functional layers of Keras have been implemented to combine all the features of mobile apps, such as text reviews, numerical features, the total number of reviews, and categorical columns. The performance was evaluated based on the metrics of accuracy and area under the receiver operator characteristic (ROC) curve.
In [43], using textual reviews on Google Play, the authors proposed a system to determine the polarity of sentiments that can be performed on mobile devices. A client server-based system architecture was used where the server performed training and classification tasks, while the clients were mobile devices that performed sentiment analysis tasks that could be run on small-resource mobile devices. Naïve Bayes was used for the developed application, and a linear support vector machine was used for comparison. The accuracy of the naïve Bayes classifier was 83.87%, while the accuracy of SVM was 89.49%. It was reported that the use of semantic handling contributed to reducing the accuracy of the classifiers.
With the aim of identifying the most relevant topics in a document, researchers in [44] used a sentiment analysis approach that included a lexicon-based model for specifying the set of emotions and a statistical methodology that was the target of the sentiments. In addition, a heuristic learning method was used to improve the initial knowledge considering users' feedback. The proposed sentiment analysis approach was integrated into an Android-based mobile app. It automatically assigned sentiments to pictures, taking into account the descriptions provided by the users.
In [45], a system was proposed to model mobile users' feedback behaviours to analyse sentiment trends. The dataset was collected from a popular Chinese mobile application called Toutiao. Few analysis methods have been proposed for the sentiment of comments, and modelling algorithms have been proposed for feedback behaviours. A system called MoSa was built to identify several implicit behaviour models and hidden sentiment trends. This system and modelling methods provided empirical results to guide interaction design for the mobile internet, social networks, and blockchainbased crowdsourcing.
Naïve Bayes and support-vector machine supervised machine learning algorithms were utilized in [46] to predict sentiments and highly recommended brands based on Twitter tweets. Based on the experiment, the support-vector machine produced more accurate results than naïve Bayes.
In [47], the researchers adopted the keyword co-occurrence measure (SKCM) for Arabic enhanced sentiment analysis. They started with special pre-processing steps followed by SKCM to extract sentiment-based feature selection using an SVM classifier. The results were very promising for enhancing the accuracy of sentiment analysis. In [48], four machine learning techniques were implemented for three Arabic language corpora to increase the accuracy of opinions using rule-based feature selection approaches. The results showed that the proposed approaches yielded better results as different domains were used for the experiments to show the impact of the proposed model on several ML techniques.
In [49,50], the authors introduced a novel feature selection method with voting classifier algorithm to classify the CT images to determine wither COVID-19 is positive or negative. Their proposed voting classifier called PSO-Guided-WOA achieved the best results among other compared methods.

Research Methodology
The methodology followed in this research started with the data collection step, in which reviews were gathered from Google Play and the app store for selected governmental applications. Then, extensive pre-processing was implemented on the collected reviews. Subsequently, features were extracted and a corpus was built, followed by the exploration of several machine learning algorithms that were applied to obtain results and calculate accuracy. Fig. 1 shows the methodology followed in this research.

Dataset Collection
The dataset used in this research was collected from users' reviews of some Saudi governmental applications available on Google Play and the app store. These applications were Tawakkalna, Tetaman, Tabaud, Sehhaty, Mawid, and Sehhah. Only Arabic reviews were gathered. The total number of reviews was 8000 extracted from Google Play and the app store for the mentioned applications. After pre-processing the dataset, 7759 reviews were considered for analysis. Tab. 1 shows the statistics of the reviews based on their applications.

Dataset Pre-processing
The Arabic language involves many issues, such as morphological complexities and dialectal varieties. Thus, it requires progressive pre-processing and lexicon-building steps. The pre-processing steps are as follows.

Dataset Labelling
The process of identifying raw data is called data labelling, where informative labels are added to the dataset to help machine learning algorithms in the learning process. Accurate data labelling is very important for machine learning algorithms to obtain highly accurate results. Polarity classification [51] is used in which labels are classified into three polarities: 1 for positive records, 0 for neutral records, and −1 for negative records. Tab. 2 shows the results of the labelling stage. Tab. 3 shows samples of tree portions from each corpus. Fig. 2 shows dataset statistics.

Word Tokenization
The process of splitting a sentence into a list of words is called word tokenization. Each record in the corpus was segmented into small units, which were the words; these words were counted in every record. Counts were registered for each unique word against each record. Every unique word was then grouped depending on its root, i.e., the stem. For example, the tokenization of the Arabic sentence is as follows: and

Dataset Corpus and Noise Cleaning
Any real text scraped or collected from the web includes many undesired aspects of letters or words. Therefore, each record was checked for noise, such as non-letter characters, stop words, and non-Arabic words. By reading plain text sources and distributing a dataset to records, each record was ordered into three components: Record number, record text content, and record label. Class labels are polynomials of three values, as mentioned before. For the previous example and after the noise was removed, the remaining words were as follows: and Then, the root of these words was used in the word stemming step.

Words Stemming
Word Stemming is the process used to find the root of words where selected words are reduced to their word stems. For example, the root of the words and is Stems are used in the feature selection methods. They are the roots of every similar word morphologically. In the Arabic language, roots or stems are mostly three or four letters. The Khoja stemmer algorithm [52] was used for stemming the words in the dataset.

Feature Selection and Extraction
Feature selection and extraction are a significant part of any sentiment analysis method to filter irrelevant or redundant features in the dataset. After pre-processing, the text must be converted into an understandable form. Hybrid and multiple features selection are trends in the literature of sentiment analysis research as in [53,54], In this research, the following methods were explored for feature selection and extraction.
For each word in the review collection, we calculated a set of linguistic and statistical features using the aforementioned collections and then used machine learning algorithms for term classification I) Term Frequency Matrix (TFM) is one of the feature selection methods used in this research.
It is the easiest way to map the text into a numerical representation that is used to gather the frequency of each word in the collected reviews. Stems are counted in the overall records, the n * m found matrix consider the n records and number of m words in each record, and the existence of any unique word related to the particular stem is added to the stem count. Features VIII, IX, and X were experimentally selected to be further explored in the next phase (ML classification). These features produce a weight for each record based on a supervised automatic sentiment lexicon. The supervised lexicon is gathered from the corpus, and each record's unique words are grouped according to the label of the record, so we have three lists of words: A negative list, a positive list, and a neutral list. We eliminate neutral words and clear the remaining two lists from shared words, so any negative list word that exists in the positive list is removed from both and vice versa. The final two lists comprise the supervised lexicon. The lexicon is used to produce a weight for each record by the summation of existing positive list words and the subtraction of existing negative list words, which produces an extracted feature that we call a supervised feature. This feature is then combined with traditional TDM and TFM selected features to increase the accuracy of classification.

Classification Algorithms
For classification, the four most common machine learning (ML) classifiers are selected to perform the comparative analysis. These classifiers are (i) decision tree, (ii) support vector machine (SVM), K-nearest neighbor (KNN), and Naïve Bayes (NB).

Regular Naïve Bayes (NB)
The naïve Bayes classifier is a popular classifier that assumes independence between every pair of features. The following Eq. (1) for naïve Bayes is adopted from [55] as follows: where y represents the text final class, which is used to match the labels of the data. c k relates to the class labels, and x i is a feature vector of d-dimensional.

Decision Tree (DT) with the Gain Ratio Equation
Another popular ML classifier is a decision tree where each node used denotes a choice among several alternatives and each leaf node presents a classification (decision). DT starts with a root that branches off into several solutions, similar to a tree. DT has been used extensively in ASA.

K-Nearest Neighbor with k = 3
KNN (K-nearest neighbours) is a supervised learning method used as a classifier in machine learning. It looks for the similarity of a given vector to another vector available in the dataset. Two main parameters need to be set for KNN: (i) the k value and (ii) the distance metric. To calculate the distance, the Euclidean function is used, where K = 3. The k set to 3 based on numerical analysis shows the best results obtained when k = 3. The KNN compared new vectors with k training examples that are its closest neighbour. KNN is among the popular methods used for ASA, as in [18,21].

Support Vector Machine
SVM is a supervised machine learning algorithm whose main concept is to assign labels to objects based on the learning process through examples. SVM is used extensively for ASA in the literature, as in [18,40,56]. In this work, SVM with sigmoid kernel was used according to the following equations [57]: where w * denotes the vectors' weight used to specify the hyperplane with maximum margin, ϕ(x) is used for the predefined functions of input vector x, the optimal coefficients are denoted by a_i ∧ * that are determined during the training process, K() is used for the selected kernel function, y specifies the class labels, and parameter b is the bias.
Among the different kernels used with SVM, we have chosen the sigmoid kernel due to its origin from neural networks. The sigmoid function gives values between -1 and 1; therefore, it can classify the predictions based on a particular limit refer to Eq. (2).
k (x, y) = tanh(ax.y + b) (2) where a is alpha and b is the intercept constant. These parameters can be attuned using the kernel parameters a and b.

Result
Generally, the performance of a classification algorithm is measured based on its accuracy, recall, precision and F-measure. Accuracy refers to the ratio of the number of accurately estimated samples to the total number of predicted samples based on Tab. 5, accuracy in Eq.
where true positive (TP) refers to the number of reviews that are classified correctly and belong to the current class. True negative (TN) refers to the number of reviews that are classified correctly that do not belong to the current class.
False-positive (FP) refers to the number of reviews that are classified mistakenly to belong to the current class. False negative (FN) refers to the number of reviews that are classified mistakenly and do not belong to the current class, as shown in Tab. 4. Tab. 5 illustrates the results produced by the classifiers in terms of accuracy, recall, precision, and F-score. Tab. 6 presents a comparison of different classifiers based on execution time.  Tab. 5 shows the obtained results based on accuracy, recall, precision, and the F-measure. Tab. 6 shows a comparison based on execution time. We can see that KNN using feature IX produced the highest accuracy and obtained 78.46% and 59.92%, 55.38%, and 54.78% for DT, SVM, and NB, respectively. The NB model has the worst accuracy compared to the other methods.
The recall is up to 78.1% for KNN, which is the highest, while the other methods obtained 59.08%, 57.49%, 50.55% for DT, NB, and SVM, respectively. The best precision was 79.94% by KNN, while 75.82%, 74.48%, and 72.64 were obtained by SVM, DTree, and NB, respectively. For the F-measure, is KNN scored 78.69%, whereas NB, DTree, and SVM scored 64.18%, 61.46%, and 60.66%, respectively. The last evaluation criterion was the execution time, where the DT and NB were the fastest methods while SVM was the slowest. Fig. 3 shows the comparison of different classifiers based on accuracy. This paper introduced a proposed hybrid feature selection method used for Arabic sentiment analysis to extract users' opinions of Saudi governmental applications for COVID-19. A new Arabic dataset was developed that includes 7759 reviews collected from Google Play and the app store. There were many challenges for the collected dataset because it was in the Arabic language. Therefore, extensive pre-processing steps were utilized to prepare the data for use for ML classifiers. This was done by analysing the gathered reviews to label them as positive, negative, and neutral sentiments and performing necessary data cleaning. Four well-known classifier methods were implemented (DTree, SVM, KNN, and NB). The results showed that KNN outperformed the other methods with accuracy of 78.46% compared to 59.92%, 55.38%, and 54.78% for DT, SVM, and NB, respectively. The NB model had the worst accuracy compared to the other methods.