Nowadays, an unprecedented number of users interact through social media platforms and generate a massive amount of content due to the explosion of online communication. However, because user-generated content is unregulated, it may contain offensive content such as fake news, insults, and harassment phrases. The identification of fake news and rumors and their dissemination on social media has become a critical requirement. They have adverse effects on users, businesses, enterprises, and even political regimes and governments. State of the art has tackled the English language for news and used feature-based algorithms. This paper proposes a model architecture to detect fake news in the Arabic language by using only textual features. Machine learning and deep learning algorithms were used. The deep learning models are used depending on conventional neural nets (CNN), long short-term memory (LSTM), bidirectional LSTM (BiLSTM), CNN+LSTM, and CNN + BiLSTM. Three datasets were used in the experiments, each containing the textual content of Arabic news articles; one of them is real-life data. The results indicate that the BiLSTM model outperforms the other models regarding accuracy rate when both simple data split and recursive training modes are used in the training process.
The rise of social networks has considerably changed the way users around the world communicate. Social networks and user-generated content (UGC) are examples of platforms that allow users to generate, share, and exchange their thoughts and opinions via posts, tweets, and comments. Thus, social media platforms (i.e., Twitter, Facebook, etc.) are considered powerful tools through which news and information can be rapidly transmitted and propagated. These platforms empowered their significance to be the essence of information and news source for individuals through the WWW [
False information can be classified as intention-based or knowledge-based [
The bulk of researches for fake news detection are based on machine learning techniques [
The problem of Fake news detection can have harmful consequences on social and political life. Detecting fake news is very challenging, mainly when applied in different languages than English. The Arabic language is one of the most spoken languages over the globe. There are a lot of sources for news in Arabic, including official news websites. These sources are considered the primary source of Arabic datasets. Our goal is to detect rumors and measure the effect of fake news detection in the middle east region. We have evaluated many algorithms to achieve the best results.
The work's main objective is exploring and evaluating the performance of different deep learning models in improving fake news detection for the Arabic language. Additionally, compare deep learning performance with the traditional machine learning techniques. Eight machine learning algorithms with cross-fold validation are evaluated, including probabilistic and vector space algorithms. We have also tested five combinations of deep learning algorithms, including CNN and LSTM.
The paper is organized as follows. Section 2 tackles the literature review in some detail. The proposed model architecture is presented in Section 3. Section 4 presents the experiments and the results with discussion. The paper is concluded in Section 5.
There are many methods used for fake news detection and rumor detection. The methods include machine learning and deep learning algorithms, as illustrated in the following subsections.
Fake news detection has been investigated from different perspectives; each utilized different features for information classification. These features included linguistic, visual, user, post, and network-based features [
Many methods considered rumors or fake news detection as a classification issue. These methods aim to associate attributes’ values, like a rumor or non-rumor, true or false, or fake or genuine, with a specific piece of text. Researchers had utilized machine-learning methods, accomplishing optimistic results. Substitutionary researchers utilized other methods based on data mining techniques. They depended on extrinsic resources, like knowledge bases, to forecast either the included class of social media content or to examine their truthfulness. Many methods that detect rumors have concentrated on utilizing content features for classification. Few methods for rumor detection have depended on social context. Otherwise, rumor detection and verification methods predominantly utilized a combination of content and context features. Using this combination is since using the social context of rumors may significantly enhance detection performance [
Categories | Methods | Samples |
---|---|---|
User–related attributes | Machine learning | SVM, Random Forest, Decision Tree, Logistic Regression, Conditional Random Field (CRF), Hidden Markov Model (HMM). |
Content–related attributes | Deep learning | Recurrent Neural Networks (RNN), Convolutional Neural Networks (CNN) |
Other methods | Retweet Behavior, Diffusion Patterns, Anolamy Detection, Hawkes process, Crowdsourcing, Computational Fact-Checking |
Most of the fake news detection works to formulate the problem as a binary classification problem. The literature research may fall under the umbrella of three main classes [
Deep learning (DL) [
In [
Other attempts were made using text and one or more of other features in [
Post-based, together with user-based features, were used for fake news predictions [
Study | Used features | Experimented models | Dataset |
---|---|---|---|
Umer |
Only text | Logistic regression |
size: 63,000 articles. |
Girgis |
Only text | RNN |
LIAR Dataset |
Jing et al. [ |
Only text | SVM-TS |
Twitter 498 rumors and 494 non-rumors, |
Muhammad |
Only text | K-nearest neighbors (KNN), |
Pheme rumor dataset (5800 tweets) |
Sansiri |
Only text | LSTM |
20,015 news, (11,941 fake 8,074 real news) |
Verma |
Only text | GRU |
50,000 real news dataset |
Kaliyar |
Only text | CNN | Kaggle dataset |
Kaliyar |
Text |
CNN |
LIAR Dataset |
Sansiri |
Content-based |
DTC |
Real-world dataset (nearly 615000 tweets from 284000 users) |
Pavithra |
Textural |
CNN |
Pheme-RNR (1971 rumor + 3819 non-rumor) |
Jing et al. [ |
Text |
DTR, |
Twitter15 (1,381 propagation trees) |
Yichun |
Post-based, |
DLSTM |
Sina Weibo, (2313 rumors, 2351 non-rumors) |
Lin et al. [ |
Text |
DTR |
Twitter (992 events) |
The Arabic language has a complex structure that imposes challenges in addition to the lack of datasets. Thus, the researches on rumor detection in Arabic social media are few and require more attention and effort to achieve optimal results. The studies that are focused on Arabic rumor detection are summarized in
Reference | Year | Objective | Method |
---|---|---|---|
[ |
2016 | - Determining the attributes of Arabic rumor (fake information) patterns. |
Using natural language processing and machine learning |
[ |
2018 | - Detecting truth in the Arabic tweets |
- Introducing main procedures of the public model of Truth Detection for Arabic Tweets |
[ |
2018 | - Detecting the trustiness of the information that is prevalent through social media. | - Using classical machine learning algorithms |
[ |
2018 | - Credibility checking model for Arabic news | - Content parsing, features extraction, content verification, users’ comments, polarity evaluation, and credibility classification. |
[ |
2019 | - Detecting rumors in Arabic social media | - Using extracted features from the user and the content and analyzing to identify their importance. |
The proposed methodology investigates the most famous state of the arts deep learning algorithms for Arabic text classification; like deep learning techniques, they have the advantage of their ability to capture semantic features from textual data [
The first dataset consists of news and tweets that were manually collected and annotated by rumor/non-rumor. The actual dataset was collected from the Arabic news portals such as; Youm7, Akhbarelyom, and Ahram. This fake news was announced to make people aware that news is not absolute and fake. This effort is the responsibility of the information and decision support center, the Egyptian cabinet.
The second dataset is a benchmark dataset published in [
Dataset | Number of tweets | Class label 0 (non-rumor) | Class label 1 (rumor) | No unique tokens | The average length of tokens | Max text length |
---|---|---|---|---|---|---|
Real dataset |
1980 | 801 | 1178 | 8881 | 14.1 | 46 |
Benchmark dataset |
2578 | 1218 | 1364 | 12919 | 18.4 | 88 |
Merged dataset |
4561 | 2019 | 2542 | 18483 | 16.3 | 88 |
Preprocessing the text before it is fed into the classifier is very important and impacts the overall performance of the classification model. In this step, the text is cleaned using filters to remove punctuations and all non-Unicode characters. Afterward, stop words are removed, then sentences are tokenized, and tokens are stemmed. The resulting sentences are then encoded as numerical sequences, the number of unique tokens and the maximum sentences’ length is calculated. This maximum length is used to pad all sentences to be of the same size, equal to the maximum length. Labels are then encoded using one-hot-encoding.
Recently, word embeddings proved to outperform traditional text representation techniques. It represents each word as a real-valued vector in a dimensional space while preserving the semantic relationship between words. As vectors of words with similar meanings are placed close to each other. Word embeddings can be learned from the text to fit a deep neural model on text data. For our work, the Tensorflow Keras embedding layer was used. It takes, as an input, the numerically encoded text. It is implemented as the first hidden layer of the deep neural network where the word embeddings will be learned while training the network. The embedding layer stores a lookup table to map the words represented by numeric indexes to their dense vector representations.
Our system explores the usage of three deep neural networks, namely, CNN, LSTM, and BiLSTM, and two combinations between CNN+LSTM and CNN+ BiLSTM as illustrated in
CNN model consists of one conventional layer which learns to extract features from sequences represented using a word embedding and derive meaningful and useful sub-structures for the overall prediction task. It is implemented with 64 filters (parallel fields for processing words) with a linear (‘relu’) activation function. The second layer is a pooling layer to reduce the output of the convolutional layer by half. The 2D output from the CNN part of the model is flattened to one long 2D vector to represent the ‘features’ extracted by the CNN. Finally, two dense layers are used to scale, rotate and transform the vector by multiplying Matrix and vector.
The output of the word embedding layer is fed into one LSTM layer with 128 memory units. The output of the LSTM is fed into a dense layer of size = 64. This is used to enhance the complexity of the LSTM's output threshold. The activation function is natural for binary classification. It is a binary-class classification problem; binary cross-entropy is used as the loss function.
The third model combines both CNN with LSTM, where two conventional layers are added with max-pooling and dropout layers. The conventional layers act as feature extractors for the LSTMs on input data. The CNN layer uses the output of the word embeddings layer. Afterward, the pooling layer reduces the features extracted by the CNN layer. A dropout layer is added to help to prevent neural from overfitting. The LSTM layer with a hidden size of 128 is added. We use one LSTM layer with a state output of size = 128. Note, as per default return sequence is False, we only get one output, i.e., of the last state of the LSTM. The output of LSTM is connected with a dense layer of size = 64 to produce the final class label by calculating the probability of the LSTM output. The softmax activation function is used to generate the final classification.
The BiLSTM model uses a combination of recurrent and convolutional cells for learning. The output from the word embeddings layers is fed into a bi-directional LSTM. Afterward, dense layers are used to find the most suitable class based on probability.
BiLSTM-CNN model architecture uses a combination of convolutional and recurrent neurons for learning. As input, the output of the embeddings layers is fed into two-level conventional layers for feature learning for the BiLSTM layer. The features extracted by the CNN layers are max-pooled and concatenated. The fully connected dense layer predicts the probability of each class label.
For Training the deep learning models, Adam optimizer with 0.01 learning rate, weight decay of 0.0005, and 128 batch size. A dropout value of 0.5 is used to avoid overfitting and speed up the learning. The output layer uses a softmax activation function.
The experiments used python programming Tensorflow and Keras libraries for machine learning and deep learning models. A windows 10–based machine with core i7 and 16 GB RAM was used.
Two experiments have been performed on three different datasets. The first experiment utilizes the proposed deep learning algorithms. The second experiment utilizes machine-learning algorithms using n-gram feature extraction and compares their results with deep learning algorithms.
Experiments included two phases; first, the most famous machine learning algorithms have been applied for classification with different n-grams. Machine learning techniques were evaluated using accuracy, f1-measure, and AUC (Area Under Curve) measures. The second phase of experiments included applying deep learning models for classification. Deep learning algorithms were first trained using simple data spilled with 80% training and 20% testing. Then same algorithms were trained using 5-fold cross-validation [
The experiments are conducted using many machine learning algorithms, including Linear SVC, SVC, multinomialNB, bernoulliNB, stochastic gradient descent (SGD), decision tree, random forest, and k-neighbors. Each algorithm is evaluated using accuracy, F-score, and area under the curve (AUC). The results of the first dataset experiment are shown in
Algorithm | Accuracy | F1_score | AUC |
---|---|---|---|
Linear SVC | 0.859 | 0.857 | 0.845 |
SVC | 0.596 | 0.445 | 0.5 |
MultinomialNB | 0.823 | 0.814 | 0.788 |
BernoulliNB | 0.821 | 0.811 | 0.786 |
SGD classifier | 0.861 | 0.861 | 0.853 |
Decision tree classifier | 0.641 | 0.56 | 0.563 |
Random forest classifier | 0.596 | 0.445 | 0.5 |
KNeighbors classifier | 0.806 | 0.796 | 0.77 |
The results of the second dataset experiment are shown in
Algorithm | Accuracy | F1_score | AUC |
---|---|---|---|
Linear SVC | 0.762 | 0.762 | 0.762 |
SVC | 0.511 | 0.345 | 0.5 |
MultinomialNB | 0.743 | 0.738 | 0.74 |
BernoulliNB | 0.735 | 0.725 | 0.731 |
SGD classifier | 0.737 | 0.737 | 0.736 |
Decision tree classifier | 0.584 | 0.573 | 0.588 |
Random forest classifier | 0.515 | 0.373 | 0.504 |
KNeighbors classifier | 0.725 | 0.725 | 0.724 |
The results of the third dataset experiment are shown in
Algorithm | Accuracy | F1_score | AUC |
---|---|---|---|
Linear SVC | 0.779 | 0.779 | 0.775 |
SVC | 0.561 | 0.403 | 0.5 |
MultinomialNB | 0.783 | 0.778 | 0.767 |
BernoulliNB | 0.779 | 0.772 | 0.761 |
SGD classifier | 0.769 | 0.769 | 0.767 |
Decision tree classifier | 0.593 | 0.489 | 0.54 |
Random forest classifier | 0.56 | 0.419 | 0.501 |
The methods SVC, decision tree, and random forest can be concluded that is not suitable for this problem. The following graphs depicted in
Many deep learning algorithms are conducted; CNN, LSTM, CNN + LSTM, BiLSTM, and CNN + BiLSTM. The evaluation metrics of accuracy, loss, and AUC are used.
Deep learning model | Accuracy | Loss | AUC |
---|---|---|---|
CNN | 0.780303 | 0.582185 | 0.839837 |
LSTM | 0.838384 | 1.387679 | 0.887626 |
CNN + LSTM | 0.825758 | 0.837274 | 0.886626 |
BiLSTM | 0.848283 | 0.583095 | 0.902887 |
CNN + BiLSTM | 0.8283 | 0.909429 | 0.886772 |
Deep learning model | Accuracy | Loss | AUC |
---|---|---|---|
CNN | 0.597679 | 4.226126 | 0.61282 |
LSTM | 0.709865 | 1.902138 | 0.792778 |
CNN + LSTM | 0.727273 | 1.401511 | 0.792778 |
BiLSTM | 0.742747 | 1.236142 | 0.803262 |
CNN + BiLSTM | 0.7273 | 1.768434 | 0.791577 |
Deep learning model | Accuracy | Loss | AUC |
---|---|---|---|
CNN | 0.710843 | 1.118824 | 0.765603 |
LSTM | 0.765608 | 2.172314 | 0.829969 |
CNN + LSTM | 0.760131 | 1.359961 | 0.831611 |
BiLSTM | 0.773275 | 1.151952 | 0.837928 |
CNN + BiLSTM | 0.7558 | 1.741797 | 0.829785 |
The following graphs depicted in
To verify the experiments done with deep learning algorithms, five-fold cross-validation on the three datasets have experimented. The results of each dataset are shown in
Deep learning model | Accuracy | Loss |
---|---|---|
CNN | 58.27 | 4.54 |
LSTM | 82.06 | 1.13 |
CNN + LSTM | 81.86 | 1.01 |
BiLSTM | 83.92 | 0.86 |
CNN + BiLSTM | 83.88 | 1.08 |
Deep learning model | Accuracy | Loss |
---|---|---|
CNN | 58.43 | 4.25 |
LSTM | 70.02 | 2.51 |
CNN + LSTM | 70.19 | 2.09 |
BiLSTM | 70.83 | 2.06 |
CNN + BiLSTM | 71.72 | 2.01 |
Deep learning model | Accuracy | Loss |
---|---|---|
CNN | 57.42 | 5.94 |
LSTM | 73.34 | 2.59 |
CNN + LSTM | 73.03 | 2.13 |
BiLSTM | 74.98 | 1.97 |
CNN + BiLSTM | 73. 87 | 2.02 |
This paper aims at investigating machine learning and deep learning models for content-based Arabic fake news classification. A series of experiments were conducted to evaluate the task-specific deep learning models. Three datasets were used in the experiments to assess the most well-known models in the literature. Our findings indicate that machine learning and deep learning approaches can identify fake news using text-based linguistic features. There was no single model that performed optimally across all datasets in terms of machine learning algorithms. On the other hand, our results show that the BiLSTM model achieves the highest accuracy among all models assessed across all datasets.
We intend to thoroughly examine the existing architectures combining various layers as part of our future work. Furthermore, examine the effect of various pre-trained word embeddings on the performance of deep learning models.