Social Networks Fake Account and Fake News Identification with Reliable Deep Learning

Recent developments of the World Wide Web (WWW) and social networking (Twitter, Instagram, etc.) paves way for data sharing which has never been observed in the human history before. A major security issue in this network is the creation of fake accounts. In addition, the automatic classification of the text article as true or fake is also a crucial process. The ineffectiveness of humans in distinguishing the true and false information exposes the fake news as a risk to credibility, democracy, logical truth, and journalism in government sectors. Besides, the automatic fake news or rumors from the social networking sites is a major research area in the field of social media analytics. With this motivation, this paper develops a new reliable deep learning (DL) based fake account and fake news detection (RDL-FAFND) model for the social networking sites. The goal of the RDL-FAFND model is to resolve the major problems involved in the social media platforms namely fake accounts, fake news/rumor identification. The presented RDL-FAFND model detects the fake account by the use of a parameter tuned deep stacked Auto encoder (DSAE) using the krill herd (KH) optimization algorithm for detecting the fake social networking accounts. Besides, the presented RDL-FAFND model involves an ensemble of the machine learning (ML) models with different linguistic features (EML-LF) for categorizing the text as true or fake. An extensive set of experiments have been carried out for highlighting the superior performance of the RDL-FAFND model. A detailed comparative results analysis has stated that the presented RDL-FAFND model is considerably better than the existing methods.


Introduction
The advancement of the World Wide Web (WWW) and quick adoption of the social networks like Twitter and Instagram has established the basis for data distribution which has never been witnessed before in the human history [1]. Moreover, news channels have gained benefits from the extensive utilization of the social networks by offering upgraded news to their real time users. The news media has been developed from the tabloids, magazines, and newspapers in digital forms like social media feeds, blogs, online news platforms, and different digital media formats [2]. It would be simple for the users to obtain the updated news in their hands. Facebook referral accounts have been found to utilize 70% of the traffic for their news websites. These social networks in the present condition are highly effective and beneficial for the clients for deliberating and sharing their ideas and for discussing the problems related to education, health, and democracy. But, this network is also being utilized by negative viewpoints with specific entities generally for financial benefits, and otherwise, it is utilized for making manipulating mindsets, absurdity/spreading satire, and biased opinions. The occurrence is generally called as the fake news.
The fake news initially denotes false and frequently sensational data dispersed in the appearance of the related news. The fake news is determined as the news, that is demonstrably and intentionally false, or some data existing as news which is accurately incorrect and implemented to deceive the news user for considering it to be true [3]. The news content could be entirely fake, made for deceiving the user, or it is a complicated content that utilizes the mislead data for addressing a specific topic. It is possible for distinguishing the contents that simulates the open source however, the sources are not true. The spread patterns of the fake news on the social television are frequently investigated for identifying the features of the fake news that supports the discrimination among the legitimate and the fake news respectively [4]. The challenge in the detection of the fake news has been determined in various forms. The classification has been considered as the act of binary classification among true/false, rumour/not, hoax/not. The alternative method for defining the challenge is to execute a classification model for various classes like, true, nearly true, partly true, frequently false/false, or unproven rumour, true rumour, and false rumour/not. The major variance among the determination of the classification challenges is because of their distinct annotation systems/ application contexts in distinct datasets.
On the other hand, by extending the utilization of the social media, adversaries search for violating the secrecy of the other clients and misuse its names and credentials by the creation of fake accounts [5,6]. Henceforth, the social media providers involve in the task of detecting the adversaries and fake accounts for removing them from the social media platforms. The use of fake accounts in social media can cause more harm compared to the other cybercrimes. Eliminating the fake accounts has gained more interest among the scientists; therefore, wide-ranging studies have been performed on the detection of fake accounts in social media [7]. Distinct methods have been used for finding the fake accounts based on their feature similarities, comparability of friend networks, profile analyses for a time interval along with their IP address. [8] Provides an unsupervised 2-layer Meta classification technique that could identify the uncontrollable nodes in a difficult network by utilizing the extraction features of the graph topology. It is also verified that the presented technique is utilized for detecting both the fake and real clients in the network. [9] Offered a powerful and scalable defense system named "´Integro" that places the fake accounts with lower ranks at the utilization of the client rankings. [10] Presented a forward message tree with 6 efficient features for investigating the connections among the accounts and for identifying the suspected accounts. This paper proposes a reliable deep learning (DL) based fake account and fake news detection (RDL-FAFND) model for the social networking sites. The presented RDL-FAFND model detects the fake account using the krill herd (KH) optimization based deep stacked Auto encoder (DSAE). The exploitation of the herding behavior of the krill's helps to properly adjust the hyper parameters of the DSAE model. In addition, the presented RDL-FAFND model involves an ensemble of the machine learning (ML) models with dissimilar linguistic features (EML-LF) for identifying the text as true or fake. A series of experimentations have been performed for guaranteeing the improved fake account and fake news detection performance of the RDL-FAFND model.

Related Works
The spreading of fake news has resulted in serious problems, containing the significant effects on the social activities. Therefore, the current research about fake news identification from social media has become a hot research topic and various investigations have tried to develop fake news classification methods using ML. Han et al. [11] developed a method for detecting the various fake news categories and linguistic features. They have calculated the efficiency of the baseline classification and the DL methods concerning the fake news recognition and related them for balancing the accuracy and light weights. Agarwal et al. [12] utilized the LIAR dataset from Kaggle for fake news classification, containing 20,801 news records from the USA. They extracted the reliability scores and another linguistic feature in the text, and both these datasets have been tokenized and normalized.
Wang et al. [13] established the WeFEND architecture for the automated annotation of the news articles that utilized the client information in WeChat as a kind of weaker supervision in the fake news identification. Various methods examine the importance of the textual and the linguistic features for fake news identification. Nikiforos et al. [14] established a new dataset, comprising of 2366 tweets in English, with respect to the Hong Kong protests. Both the linguistic features and the network accounts have been extracted from the tweets when various features have been recognized as a determining factor for the fake news recognition. This method has considered the SMOTE oversampling, and the binary classification for addressing the class imbalance. The SMOTE over-sampling and the feature extraction have been performed in the Rapid Miner Studio.
Jeronimo et al. [15] exploited a dataset comprising of 207,914 news articles of the two main conventional architectures in Brazil, gathered from 2014 to 2017, and 95 news of the two facts checking facilities in Brazil (fake news class). It is an accompanied classification with XGBoost, RF (by TF-IDF and Bag of Words demonstrating), attains high efficiency in the inter-field conditions. Mahyoob et al. [16] utilized twenty posts from PolitiFact as the actual news and twenty posts in Facebook as the fake news, totally acquiring 3 classes. It is an executed qualitative and quantitative data analyses with the QDA method, relating the posts based on its linguistic features. Shu et al. [17] proposed a new fake news data repository and a FakeNewsNet. It comprises of 2 datasets with several features, involving the spatiotemporal data, the social, and the news contents.
Kumar et al. [18] related the distinct ensembles for accomplishing the binary classification on 1356 news from Twitter and 1056 actual and fake news from PolitiFact. It can generate the dataset for every topic, and later it can be encoded and tokenized by themselves. Alves et al. [19] produced a new binary class datasets, comprising of 2996 articles expressed by the Brazilian Portuguese. The investigation has been carried out with the bi-directional and standard LSTM and the dense layers. Victor [20] utilized the LIAR and PHEME datasets, and carried out the research with deep 2 path CNN and bidirectional RNN for the unsupervised and the supervised learning. Miao et al. [21] developed a novel dataset of 4072 news articles from the Webhose. That is about the fake news regarding COVID-19. It utilized the linguistic features and performed the investigations with the baseline classifications like the dense layer and the LSTM.

The Proposed RDL-FAFND Model
The overall system architecture of the presented model involves two major operations namely the KH-DSAE based fake account detection model and the EML-LF based fake news detection model, as shown in Fig. 1. The detailed working of these two modules has been discussed in the subsequent sections.

Automated Fake Account Detection Model
Primarily, the fake accounts in the social networking sites are detected using the KH-DSAE model. The KH-DSAE model initially receives the social networking data as the input and performs the DSAE based detection process. For increasing the detection efficiency of the DSAE model, the KH algorithm has been applied to it.

Architecture of DSAE for Fake Account Detection
The ANN model consists of 1 input layer, several hidden layers, and an output layer. Commonly, the amount of layers and the neurons would not be set at the beginning; rather, it is would be defined by the empirical techniques based on the difficulty of the problems. If there are excessive layers and neurons, it would consume excessive time durations for learning the instances; unlike, if there are excessive layers, the fault tolerance and the instance recognition efficiency would fall to a lower level. Fig. 2 shows the structure of the DSAE model.
The number of neurons in every hidden layer is normally fixed to (2,4,2) in the case of 3 hidden layers with the input neurons (containing 2 parameters). In the forward propagation, some weighted input z l j of the neurons, j in the layer l is calculated by the activation of the upper layer a lÀ1 ð Þ j with weight W l jk among the nearby layers and the bias b l j represents the present layer [22]. Later, a sigmoid activation function f z ð Þ is where l represents the hidden layer count (l 2 1; 3 ½ ), j denotes the neuron count in the present layer, and k indicates the quantity of the neurons in the upper layer. If l is equivalent to zero, the input layer and the values of a 0 j are quantified by the user. The activation of the output layer d L represents the output neuron value. When L is equivalent to four the hidden layer count becomes 3, as given in Eqs. (1)-(3).
The primary objective of the BP in the NN is to attain the expression for the partial derivatives @C=@W and @C=@b of the cost function C regarding the bias (b) and the weight (W). In this procedure, the NN adapts the bias and the weight values based on the errors among the desired and the modelled output till the error falls under a fixed threshold. The quadratic cost function is given by: where N represents the overall amount of the trained samples,ŷ indicates the desirable output, and y denotes the model output from the NN. In the output layer, the error elements d L is represented as The initial term on the right, @C ¼ @a L j L, measures the rapidness of the cost function that is altering at a L j ; when the second term on the right, f 0 z L j , measure the rapidness of the activation function that is altering at z L j : In several hidden layers, the error d l should be calculated from the succeeding layer d lþ1 is given by: where Ã denotes the Hadamard product that is the component wise product of the 2 vectors, and W lþ1 j T denotes the transposition of the weight matrix W lþ1 j . Afterward, it could attain the partial derivative of the cost function C regarding the weight and the bias as given by: When several back and forward propagations exist, the error among the desirable output and the modelled output would be lesser compared to that of the fixed threshold. Also, the output layer neuron can attain saturation, the bias and the weight learning's would stop, and the bias b and the weights W of this method would be established.

Parameter Optimization Using KH Algorithm
For tuning the weight and the bias values of the DSAE model, the KH algorithm has been employed. The DSAE model undergoes training with the weight and the bias parameters. In addition, 10 fold crossvalidation (CV) process has been employed for the evaluation of the fitness function. The FF can be determined as the 1-CA validation of the 10-fold CV technique on the training set, as given in Eqs. (9) and (10). In addition, the solution with maximum CA validation holds the smallest fitness value.
where y c and y f refers to the count of the true and false classifications correspondingly. KH [23] is a novel metaheuristic optimization approach commonly used for resolving the optimization processes. It is inspired from the herding of the krill swarm with some biological and environmental procedures. The time based location of a separate krill in a two-dimensional space has been determined using the following 3 key measures.
(i) Motion is influenced by another krill individual, (ii) Foraging action, The KH technique utilized the Lagrangian method in a d dimension decision space using Eq. (11): where N i ; F i , and D i represent the movements directed by another krill individual, foraging movement, and physical diffusion of the ith krill individual, correspondingly. In the motion influenced by another krill individual, the movement direction, a i , is nearly calculated by the repulsive (i.e., repulsive swarm density), target (i.e., target swarm density), and the local effects (i.e., local swarm density). For a krill individual, this motion can be determined by the following equation.
and N max represents the maximum induced speed, x n denotes the inertia weight of the movement induced in [0, 1], and N old i indicates the latter movement induced.
The foraging movement is calculated by 2 major elements namely the food position and the previous knowledge on the food position. For the ith krill individual, this movement can be equated by: where and V f represents the foraging speed, x f denotes the inertia weight of the foraging movement among 0 to 1, F old i indicates the latter foraging movement. The arbitrary diffusion of the krill individual is assumed as an arbitrary procedure in the core. This movement is based on the maximum diffusion speed and an arbitrary vector direction. It is denoted by: where D max represents the maximum diffusion speed, d indicates the arbitrary vector direction and its array denotes the arbitrary values in À1; 1 ½ . According to the 3 aforementioned motions, by distinct variables of the movements in the time, the location vector of the krill individual's at the interval t to t þ Dt can be stated as follows: It must be distinguished that Dt is an essential variable and it is fine-tuned based on the real time optimization problems. Fig. 3 illustrates the flowchart of the KH technique.

Automated Fake News Detection Model
At this stage, the automated fake news detection model can be designed by using the EML-LF model. The EML-LF model incorporates three major sub processes namely the pre-processing, the linguistic feature set, and the EML based classification.

Preprocessing
The data gathered from the social media would undergo the pre-processing procedure prior to its use as the input to the EML-LF model. The undesirable variables in the article like the author names, the published data, the URL, and the category would be discarded. The articles with no body text or with <20 words in the article body would be deleted. Then, the article in multiple columns would be converted into one column to maintain the consistency in its format and structure. These processes are thus carried out on the dataset to attain uniformity.

Linguistic Feature Set
When the data pre-processing is done, the subsequent stage is the process of extracting the linguistic features [24].
Ngrams: The unigrams and the bigrams are extracted from the bag of words representation of all the news articles. In case of infrequent variations in the content length, the features would be encoded as the tf-idf values.
Punctuation: A punctuation feature set comprising of 11 kinds of punctuations is generated from the Linguistic Inquiry and the Word Count software (LIWC, Version 1. 3.1 2015). It includes the punctuation marks like comma, dash, question mark, exclamation mark, period, etc.
Psycholinguistic features. The LIWC lexicon is used for extracting the proportion of words which falls to the psycholinguistic classes. The LIWC depends upon the large lexicons of the word classes representing the psycholinguistic processes (for instances, positive emotions and perceptual processes), summary classes, and parts of the speech classes (article, verb). The individual LIWC classes are clustered into the feature sets as given here: summary categories (analytical thinking, emotional tones), linguistic process (function word, pronoun), and psychological process (e.g., affective process, social process).
Readability. The features denoting the text understandability are also extracted. It comprises of the content features like character count, complex word, long word, syllable count, word type, and paragraph count, among the other content features. Some readability measures like Flesch-Kincaid, Flesch Reading Ease, Gunning Fog, and the Automatic Readability Index (ARI) are used.
Syntax. At last, a collection of features generated from the production rules depending upon the context free grammar (CFG) trees utilizing the Stanford Parser are extracted. The CFG comprises of the lexicalized production rules integrated into the parent and the grandparent nodes. They are found to be helpful for the linguistic deception detection.

Ensemble of ML Models for Fake News Detection
At this stage, the EML model is applied to categorize the news into true or fake news. The ensemble learning helps in improving the outcome of the ML by combining several techniques. These techniques permit the generation of an enhanced predictive method over an individual method. Here, a Simple Majority Voting Ensemble or Voting Classifier is utilized for combining the predictive results from multiple ML techniques (MLP, RF, and KNN) for getting an enhanced integrated outcome. When the Voting Classifier is trained, it can be utilized for predicting the label of the novel samples depending upon the votes of the contributing models. For evaluating the efficiency of the individual and the ensemble methods, initially, it is trained and tested on the individual methods on the fake news datasets utilizing the10-fold CV. Afterwards, it is trained for the presented ensemble classifier on a similar analysis dataset utilizing the 10fold CV.
The MLP, RF, and KNN are the familiar techniques which are extremely efficient for resolving the classification problems. The RF is commonly utilized as a baseline from the text classification problem by the researchers. It can be an ensemble learning technique to the classification task and functions by generating several DTs at the time of training and classifies the classes as decided by the contributing DTs [25]. The KNN technique operates by computing the distance (provided in Eqs. (17)-(19)) among the query and every instance from the data and by selecting the particular count of instances (K) that are closer to the query. The KNN distance can be written as: Manhattan ¼ Minkowski ¼ In the classification problem, the distinct K values in the KNN technique results in various classification outcomes; but, the optimal value of K is defined by performing experiments for several rounds with distinct values of K and by selecting the one that provides the optimal classification outcomes. The RF model is defined by establishing a number of DTs at the training time and predicts more classes as decided by the contributing DTs. The RF uses the Gini Index and the Entropy for the classification function as provided in the 2 subsequent formulas: Entropy ¼ Àf log fi ð Þ: The MLP, colloquially, that is frequently demonstrated to as the NNs is called as "vanilla," specifically in the case of having one hidden layer. As mentioned above, this research has presented an ensemble learning method that combines the efficient ML techniques such as the RF, the KNN, and the MLP, and employs the linguistic feature sets for the fake news detection.

Performance Validation
This section validates the performance of the proposed model on fake account and fake news detection dataset. A detailed set of experimentations have been performed and the results have been compared with the existing methods. Initially, the fake account detection performance can be validated using a fake account dataset from the Kaggle repository [26,27]. From the Fig. 4 and Tab. 1 showcases the fake account detection performance of the proposed KH-DSAE model with the other methods in terms of the area under curve (AUC), accuracy, false positive rate (FPR), and true positive rate (TPR) [28].     In order to further validate the performance of the EML-LF model, another results analysis takes place on the Fake News Detection Liar benchmark dataset, as given in Tab. 3 and Fig. 7. The resultant values demonstrate that the NB model has showcased least performance with the accuracy of 72.6%, recall of 74.6%, precision of 91%, and F-score of 82%. Besides, the SSO algorithm has obtained better performance over the NB model with the accuracy of 78%, recall of 70.5%, precision of 100%, and Fscore of 82.7%. Along with that, the DT model has demonstrated slightly enhanced outcome over the SSO algorithm with the accuracy of 79.8%, recall of 95.1%, precision of 83.2%, and F-score of 88.7%.   Next to that, the GBT model has attained moderate outcome with the accuracy of 79.8%, recall of 95.5%, precision of 82.9%, and F-score of 88.8%. Meanwhile, the Ridor model has obtained somewhat manageable outcome with the accuracy of 82%, recall of 99.8%, precision of 82.2%, and F-score of 90.2%. Simultaneously, the J48, SMO, and SVM models have portrayed reasonable outcome with the closer accuracy of 82.2%, 82.3%, and 83.6% respectively. Though the GWO algorithm has demonstrated near optimal results with the accuracy of 96.5%, recall of 100%, precision of 95.6%, and F-score of 97.7%, the presented EML-LF model has outperformed all the other methods with the accuracy of 98.6%, recall of 100%, precision of 100%, and F-score of 94.7%. From the above results, it is evident that the presented model is an appropriate tool for fake news and fake account detection on the social media.

Conclusion
In this paper, a new RDL-FAFND model has been developed for the identification of fake accounts and fake news on the social networks. The presented RDL-FAFND model involves two major operations namely the KH-DSAE based fake account detection model and the EML-LF based fake news detection model. The exploitation of the herding behavior of the krill's helps in adjusting the hyper parameters of the DSAE model. Similarly, the inclusion of the ensemble learning process helps in increasing the fake news detection rate. A series of experimentations have been performed for guaranteeing the improved fake account and fake news detection performance of the RDL-FAFND model. The detailed comparative results analysis has verified the supremacy of the presented RDL-FAFND model over the existing methods in terms of different measures. As a part of the future scope, the enrichment of the feature set from the social science knowledge domain (especially psychology) can be analyzed. It is believed that they can exhibit effective outcomes on the identification of fake accounts on the social networking sites.
Funding Statement: The authors received no specific funding for this study.