Stochastic Gradient Boosting Model for Twitter Spam Detection

In today’s world of connectivity there is a huge amount of data than we could imagine. The number of network users are increasing day by day and there are large number of social networks which keeps the users connected all the time. These social networks give the complete independence to the user to post the data either political, commercial or entertainment value. Some data may be sensitive and have a greater impact on the society as a result. The trustworthiness of data is important when it comes to public social networking sites like facebook and twitter. Due to the large user base and its openness there is a huge possibility to spread spam messages in this network. Spam detection is a technique to identify and mark data as a false data value. There are lot of machine learning approaches proposed to detect spam in social networks. The efficiency of any spam detection algorithm is determined by its cost factor and accuracy. Aiming to improve the detection of spam in the social networks this study proposes using statistical based features that are modelled through the supervised boosting approach called Stochastic gradient boosting to evaluate the twitter data sets in the English language. The performance of the proposed model is evaluated using simulation results.


Introduction
In the previous decade, the world of internet social networks has grown tremendously. Facebook and twitter, for example, have become worldwide communication platforms. With around 330 million monthly active users and 145 million daily active users, twitter is the most popular of these several social networking services. Approximately 500 million tweets are sent out every day. The chance of receiving fake spam messages increases as the size of the network expands. The length of the tweet was originally limited to 140 characters, but it has now been increased to 280 characters. Traditional spam detection and reporting techniques are difficult to use due to the tiny size of the message. The need of the hour is for reliable ways to detect and report twitter spam. messages. Furthermore, semantic classification of twitter messages is challenging, therefore the usual methods outlined in [3] cannot be used. The traditional spam detection methods focus on recognizing and extracting user base data from a twitter account, then using machine learning algorithms to detect unauthorized users or spam campaigns [4,5]. As spamming techniques change, current solutions that rely on statistical based features will be unable to detect spammers using new spamming techniques. Some solutions to combat spamming exploit social network information using ranking schemes, which can reduce spammers' influence on legitimate users [6,7]. However, relying solely on network information, these spam detection systems make it difficult to identify legitimate users from spammers. The optimization model that outperforms previous approaches uses supervised machine learning techniques that rely on only one feature which can be either text or URL based [8][9][10]. As described in [11] new deep learning approaches such as convolutional neural network (CNN) and long short term neural networks (LSTM) have enabled various text representation with iterative training to get better results. The study, on the other hand, ignores the randomness of the twitter messages. To address the above said problems, we propose using stochastic gradient boosting with a randomness notion.
The major contributions of the proposed work are as follows: 1. English twitter review datasets extracted from honeypot dataset was used as the public dataset. 2. A detailed study has been done to select the features for the boosting algorithm [12][13][14][15]. 3. By fitting the parameterized function for spam detection, a stochastic gradient boosting technique has been modelled. 4. The accuracy of classification has been increased by injecting randomness into the training data selection process, whereas in traditional approaches, the training data is nearly consistent. 5. The results are compared using the simulation studies against the selected literature which uses Neural network and Gradient Boosting for spam detection.
The remaining sections of this article is organized as follows: A detailed literature study on traditional spam detection techniques has been done on Section 2. Section 3 describes the data collection, feature extraction and modelling of boosting approach for spam detection. The results are presented in Section 4. Finally Section 5 presents the conclusion and aspects of future enhancement for the proposed work.

Related Work
The definition of spam can be formulated as follows: "Spam is an undesirable information that contain improper messages that may mislead the readers" [16]. Normally, spam communications are tough to foresee since spammers spoof authenticated users' information [17]. Several research studies have been conducted to aid in the detection of spam communications in both emails and other social networking sites. In this section, we will go through some of the most prevalent ways to spam detection that are relevant to our proposed framework.
Convolutional Neural Networks (CNN) is a type of deep learning technique that is widely utilized in natural language processing. The application of CNN to false information detection has been extended by the researchers. The study in [18] presents a CNN-based message classification approach for detecting fake news in twitter feeds. The authors in [19] combined CNN and ensemble neural networks to detect fake information on twitter. Yang et al. [20] used CNNs with text and images to identify fake content. The collected features were from the image and text. The results validate that this method is efficient to detect false information.
Researchers frequently utilize hybrid techniques for spam detection, which are created by combining any two similar deep learning architectures. In [21], the use of recurrent convolutional neural network (RCNN) to learn the contextual information has been discussed. The same CRNN model proposed in [22] attempts to extract data from the message such as captions and keywords. The collected features were used to generate the training data set. All of these methods use a deterministic training data set that stays the same throughout the cycle. Due to unpredictable nature of twitter tweets, randomization in the data selection process may negatively impact performance. As a result, the suggested method uses a stochastic model to classify messages. The recursive neural network (RxNN) is one of the efficient models for the spam detection because of its hierarchical architecture and the use of compositional vectors for training.
In [23] the authors proposed method for extracting information from tweets that are discriminating. In general, the features vary for different kinds of rumors. This method proves to be efficient in terms of identifying random spam tweets. Many works have used multi-layer graphical model with hidden units called Deep Belief Network (DBN) to detect spam. The study in [24] employed a DBN based method to identify malicious material in personal networks, which may be extended to public domains as well. DBNs are non-supervisory in nature and has consistently outperformed restricted supervised techniques. A deep learning model has been introduced for detecting spammers in the twitter network in [25]. To increase the performance of spammer detection in the twitter network, the techniques were applied to tweets as well as the meta-data of twitter users. The main drawbacks of using neural networks for spam detection are the high complexity and increased computational cost.
Himank introduced a method for identifying spam in the twitter network in real time in [26]. The classification of spammers is based on user and text-based features. The performance evaluation was carried out using the machine learning techniques such as Support vector machine (SVM), Neural network, Random forest and Gradient boosting. The neural network was able to reach an accuracy of 91.65%. In our suggested model, we apply boosting algorithms with great accuracy in classification issues. In the literature, there are numerous boosting methods, however gradient boosting is the most reliable and efficient model. The suggested method employs stochastic gradient boosting [27], a variant of classical gradient boosting. This approach uses non-replacement random subsamples of training sets.

Proposed Model for Spam Detection
In this section we put forth a detailed modeling for spam detection based on boosting algorithm. It is a well validated observation that the majority of spam tweets contain a URL that redirect users [28,29]. In order to proceed with modeling, we extract several features from the honey pot dataset. Due to the random character of spam messages, the feature selection procedure is not easy. We make every attempt to accommodate the most popular features which appear in the majority of tweets.

Feature Selection
Various methods have been described in for extracting from linguistic datasets. The efficiency of classification is determined by the precision and number of features. Because spam attacks are unpredictable, defining features for any given data collection is not an easy operation. In our proposed approach we have identified 15 features based on the literature in [30]. The computational complexity of any classification technique can be reduced by reducing the size of the feature set with increased accuracy, making it viable to execute for a large population of tweets.
Tab. 1 shows the features that were extracted. The extracted features are classified into two categories: the first category collects information regarding the user and their features, such as account age, followers and so on and is referred to as account based features. Second, the features associated with the tweet that is being investigated for detection are collected. Hashtags, Retweets, Embedded URLS and other elements are among them and they have been categorized as content based features.

Stochastic Gradient Boosting
Gradient boosting generates the final conclusion by combining the predictions from multiple instances. Each subtree's nodes have thier own set of characteristics, and they aren't all the same. This boosting can be substantially improved by introducing randomness into the feature selection process, which is referred to as stochastic gradient boosting [31].
For a given input data set 'x' with 'N' Values and 'M' features there is an in -deterministic response 'Y'. The goal of the algorithm is to develop a function F*(X) that transfers the input data value (X) to the output response of spam or non-spam (Y) given a training sample of {y i , x i } 1 N of known {y, x} data values. In the intended result, there is always some loss ðy; f x ð Þ.
The mapping function F*(X) can be calculated as follows: The mapping function in Eq. (1) can be approximated by an additive expansion: where 'm' is the set of features associated on every data set and 'a' is the parameter value of the feature 'm', h x: a m ð Þ is the matrix of feature values for any tweet 'x' where x 2 X and b m is the expansion co-efficient.
The algorithm starts with initial guess F 0 X ð Þ and the expansion coefficients {b m ; a m g are fit into the initial training data sample and hence for m = 1, 2,…. M and The gradient boosting approach solves the Eq. (3) by least square approximations and hence where q is arbitrary value and f y im is the residual data and can be formulated as a differentiable function For the given parameters h x i : a ð Þ the optimal value of the expansion coefficient The value h x : a ð Þis the terminal node of the decision tree. At each iteration the tree partitions the input data set 'X' in to 'L' disjoint sub trees R lm f g L l¼1 and predicts a response for each iteration as follows: where y lm ¼ mean y im is the mean in each region.
The sub trees can be solved independently at each region R lm by the corresponding terminal node 'l' constructed for the 'm th ' feature. Based on the above formulations the solution to Eq. (7) reduces to a simple location based estimate which is given as follows The mapping function F mÀ1 X ð Þ is updated separately in each region where 'v' is the shrinkage parameter 0 < v < 1 controls the learning rate of the algorithm.
In the gradient procedure modelled we incorporate randomness as part of the model. The subsample of training data for each iteration is drawn at random from the entire available data set. Let {y i , x i } 1 N be the training data set and p i ð Þ ð Þ N 1 is the random permutation of integers {1, 2…N}. Now the random subsample e N < N is given by The stochastic gradient boosting algorithm can now be written as follows:

Simulation Studies
The proposed work had aimed to detect spams in twitter messages using Stochastic gradient boosting method (SGBM). The proposed model was developed using MATLAB simulation environment. We have increased the training and testing from 100 to 10000 and evaluated the performance of the proposed model against the ground works. Tab. 2 lists the training and testing data samples with different spam ratios.

Evaluation Metrics
The measure of performance is evaluated using some metrics like Accuracy, True Positive Rate (TPR), False Positive Rate (FPR) and F-measure.

True Positive Rate (TPR)
The TPR, which is also called as recall indicates the ratio of correctly identified spams to the total number of actual spams.
'ðy i ; gÞ The FPR refers to the proportion of non-spams incorrectly classified as spams in the total number of actual non-spams.

Accuracy
The accuracy is the percentage of correctly identified tweets (both spams and non-spams) in the total number of examined tweets.

Precision
The precision is defined as the ratio of correctly classified spams to the total number of tweets that are classified as spams.

F-measure
The F-Measure is a measure of model accuracy of the system. It is defined as the weighted harmonic mean of precision and recall.

Results and Discussions
The proposed work is compared with Gradient Boosting method (GBM) and Convolutional neural network (CNN). Boosting algorithms perform well compared to the convolutional neural network. The results of the models are compared in terms of the evaluation metrics accuracy, FPR, TPR and Fmeasure. Three data sets were used with the spam to non-spam ratio of (1:1). The average value of the evaluation metrics for all three methods has been listed in Tab. 3.  1 shows a comparison of detection accuracy for all three techniques. As we can see, the classification accuracy of all three methods improves as the size of the training datasets grows from 1 k to 100 K. The stochastic gradient boosting approach has a greater detection accuracy than the other two techniques, as shown in the graph.

Comparative Analysis
A comparative analysis is done for the proposed method with one of the methods for detecting spammers proposed in [32]. The approach presented in [32] is compared with our proposed stochastic gradient boosting method. Fig. 3 shows the performance comparison of the proposed method in [32] in terms of accuracy. Fig. 3 reveals that the proposed method perform well in terms of accuracy.

Conclusion
In the proposed methodology, we reviewed the conventional neural network design with two boosting methods and their effectiveness in terms of spam detection. In order to examine their performance in recognizing twitter spams in terms of accuracy, TPR/FPR and F-measure, the algorithms were tested in various scenarios by increasing the volume of training data while keeping the spam-to-non-spam ratio constant. The stochastic gradient boosting approach is optimal in terms of all performance metrics, according to the findings of the studies. As a future development, we can investigate the performance of these algorithms with dynamic spam to non-spam ratio and growing tweet volumes.
Funding Statement: The authors received no specific funding for this study.
Conflicts of Interest: The authors declare that they have no conflicts of interest to report regarding the present study.