Intrusion Detection Method of Internet of Things Based on Multi GBDT Feature Dimensionality Reduction and Hierarchical Traffic Detection

The rapid development of Internet of Things (IoT) technology has brought great convenience to people’s life. However, the security protection capability of IoT is weak and vulnerable. Therefore, more protection needs to be done for the security of IoT. The paper proposes an intrusion detection method for IoT based on multi GBDT feature reduction and hierarchical traffic detection model. Firstly, GBDT is used to filter the features of IoT traffic data sets BoT-IoT and UNSW-NB15 to reduce the traffic feature dimension. At the same time, in order to improve the reliability of feature filtering, this paper constructs multiple GBDT models to filter the features of multiple sub data sets, and comprehensively evaluates the filtered features to find out the best alternative features. Then, two neural networks are trained with the two data sets after dimensionality reduction, and the traffic will be detected with the trained neural network. In order to improve the efficiency of traffic detection, this paper proposes a hierarchical traffic detection model, which can reduce the computational cost and time cost of detection process. Experiments show that the multi GBDT dimensionality reduction method can obtain better features than the traditional PCA dimensionality reduction method. Besides, the use of dual data sets improves the comprehensiveness of the IoT intrusion detection system, which can detect more types of attacks, and the hierarchical traffic model improves the detection efficiency of the system.


Introduction
The rapid development of the IoT has brought great convenience to people's life. However, with the increase of the IoT systems and equipment, the security problem of the IoT is becoming more and more serious and urgent. Due to the complexity and diversity of IoT attacks, the traditional intrusion detection scheme based on rule detection has been difficult to meet the current needs. Therefore, more and more researchers have turned their attention to machine learning.
Jan et al. [1] developed a lightweight attack detection strategy utilizing a supervised machine learningbased support vector machine (SVM) to detect an adversary attempting to inject unnecessary data into the IoT network. Alazzam et al. [2] proposed a wrapper feature selection algorithm for IDS. This algorithm used the pigeon inspired optimizer to utilize the selection process. Ravi et al. [3] proposed a new SDRK machine learning (ML) algorithm to detect intrusion. Lv et al. [4] proposed a novel accurate and effective misuse intrusion detection system that relies on specific attack signatures to distinguish between normal and malicious activities to detect various attacks based on an extreme learning machine with a hybrid kernel function (HKELM). Hassan et al. [5] proposed a hybrid deep learning model to efficiently detect network intrusions based on a convolutional neural network (CNN) and a weight-dropped, long short-term memory (WDLSTM) network. Zhang et al. [6] proposed an intrusion detection model based on improved genetic algorithm (GA) and deep belief network (DBN). Yang et al. [7] put forward the LM-BP neural network model. The LM-BP neural network model was applied to an intrusion detection system, and the intrusion detection flow under LM-BP algorithm was given. Zarai et al. [8] proposed an intrusion detection system based on deep neural network and short-term memory artificial neural network. Li [9] proposed a malicious attack detection method for IoT based on clustering and classification. Shen [10] proposed an attack detection model based on DT-DNN, and implemented a lightweight attack detection system working at the transport layer. Han [11] designed a lightweight IoT traffic detection model Page-Net. The model can reasonably lay out network parameters according to the distribution characteristics of traffic characteristics, and achieve high detection accuracy with a small number of parameter scales, which is more suitable for deployment in edge environments. Jin [12] proposed an abnormal flow detection technology based on the mixed dimensions of time and space and based on the sliding window, which can improve the efficiency and accuracy of abnormal traffic detection, and has lower computational overhead. Chen [13] proposed a collaborative anomaly detection framework based on Internet of Things and studied the anomaly detection algorithm based on image.

Data Processing
In this paper, machine learning technology is used to detect the traffic of the Internet of Things. The training of machine learning depends on the appropriate data set, so the first step is to find the appropriate data set as the data of model training. After comparison, we finally chose BoT-IoT and UNSW-NB15 as data set. BoT-IoT data set simulates the attack data collected in the IoT environment including 4 attack categories. UNSW-NB15 is not a data set specifically for IoT traffic, but it contains modern attack type data, which is more in line with the characteristics of the display scene and has rich attack types, which can just make up for the lack of BoT-IoT attack types.

Data Balance and Encoding
After selecting the dataset, we need to process the dataset. The first step is to solve the problem of uneven data distribution. We randomly sample the samples which accounts for a large proportion, and supplement the samples which accounts for small proportion with SMOTE algorithm. Finally, we get a balanced data set. The data after balance processing is shown in Table 1 and Table 2.  There are a large number of discrete features in the traffic data set, such as protocol and state in the session. We need to convert these discrete features into a form that is easy to use by the machine learning algorithm, so we need to encode these discrete features as one-hot. One-hot coding is the representation of classification variables as binary vectors. This first requires mapping classification values to integer values. Then, each integer value is represented as a binary vector, which is zero except for the index of the integer, and it is marked as one.

Feature Dimensionality Reduction
Original dataset usually provides multiple features and we classify the category of samples according to these features. However, the features provided in the samples are not all useful. Many features do not play a role or even play a negative role in the classification of the samples, and too many features will increase the complexity of the classifier, resulting in longer model training and testing time. Therefore, it is necessary to reduce the dimension of the features before formally training the classification model. At present, the most commonly used feature dimensionality reduction method is PCA. The basic idea of PCA is to find the main axis direction of data, and form a new coordinate system by the main axis. The dimension here can be lower than the original dimension, and then the data is projected from the original coordinate system to the new coordinate system. This projection process is the process of dimension reduction. PCA has a good dimensionality reduction effect in many scenarios, but PCA only considers the data correlation between features and does not consider the role of labels in the dimensionality reduction process. Therefore, some information loss may be caused in the dimensionality reduction process, which may affect the training of classification model. In addition, the features after PCA dimensionality reduction have no practical meaning, and it is difficult for the real feature collector to directly collect these features. Therefore, it is necessary to carry out another PCA operation on the collected features in the operation stage of the flow detection system before they are sent to the classification model for discrimination, which increases the complexity and calculation of the system. Therefore, this paper will use GBDT for feature screening to realize feature dimensionality reduction. GBDT is one of the boosting ensemble learning methods. It can be used for both classification and regression. It is composed of multiple decision trees. In each step of GBDT algorithm, a decision tree is used to fit the residual of the current learner to obtain a new weak learner. Combining the decision trees of each step, we get a strong learner. Assuming that the sample has features and the GBDT model has decision trees, the importance of the of the sample in the GBDT model is calculated as formula (1).
In formula (1), is the importance of the in the global GBDT, and ( ) represents the importance of the in the − . ( ) is determined by the change of impurity of the decision tree during node splitting.
There are two ways to express the impurity in the decision tree: Gini coefficient and information entropy. Taking Gini coefficient as an example, the Gini coefficient of a node in the decision tree is calculated as formula (2).
(2) In formula (2), is the category quantity, is the proportion of class samples in node. When the node splitting is based on the , the change of impurity is calculated as formula (3).
In formula (3), and ℎ represent two new nodes after node splitting. Seeking the change of impurity is a greedy process, so the selected during the node splitting must maximize the change of impurity in this splitting. After rounds of splitting, the construction process of the decision tree is completed. At this time, the importance of a in the decision tree is evaluated as formula (4).
In order to further strengthen the reliability of feature reduction and reduce errors, this paper integrates the bagging idea of Random Forest when using GBDT for feature filtering, that is, divide multiple groups of samples, construct multiple GBDT models, use these GBDT models to filter features on the sub data sets respectively, and finally comprehensively choose the features filtered by multiple GBDT. We call the GBDT dimension reduction method with bagging idea as multiple GBDT dimensionality reduction method.
Let the feature dimension of the data set before One -Hot coding be , the feature dimension after coding is ℎ ， ℎ > . The feature dimension reduction process of IoT traffic in this paper is as follows: (1) Encode the original data set with One-Hot coding method. After encoding, in original data set becomes { | = 1,2,3, … , }. For convenience of presentation, write as . Then the original feature space { | = 1,2, … , } is mapped to the new feature space { | = 1,2, … , ℎ }. (2) Divide the data set encoded by One-Hot method into groups, then we get groups of sub data sets, { | = 1,2, … , }.

Model Construction
In order to improve the comprehensiveness of the IoT traffic detection model, this paper selects two data sets for training, and finally obtains two neural network models FNN1 and FNN2, which are deployed in the intrusion detection system to detect the traffic in real time. However, the two models mean that the traffic needs to be detected twice, which often increases the delay of traffic detection. For some edge computing devices with limited resources, twice detection means double the amount of calculation, which will also bring great computing pressure to these edge computing devices. Therefore, the dual network model needs to be improved to improve the detection efficiency. Therefore, a hierarchical detection model is constructed. Fig. 1 shows the structure of dual network parallel detection model and Fig. 2 shows the structure of hierarchical detection model.   The hierarchical detection model consists of two binary decision trees and two fully connected neural networks. Binary classification DT1 is a decision tree model trained with BoT-IoT data set, and binary classification DT2 is a decision tree trained with UNSW-NB15 data set. FNN1 is a fully connected neural network trained with BoT-IoT data set, and FNN2 is a fully connected neural network trained with UNSW-NB15 data set. When the detection is started, the traffic is detected by the binary classification DT1 to determine whether the traffic is normal or abnormal. If it is abnormal, FNN1 is activated, and FNN1 determines the specific attack type and gives an alarm. If DT1 determines that the traffic is normal, DT2 will be activated, and DT2 will judge whether it is normal. If it is normal, the output is normal. If it is abnormal, FNN2 will be activated, and FNN2 will determines the specific attack type. The classification process is shown in Fig. 3.

Performance Analysis of Hierarchical Detection Model
The performance of decision tree in complex classification scenarios is inferior to that of deep learning neural network, but it has a good performance in simple binary classification problems. Because the structure of binary classification decision tree is simple, it is better than complex neural network in detection speed and computation. This paper tests the time cost of decision tree and neural network with the same classification task on ARM platform and x86 platform respectively. The results show that the time cost of neural network is 4~5 times that of ARM. The specific experimental data will be shown in detail in Section 4.
In order to better evaluate the efficiency of the model, we set some variables and they are shown in Table 3. Since it is difficult to directly evaluate computational cost, we equivalent the proportional relationship of computational cost to the proportion of time cost, that is, when the time cost of FNN is times that of DT, it is considered that the computational cost of FNN is times that of DT.
Then, the average time of detecting a single record in the dual network parallel detection mode shown in Fig. 1 is Computational cost is The average time of detecting a single record in the hierarchical detection model shown in Fig. 2 is Computational cost is Experimental data show that is about 3.7~5.2 times of , we take the mean, 4.5 times, that is, 1 = 4.5 1 ， 2 = 4.5 2 . Through formula (5), we can also get 1 = 4.5 1 , 2 = 4.5 2 . In the actual network environment, the proportion of normal traffic is much larger than that of abnormal traffic, so the probability of decision tree judging as normal traffic is very high. We set the proportion of abnormal flow in the actual environment as 5%, then 1 = 0.05， 2 = 0.05, so Table 4 can be obtained from formulas (6)-(9). It can be seen from Table 4 that the hierarchical traffic detection model is superior to the dual network parallel detection in terms of time cost and computation cost.

Experimental Simulation
The original BoT-IoT dataset has 43 features and UNSW-NB15 has 47 features. The BoT-IoT and UNSW-NB15 data sets are processed by multiple GBDT dimensionality reduction method. Each data set retains 19 features, and the filtered features are shown in Table 5. In order to compare the dimensionality reduction effect of PCA and GBDT, the accuracy of PCA-DT, PCA-FNN, GBDT-DT and GBDT-FNN were tested on two data sets. The PCA dimension reduction retains 0.95 information, and the parameters of FNN are shown in Table 6. Fig. 4 shows the accuracy of different models on BoT-IoT data set and Fig. 5 shows the accuracy of different models on UNSW-NB15 data set.  Since UNSW-NB15 contains 10 types of attacks, and the display of recall and precision is very complex, this paper only shows the recall and precision on the BoT-IoT data set. Fig. 6 shows the recall rates of different models on BoT-IoT dataset and Fig. 7 shows the precision of different models on BoT-IoT data set. In order to prove the rationality of the hierarchical detection model, we tested the DT binary classification accuracy, DT multi classification accuracy, FNN binary classification accuracy and FNN multi classification accuracy on the BoT-IoT data set and make a comparison. It can be seen from Fig. 8 that FNN performs better than DT in the problem of multiple classification of abnormal traffic, but the difference between DT and FNN in the binary classification of normal traffic and abnormal traffic is very small. Therefore, taking DT as the binary classifier is reasonable.
In order to verify the detection efficiency of hierarchical detection model, the time cost of detecting a sample of binary classification decision tree and FNN model is tested on ARM platform and x86 platform. The experimental results are shown in Fig. 9 and Fig. 10. As can be seen from Table 7, the time cost of FNN on ARM platform is 3.71 times that of DT, and that of FNN on x86 platform is 5.15 times that of DT.

Summary
In order to deal with the attack of malicious IoT traffic, the paper proposes an IoT intrusion detection scheme based on multi GBDT feature dimensionality reduction and hierarchical traffic detection model. The detection scheme first uses the multiple GBDT model to reduce the dimension of two network traffic data sets, and then trains the dual network model with the processed two data sets. In order to improve the efficiency of traffic detection, a hierarchical detection model is proposed. The model is composed of two binary decision trees and two fully connected networks. The hierarchical detection model takes into account the characteristics of real network traffic, and combines the advantages of decision tree and neural network, it can improve the detection efficiency when detecting as many attack categories as possible.