Anomaly classification based on network traffic features is an important task to monitor and detect network intrusion attacks. Network-based intrusion detection systems (NIDSs) using machine learning (ML) methods are effective tools for protecting network infrastructures and services from unpredictable and unseen attacks. Among several ML methods, random forest (RF) is a robust method that can be used in ML-based network intrusion detection solutions. However, the minimum number of instances for each split and the number of trees in the forest are two key parameters of RF that can affect classification accuracy. Therefore, optimal parameter selection is a real problem in RF-based anomaly classification of intrusion detection systems. In this paper, we propose to use the genetic algorithm (GA) for selecting the appropriate values of these two parameters, optimizing the RF classifier and improving the classification accuracy of normal and abnormal network traffics. To validate the proposed GA-based RF model, a number of experiments is conducted on two public datasets and evaluated using a set of performance evaluation measures. In these experiments, the accuracy result is compared with the accuracies of baseline ML classifiers in the recent works. Experimental results reveal that the proposed model can avert the uncertainty in selection the values of RF’s parameters, improving the accuracy of anomaly classification in NIDSs without incurring excessive time.
Network-based intrusion detection system (NIDS) is a network security tool that works together with popular data encryption algorithms and firewalls to protect network resources and services [
Recently, ML methods have been used for solving many problems in different applications [
In previous works, numerous studies have been proposed using ML methods for network intrusion detection. The work introduced by Solanki et al. [
Gao et al. [
An audit technique based on the frequency happened in the data traffics of the networks has proposed by Ye et al. [
Some comparative studies on ML methods have been proposed for network defense [
In this study, a genetic-based RF model is proposed and compared with the baseline ML methods for network intrusion detection in the state-of-the-arts. The experiment is conducted on two available public datasets, namely, KDD99 [
The rest of the paper is structured as follows: Section 2 describes the research methods and the main steps of the proposed IDS model. The experiments and discussion on the used datasets are given in Section 3. Finally, Section 4 summarizes the conclusion of the work.
GA was presented initially by Holland [
All the individuals are evaluated by a fitness function that expresses the importance of the individual as a solution. Then select the best parent individuals and apply the crossover and mutation operator to produce the new individuals (offspring) for the next generation. Crossover operator combines the features of two selected parents to create two offspring. Mutation operator changes one or more components of the selected individual in order to prevent any stagnation that may occur during the search process. After a number of generations in evolution when the stopping criterion is met, the individuals that survived in the population are considered the optimal solutions [
1: |
The RF classifier is a powerful ML tool that can be used for solving classification and regression problems. RF is one of the ensemble learning methods that can build a number of decision trees [ Individually and randomly, each decision tree is constructed using different samples of the training dataset. During the construction of each tree, a part of
In other words, the RF can average the probability of decision trees obtained using different random samples of the original dataset [
The RF classifier has been used in a wide range of applications, such as image classification [
1: |
In this research, we explore the application of GA-based RF for detecting intrusion attack throughout the features of network data traffic.
The idea behind the GA-based RF model is to optimize the RF classifier by selecting the appropriate parameters’ values and improve the detection rate of NIDS by using the optimized RF. The GA can generate random values for the specific parameters of RF and build a new decision boundary that has a highest value of GA fitness function. In detail, the datasets for training and testing the GA-based RF model are prepared from the network data traffics. The decision boundary of GA-based RF model is trained using training set and GA. After that, the trained GA-based RF model with the appropriate parameters’ values is tested to detect normal and abnormal class label of samples in the testing set.
The study experiments are conducted on a laptop has a CPU processor Intel Core i7-4510U with 2.0 GHz, 8 GB RAM, and a 64 bit Windows 10 operating system. Python programming language is used to implement the experiments. Two public datasets, namely, KDD99 and UNSW-NB15 are employed to evaluate and compare the proposed model.
As mentioned above, the datasets used in the experiments are KDD99 [
To evaluate the proposed GA-based RF and other baseline classifiers, the training samples of two sets are used first to train these classifiers and build trained models; then, these trained models are tested on the two testing sets.
The results of experiments are assessed based on three measures. These measures are accuracy, sensitivity, and precision, computed as follows:
FP and FN are the number of false positives and negatives. TP and TN are the number of true positives and negatives.
In this section, the results of the experiments are presented and compared with the results of recent related work. After building the GA-based RF model using the KDD99 training set, the best values of the minimum number of instances for each split and the number of trees in the RF are selected to be 17 and 2, respectively. For the UNSW-NB15 training set, the value of the minimum number of instances for each split is 4 and the value of the number of trees in the forest is also 2. The other parameters of RF are fixed to have the default values.
Normal traffic | Abnormal traffic | |
---|---|---|
Normal traffic | 47630 | 283 |
Abnormal traffic | 1808 | 23548 |
Normal traffic | Abnormal traffic | |
---|---|---|
Normal traffic | 28267 | 8733 |
Abnormal traffic | 2228 | 43104 |
From the results of confusion matrices, the performance evaluation measures are computed and shown in
Class Na | Evaluation measure | ||
---|---|---|---|
Precision | Recall | F1-score | |
Normal traffic | 96.0% | 99.0% | 98.0% |
Abnormal traffic | 99.0% | 93.0% | 96.0% |
Accuracy | 97.2% | ||
Macro avg. | 98.0% | 96.0% | 97.0% |
Weighted avg. | 97.0% | 97.0% | 97.0% |
Class Na | Evaluation measure | ||
---|---|---|---|
Precision | Recall | F1-score | |
Normal traffic | 93.0% | 76.0% | 84.0% |
Abnormal traffic | 83.0% | 95.0% | 89.0% |
Accuracy | 86.7% | ||
Macro avg. | 88.0% | 86.0% | 86.0% |
Weighted avg. | 87.0% | 87.0% | 86.0% |
To compare the accuracy results of optimized RF classifier to classify anomalies with the traditional RF and other baseline classifiers in the recent work [
Work/authors [ref.] | Classifier name | Accuracy | Weighted average of precision | Weighted average of recall |
---|---|---|---|---|
Khan et al. [ |
NB | 94.68% | 95% | 95% |
KNN | 96.01% | 96% | 96% | |
SVM-Poly | 94.04% | 94% | 94% | |
NB-KE | 94.43% | 95% | 94% | |
SMO | 95.11% | 95% | 95% | |
SVM-RBF | 94.95% | 95% | 95% | |
DS | 93.98% | 94% | 94% | |
DT | 96.22% | 96% | 96% | |
RF | 96.79% | 97% | 97% | |
HT | 92.66% | 93% | 92% | |
This work | GA-based RF | 97% | 97% |
Note. SVM, support vector machine; RF, random forest; GA, genetic algorithm.
Work/authors [ref.] | Classifier name | Accuracy | Weighted average of precision | Weighted average of recall |
---|---|---|---|---|
Khan et al. [ |
NB | 76.39% | 78% | 76% |
KNN | 84.49% | 86% | 85% | |
SVM-Poly | 68.34% | 69% | 68% | |
NB-KE | 76.22% | 77% | 76% | |
SMO | 85.34% | 86% | 85% | |
SVM-RBF | 83.22% | 84% | 83% | |
DS | 76.63% | 84% | 77% | |
DT | 84.55% | 86% | 85% | |
RF | 83.63% | 87% | 84% | |
HT | 59.44% | 76% | 59% | |
This work | GA-based RF | 87% | 87% |
Note. SVM, support vector machine; RF, random forest; GA, genetic algorithm.
As shown in the
In this paper, a GA-based RF model is proposed to classify normal and abnormal networks traffics for IDS. The GA is used for selecting the appropriate values for two parameters of RF. These parameters are the minimum number of instances for each split and the number of trees in the forest, optimizing the RF classifier and improving the accuracy of anomaly classification and intrusion detection. A set of experiments were conducted on two public dataset and evaluated using a set of performance evaluation measures. The experimental results revealed that the selection of suitable values of RF classifier has improved the accuracy of network anomaly classification compared to the RF with default values. Moreover, the proposed GA-based RF model outperforms the ML models with high detection rates of 97.20% for KDD99 test set and 86.70% for UNSW-NB15 test set. In the future work, the proposed model will be used with feature selection methods to detect the types of attacks in the abnormal network traffic and enhance the network-based IDS.
The author would like to express his gratitude to King Khalid University, Saudi Arabia for providing administrative and technical support.