As the importance of email increases, the amount of malicious email is also increasing, so the need for malicious email filtering is growing. Since it is more economical to combine commodity hardware consisting of a medium server or PC with a virtual environment to use as a single server resource and filter malicious email using machine learning techniques, we used a Hadoop MapReduce framework and Naïve Bayes among machine learning methods for malicious email filtering. Naïve Bayes was selected because it is one of the top machine learning methods(Support Vector Machine (SVM), Naïve Bayes, K-Nearest Neighbor(KNN), and Decision Tree) in terms of execution time and accuracy. Malicious email was filtered with MapReduce programming using the Naïve Bayes technique, which is a supervised machine learning method, in a Hadoop framework with optimized performance and also with the Python program technique with the Naïve Bayes technique applied in a bare metal server environment with the Hadoop environment not applied. According to the results of a comparison of the accuracy and predictive error rates of the two methods, the Hadoop MapReduce Naïve Bayes method improved the accuracy of spam and ham email identification 1.11 times and the prediction error rate 14.13 times compared to the non-Hadoop Python Naïve Bayes method.
Email plays a very important role in inter-company business and individual social life, and since it has a legal effect for business handling, the importance of blocking malicious email is growing. The amount of malicious email has increased to more than 45% of all email [
In a brief comparison of Hadoop and Spark, it can be seen that a common feature is that they are frameworks for big data to process large data sets. Their differences are that whereas Hadoop is a disk-based framework that provides local operation, storage, and replication functions by node in a computer cluster consisting of multiple nodes for large datasets, Hadoop has excellent expandability so that the service can be very easily expanded, and Hadoop has a structure that supports data distribution and redundant storage between nodes using the MapReduce programming model [
Methods to shorten Hadoop execution time include hardware improvement, network infrastructure improvement, Hadoop configuration tuning, job scheduling (or task scheduling), and data locality algorithm [
Hadoop is a Java-based open source framework that can carry out distributed processing of massive data. It is a framework that carries out distributed storage of data in a Hadoop Distributed File System (HDFS) and processes data using MapReduce, a distributed processing system [
Among the Hadoop improvement methods, tuning by Hadoop parameters is efficient, and parameter tuning methods can be classified into six types. Hadoop configuration parameters affect job performance in various ways. Hadoop currently has more than 200 parameters, among which the main parameter items for reducing Hadoop execution time include block size setting, number of map nodes setting, number of reducer job nodes setting, map output data compression, temporary data processing setting, memory buffer size setting, number of Map/Reduce task setting, etc. The six types of parameter tuning methods are rule-based, cost-modeling, simulation-based, experiment-based, machine learning, and adaptive tuning approaches. In this paper, the experiment-based approach was selected among the six type methods because in the case of the machine learning approach, which is the optimal method, and the adaptive tuning approach, the parameter information obtained with the machine learning technique is sometimes inappropriate for the real environment and the parameter information is affected by Hadoop version changes causing cost problems. Therefore, the experiment-based approach can optimize and tune major Hadoop parameter information reflecting system environmental information as much as possible. That is, the adaptive tuning approach, which automatically extracts parameter information with the application, is affected by the Hadoop version, and it seems that the application has difficulties in automatically optimizing parameters by properly understanding the given system environment [
Among related papers that improve execution time by adjusting Hadoop parameters, the results of improving the execution time of Hadoop Mahout by adjusting the block size, the number of replications, and the memory buffer size were presented in [
Machine learning is divided into supervised learning, unsupervised learning, and reinforcement learning [ P(A): prior probability, probability of A (reason) that is determined before the outcome is produced. P(B|A): likelihood probability, probability for an outcome B to occur given that reason A has occurred. P(A|B): posterior probability, probability for reason A to occur given that outcome B has occurred.
Hadoop is a distributed computing-based framework that processes massive data and has the characteristic of processing data merging and sorting using MapReduce characteristics. The reason is that the framework moves and merges distributed datasets across multiple data nodes and handles sorting based on key values [
To improve Hadoop performance, about five configuration parameters, including block size, number of reducers, the size of the replication node, and their set values were adjusted based on the experiment-driven approach [
Kim et al. [
Spam filtering methods are largely divided into reputation-based filtering methods and content-based filtering methods. Representative methods of reputation-based filtering include the blacklist method and whitelist method. Representative methods of content-based filtering include machine-learning filtering methods, and representative examples of machine-learning methods include Bayesian filtering, SVM filtering, and boosting algorithm methods. The Bayesian filtering method was applied in this paper, and it is a method used to generate a final probability estimate by combining the probabilities of individual spams or hams of each word in the message [
(w: each image is represented by w = {
Machine learning-based spam classification algorithms include SVM, Naïve Bayes, KNN and Decision Tree. Among them, SVM and Naïve Bayes are the most efficient methods, but they have problems such as low scalability and limited accuracy. Therefore, as a solution to the problems, accuracy, speed, and scalability were improved by performing it in the MapReduce framework [
This paper analyzes email data and predicts spam or ham, which are harmful traffic, using the Naive Bayes algorithm in the MapReduce program based on the Hadoop framework. To process massive data, a Hadoop environment was selected where scale-out is easy and existing commodity hardware is combined with a virtual environment to easily manage the distributed replication of data. To improve the prediction accuracy of spam filtering, the Naive Bayes algorithm with the Bayesian theorem applied was used. Vyas et al. [
As for the tuning method, the Hadoop execution time was improved by selecting 19 Hadoop configuration parameters, which are elements that greatly affect performance, among the Hadoop configuration parameters based on an experiment-driven approach [
In order to shorten the execution time while increasing malicious email filtering accuracy with the improved Hadoop framework, this paper conducted experiments using the Naïve Bayes theorem based on MapReduce. According to the experimental results of Vyas et al. [ True Positive (TP): Predicting the correct answer that is actually true as true (correct answer) False Positive (FP): Predicting the correct answer that is actually false as true (wrong answer) False Negative (FN): Predicting the correct answer that is actually true as false (wrong answer) True Negative (TN): Predicting the correct answer that is actually false as false (correct answer)
Technique | Accuracy | Execution time | TP | TN | FP | FN |
---|---|---|---|---|---|---|
Clustering | 53.915% | 1.33 sec | 1 | 0 | 1 | 0 |
J48 | 89.3617% | 1.52 sec | 0.84 | 0.955 | 0.045 | 0.16 |
Naïve Bayes | 91.4894% | 0.46 sec | 0.84 | 1 | 0 | 0.16 |
SMO | 93.617% | 1.92 sec | 0.88 | 1 | 0 | 0.12 |
ID3 | 93.617% | 0.81 sec | 0.88 | 1 | 0 | 0.12 |
Four commodity PCs were used in the experimental environment, the name node and secondary name node were configured with one PC each, and the data nodes were configured with three PCs. The details of hardware are shown in
Item | Name node system spec | Data node system spec |
---|---|---|
CPU | Core i7-7700 @ 3.6 GHz | Core i5-7500 @ 3.4 GHz |
Number of cores | Quad-Core | Quad-Core |
OS version | Ubuntu 16.04 | Ubuntu 16.04 |
Memory | DDR4 8 GB | DDR4 8 GB |
Network environment | 1 Gbps Wire LAN | 1 Gbps Wire LAN |
Hadoop version | 2.5.2 | 2.5.2 |
Python version | 2.7/3.6 | 3.5.2 |
As for the experimental method, Hadoop framework tuning was optimized by adjusting the Hadoop configuration parameters, that affect performance, and accordingly, eight test cases to be used in the test were selected. For test case 1, the default values set during Hadoop installation were used, and for test case 2 through test case 8, appropriate values for configuration parameters that affect execution time were set to perform Hadoop framework tuning. For test case 2, dfs.replication, which controls the number of dataset copies, was set to 2 and mapreduce.map.memory was set to 2048 MB and mapreduce.reduce.memory.mb was set to 4096 MB and mapreduce.map.java.opts.max.heap was set to 1638 MB and mapreduce.reduce.java.opts.max.heap was set to 3277 MB and mapreduce.task.io.sort.mb was set to 200 MB and mapred.reduce.parallel.copies was set to 20 numbers and yarn.scheduler.maximum-allocation-mb was set to 4096 MB and yarn.scheduler.maximum-allocation-vcores set to 1 and mapreduce.output.fileoutputformat.compress.type set to block in the environment. For test case 3, unlike test case 2, dfs.block_size was changed to 128 MB and dfs.namenode.handler.count was set to 10 and dfs.replication was set to 3 and mapred.tasktracker.map.tasks.maximum was set to 7 and set mapred.tasktracker.reduce.tasks.maximum was set to 7 and mapred.map.tasks.speculative.execution was set to false. From test case 4 to test case 8, the Hadoop speed was tuned only by adjusting block size. The reason why other parameters were not adjusted is that the parameters were judged to have been optimized with the values set in test case 3. To test the execution time, dfs.block_size was set to 256 MB in test case 4, set to 512 MB in test case 5, set to 1024 MB in test case 6, set to 256 MB in test case 7 and set to 128 MB in test case 8.
As shown in
Test case type | With hadoop framework | Without hadoop | Execution time multiplier ratio(A/C) | |
---|---|---|---|---|
MapReduce without laplace smoothing (A) (sec) | MapReduce with laplace smoothing (B)(sec) | Bare metal server (C)(sec) | ||
Test case 1 | 5744 | 5771 | 1456 | 3.94 |
Test case 2 | 4980 | 5771 | 1456 | 3.42 |
Test case 3 | 5213 | 4887 | 1456 | 3.355 |
Test case 4 | 4897 | 5023 | 1456 | 3.362 |
Test case 5 | 5048 | 8352 | 1456 | 3.466 |
Test case 6 | 5032 | 5095 | 1456 | 3.5 |
Test case 7 | 5183 | 5147 | 1456 | 3.5 |
Test case 8 | 5029 | 5045 | 1456 | 3.5 |
Given that, the execution time in test case 1 as shown in
In
Item | Hadoop MapReduce Naïve Bayes malicious email prediction with laplace smoothing | Bare Metal Python Naïve Bayes malicious email prediction in non-Hadoop |
---|---|---|
Number of file data | 1,500,976 | 1,500,976 |
Number of successful file reads | 1,494,579 | 1,454,489 |
Number of fail file reads | 6,397 | 46,487 |
TP | 484,977 | 307,120 |
TN | 998,856 | 999,453 |
FP | 10,746 | 147,916 |
FN | 10,746 | 147,916 |
Number of correct answers | 1,483,833 | 1,306,573 |
Number of incorrect answers | 10,746 | 147,916 |
< Meaning of performance evaluation index in terms of experiment > TP: Correct prediction of actual spam email as spam email (correct answer) TN: Correct prediction of actual ham email as ham email (correct answer) FP: Wrong prediction of actual ham email as spam email (wrong answer) FN: Wrong prediction of actual spam email as ham email (wrong answer)
The performance evaluation indicators TP, TN, FP, and FN were compared graphically between the Hadoop MapReduce Naïve Bayes with Laplace smoothing and the non-Hadoop Python Naïve Bayes in
Item | Hadoop MapReduce Naïve Bayes malicious email prediction with laplace smoothing | Bare Metal Python Naïve Bayes malicious email prediction in non-hadoop environment |
---|---|---|
Prediction error rate | 0.72% | 10.17% |
Accuracy | 99.28% | 89.83% |
The malicious email prediction performances of Hadoop MapReduce Naïve Bayes with Laplace smoothing and non-Hadoop Python Naïve Bayes are compared in the graph shown in
For the algorithm for Hadoop performance improvement and malicious email prediction filtering, the MapReduce model was constructed with a two-step MapReduce method, referring to Mseltz [ |V|: the number of terms in the vocabulary
Step 1) MapReduce Naïve Bayes program process with Laplace smoothing for Training |
To classify malicious email into spam and ham, the posterior probability of class c is expressed as
The libraries used in the bare metal Python Naïve Bayes program are pandas, numpy, countvetorizer, and multinomial Naïve Bayes, and the process for filtering and classification through learning is described in
Hadoop tuning was optimized through the optimization of Hadoop configuration parameters. Execution time in the Hadoop environment was longer than that in the non-Hadoop Linux environment, because in a virtual environment, resource sharing, data replication, and data node resource allocation take longer in a virtual environment than in a bare metal environment. However, since the bare metal environment is operated by the server alone, the disadvantage is that it is difficult to share resources with other servers. That is, there is a limit to adopting a bare metal framework for a malicious email filtering system just because the execution time is short, and a peculiarity in the Hadoop tuning process is that execution time does not necessarily improve even if the block size among configuration parameters is increased. Also, execution time does not necessarily improve just because the number of dataset copies is small. For these reasons, it is the most reasonable to derive and optimize the most appropriate configuration parameter values in a given environment for Hadoop tuning through experiments based on an experiment-driven approach.
In this paper, in addition to research to find a Hadoop performance optimization method, our study sought a method to increase the accuracy of malicious email classification by comparing and analyzing the accuracy of malicious email filtering using the MapReduce Naïve Bayes with Laplace smoothing based on the Hadoop framework optimized for tuning and the filtering system accuracy of the Python Naïve Bayes program in a non-Hadoop environment. The Hadoop MapReduce Naïve Bayes method with Laplace smoothing showed a prediction error rate that was 14.13 times smaller than the non-Hadoop Python Naïve Bayes method as a result of training and classification by applying the Laplace smoothing method. In terms of accuracy, the Hadoop MapReduce Naïve Bayes method with Laplace smoothing is 1.11 times more accurate than the non-Hadoop Python Naïve Bayes method. Therefore, for the spam and ham classification of large-capacity malicious email, the MapReduce Naïve Bayes method with Laplace smoothing in the Hadoop framework environment can be used to make more accurate predictions for malicious email filtering.
The Hadoop framework applied in this paper has the disadvantage of being slower than the bare metal server used alone in a non-Hadoop environment because it uses several commodity servers based on a virtual environment and duplicates the dataset with a distributed computing method. In order to improve these shortcomings, eight types of test cases were configured by selecting 14 Hadoop configuration parameters that have a significant impact on performance, based on an experiment-driven approach, with various tests, and an environment with optimized performance as in test case 3 described in
For spam filtering using machine learning techniques, Naïve Bayes was chosen because it has the highest speed and accuracy among six spam filtering methods (clustering, J48, Naïve Bayes, SMO, minimal optimization, and ID3). Also, as a result of comparing the prediction error rate when a Laplace smoothing algorithm was applied to the MapReduce programming model in the Hadoop framework environment and the prediction error rate in the bare metal server environment, the Hadoop MapReduce Naïve Bayes algorithm was found to be much more accurate than the non-Hadoop Python Naïve Bayes algorithm. According to the results of analysis of eight test cases, the Hadoop MapReduce Naïve Bayes method with Laplace smoothing had a malicious email filtering’s prediction error rate that was 14.13 times less than the non-Hadoop Python Naïve Bayes method and 1.11 times higher accuracy than the non-Hadoop metal Python Naïve Bayes method. As a result, it can be seen that the Hadoop MapReduce Naïve Bayes method with Laplace smoothing has higher prediction accuracy than the non-Hadoop Python Naïve Bayes method, and the Hadoop MapReduce Naïve Bayes method with Laplace smoothing is judged to be the most suitable solution for filtering malicious email.
As a future plan, we will extract sentiment information from unstructured social media test data using the Hadoop MapReduce function and machine learning methods (Naïve Bayes, SVM, and linear regression) and conduct a comparative analysis of each machine learning method through sentiment analysis.
We would like to express our very great appreciation to Department of Radio and Information Communications Engineering, Chungnam National University for providing the necessary tools and environment for this project.