An Enhanced Memetic Algorithm for Feature Selection in Big Data Analytics with MapReduce

Recently, various research fields have begun dealing with massive datasets forseveral functions. The main aim of a feature selection (FS) model is to eliminate noise, repetitive, and unnecessary featuresthat reduce the efficiency of classification. In a limited period, traditional FS models cannot manage massive datasets and filterunnecessary features. It has been discovered from the state-ofthe-art literature that metaheuristic algorithms perform better compared to other FS wrapper-based techniques. Common techniques such as the Genetic Algorithm (GA) andParticle Swarm Optimization (PSO) algorithm, however, suffer from slow convergence and local optima problems. Even with new generation algorithms such as Firefly heuristic and Fish Swarm Heuristic, these questions have been shown to overcome. This paper introduces an improved memetic optimization (EMO) algorithm for FS in this perspective by using conditional criteria in large datasets. The proposed EMO algorithm divides the entire dataset into sample blocksandconducts the task of learning in the map steps. The partial result obtained is combined into a final vector of feature weights in the reductionprocess which defines the appropriate collection of characteristics. Finally, the method of grouping based on the support vector machine (SVM) takes place. Within the Spark system, the proposed EMO algorithm is applied and the experimental results claim that it is superior to other approaches. The simulation results show that the maximum AUC values of 0.79 and 0.74 respectively are obtained by the EMO-FS model.


Introduction
Development of prediction techniques from data needs compact Machine Learning (ML), pattern recognition, as well as statistical modeling approaches. These models are appliedto Big Data and monitor the massive number of instances and numerous predictive quantities named as features [1]. Generally, the data dimensionality can be reduced, which yields a selected subset of original features of the predictive data regarding a result of interest T. In specific, the main theme of FS is referred to as exploring a feature subset that has limited size as well as optimal detection. The architecture of Hadoop MapReduce is shown in Fig. 1.
To remove irrelevant and repetitive data related to weakly relevant models [2], FS assiststhe learning process. This produces a simulation outcome where the prediction techniques along with minimum features are easy to analyze, understand, rely on, and also are rapid in prediction. Hence, FS offers a valid intimation of data produced in a data generation method as well as it acts as a basic tool in knowledge discovery; deep correlation of solutions to FS along with simple techniques which produce the explored data [3]. FS can be referred to as the fundamental operation in detecting a by-product. Developing an FS technique is very hard becausethe FS problem is a combinatorial issue andis also named NP-hard for linear regression problems [4]. The exhaustive searching of every feature subset is ineffective than the minimum feature subset.Heuristic search approaches, as well as approximate considerations, are essential for scaling FS, that ranges from convex relaxations and parametric consideration like linearity which is a Lasso technique [5] for causally-inspired, non-parametric modules, namely, secured data distribution for a causal method [6]. Especially, Big Data has maximum dimensions and higher instance volume, and hence processing has become CPU-based and data-intensive which is difficult to be managed using an individual system. One of the major problems in this study is topartition datahorizontally(samples) and vertically (features), which is termed as hybridpartitioning, and thus computations are processed locally for every block as well as integrated minimum communication load. Another problem is thetypes of heuristics that might be applied rapidly and protectively to remove irrelevant and repetitive features. Hybrid partitioning of data samples and learned models [7] issaid to be an open studying issue in Big ML models whereas protective FS heuristics are presented for sparse Big Data [8], which can only be applied for data with maximum values. For addressing the limitations present in this model for big volume data, Parallel, Forward-Backward with Pruning (PFBP) technology has been proposed. PFBP technologyis independent of data sparsity and henceit is used in massive and sparse datasets; and expanded for adding the optimizations particularly in sparse datasets. PFBP depends upon statistical tests of independence and is motivated by the statistical causal model which shows the merging  [9].
The application of Data Mining (DM) tools to solve the issues related to big data requires redeveloping models similar to the environment. From diverse aspects, the MapReduce concept [10] and a shared file systemhavebeen established by Google as an efficient and effective module to report the big data analysis. Hence, it might be applied in DM instead of parallelization approaches like MPI (Message Passing Interface) which has constraints of fault-tolerant and simplicity. Several experiments have been carried out in parallelization of ML devices by applying the MapReduce technique [11]. In recent times, novel and reliable workflows are used to expand the reputed MapReduce model, namely Apache Spark [12], and are effectively employed in different DM as well as ML issues [13]. Data preprocessing approaches and robust data reductiontechniques mainly focus on cleaning and simplifying the input data. Hence, it tries to simulate the DM process and enhance the accuracy by removing noisy and repetitive data. Also, specific work defines two major kinds of data reduction modules. Initially, instance selection and instance production [14] concentrate on theinstance level. When compared with other previous techniques, evolutionary models are applied effectively in feature selection. Thus, the additional improvement of a single size might restrict the usability and perform poorly in offering a preprocessed dataset within a limited duration while solving a major issue. This paper presents no models for handling feature space along with evolutionary big data techniques.
From the state-of-the-art literature, it has been found that metaheuristic algorithms perform better compared to other wrapper-based techniques for FS. However, popular techniques like the Genetic Algorithm (GA) andparticle swarm optimization (PSO) algorithm suffer from slow convergence and local optima problems. These limitations are attempted to solve using later generation algorithms like Firefly heuristic and Fish Swarm Heuristic. In this view, this paper presents an Enhanced Memetic Optimization (EMO) algorithm for FS by including conditional criteriain big datasets. The proposed EMO algorithm partitions the actual dataset into blocks of samples andperforms a learning process in the map phase. In the reduce phase, the attained partial outcome is re-merged into a final vector of feature weights which can be used to identify the required set of features. The proposed EMO algorithm has been implemented within the Spark framework and the experimental outcomes state that the proposed EMO algorithm is superior to other algorithms.The rest of the paper is organized as follows. Section 2 offers the proposed EMO-FS model and Section 3 validates the performance of the EMO-FS model. Finally, Section 4 concludes the paper.

The Proposed EMO-FS Model
The proposed MapReduce model for FS is acombinedform of generic classificationtasks. Specifically, the EMO-FS technique depends on the EMO technique to perform the FS process. Initially, EMO-FS has been employed on the actual dataset to attain a vector of weights that exhibits the relationship between all attributes. This vector has been employed with an alternate MapReduce function to generate minimized dataset and the resultant dataset is used by SVM. Fig. 2 shows the overall process of the EMO-FS model.

EMO Algorithm
Here, the proposed correlation-basedmemetic FS algorithm forclassification is shown in Fig. 3.
In the initialization step,the GA population is initiated randomly along with a chromosome by encoding the candidate feature subset. Then a local search (LS) is performed. The LS can be processed on every part of the chromosometo attainlocal optimum and enhance the feature subset. Genetic operators like crossover and mutation are carried out for producing the upcoming population. This step is repeated till the termination criteria are reached.

Population Initialization
In FS, the candidate feature subset is represented by selecting an encoded chromosome. A chromosome is defined as a binary string of a length similar to the overall number of features where every bit encodes in one feature. A bit is either 1 or 0 that represents the corresponding feature's presence. A chromosome length is implied as n. A higher value of bit '1' is signified as m. If the prior knowledge regarding optimal features is provided, then m has been limited to the pre-determined value; otherwise, m is the same as n. At the initial point of searching, a population of size p were is initiated randomly.

Objective Function
It can be described in a simple term by classification accuracy: where Sc is the adjacent chosen feature subset encoded in chromosome c, and FS procedure function Accuracy ðScÞ estimates the dimension for a given feature subset Sc,.

Local Search Improvement Procedure (LS)
Symmetrical uncertainty in thecorrelation-oriented filter ranking model assures that it is an effective method to eliminate repeated features and boost classification accuracy. By assuming this, the application of the correlation-based filter ranking technique iswith SU values in the form of memes.While the correlation between the feature and a class is greater for creating a related class and the link between the alternate features is up to the level to be detected by other related features, then it is known to be the best feature to perform the classification.
SU-oriented correlation values depend on the data-theoretical model of entropy which is a value of uncertainty regarding an arbitrary variable. Hence, the entropy of a variable X can be expressed as and theentropy of X is described when observing conditioned on variable Y might be expressed as where P x i ð Þ denotes the advanced probabilities for measures of X and P x i jy j À Á represents the probabilities of X that provides the values of conditioned onY :Consequently, the quantity of entropy X gets reduced and presents extra information of X obtained by Y,whichis termed as information gain IG ð Þ, which is and defined as Based on the obtained value, a feature Y is referred to as highly associated with feature X when IG X jY ð Þ. IG ZjY ð Þ holds for any other feature Z. Here, IG is defined symmetrically for 2 arbitrary variables X and Y . Symmetry is said to be a required feature of correlation values among features. But, IG is a biased one with higher rates. Moreover, the measures should undergo normalization to guarantee that comparable characteristicshave a similar impact. Hence, a Symmetrical Uncertainty has been selected and described as given below: SU balances the IG's bias to features and massive values and normalizes the value to ½0, 1 that contains value 1 showing that information of, where 1 represents complete prediction of value and 0 denotes that X and Y are independent. Also, the pair of features acts symmetrically. SU value is constrained by 2 main functions as given in the following: It is capable of removing features that havelower SU valueswhen compared with a threshold. Acquires a feature weight thathelps in the population initialization for GA in memetic technology.
A feature with a higher SU measure obtains maximum weight whereas the features with minimum SU values are eliminated. For a dataset that has N features and a class C, the applied model explores the collection of frequent feature subsets in classification techniques. It is composed of 2 main regions. The initial part is associated with the estimation of SU i;c for every feature and placed in lower sequence based on the SU i;c measures. The SU i;c value states the relationship between feature F i , and class C. Alternatively, it processes the ordered list to avoid repeated features and records the predominant feature compared to other related features. A pre-defined feature f p which is a predefined is applied to extract the features ranked lowerthan f p fp and including f p . In residual features, when f p is appeared to be repetitive f q ; f q is eliminated. Feature f q is a redundant pair to f p when thecorrelation between f p and f q is higher than the correlation between f q and class C. Once a single iteration is completed concerning f p , the model acquires the present feature immediately next to f p as a novel reference to follow the same procedure. This process is terminated when there are no further features and returns the feature subset.

Evolutionary Operators
In this process, benchmark GA tools are employed, including linear ranking selection, uniform crossover, as well as mutation operators relied on the elitist procedure. Nevertheless, when prior knowledge of the optimal value is given, the value of bit '1' is compared with higher m from an evolution task. The standard uniform crossover and mutation operators do not follow this strategy, so Subset Size-Oriented Common Feature Crossover is used.

Contraction Criterion
For developing an efficient hybrid technique in global optimization, the merits of the exploration abilities of EA and exploitation abilities of LS have been combined in a balanced manner. In MDE, it is presented with a novel contraction procedure that integrates 2 types as given in the following: q 1 is extended with higher distance in the objective space. q 2 is referred to as greater distance in the decision space.
Hence, the suggestion of q 1 is retrieved as where q 1 implies the value of diversity of the population in the objective space.

EMO-FS Algorithm
The parallelization of the EMO model is defined by applying the MapReduce technique to get a vector of weights. Assume T as a randomized training set recorded in HFDS, and m implies the number of map tasks. The dividing pattern of MapReduce splits T into m disjoint subsets of samples. Then, every T i subset i 2 1; 2; …; m f g ð Þcan be estimated under the application of adjacent Map i task. The partition work is performed frequently since every subsetisconstrained with approximate instances, and randomized T assures enough class management. The map stage takes place in every T i that has an FS model. Hence, the final result of every mapping function is named as a binary vector f i ¼ f i1 ; …; f iD f g , where D refers to the count of features, which represents the features to be selected using the EMO method. In the case of the reduce phase, by acquiring a vector x as described in Eq. (7), where x j signifies an average of FS domains thatcontain feature j. Hence, the simulation outcome of the entire FS process is applied in building a reduced dataset that can be employed for upcoming ML techniques: The reduce phase can be processed by a unique task that reducesthe duration of the implementation process by reducing MapReduce overhead. The complete function is carried out with limited iterations of MapReduce workflow and elimination of extra disk usage.

Support Vector Machine (SVM)
A famous classification technique used in supervised ML is the SVM. It aims to determine the best partition hyperplane that has the highest margin of the training data dividing the n-dimensional space into 2 distinct areas using this hyperplane. The SVM method also has solid underpinnings in the statistical learning hypothesis. This method is executed effectively in several linear and non-linear classification techniques. There are several kernel-based functions in SVM, namely linear kernel function, the normalized poly kernel, polynomial kernel function, Radial Basis Function (RBF), and Hyperbolic Tangent (Sigmoid) Kernel sigmoid function. The SVM gives results as class labels, either positive or negative to all instances in this case of binary classification, to calculate metrics such as ROC curve and so on. It also determines the distance betweenthe hyperplanes that divide classes. SVM has several benefits viz., achieving the optimal outcome by managing the binary illustration and capable of managing the minimum number of features.

Dataset Reduction with MapReduce
Here, the vector x is computed and the purpose is to extract few promising features from the actual dataset. For the scalable approach, a further MapReduce procedure is intended. Initially, vector x is binarized utilizing a threshold h: The vector b denotes the features chosen to reduce the dataset. The number of chosen features ðD 0 ¼ P D j¼1 b j Þ are is managed by h: through the maximum threshold, some of the features are chosen, while a lower threshold permits fewer features to be selected.

Dataset Description
For result analysis, a set of two datasets, namely Epsilon and ECBDL14-ROS, has been used. The details of the datasets are provided in Tab. 1. The table states that the Epsilon dataset includes a total of 400000 instances fort raining and 100000 instances for testing. It contains a set of 2000 features. The ECBDL14-ROS dataset includes a total of 65003913 instances fort raining and 2 897 917 instances for testing. It contains a set of 631 features.

Implementation Setup
The experiments are conducted on a cluster of 20 computing nodes and a master node. Each computing node is comprised of: (i) Processors: 2 x Intel Xeon CPU E5-2620, (ii) Cores: 6 per processor (12 threads), (iii) Clock frequency: 2.00 GHz, (iv) Cache: 15 MB, (v) Network: QDR InfiniBand (40 Gbps), (vi) Hard drive: 2 TB, and (vii) RAM: 64 GB. The Hadoop master performs the NameNode and JobTracker hosted ona master node. The previous techniques manage the HDFS, by coordinating the slave tools interms of corresponding DataNode operation, whereas the alternate one is responsible for TaskTrackers of every computing node that implements the MapReduce approach. Spark uses the same configuration, where the master process has been placed on a master node, and worker task has been exhibited on slave machines. These models are capable of sharing the HDFS file system. The above-mentioned details define the software applied for these works:             The above experimental details show that the proposed model offers maximum performance on the applied Epsilon and ECBDL14-ROS datasets. The simulation outcome indicates that the EMO-FS model has achieved maximum AUC values of 0.79 and 0.74 respectively. The following features of the EMO algorithm contribute to this achievement: The proposed EMO algorithm partitions the actual dataset into blocks of samples and performs a learning process in the map phase. In the reduce phase, the attained partial outcome is re-merged into a final vector of feature weights which can be employed to identify the required set of features.

Conclusion
This paper has presented an EMO algorithm for FS in big datasets. The proposed EMO algorithm partitions the actual dataset into blocks of samples and performs a learning process in the map phase. In the reduce phase, the attained partial outcome is re-merged into a final vector of feature weights which can be employed to identify the required set of features. Finally, the SVM-based classification process takes place. The proposed EMO algorithm has been implemented within the Spark framework. The above experimental details show that the proposed model offers maximum performance on the applied Epsilon and ECBDL14-ROS datasets. The simulation outcomes indicate that the EMO-FS model has achieved maximum AUC values of 0.79 and 0.74 respectively. In the future, the proposed model can be further enhanced by the use of deep learning and parameter tuning models.