Hybridization of Fuzzy and Hard Semi-Supervised Clustering Algorithms Tuned with Ant Lion Optimizer Applied to Higgs Boson Search

: This paper focuses on the unsupervised detection of the Higgs boson particle using the most informative features and variables which characterize the “Higgs machine learning challenge 2014” data set. This unsupervised detection goes in this paper analysis through 4 steps: (1) selection of the most informative features from the considered data; (2) definition of the number of clusters based on the elbow criterion. The experimental results showed that the optimal number of clusters that group the considered data in an unsupervised manner corresponds to 2 clusters; (3) proposition of a new approach for hybridization of both hard and fuzzy clustering tuned with Ant Lion Optimization (ALO); (4) comparison with some existing metaheuristic optimizations such as Genetic Algorithm (GA) and Particle Swarm Optimization (PSO). By employing a multi-angle analysis based on the cluster validation indices, the confusion matrix, the efficiencies and purities rates, the average cost variation, the computational time and the Sammon mapping visualization, the results highlight the effectiveness of the improved Gustafson–Kessel algorithm optimized with ALO (ALOGK) to validate the proposed approach. Even if the paper a complete clustering analysis, its novel contribution

A particularly interesting particle is the Higgs boson, created by physicist Peter Higgs to argue the validity of the standard model [35]. This particle was the subject of the 2013 Nobel Prize in Physics awarded jointly to Peter Higgs and François Englert, after its discovery by the CERN (European Nuclear Research Organization) at Large Hadron Collider (LHC) [36]. Some of these researches were ATLAS [37] and CMS which claimed the Higgs boson discovery [38]. The Higgs boson has several distinct decaying procedures. Putting the original discovery aside, the study of all decaying modes assures the validity of the theory and allows distinguishing new elements.
It should be noted according to Higgs boson recognition complexity, that high-performance methods such as neural networks as well as other multidimensional methods have been used [39][40][41][42]. However, the use of clustering methods in the high energy physics and the Higgs boson search is restricted. For the high energy physics data classification, some clustering methods [43] and similar techniques that focus on a space-filling curve are applied [44]. A dynamic nearest neighbor graph implemented with computational geometry was first used in the study of Jet clustering in particle physics event properties [33]. An approach based on Fuzzy clustering algorithms is proposed for the recognition of particles based on pulse shape evaluation [45].
Studying the H to tau-tau channel was the topic of the ATLAS challenge [46]. Using a collection of deep neural networks, the winning method [47] had costly computational complexity but achieved results greater than the runner-up with a margin that is statistically important. There is also an advanced Decision Tree technique that offers a solid balance between efficiency and simple implementation [48]. An attempt to prove that Cartesian Genetic Programming (CGP) could be an interesting solution to learn from the Higgs big data set was presented in [49]. The work done in [50] shows a practical implementation for the study of non-resonant production of Higgs boson pairs in the context of extensions of the standard model with anomalous couplings of the Higgs bosons.
To identify the Higgs boson signals and achieve a better classification performance, it is important that the features selected contain necessary discriminative information. Good classification results can be achieved if the input is well prepared. This promotes the use of simple algorithms which resulting functions are easy to interpret compared with complex schemes such as neural networks or support vector machines [41]. There isn't any ideal algorithm for clustering that is suitable for all areas. Successfully applying a specific technique in one domain does not necessarily imply that it can be deployed in another domain with the same degree of satisfaction. Several considerations can affect the decision of the best technique of clustering such as the understanding of the clustering objectives, the comparison of the algorithms results and the use of the appropriate clustering performance measurements [29]. The interaction between the data to be clustered and the efficiency of distinct clustering algorithms is an interesting issue that has yet to be answered in clustering. An attempt to locate the optimum cluster number, while comparing between hard and fuzzy clustering algorithms of the applied thyroid diseases data set, is presented. Our work is a continuity of the one presented in [31].
The main contribution of the approach presented in this paper is the hybridization of fuzzy and hard semi-supervised clustering algorithms tuned with ALO to distinguish a Higgs signal from a background in an unsupervised way. The clustering algorithms used in this hybridization are K-means [51], Global K-means [52,53], K-medoids [54][55][56], Fuzzy C-means [57,58], Gustafson-Kessel [57,58] and Gath-Geva [57,58] algorithms. To bring to a close the performances of this hybridization, a comparative study is done with some metaheuristic methods such as GA [26,59] and PSO [60,61] using multiple validity indexes. In addition, an in-depth investigation of the proposed algorithms parameters was carried to achieve better results. This paper presents another contribution when selecting the most informative features from the dataset using a feature engineering technique based on Higgs data quantities ranking using T-statistic method and K-means algorithm optimized using PSO, before excluding highly correlated features using SOM and then features dimensionality reduction using K-means algorithm optimized by PSO.
The Machine Learning dataset (HiggsML)-based on ATLAS events and characterized by a list of selected attributes and variables-is used to confirm the proposed approach [62].
The elbow criterion [63] is applied to find the optimal number of clusters that group the considered data in an unsupervised way.
The rest of this paper sections is organized as follows. Section 2 explains the dataset setting and gives the physics background of this work. Section 3 introduces the clustering and metaheuristic algorithms used in this work. Section 4 is dedicated to our proposed approach. Section 5 reports the performance of the experimental analyses and comparisons of the proposed clustering algorithms. Section 6 concludes the analysis and discusses future research directions.

Data
As part of the mechanism that gives mass to other elementary particles, the Higgs Boson particle was theoretically anticipated to exist almost 50 years ago. Its importance lies in the fact that it is the last ingredient of the particle physics Standard Model for fundamental particles and forces.
In the beginning, an LHC event is initiated by a proton-antiproton collision. By using detectors surrounding the intersection zone, it can be observed that the collision energy forms new particles. Most events, however, do not generate interesting particles like the top quark or the Higgs boson. A successful interpretation of data relies on the efficient distinction between events that produce interesting particles (signal) as well as those who produce other particles (background).
In the past, human knowledge (naive-Bayes-like "cut-based" methods) designed the region of interest. Event selection in high-energy physics is an important topic for machine learning as multiple background kinds can match the distinct signature of the signal.
From the point of view of machine learning, the concern can be formally considered as binary classification problem where the events generated in the collider are pre-processed and reflected as feature vectors. The concern is to classify events as a signal (i.e., an event of interest, in this work a decay of H to tau-tau) or background (an event produced by processes already known). More specifically, the classifier is used as a technique of selection that specifies a signalrich region (not necessarily connected) in the space of the feature. While the objective is to find a new phenomenon, in the real data, the labeled actual signal events are not available. Alternatively, an extensive simulator is used to simulate events by producing artificial events based on the Standard Model and a detector model, taking into consideration possible artifacts and noise. These simulations are used to evaluate the classifiers.
The classes are very imbalanced in the real data (around two signal events after pre-selection in a thousand events). For this reason, in signal events, the simulated data used in this work is enhanced. Weights are applied to all events of significance to reflect their chances of event occurring to compensate for this bias.
CERN's ATLAS experiment provided simulated data used in many types of Higgs boson research. A portion of these data was publicly released.
The work presented in this paper is based on the data provided by the Higgs Boson Machine Learning experiment (HiggsML) organized in 2014 by the ATLAS physicist's team and computer scientists [62].
As introduced above, this work considers two classes of Higgs Boson: signals (denoted H class) and background (B class). Notice that, each class consists of several sub-classes reflecting the various Higgs decay channels and various other typical reactions. These different subclasses are not considered in this work.
The collected data amounted to 200,000 signals, with 100,000 learning signals and 100,000 test signals. In the first batch of signals, the class H participates with 34,170 events and the class B participates with 65,830 events. As a test batch of signals, the class H participates with 35450 events and the class B participates with 64,550 events. Each selected event is described using a list of features. The quantities that were selected by the physicists of ATLAS to select regions of interest are [62]: • F 1 : The estimated mass m H of the Higgs boson candidate, obtained through probabilistic phase space integration. • F 2 : The transverse mass between the missing transverse energy and the lepton.
• F 3 : The invariant mass of the hadronic tau and the lepton.
• F 4 : The modulus of the vector sum of the transverse momentum of the hadronic tau, the lepton and the missing transverse energy vector. • F 5 : The absolute value of the pseudorapidity separation between the two jets. • F 6 : The invariant mass of the two jets. • F 7 : The product of the pseudorapidities of the two jets.
• F 8 : The R separation between the hadronic tau and the lepton.
• F 9 : The modulus of the vector sum of the missing transverse momenta and the transverse momenta of the hadronic tau, the lepton, the leading jet. • F 10 : The sum of the moduli of the transverse momenta of the hadronic tau, the lepton, the leading jet. • F 11 : The ratio of the transverse momenta of the lepton and the hadronic tau.
• F 12 : The centrality of the azimuthal angle of the missing transverse energy vector with respect to the hadronic tau and the lepton. • F 13 : The centrality of the pseudorapidity of the lepton with respect to the two jets.
• F 14 : The transverse momentum p x 2 + p y 2 of the hadronic tau.
• F 15 : The pseudorapidity of the hadronic tau.
• F 16 : The azimuth angle of the hadronic tau.
• F 17 : The transverse momentum p x 2 + p y 2 of the lepton (electron or muon). Each event is labeled as signal or background. Since the aim of this paper is an unsupervised classification, the event label is not used as an input feature to the clustering classifier, but as validation information. To avoid bias with a wider range, all the considered features are normalized.

Methodology
The following subsections present an overview of the clustering algorithms and metaheuristic techniques as well as some validation indices and performance metrics used in this work.

Clustering Algorithms
It is possible to categorize clustering algorithms into partition-based algorithms, hierarchicalbased algorithms, density-based algorithms, grid-based algorithms, fuzzy-based algorithms and model-based algorithms [64][65][66][67][68][69]. These algorithms are used to measure the level of resemblance within and between clusters.

K-Means Algorithm
K-means (KM) clustering algorithm [51] is a well-known clustering method dividing samples into K clusters by updating the center of clusters in an iterative way until the criteria for convergence is met. Elements with highest resemblance are allocated to the same group and those with reduced resemblance are allocated to distinct groups [31]. Algorithm 1: KM Algorithm 1: Initialize a set of K items as first centroids 2: Repeat 3: Structure K clusters by allocating all items to the adjacent centroid 4: Recompute every cluster's centroid 5: Until no cluster position changes 6: Return K clusters positions K-means clustering algorithm effectiveness is highly reliant on the randomly chosen initial cluster centroids. Few options are suggested to fix this issue of initialization. One of those techniques is the Global K-means clustering.

Global K-Means Algorithm
The global K-means (GlobalKM) clustering is an efficient global clustering technique which objective is to optimize the clustering error based on the K-means method as a local search strategy. It is an iterative clustering method that adds one cluster sequentially by applying N (equal to the data size) executions of the K-means algorithm as a global search process based on non-random initial positions [52,53].
8: End for 9: Memorize the C i cluster that optimizes the root square quantity as the i th center. 10: i = i + 1 11: until i >= K 12: Apply the KM algorithm (Algorithm 1) to the identified clusters. 13: Return K clusters positions

K-Medoids Algorithm
K-medoids [54][55][56] is a development of K-means by handling separate data. It considers the closest data point to the data centroid as the appropriate cluster. In comparison with K-means, it is a solid method for noise and anomalies. This is due to the fact that K-medoids reduces the sum value of dissimilitude between the cluster and the data points belonging to it, instead of the sum of squared Euclidean distances.
Several approaches using the K-medoids method have been proposed, including those suggesting novel multi-centroid, multi-run sampling schemes [55] and new search strategies for efficient K-medoids-based algorithms [56].
The most prevalent technique of K-medoids clustering is the Partitioning Around Medoids (PAM) algorithm [31,70]. This is also the algorithm that was adopted in our work as suggested below (Algorithm 3).

Fuzzy C-Means Algorithm
The core idea of Fuzzy c-means (FCM) is to define the membership of each element to each cluster using the optimization of the objective function. The membership summation of every element of the data needs to be equal to 1 [31,[71][72][73][74][75][76][77][78][79].
The cluster centroids are revised after each iteration based on the Eq. (2). where: • N is the number of data points.
• c i corresponds to the i th cluster center with i ∈ {1, . . . , K} and K is the number of clusters.
• M is the fuzziness index with m ∈ [1, ∞[. • K is the number of cluster center.
• μ ij is the membership of the i th data to j th cluster center. It is calculated using the Eq. (3): d ij is the Euclidean distance between the i th data and the j th cluster center. It corresponds to Eq. (4): The FCM algorithm's primary goal is to optimize the objective function J detailed in the formula Eq. (5): The FCM algorithm is given below. It can be ended if the objective function J can no longer be minimized or the change of the cluster centers position becomes very small.
• i is the iteration step, • β the termination criterion belonging to [0, 1], • U = μ ij N×K is the fuzzy membership matrix.

Gustafson-Kessel Algorithm
Gustafson and Kessel were the first to propose the Gustafson-Kessel fuzzy clustering (GK) algorithm [57,58]. This latter associates the cluster to its centroid and its covariance. Unlike the FCM algorithm that considers clusters to be spherical, this restriction does not apply on the GK algorithm that can identify ellipsoidal clusters [31,58].
The steps of the GK algorithm are given in Algorithm 5.
• U = μ ij N×K is the fuzzy membership matrix. The objective function to be minimized in this algorithm is as given in Eq. (10).

Gath-Geva Algorithm
Gath and Geva generalize the highest likelihood estimate of the Fuzzy clustering. The clustering algorithm Fuzzy maximum likelihood estimates (FMLE) is based on the FMLE distance norm [31,58]. This norm distance decreases quicker than the one used in the GK algorithm. The difference between Gath-Geva (GG) and GK algorithms is that for GG the distance norm calculation includes an exponential term, which implies its faster decrease. This makes GG algorithm tend to the closest local value. This can be resolved using efficient initialization [31,58].

Genetic Algorithm
GA Principle the genetic algorithm (GA) is a search heuristic inspired by the theory of natural evolution of Charles Darwin [26].
It has three main operators: selection, crossover and mutation. A GA starts iteration with an initial population.
A GA process is initiated with a random set of individuals, considered as solutions to the problem to solve, called a Population. Each individual or solution is a chromosome identified by a set of joined parameters called Genes and is evaluated then assigned to a fitness value. In the selection procedure, some criterion is applied to select a certain number of strings, namely parents, from this population according to their fitness values. Strings with lower fitness values have more opportunities to be selected for reproduction in next step. This paper use a rank selection scheme that allows the control of the selection pressure, denoted s p .
The most significant phase of the genetic algorithm is Crossover where the fittest individuals for reproduction are selected to create offspring of the next generation based on the process of natural selection [26,60]. This is an iterating process which ends to find a generation with the fittest individuals. The crossover probability of a selected individual to go through a crossover process is denoted p c .
Some of the genes of the formed offspring can be subjected to a mutation with a low random probability p m . It occurs to prevent premature convergence and maintain diversity within the population [25,59,60].
The algorithm generally terminates when the population does not produce offspring significantly different from the previous generation.
GA operates on a population (a number of potential solutions n pop ). The population at time t is represented by the time-dependent variable S(t), with the initial population of random estimates being S(0). Algorithm 6 shows the GA structure. GA Based Clustering In the GA based clustering applications (Sections 4.4 and 5), the fitness S(t) cited in Step 3 of the GA Algorithm corresponds to the optimal distance of the considered clustering algorithm. The GA end criterion is satisfied in this work when the maximum number of iterations is reached.
By using the previous clustering algorithms (defined in Section 3.1), we obtain the following GA based algorithms: GAKM, GAGlobalKM, GAPAM, GAFCM, GAGK and GAGG. The results of GA and clustering combinations are listed and analyzed in Section 5.

Particle Swarm Optimization
PSO Principle Particle swarm optimization (PSO) is a population based stochastic optimization process [61]. It was implemented effectively in many fields such as system control, function optimization, artificial neural network training and other areas.
A PSO process is initialized with a random population of solutions. The prospective solutions, called particles, move via the problem space by following the current optimal solutions [24,60,61].
Every particle has a fitness value to evaluate by the objective function, and a velocity which directs its flying. In every iteration, the particle swarm optimizer updates its velocity and position using the two best attributes: the best solution it has reached named P best and the global best value named G best as presented in Eqs. (11) and (12).
where, • C p is the position of a particle and V p its velocity.
• c 1 and c 2 are constants labeled as acceleration learning variables representing the weighting of stochastic terms pulling each particle to P best and G best positions respectively. The inertia weight w is calculated for each iteration using Eq. (13). (13) w damp is the inertia weight damping ratio and t the current iteration. PSO Based Clustering in this work, the fitness cited in Step 3 of the PSO Algorithm corresponds to the optimal distance of each clustering algorithm defined in Section 3.1. The PSO end criterion is satisfied in this work when the maximum number of iterations is reached.
The results of PSO based clustering combinations giving the PSOKM, PSOGlobalKM, PSOPAM, PSOFCM, PSOGK and PSOGG algorithms are listed and analyzed in Section 5.

Ant Lion Optimization
Ant lion optimization (ALO) algorithm [64] is based on ant lions hunting mechanism. It consists on random walk exploration and random agent selection based on five main hunting steps: random walk of agents, building traps, entrapment of ants in the trap, catching prey and rebuilding traps. The ALO optimizer roulette wheel and random ants walks can eliminate local optima.
The random walk of ants is given by: where cumsum is calculating cumulative sum, n is maximum number of iterations and t is a step of the random walk rand is a random number generator between [0, 1]. The random walk can be produced within the search space with Eq. (16).  The mathematical equations for trapping are given by Eqs. (17) and (18).
where c t , d t are respectively the minimum and the maximum of all variables at t th iteration. Antlions shoot sand outward to push ants into them. It is possible to design the mathematical model for the above action: where I = 10 w t T , t is current iteration and T is the maximum number of iterations. The final phase of ant lions hunting conduct is to catch an ant that enters the bottom of a pit and then the following equation must update its position to the recent situation.
It is essential to keep the best solution in the evolution algorithm. This can be set up as: where R t A is the random walk around the antlion selected by the roulette wheel at t th iteration, R t E is the random walk around the elite at t th iteration, and Ant t i indicates the position of i th ant at t th iteration.
The detailed ALO algorithm is described in Algorithm 8. 14: until the end criterion is satisfied.

Cluster Validation Indices and Performance Metrics
It is possible to use cluster validity indexes to validate the efficiency of a clustering method and evaluate its fitness of data partitions. Usually these effectiveness indicators are autonomous from clustering algorithms. Several cluster validity indexes were suggested in the literature for clustering algorithms. The partition coefficient (PC) was the first suggested cluster validity index. Subsequently, partition entropy (PE) was proposed as a normalization of PC. The separation coefficient (SC) was the first validity index to take the data geometrical properties in consideration.

Partition Coefficient (PC)
The partition coefficient, in Eq. (23), is described as the Frobenius norm of the membership matrix, divided by the data size [31].
where μ ij is the membership of data point j in cluster i. The optimum cluster number corresponds to the PC's highest value which shows the cluster's compactness. This value is between 0 and 1 [31].

Classification Entropy (CE)
Classification entropy, Eq. (24) [31], like the partition coefficient, measures the fuzziness of the cluster partition. They are both computed only using the membership matrix components.
The clustering is considered as efficient when PC index tends to 1 and CE index tends 0.

Partition Index (PI)
Partition index, in Eq. (25), is the proportion of clusters compactness sum in comparison with their segregation [31].
The partition is better when PI value is higher.

Separation Index (S)
On the opposite of the partition index (PI), the separation index S Eq. (26) uses a minimumdistance separation [31]. The partition is considered as optimal when separation index value S is small.

Xie-Beni Index (XB)
The Xie-Beni index is a fuzzy clustering validity measure that can be applied to crisp clustering. It is identified as the ratio between the quadratic error average and the minimum square distance between the clusters elements as given in Eq. (27) [31].
The optimum fuzzy partition is attained by reducing XB with respect to c = 2, . . . , c max .

Dunn's Index (DI)
Dunn proposed a crisp clustering efficiency index used for the identification of compact and strongly distinguished clusters. The Dunn's index DI is defined by Eq. (28) [31].
where d is a distance function and C i the set whose elements are allocated to the i th cluster.

Alternative Dunn's Index (ADI)
Alternative Dunn's index is an improvement of the initial Dunn's index, Eq. (30). The computation becomes simpler when the dissimilarity function between two clusters (min x∈C i ,y∈C j d(x,y)) is evaluated in Beneath value by triangle-nonequality [31]: where v j is the cluster center of the jth cluster.

Efficiency and Purity
To measure the efficiency of the used algorithms, external criterions that evaluate the clustering matching of the actual classes are computed.
These performance parameters are the efficiency γ i , and the purity β i of classifications. They are calculated from the confusion matrix N(N ij ), (N ij being the value of signals of genuine class C i classified as class C j ) (Tab. 1). For each class C i , we have: The global performance parameters are then calculated as given in Eqs. (33) and (34).

Proposed Approach
As previously introduced, the purpose of this work is to improve the classification of events into Higgs Boson signal or background using the hybridization between fuzzy and hard semisupervised clustering algorithms tuned with ALO. Towards this end, this work has been done based on 4 major processes: • improved feature engineering approach, • definition of the optimal number of clusters, • new hybridization of fuzzy and hard clustering algorithms tuned with ALO and then • PSO and GA based clustering used for a comparative analysis.
The details of each process are presented in the next sections.

Features Ranking Using T-Statistic Method and PSOKM
This feature selection method utilizes t-Statistic [80,81] where each element can be categorized either into class of signals C 1 or to class of backgrounds C 2 . For each feature F i , PSOKM (KM algorithm optimized using PSO) is applied on the data set (executed 50 times) to define the two clusters and calculate t-Statistic as in Eq. (35).
where μ ij and σ ij denote respectively the mean and the standard deviation of i th feature F i for class C j , j = {1, 2}, using PSOKM algorithm.
The P value is described as the likelihood of achieving a result equivalent to or better the observed one. When the p value tends to 1, no distinction is suggested between groups other than by chance. If the p value is close to 0, the observed distinction is unlikely due to chance [81].
The t-statistic values and p-value are reported in Tab. 2 in an ascending order of the p-value.  Fig. 1, the second step excludes the features that are highly correlated and allows having diversified discriminant information about events. In order to identify the correlation between these ranked features, the Self-Organizing Feature Map (SOM) technique [82][83][84][85][86][87][88][89][90] is used, by plotting the weight planes that shows the values in each map unit for each variable. The SOM technique is a non-linear mapping algorithm that aims to achieve a lowdimensional representation (usually 2D or 3D) of a set of points dispersed in high-dimensional pattern space. It enables the Euclidean distances between the points on the map to be as close as possible to the Euclidean distances in the high-dimensional pattern space between the respective points [82]. Fig. 1 presents the differences regarding the input variables. Lighter and darker colors represent respectively larger and smaller weights associated to each feature. If the patterns of two inputs are very close, they can be claimed that they are strongly correlated [83].
Then, the sets of features which are highly correlated and then very similar are: The SOM parameters tuning that had given the most efficient results corresponds to a dimension of 20 × 20 and a number of epoch equal to 200.

Features Dimensionality Reduction Using PSOKM
The main idea of this section is to combine the softly correlated features, which leads to a reduction of the dimensionality of the feature space. Having this dimensionality as low as possible is very important to reduce exponentially the density of the dataset since the volume of the feature space grows exponentially with each dimension. In addition, the computing time of algorithms grows strongly with the number of dimensions.
The To define the best number of features to use from the list of attributes in the previous section, the PSOKM algorithm is used. An experiment was conducted on the dataset to evaluate the efficiency γ i , and the purity β i of the classification for each batch of features results are reported in Tab. 3. This work will then use the 10 features that was proven to be the most informative (F 1 , F 18 , F 23 , F 15 , F 24 , F 30 , F 17 , F 14 , F 10 , F 8 ) as the system input matrix, to which an efficient comparative classification will be applied.

The Optimal Number of Clusters: Elbow Criterion
To find the optimal number of clusters, this work used the elbow criterion [63]. It consists on applying clustering on the data for different number of clusters and validates the correctness of the obtained results. The best number of clusters corresponds to the last one that adds sufficient information.
To answer this, the most informative feature values for the Higgs Boson signals, as defined in Section 4.1.3, are used as input for the clustering algorithms K-medoid (PAM) and Fuzzy c-means (FCM). 478 CMES, 2020, vol.125, no.2 To validate the efficiency of these clustering methods and evaluate the data partitions fitness in each iteration, some common cluster validity indexes, as PC, CE, PI, S, XB, DI, and ADI defined in Subsection 3.3, are used.
There are several runs carried out with a distinct number of clusters ranging from 2 to 7.
Tab. 4 presents the validation indices values calculated for each run using the hard K-medoids (PAM) algorithm. Each run corresponds to a specific number of clusters (between 2 and 7 clusters). Tab. 5 shows the results found using FCM algorithm.   [31]. In Tab. 4, PI and S using PAM reveals that the number of clusters could be defined to 2 and 3 respectively. XB index is infinity (Inf) for PAM as reported in Tab. 4. This reflects that an overflow happened leading to a too large to represent as a conventional floating-point value. All XB index values were equal to infinity. It may be caused by an initialization problem of the hard PAM clustering. The plots of Dunn's and alternative Dunn's indexes in Tab. 4 confirm that the adequate clusters number should be equal to 4 and 2 clusters respectively. According to this analysis, 2 clusters achieve the highest information partitioning of data. According to the FCM results of Tab. 5, PC and CE indexes reach the values 1 and 0 (respectively) when the number of clusters is rated to 2. This means that when the number of clusters is equal to 2, the FCM based clustering is efficient. The FCM results in Tab. 5 give more information about the optimal number of clusters using PI and S indexes where the local minimum is reached when the number of clusters is 3. For the XB index using FCM method, it is difficult to find the optimum number of clusters (Tab. 5). Either 3 and 6 can be seen as an elbow. Dunn's and the alternative Dunn's indexes have an elbow when the number of clusters is equal to 2. The hard PAM and the FCM algorithms based experimental results shown that the elbow corresponds to 2 clusters.

Proposed Hybrid Clustering Tuned with ALO
Since ALO have the advantages of fast convergence, local optima avoidance, high efficiency and improved exploration space and in order to enhance clustering ability to find global minimum, a hybrid clustering approach is proposed. Fig. 2 presents the flowchart of proposed algorithms.
In the proposed approach, the objective function to minimize corresponds to the clustering algorithms used: KM, GlobalKM, PAM, FCM, GK and GG. Each one of the considered clustering algorithms is hybridized and tuned with the Ant Lion Optimizer giving respectively the ALOKM, ALOGlobalKM, ALOPAM, ALOFCM, ALOGK and ALOGG Algorithms.
Based on Section 4.2, the number of clusters to find for the considered data set is 2 clusters. The aim of clustering in our case is to assign the Higgs data points to two clusters where each one groups the data points with similar characteristics. The 10 most informative features selected in Section 4.1.3 are used.
Each cluster, in our approach, is considered as an Antlion and the elite of the ALO algorithm is a matrix of 2-anlions that optimizes the most the fuzzy and hard clustering algorithms.
In the first phase of the proposed approach, the best 2-anlions solution which optimizes the most the considered hard or fuzzy clustering algorithm is defined. Then, the positions of the initial population of ants are updated based on that. Thereafter, the fitness of each 2-ants is calculated. The fitness of the fitter 2-ants element is compared to the elite in order to return the best 2-clusters element. Many iterations are carried out in order to return the best results. The details of the proposed hybrid clustering tuned with ALO are presented in Algorithm 9.

Hybrid Clustering Combinations with GA and PSO
The performance of the proposed approach defined in Section 4.3 was compared to the hybridization of GA and PSO with the same considered clustering algorithms (KM, GlobalKM, PAM, FCM, GK and GG) giving respectively the GAKM, GAGlobalKM, GAPAM, GAFCM, GAGK, GAGG, PSOKM, PSOGlobalKM, PSOPAM, PSOFCM, PSOGK and PSOGG algorithms. These GA and PSO clustering combinations were used and applied to the field of Higgs boson for the first time in this work.
The 10 most informative features selected in Section 4.1.3 are used as an input to those algorithms in order to assign the Higgs data points to two groups with similar characteristics.
The Tab. 6 summarizes the offline parameter tuning of GA and PSO that gives the best efficiency applied to the considered clustering algorithm combinations.    Based on the confusion matrix, we can qualify a method as better that the others, as a part of a first layer analysis, when N 11 (the number of signals found using the unsupervised clustering) tend to N 1 and N 22 (the number of backgrounds found using the unsupervised clustering) tend to N 2 . In order to validate this first layer analysis and based on Eqs. (33) and (34), we are listing in Tab. 8 efficiency and purity performance average values and standard deviations found in the 50 runs done to each clustering algorithm combination. Based on this second level analysis, it can be clearly seen that the results found tend globally in favor to hybridization of fuzzy clustering (either using Fuzzy c-means, Gath-Geva or Gustafson-Kessel) tuned with ALO, GA and PSO). There is no specific fuzzy based clustering algorithm that worked best in all these combinations. The efficiencies of ALOGK, GA based FC-means and PSO based FC-means are the best. Moreover, based on Tab. 8, we can confirm the conclusion of the first layer analysis based on the confusion matrix (Tab. 7) that the ALOGK clustering is the most performing method used since it corresponds to the higher efficiency equal to 91.94% (with a standard deviation of ±2.66% in the 50 runs) and the higher purity equal to 89.92% (with a standard deviation of ±5.67% in the 100 runs). Tabs. 9.1 and 9.2 list the most common clustering validation indices as Partition Coefficient (PC), Classification Entropy (CE), Partition Index (PI), Separation Index (S), Xie-Beni Index (XB), Dunn's Index (DI) and Alternative Dunn's Index (ADI) applied to the considered data and calculated for each algorithm combination used. The PC and CE indexes are useless for the hard clustering. This includes that for all hard clustering algorithm combinations, the PC values are equal to 1 and the CE ones are not a number. Applied to the fuzzy clustering, the algorithm is efficient when the PC value tends to 1 and the CE value to 0. According to Tabs. 9.1 and 9.2, this rule is validated whether we use the ALOGK or ALOFCM clustering. This means that these two methods are compact, belong to the crisp classification and produce clusters with less overlap. The minimal values of PI and S correspond to the ALOFCM, ALOGG or ALOGK. This means that they are the strongest algorithms based on the noise criterion and the separation of clusters. The GG basic clustering algorithm is the most compact and gives the best separated clusters according to the XB index. Both DI and ADI indexes tend toward the novel ALOFCM, ALOGK and ALOGG clustering algorithm. Using this third layer analysis based on the clustering validation indices, we can confirm that fuzzy based combinations are giving better results than hard ones, especially the combinations based on the ALO method. The execution times of the considered approaches are gathered in Tab. 8. The time spent using the hard clustering, especially the KM algorithm, is lower than the one using fuzzy clustering. This is due to the fact that we are using a large set of data and that this kind of clustering is the most suitable for this case.
As another angle of analysis, we are measuring the variation of the clustering cost by iterations for each algorithm combination which corresponds to the sum of the distances between each data point and its closest cluster centroid. The results are shown in the Figs. 3-6. According to these figures, the optimal cost is reached respectively using GK, ALOGK, GAGK and PSOGK. Again, we can conclude, as a part of comparison between hard and fuzzy clustering, that the combinations done using the fuzzy clustering have given better results. It can also be concluded that the method that optimized the most the clustering cost was ALOGK algorithm. In term of fast convergence, we can see using Figs. 3-6 that the combinations with KM and PAM algorithms were faster to converge to their optimal costs. But, these optimal costs were at least doubly bigger than the optimal cost of each combination with GK algorithm.  In another level of analysis, Figs. 7-10 present the Higgs Boson search clustering distribution using Sammon visualization [72], into signals and backgrounds based on the algorithm combinations that optimize the most the clustering cost according to the last analysis (GK, the proposed ALOGK, GAGK, and PSOGK). The Sammon mapping used is based on the two most discriminant features obtained in the Section 4.1.3. This corresponds to the features F 1 and F 18 . In Figs. 7-10, the distance measure of the cluster prototype is transformed into Euclidean distance in the projected two dimensional spaces where each cluster centroid is represented by a single red star. When the cluster is properly selected, the projected data fall close to the projected cluster center in an approximately spherically distributed cluster (shown as blue stars). As it can be seen in these figures, the data points lie much closer to the center of the cluster when the ALOGK algorithm is used.     This paper presents a hybridization of hard and fuzzy clustering tuned with Ant Lion Optimizer for the detection of the Higgs boson particle using the most informative features and variables which characterize the Higgs machine learning challenge 2014 data set.
The contribution of this work lies firstly in the approach used for these features and variables selection. The second contribution of this work is the new hybridized clustering combinations and tuning improvement, applied for the first time in the field of the Higgs boson search, where a metaheuristic technique such as ALO, optimizes various clustering methods as KM, GlobalKM, PAM, FCM, GK and GG. The results of the hybrid clustering technique tuned by ALO are compared with some exitising metaheuristic optimizations such as GA and PSO.
In order to choose the proper learning parameters for the experiment, an offline parameters tuning have been done for each one of the algorithms used in this work. The aim of each stochastic technique was to minimize the clustering algorithms.
Based on a multi-angle comparative analysis of the results found using each hybrid combination, the ALOGK clustering has proved its high truthfulness applied to Higgs boson search.
To confirm this result many scalar validity indexes are used in performances analysis as partition coefficient, classification entropy, partition index, separation index, Xie and Beni's index, Dunn's index, alternative Dunn's index, efficiency, purity computational time, average cost variation and Sammon mapping visualization.
As a perspective, we will improve and compare the 3 algorithms combinations that have given the best results in this work as well as other novel ones applied to an extended number of Higgs channels.

Funding Statement:
The author(s) received no specific funding for this study.

Conflicts of Interest:
The authors declare that they have no conflicts of interest to report regarding the present study.