Using Link-Based Consensus Clustering for Mixed-Type Data Analysis

: A mix between numerical and nominal data types commonly presents many modern-age data collections. Examples of these include banking data, sales history and healthcare records, where both continuous attributes like age and nominal ones like blood type are exploited to character-ize account details, business transactions or individuals. However, only a few standard clustering techniques and consensus clustering methods are provided to examine such a data thus far. Given this insight, the paper introduces novel extensions of link-based cluster ensemble, LCE WCT and LCE WTQ that are accurate for analyzing mixed-type data. They promote diversity within an ensemble through different initializations of the k-prototypes algorithm as base clusterings and then refine the summarized data using a link-based approach. Based on the evaluation metric of NMI (Normalized Mutual Information) that is averaged across different combinations of benchmark datasets and experimental settings, these new models reach the improved level of 0.34, while the best model found in the literature obtains only around the mark of 0.24. Besides, parameter analysis included herein helps to enhance their performance even further, given relations of clustering quality and algorithmic variables specific to the underlying link-based models. Moreover, another significant factor of ensemble size is examined in such a way to justify a tradeoff between complexity and accuracy.


Introduction
Cluster analysis has been widely used to explore the structure of a given dataset. This analytical tool is usually employed in the initial stage of data interpretation, especially for a new problem where prior knowledge is limited. The goal of acquiring knowledge from data sources has been a major driving force, which makes cluster analysis one of the highly active research subjects. Over several decades, different clustering techniques are devised and applied to a variety of problem domains, such as biological study [1], customer relationship management [2], information retrieval [3], image processing and machine vision [4], medicine and health care [5], pattern recognition [6], psychology [7] and recommender system [8]. In addition to these, the A solution to this dilemma is to combine different clusterings into a single consensus clustering. This process, known as consensus clustering or cluster ensemble, has been reported to provide more robust and stable solutions across different problem domains and datasets [9,24]. Among state-of-the-art approaches, link-based cluster ensemble or LCE [26,27] usually deliver accurate clustering results, with respect to both numerical and nominal domains. Given this insight, the paper introduces the extension of LCE to mixed-type data clustering, with contributions being summarized as follows. Firstly, a new extension of LCE that makes use of k-prototypes as base clusterings is proposed. In particular, the resulting models have been assessed on benchmark datasets, and compared to both groups of basic and ensemble clustering techniques. Experimental results point out that the proposed extension usually outperforms those included in this empirical study. Secondly, parameter analysis with respect to algorithmic variables of LCE is conducted and emphasized as a guideline for further studies and applications. The rest of this paper is organized as follows. To set the scene for this work, Section 2 presents existing methods to mixed-type data clustering. Following that, Section 3 introduces the proposed extension of LCE, including ensemble generation and estimation of link-based similarity. To perceive its performance, the empirical evaluation in Section 4 is conducted on benchmark data sets, with a rich collection of compared techniques. The paper is concluded in Section 5 with the direction of future research.

Mixed-Type Data Clustering Methods
Following the success in numerical and nominal domains, a line of research has emerged with the focus on clustering mixed-type data. One of initial attempts is the model of k-prototypes, which extends the classical k-means to clustering mixed numeric and categorical data [21]. It makes use of a heterogeneous proximity function to assess the dissimilarity between data objects and cluster prototypes (i.e., cluster centroids). While the Euclidean distance is exploited for numerical case, the nominal dissimilarity can be directly derived from the number of mismatches between nominal values. This distance function for mixed-type data requires different weights for the contribution of numerical vs. nominal attributes to avoid favoring either type of attribute. Let X = {x 1 , . . . , x N } be a set of N data objects and each x i ∈ X is described by D attributes, where D = D n + D c , i.e., the total number of numerical (D n ) and nominal (D c ) attributes. The distance between an object x i ∈ X and a cluster prototype c p is estimated by the following equation.
where δ(y, z) = 0 if y = z and 1, otherwise. In addition, γ is a weight for nominal attributes. A large γ suggests that the clustering process favors the nominal attributes, while a small value of γ indicates that numerical attributes are emphasized.
Besides the aforementioned, k-centers [22] is an extension of the k-prototypes algorithm. It focuses on the effect of attribute values with different frequencies on clustering accuracy. Unlike k-prototypes that selects nominal attribute values that appear most frequently as centroids, kcenters also takes into account other attribute values with low frequency on centroids. Based on this idea, a new dissimilarity measure is defined. Specifically, the Euclidean distance is used for numerical attributes, while the nominal dissimilarity is derived from the similarity between corresponding nominal attributes. Let x i ∈ X be a data object described by D n numerical attributes and D c nominal attributes. The domain of nominal attribute A g is denoted by {a g (1) , a g (2) , . . . , a g(n g ) }, where n g is the number of attribute values of A g . The definition of the distance between data object x i and centroid c p is defined as follows.
where f (x ig , c pg ) = {c pg(r) |x ig = a pg(r) }. The weight parameters β and γ are for numerical and nominal attributes, respectively. According to [22], β is set to be 1 while a greater weight is given for γ if nominal valued attributes are emphasised more or a smaller value for γ otherwise. The new definition of centroids is also introduced. For numerical attributes, a centroid is represented by the mean of attribute values. For nominal attribute A g , g ∈ D c , centroid c pg is an n g dimensional vector denoted as (c pg (1) , c pg (2) , . . . , c pg(n j ) ), where c pg(r) can be defined by the next equation.
where n pg(r) denotes the number of data objects in the pth cluster with attribute value a g (r) . Note that if attribute value a g(r) does not exist in the pth cluster, c pg(r) = 0. The problem of selecting an appropriate clustering algorithm or parameter setting of any potential alternative has proven difficult, especially with a new set of data. In such a case where prior knowledge is generally minimal, the performance of any particular method is inherently uncertain. To obtain a more robust and accurate outcome, consensus clustering has been put forward and extensively investigated in the past decade. However, while a large number of cluster ensemble techniques for numerical data have been developed [24,26,[28][29][30][31][32][33][34][35], there are very few studies that extend such a methodology to mixed-type data clustering. Specific to this subject, the cluster ensemble framework of [36] uses the pairwise similarity concept [24], which is originally designed for continuous data. Though this research area has received a little attention thus far, it is crucial to explore the true potential of cluster ensembles for such a problem. This motivates the present research, with the link-based framework being developed and evaluated herein.

Link-Based Consensus Clustering for Mixed-Type Data
This section presents the proposed framework of LCE for mixed-type data. It includes details of conceptual model, ensemble generation strategies, link-based similarity measures, and consensus function that is used to create the final clustering result, respectively.

Problem Definition
LCE approach has been initially introduced for gene expression data analysis [9]. Unlike other methods, it explicitly models base clustering results as a link network from which the relations between and within these partitions can be obtained. In the current research, this consensusclustering model is uniquely extended for the problem of clustering mixed-type data, which can be formulated as follows. Let = {π 1 , . . . , π M } be a cluster ensemble with M base clusterings, each of which returns a set of clusters π g = {C where k g is the number of clusters in the gth clustering. For each x i ∈ X , C g (x i ) denotes the cluster label in the gth base clustering to which data object x i belongs, i.e., C g ( The problem is to find a new partition π * = {C * 1 , . . . , C * K }, where K denotes the number of clusters in the final clustering result, of a data set X that summarizes the information from the cluster ensemble .

LCE Framework for Mixed-Type Data Clustering
The extended LCE framework for the clustering of mixed-type data involves three steps: (i) creating a cluster ensemble , (ii) aggregating base clustering results, π g ∈ , g = 1 . . . M, into a meta-level data matrix RA l (with l being the link-based similarity measure used to deliver the matrix), and (iii) generating the final data partition π * using the spectral graph partitioning (SPEC) algorithm. See Fig. 1 for the illustration of this framework.

Generating Cluster Ensemble
The proposed framework is generalized such that it can be coupled with several different ensemble generation methods. As for the present study, the following four types of ensembles are investigated. Unlike the original work in which the classical k-means is used to form base clusterings, the extended LCE obtains an ensemble by applying k-prototypes to mixed-type data (see Fig. 1 for details). Each base clustering is initialized with a random set of cluster prototypes. Also, the variable γ of k-prototypes is arbitrarily selected from the set of {0.1, 0.2, 0.3, . . ., 5}.
Full-space + Fixed-k: Each π g ∈ , is formed using data set X ∈ R N×D with all D attributes.
The number of clusters in each base clustering is fixed to k = √ N . Intuitively, to obtain a meaningful partition, k becomes 50 if √ N > 50.
Full-space + Random-k: Each π g is obtained using the data set with all attributes, and the number of clusters is randomly selected from the set 2, . . . , √ N . Note that both 'Fixed-k' and 'Random-k' generation strategies are initially introduced in the primary work of [30].
Subspace + Fixed-k: Each π g is created using the data set with a subset of original attributes, and the number of clusters is fixed to k = √ N . Following the study of [37] and [38], a data subspace X ∈ R N×D is selected from the original data X ∈ R N×D , where D is the number of original attributes and D < D. In particular, D is randomly chosen by the following.
where α ∈ [0, 1] is a uniform random variable. Besides, D min and D max are user-specified parameters, which have the default values of 0.75 and 0.85 D, respectively.
Subspace + Random-k: Each π g is generated using the dataset with a subset of attributes, and the number of clusters is randomly selected from the set 2, . . . , √ N .

Summarizing Multiple Clustering Results
Having obtained the ensemble , the corresponding base clustering results are summarized into an information matrix RA l ∈ [0, 1] N×P , from which the final data partition π * can be created. Note that P denotes the total number clusters in the ensemble under examination. For each clustering π g ∈ and their corresponding clusters {C g 1 , . . . , C g k g }, a matrix entry RA l (x i , cl) represents the association degree that data object x i ∈ X has with each cluster cl ∈ {C g 1 , . . . , C g k g }, which can calculated by the next equation.
where C g * (x i ) is a cluster label to which sample x i has been assigned. In addition, sim(C x , C y ) ∈ [0, 1] denotes the similarity between any two clusters C x , C y ∈ π g , which can be discovered using the link-based algorithm l presented next.
Weighted Connected-Triple (WCT) Algorithm: has been developed to evaluate the similarity between any pair of clusters C x , C y ∈ . At the outset, the ensemble is represented as a weighted graph G = (V , W ), where V is the set of vertices each representing a cluster in and W is a set of weighted edges between clusters. The weight |w xy | ∈ [0, 1] assigned to the edge w xy ∈ W between C x , C y ∈ V , is estimated by the next equation.
where L z ⊂ X denotes the set of data objects belonging to cluster C z ∈ . Note that G is an undirected graph such that |w xy | is equivalent to |w yx |, ∀C x , C y ∈ V. The WCT algorithm is summarized in Fig. 2. Following that, the similarity between clusters C x and C y can be estimated by the next equation.
where WCT max is the maximum WCT x y value of any two clusters C x , C y ∈ V and DC ∈ [0, 1] is a constant decay factor (i.e., confidence level of accepting two non-identical clusters as being similar). With this link-based similarity metric, sim It is also reflexive such that sim(C x , C y ) = sim(C y , C x ).

Figure 2: The summarization of WCT algorithm
Weighted Triple-Quality (WTQ) Algorithm: WTQ is inspired by the initial measure of [39], as it discriminates the quality of shared triples between a pair of vertices in question. Specifically, the quality of each vertex is determined by the rarity of links connecting itself to other vertices in a network. With a weighted graph provided that here N z ⊂ V denotes the set of vertices that is directly linked to the vertex v z , such that ∀v t ∈ N z , w zt ∈ W . A pseudocode for the WTQ measure is described in Fig. 3. Following that, the similarity between clusters C x and C y can be estimated by where WTQ max is the maximum WTQ x y value of any two clusters and DC ∈ [0, 1] is a decay factor.

Creating Final Data Partition
Having acquired RA l , the spectral graph-partitioning (SPEC) algorithm [40] is used to create the final data partition. This technique is first introduced by [28] as part of the Hybrid Bipartite Graph Formation (HBGF) framework. In particular, SPEC is exploited to divide a bipartite graph, which is transformed from the matrix BA ∈ {0, 1} N×P (a crisp variation of RA l ), into K clusters. Given this insight, HBGF can be considered as the baseline model of LCE. The process of generating the final data partition π * from this RA l matrix is summarized as follows. At first, a weighted bipartite graph G = (V , W ) is constructed from the matrix RA l , where V = V X ∪ V C is a set of vertices representing both data objects V X and clusters V C , and W denotes a set of weighted edges. The weight |w ij | of edge w ij connecting vertices v i , v j ∈ V , can be defined by In other words, W ∈ [0, 1] (N+P)×(N+P) can also be specified as After that, the K largest eigenvectors u 1 , u 2 , . . . , u K of W are used to produce the matrix U = [u 1 u 2 . . . u K ], in which the eigenvectors are stacked in columns. Then, another matrix U * ∈ [0, 1] (N+P)×K is formed by normalizing each row of U to have a unit length. By considering each row of U * as K-dimensional embedding of a graph vertex or a sample in [0, 1] K , k-means is finally used to generate the final partition π * = {C * 1 , . . . , C * K } of K clusters.

Performance Evaluation
To obtain a rigorous assessment of LCE for mixed-type data clustering, this section presents the framework that is systematically designed and employed for the performance evaluation.

Investigated Datasets
Five benchmark datasets obtained from the UCI repository [41] are included in this investigation, with Tab. 1 giving their details. Abalone consists of 4,177 instances, where eight physical measurements are used to divide these data into 28 age groups of abalone. There is only one categorical attribute, while the rest are continuous. Acute Inflammations was originally created by a medical expert to assess the decision support system, which performs the presumptive diagnosis of two diseases of urinary system: acute inflammations of urinary bladder and acute nephritises [42]. There are 120 instances, each representing a potential patient with six symptom attributes (1 numerical and 5 categorical). Heart Disease contains 303 records of patients collected from Cleveland Clinic Foundation. Each data record is described by 13 attributes (5 numerical and 8 nominal) regarding heart disease diagnosis. This dataset is divided into two classes referring to the presence and absence of heart disease in the examined patients. Horse Colic has 368 data records of injured horses, each of which is described by 27 attributes

Experimental Design
This experiment aims to examine the quality of the LCE WCT and LCE WTQ extensions of LCE for clustering mixed numeric and nominal data. For these extended models where k-prototypes is used for creating a cluster ensemble, the parameter γ of this base clustering algorithm is randomly selected from {0.1, 0.2, . . . , 5}. The results with LCE models are compared against a large number of standard clustering techniques and advanced cluster ensemble approaches. At first, this includes three standard clustering algorithms: k-prototypes, k-centers, k-means (KM) and dSqueezer. Particularly, the weight parameter γ is randomly selected from {0.1, 0.2, . . . , 5} for each run of k-prototypes and k-centers. In order to exploit k-means, a mixed-type dataset needs to be pre-processed such that each nominal attribute is transformed to β new binary-value features, where β is the corresponding number of nominal values. For the case of dSqueezer, each numerical data attribute has to be mapped to the corresponding categorical domain using the discretisation method explained by [19]. The set of compared methods also contains twelve different cluster ensemble techniques that have been reported in the literature for their effectiveness in combining clustering results: four graph-based methods of HBGF [28], CSPA [32], HGPA [32] and MCLA [32]; two pairwise-similarity based methods [24] of EAC-SL and EAC-AL; and six feature-based methods of IVC [43], MM [33], QMI [33], AGG F [29], AGG LSF [29] and AGG LSR [29]. The experiment setting employed in this evaluation is exhibited below. Note that the performance of standard clustering algorithms is always assessed over the original data, without using any information of cluster ensembles.
• Cluster ensemble methods are investigated using four different ensemble types: Full-space + Fixed-k, Full-space + Random-k, Subspace + Fixed-k, and Sub-space + Random-k. • Ensemble size (M) of 10 base clusterings is experimented.
• As in [24,28,29], each method divides data points into a partition of K (the number of true classes for each dataset) clusters, which is then evaluated against the corresponding true partition. Note that, true classes are known for all datasets but are not explicitly used by the cluster ensemble process. They are only used to evaluate the quality of the clustering results. • The quality of each cluster ensemble method with respect to a specific ensemble setting is generalized as the average of 50 runs. Based on the central limit theorem (CLT), the observed statistics in a controlled experiment can be justified to the normal distribution [43]. • The constant decay factor (DC) of 0.9 is exploited with WCT and WTQ algorithms.

Performance Measurements and Comparison
Provided that the external class labels are available for all experimented datasets, the results of final clustering are evaluated using the validity index of Normalized Mutual Information (NMI) introduced by [32]. Other quality measures such as Classification Accuracy (CA; [44]) and Adjusted Rand Index (AR; [45]) can be similarly used. However, unlike other criteria, NMI is not biased by a large number of clusters, thus providing a reliable conclusion. This also simplifies the magnitude of evaluation results and their comprehension. This quality index measures the average mutual information (i.e., the degree of agreement) between two data partitions. One is obtained from a clustering algorithm (π * ) while the other is taken from a priori information, i.e., known class labels ( ). With NMI ∈ [0, 1], the maximum value indicates that the clustering result and the original classes completely match. Given the two data partitions of K clusters and K classes, NMI is computed by the following equation.
where n i,j is the number of data objects agreed by cluster i and class j, n i is the number of data objects in cluster i, m j is the number of data objects in class j and N is the total number of data objects. To compare the performance of different cluster ensemble methods, the overall quality measure for a specific experiment setting (i.e., dataset and ensemble type) is obtained as the average of NMI values across 50 trials. These method-specific means may be used for the comparison purpose only to a certain extent. To achieve a more reliable assessment, the number of times (or frequencies) that one technique is 'significantly better' and 'significantly worse' (of 95% confidence level) than the others are considered here. This comparison method has been successfully exploited by [9] and [46] to discover trustworthy conclusions from the results generated by different cluster ensemble approaches. Based on these, it is useful to compare the frequencies of better (B) and worse (W ) performance between methods. The overall measure (B − W ) is also used as a summarization. Fig. 4 shows the overall performance of different clustering methods, as the average NMI measure across all investigated datasets and ensemble types. Based on this, LCE WCT and LCE WTQ are similarly more effective than their baseline model (i.e., HBGF), whilst significantly improve the quality of data partitions acquired by base clusterings, i.e., k-prototypes. Their performance levels are also better than other cluster ensemble methods and standard clustering algorithms included in this evaluation. Note that CSPA and k-means are the most accurate amongst the aforementioned two groups of compared methods. In addition, featurebased approaches such as QMI and IVC are unfortunately incapable of enhancing the accuracy of base clustering results. Dataset-specific results are given in Tabs  To further evaluate the quality of identified techniques, the number of times (or frequency) that one method is significantly better and worse (of 95% confidence level) than the others are assessed across all experimented datasets and ensemble types. Tabs. 2 and 3 present for each method the frequencies of significant better (B) and significant worse (W ) performance, respectively. According to the frequencies shown in Tab. 2, LCE WCT and LCE WTQ perform equally well on most of the examined datasets. EAC-AL is exceptionally effective on 'Abalone' data, while the three graph-based approaches of CSPA, HGPA and MCLA are of good quality with 'Heart Disease' and 'Horse Colic'. Note that k-means and k-prototypes are the best amongst basic clustering techniques. It is also interesting to see that the better-performance statistics of feature-based approaches are usually lower than those of standard clusterings considered here. These findings can be similarly observed in Tab. 3, which illustrates the frequencies of worse performance (W ). In this specific evaluation context, k-means is notably effective for most datasets and outperforms many graph-based and pairwise-similarity based cluster ensemble methods.

Experimental Results
Besides, the relations between performance of experimented cluster ensemble methods with respect to different ensemble types are also examined for this experiment: Full-space + Fixed-k, Full-space + Random-k, Subspace + Fixed-k, and Subspace + Random-k. Specifically, Fig. 5 shows the average NMI measures of different approaches across datasets. According to this statistical illustration, LCE WCT and LCE WTQ are more effective than other techniques across different ensemble types, with their best performance being obtained with 'Subspace + Fixed-k'. HBGF and three graph-based approaches (CSPA, HGPA and MCLA) are also more effective on Subspace ensemble types, as compared to the Full-space alternatives. While both 'Fixed-k' and 'Random-k' strategies equally lead to good performance of link-based techniques, feature-based and pair-wise similarity based methods perform better using the latter.    The quality of LCE WCT and LCE WTQ with respect to the perturbation of DC and M parameters is also studied for the clustering of mixed-type data. Fig. 6 presents the relation between different values of DC ∈ {0.1, . . . , 0.9} and the quality of data partitions generated by both LCE methods -the average NMI measure across all ensemble types, where M is fixed to 10 for comparison simplicity. In general, the performance of LCE WCT and LCE WTQ gradually improve as the value of DC increases. Another parameter to be assessed is the ensemble size (M).

Conclusion
This paper has presented the novel extension of link-based consensus clustering to mixed-type data analysis. The resulting models have been rigorously evaluated on benchmark datasets, using several ensemble types. The comparison results against different standard clustering algorithms and a large set of well-known cluster ensemble methods show that the link-based techniques usually provide solutions of higher quality than those obtained by competitors. Furthermore, the investigation of their behavior with respect to the perturbation of algorithmic parameters also suggests the robust performance. Such a characteristic makes link-based cluster ensembles highly useful for the exploration and analysis of a new set of mixed-type data, where prior knowledge is minimal. Because of its scope, there are many possibilities for extending the current research. Firstly, other link-based similarity measures may be explored. As more information within a link network is exploited, link-based cluster ensembles are likely to be more accurate (see the relevant findings in the initial work [30,31], where the use of SimRank and its variants is examined). However, it is important to note that such modification is more resource intensive and less accurate in a noisy environment than the present setting. Secondly, performance of linkbased cluster ensembles may be further improved using an adaptive decay factor (DC), which is determined from the dataset under examination.
The diversity of cluster ensembles has a positive effect on the performance of the link-based approach. It is interesting to observe the behavior of the proposed models to new ensemble generation strategies, e.g., the random forest method for clustering [47], which may impose a higher diversity amongst base clusterings. Another non-trivial topic is related to the determination of ensemble components' significance. This discrimination or selection process usually leads to a better outcome. The coupling of such a mechanism with the link-based cluster ensembles is to be further studied. Despite its performance, the consensus function of spectral graph partitioning (SPEC) can be inefficient with a large RA matrix. This can be overcome through the approximation of eigenvectors required by SPEC. As a result, the time complexity becomes linear to the matrix size, but with possible information loss. A better alternative has been introduced by [48] via the notion of Power Iteration Clustering (PIC). It does not actually find eigenvectors but discovers interesting instances of their combinations. As a result, it is very fast and has proven more effective than the conventional SPEC. The application of PIC as a consensus function of link-based cluster ensembles is a crucial step towards making the proposed approach truly effective in terms of run-time and quality. Other possible future works include the use of proposed method to support accurate clusterings for fuzzy reasoning [49], handling of data with missing values [50] and data discretization [51].