A Fast and Effective Multiple Kernel Clustering Method on Incomplete Data

: Multiple kernel clustering is an unsupervised data analysis method that has been used in various scenarios where data is easy to be collected but hard to be labeled. However, multiple kernel clustering for incomplete data is a critical yet challenging task. Although the existing absent multiple kernel clustering methods have achieved remarkable performance on this task, they may fail when data has a high value-missing rate, and they may easily fall into a local optimum. To address these problems, in this paper, we propose an absent multiple kernel clustering (AMKC) method on incomplete data. The AMKC method first clusters the initialized incomplete data. Then, it constructs a new multiple-kernel-based data space, referred to as K -space, from multiple sources to learn kernel combination coefficients. Finally, it seamlessly integrates an incomplete-kernel-imputation objective, a multiple-kernel-learning objective, and a kernel-clustering objective in order to achieve absent multiple kernel clustering. The three stages in this process are carried out simultaneously until the convergence condition is met. Experiments on six datasets with various characteristics demonstrate that the kernel imputation and clustering performance of the proposed method is significantly better than state-of-the-art competitors. Meanwhile, the proposed method gains fast convergence speed.


Introduction
In many real-world scenarios, it is always easy to collect a large amount of data from the normal condition [1][2][3]. But it is often time-consuming and expensive to label them for supervised learning methods. On the other hand, it is very dif cult to obtain data from the abnormal condition [4,5]. The above dif culties require the analytics methods should be in an unsupervised fashion. Moreover, the data are always collected from multiple sources [6]. Thus, it is necessary to employ the multiple-view clustering method [7], which is a critical technique for analyzing heterogeneous data from multiple sources [8]. Unlike single-view clustering [9,10], multiple-view clustering further integrates information from different sources that may have various data types and distributions. This integration poses signi cant challenges [11] to the multiple-view clustering approach. Most works dealing with multiple-view clustering focus on learning a uni ed clustering result that re ects consistent or complementary information contained in different sources [12][13][14][15]. However, they are unable to appropriately capture complex distributions (e.g., inseparable distribution) in each data source [16], which is dif cult because these complex distributions re ect the essential information in the data.
In order to capture complex distributions, recent multiple-view clustering methods have introduced multiple kernels into their learning procedure; these methods are known as multiple kernel clustering methods [17][18][19][20][21][22]. The multiple kernel clustering method adopts multiple kernels to learn complex distributions and reformulates the integration of multiple-source information as a convex optimization problem. Ordinarily, the multiple kernel clustering method rst captures the complex distributions in each data source using various kernels, namely base kernels; these base kernels project heterogeneous data into a homogeneous representation space [23,24]. The method then integrates multiple-source information by means of a uni ed kernel learning procedure, i.e., leverages the linear combination of several base kernels to generate a uni ed one. Subsequently, the method clusters objects according to the uni ed kernel which re ects the complex object relationships among multiple sources. As a result, multiple kernel clustering can achieve remarkable clustering performance on multiple-source data with complex distributions.
Although multiple kernel clustering methods can effectively cluster on multiple-source data with complex distributions, they may be heavily affected by a data incompleteness problem: namely, some data values in one or multiple sources may be missing due to lacking observations, data corruption, or environmental noise. The data incompleteness problem exists in a variety of scenarios, including neuro-imaging [25], computational biology [26], text security analysis [27], and medical analysis [28]. Most current multiple kernel clustering methods are unable to be implemented directly on data affected by the incompleteness problem. The main reason for this failure lies in the fact that a kernel matrix generated from a data source with missing values will be incomplete, as well as that a multiple kernel clustering method cannot learn a uni ed kernel based on incomplete base kernel matrices.
Although the above absent-kernel imputation methods can enable multiple kernel clustering in the presence of the data incompleteness problem, they are still affected by several issues. Firstly, most of the above absent-kernel imputation methods ignore the relations between kernels when imputing missing values. It is very important that these relations are considered during absent-kernel imputation, they may re ect the redundant information contained in kernels [1]; this information can then be used as essential evidence for missing value imputation. Secondly, clustering methods that integrated with the above advanced absent-kernel imputation methods are less effective and less ef cient than other state-of-the-art clustering methods. For example, the clustering accuracy and time-ef ciency of both the multiple kernel k-means clustering in [32] and the localized multiple k-means clustering method used in [33] are worse than that of the unsupervised multiple kernel extreme learning machine [21]. Consequently, the accuracy of these absent-kernel imputation methods may be unsatisfactory, and their time cost may also be very large.
To address the above problems, this paper proposed an absent multiple kernel clustering (AMKC) method, which learns on unlabeled incomplete data from multiple sources and achieves high effectiveness and a fast learning speed. The AMKC method rst adopts multiple kernels that map data from multiple sources into multiple kernel spaces, where the missing values are randomly imputed. It then conducts a three-stage procedure to iteratively cluster data, integrate multiplesource information, and impute missing values. In the rst stage, AMKC clusters data based on a uni ed kernel learned. This clustering process can converge in limited iterations with a fast speed and excellent clustering performance. In the second stage, AMKC constructs a new multiplekernel-based data space from multiple sources that contain incomplete data; this is done in order to learn the kernel combination coef cients so as to construct the uni ed kernel, which will be further used in the rst stage of the next iteration. This construction enables improved information integration on multiple-source data with complex distributions. In the third stage, AMKC imputes the missing values in each base-kernel matrix, jointly considering the clustering objective and the relations between the kernels. AMKC performs this three-stage procedure iteratively until the convergence of its clustering performance. In summary, the main contributions of this paper can be outlined as follows: (1) It provides an effective multiple kernel clustering method for incomplete multi-source data.
As AMKC avoids a local optimal solution, it signi cantly improves clustering performance. (2) It provides an ef cient multiple kernel clustering method. AMKC can converge within a limited number of steps, which improves training speed and reduces the time cost of clustering. (3) It provides a high-precision absent-multiple-kernel imputation method. AMKC considers not only the relations between different kernels, but also the ties between kernel relations and the clustering objective. Consequently, the proposed method generates reliable and precise complete kernels.
We carry out extensive experiments on six datasets in order to evaluate the clustering performance of AMKC. Moreover, we adopt averaged relative error to measure the degree of recovery of the absent-kernel matrices imputed by AMKC. The experimental results demonstrate that: (1) AMKC performs better than comparison methods on datasets with a high missing ratio; (2) AMKC's joint optimization and clustering process enable better clustering performance on the experimental datasets. This strong evidence supports the superior kernel imputation and clustering performance of AMKC.

The AMKC Work ow
The work ow of the AMKC method is illustrated in Fig. 1. In this architecture, AMKC adopts an iterative three-stage procedure to cluster data, learn kernel combination coef cients, and impute absent-kernel matrices. In the rst stage, AMKC clusters data based on a uni ed learned kernel. This clustering process can converge in limited iterations, rapidly and with high clustering performance. In the second stage, AMKC constructs a new multiple-kernel-based data space to learn the kernel combination coef cients; this construction enables better information integration on multiple-source data with complex distributions. In the third stage, AMKC imputes the missing values in each base-kernel matrix. AMKC then performs this three-stage procedure iteratively until convergence of its clustering performance occurs.

First Stage: Kernel K-Means Clustering
The rst step in the rst stage is to conduct kernel k-means clustering on a uni ed imputed and learned kernel. This uni ed kernel, which is a weighted combination of the observed values in the absent-kernel matrices, is calculated as follows: where m is the number of the employed base kernel matrices; K (cc) p is an optimal base kernel matrix, which is learned and imputed in the third stage in the next iteration. Initially, K (cc) p inherits the observed values in the p-th absent-kernel matrix with other values as 0. µ = [µ 1 , µ 2 , . . . , µ m ] is a set of combination coef cients that satis es m p=1 µ p = 1 and µ p ≥ 0. In the AMKC learning process, all coef cients in µ are initialized as 1/m. Following initialization, these coef cients will be learned in the second stage of the AMKC's iterative three-stage procedure. With the uni ed learned kernel K, we can formalize the kernel k-means clustering's objective function as below: where the cluster assignment matrix C = c 11 , . . . , c 1n c ; . . . ; c n1 , . . . , c nn c ∈ {0, 1} n×n c , c ij indicates whether the i-th object belongs to the j-th cluster, n is the number of the objects in a dataset, c ij refers to the number of objects in the j-th cluster, and 1 n ∈ {1} n is a k-dimensional vector in which all values are 1. It should be a remarkable fact here that it is very dif cult to solve Eq. (2) directly, as the values in C are discrete (i.e., either 0 or 1); one solution would be to relax Eq. (2) by allowing C to take real values. Accordingly, Eq. (2) can be reduced to: where H = CL 1 2 including the clustering performance. The clustering label of an object is determined by the elements in its corresponding row of H. Here, AMKC sets the clustering pseudo-label y i of the i-th object x i in the dataset as the arg max j h ij , where h ij is the ij-th element of H. In order to obtain the optimal H for improving the clustering results, we follow the way in [37] to solve Eq. (3). As a result, the n c eigenvectors relating to the n c largest eigenvalues from K are selected as the optimal H.

Second Stage: Kernel Combination Coef cients Learning
In the second stage, the set of combination coef cients µ in Eq. (1) are learned in a transformed space. Following [38], AMKC formulates the combination coef cients learning process as a binary classi cation problem. More speci cally, AMKC rst constructs a kernel feature space, referred to as K-space, based on imputed kernel matrices and the clustering pseudo-labels learned in the rst stage.
Initially, the p-th imputed kernel matrix is set as K (cc) p ; this will be learned and imputed in the third stage following initialization. If the data in the K-space is denoted as U, the given multiplesources dataset is X, then the transformation from multiple kernel matrices of X to U can be expressed as follows: is the result corresponding to object x i and x j from the p-th kernel matrix, while u (xi,xj) ∈ R 1×m is an object in U transformed from object x i and x j in X by m base kernel matrices. Thus, all data in the K-space can be denoted as U = u (1,1) , . . . , u (n,n) , while the 272 CMC, 2021, vol.67, no.1 label S (x i ,x j ) of u (xi,xj) in the K-space can be de ned from the pseudo-labels y i and y j of objects x i and x j as follows: Following the above K-space construction, AMKC learns the optimal combination coef cients µ for the uni ed kernel K through a closed-form solution [39]: or: where S = s (1,1) , . . . , s (n,n) , C is a trade-off parameter. Eqs. (6) and (7) are suitable for datasets with different characteristics. AMKC uses Eq. (6) to quickly learn the optimal µ for a large-scale dataset. For data obtained from a large number of different sources, AMKC adopts Eq. (7) to calculate the optimal solution with more ef ciency. The learned optimal µ will be employed to construct the uni ed and imputed kernel K in the rst stage in the next iteration.

Third Stage: Absent-Kernel Matrices Imputation
The AMKC method learns and imputes absent-kernel matrices in the third stage based on the clustering pseudo-label and kernel combination coef cients learned in the rst and second stages, respectively. The learning objective can be formalized as follows: In the third stage, AMKC regards µ and H as constants. AMKC utilizes the optimal H, which is generated on m base kernels in the rst stage, and the optimal m combination coef cients µ 1 , . . . , µ m , which are optimized in the second stage. Thus, the optimization in Eq. (8) When approached directly, the optimization problem in Eq. (9) appears to be intractable due to K from the perspective of matrix decomposition. Let Q = I n − HH , Eq. (9) can be decomposed as m independent sub-problems equivalently. The p-th subproblem is formalized as follows: p = K p s p , s p meaning the p-th kernel submatrix calculated from the objects s p s whose p-th view are present. With the similar form, the matrix Q in Eq. (10) can be expressed in block form as As a result, by rewriting the optimization problem in Eq. (10), a closed-form expression in Eq. (11) can be obtained for the optimal K (cc) p , which will be used to impute and update the uni ed kernel K in the next iteration.

The AMKC Algorithm
The iterative three-stage procedure of AMKC is outlined in Algorithm 1. , and the convergence condition

Ensure:
The combination coef cients µ, a set of new kernels K with zero.
2: repeat 3: Utilize the optimal µ (t−1) and K (t−1) to formulate the uni ed kernel K (t) by Eq. (1); 4: Conduct kernel k-means clustering on K (t) to obtain the optimal H (t) by solving Eq. (2); 5: Learn the optimal µ (t) by Eq. (7)  is both incomplete and independent. Accordingly, in the iterative three-stage procedure, AMKC optimizes these three variables independently. At each stage, AMKC optimizes one of these variables and treats the others as constants. In this way, AMKC can obtain the local optimal values of these three variables after the iteration converges. AMKC determines the iteration convergence based on the changes in the loss value obj (t) of the clustering objective function Eq. (3), where t refers to the t-th iteration. More speci cally, AMKC de nes a convergence index as cov = (obj) (t−1) − (obj) t / (obj) (t) . If the value of cov is smaller than a pre-de ned , AMKC will stop the iterative procedure.

Time Complexity
In order to demonstrate the fast learning speed of the proposed method AMKC, we theoretically analyze and discuss its time complexity in this section. The time complexity of AMKC is primarily determined by three components: kernel k-means clustering, kernel combination coef cient learning, and absent-kernel matrices imputation.
Suppose the number of objects is n, the number of base kernels is m. For kernel k-means clustering in the rst stage, its time complexity can be reduced from O n 3 to O (n) via cluster shifting [40]. For the second stage: kernel combination coef cient learning, if the optimal µ is learned by solving Eq. + m 2 n + mn < O n 3 + mn 2 + mn . Thus, in order to reduce the time complexity in each iteration for achieving a fast learning speed, AMKC is best to use Eq. (6) when n is very large, while using Eq. (7) for m is very large. Namely, the time complexity of AMKC in an iteration is O m 2 n or O mn 2 . Accordingly, the time complexity of AMKC in n t iterations is O m 2 nn t or O mn 2 n t . In practical terms, n t also effects the ef ciency of AMKC. Fortunately, AMKC is able to theoretically converge within nite iterations, as demonstrated in Theorem 1, and n t can be empirically proved to be a very small number when the convergence condition is met, as shown in Fig. 7.

Convergence Analysis
In this section, we theoretically testify that AMKC algorithm can converge within nite steps to support its fast learning speed. Assuming that the number of clusters n c is nite, for n t iterations, AMKC generates a series of imputed kernels K (i.e., K 1 , K 2 , . . . , K n t ) and a series of cluster assignment matrices C (i.e., C 1 , C 2 , . . . , C n t ). Given a cluster assignment matrix C and an imputed uni ed kernel K, the clustering loss in AMKC is denoted as f C, K .
In the rst stage, the kernel k-means clustering converges to a minimal solution. The AMKC achieves C = arg min C f C n t −1 , K n t −1 in the n t -th iteration; moreover, since the uni ed kernel has been imputed and the clustering also converges to a minimal solution, f C, K is strictly decreasing (i.e., f C 1 , K 1 > f C 2 , K 2 > · · · > f C n t , K n t ). Assuming that n t ≤ y + 1, there are at least two identical assignment matrices in the series of cluster assignment matrices C (i.e., C i = C j , 1 ≤ i = j ≤ n t ). Because C i = C j , we can infer that K i = K j ; therefore, the value of the clustering loss does not change (i.e., f C i , In this case, the convergence criterion of AMKC is satis ed and the AMKC algorithm stops. Since n t ≤ y + 1, AMKC (Algorithm 1) converges to a local optimum in nite iterations.

Original Datasets
In order to evaluate the AMKC's performance, we conduct experiments on six datasets; namely, Iris [41], Lib [41], Seed [41], Isolet [41], Cifar [42], and Caltech256 [43]. Of these data sets, Iris, Lib, Seed and Isolet are collected from UCI data repository [41], and are commonly used to evaluate multi-view learning methods [18,44] and multiple kernel learning methods [21,38]. Furthermore, these four datasets were all collected from different real-life scenarios with different characteristics, enabling a comprehensive evaluation of the proposed method from different perspectives. The remaining datasets, i.e., Cifar and Caltech256 have also been commonly used to evaluate machine learning method in recent research. For Cifar and Caltech256, we chose 300 and 10 objects belonging to each cluster respectively. The important statistics including the number of objects, base kernels, and classes of these datasets are listed in Tab. 1; The base kernels used in the experiments include three kinds of kernel, namely, linear kernel, polynomial kernels with degree {2, 3, 4}, and Gaussian kernels with kernel width falling within the range of 10 −10 , 10 −8 , 10 −6 , 10 −4 , 10 −2 , 1, 10 2 , 10 4 , 10 6 , 10 8 , 10 10 .

Competitors
The clustering performance of AMKC is compared with other two-stage clustering methods for incomplete kernels. The comparison methods rstly complete the absent base kernels with special values learned by different imputation methods, and then conduct multiple kernel k-means clustering (MKKM) [45] on the imputed kernels. In our experiments, four representative imputation methods are employed; namely, zero imputation (ZI), mean imputation (MI), k-nearestneighbor imputation (KNN), and alignment maximization imputation (AF). For convenience, the methods combined MKKM with different imputation methods are denoted as ZI + MKKM, MI + MKKM, KNN + MKKM and AF + MKKM, respectively. Furthermore, the state-of-theart method MKKM + IK [32], which iteratively performs clustering and kernel imputation, is also employed for comparison. We do not include the MVKC method [35] in our clustering performance comparison because of its high computational cost, even for small amounts of data. Instead, we simply compared MVKC with AMKC in terms of imputation precision.
To reduce the impact caused by the tested datasets and randomness of the kernel k-means clustering, for the same parameters τ , θ p , and θ 0 , the absent-base-kernel matrices are randomly generated 10 times by randomly selecting different objects to be absent. Furthermore, for each series of generated absent-base-kernel matrices, we repeatedly carry out random initialization 20 times for extensive experiments.

Performance Measures
To accurately evaluate the clustering performance and effectiveness of the methods of interest, we measure the clustering results through three performance measures: clustering accuracy (ACC), normal mutual information (NMI) and Purity. Differently-parameterized results of ACC, NMI and Purity are aggregated by averaging them, respectively. Since the proposed method combines kernel imputation and clustering, we can get new imputed kernel matrices and clustering results at the same time. In order to verify the degree of recovery of AMKC for absent kernel matrices, we measure the average relative error (ARE) [35,46] between the complete missing values and the original value among all views.

Recovery of the Absent Kernels
To validate the degree of recovery achieved by the proposed AMKC, when the base kernels are diverse, our results are compared with several state-of-the-art kernel matrix completion methods, namely, Multi-view Kernel Completion (MVKC) [35] and MKKM with incomplete Kernels (MKKM + IK) on the Iris dataset with various kernels. We generated three sets of kernel matrices with various kernels and different parameters. In more detail, as shown in Fig. 2, KH1, KH2 and KH3 are denoted those combined with three Gaussian kernels, one linear kernel and two Gaussian kernels, and three linear kernels, respectively. We then randomly generated the missing matrices based on KH1, KH2, and KH3 and applied the comparison methods to them. Finally, the average relative error (ARE) (Eq. (11)) [35,46] is taken to measure the error between the predicted kernel matrix and the original matrices KH1, KH2 and KH3. The average relative error is computed over all missing data points for all views, as follows: where n p is the number of missing samples in the p-th view, I (p) is the set of indices of all missing data in the p-th view, K (cc) p and K p refer to the learned imputed kernel matrix and the original complete one. The ARE values for the comparison methods are presented in Fig. 2. We can see from the gure that the proposed method generally predicts missing values more accurately than MVKC and MKKM + IK on the Gaussian kernel. When the base kernel contains a non-linear kernel (Gaussian kernel), the MVKC performs no better than MKKM+IK and MVKC (see Figs. 2a and 2b) due to the prior assumption on the linear assumption. Since the proposed method considers connection among kernels and clustering guidance, regardless of the type of the composed kernel, the proposed method is able to recover the absent kernels more ef ciently, as in Fig. 2.

Clustering Performance Analysis
The clustering results of different clustering methods for incomplete data are shown in Figs. 3-5. Respectively, they present a comparison of the ACC, NMI, and Purity of six clustering methods on each dataset. It can be seen that the proposed method outperforms other comparison methods on six datasets. And the clustering performance of the proposed AMKC remains robust as the missing ratio grows larger, while the clustering performance of the other comparison methods, especially ZI + MKKM, MI + MKKM, KNN + MKKM, exhibits a quick downward. As the term ACC can measure the distance between the learned clustering pseudo-labels and actual ones to some extent, the higher ACC for large missing ratios can prove that the proposed method is able to effectively restore datasets through its iterative process.
In order to investigate comprehensively the effectiveness of the proposed method, the aggregated ACC, NMI and Purity along with their standard deviations (mean ± std) are listed in Tabs. 2-4, respectively. It is clear that the proposed method obtained the best performance shown in bold. Namely, the proposed method performs better than other comparison methods in all datasets, which is consistent with the conclusion from Figs. 3-5. The superiority of the proposed method can be attributed to joint optimization on clustering, combination coef cients and kernel imputation. In the three-stage procedure of the proposed method, the clustering information are employed to guide the kernel imputation into an optimum, while the good imputed uni ed kernel prompts the clustering result. Thus, the clustering performances of the proposed method can be greatly improved.

Comparison with Baseline Algorithm
Since the proposed method can simultaneously achieve clustering and kernel imputation as the extend of the unsupervised multiple kernel extreme learning machine (UMK-ELM) [21], we provide TUMK-ELM (the two-stage UMK-ELM) [37] as a baseline and compare the clustering performance of the proposed method with four aforementioned imputation methods: ZI, MI, KNN, and AF imputation (They referred to as ZI + TUMKELM, MI + TUMKELM, KNN + TUMKELM, and AF + TUMKELM).
Here, we carry out the experiments only on the Iris, Lib, and Seed datasets. Moreover, taking the ACC for example, the corresponding ACC values of each method in each appointed dataset with a variety of missing ratio are calculated. Our experimental results (see Fig. 6) demonstrate that the performance of the proposed method (line in black) is the closest to that of TUMK-ELM (line in red), where the datasets used in TUMK-ELM are complete. This indicate that the proposed method can not only obtain good clustering results, but is also able to achieve outstanding imputation performance.

Convergence Speed Analysis
In order to investigate the convergence speed of the proposed AMKC method, additional experiments are carried out on three main datasets (Iris, Lib and Seed). The results of the objective value (obj in Algorithm 1) in each iteration for the xed missing ratio 0.9 are shown in Fig. 7. From Fig. 7, we can observe that the objective value tends to converge quickly. As the convergence speed of the objective value determine AMKC's convergence speed, the experimental results demonstrate AMKC can converge within only a few iterations on these three datasets. Thus, the proposed method has a fast convergence speed.

Conclusion
As the multiple-kernel clustering method has promising and competitive performance, it can be widely employed in various applications. In order to cope with incomplete data or base kernels, we proposed a new multiple-kernel clustering method with absent kernels, which jointly cluster and impute the incomplete kernels to achieve clustering performance. Our method iteratively performs three stages utilizing an optimization strategy to obtain optimal clustering information, combination coef cients and imputed kernels, so better clustering for the absent kernels are gained. Extensive experiments on six datasets have veri ed the improved performance of the proposed method.