A Tradeoff Between Accuracy and Speed for K-Means Seed Determination

With a sharp increase in the information volume, analyzing and retrieving this vast data volume is much more essential than ever. One of the main techniques that would be beneficial in this regard is called the Clustering method. Clustering aims to classify objects so that all objects within a cluster have similar features while other objects in different clusters are as distinct as possible. One of the most widely used clustering algorithms with the well and approved performance in different applications is the k-means algorithm. The main problem of the k-means algorithm is its performance which can be directly affected by the selection in the primary clusters. Lack of attention to this crucial issue has consequences such as creating empty clusters and decreasing the convergence time. Besides, the selection of appropriate initial seeds can reduce the cluster’s inconsistency. In this paper, we present a new method to determine the initial seeds of the k-mean algorithm to improve the accuracy and decrease the number of iterations of the algorithm. For this purpose, a new method is proposed considering the average distance between objects to determine the initial seeds. Our method attempts to provide a proper tradeoff between the accuracy and speed of the clustering algorithm. The experimental results showed that our proposed approach outperforms the Chithra with 1.7% and 2.1% in terms of clustering accuracy for Wine and Abalone detection data, respectively. Furthermore, achieved results indicate that comparing with the Reverse Nearest Neighbor (RNN) search approach, the proposed method has a higher convergence speed.


Introduction
Machine Learning (ML) is a widely used field in data science that tries to increases the learning abilities of machines employing learning algorithms. It is classifying into two categories; Supervised and Unsupervised learning methods [1,2]. One of the unsupervised ML methods deployed to create clusters with similar objects is called clustering. Ideally, after the clustering, all objects within a cluster have the highest similarity while other objects of different clusters are distinguishable well [3,4]. Clustering not only can analyze the data without labels but also can label them [5]. The cluster-based analysis has numerous applications in marketing, economic sciences, pattern recognition, data analysis, image processing, text identification, Ad-hoc & Sensor Networks, and security issues [3,[6][7][8].
K-means algorithm is one of the most popular algorithms in this field due to its simplicity [9]. The steps of the k-means algorithm are [10]: 1-The k points are selected randomly as the initial seeds (k is selected among n available points).
2-Each object is located in the corresponding cluster according to its similarity to the initial seeds.
3-Cluster centers are computed. Each cluster center is equivalent to the average object in that cluster.
4-According to the new cluster centers, the second and third steps will be repeated until no change is seen in clusters (the algorithm is terminated).
Despite the widespread usage of the k-means algorithm, it suffers from a few drawbacks. These drawbacks include the problem of determining the number of clusters in advance of algorithm implementation [11], its restriction to the dense clusters with spherical shape [12], and high sensitivity to the outliers and noises because of using the square Euclidean distance metric [13]. Moreover, the Local Optimum convergence problem [14,15] and the Performance dependency to the primary seed selection [14,15] are its other drawbacks. Nevertheless, the clustering sensitivity to the initial seeds is the most noticeable issue because this can lead to empty clusters and even slower run time [16]. Also, selecting appropriate initial seeds leads to a decrease in inter-cluster inconsistencies. In other words, while the primary seeds are near to their optimum values, the clustering quality would be significantly improved and vice versa. Other problems, except determining the number of clusters, can be addressed by selecting the proper initial clusters in their early stage [14].
Therefore, this research is focusing on determining the initial seeds in the k-mean's algorithm properly. The proper selection of initial seeds increases the clustering accuracy and reduces the total number of iterations. The closer initial conditions to their optimum values lead to the clustering accuracy increases, and consequently, the number of iterations decreases [16]. This issue is a considerable advantage for processing big datasets.
Thus far, many efforts have been made to select optimal seeds. Usually, in existing methods, the clustering accuracy and its run-time are inversely related to each other. In other words, the clustering speed is reducing with increasing the clustering accuracy. In this paper, a method with the k-means algorithm is presented that provides a proper tradeoff between accuracy and speed. We mainly focus on outlier data to reduce the sensitivity to the noise and outlier. The appropriate selection of the seeds is a noticeable improvement in both accuracy and speed. It is critical to avoid choosing outlier data as initial seeds. In contrast to the most available literature, we consider several important subjects such as data distribution and outlier removal. Finally, our proposed method is compared with several others approaches using two criteria: accuracy and the number of convergence iterations. The rest of this paper is organized as follows: The literature and background of this research are presented in the next section. In Section 3, our proposed method is discussed in detail. The experimental results are shown in Section 4. Finally, in the Section 5, conclusions and our suggestions for future works are presented.

Related Work
The quality of the clustering algorithm has been considerably affecting by initial seeds. Thus several approaches have been presented for addressing this problem [17]. In this section, a number of the most well-known methods for initializing seeds, their positive and negative points, and their sensitivity to the outliers are investigated.
The first method for seed initialization is presented in [18]. In this method, the initial k seeds are selected randomly from the dataset, and the rest are assigned to other clusters. This method uses the fact that by random selection of primary clusters, the probability of selecting the nearest points to the center of the clusters is more likely to be considered as the primary clusters [3]. In other words, more density points (points that are good candidates) are more likely to be selected as primary candidates. In other words, despite its simplicity and high speed, different runs might cause results.
In [19], selected only K seeds are assigning randomly to one of the K nearest cluster centers. Then, the kmean should be run before the next point assignment. The disadvantage of this method is its high computational complexity; therefore, it is not suitable for large databases [20].
Faber [21] also chose initial solutions randomly from the dataset by itself rather than data space. Primary data are selected from data points in the dataset rather than from data space. This method can lead to more accurate clustering results in many cases. In several methods, the initial k seeds are considered as primary values. If the pixels are randomly arranged, this method is similar to the Faber method [9].
In various methods, the initial seeds are determined based on the data density [9,22]. Its main problem is its meager run time. However, there are high costs to compute the density and order of them [3]. In 2012, Kathiresan et al. [23], instead of density data, set the data based on the Z-Score. The Z-Score indicates that the given data is higher or lower than the mean value [14].
In Tou et al. [24], the first point of the dataset is selected as the first seed. Then, the distance from this point to the next point in the data is computed and considered the second seed. This distance must be greater than the threshold. Otherwise, the next point in the dataset is spotted, and these steps are repeated until finding the second seed [16]. Once the second seed is selected, the distances between the third and two previous seeds are computed and compared with the threshold. If both are more than the threshold, then the point is chosen as the third seed. These steps continue until selecting k'th seeds [14]. In this method, the user can control the distance between cluster centers. The results of this method are dependent on the data orders in the dataset. The main problem of this method is determining the threshold value by the user [3].
In the Boryczka [25] method, the data with the highest samples in the center is considered the first seed. The next one is selecting based on the Sum of Squared Errors (SSE) reduction. A point located a greater distance with previous resolutions is selected as the next step [14]. This process continues until finding the kth point. Because of the distance computing of each pair of objects in each iteration, the time complexity of this method is O(n 2 ). Using this method for big data, a small subset of the datasets is used to select original data. Another method has been proposed for selecting these small subsets in [26].
KKZ (Katsavounidis et al. [27]) algorithm is similar to the previous method, but they considered the first data as the edge data. After, the far point regarding the first one is selected as the second seed. Then the distance of all points is calculated, and the nearest point is determined. The point with more distance from the nearest neighbor pixels is selected as the third seed [3]. These steps continue to determine the kth point. Outliers and noises in the dataset can degrade this method effectively [28].
In [29], a new approach called Bootstrap was presented. This approach initially selects S samples randomly from primary data (S = 10 was recommended by them). Then, the k-means algorithm is implemented on each subset. They claimed that their algorithm is robust against outlier data with high efficiency [17]. The main disadvantage of their method is its massive computation requirement. Therefore, this algorithm cannot be used in situations where time, space, and speed are noticeable. In some cases, to improve the results, they combined the k-means method with hierarchical clustering [14,30]. Since outliers and noises cause inappropriate clustering results, several methods are avoiding them as selected initial seeds. In Random Partition Method [31], the dataset is divided into K clusters. Each data is allocated to one of the clusters based on uniform distribution. Then, the cluster center is calculated as its primary seed for all clusters. These steps are repeated several times to achieve the best clustering result [31]. This method is fast with the linear time complexity. The main drawback of this method is its different clustering results in different execution times. If the outlier data is selected as the seed, it may lead to an empty cluster. Also, in this approach, there is no mechanism to prevent close points as the initial seeds.
In research [32], they developed an advanced algorithm to find primary seeds. At the first step, the distance between each pair of data is calculated. Two closest points are added to A1 and removed from the D set. Then, those data closer to the A1 elements are inserted in A1 and are removed from the D set. This rule is repeated until the number of elements of the A1 points reached the threshold. In the second stage, A2 is crated, and this procedure continues until the kth set, Ak, is filled. Now, the average of all elements in the Ai is considered as i'th seed (i = {1, 2, …, k}).

Methodology
In this section, our novel and efficient methodology for determining the initial seeds of the k-means algorithm is presented. Before diving into our model, it is necessary to know more about two relevant studies introduced by Chithra et al. [8] and Karbasi [5]. Then, our proposed method is demonstrated in more detail.

Paniz Approach
This method is similar to K-means++. The k-means++ method is one of the most popular methods in the k-means family by reducing iterations numbers. The k-means++ algorithm converges quickly with higher accuracy [33]. Its details are as follows: At first, a particular point among all data is spotted as Ext. This point should have the most significant gap rather than others. Fig. 1 shows a dataset with ten samples in 2D space (the Ext point is marked by Red color). Euclidean distance from all the remaining points to the Ext is computed (1.22, 2.01, 3.04, 3.51, 8.01, 9.27, 10.10, 11.14, and 12.02). Then, all points are sorted based on the distance from the point to point (See Fig. 1). After distance sorting, the difference between two consecutive distances is obtained.
After calculating the difference in two consecutive distances, we obtain the mean values of the previous step (the mean value = 1.35). Then, all the distances are divided into "m" groups. The value of m is unknown before this step. According to Fig. 1, the two successive distances with a smaller or equal amount to the mean value are placed in the same group. Now, the value of m = 2 is obtained.
Depending on the k value, one of the following three cases may occur: A) If k < m, then, the class is merged until precisely the k group exists. The central element in each group is then selected as the primary seed of the cluster. First, the value of q ¼ ½m=K is calculated and is stored in the r variable. According to the calculated value, we may have whether q > 1 or q ¼ 1. If q > 1; then; K À 1 supergroup with q group along with a single supergroup including q + r groups are created (for example for m ¼ 7, K ¼ 3, q ¼ 2, and r = 1, then, we have two supergroups with two groups and one supergroup with three groups (q þ r ¼ 3)). The median value of the supergroups is considered as primary seeds. On the other side, if q is equal to 1, K groups are randomly selected from the m group and, then the median of the elements of the group are selected as the seeds. B) If k ¼ m, the median of elements of each group is chosen as initial seeds. C) If k > m, then k points of the different groups are selected using the Round-Robin rule. This procedure continues from the furthest group and continues until the first group arrives. While on reaching the first group, k points have not been chosen yet, it will be returned to the first group. This procedure continues until the k points selection is going to end up. Then we need to choose more than one point in each group. Now, the value of q ¼ K m is computed, and the remainder is stored in the variable "r". In this case, we select q þ 1 values from r groups, and q values from m-r groups (for example, if m ¼ 2 and k ¼ 5, then the values of q and r is 2 and 1, respectively. The three values of one group and the two values of another group is selected).
One of the essential advantages of this method is its high convergence speed compared to many methods. The placement of members with distance less equal or equal to the average distance in a group leads to a more accurate result. The initial seeds are closer to the final values, and the algorithm converges in its very early steps. It is worthy of mentioning that in large databases, decreasing one stage of the k-means algorithm leads to a significant increase in speed [5].

Chithra Approach
In this method, at first, the range of attribute values is calculated and stored in the "R" variable. The proposed method is based on numeric attributes. The range of the values is limited to the subtraction of the minimum and maximum values. The variable is the result of R dividing by the number of clusters. The first seed would be the sum of the minimum value, while the rest are obtained from previous seed and G [8].

The First Proposed Algorithm
The main problem of the Chithra method is the lack of attention to the data distribution in the data space. For example, in both the distributed data and high-density data, the primary seeds are considered the same. As said before, in their algorithm, the minimum and maximum amounts are used to calculate the initial seeds without any knowledge about data distribution. After obtaining the range of values, it is divided into k of the same portion. Each seed has the same distance as compared to previous ones. In some data, the seeds' distance may not be the same. As a result, it reduces the clustering accuracy and increases the convergence time of the algorithm.
Lack of attention to the extent of data distribution causes some unwanted problems, such as creating empty clusters. To solve the mentioned problem, we involve the following steps in our algorithm: Definition of G according to Eq. (1).
1. Initialization of the C 1 according to Eq. (2). 2. Applying the updating procedure for C i according to Eq. (3).
S ¼ xjx 2 D s ; D s : Data space f g Figure 1: The grouped data points Considering the mean distance, if the distances of the points are high, each seed would be more distinguishable than the last seed. The C i may be higher than the maximum amount of data due to the high data variance. To overcome this problem, we propose to compare C i dynamically with the maximum value during the determination of the seeds and update it accordingly. If the C i is greater than the maximum value, a (the coefficient of G in the third stage of our algorithm) decreases from its default value (a ¼ 1) to some lower values gradually (a ¼ n ; n 2 1; 0:75; 0:5; 0:25 ½ ). This amount will be changed to 0.75 at the first update. If the problem arises again, it should change to 0.5 in the second update. If the problem still exists, the coefficient of G is then reduced again to 0.25.

The Second Proposed Algorithm
The main drawback of the Chithra method is ignoring the outliers. Although outlier detection would be considered as the extra step, it can affect the clustering results noticeably. In particular, in the applications that clustering tasks have to be run dynamically and frequently. Consider the following example shown in Fig. 2 with two clusters and two outlier points. For clustering these data using the k-means algorithm, two primary seeds should be determined at the first step (Fig. 2). Besides, the primary seeds selected and clustering implemented according to the Chithra algorithm (Fig. 3).  In the next step, the average of each cluster is calculated as the center of that cluster. After determining these centers, data clustering is not changed from the previous stage, and the algorithm ends up. Now, the final clustering of the data is shown in Fig. 4.
As shown clearly, the outliers are allocated in one cluster while the rest of the data (the mainstream of the data) are clustered into another. This issue is because Chithra tries to separate the data to k parts equally without any attention to the data distribution. For this reason, if outliers are available in the dataset (something that is not avoidable in the real world), the clustering quality might be decreased strongly. Even in some cases, this may lead to empty clusters. For example, consider the data set presented in Fig. 5, then Fig. 6 shows the primary seeds and clustering results after implementing the k-means algorithm in the first iteration. Now, the average of each cluster is determined as the new center (Fig. 7). Then, the clusters are reconfigured concerning the assigned center. A shown in Fig. 7, no data is assigned to the second cluster (same as in the previous one). Because the cluster centers have remained constant during this step, the algorithm is terminated. The consequent result is shown in Fig. 7.  However, in the first proposed method, the data sparsity is considered, it might lead to empty clusters. Our evaluation shows that outliers are considered as primary seeds in some scenarios. To solve the mentioned issues, we first divide the data into m groups found on the Paniz algorithm and remove groups with the number of elements less than or equal to three. We know that outlier objects have more distance than others. Since we employed the average distance during our clustering method, outliers are separated from others. In other words, because outlier objects usually have more distance than average distance, they are placed in separate groups. Here, m is the number of remaining groups after the removal of some clusters. For instance, if we obtain five groups at the beginning of the task and then one is removed, m is considered to 5. Recalling one of the three cases described in Sections 3-1, the first proposed method is used to determine the initial seeds instead of considering the middle of numbers in each group. The second proposed method is as follow: 1. We are detecting outliers found on the Paniz approach. 2. Following these rules for the obtained clusters in the previous iteration: Definition of G according to Eq. (1). Initialization of the C 1 according to Eq. (2).  Applying the updating procedure for C i according to Eq. (4).

Experimental Results
To evaluate the performance of our proposed method, we use the Iris, Wine, and Abalone datasets which are popular datasets for clustering [22]. Tab. 1 shows the information about each dataset used to evaluate the results.
The clustering algorithm consists of two parts: The primary seeds selection and the k-means algorithm implemented in C++. The criteria used to evaluate the performance of the methods are the Accuracy (Acc) and the Number of Convergence Iterations (NCI). One of the essential performance metrics for determining the clustering results is accuracy. The Acc shows that what percentage of the datasets are appropriately clustered. Another criterion is to count the number of iterations to achieve convergence or NCI. The lower NCI shows that better performance is achieved. The lower NCI occurs when initial seeds are closely selected to the original cluster centers. In this research, the Euclidean and Manhattan distance measures are employed to measure the difference between two elements in the clustering procedure. Usually, better accuracy is obtained while using Euclidean distance. In our first and second proposed methods, we achieved slightly more accuracy using Euclidean distance rather than Manhattan distance (Approximately 0.43% and 0.63% higher accuracy than Manhattan distance). The obtained accuracy is shown using the Euclidean and Manhattan distance of Tab. 2. In the following, we only use Euclidean distance to present the results. In Tab. 3, we compare the achieved clustering results on different datasets using the Acc and the NCI.  Now, we should compare our achieved results with available algorithms under similar conditions. Tabs. 4-6 draw a wide comparison on Iris, Wine, and Abalone dataset based on the Acc. Furthermore, Tabs. 7 and 8 also compare the clustering results with [8,11,26,34,35] found on NCI.     The NCI of our proposed method is 1 and 5 for the Iris and Wine datasets, respectively, while it is 151 and 252 for Reverse Nearest Neighbor (RNN) approach. These noticeable achievements demonstrate that our algorithm can spot the primary seeds in its very early iteration. Our selected primary seeds are very close to the actual cluster centers. It is worthy of considering that the lower NCI leads to a faster execution time. The reason for increasing the accuracy and decreasing the NCI is considering the data distribution and removing outliers in our proposed algorithm.

Conclusion
Clustering is an unsupervised machine learning method that aims to create clusters with similar objects. The clustering received considerable attention due to its deep and wide usage in different applications such as marketing, economics, sciences, pattern recognition, computer vision, and information analysis. One of the popular clustering algorithms is the k-means algorithm. Its popularity is due to its simplicity. Nevertheless, it has some disadvantages such as the problem of its requirement for determining the number of clusters in advance, its restriction to the dense clusters with a spherical shape, high sensitivity to the outliers, the local optimum convergence problem, performance dependency to the selection of the initial seeds, and the dead-unit. Among all the mentioned problems, the performance dependency on selecting the primary seeds is much more noticeable than others. Because all other problems can be solved by selecting the appropriate initial seeds, on the other hand, increasing speed and accuracy are related inversely. As a result, increasing one causes another to decrease and vice versa.
Our proposed method tries to establish a proper tradeoff between these two factors. In our method, the data with equal or less than the average distance are allocated to the same group. Those groups with the number of members equal to or less than a threshold are then considered as the outliers and removed from the dataset. Next, the minimum plus half of the mean value of the group is calculated as the seed. In our algorithm, due to distributing seeds among all data space and avoiding outliers as the primary seeds, both high Acc and low NCI are obtained together. This achievement is essential, particularly for big data applications. The proposed algorithm prevents the point-of-spot problem and is a good selection for large databases. Our proposed method is evaluated with the Iris, Wine, and Abalone datasets. In brief, the experimental results showed that our proposed approach outperforms the Chithra with 1.7% and 2.1% in terms of the clustering accuracy for the Wine and Abalone detection datasets, respectively. Besides, we achieved the NCI of about 1 and 5 for the Iris and Wine datasets, respectively. Compare it with 151 and 252 for Reverse Nearest Neighbor (RNN) approach. These noticeable achievements show that our algorithm is strong enough to find the primary seeds near their optimum values in its very early iterations.
Currently, most clustering methods are designed for static data. In the current study, the performance of our proposed algorithm was evaluated on static data as well. In future works, we study the performance and effectiveness of the proposed method on dynamic data found on more complex outlier removal algorithm found on the machine learning and deep learning algorithms [36][37][38][39][40][41][42][43].
Funding Statement : The authors received no specific funding for this study.
Conflicts of Interest: The authors declare that they have no conflicts of interest to report regarding the present study.