Data clustering is crucial when it comes to data processing and analytics. The new clustering method overcomes the challenge of evaluating and extracting data from big data. Numerical or categorical data can be grouped. Existing clustering methods favor numerical data clustering and ignore categorical data clustering. Until recently, the only way to cluster categorical data was to convert it to a numeric representation and then cluster it using current numeric clustering methods. However, these algorithms could not use the concept of categorical data for clustering. Following that, suggestions for expanding traditional categorical data processing methods were made. In addition to expansions, several new clustering methods and extensions have been proposed in recent years. ROCK is an adaptable and straightforward algorithm for calculating the similarity between data sets to cluster them. This paper aims to modify the algorithm by creating a parameterized version that takes specific algorithm parameters as input and outputs satisfactory cluster structures. The parameterized ROCK algorithm is the name given to the modified algorithm (P-ROCK). The proposed modification makes the original algorithm more flexible by using user-defined parameters. A detailed hypothesis was developed later validated with experimental results on real-world datasets using our proposed P-ROCK algorithm. A comparison with the original ROCK algorithm is also provided. Experiment results show that the proposed algorithm is on par with the original ROCK algorithm with an accuracy of 97.9%. The proposed P-ROCK algorithm has improved the runtime and is more flexible and scalable.
To group comparable data points in a cluster, data points are clustered. Numerical points are easier to cluster because several clustering methods have previously been established in this manner. There is a substantial challenge when the data has clustered numeric properties and is categorical. Categorical data must be translated into numerical data because minimal research in this field [
Recent research has concentrated on resurrecting traditional simple clustering algorithms with modifications to improve their efficiency. This serves two purposes. Traditional algorithms’ scalability and ease of implementation can be preserved for starters. Second, the investment required to develop a new algorithm is significantly higher than that required to modify an existing one.
As massive data warehouses store massive amounts of data, clustering has become increasingly important in today’s world. Various clustering methods have been developed over the years to cluster this massive amount of data. Clustering categorical and numerical data, on the other hand, is an entirely different challenge. Categorical data values exist on a nominal scale. Each one represents a conceptually distinct notion, they cannot be meaningfully sorted, and they cannot be handled or manipulated the same way numbers can. Blood types A, B, AB, and O, for example, indicate a person’s blood type. Rocks can be classified as igneous, metamorphic, or sedimentary. Computing the similarity between data points does not require a distance similarity metric. Over the years, many clustering approaches have been created, and some of them are detailed in the following sub-section.
The several clustering is listed in the below sub-sections.
One of the earliest attempts in this direction was the application of the widely used K-means algorithm [
The k-means algorithm was also extended in the form of a K-Histogram [
In fuzzy c-means clustering, data items can belong to multiple clusters depending on their degree of cluster membership [
The QROCK [
The Q-ROCK [ It computes clusters by determining the connected components of the graph. It drastically reduces the computation of the algorithm. It does not believe in the prior knowledge of the desired number of clusters Also detects outliers Defines a new weighted goodness measure Avoids explicit computation of links, thereby substantially improving computational efforts. Computes a correspondence between the values of the number of clusters,
The M-Rock [
This section first presents a detailed overview of the ROCK algorithm with its methodology and further shows our proposed modification. Our proposal includes fine-tuning parameters of the algorithm as our modification. Analysis of the same is also provided for better evaluation of the proposed modifications.
The ROCK algorithm in [
Two points,
It means similarity 0 reflects no links between the data points and vice versa. The algorithm presents a new criterion function on links and maximizes this criterion for best results. The criterion function aims to maximize the links between two points
It then merges the clusters based on the links using a goodness measure till no further merging is possible or no links remain. The goodness measure is calculated as:
The steps involved in the ROCK algorithm Step 1: Consider all the data points as separate clusters. Step 2: Repeat steps 3 to 5 until no more clusters can be merged or links remain. Step 3: Compute links between the clusters. Step 4: Compute the goodness measure. Step 5: Merge the best two clusters.
The parameters involved in the ROCK algorithm are discussed below. The threshold for neighborhood decision: Two points are considered neighbors if there is a considerable similarity between them. This similarity is further dependent on a given threshold whose value ranges from 0 to 1. Variations in the value of the threshold can bring about changes in the result of the ROCK algorithm. Value of Value of
where
After observing the method’s parameters, the algorithm can be converted into a parameterized version with all parameters being user-defined and input. After that, the user-defined parameters can be calculated. Using domain knowledge and data properties. Given below are the steps of the proposed Parameterized-ROCK (P-ROCK) algorithm.
The proposed algorithm requires Definition of functions f(θ) and h(θ) as input. From the implementation point of view, this can be done through subroutine calls that take (as input) steps 4 and 5. The initial values of links can be directly computed as link = neighbor *neighbor. At a time, only two clusters are merged, thus reducing the value of
In this section, a few hypotheses will be formed regarding the effect of parameters on runtime and the clustering quality of the ROCK algorithm. Further implementation results of the proposed Parameterized ROCK algorithm can justify the statements of hypotheses.
The first hypothesis has been adopted directly from [
Besides this effect of θ, it can also be safely assumed that θ deduced from one dataset cannot be applied to another dataset. It implies θ needs to be learned for each data or be provided by a domain expert.
The constraint mentioned in [
The extra parameter h(θ) introduced for P-ROCK needs a detailed analysis. Putting the limiting values of
Therefore, a good measure is
Since goodness measure cannot be a negative value,
Which makes,
Resulting,
The other extreme situation where both clusters to be merged might have
This makes,
Taking log both sides,
This makes,
Resulting,
A hypothesis can thus be formed as:
As mentioned in [
A hypothesis can therefore be framed as:
Experiments are designed and conducted to verify the hypotheses discussed in the last section. The later sections discuss the results of the experiments.
Experiments on the proposed P-ROCK algorithm have been performed on MATLAB, an easy and efficient platform for mathematical computations. Two real-life datasets have been taken for the experiments: The small soybean dataset and the Congressional Votes dataset. A brief description of the datasets has been given below.
The working of ROCK algorithm on soybean dataset is as shown in
Value of θ | Total elapsed time (in s) | Number of current clusters | Number of major clusters | Cluster structure |
---|---|---|---|---|
0.5 | 0.024847 | 4 | 1 | 44 |
0.55 | 0.023962 | 4 | 1 | 44 |
0.6 | 0.022707 | 4 | 3 | 10, 9, 27 |
0.65 | 0.021525 | 4 | 3 | 10, 10, 26 |
0.7 | 0.021052 | 4 | 3 | 10, 9, 27 |
0.75 | 0.020861 | 4 | 4 | 10, 10, 10, 17 |
0.8 | 0.018242 | 17 | 8 | 2, 5, 2, 7, 3, 5, 8, 6 |
0.85 | 0.016214 | 34 | 6 | 2, 2, 3, 4, 5, 3 |
0.9 | 0.014884 | 46 | 1 | 2 |
Value of θ | Total elapsed time (in s) | Number of current clusters | Number of major clusters | Cluster structure |
---|---|---|---|---|
0.5 | 2.534873 | 3 | 2 | 431, 4 |
0.55 | 2.438263 | 10 | 2 | 424, 3 |
0.6 | 2.445057 | 10 | 2 | 424, 3 |
0.65 | 2.263818 | 29 | 3 | 3, 2, 404 |
0.7 | 1.891412 | 82 | 6 | 2, 2, 2, 3, 188, 162 |
0.75 | 1.838770 | 82 | 6 | 2, 2, 2, 3, 188, 162 |
0.8 | 1.394288 | 185 | 8 | 124, 122, 2, 2, 2, 2, 2, 2 |
0.85 | 1.393105 | 185 | 8 | 124, 122, 2, 2, 2, 2, 2, 2 |
0.9 | 0.857809 | 342 | 38 | 5, 3, 4, 8, 5, 3, 2, 8, 3, 2, 2, 4, 2, 2, 3, 3, 2, 5, 5, 2, 6, 2, 5, 2, 2, 2, 2, 5, 3, 4, 3, 2, 6, 2, 4, 2, 4, 2 |
We observed the values of θ for which obtained cluster structure is most accurate and keeping all other parameters constant. A changing definition of f(θ) can provide insight into its impact on the algorithm’s performance. The definitions used for f(θ), other than the standard ROCK algorithm [ f(θ) = 1 − θ f(θ) = 1 − log2(1 + θ) f(θ) = 1 + log2(1 + θ) f(θ) = 1 + θ f(θ) = 1 f(θ) = 0
The first three definitions have been designed according to the constraint mentioned in [
Value of f(θ) | Total elapsed time (s) | Cluster structure | Number of current clusters | Number of major clusters |
---|---|---|---|---|
1 − θ | 0.020748 | 10, 10, 10, 17 | 4 | 4 |
1 − log2(1 + θ) | 0.021128 | 10, 10, 10, 17 | 4 | 4 |
1 + log2(1 + θ) | 0.020752 | 10, 10, 10, 17 | 4 | 4 |
1+θ | 0.020771 | 10, 10, 10, 17 | 4 | 4 |
1 | 0.021448 | 10, 10, 10, 17 | 4 | 4 |
0 | - | - | - | - |
Value of f(θ) | Total elapsed time |
Cluster structure | Number of current clusters | Number of major clusters |
---|---|---|---|---|
1 − θ | 1.836260 | 188, 162, 2, 3, 2, 2 | 82 | 6 |
1 − log2(1 + θ) | 1.825012 | 188, 162, 2, 3, 2, 2 | 82 | 6 |
1+log2(1 + θ) | 1.800062 | 188, 162, 2, 3, 2, 2 | 82 | 6 |
1+θ | 1.896225 | 188, 162, 2, 3, 2, 2 | 82 | 6 |
1 | 1.828216 | 188, 162, 2, 3, 2, 2 | 82 | 6 |
0 | - | - | - | - |
Definition of h(θ) decides the extent of the impact of (θ) over goodness measure. It implies that h(θ) will directly affect the quality of clusters produced. The definitions used for f(θ), other than the standard ROCK algorithm, in our experiments are as follows.
h (θ) = c*f(θ), h(θ) = 1 + f(θ), h(θ) = 1 + 1 + 2f(θ)2, h(θ) 1 + 2f(θ), c in c*f(θ) is a constant which generates optimum structure of clusters in Soybean dataset for values of c ≥7. For values less than 7, the merging of clusters does not occur. The reason for dependence on the value of c exists, concerning HYPOTHESIS 4, that the resultant h (θ) should be greater than 1. The experimental results for different h(θ) are tabulated in
Value of h(θ) | Total elapsed time |
Cluster Structure | Number of current clusters | Number of major clusters |
---|---|---|---|---|
c*f(θ), c < 7 | 0.012128 | 1, 1, 1, 1, | 47 | 4 |
c*f(θ), c ≥ 7 | 0.021340 | 10, 10, 10, 17 | 4 | 4 |
1 + f(θ) | 0.021291 | 10, 10, 10, 17 | 4 | 4 |
1 + 2f(θ)2 | 0.020862 | 10, 10, 10, 17 | 4 | 4 |
1 + f (θ)2 | 0.021033 | 10, 10, 10, 17 | 4 | 4 |
Value of h (θ) | Total elapsed time |
Cluster Structure | Number of current clusters | Number of major clusters |
---|---|---|---|---|
c*f(θ), c < 8 | 0.616397 | 1, 1, 1… | 435 | 0 |
c*f(θ), c > 8 | 1.786754 | 188, 162, 2, 3, 2, 2 | 82 | 6 |
1 + f(θ) | 1.798534 | 188, 162, 2, 3, 2, 2 | 82 | 6 |
1 + 2f(θ)2 | 1.819414 | 188, 162, 2, 3, 2, 2 | 82 | 6 |
1 + f(θ)2 | 1.834672 | 188, 162, 2, 3, 2, 2 | 82 | 6 |
Reference [
The impact of other factors on the clustering results of the real-life datasets can be observed through
Expression for goodness measure, |
Cluster structure | ||
---|---|---|---|
Soybean small dataset | Congressional Votes dataset | Mushroom dataset | |
10, 10, 10, 17 | 188, 162, 2, 3, 2, 2 | 96, 96, 704, 256, 768, 192, 1728, 8, 32, 48, 192, 48, 288, 1296, 8, 16, 104, 288, 36, 1728, 192 | |
10, 10, 10, 17 | 188, 162, 2, 3, 2, 2 | 96, 96, 704, 256, 768, 192, 1728, 8, 32, 48, 192, 48, 288, 1296, 8, 16, 104, 288, 36, 1728, 192 | |
10, 10, 10, 17 | 188, 162, 2, 3, 2, 2 | 96, 96, 704, 256, 768, 192, 1728, 8, 32, 48, 192, 48, 288, 1296, 8, 16, 104, 288, 36, 1728, 192 |
The results obtained are the best clustering results for each real-life dataset, proving HYPOTHESIS 5.
The proposed algorithm PROCK is a modification of ROCK [
Algorithms | Cluster No. | No. of republicans | No. of democrats | Accuracy |
---|---|---|---|---|
ROCK | 1 | 144 | 22 | 79.31 |
2 | 5 | 201 | ||
P-ROCK | 1 | 159 | 3 | 79.77 |
2 | 0 | 188 |
In a scientific report of the ROCK algorithm [
The significant modifications are done in the original ROCK algorithm, and the conclusions from the experimental results are listed below. A new parameter All ROCK algorithm parameters are combined as user-defined inputs to the algorithm, except for the similarity and good measure formulas. The parameterized version of the ROCK algorithm, P-ROCK, is proposed. Hypotheses are verified through experimental results. Increasing the threshold value above the optimal decreases the algorithm’s runtime and produces many small clusters. Decreasing the threshold value below the optimal increases runtime and produces few significant clusters. There can be no universal value of θ; it depends on data characteristics. The optimal value of θ needs to be learned separately or be provided by an expert. Definition of f(θ) does not affect the performance of ROCK if the constraint ‘f(θ) should be 0 at θ = 1 and f(θ) should be 1 at θ=0’ is satisfied. If constraint h (θ)>1 is followed, the definition of h(θ) does not affect the cluster quality. Except for link and threshold, no other parameter controls the quality of output of the algorithm.
The significant time in the ROCK and the proposed P-ROCK algorithm is spent merging the clusters. At every iteration, only two components are merged. Hence, iterations increase if a more significant number of mergeable components occur. This indicates higher runtime. Whether the components are mergeable depends on the average number of neighbors per data point. Threshold θ decides the number of neighbors. Since more components are merged, very few clusters are produced that are too big. Thus, a constant decrease in runtime is observed with the increased threshold.
Very few research works are headed in the direction of handling categorical data. Most of the proposals are extensions to the traditional algorithms handling categorical data because they are simplistic. The limitations of the existing algorithms have already been addressed, making it easier to focus just on the concept of improving them. Our work extends the ROCK algorithm to provide a parameterized version of the ROCK algorithm (P-ROCK). ROCK algorithm has been taken because of its scalability and simplicity. The overall steps of the algorithm, right from the computation of neighbors to links to merging, are simple. The runtime and accuracy of the algorithm make it scalable enough to be used for efficient clustering of categorical data. Providing flexibility to the algorithm is the aim of the paper. The same is achieved through our proposed modification of user-defined parameters as inputs, making it more flexible according to user needs. The parameters taken into account are the threshold (θ), f(θ) and h(θ). By testing the algorithm for various values and definitions of the parameters, a specific hypothesis outlining the impact of these parameters on the algorithm has been formulated. Experimental results on real-life datasets prove our hypotheses. The proposed P-ROCK algorithm with the original ROCK algorithm also provides a better insight into the proposal. The proposed modifications do not compromise the accuracy and runtime of the original ROCK algorithm.
Researchers supporting project number (RSP2022R498), King Saud University, Riyadh, Saudi Arabia.