Picture-Neutrosophic Trusted Safe Semi-Supervised Fuzzy Clustering for Noisy Data
1 Graduate University of Science and Technology, Vietnam Academy of Science and Technology, Hanoi, 100000, Vietnam
2 Institute of Information Technology, Vietnam Academy of Science and Technology, Hanoi, 100000, Vietnam
3 VNU Information Technology Institute, Vietnam National University, Hanoi, 100000, Vietnam
4 Department of Mathematics, University of New Mexico, Gallup, 87301, New Mexico, USA
5 University of Information and Communication Technology, Thai Nguyen University, Thai Nguyen, 250000, Vietnam
6 Faculty of Computer Science and Engineering, Thuyloi University, Hanoi, 100000, Vietnam
* Corresponding Author: Tran Thi Ngan. Email:
Computer Systems Science and Engineering 2023, 46(2), 1981-1997. https://doi.org/10.32604/csse.2023.035692
Received 31 August 2022; Accepted 14 December 2022; Issue published 09 February 2023
AbstractClustering is a crucial method for deciphering data structure and producing new information. Due to its significance in revealing fundamental connections between the human brain and events, it is essential to utilize clustering for cognitive research. Dealing with noisy data caused by inaccurate synthesis from several sources or misleading data production processes is one of the most intriguing clustering difficulties. Noisy data can lead to incorrect object recognition and inference. This research aims to innovate a novel clustering approach, named Picture-Neutrosophic Trusted Safe Semi-Supervised Fuzzy Clustering (PNTS3FCM), to solve the clustering problem with noisy data using neutral and refusal degrees in the definition of Picture Fuzzy Set (PFS) and Neutrosophic Set (NS). Our contribution is to propose a new optimization model with four essential components: clustering, outlier removal, safe semi-supervised fuzzy clustering and partitioning with labeled and unlabeled data. The effectiveness and flexibility of the proposed technique are estimated and compared with the state-of-art methods, standard Picture fuzzy clustering (FC-PFS) and Confidence-weighted safe semi-supervised clustering (CS3FCM) on benchmark UCI datasets. The experimental results show that our method is better at least 10/15 datasets than the compared methods in terms of clustering quality and computational time.
The finding of underlying connections between the human brain and events has made the development of sophisticated clustering algorithms fashionable in cognitive research [1,2]. Dealing with noisy data is one of the most intriguing clustering difficulties. Incorrect data with noises that affect the quality of results could be seen in many applications, such as satellite images , medical image processing [4,5], control systems , etc.
Semi-supervised fuzzy clustering techniques were introduced with additional information provided by users [7–9] to enhance the range of applications and the quality of clusters. The differences in incorporating various supplementary information forms were demonstrated in  which provided a summary of the semi-supervised fuzzy clustering technique. Accordingly, object segmentation using semi-supervised fuzzy clustering is effective as long as the proper supplementary information, also known as “safe information” and clean data are supplied. However, real-world data are frequently unreliable, noisy and inaccurate. These situations require more effective clustering methods.
The safe semi-supervised fuzzy clustering approach introduced in [11–13] is the typical method to deal with safe information in semi-supervised fuzzy clustering. There are two primary phases in their strategy after the core concept. The confidence weights for labeled data are calculated in the first phase. Then, the high confidence weights are used to generate and identify centers of clusters and fuzzy element values under the labeled data in the second phase. Safe semi-supervised Fuzzy C-Means clustering (S3FCM) approach was firstly presented in . By balancing semi-supervised and unsupervised clustering, this technique investigated the incorrectly classified data. A local homogeneous graph was employed in the first phase . The Local Homogeneous Consistent Safe SSFCM (LHC-S3FCM) method performed effectively on datasets with a large percentage of incorrectly categorized data by utilizing this graph. The CS3FCM, an enhanced safe semi-supervised clustering model, based on confidence weights, was put out in . This approach provides good results in minimizing the negative impact of incorrectly labeled samples on the clustering process, assuming each data sample has its own safe confidence weight.
To establish the safe level of each sample in the data set, Guo et al.  have recently suggested a safe semi-supervised clustering with a safe degree. The model provides the essential procedures to reduce the adverse effects of risk in both labeled and unlabeled samples based on the safe degree value. Despite performing better than other approaches when dealing with “safe information”, safe semi-supervised fuzzy clustering algorithms can still not solve the challenge of clustering inaccurate data with noises. Noisy data division can lead to incorrect object detection and inference. Data points, isolated or at the edge of some clusters, are considered to contain noisy data. It is a must to improve safe semi-supervised fuzzy clustering algorithms for dealing with noisy data.
This research aims to develop a new clustering method to remove the noise from data and increase the performance of the clustering method. This method integrates the semi-supervised clustering method and the picture fuzzy set . There are four membership degrees in the PFS  with Neutrosophic set , including the positive degree, neutral degree, negative degree and rejection degree. Noisy data typically have a high rejection rate. Additionally, the neutral degree is used to determine the data points belonging to the boundary of clusters. It is clear that PFS could be used to identify noisy data in datasets.
Based on the original Fuzzy C-Means (FCM) model, a fuzzy clustering algorithm for images (a.k.a. FC-PFS) introduced in  outperforms the other fuzzy clustering techniques in terms of average clustering indices such as the mean accuracy and computational time. As an extension of collaborative distributed fuzzy clustering (CDFCM)  on PFS, a form of FC-PFS on distributed computing known as DPFCM was demonstrated in . As stated in the paper, a strategy to reduce computational time and increase clustering quality is the idea of semi-supervised clustering using distributed and cloud computing. Wu and Chen presented an adaptive picture fuzzy clustering technique based on entropy weight . This approach improved accuracy, addressed noisy data in image segmentation and overcame the time-consuming limitation in existing picture fuzzy clustering algorithms. Two practical, robust picture fuzzy clustering techniques for decreasing computational time were also introduced [21,22]. Nonetheless, those fuzzy clustering algorithms struggle with managing both the “safe information” and the “noisy data” because if labeled data has noise, the clustering quality will be seriously affected.
To handle problems with enhancing “safe information” and reducing the effect of “noisy data”, Picture-Neutrosophic Trusted Safe Semi-Supervised Fuzzy Clustering (PNTS3FCM) is introduced. This is a new technique to address the issue of data partition with noisy information. The PNTS3FCM approach includes picture fuzzy and neutrosophic set concepts in the semi-supervised fuzzy clustering with a safe information procedure. The research proposes a new optimization model consisting of four essential components: a clustering component, an outlier-solving component and a safe semi-supervised fuzzy clustering using labeled and unlabeled data. The first two parts employed FC-PFS and the last two are the new parts to enhance safe information and reduce noisy data. An iterative technique from the formulation is also provided to construct the cluster centers and memberships. In fact, the survey has revealed a new field of study: safe, semi-supervised clustering on the picture fuzzy set. To compare PNTS3FCM with other available methods on benchmark datasets, two similar algorithms-FC-PFS  and CS3FCM , are chosen.
The remaining paper is structured as follows: Section 2 offers the essential information underpinning our study. The proposed approach is introduced in Section 3 and the experimental results are presented in Section 4. Some conclusions are given in the last section.
In this section, some fundamental concepts and methods of semi-supervised clustering are presented, including Safe semi-supervised clustering and Picture fuzzy set and picture fuzzy clustering.
For S3FCM, consider the dataset where is the number of data elements. is denoted for the number of clusters. The cluster center is defined by . The membership degree of kth element belonging to the ith cluster is characterized by and m is the fuzzifier parameter. The value expresses a label indicator; the value if is labeled and otherwise. are the fuzzy degrees of labeled samples. The objective function of S3FCM  is as below:
with: , . Where and are the regulatory factor in which is the partition matrix after using FCM on unlabeled data, dik is the distance between the kth element and ith cluster. The final cluster labels are determined through the algorithm  and the value is specified as follows:
The below function calculates the center :
On the other hand, the LHC-S3FCM  is expected to deal with wrong labels from additional information. The objective function is defined as follows:
with the constraints:
Therefore, the cluster centers , the value for labeled data and the value for unlabeled data correspond to the below functions:
Another approach of FCM is Confidence-weighted Safe Semi-supervised Clustering (CS3FCM)  by using confidence weights. The confidence weights show various effects of samples on performance degradation. The following is the goal:
with ; and are the regulatory factors. Therefore, the value of , and are determined by the following functions:
where , and correspond to the positive degree, the neutral degree and the negative degree of each element. And these degrees satisfy the following conditions:
Then, the refusal degree is computed by function:
The objective of FC-PFS  aims to group the data in clusters and reduce the outliers through the concept of entropy as follows:
with the constraints:
The values of and correspond to the positive, neutral and negative degrees of PFS . The vector of the cluster centers denotes .
For the above objective function, the cluster centers , the membership degrees and non-membership degrees are computed using the following formulas:
In , the refusal degree is calculated using the Yager complement operator as follows:
where is a regulatory factor and it is often chosen within [0.6–0.8].
The detailed steps for the FC-PFS algorithm are shown below.
The idea behind the proposed method (PNTS3FCM) is the combination between PFS and safe semi-supervised fuzzy clustering by introducing a novel objective function with four primary components. The first and the second stages are employed from the original picture fuzzy clustering method . The two last stages are the semi-supervised component used to orient the clustering process by labeled and unlabeled data. The main idea is represented in Fig. 1 and the detailed steps are described in Fig. 2.
Fig. 1 illustrates the method and concept in which the input data are provided to the block of PNTS3FCM. Through the use of picture fuzzy degrees, the first step of PNTS3FCM is to reduce the distance between data components and cluster centers. The picture fuzzy set model’s second step involves processing the “noisy data” by integrating the entropy quantity between the neutral and refuse degrees. The refusal degree plays an important role in reducing the effect of noise data in the objective function because of its higher value relating to noise data following .
To deal with “safe information”, the two last stages coordinate the safe semi-supervised fuzzy clustering using both labeled and unlabeled data. PNTS3FCM has two phases: Firstly, FC-PFS is used to partition all data to get the clustering result with positive, neutral and refusal values. The second phase uses all data with these values to partition data to archive better clustering quality by enhancing safe data information and reducing noisy data.
The technique produces final clusters that are reliable and confident. We will discuss the formulation and algorithm for this concept in the next section.
As illustrated by the main idea above, this section will describe the details of the proposed model. The objective function is stated by the following formula:
With the constraints :
where data set having n elements, the number of labeled data in ; the number of clusters ; the values of positive, neutral and refusal degrees of element belong to cluster . Each part of the objective function has its own meaning. The first two parts of Eq. (17), as shown, are those of the original picture fuzzy clustering (FC-PFS) . The safe semi-supervised fuzzy clustering on the picture fuzzy set is covered in the last two parts.
• The first part represents fuzzy clustering on the PFS.
• The second part represents entropy information which helps to reduce noisy data through the neutral and refusal degrees of a data point.
• The third part is the component for labeled data elements, where and L is the number of labeled data elements. The numerator describes semi-supervised fuzzy clustering, in which is a given constant that has a value of 1 or 0.
The denominator describes the safe semi-supervised clustering. The meaning of this component is as follows: After clustering, if any data point is assigned to the correct label, the weight will be increased; otherwise, the weight will be decreased.
• Finally, the fourth part is the component of the unlabeled data elements, where the numerator is the same as the first part and the denominator is added to the component . The meaning of this value is that after applying clustering to all data points, the denominator will be greater than 1 for unlabeled data elements with high refusal value of . Indeed, the weights of these data elements are reduced.
• The additional information for semi-supervised fuzzy clustering is the prior picture membership degrees. We use the original FC-PFS algorithm to cluster all data, including labeled and unlabeled data. From that, we calculate four values that guide the calculation for all data elements.
The positive degree u of the labeled data elements is
The positive degree u of the unlabeled data elements is
Other degrees are shown below:
Details of the FPNTS3FCM algorithm are below.
Advantages of the PNTS3FCM algorithm:
a) PNTS3FCM has better clustering quality than the related methods, such as FC-PFS and CS3FCM algorithm, due to the capability to handle noisy data.
b) PNTS3FCM produces more information about the clusters, such as the cluster centers and the picture fuzzy degrees (positive, neutral, negative, refusal). It deals with both “safe information” and “noisy data”.
c) PNTS3FCM is the combination of three major concepts: SAFE, SEMI Clustering and PICTURE Fuzzy Set. The combination is the first trial in the literature toward practical problems.
Disadvantages of the PNTS3FCM algorithm
a) PNTS3FCM takes more computational time than the other algorithms due to the calculation of two additional parts in the objective function (24).
b) The model contains many parameters which need to be tuned in some real-world applications.
The experiments are performed on a Core i5-powered HP laptop using the C programming language. The selected benchmark UCI datasets  are described in Table 1. Outlier Detection DataSets (ODDS)  are given in Table 2.
Experiments are executed to compare the proposed PNTS3FCM approach and the state-of-art methods, CS3FCM  and FC-PFS . The classification accuracy (CA), computing time (CT) and clustering quality indicators, including DB, PBM and ASWC , are the criteria for evaluation. The CT is the amount of time needed to complete the computation. Value CT is computed as in (25).
where , is the starting time and ending time of the algorithm, respectively. The smaller value of CT reaches, the better performance of the method is. The calculation of CA  is given by the below equation.
where is the function that determines the equivalent label for using the Kuhn–Munkres algorithm . The function gets two values (0 if and 1 if ). The performance of the CA index is better when it has a higher value.
The value of ASWC is computed by Eq. (27).
where is the average distance from element to all other parts in cluster; is the average distance from element to all other elements in cluster. is a tiny constant. It is added to make the denominator differ from zero (when ). The higher value of the ASWC index leads to better performance.
The value of PBM  is determined by:
where , , with is the average value of all elements in the jth cluster, The higher value of the PBM index has, the better performance is.
The DB  is determined by (29)
where is the size of ith cluster. In which and are computed by
The average value and standard deviation value in experimental results are denoted as Ave and STD Dev, respectively.
Herein, the proposed method is assessed by classification accuracy in two situations, including on all data and labeled data. Herein, the experimental results are presented following two of these cases.
Evaluation by classification accuracy on all data
Using all the data elements of 15 datasets, the classification accuracy of PNTS3FCM, FC-PFS and CS3FCM are calculated and presented as follows. Table 3 shows the classification accuracy of all data without outliers.
As shown in Table 3, PNTS3FCM gets the best results of CA on 7/9 datasets (except Australian and WDBC). FC-PFS has not achieved the highest CA on all datasets. CS3FCM is the best model on 2/9 datasets (Australian, WDBC).
From the results in Table 4, it is clear that PNTS3FCM gives correct classification results in 4 out of 6 datasets (Glass, Yeast, Vertebral, Ionosphere). The other FC-PFS is only better on the Wine dataset and CS3FCM only gives good results on the Ecoli dataset.
Summary: During the evaluation by classification accuracy on all data, including outlier and non-outlier (15 datasets), PNTS3FCM is the best on 11 datasets (Balance-scale, Dermatology, Heart, Iris, Spambase, Tae, Waveform, Glass, Yeast, Vertebral, Ionosphere). FC-PFS is the best model on the Wine dataset. CS3FCM is the best model on three datasets (Australian, WDBC, Ecoli).
Evaluation by classification accuracy on labeled data
By using the labeled data elements of 15 datasets, the classification accuracy (CA) of PNTS3FCM, FC-PFS and CS3FCM are calculated and presented as follows. Table 5 shows the classification accuracy of labeled data without outliers.
In Table 5, PNTS3FCM gets the best results of CA on 7/9 datasets (except Iris and WDBC). FC-PFS has no highest value on all datasets. CS3FCM is the best model on 2/9 datasets (Iris, WDBC). As shown in Table 6, PNTS3FCM shows the highest values on 4/6 datasets (Ecoli, Glass, Yeast, Vertebral). FC-PFS has no highest CA on all datasets. CS3FCM is the best model on 2/6 datasets (Wine, Ionosphere).
Summary: During the evaluation by classification accuracy on labeled data, including outlier and non-outlier (15 datasets), PNTS3FCM has better results on 11 datasets (Australian, Balance-scale, Dermatology, Heart, Spambase, Tae, Waveform, Ecoli, Glass, Yeast, Vertebral). FC-PFS has not had the highest CA on all datasets. CS3FCM is the best model on four datasets (Iris, WDBC, Wine, Ionosphere).
Summary: As in Table 7, During the evaluation clustering quality by DB index on all data, including outlier and non-outlier (15 datasets), PNTS3FCM gets the best results on ten datasets (Australian, Dermatology, Heart, Tae, Waveform, WDBC, Ecoli, Glass, Yeast, Wine). FC-PFS is the best model on three datasets (Iris, Vertebral, Ionosphere). CS3FCM is the best model on 3 datasets (Balance-scale, Spambase). This pointed out that the proposed method was better in clustering quality in not only outlier not also non-outlier data compared to others.
We compare PNTS3FCM and CS3FCM on 15 datasets using computational time. Table 8 shows the results of evaluation clustering quality by computational time on data without outlier datasets.
Summary: During the evaluation of computational time on all data, including outlier and non-outlier (15 datasets), PNTS3FCM has better results on nine datasets. CS3FCM is the best model on six datasets. The proposed method seems to be better with a more significant number of data clusters. To get better results, the proposed method is firstly based on Picture fuzzy set that has more information to reduce the noise or hesitation in partitioning data. Secondly, PNTS3PFCM has a safe semi-supervised part for labeled and unlabeled data that can cope with the doubt labeled data, then reduce their effectiveness in the clustering process.
This research suggested a novel technique called Picture-Neutrosophic Trusted Safe Semi-Supervised Fuzzy Clustering (PNTS3FCM) to address the issue of data clustering with high confidence and noisy information. PNTS3FCM is constructed based on combining Picture Fuzzy Sets, Neutrosophic Sets and safe fuzzy semi-supervised clustering (PFS). This method consists of 4 critical parts: the clustering portion, the outlier solution part and the safe semi-supervised fuzzy clustering with labeled and unlabeled data. Through the use of PFS, the first stage of PNTS3FCM aims to reduce the distance between data components and cluster centers. The model’s second step involves processing the “noisy data” by integrating the entropy quantity between the neutral and refuse degrees. The third and fourth stages coordinate the safe semi-supervised fuzzy clustering using both labeled and unlabeled data to solve the safety information. We also provide an iterative technique from the formulation to construct the cluster centers and memberships. The method produces final clusters that are reliable and confident.
PNTS3FCM has illustrated its effectiveness by comparing it with two related methods, including FC-PFS and CS3FCM algorithm. The experiment results show that PNTS3FCM is better than the others in terms of computational time and clustering quality. Even though the proposed PNTS3FCM mainly focuses on eliminating or reducing noisy data elements, this method still has some limitations. First of all, PNTS3FCM takes a long time to compute. Secondly, it needs an increased number of parameters. In the future, an effective optimization algorithm will be studied and introduced to overcome these limitations.
Acknowledgement: We are grateful for the support from the staff of the Institute of Information Technology, Vietnam Academy of Science and Technology.
Funding Statement: This research is funded by Graduate University of Science and Technology under grant number GUST.STS.ĐT2020-TT01.
Conflicts of Interest: The authors declare that they have no conflicts of interest to report regarding the present study.
- X. Ji, S. Liu, P. Zhao, X. Li and Q. Liu, “Clustering ensemble based on sample’s certainty,” Cognitive Computation, vol. 13, no. 4, pp. 1034–1046, 202
- J. Zhang, H. Wang, S. Huang, T. Li, P. Jin et al., “Co-adjustment learning for co-clustering,” Cognitive Computation, vol. 13, no. 2, pp. 504–517, 2021.
- P. H. Thong, “Some novel hybrid forecast methods based on picture fuzzy clustering for weather nowcasting from satellite image sequences,” Applied Intelligence, vol. 46, no. 1, pp. 1–15, 2017.
- A. Khosravanian, M. Rahmanimanesh, P. Keshavarzi and S. Mozaffari, “Fuzzy local intensity clustering (FLIC) model for automatic medical image segmentation,” The Visual Computer, vol. 37, no. 5, pp. 1185–1206, 2021.
- S. A. Kumar, B. S. Harish and V. M. Aradhya, “A picture fuzzy clustering approach for brain tumor segmentation,” in 2016 Second Int. Conf. on Cognitive Computing and Information Processing (CCIP), Mysuru, India, pp. 1–6, 2016.
- G. Bode, T. Schreiber, M. Baranski and D. Müller, “A time series clustering approach for building automation and control systems,” Applied Energy, vol. 238, no. 11, pp. 1337–1345, 2019.
- J. C. Bezdek, Pattern recognition with fuzzy objective function algorithms. New York: Plenum Press. http://dx.doi.org/10.1007/978-1-4757-0450-1.
- N. Grira, M. Crucianu and N. Boujemaa, “Active semi-supervised fuzzy clustering,” Pattern Recognition, vol. 41, no. 5, pp. 1834–1844, 200
- L. A. Zadeh, “Fuzzy sets,” Information and Control, vol. 8, no. 3, pp. 338–353, 1965.
- P. H. Thong and L. H. Son, “An overview of semi-supervised fuzzy clustering algorithms,” International Journal of Engineering and Technology, vol. 8, no. 4, pp. 301–306, 2016.
- H. Gan, “Safe semi-supervised fuzzy C-means clustering,” IEEE Access, vol. 7, pp. 95659–95664, 2019.
- H. Gan, Y. Fan, Z. Luo and Q. Zhang, “Local homogeneous consistent safe semi-supervised clustering,” Expert Systems with Applications, vol. 97, pp. 384–393, 2018.
- H. Gan, Y. Fan, Z. Luo, R. Huang and Z. Yang, “Confidence-weighted safe semi-supervised clustering,” Engineering Applications of Artificial Intelligence, vol. 81, pp. 107–116, 2019.
- L. Guo, H. Gan, S. Xia, X. Xu and T. Zhou, “Joint exploring of risky labeled and unlabeled samples for safe semi-supervised clustering,” Expert Systems with Applications, vol. 176, pp. 114796–114803, 2021.
- B. C. Cuong and V. Kreinovich, “Picture fuzzy sets,” Journal of Computer Science and Cybernetics, vol. 30, no. 4, pp. 409–420, 2014.
- F. Smarandache, Neutrosophy: Neutrosophic probability, set and logic: analytic synthesis & synthetic analysis. Santa Fe, Rehoboth, MA, USA: American Research Press, 1998.
- P. H. Thong and L. H. Son, “Picture fuzzy clustering: A new computational intelligence method,” Soft Computing, vol. 20, no. 9, pp. 3549–3562, 2016.
- J. Zhou, C. P. Chen, L. Chen and H. X. Li, “A collaborative fuzzy clustering algorithm in distributed network environments,” IEEE Transactions on Fuzzy Systems, vol. 22, no. 6, pp. 1443–1456, 2013.
- L. H. Son, “DPFCM: A novel distributed picture fuzzy clustering method on picture fuzzy sets,” Expert Systems with Applications, vol. 42, no. 1, pp. 51–66, 2015.
- C. Wu and Y. Chen, “Adaptive entropy weighted picture fuzzy clustering algorithm with spatial information for image segmentation,” Applied Soft Computing, vol. 86, no. 4, pp. 105888–105927, 20
- C. Wu and Z. Kang, “Robust entropy-based symmetric regularized picture fuzzy clustering for image segmentation,” Digital Signal Processing, vol. 110, no. 1, pp. 102905–102933, 20
- C. Wu and N. Liu, “Suppressed robust picture fuzzy clustering for image segmentation,” Soft Computing, vol. 25, no. 5, pp. 3751–3774, 2021.
- K. Atanassov, “Intuitionistic fuzzy sets,” International Journal Bioautomation, vol. 20, pp. S1–S6, 2016.
- UCI Machine learning repository, “Data,” 2021. [Online]. Available: https://archive.ics.uci.edu/ml/index.php.
- Outlier detection datasets (ODDS“Data,” 2016. [Online]. Available: http://odds.cs.stonybrook.edu/.
- L. Vendramin, R. J. Campello and E. R. Hruschka, “Relative clustering validity criteria: A comparative overview,” Statistical Analysis and Data Mining: The ASA Data Science Journal, vol. 3, no. 4, pp. 209–235, 2010.