Online Social Networks (OSN) sites allow end-users to share a great deal of information, which may also contain sensitive information, that may be subject to commercial or non-commercial privacy attacks. As a result, guaranteeing various levels of privacy is critical while publishing data by OSNs. The clustering-based solutions proved an effective mechanism to achieve the privacy notions in OSNs. But fixed clustering limits the performance and scalability. Data utility degrades with increased privacy, so balancing the privacy utility trade-off is an open research issue. The research has proposed a novel privacy preservation model using the enhanced clustering mechanism to overcome this issue. The proposed model includes phases like pre-processing, enhanced clustering, and ensuring privacy preservation. The enhanced clustering algorithm is the second phase where authors modified the existing fixed k-means clustering using the threshold approach. The threshold value is determined based on the supplied OSN data of edges, nodes, and user attributes. Clusters are k-anonymized with multiple graph properties by a novel one-pass algorithm. After achieving the k-anonymity of clusters, optimization was performed to achieve all privacy models, such as k-anonymity, t-closeness, and l-diversity. The proposed privacy framework achieves privacy of all three network components, i.e., link, node, and user attributes, with improved utility. The authors compare the proposed technique to underlying methods using OSN Yelp and Facebook datasets. The proposed approach outperformed the underlying state of art methods for Degree of Anonymization, computational efficiency, and information loss.
A platform for users to exchange information with their peers is provided by an Online Social Network (OSN) [
In contrast to the actual world, where data is transient, information on the internet stays indefinitely, posing a significant risk to online users’ privacy. When individuals share sensitive information online, they frequently ignore the associated consequences [
The protection of privacy and anonymization in OSNs has sparked considerable attention. Each OSN may be presented in graph form. OSN represents the end-user as a node and the edge between two such nodes as a link [
The authors proposed a novel privacy preservation framework for the OSNs consisting of privacy for all network elements. The motivation of the proposed model is to provide all the privacy notions of OSNs using the graph properties to create an enhanced clustering approach. The proposed model acquires OSN data and applies the pre-processing to remove the noisy contents and attributes normalization without any IL. After pre-processing the input OSN data, different graph properties has selected for the initial cluster formation. The authors designed the enhanced clustering algorithm to ensure the minimum information loss with higher privacy preservation of attributes, edges, and nodes in the network. The authors modified the existing static K-means algorithm to discover the optimal number of clusters for the input OSN data using the multiple graph properties. A threshold is computed using various graph properties, and then the initial clusters are formed using that threshold. After forming clusters, the next phase is optimizing the clusters to achieve k-anonymity and higher privacy models, namely t-closeness and l-diversity. Authors developed unique lightweight techniques that meet all privacy criteria for clustered OSN data privacy. Research work is divided into the following sections. Section 2 examines the related works in terms of motivation and contributions. In Section 3, the design and methods of the proposed model are described. In Section 4, results from the simulation and system analysis are discussed. Section 5 concludes with recommendations for further work.
Privacy preservation becomes the essential requirement to secure OSNs. This section aims to review the privacy preservation techniques in OSNs. As previously mentioned, the OSN is represented as a graph with vertices, links, and attributes. Therefore, securing these OSN components with minimum IL is vital. Further privacy preservation strategies are discussed, offered a research gap analysis, and then presented the contributions.
The previous solution for Pervasive Social Network (PSN) privacy protection had presented in [
According to [
The authors of [
The paper discusses the various research gaps in the underlying methods in the above section to justify the novelty of the proposed model. According to the above studies, achieving the complete privacy preservation of the OSN data is still a challenging research problem concerning the trade-off among the essential requirements such as system dynamics, minimum IL, computational efficiency, and higher DoA. The underlying methods have failed to achieve all these requirements while presenting the privacy preservation solution for OSNs. The research gaps that motivate the novel proposed model in this paper are summarized below.
Cryptography-based approaches achieved safe communications with a certain amount of anonymity, but not in all OSN components. In addition, they depended on trustworthy authorities to protect their privacy and secure communications.
Grouping/clustering-based approaches have shown acceptable results. However, they have not yet simplified or enhanced performance and privacy protection trade-off needs. It has been shown that swarm intelligence-based clustering algorithms can obtain the k-anonymity privacy principles in OSNs with high computational complexity. Hence, because of the low quality of anonymization, such clustering algorithms failed to accomplish high-level privacy protection in online social networks. The grouping was based on only one graph characteristic, limiting the privacy protection in OSNs.
Other OSN graph-based techniques failed to ensure privacy in all components, such as nodes, edges, and node characteristics. Attackers using such tactics can readily exploit the graph’s structural information by linking an anonymized OSN graph, which leads to IL.
Differential privacy and tailored privacy techniques for online social networks were discussed in the literature. On the other hand, different privacy approaches consider each data owner has identical privacy expectations and thus fail to accommodate various perspectives on privacy. On the other hand, a configurable privacy scheme initiates the creation process of differential privacy. It leads to an unanticipated correlation in new noises that compromise data privacy and leak sensitive information. All underlying methods discussed in the literature relied on the static computing environment to produce privacy protection in OSN.
The idea of the improved privacy preservation model for the OSN with limited scope had been introduced in the literature, but no provision to ensure the k-anonymization, t-closeness, and l-diversity.
A novel enhanced privacy preservation framework is proposed in this paper to overcome the limitations of underlying methods. The authors recently addressed some of the problems mentioned in [ A novel enhanced cluster-based anonymization for privacy preservation paradigm is developed, which employs numerous graph features for enhanced cluster formation to accomplish high-level privacy protection in OSNs while requiring reduced information loss and computational effort. Research work uses a lightweight pre-processing approach to decrease data noise and complexity. The threshold-based mechanism has been proposed for estimating the optimal number of clusters by enhancing the static k-means clustering algorithm. The proposed clustering achieves fully enhanced privacy preservation notions for OSN. To assure k-anonymization privacy, the authors design a unique enhanced cluster optimization approach that uses OSN network features, eccentricity, and distance. To improve the privacy of k-anonymized clusters, the authors design a unique one-pass technique that ensures enhanced clusters and has l-diversity and t-closeness to defend against similarity and attribute disclosure concerns. The results of a performance analysis of the proposed model using comparable approaches on real-world datasets, altering the number of OSN users and size, are presented in the research work.
This section discusses the proposed enhanced OSN privacy preservation framework’s design and methodology. The main objectives of the proposed model include (1) discovering the optimal number of clusters to divide the pre-processed OSN data into different groups and (2) ensuring the enhanced privacy preservation notions of k-anonymization, t-closeness, and l-diversity via a cluster’s optimization algorithm.
The outcome of the EAP phase is further given to the EDCP phase, where the one-pass algorithm optimizes clusters to guarantee the l-diversity and t-closeness privacy notions.
Consider the following: OSN is represented as a graph
The study focuses on offering enhanced privacy preservation of OSN graph Instead of static clustering, split the OSN data into the optimal number of clusters based on the input dimensions and attributes. To achieve all the privacy notions of k-anonymization, t-closeness, and l-diversity for the input OSN clusters. To reduce sensitive information leakage, i.e., minimal IL, while maintaining high levels of privacy
As described above, this phase consists of OSN data pre-processing, discovering optimal cluster numbers using threshold, and forming enhanced clusters. Data pre-processing performs data normalization before clustering since raw data may contain outliers or messy data. The data normalization step’s goal is to improve the quality of OSN data using statistical attribute modeling. As a result of the statistical normalization of users and their vital properties, outliers or messy data are eliminated without needing a specific function. Thus, the data noise is discarded using the data normalization mechanism.
Fixed K-means clustering forms the
The needed number of clusters for the incoming OSN data is computed using the threshold
After that, the mean function has applied to get the threshold value T as
The enhanced clustering of the OSN data failed to achieve the k-anonymization privacy notion as each cluster in the network does not satisfy the at-least k-user’s requirements. The k-anonymity requirement in a network requires at least k users in each cluster, i.e., every cluster
For each cluster
where
The minimum
Finally, the hybrid score for every vertex in the current cluster is computed using the weight management technique, as shown in
The enhanced l-diversity and t-closeness phase (EDCP) is designed to achieve the privacy models such as t-closeness and l-diversity for the enhanced clustered OSNs, as shown in Algorithm 2.
Algorithm 2 demonstrates the functioning of a one-pass algorithm that first assures the t-closeness notion with a predetermined threshold value
By employing the clustering approach described in this paper, the privacy of vertices and users and their properties are protected while sensitive information loss and computational overhead are minimized. The edges in OSNs are anonymized after the clusters are ensured to comply with the three principles of privacy preservation. We apply the approach in [
The results of experimental work for performance analysis are presented in this section. We used MATLAB to develop and analyze the suggested model using cutting-edge methodologies. We conducted the studies on a Windows 10 computer with an Intel I3 processor and 4GB of RAM. A total of 25 instances were run for each scenario, and the results were averaged. The proposed approach had compared with two underlying OSN anonymization methods. It should be noted that PSO-GA refers to a hybrid swarm intelligence-based OSN clustering approach, whereas LECC refers to an l-diversity Equi-Cardinal (LECC) clustering approach. The Proposed Fixed Clustering model (PFC) and the Proposed Enhanced Clustering (PEC) are two variants of the proposed privacy preservation framework and evaluated with LECC and PSO-GA methodologies.
The proposed and state-of-the-art techniques’ effectiveness was evaluated using two real-world datasets, Yelp [
Three parameters were taken into account to compare the proposed technique with state-of-the-art techniques: Degree of Anonymization (DoA), Execution Time (ET), and Information Loss (IL). During the OSN data anonymization, the ET represents the average execution time for each of the 25 scenarios considered. Counting the number of assigned users in a cluster has been used to calculate the DoA of every user; hence, user DoA is comparable to cluster DoA. The formulas for computing DoA, IL, and ET are referred to from [
This section presents the experimental outcomes using different methods on the Yelp dataset. To analyze the effect of the number of users on the performances, we have varied the user density by keeping the fixed cluster size for static privacy preservation methods such as LECC, PSO-GA, and PFC. The clusters for the PEC are created based on a threshold. Yelp is an extensive dataset with more than 20,000 users/nodes. The user’s density varied from 2,000 to 20,000 users. We kept the cluster size 100 for LECC, PSO-GA, and PFC. This analysis aims to explain the efficiency of having the optimal number of clusters compared to static approaches. This section demonstrates the comparative results with DoA, IL, AMICD, ASSE, and ET parameters using Yelp Dataset.
The IL is contradictory to the DoA outcomes; therefore, with the increase in user density, the IL performance decreased. The technique with higher DoA and lower IL is declared an efficient privacy preservation solution. As a result of comparing all four methods, the proposed PEC framework outperformed all three fixed clustering methods. With the increase in user density, the privacy preservation performances like DoA and IL become efficient. Because of the increased number of users, the created clusters become more relevant and anonymized, resulting in higher DoA and lower IL. The PFC is a previously proposed model with fixed clustering [
However, the PSO-GA and LECC suffered from the lack of complete privacy notions, leading to lower DoA and a higher possibility of IL.
The proposed enhanced privacy preservation framework, PEC, outperformed all four underlying techniques. PEC delivered a higher DoA than PSO-GA, LECC, and our previous proposed PFC model. PEC has calculated the ideal number of clusters based on enhanced clustering, which is the primary reason for improved performance. The current static models classified all OSNs into a predetermined 100 number of clusters. The fixed clustering leads to the problems of over-clustering or under-clustering, affecting performances like DoA and IL. After enhanced cluster formation, we achieved higher privacy notions using the proposed model that leads to reduced IL and higher DoA. Along with DoA and IL, AMICD and ASSE performance analysis of the privacy preservation notions becomes essential.
LECC approach based on K-means clustering has provided privacy protection with the lowest ET of any strategy. The proposed technique is based on the LECC approach and includes data normalization, enhanced cluster generation, EAP with multiple graph attributes, and EDCP with higher privacy notions. Compared to LECC and PFC, including all privacy concepts with enhanced clustering necessitates more processing. However, compared to static techniques, the proposed model provides an enhanced approach to privacy protection with substantial performance savings.
This section presents the analysis of the privacy preservation methods PSO-GA, LECC, PFC, and PEC using the Facebook dataset. The size of the Facebook dataset is relatively small compared to the Yelp dataset. As the dataset contains around 4000+ users, we varied the user densities from 1,000 to 4,000 and measured the results for each method.
As observed in all results of the Facebook dataset, the trend overlaps with the Yelp dataset. The proposed model significantly outperformed the underlying static privacy preservation models in terms of DoA, and IL parameters. The suggested model reduced IL by generating optimum clusters and exploiting several graph features for cluster optimization. Because of the scaled data points, the created clusters are more relevant, resulting in little IL. The static methods relied on a fixed number of clusters regardless of the user density, which significantly affects the overall performance. For this experiment, we set 70 clusters for each static method. The proposed enhanced approach internally discovered the optimal number of clusters according to input user density and then, with graph properties, performed clusters optimization to ensure the k-anonymization, t-closeness, and l-diversity. Therefore, it shows improved anonymization (DoA) with reduced IL.
In
This section presents the comparative study of the static and enhanced privacy preservation models to claim that the proposed approach achieved data privacy and utility trade-off using two datasets. The data privacy and data utility trade-off are essential requirements for any privacy preservation technique. The underlying methods failed to produce a trade-off among them. The method should provide a higher DoA (data privacy) with minimum data utility (IL, AMICD, and ASSE).
The proposed enhanced privacy preservation model PEC produced higher DoA data privacy with reduced IL, AMICD, and ASSE utilities. The privacy preservation method uses enhanced clustering to compute clusters and has improved data privacy with efficient data utility parameters for both Yelp and Facebook datasets. According to the results achieved for Yelp datasets in the above section,
PSO-GA | LECC | PFC | PEC | |
---|---|---|---|---|
DoA | 8129.6 | 8508.8 | 9308.8 | |
IL | 43.46 | 36.59 | 32.13 | |
ET | 375.74 | 324.41 | 334.93 | |
AMICD | 0.7686 | 0.7432 | 0.6601 | |
ASSE | 10.76 | 9.35 | 6.41 |
PSO-GA | LECC | PFC | PEC | |
---|---|---|---|---|
DoA | 2136 | 2339 | 2648 | |
IL | 73.48 | 69.25 | 62.88 | |
ET | 101.97 | 85.38 | 89.39 | |
AMICD | 0.9161 | 0.8892 | 0.8183 | |
ASSE | 14.25 | 13.31 | 10.43 |
The proposed enhanced privacy preservation model improved four out of five parameters compared to underlying methods. It shows that the ET parameter belongs to the computational time, which is a little higher for PEC, but it also significantly improves the DoA and reduces the IL, AMICD, and ASSE. However, all static methods suffered from poor data privacy and data utility performances. The existing techniques, like PSO-GA and LECC, failed to achieve the trade-off among the five metrics. The PFC method is significantly better compared to PSO-GA and LECC. But dynamics of PEC outperformed PFC further as the existing models like LECC utilized the optimization algorithms, which required more time for convergence. The proposed model focused on establishing more reliable and stable clusters using the multiple graph properties. The stable clusters with dynamically discovered cluster numbers reduce the frequent clustering requirements. Therefore, it reduces the IL.
For the Yelp dataset, PEC improves the DoA performance by 10.16% compared to PFC and reduces IL, AMICD, and ASSE by 8.34%, 12.14%, and 23.71% compared to PFC, respectively. For the Facebook dataset, PEC improves the DoA performance by 10.72% compared to PFC and reduces IL, AMICD, and ASSE by 6.18%, 10.2%, and 11.31% compared to PFC, respectively. The enhanced privacy preservation approach PEC improves data privacy and utility parameters by approximately 10+% compared to our static model PFC, compromising additional 3%–4% computational requirements.
Due to the growing use of ONS worldwide, several issues concerning the privacy preservation of sensitive information are arising. Privacy preservation becomes an essential need to secure OSNs from malicious users. The distributed nature of OSNs results in privacy preservation is a challenging task. The novel enhanced privacy preservation model proposed in this paper intends to address improving privacy preservation with minimum IL and computational complexity. To achieve improved privacy preservation, the authors have proposed an enhanced cluster mechanism where the number of clusters was computed. The optimal number of clusters for the OSNs leads to minimum IL with the highest level of anonymization compared to static clustering. Apart from this, the proposed work optimized the clusters to ensure the privacy preservation notions such as k-anonymization, t-closeness, and l-diversity. The experimental results proved the efficiency of the proposed model compared to underlying state-of-the-art privacy preservation methods. The DoA of the proposed model has improved by 25.35% compared to the underlying techniques. The IL proposed model has been reduced by 23.23%, respectively.
The system has the following limitations: 1) Although the proposed privacy preservation model delivered the best performances using different OSN datasets, applying optimization algorithms is still room for improvement. The proposed flexible clustering mechanism may lead to unreliable results in some instances due to the lack of an optimization mechanism. 2) The proposed model requires a semi-automatic process for privacy preservation. The proposed model has not explored the emergence of deep learning mechanisms for automatic privacy preservation requirements. 3) The proposed system performs well for static OSN datasets only, which is a system’s limitation; hence the system can be extended for dynamic OSN, where nodes and edges are added and deleted at runtime. Finding sensitive information and effectively implementing anonymization in dynamic real-world networks will be suggested in future work.
The exciting direction to extend the proposed work is to apply the various optimization algorithms to produce the optimized clusters without loss of information. The authors suggest introducing new privacy preservation notions to extend the proposed model.
We thank all those referees who contributed their time and suggestions to our work.
The authors received no specific funding for this study.
The authors declare they have no conflicts of interest to report regarding the present study.