Achievement of sustainable privacy preservation is mostly very challenging in a resource shared computer environment. This challenge demands a dedicated focus on the exponential growth of big data. Despite the existence of specific privacy preservation policies at the organizational level, still sustainable protection of a user’s data at various levels, i.e., data collection, utilization, reuse, and disclosure, etc. have not been implemented to its spirit. For every personal data being collected and used, organizations must ensure that they are complying with their defined obligations. We are proposing a new clustered-purpose based access control for users’ sustainable data privacy protection in a big data environment. The clustered-purpose based access control significantly contributes to handling the personal data for stated, unambiguous, and genuine purposes. The proposed algorithm picks specific records from the sample space. It ensures the sustainability and utilization of data for intended purposes by validating the existing privacy tags, assigning new privacy tags based on a clustered-purpose based approach. The proposed method equally ensures the security and sustainable privacy aspects of existing as well as new personal data managed inside large databases repositories. The comparative analysis of significant results presents the outperformance of the proposed algorithm as compared to existing non-purpose based conventional methods of sustainable privacy preservation. The proposed algorithm clusters the large datasets in a big data environment and allows only authorized access to users. The current study is limited to purpose-based access control based on privacy tags. However, future research can also consider other types of privacy protection scenarios in a shared environment.
We can observe an exponential growth of data in the recent era. The information technology interfaces human-computer interaction in the best possible ways but suffers equally to ensure the privacy and security of sensitive information of users. Though we can notice very sophisticated systems that collect massive amounts of personal data, store and manage the data accordingly, still optimal preservation of privacy is an open optimization problem at organizational levels. Therefore, ensuring security and privacy has become quite a challenge. As this challenge has yet to be tackled, still quite a few people fear to share their information online or otherwise.
The central concept of security is always protecting the integrity, confidentiality, and availability. With the development of online data sharing and the advancement of information technology, data security became an increasingly important issue. Data are vulnerable to exposure by several factors such as cyber-attack, data combinations, or end-user tracking. Protection on these rouges is possible using technologies that could enhance privacy (PETs) and provide optimized security procedures. Such could also deal with all other kinds of data and privacy protection employing new tools for personal protection of data through offline and online transactions [
The amount of data that an organization has helps its management make strategic decisions. Business intelligence talks about the power of data over anything else. Data analytics based on the individual also yields unintended conclusions, for example, the case of a father finding out about his teenage daughter’s pregnancy through a personalized promotion directed to the daughter via Amazon data analytics. There are specific security threats involved in the utilization of big data that emerge from public repositories (migration of data to the cloud and its sharing with the public users) [
Most organizations have come up with a guideline or regulations about protecting consumer/user’s privacy to ease people about sharing their information. The problem arises when the responsible party fails to comply with its rules. This concern is amplified when the data used within an organization and is shared across the platform among their companies. Even worse, data is now available to purchase via certain vendors, for example, Amazon. According to Yang et al. [
Restricting access to sensitive data or clustering the specific data based on users’ tags can be an effective solution to control access to data centres. It is noticeable that access to sensitive data should be equipped with necessary security requirements in addition to efficient and flexible management, insertion, and retrieval of data. The security and privacy requirements should be implemented through the organizational policies for granting access and control of sensitive users’ data [
Similarly, Byun et al. [ Intended purpose: A policy that recognizes the deliberate type of access of data. Access purpose: A policy that recognizes the purpose of data access bound with the intention.
Ethically and professionally, the organizations collecting sensitive data of users should prior inform the users about the purpose and intention for seeking the information. Besides, such organizations should also notify the users in the context of exposing or forwarding this sensitive information to other entities for other purposes. The privacy of users, though, can be ensured in this manner, but mostly the users are not willing to allow organizations to access and spreading sensitive information for specific purposes. In such a model, organizations may lose the chance to seek data from users. Kabir et al. [ Allowed intended purpose: Any access to data is permitted for a specific use defined by the data provider Prohibited intended purpose: Any access to information is not enabled for any particular purpose specified by the data provider.
This section discusses the previous works in data clustering and purpose-based access control.
Technique | By | Method |
---|---|---|
CURE (Clustering using representative) | [ |
The individual clusters can be demonstrated with the data points that are supposed to be encapsulated within the domain of clusters. These data points could shrink towards the center to help to reduce the distance between points and within the cluster space also. |
k-Mode algorithm | [ |
A well-known partition clustering algorithm. Works by employing a mode of data points under consideration Tries to reduce the cost function similar to other clustering algorithms Robust to deal with outliers and works fine for numerical attributes of data. |
ROCK (Robust Clustering using Links) | [ |
Belongs to the domain of agglomerative clustering algorithms. Similar to other agglomerative approaches, it employs the links strategy for quantifying the similarity. Scalability depends on the sample size |
k-Histogram | [ |
Suitable for categorical data and is considered as an extension of k-means Dynamic updates the clustering process and works at the histogram concept that should be used in place of mean concept. |
DBSCAN | [ |
Famous clustering algorithm base on the density of data points within the domain and suppresses the noise (outliers in data). |
Fuzzy rule-based clustering algorithm | [ |
Unsupervised clustering is achieved by employing supervised classification approaches. Fuzzy rules are exploited to identify the essential clusters in data space. |
Squeezer | [ |
It deals with categorical data in contrast to numerical data. It comprises of two types of data structures in its implementation. Produces high-quality cluster result and good scalability |
Herd clustering | [ |
Inspired by the human mobility pattern and the herd behavior from the real world. Clusters are formed by the moving particles, which are represented by the data instances. |
We can notice several studies aiming at protecting and preserving the privacy of users employing the concept of “purpose” for seeking an access-control related to a particular policy.
Technique | By | Method |
---|---|---|
Platform for privacy preferences (P3P) | [ |
In this way, the website can encode the data in a specific format called P3P and ensures the preservation of users’ information accessible to legitimate people only. |
Hippocratic databases | [ |
These databases contain specific policies and authorization access patterns/ways to seek sensitive information of users for particular purposes. |
Strawman | [ |
It also proposes a purpose-based access control aligned with specific access policies. |
Hippocratic databases | [ |
It also proposes a method of implementing a privacy policy in Hippocratic databases. It emphasizes that access and exposure of data is granted only to legitimate entities and enlists the purpose of accessing sensitive users’ data. The proposed method introduces models based on granular level limited access and disclosure to users’ data and implements the ideas employing the query modification method. |
Granular level access control model | [ |
It introduced a new notion of validity, conditional validity. |
[ |
Proposes and implements the access control mechanisms at the granular level by consideration of concepts of transformation from RDBMS to privacy preservation levels. |
|
Purpose-based access control | [ |
It employs VDM to ensure privacy preservation through sophisticated mechanisms. The model defines and implements the entities listed in the PBAC aligned with the corresponding privacy preservation specifications. |
[ |
Proposes a model that ensures the privacy protection of users. The model entities correspond to the policies highlighted for purpose-based access to data. Since the approach reflects the purpose of accessing and disclosure of data so it is considered to contribute in this direction. |
|
Enterprise privacy authorization language (EPAL) | [ |
Byun et al. [ IBM develops a language that aids in describing the privacy policies at the enterprise level. The policies are listed in hierarchies reflecting the data-categories associated with specific purposes of data access. The implementations of concepts aid with actions and obligations as defined in the policy set. |
User authentication and data authorization | [ |
Proposes a model that ensures user authentication and data authorization for safer access to users’ data. Implements the authorization policies for purpose-based access and disclosure of data. |
Attribute level access control aligned with the purpose-based privacy policy | [ |
Proposes a model that considers the attribute-level access control and ensures the purpose-based access to sensitive data. |
An achievement of sustainable privacy preservation is mostly very challenging in a resource shared computer environment. This challenge demands a dedicated focus on the exponential growth of big data. Despite the existence of specific privacy preservation policies at the organizational level, still the protection of a user’s data at various levels, i.e., data collection, utilization, reuse, and disclosure, etc. have not been implemented to its spirit. For every personal data being collected and used, organizations must ensure that they are complying with their defined obligations. We are proposing a new clustered-purpose based access control for users’ sustainable data privacy protection in a big data environment. The clustered-purpose based access control significantly contributes to handling the personal data for stated, unambiguous, and genuine purposes.
The general architecture of the proposed purpose-based access model is shown in
Contrary to conventional data access, archive, retention, and sharing policies, the proposed architecture incorporates the essential aspect “the sustainable purpose of access,” ensuring that purpose-based data access and disclosure as a core component. The purpose-based access confirms the intentions of proper and appropriate usages of data for specifically defined purposes. It has been keenly noticed that satisfaction level, agreement, and trust of users towards purpose-based access of data authenticates the implementation of this architecture as compared to existing conventional data access architectures.
Data clustering plays an essential role in data mining due to its ability to work on a large amount of data [
Pick a record vector Define purpose-based tags Identify documents based on Compute the similarity between documents contained in the vector Establish a similarity matrix based on Step (4) by assigning each document to the cluster that has the closest similarity as defined in (2). Output the similarities of
The process starts by selecting a vector of users’ records randomly from a sample space repository (with the concept of non-duplicate records for the next fetch of records from sample space). We devise a filter that validates the existence or non-existence of a purpose-based access tag of individual records. The records with the non-existence of purpose-based access tags are then assigned the tags defined by the organization in a purpose-based access policy. Once tags are assigned, we select a seed to start building a cluster, subsequently by selecting and adding more records to the cluster such that the record added incurs the least information loss within the cluster. The algorithm determines the clusters having a proximity relationship with the neighboring clusters based on the similarity index score. The “purpose aware” semantic similarity identification is achieved through employing the Manhattan similarity index.
Generally, looking for records that are not directly next to the first cluster will result in a longer wait compared to looking for the closest record to build the second cluster since we need to find a degree of similarity that satisfies the purpose-based policy criteria also. Therefore, the distance of the next record is based on the distance function that can be determined and changed by the system administrator as per change or update in a purpose-based access policy.
It is viable that with the addition of an outlier in a cluster, the information loss ratio increases since the outliers occur in data samples regardless of the similarity. The records are now stored in the database along with the tag of the cluster. The amount of data that can be accessed by the user would depend on their role or the purpose of their search. For example, in a situation, an entity accessing the database would not have a need to access the entire database. Instead, the cluster of matching tags (with the notion of purpose-based access) will be reachable. This will enhance the privacy preservation of users’ sensitive data for illegitimate access.
The methodology ensures that restricting access to users’ sensitive data for specific data access is based on users’ tags so that to provide an effective solution to control access to data centres. Besides, it is taken care that access to sensitive data should be equipped with necessary security requirements in addition to efficient and flexible management, insertion, and retrieval of data. The security and privacy requirements are implemented through the organizational policies for granting access and control of sensitive users’ data. As privacy policies closely relate to the purpose of the data usage compared to the actions performed on the transactions, the conventional access control models are not suitable to be used in achieving privacy protection. Hence, the concept of purpose as an essential component in this proposed model for implementing access control to protect privacy.
There are different access control mechanisms in the cloud environment, e.g., discretionary access control, mandatory access control, and role-based access control mechanisms [
Contrary to the defined conventional access control mechanisms, the proposed purpose-based access control signifies these essential objectives, Clusters the purpose-based users’ data as per defined attributes Dynamic purpose-based access control as per change in organizational policy Authorizes access to users according to the purpose-based clusters where the user’s authorization exists.
For instance, an organization comprises of users of different departments whose data is managed and controlled by a cloud environment. The organizational policies change from time to time to reflect the regulatory plan and users’ requirements for data access. Let’s assume that the users’ data is randomly scattered, and there has been no semantic understanding assigned to access control to information. At one instance of time, the organization defines a defined policy on how to control the access given to the data. The proposed mechanism clusters the data as per organizational attributes given to users according to access policies. Later at some other instance of time, if the regulatory policy changes, the purpose-based access control is also customized as per needs.
The performance of the proposed clustered purpose-based access algorithm was evaluated with a non-purpose based scenario employing different sets of data. For simulation, we considered six datasets generated from Wisconsin Benchmark datasets [
Repository | Dataset | Numeric attributes | String attributes | Customized purpose-based attributes | Length (number of instances) |
---|---|---|---|---|---|
Wisconsin | 1 | 2 | 5 | 5 | 1200 |
2 | 3 | 7 | 4 | 1700 | |
3 | 3 | 6 | 5 | 1450 | |
4 | 2 | 4 | 4 | 1700 | |
5 | 3 | 4 | 5 | 1300 | |
6 | 4 | 4 | 4 | 1400 | |
UCI–7 | 7 | 5 | 7 | 3 | 2900 |
UCI–8 | 8 | 3 | 5 | 4 | 500 |
UCI–9 | 9 | 3 | 4 | 5 | 1500 |
UCI–10 | 10 | 3 | 3 | 4 | 2000 |
The goal of these experiments was to investigate the performance of algorithms. The datasets were analyzed using python 3.0 with a Jupiter notebook. From the datasets, we have created two scenarios, i.e., clustered purpose-based access to users’ records and non-purpose based access. We present here an example that describes the purpose tree and its implementation using metadata structure for purpose-based access control to data.
For instance,
Process ID | Process Name | Parent of process | Purpose based access control ID | Purpose based access level |
---|---|---|---|---|
P_01 | CEO | None | PBAC_01 | L_01 = {PBAC_01 to PBAC_07} |
P_02 | DDF | CEO | PBAC_02 | L_02 = {PBAC_02, PBAC_04, PBCA_05} |
P_03 | DDA | CEO | PBAC_03 | L_02 = {PBAC_03, PBAC_06, PBCA_07} |
P_04 | SD | DDF | PBAC_04 | L_03 = {PBAC_04} |
P_05 | MD | DDF | PBAC_05 | L_03 = {PBAC_05} |
P_06 | HRD | DDA | PBAC_06 | L_03 = {PBAC_06} |
P_07 | FG | DDA | PBAC_07 | L_03 = {PBAC_07} |
The performance evaluation of purpose-based access control is measured in terms of the query control mechanism [
Similarly, we also observed the notable performance achievement of the proposed algorithm in terms of seek time for accessing a purpose-based number of records. We can vet that the proposed algorithm outperforms in reducing the time complexity involved in fetching the users’ records.
The proposed algorithm ensures the sustainable privacy preservation to users’ sensitive data for stated, unambiguous, and genuine purposes. The sustainability is achieved by validating the existing privacy tags and assigns new sustainable privacy tags based on non-privacy preserved data aiming clustered-purpose based approach. In this way, the proposed method equally ensures the security and sustainable privacy aspects of existing as well as new personal data managed inside large databases repositories.
Sustainable privacy preservation (especially in a shared computer environment) is quite challenging and requires careful access to users’ sensitive data. This paper presented a new clustered-purpose based access control for users’ sustainable data privacy protection in a big data environment. The clustered-purpose based access control significantly contributed to handle the personal data for stated, unambiguous, and genuine purposes. The proposed algorithm clusters and seeks access to users’ records by validating the existing privacy tags and assigns new privacy tags based on non-privacy preserved data aiming clustered-purpose based approach. In this way, the proposed method equally ensures the security and privacy aspects of existing as well as new personal data managed inside large databases repositories. The comparative analysis of results reveals the outperformance of our cluster-purpose based access algorithm as compared to conventional non-purpose based access algorithms towards sustainable privacy presentation to users’ sensitive records. The current research study assumes that the organizations have defined access policies that serve as inputs to the proposed model to cluster the data based on purpose-based tagging and access. The study is also limited to purpose-based access control based on privacy tags. However, future research can also consider other types of privacy protection scenarios in a shared environment.
Conceptualization, Norjihan Abdul Ghani; Data curation, Zahra Mahmoud and Raja Majid Mehmood; Formal analysis, Norjihan Abdul Ghani and Zahra Mahmoud; Funding acquisition, Raja Majid Mehmood; Methodology, Muneer Ahmad; Project administration, Norjihan Abdul Ghani; Resources, Raja Majid Mehmood; Writing–original draft, Zahra Mahmoud; Writing – review & editing, Muneer Ahmad.