Privacy-Enhanced Data Deduplication Computational Intelligence Technique for Secure Healthcare Applications

: A significant number of cloud storage environments are already implementing deduplication technology. Due to the nature of the cloud environment, a storage server capable of accommodating large-capacity storage is required. As storage capacity increases, additional storage solutions are required. By leveraging deduplication, you can fundamentally solve the cost problem. However, deduplication poses privacy concerns due to the structure itself.In this paper, we point out the privacy infringement problem and propose a new deduplication technique to solve it. In the proposed technique, since the user’s map structure and files are not stored on the server, the file uploader list cannot be obtained through the server’s meta-information analysis, so the user’s privacy is maintained. In addition, the personal identification number (PIN) can be used to solve the file ownership problem and provides advantages such as safety against insider breaches and sniffing attacks. The proposed mechanism required an additional time of approximately 100 ms to add a ID Ref to distinguish user-file during typical deduplication, and for smaller file sizes, the time required for additional operations is similar to the operation time, but relatively less time as the file’s capacity grows.


Introduction
The recent surge in the popularity of cloud services has led to the emergence of a variety of cloud service providers (CSPs). There are many different types of services available on cloud platforms, most of which are storage-based solutions. The cloud storage market will continue to expand [1,2]. In the coming era of the Fourth Industrial Revolution (4IR), data size is expected to grow to astronomical scale, which will ultimately lead to increased costs for storage capacity, a huge burden on operators. Deduplication is therefore one of the key technologies that is critical to cloud environments. Deduplication is a technology that prevents multiple users from storing the same data (redundancy). The advantage of this technology is that it can reduce data storage to unprecedented levels [3,4].
Currently, deduplication technology is already applied in many cloud storage environments. Due to the nature of cloud environments, large storage servers are required, and additional storage expansion is required as storage capacity increases. However, deduplication techniques can solve these cost problems at source. However, deduplication techniques have structurally privacy problems. For deduplication technology to be applied, the mapping structure of users and files is stored as meta-information, and the list of users who uploaded specific files can be obtained through meta-analysis on the server. In particular, the problem can be very serious if files have sensitivity, such as political tendencies, ideas, certain diseases, and sex life. If the upload user list of the file is available on the cloud server, the risk that the cloud environment will act as a kind of monitoring system cannot be ruled out [5][6][7].
In this paper, we point out these problems, and propose new deduplication techniques to address them. Since the proposed technique does not store the mapping structure of users and files on the server, meta-information analysis on the server cannot secure a list of users who upload files, protecting users' privacy. In addition, Personal Identification Number (PIN) can be utilized to address ownership issues of files, and it has the advantage of being safe from insider attacks and sniffing attacks.

Data Deduplication Technology
Deduplication is a technology that can store identical data in a single file without duplicating it. Typically, deduplication can be achieved by comparing the files' hash values and is practiced commonly through mapping for users who have access to a certain redundant file(s). Fig. 1 shows an example where deduplication is implemented for real-world applications. If user A and user B have the same file in the application, the cloud server stores the information in a single file with the map structure defined for both users to use at the same time.
During this process, the file removal for user B is completed by removing the mapping information as shown in Fig. 2. When User A later removes the file, it removes all of User A's mapping information and files, resulting in deduplication.

Convergent Encryption
In general, deduplication is applied in a way that has vulnerabilities in terms of security. In other words, when the server alone retains a file(s) belonging to multiple users, the information can be exposed to various security threats due to structural reasons. Hence, the need for appropriate security measures for the deduplication technique.
The paper by Douceur proposed a convergent encryption method to implement deduplication [1]. This method uses the hash of the original plaintext to obtain encrypted keys. For instance, if the original plaintext is M, its encrypted key can be created by applying the cryptographic hash to M, thus resulting in H(M). Basically, convergent encryption is capable of creating different keys for a single file. Unless the content of the original plaintext is known, it is difficult to figure out or crack the encrypted key and thus the files are managed safely without the need for separate key management. This merit notwithstanding, convergent encryption has a problem in that it basically is vulnerable to brute-force attacks when the data has low or small entropy [4].

DupLESS
Bellare proposed in their research paper a technique to thwart brute-force attacks [5]. In Dup-LESS, a separate key server is provided to counteract the susceptibility found in the convergent encryption approach. It is a deduplication method on the server's side, where the server in question creates encrypted keys by interacting with the client. However, both convergent encryption and DupLESS have limitations in that the focus is only placed on creating encrypted keys for the deduplicated files.
The security limitations of deduplication techniques cannot be resolved simply by encrypting the files using an appropriate key-management approach. Of note, user privacy protection and deduplication have certain elements in conflict with each other. Chapter 3 further discusses threats of user privacy breaches resulting from using deduplication techniques and other security issues.

Problems with User-to-file Structure
Fundamentally, in order to apply the deduplication technique to cloud storage, as mentioned in Chapter 2, a mapping structure between users and files must be established. However, the user-file mapping structure problem has a serious privacy problem. When looking at the simple mapping structure alone, it may not be easy to understand what kind of privacy problem may occur. However, if the file is highly sensitive to the user, the user-file mapping structure itself may threaten the user's privacy.
In the case of Korea, According to Paragraph 1 of Article 23 for korea (Restrictions on the Processing of Sensitive Information) of the Personal Information Protection Act, the personal information controller is responsible for information on ideology and belief, union or political party membership or withdrawal, political opinion, health, sexual life, etc., and other information subjects. Sensitive information is defined as information that is likely to significantly infringe on a person's privacy. In addition, according to Paragraph 2 of the same Article, it is stipulated those necessary measures should be taken to ensure the safety of sensitive information.
Strong security measures are required if the upload target file is a file containing sensitive information or if the upload history is exposed in a sensitive situation. In particular, in the case of uploading documents supporting a specific political party, ideological content, specific diseases, and sex-related files, the mapping structure between the user and the file itself may constitute an invasion of privacy when deduplication is processed. In particular, in the deduplication environment, multiple users and a single file are processed in a mapped structure, which means that it is possible to secure a list of users who uploaded a specific file on the server side. It cannot be ruled out that the cloud server even performs a kind of monitoring role.
Whether or not a particular user has uploaded a file is his or her own privacy zone. Even if you are a cloud server administrator, securing all the list of files uploaded by a specific person or, conversely, securing the list of uploaders who uploaded a specific file can be considered an act of actively violating privacy.

Incompatibility of Deduplication and Securing Confidentiality
Typically, cloud systems provide data confidentiality services based on encryption on the server side. In other words, when a plaintext file is received from the client side, it is common for the server side to prepare for hacking by storing it in storage after encryption processing. In such cases, when the user requests a download, the server side sends a decrypted plaintext file to the user. This means that the server has an encryption key and can be decrypted if necessary. Therefore, if an administrator has a high level of privileges on the cloud server side, the possibility of restoring files based on that key may exist if necessary [6][7][8].
As a strong security measure to protect users' privacy, the client side can consider uploading file encryption to the server after it has been pre-applied. In this case, the file can be managed safely on the server side, but the deduplication mechanism does not work. As shown in Fig. 3, although User A and User B uploaded the same file, they were encrypted with different keys, effectively recognized by the server as different files, making deduplication impossible. In other words, achieving full confidentiality through pre-encryption on the client side and capacity savings through deduplication at the same time is practically very challenging.

File Ownership Issue
File ownership and sharing issues are related to security policies from the server's point of view. If user A and user B upload the same file in a cloud environment to which deduplication is applied, the actual saved file is one. In this case, you need to consider who owns the file. Conversely, the user requesting to download a file needs to prove that he or she owns the file. This problem can be solved by strictly managing the list of who has previously uploaded a specific file. However, this method has the privacy problem according to the user-file mapping structure mentioned above as it is. When a particular user takes ownership of a particular file, conversely, it means that the file was previously uploaded by that user, which in other words makes it clear to the server who uploaded the particular file.
This means that the list of users who uploaded various sensitive information such as promotional materials supporting a specific party, data related to a specific disease and sex life, and data containing specific ideas and beliefs can be known as it is through meta-analysis on the server side. Therefore, the cloud environment to which the existing deduplication method is applied has a structural privacy problem. Due to this ownership problem, cloud services with deduplication applied to date do not provide a complete privacy protection method from a strict point of view [9][10][11].

Insider Attack Vulnerability
In the cloud environment, there is fundamentally a risk of personal information exposure by insiders. Since there have been many cases of data exposure by insiders in the past, insider attacks are actually considered a very large security threat in the cloud environment.
Typical cloud environments provide minimal security through data encryption. However, an internal administrator with the highest level of privileges cannot rule out the assumption that a file decryption key can be collected, or that a file and a user mapping relationship can be collected through metadata analysis, even if the encryption key is unknown. This means that a list of users who have posted a particular file can be obtained, or vice versa, a list of files posted by a particular user can be obtained through metadata analysis [12].
Solving such deduplication and privacy problems simultaneously is practically very difficult. In this paper, we propose ways to solve both deduplication and privacy problems simultaneously.

Deduplication Technique to Solve the Proposed Privacy Problem
In this chapter, we propose a new technique to solve the privacy problem in the existing deduplication method.

Overview of Proposed Deduplication Techniques
Privacy issues in the existing deduplication environment fundamentally arise in the structure in which users and files are mapped. In fact, if the mapping structure between files and users is completely removed, it is impossible to obtain a list of users who have uploaded a corresponding file from a specific file, or vice versa, a list of files uploaded by a specific user. Therefore, in this paper, we propose a method in which users cannot be inferred only through metadata and file structure analysis on the server by fundamentally blocking the mapping relationship between the user-file list and the file [13,14].
The proposed method allows users to upload/download files based on PIN, ID Ref , and F has . Here, the PIN is a value that is known only to the user, and the ID Ref is a value owned by the meta-server and is configured based on the PIN. Furthermore, the F has value is a hash-taken value of the file, and the ID Ref cannot be guessed based on the F has , and vice versa, the F has and PIN cannot be guessed based on the ID Ref . This means that no user-file mapping information can be obtained only through ID Ref and F has held by meta-server and storage servers, and extracting F has based on ID Ref necessarily requires a PIN value that users know. In the proposed method, each element is independently recorded and used in the process of mapping users and files to prevent the other elements from being exposed by one element, thereby solving the privacy problem. However, as processing for each element is required, it requires somewhat more time than the conventional method. Fig. 4 shows the overall appearance of the proposed schema.  Fig. 5 shows the configuration of the client and the cloud server. Clients can connect to cloud servers from a variety of devices such as mobile devices, laptops, and PCs. On the other hand, cloud servers have metasurvers and storage servers. Metasurvers and storage servers must be strictly partitioned.

System Architecture
This means that it cannot be configured together on one server, but must be configured on different servers. It is also recommended that meta-server and storage servers be split physically. Here, the administrator responsible for the meta-server and storage server must be configured separately.

File Duplicate Verification Protocol
File redundancy verification protocol is a step in determining whether a file actually exists on the server. If the same file exists on the server, it is efficient because it does not require a file upload process. Fig. 6 shows the flow of the file redundancy verification protocol.  Upon client request, the meta server generates the session ID ID Se . Meta-server encrypts the ID Se 's with a pre-shared secret key EK Pre and sends it to the client. The client obtains the ID Se 's by decrypting the ciphertext. The client requests the server to upload the file it wants to record to a specific path. Meta server generates the file path requested by the client. Metaserver generates routes and informs clients of the results. The client hashes the upload destination file to extract the hash value of the file. The client generates an F has by taking another hash processing on the hash value of the extracted file. Clients encrypt and transmit F has . 10 The meta-server obtains the F has by decrypting the ciphertext sent from the client. 11 The meta-server sends an F has to the storage server to inquire whether the file already exists. 12 The storage server checks whether the file is registered or not. 13 The result shall be returned by dividing it into one of E, N, and I. where E stands for Exist, which means that the file already exists, and N stands for Not Exist, which means that the file does not exist. Also, I stands for Incomplete, meaning that only a few files have been uploaded. 14 Meta-server responds to clients with E/N/I values.
The results of the duplicate file check are as shown in Tab. 2. If you receive E, the same file is already uploaded on the server, so no additional upload is required. Therefore, the client only needs to perform the mapping process without any upload process. However, if an N or I is received, it means that an upload is required on the server, so the client must perform a file upload.
If the result is I, then only partial file uploads are done, so the rest of the uploads except the files uploaded on the server are carried out Fig. 7 shows the flow of file uploads and redundant mapping protocols.  Step to the meta-server. The meta-server stores the ID Ref received by the client. 10 Meta server maps the ID Ref and F Pa values for deduplication of files. 11 The meta-server notifies the client that the upload is complete. Fig. 8 shows the flow of the deduplication file download protocol.

Figure 8: File download protocol
Upon client request, the meta server generates the session ID ID Se . The meta server encrypts the generated ID Se 's with the private key EK Pre and sends it to the client server. The client receives a passphrase from the meta-server and obtains the ID Se 's through decryption.
The client uses the ID Se to request a file download to the meta-server. The meta-server first checks if the file exists in the requested path. If it does not exist, terminate it. The meta-server informs the client that the file exists in that path. The client receives a PIN from the user. The client encrypts the hash value of the PIN entered by the user with the ID Se key and sends it to the meta-server. The meta-server decrypts the ciphertext to obtain the hash value of the PIN, H(PIN). 10 The meta-server extracts the F has from the following expressions based on the ID Ref and H(PIN) values.

D((RefID ⊕ UserID) ⊕ (H (PIN))) H(PIN)
(2) 4179 12 The storage server determines whether a specific file mapped to the F has exists. 13 The storage server responds to the meta-server that there is a file mapped to the corresponding F has . 14 The meta-server informs the client that it is ready for download. 13 The client requests files from the storage server. 13 The client transfers the file from the storage server to complete it.

Privacy Protection Issue
The existing deduplication mechanism ensures the security of files by implementing the fileencryption technique, but lays the user-to-file mapping structure bare. Such vulnerability renders metainformation analysis feasible, potentially leading to serious user-privacy breaches.
The technique proposed by this paper separates the user and file information completely. ID Ref is stored on the meta server and the F has value on the storage server.
ID Ref refers to the value that is subjected to PIN-based XOR operation and encryption. The so encrypted value cannot be decoded even by the administrator. For decoding the H(PIN) value is needed, which, however, is not stored on the server [15][16][17][18].
Using ID Ref , therefore, the F has cannot be computed and conversely, ID Ref cannot be tracked based on F has . If the file in question is to be downloaded using all of the information needed, the PIN information known to the user must be available. Of note, ID Ref is encrypted using the H(PIN) value as the key and thus makes the analysis more difficult. That is, without knowing the PIN value, the information on the server cannot be used to reconstruct the mapping relationship between user information and files, thereby ensuring that users' privacy is secure.

Resolution of Ownership Issue
The conventional deduplication technology typically handles the file ownership issue using the user mapping structure of files. However, the existing deduplication method has a problem that it can be configured in a manner that may invade privacy. In the deduplication system proposed herein, the PIN information serves as a piece of evidence that substantiates a particular fileownership of users. When an authorized individual attempts to download the file, the F has value cannot be computed normally from ID Ref because he/she does not know the PIN; hence file downloading infeasible in the usual way. Contrastingly, any authorized user can compute the F has value from ID Ref based on the PIN information that was used for the upload and accordingly executes the download. In the latter case, computing the file F has from ID Ref can be done only by the individual(s) who knows the PIN, and thus the user concerned can prove his/her ownership of the file. In other words, the confidential information is PIN, and the file ownership issue can be resolved while user privacy is being ensured when one implements the characteristic that confidential information is not stored on the server.

Insider Attacks
The existing deduplication technique keeps the file-to-user mapping structure as metainformation. In that mechanism, the administrator with viewing rights can collect the metainformation, which makes the system extremely susceptible to insider attacks. In other words, the cloud server administrator/manager can easily access the information including the uploader lists of certain files and conversely the file lists of certain users.
In the proposed approach, even if an insider attack has exposed the entirety of the meta server and storage server, the hacker still cannot obtain the file lists of users or the uploader lists of certain files. The server-stored information is ID Ref and F has , and these two sets of values are not related to each other. The ID Ref information, in particular, is encrypted in H(PIN) keys, hence it makes meta-analysis even more challenging. Even if the data were exposed by an insider-originated brute-force attack, the relationships between files and users cannot be established without knowing the PIN, hence the insider-attack problem can be solved.

Sniffing Attacks
In general, sniffing attacks are aimed at intercepting packets that occur in the process of communication between a server and a client. The proposed technique encrypts the comparison parameters of the client and server in the protocol execution process before sending out the data. This step guarantees security even if a hacker launches sniffing attacks. In particular, since the decryption process of the encrypted comparison parameter is performed at the receiving client or server, the possibility of eavesdropping by a hacker can be reduced. Of note, the technique utilizes ID Se , i.e., a single-session ID randomly generated by the meta server, as the key to encrypt the parameters that are transmitted during the protocol application. Therefore, even if hackers perform replaying following their sniffing attacks, they cannot execute the normal up/download protocol for the file unless knowing the EK Pre previously shared between the server and client [19][20][21][22][23].

Integrity Considerations
The proposed deduplication technique employs yet another verification process on the server's side so as to ensure the integrity of files. In this mechanism, the F has value refers to the rehashed value of the file hash value, and the server can verify the integrity of the uploaded file by comparing the two hash values of the uploaded file and the F has value to find a match. In other words, the F has value is characterized by its trait that it is utilized for substantiating file ownership during download protocol execution and at the same time can play the role of integrity verifier for the uploaded file. Additionally, a notable merit is that the alteration status of storage server-stored files can be monitored.

Efficiency Considerations
Deduplication techniques typically generate file-hash values based on Merkle trees. Generating hash values for files virtually takes up the largest amount of time during the process of deduplication and thus this study compared the time required for generating Merkle trees during file redundancy check against the combined time spent on computing the ID Ref in the proposed algorithms. Tab. 3 below presents the environments for performance measurement. In the proposed technique, enhanced security functions are incorporated into the conventional deduplication method. As such, the actual processing time in the new mechanism requires some additional time to be spent on security implementations (e.g., ID Ref operation) compared to the conventional deduplication approach. This time requirement notwithstanding, the new technique shows, in practical terms, no significant drop in performance against the conventional Merkle tree-based deduplication method, and yet offers merits that include greater privacy protection and security. Hence, the proposed mechanism should likely replace the current deduplication approach as a way to strengthen its security performance.

Conclusion
As the network infrastructure evolves, the utilization of cloud platforms is increasing. In particular, as cloud services are not only used in limited fields but are used throughout society, the market for providing services is also becoming huge. Cloud services will expand from 4IR to convergence with a variety of technologies in the future, as they do not put much pressure on users compared to convenience. However, the use of cloud services by various users means that astronomical data is collected at the same time. Increasing data leads to higher costs required to provide services, for which redundant data removal technology can be a great solution [24][25][26]. In the construction of cloud storage systems, deduplication technology is positioned as an essential element technology. As a technology that can dramatically reduce storage capacity, deduplication technology has the advantage of direct cost reduction and will continue to serve as a core technology for the cloud in the future. However, the deduplication technology structurally has a problem of infringing on the user's privacy, and the existing deduplication security technology has focused only on the file encryption itself [27][28][29].
Existing deduplication applications are experiencing privacy problems due to user-file mapping structures. Thus, tracking to users can be prevented by breaking the user-file mapping structure. The mechanism proposed in this paper uses three main elements: the user's unique secret key PIN, ID Ref referenced based on the PIN stored on the meta-server, and F has , hash values for the hash values of the file. In this paper, we pointed out the privacy problem according to the user-file mapping structure, and proposed a new technique to supplement it. The proposed method stores only the ID Ref and F has values in the meta server and the storage server, respectively, and has the characteristic that the two values do not have any connection to each other, so privacy can be safely protected and the meta information exposure caused by insider attacks can also be secured. In addition, by utilizing the user's PIN value, the file ownership and privacy issues were solved at the same time.
To describe the proposed method, Chapter 2 of the paper reviewed previous studies relating to the conventional deduplication technology. Chapter 3 indicated the breach of privacy resulting from implementing the current technology as well as other possible security problems. In Chapter 4, a new deduplication approach was proposed, followed by security performance analysis for the new method in Chapter 5.
Deduplication technology is an essential technology in building cloud storage, and deduplication technology is currently applied to many cloud services. In order to provide secure cloud service in the future era of the 4th industrial era, research on privacy issues that may occur in deduplication technology will be continuously required.