Federation Boosting Tree for Originator Rights Protection
1 Harbin University of Science and Technology, Harbin, 150040, China
2 School of Cyberspace Science, Harbin Institute of Technology, Harbin, 150001, China
3 The Ohio State University, Columbus, 43202, USA
4 Heilongjiang Province Cyberspace Research Center, Harbin, 150001, China
* Corresponding Author: Chao Ma. Email:
Computers, Materials & Continua 2023, 74(2), 4043-4058. https://doi.org/10.32604/cmc.2023.031684
Received 24 April 2022; Accepted 29 June 2022; Issue published 31 October 2022
AbstractThe problem of data island hinders the application of big data in artificial intelligence model training, so researchers propose a federated learning framework. It enables model training without having to centralize all data in a central storage point. In the current horizontal federated learning scheme, each participant gets the final jointly trained model. No solution is proposed for scenarios where participants only provide training data in exchange for benefits, but do not care about the final jointly trained model. Therefore, this paper proposes a new boosted tree algorithm, called RPBT (the originator Rights Protected federated Boosted Tree algorithm). Compared with the current horizontal federal learning algorithm, each participant will obtain the final jointly trained model. RPBT can guarantee that the local data of the participants will not be leaked, while the final jointly trained model cannot be obtained. It is worth mentioning that, from the perspective of the participants, the scheme uses the batch idea to make the participants participate in the training in random batches. Therefore, this scheme is more suitable for scenarios where a large number of participants are jointly modeling. Furthermore, a small number of participants will not actually participate in the joint training process. Therefore, the proposed scheme is more secure. Theoretical analysis and experimental evaluations show that RPBT is secure, accurate and efficient.
With the increasing awareness of large companies on data security and user privacy protection, users have serious concern on their private information, which may be leaked or even abused by others for commercial or political purposes . Each data owner’s private dataset may contain sensitive information that cannot be made public, and direct exposure may violate some privacy policies, such as the General Data Protection Regulation implemented by the EU . Enterprises and regulators have begun to think about how to ensure the legal compliance of data circulation and avoid risks such as illegal data transfer and resale . Data owners only allow data to be kept in their own hands, which makes the data circulation difficult , thus forming data island.
To tackle data island issue in the condition of data, the federated learning framework is proposed by Google in 2016 . Its basic idea is that each participant with the data source trains a model, and uploads the local model instead of the original data to the central server that aggregates the model. The central server obtains the global model through model aggregation, and then distributes the global model to each participant for local model update, and finally completes joint training. This framework is also known as the horizontal federated learning framework . At present, for the horizontal joint learning usage scenario, there is equality among all participants. Each party provides its own data for joint modeling, and finally everyone gets a global model. However, in the horizontal federation process, some participants only want to profit from the training data provided and do not need to obtain the global model. That is, the scenario where one party initiates joint modeling by purchasing data from other participants, but only the initiator ends up with access to the global model has not been solved.
In response to the problems mentioned above, this paper proposes an efficient and privacy-preserving horizontal federated modeling algorithm RPBT, which enables the originator to independently obtain the final joint training model while protecting the data privacy of the participants. Under the premise of ensuring the data privacy of each participant, only the originator can obtain the final global model. The contribution of this paper can be summarized into the following points:
• We presented to ensure the confidentiality and security of participants data and to protect the rights and interests of originator during the joint training process.
• We presented an efficient boosting tree model method RPBT for joint training, which applies the idea of data batch training in the process of machine learning to the batch selection of participants, so that the participants cannot obtain a complete joint model, which ensures that the originator independently acquires the rights of the joint model.
• We perform a very detailed theoretical and experimental analysis of the built RPBT model, and consider our method to be safe, accurate and effective.
The rest of the paper is organized as follows: In Section 2, we give the system model, threat model and problem definition. In Section 3, we briefly introduce necessary preliminaries. In Section 4, we describe in detail the federation boosting tree model of the originator’s rights protection. In Section 5, we give the theoretical analysis, including correctness and security. In Section 6, we conduct the actual experimental analysis. In Section 7, we review the relevant literature. Finally, we summarize the paper.
Our system model considers a federated boosted tree model of originator rights protection. All participants agreed to share the dataset for training the global model, but do not want to reveal their private data to anyone. At the same time, only the originator can obtain the final joint model, which protects the rights of the originator. As shown in Fig. 1, the model framework consists of three parts:
• Originator. The final joint model is obtained by purchasing participant data for joint training.
• Participant. Provide a dataset to cooperate with the originator for model training, and the final joint model is unknown.
• Coordinator. Using the secure aggregation protocol, under the premise of protecting data privacy, the model information of the originator and participants is aggregated, and the final model information is sent to the originator.
Assuming that the coordinator is completely credible, no external user can truncate or tamper with the communication between it and the participants. In addition, the coordinator itself will not leak the data information in the process of constructing the lifting tree model to the unknown user.
All participants are semi-honest. He executes the specified protocol honestly, but may launch large-scale reasoning attacks. That is, some participants may use untrue data to participate in model training and disrupt the model effect. It is also possible that a group of participants colluded to infer the data information of other participants.
We also assume that the coordinator will not collude with the participants. Such a non-conspiracy assumption is reasonable in practice, because the coordinator and participants will maintain their reputation, is unlikely to collude with others to undermine reputation.
What we are concerned about is building a boosting tree model for the protection of the rights of the originators on the premise of ensuring the data privacy of all participants. As shown in Fig. 1, it is assumed that both the originator and the participant are factories that produce the same kind of product. If the originator wants to predict the quality of the product from the machine production parameters, then a large amount of production data is needed to build a predictive model. We designed a federal training platform, the originator purchases production data of the participating factories with money. After the participants receive the money, they provide the data to cooperate with the originator for joint training. Let the originator get the final joint model to predict the quality of the product. Our design goals are as follows:
– Efficiency: RPBT supports joint training of large-scale participants. Use parallel computing to optimize computing time for large-scale participant model training.
– Privacy Protection: To protect the data privacy of all participants, the participants calculate the model information through their own data, and add noise to the model information when sending it to the collaborating parties. After the coordinator accepts the participant model information, it will use the secure aggregation protocol to aggregate all participant model related information in a secure manner.
– Right Protection: As the parties providing data to participate in training, they will not get the final boosted tree model. Only the originator gets the final joint model, which guarantees the rights and interests of the originator.
Secure aggregation protocols are a way to use secure multi-party computation (MPC) to securely compute sum of model parameter updates from individual user devices to advance privacy-preserving machine learning . The protocol allows the server to compute the sum of data vectors held from multiple participants in a secure manner (i.e., without learning each user’s individual contribution). The core idea is to add an obfuscation term to each client’s gradient. Since the obfuscation term is specially constructed by negotiation with other clients and known only to itself, the obfuscation term in each client’s gradient passed to the server can be eliminated during aggregation, making it impossible for the server to attack the true gradient value of each client while getting the true gradient aggregation result. Since each client sends an obfuscated gradient and the obfuscated terms in the gradient are known only to the client, the privacy of each client’s gradient can be guaranteed without collusion between other clients and the server. For example, in a federated learning environment, model updates are provided to deep neural network aggregation users. It has low runtime and communication overhead, even in scenarios with large datasets and many participating parties. The protocol is secure in an honest but curious active adversary setting, and demonstrates that it remains secure even if an arbitrarily chosen subset of users drop out at any time.
XGBoost  is an open source machine learning project developed by Tianqi Chen and others that efficiently implements the GBDT algorithm with many algorithmic and engineering improvements, and has been widely used and achieved good results in Kaggle competitions and many other machine learning competitions. XGBoost is an optimized distributed gradient boosting library designed to be efficient and portable. It implements machine learning algorithms in the Gradient Boosting framework. XGBoost provides parallel tree boosting (also known as GBDT, GBM) to solve many data science problems quickly and accurately. The same code runs on major distributed environments (Hadoop, SGE, MPI) and can solve problems with over billions of samples. XGBoost leverages out-of-core computing and enables data scientists to process hundreds of millions of samples of data on a single host. XGboost uses CART decision trees as sub-models, and Gradient Tree Boosting to integrate the learning of multiple CART trees to obtain the final model. base model supports not only decision trees but also linear models, and here we use an objective function based on the decision Here we use the objective function based on the decision tree.
In the problem definition, we mentioned the need to devise a scheme that protects the rights of the originator and allows only the originator to obtain the final joint model. As shown in Fig. 2, RPBT proceeds as follows.
1) System initialization. Determine the group information of the participants, set the safety aggregation related parameters and the sponsor rights protection parameters.
2) Local histogram computation. Design a local histogram calculation process and ensure data privacy during joint training.
3) Histogram security aggregation. The coordinator calculates the optimal splitting point by using the participant histogram information through the secure aggregation method.
4) Joint model distribution. After the joint training, the sponsor obtains the joint model independently to ensure the rights and interests of the sponsor.
Before the task is executed, the information of the participants is counted, and the groups are grouped according to the number of participants and the size of the available data set, so as to facilitate the selection of training batches during joint training. At the same time, the proportion of participants who do not participate in joint training is set according to the number of participants in each group. By giving a fake intermediate model to the participants who do not participate in the joint training, the participants are not sure whether the intermediate model obtained by themselves is real, and the participants cannot use the intermediate model.
The calculation process of joint modeling is designed through the basic idea of horizontal federated learning combined with XGBoost algorithm. Let’s first review the specific calculation details of the XGBoost algorithm in the modeling process. A dataset with samples and features, suppose we have trained trees and the final predicted value for the i-th sample is equal to:
where represents the sample characteristics, denotes the prediction result of the k-th tree for sample , and finally these values are added together to get the final result .
According to the additive model of formula (1), the t-th tree is learned based on the t-th tree optimization, the residuals are calculated based on the predicted values of the t-1 tree, and then the t residual tree is fitted, and the objective The Taylor second-order expansion of the objective function is as follows:
In constructing the t-th tree, starting from the root node, for each split, the original one node will split into two nodes, and the sample data in the original node will enter into each of the two nodes according to the judgment rule. After each new split, we need to check whether this split will bring gain to the loss function. When splitting a node, we use the following formula to calculate the gain of the split point, where The IL and IR represent the instance space of the left and right tree nodes after splitting. After obtaining an optimal tree structure, the leaves optimal weights can be obtained by the following formula, where is the instance space of leaf .
After obtaining an optimal tree structure, the leaves optimal weights can be obtained by the following formula, where is the instance space of leaf
From the above review of XGBoost knowledge, we can observe that computing the split point only depends on and . The calculation steps were designed as follows:
1) The originator and the participants calculate the quantile interval information according to the data feature set they own, and send the quantile interval information to the coordinator.
2) The coordinator integrates all quantile interval information, calculates the average quantile sketch, and distributes it to the originators and participants.
3) The originator and participants calculate the local histogram (including the information of and ), and then send the local histogram to the collaborating party.
4) The collaborating party merges the local histograms of the originator and the participating parties into a global histogram. The splitting point gain is calculated according to the global histogram, and the splitting feature and splitting threshold corresponding to the optimal splitting point gain are sent to the originator and the participants.
5) The originator and participants update the local histogram after receiving the splitting feature and splitting threshold.
In our lifting tree algorithm, the originator and the participants send the and information to the coordinator, and the coordinator calculates the gain to determine the optimal split point and the corresponding split threshold. Therefore, during joint training, the originator and the participants will not expose the original data information. However, if the coordinator directly uses and to calculate , it is equivalent to expose the and data information owned by the local histogram of each participant. In order to further ensure the data privacy and security of the participants, we introduce a secure aggregation protocol . We need to ensure that the local histogram information of the participants is kept secret from the coordinator, so before the participant sends the local histogram to the coordinator, we consider encrypting the local histogram information of each party by means of secure aggregation, so as to ensure the coordinator. The local histogram information of each participant is not known specifically.
The details of the joint model computation combined with the secure aggregation algorithm is described as Algorithm 1 in Tab. 1. When tree calculates the gain of splitting features, XGBoost uses a pre-sorted algorithm to handle node splitting. The split point calculated in this way is more accurate, but it also causes a lot of time overhead. To solve this problem, Lightgbm chose a decision tree algorithm based on histogram. Compared with the pre-sorted algorithm, histogram has many advantages in memory consumption and computational cost. We also added the histogram algorithm to the calculation process to speed up the calculation process. The basic principle is to directly pass in a number of buckets, and then use the improved GK-summary algorithm to calculate the quantile . We calculate the corresponding value in the data based on this quantile. Forms an interval that divides all samples in the matching interval into a bucket. Since the data splitting only relies on the values of G and H, we only need to count the G and H values of each sample to meet the requirements of subsequent tasks.
Next, we design how to complete data encryption when calculating the local histogram. We choose to use Diffie-Hellman algorithm  to generate a private key for the initiator and all participants to generate noise when encrypting. In addition, we set uuid to each participant to show identity, and then judge the way in which the encryption process uses noise according to uuid.
We design the operation of model aggregation to be done on the coordinator side. The specific details are as described in Algorithm 2 in Tab. 2. The most important step for the coordinator is to find the best split point. Our histogram format is , where is the sum of the first derivatives of all samples in this bucket, and is two sum of derivatives. Then according to the segmentation scheme of XGBoost, we traverse each bucket of each feature to find the optimal segmentation feature and bucket number. Then it is returned to each participant, who divides locally according to the number of features and buckets. Then calculates a new histogram and repeats the above process to complete the training. During the coordinator’s process of finding the best split point, the histogram data of all participating parties are summed. According to the method we have dealt with before, the addition can eliminate the noise in the encrypted data, so that the coordinator can obtain a complete and unencrypted global data without obtaining the original data of any participant.
We need to consider how to protect the rights of the originator during the joint training process. The originator purchases the participant’s dataset for joint training through pricing criteria. In the existing horizontal federation scheme, both the originator and the participants can obtain the final federation model. For the originator, now that he has paid for the data set of the participating parties, he just wants to obtain the joint model by himself. Because of the participating parties, both the data value and the joint model are obtained. This is unfair to the originator and does not effectively protect the rights and interests of the originator. Therefore, we design a horizontal federated learning method suitable for the current scenario, in order to solve the problem of how to ensure the rights and interests of the originator.
The main idea is that before joint modeling, the originator determines the number of participants. In the joint modeling process, the originator randomly selects a batch of participants for joint training each round, and an intermediate model is obtained after this round of training. Then the originator selects a batch of new participants, and on the basis of the intermediate model, continues joint training to obtain a new intermediate model. The joint training process stops when the initiator and all participants have completed training. The originator finally gets a joint model jointly trained by all participants.
The specific details of protecting the rights and interests of the originator during the training process are described as Algorithm 3 in Tab. 3. The idea that the originator randomly selects a batch of participants each time during the training process refers to the data batch training process in machine learning. Batch training is used because there is more training data. Using batch training makes training faster and can get stable results. We regard the data of the participants selected in each round as a batch of data for each iteration of the machine learning process. Since good stable results can be obtained using batch data, our process of training the participants as batch data is also effective, and the resulting model can also be said to be stable. In addition, during the joint training process, we can set some participants not to participate in the real training. Instead, a fake intermediate model is returned directly to the participants during the training process. In this way, the participants are not sure whether the intermediate models they have obtained are real, and the participants cannot use the intermediate models, which is equivalent to protecting the rights and interests of the originators from another aspect.
In this section, we propose a new boosted tree algorithm called the RPBT. It mainly solves the following three problems. First, the data privacy and security of each participant is guaranteed during the joint modeling process. Secondly, when the coordinator performs the model calculation, it is ensured that the model information of the participants will not be leaked. Finally, it is ensured that only the originator can get the final boosting tree model, and the participants cannot know the accurate boosting tree model, which protects the rights and interests of the originator.
Theorem 1. RPBT is lossless, i.e., RPBT model M and XGBoost model M′ behave the same, provided that the initialization and hyper parameterization of models M and M′ are the same.
Proof. According to formula 3, and are the only information for computing the optimal splitting point. In the modeling process, and are encrypted by a secure aggregation protocol. For example, the encryption of message a of side A is , and the encryption of message b of side B is . Where r is the generated noise. By the definition of secure aggregation, we have . The proof is as follows:
Thus, we have , . Using secure aggregation protocol encryption is, , then . As long as there is the same initialization, the optimal splitting point of the model