The present disclosure relates to the technical field of federated learning, in particular to an efficient, secure and low-communication vertical federated learning method.
Federated learning is a machine learning technology proposed by Google to jointly train models on distributed devices or servers with data stored. Compared with traditional centralized learning, federated learning does not need to gather data together, in such a way that the transmission cost among devices are reduced and the privacy of data is protected to a great extent.
Federated learning has been significantly developed since being proposed. Especially, with the more and more extensive application of distributed scenes, federated learning applications have attracted more and more attention. According to different data division manners, federated learning is mainly divided into two types, horizontal federated learning and vertical federated learning. In the horizontal federated learning, the data distributed in different devices have the same features, but belong to different users. In the vertical federated learning, the data distributed in different devices belong to the same user, but have different features. The two federated learning paradigms have completely different training mechanisms, and are thereby separately discussed in most of the current studies. Therefore, horizontal federated learning has made great progress, and yet vertical federated leaning has some problems such as low security and inefficiency that need to be solved.
Nowadays, with the arrival of the big data era, companies can readily obtain enormous data sets, but it is difficult to obtain data with different features. Therefore, vertical federated learning has drawn more and more attention in industry. Due to the advantages of horizontal federated learning, in case, with the aid of horizontal federated learning in the vertical federation learning process, a more efficient and secure vertical federated learning mechanism can be developed easier.
The present disclosure aims to provide an efficient, secure and low-communication vertical federated learning method. A model is trained to complete feature data of each participant in the case that the participants contain different feature data (including the case that only one participant holds a label). Then horizontal federated learning is used to jointly train the model with the data held by each participant, so as to solve the problems such as security efficiency and traffic load in the vertical federated learning process. At the cost of minimal loss of accuracy, the training can be completed more efficiently and quickly.
The purpose of the present disclosure is implemented through the following technical solution:
Further, when all participants hold the label data, the held data feature set only consists of the feature data.
Further, the data feature set is personal privacy information in the step (1). In a sense of vertical federated learning, sending index data will not lead to the disclosure of additional information.
Further, in the step (1), each participant uses the BlinkML method to determine an optimal sample number of each selected feature sent to each of the other participants, and then adds noise satisfying differential privacy to part of the samples of each selected feature according to the determined optimal sample number, and sends the part of samples together with the data indexes of the selected samples to other corresponding participants. In the method, only a few samples is needed to be sent to each other in advance to determine the optimal (least) sample number to be sent.
Further, each participant uses the BlinkML method to determine the optimal sample number of each selected feature sent to each of the other participant, including the following steps:
and then obtaining θi,j,N,k by sampling from a normal distribution N (α2LLT). Repeating for K times to obtain K pairs (
θi,j,N,k) where k represents a sample number.
represents the candidate sample number of the ith feature sent to the participant j. N is the total number of the samples for each participant.
where M(x;)represents that the participant j takes the feature data held by the sample x as the input,
is a model parameter, the output of the model Mi,j is a predicted feature data i, D is a sample set, E(*) is an excepted value, and ∈ is a real number that represents a threshold.
If p>1−δ, letting
and if p<1−δ, letting
where δ represents a threshold, which is a real number. Carrying out the process according to the step (e) and the step (f) for multiple times until an optimal candidate sample number that should be selected for each feature is obtained through convergence.
Further, if each participant has a missing feature which does not receive data in the step (2), the model of the missing feature without receiving data is obtained with the method of labeled-unlabeled multitask learning (A. Pentina and C. H. Lampert, “Multi-task learning with labeled and unlabeled tasks,” in Proceedings of the 34th International Conference on Machine Learning-Volume 70, ser. ICML'17. JMLR.org, 2017, p. 2807-2816), including the following steps:
L(*) is a loss function of a model in which a sample of a data set Sp is taken as an input, where ns
Further, all participants jointly train a model by using horizontal federated learning, which is not limited to a specific method.
Compared with the prior art, the present disclosure has the following advantages: the present disclosure combines vertical federated learning with horizontal federated learning, and provides a new idea for the development of vertical federated learning by transforming vertical federated learning into horizontal federated learning. By applying the differential privacy to the method according to the present disclosure, data privacy is guaranteed, and thereby data security is theoretically guaranteed. Combined with the method of multitask learning, the traffic load of the data is significantly reduced, and the training time is thereby reduced. The efficient, secure and low-communication vertical federated learning method according to the present disclosure has the advantages of simple use and high training efficiency, and can be implemented in industrial sense while protecting data privacy.
The arrival of the Internet era provides conditions for the collection of big data, however, with the gradual exposure of data security problems and the protection of data privacy by enterprises, the problem of data “island” is becoming more and more serious. At the same time, although enterprises have a large amount of data due to the development of Internet technology, the user feature of the data are different due to business restrictions and other reasons. If the data is used, a model with higher accuracy and stronger generalization ability can be trained. Therefore, it has become one of the methods to solve the problem by sharing data among enterprises, breaking the data “island”, as well as protecting data privacy.
The present disclosure aims at the above scene. That is, under the premise that the data is stored in local, a model is jointly trained with multiple data to protect the data privacy of all participants and the training efficiency is improved while controlling the loss of accuracy.
where
and θi,j,N,k represent the model parameters obtained from the kth sampling by training with or N samples, respectively,
represents the optimal candidate sample number of the ith feature sent to the participant j.
is calculated, where
If p>1−δ, letting
and if p<1−δ, letting
where δ represents a threshold, which is a real number, and is generally 0.05. Carrying out the process according to the step (e) and the step (f) for mutiple times until the optimal candidate sample number that should be selected for each feature is obtained through convergence.
For the features which do not receive the data, the labeled-unlabled multitask learning method is used to learn the model of the task. In the case of one participant, for example, the process includes the following steps:
L(*) is a loss function of a model in which a sample of a data set Sp is taken as the input. ns
In order to make the purpose, the technical solution and the advantages of the present disclosure more clear, the technical solution of the present disclosure will be described clearly and completely in combination with an embodiment below. It is obvious that the embodiment described is only some but not all embodiments of the present disclosure. Based on the embodiments in the present disclosure, all other embodiments obtained by those skilled in the art without any creative effort fall within the protection scope of the present disclosure.
A and B represent a bank and an e-commerce company respectively, and are both desired to jointly train a model to predict the economic level of users by the federated learning method according to the present disclosure. Due to the differences in business between the bank and the e-commerce company, they hold different features in training data, so it is feasible for them to work together to train a model with higher accuracy and stronger generalization performance. A and B hold data (XA, YA) and (XB, YB), respectively.
are training data,
are labels corresponding to the training data, where N represents the size of the data volume. The training data of A and B include the same user samples, but each sample has different features. The feature numbers of A and B are represented by mA and mB, respectively, namely
Due to user privacy issue and other reasons, A and B cannot share data with each other, so the data is stored locally. In order to solve the problem, the bank and the e-commerce company can jointly train a model by using vertical federated learning as follows.
Step S101, the bank A and the e-commerce company B randomly selected part of features of the data feature set held and a small number of samples of the selected features.
In an embodiment, the bank A and the e-commerce company B randomly selected rA features and rB features from mA features and mB features thereof, respectively. For each selected feature, A and B randomly selected ni
Step S1011, for each feature, the bank A and the e-commerce company B use the BlinkML method to determine the sample number, which can reduce the data transmission while ensuring the training accuracy of the feature model.
In an embodiment, A sent some samples of the feature iA to B, for example. A randomly selected n0 samples and sends them to B, where n0 is very small, and B calculated
used a feature iA of the n0 samples received as labels to train a model θi
was calculated. If p>1−δ,
and if
The previous process and this process were repeated. It should be noted that the process is actually a binary search process, which is used to find the optimal ñ. Then, B sent the size of ñ, to A. Similarly, the process can also be used to determine the minimum count of the samples sent by B to A.
Step S1011, A and B added noise satisfying differential privacy to the selected data, respectively, and sent the data with noise added and data indexes to each other. The data indexes can ensure data alignment in subsequent stages. In the scene of vertical federated learning, the indexes do not disclosure additional information.
Step S102, A and B took the prediction of each missing feature as a learning task, respectively, and took the received feature data as labels to train multiple models respectively. At the same time, for features without data, A and B trained the model by labeled-unlabled multitask learning method.
In an embodiment, A sent part of samples to B, for example.
L(*) is a loss function of the model in which the sample of the data set Sp is taken as the input. ns
Step S103, A and B predict the data of other samples with the trained model, respectively, to complete the missing feature data.
Step S104, A and B carried out the training together with horizontal federated learning method to obtain a final trained model.
The efficient, secure and low-communication vertical federated learning method according to the present disclosure can use the data held by each participant to jointly train the model without exposing the local data of the participants by combining with horizontal federated learning. The privacy protection level of the method satisfies differential privacy, and the training result of the model is close to centralized learning.
The steps of the method or algorithm described combined with the embodiments of the present disclosure may be implemented in a hardware manner, or may be implemented in a manner in which a processor executes software instructions. The software instructions may consist of corresponding software modules, and the software modules can be stored in Random Access Memory (RAM), flash memory, Read Only Memory (ROM), Erasable Programmable ROM (EPROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), registers, hard disks, removable hard disks, CD-ROMs or any other forms of storage media well-known in the art. An exemplary storage medium is coupled to the processor, such that the processor can read information from, and write information to, the storage medium. The storage medium can also be an integral part of the processor. The processor and storage medium may reside in an Application Specific Integrated Circuit (ASIC). Alternatively, the ASIC may be located in a node device, such as the processing node described above. In addition, the processor and storage medium may also exist in the node device as discrete components.
It should be noted that when the data compression apparatus provided in the foregoing embodiment performs data compression, division into the foregoing functional modules is used only as an example for description. In an actual application, the foregoing functions can be allocated to and implemented by different functional modules based on a requirement, that is, an inner structure of the apparatus is divided into different functional modules, to implement all or some of the functions described above. For details about a specific implementation process, refer to the method embodiment. Details are not described herein again.
All or some of the foregoing embodiments may be implemented by using software, hardware, firmware, or any combination thereof. When the software is used for implementation, all or some of the embodiments may be implemented in a form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a server or a terminal, all or some of the procedures or functions according to the embodiments of this application are generated. The computer instructions may be stored in a computer-readable storage medium or may be transmitted from a computer-readable storage medium to another computer-readable storage medium. For example, the computer instructions may be transmitted from a web site, computer, server, or data center to another web site, computer, server, or data center in a wired (for example, a coaxial optical cable, an optical fiber, or a digital subscriber line) or wireless (for example, infrared, radio, or microwave) manner. The computer-readable storage medium may be any usable medium accessible by a server or a terminal, or a data storage device, such as a server or a data center, integrating one or more usable media. The usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, or a magnetic tape), an optical medium (for example, a digital video disk (DVD)), or a semiconductor medium (for example, a solid-state drive).
The above is only preferred embodiments of the present disclosure and is not used to limit the present disclosure. Any amendment, equivalent replacement and improvement made under the spirit and principle of the present disclosure shall fall within the protection scope of the present disclosure.
| Number | Date | Country | Kind |
|---|---|---|---|
| 202111356723.1 | Nov 2021 | CN | national |
The present application is a continuation of International Application No. PCT/CN2022/074421, filed on Jan. 27, 2022, which claims priority to Chinese Patent Application No. 202111356723.1, field on Nov. 16, 2021, the content of which are incorporated herein by reference in their entireties.
| Number | Date | Country | |
|---|---|---|---|
| Parent | PCT/CN2022/074421 | Jan 2022 | US |
| Child | 18316256 | US |