Some references, which may include patents, patent applications and various publications, are cited and discussed in the description of this disclosure. The citation and/or discussion of such references is provided merely to clarify the description of the present disclosure and is not an admission that any such reference is “prior art” to the disclosure described herein. All references cited and discussed in this specification are incorporated herein by reference in their entireties and to the same extent as if each reference were individually incorporated by reference.
The present disclosure relates generally to federated learning, and more specifically related to a large scale privacy-preservation federated learning on vertically partitioned data using kernel method.
The background description provided herein is for the purpose of generally presenting the context of the disclosure. Work of the presently named inventors, to the extent it is described in this background section, as well as aspects of the description that may not otherwise qualify as prior art at the time of filing, are neither expressly nor impliedly admitted as prior art against the present disclosure.
Federated learning is a machine learning technique that trains an algorithm across multiple decentralized edge devices or servers holding local data samples, without exchanging their data samples. However, it is a challenge to process large amount of data with sufficient efficiency, scalability, and safety.
Therefore, an unaddressed need exists in the art to address the aforementioned deficiencies and inadequacies.
In certain aspects, the present disclosure relates to a system for prediction using a machine learning model. In certain embodiments, the system includes an active computing device and at least one passive computing device in communication with the active computing device. Each of the active and passive computing devices includes local data. The active computing device has a processor and a storage device storing computer executable code. The computer executable code, when executed at the processor, is configured to:
obtain parameters of the machine learning model;
retrieve an instance from the local data of the active computing device;
sample a random direction of the instance;
compute a dot product of the random direction and the instance, and calculate a random feature based on the dot product;
compute a predicted value of the instance in the active computing device, instruct the at least one passive computing device to compute a predicted value of the instance in the at least one passive computing device, and summarize the predicted values from the active and the at least one passive computing devices to obtain a final predicted value of the instance, where the predicted value of the instance in the at least one passive computing device is obtained based on the local data of the at least one passive computing device;
determine a model coefficient using the random feature and a difference between the final predicted value of the instance and a target value of the instance;
update the machine learning model using the model coefficient; and
predict a value for a new instance using the machine learning model.
In certain embodiments, the parameters of the machine learning model comprises a constant learning rate.
In certain embodiments, the instance is characterized by an index, the computer executable code is configured to provide the index to the at least one passive client computer, and each of the active and the at least one passive client computers is configured to sample the random direction based on the index. In certain embodiments, the random direction is sampled from a Gaussian distribution.
In certain embodiments, the random feature is calculated using the equation ϕw
In certain embodiments, predicted value of the instance in the active computing device is calculated using a number of iterations, and the predicted value is updated in the iterations using the equation fl(x)=fl(x)+αiϕw
In certain embodiments, the iterations equals to or is greater than 2.
In certain embodiments, the computer executable code is configured to update the machine learning model by replacing each of the previous model coefficients using the equation of αj=(1−γλ)αj, wherein αj is any one of the previous model coefficients, γ is a learning rate of the machine learning model, and λ is a regularization parameter of the machine learning model.
In certain embodiments, communication between the active and the at least one passive computing devices is performed using a tree structure via a coordinator computing device that is in communication with the active and the at least one passive computing devices.
In certain aspects, the present disclosure relates to a method for prediction using a machine learning model. The method includes:
obtaining, by an active computing device, parameters of the machine learning model;
retrieving, by the active computing device, an instance from the local data of the active computing device;
sampling, by the active computing device, a random direction of the instance;
computing, by the active computing device, a dot product of the random direction and the instance, and calculating a random feature based on the dot product;
computing, by the active computing device, a predicted value of the instance, instructing at least one passive computing device to compute a predicted value of the instance therein, and summarizing the predicted values from the active and the at least one passive computing devices to obtain a final predicted value of the instance, wherein the predicted value of the instance in the at least one passive computing device is obtained based on the local data of the at least one passive computing device;
determining, by the active computing device, a model coefficient using the random feature and a difference between the final predicted value of the instance and a target value of the instance;
updating, by the active computing device, the machine learning model using the model coefficient; and
predicting, by the active computing device, a value for a new instance using the machine learning model.
In certain embodiments, the parameters of the machine learning model comprises a constant learning rate.
In certain embodiments, the instance is characterized by an index, the computer executable code is configured to provide the index to the at least one passive client computer, and each of the active and the at least one passive client computers is configured to sample the random direction based on the index. In certain embodiments, the random direction is sampled from a Gaussian distribution.
In certain embodiments, the random feature is calculated using the equation ϕw
In certain embodiments, wiTxi+b is calculated by ρi=1q((wi(xi+bl)−ρi≠l qbl, q is a number of the active and the at least one passive computing devices, {circumflex over (l)} is the {circumflex over (l)}-th of the q computing devices, (wi(xi is the dot product of the random direction and the instance in the {circumflex over (l)}-th computing device, b{circumflex over (l)} is a random number generated in the {circumflex over (l)}-th computing device, and l′ is the active computing device.
In certain embodiments, predicted value of the instance in the active computing device is calculated using a number of iterations, and the predicted value is updated in the iterations using the equation fl(x)=fl(x)+αiϕw
In certain embodiments, the computer executable code is configured to update the machine learning model by replacing each of the previous model coefficients using the equation of αi=(1−γλ)αj, wherein αj is any one of the previous model coefficients, γ is a learning rate of the machine learning model, and λ is a regularization parameter of the machine learning model.
In certain embodiments, communication between the active and the at least one passive computing devices is performed using a tree structure via a coordinator computing device that is in communication with the active and the at least one passive computing devices.
In certain aspects, the present disclosure relates to a non-transitory computer readable medium storing computer executable code. The computer executable code, when executed at a processor of a computing device, is configured to perform the method described above.
These and other aspects of the present disclosure will become apparent from following description of the preferred embodiment taken in conjunction with the following drawings and their captions, although variations and modifications therein may be affected without departing from the spirit and scope of the novel concepts of the disclosure.
The accompanying drawings illustrate one or more embodiments of the disclosure and together with the written description, serve to explain the principles of the disclosure. Wherever possible, the same reference numbers are used throughout the drawings to refer to the same or like elements of an embodiment.
In certain embodiments, symbols and equations are defined as follows:
xi represents a part of an instance or a sample of data, which could be available in any of the workers or servers, the instance is indexed by i;
(xi)g
(w) is a distribution or measure, such as Gaussian distribution;
wi is the random direction corresponding to the index i;
wiT is the transpose operation of wi;
(wi is the random direction corresponding to the index i in the local server l;
(wi is the transpose operation of (wi);
wiTxi is the dot product of wlT and xi;
(wi)(xi) is the dot product of (wi) and (xi);
b is a random number in the range of [0, 2π], which could be generated by a random number generator;
(wi)(xi)+b represents adjusted dot product, where the dot product of (wi) and (xi) is adjusted by the random value b;
wiTxi+b represents adjusted dot product, where the dot product of wiT and xi is adjusted by the random value b;
f(xi) is the predicted value of the instance xi;
yi is the label of the instance xi;
ϕw
αi is the model coefficient of the instance xi;
{αi}i=1t represents model coefficient of the instance xi at different iterations, here t is the number of iterations, not transpose operator, and for each iteration i, there is a corresponding model coefficient αi;
α∧
T0, T1, and T2 are tree structures for communication.
In a lot of real-world machine learning applications, data are provided by multiple providers and each maintains private records of different feature sets about common entities. It is challenging to train these vertically partitioned data effectively and efficiently while keeping data privacy using traditional machine learning algorithms.
The present disclosure relates to large scale privacy-preservation federated learning on vertically partitioned data focusing on nonlinear learning with kernels. In certain aspects, the present disclosure provides a federated doubly stochastic kernel learning (FDSKL) algorithm for vertically partitioned data. In certain embodiments, the present disclosure uses random features to approximate the kernel mapping function and uses doubly stochastic gradients to update the kernel model, which are all computed federatedly without revealing the whole data sample to each worker. Further, the disclosure uses a tree structured communication scheme to distribute and aggregate computation which has the lowest communication cost. The disclosure proves that FDSKL can converge to the optimal solution in O(l/t), where t is the number of iterations. The disclosure also provides the analysis of the data security under the semi-honest assumption. In conclusion, FDSKL is the first efficient and scalable privacy-preservation federated kernel method. Extensive experimental results on a variety of benchmark datasets show that FDSKL is significantly faster than state-of the-art federated learning methods when dealing with kernels.
Certain embodiments of the present disclosure, among other things, have the following beneficial advantages: (1) The FDSKL algorithm can train the vertically partitioned data efficiently, scalably and safely by kernel methods. (2) FDSKL is a distributed doubly stochastic gradient algorithm with constant learning rate which is much faster than the existing doubly stochastic gradient algorithms all of which are built on the diminishing learning rate, and also much faster than existing method of privacy-preserving federated kernel learning algorithm. (3) A tree structured communication scheme is used to distribute and aggregate computation which is much more efficient than the star-structured communication and ring-structured communication, which makes the FDSKL much efficient than existing federated learning algorithms. (4) Existing federated learning algorithms for vertically partitioned data use the encryption technology to keep the algorithm safety which is time consuming. However, the method of the present disclosure uses random perturbations to keep the algorithm safety which is cheaper than the encryption technology and make the FDSKL more efficient than the existing federated learning algorithms. (5) Most of existing federated learning algorithms on vertically partitioned data are limited to linear separable model. The FDSKL of the present disclosure is the first efficient and scalable federated learning algorithm on vertically partitioned data breaking the limitation of implicitly linear separability.
In certain aspects, the significant novel features of the present disclosure includes: (1) FDSKL is a distributed doubly stochastic gradient algorithm for vertically partitioned data with constant learning rate. (2) The disclosure proves the sublinear convergence rate for FDSKL. (3) The present disclosure computes (wi(xi+b locally to avoid directly transferring the local data (xi) to other workers for computing wiTxi+b, where b is an added random number to keep the algorithm safe. (4) The disclosure provides the analysis of the data security under the semi-honest assumption.
In certain aspects, the present disclosure relates to random feature approximation. Random feature (Rahimi and Recht, 2008, 2009, which are incorporated herein by reference in their entirety) is a powerful technique to make kernel methods scalable. It uses the intriguing duality between positive definite kernels which are continuous and shift invariant (i.e., K(x, x′)=K(x−x′)) and stochastic process as shown in Theorem 0.
Theorem 0 (Rudin, 1962, which is incorporated herein by reference in its entirety). A continuous, real-valued, symmetric and shift-invariant function K(x, x′)=K(x−x′) on d is a positive definite kernel if and only if there is a finite non-negative measure (w) on d, such that:
where (b) is a uniform distribution on [0, 2π], and ϕw
According to Theorem 0, the value of the kernel function can be approximated by explicitly computing the random feature ϕw(x) maps as follows:
where m is the number of random features and wi are drawn from (w). Specifically, for Gaussian RBF kernel K(x, x′)=exp(−∥x−x′∥2/2σ2), (w) is a Gaussian distribution with density proportional to exp(−σ2∥w∥2/2). For the Laplac kernel (Yang et al., 2014, which is incorporated herein by reference by its entirety), this yields a Cauchy distribution. Note that the computation of a random feature map ϕ requires to compute a linear combination of the raw input features, which can also be partitioned vertically. This property makes random feature approximation well-suited for the federated learning setting.
In certain aspects, the present disclosure relates to doubly stochastic gradient. Because the functional gradient in RKHS H can be computed as ∇f(x)=K(x, ·), the stochastic gradient of ∇f(x) with regard to the random feature w can be denoted by:
ζ(·)=ϕw(x)ϕw(·).
Given a randomly sampled data instance (x, y), and a random feature w, the doubly stochastic gradient of the loss function L (f(xi), yi) on RKHS with regard to the sampled instance (x, y) and the random feature w can be formulated as follows:
ζ(·)=L′(f(xi), yi)ϕw(xi)ϕw(·).
Because ∇∥f∥=2f, the stochastic gradient of (f) can be formulated as follows:
Note that we have (x,y)w{circumflex over (ζ)}(·)=∇(f). According the stochastic gradient {circumflex over (ζ)}(·), we can update the solution by the step size γt. Then, let f1(·)=0, we have that:
From the above equation, αit are the important coefficients which defines the model of f(·). Note that the model f(x) in the above equation does not satisfy the assumption of implicitly linear separability same to the usual kernel model f(x)=ΣiNαiK(xi, x).
In certain aspects, the federated doubly stochastic kernel learning (FDSKL) algorithm is described as follows.
FDSKL System Structure:
Data Privacy: To keep the vertically partitioned data privacy, certain embodiments of the disclosure divide the computation of the value of ϕw
Model Privacy: In addition to keep the vertically partitioned data privacy, the disclosure also keeps the model privacy. The model coefficients αi are stored in different workers separately and privately. According to the location of the model coefficients αi, the disclosure partitions the model coefficients {αi}i=1T as α∧
Tree-structured communication: In order to obtain wiTxi and f(xi), the disclosure needs to accumulate the local results from different workers. The present disclosure uses an efficient tree-structured communication scheme to get the global sum which is faster than the simple strategies of star-structured communication and ring-structured communication. Tree structure described by Zhang et al., 2018 is incorporated herein by reference in its entirety.
FDSKL Algorithm: To extend stochastic gradient descent (DSG) to the federated learning on vertically partitioned data while keeping data privacy, the disclosure needs to carefully design the procedures of computing wiTxi+b, f(xi) and updating the solution. In certain embodiments, the solution is detailed in the following procedures 1-3 and exemplified by the following algorithms 1 with reference to algorithms 2 and 3. In certain embodiments, in contrast to using the diminishing learning rate in DSG, the FDSKL uses a constant learning rate γ which can be implemented more easily in the parallel computing environment.
the disclosure has
Based on these key procedures, the disclosure summarizes the FDSKL algorithm in Algorithm 1. Note that different to the diminishing learning rate used in DSG, the FDSKL of the disclosure uses a constant learning rate which can be implemented more easily in the parallel computing environment. However, the convergence analysis for constant learning rate is more difficult than the one for diminishing learning rate.
In certain embodiments, the output λ−
Theoretical Analysis:
In certain embodiments, the present disclosure proves that FDSKL converges to the optimal solution at a rate of O(1/t) as shown in Theorem 1.
Assumption 1: Suppose the following conditions hold:
Set ϵ>0,
for Algorithm 1, with
for
iterations, where B=[√{square root over (G22+G1)}+G2]2,
and e1=[∥h1−f*∥H2].
In certain embodiments, the present disclosure proves that FDSKL can prevent inference attack (Definition 1 as follows) under the semi-honest assumption (Assumption 2 as follows):
Definition 1 (inference attack): an inference attack on the l-th worker is to infer some feature group G of one sample xi which belongs from other workers without directly accessing it.
Assumption 2 (semi-honest security): all workers will follow the protocol or algorithm to perform the correct computation. However, they may retain records of the intermediate computation results which they may use later to infer the other work's data.
The present disclosure is more particularly described in the following examples that are intended as illustrative only since numerous modifications and variations therein will be apparent to those skilled in the art. Various embodiments of the disclosure are now described in detail. Referring to the drawings, like numbers indicate like components throughout the views. As used in the description herein and throughout the claims that follow, the meaning of “a”, “an”, and “the” includes plural reference unless the context clearly dictates otherwise. Also, as used in the description herein and throughout the claims that follow, the meaning of “in” includes “in” and “on” unless the context clearly dictates otherwise. Moreover, titles or subtitles may be used in the specification for the convenience of a reader, which shall have no influence on the scope of the present disclosure. Additionally, some terms used in this specification are more specifically defined below.
The terms used in this specification generally have their ordinary meanings in the art, within the context of the disclosure, and in the specific context where each term is used. Certain terms that are used to describe the disclosure are discussed below, or elsewhere in the specification, to provide additional guidance to the practitioner regarding the description of the disclosure. It will be appreciated that same thing can be said in more than one way. Consequently, alternative language and synonyms may be used for any one or more of the terms discussed herein, nor is any special significance to be placed upon whether or not a term is elaborated or discussed herein. Synonyms for certain terms are provided. A recital of one or more synonyms does not exclude the use of other synonyms. The use of examples anywhere in this specification including examples of any terms discussed herein is illustrative only, and in no way limits the scope and meaning of the disclosure or of any exemplified term. Likewise, the disclosure is not limited to various embodiments given in this specification.
As used herein, the term “module” may refer to, be part of, or include an Application Specific Integrated Circuit (ASIC); an electronic circuit; a combinational logic circuit; a field programmable gate array (FPGA); a processor (shared, dedicated, or group) that executes code; other suitable hardware components that provide the described functionality; or a combination of some or all of the above, such as in a system-on-chip. The term module may include memory (shared, dedicated, or group) that stores code executed by the processor.
The term “code”, as used herein, may include software, firmware, and/or microcode, and may refer to programs, routines, functions, classes, and/or objects. The term shared, as used above, means that some or all code from multiple modules may be executed using a single (shared) processor. In addition, some or all code from multiple modules may be stored by a single (shared) memory. The term group, as used above, means that some or all code from a single module may be executed using a group of processors. In addition, some or all code from a single module may be stored using a group of memories.
The term “interface”, as used herein, generally refers to a communication tool or means at a point of interaction between components for performing data communication between the components. Generally, an interface may be applicable at the level of both hardware and software, and may be uni-directional or bi-directional interface. Examples of physical hardware interface may include electrical connectors, buses, ports, cables, terminals, and other I/O devices or components. The components in communication with the interface may be, for example, multiple components or peripheral devices of a computer system.
The present disclosure relates to computer systems. As depicted in the drawings, computer components may include physical hardware components, which are shown as solid line blocks, and virtual software components, which are shown as dashed line blocks. One of ordinary skill in the art would appreciate that, unless otherwise indicated, these computer components may be implemented in, but not limited to, the forms of software, firmware or hardware components, or a combination thereof.
The apparatuses, systems and methods described herein may be implemented by one or more computer programs executed by one or more processors. The computer programs include processor-executable instructions that are stored on a non-transitory tangible computer readable medium. The computer programs may also include stored data. Non-limiting examples of the non-transitory tangible computer readable medium are nonvolatile memory, magnetic storage, and optical storage.
The present disclosure will now be described more fully hereinafter with reference to the accompanying drawings, in which embodiments of the present disclosure are shown. This disclosure may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the present disclosure to those skilled in the art.
The processor 352 may be a central processing unit (CPU) which is configured to control operation of the worker 350. The processor 352 can execute an operating system (OS) or other applications of the worker 350. In certain embodiments, the worker 350 may have more than one CPU as the processor, such as two CPUs, four CPUs, eight CPUs, or any suitable number of CPUs.
The memory 354 can be a volatile memory, such as the random-access memory (RAM), for storing the data and information during the operation of the worker 350. In certain embodiments, the memory 354 may be a volatile memory array. In certain embodiments, the worker 350 may run on more than one memory 354. In certain embodiments, the worker 350 may further include graphic card to assist the processor 352 and the memory 354 with image processing and display.
The storage device 356 is a non-volatile data storage media for storing the OS (not shown) and other applications of the worker 350. Examples of the storage device 356 may include non-volatile memory such as flash memory, memory cards, USB drives, hard drives, floppy disks, optical drives, or any other types of data storage devices. In certain embodiments, the worker 350 may have multiple storage devices 356, which may be identical storage devices or different types of storage devices, and the applications of the worker 350 may be stored in one or more of the storage devices 356 of the worker 350.
In this embodiments, the processor 352, the memory 354, the storage device 356 are component of the worker 350, such as a server computing device. In other embodiments, the worker 350 may be a distributed computing device and the processor 352, the memory 354 and the storage device 356 are shared resources from multiple computers in a pre-defined area.
The storage device 356 includes, among other things, a FDSKL application 358 and private data 372. The FDSKL application 358 includes a listener 360, a parameter module 362, a sampling module 364, a random feature module 366, an output prediction module 368, and a model coefficient module 370. In certain embodiments, the storage device 356 may include other applications or modules necessary for the operation of the FDSKL application 358. It should be noted that the modules 360-370 are each implemented by computer executable codes or instructions, or data table or databases, which collectively forms one application. In certain embodiments, each of the modules may further include sub-modules. Alternatively, some of the modules may be combined as one stack. In other embodiments, certain modules may be implemented as a circuit instead of executable code. In certain embodiments, one module or a combination of the modules may also be named a model, the model may have multiple parameters that can be learned by training, and the model with well-trained parameters can be used for prediction.
The listener 360 is configured to, upon receiving an instruction for training, initialize a FDSKL training, and send a notice to the parameter module 362. The instruction may be received from an administrator or a user of the worker device 350. Under this situation, the worker 350 function as an active worker. In certain embodiments, upon receiving a request from an active worker, the listener 360 may also instruct the parameter module 360, so that the parameter module 360 and the other related modules can computer and provide information to the active worker. Under this situation, the worker 350 functions as a passive worker. The information the passive worker 350 provides may include predicted output corresponding to a sample, and adjusted dot product of random direction and the sample. Kindly note the modules of the FDSKL application 358 in the active worker and the passive worker is basically the same, and the operation of the application in the active worker can call certain functions of the application in the passive worker. Unless otherwise specified, the following modules are described in regard to an active worker.
The parameter module 362 is configured to, upon receiving the notice from the listener 360 that an FDSKL training should be performed, provide parameters of the FDSKL application to the sampling module 364. The parameters include, for example, a distribution (w), a regularization parameter λ, and a constant learning rate γ. In certain embodiments, the distribution or measure (w) is a Gaussian distribution. In certain embodiments, the regularization parameter λ, the constant learning rate γ, and the measure (w) are predefined.
The sampling module 364 is configured to, when the parameters of the FDSKL application is available, pick up an instance or a sample xi or (xi from the private data 372 with the index i. For example, the private data 372 or the local data Dl in the worker 350 may include online consumption, loan and repayment information of customers. The index i is used to identify a customer, which could be personal ID of the customer. The sampling module 364 is configured to pick up the instance randomly, and therefore, the index i is also named a random seed. The sampled instance may include a set of attributes of the customer, and each attribute may correspond to a record of the customer. The record of the customer may be online consumption amount of the customer per month, the number and amount of loans the customer have taken, the repayment history of the customer, etc.
The sampling module 364 is further configured to, when the index i is available, send the index i to the other related workers via the coordinator 310. The other related workers are the workers in the system 300 that are available or relevant to the active worker's model training, and those related workers are defined as passive workers.
The sampling module 364 is also configured to sample a random direction wi from the distribution (w) using the index i, and send the instance xi and the random direction wi to the random feature module 366 and the output prediction module 368. Because the instance xi is randomly picked up from the private data 372, the corresponding index i of the instance xi can also be regarded as a random value. Accordingly, the index i is used as a random seed for sampling the random direction wi from the distribution (w).
The random feature module 366 is configured to, upon receiving the instance xi and the random direction wi, compute dot product of the random direction wiT and the instance xi, add a random number b to the dot product to obtain adjusted dot product, save the adjusted dot product locally, calculate the random feature from the adjusted dot product, send the adjusted dot product to the output prediction module 368, and send the random feature to the model coefficient module 370. In certain embodiments, the adjusted dot product is obtained using the formula wiTxi+b, where b is a random number in a range of 0 to 2π. In certain embodiments, the random feature ϕw
The output prediction module 368 is configured to calculate a predicted output value of the sample xi, instruct the output prediction module 368 of the other related workers to calculate their respective predicted output values, compute the final predicted output by adding the predicted output value of the active worker and the predicted output values of the passive workers together, and send the final output value to the model coefficient module 370. In certain embodiments, each of the active worker and the passive workers is configured to call Algorithm 2 described above to calculate their respective predicted output values. In certain embodiments, the respective predicted output values are communicated based on a tree structure To. In certain embodiments, the tree-structured communication To has the same or similar structure as that shown in
The model coefficient module 370 is configured to, upon receiving the random feature ϕw
The private data 372 stores data specific for the worker 350. The private data 372 include a large number of instances or samples, and each instance can be indexed. The private data 372 stored in different workers are different, but they may be indexed or linked in the same way, for example by the identifications of customers. In an example, a first worker 350-1 may be a server computing device in a digital finance company, and its private data include online consumption, loan and repayment information. A second worker 350-2 may be a server computing device in an e-commerce company, and its private date include online shopping information. A third worker 350-3 may be a server computing device in a bank, and its private data include customer information such as average monthly deposit and account balance. If a person submits a loan application to the digital finance company, the digital finance company might evaluate the credit risk to this financial loan by comprehensively utilizing the information stored in the three workers. Therefore, to make the evaluation, the first worker 350-1 can initiate a process as the active worker, the second worker 350-2 and the third workers 350-3 can operate as passive workers. The three workers do not share private data. However, since some of the customers for the digital finance company, the e-commerce company and the bank are the same, those customers can be indexed and linked, such that their private data in the three workers can be utilized by the FDSKL model or the FDSKL application 358. In certain embodiments, each of the three workers 350-1, 350-2 and 350-2 is installed with the FDSKL application 358, and each of the workers can initialize a FDSKL training as an active worker. Kindly note that for each worker, its private data 370 is accessible by its own FDSKL application 358.
In certain embodiments, the FDSKL application 358 may further include a user interface. The user interface is configured to provide a use interface or graphic user interface in the worker 350. In certain embodiments, the user is able to configure or revise parameters for training or using the FDSKL application 358.
As shown in
At procedure 404, the sampling module 362 randomly picks up an instance or a sample xi from the private data 372 with the index i, and sends the index i to the passive workers. The instance may include multiple attributes. In certain embodiments, the random seed i is saved in each of the active and passive workers for later use.
At procedure 406, the sampling module 362 samples random direction wi from the distribution (w) based on the index i, and sends the instance xi and the sampled random direction wi to the random feature module 366 and the output prediction module 368.
At procedure 408, when the instance xi and the random direction wi are available, the random feature module 366 computes dot product of the random direction wiT and the instance xi, adds a random number to obtain adjusted dot product, and saves the adjusted dot product locally. In certain embodiments, the adjusted dot product is obtained using the formula wiTxi+b, where b is a random number in a range of 0 to 2π.
At procedure 410, after obtaining the values of wiTxi+b, the random feature module 366 calculates the random feature ϕw
At procedure 412, when the sample xi, the weight values wi, measure (w), and the model coefficient αi are available, the output prediction module 368 of each of the workers calculates a predicted value for the sample xi. Here the sample xi and the model coefficient αi are specific for each of the workers, and different workers generally have different xi, and the model coefficient αi. But the distribution (w) and the index i are the same in the different workers, and thus the different workers has the same random direction wi which is sampled from the distribution (w) based on the index i. In certain embodiments, before any training process, the workers may not have any model coefficient αi. After each iteration of training, a corresponding model coefficient α is created and added in the model. In certain embodiments, the output prediction module 368 of the active worker calculates its predicted value for the sample xi, and also coordinates with the passive workers to calculate their respective predicted values for their respective samples xi. In certain embodiments, the output prediction module 368 of the active worker coordinates with the passive workers via the coordinator 310.
At procedure 414, when the respective predicted output values are available in the workers, the coordinator 310 uses a tree-structured communication scheme to send the predicted values from the passive workers to the active worker, the output prediction module 366 uses the predicted value of the active worker and the predicted values of the passive workers to obtain the final predicted value f(xi) for the sample xi, and sends the final predicted value to the model coefficient module 368. In certain embodiments, the calculation is performed using f(xi)=Σl=1qfl(xi).
At procedure 416, upon receiving the random feature ϕw
At procedure 418, after computing the model coefficient, the model coefficient module 368 of the active worker updates all the previous model coefficients. In certain embodiments, the passive workers would similarly perform the above steps in parallel. Specifically, the passive workers, after receiving the index i from the active worker via the coordinator 310, picks up the instance xi corresponding to the index i from its local private data 372, samples the random direction wi corresponding to the index i (or alternatively receive the random direction wi from the active worker), calculates the dot product wiTxi between the random direction and the instance, obtains the adjusted dot product wiTxi+b by adding a random value b in the range of 0 to 2π, computes random feature ϕw
Kindly note the active workers and the passive workers share the same index i and the random direction wi, but have their own instances xi. Further, the random values b for calculating the adjusted dot products are different for different workers. In addition, each worker has its own predicted output corresponding to their own instance xi, but the workers will use the same final predicted output f(xi) which is a summation of the predicted outputs from all the workers.
In certain embodiments, after the well training of the model by the above process, a prediction can be performed similarly. The difference between prediction and training includes, for example: an instance is provided for prediction while multiple instances are picked up randomly for training; the prediction is performed using the provided instance while the training is performed iteratively using randomly picked instances; the prediction can stop at procedure 414 since the prediction for the provided instance is completed, while the training needs to update the model parameters at procedures 416 and 418.
As shown in
At procedure 502, the output prediction module 368 in the l-th active worker sets the predicted value fl(x) as 0.
At procedure 504, for each iteration of calculation using the sample x, the output prediction module 368 picks up a random direction wi from the distribution (w) corresponding to the random seed i. In certain embodiments, the random seed i is saved for example in the above procedure 404, and is retrieved at the procedure 504.
At procedure 506, the output prediction module 368 obtains wiTx+b if it is locally saved, and otherwise, instructs computation of wiTx+b using Algorithm 3.
At procedure 508, the output prediction module 368 computes the random feature ϕw
At procedure 510, the output prediction module 368 computes the predicted value on the local worker fl(x)=fl(x)+αiϕw
In certain embodiments, after well training of the model, the procedures in
As shown in
At procedure 604, the random feature module 366 generates a random number b in the range of [0, 2π] using the seed σ(i). In certain embodiments, the random seeds σ(i) is generated locally, for example by any type of random number generator.
At procedure 606, the random feature module 366 adds the random number b to the dot product wiTxi to obtain adjusted dot product wiTxib.
At procedure 608, the procedures 602-606 are repeated for each of the workers 1 to q. Specifically, at the same time of performing procedures 602-606 by the l-th active worker, the random feature module 366 of the active worker also asks the passive workers to repeat the procedures 602, 604 and 606 locally, so that each of the workers calculates its own adjusted dot product. Assume there are q workers in total. For any worker {circumflex over (l)} in the q workers, the random seeds is represented by σl(i), the generated random number in the range of [0-2π] is represented by bl, the dot product is represented by (wi(xi, and the adjusted dot product is represented by (wi(xi+bl. Kindly note that the (wi) in the q workers are the same, each of which is picked up from the distribution (w) using the same index i, but the (xi) in the q workers are different because different workers stores different data for the same customer i, and the random numbers b{circumflex over (l)} in the q workers are different because each of them is generated locally and randomly. By performing the procedures 602-608, each of the q workers has its own calculated adjusted dot product.
At procedure 610, the random feature module 366 summates the adjusted dot products from the workers 1 to q to obtain a summated dot product. In certain embodiments, the summation is performed using the equation ζ=Σi=1q((wi(xi+b{circumflex over (l)}). Because the summations are performed using data from q workers, the summated dot product include q random numbers b{circumflex over (l)}. In certain embodiments, the summation is coordinated via the coordinator 310. In certain embodiments, the summation is performed using the tree structure T1.
At procedure 612, the random feature module 366 randomly selects a worker l′ which is not the l-th active worker, and uses a tree structure T2 to compute a summation of the random numbers b except that of the l-th worker. Since l′ is not the same as l, the l′-th worker is a passive worker that is randomly selected from all the passive workers. The summation of the random numbers b is represented by and it is calculated using the equation
At procedure 614, the random feature module 366 generates the random feature by reducing the summated random numbers from the summated dot product. The random feature is calculated as ζ−
By the above operations described in
In certain aspects, the present disclosure relates to a non-transitory computer readable medium storing computer executable code. In certain embodiments, the computer executable code may be the software stored in the storage device 356 as described above. The computer executable code, when being executed, may perform one of the methods described above.
In certain aspects, the present disclosure relates to a method of using the well-trained FDSKL model to predict result for an instance. In certain embodiments, the disclosure uses the procedure described in
Example. Exemplary experiments have been conducted using the model according to certain embodiments of the present disclosure.
Design of Experiments: To demonstrate the superiority of FDSKL on federated kernel learning with vertically partitioned data, we compare with PP-SVMV (Yu, Vaidya, and Jiang, 2006). Moreover, to verify the predictive accuracy of FDSKL on vertically partitioned data, we compare with oracle learners that can access the whole data samples without the federated learning constraint. For the oracle learners, we use state-of-the-art kernel classification solvers, including LIBSVM (Chang and Lin, 2011) and DSG (Dai et al., 2014). Finally, we include FD-SVRG (Wan et al., 2007), which uses a linear model, to comparatively verify the accuracy of FDSKL. The algorithms for comparison are listed in
Implementation Details: Our experiments were performed on a 24-core two-socket Intel Xeon CPU E5-2650 v4 machine with 256 GB RAM. We implemented our FDSKL in python, where the parallel computation was handled via MPI4py (Dalcin et al., 2011). The code of LIBSVM is provided by Chang and Lin (2011). We used the implementation provided by Dai et al. (2014) for DSG. We modified the implementation of DSG such that it uses constant learning rate. Our experiments use the following binary classification datasets as described below.
Datasets:
Results and Discussions: The results are shown in
As mentioned in previous section, FDSKL used a tree structured communication scheme to distribute and aggregate computation. To verify such a systematic design, we also compare the efficiency of three commonly used communication structures: cycle-based, tree-based and star-based. The goal of the comparison task is to compute the kernel matrix (linear kernel) of the training set of four datasets. Specifically, each node maintains a feature subset of the training set, and is asked to compute the kernel matrix using the feature subset only. The computed local kernel matrices on each node are then summed by using one of the three communication structures. Our experiment compares the efficiency (elapsed communication time) of obtaining the final kernel matrix, and the results are given in
Conclusion: Privacy-preservation federated learning for vertically partitioned data is urgent currently in machine learning. In certain embodiments and examples of the disclosure, we propose a federated doubly stochastic kernel learning (i.e., FDSKL) algorithm for vertically partitioned data, which breaks the limitation of implicitly linear separability used in the existing privacy-preservation federated learning algorithms. We proved that FDSKL has a sublinear convergence rate, and can guarantee data security under the semi-honest assumption. To the best of our knowledge, FDSKL is the first efficient and scalable privacy-preservation federated kernel method. Extensive experimental results show that FDSKL is more efficient than the existing state-of-the-art kernel methods for high dimensional data while retaining the similar generalization performance.
Certain embodiments of the present disclosure, among other things, have the following beneficial advantages: (1) The FDSKL algorithm can train the vertically partitioned data efficiently, scalably and safely by kernel methods. (2) FDSKL is a distributed doubly stochastic gradient algorithm with constant learning rate which is much faster than the existing doubly stochastic gradient algorithms all of which are built on the diminishing learning rate, and also much faster than existing method of privacy-preserving federated kernel learning algorithm. (3) A tree structured communication scheme is used to distribute and aggregate computation which is much more efficient than the star-structured communication and ring-structured communication, which makes the FDSKL much efficient than existing federated learning algorithms. (4) Existing federated learning algorithms for vertically partitioned data use the encryption technology to keep the algorithm safety which is time consuming. However, the method of the present disclosure uses random perturbations to keep the algorithm safety which is cheaper than the encryption technology and make the FDSKL more efficient than the existing federated learning algorithms. (5) Most of existing federated learning algorithms on vertically partitioned data are limited to linear separable model. The FDSKL of the present disclosure is the first efficient and scalable federated learning algorithm on vertically partitioned data breaking the limitation of implicitly linear separability.
The foregoing description of the exemplary embodiments of the disclosure has been presented only for the purposes of illustration and description and is not intended to be exhaustive or to limit the disclosure to the precise forms disclosed. Many modifications and variations are possible in light of the above teaching.
The embodiments were chosen and described in order to explain the principles of the disclosure and their practical application so as to enable others skilled in the art to utilize the disclosure and various embodiments and with various modifications as are suited to the particular use contemplated. Alternative embodiments will become apparent to those skilled in the art to which the present disclosure pertains without departing from its spirit and scope. Accordingly, the scope of the present disclosure is defined by the appended claims rather than the foregoing description and the exemplary embodiments described therein.