The present invention relates to an optimization apparatus, an optimization method, and a program.
As a method for protecting privacy of data used during training from output results of machine learning, a standard called differential privacy is often used. As a method for realizing the differential private machine learning, a differentially private stochastic gradient descent (hereinafter referred to as DP-SGD) applicable to deep learning as well has been widely used (for example, NPL 1).
In the DP-SGD, a certain restriction is provided to the L2 norm of the gradient used in the calculation of the SGD (stochastic gradient descent method), and noise following the Gaussian distribution is given to the gradient, thereby guaranteeing differential privacy. In this case, in NPL 1, a method of clipping the calculated gradient is used as a way of providing a restriction to the L2 norm of the gradient.
However, in NPL 1, since it is necessary to calculate the norm of the gradient for each piece of data when clipping the gradient, calculation may take time.
An embodiment of the present invention was made in view of the above points, and an object of the present invention is to reduce the calculation time of DP-SGD.
In order to achieve the object, an optimization apparatus according to one embodiment is an optimization apparatus for optimizing a function having a parameter, the optimization apparatus comprising: a sub-sampling unit that randomly samples a predetermined number of pieces of data from a given data set, to create a data aggregate consisting of the predetermined number of pieces of data; a gradient calculation unit that calculates a gradient related to the parameter of a l-Lipschitz continuous loss function, for each of the pieces of data included in the data aggregate; a noise addition unit that adds noise according to a Gaussian distribution to the gradient to calculate a gradient after noise addition; and a parameter update unit that updates the parameter by using the gradient obtained after the noise addition.
The calculation time of DP-SGD can be reduced.
An embodiment of the present invention will be described below. In the present embodiment, gradient clipping is not required, so that a DP-SGD which can be calculated at a high speed (hereinafter referred to as a high-speed DP-SGD) can be obtained, and a private optimization apparatus 10 for executing this high-speed DP-SGD will be described. Since the high-speed DP-SGD does not require gradient clipping, each gradient to which noise is added can be calculated in parallel, and the calculation time can be reduced more than the conventional DP-SGD.
The input is defined as a d-dimensional real vector x∈Rd, and the output is defined as a real number y∈R. The goal of machine learning is to determine a parameter θ of a function f:Rd→R such that a loss function L(f(x; θ), y) is the smallest from a data set D={(xi, yi)|i=1, . . . , N} consisting of N input/output pairs. In other words, the following is obtained.
In the following description, the input/output pairs (xi, yi) constituting the data set D are also referred to as records. There is an SGD (stochastic gradient descent method) as a calculation method for the above equation 1, and θ is determined for each t by repeating the following steps.
It should be noted that there is an approach called a mini-batch gradient descent method as a method of utilizing not only one piece of data but also a plurality of pieces of data for calculating the loss function L. In the following, it is assumed that the mini-batch gradient descent method and the stochastic gradient descent method are not particularly distinguished.
In the conventional DP-SGD, the following steps are performed at each iteration number t (also referred to as timing).
Sub-sampling: At timing t, S (<N) records are extracted from the data set D by random sampling, and an aggregate of the extracted records is taken as St={(xi, yi)|i=1, . . . , S}.
Calculation of gradient: For each i=1, . . . , S, the gradient gt(xi, yi) at the timing t is obtained as follows.
Here, θt is a parameter θ at the timing t.
Clipping of gradient: Clipping of the gradient is performed for each i=1, . . . , S as follows.
Noise addition: Noise is added in the following manner, to calculate the gradient ∇Lt at the timing t.
In this case, the parameter is defined as C∈R+, a unit matrix is defined as I, and a Gaussian distribution is defined as N(⋅, ⋅). As described above, the calculation of the gradient clipping must be individually performed for each i=1, . . . , S. That is, the calculation of the gradient clipping is the bottleneck in parallel calculation.
Parameter update: The parameter is updated by θt+1←θt−α∇Lt.
The high-speed DP-SGD proposed in the present embodiment will be described. In the high-speed DP-SGD, the L2 norm of the gradient is naturally suppressed to a constant value or less, thereby eliminating the need for gradient clipping. For this purpose, the loss function L is set to be l-Lipschitz continuous (1 is the lowercase L).
When the loss function L is l-Lipschitz continuous, then this loss function L has the following properties.
Thus, since the norm of the gradient is limited to 1, it is not necessary to perform clipping at a constant C in the high-speed DP-SGD, and only noise is added.
The following describes an example where the function f is made l=l-Lipschitz continuous by applying spectral normalization, to constitute a high-speed DP-SGD. Here, it is assumed that the function f is a deep neural network (DNN) applied to a task such as a classification problem. For the spectral normalization, for example, the reference literature “Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. In International Conference on Learning Representations, 2018.” and the like should be referred to.
The DNN is composed of a combination of the weight of layers {W(i)|i=1, . . . , K} and the (non) linear function {(φ(i)|i=1, . . . , K}. Here, φ(i) is also referred to as an activation function.
Specifically, the function f representing the DNN is constructed as follows.
Since the function f is l-Lipschitz continuous, it is sufficient if the L2 norm of the input of the i layer is suppressed to 1, that is, if the L2 norm of the input x(I) of the i layer is ∥x(i)∥2: <1. However, as the (non) linear function φ(I) (where i=1, . . . , K−1), the present invention is limited to a l-Lipschitz continuous function such as ReLu and Leak ReLu.
In order to suppress the L2 norm of x(I) to 1, x′(i)=x(i)/∥x(i)∥2 may be used.
Therefore, by using x′(i) instead of x(i) and configuring the function f representing DNN as follows, a l-Lipschitz continuous function can be obtained.
When processing in batches, since x is a matrix, it is sufficient if the spectral norm is suppressed to 1. That is, in this case, with ∥⋅∥ as the spectral norm, x′(i)=x(i)/∥x(i)∥ may be used. The spectral norm can be calculated at high speed by using a power law or other methods.
Lipschitz continuity of nonlinear function of final layer φ(K) is a nonlinear function of the final layer, which is different from ReLu, Leak ReLu, and the like because it returns the output of the DNN. In the classification problem, tanh, sigmoid, softmax and the like are used as φ(K), and they are all l-Lipschitz continuous functions.
It is assumed that the function f representing the DNN is l-Lipschitz continuous. In this case, for example, by using L2Los, L1Los and HingLoss as the loss function L, the loss function L becomes l-Lipschitz continuous. Since the cross entropy is not l-Lipschitz continuous, the cross entropy cannot be set as the loss function L.
Thus, since the loss function L becomes l-Lipschitz continuous,
The above equation is guaranteed. Therefore, since gradient clipping is not required, the calculation time can be reduced more than the conventional DP-SGD, by calculating in parallel the gradient to which noise is to be added. In this configuration example, the DNN is assumed as the function f, but this is merely an example, and any function can be used as long as the loss function L is l-Lipschitz continuous.
The algorithm of the high-speed DP-SGD at the iteration number t (timing t) will be described below. In the high-speed DP-SGD, the following steps are executed for each iteration number t.
Sub-sampling: At timing t, S (<N) records are extracted from the data set D by random sampling, and an aggregate of the extracted records is taken as St={(xi, yi)|i=1, . . . , S}.
Calculation of gradient: For each i=1, . . . , S, the gradient gt(xi, yi) at the timing t is obtained as follows.
Noise addition: Noise is added in the following manner, to calculate the gradient ∇Lt at the timing t.
Parameter update: The parameter is updated by θt+1←θt−α∇Lt.
Next, a hardware configuration of the private optimization apparatus 10 according to the present embodiment will be described with reference to
As illustrated in
The input device 101 is, for example, a keyboard, a mouse, a touch panel, or the like. The display device 102 is, for example, a display or the like. The private optimization apparatus 10 may not include, for example, at least one of the input device 101 and the display device 102.
The external I/F 103 is an interface with an external device such as a recording medium 103a. The private optimization apparatus 10 can read from and write to the recording medium 103a via the external I/F 103. Examples of the recording medium 103a include a CD (compact disc), a DVD (digital versatile disk), an SD (secure digital) memory card, a USB (universal serial bus) memory card, and the like.
The communication I/F 104 is an interface for connecting the private optimization apparatus 10 to a communication network. The processor 105 is, for example, various arithmetic devices such as a CPU (central processing unit) and a GPU (graphics processing unit). The memory device 106 is, for example, various storage devices such as a HDD (hard disk drive), an SSD (solid state drive), a RAM (random access memory), a ROM (read only memory), and a flash memory.
By having the hardware configuration shown in
Next, a functional configuration of a private optimization apparatus 10 according to the present embodiment will be described with reference to
As shown in
The private optimization apparatus 10 according to the present embodiment includes a storage unit 206. The storage unit 206 is implemented by, for example, the memory device 106. Note that the storage unit 206 may be implemented by using, for example, a storage device connected to the private optimization apparatus 10 via a communication network or the like.
The sub-sampling unit 201 extracts S (<N) records from the data set D by random sampling at each timing t, to create St={((xi, yi)|i=1, . . . , S}.
The gradient calculation unit 202 calculates the gradient gt(xi, yi) at the timing t for each i=1, . . . , S.
The noise addition unit 203 calculates the gradient ∇Lt at the timing t by adding noise according to the Gaussian distribution to the gradient gt(xi, yi).
The parameter update unit 204 updates the parameter by θt+1←θt−α∇Lt.
The termination determination unit 205 determines whether or not to terminate the iteration with respect to t.
The storage unit 206 stores various pieces of data (for example, the data set D, various parameters, etc.) necessary for executing the high-speed DP-SGD.
The flow of the private optimization processing for learning the parameter θ of a function f (x; θ) by the high-speed DP-SGD will be described below with reference to
The flow of the private optimization processing for a certain t (where t is an integer equal to or greater than 0) will be described below. It is assumed that a parameter θ0 at t=0 is initialized by an arbitrary method.
First, the sub-sampling unit 201 extracts S (<N) records from the data set D by random sampling, to create St={(xi, yi)|i=1, . . . , S} (step S101).
Then, the gradient calculation unit 202 calculates, for each i=1, . . . , S, the gradient gt(xi, yi) is calculated by the above equation 7 (step S102). As described above, the loss function L is l-Lipschitz continuous (in particular, l=l-Lipschitz continuous).
Next, the noise addition unit 203 adds noise to the gradient gt(xi, yi) according to the above equation 8, to calculate the gradient ∇Lt (step S103).
Then, the parameter update unit 204 updates the parameter by θt+1←θt−α∇Lt (step S104).
The termination determination unit 205 determines whether or not to terminate the iteration with respect to t (step S105). Here, the termination determination unit 205 may determine to terminate the iteration with respect to t when a predetermined termination condition is satisfied, and may determine not to terminate the iteration with respect to t when the termination condition is not satisfied. Note that such termination condition is, for example, that the number of iterations with respect to t reaches a predetermined number T, that the parameter θ converges, and the like.
When it is determined in step S105 that the iteration with respect to t is not finished, the termination determination unit 205 updates such that t←t+1 (step S106), and returns to step S101. In this manner, steps S101 to S104 described above are repeatedly executed with respect to each t.
On the other hand, when it is determined in step S105 that the iteration with respect to t is finished, the private optimization apparatus 10 ends the private optimization processing. In this case, the finally obtained parameter θt is a learned parameter.
As described above, the private optimization apparatus 10 according to the present embodiment can learn the parameter θ of the function f by the DP-SGD which does not require gradient clipping. Thus, since the gradient gt(xi, yi) to which noise is to be added can be calculated in parallel (in other words, it is not necessary to perform gradient clipping for each (xi, yi) after the gradient gt(xi, yi) is calculated in parallel), the parameter θ of the function f can be learned at high speed. As is apparent from the above equation 8, the high-speed DP-SGD guarantees the differential privacy as in the conventional DP-SGD.
The present invention is not limited to the specifically disclosed embodiments, and various modifications, changes, combinations with known techniques, and the like can be made without departing from the scope of the claims.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2021/016783 | 4/27/2021 | WO |