The exemplary embodiment relates to optimization of integral functions and finds particular application in connection with a system and method which used an adapted weighted stochastic gradient descent approach to update a probability distribution function used for sampling data during the optimization.
Many problems involve optimizing a function (e.g., minimizing a cost function or maximizing a future reward) with respect to some parameters (such as the investment level on each different asset). The function to optimize is often an integral (for example, an expectation, i.e., the average over multiple possible outcomes). When this integral is not tractable, i.e., when it is hard to compute, the optimization problem is itself difficult. In many real-world problems, such as those involving intractable integrals, such as the averaging on combinatorial spaces, non-Gaussian integrals, etc., the solution can be expressed as an integral of a function of a sampled value and a gradient of the function. Stochastic Gradient Descent (SGD) is an optimization technique that can optimize an integral without having to compute it. However, SGD is known to be slow to converge.
Stochastic approximation is a class of methods to solve intractable equations by using a sequence of approximate (and random) evaluations. See, H. Robbins and S. Monro, “A stochastic approximation method,” Annals of Mathematical Statistics, 22(3):400 407, 1951. Stochastic Gradient Descent (SGD) is a special type of stochastic approximation method that optimizes an integral using approximate gradient steps (See, Léon Bottou, “Online algorithms and stochastic approximations,” in Online Learning and Neural Networks (David Saad, Ed., CUP 1998). It has been shown that this technique is very useful in large scale learning tasks because it can provide good generalization properties with a small number of passes through the data (See, Léon Bottou and Olivier Bousquet, “The tradeoffs of large scale learning,” in Optimization for Machine Learning, pp. 351-368 (Suvrit Sra, Sebastian Nowozin, and Stephen J. Wright, Eds, MIT Press, 2011).
The convergence properties of the SGD algorithms are directly linked to the variance of the gradient estimate. A tradeoff between the variance of the gradient and the convergence speed can be obtained using batching (see, for example, M. Friedlander and M. Schmidt, “Hybrid deterministic-stochastic methods for data fitting,” UBC-CS technical report, TR-2011-01. However, with batching, the time required for every step increases with the size of the batch.
There remains a need for an improvement to the standard SGD algorithms for solving optimization problems efficiently, without introducing errors.
In accordance with one aspect of the exemplary embodiment, an optimization method includes receiving data to be sampled in optimizing a first function, which is an integral of a second function weighted according to a base probability distribution, the second function having at least one first parameter to be determined in the optimizing of the integral first function. For a number of iterations, the method includes sampling the data according to a modified probability distribution function which depends on at least one second parameter, updating at least one of the at least one first parameter of the second function, and updating at least one of the at least one second parameter of the modified probability distribution function. In a subsequent one of the iterations, the modified probability distribution function used in sampling the data is based on the updated at least one second parameter of the distribution function. Information is output, based on the at least one first parameter of the second function as updated in a later one of the iterations.
In accordance with another aspect of the exemplary embodiment, a system includes memory which receives data to be sampled in optimizing an integral function, which is an integral of a second function weighted according to a base probability distribution, the second function having at least one first parameter to be determined in the optimizing of the integral function. A sampling component samples the data according to a modified probability distribution function which depends on at least one second parameter. A first update component updates the at least one first parameter of the second function. A second update component updates the at least one second parameter of the modified probability distribution function. In a subsequent one of a number of iterations, the modified probability distribution function used by the sampling component for sampling the data is based on the updated second parameter of the modified probability distribution function. An information output component outputs information based on the at least one updated first parameter of the integral function of a later one of the iterations. A processor implements the first and second update components.
In accordance with another aspect of the exemplary embodiment, a method for optimizing a first function using adaptive weighted stochastic gradient descent is provided. The method includes receiving data to be sampled in optimizing a first function which is an integral of a second function depending on at least one first parameter in which elements are weighted by a probability distribution function. For a number of iterations, the method includes sampling the data according to a modified distribution function which depends on at least one second parameter, updating the at least one first parameter of the second function based on a first step size and an estimation of the gradient of the second function, and updating at least one of the at least one second parameter of the distribution function based on a second step size an estimation of the variance of the gradient of the second function. In a subsequent one of the iterations, the distribution function used in sampling the data is based on the updated at least one second parameter of the distribution function. Information is output, based on the at least one first parameter of the second function, as updated in a later one of the iterations.
The exemplary system and method provide an improvement over a standard Stochastic Gradient Descent (SGD) algorithm for optimizing an integral (first) function, which is an integral of a second function of a sampled value and a gradient of the second function, over a space of values. To reduce the variance of the gradient, an optimal sampling distribution is learned using a parallel SGD algorithm that learns how to minimize the variance of the gradient of the second function.
In particular, the method accelerates an SGD algorithm for learning the parameters of the second function by learning a weight function that minimizes the variance of the approximate gradient involved in each step of the algorithm, while optimizing the function. While there is a constant extra computational cost to pay at each iteration to learn the weight function, this cost is the same for each iteration, which may be compensated for by the increase of convergence speed.
With reference to
The various hardware components 14, 18, 20, 22, 24 of the computer 10 may be all connected by a bus 38. The processor 14 executes the instructions 16 for performing the exemplary method outlined in
The computer system 10 may be a PC, such as a desktop, a laptop, palmtop computer, portable digital assistant (PDA), server computer, cellular telephone, tablet computer, pager, combination thereof, or other computing device capable of executing instructions for performing the exemplary method.
The memory 18, 20 may represent any type of non-transitory computer readable medium such as random access memory (RAM), read only memory (ROM), magnetic disk or tape, optical disk, flash memory, or holographic memory. In one embodiment, the memory 18, 20 comprises a combination of random access memory and read only memory. In some embodiments, the processor 14 and memory 18 and/or 20 may be combined in a single chip.
The digital computer processor 14 can be variously embodied, such as by a single-core processor, a dual-core processor (or more generally by a multiple-core processor), a digital processor and cooperating math coprocessor, a digital controller, or the like.
The exemplary system includes one or more software components 40, 42, 44, 46 for implementing an algorithm for optimizing the integral function as described below. These components may include a sampling component 40, a function parameter update component (or first update component) 42, a probability distribution function update component (or second update component) 44, and an information output component 46, which are best understood with reference to the method, described below. These components use the input data 12 as well as input or computed step sizes 50 to optimize parameters 52 of the integral function 54 stored in memory that has a set of predefined fixed parameters 55 and, in the process, to optimize parameters of a weighted distribution function 56 used by the sampling component 40 for selecting samples of the data 12 to be used in a next iteration by the function parameter update component 42.
The term “software,” as used herein, is intended to encompass any collection or set of instructions executable by a computer or other digital system so as to configure the computer or other digital system to perform the task that is the intent of the software. The term “software” as used herein is intended to encompass such instructions stored in storage medium such as RAM, a hard disk, optical disk, or so forth, and is also intended to encompass so-called “firmware” that is software stored on a ROM or so forth. Such software may be organized in various ways, and may include software components organized as libraries, Internet-based programs stored on a remote server or so forth, source code, interpretive code, object code, directly executable code, and so forth. It is contemplated that the software may invoke system-level code or calls to other software residing on a server or other location to perform certain functions.
As will be appreciated,
At S102, a set of data points 12 to be sampled is received into memory (or simply a sample space from which the data points are to be sampled). The data points x can each be a scalar value or a vector comprising two or more values. If not already stored, a function γ(θ) (54) to be optimized on the data points x (12) is also received into memory. The integral function is defined in terms of a second function ƒ, which is a function of x, one or more variable parameters θ (52), and a set of constant parameters (55), where the weight of each value x in the function is determined according to a base probability distribution P.
For each of a number of iterations, steps S104, S106, and S108 are performed, e.g., in sequence. The number of iterations may be at least 2, or at least 10, or at least 100, such as at least 1000.
At S104, a sampling step is performed (by the sampling component 40), by drawing a data point x from the set X (12) of data points, where the sampling is based on a modified probability distribution (function) which is continuous with the base probability distribution P but which differs in its parameters τ (56).
At S106, parameters θ (52) of the function ƒ (54) to be optimized are updated (by the function parameter update component 42), through a stochastic gradient descent step with a magnitude specified by a step size parameter ηt (50).
At S108, parameters τ (56) of the probability distribution function are updated (by the probability distribution function update component 44), also through a stochastic gradient descent step of magnitude εt in the direction of a smaller variance for the approximate gradient. In the next iteration, the data point sampled in S104 is drawn based on the updated probability distribution function , defined by parameters τ Depending on the modification to , this can make drawing certain samples more likely to be in the direction of the exact gradient than in the prior iteration, which expedites the convergence in a manner which generally outweighs the cost of the additional computation time which S108 entails.
At S110, if convergence or other stopping criterion has not been reached, the method returns to S104 for the next iteration, otherwise to S112, where information is output (by the information output component 46). The information may include one or more of the value of the optimized function, its parameters, and information based thereon, derived from the most recent (or one of the most recent) iterations. The method ends at S114.
The method illustrated in
Alternatively, the method may be implemented in transitory media, such as a transmittable carrier wave in which the control program is embodied as a data signal using transmission media, such as acoustic or light waves, such as those generated during radio wave and infrared data communications, and the like.
The exemplary method may be implemented on one or more general purpose computers, special purpose computer(s), a programmed microprocessor or microcontroller and peripheral integrated circuit elements, an ASIC or other integrated circuit, a digital signal processor, a hardwired electronic or logic circuit such as a discrete element circuit, a programmable logic device such as a PLD, PLA, FPGA, Graphical card CPU (GPU), or PAL, or the like. In general, any device, capable of implementing a finite state machine that is in turn capable of implementing the flowchart shown in
As will be appreciated, the steps of the method need not all proceed in the order illustrated and fewer, more, or different steps may be performed.
In the following, the terms “optimization,” “minimization,” and similar phraseology are to be broadly construed as one of ordinary skill in the art would understand these terms. For example, these terms are not to be construed as being limited to the absolute global optimum value, absolute global minimum, and so forth. For example, minimization of a function may employ an iterative minimization algorithm that terminates at a stopping criterion before an absolute minimum is reached. It is also contemplated for the optimum or minimum value to be a local optimum or local minimum value.
Stochastic Gradient Descent
First, a summary of the stochastic gradient descent method is provided by way of background.
In general, as in existing stochastic gradient methods, the exemplary method allows optimization, e.g., minimization, of integral functions of the general form:
γ(θ)=∫Xƒ(x;θ)dP(x) (1)
where γ(θ) represents a possible solution,
dP(x) represents the weight applied to a sample x, where P is a probability distribution on a measurable space Ξ from which all data samples x are drawn, and
ƒ is a function from X×θ into , where is the set of real numbers and θ is a parameter from a set Θ of one or more variable parameters (52) to be learned. The object is to identify the values of parameters θ which provide an optimal value of γ. A variety of different functions can be used as function ƒ, examples of which are given below.
Stochastic Gradient Descent (SGD) is a stochastic approximation method which involves performing approximate gradient steps equal, on average to the true gradient ∇θγ(θ) of the function γ(θ) given distribution P, i.e., its derivative.
In some applications disclosed herein, such as logistic regression learning, the function ƒ is the log-likelihood and P is an empirical distribution
where {x1, . . . , xn} is a set of i.i.d. (independently identically distributed) data points sampled from an unknown distribution and δ (x,xi) represents the Dirac distribution at position xi (see, Léon Bottou, “Online algorithms and stochastic approximations,” in Online Learning and Neural Networks (David Saad, Ed., CUP, 1998).
The conventional SGD method involves a sequence of iterations (steps). At a given step t, the conventional SGD method can be viewed as a two-step procedure:
where ηt represents the gradient step size (a learning rate); and
One problem is in choosing the gradient step size. If ηt is too big, the iterations do not tend to converge. If ηt is too small, convergence may be very slow. In practice, when the number of unique elements in the support of P is small, the elements are not sampled i.i.d., but are explored one by one by doing multiple passes through the data. For simplicity, only the random version of SGD is considered, bearing in mind that a more ordered sampling may be employed.
Weighted Stochastic Gradient Descent
Similarly to importance sampling, it can be shown that it is not necessary to sample the elements according to the base distribution P. Rather, any distribution >>P can be used. The notation >>P means that P is absolutely continuous with respect to : ∀xεX, (x)=0→P(x)=0. This means that when =0, P=0.
Denoting q as the density
the distribution can be written as:
The Weighted SGD algorithm has the following form: At each iteration:
1. sample an element xt in X probabilistically, according to the distribution ;
2. do an approximate gradient step: θt+1=θt−ηt∇θ{tilde over (ƒ)}(xt; θt)
i.e., in the weighted SGD, the update is based on the variance of the gradient of the function ƒ.
Variance of the Approximate Gradients
By construction, unbiased estimates of the gradient ∇θ{tilde over (ƒ)}(xt; θt) can be obtained for any choice of the weighting distribution , i.e.,
E[∇θ{tilde over (ƒ)}(xt;θt)]=∇θγ(θ). (3)
where Eis the expectation of .
The efficiency of the base distributions based on their variance Σ(θ) can be compared:
Where ∇θT . . . represents the transpose of the gradient ∇θ . . . .
Now, this formula can be used to choose the optimal sampling distribution * that has the variance Σ(θ) with the smallest magnitude, e.g., such that:
i.e., the expectation of Q is a function (norm squared) of the gradient of the variance in the function ƒ, i.e., of ∇θƒ(xt; θt) and density q(x).
Equation (7) is obtained by noting that
does not depend on . Ideally it would be desirable to compute the best possible proposal for at each iteration, i.e., for each different value of θt, i.e., to select a distribution on the basis of which the next sample xt is drawn which minimizes the variance in the function ƒ. However, this is not necessary, and in the exemplary method, an approximation can be performed, as follows:
Adaptive Weighted Stochastic Gradient Descent
It can be demonstrated that instead of solving Equation (7) exactly, a gradient step can be performed at each iteration toward lower variance estimates.
Assume that the proposed probability distribution (from a family of distributions under consideration) is τ where τεT is a set of parameters for the proposed probability distribution function. For example, τ could be a Gaussian distribution N(μ,Λ−1) with parameters τ=(μ,Λ)εdS+, where μ is the mean and Λ is the variance of the Gaussian distribution. S+ represents the set of positive definite matrices and the symbol means the Cartesian product.
The corresponding probability distribution function (pdf) of the proposed distribution τ at point x is denoted q(x; τ). This function is assumed to be tractable, i.e. one can sample and compute q(x; τ) efficiently. The variance term also depends on τ:
An algorithm for performing stochastic gradient descent using the adaptive weighting method is shown in Algorithm 1.
Algorithm 1 can be summarized as follows. The algorithm is initialized with initial values θ0 for a set of parameters Θ of the second function ƒ to be optimized. The initial values can be selected randomly, based on experimental data, or all set to some default value. The number t of iterations performed is initialized to 0. Initial values for the set of parameters τ of the modified probability distribution are also established, e.g., by selecting mean and variance of any suitable Gaussian distribution which is continuous with P.
In each iteration t, steps (1)-(4) of the algorithm are performed. Step (1) involves sampling a data point according to the current distribution whose parameters τt were established in the prior iteration (S104). Eq (1) represents the sampling of a random variable according to , which is a function of the current value of τt.
Step (2) involves updating the parameters Θ of the function ƒ as a function of the current values θ of these parameters, a gradient step size ηt and the gradient of {tilde over (ƒ)}, i.e.,
(S106).
is the weighted gradient of the second function, computed as a function of θt and xt, where the weight is the inverse of q(xt; τt), so that its expectation is equal to the gradient of the second function. q(xt; τt) is the probability of drawing xt according to the modified probability distribution function . This step depends on τt in order to apply lower weights to high probabilities of in order to guarantee that the gradient in (2) is on average equal to the gradient of the target function γ.
Step (3) involves updating the parameters τ of the distribution as a function of a step size εt (learning rate) for the updating of parameter τ and the gradient of the variance in Eqn. (6): i.e., the squared norm of the ratio of the gradient of the function ƒ to q, multiplied by the gradient of the log of q:
This step aims at optimizing τ rather than θ. By changing the parameters τ, the sampling distribution Q is changed for the next iteration. As for step (2) this is also a stochastic gradient approximation step, rather than an exact computation (S108).
Step (4) involves a standard step of incrementing the iteration number to proceed to the next iteration.
Steps (1)-(4) are continued until a stopping criterion is met (S110). The iterations may be continued while γ(θ) continues to converge. However, since γ(θ) is not computed in the example algorithm, the iterations may be continued until the pre-defined fixed number of iterations is reached. It can be shown mathematically that Algorithm 1 does in fact converge, rather than achieving simply a local optimum, for a wide variety of cases. In other embodiments, the iterations are continued while τ continues to change by more than a threshold amount, or the like.
The step sizes ηt and εt can be fixed or variable (for example, decreasing as more iterations are performed). For example, ηt could be set at a fixed value of, for example, from 0.0001 to 0.2, or could be set as the minimum of a fixed value and a function of the number of iterations, e.g.,
In this example, ηt decreases as a function of the number of iterations, but only begins to decrease after a predefined number (500 in this example) of iterations.
Similarly, εt could be set at a fixed value of, for example, from 0.0001 to 0.2, or could be set as the minimum of a fixed value and a function of the number of iterations, e.g.,
In this example, εt decreases, but only after 1000 iterations.
It can be shown that the gradient steps in the updates are noisy (but unbiased) estimates of the gradient of the variance:
The exemplary method avoids the need to compute Eqn. (9), by using step (3) of the algorithm.
To illustrate the sampling step (S104),
Plot 1: mean=1, variance=1;
Plot 2: mean=1, variance=1.32;
Plot 3: mean=1.3, variance=1
Plot 4: mean=1.3, variance=1.32.
The gradient g (∇θf(x,θ)p(x)/q(x)) for each possible choice of x is also shown for each plot. The expected gradient ∇θγ(θ) is the horizontal line at y. Formally, it is equal to the integral of N(x|1.0,1.0)*(θ−x) over the real line. In the example, it can be shown that it is equal on average to −0.5, as shown in the plots. However, in the general case of stochastic gradient descent, this gradient is not known, but uses an approximate gradient (four different choices are shown in the figure), the estimated gradient g satisfies E [g]=0.5, where the expectation is taken with respect to . Since the best possible choice of is the one that give the smallest variance of the gradient, the variance of g, Var[g] is computed and shown in the title of each plot. It can be seen that the choice q(x)=N(x|1.3,1.3) leads to a variance equal to 0.77, which is smaller than the other choices of . For this value of θ (θ=0.5), the best distribution is not equal to the P distribution, since it gives a higher variance (Var[g]=1).
The exemplary approach is similar to adaptive importance sampling using stochastic gradient. See, for example, W. A. Al-Qaq, M. Devetsikiotis, and J.-K. Townsend, “Stochastic gradient optimization of importance sampling for the efficient simulation of digital communication systems,” IEEE Transactions on Communications, 43(12):2975-2985, December 1995; W. A. AI-Qaq and J. K. Townsend, “A stochastic importance sampling methodology for the efficient simulation of adaptive systems in frequency nonselective Rayleigh fading channels,” IEEE Journal on Selected Areas in Communications, 15(4):614-625, May 1997. In these adaptive importance sampling methods, the optimal importance distribution is learned by minimizing the variance of the sampling distribution using a stochastic approximation procedure, but it is not used to minimize a function of the form Eqn. (1).
Adaptations to the Method
1. Batching
It is to be appreciated that Algorithm 1 can be adapted to batching (sampling multiple data points at each iteration instead of a single one in step (1)) to further reduce the variance of the gradient.
2. Scaling
To obtain good convergence properties, the gradient can be scaled, ideally in the direction of the inverse Hessian. However, this type of second-order method is slow in practice. One can estimate the Hessian greedily, as in Quasi-Newton methods such as LM-FBGS, and adapt it to SGD algorithm, similarly to the method in Friedlander.
Applications
There are many applications of SGD in solving data analytics problems. As examples, the exemplary method can be used in the following problems:
1. Large scale learning. These include classical applications of SGD. See, for example, those exemplified in Léon Bottou, “Online algorithms and stochastic approximations,” in Online Learning and Neural Networks (David Saad, editor, Cambridge University Press, Cambridge, UK, 1998);] Léon Bottou and Olivier Bousquet, “The tradeoffs of large scale learning,” in, Optimization for Machine Learning, pp. 351-368 (Suvrit Sra, Sebastian Nowozin, and Stephen J. Wright, Eds, MIT Press, 2011); M. Friedlander and M. Schmidt, “Hybrid deterministic-stochastic methods for data fitting,” UBC-CS technical report, TR-2011-01.
2. Optimization under uncertainty (e.g., stock management).
3. Parameter estimation in latent variable models: Any application where an expectation maximization/minimization (EM) algorithm is used. The likelihood can be expressed as: ∫Zp(z|θ)p(x|z,θ)dz. The quantity ƒ(x,θ)=p(z|θ)p(x|z,θ) is known as the complete likelihood.
4. Variational inference. SGD can be used to minimize the variational objective, which is sometimes intractable.
As one example, the method can be used to estimate an average future reward, e.g., which is expressed as an integral of all future rewards. Another example is to estimate the minimum cost of stocks.
Without intending to limit the scope of the exemplary embodiment, the following examples demonstrate the applicability of the method.
The exemplary algorithm can work on any integrable function, and should be particularly effective on high dimensional problems, but as an illustration, the algorithm is tested on a 1D integral with one parameter θ:
γ(θ)=∫X log(1+exp(θx))N(x)dx
where N (x) is the standard normal distribution (density). The integral function γ is convex and has therefore a unique minimum.
A conventional SGD algorithm was run using the sampling N (x) and the learning rate optimized for this algorithm. Good performance was found with a learning rate
Then, Algorithm 1 (adaptive weighted SGD) was run using the learning rate
Table 1 shows numerical experiments averaged over 100 repetitions. The error is the distance from the optimal point computed numerically (using quadrature). AW-SGD is the exemplary adaptive weighted SGD algorithm. SGD is the conventional stochastic gradient descent method outlined above (without step (3) of algorithm 1). The values in parentheses are the standard deviations. It can be seen that the improvement is about 50% of error decrease, while having an increase of only 15-30% of computation time for the same number of iterations.
The standard deviation is lower with the present AW-SGD method, indicating that the method provides less variability in results, which is advantageous.
It will be appreciated that variants of the above-disclosed and other features and functions, or alternatives thereof, may be combined into many other different systems or applications. Various presently unforeseen or unanticipated alternatives, modifications, variations or improvements therein may be subsequently made by those skilled in the art which are also intended to be encompassed by the following claims.
Number | Name | Date | Kind |
---|---|---|---|
6381586 | Glasserman et al. | Apr 2002 | B1 |
Entry |
---|
Zinkevich (“Online Convex Programming and Generalized Infinitesimal Gradient Ascent”, CMU-CS-03-110, 2003, 28 pages. |
Al-Qaq, et al. “Stochastic gradient optimization of importance sampling for the efficient simulation of digital communication systems”, Dec. 1995, IEEE Transaction on Communications, 43(12):2975-2985. |
Al-Qaq, et al. “A stochastic importance sampling methodology for the efficient simulation of adapative systems in frequency nonselective rayleigh fading channels”, IEEE Journal on Selected Areas in Communications, May 1997, 15(4):614-625. |
Bottou, et al. “The tradeoffs of large scale learning”, Optimization for Machine Learning, 2011, MIT Press, p. 351-368. |
Bottou. “Online algorithms and stochastic approximations”, Online Learning and Neural Networks, 1998, p. 1-34. |
Friedlander, et al. “Hybrid deterministic-stochastic methods for data fitting”, UBC-CS technical report, TR-2011-01, Jan. 5, 2012, http://www.arxiv.org/abs/1104.2373v3. |
Robbins, et al. “A stochastic approximation method”, Annals of Mathematical Statistics, 1951, 22(3):400-407. |
Number | Date | Country | |
---|---|---|---|
20130325401 A1 | Dec 2013 | US |