Embodiments of the present disclosure relate generally to training machine learning models. More specifically, embodiments of the present disclosure relate to machine learning models that may be trained using parallel stochastic gradient descent algorithms with linear and non-linear activation functions.
In Stochastic Gradient Descent (“SGD”), a machine learning model training system operates iteratively to determine values of the model parameters by finding a minimum of an objective function of parameters of the model. SGD is an effective method for many machine learning problems. SGD is an algorithm with few hyper-parameters, and SGD's convergence rates are well understood both theoretically and empirically. However, SGD performance scalability may be severely limited by its inherently sequential computation. SGD iteratively processes an input dataset, in which each iteration's computation depends on model parameters learned from a previous iteration.
Current approaches for parallelizing SGD allow for learning local models per thread, and then combining these models in ways that do not honor this inter-step dependence. For example, in current approaches to parallelize SGD, such as H
While these algorithms in current approaches may be guaranteed to eventually converge, the current approaches need to carefully manage the communication staleness trade-off. For example, H
Thus, there exists a need to provide a parallel SGD algorithm that allows threads to communicate less frequently, but also achieve a high-fidelity approximation to what the threads would have produced had they run sequentially.
According to certain embodiments, systems, methods, and computer-readable media are disclosed for parallel stochastic gradient descent with linear and non-linear activation functions.
According to certain embodiments, computer-implemented methods for parallel stochastic gradient descent using linear and non-linear activation functions are disclosed. One method includes: receiving a set of input examples, the set of examples including a plurality of vectors of feature values and a corresponding label to learn; receiving a global model, the global model used to compute a plurality of local models based on the set of input examples; and learning a new global model based on the global model and the set of input examples by iteratively performing the following steps: computing a plurality of local models having a plurality of model parameters based on the global model and at least a portion of the set of input examples; computing, for each local model, a corresponding model combiner based on the global model and at least a portion of the set of input examples; and combining the plurality of local models into the new global model based on the current global model and the plurality of corresponding model combiners.
According to certain embodiments, systems for parallel stochastic gradient descent using linear and non-linear activation functions are disclosed. One system including: a data storage device that stores instructions for parallel stochastic gradient descent using linear and non-linear activation functions; and a processor configured to execute the instructions to perform a method including: receiving a set of input examples, the set of examples including a plurality of vectors of feature values and a corresponding label to learn; receiving a global model, the global model used to compute a plurality of local models based on the set of input examples; and learning a new global model based on the global model and the set of input examples by iteratively performing the following steps: computing a plurality of local models having a plurality of model parameters based on the global model and at least a portion of the set of input examples; computing, for each local model, a corresponding model combiner based on the global model and at least a portion of the set of input examples; and combining the plurality of local models into the new global model based on the current global model and the plurality of corresponding model combiners.
According to certain embodiments, non-transitory computer-readable media storing instructions that, when executed by a computer, cause the computer to perform a method for parallel stochastic gradient descent using linear and non-linear activation functions are disclosed. One computer-readable medium storing instructions that, when executed by a computer, cause the computer to perform the method including: receiving a set of input examples, the set of examples including a plurality of vectors of feature values and a corresponding label to learn; receiving a global model, the global model used to compute a plurality of local models based on the set of input examples; and learning a new global model based on the global model and the set of input examples by iteratively performing the following steps: computing a plurality of local models having a plurality of model parameters based on the global model and at least a portion of the set of input examples; computing, for each local model, a corresponding model combiner based on the global model and at least a portion of the set of input examples; and combining the plurality of local models into the new global model based on the current global model and the plurality of corresponding model combiners.
Additional objects and advantages of the disclosed embodiments will be set forth in part in the description that follows, and in part will be apparent from the description, or may be learned by practice of the disclosed embodiments. The objects and advantages of the disclosed embodiments will be realized and attained by means of the elements and combinations particularly pointed out in the appended claims.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosed embodiments, as claimed.
In the course of the detailed description to follow, reference will be made to the attached drawings. The drawings show different aspects of the present disclosure and, where appropriate, reference numerals illustrating like structures, components, materials and/or elements in different figures are labeled similarly. It is understood that various combinations of the structures, components, and/or elements, other than those specifically shown, are contemplated and are within the scope of the present disclosure.
Moreover, there are many embodiments of the present disclosure described and illustrated herein. The present disclosure is neither limited to any single aspect nor embodiment thereof, nor to any combinations and/or permutations of such aspects and/or embodiments. Moreover, each of the aspects of the present disclosure, and/or embodiments thereof, may be employed alone or in combination with one or more of the other aspects of the present disclosure and/or embodiments thereof. For the sake of brevity, certain permutations and combinations are not discussed and/or illustrated separately herein.
Again, there are many embodiments described and illustrated herein. The present disclosure is neither limited to any single aspect nor embodiment thereof, nor to any combinations and/or permutations of such aspects and/or embodiments. Each of the aspects of the present disclosure, and/or embodiments thereof, may be employed alone or in combination with one or more of the other aspects of the present disclosure and/or embodiments thereof. For the sake of brevity, many of those combinations and permutations are not discussed separately herein.
One skilled in the art will recognize that various implementations and embodiments of the present disclosure may be practiced in accordance with the specification. All of these implementations and embodiments are intended to be included within the scope of the present disclosure.
As used herein, the terms “comprises,” “comprising,” “have,” “having,” “include,” “including,” or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements, but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. The term “exemplary” is used in the sense of “example,” rather than “ideal.” Additionally, the term “or” is intended to mean an inclusive “or” rather than an exclusive “or.” That is, unless specified otherwise, or clear from the context, the phrase “X employs A or B” is intended to mean any of the natural inclusive permutations. For example, the phrase “X employs A or B” is satisfied by any of the following instances: X employs A; X employs B; or X employs both A and B. In addition, the articles “a” and “an” as used in this application and the appended claims should generally be construed to mean “one or more” unless specified otherwise or clear from the context to be directed to a singular form.
For the sake of brevity, conventional techniques related to systems and servers used to conduct methods and other functional aspects of the systems and servers (and the individual operating components of the systems) may not be described in detail herein. Furthermore, the connecting lines shown in the various figures contained herein are intended to represent exemplary functional relationships and/or physical couplings between the various elements. It should be noted that many alternative and/or additional functional relationships or physical connections may be present in an embodiment of the subject matter.
Reference will now be made in detail to the exemplary embodiments of the disclosure, examples of which are illustrated in the accompanying drawings. Wherever possible, the same reference numbers will be used throughout the drawings to refer to the same or like parts.
The present disclosure generally relates to, among other things, a parallel Stochastic Gradient Descent (“SGD”) algorithm that allows threads to communicate less frequently and achieve a high-fidelity approximation to what the threads would have produced had the threads run sequentially. A key aspect of the parallel SGD algorithm of the present disclosure is for each thread to generate a sound model combiner that precisely captures the first-order effects of a SGD computation starting from an arbitrary model. Periodically, the threads may update a global model with their local model while using a model combiner to account for changes in the global model that occurred in the interim.
While the parallel SGD algorithm may be generalized to different machine learning problems and on different parallel settings, such as distributed clusters and GPUs, the present disclosure describes techniques in the context of evaluation on linear learners on multi-core machines, which may be primarily motivated by the fact that this setting forms the core of machine learning presently. For example, some developers have trained over one (1) million models per month in 2016 on single-node installations. Likewise, a 2015 survey has shown almost 50% of Spark installations are single-node. As machines with terabytes of memory become commonplace, machine learning tasks on large datasets may be done efficiently on single machines without paying the inherent cost of distribution. Thus, the present disclosure discusses the parallel SGD algorithm being evaluated on linear learners on multi-core machines. However, the present disclosure may not be limited to such systems, and may be applicable to different parallel settings, such as distributed clusters and GPUs, as described in more detail below. A person of ordinary skill in the art may apply the techniques described herein to other environments and systems.
The present disclosure relates to a parallel SGD (sometimes referred to as Symbolic SGD or S
In a parallel SGD algorithm of the present disclosure, a set of N input examples zi=(xi, yi) may be given. xi may be a vector of feature values (f), and yi may be a label to learn. A convex cost function to minimize may be represented by
Accordingly, an object of the present disclosure is to find
The cost function may optionally include a regularization term. Gradients may be defined as
and Hessian of the cost function may be defined as
At each step t, the parallel SGD may pick zr=(xr, yr) uniformly randomly from the input dataset, and may update the current model wt along the gradient Gz
w
t+1
=w
t−αtGz
αt may represent a learning rate that determines a magnitude of the update along the gradient. As equation 3 shows, wt+1 may be dependent on wt, and this dependence may make parallelization of SGD across iterations difficult.
As mentioned in the introduction above, current parallelization techniques, such as H
According to embodiments of the present disclosure, a parallel SGD (symbolic SGD/S
In order to adjust the computation of D2, a second processor may perform a computation from wg+Δw, where Δw may be an unknown symbolic vector. This may allow for the second processor to both compute a local model (resulting from the concrete part) and a model combiner (resulting from the symbolic part) that accounts for changes in the initial state. Once both processors are done calculating, the second processor may find wh by setting Δw to w1−wg, where w1 is computed by the first processor. This parallelization approach of SGD may be extended to multiple processors where all processor produce a local model and a combiner (except for the first processor) and the local models are combined sequentially using the combiners.
A SGD computation of dataset D starting from w may be represented by SD(w). Thus, for example, for local model w1, w1=SD
S
D(w+Δw)=SD(w)+S′D(w)·Δw+O(∥Δw∥22) (4)
may be defined as the model combined. In equation (4) above, the model combiner may capture a first order effect of how a Δw change in wg may affect the SGD computation. For example, by using Δw=w1−wg in equation (4), the local models, as shown in
Further, when Δw is sufficiently small, the second order term may be neglected, and the model combiner may be used to combine local models with sufficient fidelity.
Model combiners may provide flexibility to design parallel SGD algorithms. As explain in more detail below, model combiners allow for a parallel SGD using map-reduce functionality and allow for a parallel SGD using asynchronous functionality. In model combiners that allow for a parallel SGD to use map-reduce functionality, for the map phase, each processor i ∈ [1, N] may start from the same global model wg, and each processor may compute a local model SD
w
i
=S
D
(wg)+MD
Machine learning algorithms, such as linear regression, linear regression with L2 regularization, and polynomial regression may have a linear update to the model parameters, but not necessarily linear on the input example). In such cases, the higher order terms in equation (4) may vanish. Accordingly, for such machine learning algorithms, model combiners may generate exactly the same model as a sequential SGD.
Specifically, a standard linear regression with square loss machine learning algorithm, the combiner matrix may be given by:
when computing on D=(x1, y1), . . . , (xn, yn).
Since the model combiner may be independent of w, the model combiner may be computed once and may be reused in subsequent phases provided that the learning rates do not change.
Further, for a logistic regression machine learning algorithm, which has a update rule:
w
i
=w
i−1−α·(σ(xi·wi−1)−yi)·xi (7)
where σ may be the sigmoid function, the model combiner may be given by:
where w0=w.
The model combiner for logistic regression may be a model combiner generated for linear regression but with α scaled by σ′(xi·wi−1).
Table 1 provides model combiners for linear learners. When the SGD update function is not differentiable, using the Taylor expansion in equation (4) may result in errors at points of discontinuity. However, assuming bounded gradients, these errors may not affect the convergence of the parallel SGD (S
For linear learners of Table 1, λ>0 may be an additional hyper-parameter and δϕ may be 1 when ϕ is true, else δϕ may be 0. In a Lasso linear learner, the model w may include positive w+ and negative w− features with s(i) denoting the sign of feature i. [vi]i may describe a vector with vi as the ith element and [mij]ij may represent a matrix with mij as the (i,j)th element.
A challenge in using model combiners, as described above, may be that the model combiners are large f×f matrices. Machine learning problems may involve learning over tens of thousands to billions of features. Thus, it may be impossible to represent the model combiner explicitly. As described below, mechanisms may address the problem.
Accordingly, in embodiments of the present disclosure, a model combiner matrix may be projected into a smaller dimension while maintaining its fidelity. While the projection may generate an unbiased estimate of the model combiner, the model combiner's variance may affect convergence. As discussed above regarding convergence, it may be shown that with appropriate bounds on this variance, convergence may be guaranteed.
In embodiments of the present disclosure, a combiner matrix MD in a parallel SGD (S
[mij]ij may represent a matrix with mij as an element in the ith row and jth column. A=[aij]ij may be a random f×k matrix with aij=dij/√{square root over (k)} where dij may be independently sampled from a random distribution D with E[D]=0 and Var[D]=1. Then, E[A·AT]=If×f, which may be based on the proof as follows. If B=[bij]ijA·AT, then
When i≠j, E[bij]=0 as aik and ajk may be independent random variables with mean 0, and E[bij]=1 as the variance of aii is 1.
Thus, the model combination with equation (27), as discussed below, may become:
wi≈SD
Equation (9) may allow for an efficient algorithm that computes a projected version of the combiner matrix, while still producing the same answer as the sequential SGD algorithm in expectation. Accordingly, the projection may incur a space and time overhead of O(z×k) where z may be a number of non-zeros in an example, xi. The overhead may be acceptable for a small value of k. In exemplary embodiments of the present disclosure discussed in more detail below, k may be between values of 7 to 15 in various benchmarks and remain acceptable. Additionally, the overhead for such a small value k may be hidden by utilizing Single Instruction, Multiple Data (“SIMD”) hardware within a processor. For example, as discussed in more detail below, the parallel SGD of the present disclosure (S
After learning a local model and a projected model combiner in each processor, the parallel SGD of the present disclosure (S
An unbiased estimation, as discussed above, may only be useful when a variance of the approximation is acceptably small. The variance of the projection described above may be determined as follows. A trace of a matrix M, tr(M) may be the sum of the diagonal elements of the matrix. λi(M) may be the ith eigenvalue of M, and σi(M)=√{square root over (λi(MT·M))} the ith singular value of M. M·σmax(M) may be a maximum singular value of M. With v=M·A·AT·Δw, then, the trace of a covariance matrix tr(((v)) may be bounded by equation (10), as proven in more detail below.
Accordingly, the covariance may be small when k, which is the dimension of the projected space, is large. However, increasing k may proportionally increase the overhead of the parallel SGD algorithm. Similarly, covariance may be small when the projection happens on a small Δw. With respect to equation (9) above, wi−1 may be as close to wg as possible, which may imply that processors may need to communicate frequently enough to ensure that their respective models may be roughly in-sync. Additionally, singular values of M may be as small as possible. Discussed in more detail below is an optimization that allows for the frequent communication and small values of M.
Variance and Covariance of
For simplicity below, w may be used instead of Δw, and r may be used instead of k for the size of the projected space. k may be used for summation indices. v=M·w may be estimated with
where A is a f×r matrix, where aij is a random variable with the following properties: E(aij)=0, E(aij2)=1, and E(aij4)=ρ=3. msT may be some row of M. Its estimation in
From the above, E(vs)=msT·w. The notation ij=kl to is used to mean i=k Λ j=1, and the notation ij≠kl is used to mean its negation. ms, mt may be two rows of M, and the covariance of the resulting vs and vt is to be found.
In other words:
The covariance Cov(a,b)=E(a·b)−E(a)E(b). Using this:
(v) may be the covariance of v. That is, (v)ij=Cov(vi, vj). So:
Note that this computation may be used for matrix N=M−I as well since nothing was assumed about the matrix M from the beginning. Therefore,
since w is a constant in v′ and (a+x)=(x) for any constant vector a and any probabilistic vector x.
As mentioned above, the trace of a covariance matrix tr((v)) may be bounded as shown by equation (10). (v) may be bounded by computing its trace since tr((v))=Σivar(vi), which is a summation of the variance of elements of v.
where λiM·MT is the ith largest eigenvalue of M·MT, which is a square of the ith largest singular value of M, σi(M)2.
Since ∥M·w∥22≤∥w∥22∥M∥22=∥w∥22σmax(M)2, tr((v)) may be bounded as follows:
Thus:
Combining the two inequalities yields;
Accordingly, the same bounds may be derived when N=M−I is used. Subtracting I from a model combiner results in a matrix with a small rank. Thus, most of its singular values may be zero. The model combiner may be assumed to be generated for a linear learner, and thus the model combiner may be of the form Πi(I−αxixiT), where any non-linear scalar terms from the Hessian may be factored into α.
For a matrix Ma→b=Πi=ba(I−αxixiT), rank Ma→b≤b−a. Proof may be by induction. A base case may be when a=b and Ma→b=I. Thus, I=I=0, which may be a rank zero. For induction, assume that rank Ma→b−1−I≤b−a−1. Accordingly:
The term αxb·(xbT·Ma→b−1) may be a rank −1 matrix, and the term (Ma→b−1−I) may be of a rank b−a−1 by induction hypothesis. Since for any two matrices A and B, rank(A+B)≤rank(A)+rank(B)≤rank(Ma→b−1−I)+rank(−αxb·xbT·(Ma→b−1))≤b−a−1+1=b−a.
Equation (43), as discussed below, suggests that when αi is small, a model combiner MD(w) may be dominated by the I term. From the equations above, the combiner matrix MD(w) may be generated from n examples, MD(w)−I may have at most n non-zero singular values. Because each processor may operate on a subset of data, it may be likely that n examples«f features. These observations may be used to lower a variance of dimensionality reduction by projecting the matrix ND instead of MD. This optimization allows for the scalability of the parallel SGD of the present disclosure (S
wi≈SD
From the proof discussed above, approximation above may be guaranteed to be unbiased.
An important factor in controlling the singular values of ND
As discussed below, convergence may be guaranteed when neglecting the higher order terms under certain general assumptions of the cost function, provided ∥Δw∥2 is bound. For example, a sequence w0, w1, . . . , wt may represent a sequence of weight vectors produced by a sequential SGD run. As shown in
w
t+1
=S
D(wi−Δw)+MDΔw (21)
where the model combiner after the projection by taking the I off may be given as:
M
D
=I+(S′D(wi−Δw)−I)AAT (22)
Accordingly, applying Taylor's theorem, for some 0≤μ≤1:
Comparing equation (23) with equation (21) may show that the parallel SGD of the present disclosure may introduce two error terms as compared to a sequential SGD.
w
t+1
=S
D(wt)+FRD(wt, Δw)+SRD(wt, Δw) (24)
where a first-order error term FR may come due to a projection approximation:
FR
D(wt, Δw)=(I−S′D(wt−Δw))(I−AAT) (25)
and where a second-order error term SR may come due to neglecting the higher-order terms in the Taylor expansion.
To prove convergence the parallel SGD of the present disclosure (S
Assuming that convexity of a cost function is:
(w−w*).G(w)>0 (27)
for w≠w*, and assuming bounded gradients, for any input z=(X, y),
∥G1(w)∥2≤bG∥w−w*∥w (28)
for some bG≥0.
Bounds on a mean and second moment of FR, are as follows:
E
A(FRD(wt, Δw))=0 (29)
E
A(∥FRD(wt, Δw)∥22)≤bFR∥wt−w*∥22 (30)
for some bFR≥0.
Bounds on SR are as follows:
∥SRD(wt, Δw)∥2≤bSR∥wi−w*∥2 (31)
for some bFR≥0.
Convergence of a parallel SGD of the present disclosure follows if the following equation converges to 0:
h
t
=∥w
t
−w*∥
2
2 (32)
A worst case where error terms are added every step of the parallel SGD may be assumed. As noted above, the worst case as the error bounds from equations (29)-(31) may be for arbitrary steps. Thus, the sequence ht converges to 0 almost surely, as discussed in detail below.
Pt denotes all the random choices made by the algorithm at time t. For terseness, the following notation for the conditional expectation with respect to Pt may be used:
CE(x)=E(x|Pt) (33)
A key technical challenge may be in showing that an infinite sum of a positive expected variations in ht is bounded, which may be shown below. z=(x, y) may be an example processed at time t. The following short hand is used:
In other words, for B=bG+bFR+bSR+2bGbSR, the follow:
CE(ht+1−(I+γt2B)ht)≤0 (35)
Auxiliary sequences
and h′t=μtht may be defined. Σtγt2<∞, μt may be assumed to converge to a nonzero value. Since equation (35) above may imply CE(h′t+1−h′t)≤0, from quasi-martingale convergence theorem, h′t, and thus, ht may converge almost surely. Under an additional assumption that Σtγt=∞, this convergence may be 0.
The proof above relies on equations (29)-(31), which are proven below based on the following assumptions and proof. The discussion may be restricted to linear learners and that the model combiners are of the form Mz(w)=(I−αHz(w)xxT) for a scalar Hessian Hz(w). Assuming the parallel SGD (S
∥Δw∥2≤min(1, bΔw∥wt−w*∥2) (36)
for some bΔw>0.
Assuming, bounded Hessian:
|Hz(w)|≤bH (37)
for some bH>0.
The model combiner MD(w)=Πi(I−αHz(w)xixiT) may have bounded eigenvalues, which follows from induction on i using the bounded Hessian of equation (37) above.
Bounds on the mean and second moment of FR of equations (29) and (30) may be proven, as explained in detail below. The second moment follows from the above, as applied to M and MT, as explained in detail below.
Bounds on SR, as shown in equation (31), may be proven as follows. For linear learners:
where H is a second derivative of a cost with respect to x·w, and ⊗ is a tensor outer product.
As to the equation (40) above, if the input is composed of a previous SGD phase, then:
For notational convenience, let Mb→aΠi=ba(I−αxixiT).
By explicitly differentiating S′n(w):
Each element of SRz may obtained by ΔwTPΔw where P is an outer product of a row from the first term above and S′j(w)Tx. Using the above twice, each of these vectors may be shown to be bounded.
To generate a model combiner, D=(z1, z2, . . . , zn) may be a sequence of input examples and Di may represent a subsequence (z1, . . . , zi). The model combiner may be given by:
with SD
S
D(w)=Sz
which follows from equation (3) above and the chain rule.
As mentioned above, the graphs of
Discussed below are implementations of various parallel SGD algorithms according to embodiments of the present disclosure (S
Because the map-reduce S
However, even sparse datasets have a frequently used subset of features, which may be likely to show up in many input examples. The frequent used subset of features can cause scalability issues for H
A parallel SGD algorithm according to embodiments of the present disclosure may blend asynchronous updates of infrequent model parameters with the map-reduce S
Because the frequently accessed subset of features may be much smaller than the infrequently accessed subset of features, Async-S
An average NFNZ ratio (shown in Table 2 below) may be an average number of frequent features in each input example divided by a number of non-zero features in that input example. Thus, a value of zero (0) may mean all features are infrequent, and a value of one (1) may mean all features are frequent. A frequent feature may be defined as whether a particular feature shows up in at least ten percent (10%) of the input examples. At runtime, Async-S
Equation (4) shows that an error in S
Block size may be set to a constant value per benchmark throughout the execution of the parallel SGD of the present disclosure (S
While the computational complexity of computing a model combiner O(k) (where k<15 in all experiments), a k×slowdown may not be seen. Each processor may consecutively store each of the k vectors in A so the parallel SGD of the present disclosure (S
Further, S
with a probability of
respectively.
According to embodiments of the present disclosure, experiments were performed on a machine including two Intel Xeon E5-2630 v3 processors clocked at 2.4 GHz with 256 GB of Random Access Memory (“RAM”). The machine includes two sockets with 8 cores for each processor, which allowed for a study of the scalability of the parallel SDG algorithms of the present disclosure across multiple sockets. Hyper-threading and turbo boosting of the processors was disabled. Threads were pinned to cores in a compact way, which means that thread i+1 was placed as close as possible to thread i. The above described machine runs Windows 10, and implementations were compiled with Intel C/C++ compiler 16.0 and relied heavily on Open Multi-Processing (“OpenMP”) primitives for parallelization and Math Kernal Library (“MKL”) for efficient linear algebra computations. Additionally, to measure runtime, an average of five independent runs was used on an otherwise idle machine.
As mentioned above,
Discussed below are results for logistic regression. To study the scalability of a parallel algorithm, it is important to compare the algorithms against an efficient baseline. Otherwise, it may not be empirically possible to differentiate between the scalability achieved from the parallelization of the inefficiencies and the scalability inherent in the algorithm. The datasets used in experiments herein are described in Table 2, which includes various statistics of each benchmark.
For each algorithm and benchmark, a parameter sweep was done over the learning rate, α, and α was picked which gave the best area under the curve (often referred to as simply the “AUC”) after 10 passes over the data. For Async-S
The last two columns of Table 2 summarize the speedup of Async-S
Each benchmark indicates that H
Moreover, the acts described herein may be computer-executable instructions that may be implemented by one or more processors and/or stored on a non-transitory computer-readable medium or media. The computer-executable instructions may include a routine, a sub-routine, programs, a thread of execution, and/or the like. Still further, results of acts of the methods may be stored in a non-transitory computer-readable medium, displayed on a display device, and/or the like.
Upon receiving the set of input examples and the global model, a new global model may be learned based on the global model and the set of input examples by iteratively performing the following steps at step 506. Each global model may be dependent on a previous model, and may make parallelization of stochastic gradient descent across iterations difficult. Using model combiners with local models allows for the models to be combined into a new global model.
At step 508, a plurality of local models having a plurality of model parameters may be computed based on the global model and at least a portion of the set of input examples. The plurality of model parameters may be linearly updated based on a machine learning algorithm including one or more of linear regression, linear regression with L2 regularization, and polynomial regression. Additionally, and/or alternatively, the plurality of model parameters may be non-linearly updated based on a machine learning algorithm including logistic regression. At step 510 for each local model, a corresponding model combiner may be computed based on the global model and at least a portion of the set of input examples. Additionally, and/or alternatively, for each local model, a corresponding projected model combiner may be computed based on the global model and at least a portion of the set of input examples. The model combiners may be large f×f matrices. Machine learning problems may involve learning over tens of thousands to billions of features, and thus, a model combiner may not be explicitly represented. Accordingly, as discussed above, a projected model combiner may be used to project the model combiner matrix into a smaller dimension while maintaining its fidelity.
Then at step 512, the plurality of local models may be combined into the new global model based on the current global model and the corresponding plurality of model combiners and/or the corresponding plurality of projected model combiners.
Method 600 may begin at step 602 where a set of input examples may be received. The set of examples may include a plurality of vectors of feature values and a corresponding label to learn. A stochastic gradient descent algorithm may iteratively process the input dataset in which computation at each iteration may depend on model parameters learned from the previous iteration. At step 604, the method may receive a global model, the global model used to compute a plurality of local models based on the set of input examples. Periodically, the global model may be updated with the local models while using the model combiner to account for changes in the global model that occurred in the interim.
Upon receiving the set of input examples and the global model, a new global model may be learned based on the global model and the set of input examples by iteratively performing the following steps at step 606. Each global model may be dependent on a previous model, and may make parallelization of stochastic gradient descent across iterations difficult. Using model combiners with local models allows for the models to be combined into a new global model.
At step 608, during a map phase, a plurality of local models having a plurality of model parameters may be computed based on the global model and at least a portion of the set of input examples, and, for each local model, a corresponding model combiner based on the global model and at least a portion of the set of input examples. The plurality of model parameters may be linearly updated based on a machine learning algorithm including one or more of linear regression, linear regression with L2 regularization, and polynomial regression. Additionally, and/or alternatively, the plurality of model parameters may be non-linearly updated based on a machine learning algorithm including logistic regression. Additionally, and/or alternatively, for each local model, a corresponding projected model combiner may be computed based on the global model and at least a portion of the set of input examples. The model combiners may be large f×f matrices. Machine learning problems may involve learning over tens of thousands to billions of features, and thus, a model combiner may not be explicitly represented. Accordingly, as discussed above, a projected model combiner may be used to project the model combiner matrix into a smaller dimension while maintaining its fidelity.
Then at step 610, during a reduce phase, the plurality of local models may be combined into a new global model based on the current global model and the plurality of corresponding model combiners and/or the corresponding plurality of projected model combiners.
Method 700 may begin at step 702 where a set of input examples may be received. The set of examples may include a plurality of vectors of feature values and a corresponding label to learn. A stochastic gradient descent algorithm may iteratively process the input dataset in which computation at each iteration may depend on model parameters learned from the previous iteration. At step 704, the method may receive a global model, the global model used to compute a plurality of local models based on the set of input examples. Periodically, the global model may be updated with the local models while using the model combiner to account for changes in the global model that occurred in the interim. At step 706, the method may sample at least a portion of the set of input examples to determine and/or find frequent feature values. For each frequent feature value found, the map reduce version of the parallel stochastic gradient descent algorithm may be used. For each infrequent feature value, the features may be asynchronously updated.
For frequent features values, upon receiving the set of input examples and the global model, a new global model may be learned based on the global model and the set of input examples by iteratively performing the following steps at step 708. Each global model may be dependent on a previous model, and may make parallelization of stochastic gradient descent across iterations difficult. Using model combiners with local models allows for the models to be combined into a new global model.
At step 710, for each frequent feature value during a map phase, a plurality of local models having a plurality of model parameters may be computed based on the global model and at least a portion of the set of input examples, and, for each local model, a corresponding model combiner based on the global model and at least a portion of the set of input examples. The plurality of model parameters may be linearly updated based on a machine learning algorithm including one or more of linear regression, linear regression with L2 regularization, and polynomial regression. Additionally, and/or alternatively, the plurality of model parameters may be non-linearly updated based on a machine learning algorithm including logistic regression. Additionally, and/or alternatively, for each local model, a corresponding projected model combiner may be computed based on the global model and at least a portion of the set of input examples. The model combiners may be large f×f matrices. Machine learning problems may involve learning over tens of thousands to billions of features, and thus, a model combiner may not be explicitly represented. Accordingly, as discussed above, a projected model combiner may be used to project the model combiner matrix into a smaller dimension while maintaining its fidelity. Further, each infrequent feature may be updated asynchronously, as discussed above.
Then at step 712, for each frequent feature value during a reduce phase, the plurality of local models may be combined into a new global model based on the current global model and the plurality of corresponding model combiners and/or the corresponding plurality of projected model combiners. For each infrequent feature value asynchronously update the plurality of model parameters. A frequent feature value may be a particular feature that shows up in at least 10% of the set of input examples. Additionally, or alternatively, other frequency determinations may be used to determine a whether a feature value is a frequent feature value, such as 5%, 10%, 15%, 20%, 25%, etc.
The computing device 800 may additionally include a data store 808 that is accessible by the processor 802 by way of the system bus 806. The data store 808 may include executable instructions, data, examples, features, etc. The computing device 800 may also include an input interface 810 that allows external devices to communicate with the computing device 800. For instance, the input interface 810 may be used to receive instructions from an external computer device, from a user, etc. The computing device 800 also may include an output interface 812 that interfaces the computing device 800 with one or more external devices. For example, the computing device 800 may display text, images, etc. by way of the output interface 812.
It is contemplated that the external devices that communicate with the computing device 800 via the input interface 810 and the output interface 812 may be included in an environment that provides substantially any type of user interface with which a user can interact. Examples of user interface types include graphical user interfaces, natural user interfaces, and so forth. For example, a graphical user interface may accept input from a user employing input device(s) such as a keyboard, mouse, remote control, or the like and may provide output on an output device such as a display. Further, a natural user interface may enable a user to interact with the computing device 800 in a manner free from constraints imposed by input device such as keyboards, mice, remote controls, and the like. Rather, a natural user interface may rely on speech recognition, touch and stylus recognition, gesture recognition both on screen and adjacent to the screen, air gestures, head and eye tracking, voice and speech, vision, touch, gestures, machine intelligence, and so forth.
Additionally, while illustrated as a single system, it is to be understood that the computing device 800 may be a distributed system. Thus, for example, several devices may be in communication by way of a network connection and may collectively perform tasks described as being performed by the computing device 800.
Turning to
The computing system 900 may include a plurality of server computing devices, such as a server computing device 902 and a server computing device 904 (collectively referred to as server computing devices 902-904). The server computing device 902 may include at least one processor and a memory; the at least one processor executes instructions that are stored in the memory. The instructions may be, for example, instructions for implementing functionality described as being carried out by one or more components discussed above or instructions for implementing one or more of the methods described above. Similar to the server computing device 902, at least a subset of the server computing devices 902-904 other than the server computing device 902 each may respectively include at least one processor and a memory. Moreover, at least a subset of the server computing devices 902-904 may include respective data stores.
Processor(s) of one or more of the server computing devices 902-904 may be or may include the processor, such as processor 802. Further, a memory (or memories) of one or more of the server computing devices 902-904 can be or include the memory, such as memory 804. Moreover, a data store (or data stores) of one or more of the server computing devices 902-904 may be or may include the data store, such as data store 808.
The computing system 900 may further include various network nodes 906 that transport data between the server computing devices 902-904. Moreover, the network nodes 906 may transport data from the server computing devices 902-904 to external nodes (e.g., external to the computing system 900) by way of a network 908. The network nodes 902 may also transport data to the server computing devices 902-904 from the external nodes by way of the network 908. The network 908, for example, may be the Internet, a cellular network, or the like. The network nodes 906 may include switches, routers, load balancers, and so forth.
A fabric controller 910 of the computing system 900 may manage hardware resources of the server computing devices 902-904 (e.g., processors, memories, data stores, etc. of the server computing devices 902-904). The fabric controller 910 may further manage the network nodes 906. Moreover, the fabric controller 910 may manage creation, provisioning, de-provisioning, and supervising of managed runtime environments instantiated upon the server computing devices 902-904.
As used herein, the terms “component” and “system” are intended to encompass computer-readable data storage that is configured with computer-executable instructions that cause certain functionality to be performed when executed by a processor. The computer-executable instructions may include a routine, a function, or the like. It is also to be understood that a component or system may be localized on a single device or distributed across several devices.
Various functions described herein may be implemented in hardware, software, or any combination thereof. If implemented in software, the functions may be stored on and/or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media may include computer-readable storage media. A computer-readable storage media may be any available storage media that may be accessed by a computer. By way of example, and not limitation, such computer-readable storage media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to store desired program code in the form of instructions or data structures and that can be accessed by a computer. Disk and disc, as used herein, may include compact disc (“CD”), laser disc, optical disc, digital versatile disc (“DVD”), floppy disk, and Blu-ray disc (“BD”), where disks usually reproduce data magnetically and discs usually reproduce data optically with lasers. Further, a propagated signal is not included within the scope of computer-readable storage media. Computer-readable media may also include communication media including any medium that facilitates transfer of a computer program from one place to another. A connection, for instance, can be a communication medium. For example, if the software is transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, digital subscriber line (“DSL”), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio and microwave are included in the definition of communication medium. Combinations of the above may also be included within the scope of computer-readable media.
Alternatively, and/or additionally, the functionality described herein may be performed, at least in part, by one or more hardware logic components. For example, and without limitation, illustrative types of hardware logic components that may be used include Field-Programmable Gate Arrays (“FPGAs”), Application-Specific Integrated Circuits (“ASICs”), Application-Specific Standard Products (“ASSPs”), System-on-Chips (“SOCs”), Complex Programmable Logic Devices (“CPLDs”), etc.
What has been described above includes examples of one or more embodiments. It is, of course, not possible to describe every conceivable modification and alteration of the above devices or methodologies for purposes of describing the aforementioned aspects, but one of ordinary skill in the art can recognize that many further modifications and permutations of various aspects are possible. Accordingly, the described aspects are intended to embrace all such alterations, modifications, and variations that fall within the scope of the appended claims.
The present application claims the benefit of priority to U.S. Provisional Application No. 62/505,292, entitled “SYSTEMS, METHODS, AND COMPUTER-READABLE MEDIA FOR PARALLEL STOCHASTIC GRADIENT DESCENT WITH LINEAR AND NON-LINEAR ACTIVATION FUNCTIONS,” and filed on May 12, 2017, which is incorporated herein by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
62505292 | May 2017 | US |