The present disclosure relates to second-order optimization methods for avoiding saddle points during the training of neural networks. The technology described herein is particularly well-suited for, but not limited to, optimization problems encountered in deep learning applications.
Deep neural networks have been used for achieving state-of-the-art results on a wide variety of tasks such as image-classification and objects recognition, Natural Language Processing, and speech recognition. In the past few decades, many different neural network architectures have been considered to apply on real-world applications Convolutional Neural Networks (CNNs) for processing data with a known grid-like structure, or Recurrent Neural Networks (RNNs) for addressing tasks involving time dimension in data. The development of pre-training, better forms of initialization, fruitful variants of training techniques and improved hardware have made it possible to train very deep network and achieve excellent performance.
A complex and highly non-convex optimization problem is at the core of training deep neural networks. For a multi-label classification problem, given n sample-label pairs (xi, yi)in=1, we construct neural network models h with respect to parameter θ to obtain the predicted labels =h(xi, θ) for each input sample xi. If we denote the loss function for the i-th sample by f(yi, yi), the overall training loss for the entire sample set is then defined by
where the loss function fi(θ)=f(, yi) may include the squared error ∥yi−yx∥2/2 and the cross-entropy error −Σj(yij log(ŷij)+(1−yij)log(1−ŷij)). Note that all of the loss functions are nonnegative. The ultimate goal is then to minimize the overall training loss (1) to obtain the best parameter θ* such that the least classification error on both the validation and testing datasets is achieved.
Currently, the most popular methodologies to train networks are in the category of first-order (or gradient-based) optimization framework, like mini-batch stochastic gradient method (MSGD), mini-batch stochastic gradient method with momentum (ASGD), and other variants such as Adagrad, Adadelta, and Adam. There are also plenty of practical improving techniques to enhance the training performance, such as drop-out, batch normalization, layer normalization, to name but a few.
In training neural networks, especially when addressing deep neural networks with a large amount of data samples, one of the main challenges is the relatively slow training rate. Besides, computational results claim that it is more likely to achieve better training/testing performance when the optimization algorithms could help converge to a local minimizer of training loss function defined in Equation (1). However, since the models defined by deep neural networks are always highly non-convex, the number of saddle points increases exponentially as the number of hidden layers and corresponding neurons increases. Within the neighborhood of saddle points, the first-order methods may hardly make progress due to the nearly zero gradient of the loss function. Therefore, the first-order methods suffer to escape from saddle points and show frustratingly slow convergence rate after initial progress. Recent work suggests adding noise to the stochastic gradients to prevent slowdown near a saddle point.
The second-order methods, as an alternative to training deep neural network, were widely discussed in recently years. Examples include Hessian-free optimization in, L-B FGS optimization in and saddle-free Newton (SFN) method in. The extensions of the original work include the improvement of the preconditioning matrix for conjugate gradient (CG) solver, as well as the parallel/distributed variants for second-order methods. Among the previous works, either fully connected feed forward neural networks (DNNs) or recurrent neural networks (RNNs) were considered.
Embodiments of the present invention address and overcome one or more of the above shortcomings and drawbacks, by providing methods, systems, and apparatuses related to second-order optimization methods for avoiding saddle points during the training of deep neural networks.
According to some embodiments, a computer-implemented method for training a deep neural network includes defining a loss function corresponding to the deep neural network, receiving a training dataset comprising training samples, and setting current parameter values to initial parameter values. An optimization method is performed which iteratively minimizes the loss function. During each iteration, a steepest direction of the loss function is calculated by determining the gradient of the loss function at the current parameter values. A batch of samples included in training samples is selected. A matrix-free CG solver is applied to obtain an inexact solution to a linear system defined by the steepest direction of the loss function and a stochastic Hessian matrix with respect to the batch of samples. A descent direction is determined based on the inexact solution to the linear system and the steepest direction of the loss function, and the current parameter values are updated based on the descent direction. Following the optimization method, the current parameter values are stored in relationship to the deep neural network.
Various enhancements, refinements, or other modifications may be made to the aforementioned in different embodiments. For example, in one embodiment, the current parameter values are updated based on the descent direction and a learning rate calculated using the steepest direction of the loss function and the descent direction. The learning rate may be calculated, for example, using an Amijo line search method or a Goldstein line-search method. In one embodiment, the batch of samples comprises a random sampling of the plurality of training samples. This random sampling may be calculated a single time, or the training samples may be resampled during each iteration. In another embodiment, the optimization method is performed using a parallel computing platform and computing operations associated with the optimization method are performed in parallel across a plurality of processors included in the parallel computing platform.
According to other embodiments of the present invention, a computer-implemented method for training a deep neural network includes defining a loss function corresponding to the deep neural network, receiving a training dataset comprising a plurality of training samples, and setting current parameter values to initial parameter values. A computing platform is used to perform an optimization method which iteratively minimizes the loss function over a plurality of iterations. During each iteration of the optimization method, a gradient for the loss function is calculated at the current parameter values, and a batch of samples included in the plurality of training samples is selected. A trust region subproblem is constructed that approximates the loss function using the gradient and a stochastic Hessian matrix of the loss function with respect to the batch of samples. A descent direction is determined by applying a SteihaugCG solver to the trust region subproblem given a trust region radius. The current parameter values and the trust region radius are conditionally updated based on a comparison of (i) a true reduction value provided by the loss function given the current parameter values, and (ii) a predicted reduction value provided by the descent direction. Following the optimization method, the current parameter values are stored in relationship to the deep neural network.
In some embodiments of the aforementioned second method for training a deep neural network, the trust region radius corresponds as a spherical area in which the trust region subproblem lies. In other embodiments, the trust region subproblem is a bounded quadratic minimization problem.
In one embodiment of the aforementioned second method for training a deep neural network, the current parameter values are updated by selecting a learning rate for the descent direction and determining a first set of parameters based on the product of the descent direction and the learning rate. A momentum descent direction at the first set of parameters is also determined. A momentum rate is selected for the momentum descent direction, and the current parameter values are updated based on the first set of parameters and the product of the momentum descent direction and the momentum rate. In one embodiment, the learning rate is determined using a backtracking line search based on the loss function, the current parameter values, and the descent direction. In another embodiment, the momentum rate is determined using a backtracking line search based on the loss function, the first set of parameters, and the momentum descent direction.
Additional features and advantages of the invention will be made apparent from the following detailed description of illustrative embodiments that proceeds with reference to the accompanying drawings.
The foregoing and other aspects of the present invention are best understood from the following detailed description when read in connection with the accompanying drawings. For the purpose of illustrating the invention, there are shown in the drawings embodiments that are presently preferred, it being understood, however, that the invention is not limited to the specific instrumentalities disclosed. Included in the drawings are the following Figures:
Systems, methods, and apparatuses are described herein which relate generally to second-order optimization methods for avoiding saddle points during the training of deep neural networks. More specifically, the techniques described herein employ two stochastic Hessian-based methods: Inexact Stochastic Newton-CG (SINNC) and Inexact Stochastic Trust Region method (SINTR). These two methods use stochastic Hessian information for detecting the negative curvature direction efficiently. An earlier-terminated CG solver is given to find an approximate solution for the possibly indefinite sub-problem for SINNC and the SteihaugCG solver is applied and learned in SINTR. A number of illustrated examples are used to demonstrate the superior performance of SINNC and SINTR compared to MSGD and its variants in terms of loss objective value reduction and training accuracy. By using the proposed second-order methods, one could converge to a flatter minimizer which also provides better generalizations of the training model. Thus, SINNC and SINTR show promise in solving large DNNs and achieving better accuracy than MSGD type methods.
In the descriptions provided below, the following terminology is used. Denote [n]:={1, . . . , n}. We use fi to denote the loss function corresponding to the i-th sample and label pair (xi, yi), where i∈[n]. X and Y represent the samples matrix (xi . . . , xn) and labels vector (yi, . . . yn). We use
to denote the stochastic Hessian matrix with respect to batch θ≠S⊂[n].
Starting at step 105, a loss function corresponding to the deep neural network is defined. As is generally understood in the art, a loss function is used to guide the training process of a deep neural network. Various loss functions known in the art (e.g., Cross-Entropy, Mean Squared Error, etc.) may be used with the techniques described herein, as well as custom loss functions designed for particular datasets or applications. In general, the loss function of the deep neural network will be known in advance based on the characteristics of the deep neural network. Thus, defining the loss function at step 105 may be simply a matter of specifying the details of the loss function.
At step 110, various inputs are received, for example, as parameters supplied by a user. These inputs comprise a training set comprising labeled pairs(xi,yi)i=1n, an initial iterate Θ0, and an initial CG starter d0. Additionally, configuration information is supplied indicating a CG iteration limit kmax, constant c∈(0,1), and a sample size β∈[n]. Finally, at step 115, the inputs are set to initial values, as necessary.
Steps 120-135 illustrate an optimization method which iteratively minimizes the loss function over a plurality of iterations. At step 120, the steepest direction of the loss function is calculated by determining the gradient of the loss function at the current parameter values. More generally, the full gradient is evaluated gt=∇F(ΘT). Next, a batch of training samples ST∈[n] is selected at step 125. This batch of samples may be created, for example, using a random sampling of the plurality of training samples such that that |Sy|=β. Such a random sampling may be performed, for example, a single time during the first iteration of the optimization method. Alternatively, the batch can be resampled during each iteration.
A matrix-free CG solver is applied at step 130 to obtain an inexact solution dt to a linear system defined by the steepest direction of the loss function and a stochastic Hessian matrix with respect to the batch of samples (i.e., the possible indefinite linear system HS
The current parameter values are updated at step 135 based on the descent direction. In some embodiments, to ensure sufficient reduction of the loss function at each iteration, the current parameter values may also be updated using the learning rate calculated using the steepest direction of the loss function and the descent direction. This learning rate may be calculated, for example, using an Amijo line search method or a Goldstein line-search method. Examples of generic implementations of these methods are described in Nocedal J., and Wright S. J., “Numerical Optimization,” Springer Series in Operations Research and Financial Engineering, 2nd Edition, 2006. Thus, the learning rate ηt may be selected as the largest element in the set {1, c, c2, . . . } such that F(ΘT+ηtpt)≤F(ΘT)+cηtgtTpt. The updating of the parameters at step 135 is then a matter of updating ΘT+1 to ΘT+ηtpt. The optimization method then repeats again starting at step 120 until convergence or a desired number of steps are performed.
Following the optimization method, at step 140, the current parameter values are stored in relationship to the deep neural network. More specifically, the final parameter values are stored in a computer readable medium such that they can be used during deployment of the deep neural network on real-world data.
Note that, unlike truncated Newton-CG methods used in conventional systems, the method 100 considers negative curvature information indicated from the stochastic Hessian matrix. The method 100 also unitizes the stochastic Hessian-vector product but there is no need to evaluate the full Hessian, which is required by saddle-free Newton (SFN) methods. Pseudocode for an example implementation of the SINNC algorithm is set forth in
To train neural network by second-order methods, the stochastic Hessian matrix and stochastic general Gaussian-Newton matrix are adopted as the approximation of the Hessian matrix, and further build the stochastic quadratic approximated model depending on them. Because training a deep neural network always involves a very large number of parameters, the exact solution of minimizing the quadratic approximation is prohibitive. Instead, we try to achieve a reasonable inexact solution in a computationally cost effective manner. Because the conjugate gradient method (CG) is often used to achieve an increasingly accurate solution after several iterations, the techniques described herein apply CG to minimize our quadratic model.
A known deficiency of the CG method is that it becomes unstable when an indefinite Hessian matrix is encountered during the minimization of the quadratic model. The reason behind this is that with an indefinite Hessian matrix, we may not find a conjugate direction. Several strategies have been proposed to deal with that deficiency, such as to modify the indefinite Hessian matrix so that the matrix can be positive and apply the CG solver afterward, or to apply a trust region approach which can always find a descent direction, or to use truncated Newton method, which terminates CG iteration whenever the negative curvature is encountered. In embodiments of the present invention, an early-terminated CG solver is applied in order to find an inexact solution for the quadratic model. With a good initial point, one could build a sequence of conjugate directions. From which, we could guarantee to reduce the residue of the system until the terminated condition is satisfied.
Starting at step 405, a loss function a loss function corresponding to the deep neural network is defined. In some embodiments, the loss function can be specified directly in the source code executing the method; while, in other embodiments, the loss function may be supplied as an input value to the source code. At step 410, input values are received by the computing system executing the method 400. These input values include, without limitation, a training set of labeled pairs (xi, yi)i=1n, an initial iterate of parameter values Θ0, and an initial trust region radius r0∈(0,R). Additionally, constants η0, η1, γ1, η2, and ϵ are supplied as inputs, where 0<η0<η1<1, 0<γ1<1<γ2, ϵ>0 (see
An optimization method is performed at steps 420-440 to iteratively minimize the loss function over a plurality of iterations. Starting at step 420, the gradient for the loss function at the current parameter values is calculated (i.e., gt=−∇F(ΘT)). Next, at step 425, a batch of training samples is selected. As with the method 200 discussed above with respect to
Then, at step 430, a trust region subproblem is constructed that approximates the loss function using the gradient and a stochastic Hessian matrix of the loss function with respect to the batch of samples. That is, an approximation of F(Θ) at ΘT is built using a stochastic Hessian HS
Next step 435, a descent direction is determined by applying a SteihaugCG solver to the trust region subproblem given the trust region radius. More specifically, an earlier terminated SteihaugCG solver is applied obtain an inexact minimizer of mt(d), denoted herein as dt. The current parameter values and the trust region radius are conditionally updated at step 440 based on a comparison of (i) a true reduction value provided by the loss function given the current parameter values, and (ii) a predicted reduction value provided by the descent direction. Continuing with the terminology used above, the value of ΘT+1, is updated based on the mt and dt, and the following value is calculated:
Then, based on a comparison of ρt with the constants η0 and η1, the values of Θt+1 and the trust region radius rt+1 are set for the next iteration.
After updating the values, the method 400 then repeats again starting at step 420 until convergence or a desired number of steps is performed. The methodology for setting the values of Θt+1 and rt+1 is set forth in the pseudocode presented in
In some embodiments, a momentum parameter may be added to SINTR to improve the escaping efficiency from saddle points. One example algorithm, referred to herein as SINTR+, is shown in
Parallel portions of a deep learning application may be executed on the architecture 900 as “device kernels” or simply “kernels.” A kernel comprises parameterized code configured to perform a particular function. The parallel computing platform is configured to execute these kernels in an optimal manner across the architecture 900 based on parameters, settings, and other selections provided by the user. Additionally, in some embodiments, the parallel computing platform may include additional functionality to allow for automatic processing of kernels in an optimal manner with minimal input provided by the user.
The processing required for each kernel is performed by grid of thread blocks (described in greater detail below). Using concurrent kernel execution, streams, and synchronization with lightweight events, the architecture 900 of
The device 910 includes one or more thread blocks 930 which represent the computation unit of the device 910. The term thread block refers to a group of threads that can cooperate via shared memory and synchronize their execution to coordinate memory accesses. For example, in
Continuing with reference to
Each thread can have one or more levels of memory access. For example, in the architecture 900 of
The embodiments of the present disclosure may be implemented with any combination of hardware and software. For example, aside from parallel processing architecture presented in
While various aspects and embodiments have been disclosed herein, other aspects and embodiments will be apparent to those skilled in the art. The various aspects and embodiments disclosed herein are for purposes of illustration and are not intended to be limiting, with the true scope and spirit being indicated by the following claims.
An executable application, as used herein, comprises code or machine readable instructions for conditioning the processor to implement predetermined functions, such as those of an operating system, a context data acquisition system or other information processing system, for example, in response to user command or input. An executable procedure is a segment of code or machine readable instruction, sub-routine, or other distinct section of code or portion of an executable application for performing one or more particular processes. These processes may include receiving input data and/or parameters, performing operations on received input data and/or performing functions in response to received input parameters, and providing resulting output data and/or parameters.
A graphical user interface (GUI), as used herein, comprises one or more display images, generated by a display processor and enabling user interaction with a processor or other device and associated data acquisition and processing functions. The GUI also includes an executable procedure or executable application. The executable procedure or executable application conditions the display processor to generate signals representing the GUI display images. These signals are supplied to a display device which displays the image for viewing by the user. The processor, under control of an executable procedure or executable application, manipulates the GUI display images in response to signals received from the input devices. In this way, the user may interact with the display image using the input devices, enabling user interaction with the processor or other device.
The functions and process steps herein may be performed automatically or wholly or partially in response to user command. An activity (including a step) performed automatically is performed in response to one or more executable instructions or device operation without user direct initiation of the activity.
The system and processes of the figures are not exclusive. Other systems, processes and menus may be derived in accordance with the principles of the invention to accomplish the same objectives. Although this invention has been described with reference to particular embodiments, it is to be understood that the embodiments and variations shown and described herein are for illustration purposes only. Modifications to the current design may be implemented by those skilled in the art, without departing from the scope of the invention. As described herein, the various systems, subsystems, agents, managers and processes can be implemented using hardware components, software components, and/or combinations thereof. No claim element herein is to be construed under the provisions of 35 U.S.C. 112(t) unless the element is expressly recited using the phrase “means for.”
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2018/027215 | 4/12/2018 | WO | 00 |