Deep neural networks can be used for a variety of machine learning tasks, such as classification (e.g., image classification), speech enhancement, and/or the like. One example of a deep neural network is a residual network (ResNet, see, e.g., He, Kaiming et al., “Deep Residual Learning for Image Recognition,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA: IEEE. pp. 770-778). An issue with the ResNet, as well as other deep neural networks, is training. And, as layer quantity grows deeper in the deep neural network, the latency for training increases.
In some example embodiments, there may be provided a method that includes receiving, at a machine learning model, an input for a task of the machine learning model, wherein the machine learning model comprises a plurality of residual blocks augmented with a plurality of augmented weight blocks that sample intermediate features from the plurality of residual blocks; applying the input to the machine learning model to perform the task, wherein the applying comprises applying the plurality of intermediate features, which are obtained from the plurality of residual blocks, to the plurality of augmented weight blocks to form a plurality of intermediate outputs; and generating an output of the machine learning model, wherein the output is generated using at least on a combination of the plurality of intermediate outputs.
In some variations of the methods, systems, and computer program products, one or more of the following features can optionally be included in any feasible combination. The task may comprise image classification, object localization, echo cancellation, and/or speech enhancement. The input may be applied to a first linear transformation block to increase a dimensionality of the input. A residual block may include a residual input that is fed forward and summed with the residual input applied to a first nonlinear block and a first linear block to form a residual output. A first intermediate feature may be obtained at a first output of the first nonlinear block. The plurality of augmented weight blocks may each comprise a fully connected neural network and/or a convolutional layer. The machine learning model may be trained using sparse stochastic gradient descent. In response to sparse stochastic gradient descent converging to a first solution during training, one or more weights (which are smaller than a first threshold value) may be set to zero. In response to sparse stochastic gradient descent converging to a second solution during training of remaining non-zero weights, one or more remaining non-zero weights (that are smaller than a second threshold value) may be set to zero.
The details of one or more variations of the subject matter described herein are set forth in the accompanying drawings and the description below. Other features and advantages of the subject matter described herein will be apparent from the description and drawings, and from the claims.
The accompanying drawings, which are incorporated in and constitute a part of this specification, show certain aspects of the subject matter disclosed herein and, together with the description, help explain some of the principles associated with the disclosed implementations. In the drawings,
The ML model 100 may provide a low-complexity nonlinear estimation framework for a wide range of ML learning tasks, such as supervised learning tasks including, for example, image classification, object localization, echo (or feedback) cancellation, speech enhancement, and/or the like.
Referring to
In the example of
Although some of the examples herein refer to an input x 102 in the form of an image, other types of data and other types of ML tasks may be implemented as well during training or interference of the ML model.
In some embodiments, the ML model 100 includes the augmented weight blocks 106A-L (H0-HL). Each of the augmented weight blocks 106A (H0), 106B (H1), 106C (H2), and so forth through 106L (HL) may be implemented as a linear computation block (e.g., a linear transformation block). For example, the linear computation block may comprise a fully connected layer of a neural network or a convolutional layer. In the case of an image for example, the “convolutional layer” computes a convolution of the input image using a kernel filter to extract as output features. The phrase “fully connected layer” refers to a neural network in which each neuron applies a linear transformation to the input vector through a weights matrix, so every input of the input vector influences every output of the output vector.
The augmented weight blocks 106A (H0), 106B (H1), 106C (H2), and so forth through 106L (HL) can be viewed as augmented layers to a more traditional ResNet, such as the residual blocks at 115B-115D.
Referring again to
Referring again to the intermediate feature 108B (v1), it is applied (e.g., as a matrix-vector multiplication, matrix multiplication, an operation by a fully connected layer, or an operation by a convolutional layer) to augmented weight block H1 106B, the output of which (along with the sum of the other augmented weight block outputs (referred to herein as “intermediate outputs” 150A-L)) contribute to the output (ŷ) 104. Similarly, the intermediate features 108C-L are obtained by sampling the output of the nonlinear computation blocks 112C-L (G1-GL). Thus, the output (ŷ) 104 of the ML model 100 is a function of the intermediate features 108A-L (labeled v0-vL) applied to the augmented weight blocks 106A-L (labeled H0-HL).
The nonlinear computation blocks G1-GL 112B-L may each be implemented using a nonlinear function. For example, nonlinear computation blocks G1-GL 112B-L may be implemented with a cascade (e.g., series) of a batch normalization (BN) layer, a Rectified Linear Unit (ReLU) layer, a convolutional (CONV) layer, a BN layer, and a ReLU layer (or a subset of these layers).
As noted, the output (ŷ) 104 of the ML model 100 is a function of the intermediate features 108A-L applied to the augmented weight blocks 106A-L, but unlike a traditional ResNet, the ML model 100 may require fewer parameters (e.g., parameters associated with the weights of H0-HL 106A-L, G1-GL 112B-L, and W0-WL−1 110A-110D). Fewer parameters may allow the ML model to be deployed with a smaller memory footprint at a device. Moreover, fewer parameters may mean less latency associated with the training and with inferences.
In some embodiments, the ML model 100 may be trained using sparse stochastic gradient descent (SSGD). Alternatively, the training may include finding the corresponding low-complexity approximate estimator via weight pruning and retraining.
In the example of
wherein i corresponds to the ith residual block, so residual block 115B has i equal to 1, and so forth through L. To illustrate further, the output x1 of the residual block 115B corresponds to xi, while the input x0 of the residual block 115B corresponds to xi-1. WiGi is composed of the nonlinear function Gi (which as noted may be implemented as a cascade of a batch normalization (BN) layer, a ReLU layer, convolutional (CONV) layer, a BN layer, and a ReLU layer) and the linear function block Wi (e.g., a fully connected or a conventional (CONV) layer). The Wi (which is in the set of (ϵ)) forms a linear transformation. And, Gi (xi-1; θi):
as a function implemented by a neural network or deep neural network (DNN) with parameters θi for all iϵ{1, 2, . . . , L}. θi denotes the set of parameters used by Gi. When θi is changed, the function Gi may be a different function. For example, during training, the performance of Gi may be maximized, and this optimization (or maximization) procedure may be performed by changing θi. The multidimensional expansion of x0=W0 x for the input x 102 (xϵ
) to the ML model 100 uses, as noted, a linear transformation (e.g., a CONV layer) with a weight matrix W0ϵ
. The output ŷ 104ϵ
(or ŷL-A-ResNEst to indicate L blocks) of the ML model 100 may be defined as follows:
Referring to equation 2 above, the output ŷ 104 is based on intermediate features 108A-L (v0-vL) applied as input to the augmented weight blocks 106A-L (H0-HL) of the augmented layers. Specifically, the output ŷ 104 represents the aggregate (or sum) of the intermediate outputs 150A-L, which are formed by the augmented weight blocks applied (e.g., as a matrix-vector multiplication, matrix multiplication, an operation by a fully connected layer, or an operation by a convolutional layer) to a corresponding intermediate feature 108A-L.
In the notation used, M refers to an expansion factor, and No refers to the output dimension of the ML model. As noted above, the linear computational block W0 110A is used to expand the input x0 to a higher dimension, which in this example is M. Moreover, the number of blocks L is a nonnegative integer, which in the example of
The ML model 100 may be trained using supervised learning as noted above. To train the ML model 100, the objective is to solve an empirical risk minimization problem represented as follows:
wherein 1 denote the loss function and {(xn, yn)}n=1N denotes the training data, and wherein W0-WL−1, θ1-θL, and H0-HL are parameters of the ML model 100 (e.g., A-ResNEst model) being minimized with respect to the empirical risk across the data set (where A represents the empirical risk function or average loss function across a given data set). During training for example, a goal is to find these parameters such that the empirical risk in Equation (5) is minimized or approximately minimized. Regularization terms may be added to the loss function to find parameters achieving a threshold level of performance.
Table 1 below shows an example of results that show that the ML model 100 (column labeled “A-ResNEst”) in general exhibits competitive classification accuracy with fewer parameters when compared to “Standard ResNets”. At Table 1, the classification accuracy corresponds to an average of 7 trials with different initializations, the parameters of the networks are shown in millions (M), and the dataset used for classification is CIFAR-10. Each row of Table 1 represents a different model architecture, such as WRN-16-8, WRN-40-4, etc. In the first row of Table 1, the ML model 100 (“A-ResNEst”) had a classification accuracy of 95.29% while the Standard ResNet had an accuracy of 95.56%, but the ML model 100 (“A-ResNEst”) used 8.7M parameters far fewer than the 11M parameters of the Standard ResNet. The ML Model 100 (A-ResNets) in most cases have fewer parameters than the standard ResNets because they do not have the layers WL and WL+1; and the number of prediction weights in H0, H1, . . . . HL is usually not larger than the number of weights in WL and WL=1.
At 302, the ML model, may receive an input for a task of the machine learning model, wherein the machine learning model comprises a plurality of residual blocks augmented with a plurality of augmented weight blocks that sample a plurality of intermediate features from the plurality of residual blocks. For example, the ML model 100 may receive an input, such as data (e.g., one or more images or portions thereof) at x 102. During training for example, the inputs may include labels to train the ML model. During inference, the inputs will not include labels, so the ML model 100 can predict or estimate the output (ŷ) 104 (e.g., a classification or label for the input). The machine learning model may include a plurality of residual blocks, such as blocks 115B-D that are augmented with a plurality of augmented weight blocks (H0-HL) 106A-L. The augmented weight blocks sample a plurality of intermediate features (v0-vL) 108A-L from the plurality of residual blocks.
At 304, the input is applied to the machine learning model to perform the task, wherein the applying comprises applying the plurality of intermediate features, which are obtained from the plurality of residual blocks, to the plurality of augmented weight blocks to form a plurality of intermediate outputs. For example, the input x 102 is applied to the ML model, so that the ML model can perform its task (whether during training or inference phase of the ML model). The applying may include using plurality of intermediate features (v0-vL) 108A-L (which are obtained or sampled from the plurality of residual blocks 115B-D as well as GL 112L and the input x as v0), to the plurality of augmented weight blocks (H0-HL) 106A-L to form a plurality of intermediate outputs 150A-L.
At 306, an output of the machine learning model may be generated, wherein the output is generated using at least on a combination of the plurality of intermediate outputs. For example, the ML model 100 may generate the output (ŷ) 104. This output (ŷ) 104 may be generated as a combination of the plurality of intermediate outputs 150A-L.
In some embodiments, the task being learning as part of training or the task being performed as an inference includes image classification, object localization, echo cancellation, and/or speech enhancement.
In some embodiments, the input x 102 is applied to a first linear transformation block W0 110A to increase the dimensionality of the input x by forming x0.
In some embodiments, a residual block, such as residual block 115B, may include a residual input, such as x0 (as depicted at
In some embodiments, a first intermediate feature, such as intermediate feature 108B, is obtained at a first output of the first nonlinear block, such as G1 112B.
In some embodiments, the plurality of augmented weight blocks, such as H0-HL 106A-L each comprise a fully connected neural network and/or a convolutional layer.
In some embodiments, the machine learning model is trained using sparse stochastic gradient descent. When this is the case, in response to sparse stochastic gradient descent converging to a first solution during training, one or more weights (which are smaller than a first threshold value) may be set to zero (see, e.g.,
In some embodiments, the training of the ML model uses gradient decent, such as stochastic gradient decent or sparse stochastic gradient decent (SSGD). Gradient descent refers to an iterative optimization that finds a local minimum of an objective (e.g., loss) function in n-dimensional space, while stochastic gradient descent refers to an iterative optimization of the objective function that uses a stochastic approximation (e.g., an estimate) of the gradient descent optimization. Rather than use the actual gradient (as determined from the entire dataset as in a gradient descent), the stochastic gradient descent uses an estimate of the gradient (calculated from a randomly selected subset of the data).
In the case of sparse stochastic gradient descent, a sparse matrix is used as further described herein. Let J(θ) be a cost function of a neural network with parameters θ. Let Ok denote the parameters in the k-th layer of the neural network. J is defined as a regularized empirical risk given by
wherein 1 is the loss function, {(xn, yn)}n=1N is the training data set, ŷ is the output of the neural network, and A is the regularization constant. To iteratively solve the optimization problem in (6), the following iterative update rule may be used:
wherein Stk is the proportionate (diagonal) matrix of the k-th layer given by
for i=1, 2, . . . , |θk|. This type of update rule may be referred to as a sparse stochastic gradient descent (SSGD). The hyperparameters of SSGD may include p and c such that 1.0≤p≤2.0 and c≥0. The SSGD is built on top of the regularized empirical risk minimization problem. In the case of SSGD and Equations 7 and 8, at each step in the optimization for example, the SSGD algorithm may compute a diagonal matrix whose diagonal elements are positive and proportionate to the magnitude of weights at a current step, and a larger weight may assign a larger positive element in the diagonal matrix. Next, the SSDG algorithm may apply the diagonal matrix to the stochastic gradient, wherein the direction of the stochastic gradient may (or may not) change. The learning rate (or step size) may be applied to the multiplication of the diagonal matrix and the stochastic gradient to update the parameters of the neural network. In this example, the hyperparameters p and c may be tuned for the SSGD algorithm, or the hyperparameters can be fixed or time-varying during the optimization process. Table 2 below depicts some example results using SSGD. In some implementations, a nonzero λ can make a large difference with respect to performance such as accuracy of classification or regression error. Table 2 uses the CIFAR-10 data set with the top-1 test accuracies (%) for different types of deep neural networks, such as VGG-19, ResNet-20, ResNet-56, and wide ResNet (WRN) 16-8. When the weight decay (2) is set to 0, the accuracies drop substantially.
Compared with residual networks (ResNets), the ML model 100 (A-ResNEsts) may avoid using (or depending on) the final residual representation for the predicted output ŷ. Instead, the ML model 100 (A-ResNEsts) may apply a linear prediction on top of each residual feature vi via the set of augmented weight matrices Hi.
Because deep neural networks are usually very computationally expensive, it is difficult to deploy them on resource-constrained devices. To address this issue during training of the ML model 100, the estimators (e.g., Wi (110A-L), Hi (106A-L), Gi (112A-L)) are implemented using low-complexity by finding sparse solutions (or approximate) estimators that have lower complexity.
At 312, the estimator may be initialized using, for example, samples obtained from a distribution (e.g., a random distribution). The estimator refers to for example a neural network, such as the ML model 100 (A-ResNEst). At the initialization stage, the parameters of the ML model may be randomly sampled from a probability distribution (e.g., a Gaussian distribution and uniform distribution).
At 314, a cost function of a neural network is applied to all of the weights Wi (110A-D) as part of the training, the weights of Hi (106A-L), and the weight of Gi (112B-112L). For example, a cost function, such as the cost function J (e) of equation 6 above is applied to the weights of Wi (110A-L), the weights of Hi (106A-L), and the weight of Gi (112B-112L).
When the SSGD converges to a solution at 316 as part of the training, all of the weights are set to zero in each layer if they are smaller than a given threshold value. Here, the term layer refers to a computation block, such Gi, Hi, or Wi. Moreover, the threshold in each layer may be different. Alternatively, or additionally, the threshold may be the same. The zero weights permanently remain zero after this step during the subsequent training (e.g., re-training).
At 318, the cost function is again applied to all of the remaining non-zero weights of Wi, Hi, and Gi as part of the re-training of the remaining non-zero weights. At this point, once the SSGD converges to a solution at 320, the solution is selected as a low-complexity approximation of the estimator (e.g., a low complexity A-ResNEst). At 318, the retraining of the remaining non-zero weights fine tunes the performance of the ML model 100.
In some implementations, the current subject matter may be configured to be implemented in a system 400, as shown in
The processor 410 may be further configured to process instructions stored in the memory 420 or on the storage device 430, including receiving or sending information through the input/output device 440. The memory 420 may store information within the system 400. In some implementations, the memory 420 may be a computer-readable medium. In alternate implementations, the memory 420 may be a volatile memory unit. In yet some implementations, the memory 420 may be a non-volatile memory unit. The storage device 430 may be capable of providing mass storage for the system 400. In some implementations, the storage device 430 may be a computer-readable medium. In alternate implementations, the storage device 430 may be a floppy disk device, a hard disk device, an optical disk device, a tape device, non-volatile solid state memory, or any other type of storage device. The input/output device 440 may be configured to provide input/output operations for the system 400. In some implementations, the input/output device 440 may include a keyboard and/or pointing device. In alternate implementations, the input/output device 440 may include a display unit for displaying graphical user interfaces.
In view of the above-described implementations of subject matter this application discloses the following list of examples, wherein one feature of an example in isolation or more than one feature of said example taken in combination and, optionally, in combination with one or more features of one or more further examples are further examples also falling within the disclosure of this application:
Example 1. A computer-implemented method comprising:
Example 2. The method of Example 1, wherein the task comprises image classification, object localization, echo cancellation, and/or speech enhancement.
Example 3. The method of any of Examples 1-2 further comprising: applying the input to a first linear transformation block to increase a dimensionality of the input.
Example 4. The method of any of Examples 1-3, wherein a residual block includes a residual input that is fed forward and summed with the residual input applied to a first nonlinear block and a first linear block to form a residual output.
Example 5: The method of any of Examples 1-4, wherein a first intermediate feature is obtained at a first output of the first nonlinear block.
Example 6. The method of any of Examples 1-5, wherein the plurality of augmented weight blocks each comprise a fully connected neural network and/or a convolutional layer.
Example 7. The method of any of Examples 1-6 further comprising: training the machine learning model using sparse stochastic gradient descent.
Example 8. The method of any of Examples 1-7 further comprising:
Example 9. The method of any of Examples 1-8 further comprising: in response to sparse stochastic gradient descent converging to a second solution during training of remaining non-zero weights, setting one or more remaining non-zero weights to zero that are smaller than a second threshold value.
Example 10. A system comprising:
Example 11. The system of Example 10, wherein the task comprises image classification, object localization, echo cancellation, and/or speech enhancement.
Example 12. The system of any of Examples 10-11, further comprising: applying the input to a first linear transformation block to increase a dimensionality of the input.
Example 13. The system of any of Examples 10-12, wherein a residual block includes a residual input that is fed forward and summed with the residual input applied to a first nonlinear block and a first linear block to form a residual output.
Example 14. The system of any of Examples 10-13, wherein a first intermediate feature is obtained at a first output of the first nonlinear block.
Example 15. The system of any of Examples 10-14, wherein the plurality of augmented weight blocks each comprise a fully connected neural network and/or a convolutional layer.
Example 16. The system of any of Examples 10-15, further comprising: training the machine learning model using sparse stochastic gradient descent.
Example 17. The system of any of Examples 10-16, further comprising: in response to sparse stochastic gradient descent converging to a first solution during training, setting to zero one or more weights smaller than a first threshold value.
Example 18. The system of any of Examples 10-17, further comprising: in response to sparse stochastic gradient descent converging to a second solution during training of remaining non-zero weights, setting one or more remaining non-zero weights to zero that are smaller than a second threshold value.
Example 19. A non-transitory computer-readable medium including code which when executed by at least one processor causes operations comprising;
Example 20. The non-transitory computer-readable medium of Example 20, further comprising: training the machine learning model using sparse stochastic gradient descent.
The systems and methods disclosed herein can be embodied in various forms including, for example, a data processor, such as a computer that also includes a database, digital electronic circuitry, firmware, software, or in combinations of them. Moreover, the above-noted features and other aspects and principles of the present disclosed implementations can be implemented in various environments. Such environments and related applications can be specially constructed for performing the various processes and operations according to the disclosed implementations or they can include a general-purpose computer or computing platform selectively activated or reconfigured by code to provide the necessary functionality. The processes disclosed herein are not inherently related to any particular computer, network, architecture, environment, or other apparatus, and can be implemented by a suitable combination of hardware, software, and/or firmware. For example, various general-purpose machines can be used with programs written in accordance with teachings of the disclosed implementations, or it can be more convenient to construct a specialized apparatus or system to perform the required methods and techniques.
Although ordinal numbers such as first, second and the like can, in some situations, relate to an order; as used in this document ordinal numbers do not necessarily imply an order. For example, ordinal numbers can be merely used to distinguish one item from another. For example, to distinguish a first event from a second event, but need not imply any chronological ordering or a fixed reference system (such that a first event in one paragraph of the description can be different from a first event in another paragraph of the description).
The foregoing description is intended to illustrate but not to limit the scope of the invention, which is defined by the scope of the appended claims. Other implementations are within the scope of the following claims.
These computer programs, which can also be referred to programs, software, software applications, applications, components, or code, include machine instructions for a programmable processor, and can be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language. As used herein, the term “machine-readable medium” refers to any computer program product, apparatus and/or device, such as for example magnetic discs, optical disks, memory, and Programmable Logic Devices (PLDs), used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term “machine-readable signal” refers to any signal used to provide machine instructions and/or data to a programmable processor. The machine-readable medium can store such machine instructions non-transitorily, such as for example as would a non-transient solid state memory or a magnetic hard drive or any equivalent storage medium. The machine-readable medium can alternatively or additionally store such machine instructions in a transient manner, such as for example as would a processor cache or other random access memory associated with one or more physical processor cores.
To provide for interaction with a user, the subject matter described herein can be implemented on a computer having a display device, such as for example a cathode ray tube (CRT) or a liquid crystal display (LCD) monitor for displaying information to the user and a keyboard and a pointing device, such as for example a mouse or a trackball, by which the user can provide input to the computer. Other kinds of devices can be used to provide for interaction with a user as well. For example, feedback provided to the user can be any form of sensory feedback, such as for example visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including, but not limited to, acoustic, speech, or tactile input.
The subject matter described herein can be implemented in a computing system that includes a back-end component, such as for example one or more data servers, or that includes a middleware component, such as for example one or more application servers, or that includes a front-end component, such as for example one or more client computers having a graphical user interface or a Web browser through which a user can interact with an implementation of the subject matter described herein, or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication, such as for example a communication network. Examples of communication networks include, but are not limited to, a local area network (“LAN”), a wide area network (“WAN”), and the Internet.
The computing system can include clients and servers. A client and server are generally, but not exclusively, remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
The implementations set forth in the foregoing description do not represent all implementations consistent with the subject matter described herein. Instead, they are merely some examples consistent with aspects related to the described subject matter. Although a few variations have been described in detail above, other modifications or additions are possible. In particular, further features and/or variations can be provided in addition to those set forth herein. For example, the implementations described above can be directed to various combinations and sub-combinations of the disclosed features and/or combinations and sub-combinations of several further features disclosed above. In addition, the logic flows depicted in the accompanying figures and/or described herein do not necessarily require the particular order shown, or sequential order, to achieve desirable results. Other implementations can be within the scope of the following claims.
This application claims priority to U.S. Provisional Application No. 63/286,463 entitled “SYSTEM AND METHODS FOR LOW-COMPLEXITY DEEP LEARNING NETWORKS WITH AUGMENTED RESIDUAL FEATURES” and filed on Dec. 6, 2021, which is incorporated herein by reference in its entirety.
This invention was made with government support under DC015436, and DC015046 awarded by the National Institutes of Health, and U.S. Pat. No. 1,838,897, and CCF2124929 awarded by the National Science Foundation. The government has certain rights in the invention.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US22/80954 | 12/5/2022 | WO |
Number | Date | Country | |
---|---|---|---|
63286463 | Dec 2021 | US |