Flexible, lightweight quantized deep neural networks

Description

BACKGROUND OF THE INVENTION

Emerging vision, speech and natural language applications have widely adopted deep learning models that have achieved state-of-the-art accuracy. Recent industrial efforts have focused on implementing the models on mobile devices. However, real-time applications based on these deep models typically incur unacceptably large latencies and can easily drain the battery on energy-limited devices. Therefore, prior research has proposed model compression techniques including pruning and quantization to satisfy the stringent energy and speed requirements of energy-limited devices.

Prior work has extensively explored approaches to reduce latency and energy consumption of Deep Neural Networks (DNNs) on hardware, through both algorithmic and hardware efforts. Because the latency and energy consumption of DNNs generally stem from computational cost and memory accesses, prior work in the algorithmic domain mainly focuses on the reduction of Floating Point Operations (FLOPs) and model size. Some work reduces the number of parameters through weight pruning, while some other work introduces structural sparsity via filter pruning for Convolutional Neural Networks (CNNs) to enable speedup on general hardware platforms incorporating CPUs and GPUs. To reduce the model size, previous work has also conducted neural architecture search with energy constraint. In addition to algorithmic advances, prior art has also proposed methodologies to achieve fast and energy-efficient DNNs. Some prior art works propose the co-design of the hardware platform and the architecture of the neural network running on it. Some work proposes more lightweight DNN units for faster inference on general-purpose hardware, while others propose hardware-friendly DNN computation units to enable energy-efficient implementation on customized hardware.

By reducing the weight and activation precision, DNN quantization has proved to be an effective technique to improve the speed and energy efficiency of DNNs on customized hardware, due to its lower computational cost and fewer memory accesses. A DNN with 16-bit fixed-point representation can achieve competitive accuracy compared to the full-precision network.

Uniform quantization approaches enable fixed-point hardware implementation for DNNs. One prior art effort uses only 1 bit for the DNN parameters, turning multiplications into XNOR operations on customized hardware. However, these models require an over-parameterized model size to maintain a high accuracy.

LightNN is a quantization approach that constrains the weights of DNNs to be a sum of k powers of 2, and therefore can use shift and add operations to replace the multiplications between activations and weights. In LightNN-1, all of the multiplications of the DNNs is replaced by a shift operation, while for LightNN-2, two shifts and an add replace the multiplication operation. Because shift operations are much more lightweight on customized hardware (e.g., Field Programmable Arrays—FPGAs—or Applications Specific Integrated Circuits—ASICs), these approaches can achieve faster speed and lower energy consumption, and generally maintain accuracy for over-parameterized models.

Although the LightNN approaches provide better energy-efficiency, they use a single k value (i.e., the number of shifts per multiplication) across the whole network, and therefore lack the flexibility to provide fine-grained trade-offs between energy and accuracy. The energy efficiency for these models also exhibits gaps, making the Pareto front of accuracy and energy discrete. However, a continuous accuracy and energy/latency trade-off is an important feature for designers to target different market segments (e.g., IoT devices, edge devices, and mobile devices).

SUMMARY OF THE INVENTION

To provide a more flexible Pareto front for the LightNN approaches each convolutional filter in the present invention is equipped with the freedom to use a different number of shift-and-add operations to approximate multiplications. A set of free variables k={k₁, . . . , k_F} is introduced where each element represents the number of shift-and-add for the corresponding convolutional filter. As a result, a more contiguous Pareto front can be achieved.

For example, if k is constrained such that k∈{1, 2}^F, then the throughput and energy consumption of the new model will be between the first (k={1}^F) and second (k={2}^F) versions of the prior art quantization approaches. Formally, min_w,k custom character (w,k) is being solved, where is the loss function and w is the weights vector. However, the commonly adopted stochastic gradient descent (SGD) algorithm does not apply in this case since is non-differentiable with respect to k.

The present invention uses a differentiable training algorithm having flexible k values, which enables end-to-end optimization with standard SGD. Using customized k values for each convolutional filter enables a more continuous Pareto front. The present invention uses an end-to-end differentiable training algorithm via approximate gradient computation for non-differentiable operations and regularization to encourage sparsity. Moreover, the differentiable training approach of the present invention uses gradual quantization, which can achieve higher accuracy than LightNN-1 without increasing latency. The present invention provides a differentiable training algorithm which provides a continuous Pareto front for hardware designers to search for a highly accurate model under the hardware resource constraints, wherein the differentiable training enables gradual quantization, and further pushes forward the Pareto-optimal curve.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a flow chart showing the quantization flow for k=2.

FIG. 2 shows an equivalent conversion from a convolution with a k_i>1 filter to k_iconvolutions, each with a k_i=1 filter.

FIG. 3 is a graph showing the regularization loss curve with respect to weight value.

FIG. 4 is pseudo code of the training algorithm.

DETAILED DESCRIPTION

LightNN Overview

As a quantized DNN model, LightNN constrain the weights of a network to be the sum of k powers of 2, denoted as LightNN-k. Thus, the multiplications between weights and activations can be implemented with k shift operations and k−1 additions. Specifically, LightNN-1 constrains the weights to be a power of 2, and only uses a shift for a multiplication. The approximation function used by LightNN-k to quantize a full-precision weight w can be formulated in a recursive way: Q_k(w)=Q_k−1(w)+Q₁(w−Q_k−1(w)) for k>1, where Q₁(w)=sign(w)×2^[log(|w|)] which rounds the weight w to a nearest power of 2.

LightNN is trained with a modified backpropagation algorithm. In the forward phase of each training iteration, the parameters are first approximated using the Q_kfunction. Then, in the backward phase, the gradients of loss with respect to quantized weights are computed and applied to the full-precision weights in the weight update phase.

LightNN has proven to be accurate and energy-efficient on customized hardware. LightNN-2 can generally have an accuracy close to full-precision DNNs, while LightNN-1 can achieve higher energy efficiency than LightNN-2. Due to the nature of the discrete k values, there exists a gap between LightNN-1 and LightNN-2 with respect to accuracy and energy.

The present invention customizes the k values for each convolutional filter, and thus, achieves a smoother energy-accuracy trade-off to provide hardware designers with more design options.

Differentiable Training

Herein, the quantization function is first defined, and then the end-to-end training algorithm for the present invention is introduced, equipped with a regularization loss to penalize large k values.

Quantization function: The i^thfilter of the network is denoted as w_iand the quantization function for the filter w_ias Q_k(w_i|t), where k=max_ik is the maximum number of shifts used for this network, and vector t is a latent variable that controls the approximation (e.g., some threshold value). Also, the residual resulting from the approximation is denoted as r_i,k=w_i−Q_k(w_i|t). The quantization function is formally defined as follows:

$Q_{k} (w_{i} | t) = {\begin{matrix} 0, if k = 0 \\ \sum_{j = 0}^{k - 1} ({ r_{i, j} }_{2} > t_{j}) R (k_{i, j}), if k \geq 1 \end{matrix}$

where R(x)=sign(w)×2^[log(|x|)] rounds the input variable to a nearest power of 2, and [·] is a rounding-to-integer function. This quantization flow is shown in FIG. 1. To interpret the thresholds t, t₀determines whether this filter is pruned out, and t₁determines whether one shift is enough, etc. Then, the number of shifts for the i^thfilter is k_i=Σ_j=0^k−1 custom character (∥r_i,j∥₂>t_j). Therefore, choosing k_iper filter is equivalent to finding optimal thresholds t.

The quantization approach of the present invention targets efficient hardware implementation. Instead of assigning a customized k_ifor each weight, the present invention has customized k_ivalues per filter, and therefore preserves the structural sparsity. As shown in FIG. 2, the convolution with a k_i=2 filter can be equivalently converted to the sum of two convolutions each with a k_i=1 filter. Thus, the present invention can be efficiently implemented as LightNN-1 with an extra summation of feature maps per layer.

Differentiable Training: Instead of picking the thresholds t by hand, they are considered as trainable parameters. Therefore, the loss function custom character (w²,t) is a function of both weights and thresholds. A straight-through estimator is used to compute

$\frac{\partial ℒ}{L w_{i}} .$

By defining

$\frac{\partial w_{i}^{q}}{\partial w_{i}} = 1$

where w_i^q=Q_k(w_i|t) is the quantized w_i; we have

$\frac{\partial ℒ}{\partial w_{i}} = \frac{\partial ℒ}{\partial w_{i}^{q}}, \frac{\partial w_{i}^{q}}{\partial w_{i}} = \frac{\partial ℒ}{\partial w_{i}^{q}},$

which becomes a differentiable expression.

To compute the gradient for thresholds, i.e.,

$\frac{\partial w_{i}^{q}}{\partial t_{j}}$

the indicator function g(x, t_j)= custom character (x>t_j) is relaxed to a sigmoid function, σ(·), when computing gradients, i.e., ĝ(x,t_j)=σ(x−t_j). In addition, the straight through estimator is used to compute the gradient for R(x). Thus, the gradient

$\frac{\partial w_{i}^{q}}{\partial t_{j}}$

can be computed by:

$\frac{\partial Q_{k_{i}} (w_{i} | t)}{\partial t_{j}} = \sum_{l = 0}^{k_{i} - 1} \frac{\partial σ ({ r_{i, l} }_{2} - t_{l})}{\partial t_{j}} R (r_{i, l}) + σ ({ r_{i, l} }_{2} - t_{l}) \frac{\partial R (r_{i, l})}{\partial t_{j}} = \sum_{l = 0}^{k_{i} - 1} σ^{'} ({ r_{i, l} }_{2} - t_{l}) (\frac{\partial { r_{i, l} }_{2}}{\partial t_{j}} - \frac{\partial t_{l}}{\partial t_{j}}) R (r_{i, l}) + σ ({ r_{i, l} }_{2} - t_{l}) \frac{\partial r_{i, l}}{\partial t_{j}}$

where

$\frac{\partial { r_{i, l} }_{2}}{\partial t_{j}} and \frac{\partial r_{i, l}}{\partial t_{j}}$

are 0 for l<j; otherwise, they can be computed with the result of

$\frac{\partial Q_{l} {w_{l} | t}}{\partial t_{j}} \cdot \frac{\partial t_{l}}{\partial t_{j}} = (l = j) .$

Regularization: To encourage smaller k_ifor the filters, a regularization loss: custom character _reg,k(w)=Σ_j=0^k−1λ_jΣ_i∥r_i,k∥₂was added, where λ_jperforms as a handle to balance accuracy and model sparsity. This regularization loss is the sum of several group Lasso losses, since they can introduce structural sparsity. The first item λ₀Σ_i∥r_i,0∥₂=λ₀Σ_i∥w_i∥₂is used to prune the whole filters out, while the other items (j>0) regularize the residuals. FIG. 3 shows the two items of regularization loss and their sum for the case k=2 with λ₀=1e⁻⁵and λ₁=3e⁻⁵. Therefore, the total loss for training is: custom character _total(w,t)=_CE(w,t)+_reg,k(w).

The new training algorithm is summarized in FIG. 4. This is the same as the conventional backpropagation algorithm for full precision DNNs, except that in the forward phase, the weights are quantized given the thresholds t. Then, due to the differentiability of the quantization function with respect to w and t, their gradients can be computed and their values updated in each training iteration.

The present invention, disclosed herein, customizes the number of shift operations for each filter of a LightNN. Equipped with the differentiable training algorithm, the present invention can achieve a flexible trade-off between accuracy and speed/energy. The present invention provides a more continuous Pareto front for LightNN models and outperforms fixed-point DNNs with respect to both accuracy and speed/energy. Moreover, due to the gradual quantization nature of the differentiable training, the present invention achieves higher accuracy than LightNN-1 without sacrificing speed and energy efficiency, and thus, pushes forward the Pareto-optimal front.

Claims

1. A method of training a deep neural network having multiple convolutional layers, each convolutional layer having one or more filters, comprising, for some or all of the one or more filters: quantizing weights for some or all of the one or more filters as a set of numbers;computing a residual for some or all of the one or more filters based on a difference between unquantized weights of the one or more filters and the quantized weights;determining a parameter k for some or all of the one or more filters based on a comparison of the computed residual to a threshold;computing a loss function for the unquantized weights and the threshold; andupdating the unquantized weights for some or all of the one or more filters based on a derivative of the computed loss function by applying a number of operations based on k for each weight.
2. The method of claim 1, each number in the set of numbers comprising a sum of powers of 2.
3. The method of claim 1, the number of operations comprising k shift operations and a k−1 add operations.
4. The method of claim 1, the optimal threshold being updated based on a derivative of the computed loss function.
5. The method of claim 1, the loss function being a sum of a cross entity loss and a regularization loss.
6. The method of claim 5, the regularization loss being a sum of a plurality of lasso losses.
7. The method of claim 1 wherein a maximum value for k is pre-selected.
8. The method of claim 1, the weights being quantized in accordance with the function:
9. The method of claim 8, the parameter k for filter i defined as
10. A system of training a deep neural network having multiple convolutional layers, each convolutional layer having one or more filters, comprising, for some or all of the one or more filters: a processor; andmemory coupled to the processor and containing software that, when executed by the processor performs, for some or all of the one or more filters, the functions of: quantizing weights for some or all of the one or more filters, as a set of numbers;computing a residual for some or all of the one or more filters, based on a difference between the unquantized weights and the quantized weights;determining a parameter k for some or all of the one or more filters, based on a comparison of the computed residual to an optimal threshold;computing a loss function for the weights and the optimal threshold; andupdating the weights for some or all of the one or more filters, based on a derivative of the computed loss function by applying a number of operations based on k for each weight.
11. The system of claim 10, each number in the set of numbers comprising a sum of powers of 2.
12. The system of claim 10, the number of operations comprising k shift operations and a k−1 add operations.
13. The system of claim 10, the optimal threshold being updated based on a derivative of the computed loss function.
14. The system of claim 10, the loss function being a sum of a cross entity loss and a regularization loss.
15. The system of claim 14, the regularization loss being a sum of a plurality of lasso losses.
16. The system of claim 10 wherein a maximum value for k is pre-selected.
17. The system of claim 10, the weights being quantized in accordance with the function:
18. The system of claim 17, the parameter k for filter i defined as

RELATED APPLICATIONS

This application claims the benefit of the U.S. Provisional Patent Application No. 62/921,121, filed May 31, 2019, the contents of which are incorporated herein in their entirety.

GOVERNMENT RIGHTS

This invention was made with government support under contract No. 1815899 granted Computing and Communication Foundation of the National Science Foundation. The government has certain rights in this invention.

US Referenced Citations (6)

Number	Name	Date	Kind
10540591	Gao	Jan 2020	B2
10558915	Gao	Feb 2020	B2
11095887	Kim	Aug 2021	B2
11188817	Dikici	Nov 2021	B2
11315016	Sundaram	Apr 2022	B2
20200380371	Ding	Dec 2020	A1

Non-Patent Literature Citations (2)

Entry
Ding, R., et al., “Quantized Deep Neural Networks for Energy Efficient Hardware-based Inference”, 2018 23rd Asia and South Pacific Design Automation Conference (ASP-DAC) Jeju, South Korea, Jan. 22-25, 2018, 8 pages.
Ding, R., et al., “LighNN: Filling the Gap between Conventional Deep Neural Networks and Binarized Networks”, GLSVLSI 17: Proceedings of the Great Lakes Symposium on VLSI 2017 Banff, Canada, (May 2017) 6 pages.

Related Publications (1)

	Number	Date	Country
	20200380371 A1	Dec 2020	US

Provisional Applications (1)

	Number	Date	Country
	62921121	May 2019	US

Flexible, lightweight quantized deep neural networks

Information

Patent Number

Date Filed

Date Issued

Inventors

Original Assignees

Examiners

Agents

CPC

Field of Search

CPC

International Classifications

Term Extension

Abstract