Emerging vision, speech and natural language applications have widely adopted deep learning models that have achieved state-of-the-art accuracy. Recent industrial efforts have focused on implementing the models on mobile devices. However, real-time applications based on these deep models typically incur unacceptably large latencies and can easily drain the battery on energy-limited devices. Therefore, prior research has proposed model compression techniques including pruning and quantization to satisfy the stringent energy and speed requirements of energy-limited devices.
Prior work has extensively explored approaches to reduce latency and energy consumption of Deep Neural Networks (DNNs) on hardware, through both algorithmic and hardware efforts. Because the latency and energy consumption of DNNs generally stem from computational cost and memory accesses, prior work in the algorithmic domain mainly focuses on the reduction of Floating Point Operations (FLOPs) and model size. Some work reduces the number of parameters through weight pruning, while some other work introduces structural sparsity via filter pruning for Convolutional Neural Networks (CNNs) to enable speedup on general hardware platforms incorporating CPUs and GPUs. To reduce the model size, previous work has also conducted neural architecture search with energy constraint. In addition to algorithmic advances, prior art has also proposed methodologies to achieve fast and energy-efficient DNNs. Some prior art works propose the co-design of the hardware platform and the architecture of the neural network running on it. Some work proposes more lightweight DNN units for faster inference on general-purpose hardware, while others propose hardware-friendly DNN computation units to enable energy-efficient implementation on customized hardware.
By reducing the weight and activation precision, DNN quantization has proved to be an effective technique to improve the speed and energy efficiency of DNNs on customized hardware, due to its lower computational cost and fewer memory accesses. A DNN with 16-bit fixed-point representation can achieve competitive accuracy compared to the full-precision network.
Uniform quantization approaches enable fixed-point hardware implementation for DNNs. One prior art effort uses only 1 bit for the DNN parameters, turning multiplications into XNOR operations on customized hardware. However, these models require an over-parameterized model size to maintain a high accuracy.
LightNN is a quantization approach that constrains the weights of DNNs to be a sum of k powers of 2, and therefore can use shift and add operations to replace the multiplications between activations and weights. In LightNN-1, all of the multiplications of the DNNs is replaced by a shift operation, while for LightNN-2, two shifts and an add replace the multiplication operation. Because shift operations are much more lightweight on customized hardware (e.g., Field Programmable Arrays—FPGAs—or Applications Specific Integrated Circuits—ASICs), these approaches can achieve faster speed and lower energy consumption, and generally maintain accuracy for over-parameterized models.
Although the LightNN approaches provide better energy-efficiency, they use a single k value (i.e., the number of shifts per multiplication) across the whole network, and therefore lack the flexibility to provide fine-grained trade-offs between energy and accuracy. The energy efficiency for these models also exhibits gaps, making the Pareto front of accuracy and energy discrete. However, a continuous accuracy and energy/latency trade-off is an important feature for designers to target different market segments (e.g., IoT devices, edge devices, and mobile devices).
To provide a more flexible Pareto front for the LightNN approaches each convolutional filter in the present invention is equipped with the freedom to use a different number of shift-and-add operations to approximate multiplications. A set of free variables k={k1, . . . , kF} is introduced where each element represents the number of shift-and-add for the corresponding convolutional filter. As a result, a more contiguous Pareto front can be achieved.
For example, if k is constrained such that k∈{1, 2}F, then the throughput and energy consumption of the new model will be between the first (k={1}F) and second (k={2}F) versions of the prior art quantization approaches. Formally, minw,k(w,k) is being solved, where
is the loss function and w is the weights vector. However, the commonly adopted stochastic gradient descent (SGD) algorithm does not apply in this case since
is non-differentiable with respect to k.
The present invention uses a differentiable training algorithm having flexible k values, which enables end-to-end optimization with standard SGD. Using customized k values for each convolutional filter enables a more continuous Pareto front. The present invention uses an end-to-end differentiable training algorithm via approximate gradient computation for non-differentiable operations and regularization to encourage sparsity. Moreover, the differentiable training approach of the present invention uses gradual quantization, which can achieve higher accuracy than LightNN-1 without increasing latency. The present invention provides a differentiable training algorithm which provides a continuous Pareto front for hardware designers to search for a highly accurate model under the hardware resource constraints, wherein the differentiable training enables gradual quantization, and further pushes forward the Pareto-optimal curve.
LightNN Overview
As a quantized DNN model, LightNN constrain the weights of a network to be the sum of k powers of 2, denoted as LightNN-k. Thus, the multiplications between weights and activations can be implemented with k shift operations and k−1 additions. Specifically, LightNN-1 constrains the weights to be a power of 2, and only uses a shift for a multiplication. The approximation function used by LightNN-k to quantize a full-precision weight w can be formulated in a recursive way: Qk(w)=Qk−1(w)+Q1(w−Qk−1(w)) for k>1, where Q1(w)=sign(w)×2[log(|w|)] which rounds the weight w to a nearest power of 2.
LightNN is trained with a modified backpropagation algorithm. In the forward phase of each training iteration, the parameters are first approximated using the Qk function. Then, in the backward phase, the gradients of loss with respect to quantized weights are computed and applied to the full-precision weights in the weight update phase.
LightNN has proven to be accurate and energy-efficient on customized hardware. LightNN-2 can generally have an accuracy close to full-precision DNNs, while LightNN-1 can achieve higher energy efficiency than LightNN-2. Due to the nature of the discrete k values, there exists a gap between LightNN-1 and LightNN-2 with respect to accuracy and energy.
The present invention customizes the k values for each convolutional filter, and thus, achieves a smoother energy-accuracy trade-off to provide hardware designers with more design options.
Differentiable Training
Herein, the quantization function is first defined, and then the end-to-end training algorithm for the present invention is introduced, equipped with a regularization loss to penalize large k values.
Quantization function: The ith filter of the network is denoted as wi and the quantization function for the filter wi as Qk(wi|t), where k=maxik is the maximum number of shifts used for this network, and vector t is a latent variable that controls the approximation (e.g., some threshold value). Also, the residual resulting from the approximation is denoted as ri,k=wi−Qk(wi|t). The quantization function is formally defined as follows:
where R(x)=sign(w)×2[log(|x|)] rounds the input variable to a nearest power of 2, and [·] is a rounding-to-integer function. This quantization flow is shown in (∥ri,j∥2>tj). Therefore, choosing ki per filter is equivalent to finding optimal thresholds t.
The quantization approach of the present invention targets efficient hardware implementation. Instead of assigning a customized ki for each weight, the present invention has customized ki values per filter, and therefore preserves the structural sparsity. As shown in
Differentiable Training: Instead of picking the thresholds t by hand, they are considered as trainable parameters. Therefore, the loss function (w2,t) is a function of both weights and thresholds. A straight-through estimator is used to compute
By defining
where wiq=Qk(wi|t) is the quantized wi; we have
which becomes a differentiable expression.
To compute the gradient for thresholds, i.e.,
the indicator function g(x, tj)=(x>tj) is relaxed to a sigmoid function, σ(·), when computing gradients, i.e., ĝ(x,tj)=σ(x−tj). In addition, the straight through estimator is used to compute the gradient for R(x). Thus, the gradient
can be computed by:
where
are 0 for l<j; otherwise, they can be computed with the result of
Regularization: To encourage smaller ki for the filters, a regularization loss: reg,k(w)=Σj=0k−1λjΣi∥ri,k∥2 was added, where λj performs as a handle to balance accuracy and model sparsity. This regularization loss is the sum of several group Lasso losses, since they can introduce structural sparsity. The first item λ0Σi∥ri,0∥2=λ0Σi∥wi∥2 is used to prune the whole filters out, while the other items (j>0) regularize the residuals.
total(w,t)=
CE(w,t)+
reg,k(w).
The new training algorithm is summarized in
The present invention, disclosed herein, customizes the number of shift operations for each filter of a LightNN. Equipped with the differentiable training algorithm, the present invention can achieve a flexible trade-off between accuracy and speed/energy. The present invention provides a more continuous Pareto front for LightNN models and outperforms fixed-point DNNs with respect to both accuracy and speed/energy. Moreover, due to the gradual quantization nature of the differentiable training, the present invention achieves higher accuracy than LightNN-1 without sacrificing speed and energy efficiency, and thus, pushes forward the Pareto-optimal front.
This application claims the benefit of the U.S. Provisional Patent Application No. 62/921,121, filed May 31, 2019, the contents of which are incorporated herein in their entirety.
This invention was made with government support under contract No. 1815899 granted Computing and Communication Foundation of the National Science Foundation. The government has certain rights in this invention.
Number | Name | Date | Kind |
---|---|---|---|
10540591 | Gao | Jan 2020 | B2 |
10558915 | Gao | Feb 2020 | B2 |
11095887 | Kim | Aug 2021 | B2 |
11188817 | Dikici | Nov 2021 | B2 |
11315016 | Sundaram | Apr 2022 | B2 |
20200380371 | Ding | Dec 2020 | A1 |
Entry |
---|
Ding, R., et al., “Quantized Deep Neural Networks for Energy Efficient Hardware-based Inference”, 2018 23rd Asia and South Pacific Design Automation Conference (ASP-DAC) Jeju, South Korea, Jan. 22-25, 2018, 8 pages. |
Ding, R., et al., “LighNN: Filling the Gap between Conventional Deep Neural Networks and Binarized Networks”, GLSVLSI 17: Proceedings of the Great Lakes Symposium on VLSI 2017 Banff, Canada, (May 2017) 6 pages. |
Number | Date | Country | |
---|---|---|---|
20200380371 A1 | Dec 2020 | US |
Number | Date | Country | |
---|---|---|---|
62921121 | May 2019 | US |