The present invention relates generally to long short-term memory (LSTM) and, more particularly, to a hidden-layer LSTM (H-LSTM) that employs grow-and-prune training to adjust the hidden layers.
Recurrent neural networks (RNNs) have been ubiquitously employed for sequential data modeling due to their ability to carry information through recurrent cycles. However, one common problem for RNN training is the gradient vanishing problem where the gradient values diminish or explode exponentially when time lag increases. Long short-term memory (LSTM) has been proposed as a special type of RNN that uses control gates and cell states to alleviate this problem. It delivers state-of-the-art performance for a wide variety of applications, such as language modeling, speech recognition, image captioning, and neural machine translation. Thus, LSTMs have been applied to a wide spectrum of applications.
Going deeper is a common practice to improve the performance of deep neural networks. Researchers have kept stacking more LSTM cells and increasing the model depth and size to improve accuracy. For example, the DeepSpeech2 architecture, which has been used for speech recognition, contains three convolutional, seven bidirectional recurrent, one fully-connected, and one connectionist temporal classification (CTC) layers. This is more than 2× deeper and 10× larger than the initial DeepSpeech architecture. As another example, the initial LSTM-based neural machine translation model utilizes only four LSTM layers, while its successor, Google's neural machine translation (GNMT) system, possesses eight LSTM layers jointly with additional attention connections.
However, going deeper with LSTM can lead to three common problems that may impact its practicability and ease of usage:
(1) Excessive computation cost: Deployment of a large LSTM model consumes substantial storage, memory bandwidth, and computational resources. Such demands may be too excessive for edge devices, such as mobile phones, smart watches, and Internet-of-Things (IoT) sensors.
(2) Regularization difficulty: Large LSTMs that can easily contain millions of parameters are prone to overfitting but hard to regularize. Employing standard regularization methods that are used for feedforward neural networks (NNs), such as dropout, in an LSTM cell is challenging.
(3) Increased latency: The increasingly stringent runtime latency constraints in real-time applications make large LSTMs, which incur high latency, inapplicable in these scenarios.
At least these problems pose a significant design challenge in obtaining compact, fast, and accurate LSTMs.
According to various embodiments, a hidden-layer long short-term memory (H-LSTM) system is disclosed. The system includes a memory cell and a plurality of deep neural network (DNN) control gates enhanced with hidden layers configured to perform a linear transformation followed by an activation function.
According to various embodiments, a method for generating an optimal hidden-layer long short-term memory (H-LSTM) architecture is disclosed. The H-LSTM architecture includes a memory cell and a plurality of deep neural network (DNN) control gates enhanced with hidden layers. The method includes providing an initial seed H-LSTM architecture, training the initial seed H-LSTM architecture by growing one or more connections based on gradient information and iteratively pruning one or more connections based on magnitude information, and terminating the iterative pruning when training cannot achieve a predefined accuracy threshold.
According to various embodiments, a non-transitory computer-readable medium having stored thereon a computer program for execution by a processor configured to perform a method for generating an optimal hidden-layer long short-term memory (H-LSTM) architecture is disclosed. The method includes providing an initial seed H-LSTM architecture, training the initial seed H-LSTM architecture by growing one or more connections based on gradient information and iteratively pruning one or more connections based on magnitude information, and terminating the iterative pruning when training cannot achieve a predefined accuracy threshold.
Various other features and advantages will be made apparent from the following detailed description and the drawings.
In order for the advantages of the invention to be readily understood, a more particular description of the invention briefly described above will be rendered by reference to specific embodiments that are illustrated in the appended drawings. Understanding that these drawings depict only exemplary embodiments of the invention and are not, therefore, to be considered to be limiting its scope, the invention will be described and explained with additional specificity and detail through the use of the accompanying drawings, in which:
Long short-term memory (LSTM) has been widely used for sequential data modeling. LSTM depth has typically been increased by stacking LSTM cells to improve performance. However, this incurs model redundancy, increases run-time delay, and makes the LSTMs more prone to overfitting.
To address these problems, generally disclosed herein is a hidden-layer LSTM (H-LSTM) that adds hidden layers to LSTM's one-level nonlinear control gates. H-LSTM increases accuracy while employing fewer external stacked layers, thus reducing the number of parameters and run-time latency significantly. Grow-and-prune (GP) training is employed to iteratively adjust the hidden layers through gradient-based growth and magnitude-based pruning of connections. This learns both the weights and the compact architecture of H-LSTM control gates. The GP training is also augmented with an activation function shift technique. GP-trained H-LSTMs for image captioning and speech recognition applications were created. For the NeuralTalk architecture on the MSCOCO dataset, the created models reduced the number of parameters by 38.7× (floating-point operations (FLOPs) by 45.5×), reduced the run-time latency by 4.5×, and improved the CIDEr-D score by 2.8%. For the DeepSpeech2 architecture on the AN4 dataset, the created models reduced the number of parameters by 19.4× (FLOPs by 23.5×), reduced the run-time latency by 37.4%, and reduced the word error rate from 12.9% to 8.7%. Thus, GP-trained H-LSTMs are more compact, faster, and more accurate than typical models.
LSTM Overview
LSTM is a recurrent neural network (RNN) variant that is well-suited for processing, modeling, and making predictions based on time series data.
The LSTM cell architecture 10 may be implemented in a variety of configurations including general computing devices such as but not limited to desktop computers, laptop computers, tablets, network appliances, and the like. The LSTM cell architecture 10 may also be implemented in mobile devices such as but not limited to a mobile phone, smart phone, smart watch, or tablet computer. The control gates may be implemented in one or more processors such as but not limited to a central processing unit (CPU), a graphics processing unit (GPU), or a field programmable gate array (FPGA).
Computation flow is depicted in Eqs. (1)-(3):
where ft, it, and ot refer to the forget gate 18, input gate 14, and output gate 16, respectively. Additionally, gt refers to a cell update vector 20, xt refers to an input vector 22, ht refers to a hidden state vector 24, and ct refers to a cell state vector 26. Subscript t refers to step t and subscript t−1 refers to step t−1. W and b refer to weight matrix and bias. σ and tanh refer to the sigmoid and tanh activation functions; ⊗ and ⊕ refer to element-wise multiplication and element-wise addition, respectively.
A major advantage of LSTM relative to a traditional RNN is in its capability to deal with the exploding and vanishing gradient problem during training. The error gradients remain in the LSTM cell when back-propagated from the output layer. This allows the gradient information to flow through time without vanishing, unless cut off by the control gates during training. As a result, LSTMs can learn tasks that require memories of events that happened thousands of discrete time steps earlier. This yields a significant accuracy gain relative to typical RNNs and hence support a wide spectrum of real-world use scenarios.
Hidden-Layer LSTM Overview
Recent years have witnessed the impact of increasing NN depth on its performance. A deep architecture allows an NN to capture low/mid/high-level features through a multi-level information extraction or distillation. Such a hierarchical information distillation process typically leads to a higher inference accuracy. However, since a typical LSTM employs fixed single-layer nonlinearity for gate controls, the current standard approach for increasing model depth is through stacking several LSTM cells or adding deep feed-forward networks externally.
By contrast, embodiments of the present invention employ a different approach that increases depth within LSTM cells. Generally disclosed herein is an H-LSTM architecture whose control gates are enhanced by adding hidden layers. Specifically, a multi-layer transformation is introduced in the three control gates (ft 18, it 14, and ot 16) and the cell update vector (gt 20). H-LSTM focuses on internally deeper control flows, where each control gate is made individually deeper without any network sharing. The introduction of a multi-layer information extraction or distillation in these control gates yields substantial improvements in both model compactness and performance.
The internal computation flow is governed by Eqs. (4)-(6):
Where DNN and H, respectively, refer to the DNN gates 30-36 and hidden layers (each performs a linear transformation followed by the activation function); * indicates zero or more H layers in the DNN gate.
Introduction of DNN gates provides three major benefits to an H-LSTM:
(1) Strengthened control: Hidden layers in DNN gates enhance gate control through multi-level information extraction or distillation. This makes an H-LSTM more capable and intelligent and alleviates its reliance on external stacking. Consequently, an H-LSTM can achieve comparable or even improved accuracy with fewer external stacked layers relative to a typical LSTM, leading to higher compactness.
(2) Easy regularization: The typical approach only uses dropout in the input/output layers and recurrent connections in the LSTMs. In the embodiments disclosed herein, it becomes possible to apply dropout even to all control gates within an LSTM cell. This reduces overfitting and leads to better generalization.
(3) Flexible gates: Unlike the fixed but specially-crafted gate control functions in LSTMs, DNN gates in an H-LSTM offer a wide range of choices for internal activation functions, such as a rectified linear unit (ReLU). This may provide additional benefits to the model. For example, networks typically learn faster with ReLUs. They can also take advantage of ReLU's zero outputs for FLOPs reduction.
Grow-and-Prune (GP) Training Overview
Typical training based on back propagation on fully-connected NNs yields over-parameterized models. As such, pruning is implemented to drastically reduce the size of large deep convolutional neural networks (CNNs) and LSTMs. The pruning phase is complemented with a brain-inspired growth phase for large CNNs. The network growth phase allows a CNN to grow neurons, connections, and feature maps, as necessary, during training. Thus, it enables automated search in the architecture space. It has been shown that a sequential combination of growth and pruning can yield additional compression on CNNs relative to pruning-only methods (e.g., 1.7× for AlexNet and 2.3× for VGG-16 on top of the pruning-only methods). More detail on GP training can generally be found in PCT Application No. PCT/US18/57485, which is herein incorporated by reference in its entirety.
Here, GP training has been extended to LSTMs. The steps involved are depicted in
During training, GP training first grows connections based on the gradient information at step 40. After the application of an activation function shift technique at step 42, to be explained in more detail below, GP training prunes away redundant connections for compactness, based on their magnitudes, at step 44. Finally, GP training rests at an accurate, yet compact, inference model at step 46.
GP training adopts the following growth and pruning policies:
Growth policy: Activate a dormant ω in W iff |ω.grad| is larger than the (100α)th percentile of all elements in |W.grad|.
Pruning policy: Remove a ω iff |ω| is smaller than the (100β)th percentile of all elements in |W|.
Here, ω, W, .grad, α, and β refer to the weight of a single connection, weights of all connections within one layer, operation to extract the gradient, growth ratio, and pruning ratio, respectively.
In the growth phase 40, the main objective is to locate the most effective dormant connections to reduce the value of the loss function L. ∂L/∂w is first evaluated for each dormant connection ω based on its average gradient over the entire training set. Then each dormant connection whose gradient magnitude |ω.grad|=|∂L/∂w| surpasses the (100α)th percentile of the gradient magnitudes of its corresponding weight matrix is activated. This rule caters to dormant connections if they provide most efficiency in L reduction. Growth 40 can also help avoid local minima to improve accuracy.
The pruning phase 44 involving the pruning of insignificant weights is an iterative process. In each iteration, insignificant weights whose magnitudes are smaller than the (100β)th percentile within their respective layers are pruned away. A neuron is pruned if all its input (or output) connections are pruned away. The NN is then retrained after weight pruning to recover its performance before starting the next pruning iteration. The pruning phase 44 terminates when retraining cannot achieve a pre-defined accuracy threshold.
GP training finalizes a model 46 based on the last complete iteration. In one embodiment, a mask Msk is utilized to disregard the ‘dormant’ or pruned connections. It is shown how the mask Msk and weight matrix W is updated in the gradient-based growth and magnitude-based pruning process in the methodology in
Activation Function Shift
An activation function shift 42 is also employed from a leaky rectified linear unit (ReLU) to a ReLU during training, as shown in
In the seed architecture 38 and growth phase 40, a leaky ReLU is adopted as the activation function for H * in Eq. (4). A reverse slope s of 0.01 is chosen in one embodiment. Then, for the activation function shift 42, all of the activation functions are changed from leaky ReLU to ReLU while keeping the weights unchanged. This may incur a minor accuracy drop. The network is retrained to recover performance and continue to the pruning phase 44 with ReLU as the activation function.
This activation function shift method brings two major benefits:
(1) The leaky ReLU effectively alleviates the ‘dying ReLU’ phenomenon, in which a zero output of the ReLU neuron blocks it from any future gradient update. Alleviating this phenomenon via reducing the learning rate results in longer training time. Adopting the leaky ReLU in the growth phase allows use of larger learning rate and momentum values, hence enabling faster training.
(2) The ReLU's zero outputs can help reduce FLOPs. Whenever the output value is zero, the corresponding multiply-accumulate operation in the next layer can be bypassed. This may reduce FLOPs by around 15%-20% in some embodiments.
Evaluation of Embodiments of the Disclosed Invention
Results for image captioning and speech recognition benchmarks are presented below. The embodiments were implemented using PyTorch on Nvidia GTX 1060 with 1.708 GHz frequency and Tesla P100 GPUs with 1.329 GHz frequency. CUDA 8.0 and CUDNN 5.1 were also used. It is to be noted none of the implementations or particular application for evaluation are intended to be limiting.
NeuralTalk for Image Captioning:
The effectiveness of embodiments of the disclosed invention is first shown on image captioning.
The NeuralTalk architecture uses the last hidden layer of a pretrained CNN image encoder as an input to a recurrent decoder for sentence generation. The recurrent decoder applies a beam search technique for sentence generation. A beam size of k indicates that at step t, the decoder considers the set of k best sentences obtained so far as candidates to generate sentences in step t+1, and keeps the best k results. In the evaluated embodiment, a VGG-16 is used as the CNN encoder. H-LSTM and LSTM cells are used with the same width of 512 for the recurrent decoder and their performance is compared. Beam=2 is used as the default beam size.
Results are reported on the MSCOCO dataset, which contains 123287 images of size 256×256×3, along with five reference sentences per image. The split used has 113287, 5000, and 5000 images in the training, validation, and test sets, respectively.
W is initialized in the H-LSTM based on a Gaussian distribution with zero mean and 1/√{square root over (n)} standard deviation, where n is the dimension of the input vector. In the evaluation, it is determined GP training works better with Gaussian instead of uniform initialization. The same initialization is also adopted for DeepSpeech2, to be discussed further below. An Adam optimizer is used for this evaluation. A batch size of 64 is used for training. The learning rate is initialized to 3×10−4. In the first 90 epochs, the weights of the CNN are fixed and the LSTM decoder is trained only. The learning rate is decayed by 0.8 factor every six epochs in this phase. After 90 epochs, the CNN and LSTM are fined-tuned at a fixed 1×10−6 learning rate. A dropout ratio of 0.2 is used for the hidden layers in the H-LSTM. A dropout ratio of 0.5 is also used for the input and output layers of the LSTM. The CIDEr-D score is used for evaluation. It is a variant of the CIDEr score (CIDEr-D is used for MSCOCO as the default server evaluation metric).
The performance of a fully-connected HLSTM is first compared with a fully-connected LSTM to show the benefits emanating from using the H-LSTM cell alone.
The NeuralTalk architecture with a single LSTM achieves a 0.910 CIDEr-D score. Stacked 2-layer and 3-layer LSTMs are also evaluated, which achieve 0.921 and 0.928 CIDEr-D scores, respectively. A single H-LSTM is trained next and the results are compared in the graph and table in
H-LSTM can also reduce run-time latency. Even with Beam=1, a single H-LSTM achieves a higher accuracy than the three LSTM baselines. Reducing the beam size leads to run-time latency reduction. H-LSTM is 4.5×, 3.6×, 2.6× faster than the stacked 3-layer LSTM, stacked 2-layer LSTM, and single LSTM, respectively, while providing higher accuracy.
Next, both network pruning and GP training are implemented to synthesize compact inference models for an H-LSTM (Beam=2). The seed architecture for GP training has a sparsity of 50%. In the growth phase, a 0.8 growth ratio is used in the first five epochs. The results are summarized in the table in
The GP-trained H-LSTM models are listed in the table in
Note that a beam size of two leads to four evaluation branches per step, i.e. about three times more computation load against beam size one. Thus, the 4:5× speedup of the fast model is a compounded effect of smaller model size and reduced beam size, with 1:5× and 3:0× contributions, respectively.
DeepSpeech2 for Speech Recognition:
Speech recognition is another application also considered.
A bidirectional DeepSpeech2 architecture is implemented that employs stacked recurrent layers following convolutional layers for speech recognition. Mel-frequency cepstral coefficients are used as network inputs, extracted from raw speech data at a 16 KHz sampling rate and 20 ms feature extraction window. There are two CNN layers prior to the recurrent layers and one connectionist temporal classification layer for decoding after the recurrent layers. The width of the hidden and cell states is 800. The width of H-LSTM hidden layers is also set to 800.
The AN4 dataset is used to evaluate the performance of the DeepSpeech2 architecture. It contains 948 training utterances and 130 testing utterances.
A Nesterov SGD optimizer is used in the evaluation. The learning rate is initialized to 3×10−4, decayed per epoch by a 0.99 factor. A batch size of 16 is used for training. A dropout ratio of 0.2 is used for the hidden layers in the H-LSTM. Batch normalization is applied between recurrent layers. L2 regularization is applied during training with a weight decay of 1×10−4. A word error rate (WER) is used as the evaluation criterion.
The performance of the fully-connected HLSTM is first compared against the fully-connected LSTM and gate recurrent unit (GRU) to demonstrate the benefits provided by the H-LSTM cell alone. GRU uses reset and update gates for memory control and has fewer parameters than LSTM.
For the baseline, various DeepSpeech2 models containing a different number of stacked layers based on GRU and LSTM cells are trained. The stacked 4-layer and 5-layer GRUs achieve a WER of 14.35% and 11.64%, respectively. The stacked 4-layer and 5-layer LSTMs achieve a WER of 13.99% and 10.56%, respectively.
Next, an H-LSTM is trained to make a comparison. Since an H-LSTM is intrinsically deeper, it is an aim to achieve a similar accuracy with a smaller stack. A WER of 12.44% and 8.92% is reached with stacked 2-layer and 3-layer HLSTMs, respectively.
The cell comparison results are summarized in the graph and table in
GP training is next implemented to show its additional benefits on top of just performing network pruning. The stacked 3-layer H-LSTMs is selected for this evaluation due to its highest accuracy. For GP training, the seed architecture is initialized with a connection sparsity of 50%. The networks are grown for three epochs using a 0.9 growth ratio.
For compactness, an accuracy threshold for both GP training and the pruning-only process is set to 10.52%. These two approaches are compared in the table in
Two GP-trained models are obtained by varying the WER constraint during the pruning phase: an accurate model aimed at a higher accuracy (9.00% WER constraint) and a compact model aimed at extreme compactness (10.52% WER constraint).
The results against other work are compared in the table in
The introduction of the ReLU activation function in DNN gates provides additional FLOPs reduction for the H-LSTM. This effect does not apply to LSTMs and GRUs that only use tanh and sigmoid gate control functions. At inference time, the average activation percentage of the ReLU outputs is 48.3% for forward-direction LSTMs, and 48.1% for backward-direction LSTMs. This further reduces the overall run-time FLOPs by 14.5%.
The details of the final inference models are summarized in the table in
The importance of regularization in H-LSTM is observed on its final performance. The comparison between fully-connected models with and without dropout for both applications is summarized in the table in
Some real-time applications may emphasize stringent memory and delay constraints instead of accuracy. In this case, the deployment of stacked LSTMs may be infeasible due to their substantial computation cost. However, the extra parameters in H-LSTM's hidden layers can be easily compensated by a reduced hidden layer and cell state width. Several models for image captioning in the table in
As such, embodiments disclosed herein combine H-LSTM and GP training to learn compact, fast, and accurate LSTMs. An H-LSTM adds hidden layers to control gates as opposed to architectures that just employ a one-level nonlinearity. GP training combines gradient-based growth and magnitude-based pruning to ensure H-LSTM compactness. An activation function shift technique is also incorporated to improve the training behavior as well as to reduce FLOPs. H-LSTMs were GP-trained for image captioning and speech recognition applications. For the NeuralTalk architecture on the MSCOCO dataset, disclosed embodiments reduced the number of parameters by 38.7× (FLOPs by 45.5×) and run-time latency by 4.5×, and improved the CIDEr-D score by 2.8%. For the DeepSpeech2 architecture on the AN4 dataset, disclosed embodiments reduced the number of parameters by 19.4× (FLOPs by 23.5×), run-time latency by 37.4%, and WER from 12.9% to 8.7%.
It is understood that the above-described embodiments are only illustrative of the application of the principles of the present invention. The present invention may be embodied in other specific forms without departing from its spirit or essential characteristics. All changes that come within the meaning and range of equivalency of the claims are to be embraced within their scope. Thus, while the present invention has been fully described above with particularity and detail in connection with what is presently deemed to be the most practical and preferred embodiment of the invention, it will be apparent to those of ordinary skill in the art that numerous modifications may be made without departing from the principles and concepts of the invention as set forth in the claims.
This application claims priority to provisional application 62/677,232, filed May 29, 2018, which is herein incorporated by reference in its entirety.
This invention was made with government support under Grant #CNS-1617640 awarded by the National Science Foundation. The government has certain rights in the invention.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2019/022246 | 3/14/2019 | WO | 00 |
Number | Date | Country | |
---|---|---|---|
62677232 | May 2018 | US |