Not applicable.
Not applicable.
The drawings constitute a part of this specification and include exemplary examples of the ECONOMIC LONG SHORT-TERM MEMORY FOR RECURRENT NEURAL NETWORKS, which may take the form of multiple embodiments. It is to be understood that in some instances, various aspects of the invention may be shown exaggerated or enlarged to facilitate an understanding of the invention. Therefore, drawings may not be to scale.
The field of the invention is recurrent neural networks, specifically long short-term memory and hardware architecture for recurrent neural networks.
Machine learning is a form of artificial intelligence that can be used to automate decision making and predictions. It can be used for image classification, pattern recognition (e.g., character recognition, face recognition, etc.), object detection, time series prediction, natural language processing, and speech recognition. Three structures of Machine Learning known in the art are Convolutional Neural Networks (CNN), Feed-Forward Deep Networks (FFDN) and Recurrent Neural Networks. See A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, “Large-scale video classification with convolutional neural networks,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2014, 1725-1732; X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2010, 249-256; D. P. Mandic and J. Chambers, Recurrent neural networks for prediction: learning algorithms, architectures, and stability, John Wiley & Sons, Inc., 2001.
RNN has been applied to speech recognition, language translation, image captioning, and action recognition in videos. RNN is a deep model when it is unrolled along the time axis. One main advantage of RNN is that RNN can learn from previous data and information. The key point is what to remember and how far a model remembers. In a standard RNN, recent past information is used for learning. The downside is that RNN cannot learn long-term expectations or dependencies due to vanishing or exploding gradients. To overcome this deficiency, Long Short-Term Memory (LSTM) has been proposed in the art. LSTM is an architecture of RNN where memory controllers are added for deciding when to forget, remember, and for output. Addition of LSTM allows expansion of the training procedure to learn long-term dependencies.
Hao Xue et al. previously presented a hierarchal LSTM Model to consider both the scene layouts and influence of social neighborhood to predict pedestrians' future trajectory. H Xue, D. Q. Huynh, and M. Reynolds, “SS-LSTM: a hierarchal lstm model for pedestrian trajectory prediction,” in 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), IEEE (2018), 1186-1194. In their approach, known as Social-Scene-LSTM (SS-LSTM), three different LSTMs are used to capture social, personal, and scene scale information. A circular shaped neighborhood is used instead of a rectangular shape. The SS-LSTM approach was tested using three datasets, and the simulations results show the prediction accuracy is improved due to using a circular shape neighborhood.
For a hardware implementation, RNN hardware design is not all done on neural networks. For example, Chang et al. presented a hardware implementation of RNN on FPGA. A. X. M. Chang, B. Martini, and E. Culurciello, “Recurrent neural networks hardware implementation on FPGA”, arXiv preprint arXiv: 1511.05552, 2015. This hardware implementation was done on the programmable logic Zynq 7020 FPGA from Xilinx for LSTM. The implementation has two layers with 128 hidden units, which the method being tested using a character level language model. Performance per unit power of different embedded platforms could be studied.
A standard LSTM consists of three gates and two activation functions. The first step of LSTM is to decide which information should be forgotten from the cell state, which is known as the “forget gate.” The second step is to make a decision on what new information will be stored in the cell. This action is performed by the “input gate” which decides values to be updated and then creates new candidate values. Finally, the output layer decides the data that will go to output. The equation of each part is calculated as follows:
f
t=σ(Wf[ht-1,xt]+bf)
i
t=σ(Wi[ht-1,xt]+bt)
o
r=σ(Wo[ht-1,xt]+bo)
{tilde over (c)}
t=tanh(Wc[ht-1,xt]+bc)
c
t
=f
t
⊙c
t-1
+i
t
⊙{tilde over (c)}
t
h
t
=o
t⊙tanh(ct)
Where for the matrix multiplication Wf [ht-1,xt]=Whht-1+Wxxt, ft is the result of the forget gate, it is the input gate result, and ot is the output gate result. The new state memory is {tilde over (c)}t, the final state memory is ct, and the cell output is ht. The weights of the forget gate, input gate, and output gate are Wf, Wi, and Wo, respectfully. The biases are bf, bi, and bo for the forget, input and output layer, respectfully. The symbol of ⊙ represents the elementwise (Hadamard) multiplication, σ is the logistical sigmoid function, and tanh is the hyperbolic tangent function.
LSTM has been proposed in the art with variations to make it simple and improve its performance. Greff et al. presented a coupled-gate LSTM, in which the forget gate and input gate are coupled into one. K. Greff, R. K. Srivastava, J. Koutnik, B. R. Steuebrink, and J. Schmidhuber, “LSTM: a search space odyssey”, IEEE transactions on neural networks and learning systems, vol. 28, no. 10 (2017), 2222-2232. Therefore, the structure has one gate less which makes it simpler than LSTM. The consequence is that the coupled-gate LSTM leads to reduced computational complexity and slightly higher accuracy. Cho et al. present another LSTM variation which is called the Gated Recurrent Unit (GRU) architecture. Instead of using three gates in LSTM, GRU includes two gates: update gate and rest gate. The update gate operation combines the forget gate and input gate while the rest gate has the same functionality as the output layer. GRU model simplified LSTM by eliminating the memory unit and the output activation function. K. Cho, B. Van Merrienboer, C. Gulcehre, D. Bandanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using rnn encoder-decoder for statistical machine translation,” arXiv preprint arXiv: 1406.1078 (2014). Zhou et al. simplified the LSTM by only using one gate, named as Minimal Gated Unit (MGU). MGU does not include the memory cell which is similar to GRU. G. B. Zhou, J. Wu, C. L. Ahang, and Z. H. Zhou, “Minimal gated unit for recurrent neural networks,” International Journal of Automation and Computing, vol. 13, no. 3 (2016), 226-234. GRU model has faster training, higher accuracy, and fewer trainable parameters compared to LSTM. Elsayed et al. present a reduced-gate convolutional LSTM (rgcLSTM) architecture which is another one-gate method. N. Elsayed, A. S. Maida, and M. Bayoumi, “Reduced-gate convolutional lstm using predictive coding for spatiotemporal prediction,” arXiv preprint arXiv: 1810.07251 (2018). It uses a memory cell, and it has a peephole connection from the cell state to the network gate.
A novel LSTM structure is disclosed designed to reduce training parameter and increase training speed when retaining or increasing the performance. The results of all versions of LSTM show that models with fewer parameters may provide higher accuracy. Thus, the novel LSTM structure disclosed herein utilizes one gate and two activation functions. In contrast to known LSTM designs, the gate (a) comprises both the forget (update) gate and the input (reset) gate. The disclosed design's performance is comparable to previous LSTM designs.
The disclosed Economic LSTM (ELSTM), shown in
f(t)=σ(Wf·If+bf)
f(t)=σ([Wcf,Wxf,Uhf]·[x(t),c(t−1),h(t−1)]+bf)
Where If is a general input in this case, a forget gate activation vector f(t)ϵRd×h×n where d is the width, h is the height, and n is the number of channels of ft. Input vector x(t)ϵR×h×r is the input which may be an image, audio, etc. and r is the number of input channels. h(t−1) comprises the output of the block or cell at the time of (t−1), the stack representing the internal state at the time of (t−1) is called c(t−1). The same as f(t), h(t−1), and c(t−1)ϵRd×h×n For the weights, Wxf, Wcf, and Uhf are the convolutional weights, and they have dimension (m×m) for all the kernels, and bf is the bias which has a dimension of n×1. The input update equation is obtained by:
u(t)=tanh(Wu·Iu+bu)
u(t)=tanh([Wcu,Wxu,Uuu]·[x(t),c(t−1),h(t−1)]+bu)
Where lu is a general input, an update activation vector u(t)ϵRd×h×n, and it matches the dimension of ft·buϵRn×1 also matches the dimension of bfϵRn×1. The output of Ut of tanh is multiplied by 1−ft, the multiplication result will be added with the memory state to generate the updated memory state. The new state is used to produce the desired output using tanh and the output gate ft. The equations of the memory state and the output are finalized by the following equations:
C(t)=f(t)⊙C(t−1)+(1−f(t))⊙U(t)
h(t)=f(t)⊙tanh(C(t))
Where C(t) is the final memory state, h(t) is the final output, and the ⊙ symbol represents elementwise multiplication. The comparison of computation components for LSTM, coupled gate LSTM, MGU, GRU, and ELSTM is shown in
The ELSTM was implemented and tested using three data sets: MNIST, IMDB, and ImageNet datasets. The MNIST dataset contains handwritten digits images (0-9). It includes 10,000 images in the test set phase and 60,000 images in the training set phase. These images are preprocessed to make the center of these digits' mass to be at the central position of image size 28×28. MNIST has been commonly used for deep neural network classification. The testing is done using each row (28 pixels) as a single input. 100 hidden units with the batch size of 100 is used, and the learning rate is 10−10 with a momentum of 0.99. As shown in
ELSTM was again tested by MNIST using another method that takes every pixel as one component in the input sequence. Therefore, an image spreads to become a sequence length of 784. The pixel scanning runs from left to right and top to bottom. The aim of this task is to test ELSTM performance in long sequence length. The simulation result shows ELSTM has an accuracy of 84.91% after 20,000 epochs and the traditional LSTM has an accuracy of 65% after 900,000 epochs.
The second testing of the ELSTM was used to study the classification of sentiment in IMDB.com movie reviews. It separates the status of the reviews into a positive and negative review. The IMDB dataset contains 25,000 movie reviews for testing and another 25,000 for training. The sequence length has a maximum length of 128. The ELSTM is implemented using 100 hidden units with a batch size of 16, and 10−8 learning rate with 0.99 momenta. The simulation result shows the ELSTM has an accuracy of 65.03% while LSTM has an accuracy of 61.43% after 20,000 epochs, as seen in
Third, ELSTM was tested using an ImageNet dataset. ImageNet includes 3.2 million cleanly labeled full resolution images with 12 subtrees with 5247 synonym set or synsets. The simulation result shows the ELSTM has 82.39% accuracy while LSTM has an accuracy of 75.12% after 20,000 epochs as seen in
The simulation result of all tests using different datasets shows the disclosed ELSTM has better performance than LSTM. For deeper evaluation, the ELSTM was next compared to multiple LSTM structures such as coupled-gate LSTM, MGU, and GRU using the three data sets (MNIST, IMDB, and ImageNet) as shown in
The error evaluation using Mean Squared (MSE) and Mean Absolute Error (MAE) is studied. The MSE measures the average of the squares of the errors, or is the average squared difference between the between the desired value and what is estimated. The MAE is a measure of the difference between two continuous variables. Each one is calculated using the following equations.
Where yi is the resulted value, {tilde over (y)} is the estimated value, and n is the number of results. The measurements of both MSE and MAE are shown in
In hardware design, the hardware module of the gate is shown in
The subject matter of the present invention is described with specificity herein to meet statutory requirements. However, the description itself is not intended to necessarily limit the scope of claims. Rather, the claimed subject matter might be embodied in other ways to include different steps or combinations of steps similar to the ones described in this document, in conjunction with other present or future technologies.
Although the terms “step” and/or “block” or “module” etc. might be used herein to connote different components of methods or systems employed, the terms should not be interpreted as implying any particular order among or between various steps herein disclosed unless and except when the order of individual steps is explicitly described.
Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. Reference throughout this specification to “one embodiment,” “an embodiment,” or similar language means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment. Thus, appearances of the phrases “in one embodiment,” “in an embodiment,” and similar language throughout this specification may, but do not necessarily, all refer to the same embodiment. Moreover, the terms “substantially” or “approximately” as used herein may be applied to modify any quantitative representation that could permissibly vary without resulting in a change to the basic function to which it is related.
This application claims priority to U.S. Provisional Application No. 62/987,487 titled ECONOMIC LONG SHORT-TERM MEMORY FOR RECURRENT NEURAL NETWORKS, filed on Mar. 10, 2020.
Number | Date | Country | |
---|---|---|---|
62987487 | Mar 2020 | US |