The invention relates generally to digital hardware systems, such as digital circuits, and more particularly to methods and digital circuit hardware systems for efficient processing of recurrent neural networks.
There are a wide variety of recurrent neural networks (RNNs) available, including those based on Long Short Term Memories (LSTMs), Legendre Memory Units (LMUs), and many variants of these. However, commercially viable RNNs have several constraints often ignored by research focused efforts. In this more constrained setting, these RNNs must be:
(1) High performance: The most responsive, low-latency networks will process data as soon as it is available, without any buffering. Many methods are often tested on the assumption that large windows of data are available all at once and can be easily accessed from memory. However, at deployment, buffering large amounts of data into memory introduces undesirable latencies as well as increases overall size and consequently power consumption of such an implementation. This particular requirement imposes strict bounds on end to end latency and throughput while processing such a recurrent neural network. For a real time processing of recurrent neural networks, end to end latency must be minimised and throughput of the implementation maximised.
(2) Power efficient: For RNN applications sensitive to power dissipation, like battery powered automatic speech recognition, keyword spotting, etc, the amount of power consumed by processing of such RNNs becomes important. While there is no sole determiner of power efficiency, quantization, the number and types of operations, number and types of memory accesses as well as static power dissipation of underlying hardware are all important factors.
Thus, commercially viable recurrent neural networks mandate a custom hardware design that not only supports a wide variety of typical layers that constitute a RNN like LMU, projected LSTMs and Feed Forward cells but also allows for highly efficient and distributed mapping of said RNN layers onto the hardware by taking into account unique constraints related to power, performance or a combination of both. For this, such hardware designs should also provide theoretical performance and power models to accurately calculate performance and power metrics of mapping of a given RNN onto the hardware design.
A distinguishing feature of the Legendre Memory Unit (LMU) (see Voelker, A. R., Kajić, I. and Eliasmith, C., 2019. Legendre memory units: Continuous-time representation in recurrent neural networks), consisting of linear ‘memory layers’ and a nonlinear ‘output layer’, is that the linear memory layers are optimal for compressing an input time series over time. Because of this provable optimality, the LMU has fixed recurrent and input weights on the linear layer. The LMU outperforms all previous RNNs on the standard psMNIST benchmark task by achieving 97.15-98.49% test accuracy (see Voelker, A. R., Kajić, I. and Eliasmith, C., 2019. Legendre memory units: Continuous-time representation in recurrent neural networks), compared to the next best network (dilated RNN) at 96.1% and the LSTM at 89.86%, while using far fewer parameters 102,000 versus 165,000 (a reduction of 38%). While a hardware based implementation of the A and B weights of LMU is available (see Voelker, A. R., Kajić, I. and Eliasmith, C., 2019. Legendre memory units: Continuous-time representation in recurrent neural networks), there is no hardware design, complete with performance and power models, that can implement an LMU of varying size and characteristics in a highly efficient and distributed manner, all the while being compatible to interface with hardware implementations of projected LSTMs (see Sak, H., Senior, A. W. and Beaufays, F., 2014. Long short-term memory recurrent neural network architectures for large scale acoustic modeling) or feed forward cells of the RNN.
As an alternative to the large size of the standard Long Short Term Memory (LSTM) cells, the Projected Long Short Term Memory (LSTM-P/Projected LSTM) cell architecture is proposed (see Sak, H., Senior, A. W. and Beaufays, F., 2014. Long short-term memory recurrent neural network architectures for large scale acoustic modeling) that projects the hidden states of the LSTM onto a smaller dimension, thus reducing the number, and thus memory requirements, of the recurrent connection. While there have been many hardware based implementations of the standard LSTM (see Chang, A. X. M., Martini, B. and Culurciello, E., 2015. Recurrent neural networks hardware implementation on FPGA. arXiv preprint arXiv:1511.05552.) to allow for different efficient implementations (see Zhang, Yiwei, Chao Wang, Lei Gong, Yuntao Lu, Fan Sun, Chongchong Xu, Xi Li, and Xuehai Zhou. “Implementation and optimization of the accelerator based on fpga hardware for lstm network.” In 2017 IEEE International Symposium on Parallel and Distributed Processing with Applications and 2017 IEEE International Conference on Ubiquitous Computing and Communications (ISPA/IUCC), pp. 614-621. IEEE, 2017), (see Yazdani, R., Ruwase, O., Zhang, M., He, Y., Arnau, J. M. and Gonzalez, A., 2019. Lstm-sharp: An adaptable, energy-efficient hardware accelerator for long short-term memory. arXiv preprint arXiv:1911.01258), there currently is no hardware design that implements a well composed projected LSTM, complete with performance and power models, that can support LSTM and projection cells of varying sizes and characteristics in a highly efficient and distributed manner, all the while being compatible to interface with hardware implementations of LMUs and feed forward cells of the RNN.
The feed forward cell, while itself not being recurrent in nature, is generally a common part of RNNs, typically with multiples of such cells with a plurality of sizes, being used to extract low level features from the preceding recurrent states. While there have been many efficient hardware designs of feed forward cells (see Himavathi, S., Anitha, D. and Muthuramalingam, A., 2007. Feedforward neural network implementation in FPGA using layer multiplexing for effective resource utilization. IEEE Transactions on Neural Networks, 18(3), pp. 880-888.) (see Canas, A., Ortigosa, E. M., Ros, E. and Ortigosa, P. M., 2006. FPGA implementation of a fully and partially connected MLP. In FPGA Implementations of Neural Networks (pp. 271-296). Springer, Boston, Mass.), that support processing of varying sizes and characteristics of these cells but there have been no designs that do so in the setting of recurrent neural networks; that is, being compatible to interface with hardware implementations of LMUs and LSTM-Ps cells of the RNN, while also providing power and performance models to enable highly efficient and distributed mapping of RNNs onto the digital hardware design.
There thus remains a need for improved methods and systems for efficient processing of recurrent neural networks for application domains including, but not limited to, automatic speech recognition (ASR), keyword spotting (KWS), biomedical signal processing, and other applications that involve processing time-series data.
A digital hardware system for processing time series data with a recurrent neural network, wherein the digital hardware system comprises a plurality of computation blocks, the digital hardware system computes parts of a given recurrent neural network by time multiplexing the parts of the recurrent neural network over a spatially distributed set of computation blocks of the digital hardware system wherein the digital hardware system is configured to read a sequence of time series data to be processed from the memory of the digital hardware system; temporally process the data by time multiplexing the recurrent neural network over the spatially distributed set of computation blocks of the digital hardware system; and, write final processed output or processed intermediate activations of the recurrent neural network to the memory of the digital hardware system.
The invention is illustrated in the figures of the accompanying drawings which are meant to be exemplary and not limiting, in which like references are intended to refer to like or corresponding parts, and in which:
Having summarized the invention above, certain exemplary and detailed embodiments will now be described below, with contrasts and benefits over the prior art being more explicitly described.
It will be understood that the specification is illustrative of the present invention and that other embodiments suggest themselves to those skilled in the art. All references cited herein are incorporated by reference.
In the following detailed description, reference is made to the accompanying drawings which form a part hereof, and in which is shown by way of illustration specific embodiments in which the invention may be practiced. These embodiments are described in sufficient detail to enable those skilled in the art to practice the invention, and it is to be understood that the embodiments may be combined, or that other embodiments may be utilized and that structural and logical changes may be made without departing from the spirit and scope of the present invention. The following detailed description is, therefore, not to be taken in a limiting sense, and the scope of the present invention is defined by the appended claims and their equivalents.
The main embodiment of the present invention provides digital hardware designs for processing of input of time series based data for a recurrent neural network where the parts of the recurrent neural network are composed of at least one of: an LMU cell, an LSTM-P cell, plus zero or Feed Forward cells. The input is processed by time multiplexing the parts of the recurrent neural network (RNN) over a spatially distributed set of these computation blocks of the digital hardware design, where a computation block refers to a digital hardware implementation of: an LMU cell or an LSTM-P cell or a Feed Forward cell. Each computation block is built by connecting together a plurality of computation sub blocks. The spatial distribution of computation blocks and time multiplexing of parts of the RNN over these computation blocks, referred to as “mapping” or “allocation”, is determined by an algorithm that takes as input a set of user constraints related to latency, throughput, transistor count, power consumption and like. The mapping algorithm explores all combinations of partitioning the network, while for each unique partition combination, updating the spatial distribution and time multiplexing in an iterative fashion by identifying the slowest computation sub block until said user constraints are met.
Various terms as used herein are defined below. To the extent a term used in a claim is not defined below, it should be given the broadest definition persons in the pertinent art have given that term as reflected in printed publications and issued patents at the time of filing.
As used herein, the term “LMU” refers to the Legendre Memory Unit recurrent cell, and in particular any neural network embodiment of equations 1 and 2 from Voelker et al. (2019), referred also to as the “linear layer”, which is to be understood as a general component of a neural network, and is not limited in use within any one particular network architecture.
As used herein, the term “LSTM-P” refers to the Projected LSTM recurrent cell.
As used herein, the term “activation” refers to a function that maps an input to an output that includes, but is not limited to, functions such as rectified linear units, sigmoid, hyperbolic tangent, linear, etc. These functions might also be approximated for the purposes of hardware implementation, for example by way of thresholding, linear approximations, lookup-tables, etc.
As used herein, the term “vector” refers to a group of scalar values, where all scalar values of the group may be transformed by some arbitrary mathematical function in sequence or in parallel. The contents of a vector depend on context, and may include, but are not limited to, activities produced by activation functions, weights stored in SRAM, inputs stored in SRAM, outputs stored in SRAM, values along a datapath, etc.
As used herein, the term “computational sub block” refers to a digital hardware implementation responsible for carrying out a specific computation and composed of different operators including, but not limited to, multiply and accumulate blocks, adders, multipliers, registers, memory blocks, etc. Different types of computation sub blocks are responsible for carrying out different computations.
As used herein, the term “computational block” refers to a group of computation sub blocks responsible for carrying out computations assigned to it, with the group composed of plurality of different types of computation sub blocks, with members from the same group of computation sub blocks working in parallel to process parts or whole of computation assigned to their respective computation block.
As used herein, the term “time multiplexing” refers to the scheduling of a group of computations such that each computation in this group is carried out by computation sub block in order, that is, one after the other in time.
As used herein, the term “quantization” refers to the process of mapping a set of values to a representation with a narrower bit width as compared to the original representation. Quantization encompasses various types of quantization scaling including uniform quantization, power-of-two quantization, etc. as well as different types of quantization schemes including symmetric, asymmetric, etc.
As used herein, the term “quantization block” refers to a digital hardware block that performs a quantization operation.
As used herein, the term “dequantization” refers to the process of mapping a set of quantized values to the original bit width. Dequantization encompasses various types of dequantization scaling including uniform dequantization, power-of-two dequantization, etc. as well as different types of dequantization schemes including symmetric, asymmetric, etc.
As used herein, the term “dequantization block” refers to a digital hardware block that performs a dequantization operation.
A number of different variable notations are used to refer to the dynamic energy consumption, as follows:
32,8EqOP: The dynamic energy consumed by a quantization operation, with an input operand width of 32 and output width of 8.
A number of different variable notations are used to refer to the static leakage power, as follows:
A number of different variable notations are used to refer to parameters of a particular network architecture, as follows:
The digital hardware implementation of the LSTM-P, referred to as a LSTM-P computation block, is divided into 2 separate computation sub blocks, a plurality of which can be connected together to form a computation block for an LSTM-P cell. The 2 computation sub blocks implement the LSTM-P cell by dividing the cell into 2 parts and implementing each respectively: one for the LSTM cell (LB) and one for the projection cell (PB). Each computation sub block computes one output unit at a time and a corresponding cell with multiple outputs can be time multiplexed onto the same computation sub block.
Digital Hardware Implementation of LSTM cell (LB)
The digital hardware implementation of an LSTM computation sub block, computing one output unit at a time, is shown in
It is important to remember that, when each output unit of an LSTM cell is allocated an LB, each such LB block computes one output unit of that LSTM cell. Realistically, an LSTM cell with a certain number of output units will be allocated fewer LBs than the number of output units. In this case the cell's output computation will be temporally multiplexed in parallel over the allocated LBs.
As shown in
i
t=σ(Wix⊙xt)
f
t=α(Wfx⊙xt)
o
t=σ(Wix⊙xt)
c
t=(ft*ct−1)+(it*σ(Wcx⊙xt))
y
t
=o
t*σ(ct)
where xt is the input data at timestep t, Wix are the weights of the input gate, Wfx are the weights of the forget gate, Wox are the weights of the output gate and Wcx are the weights of the cell. σ is an activation function. ⊙ refers to a multiplication and accumulation of 2 input vectors.
As shown in
As shown in
Power and energy consumption of a computation sub block can be generally divided into 2 categories: 1. static leakage power; and 2. dynamic switching energy. Each of these can further be divided into: 1. scale variant and 2. scale invariant categories.
Static leakage power and dynamic switching energy can be categorised as scale variant if the consumption changes with respect to the number of output units allocated to a computation sub block. For example, the size of a memory block storing weights of a network connection will increase as more output units are time multiplexed onto the same block. Hence, static power of the memory blocks storing weights can be categorized as scale variant. If the overall consumption does not depend on the number of outputs units time multiplexed onto the computation sub block, it can be categorised as scale invariant. For example, a multiply and accumulate sub module dissipates a deterministic amount of static power, invariant to the number of output units time multiplexed onto the computation sub block containing that multiply and accumulate unit.
The scale variant dynamic energy can be modeled by summing up dynamic energy of each individual operation involved in computation of one output unit. The scale invariant portion can be calculated by summing up total energy of operations that do not depend on the degree of time multiplexing of outputs of a cell on a computation sub block. The scale variant static power consumption can be modeled by summing up static power consumption of each scale variant sub module for computation of one output unit. The scale invariant portion can be calculated by summing up static power consumption of sub modules that do not depend on the degree of time multiplexing of outputs of a cell on a computation sub block.
Inactive computation sub blocks can be disconnected from the voltage supply or put in low power mode using Dynamic Voltage and Frequency Scaling. A highly deterministic mapping of the recurrent neural network allows prediction of cycle level data flow behaviour. By turning off computation sub blocks when they are not needed, static power consumption can be decreased. Care should be exercised when applying the principles of DVFS to the memory blocks as they can be potentially volatile in nature and may lead to loss of data.
Dynamic energy consumed by an LSTM computation sub block with index i (LB) with one output unit allocated is defined as Edi,LB. As discussed above, dynamic energy consumption is further divided according to its dependency on allocation (number of output units time multiplexed on to LBi): Ed-svi,LB as the scale variant portion of Edi,LB and Ed-sivi,LB as the scale invariant portion. For
Static power consumption of an LSTM computation sub block with index i (LBi) with one output unit allocated is defined as Psi,LB. As discussed above, static power consumption is further divided according to its dependency on allocation (number of output units time multiplexed on to LBi): Ps-svi,LB as the scale variant portion of Psi,LB and Ps-sivi,LB as the scale invariant portion. For
Total energy consumed by an LSTM computation sub block with index i (LB) can be written as below:
E
N,total
i,LB
=N.E
d-sv
i,LB
+E
d-siv
i,LB
+N.P
s-sv
i,LB.latencye2e+Ps-sivi,LB.latencyiLB
where latencye2e is the total latency of processing a given input of the recurrent neural network, latencyiLB is the latency of computation sub block LBi of processing all time multiplexed outputs allocated to it and N is the total number of output units time multiplexed onto LBi.
Latency of an LSTM computation sub block with index i (LBi) for processing one output can be calculated by adding up individual latencies of each operation involved in the computation of one output unit. For the LB computation sub block shown in
T
i, LB=(ni+nr)+1+1+1+1+1+1+1+1+1
It should be noted that the extra cycle of reading data from memory can be easily shadowed by reading a cycle early, before the compute actually starts. Hence, it is not included in the computation above.
T
N
i, LB=(ni+nr)*N+1+1+1+1+1+1+1+1+1
Digital Hardware Implementation of the projection Cell (PB)
The digital hardware implementation of a projection computation sub block, computing one cell's worth of output, is shown in
It is important to remember that, when each output unit of a projection cell is allocated a PB, each such PB block computes one output unit of that projection cell. Realistically, a projection cell with a certain number of output units will be allocated fewer PBs than the number of output units. In this case the cell's output computation will be temporally multiplexed in parallel over the allocated PBs.
As shown in
h
p
t=σ(Wpx⊙xt)
where xt is the input data at timestep t, equal to the output of the corresponding LSTM cell. Wpx are the weights of the projection connection. σ is an activation function. ⊙ refers to a multiplication and accumulation of 2 input vectors.
As shown in
As shown in
Similar to the LSTM computation sub block, LB, power and energy consumption of a computation sub block PB can also be divided into 2 categories: 1. static leakage power; and 2. dynamic switching energy and each of these can further be divided into: 1. scale variant; and 2. scale invariant categories introduced previously.
Similar to the LSTM computation sub block, LB, inactive computation sub blocks can be disconnected from the voltage supply or put in low power mode using Dynamic Voltage and Frequency Scaling, as introduced previously.
Dynamic energy consumed by a projection computation sub block with index i (PBi) with one output unit allocated is defined as Edi,PB. As discussed above, dynamic energy consumption is further divided according to its dependency on allocation (number of output units time multiplexed on to PBi):Ed-svi,PB as the scale variant portion of Edi,PB and Ed-sivi,PB as the scale invariant portion. For
Static power consumption of a projection computation sub block with index i (PBi) with one output unit allocated is defined as Psi,LB. As discussed above, static power consumption is further divided according to its dependency on allocation (number of output units time multiplexed on to PBi): Ps-svi,PB as the scale variant portion of Psi,PB and Ps-sivi,PB as the scale invariant portion. For
P
s-sv
i,PB
=n
c.1×8Pmemorys+1×8Pmemorys
P
s-siv
i,PB=8,8,29PMACs+29,32Pdqs+32,32Pacts+32,8Pqs+8,8,8P2:1 muxs+2.32,32Pregs+2.8,8Pregs
Total energy consumed by a projection computation sub block with index i (PBi) can be written as below:
E
N,total
i,PB
=N.E
d-sv
i,PB
+E
d-siv
i,PB
+N.P
s-sv
i,PB.latencye2e+Ps-sivi,PB.latencyiPB
where latencye2e is the total latency of processing a given input of the recurrent neural network, latencyiPB is the latency of computation sub block PB of processing all time multiplexed outputs allocated to it and N is the total number of output units time multiplexed onto PBi.
Modeling Performance of Digital Hardware Implementation of projection cell (PB)
Latency of a projection computation sub block with index i (PBi) for processing N outputs, abbreviated as TNi, PB, can be calculated by adding up individual latencies of each operation involved in the computation of N output units. Similar to the LSTM computation sub block, a plurality of pipelining registers can be introduced to reduce the longest delay path and also increase throughput by filling up the data path with computation across multiple of these N units, thus shadowing critical stages of computation of chronologically newer output units behind critical stages of older output units. Hence, latency of a block PBi, with configuration as
T
N
i, PB=(nr·N)+4
A plurality of LSTM computation sub blocks (LBs) can be connected to work in parallel with a plurality of projection computation sub blocks (PBs), with memory output readouts from one set of blocks connected to the input bus of the other set via a chain of multiplexer and registers acting as an interconnect.
All the 8 LBs will finish their MAC operations at the same time. All 8 LBs write their allocated LSTM cells' outputs to 8 different memories (titled “memory output” in
Due to symmetry of workload distribution, all PBs finish their workload of projected output calculation at the same time. Each of them then writes its computed projected units to its memory (titled “memory output” in
Total energy consumption of the LSTM-P computation block can be calculated by summing up individual energy consumption of its constituent set of LSTM computation sub blocks LB and its constituent projection computation sub blocks PB.
For calculating the latency and throughput of a LSTM-P computation block, the parallel mode of operation plays an important role. All constituent sub blocks in the set of constituent LSTM computation sub blocks work in parallel. Hence, the slowest LB sub block in finishing the computation of its allocated output units of the LSTM cell determines the latency of the entire set of constituent LSTM sub blocks. Similarly, the slowest PB sub block in finishing the computation of its allocated output units of the projection cell determines the latency of the entire set of constituent projection sub blocks. Hence, to calculate the overall latency of an LSTM-P computation block, the latency of the set of constituent LSTM sub blocks can be added with the latency of the set of constituent projection sub blocks. The throughput of the LSTM-P computation block is determined by the slowest computation sub block, regardless of type.
Digital Hardware implementation of LMU is divided into 3 separate computation sub blocks, a plurality of which can be connected together to form a computation block for an LMU cell. The 3 computation sub blocks implement the LMU cell by dividing the cell into 3 parts and implementing each respectively: one for the encoder cell (UB), one for the memory cell (MB) and one for the hidden cell (HB). Each computation sub block computes one output unit at a time and a corresponding cell with multiple outputs can be time multiplexed onto the same computation sub block.
Digital Hardware Implementation of encoder cell (UB)
The digital hardware implementation of an encoder computation sub block, computing one output unit at a time, is shown in
Similar to the LSTM-P's computation sub block, the encoder cell's output computation will be temporally multiplexed in parallel over the allocated UBs.
It is important to note that the input vector to the encoder cells, the hidden state and memory can all be potentially of different sizes, thus the MAC operations associated with each of them can potentially finish at different times, thus resulting in one of them taking the longest. This will become the bottleneck in computation latency. Triangular inequality law, that checks if the sum of two numbers is less than the third number, can be deployed to check if two multiply and accumulate operations can be sequentialized to save static power dissipation and silicon area of the resulting hardware implementation.
As shown in
u
t=σ(ex⊙xt+eh⊙ht−1+em⊙mt−1)
where xt is the input data at timestep t, ht−1 is the hidden state at timestep t−1, mt is the memory of one of the LMU tapes at timestep t−1 with ex, eh and em being the corresponding weights. α is an activation function. ⊙ refers to a multiplication and accumulation of 2 input vectors.
As shown in
As shown in
Similar to computation sub blocks of the LSTM-P, power and energy consumption of a computation sub block UB can also be divided into 2 categories: 1. static leakage power; and 2. dynamic switching energy and each of these can further be divided into: 1. scale variant; and 2. scale invariant categories introduced previously.
Similar to the computation sub blocks of the LSTM-P, inactive computation sub blocks can be disconnected from the voltage supply or put in low power mode using Dynamic Voltage and Frequency Scaling, as introduced previously.
Dynamic energy consumed by an encoder computation sub block with index i (UBi) with one output unit allocated is defined as Edi,UB. As discussed above, dynamic energy consumption is further divided according to its dependency on allocation (number of output units time multiplexed on to UBi): Ed-svi,UB as the scale variant portion of Edi,UB and Ed-sivi,UB as the scale invariant portion. For
E
d
i, UB=2.8,8,8E2:1 mux+(nc+ni+nr).8,8Ememoryr+(nc+ni+nr).8,8,29EMACOP+2.29,32EdqOP+32,32,32EadderOP+32,32EactOP+32,8EqOP+(nc/num_MBcurrent).8,8Ememoryr+4.32,32EregOP+8,8EregOP
Static power consumption of an encoder computation sub block with index i (UBi) with one output unit allocated is defined as Psi,UB. As discussed above, static power consumption is further divided according to its dependency on allocation (number of output units time multiplexed on to UBi): Ps-svi,UB as the scale variant portion of Psi,UB and Ps-sivi,UB as the scale invariant portion. For
P
s
i, UB=8,8,8P2:1 muxs+(ni+nc+nr).1×8Pmemorys2.8,8,29PMACs2.29,32Pdqs+32,32,32Padders+32,32Pacts+32,8Pqs+1×8Pmemorys+8,8,8P2:1 muxs+4.32,32Pregs+2.8,8Pregs
Total energy consumed by an encoder computation sub block with index i (UBi) can be written as below:
E
N, total
i,UB
=N.E
d-sv
i,UB
+E
d-siv
i,UB
+N.P
s-sv
i,UB.latencye2e+Ps-sivi,UB.latencyiUB
where latencye2e is the total latency of processing a given input of the recurrent neural network, latencyiUB is the latency of computation sub block UBi of processing all time multiplexed outputs allocated to it and N is the total number of output units time multiplexed onto UBi.
Latency of an encoder computation sub block with index i (UBi) for processing N outputs, abbreviated as TNi, UB, can be calculated by adding up individual latencies of each operation involved in the computation of N output units. Similar to the LSTM-P computation sub blocks, a plurality of pipelining registers can be introduced to reduce the longest delay path and also increase throughput by filling up the data path with computation across multiple of these N units, thus shadowing critical stages of computation of chronologically newer output units behind critical stages of older output units. For the UB computation sub block shown in
T
N
i, UB={(nc.N)+5, if nc≥ni+nr
(nr.N)+5, if nr≥ni+nc
(ni.N)+5, if ni≥nc+nr
Digital Hardware Implementation of the memory cell (MB)
The digital hardware implementation of a memory computation sub block, computing one unit of memory output, is shown in
Similar to computation sub blocks introduced before, the cell's output computation will be temporally multiplexed in parallel over the allocated MBs.
As shown in
A=a
ij
∈R
q×q, where aij=(2*i+1)*{−1, if i<j
(−1)i−j+1 if i≥j
where q is a property of the LMU cell, i and j are greater than equal to zero, and GenB implementing the following equation with output line [604] in
B=b
i
∈R
q×1, where bi=(2*i+1)*(−1)i
where q is a property of the LMU cell, i is greater than equal to zero.
The “2*i+1” is implemented by [600] in
Optionally, the weights A and B can also be stored in memory instead. Each of these weight sets only correspond to the appropriate output units assigned to it. These weights are then subsequently multiplied and accumulated with an input flit, typically memory of particular memory tape, from the input data buses [500] and [502] using the MAC blocks [528] like [530]. The 2 values are added up [510] and the values are appropriately transformed using hardware implementation of activations [512] and finally written to an output memory block [516]. Finally, the output line [526] of the output memory block is connected to a multiplexer [518] and a registered line [522] to support interfacing with a multiplexer based chain interconnect to support movement of data from this computation sub block to other computation sub blocks [520]. The memory blocks in LB can individually be, but not limited to, SRAM memory, DRAM memory, ROM memory, flash memory, solid state memory and can be either volatile or non volatile. The MB shown in
m
t=σ(A⊙mt−1+B*ut)
where mt−1 is the memory of a memory tape at timestep t−1 and ut is the encoded input of that memory tape with A and B being the weights of the memory cell. o is an activation function. ⊙ refers to a multiplication and accumulation of 2 input vectors.
As shown in
As shown in
Similar to the previously introduced computation sub blocks, power and energy consumption of a computation sub block MB can also be divided into 2 categories: 1. static leakage power; and 2. dynamic switching energy and each of these can further be divided into: 1. scale variant; and 2. scale invariant categories introduced previously.
Similar to previously introduced computation sub blocks inactive computation sub blocks can be disconnected from the voltage supply or put in low power mode using Dynamic Voltage and Frequency Scaling, as introduced previously.
Dynamic energy consumed by a projection computation sub block with index i (MBi) with one output unit allocated is defined as Edi,MB. As discussed above, dynamic energy consumption is further divided according to its dependency on allocation (number of output units time multiplexed on to MBi): Ed-svi,MB as the scale variant portion of Edi,MB and Ed-sivi,MB as the scale invariant portion. For
Static power consumption of a memory computation sub block with index i (MBi) with one output unit allocated is defined as Psi,MB. As discussed above, static power consumption is further divided according to its dependency on allocation (number of output units time multiplexed on to MBi): Ps-svi,MB as the scale variant portion of Psi,MB and Ps-sivi,MB as the scale invariant portion. For
P
s-sv
i,MB=2.1×8Pmemorys
P
s-siv
i,MB=8PGEN-As+8PGEN-Bs+2.8,8,29PMACs+2.29,32Pdqs+32,32,32Padders+32,32Pacts+32,8Pqs+8,8,8P2:1 muxs+4.32,32Pregs+2.8,8Pregs
Total energy consumed by a memory computation sub block with index i (MB) can be written as below:
E
N,total
i,MB
=N.E
d-sv
i,MB
+E
d-siv
i,MB
+N.P
s-sv
i,MB.latencye2e+Ps-sivi,MB.latencyiMB
where latencye2e is the total latency of processing a given input of the recurrent neural network, latencyiMB is the latency of computation sub block MBi of processing all time multiplexed outputs allocated to it and N is the total number of output units time multiplexed onto MBi.
Modeling Performance of Digital Hardware Implementation of memory cell (MB)
Latency of a memory computation sub block with index i (MBi) for processing N outputs, abbreviated as TNi, MB, can be calculated by adding up individual latencies of each operation involved in the computation of N output units. Similar to computation sub blocks introduced previously, a plurality of pipelining registers can be introduced to reduce the longest delay path and also increase throughput by filling up the data path with computation across multiple of these N units, thus shadowing critical stages of computation of chronologically newer output units behind critical stages of older output units. For the MB computation sub block shown in
T
N
i, MB=(nc.N)+5
The digital hardware implementation of a hidden computation sub block, computing one unit of hidden cell output, is shown in
Similar to computation sub blocks introduced before, the cell's output computation will be temporally multiplexed in parallel over the allocated HBs.
As shown in
h
t=σ(Wx⊙xt+Wh⊙ht−1+Wm⊙mt)
where xt is input at timestep t, ht−1 is the hidden state at timestep t−1, mt is the entire flattened memory tape at timestep t with Wx, Wh and Wm being the weights respectively. σ is an activation function. ⊙ refers to a multiplication and accumulation of 2 input vectors.
As shown in
As shown in
Similar to the previously introduced computation sub blocks, power and energy consumption of a computation sub block HB can also be divided into 2 categories: 1. static leakage power; and 2. dynamic switching energy and each of these can further be divided into: 1. scale variant; and 2. scale invariant categories introduced previously.
Similar to previously introduced computation sub blocks inactive computation sub blocks can be disconnected from the voltage supply or put in low power mode using Dynamic Voltage and Frequency Scaling, as introduced previously.
Dynamic energy consumed by a projection computation sub block with index i (HBi) with one output unit allocated is defined as Edi,HB. As discussed above, dynamic energy consumption is further divided according to its dependency on allocation (number of output units time multiplexed on to HBi): Ed-svi,HB as the scale variant portion of Edi,HB and Ed-sivi,HB as the scale invariant portion. For
Static power consumption of a hidden computation sub block with index i (HBi) with one output unit allocated is defined as Psi,HB. As discussed above, static power consumption is further divided according to its dependency on allocation (number of output units time multiplexed on to HBi): Ps-svi,HB as the scale variant portion of Psi,HB and Psi,HB and Ps-sivi,HB as the scale invariant portion. For
P
s-sv
i,HB=(ni+nr+nc).1×8Pmemorys+2.1×8Pmemorys
P
s-siv
i,HB=8,8,8P2:1 muxs+2.8,8,29PMACs+2.29,32Pdqs+32,32,32Padders+32,32Pacts+32,8Pds+8,8,8P2:1 muxs+4.32,32Pregs+2.8,8Pregs
Total energy consumed by a memory computation sub block with index i (HBi) can be written as below:
E
N,total
i,HB
=N.E
d-sv
i,HB
+E
d-siv
i,HB
+N.P
s-sv
i,HB.latencye2e+Ps-sivi,HB.latencyiHB
where latencye2e is the total latency of processing a given input of the recurrent neural network, latencyiHB is the latency of computation sub block HBi of processing all time multiplexed outputs allocated to it and N is the total number of output units time multiplexed onto HBi.
Latency of a hidden computation sub block with index i (HBi) for processing N outputs, abbreviated as TNi,HB, can be calculated by adding up individual latencies of each operation involved in the computation of N output units. Similar to computation sub blocks introduced previously, a plurality of pipelining registers can be introduced to reduce the longest delay path and also increase throughput by filling up the data path with computation across multiple of these N units, thus shadowing critical stages of computation of chronologically newer output units behind critical stages of older output units. For the HB computation sub block shown in
T
N
i, HB=(nc.N)+5, if nc≥nr+ni
(nr.N)+5, if nr≥nr+nc
(ni.N)+5, if ni≥nr+nc
A plurality of encoder computation sub blocks (UBs) can be connected to work in parallel with a plurality of memory computation sub blocks (MBs) which in turn can be connected to work in parallel with a plurality of hidden computation sub blocks (HBs), with memory output readouts from one set of blocks connected to the input bus of the other set via a chain of multiplexer and registers acting as an interconnect.
All the 4 UBs will finish their MAC operations at the same time. This is ensured due to the symmetric distribution of workload, ensured by design, over the 4 UBs. All 4 UBs write their allocated encoder cells' outputs to 4 different memory blocks (titled “memory output” in
Total energy consumption of the LMU computation block can be calculated by summing up individual energy consumption of its constituent set of encoder computation sub blocks (UB), constituent set of memory computation sub blocks (MB) and its constituent set of hidden computation sub blocks (HB).
For calculating the latency and throughput of an LMU computation block, the parallel mode of operation plays an important role. All constituent sub blocks in the set of constituent encoder computation sub blocks work in parallel. Same is true for the memory and hidden constituent set of computation sub blocks. Hence, the slowest UB sub block in finishing the computation of its allocated output units of the encoder cell determines the latency of the entire set of constituent encoder computation sub blocks. Similarly, the slowest MB sub block in finishing the computation of its allocated output units of the memory cell determines the latency of the entire set of constituent memory computation sub blocks. Finally, the slowest HB sub block in finishing the computation of its allocated output units of the hidden cell determines the latency of the entire set of constituent hidden computation sub blocks. Hence, to calculate the overall latency of an LMU computation block, the latency of the set of constituent encoder computation sub blocks can be added with the latency of the set of constituent memory computation sub blocks, which can then be added with the latency of the set of constituent hidden computation sub blocks. The throughput of the LMU computation block is determined by the slowest computation sub block, regardless of type.
The digital hardware implementation of a feed forward computation block, comprises a plurality of one type of constituent computation sub blocks. A constituent sub block, computing one unit of feed forward cell output, is shown in
Similar to computation sub blocks introduced before, the cell's output computation will be temporally multiplexed in parallel over the allocated FBs.
As shown in
f
t=σ(Wx⊙x)
where xt is input at timestep t with Wx being the weights respectively. σ is an activation function. ⊙ refers to a multiplication and accumulation of 2 input vectors.
As shown in
As shown in
Similar to the previously introduced computation sub blocks, power and energy consumption of a computation sub block FB can also be divided into 2 categories: 1. static leakage power; and 2. dynamic switching energy and each of these can further be divided into: 1. scale variant; and 2. scale invariant categories introduced previously.
Similar to previously introduced computation sub blocks, inactive computation sub blocks can be disconnected from the voltage supply or put in low power mode using Dynamic Voltage and Frequency Scaling, as introduced previously.
Dynamic energy consumed by a feed forward computation sub block with index i (FBi) with one output unit allocated is defined as Edi,FB. As discussed above, dynamic energy consumption is further divided according to its dependency on allocation (number of output units time multiplexed on to FBi): Ed-svi,FB as the scale variant portion of Edi,FB and Ed-sivi,FB as the scale invariant portion. For
E
d-sv
i,FB
=n
o.8,8,29EMACOP+no8,8Ememoryr+29,32EdqOP+32,32EactOP+32,8EqOP+8,8Ememoryw+(nr/num_FBnext_layer).Ememoryr+2.32,32EregOP+8,8EregOP
E
d-siv
i,FB=(nr/num_FBnext_layer).(2.8,8,8E2:1 muxOP+ΣK=1iunitkFB.8,8EregOP)
Static power consumption of a feed forward computation sub block with index i (FBi) with one output unit allocated is defined as Psi,FB. As discussed above, static power consumption is further divided according to its dependency on allocation (number of output units time multiplexed on to FBi): Ps-svi,FB as the scale variant portion of Psi,FB and Ps-sivi,FB as the scale invariant portion. For
P
s-sv
i,FB
=n
o.1×8Pmemorys×1×8Pmemorys
P
s-siv
i,FB=8,8,29PMACs+29,32Pdqs+32,32Pacts+32,8Pqs+8,8,8P2:1 muxs+2.32,32Pregs+2.8,8Pregs
Total energy consumed by a feed forward computation sub block with index i (FBi) can be written as below:
E
N,total
i,FB
=N.E
d-sv
i,FB
+E
d-siv
i,FB
+N.P
s-sv
i,FB.latencye2e+Ps-sivi,FB.latencyiFB
where latencye2e is the total latency of processing a given input of the recurrent neural network, latencyiFB is the latency of computation sub block FBi of processing all time multiplexed outputs allocated to it and N is the total number of output units time multiplexed onto FBi.
Latency of a feed forward computation sub block with index i (FBi) for processing N outputs, abbreviated as TNi,FB, can be calculated by adding up individual latencies of each operation involved in the computation of N output units. Similar to computation sub blocks introduced previously, a plurality of pipelining registers can be introduced to reduce the longest delay path and also increase throughput by filling up the data path with computation across multiple of these N units, thus shadowing critical stages of computation of chronologically newer output units behind critical stages of older output units. For the FB computation sub block shown in
T
N
i,FB=(no.N)+4
Software Architecture of Mapping Algorithm
A mapping or allocation is defined as time multiplexing of whole or part of the above discussed recurrent neural network composed of at least one of: an LMU cell, an LSTM-P cell, and zero or more feed forward cells, onto a spatial distribution of plurality of corresponding types of computation blocks.
A partitioning of a neural network is defined as separating a sequence of layers of that network into a group such that all layers belonging to that group will be time multiplexed onto the same computation block. An entire neural network can be partitioned into multiple groups (also called partitions). Layers of different types cannot be partitioned into the same group. Hence, a new group must be formed when consecutive layers change types.
A mapping of a recurrent neural network into the computational blocks introduced above depends on the partitioning of the network as well as the number of each type of constituent computation sub blocks for that partition.
For discovering a mapping of a given recurrent neural network that comes closest to satisfying a user specified constraints of combination of latency, throughput, power and like, first, a linear search over all possible partitioning [1000] of the recurrent neural network is performed. For a large recurrent neural network where a linear search might be too slow, linear search can be replaced with intelligent search methods including but not limited to gradient based searches, evolutionary strategies guided searches, etc. For a partitioning, each partition's computation block is allocated one each of its constituent computation sub blocks [1002]. The total latency and energy consumption of the network is computed [1004] by adding up the latency and energy consumption of each partition's computation block using methods discussed previously. The throughput of the network is calculated by identifying the throughput of each partition's computation block and the slowest computation block determines the overall throughput of the network. At this stage, either the slowest computation block or the block, which if the count of its slowest constituent computation sub block was increased would lead to greatest decrease in latency, is chosen. The count of the slowest constituent computation sub block of the selected computation block is incremented by 1 [1008]. This process, labelled “iterative resource allocation” and shown in
A custom accelerator for keyword spotting applications is implemented using the above mentioned computational blocks and mapping techniques. The custom accelerator is a mapping of an LMU based recurrent neural network that has been trained to achieve 95.9% test accuracy on the SpeechCommands dataset (see Warden, P., 2018. Speech commands: A dataset for limited-vocabulary speech recognition. arXiv preprint arXiv:1804.03209). The LMU based recurrent neural network is 361 kbits in size. The computation blocks are configured to use 4 bit quantized weights and 7 bit quantized activations and use SRAM as memory blocks. The recurrent neural network is then trained with these quantization settings to mimic hardware aware training. The design of the hardware accelerator was explored across different clock frequencies, while always ensuring that the timing constraints of the SpeechCommands models (40 ms windows updated every 20 ms) are satisfied in real time. Cycle-accurate power envelopes of each design were guided by the above discussed power and energy estimation models for each computation block that makes up the hardware accelerator design. Total power usage is determined with these envelopes using publicly available power data (see Frustaci, F., Khayatzadeh, M., Blaauw, D., Sylvester, D. and Alioto, M., 2015. SRAM for error-tolerant applications with dynamic energy-quality management in 28 nm CMOS. IEEE Journal of Solid-state circuits, 50(5), pp. 1310-1323) (see Hoppner, S. and Mayr, C., 2018. SpiNNaker2-towards extremely efficient digital neuromorphics and multi-scale brain emulation. Proc. NICE) (see Yabuuchi, M., Nii, K., Tanaka, S., Shinozaki, Y., Yamamoto, Y., Hasegawa, T., Shinkawata, H. and Kamohara, S., 2017, June. A 65 nm 1.0 V 1.84 ns Silicon-on-Thin-Box (SOTB) embedded SRAM with 13.72 nW/Mbit standby power for smart IoT. In 2017 Symposium on VLSI Circuits (pp. C220-C221). IEEE). Multiply-accumulate (MAC) and SRAM dynamic and static power, are the dominant power consumers in the design. Dynamic power for multipliers, dividers, and other components was estimated as a function of the number of transistors in the component, and the power cost per transistor of the MAC. All estimates are for a 22 nm process. To estimate the number of transistors, and hence the area, of the design we generated RTL designs of each of the relevant components, and used the yosys open source tool (see Wolf, C., 2016. Yosys open synthesis suite) and libraries to estimate the number of transistors required for the total number of components included in our network.
A custom accelerator for implementing the RNN-T network for automatic speech recognition (see He, Y., Sainath, T. N., Prabhavalkar, R., McGraw, I., Alvarez, R., Zhao, D., Rybach, D., Kannan, A., Wu, Y., Pang, R. and Liang, Q., 2019, May. Streaming end-to-end speech recognition for mobile devices. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6381-6385). IEEE) is implemented using the above mentioned computational blocks and mapping techniques. The computation blocks are configured to use 8 bit quantized weights and 8 bit quantized activations and use SRAM as memory blocks. The design of the hardware accelerator was fixed at 250 MHz, while always ensuring that the accelerator is able to process an entire input sample in less than 60 ms. Cycle-accurate power envelopes of the design were guided by the above discussed power and energy estimation models for each computation block that makes up the hardware accelerator design. Total power usage is determined with these envelopes using publicly available power data (see Frustaci, F., Khayatzadeh, M., Blaauw, D., Sylvester, D. and Alioto, M., 2015. SRAM for error-tolerant applications with dynamic energy-quality management in 28 nm CMOS. IEEE Journal of Solid-state circuits, 50(5), pp. 1310-1323) (see Hoppner, S. and Mayr, C., 2018. SpiNNaker2-towards extremely efficient digital neuromorphics and multi-scale brain emulation. Proc. NICE) (see Yabuuchi, M., Nii, K., Tanaka, S., Shinozaki, Y., Yamamoto, Y., Hasegawa, T., Shinkawata, H. and Kamohara, S., 2017, June. A 65 nm 1.0 V 1.84 ns Silicon-on-Thin-Box (SOTB) embedded SRAM with 13.72 nW/Mbit standby power for smart IoT. In 2017 Symposium on VLSI Circuits (pp. C220-C221). IEEE). All estimates are for a 22 nm process. We forego analysis of small components like multiplexers, registers, etc in our power model. Instead, we focus only on the widely agreed-upon energy sinks: 1. MAC operations and 2. SRAM based operations. We explored 3 different cases of mapping using the LSTM-P as a reference for the size of the LMU to be used in the RNN-T network:
For the RNN-T network, we generated a total of 1024 partition combinations corresponding to increasingly partitioning the network (1 partition to 12 partitions) and then generating all possible combinations for that set number of partitions. For each partition combination, we derived the spatial configuration of computation blocks and time multiplexing of these partitions onto said computation blocks using our mapping algorithm. We used the performance and power models of computation blocks described above to derive latency and power numbers at every step of iterative resource allocation.
Before sharing the results for the LMU, we first share the details of the LSTM-P's performance:
We now share the results of the optimal mapping for the 3 different cases of LMU sizes as discussed above. For the case of setting LMU dimensions equal to the LSTM-P's dimensions, we use LMU for the recurrent connections in place of the LSTM-P. For the optimal mapping, all layers are mapped into their separate partitions, except the prediction network where 2 layers making it up are mapped onto a single partition. We now discuss the throughput, latency and power numbers for this mapping.
For the second case, we calculated the ratio between the size of LSTM-P and LMUs used in achieving similar accuracy for the Librispeech dataset. We observed that the best LMU was roughly one-third the size of the best LSTM-P in this experiment. In the equal dimensions experiment, the LMU is roughly six times smaller than the LSTM-P. This equates to a scaling factor of 2. Hence we expect that were a similar scaling factor used in the RNN-T network to achieve accuracy parity, the LMU would consume roughly twice the power of the equal dimensions experiment. This results in power consumption of 4.78 mW.
For the third case, we calculated the ratio between the size of LSTM-P and LMUs used in achieving similar accuracy for the TIMIT dataset. We measured the dimension of the best performing LMU and the best performing LSTM-P. We then scaled the dimensions of the LSTM-P in the RNN-T by these ratios to arrive at the LMU dimensions. Hence we expect that were this LMU used in the RNN-T network to achieve accuracy parity, it would consume 4.4 mW of power.
We now present additional details about this experiment. We first focus on the increased power consumption of the mapping of the RNN-T network built with LSTM-P vs the RNN-T network built with LMUs. The RNN-T network built with LSTM-P consumes 145.4 MB of SRAM memory. This is approximately 6× larger than the RNN-T built with LMU's 26 MB. Hence, the static power consumption of the memory for an LSTM-P implementation pushes it into the domain of infeasibility. We also explore the relationship between partitions of the RNN-T and associated power consumption of optimal mapping of that partition combination. Any partition combination can achieve an end-to-end latency of less than 60 ms. The important thing is for it to do so while consuming as little power as possible.
We now focus on the relationship between throughput of hardware mappings and the number of partitions in
We now correlate latency trends with steps of iterative resource allocation. Recall that each step in iterative resource allocation finds the bottleneck and adds more resources to that computation sub block. For this analysis, we show network's latency (y axis) evolution through steps of iterative resource allocation (x-axis) in
This application claims priority to provisional application No. 63/017,479, filed Apr. 29, 2020, the contents of which are herein incorporated by reference.
Number | Date | Country | |
---|---|---|---|
63017479 | Apr 2020 | US |