Methods And Systems For Efficient Processing Of Recurrent Neural Networks

Description

TECHNICAL FIELD

The invention relates generally to digital hardware systems, such as digital circuits, and more particularly to methods and digital circuit hardware systems for efficient processing of recurrent neural networks.

BACKGROUND

There are a wide variety of recurrent neural networks (RNNs) available, including those based on Long Short Term Memories (LSTMs), Legendre Memory Units (LMUs), and many variants of these. However, commercially viable RNNs have several constraints often ignored by research focused efforts. In this more constrained setting, these RNNs must be:

(1) High performance: The most responsive, low-latency networks will process data as soon as it is available, without any buffering. Many methods are often tested on the assumption that large windows of data are available all at once and can be easily accessed from memory. However, at deployment, buffering large amounts of data into memory introduces undesirable latencies as well as increases overall size and consequently power consumption of such an implementation. This particular requirement imposes strict bounds on end to end latency and throughput while processing such a recurrent neural network. For a real time processing of recurrent neural networks, end to end latency must be minimised and throughput of the implementation maximised.

(2) Power efficient: For RNN applications sensitive to power dissipation, like battery powered automatic speech recognition, keyword spotting, etc, the amount of power consumed by processing of such RNNs becomes important. While there is no sole determiner of power efficiency, quantization, the number and types of operations, number and types of memory accesses as well as static power dissipation of underlying hardware are all important factors.

Thus, commercially viable recurrent neural networks mandate a custom hardware design that not only supports a wide variety of typical layers that constitute a RNN like LMU, projected LSTMs and Feed Forward cells but also allows for highly efficient and distributed mapping of said RNN layers onto the hardware by taking into account unique constraints related to power, performance or a combination of both. For this, such hardware designs should also provide theoretical performance and power models to accurately calculate performance and power metrics of mapping of a given RNN onto the hardware design.

A distinguishing feature of the Legendre Memory Unit (LMU) (see Voelker, A. R., Kajić, I. and Eliasmith, C., 2019. Legendre memory units: Continuous-time representation in recurrent neural networks), consisting of linear ‘memory layers’ and a nonlinear ‘output layer’, is that the linear memory layers are optimal for compressing an input time series over time. Because of this provable optimality, the LMU has fixed recurrent and input weights on the linear layer. The LMU outperforms all previous RNNs on the standard psMNIST benchmark task by achieving 97.15-98.49% test accuracy (see Voelker, A. R., Kajić, I. and Eliasmith, C., 2019. Legendre memory units: Continuous-time representation in recurrent neural networks), compared to the next best network (dilated RNN) at 96.1% and the LSTM at 89.86%, while using far fewer parameters 102,000 versus 165,000 (a reduction of 38%). While a hardware based implementation of the A and B weights of LMU is available (see Voelker, A. R., Kajić, I. and Eliasmith, C., 2019. Legendre memory units: Continuous-time representation in recurrent neural networks), there is no hardware design, complete with performance and power models, that can implement an LMU of varying size and characteristics in a highly efficient and distributed manner, all the while being compatible to interface with hardware implementations of projected LSTMs (see Sak, H., Senior, A. W. and Beaufays, F., 2014. Long short-term memory recurrent neural network architectures for large scale acoustic modeling) or feed forward cells of the RNN.

As an alternative to the large size of the standard Long Short Term Memory (LSTM) cells, the Projected Long Short Term Memory (LSTM-P/Projected LSTM) cell architecture is proposed (see Sak, H., Senior, A. W. and Beaufays, F., 2014. Long short-term memory recurrent neural network architectures for large scale acoustic modeling) that projects the hidden states of the LSTM onto a smaller dimension, thus reducing the number, and thus memory requirements, of the recurrent connection. While there have been many hardware based implementations of the standard LSTM (see Chang, A. X. M., Martini, B. and Culurciello, E., 2015. Recurrent neural networks hardware implementation on FPGA. arXiv preprint arXiv:1511.05552.) to allow for different efficient implementations (see Zhang, Yiwei, Chao Wang, Lei Gong, Yuntao Lu, Fan Sun, Chongchong Xu, Xi Li, and Xuehai Zhou. “Implementation and optimization of the accelerator based on fpga hardware for lstm network.” In 2017 IEEE International Symposium on Parallel and Distributed Processing with Applications and 2017 IEEE International Conference on Ubiquitous Computing and Communications (ISPA/IUCC), pp. 614-621. IEEE, 2017), (see Yazdani, R., Ruwase, O., Zhang, M., He, Y., Arnau, J. M. and Gonzalez, A., 2019. Lstm-sharp: An adaptable, energy-efficient hardware accelerator for long short-term memory. arXiv preprint arXiv:1911.01258), there currently is no hardware design that implements a well composed projected LSTM, complete with performance and power models, that can support LSTM and projection cells of varying sizes and characteristics in a highly efficient and distributed manner, all the while being compatible to interface with hardware implementations of LMUs and feed forward cells of the RNN.

The feed forward cell, while itself not being recurrent in nature, is generally a common part of RNNs, typically with multiples of such cells with a plurality of sizes, being used to extract low level features from the preceding recurrent states. While there have been many efficient hardware designs of feed forward cells (see Himavathi, S., Anitha, D. and Muthuramalingam, A., 2007. Feedforward neural network implementation in FPGA using layer multiplexing for effective resource utilization. IEEE Transactions on Neural Networks, 18(3), pp. 880-888.) (see Canas, A., Ortigosa, E. M., Ros, E. and Ortigosa, P. M., 2006. FPGA implementation of a fully and partially connected MLP. In FPGA Implementations of Neural Networks (pp. 271-296). Springer, Boston, Mass.), that support processing of varying sizes and characteristics of these cells but there have been no designs that do so in the setting of recurrent neural networks; that is, being compatible to interface with hardware implementations of LMUs and LSTM-Ps cells of the RNN, while also providing power and performance models to enable highly efficient and distributed mapping of RNNs onto the digital hardware design.

There thus remains a need for improved methods and systems for efficient processing of recurrent neural networks for application domains including, but not limited to, automatic speech recognition (ASR), keyword spotting (KWS), biomedical signal processing, and other applications that involve processing time-series data.

SUMMARY OF THE INVENTION

A digital hardware system for processing time series data with a recurrent neural network, wherein the digital hardware system comprises a plurality of computation blocks, the digital hardware system computes parts of a given recurrent neural network by time multiplexing the parts of the recurrent neural network over a spatially distributed set of computation blocks of the digital hardware system wherein the digital hardware system is configured to read a sequence of time series data to be processed from the memory of the digital hardware system; temporally process the data by time multiplexing the recurrent neural network over the spatially distributed set of computation blocks of the digital hardware system; and, write final processed output or processed intermediate activations of the recurrent neural network to the memory of the digital hardware system.

BRIEF DESCRIPTION OF THE DRAWINGS

The invention is illustrated in the figures of the accompanying drawings which are meant to be exemplary and not limiting, in which like references are intended to refer to like or corresponding parts, and in which:

FIG. 1 illustrates a circuit embodiment of a LSTM cell's computation sub block (LB).

FIG. 2 illustrates a circuit embodiment of a projection cell's computation sub block (PB).

FIG. 3 illustrates a circuit embodiment of a LSTM-P cell's computation block, consisting of multiple LBs and PBs.

FIG. 4 illustrates a circuit embodiment of an encoder cell's computation sub block (UB).

FIG. 5 illustrates a circuit embodiment of a memory cell's computation sub block (MB).

FIG. 6 illustrates a circuit embodiment of Gen_Aand Gen_Bfor generating weights A and B of the LMU.

FIG. 7 illustrates a circuit embodiment of a hidden cell's computation sub block (HB).

FIG. 8 illustrates a circuit embodiment of a LMU cell's computation block, consisting of multiple UBs, MBs and HBs.

FIG. 9 illustrates a circuit embodiment of a feed forward cell's computation sub block (FB).

FIG. 10 illustrates a method according to one embodiment of the invention.

FIG. 11 illustrates an exemplary neural network on which one embodiment of the methods of this invention has been applied such that layers of the neural networks have been partitioned, each with a computation block constituent of its computation sub blocks.

FIG. 12 illustrates the relationship between clock frequency, power consumption and transistor count of the resulting hardware design when embodiments of this invention are used to process an RNN.

FIG. 13 illustrates the relationship between throughput and end to end latency of the resulting hardware design when embodiments of this invention are used to process an RNN with LSTM-P or LMU as the recurrent cell choice.

FIG. 14 illustrates the relationship between throughput and number of partitions of the resulting hardware design when embodiments of this invention are used to process an RNN with LSTM-P or LMU as the recurrent cell choice.

FIG. 15 illustrates the relationship between end to end latency and the step of iterative resource allocation of the resulting hardware design when embodiments of this invention are used to process an RNN with LSTM-P or LMU as the recurrent cell choice.

DETAILED DESCRIPTION OF THE INVENTION

Having summarized the invention above, certain exemplary and detailed embodiments will now be described below, with contrasts and benefits over the prior art being more explicitly described.

It will be understood that the specification is illustrative of the present invention and that other embodiments suggest themselves to those skilled in the art. All references cited herein are incorporated by reference.

In the following detailed description, reference is made to the accompanying drawings which form a part hereof, and in which is shown by way of illustration specific embodiments in which the invention may be practiced. These embodiments are described in sufficient detail to enable those skilled in the art to practice the invention, and it is to be understood that the embodiments may be combined, or that other embodiments may be utilized and that structural and logical changes may be made without departing from the spirit and scope of the present invention. The following detailed description is, therefore, not to be taken in a limiting sense, and the scope of the present invention is defined by the appended claims and their equivalents.

The main embodiment of the present invention provides digital hardware designs for processing of input of time series based data for a recurrent neural network where the parts of the recurrent neural network are composed of at least one of: an LMU cell, an LSTM-P cell, plus zero or Feed Forward cells. The input is processed by time multiplexing the parts of the recurrent neural network (RNN) over a spatially distributed set of these computation blocks of the digital hardware design, where a computation block refers to a digital hardware implementation of: an LMU cell or an LSTM-P cell or a Feed Forward cell. Each computation block is built by connecting together a plurality of computation sub blocks. The spatial distribution of computation blocks and time multiplexing of parts of the RNN over these computation blocks, referred to as “mapping” or “allocation”, is determined by an algorithm that takes as input a set of user constraints related to latency, throughput, transistor count, power consumption and like. The mapping algorithm explores all combinations of partitioning the network, while for each unique partition combination, updating the spatial distribution and time multiplexing in an iterative fashion by identifying the slowest computation sub block until said user constraints are met.

Definitions of Key Terms

Various terms as used herein are defined below. To the extent a term used in a claim is not defined below, it should be given the broadest definition persons in the pertinent art have given that term as reflected in printed publications and issued patents at the time of filing.

As used herein, the term “LMU” refers to the Legendre Memory Unit recurrent cell, and in particular any neural network embodiment of equations 1 and 2 from Voelker et al. (2019), referred also to as the “linear layer”, which is to be understood as a general component of a neural network, and is not limited in use within any one particular network architecture.

As used herein, the term “LSTM-P” refers to the Projected LSTM recurrent cell.

As used herein, the term “activation” refers to a function that maps an input to an output that includes, but is not limited to, functions such as rectified linear units, sigmoid, hyperbolic tangent, linear, etc. These functions might also be approximated for the purposes of hardware implementation, for example by way of thresholding, linear approximations, lookup-tables, etc.

As used herein, the term “vector” refers to a group of scalar values, where all scalar values of the group may be transformed by some arbitrary mathematical function in sequence or in parallel. The contents of a vector depend on context, and may include, but are not limited to, activities produced by activation functions, weights stored in SRAM, inputs stored in SRAM, outputs stored in SRAM, values along a datapath, etc.

As used herein, the term “computational sub block” refers to a digital hardware implementation responsible for carrying out a specific computation and composed of different operators including, but not limited to, multiply and accumulate blocks, adders, multipliers, registers, memory blocks, etc. Different types of computation sub blocks are responsible for carrying out different computations.

As used herein, the term “computational block” refers to a group of computation sub blocks responsible for carrying out computations assigned to it, with the group composed of plurality of different types of computation sub blocks, with members from the same group of computation sub blocks working in parallel to process parts or whole of computation assigned to their respective computation block.

As used herein, the term “time multiplexing” refers to the scheduling of a group of computations such that each computation in this group is carried out by computation sub block in order, that is, one after the other in time.

As used herein, the term “quantization” refers to the process of mapping a set of values to a representation with a narrower bit width as compared to the original representation. Quantization encompasses various types of quantization scaling including uniform quantization, power-of-two quantization, etc. as well as different types of quantization schemes including symmetric, asymmetric, etc.

As used herein, the term “quantization block” refers to a digital hardware block that performs a quantization operation.

As used herein, the term “dequantization” refers to the process of mapping a set of quantized values to the original bit width. Dequantization encompasses various types of dequantization scaling including uniform dequantization, power-of-two dequantization, etc. as well as different types of dequantization schemes including symmetric, asymmetric, etc.

As used herein, the term “dequantization block” refers to a digital hardware block that performs a dequantization operation.

A number of different variable notations are used to refer to the dynamic energy consumption, as follows:

- ^8,8E_memory^r: The dynamic Energy consumed by an memory read operation with read port width of 8 and a write port width of 8
- ^8,8E_memory^w: The dynamic Energy consumed by an memory write operation with read port width of 8 and a write port width of 8.
- ^8,8,29E_MAC^OP: The dynamic Energy consumed by a MAC operation with 2 input operands, each of width of 8 and an output with width 29.
- ^32,32E_act^OP: The dynamic Energy consumed by an activation operation with an input width of 32 bits and an output width of 32 bits.
- ^8,32,32E_mult^OP: The dynamic Energy consumed by a multiplication operation with 2 input operands, one of width 8 and other of width 32, and an output with width 32.
- ^32,32,32E_mult^OP: The dynamic Energy consumed by a multiplication operation with 2 input operands, each of width of 32 and an output with width 32.
- ^32,32,32E_adder^OP: The dynamic Energy consumed by an adder's operation with 2 input operands, each of width of 32 and an output with width 32.
- ^8,88E_{2:1 mux}^OP: The dynamic Energy consumed by a 2:1 mux's line switch operation with 2 input operands, each of width of 8 and an output with width 8.
- ^8,8E_reg^OP: The dynamic Energy consumed by a register write+read operation with an input operand of width of 8 and an output width of 8.
- ^32,32E_reg^OP: The dynamic Energy consumed by a register write+read operation with an input operand of width of 32 and an output width of 32.
- ⁸E_GEN-A^OP: The dynamic Energy consumed by state state matrix A's element generation operation with an output width of 8.
- ⁸E_GEN-B^OP: The dynamic Energy consumed by state state matrix B's element generation operation with an output width of 8.
- ^29,32E_dq^OP: The dynamic energy consumed by a dequantization operation, with an input operand width of 29 and output width of 32.
- ^32,32E_dq^OP: The dynamic energy consumed by a dequantization operation, with an input operand width of 32 and output width of 32.

^32,8E_q^OP: The dynamic energy consumed by a quantization operation, with an input operand width of 32 and output width of 8.

A number of different variable notations are used to refer to the static leakage power, as follows:

- ^1×8P_memory^s: The static leakage power of a memory macro of size 1×8 (depth×width) bits.
- ^8,8,29P_MAC^s: The static leakage power of a MAC module with 2 operands, each of width 8 and an output width of 29.
- ^32,32P_act^s: The static leakage power of an activation module with an input width of 32 and an output width of 32.
- ^32,32,32P_mult^s: The static leakage power of a multiplier module with 2 operands, each of width of 32 and an output width of 32.
- ^8,32,32P_mult^s: The static leakage power of a multiplier module with 2 operands, one of width 8 and other of width 32 and an output width of 32.
- ^32,32,32P_adder^s: The static leakage power of an adder with 2 operands, each of width 32 and an output width of 32.
- ^8,8,8P_{2:1 mux}^s: The static leakage power of a 2:1 mux module with 2 input operands, each of width 8 and an output width of 8.
- ^8,8P_reg^s: The static leakage power of a register with an input width of 8 and an output width of 8.
- ^32,32P_reg^s: The static leakage power of a register with an input width of 32 and an output width of 32.
- ⁸P_GEN-A^s: The static leakage power of state state matrix A's element generator with an output width of 8.
- ⁸P_GEN-B^s: The static leakage power of state state matrix B's element generator with an output width of 8.
- ^32,8P_q^s: The static leakage power of a quantization block with an input width of 32 and output width of 8.
- ^29,32P_q^s: The static leakage power of a dequantization block with an input width of 29 and output width of 32.
- ^32,32P_dq^s: The static leakage power of a dequantization block with an input width of 32 and output width of 32.

A number of different variable notations are used to refer to parameters of a particular network architecture, as follows:

- n_i: The size of user input to the LSTM-P cell or the LMU cell.
- n_o: The size of user input to the feed forward cell.
- n_r: The size of hidden state of the LSTM-P cell or the LMU cell.
- n_c: The size of the projection layer of the LSTM-P cell or the size of a memory tape for LMU cell.
- N_{c, next_layer}: The size of the projection layer of the next LSTM-P cell or the size of a memory tape for the next LMU cell.
- num_LB_current: Number of LB sub computation blocks within the computation block responsible for processing the current layer.
- num_LB_{next_layer}: Number of LB sub computation blocks within the computation block responsible for processing the next layer.
- unit_k^LB: Number of output units assigned to a LB sub computation block with index k, with 0≤k<num_LB_current.
- num_PB_current: Number of PB sub computation blocks within the computation block responsible for processing the current layer.
- unit_k^PB: Number of output units assigned to a PB sub computation block with index k, with 0≤k<num_PB_current.
- num_UB_current: Number of UB sub computation blocks within the computation block responsible for processing the current layer.
- unit_k^UB: Number of output units assigned to a UB sub computation block with index k, with 0≤k<num_UB_current.
- num_MB_current: Number of MB sub computation blocks within the computation block responsible for processing the current layer.
- unit_k^MB: Number of output units assigned to a MB sub computation block with index k, with 0≤k<num_MB_current.
- num_HB_current: Number of HB sub computation blocks within the computation block responsible for processing the current layer.
- unit_k^HB: Number of output units assigned to a HB sub computation block with index k, with 0≤k<num_HB_current.
- num_FB_current: Number of FB sub computation blocks within the computation block responsible for processing the current layer.
- num_FB_{next_layer}: Number of FB sub computation blocks within the computation block responsible for processing the next layer.
- unit_k^FB: Number of output units assigned to a FB sub computation block with index k, with 0≤k<num_FB_current.

Digital Hardware Implementation of LSTM-P Cell

The digital hardware implementation of the LSTM-P, referred to as a LSTM-P computation block, is divided into 2 separate computation sub blocks, a plurality of which can be connected together to form a computation block for an LSTM-P cell. The 2 computation sub blocks implement the LSTM-P cell by dividing the cell into 2 parts and implementing each respectively: one for the LSTM cell (LB) and one for the projection cell (PB). Each computation sub block computes one output unit at a time and a corresponding cell with multiple outputs can be time multiplexed onto the same computation sub block.

Digital Hardware Implementation of LSTM cell (LB)

The digital hardware implementation of an LSTM computation sub block, computing one output unit at a time, is shown in FIG. 1. A block is denoted as LB_i, where i is the index of the computation sub blocks in a chain or set of LB computation sub blocks.

It is important to remember that, when each output unit of an LSTM cell is allocated an LB, each such LB block computes one output unit of that LSTM cell. Realistically, an LSTM cell with a certain number of output units will be allocated fewer LBs than the number of output units. In this case the cell's output computation will be temporally multiplexed in parallel over the allocated LBs.

As shown in FIG. 1, for producing an output of the LSTM cell, the weights of each gate of the LSTM are stored in separate memory blocks [118]. For example, the forget gate has its weights stored in [100]. Each of these weight sets only correspond to the appropriate output units assigned to it. Each gate is given its own separate MAC unit [136]. For example, the output gate has [110] as its MAC unit. This allocation allows the MAC operations associated with the 4 gates to be computed in parallel, achieving a speed-up factor of 4×. Also, the input line, multiplexed [104] between a user input [108] and recurrent connection [106], is shared among the 4 MAC units. This input line is a data bus, carrying a flit of the input. Resulting accumulated values are appropriately transformed using hardware implementation of activations [114], after which they are further processed by a series of multiplication [120] and addition operations [122]. The candidate gate's output of the current timestep is stored in a separate memory block [102] and the LSTM cell's final output of the current timestep is stored in a separate memory block [126]. Finally, the output line of [134] memory block is connected to a multiplexer [130] and a registered line [132] to support interfacing with a multiplexer based chain interconnect to support movement of data from this computation sub block to other computation sub blocks [128]. The memory blocks in LB can individually be, but not limited to, SRAM memory, DRAM memory, ROM memory, flash memory, solid state memory and can be either volatile or non volatile. The LB implements the following equations:

i
_t=σ(Wⁱ_x⊙x_t)

f
_t=α(W^f_x⊙x_t)

o
_t=σ(Wⁱ_x⊙x_t)

c
_t=(f_t*c_t−1)+(i_t*σ(W^c_x⊙x_t))

y
_t
=o
_t*σ(c_t)

where x_tis the input data at timestep t, Wⁱ_xare the weights of the input gate, W^fx are the weights of the forget gate, W^o_xare the weights of the output gate and W^c_xare the weights of the cell. σ is an activation function. ⊙ refers to a multiplication and accumulation of 2 input vectors.

As shown in FIG. 1, pipeline registers [116] are also optionally introduced along the data path to increase throughput of the computation. A register can be optionally introduced after any of the sub-modules of adders, multiply and accumulate units, multipliers, hardware implementation of activation functions and memory blocks, etc.

As shown in FIG. 1, strategic quantization stages are also optionally introduced along the computation data path of LB. A plurality of quantization [124] and dequantization blocks [112] that descale a previously quantized input and requantize it to a plurality of bit widths can be optionally placed anywhere along the computation datapath. Optionally, the efficacy of a unique placement of these quantization blocks can be determined by training the neural network while replicating this quantization scheme in software and its impact on the performance of the neural network measured. The bit widths of other sub modules of adders, multiply and accumulate units, multipliers, hardware implementation of activation functions and memory blocks, etc. will change according to the placement and bit widths of plurality of quantization blocks introduced. For example, in FIG. 1, different data path connections have a bit width of 8 [8], 29 [29] and 32 [32].

Modeling Energy Consumption of Digital Hardware Implementation of LSTM Cell (LB)

Power and energy consumption of a computation sub block can be generally divided into 2 categories: 1. static leakage power; and 2. dynamic switching energy. Each of these can further be divided into: 1. scale variant and 2. scale invariant categories.

Static leakage power and dynamic switching energy can be categorised as scale variant if the consumption changes with respect to the number of output units allocated to a computation sub block. For example, the size of a memory block storing weights of a network connection will increase as more output units are time multiplexed onto the same block. Hence, static power of the memory blocks storing weights can be categorized as scale variant. If the overall consumption does not depend on the number of outputs units time multiplexed onto the computation sub block, it can be categorised as scale invariant. For example, a multiply and accumulate sub module dissipates a deterministic amount of static power, invariant to the number of output units time multiplexed onto the computation sub block containing that multiply and accumulate unit.

The scale variant dynamic energy can be modeled by summing up dynamic energy of each individual operation involved in computation of one output unit. The scale invariant portion can be calculated by summing up total energy of operations that do not depend on the degree of time multiplexing of outputs of a cell on a computation sub block. The scale variant static power consumption can be modeled by summing up static power consumption of each scale variant sub module for computation of one output unit. The scale invariant portion can be calculated by summing up static power consumption of sub modules that do not depend on the degree of time multiplexing of outputs of a cell on a computation sub block.

Inactive computation sub blocks can be disconnected from the voltage supply or put in low power mode using Dynamic Voltage and Frequency Scaling. A highly deterministic mapping of the recurrent neural network allows prediction of cycle level data flow behaviour. By turning off computation sub blocks when they are not needed, static power consumption can be decreased. Care should be exercised when applying the principles of DVFS to the memory blocks as they can be potentially volatile in nature and may lead to loss of data.

Dynamic energy consumed by an LSTM computation sub block with index i (LB) with one output unit allocated is defined as E^d_i,LB. As discussed above, dynamic energy consumption is further divided according to its dependency on allocation (number of output units time multiplexed on to LB_i): E^d-sv_i,LBas the scale variant portion of E^d_i,LBand E^d-siv_i,LBas the scale invariant portion. For FIG. 1, they can be calculated as follows:

$E_{i, LB}^{d - sv} = {2.}^{8, 8, 8} E_{2 : 1 mux}^{OP} + 4 (n_{i} + n_{r}) .^{8, 8} E_{memory}^{r} + 4. (n_{i} + n_{r}) .^{8, 8, 29} E_{MAC}^{OP} + {4.}^{29, 32} E_{dq}^{OP} + {4.}^{32, 32} E_{act}^{OP} + {}^{8, 8}E_{memory}^{r} + {}^{8, 32, 32}E_{mult}^{OP} + {}^{32, 8}E_{q}^{OP} + {}^{8, 8}E_{memory}^{w} + {}^{32, 32, 32}E_{mult}^{OP} + {}^{32,, 32}E_{dq}^{OP} + {}^{32, 32, 32}E_{adder}^{OP} + {}^{32, 32}E_{act}^{OP} + {}^{32, 32, 32}E_{mult}^{OP} + {}^{32, 8}E_{q}^{OP} + {}^{8, 8}E_{memory}^{w} + {14.}^{32, 32} E_{reg} + {2.}^{8, 8} E_{reg} + (n_{r} / {num_PB}_{current}) .^{8, 8} E_{memory}^{r}$

$E_{i, LB}^{d - siv} = 2. {(n_{r} / {num_RB}_{current})}^{8, 8, 8} E_{2 : 1 mux}^{OP} + \sum_{K = 1}^{i} {unit}_{k}^{PB} . (n_{r} / {num_PB}_{current}) .^{8, 8} E_{reg}^{OP}$

Static power consumption of an LSTM computation sub block with index i (LB_i) with one output unit allocated is defined as P^s_i,LB. As discussed above, static power consumption is further divided according to its dependency on allocation (number of output units time multiplexed on to LB_i): P^s-sv_i,LBas the scale variant portion of P^s_i,LBand P^s-siv_i,LBas the scale invariant portion. For FIG. 1, they can be calculated as follows:

$P_{i, LB}^{s - sv} = {4.}^{1 \times 8} P_{memory}^{s} . (n_{i} + n_{r}) + {2.}^{1 \times 8} P_{memory}^{s} + {}^{1 \times 8}P_{memory}^{s}$

$P_{i, LB}^{s - siv} = {}^{8, 8, 8}P_{2 : 1 mux}^{s} + {4.}^{8, 8, 29} P_{MAC}^{s} + {4.}^{29, 32} P_{dq}^{s} + {4.}^{32, 32} P_{act}^{s} + {}^{8, 32, 32}P_{mult}^{s} + {}^{32, 32}P_{dq}^{s} + {}^{32, 32, 32}P_{mult}^{s} + {}^{32, 32, 32}P_{adder}^{s} + {}^{32, 8}P_{q}^{s} + {}^{32, 32}P_{act}^{s} + {}^{32, 32, 32}P_{mult}^{s} + {}^{32, 8}P_{q}^{s} + {}^{8, 8, 8}P_{2 : 1 mux}^{s} + {14.}^{32, 32} P_{reg}^{s} + {3.}^{8, 8} P_{reg}^{s}$

Total energy consumed by an LSTM computation sub block with index i (LB) can be written as below:

E
^N,total
_i,LB
=N.E
^d-sv
_i,LB
+E
^d-siv
_i,LB
+N.P
^s-sv
_i,LB.latency_e2e+P^s-siv_i,LB.latencyⁱ_LB

where latency_e2eis the total latency of processing a given input of the recurrent neural network, latencyⁱ_LBis the latency of computation sub block LB_iof processing all time multiplexed outputs allocated to it and N is the total number of output units time multiplexed onto LB_i.

Modeling Performance of Digital Hardware Implementation of LSTM Cell (LB)

Latency of an LSTM computation sub block with index i (LB_i) for processing one output can be calculated by adding up individual latencies of each operation involved in the computation of one output unit. For the LB computation sub block shown in FIG. 1, assuming latency of 1 for each operation, it can be written as:

T
_{i, LB}=(n_i+n_r)+1+1+1+1+1+1+1+1+1

It should be noted that the extra cycle of reading data from memory can be easily shadowed by reading a cycle early, before the compute actually starts. Hence, it is not included in the computation above. FIG. 1 introduces pipelining at outputs of some sub modules. The obvious advantage to introducing a register at such sub-modules is that it reduces the longest delay path and helps improve the operational clock frequency. Another advantage is how it can reduce the overall latency of compute when LB_iis allocated N output units. Pipelining allows filling up the data path with computation across multiple of these N units and allows shadowing of the critical (in terms of defining overall throughput) stages of computation of chronologically newer output units behind critical stages of older output units. Hence, latency of a block LB_i, with configuration as FIG. 1, labelled as T^N_{i, LB}, to finish computation for N outputs units allocated to it can be written as below:

T
^N
_{i, LB}=(n_i+n_r)*N+1+1+1+1+1+1+1+1+1

Digital Hardware Implementation of the projection Cell (PB)

The digital hardware implementation of a projection computation sub block, computing one cell's worth of output, is shown in FIG. 2. Each block is denoted as PB_i, where i is the index of the computation sub block in a chain of PB computation sub blocks.

It is important to remember that, when each output unit of a projection cell is allocated a PB, each such PB block computes one output unit of that projection cell. Realistically, a projection cell with a certain number of output units will be allocated fewer PBs than the number of output units. In this case the cell's output computation will be temporally multiplexed in parallel over the allocated PBs.

As shown in FIG. 2, for producing an output of the projected cell, the weights of projection connection are stored in a separate memory block [200]. Each of these weight sets only correspond to the appropriate output units assigned to it. These weights are then subsequently multiplied and accumulated with an input flit from the input data bus [202] using the MAC block [204]. Resulting accumulated values are appropriately transformed using hardware implementation of activations [208]. The output of the current timestep is stored in a separate memory block [210]. Finally, the output line [220] of the output memory block is connected to a multiplexer [218] and a registered line [212] to support interfacing with a multiplexer based chain interconnect to support movement of data from this computation sub block to other computation sub blocks [214]. The memory blocks in PB can individually be, but not limited to, SRAM memory, DRAM memory, ROM memory, flash memory, solid state memory and can be either volatile or non volatile. The PB shown in FIG. 2 implements the following equation:

h
^p
_t=σ(W^p_x⊙x_t)

where x_tis the input data at timestep t, equal to the output of the corresponding LSTM cell. W^p_xare the weights of the projection connection. σ is an activation function. ⊙ refers to a multiplication and accumulation of 2 input vectors.

As shown in FIG. 2. pipeline registers [216] are also optionally introduced along the data path to increase throughput of the computation. A register can be optionally introduced after any of the sub-module of the multiply and accumulate unit, hardware implementation of activation functions and memory blocks, etc.

As shown in FIG. 2, strategic quantization stages are also optionally introduced along the computation data path of PB. A plurality of quantization [222] and dequantization [206] blocks that descale a previously quantized input and requantize it to a plurality of bit widths can be optionally placed anywhere along the computation datapath. Optionally, the efficacy of a unique placement of these quantization blocks can be determined by training the neural network while replicating this quantization scheme in software and its impact on the performance of the neural network measured. The bit widths of other sub modules of multiply and accumulate units, hardware implementation of activation functions and memory blocks, etc. will change according to the placement and bit widths of plurality of quantization blocks introduced. For example, in FIG. 2, different data path connections have a bit width of 8 [8], 29 [29] and 32 [32].

Modeling Energy Consumption of Digital Hardware Implementation of Projection Cell (PB)

Similar to the LSTM computation sub block, LB, power and energy consumption of a computation sub block PB can also be divided into 2 categories: 1. static leakage power; and 2. dynamic switching energy and each of these can further be divided into: 1. scale variant; and 2. scale invariant categories introduced previously.

Similar to the LSTM computation sub block, LB, inactive computation sub blocks can be disconnected from the voltage supply or put in low power mode using Dynamic Voltage and Frequency Scaling, as introduced previously.

Dynamic energy consumed by a projection computation sub block with index i (PB_i) with one output unit allocated is defined as E^d_i,PB. As discussed above, dynamic energy consumption is further divided according to its dependency on allocation (number of output units time multiplexed on to PB_i):E^d-sv_i,PBas the scale variant portion of E^d_i,PBand E^d-siv_i,PBas the scale invariant portion. For FIG. 2, assuming forward connection to a LSTM computation sub block, they can be calculated as follows:

$E_{i, PB}^{d - sv} = n_{c} . {}^{8, 8}E_{memory}^{r} + n_{c} .^{8, 8, 29} E_{MAC}^{OP} + {}^{29, 32}E_{dq}^{OP} + {}^{32, 32}E_{act}^{OP} + {}^{32, 8}E_{q}^{OP} + {}^{8, 8}E_{memory}^{w} + (n_{c} / {num_LB}_{current}) .^{8, 8} E_{memory}^{r} + (n_{c},_{next_layer} / {num_LB}_{next_layer}) .^{8, 8} E_{memory}^{r} + {2.}^{32, 32} E_{reg}^{OP} + {}^{8, 8}E_{reg}^{OP}$

$E_{i, PB}^{d - siv} = 2. (n_{c} / {LB}_{current}) .^{8, 8, 8} E_{2 : 1 mux}^{OP} + \sum_{K = 1}^{i} {unit}_{k}^{LB} . (n_{c} / {LB}_{current}) .^{8, 8} E_{reg}^{OP} + 2. (n_{c},_{next_layer} / {num_LB}_{next_layer}) .^{8, 8, 8} E_{2 : 1 mux}^{OP} + \sum_{K = 1}^{i} {unit}_{k} . (n_{c},_{next_layer} / {num_LB}_{next_layer}) .^{8, 8} E_{reg}^{OP}$

Static power consumption of a projection computation sub block with index i (PB_i) with one output unit allocated is defined as P^s_i,LB. As discussed above, static power consumption is further divided according to its dependency on allocation (number of output units time multiplexed on to PB_i): P^s-sv_i,PBas the scale variant portion of P^s_i,PBand P^s-siv_i,PBas the scale invariant portion. For FIG. 2, they can be calculated as follows:

P
^s-sv
_i,PB
=n
_c.^1×8P_memory^s+^1×8P_memory^s

P
^s-siv
_i,PB=^8,8,29P_MAC^s+^29,32P_dq^s+^32,32P_act^s+^32,8P_q^s+^8,8,8P_{2:1 mux}^s+2.^32,32P_reg^s+2.^8,8P_reg^s

Total energy consumed by a projection computation sub block with index i (PB_i) can be written as below:

E
^N,total
_i,PB
=N.E
^d-sv
_i,PB
+E
^d-siv
_i,PB
+N.P
^s-sv
_i,PB.latency_e2e+P^s-siv_i,PB.latencyⁱ_PB

where latency_e2eis the total latency of processing a given input of the recurrent neural network, latencyⁱ_PBis the latency of computation sub block PB of processing all time multiplexed outputs allocated to it and N is the total number of output units time multiplexed onto PB_i.

Modeling Performance of Digital Hardware Implementation of projection cell (PB)

Latency of a projection computation sub block with index i (PB_i) for processing N outputs, abbreviated as T^N_{i, PB}, can be calculated by adding up individual latencies of each operation involved in the computation of N output units. Similar to the LSTM computation sub block, a plurality of pipelining registers can be introduced to reduce the longest delay path and also increase throughput by filling up the data path with computation across multiple of these N units, thus shadowing critical stages of computation of chronologically newer output units behind critical stages of older output units. Hence, latency of a block PB_i, with configuration as FIG. 2, labelled at T^N_{i, PB}, to finish computation for N output units allocated to it can be written as below:

T
^N
_{i, PB}=(n_r·N)+4

Complete Digital Hardware Implementation of LSTM-P Computation Block

A plurality of LSTM computation sub blocks (LBs) can be connected to work in parallel with a plurality of projection computation sub blocks (PBs), with memory output readouts from one set of blocks connected to the input bus of the other set via a chain of multiplexer and registers acting as an interconnect. FIG. 3 shows 8 LBs [304] and 4 PBs [308] connected together to form a LSTM-P computation block, with an LSTM cell with 2049 outputs time multiplexed onto the 8 LBs and a projection cell with 640 outputs time multiplexed onto the 4 PBs. LB₀[302] will be responsible for computation of LSTM cell's outputs 0-255 and PB₀[306] will be responsible for computation of projection cell's outputs 0-159. The input data bus [300] is fed to all 8 LBs at the same time. The streaming multiplexer chain interconnect after each set of LBs [316] and PBs [320] allow data dependencies to be well composed, despite any unbalanced/disjointed resource allocation between the 2 sets of computation sub blocks. For example, the data line [314] will pass LSTM cell's outputs 1792-2048 to the other LSTM computation sub blocks and data line [318] will pass projection cell's outputs 480-639 to other projection computation sub blocks.

All the 8 LBs will finish their MAC operations at the same time. All 8 LBs write their allocated LSTM cells' outputs to 8 different memories (titled “memory output” in FIG. 1). Since each PB needs access to all of the LSTM outputs, each of these 8 different memories are then multiplexed streamed serially onto a single bus via a chain of multiplexer based interconnect [316]. This bus then forms the input to the set of PB blocks.

Due to symmetry of workload distribution, all PBs finish their workload of projected output calculation at the same time. Each of them then writes its computed projected units to its memory (titled “memory output” in FIG. 2). These memories then become the input to the next LSTM cell [310] for the current timestep and input to the same LSTM cell for the next timestep [312]. Hence, they are again multiplexed streamed via a chain of mux based bus [320] to form the final data bus to the next computation block (not shown) as well as the set of LB computation sub blocks in FIG. 3.

Total energy consumption of the LSTM-P computation block can be calculated by summing up individual energy consumption of its constituent set of LSTM computation sub blocks LB and its constituent projection computation sub blocks PB.

For calculating the latency and throughput of a LSTM-P computation block, the parallel mode of operation plays an important role. All constituent sub blocks in the set of constituent LSTM computation sub blocks work in parallel. Hence, the slowest LB sub block in finishing the computation of its allocated output units of the LSTM cell determines the latency of the entire set of constituent LSTM sub blocks. Similarly, the slowest PB sub block in finishing the computation of its allocated output units of the projection cell determines the latency of the entire set of constituent projection sub blocks. Hence, to calculate the overall latency of an LSTM-P computation block, the latency of the set of constituent LSTM sub blocks can be added with the latency of the set of constituent projection sub blocks. The throughput of the LSTM-P computation block is determined by the slowest computation sub block, regardless of type.

Digital Hardware Implementation of LMU Cell

Digital Hardware implementation of LMU is divided into 3 separate computation sub blocks, a plurality of which can be connected together to form a computation block for an LMU cell. The 3 computation sub blocks implement the LMU cell by dividing the cell into 3 parts and implementing each respectively: one for the encoder cell (UB), one for the memory cell (MB) and one for the hidden cell (HB). Each computation sub block computes one output unit at a time and a corresponding cell with multiple outputs can be time multiplexed onto the same computation sub block.

Digital Hardware Implementation of encoder cell (UB)

The digital hardware implementation of an encoder computation sub block, computing one output unit at a time, is shown in FIG. 4. A block is denoted as UB_i, where i is the index of the computation sub blocks in a chain of UB computation sub blocks.

Similar to the LSTM-P's computation sub block, the encoder cell's output computation will be temporally multiplexed in parallel over the allocated UBs.

It is important to note that the input vector to the encoder cells, the hidden state and memory can all be potentially of different sizes, thus the MAC operations associated with each of them can potentially finish at different times, thus resulting in one of them taking the longest. This will become the bottleneck in computation latency. Triangular inequality law, that checks if the sum of two numbers is less than the third number, can be deployed to check if two multiply and accumulate operations can be sequentialized to save static power dissipation and silicon area of the resulting hardware implementation.

As shown in FIG. 4, for computing an output of the encoder cell, the weights are stored in memory blocks [402], with each connection composed of its own memory block like [400]. Each of these weights sets only correspond to the appropriate output units assigned to it. These weights are then subsequently multiplied and accumulated with an input flit from the input data bus [416], multiplexed [418] between inputs [412] and [414], and input data bus [422], using MAC blocks [406] like [404]. Triangular inequality law has been used here to fold the input and hidden states' related multiply and accumulate operations onto the same MAC unit, thus also storing the corresponding weights onto the same physical memory block. Resulting accumulated values are added [420] and optionally transformed using hardware implementation of activations [410] and finally written to the output memory block [434]. Finally, the output line [436] of the output memory block is connected to a multiplexer [430] and a registered line [426] to support interfacing with a multiplexer based chain interconnect to support movement of data from this computation sub block to other computation sub blocks [428]. The memory blocks in UB can individually be, but not limited to, SRAM memory, DRAM memory, ROM memory, flash memory, solid state memory and can be either volatile or non volatile. The UB shown in FIG. 4 implements the following equation:

u
_t=σ(e^x⊙x_t+e^h⊙h_t−1+e^m⊙m_t−1)

where x_tis the input data at timestep t, h_t−1is the hidden state at timestep t−1, m_tis the memory of one of the LMU tapes at timestep t−1 with e^x, e^hand e^mbeing the corresponding weights. α is an activation function. ⊙ refers to a multiplication and accumulation of 2 input vectors.

As shown in FIG. 4, pipeline registers [424] are also optionally introduced along the data path to increase throughput of the computation. A register can be optionally introduced after any of the sub-modules of adders, multiply and accumulate units, hardware implementation of activation functions and memory blocks, etc.

As shown in FIG. 4, strategic quantization stages are also optionally introduced along the computation data path of UB. A plurality of quantization [408] and dequantization [432] blocks that descale a previously quantized input and requantize it to a plurality of bit widths can be optionally placed anywhere along the computation datapath. Optionally, the efficacy of a unique placement of these quantization blocks can be determined by training the neural network while replicating this quantization scheme in software and its impact on the performance of the neural network measured. The bit widths of other sub modules of adders, multiply and accumulate units, hardware implementation of activation functions and memory blocks, etc. will change according to the placement and bit widths of plurality of quantization blocks introduced. For example, in FIG. 4, different data path connections have a bit width of 8 [8], 29 [29] and 32 [32].

Modeling Energy Consumption of Digital Hardware Implementation of Encoder Cell (UB)

Similar to computation sub blocks of the LSTM-P, power and energy consumption of a computation sub block UB can also be divided into 2 categories: 1. static leakage power; and 2. dynamic switching energy and each of these can further be divided into: 1. scale variant; and 2. scale invariant categories introduced previously.

Similar to the computation sub blocks of the LSTM-P, inactive computation sub blocks can be disconnected from the voltage supply or put in low power mode using Dynamic Voltage and Frequency Scaling, as introduced previously.

Dynamic energy consumed by an encoder computation sub block with index i (UB_i) with one output unit allocated is defined as E^d_i,UB. As discussed above, dynamic energy consumption is further divided according to its dependency on allocation (number of output units time multiplexed on to UB_i): E^d-sv_i,UBas the scale variant portion of E^d_i,UBand E^d-siv_i,UBas the scale invariant portion. For FIG. 4, assuming a single memory tape, they can be calculated as follows:

E
^d
_{i, UB}=2.^8,8,8E_{2:1 mux}+(n_c+n_i+n_r).^8,8E_memory^r+(n_c+n_i+n_r).^8,8,29E_MAC^OP+2.^29,32E_dq^OP+^32,32,32E_adder^OP+^32,32E_act^OP+^32,8E_q^OP+(n_c/num_MB_current).^8,8E_memory^r+4.^32,32E_reg^OP+^8,8E_reg^OP

Static power consumption of an encoder computation sub block with index i (UB_i) with one output unit allocated is defined as P^s_i,UB. As discussed above, static power consumption is further divided according to its dependency on allocation (number of output units time multiplexed on to UB_i): P^s-sv_i,UBas the scale variant portion of P^s_i,UBand P^s-siv_i,UBas the scale invariant portion. For FIG. 4, assuming a single memory tape, they can be calculated as follows:

P
^s
_{i, UB}=^8,8,8P_{2:1 mux}^s+(n_i+n_c+n_r).^1×8P_memory^s2.^8,8,29P_MAC^s2.^29,32P_dq^s+^32,32,32P_adder^s+^32,32P_act^s+^32,8P_q^s+^1×8P_memory^s+^8,8,8P_{2:1 mux}^s+4.^32,32P_reg^s+2.^8,8P_reg^s

Total energy consumed by an encoder computation sub block with index i (UB_i) can be written as below:

E
^{N, total}
_i,UB
=N.E
^d-sv
_i,UB
+E
^d-siv
_i,UB
+N.P
^s-sv
_i,UB.latency_e2e+P^s-siv_i,UB.latencyⁱ_UB

where latency_e2eis the total latency of processing a given input of the recurrent neural network, latencyⁱ_UBis the latency of computation sub block UB_iof processing all time multiplexed outputs allocated to it and N is the total number of output units time multiplexed onto UB_i.

Modeling Performance of Digital Hardware Implementation of Encoder Cell (UB)

Latency of an encoder computation sub block with index i (UB_i) for processing N outputs, abbreviated as T^N_{i, UB}, can be calculated by adding up individual latencies of each operation involved in the computation of N output units. Similar to the LSTM-P computation sub blocks, a plurality of pipelining registers can be introduced to reduce the longest delay path and also increase throughput by filling up the data path with computation across multiple of these N units, thus shadowing critical stages of computation of chronologically newer output units behind critical stages of older output units. For the UB computation sub block shown in FIG. 4, assuming latency of 1 for each operation, it can be written as:

T
^N
_{i, UB}={(n_c.N)+5, if n_c≥n_i+n_r

(n_r.N)+5, if n_r≥n_i+n_c

(n_i.N)+5, if n_i≥n_c+n_r

Digital Hardware Implementation of the memory cell (MB)

The digital hardware implementation of a memory computation sub block, computing one unit of memory output, is shown in FIG. 5. Each block is denoted as MB_i, where i is the index of the computation sub blocks in a chain of MB computation sub blocks.

Similar to computation sub blocks introduced before, the cell's output computation will be temporally multiplexed in parallel over the allocated MBs.

As shown in FIG. 5, for producing an output of the memory cell, the weights A and B of the memory cell can be generated in real time using Gen_A[504] and Gen_B[506] respectively. As shown in FIG. 6, with Gen_Aimplementing the following equation with output line [606]:

A=a
_ij
∈R
^q×q, where a_ij=(2*i+1)*{−1, if i<j

(−1)^i−j+1if i≥j

where q is a property of the LMU cell, i and j are greater than equal to zero, and Gen_Bimplementing the following equation with output line [604] in FIG. 6:

B=b
_i
∈R
^q×1, where b_i=(2*i+1)*(−1)ⁱ

where q is a property of the LMU cell, i is greater than equal to zero.

The “2*i+1” is implemented by [600] in FIG. 6, by the use of an incrementing adder [614], with 1 [1] as one of its operands, with its output being shifted to the left by one bit using a left shift operation [616] and consequently being incremented. The “j” part of the equation is implemented in [602]. The output of [600] is then multiplied [612] with −1 on line [620] before being processed by a multiplexer [608] using [602] as a select line and pipelined [618] to generate the outputs of Gen_Ain [606]. The output of the multiplier [612] and block [600] are also used for Gen_B, which are multiplexed via [610] to generate the outputs of Gen_Bin [604]. The bit widths of other sub modules of adders, bit shifters, multipliers, multiplexers etc. will change according to the placement and bit widths of plurality of quantization blocks introduced. For example, in FIG. 6, different data path connections have a bit width of 8 [8], 29 [29] and 32 [32].

Optionally, the weights A and B can also be stored in memory instead. Each of these weight sets only correspond to the appropriate output units assigned to it. These weights are then subsequently multiplied and accumulated with an input flit, typically memory of particular memory tape, from the input data buses [500] and [502] using the MAC blocks [528] like [530]. The 2 values are added up [510] and the values are appropriately transformed using hardware implementation of activations [512] and finally written to an output memory block [516]. Finally, the output line [526] of the output memory block is connected to a multiplexer [518] and a registered line [522] to support interfacing with a multiplexer based chain interconnect to support movement of data from this computation sub block to other computation sub blocks [520]. The memory blocks in LB can individually be, but not limited to, SRAM memory, DRAM memory, ROM memory, flash memory, solid state memory and can be either volatile or non volatile. The MB shown in FIG. 5 implements the following equation:

m
_t=σ(A⊙m_t−1+B*u_t)

where m_t−1is the memory of a memory tape at timestep t−1 and u_tis the encoded input of that memory tape with A and B being the weights of the memory cell. o is an activation function. ⊙ refers to a multiplication and accumulation of 2 input vectors.

As shown in FIG. 5, pipeline registers [524] are also optionally introduced along the data path to increase throughput of the computation. A register can be optionally introduced after any of the sub-module of the multiply and accumulate unit, hardware implementation of activation functions and memory blocks, etc.

As shown in FIG. 5, strategic quantization stages are also optionally introduced along the computation data path of MB. A plurality of quantization [514] and dequantization [508] blocks that descale a previously quantized input and requantize it to a plurality of bit widths can be optionally placed anywhere along the computation datapath. Optionally, the efficacy of a unique placement of these quantization blocks can be determined by training the neural network while replicating this quantization scheme in software and its impact on the performance of the neural network measured. The bit widths of other sub modules of multiply and accumulate unit, hardware implementation of activation functions and memory blocks, etc. will change according to the placement and bit widths of plurality of quantization blocks introduced. For example, in FIG. 5, different data path connections have a bit width of 8 [8], 29 [29] and 32 [32].

Modeling Energy Consumption of Digital Hardware Implementation of Memory Cell (MB)

Similar to the previously introduced computation sub blocks, power and energy consumption of a computation sub block MB can also be divided into 2 categories: 1. static leakage power; and 2. dynamic switching energy and each of these can further be divided into: 1. scale variant; and 2. scale invariant categories introduced previously.

Similar to previously introduced computation sub blocks inactive computation sub blocks can be disconnected from the voltage supply or put in low power mode using Dynamic Voltage and Frequency Scaling, as introduced previously.

Dynamic energy consumed by a projection computation sub block with index i (MB_i) with one output unit allocated is defined as E^d_i,MB. As discussed above, dynamic energy consumption is further divided according to its dependency on allocation (number of output units time multiplexed on to MB_i): E^d-sv_i,MBas the scale variant portion of E^d_i,MBand E^d-siv_i,MBas the scale invariant portion. For FIG. 5, assuming a single memory tape, they can be calculated as follows:

$E_{i, MB}^{d - sv} = n_{c} .^{8} E_{GEN - A}^{OP} + {}^{8}E_{GEN - B}^{OP} + (n_{c} + 1) .^{8, 8, 29} E_{MAC}^{OP} + {2.}^{29, 32} E_{dq}^{OP} + {}^{32, 32, 32}E_{adder}^{OP} + {}^{32, 32}E_{act}^{OP} + {}^{32, 8}E_{q}^{OP} + {}^{8, 8}E_{memory}^{w} + (n_{c} / {num_MB}_{current}) .^{8, 8} E_{memory}^{r} + (n_{r} / {num_HB}_{current}) .^{8, 8} E_{memory}^{r} + {}^{8, 8}E_{memory}^{r} + {4.}^{32, 32} E_{reg}^{OP} + {}^{8, 8}E_{reg}^{OP}$

$E_{i, MB}^{d - siv} = {2.}^{8, 8, 8} E_{2 : 1 mux}^{OP} + \sum_{K = 1}^{i} {unit}_{k}^{MB} .^{8, 8} E_{reg}^{OP} + (n_{c} / {num_MB}_{current}) . ({2.}^{8, 8, 8} E_{2 : 1 mux}^{OP} + \sum_{K = 1}^{i} {unit}_{k}^{MB} .^{8, 8} E_{reg}^{OP}) + (n_{r} / {num_HB}_{current}) . ({2.}^{8, 8, 8} E_{2 : 1 mux}^{OP} + \sum_{K = 1}^{i} {unit}_{k}^{MB} .^{8, 8} E_{reg}^{OP})$

Static power consumption of a memory computation sub block with index i (MB_i) with one output unit allocated is defined as P^s_i,MB. As discussed above, static power consumption is further divided according to its dependency on allocation (number of output units time multiplexed on to MB_i): P^s-sv_i,MBas the scale variant portion of P^s_i,MBand P^s-siv_i,MBas the scale invariant portion. For FIG. 5, they can be calculated as follows:

P
^s-sv
_i,MB=2.^1×8P_memory^s

P
^s-siv
_i,MB=⁸P_GEN-A^s+⁸P_GEN-B^s+2.^8,8,29P_MAC^s+2.^29,32P_dq^s+^32,32,32P_adder^s+^32,32P_act^s+^32,8P_q^s+^8,8,8P_{2:1 mux}^s+4.^32,32P_reg^s+2.^8,8P_reg^s

Total energy consumed by a memory computation sub block with index i (MB) can be written as below:

E
^N,total
_i,MB
=N.E
^d-sv
_i,MB
+E
^d-siv
_i,MB
+N.P
^s-sv
_i,MB.latency_e2e+P^s-siv_i,MB.latencyⁱ_MB

where latency_e2eis the total latency of processing a given input of the recurrent neural network, latencyⁱ_MBis the latency of computation sub block MB_iof processing all time multiplexed outputs allocated to it and N is the total number of output units time multiplexed onto MB_i.

Modeling Performance of Digital Hardware Implementation of memory cell (MB)

Latency of a memory computation sub block with index i (MB_i) for processing N outputs, abbreviated as T^N_{i, MB}, can be calculated by adding up individual latencies of each operation involved in the computation of N output units. Similar to computation sub blocks introduced previously, a plurality of pipelining registers can be introduced to reduce the longest delay path and also increase throughput by filling up the data path with computation across multiple of these N units, thus shadowing critical stages of computation of chronologically newer output units behind critical stages of older output units. For the MB computation sub block shown in FIG. 5, assuming latency of 1 for each operation, it can be written as:

T
^N
_{i, MB}=(n_c.N)+5

Digital Hardware Implementation of the Hidden Cell (HB)

The digital hardware implementation of a hidden computation sub block, computing one unit of hidden cell output, is shown in FIG. 7. Each block is denoted as HB_i, where i is the index of the computation sub blocks in a chain of HB computation sub blocks.

Similar to computation sub blocks introduced before, the cell's output computation will be temporally multiplexed in parallel over the allocated HBs.

As shown in FIG. 7, for producing an output of the hidden cell, the set of weights are stored in memory blocks [702], with each connection composed of its own memory block like [700]. Each of these weight sets only correspond to the appropriate output units assigned to it. These weights are then subsequently multiplied and accumulated with input flits from the input data bus [716], multiplexed [718] between [712] and [714], and input bus [722], using the group of MAC blocks [706], with [704] being one of the MAC blocks in the group. Triangular inequality law has been used here to fold the input and hidden states' related multiply and accumulate operations onto the same MAC unit, thus also storing the corresponding weights onto the same physical memory block. Resulting accumulated values are added up [720] before being optionally transformed using hardware implementation of activation [710] and then written to output memory block [734]. Finally, the output line [736] of the output memory memory block is connected to a multiplexer [730] and a registered line [726] to support interfacing with a multiplexer based chain interconnect to support movement of data from this computation sub block to other computation sub blocks [728]. The memory blocks in HB can individually be, but not limited to, SRAM memory, DRAM memory, ROM memory, flash memory, solid state memory and can be either volatile or non volatile. The HB shown in FIG. 7 implements the following equation:

h
_t=σ(W_x⊙x_t+W_h⊙h_t−1+W_m⊙m_t)

where x_tis input at timestep t, h_t−1is the hidden state at timestep t−1, m_tis the entire flattened memory tape at timestep t with W_x, W_hand W_mbeing the weights respectively. σ is an activation function. ⊙ refers to a multiplication and accumulation of 2 input vectors.

As shown in FIG. 7, pipeline registers [724] are also optionally introduced along the data path to increase throughput of the computation. A register can be optionally introduced after any of the sub-module of the multiply and accumulate unit, adder, hardware implementation of activation functions and memory blocks, etc.

As shown in FIG. 7, strategic quantization stages are also optionally introduced along the computation data path of HB. A plurality of quantization [708] and dequantization [732] blocks that descale a previously quantized input and requantize it to a plurality of bit widths can be optionally placed anywhere along the computation datapath. Optionally, the efficacy of a unique placement of these quantization blocks can be determined by training the neural network while replicating this quantization scheme in software and its impact on the performance of the neural network measured. The bit widths of other sub modules of multiply and accumulate unit, adder, hardware implementation of activation functions and memory blocks, etc. will change according to the placement and bit widths of plurality of quantization blocks introduced. For example, in FIG. 7, different data path connections have a bit width of 8 [8], 29 [29] and 32 [32].

Modeling Energy Consumption of Digital Hardware Implementation of Hidden Cell (HB)

Similar to the previously introduced computation sub blocks, power and energy consumption of a computation sub block HB can also be divided into 2 categories: 1. static leakage power; and 2. dynamic switching energy and each of these can further be divided into: 1. scale variant; and 2. scale invariant categories introduced previously.

Dynamic energy consumed by a projection computation sub block with index i (HB_i) with one output unit allocated is defined as E^d_i,HB. As discussed above, dynamic energy consumption is further divided according to its dependency on allocation (number of output units time multiplexed on to HB_i): E^d-sv_i,HBas the scale variant portion of E^d_i,HBand E^d-siv_i,HBas the scale invariant portion. For FIG. 7, assuming forward connection to an LMU layer with a single memory tape, they can be calculated as follows:

$E_{i, HB}^{d - sv} = {2.}^{8, 8, 8} E_{2 : 1 mux}^{OP} + (n_{i} + n_{r} + n_{c}) .^{8, 8, 29} E_{MAC}^{OP} + (n_{i} + n_{r} + n_{c}) {}^{8, 8}E_{memory}^{r} + {2.}^{29, 32} E_{dq}^{OP} + {}^{32, 32, 32}E_{adder}^{OP} + {}^{32, 32}E_{act}^{OP} + {}^{32, 8}E_{q}^{OP} + {}^{8, 8}E_{memory}^{w} + (n_{r} / {num_HB}_{current}) . E_{memory}^{r} + E_{memory}^{r} + E_{memory}^{r}$

$E_{i, HB}^{d - siv} = {2.}^{8, 8, 8} E_{2 : 1 mux}^{OP} + \sum_{K = 1}^{i} {unit}_{k}^{HB} .^{8, 8} E_{reg}^{OP} + {2.}^{8, 8, 8} E_{2 : 1 mux}^{OP} + \sum_{K = 1}^{i} {unit}_{k}^{HB} .^{8, 8} E_{reg}^{OP} + (n_{r} / {num_HB}_{current}) . ({2.}^{8, 8, 8} E_{2 : 1 mux}^{OP} + \sum_{K = 1}^{i} {unit}_{k}^{HB} .^{8, 8} E_{reg}^{OP})$

Static power consumption of a hidden computation sub block with index i (HB_i) with one output unit allocated is defined as P^s_i,HB. As discussed above, static power consumption is further divided according to its dependency on allocation (number of output units time multiplexed on to HB_i): P^s-sv_i,HBas the scale variant portion of P^s_i,HBand P^s_i,HBand P^s-siv_i,HBas the scale invariant portion. For FIG. 7, they can be calculated as follows:

P
^s-sv
_i,HB=(n_i+n_r+n_c).^1×8P_memory^s+2.^1×8P_memory^s

P
^s-siv
_i,HB=^8,8,8P_{2:1 mux}^s+2.^8,8,29P_MAC^s+2.^29,32P_dq^s+^32,32,32P_adder^s+^32,32P_act^s+^32,8P_d^s+^8,8,8P_{2:1 mux}^s+4.^32,32P_reg^s+2.^8,8P_reg^s

Total energy consumed by a memory computation sub block with index i (HB_i) can be written as below:

E
^N,total
_i,HB
=N.E
^d-sv
_i,HB
+E
^d-siv
_i,HB
+N.P
^s-sv
_i,HB.latency_e2e+P^s-siv_i,HB.latencyⁱ_HB

where latency_e2eis the total latency of processing a given input of the recurrent neural network, latencyⁱ_HBis the latency of computation sub block HB_iof processing all time multiplexed outputs allocated to it and N is the total number of output units time multiplexed onto HB_i.

Modeling Performance of Digital Hardware Implementation of Hidden Cell (HB)

Latency of a hidden computation sub block with index i (HB_i) for processing N outputs, abbreviated as T^N_i,HB, can be calculated by adding up individual latencies of each operation involved in the computation of N output units. Similar to computation sub blocks introduced previously, a plurality of pipelining registers can be introduced to reduce the longest delay path and also increase throughput by filling up the data path with computation across multiple of these N units, thus shadowing critical stages of computation of chronologically newer output units behind critical stages of older output units. For the HB computation sub block shown in FIG. 7, assuming latency of 1 for each operation, it can be written as:

T
^N
_{i, HB}=(n_c.N)+5, if n_c≥n_r+n_i

(n_r.N)+5, if n_r≥n_r+n_c

(n_i.N)+5, if n_i≥n_r+n_c

Complete Digital Hardware Implementation of LMU Computation Block

A plurality of encoder computation sub blocks (UBs) can be connected to work in parallel with a plurality of memory computation sub blocks (MBs) which in turn can be connected to work in parallel with a plurality of hidden computation sub blocks (HBs), with memory output readouts from one set of blocks connected to the input bus of the other set via a chain of multiplexer and registers acting as an interconnect. FIG. 8 shows an input bus [806], 4 UBs [800], 8 MBs [802] and 4 HBs [804] connected together to form a computation block, with an encoder cell with 4 outputs time multiplexed onto the 4 UBs, a memory cell with 2048 outputs time multiplexed onto the 8 MBs and a hidden cell with 640 outputs time multiplexed onto the 4 HBs. The input data bus is fed to all 4 UBs and all 4 HBs. The streaming multiplexer chain interconnects after each set of UBs, MBs and HBs allow data dependencies to be well composed, despite any unbalanced/disjointed resource allocation between the 3 sets of computation sub blocks.

All the 4 UBs will finish their MAC operations at the same time. This is ensured due to the symmetric distribution of workload, ensured by design, over the 4 UBs. All 4 UBs write their allocated encoder cells' outputs to 4 different memory blocks (titled “memory output” in FIG. 4). Since each MB needs access to all of the encoder cells' outputs, each of these 4 different memory blocks are then multiplexed streamed serially onto a single bus [808]. This bus forms an input to the set of MBs shown in FIG. 8. All the 8 MBs will finish their MAC operations at the same time as well. All 8 MBs write their allocated memory cells' outputs to 8 different memory blocks (titled “memory output” in FIG. 5). Since each MB, HB and UB needs access to all of the memory cells' outputs, each of these 8 different memory blocks are then multiplexed streamed serially onto a single bus [810], which is connected to all UBs, HBs and MBs. Finally, the 4 HBs will finish their MAC operations at the same time as well. All 4 HBs write their allocated memory cells' outputs to 4 different memory blocks (titled “memory output” in FIG. 7). Since each UB and HB needs access to all of the hidden cells' outputs, each of these 4 different memory blocks are then multiplexed streamed serially onto a single bus [812], which is connected to all UBs and HBs.

Total energy consumption of the LMU computation block can be calculated by summing up individual energy consumption of its constituent set of encoder computation sub blocks (UB), constituent set of memory computation sub blocks (MB) and its constituent set of hidden computation sub blocks (HB).

For calculating the latency and throughput of an LMU computation block, the parallel mode of operation plays an important role. All constituent sub blocks in the set of constituent encoder computation sub blocks work in parallel. Same is true for the memory and hidden constituent set of computation sub blocks. Hence, the slowest UB sub block in finishing the computation of its allocated output units of the encoder cell determines the latency of the entire set of constituent encoder computation sub blocks. Similarly, the slowest MB sub block in finishing the computation of its allocated output units of the memory cell determines the latency of the entire set of constituent memory computation sub blocks. Finally, the slowest HB sub block in finishing the computation of its allocated output units of the hidden cell determines the latency of the entire set of constituent hidden computation sub blocks. Hence, to calculate the overall latency of an LMU computation block, the latency of the set of constituent encoder computation sub blocks can be added with the latency of the set of constituent memory computation sub blocks, which can then be added with the latency of the set of constituent hidden computation sub blocks. The throughput of the LMU computation block is determined by the slowest computation sub block, regardless of type.

Digital Hardware Implementation of Feed Forward Cell (FB)

The digital hardware implementation of a feed forward computation block, comprises a plurality of one type of constituent computation sub blocks. A constituent sub block, computing one unit of feed forward cell output, is shown in FIG. 9. Each constituent computation sub block is denoted as FB_i, where i is the index of the computation sub blocks in a chain of FB computation sub blocks.

Similar to computation sub blocks introduced before, the cell's output computation will be temporally multiplexed in parallel over the allocated FBs.

As shown in FIG. 9, for producing an output of the feed forward cell, the set of weights are stored in a memory block [900]. Each of these weight sets only correspond to the appropriate output units assigned to it. These weights are then subsequently multiplied and accumulated with input flits from the input data bus [902] using the MAC block [904]. Resulting accumulated values are optionally transformed using hardware implementation of activations [908] and stored in output memory block [910]. Finally, the output line [920] of the output memory block is connected to a multiplexer [918] and a registered line [912] to support interfacing with a multiplexer based chain interconnect to support movement of data from this computation sub block to other computation sub blocks [914]. The memory blocks in FB can individually be, but not limited to, SRAM memory, DRAM memory, ROM memory, flash memory, solid state memory and can be either volatile or non volatile. The FB shown in FIG. 9 implements the following equation:

f
_t=σ(W_x⊙x)

where x_tis input at timestep t with W_xbeing the weights respectively. σ is an activation function. ⊙ refers to a multiplication and accumulation of 2 input vectors.

As shown in FIG. 9, pipeline registers [916] are also optionally introduced along the data path to increase throughput of the computation. A register can be optionally introduced after any of the sub-module of the multiply and accumulate unit, hardware implementation of activation functions and memory blocks, etc.

As shown in FIG. 9, strategic quantization stages are also optionally introduced along the computation data path of FB. A plurality of quantization [922] and dequantization [906] blocks that descale a previously quantized input and requantize it to a plurality of bit widths can be optionally placed anywhere along the computation datapath. Optionally, the efficacy of a unique placement of these quantization blocks can be determined by training the neural network while replicating this quantization scheme in software and its impact on the performance of the neural network measured. The bit widths of other sub modules of multiply and accumulate unit, hardware implementation of activation functions and memory blocks, etc. will change according to the placement and bit widths of plurality of quantization blocks introduced. For example, in FIG. 9, different data path connections have a bit width of 8 [8], 29 [29] and 32 [32].

Modeling Energy Consumption of Digital Hardware Implementation of Feed Forward Cell (FB)

Similar to the previously introduced computation sub blocks, power and energy consumption of a computation sub block FB can also be divided into 2 categories: 1. static leakage power; and 2. dynamic switching energy and each of these can further be divided into: 1. scale variant; and 2. scale invariant categories introduced previously.

Similar to previously introduced computation sub blocks, inactive computation sub blocks can be disconnected from the voltage supply or put in low power mode using Dynamic Voltage and Frequency Scaling, as introduced previously.

Dynamic energy consumed by a feed forward computation sub block with index i (FB_i) with one output unit allocated is defined as E^d_i,FB. As discussed above, dynamic energy consumption is further divided according to its dependency on allocation (number of output units time multiplexed on to FB_i): E^d-sv_i,FBas the scale variant portion of E^d_i,FBand E^d-siv_i,FBas the scale invariant portion. For FIG. 9, assuming forward connection to a feed forward computation sub block, they can be calculated as follows:

E
^d-sv
_i,FB
=n
_o.^8,8,29E_MAC^OP+n_o^8,8E_memory^r+^29,32E_dq^OP+^32,32E_act^OP+^32,8E_q^OP+^8,8E_memory^w+(n_r/num_FB_{next_layer}).E_memory^r+2.^32,32E_reg^OP+^8,8E_reg^OP

E
^d-siv
_i,FB=(n_r/num_FB_{next_layer}).(2.^8,8,8E_{2:1 mux}^OP+Σ_K=1ⁱunit_k^FB.^8,8E_reg^OP)

Static power consumption of a feed forward computation sub block with index i (FB_i) with one output unit allocated is defined as P^s_i,FB. As discussed above, static power consumption is further divided according to its dependency on allocation (number of output units time multiplexed on to FB_i): P^s-sv_i,FBas the scale variant portion of P^s_i,FBand P^s-siv_i,FBas the scale invariant portion. For FIG. 9, they can be calculated as follows:

P
^s-sv
_i,FB
=n
_o.^1×8P_memory^s×^1×8P_memory^s

P
^s-siv
_i,FB=^8,8,29P_MAC^s+^29,32P_dq^s+^32,32P_act^s+^32,8P_q^s+^8,8,8P_{2:1 mux}^s+2.^32,32P_reg^s+2.^8,8P_reg^s

Total energy consumed by a feed forward computation sub block with index i (FB_i) can be written as below:

E
^N,total
_i,FB
=N.E
^d-sv
_i,FB
+E
^d-siv
_i,FB
+N.P
^s-sv
_i,FB.latency_e2e+P^s-siv_i,FB.latencyⁱ_FB

where latency_e2eis the total latency of processing a given input of the recurrent neural network, latencyⁱ_FBis the latency of computation sub block FB_iof processing all time multiplexed outputs allocated to it and N is the total number of output units time multiplexed onto FB_i.

Modeling Performance of Digital Hardware Implementation of Hidden Cell (FB)

Latency of a feed forward computation sub block with index i (FB_i) for processing N outputs, abbreviated as T^N_i,FB, can be calculated by adding up individual latencies of each operation involved in the computation of N output units. Similar to computation sub blocks introduced previously, a plurality of pipelining registers can be introduced to reduce the longest delay path and also increase throughput by filling up the data path with computation across multiple of these N units, thus shadowing critical stages of computation of chronologically newer output units behind critical stages of older output units. For the FB computation sub block shown in FIG. 9, assuming latency of 1 for each operation, it can be written as:

T
^N
_i,FB=(n_o.N)+4

Software Architecture of Mapping Algorithm

A mapping or allocation is defined as time multiplexing of whole or part of the above discussed recurrent neural network composed of at least one of: an LMU cell, an LSTM-P cell, and zero or more feed forward cells, onto a spatial distribution of plurality of corresponding types of computation blocks.

A partitioning of a neural network is defined as separating a sequence of layers of that network into a group such that all layers belonging to that group will be time multiplexed onto the same computation block. An entire neural network can be partitioned into multiple groups (also called partitions). Layers of different types cannot be partitioned into the same group. Hence, a new group must be formed when consecutive layers change types.

A mapping of a recurrent neural network into the computational blocks introduced above depends on the partitioning of the network as well as the number of each type of constituent computation sub blocks for that partition.

For discovering a mapping of a given recurrent neural network that comes closest to satisfying a user specified constraints of combination of latency, throughput, power and like, first, a linear search over all possible partitioning [1000] of the recurrent neural network is performed. For a large recurrent neural network where a linear search might be too slow, linear search can be replaced with intelligent search methods including but not limited to gradient based searches, evolutionary strategies guided searches, etc. For a partitioning, each partition's computation block is allocated one each of its constituent computation sub blocks [1002]. The total latency and energy consumption of the network is computed [1004] by adding up the latency and energy consumption of each partition's computation block using methods discussed previously. The throughput of the network is calculated by identifying the throughput of each partition's computation block and the slowest computation block determines the overall throughput of the network. At this stage, either the slowest computation block or the block, which if the count of its slowest constituent computation sub block was increased would lead to greatest decrease in latency, is chosen. The count of the slowest constituent computation sub block of the selected computation block is incremented by 1 [1008]. This process, labelled “iterative resource allocation” and shown in FIG. 10, of incrementing count of a computation sub block for a partitioning is repeated until either user specified constraints are met or any power or area constraints specified exceeded [1006]. Then the linear search moves onto the next partitioning and repeats this process of “iterative resource allocation”.

FIG. 11 shows the final mapping [1120] of a recurrent neural network [1118], composed of 6 LMU cell layers, with [1116] being the first LMU layer, and 2 feed forward layers, partitioned into 4 partitions [1100], [1104], [1108], [1112] and mapped to a computation block of corresponding types [1102], [1106], [1110], [1114], with each computation block made up of different number of corresponding computation sub blocks, where all layers belonging to a partition will be time multiplexed onto the partition's computation block for processing.

Example of Low Power, Real Time Keyword Spotting

A custom accelerator for keyword spotting applications is implemented using the above mentioned computational blocks and mapping techniques. The custom accelerator is a mapping of an LMU based recurrent neural network that has been trained to achieve 95.9% test accuracy on the SpeechCommands dataset (see Warden, P., 2018. Speech commands: A dataset for limited-vocabulary speech recognition. arXiv preprint arXiv:1804.03209). The LMU based recurrent neural network is 361 kbits in size. The computation blocks are configured to use 4 bit quantized weights and 7 bit quantized activations and use SRAM as memory blocks. The recurrent neural network is then trained with these quantization settings to mimic hardware aware training. The design of the hardware accelerator was explored across different clock frequencies, while always ensuring that the timing constraints of the SpeechCommands models (40 ms windows updated every 20 ms) are satisfied in real time. Cycle-accurate power envelopes of each design were guided by the above discussed power and energy estimation models for each computation block that makes up the hardware accelerator design. Total power usage is determined with these envelopes using publicly available power data (see Frustaci, F., Khayatzadeh, M., Blaauw, D., Sylvester, D. and Alioto, M., 2015. SRAM for error-tolerant applications with dynamic energy-quality management in 28 nm CMOS. IEEE Journal of Solid-state circuits, 50(5), pp. 1310-1323) (see Hoppner, S. and Mayr, C., 2018. SpiNNaker2-towards extremely efficient digital neuromorphics and multi-scale brain emulation. Proc. NICE) (see Yabuuchi, M., Nii, K., Tanaka, S., Shinozaki, Y., Yamamoto, Y., Hasegawa, T., Shinkawata, H. and Kamohara, S., 2017, June. A 65 nm 1.0 V 1.84 ns Silicon-on-Thin-Box (SOTB) embedded SRAM with 13.72 nW/Mbit standby power for smart IoT. In 2017 Symposium on VLSI Circuits (pp. C220-C221). IEEE). Multiply-accumulate (MAC) and SRAM dynamic and static power, are the dominant power consumers in the design. Dynamic power for multipliers, dividers, and other components was estimated as a function of the number of transistors in the component, and the power cost per transistor of the MAC. All estimates are for a 22 nm process. To estimate the number of transistors, and hence the area, of the design we generated RTL designs of each of the relevant components, and used the yosys open source tool (see Wolf, C., 2016. Yosys open synthesis suite) and libraries to estimate the number of transistors required for the total number of components included in our network. FIG. 12 shows the resulting power/area trade-off for our LMU based design. Note that all designs depicted are real-time capable. We observe increased power and area consumption when operating in the realm of low frequency, directly attributed to requiring additional resources to meet real time constraints along with dynamic switching of “glue” logic to support parallel operations of these resources. As we increase frequency, it becomes easier to maintain real time operations, thus progressively reducing resources required. We then reach the lowest power design, found at 8.79 μW (92 kHz clock) and 8,052,298 transistors. For this design, the throughput for one 20 ms frame is 13.38 ms and the latency for the 40 ms update is 39.59 ms. Beyond this optimal design point, any increase in frequency is met with a sharp rise in power, explained by the design becoming faster than real time, thus requiring no additional resources but disproportionately decreasing end-to-end latency.

Example of Low Power, Real Time Automatic Speech Recognition

A custom accelerator for implementing the RNN-T network for automatic speech recognition (see He, Y., Sainath, T. N., Prabhavalkar, R., McGraw, I., Alvarez, R., Zhao, D., Rybach, D., Kannan, A., Wu, Y., Pang, R. and Liang, Q., 2019, May. Streaming end-to-end speech recognition for mobile devices. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6381-6385). IEEE) is implemented using the above mentioned computational blocks and mapping techniques. The computation blocks are configured to use 8 bit quantized weights and 8 bit quantized activations and use SRAM as memory blocks. The design of the hardware accelerator was fixed at 250 MHz, while always ensuring that the accelerator is able to process an entire input sample in less than 60 ms. Cycle-accurate power envelopes of the design were guided by the above discussed power and energy estimation models for each computation block that makes up the hardware accelerator design. Total power usage is determined with these envelopes using publicly available power data (see Frustaci, F., Khayatzadeh, M., Blaauw, D., Sylvester, D. and Alioto, M., 2015. SRAM for error-tolerant applications with dynamic energy-quality management in 28 nm CMOS. IEEE Journal of Solid-state circuits, 50(5), pp. 1310-1323) (see Hoppner, S. and Mayr, C., 2018. SpiNNaker2-towards extremely efficient digital neuromorphics and multi-scale brain emulation. Proc. NICE) (see Yabuuchi, M., Nii, K., Tanaka, S., Shinozaki, Y., Yamamoto, Y., Hasegawa, T., Shinkawata, H. and Kamohara, S., 2017, June. A 65 nm 1.0 V 1.84 ns Silicon-on-Thin-Box (SOTB) embedded SRAM with 13.72 nW/Mbit standby power for smart IoT. In 2017 Symposium on VLSI Circuits (pp. C220-C221). IEEE). All estimates are for a 22 nm process. We forego analysis of small components like multiplexers, registers, etc in our power model. Instead, we focus only on the widely agreed-upon energy sinks: 1. MAC operations and 2. SRAM based operations. We explored 3 different cases of mapping using the LSTM-P as a reference for the size of the LMU to be used in the RNN-T network:

- 1. In the first case, we set the dimensions of the LMU to be the same as that of the LSTM-P used in the RNN-T architecture (see He, Y., Sainath, T. N., Prabhavalkar, R., McGraw, I., Alvarez, R., Zhao, D., Rybach, D., Kannan, A., Wu, Y., Pang, R. and Liang, Q., 2019, May. Streaming end-to-end speech recognition for mobile devices. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6381-6385). IEEE). In this case, we set the dimension of the memory cell to be 2048 and the dimension of the hidden cell to be 640 in the LMU.
- 2. In the second case, we trained both the LSTM-P version and the LMU version of the networks for Librispeech (see Panayotov, V., Chen, G., Povey, D. and Khudanpur, S., 2015, April. Librispeech: an ASR corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5206-5210). IEEE) to accuracy parity. We then calculated the ratio of the LMU dimensions versus the LSTM-P dimensions used in this task. We then scaled the RNN-T LSTM-P's dimensions by the same ratio and set the resulting numbers to be the dimensions of the LMU.
- 3. In the third case, we followed the same Librispeech strategy for scaling. However, this time we used the TIMIT corpus (see Garofolo, J. S., Lamel, L. F., Fisher, W. M., Fiscus, J. G. and Pallett, D. S., 1993. DARPA TIMIT acoustic-phonetic continuous speech corpus CD-ROM. NIST speech disc 1-1.1. NASA STI/Recon technical report n, 93, p. 27403) as our target dataset.

For the RNN-T network, we generated a total of 1024 partition combinations corresponding to increasingly partitioning the network (1 partition to 12 partitions) and then generating all possible combinations for that set number of partitions. For each partition combination, we derived the spatial configuration of computation blocks and time multiplexing of these partitions onto said computation blocks using our mapping algorithm. We used the performance and power models of computation blocks described above to derive latency and power numbers at every step of iterative resource allocation.

Before sharing the results for the LMU, we first share the details of the LSTM-P's performance:

- Throughput: The optimal mapping will be able to accept a new sample of 60 ms every 9.83 ms. This translates to a throughput of 101.72 audio samples per second
- Latency: This mapping will be able to process an input sample of 60 ms and generate the output after 59.99 ms.
- Power Consumption: The network will consume 8.62 mW of power to process this sample
- Mapping details: Count of constituent computation sub blocks for each computation block of partition is shown below for optimal mapping.
  - Partition containing Encoder Layer 1: [8 LB, 4 PB]
  - Partition containing Encoder Layer 2: [4 LB, 1 PB]
  - Partition containing Encoder Layer 3: [3 LB, 1 PB]
  - Partition containing Encoder Layer 4: [3 LB, 1 PB]
  - Partition containing Encoder Layer 5: [3 LB, 1 PB]
  - Partition containing Encoder Layer 6: [3 LB, 1 PB]
  - Partition containing Encoder Layer 7, Encoder Layer 8, Prediction Layer 1, Prediction Layer 2: [23 LB, 4 PB]
  - Partition containing Feed Forward Joint Network Layer: 2 FF
  - Partition containing Feed Forward Softmax Layer: 4 FF

We now share the results of the optimal mapping for the 3 different cases of LMU sizes as discussed above. For the case of setting LMU dimensions equal to the LSTM-P's dimensions, we use LMU for the recurrent connections in place of the LSTM-P. For the optimal mapping, all layers are mapped into their separate partitions, except the prediction network where 2 layers making it up are mapped onto a single partition. We now discuss the throughput, latency and power numbers for this mapping.

- Throughput: This mapping will be able to accept a new sample of 60 ms every 7.716 ms. This translates to a throughput of 129.6 audio samples per second
- Latency: This mapping will be able to process an input sample of 60 ms and generate the output after 59.65 ms.
- Power Consumption: The network will consume 2.39 mW of power to process this sample
- Resource Allocation: Resource level distribution is shown below now. Remember that an LMU RNN (section 2.2) is made up of 3 processing sub blocks per layer: UB, MB and HB. A Feed Forward Layer (section 2.4) is made up of a single processing block called FB.
  - Partition containing Encoder Layer 1: [1 UB, 9 MB, 3 HB]
  - Partition containing Encoder Layer 2: [1 UB, 8 MB, 3 HB]
  - Partition containing Encoder Layer 3: [1 UB, 4 MB, 2 HB]
  - Partition containing Encoder Layer 4: [1 UB, 4 MB, 2 HB]
  - Partition containing Encoder Layer 5: [1 UB, 4 MB, 2 HB]
  - Partition containing Encoder Layer 6: [1 UB, 4 MB, 2 HB]
  - Partition containing Encoder Layer 7: [1 UB, 4 MB, 2 HB]
  - Partition containing Encoder Layer 8: [1 UB, 4 MB, 2 HB]
  - Partition containing Prediction Layer 1, Prediction layer 2: [1 UB, 15 MB, 10 HB]
  - Partition containing Feed Forward Joint Network Layer: 2 FF
  - Partition containing Feed Forward Softmax Layer: 5 FF

For the second case, we calculated the ratio between the size of LSTM-P and LMUs used in achieving similar accuracy for the Librispeech dataset. We observed that the best LMU was roughly one-third the size of the best LSTM-P in this experiment. In the equal dimensions experiment, the LMU is roughly six times smaller than the LSTM-P. This equates to a scaling factor of 2. Hence we expect that were a similar scaling factor used in the RNN-T network to achieve accuracy parity, the LMU would consume roughly twice the power of the equal dimensions experiment. This results in power consumption of 4.78 mW.

For the third case, we calculated the ratio between the size of LSTM-P and LMUs used in achieving similar accuracy for the TIMIT dataset. We measured the dimension of the best performing LMU and the best performing LSTM-P. We then scaled the dimensions of the LSTM-P in the RNN-T by these ratios to arrive at the LMU dimensions. Hence we expect that were this LMU used in the RNN-T network to achieve accuracy parity, it would consume 4.4 mW of power.

We now present additional details about this experiment. We first focus on the increased power consumption of the mapping of the RNN-T network built with LSTM-P vs the RNN-T network built with LMUs. The RNN-T network built with LSTM-P consumes 145.4 MB of SRAM memory. This is approximately 6× larger than the RNN-T built with LMU's 26 MB. Hence, the static power consumption of the memory for an LSTM-P implementation pushes it into the domain of infeasibility. We also explore the relationship between partitions of the RNN-T and associated power consumption of optimal mapping of that partition combination. Any partition combination can achieve an end-to-end latency of less than 60 ms. The important thing is for it to do so while consuming as little power as possible. FIG. 13 presents the final latency on the y axis and the waiting period (1/throughput) on the x axis for the case of equal dimension LMU and LSTM-P. The best sample would be the one that would lie bottom left, that is, have lowest latency and input waiting period. Note that we do not show power here. All the LSTM-P points consume more than 5 mW here. Also another thing to remember is that we stop dynamic allocation once we reach 60 ms latency. For this discussion, the best partition combination is by an LMU implementation with an input waiting period of 7.716 ms and end-to-end latency of 59.65 ms.

We now focus on the relationship between throughput of hardware mappings and the number of partitions in FIG. 14. We analysed the best possible throughput achieved after iterative resource allocation by a partition combination of certain number of partitions for the case of equal dimensions LMU and LSTM-P. Remember that there are ¹²C_Npossible combinations if we split the network into N partitions. For a particular N (x-axis), we found the combination out of ¹²C_Nthat achieves the best throughput and plotted that. The overall trend is that as we increase the number of partitions, the overall throughput is decreased since each partition has to increasingly do less work as less and less layers are mapped onto it. However, note that as we increasingly partition, we get diminishing returns. This is because the dependency of the prediction network's input on the previous output of the entire network imposes a strict critical path that cannot be partitioned further. This starts to dominate the overall throughput and hence, there are no benefits to partitioning beyond a point.

We now correlate latency trends with steps of iterative resource allocation. Recall that each step in iterative resource allocation finds the bottleneck and adds more resources to that computation sub block. For this analysis, we show network's latency (y axis) evolution through steps of iterative resource allocation (x-axis) in FIG. 15 for the case of equal dimension LMU and LSTM-P. The network initially has 2 partitions, sliced at the interface of RNN layers' output with Feed Forward layers' input. Note that, as expected, the general trend is for latencies to decrease as we increase the number of partitions. However, note that LSTM-P based implementation generally is faster than LMU at any iterative resource allocation step. Furthermore, it is quicker to reach 60 ms end-to-end latency (which is where iterative resource allocation stops). This is because there is less amount of parallelism to take advantage of in LMU when compared to an LSTM-P and greater amount of data dependency within a timestep. Owing to the increased depth of computation and dependency of one step on the previous, there is less scope of distributing the computations in parallel. Hence, we return with a slightly higher end to end latency in the case of an LMU implementation. The LMU overall requires less computation but the compute is nested deeper and cannot be distributed as well when compared to the LSTM-P.

Claims

1. A digital hardware system for processing time series data with a recurrent neural network, wherein the digital hardware system comprises a plurality of computation blocks, the digital hardware system computes parts of a given recurrent neural network by time multiplexing the parts of the recurrent neural network over a spatially distributed set of computation blocks of the digital hardware system; wherein the digital hardware system is configured to: read a sequence of time series data to be processed from the memory of the digital hardware system;temporally process the data by time multiplexing the recurrent neural network over the spatially distributed set of computation blocks of the digital hardware system; and,write final processed output or processed intermediate activations of the recurrent neural network to the memory of the digital hardware system.
2. The digital hardware system of claim 1, wherein the computation block is a digital hardware implementation of a projected LSTM cell, and the computation block contains at least one LB and at least one PB computation sub block, where a plurality of LB computation sub blocks work in parallel to time multiplex the total compute mapped onto the LB computation sub blocks and a plurality of PB computation sub blocks work in parallel to time multiplex the total compute mapped onto the PB computation sub blocks wherein the LB computation sub block implements computation of the LSTM sub cell of the projected LSTM cell, including: at least one input signal, at least one multiplexer, a plurality of network weights stored over four separate memory blocks, four multiply and accumulate units, a plurality of quantization blocks, a plurality of dequantization blocks, three multipliers, one adder, 5 activation blocks, a plurality of registers, a memory block to store intermediate values and a memory block to store output values with each component described above configured for a plurality of bit widths;four parallel multiply and accumulate operations, wherein said network weights are multiplied and accumulated with said input signal with the four multiply and accumulate operations mapped onto the four multiply and accumulate units;the four resulting values processed by a plurality of multipliers, adders, quantization blocks, dequantization blocks and activation blocks, with intermediate values along the data path being written to a memory block;the final processed output written to a memory block.wherein the PB computation sub block implements computation of the projection sub cell of the projected LSTM cell, including: an input signal, at least one multiplexer, a plurality of network weights stored over a memory block, one multiply and accumulate unit, a plurality of quantization blocks, a plurality of dequantization blocks, an activation block, a plurality of registers and a memory block to store output values with each component described above configured for a plurality of bit widths;wherein said network weights are multiplied and accumulated with said input signal using the multiply and accumulate block;the resulting value processed by a plurality of quantization blocks, dequantization blocks and activation blocks;the final processed output written to a memory block.
3. The digital hardware system of claim 1, wherein the computation block is a digital hardware implementation of an LMU cell, and the block contains at least one each of an UB, an MB and an HB computation sub block, where a plurality of UB computation sub blocks work in parallel to time multiplex the total compute mapped onto the UB computation sub blocks, a plurality of MB computation sub blocks work in parallel to time multiplex the total compute mapped onto the MB computation sub blocks, and a plurality of HB computation sub blocks work in parallel to time multiplex the total compute mapped onto the HB computation sub blocks, wherein the UB computation sub block implements computation of the encoder sub cell of the LMU cell, including: at least two input signals, at least one multiplexer, a plurality of network weights stored over at least two separate memory blocks, at least two multiply and accumulate units, a plurality of quantization blocks, a plurality of dequantization blocks, at least one adder, an activation block, a plurality of registers, and a memory block to store output values with each component described above configured for a plurality of bit widths;at least three multiply and accumulate operations, wherein said network weights are multiplied and accumulated with said input signals, with two of the multiply and accumulate operations optionally mapped onto the same physical multiply and accumulate block;the resulting values of the multiply and accumulate operations processed by a plurality of adders, quantization blocks, dequantization blocks and activation blocks;the final processed output written to a memory block.wherein the MB computation sub block implements computation of the memory sub cell of the LMU cell containing a plurality of memory tapes, including: at least two input signals, at least one multiplexer, a plurality of network weights stored over at least two separate memory blocks, at least two multiply and accumulate units, a plurality of quantization blocks, a plurality of dequantization blocks, at least one adder, an activation block, a plurality of registers, and a memory block to store output values with each component described above configured for a plurality of bit widths;two parallel multiply and accumulate operations, wherein said network weights are multiplied and accumulated with said input signals.the resulting values of the multiply and accumulate operations processed by a plurality of adders, quantization blocks, dequantization blocks and activation blocks;the final processed output written to a memory block.wherein the HB computation sub block implements computation of the hidden sub cell of the LMU cell, comprising: at least two input signals, at least one multiplexer, a plurality of network weights stored over at least two separate memory blocks, at least two multiply and accumulate units, a plurality of quantization blocks, a plurality of dequantization blocks, at least one adder, an activation block, a plurality of registers, and a memory block to store output values with each component described above configured for a plurality of bit widths;at least three multiply and accumulate operations, wherein said network weights are multiplied and accumulated with said input signal, with two of the multiply and accumulate operations optionally mapped onto the same physical multiply and accumulate block;the resulting values of the multiply and accumulate operations processed by a plurality of adders, quantization blocks, dequantization blocks and activation blocks;the final processed output written to a memory block.
4. The digital hardware system of claim 1, where the computation block is a digital hardware implementation of a feed forward cell, and the computational block contains at least one FB computation sub block and a plurality of FB computation sub blocks work in parallel to time multiplex the total compute mapped onto the FB computational sub blocks, wherein the FB computation sub block implements computation of the feed forward cell, comprising: an input signal, at least one multiplexer, a plurality of network weights stored over a memory block, one multiply and accumulate unit, a plurality of quantization blocks, a plurality of dequantization blocks, an activation block, a plurality of registers and a memory block to store output values with each component described above configured for a plurality of bit widths;wherein said network weights are multiplied and accumulated with said input signal using the multiply and accumulate block;the resulting value processed by a plurality of quantization blocks, dequantization blocks and activation blocks;the final processed output written to a memory block.
5. The digital hardware system of claim 3, where the weights A are generated on the fly using GenA and the weights B are generated on the fly using GenB blocks: wherein the GenA block comprises: at least three adders, one right shift operator, one multiplier, two multiplexers, a plurality of quantization blocks, with each component described above configured for a plurality of bit widths;a plurality of internally signals, incremented by 1 at every time step, shifted to the right by 1 bit using a right shift operation, and the resulting signal connected to a datapath comprising of a plurality of multipliers, multiplexers and quantization blocks;wherein the GenB block comprises: at least three adders, one right shift operator, one multiplier, two multiplexers, a plurality of quantization blocks, with each component described above configured for a plurality of bit widths;a plurality of internally signals, incremented by 1 at every time step, shifted to the right by 1 bit using a right shift operation, and the resulting signal connected to a datapath comprising of a plurality of multipliers, multiplexers and quantization blocks.
6. The digital hardware system of claim 1, where part or whole of at least one computation block is operated with reduced voltage or reduced frequency using Dynamic Voltage and Frequency scaling.
7. The digital hardware system of claim 1, wherein the computation block's multiply and accumulate operations are used to shadow non-concurrent operations using a plurality of registers along the computation data path.
8. The digital hardware system of claim 3, wherein the multiply and accumulate operations for three vectors are mapped onto two physical multiply and accumulate blocks if the three vectors satisfy the triangular inequality law, where the largest sized vector's multiply and accumulate operations among the three said vectors is mapped to one of the two multiply and accumulate blocks with the remaining vectors' multiply and accumulate operations mapped onto the other multiply and accumulate block.
9. The digital hardware system of claim 1, wherein the computation blocks are further configured to store quantized values of weights as integers represented in binary with a configurable number of bits and store quantized values of outputs as represented in binary with a configurable number of bits in their memory, wherein the multiply and accumulate blocks perform multiply and accumulate operations on quantized values of inputs and weights.
10. The digital hardware system of claim 1, wherein a plurality of the constituent computation sub blocks of computation blocks are connected to a network on chip interconnect.
11. A method for producing a digital hardware system for processing a recurrent neural network to balance requirements of energy, latency, throughput and accuracy, that accepts as input: the recurrent neural network, the requirements of energy, latency, throughput and accuracy, and outputs: the digital hardware system for processing the recurrent neural network, wherein the digital hardware system comprises a plurality of computation blocks; wherein the digital hardware system is configured to: read a sequence of time series data to be processed from the memory of the digital hardware system;temporally process the data by time multiplexing the recurrent neural network over the spatially distributed set of computation blocks of the digital hardware system; and,write final processed output or processed intermediate activations of the recurrent neural network to the memory of the digital hardware system.wherein the method comprises: partitioning the layers of the recurrent neural network and assigning a computation block to each partition, where the computation block is responsible for processing compute of the partition and is composed of a plurality of computation sub blocks;estimating latency and throughput of each computation block;determining the number of computation sub blocks in each computation block;estimating energy consumption of every computation block.
12. The method of claim 11, wherein the layers of whole or part of the recurrent neural network are grouped into partitions, with all layers inside a partition composed of same type of cells: LMU cells or Projected LSTM cells or Feed Forward cells, and all layers of a partition are time multiplexed onto the same computation block.
13. The method of claim 12, wherein latency and throughput of computation blocks and its constituent computation sub blocks are estimated using a timing model comprising, calculating the latency of a computation sub block as the sum of individual latency of operations of said computation sub block for processing the computation of partitions time multiplexed onto it;calculating the throughput of a computation sub block as the inverse of latency of the said computation sub block;calculating the latency of a computation block as the sum of individual latencies of the slowest of each type of its constituent computation sub blocks;calculating the throughput of a computation block as the inverse of latency of the said computation block.
14. The method of claim 13, wherein the number of constituent computation sub blocks for a plurality of computation blocks are determined using iterative resource allocation comprising, calculating the latency of each computation block and its constituent computation sub blocks;calculating total latency by adding up latencies of all computation blocks;either the computation block with highest latency or the computation block, which if the count of its slowest constituent computation sub block was increased would lead to greatest decrease in total latency, is chosen;the count of the slowest constituent computation sub block of the selected computation block is incremented by 1.
15. The method of claim 13, wherein energy consumption of a computation block and its constituent computation sub blocks is estimated using an energy model that separates energy into static power dissipation over time and dynamic energy and further subdivides each into scale variant and scale invariant while taking in as input: the partition of layers time multiplexed onto said computation block and outputs the total energy consumed by the said computation block and its constituent computation sub blocks, wherein the scale variant dynamic energy of a computation sub block is calculated by adding up the dynamic energy of scale variant operations needed for processing one output unit of said computation sub block and multiplying it by number of output units of a partition time multiplexed onto said computation sub block;wherein the scale invariant dynamic energy of a computation sub block is calculated by adding up the dynamic energy of scale invariant operations needed for processing all output units of a partition time multiplexed onto said computation sub block;wherein the total dynamic energy of a computation sub block is calculated by adding up said computation sub block's scale variant dynamic energy and scale invariant dynamic energy;wherein the scale variant static power of a computation sub block is calculated by adding up the static power consumption of scale variant operators and multiplying it by number of output units of a partition time multiplexed onto said computation sub block;wherein the scale invariant static power of a computation sub block is calculated by adding up the static power consumption of scale invariant operators;wherein the total static power consumption of a computation sub block is calculated by adding up said computation sub block's scale variant static power consumption and scale invariant static power consumption;wherein the total energy of a computation sub block is calculated by adding together the: total dynamic energy of said computation sub block, latency of said computation sub block multiplied with total static power consumption of said computation sub block;wherein the total energy of a computation block is calculated by adding up total energy of its constituent computation sub blocks.

Parent Case Info

This application claims priority to provisional application No. 63/017,479, filed Apr. 29, 2020, the contents of which are herein incorporated by reference.

Provisional Applications (1)

	Number	Date	Country
	63017479	Apr 2020	US

Methods And Systems For Efficient Processing Of Recurrent Neural Networks

Information

Publication Number

Date Filed

Date Published

Inventors

Original Assignees

CPC

International Classifications

Abstract

Description

Claims

Parent Case Info

Provisional Applications (1)