The present disclosure relates to machine learning, and more particularly to accelerating neural networks.
The emergence of Internet-of-Things (IoT) devices, which require edge computing with severe area/energy constraints, has garnered substantial interest in energy-efficient application-specific integrated circuits (ASIC) accelerators for deep learning applications. Automatic speech recognition (ASR) is one of the most prevalent tasks that allow such edge devices to interact with humans and have been integrated into many commercial edge devices.
Recurrent neural networks (RNNs) are very powerful for speech recognition, combining two properties: 1) a distributed hidden state that allows them to store a lot of information about the past efficiently and 2) non-linear dynamics that allow them to update their hidden state in complicated ways. Long short-term memory (LSTM) is a type of RNN with internal gates to scale the inputs and outputs within the cell. LSTM gates avoid the vanishing/exploding gradients issue that plagues RNNs, but they require 8× weights compared with a multi-layer perceptron (MLP) that has the same number of hidden neurons per layer.
Due to the large size of the LSTM RNNs that enable accurate ASR, most of these speech recognition tasks are performed in cloud servers, which requires a constant internet connection, involves privacy concerns, and incurs latency for speech recognition tasks. A particular challenge of performing on-device ASR is that state-of-the-art LSTM-based models for ASR contain tens of millions of weights. Weights can be stored on-chip (e.g., SRAM cache of mobile processors), which has fast access time (nanoseconds range) but is limited to a few megabytes (MBs) due to cost. Alternatively, weights can be stored off-chip (e.g., DRAM) up to a few gigabytes (GBs), but access is slower (tens of nanoseconds range) and consumes ˜100× higher energy than on-chip counterparts.
To improve the energy efficiency of neural network hardware, off-chip memory access and communication need to be minimized. To that end, it becomes crucial to store most or all weights on-chip through sparsity/compression, weight quantization, and network size reduction. Recent works presented methods to reduce the complexity and memory requirements of RNNs for ASR.
Hierarchical coarse-grain sparsity for deep neural networks is provided. An algorithm-hardware co-optimized memory compression technique is proposed to compress deep neural networks in a hardware-efficient manner, which is referred to herein as hierarchical coarse-grain sparsity (HCGS). HCGS provides a new long short-term memory (LSTM) training technique which enforces hierarchical structured sparsity by randomly dropping static block-wise connections between layers. HCGS maintains the same hierarchical structured sparsity throughout training and inference; this reduces weight storage for both training and inference hardware systems.
Aided by HCGS-based block-wise recursive weight compression, LSTM recurrent neural networks are demonstrated with up to 16× fewer weights while achieving minimal error rate degradation. The prototype chip fabricated in 65 nanometer (nm) low-power (LP) complementary metal-oxide-semiconductor (CMOS) achieves up to 8.93 tera-operations per second per watt (TOPS/W) for real-time speech recognition using compressed LSTMs based on HCGS. HCGS-based LSTMs have demonstrated energy-efficient speech recognition with low error rates for TIMIT, TED-LIUM, and LibriSpeech data sets.
An exemplary embodiment provides a method for compressing a neural network. The method includes randomly selecting a hierarchical structure of block-wise weights in the neural network and training the neural network by selecting a same number of random blocks for every block row.
Another exemplary embodiment provides a neural network accelerator. The neural network accelerator includes an input buffer, an output buffer, and a hierarchical coarse-grain sparsity selector configured to randomly select block-wise weights from the input buffer for training a neural network.
Those skilled in the art will appreciate the scope of the present disclosure and realize additional aspects thereof after reading the following detailed description of the preferred embodiments in association with the accompanying drawing figures.
The accompanying drawing figures incorporated in and forming a part of this specification illustrate several aspects of the disclosure, and together with the description serve to explain the principles of the disclosure.
The embodiments set forth below represent the necessary information to enable those skilled in the art to practice the embodiments and illustrate the best mode of practicing the embodiments. Upon reading the following description in light of the accompanying drawing figures, those skilled in the art will understand the concepts of the disclosure and will recognize applications of these concepts not particularly addressed herein. It should be understood that these concepts and applications fall within the scope of the disclosure and the accompanying claims. It will be understood that, although the terms first, second, etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. For example, a first element could be termed a second element, and, similarly, a second element could be termed a first element, without departing from the scope of the present disclosure. As used herein, the term “and/or” includes any and all combinations of one or more of the associated listed items.
It will be understood that when an element such as a layer, region, or substrate is referred to as being “on” or extending “onto” another element, it can be directly on or extend directly onto the other element or intervening elements may also be present. In contrast, when an element is referred to as being “directly on” or extending “directly onto” another element, there are no intervening elements present. Likewise, it will be understood that when an element such as a layer, region, or substrate is referred to as being “over” or extending “over” another element, it can be directly over or extend directly over the other element or intervening elements may also be present. In contrast, when an element is referred to as being “directly over” or extending “directly over” another element, there are no intervening elements present. It will also be understood that when an element is referred to as being “connected” or “coupled” to another element, it can be directly connected or coupled to the other element or intervening elements may be present. In contrast, when an element is referred to as being “directly connected” or “directly coupled” to another element, there are no intervening elements present.
Relative terms such as “below” or “above” or “upper” or “lower” or “horizontal” or “vertical” may be used herein to describe a relationship of one element, layer, or region to another element, layer, or region as illustrated in the Figures. It will be understood that these terms and those discussed above are intended to encompass different orientations of the device in addition to the orientation depicted in the Figures.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used herein, the singular forms “a,” “an,” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises,” “comprising,” “includes,” and/or “including” when used herein specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
“About” as used herein when referring to a measurable value such as an amount, a temporal duration, and the like, is meant to encompass variations of ±20%, ±10%, ±5%, ±1%, and ±0.1% from the specified value, as such variations are appropriate.
Throughout this disclosure, various aspects of the invention can be presented in a range format. It should be understood that the description in range format is merely for convenience and brevity and should not be construed as an inflexible limitation on the scope of the invention. Accordingly, the description of a range should be considered to have specifically disclosed all the possible subranges as well as individual numerical values within that range. For example, description of a range such as from 1 to 6 should be considered to have specifically disclosed subranges such as from 1 to 3, from 1 to 4, from 1 to 5, from 2 to 4, from 2 to 6, from 3 to 6 etc., as well as individual numbers within that range, for example, 1, 2, 2.7, 3, 4, 5, 5.3, 6 and any whole and partial increments therebetween. This applies regardless of the breadth of the range.
Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs. It will be further understood that terms used herein should be interpreted as having a meaning that is consistent with their meaning in the context of this specification and the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
Hierarchical coarse-grain sparsity for deep neural networks is provided. An algorithm-hardware co-optimized memory compression technique is proposed to compress deep neural networks in a hardware-efficient manner, which is referred to herein as hierarchical coarse-grain sparsity (HCGS). HCGS provides a new long short-term memory (LSTM) training technique which enforces hierarchical structured sparsity by randomly dropping static block-wise connections between layers. HCGS maintains the same hierarchical structured sparsity throughout training and inference; this reduces weight storage for both training and inference hardware systems.
Aided by HCGS-based block-wise recursive weight compression, LSTM recurrent neural networks are demonstrated with up to 16× fewer weights while achieving minimal error rate degradation. The prototype chip fabricated in 65 nanometer (nm) low-power (LP) complementary metal-oxide-semiconductor (CMOS) achieves up to 8.93 tera-operations per second per watt (TOPS/W) for real-time speech recognition using compressed LSTMs based on HCGS. HCGS-based LSTMs have demonstrated energy-efficient speech recognition with low error rates for TIMIT, TED-LIUM, and LibriSpeech data sets.
I. Introduction
Long short-term memory (LSTM) is a type of recurrent neural network (RNN), which is widely used for time-series data and speech applications, due to its high accuracy on such tasks. However, LSTMs pose difficulties for efficient hardware implementation because they require a large amount of weight storage and exhibit computation complexity. Prior works have proposed compression techniques to alleviate the storage/computation requirements of LSTMs. Magnitude-based pruning has shown large compression, but the index storage can be a large burden, especially for the simple coordinate (COO) format that stores each nonzero weight's location. The compressed sparse row/column (CSR/CSC) format reduces the index cost as only the distance between non-zero elements in a row/column is stored, but still requires non-negligible index memory and causes irregular memory access.
A new HCGS scheme is presented herein that structurally compresses LSTM weights by 16× with minimal error rate degradation.
An HCGS-based LSTM accelerator is prototyped in 65-nm LP CMOS, which executes two-/three-layer LSTMs for real-time speech recognition. It consumes 1.85-/3.43-/3.42-mW power and achieves 8.93/7.22/7.24 TOPS/W for TIMIT/TED-LIUM/LibriSpeech data sets, respectively. Contributions of this disclosure include the following:
Section II presents the proposed HCGS algorithm for LSTMs. Section III describes the HCGS-based LSTM accelerator architecture and chip design optimization. In Section IV, the prototype chip measurement results and comparison are presented.
II. LSTM and Hierarchical Coarse-Grain Sparsity
A. LSTM-Based Speech Recognition
LSTM RNNs have shown state-of-the-art accuracy for speech recognition tasks.
With the abovementioned notations, an LSTM is defined as:
i
t=σ(Wxixt+Whiht−1+bi) Equation 1
ƒt=σ(Wxƒxt+Whƒht−1+bƒ) Equation 2
o
t=σ(Wxoxt+Whoht−1+bo) Equation 3
{tilde over (c)}
t=tanh(Wxcxt+Whcht−1+bc) Equation 4
c
t=ƒt⊙ct−1+it⊙{tilde over (c)}t Equation 5
h
t
=o
t⊙tanh(ct) Equation 6
where σ(⋅) represents the sigmoid function and ⊙ is the element-wise product. From the abovementioned LSTM equations, the weight memory requirement of LSTMs is 8× compared with MLPs with the same number of neurons per layer. The LSTM-based speech recognition typically consists of pipeline of a pre-processing or feature extraction module, followed by an LSTM RNN engine and then by a Viterbi decoder. A commonly used feature for pre-processing of speech data is feature-space maximum likelihood linear regression (fMLLR). fMLLR features are extracted from Mel frequency cepstral coefficients (MFCCs) features, obtained conventionally from 25-ms windows of audio samples with 10-ms overlap between adjacent windows. The features for the current window are combined with those of past and future windows to provide the context of input speech data. In an exemplary implementation, five past windows, one current window, and five future windows are merged to generate an input frame with 11 windows, leading to a total of 440 fMLLR features per frame. These merged sets of features become inputs to the ensuing LSTM RNN. The output layer of the LSTM consists of probability estimates that are conveyed to the subsequent Viterbi decoder module to determine the best sequence of phonemes/words.
B. Hierarchical Coarse-Grain Sparsity (HCGS)
The hierarchical structure of block-wise weights is randomly selected before the RNN training process starts, and this pre-defined structured sparsity is maintained throughout the training and inference phases. A constraint is applied such that HCGS always selects the same number of random blocks for every block row (see
In HCGS, the connections between feed-forward layers and recurrent layers are dropped in a hierarchical and recursive block-wise manner. The example shown in
The indices needed for HCGS networks in
Algorithm 1 shows the computational changes required to incorporate HCGS in LSTM training. The binary connection mask is initialized for every layer of the feed-forward network (CW) and the recurrent network (CU), which forces the deleted weight connections to zero during the forward propagation. During back-propagation, the HCGS mask ensures that the deleted weights do not get updated and remain zero throughout training.
To further increase compression efficiency, weights associated with the four gates in each LSTM layer share the common connection mask that is randomly selected. Sharing the same random mask results in 4× reduction of the index memory, and reduces the computations for decompression by 4× as well.
Compared to cases where different random masks were used for the four gates, sharing the same random mask did not affect PER or WER by more than 0.2% across all LSTM evaluations.
Three well-known benchmarks for speech recognition applications, TIMIT, TED-LIUM, and LibriSpeech, are used to train the proposed HCGS-based LSTMs and evaluate the corresponding error rates. The baseline three-layer, 512-cell LSTM RNN that performs speech recognition for TED-LIUM/LibriSpeech data sets requires 24 MB of weight memory in floating-point precision. Aided by the proposed HCGS that reduces the number of weights by 16× and low-precision (6-bit) representation of weights, the compressed parameters of a three-layer, 512-cell LSTM RNN is reduced to only 288 kB (83× reduction in model size compared with 24 MB). The resultant LSTM network can be fully stored on-chip, which enables energy-efficient acceleration without costly DRAM access.
C. HCGS-Based Training
LSTM RNNs are trained by minimizing the cross-entropy error, as described in the following equation:
E=−Σ
i=1
N
t
i
×lny
i Equation 7
where N is the size of the output layer, yi is the ith output node, and ti is the ith target value or label. The mini-batch stochastic gradient method is used to train the network. The change in weight for each iteration is the differential of the cost function with respect to the weight value, as follows:
The weight Wij in the (k+1)th iteration is updated using the following equation:
(Wij)k+1=(Wij)k+{(ΔWij)k+m×(ΔWij)k−1}×lr×Cij Equation 9
where m is the momentum, lr is the learning rate, and Cij is the binary connection coefficient between two subsequent neural network layers, which is introduced for the proposed HCGS, and only the weights in the network corresponding to Cij=1 are updated.
D. Design Space Exploration
There are several important design parameters for HCGS-based LSTM hardware design, including activation/weight precision, HCGS compression ratio, the number of CGS levels, and width of LSTM RNN (i.e., the number of LSTM cells in each layer).
Compared with single-level CGS, the two-level HCGS scheme shows a favorable tradeoff between weight compression and PER of two-layer LSTM RNN for TIMIT data set (see
Overall, 512-cell LSTMs show a good balance between error rate (compared with 256-cell LSTMs) and memory (compared with 1024-cell LSTMs) for various HCGS evaluations. Based on these results, the 512-cell LSTM and two-level HCGS with 16× compression are selected as the hardware design point (see
E. Robustness Across Random Block Selection and Further Minimization of Index Memory
This shows similar PER and WER values for the cases of using the same and different random block assignments for four LSTM gates. Compared with cases of using different random block selection, sharing the same random block selection for four gates did not affect PER or WER by more than 0.2% across all LSTM evaluations.
Based on this result, to further increase the compression efficiency, the same random block selection is employed for weights associated with the four gates in each LSTM layer. As shown in
F. Guided Coarse-Grain Sparsity (Guided-CGS)
To benchmark the proposed pre-determined random sparsity against variants of learned sparsity methods, a guided block-wise sparsity method is introduced and called guided coarse-grain sparsity (Guided-CGS). Unlike HCGS where the blocks are chosen randomly, Guided-CGS implements a magnitude-based selection criteria to select blocks that contain the largest absolute mean weights, and the unselected blocks will be zero. The magnitude-based selection is executed after one epoch of training with group Lasso. This method ensures that the weight block selection is done through group Lasso based optimization, instead of being randomly chosen.
G. Quantizing LSTM Networks
To achieve high accuracy with very low-precision quantization, weights of the DNN are quantized during training. The in-training quantization jointly optimizes block-wise sparsity and low-precision quantization. During the forward propagation part of the LSTM training, each weight is quantized to n bits, while the backward propagation part employs full-precision weights. This way, the network is optimized to minimize the cost function with n-bit precision weights. The n-bit quantized weights are represented in Equation 10 and steps to make quantized copies of the full-precision weights are shown in Algorithm 2.
W
q
=Quantization(W,n) Equation 10
The parameter update section in Algorithm 1 is adapted to include the process of updating the batch normalization parameters. Back-propagation through time (BPTT) is used to compute the gradients by minimizing the cost function using the quantized weights Wq
III. Architecture and Design Optimizations
A. Hardware Architecture
1. HCGS Selector
The HCGS selector (see
The selection input for the HCGS selector is a 48-bit vector, where 16 bits correspond to the first level of selection, and the remaining 32 bits are used for the second level. The selector supports block sizes ranging from 128×128 to 32×32 for the first level and from 16×16 to 4×4 for the second level. This wide range of block sizes allows for flexibility to map arbitrary LSTM networks trained with HCGS onto the accelerator chip using different configurations.
2. Input and Output Buffers
An input frame consists of fMLLR features, as described in Section II-A. The input buffer is used to store the fMLLR features of an input frame, which streams in 13 bits each cycle over 512 cycles. The input buffer is essential for the continuous computation of the LSTM output as it enables the subsequent input frame to be ready for use as soon as the current frame computation is complete. This buffer ensures that there is no stall required to stream in the consecutive frames of the real-time speech input. The serial-in/parallel-out input buffer takes in 13-bit inputs sequentially and outputs all 6656 bits in parallel. The output buffer consists of two identical buffers for double buffering, which enables continuous computation of the LSTM accelerator in conjunction with the input buffer while streaming out the final layer outputs. Each output buffer employs an HCGS selector and a 6656:416 multiplexer to feedback the current layer output to the next layer. The feedback path from the output buffer to the input of the MAC facilitates the reuse of the MAC unit. Each output buffer takes in a 13-bit LSTM cell output, and the correct buffer is chosen by the FSM by keeping track of whether the buffer is full or ready to stream data out of the chip. Finally, a multiplexer is used to decide whether the xt input should be from the input or the output buffer, and this is done through the FSM that uses the frame complete flag to switch between the two buffers.
3. H-Buffer and C-Buffer
The H-buffer and C-buffer are rolling buffers and store the outputs of the previous frame (ht−1) and cell state (ct−1) for each LSTM layer, respectively. Each buffer has three internal registers corresponding to the maximum number of layers supported by the hardware. The C-buffer registers behave as shift registers, while the H-buffer registers operate similar to the input buffer where inputs are streamed in serially and outputs are streamed out in parallel.
4. MAC Unit
The MAC unit consists of 64 parallel MACs (computing vector-matrix multiplications) and the LSTM gate computation module (computing intermediate LSTM gate and final output values), which can perform 129 (=64×2.1) compressed operations equivalent to 2064 (=129×16) uncompressed operations effectively in each cycle, aided by the proposed HCGS compression by 16×. The non-linear activation functions of sigmoid and hyperbolic tangent (tanh) are implemented with piecewise linear (PWL) modules using 20 linear segments that exhibit maximum relative error [(PWL_output−ideal_output)/ideal_output] of 1.67×10−3 and average relative error of 3.30×10−4.
5. Weight/Bias Memory
As described in Section III-B, weights are stored in the interleaved fashion, where each memory sub-bank (W1-W3) stores weights corresponding to a single layer. Since all weights of the two-/three-layer RNNs can be loaded on-chip initially, write operations are not needed for the LSTM accelerator during inference operations. The required read bandwidth of the LSTM accelerator is 192 bits/cycle from memory bank 0 and 192 bits/cycle from memory bank 1 (see
The memory sub-banks that store weights of layers not currently being computed are put into “selective precharge” mode, which clamps the wordlines to a low value (0 V) and floats the bitlines for leakage power reduction. Getting into and out of this selective precharge mode each adds a small overhead of one extra cycle. Moreover, due to the nature of the LSTM, each weight in the memory and sub-banks are used only once, which makes the number of transitions between selective precharge mode and normal mode for each sub-bank to be minimal. Overall, adding the selective precharge mode resulted in 19% energy-efficiency improvement at the system-level for the LSTM accelerator.
B. Interleaved Memory Storage
To enable this efficiently, each row of four matrices Wxi, Wxƒ, Wxo, and Wxc is stored in a staggered manner (same for Wh*) in on-chip SRAM arrays (see
C. End-to-End Operation and Latency
Since all weights of target LSTM networks with HCGS compression are stored on-chip, there is no need to off-chip DRAM communication, and the chip performs the end-to-end operation of the entire LSTM in a pipelined fashion. The initial delay of 512 cycles is consumed to load the input buffer. Once the input buffer is filled, each LSTM state computation takes three cycles, one for MAC, one for addition, and one for activation, which is all pipelined. The first neuron output takes a total of nine cycles, after which a new neuron output is obtained every cycle. The outputs of the current layer are stored in the output buffer. Once the output buffer is full, if the current layer is an intermediate layer, the output directly is conveyed to the input of the next layer, or if the current layer is the last layer of the LSTM, the output data is streamed out of the chip over 512 cycles.
D. Computing Device
In some aspects of the present invention, software executing the instructions provided herein may be stored on a non-transitory computer-readable medium, wherein the software performs some or all of the steps of the present invention when executed on a processor.
Aspects of the invention relate to algorithms executed in computer software. Though certain embodiments may be described as written in particular programming languages, or executed on particular operating systems or computing platforms, it is understood that the system and method of the present invention is not limited to any particular computing language, platform, or combination thereof. Software executing the algorithms described herein may be written in any programming language known in the art, compiled or interpreted, including but not limited to C, C++, C#, Objective-C, Java, JavaScript, MATLAB, Python, PHP, Perl, Ruby, or Visual Basic. It is further understood that elements of the present invention may be executed on any acceptable computing platform, including but not limited to a server, a cloud instance, a workstation, a thin client, a mobile device, an embedded microcontroller, a television, or any other suitable computing device known in the art.
Parts of this invention are described as software running on a computing device. Though software described herein may be disclosed as operating on one particular computing device (e.g. a dedicated server or a workstation), it is understood in the art that software is intrinsically portable and that most software running on a dedicated server may also be run, for the purposes of the present invention, on any of a wide range of devices including desktop or mobile devices, laptops, tablets, smartphones, watches, wearable electronics or other wireless digital/cellular phones, televisions, cloud instances, embedded microcontrollers, thin client devices, or any other suitable computing device known in the art. Similarly, parts of this invention are described as communicating over a variety of wireless or wired computer networks. For the purposes of this invention, the words “network”, “networked”, and “networking” are understood to encompass wired Ethernet, fiber optic connections, wireless connections including any of the various 802.11 standards, cellular WAN infrastructures such as 3G, 4G/LTE, or 5G networks, Bluetooth®, Bluetooth® Low Energy (BLE) or Zigbee® communication links, or any other method by which one electronic device is capable of communicating with another. In some embodiments, elements of the networked portion of the invention may be implemented over a Virtual Private Network (VPN).
Generally, program modules include routines, programs, components, data structures, and other types of structures that perform particular tasks or implement particular abstract data types. Moreover, those skilled in the art will appreciate that the invention may be practiced with other computer system configurations, including hand-held devices, multiprocessor systems, microprocessor-based or programmable consumer electronics, minicomputers, mainframe computers, and the like. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote memory storage devices.
By way of example, and not to be limiting, computer-readable media may comprise computer storage media. Computer storage media includes volatile and non-volatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EPROM, EEPROM, flash memory or other solid state memory technology, CD-ROM, DVD, or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by the computer.
According to various embodiments of the invention, the computer 2200 may operate in a networked environment using logical connections to remote computers through a network 2240, such as TCP/IP network such as the Internet or an intranet. The computer 2200 may connect to the network 2240 through a network interface unit 2245 connected to the bus 2235. It should be appreciated that the network interface unit 2245 may also be utilized to connect to other types of networks and remote computer systems.
The computer 2200 may also include an input/output controller 2255 for receiving and processing input from a number of input/output devices 2260, including a keyboard, a mouse, a touchscreen, a camera, a microphone, a controller, a joystick, or other type of input device. Similarly, the input/output controller 2255 may provide output to a display screen, a printer, a speaker, or other type of output device. The computer 2200 can connect to the input/output device 2260 via a wired connection including, but not limited to, fiber optic, Ethernet, or copper wire or wireless means including, but not limited to, Wi-Fi, Bluetooth, Near-Field Communication (NFC), infrared, or other suitable wired or wireless connections.
As mentioned briefly above, a number of program modules and data files may be stored in the storage device 2220 and/or RAM 2210 of the computer 2200, including an operating system 2225 suitable for controlling the operation of a networked computer. The storage device 2220 and RAM 2210 may also store one or more applications/programs 2230. In particular, the storage device 2220 and RAM 2210 may store an application/program 2230 for providing a variety of functionalities to a user. For instance, the application/program 2230 may comprise many types of programs such as a word processing application, a spreadsheet application, a desktop publishing application, a database application, a gaming application, internet browsing application, electronic mail application, messaging application, and the like. According to an embodiment of the present invention, the application/program 2230 comprises a multiple functionality software application for providing word processing functionality, slide presentation functionality, spreadsheet functionality, database functionality and the like.
The computer 2200 in some embodiments can include a variety of sensors 2265 for monitoring the environment surrounding and the environment internal to the computer 2200. These sensors 2265 can include a Global Positioning System (GPS) sensor, a photosensitive sensor, a gyroscope, a magnetometer, thermometer, a proximity sensor, an accelerometer, a microphone, biometric sensor, barometer, humidity sensor, radiation sensor, or any other suitable sensor.
IV. Evaluation Results
The invention is further described in detail by reference to the following experimental examples. These examples are provided for purposes of illustration only, and are not intended to be limiting unless otherwise specified. Thus, the invention should in no way be construed as being limited to the following examples, but rather, should be construed to encompass any and all variations which become evident as a result of the teaching provided herein.
Without further description, it is believed that one of ordinary skill in the art can, using the preceding description and the following illustrative examples, make and utilize the system and method of the present invention. The following working examples therefore, specifically point out the exemplary embodiments of the present invention, and are not to be construed as limiting in any way the remainder of the disclosure.
A. Pre-/Post-Processing Operations
The pre-processing steps are performed in Kaldi framework using audio files from TIMIT, TED-LIUM, and LibriSpeech data sets. The same extracted input features were used for LSTM training and real-time inference based on HCGS compression. With LSTM outputs streamed out of the chip, post-processing is also performed using Kaldi framework to obtain the final error rates for speech recognition. When 512 outputs (13-bit each) per frame are received from chip output, the hidden Markov model (HMM) states are calculated using a weighted finite-state transducer (WFST) that performs Viterbi beam search, finally obtaining phoneme error rate (for TIMIT data set) or WER (for TED-LIUM/LibriSpeech data sets).
B. Performance, Energy, and Error Rate Measurements
This high MAC efficiency is obtained because each of the layers in the two-/three-layer target RNNs has a regular structure to start with, and also HCGS compression still maintains a regular structure because the same number of blocks are pruned/kept in the same block row (see
Measured accuracy results of 20.6% PER are achieved for TIMIT, 21.3% WER for TED-LIUM, and 11.4% WER for LibriSpeech data sets.
C. Comparison to Prior LSTM/RNN Works
Table I shows a detailed comparison with prior ASIC and FPGA hardware designs for RNNs. Compared with the RNN ASIC works of [12] (F. Conti, L. Cavigelli, G. Paulin, I. Susmelj, and L. Benini, “Chipmunk: A systolically scalable 0.9 mm2, 3.08 Gop/s/mW 1.2 mW accelerator for near-sensor recurrent neural network inference,” in Proc. IEEE Custom Integr. Circuits Conf. (CICC), April 2018, pp. 1-4) and [13] (S. Yin et al., “A 1.06-to-5.09 TOPS/W reconfigurable hybrid-neural-network processor for deep learning applications,” in Proc. Symp. VLSI Circuits, June 2017, pp. 26-27), this work shows 2.90× and 1.75× higher energy efficiency (TOPS/W), respectively. Reference [11] (J. Yue et al., “A 65 nm 0.39-to-140.3 TOPS/W 1-to-12b unified neural network processor using block-circulant-enabled transpose-domain acceleration with 8.1× higher TOPS/mm2 and 6T HBSTTRAM-based 2D data-reuse architecture,” in IEEE ISSCC Dig. Tech. Papers, February 2019, pp. 138-140) presented higher TOPS/W, but the end-to-end latency or FPS was not reported. Moreover, only a simpler TIMIT data set has been benchmarked (while embodiments described herein are also benchmarked against more complex TED-LIUM and LibriSpeech data sets), and the absolute TIMIT PER has not been shown.
D. Speech Recognition Evaluation Setup
For the speech recognition tasks, the input consists of 440 feature space maximum likelihood linear regression (fMLLR) features that are extracted using the s5 recipe of Kaldi. The fMLLR features were computed using a time window of 25 ms with an overlap of 10 ms. The PyTorch-Kaldi speech recognition toolkit is used to train the LSTM networks. The final LSTM layer generates the acoustic posterior probabilities, which are normalized by their prior and then conveyed to a hidden Markov model (HMM) based decoder. An n-gram language model derived from the language probabilities is merged with the acoustic scores by the decoder. A beam search algorithm is then used to retrieve the sequence of words uttered in the speech signal. The final error rates for TIMIT and TED-LIUM corpora are computed with the NIST SCTK scoring toolkit.
For TIMIT, the phoneme recognition task (aligned with the Kaldi s5 recipe) is considered, and 2-layer unidirectional LSTMs are trained, with 256, 512, and 1,024 cells per layer. For TED-LIUM, the word recognition task (aligned with the Kaldi s5 recipe) is targeted, 3-layer uni-directional LSTMs are trained, with 256, 512, and 1,024 cells per layer. All possible combinations of power-of-2 block sizes are evaluated, and the PER for TIMIT and WER for TED-LIUM were relatively constant, showing the robustness of HCGS across different block sizes.
E. Improvements Due to HCGS
The hierarchical sparsity leads to the improved accuracy of the networks. Sparse weights with fine granularity tend to form a uniform sparsity distribution even within smaller regions of the weight matrix. This property will lead to extremely sporadic and isolated connections when the target compression rate is high. However, the grouping of sparse weights within the hierarchical structure of HCGS allows densely connected regions to be formed even when the target compression rate is high. As two-tier HCGS outperforms single-tier CGS in terms of accuracy and three-tier HCGS leads to marginal/worse performance than two-tier HCGS, the reported results in Sections IV-F and IV-G focus on LSTM training with two-tier HCGS.
F. LSTM Results for TIMIT
G. LSTM Results for TED-LIUM
A prior work reported that wider CNNs can lower the precision of activations/weights than shallower counterparts, for the same or even better accuracy. However, evaluations of embodiments described herein do not result in such trends with LSTMs for TIMIT or TED-LIUM. Especially when combined with structured compression, LSTMs are more sensitive to low-precision quantization, so that LSTMs with medium (e.g. 6-bit) precision show the best trade-off between PER/WER and weight memory compression.
H. Comparison with Learned Sparsity and Prior Works
Single-tier Guided-CGS shows better PER than HCGS for compression ratios up to 4×, but PER worsens substantially for larger compression ratios. This sharp increase in PER is observed for group Lasso, L1 and MP schemes as well, which can be attributed to the congestion of selected groups in small regions of weight matrices caused by the regularization function. The pre-determined random sparsity in HCGS ensures that congestion is avoided when selecting blocks within weight matrices, resulting in a much more graceful PER degradation for large (>4×) compression ratios. The effectiveness of random pruning was also demonstrated, where the pruned DNN recovered the accuracy loss by fine-tuning the remaining weights.
Those skilled in the art will recognize improvements and modifications to the preferred embodiments of the present disclosure. All such improvements and modifications are considered within the scope of the concepts disclosed herein and the claims that follow.
The disclosures of each and every patent, patent application, and publication cited herein are hereby incorporated herein by reference in their entirety. While this invention has been disclosed with reference to specific embodiments, it is apparent that other embodiments and variations of this invention may be devised by others skilled in the art without departing from the true spirit and scope of the invention. The appended claims are intended to be construed to include all such embodiments and equivalent variations.
This application claims priority to U.S. Provisional Application No. 63/257,011, filed on Oct. 18, 2021, incorporated herein by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
63257011 | Oct 2021 | US |