This disclosure relates to automatic speech recognition (ASR) and, more particularly, to ASR systems with low power consumption.
Since the late 1980s, there have been sporadic efforts to exploit special-purpose hardware for speech applications. Application-specific integrated circuits (ASICs) for hidden Markov Model (HMM) based speech recognition were reported as early as 1991, followed by power-conscious implementations intended for portable devices. The “In Silico Vox” project created sophisticated hardware ports of Sphinx-3, a well-established software framework for ASR.
More specialized research efforts have examined changes in ASR algorithms and their applicability to hardware. One area of evolution has been in the use of weighed finite-state transducers (WFSTs), which are a component of the modern ASR framework. WFSTs are now commonplace in software speech recognizers, including commercial systems. Perhaps due to long development cycles, most of the hardware implementations realized so far are based on non-WFST software decoders. Digital circuits using WFSTs were presented in 2008, contributing an important observation that pre-processing the WFST could eliminate the need to consider unlabeled arcs recursively during each frame.
Another focus area in ASR research has been acoustic modeling. Previous implementations of ASR hardware use Gaussian mixture models (GMMs) with varying levels of complexity (number of tied states and number of Gaussian components per distribution). To speed up computation, designers have performed frame skipping and Gaussian selection and used fixed-point parameters. Meanwhile, deep neural networks (DNNs) and other neural network architectures have become popular for ASR due to their improved accuracy. The circuits community has studied DNNs and developed efficient implementations, primarily for computer vision.
Other systems include various other architectural features: choosing how to divide hardware resources between acoustic modeling and search, and two-pass decoding with the first pass performed by hardware. Code generation tools can help rapidly evaluate different accelerator kernels specified in a high-level language. Acoustic models generally see large efficiency gains from acceleration, and front-ends generally do not, but there is no consensus on the proper hardware/software partitioning for search—this seems to depend on the application.
The wide range of algorithms available for voice activity detection (VAD) also provides opportunities to apply special-purpose architectures. Power can be minimized through a variety of techniques, whether minimum-energy operation in deeply scaled CMOS, or mixed-signal design with adaptive feature granularity. These efforts have brought VAD power consumption down to a few micro-Watts.
Other efforts in ASR have continued expanding decoder capabilities for general-purpose transcription applications with a 60,000 word (or larger) vocabulary, for example, achieving 127× faster than real-time throughput using a large ASIC clocked at 556 MHz. One system bridged the gap between high-performance and low-power applications, applying a series of optimizations to a Japanese-language system called Julius in order to obtain just 54 mW of power consumption and 82 MB/s of memory bandwidth during real-time decoding. Also, techniques of WFST pruning and search have been developed that are relevant to low-power ASR applications.
Voice recognition has become ubiquitous in mobile computing, which requires low power to extend battery life. The automatic speech recognition (ASR) system described below may be capable of achieving real world power savings 90 to 99 percent greater than existing technologies.
The ASR system may include a continuous speech recognizer based on hidden Markov Models. It may transcribe an arbitrary length audio input into a sentence. The transition model may comprise WFSTs. In one test case, the ASR system had a vocabulary of 145 k words and required under 8 milli-Watts for operation. In another test case, the ASR system used a vocabulary of eleven words and required 172 micro-Watts for operation.
In an embodiment, a method of automatic speech recognition includes: evaluating a deep neural network (DNN) acoustic model across multiple frames in parallel during a single transfer of the DNN parameters; training the DNN with layers having a comparable number of nodes; compressing the weight matrix for each DNN layer with a scalar quantizer; using on-chip caching to reduce external memory accesses for search graph (WFST) parameters; constructing an on-chip lattice of states which represents word hypotheses by performing an on-chip conversion to a word lattice structure which is much smaller.
In another embodiments, in an automatic speech recognition (ASR) system having an external memory, a method of automatic speech recognition includes evaluating acoustic model deep neural network (DNN) parameters across multiple frames in parallel during a single transfer of the DNN parameters; training the DNN with layers having a comparable number of nodes; compressing the weight matrix for each DNN layer with a scalar quantizer; using on-chip caching to reduce external memory accesses for weighted finite-state transducer (WFST) parameters by; storing the WFST parameters in an off-chip memory; populating an on-chip state list by fetching at least some of the WFST parameters from the off-chip memory; constructing an on-chip state lattice which represents word hypotheses having one or more word arcs by performing a Viterbi search on the state list; and generating an on-chip word lattice from the on-chip state lattice, the word lattice having at least one word arc representing a chain of one or more of the word arcs in the state lattice. The on-chip state lattice and the on-chip word lattice act as an on-chip cache to reduce reading and writing to the off-chip memory.
In another embodiment, a decoder comprises a feature extraction circuit for calculating one or more feature vectors; an acoustic model circuit coupled to receive one or more feature vectors from said feature extraction circuit and assign one or more likelihood values to the one or more feature vectors; an off-chip memory for storing states of transition of the decoder; an on-chip state lattice populated with at least one of the states of transition stored in the off-chip memory and which represents word hypotheses and having one or more word arcs formed by performing a Viterbi search on the state list by a search circuit; and an on-chip word lattice generated by generating a snapshot from the on-chip state lattice, the word lattice having at least one word arc representing a chain of one or more of the word arcs in the state lattice. The on-chip state lattice and the on-chip word lattice act as an on-chip cache to reduce reading and writing to the off-chip memory.
The foregoing features may be more fully understood from the following description of the drawings. The drawings aid in explaining and understanding the disclosed technology. Since it is often impractical or impossible to illustrate and describe every possible embodiment, the provided figures depict one or more exemplary embodiments. Accordingly, the figures are not intended to limit the scope of the invention. Like numbers in the figures denote like elements.
ASR system 100 may include a modeling module 108 that provides model data for ASR system 100. Modeling module 108 may include acoustic models 116, lexical models 118, and/or language models 120. ASR system 100 may also include a compression module 130 to compress data, such as a DNN weight matrix for example, from the models 116, 118, and/or 120. In embodiments, the compression module may compress the data using a scalar quantizer, which may have a precision range of 6-10 bits. Those of ordinary skill in the art will appreciate how to determine the precision required to meet the needs of a particular application. The DNN may also be trained with sparse weights for further memory savings. Additionally, the DNN may be trained where the layers of the DNN each have a comparable number of nodes to maximize on-chip memory utilization.
ASR system 100 may include an ASR module 110 (which may be an ASR decoding module), having a front-end module 124, acoustic model module 126 (which may be an acoustic model evaluation module), and search module 128 to process speech signal 122 from microphone 102 (or from another type of audio source).
One or more of the modules and components of ASR system 100 may be coupled and decoupled to power by through the operation of an automated power gate 114. In the example embodiment of
ASR system 100 may also include a voice activity detection circuit 112. Voice activity detection circuit may be coupled to receive speech signal 122. In response to receiving speech signal 122, voice activity detection circuit 112 may control power gate 114. For example, if voice activity detection circuit 112 does not detect speech or audio for a predetermined amount of time, it may cause power gate 114 to disconnect ASR module 110 from power to reduce power consumption. When voice activity detection circuit 112 detects speech or audio, it may cause power gate 114 to couple ASR module 110 to power to process the speech or audio.
ASR system 100 may also include memory 130. Memory 130 may be shared between model module 108 and ASR module 110, and may act as storage area for data communicated between model module 108 and ASR module 110 and data produced by ASR module 110 during operation.
In embodiments ASR system 100 may be implemented as, or as part of, a signal processing integrated circuit (IC). The IC may provide low power consumption. In one implementation, the IC may include ASR module 110 and/or voice activity detection module 112, while memory 130 and model module 108 are external to the IC.
Front-end module 124 may transform audio (e.g. 16-bit samples at 8 kHz or 16 kHz) into a lower-rate feature representation. For example, front-end module 124 may produce feature vectors 132, which may be based on mel-frequency cepstral coefficients (MFCCs). In one embodiment, the MFCC features may be based on 25 ms frames of audio with 10 ms intervals between frames.
Acoustic module 126 receives feature vectors 132 and produces likelihood scores 134 based on feature vectors 132 and data received from model module 108. Search module 128 receives likelihood scores 134 and uses them to recognize which word(s) were spoken in speech signal 122. Search module 128 produces a signal representing the recognized words 104.
In embodiments, acoustic model 126 and/or search module 128 may be implemented, in whole or in part, as hidden Markov model (HMM) inference, where hidden variables are states within a Markov process modeling speech production, and the observed variables are the acoustic features (e.g. feature vectors 132). Additionally, search module 128 may implement the Viterbi search algorithm. The Viterbi algorithm maintains a list of hypotheses, or active states, at each time step (e.g. each iteration of the algorithm). It also approximates the likelihood of all reachable states at the next time step.
In embodiments, an active state list may be stored in an on-chip memory (e.g. memory 302 in
Acoustic Modeling
The acoustic model evaluates the likelihood of input features yt with respect to a set of distributions p(y|i), where i is the index of a tied state. In an embodiment, acoustic model 126 implements a deep-neural network (DNN) model. However, other models may be used including, but not limited to Gaussian mixture models (GMMs), recurrent networks (RNN), convolutional networks (CNN), long short-term memory cells (LSTM), subspace GMM (SGMM), or a combination thereof.
Referring to
Architecture NN includes a sequencer 202 that receives a compressed parameter stream from memory 130. Decoder 204 decodes the compressed parameters and sends weight and bias coefficients to one or more execution units (EUs) 206. Each EU may include a local memory for storing feature vectors received from front-end module 124, intermediate results, and likelihood outputs for one frame. Thus, to reduce power consumption, only the network parameters are fetched from off-chip memory 130, and no data is written to memory 130.
To provide scalability, architecture 200 may be reconfigurable. For example, EUs 206 may be organized into groups, such as group 208 of four EUs. Each group may arbitrate access to a group of SRAM memories 210. For small neural networks, each EU 206 may be active and accessing one SRAM. For larger networks that would overflow one SRAM 210, every other EU 206 may be disabled and the active EUs may access two adjacent SRAMs 210. Similarly, for neural networks that are large enough to overflow two SRAMs, three of every four EUs 206 may be disabled and the active EUs are configured to access four adjacent SRAMs.
In an embodiment, ASR system 100 may include thirty-two EUs organized into eight groups of four. In additional, each EU may have one single-port SRAM 210 that stores both input and output vectors of the current layer. This may result in a reduction of memory bandwidth and clock frequency. The grouping scheme can also be expanded to handle a wider range of neural networks if required.
Model data structures stored in memory 130 may be designed so the entire model may be accessed with a single, sequential read to avoid page access penalties and also reduce bandwidth and power consumption from reading memory 130.
In embodiments, acoustic model 116 may be designed so that the model, when compressed and stored in memory 130, is about 1 megabyte in size. This may reduce memory access to limit power consumption, while still providing a sufficiently accurate acoustic model.
Certain embodiments work in concert with model design to provide about two orders of magnitude in energy scalability across different ASR tasks. The time required to evaluate a NN model is dominated by the matrix/vector multiplications in affine layers, which grows with the square of the layer width (number of nodes). Furthermore, the maximum layer width dictates the amount of parallelism that is possible. Thus, the overall computation time (and, to a first approximation, energy) grows with the cube of the network width.
Circuit area may scale linearly with the number of EUs. This creates an area/power tradeoff: having more EUs allows lower memory bandwidth, and slightly lower core power due to reduced clock frequencies and voltages. Worst case latency may be proportional to the number of EUs, since search cannot proceed until the acoustic model is evaluated. With 32 EUs, the acoustic model occupies 44% of total ASR area. Circuit area may also scale linearly with the amount of memory provided to each EU. In addition, the user can limit the number of frames evaluated by acoustic model 126 through software settings in a host system, for example. This may increase amortized memory bandwidth and runtime, but decrease latency.
Sequencer and EU
The interface 214 between sequencer 202 and EUs 206 may include a first-in-first-out (FIFO) command module and a FIFO data module. In embodiments, each EU has minimal control logic and instead executes commands from sequencer 202 on data in memory, similar to an arithmetic-logic unit (ALU). In embodiments, each active EU may manipulate a vector corresponding to one acoustic frame, and store the vector in its local memory.
Compressed parameter stream 224 may have a hierarchical format that can be parsed via stack 228. An NN represented by this format is a list of layers. Each layer can be specified in either sparse or dense form. Both forms include a bias vector and quantized weight matrix. Each weight matrix has its own quantizer to account for different distributions of weights without loss of accuracy. As the parameter stream is streamed in, quantizer levels may from the stream may be stored in quantization table SRAM 222. In a dense layer, the next structure in the stream may be a bit-packed group of quantizer indices. Sequencer 202 may read each quantizer level from quantization table SRAM 222 and pair them with serially incremented indices. In such an embodiment, the sequencer may instruct the EUs to perform matrix/vector multiplication in row-major order. Sparse layers may contain additional structures to improve compression, which can be parsed by sequencer 202.
Search
At the beginning of the utterance, a starting state (i.e., state 0) with a likelihood of 1 may be inserted, followed by all the other initial states of the WFST 304 that are reachable via ε arcs from the starting state.
A forward pass of Viterbi search may be performed at each time step (frame). At the beginning of the forward pass, the state list for frame t+1 may be empty. Each state from frame t may be read and the WFST may be queried for arcs leaving that state. The destination of each arc may be another state which becomes a hypothesis for frame t+1. The likelihood of these new hypotheses may include the likelihood of being in the source state, the transition probability (arc weight) from the WFST 304, and the acoustic likelihood for the appropriate tied state distribution (arc input label). Following the Viterbi approximation, only the most likely arc into each state is considered for selecting and scoring hypotheses, where p(xt+1) is approximated as maxxt p(xt) p(xt+1|xt) p(yt+1|xt+1), where xt is the index of a WFST state considered by the search at time t and yt is the feature vector representing the audio signal at time t.
The forward pass may be divided into two phases: the “non-ε” phase and the “ε” phase. In the non-E phase, the source states from frame t may be expanded. However, in embodiments, only the arcs with non-E (nonzero) input labels are considered. The ε-input arcs may be discarded because they do not advance the hypothesis forward in time. Each non-ε arc may be scored with a likelihood from the acoustic model, and the destination states with sufficiently high scores may be saved as hypotheses for frame t+1. Then, in the ε phase, all the source states from frame t+1 may be expanded. Only the ε-input arcs are considered, so the acoustic model may not be required. High-scoring arcs may be saved as hypotheses for frame t+1 (i.e. they do not advance forward in time). To eventually recover the word sequence, the list of active states may be saved after each forward pass (as a “snapshot” to external memory 130).
Referring to
After the audio has been consumed to create feature vectors, and the feature vectors have been consumed by forward search, a final weight update may be performed. This involves reading all stored states and adding final weights from the WFST to their likelihood scores. In embodiments, not all states have a final weight (or, equivalently, their final weight may be infinite). In an embodiment, if any of the states have a final weight, then only those are considered as valid endpoints of the search. If none of the states have a final weight, then all are considered valid. The most likely valid state is used as the endpoint for a backtrace which produces a sequence of word labels. This backtrace could be constructed by reading all of the state snapshots (stored in external memory) in reverse order, for example.
In embodiments, efficiency features can be added to search module 128 without changing the behavior of the search algorithm. For memory area savings, the two state lists 302 may be consolidated into a single list with separate fields for the “source” (t) and “destination” (t+1) frames. This may be beneficial because most states remain active for several frames at a time. More significant optimizations may be made to reduce off-chip memory bandwidth. WFST compression and caching techniques may be used to reduce read bandwidth, and a word lattice may be used to reduce write bandwidth.
Referring again to
In embodiments, state list 302 may be a hash table that stores active states (i.e. hypotheses for the speech recognition). Each hypothesis may be indexed by a particular WFST state and linked with a (state, time) pair referred to as a word lattice state (WL state) 310.
As the compressed WFST data is received, it is expanded by the arc fetch module 308 to produce a sequence of arcs. Each arc that is expanded has a destination state, input and output labels, and weight.
Prior to the search being executed, the acoustic model has a computed likelihood for each tied state. For each unscored arc, the appropriate likelihood is received and combined with the source state score and arc weight to achieve the overall likelihood of the destination state. The scored arc has enough information necessary to save a new hypothesis in the state list. However, if the score of the arc does not exceed a particular threshold, the arc may be pruned and discarded by pruning module 312. If the scored arc has a word label (i.e. a non-ε output label), it may be sent to word lattice 306.
Referring to
WFST
The WFST contains time-varying aspects of the HMM. In embodiments, the WFST is a composition of four training sources: the lexicon (phoneme sequences that make up each word), the language model (possible transitions between words), the subphonetic model (different sounds produced over time within each phoneme), and the context dependency (different ways of realizing each phoneme). The information may be stored in the WFST using a key/value store where keys are state IDs and the values contain all outgoing arcs from each state. The arcs may include a weight which can be used during processing of hypothesis likelihoods. In embodiments, the WFST data may be trained with quantized weights for memory savings.
The WFST may be cached to reduce memory access and bandwidth. The cache may also be compressed. Because each object in the WFST may have varying size (due to different numbers of possible arcs leaving each state), the objects may be stored as variable length sequences of bytes in the cache, rather than using a one-arc-per-cache-line scheme. Additionally, the cache may have a circular buffer architecture. The circular buffer may be configured to allow for the variable-size objects. This may reduce cache misses and flushes/evictions of data from the cache.
In some embodiments, the WFST may be encoded and/or compressed with an application-specific encoding scheme to maximize hit rates and minimize the bandwidth for cache misses.
Word Lattice
Referring to
The snapshot may vary in size but may typically be 10-50 kB. This may result in an average 1-5 MB/s of writes. At the end of decoding, all the snapshots can be read in reverse order to perform a back-trace, resulting in an additional 1-5 MB/s amortized read bandwidth. At least some of the information in the state snapshots may not be necessary for reconstructing word-level hypotheses.
A word lattice differs from a state lattice in that every arc may have exactly one output (word) label. Each arc in the word lattice may represent a chain of one or more arcs in the state lattice, the last of which has an output label. Because the word lattice may be smaller than a state lattice covering the same search space, snapshots to the external memory are needed less frequently.
For example, data structures such as those shown in
Referring to
Details related to ASR systems may be found in the dissertation of Michael Price, Energy-scalable Speech Recognition Circuits (June 2016) (on file with the Massachusetts Institute of Technology Libraries), which is incorporated here by reference in its entirety.
Having described preferred embodiments, which serve to illustrate various concepts, structures and techniques, which are the subject of this patent, it will now become apparent to those of ordinary skill in the art that other embodiments incorporating these concepts, structures and techniques may be used. Accordingly, it is submitted that that scope of the patent should not be limited to the described embodiments but rather should be limited only by the spirit and scope of the following claims. All references cited herein are hereby incorporated herein by reference in their entirety.
This application is a CONTINUATION application of U.S. patent application Ser. No. 16/099,589 (filed Nov. 7, 2018), which is a U.S. National Stage Application of PCT Application No. PCT/US2017/035134 (filed May 31, 2017), which in turn claims the benefit of U.S. Provisional Patent Application No. 62/344,081 (filed Jun. 1, 2016). All applications listed in this section are incorporated here by reference in their entirety.
Number | Name | Date | Kind |
---|---|---|---|
5479579 | Duong | Dec 1995 | A |
5790754 | Mozer | Aug 1998 | A |
6549945 | Sinibaldi | Apr 2003 | B1 |
7027988 | Mohri | Apr 2006 | B1 |
7069215 | Bangalore | Jun 2006 | B1 |
7209880 | Gajic | Apr 2007 | B1 |
7272561 | Miyazaki | Sep 2007 | B2 |
8352265 | Lin | Jan 2013 | B1 |
8463610 | Bourke | Jun 2013 | B1 |
8595010 | Shields | Nov 2013 | B2 |
9153231 | Salvador | Oct 2015 | B1 |
9477753 | Kingsbury | Oct 2016 | B2 |
9530404 | Hofer | Dec 2016 | B2 |
9633669 | Salvador | Apr 2017 | B2 |
9740678 | Hofer | Aug 2017 | B2 |
9990917 | Hofer | Jun 2018 | B2 |
10019985 | Heigold | Jul 2018 | B2 |
10057707 | Cartwright | Aug 2018 | B2 |
10176802 | Ladhak | Jan 2019 | B1 |
10255911 | Malinowski | Apr 2019 | B2 |
10334384 | Sun | Jun 2019 | B2 |
10516782 | Cartwright | Dec 2019 | B2 |
10522151 | Cartwright | Dec 2019 | B2 |
10567185 | Cartwright | Feb 2020 | B2 |
11076052 | Cartwright | Jul 2021 | B2 |
11107461 | Price | Aug 2021 | B2 |
20080255839 | Larri | Oct 2008 | A1 |
20100169093 | Washio | Jul 2010 | A1 |
20140278390 | Kingsbury | Sep 2014 | A1 |
20160066113 | Elkhatib | Mar 2016 | A1 |
20160093297 | Deisher | Mar 2016 | A1 |
20160098986 | Hofer | Apr 2016 | A1 |
20160284349 | Ravindran | Sep 2016 | A1 |
20160300566 | Hofer | Oct 2016 | A1 |
20160379629 | Hofer | Dec 2016 | A1 |
20170323638 | Malinowski | Nov 2017 | A1 |
20180006837 | Cartwright | Jan 2018 | A1 |
20180027123 | Cartwright | Jan 2018 | A1 |
20180027351 | Cartwright | Jan 2018 | A1 |
20180190266 | Sun | Jul 2018 | A1 |
20180191912 | Cartwright | Jul 2018 | A1 |
20180279063 | Sun | Sep 2018 | A1 |
20180336902 | Cartwright | Nov 2018 | A1 |
20190147856 | Price | May 2019 | A1 |
20210358484 | Price | Nov 2021 | A1 |
Number | Date | Country |
---|---|---|
1 331 630 | Jul 2003 | EP |
20150065171 | Jun 2015 | KR |
Entry |
---|
Notice of Allowance and Issue fee due dated Jun. 10, 2021, U.S. Appl. No. 16/099,589, 9 pages. |
Response after Final filed May 11, 2021, U.S. Appl. No. 16/099,589, 9 pages. |
Final Office Action dated Feb. 16, 2021, U.S. Appl. No. 16/099,589, 13 pages. |
Response (with Amended Claims in English) to Korean Office Action ) dated Nov. 21, 2020 for Korean Application No. 10-2018-7036748; Response filed on Dec. 24, 2020; 18 pages. |
Korean Office Action (with English Translation) dated Nov. 21, 2020 for Korean Application No. 10-2018-7036748; 8 pages. |
Response (with English Translation/ with Amended Claims in English) to Korean Preliminary Rejection dated Mar. 30, 2020 for Korean Application No. 10-2018-7036748; Response filed on Jun. 26, 2020; 30 Pages. |
Office Action dated Mar. 30, 2020 for Korean Patent Application No. 10-2018-7036748 with English Translation; 12 Pages. |
PCT Notification of Transmittal of the International Search Report and Written Opinion of the ISA, ISA and Written Opinion for PCT Application No. PCT/US2017/035134 dated Sep. 28, 2017; 12 pages. |
Korean Notice of Allowance (w/machine English translation) dated Jun. 26, 2022 for Korean Application No. 10-2021-7026536; 8 pages. |
Number | Date | Country | |
---|---|---|---|
20210358484 A1 | Nov 2021 | US |
Number | Date | Country | |
---|---|---|---|
62344081 | Jun 2016 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 16099589 | US | |
Child | 17388845 | US |