The embodiments are generally directed to end-to-end speech recognition systems, and more specifically to an ensemble end-to-end speech recognition system that includes a phone byte pair encoding system and a character byte pair encoding system implemented as neural networks.
Phones and the context-dependent variants have long been standard modeling units for conventional speech recognition systems. However, the modern end-to-end systems are beginning to use character and character based sub-words, such as byte pair encoder (BPE) and word pieces, for automatic speech recognition. Accordingly, what is needed are techniques that improve accuracy and optimize the automatic speech recognition systems that use a BPE.
The embodiments herein describe an ensemble byte pair encoder (BPE) system that includes a phone BPE system and a character BPE system. The ensemble BPE system models end-to-end speech recognition.
The embodiments describe a multi-level language model (LM) that includes a sub-word LM and a word LM for decoding words in a phone BPE system and a character BPE system. The multi-level LM may be a reoccurring neural network. The LM may include a prefix tree for identifying one or more words.
The embodiments describe an acoustic model in the phone BPE. The acoustic model may be represented as a neural network. The acoustic model may be trained using phone BPE targets. The training may be with the multi-task attention and Connectionist Temporal Classification (CTC) loss.
The embodiments describe a decoder which may convert the phone BPE sequence into words. The decoder may use a one-pass beam search decoder algorithm that efficiently ensembles both phone and character BPE systems in real-time and exploits the complementarity between the phone and character BPE systems.
Memory 120 may be used to store software executed by computing device 100 and/or one or more data structures used during operation of computing device 100. Memory 120 may include one or more types of machine readable media. Some common forms of machine readable media may include floppy disk, flexible disk, hard disk, magnetic tape, any other magnetic medium, CD-ROM, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, RAM, PROM, EPROM, FLASH-EPROM, any other memory chip or cartridge, and/or any other medium from which a processor or computer is adapted to read.
Processor 110 and/or memory 120 may be arranged in any suitable physical arrangement. In some embodiments, processor 110 and/or memory 120 may be implemented on a same board, in a same package (e.g., system-in-package), on a same chip (e.g., system-on-chip), and/or the like. In some embodiments, processor 110 and/or memory 120 may include distributed, virtualized, and/or containerized computing resources. Consistent with such embodiments, processor 110 and/or memory 120 may be located in one or more data centers and/or cloud computing facilities.
In some embodiments, memory 120 may include non-transitory, tangible, machine readable media that includes executable code that when run by one or more processors (e.g., processor 110) may cause the one or more processors to perform the methods described in further detail herein. For example, as shown, memory 120 includes instructions for a multi-level language model (LM) 125 and ensemble BPE system 165. Multi-level LM 125 may receive an utterance, e.g. a spoken word or words in a natural language and convert the utterance into text that includes one or more textual representations of the word or words. In some embodiments multi-level LM 125 includes a sub-word LM 130 and a word LM 140. In some embodiments, sub-word LM 130 may build a word from one or more sequences of characters or sub-words, while word LM 140 may predict probability of a word given a sequence. Both sub-word LM 130 and word LM 140 may generate one or more scores for a word. Multi-level LM 125 may combine scores from sub-word LM 130 and word LM 140 and use the combined scores to determine a word. Multi-level LM 125, sub-word LM 130 and word LM 140 may be “networks” that may comprise any hardware or software-based framework that includes any artificial intelligence network or system, neural network or system and/or any training or learning models implemented thereon or therewith. Further, the multi-level LM 125, sub-word LM 130 and word LM 140 may each be at least one recurrent neural network.
In some embodiments, multi-level LM 125 may build a prefix tree 150. The prefix tree 150 may be stored in memory 120. The prefix tree 150 may store the pronunciation of characters, sub-words and words in a word dictionary. The prefix tree 150 may be built by decomposing the pronunciation of each word in a word dictionary into a phone sequence using a BPE decoding algorithm. The decomposition may be greedy because the BPE decoding algorithm may use large sub-words when possible. Typically, the prefix tree 150 may be built once using existing word dictionary that is converted into phone sequence and then stored in memory 120. Once prefix tree 150 is built, multi-level LM 125 may use prefix tree 150 to identify words.
When the multi-level LM 125 receives an utterance or spoken words, the multi-level LM 125 may attempt to convert the spoken words into text words using a forward function 160. Forward function 160 may also be stored in memory 120. To determine a text word from a spoken word, the forward function 160 in the multi-level LM 125 may traverse the prefix tree 150 from a root node to other nodes within prefix tree 150 according to the hypothesized sub-words and accumulate sub-word LM scores using sub-word LM 130 at each step. Each step may be a traversal between two tree nodes in prefix tree 150. In some instances, the forward function 160 may encounter a tree node in the prefix tree 150 containing words whose pronunciations match the sequences of sub-words on the paths stemming from the root. At this point, the forward function 160 may use word LM 140 to output the word or words associated with the tree node. Word LM 140 may also generate word LM scores for the word(s) and replace accumulated sub-word LM scores from sub-word LM 130 with word LM scores from the word LM 140. Subsequently, forward function 160 may move back to the root of the prefix tree 150 to determine the next word in the utterance. A tree node in the prefix tree 150 may contain multiple words when a word is a homophone. When a word boundary is met at a node that includes a homophone word, sub-word LM 130 may output multiple hypothesized words which have different state from the word LM states.
When forward function 160 proceeds with sub-word LM 130, forward function 160 identified if the sub-word s is in a list of sub-words branching out from a current node in prefix tree 150 (shown as function node.getTokens( ) in
When forward function 160 proceeds to word LM 140, forward function 160 has identified a word boundary for sub-word s. The word boundary occurs when pronunciation of sub-word s matches the pronunciation of a word at the tree node. In this case, forward function 160 may retrieve a list of complete words associated with the tree node of prefix tree 150 that is associated with state (see function node.getWords( ) in
In some embodiments, the state may be a tuple of six elements, shown below:
(Sstate; Slogp; Wstate; Wlogp; node; accum)
The Sstate may contain the state for a sub-word. Slogp may contain log-probabilities associated with the Sstate for a sub-word s in the sub-word LM S. Forward function 160 may use the log-probabilities to determine the sub-word score with the sub-word LM S. The Wstate may contain the state for a word. Wlogp may contain associated log-probabilities for a word from word LM W. Forward function 160 may use the log-probabilities to determine the word score with word LM W. The node may contain the current position in the prefix tree T (prefix tree 150). Accum may contain an accumulated sub-word score since the last word output.
In some embodiments, at the beginning of the utterance of words, the forward function 160 may initialize the states and log-probabilities by accepting the start of sentence token <sos> and may set node to the root of the prefix tree T:
(Sstate; Slogp)←S.forward(default_state; <sos>);
(Wstate; Wlogp)←W.forward(default_state; <sos>);
node←root; accum←0:
Going back to
In some embodiments, phone BPE system 170 includes acoustic model (AM) 172 and multi-level LM 174. Character BPE system 180 also includes an AM 182 and multi-level LM 184. AMs 172, 182 may be trained with end-to-end objectives, e.g., the hybrid attention and Connectionist Temporal Classification (CTC) model. AMs 172, 182 may also provide a scoring function which computes the score of the next sub-word given acoustic inputs and previously decoded sub-words, which may be a linear combination of log probabilities from the attention decoder and CTC outputs. Multi-level LMs 174, 184 may be multi-level LM 125 that is trained on text data. Multi-level LMs 174, 184 may be the same instance of multi-level LM 125 or each have different instances of multi-level LM 125. Multi-level LMs 174, 184 may provide the forward function 160 for computing the score of next sub-word given previously decoded words and sub-words.
In ensemble BPE system 165, phone BPE system 170 and character BPE system 180 work together to determine a textual representation of a word. Further phone BPE system 170 and character BPE system 180 may be complementary to each other, i.e. by capturing different aspects of the language. The one-pass beam search algorithm 190 may determine textual representation of a spoken word by traversing an ensemble that includes both phone BPE system 170 and character BPE system 180. For example, one-pass beam search algorithm 190 may use the phone BPE system 170 to propose sub-words up to a word boundary. After phone BPE system 170 identifies a word boundary, the one-pass beam search algorithm 190 may decompose the word into sequence of character BPEs and run the character BPE system 180 to accept the sequence. Next, one-pass beam search algorithm 190 may linearly combine scores from the phone BPE system 170 and character BPE system 180 up to the word boundary. In this way, the phone BPE system 170 may lead the decoding process and the character BPE system 180 may verify the word identified by the phone BPE system 170. This is because the phone BPE system 170 may be more accurate than the character BPE system 180. The phone BPE system 170 and the character BPE system 180 may synchronize at each word boundary. In such a way, the evidence from character BPE system 180 may be injected as early as possible to adjust the scores of word hypothesis. Compared to a conventional second pass rescoring algorithm, the one-pass beam search algorithm 190 may avoid generating and saving large amount of hypothesis by the phone BPE system 170.
(score, ws, sc1, ys1, st1, sc2, ys2, st2)
where score is the final score of the beam for pruning purposes. The ws is the word hypothesis. The sc1 is the score, ys1 is the output sub-word sequence, and st1 is the multi-level LM state from multi-level LM 174 for phone BPE system 170. The sc2 is the score, ys2 is the output word sequence, and st2 is the multi-level LM state from multi-level LM 184 for character BPE system 180.
In some embodiments, parameter β may be for combining the LM score from multi-level LM 125 with the AM score from acoustic model 175 within each BPE system (phone BPE system 170 and character BPE system 180). Parameter γ∈[0, 1] may be for combining scores from phone BPE system 170 and character BPE system 180.
In some embodiments, the one-pass beam search algorithm 190 may be terminated using a detection method. Notably that one-pass beam search algorithm 190, executes once through the phone BPE sequence in phone BPE system 170 and once through the corresponding character BPE sequence in character BPE system 180. Further, character BPE system 180 does not propose additional hypothesis but simply follows phone BPE system 170. In this way, the time complexity is of Algorithm 2 is roughly the sum of that of individual systems for the same beam size.
As illustrated in
In some embodiments, once one or more words are identified, one-pass beam search algorithm 190 decomposes the one or more words into characters using the spm-encode(w) function in
In some embodiments, the one-pass beam search algorithm 190 repeats the above process for multiple beams that are the hypothesis for multiple words that may correspond to the utterance, such as a spoken word (input x). Next, one-pass beam search algorithm 190 may select a word that corresponds to a highest score as the text word for the spoken utterance. One-pass beam search algorithm 190 may also select top several highest scores and identify several textual word candidates for the spoken utterance. The highest scores may be the top configurable number of scores or scores that are above a configurable threshold.
At process 402, a spoken utterance is received. For example, ensemble BPE system 165 receives a spoken utterance, such as a spoken word, and uses the one-pass beam search algorithm 190 to convert the spoken utterance into a textual representation of the spoken word.
At process 404, one or more words are identified using a phone BPE system 170. One-pass beam search algorithm 190 may identify one or more sub-words and continues until the phone BPE system 170 reaches a word boundary, at which point the one or more sub-words become one or more words. The phone BPE system 170 may also generate scores for the one or more sub-words up to the word boundary at which point the one or more sub-words are converted into one or more words. As discussed above, phone BPE system 170 may use multi-level LM 174 to identify one or more words and both multi-level LM 174 and AM 172 to determine one or more scores for the one or more words.
At process 406, the one or more words are decomposed into a character sequence. For example, one-pass beam search algorithm 190 may convert the one or more words into one or more character sequences, where each sequence includes one or more characters for a corresponding word.
At process 408, one or more words are identified from the character sequences using a character BPE system 180. For example, one-pass beam search algorithm 190 may identify one or more words from the character sequence using character BPE system 180. To identify the one or more words, character BPE system 180 may use multi-level LM 184 to determine sub-words from the characters. The sub-words are then converted to words once a word boundary is met. Multi-level LM 184 may also determine scores for the one or more-sub-words and words. Character BPE system 180 may then use AM 182 to determine scores for the words and combine the scores from the multi-level LM 184 and AM 182 for each word. In this way, the character BPE system 180 may verify the one or more words identified using the phone BPE system 170.
At process 410, scores for each word from the phone BPE system 170 and character BPE system 180 are combined. For example, for each word identified in process 404 that matches a word identified in process 408, one-pass beam search algorithm 190 combines the score for the word from the phone BPE system 170 with the score for the word from the character BPE system 180.
At process 412, a text word is determined from the scores. For example, a word that corresponds to the highest score is determined as the text word.
In some embodiment, multi-level LMs 174, 184 and AMs 172, 182 may have neural network architectures that include an encoder, an attention layer, and a decoder. In some embodiments, an encoder in AMs 172, 182 may be shared by an attention layer and the CTC model. Further the CTC model include convolutional layers, e.g. two convolutional layers. The decoder may include transformer layers, e.g. six transformer layers. In some embodiments, encoder layers of the encoder may employ self-attention and decoder layers in the decoder may employ self-attention to previously encoded labels followed by the source attention to encoder outputs. In some embodiments, the attention operations may use four heads of the sixty-four dimensions each, and the output of multi-head attention layer may go through a one-hidden-layer-position-wise feed forward network of rectifier linear units (ReLU units), e.g. 2048 ReLU units. In another architecture, the decoder may include twenty-four transformer layers and encoder may include twelve transformer layers. Also, there may be ten attention heads that yield an attention dimension that is 384. In this architecture, during training, the attention and feed-forward operations may be randomly skipped with a probability for each layer so that the layer reduces to the identity mapping, and the layer dropout probability linearly increases with depth up to p. In some embodiments, p may be a probability between 0 and 1. The probability may be used to randomly omit the attention operation in a transformer layer, so that the layer reduces to the identity mapping. The probability of that omission is a dropout rate. The rate varies for each layer, and may increase from lower layers, i.e. the layers closer to the input, to the higher layers, i.e. the layers closer to the output. In some embodiments, the rate may be linearly increased from 0 to probability p where the value of p may be set during training. For example, for an encoder or decoder that has five layers and p=0.5, the dropout rate may be 0.1 for the first layer, 0.2 for the second layer, 0.3 for the third layer, 0.4 for the fourth layer, and 0.5 for the fifth layer.
Some examples of computing devices, such as computing device 100 may include non-transitory, tangible, machine readable media that include executable code that when run by one or more processors (e.g., processor 110) may cause the one or more processors to perform the processes of method 400. Some common forms of machine readable media that may include the processes of method 400 are, for example, floppy disk, flexible disk, hard disk, magnetic tape, any other magnetic medium, CD-ROM, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, RAM, PROM, EPROM, FLASH-EPROM, any other memory chip or cartridge, and/or any other medium from which a processor or computer is adapted to read.
This description and the accompanying drawings that illustrate inventive aspects, embodiments, implementations, or applications should not be taken as limiting. Various mechanical, compositional, structural, electrical, and operational changes may be made without departing from the spirit and scope of this description and the claims. In some instances, well-known circuits, structures, or techniques have not been shown or described in detail in order not to obscure the embodiments of this disclosure. Like numbers in two or more figures represent the same or similar elements.
In this description, specific details are set forth describing some embodiments consistent with the present disclosure. Numerous specific details are set forth in order to provide a thorough understanding of the embodiments. It will be apparent, however, to one skilled in the art that some embodiments may be practiced without some or all of these specific details. The specific embodiments disclosed herein are meant to be illustrative but not limiting. One skilled in the art may realize other elements that, although not specifically described here, are within the scope and the spirit of this disclosure. In addition, to avoid unnecessary repetition, one or more features shown and described in association with one embodiment may be incorporated into other embodiments unless specifically described otherwise or if the one or more features would make an embodiment non-functional.
Although illustrative embodiments have been shown and described, a wide range of modification, change and substitution is contemplated in the foregoing disclosure and in some instances, some features of the embodiments may be employed without a corresponding use of other features. One of ordinary skill in the art would recognize many variations, alternatives, and modifications. Thus, the scope of the invention should be limited only by the following claims, and it is appropriate that the claims be construed broadly and in a manner consistent with the scope of the embodiments disclosed herein.
This application claims priority to U.S. Provisional Patent Application No. 63/007,054, filed Apr. 8, 2020, which is incorporated by reference herein in its entirety.
Number | Name | Date | Kind |
---|---|---|---|
10282663 | Socher et al. | May 2019 | B2 |
10346721 | Albright et al. | Jul 2019 | B2 |
10474709 | Paulus | Nov 2019 | B2 |
10521465 | Paulus | Dec 2019 | B2 |
10542270 | Zhou et al. | Jan 2020 | B2 |
10546217 | Albright et al. | Jan 2020 | B2 |
10558750 | Lu et al. | Feb 2020 | B2 |
10565305 | Lu et al. | Feb 2020 | B2 |
10565306 | Lu et al. | Feb 2020 | B2 |
10565318 | Bradbury | Feb 2020 | B2 |
10565493 | Merity et al. | Feb 2020 | B2 |
10573295 | Zhou et al. | Feb 2020 | B2 |
10592767 | Trott et al. | Mar 2020 | B2 |
10699060 | Mccann et al. | Jun 2020 | B2 |
10776581 | Mccann et al. | Sep 2020 | B2 |
10783875 | Hosseini-Asl et al. | Sep 2020 | B2 |
10817650 | Mccann et al. | Oct 2020 | B2 |
10839284 | Hashimoto et al. | Nov 2020 | B2 |
10846478 | Lu et al. | Nov 2020 | B2 |
20050060151 | Kuo | Mar 2005 | A1 |
20160004691 | Cao | Jan 2016 | A1 |
20160350653 | Socher et al. | Dec 2016 | A1 |
20170024645 | Socher et al. | Jan 2017 | A1 |
20170032280 | Socher | Feb 2017 | A1 |
20170140240 | Socher et al. | May 2017 | A1 |
20180096219 | Socher | Apr 2018 | A1 |
20180121788 | Hashimoto et al. | May 2018 | A1 |
20180121799 | Hashimoto et al. | May 2018 | A1 |
20180129931 | Bradbury et al. | May 2018 | A1 |
20180129937 | Bradbury et al. | May 2018 | A1 |
20180129938 | Xiong et al. | May 2018 | A1 |
20180268287 | Johansen et al. | Sep 2018 | A1 |
20180268298 | Johansen et al. | Sep 2018 | A1 |
20180336453 | Merity et al. | Nov 2018 | A1 |
20180373987 | Zhang et al. | Dec 2018 | A1 |
20190130248 | Zhong et al. | May 2019 | A1 |
20190130249 | Bradbury et al. | May 2019 | A1 |
20190130273 | Keskar et al. | May 2019 | A1 |
20190130312 | Xiong et al. | May 2019 | A1 |
20190130896 | Zhou et al. | May 2019 | A1 |
20190188568 | Keskar et al. | Jun 2019 | A1 |
20190213482 | Socher et al. | Jul 2019 | A1 |
20190251431 | Keskar et al. | Aug 2019 | A1 |
20190258714 | Zhong et al. | Aug 2019 | A1 |
20190258939 | Min et al. | Aug 2019 | A1 |
20190286073 | Asl et al. | Sep 2019 | A1 |
20190355270 | Mccann et al. | Nov 2019 | A1 |
20190362020 | Paulus et al. | Nov 2019 | A1 |
20190362246 | Lin et al. | Nov 2019 | A1 |
20200005765 | Zhou et al. | Jan 2020 | A1 |
20200065651 | Merity et al. | Feb 2020 | A1 |
20200084465 | Zhou et al. | Mar 2020 | A1 |
20200089757 | Machado et al. | Mar 2020 | A1 |
20200090033 | Ramachandran et al. | Mar 2020 | A1 |
20200090034 | Ramachandran et al. | Mar 2020 | A1 |
20200103911 | Ma et al. | Apr 2020 | A1 |
20200104643 | Hu et al. | Apr 2020 | A1 |
20200104699 | Zhou et al. | Apr 2020 | A1 |
20200105272 | Wu et al. | Apr 2020 | A1 |
20200117854 | Lu et al. | Apr 2020 | A1 |
20200117861 | Bradbury | Apr 2020 | A1 |
20200142917 | Paulus | May 2020 | A1 |
20200175305 | Trott et al. | Jun 2020 | A1 |
20200184020 | Hashimoto et al. | Jun 2020 | A1 |
20200234113 | Liu | Jul 2020 | A1 |
20200272940 | Sun et al. | Aug 2020 | A1 |
20200285704 | Rajani et al. | Sep 2020 | A1 |
20200285705 | Zheng et al. | Sep 2020 | A1 |
20200285706 | Singh et al. | Sep 2020 | A1 |
20200285993 | Liu et al. | Sep 2020 | A1 |
20200301925 | Zhong et al. | Sep 2020 | A1 |
20200302178 | Gao et al. | Sep 2020 | A1 |
20200302236 | Gao et al. | Sep 2020 | A1 |
20200334334 | Keskar et al. | Oct 2020 | A1 |
20200364299 | Niu et al. | Nov 2020 | A1 |
20200364542 | Sun | Nov 2020 | A1 |
20200372116 | Gao et al. | Nov 2020 | A1 |
20200372319 | Sun et al. | Nov 2020 | A1 |
20200372339 | Che et al. | Nov 2020 | A1 |
20200372341 | Asai et al. | Nov 2020 | A1 |
20200380213 | Mccann et al. | Dec 2020 | A1 |
Entry |
---|
Drexler et al., “Subword Regularization and Beam Search Decoding for End-to-End Automatic Speech Recognition,” in ICASSP, pp. 6266-6270, 2019. |
Povey et al., “The Kaldi Speech Recognition Toolkit,” in ASRU, pp. 1-4, 2011. |
Sennrich et al. “Neural Machine Translation of Rare Words with Subword Units,” in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pp. 1715-1725, 2016. |
Schuster et al., “Japanese and Korean Voice Search,” in ICASSP, pp. 5149-5152, 2012. |
Graves, “Towards End-to-End Speech Recognition with Recurrent Neural Networks,” in ICML, pp. 1-9, 2014. |
Gowayyed et al., “EESEN: End-to-End Speech Recognition using Deep RNN Models and WFST-Based Decoding,” in ASRU, pp. 1-8, 2015. |
Sak et al., “Fast and Accurate Recurrent Neural Network Acoustic Models for Speech Recognition,” INTERSPEECH, pp. 1468-1472, 2015. |
Amodei et al., “Deep Speech 2: End-to-End Speech Recognition in English and Mandarin,” in Proceedings of the 33rd International Conference on Machine Learning, New York, NY, USA, pp. 1-10, 2016. |
Collobert et al., “Wav2Letter: An End-to-End ConvNet-based Speech Recognition System,” arXiv:1609.03193 [cs.LG], pp. 1-8, 2016. |
Chan et al., “Listen, Attend and Spell: A Neural Network for Large Vocabulary Conversational Speech Recognition,” in ICASSP, pp. 4960-4964, 2016. |
Watanabe et al., “ESPnet: End-to-End Speech Processing Toolkit,” INTERSPEECH, 2207-2211, 2018. |
Zeyer et al., “Improved Training of End-to-End Attention Models for Speech Recognition,” in Interspeech, pp. 1-5, 2018. |
He et al., “Streaming End-to-End Speech Recognition for Mobile Devices,” in ICASSP, pp. 1-5, 2019. |
Wang et al., “Espresso: A Fast End-to-End Neural Speech Recognition Toolkit,” arXiv:1909.08723v1 to appear in ASRU, pp. 1-8, 2019. |
Young et al., “Tree-Based State Tying for High Accuracy Acoustic Modelling,” in Proceedings of the Workshop on Human Language Technology, pp. 307-312, 1994. |
Miao et al., “An Empirical Exploration of CTC Acoustic Models,” in ICASSP, pp. 1-5, 2016. |
Zweig et al., “Advances In All-Neural Speech Recognition,” arXiv: 1609.05935v2 and in ICASSP, pp. 1-5, 2017. |
Kudo, “Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates,” Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pp. 66-75, 2018. |
Watanabe et al., “Hybrid CTC/Attention Architecture for End-to-End Speech Recognition,” IEEE Journal of Selected Topics in Signal Processing, No. 8, pp. 1240-1253, 2017. |
The CMU pronouncing dictionary,“http://www.speech.cs.cmu.edu/cgi-bin/cmudict,” pp. 1-2, retrieved Jan. 1, 2021. |
Bisani et al.,“Joint-Sequence Models for Grapheme-to-Phoneme Conversion,” Speech Communication, vol. 50, No. 5, pp. 434-451, 2008. |
Rao et al., “Grapheme-to-Phoneme Conversion Using Long Short-Term Memory Recurrent Neural Networks,” in ICASSP, pp. 1-5, 2015. |
Yolchuyeva et al., “Transformer Based Grapheme-to-Phoneme Conversion,” in Interspeech, pp. 2095-2099, 2019. |
Gulcehre et al., “On Using Monolingual Corpora in Neural Machine Translation,” arXiv:1503.03535 [cs.CL], pp. 1-9, 2015. |
Hori et al., “Multi-Level Language Modeling and Decoding for Open Vocabulary End-to-End Speech Recognition,” ASRU, pp. 1-7, 2017. |
Rao et al., “Multi-Accent Speech Recognition with Hierarchical Grapheme Based Models,” in ICASSP, pp. 4815-4819, 2017. |
Toshniwal et al., “Multitask Learning with Low-Level Auxiliary Tasks for Encoder-Decoder Based Speech Recognition,” Interspeech, pp. 1-5, 2017. |
Rao et al., “Exploring Architectures, Data and Units for Streaming End-to-End Speech Recognition with RNN-Transducer,” arXiv:1801.00841v1 [cs.CL] pp. 1-7, 2018. |
Yu et al., “A Multistage Training Framework for Acoustic-to-Word Model,” Interspeech, pp. 1-5, 2018. |
Vaswani et al., “Attention is All you Need,” 31st Conference on Neural Information Processing Systems, pp. 1-11, 2017. |
Karita et al., “A Comparative Study on Transformer vs RNN in Speech Applications,” in ASRU, pp. 1-10, 2019. |
Park et al., “SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition,” Interspeech, pp. 2613-2617, 2019. |
Kingma et al., “Adam: A method for stochastic optimization,” in ICLR, pp. 1-15, 2015. |
Paul et al., “The Design for the Wall Street Journal-Based CSR Corpus,” Proceedings of the workshop on Speech and Natural Language, pp. 1-6, 1992. |
“G2p-seq2seq,” https://github.com/cmusphinx/g2p-seq2seq, pp. 1-4, 2019. |
Baskar et al. “Promising Accurate Prefix Boosting for Sequence-to-Sequence ASR,” in ICASSP, pp. 1-5, 2019. |
Godfrey et al., “Switchboard: Telephone Speech Corpus for Research and Development,” in ICASSP, pp. I-517-I-520, 1992. |
Cui et al., “Improving Attention-Based End-to-End ASR Systems with Sequence-Based Loss Functions,” in SLT, pp. 353-360, 2018. |
Zeyer et al., “A Comprehensive Analysis on Attention Models,” in Proc. IRASL Workshop, NeurIPS, pp. 1-12, 2018. |
Number | Date | Country | |
---|---|---|---|
20210319796 A1 | Oct 2021 | US |
Number | Date | Country | |
---|---|---|---|
63007054 | Apr 2020 | US |