The present invention relates generally to automatic speech recognition (ASR), and relates more particularly to Gaussian likelihood computation.
Gaussian mixture models (GMMs) can be used in both the front end processing and the search stage of hidden Markov model (HMM)-based large vocabulary automatic speech recognition (ASR). During front end processing, GMMs are typically used in the computation of posterior vectors for generating feature space minimum phone error (fMPE) transforms that apply to feature vectors. During the search stage, the GMMs are typically used as acoustic models to model different sounds. During both of these stages, the use of a hierarchical Gaussian codebook can expedite Gaussian likelihood computation.
Gaussian likelihood computation is typically the most computationally intensive operation performed during HMM-based large vocabulary ASR. For instance, Gaussian likelihood computation often consumes thirty to seventy percent of the total recognition time. Thus, the speed with which an ASR system recognizes speech is directly tied to the speed with which it computes the Gaussian likelihoods.
The present invention relates to a method and apparatus for computing Gaussian likelihoods. One embodiment of a method for processing a speech sample includes generating a feature vector for each frame of the speech signal, evaluating the feature vector in accordance with a hierarchical Gaussian shortlist, and producing a hypothesis regarding a content of the speech signal, based on the evaluating.
The teachings of the present invention can be readily understood by considering the following detailed description in conjunction with the accompanying drawings, in which:
To facilitate understanding, identical reference numerals have been used, where possible, to designate identical elements that are common to the figures.
The present invention relates to a method and apparatus for computing Gaussian likelihoods. Embodiments of the present invention use hierarchical Gaussian shortlists to improve the performance of standard vector quantization (VQ)-based Gaussian selection. First, all of the Gaussian components are merged into a number of indexing clusters (e.g., using bottom-up Gaussian clustering). Then, a shortlist is built for all of the clusters in each layer. This speeds the computation of Gaussian likelihoods, making it possible to achieve real-time ASR performance.
For a feature vector xt, the likelihood of an N-dimensional Gaussian distribution with a mean of μ and a covariance of Σ may be computed as:
In most speech recognition systems, log likelihood is used for numerical stabilities, and diagonal covariance is used for data sparsity reasons. If the diagonal covariance is Σ=diag(σ12, σ22, . . . , σN2), then the log likelihood becomes:
The first term is not related to the feature vector, and can be pre-computed before decoding. The second term can be further decomposed into a dot-product format, part of which can also be pre-computed.
Feature space minimum phone error (fMPE) is a training technique that adopts the same objective function as traditional minimum phone error (MPE) techniques for transforming feature vectors during training and decoding.
If xt denotes the original feature vector at time t, then the fMPE transformed feature vector is:
y
t
=x
t
+Mh
t (EQN. 3)
Where ht is a high-dimensional posterior probability vector, and M is a matrix mapping ht onto a lower-dimensional feature space. The projection matrix M is trained to optimize the MPE criterion. The posterior probability vector ht is computed by first evaluating the likelihood of the original feature vector along a large set of Gaussians (e.g., all of the Gaussians in the acoustic model) with no priors. Then, for each frame, the posterior probabilities of the contextual frames are also computed and concatenated with the specified frame to form the final posterior probability vector ht. Although fMPE yields significant recognition accuracy, it is, as noted above, computationally expensive due to its naïve implementation, especially for real-time systems operating on portable devices.
As illustrated, the system 100 comprises an input device 102, an analog-to-digital converter 104, a front-end processor 106, a pattern classifier 108, a confidence scorer 110, an output device 112, a plurality of acoustic models 114, and a plurality of language models 116. In alternative embodiments, one or more of these components may be optional. In further embodiments still, two or more of these components may be implemented as a single component.
The input device 102 receives input speech signals (e.g., user utterances). These input speech signals comprise data to be processed by the system 100. Thus, the input device 102 may include one or more of: a keyboard, a stylus, a mouse, a microphone, a camera, or a network interface (which allows the system 100 to receive input from remote devices).
The input device 102 is coupled to the analog-to-digital converter 104, which receives the input speech signal from the input device 102. The analog-to-digital converter 104 converts the analog form of the speech signal to a digital waveform. In an alternative embodiment, the speech signal may be digitized before it is provided to the input device 102; in this case, the analog-to-digital converter 104 is not necessary or may be bypassed during processing.
The analog-to-digital converter 104 is coupled to the front-end processor 106, which receives the waveforms from the analog-to-digital converter 104. The front-end processor 106 processes the waveform in accordance with one or more feature analysis techniques (e.g., spectral analysis). In addition, the front-end processor may perform one or more pre-processing techniques (e.g., noise reduction, endpointing, etc.) prior to the feature analysis. The result of this processing is a set of feature vectors that are computed on a frame-by-frame basis for each frame of the waveform. The front-end processor 106 is coupled to the pattern classifier 108 and delivers the feature vectors to the pattern classifier 108.
The pattern classifier 108 decodes the feature vectors into a string of words that is most likely to correspond to the input speech signal. To this end, the pattern classifier 108 performs decoding and/or searching in accordance with the feature vectors. In one embodiment, and at each frame, the pattern classifier 108 evaluates the corresponding feature vector for at least a subset of Gaussians in a Gaussian codebook (e.g., in accordance with fMPE). In one embodiment, the feature vectors are evaluated using a hierarchical Gaussian shortlist that comprises a subset of the Gaussians in the Gaussian codebook.
In one embodiment, the pattern classifier 108 also performs a search (e.g., a Viterbi search) guided by the acoustic models 114 and the language models 116. This search produces an acoustic model score and a language model score for each hypothesis or proposed string that may correspond to the waveform. The search may also may use of a hierarchical Gaussian shortlist.
The plurality of acoustic models 114 comprises statistical representations of the sounds that make up words. In one embodiment, at least some of the acoustic models comprise finite state networks, where each state comprises a Gaussian mixture model (GMM) the models the statistical representation for an associated sound. In a further embodiment, the finite state networks are weighted.
The plurality of language models 116 comprises probabilities (e.g., in the form of distributions) of sequences of words (e.g., N-grams). Different language models may be associated with different languages, contexts, and applications. In one embodiment, at least some of the language models 116 are grammar files containing predefined combinations of words.
The confidence scorer 110 is coupled to the pattern classifier 108 and receives the string from the pattern classifier 108. The confidence score 110 assigns a confidence score to each word in the string before delivering the string and the confidence scores to the output device 112.
The output device 112 is coupled to the confidence scorer 110 and receives the string and confidence scores from the confidence scorer 110. The output device 112 delivers the system output (e.g., textual transcriptions of the input speech signal) to a user or to another device or system. Thus, in one embodiment, the output device 112 comprises one or more of the following: a display, a speaker, a haptic device, or a network interface (which allows the system 100 to send outputs to a remote device).
As discussed above, the system 100 makes use of a set of hierarchical Gaussian shortlists.
The indexing layer comprises a plurality of indexing Gaussians 2021-202n (hereinafter collectively referred to as “indexing Gaussians 202”). Each indexing Gaussian 202 corresponds to a cluster 2041-204n (hereinafter collectively referred to as “clusters 204”) in the Gaussian layer. Thus, each indexing Gaussian 202 may be considered a parent of its corresponding cluster 204.
In one embodiment, the acoustic space is divided into a number of partitions, and a hierarchical Gaussian shortlist such as the hierarchical Gaussian shortlist 200 is built for each partition. The hierarchical Gaussian shortlist 200 for a given partition specifies the subset of Gaussians that are expected to have high likelihood values in the given partition.
In one embodiment, the acoustic space is divided into the partitions using vector quanitization (VQ); thus, the partitions may also be referred to as VQ regions. VQ codebooks are then organized as a tree to quickly locate the VQ region within which a given feature vector falls. Next, one list of Gaussians is created for each combination (v, s) of VQ region v and tied acoustic state s. In one embodiment, the list is created empirically by considering a sufficiently large amount of speech data. For each acoustic observation, every Gaussian is evaluated. The Gaussians whose likelihoods are within a predefined threshold of the most likely Gaussian are then added to the list for the combination (v, s) of VQ region and acoustic state. In one embodiment, a minimum size is enforced for each shortlist in order to ensure that there are no empty shortlists.
The hierarchical Gaussian shortlist 200 is not directly plotted. Rather, as illustrated in
The method 300 is initialized in step 302 and proceeds to step 304, where the input device 102 acquires a speech signal (e.g., a user utterance). In optional step 306 (illustrated in phantom), the analog-to-digital converter 104 digitizes the speech signal, if necessary, to generate a waveform. In instances where the speech signal is acquired in digital form, digitization by the analog-to-digital converter 104 will not be necessary.
In step 308, the front-end processor 106 processes the frames of the waveform to produce a plurality of feature vectors. As discussed above, the feature vectors are produced on a frame-by-frame basis.
In step 310, the pattern classifier 108 performs a search (e.g., a Viterbi search) in accordance with the feature vectors and with the language models 116. The ultimate result of the search comprises one or more hypotheses (e.g., strings of words) representing the possible content of the speech signal. Each hypothesis is associated with a likelihood that it is the correct hypothesis. In one embodiment, the likelihood is based on a language model score and an acoustic model score.
In one embodiment, the acoustic model score is calculated using hierarchical Gaussian shortlists, as discussed above. In accordance with this embodiment, some states of a given acoustic model (finite state network) are active, and some states are not active. Each feature vector for each frame of the waveform is evaluated against only the active states of the acoustic model.
Specifically, the first step in generating the acoustic model score is to identify, in accordance with a given feature vector, the VQ region most closely associated with the corresponding frame from which the feature vector came. The identified VQ region is then used to guide evaluation of the Gaussians in the Gaussian codebook.
Referring again to
The further evaluation again comprises evaluation against shortlists. Specifically, each cluster 204 associated with each of the x indexing Gaussians 202 is arranged as a shortlist. This may be referred to as a “Gaussian layer shortlist.” The Gaussian layer shortlist comprises the most probable Gaussians within the associated cluster 204 for the VQ region associated with the given feature vector. In one embodiment, a Gaussian layer shortlist is built for each combination of VQ region and cluster 204. In each cluster 204 that is selected for further evaluation, only the Gaussians in the cluster's Gaussian layer shortlist are evaluated against the feature vector. In this way, Gaussian likelihood computation is limited to a relatively small number of Gaussians in both the indexing layer and the lower Gaussian layer.
When likelihoods have been generated for each of the hypotheses, the method 300 proceeds to optional step 312, where the confidence scorer 110 estimates the confidence levels of the hypotheses, and optionally corrects words in the hypotheses based on word-level posterior probabilities. The output device 112 then outputs at least one of the hypotheses (e.g., as a text transcription of the speech signal) in step 314.
The method 300 terminates in step 316.
Alternatively, embodiments of the present invention (e.g., error correction module likelihood computation 405) can be represented by one or more software applications (or even a combination of software and hardware, e.g., using Application Specific Integrated Circuits (ASIC)), where the software is loaded from a storage medium (e.g., I/O devices 406) and operated by the processor 402 in the memory 404 of the general purpose computing device 400. Thus, in one embodiment, the likelihood computation 405 for computing Gaussian likelihoods described herein with reference to the preceding Figures can be stored on a non-transitory computer readable medium (e.g., RAM, magnetic or optical drive or diskette, and the like).
It should be noted that although not explicitly specified, one or more steps of the methods described herein may include a storing, displaying and/or outputting step as required for a particular application. In other words, any data, records, fields, and/or intermediate results discussed in the methods can be stored, displayed, and/or outputted to another device as required for a particular application. Furthermore, steps or blocks in the accompanying Figures that recite a determining operation or involve a decision, do not necessarily require that both branches of the determining operation be practiced. In other words, one of the branches of the determining operation can be deemed as an optional step.
Although various embodiments which incorporate the teachings of the present invention have been shown and described in detail herein, those skilled in the art can readily devise many other varied embodiments that still incorporate these teachings.
This application was made with Government support under contract no. NBCHD040058 awarded by the Department of the Interior. The Government has certain rights in this invention.