The present invention relates to the electrical, electronic and computer arts, and, more particularly, to automatic speech recognition (ASR) and the like.
ASR is now present on a wide variety of platforms, ranging from huge server farms to small embedded devices. Each platform has its own resource limitations and this variety of requirements makes building acoustic models (AMs) for all these platforms a delicate task of balancing compromises between model size, decoding speed, and accuracy. In practice, AMs are custom built for each of these platforms. Furthermore, in ASR, it is often desirable to dynamically adjust the complexity of models depending on the load on the machine on which recognition is being carried out. In times of heavy use, it is desirable to use models of low complexity so that a larger number of engines can be run. On the other hand, when the resources are available, it is feasible to use more complex and more accurate models. Current techniques rely on building and storing multiple models of varying complexity. This causes an overhead in model development time. It also results in storage cost at run time. Furthermore, we are limited in complexity control by the number of models available.
Principles of the invention provide techniques for model restructuring for client and server based automatic speech recognition. In one aspect, an exemplary method includes the step of obtaining access to a large reference acoustic model for automatic speech recognition. The large reference acoustic model has L states modeled by L mixture models, and the large reference acoustic model has N components. Another step includes identifying a desired number of components Nc, less than N, to be used in a restructured acoustic model derived from the reference acoustic model. The desired number of components Nc is selected based on a computing environment in which the restructured acoustic model is to be deployed. The restructured acoustic model also has L states. Yet another step includes, for each given one of the L mixture models in the reference acoustic model, building a merge sequence which records, for a given cost function, sequential mergers of pairs of the components associated with the given one of the mixture models. Further steps include assigning a portion of the Nc components to each of the L states in the restructured acoustic model; and building the restructured acoustic model by, for each given one of the L states in the restructured acoustic model, applying the merge sequence to a corresponding one of the L mixture models in the reference acoustic model until the portion of the Nc components assigned to the given one of the L states is achieved (note that, as discussed below, the merge can, in some instances, be done “all at once”).
As used herein, “facilitating” an action includes performing the action, making the action easier, helping to carry the action out, or causing the action to be performed. Thus, by way of example and not limitation, instructions executing on one processor might facilitate an action carried out by instructions executing on a remote processor, by sending appropriate data or commands to cause or aid the action to be performed. For the avoidance of doubt, where an actor facilitates an action by other than performing the action, the action is nevertheless performed by some entity or combination of entities.
One or more embodiments of the invention or elements thereof can be implemented in the form of a computer product including a computer readable storage medium with computer usable program code for performing the method steps indicated. Furthermore, one or more embodiments of the invention or elements thereof can be implemented in the form of a system (or apparatus) including a memory, and at least one processor that is coupled to the memory and operative to perform exemplary method steps. Yet further, in another aspect, one or more embodiments of the invention or elements thereof can be implemented in the form of means for carrying out one or more of the method steps described herein; the means can include (i) hardware module(s), (ii) software module(s), or (iii) a combination of hardware and software modules; any of (i)-(iii) implement the specific techniques set forth herein, and the software modules are stored in a computer-readable recordable storage medium (or multiple such media).
Techniques of the present invention can provide substantial beneficial technical effects. For example, one or more embodiments may provide one or more of the following advantages:
These and other features and advantages of the present invention will become apparent from the following detailed description of illustrative embodiments thereof, which is to be read in connection with the accompanying drawings.
As noted, ASR is now present on a wide variety of platforms, ranging from huge server farms to small embedded devices. Each platform has its own resource limitations and this variety of requirements makes building acoustic models (AMs) for all these platforms a delicate task of balancing compromises between model size, decoding speed, and accuracy. In practice, AMs are custom built for each of these platforms. Furthermore, in ASR, it is often desirable to dynamically adjust the complexity of models depending on the load on the machine on which recognition is being carried out. In times of heavy use, it is desirable to use models of low complexity so that a larger number of engines can be run. On the other hand, when the resources are available, it is feasible to use more complex and more accurate models. Current techniques rely on building and storing multiple models of varying complexity.
Advantageously, one or more embodiments reduce overhead in model development time, as compared to current techniques; reduce storage cost at run time; and/or allow for enhanced complexity control, as compared to current techniques, which are typically limited in complexity control by the number of models available.
One or more embodiments employ model restructuring. In this approach, a large reference model is built from data. Smaller models are then derived from this reference model in such a way that the smaller models approximate the distribution specified by the large model. This has several advantages over the current known approaches—it does not incur the extra build time or storage cost. Smaller models can be constructed on demand. One or more embodiments allow an almost continuous control over model complexity.
Note that one or more embodiments are described in the context of Gaussian mixture models but other approaches are possible; e.g., exponential densities as described in Pierre L. Dognin et al., “Restructuring Exponential Family Mixture Models,” Interspeech 2010, September 2010, pp. 62-65.
An issue often encountered in probabilistic modeling is restructuring a model to change its number of components, parameter sharing, or some other structural constraints. Automatic Speech Recognition (ASR) has become ubiquitous and building acoustic models (AMs) that can retain good performance while adjusting their size to application requirements is a challenging problem. AMs are usually built around Gaussian mixture models (GMMs) that can each be restructured, directly impacting the properties of the overall AM. For instance, generating smaller AMs from a reference AM can simply be done by restructuring the underlying GMMs to have fewer components while best approximating the original GMMs. Maximizing the likelihood of a restructured model under the reference model is equivalent to minimizing their Kullback-Leibler (KL) divergence. For GMMs, this is analytically intractable. However, a lower bound to the likelihood can be maximized and a variational expectation-maximization (EM) can be derived. Using variational KL divergence and variational EM in the task of AM clustering, one or more embodiments define a greedy clustering technique that can build on demand clustered models of any size from a reference model. Non-limiting experimental results show that clustered models are on average within 2.7% of the WERs for equivalent models built from data, and only at 9% for a model 20 times smaller than the reference model. This makes clustering a reference model a viable option to training models from data.
It is often desirable in probabilistic modeling to approximate one model using another model with a different structure. By changing the number of components or parameters, by sharing parameters differently, or simply by modifying some other constraints, a model can be restructured to better suit the requirements of an application. Model restructuring is particularly relevant to the field of Automatic Speech Recognition (ASR) since ASR is now present on a wide variety of platforms, ranging from huge server farms to small embedded devices. Each platform has its own resource limitations and this variety of requirements makes building acoustic models (AMs) for all these platforms a delicate task of balancing compromises between model size, decoding speed, and accuracy. It is believed that that model restructuring can alleviate issues with building AMs for various platforms.
Clustering, parameter sharing, and hierarchy building are three different examples of AM restructuring commonly used in ASR. For instance, clustering an AM to reduce its size, despite a moderate degradation in recognition performance, has an impact on both ends of the platform spectrum. For servers, it means more models can fit into memory and more engines can run simultaneously sharing these models. For mobile devices, models can be custom-built to exactly fit into memory. Sharing parameters across components is another restructuring technique that efficiently provides smaller models with adjustable performance degradation. Restructuring an AM by using a hierarchy speeds up the search for the most likely components and directly impacts recognition speed and accuracy. One or more embodiments cluster models efficiently.
When restructuring models, an issue that arises is measuring similarity between the reference and the restructured model. Minimizing the Kullback-Leibler (KL) divergence (reference is made to S. Kullback, Information Theory and Statistics, Dover Publications, Mineola, N.Y., 1997) between these two models is equivalent to maximizing the likelihood of the restructured model under data drawn from the reference model. Unfortunately, this is intractable for Gaussian Mixture Models (GMMs), a core component of AMs, without resorting to expensive Monte Carlo techniques. However, it is possible to maximize a variational lower bound to the likelihood and derive from it a variational KL divergence (see Pierre L. Dognin et al., “Refactoring acoustic models using variational density approximation,” in ICASSP, April 2009, pp. 4473-4476) as well as an Expectation-Maximization (EM) technique (see Pierre L. Dognin et al., “Refactoring acoustic models using variational expectation maximization,” in Interspeech, September 2009, pp. 212-215) that will update the parameters of a model to better match a reference model. Both variational KL divergence and variational EM can be components of model restructuring. One or more embodiments provide a greedy clustering technique, based on these methods, which provides clustered and refined models of any size, on demand.
For other approaches, based on minimizing the mean-squared error between the two density functions, see Kai Zhang and James T. Kwok, “Simplifying mixture models through function approximation,” in NIPS 19, pp. 1577-1584, MIT Press, 2007; or based on compression using dimension-wise tied Gaussians optimized using symmetric KL divergences, see Xiao-Bing Li et al., “Optimal clustering and non-uniform allocation of Gaussian kernels in scalar dimension for HMM compression,” in ICASSP, March 2005, pp. 669-672.
Non-limiting experimental results show that, starting with a large acoustic model, smaller models can be derived that achieve word error rates (WERs) almost identical to those for similar size models trained from data. This reduces or eliminates the need to train models of every desired size.
Models
Acoustic models are typically structured around phonetic states and take advantage of phonetic context while modeling observation. Let A be an acoustic model composed of L context dependent (CD) states. L is chosen at training time and typically ranges from a few hundreds to a few thousands. Each CD state s uses a GMM fs with Ns components resulting in A having N=ΣsNs Gaussians, A GMM f with continuous observation xεd is specified as:
where a indexes components off f, πa is the prior probability, and (x; μa, Σα) is a Gaussian in x with mean vector μa and covariance matrix Σa which is symmetric and positive definite. In general, Σa is a full matrix but in practice it is typically chosen to be diagonal for computation and storage efficiency.
Variational KL Divergence
The KL divergence (see Kullback reference) is a commonly used measure of dissimilarity between two probability density functions (pdfs)/(x) and g(x),
where L(f∥g) is the expected log likelihood of g under f,
In the case of two GMMs f and g, the expression for L(f∥g) becomes:
where the integral ∫fa log Σbωbgb is analytically intractable. As a consequence, DKL(f∥g) is intractable for GMMs.
One solution presented in the Dognin Interspeech reference provides a variational approximation to DKL(f∥g). This is done by first providing variational approximations to L(f∥f) and L(f∥g) and then using equation (3). In order to define a variational approximation to equation (5), variational parameters φb|a are introduced as a measure of the affinity between the Gaussian component fa of f and component gb of g. The variational parameters must satisfy the constraints:
By using Jensen's inequality, a lower bound is obtained for equation (5),
The lower bound on L(f∥g), given by the variational approximation (f∥g), can be maximized with respect to (w.r.t.) φ and the best bound is given by:
By substituting {circumflex over (φ)}b|a in equation (7), the following expression for {circumflex over (φ)}(κ∥g) is obtained:
{circumflex over (φ)}(f∥g) is the best variational approximation of the expected log likelihood L(f∥g) and is referred to as variational likelihood. Similarly, the variational likelihood {circumflex over (φ)}(f∥f), which maximizes a lower bound on L(f∥f), is:
The variational KL divergence KL(f∥g) is obtained directly from equations (10) and (11) since KL(f∥g)={circumflex over (φ)}(f∥f)−{circumflex over (φ)}(f∥g),
where KL(f∥g) is based on the KL divergences between all individual components of f and g.
Generalized Variational KL Divergence
One or more embodiments extend the variational KL to weighted densities
For pdfs f and g, ∫f=∫g=1 and equation (14) yields DKL(f∥g) as expected. If
where KL(
Weighted Local Maximum Likelihood
Consider a weighted GMM IQ where
An especially useful case of the generalized variational KL divergence is KL(
where g=
KL(
In this special case, it is clear from equation (20) that KL(
Variational Expectation-Maximization
In the context of restructuring models, the variational KL divergence KL(f∥g) can be minimized by updating the parameters of the restructured model g to match the reference model j: Since the variational KL divergence KL(f∥g) gives an approximation to DKL(f∥g), KL(f∥g) can be minimized w.r.t. the parameters of g, {πb, μb, Σb}. It is sufficient to maximize φ(f∥g), as φ(f∥f) is constant in g. Although equation (10) is not easily maximized w.r.t. the parameters of g, φ(f∥g) in equation (7) can be maximized leading to an Expectation-Maximization (EM) technique.
In one or more embodiments, it is desirable to maximize φ(f∥g), w.r.t φ and the parameters {πb, μb, Σb} of g. This can be achieved by defining a variational Expectation-Maximization (varEM) technique where we first maximize φ(f∥g) w.r.t. φ. With φ fixed, then maximize φ(f∥g) w.r.t the parameters of g, Previously, the best lower bound on L(f∥g) with {circumflex over (φ)}(f∥g) was found as given by {circumflex over (φ)}b|a in equation (9). This is the expectation (E) step:
For a fixed φb|a={circumflex over (φ)}b|a, it is now possible to find the parameters of g that maximize 100(f∥g). The maximization (M) step is:
The technique alternates between the E-step and M-step, increasing the variational likelihood in each step.
Discrete Variational EM
If φb|a is constrained to {0,1}, this provides a hard assignment of the components of f to the components of g. Let Φb|a be the constrained φb|a. In the constrained E-step, a given a is assigned to the b for which φb|a is greatest. That is, find {circumflex over (b)}=arg maxb Φb|a, and set Φ{circumflex over (b)}|a=1 and Φb|a=0 for all b≠{circumflex over (b)}. In the rare case where several components are assigned the same value maxbφb|a, choose the smallest index of all these components as {circumflex over (b)}. The M-step remains the same, and the resulting gb is the maximum likelihood Gaussian given the subset Q of components indices from f provided by Φ; the equations (23)-(25) are then similar to the merge steps in (16)-(18) with Φ providing the subset Q. This is called the discrete variational EM (discrete varEM).
Model Clustering
Clustering down an acoustic model A of size N into a model c of target size Nc means reducing the overall number of Gaussian components. In the reference model A, each state s uses a GMM fs with Ns components to model observation. Create a new model c by clustering down each fs independently into fsc of size Nsc such that 1≦Nsc≦Ns. The final clustered model c has Nc Gaussian components with Nc=ΣsNsc≦N.
A greedy approach is taken to produce fsc from fs which finds the best sequence of merges to perform within each fs so that c reaches the target size Nc. This procedure can be divided into two independent parts: 1) cluster down fs optimally to any size. 2) define a criterion to decide the optimal target size Nsc for each fs under the constraint Nc=ΣsNsc. Once the best sequence of merges and Nsc are known, it is straightforward to produce fsc by using equations (16)-(18).
Greedy Clustering and Merge-Sequence Building
The greedy technique used in one or more embodiments clusters down fs by sequentially merging component pairs that are similar. For a given cost function, this sequence of merges is deterministic and unique. It is therefore possible to cluster fs all the way to one final component, while recording each merge and its cost into a merge-sequence S(fs).
Technique 1 depicted in
Given, fs and S(fs), it is possible to generate clustered models fsc{ks} of any size simply by applying the sequence of merges recorded in S(fs) to fs all the way to the ks-th merge step. These clustered models span from the original fsc{0}=fs to a final fsc{Ns−1}. At each step ks, every new model fsc{ks} has one component less than fsc{ks−1}. Therefore, fsc of any target size Nsc can be generated from fs and S(fs). To generate fsc from fs, there exists another equivalent option to applying sequentially the best merges in S(fs). At each merge step ks. S(fs) can be analyzed to provide Qk
A useful property of the GC technique is that if a weight is applied to fs, the sequence of merges in SD will remain unchanged, only their corresponding costs will be modified. Indeed, if λs is applied to fs so that
KL(λs
Therefore, applying a weight λs to each state s just impacts the wLML costs in S(fs).
If A is composed of L contextual states and N Gaussians, once S(fs) are built, models c of any size Nc can be created, such that L≦Nc≦N. The minimum size for c is L because one Gaussian is the smallest any fsc can be. Since S(fs) are independently computed across all states s, the only difference between two AMs with identical size Nc is their respective number of components Nsc. Finding good Gaussian assignments Nsc is significant for clustering models.
Gaussian Assignment
For any given target size Nc, many c can be produced, each of them with different Nsc for state s. Nsc are chosen so that Nc=ΣsNsc. However, choosing Nsc means putting all states into competition to find an optimal sharing of Nc components. This raises several issues. First, the selection of Nsc should take into account the distortion brought into the model A when creating c. Second, it may be desirable to use some prior information about each state s when determining Nsc. The first issue can be addressed by using the cost information recorded in S(fs) when selecting Nsc. These costs are directly related to the distortion brought into the model by each merge. The second issue can be addressed by applying a weight λs to all the costs recorded in S(fx) so to amplify or decrease them compared to cost for other states. If choosing Nsc is based on wLML costs across states, this will result in assigning more (or less) Gaussians to this state. λs can be chosen in many ways. For instance, during training time, each state is associated with some frames from the training data. The number of frames associated to state s divided by the total number of frames in the training data is the prior for state s and this prior can be used as λs. In practice, it is common that states modeling silence observe a large amount of the training data and therefore should be assigned greater Nsc than for other states. λs can also be based on the language models or grammars used for decoding since they also reveal how often a state s is expected to be observed. Three assignment strategies are presented that address either one or both of the two issues raised in this section.
α-Assignment: Appropriately assigning a number of components to each fs is a problem first encountered during AM training. A common approach is to link Ns to the number of frames in the training data modeled by state s. Define cs as the count of frames in the training data that align to a particular state s. Ns is commonly (see Steve Young et al., The HTK Book (for HTK Version 3.3), Cambridge University Engineering Department, 2005) defined by using:
where β=0:2. Since training data can be largely composed of silence, β is empirically chosen to ensure that states modeling silence do not grab most of the N Gaussians, at the expense of states modeling actual speech. With β fixed, finding α is done with an iterative process. For clustering, cs can be obtained for A and this method provides Nsc given Nc. It is referred to as α-assignment, which is intrinsically sensitive to the state priors through c3. However, it does not account for distortion when producing . Combining the greedy clustering with α-assignment is referred to as GC-α in the rest of this document.
Global Greedy Assignment: The global greedy assignment (GGA) extends the greedy approach used in building S(fs) to find the best sequence of merges across all states. S(fs) records the sequence of merges within fs, using wLML cost function. GGA begins by setting merge sequence index ks=1 for all states s. Then, GGA finds the best merge across all states by comparing costs recorded in each S(fs)[ks]. For s′, state of the next best merge, the merge sequence index ks is increased, ks′=ks′+1, to point to the next best merge within fs′. For each state s, GGA keeps track of which merge sequence index ks it must use to access the next best merge recorded in S(fs)[ks]. This simple technique iterates until the target Nc is reached, ultimately providing Nsc as the number of Gaussians left in each fsc. If a state s has only one component left before Nc is reached, it will be assigned only one Gaussian.
GGA is fast and requires very simple book-keeping. Indeed, no real merge occurs. At each iteration. GGA only needs to follow the sequence within each S(fs). GGA follows a merge sequence that tries, to some extent, to minimize merging distortion to the original model A. For GGA, each state s is equally likely to host the next best merge. However, it is straightforward to allow for different state prior λs by modifying the costs in S(fs)[ks] accordingly, as discussed earlier. Changing the state priors will change the sequence of best merges across states (but not within states). Combining GC with GGA is referred to as GC-GGA in this document.
Viterbi Selection: Finding an optimal set of Nsc can be done by using a Viterbi procedure inspired from a different optimal allocation problem in Etienne Marcheret et al., “Optimal quantization and bit allocation for compressing large feature space transforms,” IEEE Workshop on Automatic Speech Recognition & Understanding 2009, pp. 64-69. Within each state s in A,fs is of size Ns. The size of fsc at each merge step ks in S(fs)[ks] os −ks for 1≦ks≦Ns−1. The following Viterbi procedure finds the optimal merge step k*s so that Nc=ΣsNs−k*s. For each state s, the cumulative wLML cost is required at each step ks such that
Each cost can be adjusted for weight λs if required. The procedure is:
This procedure gives the Gaussian assignments that minimize overall cost. It therefore optimizes both Gaussian assignment and attempts to minimize model distortion simultaneously. Again, λs can be taken into account simply by modifying the cumulative costs before running the Viterbi procedure. Combining the greedy clustering and Viterbi selection is referred to as GC-viterbi in this document.
Model Refinement
Once A is clustered down into c, it is possible to refine the parameters of c with varEM by using A as the model to match. Parameters of each fsc will be updated to minimize KL(fs∥fsc). For each state s,fs is used as the reference model and fsc is used as the initial model for varEM. At the end of convergence, obtain a new model r composed by the set of fsr that minimizes ΣsKL(fs∥fsc). The motivation for refining c into r is that the greedy clustering changes the structure of A by decreasing Ns to Nsc following a sequence of merges with minimum local costs. However, it may be beneficial to use a global criterion to update parameters in c by allowing the parameters of fsc to better match fs, potentially recovering some distortion created within each fsc when merging components.
It is to be emphasized that the experimental results are intended to illustrate what can be expected in some embodiments of the invention, but are not intended to be limiting; other embodiments may yield different results. In one or more embodiments, one goal is to provide clustered model Ac, refined or not, that can closely match the decoding performance of models trained from data, measured using WERs. The training set for the reference model includes 800 hours of US English data, with 10K speakers for a total of 800K utterances (4.3M words). It includes in-car speech in various noise conditions, recorded at 0, 30 and 60 mph with 16 KHz sampling frequency. The test set is 39K utterances and contains 206K words. It is a set of 47 different tasks of in-car speech with various US regional accents.
The reference model A100K is a 100K Gaussians model built on the training data. A set of 91 phonemes is used, each phoneme modeled with a three-state left to right hidden Markov model. These states are modeled using two-phoneme left context dependencies, yielding a total of 1519 CD states. The acoustic models for these CD states are built on 40-dimensional features obtained using Linear Discriminant Analysis (LDA) combined with Semi Tied Covariance (STC) transformation. CD states are modeled using GMMs with 66 Gaussians on average. Training includes a sequence of 30 iterations of the EM technique where CD state alignments are re-estimated every few steps of EM. Twenty baseline models were built from training data using 5K, 10K, . . . , 100K Gaussians. All these models have different STCs and lie in different feature spaces. Since all clustered models are in the reference model feature space, for consistency, nineteen models were built using the 100K model's STC (100K-STC) from A5K to A95K. Differences in WERs for these models and the baseline are small, as shown in the table of
Baseline results show that the reference WER for A 100K is 1.18%. WERs remain within 15% relative from 95K down to 40K, and then start increasing significantly below 25K. At 5K, WER has increased 110% relative to WER at 100K. For each Gaussian assignment strategy, GC was used with wLML to cluster A100K down to 5K, saving intermediate models every 5K Gaussians (95K, . . . , 5Kc), for a total of 19 clustered models for each GC-GGA, GC-α and GC-viterbi technique, GC-GGA was the first technique implemented and showed promising results. WERs stay close to the 100K-STC results from 95K-65K (sometimes even slightly improving on them), but then diverge slowly afterward and more sharply below 45K. At 5K, GC-GGA gives 3.30% WER, within 30% relative to 2.53% given by A5K.
Results for a technique called GC-models are also reported. GC-models refers to taking the Gaussian assignments directly from the 100K-STC models trained from data. This gives the best assignment N*s chosen by the training procedure. GC-models results are consistently better than that of GC-GGA over the entire 5K-95K range. GC-models is an unrealistic technique as it is typically necessary to train models first to find Gaussian assignments to create clustered models of the same size. However, its results give a clear indication that Gaussian assignment is significant in order to cluster the reference model optimally, especially when creating small models. For 5K models, each f, has an average of only 3.3 Gaussians. A good assignment is important. One difference is that the training procedure, in one or more embodiments, allows for splitting and merging of Gaussians within CD state during EM. Interestingly, GC-α gives similar WERs to GC-models for the entire 5K-95K range. This is not entirely surprising since it uses a similar criterion to assign Gaussians as in the training. However, only merging is allowed when clustering down A100K. From 45K-95K, GC-α matches or improves on 100K-STC results. Below 45K, a small divergence begins and, at 5K, GC-α gives 2.87%, only within 13% of A5K, a clear improvement over GC-GGA at 30% of A5K.
GC-viterbi gives results equivalent to GC-α from 95K to 10K. For 10K, it is slightly better than GC-GGA, but it is almost the same as GC-GGA for 5K. This is counter intuitive as GC-viterbi may be expected to give an “optimal” assignment and therefore better WERs. However, after analysis of Ns given by GC-viterbi, it is clear that the states modeling silence have much smaller Nsc than for GC-α. In A100K, silence states have more Gaussians than any other states, and are likely to overlap more. Therefore, S(fs) for those states have merge steps with smaller costs than all other states. In at least some embodiments, adjusting for state priors is necessary when using GC-viterbi. Without it, the silence models are basically decimated to the benefit of the other states. By using state prior λ*s=λsβ/Σs′λs′β with β=0.2 reminiscent of α-assignment, obtain the best results for GC-viterbi reported in the table of
When clustering A100K into very small models like 5Kc (20 times smaller), achieving WERs close to 100K-STC models WERs becomes a delicate equilibrium of allocating Gaussians between speech states and silence states. This is implicitly done with β=0.2 in equation (28). Since β was historically tuned for large models, it could be tuned for smaller models. However, it is believed that a step further should be taken to treat silence and speech states as two different categories. Given the teachings herein, the skilled artisan will be able to implement such an approach. For example, in a non-limiting example above, equation (28) was used for Gaussian assignment for both speech and silence states in an α-assignment scheme. However, instead of using the single equation (28) with the single value of β for both speech states and silence states, different assignment criteria (e.g., different equations or the same equation with different parameters) could be used for assignment of speech states and assignment of silence states. In a non-limiting example, equation (28) could be used for both speech and silence states but with different values of 0 and/or different values of N. For example, 1000 Gaussians could be assigned to silence and 9000 Gaussians could be assigned to speech. Approaches using other probability density functions (e.g., other than those using Gaussian mixture models) are also possible as discussed elsewhere herein (e.g., exponential family mixture models).
Note that, in some instances, approximately 30% of training data may be associated with silence and about 70% with speech; speech may be associated with thousands of states and silence with only a handful of states. Note also that in at least some instances, “silence” includes conditions of non-speech noise such as background noise, auto noise, sighs, coughs, and the like.
To improve upon GC-α, model refinement was used using discrete varEM (dvarEM). WERs are better overall and, at 5K, GC-α with dvarEM reaches 2.76%, within 9% of A5K. Over the 5K-95K range, models built from GC-α with dvarEM are on average within 2.7% of the WERs for 100K-STC models built from data. In fact, for 9 out of 19 clustered models, GC-α with dvarEM is better than the baseline models. This makes clustering a reference model a viable option to training models from data.
Speech Recognition Block Diagram
References
The skilled artisan will already be familiar with the following references, which have been cited above, and, given the teachings herein, will be able to make and use embodiments of the invention and perceive the best mode. Nevertheless, out of an abundance of caution, all of the following nine (9) references are fully incorporated herein by reference in their entireties for all purposes.
One or more embodiments provide a set of tools for restructuring acoustic models in an ASR task, while preserving recognition performance compared to equivalent models built from data. These tools were applied to define a greedy clustering technique that can efficiently generate, from a reference model, smaller clustered models of any size. The clustered models have ASR performance comparable to models of the same size built from training data. Advances in Gaussian assignment techniques lead to significant improvement in WER, especially for clustered models with large size reduction. For 5K, WER went from being within 30% of the model built from data to 9% with GC-α with discrete varEM. One or more embodiments provide a greedy clustering process including two independent steps. The first step generates the sequence of best merges for each CD state, while the second step provides a Gaussian assignment for every state. This two step approach is particularly suited for parallelization and is significant in the handling of large models. Furthermore, this greedy clustering technique can generate clustered models on demand as most of the computation is done up front or ‘offline.’ This renders possible applications where smaller models can be built on demand from a reference model to accommodate new and changing constraints over time.
Reference should now be had to flow chart 500 of
An additional step 508 includes identifying a desired number of components Nc, less than N, to be used in the restructured acoustic model. The restructured acoustic model is derived from the reference acoustic model, and the desired number of components Nc is selected based on the computing environment in which the restructured acoustic model is to be deployed. The restructured acoustic model also has L states.
A number of techniques can be used to identify the desired number of components in the restructured model. In some instances, rules of thumb familiar to the skilled artisan can be employed. In other instances, the decision can be based on the hardware limitations of the system that will use the restructured acoustic model; for example, there may only be 400 MB of space available for the restructured acoustic model in a given instance; this leads to the allowable number of components. In some cases, equation (28) (in essence, an empirical formula) can be employed (there generally being a linear relationship between the log of the counts and the log of the number of Gaussians).
A further step 506 includes, for each given one of the L mixture models in the reference acoustic model, building a merge sequence which records, for a given cost function, sequential mergers of pairs of the components associated with the given one of the mixture models. This can be carried out, for example, as described above in the portion entitled “Greedy Clustering and Merge-Sequence Building” and as depicted in Technique 1 of
Note that the sequence of merge can be done all at once if speed-up is needed. For instance, if it is known that one final weighted component:
f_merged=merge(merge erge(1,2),merge(3,4)),merge(5,6)),
is desired, where 1,2,3,4,5,6 are indices of components to be merged in the mixture model, this can be carried out all at once as:
f_merged=merge(1,2,3,4,5,6).
This is because the merge() operation employed in one or more embodiments gives the Maximum Likelihood merge of two components, and has the following properties:
merge(a,b)=merge(b,a)
merge(merge(a,b),c)=merge(a,merge(b,c))=merge(b,merge(a,c))=erge(a,b,c)
Basically, the order of merges does not matter, the final merged component will always be the same. Therefore, if one has the relevant equations to do merge(a,b,c), it can be faster to do merge(a,b,c) than merge((merge(a,b),c). For the avoidance of doubt, as used herein, including the claims, “applying the merge sequence to a corresponding one of the L mixture models in the reference acoustic model until the portion of the Nc components assigned to the given one of the L states is achieved” is intended to cover sequential or “all at once” approaches or combinations thereof.
Processing continues at 514.
As used herein, a mixture model is a probability density function (pdf) that is represented as a sum of weighted components (each of them can be of different form (Gaussians, Exponential densities, and the like) where the weights sum to one, and the total integral of the pdf is one. Further, as used herein, the term “component” is understood, in the context of statistical modeling, as being one individual element of the mixture.
It will be appreciated that the restructured model can be stored in a computer-readable storage medium and used in an ASR system.
In some instances, step 510 includes applying an α-assignment scheme, such as described above in the portion entitled “α-Assignment.” Such a scheme can involve, for example, assigning the portion of the Nc components to each of the L states in the restructured acoustic model by taking into account the number of frames in the actual data modeled by a given one of the L states, and correcting for those of the L states modeling silence.
In some cases, step 510 includes applying a global greedy assignment scheme across all the L states, to reduce merge distortion, such as described above in the portion entitled “Global Greedy Assignment.” Such a scheme can involve, for example, iterating to find a best merge across all the L states until only the portion of the Nc components is left for each of the L states in the restructured acoustic model.
In some embodiments, step 510 includes applying a Viterbi selection scheme, such as described above in the portion entitled “Viterbi Selection.” Such a scheme can involve, for example, recursively optimizing the assignment of the portion of the Nc components to each of the L states in the restructured acoustic model while simultaneously minimizing model distortion.
In at least some cases, the cost function is weighted local maximum likelihood as described above.
Note that
It will be appreciated that in some cases, the method can include actually building the reference model, while in other cases, a model prepared outside the scope of the method can be accessed. Step 504 is intended to cover either case.
In some cases, the states are context-dependent states.
In some cases, the mixture models are Gaussian mixture models.
With reference now to the block diagram of
Then, using component assignor 708, find the number of components for each state, and using model builder 710, generate a new (restructured) model, given the merge sequences and number of components for each state.
Exemplary System and Article of Manufacture Details
As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or computer program product. Accordingly, aspects of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, microcode, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit,” “module” or “system.” Furthermore, aspects of the present invention may take the form of a computer program product embodied in one or more computer readable medium(s) having computer readable program code embodied thereon.
One or more embodiments of the invention, or elements thereof, can be implemented in the form of an apparatus including a memory and at least one processor that is coupled to the memory and operative to perform exemplary method steps.
One or more embodiments can make use of software running on a general purpose computer or workstation. With reference to
Accordingly, computer software including instructions or code for performing the methodologies of the invention, as described herein, may be stored in one or more of the associated memory devices (for example, ROM, fixed or removable memory) and, when ready to be utilized, loaded in part or in whole (for example, into RAM) and implemented by a CPU. Such software could include, but is not limited to, firmware, resident software, microcode, and the like.
A data processing system suitable for storing and/or executing program code will include at least one processor 402 coupled directly or indirectly to memory elements 404 through a system bus 410. The memory elements can include local memory employed during actual implementation of the program code, bulk storage, and cache memories which provide temporary storage of at least some program code in order to reduce the number of times code must be retrieved from bulk storage during implementation.
input/output or I/O devices (including but not limited to keyboards 408, displays 406, pointing devices, and the like) can be coupled to the system either directly (such as via bus 410) or through intervening I/O controllers (omitted for clarity).
Network adapters such as network interface 414 may also be coupled to the system to enable the data processing system to become coupled to other data processing systems or remote printers or storage devices through intervening private or public networks. Modems, cable modern and Ethernet cards are just a few of the currently available types of network adapters.
As used herein, including the claims, a “server” includes a physical data processing system (for example, system 412 as shown in
As noted, aspects of the present invention may take the form of a computer program product embodied in one or more computer readable medium(s) having computer readable program code embodied thereon. Any combination of one or more computer readable medium(s) may be utilized. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Media block 418 is a non-limiting example. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C++ or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider). Other non-limiting examples of suitable languages include:
Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer readable medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and/or block diagrams in the FIGS. illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
It should be noted that any of the methods described herein can include an additional step of providing a system comprising distinct software modules embodied on a computer readable storage medium; the modules can include, for example, any or all of the elements depicted in the block diagrams and/or described herein; by way of example and not limitation, a merge sequence generator module, a component assigner module, and a model builder module. Optionally, the merge sequence generator module can include a distance matrix generator module and a merge sequence creator module. The method steps can then be carried out using the distinct software modules and/or sub-modules of the system, as described above, executing on one or more hardware processors 402. Each of the distinct modules or sub-modules includes code to implement the corresponding equations. Further, a computer program product can include a computer-readable storage medium with code adapted to be implemented to carry out one or more method steps described herein, including the provision of the system with the distinct software modules. In addition, the resulting restructured acoustic model or models can be stored in a computer-readable storage medium and can be used in a system such as that in
In any case, it should be understood that the components illustrated herein may be implemented in various foul's of hardware, software, or combinations thereof; for example, application specific integrated circuit(s) (ASICS), functional circuitry, one or more appropriately programmed general purpose digital computers with associated memory, and the like. Given the teachings of the invention provided herein, one of ordinary skill in the related art will be able to contemplate other implementations of the components of the invention.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
The corresponding structures, materials, acts, and equivalents of all means or step plus function elements in the claims below are intended to include any structure, material, or act for performing the function in combination with other claimed elements as specifically claimed. The description of the present invention has been presented for purposes of illustration and description, but is not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the invention. The embodiment was chosen and described in order to best explain the principles of the invention and the practical application, and to enable others of ordinary skill in the art to understand the invention for various embodiments with various modifications as are suited to the particular use contemplated.
Number | Name | Date | Kind |
---|---|---|---|
6107935 | Comerford et al. | Aug 2000 | A |
6411930 | Burges | Jun 2002 | B1 |
7216077 | Padmanabhan et al. | May 2007 | B1 |
7295978 | Schwartz et al. | Nov 2007 | B1 |
7539617 | Mami et al. | May 2009 | B2 |
20030163313 | Rees | Aug 2003 | A1 |
20030171931 | Chang | Sep 2003 | A1 |
20040059576 | Lucke | Mar 2004 | A1 |
20040260548 | Attias | Dec 2004 | A1 |
20080270127 | Kobayashi et al. | Oct 2008 | A1 |
20100121640 | Zheng et al. | May 2010 | A1 |
Entry |
---|
Pierre L. Dognin et al., “Refactoring acoustic models using variational density approximation,” in ICASSP, Apr. 2009, pp. 4473-4476. |
Pierre L. Dognin et al., “Refactoring acoustic models using variational expectation maximization,” in Interspeech, Sep. 2009, pp. 212-215. |
Kai Zhang and James T. Kwok, “Simplifying mixture models through function approximation,” in NIPS 19, pp. 1577-1584. MIT Press, 2007. |
Xiao-Bing Li et al., “Optimal clustering and non-uniform allocation of Gaussian kernels in scalar dimension for HMM compression,” in ICASSP, Mar. 2005, pp. 669-672. |
Imre Csiszár, “Why least squares and maximum entropy? an axiomatic approach to inference for linear inverse problems,” Annals of Statistics, v. 19, n. 4, pp. 2032-2066, 1991. |
Steve Young et al., The HTK Book (for HTK Version 3.3), Cambridge University Engineering Department, 2005. |
E. Marcheret et al., “Optimal quantization and bit allocation for compressing large feature space transforms,” IEEE Workshop on ASR & Understanding 2009, pp. 64-69. |
Pierre L. Dognin et al., “Restructuring Exponential Family Mixture Models”, Interspeech 2010, Sep. 2010, pp. 62-65. |
Solomon Kullback, “Information Theory and Statistics”. Dover Publications, Jul. 7, 1997. |
Number | Date | Country | |
---|---|---|---|
20120150536 A1 | Jun 2012 | US |