This invention relates to scoring of acoustically-based events in a word spotting system.
Word spotting systems are used to detect the presence of specified keywords or phases or other linguistic events in an acoustically-based signal. Many word spotting systems provide a score associated with each detection. Such scores can be useful for characterizing which detections are more likely to correspond to a true events (“hits”) rather than misses, which are sometimes referred to as false alarms.
Some word spotting systems make use of statistical models, such as Hidden Markov Models (HMMs), which are trained based on a training corpus of speech. In such systems, probabilistically motivated scores have been used to characterize the detections. One such score is a posterior probability (or equivalently a logarithm of the posterior probability) that occurred (e.g., started, ended) at a particular time given acoustically-based signal and the HMM model for the keyword of interest and for other speech.
It has been observed that the probabilistically motivated scores can be variable, depending on factors such as the audio conditions and the specific word or phrase that is being detected. For example, scores obtained in different audio conditions or for different words and phrases are not necessarily comparable.
In one aspect, in general, the invention features a method and corresponding software and a system for scoring acoustically-based events in a speech processing system. Data characterizing an instance of an event are first accepted. This data includes a score for the event. The event is associated with a number of component events from a set of component events. Probability models are also accepted for component scores associated with each of the set of component events in each of two of more possible classes of the event. The event is then scored. This scoring includes computing a probability of one of the two or more possible classes for the event using the accepted probability models.
Aspects of the invention can include one or more of the following features:
The two or more classes of the event can include true occurrence of the event, and the classes can include false detection of the event.
The acoustically-based event can include a linguistically-defined event, which can include one or more word events. The component events can include subword units, such as phonemes.
The probability models for the component scores can be Gaussian models.
The method can further include accepting data characterizing multiple instances of events, such that at least some of the events are known to belong to each of the two or more classes of events. The method can further include estimating parameters for the probability models for the component scores from the data characterizing the multiple instances of events. Estimating the parameters can include applying a Gibbs sampling approach.
Aspects of the invention can have one or more of the following advantages.
The approach can make scores for different events, which may have different phonetic content, more comparable.
The overall accuracy of a word spotting system can be improved using this approach.
Other features and advantages of the invention are apparent from the following description, and from the claims.
Referring to
Referring first to the runtime subsystem 102, which is shown in
The word spotting engine 120 processes the unknown speech 126 to detect instances of the events specified by the queries. These detections are termed putative events 144. Each putative event is associate a score and the identity of the query that was detected, as well as an indication of when the putative event occurred in the unknown speech (e.g., a start time and/or an end time). In this version of the system, the score associated with a putative event is a probability that the event started at the indicated time conditioned on the entire unknown speech signal 126 and based on the models 122. These scores that are output from the word spotting engine 120 are referred to below as “raw scores.”
The raw scores for the putative events 144 are processed by a score normalizer 140 to produce putative events with normalized scores 152. The score normalizer 140 makes use of normalization parameters 142, which are determined by the training subsystem 101. Generally, the score normalizer 140 uses the phonetic content of a query and the normalization parameters that are associated with that phonetic content to map the raw score for the query to a normalized score. The normalized score can be interpreted as a probability that the putative event is a true detection of the query. The normalization score is a number between 0.0 and 1.0 with a larger number being associated with a greater certainty that the putative event is a true detection of the query.
Referring to
The normalization parameters 142 are estimated by a normalization parameter estimator 130. This parameter estimator takes as inputs a set of true instances of query events along with their associated raw scores 132, as well as a set of false alarms and their scores 134, that were produced by the word spotting engine 120 when run on training speech (B) 124. These sets of true events and false alarms include instances associated with a number of different queries, which together provide a sampling of the subword units used to represent the queries. Preferably, training speech (A) 112, which is used to estimate models 122, and training speech (B) 124 are different, although the procedure can be carried out with the same training speech, optionally using one of a variety of statistical jackknifing techniques with the same speech.
The normalization parameter estimator 130 and the associated score normalizer 140 are based on a probabilistic model that treats each raw score, R(q), for an instance of a putative detection of a query q expressed as a logarithm of a probability that the query q occurred, as having an additive form that includes terms each associated with a different subword (phonetic) unit of a query. That is, if the query q is represented as the sequence of N units s1, . . . , sN, (the dependence of the length N on the specific query q is omitted in the notation below to simplify the notation) then the raw score is represented as R(q)=Σi=1Nrs
The queries are all represented using a common limited set of subword units, in this version of the system, a set of approximately L=40 English phonemes. Normalization parameters 142 therefore include parameters for 2L distributions, two for each subword unit s, one for a true detection (“Hit”), Ps(r|Hit), and one for a miss (false alarm), Ps(r|Miss).
Each of these distributions that are associated with the subword units is modeled as a Gaussian (Normal) distribution, with the shared variances among the Hit distributions and among the Miss distributions. Specifically, the distributions take the form:
PS(r|Hit)=N(r; μH,s,σH2)
and
PS(r|Miss)=N(r; μM,s,σM2).
Therefore normalization parameters 142 include 2L means μH,s and μM,s, and two variances σH2 and σH2.
Because of the additive form R(q)=Σi=1Nrs
P(q)(R(q)|Hit)=N(R(q);Σi=1NμH,s
and similarly
P(q)(R(q)|Miss)=N(R(q); Σi=1NμM,s
The score normalizer 140 takes as input a raw score R(q) for a query q, which is represented as the sequence of units s1, . . . , sN, and outputs a normalized score, which is computed as a probability Pr(Hit|R(q)) based on the normalization parameters. Score normalizer 140 implements a computation based on Bayes' Rule:
Pr(Hit|R(q))=P
where
P(q)(R(q))=P(q)(R(q)|Hit)Pr(Hit)+P(q)(R(q)|Miss)(1−Pr(Hit))
The a priori probability that a detection is a hit, Pr(Hit), is treated as independent of the query. This a priori probability is computed from the relative number of true query events 132 and false alarms 134 is also stored as one of the parameters of normalization parameters 142.
Referring to
Referring to
Pr(Hit), {μH,i, μM,i}i=1,L, σH2, σM2:
The normalization parameter estimator 130 estimates the parameter Pr(Hit) according to the fraction of the number of true hits to the total number of detections. Alternatively, this parameter is set to quantity that reflects the estimated fraction of events that will be later detected by the word spotting engine on the unknown speech, or set to some other constant according to other criteria, such as by optimizing the quantity to increase accuracy.
The normalization parameter estimator 130 estimates the parameters for the hits, {μH,i}i=1,L, σH2 from the set to true hits 132 independently of the corresponding parameters that it estimates from the false alarms 134. For notational simplicity, we drop the subscript H and M in the following discussion, and refer to the entire set of values for either the hits or the misses as μ≡{μ□,i}i=1,L. Similarly, the entire set of queries and their corresponding raw scores are denoted Q≡{q} and R≡{R(q)}, respectively. (In the discussion below, each element of the sets corresponds to a single instance of a query.)
Referring to
({circumflex over (μ)},{circumflex over (σ)})=arg max P(R|Q, μ, σ) μ,σ
A function em_estimate() is executed to yield an approximation ({circumflex over (μ)}(1), {circumflex over (σ)}(1)) of this ML estimate. The details of this procedure are discussed further below with reference to
The Gibbs_sample() procedure continues with a three-step interation (lines 320-350). In the first step of the iteration (line 330), a function sample_factor() is used to generate a random sampling of the component scores based on the raw scores for the queries, and the current parameter values. This function yields a set {{tilde over (r)}(q)} with one vector element per query, where {tilde over (r)}(q)≡({tilde over (r)}1(q), . . . , {tilde over (r)}N(q)) is the vector of component scores for query q, and N is the length of the phonetic representation of q. For each of the queries, the component scores are drawn at random constrained to satisfy match the total raw score for the query, Σi{tilde over (r)}i(q)=R(q). The sample_factor() function is described below with reference to
In the next step of the iteration (line 340), the randomly drawn component scores are used in a function sample_mean() to reestimate the means of the component scores, {circumflex over (μ)}(i)=(μ1(i), . . . μL(i))T. The sample_mean() is described below with reference to
In the third and final step of the iteration (line 350), the randomly drawn component scores, and the newly updated means of the distributions of the component scores are used in a function sample_sig() to reestimate the shared standard deviation of the distributions, {circumflex over (σ)}(i).
After the specified number of iterations (num_iter), the Gibbs_sample() procedure returns the current estimate of the parameters of the distributions for the component scores (line 360).
Referring to
The em_iterate() makes use of the Estimate-Maximize (EM) algorithm, starting at the initial estimate ({circumflex over (μ)}(0), {circumflex over (σ)}(0)), and iterating until a stopping condition, in this case the maximum number of iterations num_iter, is reached. Each iteration involves two steps. First, a function expect_factor() (line 430) is used to determine expected values of sufficient statistics for updating the parameter values, and then a function maximize_like() (line 440) uses these expected values to reestimate the parameter values. After the maximum number of iterations is reached, the current estimates of the parameter values are returned as an estimate of the Maximum Likelihood estimate of the parameter values.
Referring to
Referring to
Referring to
Referring to
Referring to
In an optional mode, the normalization parameter estimator does not assume that the variances of the component score distributions are tied to a common value, and it independently estimates each variance using a variant of the procedures shown in
In alternative embodiments, different forms of probability distributions, and different parameter estimation methods are used. These estimates can form Maximum Likelihood (ML), Maximum A Posteriori (MAP), Maximum Mutual Information (MMI), or other types of estimates of the parameter values. Various types of prior distributions of parameter values can be used for those estimation techniques that depend on such prior estimates. Various numerical techniques can also be use to optimize or calculate the parameter values.
In the discussion above, each putative instance of a query is associated with a particular phoneme sequence. In alternative forms of the approach, each query may allow multiple different phoneme sequences, for example to allow alternative pronunciations or alternative word sequences. In this alternative approach, the phoneme sequence associated with an instance of a query (hit or miss) can be treated as unknown or as a random variable, which can have a prior distribution based on the query. Also, as introduced above, the subword units are not necessarily phonemes. Larger linguistic units such as syllables or demi-syllables whole words can be use, as can arbitrary units derived from data. Also, other forms of models, both statistical and non-statistical, can be used by the word spotting engine to locate the putative events with their associated scores.
The system described above can be implemented in software, with instructions stored on a computer-readable medium, such as a magnetic or an optical disk. The software can be executed on different types of processors, including general purpose processors and signal processors. For example, the system can be hosted on a general purpose computer executing the Windows operating system. Some or all of the functional can also be implemented using hardware, such as using ASICs or custom integrated circuits. The system can be implemented on a single computer, or can be distributed over multiple computers. For example, the training subsystem can be hosted on one computer while the runtime component is hosted on another component.
It is to be understood that the foregoing description is intended to illustrate and not to limit the scope of the invention, which is defined by the scope of the appended claims. Other embodiments are within the scope of the following claims.
This application claims the benefit of U.S. Provisional Application No. 60/489,390 filed Jul. 23, 2003, which is incorporated herein by reference.
Number | Name | Date | Kind |
---|---|---|---|
4227177 | Moshier | Oct 1980 | A |
4903305 | Gillick et al. | Feb 1990 | A |
4977599 | Bahl et al. | Dec 1990 | A |
5440662 | Sukkar | Aug 1995 | A |
5572624 | Sejnoha | Nov 1996 | A |
5625748 | McDonough et al. | Apr 1997 | A |
5749069 | Komori et al. | May 1998 | A |
5842163 | Weintraub | Nov 1998 | A |
5893058 | Kosaka | Apr 1999 | A |
5937384 | Huang et al. | Aug 1999 | A |
6038535 | Campbell | Mar 2000 | A |
6539353 | Jiang et al. | Mar 2003 | B1 |
20020026309 | Rajan | Feb 2002 | A1 |
20020161581 | Morin | Oct 2002 | A1 |
20040167893 | Matsunaga et al. | Aug 2004 | A1 |
20040215449 | Roy | Oct 2004 | A1 |
20060074666 | Chung | Apr 2006 | A1 |
20060212295 | Wasserblat et al. | Sep 2006 | A1 |
Number | Date | Country | |
---|---|---|---|
60489390 | Jul 2003 | US |