Efficient empirical determination, computation, and use of acoustic confusability measures

Description

BACKGROUND OF THE INVENTION
Technical Field

The invention relates to speech recognition. More particularly, the invention relates to efficient empirical determination, computation, and use of an acoustic confusability measure.

Description of the Prior Art

In United States Patent Application Publication No. 20020032549, it is stated:

In the operation of a speech recognition system, some acoustic information is acquired, and the system determines a word or word sequence that corresponds to the acoustic information. The acoustic information is generally some representation of a speech signal, such as the variations in voltage generated by a microphone. The output of the system is the best guess that the system has of the text corresponding to the given utterance, according to its principles of operation.

The principles applied to determine the best guess are those of probability theory. Specifically, the system produces as output the most likely word or word sequence corresponding to the given acoustic signal. Here, “most likely” is determined relative to two probability models embedded in the system: an acoustic model and a language model. Thus, if A represents the acoustic information acquired by the system, and W represents a guess at the word sequence corresponding to this acoustic information, then the system's best guess W* at the true word sequence is given by the solution of the following equation:

W*=argmax_WP(A|W)P(W).

Here P(A|W) is a number determined by the acoustic model for the system, and P(W) is a number determined by the language model for the system. A general discussion of the nature of acoustic models and language models can be found in “Statistical Methods for Speech Recognition,” Jelinek, The MIT Press, Cambridge, Mass. 1999, the disclosure of which is incorporated herein by reference. This general approach to speech recognition is discussed in the paper by Bahl et al., “A Maximum Likelihood Approach to Continuous Speech Recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume PAMI-5, pp. 179-190, March 1983, the disclosure of which is incorporated herein by reference.

The acoustic and language models play a central role in the operation of a speech recognition system: the higher the quality of each model, the more accurate the recognition system. A frequently-used measure of quality of a language model is a statistic known as the perplexity, as discussed in section 8.3 of Jelinek. For clarity, this statistic will hereafter be referred to as “lexical perplexity.” It is a general operating assumption within the field that the lower the value of the lexical perplexity, on a given fixed test corpus of words, the better the quality of the language model.

However, experience shows that lexical perplexity can decrease while errors in decoding words increase. For instance, see Clarkson et al., “The Applicability of Adaptive Language Modeling for the Broadcast News Task,” Proceedings of the Fifth International Conference on Spoken Language Processing, Sydney, Australia, November 1998, the disclosure of which is incorporated herein by reference. Thus, lexical perplexity is actually a poor indicator of language model effectiveness.

Nevertheless, lexical perplexity continues to be used as the objective function for the training of language models, when such models are determined by varying the values of sets of adjustable parameters. What is needed is a better statistic for measuring the quality of language models, and hence for use as the objective function during training.

United States Patent Application Publication No. 20020032549 teaches an invention that attempts to solve these problems by:

Providing two statistics that are better than lexical perplexity for determining the quality of language models. These statistics, called acoustic perplexity and the synthetic acoustic word error rate (SAWER), in turn depend upon methods for computing the acoustic confusability of words. Some methods and apparatuses disclosed herein substitute models of acoustic data in place of real acoustic data in order to determine confusability.

In a first aspect of the invention taught in United States Patent Application Publication No. 20020032549, two word pronunciations l(w) and l(x) are chosen from all pronunciations of all words in fixed vocabulary V of the speech recognition system. It is the confusability of these pronunciations that is desired. To do so, an evaluation model (also called valuation model) of l(x) is created, a synthesizer model of l(x) is created, and a matrix is determined from the evaluation and synthesizer models. Each of the evaluation and synthesizer models is preferably a hidden Markov model. The synthesizer model preferably replaces real acoustic data. Once the matrix is determined, a confusability calculation may be performed. This confusability calculation is preferably performed by reducing an infinite series of multiplications and additions to a finite matrix inversion calculation. In this manner, an exact confusability calculation may be determined for the evaluation and synthesizer models.

In additional aspects of the invention taught in United States Patent Application Publication No. 20020032549, different methods are used to determine certain numerical quantities, defined below, called synthetic likelihoods. In other aspects of the invention, (i) the confusability may be normalized and smoothed to better deal with very small probabilities and the sharpness of the distribution, and (ii) methods are disclosed that increase the speed of performing the matrix inversion and the confusability calculation. Moreover, a method for caching and reusing computations for similar words is disclosed.

Such teachings are yet limited and subject to improvement.

SUMMARY OF THE INVENTION

There are three related elements to the invention herein:

Empirically Derived Acoustic Confusability Measures

The first element comprises a means for determining the acoustic confusability of any two textual phrases in a given language. Some specific advantages of the means presented here are:

- Empirically Derived. The measure of acoustic confusability is empirically derived from examples of the application of a specific speech recognition technology. Thus, the confusability scores assigned by the measure may be expected to reflect the actual performance of a deployed instance of the technology, in a particular application.
- Depends Only on Recognizer Output. The procedure described herein does not require access to the internal computational models of the underlying speech recognition technology, and does not depend upon any particular internal structure or modeling technique, such as Hidden Markov Models (HMMs). Only the output of the speech recognition system, comprising the sequence of decoded phonemes, is needed.
- Iteratively Trained. The procedure described is based upon iterative improvement from an initial estimate, and therefore may be expected to be superior to any a priori human assignment of phoneme confusion scores, or to a method that makes only a single, initial estimate of phoneme confusion scores, without iterative improvement.
  
  Techniques for Efficient Computation of Empirically Derived Acoustic Confusability Measures

The second element comprises computational techniques for efficiently applying the acoustic confusability scoring mechanism. Previous inventions have alluded to the use of acoustic confusability measures, but notably do not discuss practical aspects of applying them. In any real-world practical scheme, it is often required to estimate the mutual acoustic confusability of tens of thousands of distinct phrases. Without efficient means of computing the measure, such computations rapidly become impractical. In this patent, we teach means for efficient application of our acoustic confusability measure, allowing practical application to very large-scale problems.

Method for Using Acoustic Confusability Measures

The third element comprises a method for using acoustic confusability measures, derived by whatever means (thus, not limited to the measure disclosed here), to make principled choices about which specific phrases to make recognizable by a speech recognition application.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 shows a basic lattice according to the invention, where the numbers along the left are the row coordinates, the numbers along the top are column coordinates, and the small dots are the nodes of the lattice. The coordinates are used to identify the nodes of the lattice, in the form (row coordinate, column coordinate). Thus, the coordinates of the node in the upper-right corner are (0, 4);

FIG. 2 shows a basic lattice with actual phonemes according to the invention, where the purely symbolic phonemes d₁etc. have been replaced by actual phonemes from the SAMPA phoneme alphabet for US English. The true phoneme sequence shown is a pronunciation of the English word “hazy,” the decoded phoneme sequence is a pronunciation of the word “raise;”

FIG. 3 shows a basic lattice with actual decoding costs according to the invention, where the symbolic decoding costs δ(d|t) have been replaced by the starting values proposed in the text;

FIG. 4: Initial State of Bellman-Ford Algorithm. Each node now has a box to record the minimum-path cost from the source node, at coordinates (0, 0), to the node in question. The cost from the source node to itself is 0, so that value has been filled;

FIG. 5 shows two nodes with minimum path costs after labeling according to the invention, where the costs to reach nodes (0, 1) and (1, 0) have been determined and filled in, and the arcs of the minimum cost path in each case have been marked, by rendering them with a thicker line;

FIG. 6 shows the state of the lattice after a next step of the algorithm according to the invention, where the cost of the minimum cost path to (1, 1), and the arc followed for that path, have both been determined. This is the first non-trivial step of the algorithm. The result is determined as described in the text;

FIG. 7 shows the state of the lattice after completion of the algorithm according to the invention, where every node has been labeled with its minimum cost path, and the associated arcs have all been determined; and where some arbitrary choices, between paths of equal cost, have been made in selecting the minimum cost arcs;

FIG. 8 shows a confusatron output for typical homonyms according to the invention, comprising a small portion of the list of homonyms, generated from a grammar that comprised popular musical artist names, where the parenthesized text is the pronunciation that is shared by the two colliding phrases and, where in each case, the list of colliding phrases appears to the right, enclosed in angle brackets, with list elements separated by a # sign; and

FIG. 9 shows a confusatron output for typical dangerous words according to the invention, comprising a small portion of the list of dangerous words, where each entry is comprised of the nominal truth, and its clarity score, followed by a list (in order of decreasing confusability) of the other literals in the grammar that are likely to be confused with it and, where the items listed below the true literal are likely erroneous decodings, when the given true utterance has been spoken.

DETAILED DESCRIPTION OF THE INVENTION

There are three related elements to the presently preferred embodiment of invention disclosed herein:

Empirically Derived Acoustic Confusability Measures

The first element comprises a means for determining the acoustic confusability of any two textual phrases in a given language. Some specific advantages of the means presented here are:

- Empirically Derived. The measure of acoustic confusability is empirically derived from examples of the application of a specific speech recognition technology. Thus, the confusability scores assigned by the measure may be expected to reflect the actual performance of a deployed instance of the technology, in a particular application.
- Depends Only on Recognizer Output. The procedure described herein does not require access to the internal computational models of the underlying speech recognition technology, and does not depend upon any particular internal structure or modeling technique, such as Hidden Markov Models (HMMs). Only the output of the speech recognition system, comprising the sequence of decoded phonemes, is needed.
- Iteratively Trained. The procedure described is based upon iterative improvement from an initial estimate, and therefore may be expected to be superior to any a priori human assignment of phoneme confusion scores, or to a method that makes only a single, initial estimate of phoneme confusion scores, without iterative improvement.
  
  Techniques for Efficient Computation of Empirically Derived Acoustic Confusability Measures

The second element comprises computational techniques for efficiently applying the acoustic confusability scoring mechanism. Previous inventions have alluded to the use of acoustic confusability measures, but notably do not discuss practical aspects of applying such mechanisms. In any real-world practical scheme, it is often required to estimate the mutual acoustic confusability of tens of thousands of distinct phrases. Without efficient means of computing the measure, such computations rapidly become impractical. In this patent, we teach means for efficient application of our acoustic confusability score, allowing practical application to very large-scale problems.

Method for Using Acoustic Confusability Measures

Empirically Derived Acoustic Confusability Measure

The immediately following discussion explains how to derive and compute an empirically derived acoustic confusability measure. The discussion is divided into several subsections;

- In Section 1, we establish some notation and nomenclature, common to the invention as a whole.
- In Section 2, we explain how to empirically derive our acoustic confusability measure.
- In Section 3, we explain how to use the output of the preceding section to compute the acoustic confusability of any two phrases.

1. Notation and Nomenclature

We first establish some notation and nomenclature. The symbol or expression being defined appears in the left hand column; the associated text explains its meaning or interpretation. Italicized English words, in the associated text, give the nomenclature we use to refer to the symbol and the concept.

- u a single complete utterance, represented as an audio recording
- w a word sequence, phrase, or literal, represented as text. We will use these terms interchangeably.
- X the corpus; thus a sequence of utterances u₁, u₂, . . . u_Cand associated transcriptions T₁, T₂, . . . , T_C, where C is the number of utterances in the corpus. To underscore that the corpus contains audio data, we will sometimes refer to it as the audio corpus or acoustic corpus.
- P the recognized corpus; the result of passing the audio corpus through a given speech recognition system.
- Φ the phoneme alphabet of the human language in question. This is a finite collection of the basic sound units of the language, denoted by some textual names or symbols. For the purposes of this discussion, we will use the Speech Assessment Methods Phonetic Alphabet (SAMPA) for US English, as defined in Language Supplement, OpenSpeech™ Recognizer, (US English), for English in the United States (en-US), Second Edition, May 2004, page 33. Additional discussion may be found in Wikipedia.
- q(w) a pronunciation or baseform (there may be several) of the phrase w, represented as a sequence of phonemes ϕ₁, ϕ₂, . . . , ϕ_Q, where Q is the number of phonemes in the pronunciation. Each ϕ_iis a member of Φ.
- Q(w) the set of all pronunciations of w
- G a grammar, in the sense of an automatic speech recognition system; thus comprising a representation of all phrases (also referred to as word sequences or literals) that the system may recognize, nominally with some symbolic meaning attached to each phrase
- L(G) the language of G; thus a list of all word sequences admissible by G
- Q(L(G)) the set of all pronunciations of all word sequences appearing in L(G); thus a list of one or more phoneme sequences for each word sequence in L(G)
- R a recognizer or automatic speech recognition system; thus a computer system that accepts utterances as input and returns decodings. We will use the terms “recognizer,” “automatic speech recognition system,” “speech recognition system” and “recognition system” interchangeably; they mean the same thing.
- R_Ga recognizer that is constrained to return only decodings that correspond to phrases in the grammar G
- D a decoding; the output of a speech recognition system when presented with an utterance. To exhibit the particular input utterance associated to the decoding, we write D(u). For our purposes, a decoding consists of a pair f, s, where f is the decoded frame sequence (defined below), and s is the associated confidence score. (Note: we may sometimes use D to denote the length of a phoneme sequence, this will be clear from context.)
- sa confidence score; a number, determined by the recognition system, that expresses the likelihood that the decoding returned by the recognizer is correct. By assumption s lies in the interval [0, 1]; if not this can be arranged via a suitable scaling operation. Written s(u) to exhibit the associated utterance u.
- T a transcription or true transcription; regular text in the human language in question. To exhibit the particular utterance associated to the transcription, we write T(u). (Note: we may sometimes use T to denote the length of a phoneme sequence, this will be clear from context.)
- f a frame of decoded speech; thus the recognizer's output for a short segment of speech, nominally a phoneme in Φ. Written f(u) to exhibit the associated utterance u.
- f=f₁f₂. . . f_Fa decoded frame sequence; the sequence of frames associated to a particular decoding, where F is the number of frames in the sequence. Written f(u) or f₁(u) f₂(u) . . . f_F(u) to exhibit the associated utterance u.
- d=d₁d₂. . . d_Na decoded phoneme sequence; the sequence of phonemes, where N is the number of phonemes in the sequence, derived from a particular decoded frame sequence, by the operations of phoneme mapping and coalescing, explained below. Written d₁(u) d₂(u) . . . d_N(u) to exhibit the associated utterance u.
- t=t₁t₂. . . t_Qa true phoneme sequence; the sequence of phonemes, where Q is the number of phonemes in the sequence, derived from a true transcription T, by a means that is explained below. Also known as a pronunciation of T. Written t(u) or t₁(u) t₂(u) . . . t_Q(u) to exhibit the associated utterance u. Compare with the decoded phoneme sequence, as defined above, and note that for one and the same utterance u, the decoded phoneme sequence and true phoneme sequence may and typically will differ, and may even contain different numbers of phonemes.
- c(d|t) the integer-valued count of the phoneme pair d|t, derived as explained below.
- δ(d|t) the decoding cost of decoding phoneme t as phoneme d. If neither d nor t is the empty phoneme (defined below), this is also referred to as the substitution cost of substituting d for t.
- δ_(i)(d|t) the decoding cost of decoding phoneme t as phoneme d, at iteration i of the method. The index i is a so-called “dummy index”; the same quantity may also be denoted using the dummy index m as δ_(m)(d|t) to refer to the decoding cost at iteration m of the method.
- ε the empty phoneme, sometimes called epsilon
- Φ+ε the augmented phoneme alphabet; the set Φ augmented with the empty phoneme. Thus, Φ{ε}. Sometimes written Φ′.
- δ(d|ε) the insertion cost of inserting phoneme d into a decoding.
- δ(ε|t) the deletion cost of deleting phoneme t from a decoding.
- p(d|t) the probability of decoding a true phoneme t as the phoneme d.
- p(i)(d|t) the probability of decoding a true phoneme t as the phoneme d, at iteration i of the method. The index i is a so-called “dummy index”; the same quantity may also be denoted using the dummy index m.
- Π={p(d|t)} a family of conditional probability models, where each p(⋅|t) comprises a probability model, for each t in Φ+ε, over the space Ω=Φ+ε.
- Π_(i)={p_(i)(d|t)} a family of conditional probability models, at iteration i of the method. The index i is a so-called “dummy index”; the same quantity may also be denoted using the dummy index m.
- L a lattice; formally a directed acyclic graph comprising a set of nodes N and a set of arcs or edges E⊂N×N.
- a an arc of L; formally an ordered pair of nodes t,h∈E. If a=t,h, we say that a is an arc from t to h, and refer to t as the tail, and h as the head, of arc a.
- A=a₁, a₂, . . . , a_Ka path (of length K) in L; formally a sequence of arcs a₁, a₂, . . . , a_Kof L, with the property that the head of arc a_iis the tail of arc a_i+1, for each i=1, . . . , K−1.
- l(a) the label of arc a; comprising the phoneme pair x|y, with x, y∈Φ′, that is associated with the given arc in L

2. Method for Constructing an Empirically Derived Acoustic Confusability Measure

We first present an outline of the method, then present a detailed explanation of how to apply the method.

Outline of Method

The method comprises two basic steps. The first step is corpus processing, in which the original corpus is passed through the automatic speech recognition system of interest. This step is non-iterative; that is, the corpus is processed just once by the recognition system. The second step is development of a family of phoneme confusability models. This step is iterative; that is, it involves repeated passes over the corpus, at each step delivering an improved family of confusability models.

Corpus Processing

We assume that we have at our disposal some large and representative set of utterances, in some given human language, with associated reliable transcriptions. We refer to this as the corpus. By an utterance we mean a sound recording, represented in some suitable computer-readable form. By transcription we mean a conventional textual representation of the utterance; by reliable we mean that the transcription may be regarded as accurate. We refer to these transcriptions as the truth, or the true transcriptions.

In this step, we pass the utterances through an automatic speech recognition system, one utterance at a time. For each utterance, the recognition system generates a decoding, in a form called a decoded frame sequence, and a confidence score. As defined above, a frame is a brief audio segment of the input utterance.

The decoded frame sequence comprises the recognizer's best guess, for each frame of the utterance, of the phoneme being enunciated, in that audio frame. As defined above, a phoneme is one of a finite number of basic sound units of a human language.

This decoded frame sequence is then transformed, by a process that we describe below, into a much shorter decoded phoneme sequence. The confidence score is a measure, determined by the recognition system, of the likelihood that the given decoding is correct.

We then inspect the true transcription of the input utterance, and by a process that we describe below, transform the true transcription (which is just regular text, in the language of interest) into a true phoneme sequence.

Thus for each utterance we have confidence score, and a pair of phoneme sequences: the decoded phoneme sequence, and the true phoneme sequence. We refer to this entire collection as the recognized corpus, and denote it as P.

The recognized corpus constitutes the output of the corpus processing step.

Iterative Development of Probability Model Family

From the preceding step, we have at our disposal the recognized corpus P, comprising a large number of pairs of phoneme sequences.

In this step, we iteratively develop a sequence of probability model families. That is, we repeatedly pass through the recognized corpus, analyzing each pair of phoneme sequences to collect information regarding the confusability of any two phonemes. At the end of each pass, we use the information just collected to generate an improved family of probability models. We repeat the procedure until there is no further change in the family of probability models, or the change becomes negligible.

It is important to understand that this step as a whole comprises repeated iterations. In the detailed discussion below, we describe a single iteration, and the criterion for declaring the step as a whole complete.

The output of this step is a family of probability models, which estimates the acoustic confusability of any two members of the augmented phoneme alphabet 0′. From these estimates, by another method that we explain, we may then derive the acoustic confusability measure that we seek.

DETAILED DESCRIPTION OF THE METHOD

We now provide detailed descriptions of the steps outlined above.

Corpus Processing

Let X={<u₁, T₁>, . . . , <u_C, T_C>} be the corpus, comprising C pairs of utterances and transcriptions. For each <u, T> pair in X:

- 1. Recognize. Apply the recognizer R (or for a grammar-based system, the recognizer R_G, where G is a grammar that admits every transcription in the corpus, plus possibly other phrases that are desired to be recognized) to the utterance u, yielding as output a decoded frame sequence f and a confidence score s.
- 2. Optionally Map Phonemes. This step is optional. Let f=f₁f₂. . . f_Fbe the decoded frame sequence. Apply a phoneme map m to each element of the decoded frame sequence, yielding a new decoded frame sequence f′=f′₁f′₂. . . f′_F, where each f′_j=m(f_j).

The purpose of the phoneme map m is to reduce the effective size of the phoneme alphabet, by collapsing minor variants within the phoneme alphabet into a single phoneme. An example would be the mapping of the “p closure” phoneme, often denoted pcl, to the regular p phoneme. Another example would be splitting phoneme pairs, known as diphones, into separate phonemes. This operation can simplify the calculation, and avoids the problem of too finely subdividing the available statistical evidence, which can lead to unreliable estimates of phoneme confusability.

However, this operation may be skipped, or in what amounts to the same thing, the map m may be the identity map on the phoneme alphabet.

Note: it will be obvious to one skilled in the art, that by suitable modification the map m may function to expand rather than to reduce the phoneme alphabet, for instance by including left and/or right phonetic context in the output phoneme. This modification is also claimed as part of this invention.

- 3. Coalesce. Let f′=f′₁f′₂. . . f′_F, be the decoded frame sequence, optionally after the application of Step 2. We now perform the operation of coalescing identical sequential phonemes in the decoded frame sequence, to obtain the decoded phoneme sequence. This is done by replacing each subsequence of identical contiguous phonemes that appear in f′ by a single phoneme of the same type.

Thus if

f′=r r r r eI eI z z z z

is the decoded frame sequence, comprising 10 frames, the result of coalescing f′ is the decoded phoneme sequence

d=r eI z.

Here and above, r, eI and z are all members of the phoneme alphabet Φ. This phoneme sequence corresponds to the regular English language word “raise.” Note that d has three elements, respectively d₁=r, d₂=eI, and d₃=z.

We denote the coalescing operation by the letter g, and write d=g(f′) for the action described above.

- 4. Generate Pronunciation of T. Let T be the transcription of u. By lookup in the dictionary of the recognition system, or by use of the system's automatic pronunciation generation system, generate a pronunciation t for T, also written t=q(T). Thus if T is the regular English word “hazy,” then one possibility is
  
  t=h eI z i:

As above, h, eI, z, and i: are all members of the phoneme alphabet Φ. Note that t has four elements, respectively t₁=h, t₂eI, t₃=z, and t₄=i:.

It should be noted that there may be more than one valid pronunciation for a transcription T There are a number of ways of dealing with this:

- (a) Decode the utterance u with a grammar-based recognizer R_G, where the grammar G restricts the recognizer to emit only the transcription T(u). This is known as a “forced alignment,” and is the preferred embodiment of the invention.
- (b) Pick the most popular pronunciation, if this is known.
- (c) Pick a pronunciation at random.
- (d) Use all of the pronunciations, by enlarging the corpus to contain as many repetitions of u as there are pronunciations of T(u), and pairing each distinct pronunciation with a separate instance of u.
- (e) Pick the pronunciation that is closest, in the sense of string edit distance, to the decoded phoneme sequence d.

By applying these steps sequentially to each element of the corpus X, we obtain the recognized corpus P={<u1, d(u1), t(u1), s(u1)>, . . . , <uC, d(uC), t(uC), s(uC)>}, or more succinctly P={<u1, d1, t1, s1>, . . . , <uC, dC, tC, sC>}.

Iterative Development of Probability Model Family

We now give the algorithm for the iterative development of the required probability model family, Π={p(d|t)}.

- 1. Begin with the recognized corpus P.
- 2. Establish a termination condition τ. This condition typically depends on one or more of: the number of iterations executed, the closeness of match between the previous and current probability family models, respectively Π_(m−1)and Π_(m), or some other consideration. To exhibit this dependency explicitly, we write τ(m, Π_(m−1), Π_(m)).
- 3. Define the family of decoding costs {δ₍₀₎(x|y)|x, y in Φ′} as follows
  - δ₍₀₎(x|ε)=2 for each x in Φ
  - δ₍₀₎(ε|x)=3 for each x in Φ
  - δ₍₀₎(x|x)=0 for each x in Φ′
  - δ₍₀₎(x|y)=1 for each x, y in Φ, with x≠y.
    
    Note: these settings are exemplary, and not a defining characteristic of the algorithm. Practice has shown that the algorithm is not very sensitive to these values, so long as δ₍₀₎(x|x)=0, and the other quantities are greater than 0.
- 4. Set the iteration count m to 0.
- 5. For each x, y in Φ′, set the phoneme pair count c(x|y) to 0.
- 6. For each entry <u, d, t, s> in P, perform the following (these steps are explained in greater detail below):
  - a. Construct the lattice L=d×t.
  - b. Populate the lattice arcs with values drawn from the current family of decoding costs, {δ_(m)(x|y)}.
  - c. Apply the Bellman-Ford dynamic programming algorithm, or Dijkstra's minimum cost path first algorithm, to find the shortest path through this lattice, from the upper-left (source) node to the lower-right (terminal) node. The minimum cost path comprises a sequence of arcs A=a₁, a₂, . . . , a_k, in the lattice L, where the tail of arc a₁is the source node, the head of arc a_k, is the terminal node, and the head of arc a_iis the tail of arc a_i+1, for each i=1, . . . , K−1.
  - d. Traverse the minimum cost path determined in step c. Each arc of the path is labeled with some pair x|y, where x and y are drawn from Φ′. For each x|y arc that is traversed, increment the phoneme pair count c(x|y) by 1.
- 7. For each y in Φ′, compute c(y)=Σc(x|y), where the sum runs over all x in Φ′.
- 8. Estimate the family of probability models Π_(m)={p_(m)(x|y)}. For each fixed y in Φ′, this is done by one of the following two formulae:
  - a. If c(x|y) is non-zero for every x in Φ′, then set p_(m)(x|y)=c(x|y)/c(y), for each x in Φ′.
  - b. If c(x|y) is zero for any x in Φ′, apply any desired zero-count probability estimator, also known as a smoothing estimator, to estimate p_(m)(x|y). A typical method is Laplace's law of succession, which is p_(m)(x|y)=(c(x|y)+1)/(c(y)+|Φ′|), for each x in Φ′.
- 9. If m>0, test the termination condition τ(m, Π_{m−1), Π(m)}). If the condition is satisfied, return Π_(m)as the desired probability model family Π={p(d|t)} and stop.
- 10. If the condition is not satisfied, define a new family of decoding costs {δ_(m+1)(x|y)|x, y in Φ′} by δ_(m+1)(x|y)=−log p_(m)(x|y). (The logarithm may be taken to any base greater than 1.)

Note that each p_(m)(x|y) satisfies 0<p_(m)(x|y)<1, and so each δ_m+1)(x|y)>0.

- 11. Increment the iteration counter m and return to step 5 above.
  
  We now provide the additional discussion promised above, to explain the operations in Step 6 above.

Step 6a: Consider the entry <u, d, t, s> of P, with decoded phoneme sequence d=d₁d₂. . . d_N, containing N phonemes, and true phoneme sequence t=t₁t₂. . . t_Q, containing Q phonemes. Construct a rectangular lattice of dimension (N+1) rows by (Q+1) columns, and with an arc from a node (i, j) to each of nodes (i+1, j), (i, j+1) and (i+1, j+1), when present in the lattice. (Note: “node (i, j)” refers to the node in row i, column j of the lattice.) The phrase “when present in the lattice” means that arcs are created only for nodes with coordinates that actually lie within the lattice. Thus, for a node in the rightmost column, with coordinates (i, Q), only the arc (i, Q)→(i+1, Q) is created.)

Step 6b: Label

- each arc (i, j)→(i, j+1) with the cost δ_(m)(ε|t_j)
- each arc (i, j)→(i+1, j) with the cost δ_(m)(d_i|ε)
- each arc (i, j)→(i+1, j+1) with the cost δ_(m)(d_i|t_j).

An example of such a lattice appears, in various versions, in FIGS. 1, 2, and 3 below. FIG. 1 exhibits the lattice labeled with symbols, for the case where N=3 and Q=4, with symbolic expressions for decoding costs. FIG. 2 shows the lattice for the particular case d=r eI z and t=h eI z i:, again with symbolic expressions for decoding costs. FIG. 3 shows the same lattice, with the actual decoding costs for iteration 0 filled in.

Step 6c: The Bellman-Ford dynamic programming algorithm is a well-known method for finding the shortest path through a directed graphic with no negative cycles. We apply it here to find the shortest path from the source node, which we define as node (0, 0), to the terminal node, which we define as node (N, Q).

FIGS. 4, 5, 6, and 7 below demonstrate the application of the Bellman-Ford algorithm to the example of FIG. 3.

FIG. 4 shows the initial state of the algorithm, with the source node labeled with the minimum cost for reaching that node from the source node, which of course is 0.

FIG. 5 shows the state of the algorithm after labeling nodes (0, 1) and (1, 0) with the minimum cost for reaching those nodes from the source node. The arcs traversed to yield the minimum cost has also been exhibited, by thickening the line of the arc.

Because there is only a single arc incident on each of these nodes, the minimum costs are respectively 0+3=3 and 0+2=2. In each case, this quantity is determined as (minimum cost to reach the immediately preceding node)+(cost of traversing the arc from the immediately preceding node).

FIG. 6 shows the state of the algorithm after labeling node (1, 1). The computation here is less trivial, and we review it in detail. Node (1, 1) has three immediate predecessors, respectively (0, 0), (0, 1) and (1, 0). Each node has been labeled with its minimum cost, and so we may compute the minimum cost to (1, 1). This of course is the minimum among the three possible paths to (1, 1), which are:

- from (0, 0), via arc (0, 0)→(1, 1), with total cost 0+1=1
- from (0, 1), via arc (0, 1)→(1, 1), with total cost 3+2=5
- from (1, 0), via arc (1, 0)→(1, 1), with total cost 2+3=5.

It is evident that the path from (0, 0) is the minimum cost path, and this is indicated in FIG. 6.

By repeated application of this process, the minimum cost path from the source node to each node of the lattice may be determined. FIG. 7 shows the final result.

Because the arc costs are guaranteed to be non-negative, it is evident to one skilled in the art that the same computation may be performed, at possibly lower computational cost, using Dijkstra's shortest path first algorithm. The improvement follows from the fact that only the minimum cost path from the source node to the terminal node is required, and so the algorithm may be halted as soon as this has been determined.

The output of this step is a sequence of arcs A=a₁, a₂, . . . , a_k, in the lattice L, known to comprise the minimum cost path from the source node to the terminal node. We write l(a) for the phoneme pair x|y that labels the arc a.

Step 6d: For each arc a_iin the minimum cost path A, labeled with phoneme pair x|y=l(a_i), increment the counter c(x|y) by 1.

This completes the description of the method to construct an empirically derived acoustic confusability measure. The means of using the result of this algorithm to compute the acoustic confusability of two arbitrary phrases is described below.

N-Best Variant of the Method

An important variant of the just-described method to construct an empirically derived acoustic confusability measure, which can improve the accuracy of the resulting measure, is as follows.

It is well known to those skilled in the art that the output of a recognizer R (or R_G, for a grammar-based recognition system), may comprise not a single decoding D, comprising a pair f, s, but a so-called “N-best list,” comprising a ranked series of alternate decodings, written f₁, s₁, f₂, s₂, . . . , f_B, s_B. In this section we explain a variant of the basic method described above, called the “N-Best Variant,” which makes use of this additional information. The N-best variant involves changes to both the corpus processing step, and the iterative development of probability model family step, as follows.

N-Best Variant Corpus Processing

In the N-best variant of corpus processing, for each utterance u, each entry f_i(u), s_i(u) in the N-best list is treated as a separate decoding. All other actions, taken for a decoding of u, are then performed as before. The result is a larger recognized corpus P′.

N-Best Variant Iterative Development of Probability Model Family

In the N-best variant of iterative development of probability model family, there are two changes. First, the input is the larger recognized corpus, P′, developed as described immediately above. Second, in step 6d, as described above, when processing a given entry <u, d, t, s> of P′, each count c(x|y) is incremented by the value s, which is the confidence score of the given entry, rather than by 1.

The rest of the algorithm is unchanged.

3. Method to Compute the Empirically Derived Acoustic Confusability of Two Phrases

In the preceding sections we described how to determine the desired probability model family Π={p(d|t)}. In this section we explain how to use H to compute the acoustic confusability of two arbitrary phrases w and v.

Specifically, we give algorithms for computing two quantities, both relating to acoustic confusability. The first is the raw phrase acoustic confusability r(v|w). This is a measure of the acoustic similarity of phrases v and w. The second is the grammar-relative confusion probability p(v|w, G). This is an estimate of the probability that a grammar-constrained recognizer R_Greturns the phrase v as the decoding, when the true phrase was w. Note that no reference is made to any specific pronunciation, in either quantity.

In both cases, we must come to grips with the fact that the phrases v and w may have multiple acceptable pronunciations. There are a variety of ways of dealing with this, all of which are claimed as part of this patent.

In the process of computing these quantities, we also give expressions that depend upon specific pronunciations (and from which the pronunciation-free expressions are derived). These expressions have independent utility, and also are claimed as part of this patent.

Computation of Raw Pronunciation Acoustic Confusability r(q(v)|q(w)) and Raw Phrase Acoustic Confusability r(v|w)

We first assume that pronunciations q(w)∈Q(w) and q(v)∈Q(v) are given, and explain the computation of the raw pronunciation acoustic confusability, r(q(v)|q(w)). Then we explain methods to determine the raw phrase acoustic confusability r(v|w).

Computation of Raw Pronunciation Acoustic Confusability

Let the probability model family Π={p(d|t)} and the pronunciations q(w) and q(v) be given. Proceed as follows to compute the raw pronunciation acoustic confusability r(q(v)|q(w)):

- 1. Define the decoding costs δ(d|t)=−log p(d|t) for each d, t∈Φ′.
- 2. Construct the lattice L=q(v)×q(w), and label it with phoneme decoding costs δ(d|t), depending upon the phonemes of q(v) and q(w). This means performing the actions of Steps 6a and 6b, as described above, “Iterative Development of Probability Model Family,” with the phoneme sequences q(v) and q(w) in place of d and t respectively.
- 3. Perform the actions of Step 6c above to find the minimum cost path A=a₁, a₂, . . . , a_k, from the source node to the terminal node of L.
- 4. Compute S, the cost of the minimum cost path A, as the sum of the decoding costs δ(l(a)) for each arc a∈A. (Recall that l(a) is the phoneme pair x|y that labels a.) Thus,

$S = \sum_{i = 1}^{K} δ (l (a_{i})) .$

- 5. Compute r(q(v)|q(w))=exp(−S); this is the raw pronunciation acoustic confusability of q(v) and q(w). Here the exponential is computed to the same base as that used for the logarithm, in preceding steps.

Note that equivalently

$r (q (v) ❘ q (w)) = \prod_{i = 1}^{K} p (l (a_{i})),$

and indeed this quantity may be computed directly from the lattice L, by suitable modification of the steps given above.

We have described here one method of computing a measure of the acoustic confusability r(q(v)|q(w)) of two pronunciations, q(w) and q(v). In what follows we describe methods of manipulating this measure to obtain other useful expressions. It is to be noted that while the expressions developed below assume the existence of some automatic means of quantitatively expressing the confusability of two pronunciations, they do not depend on the exact formulation presented here, and stand as independent inventions.

Computation of Raw Phrase Acoustic Confusability

We begin by defining r(v|q(w))=Σr(q(v)|q(w)), where the sum proceeds over all q(v)∈Q(v). This accepts any pronunciation q(v) as a decoding of v. The raw phrase acoustic confusability r(v|w), with no reference to pronunciations, may then be determined by any of the following means:

- 1. Worst Case, Summed. Find q(w)∈Q(w) that minimizes r(w|q(w)); call this q†(w). Thus q†(w) is the pronunciation of w that is least likely to be correctly decoded. Set r(v|w)=r(v|q†(w)). This is the preferred implementation.
- 2. Worst Case, Individual Pronunciations. For v≠W, set r(v|w)=max {r(q(v)|q(w))}, where the maximum is taken over all q(v)∈Q(v) and q(w)∈Q(w). For v=w, set r(w|w)=min {r(q(w)|q(w))}, where the minimum is taken over all q(w)∈Q(w). Since higher values of r(q(v)|q(w)) imply greater confusability, this assigns to r(v|w) the raw pronunciation confusability of the two most confusable pronunciations of v and w respectively. This is the preferred method.
- 3. Most Common. Assume the two most common pronunciations of v and w are known, respectively q*(v) and q*(w). Set r(v|w)=r(q*(v)|q*(w)).
- 4. Average Case. Assume that a probability distribution on Q(w) is known, reflecting the empirical distribution, within the general population, of various pronunciations q(w) of w. Set r(v|w)=Σp(q(w))r(v|q(w)), where the sum proceeds over all q(w)∈Q(w).
- 5. Random. Randomly select q(v)∈Q(v) and q(w)∈Q(w), and set r(v|w)=r(q(v)|q(w)).

Those skilled in the art will observe ways to combine these methods into additional hybrid variants, for instance by randomly selecting q(v), but using the most common pronunciation q*(w), and setting r(v|w)=r(q(v) q*(w)).

Computation of Grammar-Relative Pronunciation Confusion Probability p(q(v)|q(w), G) and Grammar-Relative Phrase Confusion Probability p(v|w, G)

Suppose that a recognizer is constrained to recognize phrases within a grammar G. We proceed to define expressions that estimate the grammar-relative pronunciation confusion probability p(q(v)|q(w), G), and the grammar-relative phrase confusion probability p(v|w, G).

In what follows we write L(G) for the set of all phrases admissible by the grammar G, and Q(L(G)) for the set of all pronunciations of all such phrases. By assumption L(G) and Q(L(G)) are both finite.

Computation of Grammar Relative Pronunciation Confusion Probability p(q(v)|q(w), G)

Let two pronunciations q(v), q(w)∈Q(L(G)) be given; exact homonyms, that is q(v)=q(w), are to be excluded. We estimate p(q(v)|q(w), G), the probability that an utterance corresponding to the pronunciation q(w) is decoded by the recognizer R_Gas q(v), as follows.

- 1. Compute the normalizer of q(w) relative to G, written Z(q(w), G), as Z(q(w),G)=Σr(q(x)|q(w)), where the sum extends over all q(x)∈Q(L(G)), excluding exact homonyms (that is, cases where q(x)=q(w), for x≠w).
- 2. Set p(q(v)|q(w), G)=r(q(v)|q(w))/Z(q(w), G).
  
  Note: by virtue of the definition of the normalizer, this is in fact a probability distribution over Q(L(G)).
  
  Computation of Grammar-Relative Phrase Confusion Probability p(v|w, G)

Let two phrases v, w ∈L(G) be given. We estimate p(v|w, G), the probability that an utterance corresponding to any pronunciation of w is decoded by the recognizer R_Gas any pronunciation of v, as follows.

As above we must deal with the fact that there are in general multiple pronunciations of each phrase. We proceed in a similar manner, and begin by defining p(v|q(w),G)=Σp(q(v)|q(w),G), where the sum is taken over all q(v)∈Q(v). We may then proceed by one of the following methods:

- 1. Worst Case, Summed. Find q(w)∈Q(w) that minimizes p(w|q(w), G); call this q†(w). Thus q†(w) is the pronunciation of w that is least likely to be correctly decoded. Set p(v|w, G)=p(v|q†(w), G). This is the preferred implementation.
- 2. Worst Case, Individual Pronunciations. For v≠w, set p′(v|w, G)=max{p(q(v)|q(w), G)}, where the maximum is taken over all q(v)∈Q(v) and q(w)∈Q(w). For v=w, set p′(w|w, G)=min{p(q(w)|q(w), G)}, where the minimum is taken over all q(w)∈Q(w). Renormalize the set of numbers {p′(x|w, G)} to obtain a new probability distribution p(x|w, G).
- 3. Most Common. Assume the most common pronunciation of w is known, denoted q*(w). Set p(v|w, G)=p(v|q*(w), G).
- 4. Average Case. Assume the empirical distribution p(q(w)) over Q(w) is known. Set p(v|w, G)=Σp(q(w))p(v|q(w),G), where the sum is taken over all q(w)∈Q(w).
- 5. Random. For any given v, w∈L(G), randomly select q(v) and q(w) from Q(v) and Q(w) respectively, and set p′(v|w, G)=p(q(v)|q(w), G). Renormalize the set of numbers {p′(x|w, G)} to obtain a new probability distribution p(x w, G).

4. Techniques for Efficient Computation of Empirically Derived Acoustic Confusability Measures

In applying measures of acoustic confusability, it is typically necessary to compute a very large number of grammar-relative pronunciation confusion probabilities, p(q(v)|q(w), G), which ultimately depend upon the quantities r(q(v)|q(w)) and Z(q(w), G). We now explain three methods for improving the efficiency of these computations.

Partial Lattice Reuse

For a fixed q(w) in Q(L(G)), it is typically necessary to compute a large number of raw pronunciation confusability values r(q(v)|q(w)), as q(v) takes on each or many values of Q(L(G)). In principle for each q(v) this requires the construction, labeling and minimum-cost-path computation for the lattice L=q(v)×q(w), and this is prohibitively expensive.

This computation can be conducted more efficiently by exploiting the following observation. Consider two pronunciations q(v₁)=d₁₁, d₁₂, . . . , d_1Q1and q(v₂)=d₂₁, d₂₂, . . . , d_2Q2. Suppose that they share a common prefix; that is, for some M≤Q1, Q2 we have d_1j=d_2jfor j=1, . . . , M. Then the first M rows of the labeled and minimum-cost-path-marked lattice L₁=q(v₁)×q(w) can be reused in the construction, labeling and minimum-cost-path computation for lattice L₂=q(v₂)×q(w).

The reuse process consists of retaining the first (M+1) rows of nodes of the L₁lattice, and their associated arcs, labels and minimum-cost-path computation results, and then extending this to the L₂lattice, by adjoining nodes, and associated arcs and labels, corresponding to the remaining Q2-M phonemes of q(v₂). Thereafter, the computation of the required minimum-cost-path costs and arcs proceeds only over the newly-added Q2-M bottom rows of L₂.

For instance, continuing the exemplary lattice illustrated earlier, suppose q(w)=h eI z i:, and take q(v₁)=r eI z (a pronunciation of “raise”) and q(v₂)=r eI t (a pronunciation of “rate”). Then to transform L₁=q(v₁)×q(w) into L₂=q(v₂)×q(w) we first remove all the bottom row of nodes (those with row index of 3), and all arcs incident upon them. These all correspond to the phoneme “z” in q(v₁). (However, we retain all other nodes, and all labels, values and computational results that mark them.) Then we adjoin a new bottom row of nodes, and associated arcs, all corresponding to the phoneme “t” in q(v₂).

Note that it is possible, for example if q(v₂)=r eI (a pronunciation of “ray”), that no additional nodes need be added, to transform L₁into L₂. Likewise, if for example q(v₂)=r eI {circle around (a)} r (a pronunciation of “razor”), it is possible that no nodes need to be removed.

This procedure may be codified as follows:

- 1. Fix q(w) in Q(L(G)). Construct an initial “empty” lattice L₀, consisting of only the very top row of nodes and arcs, corresponding to q(w).
- 2. Sort Q(L(G)) lexicographically by phoneme, yielding an enumeration q(v₁), q(v₂), . . . .
- 3. Set the iteration counter i=1.
- 4. Find the length M of the longest common prefix of q(v_i−1)=d_{1−1 1}, d_{i−1 2}, . . . , d_i−Qi−1and q(v_i)=d_{i 1}, d_{i 2}, . . . , d_{i Qi}. This is the largest integer M such that d_i−1j=d_ijfor j=1, . . . , M.
- 5. Construct lattice L_ifrom L_i−1as follows:
  - a. Remove the bottom Q_i−1−M rows of nodes (and associated arcs, costs and labels) from L_i−1, corresponding to phonemes d_{i−1 M+1}, . . . , d_{i−1 Qi−1}of q(v_i−1), forming interim lattice L*.
  - b. Adjoin Q_i−M rows of nodes (and associated arcs, labeled with costs) to the bottom of L*, corresponding to phonemes d_{i M+1}, . . . , d_{i Qi}of q(v_i), forming lattice L_i=q(v_i)×q(w).
- 6. Execute the Bellman-Ford or Dijkstra's shortest path first algorithm on the newly-added portion of L_i. Compute the value of r(q(v_i)|q(w)) and record the result.
- 7. Increment the iteration counter i. If additional entries of Q(L(G)) remain, go to step 4. Otherwise stop.

It will be obvious to one skilled in the art that this same technique may be applied, with appropriate modifications to operate on the columns rather than the rows of the lattice in question, by keeping q(v) fixed, and operating over an enumeration q(w₁), q(w₂), . . . of Q(L(G)) to compute a sequence of values r(q(v)|q(w₁)), r(q(v)|q(w₂)), . . . .

Pruning

One application of acoustic confusability measures is to find phrases within a grammar, vocabulary or phrase list that are likely to be confused. That is, we seek pairs of pronunciations q(v), q(w), both drawn from Q(L(G)), with v≠w, such that r(q(v)|q(w)), and hence ultimately p(q(v)|q(w), G), is large.

In principle, this involves the computation of r(q(v)|q(w)) for some |Q(L(G))|²distinct pronunciation pairs. Because it is not uncommon for Q(L(G)) to contain as many as 100,000 members, this would entail on the order of 10 billion acoustic confusability computations. Because of the complexity of the computation, this is a daunting task for even a very fast computer.

However, it is possible to simplify this computation, as follows. If it can be established, with a small computational effort, that r(q(v)|q(w))<<r(q(w)|q(w)), then the expensive exact computation of r(q(v)|q(w)) need not be attempted. In this case we declare q(v) “not confusable” with q(w), and take r(q(v)|q(w))=0 in any further computations.

We refer to such a strategy as “pruning.” We now describe two complementary methods of pruning, respectively the method of Pronunciation Lengths, and the method of Pronunciation Sequences.

Pronunciation Lengths

Consider pronunciations q(v)=d₁, d₂, . . . , d_Dand q(w)=t₁, t₂, . . . , t_T. Suppose for a moment that D>>T; in other words that q(v) contains many more phonemes than q(w). Then the minimum cost path through the lattice L=q(v)×q(w) necessarily traverses many edges labeled with insertion costs δ(x|ε), for some x in the phoneme sequence q(v). This entails a lower bound on the minimum cost path through L, which in turn entails an upper bound on r(q(v)|q(w)).

We now explain the method in detail. Let q(v)=d₁, d₂, . . . , d_Dand q(w)=t₁, t₂, . . . , t_T, and let a threshold Θ be given. (The value of Θ may be a fixed number, a function of r(q(w)|q(w)), or determined in some other way.) We proceed to compute an upper bound r†(q(v)|q(w)) on r(q(v)|q(w)).

Let us write δ_i=δ(d_i|ε) for each phoneme d_iof q(v), where i=1, . . . , D. Sort these costs in increasing order, obtaining a sequence δ_i₁≤δ_i₂≤ . . . ≤δ_i_D.

Now, because D is the number of phonemes in q(v), even if the T phonemes of q(w) are exactly matched in the minimum cost path through the lattice, that path must still traverse at least I=−T arcs labeled with the insertion cost of some phoneme d of q(v). In other words, the cost S of the minimum cost path through the lattice is bounded below by the sum of the I smallest insertion costs listed above, S†=δ_i₁+δ_i₂+ . . . +δ_i_I. Because S≥S†, and by definition r(q(v)|q(w))=exp(−S), if we take r†(q(v)|q(w))=exp(−S†) we have r(q(v)|q(w))≤r†(q(v)|q(w)) as desired.

Note: the computation of the exponential can be avoided if we take B=log Θ, and equivalently check that −B≤S†.

A similar bound may be developed for the case T>>D. For this case we consider the phoneme deletion costs δ_i=δ(ε|t_i) for each phoneme t_iof q(w), where i=1, . . . , T. As before, we sort these costs, obtaining the sequence δ_i₁≤δ_i₂≤ . . . ≤δ_i_T. Letting E=T−D, we form the sum S†=δ_i₁+δ_i₂+ . . . +δ_i_g, and proceed as before.

Pronunciation Sequences

The preceding method of Pronunciation Lengths required either D>>T or T>>D, where these are the lengths of the respective pronunciation sequences. We now describe a method that may be applied, under suitable conditions, when D custom character T

For each ϕ in Φ, define δ_sd^min(ϕ)=min{δ(x|ϕ)|x∈Φ′}, and define δ_si^min(ϕ)=min{δ(ϕ|x∈Φ′}. Thus δ_sd^min(ϕ) is the minimum of all costs to delete ϕ or substitute any other phoneme for ϕ, and likewise δ_si^min(ϕ) is the minimum of all costs to insert ϕ or substitute ϕ for any other phoneme. Note that these values are independent of any particular q(v) and q(w), and may be computed once for all time.

To apply the method, as above let q(v)=d₁, d₂. . . , d_Dand q(w)=t₁, t₂, . . . , t_T, and let a threshold Θ be given.

For each ϕ in Φ, define w #(ϕ) and v #(ϕ) to be the number of times the phoneme ϕ appears in q(w) and q(v) respectively. Let n(ϕ)=w #(ϕ)−v #(ϕ).

Now form the sequence W\V=ϕ₁, ϕ_w, . . . , where for each ϕ in Φ with n(ϕ)>0, we insert n(ϕ) copies of ϕ into the sequence. Note that a given ϕ may occur multiple times in W\V, and observe that for each instance of ϕ in W\V, the minimum cost path through the lattice L=q(v)×q(w) must traverse a substitution or deletion arc for ϕ.

Now compute S†=Σδ_sd^min(ϕ), where the sum runs over the entries of W\V. It follows that S, the cost of the true minimal cost path through L, is bounded below by S†. Hence we may define r†(q(v)|q(w))=exp(−S†) and proceed as before.

A similar method applies with the sequence V\W, where we insert n(ϕ)=v #(ϕ)−w #(ϕ) copies of ϕ in the sequence, for n(ϕ)>0. (Note the interchange of v and w here.) We compute S†=Σδ_si^min(ϕ), where the sum runs over the entries of V\W, and proceed as above.

Incremental Computation of Confusability in a Sequence of Grammars

Suppose have two grammars, G and G′, such that L(G) and L(G′) differ from one another by a relatively small number of phrases, and hence so that Q(L(G)) and Q(L(G′)) differ by only a small number of pronunciations. Let us write Q and Q′ for these two pronunciation lists, respectively.

Suppose further that we have already computed a full set of grammar-relative pronunciation confusion probabilities, p(q(v) q(w), G), for the grammar G. Then we may efficiently compute a revised set p(q(v) q(w), G′), as follows.

First observe that the value of a raw pronunciation confusion measure, r(q(v) q(w)), is independent of any particular grammar. While Q′ may contain some pronunciations not in Q, for which new values r(q(v) q(w)) must be computed, most will already be known. We may therefore proceed as follows.

- 1. Compute any r(q(v) q(w)), for q(v), q(w) in Q′, not already known.
- 2. Let A=Q′\Q, that is, newly added pronunciations. Let B=Q\Q′, that is, discarded pronunciations.
- 3. Observe now that the normalizer Z(q(w),G′)=Σr(q(x)|q(w)), where the sum extends over q(x) in Q′, excluding exact homonyms, may be reexpressed as Z(q(w), G′)=Z(q(w), G)+Σ_q(x)∈Ar(q(x)|q(w))−Σ_q(x)∉Br(q(x)|q(w)). Moreover, the old normalizer Z(q(w), G′) is available as the quotient r(q(w) q(w))|p(q(w) q(w), G). Thus the new normalizer Z(q(w), G′) may be computed incrementally, at the cost of computing the two small sums.
- 4. Finally, p(q(v)|q(w), G′) may be obtained as r(q(v)|q(w))/Z(q(w), G′) as above.
- 5. Methods for Using Acoustic Confusability Measures

We now present two of the primary applications of an acoustic confusability measure.

The first of these, the “Confusatron,” is a computer program that takes as input an arbitrary grammar G, with a finite language L(G), and finds phrases in L(G) that are likely to be frequent sources of error, for the speech recognition system. The second is a method, called maximum utility grammar augmentation, for deciding in a principled way whether or not to add a particular phrase to a grammar.

While our discussion presumes the existence of a raw pronunciation confusability measure r(q(v)|q(w)), and/or grammar-relative pronunciation confusion probabilities p(q(v)|q(w), G), the methods presented in this section are independent of the particular measures and probabilities developed in this patent, and stand as independent inventions.

The Confusatron

We now explain a computer program, which we refer to as the “Confusatron,” which automatically analyzes a given grammar G to find so-called “dangerous words.” These are actually elements of L(G) with pronunciations that are easily confusable, by a given automatic speech recognition technology.

The value of the Confusatron is in its ability to guide a speech recognition system designer to decide what phrases are recognized with high accuracy within a given application, and which are not. If a phrase identified as likely to be poorly recognized may be discarded and replaced by another less confusable one, in the design phase, the system is less error-prone, and easier to use. If a phrase is likely to be troublesome, but must nevertheless be included in the system, the designer is at least forewarned, and may attempt to take some mitigating action.

We begin with a description of the Confusatron's function, and its basic mode of operation. We then describe variations; all are claimed as part of the patent.

The Confusatron generates a printed report, comprising two parts.

The first part, an example of which is exhibited in FIG. 8, lists exact homonyms. These are distinct entries v, w in L(G), with v≠w, for which q(v)=q(w), for some q(v)∈Q(v) and q(w)∈Q(w). That is, these are distinct literals with identical pronunciations. Thus no speech recognizer, no matter what its performance, is able to distinguish between v and w, when the utterance presented for recognition contains no additional context, and the utterance presented for recognition matches the given pronunciation. We say that the literals v and w “collide” on the pronunciation q(v)=q(w). Generating this homonym list does not require an acoustic confusability measure, just a complete catalog of the pronunciation set, Q(L(G)).

However, it is the second part that is really useful. Here the Confusatron automatically identifies words with distinct pronunciations that are nevertheless likely to be confused. This is the “dangerous word” list, an example of which is exhibited in FIG. 9.

The Confusatron operates as follows. Let G be a grammar, with finite language L(G), and finite pronunciation set Q(L(G)). Let {p(q(v)|q(w), G)} be a family of grammar-relative pronunciation confusability models, either derived from an underlying raw pronunciation confusion measure r(q(v)|q(w)) as described above, or defined by independent means.

It is useful at this point to introduce the quantity C(q(w), G), called the “clarity” of q(w) in G. This is a statistic of our invention, which is defined by the formula

$C (q (w), G) = 10 \log_{10} (\frac{p (q (w) ❘ q (w), G)}{1 - p (q (w) ❘ q (w), G)}) .$

The unit of this statistic, defined as above, is called a “deciclar,” where “clar” is pronounced to rhyme with “car.” This turns out to be a convenient expression, and unit, in which to measure the predicted recognizability of a given pronunciation q(w), within a given grammar G. Note that the clarity is defined with reference to a particular grammar. If the grammar is clear from context, we do not mention it or denote it in symbols.

Note that the higher the value of p(q(w)|q(w), G), which is the estimated probability that q(w) is recognized as itself, when enunciated by a competent speaker, the larger the value of C(q(w), G). Thus high clarity pronunciations are likely to be correctly decoded, whereas lower clarity pronunciations are less likely to be correctly decoded. This forms the basic operating principle of the Confusatron, which we now state in detail.

- 1. By plotting a histogram of clarity scores of correctly recognized and incorrectly recognized pronunciations, determine a clarity threshold Γ. Words with pronunciations with clarity below Γ are flagged as dangerous. Note: this step presumably need be performed only once, for a given speech recognition technology and acoustic confusability measure.
- 2. Let a grammar G be given. From G, by well-known techniques, enumerate its language L(G). From L(G), by use of the functionality of the automatic speech recognition system, or by other well-known means such as dictionary lookup, enumerate the pronunciation set Q(L(G)).
- 3. For each w in L(G):
  - a. Compute the clarity C(q(w), G) of each q(w) in Q(w). (For this computation, presumably any and all of the previously described speedup techniques may be applied to reduce the execution time of this step.)
  - b. Set the clarity of w, written C(w, G) to the minimum of C(q(w), G), over all q(w) in Q(w). If C(w, G)<Γ, declare w to be dangerous, and emit w and its clarity.
  - c. In conjunction with the clarity computations of step 3a, identify and record the phrases v for which p(v|q(w), G) attains its highest values. Emit those phrases.
  - d. In conjunction with the clarity computations of step 3a, identify and record any exact homonyms of q(w), and emit them separately.

Several important variations of the basic Confusatron algorithm are now noted.

Results for Pronunciations

First, rather than aggregating and presenting clarity results C(q(w), G) over all q(w) in Q(w), it is sometimes preferable to report them for individual pronunciations q(w). This can be useful if it is desirable to identify particular troublesome pronunciations.

Semantic Fusion

Second, there is often some semantic label attached to distinct phrases v and w in a grammar, such that they are known to have the same meaning. If they also have similar pronunciations (say, they differ by the presence of some small word, such as “a”), it is possible that the value of p(q(v)|q(w), G) is high. This may nominally cause q(w) to have low clarity, and thereby lead to flagging w as dangerous, when in fact the pronunciations q(v) that are confusable with q(w) have same underlying meaning to the speech recognition application.

It is straightforward to analyze the grammar's semantic labels, when they are present, and accumulate the probability mass of each p(q(v)|q(w), G) into p(q(w)|q(w), G), in those cases when v and w have the same meaning. This process is known as “semantic fusion,” and it is a valuable improvement on the basic Confusatron, which is also claimed in this patent.

Dangerous Word Detection Only

Suppose our task is only to decide if a given pronunciation q(w) is dangerous or not, that is if C(q(w), G)<Γ. By straightforward algebra, this can be turned into an equivalent comparison p(q(w)|q(w),G)<10^(Γ/10)/(1+10^(Γ/10)). Let us write Ψ for this transformed threshold 10^(Γ/10)/(1+10^(Γ/10).

Recall that p(q(w)|q(w), G)=r(q(w) q(w))|Z(q(w), G), and that the denominator is a monotonically growing quantity, as the defining sum proceeds over all q(v) in Q(L(G)), excluding homonyms of q(w). Now by definition p(q(w)|q(w), G)<Ψiff r(q(w)|q(w))/Z(q(w), G)<Ψ, that is iff Z(q(w), G)>r(q(w)|q(w))/Ψ.

Thus, we can proceed by first computing r(q(w)|q(w)), then accumulating Z(q(w), G), which is defined as Z(q(w),G)=Σr(q(x)|q(w)), where the sum runs over all non-homonyms of q(w) in Q(L(G)), and stopping as soon as the sum exceeds r(q(w)|q(w))/Ψ. If we arrange to accumulate into the sum the quantities r(q(x)|q(w)) that we expect to be large, say by concentrating on pronunciations of length close to that of q(w), then for dangerous words we may hope to terminate the accumulation of Z(q(w), G) without proceeding all the way through Q(L(G)).

Maximum Utility Grammar Augmentation

Suppose we are given a predetermined utility U(w) for recognizing a phrase w in a speech recognition application, a prior probability p(w) of the phrase. Then we may define the value of the phrase, within a grammar G, as V(w, G)=p(w) p(w|w, G) U(w). We may then further define the value of a grammar as the value of all its recognizable phrases; that is, V(G)=ΣV(w,G), where the sum extends over all w in L(G).

Consider now some phrase w that is not in L(G); we are trying to decide whether to add it to G or not. On the one hand, presumably adding the phrase has some value, in terms of enabling new functionality for a given speech recognition application, such as permitting the search, by voice, for a given artist or title in a content catalog.

On the other hand, adding the phrase might also have some negative impact, if it has pronunciations that are close to those of phrases already in the grammar: adding the new phrase could induce misrecognition of the acoustically close, already-present phrases.

Let us write G+w for the grammar G with w added to it. Then a principled way to decide whether or not to add a given phrase w is to compute the gain in value ΔV(w), defined as ΔV(w)=V(G+w)−V(G).

Moreover, given a list of phrases w₁, w₂, . . . , under consideration for addition to G, this method can be used to rank their importance, by considering each ΔV(w _i), and adding the phrases in a greedy manner. By recomputing the value gains at each stage, and stopping when the value gain is no longer positive, a designer can be assured of not inducing any loss in value, by adding too many new phrases.

Although the invention is described herein with reference to the preferred embodiment, one skilled in the art will readily appreciate that other applications may be substituted for those set forth herein without departing from the spirit and scope of the present invention. Accordingly, the invention should only be limited by the Claims included below.

Claims

1. A method comprising: passing utterances for which transcriptions are available through a speech recognition system to produce decodings and corresponding confidence scores;producing a recognized corpus by associating each confidence score with (i) a corresponding one of the decodings or a representation thereof and (ii) a corresponding one of the transcriptions or a representation thereof;iterating from an initial probability model by passing through the recognized corpus to collect information regarding confusability and then using the information to develop an improved probability model; andderiving an acoustic confusability measure based on the improved probability model.
2. The method of claim 1, further comprising: transforming the decodings into decoded phoneme sequences; andtransforming the transcriptions into true phoneme sequences.
3. The method of claim 2, wherein the recognized corpus is produced by associating each confidence score with a pair of phoneme sequences, the pair of phoneme sequences including (i) a corresponding decoded phoneme sequence that is representative of the corresponding decoding and (ii) a corresponding true phoneme sequence that is representative of the corresponding transcription.
4. The method of claim 3, wherein said iterating involves passing each pair of phoneme sequences through the recognized corpus.
5. The method of claim 1, wherein the recognized corpus is repeatedly passed through, such that a sequence of probability models are generated, andwherein after each pass, the information collected regarding confusability is used to generate the improved probability model that is used for a next pass.
6. The method of claim 5, wherein the recognized corpus is repeatedly passed through until there is no further change in the improved probability model.
7. The method of claim 5, wherein the sequence of probability models comprises at least one model π={p(d|t)} wherein each of d and t are phonemes drawn from a phoneme alphabet.
8. The method of claim 1, wherein the acoustic confusability measure is indicative of acoustic similarity, and therefore acoustic confusability, between words or phrases that are in a given language.
9. The method of claim 1, wherein the recognized corpus is produced by generating, for each utterance, at least one decoded phoneme sequence and at least one true phoneme sequence.
10. The method of claim 1, further comprising: recognizing, from a corpus comprising a set of utterances with corresponding transcriptions, an utterance, to yield a recognized utterance comprising at least one decoded frame sequence; andcoalescing identical sequential phonemes of the at least one decoded frame sequence to yield at least one decoded phoneme sequence.
11. The method of claim 10, wherein said recognizing comprises producing a plurality of decodings for the recognized utterance.
12. The method of claim 11, wherein each of the plurality of decodings comprises a corresponding confidence score.
13. The method of claim 1, further comprising: determining, from a corpus comprising a set of utterances with corresponding transcriptions, at least one pronunciation including at least one true phoneme sequence.
14. The method of claim 13, wherein said determining comprises any of the steps of: for each word of each transcription, utilizing a most popular pronunciation;for each word of each transcription, utilizing a pronunciation selected at random;for each word of each transcription, utilizing the pronunciation that is closest by string edit distance to at least one decoded phoneme sequence for the respective word within the at least one decoded phoneme sequence; orfor each word of each transcription, utilizing each of a plurality of pronunciations from a set of all pronunciations of the respective word.
15. A non-transitory medium with instructions stored thereon that, when executed by a system, cause the system to perform operations comprising: obtaining decodings and corresponding confidence scores produced for utterances for which transcriptions are available;producing a recognized corpus by associating each confidence score with (i) a corresponding one of the decodings or a representation thereof and (ii) a corresponding one of the transcriptions or a representation thereof;iterating from an initial probability model by passing through the recognized corpus to collect information regarding confusability and then using the information to develop an improved probability model; andderiving, based on the improved probability model, an acoustic confusability measure that is used by a speech recognition application to selectively limit phrases to make recognizable.
16. The non-transitory medium of claim 15, wherein the decodings and the corresponding confidence scores are obtained by passing the utterances through the speech recognition application.
17. The non-transitory medium of claim 15, wherein the recognized corpus is repeatedly passed through, such that a sequence of probability models are generated, andwherein after each pass, the information collected regarding confusability is used to generate the improved probability model that is used for a next pass.
18. The non-transitory medium of claim 15, wherein the operations further comprise: apply a phoneme map to each element of each decoding, so as to reduce an effective size of a phoneme alphabet across the decodings.
19. The non-transitory medium of claim 18, wherein the effective size of the phoneme alphabet is reduced by collapsing minor variants within the phoneme alphabet into a single phoneme.
20. A computer-implemented system, comprising hardware configured using computer code to: obtaining decodings and corresponding confidence scores that are produced by passing utterances for which transcriptions are available through a speech recognition application;producing a recognized corpus by associating each confidence score with (i) a corresponding one of the decodings or a representation thereof and (ii) a corresponding one of the transcriptions or a representation thereof;iterating from an initial probability model by passing through the recognized corpus to collect information regarding confusability and then using the information to develop an improved probability model; andderiving an acoustic confusability measure based on the improved probability model.
21. The computer-implemented system of claim 20, wherein the hardware is further configured to use the computer code to: provide the acoustic confusability measure to the speech recognition application, which uses the acoustic confusability measure to selectively limit phrases to make recognizable.

CROSS REFERENCE TO RELATED APPLICATIONS

This application is a continuation of U.S. patent application Ser. No. 16/988,292, filed Aug. 7, 2020, which is a continuation of U.S. patent application Ser. No. 16/158,900, filed Oct. 12, 2018, now U.S. Pat. No. 10,748,527, issued Aug. 18, 2020, which is a continuation of U.S. patent application Ser. No. 15/457,964, filed Mar. 13, 2017, now U.S. Pat. No. 10,121,469, issued Nov. 6, 2018, which is a divisional of U.S. patent application Ser. No. 14/574,314, filed Dec. 17, 2014, now U.S. Pat. No. 9,626,965, issued Apr. 18, 2017, which is a divisional application of U.S. patent application Ser. No. 11/932,122, filed Oct. 31, 2007, now U.S. Pat. No. 8,959,019, issued Feb. 17, 2015, which are incorporated herein in their entireties by this reference thereto.

US Referenced Citations (184)

Number	Name	Date	Kind
4980918	Bahl et al.	Dec 1990	A
5381459	Lappington	Jan 1995	A
5553119	McAllister et al.	Sep 1996	A
5581655	Cohen et al.	Dec 1996	A
5611019	Nakatoh et al.	Mar 1997	A
5698834	Worthington et al.	Dec 1997	A
5737723	Riley et al.	Apr 1998	A
5752232	Basore et al.	May 1998	A
5754977	Gardner et al.	May 1998	A
5774859	Houser et al.	Jun 1998	A
5963903	Hon et al.	Oct 1999	A
5987411	Petroni et al.	Nov 1999	A
6009387	Ramaswamy et al.	Dec 1999	A
6012058	Fayyad et al.	Jan 2000	A
6021387	Mozer et al.	Feb 2000	A
6049767	Printz	Apr 2000	A
6073099	Sabourin et al.	Jun 2000	A
6130726	Darbee et al.	Oct 2000	A
6134527	Meunier et al.	Oct 2000	A
6141640	Moo	Oct 2000	A
6182039	Rigazio et al.	Jan 2001	B1
6185530	Ittycheriah et al.	Feb 2001	B1
6195641	Loring et al.	Feb 2001	B1
6243679	Mohri et al.	Jun 2001	B1
6260013	Sejnoha	Jul 2001	B1
6263308	Heckerman et al.	Jul 2001	B1
6298324	Zuberec et al.	Oct 2001	B1
6301560	Masters	Oct 2001	B1
6320947	Joyce et al.	Nov 2001	B1
6336091	Polikaitis et al.	Jan 2002	B1
6374177	Lee et al.	Apr 2002	B1
6374226	Hunt et al.	Apr 2002	B1
6381316	Joyce et al.	Apr 2002	B2
6408272	White et al.	Jun 2002	B1
6415257	Junqua et al.	Jul 2002	B1
6424935	Taylor	Jul 2002	B1
6446035	Grefenstette et al.	Sep 2002	B1
6493667	De Souza et al.	Dec 2002	B1
6523005	Holzapfel	Feb 2003	B2
6658414	Bryan et al.	Dec 2003	B2
6665644	Kanevsky et al.	Dec 2003	B1
6711541	Kuhn et al.	Mar 2004	B1
6711543	Cameron	Mar 2004	B2
6714632	Joyce et al.	Mar 2004	B2
6721633	Funk et al.	Apr 2004	B2
6725022	Clayton et al.	Apr 2004	B1
6728531	Lee et al.	Apr 2004	B1
6754625	Olsen et al.	Jun 2004	B2
6799201	Lee et al.	Sep 2004	B1
6804653	Gabel	Oct 2004	B2
6892083	Shostak	May 2005	B2
6901366	Kuhn et al.	May 2005	B1
6975993	Keiller	Dec 2005	B1
6985865	Packingham et al.	Jan 2006	B1
7013276	Bickley et al.	Mar 2006	B2
7020609	Thrift et al.	Mar 2006	B2
7027987	Franz et al.	Apr 2006	B1
7062477	Fujiwara et al.	Jun 2006	B2
7113981	Slate	Sep 2006	B2
7117159	Packingham et al.	Oct 2006	B1
7158959	Chickering et al.	Jan 2007	B1
7188066	Falcon et al.	Mar 2007	B2
7203645	Pokhariyal et al.	Apr 2007	B2
7219056	Axelrod et al.	May 2007	B2
7231380	Pienkos	Jun 2007	B1
7263487	Hwang	Aug 2007	B2
7263489	Cohen et al.	Aug 2007	B2
7277851	Henton	Oct 2007	B1
7310600	Garner et al.	Dec 2007	B1
7324947	Jordan et al.	Jan 2008	B2
7406417	Hain	Jul 2008	B1
7428555	Yan	Sep 2008	B2
7444282	Choo et al.	Oct 2008	B2
7447636	Schwartz et al.	Nov 2008	B1
7483885	Chandrasekar et al.	Jan 2009	B2
7519534	Maddux et al.	Apr 2009	B2
7590605	Josifovski	Sep 2009	B2
7654455	Bhatti et al.	Feb 2010	B1
7769786	Patel	Aug 2010	B2
7809601	Shaya et al.	Oct 2010	B2
7831549	Tilei et al.	Nov 2010	B2
7844456	Cai et al.	Nov 2010	B2
7860716	Tian et al.	Dec 2010	B2
7881930	Faisman et al.	Feb 2011	B2
7904296	Morris	Mar 2011	B2
7934658	Bhatti et al.	May 2011	B1
7949526	Ju et al.	May 2011	B2
7974843	Schneider	Jul 2011	B2
8165916	Hoffberg et al.	Apr 2012	B2
8306818	Chelba et al.	Nov 2012	B2
8321278	Haveliwala et al.	Nov 2012	B2
8321427	Stampleman et al.	Nov 2012	B2
8374870	Braho et al.	Feb 2013	B2
8515753	Kim et al.	Aug 2013	B2
8577681	Roth et al.	Nov 2013	B2
8793127	Printz et al.	Jul 2014	B2
8959019	Printz et al.	Feb 2015	B2
9626965	Printz et al.	Apr 2017	B2
10121469	Printz et al.	Nov 2018	B2
10748527	Printz et al.	Aug 2020	B2
11587558	Printz	Feb 2023	B2
20010019604	Joyce et al.	Sep 2001	A1
20010037324	Agrawal et al.	Nov 2001	A1
20020015480	Daswani et al.	Feb 2002	A1
20020032549	Axelrod et al.	Mar 2002	A1
20020032564	Ehsani et al.	Mar 2002	A1
20020044226	Risi	Apr 2002	A1
20020046030	Haritsa et al.	Apr 2002	A1
20020049535	Rigo et al.	Apr 2002	A1
20020075249	Kubota et al.	Jun 2002	A1
20020106065	Joyce et al.	Aug 2002	A1
20020107695	Roth et al.	Aug 2002	A1
20020116190	Rockenbeck et al.	Aug 2002	A1
20020116191	Olsen et al.	Aug 2002	A1
20020133340	Basson et al.	Sep 2002	A1
20020146015	Bryan et al.	Oct 2002	A1
20030004728	Keiller	Jan 2003	A1
20030028380	Freeland et al.	Feb 2003	A1
20030033152	Cameron	Feb 2003	A1
20030046071	Wyman	Mar 2003	A1
20030061039	Levin	Mar 2003	A1
20030065427	Funk et al.	Apr 2003	A1
20030068154	Zylka	Apr 2003	A1
20030069729	Bickley et al.	Apr 2003	A1
20030073434	Shostak	Apr 2003	A1
20030088416	Griniasty	May 2003	A1
20030093281	Geilhufe et al.	May 2003	A1
20030125928	Lee et al.	Jul 2003	A1
20030177013	Falcon et al.	Sep 2003	A1
20030212702	Campos et al.	Nov 2003	A1
20040039570	Harengel et al.	Feb 2004	A1
20040077334	Joyce et al.	Apr 2004	A1
20040110472	Witkowski et al.	Jun 2004	A1
20040127241	Shostak	Jul 2004	A1
20040132433	Stern et al.	Jul 2004	A1
20040153319	Yacoub	Aug 2004	A1
20040193408	Hunt	Sep 2004	A1
20040199498	Kapur et al.	Oct 2004	A1
20040249639	Kammerer	Dec 2004	A1
20050010412	Aronowitz	Jan 2005	A1
20050071224	Fikes et al.	Mar 2005	A1
20050125224	Myers et al.	Jun 2005	A1
20050143139	Park et al.	Jun 2005	A1
20050144251	Slate	Jun 2005	A1
20050170863	Shostak	Aug 2005	A1
20050182558	Maruta	Aug 2005	A1
20050198056	Dumais et al.	Sep 2005	A1
20050203751	Stevens et al.	Sep 2005	A1
20050228670	Mahajan et al.	Oct 2005	A1
20060018440	Watkins et al.	Jan 2006	A1
20060028337	Li	Feb 2006	A1
20060050686	Velez-Rivera et al.	Mar 2006	A1
20060064177	Tian et al.	Mar 2006	A1
20060074656	Mathias et al.	Apr 2006	A1
20060085521	Sztybel	Apr 2006	A1
20060136292	Bhati et al.	Jun 2006	A1
20060149635	Bhatti et al.	Jul 2006	A1
20060184365	Odell et al.	Aug 2006	A1
20060206339	Silvera et al.	Sep 2006	A1
20060206340	Silvera et al.	Sep 2006	A1
20060259467	Westphal	Nov 2006	A1
20060271546	Phung	Nov 2006	A1
20060287856	He et al.	Dec 2006	A1
20070027864	Collins et al.	Feb 2007	A1
20070033003	Morris	Feb 2007	A1
20070067285	Blume et al.	Mar 2007	A1
20070150275	Garner et al.	Jun 2007	A1
20070179784	Thambiratnam et al.	Aug 2007	A1
20070192309	Fischer et al.	Aug 2007	A1
20070198265	Yao	Aug 2007	A1
20070213979	Meermeier	Sep 2007	A1
20070214140	Dom et al.	Sep 2007	A1
20070219798	Wang et al.	Sep 2007	A1
20070250320	Chengalvarayan	Oct 2007	A1
20070271241	Morris et al.	Nov 2007	A1
20080021860	Wiegering et al.	Jan 2008	A1
20080046250	Agapi et al.	Feb 2008	A1
20080082322	Joublin et al.	Apr 2008	A1
20080103887	Oldham et al.	May 2008	A1
20080103907	Maislos et al.	May 2008	A1
20080126100	Grost et al.	May 2008	A1
20080154596	Da Palma et al.	Jun 2008	A1
20080250448	Rowe et al.	Oct 2008	A1
20090048910	Shenfield et al.	Feb 2009	A1

Foreign Referenced Citations (26)

Number	Date	Country
0635820	Jan 1995	EP
1341363	Sep 2003	EP
1447792	Aug 2004	EP
1003018	May 2005	EP
1633150	Mar 2006	EP
1633151	Mar 2006	EP
1742437	Jan 2007	EP
00016568	Mar 2000	WO
00021232	Apr 2000	WO
01022112	Mar 2001	WO
01022249	Mar 2001	WO
01022633	Mar 2001	WO
01022712	Mar 2001	WO
01022713	Mar 2001	WO
01039178	May 2001	WO
01057851	Aug 2001	WO
02007050	Jan 2002	WO
02011120	Feb 2002	WO
02017090	Feb 2002	WO
02097590	Dec 2002	WO
04077721	Sep 2004	WO
06033841	Mar 2006	WO
06098789	Sep 2006	WO
04021149	Mar 2007	WO
05079254	May 2007	WO
06029269	May 2007	WO

Non-Patent Literature Citations (19)

Entry
“BBN Intros Speech Recognition for Cellular/Phone Apps”, Newsbytes, U.S.A., Feb. 28, 1995.
“BBN's Voice Navigation for Time-Warner's FSN”, Telemedia News & Views, vol. 2, Issue 12, Dec. 1994, U.S.A.
“Full Servce Network”, Time Warner Cable, The TWC Story \| Eras Menu, 1990-1995, U.S.A.
Colman, P., “The Power of Speech”, Convergence, Aug. 1995, pp. 16-23, U.S.A.
Dawson, F., “Time Warner Pursues Voice as New Remote”, Broadband Week, Multichannel News, U.S.A., Jan. 1, 1995, pp. 31 and 34.
Frozena, J., “(BBN) Time Warner Cable and BBN Hark Systems Corporation Plan to Provide Voice Access to the Information Superhighway”, Business Wire, Cambridge. Massachusetts, U.S.A., Nov. 1, 1994.
Henriques, D., “Dragon Systems, a Former Little Guy, Gets Ready for Market”, New York Times, Business Day, Technology: Market Place, U.S.A., Mar. 1, 1999.
Lefkowitz, L., “Voice-Recognition Home TV Coming This Year; Service Merges Computer, Phone, and Cable Technologies”, Computer Shopper, vol. 15, p. 68, Feb. 1995, U.S. and U.K.
Wikipedia, “Additive smoothing”, 5 pages, downloaded Apr. 8, 2020. (Year: 2020).
Wikipedia, “Law of succession”, 10 pages, downloaded Apr. 8, 2020. (Year: 2020).
Salami , et al., “A Fully Vector Quantised Self-Excited Vocoder”, Int'l Conference on Acoustics, Speech & Signal Processing; vol. 1, par. 3.1; Glasgow, May 1989.
Schotz, S. , “Automatic prediction of speaker age using CART”, Course paper for course in Speech Recognition, Lund University, retrieved online from url: http://person2.sol.lu.se/SusznneSchotz/downloads/SR_paper_SusanneS2004.pdf, 2003, 8 pages.
“Dijkstra's Algorithm”, Wikipedia, downloaded Jul. 18, 2016., Jul. 8, 2016.
Amir, A. , et al., “Advances in Phonetic Word Spotting”, IBM Research Report RJ 10215, Aug. 2001, pp. 1-3.
Belzer , et al., “Symmetric Trellis-Coded Vector Quantization”, IEEE Transactions on Communications, IEEE Service Center, Piscataway, NJ, vol. 45, No. 45, par. II, figure 2, Nov. 1997, pp. 1354-1357.
Chan , et al., “Efficient Codebook Search Procedure for Vector-Sum Excited Linear Predictive Coding of Speech”, IEEE Electronics Letters; vol. 30, No. 22; Stevanage, GB, ISSN 0013-5194, Oct. 27, 1994, pp. 1830-1831.
Chan , “Fast Stochastic Codebook Search Through the Use of Odd-Symmetric Crosscorrelation Basis Vectors”, Int'l Conference on Acoustics, Speech and Signal Processing; Detroit, Michigan, vol. 1, Par. 1; ISBN 0-7803-2461-5, May 1995, pp. 21-24.
Chen , et al., “Diagonal Axes method (DAM): A Fast Search Algorithm for Vecotr Quantization”, IEEE Transactions on Circuits and Systems for Video Technology, Piscataway, NJ; vol. 7, No. 3, ISSN 1051-8215; Par. I, II, Jun. 1997.
Hanzo , et al., “Voice Compression and Communications—Principles and Applications for Fixed and Wireless Channels”, Wiley, ISBN 0-471-15039-8; par. 4.3.3, 2001.

Related Publications (1)

	Number	Date	Country
	20230206914 A1	Jun 2023	US

Divisions (2)

	Number	Date	Country
Parent	14574314	Dec 2014	US
Child	15457964		US
Parent	11932122	Oct 2007	US
Child	14574314		US

Continuations (3)

	Number	Date	Country
Parent	16988292	Aug 2020	US
Child	18171204		US
Parent	16158900	Oct 2018	US
Child	16988292		US
Parent	15457964	Mar 2017	US
Child	16158900		US

Efficient empirical determination, computation, and use of acoustic confusability measures

Information

Patent Number

Date Filed

Date Issued

Inventors

Original Assignees

Examiners

Agents

CPC

Field of Search

US

CPC

International Classifications

Term Extension

Abstract