1. Field of the Invention
This invention relates to the automated acquisition of grammar fragments for recognizing and understanding spoken language.
2. Introduction
In speech-understanding systems, the language models for recognition and understanding are traditionally designed separately. Furthermore, while there is a large amount of literature on automatically learning language models for recognition, most understanding models are designed manually and involve a significant amount of expertise in development.
In general, a spoken language understanding task can have a very complex semantic representation. A useful example is a call-routing scenario, where the machine action transfers a caller to a person or machine that can address and solve problems based on the user's response to an open-ended prompt, such as “How may I help you?” These spoken language understanding tasks associated with call-routing are addressed in U.S. patent application Ser. No. 08/528,577, “Automated Phrase Generation”, and U.S. Pat. No. 5,675,707 “Automated Call Routing System”, both filed on Sep. 15, 1995, which are incorporated herein by reference in their entireties. Furthermore, such methods can be embedded within more complex task, as disclosed in U.S. patent application Ser. No. 08/943,944, filed Oct. 3, 1997, which is also hereby incorporated by reference in its entirety.
While there is a vast amount of literature on syntactic structure and parsing, much of that work involves a complete analysis of a sentence. It is well known that most natural language utterances cannot be completely analyzed by these methods due to lack of coverage. Thus, many approaches use grammar fragments in order to provide a localized analysis on portions of the utterance where possible, and to treat the remainder of the utterance as background. Typically, these grammar fragments are defined manually and involve a large amount of expertise.
In an attempt to solve some of these problems, U.S. patent application Ser. Nos. 08/960,289 and 08/960,291, both filed Oct. 29, 1997 and hereby incorporated by reference in their entireties, disclose how to advantageously and automatically acquire sequences or words, or “superwords”, and exploit them for both recognition and understanding. This is advantageous because longer units (e.g., area codes) are both easier to recognize and have sharper semantics.
While superwords (or phrases) have been shown to be very useful, many acquired phrases are merely mild variations of each other (e.g., “charge this call to” and “bill this to”). For example, U.S. patent application Ser. No. 08/893,888, filed Jul. 8, 1997 and incorporated herein by reference in its entirety, discloses how to automatically cluster such phrases by combining phrases with similar wordings and semantic associations. These meaningful phrase clusters were then represented as grammar fragments via traditional finite state machines. This clustering of phrases is advantageous for two reasons: First, statistics of similar phrases can be pooled, thereby providing more robust estimation; and second, they provide robustness to non-salient recognition errors, such as “dialed a wrong number” versus “dialed the wrong number”.
However, in order to utilize these grammar fragments in language models for both speech recognition and understanding, they must be both syntactically and semantically coherent. To achieve this goal, an enhanced clustering mechanism exploiting both syntactic and semantic associations of phrases is required.
A method and apparatus for clustering phrases into grammar fragments is provided. The method and apparatus exploits succeeding words, preceding words and semantics associated to each utterance, in order to generate grammar fragments consisting of similar phrases. Distances between phrases may be calculated based on the distribution of preceding and succeeding words and call-types.
In at least one embodiment, the goal may be to generate a collection of grammar fragments each representing a set of syntactically and semantically similar phrases. First, phrases observed in the training set may be selected as candidate phrases. Each candidate phrase may have three associated probability distributions: of succeeding contexts, of preceding contexts, and of associated semantic actions. The similarity between candidate phrases may be measured by applying the Kullback-Leiber distance to these three probability distributions. Candidate phrases, which are close in all three distances, may then be clustered into a grammar fragment. Salient sequences of these fragments may then be automatically acquired, which are then exploited by a spoken language understanding module to determine a call classification.
The preferred embodiments of this invention will be described in detail, with reference to the following figures, wherein:
At step 130, a non-clustered grammar fragment is selected as the reference grammar fragment. Then, at step 140, all other fragments are sorted in the three separate distance categories, based on their distances from the reference fragment. The process then proceeds to step 150, where a subset is identified from each of the three ranked lists based on the maximum difference between two fragments on the list. The process then proceeds to step 160.
At step 160, any fragments that occur in all three subsets are clustered. The process then proceeds to step 170, where the process checks to see if any more grammar fragments need to be examined. If more grammar fragments need to be examined, the process returns to step 130; otherwise the process proceeds to step 180 and ends.
The process of
A candidate phrase is a phrase having a higher frequency of occurrence than some arbitrary threshold. The candidate phrases are regarded as units for generating the grammar fragments. A grammar fragment is a cluster of phrases acquired based on their similarity and is represented as a conventional finite-state machine. These fragments are both symtactically and semantically coherent.
There are many standpoints for the linguistic terms syntax and semantics, so that the usage of these terms must be clarified. The syntactic association signifies the relationship between a grammar fragment and phrases succeeding or preceding the fragment. Several kinds of succeeding and preceding phrases for each fragment are generally observed in the training transcriptions. If the roles of fragments are similar to each other in spoken dialog, then the distribution of these phrases will be similar between the fragments. Thus, by syntactic association, we do not explicitly focus on grammatical issues, such as part-of-speech and tense, but rather on the distribution of phrases surrounding a grammar fragment. On the other hand, the semantic associations focus on the relationship between a fragment in spoken language and the semantications or call-type corresponding to the speech. The distribution of call-types for a fragment must be comparable to that for another fragment if the two fragments are to be clustered. The semantic association is therefore the cross-channel association between speech and call-types.
An example of the syntactic and semantic associations of a fragment is illustrated in
In order to generate syntactic probability distributions, a set of phrases which precedes or succeeds fragments is generated first. In the following discussion, a phrase that succeeds or precedes a fragment is called its context. Though in our experiments the context consists of single words, our algorithm can be applied to the longer context so we describe the method in the general case. Consequently the context can contain words and non-terminal symbols corresponding to grammar fragments.
Since the contexts are the predecessor or successor of a fragment, they consist of not only words but also the symbols BOS and EOS. In other words, a phrase in a grammar fragment cannot contain these symbols because it must have both preceding and succeeding contexts.
Three probability distributions for each grammar fragment are obtained by using preceding and succeeding context frequencies, and call-type frequency. The estimated probability distributions focusing on succeeding and preceding contexts of a fragment fj are denoted in equations (1) and (2), respectively.
In both equations (1) and (2), si denotes the i-th context of the fragment fj, frequency list S, fj is the j-th grammar fragment in the fragment grammar; wN, denotes the N-th word in the context s1; and Nc(Nc≧1) is the number of items in the context si. Suffixes, such as t, t+1, and t−1, denote the sequential order of words, contexts, or fragments. The function C( ) counts the frequency of a sequence in the training transcriptions, as described above.
The contexts s1t+1 and s1t−1 are equivalent to word sequences w1t+1w2t+2 . . . wN
As a result, three types of probability distribution are obtained for each fragment. The distance between the two fragments is calculated by comparing each type of probability distribution. Namely, three distances between two fragments are measured by using succeeding context, preceding context, and call-type probability distributions.
While any distance measurements known to those skilled in the ar may e used, the Kullback-Leibler distance is one of the most popular distance measurements for measuring similarity between two probability distributions. Because of the logarithmic term in the Kullback-Leibler distance, the probabilities in equations (1) and (2) and (3) must be positive. Therefore, back-off smoothing is applied in advance to each probability distribution by using unigram probability distribution. The context frequency list S described above and the set of call-type frequencies are utilized to make the context and the call-type unigram probability distributions, respectively. Equation (4) shows the definitions of the Kullback-Leibler distance between fragments f1 and f2 exploiting the succeeding context probability distributions. The substitute in dS denotes “succeeding context.”
S represents the set f of contexts described above. The term si is one off the contexts. The conditional probabilities {circumflex over (p)}(sit+1|f1t) and {circumflex over (p)}(sit+1|f2t) are the smoothed distributions for the fragments f1 and f2, respectively. A distance based on the preceding context probability distributions can also be measured in the same manner. Equation (5) defines the distance based on preceding context probability distributions.
The functions {circumflex over (p)}(sit−1|f1t) and {circumflex over (p)}(sit−1|f2t) are smoothed predecessor probability distributions for the fragments f1 and f2, respectively. Equation (6) defines the distance based on call-type probability distributions. In equation (6), ci represents one of the call-types belonging to the call-type set C. The functions {circumflex over (p)}(ci|fi) and {circumflex over (p)}(ci|f2) are smoothed probability distributions for the call-type ci associated with fragments f1 and f2, respectively.
In general, the Kullback-Leibler distance is an asymmetric measure. Namely, the distance from f1 and f2 is not equal to that from f2 and f1. We therefore symmetrize the Kullback-Leibler distance by defining each type of distance as the average two distances measured from both fragments. Thus the fragment distances shown in equations (7), (8) and (9) are used in fragment clustering.
The basic idea for grammar fragment clustering is that the fragments having a comparatively small distance from a reference fragment are regarded as being similar and are clustered into the same grammar fragment. In this study, however, three distances based on preceding contexts, on succeeding contexts and on call-types are obtained between fragments. Therefore, the fragments between which all distances are small are clustered together.
All candidate phrases described above may be generated from the training transcriptions. Then each candidate phrase forms a grammar fragment as the initial set of grammar fragment. Namely each grammar fragment consists of one candidate phrase at the first stage. The following procedure in the fragment clustering algorithm is described as follows.
The frequency of each grammar fragment may be obtained by summing candidate phrase frequencies. A grammar fragment f0 having the highest frequency and consisting of one phrase is selected as the reference fragment. All fragments are sorted in the order of fragment distances measured from f0. The fragment distance lists based on preceding contexts, on succeeding contexts, and on call-types are sorted independently. Thus, three fragment lists in distance order are obtained as the result of the sorting. In each fragment list, the subset of fragment for clustering is determined based on the maximum difference in distance between successive fragments in that list. For instance, the fragment list based on the distance on succeeding contexts, the number of candidate fragment Ns(f0) is determined by:
Symbols fi and fi+1 represent rank ordered fragments with respect to the distance on succeeding contexts. Ds(f0fi+1) and Ds(f0fi) are the distance from the reference fragment f0 to the fragment fi+1 and fi, respectively. The distance Ds(f0fi) monotonically increases with i. Nm is the maximum number of fragments to be compared. The number of candidate fragments based on the distance focusing on preceding contexts Np(f0) and call-types Nc(f0) can be also determined by using distances Dp(f0fi) and Dc(f0fi). Following these determinations, the maximum number of candidates among three types of distance is determined by:
N(fO)=max{Np(fO),N(fO),Nc(fO)} (11)
All fragments whose rank order in each fragment list is less than N(f0) are selected as candidates of similar fragments. Fragments listed within the making N(f0) among all three types of candidate list are syntactically and semantically similar to the reference fragment f0. Such fragments are merged into the reference fragment f0. Equation (12) shows the criterion of fragment classification based on fragment distance orders.
f′O={fi|Op(fi)≦N(fO)&Os(fi)≦N(fO)&Oc(fi)≦N(fO)} (12)
Symbol f0 denotes the new grammar fragment generated by this merging. Op(fi), Os(fi), and Oc(fi) represent the ranked order focusing on preceding and succeeding contexts, and call-types, respectfully. If there is a fragment similar to the reference fragment, the reference fragment f0 is updated by clustering the similar fragments. The clustering algorithm is iterated over the updated fragment set. If the grammar fragment f0 is not augmented, f0 is referred as one of the candidates when another fragment is selected as the reference in this iteration.
Some phrases in the fragment can be partially replaced into another grammar fragment. By using the fragment having higher fragment frequency than that of a fragment focused, the fragment is parsed and some words in the fragment can be replaced into non-terminal symbols representing another fragment. For example, the phrase “want to make” in the fragment <015> can be decomposed into “want”, “to” and “make”. The words “want” and “make” are one of the phrases in the fragments <005> and <002>, respectively. Therefore, the phrase “want to make” is an instantiation of the fragment “<005> to <002>”. As a consequence of this parsing, the fragment grammar acquires the ability to represent not only phrases given as input, but also words sequences not observed in the training transcriptions.
The generalization of grammar fragments is performed in order of the grammar fragment frequency. A parser replaces the word sequence of each phrase into a non-terminal symbol that represents the grammar fragment in which the phrase belongs. When a set of grammar fragments has been created, the frequency of each fragment is obtained by summing the frequencies of the phrases represented by that fragment.
For call-type classification, salient grammar fragments are automatically generated from the parsed training transcriptions and associated call-types. Each salient grammar fragment consists of call-type of the highest association score and a corresponding sequence that consists of conventional words and non-terminal symbols for grammar fragments.
Using the method discussed in reference to
The grammar fragment clustering subsystem 1100 operates on a database of a large number of utterances each of which is related to one of the predetermined set of routing objectives, where each such utterance is labeled with its associated routing objective. The operation of this subsystem is essentially carried out by the candidate phrase selector 1120 which selects as an output a set of candidate phrases having a probabilistic relationship with one or more of the set of predetermined routing objectives with which the input speech utterances are associated. The selected candidate phrases are then input to a distance calculation device 1130 which determines the probabilistic distances between candidate phrases. Once the distances are calculated and ranked, the grammar fragment clustering device 1140 clusters those phrases which appear in a subset of all three distance rankings. The operation of the candidate phrase selector 1120, distance calculation device 1130, and the grammar fragment clustering device 1140 are generally determined in accordance with the previously described method for selecting and clustering grammar fragments. Operations of the input speech classification subsystem 1110 begins with inputting of a user's task objective request, in the caller's natural speech, to input speech recognizer 1150. The input speech recognizer 1150 may be of any known design and performs the functions of recognizing, or spotting, the existence of one or more grammar fragments in the input speech. The grammar fragment cluster detector 1160 then detects the grammar fragment clusters present among the grammar fragments recognized. As can be seen in the figure, the grammar fragment clusters developed by the grammar fragment clustering subsystem 1100 are provided as an input to the input speech recognizer 1150, the grammar fragment cluster detector 1160 and the classification processor 1170.
The output of the grammar fragment cluster detector 1160, which will comprise the detected grammar fragment clusters appearing in the caller's routing objective request, is provided to classification processor 1170. The classification processor 1170 may apply a confidence function, based on the probabilistic relation between the recognized grammar fragment clusters and selected task objectives, and make a decision either to implement a particular task objective or after determining that no decision is likely, to default the user to an operator position.
Experiments were conducted using the method described above. In such a task, digit sequences, such as telephone numbers and credit card numbers, are observed in the training transcription. These digit sequences can be generalized by using non-terminal symbols for them. Some non-terminal symbols such as <digit07> and <digit10> are used to generalize telephone numbers and others such as <digit16> and <digit18> are applied to credit card numbers. In this experiment, the training transcriptions were first filtered by using such non-terminal symbols. By using this filtered training transcriptions, the phrases, contexts and grammar fragments were generated by the method described above.
The engine used for speech recognition was the AT&T Watson recognizer. The recognition language model is as described in U.S. patent application Ser. Nos. 08/960,289 and 08/960,291. The acoustic model for the process was trained with a database of telephone-quality fluent speech. The training transcription contained 7,844 sentences while the test transcription contained 1,000 sentences. For the grammar fragment acquisition, the number of words in a phrase was constrained to be three or less in this experiment. Each phrase observed 30 times or more in the training transcription was selected as a candidate to participate in the clustering. Totally 1108 candidate phrases were obtained. The context length Nc for computing the distances between two fragments was set to one. 3582 context phrases were used for creating the syntactic probability distributions. The maximum number of candidate fragment Nm was 80.
In the call-type classification, there are two important performance measures. The first measure is the false rejection rate, where a call is falsely rejected or classified as the call-type other. Since such calls are transferred to a human operator, this measure corresponds to a missed opportunity for automation. The second measure is the probability of correct classification. Errors in this measure lead to misunderstanding that must be resolved by a dialog manager.
The call-type classification performance is significantly improved by the fragment grammar. This improvement results from the salient grammar fragments used in the call-type classifier that now accept various phrases that are syntactically and semantically similar to the originals providing generality. From this experimental result, we can conclude that by generalizing grammar fragments, unobserved phrases are obtained without deteriorating the call-type classification performance.
An example of the variety of phrases accepted by a salient grammar fragment is illustrated in
While this invention has been described in conjunction with the specific embodiments outlined above, it is evident that any alternatives, modifications and variations will be apparent to those skilled in the art. Accordingly, the preferred embodiments of the invention as set forth above are intended to be illustrative, not limiting. Various changes may be made without departing from the spirit and scope of the invention as defined in the following claims.
This application is a continuation of U.S. patent application Ser. No. 09/666,563, filed on Sep. 21, 2000, which is a continuation of U.S. patent application Ser. No. 09/217,635, filed Dec. 21, 1998, now U.S. Pat. No. 6,173,261, issued Jan. 9, 2001, which claims the benefit of U.S. Provisional Application No. 60/102,433, filed Sep. 30, 1998, the contents of which are incorporated herein by reference in their entirety. This application is also a continuation-in-part of U.S. patent application Ser. No. 08/943,944, filed Oct. 3, 1997, now U.S. Pat. No. 6,192,110, issued Feb. 20, 2001, which is a continuation-in-part of U.S. patent application Ser. No. 08/528,578, filed Sep. 15, 1995, now U.S. Pat. No. 5,675,707, issued Oct. 7, 1997.
Number | Name | Date | Kind |
---|---|---|---|
4477600 | Fesman | Oct 1984 | A |
4866778 | Baker | Sep 1989 | A |
4903305 | Gillick et al. | Feb 1990 | A |
5029214 | Hollander | Jul 1991 | A |
5033088 | Shipman | Jul 1991 | A |
5099425 | Kanno: Yuji et al. | Mar 1992 | A |
5384892 | Strong | Jan 1995 | A |
5390272 | Repta et al. | Feb 1995 | A |
5434906 | Robinson et al. | Jul 1995 | A |
5457768 | Tsuboi et al. | Oct 1995 | A |
5666400 | McAllister et al. | Sep 1997 | A |
5675707 | Gorin et al. | Oct 1997 | A |
5719921 | Vysotsky et al. | Feb 1998 | A |
5748841 | Morin et al. | May 1998 | A |
5794193 | Gorin | Aug 1998 | A |
5839106 | Bellegarda | Nov 1998 | A |
5860063 | Gorin et al. | Jan 1999 | A |
5905774 | Tatchell et al. | May 1999 | A |
5960384 | Brash | Sep 1999 | A |
6021384 | Gorin et al. | Feb 2000 | A |
6023673 | Bakis et al. | Feb 2000 | A |
6044337 | Gorin et al. | Mar 2000 | A |
6067520 | Lee | May 2000 | A |
6163769 | Acero et al. | Dec 2000 | A |
6173261 | Arai et al. | Jan 2001 | B1 |
6182039 | Rigazio et al. | Jan 2001 | B1 |
6192110 | Abella et al. | Feb 2001 | B1 |
6278967 | Akers et al. | Aug 2001 | B1 |
8666744 | Arai et al. | Mar 2014 | B1 |
Entry |
---|
Emami et al., “Using A Connectionist Model IN a Syntactical Based Language Model”, IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003, Proceedings, Apr. 2003. |
IBM TDB'051, Technical Disclosure Bulletin NN85102051, Pseudo-Pos Language Model Having A Redefined Vocabulary, Oct. 1985. |
Patent Application for U.S. Appl. No. 08/943,944, “Method and Apparatus for Generating Semantically Consistent Inputs to a Dialog Manager”, filed Oct. 3, 1997, by Alicia Abella et al. |
Langkilde et al., “Automatic Prediction of Problematic Human-Computer Dialogues in ‘How May I Help You?’,” Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding, Keystone, CO, pp. 369-372, 1999. |
Number | Date | Country | |
---|---|---|---|
20140303978 A1 | Oct 2014 | US |
Number | Date | Country | |
---|---|---|---|
60102433 | Sep 1998 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 09666563 | Sep 2000 | US |
Child | 14196536 | US | |
Parent | 09217635 | Dec 1998 | US |
Child | 09666563 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 08943944 | Oct 1997 | US |
Child | 09666563 | US | |
Parent | 08528578 | Sep 1995 | US |
Child | 08943944 | US |