1. Field of the Invention
The present invention relates to the field of speech recognition, and, more particularly, to automatic grammar tuning using statistical language model generation.
2. Description of the Related Art
Speech recognition systems often use one or more language models to improve speech recognition accuracy. Language models provide information concerning a likelihood that various words or phrases will be used in combination with each other. Two basic types of language models include statistical language models and grammar-based language models.
A statistical language model is a probabilistic description of the constraints on word order found in a given language. Most current statistical language models are based on the N-gram principle, where the probability of the current word is calculated on the basis of the identities of the immediately preceding (N-1) words. A statistical language model grammar is not manually written, but is trained from a set of examples that models expected speech, where the set of examples can be referred to as a speech corpus. One significant drawback to statistical language model grammars is that a size of a speech corpus for generating a statistical language model grammar can be very large. A reasonably sized speech corpus can, for example contain over twenty thousand utterances or can contain five thousand complete sentences. A cost incurred to obtain this speech corpus can be prohibitively high.
A grammar-based language model manually specifies a set of rules that are written in a grammar specification language, such as the NUANCE Grammar Specification Language (GSL), a Speech Recognition Grammar Specification (SRGS) complaint language, a JAVA Speech Grammar Format (JSGF) compliant language, and the like. Using the grammar specification language, a set of rules is constructed that together define what may be spoken.
Performance of grammar-based language models can be significantly improved by tuning the grammars, where grammar tuning is a process of improving speech recognition accuracy by modifying speech grammar based on an analysis of its performance. Grammar tuning is often performed during an iterative period of usability testing and application improvement. Grammar tuning often involves amending an existing grammar with commonly spoken phrases, removing highly confusable words, and adding additional ways that a speaker may pronounce a word. For example, cross-wording tuning can fix utterances that contain words which run together. Adding representative probabilities to confusion pairs can correct substitution errors.
Conventionally implemented grammar tuning typically involves manually tuning efforts, which can involve specialized skills. Manual tuning can be an extremely time consuming activity that can take longer than is practical for a development effort. Further, conventional grammar tuning requires access to a grammar source code which may not be available.
The present invention provides an automatic grammar tuning solution, which selectively replaces an original grammar with an automatically generated statistical language model grammar, referred to as a replacement grammar. The original grammar can be a statistical language model grammar or can be a grammar-based language model grammar. The speech corpus used to create the replacement grammar can be created from logged data. The logged data can be obtained from speech recognition runs that utilized the original grammar. After the replacement grammar is generated, a performance analysis can be performed to determine whether performance of the replacement grammar represents an improvement over the performance of the original grammar. When it does, the original grammar can either be automatically and dynamically replaced with the replacement grammar or an authorized administrator can be presented with an option to replace the original grammar with the replacement grammar.
The present invention can be implemented in accordance with numerous aspects consistent with material presented herein. For example, one aspect of the present invention can include a grammar tuning method. The method can utilize an original speech recognition grammar in a speech recognition system to perform speech recognition operations for multiple recognition instances. Instance data associated with the recognition operations can be stored. A replacement grammar can be automatically generated from the stored instance data, where the replacement grammar is a statistical language model grammar. The original speech recognition grammar, which can be a grammar-based language model grammar or a statistical language model grammar, can be selectively replaced with the replacement grammar. For example, when tested performance for the replacement grammar is better than that for the original grammar the replacement grammar can replace the original grammar.
Another aspect of the present invention can include a method for tuning speech recognition grammars. The method can perform speech-to-text operations using an original speech recognition grammar. The original speech recognition grammar can be a grammar-based language model grammar. Data for recognition instances associated with the speech-to-text operations can be stored. A set of words and phrases can be created from the recorded recognition data. A replacement grammar can be automatically generated from the set of words and phrases. This replacement grammar can be a statistical language model grammar. The original speech recognition grammar car be selectively replaced with the replacement grammar.
Still another aspect of the present invention can include a speech recognition system, which includes a language model processor, a log data store, a statistical language model generator, and a grammar swapper. The language model processor can utilize an original speech recognition grammar in performing speech recognition operations. The log data store can store speech instance data associated with the speech recognition operations. The statistical language model generator can automatically generate a replacement grammar from the speech instance data. The grammar swapper can selectively replace the original speech recognition grammar with the speech replacement grammar.
It should be noted that various aspects of the invention can be implemented as a program for controlling computing equipment to implement the functions described herein, or a program for enabling computing equipment to perform processes corresponding to the steps disclosed herein. This program may be provided by storing the program in a magnetic disk, an optical disk, a semiconductor memory, or any other recording medium. The program can also be provided as a digitally encoded signal conveyed via a carrier wave. The described program can be a single program or can be implemented as multiple subprograms, each of which interact within a single computing device or interact in a distributed fashion across a network space.
It should also be noted that the methods detailed herein can also be methods performed at least in part by a service agent and/or a machine manipulated by a service agent in response to a service request.
There are shown in the drawings, embodiments which are presently preferred, it being understood, however, that the invention is not limited to the precise arrangements and instrumentalities shown.
More specifically, the speech recognition engine 110 can convert received speech 106 into speech recognized text 108, using an acoustic model processor 112 and a language model processor 114. The language model processor 114 can utilize words, phrases, weights, and rules defined by an original grammar 118. The language processor 114 can be configured to handle grammar-based language model grammars as well as statistical language model grammars. Grammar 118 can be stored in a grammar data store 116.
The speech recognition engine 110 can include machine readable instructions for performing speech-to-text conversions, in one embodiment, the speech recognition engine 110 can be implemented within a clustered server environment, such as within a WEBSPHERE computing environment. Engine 110 can also be implemented within a single server, within a desktop computer, within an embedded device, and the like. The various components of system 100 can be implemented within the same computing space, or within other remotely located spaces, which are communicatively linked to the engine 110.
In one embodiment, the data store 116 can include a plurality of grammars, which are selectively activated. For example the data store 116 can include context dependent grammars and/or speaker dependent grammars, which are selectively activated depending on conditions of the system 100. Accordingly grammar 118 can be a context dependent grammar a context independent grammar, a speaker dependent grammar, and a speaker independent grammar, or a context independent grammar depending upon implementation specifies for system 100.
Further, the data store 116 can include grammar-based language model grammars and statistical language model grammars. The grammar-based language model grammars can be written in any language including, but not limited to, a NUANCE Grammar Specification language (GSL), a Speech Recognition Grammar Specification (SRGS) compliant language, and a JAVA Speech Grammar Format (JSGF) compliant language.
As speech recognition engine 110 executes, instance data 122 associated with the speech recognition runs can be placed in log data store 120. The instance data 112 can include audio containing speech utterances, speech-converted text, confidence scores for a recognition instance, a context for the recognition instance, and other such data.
The training data store 130 can be an additional repository in which training data is stored. The training data can be generated from the instance data 122 or can be independently obtained. The training data can include speech utterances and associated transcribed text. The text can represent desired results for when the speech utterances are speech-to-text converted.
The grammar enumerator 140 can access the log data store 120 and/or the training data store 130 and can generate a set of words and phrases 150 contained therein. This set of words and phrases 150 can be further processed by the weighing engine 142. The weighing engine can determine a relative frequency of use for each of the words and phrases from data stored in data stores 120 and/or 130, which is used to generate weighed set 152. Set 152 can be conveyed to a grammar generator 144, which uses the weighed set 152 to generate a replacement grammar 154. The replacement grammar 154 can be a statistical language model grammar and the data contained in data stores 120 and/or 130 can be used as a speech corpus for the grammar 154.
Performance analyzer 146 can then compare relative performance of replacement grammar 154 against performance data for corresponding grammar 118. Depending upon the results of the comparisons, a grammar replacement action can be triggered. If so, grammar swapper 148 can replace grammar 118 with grammar 154.
In one embodiment, grammar 118 can be stored within data store 149 for a designated trial time. Operational performance metrics can be captured for the replacement grammar 154 during this trial time. It is possible that the replacement grammar 154 performs worse than the original speech recognition grammar 118 even though performance analyzer 146 predicted improved performance. If operational performance of replacement grammar 154 is worse than the original grammar, the grammar swapper 148 can exchange grammars 118 and 154.
Another reason to store the original speech recognition grammar 118 in data store 149 (assuming grammar 118 is a grammar-based language model grammar) is that manual tuning of grammar 118 can occur subsequently to the swap. Once manually tuned, grammar 118 can have better performance metrics than those of replacement grammar 154. In which case, the grammars can be re-swapped using grammar swapper 148.
Data stores 116, 120, 130, and 149 can be a physical or virtual storage spaces configured to store digital content. Each of the data stores 116, 120, 130, and 149 can be physically implemented within any the of hardware including, but not limited to, a magnetic disk, an optical disk, a semiconductor memory, a digitally encoded plastic memory, a holographic memory, or any other recording medium. Further, each data store 116, 120, 130, and 149 can be a stand-alone storage unit as well as a storage unit formed from a plurality of physical devices. Additionally, content can be stored within data stores 116, 120, 130, and 149 in a variety of manners. For example, content can be stored within a relational database structure or can be stored within one or more files of a file storage system, where each file may or may not be indexed for information searching purposes. Further, the data stores 116, 120, 130, and 149 can utilize one or more encryption mechanisms to protect stored content from unauthorized access.
Components of system 100 can be communicatively linked via one or more networks (not shown). The networks can include any hardware/software/and firmware necessary to convey digital content encoded within carrier waves. Content can be contained within analog or digital signals and conveyed through data or voice channels. The networks can include local components and data pathways necessary for communications to be exchanged among computing device components and between integrated device components and peripheral devices. The networks can also include network equipment, such as rooters, data lines, hubs, and intermediary servers which together form a packet-based network, such as the Internet or an intranet. The networks can further include circuit-based communication components and mobile communication components, such as telephony switches, modems, cellular communication towers, and the like. The networks can include line based and/or wireless communication pathways.
Method 200 can begin in step 205, where a speech recognition system can be utilized to perform speech recognition operations for multiple recognition instances. The speech recognition system can use an original speech recognition grammar when performing the operations. The speech recognition grammar can be a grammar-based language model grammar or a statistical language model grammar. In step 210, instance data associated with the recognition operations can be stored in a data store.
In step 215, words and phrases contained in the data store can be enumerated. In step 220, the words and phrases can be weighed. The recognition instance data can be used to determine relative usage frequency for weighing purposes. In step 225, a replacement grammar can be generated using the weighed words and phrase. The replacement grammar can be a statistical language model grammar.
In step 230, performance metrics of the replacement grammar can be compared against performance metrics of the original speech recognition grammar. For example, the data store can include a training set of audio. The training set of audio can be automatically generated from the recognition instances and/or can be a standard training step with known results. The comparisons of step 230 can compare confidence scores generated by the grammars and/or can compare generated results against manual transcriptions of the training set.
In step 235, a determination can be made as to whether the replacement grammar has better performance metrics than the original speech recognition grammar. If not, the method can loop to step 205, where further recognition instances can be performed using the original speech recognition grammar. Because accuracy of a statistical language model grammar can increase with a larger training corpus and because a statistical language model grammar is generated specifically for a training corpus, the method 200 can be performed iteratively with potentially varying results.
If the performance metrics of the replacement grammar are better than those of the original speech recognition grammar, the method can proceed from step 235 to step 240, where the original speech recognition grammar can be replaced. Replacement can occur automatically and/or based upon a manual selection depending upon implementation specifics. The method can loop from step 240 to step 205, where it can repeat. Thus a speech recognition grammar can be continuously tuned as recognition instance data changes.
The present invention may be realized in hardware, software, or a combination of hardware and software. The present invention may be realized in a centralized fashion in one computer system or in a distributed fashion where different elements are spread across several interconnected computer systems. Any kind of computer system or other apparatus adapted for carrying out the methods described herein is suited. A typical combination of hardware and software may be a general purpose computer system with a computer program that, when being loaded and executed, controls the computer system such that it carries out the methods described herein.
The present invention also may be embedded in a computer program product, which comprises all the features enabling the implementation of the methods described herein, and which when loaded in a computer system is able to carry out these methods. Computer program in the present context means any expression, in any language, code or notation, of a set of instructions intended to cause a system having an information processing capability to perform a particular function either directly or after either or both of the following: a) conversion to another language, code or notation; b) reproduction in a different material form.
This invention may be embodied in other forms without departing from the spirit or essential attributes thereof. Accordingly, reference should be made to the following claims, rather than to the foregoing specification, as indicating the scope of the invention.
Number | Name | Date | Kind |
---|---|---|---|
4477698 | Szlam et al. | Oct 1984 | A |
4729096 | Larson | Mar 1988 | A |
4736296 | Katayama et al. | Apr 1988 | A |
4914590 | Loatman et al. | Apr 1990 | A |
4965763 | Zamora | Oct 1990 | A |
5101349 | Tokuume et al. | Mar 1992 | A |
5239617 | Gardner et al. | Aug 1993 | A |
5253164 | Holloway et al. | Oct 1993 | A |
5282265 | Rohra Suda et al. | Jan 1994 | A |
5325293 | Dorne | Jun 1994 | A |
5327341 | Whalen et al. | Jul 1994 | A |
5383121 | Letkeman | Jan 1995 | A |
5386556 | Hedin et al. | Jan 1995 | A |
5390279 | Strong | Feb 1995 | A |
5392209 | Eason et al. | Feb 1995 | A |
5544360 | Lewak et al. | Aug 1996 | A |
5642519 | Martin | Jun 1997 | A |
5664109 | Johnson et al. | Sep 1997 | A |
5677835 | Carbonell et al. | Oct 1997 | A |
5678052 | Brisson | Oct 1997 | A |
5748841 | Morin et al. | May 1998 | A |
5799268 | Boguraev | Aug 1998 | A |
5809476 | Ryan | Sep 1998 | A |
5812977 | Douglas | Sep 1998 | A |
5832450 | Myers et al. | Nov 1998 | A |
5845047 | Fukada et al. | Dec 1998 | A |
5864819 | DeArmas et al. | Jan 1999 | A |
5867817 | Catallo et al. | Feb 1999 | A |
5873064 | DeArmas et al. | Feb 1999 | A |
5905773 | Wong | May 1999 | A |
5918222 | Fukui et al. | Jun 1999 | A |
5937385 | Zadrozny et al. | Aug 1999 | A |
5960384 | Brash | Sep 1999 | A |
5970463 | Cave et al. | Oct 1999 | A |
6014663 | Rivette et al. | Jan 2000 | A |
6021202 | Anderson et al. | Feb 2000 | A |
6023669 | Suda et al. | Feb 2000 | A |
6044347 | Abella et al. | Mar 2000 | A |
6052693 | Smith et al. | Apr 2000 | A |
6055494 | Friedman | Apr 2000 | A |
6073102 | Block | Jun 2000 | A |
6088437 | Amick | Jul 2000 | A |
6138100 | Dutton et al. | Oct 2000 | A |
6154722 | Bellegarda | Nov 2000 | A |
6182029 | Friedman | Jan 2001 | B1 |
6188976 | Ramaswamy et al. | Feb 2001 | B1 |
6192110 | Abella et al. | Feb 2001 | B1 |
6192112 | Rapaport et al. | Feb 2001 | B1 |
6192339 | Cox | Feb 2001 | B1 |
6208972 | Grant et al. | Mar 2001 | B1 |
6233559 | Balakrishnan | May 2001 | B1 |
6292771 | Haug et al. | Sep 2001 | B1 |
6311159 | Van Tichelen et al. | Oct 2001 | B1 |
6314402 | Monaco et al. | Nov 2001 | B1 |
6334103 | Surace et al. | Dec 2001 | B1 |
6347329 | Evans | Feb 2002 | B1 |
6405165 | Blum et al. | Jun 2002 | B1 |
6434547 | Mishelevich et al. | Aug 2002 | B1 |
6438533 | Spackman et al. | Aug 2002 | B1 |
6466654 | Cooper et al. | Oct 2002 | B1 |
6484136 | Kanevsky et al. | Nov 2002 | B1 |
6505162 | Wang et al. | Jan 2003 | B1 |
6519562 | Phillips et al. | Feb 2003 | B1 |
6542868 | Badt et al. | Apr 2003 | B1 |
6553385 | Johnson et al. | Apr 2003 | B2 |
6604075 | Brown et al. | Aug 2003 | B1 |
6647363 | Claassen | Nov 2003 | B2 |
6721706 | Strubbe et al. | Apr 2004 | B1 |
6728692 | Martinka et al. | Apr 2004 | B1 |
6748361 | Comerford et al. | Jun 2004 | B1 |
6895084 | Saylor et al. | May 2005 | B1 |
6915254 | Heinze et al. | Jul 2005 | B1 |
6947936 | Suermondt et al. | Sep 2005 | B1 |
6999931 | Zhou | Feb 2006 | B2 |
7031908 | Huang et al. | Apr 2006 | B1 |
7120582 | Young et al. | Oct 2006 | B1 |
7124144 | Christianson et al. | Oct 2006 | B2 |
7200559 | Wang | Apr 2007 | B2 |
7813926 | Wang et al. | Oct 2010 | B2 |
20020007285 | Rappaport | Jan 2002 | A1 |
20020095313 | Haq | Jul 2002 | A1 |
20020128831 | Ju et al. | Sep 2002 | A1 |
20020143824 | Lee et al. | Oct 2002 | A1 |
20020169764 | Kincaid et al. | Nov 2002 | A1 |
20030046264 | Kauffman | Mar 2003 | A1 |
20030061201 | Grefenstette et al. | Mar 2003 | A1 |
20030115080 | Kasravi et al. | Jun 2003 | A1 |
20030200094 | Gupta et al. | Oct 2003 | A1 |
20030208382 | Westfall | Nov 2003 | A1 |
20030233345 | Perisic et al. | Dec 2003 | A1 |
20040085162 | Agarwal et al. | May 2004 | A1 |
20040098263 | Hwang et al. | May 2004 | A1 |
20040103075 | Kim et al. | May 2004 | A1 |
20040139400 | Allam et al. | Jul 2004 | A1 |
20040186746 | Angst et al. | Sep 2004 | A1 |
20040220895 | Carus et al. | Nov 2004 | A1 |
20040243545 | Boone et al. | Dec 2004 | A1 |
20040243551 | Boone et al. | Dec 2004 | A1 |
20040243552 | Titemore et al. | Dec 2004 | A1 |
20040243614 | Boone et al. | Dec 2004 | A1 |
20050108010 | Frankel et al. | May 2005 | A1 |
20050114122 | Uhrbach et al. | May 2005 | A1 |
20050120020 | Carus et al. | Jun 2005 | A1 |
20050120300 | Schwager et al. | Jun 2005 | A1 |
20050144184 | Carus et al. | Jun 2005 | A1 |
20050154580 | Horowitz et al. | Jul 2005 | A1 |
20050165598 | Cote et al. | Jul 2005 | A1 |
20050165602 | Cote et al. | Jul 2005 | A1 |
20050192792 | Carus et al. | Sep 2005 | A1 |
20050192793 | Cote et al. | Sep 2005 | A1 |
20050207541 | Cote | Sep 2005 | A1 |
20050228815 | Carus et al. | Oct 2005 | A1 |
20050261901 | Davis et al. | Nov 2005 | A1 |
20060074671 | Farmaner et al. | Apr 2006 | A1 |
20060265366 | Winkelman et al. | Nov 2006 | A1 |
20070219793 | Acero et al. | Sep 2007 | A1 |
Number | Date | Country |
---|---|---|
WO 9905671 | Feb 1999 | WO |
Number | Date | Country | |
---|---|---|---|
20080052076 A1 | Feb 2008 | US |