To recognize and understand the intention of the callers, telephony applications and the like e.g., a “voice menu” system typically use Context-Free-Grammars. In general, Context-Free-Grammars are data that basically provide a specific list of sentences/phrases for which the telephony application listens. When a caller speaks an utterance, a matching sentence/phrase is selected based on weighted parameters and the like, or the caller asked to repeat the utterance if no matching sentence/phrase is found.
While a Context-Free-Grammars approach is relatively easy and inexpensive to implement, using this approach suffers from a number of problems. For one, disfluencies in speech input are not effectively handled. For another, there is a practical problem of pronunciation mismatch. Users are often unsatisfied and frustrated with voice menu systems because of being given wrong selections or having to repeat the same speech over and over.
Further, Context-Free-Grammars are only as good as the list, which is difficult to put together. For example, even though there is often a very large volume of data corresponding to a very large number of calls for a telephony application, much of it cannot be used, because manual transcriptions are needed, e.g., on the order of tens of thousands for a single top-level voice menu to handle the large number of variations. After a point, the performance does not improve by any significant amount simply by adding new phrases and/or adjusting the parameter weights.
This Summary is provided to introduce a selection of representative concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used in any way that would limit the scope of the claimed subject matter.
Briefly, various aspects of the subject matter described herein are directed towards a technology by which a classifier, which is trained with speech-recognized acoustic data having associated semantic label, is configured to classify speech-recognized text into a semantic label of a predetermined set of such labels. The semantic label is then output, such as for use by a voice menu program in branching between menus, e.g., a telephony application or an automotive application.
In one aspect, the speech recognizer is an utterance classification-based speech recognizer having a statistical language model iteratively trained on labeled training data, (as well as possibly on non-labeled data). The speech recognizer and/or the classifier may operate at a phoneme-level, a word-level, or other sub-unit level. As will be understood, the technology also includes the capability to use transcribed data, non-transcribed data with semantic labels, and non-transcribed, non-labeled (blind) data to improve results.
Other advantages may become apparent from the following detailed description when taken in conjunction with the drawings.
The present invention is illustrated by way of example and not limited in the accompanying figures in which like reference numerals indicate similar elements and in which:
Various aspects of the technology described herein are generally directed towards using an information retrieval approach and a classifier to understand a speaker's intent regarding what the speaker said, by matching with and/or mapping recognized speech into a cluster of logged classification samples to obtain a semantic label. Performance improves as the search space (database) becomes more complete and more training samples become available. As will be understood, the technology uses data-driven techniques to bypass the Context-Free-Grammars approach to provide higher performance (user satisfaction) at a lower development/maintenance cost.
It should be understood that any of the examples herein are non-limiting. As such, the present invention is not limited to any particular embodiments, aspects, concepts, structures, functionalities or examples described herein. Rather, any of the embodiments, aspects, concepts, structures, functionalities or examples described herein are non-limiting, and the present invention may be used in various ways that provide benefits and advantages in computing and classification in general.
U.S. patent application Ser. No. 12/722,556, assigned to the assignee of the present application and hereby incorporated by reference, generally describes the use of information retrieval-based methods to convert speech into a recognition result (text). For example, an utterance may be converted to phonemes or sub-word units, which are then divided into various possible segments. The segments are then measured against word labels based upon TF-IDF or other features, for example, to find acoustic scores for possible words of the utterance. The acoustic scores may be used in various hypotheses along with a length score and a language model score to rank candidate phrases for the utterance. Training is based on creating an acoustic units-to-text data matrix (analogous to a term-document matrix) over the appropriate features. Minimum classification error techniques or other techniques may be used to train the parameters of the routing matrix.
Note that traditional speech recognition works on either the word or the phone level. However, an alternative mixed-level voice search implementation may use word-level transcriptions from the training sentences and automatically generate phone-level training sentences from the speech recognition output. Such phone-level recognition units tend to capture disfluency and reduce pronunciation mismatch relative to only using word-level units. For example, “Indiana” may be pronounced “Inny Ana” which are not words; however by operating at a phoneme-level both utterances may be mapped to a semantic label such as “destination.”
In general, the training process iterates, with the speech recognition results evaluated against the labeled training data until the statistical language model 102 is deemed sufficiently good (decision diamond 112); four or five iterations may suffice, for example. Such iterative language model training encourages more consistent speech recognition output, which in turns improves the classification accuracy. It also enables the use of non-transcribed acoustic data to be used to improve the language model.
Some of the acoustic data 108 are associated with semantic tags (e.g., one million out of two million units may have tags). More particularly, observed utterances are grouped into clusters based on their semantic concepts (e.g., the voice menu's branches). Those with tags are run through a speech recognizer 114 (e.g., corresponding to the recognition process in the iterative training) with the recognition results used to train a classifier.
The recognition results, in conjunction with the semantic tags, are used to train a classification model 116. A typical number of semantic labels is on the order of a dozen or two, and while the number of semantic labels is predetermined based upon those needed for an application, the number may change as new features or the like are added to the application. For example, a “voice menu” task is cast as a semantic classification (voice search) task using the training sentences to see which cluster most closely represents an input query, which then may be mapped to a specific menu. Because the classifier takes text, new menu options may be added without needing actual training utterances, e.g., by using artificial examples entered as text by the system designers.
Transcribed data, non-transcribed but categorized data (that is, what the user wants is known, however the exact words spoken are not known), and complete blind data (non-transcribed, non-characterized data, i.e., neither the transcription nor the category is known) may be used to improve the statistical language model 102 and/or the classification model 116. To this end, a semi-supervised method (labeling and/or transcription is provided for part of the data while the remaining data is unlabeled and/or non-transcribed) may be used in a performance tuning phase to achieve continued performance improvement in language model tuning and classifier tuning at relatively very low cost. For example, in the partially labeled case, semantic labels may be regenerated by the classification module for reuse in training; these may be weighted (or threshholded) by some confidence measure, with only high-confidence data is used for the subsequent learning. It is also possible to weight all of the data equally. For the unlabeled case, it is possible to iterate the language model (and transcriptions) until convergence, and then iterate the classification (and semantic labels) until convergence; it is also possible to interleave the language model and classification updates. Using transcription from speech recognition on otherwise non-transcribed data (e.g., instead of manual transcription) improves the quality of the language model.
By way of example, consider a top-level voice menu in an automotive application scenario. Semantic tags such as “directions”, “destination,” “help”, “phone” and so forth may be the classification classes for that menu. When a user speaks to say something like “I need to know my options” and thereby provides the acoustic data, the classifier 220 receives corresponding recognized text and determines that this speech belongs to the “help” class. From there, the application can provide an appropriate verbal response (e.g., “are you asking for help?”) and/or take an appropriate action, e.g., branch to a Help menu.
In one implementation, the classifier 220 need not be limited to only a single output class, but instead may generate n-Best ranked results (or results with associated likelihood data), such as when given an imprecise query. These may be used to provide a more helpful and productive confirmation. For example the speech recognizer may hear “turn-by-turn driving directions” which the classifier may decide may match two classes reasonably well (e.g., both around fifty percent probability), and thus the user may be asked in return “Do you want a next turn notification or driving directions to a new destination?” with the user's response then received and matched to the desired class. Also note that if no semantic label has a high enough probability, or if a label comes back as an “unknown” classification or the like, a “Sorry, I did not understand you” or other suitable prompt may be given to the user.
The invention is operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well-known computing systems, environments, and/or configurations that may be suitable for use with the invention include, but are not limited to: personal computers, server computers, hand-held or laptop devices, tablet devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.
The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, and so forth, which perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in local and/or remote computer storage media including memory storage devices.
With reference to
The computer 310 typically includes a variety of computer-readable media. Computer-readable media can be any available media that can be accessed by the computer 310 and includes both volatile and nonvolatile media, and removable and non-removable media. By way of example, and not limitation, computer-readable media may comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by the computer 310. Communication media typically embodies computer-readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. Combinations of the any of the above may also be included within the scope of computer-readable media.
The system memory 330 includes computer storage media in the form of volatile and/or nonvolatile memory such as read only memory (ROM) 331 and random access memory (RAM) 332. A basic input/output system 333 (BIOS), containing the basic routines that help to transfer information between elements within computer 310, such as during start-up, is typically stored in ROM 331. RAM 332 typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by processing unit 320. By way of example, and not limitation,
The computer 310 may also include other removable/non-removable, volatile/nonvolatile computer storage media. By way of example only,
The drives and their associated computer storage media, described above and illustrated in
The computer 310 may operate in a networked environment using logical connections to one or more remote computers, such as a remote computer 380. The remote computer 380 may be a personal computer, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer 310, although only a memory storage device 381 has been illustrated in
When used in a LAN networking environment, the computer 310 is connected to the LAN 371 through a network interface or adapter 370. When used in a WAN networking environment, the computer 310 typically includes a modem 372 or other means for establishing communications over the WAN 373, such as the Internet. The modem 372, which may be internal or external, may be connected to the system bus 321 via the user input interface 360 or other appropriate mechanism. A wireless networking component such as comprising an interface and antenna may be coupled through a suitable device such as an access point or peer computer to a WAN or LAN. In a networked environment, program modules depicted relative to the computer 310, or portions thereof, may be stored in the remote memory storage device. By way of example, and not limitation,
An auxiliary subsystem 399 (e.g., for auxiliary display of content) may be connected via the user interface 360 to allow data such as program content, system status and event notifications to be provided to the user, even if the main portions of the computer system are in a low power state. The auxiliary subsystem 399 may be connected to the modem 372 and/or network interface 370 to allow communication between these systems while the main processing unit 320 is in a low power state.
While the invention is susceptible to various modifications and alternative constructions, certain illustrated embodiments thereof are shown in the drawings and have been described above in detail. It should be understood, however, that there is no intention to limit the invention to the specific forms disclosed, but on the contrary, the intention is to cover all modifications, alternative constructions, and equivalents falling within the spirit and scope of the invention.