Currently, applications for speech recognition systems are widely varied. Many speech recognition applications allow a user to provide a spoken input, and a speech recognition system identifies a semantic value corresponding to the spoken input. Such systems are often implemented in dialog systems which are conducted by telephone.
In a telephone-based dialog system, a user of the system calls in and provides spoken inputs which are recognized by the speech recognizer based on grammars. The speech recognition system may activate different grammars, or different portions of grammars, based on where the application is in the dialog being conducted with the user.
By way of specific example, assume that a dialog system is implemented in a pizza restaurant. The dialog system takes orders from customers that call in by telephone. The dialog system directs the user through a dialog by prompting the user with questions. The speech recognition system then attempts to identify one of a plurality of different expected semantic values based on the user's spoken input in response to the prompt.
For instance, the dialog system may first ask the user “Do you wish to order a pizza?” The speech recognition system would then be expecting the user to give one of a plurality of expected responses, such as: “yes”, “no”, “yes please”, “no thank you”, etc. Assuming that the user responds affirmatively, the dialog system may then ask the user “What size pizza would you like?” The speech recognition system might then activate a portion of the grammar looking for expected responses to that question. For instance, the speech recognition system may activate the portion of the grammar that is looking for semantic values of: “large”, “medium”, “small”, “I'd like a large please”, “Please give me a small”, etc.
One problem with these types of grammar-based systems is that it is very difficult for the developer of the system to anticipate all of the different ways that a user may respond to any given prompt. For example, if the system is expecting a response indicative of a semantic value of “large”, “medium”, or “small”, the user may instead say “family size”, or “extra large”, neither of which might be anticipated by the dialog system. Therefore, these responses may not be accommodated in the grammars currently active in the speech recognizer.
In the past, one way of tuning the grammars in these types of speech recognition applications was to listen to and manually transcribe call log data for calls that resulted in errors by the speech recognition system. For instance, the audio data corresponding to calls that ended in a hang-up, instead of an order being placed, can be used to tune the system. In using that information, the audio information for a call is first transcribed into written form. This is a laborious and time consuming process. The misrecognition originally recognized by the speech recognition system is provided to the developer, along with the transcribed audio information. The developer then either writes a new grammar rule to accommodate the unexpected response, or manually maps the transcribed data to one of the expected semantic values, and uses that mapping in revising the grammar. Of course, this is highly time consuming and costly, because the audio information not only has to be transcribed, but then the transcription must be used to modify the grammar in some way.
Another type of technology currently in use is referred to as “Wizard of Oz” technology. In this context, “Wizard of Oz” is a term used in the art to describe a method by which voice user interface applications are evaluated where the evaluation subject (the person interacting with the system) believes that he or she is talking to an automated system. In fact, however, the flow of the voice user interface application is entirely under the control of the system designer who is unseen by the evaluation subject. The system designer is presented with a user interface that allows the designer to easily (and in real time) select an appropriate system action based on the subject's input (or response to a question).
The discussion above is merely provided for general background information and is not intended to be used as an aid in determining the scope of the claimed subject matter.
Speech log data is received, and possible semantic classifications for that log data are obtained from grammars that were active in the system when the log data was received. Audio information from the log data, along with the possible semantic values, are then presented for user selection. A user selection is received, and corrected log data is generated based on the user selected semantic value.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter. The claimed subject matter is not limited to implementations that solve any or all disadvantages noted in the background.
The present subject matter deals with correcting semantic classifications for speech data that is stored in a data log. However, before describing the subject matter in more detail, one illustrative environment in which the present subject matter can be used will be described.
Embodiments are operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well-known computing systems, environments, and/or configurations that may be suitable for use with various embodiments include, but are not limited to, personal computers, server computers, hand-held or laptop devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, telephony systems, distributed computing environments that include any of the above systems or devices, and the like.
Embodiments may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. Some embodiments are designed to be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules are located in both local and remote computer storage media including memory storage devices.
With reference to
Computer 110 typically includes a variety of computer readable media. Computer readable media can be any available media that can be accessed by computer 110 and includes both volatile and nonvolatile media, removable and non-removable media. By way of example, and not limitation, computer readable media may comprise computer storage media and communication media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by computer 110. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. Combinations of any of the above should also be included within the scope of computer readable media.
The system memory 130 includes computer storage media in the form of volatile and/or nonvolatile memory such as read only memory (ROM) 131 and random access memory (RAM) 132. A basic input/output system 133 (BIOS), containing the basic routines that help to transfer information between elements within computer 110, such as during start-up, is typically stored in ROM 131. RAM 132 typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by processing unit 120. By way of example, and not limitation,
The computer 110 may also include other removable/non-removable volatile/nonvolatile computer storage media. By way of example only,
The drives and their associated computer storage media discussed above and illustrated in
A user may enter commands and information into the computer 110 through input devices such as a keyboard 162, a microphone 163, and a pointing device 161, such as a mouse, trackball or touch pad. Other input devices (not shown) may include a joystick, game pad, satellite dish, scanner, or the like. These and other input devices are often connected to the processing unit 120 through a user input interface 160 that is coupled to the system bus, but may be connected by other interface and bus structures, such as a parallel port, game port or a universal serial bus (USB). A monitor 191 or other type of display device is also connected to the system bus 121 via an interface, such as a video interface 190. In addition to the monitor, computers may also include other peripheral output devices such as speakers 197 and printer 196, which may be connected through an output peripheral interface 195.
The computer 110 is operated in a networked environment using logical connections to one or more remote computers, such as a remote computer 180. The remote computer 180 may be a personal computer, a hand-held device, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer 110. The logical connections depicted in
When used in a LAN networking environment, the computer 110 is connected to the LAN 171 through a network interface or adapter 170. When used in a WAN networking environment, the computer 110 typically includes a modem 172 or other means for establishing communications over the WAN 173, such as the Internet. The modem 172, which may be internal or external, may be connected to the system bus 121 via the user input interface 160, or other appropriate mechanism. In a networked environment, program modules depicted relative to the computer 110, or portions thereof, may be stored in the remote memory storage device. By way of example, and not limitation,
While the present subject matter can be used to correct semantic values for any log data, it will be described herein in the context of correcting semantic values associated with speech inputs logged for a voice user interface in a dialog system. However, the invention is not to be so limited, and a wide variety of different speech recognition-based systems can be improved using the present subject matter.
Call log store 202 illustratively stores a log of calls that were made to a dialog system, and that ended in erroneous speech recognitions of the voice data input by the customer or user of the system. While call log store 202 can store a wide variety of information, it illustratively at least stores log data for calls that were erroneously recognized.
Audio information 252 is illustratively audio data that can be played back to a user 212 (shown in
Active grammar information 256 is illustratively one or more indicators that indicate the particular grammars, or portions of grammars, that were active in the speech recognition system during the time of the dialog session during which speech recognition result 254 was recognized. In other words, assume the dialog was asking the customer what size pizza they would like. Then active grammar information 256 will indicate that the active grammars were those grammars (or portions of grammars) that expected a speech input corresponding to semantic values that indicate pizza size.
Data type information 258 is optional, and indicates the particular data type being sought by the speech recognition system at that point in the dialog session. For instance, it may be that the dialog was seeking a name of an American city. That city may illustratively be stored on a list of American cities, and in that case, the data type being sought would be a list. This is optional and its use will be described in greater detail below.
Referring again to
Correction component 204 then accesses the grammars for the underlying speech recognition system and identifies, based on the active grammar information 256 (shown in
Correction component 204 then provides, through user interface 206, the audio information 252, the prior recognition result 254, and the possible semantic values 268 identified from the active grammars. This is indicated by block 266 in
User 212 then actuates a mechanism on user interface 206, which can be any desired mechanism such as a radio button or other mechanism, and plays audio information 252. User 212 listens to the audio information 252 and determines which of the possible semantic values 268 the audio information should be mapped to. For instance, again assume that the possible semantic values 268 are the pizza sizes “small”, “medium”, and “large”. Each of those possible semantic values will illustratively be presented to user 212 on user interface 206 in a user-selectable way, such as in a radio button, or other user selectable input mechanism. Assume that the audio information 252 indicates that the user stated “family size”. User 212 can then select the possible semantic value 268 of “large” by simply clicking on the radio button (or other user interface element) corresponding to the semantic value of “large”. The selected semantic value 270 is then provided from user interface component 206 to correction component 204. Receiving the user selection of the semantic value is indicated by block 272 in
Correction component 204 then generates corrected log data 280. This is indicated by block 276 in
The corrected log data 280 or just the corrected semantic classification data 260 can be used in a wide variety of different ways. For instance, the data can be provided to analyzer 208 which analyzes the data to determine the semantic accuracy of the grammar or speech recognizer. Analyzer 208 can also provide a wide variety of analyses of the data, and output analysis results 300 indicative of the analysis performed by analyzer 208. Analyzing the corrected data and outputting the analysis results is indicated by blocks 352 and 354 in
Corrected semantic classification data 260 (or the corrected log data 280) can also be provided to training component 210. Training component 210 can identify out-of-grammar phrases for various known semantic classes and generate rules in the grammar associated with those out-of-grammar phrases. Training component 210 can also find unknown semantic classes, such as categories that users talk about, but that are not used in the current dialog system (e.g., “extra large” pizza, in addition to small, medium and large). Component 210 can then generate rules in the grammar to accommodate those unknown semantic classes. Training component 210 can also apply machine-learning techniques to automatically update the statistical likelihood's underlying the deployed system's grammars and semantic classification techniques (including, for example, reinforcement learning on positive results) without further user intervention. Training or turning a speech recognition component (such as a grammar) is indicated by block 356, and outputting the trained component 357 is indicated by block 358.
It will also be noted that if the expected data type being sought 258 is logged, and provided in log data 260, correction component 204 can use the expected data type 258 in generating a more useful user interface to be presented to user 212 by component 206. For example, assume that the dialog at the time the data was logged was looking for a date, as shown in
In one illustrative embodiment, the designer of the grammar under analysis illustratively includes in the grammar the data type being sought for each of the given grammars or grammar rules. Therefore, when log 202 logs the active grammars, it also logs the data types being sought by the active grammars. In that embodiment, correction component 204 reads the data types being sought and dynamically generates the possible semantic values 268 using user interface structures suitable to the data type being sought (such as the calendar, dropdown boxes, etc.).
It can thus be seen that the present subject matter can be used to drastically streamline the process of tuning grammars that was previously done using extremely costly and time consuming manual transcription processes. The present subject matter provides a relatively simple interface for rapidly classifying user utterances into semantic buckets. The semantic information is useful in itself for a wide variety of analytical and tuning purposes, and the analytical and tuning processes are significantly speeded up by this subject matter. In addition, the user interface used for transcription automatically presents the transcriber or user, with the set of possible semantic values, which can be read directly from the active grammars.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.