Speech recognition systems are currently used in a wide variety of applications. Many speech recognition systems use grammars, such as context free grammars (CFGs). As is known, CFGs use a set of rules yeilding words (or tokens) to identify words in a spoken utterance. Authoring these grammars is often one of the most difficult tasks in developing a speech recognition system for a given implementation.
One reason that authoring grammars is so difficult relates to the wide variety of different ways that different users tend to phrase inputs to the speech recognition system. For instance, assume that the implementation for a given speech recognition system is an interactive voice response (IVR) dialog implementation at a pizza restaurant, which accepts orders for pizzas over the phone. Assume further that the IVR unit asks a caller, at some point during the dialog, “What size pizza would you like?” Users will respond to this in many different ways, even if they are all ordering the same size pizza. For instance, users may respond in any of the following ways, or in even other ways:
I'd like a large pizza.
Please give me a large pizza.
I'll take a large pizza please.
I'd like a large pizza please.
I'll have a large pizza, thanks:
These examples illustrate that even though the content portion of the response (that portion of the response which actually answers the prompt) “large pizza” is the same for each example, the preamble (those words preceding the content portion of the response) and the postambles (those words following the content portion of the response) differ widely.
In order for a speech recognition system to handle all of these responses, the grammar in the speech recognition system must contain a rule that accommodates each of these responses. Therefore, in authoring the grammar, the grammar author must not only have knowledge about how users will respond with content (e.g., small, medium, or large pizza), but the grammar author must also be able to think of all of these different preambles and postambles. If the preambles and postambles are not present in the rules in the grammar, then the speech recognition system will not recognize the response by the user.
One way of addressing this problem involves using an already-authored grammar. An already-existing path through the grammar is specified, and the grammar is asked to predict other paths through the grammar, given the specified path. The grammar is then reconfigured to activate the predicted paths through the grammar when the specified path is activated.
Another way of addressing this problem involves manual transcription. In the exemplary pizza restaurant implementation being discussed, prior to implementing the automated dialog system at the pizza restaurant, a manual system is used in which a human operator speaks with customers and asks the customers the prompt: “What size pizza would you like?” The vocal answers from the customers are then all recorded and transcribed for later use by the grammar author. By reviewing all of the transcribed customer responses, the grammar author is better able to predict the different preambles and postambles that might commonly be used in response to the prompt. Of course, this is relatively time consuming and requires a relatively large amount of resources, and in any case, is anecdotal and subject to error.
The present invention addresses one, some or all of these problems, or it can be used to address different problems, as will be evident by reading the following description.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.
A speech grammar is generated using possible answer forms to input prompts. In one embodiment, input prompts are provided to a natural language generation system which generates predicted responses to the input prompts. In one embodiment, a grammar is pre-populated with preambles and postambles from the predicted responses.
The present invention relates generally to grammar authoring or grammar generation. However, before describing the present invention in greater detail, one illustrative environment in which the present invention can be used will be described.
The invention is operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well-known computing systems, environments, and/or configurations that may be suitable for use with the invention include, but are not limited to, personal computers, server computers, hand-held or laptop devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, telephony systems, distributed computing environments that include any of the above systems or devices, and the like.
The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The invention is designed to be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules are located in both local and remote computer storage media including memory storage devices.
With reference to
Computer 110 typically includes a variety of computer readable media. Computer readable media can be any available media that can be accessed by computer 110 and includes both volatile and nonvolatile media, removable and non-removable media. By way of example, and not limitation, computer readable media may comprise computer storage media and communication media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by computer 110. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. Combinations of any of the above should also be included within the scope of computer readable media.
The system memory 130 includes computer storage media in the form of volatile and/or nonvolatile memory such as read only memory (ROM) 131 and random access memory (RAM) 132. A basic input/output system 133 (BIOS), containing the basic routines that help to transfer information between elements within computer 110, such as during start-up, is typically stored in ROM 131. RAM 132 typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by processing unit 120. By way of example, and not limitation,
The computer 110 may also include other removable/non-removable volatile/nonvolatile computer storage media. By way of example only,
The drives and their associated computer storage media discussed above and illustrated in
A user may enter commands and information into the computer 110 through input devices such as a keyboard 162, a microphone 163, and a pointing device 161, such as a mouse, trackball or touch pad. Other input devices (not shown) may include a joystick, game pad, satellite dish, scanner, or the like. These and other input devices are often connected to the processing unit 120 through a user input interface 160 that is coupled to the system bus, but may be connected by other interface and bus structures, such as a parallel port, game port or a universal serial bus (USB). A monitor 191 or other type of display device is also connected to the system bus 121 via an interface, such as a video interface 190. In addition to the monitor, computers may also include other peripheral output devices such as speakers 197 and printer 196, which may be connected through an output peripheral interface 195.
The computer 110 is operated in a networked environment using logical connections to one or more remote computers, such as a remote computer 180. The remote computer 180 may be a personal computer, a hand-held device, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer 110. The logical connections depicted in
When used in a LAN networking environment, the computer 110 is connected to the LAN 171 through a network interface or adapter 170. When used in a WAN networking environment, the computer 110 typically includes a modem 172 or other means for establishing communications over the WAN 173, such as the Internet. The modem 172, which may be internal or external, may be connected to the system bus 121 via the user input interface 160, or other appropriate mechanism. In a networked environment, program modules depicted relative to the computer 110, or portions thereof, may be stored in the remote memory storage device. By way of example, and not limitation,
In order to begin operation of system 200, grammar author 206 generates one or more prompts which will be used in a speech system (such as a dialog system or IVR system) in which the speech recognition system that uses grammar 208 will be deployed. For the sake of example, assume that a dialog system will be implemented in a pizza restaurant to automatically take orders for pizzas from customers that call in on the telephone. Of course, this implementation is exemplary only and a wide variety of other implementations could be used as well.
In any case, in order to generate grammar 208 for that dialog system, grammar author 206 illustratively generates a plurality of prompts 210 that will be used in the dialog system. Such prompts may include, for example:
What size pizza would you like?
What kind of curst would you like?
What toppings would you like?
Grammar author 206 illustratively provides prompts 210 to the grammar authoring tool 202. This is indicated by block 212 in
One grammar authoring tool allows a grammar author 206 to generate a grammar by dragging and dropping portions of a graph, which represent the grammar rules, into a desired configuration. Of course, a wide variety of other grammar authoring tools can be used as well. One embodiment of a user interface display generated by grammar authoring tool 202 is shown in
Grammar authoring tool 202 then provides the prompts 210 to response prediction system 204. Response prediction system 204 can be any type of system trained to predict responses to an input prompt. In one embodiment, the response prediction system 204 is a natural language generation system trained to generate one or more likely natural language outputs in response to a natural language input prompt. The natural language generation system can use any of a wide variety of technologies (such as language models, neural networks, natural language response look-up systems, lexical knowledge bases, information retrieval search systems, machine translation systems, localization systems, etc.) in order to predict user responses to the prompts 210 that are provided to it. This is indicated by block 216 in
Response predication system 204 receives the prompt 210 from grammar authoring tool 202 and generates likely responses 220 to the prompt 210. The responses can take any of a wide variety of forms. For instance, in one embodiment, the responses 220 are full responses to the prompt 210. In another embodiment, the responses 220 are likely preambles and postambles, which are predicted in view of the prompt 210. This latter embodiment is discussed herein for the sake of example.
Having response prediction system 204 generate predicted responses is indicated by block 222 in
In either embodiment, the likely responses 220 can be displayed, through grammar authoring tool 202, to grammar author 206. This is indicated by block 224 in
I'd like a . . .
Give me a . . .
I'll have a . . .
Let me have a . . . .
Of course, it will be noted that a wide variety of other preambles may be predicted, given the prompt, and only four are shown for the sake of example.
. . . please
. . . thank you
. . . thanks
. . . ok
Again, of course, a wide variety of other or different postambles might be predicted and those shown are for illustrative purposes only.
In accordance with one embodiment, after displaying the proposed responses, grammar authoring tool 202 simply pre-populates grammar 208 with the likely responses 220 without any further input by grammar author 206. The grammar author 206 can then provide further inputs to grammar authoring tool 202 in order to develop more content portions of the grammar, and in order to reconfigure the grammar, as desired.
However, in accordance with another embodiment, as illustrated in
In this embodiment, once the grammar author 206 has selected desired responses, the grammar author 206 can then actuate Add button 308 (shown on user interface display 300 in
Again, once the likely responses selected by the grammar author 206 have been populated into grammar 208, grammar author 206 can then complete the remaining portions of the grammar as desired. This is indicated by block 230 in
It can thus be seen that proposed response forms to an input prompt in a dialog system can be used to generate a grammar. The proposed responses, in one embodiment, might simply include preambles and/or postambles. In another embodiment, the responses might include content as well. However, a grammar author may likely be well versed in, and have a relatively large amount of knowledge with respect to, content portions of the grammar, but may need most help in generating preambles and postambles. In that case, only the preambles and postambles need to be predicted. In either case, a natural language generation system can be used in order to generate the proposed responses, and the proposed responses can be automatically generated and populated into a grammar.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.