Users of current speech recognition systems face a number of problems. First, the users must become familiar with the speech recognition system, and learn how to operate the speech recognition system. In addition, the users must train the speech recognition system to better recognize the user's speech.
To address the first problem (teaching users to use the speech recognition system) current speech recognition tutorial systems attempt to teach the user about the workings of the speech recognizer using a variety of different means. For instance, some systems use tutorial information in the form of help documentation, which can either be electronic or paper documentation, and simply allow the user to read through the help documentation. Still other tutorial systems provide video demonstrations of how users can use different features of the speech recognition system.
Thus, current tutorials do not offer a hands-on experience in which the user can try out speech recognition in a safe, controlled environment. Instead, they only allow the user to watch, or read through, tutorial content. However, it has been found that where a user is simply asked to read tutorial content, even if it is read aloud, the user's retention of meaningful tutorial content is extremely low, bordering on insignificant.
In addition, current speech tutorials are not extensible by third parties. In other words, third party vendors must typically create separate tutorials, from scratch, if they wish to create their own speech commands or functionality, add speech commands or functionality to the existing speech system, or teach existing or new features of the speech system which are not taught by current tutorials.
In order to address the second problem (training the speech recognizer to better recognize the speaker) a number of different systems have also been used. In all such systems, the computer is first placed in a special training mode. In one prior system, the user is simply asked to read a given quantity of predefined text to the speech recognizer, and the speech recognizer is trained using the speech data acquired from the user reading that text. In another system, the user is prompted to read different types of text items, and the user is asked to repeat certain items which the speech recognizer has difficulty recognizing.
In one current system, the user is asked to read the tutorial content out loud, and the speech recognition system is activated at the same time. Therefore, the user is not only reading tutorial content (describing how the speech recognition system works, and including certain commands used by the speech recognition system), but the speech recognizer is actually recognizing the speech data from the user, as the tutorial content is read. The captured speech data is then used to train the speech recognizer. However, in that system, the full speech recognition capability of the speech recognition system is active. Therefore, the speech recognizer can recognize substantially anything in its vocabulary, which may typically include thousands of commands. This type of system is not very tightly controlled. If the speech recognizer recognizes a wrong command, the system can deviate from the tutorial text and the user can become lost.
Therefore, current speech recognition training systems require a number of different things to be effective. The computer must be in a special training mode, have high confidence that the user is going to say a particular phrase, and be actively listening for only a couple of different phrases.
It can thus be seen that speech engine training and user tutorial training address separate problems but are both required for the user to have a successful speech recognition experience.
The discussion above is merely provided for general background information and is not intended to be used as an aid in determining the scope of the claimed subject matter.
The present invention combines speech recognition tutorial training with speech recognizer voice training. The system prompts the user for speech data and simulates, with predefined screenshots, what happens when speech commands are received. At each step in the tutorial process, when the user is prompted for an input, the system is configured such that only a predefined set (which may be one) of user inputs will be recognized by the speech recognizer. When a successful recognition is being made, the speech data is used to train the speech recognition system.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter
Appendix A illustrates one exemplary tutorial flow schema used in accordance with one embodiment of the present invention.
The present invention relates to a tutorial system that teaches a user about a speech recognition system, and that also simultaneously trains the speech recognition system based on voice data received from the user. However, before describing the present invention in more detail, one illustrative environment in which the present invention can be used will be described.
Embodiments are operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well-known computing systems, environments, and/or configurations that may be suitable for use with various embodiments include, but are not limited to, personal computers, server computers, hand-held or laptop devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, telephony systems, distributed computing environments that include any of the above systems or devices, and the like.
Embodiments may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. Some embodiments are designed to be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules are located in both local and remote computer storage media including memory storage devices.
With reference to
Computer 110 typically includes a variety of computer readable media. Computer readable media can be any available media that can be accessed by computer 110 and includes both volatile and nonvolatile media, removable and non-removable media. By way of example, and not limitation, computer readable media may comprise computer storage media and communication media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by computer 110. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. Combinations of any of the above should also be included within the scope of computer readable media.
The system memory 130 includes computer storage media in the form of volatile and/or nonvolatile memory such as read only memory (ROM) 131 and random access memory (RAM) 132. A basic input/output system 133 (BIOS), containing the basic routines that help to transfer information between elements within computer 110, such as during start-up, is typically stored in ROM 131. RAM 132 typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by processing unit 120. By way of example, and not limitation,
The computer 110 may also include other removable/non-removable volatile/nonvolatile computer storage media. By way of example only,
The drives and their associated computer storage media discussed above and illustrated in
A user may enter commands and information into the computer 110 through input devices such as a keyboard 162, a microphone 163, and a pointing device 161, such as a mouse, trackball or touch pad. Other input devices (not shown) may include a joystick, game pad, satellite dish, scanner, or the like. These and other input devices are often connected to the processing unit 120 through a user input interface 160 that is coupled to the system bus, but may be connected by other interface and bus structures, such as a parallel port, game port or a universal serial bus (USB). A monitor 191 or other type of display device is also connected to the system bus 121 via an interface, such as a video interface 190. In addition to the monitor, computers may also include other peripheral output devices such as speakers 197 and printer 196, which may be connected through an output peripheral interface 195.
The computer 110 is operated in a networked environment using logical connections to one or more remote computers, such as a remote computer 180. The remote computer 180 may be a personal computer, a hand-held device, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer 110. The logical connections depicted in
When used in a LAN networking environment, the computer 110 is connected to the LAN 171 through a network interface or adapter 170. When used in a WAN networking environment, the computer 110 typically includes a modem 172 or other means for establishing communications over the WAN 173, such as the Internet. The modem 172, which may be internal or external, may be connected to the system bus 121 via the user input interface 160, or other appropriate mechanism. In a networked environment, program modules depicted relative to the computer 110, or portions thereof, may be stored in the remote memory storage device. By way of example, and not limitation,
Tutorial framework 202 provides interactive tutorial information 230 through user interface component 212 to the user 214. The interactive tutorial information 230 walks the user through a tutorial of how to operate the speech recognition system 208. In doing so, the interactive tutorial information 230 will prompt the user for speech data. Once the user says the speech data, it is acquired, such as through a microphone, and provided as a user input 232 to tutorial framework 202. Tutorial framework 202 then provides the user speech data 232 to speech recognition system 208, which performs speech recognition on the user speech data 232. Speech recognition system 208 then provides tutorial framework 202 with speech recognition results 234 that indicate the recognition (or non-recognition) of the user speech data 232.
In response tutorial framework 202 provides another set of interactive tutorial information 230 to user 214 through user interface component 212. If the user speech data 232 was accurately recognized by speech recognition system 208, then the interactive tutorial information 230 shows the user what happens when the speech recognition system receives that input. Similarly, if the user speech data 232 is not recognized by speech recognition system 208, then the interactive tutorial information 230 shows the user what happens when a non-recognition occurs at that step in the speech recognition system. This continues for each step in the tutorial application that is currently running.
The tutorial content illustratively includes tutorial flow content 216 and a set of screenshots or other user interface display elements 218. Tutorial flow content 216 illustratively describes the complete navigational flow of the tutorial application as well as the user inputs which are allowed at each step in that navigational flow. In one embodiment, tutorial flow content 216 is an XML file that defines a navigational hierarchy for the application.
In any case, the exemplary navigation hierarchy 300 shows that the tutorial application includes one or more topics 302. Each topic has one or more different chapters 304 and can also have pages. Each chapter has one or more different pages 306, and each page has zero or more different steps 308 (An example of a page with zero steps might be an introduction page with no steps). The steps are steps which are to be taken by the user in order to navigate through a given page 306 of the tutorial. When all of the steps 308 for a given page 306 of the tutorial have been completed, the user is provided with the option to move on to another page 306. When all the pages for a given chapter 304 have been completed, the user is provided with an option to move on to a subsequent chapter. Of course, when all of the chapters of a given topic have been completed, the user can then move on to another topic of the tutorial. It will also be noted, of course, that the user may be allowed to skip through different levels of the hierarchy, as desired by the developer of the tutorial application.
One concrete example of a tutorial flow content 216 is attached to the application as Appendix A. Appendix A is a XML file which completely defines the flow of the tutorial application according to the navigational hierarchy 300 shown in
Once this tutorial content 204 has been generated by a developer (or other tutorial author), the tutorial application for which tutorial content 204 has been generated can be run by system 200 shown in
The user 214 first opens the tutorial application one. This is indicated by block 320 in
Once the tutorial application is opened by the user, tutorial framework 202 accesses the corresponding tutorial content 204 and parses the tutorial flow content 216 into the navigational hierarchy schema, one example which is represented in
The tutorial framework 202 then displays a user interface element to user 214 through user interface 212 that allows the user to start the tutorial. For instance, tutorial framework 202 may display at user interface 212 a start button which can be actuated by the user by simply saying “start” (or another similar phrase) or using a point and click device. Of course, other ways of starting the tutorial application running can be used as well. User 214 then starts the tutorial running. This is indicated by blocks 324 and 326 in
Tutorial framework 202 then runs the tutorial, interactively prompting the user for speech data and simulating, with the screenshots, what happens when the commands which the user has been prompted for are received by the speech recognition system for which the tutorial is being run. This is indicated by block 328 in
The screenshot 502 in
More specifically,
Below the tutorial portion 504 are a plurality of steps 522 which can be taken by the user in order accomplish a task. As the user takes the steps 522, a demonstration portion 524 of the screenshot demonstrates what happens in the speech recognition program when those steps are taken. For example, when the user says “Start”, “All Programs”, “Accessories”, the demonstration portion 524 of the screenshot displays the display 526 which shows that the “Accessories” programs are displayed. Then, when the user says “WordPad”, the display shifts to show that the “WordPad” application is opened.
All of the speech information recognized in the tutorial is provided to speech recognition training system 210 to better train speech recognition system 208.
It should be noted that, at each step 522 in the tutorial, when the user is requested to say a word or phrase, the framework 202 is configured to accept only a predefined set of responses to the prompts for speech data. In other words, if the user is being prompted to say “start”, framework 202 may only be configured to accept a speech input from the user which is recognized as “start”. If the user inputs any other speech data, framework 202 will illustratively provide a screenshot illustrating that the speech input was unrecognized.
Tutorial framework 202 may also illustratively show what happens in the speech recognition system when a speech input is unrecognized. This can be done in a variety of different ways. For instance, tutorial framework 202 can, itself, be configured to only accept predetermined speech recognition results from speech recognition system 208 in response to a given prompt. If the recognition results do not match those allowed by tutorial framework 202, then tutorial framework 202 can provide interactive tutorial information through user interface component 212 to user 214, indicating that the speech was unrecognized. Alternatively, speech recognition system 208 can, itself, be configured to only recognize the predetermined set of speech inputs. In that case, only predetermined rules may be activated in speech recognition system 208, or other steps can be taken to configure speech recognition system 208 such that it does not recognize any speech input outside of the predefined set of possible speech inputs.
In any case, allowing only a predetermined set of speech inputs to be recognized at any given step in the tutorial process provides some advantages. It keeps the user on track in the tutorial, because the tutorial application will know what must be done next, in response to any of the given predefined speech inputs which are allowed at the step being processed. This is in contrast to some prior systems which allowed recognition of substantially any speech input from the user.
Referring again to the flow diagram in
When speech recognition system 208 provides recognition results 234 to tutorial framework 202 indicating that an accurate, and acceptable, recognition has been made, then tutorial framework 202 provides the user speech data 232 along with the recognition result 234 (which is illustratively a transcription of the user speech data 232) to speech recognition training system 210. Speech recognition training system 210 then uses the user speech data 232 and the recognition result 234 to better train the models in speech recognition system 208 to recognize the user's speech. This training can take any of a wide variety of different known forms, and the particular way in which the speech recognition system training is done does not form part of the invention. Performing speech recognition training using the user speech data 232 and the recognition result 234 is indicated by block 332 in
The schema has a variety of features which are shown in the example set out in Appendix A. For instance, the schema can be used to create practice pages which will instruct the user to perform a task, which the user has already learned, without immediately providing the exact instruction of how to do so. This allows the user to attempt to recall the specific instruction and enter the specific command without being told exactly what to do. This enhances the learning process.
By way of example, as shown in Appendix A, a practice page can be created by setting a “practice=true” flag in the <page> token. This is done as follows:
<page title=“stop listening” practice=“true”>
This causes the <instruction> under the “step” token not to be shown unless a timeout occurs (such as 30 seconds) or unless the speech recognizer 208 obtains a mis-recognition from the user (i.e., the user says the wrong thing).
As a specific example, where the “page title” is set to “stop listening” and the “practice flag” is set to “true”, the display may illustrate the tutorial language:
“During the tutorial, we will sometimes ask you to practice what you have just learned. If you make a mistake, we will help you along. Do you remember how to show the context menu, or right click menu for the speech recognition interface? Try showing it now!”
This can, for instance, be displayed in the tutorial section 504, and the tutorial can then simply wait, listening for the user to say the phrase “show speech options”. In one embodiment, once the user says the proper speech command, then the demonstration display portion 524 is updated to show what would be seen by the user if that command were actually given to the application.
However, if the user has not entered a speech command after a predetermined timeout period, such as 30 seconds or any other desirable timeout, or if the user has entered an improper command, which will not be recognized by the speech recognition system, then the instruction is displayed: “try saying ‘show speech options’”.
It can thus be seen that the present invention combines the tutorial and speech training processes in a desirable way. In one embodiment, the system is interactive in that it shows the user what happens with the speech recognition system when the commands for which the user is prompted are received by the speech recognition system. It also confines the possible recognitions at any step in the tutorial to a predefined set of recognitions in order to make speech recognition more efficient in the tutorial process, and to keep the user in a controlled tutorial environment.
It will also be noted that the tutorial system 200 is easily extensible. In order to provide a new tutorial for new speech commands or new speech functionality, a third party simply needs to author the tutorial flow content 216 and screenshots 218, and they can be easily plugged into framework 202 in tutorial system 200. This can also be done if the third party wishes to create a new tutorial for existing speech commands or functionality, or if the third party wishes to simply alter exiting tutorials. In all of these cases, the third party simply needs to author the tutorial content, with referenced screenshots (or other display elements) such that it can be parsed into the tutorial schema used by tutorial framework 202. In the embodiment discussed herein, that schema is a hierarchical schema, although other schemas could just as easily be used.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.
The present application is based on and claims the benefit of U.S. provisional patent application Ser. No. 60/712,873, filed Aug. 31, 2005, the content of which is hereby incorporated by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
60712873 | Aug 2005 | US |