The present invention relates to speech technology. More specifically, the present invention provides an object model and interface in managed code (that uses the execution environment for memory management, object lifetime, etc.) such that applications that target the managed code environment can quickly and easily implement speech-related features.
Speech synthesis engines typically include a decoder which receives textual information and converts it to audio information that can be synthesized into speech on an audio device. Speech recognition engines typically include a decoder which receives audio information in the form of a speech signal and identifies a sequence of words from the speech signal.
The process of making speech recognition and speech synthesis more widely available has encountered a number of obstacles. For instance, the engines from different vendors behave differently under similar circumstances. Therefore, it has, in the past, been virtually impossible to change synthesis or recognition engines without inducing errors in applications that have been written with those engines in mind. Also, interactions between application programs and engines can be complex, including cross-process data marshalling, event notification, parameter validation, and default configuration, to name just a few of the complexities.
In an effort to make such technology more readily available, an interface between engines and applications was specified by a set of application programming interfaces (API) referred to as the speech application programming interface (SAPI). A description of a number of the features in SAPI are set out in U.S. Patent Publication Number US-2002-0069065-A1.
While these features addressed many of the difficulties associated with speech-related technology, and while they represent a great advancement in the art over prior systems, a number of difficulties still present themselves. For instance, the SAPI interfaces have not been developed and specified in a manner consistent with other interfaces in a wider platform environment, which includes non-speech technologies. This has, to some extent, required application developers who wish to utilize speech-related features offered by SAPI to not only understand the platform-wide API's and object models, but to also understand the speech-specific API's and object models exposed by SAPI.
The present invention provides an object model that exposes speech-related functionality to applications that target a managed code environment. In one embodiment, the object model and associated interfaces are implemented consistently with other non-speech related object models and interfaces supported across a platform.
In one specific embodiment of the invention, a dynamic grammar component is provided such that dynamic grammars can be easily authored and implemented on the system. In another embodiment, dynamic grammar sharing is also facilitated. Further, in accordance with yet another embodiment, an asynchronous control pattern is implemented for various speech-related features. In addition, semantic properties are presented in a uniform fashion, regardless of how they are generated in a result set.
Appendix A fully specifies one illustrative embodiment of a set of object models and interfaces used in accordance with one embodiment of the present invention.
The present invention deals with an object model and API set for speech-related features. However, prior to describing the present invention in greater detail, one illustrative embodiment of a computer, and computing environment, in which the present invention can be implemented will be discussed.
The invention is operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well known computing systems, environments, and/or configurations that may be suitable for use with the invention include, but are not limited to, personal computers, server computers, hand-held or laptop devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.
The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
With reference to
Computer 110 typically includes a variety of computer readable media. Computer readable media can be any available media that can be accessed by computer 110 and includes both volatile and nonvolatile media, removable and non-removable media. By way of example, and not limitation, computer readable media may comprise computer storage media and communication media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by computer 100. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier WAV or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, FR, infrared and other wireless media. Combinations of any of the above should also be included within the scope of computer readable media.
The system memory 130 includes computer storage media in the form of volatile and/or nonvolatile memory such as read only memory (ROM) 131 and random access memory (RAM) 132. A basic input/output system 133 (BIOS), containing the basic routines that help to transfer information between elements within computer 110, such as during start-up, is typically stored in ROM 131. RAM 132 typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by processing unit 120. By way o example, and not limitation,
The computer 110 may also include other removable/non-removable volatile/nonvolatile computer storage media. By way of example only,
The drives and their associated computer storage media discussed above and illustrated in
A user may enter commands and information into the computer 110 through input devices such as a keyboard 162, a microphone 163, and a pointing device 161, such as a mouse, trackball or touch pad. Other input devices (not shown) may include a joystick, game pad, satellite dish, scanner, or the like. These and other input devices are often connected to the processing unit 120 through a user input interface 160 that is coupled to the system bus, but may be connected by other interface and bus structures, such as a parallel port, game port or a universal serial bus (USB). A monitor 191 or other type of display device is also connected to the system bus 121 via an interface, such as a video interface 190. In addition to the monitor, computers may also include other peripheral output devices such as speakers 197 and printer 196, which may be connected through an output peripheral interface 190.
The computer 110 may operate in a networked environment using logical connections to one or more remote computers, such as a remote computer. 180. The remote computer 180 may be a personal computer, a hand-held device, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer 110. The logical connections depicted in
When used in a LAN networking environment, the computer 110 is connected to the LAN 171 through a network interface or adapter 170. When used in a WAN networking environment, the computer 110 typically includes a modem 172 or other means for establishing communications over the WAN 173, such as the Internet. The modem 172, which may be internal or external, may be connected to the system bus 121 via the user-input interface 160, or other appropriate mechanism. In a networked environment, program modules depicted relative to the computer 110, or portions thereof, may be stored in the remote memory storage device. By way of example, and not limitation,
Managed code layer 202 also interacts with lower level components including speech recognition engine(s) 210 and text-to-speech (TTS) engine(s) 212. Other non-speech lower level components 214 are shown as well. In one embodiment, managed code layer 202 interacts with SR engines 210 and TTS engines 212 through an optional speech API layer 216 which is discussed in greater detail below. It should be noted, of course, that all of the functionalities set out in optional speech API layer 216, or any portion thereof, can be implemented in managed code layer 202 instead.
Thus, the managed code layer 202 exposes programming object models and associated members that allow speech-related application 206 to quickly, efficiently and easily implement speech-related features (such as, for example, speech recognition and TTS features) provided by SR engine(s) 210 and TTS engine(s) 212. However, because managed code layer 202 is developed consistently across the entire platform 200, the programming models and associated members exposed by layer 202 in order to implement speech-related features are consistently with (designed consistently using the same design principles as) the programming models and members exposed by managed code layer 202 to non-speech applications 208 to implement non-speech related features.
This provides significant advantages over prior systems. For instance, similar operations are treated similarly across the entire platform. Asynchronous loading a picture into a PictureBox object, for example, is treated similarly (at the level of the programming models and members exposed by managed code layer 202) to an asynchronous speech synthesis operation. Similarly, objects and associated member are specified and accessed in the same way across the entire platform. Thus, a user need not learn two different systems when implementing speech-related features and non-speech related features. This significantly enhances the likelihood that speech-related technologies will gain wider acceptance.
In any case,
SystemRecognizer 244, in one illustrative embodiment, is an object class that represents a proxy to a system-wide speech recognizer instance (or speech recognition engine instance). SystemRecognizer 244 illustratively derives from a base class such as (Recognizer) from which other recognizers (such as LocalRecognizer 246) derive. SystemRecognizer 244 generates events for recognitions, partial recognitions, and unrecognized speech. SystemRecognizer 244 also illustratively exposes methods and properties for (among many other things) obtaining attributes of the underlying recognizer or recognition engine represented by SystemRecognizer 244, for returning audio content along with recognition results, and for returning a collection of grammars that are currently attached to SystemRecognizer 244.
LocalRecognizer 246 inherits from the base class recognizer 244 and illustratively includes an object class that represents an in-process instance of a speech recognition engine in the address space of the application. Therefore, unlike SystemRecognizer 244, which is shared with other processes in environment 200, LocalRecognizer 246 is totally under the control of the process that creates it.
Each instance of LocalRecognizer 246 represents a single recognition engine 210. The application 206 that owns the instance of LocalRecognizer 246 can connect to each recognition engine 210, in one or more recognition contexts, from which the application can control the recognition grammars to be used, start and stop recognition, and receive events and recognition results. In one embodiment, the handling of multiple recognition processes and contexts is discussed in greater detail in U.S. Patent Publication Number US-2002-0069065-A1.
LocalRecognizer 246 also illustratively includes methods that can be used to call for synchronous or asynchronous recognition (discussed in greater detail below), and to set the input source for the audio to be recognized. For instance, the input source can be set to a URI string that specifies the location of input data, a stream, etc.
RecognitionResult 250 is illustratively an object class that provides data for the recognition event, the rejected recognition event and hypothesis events. It also illustratively has properties that allow the application to obtain the results of a recognition. RecognitionResult component 250 is also illustratively an object class that represents the result provided by a speech recognizer, when the speech recognizer processes audio and attempts to recognize speech. RecognitionResult 250 illustratively includes methods and properties that allow an application to obtain alternate phrases which may have been recognized, a confidence measure associated with each recognition result, an identification of the grammar that produced the result and the specific rule that produced the result, and additional SR engine-specific data.
Further, RecognitionResult 250 illustratively includes a method that allows an application to obtain semantic properties. In other words, semantic properties can be associated with items in rules in a given grammar. This can be done by specifying the semantic property with a name/value pair or by attaching script to a rule that dynamically determines which semantic property to emit based on evaluation of the expression in the script. When a rule that is associated with a semantic property is activated during recognition (i.e., when that rule spawns the recognition result), a method on RecognitionResult 250 allows the application to retrieve the semantic property tag that identifies the semantic property that was associated with the activated rule. The handler object that handles the RecognitionEvent can then be used to retrieve the semantic property which can be used by the application in order to shortcut otherwise necessary processing. In accordance with one embodiment, the semantic property is retrieved and presented by RecognitionResult 250 as a standard collection of properties regardless of whether it is generated by a name/value pair or by script.
Grammar 252 is illustratively a base object class that comprises a logical housing for individual recognition rules and dictation grammars. Grammar 252 thus generates events when the grammar spawns a full recognition, a non-recognition, or a partial recognition. Grammar 252 also illustratively exposes methods and properties that can be used to load a grammar into the object from various sources, such as a stream or another specified location. The methods and properties exposed by Grammar object 252 also illustratively specify whether the grammar and individual rules in the grammar are active or inactive, and the specific speech recognizer 244 that hosts the grammar.
In addition, Grammar object 252 illustratively includes a property that points to another object class that is used to resolve rule references. For instance, as is discussed in greater detail below with respect to
DictationGrammar 254 is illustratively an object class that derives from Grammar object 252. DictationGrammar object 254 includes individual grammar rules and dictation grammars. It also illustratively includes a Load method that allows these grammars to be loaded from different sources.
SrgsGrammar component 260 also illustratively inherits from the base Grammar class 252. However, SrgsGrammar class 260 is configured to expose methods and properties, and to generate events, to enable a developer to quickly and easily generate grammars that conform to the standardized Speech Recognition Grammar Specification adopted by W3C (the world wide web consortium). As is well known, the SRGS standard is an XML format and structure for defining speech recognition grammars. It includes a small number of XML elements, such as a grammar element, a rule element, an item element, and an one-of element, among others. The SrgsGrammar class 260 includes properties to get and set all rules and to get and set a root rule in the grammar. Class 260 also includes methods to load SrgsGrammar class instances from a string, a stream, etc.
SrgsGrammar 260 (or another class) also illustratively exposes methods and properties that allow the quick and efficient dynamic creation and implementation of grammars. This is described in greater detail with respect to
Category 256 is described in greater detail later in the specification. Briefly, Category 256 can be used to associate grammars with categories. This better facilitates use of the present invention in a multi-application environment.
GrammarCollection component 258 is illustratively an object class that represents a collection of grammar objects 252. It illustratively includes methods and properties that allow an application to insert or remove grammars from the collection.
While
Voice 280 is illustratively a primary object class that can be accessed in order to implement fundamental speech synthesis features. For instance, in one illustrative embodiment, Voice class 280 generates events when speech has started or ended, or when bookmarks are reached. It also illustratively includes methods which can be called to implement a speak operation, or an asynchronous speak operation. Those methods also allow the application to specify the source of the input to be spoken. Similarly, Voice class 280 illustratively exposes properties and methods which allow an application to get the attributes of the voice, get and set the rate of speech and the particular synthesizer where this Voice class 280 is to be used, as well as to get and set the volume of the voice.
VoiceAttributes 282 is illustratively an object class that represents the attributes of the TTS voice being used to synthesize the input. VoiceAttributes class 282 illustratively exposes a method that allows an application instantiate a voice by interacting through a list of available voices and checking properties of each voice against desired properties or by specifying another identifier for the voice. Such properties can include, for example, the approximate age of the speaker, the desired gender for the speaker, a platform-specific voice, cultural information related to the speaker, audio formats supported by the synthesizer to be used, or a specific vendor that has provided the voice.
SpeechSynthesizer 285 is illustratively an object class that exposes elements used to represent a TTS engine. SpeechSynthesizer 285 exposes methods that allow pause, skip and resume operations. It also generates events for changes in audio level, and synthesis of phonemes and visemes. Of course, other exposed members are discussed in Appendix A hereto.
Again, as with respect to
In any case, a recognizer, (such as SystemRecognizer 244 or LocalRecognizer 246) is first selected and the selected recognizer is instantiated. This is indicated by blocks 300 and 302 in
Once the grammar is created, it is attached to the recognizer. This is indicated by block 306 in
After the grammar has been activated, an event handler that is to be used when a recognition occurs is identified. This is indicated by block 310. At this point, the speech recognizer has been instantiated and a grammar has been created and assigned to it. The grammar has been activated and therefore the speech recognizer is simply listening and waiting to generate a recognition result from an input.
Thus, the next step is to wait until a rule in an active grammar has caused a recognition. This is indicated by block 312. When that occurs, the recognizer generates a recognition event at block 314. The recognition event is propagated to the application through the event handler. This is indicated at block 316. A RecognitionResult class is also generated and made available to the application. This is indicated by block 318. The application, having access to the RecognitionEventArgs and RecognitionResult classes can obtain all of the necessary information about the recognition result in order to perform its processing. Of course, since the recognition event and the recognition result are propagated up to the application layer through managed code layer 202, they are provided to the application layer through APIs and an object model and associated members that are consistent with the APIs and object model and associated members used to provide other non-speech applications with information, across the platform.
Next, the application can set the characteristics of the voice to be synthesized. This is indicated by block 352. Again, of course, the application need not revise or set any characteristics of the voice, but can simply provide a source of information to be synthesized and call the Speak method on Voice class 280. Default voice characteristics will be used. However, as illustrated in
Next, the application calls a Speak method on Voice class 280. This is indicated by block 354. In one illustrative embodiment, the application can call a synchronous speak method or an asynchronous speak method. If the synchronous speak method is called, the instantiated TTS engine generates the synthesis data from the identified source and the method returns when the TTS engine has completed that task.
However, the application can also call an asynchronous speak method on Voice class 280. In that case, a Voice class 280 can return to the application a pointer to another object that can be used by the application to monitor the progress of the speech synthesis operation being performed, to simply wait until the speech synthesis operation is completed, or to cancel the speech synthesis operation, of course, it should be noted that in another embodiment the Voice class itself implements these features rather than pointing the application to a separate object. In any case, however, the asynchronous speech pattern is illustratively enabled by the present invention. When the speech operation is complete an event is generated.
Again, as with the flow diagram set out in
FIG.7 illustrates one implementation, using XML, to define a simplified grammar. The specific recognizer can then compile this into its internal data structures. The XML statements set out in
It will be noted, in accordance with the present invention, instead of specifying the “one-of” element the rule could contain a reference to another rule which specifies the list of artists.
Such a statement could be:
In that statement, the URI identifies a uniform resource identifier for the artist names. The rule for the artist names can be written to take artist names from a database, by performing a database query operation, by identifying an otherwise already compiled list of artist names, or by specifying any other way of identifying artist names. However, in one embodiment, the artist names rule is left empty and is computed and loaded at run time. Therefore, the list of artists need not already be generated and stored in memory (and thus consuming memory space). Instead, it can be generated at run time and loaded into the grammar, only when the grammar is instantiated.
The second line in
Once the grammar object g has been created, and a rule object r has been created, the next task is to identify an item to be included in the rule. Thus, the third line of
The next step is to create an item object containing a list of all of the artists for which music has been stored. Therefore, the fourth line includes a OneOf statement that identifies the rule r, the elements container, and calls the AddOneOf method that contains a list of all of the artists. The rule r is then identified as the root.
The last three lines of
The grammar can be made even more dynamic by replacing the fourth line of code shown in
The empty OneOf object is filled at run time. Using standard platform-based conventions and constructs, the empty OneOf object can be filled in as follows:
where the [xyz] is an expression that will be evaluated at run time to obtain a set of artists in an array. This expression can be a relational database query, a web service expression, or any other way of specifying an array of artists.
The present invention thus provides a grammar maintenance component 424 that is used to maintain grammars 420 and 422. Hence, when a grammar class, such as Grammar class 252, is invoked to add or change a rule, that is detected by grammar maintenance component 424. For instance, assume that an application invokes a method on Grammar class 252 to change a rule in grammar 420 for Application A. Grammar maintenance component 424 detects a change in that rule and identifies all of the grammars that refer to that rule so that they can be updated with the changed rule. In this way, even dynamic grammars can be shared and can refer to one another without the grammars becoming outdated relative to one another.
Next, component 424 identifies any grammars that refer to the changed grammar. This is indicated by block 432. Component 424 then propagates changes to the identified grammars as indicated by block 434.
In one embodiment, grammar maintenance component 424 is implemented in the optional speech API layer 216 (such as SAPI). However, it could be fully implemented in managed code layer 202 as well.
While a plurality of other features are further defined in Appendix A hereto, an additional feature is worth specifically mentioning in the specification. The present invention can illustratively be used to great benefit in multiple-application environments, such as on the desktop. In such an environment, multiple applications are often running at the same time. In fact, those multiple applications may all be interacting with managed code layer 202 to obtain speech-related features and services.
For instance, a command-and-control application may be running which will recognize and execute command and control operations based on a spoken input. Similarly, however, a dictation application may also be running which allows the user to dictate text into a document. In such an environment, both applications will be listening for speech, and attempting to activate grammar rules to identify that speech.
It may happen that grammar rules coincide in different applications such that grammars associated with two different applications can be activated based on a single speech input. For instance, assume that the user is dictating text into a document which includes the phrase “The band wanted to play music by band xyz.” Where the phrase “band xyz” identifies an artist or band. In that instance, the command and control application may activate a rule and attempt to invoke a media component in order to “play music by band xyz.” Similarly, the dictation application may attempt to identify the sentence dictated using its dictation grammar and input that text into the document being dictated.
In accordance with one embodiment of the present invention, and as discussed above, with respect to
The property value will be changed based on different contexts. This allows the present system to be used in a multi-application environment while significantly reducing the likelihood that misrecognitions will take place in this way.
It can thus be seen that the present invention provides an entirely new API and object model and associated members for allowing speech-related features to be implemented quickly and easily. The API and object model have a form that is illustratively consistent with other APIs and object models supported by a platform-wide environment. Similarly, the present invention can be implemented in a managed code environment to enable managed code applications to take advantage of speech-related features very quickly and easily.
The present invention also provides a beneficial mechanism for generating and maintaining dynamic grammars, for eventing, and for implementing asynchronous control patterns for both speech synthesis and speech recognition features. One embodiment of the present invention sits on top of a API layer, such as SAPI, and wraps and extends the functionality provided by the SAPI layer. Thus, that embodiment of the present invention can utilize, among other things, the multi-process shared engine