Embodiments of the invention relate to voice driven systems, and in particular a voice driven system that includes cooperating voice dialog and business logic interpreters.
In general, voice-enabled software is computationally intensive. For example, voice-enabled software often requires consideration for voice-enabled operations (e.g., capturing and converting speech input from a user and/or providing speech output to a user) that operate pursuant to a particular flow of dialog. It also often requires consideration for other logical operations, such as determinations of the truth of a particular condition.
Operations in voice-enabled software are typically implemented in a serial manner, that is, operations be completed as they are encountered. For example, VoiceXML (VXML) has been implemented in some voice-enabled software to provide speech synthesis and speech recognition as well as the business logic for voice-enabled software. VXML for voice-enabled software contains both the control flow of as well as the business logic in the XML itself. However, this serial flow prevents optimization of the voice-enabled software. Specifically, VXML typically requires that operations be completed as they are encountered, thus leaving no capability for optimization to improve the operation of the voice-enabled software.
Embodiments of the invention address the deficiencies of the prior art by providing a method, apparatus, and program product to cooperatively mediate between voice-enabled operations and business logic. The method comprises receiving XML data and generating at least one object from the XML data. The method further comprises, in response to determining that the at least one object has been called, implementing an operation defined by a portion of the object.
Embodiments of the invention provide for the creation of voice dialog objects that can be subsequently called during a dialog flow. In this manner, embodiments of the invention allow for the mediation between voice-enabled operations and business logic, allowing pre-processing of some operations and/or parallelizing of those operations. Thus, the efficiency and operation of the voice-enabled application can be increased without sacrificing operational capability. These and other advantages will be apparent in light of the following figures and detailed description.
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and, together with a general description of the invention given above and the detailed description of the embodiments given below, serve to explain the principles of the invention.
It should be understood that the appended drawings are not necessarily to scale, presenting a somewhat simplified representation of various features illustrative of the basic principles of embodiments of the invention. The specific design features of embodiments of the invention as disclosed herein, including, for example, specific dimensions, orientations, locations, and shapes of various illustrated components, as well as specific sequences of operations (e.g., including concurrent and/or sequential operations), will be determined in part by the particular intended application and use environment. Certain features of the illustrated embodiments may have been enlarged or distorted relative to others to facilitate visualization and clear understanding.
Turning now to the drawings, wherein like numbers denote like parts throughout the drawings,
The VCS 12 is coupled to at least one peripheral device through an input/output device interface 38 (illustrated as, and hereinafter, “I/O I/F” 38). In particular, the VCS 12 receives data from a user through at least one user interface 40 (including, for example, a keyboard, mouse, a microphone, and/or other user interface) and/or outputs data to the user through at least one output device 42 (including, for example, a display, speakers, a printer, and/or another output device). Moreover, in some embodiments, the I/O I/F 38 communicates with a device that is operative as a user interface 40 and output device 42 in combination, such as a touch screen display (not shown).
The VCS 12 is typically under the control of an operating system 44 and executes or otherwise relies upon various computer software applications, sequences of operations, components, programs, files, objects, modules, etc., consistent with embodiments of the invention. In specific embodiments, the VCS 12 executes or otherwise relies on a voice client application 46 to manage the cooperation of voice dialogs and business logic. The voice client application is referred to hereinafter as a “VoiceArtisan” application 46. The mass storage 34 of the VCS 12 includes a voice dialog data structure 48 and a log data structure 50. The VoiceArtisan application 46 may further log data associated with its operation and store that data in the log data structure 50.
The mobile system 16 is configured to implement a voice dialog flow (e.g., a voice enabled set of steps, such as for a pick-and-place, voice-assisted, or voice-directed operation), capture speech input, and execute business logic. In those embodiments in which the functionality of the VCS 12 and mobile system 16 are separate, the mobile system 16 is also configured to communicate with the VoiceArtisan application 46 across the network 18.
In some embodiments, the user 64 interfaces with the mobile device 60 (and the mobile device 60 interfaces with the user 64) through the headset 62, which is coupled to the mobile device 60 through a cord 68. In alternative embodiments, the headset 62 is a wireless headset and coupled to the mobile device 60 through a wireless signal (not shown). The headset 62 includes a speaker 70 and a microphone 72. The speaker 70 is configured to play audio (e.g., such as speech output associated with a voice dialog to instruct the user 64 to perform an action), while the microphone 72 is configured to capture speech input from the user 64 (e.g., such as for conversion to machine readable input). As such, and in some embodiments, the user 64 interfaces with the mobile device 60 hands-free through the headset 62.
In some embodiments, the mobile device 60 additionally includes at least one input/output interface 86 (illustrated as, and hereinafter, “I/O I/F” 86) configured to communicate with at least one peripheral other than the headset 62. Such a peripheral may include at least one of one or more training devices (e.g., to coach a new user through training to use the mobile device 60, headset 62, and/or a system to which they are coupled), image scanners, barcode readers, RFID readers, monitors, printers, user interfaces, output devices, and/or other peripherals (none shown). In specific embodiments, the I/O I/F 86 includes at least one peripheral interface, including at least one of one or more serial, universal serial bus (USB), PC Card, VGA, HDMI, DVI, and/or other interfaces (e.g., for example, other computer, communicative, data, audio, and/or visual interfaces) (none shown). The mobile device 60 also includes a power supply 88, such as a battery, rechargeable battery, rectifier, and/or other power source. The mobile device 60 monitors the voltage from the power supply 88 with a power monitoring circuit 90. In some embodiments, and in response to the power monitoring circuit 90 determining that the power from the power supply 88 is insufficient, the mobile device 60 shuts down to prevent potential damage. The mobile device 60 is configured to communicate with the headset 62 through a headset interface 92 (illustrated as, and hereinafter, “headset I/F” 92), which is in turn configured to couple to the headset 62 through the cord 68 and/or wirelessly. In specific embodiments, the mobile device 60 couples to the headset 62 through the BlueTooth® open wireless technology standard that is known in the art.
The mobile device 60 may be under the control and/or otherwise rely upon various software applications, components, programs, files, objects, modules, etc. (hereinafter, “program code”) consistent with embodiments of the invention. This program code may include an operating system 94 (e.g., such as a Windows Embedded Compact operating system as distributed by Microsoft Corporation of Redmond, Wash.) as well as one or more software applications (e.g., configured to operate in an operating system or as “stand-alone” applications). As such, the memory 82 is configured with a voice application 94 to implement dialog flows, execute business logic, and/or communicate with the VoiceArtisan application 96. The memory further includes a data store 98 to store data related to the mobile device 60, headset 62, and/or user 64.
In some embodiments, a suitable mobile device 60 for implementing the present invention is a Talkman® wearable computer available from Vocollect, Inc., of Pittsburgh, Pa. The mobile device 60 is utilized in a voice-enabled system, which uses speech recognition technology for documentation and/or communication. The headset 62 provides hands-free voice communication between the user 64 and the mobile device 60. For example, in one embodiment, the voice application 96 implements a dialog flow, such as for a pick-and-place, voice-assisted, or voice-directed operation. The voice application 96 communicates with the VoiceArtisan application 46 to call voice dialogs. In turn, the voice application 96 can capture speech input for subsequent conversion to a useable digital format (e.g., machine readable input) by the VoiceArtisan application 46.
Referring back to
A person having ordinary skill in the art will recognize that the environments illustrated in
Thus, a person having skill in the art will recognize that other alternative hardware and/or software environments may be used without departing from the scope of the invention. For example, the voice client 46 and/or client application 96 may be configured with fewer or additional modules, while the mass storage 34 and/or memory 94 may be configured with fewer or additional data structures. Additionally, a person having ordinary skill in the art will appreciate that the VCS 12 and/or mobile system 16 may include more or fewer applications disposed therein. As such, other alternative hardware and software environments may be used without departing from the scope of embodiments of the invention.
Moreover, a person having ordinary skill in the art will appreciate that the terminology used to describe various pieces of data, such as XML, Python, voice dialog, business logic instance, programming language, speech input, and machine readable input are merely used for differentiation purposes and not intended to be limiting.
The routines executed to implement the embodiments of the invention, whether implemented as part of an operating system or a specific application, component, program, object, module or sequence of instructions executed by one or more computing systems will be referred to herein as a “sequence of operations,” a “program product,” or, more simply, “program code.” The program code typically comprises one or more instructions that are resident at various times in various memory and storage devices in a computing system (e.g., the VCS 12 and/or mobile system 16), and that, when read and executed by one or more processors of the computing system, cause that computing system to perform the steps necessary to execute steps, elements, and/or blocks embodying the various aspects of the invention.
While the invention has and hereinafter will be described in the context of fully functioning computing systems, those skilled in the art will appreciate that the various embodiments of the invention are capable of being distributed as a program product in a variety of forms, and that the invention applies equally regardless of the particular type of computer readable media used to actually carry out the distribution. Examples of computer readable media include but are not limited to physical and tangible recordable type media such as volatile and nonvolatile memory devices, floppy and other removable disks, hard disk drives, optical disks (e.g., CD-ROM's, DVD's, etc.), among others.
In addition, various program code described hereinafter may be identified based upon the application or software component within which it is implemented in a specific embodiment of the invention. However, it should be appreciated that any particular program nomenclature that follows is used merely for convenience, and thus the invention should not be limited to use solely in any specific application identified and/or implied by such nomenclature. Furthermore, given the typically endless number of manners in which computer programs may be organized into routines, procedures, methods, modules, objects, and the like, as well as the various manners in which program functionality may be allocated among various software layers that are resident within a typical computer (e.g., operating systems, libraries, APIs, applications, applets, etc.), it should be appreciated that the invention is not limited to the specific organization and allocation of program functionality described herein.
In general, a dialog flow is created by a user and defines a voice dialog and/or business logic for a voice-enabled operation, such as a pick-and-place, voice-assisted, or voice-directed operation. The user graphically defines the dialog flow on a development environment and further graphically defines the voice dialogs and/or business logic therein. When the dialog flow is built, the development environment generates a Python script for the voice application as well as XML data corresponding to the voice dialogs called in that Python script for the VoiceArtisan application. The VoiceArtisan application, in turn, analyzes the XML data to create voice dialog objects corresponding to the voice dialog's defined in that XML data. When a voice dialog is called by the voice application, the VoiceArtisan application executes a corresponding voice dialog object to perform the function associated with that voice dialog.
In some embodiments, a voice dialog defines a state machine that includes nodes and transitional links that in turn define at least one speech output and/or business logic. Each node represents a state in the state machine while each link is a transition between the states. Without intending to be limiting, types of transitions may include at least one of the following: a default link (in which there is an immediate, unconditional transition); a vocabulary link (in which there is a transition based on recognition of a spoken vocabulary word or phrase); and a conditional link (in which a transition based on the truth of a specific condition). Voice dialogs are connected to the business logic via node or link callback methods.
For example,
In some embodiments, the first node 202 and/or second node 204 may be assigned respective “on entry” functions. As such, the first node 202 may be assigned an “on entry” function called “first_dialog_state_one( )” (e.g., that indicates that there is a voice dialog for the first node 202 of the first voice dialog 200 to specify “At State One” when that node is entered) while the second node 204 may be assigned an “on entry” function called “first_dialog_state_two( )” (e.g., that indicates that there is a voice dialog for the second node 204 of the first voice dialog 200 to specify “At State Two” when that node is entered).
Also for example,
Consistent with embodiments of the invention, the first voice dialog 200 and/or second voice dialog 210 may be called by a voice application. In particular, the first and/or second voice dialogs 200 and/or 210 may be called by the “main( )” function of a dialog flow executed by the voice application. Specifically, when a dialog flow is built there are at least two files that are created. The first is the pseudocode for the dialog flow that is executed by the voice application. This psuedocode includes calls to voice dialogs and/or business logic. The second includes XML data that defines the voice dialogs and/or specific business logic associated therewith. The following Code Listing 1 illustrates one embodiment of Python pseudocode for a dialog flow that may be implemented by a voice application that includes a “main( )” function illustrating the use of the first and second voice dialogs 200 and 210, and also illustrating the use of business logic.
Thus, as is evident from Code Listing 1, control is handed back and forth between voice dialogs as well as business logic. In particular, Code Listing 1 indicates that the execution begins in the “main( )” function which runs the “first_dialog.” During “first_dialog” execution, there is a call to the “first_dialog_state_one( )” and “first_dialog_state_two( )” functions, which are executed. After the “first_dialog” function terminates the control returns to the “main( )” function to implement business logic, if required. The “main( )” function then runs the “second_dialog” function in which at least one call to a “second_dialog_condition( )” function is executed. After the “second_dialog” function terminates, control again returns to the “main( )” function to implement business logic, if required. The voice application terminates when the “main( )” function completes.
The XML data, in turn, may be used by a task execution engine of a VoiceArtisan application to construct voice dialog objects representing voice dialogs. In particular, the VoiceArtisan application may parse the XML data and build the voice dialog objects in C++ as corollaries to the voice dialogs. When a voice dialog is called by the voice application, the voice dialog is implemented. Specifically, Code Listing 2 illustrates one embodiment of the XML data that includes data about the voice dialogs and/or business logic that are called by Code Listing 1.
Thus, the XML representation directs the VoiceArtisan application to perform corresponding actions for a voice dialog, whether that be providing speech output, performing speech recognition, or executing other action. After a speech output, the VoiceArtisan may pass back control to the voice application to implement business logic, such as the business logic defined by a transitional link. As such, the interaction of the VoiceArtisan application and the voice application allows for the abstraction of functions across the two. However, direct access to the functionality of the VoiceArtisan application is prevented, maintaining the security thereof.
As detailed above, embodiments of the invention may be used to coordinate voice dialogs and business logic for a voice enabled system, and in particular for a pick and place, voice-assist, and/or voice-directed operation. For example, the dialog flow for a voice application can specify calls for a voice dialog. The voice dialog is recognized by a VoiceArtisan application, which has already created voice dialog objects correlated to the voice objects in the dialog flow. The VoiceArtisan application executes a voice dialog when called. Concurrently, business logic may be implemented by the voice application and/or the VoiceArtisan application based upon information associated with either the voice dialog or the dialog flow.
When the voice application is in communication with the VoiceArtisan application (“Yes” branch of decision block 222) the voice application determines, from a memory, at least one dialog flow to implement (block 224). When the voice application does not determine any dialog flows to implement (“No” branch of decision block 226) the sequence of operations may end.
In specific embodiments, each dialog flow is defined in XML and includes business logic as well as at least one call to a voice dialog. Additionally, a dialog flow may define particular vocabulary words that are used with that dialog flow in addition to those utilized with a voice dialog. As such, when the voice application determines that there is at least one dialog flow to implement (“Yes” branch of decision block 226) the voice application determines whether all words associated with that dialog flow are available to be converted from speech input to machine readable input or vice-versa (e.g., whether the text-to-speech engine and/or a voice recognizer can convert the particular word to machine readable input and/or convert the particular word to speech output such that the text-to-speech engine and/or voice recognizer have been “trained”) (block 228). When all words for a dialog flow have not been trained (“No” branch of decision block 228) the voice application may capture speech input associated with that word and/or words to train the text-to-speech engine and/or voice recognizer (block 230). When all words for a dialog flow have been trained (“Yes” branch of decision block 228 or block 230) the voice application locates the main module associated with that dialog flow and executes the dialog flow (block 232).
In a voice dialog, nodes may be transitioned from one to another with links, which may include default, vocabulary, or conditional links. If there is no link associated with a particular speech output (“No” branch of decision block 248) the sequence of operations ends. However, when there is a link associated with a particular speech output (“Yes” branch of decision block 248) the VoiceArtisan application determines if the link is a conditional link (e.g., an automatic link) (block 250). In a condition link, there is a transition from one node to another when a condition associated with that link is true. Thus, when there is a conditional link (“Yes” branch of decision block 250) the VoiceArtisan application transitions to the next node when a condition associated with that link is true (block 252) and the sequence of operations returns to block 248.
However, when the link is not a conditional link (“No” branch of decision block 250) the VoiceArtisan application determines if the link is a vocabulary link (block 254). When the link is a vocabulary link (“Yes” branch of decision block 254) the VoiceArtisan application transitions to the next node based on the recognition of a spoken vocabulary word or phrase. As such, when the particular vocabulary word or phrase is spoken, the VoiceArtisan application transitions to the next node to send another voice dialog and/or implement business logic (block 256) and returns to block 248. However, when the link is not a vocabulary link (“No” branch of decision block 254) the link may be a default link. In a default link, there is an immediate, unconditional transition from one node to another. As such, the VoiceArtisan application transitions to the next node to send another voice dialog and/or implement business logic (block 258) and returns to block 248.
In some embodiments, control in a dialog flow may be handed off between the voice application and the VoiceArtisan application depending upon the particular operations defined by a dialog flow and/or voice dialog. For example, a vocabulary link of a voice dialog may indicate that a transition occurs when the user says a particular word or phrase. The voice application takes control to capture the speech input of the user and provide it to the VoiceArtisan application for conversion to machine readable input. The VoiceArtisan application converts the speech input to machine readable input, then provides that machine readable input back to the voice application to determine whether the specified word or phrase has been spoken by the user. Thus, the voice application indicates whether to transition to the next node. Also for example, a conditional link may indicate that a transition occurs when a particular barcode is scanned and/or a particular button is pressed. The determination of whether the condition is true, however, is determined by the voice application.
While the present invention has been illustrated by a description of the various embodiments and the examples, and while these embodiments have been described in considerable detail, it is not the intention of the applicants to restrict or in any way limit the scope of the appended claims to such detail. Additional advantages and modifications will readily appear to those skilled in the art. For example, voice dialogs may include more or fewer nodes and transitional links than those illustrated. In particular, a node in a voice dialog may be connected to multiple nodes through multiple transition links (e.g., multiple vocabulary or conditional links). The particular node that is transitioned to may thus be dependent on the particular link (e.g., word, phrase, or condition) used to transition to that node. Moreover, a voice dialog does not necessarily have to include speech output, and may instead include an action (e.g., such as waiting for speech input) or business logic.
Still further, one having ordinary skill in the art will appreciate that the voice application and VoiceArtisan application operate in a cooperative manner. As such, the voice application and VoiceArtisan application may be executed on the same computing system, and in specific embodiments the voice application may be run as a virtual component of the VoiceArtisan application. Thus, the particular nomenclature for the voice application and the VoiceArtisan application is merely for differentiation purposes and is not intended to be limiting. As such, the invention in its broader aspects is therefore not limited to the specific details, apparatuses, and methods shown and described. A person having ordinary skill in the art will appreciate that any of the blocks of the above flowcharts may be deleted, augmented, made to be simultaneous with another, combined, or be otherwise altered in accordance with the principles of the embodiments of the invention. Accordingly, departures may be made from such details without departing from the scope of applicants' general inventive concept.