The present invention relates to the field of speech-enabled interactive voice response (IVR) systems and similar systems involving a dialog between a human and a computer. More particularly, the present invention is related to a system and method of dynamically generating voice application information from a server, and particularly dynamic generation of mark-up language documents to a browser capable of rendering such mark-up language documents on a client computer.
The explosive growth of the Internet, and particularly the World Wide Web, over the last several years cannot be understated. The corresponding impact on the global economy has been similarly dramatic. Virtually any type of information is available to a user who is even only remotely familiar with navigating this network of computers. Yet, there are still instances where information that may be important or even critical to an individual, though otherwise available on the Web, is without his or her reach. For example, an individual who is traveling might desire to obtain information regarding flight departures for a particular airline carrier from his current destination using a landline phone, mobile phone, wireless personal digital assistant, or similar device. While that information might be readily available from the Web server of the airline carrier, in the past, the traveler did not have access to the Web server from a phone. Recently, however, advances have been made to marry telephones and telephony-based voice applications with the World Wide Web. One such advance is the Voice Extended Markup Language (VoiceXML).
VoiceXML is a Web-based markup language for representing human/computer dialog. It is similar to Hypertext Markup Language (HTML), but assumes a voice browser having both audio input and output. As seen in
As interest in deployment of speech applications written in VoiceXML expands, the need for a sophisticated and elegant integration of the voice user interface front end and business-rule driven back end becomes ever more important. VoiceXML itself is a satisfactory vehicle for expressing the voice user interface, but it does little to assist in implementing the business rules of the application.
Within the Internet community, the problem of integrating the user interface (HTML browser) and business rule-driven back end has been addressed through the use of dynamically generated HTML, where server code is written that defines both the application and its back-end data manipulation. When the user fetches an application via a browser, the application dynamically generates HTML (or XML) that the web server conveys as an http response. The user's input (mouse clicks, and keyboard entries) are collected by the browser and returned in an HTTP request (GET or POST) to the server where it is processed by the application.
This dynamic generation model has been extended by the VoiceXML community for use in speech applications. Server-resident application code interacts with data visible to the server and produces a stream of VoiceXML. But this approach requires the development of custom code for each new application, or (at best) reusable components of the custom code that can be structured as templates that facilitate their reuse.
Accordingly, there is a need for a speech application development and deployment architecture that leverages the best of the dynamic generation architecture described above, yet exploits the extreme simplification of application development provided by an integrated service creation environment, such as the family of application development tools that comprise the Natural Language Speech Assistant (NLSA) developed by Unisys corporation. The present invention satisfies this need.
The present invention enables an application developer to design a speech-enabled application using existing speech application development tools in an integrated service creation environment, and then to deploy that speech application in a client-server environment in which the speech application dialogue with the user is carried out through the dynamic generation of documents in a particular mark-up language and the rendering of those documents by a suitable client browser. One embodiment of the invention comprises a server that communicates with a client in a client-server environment to carry out a dialogue with a user, wherein the client comprises a browser that fetches from the server a document containing instructions in a mark-up language and renders the document in accordance with the mark-up language instructions to provide interaction with the user. The server comprises a dialogue flow interpreter (DFI) that reads a data file containing information representing different states of the dialogue with the user and that uses that information to generate for a given state of the dialogue objects representing prompts to be played to the user, grammars of expected responses from the user, and other state information. The data file is generated by a speech application developer using an integrated service creation environment, such as the Unisys NLSA. The server further comprises a mark-up language generator that generates, within a document, instructions in the mark-up language of the client browser that represent an equivalent of the objects generated by the DFI. In essence, the mark-up language generator serves as a wrapper around the DFI to transform the information normally generated by the DFI for use with monolithic speech applications into dynamically generated mark-up language documents for use in a browser-based client-server environment. A server application instantiates the DFI and mark-up language generator to provide the overall shell of the speech application and to supply necessary business logic behind the application. The server application is responsible for delivering generated mark-up language documents to the client browser and for receiving requests and associated information from the browser. An application server (i.e., application hosting software) may be used to direct communications between one or more browsers and one or more different speech applications deployed in this manner. The speech application development and deployment architecture of the present invention can be used to enable dynamic generation of speech application information in any of a variety of mark-up languages, including voiceXML, Speech Application Language Tags (SALT), hypertext markup language (HTML), and others. The server can be implemented in a variety of application service provider models, including the Java Server Pages (JSP)/Servlet model developed by Sun Microsystems, Inc. (as defined in the Java Servlet API specification), and the Active Server Pages (ASP)/Internet Information Server (IIS) model developed by Microsoft Corporation.
Other features of the present invention will become evident hereinafter.
The foregoing summary, as well as the following detailed description, is better understood when read in conjunction with the appended drawings. For the purpose of illustrating the invention, there is shown in the drawings exemplary constructions of the invention; however, the invention is not limited to the specific methods and instrumentalities disclosed. In the drawings:
As shown, the architecture consists of both an offline environment and a runtime environment. The principal offline component is an integrated service creation environment. In this example, the integrated service creation environment comprises the Natural Language Speech Assistant or “NLSA” (developed by Unisys Corporation, Blue Bell, Pa.). Integrated service creation environments, like the Unisys NLSA, enable a developer to generate a series of data files 215 that define the dialogue flow (sometimes referred to as the “call flow”) of a speech application as well as the prompts to be played, expected user responses, and actions to be taken at each state of the dialogue flow. These data files 215 can be thought of as defining a directed graph where each node represents a state of the dialogue flow and each edge represents a response-contingent transition from one dialog state to another. The data files 215 output from the service creation environment can consists of sound files, grammar files (to constrain the expected user responses received from a speech recognizer), and files that define the dialogue flow (e.g., DFI files) in a form used by a dialogue flow interpreter (DFI) 220, as described more fully below. In the case of the NLSA, the files that define the dialogue flow contain an XML representation of the dialogue flow.
Referring again to
The Unisys NLSA uses an easy-to-understand spread sheet metaphor to express the relationships between words and phrases that define precisely what the end user is expected to say at a given state in a dialogue. The tool provides facilities for managing variables and sound files as well as a mechanism for simulating the application prior to the generation of actual code. The tool also produces recording scripts (for managing the recording of the ‘voice’ of the application) and dialog design documents summarizing the application's architecture. Further details regarding the NLSA, and the creation of the data files 215 by that tool, are provided in U.S. Pat. Nos. 5,995,918 and 6,321,198, and in co-pending, commonly assigned application Ser. No. 09/702,244.
The runtime environment of the speech application development and deployment architecture of
In the Unisys NLSA environment, the runtime environment may also include a natural language interpreter (NLI) 225, in the event that its functionality is not provided as part of the ASR 235. The NLI accesses a given grammar file of the data files 215, which expresses valid utterances and associates them with tokens and provides other information relevant to the application. The NLI extracts and processes a user utterance based on the grammar to provide information useful to the application, such as a token representing the meaning of the utterance. This token may then, for example, be used to determine what action the speech application will take in response. The operation of an exemplary NLI is described in U.S. Pat. No. 6,094,635 (in which it is referred to as the “runtime interpreter”) and in U.S. Pat. No. 6,321,198 (in which it is referred to as the “Runtime NLI”).
The dialog flow interpreter (DFI) is instantiated by the speech application 230. The DFI accesses the representation of the application contained in the data files 215 produced by the service creation environment. The DFI furnishes the critical components of a speech application dialog state, in the form of objects, to the invoking program by consulting the representation of the speech application in the data files 215. In order to understand this process, it is essential to understand the components that make up a dialog state.
Essentially, each state of a dialogue represents one conversational interchange between the application and a user. Components of a state are defined in the following table:
In the Unisys NLSA, a tool within the service creation environment is used to refine each response down to the actual words and phrases that the end user is expected to say. Variables can be introduced in place of constant string literals in prompts and responses, and variables as well as actions can be explicitly associated with data storage activities. The complete specification of a speech application, then, requires the specification of all the application's dialog states, and the specification of each of these internal components for each state.
When invoked by the speech application 230 at runtime, the DFI provides the current dialog state as well as each of the components or objects required to operate that state, such as:
The source of the information provided by the DFI is drawn from the representation of the application produced by the service creation environment in the data files 215.
In this manner, the DFI and associated data files 215 contain the code and information necessary to implement most of the speech application dialogue. In its simplest form, therefore, the speech application 230 need only implement a loop, such as that illustrated in
In the Unisys embodiment of a DFI, the speech application 230, which can be coded by the developer in any of a variety of programming languages, such as C, Visual Basic, Java, or any other programming language, instantiates the DFI 220 and invokes it to interpret the design specified in the data files 215. The DFI 220 controls the dialogue flow through the application, supplying all the underlying code that previously the developer would have had to write. The DFI 220 in effect provides a library of “standardized” objects that implement the low-level details of a dialogue. The DFI 220 is implemented as an application programming interface (API) to further simplify the implementation of the speech application 230. The DFI 215 drives the entire dialogue of the speech application 230 from start to finish automatically, thus eliminating the crucial and often complex task of dialogue management. Traditionally, such a process is application dependent and therefore requires re-implementation for each new application.
As mentioned above, a dialogue of a speech application includes a series of transitions between states. Each state has its own set of properties that include the prompt to be played, the speech recognizer's grammar to be loaded (to listen for what the user of the voice system might say), the reply to a caller's response, and actions to take based on each response. The DFI 220 keeps track of the state of the dialogue at any given time throughout the life of the application, and exposes functions to access state properties.
Referring to
Get_Prompt( ) 320: returns a prompt object containing information defining the appropriate prompt to play; this information may then be passed, for example, to the TTS engine 450, which may convert it to audio data to be played to a user,
Get_Grammar( ) 330: returns a grammar object containing information concerning the appropriate grammar for the current state; this grammar is then loaded into the speech recognition engine (ASR) 445 to constrain the recognition of a valid utterance from a user;
Get_Response( ) 340: returns a response object comprised of the actual user response, any variables that this response may contain, and all possible actions defined for this response; and
Advance_State 350: transitions the dialogue to the next state.
Other DFI functions 370 are used to retrieve state-independent properties (i.e., global properties). These include but are not limited to information concerning the directory paths for the various data files 215 associated with the speech application, the application's input mode (e.g., DTMF or Voice), the current state of the dialogue, and the previous state of the dialogue. All of these functions can be called from the speech application 230 code to provide information about the dialogue during the execution of the speech application.
Further details as to the function and operation of the DFI 220 may be found in co-pending, commonly assigned U.S. patent application Ser. No. 09/702,244, entitled “Dialogue Flow Interpreter Development Tool,” filed Oct. 30, 2000.
As described above and illustrated in
The new architecture for speech application development and deployment of the present invention is illustrated in
As shown, the client 435 comprises a browser 440 that fetches from the server a document containing instructions in a mark-up language and renders the document in accordance with the mark-up language instructions to provide interaction with the user. The present invention can be used to enable dynamic generation of speech application information in any of a variety of mark-up languages, including voiceXML, Speech Application Language Tags (SALT), hypertext markup language (HTML), and others such as Wireless Markup Language (WML) for Wireless Application Protocol (WAP)-based cell phone applications, and the W3 platform for handheld devices. Hence, the browser may comprise a voiceXML-compliant browser, a SALT-compliant browser, an HTML-compliant browser, a WML-compliant browser or any other markup language-compliant browser. Examples of voiceXML-compliant browsers include “SpeechWeb” commercially available from PipeBeach AB, “Voice Genie” commercially available from Voice Genie Technology Inc., and “Voyager” commercially available from Nuance Communications. VoiceXML browser products generally include an automatic speech recognizer 445, a text-to-speech synthesizer 450, and a telephony interface 460. The ASR 445, TTS 450, and telephony interface may also be supplied by different vendors.
As illustrated in
The browser 440 communicates with a server 410 of the present invention through standard Web-based HTTP commands (e.g., GET and POST) transmitted over, for example, the Internet 430. However, the present invention can be deployed over any private or public network, including local area networks, wide-area networks, and wireless networks, whether part of the Internet or not.
Preferably, an application server 425 (i.e., application hosting software) intercepts requests from the client browser 440 and forwards those requests to the appropriate speech application (e.g., server application 415) hosted on the server computer 410. In this manner, multiple speech applications may be available for use by a user.
In addition to the dialogue flow interpreter (DFI) 220 (and optionally the NLI 225) and the data files 215 discussed above, the server 410 further comprises a mark-up language generator 420 that generates, within a document, instructions in the mark-up language supported by the client browser 440 that represent equivalents of the objects generated by the DFI. That is, the mark-up language generator 420 serves as a wrapper around the DFI 220 (and optionally the NLI 225) to transform the information normally generated by the DFI for use with monolithic speech applications, such as the prompt, response, action and other objects discussed above, into dynamically generated mark-up language instructions within a document that can then be served to the client browser 440.
By way of example only, a prompt object returned by the DFI 220 based on the XML representation of the exemplary DFI file illustrated in
The prompt object is essentially a representation in memory of this information. In this example, the mark-up language generator 420 may generate the following VoiceXML instructions for rendering by a VoiceXML-enabled client browser:
These instructions would be generated into a document to be transmitted back to the client browser. The following is an example of a larger document containing a voiceXML representation of several objects associated with a state of the exemplary dialogue of
A server application 415, similar to the speech application 230 illustrated in
In one embodiment, the server application 415 may be embodied as an executable script on the server 410 that, in combination with appropriate .asp or .jsp files and the instances of the DFI 220 and mark-up language generator 420, produces the mark-up language document to be returned to the browser 440.
Preferably, the service creation environment will in addition to producing the data files 215 that define the dialogue of the speech application, also produce the basic shell code of the server application 415 to further relieve the application developer from having to code to a specific client-server specification (e.g., JSP/Servlet or ASP/IIS). All the developer will need to do is to provide the necessary code to implement the business logic of the application. Although other web developers use ASP/IIS and JSP/Servlet techniques on servers to dynamically generate markup language code, the architecture of the present invention is believed to be the first to use an interpretive engine (ie., the DFI 220) on the server to retrieve essential information representing the application that was itself built by an offline tool.
The DFI 220 is ideally suited to provide the information source from which a mark-up language document can be dynamically produced. Using either an ASP/IIS or ISP/Servlet model, the server application 415 invokes the same DFI methods described above, but the returned objects are then translated by the markup language generator 420 into appropriate mark-up language tags and packaged in a mark-up language document, permitting the server application 415 to stream the dynamically generated mark-up language documents to a remote client browser. Whenever the Action at a given dialogue state includes some database read or write activity, that activity is performed under control of the DFI 220 and the result of the transaction is reflected in the generated mark-up language instructions.
Thus, the DFI 220 effectively becomes an extension of the server application 415. In the present embodiment, the speech application dialogue and its associated speech recognition grammars, audio files, or application-specific data that make up the data files 215 reside on server-visible data stores. The files representing the dialogue flow are represented in XML (e.g.,
In operation, the control of a dialogue with a user in accordance with the architecture of the present invention generally occurs as follows:
In embodiments in which the ASR is not equipped to extract the meaning from an utterance, then in step 4, the utterance may be passed back to the server application 415, which may then invoke an NLI (e.g., NLI 225) to extract the meaning.
In the above manner, state after state is executed until the application has performed the desired task.
Thus, it will be appreciated that the above architecture allows the use of the DFI 220 on the server 410 to retrieve essential information from the data files 215 representing the speech application dialogue (as created by the offline service creation environment). While most solutions involve committing to a particular technology, thus requiring a complete rewrite of an application if the “hosting technology” is changed, the design abstraction approach of the present invention minimizes the commitment to any particular platform. Under the system of the present invention a user does not need to learn a particular mark-up language, nor the intricacies of a particular client-server model (e.g., ASP/IIS or JSP/Servlet).
Benefits of the above architecture include ease of movement between competing Internet technology “standards” such as JSP/Servlet and ASP/IIS. A further benefit is that it protects the user and application designer from changes in an evolving markup language standard (e.g., VoiceXML). Finally, the novel architecture disclosed herein provides for multiple delivery platforms (e.g. VoiceXML for spoken language), WML for WAP-based cell phone applications, and the W3 platform for handheld devices.
The architecture of the present invention may be implemented in hardware or software, or a combination of both. When implemented in software, the program code executes on programmable computers (e.g., server 410 and client 435) that each include a processor, a storage medium readable by the processor (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device. Program code is applied to data entered using the input device to perform the functions described above and to generate output information. The output information is applied to one or more output devices. Such program code is preferably implemented in a high level procedural or object oriented programming language. However, the program code can be implemented in assembly or machine language, if desired. In any case, the language may be a compiled or interpreted language. The program code may be stored on a computer-readable medium, such as a magnetic, electrical, or optical storage medium, including without limitation a floppy diskette, CD-ROM, CD-RW, DVD-ROM, DVD-RAM, magnetic tape, flash memory, hard disk drive, or any other machine-readable storage medium, wherein, when the program code is loaded into and executed by a machine, such as a computer, the machine becomes an apparatus for practicing the invention. The program code may also be transmitted over some transmission medium, such as over electrical wiring or cabling, through fiber optics, over a network, including the Internet or an intranet, or via any other form of transmission, wherein, when the program code is received and loaded into and executed by a machine, such as a computer, the machine becomes an apparatus for practicing the invention. When implemented on a general-purpose processor, the program code combines with the processor to provide a unique apparatus that operates analogously to specific logic circuits.
In the foregoing description, it can be seen that the present invention comprises a new and useful architecture for the development and deployment of speech applications that enables an application developer to design a speech-enabled application using existing speech application development tools in an integrated service creation environment, and then to deploy that speech application in a client-server environment in which the speech application dialogue with the user is carried out through the dynamic generation of documents in a particular mark-up language and the rendering of those documents by a suitable client browser. It should be appreciated that changes could be made to the embodiments described above without departing from the inventive concepts thereof. It should be understood, therefore, that this invention is not limited to the particular embodiments disclosed, but it is intended to cover all modifications within the spirit and scope of the present invention as defined by the appended claims.
Number | Date | Country | Kind |
---|---|---|---|
60288708 | May 2001 | US | national |
This application claims the benefit of the filing date of U.S. Provisional Application Ser. No. 60/288,708, entitled “Dynamic Generation of Voice Application Information from a Web Server,” filed May 4, 2001, which application is incorporated herein by reference in its entirety. The subject matter disclosed herein is related to the subject matter disclosed in U.S. Pat. No. 5,995,918, entitled “System And Method For Creating A Language Grammar Using A Spreadsheet Or Table Interface,”(issued Nov. 30, 1999), U.S. Pat. No. 6,094,635, entitled “System and Method for Speech Enabled Application,” (issued Jul. 25, 2000), U.S. Pat. No. 6,321,198, entitled “Apparatus for Design and Simulation of Dialogue,” (issued Nov. 20, 2001), and pending U.S. patent application Ser. No. 09/702,244, entitled “Dialogue Flow Interpreter Development Tool,” filed Oct. 30, 2000, all of which are assigned to the assignee of the instant application, and the contents of which are hereby incorporated by reference in their entireties.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US02/13982 | 5/3/2002 | WO |