Multimodal interaction is defined as the ability to interact with an application using multiple modes; for example, a user can use speech, keypad or handwriting for input and can receive output in the form of audio prompts or visual display. In addition to using multiple modes for input and output, user interaction is synchronized: for instance, if a user has both GUI and speech modes active on a device and he/she provides an input field via speech, recognition results may be reflected by both an audio prompt and a GUI display.
In today's multimodal frameworks, synchronization between various channels is either hardwired in applications markup pages using scripts, as is the case in Microsoft's SALT (Speech Application Language Tags) specification, or it is embedded inside a multimodal client. This implies that any changes to multimodal programming models require a re-authoring of already deployed applications and/or a release of new versions of multimodal clients. This greatly increases the cost of software maintenance and discourages customers and service providers from adopting new and improved multimodal programming models.
Multimodal interaction always entails some form of synchronization. There are various ways in which multiple channels become synchronized during a multimodal interaction. In a tightly coupled type of synchronization, user interaction is reflected equally in all modalities. For example, if an application uses both audio and GUI to ask a user for a date, when the user says “June 5th”, the result of the recognition is played back to him in speech and displayed to him in his GUI display as “06/05/2004”. Contrast this with a loosely coupled type of synchronization, which is dominant in rich conversational multimodal applications where modalities are typically used to complement each other rather than to supplement each other. In the latter form of synchronization, a user might say his itinerary using one sentence, “I want to go to Montreal tomorrow and return this Friday”, and have the list of available flights that satisfy his constraints returned in his GUI display as a selection list so that he can choose the flight that best suits his constraints. In both cases, software developers must use programming models that enable them to author either form of interaction.
Multimodal interaction is still at its infancy; various multimodal programming models are emerging in the industry, such as SALT and X+V (XHTML plus Voice). As multimodal matures in the market place, various incarnations of these programming models or variants of them might be adopted, each of which defines a particular synchronization strategy. In order to maintain the middleware being developed for such applications, it is necessary to create an architecture and a multimodal data flow process that can factor out the particularity of each programming model from the rest of the software components that support it. In the case of multimodal programming models, the particularity lies in the synchronization and authoring strategy adopted by each model. Factoring guarantees interoperability, efficient code maintenance, and an easier migration path for developers and service providers.
The invention provides an architecture for factoring synchronization strategies and authoring schemes from the rest of the software components needed to handle a multimodal interaction. By implementing this aspect of the invention, both the client side (a modality-specific user agent) and the server-side infrastructure are made agnostic to a particular multimodal authoring technology and/or standard. This means client devices (deployed in vast numbers) can remain intact even though the underlying programming model is changing. On the server side, it means the existing infrastructure can either migrate seamlessly to a new multimodal standard and/or support multiple multimodal progamming models simultaneously; this a significant benefit for application service providers that need to support a wide range of technologies and standards to satisfy diverse cusomers' requirements.
Supporting the claim above is a mechanism by which the factored out synchronization strategy components, henceforth refered to as Synclets, communicate with the rest of the runtime components. According to a first aspect of the invention, there is provided a factored multimodal interaction architecture for a distributed computing system that includes a plurality of client browsers and at least one multimodal application server that can interact with the clients by means of a plurality of interaction modalities. The factored architecture includes an interaction manager with a multimodal interface, wherein the interaction manager can receive a client request for a multimodal application in one interaction modality and transmit the client request in another modality, a browser adapter for each client browser, each browser adapter including the multimodal interface, and one or more pluggable synchronization modules. Each synchronization module implements one of the plurality of interaction modalities between one of the plurality of clients and the server so that a synchronization module for an interaction modality mediates communication between the multimodal interface of the client browser adapter and the multimodal interface of the interaction manager.
In another aspect of the invention, the architecture includes a servlet filter that can intercept a client request for a multimodal application, and can pass that client request and a library of synchronization modules to the interaction manager, so that the interaction manager can select a synchronization module appropriate for the client request from the library of synchronization modules.
In another aspect of the invention, each multimodal interface of a client browser adapter and the multimodal interface of the interaction manager can communicate via a plurality of multimodal messages, and a synchronization module for an interaction modality is instantiated by the interaction manager upon receiving a client request for that interaction modality, so that the synchronization module can implement an exchange of multimodal messages between the multimodal interface of the client browser adapter and the multimodal interface of the interaction manager.
In another aspect of the invention, the architecture includes a synchronization proxy for each client for encoding the multimodal messages in an internet communication protocol.
In another aspect of the invention, the multimodal messages include multimodal events and multimodal signals.
In another aspect of the invention, the interaction manager is a state machine having an associated state, a loaded state, a ready state, and a not-associated state; the client browser adapter is a state machine having an associated state, a loading state, a loaded state, and a ready state; and a synchronization module is a state machine having an instantiated state, a loaded state, a ready state, and a stale state.
In another aspect of the invention, the client browser adapter enters the associated state when a connection to either the interaction manager or another client has been established; the client browser adapter enters the loading state when it is loading a document; the client browser adapter enters the loaded state when it has completed loading the document; and the client browser adapter enters the ready state when it is ready for multimodal interaction.
In another aspect of the invention, the synchronization module enters the instantiated state when it has been instantiated but has no document to process; the synchronization module enters the loaded state when it has been given a document to process but is waiting for a loaded signal from a client; the synchronization module enters the ready state when it is ready to receive events and send synchronization commands; and the synchronization module enters the stale state when the document being handled is no longer in view for the client.
In another aspect of the invention, the interaction manager enters the associated state when any non-stale synchronization module is in the instantiated state; the interaction manager enters the loaded state if any non-stale synchronization module is in the loaded state; the interaction manager enters the ready state if all non-stale synchronization modules are in the ready state; and the interaction manager enters the not-associated state when there is no client session associated with it.
In further aspect of the invention, the architecture includes an event control interface, by which a client browser adapter or the interaction manager can register or remove an event listener, or dispatch an event to another client browser adapter or to the interaction manager; a command control interface by which a client browser adapter or the interaction manager can modify the state of another a client browser adapter by issuing a synchronization command; and an event listener interface that can provide an event handler to a client browser adapter or the interaction manager.
These aspects of the invention define a modality independent and multimodal programming model agnostic protocol (a set of interaces), herein referred to as the Multimodal On Demand (MMOD) protocol.
a-b depict the sequence of MMOD messages exchanged for an X+V multimodal session.
Multimodal Runtime Components
Multimodal interaction requires the presence of one or more modalities, a synchronization module and a server capable of serving/storing the multimodal applications. Users interact via one or more modalities with applications, and their interaction is synchronized as per the particular programming model used and the authoring of the application. The schematic diagram depicted in
The multimodal interaction manager is the component that manages interaction across various modalities. Interaction management entails various functionality, the main three being listed below:
The architecture of a typical multimodal application is illustrated in
In a system of a preferred embodiment of the invention, the synchronization component of interaction management is factored out to allow the rest of the infrastructure to handle multiple programming models each with their own associated synclets.
The factoring performed on the synclets allows various service providers to contract programmers to develop new synchronization strategies based on a new version af an existent multimodal programming model (as depicted in
Data Flow Process
The diagram depicted in
Interaction Manager Framework
The Interaction Manager (IM) is a framework that supports distributed multimodal interaction. As can be see from the figure, the Interaction Manager is placed server side and communicates with active channels through a set of common interfaces called Multimodal Interfaces On Demand (MMOD). These interfaces of this embodiment will be explained in conduction with an X+V application using a GUI and a voice modality. The factorization strategy of the emplary aspect of the invention is not limited to this embodiment, and is applicable to any client interacting with an application through muiltiple modalities.
Multimodal on Demand Servlet Filter
Referring to
Interaction Manager
The Interaction Manager (IM) 124 is a composite object that typically (but not necessarily) resides server-side and is responsible for acquiring user interaction in one mode and publishing it in all other active modes. In a web environment, the IM can synchronize across multiple browsers, each supporting a particular markup language. In this context, each browser can constitute one interaction mode and thus the IM is responsible for:
To establish and exchange information between the IM 124 and the various client devices 100 and 110, the clients 100, 110 must implement a set of generic multimodal interfaces called Multimodal On Demand (MMOD) interfaces 103, 113. The MMOD interfaces 103, 113 also define a set of messages that can be bound to multiple protocols, e.g. HTTP, SOAP, XML, etc. A distributed client must be able to implement at least one such encoding in order to send and receive MMOD messages over a physical connection. The SyncProxy modules 104, 114 of client devices 100, 110 are synchronization proxies each of which implement a particular encoding of the MMOD messages and is responsible for marshalling and unmarshalling events, signals and commands over the physical connection.
For maximum adaptability, the IM framework of the preferred embodiment of the invention does not assume that all browser vendors will implement MMOD and its associated protocol bindings. As such, the IM framework includes a set of Browser Adapter classes 102, 112 that implement these MMOD interfaces 103, 113 and SyncProxy classes 104, 114 that implement a particular encoding for MMOD messages. The framework currently contains support for the IE browser 101 and IBM's VoiceXML browser 111.
IM State Machine
The IM 124 has four states:
The IM's state transitions are dependent on the actual synchronization strategy being used during a particular user session. The sequence diagram depicted in
Client State Machine
The IM framework of the preferred embodiment of the invention expects MMOD clients 100 to have the following states:
The IM framework of the preferred embodiment of the invention makes no assumption as to the programming model followed to author the multimodal applications and, as such, can be used for a variety of multimodal programming models such as XHTML+Voice, XHTML+XForms+Voice, SVG+Voice etc. Each programming model typically dictates a specific synchronization strategy; thus to support multiple programming models one needs to support multiple synchronization strategies. The IM framework of the preferred embodiment of the invention defines a mechanism by which multiple synchronization strategies can be implemented without affecting the underlying middleware infrastructure or applications that have been already deployed. This design significantly reduces the time it takes to adopt new programming models and their corresponding synchronization strategies and ensures minimal outage time for applications already deployed on that framework.
Synclets
The synclets 125 are state machines that are implement a specific synchronization strategy and coordinate communication over the various channels. The IM framework of the preferred embodiment of the invention specifies a specific interface to which a synclet author must adhere, allowing these components to plug seamlessly into the rest of the IM framework. During a multimodal interaction with the IM, the MMOD servlet filter chooses a synclet library based on the multimodal document mime type. This synclet library is passed to the IM and the IM will use it to instantiate the appropriate synclet for that document type and bind it to that user session. The MMOD servlet filter will then hand the synclet the actual document. The synclet will then determine how to handle synchronization between the various active channels; as such it determines when and how to communicate events and synchronization commands from one channel to the other active channels.
Synclet State Machine
The IM framework of the preferred embodiment of the invention may include one more synclets each implementing one or more multimodal programming models. The state of all active synclets during a user session determines the IM's overall state as described in the first section. The IM polls each synclet for its state during a user interaction, sets its own state, then informs connected clients of that state. A synclet has four states:
Another aspect of the preferred embodiment of the invention is a set of abstract interfaces and messages that allow endpoints in a multimodal interaction to communicate with each other, and a protocol to serialize and un-serialize MMOD messages. These endpoint interfaces are: (1) the Event Control interface; (2) the Command Control interface; and (3) the Event Listener interface. MMOD is designed as a web service. Its interfaces can be written in any language and its messages bound to a variety of protocols, such as SOAP, SIP, Binary or XML. These multimodal interfaces are key to establishing and maintaining communication with endpoints participating in a multimodal interaction. In addition, synclets and MMOD events each have an interface. In a distributed architecture as shown in
Event Control Interface
The following section of code specifies the interface that MMOD components, such as clients and the IM, use to register and remove event listeners as well as to dispatch events down a browser's tree.
Command Control Interface
This interface allows components to modify the browser's state by issuing synchronization commands on that browser's interface.
Event Listener Interface
This interface is implemented by any component that registers listeners for browser events. The method handleEvent is called whenever that event listener is activated.
Synclet Interface
A synclet has the following interface:
MMOD Events
In the X+V embodiment of the invention, the IM framework supports the following list of MMOD events. This list of events is not exhaustive, and other events can be defined for other interaction modalities.
Note that the Nomatch, Noinput, Vxmldone, RecoResult, and RecoResultEx events are defined for the voice interaction modality.
An MMOD event has the following interface:
MMOD Signals
Alongside events that are asynchronous in nature, the MMOD protocol also defines a set of signals. Signals, like events, are asynchronous messages that get exchanged between various endpoints of a multimodal interaction. However, unlike events, signals are used to exchange lower level information about the actual participants in a multimodal interaction. The following example list of signals is not exhaustive, and other signals can be defined and still be within the scope of the preferred embodiment of the invention.
The time synchronization signals are used to correct for network latency that can result for geographically distributed clients.
MMOD Protocol
As mentioned before, MMOD clients exchange a set of messages to establish and maintain communication during a multimodal interaction. The sequences of messages exchanged can vary depending on the configuration of the endpoints. For a peer-to-peer type of configuration, an MMOD browser exchanges messages directly with another MMOD browser, whereas in a peer-to-coordinator type of configuration as shown in
a-b depict the exchange of messages between the GUI browser adapter 102, the voice browser adapter 112, and the IM 124 depicted in
Advantages of the Invention
The exemplary aspects of the invention provide the following advantages, all centered around building an extensible, flexible framework that supports a wide range of multimodal applications and their underlying authoring/programming models:
While the present invention has been described in detail with reference to a preferred embodiment, those skilled in the art will appreciate that various modifications and substitutions can be made thereto without departing from the spirit and scope of the invention as set forth in the appended claims.