System and method for an integrated, multi-modal, multi-device natural language voice services environment

Description

FIELD OF THE INVENTION

The invention relates to an integrated voice services environment in which a plurality of devices can provide various voice services by cooperatively processing free form, multi-modal, natural language inputs, thereby facilitating conversational interactions between a user and one or more of the devices in the integrated environment.

BACKGROUND OF THE INVENTION

As technology has progressed in recent years, consumer electronic devices have emerged to become nearly ubiquitous in the everyday lives of many people. To meet the increasing demand that has resulted from growth in the functionality and mobility of mobile phones, navigation devices, embedded devices, and other such devices, a wealth of features and functions are often provided therein in addition to core applications. Greater functionality also introduces the trade-offs, however, including learning curves that often inhibit users from fully exploiting all of the capabilities of their electronic devices. For example, many existing electronic devices include complex human to machine interfaces that may not be particularly user-friendly, which inhibits mass-market adoption for many technologies. Moreover, cumbersome interfaces often result in otherwise desirable features being buried (e.g., within menus that may be tedious to navigate), which has the tendency of causing many users to not use, or even know about, the potential capabilities of their devices.

As such, the increased functionality provided by many electronic devices often tends to be wasted, as market research suggests that many users only use only a fraction of the features or applications available on a given device. Moreover, in a society where wireless networking and broadband access are increasingly prevalent, consumers tend to naturally desire seamless mobile capabilities from their electronic devices. Thus, as consumer demand intensifies for simpler mechanisms to interact with electronic devices, cumbersome interfaces that prevent quick and focused interaction can become an important concern. Accordingly, the ever-growing demand for mechanisms to use technology in intuitive ways remains largely unfulfilled.

One approach towards simplifying human to machine interactions in electronic devices includes the use of voice recognition software, which can enable users to exploit features that could otherwise be unfamiliar, unknown, or difficult to use. For example, a recent survey conducted by the Navteq Corporation, which provides data used in a variety of applications such as automotive navigation and web-based applications, demonstrates that voice recognition often ranks among the features most desired by consumers of electronic devices. Even so, existing voice user interfaces, when they actually work, still tend to require significant learning on the part of the user.

For example, many existing voice user interface only support requests formulated according to specific command-and-control sequences or syntaxes. Furthermore, many existing voice user interfaces cause user frustration or dissatisfaction because of inaccurate speech recognition. Similarly, by forcing a user to provide pre-established commands or keywords to communicate requests in ways that a system can understand, existing voice user interfaces do not effectively engage the user in a productive, cooperative dialogue to resolve requests and advance a conversation towards a mutually satisfactory goal (e.g., when users may be uncertain of particular needs, available information, or device capabilities, among other things). As such, existing voice user interfaces tend to suffer from various drawbacks, including significant limitations on engaging users in a dialogue in a cooperative and conversational manner.

Additionally, many existing voice user interfaces fall short in utilizing information distributed across various different domains or devices in order to resolve natural language voice-based inputs. Thus, existing voice user interfaces suffer from being constrained to a finite set of applications for which they have been designed, or to devices on which they reside. Although technological advancement has resulted in users often having several devices to suit their various needs, existing voice user interfaces do not adequately free users from device constraints. For example, users may be interested in services associated with different applications and devices, but existing voice user interfaces tend to restrict users from accessing the applications and devices as they see fit. Moreover, users typically can only practicably carry a finite number of devices at any given time, yet content or services associated with users' other devices that currently being used may be desired in various circumstances. Accordingly, although users tend to have varying needs, where content or services associated with different devices may be desired in various contexts or environments, existing voice technologies tend to fall short in providing an integrated environment in which users can request content or services associated with virtually any device or network. As such, constraints on information availability and device interaction mechanisms in existing voice services environments tend to prevent users from experiencing technology in an intuitive, natural, and efficient way.

Existing systems suffer from these and other problems.

SUMMARY OF THE INVENTION

According to various aspects of the invention, a system and method for an integrated, multi-modal, multi-device natural language voice services environment may include a plurality of voice-enabled devices each having intent determination capabilities for processing multi-modal natural language inputs in addition to knowledge of the intent determination capabilities of other devices in the environment. Further, the environment may be arranged in a centralized manner, a distributed peer-to-peer manner, or various combinations thereof. As such, the various devices may cooperate to determine intent of multi-modal natural language inputs, and commands, queries, or other requests may be routed to one or more of the devices best suited to take action in response thereto.

According to various aspects of the invention, the integrated natural language voice services environment arranged in the centralized manner includes an input device that receives a multi-modal natural language input, a central device communicatively coupled to the input device, and one or more secondary devices communicatively coupled to the central device. Each of the input device, the central device, and the one or more secondary devices may have intent determination capabilities for processing multi-modal natural language inputs. As such, an intent of a given multi-modal natural language input may be determined in the centralized manner by communicating the multi-modal natural language input from the input device to the central device. Thereafter, the central device may aggregate the intent determination capabilities of the input device and the one or more secondary devices and determine an intent of the multi-modal natural language input using the aggregated intent determination capabilities. The input device may then receive the determined intent from the central device and invoke at least one action at one or more of the input device, the central device, or the secondary devices based on the determined intent.

According to various aspects of the invention, the integrated natural language voice services environment arranged in the distributed manner includes an input device that receives a multi-modal natural language input, a central device communicatively coupled to the input device and one or more secondary devices communicatively coupled to the input device, wherein each of the input device and the one or more secondary devices may have intent determination capabilities for processing multi-modal natural language inputs, as in the centralized implementation. However, the distributed implementation may be distinct from the centralized implementation in that a preliminary intent of the multi-modal natural language input may be determined at the input device using local intent determination capabilities. The multi-modal natural language input may then be communicated to one or more of the secondary devices (e.g., when a confidence level of the intent determination at the input device falls below a given threshold). In such cases, each of the secondary devices determine an intent of the multi-modal natural language input using local intent determination capabilities. The input device collates the preliminary intent determination and the intent determinations of the secondary devices, and may arbitrate among the collated intent determinations to determine an actionable intent of the multi-modal natural input.

According to various aspects of the invention, the integrated natural language voice services environment arranged in a manner that dynamically selects between a centralized model and a distributed model. For example, the environment includes an input device that receives a multi-modal natural language input one or more secondary devices communicatively coupled to the input device, each of which have intent determination capabilities for processing multi-modal natural language inputs. A constellation model may be accessible to each of the input device and the one or more secondary devices, wherein the constellation model describes the intent determination capabilities of the input device and the one or more secondary devices. The multi-modal natural language input can be routed for processing at one or more of the input device or the secondary devices to determine an intent thereof based on the intent determination capabilities described in the constellation model. For example, when the constellation model arranges the input device and the secondary devices in the centralized manner, one of the secondary devices may be designated the central device and the natural language input may be processed as described above. However, when the multi-modal natural language cannot be communicated to the central device, the constellation model may be dynamically rearranged in the distributed manner, whereby the input device and the secondary devices share knowledge relating to respective local intent determination capabilities and operate as cooperative nodes to determine the intent of the multi-modal natural language input using the shared knowledge relating to local intent determination capabilities.

Other objects and advantages of the invention will be apparent based on the following drawings and detailed description.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustrates a block diagram of an exemplary multi-modal electronic device that may be provided in an integrated, multi-device natural language voice services environment, according to various aspects of the invention.

FIG. 2 illustrates a block diagram of an exemplary centralized implementation of the integrated, multi-modal, multi-device natural language voice service environment, according to various aspects of the invention.

FIG. 3 illustrates a flow diagram of an exemplary method for processing multi-modal, natural language inputs at an input device in the centralized implementation of the integrated, multi-modal, multi-device natural language voice service environment, according to various aspects of the invention.

FIG. 4 illustrates a flow diagram of an exemplary method for processing multi-modal, natural language inputs at a central device in the centralized implementation of the integrated, multi-modal, multi-device natural language voice service environment, according to various aspects of the invention.

FIG. 5 illustrates a flow diagram of an exemplary method for processing multi-modal, natural language inputs at a secondary device in the centralized implementation of the integrated, multi-modal, multi-device natural language voice service environment, according to various aspects of the invention.

FIG. 6 illustrates a block diagram of an exemplary distributed implementation of the integrated, multi-modal, multi-device natural language voice service environment, according to various aspects of the invention.

FIG. 7 illustrates a flow diagram of an exemplary method for processing multi-modal, natural language inputs at an input device in the distributed implementation of the integrated, multi-modal, multi-device natural language voice service environment, according to various aspects of the invention.

DETAILED DESCRIPTION

According to various aspects of the invention, FIG. 1 illustrates a block diagram of an exemplary multi-modal electronic device 100 that may be provided in a natural language voice services environment that includes one or more additional multi-modal devices (e.g., as illustrated in FIGS. 2 and 6). As will be apparent, the electronic device 100 illustrated in FIG. 1 may be any suitable voice-enabled electronic device (e.g., a telematics device, a personal navigation device, a mobile phone, a VoIP node, a personal computer, a media device, an embedded device, a server, or another electronic device). The device 100 may include various components that collectively provide a capability to process conversational, multi-modal natural language inputs. As such, a user of the device 100 may engage in multi-modal conversational dialogues with the voice-enabled electronic device 100 to resolve requests in a free form, cooperative manner.

For example, the natural language processing components may support free form natural language utterances to liberate the user from restrictions relating to how commands, queries, or other requests should be formulated. Rather, the user may employ any manner of speaking that feels natural in order to request content or services available through the device 100 (e.g., content or services relating to telematics, communications, media, messaging, navigation, marketing, information retrieval, etc.). For instance, in various implementations, the device 100 may process natural language utterances utilizing techniques described in U.S. patent application Ser. No. 10/452,147, entitled “Systems and Methods for Responding to Natural Language Speech Utterance,” filed Jun. 3, 2003, and U.S. patent application Ser. No. 10/618,633, entitled “Mobile Systems and Methods for Responding to Natural Language Speech Utterance,” filed Jun. 15, 2003, the disclosures of which are hereby incorporated by reference in their entirety.

Moreover, because the device 100 may be deployed in an integrated multi-device environment, the user may further request content or services available through other devices deployed in the environment. In particular, the integrated voice services environment may include a plurality of multi-modal devices, each of which include natural language components generally similar to those illustrated in FIG. 1. The various devices in the environment may serve distinct purposes, however, such that available content, services, applications, or other capabilities may vary among the devices in the environment (e.g., core functions of a media device may vary from those of a personal navigation device). Thus, each device in the environment, including device 100, may have knowledge of content, services, applications, intent determination capabilities, and other features available through the other devices by way of a constellation model 130b. Accordingly, as will be described in greater detail below, the electronic device 100 may cooperate with other devices in the integrated environment to resolve requests by sharing context, prior information, domain knowledge, short-term knowledge, long-term knowledge, and cognitive models, among other things.

According to various aspects of the invention, the electronic device 100 may include an input mechanism 105 that can receive multi-modal natural language inputs, which include at least an utterance spoken by the user. As will be apparent, the input mechanism 105 may include any appropriate device or combination of devices capable of receiving a spoken input (e.g., a directional microphone, an array of microphones, or any other device that can generate encoded speech). Further, in various implementations, the input mechanism 105 can be configured to maximize fidelity of encoded speech, for example, by maximizing gain in a direction of the user, cancelling echoes, nulling point noise sources, performing variable rate sampling, or filtering environmental noise (e.g., background conversations). As such, the input mechanism 105 may generate encoded speech in a manner that can tolerate noise or other factors that could otherwise interfere with accurate interpretation of the utterance.

Furthermore, in various implementations, the input mechanism 105 may include various other input modalities (i.e., the input mechanism 105 may be arranged in a multi-modal environment), in that non-voice inputs can be correlated and/or processed in connection with one or more previous, contemporaneous, or subsequent multi-modal natural language inputs. For example, the input mechanism 105 may be coupled to a touch-screen interface, a stylus and tablet interface, a keypad or keyboard, or any other suitable input mechanism, as will be apparent. As a result, an amount of information potentially available when processing the multi-modal inputs may be maximized, as the user can clarify utterances or otherwise provide additional information in a given multi-modal natural language input using various input modalities. For instance, in an exemplary illustration, the user could touch a stylus or other pointing device to a portion of a touch-screen interface of the device 100, while also providing an utterance relating to the touched portion of the interface (e.g., “Show me restaurants around here”). In this example, the natural language utterance may be correlated with the input received via the touch-screen interface, resulting in “around here” being interpreted in relation to the touched portion of the interface (e.g., as opposed to the user's current location or some other meaning).

According to various aspects of the invention, the device 100 may include an Automatic Speech Recognizer 110 that generates one or more preliminary interpretations of the encoded speech, which may be received from the input mechanism 105. For example, the Automatic Speech Recognizer 110 may recognize syllables, words, or phrases contained in an utterance using one or more dynamically adaptable recognition grammars. The dynamic recognition grammars may be used to recognize a stream of phonemes through phonetic dictation based on one or more acoustic models. Furthermore, as described in U.S. patent application Ser. No. 11/197,504, entitled “Systems and Methods for Responding to Natural Language Speech Utterance,” filed Aug. 5, 2005, the disclosure of which is hereby incorporated by reference in its entirety, the Automatic Speech Recognizer 110 may be capable of multi-pass analysis, where a primary speech recognition engine may generate a primary interpretation of an utterance (e.g., using a large list dictation grammar) and request secondary transcription from one or more secondary speech recognition engines (e.g., using a virtual dictation grammar having decoy words for out-of-vocabulary words).

Thus, the Automatic Speech Recognizer 110 may generate preliminary interpretations of an utterance in various ways, including exclusive use of a dictation grammar or virtual dictation grammar, or use of various combinations thereof (e.g., when the device 100 supports multi-pass analysis). In any event, the Automatic Speech Recognizer 110 may provide out-of-vocabulary capabilities and may tolerate portions of a speech signal being dropped, the user misspeaking, or other variables that may occur in natural language speech (e.g., stops and starts, stutters, etc.). Furthermore, the recognition grammars employed by the Automatic Speech Recognizer 110 may include vocabularies, dictionaries, syllables, words, phrases, or other information optimized according to various contextual or application-specific domains (e.g., navigation, music, movies, weather, shopping, news, languages, temporal or geographic proximities, or other suitable domains). Moreover, environmental knowledge (e.g., peer-to-peer affinities, capabilities of devices in the environment, etc.), historical knowledge (e.g., frequent requests, prior context, etc.), or other types of knowledge can be used to continually optimize the information contained in the recognition grammars on a dynamic basis.

For example, information contained in the recognition grammars may be dynamically optimized to improve a likelihood of a given utterance being recognized accurately (e.g., following an incorrect interpretation of a word, the incorrect interpretation may be removed from the grammar to reduce a likelihood of the incorrect interpretation being repeated). Accordingly, the Automatic Speech Recognizer 110 may use a number of techniques to generate preliminary interpretations of natural language utterances, including those described, for example, in U.S. patent application Ser. No. 11/513,269, entitled “Dynamic Speech Sharpening,” filed Aug. 31, 2006, the disclosure of which is hereby incorporated by reference in its entirety. Furthermore, the techniques used by the Automatic Speech Recognizer 110 associated with the device 100 may be considered in defining intent determination capabilities of the device 100, and such capabilities may be shared with other devices in the environment to enable convergence of speech recognition throughout the environment (e.g., because various devices may employ distinct speech recognition techniques or have distinct grammars or vocabularies, the devices may share vocabulary translation mechanisms to enhance system-wide recognition).

According to various aspects of the invention, the Automatic Speech Recognizer 110 may provide one or more preliminary interpretations of a multi-modal input, including an utterance contained therein, to a conversational language processor 120. The conversational language processor 120 may include various components that collectively operate to model everyday human-to-human conversations in order to engage in cooperative conversations with the user to resolve requests based on the user's intent. For example, the conversational language processor 120 may include, among other things, an intent determination engine 130a, a constellation model 130b, one or more domain agents 130c, a context tracking engine 130d, a misrecognition engine 130e, and a voice search engine 130f. Furthermore, the conversational language processor 120 may be coupled to one or more data repositories 160 and applications associated with one or more domains. Thus, the intent determination capabilities of the device 100 may be defined based on the data and processing capabilities of the Automatic Speech Recognizer 110 and the conversational language processor 120.

More particularly, the intent determination engine 130a may establish meaning for a given multi-modal natural language input based on a consideration of the intent determination capabilities of the device 100 as well as the intent determination capabilities of other devices in the integrated voice services environment. For example, the intent determination capabilities of the device 100 may be defined as a function of processing resources, storage for grammars, context, agents, or other data, and content or services associated with the device 100 (e.g., a media device with a small amount of memory may have a smaller list of recognizable songs than a device with a large amount of memory). Thus, the intent determination engine 130a may determine whether to process a given input locally (e.g., when the device 100 has intent determination capabilities that suggest favorable conditions for recognition), or whether to route information associated with the input to other devices, which may assist in determining the intent of the input.

As such, to determine which device or combination of devices should process an input, the intent determination engine 130a may evaluate the constellation model 130b, which provides a model of the intent determination capabilities for each of the devices in the integrated voice services environment. For instance, the constellation model 130b may contain, among other things, knowledge of processing and storage resources available to each of the devices in the environment, as well as the nature and scope of domain agents, context, content, services, and other information available to each of the devices in the environment. As such, using the constellation model 130b, the intent determination engine 130a may be able to determine whether any of the other devices have intent determination capabilities that can be invoked to augment or otherwise enhance the intent determination capabilities of the device 100 (e.g., by routing information associated with a multi-modal natural language input to the device or devices that appear best suited to analyze the information and therefore determine an intent of the input). Accordingly, the intent determination engine 130a may establish the meaning of a given utterance by utilizing the comprehensive constellation model 130b that describes capabilities within the device 100 and across the integrated environment. The intent determination engine 130a may therefore optimize processing of a given natural language input based on capabilities throughout the environment (e.g., utterances may be processed locally to the device 100, routed to a specific device based on information in the constellation model 130b, or flooded to all of the devices in the environment in which case an arbitration may occur to select a best guess at an intent determination).

Although the following discussion will generally focus on various techniques that can be used to determine the intent of multi-modal natural language inputs in the integrated multi-device environment, it will be apparent that the natural language processing capabilities of any one of the devices may extend beyond the specific discussion that has been provided herein. As such, in addition to the co-pending U.S. Patent Applications referenced above, further natural language processing capabilities that may be employed include those described in U.S. patent application Ser. No. 11/197,504, entitled “Systems and Methods for Responding to Natural Language Speech Utterance,” filed Aug. 5, 2005, U.S. patent application Ser. No. 11/200,164, entitled “System and Method of Supporting Adaptive Misrecognition in Conversational Speech,” filed Aug. 10, 2005, U.S. patent application Ser. No. 11/212,693, entitled “Mobile Systems and Methods of Supporting Natural Language Human-Machine Interactions,” filed Aug. 29, 2005, U.S. patent application Ser. No. 11/580,926, entitled “System and Method for a Cooperative Conversational Voice User Interface,” filed Oct. 16, 2006, U.S. patent application Ser. No. 11/671,526, entitled “System and Method for Selecting and Presenting Advertisements Based on Natural Language Processing of Voice-Based Input,” filed Feb. 6, 2007, and U.S. patent application Ser. No. 11/954,064, entitled “System and Method for Providing a Natural Language Voice User Interface in an Integrated Voice Navigation Services Environment,” filed Dec. 11, 2007, the disclosures of which are hereby incorporated by reference in their entirety.

According to various aspects of the invention, FIG. 2 illustrates a block diagram of an exemplary centralized implementation of the integrated, multi-modal, multi-device natural language voice service environment. As will be apparent from the further description to be provided herein, the centralized implementation of the integrated, multi-device voice services environment may enable a user to engage in conversational, multi-modal natural language interactions with any one of voice-enabled devices 210a-n or central voice-enabled device 220. As such, the multi-device voice services environment may collectively determine intent for any given multi-modal natural language input, whereby the user may request content or voice services relating to any device or application in the environment, without restraint.

As illustrated in FIG. 2, the centralized implementation of the multi-device voice service environment may include a plurality of voice-enabled devices 210a-n, each of which include various components capable of determining intent of natural language utterances, as described above in reference to FIG. 1. Furthermore, as will be apparent, the centralized implementation includes a central device 220, which contains information relating to intent determination capabilities for each of the other voice-enabled devices 210a-n. For example, in various exemplary implementations, the central device 220 may be designated as such by virtue of being a device most capable of determining the intent of an utterance (e.g., a server, home data center, or other device having significant processing power, memory resources, and communication capabilities making the device suitable to manage intent determination across the environment). In another exemplary implementation, the central device 220 may be dynamically selected based on one or more characteristics of a given multi-modal natural language input, dialogue, or interaction (e.g., a device may be designated as the central device 220 when a current utterance relates to a specific domain).

In the centralized implementation illustrated in FIG. 2, a multi-modal natural language input may be received at one of the voice-enabled devices 210a-n. Therefore, the receiving one of the devices 210a-n may be designated as an input device for that input, while the remaining devices 210a-n may be designated as secondary devices for that input. In other words, for any given multi-modal natural language input, the multi-device environment may include an input device that collects the input, a central device 220 that aggregates intent determination, inferencing, and processing capabilities for all of the devices 210a-n in the environment, and one or more secondary devices that may also be used in the intent determination process. As such, each device 210 in the environment may be provided with a constellation model that identifies all of the devices 210 having incoming and outgoing communication capabilities, thus indicating an extent to which other devices may be capable of determining intent for a given multi-modal natural language input. The constellation model may further define a location of the central device 220, which aggregates context, vocabularies, content, recognition grammars, misrecognitions, shared knowledge, intent determination capabilities, inferencing capabilities, and other information from the various devices 210a-n in the environment.

Accordingly, as communication and processing capabilities permit, the central device 220 may be used as a recognizer of first or last resort. For example, because the central device 220 converges intent determination capabilities across the environment (e.g., by aggregating context, vocabularies, device capabilities, and other information from the devices 210a-n in the environment), inputs may be automatically routed to the central device 220 when used as a recognizer of first resort, or as a recognizer of last resort when local processing at the input device 210 cannot determine the intent of the input with a satisfactory level of confidence. However, it will also be apparent that in certain instances the input device 210 may be unable to make contact with the central device 220 for various reasons (e.g., a network connection may be unavailable, or a processing bottleneck at the central device 220 may cause communication delays). In such cases, the input device 210 that has initiated contact with the central device 220 may shift into decentralized processing (e.g., as described in reference to FIG. 6) and communicate capabilities with one or more of the other devices 210a-n in the constellation model. Thus, when the central device 220 cannot be invoked for various reasons, the remaining devices 210a-n may operate as cooperative nodes to determine intent in a decentralized manner.

Additionally, in the multi-device voice services environment, the central device 220 and the various other devices 210a-n may cooperate to create a converged model of capabilities throughout the environment. For example, as indicated above, in addition to having intent determination capabilities based on processing resources, memory resources, and device capabilities, each of the devices 210a-n and the central device 220 may include various other natural language processing components. The voice services environment may therefore operate in an integrated manner by maintaining not only a complete model of data, content, and services associated with the various devices 210a-n, but also of other natural language processing capabilities and dynamic states associated with the various devices 210a-n. As such, the various devices 210a-n may operate with a goal of converging capabilities, data, states, and other information across the device, either on one device (e.g., the central device 220) or distributed among the various devices 210a-n (e.g., as in the decentralized implementation to be described in FIG. 6).

For example, as discussed above, each device 210 includes an Automatic Speech Recognizer, one or more dynamically adaptable recognition grammars, and vocabulary lists used to generate phonemic interpretations of natural language utterances. Moreover, each device 210 includes locally established context, which can range from information contained in a context stack, context and namespace variables, vocabulary translation mechanisms, short-term shared knowledge relating to a current dialogue or conversational interaction, long-term shared knowledge relating to a user's learned preferences over time, or other contextual information. Furthermore, each device 210 may have various services or applications associated therewith, and may perform various aspects of natural language processing locally. Thus, additional information to be converged throughout the environment may include partial or preliminary utterance recognitions, misrecognitions or ambiguous recognitions, inferencing capabilities, and overall device state information (e.g., songs playing in the environment, alarms set in the environment, etc.).

Thus, various data synchronization and referential integrity algorithms may be employed in concert by the various devices 210a-n and the central device 220 to provide a consistent worldview of the environment. For example, information may be described and transmitted throughout the environment for synchronization and convergence purposes using the Universal Plug and Play protocol designed for computer ancillary devices, although the environment can also operate in a peer-to-peer disconnected mode (e.g., when the central device 220 cannot be reached). However, in various implementations, the environment may also operate in a peer-to-peer mode regardless of the disconnected status, as illustrated in FIG. 6, for example, when the devices 210a-n have sufficient commensurate resources and capabilities for natural language processing.

In general, the algorithms for convergence in the environment can be executed at various intervals, although it may be desirable to limit data transmission so as to avoid processing bottlenecks. For example, because the convergence and synchronization techniques relate to natural language processing, in which any given utterance will typically be expressed over a course of several seconds, information relating to context and vocabulary need not be updated on a time frame of less than a few seconds. However, as communication capabilities permit, context and vocabulary could be updated more frequently to provide real-time recognition or the appearance of real-time recognition. In another implementation, the convergence and synchronization may be permitted to run until completion (e.g., when no requests are currently pending), or the convergence and synchronization may be suspended or terminated when a predetermined time or resource consumption limit has been reached (e.g., when the convergence relates to a pending request, an intent determination having a highest confidence level at the time of cut-off may be used).

By establishing a consistent view of capabilities, data, states, and other information throughout the environment, an input device 210 may cooperate with the central device 220 and one or more secondary devices (i.e., one or more of devices 210a-n, other than the input device) in processing any given multi-modal natural language input. Furthermore, by providing each device 210 and the central device 220 with a constellation model that describes a synchronized state of the environment, the environment may be tolerant of failure by one or more of the devices 210a-n, or of the central device 220. For example, if the input device 210 cannot communicate with the central device 220 (e.g., because of a server crash), the input device 210 may enter a disconnected peer-to-peer mode, whereby capabilities can be exchanged with one or more devices 210a-n with which communications remain available. To that end, when a device 210 establishes new information relating to vocabulary, context, misrecognitions, agent adaptation, intent determination capabilities, inferencing capabilities, or otherwise, the device 210 may transmit the information to the central device 220 for convergence purposes, as discussed above, in addition to consulting the constellation model to determine whether the information should be transmitted to one or more of the other devices 210a-n.

For example, suppose the environment includes a voice-enabled mobile phone that has nominal functionality relating to playing music or other media, and which further has a limited amount of local storage space, while the environment further includes a voice-enabled home media system that includes a mass storage medium that provides dedicated media functionality. If the mobile phone were to establish new vocabulary, context, or other information relating to a song (e.g., a user downloads the song or a ringtone to the mobile phone while on the road), the mobile phone may transmit the newly established information to the home media system in addition to the central device 220. As such, by having a model of all of the devices 210a-n in the environment and transmitting new information to the devices where it will most likely be useful, the various devices may handle disconnected modes of operation when the central device 220 may be unavailable for any reason, while resources may be allocated efficiently throughout the environment.

Thus, based on the foregoing discussion, it will be apparent that a centralized implementation of an integrated multi-device voice services environment may generally include a central device 220 operable to aggregate or converge knowledge relating to content, services, capabilities, and other information associated with various voice-enabled devices 210a-n deployed within the environment. In such centralized implementations, the central device 220 may be invoked as a recognizer of first or last resort, as will be described in greater detail with reference to FIGS. 3-5, and furthermore, the other devices 210a-n in the environment may be configured to automatically enter a disconnected or peer-to-peer mode of operation when the central device 220 cannot be invoked for any reason (i.e., devices may enter a decentralized or distributed mode, as will be described in greater detail with reference to FIGS. 6-7). Knowledge and capabilities of each of the devices 210a-n may therefore be made available throughout the voice services environment in a centralized manner, a distributed manner, or various combinations thereof, thus optimizing an amount of natural language processing resources used to determine an intent of any given multi-modal natural language input.

According to various aspects of the invention, FIG. 3 illustrates a flow diagram of an exemplary method for processing multi-modal, natural language inputs at an input device in the centralized implementation of the integrated, multi-modal, multi-device natural language voice service environment. Similarly, FIGS. 4 and 5 illustrate corresponding methods associated with a central device and one or more secondary devices, respectively, in the centralized voice service environment. Furthermore, it will be apparent that the processing techniques described in relation to FIGS. 3-5 may generally be based on the centralized implementation illustrated in FIG. 2 and described above, whereby the input device may be assumed to be distinct from the central device, and the one or more secondary devices may be assumed to be distinct from the central device and the input device. However, it will be apparent that various instances may involve a natural language input being received at the central device or at another device, in which case the techniques described in FIGS. 3-5 may be vary depending on circumstances of the environment (e.g., decisions relating to routing utterances to a specific device or devices may be made locally, collaboratively, or in other ways depending on various factors, such as overall system state, communication capabilities, intent determination capabilities, or otherwise).

As illustrated in FIG. 3, a multi-modal natural language input may be received at an input device in an operation 310. The multi-modal input may include at least a natural language utterance provided by a user, and may further include other input modalities such as audio, text, button presses, gestures, or other non-voice inputs. It will also be apparent that prior to receiving the natural language input in operation 310, the input device may be configured to establish natural language processing capabilities. For example, establishing natural language processing capabilities may include, among other things, loading an Automatic Speech Recognizer and any associated recognition grammars, launching a conversational language processor to handle dialogues with the user, and installing one or more domain agents that provide functionality for respective application domains or contextual domains (e.g., navigation, music, movies, weather, information retrieval, device control, etc.).

The input device may also be configured to coordinate synchronization of intent determination capabilities, shared knowledge, and other information with the central device and the secondary devices in the environment prior to receiving the input at operation 310. For example, when the input device installs a domain agent, the installed domain agent may bootstrap context variables, semantics, namespace variables, criteria values, and other context related to that agent from other devices in the system. Similarly, misrecognitions may be received from the central device and the secondary devices in order to enable correction of agents that use information relevant to the received misrecognitions, and vocabularies and associated translation mechanisms may be synchronized among the devices to account for potential variations between the Automatic Speech Recognizers used by the various devices (e.g., each device in the environment cannot be guaranteed to use the same Automatic Speech Recognizer or recognition grammars, necessitating vocabulary and translation mechanisms to be shared among the devices that share intent determination capabilities).

Upon establishing and synchronizing natural language processing capabilities and subsequently receiving a multi-modal natural language input in operation 310, the input device may determine whether the environment has been set up to automatically transmit the input to the central device in a decisional operation 320. In such a case, processing proceeds to an operation 360 for transmitting the input to the central device, which may then process the input according to techniques to be described in relation to FIG. 4. If the environment has not been set up to automatically communicate the input to the central device, however, processing proceeds to an operation 330, where the input device performs transcription of the natural language utterance contained in the multi-modal input. For example, the input device may transcribe the utterance using the Automatic Speech Recognizer and recognition grammars associated therewith according to techniques described above and in the above-referenced U.S. Patent Applications.

Subsequently, in an operation 340, an intent of the multi-modal natural language input may be determined at the input device using local natural language processing capabilities and resources. For example, any non-voice input modalities included in the input may be merged with the utterance transcription and a conversational language processor associated with the input device may utilize local information relating to context, domain knowledge, shared knowledge, context variables, criteria values, or other information useful in natural language processing. As such, the input device may attempt to determine a best guess as to an intent of the user that provided the input, such as identifying a conversation type (e.g., query, didactic, or exploratory) or request that may be contained in the input (e.g., a command or query relating to one or more domain agents or application domains).

The intent determination of the input device may be assigned a confidence level (e.g., a device having an Automatic Speech Recognizer that implements multi-pass analysis may assign comparatively higher confidence levels to utterance transcriptions created thereby, which may result in a higher confidence level for the intent determination). The confidence level may be assigned based on various factors, as described in the above-referenced U.S. Patent Applications. As such, a decisional operation 350 may include determining whether the intent determination of the input device meets an acceptable level of confidence. When the intent determination meets the acceptable level confidence, processing may proceed directly to an operation 380 where action may be taken in response thereto. For example, when the intent determination indicates that the user has requested certain information, one or more queries may be formulated to retrieve the information from appropriate information sources, which may include one or more of the other devices. In another example, when the intent determination indicates that the user has requested a given command (e.g., to control a specific device), the command may be routed to the appropriate device for execution.

Thus, in cases where the input device can determine the intent of a natural language input without assistance from the central device or the secondary devices, communications and processing resources may be conserved by taking immediate action as may be appropriate. On the other hand, when the intent determination of the input device does not meet the acceptable level of confidence, decisional operation 350 may result in the input device requesting assistance from the central device in operation 360. In such a case, the multi-modal natural language input may be communicated to the central device in its entirety, whereby the central device processes the input according to techniques described in FIG. 4. However, should transmission to the central device fail for some reason, the input device may shift into a disconnected peer-to-peer mode where one or more secondary devices may be utilized, as will be described below in relation to FIG. 7. When transmission to the central device occurs without incident, however, the input device may receive an intent determination from the central device in an operation 370, and may further receive results of one or more requests that the central device was able to resolve, or requests that the central device has formulated for further processing on the input device. As such, the input device may take action in operation 380 based on the information received from the central device in operation 370. For example, the input device may route queries or commands to local or remote information sources or devices based on the intent determination, or may present results of the requests processed by the central device to the user.

Referring to FIG. 4, the central device may receive the multi-modal natural language input from the input device in an operation 410. The central device, having aggregated context and other knowledge from throughout the environment, may thus transcribe the utterance in an operation 420 and determine an intent of the input from the transcribed utterance in an operation 430. As such, the central device may consider information relating to context, domain agents, applications, and device capabilities throughout the environment in determining the intent of the utterance, including identification of one or more domains relevant to the input. However, it will be apparent that utilizing information aggregated from throughout the environment may cause ambiguity or uncertainty in various instances (e.g., an utterance containing the word “traffic” may have a different intent in domains relating to movies, music, and navigation).

As such, once the central device has attempted to determine the intent of the natural language input, a determination may be made in an operation 440 as to whether one or more secondary devices (i.e., other devices in the constellation besides the input device) may also be capable of intent determination in the identified domain or domains. When no such secondary devices can be identified, decisional operation 440 may branch directly to an operation 480 to send to the input device the determined intent and any commands, queries, or other requests identified from the determined intent.

On the other hand, when one or more secondary devices in the environment have intent determination capabilities in the identified domain or domains, the natural language input may be sent to such secondary devices in an operation 450. The secondary devices may then determine an intent as illustrated in FIG. 5, which may include techniques generally similar to those described above in relation to the input device and central device (i.e., the natural language input may be received in an operation 510, an utterance contained therein may be transcribed in an operation 520, and an intent determination made in an operation 530 may be returned to the central device in an operation 540).

Returning to FIG. 4, the central device may collate intent determination responses received from the secondary devices in an operation 460. For example, as indicated above, the central device may identify one or more secondary devices capable of determining intent in a domain that the central device has identified as being relevant to the natural language utterance. As will be apparent, the secondary devices invoked in operation 450 may often include a plurality of devices, and intent determination responses may be received from the secondary devices in an interleaved manner, depending on processing resources, communications throughput, or other factors (e.g., the secondary devices may include a telematics device having a large amount of processing power and a broadband network connection and an embedded mobile phone having less processing power and only a cellular connection, in which case the telematics device may be highly likely to provide results to the central device before the embedded mobile phone). Thus, based on potential variations in response time of secondary devices, the central device may be configured to place constraints on collating operation 460. For example, the collating operation 460 may be terminated as soon as an intent determination has been received from one of the secondary devices that meets an acceptable level of confidence, or the operation 460 may be cut off when a predetermined amount of time has lapsed or a predetermined amount of resources have been consumed. In other implementations, however, it will be apparent that collating operation 460 may be configured to run to completion, regardless of whether delays have occurred or suitable intent determinations have been received. Further, it will be apparent that various criteria may be used to determine whether or when to end the collating operation 460, including the nature of a given natural language input, dialogue, or other interaction, or system or user preferences, among other criteria, as will be apparent.

In any event, when the collating operation 460 has completed, a subsequent operation 470 may include the central device arbitrating among the intent determination responses received from one or more of the secondary devices previously invoked in operation 450. For example, each of the invoked secondary devices that generate an intent determination may also assign a confidence level to that intent determination, and the central device may consider the confidence levels in arbitrating among the responses. Moreover, the central device may associate other criteria with the secondary devices or the intent determinations received from the secondary devices to further enhance a likelihood that the best intent determination will be used. For example, various ones of the secondary devices may only be invoked for partial recognition in distinct domains, and the central device may aggregate and arbitrate the partial recognitions to create a complete transcription. In another example, a plurality of secondary devices may be invoked to perform overlapping intent determination, and the central device may consider capabilities of the secondary devices to weigh the respective confidence levels (e.g., when one of two otherwise identical secondary devices employs multi-pass speech recognition analysis, the secondary device employing the multi-pass speech recognition analysis may be weighed as having a higher likelihood of success). It will be apparent that the central device may be configured to arbitrate and select one intent determination from among all of the intent hypotheses, which may include the intent determination hypothesis generated by the central device in operation 430. Upon selecting the best intent determination hypothesis, the central device may then provide that intent determination to the input device in operation 480, as well as any commands, queries, or other requests that may be relevant thereto. The input device may then take appropriate action as described above in relation to FIG. 3.

According to various aspects of the invention, FIG. 6 illustrates a block diagram of an exemplary distributed implementation of the integrated, multi-modal, multi-device natural language voice service environment. As described above, the distributed implementation may also be categorized as a disconnected or peer-to-peer mode that may be employed when a central device in a centralized implementation cannot be reached or otherwise does not meet the needs of the environment. The distributed implementation illustrated in FIG. 6 may be generally operate with similar purposes as described above in relation to the centralized implementation (i.e., to ensure that the environment includes a comprehensive model of aggregate knowledge and capabilities of a plurality of devices 610a-n in the environment). Nonetheless, the distributed implementation may operate in a somewhat different manner, in that one or more of the devices 610a-n may be provided with the entire constellation model, or various aspects of the model may be distributed among the plurality of devices 610a-n, or various combinations thereof.

Generally speaking, the plurality of voice-enabled devices 610a—may be coupled to one another by a voice services interface 630, which may include any suitable real or virtual interface (e.g., a common message bus or network interface, a service-oriented abstraction layer, etc.). The various devices 610a-n may therefore operate as cooperative nodes in determining intent for multi-modal natural language utterances received by any one of the devices 610. Furthermore, the devices 610a-n may share knowledge of vocabularies, context, capabilities, and other information, while certain forms of data may be synchronized to ensure consistent processing among the devices 610a-n. For example, because natural language processing components used in the devices 610a-n may vary (e.g., different recognition grammars or speech recognition techniques may exist), vocabulary translation mechanisms, misrecognitions, context variables, criteria values, criteria handlers, and other such information used in the intent determination process should be synchronized to the extent that communication capabilities permit.

By sharing intent determination capabilities, device capabilities, inferencing capabilities, domain knowledge, and other information, decisions as to routing an utterance to a specific one of the devices 610a-n may be made locally (e.g., at an input device), collaboratively (e.g., a device having particular capabilities relevant to the utterance may communicate a request to process the utterance), or various combinations thereof (e.g., the input device may consider routing to secondary devices only when an intent of the utterance cannot be determined). Similarly, partial recognition performed at one or more of the devices 610a-n may be used to determine routing strategies for further intent determination of the utterance. For example, an utterance that contains a plurality of requests relating to a plurality of different domains may be received at an input device that can only determine intent in one of the domains. In this example, the input device may perform partial recognition for the domain associated with the input device, and the partial recognition may also identify the other domains relevant to the utterance for which the input device does not have sufficient recognition information. Thus, the partial recognition performed by the input device may result in identification of other potentially relevant domains and a strategy may be formulated to route the utterance to other devices in the environment that include recognition information for those domains.

As a result, multi-modal natural language inputs, including natural language utterances, may be routed among the various devices 610a-n in order to perform intent determination in a distributed manner. However, as the capabilities and knowledge held by any one of the devices 610a-n may vary, each of the devices 610a-n may be associated with a reliability factor for intent determinations generated by the respective devices 610a-n. As such, to ensure that final intent determinations can be relied upon with a sufficient level of confidence, knowledge may be distributed among the devices 610a-n to ensure that reliability metrics for intent determinations provided by each of the devices 610a-n are commensurable throughout the environment. For example, additional knowledge may be provided to a device having a low intent determination reliability, even when such knowledge results in redundancy in the environment, to ensure commensurate reliability of intent determination environment-wide.

Therefore, in distributed implementations of the integrated voice services environment, utterances may be processed in various ways, which may depend on circumstances at a given time (e.g., system states, system or user preferences, etc.). For example, an utterance may be processed locally at an input device and only routed to secondary devices when an intent determination confidence level falls below a given threshold. In another example, utterances may be routed to a specific device based on the modeling of knowledge and capabilities discussed above. In yet another example, utterances may be flooded among all of the devices in the environment, and arbitration may occur whereby intent determinations may be collated and arbitrated to determine a best guess at intent determination.

Thus, utterances may be processed in various ways, including through local techniques, centralized techniques, distributed techniques, and various combinations thereof. Although many variations will be apparent, FIG. 7 illustrates an exemplary method for combined local and distributed processing of multi-modal, natural language inputs in a distributed implementation of the voice service environment, according to various aspects of the invention. In particular, the distributed processing may begin in an operation 710, where a multi-modal natural language input may be received at an input device. The input device may then utilize various natural language processing capabilities associated therewith in an operation 720 to transcribe an utterance contained in the multi-modal input (e.g., using an Automatic Speech Recognizer and associated recognition grammars), and may subsequently determine a preliminary intent of the multi-modal natural language input in an operation 730. It will be apparent that operations 710 through 730 may generally be performed using local intent determination capabilities associated with the input device.

Thereafter, the input device may invoke intent determination capabilities of one or more secondary devices in an operation 740. More particularly, the input device may provide information associated with the multi-modal natural language input to one or more of the secondary devices, which may utilize local intent determination capabilities to attempt to determine intent of the input using techniques as described in relation to FIG. 5. It will also be apparent that, in various implementations, the secondary devices invoked in operation 740 may include only devices having intent determination capabilities associated with a specific domain identified as being associated with the input. In any event, the input device may receive intent determinations from the invoked secondary devices in an operation 750, and the input device may then collate the intent determinations received from the secondary devices. The input device may then arbitrate among the various intent determinations, or may combine various ones of the intent determinations (e.g., when distinct secondary devices determine intent in distinct domains), or otherwise arbitrate among the intent determinations to determine a best guess at the intent of the multi-modal natural language input (e.g., based on confidence levels associated with the various intent determinations). Based on the determined intent, the input device may then take appropriate action in an operation 770, such as issuing one or more commands, queries, or other requests to be executed at one or more of the input device or the secondary devices.

Furthermore, in addition to the exemplary implementations described above, various implementations may include a continuous listening mode of operation where a plurality of devices may continuously listen for multi-modal voice-based inputs. In the continuous listening mode, each of the devices in the environment may be triggered to accept a multi-modal input when one or more predetermined events occur. For example, the devices may each be associated with one or more attention words, such as “Phone, <multi-modal request>” for a mobile phone, or “Computer, <multi-modal request>” for a personal computer. When one or more of the devices in the environment recognize the associated attention word, keyword activation may result, where the associated devices trigger to accept the subsequent multi-modal request. Further, where a plurality of devices in a constellation may be listening, the constellation may use all available inputs to increase recognition rates.

Moreover, it will be apparent that the continuous listening mode may be applied in centralized voice service environments, distributed centralized voice service environments, or various combinations thereof. For example, when each device in the constellation has a different attention word, any given device that recognizes an attention word may consult a constellation model to determine a target device or functionality associated with the attention word. In another example, when a plurality of devices in the constellation share one or more attention words, the plurality of devices may coordinate with one another to synchronize information for processing the multi-modal input, such as a start time for an utterance contained therein.

Implementations of the invention may be made in hardware, firmware, software, or various combinations thereof. The invention may also be implemented as instructions stored on a machine-readable medium, which may be read and executed by one or more processors. A machine-readable medium may include various mechanisms for storing or transmitting information in a form readable by a machine (e.g., a computing device). For example, a machine-readable storage medium may include read only memory, random access memory, magnetic disk storage media, optical storage media, flash memory devices, and others, and a machine-readable transmission media may include forms of propagated signals, such as carrier waves, infrared signals, digital signals, and others. Further, firmware, software, routines, or instructions may be described in the above disclosure in terms of specific exemplary aspects and implementations of the invention, and performing certain actions. However, it will be apparent that such descriptions are merely for convenience and that such actions in fact result from computing devices, processors, controllers, or other devices executing the firmware, software, routines, or instructions.

Aspects and implementations may be described as including a particular feature, structure, or characteristic, but every aspect or implementation may not necessarily include the particular feature, structure, or characteristic. Further, when a particular feature, structure, or characteristic has been described in connection with an aspect or implementation, it will be understood that such feature, structure, or characteristic may be included in connection with other aspects or implementations, whether or not explicitly described. Thus, various changes and modifications may be made to the preceding description without departing from the scope or spirit of the invention, and the specification and drawings should therefore be regarded as exemplary only, and the scope of the invention determined solely by the appended claims.

Claims

1. A method of providing an integrated multi-modal, natural language voice services environment comprising one or more of an input device that receives a multi-modal natural language input comprising at least a natural language utterance and a non-voice input related to the natural language utterance, a first device, or one or more secondary devices, the method being implemented in the first device having one or more physical processors programmed with computer program instructions that, when executed by the one or more physical processors, program the first device to perform the method, wherein the one or more secondary devices include at least a second device, the method comprising: obtaining, by the first device from the input device, the multi-modal natural language input;transcribing, by the first device, the natural language utterance;determining, by the first device, a preliminary intent prediction of the multi-modal natural language input based on the transcribed utterance and the non-voice input;transmitting, by the first device, the multi-modal natural language input to the second device;receiving, by the first device from the second device, a second intent prediction of the multi-modal natural language input;determining, by the first device, an intent of the multi-modal natural language input based on the preliminary intent prediction and the second intent prediction; andinvoking, by the first device, at least one action at one or more of the input device, the first device, or the one or more secondary devices based on the determined intent.
2. The method of claim 1, wherein invoking the at least one action at one or more of the input device, the first device, or the one or more secondary devices comprises transmitting a request related to the multi-modal natural language input based on the preliminary intent prediction.
3. The method of claim 1, the method further comprising: determining, by the first device, processing capabilities associated with the one or more secondary devices; andselecting, by the first device, based on the processing capabilities associated with the one or more secondary devices, the second device to make the second intent prediction of the multi-modal natural language input.
4. The method of claim 3, the method further comprising: maintaining, by the first device, a constellation model that describes natural language resources, dynamic states, and intent determination capabilities associated with the input device and the one or more secondary devices, wherein the processing capabilities associated with the one or more secondary devices are determined based on the constellation model.
5. The method of claim 4, wherein the intent determination capabilities for a given one of the input device, the first device, or the one or more secondary devices are based on at least one of processing power, storage resources, natural language processing capabilities, or local knowledge.
6. The method of claim 1, the method further comprising: determining, by the first device, a domain relating to the multi-modal natural language input; andselecting, by the first device, based on the domain, the second device to make the second intent prediction of the multi-modal natural language input.
7. The method of claim 6, wherein the one or more secondary devices are associated with different domains, the second device is associated with the domain, and the different domains comprise the domain.
8. The method of claim 1, wherein the input device initially received the multi-modal natural language input.
9. A method of providing an integrated multi-modal, natural language voice services environment comprising one or more of an input device that receives a multi-modal natural language input comprising at least a natural language utterance and a non-voice input related to the natural language utterance, a first device, or one or more secondary devices, the method being implemented in the first device having one or more physical processors programmed with computer program instructions that, when executed by the one or more physical processors, program the first device to perform the method, the method comprising: obtaining, by the first device from the input device, the multi-modal natural language input;transcribing, by the first device, the natural language utterance;determining, by the first device, a preliminary intent prediction of the multi-modal natural language input based on the transcribed utterance and the non-voice input;communicating, by the first device, the multi-modal natural language input to each of the one or more secondary devices, wherein each of the one or more secondary devices determines an intent of the multi-modal natural language input received at the input device using local intent determination capabilities;receiving, by the first device, an intent determination from each of the secondary devices; andarbitrating, by the first device, among the intent determinations received from each of the secondary devices to determine an intent of the multi-modal natural input; andinvoking, by the first device, at least one action at one or more of the input device, the first device, or the one or more secondary devices based on the determined intent.
10. A system for processing a multi-modal natural language input, the system comprising: an input device that receives a multi-modal natural language input comprising at least a natural language utterance and a non-voice input related to the natural language utterance;one or more secondary devices, wherein the one or more secondary devices include at least a second device, anda first device having one or more physical processors programmed with computer program instructions that, when executed by the one or more physical processors, program the first device to: obtain, from the input device, the multi-modal natural language input;transcribe the natural language utterance;determine a preliminary intent prediction of the multi-modal natural language input based on the transcribed utterance and the non-voice input; andtransmit the multi-modal natural language input to the second device;receive, from the second device, a second intent prediction of the multi-modal natural language input;determine an intent of the multi-modal natural language input based on the preliminary intent prediction and the second intent prediction; andinvoke at least one action at one or more of the input device, the first device, or the one or more secondary devices based on the determined intent.
11. The system of claim 10, wherein to invoke the at least one action at one or more of the input device, the first device, or the one or more secondary devices, the first device is further programmed to: transmit a request related to the multi-modal natural language input based on the preliminary intent prediction.
12. The system of claim 10, wherein the first device is further programmed to: determine processing capabilities associated with the one or more secondary devices; andselect based on the processing capabilities associated with the one or more secondary devices, the second device to make the second intent prediction of the multi-modal natural language input.
13. The system of claim 12, wherein the first device is further programmed to: maintain a constellation model that describes natural language resources, dynamic states, and intent determination capabilities associated with the input device and the one or more secondary devices, wherein the processing capabilities associated with the one or more secondary devices are determined based on the constellation model.
14. The system of claim 13, wherein the intent determination capabilities for a given one of the input device, the first device, or the one or more secondary devices are based on at least one of processing power, storage resources, natural language processing capabilities, or local knowledge.
15. The system of claim 10, wherein the first device is further programmed to: determine a domain relating to the multi-modal natural language input; andselect, based on the domain, the second device to make the second intent prediction of the multi-modal natural language input.
16. The system of claim 15, wherein the one or more secondary devices are associated with different domains, the second device is associated with the domain, and the different domains comprise the domain.
17. The system of claim 10, wherein the input device initially received the multi-modal natural language input.
18. A system for processing a multi-modal natural language input, the system comprising: an input device that receives a multi-modal natural language input comprising at least a natural language utterance and a non-voice input related to the natural language utterance;one or more secondary devices; anda first device having one or more physical processors programmed with computer program instructions that, when executed by the one or more physical processors, program the first device to: obtain, from the input device, the multi-modal natural language input;transcribe the natural language utterance;determine a preliminary intent prediction of the multi-modal natural language input based on the transcribed utterance and the non-voice input;communicate the multi-modal natural language input to each of the one or more secondary devices, wherein each of the one or more secondary devices determines an intent of the multi-modal natural language input received at the input device using local intent determination capabilities;receive an intent determination from each of the secondary devices; andarbitrate among the intent determinations received from each of the secondary devices to determine an intent of the multi-modal natural input,invoke at least one action at one or more of the input device, the first device, or the one or more secondary devices based on the determined intent.

CROSS-REFERENCE TO RELATED APPLICATION

This application is a continuation of U.S. patent application Ser. No. 14/083,061, entitled “System and Method for an Integrated, Multi-Modal, Multi-Device Natural Language Voice Services Environment,” filed Nov. 18, 2013, which is a continuation of U.S. patent application Ser. No. 12/127,343, entitled “System and Method for an Integrated, Multi-Modal, Multi-Device Natural Language Voice Services Environment,” filed May 27, 2008 (which issues as U.S. Pat. No. 8,589,161 on Nov. 19, 2013), the content of which is hereby incorporated by reference in its entirety.

US Referenced Citations (865)

Number	Name	Date	Kind
4430669	Cheung	Feb 1984	A
4821027	Mallory	Apr 1989	A
4829423	Tennant	May 1989	A
4887212	Zamora	Dec 1989	A
4910784	Doddington	Mar 1990	A
5027406	Roberts	Jun 1991	A
5155743	Jacobs	Oct 1992	A
5164904	Sumner	Nov 1992	A
5208748	Flores	May 1993	A
5265065	Turtle	Nov 1993	A
5274560	LaRue	Dec 1993	A
5357596	Takebayashi	Oct 1994	A
5369575	Lamberti	Nov 1994	A
5377350	Skinner	Dec 1994	A
5386556	Hedin	Jan 1995	A
5424947	Nagao	Jun 1995	A
5471318	Ahuja	Nov 1995	A
5475733	Eisdorfer	Dec 1995	A
5479563	Yamaguchi	Dec 1995	A
5488652	Bielby	Jan 1996	A
5499289	Bruno	Mar 1996	A
5500920	Kupiec	Mar 1996	A
5517560	Greenspan	May 1996	A
5533108	Harris	Jul 1996	A
5537436	Bottoms	Jul 1996	A
5539744	Chu	Jul 1996	A
5557667	Bruno	Sep 1996	A
5559864	Kennedy, Jr.	Sep 1996	A
5563937	Bruno	Oct 1996	A
5577165	Takebayashi	Nov 1996	A
5590039	Ikeda	Dec 1996	A
5608635	Tamai	Mar 1997	A
5615296	Stanford	Mar 1997	A
5617407	Bareis	Apr 1997	A
5633922	August	May 1997	A
5634086	Rtischev	May 1997	A
5652570	Lepkofker	Jul 1997	A
5675629	Raffel	Oct 1997	A
5696965	Dedrick	Dec 1997	A
5708422	Blonder	Jan 1998	A
5721938	Stuckey	Feb 1998	A
5722084	Chakrin	Feb 1998	A
5740256	CastelloDaCosta	Apr 1998	A
5742763	Jones	Apr 1998	A
5748841	Morin	May 1998	A
5748974	Johnson	May 1998	A
5752052	Richardson	May 1998	A
5754784	Garland	May 1998	A
5761631	Nasukawa	Jun 1998	A
5774841	Salazar	Jun 1998	A
5774859	Houser	Jun 1998	A
5794050	Dahlgren	Aug 1998	A
5794196	Yegnanarayanan	Aug 1998	A
5797112	Komatsu	Aug 1998	A
5799276	Komissarchik	Aug 1998	A
5802510	Jones	Sep 1998	A
5829000	Huang	Oct 1998	A
5832221	Jones	Nov 1998	A
5839107	Gupta	Nov 1998	A
5848396	Gerace	Dec 1998	A
5855000	Waibel	Dec 1998	A
5867817	Catallo	Feb 1999	A
5878385	Bralich	Mar 1999	A
5878386	Coughlin	Mar 1999	A
5892813	Morin	Apr 1999	A
5892900	Ginter	Apr 1999	A
5895464	Bhandari	Apr 1999	A
5895466	Goldberg	Apr 1999	A
5897613	Chan	Apr 1999	A
5899991	Karch	May 1999	A
5902347	Backman	May 1999	A
5911120	Jarett	Jun 1999	A
5918222	Fukui	Jun 1999	A
5926784	Richardson	Jul 1999	A
5933822	Braden-Harder	Aug 1999	A
5950167	Yaker	Sep 1999	A
5953393	Culbreth	Sep 1999	A
5960384	Brash	Sep 1999	A
5960397	Rahim	Sep 1999	A
5960399	Barclay	Sep 1999	A
5960447	Holt	Sep 1999	A
5963894	Richardson	Oct 1999	A
5963940	Liddy	Oct 1999	A
5983190	Trower, II	Nov 1999	A
5987404	DellaPietra	Nov 1999	A
5991721	Asano	Nov 1999	A
5995119	Cosatto	Nov 1999	A
5995928	Nguyen	Nov 1999	A
5995943	Bull	Nov 1999	A
6009382	Martino	Dec 1999	A
6014559	Amin	Jan 2000	A
6018708	Dahan	Jan 2000	A
6021384	Gorin	Feb 2000	A
6028514	Lemelson	Feb 2000	A
6035267	Watanabe	Mar 2000	A
6044347	Abella	Mar 2000	A
6049602	Foladare	Apr 2000	A
6049607	Marash	Apr 2000	A
6058187	Chen	May 2000	A
6067513	Ishimitsu	May 2000	A
6073098	Buchsbaum	Jun 2000	A
6076059	Glickman	Jun 2000	A
6078886	Dragosh	Jun 2000	A
6081774	deHita	Jun 2000	A
6085186	Christianson	Jul 2000	A
6101241	Boyce	Aug 2000	A
6108631	Ruhl	Aug 2000	A
6119087	Kuhn	Sep 2000	A
6119101	Peckover	Sep 2000	A
6122613	Baker	Sep 2000	A
6134235	Goldman	Oct 2000	A
6144667	Doshi	Nov 2000	A
6144938	Surace	Nov 2000	A
6154526	Dahlke	Nov 2000	A
6160883	Jackson	Dec 2000	A
6167377	Gillick	Dec 2000	A
6173266	Marx	Jan 2001	B1
6173279	Levin	Jan 2001	B1
6175858	Bulfer	Jan 2001	B1
6185535	Hedin	Feb 2001	B1
6188982	Chiang	Feb 2001	B1
6192110	Abella	Feb 2001	B1
6192338	Haszto	Feb 2001	B1
6195634	Dudemaine	Feb 2001	B1
6195651	Handel	Feb 2001	B1
6199043	Happ	Mar 2001	B1
6208964	Sabourin	Mar 2001	B1
6208972	Grant	Mar 2001	B1
6219346	Maxemchuk	Apr 2001	B1
6219643	Cohen	Apr 2001	B1
6226612	Srenger	May 2001	B1
6233556	Teunen	May 2001	B1
6233559	Balakrishnan	May 2001	B1
6233561	Junqua	May 2001	B1
6236968	Kanevsky	May 2001	B1
6246981	Papineni	Jun 2001	B1
6246990	Happ	Jun 2001	B1
6266636	Kosaka	Jul 2001	B1
6269336	Ladd	Jul 2001	B1
6272455	Hoshen	Aug 2001	B1
6272461	Meredith	Aug 2001	B1
6275231	Obradovich	Aug 2001	B1
6278377	DeLine	Aug 2001	B1
6278968	Franz	Aug 2001	B1
6286002	Axaopoulos	Sep 2001	B1
6288319	Catona	Sep 2001	B1
6292767	Jackson	Sep 2001	B1
6301560	Masters	Oct 2001	B1
6308151	Smith	Oct 2001	B1
6311159	VanTichelen	Oct 2001	B1
6314402	Monaco	Nov 2001	B1
6321196	Franceschi	Nov 2001	B1
6356869	Chapados	Mar 2002	B1
6362748	Huang	Mar 2002	B1
6366882	Bijl	Apr 2002	B1
6366886	Dragosh	Apr 2002	B1
6374214	Friedland	Apr 2002	B1
6374226	Hunt	Apr 2002	B1
6377913	Coffman	Apr 2002	B1
6381535	Durocher	Apr 2002	B1
6385596	Wiser	May 2002	B1
6385646	Brown	May 2002	B1
6393403	Majaniemi	May 2002	B1
6393428	Miller	May 2002	B1
6397181	Li	May 2002	B1
6404878	Jackson	Jun 2002	B1
6405170	Phillips	Jun 2002	B1
6408272	White	Jun 2002	B1
6411810	Maxemchuk	Jun 2002	B1
6411893	Ruhl	Jun 2002	B2
6415257	Junqua	Jul 2002	B1
6418210	Sayko	Jul 2002	B1
6420975	DeLine	Jul 2002	B1
6429813	Feigen	Aug 2002	B2
6430285	Bauer	Aug 2002	B1
6430531	Polish	Aug 2002	B1
6434523	Monaco	Aug 2002	B1
6434524	Weber	Aug 2002	B1
6434529	Walker	Aug 2002	B1
6442522	Carberry	Aug 2002	B1
6446114	Bulfer	Sep 2002	B1
6453153	Bowker	Sep 2002	B1
6453292	Ramaswamy	Sep 2002	B2
6456711	Cheung	Sep 2002	B1
6456974	Baker	Sep 2002	B1
6466654	Cooper	Oct 2002	B1
6466899	Yano	Oct 2002	B1
6470315	Netsch	Oct 2002	B1
6487494	Odinak	Nov 2002	B2
6487495	Gale	Nov 2002	B1
6498797	Anerousis	Dec 2002	B1
6499013	Weber	Dec 2002	B1
6501833	Phillips	Dec 2002	B2
6501834	Milewski	Dec 2002	B1
6505155	Vanbuskirk	Jan 2003	B1
6510417	Woods	Jan 2003	B1
6513006	Howard	Jan 2003	B2
6522746	Marchok	Feb 2003	B1
6523061	Halverson	Feb 2003	B1
6532444	Weber	Mar 2003	B1
6539348	Bond	Mar 2003	B1
6549629	Finn	Apr 2003	B2
6553372	Brassell	Apr 2003	B1
6556970	Sasaki	Apr 2003	B1
6556973	Lewin	Apr 2003	B1
6560576	Cohen	May 2003	B1
6560590	Shwe	May 2003	B1
6567778	ChaoChang	May 2003	B1
6567797	Schuetze	May 2003	B1
6567805	Johnson	May 2003	B1
6570555	Prevost	May 2003	B1
6570964	Murveit	May 2003	B1
6571279	Herz	May 2003	B1
6574597	Mohri	Jun 2003	B1
6574624	Johnson	Jun 2003	B1
6578022	Foulger	Jun 2003	B1
6581103	Dengler	Jun 2003	B1
6584439	Geilhufe	Jun 2003	B1
6587858	Strazza	Jul 2003	B1
6591239	McCall	Jul 2003	B1
6594257	Doshi	Jul 2003	B1
6594367	Marash	Jul 2003	B1
6598018	Junqua	Jul 2003	B1
6601026	Appelt	Jul 2003	B2
6601029	Pickering	Jul 2003	B1
6604075	Brown	Aug 2003	B1
6604077	Dragosh	Aug 2003	B2
6606598	Holthouse	Aug 2003	B1
6611692	Raffel	Aug 2003	B2
6614773	Maxemchuk	Sep 2003	B1
6615172	Bennett	Sep 2003	B1
6622119	Ramaswamy	Sep 2003	B1
6629066	Jackson	Sep 2003	B1
6631346	Karaorman	Oct 2003	B1
6631351	Ramachandran	Oct 2003	B1
6633846	Bennett	Oct 2003	B1
6636790	Lightner	Oct 2003	B1
6643620	Contolini	Nov 2003	B1
6647363	Claassen	Nov 2003	B2
6650747	Bala	Nov 2003	B1
6658388	Kleindienst	Dec 2003	B1
6678680	Woo	Jan 2004	B1
6681206	Gorin	Jan 2004	B1
6691151	Cheyer	Feb 2004	B1
6701294	Ball	Mar 2004	B1
6704396	Parolkar	Mar 2004	B2
6704576	Brachman	Mar 2004	B1
6704708	Pickering	Mar 2004	B1
6707421	Drury	Mar 2004	B1
6708150	Hirayama	Mar 2004	B1
6721001	Berstis	Apr 2004	B1
6721633	Funk	Apr 2004	B2
6721706	Strubbe	Apr 2004	B1
6726636	DerGhazarian	Apr 2004	B2
6732088	Glance	May 2004	B1
6735592	Neumann	May 2004	B1
6739556	Langston	May 2004	B1
6741931	Kohut	May 2004	B1
6742021	Halverson	May 2004	B1
6745161	Arnold	Jun 2004	B1
6751591	Gorin	Jun 2004	B1
6751612	Schuetze	Jun 2004	B1
6754485	Obradovich	Jun 2004	B1
6754627	Woodward	Jun 2004	B2
6754647	Tackett	Jun 2004	B1
6757544	Rangarajan	Jun 2004	B2
6757718	Halverson	Jun 2004	B1
6785651	Wang	Aug 2004	B1
6795808	Strubbe	Sep 2004	B1
6801604	Maes	Oct 2004	B2
6801893	Backfried	Oct 2004	B1
6804330	Jones	Oct 2004	B1
6810375	Ejerhed	Oct 2004	B1
6813341	Mahoney	Nov 2004	B1
6816830	Kempe	Nov 2004	B1
6829603	Wolf	Dec 2004	B1
6832230	Zilliacus	Dec 2004	B1
6833848	Wolff	Dec 2004	B1
6850603	Eberle	Feb 2005	B1
6856990	Barile	Feb 2005	B2
6865481	Kawazoe	Mar 2005	B2
6868380	Kroeker	Mar 2005	B2
6868385	Gerson	Mar 2005	B1
6871179	Kist	Mar 2005	B1
6873837	Yoshioka	Mar 2005	B1
6877001	Wolf	Apr 2005	B2
6877134	Fuller	Apr 2005	B1
6901366	Kuhn	May 2005	B1
6910003	Arnold	Jun 2005	B1
6912498	Stevens	Jun 2005	B2
6915126	Mazzara, Jr.	Jul 2005	B2
6928614	Everhart	Aug 2005	B1
6934756	Maes	Aug 2005	B2
6937977	Gerson	Aug 2005	B2
6937982	Kitaoka	Aug 2005	B2
6941266	Gorin	Sep 2005	B1
6944594	Busayapongchai	Sep 2005	B2
6950821	Faybishenko	Sep 2005	B2
6954755	Reisman	Oct 2005	B2
6959276	Droppo	Oct 2005	B2
6961700	Mitchell	Nov 2005	B2
6963759	Gerson	Nov 2005	B1
6964023	Maes	Nov 2005	B2
6968311	Knockeart	Nov 2005	B2
6973387	Masclet	Dec 2005	B2
6975993	Keiller	Dec 2005	B1
6980092	Turnbull	Dec 2005	B2
6983055	Luo	Jan 2006	B2
6990513	Belfiore	Jan 2006	B2
6996531	Korall	Feb 2006	B2
7003463	Maes	Feb 2006	B1
7016849	Arnold	Mar 2006	B2
7020609	Thrift	Mar 2006	B2
7024364	Guerra	Apr 2006	B2
7027586	Bushey	Apr 2006	B2
7027974	Busch	Apr 2006	B1
7027975	Pazandak	Apr 2006	B1
7035415	Belt	Apr 2006	B2
7036128	Julia	Apr 2006	B1
7043425	Pao	May 2006	B2
7054817	Shao	May 2006	B2
7058890	George	Jun 2006	B2
7062488	Reisman	Jun 2006	B1
7069220	Coffman	Jun 2006	B2
7072834	Zhou	Jul 2006	B2
7072888	Perkins	Jul 2006	B1
7076362	Ohtsuji	Jul 2006	B2
7082469	Gold	Jul 2006	B2
7085708	Manson	Aug 2006	B2
7092928	Elad	Aug 2006	B1
7107210	Deng	Sep 2006	B2
7107218	Preston	Sep 2006	B1
7110951	Lemelson	Sep 2006	B1
7127395	Gorin	Oct 2006	B1
7127400	Koch	Oct 2006	B2
7130390	Abburi	Oct 2006	B2
7136875	Anderson	Nov 2006	B2
7137126	Coffman	Nov 2006	B1
7143037	Chestnut	Nov 2006	B1
7143039	Stifelman	Nov 2006	B1
7146319	Hunt	Dec 2006	B2
7149696	Shimizu	Dec 2006	B2
7165028	Gong	Jan 2007	B2
7170993	Anderson	Jan 2007	B2
7171291	Obradovich	Jan 2007	B2
7174300	Bush	Feb 2007	B2
7177798	Hsu	Feb 2007	B2
7184957	Brookes	Feb 2007	B2
7190770	Ando	Mar 2007	B2
7197069	Agazzi	Mar 2007	B2
7197460	Gupta	Mar 2007	B1
7203644	Anderson	Apr 2007	B2
7206418	Yang	Apr 2007	B2
7207011	Mulvey	Apr 2007	B2
7215941	Beckmann	May 2007	B2
7228276	Omote	Jun 2007	B2
7231343	Treadgold	Jun 2007	B1
7236923	Gupta	Jun 2007	B1
7254482	Kawasaki	Aug 2007	B2
7272212	Eberle	Sep 2007	B2
7277854	Bennett	Oct 2007	B2
7283829	Christenson	Oct 2007	B2
7283951	Marchisio	Oct 2007	B2
7289606	Sibal	Oct 2007	B2
7299186	Kuzunuki	Nov 2007	B2
7301093	Sater	Nov 2007	B2
7305381	Poppink	Dec 2007	B1
7321850	Wakita	Jan 2008	B2
7328155	Endo	Feb 2008	B2
7337116	Charlesworth	Feb 2008	B2
7340040	Saylor	Mar 2008	B1
7366285	Parolkar	Apr 2008	B2
7366669	Nishitani	Apr 2008	B2
7376645	Bernard	May 2008	B2
7386443	Parthasarathy	Jun 2008	B1
7398209	Kennewick	Jul 2008	B2
7406421	Odinak	Jul 2008	B2
7415100	Cooper	Aug 2008	B2
7415414	Azara	Aug 2008	B2
7421393	DiFabbrizio	Sep 2008	B1
7424431	Greene	Sep 2008	B2
7447635	Konopka	Nov 2008	B1
7451088	Ehlen	Nov 2008	B1
7454368	Stillman	Nov 2008	B2
7454608	Gopalakrishnan	Nov 2008	B2
7461059	Richardson	Dec 2008	B2
7472020	Brulle-Drews	Dec 2008	B2
7472060	Gorin	Dec 2008	B1
7472075	Odinak	Dec 2008	B2
7477909	Roth	Jan 2009	B2
7478036	Shen	Jan 2009	B2
7487088	Gorin	Feb 2009	B1
7487110	Bennett	Feb 2009	B2
7493259	Jones	Feb 2009	B2
7493559	Wolff	Feb 2009	B1
7502672	Kolls	Mar 2009	B1
7502738	Kennewick	Mar 2009	B2
7516076	Walker	Apr 2009	B2
7529675	Maes	May 2009	B2
7536297	Byrd	May 2009	B2
7536374	Au	May 2009	B2
7542894	Murata	Jun 2009	B2
7546382	Healey	Jun 2009	B2
7548491	Macfarlane	Jun 2009	B2
7552054	Stifelman	Jun 2009	B1
7558730	Davis	Jul 2009	B2
7574362	Walker	Aug 2009	B2
7577244	Taschereau	Aug 2009	B2
7606708	Hwang	Oct 2009	B2
7620549	DiCristo	Nov 2009	B2
7634409	Kennewick	Dec 2009	B2
7640006	Portman	Dec 2009	B2
7640160	DiCristo	Dec 2009	B2
7640272	Mahajan	Dec 2009	B2
7672931	Hurst-Hiller	Mar 2010	B2
7676365	Hwang	Mar 2010	B2
7676369	Fujimoto	Mar 2010	B2
7684977	Morikawa	Mar 2010	B2
7693720	Kennewick	Apr 2010	B2
7697673	Chiu	Apr 2010	B2
7706616	Kristensson	Apr 2010	B2
7729916	Coffman	Jun 2010	B2
7729918	Walker	Jun 2010	B2
7729920	Chaar	Jun 2010	B2
7734287	Ying	Jun 2010	B2
7748021	Obradovich	Jun 2010	B2
7788084	Brun	Aug 2010	B2
7792257	Vanier	Sep 2010	B1
7801731	Odinak	Sep 2010	B2
7809570	Kennewick	Oct 2010	B2
7818176	Freeman	Oct 2010	B2
7831426	Bennett	Nov 2010	B2
7831433	Belvin	Nov 2010	B1
7856358	Ho	Dec 2010	B2
7873519	Bennett	Jan 2011	B2
7873523	Potter	Jan 2011	B2
7873654	Bernard	Jan 2011	B2
7881936	Longe	Feb 2011	B2
7890324	Bangalore	Feb 2011	B2
7894849	Kass	Feb 2011	B2
7902969	Obradovich	Mar 2011	B2
7917367	DiCristo	Mar 2011	B2
7920682	Byrne	Apr 2011	B2
7949529	Weider	May 2011	B2
7949537	Walker	May 2011	B2
7953732	Frank	May 2011	B2
7974875	Quilici	Jul 2011	B1
7983917	Kennewick	Jul 2011	B2
7984287	Gopalakrishnan	Jul 2011	B2
8005683	Tessel	Aug 2011	B2
8015006	Kennewick	Sep 2011	B2
8024186	De Bonet	Sep 2011	B1
8027965	Takehara	Sep 2011	B2
8032383	Bhardwaj	Oct 2011	B1
8060367	Keaveney	Nov 2011	B2
8069046	Kennewick	Nov 2011	B2
8073681	Baldwin	Dec 2011	B2
8077975	Ma	Dec 2011	B2
8082153	Coffman	Dec 2011	B2
8086463	Ativanichayaphong	Dec 2011	B2
8103510	Sato	Jan 2012	B2
8112275	Kennewick	Feb 2012	B2
8140327	Kennewick	Mar 2012	B2
8140335	Kennewick	Mar 2012	B2
8145489	Freeman	Mar 2012	B2
8150694	Kennewick	Apr 2012	B2
8155962	Kennewick	Apr 2012	B2
8170867	Germain	May 2012	B2
8180037	Delker	May 2012	B1
8195468	Weider	Jun 2012	B2
8200485	Lee	Jun 2012	B1
8204751	Di Fabbrizio	Jun 2012	B1
8219399	Lutz	Jul 2012	B2
8219599	Tunstall-Pedoe	Jul 2012	B2
8224652	Wang	Jul 2012	B2
8255224	Singleton	Aug 2012	B2
8326627	Kennewick	Dec 2012	B2
8326634	DiCristo	Dec 2012	B2
8326637	Baldwin	Dec 2012	B2
8332224	DiCristo	Dec 2012	B2
8340975	Rosenberger	Dec 2012	B1
8346563	Hjelm	Jan 2013	B1
8370147	Kennewick	Feb 2013	B2
8447607	Weider	May 2013	B2
8447651	Scholl	May 2013	B1
8452598	Kennewick	May 2013	B2
8503995	Ramer	Aug 2013	B2
8509403	Chiu	Aug 2013	B2
8515765	Baldwin	Aug 2013	B2
8527274	Freeman	Sep 2013	B2
8577671	Barve	Nov 2013	B1
8589161	Kennewick	Nov 2013	B2
8620659	DiCristo	Dec 2013	B2
8719005	Lee	May 2014	B1
8719009	Baldwin	May 2014	B2
8719026	Kennewick	May 2014	B2
8731929	Kennewick	May 2014	B2
8738380	Baldwin	May 2014	B2
8849652	Weider	Sep 2014	B2
8849670	DiCristo	Sep 2014	B2
8849696	Pansari	Sep 2014	B2
8849791	Hertschuh	Sep 2014	B1
8886536	Freeman	Nov 2014	B2
8972243	Strom	Mar 2015	B1
8983839	Kennewick	Mar 2015	B2
9009046	Stewart	Apr 2015	B1
9015049	Baldwin	Apr 2015	B2
9037455	Faaborg	May 2015	B1
9070366	Mathias	Jun 2015	B1
9070367	Hoffmeister	Jun 2015	B1
9105266	Baldwin	Aug 2015	B2
9171541	Kennewick	Oct 2015	B2
9269097	Freeman	Feb 2016	B2
9305548	Kennewick	Apr 2016	B2
9308445	Merzenich	Apr 2016	B1
9406078	Freeman	Aug 2016	B2
9443514	Taubman	Sep 2016	B1
9502025	Kennewick	Nov 2016	B2
20010039492	Nemoto	Nov 2001	A1
20010041980	Howard	Nov 2001	A1
20010047261	Kassan	Nov 2001	A1
20010049601	Kroeker	Dec 2001	A1
20010054087	Flom	Dec 2001	A1
20020002548	Roundtree	Jan 2002	A1
20020010584	Schultz	Jan 2002	A1
20020015500	Belt	Feb 2002	A1
20020022927	Lemelson	Feb 2002	A1
20020022956	Ukrainczyk	Feb 2002	A1
20020029186	Roth	Mar 2002	A1
20020029261	Shibata	Mar 2002	A1
20020032752	Gold	Mar 2002	A1
20020035501	Handel	Mar 2002	A1
20020040297	Tsiao	Apr 2002	A1
20020049535	Rigo	Apr 2002	A1
20020049805	Yamada	Apr 2002	A1
20020059068	Rose	May 2002	A1
20020065568	Silfvast	May 2002	A1
20020067839	Heinrich	Jun 2002	A1
20020069059	Smith	Jun 2002	A1
20020069071	Knockeart	Jun 2002	A1
20020073176	Ikeda	Jun 2002	A1
20020082911	Dunn	Jun 2002	A1
20020087312	Lee	Jul 2002	A1
20020087326	Lee	Jul 2002	A1
20020087525	Abbott	Jul 2002	A1
20020107694	Lerg	Aug 2002	A1
20020120609	Lang	Aug 2002	A1
20020124050	Middeljans	Sep 2002	A1
20020133347	Schoneburg	Sep 2002	A1
20020133354	Ross	Sep 2002	A1
20020133402	Faber	Sep 2002	A1
20020135618	Maes	Sep 2002	A1
20020138248	Corston-Oliver	Sep 2002	A1
20020143532	McLean	Oct 2002	A1
20020143535	Kist	Oct 2002	A1
20020152260	Chen	Oct 2002	A1
20020161646	Gailey	Oct 2002	A1
20020161647	Gailey	Oct 2002	A1
20020169597	Fain	Nov 2002	A1
20020173333	Buchholz	Nov 2002	A1
20020173961	Guerra	Nov 2002	A1
20020184373	Maes	Dec 2002	A1
20020188602	Stubler	Dec 2002	A1
20020198714	Zhou	Dec 2002	A1
20030005033	Mohan	Jan 2003	A1
20030014261	Kageyama	Jan 2003	A1
20030016835	Elko	Jan 2003	A1
20030036903	Konopka	Feb 2003	A1
20030046071	Wyman	Mar 2003	A1
20030046281	Son	Mar 2003	A1
20030046346	Mumick	Mar 2003	A1
20030064709	Gailey	Apr 2003	A1
20030065427	Funk	Apr 2003	A1
20030069734	Everhart	Apr 2003	A1
20030088421	Maes	May 2003	A1
20030093419	Bangalore	May 2003	A1
20030097249	Walker	May 2003	A1
20030110037	Walker	Jun 2003	A1
20030112267	Belrose	Jun 2003	A1
20030115062	Walker	Jun 2003	A1
20030120493	Gupta	Jun 2003	A1
20030135488	Amir	Jul 2003	A1
20030144846	Denenberg	Jul 2003	A1
20030158731	Falcon	Aug 2003	A1
20030161448	Parolkar	Aug 2003	A1
20030167167	Gong	Sep 2003	A1
20030174155	Weng	Sep 2003	A1
20030182132	Niemoeller	Sep 2003	A1
20030187643	VanThong	Oct 2003	A1
20030204492	Wolf	Oct 2003	A1
20030206640	Malvar	Nov 2003	A1
20030212550	Ubale	Nov 2003	A1
20030212558	Matula	Nov 2003	A1
20030212562	Patel	Nov 2003	A1
20030225825	Healey	Dec 2003	A1
20030233230	Ammicht	Dec 2003	A1
20030236664	Sharma	Dec 2003	A1
20040006475	Ehlen	Jan 2004	A1
20040010358	Oesterling	Jan 2004	A1
20040025115	Sienel	Feb 2004	A1
20040030741	Wolton	Feb 2004	A1
20040036601	Obradovich	Feb 2004	A1
20040044516	Kennewick	Mar 2004	A1
20040098245	Walker	May 2004	A1
20040117179	Balasuriya	Jun 2004	A1
20040117804	Scahill	Jun 2004	A1
20040122673	Park	Jun 2004	A1
20040122674	Bangalore	Jun 2004	A1
20040133793	Ginter	Jul 2004	A1
20040140989	Papageorge	Jul 2004	A1
20040143440	Prasad	Jul 2004	A1
20040148154	Acero	Jul 2004	A1
20040148170	Acero	Jul 2004	A1
20040158555	Seedman	Aug 2004	A1
20040166832	Portman	Aug 2004	A1
20040167771	Duan	Aug 2004	A1
20040172247	Yoon	Sep 2004	A1
20040172258	Dominach	Sep 2004	A1
20040189697	Fukuoka	Sep 2004	A1
20040193408	Hunt	Sep 2004	A1
20040193420	Kennewick	Sep 2004	A1
20040199375	Ehsani	Oct 2004	A1
20040201607	Mulvey	Oct 2004	A1
20040205671	Sukehiro	Oct 2004	A1
20040243393	Wang	Dec 2004	A1
20040243417	Pitts, III	Dec 2004	A9
20040247092	Timmins	Dec 2004	A1
20040249636	Applebaum	Dec 2004	A1
20050015256	Kargman	Jan 2005	A1
20050021331	Huang	Jan 2005	A1
20050021334	Iwahashi	Jan 2005	A1
20050021470	Martin	Jan 2005	A1
20050021826	Kumar	Jan 2005	A1
20050033574	Kim	Feb 2005	A1
20050033582	Gadd	Feb 2005	A1
20050043940	Elder	Feb 2005	A1
20050080632	Endo	Apr 2005	A1
20050102282	Linden	May 2005	A1
20050114116	Fiedler	May 2005	A1
20050125232	Gadd	Jun 2005	A1
20050131673	Koizumi	Jun 2005	A1
20050137850	Odell	Jun 2005	A1
20050137877	Oesterling	Jun 2005	A1
20050143994	Mori	Jun 2005	A1
20050144013	Fujimoto	Jun 2005	A1
20050144187	Che	Jun 2005	A1
20050149319	Honda	Jul 2005	A1
20050216254	Gupta	Sep 2005	A1
20050234727	Chiu	Oct 2005	A1
20050246174	DeGolia	Nov 2005	A1
20050283364	Longe	Dec 2005	A1
20050283532	Kim	Dec 2005	A1
20050283752	Fruchter	Dec 2005	A1
20060041431	Maes	Feb 2006	A1
20060047509	Ding	Mar 2006	A1
20060072738	Louis	Apr 2006	A1
20060074670	Weng	Apr 2006	A1
20060074671	Farmaner	Apr 2006	A1
20060080098	Campbell	Apr 2006	A1
20060100851	Schonebeck	May 2006	A1
20060106769	Gibbs	May 2006	A1
20060129409	Mizutani	Jun 2006	A1
20060130002	Hirayama	Jun 2006	A1
20060182085	Sweeney	Aug 2006	A1
20060206310	Ravikumar	Sep 2006	A1
20060217133	Christenson	Sep 2006	A1
20060236343	Chang	Oct 2006	A1
20060242017	Libes	Oct 2006	A1
20060253281	Letzt	Nov 2006	A1
20060285662	Yin	Dec 2006	A1
20070011159	Hillis	Jan 2007	A1
20070033005	Cristo	Feb 2007	A1
20070033020	Francois	Feb 2007	A1
20070033526	Thompson	Feb 2007	A1
20070038436	Cristo	Feb 2007	A1
20070038445	Helbing	Feb 2007	A1
20070043569	Potter	Feb 2007	A1
20070043574	Coffman	Feb 2007	A1
20070043868	Kumar	Feb 2007	A1
20070050191	Weider	Mar 2007	A1
20070055525	Kennewick	Mar 2007	A1
20070061067	Zeinstra	Mar 2007	A1
20070061735	Hoffberg	Mar 2007	A1
20070073544	Millett	Mar 2007	A1
20070078708	Yu	Apr 2007	A1
20070078709	Rajaram	Apr 2007	A1
20070078814	Flowers	Apr 2007	A1
20070094003	Huang	Apr 2007	A1
20070100797	Thun	May 2007	A1
20070112555	Lavi	May 2007	A1
20070112630	Lau	May 2007	A1
20070118357	Kasravi	May 2007	A1
20070124057	Prieto	May 2007	A1
20070135101	Ramati	Jun 2007	A1
20070146833	Satomi	Jun 2007	A1
20070162296	Altberg	Jul 2007	A1
20070174258	Jones	Jul 2007	A1
20070179778	Gong	Aug 2007	A1
20070185859	Flowers	Aug 2007	A1
20070186165	Maislos	Aug 2007	A1
20070192309	Fischer	Aug 2007	A1
20070198267	Jones	Aug 2007	A1
20070203699	Nagashima	Aug 2007	A1
20070203736	Ashton	Aug 2007	A1
20070208732	Flowers	Sep 2007	A1
20070214182	Rosenberg	Sep 2007	A1
20070250901	McIntire	Oct 2007	A1
20070265850	Kennewick	Nov 2007	A1
20070266257	Camaisa	Nov 2007	A1
20070276651	Bliss	Nov 2007	A1
20070294615	Sathe	Dec 2007	A1
20070299824	Pan	Dec 2007	A1
20080034032	Healey	Feb 2008	A1
20080046311	Shahine	Feb 2008	A1
20080059188	Konopka	Mar 2008	A1
20080065386	Cross	Mar 2008	A1
20080065389	Cross	Mar 2008	A1
20080091406	Baldwin	Apr 2008	A1
20080103761	Printz	May 2008	A1
20080103781	Wasson	May 2008	A1
20080104071	Pragada	May 2008	A1
20080109285	Reuther	May 2008	A1
20080115163	Gilboa	May 2008	A1
20080126091	Clark	May 2008	A1
20080133215	Sarukkai	Jun 2008	A1
20080140385	Mahajan	Jun 2008	A1
20080147396	Wang	Jun 2008	A1
20080147410	Odinak	Jun 2008	A1
20080147637	Li	Jun 2008	A1
20080154604	Sathish	Jun 2008	A1
20080162471	Bernard	Jul 2008	A1
20080177530	Cross	Jul 2008	A1
20080184164	Di Fabbrizio	Jul 2008	A1
20080189110	Freeman	Aug 2008	A1
20080228496	Yu	Sep 2008	A1
20080235023	Kennewick	Sep 2008	A1
20080235027	Cross	Sep 2008	A1
20080270224	Portman	Oct 2008	A1
20080294437	Nakano	Nov 2008	A1
20080294994	Kruger	Nov 2008	A1
20080306743	Di Fabbrizio	Dec 2008	A1
20080319751	Kennewick	Dec 2008	A1
20090006077	Keaveney	Jan 2009	A1
20090006194	Sridharan	Jan 2009	A1
20090018829	Kuperstein	Jan 2009	A1
20090024476	Baar	Jan 2009	A1
20090052635	Jones	Feb 2009	A1
20090067599	Agarwal	Mar 2009	A1
20090076827	Bulitta	Mar 2009	A1
20090106029	DeLine	Apr 2009	A1
20090117885	Roth	May 2009	A1
20090144131	Chiu	Jun 2009	A1
20090144271	Richardson	Jun 2009	A1
20090150156	Kennewick	Jun 2009	A1
20090164216	Chengalvarayan	Jun 2009	A1
20090171664	Kennewick	Jul 2009	A1
20090216540	Tessel	Aug 2009	A1
20090248565	Chuang	Oct 2009	A1
20090248605	Mitchell	Oct 2009	A1
20090259561	Boys	Oct 2009	A1
20090259646	Fujita	Oct 2009	A1
20090265163	Li	Oct 2009	A1
20090271194	Davis	Oct 2009	A1
20090273563	Pryor	Nov 2009	A1
20090276700	Anderson	Nov 2009	A1
20090287680	Paek	Nov 2009	A1
20090299745	Kennewick	Dec 2009	A1
20090299857	Brubaker	Dec 2009	A1
20090304161	Pettyjohn	Dec 2009	A1
20090307031	Winkler	Dec 2009	A1
20090313026	Coffman	Dec 2009	A1
20090319517	Guha	Dec 2009	A1
20100023320	Cristo	Jan 2010	A1
20100029261	Mikkelsen	Feb 2010	A1
20100036967	Caine	Feb 2010	A1
20100049501	Kennewick	Feb 2010	A1
20100049514	Kennewick	Feb 2010	A1
20100057443	Cristo	Mar 2010	A1
20100063880	Atsmon	Mar 2010	A1
20100064025	Nelimarkka	Mar 2010	A1
20100094707	Freer	Apr 2010	A1
20100138300	Wallis	Jun 2010	A1
20100145700	Kennewick	Jun 2010	A1
20100185512	Borger	Jul 2010	A1
20100204986	Kennewick	Aug 2010	A1
20100204994	Kennewick	Aug 2010	A1
20100217604	Baldwin	Aug 2010	A1
20100286985	Kennewick	Nov 2010	A1
20100299142	Freeman	Nov 2010	A1
20100312566	Odinak	Dec 2010	A1
20100318357	Istvan	Dec 2010	A1
20100331064	Michelstein	Dec 2010	A1
20110022393	Waller	Jan 2011	A1
20110106527	Chiu	May 2011	A1
20110112827	Kennewick	May 2011	A1
20110112921	Kennewick	May 2011	A1
20110119049	Ylonen	May 2011	A1
20110131036	DiCristo	Jun 2011	A1
20110131045	Cristo	Jun 2011	A1
20110231182	Weider	Sep 2011	A1
20110231188	Kennewick	Sep 2011	A1
20110238409	Larcheveque	Sep 2011	A1
20110307167	Taschereau	Dec 2011	A1
20120022857	Baldwin	Jan 2012	A1
20120046935	Nagao	Feb 2012	A1
20120101809	Kennewick	Apr 2012	A1
20120101810	Kennewick	Apr 2012	A1
20120109753	Kennewick	May 2012	A1
20120150620	Mandyam	Jun 2012	A1
20120150636	Freeman	Jun 2012	A1
20120239498	Ramer	Sep 2012	A1
20120240060	Pennington	Sep 2012	A1
20120278073	Weider	Nov 2012	A1
20130006734	Ocko	Jan 2013	A1
20130054228	Baldwin	Feb 2013	A1
20130060625	Davis	Mar 2013	A1
20130080177	Chen	Mar 2013	A1
20130211710	Kennewick	Aug 2013	A1
20130253929	Weider	Sep 2013	A1
20130254314	Chow	Sep 2013	A1
20130297293	Cristo	Nov 2013	A1
20130304473	Baldwin	Nov 2013	A1
20130311324	Stoll	Nov 2013	A1
20130332454	Stuhec	Dec 2013	A1
20130339022	Baldwin	Dec 2013	A1
20140006951	Hunter	Jan 2014	A1
20140012577	Freeman	Jan 2014	A1
20140025371	Min	Jan 2014	A1
20140108013	Cristo	Apr 2014	A1
20140156278	Kennewick	Jun 2014	A1
20140195238	Terao	Jul 2014	A1
20140236575	Tur	Aug 2014	A1
20140249821	Kennewick	Sep 2014	A1
20140249822	Baldwin	Sep 2014	A1
20140278413	Pitschel	Sep 2014	A1
20140278416	Schuster	Sep 2014	A1
20140288934	Kennewick	Sep 2014	A1
20140330552	Bangalore	Nov 2014	A1
20140365222	Weider	Dec 2014	A1
20150019211	Simard	Jan 2015	A1
20150019217	Cristo	Jan 2015	A1
20150019227	Anandarajah	Jan 2015	A1
20150066627	Freeman	Mar 2015	A1
20150073910	Kennewick	Mar 2015	A1
20150095159	Kennewick	Apr 2015	A1
20150142447	Kennewick	May 2015	A1
20150170641	Kennewick	Jun 2015	A1
20150193379	Mehta	Jul 2015	A1
20150199339	Mirkin	Jul 2015	A1
20150228276	Baldwin	Aug 2015	A1
20150293917	Bufe	Oct 2015	A1
20150348544	Baldwin	Dec 2015	A1
20150348551	Gruber	Dec 2015	A1
20150364133	Freeman	Dec 2015	A1
20160049152	Kennewick	Feb 2016	A1
20160078482	Kennewick	Mar 2016	A1
20160078491	Kennewick	Mar 2016	A1
20160078504	Kennewick	Mar 2016	A1
20160078773	Carter	Mar 2016	A1
20160110347	Kennewick	Apr 2016	A1
20160148610	Kennewick	May 2016	A1
20160148612	Guo	May 2016	A1
20160188292	Carter	Jun 2016	A1
20160188573	Tang	Jun 2016	A1
20160335676	Freeman	Nov 2016	A1

Foreign Referenced Citations (31)

Number	Date	Country
1433554	Jul 2003	CN
1320043	Jun 2003	EP
1646037	Apr 2006	EP
H11249773	Sep 1999	JP
2001071289	Mar 2001	JP
2006146881	Jun 2006	JP
2008027454	Feb 2008	JP
2008058465	Mar 2008	JP
2011504304	Feb 2011	JP
0171609	Sep 2001	NO
9946763	Sep 1999	WO
0021232	Jan 2000	WO
0046792	Jan 2000	WO
0178065	Oct 2001	WO
2004072954	Jan 2004	WO
2007019318	Jan 2007	WO
2007021587	Jan 2007	WO
2007027546	Jan 2007	WO
2007027989	Jan 2007	WO
2008098039	Jan 2008	WO
2008118195	Jan 2008	WO
2008139928	Jun 2008	WO
2009075912	Jan 2009	WO
2009145796	Jan 2009	WO
2009111721	Sep 2009	WO
2010096752	Jan 2010	WO
2016044290	Mar 2016	WO
2016044316	Mar 2016	WO
2016044319	Mar 2016	WO
2016044321	Mar 2016	WO
2016061309	Apr 2016	WO

Non-Patent Literature Citations (22)

Entry
Davis, Z., et al., A Personal Handheld Multi-Modal Shopping Assistant, IEEE, 2006, 9 pages.
“Statement in Accordance with the Notice from the European Patent Office” dated Oct. 1, 2007 Concerning Business Methods (OJ EPO Nov. 2007, 592-593), XP002456252.
Arrington, Michael, “Google Redefines GPS Navigation Landscape: Google Maps Navigation for Android 2.0”, TechCrunch, printed from the Internet <http://www.techcrunch.com/2009/10/28/google-redefines-car-gps-navigation-google-maps-navigation-android/>, Oct. 28, 2009, 4 pages.
Bazzi, Issam et al., “Heterogeneous Lexical Units for Automatic Speech Recognition: Preliminary Investigations”, Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 3, Jun. 5-9, 2000, XP010507574, pp. 1257-1260.
Belvin, Robert, et al., “Development of the HRL Route Navigation Dialogue System”, Proceedings of the First International Conference on Human Language Technology Research, San Diego, 2001, pp. 1-5.
Chai et al., “MIND: A Semantics-Based Multimodal Interpretation Framework for Conversational Systems”, Proceedings of the International CLASS Workshop on Natural, Intelligent and Effective Interaction in Multimodal Dialogue Systems, Jun. 2002, pp. 37-46.
Cheyer et al., “Multimodal Maps: An Agent-Based Approach”, International Conference on Cooperative Multimodal Communication (CMC/95), May 24-26, 1995, pp. 111-121.
El Meliani et al., “A Syllabic-Filler-Based Continuous Speech Recognizer for Unlimited Vocabulary”, Canadian Conference on Electrical and Computer Engineering, vol. 2, Sep. 5-8, 1995, pp. 1007-1010.
Elio et al., “On Abstract Task Models and Conversation Policies” in Workshop on Specifying and Implementing Conversation Policies, Autonomous Agents '99, Seattle, 1999, 10 pages.
Kirchhoff, Katrin, “Syllable-Level Desynchronisation of Phonetic Features for Speech Recognition”, Proceedings of the Fourth International Conference on Spoken Language, 1996, ICSLP 96, vol. 4, IEEE, 1996, 3 pages.
Kuhn, Thomas, et al., “Hybrid In-Car Speech Recognition for Mobile Multimedia Applications”, Vehicular Technology Conference, IEEE, Jul. 1999, pp. 2009-2013.
Lin, Bor-shen, et al., “A Distributed Architecture for Cooperative Spoken Dialogue Agents with Coherent Dialogue State and History”, ASRU'99, 1999, 4 pages.
Lind, R., et al., The Network Vehicle—A Glimpse into the Future of Mobile Multi-Media, IEEE Aerosp. Electron. Systems Magazine, vol. 14, No. 9, Sep. 1999, pp. 27-32.
Mao, Mark Z., “Automatic Training Set Segmentation for Multi-Pass Speech Recognition”, Department of Electrical Engineering, Stanford University, CA, copyright 2005, IEEE, pp. I-685 to I-688.
O'Shaughnessy, Douglas, “Interacting with Computers by Voice: Automatic Speech Recognition and Synthesis”, Proceedings of the IEEE, vol. 91, No. 9, Sep. 1, 2003, XP011100665. pp. 1272-1305.
Reuters, “IBM to Enable Honda Drivers to Talk to Cars”, Charles Schwab & Co., Inc., Jul. 28, 2002, 1 page.
Turunen, “Adaptive Interaction Methods in Speech User Interfaces”, Conference on Human Factors in Computing Systems, Seattle, Washington, 2001, pp. 91-92.
Vanhoucke, Vincent, “Confidence Scoring and Rejection Using Multi-Pass Speech Recognition”, Nuance Communications, Menlo Park, CA, 2005, 4 pages.
Weng, Fuliang, et al., “Efficient Lattice Representation and Generation”, Speech Technology and Research Laboratory, SRI International, Menlo Park, CA, 1998, 4 pages.
Wu, Su-Lin, et al., “Incorporating Information from Syllable-Length Time Scales into Automatic Speech Recognition”, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, 1998, vol. 2, IEEE, 1998, 4 pages.
Wu, Su-Lin, et al., “Integrating Syllable Boundary Information into Speech Recognition”, IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP-97, 1997, vol. 2, IEEE, 1997, 4 pages.
Zhao, Yilin, “Telematics: Safe and Fun Driving”, IEEE Intelligent Systems, vol. 17, Issue 1, 2002, pp. 10-14.

Related Publications (1)

	Number	Date	Country
	20160217785 A1	Jul 2016	US

Continuations (2)

	Number	Date	Country
Parent	14083061	Nov 2013	US
Child	15090215		US
Parent	12127343	May 2008	US
Child	14083061		US

System and method for an integrated, multi-modal, multi-device natural language voice services environment

Information

Patent Number

Date Filed

Date Issued

Inventors

Original Assignees

Examiners

Agents

CPC

Field of Search

US

CPC

International Classifications

Abstract