System and method for an integrated, multi-modal, multi-device natural language voice services environment

Information

  • Patent Grant
  • 9711143
  • Patent Number
    9,711,143
  • Date Filed
    Monday, April 4, 2016
    8 years ago
  • Date Issued
    Tuesday, July 18, 2017
    7 years ago
Abstract
A system and method for an integrated, multi-modal, multi-device natural language voice services environment may be provided. In particular, the environment may include a plurality of voice-enabled devices each having intent determination capabilities for processing multi-modal natural language inputs in addition to knowledge of the intent determination capabilities of other devices in the environment. Further, the environment may be arranged in a centralized manner, a distributed peer-to-peer manner, or various combinations thereof. As such, the various devices may cooperate to determine intent of multi-modal natural language inputs, and commands, queries, or other requests may be routed to one or more of the devices best suited to take action in response thereto.
Description
FIELD OF THE INVENTION

The invention relates to an integrated voice services environment in which a plurality of devices can provide various voice services by cooperatively processing free form, multi-modal, natural language inputs, thereby facilitating conversational interactions between a user and one or more of the devices in the integrated environment.


BACKGROUND OF THE INVENTION

As technology has progressed in recent years, consumer electronic devices have emerged to become nearly ubiquitous in the everyday lives of many people. To meet the increasing demand that has resulted from growth in the functionality and mobility of mobile phones, navigation devices, embedded devices, and other such devices, a wealth of features and functions are often provided therein in addition to core applications. Greater functionality also introduces the trade-offs, however, including learning curves that often inhibit users from fully exploiting all of the capabilities of their electronic devices. For example, many existing electronic devices include complex human to machine interfaces that may not be particularly user-friendly, which inhibits mass-market adoption for many technologies. Moreover, cumbersome interfaces often result in otherwise desirable features being buried (e.g., within menus that may be tedious to navigate), which has the tendency of causing many users to not use, or even know about, the potential capabilities of their devices.


As such, the increased functionality provided by many electronic devices often tends to be wasted, as market research suggests that many users only use only a fraction of the features or applications available on a given device. Moreover, in a society where wireless networking and broadband access are increasingly prevalent, consumers tend to naturally desire seamless mobile capabilities from their electronic devices. Thus, as consumer demand intensifies for simpler mechanisms to interact with electronic devices, cumbersome interfaces that prevent quick and focused interaction can become an important concern. Accordingly, the ever-growing demand for mechanisms to use technology in intuitive ways remains largely unfulfilled.


One approach towards simplifying human to machine interactions in electronic devices includes the use of voice recognition software, which can enable users to exploit features that could otherwise be unfamiliar, unknown, or difficult to use. For example, a recent survey conducted by the Navteq Corporation, which provides data used in a variety of applications such as automotive navigation and web-based applications, demonstrates that voice recognition often ranks among the features most desired by consumers of electronic devices. Even so, existing voice user interfaces, when they actually work, still tend to require significant learning on the part of the user.


For example, many existing voice user interface only support requests formulated according to specific command-and-control sequences or syntaxes. Furthermore, many existing voice user interfaces cause user frustration or dissatisfaction because of inaccurate speech recognition. Similarly, by forcing a user to provide pre-established commands or keywords to communicate requests in ways that a system can understand, existing voice user interfaces do not effectively engage the user in a productive, cooperative dialogue to resolve requests and advance a conversation towards a mutually satisfactory goal (e.g., when users may be uncertain of particular needs, available information, or device capabilities, among other things). As such, existing voice user interfaces tend to suffer from various drawbacks, including significant limitations on engaging users in a dialogue in a cooperative and conversational manner.


Additionally, many existing voice user interfaces fall short in utilizing information distributed across various different domains or devices in order to resolve natural language voice-based inputs. Thus, existing voice user interfaces suffer from being constrained to a finite set of applications for which they have been designed, or to devices on which they reside. Although technological advancement has resulted in users often having several devices to suit their various needs, existing voice user interfaces do not adequately free users from device constraints. For example, users may be interested in services associated with different applications and devices, but existing voice user interfaces tend to restrict users from accessing the applications and devices as they see fit. Moreover, users typically can only practicably carry a finite number of devices at any given time, yet content or services associated with users' other devices that currently being used may be desired in various circumstances. Accordingly, although users tend to have varying needs, where content or services associated with different devices may be desired in various contexts or environments, existing voice technologies tend to fall short in providing an integrated environment in which users can request content or services associated with virtually any device or network. As such, constraints on information availability and device interaction mechanisms in existing voice services environments tend to prevent users from experiencing technology in an intuitive, natural, and efficient way.


Existing systems suffer from these and other problems.


SUMMARY OF THE INVENTION

According to various aspects of the invention, a system and method for an integrated, multi-modal, multi-device natural language voice services environment may include a plurality of voice-enabled devices each having intent determination capabilities for processing multi-modal natural language inputs in addition to knowledge of the intent determination capabilities of other devices in the environment. Further, the environment may be arranged in a centralized manner, a distributed peer-to-peer manner, or various combinations thereof. As such, the various devices may cooperate to determine intent of multi-modal natural language inputs, and commands, queries, or other requests may be routed to one or more of the devices best suited to take action in response thereto.


According to various aspects of the invention, the integrated natural language voice services environment arranged in the centralized manner includes an input device that receives a multi-modal natural language input, a central device communicatively coupled to the input device, and one or more secondary devices communicatively coupled to the central device. Each of the input device, the central device, and the one or more secondary devices may have intent determination capabilities for processing multi-modal natural language inputs. As such, an intent of a given multi-modal natural language input may be determined in the centralized manner by communicating the multi-modal natural language input from the input device to the central device. Thereafter, the central device may aggregate the intent determination capabilities of the input device and the one or more secondary devices and determine an intent of the multi-modal natural language input using the aggregated intent determination capabilities. The input device may then receive the determined intent from the central device and invoke at least one action at one or more of the input device, the central device, or the secondary devices based on the determined intent.


According to various aspects of the invention, the integrated natural language voice services environment arranged in the distributed manner includes an input device that receives a multi-modal natural language input, a central device communicatively coupled to the input device and one or more secondary devices communicatively coupled to the input device, wherein each of the input device and the one or more secondary devices may have intent determination capabilities for processing multi-modal natural language inputs, as in the centralized implementation. However, the distributed implementation may be distinct from the centralized implementation in that a preliminary intent of the multi-modal natural language input may be determined at the input device using local intent determination capabilities. The multi-modal natural language input may then be communicated to one or more of the secondary devices (e.g., when a confidence level of the intent determination at the input device falls below a given threshold). In such cases, each of the secondary devices determine an intent of the multi-modal natural language input using local intent determination capabilities. The input device collates the preliminary intent determination and the intent determinations of the secondary devices, and may arbitrate among the collated intent determinations to determine an actionable intent of the multi-modal natural input.


According to various aspects of the invention, the integrated natural language voice services environment arranged in a manner that dynamically selects between a centralized model and a distributed model. For example, the environment includes an input device that receives a multi-modal natural language input one or more secondary devices communicatively coupled to the input device, each of which have intent determination capabilities for processing multi-modal natural language inputs. A constellation model may be accessible to each of the input device and the one or more secondary devices, wherein the constellation model describes the intent determination capabilities of the input device and the one or more secondary devices. The multi-modal natural language input can be routed for processing at one or more of the input device or the secondary devices to determine an intent thereof based on the intent determination capabilities described in the constellation model. For example, when the constellation model arranges the input device and the secondary devices in the centralized manner, one of the secondary devices may be designated the central device and the natural language input may be processed as described above. However, when the multi-modal natural language cannot be communicated to the central device, the constellation model may be dynamically rearranged in the distributed manner, whereby the input device and the secondary devices share knowledge relating to respective local intent determination capabilities and operate as cooperative nodes to determine the intent of the multi-modal natural language input using the shared knowledge relating to local intent determination capabilities.


Other objects and advantages of the invention will be apparent based on the following drawings and detailed description.





BRIEF DESCRIPTION OF THE DRAWINGS


FIG. 1 illustrates a block diagram of an exemplary multi-modal electronic device that may be provided in an integrated, multi-device natural language voice services environment, according to various aspects of the invention.



FIG. 2 illustrates a block diagram of an exemplary centralized implementation of the integrated, multi-modal, multi-device natural language voice service environment, according to various aspects of the invention.



FIG. 3 illustrates a flow diagram of an exemplary method for processing multi-modal, natural language inputs at an input device in the centralized implementation of the integrated, multi-modal, multi-device natural language voice service environment, according to various aspects of the invention.



FIG. 4 illustrates a flow diagram of an exemplary method for processing multi-modal, natural language inputs at a central device in the centralized implementation of the integrated, multi-modal, multi-device natural language voice service environment, according to various aspects of the invention.



FIG. 5 illustrates a flow diagram of an exemplary method for processing multi-modal, natural language inputs at a secondary device in the centralized implementation of the integrated, multi-modal, multi-device natural language voice service environment, according to various aspects of the invention.



FIG. 6 illustrates a block diagram of an exemplary distributed implementation of the integrated, multi-modal, multi-device natural language voice service environment, according to various aspects of the invention.



FIG. 7 illustrates a flow diagram of an exemplary method for processing multi-modal, natural language inputs at an input device in the distributed implementation of the integrated, multi-modal, multi-device natural language voice service environment, according to various aspects of the invention.





DETAILED DESCRIPTION

According to various aspects of the invention, FIG. 1 illustrates a block diagram of an exemplary multi-modal electronic device 100 that may be provided in a natural language voice services environment that includes one or more additional multi-modal devices (e.g., as illustrated in FIGS. 2 and 6). As will be apparent, the electronic device 100 illustrated in FIG. 1 may be any suitable voice-enabled electronic device (e.g., a telematics device, a personal navigation device, a mobile phone, a VoIP node, a personal computer, a media device, an embedded device, a server, or another electronic device). The device 100 may include various components that collectively provide a capability to process conversational, multi-modal natural language inputs. As such, a user of the device 100 may engage in multi-modal conversational dialogues with the voice-enabled electronic device 100 to resolve requests in a free form, cooperative manner.


For example, the natural language processing components may support free form natural language utterances to liberate the user from restrictions relating to how commands, queries, or other requests should be formulated. Rather, the user may employ any manner of speaking that feels natural in order to request content or services available through the device 100 (e.g., content or services relating to telematics, communications, media, messaging, navigation, marketing, information retrieval, etc.). For instance, in various implementations, the device 100 may process natural language utterances utilizing techniques described in U.S. patent application Ser. No. 10/452,147, entitled “Systems and Methods for Responding to Natural Language Speech Utterance,” filed Jun. 3, 2003, and U.S. patent application Ser. No. 10/618,633, entitled “Mobile Systems and Methods for Responding to Natural Language Speech Utterance,” filed Jun. 15, 2003, the disclosures of which are hereby incorporated by reference in their entirety.


Moreover, because the device 100 may be deployed in an integrated multi-device environment, the user may further request content or services available through other devices deployed in the environment. In particular, the integrated voice services environment may include a plurality of multi-modal devices, each of which include natural language components generally similar to those illustrated in FIG. 1. The various devices in the environment may serve distinct purposes, however, such that available content, services, applications, or other capabilities may vary among the devices in the environment (e.g., core functions of a media device may vary from those of a personal navigation device). Thus, each device in the environment, including device 100, may have knowledge of content, services, applications, intent determination capabilities, and other features available through the other devices by way of a constellation model 130b. Accordingly, as will be described in greater detail below, the electronic device 100 may cooperate with other devices in the integrated environment to resolve requests by sharing context, prior information, domain knowledge, short-term knowledge, long-term knowledge, and cognitive models, among other things.


According to various aspects of the invention, the electronic device 100 may include an input mechanism 105 that can receive multi-modal natural language inputs, which include at least an utterance spoken by the user. As will be apparent, the input mechanism 105 may include any appropriate device or combination of devices capable of receiving a spoken input (e.g., a directional microphone, an array of microphones, or any other device that can generate encoded speech). Further, in various implementations, the input mechanism 105 can be configured to maximize fidelity of encoded speech, for example, by maximizing gain in a direction of the user, cancelling echoes, nulling point noise sources, performing variable rate sampling, or filtering environmental noise (e.g., background conversations). As such, the input mechanism 105 may generate encoded speech in a manner that can tolerate noise or other factors that could otherwise interfere with accurate interpretation of the utterance.


Furthermore, in various implementations, the input mechanism 105 may include various other input modalities (i.e., the input mechanism 105 may be arranged in a multi-modal environment), in that non-voice inputs can be correlated and/or processed in connection with one or more previous, contemporaneous, or subsequent multi-modal natural language inputs. For example, the input mechanism 105 may be coupled to a touch-screen interface, a stylus and tablet interface, a keypad or keyboard, or any other suitable input mechanism, as will be apparent. As a result, an amount of information potentially available when processing the multi-modal inputs may be maximized, as the user can clarify utterances or otherwise provide additional information in a given multi-modal natural language input using various input modalities. For instance, in an exemplary illustration, the user could touch a stylus or other pointing device to a portion of a touch-screen interface of the device 100, while also providing an utterance relating to the touched portion of the interface (e.g., “Show me restaurants around here”). In this example, the natural language utterance may be correlated with the input received via the touch-screen interface, resulting in “around here” being interpreted in relation to the touched portion of the interface (e.g., as opposed to the user's current location or some other meaning).


According to various aspects of the invention, the device 100 may include an Automatic Speech Recognizer 110 that generates one or more preliminary interpretations of the encoded speech, which may be received from the input mechanism 105. For example, the Automatic Speech Recognizer 110 may recognize syllables, words, or phrases contained in an utterance using one or more dynamically adaptable recognition grammars. The dynamic recognition grammars may be used to recognize a stream of phonemes through phonetic dictation based on one or more acoustic models. Furthermore, as described in U.S. patent application Ser. No. 11/197,504, entitled “Systems and Methods for Responding to Natural Language Speech Utterance,” filed Aug. 5, 2005, the disclosure of which is hereby incorporated by reference in its entirety, the Automatic Speech Recognizer 110 may be capable of multi-pass analysis, where a primary speech recognition engine may generate a primary interpretation of an utterance (e.g., using a large list dictation grammar) and request secondary transcription from one or more secondary speech recognition engines (e.g., using a virtual dictation grammar having decoy words for out-of-vocabulary words).


Thus, the Automatic Speech Recognizer 110 may generate preliminary interpretations of an utterance in various ways, including exclusive use of a dictation grammar or virtual dictation grammar, or use of various combinations thereof (e.g., when the device 100 supports multi-pass analysis). In any event, the Automatic Speech Recognizer 110 may provide out-of-vocabulary capabilities and may tolerate portions of a speech signal being dropped, the user misspeaking, or other variables that may occur in natural language speech (e.g., stops and starts, stutters, etc.). Furthermore, the recognition grammars employed by the Automatic Speech Recognizer 110 may include vocabularies, dictionaries, syllables, words, phrases, or other information optimized according to various contextual or application-specific domains (e.g., navigation, music, movies, weather, shopping, news, languages, temporal or geographic proximities, or other suitable domains). Moreover, environmental knowledge (e.g., peer-to-peer affinities, capabilities of devices in the environment, etc.), historical knowledge (e.g., frequent requests, prior context, etc.), or other types of knowledge can be used to continually optimize the information contained in the recognition grammars on a dynamic basis.


For example, information contained in the recognition grammars may be dynamically optimized to improve a likelihood of a given utterance being recognized accurately (e.g., following an incorrect interpretation of a word, the incorrect interpretation may be removed from the grammar to reduce a likelihood of the incorrect interpretation being repeated). Accordingly, the Automatic Speech Recognizer 110 may use a number of techniques to generate preliminary interpretations of natural language utterances, including those described, for example, in U.S. patent application Ser. No. 11/513,269, entitled “Dynamic Speech Sharpening,” filed Aug. 31, 2006, the disclosure of which is hereby incorporated by reference in its entirety. Furthermore, the techniques used by the Automatic Speech Recognizer 110 associated with the device 100 may be considered in defining intent determination capabilities of the device 100, and such capabilities may be shared with other devices in the environment to enable convergence of speech recognition throughout the environment (e.g., because various devices may employ distinct speech recognition techniques or have distinct grammars or vocabularies, the devices may share vocabulary translation mechanisms to enhance system-wide recognition).


According to various aspects of the invention, the Automatic Speech Recognizer 110 may provide one or more preliminary interpretations of a multi-modal input, including an utterance contained therein, to a conversational language processor 120. The conversational language processor 120 may include various components that collectively operate to model everyday human-to-human conversations in order to engage in cooperative conversations with the user to resolve requests based on the user's intent. For example, the conversational language processor 120 may include, among other things, an intent determination engine 130a, a constellation model 130b, one or more domain agents 130c, a context tracking engine 130d, a misrecognition engine 130e, and a voice search engine 130f. Furthermore, the conversational language processor 120 may be coupled to one or more data repositories 160 and applications associated with one or more domains. Thus, the intent determination capabilities of the device 100 may be defined based on the data and processing capabilities of the Automatic Speech Recognizer 110 and the conversational language processor 120.


More particularly, the intent determination engine 130a may establish meaning for a given multi-modal natural language input based on a consideration of the intent determination capabilities of the device 100 as well as the intent determination capabilities of other devices in the integrated voice services environment. For example, the intent determination capabilities of the device 100 may be defined as a function of processing resources, storage for grammars, context, agents, or other data, and content or services associated with the device 100 (e.g., a media device with a small amount of memory may have a smaller list of recognizable songs than a device with a large amount of memory). Thus, the intent determination engine 130a may determine whether to process a given input locally (e.g., when the device 100 has intent determination capabilities that suggest favorable conditions for recognition), or whether to route information associated with the input to other devices, which may assist in determining the intent of the input.


As such, to determine which device or combination of devices should process an input, the intent determination engine 130a may evaluate the constellation model 130b, which provides a model of the intent determination capabilities for each of the devices in the integrated voice services environment. For instance, the constellation model 130b may contain, among other things, knowledge of processing and storage resources available to each of the devices in the environment, as well as the nature and scope of domain agents, context, content, services, and other information available to each of the devices in the environment. As such, using the constellation model 130b, the intent determination engine 130a may be able to determine whether any of the other devices have intent determination capabilities that can be invoked to augment or otherwise enhance the intent determination capabilities of the device 100 (e.g., by routing information associated with a multi-modal natural language input to the device or devices that appear best suited to analyze the information and therefore determine an intent of the input). Accordingly, the intent determination engine 130a may establish the meaning of a given utterance by utilizing the comprehensive constellation model 130b that describes capabilities within the device 100 and across the integrated environment. The intent determination engine 130a may therefore optimize processing of a given natural language input based on capabilities throughout the environment (e.g., utterances may be processed locally to the device 100, routed to a specific device based on information in the constellation model 130b, or flooded to all of the devices in the environment in which case an arbitration may occur to select a best guess at an intent determination).


Although the following discussion will generally focus on various techniques that can be used to determine the intent of multi-modal natural language inputs in the integrated multi-device environment, it will be apparent that the natural language processing capabilities of any one of the devices may extend beyond the specific discussion that has been provided herein. As such, in addition to the co-pending U.S. Patent Applications referenced above, further natural language processing capabilities that may be employed include those described in U.S. patent application Ser. No. 11/197,504, entitled “Systems and Methods for Responding to Natural Language Speech Utterance,” filed Aug. 5, 2005, U.S. patent application Ser. No. 11/200,164, entitled “System and Method of Supporting Adaptive Misrecognition in Conversational Speech,” filed Aug. 10, 2005, U.S. patent application Ser. No. 11/212,693, entitled “Mobile Systems and Methods of Supporting Natural Language Human-Machine Interactions,” filed Aug. 29, 2005, U.S. patent application Ser. No. 11/580,926, entitled “System and Method for a Cooperative Conversational Voice User Interface,” filed Oct. 16, 2006, U.S. patent application Ser. No. 11/671,526, entitled “System and Method for Selecting and Presenting Advertisements Based on Natural Language Processing of Voice-Based Input,” filed Feb. 6, 2007, and U.S. patent application Ser. No. 11/954,064, entitled “System and Method for Providing a Natural Language Voice User Interface in an Integrated Voice Navigation Services Environment,” filed Dec. 11, 2007, the disclosures of which are hereby incorporated by reference in their entirety.


According to various aspects of the invention, FIG. 2 illustrates a block diagram of an exemplary centralized implementation of the integrated, multi-modal, multi-device natural language voice service environment. As will be apparent from the further description to be provided herein, the centralized implementation of the integrated, multi-device voice services environment may enable a user to engage in conversational, multi-modal natural language interactions with any one of voice-enabled devices 210a-n or central voice-enabled device 220. As such, the multi-device voice services environment may collectively determine intent for any given multi-modal natural language input, whereby the user may request content or voice services relating to any device or application in the environment, without restraint.


As illustrated in FIG. 2, the centralized implementation of the multi-device voice service environment may include a plurality of voice-enabled devices 210a-n, each of which include various components capable of determining intent of natural language utterances, as described above in reference to FIG. 1. Furthermore, as will be apparent, the centralized implementation includes a central device 220, which contains information relating to intent determination capabilities for each of the other voice-enabled devices 210a-n. For example, in various exemplary implementations, the central device 220 may be designated as such by virtue of being a device most capable of determining the intent of an utterance (e.g., a server, home data center, or other device having significant processing power, memory resources, and communication capabilities making the device suitable to manage intent determination across the environment). In another exemplary implementation, the central device 220 may be dynamically selected based on one or more characteristics of a given multi-modal natural language input, dialogue, or interaction (e.g., a device may be designated as the central device 220 when a current utterance relates to a specific domain).


In the centralized implementation illustrated in FIG. 2, a multi-modal natural language input may be received at one of the voice-enabled devices 210a-n. Therefore, the receiving one of the devices 210a-n may be designated as an input device for that input, while the remaining devices 210a-n may be designated as secondary devices for that input. In other words, for any given multi-modal natural language input, the multi-device environment may include an input device that collects the input, a central device 220 that aggregates intent determination, inferencing, and processing capabilities for all of the devices 210a-n in the environment, and one or more secondary devices that may also be used in the intent determination process. As such, each device 210 in the environment may be provided with a constellation model that identifies all of the devices 210 having incoming and outgoing communication capabilities, thus indicating an extent to which other devices may be capable of determining intent for a given multi-modal natural language input. The constellation model may further define a location of the central device 220, which aggregates context, vocabularies, content, recognition grammars, misrecognitions, shared knowledge, intent determination capabilities, inferencing capabilities, and other information from the various devices 210a-n in the environment.


Accordingly, as communication and processing capabilities permit, the central device 220 may be used as a recognizer of first or last resort. For example, because the central device 220 converges intent determination capabilities across the environment (e.g., by aggregating context, vocabularies, device capabilities, and other information from the devices 210a-n in the environment), inputs may be automatically routed to the central device 220 when used as a recognizer of first resort, or as a recognizer of last resort when local processing at the input device 210 cannot determine the intent of the input with a satisfactory level of confidence. However, it will also be apparent that in certain instances the input device 210 may be unable to make contact with the central device 220 for various reasons (e.g., a network connection may be unavailable, or a processing bottleneck at the central device 220 may cause communication delays). In such cases, the input device 210 that has initiated contact with the central device 220 may shift into decentralized processing (e.g., as described in reference to FIG. 6) and communicate capabilities with one or more of the other devices 210a-n in the constellation model. Thus, when the central device 220 cannot be invoked for various reasons, the remaining devices 210a-n may operate as cooperative nodes to determine intent in a decentralized manner.


Additionally, in the multi-device voice services environment, the central device 220 and the various other devices 210a-n may cooperate to create a converged model of capabilities throughout the environment. For example, as indicated above, in addition to having intent determination capabilities based on processing resources, memory resources, and device capabilities, each of the devices 210a-n and the central device 220 may include various other natural language processing components. The voice services environment may therefore operate in an integrated manner by maintaining not only a complete model of data, content, and services associated with the various devices 210a-n, but also of other natural language processing capabilities and dynamic states associated with the various devices 210a-n. As such, the various devices 210a-n may operate with a goal of converging capabilities, data, states, and other information across the device, either on one device (e.g., the central device 220) or distributed among the various devices 210a-n (e.g., as in the decentralized implementation to be described in FIG. 6).


For example, as discussed above, each device 210 includes an Automatic Speech Recognizer, one or more dynamically adaptable recognition grammars, and vocabulary lists used to generate phonemic interpretations of natural language utterances. Moreover, each device 210 includes locally established context, which can range from information contained in a context stack, context and namespace variables, vocabulary translation mechanisms, short-term shared knowledge relating to a current dialogue or conversational interaction, long-term shared knowledge relating to a user's learned preferences over time, or other contextual information. Furthermore, each device 210 may have various services or applications associated therewith, and may perform various aspects of natural language processing locally. Thus, additional information to be converged throughout the environment may include partial or preliminary utterance recognitions, misrecognitions or ambiguous recognitions, inferencing capabilities, and overall device state information (e.g., songs playing in the environment, alarms set in the environment, etc.).


Thus, various data synchronization and referential integrity algorithms may be employed in concert by the various devices 210a-n and the central device 220 to provide a consistent worldview of the environment. For example, information may be described and transmitted throughout the environment for synchronization and convergence purposes using the Universal Plug and Play protocol designed for computer ancillary devices, although the environment can also operate in a peer-to-peer disconnected mode (e.g., when the central device 220 cannot be reached). However, in various implementations, the environment may also operate in a peer-to-peer mode regardless of the disconnected status, as illustrated in FIG. 6, for example, when the devices 210a-n have sufficient commensurate resources and capabilities for natural language processing.


In general, the algorithms for convergence in the environment can be executed at various intervals, although it may be desirable to limit data transmission so as to avoid processing bottlenecks. For example, because the convergence and synchronization techniques relate to natural language processing, in which any given utterance will typically be expressed over a course of several seconds, information relating to context and vocabulary need not be updated on a time frame of less than a few seconds. However, as communication capabilities permit, context and vocabulary could be updated more frequently to provide real-time recognition or the appearance of real-time recognition. In another implementation, the convergence and synchronization may be permitted to run until completion (e.g., when no requests are currently pending), or the convergence and synchronization may be suspended or terminated when a predetermined time or resource consumption limit has been reached (e.g., when the convergence relates to a pending request, an intent determination having a highest confidence level at the time of cut-off may be used).


By establishing a consistent view of capabilities, data, states, and other information throughout the environment, an input device 210 may cooperate with the central device 220 and one or more secondary devices (i.e., one or more of devices 210a-n, other than the input device) in processing any given multi-modal natural language input. Furthermore, by providing each device 210 and the central device 220 with a constellation model that describes a synchronized state of the environment, the environment may be tolerant of failure by one or more of the devices 210a-n, or of the central device 220. For example, if the input device 210 cannot communicate with the central device 220 (e.g., because of a server crash), the input device 210 may enter a disconnected peer-to-peer mode, whereby capabilities can be exchanged with one or more devices 210a-n with which communications remain available. To that end, when a device 210 establishes new information relating to vocabulary, context, misrecognitions, agent adaptation, intent determination capabilities, inferencing capabilities, or otherwise, the device 210 may transmit the information to the central device 220 for convergence purposes, as discussed above, in addition to consulting the constellation model to determine whether the information should be transmitted to one or more of the other devices 210a-n.


For example, suppose the environment includes a voice-enabled mobile phone that has nominal functionality relating to playing music or other media, and which further has a limited amount of local storage space, while the environment further includes a voice-enabled home media system that includes a mass storage medium that provides dedicated media functionality. If the mobile phone were to establish new vocabulary, context, or other information relating to a song (e.g., a user downloads the song or a ringtone to the mobile phone while on the road), the mobile phone may transmit the newly established information to the home media system in addition to the central device 220. As such, by having a model of all of the devices 210a-n in the environment and transmitting new information to the devices where it will most likely be useful, the various devices may handle disconnected modes of operation when the central device 220 may be unavailable for any reason, while resources may be allocated efficiently throughout the environment.


Thus, based on the foregoing discussion, it will be apparent that a centralized implementation of an integrated multi-device voice services environment may generally include a central device 220 operable to aggregate or converge knowledge relating to content, services, capabilities, and other information associated with various voice-enabled devices 210a-n deployed within the environment. In such centralized implementations, the central device 220 may be invoked as a recognizer of first or last resort, as will be described in greater detail with reference to FIGS. 3-5, and furthermore, the other devices 210a-n in the environment may be configured to automatically enter a disconnected or peer-to-peer mode of operation when the central device 220 cannot be invoked for any reason (i.e., devices may enter a decentralized or distributed mode, as will be described in greater detail with reference to FIGS. 6-7). Knowledge and capabilities of each of the devices 210a-n may therefore be made available throughout the voice services environment in a centralized manner, a distributed manner, or various combinations thereof, thus optimizing an amount of natural language processing resources used to determine an intent of any given multi-modal natural language input.


According to various aspects of the invention, FIG. 3 illustrates a flow diagram of an exemplary method for processing multi-modal, natural language inputs at an input device in the centralized implementation of the integrated, multi-modal, multi-device natural language voice service environment. Similarly, FIGS. 4 and 5 illustrate corresponding methods associated with a central device and one or more secondary devices, respectively, in the centralized voice service environment. Furthermore, it will be apparent that the processing techniques described in relation to FIGS. 3-5 may generally be based on the centralized implementation illustrated in FIG. 2 and described above, whereby the input device may be assumed to be distinct from the central device, and the one or more secondary devices may be assumed to be distinct from the central device and the input device. However, it will be apparent that various instances may involve a natural language input being received at the central device or at another device, in which case the techniques described in FIGS. 3-5 may be vary depending on circumstances of the environment (e.g., decisions relating to routing utterances to a specific device or devices may be made locally, collaboratively, or in other ways depending on various factors, such as overall system state, communication capabilities, intent determination capabilities, or otherwise).


As illustrated in FIG. 3, a multi-modal natural language input may be received at an input device in an operation 310. The multi-modal input may include at least a natural language utterance provided by a user, and may further include other input modalities such as audio, text, button presses, gestures, or other non-voice inputs. It will also be apparent that prior to receiving the natural language input in operation 310, the input device may be configured to establish natural language processing capabilities. For example, establishing natural language processing capabilities may include, among other things, loading an Automatic Speech Recognizer and any associated recognition grammars, launching a conversational language processor to handle dialogues with the user, and installing one or more domain agents that provide functionality for respective application domains or contextual domains (e.g., navigation, music, movies, weather, information retrieval, device control, etc.).


The input device may also be configured to coordinate synchronization of intent determination capabilities, shared knowledge, and other information with the central device and the secondary devices in the environment prior to receiving the input at operation 310. For example, when the input device installs a domain agent, the installed domain agent may bootstrap context variables, semantics, namespace variables, criteria values, and other context related to that agent from other devices in the system. Similarly, misrecognitions may be received from the central device and the secondary devices in order to enable correction of agents that use information relevant to the received misrecognitions, and vocabularies and associated translation mechanisms may be synchronized among the devices to account for potential variations between the Automatic Speech Recognizers used by the various devices (e.g., each device in the environment cannot be guaranteed to use the same Automatic Speech Recognizer or recognition grammars, necessitating vocabulary and translation mechanisms to be shared among the devices that share intent determination capabilities).


Upon establishing and synchronizing natural language processing capabilities and subsequently receiving a multi-modal natural language input in operation 310, the input device may determine whether the environment has been set up to automatically transmit the input to the central device in a decisional operation 320. In such a case, processing proceeds to an operation 360 for transmitting the input to the central device, which may then process the input according to techniques to be described in relation to FIG. 4. If the environment has not been set up to automatically communicate the input to the central device, however, processing proceeds to an operation 330, where the input device performs transcription of the natural language utterance contained in the multi-modal input. For example, the input device may transcribe the utterance using the Automatic Speech Recognizer and recognition grammars associated therewith according to techniques described above and in the above-referenced U.S. Patent Applications.


Subsequently, in an operation 340, an intent of the multi-modal natural language input may be determined at the input device using local natural language processing capabilities and resources. For example, any non-voice input modalities included in the input may be merged with the utterance transcription and a conversational language processor associated with the input device may utilize local information relating to context, domain knowledge, shared knowledge, context variables, criteria values, or other information useful in natural language processing. As such, the input device may attempt to determine a best guess as to an intent of the user that provided the input, such as identifying a conversation type (e.g., query, didactic, or exploratory) or request that may be contained in the input (e.g., a command or query relating to one or more domain agents or application domains).


The intent determination of the input device may be assigned a confidence level (e.g., a device having an Automatic Speech Recognizer that implements multi-pass analysis may assign comparatively higher confidence levels to utterance transcriptions created thereby, which may result in a higher confidence level for the intent determination). The confidence level may be assigned based on various factors, as described in the above-referenced U.S. Patent Applications. As such, a decisional operation 350 may include determining whether the intent determination of the input device meets an acceptable level of confidence. When the intent determination meets the acceptable level confidence, processing may proceed directly to an operation 380 where action may be taken in response thereto. For example, when the intent determination indicates that the user has requested certain information, one or more queries may be formulated to retrieve the information from appropriate information sources, which may include one or more of the other devices. In another example, when the intent determination indicates that the user has requested a given command (e.g., to control a specific device), the command may be routed to the appropriate device for execution.


Thus, in cases where the input device can determine the intent of a natural language input without assistance from the central device or the secondary devices, communications and processing resources may be conserved by taking immediate action as may be appropriate. On the other hand, when the intent determination of the input device does not meet the acceptable level of confidence, decisional operation 350 may result in the input device requesting assistance from the central device in operation 360. In such a case, the multi-modal natural language input may be communicated to the central device in its entirety, whereby the central device processes the input according to techniques described in FIG. 4. However, should transmission to the central device fail for some reason, the input device may shift into a disconnected peer-to-peer mode where one or more secondary devices may be utilized, as will be described below in relation to FIG. 7. When transmission to the central device occurs without incident, however, the input device may receive an intent determination from the central device in an operation 370, and may further receive results of one or more requests that the central device was able to resolve, or requests that the central device has formulated for further processing on the input device. As such, the input device may take action in operation 380 based on the information received from the central device in operation 370. For example, the input device may route queries or commands to local or remote information sources or devices based on the intent determination, or may present results of the requests processed by the central device to the user.


Referring to FIG. 4, the central device may receive the multi-modal natural language input from the input device in an operation 410. The central device, having aggregated context and other knowledge from throughout the environment, may thus transcribe the utterance in an operation 420 and determine an intent of the input from the transcribed utterance in an operation 430. As such, the central device may consider information relating to context, domain agents, applications, and device capabilities throughout the environment in determining the intent of the utterance, including identification of one or more domains relevant to the input. However, it will be apparent that utilizing information aggregated from throughout the environment may cause ambiguity or uncertainty in various instances (e.g., an utterance containing the word “traffic” may have a different intent in domains relating to movies, music, and navigation).


As such, once the central device has attempted to determine the intent of the natural language input, a determination may be made in an operation 440 as to whether one or more secondary devices (i.e., other devices in the constellation besides the input device) may also be capable of intent determination in the identified domain or domains. When no such secondary devices can be identified, decisional operation 440 may branch directly to an operation 480 to send to the input device the determined intent and any commands, queries, or other requests identified from the determined intent.


On the other hand, when one or more secondary devices in the environment have intent determination capabilities in the identified domain or domains, the natural language input may be sent to such secondary devices in an operation 450. The secondary devices may then determine an intent as illustrated in FIG. 5, which may include techniques generally similar to those described above in relation to the input device and central device (i.e., the natural language input may be received in an operation 510, an utterance contained therein may be transcribed in an operation 520, and an intent determination made in an operation 530 may be returned to the central device in an operation 540).


Returning to FIG. 4, the central device may collate intent determination responses received from the secondary devices in an operation 460. For example, as indicated above, the central device may identify one or more secondary devices capable of determining intent in a domain that the central device has identified as being relevant to the natural language utterance. As will be apparent, the secondary devices invoked in operation 450 may often include a plurality of devices, and intent determination responses may be received from the secondary devices in an interleaved manner, depending on processing resources, communications throughput, or other factors (e.g., the secondary devices may include a telematics device having a large amount of processing power and a broadband network connection and an embedded mobile phone having less processing power and only a cellular connection, in which case the telematics device may be highly likely to provide results to the central device before the embedded mobile phone). Thus, based on potential variations in response time of secondary devices, the central device may be configured to place constraints on collating operation 460. For example, the collating operation 460 may be terminated as soon as an intent determination has been received from one of the secondary devices that meets an acceptable level of confidence, or the operation 460 may be cut off when a predetermined amount of time has lapsed or a predetermined amount of resources have been consumed. In other implementations, however, it will be apparent that collating operation 460 may be configured to run to completion, regardless of whether delays have occurred or suitable intent determinations have been received. Further, it will be apparent that various criteria may be used to determine whether or when to end the collating operation 460, including the nature of a given natural language input, dialogue, or other interaction, or system or user preferences, among other criteria, as will be apparent.


In any event, when the collating operation 460 has completed, a subsequent operation 470 may include the central device arbitrating among the intent determination responses received from one or more of the secondary devices previously invoked in operation 450. For example, each of the invoked secondary devices that generate an intent determination may also assign a confidence level to that intent determination, and the central device may consider the confidence levels in arbitrating among the responses. Moreover, the central device may associate other criteria with the secondary devices or the intent determinations received from the secondary devices to further enhance a likelihood that the best intent determination will be used. For example, various ones of the secondary devices may only be invoked for partial recognition in distinct domains, and the central device may aggregate and arbitrate the partial recognitions to create a complete transcription. In another example, a plurality of secondary devices may be invoked to perform overlapping intent determination, and the central device may consider capabilities of the secondary devices to weigh the respective confidence levels (e.g., when one of two otherwise identical secondary devices employs multi-pass speech recognition analysis, the secondary device employing the multi-pass speech recognition analysis may be weighed as having a higher likelihood of success). It will be apparent that the central device may be configured to arbitrate and select one intent determination from among all of the intent hypotheses, which may include the intent determination hypothesis generated by the central device in operation 430. Upon selecting the best intent determination hypothesis, the central device may then provide that intent determination to the input device in operation 480, as well as any commands, queries, or other requests that may be relevant thereto. The input device may then take appropriate action as described above in relation to FIG. 3.


According to various aspects of the invention, FIG. 6 illustrates a block diagram of an exemplary distributed implementation of the integrated, multi-modal, multi-device natural language voice service environment. As described above, the distributed implementation may also be categorized as a disconnected or peer-to-peer mode that may be employed when a central device in a centralized implementation cannot be reached or otherwise does not meet the needs of the environment. The distributed implementation illustrated in FIG. 6 may be generally operate with similar purposes as described above in relation to the centralized implementation (i.e., to ensure that the environment includes a comprehensive model of aggregate knowledge and capabilities of a plurality of devices 610a-n in the environment). Nonetheless, the distributed implementation may operate in a somewhat different manner, in that one or more of the devices 610a-n may be provided with the entire constellation model, or various aspects of the model may be distributed among the plurality of devices 610a-n, or various combinations thereof.


Generally speaking, the plurality of voice-enabled devices 610a—may be coupled to one another by a voice services interface 630, which may include any suitable real or virtual interface (e.g., a common message bus or network interface, a service-oriented abstraction layer, etc.). The various devices 610a-n may therefore operate as cooperative nodes in determining intent for multi-modal natural language utterances received by any one of the devices 610. Furthermore, the devices 610a-n may share knowledge of vocabularies, context, capabilities, and other information, while certain forms of data may be synchronized to ensure consistent processing among the devices 610a-n. For example, because natural language processing components used in the devices 610a-n may vary (e.g., different recognition grammars or speech recognition techniques may exist), vocabulary translation mechanisms, misrecognitions, context variables, criteria values, criteria handlers, and other such information used in the intent determination process should be synchronized to the extent that communication capabilities permit.


By sharing intent determination capabilities, device capabilities, inferencing capabilities, domain knowledge, and other information, decisions as to routing an utterance to a specific one of the devices 610a-n may be made locally (e.g., at an input device), collaboratively (e.g., a device having particular capabilities relevant to the utterance may communicate a request to process the utterance), or various combinations thereof (e.g., the input device may consider routing to secondary devices only when an intent of the utterance cannot be determined). Similarly, partial recognition performed at one or more of the devices 610a-n may be used to determine routing strategies for further intent determination of the utterance. For example, an utterance that contains a plurality of requests relating to a plurality of different domains may be received at an input device that can only determine intent in one of the domains. In this example, the input device may perform partial recognition for the domain associated with the input device, and the partial recognition may also identify the other domains relevant to the utterance for which the input device does not have sufficient recognition information. Thus, the partial recognition performed by the input device may result in identification of other potentially relevant domains and a strategy may be formulated to route the utterance to other devices in the environment that include recognition information for those domains.


As a result, multi-modal natural language inputs, including natural language utterances, may be routed among the various devices 610a-n in order to perform intent determination in a distributed manner. However, as the capabilities and knowledge held by any one of the devices 610a-n may vary, each of the devices 610a-n may be associated with a reliability factor for intent determinations generated by the respective devices 610a-n. As such, to ensure that final intent determinations can be relied upon with a sufficient level of confidence, knowledge may be distributed among the devices 610a-n to ensure that reliability metrics for intent determinations provided by each of the devices 610a-n are commensurable throughout the environment. For example, additional knowledge may be provided to a device having a low intent determination reliability, even when such knowledge results in redundancy in the environment, to ensure commensurate reliability of intent determination environment-wide.


Therefore, in distributed implementations of the integrated voice services environment, utterances may be processed in various ways, which may depend on circumstances at a given time (e.g., system states, system or user preferences, etc.). For example, an utterance may be processed locally at an input device and only routed to secondary devices when an intent determination confidence level falls below a given threshold. In another example, utterances may be routed to a specific device based on the modeling of knowledge and capabilities discussed above. In yet another example, utterances may be flooded among all of the devices in the environment, and arbitration may occur whereby intent determinations may be collated and arbitrated to determine a best guess at intent determination.


Thus, utterances may be processed in various ways, including through local techniques, centralized techniques, distributed techniques, and various combinations thereof. Although many variations will be apparent, FIG. 7 illustrates an exemplary method for combined local and distributed processing of multi-modal, natural language inputs in a distributed implementation of the voice service environment, according to various aspects of the invention. In particular, the distributed processing may begin in an operation 710, where a multi-modal natural language input may be received at an input device. The input device may then utilize various natural language processing capabilities associated therewith in an operation 720 to transcribe an utterance contained in the multi-modal input (e.g., using an Automatic Speech Recognizer and associated recognition grammars), and may subsequently determine a preliminary intent of the multi-modal natural language input in an operation 730. It will be apparent that operations 710 through 730 may generally be performed using local intent determination capabilities associated with the input device.


Thereafter, the input device may invoke intent determination capabilities of one or more secondary devices in an operation 740. More particularly, the input device may provide information associated with the multi-modal natural language input to one or more of the secondary devices, which may utilize local intent determination capabilities to attempt to determine intent of the input using techniques as described in relation to FIG. 5. It will also be apparent that, in various implementations, the secondary devices invoked in operation 740 may include only devices having intent determination capabilities associated with a specific domain identified as being associated with the input. In any event, the input device may receive intent determinations from the invoked secondary devices in an operation 750, and the input device may then collate the intent determinations received from the secondary devices. The input device may then arbitrate among the various intent determinations, or may combine various ones of the intent determinations (e.g., when distinct secondary devices determine intent in distinct domains), or otherwise arbitrate among the intent determinations to determine a best guess at the intent of the multi-modal natural language input (e.g., based on confidence levels associated with the various intent determinations). Based on the determined intent, the input device may then take appropriate action in an operation 770, such as issuing one or more commands, queries, or other requests to be executed at one or more of the input device or the secondary devices.


Furthermore, in addition to the exemplary implementations described above, various implementations may include a continuous listening mode of operation where a plurality of devices may continuously listen for multi-modal voice-based inputs. In the continuous listening mode, each of the devices in the environment may be triggered to accept a multi-modal input when one or more predetermined events occur. For example, the devices may each be associated with one or more attention words, such as “Phone, <multi-modal request>” for a mobile phone, or “Computer, <multi-modal request>” for a personal computer. When one or more of the devices in the environment recognize the associated attention word, keyword activation may result, where the associated devices trigger to accept the subsequent multi-modal request. Further, where a plurality of devices in a constellation may be listening, the constellation may use all available inputs to increase recognition rates.


Moreover, it will be apparent that the continuous listening mode may be applied in centralized voice service environments, distributed centralized voice service environments, or various combinations thereof. For example, when each device in the constellation has a different attention word, any given device that recognizes an attention word may consult a constellation model to determine a target device or functionality associated with the attention word. In another example, when a plurality of devices in the constellation share one or more attention words, the plurality of devices may coordinate with one another to synchronize information for processing the multi-modal input, such as a start time for an utterance contained therein.


Implementations of the invention may be made in hardware, firmware, software, or various combinations thereof. The invention may also be implemented as instructions stored on a machine-readable medium, which may be read and executed by one or more processors. A machine-readable medium may include various mechanisms for storing or transmitting information in a form readable by a machine (e.g., a computing device). For example, a machine-readable storage medium may include read only memory, random access memory, magnetic disk storage media, optical storage media, flash memory devices, and others, and a machine-readable transmission media may include forms of propagated signals, such as carrier waves, infrared signals, digital signals, and others. Further, firmware, software, routines, or instructions may be described in the above disclosure in terms of specific exemplary aspects and implementations of the invention, and performing certain actions. However, it will be apparent that such descriptions are merely for convenience and that such actions in fact result from computing devices, processors, controllers, or other devices executing the firmware, software, routines, or instructions.


Aspects and implementations may be described as including a particular feature, structure, or characteristic, but every aspect or implementation may not necessarily include the particular feature, structure, or characteristic. Further, when a particular feature, structure, or characteristic has been described in connection with an aspect or implementation, it will be understood that such feature, structure, or characteristic may be included in connection with other aspects or implementations, whether or not explicitly described. Thus, various changes and modifications may be made to the preceding description without departing from the scope or spirit of the invention, and the specification and drawings should therefore be regarded as exemplary only, and the scope of the invention determined solely by the appended claims.

Claims
  • 1. A method of providing an integrated multi-modal, natural language voice services environment comprising one or more of an input device that receives a multi-modal natural language input comprising at least a natural language utterance and a non-voice input related to the natural language utterance, a first device, or one or more secondary devices, the method being implemented in the first device having one or more physical processors programmed with computer program instructions that, when executed by the one or more physical processors, program the first device to perform the method, wherein the one or more secondary devices include at least a second device, the method comprising: obtaining, by the first device from the input device, the multi-modal natural language input;transcribing, by the first device, the natural language utterance;determining, by the first device, a preliminary intent prediction of the multi-modal natural language input based on the transcribed utterance and the non-voice input;transmitting, by the first device, the multi-modal natural language input to the second device;receiving, by the first device from the second device, a second intent prediction of the multi-modal natural language input;determining, by the first device, an intent of the multi-modal natural language input based on the preliminary intent prediction and the second intent prediction; andinvoking, by the first device, at least one action at one or more of the input device, the first device, or the one or more secondary devices based on the determined intent.
  • 2. The method of claim 1, wherein invoking the at least one action at one or more of the input device, the first device, or the one or more secondary devices comprises transmitting a request related to the multi-modal natural language input based on the preliminary intent prediction.
  • 3. The method of claim 1, the method further comprising: determining, by the first device, processing capabilities associated with the one or more secondary devices; andselecting, by the first device, based on the processing capabilities associated with the one or more secondary devices, the second device to make the second intent prediction of the multi-modal natural language input.
  • 4. The method of claim 3, the method further comprising: maintaining, by the first device, a constellation model that describes natural language resources, dynamic states, and intent determination capabilities associated with the input device and the one or more secondary devices, wherein the processing capabilities associated with the one or more secondary devices are determined based on the constellation model.
  • 5. The method of claim 4, wherein the intent determination capabilities for a given one of the input device, the first device, or the one or more secondary devices are based on at least one of processing power, storage resources, natural language processing capabilities, or local knowledge.
  • 6. The method of claim 1, the method further comprising: determining, by the first device, a domain relating to the multi-modal natural language input; andselecting, by the first device, based on the domain, the second device to make the second intent prediction of the multi-modal natural language input.
  • 7. The method of claim 6, wherein the one or more secondary devices are associated with different domains, the second device is associated with the domain, and the different domains comprise the domain.
  • 8. The method of claim 1, wherein the input device initially received the multi-modal natural language input.
  • 9. A method of providing an integrated multi-modal, natural language voice services environment comprising one or more of an input device that receives a multi-modal natural language input comprising at least a natural language utterance and a non-voice input related to the natural language utterance, a first device, or one or more secondary devices, the method being implemented in the first device having one or more physical processors programmed with computer program instructions that, when executed by the one or more physical processors, program the first device to perform the method, the method comprising: obtaining, by the first device from the input device, the multi-modal natural language input;transcribing, by the first device, the natural language utterance;determining, by the first device, a preliminary intent prediction of the multi-modal natural language input based on the transcribed utterance and the non-voice input;communicating, by the first device, the multi-modal natural language input to each of the one or more secondary devices, wherein each of the one or more secondary devices determines an intent of the multi-modal natural language input received at the input device using local intent determination capabilities;receiving, by the first device, an intent determination from each of the secondary devices; andarbitrating, by the first device, among the intent determinations received from each of the secondary devices to determine an intent of the multi-modal natural input; andinvoking, by the first device, at least one action at one or more of the input device, the first device, or the one or more secondary devices based on the determined intent.
  • 10. A system for processing a multi-modal natural language input, the system comprising: an input device that receives a multi-modal natural language input comprising at least a natural language utterance and a non-voice input related to the natural language utterance;one or more secondary devices, wherein the one or more secondary devices include at least a second device, anda first device having one or more physical processors programmed with computer program instructions that, when executed by the one or more physical processors, program the first device to: obtain, from the input device, the multi-modal natural language input;transcribe the natural language utterance;determine a preliminary intent prediction of the multi-modal natural language input based on the transcribed utterance and the non-voice input; andtransmit the multi-modal natural language input to the second device;receive, from the second device, a second intent prediction of the multi-modal natural language input;determine an intent of the multi-modal natural language input based on the preliminary intent prediction and the second intent prediction; andinvoke at least one action at one or more of the input device, the first device, or the one or more secondary devices based on the determined intent.
  • 11. The system of claim 10, wherein to invoke the at least one action at one or more of the input device, the first device, or the one or more secondary devices, the first device is further programmed to: transmit a request related to the multi-modal natural language input based on the preliminary intent prediction.
  • 12. The system of claim 10, wherein the first device is further programmed to: determine processing capabilities associated with the one or more secondary devices; andselect based on the processing capabilities associated with the one or more secondary devices, the second device to make the second intent prediction of the multi-modal natural language input.
  • 13. The system of claim 12, wherein the first device is further programmed to: maintain a constellation model that describes natural language resources, dynamic states, and intent determination capabilities associated with the input device and the one or more secondary devices, wherein the processing capabilities associated with the one or more secondary devices are determined based on the constellation model.
  • 14. The system of claim 13, wherein the intent determination capabilities for a given one of the input device, the first device, or the one or more secondary devices are based on at least one of processing power, storage resources, natural language processing capabilities, or local knowledge.
  • 15. The system of claim 10, wherein the first device is further programmed to: determine a domain relating to the multi-modal natural language input; andselect, based on the domain, the second device to make the second intent prediction of the multi-modal natural language input.
  • 16. The system of claim 15, wherein the one or more secondary devices are associated with different domains, the second device is associated with the domain, and the different domains comprise the domain.
  • 17. The system of claim 10, wherein the input device initially received the multi-modal natural language input.
  • 18. A system for processing a multi-modal natural language input, the system comprising: an input device that receives a multi-modal natural language input comprising at least a natural language utterance and a non-voice input related to the natural language utterance;one or more secondary devices; anda first device having one or more physical processors programmed with computer program instructions that, when executed by the one or more physical processors, program the first device to: obtain, from the input device, the multi-modal natural language input;transcribe the natural language utterance;determine a preliminary intent prediction of the multi-modal natural language input based on the transcribed utterance and the non-voice input;communicate the multi-modal natural language input to each of the one or more secondary devices, wherein each of the one or more secondary devices determines an intent of the multi-modal natural language input received at the input device using local intent determination capabilities;receive an intent determination from each of the secondary devices; andarbitrate among the intent determinations received from each of the secondary devices to determine an intent of the multi-modal natural input,invoke at least one action at one or more of the input device, the first device, or the one or more secondary devices based on the determined intent.
CROSS-REFERENCE TO RELATED APPLICATION

This application is a continuation of U.S. patent application Ser. No. 14/083,061, entitled “System and Method for an Integrated, Multi-Modal, Multi-Device Natural Language Voice Services Environment,” filed Nov. 18, 2013, which is a continuation of U.S. patent application Ser. No. 12/127,343, entitled “System and Method for an Integrated, Multi-Modal, Multi-Device Natural Language Voice Services Environment,” filed May 27, 2008 (which issues as U.S. Pat. No. 8,589,161 on Nov. 19, 2013), the content of which is hereby incorporated by reference in its entirety.

US Referenced Citations (865)
Number Name Date Kind
4430669 Cheung Feb 1984 A
4821027 Mallory Apr 1989 A
4829423 Tennant May 1989 A
4887212 Zamora Dec 1989 A
4910784 Doddington Mar 1990 A
5027406 Roberts Jun 1991 A
5155743 Jacobs Oct 1992 A
5164904 Sumner Nov 1992 A
5208748 Flores May 1993 A
5265065 Turtle Nov 1993 A
5274560 LaRue Dec 1993 A
5357596 Takebayashi Oct 1994 A
5369575 Lamberti Nov 1994 A
5377350 Skinner Dec 1994 A
5386556 Hedin Jan 1995 A
5424947 Nagao Jun 1995 A
5471318 Ahuja Nov 1995 A
5475733 Eisdorfer Dec 1995 A
5479563 Yamaguchi Dec 1995 A
5488652 Bielby Jan 1996 A
5499289 Bruno Mar 1996 A
5500920 Kupiec Mar 1996 A
5517560 Greenspan May 1996 A
5533108 Harris Jul 1996 A
5537436 Bottoms Jul 1996 A
5539744 Chu Jul 1996 A
5557667 Bruno Sep 1996 A
5559864 Kennedy, Jr. Sep 1996 A
5563937 Bruno Oct 1996 A
5577165 Takebayashi Nov 1996 A
5590039 Ikeda Dec 1996 A
5608635 Tamai Mar 1997 A
5615296 Stanford Mar 1997 A
5617407 Bareis Apr 1997 A
5633922 August May 1997 A
5634086 Rtischev May 1997 A
5652570 Lepkofker Jul 1997 A
5675629 Raffel Oct 1997 A
5696965 Dedrick Dec 1997 A
5708422 Blonder Jan 1998 A
5721938 Stuckey Feb 1998 A
5722084 Chakrin Feb 1998 A
5740256 CastelloDaCosta Apr 1998 A
5742763 Jones Apr 1998 A
5748841 Morin May 1998 A
5748974 Johnson May 1998 A
5752052 Richardson May 1998 A
5754784 Garland May 1998 A
5761631 Nasukawa Jun 1998 A
5774841 Salazar Jun 1998 A
5774859 Houser Jun 1998 A
5794050 Dahlgren Aug 1998 A
5794196 Yegnanarayanan Aug 1998 A
5797112 Komatsu Aug 1998 A
5799276 Komissarchik Aug 1998 A
5802510 Jones Sep 1998 A
5829000 Huang Oct 1998 A
5832221 Jones Nov 1998 A
5839107 Gupta Nov 1998 A
5848396 Gerace Dec 1998 A
5855000 Waibel Dec 1998 A
5867817 Catallo Feb 1999 A
5878385 Bralich Mar 1999 A
5878386 Coughlin Mar 1999 A
5892813 Morin Apr 1999 A
5892900 Ginter Apr 1999 A
5895464 Bhandari Apr 1999 A
5895466 Goldberg Apr 1999 A
5897613 Chan Apr 1999 A
5899991 Karch May 1999 A
5902347 Backman May 1999 A
5911120 Jarett Jun 1999 A
5918222 Fukui Jun 1999 A
5926784 Richardson Jul 1999 A
5933822 Braden-Harder Aug 1999 A
5950167 Yaker Sep 1999 A
5953393 Culbreth Sep 1999 A
5960384 Brash Sep 1999 A
5960397 Rahim Sep 1999 A
5960399 Barclay Sep 1999 A
5960447 Holt Sep 1999 A
5963894 Richardson Oct 1999 A
5963940 Liddy Oct 1999 A
5983190 Trower, II Nov 1999 A
5987404 DellaPietra Nov 1999 A
5991721 Asano Nov 1999 A
5995119 Cosatto Nov 1999 A
5995928 Nguyen Nov 1999 A
5995943 Bull Nov 1999 A
6009382 Martino Dec 1999 A
6014559 Amin Jan 2000 A
6018708 Dahan Jan 2000 A
6021384 Gorin Feb 2000 A
6028514 Lemelson Feb 2000 A
6035267 Watanabe Mar 2000 A
6044347 Abella Mar 2000 A
6049602 Foladare Apr 2000 A
6049607 Marash Apr 2000 A
6058187 Chen May 2000 A
6067513 Ishimitsu May 2000 A
6073098 Buchsbaum Jun 2000 A
6076059 Glickman Jun 2000 A
6078886 Dragosh Jun 2000 A
6081774 deHita Jun 2000 A
6085186 Christianson Jul 2000 A
6101241 Boyce Aug 2000 A
6108631 Ruhl Aug 2000 A
6119087 Kuhn Sep 2000 A
6119101 Peckover Sep 2000 A
6122613 Baker Sep 2000 A
6134235 Goldman Oct 2000 A
6144667 Doshi Nov 2000 A
6144938 Surace Nov 2000 A
6154526 Dahlke Nov 2000 A
6160883 Jackson Dec 2000 A
6167377 Gillick Dec 2000 A
6173266 Marx Jan 2001 B1
6173279 Levin Jan 2001 B1
6175858 Bulfer Jan 2001 B1
6185535 Hedin Feb 2001 B1
6188982 Chiang Feb 2001 B1
6192110 Abella Feb 2001 B1
6192338 Haszto Feb 2001 B1
6195634 Dudemaine Feb 2001 B1
6195651 Handel Feb 2001 B1
6199043 Happ Mar 2001 B1
6208964 Sabourin Mar 2001 B1
6208972 Grant Mar 2001 B1
6219346 Maxemchuk Apr 2001 B1
6219643 Cohen Apr 2001 B1
6226612 Srenger May 2001 B1
6233556 Teunen May 2001 B1
6233559 Balakrishnan May 2001 B1
6233561 Junqua May 2001 B1
6236968 Kanevsky May 2001 B1
6246981 Papineni Jun 2001 B1
6246990 Happ Jun 2001 B1
6266636 Kosaka Jul 2001 B1
6269336 Ladd Jul 2001 B1
6272455 Hoshen Aug 2001 B1
6272461 Meredith Aug 2001 B1
6275231 Obradovich Aug 2001 B1
6278377 DeLine Aug 2001 B1
6278968 Franz Aug 2001 B1
6286002 Axaopoulos Sep 2001 B1
6288319 Catona Sep 2001 B1
6292767 Jackson Sep 2001 B1
6301560 Masters Oct 2001 B1
6308151 Smith Oct 2001 B1
6311159 VanTichelen Oct 2001 B1
6314402 Monaco Nov 2001 B1
6321196 Franceschi Nov 2001 B1
6356869 Chapados Mar 2002 B1
6362748 Huang Mar 2002 B1
6366882 Bijl Apr 2002 B1
6366886 Dragosh Apr 2002 B1
6374214 Friedland Apr 2002 B1
6374226 Hunt Apr 2002 B1
6377913 Coffman Apr 2002 B1
6381535 Durocher Apr 2002 B1
6385596 Wiser May 2002 B1
6385646 Brown May 2002 B1
6393403 Majaniemi May 2002 B1
6393428 Miller May 2002 B1
6397181 Li May 2002 B1
6404878 Jackson Jun 2002 B1
6405170 Phillips Jun 2002 B1
6408272 White Jun 2002 B1
6411810 Maxemchuk Jun 2002 B1
6411893 Ruhl Jun 2002 B2
6415257 Junqua Jul 2002 B1
6418210 Sayko Jul 2002 B1
6420975 DeLine Jul 2002 B1
6429813 Feigen Aug 2002 B2
6430285 Bauer Aug 2002 B1
6430531 Polish Aug 2002 B1
6434523 Monaco Aug 2002 B1
6434524 Weber Aug 2002 B1
6434529 Walker Aug 2002 B1
6442522 Carberry Aug 2002 B1
6446114 Bulfer Sep 2002 B1
6453153 Bowker Sep 2002 B1
6453292 Ramaswamy Sep 2002 B2
6456711 Cheung Sep 2002 B1
6456974 Baker Sep 2002 B1
6466654 Cooper Oct 2002 B1
6466899 Yano Oct 2002 B1
6470315 Netsch Oct 2002 B1
6487494 Odinak Nov 2002 B2
6487495 Gale Nov 2002 B1
6498797 Anerousis Dec 2002 B1
6499013 Weber Dec 2002 B1
6501833 Phillips Dec 2002 B2
6501834 Milewski Dec 2002 B1
6505155 Vanbuskirk Jan 2003 B1
6510417 Woods Jan 2003 B1
6513006 Howard Jan 2003 B2
6522746 Marchok Feb 2003 B1
6523061 Halverson Feb 2003 B1
6532444 Weber Mar 2003 B1
6539348 Bond Mar 2003 B1
6549629 Finn Apr 2003 B2
6553372 Brassell Apr 2003 B1
6556970 Sasaki Apr 2003 B1
6556973 Lewin Apr 2003 B1
6560576 Cohen May 2003 B1
6560590 Shwe May 2003 B1
6567778 ChaoChang May 2003 B1
6567797 Schuetze May 2003 B1
6567805 Johnson May 2003 B1
6570555 Prevost May 2003 B1
6570964 Murveit May 2003 B1
6571279 Herz May 2003 B1
6574597 Mohri Jun 2003 B1
6574624 Johnson Jun 2003 B1
6578022 Foulger Jun 2003 B1
6581103 Dengler Jun 2003 B1
6584439 Geilhufe Jun 2003 B1
6587858 Strazza Jul 2003 B1
6591239 McCall Jul 2003 B1
6594257 Doshi Jul 2003 B1
6594367 Marash Jul 2003 B1
6598018 Junqua Jul 2003 B1
6601026 Appelt Jul 2003 B2
6601029 Pickering Jul 2003 B1
6604075 Brown Aug 2003 B1
6604077 Dragosh Aug 2003 B2
6606598 Holthouse Aug 2003 B1
6611692 Raffel Aug 2003 B2
6614773 Maxemchuk Sep 2003 B1
6615172 Bennett Sep 2003 B1
6622119 Ramaswamy Sep 2003 B1
6629066 Jackson Sep 2003 B1
6631346 Karaorman Oct 2003 B1
6631351 Ramachandran Oct 2003 B1
6633846 Bennett Oct 2003 B1
6636790 Lightner Oct 2003 B1
6643620 Contolini Nov 2003 B1
6647363 Claassen Nov 2003 B2
6650747 Bala Nov 2003 B1
6658388 Kleindienst Dec 2003 B1
6678680 Woo Jan 2004 B1
6681206 Gorin Jan 2004 B1
6691151 Cheyer Feb 2004 B1
6701294 Ball Mar 2004 B1
6704396 Parolkar Mar 2004 B2
6704576 Brachman Mar 2004 B1
6704708 Pickering Mar 2004 B1
6707421 Drury Mar 2004 B1
6708150 Hirayama Mar 2004 B1
6721001 Berstis Apr 2004 B1
6721633 Funk Apr 2004 B2
6721706 Strubbe Apr 2004 B1
6726636 DerGhazarian Apr 2004 B2
6732088 Glance May 2004 B1
6735592 Neumann May 2004 B1
6739556 Langston May 2004 B1
6741931 Kohut May 2004 B1
6742021 Halverson May 2004 B1
6745161 Arnold Jun 2004 B1
6751591 Gorin Jun 2004 B1
6751612 Schuetze Jun 2004 B1
6754485 Obradovich Jun 2004 B1
6754627 Woodward Jun 2004 B2
6754647 Tackett Jun 2004 B1
6757544 Rangarajan Jun 2004 B2
6757718 Halverson Jun 2004 B1
6785651 Wang Aug 2004 B1
6795808 Strubbe Sep 2004 B1
6801604 Maes Oct 2004 B2
6801893 Backfried Oct 2004 B1
6804330 Jones Oct 2004 B1
6810375 Ejerhed Oct 2004 B1
6813341 Mahoney Nov 2004 B1
6816830 Kempe Nov 2004 B1
6829603 Wolf Dec 2004 B1
6832230 Zilliacus Dec 2004 B1
6833848 Wolff Dec 2004 B1
6850603 Eberle Feb 2005 B1
6856990 Barile Feb 2005 B2
6865481 Kawazoe Mar 2005 B2
6868380 Kroeker Mar 2005 B2
6868385 Gerson Mar 2005 B1
6871179 Kist Mar 2005 B1
6873837 Yoshioka Mar 2005 B1
6877001 Wolf Apr 2005 B2
6877134 Fuller Apr 2005 B1
6901366 Kuhn May 2005 B1
6910003 Arnold Jun 2005 B1
6912498 Stevens Jun 2005 B2
6915126 Mazzara, Jr. Jul 2005 B2
6928614 Everhart Aug 2005 B1
6934756 Maes Aug 2005 B2
6937977 Gerson Aug 2005 B2
6937982 Kitaoka Aug 2005 B2
6941266 Gorin Sep 2005 B1
6944594 Busayapongchai Sep 2005 B2
6950821 Faybishenko Sep 2005 B2
6954755 Reisman Oct 2005 B2
6959276 Droppo Oct 2005 B2
6961700 Mitchell Nov 2005 B2
6963759 Gerson Nov 2005 B1
6964023 Maes Nov 2005 B2
6968311 Knockeart Nov 2005 B2
6973387 Masclet Dec 2005 B2
6975993 Keiller Dec 2005 B1
6980092 Turnbull Dec 2005 B2
6983055 Luo Jan 2006 B2
6990513 Belfiore Jan 2006 B2
6996531 Korall Feb 2006 B2
7003463 Maes Feb 2006 B1
7016849 Arnold Mar 2006 B2
7020609 Thrift Mar 2006 B2
7024364 Guerra Apr 2006 B2
7027586 Bushey Apr 2006 B2
7027974 Busch Apr 2006 B1
7027975 Pazandak Apr 2006 B1
7035415 Belt Apr 2006 B2
7036128 Julia Apr 2006 B1
7043425 Pao May 2006 B2
7054817 Shao May 2006 B2
7058890 George Jun 2006 B2
7062488 Reisman Jun 2006 B1
7069220 Coffman Jun 2006 B2
7072834 Zhou Jul 2006 B2
7072888 Perkins Jul 2006 B1
7076362 Ohtsuji Jul 2006 B2
7082469 Gold Jul 2006 B2
7085708 Manson Aug 2006 B2
7092928 Elad Aug 2006 B1
7107210 Deng Sep 2006 B2
7107218 Preston Sep 2006 B1
7110951 Lemelson Sep 2006 B1
7127395 Gorin Oct 2006 B1
7127400 Koch Oct 2006 B2
7130390 Abburi Oct 2006 B2
7136875 Anderson Nov 2006 B2
7137126 Coffman Nov 2006 B1
7143037 Chestnut Nov 2006 B1
7143039 Stifelman Nov 2006 B1
7146319 Hunt Dec 2006 B2
7149696 Shimizu Dec 2006 B2
7165028 Gong Jan 2007 B2
7170993 Anderson Jan 2007 B2
7171291 Obradovich Jan 2007 B2
7174300 Bush Feb 2007 B2
7177798 Hsu Feb 2007 B2
7184957 Brookes Feb 2007 B2
7190770 Ando Mar 2007 B2
7197069 Agazzi Mar 2007 B2
7197460 Gupta Mar 2007 B1
7203644 Anderson Apr 2007 B2
7206418 Yang Apr 2007 B2
7207011 Mulvey Apr 2007 B2
7215941 Beckmann May 2007 B2
7228276 Omote Jun 2007 B2
7231343 Treadgold Jun 2007 B1
7236923 Gupta Jun 2007 B1
7254482 Kawasaki Aug 2007 B2
7272212 Eberle Sep 2007 B2
7277854 Bennett Oct 2007 B2
7283829 Christenson Oct 2007 B2
7283951 Marchisio Oct 2007 B2
7289606 Sibal Oct 2007 B2
7299186 Kuzunuki Nov 2007 B2
7301093 Sater Nov 2007 B2
7305381 Poppink Dec 2007 B1
7321850 Wakita Jan 2008 B2
7328155 Endo Feb 2008 B2
7337116 Charlesworth Feb 2008 B2
7340040 Saylor Mar 2008 B1
7366285 Parolkar Apr 2008 B2
7366669 Nishitani Apr 2008 B2
7376645 Bernard May 2008 B2
7386443 Parthasarathy Jun 2008 B1
7398209 Kennewick Jul 2008 B2
7406421 Odinak Jul 2008 B2
7415100 Cooper Aug 2008 B2
7415414 Azara Aug 2008 B2
7421393 DiFabbrizio Sep 2008 B1
7424431 Greene Sep 2008 B2
7447635 Konopka Nov 2008 B1
7451088 Ehlen Nov 2008 B1
7454368 Stillman Nov 2008 B2
7454608 Gopalakrishnan Nov 2008 B2
7461059 Richardson Dec 2008 B2
7472020 Brulle-Drews Dec 2008 B2
7472060 Gorin Dec 2008 B1
7472075 Odinak Dec 2008 B2
7477909 Roth Jan 2009 B2
7478036 Shen Jan 2009 B2
7487088 Gorin Feb 2009 B1
7487110 Bennett Feb 2009 B2
7493259 Jones Feb 2009 B2
7493559 Wolff Feb 2009 B1
7502672 Kolls Mar 2009 B1
7502738 Kennewick Mar 2009 B2
7516076 Walker Apr 2009 B2
7529675 Maes May 2009 B2
7536297 Byrd May 2009 B2
7536374 Au May 2009 B2
7542894 Murata Jun 2009 B2
7546382 Healey Jun 2009 B2
7548491 Macfarlane Jun 2009 B2
7552054 Stifelman Jun 2009 B1
7558730 Davis Jul 2009 B2
7574362 Walker Aug 2009 B2
7577244 Taschereau Aug 2009 B2
7606708 Hwang Oct 2009 B2
7620549 DiCristo Nov 2009 B2
7634409 Kennewick Dec 2009 B2
7640006 Portman Dec 2009 B2
7640160 DiCristo Dec 2009 B2
7640272 Mahajan Dec 2009 B2
7672931 Hurst-Hiller Mar 2010 B2
7676365 Hwang Mar 2010 B2
7676369 Fujimoto Mar 2010 B2
7684977 Morikawa Mar 2010 B2
7693720 Kennewick Apr 2010 B2
7697673 Chiu Apr 2010 B2
7706616 Kristensson Apr 2010 B2
7729916 Coffman Jun 2010 B2
7729918 Walker Jun 2010 B2
7729920 Chaar Jun 2010 B2
7734287 Ying Jun 2010 B2
7748021 Obradovich Jun 2010 B2
7788084 Brun Aug 2010 B2
7792257 Vanier Sep 2010 B1
7801731 Odinak Sep 2010 B2
7809570 Kennewick Oct 2010 B2
7818176 Freeman Oct 2010 B2
7831426 Bennett Nov 2010 B2
7831433 Belvin Nov 2010 B1
7856358 Ho Dec 2010 B2
7873519 Bennett Jan 2011 B2
7873523 Potter Jan 2011 B2
7873654 Bernard Jan 2011 B2
7881936 Longe Feb 2011 B2
7890324 Bangalore Feb 2011 B2
7894849 Kass Feb 2011 B2
7902969 Obradovich Mar 2011 B2
7917367 DiCristo Mar 2011 B2
7920682 Byrne Apr 2011 B2
7949529 Weider May 2011 B2
7949537 Walker May 2011 B2
7953732 Frank May 2011 B2
7974875 Quilici Jul 2011 B1
7983917 Kennewick Jul 2011 B2
7984287 Gopalakrishnan Jul 2011 B2
8005683 Tessel Aug 2011 B2
8015006 Kennewick Sep 2011 B2
8024186 De Bonet Sep 2011 B1
8027965 Takehara Sep 2011 B2
8032383 Bhardwaj Oct 2011 B1
8060367 Keaveney Nov 2011 B2
8069046 Kennewick Nov 2011 B2
8073681 Baldwin Dec 2011 B2
8077975 Ma Dec 2011 B2
8082153 Coffman Dec 2011 B2
8086463 Ativanichayaphong Dec 2011 B2
8103510 Sato Jan 2012 B2
8112275 Kennewick Feb 2012 B2
8140327 Kennewick Mar 2012 B2
8140335 Kennewick Mar 2012 B2
8145489 Freeman Mar 2012 B2
8150694 Kennewick Apr 2012 B2
8155962 Kennewick Apr 2012 B2
8170867 Germain May 2012 B2
8180037 Delker May 2012 B1
8195468 Weider Jun 2012 B2
8200485 Lee Jun 2012 B1
8204751 Di Fabbrizio Jun 2012 B1
8219399 Lutz Jul 2012 B2
8219599 Tunstall-Pedoe Jul 2012 B2
8224652 Wang Jul 2012 B2
8255224 Singleton Aug 2012 B2
8326627 Kennewick Dec 2012 B2
8326634 DiCristo Dec 2012 B2
8326637 Baldwin Dec 2012 B2
8332224 DiCristo Dec 2012 B2
8340975 Rosenberger Dec 2012 B1
8346563 Hjelm Jan 2013 B1
8370147 Kennewick Feb 2013 B2
8447607 Weider May 2013 B2
8447651 Scholl May 2013 B1
8452598 Kennewick May 2013 B2
8503995 Ramer Aug 2013 B2
8509403 Chiu Aug 2013 B2
8515765 Baldwin Aug 2013 B2
8527274 Freeman Sep 2013 B2
8577671 Barve Nov 2013 B1
8589161 Kennewick Nov 2013 B2
8620659 DiCristo Dec 2013 B2
8719005 Lee May 2014 B1
8719009 Baldwin May 2014 B2
8719026 Kennewick May 2014 B2
8731929 Kennewick May 2014 B2
8738380 Baldwin May 2014 B2
8849652 Weider Sep 2014 B2
8849670 DiCristo Sep 2014 B2
8849696 Pansari Sep 2014 B2
8849791 Hertschuh Sep 2014 B1
8886536 Freeman Nov 2014 B2
8972243 Strom Mar 2015 B1
8983839 Kennewick Mar 2015 B2
9009046 Stewart Apr 2015 B1
9015049 Baldwin Apr 2015 B2
9037455 Faaborg May 2015 B1
9070366 Mathias Jun 2015 B1
9070367 Hoffmeister Jun 2015 B1
9105266 Baldwin Aug 2015 B2
9171541 Kennewick Oct 2015 B2
9269097 Freeman Feb 2016 B2
9305548 Kennewick Apr 2016 B2
9308445 Merzenich Apr 2016 B1
9406078 Freeman Aug 2016 B2
9443514 Taubman Sep 2016 B1
9502025 Kennewick Nov 2016 B2
20010039492 Nemoto Nov 2001 A1
20010041980 Howard Nov 2001 A1
20010047261 Kassan Nov 2001 A1
20010049601 Kroeker Dec 2001 A1
20010054087 Flom Dec 2001 A1
20020002548 Roundtree Jan 2002 A1
20020010584 Schultz Jan 2002 A1
20020015500 Belt Feb 2002 A1
20020022927 Lemelson Feb 2002 A1
20020022956 Ukrainczyk Feb 2002 A1
20020029186 Roth Mar 2002 A1
20020029261 Shibata Mar 2002 A1
20020032752 Gold Mar 2002 A1
20020035501 Handel Mar 2002 A1
20020040297 Tsiao Apr 2002 A1
20020049535 Rigo Apr 2002 A1
20020049805 Yamada Apr 2002 A1
20020059068 Rose May 2002 A1
20020065568 Silfvast May 2002 A1
20020067839 Heinrich Jun 2002 A1
20020069059 Smith Jun 2002 A1
20020069071 Knockeart Jun 2002 A1
20020073176 Ikeda Jun 2002 A1
20020082911 Dunn Jun 2002 A1
20020087312 Lee Jul 2002 A1
20020087326 Lee Jul 2002 A1
20020087525 Abbott Jul 2002 A1
20020107694 Lerg Aug 2002 A1
20020120609 Lang Aug 2002 A1
20020124050 Middeljans Sep 2002 A1
20020133347 Schoneburg Sep 2002 A1
20020133354 Ross Sep 2002 A1
20020133402 Faber Sep 2002 A1
20020135618 Maes Sep 2002 A1
20020138248 Corston-Oliver Sep 2002 A1
20020143532 McLean Oct 2002 A1
20020143535 Kist Oct 2002 A1
20020152260 Chen Oct 2002 A1
20020161646 Gailey Oct 2002 A1
20020161647 Gailey Oct 2002 A1
20020169597 Fain Nov 2002 A1
20020173333 Buchholz Nov 2002 A1
20020173961 Guerra Nov 2002 A1
20020184373 Maes Dec 2002 A1
20020188602 Stubler Dec 2002 A1
20020198714 Zhou Dec 2002 A1
20030005033 Mohan Jan 2003 A1
20030014261 Kageyama Jan 2003 A1
20030016835 Elko Jan 2003 A1
20030036903 Konopka Feb 2003 A1
20030046071 Wyman Mar 2003 A1
20030046281 Son Mar 2003 A1
20030046346 Mumick Mar 2003 A1
20030064709 Gailey Apr 2003 A1
20030065427 Funk Apr 2003 A1
20030069734 Everhart Apr 2003 A1
20030088421 Maes May 2003 A1
20030093419 Bangalore May 2003 A1
20030097249 Walker May 2003 A1
20030110037 Walker Jun 2003 A1
20030112267 Belrose Jun 2003 A1
20030115062 Walker Jun 2003 A1
20030120493 Gupta Jun 2003 A1
20030135488 Amir Jul 2003 A1
20030144846 Denenberg Jul 2003 A1
20030158731 Falcon Aug 2003 A1
20030161448 Parolkar Aug 2003 A1
20030167167 Gong Sep 2003 A1
20030174155 Weng Sep 2003 A1
20030182132 Niemoeller Sep 2003 A1
20030187643 VanThong Oct 2003 A1
20030204492 Wolf Oct 2003 A1
20030206640 Malvar Nov 2003 A1
20030212550 Ubale Nov 2003 A1
20030212558 Matula Nov 2003 A1
20030212562 Patel Nov 2003 A1
20030225825 Healey Dec 2003 A1
20030233230 Ammicht Dec 2003 A1
20030236664 Sharma Dec 2003 A1
20040006475 Ehlen Jan 2004 A1
20040010358 Oesterling Jan 2004 A1
20040025115 Sienel Feb 2004 A1
20040030741 Wolton Feb 2004 A1
20040036601 Obradovich Feb 2004 A1
20040044516 Kennewick Mar 2004 A1
20040098245 Walker May 2004 A1
20040117179 Balasuriya Jun 2004 A1
20040117804 Scahill Jun 2004 A1
20040122673 Park Jun 2004 A1
20040122674 Bangalore Jun 2004 A1
20040133793 Ginter Jul 2004 A1
20040140989 Papageorge Jul 2004 A1
20040143440 Prasad Jul 2004 A1
20040148154 Acero Jul 2004 A1
20040148170 Acero Jul 2004 A1
20040158555 Seedman Aug 2004 A1
20040166832 Portman Aug 2004 A1
20040167771 Duan Aug 2004 A1
20040172247 Yoon Sep 2004 A1
20040172258 Dominach Sep 2004 A1
20040189697 Fukuoka Sep 2004 A1
20040193408 Hunt Sep 2004 A1
20040193420 Kennewick Sep 2004 A1
20040199375 Ehsani Oct 2004 A1
20040201607 Mulvey Oct 2004 A1
20040205671 Sukehiro Oct 2004 A1
20040243393 Wang Dec 2004 A1
20040243417 Pitts, III Dec 2004 A9
20040247092 Timmins Dec 2004 A1
20040249636 Applebaum Dec 2004 A1
20050015256 Kargman Jan 2005 A1
20050021331 Huang Jan 2005 A1
20050021334 Iwahashi Jan 2005 A1
20050021470 Martin Jan 2005 A1
20050021826 Kumar Jan 2005 A1
20050033574 Kim Feb 2005 A1
20050033582 Gadd Feb 2005 A1
20050043940 Elder Feb 2005 A1
20050080632 Endo Apr 2005 A1
20050102282 Linden May 2005 A1
20050114116 Fiedler May 2005 A1
20050125232 Gadd Jun 2005 A1
20050131673 Koizumi Jun 2005 A1
20050137850 Odell Jun 2005 A1
20050137877 Oesterling Jun 2005 A1
20050143994 Mori Jun 2005 A1
20050144013 Fujimoto Jun 2005 A1
20050144187 Che Jun 2005 A1
20050149319 Honda Jul 2005 A1
20050216254 Gupta Sep 2005 A1
20050234727 Chiu Oct 2005 A1
20050246174 DeGolia Nov 2005 A1
20050283364 Longe Dec 2005 A1
20050283532 Kim Dec 2005 A1
20050283752 Fruchter Dec 2005 A1
20060041431 Maes Feb 2006 A1
20060047509 Ding Mar 2006 A1
20060072738 Louis Apr 2006 A1
20060074670 Weng Apr 2006 A1
20060074671 Farmaner Apr 2006 A1
20060080098 Campbell Apr 2006 A1
20060100851 Schonebeck May 2006 A1
20060106769 Gibbs May 2006 A1
20060129409 Mizutani Jun 2006 A1
20060130002 Hirayama Jun 2006 A1
20060182085 Sweeney Aug 2006 A1
20060206310 Ravikumar Sep 2006 A1
20060217133 Christenson Sep 2006 A1
20060236343 Chang Oct 2006 A1
20060242017 Libes Oct 2006 A1
20060253281 Letzt Nov 2006 A1
20060285662 Yin Dec 2006 A1
20070011159 Hillis Jan 2007 A1
20070033005 Cristo Feb 2007 A1
20070033020 Francois Feb 2007 A1
20070033526 Thompson Feb 2007 A1
20070038436 Cristo Feb 2007 A1
20070038445 Helbing Feb 2007 A1
20070043569 Potter Feb 2007 A1
20070043574 Coffman Feb 2007 A1
20070043868 Kumar Feb 2007 A1
20070050191 Weider Mar 2007 A1
20070055525 Kennewick Mar 2007 A1
20070061067 Zeinstra Mar 2007 A1
20070061735 Hoffberg Mar 2007 A1
20070073544 Millett Mar 2007 A1
20070078708 Yu Apr 2007 A1
20070078709 Rajaram Apr 2007 A1
20070078814 Flowers Apr 2007 A1
20070094003 Huang Apr 2007 A1
20070100797 Thun May 2007 A1
20070112555 Lavi May 2007 A1
20070112630 Lau May 2007 A1
20070118357 Kasravi May 2007 A1
20070124057 Prieto May 2007 A1
20070135101 Ramati Jun 2007 A1
20070146833 Satomi Jun 2007 A1
20070162296 Altberg Jul 2007 A1
20070174258 Jones Jul 2007 A1
20070179778 Gong Aug 2007 A1
20070185859 Flowers Aug 2007 A1
20070186165 Maislos Aug 2007 A1
20070192309 Fischer Aug 2007 A1
20070198267 Jones Aug 2007 A1
20070203699 Nagashima Aug 2007 A1
20070203736 Ashton Aug 2007 A1
20070208732 Flowers Sep 2007 A1
20070214182 Rosenberg Sep 2007 A1
20070250901 McIntire Oct 2007 A1
20070265850 Kennewick Nov 2007 A1
20070266257 Camaisa Nov 2007 A1
20070276651 Bliss Nov 2007 A1
20070294615 Sathe Dec 2007 A1
20070299824 Pan Dec 2007 A1
20080034032 Healey Feb 2008 A1
20080046311 Shahine Feb 2008 A1
20080059188 Konopka Mar 2008 A1
20080065386 Cross Mar 2008 A1
20080065389 Cross Mar 2008 A1
20080091406 Baldwin Apr 2008 A1
20080103761 Printz May 2008 A1
20080103781 Wasson May 2008 A1
20080104071 Pragada May 2008 A1
20080109285 Reuther May 2008 A1
20080115163 Gilboa May 2008 A1
20080126091 Clark May 2008 A1
20080133215 Sarukkai Jun 2008 A1
20080140385 Mahajan Jun 2008 A1
20080147396 Wang Jun 2008 A1
20080147410 Odinak Jun 2008 A1
20080147637 Li Jun 2008 A1
20080154604 Sathish Jun 2008 A1
20080162471 Bernard Jul 2008 A1
20080177530 Cross Jul 2008 A1
20080184164 Di Fabbrizio Jul 2008 A1
20080189110 Freeman Aug 2008 A1
20080228496 Yu Sep 2008 A1
20080235023 Kennewick Sep 2008 A1
20080235027 Cross Sep 2008 A1
20080270224 Portman Oct 2008 A1
20080294437 Nakano Nov 2008 A1
20080294994 Kruger Nov 2008 A1
20080306743 Di Fabbrizio Dec 2008 A1
20080319751 Kennewick Dec 2008 A1
20090006077 Keaveney Jan 2009 A1
20090006194 Sridharan Jan 2009 A1
20090018829 Kuperstein Jan 2009 A1
20090024476 Baar Jan 2009 A1
20090052635 Jones Feb 2009 A1
20090067599 Agarwal Mar 2009 A1
20090076827 Bulitta Mar 2009 A1
20090106029 DeLine Apr 2009 A1
20090117885 Roth May 2009 A1
20090144131 Chiu Jun 2009 A1
20090144271 Richardson Jun 2009 A1
20090150156 Kennewick Jun 2009 A1
20090164216 Chengalvarayan Jun 2009 A1
20090171664 Kennewick Jul 2009 A1
20090216540 Tessel Aug 2009 A1
20090248565 Chuang Oct 2009 A1
20090248605 Mitchell Oct 2009 A1
20090259561 Boys Oct 2009 A1
20090259646 Fujita Oct 2009 A1
20090265163 Li Oct 2009 A1
20090271194 Davis Oct 2009 A1
20090273563 Pryor Nov 2009 A1
20090276700 Anderson Nov 2009 A1
20090287680 Paek Nov 2009 A1
20090299745 Kennewick Dec 2009 A1
20090299857 Brubaker Dec 2009 A1
20090304161 Pettyjohn Dec 2009 A1
20090307031 Winkler Dec 2009 A1
20090313026 Coffman Dec 2009 A1
20090319517 Guha Dec 2009 A1
20100023320 Cristo Jan 2010 A1
20100029261 Mikkelsen Feb 2010 A1
20100036967 Caine Feb 2010 A1
20100049501 Kennewick Feb 2010 A1
20100049514 Kennewick Feb 2010 A1
20100057443 Cristo Mar 2010 A1
20100063880 Atsmon Mar 2010 A1
20100064025 Nelimarkka Mar 2010 A1
20100094707 Freer Apr 2010 A1
20100138300 Wallis Jun 2010 A1
20100145700 Kennewick Jun 2010 A1
20100185512 Borger Jul 2010 A1
20100204986 Kennewick Aug 2010 A1
20100204994 Kennewick Aug 2010 A1
20100217604 Baldwin Aug 2010 A1
20100286985 Kennewick Nov 2010 A1
20100299142 Freeman Nov 2010 A1
20100312566 Odinak Dec 2010 A1
20100318357 Istvan Dec 2010 A1
20100331064 Michelstein Dec 2010 A1
20110022393 Waller Jan 2011 A1
20110106527 Chiu May 2011 A1
20110112827 Kennewick May 2011 A1
20110112921 Kennewick May 2011 A1
20110119049 Ylonen May 2011 A1
20110131036 DiCristo Jun 2011 A1
20110131045 Cristo Jun 2011 A1
20110231182 Weider Sep 2011 A1
20110231188 Kennewick Sep 2011 A1
20110238409 Larcheveque Sep 2011 A1
20110307167 Taschereau Dec 2011 A1
20120022857 Baldwin Jan 2012 A1
20120046935 Nagao Feb 2012 A1
20120101809 Kennewick Apr 2012 A1
20120101810 Kennewick Apr 2012 A1
20120109753 Kennewick May 2012 A1
20120150620 Mandyam Jun 2012 A1
20120150636 Freeman Jun 2012 A1
20120239498 Ramer Sep 2012 A1
20120240060 Pennington Sep 2012 A1
20120278073 Weider Nov 2012 A1
20130006734 Ocko Jan 2013 A1
20130054228 Baldwin Feb 2013 A1
20130060625 Davis Mar 2013 A1
20130080177 Chen Mar 2013 A1
20130211710 Kennewick Aug 2013 A1
20130253929 Weider Sep 2013 A1
20130254314 Chow Sep 2013 A1
20130297293 Cristo Nov 2013 A1
20130304473 Baldwin Nov 2013 A1
20130311324 Stoll Nov 2013 A1
20130332454 Stuhec Dec 2013 A1
20130339022 Baldwin Dec 2013 A1
20140006951 Hunter Jan 2014 A1
20140012577 Freeman Jan 2014 A1
20140025371 Min Jan 2014 A1
20140108013 Cristo Apr 2014 A1
20140156278 Kennewick Jun 2014 A1
20140195238 Terao Jul 2014 A1
20140236575 Tur Aug 2014 A1
20140249821 Kennewick Sep 2014 A1
20140249822 Baldwin Sep 2014 A1
20140278413 Pitschel Sep 2014 A1
20140278416 Schuster Sep 2014 A1
20140288934 Kennewick Sep 2014 A1
20140330552 Bangalore Nov 2014 A1
20140365222 Weider Dec 2014 A1
20150019211 Simard Jan 2015 A1
20150019217 Cristo Jan 2015 A1
20150019227 Anandarajah Jan 2015 A1
20150066627 Freeman Mar 2015 A1
20150073910 Kennewick Mar 2015 A1
20150095159 Kennewick Apr 2015 A1
20150142447 Kennewick May 2015 A1
20150170641 Kennewick Jun 2015 A1
20150193379 Mehta Jul 2015 A1
20150199339 Mirkin Jul 2015 A1
20150228276 Baldwin Aug 2015 A1
20150293917 Bufe Oct 2015 A1
20150348544 Baldwin Dec 2015 A1
20150348551 Gruber Dec 2015 A1
20150364133 Freeman Dec 2015 A1
20160049152 Kennewick Feb 2016 A1
20160078482 Kennewick Mar 2016 A1
20160078491 Kennewick Mar 2016 A1
20160078504 Kennewick Mar 2016 A1
20160078773 Carter Mar 2016 A1
20160110347 Kennewick Apr 2016 A1
20160148610 Kennewick May 2016 A1
20160148612 Guo May 2016 A1
20160188292 Carter Jun 2016 A1
20160188573 Tang Jun 2016 A1
20160335676 Freeman Nov 2016 A1
Foreign Referenced Citations (31)
Number Date Country
1433554 Jul 2003 CN
1320043 Jun 2003 EP
1646037 Apr 2006 EP
H11249773 Sep 1999 JP
2001071289 Mar 2001 JP
2006146881 Jun 2006 JP
2008027454 Feb 2008 JP
2008058465 Mar 2008 JP
2011504304 Feb 2011 JP
0171609 Sep 2001 NO
9946763 Sep 1999 WO
0021232 Jan 2000 WO
0046792 Jan 2000 WO
0178065 Oct 2001 WO
2004072954 Jan 2004 WO
2007019318 Jan 2007 WO
2007021587 Jan 2007 WO
2007027546 Jan 2007 WO
2007027989 Jan 2007 WO
2008098039 Jan 2008 WO
2008118195 Jan 2008 WO
2008139928 Jun 2008 WO
2009075912 Jan 2009 WO
2009145796 Jan 2009 WO
2009111721 Sep 2009 WO
2010096752 Jan 2010 WO
2016044290 Mar 2016 WO
2016044316 Mar 2016 WO
2016044319 Mar 2016 WO
2016044321 Mar 2016 WO
2016061309 Apr 2016 WO
Non-Patent Literature Citations (22)
Entry
Davis, Z., et al., A Personal Handheld Multi-Modal Shopping Assistant, IEEE, 2006, 9 pages.
“Statement in Accordance with the Notice from the European Patent Office” dated Oct. 1, 2007 Concerning Business Methods (OJ EPO Nov. 2007, 592-593), XP002456252.
Arrington, Michael, “Google Redefines GPS Navigation Landscape: Google Maps Navigation for Android 2.0”, TechCrunch, printed from the Internet <http://www.techcrunch.com/2009/10/28/google-redefines-car-gps-navigation-google-maps-navigation-android/>, Oct. 28, 2009, 4 pages.
Bazzi, Issam et al., “Heterogeneous Lexical Units for Automatic Speech Recognition: Preliminary Investigations”, Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 3, Jun. 5-9, 2000, XP010507574, pp. 1257-1260.
Belvin, Robert, et al., “Development of the HRL Route Navigation Dialogue System”, Proceedings of the First International Conference on Human Language Technology Research, San Diego, 2001, pp. 1-5.
Chai et al., “MIND: A Semantics-Based Multimodal Interpretation Framework for Conversational Systems”, Proceedings of the International CLASS Workshop on Natural, Intelligent and Effective Interaction in Multimodal Dialogue Systems, Jun. 2002, pp. 37-46.
Cheyer et al., “Multimodal Maps: An Agent-Based Approach”, International Conference on Cooperative Multimodal Communication (CMC/95), May 24-26, 1995, pp. 111-121.
El Meliani et al., “A Syllabic-Filler-Based Continuous Speech Recognizer for Unlimited Vocabulary”, Canadian Conference on Electrical and Computer Engineering, vol. 2, Sep. 5-8, 1995, pp. 1007-1010.
Elio et al., “On Abstract Task Models and Conversation Policies” in Workshop on Specifying and Implementing Conversation Policies, Autonomous Agents '99, Seattle, 1999, 10 pages.
Kirchhoff, Katrin, “Syllable-Level Desynchronisation of Phonetic Features for Speech Recognition”, Proceedings of the Fourth International Conference on Spoken Language, 1996, ICSLP 96, vol. 4, IEEE, 1996, 3 pages.
Kuhn, Thomas, et al., “Hybrid In-Car Speech Recognition for Mobile Multimedia Applications”, Vehicular Technology Conference, IEEE, Jul. 1999, pp. 2009-2013.
Lin, Bor-shen, et al., “A Distributed Architecture for Cooperative Spoken Dialogue Agents with Coherent Dialogue State and History”, ASRU'99, 1999, 4 pages.
Lind, R., et al., The Network Vehicle—A Glimpse into the Future of Mobile Multi-Media, IEEE Aerosp. Electron. Systems Magazine, vol. 14, No. 9, Sep. 1999, pp. 27-32.
Mao, Mark Z., “Automatic Training Set Segmentation for Multi-Pass Speech Recognition”, Department of Electrical Engineering, Stanford University, CA, copyright 2005, IEEE, pp. I-685 to I-688.
O'Shaughnessy, Douglas, “Interacting with Computers by Voice: Automatic Speech Recognition and Synthesis”, Proceedings of the IEEE, vol. 91, No. 9, Sep. 1, 2003, XP011100665. pp. 1272-1305.
Reuters, “IBM to Enable Honda Drivers to Talk to Cars”, Charles Schwab & Co., Inc., Jul. 28, 2002, 1 page.
Turunen, “Adaptive Interaction Methods in Speech User Interfaces”, Conference on Human Factors in Computing Systems, Seattle, Washington, 2001, pp. 91-92.
Vanhoucke, Vincent, “Confidence Scoring and Rejection Using Multi-Pass Speech Recognition”, Nuance Communications, Menlo Park, CA, 2005, 4 pages.
Weng, Fuliang, et al., “Efficient Lattice Representation and Generation”, Speech Technology and Research Laboratory, SRI International, Menlo Park, CA, 1998, 4 pages.
Wu, Su-Lin, et al., “Incorporating Information from Syllable-Length Time Scales into Automatic Speech Recognition”, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, 1998, vol. 2, IEEE, 1998, 4 pages.
Wu, Su-Lin, et al., “Integrating Syllable Boundary Information into Speech Recognition”, IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP-97, 1997, vol. 2, IEEE, 1997, 4 pages.
Zhao, Yilin, “Telematics: Safe and Fun Driving”, IEEE Intelligent Systems, vol. 17, Issue 1, 2002, pp. 10-14.
Related Publications (1)
Number Date Country
20160217785 A1 Jul 2016 US
Continuations (2)
Number Date Country
Parent 14083061 Nov 2013 US
Child 15090215 US
Parent 12127343 May 2008 US
Child 14083061 US