System and method for an integrated, multi-modal, multi-device natural language voice services environment

Information

  • Patent Grant
  • 8589161
  • Patent Number
    8,589,161
  • Date Filed
    Tuesday, May 27, 2008
    17 years ago
  • Date Issued
    Tuesday, November 19, 2013
    12 years ago
Abstract
A system and method for an integrated, multi-modal, multi-device natural language voice services environment may be provided. In particular, the environment may include a plurality of voice-enabled devices each having intent determination capabilities for processing multi-modal natural language inputs in addition to knowledge of the intent determination capabilities of other devices in the environment. Further, the environment may be arranged in a centralized manner, a distributed peer-to-peer manner, or various combinations thereof. As such, the various devices may cooperate to determine intent of multi-modal natural language inputs, and commands, queries, or other requests may be routed to one or more of the devices best suited to take action in response thereto.
Description
FIELD OF THE INVENTION

The invention relates to an integrated voice services environment in which a plurality of devices can provide various voice services by cooperatively processing free form, multi-modal, natural language inputs, thereby facilitating conversational interactions between a user and one or more of the devices in the integrated environment.


BACKGROUND OF THE INVENTION

As technology has progressed in recent years, consumer electronic devices have emerged to become nearly ubiquitous in the everyday lives of many people. To meet the increasing demand that has resulted from growth in the functionality and mobility of mobile phones, navigation devices, embedded devices, and other such devices, a wealth of features and functions are often provided therein in addition to core applications. Greater functionality also introduces the trade-offs, however, including learning curves that often inhibit users from fully exploiting all of the capabilities of their electronic devices. For example, many existing electronic devices include complex human to machine interfaces that may not be particularly user-friendly, which inhibits mass-market adoption for many technologies. Moreover, cumbersome interfaces often result in otherwise desirable features being buried (e.g., within menus that may be tedious to navigate), which has the tendency of causing many users to not use, or even know about, the potential capabilities of their devices.


As such, the increased functionality provided by many electronic devices often tends to be wasted, as market research suggests that many users only use only a fraction of the features or applications available on a given device. Moreover, in a society where wireless networking and broadband access are increasingly prevalent, consumers tend to naturally desire seamless mobile capabilities from their electronic devices. Thus, as consumer demand intensifies for simpler mechanisms to interact with electronic devices, cumbersome interfaces that prevent quick and focused interaction can become an important concern. Accordingly, the ever-growing demand for mechanisms to use technology in intuitive ways remains largely unfulfilled.


One approach towards simplifying human to machine interactions in electronic devices includes the use of voice recognition software, which can enable users to exploit features that could otherwise be unfamiliar, unknown, or difficult to use. For example, a recent survey conducted by the Navteq Corporation, which provides data used in a variety of applications such as automotive navigation and web-based applications, demonstrates that voice recognition often ranks among the features most desired by consumers of electronic devices. Even so, existing voice user interfaces, when they actually work, still tend to require significant learning on the part of the user.


For example, many existing voice user interfaces only support requests formulated according to specific command-and-control sequences or syntaxes. Furthermore, many existing voice user interfaces cause user frustration or dissatisfaction because of inaccurate speech recognition. Similarly, by forcing a user to provide pre-established commands or keywords to communicate requests in ways that a system can understand, existing voice user interfaces do not effectively engage the user in a productive, cooperative dialogue to resolve requests and advance a conversation towards a mutually satisfactory goal (e.g., when users may be uncertain of particular needs, available information, or device capabilities, among other things). As such, existing voice user interfaces tend to suffer from various drawbacks, including significant limitations on engaging users in a dialogue in a cooperative and conversational manner.


Additionally, many existing voice user interfaces fall short in utilizing information distributed across various different domains or devices in order to resolve natural language voice-based inputs. Thus, existing voice user interfaces suffer from being constrained to a finite set of applications for which they have been designed, or to devices on which they reside. Although technological advancement has resulted in users often having several devices to suit their various needs, existing voice user interfaces do not adequately free users from device constraints. For example, users may be interested in services associated with different applications and devices, but existing voice user interfaces tend to restrict users from accessing the applications and devices as they see fit. Moreover, users typically can only practicably carry a finite number of devices at any given time, yet content or services associated with users' other devices that currently being used may be desired in various circumstances. Accordingly, although users tend to have varying needs, where content or services associated with different devices may be desired in various contexts or environments, existing voice technologies tend to fall short in providing an integrated environment in which users can request content or services associated with virtually any device or network. As such, constraints on information availability and device interaction mechanisms in existing voice services environments tend to prevent users from experiencing technology in an intuitive, natural, and efficient way.


Existing systems suffer from these and other problems.


SUMMARY OF THE INVENTION

According to various aspects of the invention, a system and method for an integrated, multi-modal, multi-device natural language voice services environment may include a plurality of voice-enabled devices each having intent determination capabilities for processing multi-modal natural language inputs in addition to knowledge of the intent determination capabilities of other devices in the environment. Further, the environment may be arranged in a centralized manner, a distributed peer-to-peer manner, or various combinations thereof. As such, the various devices may cooperate to determine intent of multi-modal natural language inputs, and commands, queries, or other requests may be routed to one or more of the devices best suited to take action in response thereto.


According to various aspects of the invention, the integrated natural language voice services environment arranged in the centralized manner includes an input device that receives a multi-modal natural language input, a central device communicatively coupled to the input device, and one or more secondary devices communicatively coupled to the central device. Each of the input device, the central device, and the one or more secondary devices may have intent determination capabilities for processing multi-modal natural language inputs. As such, an intent of a given multi-modal natural language input may be determined in the centralized manner by communicating the multi-modal natural language input from the input device to the central device. Thereafter, the central device may aggregate the intent determination capabilities of the input device and the one or more secondary devices and determine an intent of the multi-modal natural language input using the aggregated intent determination capabilities. The input device may then receive the determined intent from the central device and invoke at least one action at one or more of the input device, the central device, or the secondary devices based on the determined intent.


According to various aspects of the invention, the integrated natural language voice services environment arranged in the distributed manner includes an input device that receives a multi-modal natural language input, a central device communicatively coupled to the input device and one or more secondary devices communicatively coupled to the input device, wherein each of the input device and the one or more secondary devices may have intent determination capabilities for processing multi-modal natural language inputs, as in the centralized implementation. However, the distributed implementation may be distinct from the centralized implementation in that a preliminary intent of the multi-modal natural language input may be determined at the input device using local intent determination capabilities. The multi-modal natural language input may then be communicated to one or more of the secondary devices (e.g., when a confidence level of the intent determination at the input device falls below a given threshold). In such cases, each of the secondary devices determine an intent of the multi-modal natural language input using local intent determination capabilities. The input device collates the preliminary intent determination and the intent determinations of the secondary devices, and may arbitrate among the collated intent determinations to determine an actionable intent of the multi-modal natural input.


According to various aspects of the invention, the integrated natural language voice services environment arranged in a manner that dynamically selects between a centralized model and a distributed model. For example, the environment includes an input device that receives a multi-modal natural language input one or more secondary devices communicatively coupled to the input device, each of which have intent determination capabilities for processing multi-modal natural language inputs. A constellation model may be accessible to each of the input device and the one or more secondary devices, wherein the constellation model describes the intent determination capabilities of the input device and the one or more secondary devices. The multi-modal natural language input can be routed for processing at one or more of the input device or the secondary devices to determine an intent thereof based on the intent determination capabilities described in the constellation model. For example, when the constellation model arranges the input device and the secondary devices in the centralized manner, one of the secondary devices may be designated the central device and the natural language input may be processed as described above. However, when the multi-modal natural language cannot be communicated to the central device, the constellation model may be dynamically rearranged in the distributed manner, whereby the input device and the secondary devices share knowledge relating to respective local intent determination capabilities and operate as cooperative nodes to determine the intent of the multi-modal natural language input using the shared knowledge relating to local intent determination capabilities.


Other objects and advantages of the invention will be apparent based on the following drawings and detailed description.





BRIEF DESCRIPTION OF THE DRAWINGS


FIG. 1 illustrates a block diagram of an exemplary multi-modal electronic device that may be provided in an integrated, multi-device natural language voice services environment, according to various aspects of the invention.



FIG. 2 illustrates a block diagram of an exemplary centralized implementation of the integrated, multi-modal, multi-device natural language voice service environment, according to various aspects of the invention.



FIG. 3 illustrates a flow diagram of an exemplary method for processing multi-modal, natural language inputs at an input device in the centralized implementation of the integrated, multi-modal, multi-device natural language voice service environment, according to various aspects of the invention.



FIG. 4 illustrates a flow diagram of an exemplary method for processing multi-modal, natural language inputs at a central device in the centralized implementation of the integrated, multi-modal, multi-device natural language voice service environment, according to various aspects of the invention.



FIG. 5 illustrates a flow diagram of an exemplary method for processing multi-modal, natural language inputs at a secondary device in the centralized implementation of the integrated, multi-modal, multi-device natural language voice service environment, according to various aspects of the invention.



FIG. 6 illustrates a block diagram of an exemplary distributed implementation of the integrated, multi-modal, multi-device natural language voice service environment, according to various aspects of the invention.



FIG. 7 illustrates a flow diagram of an exemplary method for processing multi-modal, natural language inputs at an input device in the distributed implementation of the integrated, multi-modal, multi-device natural language voice service environment, according to various aspects of the invention.





DETAILED DESCRIPTION

According to various aspects of the invention, FIG. 1 illustrates a block diagram of an exemplary multi-modal electronic device 100 that may be provided in a natural language voice services environment that includes one or more additional multi-modal devices (e.g., as illustrated in FIGS. 2 and 6). As will be apparent, the electronic device 100 illustrated in FIG. 1 may be any suitable voice-enabled electronic device (e.g., a telematics device, a personal navigation device, a mobile phone, a VoIP node, a personal computer, a media device, an embedded device, a server, or another electronic device). The device 100 may include various components that collectively provide a capability to process conversational, multi-modal natural language inputs. As such, a user of the device 100 may engage in multi-modal conversational dialogues with the voice-enabled electronic device 100 to resolve requests in a free form, cooperative manner.


For example, the natural language processing components may support free form natural language utterances to liberate the user from restrictions relating to how commands, queries, or other requests should be formulated. Rather, the user may employ any manner of speaking that feels natural in order to request content or services available through the device 100 (e.g., content or services relating to telematics, communications, media, messaging, navigation, marketing, information retrieval, etc.). For instance, in various implementations, the device 100 may process natural language utterances utilizing techniques described in U.S. patent application Ser. No. 10/452,147, entitled “Systems and Methods for Responding to Natural Language Speech Utterance,” which issued as U.S. Pat. No. 7,398,209 on Jul. 8, 2008, and U.S. patent application Ser. No. 10/618,633, entitled “Mobile Systems and Methods for Responding to Natural Language Speech Utterance,” which issued as U.S. Pat. No. 7,693,720 on Apr. 6, 2010, the disclosures of which are hereby incorporated by reference in their entirety.


Moreover, because the device 100 may be deployed in an integrated multi-device environment, the user may further request content or services available through other devices deployed in the environment. In particular, the integrated voice services environment may include a plurality of multi-modal devices, each of which include natural language components generally similar to those illustrated in FIG. 1. The various devices in the environment may serve distinct purposes, however, such that available content, services, applications, or other capabilities may vary among the devices in the environment (e.g., core functions of a media device may vary from those of a personal navigation device). Thus, each device in the environment, including device 100, may have knowledge of content, services, applications, intent determination capabilities, and other features available through the other devices by way of a constellation model 130b. Accordingly, as will be described in greater detail below, the electronic device 100 may cooperate with other devices in the integrated environment to resolve requests by sharing context, prior information, domain knowledge, short-term knowledge, long-term knowledge, and cognitive models, among other things.


According to various aspects of the invention, the electronic device 100 may include an input mechanism 105 that can receive multi-modal natural language inputs, which include at least an utterance spoken by the user. As will be apparent, the input mechanism 105 may include any appropriate device or combination of devices capable of receiving a spoken input (e.g., a directional microphone, an array of microphones, or any other device that can generate encoded speech). Further, in various implementations, the input mechanism 105 can be configured to maximize fidelity of encoded speech, for example, by maximizing gain in a direction of the user, cancelling echoes, nulling point noise sources, performing variable rate sampling, or filtering environmental noise (e.g., background conversations). As such, the input mechanism 105 may generate encoded speech in a manner that can tolerate noise or other factors that could otherwise interfere with accurate interpretation of the utterance.


Furthermore, in various implementations, the input mechanism 105 may include various other input modalities (i.e., the input mechanism 105 may be arranged in a multi-modal environment), in that non-voice inputs can be correlated and/or processed in connection with one or more previous, contemporaneous, or subsequent multi-modal natural language inputs. For example, the input mechanism 105 may be coupled to a touch-screen interface, a stylus and tablet interface, a keypad or keyboard, or any other suitable input mechanism, as will be apparent. As a result, an amount of information potentially available when processing the multi-modal inputs may be maximized, as the user can clarify utterances or otherwise provide additional information in a given multi-modal natural language input using various input modalities. For instance, in an exemplary illustration, the user could touch a stylus or other pointing device to a portion of a touch-screen interface of the device 100, while also providing an utterance relating to the touched portion of the interface (e.g., “Show me restaurants around here”). In this example, the natural language utterance may be correlated with the input received via the touch-screen interface, resulting in “around here” being interpreted in relation to the touched portion of the interface (e.g., as opposed to the user's current location or some other meaning).


According to various aspects of the invention, the device 100 may include an Automatic Speech Recognizer 110 that generates one or more preliminary interpretations of the encoded speech, which may be received from the input mechanism 105. For example, the Automatic Speech Recognizer 110 may recognize syllables, words, or phrases contained in an utterance using one or more dynamically adaptable recognition grammars. The dynamic recognition grammars may be used to recognize a stream of phonemes through phonetic dictation based on one or more acoustic models. Furthermore, as described in U.S. patent application Ser. No. 11/197,504, entitled “Systems and Methods for Responding to Natural Language Speech Utterance,” which issued as U.S. Pat. No. 7,640,160 on Dec. 29, 2009, the disclosure of which is hereby incorporated by reference in its entirety, the Automatic Speech Recognizer 110 may be capable of multi-pass analysis, where a primary speech recognition engine may generate a primary interpretation of an utterance (e.g., using a large list dictation grammar) and request secondary transcription from one or more secondary speech recognition engines (e.g., using a virtual dictation grammar having decoy words for out-of-vocabulary words).


Thus, the Automatic Speech Recognizer 110 may generate preliminary interpretations of an utterance in various ways, including exclusive use of a dictation grammar or virtual dictation grammar, or use of various combinations thereof (e.g., when the device 100 supports multi-pass analysis). In any event, the Automatic Speech Recognizer 110 may provide out-of-vocabulary capabilities and may tolerate portions of a speech signal being dropped, the user misspeaking, or other variables that may occur in natural language speech (e.g., stops and starts, stutters, etc.). Furthermore, the recognition grammars employed by the Automatic Speech Recognizer 110 may include vocabularies, dictionaries, syllables, words, phrases, or other information optimized according to various contextual or application-specific domains (e.g., navigation, music, movies, weather, shopping, news, languages, temporal or geographic proximities, or other suitable domains). Moreover, environmental knowledge (e.g., peer-to-peer affinities, capabilities of devices in the environment, etc.), historical knowledge (e.g., frequent requests, prior context, etc.), or other types of knowledge can be used to continually optimize the information contained in the recognition grammars on a dynamic basis.


For example, information contained in the recognition grammars may be dynamically optimized to improve a likelihood of a given utterance being recognized accurately (e.g., following an incorrect interpretation of a word, the incorrect interpretation may be removed from the grammar to reduce a likelihood of the incorrect interpretation being repeated). Accordingly, the Automatic Speech Recognizer 110 may use a number of techniques to generate preliminary interpretations of natural language utterances, including those described, for example, in U.S. patent application Ser. No. 11/513,269, entitled “Dynamic Speech Sharpening,” which issued as U.S. Pat. No. 7,634,409 on Dec. 15, 2009, the disclosure of which is hereby incorporated by reference in its entirety. Furthermore, the techniques used by the Automatic Speech Recognizer 110 associated with the device 100 may be considered in defining intent determination capabilities of the device 100, and such capabilities may be shared with other devices in the environment to enable convergence of speech recognition throughout the environment (e.g., because various devices may employ distinct speech recognition techniques or have distinct grammars or vocabularies, the devices may share vocabulary translation mechanisms to enhance system-wide recognition).


According to various aspects of the invention, the Automatic Speech Recognizer 110 may provide one or more preliminary interpretations of a multi-modal input, including an utterance contained therein, to a conversational language processor 120. The conversational language processor 120 may include various components that collectively operate to model everyday human-to-human conversations in order to engage in cooperative conversations with the user to resolve requests based on the user's intent. For example, the conversational language processor 120 may include, among other things, an intent determination engine 130a, a constellation model 130b, one or more domain agents 130c, a context tracking engine 130d, a misrecognition engine 130e, and a voice search engine 130f. Furthermore, the conversational language processor 120 may be coupled to one or more data repositories 160 and applications associated with one or more domains. Thus, the intent determination capabilities of the device 100 may be defined based on the data and processing capabilities of the Automatic Speech Recognizer 110 and the conversational language processor 120.


More particularly, the intent determination engine 130a may establish meaning for a given multi-modal natural language input based on a consideration of the intent determination capabilities of the device 100 as well as the intent determination capabilities of other devices in the integrated voice services environment. For example, the intent determination capabilities of the device 100 may be defined as a function of processing resources, storage for grammars, context, agents, or other data, and content or services associated with the device 100 (e.g., a media device with a small amount of memory may have a smaller list of recognizable songs than a device with a large amount of memory). Thus, the intent determination engine 130a may determine whether to process a given input locally (e.g., when the device 100 has intent determination capabilities that suggest favorable conditions for recognition), or whether to route information associated with the input to other devices, which may assist in determining the intent of the input.


As such, to determine which device or combination of devices should process an input, the intent determination engine 130a may evaluate the constellation model 130b, which provides a model of the intent determination capabilities for each of the devices in the integrated voice services environment. For instance, the constellation model 130b may contain, among other things, knowledge of processing and storage resources available to each of the devices in the environment, as well as the nature and scope of domain agents, context, content, services, and other information available to each of the devices in the environment. As such, using the constellation model 130b, the intent determination engine 130a may be able to determine whether any of the other devices have intent determination capabilities that can be invoked to augment or otherwise enhance the intent determination capabilities of the device 100 (e.g., by routing information associated with a multi-modal natural language input to the device or devices that appear best suited to analyze the information and therefore determine an intent of the input). Accordingly, the intent determination engine 130a may establish the meaning of a given utterance by utilizing the comprehensive constellation model 130b that describes capabilities within the device 100 and across the integrated environment. The intent determination engine 130a may therefore optimize processing of a given natural language input based on capabilities throughout the environment (e.g., utterances may be processed locally to the device 100, routed to a specific device based on information in the constellation model 130b, or flooded to all of the devices in the environment in which case an arbitration may occur to select a best guess at an intent determination).


Although the following discussion will generally focus on various techniques that can be used to determine the intent of multi-modal natural language inputs in the integrated multi-device environment, it will be apparent that the natural language processing capabilities of any one of the devices may extend beyond the specific discussion that has been provided herein. As such, in addition to the U.S. Patents and U.S. Patent Applications referenced above, further natural language processing capabilities that may be employed include those described in U.S. patent application Ser. No. 11/197,504, entitled “Systems and Methods for Responding to Natural Language Speech Utterance,” which issued as U.S. Pat. No. 7,640,160 on Dec. 29, 2009, U.S. patent application Ser. No. 11/200,164, entitled “System and Method of Supporting Adaptive Misrecognition in Conversational Speech,” which issued as U.S. Pat. No. 7,620,549 on Nov. 17, 2009, U.S. patent application Ser. No. 11/212,693, entitled “Mobile Systems and Methods of Supporting Natural Language Human-Machine Interactions,” which issued as U.S. Pat. No. 7,949,529 on May 24, 2011, U.S. patent application Ser. No. 11/580,926, entitled “System and Method for a Cooperative Conversational Voice User Interface,” filed Oct. 16, 2006, U.S. patent application Ser. No. 11/671,526, entitled “System and Method for Selecting and Presenting Advertisements Based on Natural Language Processing of Voice-Based Input,” which issued as U.S. Pat. No. 7,818,176 on Oct. 19, 2010, and U.S. patent application Ser. No. 11/954,064, entitled “System and Method for Providing a Natural Language Voice User Interface in an Integrated Voice Navigation Services Environment,” filed Dec. 11, 2007, the disclosures of which are hereby incorporated by reference in their entirety.


According to various aspects of the invention, FIG. 2 illustrates a block diagram of an exemplary centralized implementation of the integrated, multi-modal, multi-device natural language voice service environment. As will be apparent from the further description to be provided herein, the centralized implementation of the integrated, multi-device voice services environment may enable a user to engage in conversational, multi-modal natural language interactions with any one of voice-enabled devices 210a-n or central voice-enabled device 220. As such, the multi-device voice services environment may collectively determine intent for any given multi-modal natural language input, whereby the user may request content or voice services relating to any device or application in the environment, without restraint.


As illustrated in FIG. 2, the centralized implementation of the multi-device voice service environment may include a plurality of voice-enabled devices 210a-n, each of which include various components capable of determining intent of natural language utterances, as described above in reference to FIG. 1. Furthermore, as will be apparent, the centralized implementation includes a central device 220, which contains information relating to intent determination capabilities for each of the other voice-enabled devices 210a-n. For example, in various exemplary implementations, the central device 220 may be designated as such by virtue of being a device most capable of determining the intent of an utterance (e.g., a server, home data center, or other device having significant processing power, memory resources, and communication capabilities making the device suitable to manage intent determination across the environment). In another exemplary implementation, the central device 220 may be dynamically selected based on one or more characteristics of a given multi-modal natural language input, dialogue, or interaction (e.g., a device may be designated as the central device 220 when a current utterance relates to a specific domain).


In the centralized implementation illustrated in FIG. 2, a multi-modal natural language input may be received at one of the voice-enabled devices 210a-n. Therefore, the receiving one of the devices 210a-n may be designated as an input device for that input, while the remaining devices 210a-n may be designated as secondary devices for that input. In other words, for any given multi-modal natural language input, the multi-device environment may include an input device that collects the input, a central device 220 that aggregates intent determination, inferencing, and processing capabilities for all of the devices 210a-n in the environment, and one or more secondary devices that may also be used in the intent determination process. As such, each device 210 in the environment may be provided with a constellation model that identifies all of the devices 210 having incoming and outgoing communication capabilities, thus indicating an extent to which other devices may be capable of determining intent for a given multi-modal natural language input. The constellation model may further define a location of the central device 220, which aggregates context, vocabularies, content, recognition grammars, misrecognitions, shared knowledge, intent determination capabilities, inferencing capabilities, and other information from the various devices 210a-n in the environment.


Accordingly, as communication and processing capabilities permit, the central device 220 may be used as a recognizer of first or last resort. For example, because the central device 220 converges intent determination capabilities across the environment (e.g., by aggregating context, vocabularies, device capabilities, and other information from the devices 210a-n in the environment), inputs may be automatically routed to the central device 220 when used as a recognizer of first resort, or as a recognizer of last resort when local processing at the input device 210 cannot determine the intent of the input with a satisfactory level of confidence. However, it will also be apparent that in certain instances the input device 210 may be unable to make contact with the central device 220 for various reasons (e.g., a network connection may be unavailable, or a processing bottleneck at the central device 220 may cause communication delays). In such cases, the input device 210 that has initiated contact with the central device 220 may shift into decentralized processing (e.g., as described in reference to FIG. 6) and communicate capabilities with one or more of the other devices 210a-n in the constellation model. Thus, when the central device 220 cannot be invoked for various reasons, the remaining devices 210a-n may operate as cooperative nodes to determine intent in a decentralized manner.


Additionally, in the multi-device voice services environment, the central device 220 and the various other devices 210a-n may cooperate to create a converged model of capabilities throughout the environment. For example, as indicated above, in addition to having intent determination capabilities based on processing resources, memory resources, and device capabilities, each of the devices 210a-n and the central device 220 may include various other natural language processing components. The voice services environment may therefore operate in an integrated manner by maintaining not only a complete model of data, content, and services associated with the various devices 210a-n, but also of other natural language processing capabilities and dynamic states associated with the various devices 210a-n. As such, the various devices 210a-n may operate with a goal of converging capabilities, data, states, and other information across the device, either on one device (e.g., the central device 220) or distributed among the various devices 210a-n (e.g., as in the decentralized implementation to be described in FIG. 6).


For example, as discussed above, each device 210 includes an Automatic Speech Recognizer, one or more dynamically adaptable recognition grammars, and vocabulary lists used to generate phonemic interpretations of natural language utterances. Moreover, each device 210 includes locally established context, which can range from information contained in a context stack, context and namespace variables, vocabulary translation mechanisms, short-term shared knowledge relating to a current dialogue or conversational interaction, long-term shared knowledge relating to a user's learned preferences over time, or other contextual information. Furthermore, each device 210 may have various services or applications associated therewith, and may perform various aspects of natural language processing locally. Thus, additional information to be converged throughout the environment may include partial or preliminary utterance recognitions, misrecognitions or ambiguous recognitions, inferencing capabilities, and overall device state information (e.g., songs playing in the environment, alarms set in the environment, etc.).


Thus, various data synchronization and referential integrity algorithms may be employed in concert by the various devices 210a-n and the central device 220 to provide a consistent worldview of the environment. For example, information may be described and transmitted throughout the environment for synchronization and convergence purposes using the Universal Plug and Play protocol designed for computer ancillary devices, although the environment can also operate in a peer-to-peer disconnected mode (e.g., when the central device 220 cannot be reached). However, in various implementations, the environment may also operate in a peer-to-peer mode regardless of the disconnected status, as illustrated in FIG. 6, for example, when the devices 210a-n have sufficient commensurate resources and capabilities for natural language processing.


In general, the algorithms for convergence in the environment can be executed at various intervals, although it may be desirable to limit data transmission so as to avoid processing bottlenecks. For example, because the convergence and synchronization techniques relate to natural language processing, in which any given utterance will typically be expressed over a course of several seconds, information relating to context and vocabulary need not be updated on a time frame of less than a few seconds. However, as communication capabilities permit, context and vocabulary could be updated more frequently to provide real-time recognition or the appearance of real-time recognition. In another implementation, the convergence and synchronization may be permitted to run until completion (e.g., when no requests are currently pending), or the convergence and synchronization may be suspended or terminated when a predetermined time or resource consumption limit has been reached (e.g., when the convergence relates to a pending request, an intent determination having a highest confidence level at the time of cut-off may be used).


By establishing a consistent view of capabilities, data, states, and other information throughout the environment, an input device 210 may cooperate with the central device 220 and one or more secondary devices (i.e., one or more of devices 210a-n, other than the input device) in processing any given multi-modal natural language input. Furthermore, by providing each device 210 and the central device 220 with a constellation model that describes a synchronized state of the environment, the environment may be tolerant of failure by one or more of the devices 210a-n, or of the central device 220. For example, if the input device 210 cannot communicate with the central device 220 (e.g., because of a server crash), the input device 210 may enter a disconnected peer-to-peer mode, whereby capabilities can be exchanged with one or more devices 210a-n with which communications remain available. To that end, when a device 210 establishes new information relating to vocabulary, context, misrecognitions, agent adaptation, intent determination capabilities, inferencing capabilities, or otherwise, the device 210 may transmit the information to the central device 220 for convergence purposes, as discussed above, in addition to consulting the constellation model to determine whether the information should be transmitted to one or more of the other devices 210a-n.


For example, suppose the environment includes a voice-enabled mobile phone that has nominal functionality relating to playing music or other media, and which further has a limited amount of local storage space, while the environment further includes a voice-enabled home media system that includes a mass storage medium that provides dedicated media functionality. If the mobile phone were to establish new vocabulary, context, or other information relating to a song (e.g., a user downloads the song or a ringtone to the mobile phone while on the road), the mobile phone may transmit the newly established information to the home media system in addition to the central device 220. As such, by having a model of all of the devices 210a-n in the environment and transmitting new information to the devices where it will most likely be useful, the various devices may handle disconnected modes of operation when the central device 220 may be unavailable for any reason, while resources may be allocated efficiently throughout the environment.


Thus, based on the foregoing discussion, it will be apparent that a centralized implementation of an integrated multi-device voice services environment may generally include a central device 220 operable to aggregate or converge knowledge relating to content, services, capabilities, and other information associated with various voice-enabled devices 210a-n deployed within the environment. In such centralized implementations, the central device 220 may be invoked as a recognizer of first or last resort, as will be described in greater detail with reference to FIGS. 3-5, and furthermore, the other devices 210a-n in the environment may be configured to automatically enter a disconnected or peer-to-peer mode of operation when the central device 220 cannot be invoked for any reason (i.e., devices may enter a decentralized or distributed mode, as will be described in greater detail with reference to FIGS. 6-7). Knowledge and capabilities of each of the devices 210a-n may therefore be made available throughout the voice services environment in a centralized manner, a distributed manner, or various combinations thereof, thus optimizing an amount of natural language processing resources used to determine an intent of any given multi-modal natural language input.


According to various aspects of the invention, FIG. 3 illustrates a flow diagram of an exemplary method for processing multi-modal, natural language inputs at an input device in the centralized implementation of the integrated, multi-modal, multi-device natural language voice service environment. Similarly, FIGS. 4 and 5 illustrate corresponding methods associated with a central device and one or more secondary devices, respectively, in the centralized voice service environment. Furthermore, it will be apparent that the processing techniques described in relation to FIGS. 3-5 may generally be based on the centralized implementation illustrated in FIG. 2 and described above, whereby the input device may be assumed to be distinct from the central device, and the one or more secondary devices may be assumed to be distinct from the central device and the input device. However, it will be apparent that various instances may involve a natural language input being received at the central device or at another device, in which case the techniques described in FIGS. 3-5 may be vary depending on circumstances of the environment (e.g., decisions relating to routing utterances to a specific device or devices may be made locally, collaboratively, or in other ways depending on various factors, such as overall system state, communication capabilities, intent determination capabilities, or otherwise).


As illustrated in FIG. 3, a multi-modal natural language input may be received at an input device in an operation 310. The multi-modal input may include at least a natural language utterance provided by a user, and may further include other input modalities such as audio, text, button presses, gestures, or other non-voice inputs. It will also be apparent that prior to receiving the natural language input in operation 310, the input device may be configured to establish natural language processing capabilities. For example, establishing natural language processing capabilities may include, among other things, loading an Automatic Speech Recognizer and any associated recognition grammars, launching a conversational language processor to handle dialogues with the user, and installing one or more domain agents that provide functionality for respective application domains or contextual domains (e.g., navigation, music, movies, weather, information retrieval, device control, etc.).


The input device may also be configured to coordinate synchronization of intent determination capabilities, shared knowledge, and other information with the central device and the secondary devices in the environment prior to receiving the input at operation 310. For example, when the input device installs a domain agent, the installed domain agent may bootstrap context variables, semantics, namespace variables, criteria values, and other context related to that agent from other devices in the system. Similarly, misrecognitions may be received from the central device and the secondary devices in order to enable correction of agents that use information relevant to the received misrecognitions, and vocabularies and associated translation mechanisms may be synchronized among the devices to account for potential variations between the Automatic Speech Recognizers used by the various devices (e.g., each device in the environment cannot be guaranteed to use the same Automatic Speech Recognizer or recognition grammars, necessitating vocabulary and translation mechanisms to be shared among the devices that share intent determination capabilities).


Upon establishing and synchronizing natural language processing capabilities and subsequently receiving a multi-modal natural language input in operation 310, the input device may determine whether the environment has been set up to automatically transmit the input to the central device in a decisional operation 320. In such a case, processing proceeds to an operation 360 for transmitting the input to the central device, which may then process the input according to techniques to be described in relation to FIG. 4. If the environment has not been set up to automatically communicate the input to the central device, however, processing proceeds to an operation 330, where the input device performs transcription of the natural language utterance contained in the multi-modal input. For example, the input device may transcribe the utterance using the Automatic Speech Recognizer and recognition grammars associated therewith according to techniques described above and in the above-referenced U.S. Patents and U.S. Patent Applications.


Subsequently, in an operation 340, an intent of the multi-modal natural language input may be determined at the input device using local natural language processing capabilities and resources. For example, any non-voice input modalities included in the input may be merged with the utterance transcription and a conversational language processor associated with the input device may utilize local information relating to context, domain knowledge, shared knowledge, context variables, criteria values, or other information useful in natural language processing. As such, the input device may attempt to determine a best guess as to an intent of the user that provided the input, such as identifying a conversation type (e.g., query, didactic, or exploratory) or request that may be contained in the input (e.g., a command or query relating to one or more domain agents or application domains).


The intent determination of the input device may be assigned a confidence level (e.g., a device having an Automatic Speech Recognizer that implements multi-pass analysis may assign comparatively higher confidence levels to utterance transcriptions created thereby, which may result in a higher confidence level for the intent determination). The confidence level may be assigned based on various factors, as described in the above-referenced U.S. Patents and U.S. Patent Applications. As such, a decisional operation 350 may include determining whether the intent determination of the input device meets an acceptable level of confidence. When the intent determination meets the acceptable level confidence, processing may proceed directly to an operation 380 where action may be taken in response thereto. For example, when the intent determination indicates that the user has requested certain information, one or more queries may be formulated to retrieve the information from appropriate information sources, which may include one or more of the other devices. In another example, when the intent determination indicates that the user has requested a given command (e.g., to control a specific device), the command may be routed to the appropriate device for execution.


Thus, in cases where the input device can determine the intent of a natural language input without assistance from the central device or the secondary devices, communications and processing resources may be conserved by taking immediate action as may be appropriate. On the other hand, when the intent determination of the input device does not meet the acceptable level of confidence, decisional operation 350 may result in the input device requesting assistance from the central device in operation 360. In such a case, the multi-modal natural language input may be communicated to the central device in its entirety, whereby the central device processes the input according to techniques described in FIG. 4. However, should transmission to the central device fail for some reason, the input device may shift into a disconnected peer-to-peer mode where one or more secondary devices may be utilized, as will be described below in relation to FIG. 7. When transmission to the central device occurs without incident, however, the input device may receive an intent determination from the central device in an operation 370, and may further receive results of one or more requests that the central device was able to resolve, or requests that the central device has formulated for further processing on the input device. As such, the input device may take action in operation 380 based on the information received from the central device in operation 370. For example, the input device may route queries or commands to local or remote information sources or devices based on the intent determination, or may present results of the requests processed by the central device to the user.


Referring to FIG. 4, the central device may receive the multi-modal natural language input from the input device in an operation 410. The central device, having aggregated context and other knowledge from throughout the environment, may thus transcribe the utterance in an operation 420 and determine an intent of the input from the transcribed utterance in an operation 430. As such, the central device may consider information relating to context, domain agents, applications, and device capabilities throughout the environment in determining the intent of the utterance, including identification of one or more domains relevant to the input. However, it will be apparent that utilizing information aggregated from throughout the environment may cause ambiguity or uncertainty in various instances (e.g., an utterance containing the word “traffic” may have a different intent in domains relating to movies, music, and navigation).


As such, once the central device has attempted to determine the intent of the natural language input, a determination may be made in an operation 440 as to whether one or more secondary devices (i.e., other devices in the constellation besides the input device) may also be capable of intent determination in the identified domain or domains. When no such secondary devices can be identified, decisional operation 440 may branch directly to an operation 480 to send to the input device the determined intent and any commands, queries, or other requests identified from the determined intent.


On the other hand, when one or more secondary devices in the environment have intent determination capabilities in the identified domain or domains, the natural language input may be sent to such secondary devices in an operation 450. The secondary devices may then determine an intent as illustrated in FIG. 5, which may include techniques generally similar to those described above in relation to the input device and central device (i.e., the natural language input may be received in an operation 510, an utterance contained therein may be transcribed in an operation 520, and an intent determination made in an operation 530 may be returned to the central device in an operation 540).


Returning to FIG. 4, the central device may collate intent determination responses received from the secondary devices in an operation 460. For example, as indicated above, the central device may identify one or more secondary devices capable of determining intent in a domain that the central device has identified as being relevant to the natural language utterance. As will be apparent, the secondary devices invoked in operation 450 may often include a plurality of devices, and intent determination responses may be received from the secondary devices in an interleaved manner, depending on processing resources, communications throughput, or other factors (e.g., the secondary devices may include a telematics device having a large amount of processing power and a broadband network connection and an embedded mobile phone having less processing power and only a cellular connection, in which case the telematics device may be highly likely to provide results to the central device before the embedded mobile phone). Thus, based on potential variations in response time of secondary devices, the central device may be configured to place constraints on collating operation 460. For example, the collating operation 460 may be terminated as soon as an intent determination has been received from one of the secondary devices that meets an acceptable level of confidence, or the operation 460 may be cut off when a predetermined amount of time has lapsed or a predetermined amount of resources have been consumed. In other implementations, however, it will be apparent that collating operation 460 may be configured to run to completion, regardless of whether delays have occurred or suitable intent determinations have been received. Further, it will be apparent that various criteria may be used to determine whether or when to end the collating operation 460, including the nature of a given natural language input, dialogue, or other interaction, or system or user preferences, among other criteria, as will be apparent.


In any event, when the collating operation 460 has completed, a subsequent operation 470 may include the central device arbitrating among the intent determination responses received from one or more of the secondary devices previously invoked in operation 450. For example, each of the invoked secondary devices that generate an intent determination may also assign a confidence level to that intent determination, and the central device may consider the confidence levels in arbitrating among the responses. Moreover, the central device may associate other criteria with the secondary devices or the intent determinations received from the secondary devices to further enhance a likelihood that the best intent determination will be used. For example, various ones of the secondary devices may only be invoked for partial recognition in distinct domains, and the central device may aggregate and arbitrate the partial recognitions to create a complete transcription. In another example, a plurality of secondary devices may be invoked to perform overlapping intent determination, and the central device may consider capabilities of the secondary devices to weigh the respective confidence levels (e.g., when one of two otherwise identical secondary devices employs multi-pass speech recognition analysis, the secondary device employing the multi-pass speech recognition analysis may be weighed as having a higher likelihood of success). It will be apparent that the central device may be configured to arbitrate and select one intent determination from among all of the intent hypotheses, which may include the intent determination hypothesis generated by the central device in operation 430. Upon selecting the best intent determination hypothesis, the central device may then provide that intent determination to the input device in operation 480, as well as any commands, queries, or other requests that may be relevant thereto. The input device may then take appropriate action as described above in relation to FIG. 3.


According to various aspects of the invention, FIG. 6 illustrates a block diagram of an exemplary distributed implementation of the integrated, multi-modal, multi-device natural language voice service environment. As described above, the distributed implementation may also be categorized as a disconnected or peer-to-peer mode that may be employed when a central device in a centralized implementation cannot be reached or otherwise does not meet the needs of the environment. The distributed implementation illustrated in FIG. 6 may be generally operate with similar purposes as described above in relation to the centralized implementation (i.e., to ensure that the environment includes a comprehensive model of aggregate knowledge and capabilities of a plurality of devices 610a-n in the environment). Nonetheless, the distributed implementation may operate in a somewhat different manner, in that one or more of the devices 610a-n may be provided with the entire constellation model, or various aspects of the model may be distributed among the plurality of devices 610a-n, or various combinations thereof.


Generally speaking, the plurality of voice-enabled devices 610a—may be coupled to one another by a voice services interface 630, which may include any suitable real or virtual interface (e.g., a common message bus or network interface, a service-oriented abstraction layer, etc.). The various devices 610a-n may therefore operate as cooperative nodes in determining intent for multi-modal natural language utterances received by any one of the devices 610. Furthermore, the devices 610a-n may share knowledge of vocabularies, context, capabilities, and other information, while certain forms of data may be synchronized to ensure consistent processing among the devices 610a-n. For example, because natural language processing components used in the devices 610a-n may vary (e.g., different recognition grammars or speech recognition techniques may exist), vocabulary translation mechanisms, misrecognitions, context variables, criteria values, criteria handlers, and other such information used in the intent determination process should be synchronized to the extent that communication capabilities permit.


By sharing intent determination capabilities, device capabilities, inferencing capabilities, domain knowledge, and other information, decisions as to routing an utterance to a specific one of the devices 610a-n may be made locally (e.g., at an input device), collaboratively (e.g., a device having particular capabilities relevant to the utterance may communicate a request to process the utterance), or various combinations thereof (e.g., the input device may consider routing to secondary devices only when an intent of the utterance cannot be determined). Similarly, partial recognition performed at one or more of the devices 610a-n may be used to determine routing strategies for further intent determination of the utterance. For example, an utterance that contains a plurality of requests relating to a plurality of different domains may be received at an input device that can only determine intent in one of the domains. In this example, the input device may perform partial recognition for the domain associated with the input device, and the partial recognition may also identify the other domains relevant to the utterance for which the input device does not have sufficient recognition information. Thus, the partial recognition performed by the input device may result in identification of other potentially relevant domains and a strategy may be formulated to route the utterance to other devices in the environment that include recognition information for those domains.


As a result, multi-modal natural language inputs, including natural language utterances, may be routed among the various devices 610a-n in order to perform intent determination in a distributed manner. However, as the capabilities and knowledge held by any one of the devices 610a-n may vary, each of the devices 610a-n may be associated with a reliability factor for intent determinations generated by the respective devices 610a-n. As such, to ensure that final intent determinations can be relied upon with a sufficient level of confidence, knowledge may be distributed among the devices 610a-n to ensure that reliability metrics for intent determinations provided by each of the devices 610a-n are commensurable throughout the environment. For example, additional knowledge may be provided to a device having a low intent determination reliability, even when such knowledge results in redundancy in the environment, to ensure commensurate reliability of intent determination environment-wide.


Therefore, in distributed implementations of the integrated voice services environment, utterances may be processed in various ways, which may depend on circumstances at a given time (e.g., system states, system or user preferences, etc.). For example, an utterance may be processed locally at an input device and only routed to secondary devices when an intent determination confidence level falls below a given threshold. In another example, utterances may be routed to a specific device based on the modeling of knowledge and capabilities discussed above. In yet another example, utterances may be flooded among all of the devices in the environment, and arbitration may occur whereby intent determinations may be collated and arbitrated to determine a best guess at intent determination.


Thus, utterances may be processed in various ways, including through local techniques, centralized techniques, distributed techniques, and various combinations thereof. Although many variations will be apparent, FIG. 7 illustrates an exemplary method for combined local and distributed processing of multi-modal, natural language inputs in a distributed implementation of the voice service environment, according to various aspects of the invention. In particular, the distributed processing may begin in an operation 710, where a multi-modal natural language input may be received at an input device. The input device may then utilize various natural language processing capabilities associated therewith in an operation 720 to transcribe an utterance contained in the multi-modal input (e.g., using an Automatic Speech Recognizer and associated recognition grammars), and may subsequently determine a preliminary intent of the multi-modal natural language input in an operation 730. It will be apparent that operations 710 through 730 may generally be performed using local intent determination capabilities associated with the input device.


Thereafter, the input device may invoke intent determination capabilities of one or more secondary devices in an operation 740. More particularly, the input device may provide information associated with the multi-modal natural language input to one or more of the secondary devices, which may utilize local intent determination capabilities to attempt to determine intent of the input using techniques as described in relation to FIG. 5. It will also be apparent that, in various implementations, the secondary devices invoked in operation 740 may include only devices having intent determination capabilities associated with a specific domain identified as being associated with the input. In any event, the input device may receive intent determinations from the invoked secondary devices in an operation 750, and the input device may then collate the intent determinations received from the secondary devices. The input device may then arbitrate among the various intent determinations, or may combine various ones of the intent determinations (e.g., when distinct secondary devices determine intent in distinct domains), or otherwise arbitrate among the intent determinations to determine a best guess at the intent of the multi-modal natural language input (e.g., based on confidence levels associated with the various intent determinations). Based on the determined intent, the input device may then take appropriate action in an operation 770, such as issuing one or more commands, queries, or other requests to be executed at one or more of the input device or the secondary devices.


Furthermore, in addition to the exemplary implementations described above, various implementations may include a continuous listening mode of operation where a plurality of devices may continuously listen for multi-modal voice-based inputs. In the continuous listening mode, each of the devices in the environment may be triggered to accept a multi-modal input when one or more predetermined events occur. For example, the devices may each be associated with one or more attention words, such as “Phone, <multi-modal request>” for a mobile phone, or “Computer, <multi-modal request>” for a personal computer. When one or more of the devices in the environment recognize the associated attention word, keyword activation may result, where the associated devices trigger to accept the subsequent multi-modal request. Further, where a plurality of devices in a constellation may be listening, the constellation may use all available inputs to increase recognition rates.


Moreover, it will be apparent that the continuous listening mode may be applied in centralized voice service environments, distributed centralized voice service environments, or various combinations thereof. For example, when each device in the constellation has a different attention word, any given device that recognizes an attention word may consult a constellation model to determine a target device or functionality associated with the attention word. In another example, when a plurality of devices in the constellation share one or more attention words, the plurality of devices may coordinate with one another to synchronize information for processing the multi-modal input, such as a start time for an utterance contained therein.


Implementations of the invention may be made in hardware, firmware, software, or various combinations thereof. The invention may also be implemented as instructions stored on a machine-readable medium, which may be read and executed by one or more processors. A machine-readable medium may include various mechanisms for storing or transmitting information in a form readable by a machine (e.g., a computing device). For example, a machine-readable storage medium may include read only memory, random access memory, magnetic disk storage media, optical storage media, flash memory devices, and others, and a machine-readable transmission media may include forms of propagated signals, such as carrier waves, infrared signals, digital signals, and others. Further, firmware, software, routines, or instructions may be described in the above disclosure in terms of specific exemplary aspects and implementations of the invention, and performing certain actions. However, it will be apparent that such descriptions are merely for convenience and that such actions in fact result from computing devices, processors, controllers, or other devices executing the firmware, software, routines, or instructions.


Aspects and implementations may be described as including a particular feature, structure, or characteristic, but every aspect or implementation may not necessarily include the particular feature, structure, or characteristic. Further, when a particular feature, structure, or characteristic has been described in connection with an aspect or implementation, it will be understood that such feature, structure, or characteristic may be included in connection with other aspects or implementations, whether or not explicitly described. Thus, various changes and modifications may be made to the preceding description without departing from the scope or spirit of the invention, and the specification and drawings should therefore be regarded as exemplary only, and the scope of the invention determined solely by the appended claims.

Claims
  • 1. A method to provide an integrated, multi-modal, natural language voice services environment having an input device, a central device, and one or more secondary devices, wherein the method comprises: receiving, at the central device, a multi-modal natural language input from the input device, wherein the input device initially received the multi-modal natural language input;maintaining, on the input device, the central device, and the one or more secondary devices, a constellation model that describes natural language resources, dynamic states, and intent determination capabilities associated with the input device, the central device, and the one or more secondary devices;aggregating the natural language resources, the dynamic states, and the intent determination capabilities associated with the input device and the one or more secondary devices on the central device to converge the natural language resources, the dynamic states, and the intent determination capabilities held across the natural language voice services environment on the central device;determining, on the central device, a preliminary intent associated with the multi-modal natural language input using the converged natural language resources, dynamic states, and intent determination capabilities held across the natural language voice services environment;sending the multi-modal natural language input from the central device to the one or more secondary devices to invoke the intent determination capabilities associated with the one or more secondary devices;collating, at the central device, intent determination responses received from the one or more secondary devices with the preliminary intent determined on the central device to generate an intent hypothesis associated with the multi-modal natural language input on the central device; andreturning the intent hypothesis associated with the multi-modal natural language input and information relating to one or more requests associated with the multi-modal natural language input to the input device, wherein the input device invokes one or more actions based on the returned intent hypothesis and the information relating to one or more requests associated with the multi-modal natural language input.
  • 2. The method of claim 1, wherein the intent determination capabilities associated with the input device, the central device, and the one or more secondary devices include local processing power, local storage resources, and local natural language processing capabilities.
  • 3. The method of claim 1, wherein collating the intent determination responses includes: receiving the intent determination responses from the one or more secondary devices in an interleaved manner; andarbitrating among the interleaved intent determination responses received from the one or more secondary devices and the preliminary intent determined on the central device to generate the intent hypothesis associated with the multi-modal natural language input.
  • 4. The method of claim 3, wherein the generated intent hypothesis comprises one of the interleaved intent determination responses received from the one or more secondary devices or the preliminary intent determined on the central device having a highest confidence level.
  • 5. The method of claim 3, wherein arbitrating among the interleaved intent determination responses and the preliminary intent includes: evaluating, at the central device, the constellation model to determine whether the intent determination capabilities associated with any of the one or more secondary devices include multi-pass speech recognition; andassigning a higher weight to confidence levels associated with any of the interleaved intent determination responses that were generated using multi-pass speech recognition.
  • 6. The method of claim 3, wherein collating the intent determination responses further includes terminating the collating in response to determining that a predetermined amount of time has lapsed, a predetermined amount of resources have been consumed, or one or more of the interleaved intent determination responses received from the one or more secondary devices meets or exceeds an acceptable confidence level.
  • 7. The method of claim 6, wherein the input device that initially received the multi-modal natural language input communicates the multi-modal natural language input to the central device in response to an initial intent determination generated on the input device failing to meet or exceed the acceptable confidence level.
  • 8. The method of claim 1, wherein the natural language resources and the dynamic states associated with the input device, the central device, and the one or more secondary devices include local vocabularies, local vocabulary translation mechanisms, local misrecognitions, local context information, local short-term shared knowledge, local long-term shared knowledge.
  • 9. The method of claim 1, further comprising operating the natural language voice services environment in a continuous listening mode that causes the input device to initially accept the multi-modal natural language input in response to determining that one or more predetermined events have occurred.
  • 10. The method of claim 1, further comprising identifying, at the central device, one or more domains relevant to the multi-modal natural language input, wherein the central device sends the multi-modal language input to the one or more secondary devices in response to determining that the intent determination capabilities associated therewith have relevance to the one or more identified domains.
  • 11. The method of claim 1, wherein the information returned to the input device includes results associated with the central device resolving the one or more requests and the one or more actions that the input device invokes include presenting the results in response to the multi-modal natural language input.
  • 12. The method of claim 1, wherein the information returned to the input device includes one or more queries or commands formulated on the central device and the one or more actions that the input device invokes include routing the queries or commands to generate results to present in response to the multi-modal natural language input.
  • 13. A system to provide an integrated, multi-modal, natural language voice services environment having an input device, one or more secondary devices, and a central device configured to: receive a multi-modal natural language input from the input device, wherein the input device initially received the multi-modal natural language input;maintain a constellation model and distribute the constellation model to the input device and the one or more secondary devices, wherein the constellation model describes natural language resources, dynamic states, and intent determination capabilities associated with the input device, the central device, and the one or more secondary devices;aggregate the natural language resources, the dynamic states, and the intent determination capabilities associated with the input device and the one or more secondary devices to converge the natural language resources, the dynamic states, and the intent determination capabilities held across the natural language voice services environment;use the converged natural language resources, dynamic states, and intent determination capabilities held across the natural language voice services environment to determine a preliminary intent associated with the multi-modal natural language input;send the multi-modal natural language input to the one or more secondary devices to invoke the intent determination capabilities associated with the one or more secondary devices;collate intent determination responses received from the one or more secondary devices with the determined preliminary intent to generate an intent hypothesis associated with the multi-modal natural language input on the central device; andreturn the intent hypothesis associated with the multi-modal natural language input and information relating to one or more requests associated with the multi-modal natural language input to the input device, wherein the input device is configured to invoke one or more actions based on the returned intent hypothesis and the information relating to one or more requests associated with the multi-modal natural language input.
  • 14. The system of claim 13, wherein the intent determination capabilities associated with the input device, the central device, and the one or more secondary devices include local processing power, local storage resources, and local natural language processing capabilities.
  • 15. The system of claim 13, wherein to collate the intent determination responses, the central device is further configured to: receive the intent determination responses from the one or more secondary devices in an interleaved manner; andarbitrate among the interleaved intent determination responses received from the one or more secondary devices and the determined preliminary intent to generate the intent hypothesis associated with the multi-modal natural language input.
  • 16. The system of claim 15, wherein the generated intent hypothesis comprises one of interleaved intent determination responses received from the one or more secondary devices or the preliminary intent determined on the central device having a highest confidence level.
  • 17. The system of claim 15, wherein to arbitrate among the interleaved intent determination responses and the preliminary intent, the central device is further configured to: evaluate the constellation model to determine whether the intent determination capabilities associated with any of the one or more secondary devices include multi-pass speech recognition; andassign a higher weight to confidence levels associated with any of the interleaved intent determination responses that were generated using multi-pass speech recognition.
  • 18. The system of claim 15, wherein to collate the intent determination responses, the central device is further configured to terminate receiving the interleaved intent determination responses in response to a predetermined amount of time having lapsed, a predetermined amount of resources having been consumed, or one or more of the received interleaved intent determination responses meeting or exceeding an acceptable confidence level.
  • 19. The system of claim 18, wherein the input device that initially received the multi-modal natural language input is configured to communicate the multi-modal natural language input to the central device in response to an initial intent determination generated on the input device failing to meet or exceed the acceptable confidence level.
  • 20. The system of claim 13, wherein the natural language resources and the dynamic states associated with the input device, the central device, and the one or more secondary devices include local vocabularies, local vocabulary translation mechanisms, local misrecognitions, local context information, local short-term shared knowledge, local long-term shared knowledge.
  • 21. The system of claim 13, wherein the central device is further configured to operate the natural language voice services environment in a continuous listening mode that causes the input device to initially accept the multi-modal natural language input in response to determining that one or more predetermined events have occurred.
  • 22. The system of claim 13, wherein the central device is further configured to identify one or more domains relevant to the multi-modal natural language input and send the multi-modal language input to the one or more secondary devices in response to the intent determination capabilities associated therewith having relevance to the one or more identified domains.
  • 23. The system of claim 13, wherein the information returned to the input device includes results associated with the central device resolving the one or more requests and the one or more actions invoked on the input device include presenting the results in response to the multi-modal natural language input.
  • 24. The system of claim 13, wherein the information returned to the input device includes one or more queries or commands that the central device and the one or more actions invoked on the input device include routing the queries or commands to generate results to present in response to the multi-modal natural language input.
US Referenced Citations (578)
Number Name Date Kind
4430669 Cheung Feb 1984 A
4821027 Mallory et al. Apr 1989 A
4910784 Doddington et al. Mar 1990 A
5027406 Roberts et al. Jun 1991 A
5155743 Jacobs Oct 1992 A
5164904 Sumner Nov 1992 A
5208748 Flores et al. May 1993 A
5274560 LaRue Dec 1993 A
5357596 Takebayashi et al. Oct 1994 A
5377350 Skinner Dec 1994 A
5386556 Hedin et al. Jan 1995 A
5424947 Nagao et al. Jun 1995 A
5471318 Ahuja et al. Nov 1995 A
5475733 Eisdorfer et al. Dec 1995 A
5488652 Bielby et al. Jan 1996 A
5499289 Bruno et al. Mar 1996 A
5500920 Kupiec Mar 1996 A
5517560 Greenspan May 1996 A
5533108 Harris et al. Jul 1996 A
5537436 Bottoms et al. Jul 1996 A
5539744 Chu et al. Jul 1996 A
5557667 Bruno et al. Sep 1996 A
5559864 Kennedy, Jr. Sep 1996 A
5563937 Bruno et al. Oct 1996 A
5577165 Takebayashi et al. Nov 1996 A
5590039 Ikeda et al. Dec 1996 A
5608635 Tamai Mar 1997 A
5617407 Bareis Apr 1997 A
5633922 August et al. May 1997 A
5652570 Lepkofker Jul 1997 A
5675629 Raffel et al. Oct 1997 A
5696965 Dedrick Dec 1997 A
5708422 Blonder et al. Jan 1998 A
5721938 Stuckey Feb 1998 A
5722084 Chakrin et al. Feb 1998 A
5740256 Castello Da Costa et al. Apr 1998 A
5742763 Jones Apr 1998 A
5748841 Morin et al. May 1998 A
5748974 Johnson May 1998 A
5752052 Richardson et al. May 1998 A
5754784 Garland et al. May 1998 A
5761631 Nasukawa Jun 1998 A
5774841 Salazar et al. Jun 1998 A
5774859 Houser et al. Jun 1998 A
5794050 Dahlgren et al. Aug 1998 A
5794196 Yegnanarayanan et al. Aug 1998 A
5797112 Komatsu et al. Aug 1998 A
5799276 Komissarchik et al. Aug 1998 A
5802510 Jones Sep 1998 A
5832221 Jones Nov 1998 A
5839107 Gupta et al. Nov 1998 A
5848396 Gerace Dec 1998 A
5855000 Waibel et al. Dec 1998 A
5867817 Catallo et al. Feb 1999 A
5878385 Bralich et al. Mar 1999 A
5878386 Coughlin Mar 1999 A
5892813 Morin et al. Apr 1999 A
5892900 Ginter et al. Apr 1999 A
5895464 Bhandari et al. Apr 1999 A
5895466 Goldberg et al. Apr 1999 A
5897613 Chan Apr 1999 A
5902347 Backman et al. May 1999 A
5911120 Jarett et al. Jun 1999 A
5918222 Fukui et al. Jun 1999 A
5926784 Richardson et al. Jul 1999 A
5933822 Braden-Harder et al. Aug 1999 A
5953393 Culbreth et al. Sep 1999 A
5960397 Rahim Sep 1999 A
5960399 Barclay et al. Sep 1999 A
5960447 Holt et al. Sep 1999 A
5963894 Richardson et al. Oct 1999 A
5963940 Liddy et al. Oct 1999 A
5987404 Della Pietra et al. Nov 1999 A
5991721 Asano et al. Nov 1999 A
5995119 Cosatto et al. Nov 1999 A
5995928 Nguyen et al. Nov 1999 A
6009382 Martino et al. Dec 1999 A
6014559 Amin Jan 2000 A
6018708 Dahan et al. Jan 2000 A
6021384 Gorin et al. Feb 2000 A
6028514 Lemelson et al. Feb 2000 A
6035267 Watanabe et al. Mar 2000 A
6044347 Abella et al. Mar 2000 A
6049602 Foladare et al. Apr 2000 A
6049607 Marash et al. Apr 2000 A
6058187 Chen May 2000 A
6067513 Ishimitsu May 2000 A
6076059 Glickman et al. Jun 2000 A
6078886 Dragosh et al. Jun 2000 A
6081774 De Hita et al. Jun 2000 A
6085186 Christianson et al. Jul 2000 A
6101241 Boyce et al. Aug 2000 A
6108631 Ruhl Aug 2000 A
6119087 Kuhn et al. Sep 2000 A
6122613 Baker Sep 2000 A
6134235 Goldman et al. Oct 2000 A
6144667 Doshi et al. Nov 2000 A
6144938 Surace et al. Nov 2000 A
6154526 Dahlke et al. Nov 2000 A
6160883 Jackson et al. Dec 2000 A
6167377 Gillick et al. Dec 2000 A
6173266 Marx et al. Jan 2001 B1
6173279 Levin et al. Jan 2001 B1
6175858 Bulfer et al. Jan 2001 B1
6185535 Hedin et al. Feb 2001 B1
6188982 Chiang Feb 2001 B1
6192110 Abella et al. Feb 2001 B1
6192338 Haszto et al. Feb 2001 B1
6195634 Dudemaine et al. Feb 2001 B1
6195651 Handel et al. Feb 2001 B1
6199043 Happ Mar 2001 B1
6208964 Sabourin Mar 2001 B1
6208972 Grant et al. Mar 2001 B1
6219346 Maxemchuk Apr 2001 B1
6219643 Cohen et al. Apr 2001 B1
6226612 Srenger et al. May 2001 B1
6233556 Teunen et al. May 2001 B1
6233559 Balakrishnan May 2001 B1
6233561 Junqua et al. May 2001 B1
6236968 Kanevsky et al. May 2001 B1
6246981 Papineni et al. Jun 2001 B1
6246990 Happ Jun 2001 B1
6266636 Kosaka et al. Jul 2001 B1
6269336 Ladd et al. Jul 2001 B1
6272455 Hoshen et al. Aug 2001 B1
6275231 Obradovich Aug 2001 B1
6278968 Franz et al. Aug 2001 B1
6288319 Catona Sep 2001 B1
6292767 Jackson et al. Sep 2001 B1
6301560 Masters Oct 2001 B1
6308151 Smith Oct 2001 B1
6314402 Monaco et al. Nov 2001 B1
6321196 Franceschi Nov 2001 B1
6356869 Chapados et al. Mar 2002 B1
6362748 Huang Mar 2002 B1
6366882 Bijl et al. Apr 2002 B1
6366886 Dragosh et al. Apr 2002 B1
6374214 Friedland et al. Apr 2002 B1
6377913 Coffman et al. Apr 2002 B1
6381535 Durocher et al. Apr 2002 B1
6385596 Wiser et al. May 2002 B1
6385646 Brown et al. May 2002 B1
6393403 Majaniemi May 2002 B1
6393428 Miller et al. May 2002 B1
6397181 Li et al. May 2002 B1
6404878 Jackson et al. Jun 2002 B1
6405170 Phillips et al. Jun 2002 B1
6408272 White et al. Jun 2002 B1
6411810 Maxemchuk Jun 2002 B1
6415257 Junqua et al. Jul 2002 B1
6418210 Sayko Jul 2002 B1
6420975 DeLine et al. Jul 2002 B1
6429813 Feigen Aug 2002 B2
6430285 Bauer et al. Aug 2002 B1
6430531 Polish Aug 2002 B1
6434523 Monaco Aug 2002 B1
6434524 Weber Aug 2002 B1
6434529 Walker et al. Aug 2002 B1
6442522 Carberry et al. Aug 2002 B1
6446114 Bulfer et al. Sep 2002 B1
6453153 Bowker et al. Sep 2002 B1
6453292 Ramaswamy et al. Sep 2002 B2
6456711 Cheung et al. Sep 2002 B1
6456974 Baker et al. Sep 2002 B1
6466654 Cooper et al. Oct 2002 B1
6466899 Yano et al. Oct 2002 B1
6470315 Netsch et al. Oct 2002 B1
6487495 Gale et al. Nov 2002 B1
6498797 Anerousis et al. Dec 2002 B1
6499013 Weber Dec 2002 B1
6501833 Phillips et al. Dec 2002 B2
6501834 Milewski et al. Dec 2002 B1
6505155 Vanbuskirk et al. Jan 2003 B1
6510417 Woods et al. Jan 2003 B1
6513006 Howard et al. Jan 2003 B2
6522746 Marchok et al. Feb 2003 B1
6523061 Halverson et al. Feb 2003 B1
6532444 Weber Mar 2003 B1
6539348 Bond et al. Mar 2003 B1
6549629 Finn et al. Apr 2003 B2
6553372 Brassell et al. Apr 2003 B1
6556970 Sasaki et al. Apr 2003 B1
6556973 Lewin Apr 2003 B1
6560576 Cohen et al. May 2003 B1
6560590 Shwe et al. May 2003 B1
6567778 Chao Chang et al. May 2003 B1
6567797 Schuetze et al. May 2003 B1
6570555 Prevost et al. May 2003 B1
6570964 Murveit et al. May 2003 B1
6571279 Herz et al. May 2003 B1
6574597 Mohri et al. Jun 2003 B1
6574624 Johnson et al. Jun 2003 B1
6578022 Foulger et al. Jun 2003 B1
6581103 Dengler Jun 2003 B1
6584439 Geilhufe et al. Jun 2003 B1
6587858 Strazza Jul 2003 B1
6591239 McCall et al. Jul 2003 B1
6594257 Doshi et al. Jul 2003 B1
6594367 Marash et al. Jul 2003 B1
6598018 Junqua Jul 2003 B1
6601026 Appelt et al. Jul 2003 B2
6604075 Brown et al. Aug 2003 B1
6604077 Dragosh et al. Aug 2003 B2
6606598 Holthouse et al. Aug 2003 B1
6611692 Raffel et al. Aug 2003 B2
6614773 Maxemchuk Sep 2003 B1
6615172 Bennett et al. Sep 2003 B1
6622119 Ramaswamy et al. Sep 2003 B1
6629066 Jackson et al. Sep 2003 B1
6631346 Karaorman et al. Oct 2003 B1
6631351 Ramachandran et al. Oct 2003 B1
6633846 Bennett et al. Oct 2003 B1
6643620 Contolini et al. Nov 2003 B1
6650747 Bala et al. Nov 2003 B1
6658388 Kleindienst et al. Dec 2003 B1
6678680 Woo Jan 2004 B1
6681206 Gorin et al. Jan 2004 B1
6691151 Cheyer et al. Feb 2004 B1
6701294 Ball et al. Mar 2004 B1
6704396 Parolkar et al. Mar 2004 B2
6704576 Brachman et al. Mar 2004 B1
6704708 Pickering Mar 2004 B1
6708150 Hirayama et al. Mar 2004 B1
6721001 Berstis Apr 2004 B1
6721706 Strubbe et al. Apr 2004 B1
6735592 Neumann et al. May 2004 B1
6739556 Langston May 2004 B1
6741931 Kohut et al. May 2004 B1
6742021 Halverson et al. May 2004 B1
6745161 Arnold et al. Jun 2004 B1
6751591 Gorin et al. Jun 2004 B1
6751612 Schuetze et al. Jun 2004 B1
6754485 Obradovich et al. Jun 2004 B1
6754627 Woodward Jun 2004 B2
6757544 Rangarajan et al. Jun 2004 B2
6757718 Halverson et al. Jun 2004 B1
6795808 Strubbe et al. Sep 2004 B1
6801604 Maes et al. Oct 2004 B2
6801893 Backfried et al. Oct 2004 B1
6813341 Mahoney Nov 2004 B1
6829603 Chai et al. Dec 2004 B1
6832230 Zilliacus et al. Dec 2004 B1
6833848 Wolff et al. Dec 2004 B1
6850603 Eberle et al. Feb 2005 B1
6856990 Barile et al. Feb 2005 B2
6865481 Kawazoe et al. Mar 2005 B2
6868380 Kroeker Mar 2005 B2
6868385 Gerson Mar 2005 B1
6873837 Yoshioka et al. Mar 2005 B1
6877001 Wolf et al. Apr 2005 B2
6877134 Fuller et al. Apr 2005 B1
6901366 Kuhn et al. May 2005 B1
6910003 Arnold et al. Jun 2005 B1
6912498 Stevens et al. Jun 2005 B2
6934756 Maes Aug 2005 B2
6937977 Gerson Aug 2005 B2
6937982 Kitaoka et al. Aug 2005 B2
6944594 Busayapongchai et al. Sep 2005 B2
6950821 Faybishenko et al. Sep 2005 B2
6954755 Reisman Oct 2005 B2
6959276 Droppo et al. Oct 2005 B2
6961700 Mitchell et al. Nov 2005 B2
6968311 Knockeart et al. Nov 2005 B2
6973387 Masclet et al. Dec 2005 B2
6975993 Keiller Dec 2005 B1
6980092 Turnbull et al. Dec 2005 B2
6983055 Luo Jan 2006 B2
6990513 Belfiore et al. Jan 2006 B2
6996531 Korall et al. Feb 2006 B2
7003463 Maes et al. Feb 2006 B1
7016849 Arnold et al. Mar 2006 B2
7020609 Thrift et al. Mar 2006 B2
7024364 Guerra et al. Apr 2006 B2
7027586 Bushey et al. Apr 2006 B2
7027975 Pazandak et al. Apr 2006 B1
7035415 Belt et al. Apr 2006 B2
7036128 Julia et al. Apr 2006 B1
7043425 Pao May 2006 B2
7054817 Shao May 2006 B2
7058890 George et al. Jun 2006 B2
7062488 Reisman Jun 2006 B1
7069220 Coffman et al. Jun 2006 B2
7072834 Zhou Jul 2006 B2
7082469 Gold et al. Jul 2006 B2
7085708 Manson Aug 2006 B2
7092928 Elad et al. Aug 2006 B1
7107210 Deng et al. Sep 2006 B2
7107218 Preston Sep 2006 B1
7110951 Lemelson et al. Sep 2006 B1
7127400 Koch Oct 2006 B2
7130390 Abburi Oct 2006 B2
7136875 Anderson et al. Nov 2006 B2
7137126 Coffman et al. Nov 2006 B1
7143037 Chestnut Nov 2006 B1
7143039 Stifelman et al. Nov 2006 B1
7146319 Hunt Dec 2006 B2
7149696 Shimizu et al. Dec 2006 B2
7165028 Gong Jan 2007 B2
7184957 Brookes et al. Feb 2007 B2
7190770 Ando et al. Mar 2007 B2
7197069 Agazzi et al. Mar 2007 B2
7197460 Gupta et al. Mar 2007 B1
7203644 Anderson et al. Apr 2007 B2
7206418 Yang et al. Apr 2007 B2
7207011 Mulvey et al. Apr 2007 B2
7215941 Beckmann et al. May 2007 B2
7228276 Omote et al. Jun 2007 B2
7231343 Treadgold et al. Jun 2007 B1
7236923 Gupta Jun 2007 B1
7254482 Kawasaki et al. Aug 2007 B2
7272212 Eberle et al. Sep 2007 B2
7277854 Bennett et al. Oct 2007 B2
7283829 Christenson et al. Oct 2007 B2
7283951 Marchisio et al. Oct 2007 B2
7289606 Sibal et al. Oct 2007 B2
7299186 Kuzunuki et al. Nov 2007 B2
7301093 Sater et al. Nov 2007 B2
7305381 Poppink et al. Dec 2007 B1
7321850 Wakita Jan 2008 B2
7328155 Endo et al. Feb 2008 B2
7337116 Charlesworth et al. Feb 2008 B2
7340040 Saylor et al. Mar 2008 B1
7366285 Parolkar et al. Apr 2008 B2
7366669 Nishitani et al. Apr 2008 B2
7376645 Bernard May 2008 B2
7386443 Parthasarathy et al. Jun 2008 B1
7398209 Kennewick et al. Jul 2008 B2
7406421 Odinak et al. Jul 2008 B2
7415414 Azara et al. Aug 2008 B2
7421393 Di Fabbrizio et al. Sep 2008 B1
7424431 Greene et al. Sep 2008 B2
7447635 Konopka et al. Nov 2008 B1
7451088 Ehlen et al. Nov 2008 B1
7454608 Gopalakrishnan et al. Nov 2008 B2
7461059 Richardson et al. Dec 2008 B2
7472020 Brulle-Drews Dec 2008 B2
7472060 Gorin et al. Dec 2008 B1
7478036 Shen et al. Jan 2009 B2
7487088 Gorin et al. Feb 2009 B1
7493259 Jones et al. Feb 2009 B2
7493559 Wolff et al. Feb 2009 B1
7502738 Kennewick et al. Mar 2009 B2
7516076 Walker et al. Apr 2009 B2
7529675 Maes May 2009 B2
7536297 Byrd et al. May 2009 B2
7536374 Au May 2009 B2
7542894 Murata Jun 2009 B2
7546382 Healey et al. Jun 2009 B2
7558730 Davis et al. Jul 2009 B2
7574362 Walker et al. Aug 2009 B2
7577244 Taschereau Aug 2009 B2
7606708 Hwang Oct 2009 B2
7620549 Di Cristo et al. Nov 2009 B2
7634409 Kennewick et al. Dec 2009 B2
7640006 Portman et al. Dec 2009 B2
7640160 Di Cristo et al. Dec 2009 B2
7640272 Mahajan et al. Dec 2009 B2
7676365 Hwang et al. Mar 2010 B2
7676369 Fujimoto et al. Mar 2010 B2
7684977 Morikawa Mar 2010 B2
7693720 Kennewick et al. Apr 2010 B2
7729916 Coffman et al. Jun 2010 B2
7729918 Walker et al. Jun 2010 B2
7729920 Chaar et al. Jun 2010 B2
7748021 Obradovich Jun 2010 B2
7788084 Brun et al. Aug 2010 B2
7801731 Odinak et al. Sep 2010 B2
7809570 Kennewick et al. Oct 2010 B2
7818176 Freeman et al. Oct 2010 B2
7831426 Bennett Nov 2010 B2
7831433 Belvin et al. Nov 2010 B1
7873519 Bennett Jan 2011 B2
7873523 Potter et al. Jan 2011 B2
7873654 Bernard Jan 2011 B2
7881936 Longe et al. Feb 2011 B2
7890324 Bangalore et al. Feb 2011 B2
7894849 Kass et al. Feb 2011 B2
7902969 Obradovich Mar 2011 B2
7917367 Di Cristo et al. Mar 2011 B2
7920682 Byrne et al. Apr 2011 B2
7949529 Weider et al. May 2011 B2
7949537 Walker et al. May 2011 B2
7953732 Frank et al. May 2011 B2
7974875 Quilici et al. Jul 2011 B1
7983917 Kennewick et al. Jul 2011 B2
7984287 Gopalakrishnan et al. Jul 2011 B2
8005683 Tessel et al. Aug 2011 B2
8015006 Kennewick et al. Sep 2011 B2
8060367 Keaveney Nov 2011 B2
8069046 Kennewick et al. Nov 2011 B2
8073681 Baldwin et al. Dec 2011 B2
8077975 Ma et al. Dec 2011 B2
8082153 Coffman et al. Dec 2011 B2
8086463 Ativanichayaphong et al. Dec 2011 B2
8112275 Kennewick et al. Feb 2012 B2
8140327 Kennewick et al. Mar 2012 B2
8140335 Kennewick et al. Mar 2012 B2
8145489 Freeman et al. Mar 2012 B2
8150694 Kennewick et al. Apr 2012 B2
8155962 Kennewick et al. Apr 2012 B2
8170867 Germain May 2012 B2
8195468 Weider et al. Jun 2012 B2
8219399 Lutz et al. Jul 2012 B2
8219599 Tunstall-Pedoe Jul 2012 B2
8224652 Wang et al. Jul 2012 B2
8255224 Singleton et al. Aug 2012 B2
8326627 Kennewick et al. Dec 2012 B2
8326634 Di Cristo et al. Dec 2012 B2
8326637 Baldwin et al. Dec 2012 B2
8332224 Di Cristo et al. Dec 2012 B2
8370147 Kennewick et al. Feb 2013 B2
8447607 Weider et al. May 2013 B2
8452598 Kennewick et al. May 2013 B2
8515765 Baldwin et al. Aug 2013 B2
8527274 Freeman et al. Sep 2013 B2
20010041980 Howard et al. Nov 2001 A1
20010049601 Kroeker et al. Dec 2001 A1
20010054087 Flom et al. Dec 2001 A1
20020015500 Belt et al. Feb 2002 A1
20020022927 Lemelson et al. Feb 2002 A1
20020029261 Shibata Mar 2002 A1
20020032752 Gold et al. Mar 2002 A1
20020035501 Handel et al. Mar 2002 A1
20020049805 Yamada et al. Apr 2002 A1
20020065568 Silfvast et al. May 2002 A1
20020069059 Smith Jun 2002 A1
20020069071 Knockeart et al. Jun 2002 A1
20020082911 Dunn et al. Jun 2002 A1
20020087326 Lee et al. Jul 2002 A1
20020087525 Abbott et al. Jul 2002 A1
20020107694 Lerg Aug 2002 A1
20020120609 Lang et al. Aug 2002 A1
20020124050 Middeljans Sep 2002 A1
20020133402 Faber et al. Sep 2002 A1
20020138248 Corston-Oliver et al. Sep 2002 A1
20020143532 McLean et al. Oct 2002 A1
20020143535 Kist et al. Oct 2002 A1
20020161646 Gailey et al. Oct 2002 A1
20020173961 Guerra Nov 2002 A1
20020184373 Maes Dec 2002 A1
20020188602 Stubler et al. Dec 2002 A1
20020198714 Zhou Dec 2002 A1
20030014261 Kageyama Jan 2003 A1
20030016835 Elko et al. Jan 2003 A1
20030046346 Mumick et al. Mar 2003 A1
20030064709 Gailey et al. Apr 2003 A1
20030065427 Funk et al. Apr 2003 A1
20030088421 Maes et al. May 2003 A1
20030097249 Walker et al. May 2003 A1
20030110037 Walker et al. Jun 2003 A1
20030112267 Belrose Jun 2003 A1
20030115062 Walker et al. Jun 2003 A1
20030120493 Gupta Jun 2003 A1
20030135488 Amir et al. Jul 2003 A1
20030144846 Denenberg et al. Jul 2003 A1
20030158731 Falcon et al. Aug 2003 A1
20030161448 Parolkar et al. Aug 2003 A1
20030182132 Niemoeller Sep 2003 A1
20030204492 Wolf et al. Oct 2003 A1
20030206640 Malvar et al. Nov 2003 A1
20030212550 Ubale Nov 2003 A1
20030212558 Matula Nov 2003 A1
20030212562 Patel et al. Nov 2003 A1
20030225825 Healey et al. Dec 2003 A1
20030236664 Sharma Dec 2003 A1
20040006475 Ehlen et al. Jan 2004 A1
20040025115 Sienel et al. Feb 2004 A1
20040044516 Kennewick et al. Mar 2004 A1
20040098245 Walker et al. May 2004 A1
20040117179 Balasuriya Jun 2004 A1
20040117804 Scahill et al. Jun 2004 A1
20040140989 Papageorge Jul 2004 A1
20040158555 Seedman et al. Aug 2004 A1
20040166832 Portman et al. Aug 2004 A1
20040167771 Duan et al. Aug 2004 A1
20040193408 Hunt Sep 2004 A1
20040193420 Kennewick et al. Sep 2004 A1
20040199375 Ehsani et al. Oct 2004 A1
20040205671 Sukehiro et al. Oct 2004 A1
20040243417 Pitts, III et al. Dec 2004 A9
20050015256 Kargman Jan 2005 A1
20050021334 Iwahashi Jan 2005 A1
20050021470 Martin et al. Jan 2005 A1
20050021826 Kumar Jan 2005 A1
20050033574 Kim et al. Feb 2005 A1
20050043940 Elder Feb 2005 A1
20050114116 Fiedler May 2005 A1
20050125232 Gadd Jun 2005 A1
20050137850 Odell Jun 2005 A1
20050137877 Oesterling et al. Jun 2005 A1
20050143994 Mori et al. Jun 2005 A1
20050216254 Gupta et al. Sep 2005 A1
20050234727 Chiu Oct 2005 A1
20050246174 DeGolia Nov 2005 A1
20050283752 Fruchter et al. Dec 2005 A1
20060041431 Maes Feb 2006 A1
20060047509 Ding et al. Mar 2006 A1
20060206310 Ravikumar et al. Sep 2006 A1
20060217133 Christenson et al. Sep 2006 A1
20060285662 Yin et al. Dec 2006 A1
20070033005 Cristo et al. Feb 2007 A1
20070033020 Kelleher Francois et al. Feb 2007 A1
20070038436 Cristo et al. Feb 2007 A1
20070043574 Coffman et al. Feb 2007 A1
20070043868 Kumar et al. Feb 2007 A1
20070050191 Weider et al. Mar 2007 A1
20070055525 Kennewick et al. Mar 2007 A1
20070073544 Millett et al. Mar 2007 A1
20070078708 Yu et al. Apr 2007 A1
20070078709 Rajaram Apr 2007 A1
20070118357 Kasravi et al. May 2007 A1
20070135101 Ramati et al. Jun 2007 A1
20070146833 Satomi et al. Jun 2007 A1
20070162296 Altberg et al. Jul 2007 A1
20070179778 Gong et al. Aug 2007 A1
20070186165 Maislos et al. Aug 2007 A1
20070198267 Jones et al. Aug 2007 A1
20070214182 Rosenberg Sep 2007 A1
20070250901 McIntire et al. Oct 2007 A1
20070265850 Kennewick et al. Nov 2007 A1
20070299824 Pan et al. Dec 2007 A1
20080034032 Healey et al. Feb 2008 A1
20080065386 Cross et al. Mar 2008 A1
20080091406 Baldwin et al. Apr 2008 A1
20080103761 Printz et al. May 2008 A1
20080109285 Reuther et al. May 2008 A1
20080115163 Gilboa et al. May 2008 A1
20080133215 Sarukkai Jun 2008 A1
20080140385 Mahajan et al. Jun 2008 A1
20080147410 Odinak Jun 2008 A1
20080177530 Cross et al. Jul 2008 A1
20080189110 Freeman et al. Aug 2008 A1
20080235023 Kennewick et al. Sep 2008 A1
20080235027 Cross Sep 2008 A1
20080319751 Kennewick et al. Dec 2008 A1
20090052635 Jones et al. Feb 2009 A1
20090067599 Agarwal et al. Mar 2009 A1
20090076827 Bulitta et al. Mar 2009 A1
20090106029 DeLine et al. Apr 2009 A1
20090117885 Roth May 2009 A1
20090144271 Richardson et al. Jun 2009 A1
20090150156 Kennewick et al. Jun 2009 A1
20090171664 Kennewick et al. Jul 2009 A1
20090216540 Tessel et al. Aug 2009 A1
20090271194 Davis et al. Oct 2009 A1
20090273563 Pryor Nov 2009 A1
20090276700 Anderson et al. Nov 2009 A1
20090313026 Coffman et al. Dec 2009 A1
20100023320 Di Cristo et al. Jan 2010 A1
20100029261 Mikkelsen et al. Feb 2010 A1
20100036967 Caine et al. Feb 2010 A1
20100049501 Kennewick et al. Feb 2010 A1
20100049514 Kennewick et al. Feb 2010 A1
20100057443 Di Cristo et al. Mar 2010 A1
20100063880 Atsmon et al. Mar 2010 A1
20100145700 Kennewick et al. Jun 2010 A1
20100185512 Borger et al. Jul 2010 A1
20100204986 Kennewick et al. Aug 2010 A1
20100204994 Kennewick et al. Aug 2010 A1
20100217604 Baldwin et al. Aug 2010 A1
20100286985 Kennewick et al. Nov 2010 A1
20100299142 Freeman et al. Nov 2010 A1
20100312566 Odinak et al. Dec 2010 A1
20110112827 Kennewick et al. May 2011 A1
20110112921 Kennewick et al. May 2011 A1
20110131036 Di Cristo et al. Jun 2011 A1
20110131045 Cristo et al. Jun 2011 A1
20110231182 Weider et al. Sep 2011 A1
20110231188 Kennewick et al. Sep 2011 A1
20120022857 Baldwin et al. Jan 2012 A1
20120101809 Kennewick et al. Apr 2012 A1
20120101810 Kennewick et al. Apr 2012 A1
20120109753 Kennewick et al. May 2012 A1
20120150636 Freeman et al. Jun 2012 A1
20120278073 Weider et al. Nov 2012 A1
20130054228 Baldwin et al. Feb 2013 A1
20130211710 Kennewick et al. Aug 2013 A1
20130253929 Weider et al. Sep 2013 A1
Foreign Referenced Citations (16)
Number Date Country
1 320 043 Jun 2003 EP
1 646 037 Apr 2006 EP
WO 9946763 Sep 1999 WO
WO 0021232 Apr 2000 WO
WO 0046792 Aug 2000 WO
WO 0178065 Oct 2001 WO
WO 2004072954 Aug 2004 WO
WO 2007019318 Feb 2007 WO
WO 2007021587 Feb 2007 WO
WO 2007027546 Mar 2007 WO
WO 2007027989 Mar 2007 WO
WO 2008098039 Aug 2008 WO
WO 2008118195 Oct 2008 WO
WO 2009075912 Jun 2009 WO
WO 2009145796 Dec 2009 WO
WO 2010096752 Aug 2010 WO
Non-Patent Literature Citations (18)
Entry
Statement in Accordance with the Notice from the European Patent Office dated Oct. 1, 2007 Concerning Business Methods (OJ EPO Nov. 2007, 592-593), XP002456252.
Reuters, “IBM to Enable Honda Drivers to Talk to Cars”, Charles Schwab & Co., Inc., Jul. 28, 2002, 1 page.
Lin, Bor-shen, et al., “A Distributed Architecture for Cooperative Spoken Dialogue Agents with Coherent Dialogue State and History”, ASRU'99, 1999, 4 pages.
Kuhn, Thomas, et al., “Hybrid In-Car Speech Recognition for Mobile Multimedia Applications”, Vehicular Technology Conference, IEEE, Jul. 1999, pp. 2009-2013.
Belvin, Robert, et al., “Development of the HRL Route Navigation Dialogue System”, Proceedings of the First International Conference on Human Language Technology Research, San Diego, 2001, pp. 1-5.
Lind, R., et al., “The Network Vehicle—A Glimpse into the Future of Mobile Multi-Media”, IEEE Aerosp. Electron. Systems Magazine, vol. 14, No. 9, Sep. 1999, pp. 27-32.
Zhao, Yilin, “Telematics: Safe and Fun Driving”, IEEE Intelligent Systems, vol. 17, Issue 1, 2002, pp. 10-14.
Chai et al., “MIND: A Semantics-Based Multimodal Interpretation Framework for Conversational System”, Proceedings of the International CLASS Workshop on Natural, Intelligent and Effective Interaction in Multimodal Dialogue Systems, Jun. 2002, pp. 37-46.
Cheyer et al., “Multimodal Maps: An Agent-Based Approach”, International Conference on Cooperative Multimodal Communication (CMC/95), May 24-26, 1995, pp. 111-121.
Elio et al., “On Abstract Task Models and Conversation Policies” in Workshop on Specifying and Implementing Conversation Policies, Autonomous Agents '99, Seattle, 1999, 10 pages.
Turunen, “Adaptive Interaction Methods in Speech User Interfaces”, Conference on Human Factors in Computing Systems, Seattle, Washington, 2001, pp. 91-92.
Mao, Mark Z., “Automatic Training Set Segmentation for Multi-Pass Speech Recognition”, Department of Electrical Engineering, Stanford University, CA, copyright 2005, IEEE, pp. 1-685 to 1-688.
Vanhoucke, Vincent, “Confidence Scoring and Rejection Using Multi-Pass Speech Recognition”, Nuance Communications, Menlo Park, CA, 2005, 4 pages.
Weng, Fuliang, et al., “Efficient Lattice Representation and Generation”, Speech Technology and Research Laboratory, SRI International, Menlo Park, CA, 1998, 4 pages.
El Meliani et al., “A Syllabic-Filler-Based Continuous Speech Recognizer for Unlimited Vocabulary”, Canadian Conference on Electrical and Computer Engineering, vol. 2, Sep. 5-8, 1995, pp. 1007-1010.
Arrington, Michael, “Google Redefines GPS Navigation Landscape: Google Maps Navigation for Android 2.0”, TechCrunch, printed from the Internet <http://www.techcrunch.com/2009/10/28/google-redefines-car-gps-navigation-google-maps-navigation-android/>, Oct. 28, 2009, 4 pages.
Bazzi, Issam et al., “Heterogeneous Lexical Units for Automatic Speech Recognition: Preliminary Investigations”, Processing of the IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 3, Jun. 5-9, 2000, XP010507574, pp. 1257-1260.
O'Shaughnessy, Douglas, “Interacting with Computers by Voice: Automatic Speech Recognition and Synthesis”, Proceedings of the IEEE, vol. 91, No. 9, Sep. 1, 2003, XP011100665, pp. 1272-1305.
Related Publications (1)
Number Date Country
20090299745 A1 Dec 2009 US