This application relates generally to computer technology, including but not limited to voice assistants for devices and related libraries.
Voice-based assistants that interact with a user through audio/voice input and output have grown in popularity alongside the growth of the Internet and cloud computing. These assistants can provide an interface for the consumption of digital media, as well as providing various types of information, including news, sports scores, weather, and stocks, to name a few examples.
A user may have multiple devices where voice-based assistant functionality is desirable. It is desirable to have a voice-based assistant that can be implemented and used across a variety of devices, that can provide a consistent experience across the variety of devices, and that can support functionality specific to a particular device.
The implementations described in this specification are directed to embedding or including a voice assistant in embedded systems and/or devices in a way that enables control of the local device for a wide variety of operating system platforms.
In accordance with some implementations, a thin, low-resource-usage device-side library has features including local processing of audio data, listening for wakewords or hotwords, and sending user requests. Additional features include connectivity to a cloud brain, extensible voice action control system, portability layer allowing integration into many diverse operating environments, and capability to be updated asynchronously to the rest of the client software.
The described implementations have an advantage of providing a similar user experience for interacting with a voice assistant across many different devices.
The described implementations have another advantage of enabling decoupled innovation in the voice assistant capabilities from the innovations enabled from the device itself. For example, if an improved recognition pipeline was created, it could be pushed out to devices, while the device manufacturer needs not do anything to receive it and can still benefit from previous voice commands.
In accordance with some implementations, a method at an electronic device with an audio input system, one or more processors, and memory storing one or more programs for execution by the one or more processors includes: receiving a verbal input at the device; processing the verbal input; transmitting a request to a remote system, the request including information determined based on the verbal input; receiving a response to the request, wherein the response is generated by the remote system in accordance with the information based on the verbal input; and performing an operation in accordance with the response, where one or more of the receiving, processing, transmitting, receiving, and performing are performed by one or more voice processing modules of a voice assistant library executing on the electronic device, the voice processing modules providing a plurality of voice processing operations that are accessible to one or more application programs and/or operating software executing or executable on the electronic device.
In some implementations, a device-agnostic voice assistant library for electronic devices including an audio input system, includes: one or more voice processing modules configured to execute on a common operation system implemented on a plurality of different electronic device types, the voice processing modules providing a plurality of voice processing operations that are accessible to application programs and operating software executing on the electronic devices, thereby enabling portability of voice-enabled applications configured to interact with one or more of the voice processing operations.
In some implementations, an electronic device includes an audio input system, one or more processors, and memory storing one or more programs to be executed by the one or more processors. The one or more programs include instructions for: receiving a verbal input at the device; processing the verbal input; transmitting a request to a remote system, the request including information determined based on the verbal input; receiving a response to the request, wherein the response is generated by the remote system in accordance with the information based on the verbal input; and performing an operation in accordance with the response, where one or more of the receiving, processing, transmitting, receiving, and performing are performed by one or more voice processing modules of a voice assistant library executing on the electronic device, the voice processing modules providing a plurality of voice processing operations that are accessible to one or more application programs and/or operating software executing or executable on the electronic device.
In some implementations, a non-transitory computer readable storage medium stores one or more programs. The one or more programs include instructions which, when executed by an electronic device with an audio input system and one more processors, causes the electronic device to: receive a verbal input at the device; process the verbal input; transmit a request to a remote system, the request including information determined based on the verbal input; receive a response to the request, wherein the response is generated by the remote system in accordance with the information based on the verbal input; and perform an operation in accordance with the response, where one or more of the receiving, processing, transmitting, receiving, and performing are performed by the one or more voice processing modules of the voice assistant library executing on the electronic device, the voice processing modules providing a plurality of voice processing operations that are accessible to one or more application programs and/or operating software executing or executable on the electronic device.
Like reference numerals refer to corresponding parts throughout the drawings.
Reference will now be made in detail to various implementations, examples of which are illustrated in the accompanying drawings. In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the invention and the described implementations. However, the invention may be practiced without these specific details. In other instances, well-known methods, procedures, components, and circuits have not been described in detail so as not to unnecessarily obscure aspects of the implementations.
In some implementations, an objective of a voice assistant is to provide users a personalized voice interface available across a variety of devices and enabling a wide variety of use cases, providing consistent experience throughout a user's day. The voice assistant and/or related functionality may be integrated into first-party and third-party products and devices.
An example use case involves media. Voice commands may be used to initiate playback and control of music, radio, podcasts, news, and other audio media through voice. For example, a user can utter voice commands (e.g., “play jazz music,” “play 107.5 FM,” “skip to next song,” “play ‘Serial”’) to play or control various types of audio media. Further, such commands may be used to play audio media from a variety of sources, such as online streaming of terrestrial radio stations, music subscription services, local storage, remote storage, and so on. Further, the voice assistant may utilize integrations that are available with casting devices to support additional content.
Another example use case involves remote playback. The user may issue a voice command to a casting device that includes the voice assistant functionality, and in accordance to the voice command, media is played back on (e.g., casted to) a device specified in the command, on the devices in a specified group of one or more devices, or on one or more devices in an area specified in the command. The user can also specify generic categories or specific content in the command, and the appropriate media is played in accordance with the specified category or content in the command.
A further example use case is non-media, such as productivity features (e.g., timers, alarm clocks, calendar), home automation, questions and answers powered by a search engine (e.g., search queries), fun (e.g., assistant personality, jokes, games, Easter eggs), and everyday tasks (e.g., transportation, navigation, food, finance, gifts, etc.).
In some implementations, the voice assistant is provided as an optional feature of a casting device, and the voice assistant functionality may be updated as part of the casting device.
In some implementations, detection of hotwords or keywords in voice commands and verbal inputs from users is performed by the application processor (e.g., performed at the client device or casting device to which the user speaks the voice command or verbal input). In some implementations, detection of hotwords is performed by an external digital signal processor (e.g., performed by a server system processing the voice commands, as opposed to the client or casting device to which the user speaks the voice command or verbal input).
In some implementations, a device with the voice assistant feature includes one or more of: far-field support, “push to assist” or “push to talk” (e.g., a button to initiate voice assistant functionality), and AC power.
In some implementations, the voice assistant includes application programming interfaces (APIs) for one or more of: audio input (e.g., microphone, media loopback for ongoing playback), microphone state (e.g., on/off), ducking (e.g., reducing the volume of all outputs when the assistant is triggered through either hotword or push to talk), and new assistant events and status messages (e.g., assistant was triggered (e.g., heard hotword, pushed assistant button), listening to speech, waiting on server, responding, responding finished, alarm/timer is playing).
In some implementations, the device with the voice assistant functionality may communicate with another device for configuration purposes (e.g., with a configuration application on a smartphone), to enable or facilitate the functionality of the voice assistant on the device (e.g., setup the voice assistant functionality on the device, provide tutorials to the user). The configurations or setups may include specifying a device location, association with a user account, user opt-in to voice control, linking to and prioritizing media services (e.g., video streaming services, music streaming services), home automation configurations, etc.
In some implementations, the device with the voice assistant may include one or more user interface elements or indications to the user. One or more of the user interface elements are physical (e.g., as light patterns displayed using one or more LEDs, as sound patterns output by the speaker), and may include one or more of: a “push to assist” or “push to talk” trigger not dependent on a hotword, a “mute microphone” trigger and visual status indication, an “awaiting hotword status” visual indication, a “hotword detected” visual indication, an “assistant is actively listening” visual indication visible at some distance (e.g., 15 feet), an “assistant is working/thinking” visual indication, a “voice message/notification is available” visual indication, a “volume level” control method and status indicator, and a “pause/resume” control method. In some implementations, these physical user interface elements are provided by the client device or casting device. In some implementations, the voice assistant supports a common set of user interface elements or indications across different devices, for consistency of experience across the different devices.
In some implementations, the voice assistant supports device-specific commands and/or hotwords as well as a standardized, predefined set of commands and/or hotwords.
In some implementations, the casting device 106 is communicatively coupled to a client 102. The client 102 may include an application or module (e.g., a casting device settings app) that facilitates configuration of the casting device 106, including voice assistant features.
In some implementations, the casting device 106 is coupled to a display 144.
In some implementations, the casting device 106 includes one or more visual indicators 142 (e.g., LED lights).
In some implementations, the casting device 106 includes a receiver module 146. In some implementations, the receiver module 146 operates the casting device 106, including hardware functions and communicating with a content source, for example. In some implementations, there are different receiver modules 146 at the casting device 106 for different content sources. In some implementations, the receiver module 146 includes respective sub-modules for different content sources.
The voice assistant client device 104 (e.g., a smartphone, a laptop or desktop computer, a tablet computer, a voice command device, a mobile device or in-vehicle system with GOOGLE ASSISTANT by GOOGLE INC., GOOGLE HOME by GOOGLE INC.) includes an audio input device 132 (e.g., a microphone) and an audio output device 134 (e.g., one or more speakers, headphones). In some implementations, a voice assistant client device 104 (e.g., voice command device, a mobile device or in-vehicle system with GOOGLE ASSISTANT by GOOGLE INC., GOOGLE HOME by GOOGLE INC.) is communicatively coupled to a client 140 (e.g., a smartphone, a tablet device). The client 140 may include an application or module (e.g., a voice command device settings app) that facilitates configuration of the voice assistant client device 104, including voice assistant features.
In some implementations, the voice assistant client device 104 includes one or more visual indicators 152 (e.g., LED lights). An example of a voice assistant client device with visual indicators (e.g., LED lights) is illustrated in FIG. 4A of U.S. Provisional Application No. 62/336,566, titled “LED Design Language for Visual Affordance of Voice User Interfaces,” filed May 13, 2016, which is incorporated by reference herein in its entirety.
The casting device 106 and the voice assistant client device 104 include respective instances of a voice assistant module or library 136. The voice assistant module/library 136 is a module/library that implements voice assistant functionality across a variety of devices (e.g., casting device 106, voice assistant client device 104). The voice assistant functionality is consistent across the variety of devices, while still allowing for device-specific features (e.g., support for controlling device-specific features through the voice assistant). In some implementations, the voice assistant module or library 136 is the same or similar across devices; instances of the same library can be included in a variety of devices.
In some implementations, depending on the type of device, the voice assistant module/library 136 is included in an application installed in the device, in the device operating system, or embedded in the device (e.g., embedded in the firmware).
In some implementations, the voice assistant module/library 136-1 at the casting device 106 communicates with the receiver module 146 to perform voice assistant operations.
In some implementations, the voice assistant module/library 136-1 at the casting device 106 can control or otherwise affect the visual indicators 142.
In some implementations, the voice assistant module/library 136-2 at the voice assistant client device 104 can control or otherwise affect the visual indicators 152.
The casting device 106 and the voice assistant client device 104 are communicatively coupled to a server system 114 through one or more communicative networks 112 (e.g., local area networks, wide area networks, the Internet). The voice assistant module/library 136 detects (e.g., receives) verbal input picked up (e.g., captured) by the audio input device 108/132, processes the verbal input (e.g., to detect hotwords), and transmits the processed verbal input or an encoding of the processed verbal input to the server 114. The server 114 receives the processed verbal input or an encoding thereof, and processes the received verbal input to determine the appropriate response to the verbal input. The appropriate response may be content, information, or instructions or commands or metadata to the casting device 106 or voice assistant client device 104 to perform a function or operation. The server 114 sends the response to the casting device 106 or voice assistant client device 104, where the content or information is output (e.g., output through audio output device 110/134) and/or a function is performed. As part the processing, the server 114 may communicate with one or more content or information sources 138 to obtain content or information, or references to such, for the response. In some implementations, the content or information sources 138 include, for example, search engines, databases, information associated with the user's account (e.g., calendar, task list, email), websites, and media streaming services. In some implementations, a voice assistant client device 104 and a casting device 106 may communicate or interact with each other. Examples of such communication or interaction, as well as example operations of a voice assistant client device 104 (e.g., GOOGLE HOME by GOOGLE INC.) are described in U.S. Provisional Application No. 62/336,566, titled “LED Design Language for Visual Affordance of Voice User Interfaces,” filed May 13, 2016, U.S. Provisional Application No. 62/336,569, titled “Voice-Controlled Closed Caption Display,” filed May 13, 2016, and U.S. Provisional Application No. 62/336,565, titled “Media Transfer among Media Output Devices,” filed May 13, 2016, all of which are incorporated by reference herein in their entirety.
In some implementations, the voice assistant module/library 136 receives verbal input captured by the audio input device 108/132 and transmits the verbal input (with no or little processing) or an encoding thereof to the server 114. The server 114 processes the verbal input to detect hotwords, determine an appropriate response, and send the response to the casting device 106 or voice assistant client device 104.
If the server 114 determines that the verbal input includes a command for the casting device 106 or the voice assistant client device 104 to perform a function, the server 114 transmits in the response instructions or metadata that instructs the casting device 106 or the voice assistant client device 104 to perform the function. The function may be specific to the device, and capability for supporting such functions in the voice assistant may be included in the casting device 106 or client 104 as a custom module or function added or linked to the voice assistant module/library 136.
In some implementations, the server 114 includes, or is coupled to, a voice processing backend 148 that performs the verbal input processing operations and determines responses to the verbal inputs.
In some implementations, the server 114 includes downloadable voice assistant library 150. The downloadable voice assistant library 150 (e.g., same as voice assistant library 136, or an update thereof) may include new features and functionality or updates, and can be downloaded to add the voice assistant library to a device or to update a voice assistant library 136.
Memory 206 includes high-speed random access memory, such as DRAM, SRAM, DDR RAM, or other random access solid state memory devices; and, optionally, includes non-volatile memory, such as one or more magnetic disk storage devices, one or more optical disk storage devices, one or more flash memory devices, or one or more other non-volatile solid state storage devices. Memory 206, optionally, includes one or more storage devices remotely located from one or more processing units 202. Memory 206, or alternatively the non-volatile memory within memory 206, includes a non-transitory computer readable storage medium. In some implementations, memory 206, or the non-transitory computer readable storage medium of memory 206, stores the following programs, modules, and data structures, or a subset or superset thereof:
In some implementations, the voice assistant client device 104 or casting device 106 includes one or more libraries and one or more application programming interfaces (APIs) for voice assistant and related functionality. These libraries may be included in or linked to by the voice assistant module 136 or receiver module 146. The libraries include modules associated with voice assistant functionality or other functions that facilitated voice assistant functionality. The APIs provide interfaces to hardware and other software (e.g., operating system, other applications) that facilitate voice assistant functionality. For example, a voice assistant client library 240, debugging library 242, platform APIs 244, and POSIX APIs 246 may be stored in memory 206. These libraries and APIs are further described below with reference to
In some implementations, the voice assistant client device 104 or casting device 106 includes a voice application 250 that uses the modules and functions of the voice assistant client library 240, and optionally debugging library 242, platform APIs 244, and POSIX APIs 246. In some implementations, the voice application 250 is a first-party or third-party application that is voice-enabled through use of the voice assistant client library 240, etc.
Each of the above identified elements may be stored in one or more of the previously mentioned memory devices, and corresponds to a set of instructions for performing a function described above. The above identified modules or programs (i.e., sets of instructions) need not be implemented as separate software programs, procedures, modules or data structures, and thus various subsets of these modules may be combined or otherwise re-arranged in various implementations. In some implementations, memory 206, optionally, stores a subset of the modules and data structures identified above. Furthermore, memory 206, optionally, stores additional modules and data structures not described above.
Memory 306 includes high-speed random access memory, such as DRAM, SRAM, DDR RAM, or other random access solid state memory devices; and, optionally, includes non-volatile memory, such as one or more magnetic disk storage devices, one or more optical disk storage devices, one or more flash memory devices, or one or more other non-volatile solid state storage devices. Memory 306, optionally, includes one or more storage devices remotely located from one or more processing units 302. Memory 306, or alternatively the non-volatile memory within memory 306, includes a non-transitory computer readable storage medium. In some implementations, memory 306, or the non-transitory computer readable storage medium of memory 306, stores the following programs, modules, and data structures, or a subset or superset thereof:
Each of the above identified elements may be stored in one or more of the previously mentioned memory devices, and corresponds to a set of instructions for performing a function described above. The above identified modules or programs (i.e., sets of instructions) need not be implemented as separate software programs, procedures, modules or data structures, and thus various subsets of these modules may be combined or otherwise re-arranged in various implementations. In some implementations, memory 306, optionally, stores a subset of the modules and data structures identified above. Furthermore, memory 306, optionally, stores additional modules and data structures not described above.
In some implementations, the voice assistant module 136 (
In some implementations, a library includes modules that supports audio signal processing operations, including, for example, bandpass, filtering, erasing, and hotword detection. In some implementations, a library includes modules for connecting to backend (e.g., server-based) speech processing systems. In some implementations, a library includes modules for debugging (e.g., debugging speech recognition, debugging hardware issues, automated testing).
In some implementations, the libraries are flexible; the libraries may be used across multiple device types and incorporate the same voice assistant functionality.
In some implementations, the libraries depend on standard shared objects (e.g., standard Linux shared objects), and thus are compatible with different operating systems or platforms that use these standard shard objects (e.g., various Linux distributions and flavors of embedded Linux).
In some implementations, the POSIX APIs 246 provide standard APIs for compatibility with various operating systems. Thus, the voice assistant client library 240 may be included in devices of different POSIX-compliant operating systems and the POSIX APIs 246 provides a compatibility interface between the voice assistant client library 240 and the different operating systems.
In some implementations, the libraries include modules to support and facilitate base use cases available across the different types of devices that implement the voice assistant (e.g., timers, alarms, volume control).
In some implementations, the voice assistant client library 240 includes a controller interface 402 that includes functions or modules for starting, configuring, and interacting with the voice assistant. In some implementations, the controller interface 402 includes a “Start( )” function or module 404 for starting the voice assistant at the device; a “RegisterAction( )” function or module 406 for registering an action with the voice assistant (e.g., so that the action may be actionable via the voice assistant), a “Reconfigure( )” 408 function for re-configuring the voice assistant with updated settings, and a “RegisterEventObserver( )” function 410 for registering with the assistant a set of functions for basic events.
In some implementations, the voice assistant client library 240 includes multiple functions or modules associated with particular voice assistant functionality. For example, a hotword detection module 412 processes voice inputs to detect hotwords. A speech processing module 414 processes speech in voice inputs, and converts speech to text or vice versa (e.g., identifying words and phrases, speech-to-textual-data conversion, textual-data-to-speech conversion). The action processing module 416 performs actions and operations responsive to verbal inputs. A local timers/alarms/volume control module 418 facilitates alarm clock, timer, and volume control functionality at the device and control of same by voice input (e.g., maintain timers, clocks, alarm clocks at the device). A logging/metrics module 420 records (e.g., logs) voice inputs and responses, as well as determining and recording related metrics (e.g., response time, idle time, etc.). An audio input processing module 422 processes the audio of voice inputs. An MP3 decoding module 424 decodes MP3-encoded audio. An audio input module 426 captures audio through an audio input device (e.g., a microphone). An audio output module 428 outputs audio through an audio output device (e.g., a speaker). An event queueing and state tracking module 430 for queuing events associated with the voice assistant at the device and tracking a state of the voice assistant at the device.
In some implementations, the debugging library 242 provides modules and functions for debugging. For example, HTTP server module 432 facilitates debugging of connectivity issues, and debug server/audio streaming module 434 for debugging audio issues.
In some implementations, the platform API 244 provides an interface between the voice assistant client library 240 and hardware functionality of the device. For example, the platform API includes a button input interface 436 for capturing button inputs on the device, a loopback audio interface 438 for capturing loopback audio, a logging and metrics interface 440 for logging and determining metrics, an audio input interface 442 for capturing audio input, an audio output interface 444 for outputting audio, and an authentication interface 446 for authenticating a user with other services that may interact with the voice assistant. An advantage of the voice assistant client library organization depicted in
Example code of classes and functions corresponding to the controller 402 (“Controller”) and related classes are shown below. These classes and functions can be employed via common APIs by applications that are executable on a variety of devices.
The class “ActionModule” below facilitates an application registering its own modules to handle commands provided by the voice assistant server:
The class “BuildInfo” below may be used to describe the application running the voice assistant client library 240 or the voice assistant client device 104 itself (e.g., with identifiers or version numbers of the application, the platform, and/or the device):
The class “EventDelegate” below defines functions associated with basic events, such as start of speech recognition, start and completion of the voice assistant outputting a voice response, etc.:
The class “DefaultEventDelegate” below defines functions for do-nothing overrides for certain events:
The class “Settings” below defines settings (e.g., locale, geolocation, file system directory) that may be provided to the controller 402.
The class “Controller” below corresponds to the controller 402, and the Start®, Reconfigure( ), RegisterAction( ), and RegisterEventObserver( ) functions correspond to functions Start( ) 404, Reconfigure( ) 408, RegisterAction( ) 406, and RegisterEventObserver( ) 410, respectively.
In some implementations, the voice assistant client device 104 or casting device 106 implements a platform (e.g., a set of interfaces for communicating with other devices using the same platform, and an operating system configured to support the set of interfaces). The example code below illustrates the functions associated with an interface for the voice assistant client library 402 to interact with the platform.
The class “Authentication” below defines an authentication token for authenticating the user of the voice assistant with certain accounts:
The class “OutputStreamType” below defines types of audio output streams:
The class “SampleFormat” below defines supported audio sample formats (e.g., PCM formats):
“BufferFormat” below defines a format of data stored in an audio buffer at the device:
The class “AudioBuffer” defines a buffer for audio data:
The class “AudioOutput” below defines an interface for audio output:
The class “AudioInput” below defines an interface for capturing audio input:
The class “Resources” below defines access to system resources:
The class “PlatformApi” below specifies a platform API (e.g., platform API 244) for the voice assistant client library 240:
In some implementations, volume control may be handled outside of the voice assistant client library 240. For example, the system volume may be maintained by the device outside of the control of the voice assistant client library 240. As another example, the voice assistant client library 240 may still support volume control, but requests for volume control to the voice assistant client library 240 are directed to the device.
In some implementations, alarm and timer functionality in the voice assistant client library 240 may be disabled by the user or disabled when implementing the library at a device.
In some implementations, the voice assistant client library 240 also supports an interface to LEDs on the device, to facilitate display of LED animations on the device LEDs.
In some implementations, the voice assistant client library 240 may be included in or linked to by a casting receiver module (e.g., receiver module 146) at a casting device 106. The linkage between the voice assistant client library 240 and the receiver module 146 may include, for example, support for additional actions (e.g., local media playback), and support for control of LEDs on the casting device 106.
The device receives (502) a verbal input at the device. The client device 104/casting device 106 captures a verbal input (e.g., voice input) uttered by a user.
The device processes (504) the verbal input. The client device 104/casting device 106 processes the verbal input. The processing may include hotword detection, conversion to textual data, and identification of words and phrases corresponding to commands, requests, and/or parameters provided by the user. In some implementations, the processing may be minimal or there may be no processing at all. For example, the processing may include encoding the verbal input audio for transmission to server 114, or preparing the captured raw audio of the verbal input for transmission to server 114.
The device transmits (506) a request to a remote system, the request including information determined based on the verbal input. The client device 104/casting device 106 determines a request from the verbal input by processing the verbal input to identify the request and one or more associated parameters from the verbal input. The client device 104/casting device 106 transmits the determined request to the remote system (e.g., server 114), where the remote system determines and generates a response to the request. In some implementations, the client device 104/casting device 106 transmits the verbal input (e.g., as an encoded audio, as raw audio data) to the server 114, and the server 114 processes the verbal input to determine the request and associated parameters.
The device receives (508) a response to the request, where the response is generated by the remote system in accordance with the information based on the verbal input. The remote system (e.g., the server 114) determines and generates a response to the request, and transmits the response to the client device 104/casting device 106.
The device performs (510) an operation in accordance with the response. The client device 104/casting device 106 performs one or more operations in accordance with the received response. For example, if the response is a command to the device to output certain information by audio, the client device 104/casting device 106 retrieves the information, converts the information to speech audio output, and outputs the speech audio through the speaker. As another example, if the response is a command to the device to play media content, the client device 104/casting device 106 retrieves the media content and plays the media content.
One or more of the receiving, processing, transmitting, receiving, and performing are performed by one or more voice processing modules of a voice assistant library executing on the electronic device, the voice processing modules providing a plurality of voice processing operations that are accessible to one or more application programs and/or operating software executing or executable on the electronic device (512). The client device 104/casting device 106 may have a voice assistant client library 240 that includes functions and modules for performing one or more of the receiving, processing, transmitting, receiving, and performing steps. The modules of the voice assistant client library 240 provide multiple voice processing and assistant operations that are accessible to applications, operating systems, and platform software at the client device 104/casting device 106 that include or link to the library 240 (e.g., run the library 240 and related APIs).
In some implementations, at least some voice processing operations associated with the voice processing modules are performed on the remote system, which is interconnected with the electronic device via a wide area network. For example, the processing of the verbal input to determine the request may be performed by the server 114, which is connected with the client device 104/casting device 106 through network(s) 112.
In some implementations, the voice assistant library is executable on a common operating system that is operable on a plurality of different device types, thereby enabling portability of voice-enabled applications configured to interact with one or more of the voice processing operations. The voice assistant client library 240 (and related libraries and APIs, e.g., debugging library 242, platform API 244, POSIX API 246) use standard elements (e.g., objects) of a predefined operating system (e.g., Linux), and thus is operable on a variety of devices that run a distribution or flavor of the predefined operating system (e.g., different Linux or Linux-based distributions or flavors). In this manner, voice assistant functionality is available to a variety of devices and the voice assistant experience is consistent across the variety of devices.
In some implementations, the request and response may be handled at the device. For example, for basic functions that may be local to the device such as timers, alarm clocks, clocks, and volume control, the client device 104/casting device 106 may process the verbal input and determine that the request corresponds to one of these basic functions, determine the response at the device, and perform one or more operations in accordance with the response. The device may still report the request and response to the server 114 for logging purposes.
In some implementations, a device-agnostic voice assistant library for electronic devices including an audio input system, includes one or more voice processing modules configured to execute on a common operation system implemented on a plurality of different electronic device types, the voice processing modules providing a plurality of voice processing operations that are accessible to application programs and operating software executing on the electronic devices, thereby enabling portability of voice-enabled applications configured to interact with one or more of the voice processing operations. The voice assistant client library 240 is a library that can be run on a variety of devices that share the same predefined operating system base as the library (e.g., the library and device operating system are Linux-based), thus the library is device-agnostic. The library 240 provides multiple modules for voice assistant functionality that is accessible to applications across the variety of devices.
In some implementations, at least some voice processing operations associated with the voice processing modules are performed on a backend server interconnected with the electronic devices via a wide area network. For example, the library 240 includes modules that communicate with the server 114 to transmit the verbal input to the server 114 for processing to determine the request.
In some implementations, the voice processing operations include device-specific operations configured to control devices coupled (e.g., directly or communicatively) with the electronic devices. The library 240 may include functions or modules for controlling other devices coupled to the client device 104/casting device 106 (e.g., wireless speakers, smart television, etc.)
In some implementations, the voice processing operations include information and media request operations configured to provide requested information and/or media content to a user of the electronic devices or on devices coupled (e.g., directly or communicatively) with the electronic devices. The library 240 may include functions or modules for retrieving information or media and providing the information or media (e.g., read email out loud, read news articles out loud, play streaming music) on the client device 104/casting device 106 or on a coupled device.
It will be understood that, although the terms “first,” “second,” etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. For example, a first contact could be termed a second contact, and, similarly, a second contact could be termed a first contact, which changing the meaning of the description, so long as all occurrences of the “first contact” are renamed consistently and all occurrences of the second contact are renamed consistently. The first contact and the second contact are both contacts, but they are not the same contact.
The terminology used herein is for the purpose of describing particular implementations only and is not intended to be limiting of the claims. As used in the description of the implementations and the appended claims, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will also be understood that the term “and/or” as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items. It will be further understood that the terms “comprises” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
As used herein, the term “if” may be construed to mean “when” or “upon” or “in response to determining” or “in accordance with a determination” or “in response to detecting,” that a stated condition precedent is true, depending on the context. Similarly, the phrase “if it is determined [that a stated condition precedent is true]” or “if [a stated condition precedent is true]” or “when [a stated condition precedent is true]” may be construed to mean “upon determining” or “in response to determining” or “in accordance with a determination” or “upon detecting” or “in response to detecting” that the stated condition precedent is true, depending on the context.
Reference will now be made in detail to various implementations, examples of which are illustrated in the accompanying drawings. In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the invention and the described implementations. However, the invention may be practiced without these specific details. In other instances, well-known methods, procedures, components, and circuits have not been described in detail so as not to unnecessarily obscure aspects of the implementations.
The foregoing description, for purpose of explanation, has been described with reference to specific implementations. However, the illustrative discussions above are not intended to be exhaustive or to limit the invention to the precise forms disclosed. Many modifications and variations are possible in view of the above teachings. The implementations were chosen and described in order to best explain the principles of the invention and its practical applications, to thereby enable others skilled in the art to best utilize the invention and various implementations with various modifications as are suited to the particular use contemplated.
This application is a continuation of U.S. patent application Ser. No. 17/832,049, filed Jun. 3, 2022, which is a continuation of U.S. patent application Ser. No. 16/888,346, filed May 29, 2020, which is a continuation of U.S. patent application Ser. No. 16/020,971, filed Jun. 27, 2018, which is a continuation of U.S. patent application Ser. No. 15/592,137, filed May 10, 2017, which claims the benefit of U.S. Provisional Application No. 62/336,551, filed May 13, 2016; U.S. Provisional Application No. 62/336,566, filed May 13, 2016; U.S. Provisional Application No. 62/336,569, filed May 13, 2016; U.S. Provisional Application No. 62/336,565, filed May 13, 2016; and U.S. Provisional Application No. 62/334,434, filed May 10, 2016, each of which is hereby incorporated by reference herein in their entirety.
Number | Name | Date | Kind |
---|---|---|---|
5659665 | Whelpley, Jr. | Aug 1997 | A |
5760754 | Amero, Jr. et al. | Jun 1998 | A |
5774859 | Houser et al. | Jun 1998 | A |
6195641 | Loring et al. | Feb 2001 | B1 |
6397186 | Bush et al. | May 2002 | B1 |
6681380 | Britton et al. | Jan 2004 | B1 |
7260538 | Calderone | Aug 2007 | B2 |
7660715 | Thambiratnam | Feb 2010 | B1 |
7698131 | Bennett | Apr 2010 | B2 |
7721313 | Barrett | May 2010 | B2 |
8150699 | Patch | Apr 2012 | B2 |
8340975 | Rosenberger | Dec 2012 | B1 |
8521766 | Hoarty | Aug 2013 | B1 |
8538757 | Patch | Sep 2013 | B2 |
9190049 | Kaszczuk et al. | Nov 2015 | B2 |
9304736 | Whiteley et al. | Apr 2016 | B1 |
9324322 | Torok et al. | Apr 2016 | B1 |
9338493 | Van Os et al. | May 2016 | B2 |
9424840 | Hart et al. | Aug 2016 | B1 |
9443527 | Watanabe et al. | Sep 2016 | B1 |
9554632 | Tarnow et al. | Jan 2017 | B2 |
9721570 | Beal | Aug 2017 | B1 |
9779757 | Blanksteen et al. | Oct 2017 | B1 |
9786294 | Bezos et al. | Oct 2017 | B1 |
9786295 | Nakamura et al. | Oct 2017 | B2 |
9794613 | Jang et al. | Oct 2017 | B2 |
9967644 | Chawan et al. | May 2018 | B2 |
9990002 | Kim | Jun 2018 | B2 |
10026401 | Mutagi et al. | Jul 2018 | B1 |
10832684 | Sarikaya | Nov 2020 | B2 |
20030120744 | Kessler | Jun 2003 | A1 |
20040001095 | Marques | Jan 2004 | A1 |
20040128137 | Bush et al. | Jul 2004 | A1 |
20050033582 | Gadd et al. | Feb 2005 | A1 |
20050144293 | Limont et al. | Jun 2005 | A1 |
20050164681 | Jenkins et al. | Jul 2005 | A1 |
20050212684 | Huang | Sep 2005 | A1 |
20060009154 | Tung | Jan 2006 | A1 |
20060036642 | Horvitz et al. | Feb 2006 | A1 |
20060075429 | Istvan et al. | Apr 2006 | A1 |
20060276230 | McConnell | Dec 2006 | A1 |
20070133603 | Weaver et al. | Jun 2007 | A1 |
20070192486 | Wilson et al. | Aug 2007 | A1 |
20070198267 | Jones et al. | Aug 2007 | A1 |
20080010652 | Booth | Jan 2008 | A1 |
20080065388 | Cross et al. | Mar 2008 | A1 |
20080167860 | Goller et al. | Jul 2008 | A1 |
20080180572 | Pickett et al. | Jul 2008 | A1 |
20080192495 | Yasuhisa et al. | Aug 2008 | A1 |
20080208569 | Simpson et al. | Aug 2008 | A1 |
20080228496 | Yu et al. | Sep 2008 | A1 |
20090100478 | Craner et al. | Apr 2009 | A1 |
20090178071 | Whitehead | Jul 2009 | A1 |
20090319276 | Chang | Dec 2009 | A1 |
20100064218 | Bull et al. | Mar 2010 | A1 |
20100185446 | Homma et al. | Jul 2010 | A1 |
20100240307 | Sims et al. | Sep 2010 | A1 |
20100250239 | Itakura | Sep 2010 | A1 |
20100265397 | Dasher et al. | Oct 2010 | A1 |
20110161076 | Davis et al. | Jun 2011 | A1 |
20110161085 | Boda et al. | Jun 2011 | A1 |
20110283243 | Eckhardt et al. | Nov 2011 | A1 |
20110311206 | Hubner et al. | Dec 2011 | A1 |
20120035924 | Jitkoff et al. | Feb 2012 | A1 |
20120096497 | Xiong et al. | Apr 2012 | A1 |
20120198339 | Williams et al. | Aug 2012 | A1 |
20120226981 | Clavin | Sep 2012 | A1 |
20120239661 | Giblin | Sep 2012 | A1 |
20120253822 | Schalk | Oct 2012 | A1 |
20120260192 | Gruber et al. | Oct 2012 | A1 |
20120265528 | Gruber et al. | Oct 2012 | A1 |
20130046773 | Kannan et al. | Feb 2013 | A1 |
20130080177 | Chen | Mar 2013 | A1 |
20130132094 | Lim | May 2013 | A1 |
20130138424 | Koenig et al. | May 2013 | A1 |
20130290110 | LuVogt et al. | Oct 2013 | A1 |
20130322634 | Bennett et al. | Dec 2013 | A1 |
20130332159 | Federighi et al. | Dec 2013 | A1 |
20130332311 | Pu et al. | Dec 2013 | A1 |
20130339850 | Hardi et al. | Dec 2013 | A1 |
20140006483 | Garmark et al. | Jan 2014 | A1 |
20140006947 | Garmark et al. | Jan 2014 | A1 |
20140074483 | Van Os et al. | Mar 2014 | A1 |
20140108019 | Ehsani et al. | Apr 2014 | A1 |
20140125271 | Wang | May 2014 | A1 |
20140163978 | Basye et al. | Jun 2014 | A1 |
20140244266 | Brown et al. | Aug 2014 | A1 |
20140244568 | Goel et al. | Aug 2014 | A1 |
20140257788 | Xiong et al. | Sep 2014 | A1 |
20140278435 | Ganong, III et al. | Sep 2014 | A1 |
20140297268 | Govrin et al. | Oct 2014 | A1 |
20140317502 | Brown et al. | Oct 2014 | A1 |
20140333449 | Thiesfeld | Nov 2014 | A1 |
20140365226 | Singha | Dec 2014 | A1 |
20140365887 | Cameron | Dec 2014 | A1 |
20150006182 | Schmidt et al. | Jan 2015 | A1 |
20150066510 | Bohrer et al. | Mar 2015 | A1 |
20150081296 | Lee et al. | Mar 2015 | A1 |
20150112985 | Roggero et al. | Mar 2015 | A1 |
20150097666 | Boyd et al. | Apr 2015 | A1 |
20150106096 | Toopran et al. | Apr 2015 | A1 |
20150154976 | Mutagi et al. | Jun 2015 | A1 |
20150162006 | Kummer et al. | Jun 2015 | A1 |
20150169284 | Quast et al. | Jun 2015 | A1 |
20150199566 | Moore et al. | Jul 2015 | A1 |
20150212664 | Freer | Jul 2015 | A1 |
20150261496 | Faaborg et al. | Sep 2015 | A1 |
20150331666 | Bucsa et al. | Nov 2015 | A1 |
20150365787 | Farrell et al. | Dec 2015 | A1 |
20160042735 | Vibbert et al. | Feb 2016 | A1 |
20160179462 | Bjorkengren et al. | Jun 2016 | A1 |
20160321263 | Madiraju et al. | Nov 2016 | A1 |
20160323343 | Sanghavi et al. | Nov 2016 | A1 |
20170010587 | Champy et al. | Jan 2017 | A1 |
20170068423 | Napolitano et al. | Mar 2017 | A1 |
20170090858 | Paris et al. | Mar 2017 | A1 |
20170154628 | Mohajer et al. | Jun 2017 | A1 |
20170180499 | Gelfenbeyn et al. | Jun 2017 | A1 |
20170221322 | Ignomirello | Aug 2017 | A1 |
20170236512 | Williams et al. | Aug 2017 | A1 |
20170262537 | Harrison et al. | Sep 2017 | A1 |
20170270927 | Brown et al. | Sep 2017 | A1 |
20170300831 | Gelfenbeyn et al. | Oct 2017 | A1 |
20170329766 | Matsuyana et al. | Nov 2017 | A1 |
20170339444 | Shaw et al. | Nov 2017 | A1 |
20170347477 | Avital et al. | Nov 2017 | A1 |
20180004482 | Johnston et al. | Jan 2018 | A1 |
20180041408 | Dagum et al. | Feb 2018 | A1 |
Number | Date | Country |
---|---|---|
1909063 | Feb 2007 | CN |
102064985 | May 2011 | CN |
102148031 | Aug 2011 | CN |
102196207 | Sep 2011 | CN |
103474068 | Dec 2013 | CN |
103501382 | Jan 2014 | CN |
104135197 | Nov 2014 | CN |
104506944 | Apr 2015 | CN |
104685561 | Jun 2015 | CN |
105209859 | Dec 2015 | CN |
105247845 | Jan 2016 | CN |
103095325 | Mar 2016 | CN |
102289374 | Jun 2017 | CN |
2004102415 | Apr 2004 | JP |
2004171257 | Jun 2004 | JP |
2006286275 | Oct 2006 | JP |
2009521745 | Jun 2009 | JP |
2014003610 | Jan 2014 | JP |
2014507030 | Mar 2014 | JP |
2014065359 | Apr 2014 | JP |
2015079237 | Apr 2015 | JP |
20120137425 | Dec 2012 | KR |
20150029974 | Mar 2015 | KR |
101579292 | Dec 2015 | KR |
2012103321 | Aug 2012 | WO |
2014001914 | Jan 2014 | WO |
2014064531 | May 2014 | WO |
2016054230 | Apr 2019 | WO |
Entry |
---|
Arima, I., et al., “A PC--based Automatic Speech Recognition System, Reports of the Autumn Meeting 1996”, In Acoustical Society of Japan, vol. ⋅1, Sep. 25, 1996, pp. 183-⋅184. |
ASCII Corporation, “Special Feature 1: World with Java2”, ASCII Network Technology, vol. 4, No. 3, Mar. 1, 1999, pp. 1-31. |
Associated Press, “Will the Internet Listen to Your Private Conversations”, last updated Jul. 29, 2015, pp. 1-4, available at⋅ https:/nypost.com/2015/07 /29/will-the-internet-listen-to-your-private-conversations/. |
Carroll, R, “Goodbye Privacy, Hello ‘Alexa’: Amazon Echo, the Home Robot Who Hears it All”, last updated Nov. 21, 2015, pp. 1-8, available at: https ⋅ / /www.theeguardian.com/technology/2015/nov/21/amazon-echo-alexa-tlome-robot-privacy-cloud. |
Fitzpatrick, A, “Your Gadgets May Soon Be Spying on Your Conversations”, last updated Nov. 11, 2014, pp. 1-4, available at: https://time.com/3576816/amazon-echo-microsoft-kinect/. |
Fitzpatrick, J., “How to Stop Your Amazon Echo from Listening In”, last updated Jun. 20, 2017, pp. 1-4, available at: https://www.howtoogeek.com/237397/how-to-stop-your-arnazon-echo-from-listening-in/. |
Fujitsu Limited, “FM Towns LiveMotion, Support Library, V2.1, Explanation Of Additional Functions”, Technical Report 81 SP-1090-2-0, Feb. 1995, pp. 3-21. |
Heyes, J.D., “Amazon Lies to Customers, Says Echo Device Doesn't Eavesdrop . . . But is Always Listening for the Right Word”, last updated Aug. 13, 2015, pp. 1-12, availatJle at: https://www.naturalnews.com/050771 _Amazon _Echo _ privacy _audio_surveillance.html. |
Iida, K., et al., “Enhanced Touch”, In Proceedings of the 8th International Conference on Advance in Computer Entertainment Technology, New York, NY, US, Jan. 2011, pp. 1-2. |
Moriwaki, D., “Internet Communication Starting with WinSock”, In Delphi Magazine, vol. 7, 1st Ed., PS Network, Nov. 1, 1999, pp. 104-130. |
Newitz, A., “Why Is My Digital Assistant So Creepy?”, last updated Jan. 28, 2015, pp. 1-6, available at: https://gizmodo.corn/why-is-my-digital-asissistant-so-creepy-1682216423. |
Nikkei Linux, “Revive Old PC with Linux! Revised Part 9, Revival as a PC Running on a Virtual Machine”, Nikkei Business Publications, Inc., vol. 15, No. 11, Oct. 8, 2013, pp. 151-156, pp. 1-13. |
Souma, F., et al. “Development of Koala Robot Capable of Expressing Various Kinds of Feelings”, In Proceedings of the 12th International Conference on Control, Jeju Island, KR, Oct. 17-21, 2012, pp. 424-429. |
Tested, “Amazon Echo Review”, last updated Sep. 1, 2015, one page, available at: https://v.qq.com/X/page/j00176f6mmo.html,%20ten cent. |
Wang, E., “Disassembly of the Amazon Echo—the World's Most Intelligent Smart Speaker”, EE World. last updated Dec. 18, 2014, pp. 1-20, available at: http:iibbs.eeworld.com.cn/thread-453017-1-1.html. |
Woods, B., “Forget Amazon Echo, ‘the Creepy Factor’ Has Put Me Off Voice Control Completely”, last updated Jun. 27, 2015, pp. 1 A, available at—https:/thenextweb.com/news/forget-amazon-echo-the-creepy-factor-has- put-me-off-voice-control-completely#gref. |
Number | Date | Country | |
---|---|---|---|
20230368789 A1 | Nov 2023 | US |
Number | Date | Country | |
---|---|---|---|
62336565 | May 2016 | US | |
62336566 | May 2016 | US | |
62336569 | May 2016 | US | |
62336551 | May 2016 | US | |
62334434 | May 2016 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 17832049 | Jun 2022 | US |
Child | 18226045 | US | |
Parent | 16888346 | May 2020 | US |
Child | 17832049 | US | |
Parent | 16020971 | Jun 2018 | US |
Child | 16888346 | US | |
Parent | 15592137 | May 2017 | US |
Child | 16020971 | US |