Homes are becoming more wired and connected with the proliferation of computing devices such as desktops, tablets, entertainment systems, and portable communication devices. As these computing devices evolve, many different ways have been introduced to allow users to interact with computing devices, such as through mechanical devices (e.g., keyboards, mice, etc.), touch screens, motion, and gesture. Another way to interact with computing devices is through speech.
One drawback with this mode is that vocal interaction with computers can be affected by background noise. This can be particularly problematic in the home environment, where audio devices such as televisions and radios, may output verbal utterances that the computer interprets as a user input. Accordingly, there is a need for techniques to cancel vocal background noise in such voice controlled computing environments.
The detailed description is described with reference to the accompanying figures. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. The use of the same reference numbers in different figures indicates similar or identical components or features.
An architecture in which users can request and receive information from cloud-based services through a hands-free, electronic voice controlled assistant is described in this document. The voice controlled assistant may be positioned in a room (e.g., at home, work, store, etc.) to receive user input in the form of voice interactions, such as spoken requests or a conversational dialogue. The voice input may be transmitted to a network accessible computing platform, or “cloud service”, which processes and interprets the input to perform some function. Since the voice controlled assistant is located in a room, there is a chance that background sources of speech, music, or other noise, such as from a television or radio, may adversely impact the user's intended vocal input to the assistant. Accordingly, the architecture described herein is designed to intelligently remove the background noise while isolating and preserving the user's vocal input.
The architecture may be implemented in many ways. One illustrative implementation is described below in which the voice controlled assistant is placed within a room. However, the architecture may be implemented in many other contexts and situations in which background speech may adversely disrupt user voice interaction.
Illustrative Environment
Generally, the voice controlled assistant 104 has a microphone and speaker to facilitate audio interactions with a user 112. The voice controlled assistant 104 is implemented without a haptic input component (e.g., keyboard, keypad, touch screen, joystick, control buttons, etc.) or a display. In certain implementations, a limited set of one or more haptic input components may be employed (e.g., a dedicated button to initiate a configuration, power on/off, etc.). Nonetheless, the primary and potentially only mode of user interaction with the electronic assistant 104 is through voice input and audible output. One example implementation of the voice controlled assistant 104 is provided below in more detail with reference to
The microphone of the voice controlled assistant 104 detects words and sounds uttered from the user 112. The user may speak predefined commands (e.g., “Awake”; “Sleep”), or use a more casual conversation style when interacting with the assistant 104 (e.g., “I'd like to go to a movie. Please tell me what's playing at the local cinema.”). The voice controlled assistant receives the user's vocal input, and transmits it over the network 108 to the cloud services 106. The vocal input is interpreted to form an operational request or command, which is then processed at the cloud services 106. The requests may be for essentially type of operation that can be performed by cloud services, such as database inquires, requesting and consuming entertainment (e.g., gaming, finding and playing music, movies or other content, etc.), personal management (e.g., calendaring, note taking, etc.), online shopping, financial transactions, and so forth.
In
The voice controlled assistant 104 may be communicatively coupled to the network 108 via wired technologies (e.g., wires, USB, fiber optic cable, etc.), wireless technologies (e.g., RF, cellular, satellite, Bluetooth, etc.), or other connection technologies. The network 108 is representative of any type of communication network, including data and/or voice network, and may be implemented using wired infrastructure (e.g., cable, CATS, fiber optic cable, etc.), a wireless infrastructure (e.g., RF, cellular, microwave, satellite, Bluetooth, etc.), and/or other connection technologies. The network 108 carries data, such as audio data, between the cloud services 106 and the voice controlled assistant 104.
The cloud services 106 generally refer to a network accessible platform implemented as a computing infrastructure of processors, storage, software, data access, and so forth that is maintained and accessible via a network such as the Internet. Cloud services 106 do not require end-user knowledge of the physical location and configuration of the system that delivers the services. Common expressions associated with cloud services include “on-demand computing”, “software as a service (SaaS)”, “platform computing”, “network accessible platform”, and so forth.
The cloud services 106 include a command response system 120 that is implemented by one or more servers, such as servers 122(1), 122(2), . . . , 122(S). The servers 122(1)-(S) may host any number of applications that can process the user input received from the voice controlled assistant 104, and produce a suitable response. These servers 122(1)-(S) may be arranged in any number of ways, such as server farms, stacks, and the like that are commonly used in data centers. One example implementation of the command response system 120 is described below in more detail with reference to
As noted above, because the voice controlled assistant 104 is located in a room, other ambient noise may be introduced into the environment that is unintended for detection by the assistant 104. The background noise may be human voices, singing, music, movie sound tracks, gaming sound effects, and the like. In the
The voice controlled assistant 104 captures both the user command and the background noise. As the assistant is intentionally designed with limited functionality to keep costs low, there may be limited or no noise canceling capabilities implemented on the assistant 104. Instead, the aggregated audio data that includes the user command and background noise are transmitted over the network 108 to the cloud services 106. This is represented in
The command response system 120 in the cloud services 106 hosts an intelligent noise canceling application 124 to reduce or eliminate the background audio from the aggregated audio data to restore the user command as the primary input, and then process the user command. In the illustrated implementation, the noise canceling application 124 includes a noise identifier 126 to identify background noises in the aggregated audio data received from the assistant 104, a command isolation module 128 to filter out the noises to isolate the user command, and a command processing module 130 to process the user command to generate an appropriate response.
The noise identifier 126 is configured to ascertain content of the background noise contained in the aggregated audio data received from the voice controlled assistant 104. There are many ways for the noise identifier 126 to make this determination. In one implementation, the noise identifier 126 listens to the aggregated audio data and attempts to identify a signature of the background noise. The command response system 120 may maintain a library of sounds that is have been previously identified and recorded from the user's home 102 and evaluates the current background noise relative to that collection.
In another implementation, the noise identifier 126 may conduct searches at other resource systems accessible on the Internet. In
In one scenario, the noise identifier 126 may conduct a web search for an audio signature of a background sound by sending a query to the audio source information system 132. The content detection application 136, executing on the servers 134(1)-(T), may analyze the background sound and attempt to identify a match. As one example, when attempting to identify background music, the application 136 may be implemented as a music identification application, such as Shazam™, that identifies the song, track, and/or artist.
In another scenario, the noise identifier 126 may ascertain which station or program channel is playing on the user's TV 118. The identifier 126 may query the user's media system (if accessible) or analyze the noise and attempt to find programming that matches. The identifier 126 may also access the electronic programming guide (EPG) 138 available online at the audio source information system 132 to find a matching program at the appropriate time slot.
In any one of these scenarios and examples, once the content is identified, that content or source feed of the content is retrieved locally or from a remote site, such as content store 140 at system 132. More specifically, the identified content may be retrieved from a store or a source of the content (such as live news feed or streaming programming content). The content matching the background noise is returned to the noise cancelling application 124 as represented by packet 141 containing the background audio (BA).
The content is provided to the command isolation module 128 of the noise cancellation application 124. The command isolation module 128 implements an adaptive noise cancellation algorithm to eliminate or otherwise reduce that part of the noise from the aggregated audio data received from the voice controlled assistant 104. The adaptive noise cancellation algorithm subtracts the content from the aggregated data to return a clearer audio that primarily features the user command. This is represented by the subtraction of the background audio (BA) from the aggregate audio (BA+UC) to return the user command audio (UC).
The command processing module 130 receives the user command (UC) extracted from the processed audio data by the command isolation module 128, and processes the user command data. The user command data may be in any number of forms. For instance, it may be a simple word or phrase that is matched to a set of pre-defined words and phrases to find a corresponding action or operation to be executed. In other implementations, the user command data may be a conversational dialogue. The command processing module 130 may employ a natural language processing engine to interpret the statements and act on those statements.
The operations associated with the user input may be essentially any activity that can be carried out by a computerized system. For instance, the user may request a search (e.g., “what is playing at the local cinema?”), or engage in online shopping (e.g., “how much are a pair of size 6 leather boots?”), or conduct a financial transaction (e.g., “please move $100 to my checking account”). In the first instance, the command processing module 130 may query a website of a local cinema or a more general entertainment website for a listing of shows and times. In the second scenario, the command processing module 130 may query one or more online retailer sites to identify leather boots and associated prices. In the last scenario, the command processing module 130 may interact with the user's financial institution to transfer funds (e.g., $100) from a savings account to a checking account.
Once an operation is performed, the command processing module 130 formulates a response. The response is formatted as audio data that is returned to the voice controlled assistant 104 over the network 108. This response is represented by a packet 143. When received, the voice controlled assistant 104 audibly plays the response for the user. Using the above examples, the assistant 104 may output statements like, “The Sound of Music is playing today at 4:00 pm and 7:30 pm”; or “A pair of light brown leather boots by Frye is available for $175. Do you want to purchase?”; or “To make this transfer, please tell me your date of birth and the last four digits of your account.”
Illustrative Voice Controlled Assistant
In the illustrated implementation, the voice controlled assistant 104 includes a processor 202 and memory 204. The memory 204 may include computer-readable storage media (“CRSM”), which may be any available physical media accessible by the processor 202 to execute instructions stored on the memory. In one basic implementation, CRSM may include random access memory (“RAM”) and Flash memory. In other implementations, CRSM may include, but is not limited to, read-only memory (“ROM”), electrically erasable programmable read-only memory (“EEPROM”), or any other medium which can be used to store the desired information and which can be accessed by the processor 202.
Several modules such as instruction, datastores, and so forth may be stored within the memory 204 and configured to execute on the processor 202. An operating system module 206 is configured to manage hardware and services (e.g., wireless unit, USB, Codec) within and coupled to the assistant 104 for the benefit of other modules. A speech recognition module 208 and an acoustic echo cancellation module 210 provide some basic speech recognition functionality. In some implementations, this functionality may be limited to specific commands that perform fundamental tasks like waking up the device, configuring the device, cancelling an input, and the like. The amount of speech recognition capabilities implemented on the assistant 104 is an implementation detail, but the architecture described herein supports having some speech recognition at the local assistant 104 together with more expansive speech recognition at the cloud services 106. A configuration module 212 may also be provided to assist in an automated initial configuration of the assistant (e.g., find wifi connection, enter key, etc.) to enhance the user's out-of-box experience, as well as reconfigure the device at any time in the future.
The voice controlled assistant 104 includes one or more microphones 214 to receive audio input, such as user voice input, and one or more speakers 216 to output audio sounds. A codec 218 is coupled to the microphone 214 and speaker 216 to encode and/or decode the audio signals. The codec may convert audio data between analog and digital formats. A user may interact with the assistant 104 by speaking to it, and the microphone 214 captures the user speech. The codec 218 encodes the user speech and transfers that audio data to other components. The assistant 104 can communicate back to the user by emitting audible statements through the speaker 216. In this manner, the user interacts with the voice controlled assistant simply through speech, without use of a keyboard or display common to other types of devices.
The voice controlled assistant 104 includes a wireless unit 220 coupled to an antenna 222 to facilitate a wireless connection to a network. The wireless unit 214 may implement one or more of various wireless technologies, such as wifi, Bluetooth, RF, and so on.
A USB port 224 may further be provided as part of the assistant 104 to facilitate a wired connection to a network, or a plug-in network device that communicates with other wireless networks. In addition to the USB port 224, or as an alternative thereto, other forms of wired connections may be employed, such as a broadband connection. A power unit 226 is further provided to distribute power to the various components on the assistant 104.
The voice controlled assistant 104 is designed to support audio interactions with the user, in the form of receiving voice commands (e.g., words, phrase, sentences, etc.) from the user and outputting audible feedback to the user. Accordingly, in the illustrated implementation, there are no haptic input devices, such as navigation buttons, keypads, joysticks, keyboards, touch screens, and the like. Further there is no display for text or graphical output. In one implementation, the voice controlled assistant 104 may include non-input control mechanisms, such as basic volume control button(s) for increasing/decreasing volume, as well as power and reset buttons. There may also be a simple light element (e.g., LED) to indicate a state such as, for example, when power is on. But, otherwise, the assistant 104 does not use or need to use any input devices or displays.
Accordingly, the assistant 104 may be implemented as an aesthetically appealing device with smooth and rounded surfaces, with some apertures for passage of sound waves, and merely having a power cord and optionally a wired interface (e.g., broadband, USB, etc.). Once plugged in, the device may automatically self-configure, or with slight aid of the user, and be ready to use. As a result, the assistant 104 may be generally produced at a low cost. In other implementations, other I/O components may be added to this basic model, such as specialty buttons, a keypad, display, and the like.
Illustrative Cloud Services
In the illustrated implementation, the noise identifier 126, command isolation module 128, and command processing module 130 are shown as software components or computer-executable instructions stored in the memory 304 and executed by one or more processors 302. The noise identifier 126 receives the aggregated audio data from the voice controlled assistant 104 and identifies the noise included in the audio data that is not attributable to the user. The noise identifier 126 may try to analyze the noise locally, and attempt to identify the content and source. The noise identifier 126 may alternatively query other resources on the web to attempt to identify the content and source associated with the background noise.
In
The content detection module 308 analyzes the audio data received from the voice controlled assistant 104 and attempts to isolate the background noise segment. From this segment, the content detection module 308 extracts a unique signature that uniquely identifies the background content. The signature may then be compared to content signatures associated with content items. These content signatures may be stored locally or remotely. When a relevant content signature is found, the associated content item is identified as part of the background noise.
Once the identity of the noise content is ascertained, the command isolation module 128 retrieves the content for use in canceling the background noise from the aggregated audio data. The command isolation module 128 is shown as including a content retrieval module 310 and a noise cancellation module 312. The content retrieval module 310 retrieves the content identified by the identifier 126 as that present in the background noise in the aggregated audio data. The module 310 may access content stored locally, or query a remote site for the content. Once the content is retrieved, the noise cancellation module 312 uses the content to at least partially remove the same content from the background noise, thereby leaving the user command data. In one implementation, the noise cancellation module 312 syncs the retrieved content with the background noise component and employs an adaptive noise cancellation algorithm that effectively subtracts the identified and retrieved content from the aggregated audio data. The operation removes the background noise and thus isolates the user command.
The command processing module 130 processes the newly isolated user command. This may be done in any number of ways. In the illustrated implementation, the command processing module 130 includes an optional speech recognition engine 314, a command handler 316, and a response encoder 318. The speech recognition engine 314 converts the user command to a text string. In this text form, the user command can be used in search queries, or to reference associated responses, or to direct an operation, or to be processed further using natural language processing techniques, or so forth. In other implementations, the user command may be maintained in audio form, or be interpreted into other data forms.
The user command is passed to a command handler 316 in its raw or a converted form, and the handler 316 performs essentially any operation that might use the user command as an input. As one example, a text form of the user command may be used as a search query to search one or more databases, such as internal information databases 320(1), . . . , 320(D) or external third part data providers 322(1), . . . , 322(E). Alternatively, an audio command may be compared to a command database (e.g., one or more information databases 320(1)-(D)) to determine whether it matches a pre-defined command. If so, the associated action or response may be retrieved. In yet another example, the handler 316 may use a converted text version of the user command as an input to a third party provider (e.g., providers 322(1)-(E)) for conducting an operation, such as a financial transaction, an online commerce transaction, and the like.
Any one of these many varied operations may produce a response. When a response is produced, the response encoder 318 encodes the response for transmission back over the network 108 to the voice controlled assistant 104. In some implementations, this may involve converting the response to audio data that can be played at the assistant 104 for audible output through the speaker to the user.
Illustrative Process
For purposes of describing one example implementation, the blocks are arranged visually in
At 402, the voice controlled assistant 104 captures aggregated audio data containing a user command and background noise. The user command may be a single word, phrase, or conversational-style sentence. The background noise may arise from any number of sources. Of particular interest are background noises emanating from content playing devices, such as televisions, radios, stereo systems, DVD players, game consoles, and the like.
At 404, the aggregated audio data 123 captured by the assistant 104 is transmitted over the network 108 to the command response system 120 in the cloud services 106. At 406, the command response system 120 receives the aggregated audio data from the voice controlled assistant 104.
At 408, the command response system 120 identifies content forming at least part of the background noise of the aggregated audio data. There are several ways to identify content. In one approach, the system 120 may employ a content detection module 308 to analyze the audio data, perhaps extracting a unique signature, and attempting to match the noise portions with existing content or signatures. In another approach, the system 120 examines possible sources of background content that the user may be consuming as part of his/her regular habits, such as patterns in viewing TV programming, or listening to favorite music, or playing a particular collection of video games. In still another approach, the system 120 may query other services, such as audio source information system 132 in
At 410, the content identified as forming at least part of the background noise is retrieved. The command response system 120 may store content locally, and simply retrieve that content. Alternatively, the content may be available from another provider, and the system 120 queries that provider for the content.
At 412, the retrieved content is used to at least partially remove the background noise from the aggregated audio data. In one approach, an adaptive noise cancellation algorithm may be applied to subtract the retrieved content from the aggregated audio data, there by canceling or reducing the background noise. This process leaves the user command in a clearer and more understandable state.
At 414, the newly isolated user command is interpreted. This may be accomplished in many ways, as represented by sub-operations 414(1), . . . , 414(K). As examples of potential approaches to interpret the user command, at 414(1), the user command may be converted form audio to text for processing. A speech recognition engine may be used to make this conversion. Alternatively, at 414(K), the post-cancellation audio data may be analyzed to extract pre-defined command words.
With continuing reference to the process 400 in
At 504, the response may be converted into audio data. For instance, a response from a database search may be converted into an audible presentation of the results set. As another example, a user command seeking a price of an e-commerce item may produce a response, that when converted into audio, audibly describes the e-commerce item and associated pricing.
At 506, the response audio data 143 is transmitted back from the command response system 120 to the voice controlled assistant 104. At 508, the response audio data is received from the network at the voice controlled assistant 104.
At 510, the assistant 104 audibly emits the response audio data through the speaker to the user. In this manner, the user is provided with audio feedback from the original user command. Depending on network speeds and the type of operation requested, the time lapse between entry of the user command and output of the response may range on average from near instantaneous to a few seconds.
Although the subject matter has been described in language specific to structural features, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features described. Rather, the specific features are disclosed as illustrative forms of implementing the claims.
Number | Name | Date | Kind |
---|---|---|---|
5267323 | Kimura | Nov 1993 | A |
7418392 | Mozer et al. | Aug 2008 | B1 |
7720683 | Vermeulen et al. | May 2010 | B1 |
7774204 | Mozer et al. | Aug 2010 | B2 |
20050080625 | Bennett | Apr 2005 | A1 |
20060235701 | Cane | Oct 2006 | A1 |
20080147397 | Konig | Jun 2008 | A1 |
20090228914 | Wong | Sep 2009 | A1 |
20090271203 | Resch | Oct 2009 | A1 |
20090299752 | Rodriguez | Dec 2009 | A1 |
20100185700 | Bodain | Jul 2010 | A1 |
20100333163 | Daly | Dec 2010 | A1 |
20110135107 | Konchitsky | Jun 2011 | A1 |
20120004909 | Beltman | Jan 2012 | A1 |
20120197612 | Drews | Aug 2012 | A1 |
20120223885 | Perez | Sep 2012 | A1 |
20140254816 | Kim | Sep 2014 | A1 |
Number | Date | Country |
---|---|---|
WO2011088053 | Jul 2011 | WO |
Entry |
---|
Pinhanez, “The Everywhere Displays Projector: A Device to Create Ubiquitous Graphical Interfaces”, IBM Thomas Watson Research Center, Ubicomp 2001, 18 pages. |