Smart speakers are computing devices that are configured to receive voice commands that allow user to interact with the device. The device may include an integrated virtual assistant for processing voice commands and/or may rely on a remote server to process the voice commands received at the device. Smart speakers can be used to interact with the various online services. A user may use a smart speaker to search for and receive information from network-connected service providers. The user may also conduct sensitive transactions with financial service providers, medical information providers, and/or providers of goods or services. Some smart speakers are configured to provide audio content only, while others may include screens that provide an interface for interacting with the smart speaker and may be used to display text, image, and/or video content. Smart speakers may also be configured to include home automation features that allow the user to control lighting, security, heating and ventilation systems, smart appliances, and/or other automated features within a home or business. As smart speakers become a ubiquitous feature in many homes and/or businesses, these devices have become a target for attackers attempting to exploit these devices to fraudulently obtain goods or services, conduct fraudulent financial transactions, and/or fraudulently obtain other sensitive information.
An example method for operating a voice-activated computing device according to the disclosure includes receiving audio content comprising a voice command, monitoring electromagnetic (EM) emissions using an EM detector of the voice-activated computing device, determining whether the audio content comprising the voice command was generated electronically or was issued by a human user based on the EM emissions detected while receiving the audio content comprising the voice command, and preventing the voice command from being executed by the voice-activated computing device responsive to determining that the voice command was generated electronically.
Implementations of such a method can include one or more of the following features. Determining whether the audio content comprising the voice command was issued electronically or by a human user includes correlating changes in the audio content comprising the voice command with changes in the EM emissions detected by the EM detector to determine a security indicator, and determining whether the voice command was generated electronically based on the security indicator. The changes in the audio content comprise changes in at least one of the volume and the frequency of the audio content. Calibrating the EM detector to generate baseline EM emissions, and correlating the changes in the audio content comprising the voice command with changes in the EM emissions includes subtracting the baseline EM emissions from the EM emissions before correlating the changes in the audio content with the changes in the EM emissions. Calibrating the EM detector to generate the baseline EM emissions includes detecting EM emissions generated by the voice-activated computing device. Determining that the voice command was generated electronically responsive to the security indicator exceeding a predetermined threshold. Determining whether the audio content comprising the voice command was issued electronically or by a human user includes sending the audio content and information regarding the EM emissions to a remote server for analysis, and receiving an indication from the server whether the voice command was generated electronically or by a human user from the remote server. Determining whether the audio content comprising the voice command was issued electronically or by a human user includes receiving an indication from the EM detector that the EM detector has detected abnormal EM variations.
An example voice-activated computing device according to the disclosure includes means for receiving audio content comprising a voice command using means for receiving sound, means for monitoring electromagnetic (EM) emissions, means for determining whether the audio content comprising the voice command was generated electronically or was issued by a human user based on the EM emissions detected while receiving the audio content comprising the voice command, and means for preventing the voice command from being executed by the voice-activated computing device responsive to determining that the voice command was generated electronically.
Implementations of such a voice-activated computing device can include one or more of the following features. The means for determining whether the audio content comprising the voice command was issued electronically or by a human user include means for correlating changes in the audio content comprising the voice command with changes in the EM emissions detected by the means for detecting EM emissions to determine a security indicator, and means for determining whether the voice command was generated electronically based on the security indicator. The changes in the audio content comprise changes in at least one of the volume, the frequency, the cadence, and the voice pattern of the audio content. Means for calibrating the means for detecting EM emissions to generate baseline EM emissions; and the means for correlating the changes in the audio content comprising the voice command with changes in the EM emissions further include means for subtracting the baseline EM emissions from the EM emissions before correlating the changes in the audio content with the changes in the EM emissions. The means for calibrating the means for detecting EM emissions to generate the baseline EM emissions includes means for detecting EM emissions generated by the voice-activated computing device. Means for determining that the voice command was generated electronically responsive to the security indicator exceeding a predetermined threshold. The means for determining whether the audio content comprising the voice command was issued electronically or by a human user includes means for sending the audio content and information regarding the EM emissions to a remote server for analysis, and means for receiving an indication from the server whether the voice command was generated electronically or by a human user from the remote server. The means for determining whether the audio content comprising the voice command was issued electronically or by a human user further includes means for receiving an indication from the EM detector that the EM detector has detected abnormal EM variations.
An example voice-activated computing device according to the disclosure includes an electromagnetic (EM) detector configured to monitor for EM emissions, a microphone, and a processor communicatively coupled to the EM detector and the microphone. The processor configured to receive audio content comprising a voice command using the microphone, monitor electromagnetic (EM) emissions using the EM detector, determine whether the audio content comprising the voice command was generated electronically or was issued by a human user based on the EM emissions detected while receiving the audio content comprising the voice command, and prevent the voice command from being executed by the voice-activated computing device responsive to determining that the voice command was generated electronically.
Implementations of such a voice-activated computing device can include one or more of the following features. The processor being configured to determine whether the audio content comprising the voice command was issued electronically or by a human user is further configured to correlate changes in the audio content comprising the voice command with changes in the EM emissions detected by the EM detector to determine a security indicator, and determine whether the voice command was generated electronically based on the security indicator. The changes in the audio content comprise changes in at least one of the volume, the frequency, the cadence, and the voice pattern of the audio content. The processor is further configured to calibrate the EM detector to generate baseline EM emissions; and the processor being configured to correlate the changes in the audio content comprising the voice command with changes in the EM emissions is further configured to subtract the baseline EM emissions from the EM emissions before correlating the changes in the audio content with the changes in the EM emissions. The processor being configured to calibrate the EM detector to generate the baseline EM emissions is further configured to detect, using the EM detector, EM emissions generated by the voice-activated computing device. The processor is further configured to determine that the voice command was generated electronically responsive to the security indicator exceeding a predetermined threshold. The processor being configured to determine whether the audio content comprising the voice command was issued electronically or by a human user is further configured to send the audio content and information regarding the EM emissions to a remote server for analysis, and receive an indication from the server whether the voice command was generated electronically or by a human user from the remote server. The processor being configured to determine whether the audio content comprising the voice command was issued electronically or by a human user is further configured to receive an indication from the EM detector that the EM detector has detected abnormal EM variations.
An example non-transitory, computer-readable medium, having stored thereon computer-readable instructions for operating a voice-activated computing device, according to the disclosure includes instructions configured to cause the voice-activated computing device to receive audio content comprising a voice command, monitor electromagnetic (EM) emissions using an EM detector of the voice-activated computing device, determine whether the audio content comprising the voice command was generated electronically or was issued by a human user based on the EM emissions detected while receiving the audio content comprising the voice command, and prevent the voice command from being executed by the voice-activated computing device responsive to determining that the voice command was generated electronically.
Implementations of such a non-transitory, computer-readable medium can include one or more of the following features. The code to cause the voice-activated computing device to determine whether the audio content comprising the voice command was issued electronically or by a human user further comprise instructions configured to cause the voice-activated computing device to correlate changes in the audio content comprising the voice command with changes in the EM emissions detected by the EM detector to determine a security indicator, and determine whether the voice command was generated electronically based on the security indicator. The changes in the audio content comprise changes in at least one of the volume, the frequency, the cadence, and the voice pattern of the audio content. Instructions configured to cause the voice-activated computing device to calibrate the EM detector to generate baseline EM emissions; and the Instructions configured to cause the voice-activated computing device to correlate the changes in the audio content comprising the voice command with changes in the EM emissions further comprise Instructions configured to cause the voice-activated computing device to subtract the baseline EM emissions from the EM emissions before correlating the changes in the audio content with the changes in the EM emissions. Instructions configured to cause the voice-activated computing device to calibrating the EM detector to generate the baseline EM emissions includes instructions configured to cause the voice-activated computing device to detect EM emissions generated by the voice-activated computing device. Instructions configured to cause the voice-activated computing device to determine that the voice command was generated electronically responsive to the security indicator exceeding a predetermined threshold.
Like reference symbols in the various drawings indicate like elements, in accordance with certain example implementations.
Techniques for detecting and preventing voice-based attacks on smart speakers and other voice-activated computing devices are provided. The techniques disclosed herein can distinguish between electronically-generated voice commands and voice commands that were issued by a human user. Electronically-generated voice commands can be identified by analyzing electromagnetic (EM) emissions detected while audio content that includes voice-command is received and correlating changes in the EM emissions with changes in the audio content. For electronically-generated voice commands, changes in the volume, frequency, cadence, voice pattern and/or other aspects of the audio content comprising the voice command that correlate with changes in EM emission can be indicative of the voice command being generated electronically from loudspeaker of the voice-activated computing device or another device. Human-generated voice commands will not exhibit the EM fluctuations that are indicative of a voice command having been electronically generated.
The techniques disclosed herein can be used to detect and prevent voice-based attacks on smart speakers and other voice-activated computing devices. One such attack directly attacks the voice-based computing device to assume control over a speaker of the voice-based computing device. The attacker may introduce malicious software onto the voice-based computing device that is configured to record and playback voice commands that were issued by a user of the voice-based computing device. The malicious software can be configured to record and play back security passcodes and/or other authentication credentials required to access the content and/or services that the attacker wishes to access. The malicious software can also be configured to implement machine learning techniques to synthesize voice commands that the user of the smart speaker has not vocalized. The malicious software can also be configured to generate hidden voice commands that are inaudible to a human but may be detectable and acted upon by the voice-activated computing device. The malicious software can even be configured to generate garbled sounds that include hidden voice commands.
The techniques disclosed herein can be used to detect such attacks and to prevent voice commands entered as part of such an attack from being executed by the voice-activated computing device. The voice-activated computing device can include an EM detector that is configured to detect EM emissions including those that are generated by the speaker of the voice-activated computing device when the speaker is outputting audio content. The voice-activated computing device can be configured to calibrate the EM detector to generate baseline EM emissions information that can be compared to EM emissions detected while a voice-based command is issued to the voice-activated computing device. The baseline EM emissions include environmental noise that may be generated by other devices proximate to the voice-activated computing device, emissions generated by the build-in speaker of the voice-activated computing device, and/or other EM emissions sources. The voice-activated computing device can be configured to correlate changes in the volume, frequency, cadence, voice pattern and/or other aspects of the audio content comprising the voice command received by a microphone of the voice-activated computing device to make a determination whether the voice command was issued electronically or by a human user.
Another type of threat that the techniques disclosed herein can detect and prevent from issuing commands to the voice-activated computing device are situations where an attacker has compromised the loudspeaker of a device proximate to the voice-activated computing device and uses that loudspeaker to issue commands to the voice-activated computing device. The loudspeaker may be part of a television, computing device, smartphone, and/or other type of device that includes a loudspeaker for outputting audio content. A hacker may introduce malicious code into or otherwise induce the device to broadcast voice commands to the voice-activated computing device. The voice command may be embedded in audio or video content that is broadcast, streamed, downloaded, or being played from media on the device. The voice commands may also be inaudible to humans and/or may be embedded in garbled sounds to render the voice commands inaudible. The detection range of the EM detector is determined by the antenna and the amplifier of the EM detector and can be configured to detect electronically-generated voice commands from other devices that are proximate to the voice-activated computing device. The techniques disclosed herein can be used to detect such an attack and to prevent voice commands received as part of such an attack from being executed.
The techniques disclosed herein can monitor electromagnetic variations occurring during voice command inputs and apply correlation analysis to the EM variations and changes in one or more attributes of audio content that may comprise a voice command to determine whether the voice command was issued by human user or was likely to have been issued electronically. Changes in the volume, the frequency, the cadence, and the voice pattern for voice commands issued electronically through a speaker, such as that illustrated in
Variations in the magnetic field can be monitored at the voice-activated computing device using a built-in magnetometer or other sensor capable of detecting changes in the magnetic field surrounding the voice-activated computing device. The voice-activated computing device can also include an electromagnetic (EM) detector instead of or in addition to the magnetometer or other sensor. The EM detector can be configured to be more sensitive to variations in the magnetic field proximate to the voice-activated computing device and can provide a greater detection range that may otherwise be possible using just the built-in magnetometer or other sensor capable of detecting changes in the magnetic field. The variations in the magnetic field can be correlated with changes in the audio input that includes the voice-command(s) detected by the voice-activated computing device. The voice-activated computing device can be configured to generate a correlation score, where a higher correlation between fluctuations in the magnetic field and changes in the volume, frequency, or other attributes of the voice-command is indicative of the voice command having been electronically issued using a loudspeaker. The example loudspeaker illustrated in
The voice-activated computing device 105 can include an electromagnetic (EM) detector and/or other type of sensor(s) configured to detect EM emissions. The voice-activated computing device can also include a microphone for capturing audio content that may comprise one or more voice commands. Human-generated voice commands will not exhibit the EM fluctuations that are associated with electronically generated audio content.
The operating environment 10 may include a voice command source 160. The voice command source is an electronic device capable of generating audio output that can include voice commands that can be received by the voice-activated computing device 105. The voice command source 160 can comprise a television, computing device, smartphone, and/or other type of device that includes a loudspeaker for outputting audio content. As discussed in the preceding examples, a hacker may introduce malicious code into or otherwise induce the voice command source 160 to broadcast voice commands that may be detected by the voice-activated computing device 105 in an attempt to cause the voice-activated computing device 105 to execute the voice commands. In some operating environments, the voice command source 160 may be a component of the voice-activated computing device. A hacker may introduce malicious software into the voice-activated computing device 105 and assume control over the loudspeaker of the voice-activated computing device 105 to output audio content comprising voice commands in an attempt to cause the voice-activated computing device to execute the voice commands.
The operating environment 10 may include one or more wireless access points 150 and/or one or more wireless access points 140. The one or more wireless access points 150 and/or the one or more wireless base stations are configured to provide wireless network connectivity to the voice-activated computing device 105. The wireless access points 150 and 140 are configured to provide connectivity via a network 125 (e.g., a cellular wireless network, a Wi-Fi network, a packet-based private or public network, such as the public Internet). The voice-activated computing device 105 may be configured, in some embodiments, to operate and interact with multiple types of other communication systems/devices, including local area network devices (or nodes), such as WLAN for indoor communication, femtocells, Bluetooth® wireless technology-based transceivers, and other types of indoor communication network nodes, wide area wireless network nodes, satellite communication systems, etc., and as such the voice-activated computing device 105 may include one or more interfaces to communicate with the various types of communications systems.
The operating environment 10 may further include a server 110 configured to communicate, via a network 125, or via wireless transceivers included with the server 110, with multiple network elements or nodes, and/or computing devices. For example, the server 110 may be configured to provide content accessible by the voice-activated computing device 105, such as downloadable application content, navigation data, browser-accessible content, and or access to other types of data. The server 110 can be configured to receive sensor data from the voice-activated computing device 105 associated with audio content that includes a voice command. The server 110 can be configured analyze the sensor data and the audio content to make a determination whether a voice command was generated electronically or was issued by a human user. This determination can be based on correlating fluctuations in electromagnetic (EM) emissions detected by the voice-activated computing device 105 with changes to one or more attributes in the audio content, such as changes in pitch or frequency. The voice-activated computing device 105 can be configured to analyze the audio and EM emissions data without relying on the server 110 in some implementations.
As shown, the computing device 200 can include a network interface 205 that can be configured to provide wired and/or wireless network connectivity to the computing device 200. The network interface can include one or more local area network transmitters, receivers, and/or transceivers that can be connected to one or more antennas (not shown). The one or more local area network transmitters, receivers, and/or transceivers comprise suitable devices, circuits, hardware, and/or software for communicating with and/or detecting signals to/from one or more of the wireless local area network (WLAN) access points, and/or directly with other wireless computing devices within a network. The network interface 205 can also include, in some implementations, one or more wide area network transmitters, receivers, and/or transceivers that can be connected to the one or more antennas (not shown). The wide area network transmitters, receivers, and/or transceivers can comprise suitable devices, circuits, hardware, and/or software for communicating with and/or detecting signals from one or more of, for example, the wireless wide area network (WWAN) access points and/or directly with other wireless computing devices within a network. The network interface 205 can include a wired network interface in addition to one or more of the wireless network interfaces discussed above. The network interface 205 can be used to receive data from and send data to one or more other network-enabled devices via one or more intervening networks.
The processor(s) (also referred to as a controller) 210 may be connected to the memory 215, the voice command analysis unit 270, the user interface 250, and the network interface 205. The processor may include one or more microprocessors, microcontrollers, and/or digital signal processors that provide processing functions, as well as other calculation and control functionality. The processor 210 may be coupled to storage media (e.g., memory) 215 for storing data and software instructions for executing programmed functionality within the computing device. The memory 215 may be on-board the processor 210 (e.g., within the same integrated circuit package), and/or the memory may be external memory to the processor and functionally coupled over a data bus.
A number of software modules and data tables may reside in memory 215 and may be utilized by the processor 210 in order to manage, create, and/or remove content from the computing device 200 and/or perform device control functionality. Furthermore, components of the high level operating system (“HLOS”) 225 of the computing device 200 may reside in the memory 215. As illustrated in
The application module 220 may be a process or thread running on the processor 210 of the computing device 200, which may request data from one or more other modules (not shown) of the computing device 200. Applications typically run within an upper layer of the software architectures and may be implemented in a rich execution environment of the computing device 200 (also referred to herein as a “user space”), and may include games, shopping applications, content streaming applications, web browsers, location aware service applications, etc. The application module 220 can be configured to comprise one or more applications that can be executed on the computing device 200. The application module 220 can be configured to provide a voice-command interface that allows a user of the computing device 200 to issue commands to control the operation of the one or more applications.
The processor 210 includes a trusted execution environment (TEE) 280. The trusted execution environment 280 can be used to implement a secure processing environment for executing secure software applications. The trusted execution environment 280 can be implemented as a secure area of the processor 210 that can be used to process and store sensitive data in an environment that is segregated from the rich execution environment in which the operating system and/or applications (such as those of the application module 220) may be executed. The trusted execution environment 280 can be configured to execute trusted applications that provide end-to-end security for sensitive data by enforcing confidentiality, integrity, and protection of the sensitive data stored therein. The trusted execution environment 280 can be used to store encryption keys, authentication information, and/or other sensitive data. The trusted applications implemented in the trusted execution environment 280 can be configured to provide a voice-command interface that allows a user of the computing device 200 to control the operation of the one or more applications. The trusted applications may also be used to conduct financial transactions, access sensitive data (e.g. medial or financial data associated with user of the device or with proprietary information associated with a company with which the user of the device works), and/or perform other operations of a sensitive nature. In some implementations, some or all of the functionality associated with the trusted applications may be implemented by untrusted applications operating in a rich execution environment of the computing device 200.
The computing device 200 may further include a user interface 250 providing suitable interface systems for outputting audio and/or visual content, and for facilitating user interaction with the computing device 200. For example, the user interface 250 of a typical smart speaker includes a least a microphone for receiving audio input and a speaker for outputting audio content. The computing device 200 is not limited to a smart speaker and some smart speakers may include user interface components in addition to a microphone and speaker. The computing device 200 may include additional user interface components, such as a keypad and/or a touchscreen for receiving user inputs, and a display (which may be separate from the touchscreen or be the touchscreen) for displaying visual content.
The computing device can include sensor(s) 290. The sensor(s) 290 can include an audio sensor and/or other means for detecting sounds including audio content that includes one or more voice commands. Such sensors may be included in addition to a microphone that is part of the user interface 250. The sensor(s) 290 can also include a magnetometer and can include one or more accelerometers.
The magnetometer can comprise a magnetoresistive permalloy sensor, which is used in some types of smart phones, tablet computing devices, and other types of handheld computing devices. For example, some commonly used magnetometers can be configured to measure magnetic fields within ±2 gauss (i.e., 200 microtesla) and is sensitive to magnetic fields magnetic fields of less than 100 microgauss (i.e., 0.01 microtesla).
The electromagnetic (EM) detector 295 is configured to detect EM emissions. The EM detector can be used to detect variations in EM emissions generated by a speaker of an electronic device that is used to electronically issue a voice command to the computing device 200 according to the techniques disclosed herein. The EM detector 295 can be implemented a separate chip or module that can be connected to a system on a chip (SoC), chipset, or other processing means of the computing device 200. The EM detector 295 may be disposed on the same printed circuit board as the SoC and/or other processing means. In some implementations, the EM detector 295 may be a standalone component that can be configured to integrate some or all of the functionality of the voice command analysis unit 270, such as that illustrated in
The voice command analysis unit 270 can provide means for performing the various example implementations discussed herein unless otherwise specified, such as the techniques illustrated in
For the sake of simplicity, the various features/components/functions illustrated in the schematic boxes of
As shown, the EM detector 1100 can include a data interface 1105 that can be configured to provide wired and/or wireless network connectivity to a voice-activated computing device, such as those illustrated in
The processor(s) (also referred to as a controller) 1110 may be connected to the memory 1115, the voice command analysis unit 1170, anomalous emissions detection unit 1195, the user interface 1150, and the data interface 1105. The processor 1110 may include one or more microprocessors, microcontrollers, and/or digital signal processors that provide processing functions, as well as other calculation and control functionality. The processor 1110 can be coupled to storage media (e.g., memory) 1115 for storing data and software instructions for executing programmed functionality within the computing device, such as operating system software 1125 for operating the EM detector 1100 and EM data 1120 generated by the EM detector 1100. The memory 1115 may be on-board the processor 1110 (e.g., within the same integrated circuit package), and/or the memory may be external memory to the processor and functionally coupled over a data bus.
A number of software modules and data tables may reside in memory 1115 and may be utilized by the processor 1110 in order to manage, create, and/or remove content from the EM detector 1100 and/or perform device control functionality. The memory 115 may include one or more application modules (not shown) that may be executed by the processor 1110. It is to be noted that the functionality of the modules and/or data structures may be combined, separated, and/or be structured in different ways depending upon the implementation of the EM detector 1100.
The processor 1110 can include a trusted execution environment (TEE) 1180. The trusted execution environment 1180 can be used to implement a secure processing environment for executing secure software applications. The trusted execution environment 1180 can be implemented as a secure area of the processor 1110 that can be used to process and store sensitive data in an environment that is segregated from the rich execution environment in which the operating system and/or applications may be executed. The trusted execution environment 1180 can be configured to execute trusted applications that provide end-to-end security for sensitive data by enforcing confidentiality, integrity, and protection of the sensitive data stored therein. The trusted applications implemented in the trusted execution environment 280 can be configured to provide means for collecting and analyzing EM emissions data, receiving audio data, and/or for correlating changes to attributes in the EM emissions data to variations in EM emissions data collected by the EM detector.
The EM detector 1100 can be configured to include a user interface 250 that enables the user to configure the EM detector 1100. The user interface can comprise a voice interface and/or a graphical user interface. In some implementations, the voice-activated computing device may not include a display capable of display a graphical user interface but can be configured to allow a user to connect to the voice-activated computing device via an application on a smartphone, tablet, or other computing device that includes a display capable of displaying such a graphical user interface.
The EM detector 1100 can include sensor(s) 290. The sensor(s) 290 can but are not limited to magnetometer(s), antenna(s), and/or other means for detecting variations in the EM emissions proximate to the EM detector 1100. The EM detector 1100 can also include microphone(s) and/or other audio receiving means for receiving audio content that may include voice commands issued to the voice-activated computing device.
The voice command analysis unit 1170 can provide means for performing the various example implementations discussed herein unless otherwise specified, such as the techniques illustrated in
The EM detector 1100 can include an anomalous emissions detection unit 1195. The EM detector 1100 can include the anomalous emissions detection unit 1195 in addition to or instead of the voice command analysis unit 1170. The anomalous emissions detection unit 1195 can be configured to implement a band filter that is configured to identify abnormal EM variations that fall outside of a predetermined frequency band. The anomalous emissions detection unit 1195 can be configured to output a signal to the voice command analysis unit 1170 and/or to the voice command analysis unit 270 of the voice-activated computing device indicating that abnormal EM variations have been detected and the voice command analysis unit 1170 and/or the voice command analysis unit 270 can be configured to make a determination that the audio content and any voice commands included therein having been generated electronically rather than having been issued by a human user of the voice-activated computing device.
Audio content comprising a voice command computing device can be received (stage 305). The process illustrated in
In some implementations, the voice command analysis unit 270 and/or the EM detector 295 can be configured to monitor ambient noise proximate to the voice-activated computing device 105 in order to determine baseline audio data for the operating environment in which the voice-activated computing device 105 is disposed. The EM detector 295 can also be configured to monitor ambient EM activity for the operating environment. The baseline audio content and baseline EM data for the operating environment can be taken into account by the voice command analysis unit 270 and/or the EM detector 295 when determining whether a voice command was issued electronically or by a human user.
The voice command analysis unit 270 can be configured to buffer audio input into a protected memory location that is substantially in accessible to untrusted processes running on the voice-activated computing device. The voice command analysis unit 270 can be configured to buffer the audio input in a memory associated with the trusted execution environment 280 which is inaccessible to applications and processing running in the rich execution environment of the voice-activated computing device. The EM detector 295 can include a memory for buffering audio content received via a microphone or other audio sensors incorporated into the EM detector 295. The voice command analysis unit 270 and/or the EM detector 295 can be configured to detect the end of a voice command input by monitoring the frequency, amplitude, and/or other attributes of the audio signals received and detecting changes in the frequency, amplitude, and/or other attributes indicative that the speech input has been completed. The voice command analysis unit 270 can be configured to send the audio content to a server, such as the server 110, for analysis and identification of the voice command(s) (if any) included in the audio input. The voice command analysis unit 270 can also be configured to stream the audio content to the server 110 as it is received in order to provide as fast a response time as possible to the voice command(s) included in the audio content captured by the voice-activated computing device. The EM detector 295 need not process the audio content to identify the voice command(s) included therein. Identification of the voice command(s) is not required in order to make a determination whether the audio content that includes the voice command(s) has been issued electronically rather than by a human user.
Electromagnetic (EM) emissions can be monitored using the EM detector of the voice-activated computing device (stage 310). The EM detector 295 can begin monitoring EM emissions to identify variations in the EM emissions, such variations may be indicative of a voice command being generated electronically. The magnetometer of the voice-activated computing device can be configured to monitor for EM emissions in some implementations instead of or in addition to the EM detector 295. The monitoring for EM emissions can be triggered by the voice command analysis unit 270 and/or the EM detector 295 detecting the wake up word or phrase used to trigger the voice-activated computing device to enter into a mode where the voice-activated computing device is operable to receive voice commands. The EM detector 295 can be configured output EM emissions data that is received by the voice command analysis unit 270. The EM detector 295 can be configured to buffer the EM emissions data in a memory local to the EM detector 295 and/or to perform other analysis on the EM emission data. The voice command analysis unit 270 can be configured to send a signal to the EM detector 295 to provide the EM emissions data to the voice command analysis unit 270 responsive to the voice command analysis unit 270 determining that the voice command entry has been completed.
A determination can be made whether the audio content comprising the voice command was generated electronically or was issued by a human user based on the EM emissions detected while receiving the audio content comprising the voice command (stage 315). The voice command analysis unit 270 can be configured to correlate changes in pitch, frequency, and/or other attributes of the audio content received in stage 305 with variations in the magnetic field detected by the EM detector 295. Where variations in the magnetic field detected by the EM detector 295 are cotemporal with changes in the pitch, frequency, and/or other attributes of the audio content, such occurrences are indicative of the audio content and any voice commands contained therein as being electronically generated rather than having been issued by a human user. Accordingly, the voice command analysis unit 270 can be configured to generate a correlation score (also referred to herein as a “security indicator”) as a result of the correlation of the audio content and the EM emissions data. In some implementations, the voice command analysis unit 270 can be configured to calculate the security indicator using a correlation coefficient function that correlates changes to one or more attributes in the audio data with variations in the EM emissions in the EM data. The correlation coefficient function can be configured to determine a relationship between data points representing the same point in time from the audio data and the EM data to determine whether the changes in the audio data are correlated to the variations in the EM data. The correlation coefficient formula can be configured to return a value that is indicative of whether there is a correlation between changes in the audio content and the EM variations.
In one example implementation, the correlation function is configured to return a value ranging from ‘1’ (one) to ‘−1’ (negative one, where a value of ‘1’ represents a strong positive correlation, a value of ‘0’ (zero) represents no correlation at all, and a value of ‘−1’ represents a strong negative correlation. A correlation value of ‘1’ indicates that for every positive increase in one variable (e.g., increase in pitch, frequency, etc. of the audio content) there is a positive increase of a fixed proportion in the other variable (e.g., a corresponding increase in the magnetic field). A correlation value of ‘−1’ indicates that for every positive increase in one variable (e.g., increase in pitch, frequency, etc. of the audio content) there is a negative increase of a fixed proportion in the other variable (e.g., a corresponding decrease in the magnetic field). A value of zero indicates that for every increase no positive or negative increase occurs, which indicates that the two variables are not related.
The voice command analysis unit 270 can be configured to determine the absolute value of the value output by the correlation function in the preceding example. The voice command analysis unit 270 can be configured to compare the absolute value to a predetermined threshold value and to make a determination that the changes in the audio content are correlated to the EM variations responsive to the absolute value of the output of the correlation function exceeding the predetermined threshold value. A determination that the changes in the audio content are correlated to the EM variations is indicative of the audio content and any voice commands included therein having been generated electronically rather than having been issued by a human user of the voice-activated computing device.
The example correlation techniques and the specific values discussed herein are merely examples that are used to illustrates these concepts. The voice command analysis unit 270 can be configured to utilize other correlation techniques to determine whether a correlation exits between the changes in the audio content and the EM variations. Furthermore, in some implementations, the EM detector 295 can be configured to perform the correlation function discussed above in addition to or instead of the voice command analysis unit 270.
The EM detector 295 can also be configured to detect abnormal EM variations which may be indicative of the audio content comprising the voice command having been generated electronically using a loudspeaker, such as the example loudspeakers illustrated in
The voice command can be prevented from being executed by the voice-activated computing device responsive to determining that the voice command was generated electronically (stage 320). In response to determining that the voice command was issued electronically and not from a human user, the voice command analysis unit 270 can be configured to prevent the voice command from being executed by the voice-activated computing device. The voice command analysis unit 270 can be configured to perform one or more additional actions in response to determining that the voice command was issued electronically. The voice command analysis unit 270 can be configured to temporarily disable voice command input on the device. The voice command analysis unit 270 can be configured to require an authorization code, personal identification number (PIN), password, or pass phrase from an authorized user of the voice-activated computing device before the voice command analysis unit 270 will enable the voice command functionality of the device. The voice command analysis unit 270 can be configured to power down the device in response to determining that a voice-based attack is underway. The voice command analysis unit 270 can also be configured to initiate a scan on the voice-activated computing device for malicious software, to restore the voice-activated computing device to a previously known state, and/or to reinstall software and/or applications on the voice-activated computing device. The voice command analysis unit 270 can also be configured to run diagnostics on the voice-activated computing device to determine whether any other components of the device have been compromised or damaged. The voice command analysis unit 270 can also be configured to notify the server 110 and/or another trusted third party entity that the voice-activated computing device has been subjected to a voice-based attack and may be compromised. The trusted third party entity may be a service provider associated with the voice-activated computing device that analyzes voice command inputs received at the voice-activated computing device, and/or provides other services to the voice-activated computing device, such as providing data, conducting transactions on behalf of the user of the voice-activated computing device, and/or other such services. The trusted third party entity may be a home automation service provider that provides services related to controlling and monitoring of Internet-connected appliances and systems of the user's home or business. The voice command analysis unit 270 can also be configured to notify a user of the voice-activated computing device that the device may have been subjected to a voice-based attack so that the user may take measures to address the attack, such as determining whether any unauthorized activity has been conducted through the voice-activated computing device.
Changes in the audio content comprising the voice command can be correlated with changes in the EM emissions detected by the EM detector to determine a security indicator (stage 405). As discussed above, the voice command analysis unit 270 and/or the EM detector 295 can be configured to perform one or more correlation functions on the audio data and the EM data to determine whether changes to one or more attributes of the audio content correlate to EM variations included in the EM data collected by the EM detector 295 while the audio content was being collected.
A determination can be made whether the voice command was generated electronically based on the security indicator (stage 410). As discussed above, the output of the correlation function(s) can be compared to a predetermined threshold value. If the output of the correlation function(s) exceeds the predetermined threshold value, then the voice command analysis unit 270 can make a determination that the voice command has been generated electronically due to the correlations between the changes in the attributes of the audio content and the variations in the EM data. Such a correlation indicates that the voice command was generated using a loud speaker which caused the EM variations as the audio content comprising the voice command was played by the loud speaker. As such, the voice command analysis unit 270 can be configured to take one or more actions to prevent a voice-based attack on the voice-activated computing device.
The EM detector can be calibrated to generate baseline EM emissions information (stage 505). The EM detector 295 can be used to capture baseline EM emissions information for the operating environment in which the voice-activated computing device computing device is configured to operate. The baseline EM emissions information can include EM emissions from the voice-activated computing device itself and other sources of EM emissions in the operating environment of the voice-activated computing device. The speaker and/or other electronic components of the voice-activated computing device can generate a magnetic field that is detectable by the EM detector 295. Other electronic devices proximate to the EM detector may also be generating electronic emissions that are not indicative of a voice command being generated electronically. The EM detector 295 can be configured to capture a baseline EM emission reading and/or can be configured to capture baseline EM emissions patterns over time for the operating environment in which the voice-activated computing device is located.
The baseline EM emissions can be subtracted from the EM emissions before correlating the changes in the audio content with the changes in the EM emissions (stage 510). The EM emissions detected by the EM detector 295 while monitoring for EM emissions associated with a voice command, such as in stage 310 of the process of
Electromagnetic emissions generated by the voice-activated computing device can be detected (stage 605). Components of the voice-activated computing device may generate EM emissions that can be detected by the EM detector 295 and/or the magnetometer of the voice-activated computing device. The EM emissions may interfere with the ability of the EM detector 295 and/or the voice command analysis unit 270 to make a determination whether audio content that includes a voice command was generated electronically or was issued by a human user. To counter this problem, a configuration process can be executed by the EM detector and/or the voice command analysis unit 270 in which the EM detector 295 and/or the magnetometer of the voice-activated computing device are configured to monitor EM emissions for a predetermined period of time to generate baseline EM emission data for the voice-activated computing device. EM emissions for other electronic devices proximate to the voice-activated computing device depending upon the sensitivity of the device collecting the baseline data (e.g., the EM detector 295 or the magnetometer). The baseline EM data can be used to establish what the EM emissions in the operating environment of the voice-activated computing device are expected to be. The configuration process to establish the EM emissions can be performed periodically by the voice-activated computing device. The configuration process can be performed each time that the voice-activated computing device is powered up and/or rebooted.
A determination that the voice command was generated electronically can be made responsive to the security indicator exceeding a predetermined threshold (stage 705). As discussed above, the voice command analysis unit 270 or the EM detector 295 can be configured to generate a correlation score or security indicator resulting from the correlation of the audio content and the EM emissions data. The security indicator can comprise a range of values that are indicative of the relationship between changes in attributes of the audio data comprising the voice command and EM variations included in the EM data. The voice command analysis unit 270 can be configured to make a determination that the voice command was issued electronically responsive to the security indicator exceeding the predetermined threshold value.
The threshold value may be determined by a manufacturer or reseller of the voice-activated computing device. The threshold value may also be determined by a service provider associated with the voice-activated computing device that provides voice-command analysis and identification on audio content captured by the voice-activated computing device. The threshold value may also be determined by a user of the voice-activated computing device. The voice-activated computing device can be configured to provide a user interface that enables the user to configure security settings of the voice-activated computing device. The user interface can comprise a voice interface and/or a graphical user interface. In some implementations, the voice-activated computing device may not include a display capable of display a graphical user interface but can be configured to allow a user to connect to the voice-activated computing device via an application on a smartphone, tablet, or other computing device that includes a display capable of displaying such a graphical user interface.
The audio content and information regarding the EM emissions can be sent to a remote server for analysis (stage 805). In some implementations, the audio content and the EM emissions data can be sent to a remote server for analysis, such as the server 110. The server 110 can be associated with a service provider that provides various services to the voice-activated computing device, such as voice-command analysis and identification on audio content captured by the voice-activated computing device that includes analyzing EM emissions data associated with the audio content captured by the voice-activated computing device. The service provider may also provide security services for the voice-activated computing device. The server 110 can be configured to receive the audio content and the EM emissions data collected by the voice command analysis unit 270, the EM detector 295, and/or the magnetometer of voice-activated computing device and to perform correlation analysis on the received data similar to that discussed above in the preceding example implementations in which the voice command analysis unit 270 and/or the EM detector 295 performed correlation analysis on the audio and EM emission data. In some implementations, the voice command analysis unit 270 and/or the EM detector 295 can be configured to perform correlation analysis on the audio and EM emission data and the audio and EM emission data can be sent to the server 110 as well for analysis.
An indication can be received from the server whether the voice command was generated electronically or by a human user from the remote server (stage 810). The server 110 can be configured to generate a security indicator similar to that generated by the voice command analysis unit 270 and/or the EM detector 295 in the preceding examples. In some implementations, the security indicator may be a binary value indicating that the voice command was issued electronically or was not issued electronically. In other implementations, the security indicator can comprise a range of values that indicative of the relationship between changes in attributes of the audio data comprising the voice command and EM variations included in the EM data, and the voice command analysis unit 270 can be configured to compare the security indicator to a predetermined threshold to determine whether the voice command was issued electronically or was issued by a human user. In some implementations, the server 110 and the voice command analysis unit 270 and/or the EM detector 295 can be configured to determine a security indicator. In such implementations, the voice command analysis unit 270 may be configured to determine an average of each of these security indicators, and to determine whether the voice command was issued electronically or issued by a human user based on whether the average of the security indicators exceeds the predetermined threshold.
An indication can be received from the EM detector that the EM detector has detected abnormal EM variations (stage 905). As discussed above, the EM detector 295 can be configured to include a filter that is configured to generate a signal responsive to EM variations being outside of an expected range. The EM detector 295 can output a signal to the voice command analysis unit 270 that abnormal EM variations have been detected, which may be indicative of a voice command being issued electronically by another device proximate to the voice-activated computing device.
If implemented in-part by hardware or firmware along with software, the functions can be stored as one or more instructions or code on a computer-readable medium. Examples include computer-readable media encoded with a data structure and computer-readable media encoded with a computer program. Computer-readable media includes physical computer storage media. A storage medium can be any available medium that can be accessed by a computer. By way of example, and not limitation, such computer-readable media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage, semiconductor storage, or other storage devices, or any other medium that can be used to store desired program code in the form of instructions or data structures and that can be accessed by a computer; disk and disc, as used herein, includes compact disc (CD), laser disc, optical disc, digital versatile disc (DVD), floppy disk and Blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly or conventionally understood. As used herein, the articles “a” and “an” refer to one or to more than one (i.e., to at least one) of the grammatical object of the article. By way of example, “an element” means one element or more than one element. “About” and/or “approximately” as used herein when referring to a measurable value such as an amount, a temporal duration, and the like, encompasses variations of ±20% or ±10%, ±5%, or +0.1% from the specified value, as such variations are appropriate in the context of the systems, devices, circuits, methods, and other implementations described herein. “Substantially” as used herein when referring to a measurable value such as an amount, a temporal duration, a physical attribute (such as frequency), and the like, also encompasses variations of ±20% or ±10%, ±5%, or +0.1% from the specified value, as such variations are appropriate in the context of the systems, devices, circuits, methods, and other implementations described herein.
As used herein, including in the claims, “or” as used in a list of items prefaced by “at least one of” or “one or more of” indicates a disjunctive list such that, for example, a list of “at least one of A, B, or C” means A or B or C or AB or AC or BC or ABC (i.e., A and B and C), or combinations with more than one feature (e.g., AA, AAB, ABBC, etc.). Also, as used herein, unless otherwise stated, a statement that a function or operation is “based on” an item or condition means that the function or operation is based on the stated item or condition and can be based on one or more items and/or conditions in addition to the stated item or condition.