This disclosure generally relates to acoustic devices that include microphone arrays for capturing acoustic signals.
An array of microphones can be used for capturing acoustic signals along a particular direction.
In one aspect, this document features a computer-implemented method that includes receiving information representing audio captured by a microphone array, wherein the information includes multiple datasets each representing audio signals captured in accordance with a sensitivity pattern along a corresponding direction with respect to the microphone array. The method also includes computing, using one or more processing devices for each of the multiple datasets, one or more quantities indicative of human voice activity captured from the corresponding direction, and generating, based at least on the one or more quantities computed for a plurality of the multiple datasets, a directional audio signal representing audio captured from a particular direction.
In another aspect, this document features an apparatus that includes a microphone array, one or more acoustic transducers configured to generate audio signals, and an audio processing engine that includes memory and one or more processing device. The audio processing engine is configured to receive information representing the audio captured by the microphone array, wherein the information includes multiple datasets each representing audio signals captured in accordance with a sensitivity pattern along a corresponding direction with respect to the microphone array. The audio processing engine is also configured to compute, for each of the multiple datasets, one or more quantities indicative of human voice activity captured from the corresponding direction, and generate, based at least on the one or more quantities computed for a plurality of the multiple datasets, a directional audio signal representing audio captured from a particular direction.
In another aspect, this document features one or more machine-readable storage devices having encoded thereon computer readable instructions for causing one or more processing devices to perform various operations. The operations include receiving information representing audio captured by a microphone array, wherein the information includes multiple datasets each representing audio signals captured in accordance with a sensitivity pattern along a corresponding direction with respect to the microphone array. The operations also include computing, for each of the multiple datasets, one or more quantities indicative of human voice activity captured from the corresponding direction, and generating, based at least on the one or more quantities computed for a plurality of the multiple datasets, a directional audio signal representing audio captured from a particular direction.
Implementations of the above aspects can include one or more of the following features. The information representing the audio captured by the microphone array can be received from a beamformer configured to process signals captured using the microphone array. Each of the multiple datasets can correspond to a beam generated using the beamformer. The beamformer can be one of: a fixed beamformer or a dynamic beamformer. The one or more quantities indicative of human voice activity can include a likelihood score of human voice activity in the audio signal represented in the dataset for the corresponding direction. The one or more quantities indicative of human voice activity can include a signal-to-noise ratio (SNR). The SNR can be computed as a ratio of a first quantity representing a voice signal and a second quantity representing non-voice signals. The one or more quantities indicative of human voice activity can represent a likelihood score of the presence of a keyword in the audio signal represented in the dataset for the corresponding direction. Generating the directional audio signal can include selecting one of the multiple datasets. Generating the directional audio signal can include causing a dynamic beamformer to capture audio in accordance with a sensitivity pattern generated for the particular direction.
Various implementations described herein may provide one or more of the following advantages. By steering a beamformer based on a direction of voice activity rather than a direction of the most dominant acoustic source, voice input may be accurately captured even in the presence of noise sources generating significant acoustic energy. In some cases, this may improve performance of a voice-activated device in the presence of dominant non-voice noise sources such as an air-conditioner. In some cases, the direction of relevant voice activity may also be determined via detecting the occurrence of a spoken keyword. This in turn may improve the performance of voice-activated devices in the presence of voice signals from multiple speakers.
Two or more of the features described in this disclosure, including those described in this summary section, may be combined to form implementations not specifically described herein.
The details of one or more implementations are set forth in the accompanying drawings and the description below. Other features, objects, and advantages will be apparent from the description and drawings, and from the claims.
This document describes technology for controlling directional capture of audio based on voice activity detection. Various voice-activated devices that can be controlled using spoken commands are currently available. Examples of such devices that are commercially available include Echo® and FIRE TV® manufactured by Amazon Inc. of Seattle, Wash., various iOS® enabled devices manufactured by Apple Inc., and Google Home® and other Android° powered devices manufactured by Google Inc. of Mountain View, Calif. Voice activated devices can include an array (e.g., a linear array, a circular array, etc.) of microphones that are used for directional capture of spoken inputs. For example, the signals captured by the microphone array on a device can be processed to emphasize signals captured from a particular direction and/or deemphasize signals from one or more other directions. Such a process is referred to as beamforming, and the directional sensitivity pattern resulting from such a process may be referred to as a beam. A device executing the beamforming process may be referred to as a beamformer. Selection of a sensitivity pattern or beam along a particular direction may be referred to as beam steering.
In some cases, a beamformer may steer a beam in the direction of the dominant source of acoustic energy. In low-noise environments, where a human speaker is the dominant source of acoustic energy, the beamformer may accurately steer the beam towards the speaker. However, in some cases, where the dominant source of acoustic energy is a noise source, the beamformer may steer the beam towards that source, and as a result deemphasize the voice input from a human speaker. For example, if the microphone array is disposed near a loud sound source (e.g., an air conditioner, a humidifier, a dehumidifier, etc.), the beamformer may steer the beam towards that sound source. In such a case, a voice input coming from another direction may be inadvertently deemphasized. In some situations, when multiple speakers are present in an environment (e.g., a room where multiple people are speaking with one another), the dominant source of acoustic energy may be a person who is not providing a voice input that the microphone array needs to capture. Rather, the voice input may come from a direction that is different from the direction of the dominant source of acoustic energy. In these above mentioned situations, if the beam is steered based on the direction of the dominant noise source, a spoken input coming from another direction may be missed, which in turn may affect the performance of a corresponding voice-activated device adversely.
The technology described herein allows for controlling the direction of audio capture by a microphone array based on voice activity detection (VAD), which may include keyword spotting (KWS). For example, beam steering or otherwise controlling directional audio capture may be implemented based on preliminary outputs indicating the likelihood of presence of voice activity, or a particular keyword, in audio captured from a particular direction. These preliminary outputs may be referred to as soft-VAD outputs (for voice activity detection) or soft-KWS outputs (for keyword spotting), which may be used for determining a direction the captured audio from which is emphasized for subsequent processing. In some cases, determining the direction based on such soft-VAD outputs can help deemphasize acoustic signals originating from non-human dominant sound sources such as an air conditioner, humidifier, dehumidifier, vacuum cleaner, washer, dryer, or other machines or animals (e.g., pets). This in turn may improve the performance of an associated voice-activated device in such noisy environments. In some cases, determining the direction based on soft-KWS outputs may also improve the performance of a corresponding voice-activated device by accurately picking up a relevant voice command even when multiple other human speakers are speaking in the environment.
Microphone arrays can be used for capturing acoustic signals along a particular direction. For example, signals captured by multiple microphones in an array may be processed to generate a sensitivity pattern that emphasizes the signals along a beam in the particular direction and suppresses or deemphasizes signals from one or more other directions. An example of such a device 200 is shown in
In some implementations, a directional audio capture device may also be realized using a single microphone together with a slotted interference tube. An example of such a device 250 is shown in
In some implementations, the microphone array on the audio capture device 105 can include directional microphones such as shotgun microphones described above. In some implementations, the audio capture device 105 can include a device that includes multiple microphones separated by passive directional acoustic elements disposed between the microphones. In some implementations, the passive directional acoustic elements include a pipe or tubular structure having an elongated opening along at least a portion of the length of the pipe, and an acoustically resistive material covering at least a portion of the elongated opening. The acoustically resistive material can include, for example, wire mesh, sintered plastic, or fabric, such that acoustic signals enter the pipe through the acoustically resistive material and propagate along the pipe to one or more microphones. The wire mesh, sintered plastic or fabric includes multiple small openings or holes, through which acoustic signals enter the pipe. The passive directional acoustic elements each therefore act as an array of closely spaced sensors or microphones. Various types and forms of passive directional acoustic elements may be used in the audio capture device 105. Examples of such passive directional acoustic elements are illustrated and described in U.S. Pat. No. 8,351,630, U.S. Pat. No. 8,358,798, and U.S. Pat. No. 8,447,055, the contents of which are incorporated herein by reference. Examples of microphone arrays with passive directional acoustic elements are described in co-pending U.S. application Ser. No. 15/406,045, titled “Capturing Wide-Band Audio Using Microphone Arrays and Passive Directional Acoustic Elements,” the entire content of which is also incorporated herein by reference.
Data generated from the signals captured by the audio capture device 105 may be processed to generate a sensitivity pattern that emphasizes the signals along a “beam” in the particular direction and suppresses signals from one or more other directions. Examples of such beams or sensitivity patterns 107a-107c (107, in general) are depicted in
The audio processing engine 120 can be located at various locations. In some implementations, the audio processing engine 120 may be disposed on the audio capture device 105 or on a voice-activated device associated with the audio capture device 105. In some such cases, the audio processing engine 120 may be disposed as a part of the audio capture device 105 or the associated voice-activated device. In some implementations, the audio processing engine 120 may be located on a device at a location that is remote with respect to the audio capture device 105. For example, the audio processing engine 120 can be located on a remote server, or on a distributed computing system such as a cloud-based system.
In some implementations, the audio processing engine 120 can be configured to process the data generated from the signals captured by the audio capture device 105 and generate audio data that emphasizes audio data captured along one or more directions relative to the audio capture device 105. In some implementations, the audio processing engine 120 can be configured to generate the audio data in substantially real-time (e.g., within a few milliseconds) such that the audio data is usable for real-time or near-real-time applications. The allowable or acceptable time delay for the real-time processing in a particular application may be governed, for example, by an amount of lag or processing delay that may be tolerated without significantly degrading a corresponding user-experience associated with the particular application. In some implementations, the audio data generated by the audio processing engine 120 can be transmitted, for example, over a network such as the Internet to a remote computing device configured to process the audio data. For example, the audio data generated by the audio processing engine may be sent to a remote server that analyzes the audio data to determine a voice command included in the audio data, and accordingly send back one or more control signals to a corresponding voice-activated device to affect the operation of such voice-activated device.
In some implementations, the audio processing engine 120 can be configured to control directional capture of acoustic signals by the microphone array based on calculating a likelihood of voice activity present along a given direction. An example system implementing such a control functionality is illustrated in
In some implementations, the audio processing engine 120 includes a fixed beamformer 310 that generates emphasized directional signals corresponding to multiple directions with respect to the audio capture device 105. For example, the fixed beamformer 310 can be configured to generate N directional signals or beams based on acoustic signals captured by M microphones. M may be greater than, equal to, or less than N. Each of the N beams represents acoustic signals emphasized along a particular discrete direction with respect to the audio capture device 105.
The system 300 also includes a beam score calculator 315 that is configured to calculate a preliminary score for one or more of the N beams generated by the fixed beamformer 310. For example, the beam score calculator 315 may calculate beam scores 320a-320n (320, in general) corresponding to each of the N beams, respectively, generated by the fixed beamformer 310. In some implementations, the beam score calculator 315 is configured to calculate the preliminary score based on a likelihood of presence of voice activity long the corresponding direction of the beam. For example, the beam score calculator 315 can be configured to execute a VAD process on the data representing a particular beam, and generate a VAD score as the corresponding beam score 320. In some implementations, the beam score 320 may be a flag that indicates the presence or absence of human speech within the data corresponding to the particular beam.
A VAD process can be used to identify if there is human speech present in the input audio data corresponding to a particular beam. In some implementations, if human speech is present in the data corresponding to a particular beam, the beam score calculator 315 executing the VAD process generates a discrete flag that indicates the presence of such speech, such that one or more actions can be taken based on the flag. Examples of such actions include turning on or off further processing, injection of comfort noise, gating audio pass-through, etc. In some implementations, the beam score calculator 315 can be configured to compute a beam score 320 based on the probability of human speech being present in the audio stream corresponding to the particular beam. Such a beam score 320 may be referred to as a soft-VAD score. Various types of VAD processes may be used in computing such soft-VAD scores. One example of such a process is described in the reference: Huang, Liang-sheng and Chung-ho Yang. “A novel approach to robust speech endpoint detection in car environments.” Acoustics, Speech, and Signal Processing, 2000. ICASSP'000. Proceedings. 2000 IEEE International Conference on. Vol. 3. IEEE, 2000, the entire content of which is incorporated herein by reference.
In some implementations, the multiple soft-VAD scores corresponding to the different beams may be compared to determine the one or more directions along which a human speech source is likely present. One or more beams corresponding to such directions may then be selected as the direction(s) of interest for further processing. For example, a beam control engine 325 can be used to analyze the beam scores 320 (e.g., the soft-VAD scores) to focus on one or more directions of interest that correspond to high beam scores. The one or more directions of interest may be selected in various ways. In some implementations, the beam control engine 325 can include a multiplexer 335 that is configured to select one of the multiple beams generated by the beamformer. For example, if the beam control engine 325 determines that a particular beam score (e.g., 320a) is higher than the other beam scores, the beam control engine 325 may instruct the multiplexer 335 (e.g., using a control signal) to select the data corresponding to the particular beam (beam 1, in this example) for further processing. In some implementations, more than one beam may also be selected for further processing. For example, if the beam scores 320 corresponding to two particular beams are close to one another, but each substantially higher than the other beam scores, the data corresponding to the two particular beams may be selected for further processing.
In some implementations, the one or more directions of interest may also be selected using a dynamic beamformer that is configured to generate a new dynamic-beam based on, for example, the spatial information indicated by the soft-VAD scores. An example of such a system 350 is depicted in
In some implementations, a dynamic beamformer may be used without a fixed beamformer. An example of such a system is shown in
The description above primarily uses soft-VAD scores as examples of beam scores 320. However, other types of beam scores 320 are also possible. For example, a beam score 320 can include a signal to noise ratio (SNR), wherein the signal represents a voice activity of interest, and the noise represents other unwanted signals such as non-voice acoustic signals as well as undesired voice signals. The SNR may be calculated as a ratio of a first quantity (e.g., amplitude, power etc.) representing the voice signal of interest, and a second quantity (e.g., amplitude, power, etc.) representing the noise. In some implementations, the beam score calculator 315 can execute a KWS process to generate soft-KWS scores as the beam scores 320. A KWS process can be used to determine if a specified phrase, or a set of one or more “keywords,” is present in a data stream corresponding to a particular beam. In some implementations, if the phrase or set of keywords is present, a flag can be set, and one or more actions may be taken based on whether the flag is set. Examples of keywords or phrases that are used in commercially available systems include “OK Google” used for Google Home® and other Android® powered devices manufactured by Google Inc. of Mountain View, Calif., “Hey Siri” used for iOS® enabled devices manufactured by Apple Inc. of Cupertino, Calif., “Alexa” used for Echo® and FIRE TV® devices manufactured by Amazon Inc. of Seattle, Wash. The beam score calculator 315 can be configured to use a soft-KWS process to generate a beam score 320 indicative of a likelihood that a particular phrase is present in the data corresponding to a beam. Such beam scores may be referred to as soft-KWS scores, which can then be used, analogous to how the soft-VAD scores are used to select one or more directions of interest. Upon identifying the one or more directions of interest, the beam control engine 325 can be configured to select a beam generated by a fixed beamformer or cause a dynamic beamformer to generate a dynamic-beam for the one or more directions of interest.
In some implementations, the beam score calculator 315 may be configured to calculate both a soft-VAD score and a soft-KWS score. In such cases, the beam control engine 325 may control a beamformer based on both scores. For example, in an environment where multiple human speakers are present, a soft-KWS score may be used for determining an initial direction of a particular speaker, and then if the particular speaker changes position, a soft-VAD score calculated based on the particular user's voice may be used for controlling the beamformer in accordance with the particular user's position. In some implementation, once the particular speaker is identified (using for example, a soft-KWS score), one or more characteristics of the particular speaker's voice may be identified in determining which voice to use in calculating the soft-VAD scores. In some implementations, an initial direction or beam may be selected based on a soft-KWS score, and then the soft-VAD scores may be used to “follow” the voice corresponding to the initial direction even as that voice changes position. In some implementations, where both a soft-VAD score as well as a soft-KWS score are available, a combined score may be calculated for each beam as a weighted combination of the two scores. In some implementations, one score may be preferred over the other. For example, a soft-VAD score may be used if no keyword is detected (as indicated, for example, by the absence of a soft-KWS score, or by the soft-KWS score being below a threshold), but the soft-KWS score may be preferred over the soft-VAD score when a keyword is detected.
Operations of the process 400 also includes computing, for each of the multiple datasets, one or more quantities indicative of human voice activity captured from the corresponding direction (404). In some implementations, the one or more quantities can be computed by a beam score calculator 315 described above. The one or more quantities indicative of human voice activity can include, for example, a likelihood score of human voice activity in the audio signal represented in the dataset for the corresponding direction. Such a likelihood score may be computed, for example, with the help of a voice activity detector. The one or more quantities indicative of human voice activity can also include a signal to noise ratio (SNR), wherein the signal is voice activity of interest, and the noise is other unwanted signals including non-voice acoustic signals as well as undesired voice signals. The SNR may be calculated as a ratio of a first quantity (e.g., amplitude, power etc.) representing the voice signal of interest, and a second quantity (e.g., amplitude, power, etc.) representing the noise. In some implementations, the one or more quantities indicative of human voice activity can be substantially similar to the beam scores 320 described above, including, for example, soft-VAD and soft-KWS scores. In some implementations, the one or more quantities indicative of human voice activity can represent a likelihood score of the presence of a keyword in the audio signal represented in the dataset for the corresponding direction.
The process 400 includes generating, based at least on the one or more quantities computed for a plurality of the multiple datasets, a directional audio signal representing audio captured from a particular direction (406). In some implementations, generating the directional audio signal includes selecting one of the multiple datasets. For example, if a fixed beamformer is used to generate the multiple datasets, generating the directional audio signal can include selecting one of the multiple datasets generated by the fixed beamformer. In some implementations, generating the directional audio signal can include causing a dynamic beamformer to capture audio in accordance with a sensitivity pattern generated for the particular direction.
The audio captured in accordance with the sensitivity pattern generated for the particular direction can be used for various purposes. In some implementations, signals generated based on the captured audio may be used in various speech processing applications including, for example, speech recognition, speaker recognition, speaker verification, or another speech classification. In some implementations, the device executing the process 400 (e.g., the audio processing engine 120 or another device or apparatus that includes the audio processing engine) can include a speech processing engine to implement one or more of the speech processing applications mentioned above. In some implementations, the device executing the process 400 may transmit information based on the captured audio to one or more remote computing device (e.g., servers associated with a cloud-based system) providing speech processing services. In some implementations, one or more control signals for operating a voice-activated device can be generated based on processing the audio captured in accordance with the sensitivity pattern generated for the particular direction.
The functionality described herein, or portions thereof, and its various modifications (hereinafter “the functions”) can be implemented, at least in part, via a computer program product, e.g., a computer program tangibly embodied in an information carrier, such as one or more non-transitory machine-readable media or storage device, for execution by, or to control the operation of, one or more data processing apparatus, e.g., a programmable processor, a computer, multiple computers, and/or programmable logic components.
A computer program can be written in any form of programming language, including compiled or interpreted languages, and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A computer program can be deployed to be executed on one computer or on multiple computers at one site or distributed across multiple sites and interconnected by a network.
Actions associated with implementing all or part of the functions can be performed by one or more programmable processors executing one or more computer programs to perform the functions of the calibration process. All or part of the functions can be implemented as, special purpose logic circuitry, e.g., an FPGA and/or an ASIC (application-specific integrated circuit). In some implementations, at least a portion of the functions may also be executed on a floating point or fixed point digital signal processor (DSP) such as the Super Harvard Architecture Single-Chip Computer (SHARC) developed by Analog Devices Inc.
Processing devices suitable for the execution of a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processors of any kind of digital computer. Generally, a processor will receive instructions and data from a read-only memory or a random access memory or both. Components of a computer include a processor for executing instructions and one or more memory devices for storing instructions and data.
Other embodiments and applications not specifically described herein are also within the scope of the following claims. For example, the parallel feedforward compensation may be combined with a tunable digital filter in the feedback path. In some implementations, the feedback path can include a tunable digital filter as well as a parallel compensation scheme to attenuate generated control signal in a specific portion of the frequency range.
Elements of different implementations described herein may be combined to form other embodiments not specifically set forth above. Elements may be left out of the structures described herein without adversely affecting their operation. Furthermore, various separate elements may be combined into one or more individual elements to perform the functions described herein.