This disclosure relates to wearable audio devices. More particularly, this disclosure relates to wearable audio devices that enhance the user's speech signal.
All examples and features mentioned below can be combined in any technically possible way.
In one aspect, a wearable two-way communication audio device includes a first microphone that provides a first microphone signal, a second microphone that provides a second microphone signal, and one or more processors. The one or more processors are configured to use the second microphone signal to estimate an ambient noise level and adjust an equalization filter based on the estimated ambient noise level. The first microphone signal and the second microphone signal may be processed via a first beamformer to provide a first beamformed signal and the first beamformed signal may be filtered with the equalization filter to provide a noise estimate signal. The one or more processors may also use the noise estimate signal to generate a voice output signal for transmission to a far end recipient.
Implementations may include one of the following features, or any combination thereof.
In some implementations, the one or more processors are configured to adjust the equalization filter by selecting one of a plurality of equalization filters.
In certain implementations, the one or more processors are configured to adjust the equalization filter based on the estimated ambient noise level by calculating an ambient noise energy estimate based on the second microphone signal and set a noise flag based on the ambient noise energy estimate.
In some cases, the one or more processors are configured to freeze the calculation of the ambient noise energy in response to receiving input form a voice activity detector indicating that a user is speaking.
In certain cases, the one or more processors are configured to: process the first microphone signal and the second microphone signal using a second beamformer to provide a second beamformed signal; calculate a first wind energy estimate based on the second beamformed signal and set a wind flag based on the first wind energy estimate; and adjust the equalization filter based, at least in part, on the wind flag.
In some examples, the one or more processors are configured to: generate a voice estimate signal based, at least in part, on the second beamformed signal; and provide the voice estimate signal and the noise estimate signal to a spectral enhancer to generate the voice output signal.
In certain examples, the second beamformer is a minimum variance distortionless response (MVDR) beamformer.
In some implementations, the first beamformer is a delay-and-subtract (DSub) beamformer.
In certain implementations, the one or more processors are configured to adjust the equalization filter by selecting a different equalization filter for each of a noisy, a quiet and a windy condition.
In some cases, the one or more processors are configured to: perform subband filtering on the first and second microphone signals to provide subband filtered signals and process the subband filtered signals with the first beamformer to provide the first beamformed signal. The first beamformed signal includes a first beamformed subband signal for each subband. The subband filtered signals may be processed with a second beamformer to provide a second beamformed subband signal for each subband. The one or more processors may also be configured to provide a voice estimate subband signal for each subband derived from a corresponding one of the first beamformed subband signals and a noise estimate subband signal for each subband derived from a corresponding one of the second beamformed signals to a spectral enhancer to provide a spectrally enhanced subband signal for each subband. Steady state noise reduction may be performed on the spectrally enhanced subband signal for each subband.
In some implementations, the one or more processors are configured to perform steady state noise reduction on the spectrally enhanced subband signal for each subband by: passing each spectrally enhanced subband signal for each subband through a pair of energy trackers; estimating a signal-to-noise ratio (SNR) for each subband based on output from the energy trackers; and selecting a corresponding gain to apply to a respective one of the spectrally enhanced subband signals for each subband based on the corresponding estimated SNR.
In certain implementations, the pair of energy trackers may include a first energy tracker with a fast attack and a slow decay and a second energy tracker with a slow attack and a fast decay.
Another aspect features an audio device that includes a first microphone that provides a first microphone signal; a second microphone that provides a second microphone signal; and one or more processors. The one or more processors are configured to: use the second microphone signal to estimate an ambient noise energy; and adjust an equalization filter based on the estimated ambient noise energy.
Implementations may include one of the above and/or below features, or any combination thereof.
In some implementations, using the second microphone signal to estimate the ambient noise level includes calculating an ambient noise energy estimate based on the second mic signal and set a noise flag based on the ambient noise energy estimate. And, adjusting the equalization filter based on the estimated ambient noise level may include adjusting the equalization filter based on the noise flag.
In certain implementations, the one or more processors are further configured to filter an other signal with the equalization filter to provide a voice output signal.
In some cases, the one or more processors are configured to adjust the equalization filter by selecting one of a plurality of equalization filters.
In certain cases, the plurality of equalization filters include a first equalization filter that is applied when the estimated ambient noise energy indicates a user is in a noisy environment and a second equalization filter that is applied when the estimated ambient noise energy indicates the user is in a quiet environment.
Implementations may provide or more of the following benefits.
The systems and methods described herein may reduce wind noise, especially clustering wind noise.
Some implementations may help to reduce low frequency wind noise below 1 kHz without compromising speech intelligibility much.
Certain implementations may provide improved noise reduction. In that regard, the systems and methods described herein may use a spectral noise subtraction and/or steady state noise reduction algorithm to reduce the harsh high frequency noise leakage.
Some embodiments may provide reduced ambient noise like HVAC or fan noise in fairly quiet environment.
Certain embodiments may provide smoother noise level transitions in between when the user is talking and when the user stops talking.
Some configurations may provide more natural voice with fuller bandwidth in quiet conditions than conventional headphones.
Certain configurations may provide noticeably reduced popping/crackling sounds that appear as distortions in conventional headphones.
Some implementations may reduce the effect of a user's voice getting very quiet or spectrally unbalanced when the earpieces are rotated away from a nominal orientation or/and when the user's talks next to a hard surface such as a wall or put their hands behind their head.
It is noted that the drawings of the various implementations are not necessarily to scale. The drawings are intended to depict only typical aspects of the disclosure, and therefore should not be considered as limiting the scope of the implementations. In the drawings, like numbering represents like elements between the drawings.
Aspects and implementations disclosed herein may be applicable to a wide variety of wearable audio devices in various form factors, but are generally directed to devices having at least one inner microphone that is substantially shielded from environmental noise (i.e., acoustically coupled to an environment inside the ear canal of the user) and at least one external microphone substantially exposed to environmental noise (i.e., acoustically coupled to an environment outside the ear canal of the user). Further, various implementations are directed to wearable audio devices that support two-way communications, and may for example include in-ear devices, over-ear devices, and near-ear devices. Form factors may include, e.g., earbuds, headphones, hearing assist devices, and other wearables. Further configurations may include headphones with either one or two earpieces, over-the-head headphones, behind the neck headphones, in-the-ear or behind-the-ear hearing aids, wireless headsets, audio eyeglasses, single earphones or pairs of earphones, as well as hats, helmets, clothing or any other physical configuration incorporating one or two earpieces to enable audio communications and/or ear protection. Further, what is disclosed herein is applicable to wearable audio devices that are wirelessly connected to other devices, that are connected to other devices through electrically and/or optically conductive cabling, or that are not connected to any other device, at all.
It should be noted that although specific implementations of wearable audio devices are presented with some degree of detail, such presentations of specific implementations are intended to facilitate understanding through provision of examples and should not be taken as limiting either the scope of disclosure or the scope of claim coverage.
Audio output by the transducer 108 and speech captured by the external microphones 116, 118 within each earpiece is controlled by an audio processing system 122. Audio processing system 122 may be integrated into one or both earpieces 102 or be implemented by an external system. In the case where audio processing system 122 is implemented by an external system, each earpiece 102 may be coupled to the audio processing system 122 either in a wired or wireless configuration. In various implementations, audio processing system 122 may include hardware, firmware and/or software to provide various features to support operations of the wearable audio device 100, including, e.g., providing a power source, amplification, input/output, network interfacing, user control functions, active noise reduction (ANR), signal processing, data storage, data processing, voice detection, etc.
The wearable audio device 100 is configured to provide two-way communications in which the user's voice or speech is captured and then outputted to an external node via the audio processing system 122. In that regard, the external microphones 116, 118 (alone or in combination with external microphone 120) may be used for capturing the user's voice and the audio processing system 122 may be used to process those microphone signals to provide a voice signal to the far end (aka a “voice output signal”) of a two-way communication (phone call).
For that purpose, the audio processing system 122 may include a left earpiece processing system 124 for processing signals from the microphones 110A, 116A, 118A, 120A of the left earpiece 102A, and a right earpiece processing system 126 for processing signals from the microphones 110B, 116B, 118B, 12B of the right earpiece 102B. The audio processing system 122 may also include a combined earpiece processing system 128 for processing signals from the left and right earpiece processing systems 124, 126. For example, the wearable audio device 100 may be configured such that microphone input from only one of the earpieces 102A, 102B (a primary earpiece) is used for providing the voice output signal (e.g., item302,
The left earpiece processing system 124 may be executed by a first processor in the left earpiece 102A and the right earpiece processing system 126 may be executed by a second processor in the right earpiece 102B. The combined earpiece processing system 128 may be executed by one of the first or second processors, or by a third processor that may reside in the left earpiece 102A, in the right earpiece 102B, or an external system (such as a mobile device coupled to one or both of the earpieces 102A, 102B).
In implementations that include ANR for enhancing audio signals, the inner microphone 110 may serve as a feedback microphone and the external microphone 120 (alone or in combination with microphones 116 and 118) may serve as a feedforward microphone. In such implementations, each earphone 102 may utilize an ANR circuit that is in communication with the inner and external microphones 110 and 120. The ANR circuit receives an internal signal generated by the inner microphone 110 and an external signal generated by the external microphone 120 (alone or in combination with microphones 116 and 118) and performs an ANR process for the corresponding earpiece 102. The process includes providing a signal to an electroacoustic transducer (e.g., speaker) 108 disposed in the cavity 106 to generate an anti-noise acoustic signal that reduces or substantially prevents sound from one or more acoustic noise sources that are external to the earphone 102 from being heard by the user. External microphone 120 may be arranged to face toward a user's concha when the device is worn, e.g., such that the microphone 27 is shielded from wind. Such configurations are disclosed in U.S. patent application Ser. No. 17/362,625 filed on Dec. 27, 2022, entitled “ACTIVE NOISE REDUCTION EARBUD,” now U.S. Pat. No. 11,540,043, the complete disclosure of which is incorporated herein by reference.
System 124 generally includes: a domain converter 204 that converts microphone signals from the time domain to the frequency domain. The domain converter 204 also separates spectral components of each microphone signal into multiple sub-bands. For example, the domain converter 204 may process the microphone signals to provide frequencies limited to a particular range, and within that range may provide multiple sub-bands that in combination encompass the full range. In one particular example, the sub-band filter may provide sixty-four sub-bands covering 125 Hz each across a frequency range of 0 to 8,000 Hz. The domain converter 204 may for example be configured to convert the time domain signal into sub-bands using a weighted overlap add (WOLA) analysis.
Each of the subsequent components in region labeled “sub-band processing” of the example system 124 illustrated in
The domain converter 204 provides the frequency domain signals, 206 and 208, from the first external microphone 116 and the second external microphone 118, respectively, to each of two beamformers 210, 212. The beamformers 210, 212 apply array processing techniques, such as phased array, delay-and-subtract techniques, and may utilize minimum variance distortionless response (MVDR) and linear constraint minimum variance (LCMV) techniques, to adapt a responsiveness of the set of microphones 116, 118 to enhance or reject acoustic signals from various directions. Beamforming enhances acoustic signals from a particular direction, or range of directions, while null steering reduces or rejects acoustic signals from a particular direction or range of directions.
The first beamformer 210 is a beam former that works to maximize acoustic response of the set of microphones 116, 118 in the direction of the user's mouth (e.g., directed to the front of and slightly below an earpiece), and provides a first beamformed signal 214. Because of the beamforming performed by the first beamformer 210, the first beamformed signal 214 includes a higher signal energy due to the user's voice than any of the individual microphone signals.
The second beamformer 212 steers a null toward the user's mouth and provides a second beamformed signal 216. The second beamformed signal 216 includes minimal, if any, signal energy due to the user's voice because of the null directed at the user's mouth. Accordingly, the second beamformed signal 216 is composed substantially of components due to background noise and acoustic sources not due to the user's voice, i.e., the second beamformed signal 216 is a signal correlated to the acoustic environment without the user's voice.
In certain examples, the first beamformer 210 is a super-directive near-field beamformer that enhances acoustic response in the direction of the user's mouth, and the second beamformer 212 is a delay-and-subtract algorithm that steers a null, i.e., reduces acoustic response, in the direction of the user's mouth.
The first beamformed signal 214 and the frequency domain first external microphone signal 206 (aka “frequency domain COM1 mic signal”) are provided to a wind detector 218, which analyzes those signals to identify whether wind is present. The wind detector 218 calculates an energy difference between the first beamformed signal 214 and the frequency domain COM1 mic signal. In that regard, the wind detector 218 may calculate the energy in each of the first beamformed signal 214 and the frequency domain COM1 mic signal 206 on a sub-band basis and then sum the calculated sub-band energies to determine a total wind energy for each of those signals before determining the difference between those two totals. In some cases, the wind detector 218 may only calculate the energy within a certain frequency band (e.g., 125 Hz to 2 kHz).
If the energy difference between the first beamformed signal 214 and the frequency domain COM1 mic signal 206 exceeds a threshold, then the wind detector 218 identifies that wind is detected. The wind detector 218 produces a wind flag signal 220 based on this analysis. The wind flag signal 220 may be a binary signal (0 or 1) indicating either a wind or a no wind condition.
A frequency domain signal 222 from the third external microphone 120 (aka “feedforward microphone” or “FF mic” or “Concha mic”) is equalized via an equalization (EQ) filter 224 to produce an equalized FF mic signal 226, which is provided to a dynamic wind mixer 228 along with the frequency domain COM1 mic signal 206, the first beamformed signal 214, and the wind flag signal 220. The EQ filter 224 equalizes the FF mic signal 222 to have the same voice spectra as COM1 mic signal 206 or the first beamformed signal 214 before providing the equalized signal 226 to the dynamic wind mixer 228. The COM1 mic signal 206 and the first beamformed signal 214 are assumed to have the same voice spectra by design.
The dynamic wind mixer 228 produces a wind mixer output signal 230 that is based on the wind condition, as indicated by the wind flag signal 220. When the wind flag signal 220 indicates that wind is detected, the dynamic wind mixer 228 switches to a dynamic mixing of the frequency domain COM1 mic signal 206 and the FF mic signal 222. Mixing coefficients for the COM1206 and FF mic signals 226 are determined based on an estimated wind energy ratio between those two signals. In that regard, the wind mixer 228 may calculate the energy in each of the frequency domain COM1 mic signal 206 and the equalized FF mic signal 226 on a sub-band basis and then sum the calculated sub-band energies to determine a total energy for each of those signals before determining the ratio between those two totals. In some cases, the wind mixer 228 may only calculate the energy within a certain frequency band (e.g., 125 Hz to 2 kHz).
In some implementations, the mixing of the COM1 mic and equalized FF mic signals only happened below a certain frequency (e.g., 2 KHz), and above that the dynamic wind mixer 228 crosses over to the first beamformed signal 214. Thus, depending on the wind condition, the wind mixer output signal 230 corresponds to either the first beamformed signal 214 or a mixed signal that includes a mix of the COM1 mic and equalized FF mic signals 206, 226 at lower frequencies (e.g., below 2 KHz) and which crosses over to the first beamformed signal 214 at higher frequencies (e.g., 2 kHz and above).
The wind mixer output signal 230 is provided to a spectral enhancer 232 (aka “noise spectral subtractor” or “NSS”) along with the second beamformed signal (or an equalized version of it, as discussed below). The spectral enhancer 232 uses the wind mixer output signal 230 as a voice estimate and the second beamformed signal as a noise estimate and enhances the short-time spectral amplitude (STSA) of the user's voice/speech, thereby reducing noise in a spectrally enhanced output signal 234. Examples of spectral enhancement that may be implemented in the spectral enhancer 232 include spectral subtraction techniques, minimum mean square error techniques, and Wiener filter techniques. The spectral enhancement via the spectral enhancer 232 improves the voice-to-noise ratio of the output signal 234. Spectral enhancement may further improve system performance when there are more noise sources or changing noise characteristics. The spectral enhancer 232 may operate on the two estimate signals, using their spectral content to further enhance the user's voice component of the output signal 234.
The output of the spectral enhancer 232 (i.e., the spectrally enhanced output signal 234) is passed through an inverse domain converter 236 that generates a time domain output signal. As mentioned above, the inverse domain converter 236 may be configured to perform the opposite function of the domain converter 204. That is, the inverse domain converter acts to re-combine all the sub-bands into a single output signal (the enhanced speech signal 202) using WOLA synthesis. In some cases, the spectrally enhanced output signal may first be provided to a steady state noise reducer (SSNR) 238, which can help to remove certain ambient noise (such as HVAC noise), and noise in front of the user, and can clean up high frequency noise residue from the spectral enhancement (spectral subtraction). And the output of the SSNR 238 (the “noise reduced output 238”) can then be provided to the inverse domain converter 236 to generate the output signal 202. Additional details of the SSNR 238 are described below with reference to
In some implementations, the output signal 202 may be provided as the voice output signal that is sent to the far end. In other implementations, additional output stage (time domain) processing 300,
Referring to
That wind energy estimate 244 is shared with the sliding high-pass filter 304, which maps the energy estimate to one of a plurality of different high-pass filters to apply in order to tradeoff between wind noise reduction and voice naturalness. When the wind energy is higher, the system chooses a high-pass filter with a higher corner frequency. When the wind energy is lower, the system chooses a high-pass filter with a lower corner frequency.
In some instances, the wearable audio device 100 may only provide a voice output signal to the far end from one of the earpieces 102A or 102B. In that regard, the wearable audio device 100 may detect and estimate wind noise on both earpieces 102A, 102b, e.g., using the system illustrated in
Otherwise, if the wind flag signals from the left and right earpieces both indicate a wind condition (Wind_right==1 & Wind_right==1), then the combined earpiece processing system 128 looks to the wind energy estimate signals from the left and right earpieces. And, if the estimated wind energy on the left earpiece 102A is less than the estimated wind energy on the right earpiece 102B, then that will then trigger a role switch causing the left earpiece 102A to be set as the primary earpiece.
Referring again to
In that regard, the earpiece processing system 124 may include a noise level estimator 246. As shown in
The calculated ambient noise level is compared to a threshold. When the ambient noise level estimate exceeds the threshold, the noise level estimator 246 determines that the user is in a noisy environment, and when the ambient noise level estimate is below the threshold, the noise level estimator 246 determines that the user is in a quiet environment. When the user is in a noisy environment, the system gets more aggressive on noise reduction.
The noise level estimator 246 provides a noise flag signal to a noise equalizer (EQ) 250. The noise flag signal 248 may be a binary signal (0 or 1) indicating either a quiet (0) or a noisy (1) condition. The noise EQ 250 also receives the wind flag signal 220 from the wind detector 218. Depending on whether the user is in a quiet, noisy, or windy condition, the noise EQ 250 smoothly transitions between different equalization filters to favor different noise characteristics such that improved noise reduction performance and voice spectrum may be achieved in each scenario. In some implementations, if the wind flag signal 220 indicates a windy condition (that the user is in a windy environment), then the noise EQ 250 will select the equalization filter designed for improved performance in windy conditions. In such implementations, if the wind flag signal 220 instead indicates a no wind condition (the user is not in windy environment), then the noise EQ 250 will look to the noise flag signal 248, and will apply either an equalization filter designed for improved performance in noisy conditions or an equalization filter designed for improved performance in quiet conditions depending on whether the noise flag signal 248 indicates a noisy condition or a quiet condition.
The noise EQ 250 applies the selected one of the EQ filters to the second beamformed signal 216 and provides the equalized beamformed signal 252 to the spectral enhancer 232 for processing. The equalized beamformed signal 252 is effectively a noise reference signal for the spectral enhancer 232.
For noise, the noise spectra is kept in the low frequencies to help ensure that the spectral enhancer 232 attenuates low frequency noise but maintains the high frequencies for higher voice bandwidth. For a quiet condition, a much attenuated equalization filter (relative to the noise filter) may be used since there is not much noise to reduce. For wind conditions, the wind EQ filter is selected such that the spectral enhancer 232 attenuates high frequency noise but relaxes on low frequencies.
In order to have a consistent and smooth noise estimate, a voice activity detector (VAD) 254 may be used to freeze the ambient noise level estimate when the user is talking.
In some cases, the VAD 254 may use a signal 256 from the inner (feedback) microphone 110 to detect voice activity. In some implementations, the inner microphone signal 110 may be filtered, e.g., via an acoustic echo canceller (AEC) 258, to provide a clean feedback (FB) microphone signal 260 to the domain converter 204, and the frequency domain clean FB microphone signal 262 (from the domain converter 204) may be input to the VAD 254. The VAD 254, in turn, provides a VAD flag signal 264 to the noise level estimator 246. The VAD flag signal 264 may be a binary signal (0 or 1) indicating either a voice (user is speaking) or a no voice (user is not speaking) condition. When the VAD flag signal 264 indicates that the user is speaking, the noise level estimator 246 will freeze the ambient noise level estimate until that condition abates.
As mentioned above, some implementations may include a steady state noise reducer (SSNR) 238 that receives the spectrally enhanced output signal 234 from the spectral enhancer 232 and provides further noise reduction before providing the noise reduced output signal 240 (a noise reduced version of the enhanced output signal 234) to the inverse domain converter 236. The SSNR 238 removes certain noises such as HVAC noise, noise in front of the user (e.g., from a computer fan), and cleans up high frequency noise residue from the spectral enhancer 232. With reference to
Referring again to
The equalized output signal 308 is provided to the sliding high-pass filter 304, which applies the selected high-pass filter based on the wind energy estimate 244 to provide a filtered output signal 310. In some implementations, the filtered output signal 310 may pass through a limiter 312 before it is sent to the far end.
According to various implementations, a wearable audio device provides the technical effect of enhancing voice pick-up during challenging environmental conditions, e.g., high wind or noise.
It is noted that the implementations described herein are particularly useful for two-way communications such as phone calls, especially when using ear buds. However, the benefits extend beyond phone call applications. These technologies are also applicable to aviation and military use where high nose pick up with ear buds is desired. Further potential uses include peer-to-peer applications where the voice pickup is shielded from echo issues normally present. Other use cases may involve automobile ‘car wear’ like applications, wake word or other human machine voice interfaces in environments where external microphones will not work reliably, self-voice recording/analysis applications that provide discreet environments without picking up external conversations, and any application in which multiple external microphones are not feasible. Further, the implementations may be useful in work from home or call center applications by avoiding picking up nearby conversations, thus providing privacy for the user.
It is understood that one or more of the functions of the described systems may be implemented as hardware and/or software, and the various components may include communications pathways that connect components by any conventional means (e.g., hard-wired and/or wireless connection). For example, one or more non-volatile devices (e.g., centralized or distributed devices such as flash memory device(s)) can store and/or execute programs, algorithms and/or parameters for one or more described devices. Additionally, the functionality described herein, or portions thereof, and its various modifications (hereinafter “the functions”) can be implemented, at least in part, via a computer program product, e.g., a computer program tangibly embodied in an information carrier, such as one or more non-transitory machine-readable media, for execution by, or to control the operation of, one or more data processing apparatus, e.g., a programmable processor, a computer, multiple computers, and/or programmable logic components.
A computer program can be written in any form of programming language, including compiled or interpreted languages, and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A computer program can be deployed to be executed on one computer or on multiple computers at one site or distributed across multiple sites and interconnected by a network.
Actions associated with implementing all or part of the functions can be performed by one or more programmable processors executing one or more computer programs to perform the functions. All or part of the functions can be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) and/or an ASIC (application-specific integrated circuit). Processors suitable for the execution of a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processors of any kind of digital computer. Generally, a processor may receive instructions and data from a read-only memory or a random access memory or both. Components of a computer include a processor for executing instructions and one or more memory devices for storing instructions and data.
It is noted that while the implementations described herein utilize microphone systems to collect input signals, it is understood that any type of sensor can be utilized separately or in addition to a microphone system to collect input signals, e.g., accelerometers, thermometers, optical sensors, cameras, etc.
Additionally, actions associated with implementing all or part of the functions described herein can be performed by one or more networked computing devices. Networked computing devices can be connected over a network, e.g., one or more wired and/or wireless networks such as a local area network (LAN), wide area network (WAN), personal area network (PAN), Internet-connected devices and/or networks and/or a cloud-based computing (e.g., cloud-based servers).
In various implementations, electronic components described as being “coupled” can be linked via conventional hard-wired and/or wireless means such that these electronic components can communicate data with one another. Additionally, sub-components within a given component can be considered to be linked via conventional pathways, which may not necessarily be illustrated.
A number of implementations have been described. Nevertheless, it will be understood that additional modifications may be made without departing from the scope of the inventive concepts described herein, and, accordingly, other implementations are within the scope of the following claims.