The present disclosure relates generally to audio signal processing and, in particular, to holographic detection and removal of wind noise.
People use mobile electronic devices that include one or more microphones outdoors. Example of such devices include augmented realty (AR) devices, smart phones, mobile phones, personal digital assistants, wearable devices, hearing aids, home security monitoring devices, and tablet computers, etc. The output of the microphones can include a significant amount of noise due to wind, which significantly degrades the sound quality. In particular, the wind noise may result in microphone signal saturation at high wind speeds and cause nonlinear acoustic echo. The wind noise may also reduce performance of various audio operations, such as acoustic echo cancellation (AEC), voice-trigger detection, automatic speech recognition (ASR), voice-over internet protocol (VoIP), and audio event detection performance (e.g., for outdoor home security devices). Wind noise has long been considered a challenging problem and an effective wind noise removal and detection system is highly sought after for use in various applications.
A mobile electronic device such as a smartphone includes one or more microphones that generate one or more corresponding audio signals. A wind noise detection (WND) subsystem analyzes the audio signals to determine whether wind noise is present. The audio signals may be analyzed using multiple techniques in different domains. For example, the audio signals may be analyzed in the time, spatial, and frequency domains. The WND subsystem outputs a flag or other indicator of the presence (or absence) of wind noise in the set of audio signals.
The WND subsystem may be used in conjunction with a wind noise reduction (WNR) subsystem. If the WND subsystem detects wind noise, the WNR subsystem processes the audio signals to remove or mitigate the wind noise. The WNR subsystem may process the audio signals using multiple techniques in one or domains. The WNR subsystem outputs the processed audio for use in other applications or by other devices. For example, the output from the WNR subsystem may be used for phone calls, controlling electronic security systems, activating electronic devices, and the like.
The figures depict various embodiments for purposes of illustration only. One skilled in the art will readily recognize from the following discussion that alternative embodiments of the structures and methods illustrated herein may be employed without departing from the principles described.
Wind noise in the output from a microphone is statistically complicated and typically has highly non-stationary characteristics. As a result, traditional background noise detection and reduction approaches often fail to work properly. This presents a problem for the use of mobile electronic devices in windy conditions as the wind noise may obscure desired features of the output of microphones, such as an individual's voice.
Potential approaches to wind noise detection (WND) include a negative slope fit (NSF) approach, and neural network (NN) or machine leaning (ML) based approaches. The NSF approach of WND assumes that wind noise can be approximated as decaying linearly in frequency domain. The linear decay assumption may cause the detection indicator to be inaccurate. NN and ML based wind noise detection approaches often require extensive training to discern wind noise from an audio signal of interest, which can be impractical in some scenarios, particularly where a wide variety of audio signals are of intertest. For example, to support various types of wind and voice signals, noise-aware training involves developing consistent estimate of noise, which is often very difficult with highly non-stationary wind noise.
Some potential approaches to wind noise reduction (WNR) include a non-negative sparse coding, a singular value decomposition (SVD) approach and a generalized SVD (GSVD) subspace method. The non-negative sparse coding approach of WNR converges very slow in order to get the stable results and only works if the signal-to-noise ratio (SNR) larger than 0.0 decibel (dB), which is not the case in many practical situations. However, SVD and GSVD approaches are often too complex to implement for low-power devices and are therefore unusable in many practical applications.
Wind noise is increasingly disruptive to audio signals as the associated wind speed increases. Wind noise spectrum falls off as 1/f, where f is frequency. Thus, wind noise has a strong effect on low frequency audio signals. The frequency above which wind noise is not significant increases as the wind speed increase. For example, for wind speeds up to 12 mph, the resulting wind noise is typically significant up to about 500 Hz. For higher wind speeds (e.g., 13 to 24 mph), the wind noise can significantly affect the output signal up to approximately 2 kHz. Existing approaches for WND and WNR fail to provide the desired detection and reduction accuracies in the presence of high-speed wind. However, many practical applications involve use of microphones in outdoor environments where such winds are expected.
In various embodiments, a holographic WND subsystem analyzes multiple signals generated from microphone outputs to detect wind noise. These signals may correspond to analysis in two or more of the time domain, the frequency domain, and the spatial domain. The Holographic WNR subsystem processes the output from one or more microphones to reduce wind noise. The processing techniques may modify the microphone output in two or more of the time domain, the frequency domain, and the spatial domain. The holographic WND and WNR subsystems can be user-configurable to support voice-trigger, ASR, and VoIP human listener applications. For example, in one embodiment, the WNR subsystem can be configured to focus on the wind noise reduction only in the low frequency range up to 2 kHz for voice-trigger and ASR applications so that voice signal remains uncorrupted from 2 kHz. As another example, for a VoIP human listener application, embodiments of the WNR subsystem can be configured to reduce wind noise up to 3.4 kHz for narrowband voice calls and up to 7.0 kHz for wideband voice calls.
System Overview
The microphone assembly 110 includes M microphones: microphone 1112 and microphone 2114 through microphone M 116. M may be any positive integer greater than or equal to two. The microphones 112, 114, 116 each have a location and orientation relative to each other. That is, the relative spacing and distance between the microphones 112, 114, 116 is pre-determined. For example, the microphone assembly 110 of a smartphone might include a stereo pair on the left and right edges of the device pointing forwards and a single microphone on the back surface of the device.
The microphone assembly 110 outputs audio signals 120 that are analog or digital electronic representations of the sound waves detected by the corresponding microphones. Specifically, microphone 1112 outputs audio signal 1122, microphone 2114 outputs audio signal 2124, and microphone M 116 outputs audio signal M 126. In one embodiment, the individual audio signals 122, 124, 126 are composed of a series of audio frames. The m-th frame of an audio signal can be defined as [x(m, 0), x(m, 1), x(m, 2), . . . , x(m, L−1)] (where L is the frame length in units of samples).
The WND subsystem 130 receives the audio signals 120 from the microphone assembly 110 and analyze the audio signals to determine whether a significant amount of wind noise is present. The threshold amount of wind noise above which it is considered significant may be determined based on the use case. For example, if the determination of the presence of significant wind noise is used to trigger a wind noise reduction process (e.g., by the WNR subsystem 150), the threshold amount that is considered significant may be calibrated to balance the competing demands of improving the user experience and making efficient use of the device's computational and power resources. In one embodiment, the WND subsystem 130 analyzes the audio signals 120 in two or more of the time domain, the frequency domain, and the spatial domain. The WND subsystem 130 outputs a flag 140 that indicating whether significant wind noise is present in the audio signals 120. Various embodiments of the WND subsystem 130 are described in greater detail below, with reference to
The WNR subsystem 150 receives the flag 140 and the audio signals 120. If the flag 140 indicates the WND subsystem 130 determined wind noise is present in the audio signals 120, the WNR subsystem 150 implements one or more techniques to reducing the wind noise. In one embodiment, the wind reduction techniques used are in two or more of the time domain, the frequency domain, and the spatial domain. The WNR subsystem 150 generates an output 160 that includes the modified audio signals 120 with reduced wind noise. In contrast, if the flag 140 indicates the WND subsystem 130 determined wind noise is not present, the WNR subsystem 150 has no effect on the audio signals 120. That is, the output 160 is the audio signals 120. Various embodiments of the WNR subsystem 150 are described in greater detail below, with reference to
Wind Noise Detection Subsystem
The WND subsystem 130 receives M audio signals 120, where M can be any positive integer greater than one. The energy module 210, the pitch module 220, the spectral centroid module 230, and the coherence module 240 each analyze the audio signals 120, make a determination as to whether significant wind noise is present, and produce an output indicating the determination made. The decision module 260 analyzes the outputs of the other modules and determines whether wind noise is present in the audio signals 120.
The energy module 210 performs analysis in the time domain to determine whether wind noise is present based on the energies of the audio signals 120. In one embodiment, the energy module 210 processes each frame of the audio signals 120 to generate a filtered signal [y(m, 0), y(m, 1), y(m, 2), . . . , y(m, L−1)]. The processing may include applying a low-pass filter (LPF), such as a 100 Hz second-order LPF (since wind noise energy dominates in frequencies lower than 100 Hz where both wind noise and voice are present together). The energies of the filtered signal and the original signal (i.e., Elow and Etotal) are calculated by the energy module 210 as follows:
The ratio rene(m) between Elow(m) and Etotal(m) may be calculated by the energy module 210 as follows:
In some embodiments, the energy module 210 smooths the ratio rene(m) as follows:
rene,sm(m)=rene,sm(m−1)+α*(rene(m)−rene,sm(m−1)) (4)
where α is a smoothing factor and ranges from 0.0 to 1.0. This may increase the robustness of feature extraction. If the smoothed ratio rene,sm(m) (or, if smoothing is not used, the unsmoothed ratio, rene (m)) is larger than an energy threshold (e.g., 0.45), the energy module 210 determines that frame m of the associated audio signal includes significant wind noise. If more than a threshold number (e.g., M/2) of the audio signals 210 indicate the presence of significant wind noise for a given frame, the energy module 210 outputs an indication 212 (e.g., a flag) that it has detected wind noise.
The pitch module 220 performs analysis in the time domain to determine whether wind noise is present based on the pitches of the audio signals 120. Wind noise generally does not have an identifiable pitch, so extracting pitch information from an audio signal can distinguish between wind noise and desired sound (e.g., a human voice). In one embodiment, each of the audio signals 120 is processed by a 2 kHz LPF, and the pitch f0 is estimated using an autocorrelation approach on the filtered signal. The obtained autocorrelation values may be smoothed over time. If a smoothed autocorrelation value (or unsmoothed value, if smoothing is not used) for a given frame of an audio signal is smaller than an autocorrelation threshold (e.g., 0.40), the pitch module 220 determines that significant wind noise is present in the given frame of the audio signal. If more than a threshold number (e.g., M/2) of the audio signals 120 indicate the presence of significant wind noise for the given frame, the pitch module 220 outputs an indication 222 (e.g., a flag) that it has detected wind noise.
The spectral centroid module 230 performs analysis in the frequency domain to determine whether wind noise is present based on the spectral centroids of the audio signals 120. The spectral centroid of an audio signal is correlated to the corresponding sound's brightness. Wind noise generally has a lower spectral centroid than desired sound. In various embodiments, each of the audio signals has a sampling rate, fs, in Hertz (Hz). The audio signals are processed using an N-point fast Fourier transform (FFT). For example, in one embodiment, fs=16 kHz and N=256.
The frequency resolution Δf is given by fs/N. Thus, the frequency at the J-th bin is given by fJ=J*Δf. This enables the bin in which a given frequency is placed to be calculated. For example, the 2.0 kHz frequency is in the J-th bin which can be obtained by the following equation:
J=integer of (2000.0/Δf) (5)
In one embodiment, the spectral centroid fsc(m) in the m-th frame is calculated as follows:
where X(m, k) represents the magnitude spectrum of the time domain signal in the m-th frame at the k-th bin, and f(k) is the frequency of the k-th bin (i.e., f(k)=k*Δf). Alternatively, the spectral centroid fsc may be calculated by replacing the magnitude spectrum by the power spectrum in Equation (6).
In some embodiments, the spectral centroid module 230 smooths fsc(m) as follows:
fsc,sm(m)=fsc,sm(m−1)+β*(fsc(m)−fsc,sm(m−1)) (7)
where β is a smoothing factor and ranges from 0.0 to 1.0. If the smoothed spectral centroid fsc,sm(m) (or, if smoothing is not used, the unsmoothed spectral centroid, fsc(m)) for a given frame of an audio signal is less than a spectral centroid threshold (e.g., 40 Hz), the spectral centroid module 230 determines significant wind noise is present in the given frame of the audio signal. If more than a threshold number (e.g., M/2) of the audio signal 120 indicate the presence of significant wind noise for the given frame, the spectral centroid module 230 outputs an indication 232 (e.g., a flag) that it detected wind noise.
The coherence module 240 performs analysis in the spatial domain to determine whether wind noise is present based on the coherence between audio signals 120. In various embodiments, coherence is a metric indicating the degree of similarity between a pair of audio signals 120. Wind noise generally has very low coherence at lower frequencies (e.g., less than 6 kHz), even for relatively small spatial separations. For example, wind noise is typically incoherent between two microphones separated by 1.8 cm to 10 cm, with the coherence value of wind noise being close to 0.0 for frequencies up to 6 kHz, in contrast to larger values (e.g., above 0.25) for desired sound. The coherence metric may be in a range between 0.0 and 1.0, with 0.0 indicating no coherence and 1.0 indicating the pair of audio signals are identical. Other ranges of correlation values may be used.
In one embodiment, coherence module 240 calculates a set of coherence values at one or more frequencies in a range of interest (e.g., 0 Hz to 6 kHz) for each pair of audio signals 120. Thus, with M audio signals 120, there are K sets of coherence values, with K defined as follows:
The coherence between a pair of audio signals 120 (e.g., x(t) and y(t)) may be calculated as follows:
where Gxy(f) is the cross-spectral density (CSD) (or cross power spectral density (CPSD)) between microphone signals x(t) and y(t), and Gxx(f) and Gyy(f) are the auto-spectral density of x(t) and y(t), respectively. The CSD or CPSD is the Fourier transform of the cross-correlation function, and the auto-spectral density is the Fourier transform of the autocorrelation function.
If a predetermined proportion (e.g., all) of the set of coherence values for a given frame of a pair of audio signals 120 are less than a coherence threshold (e.g., 0.25), this indicates that wind noise is present because wind noise generally results in lower coherence values than desired sound. If more than a threshold proportion (e.g., K/2) of the pairs of audio signals 120 indicate the presence of wind noise in the given frame, the coherence module 240 outputs an indication 242 (e.g., a flag) that it detected wind noise.
The decision module 260 receives output from the other modules and determines whether it is likely that significant wind noise is present in frames. In
In one embodiment, the decision module 260 determines wind noise is likely present if at least a threshold number of the indications (e.g., at least half) indicate the presence of wind noise for a given frame. If the decision module 260 makes such a determination, it outputs a flag 140 or other indication of the presence of wind noise. In the case of
Wind Noise Reduction Subsystem
The WNR subsystem 150 receives the flag 140 (or other indication of wind noise) generated by the WND subsystem 130. The flag 140 is passed to one or more modules to initiate processing in one or more domains to reduce the wind noise in the audio signals 120 (e.g., the first audio signal 122, second audio signal 124, and mth audio signal 126). In the embodiment shown in
Processing in the time domain is performed by the cutoff frequency estimation module 310 and the ramped sliding HPF module 320. The cutoff frequency estimation module 310 estimates a cutoff-frequency, fc, for use in the time domain processing. In one embodiment, if the flag 140 indicates wind noise is not present, the cutoff frequency estimation module 310 sets fc as 80 Hz. If the flag 140 indicates wind noise is present, the cutoff frequency estimation module 310 calculates a cumulative energy from 80 Hz to 500 Hz for each of the audio signals 120. To reduce computational complexity, either the magnitude spectrum or power spectrum generated by the spectral centroid module 230 may be used to calculate the cumulative energy.
If the cumulative energy of the i-th audio signal (i=1, 2, . . . , M) at frequency fc,i is larger than a cumulative energy threshold (e.g., 200.0), then the fc,i may be chosen as a potential cutoff frequency. The value for fc may be calculated as follows:
Thus, fc is dynamically adjusted between 80 Hz and 500 Hz.
The ramped sliding HPF module 320 receives the fc value 312 and slides a ramped high-pass filter (HPF) in the frequency domain based on the fc value. In one embodiment, the ramped sliding HPF filter is a second order infinite impulse response (IIR) filter parameterized as follows. Define:
where Q is the quality factor (e.g., Q=0.707). The filter coefficients can then be defined as:
The filter coefficients may be normalized as follows:
HPF numerator B=[b0/a0b1/a0b2/a0] (11)
HPF denominator A=[1.0a1/a0a2/a0] (12)
In one embodiment, when the flag 140 indicates wind noise is present, the ramped sliding HPF module 320 linearly ramps the filter coefficients on each processed audio sample according to coefficient increments (e.g., 0.01). The original A and B vectors of the coefficients are kept unchanged. The increments and the ramping length may be selected such that the filter coefficients reach their final value at the end of the ramping. At the end of ramping, the ramping function may be set to bypass mode, and thus uses the original A and B vectors, to reduce the computational complexity. Generally, each of the audio signals 120 is processed by the same ramped dynamic sliding HPF although, in some embodiments, one or more audio signals may be processed differently.
The adaptive beamforming module 330 processes the audio signals 120 in the spatial domain using an adaptive beam-former. In one embodiment, a differential beamformer is used. The differential beamformer may boost signals that have low correlation between the audio signals 120, particularly at low frequencies. Therefore, a constraint or regulation rule may be used to determine the beamformer coefficients to limit wind noises with having low correlation at low frequencies. This results in differential beams that have omni patterns below a threshold frequency (e.g., 500 Hz).
In another embodiment, the adaptive beamforming module 330 uses a minimum variance distortionless response (MVDR). The signal-to-noise ratio (SNR) of the output of this type of beamformer is given by:
where W is a complex weight vector, H denotes the Hermitian transform, Rn is the estimated noise covariance matrix, σs2 is the desired signal power, and a is a known steering vector at direction θ. The beamformer output signal at time instant n can be written as y(n)=WHx(n).
In the case of a point source, the MVDR beamformer may be obtained by minimizing the denominator of the above SNR Equation (13) by solving the following optimization problem:
minw(WHRnW) subject to WHa(θ)=1 (14)
where WHa(θ)=1 is the distortionless constraint applied to the signal of interest.
The solution of the optimization problem (14) can be found as follows:
W=λRn−1a(θ) (15)
where (·)−1 denotes the inverse of a positive definite square matrix and X is a normalization constant that does not affect the output SNR Equation (13), which can be omitted in some implementations for simplicity.
Regardless of the specific type of beam former and parameterization approach used, the adaptive beamforming module 330 applies the adaptive beamformer to the audio signals 120 to compensate for the wind noise.
The adaptive spectral shaping module 340 processes the audio signals 120 in the frequency domain using a spectral filtering approach (spectral shaping). The spectral shape of the spectral filter is dynamically estimated from a frame having wind noise. The spectral shaping suppresses wind noise in the frequency domain.
In one embodiment, the spectrum of the estimated clean sound of interest in the frequency domain is modeled as follows:
|X(m,k)|2=H(m,k)*|Y(m,k)|,k=0,1, . . . ,N/2 (16)
where H(m, k) and |Y(m,k)| are the spectral weight and input magnitude spectrum at the k-th bin and in the m-th frame, and N is the FFT length. The wind noise spectral shape |W(m,k)|2 in the m-th frame at the k-th bin can be estimated from the input spectrum when the flag 140 indicates the presence of wind noise. The frequency at the k-th bin is given by fk=k*fs/N (Hz), where fs is the sampling rate.
The frequency domain can be split into two portions by a frequency limit, fLimit. Above fLimit, adaptive spatial shaping module 340 may perform no (or limited) spectral shaping, while below fLimit, spectral shaping may be used to suppress wind noise. For example, without loss of generality, assume that fLimit is 2 kHz, 3.4 kHz, and 7.0 kHz for voice-trigger and ASR applications, narrowband voice calls, and wideband voice calls, respectively. The spectral weight can be set H(m, k)=1.0 under the condition of fk≥fLimit, otherwise, H(m, k) can be calculated through one of the following suppression rules:
where μ is a weighting parameter between 0.0 and 1.0. The values of spectral weight may be constrained such that 0.0<H(m, k)≤1.0.
Unlike the WNDR system 100 show in
The decision module 460 makes a determination of whether noise is present based on the indications 412, 422, 432. In one embodiment, the decision module 460 determines wind noise is present if at least two of the indications 412, 422, 432 indicate the corresponding module detected wind noise. In other embodiments, other rules or conditions may be used to determine whether wind noise is present.
The WNR subsystem 450 receives an indication 440 (e.g., a flag) from the decision module 460 indicating whether wind noise is present. The WNR subsystem 450 includes a cutoff frequency estimation module 470 and a ramped sliding HPF module 480 that process the audio signal 420 in the time domain. The WNR subsystem 450 also includes an adaptive spectral shaping module 490 that processes the audio signal in the frequency domain.
The cutoff frequency estimation module 470 determines a cutoff frequency value 472, fc, from the audio signal 420 and the ramped sliding HPF module 480 applies a ramped sliding HPF to the audio signal. These modules operate in a similar manner to their counterparts in FIG. 3 except that they apply time domain processing to a single audio signal 420, rather than multiple audio signals 120. Likewise, the adaptive spectral shaping module 490 processes the audio signal 420 in the frequency domain in a similar manner to its counterpart in
In the embodiment shown in
The WND subsystem 130 applies 520 multiple wind noise detection techniques to the set of audio signals 120. Each wind noise detection technique generates a flag or other indication of whether wind noise was determined to be present. For example, as described above with reference to
The WND subsystem 130 determines 530 whether wind noise is present in the audio signals 120 based on flags or other indications generated by the wind noise detection techniques. In one embodiment, the WND subsystem 130 determines 530 that wind noise is present if two or more of the wind detection techniques generate an indication of wind noise. In other embodiments, other rules may be applied to determine 530 whether wind noise is present. Regardless of the precise approach used, the WND subsystem 130 generates 540 an indication of whether wind noise is present in the audio signals 120.
If the WND subsystem 130 determines wind noise is present, the WNR subsystem 150 applies 550 one or more processing techniques to the audio signals 120 to reduce the wind noise. As described previously, with reference to
Additional Configuration Information
The foregoing description of the embodiments has been presented for illustration; it is not intended to be exhaustive or to limit the patent rights to the precise forms disclosed. Persons skilled in the relevant art can appreciate that many modifications and variations are possible considering the above disclosure.
Some portions of this description describe the embodiments in terms of algorithms and symbolic representations of operations on information. These algorithmic descriptions and representations are commonly used by those skilled in the data processing arts to convey the substance of their work effectively to others skilled in the art. These operations, while described functionally, computationally, or logically, are understood to be implemented by computer programs or equivalent electrical circuits, microcode, or the like. Furthermore, it has also proven convenient at times, to refer to these arrangements of operations as modules, without loss of generality. The described operations and their associated modules may be embodied in software, firmware, hardware, or any combinations thereof.
Any of the steps, operations, or processes described herein may be performed or implemented with one or more hardware or software modules, alone or in combination with other devices. In one embodiment, a software module is implemented with a computer program product comprising a computer-readable medium containing computer program code, which can be executed by a computer processor for performing any or all the steps, operations, or processes described.
Embodiments may also relate to an apparatus for performing the operations herein. This apparatus may be specially constructed for the required purposes, and/or it may comprise a general-purpose computing device selectively activated or reconfigured by a computer program stored in the computer. Such a computer program may be stored in a non-transitory, tangible computer readable storage medium, or any type of media suitable for storing electronic instructions, which may be coupled to a computer system bus. Furthermore, any computing systems referred to in the specification may include a single processor or may be architectures employing multiple processor designs for increased computing capability. Embodiments may also relate to a product that is produced by a computing process described herein. Such a product may comprise information resulting from a computing process, where the information is stored on a non-transitory, tangible computer readable storage medium and may include any embodiment of a computer program product or other data combination described herein.
Finally, the language used in the specification has been principally selected for readability and instructional purposes, and it may not have been selected to delineate or circumscribe the patent rights. It is therefore intended that the scope of the patent rights be limited not by this detailed description, but rather by any claims that issue on an application based hereon. Accordingly, the disclosure of the embodiments is intended to be illustrative, but not limiting, of the scope of the patent rights, which is set forth in the following claims.
This application is a continuation of co-pending U.S. application Ser. No. 16/815,664, filed Mar. 11, 2020, which is incorporated by reference in its entirety.
Number | Name | Date | Kind |
---|---|---|---|
9343056 | Goodwin | May 2016 | B1 |
9373340 | Hetherington | Jun 2016 | B2 |
10249322 | Nelke et al. | Apr 2019 | B2 |
10341759 | Dusan | Jul 2019 | B2 |
10425731 | Inoue | Sep 2019 | B2 |
20040161120 | Petersen et al. | Aug 2004 | A1 |
20120140946 | Yen et al. | Jun 2012 | A1 |
20120310639 | Konchitsky | Dec 2012 | A1 |
20130308784 | Dickins et al. | Nov 2013 | A1 |
20150213811 | Elko et al. | Jul 2015 | A1 |
20180090153 | Hosh et al. | Mar 2018 | A1 |
20180277138 | Kudryavtsev et al. | Sep 2018 | A1 |
20190043520 | Kar et al. | Feb 2019 | A1 |
Number | Date | Country | |
---|---|---|---|
Parent | 16815664 | Mar 2020 | US |
Child | 17549697 | US |