This application relates generally to systems that process audio signals, such as speech signals, to remove undesired noise components therefrom.
The term noise suppression generally describes a signal processing technique that attempts to attenuate or remove an undesired noise component from an input signal. Noise suppression may be applied to almost any type of input signal that may include an undesired/interfering component such as a noise component. For example, noise suppression functionality is often implemented in telecommunications devices, such as telephones, Bluetooth® headsets, or the like, to attenuate or remove an undesired background noise component from an input speech signal. In general, an input speech signal may be viewed as comprising both a desired speech component (sometimes referred to as “clean speech”) and a background noise component. Removing the background noise component from the input speech signal ideally leaves only the desired speech component as output.
In multi-microphone systems, noise suppression is often implemented based on the Generalized Sidelobe Canceler (GSC). The GSC consists of a fixed beamformer, a blocking matrix, and an adaptive noise canceler. In the most general case, the fixed beamformer functions to filter M input speech signals received from M microphones to create a so-called speech reference signal comprising a desired speech component and a background noise component. The blocking matrix creates M−1 background noise references by spatially suppressing the desired speech component in the M input speech signals. The adaptive noise canceler then estimates the background noise component in the speech reference signal, produced by the fixed beamformer based on the M−1 background noise references and suppresses the estimated background noise component from the speech reference signal, thereby ideally leaving only the desired speech component as output.
However, in some multi-microphone systems, at least one microphone is dedicated as a noise reference microphone and at least one microphone is dedicated as a primary speech microphone. The noise reference microphone is positioned to be relatively far from a desired speech source during regular use of the multi-microphone system. In fact, the noise reference microphone can be positioned to be as far from the desired speech source as possible during regular use of the multi-microphone system. Therefore, the input speech signal received by the noise reference microphone often will have a very poor signal-to-noise ratio (SNR). The primary speech microphone, on the other hand, is positioned to be relatively close to the desired speech source during regular use and, as a result, usually receives an input speech signal that has a much better SNR compared to the input speech signal received by the noise reference microphone.
In these multi-microphone systems, with a dedicated noise reference microphone and primary speech microphone, the traditional delay-and-sum fixed beamformer structure of the GSC (described above) may not make much sense because it can result in a speech reference signal with an SNR that is worse than that of the unprocessed input speech signal received by the primary speech microphone. In general, it is possible to get constructive interference between the desired speech components of input speech signals received by multiple microphones using the traditional delay-and-sum fixed beamformer structure. However, in the case of a multi-microphone system with a noise reference microphone and a primary speech microphone as described above, the traditional delay-and-sum fixed beamformer structure is often unable to improve the SNR compared to the primary speech microphone because of the poor SNR of the input speech signal received by the noise reference microphone. Thus, using the traditional delay-and-sum fixed beamformer structure in such a multi-microphone system often will result in a speech reference signal that has a worse SNR than that of the input speech signal received by the primary speech microphone.
Moreover, adaptive algorithms (e.g., a least mean square adaptive algorithm) conventionally used to derive the filters for the blocking matrix and the adaptive noise canceler of the GSC are often slow to converge.
Therefore, what is needed is an approach to multi-channel noise suppression that does not rely on the traditional delay-and-sum fixed beamformer structure of the GSC and/or slow to converge adaptive algorithms for deriving filters used to suppress noise.
The accompanying drawings, which are incorporated herein and form a part of the specification, illustrate the present invention and, together with the description, further serve to explain the principles of the invention and to enable a person skilled in the pertinent art to make and use the invention.
The present invention will be described with reference to the accompanying drawings. The drawing in which an element first appears is typically indicated by the leftmost digit(s) in the corresponding reference number.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the invention. However, it will be apparent to those skilled in the art that the invention, including structures, systems, and methods, may be practiced without these specific details. The description and representation herein are the common means used by those experienced or skilled in the art to most effectively convey the substance of their work to others skilled in the art. In other instances, well-known methods, procedures, components, and circuitry have not been described in detail to avoid unnecessarily obscuring aspects of the invention.
References in the specification to “one embodiment,” “an embodiment,” “an example embodiment,” etc., indicate that the embodiment described may include a particular feature, structure, or characteristic, but every embodiment may not necessarily include the particular feature, structure, or characteristic. Moreover, such phrases are not necessarily referring to the same embodiment. Further, when a particular feature, structure, or characteristic is described in connection with an embodiment, it is submitted that it is within the knowledge of one skilled in the art to affect such feature, structure, or characteristic in connection with other embodiments whether or not explicitly described.
As noted in the background section above, certain multi-microphone systems include a primary speech microphone and a noise reference microphone. The primary speech microphone is positioned to be close to a desired speech source during regular use of the multi-microphone system, whereas the noise reference microphone is positioned to be farther from the desired speech source during regular use of the multi-microphone system. Therefore, the input speech signal received by the primary speech microphone typically will have a better SNR compared to the input speech signal received by the noise reference microphone. In these multi-microphone systems, if the SNR on the noise reference phone is much worse than the primary speech microphone then the use of a traditional delay-and-sum fixed beamformer structure to suppress background noise generally does not make much sense because it can result in a speech reference signal with an SNR that is worse than that of the unprocessed input speech signal received by the primary speech microphone.
The multi-channel noise suppression systems and methods described herein omit the traditional delay-and-sum fixed beamformer in devices that include a primary speech microphone and at least one noise reference microphone as noted above. The multi-channel noise suppression systems and methods use a blocking matrix (BM) to remove desired speech in the input speech signal received by the noise reference microphone to get a “cleaner” background noise component. Then, an adaptive noise canceler (ANC) is used to remove the background noise in the input speech signal received by the primary speech microphone based on the “cleaner” background noise component to achieve noise suppression.
In accordance with embodiments described herein, the filters implemented by the BM and ANC are derived using closed-form solutions that require calculation of time-varying statistics (for frequency domain implementations) of complex signals in the noise suppression system. Conventionally, adaptive algorithms that are potentially slow to converge have been used to derive such filters. Furthermore, in accordance with embodiments described herein, spatial information embedded in the input speech signals received by the primary speech microphone and the noise reference microphone is exploited to estimate the necessary time-varying statistics to perform closed-form calculations of the filters implemented by the BM and ANC.
It should be noted that, wherever a difference in energy between two signals is used to perform a function or determine a subsequent value as described below (where difference in energy can be calculated, for example, by subtracting the log-energy of the two signal), a difference in level between the two signals (i.e., difference in signal level) can be used instead.
As shown in
Although the input speech signals received by primary speech microphone 104 and noise reference microphone 106 will each contain desired speech and background noise components, by positioning primary speech microphone 104 so that it is closer to the user's mouth than noise reference microphone 106 during regular use, the level of the user's speech that is captured by primary speech microphone 104 is likely to be greater than the level of the user's speech that is detected by noise reference microphone 106. This, along with the observation that noise sources which are further from the device will produce approximately similar levels on the two microphones, can be exploited to effectively estimate the necessary statistics to calculate filter coefficients for suppressing background noise as will be described further below in regard to
It should be noted that primary speech microphone 104 and noise reference microphone 106 are shown to be positioned on the respective front and back portions of wireless communication device 102 for illustrative purposes only and is not intended to be limiting. Persons skilled in the relevant art(s) will recognize that primary speech microphone 104 and noise reference microphone 106 can be positioned in any suitable locations on wireless communication device 102.
It should be further noted that a single noise reference microphone 106 is shown in
Moreover, primary speech microphone 104 and noise reference microphone 106 are respectively shown in
Referring now to
As shown in
After {circumflex over (N)}2(m, f) has been obtained, ANC 310 is configured to estimate and remove the undesirable background noise component N1(m, f) in P(m, f) to provide, as output, the noise suppressed primary input speech signal Ŝ1(m, f). More specifically, ANC 310 includes an adaptive noise canceler filter 325 configured to filter the “cleaner” background noise component N2(m, f) to provide an estimate of the background noise component N1(m, f) in P(m, f). ANC 310 then subtracts the estimated background noise component {circumflex over (N)}1(m, f) from P(m, f) using subtractor 330 to provide, as output, the noise suppressed primary input speech signal Ŝ1(in, f).
In an embodiment, and as illustrated in
Although system 300 is described above as being implemented in wireless communication device 102 illustrated in
In the sub-sections that follow, exemplary derivations of closed form solutions for a frequency domain blocking matrix filter 315 and a hybrid approach blocking matrix filter 315 are described. In addition, in the following sub-sections that follow, exemplary derivations of closed form solutions for a frequency domain adaptive noise canceler filter 325 and a hybrid approach adaptive noise canceler filter 325 are described.
2.1 The Blocking Matrix
As noted above, BM 305 includes a blocking matrix filter 315 configured to filter the primary input speech signal P(m, f) to provide an estimate of the desired speech component S2(m, f) in the reference input speech signal R(m, f). BM 305 then subtracts the estimated desired speech component Ŝ2(m, f) from R(m, f) using subtractor 320 to provide the “cleaner” background noise component {circumflex over (N)}2(m, f).
Ideally, no residual amount of the desired speech component S2(m, f) is left in the “cleaner” background noise component {circumflex over (N)}2(m, f). However, because of the time-varying nature of the signals processed by BM 305 and the inability of the blocking matrix filter to perfectly model the acoustic channel for the desired speech between the two microphones, often some residual amount of the desired speech component S2(m, f) will be left in the “cleaner” background noise component {circumflex over (N)}2(m, f). This residual amount of the desired speech component S2(m, f) can be observed at the output of BM 305 (i.e., based on {circumflex over (N)}2(m, f)) during periods of time (or frames) when mostly desired speech, and little or no background noise, makes up the primary input speech signal P(m, f). If BM 305 is functioning well, the output of BM 305, {circumflex over (N)}2(m, f), should be nearly zero during these periods of time (or frames). The residual amount of desired speech component S2(m, f) can be simply expressed as:
where H(f) is the transfer function of blocking matrix filter 315, m indexes the time or frame, and f indexes a particular frequency component or sub-band.
To achieve the objective of removing the desired speech component S2 (m, f) in the reference input speech signal R(m, f), the transfer function H(f) of blocking matrix filter 315 can be derived (or updated) to substantially minimize the power of the residual signal expressed in Eq. (1) during periods of time (or frames) when the primary input speech signal P(m, f) is predominantly equal to the desired speech signal Ŝ1(m, f). The power of the residual signal, also referred to as a cost function, can be expressed as:
where ( )* indicates complex conjugate.
In the following sub-sections, a frequency domain blocking matrix filter 315 and a hybrid approach blocking matrix filter 315 are derived (or updated) based on this cost function.
2.1.1 Example Derivation of Frequency Domain Blocking Matrix Filter
The frequency domain blocking matrix filter 315 is derived (or updated) based on a closed form solution below assuming a single complex tap per frequency bin. However, persons skilled in the relevant art(s) will recognize based on the teachings herein that the proposed solution can be generalized to multiple taps per bin.
The cost function expressed in Eq. (2) is expanded as:
The gradient of E{circumflex over (N)}
by inserting:
resulting in:
where CR,P*(f) and CP,P*(f) represent time-varying statistics derived (or updated) during periods of time (or frames) when the input speech signal P(m, f) is predominantly equal to the desired speech signal S1(m, f). This can be quantified by the energy of the desired speech signal being greater than the energy of the background by a significant degree. The statistics can be expressed as:
The condition that these statistics be derived (or updated) when the energy of the desired speech is greater than the energy of the background noise in primary input speech signal P(m, f) by a large degree means that reference input speech signal R(m, f) and primary input speech signal P(m, f) generally are dominated by desired speech, ideally only include desired speech. Thus, the calculation of CR,P*(f) as the sum of products of the reference input speech signal R(m, f) and the complex conjugate primary input speech signal P(m, f) at a given frequency bin f for some number of frames can be seen as a way of estimating the cross-spectrum at that frequency bin between the desired speech component in the reference input speech signal R(m, f) and the desired speech component in the primary input speech signal P(m, f). Consequently, CR,P*(f) can be referred to as the cross-channel statistics of the desired speech, or just desired speech cross-channel statistics.
Similarly, the calculation of CP,P*(f) as the sum of products of the primary input speech signal P(m, f) and its own complex conjugate at a given frequency bin f for some number of frames can be seen as a way of estimating the power spectrum at that frequency bin of the desired speech component in the primary input speech signal P(m, f). Consequently, CP,P*(f) can be referred to as the desired speech statistics of the primary input speech signal.
Collectively, the cross-channel statistics of the desired speech and the desired speech statistics of the primary input speech signal can be referred to as simply the desired speech statistics. Further details and variants on the method of calculating the desired speech statistics are provided below in section 3.
In the embodiment where blocking matrix filter 315 is implemented in the frequency domain by multiplication, statistics estimator 335, illustrated in
2.1.2 Example Derivation of Hybrid Approach Blocking Matrix Filter
A hybrid variation of blocking matrix filter 315 in accordance with an embodiment of the present invention will now be described. The hybrid variation combines the frequency domain approach described above with a time domain approach. This can be a practical solution to performing noise suppression within a sub-band based audio system where an increased frequency resolution is desirable for the noise suppressor. The limited frequency resolution is expanded by applying a low-order time domain solution to individual frequency bins or sub-bands. This also offers the possibility of expanding the frequency resolution based on a psycho-acoustically motivated frequency resolution, e.g., expand low frequency regions more than high frequency regions. As a practical example, one may have a sub-band decomposition with 32 complex sub-bands in 0 to 4 kHz. This provides a spectral resolution of 125 Hz which may be inadequate. Instead of expanding the spectral resolution of all sub-bands to 32 Hz by a 4th order noise suppression filter, it may be desirable to expand the low sub-bands by 4, the middle sub-bands by 2, and leave the upper sub-bands at the native resolution.
The hybrid approach changes the “filtering” with the transfer function H(f) from:
Ŝ2(m,f)=H(f)P(m,f) (10)
to:
where m indexes the time or frame, f indexes a particular sub-band, and k=0, 1 . . . K indexes the individual filter coefficients for a particular frequency index f, making up the noise suppression time direction filter in that particular frequency bin. Hence, the term time direction filter can be used to refer to the individual noise suppression filters that filter the frequency bins, or sub-band signals, of the primary input speech signal P(m, f) in the time direction.
The residual signal in Eq. (1) can be rewritten based on Eq. (11) as follows:
Substituting Eq. (12) into Eq. (2), the gradient of E{circumflex over (N)}
The set of K+1 equations (for k=0, 1, . . . K) of Eq. (13) provides a matrix equation for every frequency bin f to solve for H(k, f), where k=0, 1, . . . K:
This solution can be written as:
RP(f)·H2(f)=rR,P*(f) (15)
where:
and the superscript T denotes non-conjugate transpose. The solution per frequency bin to the time direction filter is thus given by:
H(f)=(RP(f))−·rR,P*(f) (19)
This solution appears to require a matrix inversion, but in most practical applications a matrix inversion is not needed.
In the embodiment where blocking matrix filter 315 is implemented based on the hybrid approach, statistics estimator 335 is configured to derive (or update) estimates of the statistics expressed in Eq. (16) and Eq. (17) and provide the estimates to controller 340. Controller 340 is then configured to use the estimates of the statistics to configure blocking matrix filter 315. For example, controller 340 can use these values to configure blocking matrix filter 315 in accordance with the transfer function H(f) expressed in Eq. (19), although this is only one example.
Comparing Eq. (16) and Eq. (17) to Eq. (9) and Eq. (8), respectively, it can be seen that the similar statistics are calculated by each set of equations, except that instead of calculating statistics only between current frequency bin components of signals, the hybrid solution requires calculation of statistics between vectors of current and past frequency bin components of signals, i.e. a time dimension is now part of the statistics. At the extreme, with no Discrete Fourier Transform (DFT), i.e. a single full band signal (the time domain signal), the hybrid method becomes a pure time domain method, and hence, the solution above provides the solution also for a pure time domain approach. The frequency index would become obsolete (as there is only one frequency band), and the signal vectors in the time direction would contain the signal time domain samples. A farther simplification in that case is that the time domain signal without DFT is real and not complex as in the case of the DFT bins or if a complex sub-band analysis has been applied.
2.1.3 Alternative Approach to Blocking Matrix
As discussed above, to achieve the objective of removing the desired speech component S2(m, f) in the reference input speech signal R(m, f), the transfer function H(f) of blocking matrix filter 315 can be derived (or updated) to substantially minimize the power of the residual signal, also referred to as a cost function, expressed in Eq. (2) during periods of time (or frames) when the primary input speech signal P(m, f) is predominantly desired speech.
As an alternative method to achieve the objective of removing the desired speech component S2(m, f) in the reference input speech signal R(m, f), the transfer function H(f) of blocking matrix filter 315 can be derived (or updated) to substantially minimize the power of the difference between the background noise component N2(m, f) in the reference input speech signal R(m, f) and the output of BM 305, {circumflex over (N)}2(m, f). The power of the difference between the background noise component N2(m,f) and the output of BM 305, {circumflex over (N)}2(m, f), can be expressed as:
where ( )* indicates complex conjugate.
Accommodating the hybrid approach, from Eq. (20) the gradient of E{circumflex over (N)}
Using the definitions of sub-section 2.1.2, the solution is given by the following matrix equation:
In practice, the estimation of rN
Hence, Eq. (22) can be simplified to:
H(f)=(RP(f))−1·(rR,P*(f)−rN
Eq. (24) facilitates updating blocking matrix 315 when background noise is present in the environment of primary speech microphone 104 and noise reference microphone 106. This can be beneficial because most environmental background noise is not intermittent like speech, and hence it can be impractical to locate segments of primarily desired speech in primary input speech signal P(m, f) and reference input speech signal R(m, f) for updating the statistics required by the closed-form solution for the blocking matrix 315. The statistics rN
From Eq. (24), the solution according to the alternative approach for a single complex tap, K=0, is easily written as:
or, according to the notation of sub-section 2.1.1, as:
In this alternative embodiment, statistics estimator 335 is configured to obtain (or update) estimates of the statistics used in the calculations of Eq. (25) and/or Eq. (26) and provide the estimates to controller 340. Controller 340 is then configured to use the estimates to configure blocking matrix filter 315. For example, controller 340 can use these values to configure blocking matrix filter 315 in accordance with the transfer function H(f) expressed in Eq. (25) or (26).
2.2 The Adaptive Noise Canceler
As noted above, ANC 310 includes an adaptive noise canceler filter 325 configured to filter the “cleaner” background noise component {circumflex over (N)}2(m, f) to provide an estimate of the background noise component N1(m, f) in P(m, f). ANC 310 then subtracts the estimated background noise component {circumflex over (N)}1(m, f) from P(m, f) using subtractor 330 to provide, as output, the noise suppressed primary input speech signal Ŝ1(m, f).
Ideally, no residual amount of the background noise component N1(m, f) is left in the noise suppressed primary input speech signal Ŝ1(m, f). However, because of the time-varying nature of the signals processed by ANC 310 and the inability of the ANC filter to perfectly model the real unknown channel, often some residual amount of the background noise component N1(m, f) will be left in the noise suppressed primary input speech signal Ŝ1(m, f).
To achieve the objective of removing the background noise component N1(m, f) in the primary input speech signal P(m, f), the transfer function W(f) of adaptive noise canceler filter 325 can be derived (or updated) to substantially minimize the power of the noise suppressed primary input speech signal Ŝ1(m, f). In practice the BM is not perfect in removing all desired speech from {circumflex over (N)}2(m, f), and hence it is wise to bias the minimization of the power of the noise suppressed primary input speech signal Ŝ1(m, f) to segments of desired speech absence, i.e. noise presence only. The power of the noise suppressed primary input speech signal Ŝ1(m, f), also referred to as a cost function, can be expressed as:
where ( )* indicates complex conjugate, m indexes the time or frame, and f indexes a particular frequency component or sub-band.
In the following sub-sections, a frequency domain adaptive noise canceler filter 325 and a hybrid approach adaptive noise canceler filter 325 are derived (or updated) based on the cost function expressed in Eq. (27).
2.2.1 Example Derivation of Frequency Domain Adaptive Noise Canceler
The frequency domain adaptive noise canceler filter 325 is derived (or updated) based on a closed form solution below assuming a single complex tap per frequency bin. However, persons skilled in the relevant art(s) will recognize based on the teachings herein that the proposed solution can be generalized to multiple taps per bin.
From
Ŝ1(m,f)=P(m,f)−W(f){circumflex over (N)}2(m,f) (28)
where, again, W(f) represents the transfer function of adaptive noise canceler filter 325. The gradient of the cost function EŜ
where C{circumflex over (N)}
C{circumflex over (N)}
Collectively, the background noise statistics of the blocking matrix output and the cross-channel background noise statistics can be referred to as the background noise statistics. Further details and variants on the method of calculating the background noise statistics are provided below in section 3.
If BM 305 is effective (in suppressing the desired speech component S2(m, f) in the “cleaner” background noise component {circumflex over (N)}2(m, f)), then the statistics expressed in Eq. (31) and Eq. (32) can be updated each time (or nearly each time) a new frame of primary input speech signal P(m, f) and reference input speech signal R(m, f) is received and processed, regardless of the content on the primary input speech signal P(m, f) and the reference input speech signal R(m, f). However, in an alternative embodiment (and in a potentially safer approach), as mentioned above, the statistics of adaptive noise canceler filter 325 can be updated primarily during periods of time or frames when desired speech is absent.
In the embodiment where adaptive noise canceler filter 325 is implemented in the frequency domain as a multiplication, statistics estimator 345, illustrated in
2.2.2 Example Derivation of Hybrid Approach Adaptive Noise Canceler Filter
A hybrid variation of adaptive noise canceler filter 325 in accordance with an embodiment of the present invention will now be described. The derivation of the hybrid approach follows that of sub-section 2.1.2 for blocking matrix filter 315.
The hybrid approach changes the “filtering” with the transfer function W(f) from:
{circumflex over (N)}1(m,f)=W(f){circumflex over (N)}2(m,f) (33)
to:
where m indexes the time or frame, f indexes a particular sub-band, and k=0, 1 . . . K indexes the individual filter coefficients for a particular frequency bin f, making up the noise suppression time direction filter in that particular frequency bin. Hence, the term time direction filter can be used to refer to the individual noise suppression filters that filter the sub-band signals of the “cleaner” background noise component {circumflex over (N)}2(m, f) in the time direction.
Eq. (28) can be rewritten based on Eq. (34) as follows:
Substituting Eq. (35) into Eq. (27), the gradient of EŜ
Eq. (36) is dual to Eq. (13). Similar to sub-section 2.1.2, the set of K+1 equations (for k=0, 1, . . . K) of Eq. (36) provides a matrix equation for every frequency bin f to solve for W(k, f), where k=0, 1, . . . K:
This solution can be written as:
R{circumflex over (N)}
where:
and the superscript T denotes non-conjugate transpose. The solution per frequency bin to the time direction filter is thus given by:
W(f)=(R{circumflex over (N)}
This solution appears to require a matrix inversion, but in most practical applications a matrix inversion is not needed.
In the embodiment where adaptive noise canceler filter 325 is implemented based on the hybrid approach, statistics estimator 345 is configured to derive (or update) estimates of the statistics expressed in Eq. (39) and Eq. (40) and provide the estimates to controller 350. Controller 350 is then configured to use the estimates of the statistics to configure adaptive noise canceler filter 325. For example, controller 350 can use these values to configure adaptive noise canceler filter 325 in accordance with the transfer function W(f) expressed in Eq. (42), although this is only one example.
Comparing Eq. (39) and Eq. (40) to Eq. (31) and Eq. (32), respectively, it can be seen that similar statistics are calculated by each set of equations, except that instead of calculating statistics only between current frequency bin components of signals, the hybrid solution requires calculation of statistics between vectors of current and past frequency bin components of signals, i.e. a time dimension is now part of the statistics. At the extreme, with no DFT, i.e. a single full band signal (the time domain signal), the hybrid method becomes a pure time domain method, and hence, the solution above provides the solution also for a pure time domain approach. The frequency index would become obsolete (as there is only one frequency band), and the signal vectors in the time direction would contain the signal time domain samples. A further simplification in that case is that the time domain signal without DFT is real and not complex as in the case of the DFT bins or if a complex sub-band analysis has been applied.
2.2.3 Alternative Approach to Adaptive Noise Canceler
As discussed above, to achieve the objective of removing the background noise component N1(m, f) in the primary input speech signal P(m, f), the transfer function W(f) of adaptive noise canceler filter 325 can be derived (or updated) to substantially minimize the power of the noise suppressed primary input speech signal Ŝ1(m, f) expressed in Eq. (27) during speech absence.
As an alternative method to achieve the objective of removing the background noise component N1(m, f) in the primary input speech signal P(m, f), the transfer function W(f) of adaptive noise canceler filter 325 can be derived (or updated) to substantially minimize the power of the difference between the desired speech component S1(m, f) in the primary input speech signal P(m, f) and the output of ANC 310, Ŝ1(m, f). The power of the difference between the desired speech component S1(m, f) and the output of ANC 310, Ŝ1(m, f), can be expressed as:
where ( )* indicates complex conjugate.
Accommodating the hybrid approach, from Eq. (43) the gradient of EŜ
which is written in matrix form as:
R{circumflex over (N)}
where R{circumflex over (N)}
and depends on the desired speech component S1(m, f) in the primary input speech signal P(m, f). The desired speech component S1(m, f) is generally not available independent of the background noise component N1(m, f) it the primary input speech signal P(m, f). However, rS
For the general hybrid version, the solution is given by:
W(f)=(f))=(R{circumflex over (N)}
and the special 0th order hybrid (non-hybrid, both BM and ANC) version has the following solution:
With a hybrid BM and non-hybrid ANC, the solution is given by:
In this alternative approach, statistics estimator 345 is configured to derive (or update) estimates of the statistics expressed in Eq. (39) and/or Eq. (40) and/or Eq. (47) and provide the estimates to controller 350. Controller 350 is then configured to use the estimates of the statistics to configure adaptive noise canceler filter 325. For example, controller 350 can use these values to configure adaptive noise canceler filter 325 in accordance with the transfer function W(f) expressed in Eq. (48), Eq. (49), or Eq. (50).
As described above in sub-sections 2.1 and 2.2, the closed-form solutions for blocking matrix filter 315 and adaptive noise canceler filter 325 require various statistics to be estimated. In practice, these statistics need to be estimated from the primary input speech signal P(m, f) and the reference input speech signal R(m, f) that contain desired speech mixed with background noise. The statistics will generally vary with time due to, for example, the position of the desired speech source relative to primary speech microphone 104 and noise reference microphone 106 changing, the position of the background noise source(s) relative to primary speech microphone 104 and noise reference microphone 106 changing, etc. The present section describes methods and features that will facilitate the estimation of the time-varying statistics used to solve the closed-form solutions for blocking matrix filter 315 and adaptive noise canceler filter 325 described above in sub-sections 2.1 and 2.2.
3.1 Estimation of Time-Varying Statistics for the Blocking Matrix Filter
As described above in sub-section 2.1.1, deriving (or updating) blocking matrix filter 315 requires knowledge of the statistics CR,P*(f) and CP,P*(f), which can be calculated during periods of time (or frames) of predominantly desired speech. The statistics were expressed generally in Eq. (8) and Eq. (9), reproduced below:
The condition that these statistics be calculated during predominantly desired speech can be quantified to update when the energy of the desired speech is greater than the energy of the background noise in primary input speech signal P(m, f) by a large degree. It means that reference input speech signal R(m, f) and primary input speech signal P(m, f) generally include primarily desired speech. Thus, the calculation of CR,P*(f) as the sum of products of the reference input speech signal R(m, f) and the complex conjugate primary input speech signal P(m, f) at a given frequency bin f for some number of frames can be seen as a way of estimating the cross-spectrum at that frequency bin between the desired speech component in the reference input speech signal R(m, f) and the desired speech component in the primary input speech signal P(m, f). Consequently, and as noted above, CR,P*(f) can be referred to as the cross-channel statistics of the desired speech, or just desired speech cross-channel statistics.
Similarly, the calculation of CP,P*(f) as the sum of products of the primary input speech signal P(m, f) and its own complex conjugate at a given frequency bin f for some number of frames can be seen as a way of estimating the power spectrum at that frequency bin of the desired speech component in the primary input speech signal P(m, f). Consequently, and as noted above, CP,P*(f) can be referred to as the desired speech statistics of the primary input speech signal.
Collectively, the cross-channel statistics of the desired speech and desired speech statistics of the primary input speech signal can be referred to as simply the desired speech statistics.
To accommodate the time varying nature of CR,P*(f) and CP,P*(f) expressed in Eq. (8) and Eq. (9), these statistics can be estimated using a time window (as is done in Eq, (8) and Eq. (9)) or using a moving average. The calculation of the statistics using a moving average can be expressed as:
CR,P*(m,f)=α(m)·CR,P*(m−1,f)+(1−α(m))·R(m,f)P*(m,f) (51)
CP,P*(m,f)=α(m)·CP,P*(m−1,f)+(1−α(m))·P(m,f)P*(m,f) (52)
where ( )* indicates complex conjugate, m indexes the time or frame, f indexes a particular frequency component, bin, or sub-band, and α(m) is an adaptation factor, which itself is time-varying.
It should be noted that the moving averages expressed in Eq. (51) and Eq. (52), commonly referred to as exponential moving averaging or exponentially weighted moving averaging, are provided for exemplary purposes only and are not intended to be limiting. Persons skilled in the relevant art(s) will recognize that other moving average expressions can be used.
The adaptation factor α(m) is adjusted in time such that it has a smaller value that is less than one and greater than zero as the likelihood of predominantly desired speech increases, and a comparatively larger value that is closer to one as the likelihood of predominantly desired speech decreases. In practice this can be achieved by adjusting α(m) to a smaller value when the energy of the desired speech is likely greater than the energy of the background noise in a current frame of the primary input speech signal P(m, f) by a large degree (resulting in CR,P*(f) and CP,P*(f) being updated quickly), and is adjusted in time such that is has a comparatively large value (e.g., a value around 1) when the energy of the desired speech is not likely to be greater than the energy of the background noise in the current frame of the primary input speech signal P(m, f) by a large degree (resulting in CR,P*(f) and CP,P*(f) being updated slowly, or not at all when α(m) is equal to one).
The adaptation factor α(m) can be determined, for example, based on a difference in energy between a current frame of the primary input speech signal P(m, f) received by primary speech microphone 104 and a current frame of the reference input speech signal R(m, f) received by noise reference microphone 106. The difference in energy can be calculated by subtracting the log-energy of the current frame of the reference input speech signal from the log-energy of the current frame of the primary input speech signal in at least one example.
For instance, if the difference in energy is 16 dB or higher (indicating likelihood of desired speech dominating any background noise present in the current frame of the primary input speech signal P(m, f)), α(m) can be set equal to a smaller value and, if the difference in energy is 6 dB or less (indicating likelihood of background noise dominating any desired speech present in the current frame of the primary input speech signal P(m, f)), α(m) can be set equal to a comparatively larger value, while a piecewise linear mapping from difference in energy to α(m) can be used in-between these two values. In general, the piecewise linear mapping can be monotonically decreasing in-between the two points.
An example piecewise linear mapping 400 from difference in energy between the primary input speech signal P(m, f) and the reference input speech signal R(m, f) to adaptation factor α(m) is illustrated in
Using a mapping from difference in energy to α(m) as described above, generally means that the statistics expressed in Eq. (51) and Eq. (52) will be updated at a rate directly related to the difference in energy between the primary input speech signal P(m, f) and the reference input speech signal R(m, f).
As shown in
At step 515, a difference in energy between the current frame of the primary input speech signal P(m, f) and the reference input speech signal R(m, f) is calculated. For example, the difference in energy can be calculated by subtracting the log-energy of the current frame of the reference input speech signal R(m, f) from the log-energy of the current frame of the primary input speech signal P(m, f) in at least one example.
At step 520, the adaptation factor α(m) is determined, based on at least the difference in energy calculated at step 515. For example, the adaptation factor α(m) can be determined based on a piecewise linear mapping from the difference in energy calculated at step 515 to α(m).
It should be noted that information other than the difference in energy calculated at step 515 can be used to determine the adaptation factor α(m). For example, a voice activity indicator provided by a voice activity detector (not shown) can be used in combination with the difference in energy calculated at step 515 to determine the adaptation factor α(m).
At step 525, the statistics used to determine blocking matrix filter 315 are updated based on the previous values of the statistics, the current frame of the primary input speech signal P(m, f) and the reference input speech signal R(m, f), and the adaptation factor α(m). For example, the cross-channel statistics of the desired speech CR,P*(m, f) can be updated according to Eq. (51) above using the previous value of the cross-channel statistics of the desired speech statistics CR,P*(m−1, f), the current frame of the primary input speech signal P(m, f) and the reference input speech signal R(m, f), and the adaptation factor α(m). Similarly, the desired speech statistics of the primary input speech signal CP,P*(m, f) can be updated according to Eq. (52) above using the previous value of the desired speech statistics of the primary input speech signal CP,P*(m−1, f), the current frame of the primary input speech signal P(m, f), and the adaptation factor α(m).
3.1.1 Improved Estimation of Clean Speech Statistics
If there are plenty of frames where the desired speech dominates the background noise in the primary input speech signal P(m, f), then even if there is some background noise, the statistics CR,P*(f) and CP,P*(f) expressed by Eq. (51) and Eq. (52), respectively, can be estimated directly from the primary input speech signal P(m, f) and the reference input speech signal R(m, f) with sufficient accuracy. However, to gain robustness to higher levels of background noise, it may be advantageous to estimate the statistics CR,P*(f) and CP,P*(f) in a more advanced manner. For example, the statistics of the stationary portion of the background noise components N1(m, f) and N2(m, f) can be further estimated and removed when estimating the statistics CR,P*(f) and CP,P*(f) as follows:
CR,P*(m,f)=α(m)·CR,P*(m−1,f)+(1−α(m))·[R(m,f)P*(m,f)−CN
CP,P*(m,f)=α(m)·CP,P*(m−1,f)+(1−α(m))·[P(m,f)P*(m,f)−CN
where CN
More specifically, the statistics, CN
CN
CN
where αS(m) is an adaptation factor.
It should be noted that the moving averages expressed in Eq. (55) and Eq. (56), commonly referred to as exponential moving averaging, are provided for exemplary purposes only and are not intended to be limiting. Persons skilled in the relevant art(s) will recognize that other moving average expressions can be used.
The adaptation factor αS(m) can be determined, for example, based on a difference in energy between a current frame of the primary input speech signal P(m, f) and a current frame of the reference input speech signal R(m, f). For instance, if the difference in energy is −3 dB or less (indicating likelihood of background noise dominating any desired speech in the current frame of the primary input speech signal P(m, f)), αS(m) can be set equal to a small value between zero and one and, if the difference in energy is 6 dB or higher (indicating likelihood of desired speech dominating any background noise present in primary input speech signal P(m, f)), αS(m) can be set equal to a comparatively larger value close to one (or exactly equal to one), while a piecewise linear mapping from difference in energy to αS(m) can be used in-between these two values. In general, the piecewise linear mapping can be monotonically increasing in-between the two points.
An example piecewise linear mapping 600 from difference in energy between the primary input speech signal P(m, f) and the reference input speech signal R(m, f) to adaptation factor αS(m) is illustrated in
Using a mapping from difference in energy to αS(m) as described above, generally means that the statistics expressed in Eq. (55) and Eq. (56) will be updated at a rate inversely related to the difference in energy between the primary input speech signal P(m, f) and the reference input speech signal R(m, f).
As shown in
At step 715, a difference in energy between the current frame of the primary input speech signal P(m, f) and the reference input speech signal R(m, f) is calculated. For example, the difference in energy can be calculated by subtracting the log-energy of the current frame of the reference input speech signal R(m, f) from the log-energy of the current frame of the primary input speech signal P(m, f) in at least one example.
At step 720, the adaptation factor αS(m) is determined, based on at least the difference in energy calculated at step 715. For example, the adaptation factor αS(m) can be determined based on a piecewise linear mapping from the difference in energy calculated at step 715 to αS(m).
It should be noted that information other than the difference in energy calculated at step 715 can be used to determine the adaptation factor αS(m). For example, a voice activity indicator provided by a voice activity detector (not shown) can be used in combination with the difference in energy calculated at step 715 to determine the adaptation factor αS(m).
At step 725, the stationary background noise statistics are updated based on the previous values of the stationary background noise statistics, the current frame of the primary input speech signal P(m, f) and the reference input speech signal R(m, f), and the adaptation factor αS(m). For example, the stationary background noise cross-channel statistics CN
3.1.2 Local Variations in Microphone Levels due to Acoustic Factors
In operation of multi-channel noise suppression system 300 illustrated in
In one potential solution to take this variation into account, local variations in the level of primary speech microphone 104 and noise reference microphone 106 due to acoustical factors can be respectively calculated based on the following moving averages:
MPlev(m)=αS·MPlev(m−1)+(1−αS)·MP(m) (57)
MRlev(m)=αS·MRlev(m−1)+(1−αS)·MR(m) (58)
where αS is determined based on the piecewise linear mapping in
The difference between the moving averages expressed in Eq. (59) and Eq. (60) can then be used to compensate for any variation in the microphone input levels due to acoustical factors. For example, the function used to map the difference in energy of the primary input speech signal P(m, f) and the reference input speech signal R(m, f) to the adaptation factor α(m) can be offset by the difference between the moving averages expressed in Eq. (59) and Eq. (60) to provide compensation. Assuming the mapping function illustrated in the plot of
3.1.3 Accommodating Changes in Acoustic Coupling Specific to Primary Speech
In operation of multi-channel noise suppression system 300 illustrated in
In one potential solution to take this potential variation into account, a moving average is maintained of the difference in energy of a current frame of the primary input speech signal P(m, f) and a current frame of the reference input speech signal R(m, f) and compared to a reference value. More specifically, the moving average is updated based on the difference in energy between a current frame of the primary input speech signal P(m, f) and a current frame of the reference input speech signal R(m, f) if the frame of the primary input speech signal P(m, f) is indicated as including desired speech. The degree to which the moving average is updated based on each frame can be controlled using a smoothing factor. For example, the smoothing factor can be set to a value that updates the moving average to be equal to 0.99 of the previous moving average value and 0.01 of the difference in energy of the current frame of the primary input speech signal P(m, f) and the current frame of the reference input speech signal R(m, f), assuming the current frame of the primary input speech signal P(m, f) is indicated as including desired speech.
The reference value, to which the moving average is compared, can be determined as a typical difference in energy between the primary input speech signal P(m, f) and the reference input speech signal R(m, f) for desired speech when the desired speech source is in its nominal (i.e., intended) position relative to the two microphones.
As an example of this feature, if the user's mouth is in its nominal position relative to the two microphones of wireless communication device 102 during a call, the presence of desired speech may be highly likely if the difference in energy between the primary input speech signal P(m, f) and the reference input speech signal R(m, f) is above 10 dB. On the other hand, if the user's mouth is not in its nominal position relative to the two microphones of wireless communication device 102 during a call (e.g., the user's mouth is farther away from at least primary speech microphone 104), then the presence of desired speech may be highly likely if the difference in energy between the primary input speech signal P(m, f) and the reference input speech signal R(m, f) is above 6 dB. Thus, there is an effective loss in coupling of 4 dB for the desired speech because of the mismatch in the position of the user's mouth during the call from its nominal position relative to the two microphones. It should be noted that although the coupling for desired speech was reduced by 4 dB by moving the handset into a suboptimal position, the coupling for noise sources remains about the same (as they are far-field to the device for all practical purposes). Hence, this change in coupling only applies to desired speech.
By keeping track of a moving average of the difference in energy of the primary input speech signal P(m, f) and the reference input speech signal R(m, f) for desired speech as discussed above, and comparing the moving average to a reference value as further discussed above, the effective loss due to suboptimal acoustic coupling for the desired speech can be estimated. This estimated effective loss can then be used to compensate for any actual loss due to suboptimal acoustic coupling for the desired speech. For example, the function used to map the difference in energy of the primary input speech signal P(m, f) and the reference input speech signal R(m, f) to the adaptation factor α(m) can be offset by the estimated effective loss to provide compensation. Assuming the mapping function illustrated in the plot of
In order to update the moving average based on the difference in energy of a current frame of the primary input speech signal P(m, f) and a current frame of the reference input speech signal R(m, f) when desired speech is indicated to be present in the frame of the primary input speech signal P(m, f), it is obviously necessary to first identify the presence of desired speech. This can be done using several methods. For example, the presence of desired speech can be determined based on whether: (1) an SNR of the primary input speech signal P(m, f) is above a certain threshold; (2) a difference in energy of the primary input speech signal P(m, f) and the reference input speech signal R(m, f) is above a certain threshold; and/or (3) a prediction gain of the reference input speech signal R(m, f) from the primary input speech signal P(m, f) using a blocking matrix with a null forced in the direction of the expected desired speech is above a certain threshold. In one embodiment, at least two of these methods are used to determine the presence of desired speech in a frame of the primary input speech signal P(m, f).
3.2 Estimation of Time-Varying Statistics for the Adaptive Noise Canceler
As described above in sub-section 2.2.1, deriving (or updating) adaptive noise canceler filter 325 requires knowledge of the statistics C{circumflex over (N)}
C{circumflex over (N)}
To accommodate the time varying nature of C{circumflex over (N)}
CP,{circumflex over (N)}
C{circumflex over (N)}
where ( )* indicates complex conjugate, m indexes the time or frame, f indexes a particular frequency component or sub-band, and γ(m) is an adaptation factor.
It should be noted that the moving averages expressed in Eq. (61) and Eq. (62), commonly referred to as exponential moving averages or exponentially weighted moving averages, are provided for exemplary purposes only and are not intended to be limiting. Persons skilled in the relevant art(s) will recognize that other moving average expressions can be used.
If BM 305 is operating well and providing the “cleaner” background noise component {circumflex over (N)}2(m, f) with little or no residual amount of the desired speech component S2(m, f), then the adaptation factor γ(m) can be set to a constant. However, if BM 305 is not operating perfectly and a residual amount of the desired speech component S2(m, f) is left in the “cleaner” background noise component {circumflex over (N)}2(m, f), setting the adaptation factor γ(m) to a constant can result in distortion or cancellation of the desired speech. Therefore, the adaptation factor γ(m) can be varied over time according to the likelihood of desired speech being present, and the updating of the statistics expressed in Eq. (61) and in Eq. (62) can be effectively halted when the likelihood of desired speech being present is high.
For the statistics used to derive (or update) blocking matrix filter 315, the difference in energy between a current frame of the primary input speech signal P(m, f) and a current frame of the reference input speech signal R(m, f) was used as an indicator of speech presence and as an input parameter to determine the adaptation factor α(m). In a similar manner, the difference in energy between a current frame of the primary input speech signal P(m, f) and a current frame of the reference input speech signal R(m, f) can be used as an indicator of speech presence and as an input parameter to determine the adaptation factor γ(m). However, given that BM 305 removed desired speech from reference input speech signal R(m, f) (at least partially) to produce the “cleaner” background noise component {circumflex over (N)}2(m, f), the difference in energy, or a moving average of the difference in energy, between a current frame of the primary input speech signal P(m, f) and a current frame of the “cleaner” background noise component {circumflex over (N)}2(m, f) can alternatively be used as an indicator of speech presence and as an input parameter to determine the adaptation factor γ(m). In fact, using the “cleaner” background noise component {circumflex over (N)}2(m, f) as opposed to the reference input speech signal R(m, f) can provide better discrimination, assuming BM 305 is functioning well.
As mentioned above, the statistics expressed in Eq. (61) and Eq. (62) for adaptive noise canceler filter 325 represent statistics of the background noise. Thus, the rate at which the statistics are updated will affect the ability of the overall noise suppression system to track and suppress moving background noise sources, e.g. a talking person walking by, a moving vehicle driving by, etc. Updating the statistics expressed in Eq. (61) and Eq. (62) at a fast pace will allow good tracking and suppression of moving noise sources. On the other hand, a fast update pace can potentially degrade steady-state suppression of stationary background noise sources. Therefore, a method referred to as dual adaptive noise cancelation can be used, where a set of statistics are maintained and updated at a fast rate (favoring moving noise sources) and a set of statistics are maintained and updated a slow rate (favoring steady-state performance). Prior to applying adaptive noise canceler filter 325, one of the two sets of statistics is selected and used to configure the filter.
For example, the following two sets of the statistics expressed in Eq. (61) and Eq. (62) can be maintained
CP,{circumflex over (N)}
C{circumflex over (N)}
and
CP,{circumflex over (N)}
C{circumflex over (N)}
where Eq. (63) and Eq. (64) represent the set of statistics updated at a fast rate (hence, the use of the fast adaptation factor γfast(m), and Eq. (65) and Eq. (66) represent the set of statistics updated at a slow rate (hence, the use of the slow adaptation factor γslow(m)).
As discussed above, the adaptation factors γfast(m) and γslow(m) can be determined, for example, based on the difference in energy, or a moving average of the difference in energy, between a current frame of the primary input speech signal P(m, f) and a current frame of the “cleaner” background noise component {circumflex over (N)}2(m, f).
In general, both mappings set the adaptation factor γ(m) to a large value (e.g., a value of one) if the difference in energy (or moving average of the difference in energy) between a current frame of the primary input speech signal P(m, f) and a current frame of the “cleaner” background noise component {circumflex over (N)}2(m, f) is greater than a certain, predetermined value (indicating a strong likelihood of desired speech dominating background noise), and to a smaller value greater than zero and smaller than one if the difference in energy (or moving average of the difference in energy) between the current frame of the primary input speech signal P(m, f) and the current frame of the “cleaner” background noise component {circumflex over (N)}2(m, f) is less than a certain, predetermined value (indicating a strong likelihood of background noise dominating desired speech), while a piecewise linear mapping can be used in-between the two predetermined values.
Using a mapping as described above, generally means that the statistics expressed in Eq. (63), Eq. (64), Eq. (65), and Eq. (66) will be updated at a rate inversely related to the difference in energy (or moving, average of the difference in energy) between the primary input speech signal P(m, f) and the “cleaner” background noise component {circumflex over (N)}2(m, f).
Prior to applying adaptive noise canceler filter 325, one of the two sets of statistics needs to be selected for calculating its transfer function. In at least one embodiment, the set of statistics (i.e., either the fast or slow version) that results in adaptive noise canceler filter 325 producing an output signal with the least amount of power is selected. The output power of adaptive noise canceler filter 325 using each set of statistics can be expressed as:
where
Hence, the final adaptive noise canceler filter 325 is selected according to:
As shown in
At step 915, a difference in energy between the current frame of the primary input speech signal P(m, f) and the “cleaner” background noise component {circumflex over (N)}2(m, f) is calculated. Alternatively, a moving average of the difference in energy between the primary input speech signal P(m, f) and the “cleaner” background noise component {circumflex over (N)}2(m, f) is updated based on the current frame of each signal.
At step 920, the adaptation factors γslow(m) and γfast(m) are determined based on at least the difference in energy between the current frames of the primary input speech signal P(m, f) and the reference input speech signal R(m, f) calculated at step 915. For example, the adaptation factor γslow(m) and γfast(m) can be respectively determined based on piecewise linear mappings 805 and 810 illustrated in
At step 925, the statistics used to determine adaptive noise canceler filter 325 are updated based on the previous values of the statistics, the current frame of the primary input speech signal P(m, f) and the “cleaner” background noise component {circumflex over (N)}2(m, f), and the adaptation factors γslow(m) and γfast(m). For example, the statistics can be updated according to Eq. (63), Eq. (64), Eq. (65), and Eq. (66) above.
3.3 Automatic Microphone Calibration
Automatic microphone calibration can be further included in multi-channel noise suppression system 300 illustrated in
More specifically, microphone mismatch estimator 1005 determines and updates a current estimate of the difference in sensitivity between primary speech microphone 104 and noise reference microphone 106 by exploiting the knowledge that in diffuse sound fields (or when the device is far-field relative to a source) the energy of the signals received by primary speech microphone 104 and noise reference microphone 106 should be approximately equal, as well as the fact that aging of the two microphones is a slow process. Therefore, determining when the two microphones are in a diffuse sound field should provide a robust method for updating a current estimate of the difference in sensitivity between the two microphones. The identification of a diffuse sound field can be carried out in several different ways.
For example, one potential method for determining if the two microphones are in a diffuse sound field is to fix the phase according to a specific direction, calculate the corresponding optimal gain for maximum prediction of the signal received by noise reference microphone 106 from the signal received by primary speech microphone 104, and measure the prediction gain. By carrying these steps out for a variety of phases corresponding to a variety of directions, and comparing the prediction gains in different directions, it is possible to determine if sound is coming from multiple directions (indicating a diffuse sound field) or from a well-defined direction.
An alternative or supporting method is to assume diffuse noise when the energy of the signals received by both microphones are within some range of their respective minimum levels (representing the acoustic noise floor on each microphone). The lowest level is generally a result of diffuse environmental ambient noise (as long as it is above the noise floor of non-acoustic noise sources), and hence suitable for updating a current estimate of the difference in sensitivity between primary speech microphone 104 and noise reference microphone 106.
Additionally, updating of the sensitivity mismatch generally should be avoided when circuit noise, such as thermal noise, dominates. Such noise is picked up after the microphones, electronically rather than acoustically, and consequently is not reflective of the sensitivity of the microphones. Because thermal noise is generally incoherent between the signal paths of the two microphones, it can be mistaken for a diffuse sound field suitable for tracking the sensitivity mismatch. To prevent updating when such noise dominates, an absolute lower level can be established under which no updating or tracking is performed. Other non-acoustic noise sources that should be omitted for tracking of the microphone sensitivity mismatch include wind noise.
Moreover, the expected range of microphone sensitivity mismatch can generally be determined from specifications provided by the microphone manufacturer. Therefore, as a safeguard from divergence of the sensitivity mismatch estimation, the sensitivity mismatch can be updated only if the observed mismatch (without sensitivity mismatch compensation) is below the sum of the microphone production tolerances plus a suitable bias term. The bias term can be used to make sure the estimated microphone sensitivity mismatch can span the entire variation.
After determining a suitable time to update the sensitivity mismatch using, for example, one or more of the methods discussed above, microphone mismatch estimator 1005 actually updates the current estimated value of the sensitivity mismatch. Microphone mismatch estimator 1005 can update the current estimated value of the sensitivity mismatch based on the difference in energy between a current frame of the primary input speech signal P(m, f) and a current frame of the reference input speech signal R(m, f) during the suitable time. For example, microphone mismatch estimator can update the current estimated value of the sensitivity mismatch based on the difference in energy between the current frame of the primary input speech signal P(m, f) and the current frame of the reference input speech signal R(m, f) during the suitable time in accordance with the following moving average expression:
Mcal(m)=βcal·Mcal(m−1)+(1−βcal)·Mdiff(m) (72)
where Mcal(m) is the current estimated value of the acoustic sensitivity mismatch, Mcal(m−1) is the previous estimated value of the acoustic sensitivity mismatch, Mdiff(m) is the difference in energy between the current frame of the primary input speech signal P(m, f) and the current frame of the reference input speech signal R(m, f) calculated during the suitable time, and βcal is a smoothing factor. The difference in energy can be calculated by subtracting the log-energy of the current frame of the reference input speech signal from the log-energy of the current frame of the primary input speech signal in at least one example.
In general, the objective of automatic microphone calibration is to track long term changes and variation in acoustic sensitivity. Therefore, a value close to (but smaller than) one for the smoothing factor βcal can be used to introduce long term averaging. However, a value close to one will also result in slow initial convergence and it may be advantageous to vary the smoothing factor βcal such that it has a smaller value immediately following a reset of the current estimated value of the sensitivity mismatch Mcal(m) and gradually increasing it to a value close to one as updates are performed.
The current estimated value of the sensitivity mismatch Mcal(m) is passed on to microphone mismatch compensator 1010 and is used by microphone mismatch compensator 1010 to scale reference input speech signal R(m, f) to compensate for any mismatch. The scaled version of reference input speech signal R(m, f) is denoted in
In another embodiment, rather than scaling the primary input speech signal P(m, f) and/or the reference input speech signal R(m, f) based the current estimated value of the sensitivity mismatch Mcal(m), the current estimated value of the sensitivity mismatch Mcal(m) can be used as an additional input to control the update of the time-varying statistics as described above in the preceding sub-sections.
As shown in
At step 1115, the presence of a diffuse sound field is identified (at least in part) based on the current frame of the primary input speech signal P(m, f) and the reference input speech signal R(m, f) using, for example, one or more of the methods described above in regard to
At step 1120, a difference in energy between the current frame of the primary input speech signal P(m, f) and the reference input speech signal R(m, f) is calculated.
At step 1125, if the presence of a diffuse sound field is identified at step 1115, the current estimated value of the sensitivity mismatch is updated based on the previous estimated value of the sensitivity mismatch and the calculated difference in energy determined at step 1120. For example, the current estimated value of the sensitivity mismatch can be updated according to Eq. (72) above.
Instead of carrying out microphone mismatch estimation and compensation as detailed above, it is possible to instead track the (diffuse) noise levels on the two microphones, and then instead of using the level difference on the two microphones to control the estimation of statistics, use the level difference on the two microphones normalized by their respective (diffuse) noise levels to control the estimation of statistics. This would result in the use of the SNR difference on the two microphones instead of the level difference being used to control the estimation of statistics. Hence, wherever level difference is referred as an input for means of controlling update of statistics, it should be understood that a corresponding SNR difference can be used as an alternative, thereby effectively carrying out microphone mismatch compensation implicitly.
4.1 Frequency Dependent Adaptation Factor
As can be seen in section 3 above, the estimation of the time-varying statistics used to derive (or update) blocking matrix filter 315 and adaptive noise canceler filter 325 can be controlled by the full-band energy difference of various signals (e.g., the full-band energy difference of primary input speech signal P(m, f) and reference input speech signal R(m, f)). However, improved performance can be expected by allowing the update control of the time-varying statistics to have some frequency resolution.
For example, the update control can be based on frequency dependent energy differences. More specifically, the adaptation factors (which are used as an update control) can become frequency dependent according to the mapping from the frequency dependent energy differences to adaptation factors. The advantage of this can be seen intuitively from a simple example. Assume that desired speech only has content below 1500 Hz and background noise only has content above 2000 Hz. With the full-band energy difference, the algorithm will try to come up with a full-band likelihood of desired speech presence. This likelihood will depend on the relative energies of the desired speech and background noise. On the other hand, if frequency dependent update control is implemented, then updates can be done with likelihood of desired signal presence being one below 1500 Hz and zero above 2000 Hz, and both speech statistics for blocking matrix filter 315 and noise statistics for adaptive noise canceler filter 325 can be updated more optimally.
4.2 Switched Blocking Matrix and Adaptive Noise Canceler
When desired speech is absent in the primary input speech signal P(m, f), the speech statistics for blocking matrix filter 315 generally are not updated and the filter remains unchanged. This means that the “cleaner” background noise component {circumflex over (N)}2(m, f), produced (in part) by blocking matrix filter 315, during desired speech absence will not only include the background noise component N2(m, f) of the reference input speech signal R(m, f), but also an additive filtered component of the primary input speech signal P(m, f), which contains only background noise and no desired speech. This additive filtered component can effectively complicate the task of adaptive noise canceler filter 325 to the point of the filter providing significantly reduced noise suppression compared to disabling blocking matrix filter 315 during desired speech absence. Therefore, it can be advantageous to operate a switched structure, where blocking matrix filter 315 can be disabled during desired speech absence.
To accommodate such a switched structure, multiple copies of the time-varying statistics used to derive (or update) adaptive noise canceler filter 325 can be maintained. More specifically, one copy of the time-varying statistics used to derive (or update) adaptive noise canceler filter 325 can be maintained for use when blocking matrix filter 315 is enabled and another copy of the time-varying statistics used to derive (or update) adaptive noise canceler filter 325 can be maintained for use when blocking matrix filter 315 is disabled.
4.2.1 Scaled Blocking Matrix
In practice it may be advantageous to use a switching mechanism to turn blocking matrix filter 315 partially on and partially off based on the likelihood of speech being present in the primary input speech signal P(m, f), rather than using a hard switching mechanism that simply turns blocking matrix filter 315 either completely on or completely off. For example, such a soft switching mechanism can be implemented as a scaling of the coefficients of blocking matrix filter 315 with a scaling factor having a value between zero and one that can be adjusted based on the likelihood of desired speech being present in the primary input speech signal P(m, f). A good estimate of the likelihood of desired speech being present in the primary input speech signal P(m, f) can be calculated from the difference in energy between the primary input speech signal P(m, f) and the reference input speech signal R(m, f).
Furthermore, it can be advantageous to make the scaling factor frequency dependent, as the desired speech source may occupy/dominate certain frequency range(s) while a background noise source may occupy/dominate a different frequency range(s). Frequency dependency can be achieved by not calculating the difference in energy between the primary input speech signal P(m, f) and the reference input speech signal R(m, f) on a full-band basis, but rather based on individual frequency bins, or groups of frequency bins.
The frequency dependent level difference can be calculated as:
Mfrq(m,f)=βr
where P(m, f) and R(m, f) have already been subject to the microphone mismatch compensation. The scaled taps of blocking matrix filter 315 are calculated according to:
Hence, it equals the regular blocking matrix filter 315 during certain desired speech presence (large microphone level difference at the specific frequency bin), is completely off during certain desired speech absence, and assumes a scaled version according to the microphone level difference at the specific frequency bin during uncertainty of desired speech presence. Example values of the parameters are Toff=3 dB and Toff=8 dB.
4.2.2 Adaptive Noise Canceler as a Function of the Blocking Matrix
A complication of having soft-decision in form of the blocking matrix scaling rather than a hard on-off switch is the inability to simply maintaining two sets of statistics for the ANC section (one corresponding to the blocking matrix on, and a second to the blocking matrix off). The scaling of the blocking matrix will introduce a source of modulation into the output signal of the blocking matrix, on which the statistics for the ANC section are based, which could further complicate the tracking of the ANC statistics. To address that, the solution for the ANC section is further analyzed. The analysis is based on the single complex tap, but can be applied to any of the formulations. From sub-section 2.2:
As opposed to sub-section 3.2 above, where the noise components of CP,{circumflex over (N)}
CP,R*fast(m,f)=γfast(m,f)·CP,P*fast(m−1,f)+(1−γfast(m,f))·P(m,f)R*(m,f) (77)
CP,P*fast(m,f)=γfast(m,f)·CP,P*fast(m−1,f)+(1−γfast(m,f))·P(m,f)P*(m,f) (79)
and
CP,R*slow(m,f)=γslow(m,f)·CP,R*slow(m−1,f)+(1−γslow(m,f))·P(m,f)P*(m,f) (80)
CP,P*slow(m,f)=γslow(m,f)·CP,P*slow(m−1,f)+(1−γslow(m,f))·P(m,f)P*(m,f) (81)
CR,R*slow(m,f)=γslow(m,f)·CR,R*slow(m−1,f)+(1−γslow(m,f))·R(m,f)R*(m,f) (82)
Additionally, as indicated by the above equations the fast and slow adaptation factors γfast(m) and γslow(m) can be made frequency dependent by mapping the level difference on a frequency bin basis. The mapping can be identical to that of section 3.2, except for being frequency bin based instead of full-band based.
Yet a further refinement is to select taps from the fast and slow tracking ANCs on a frequency bin basis instead of a fall-band basis as in section 3.2:
Efast(m,f)=|P(m,f)−Wfast(f){circumflex over (N)}2(m,f)|2 (83)
Eslow(m,f)=|P(m,f)−Wslow(f){circumflex over (N)}2(m,f)|2 (84)
where:
Hence, the final adaptive noise canceler filter 325 is selected according to:
It will be apparent to persons skilled in the relevant art(s) that various elements and features of the present invention, as described herein, can be implemented in hardware using analog and/or digital circuits, in software, through the execution of instructions by one or more general purpose or special-purpose processors, or as a combination of hardware and software.
The following description of a general purpose computer system is provided for the sake of completeness. Embodiments of the present invention can be implemented in hardware, or as a combination of software and hardware. Consequently, embodiments of the invention may be implemented in the environment of a computer system or other processing system. An example of such a computer system 1200 is shown in
Computer system 1200 includes one or more processors, such as processor 1204. Processor 1204 can be a special purpose or a general purpose digital signal processor. Processor 1204 is connected to a communication infrastructure 1202 (for example, a bus or network). Various software implementations are described in terms of this exemplary computer system. After reading this description, it will become apparent to a person skilled in the relevant art(s) how to implement the invention using other computer systems and/or computer architectures.
Computer system 1200 also includes a main memory 1206, preferably random access memory (RAM), and may also include a secondary memory 1208. Secondary memory 1208 may include, for example, a hard disk drive 1210 and/or a removable storage drive 1212, representing a floppy disk drive, a magnetic tape drive, an optical disk drive, or the like. Removable storage drive 1212 reads from and/or writes to a removable storage unit 1216 in a well-known manner. Removable storage unit 1216 represents a floppy disk, magnetic tape, optical disk, or the like, which is read by and written to by removable storage drive 1212. As will be appreciated by persons skilled in the relevant art(s), removable storage unit 1216 includes a computer usable storage medium having stored therein computer software and/or data.
In alternative implementations, secondary memory 1208 may include other similar means for allowing computer programs or other instructions to be loaded into computer system 1200. Such means may include, for example, a removable storage unit 1218 and an interface 1214. Examples of such means may include a program cartridge and cartridge interface (such as that found in video game devices), a removable memory chip (such as an EPROM, or PROM) and associated socket, a thumb drive and USB port, and other removable storage units 1218 and interfaces 1214 which allow software and data to be transferred from removable storage unit 1218 to computer system 1200.
Computer system 1200 may also include a communications interface 1220. Communications interface 1220 allows software and data to be transferred between computer system 1200 and external devices. Examples of communications interface 1220 may include a modem, a network interface (such as an Ethernet card), a communications port, a PCMCIA slot and card, etc. Software and data transferred via communications interface 1220 are in the form of signals which may be electronic, electromagnetic, optical, or other signals capable of being received by communications interface 1220. These signals are provided to communications interface 1220 via a communications path 1222. Communications path 1222 carries signals and may be implemented using wire or cable, fiber optics, a phone line, a cellular phone link, an RF link and other communications channels.
As used herein, the terms “computer program medium” and “computer readable medium” are used to generally refer to tangible storage media such as removable storage units 1216 and 1218 or a hard disk installed in hard disk drive 1210. These computer program products are means for providing software to computer system 1200.
Computer programs (also called computer control logic) are stored in main memory 1206 and/or secondary memory 1208. Computer programs may also be received via communications interface 1220. Such computer programs, when executed, enable the computer system 1200 to implement the present invention as discussed herein. In particular, the computer programs, when executed, enable processor 1204 to implement the processes of the present invention, such as any of the methods described herein. Accordingly, such computer programs represent controllers of the computer system 1200. Where the invention is implemented using software, the software may be stored in a computer program product and loaded into computer system 1200 using removable storage drive 1212, interface 1214, or communications interface 1220.
In another embodiment, features of the invention are implemented primarily in hardware using, for example, hardware components such as application-specific integrated circuits (ASICs) and gate arrays. Implementation of a hardware state machine so as to perform the functions described herein will also be apparent to persons skilled in the relevant art(s).
The present invention has been described above with the aid of functional building blocks illustrating the implementation of specified functions and relationships thereof. The boundaries of these functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternate boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed.
In addition, while various embodiments have been described above, it should be understood that they have been presented by way of example only, and not limitation. It will be understood by those skilled in the relevant art(s) that various changes in form and details can be made to the embodiments described herein: without departing from the spirit and scope of the invention as defined in the appended claims. Accordingly, the breadth and scope of the present invention should not be limited by any of the above-described exemplary embodiments, but should be defined only in accordance with the following claims and their equivalents.
This application claims the benefit of U.S. Provisional Patent Application No. 61/413,231, filed on Nov. 12, 2010, which is incorporated herein by reference in its entirety.
Number | Name | Date | Kind |
---|---|---|---|
4570746 | Das et al. | Feb 1986 | A |
4600077 | Drever | Jul 1986 | A |
5288955 | Staple et al. | Feb 1994 | A |
5550924 | Helf et al. | Aug 1996 | A |
5574824 | Slyh et al. | Nov 1996 | A |
5757937 | Itoh et al. | May 1998 | A |
5943429 | Handel | Aug 1999 | A |
6230123 | Mekuria et al. | May 2001 | B1 |
7099821 | Visser et al. | Aug 2006 | B2 |
7359504 | Reuss et al. | Apr 2008 | B1 |
7464029 | Visser et al. | Dec 2008 | B2 |
7617099 | Yang et al. | Nov 2009 | B2 |
7916882 | Pedersen et al. | Mar 2011 | B2 |
7949520 | Nongpiur et al. | May 2011 | B2 |
7983907 | Visser et al. | Jul 2011 | B2 |
8150682 | Nongpiur et al. | Apr 2012 | B2 |
8340309 | Burnett et al. | Dec 2012 | B2 |
8374358 | Buck et al. | Feb 2013 | B2 |
8452023 | Petit et al. | May 2013 | B2 |
8515097 | Nemer et al. | Aug 2013 | B2 |
20050036629 | Aubauer et al. | Feb 2005 | A1 |
20060193671 | Yoshizawa et al. | Aug 2006 | A1 |
20070021958 | Visser et al. | Jan 2007 | A1 |
20070030989 | Kates | Feb 2007 | A1 |
20070033029 | Sakawaki | Feb 2007 | A1 |
20080025527 | Haulick et al. | Jan 2008 | A1 |
20080033584 | Zopf et al. | Feb 2008 | A1 |
20080046248 | Chen et al. | Feb 2008 | A1 |
20080201138 | Visser et al. | Aug 2008 | A1 |
20100008519 | Hayakawa et al. | Jan 2010 | A1 |
20100223054 | Nemer et al. | Sep 2010 | A1 |
20100254541 | Hayakawa | Oct 2010 | A1 |
20100260346 | Takano et al. | Oct 2010 | A1 |
20110038489 | Visser et al. | Feb 2011 | A1 |
20110099007 | Zhang | Apr 2011 | A1 |
20110099010 | Zhang | Apr 2011 | A1 |
20110103626 | Bisgaard et al. | May 2011 | A1 |
20120010882 | Thyssen et al. | Jan 2012 | A1 |
20120121100 | Zhang et al. | May 2012 | A1 |
20120123771 | Chen et al. | May 2012 | A1 |
20120123773 | Zeng et al. | May 2012 | A1 |
20130044872 | Eriksson et al. | Feb 2013 | A1 |
20130211830 | Petit et al. | Aug 2013 | A1 |
Number | Date | Country | |
---|---|---|---|
20120123772 A1 | May 2012 | US |
Number | Date | Country | |
---|---|---|---|
61413231 | Nov 2010 | US |