This application is a National Stage application of PCT/US2014/055329 filed on Sep. 12, 2014, and entitled “Residual Interference Suppression”, which is incorporated herein by reference.
As is known in the art, adaptive interference cancellation (AIC) can form a part of acoustic echo cancellation, adaptive beam forming, adaptive noise cancellation, etc. AIC uses adaptive filters to model the acoustic (reverberant) channel of the interfering signal component. The estimate of the interference component is then subtracted (“cancelled”) from the input signal without distorting the desired signal component. Nevertheless, some residual interference remains after AIC. In many conventional systems, residual interference suppression (RIS) is applied after AIC, which performs spectral weighting on the AIC output.
Embodiments of the invention provide methods and apparatus for providing an enhanced estimation of the power of the residual interference component after AIC by achieving a higher accuracy for less speech distortion than conventional techniques. By using inventive RIS processing, a better signal quality (i.e. a better trade-off between distortion of the desired signal and suppression of the interference) is achieved. Thus, an improved ASR performance for barge-in applications (AEC) and for beamformer post filtering is achieved.
After AIC, the residual component of the interference includes multiple parts. One part is due to the limited length of the adaptive filter: the full length of the acoustic path cannot be modeled and late echoes cannot be cancelled with AIC. Another part is due to a misalignment of the adaptive filter: as the acoustic path changes over time the filter has to adapt permanently and is never perfectly converged.
Conventional systems apply post-filters (dynamic spectral weighting) after AIC where the post-filtering weighting factors are calculated based on estimates of the power of the residual interference after the AIC. Embodiments of the invention improve the accuracy of the power estimate based on an inventive parametric model. With the inventive processing, improved speech enhancement (less speech distortion) and improved ASR performance are achieved.
Embodiments of the invention can also provide adaptation control of the filters within the AIC to allow for a more precise estimate of the misalignment of the filter which is required to calculate the optimal step size for filter adaptation. This improves the AIC performance.
It is understood that embodiments of the invention are applicable to a wide range of applications, such as ASR and hands-free telephony applications, barge-in, acoustic echo cancellation, multichannel reverberation suppression, and the like.
In one aspect of the invention, a method for estimating reverberant spectral variance (RSV) comprises: estimating a power spectral density of residual interference after adaptive interference cancellation (AIC) using first and second components; estimating the first component using a real-valued FIR filter operating on a power spectral density (PSD) of a reference signal; and estimating the second component using an exponential decay over time corresponding to a reverberation time using the PSD of the reference signal.
The method can further include one or more of the following features: using a FIR filter for the first component and using an IIR filter for the second component, the FIR alter has a number of taps, the IIR filter includes a delay element with the delay equal to the length of the FIR filter, determining a common scaling factor for the taps, using gradient descend processing to find the first and/or second component, using equation error principle processing, using a logarithmic cost function for the gradient descend processing, determining parameters A, B, and C and compensating the first component, which corresponds to an early reverberation PSD, from an observed PSD, to drive the adaptation of the parameter B, where the parameter A is a scaling parameter corresponding to a strength of late reverberation, the parameter B describes exponential decay in relation to reverberation time of an enclosure and the parameter C is a common scaling factor for filter time lags, determining the parameter B by extrapolating a log of an AEC filter response linearly and using the resulting late reverb-PSD jointly with an FIR Model for the first component, controlling a step size of the adaptation AIC filter, dynamically adjusting a length of the FIR filter corresponding to a reverberation time and/or the ratio of early and late residual interference, the reference signal comprises a loudspeaker signal and the RSV estimate is applied for residual echo suppression, and/or the reference signal comprises a microphone signal and the second component corresponding to late RSV is used for dereverberation.
In another aspect of the invention, an article comprises: a non-transitory storage medium having stored instructions that enable a machine to estimate reverberant spectral variance (RSV), comprising instructions to: estimate a power spectral density of residual interference after adaptive interference cancellation (AIC) using first and second components; estimate the first component using a real-valued FIR filter operating on a power spectral density (PSD) of a reference signal; and estimate the second component using an exponential decay over time corresponding to a reverberation time using the PSD of the reference signal.
The article can further include one or more of the following features: using a FIR filter for the first component and using an IIR filter for the second component, the FIR filter has a number of taps, the IIR filter includes a delay element with the delay equal to the length of the FIR filter, determining a common scaling factor for the taps, using gradient descend processing to find the first and/or second component, using equation error principle processing, using a logarithmic cost function for the gradient descend processing, determining parameters A, B, and C and compensating the first component, which corresponds to an early reverberation PSD, from an observed PSD, to drive the adaptation of the parameter B, where the parameter A is a scaling parameter corresponding to a strength of late reverberation, the parameter B describes exponential decay in relation to reverberation time of an enclosure and the parameter C is a common scaling factor for filter time lags, determining the parameter B by extrapolating a log of an AEC filter response linearly and using the resulting late reverb-PSD jointly with an FIR Model for the first component, controlling a step size of the adaptation AIC filter, dynamically adjusting a length of the FIR filter corresponding to a reverberation time and/or the ratio of early and late residual interference, the reference signal comprises a loudspeaker signal and the RSV estimate is applied for residual echo suppression, and/or the reference signal comprises a microphone signal and the second component corresponding to late RSV is used for dereverberation.
In a further aspect of the invention, a system comprises: an AIC module to receive an input signal and a reference signal and generate an AIC output signal; a first PSD module to receive the AIC output signal and generate a first PSD output signal; a second PSD module to receive the reference signal and generate a second PSD output signal; an early and late residual interference PSD estimation module to receive the second PSD output signal and generate an early residual interference output and a late residual interference output, the early and late residual interference PSD estimation module configured to generate the early residual interference output using a real-valued FIR filter operating on a power spectral density (PSD) of the reference signal, and to generate the late residual interference output using an exponential decay over time corresponding to a reverberation time using the PSD of the reference signal; and a residual echo suppression module to process the early residual interference output, the late residual interference output, and the AIC output.
The system can be further configured to include one or more of the following features: using a FIR filter for the first component and using an IIR filter for the second component, the FIR filter has a number of taps, the IIR filter includes a delay element with the delay equal to the length of the FIR filter, determining a common scaling factor for the taps, using gradient descend processing to find the first and/or second component, using equation error principle processing, using a logarithmic cost function for the gradient descend processing, determining parameters A, B, and C and compensating the first component, which corresponds to an early reverberation PSD, from an observed PSD, to drive the adaptation of the parameter B, where the parameter A is a scaling parameter corresponding to a strength of late reverberation, the parameter B describes exponential decay in relation to reverberation time of an enclosure and the parameter C is a common scaling factor for filter time lags, determining the parameter B by extrapolating a log of an AEC filter response linearly and using the resulting late reverb-PSD jointly with an FIR Model for the first component, controlling a step size of the adaptation AIC filter, dynamically adjusting a length of the FIR filter corresponding to a reverberation time and/or the ratio of early and late residual interference, the reference signal comprises a loudspeaker signal and the RSV estimate is applied for residual echo suppression, and/or the reference signal comprises a microphone signal and the second component corresponding to late RSV is used for dereverberation.
In a further aspect of the invention, a system comprises: an AIC module to receive an input signal and a reference signal and generate an AIC output signal; a first PSD module to receive the AIC output signal and generate a first PSD output signal; a second PSD module to receive the reference signal and generate a second PSD output signal; an early and late residual interference PSD estimation module to receive the second PSD output signal and generate an early residual interference output and a late residual interference output, the early and late residual interference PSD estimation module configured to generate the early residual interference output using a real-valued FIR filter operating on a power spectral density (PSD) of the reference signal, and to generate the late residual interference output using an exponential decay over time corresponding to a reverberation time using the PSD of the reference signal; a beamforming module to receive the input signal and the reference signal and generate a beamforming output signal; and a dereverberation module to process the late residual interference output and the beamforming output, wherein the early residual interference output is not processed by the dereverberation module.
The system can be further configured to include one or more of the following features: using a FIR filter for the first component and using an IIR filter for the second component, the FIR filter has a number of taps, the IIR filter includes a delay element with the delay equal to the length of the FIR filter, determining a common scaling factor for the taps, using gradient descend processing to find the first and/or second component, using equation error principle processing, using a logarithmic cost function for the gradient descend processing, determining parameters A, B, and C and compensating the first component, which corresponds to an early reverberation PSD, from an observed PSD, to drive the adaptation of the parameter B, where the parameter A is a scaling parameter corresponding to a strength of late reverberation, the parameter B describes exponential decay in relation to reverberation time of an enclosure and the parameter C is a common scaling factor for filter time lags, determining the parameter B by extrapolating a log of an AEC filter response linearly and using the resulting late reverb-PSD jointly with an FIR Model for the first component, controlling a step size of the adaptation AIC filter, dynamically adjusting a length of the FIR filter corresponding to a reverberation time and/or the ratio of early and late residual interference, the reference signal comprises a loudspeaker signal and the RSV estimate is applied for residual echo suppression, and/or the reference signal comprises a microphone signal and the second component corresponding to late RSV is used for dereverberation.
The foregoing features of this invention, as well as the invention itself, may be more fully understood from the following description of the drawings in which:
If Hl(k) replicates the transmission characteristics of the interference from its source to the microphone, then subtraction can completely remove the interfering component from the microphone signal Y(k).
In the illustrated embodiment, processing is performed in the short-time Fourier domain with k being the frame index. The frequency index is omitted in the text for better readability. All signals and filter taps are generally complex values. Conjugate complex signals are indicated by an exposed star (*).
In many speech enhancement applications interfering signal components superpose the desired speech signal. In some cases a reference signal of the interfering sound source is available. An interfering source can include, e.g., a loudspeaker, or even parts of the desired signal such as reverberation. Then adaptive interference cancellation (AIC) can be applied.
The residual interference component includes a number of components. One component is due to a not fully converged adaptive filter (time lags from 0 up to D−1). Another component (time lags larger or equal than D) is present since it cannot be covered by the adaptive filter due to the finite length of the filter. If the AIC-filter is not converged at all, the first part of the residual interference may actually be the complete interference.
For dereverberation often a time lag of 50 ms is used to distinguish early and late reverberation. Reverberation arriving no later than 50 ms is usually referred to as early reverberation (or early reflections), whereas all reverberation arriving after the 50 ms boundary is referred to as late reverberation. This is mainly motivated by psychoacoustic effects. For AEC applications the length of the adaptive filter D is often chosen according to the reverberation time of the enclosure. In living rooms values of 300 ms might be chosen whereas in cars shorter filters with 50 ms might be sufficient.
Embodiments of the invention provide estimation of the reverberant spectral variance (RSV) at the AIC output. It is understood that RSV is identical to the term PSD of the residual interference. In particular, embodiments estimate early and late RSV parts jointly. Embodiments can be applied to residual echo suppression for acoustic echo cancellation and dereverberation in the context of beamforming, for example.
The residual error after adaptive interference cancellation is modelled in the power domain, as follows:
where {tilde over (Φ)}rr(k) is the PSD of the residual interference component after AIC. Kl is a frame-based scalar weighting factor which refers to the contribution of the PSD of the reference signal Φxx(k) of the recent frames. The residual interference PSD is separated into two parts: the first part (time lags 0, . . . , D−1) refers to the region which is covered by the adaptive AIC filter of length D and the second part (time lags D, . . . ) refers to the time lags which cannot be modelled by the AIC filter (due to its finite length D), as set forth below:
{tilde over (Φ)}rr(k)={tilde over (Φ)}ϵϵ(k)+{tilde over (Φ)}LL(k) (3)
The first part {tilde over (Φ)}ϵϵ(k) contributes due to misalignment of the adaptive filter and the second part {tilde over (Φ)}LL(k) contributes due to the late reverberation tail (Eq. 2). Modelling the PSD of the residual interference with the first and second parts provides enhanced performance in comparison with conventional processing.
Conventional processing uses a simple “coupling factor” which scales the PSD of the reference signal (or a smoothed version of this PSD). This coupling factor has either no temporal context at all, or the temporal context spreads out infinitely long when recursive smoothing is applied.
In contrast to this conventional processing,
The FIR model can be simplified by the assumption that Kl is expected to show equal values for all time lags (at least when Hl(k) has converged sufficiently). It is assumed that the misalignment of the adaptive filter coefficients is equally distributed over all taps. Thus, we can apply a common scaling factor C for all time lags as follows:
Kl=C, for l=0, . . . , D−1.
So then:
For the second part {tilde over (Φ)}LL(k), i.e., late reverberant spectral variance, the residual interference is represented by a parametric model that describes an exponential decay over time: Kl=A·Bl, for l≥D, where B is between 0 and 1 and is closely related to the reverberation time T60 of the enclosing room. A is a scaling parameter that represents the strength of the late reverberation. As set forth below:
The decay of the late reverberation can equivalently be formulated recursively, as set forth below:
{tilde over (Φ)}LL(k)=B·{tilde over (Φ)}LL(k−1)+A·Φxx(k−D) (7)
It should be noted, however, that this recursion can be used to estimate the late RSV from the non-reverberant PSD of the reference (use-case of acoustic echo cancellation) as well as from the reverberant PSD of a microphone signal, such as in beamforming. The parameters A and B, should be chosen differently.
In embodiments of the invention, three parameters A, B and C are used for estimating the residual interference PSD Φrr(k) on the basis of the accessible PSD reference signal Φxx(k). A method for estimating A, B, while neglecting the influence of Φϵϵ(k), is described above. In accordance with illustrative embodiments, parameter estimation is provided which considers both, Φϵϵ(k) and ΦLL(k).
Once the adaptive filter has converged sufficiently, parameters A and B can be extracted from the filter coefficients Hl(k). In one embodiment, A and B can be found by fitting a substantially straight line to the log (|Hl(k)|2), based on the assumption that the PSD of the echo component also decays exponentially in the window that is covered by the AIC (time lags 0; : : : ; D−1), and therefore, requires a sufficiently long AIC Filter.
In accordance with illustrative embodiments, knowledge about A and B can be used to estimate the parameter C. For time instances when the input signal Y(k) comprises mainly interference (other components like desired speech or local noise are much smaller) the AIC output PSD Φee(k) (which is accessible) is approximately equal to the residual interference PSD {tilde over (Φ)}rr(k). Then, the third parameter C can be estimated as follows:
It is understood that smoothing can be applied, as well as gradient descent techniques, to find C in an iterative way.
For finding the model parameters without relying on the AIC-filter, embodiments provide signal-based processing. To find the parameters, gradient descent processing is employed that minimizes the error in the mean square sense, as follows:
whereas, {tilde over (Φ)}rr(k)={tilde over (Φ)}ee(k)+{tilde over (Φ)}LL(k) is the RSV estimate by the model and Φrr(k) is the true RSV.
For the FIR part, the adaptation can be computed as:
Where, μ denotes the stepsize. Please note that the normalization term in the denominator includes the D-th input tap that excites the late reverb model. Also, the excitation of the coefficient B is contained in the normalization. If the FIR-model is simplified to the parameter C (assuming all FIR coefficients have the same value), even simpler schemes like the “sign-algorithm” can be used instead. This only evaluates the sign of the gradient resulting in a fixed increase of C if the estimated PSD is too small and a decrease in the opposite case—a logarithmic error function can also be used to find C. The NLMS update of the parameters A and B reads as follows:
To increase robustness, the temporal context in the error function (Eq. 9) can be utilized. This can for instance be achieved by exponential forgetting. A sliding window may also be used but consumes more memory.
As an alternative to the MSE cost function given in Eq. 10, a logarithmic error function may be minimized as follows:
The corresponding gradient descent update rule for the FIR filter can be provided as follows:
The parameters A and B are updated as follows:
A further refinement of this adaptation is to subtract the estimated PSD of the early residual interference {tilde over (Φ)}ϵϵ(k) from Φrr(k) before feeding it into B(k), as depicted in
In embodiments, based on the three parameters A, B, and C the residual interference PSD Φrr(k) is estimated. The PSD of the error signal Φee(k) can be directly accessed. The weighting filter W(k) for the RIS according to the Wiener filter rule is:
W(k) is then applied to E(k) to obtain a further enhanced output signal with suppressed interference:
Eenhanced(k)=E(k)·W(k) (18)
This model helps to estimate Φrr(k) more accurately and thus reduce the speech distortion which comes due to estimation errors.
For adaptation of the filter Hl(k) normalized least-mean square (NLMS) processing can be applied, as follows:
The optimal step size for adaptation can be computed as:
Control of the step size μ(k) enables good convergence behavior. In embodiments, one aim is to get a better estimate of the residual of Φϵϵ(k) and thus, to a better convergence of the AIC filter. Generally, the dynamic step size enables the filter to adapt (and converge) quickly when Φϵϵ(k) is large (i.e., the filter is not well converged) and also ensures that the filter adapts slowly when Φϵϵ(k) is small (i.e., it prevents the filter from losing good convergence). A benefit of modelling the late reverb here is to get an estimate for the early residual PSD that is not affected by the late residual PSD. As a consequence, the AIC-step-size will be small even if there is significant late reverberant energy. This improves the convergence of the AIC-filter compared to conventional AIC control methods.
Depending on the acoustic enclosure, the reverberation time T60 (and thus the reverberation tail/length of the room impulse response) is different. The length of the adaptive filter should be chosen according to the T60 (A large T60 requires a longer adaptive filter). Mobile devices, for example, are used in different acoustic environments with very different T60s. Thus, it may be desirable to adjust D dynamically.
Based on our model parameters the length D of the adaptive filter can be adjusted automatically using a variety of criteria. From parameter B the T60 can be calculated. The length D can be set to a certain (predefined) percentage of T60 (e. g. 60). Another criterion could be that the ratio of the two error portions equal a certain (predefined) value Q when the filter has converged sufficiently, e.g.: Φϵϵ(k)/ΦLL(k)≈Q. It is understood that instead of/alternatively to this ratio a formula can use purely model parameters.
In a beamforming context, where the AIC module 702 is used for enhanced signal blocking, the described estimator for the late reverb 706 can be applied to perform dereverberation 714 on the output signal of the beamformer 712. Thereby, the early reverberation components will not be suppressed as they had been identified by the joint model and are explicitly not fed into the dereverberation filter 714 as illustrated. The blocking matrix gives the AIC filter output for estimating the parameters of a reverberation model.
Conventional spatial postfilter techniques can use the blocked PSD Φee(k) directly for suppressing the reverb, but may suffer from the early reverb components leading to degradations of the desired signal.
Processing may be implemented in hardware, software, or a combination of the two. Processing may be implemented in computer programs executed on programmable computers/machines that each includes a processor, a storage medium or other article of manufacture that is readable by the processor (including volatile and non-volatile memory and/or storage elements), at least one input device, and one or more output devices. Program code may be applied to data entered using an input device to perform processing and to generate output information.
The system can perform processing, at least in part, via a computer program product, (e.g., in a machine-readable storage device), for execution by, or to control the operation of, data processing apparatus (e.g., a programmable processor, a computer, or multiple computers). Each such program may be implemented in a high level procedural or object-oriented programming language to communicate with a computer system. However, the programs may be implemented in assembly or machine language. The language may be a compiled or an interpreted language and it may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A computer program may be deployed to be executed on one computer or on multiple computers at one site or distributed across multiple sites and interconnected by a communication network. A computer program may be stored on a storage medium or device (e.g., CD-ROM, hard disk, or magnetic diskette) that is readable by a general or special purpose programmable computer for configuring and operating the computer when the storage medium or device is read by the computer. Processing may also be implemented as a machine-readable storage medium, configured with a computer program, where upon execution, instructions in the computer program cause the computer to operate.
Processing may be performed by one or more programmable processors executing one or more computer programs to perform the functions of the system. All or part of the system may be implemented as, special purpose logic circuitry (e.g., an FPGA (field programmable gate array) and/or an ASIC (application-specific integrated circuit)).
Elements of different embodiments described herein may be combined to form other embodiments not specifically set forth above. Various elements, which are described in the context of a single embodiment, may also be provided separately or in any suitable subcombination. Other embodiments not specifically described herein are also within the scope of the following claims.
Having described exemplary embodiments of the invention, it will now become apparent to one of ordinary skill in the art that other embodiments incorporating their concepts may also be used. The embodiments contained herein should not be limited to disclosed embodiments but rather should be limited only by the spirit and scope of the appended claims. All publications and references cited herein are expressly incorporated herein by reference in their entirety.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2014/055329 | 9/12/2014 | WO | 00 |
Publishing Document | Publishing Date | Country | Kind |
---|---|---|---|
WO2016/039765 | 3/17/2016 | WO | A |
Number | Name | Date | Kind |
---|---|---|---|
20040264686 | Enzner | Dec 2004 | A1 |
20050063536 | Myllyla | Mar 2005 | A1 |
20050118956 | Haeb-Umbach et al. | Jun 2005 | A1 |
20100074452 | Adeney et al. | Mar 2010 | A1 |
20110182436 | Murgia et al. | Jul 2011 | A1 |
20130301840 | Yemdji | Nov 2013 | A1 |
20140177859 | Ahgren | Jun 2014 | A1 |
20150371659 | Gao | Dec 2015 | A1 |
Number | Date | Country |
---|---|---|
WO 2004008731 | Jan 2004 | WO |
WO 2008045332 | Apr 2008 | WO |
Entry |
---|
Breining et al. Acoustic Echo Control An Application of Very-High-Order Adaptive Filters; IEEE Signal Processing Magazine, Jul. 1999; pp. 42-69 (28 pages). |
Habets et al. “Joint Dereverberation and Residual Echo Suppression of Speech Signals in Noisy Environments”; IEEE Transactions on Audio, Speech, and Language Processing, vol. 16, No. 8, Nov. 2008; pp. 1433-1451 (19 pages). |
Schmidt: “Applications of Acoustic Echo Control—An Overview”, Temic SDS, Research, Germany, 2004; pp. 9-16 (8 pages). |
Valero et al.: “Signal-Based Late Residual Echo Spectral Variance Estimation”; IEEE International Conference on Acoustic, Speech and Signal Processing, 2014; pp. 5955-5959 (5 pages). |
Tobias Wolff et al: “Influence of blocking matrix design on microphone array postfilters”; Nuance Communications, Ulm, Germany, 2010, 4 pages. |
Tomita et al.: “Equation errors versus output error methods”; Ergonomics, 1992, vol. 35, 551-564 (14 pages). |
Tobias Wolff et al.: “A generalized view on microphone array postfilters”; Nuance Communications, Ulm, Germany, 2010; 4 pages. |
International Search Report and Written Opinion dated May 28, 2015; for PCT/US2014/055329; 11 pages. |
PCT International Preliminary Report dated Mar. 23, 2017 for International Application No. PCT/US2014/055329; 8 Pages. |
Number | Date | Country | |
---|---|---|---|
20170287502 A1 | Oct 2017 | US |