The present invention relates to devices, systems and methods for noise suppressing audio signals comprising a combination of at least two audio system input signals each having a source signal portion and a background noise portion.
In audio communication, it is typically expedient to transmit a user's voice undistorted and free of noise. However, communication devices are often employed in noisy environments; the signals picked up by a device's microphones are mixtures of the user's voice and interfering noise.
The characteristics of the sound field at the microphones vary substantially across different signal and noise scenarios. For instance, the sound may come from a single direction or from many directions simultaneously. It may originate far away from—or close to the microphones. It may be stationary/constant or non-stationary/transient. The noise may also be generated by wind turbulence at the microphone ports.
Multi-microphone background noise reduction methods fall in two general categories. The first type is beamforming, where the output samples are computed as a linear combination of the input samples. The second type is noise suppression, where the noise component is reduced by applying a time-variant filter to the signal, such as by multiplying a time and frequency dependant gain on the signal in a filter bank domain.
When only one microphone or audio input is available, a noise suppression filter cannot be spatially sensitive. There is no access to the spatial features of the sound field, providing discriminative information about speech and background noise, and is typically limited only to suppress the stationary or quasi-stationary component of the background noise.
Beamforming and noise suppression may be sequentially applied, since their noise reduction effects are additive.
An example of an adaptive beamformer is disclosed in WO 2009/132646 A1.
A method of separating mixtures of sound is disclosed in “O. Yilmaz and S. Rickard, Blind Separation of Speech Mixtures via Time-Frequency Masking, IEEE Transactions on Signal Processing, Vol. 52, No. 7, pages 1830-1847, July 2004”. Separation masks are computed in a time-frequency representation on the basis of two features, namely the level difference and phase-delay between the two sensor signals.
A method of combining directional noise suppression and a stationary noise suppression algorithm is disclosed in WO 2009/096958 A1. However, this method does not take into account a spatial noise suppression component which takes advantage of combining a set of spatially discriminative features besides directional features.
The fundamental problem of noise suppression addressed by this invention is to classify a sound signal across time and frequency as being either predominantly a signal of interest, e.g. a user's voice or speech, or predominantly interfering noise and to apply the relevant filtering to reduce the noise component in the output signal. This classification has a chance of success when the distributions of speech and noise are differing.
Exploiting the differing distributions, a number of methods in the literature propose spatial features that map the signals to a one-dimensional classification problem to be subsequently solved. Examples of such features are angle of arrival, proximity, coherence and sum-difference ratio.
The present invention exploits the fact that each of the proposed spatial features are attached with a degree of uncertainty and that they may advantageously be combined, achieving a higher degree of classification accuracy that could otherwise have been achieved with any one of the individual spatial features. The proposed spatial features have been selected so that each of them adds discrimination power to the classifier.
In one embodiment of the invention the input to the classifier is a weighted sum of the proposed features.
An object of the present invention is therefore to provide a noise suppressor in the transmit path of a personal communication device which eliminates stationary noise as well as non-stationary background noise.
According to a first aspect of the invention this is achieved by a method of noise suppressing an audio signal comprising a combination of at least two audio system input signals each having a sound source signal portion and a background noise portion, the method comprising steps of:
a) extracting at least two different types of spatial sound field features from the input signals such as discriminative speech and/or background noise features,
b) computing a first intermediate spatial noise suppression gain on the basis of the extracted spatial sound field features,
c) computing a second intermediate stationary noise suppression gain,
d) combining the two intermediate noise suppression gains to form a total noise suppression gain, wherein the two intermediate noise suppression gains are combined by comparing their values and dependent on their ratio or relative difference, determining the total noise suppression gain,
e) applying the total noise suppression gain to the audio signal to generate a noise suppressed audio system output signal.
The method may advantageously be carried out in the frequency domain for at least one frequency sub-band. Well known methods of Fourier transformation such as the Fast Fourier Transformation (FFT) may be applied to convert the signals from time domain to frequency domain. As a result, optimal filtering may be applied in each band. A new frequency spectrum may be calculated every 20 ms or at any other suitable time interval using the FFT algorithm.
To achieve the optimum noise suppression gain in step d) mentioned above, the total noise suppression gain may be selected as the minimum gain or the maximum gain of the two intermediate noise suppression gains. If aggressive noise suppression is desired, the minimum gain could be selected. If conservative noise suppression is desired, letting through a larger amount of speech, the maximum gain could be selected.
Within the span of the minimum and the maximum gain a weighing factor may also be applied in step d) to achieve a more flexible total noise suppression gain. The total noise suppression gain is then selected as a linear combination of the two intermediate noise suppression gains. If the same factor 0.5 is applied to the two intermediate gains the result will be the average gain. Other factors such as 0.3 for the first intermediate gain and 0.7 for the second or vice-versa may be applied. The selected combination may be based on a measure of confidence provided by each noise reduction method.
In an embodiment of the invention, the spatial sound field features may comprise sound source proximity and/or sound signal coherence and/or sound wave directionality, such as angle of incidence.
The method may further comprise prior to step e), a step of spatially filtering the audio signal by means of a beamformer, and subsequently in step e) applying the total noise suppression gain to the output signal from the beamformer. In this way the audio signal will already to some extend have been spatially filtered before applying the total noise suppression gain.
The method may further comprise a step of computing at least one set of spatially discriminative cues derived from the extracted spatial features, and computing the spatial noise suppression gain on basis of the set(s) of spatially discriminative cues. Computing the spatial noise suppression gain may be done from a linear combination of spatial cues. Preferably the method comprises weighing the mutual relation of the content of the different types of spatial cues in the set of spatial cues as a function of time and/or frequency. In this way e.g. the directionality cue may be chosen to be more predominant in one frequency sub-band and the proximity cue to be more predominant in another frequency sub-band. New spatial cues may be computed every 20 ms or at any other suitable time interval.
In an embodiment the method comprises computing the stationary noise suppression gain on basis of a beamformer output signal. This enables the stationary noise suppression filter to calculate an improved estimate of the background noise and desired sound source portions (voice/speech) of the audio system signal.
The audio system input signals may comprise at least two microphone signals to be processed by the method.
A second aspect of the present invention relates to a system for noise suppressing an audio signal, the audio signal comprising a combination of at least two audio system input signals each having a sound source signal portion and a background noise portion, wherein the system comprises:
The spatial sound field features may further comprise the same features as mentioned above according to the first aspect of the invention. Likewise the total noise suppression gain may be determined and selected in the same way as explained in accordance with the first aspect of the invention.
The system may further comprise an audio beamformer having the two audio system input signals as input and a spatially filtered audio signal as output, the output signal serving as input signal to the output filtering block.
The features of the second aspect of the invention provide at least the same advantages as explained in accordance with the first aspect of the invention.
A third aspect of the invention relates to a headset comprising at least two microphones, a loudspeaker and a noise suppression system according to the second aspect of the invention, wherein the microphone signals serves as input signals to the noise suppression system.
Preferred embodiments of the invention will be described in more detail in connection with the appended drawings, in which:
In
The system processes inputs from at least two audio channels such as the input from two audio microphones placed in a sound field comprising a desired sound source signal such as speech from the mouth of a user of a personal communication device and an undesired background noise e.g. stationary or non-stationary background noise. A typical device for personal communication using the system for noise suppressing may be a headset such as a telephone headset placed on or near the ear of the user. Applying a noise suppression algorithm on the transmitted audio signal in the headset improves the perceived quality of the audio signal received at a far end user during a telephone conversation.
Sound field information is exploited in order to discriminate between user speech and background noise and spatial features such as directionality, proximity and coherence are exploited to suppress sound not originating from the user's mouth.
The microphones typically have different distances to the desired sound source in order to provide signals having different signal to noise ratios making further processing possible in order to efficiently remove the background noise portion of the signal.
In
The aligned input signals are advantageously Fourier transformed by a well known method such as the Fast Fourier Transformation (FFT) 5 to convert the signals from time domain to frequency domain. This enables signal processing in individual frequency sub-bands which ensures an efficient noise reduction as the signal to noise ratio may vary substantially from sub-band to sub-band. The FFT algorithm 5 may alternatively be applied prior to the alignment and matching filters 3, 4.
The spatial noise suppression gain block 6, 7 for computing a first intermediate spatial noise suppression gain comprises spatial feature extraction means and computing means for computing the spatial noise suppression gain on the basis of the extracted spatial sound field features. The features may be discriminative speech and/or background noise features, such as sound source proximity, sound signal coherence and sound wave directionality. One or more of the different types may be extracted. The proximity features carries information on the distance from the sound source to the signal sensing unit such as two microphones placed in a headset. The user's mouth will be located at a fairly well defined distance from the microphones making it possible to discriminate between speech and noise from the surroundings.
The coherence feature carries information about the similarity of the signals sensed by the microphones. A speech signal from the user's mouth will result in two highly coherent sound source portions in the two input signals, whereas a noise signal will result in a less coherent signal. The directionality feature carries information such as the angle of arrival of an incoming sound wave on the surface of the microphone membranes. The user's mouth will typically be located at a fairly well defined angle of arrival relative to the noise sources. On the basis of these spatial features, the spatial cues are computed and in the further processing, mapped to the spatial gain.
A stationary noise suppression gain is computed, typically using a well known single channel stationary noise suppression method such as a Wiener filter. The method will generate a noise estimate and a speech signal estimate. As shown in the embodiment of the invention in
A noise suppression gain combining block 8 for combining the two intermediate noise suppression gains compares their values and dependent on the ratio or relative difference of the two values, the total noise suppression gain is determined.
To achieve the optimum noise suppression gain, the total noise suppression gain may be selected as the minimum gain or the maximum gain of the two intermediate noise suppression gains. If aggressive noise suppression is desired, the minimum gain could be selected. If conservative noise suppression is desired, letting through a larger amount of speech, the maximum gain could be selected.
Within the span of the minimum and the maximum gain a weighing factor may also be applied to achieve a more flexible total noise suppression gain. The total noise suppression gain is then selected as a linear combination of the two intermediate noise suppression gains. If the same factor 0.5 is applied to the two intermediate gains the result will be the average gain. Other factors such as 0.3 for the first intermediate gain and 0.7 for the second or vice-versa may be applied. The selected combination may be based on a measure of confidence provided by each noise reduction method.
Optionally, the noise suppression gain combining block 8 may comprise a gain refinement filter as shown in
Finally, an output filtering block 11 applies the total noise suppression gain to the audio signal to generate a noise suppressed audio system output signal. Again the audio signal may be a preliminary processed audio signal such as a linear combination of the two audio system input signals provided by a beamformer 10, such as an adaptive beamformer system. The Inverse Fast Fourier Transformation (IFFT) 12 converts the output signal from the frequency domain back to the time domain to provide a processed audio system output signal.
In the embodiment shown in
In the following, an example will explain how the spatial noise suppression gain may be computed according to the embodiments of the system shown in
In the following a short hand notation is employed, where a filter bank transfer function is assumed but time and bin indices are omitted. A preliminary spatial gain is computed from a linear combination of spatial cues:
where mk, αk and ZADM are the spatial cues, the cue weights and the output from e.g. a beamformer, respectively. The operator <•> denotes averaging over time, e.g. 20 ms. The spatial cues and the cue weights mk and αk are designed to produce a spatial gain between 0 and 1. The spatial cue weights may be applied to make one or more of the spatial cues more predominant, and vice-versa one or other spatial cues less predominant in the computation of the spatial noise suppression gain.
The proximity cue may be computed as:
The directional cue may be computed as:
m2=1−max(|k∠P12|−ω0,0)
where P1, P1 and P12 are the auto and cross powers of the aligned input signals. Constants β, R0 and ω0 parameterize the spatial cue functions. k is a frequency dependant normalization factor to map phase to angle of arrival.
Directional and non-stationary background noise is specifically targeted by the invention, but it also handles stationary noise conditions and wind noise. Advantageously the method and system according to the invention is used in a headset as described above. An embodiment of such a headset 13, having a speaker 14 and two microphones 1, 2 is shown in
Likewise, the method and system may be implemented in other personal communication devices having two or more microphones, such as a mobile telephone, a speakerphone or a hearing aid.
Number | Date | Country | Kind |
---|---|---|---|
2011 00667 | Sep 2011 | DK | national |
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/EP2012/066971 | 8/31/2012 | WO | 00 | 6/18/2014 |
Publishing Document | Publishing Date | Country | Kind |
---|---|---|---|
WO2013/030345 | 3/7/2013 | WO | A |
Number | Name | Date | Kind |
---|---|---|---|
6584203 | Elko | Jun 2003 | B2 |
20070154031 | Avendano | Jul 2007 | A1 |
20070237341 | Laroche | Oct 2007 | A1 |
20110070926 | Vitte et al. | Mar 2011 | A1 |
Number | Date | Country |
---|---|---|
WO03015458 | Feb 2003 | WO |
WO2009076523 | Jun 2009 | WO |
WO2009096958 | Aug 2009 | WO |
Entry |
---|
Wittkop T. et al.: “Strategy-selective noise reduction for binaural digital hearing aids”, Speech Communication, Elsevier Science Publishers, Amsterdam, NL, vol. 39, Jan. 1, 2003, pp. 111-138, XP002266432, ISSN: 0167-6393. |
Seon Man Kim et al.: “Probabilistic spectral gain modification applied to beamformer-based noise reduction in a car environment”, IEEE Transactions on Consumer Electronics, IEEE Service Center, New York, NY, US, vol. 57, No. 2, May 1, 2011, pp. 866-872, XP011335726, ISSN: 0098-3063. |
Homayoun Kamkar-Parsi A et al.: “Instantaneous Binaural Target PSD Estimation for Hearing Aid Noise Reduction in Complex Acoustic Environments”, IEEE Transactions on Instrumentation and Measurement, IEEE Service Center, Piscataway, NJ, US, vol. 60, No. 4, Apr. 1, 2011, pp. 1141-1154, XP011349547, ISSN: 0018-9456. |
Danish search Report for Danish Application PA 2011 00667 dated Apr. 16, 2012. |
International Search Report for PCT Application No. PCT/EP2012/066971 dated Apr. 10, 2013. |
Number | Date | Country | |
---|---|---|---|
20140307886 A1 | Oct 2014 | US |