This application is a National Stage application of PCT/US2013/049846 filed on Jul. 10, 2013, published in the English language on Jan. 15, 2015 as International Publication Number WO 2015/005914A1, entitled “Methods And Apparatus For Dynamic Low Frequency Noise Suppression”, which is incorporated herein by reference.
As is known in the art, noise suppression in communication systems is desirable to improve the user experience. For example, mobile device communication between two or more parties is improved if the words spoken by the parties are crisp and easy to understand. Noise can make it difficult for the parties to understand what is being said by the other parties.
Conventional communication systems involving speech typically use Wiener filters to suppress stationary noise. However, the Wiener filter response is dependent upon the Signal-to-Noise Ratio (SNR) so that Wiener filters may not react with sufficient quickness to adequately suppress non-stationary noise bursts. As is known in the art, noise bursts can be problematic since it can be difficult to obtain a reliable estimate of the noise power spectral density. In addition, in conventional systems detection of relatively short bursts may be unreliable.
The present invention provides methods and apparatus for speech signal enhancement by dynamically suppressing low frequency noise events without suppressing speech components. With this arrangement, noise events, such as road bumps, can be suppressed without suppressing speech formants.
In one embodiment, a speech signal enhancement system for removing noise from microphone input and providing a cleaned up output signal includes dynamic low frequency noise event suppression in accordance with exemplary embodiments of the invention. Exemplary speech signal enhancement systems can include single and/or multiple microphone systems that are useful for mobile telephone applications. While exemplary embodiments of the invention are shown and described in conjunction with particular applications, components, and processing, it is understood that embodiments of the invention are applicable to audio applications in general in which it is desirable to suppress certain low frequency noise events.
In one aspect of the invention, a method comprises: receiving an input signal, forming a first window of the input signal spanning a first frequency range, forming a second window of the input signal having a second frequency range adjacent to the first frequency range, determining information on any signal peaks in the first and second windows, computing, using a computer processor, a dampening level from the information on the signal peaks in the first and second windows, and adjusting sizes of the first and second windows until a final dampening level is determined for dynamically suppressing non-speech audio events in the input signal.
The method can further include one or more of the following features: the information on the signal peaks comprises a maximum power, the dampening level is computed using a ratio of the maximum powers in the first and second windows, the final dampening level corresponds to a maximum frequency for the first window at which a total dampening for the first window is maximized, adjusting the sizes of the first and second windows by increasing a size of the first window and increasing a size of the second window, wherein the adjusted first and second windows do not overlap and remain adjacent to each other, the final dampening level is only applied to the first window, the first and second windows are of equal size, the first frequency range has a maximum corresponding to maximum frequency for a lowest expected speech formant, forming the first and second windows to capture a first speech formant in the first window and a harmonic of the first speech formant in the second window, the non-speech audio event comprises a road bump, making a voiced/unvoiced determination frame-by-frame and selecting a maximum frequency for the first frequency range based upon the voiced/unvoiced determination, and/or limiting a maximum frequency of the second frequency range based upon a maximum fundamental frequency for speech.
In another aspect of the invention, a system comprises: a dynamic noise suppression module, comprising: a frame module to sample an input signal, a window generation module coupled to the frame module to form a first window spanning a first frequency range and a second window having a second frequency range adjacent to the first frequency range and to adjust the first and second windows, a power module to determine signal peak information for the first window and for the second window, and a dampening computation module to compute a dampening level corresponding to the signal peak information in the first and second windows for suppressing non-speech audio events in the input signal.
The system can further include one or more of the following features: the dampening computation module can compute the dampening level using a ratio of the maximum powers in the first and second windows, a window generation module can adjust the sizes of the first and second windows by increasing a size of the first frequency range and increasing a size of the second window, wherein the adjusted first and second windows do not overlap and remain adjacent to each other, and/or the window generation module can form the first and second windows to capture a first speech formant in the first window and a harmonic of the first speech formant in the second window. In one embodiment, the start of the second window is selected to contain at least the highest harmonic component of the lowest formant to avoid dampening of the formant to background noise level. The first window is selected to end up to slightly below the frequency at which the highest harmonic of the lowermost formant is expected.
In a further aspect of the invention, an article comprises: at least one computer readable medium including non-transitory stored instructions that enable a machine to: receive an input signal, form a first window spanning a first frequency range, form a second window having a second frequency range adjacent to the first frequency range, determine information on any signal peaks in the first and second windows, compute, using a computer processor, a dampening level from the information on the signal peaks in the first and second windows, and adjust sizes of the first and second windows until a final dampening level is determined for suppressing non-speech audio events in the input signal.
The article can further include instructions for computing the dampening level using a ratio of maximum powers in the first and second windows, and or instructions for adjusting the sizes of the first and second windows by increasing a size of the first frequency range and increasing a size of the second window, wherein the adjusted first and second windows do not overlap and remain adjacent to each other.
The foregoing features of this invention, as well as the invention itself, may be more fully understood from the following description of the drawings in which:
A noise suppression module 106 receives the pre-processed information from the microphone array 102 and removes noise. In an exemplary embodiment, the noise suppression module 106 includes a dynamic low frequency noise suppression module 108 to suppress relatively short non-stationary noise bursts, such as road bumps.
In one embodiment, the noise suppression module 106 provides a reduced noise signal to a user device 110, such as a mobile telephone. A gain module 112 can receive an output from the device 110 to amplify the signal for a loudspeaker 114 or other sound transducer.
It is understood that it is desirable to remove noises associated with audio events, such as road bumps, without removing speech components. Relatively low frequency audio events, such as road bumps, are often located below the visible part of the human speech harmonics structure, as shown in
In an exemplary embodiment, an input signal, such as from a microphone array, is processed into frames, each having a number of samples. Each frame is analyzed to determine whether speech is present in the frame. In a speech-based embodiment, the sampling rate can be in the order of 8 kHz. Using a Fast Fourier Transform (FFT), about 129 frequency bins can be generated. In an alternative embodiment, a filterbank may be used to obtain a frequency domain representation. A window for identifying speech components, which is described more fully below, can initially include in the order of 2-3 frequency bins. It is understood that any practical sampling rate and number of frequency bins can be used to meet the requirements of a particular application.
In exemplary embodiments of the invention, initial first and second windows, which are described more fully below, are selected to evaluate the frequency and intensity information for identifying whether speech is present or whether a noise event is present. In general, speech should not be filtered while noise events should be dampened to improve the speech quality heard by users. The first and second windows are then adjusted to evaluate the peaks, if any, in the signal from the microphone array to determine whether speech is present or whether a low frequency noise event is present that should be dampened.
Referring again to the illustrative plot 400 in
An exemplary window size is 2-3 frequency bins. The voiced speech of a typical adult male has a fundamental frequency from about 85 to about 180 Hz and the voiced speech of a typical adult female has a fundamental frequency from about 165 to about 270 Hz. In the illustrated embodiment, the first window begins at about 30 Hz and ends at about 216 Hz. The first window 408 starts at a frequency corresponding to a lowest fundamental frequency that is expected, here selected to be 30 Hz. As shown in
It should be noted that if a first speech harmonic component, such as the first peak 402, is in the first window 408, a second harmonic component will be contained in the second window 410 due to the harmonic nature of the speech formants and the initial window frequencies. For example, if a fundamental frequency is 100 Hz, the second harmonic frequency is 200 Hz, and third harmonic frequency is 300 Hz (fn=nf0), and so on.
It is understood that there is an assumption that each harmonic increases in power, i.e., that the harmonics within a formant increase in power with increasing frequency, or at least stays at the same level. Harmonics can decrease in frequency. In one embodiment, a scaling factor α can be used to relax assumptions or to make them more strict, as described more fully below.
In an exemplary embodiment, the second window 410 ends at frequency K=min(2k, k+f0,max), where f0,max is the maximum fundamental frequency that is expected and k is the maximum frequency in the first window. Mostly, the second window will be the same size as the first window such that k+f0,max does not serve to limit the end point of the second window.
Once the initial first and second windows 408, 410 are established, the maximum power PL, PU of the respective first (lower) and second (upper) windows is computed as follows:
PL=max{PXX(l), l=1, . . . ,k}
PU=max{PXX(l), k+l=1, . . . ,K}
In the plot 400 of
The maximum power for the peaks can then be used to compute a dampening factor. In one embodiment, the dampening factor can be defined as set forth below:
where 1 indicates no dampening. In an exemplary embodiment, the dampening factor is determined and held constant for the entire window length.
Where the second peak 404, which is located in the second window 410 in
After generating the initial first and second windows 408, 410 and computing the dampening factor, the sizes of the first and second windows 408, 410 are then adjusted to determine if the dampening is optimized based upon the location of the peaks (if any). In an exemplary embodiment, the first window size is increased by one frequency bin, the second window start frequency is moved up one frequency bin and also increased by one frequency bin on the end. The dampening factor is re-computed for the new windows. The process of increasing the first and second window sizes and re-computing the dampening is repeated until stopping at a maximum frequency kmax, which is chosen in such a way that speech is not suppressed, as described above. In an exemplary embodiment total dampening is maximized, as set forth below:
{tilde over (H)}(l)=min{{tilde over (H)}k(l), k=1, . . . kmax}
It is understood that minimum coefficients provide maximum dampening. It is desired to maximize dampening in each frequency bin based on the relationships set forth above.
As this maximum frequency kmax is different for voiced speech (e.g. vowels such as u, o, a, e, i which have a distinct harmonics structure) and unvoiced speech (e.g. fricatives such as sh, f, z which do not have a distinct harmonics structure), a harmonicity detector can be used for a voiced/unvoiced decision. It is understood that a harmonicity detector is to be contrasted with a voice activity detector, which typically distinguishes between speech and non-speech.
As noted above, the initial sizes of the first and second windows may be off in relation to the speech components. For example, while the initial first and second windows may be located in such a way that speech formants are located in the first and second windows for speech from a baritone man, the initial windows may not be located correctly for speech formants for a relatively high-pitched woman.
As shown in
In general, the beginning of the lowermost formant of human speech is not known and is difficult to estimate in noise. In addition, the frequency of low frequency audio events, such as road bumps, is not known since such events can vary in time and can cover a relatively large frequency range. In general, noise events are not harmonic in nature and can be differentiated from speech, which does have harmonic components.
Once the dampening is determined, dampening across the first window can be applied directly by multiplying the noisy speech spectrum Y(l) with the dampening coefficients, as set forth below:
X1(l)={tilde over (H)}(l)·Y(l)
In another embodiment, dampening can be combined with other noise suppression or other processing. For example, dampening coefficients may be combined with Wiener noise suppression as follows:
X2(l)={tilde over (H)}(l)·H(l)·Y(l),
where H(l) refers to Wiener or other filter coefficients.
In another embodiment, a scaling factor α can be used to adjust dampening as desired:
The scaling factor can be used to control the aggressiveness of the dampening. Using a factor larger than 1 decreases the dampening and using a factor smaller than one increases the dampening. This allows a trade-off between stronger (e.g., more aggressive) bump suppression with a factor smaller than 1 and less aggressive bump removal (and more speech protection) with a factor larger than 1.
Scaling factors may be chosen differently for different filter coefficients in accordance with a generic representation as:
where β is an exponential scaling factor. Where β is 0.5 for example, and α is 1, then
With regard to aggressiveness of the scaling, αk,l can be used instead of α to enable the scaling to be chosen differently for different k,l. In an exemplary embodiment, dampening can be defined as:
with αk,l=α0k-l+1. With this arrangement, the larger the distance of a bin from the first window to the second window, the stronger the dampening if 0<α0<1 and the less the dampening if α0>1.
In exemplary embodiments of the invention, a floor can be provided by comfort noise insertion, as shown in
where v is the “spectral floor” of a Wiener filter and where |Y(l)| and |N(l)| are the (noisy input Y) signal and estimated noise (N) spectral magnitudes. Flooring refers to taking the maximum of {tilde over (H)}(l) and φ(l). As shown in
As an alternative, noise may be simulated from v·|N(l)|, such as by drawing complex random values which have this magnitude on average. Then X1(l)={tilde over (H)}(l)·Y(l) may be replaced by simulated noise values when {tilde over (H)}(l)<φ(l), which can be referred to as comfort noise insertion.
In step 708, the frequency ranges of the first and second windows are adjusted, such as by increasing a maximum frequency of the first window and increasing a maximum frequency of the second window while keeping the windows adjacent to each other and not overlapping. In step 710, the maximum powers in the adjusted first and second windows are computed and in step 712 the dampening level is re-computed.
In step 714, it is determined whether the maximum frequency for the first window to achieve maximum suppression is reached. If not, processing continues in step 708. If so, in step 716, the total dampening is computed. In step 718, dampening is applied to non-speech noise events, such as road bumps.
The window generator module 754 also adjusts the windows, as described above, to achieve a desired level of non-speech audio event suppression. A power module 756 obtains information on the signal in the first and second windows. In one embodiment, the power module 756 determines the maximum power of the spectrum in the first and second windows. A dampening computation module 758 determines a dampening level based on the signal information in the first and second windows, as described above. A FFT module 760 enables processing in the frequency domain.
While exemplary embodiments of the invention are shown and described as having discrete first and second windows, it is understood that additional windows can be created and that such windows can overlap with other windows. For example, additional overlapping windows can be created to confirm formant and/or noise event locations and/or presence. Also, further windows can be used for adjusting dampening coefficients within a window. Also, while determining a maximum power in a window is described, it is understood that other signal characteristics can be used to determine the presence of speech harmonic components. Further, while exemplary embodiments are shown in conjunction with speech signal enhancement for vehicles, it is understood that other embodiments can include dynamic noise suppression in any system having a microphone array, which includes one or more microphones, receiving speech in environments subject to noise, such as entertainment systems, intercom systems, laptop communication systems, and the like.
Processing may be implemented in hardware, software, or a combination of the two. Processing may be implemented in computer programs executed on programmable computers/machines that each includes a processor, a storage medium or other article of manufacture that is readable by the processor (including volatile and non-volatile memory and/or storage elements), at least one input device, and one or more output devices. Program code may be applied to data entered using an input device to perform processing and to generate output information.
The system can perform processing, at least in part, via a computer program product, (e.g., in a machine-readable storage device), for execution by, or to control the operation of, data processing apparatus (e.g., a programmable processor, a computer, or multiple computers). Each such program may be implemented in a high level procedural or object-oriented programming language to communicate with a computer system. However, the programs may be implemented in assembly or machine language. The language may be a compiled or an interpreted language and it may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A computer program may be deployed to be executed on one computer or on multiple computers at one site or distributed across multiple sites and interconnected by a communication network. A computer program may be stored on a storage medium or device (e.g., CD-ROM, hard disk, or magnetic diskette) that is readable by a general or special purpose programmable computer for configuring and operating the computer when the storage medium or device is read by the computer. Processing may also be implemented as a machine-readable storage medium, configured with a computer program, where upon execution, instructions in the computer program cause the computer to operate.
Processing may be performed by one or more programmable processors executing one or more computer programs to perform the functions of the system. All or part of the system may be implemented as, special purpose logic circuitry (e.g., an FPGA (field programmable gate array) and/or an ASIC (application-specific integrated circuit)).
Having described exemplary embodiments of the invention, it will now become apparent to one of ordinary skill in the art that other embodiments incorporating their concepts may also be used. The embodiments contained herein should not be limited to disclosed embodiments but rather should be limited only by the spirit and scope of the appended claims. All publications and references cited herein are expressly incorporated herein by reference in their entirety.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2013/049846 | 7/10/2013 | WO | 00 |
Publishing Document | Publishing Date | Country | Kind |
---|---|---|---|
WO2015/005914 | 1/15/2015 | WO | A |
Number | Name | Date | Kind |
---|---|---|---|
5621850 | Kane | Apr 1997 | A |
5933801 | Fink | Aug 1999 | A |
7225001 | Eriksson et al. | May 2007 | B1 |
20030166624 | Gale et al. | Sep 2003 | A1 |
20060104460 | Behboodian et al. | May 2006 | A1 |
20060166624 | Van Vugt | Jul 2006 | A1 |
20080281589 | Wang | Nov 2008 | A1 |
20120035921 | Li | Feb 2012 | A1 |
20120127342 | Ohtsuka | May 2012 | A1 |
20130080158 | Hetherington et al. | Mar 2013 | A1 |
20130138434 | Furuta | May 2013 | A1 |
Entry |
---|
International Application No. PCT/US2013/049846, Notification Concerning Transmittal of International Preliminary Report on Patentability (Chapter 1 of the Patent Cooperation Treaty), dated Jan. 21, 2016, 11 pages. |
Notification of Transmittal of the International Search Report and the Written Opinion of the International Searching Authority, or the Declaration, PCT/US2013/049846, dated Mar. 31, 2014, 4 pages. |
Written Opinion of the International Searching Authority, PCT/US2013/049846, dated Mar. 31, 2014, 9 pages. |
Number | Date | Country | |
---|---|---|---|
20160019910 A1 | Jan 2016 | US |