The present invention relates generally to signal processing. More particularly, it relates to techniques for suppressing noise in a speech signal, which may be used, for example, in an automobile.
In many applications, a speech signal is received in the presence of noise, processed, and transmitted to a far-end party. One example of such a noisy environment is the passenger compartment of an automobile. A microphone may be used to provide hands-free operation for the automobile driver. The hands-free microphone is typically located at a greater distance from the speaking user than with a regular hand-held phone (e.g., the hands-free microphone may be mounted on the dash board or on the overhead visor). The distant microphone would then pick up speech and background noise, which may include vibration noise from the engine and/or road, wind noise, and so on. The background noise degrades the quality of the speech signal transmitted to the far-end party, and degrades the performance of automatic speech recognition device.
One common technique for suppressing noise is the spectral subtraction technique. In a typical implementation of this technique, speech plus noise is received via a single microphone and transformed into a number of frequency bins via a fast Fourier transform (FFT). Under the assumption that the background noise is long-time stationary (in comparison with the speech), a model of the background noise is estimated during time periods of non-speech activity whereby the measured spectral energy of the received signal is attributed to noise. The background noise estimate for each frequency bin is utilized to estimate a signal-to-noise ratio (SNR) of the speech in the bin. Then, each frequency bin is attenuated according to its noise energy content via a respective gain factor computed based on that bin's SNR.
The spectral subtraction technique is generally effective at suppressing stationary noise components. However, due to the time-variant nature of the noisy environment, the models estimated in the conventional manner using a single microphone are likely to differ from actuality. This may result in an output speech signal having a combination of low audible quality, insufficient reduction of the noise, and/or injected artifacts.
As can be seen, techniques that can suppress noise in a speech signal, and which may be used in a noisy environment, particularly in an automobile, are highly desirable.
The invention provides techniques to suppress noise from a signal comprised of speech plus noise. In accordance with aspects of the invention, two or more signal detectors (e.g., microphones, sensors, and so on) are used to detect respective signals. At least one detected signal comprises a speech component and a noise component, with the magnitude of each component being dependent on various factors. In an embodiment, at least one other detected signal comprises mostly a noise component (e.g., vibration, engine noise, road noise, wind noise, and so on). Signal processing is then used to process the detected signals to generate a desired output signal having predominantly speech, with a large portion of the noise removed. The techniques described herein may be advantageously used in a signal processing system that is installed in an automobile.
An embodiment of the invention provides a signal processing system that includes first and second signal detectors operatively coupled to a signal processor. The first signal detector (e.g., a microphone) provides a first signal comprised of a desired component (e.g., speech) plus an undesired component (e.g., noise), and the second signal detector (e.g., a vibration sensor) provides a second signal comprised mostly of an undesired component (e.g., various types of noise).
In one design, the signal processor includes an adaptive canceller, a voice activity detector, and a noise suppression unit. The adaptive canceller receives the first and second signals, removes a portion of the undesired component in the first signal that is correlated with the undesired component in the second signal, and provides an intermediate signal. The voice activity detector receives the intermediate signal and provides a control signal indicative of non-active time periods whereby the desired component is detected to be absent from the intermediate signal. The noise suppression unit receives the intermediate and second signals, suppresses the undesired component in the intermediate signal based on a spectrum modification technique, and provides an output signal having a substantial portion of the desired component and with a large portion of the undesired component removed. Various designs for the adaptive canceller, voice activity detector, and noise suppression unit are described in detail below.
Another embodiment of the invention provides a voice activity detector for use in a noise suppression system and including a number of processing units. A first unit transforms an input signal (e.g., based on the FFT) to provide a transformed signal comprised of a sequence of blocks of M elements for M frequency bins, one block for each time instant, and wherein M is two or greater (e.g., M=16). A second unit provides a power value for each element of the transformed signal. A third unit receives the power values for the M frequency bins and provides a reference value for each of the M frequency bins, with the reference value for each frequency bin being the smallest power value received within a particular time window for the frequency bin plus a particular offset. A fourth unit compares the power value for each frequency bin against the reference value for the frequency bin and provides a corresponding output value. A fifth unit provides a control signal indicative of activity in the input signal based on the output values for the M frequency bins.
The third unit may be designed to include first and second lowpass filters, a delay line unit, a selection unit, and a summer. The first lowpass filter filters the power values for each frequency bin to provide a respective sequence of first filtered values for that frequency bin. The second lowpass filter similarly filters the power values for each frequency bin to provide a respective sequence of second filtered values for that frequency bin. The bandwidth of the second lowpass filter is wider than that of the first lowpass filter. The delay line unit stores a plurality of first filtered values for each frequency bin. The selection unit selects the smallest first filtered value stored in the delay line unit for each frequency bin. The summer adds the particular offset to the smallest first filtered value for each frequency bin to provide the reference value for that frequency bin. The fourth unit then compares the second filtered value for each frequency bin against the reference value for the frequency bin.
Various other aspects, embodiments, and features of the invention are also provided, as described in further detail below.
The foregoing, together with other aspects of this invention, will become more apparent when referring to the following specification, claims, and accompanying drawings.
Microphone 110a and sensor 110b provide two respective analog signals, each of which is typically conditioned (e.g., filtered and amplified) and then digitized prior to being subjected to the signal processing by signal processing system 200. For simplicity, this conditioning and digitization circuitry is not shown in
In the embodiment shown in
Adaptive canceller 220 receives the speech plus noise signal s(t) and the mostly noise signal x(t), removes the noise component in the signal s(t) that is correlated with the noise component in the signal x(t), and provides an intermediate signal d(t) having speech and some amount of noise. Adaptive canceller 220 may be implemented using various designs, some of which are described below.
Voice activity detector 230 detects for the presence of speech activity in the intermediate signal d(t) and provides an Act control signal that indicates whether or not there is speech activity in the signal s(t). The detection of speech activity may be performed in various manners. One detection technique is described below in
Noise suppression unit 240 receives and processes the intermediate signal d(t) and the mostly noise signal x(t) to removes noise from the signal d(t), and provides an output signal y(t) that includes the desired speech with a large portion of the noise component suppressed. Noise suppression unit 240 may be designed to implement any one or more of a number of noise suppression techniques for removing noise from the signal d(t). In an embodiment, noise suppression unit 240 implements the spectrum modification technique, which provides good performance and can remove both stationary and non-stationary noise (using a time-varying noise spectrum estimate, as described below). However, other noise suppression techniques may also be used to remove noise, and this is within the scope of the invention.
For some designs, adaptive canceller 220 may be omitted and noise suppression is achieved using only noise suppression unit 240. For some other designs, voice activity detector 230 may be omitted.
The signal processing to suppress noise may be achieved via various schemes, some of which are described below. Moreover, the signal processing may be performed in the time domain or frequency domain.
Within adaptive canceller 220a, the speech plus noise signal s(t) is delayed by a delay element 322 and then provided to a summer 324. The mostly noise signal x(t) is provided to an adaptive filter 326, which filters this signal with a particular transfer function h(t). The filtered noise signal p(t) is then provided to summer 324 and subtracted from the speech plus noise signal s(t) to provide the intermediate signal d(t) having speech and some amount of noise removed.
Adaptive filter 326 includes a “base” filter operating in conjunction with an adaptation algorithm, both of which are not shown in
The base filter within adaptive filter 326 is adapted to implement (or approximate) the transfer function h(t), which describes the correlation between the noise components in the signals s(t) and x(t). The base filter then filters the mostly noise signal x(t) with the transfer function h(t) to provide the filtered noise signal p(t), which is an estimate of the noise component in the signal s(t). The estimated noise signal p(t) is then subtracted from the speech plus noise signal s(t) by summer 324 to generate the intermediate signal d(t), which is representative of the difference or error between the signals s(t) and p(t). The signal d(t) is then provided to the adaptation algorithm within adaptive filter 326, which then adjusts the transfer function h(t) of the base filter to minimize the error.
The adaptation algorithm may be implemented with any one of a number of algorithms such as a least mean square (LMS) algorithm, a normalized mean square (NLMS), a recursive least square (RLS) algorithm, a direct matrix inversion (DMI) algorithm, or some other algorithm. Each of the LMS, NLMS, RLS, and DMI algorithms (directly or indirectly) attempts to minimize the mean square error (MSE) of the error, which may be expressed as:
MSE=E{|s(t)−p(t)|2}, Eq (1)
where E{α} is the expected value of α, s(t) is the speech plus noise signal (which mainly contains the noise component during the adaptation periods), and p(t) is the estimate of the noise in the signal s(t). In an embodiment, the adaptation algorithm implemented by adaptive filter 326 is the NLMS algorithm.
The NLMS and other algorithms are described in detail by B. Widrow and S.D. Sterns in a book entitled “Adaptive Signal Processing,” Prentice-Hall Inc., Englewood Cliffs, N.J., 1986. The LMS, NLMS, RLS, DMI, and other adaptation algorithms are described in further detail by Simon Haykin in a book entitled “Adaptive Filter Theory”, 3rd edition, Prentice Hall, 1996. The pertinent sections of these books are incorporated herein by reference.
Within adaptive canceller 220b, the speech plus noise signal s(t) is transformed by a transformer 422a to provide a transformed speech plus noise signal S(ω). In an embodiment, the signal s(t) is transformed one block at a time, with each block including L data samples for the signal s(t), to provide a corresponding transformed block. Each transformed block of the signal S(ω) includes L elements, Sn(ω0) through Sn(ωL−1), corresponding to L frequency bins, where n denotes the time instant associated with the transformed block. Similarly, the mostly noise signal x(t) is transformed by a transformer 232b to provide a transformed noise signal X(ω). Each transformed block of the signal X(ω) also includes L elements, Xn(ω0) through Xn(ωL−1).
In the specific embodiment shown in
The transformed speech plus noise signal S(ω) is provided to a summer 424. The transformed noise signal X(ω) is provided to an adaptive filter 426, which filters this noise signal with a particular transfer function H(ω). The filtered noise signal P(ω) is then provided to summer 424 and subtracted from the transformed speech plus noise signal S(ω) to provide the intermediate signal D(ω).
Adaptive filter 426 includes a base filter operating in conjunction with an adaptation algorithm. The adaptation may be achieved, for example, via an NLMS algorithm in the frequency domain. The base filter then filters the transformed noise signal X(ω) with the transfer function H(ω) to provide an estimate of the noise component in the signal S(ω).
NLMS units 432a through 432l minimize the intermediate elements, Dn(ω) which represent the error between the estimated noise and the received noise. The estimated noise elements, Pn(ω) are good approximations of the noise component in the speech plus noise elements Sn(ωj). By subtracting the elements Pn(ωj) from the elements Sn(ωj), the noise component is effectively removed from the speech plus noise elements, and the output elements Dn(ωj) would then comprise predominantly the speech component.
Each NLMS unit 432 can be designed to implement the following:
where μ is a weighting factor (typically, 0.01<μ<2.00) used to determine the convergence rate of the coefficients, and Xn*(ωj) is a complex conjugate of Xn(ωj).
The frequency-domain adaptive filter may provide certain advantageous over a time-domain adaptive filter including (1) reduced amount of computation in the frequency domain, (2) more accurate estimate of the gradient due to use of an entire block of data, (3) more rapid convergence by using a normalized step size for each frequency bin, and possibly other benefits.
The noise components in the signals S(ω) and X(ω) may be correlated. The degree of correlation determines the theoretical upper bound on how much noise can be cancelled using a linear adaptive filter such as adaptive filters 326 and 426. If X(ω) and S(ω) are totally correlated, the linear adaptive filter (such as adaptive filters 326 and 426) can cancel the correlated noise components. Since S(ω) and X(ω) are generally not totally correlated, the spectrum modification technique (described below) provide further suppresses the uncorrelated portion of the noise.
Within voice activity detector 230a, the signal d(t) is provided to an FFT 512, which transforms the signal d(t) into a frequency domain representation. FFT 512 transforms each block of M data samples for the signal d(t) into a corresponding transformed block of M elements, Dk(ω0) through Dk(ωM−1), for M frequency bins (or frequency bands). If the signal d(t) has already been transformed into L frequency bins, as described above in
Lowpass filter 516 filters the power values Pk(ωi) for each frequency bin i, and provides the filtered values Fk1(ωi) to a decimator 518, where the superscript “1” denotes the output from lowpass filter 516. The filtering smooth out the variations the power values from power estimator 514. Decimator 518 then reduces the sampling rate of the filtered values Fk1(ωi) for each frequency bin. For example, decimator 518 may retain only one filtered value Fk1(ωi) for each set of ND filtered values, where each filtered value is further derived from a block of data samples. In an embodiment, ND may be eight or some other value. The decimated values for each frequency bin are then stored to a respective row of a delay line 520. Delay line 520 provides storage for a particular time duration (e.g., one second) of filtered values Fk1(ωi) for each of the M frequency bins. The decimation by decimator 518 reduces the number of filtered values to be stored in the delay line, and the filtering by lowpass filter 516 removes high frequency components to ensure that aliasing does not occur as a result of the decimation by decimator 518.
Lowpass filter 526 similarly filters the power values Pk(ωi) for each frequency bin i, and provides the filtered values Fk2(ωi) to a comparator 528, where the superscript “2” denotes the output from lowpass filter 526. The bandwidth of lowpass filter 526 is wider than that of lowpass filter 516. Lowpass filters 516 and 526 may each be implemented as a FIR filter, an IIR filter, or some other filter design.
For each time instant k, a minimum selection unit 522 evaluates all of the filtered values Fk1(ωi) stored for each frequency bin i and provides the lowest stored value for that frequency bin. For each time instant k, minimum selection unit 522 provides the M smallest values stored for the M frequency bins. Each value provided by minimum selection unit 522 is then added with a particular offset value by a summer 524 to provide a reference value for that frequency bin. The M reference values for the M frequency bins are then provided to a comparator 528.
For each time instant k, comparator 528 receives the M filtered values Fk2(ωi) from lowpass filter 526 and the M reference values from summer 524 for the M frequency bins. For each frequency bin, comparator 528 compares the filtered value Fk2(ωi) against the corresponding reference value and provides a corresponding comparison result. For example, comparator 528 may provide a one (“1”) if the filtered value Fk2(ωi) is greater than the corresponding reference value, and a zero (“0”) otherwise.
An accumulator 532 receives and accumulates the comparison results from comparator 528. The output of accumulator is indicative of the number of bins having filtered values Fk2(ωi) greater than their corresponding reference values. A comparator 534 then compares the accumulator output against a particular threshold, Th1, and provides the Act control signal based on the result of the comparison. In particular, the Act control signal may be asserted if the accumulator output is greater than the threshold Th1, which indicates the presence of speech activity on the signal d(t), and de-asserted otherwise.
The speech plus noise signal s(t) is transformed by a transformer 622a to provide a transformed speech plus noise signal S(ω). Similarly, the mostly noise signal x(t) is transformed by a transformer 622b to provide a transformed mostly noise signal X(ω). In the specific embodiment shown in
It is sometime advantages, although it may not be necessary, to filter the magnitude component of S(ω) and X(ω) so that a better estimation of the short-term spectrum magnitude of the respective signal is obtained. One particular filter implementation is a first-order IIR low-pass filter with different attack and release time.
In the embodiment shown in
Noise spectrum estimator 642a receives the magnitude of the transformed signal S(ω), the magnitude of the transformed signal X(ω), and the Act control signal from voice activity detector 230 indicative of periods of non-speech activity. Noise spectrum estimator 642a then derives the magnitude spectrum estimates for the noise N(ω), as follows:
|N(ω)|=W(ω)·|X(ω)|, Eq (1)
where W(ω) is referred to as the channel equalization coefficient. In an embodiment, this coefficient may be derived based on an exponential average of the ratio of magnitude of S(ω) to the magnitude of X(ω), as follows:
where α is the time constant for the exponential averaging and is 0<α≦1. In a specific implementation, α=1 when voice activity indicator 230 indicates a speech activity period and α=0.1 when voice activity indicator 230 indicates a non-speech activity period.
Noise spectrum estimator 642a provides the magnitude spectrum estimates for the noise N(ω) to gain calculator 644a, which then uses these estimates to derive a first set of gain coefficients G1(ω) for a multiplier 646a.
With the magnitude spectrum of the noise |N(ω)| and the magnitude spectrum of the signal |S(ω)| available, a number of spectrum modification techniques may be used to determine the gain coefficients G1(ω). Such spectrum modification techniques include a spectrum subtraction technique, Wiener filtering, and so on.
In an embodiment, the spectrum subtraction technique is used for noise suppression, and gain calculation unit 644a determines the gain coefficients G1(ω) by first computing the SNR of the speech plus noise signal S(ω) and the noise signal N(ω), as follows:
The gain coefficient G1(ω) for each frequency bin ω may then be expressed as:
where Gmin is a lower bound on G1(ω).
Gain calculation unit 644a provides a gain coefficient G1(ω) for each frequency bin j of the transformed signal S(ω). The gain coefficients for all frequency bins are provided to multiplier 646a and used to scale the magnitude of the signal S(ω).
In an aspect, the spectrum subtraction is performed based on a noise N(ω) that is a time-varying noise spectrum derived from the mostly noise signal x(t). This is different from the spectrum subtraction used in conventional single microphone design whereby N(ω) typically comprises mostly stationary or constant values. This type of noise suppression is also described in U.S. Pat. No. 5,943,429, entitled “Spectral Subtraction Noise Suppression Method,” issued Aug. 24, 1999, which is incorporated herein by reference. The use of a time-varying noise spectrum (which more accurately reflects the real noise in the environment) allows for the cancellation of non-stationary noise as well as stationary noise (non-stationary noise cancellation typically cannot be achieve by conventional noise suppression techniques that use a static noise spectrum).
Noise floor estimator 642b receives the magnitude of the transformed signal S(ω) and the Act control signal from voice activity detector 230. Noise floor estimator 642b then derives the magnitude spectrum estimates for the noise N(ω), as shown in equation (4), during periods of non-speech, as indicated by the Act control signal from voice activity indicator 230. For the single-channel spectrum modification technique, the same signal S(ω) is used to derive the magnitude spectrum estimates for both the speech and the noise.
Gain calculation unit 644b then derives a second set of gain coefficients G2(ω) by first computing the SNR of the speech component in the signal S(ω) and the noise component in the signal S(ω), as shown in equation (6). Gain calculation unit 644b then determines the gain coefficients G2(ω) based on the computed SNRs, as shown in equation(6).
The spectrum subtraction technique for a single channel is also described by S. F. Boll in a paper entitled “Suppression of Acoustic Noise in Speech Using Spectral Subtraction,” IEEE Trans. Acoustic Speech Signal Proc., April 1979, vol. ASSP-27, pp. 113-121, which is incorporated herein by reference.
Noise floor estimator 642b and gain calculation unit 644b may also be designed to implement a two-channel spectrum modification technique using the speech plus noise signal s(t) and another mostly noise signal that may be derived by another sensor/microphone or a microphone array. The use of a microphone array to derive the signals s(t) and x(t) is described in detail in copending U.S. patent application Ser. No. 10/076,201, entitled “Noise Suppression for a Wireless Communication Device,” filed Feb. 12, 2002, assigned to the assignee of the present application and incorporated herein by reference.
Residual noise suppressor 642c receives the Act control signal from voice activity detector 230 and provides a third set of gain coefficients G3(ω). In an embodiment, the gain coefficients G3(ω) for each frequency bin ω may be expressed as:
where G60 is a particular value and may be selected as 0≦Gα≦1.
As shown in
In the embodiment shown in
In any case, the scaled magnitude component of S(ω) is recombined with the phase component of S(ω) and provided to an inverse FFT (IFFT) 648, which transforms the recombined signal back to the time domain. The resultant output signal y(t) includes predominantly speech and has a large portion of the background noise removed.
The embodiment shown in
The spectrum modification technique is one technique for removing noise from the speech plus noise signal s(t). The spectrum modification technique provides good performance and can remove both stationary and non-stationary noise (using the time-varying noise spectrum estimate described above). However, other noise suppression techniques may also be used to remove noise, and this is within the scope of the invention.
Signal processing system 700 further includes an adaptive beam forming unit 720 coupled to a signal processing unit 730. Beam forming unit 720 processes the signals v(t) from signal detectors 710a through 710n to provide (1) a signal s(t) comprised of speech plus noise and (2) a signal x(t) comprised of mostly noise. Beam forming unit 720 may be implemented with a main beam former and a blocking beam former.
The main beam former combines the detected signals from all or a subset of the signal detectors to provide the speech plus noise signal s(t). The main beam former may be implemented with various designs. One such design is described in detail in the aforementioned U.S. patent application Ser. No. 10/076,201.
The blocking beam former combines the detected signals from all or a subset of the signal detectors to provide the mostly noise signal x(t). The blocking beam former may also be implemented with various designs. One such design is described in detail in the aforementioned U.S. patent application Ser. No. 10/076,201.
Beam forming techniques are also described in further detail by Bernal Widrow et al., in “Adaptive Signal Processing,” Prentice Hall, 1985, pages 412-419, which is incorporated herein by reference.
The speech plus noise signal s(t) and the mostly noise signal x(t) from beam forming unit 720 are provided to signal processing unit 730. Beam forming unit 720 may be incorporated within signal processing unit 730. Signal processing unit 730 may be implemented based on the design for signal processing system 200 in
One or more microphones may also be used to detect background noise. Detection of mostly noise may be achieved by various means such as, for example, by (1) locating the microphone in a distant and/or isolated location, (2) covering the microphone with a particular material, and so on. One or more signal sensors 814 may also be used to detect various types of noise such as vibration, engine noise, motion, wind noise, and so on. Better noise pick up may be achieved by affixing the sensor to the chassis of the automobile.
Microphones 812 and sensors 814 are coupled to a signal processing unit 830, which can be mounted anywhere within or outside the passenger compartment (e.g., in the trunk). Signal processing unit 830 may be implemented based on the designs described above in
The noise suppression described herein provides an output signal having improved characteristics. In an automobile, a large amount of noise is derived from vibration due to road, engine, and other sources, which dominantly are low frequency noise that is especially difficult to suppress using conventional techniques. With the reference sensor to detect the vibration, a large portion of the noise may be removed from the signal, which improves the quality of the output signal. The techniques described herein allows a user to talk softly even in a noisy environment, which is highly desirable.
For simplicity, the signal processing systems described above use microphones as signal detectors. Other types of signal detectors may also be used to detect the desired and undesired components. For example, vibration sensors may be used to detect car body vibration, road noise, engine noise, and so on.
For clarity, the signal processing systems have been described for the processing of speech. In general, these systems may be used process any signal having a desired component and an undesired component.
The signal processing systems and techniques described herein may be implemented in various manners. For example, these systems and techniques may be implemented in hardware, software, or a combination thereof. For a hardware implementation, the signal processing elements (e.g., the beam forming unit, signal processing unit, and so on) may be implemented within one or more application specific integrated circuits (ASICs), digital signal processors (DSPs), programmable logic devices (PLDs), controllers, microcontrollers, microprocessors, other electronic units designed to perform the functions described herein, or a combination thereof. For a software implementation, the signal processing systems and techniques may be implemented with modules (e.g., procedures, functions, and so on) that perform the functions described herein. The software codes may be stored in a memory unit (e.g., memory 830 in
The foregoing description of the specific embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without the use of the inventive faculty. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein, and as defined by the following claims.
Number | Name | Date | Kind |
---|---|---|---|
5416844 | Nakaji et al. | May 1995 | A |
5426703 | Hamabe et al. | Jun 1995 | A |
5610991 | Janse | Mar 1997 | A |
5917919 | Rosenthal | Jun 1999 | A |
6122610 | Isabelle | Sep 2000 | A |
6453285 | Anderson et al. | Sep 2002 | B1 |
6453291 | Ashley | Sep 2002 | B1 |
6754623 | Deligne et al. | Jun 2004 | B2 |
7062049 | Inoue et al. | Jun 2006 | B1 |
20020152066 | Piket | Oct 2002 | A1 |
20030018471 | Cheng et al. | Jan 2003 | A1 |
Number | Date | Country | |
---|---|---|---|
20030040908 A1 | Feb 2003 | US |
Number | Date | Country | |
---|---|---|---|
60268403 | Feb 2001 | US |