1. Field of the Invention
The present invention relates to noise suppressors and to a noise suppressor that reduces noise components in a voice signal with overlapping noise.
2. Description of the Related Art
In cellular phone systems and IP (Internet Protocol) telephone systems, ambient noise is input to a microphone in addition to the voice of a speaker. This results in a degraded voice signal, thus impairing the clarity of the voice. Therefore, techniques have been developed to improve speech quality by reducing noise components in the degraded voice signal. (See, for example, Non-Patent Document 1 and Patent Document 1.)
A suppression coefficient calculation part 13 determines a suppression coefficient Gn(f) from |Xn(f)| and μn(f) in accordance with Eq. (1):
A noise suppression part 14 determines an amplitude component S*n(f) after noise suppression from Xn(f) and Gn(f) in accordance with Eq. (2):
S*n(f)=Xn(f)×Gn(f). (2)
A frequency-to-time conversion part 15 converts S*n(f) from the frequency domain to the time domain, thereby determining a signal s*n(k) after the noise suppression.
(Non-Patent Document 1) S. F. Boll, “Suppression of Acoustic Noise in Speech Using Spectral Subtraction,” IEEE Transaction on Acoustics, Speech, and Signal processing, ASSP-33, vol. 27, pp. 113-120, 1979
(Patent Document 1) Japanese Laid-Open Patent Application No. 2004-20679
In
However, it is difficult to determine the amplitude component of (short-term) noise overlapping the current frame with accuracy. That is, there is an estimation error between the amplitude component of noise overlapping the current frame and the estimated noise amplitude component (hereinafter, noise estimation error). Therefore, as shown in
As a result, the above-described noise estimation error causes excess suppression or insufficient suppression in the noise suppressor. Further, since the noise estimation error greatly varies from frame to frame, excess suppression or insufficient suppression also varies, thus causing temporal variations in noise suppression performance. These temporal variations in noise suppression performance cause abnormal noise known as musical noise.
The following two methods are employed as methods of smoothing an amplitude component.
(First Smoothing Method)
The average of the input amplitude components of a current frame and past several frames is defined as the smoothed amplitude component Pn(f). This method is simple averaging, and the smoothed amplitude component can be given by Eq. (3):
where M is the range (number of frames) to be subjected to smoothing.
(Second Smoothing Method)
The weighted average of the amplitude component |Xn(f)| of a current frame and the smoothed amplitude component Pn-1(f) of the immediately preceding frame is defined as the smoothed amplitude component Pn(f). This is called exponential smoothing, and the smoothed amplitude component can be given by Eq. (4):
Pn(f)=α×|Xn(f)|+(1−α)×Pn-1(f), (4)
where α is a smoothing coefficient.
According to the suppression coefficient calculation method of
However, when there is inputting of the voice of a speaker, the smoothed amplitude component is weakened, so that the difference between the amplitude component of the voice signal indicated by a broken line and the smoothed amplitude component indicated by a broken line (hereinafter referred to as “voice estimation error”) increases as shown in
As a result, the suppression coefficient is determined based on the smoothed amplitude component of a great voice estimation error and the estimated noise amplitude, and the input amplitude component is multiplied by the suppression coefficient. This causes a problem in that the voice component contained in the input signal is erroneously suppressed so as to degrade voice quality. This phenomenon is particularly conspicuous at the head of a voice (the starting section of a voice).
Embodiments of the present invention may solve or reduce one or more of the above-described problems.
According to one embodiment of the present invention, there is provided a noise suppressor in which one or more of the above-described problems are solved or reduced.
According to one embodiment of the present invention, there is provided a noise suppressor that minimizes effects on voice while suppressing generation of musical noise so as to realize stable noise suppression performance.
According to one embodiment of the present invention, there is provided a noise suppressor including a frequency division part configured to divide an input signal into a plurality of bands and output band signals; an amplitude calculation part configured to determine amplitude components of the band signals; a noise estimation part configured to estimate an amplitude component of noise contained in the input signal and determine an estimated noise amplitude component for each of the bands; a weighting factor generation part configured to generate a different weighting factor for each of the bands; an amplitude smoothing part configured to determine smoothed amplitude components, the smoothed amplitude components being the amplitude components of the band signals that are temporally smoothed using the weighting factors; a suppression calculation part configured to determine a suppression coefficient from the smoothed amplitude component and the estimated noise amplitude component for each of the bands; a noise suppression part configured to suppress the band signals based on the suppression coefficients; and a frequency synthesis part configured to synthesize and output the band signals of the bands after the noise suppression output from the noise suppression part.
According to one embodiment of the present invention, there is provided a noise suppressor including a frequency division part configured to divide an input signal into a plurality of bands and output band signals; an amplitude calculation part configured to determine amplitude components of the band signals; a noise estimation part configured to estimate an amplitude component of noise contained in the input signal and determine an estimated noise amplitude component for each of the bands; a weighting factor generation part configured to cause weighting factors to temporally change and outputting the weighting factors; an amplitude smoothing part configured to determine smoothed amplitude components, the smoothed amplitude components being the amplitude components of the band signals that are temporally smoothed using the weighting factors; a suppression calculation part configured to determine a suppression coefficient from the smoothed amplitude component and the estimated noise amplitude component for each of the bands; a noise suppression part configured to suppress the band signals based on the suppression coefficients; and a frequency synthesis part configured to synthesize and output the band signals of the bands after the noise suppression output from the noise suppression part.
According to the above-described noise suppressors, generation of musical noise is suppressed while minimizing effects on voice, so that it is possible to realize stable noise suppression performance.
Other objects, features and advantages of the present invention will become more apparent from the following detailed description when read in conjunction with the accompanying drawings, in which:
A description is given below, based on the drawings, of embodiments of the present invention.
In
In
As smoothing methods, there are a method that uses an FIR filter and a method that uses an IIR filter, either of which may be selected in the present invention.
(In the Case of Using an FIR Filter)
(In the Case of Using an IIR Filter)
In Eqs. (5) and (6) above, m is the number of delay elements forming the filter, and w0(f) through wm(f) are the respective weighting factors of m+1 multipliers forming the filter. By adjusting these values, it is possible to control the strength of smoothing at the time of smoothing an input signal.
Conventionally, as is apparent from Eqs. (3) and (4), the same weighting factor is used in all frequency bands. On the other hand, according to the present invention, the weighting factor wm(f) is expressed as the function of a frequency as in Eqs. (5) and (6), and is characterized in that the value differs from band to band.
Further, in conventional Eq. (4), the smoothing coefficient α as a weighting factor is a constant. Meanwhile, according to the present invention, with the weighting factor wm(f) being a variable, the weighing factor calculation part 23 shown in
Any relational expression is selectable as the one in determining the suppression coefficient Gn(f) from the smoothed amplitude component Pn(f) and the estimated noise amplitude component μn(f). For example, Eq. (1) may be used. Further, a relational expression as shown in
According to a noise suppressor of the present invention, the input amplitude component is smoothed before calculating a suppression coefficient. Accordingly, when there is no inputting of the voice of a speaker, it is possible to reduce noise estimation error that is the difference between the amplitude component of noise indicated by a solid line and the estimated noise amplitude component indicated by a broken line as shown in
Further, when there is inputting of the voice of a speaker, it is also possible to reduce voice estimation error that is the difference between the amplitude component of a voice signal indicated by a broken line and the smoothed amplitude component indicated by a solid line as shown in
Here, when an input signal of voice with overlapping noise is provided as shown in
The comparison of the waveform of
The suppression performance at the time of noise input (measured in a voiceless section) is approximately 14 dB in the conventional noise suppressor and approximately 14 dB in the noise suppressor of the present invention. The voice quality degradation at the time of voice input (measured in the voice head section of a voice) is approximately 4 dB in the conventional noise suppressor, while it is approximately 1 dB in the noise suppressor of the present invention. Thus, there is an improvement of approximately 3 dB. As a result, the present invention can reduce voice quality degradation by reducing suppression of a voice component at the time of voice input.
In the drawing, for each unit time (frame), an FFT part 30 converts the input signal xn(k) of a current frame n from a time domain k to a frequency domain f and determines the frequency domain signal Xn(f) of the input signal. The subscript n represents a frame number.
An amplitude calculation part 31 determines the amplitude component |Xn(f) from the frequency domain signal Xn(f). A noise estimation part 32 performs voice section detection, and determines the estimated noise amplitude component μn(f) from the input amplitude component |Xn(f)| in accordance with Eq. (7) when the voice of a speaker is not detected.
An amplitude smoothing part 33 determines the averaged amplitude component Pn(f) from the input amplitude component |Xn(f)|, the input amplitude component |Xn-1(f)| of the immediately preceding frame retained in an amplitude retention part 34, and the weighting factor wm(f) retained in a weighting factor retention part 35 in accordance with Eq. (8), where f3 is a sampling frequency in digitizing voice, and the weighting factor wm(f) is as shown in
A suppression coefficient calculation part 36 determines the suppression coefficient Gn(f) from the averaged amplitude component Pn(f) and the estimated noise amplitude component μn(f) in accordance with Eq. (9):
A noise suppression part 37 determines the amplitude component S*n(f) after noise suppression from Xn(f) and Gn(f) in accordance with Eq. (10):
S*n(f)=Xn(f)×Gn(f). (10)
An IFFT part 38 converts the amplitude component S*n(f) from the frequency domain to the time domain, thereby determining a signal s*n(k) after the noise suppression.
In the drawing, a channel division part 40 divides the input signal xn(k) into band signals xBPF(i,k) in accordance with Eq. (11) using bandpass filters (BPFs). The subscript i represents a channel number.
where BPF(i,j) is an FIR filter coefficient for band division, and M is the order of the FIR filter.
An amplitude calculation part 41 calculates a band-by-band input amplitude Pow(i,n) in each frame from the band signal xBPF(i,k) in accordance with Eq. (12). The subscript n represents a frame number.
where N is frame length.
A noise estimation part 42 performs voice section detection, and determines the amplitude component μ(i,n) of estimated noise from the band-by-band input amplitude component Pow(i,n) in accordance with Eq. (13) when the voice of a speaker is not detected.
A weighting factor calculation part 45 compares the band-by-band input amplitude component Pow(i,n) with a predetermined threshold THR1, and calculates a weighting factor w(i,m), where m=0, 1, and 2.
If Pow(i,n)≧THR1,
w(i,0)=0.7,
w(i,1)=0.2, and
w(i,2)=0.1.
If Pow(i,n)<THR1,
w(i,0)=0.4,
w(i,1)=0.3, and
w(i,2)=0.3.
That is, the temporal sum of weighting factors is one for each channel.
An amplitude smoothing part 43 calculates a smoothed input amplitude component PowAV(i,n) from band-by-band input amplitude components Pow(i,n−1) and Pow(i,n−2) retained in an amplitude retention part 44, the band-by-band input amplitude component Pow(i,n) from the amplitude calculation part 41, and the weighting factor w(i,m) in accordance with Eq. (14):
A suppression coefficient calculation part 46 calculates a suppression coefficient G(i,n) from the smoothed input amplitude component PowAV(i,n) and the estimated noise amplitude component μ(i,n) by Eq. (15):
A noise suppression part 47 determines a band signal s*BPF(i,k) after noise suppression from the band signal xBPF(i,k) and the suppression coefficient G(i,n) in accordance with Eq. (16):
S*BPF(i,k)=xBPF(i,k)×G(i,n) (16)
A channel synthesis part 48 is formed of an adder circuit, and determines an output voice signal s*(k) by adding up and synthesizing the band signals S*BPF(i,k) in accordance with Eq. (17):
where L is the number of band divisions.
In the drawing, for each unit time (frame), the FFT part 30 converts the input signal xn(k) of a current frame n from a time domain k to a frequency domain f and determines the frequency domain signal Xn(f) of the input signal. The subscript n represents a frame number.
The amplitude calculation part 31 determines the amplitude component |Xn(f)| from the frequency domain signal Xn(f). The noise estimation part 32 performs voice section detection, and determines the estimated noise amplitude component μn(f) from the input amplitude component |Xn(f)| in accordance with Eq. (7) when the voice of a speaker is not detected.
An amplitude smoothing part 51 determines the averaged amplitude component Pn(f) from the input amplitude component |Xn(f)|, the averaged amplitude components Pn−1(f) and Pn−2(f) of the past two frames retained in an amplitude retention part 52, and the weighting factor wm(f) retained in a weighting factor retention part 53 in accordance with Eq. (18):
Pn(f)·|Xn(f)|w1(f)·Pn−1(f)+w2(f)·Pn−2(f). (18)
A weighting factor calculation part 53 compares the averaged amplitude component Pn(f) with a predetermined threshold THR2, and calculates the weighting factor wm(f), where m=0, 1, and 2.
If Pn(f)≧THR2,
w0(f)=1.0,
w1(f)=0.0, and
w2(f)=0.0.
If Pn(f)<THR2,
w0(f)=0.6,
w1(f)=0.2, and
w2(f)=0.2.
That is, the temporal sum of weighting factors is one for each channel.
A suppression coefficient calculation part 54 determines the suppression coefficient Gn(f) from the averaged amplitude component Pn(f) and the estimated noise amplitude component μn(f) using a nonlinear function func shown in Eq. (19).
The noise suppression part 37 determines the amplitude component S*n(f) after noise suppression from Xn(f) and Gn(f) in accordance with Eq. (10). The IFFF part 38 converts the amplitude component S*n(f) from the frequency domain to the time domain, thereby determining the signal s*n(k) after the noise suppression.
Thus, by controlling the weighting factor based on an amplitude component after smoothing, it is possible to perform firm and stable control on unsteady noise.
In the drawing, for each unit time (frame), the FFT part 30 converts the input signal xn(k) of a current frame n from a time domain k to a frequency domain f and determines the frequency domain signal Xn(f) of the input signal. The subscript n represents a frame number.
The amplitude calculation part 31 determines the amplitude component |Xn(f)| from the frequency domain signal Xn(f). The noise estimation part 32 performs voice section detection, and determines the estimated noise amplitude component μn(f) from the input amplitude component |Xn(f)| in accordance with Eq. (7) when the voice of a speaker is not detected.
A signal-to-noise ratio calculation part 56 determines a signal-to-noise ratio SNRn(f) band by band from the input amplitude component |Xn(f)| of the current frame and the estimated noise amplitude component μn(f) in accordance with Eq. (20):
A weighting factor calculation part 57 determines the weighting factor w0(f) from the signal-to-noise ratio SNRn(f).
w(f)=1.0−w0(f). (21)
An amplitude smoothing part 58 determines the averaged amplitude component Pn(f) from the input amplitude component |Xn(f)| of the current frame, the input amplitude component |Xn−1(f)| of the immediately preceding frame retained in the amplitude retention part 34, and the weighting factor wm(f) from the weighting factor calculation part 57, that is, w0(f), w1(f), and w2(f), in accordance with Eq. (22):
Pn(f)=w0(f)·|Xn(f)|+w1(f)·|Xn−1(f). (22)
The suppression coefficient calculation part 36 determines the suppression coefficient Gn(f) from the averaged amplitude component Pn(f) and the estimated noise amplitude component μn(f) in accordance with Eq. (9). The noise suppression part 37 determines the amplitude component S*n(f) after noise suppression from Xn(f) and Gn(f) in accordance with Eq. (10). The IFFF part 38 converts the amplitude component S*n(f) from the frequency domain to the time domain, thereby determining the signal s*n(k) after the noise suppression.
Thus, by controlling the weighting factor based on signal-to-noise ratio, it is possible to perform stable control irrespective of the volume of a microphone.
In the drawing, for each unit time (frame), the FFT part 30 converts the input signal xn(k) of a current frame n from a time domain k to a frequency domain f and determines the frequency domain signal Xn(f) of the input signal. The subscript n represents a frame number.
The amplitude calculation part 31 determines the amplitude component |Xn(f)| from the frequency domain signal Xn(f). The noise estimation part 32 performs voice section detection, and determines the estimated noise amplitude component μn(f) from the input amplitude component |Xn(f)| in accordance with Eq. (7) when the voice of a speaker is not detected.
The amplitude smoothing part 51 determines the averaged amplitude component Pn(f) from the input amplitude component |Xn(f)|, the averaged amplitude components Pn−1(f) and Pn−2(f) of the past two frames retained in the amplitude retention part 52, and the weighting factor wm(f) from a weighting factor calculation part 61 in accordance with Eq. (18).
A signal-to-noise ratio calculation part 60 determines the signal-to-noise ratio SNRn(f) band by band from the smoothed amplitude component Pn(f) and the estimated noise amplitude component μn(f) in accordance with Eq. (23):
The weighting factor calculation part 61 determines the weighting factor w0(f) from the signal-to-noise ratio SNRn(f).
The suppression coefficient calculation part 54 determines the suppression coefficient Gn(f) from the averaged amplitude component Pn(f) and the estimated noise amplitude component μn(f) using the nonlinear function func shown in Eq. (19). The noise suppression part 37 determines the amplitude component S*n(f) after noise suppression from Xn(f) and Gn(f) in accordance with Eq. (10). The IFFF part 38 converts the amplitude component S*n(f) from the frequency domain to the time domain, thereby determining the signal s*n(k) after the noise suppression.
Thus, by controlling the weighting factor based on signal-to-noise ratio after smoothing, it is possible to perform firm and stable control on unsteady noise, and it is possible to perform stable control irrespective of the volume of a microphone.
The amplitude calculation parts 31 and 41 may correspond to an amplitude calculation part, the noise estimation parts 32 and 42 may correspond to a noise estimation part, the weighting factor retention part 35, the weighting factor calculation part 45, and the signal-to-noise ratio calculation parts 56 and 60 may correspond to a weighting factor generation part, the amplitude smoothing parts 33 and 43 may correspond to an amplitude smoothing part, the suppression coefficient calculation parts 36 and 46 may correspond to a suppression calculation part, the noise suppression parts 37 and 47 may correspond to a noise suppression part, the FET part 30 and the channel division part 40 may correspond to a frequency division part, and the IFFT part 38 and the channel synthesis part 48 may correspond to a frequency synthesis part.
The present invention is not limited to the specifically disclosed embodiment, and variations and modifications may be made without departing from the scope of the present invention.
The present application is a continuation application filed under 35 U.S.C. 111(a) claiming benefit under 35 U.S.C. 120 and 365(c) of PCT International Application No. PCT/JP2004/016027, filed on Oct. 28, 2004, the entire contents of which are hereby incorporated by reference.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/JP04/16027 | Oct 2004 | US |
Child | 11727062 | Mar 2007 | US |