This invention relates to a noise suppression device which is used for improving a recognition rate of a voice recognition system and improving sound quality of a car navigation, a mobile phone, a voice communication system such as an intercom, a hands-free communication system, a TV conference system, and a monitoring system, and, to which a voice communication, a voice storage, and a speech recognition system are introduced. The noise suppression device is adapted to suppress background noise mixed with an input signal.
Along with recent advancement of digital signal processing techniques, outdoor voice communication with mobile phones, hands-free voice communication in cars, and hands-free operation with voice recognition are widely available. Since those apparatuses are often used under high-noise environments, background noise is input to a microphone together with voice. This situation brings deterioration of a quality of voice communication and a voice recognition rate. In order to achieve highly accurate voice recognition and comfortable voice communication, a noise suppression device for suppressing the background noise mixed with the input signal is required.
An example of conventional noise suppression method is disclosed in, for example, Non-Patent Literature 1. The conventional method includes converting an input signal of time domain into power spectra which is a signal of frequency domain, calculating a suppression amount for noise suppression using power spectra of the input signal and estimated noise spectra that is estimated separately from the input signal, performing amplitude suppression of the power spectra of the input signal using the suppression amount, converting the amplitude-suppressed power spectra and the phase spectra of the input signal into time domain, and obtaining a noise suppression signal.
According to the conventional noise suppression method, the suppression amount is calculated based on the ratio of the voice power spectra to the estimated noise power spectra (SN ratio). However, when the suppression amount indicates a negative value (in decibel), a correct suppression amount cannot be obtained. For example, in a voice signal overlaid with a car cruising noise having a high power in a low frequency region, the low frequency region of voice is buried in the noise. In this case, the SN ratio becomes negative, and as a result, there is a problem in that the low frequency region of the voice signal is excessively suppressed to cause voice quality degradation.
In order to solve the foregoing problem, a conventional method for generating and recovering a low frequency region signal that has been lost is disclosed in, for example, Patent Literature 1. This conventional art discloses a voice signal processing apparatus that extracts some of harmonics components of a fundamental frequency (pitch) signal of voice from an input signal, generates subharmonics components by multiplying the extracted harmonics components by two, and overlays the obtained sub-harmonics components on the input signal, thus obtains a voice signal of which voice quality has been improved. By placing the voice signal processing apparatus in a stage subsequent to a noise suppression device, the noise suppression device having superior low frequency region components can be achieved.
Patent Literature 1: Japanese Patent Laid-Open No. 2008-76988 (pages 5 to 6, FIG. 1)
Non-Patent Literature 1: Y. Ephraim, D. Malah, “Speech Enhancement Using a Minimum Mean Square Error Short-Time Spectral Amplitude Estimator”, IEEE Trans. ASSP, vol. ASSP-32, No. 6 Dec. 1984
However, in the conventional voice signal processing apparatus disclosed in Patent Literature 1, the low frequency region signal is analyzed and generated from an input signal. Therefore, when the input signal includes remaining noise, i.e., when the output signal of the noise suppression device includes the remaining noise, the low frequency region component is affected by the remaining noise. This situation may cause a problem that the voice quality is suddenly degraded. Further, there is a problem that a large amount of calculation and memory are required for generation of the low frequency region component, filtration processing, and control of the degree of overlay of the low frequency region component.
This invention is made to solve the above problems, and has an object to provide a noise suppression device which is capable of achieving a high quality with simple processing.
A noise suppression device according to this invention includes: a power spectrum calculator configured to convert an input signal of time domain into power spectra as a signal of frequency domain; a voice/noise determination unit configured to determine whether the power spectra indicate voice or noise; a noise spectrum estimation unit configured to estimate noise spectra of the power spectra by using a determination result of the voice/noise determination unit; a period component estimation unit configured to analyze a harmonic structure constituting the power spectra, and estimate periodical information about the power spectra; a weighting coefficient calculator configured to calculate a weighting coefficient for weighting the power spectra by using the periodical information, the determination result of the voice/noise determination unit, and signal information about the power spectra; a suppression coefficient calculator configured to calculate a suppression coefficient for suppressing noise included in the power spectra by using the power spectra, the determination result of the voice/noise determination unit, and the weighting coefficient; a spectrum suppression unit configured to suppress amplitude of the power spectra in accordance with the suppression coefficient; and a transformer configured to convert the power spectra whose amplitude has been suppressed by the spectrum suppression unit into a signal of time domain to generate a noise-suppressed signal.
According to this invention, the noise suppression device is provided with: the period component estimation unit configured to analyze a harmonic structure constituting the power spectra, and estimate periodical information about the power spectra; the weighting coefficient calculator configured to calculate a weighting coefficient for weighting the power spectra by using the periodical information, the determination result of the voice/noise determination unit, and signal information about the power spectra; the suppression coefficient calculator configured to calculate a suppression coefficient for suppressing noise included in the power spectra by using the power spectra, the determination result of the voice/noise determination unit, and the weighting coefficient; and the spectrum suppression unit configured to suppress amplitude of the power spectra in accordance with the suppression coefficient. Therefore, even in a frequency band where the voice is buried in the noise, correction can be made to maintain the harmonic structure of voice, excessive suppression of the voice can be avoided, and high quality noise suppression can be achieved.
Hereinafter, embodiments of the present invention will be explained with reference to appended drawings.
The noise suppression device 100 includes an input terminal 1, a Fourier transformer 2, a power spectrum calculator 3, a period component estimation unit 4, a voice/noise section determination unit (voice/noise determination unit) 5, a noise spectrum estimation unit 6, a weighting coefficient calculator 7, an SN ratio calculator (suppression coefficient calculator) 8, a suppression amount calculator 9, a spectrum suppression unit 10, an inverse Fourier transformer (transformer) 11, and an output terminal 12.
Hereinafter, the principle of operation of the noise suppression device 100 will be explained with reference to
Processes are preliminarily performed on voice, music, and the like retrieved through a microphone (not shown) to implement an A/D (analog/digital) conversion, a sampling at a predetermined sampling frequency (for example, 8 kHz), and a partition of the sampled data into units of frames (for example, 10 ms). The frames are input to the noise suppression device 100 through the input terminal 1.
The Fourier transformer 2 applies Harming window or the like to the input signal, and implements Fast Fourier Transform at, for example, 256 points through a formula (1) shown below to transform the input signal of time domain into spectral components X(λ, k).
X(λ,k)=FT[x(t)] (1)
In this formula, “λ” denotes a frame number applied to the input signal divided into frames, “k” denotes a number designating a frequency component in a frequency band of power spectra (hereinafter referred to as “a spectrum number”), and “FT[ . . . ]” denotes the Fourier transform.
The power spectrum calculator 3 obtains power spectra Y(λ,k) from the spectral components of the input signal through a formula (2) shown below.
Y(λ,k)=√{square root over (Re{X(λ,k)}2+Im{X(λ,k)}2)}{square root over (Re{X(λ,k)}2+Im{X(λ,k)}2)}; 0≦k<128 (2)
Note that “Re{X(λ,k)}” and “Im{X(λ,k)}” denote a real part and an imaginary part, respectively, of the input signal spectra after the Fourier transform.
The period component estimation unit 4 inputs the power spectra Y(λ,k) output from the power spectrum calculator 3, and analyzes the harmonic structure of the input signal spectra. As shown in
By searching the spectral peaks, periodical information p(λ,k) is set for each spectrum number k. The periodical information “p(λ,k)=1” is set to the maximum value of the power spectra (which is the spectral peak), whereas “p(λ,k)=0” is set to the others. Although all the spectral peaks are extracted in the example of
Subsequently, based on a harmonics period of the observed spectral peaks, the peaks of the voice spectra buried in the noise spectra are estimated. More specifically, as shown in
A normalized autocorrelation function ρN(λ,τ) is obtained from the power spectra Y(λ,k) through a formula (3) show below.
In this formula, “τ” denotes a delay time, and “FT[ . . . ]” denotes a Fourier transform process. A Fast Fourier Transform may be performed with the same point number “256” as that of the formula (1). Since the formula (3) is Wiener-Khintchine theorem, details thereof are omitted. Subsequently, the maximum value ρmax(λ) of the normalized autocorrelation function is obtained through a formula (4). The formula (4) represents a search for the maximum value with respect to p(λ,r) within the range of 16≦τ≦96.
ρmax(λ)=max[ρ(λ,τ)], 16≦τ≦96 (4)
The obtained periodical information p(λ,τ) and the maximum value of the autocorrelation function ρmax(λ) are respectively output. The periodicity can be analyzed not only through peak analysis of the power spectra and the autocorrelation function taught in above, but also through any well-known methods such as Cepstrum analysis.
The voice/noise section determination unit 5 inputs the power spectra Y(λ,k) output from the power spectrum calculator 3, the maximum value of the autocorrelation function ρmax(λ) output from the period component estimation unit 4, and noise spectra N(λ,k) output from the noise spectrum estimation unit 6, which will be explained later. The voice/noise section determination unit 5 determines whether the input signal of the current frame indicates voice or noise, and outputs a result of the determination as a determination flag. An example of the determination method of the voice/noise section can be given as follows. When one of or both of a formula (5) and a formula (6) shown below are satisfied, the input signal is determined to be voice, and a Vflag indicating “1 (voice)” as the determination flag is set and output. In the other cases, the input signal is determined to be noise, and a Vflag indicating “0 (noise)” as the determination flag is set and output.
In the formula (5), “N(λ,k)” denotes an estimated noise spectra, and “Spow” and “Npow ” denote a summation of power spectra of the input signal and a summation of estimated noise spectra, respectively. “THFR
The noise spectrum estimation unit 6 inputs the power spectra Y(λ,k) output by the power spectrum calculator 3 and the determination flag Vflag output by the voice/noise section determination unit 5. The noise spectrum estimation unit 6 estimates and updates the noise spectra through the determination flag Vflag and a formula (7) shown below, and outputs the estimated noise spectra N(λ,k).
In this formula, “N(λ−1,k)” denotes an estimated noise spectra of a previous frame, which has been stored in a storage unit such as a RAM (Random Access Memory) in the noise spectrum estimation unit 6. When the determination flag indicates “Vflag=0” in the formula (7), the input signal of the current frame is determined to be noise. In this case, the estimated noise spectra N(λ−1,k) of the previous frame is updated by using an update coefficient “α” and the power spectra Y(λ,k) of the input signal. Note that the update coefficient α is a predetermined constant within a range of 0<α<1. In a preferable example, α is 0.95, but can be changed depending on a state of the input signal and a noise level.
On the other hand, when the determination flag indicates “Vflag=1” in the formula (7), the input signal of the current frame is determined to be voice. In this case, the estimated noise spectra N(λ−1,k) of the previous frame is output as the estimated noise spectra N(λ,k) of the current frame.
The weighting coefficient calculator 7 inputs the periodical information p(λ,k) output from the period component estimation unit 4, the determination flag Vflag output from the voice/noise section determination unit 5, and an SN ratio (signal-to-noise ratio) for each spectral component, which is output from the SN ratio calculator 8 explained later. The weighting coefficient calculator 7 calculates a weighting coefficient W(λ,k) for weighting the SN ratio for each spectral component.
In this formula, “W(λ−1,k)” denotes a weighting coefficient of a previous frame, and “β” denotes a predetermined constant for smoothing. Preferably, β is 0.8. “wp(k)” denotes a weighting constant, which is calculated through, for example, a formula (9) shown below. Namely, “wp(k)” is determined by the SN ratio for each spectral component and the determination flag, and is smoothed with a value of wp(k) at the spectrum number k and values at adjacent spectrum numbers. Upon smoothing with the adjacent spectral components, there are advantages of suppressing steepening of the weighting coefficient and absorbing error in the spectral peak analysis.
Note that, under normal circumstances, a weighting constant wZ(k) for “p(λ,k)=0” can be 1.0 without weighting. However, it may be possible to control wZ(k) in the same manner as wp(k), that is, control it depending on the SN ratio for each spectral component and the determination flag.
When the periodical information indicates “p(λ,k)=1” and the determination flag indicates “Vflag=1 (voice)”, the following is applied to the weighting constant.
And, when the periodical information indicates “p(λ,k)=1” and the determination flag indicates “Vflag=0 (noise)”, the following is applied to the weighting constant.
Note that, “snr(k)” denotes an SN ratio for each spectral component output from the SN ratio calculator 8, and “THSB
The SN ratio calculator 8 calculates a posteriori SNR and a priori SNR for each spectral component by using the power spectra Y(λ,k) output from the power spectrum calculator 3, the estimated noise spectra N(λ,k) output from the noise spectrum estimation unit 6, the weighting coefficient W(λ,k) output from the weighting coefficient calculator 7, and a spectrum suppression amount G(λ−1,k) of a previous frame, which is output from the suppression amount calculator 9 explained later.
The posteriori SNR γ(λ,k) can be calculated through a formula (10) shown below, which uses the power spectra Y(λ,k) and the estimated noise spectra N(λ,k). By giving a weighting based on the formula (9) shown above, a correction can be made so that the posteriori SNR is estimated to be higher at the spectral peak.
The priori SNR ξ(λ,k) is calculated through a formula (11) shown below, which uses the spectrum suppression amount G(λ−1,k) of the previous frame and the posteriori SNR γ(λ−1,k) of the previous frame.
In this formula, “δ” denotes a predetermined constant within a range of 0<δ<1. In the present embodiment, δ is preferably 0.98. Furthermore, “F[ . . . ]” denotes a half-wave rectifier, and performs a flooring to zero when the posteriori SNR indicates a negative value in decibel.
In Embodiment 1, the weighting is performed only on the posteriori SNR. Alternatively, weighting may be performed on the priori SNR or on both of the posteriori SNR and the priori SNR. In those cases, the constant in the above formula (9) may be changed to suit the weighting on the priori SNR.
The foregoing posteriori SNR γ(λ,k) and priori SNR ξ(λ,k) are output to the suppression amount calculator 9, and the priori SNR ξ(λ,k) is also output to the weighting coefficient calculator 7 as the SN ratio for each spectral component.
The suppression amount calculator 9 calculates the spectrum suppression amount G(λ,k), which is the noise suppression amount for each spectra, by using the priori SNR and posteriori SNR γ(λ,k) output from the SN ratio calculator 8, and outputs the calculated spectrum suppression amount G(λ,k) to the spectrum suppression unit 10.
As a method for calculating the spectrum suppression amount G(λ,k), for instance, Joint MAP method may be used. The Joint MAP method is a method of estimating the spectrum suppression amount G(λ,k) on an assumption that the noise signal and the voice signal are in Gaussian distribution. According to the Joint MAP method, the amplitude spectra and the phase spectra which maximize a conditional function of probability density are calculated by using the priori SNR ξ(λ,k) and the posteriori SNR γ(λ,k), and the calculated values are used for the estimated values of G(X,k). The spectrum suppression amount can be expressed as a formula (12) shown below, in which “ν” and “μ” are used as parameters to specify the shape of the function of probability density. Note that the following “Reference Literature 1” describes the detail of a spectrum suppression amount deriving method according to the Joint MAP method, and explanation thereabout is omitted here.
T. Lotter, P. Vary, “Speech Enhancement by MAP Spectral Amplitude Estimation Using a Super-Gaussian Speech Model”, EURASIP Journal on Applied Signal Processing, pp. 1110-1126, No. 7, 2005
In accordance with a formula (13) shown below, the spectrum suppression unit 10 suppresses the input signal for each spectra, and obtains voice signal spectra S(λ,k) whose noise have been suppressed, and outputs it to the inverse Fourier transformer 11.
S(λ,k)=G(λ,k)·Y(λ,k) (13)
The inverse Fourier transformer 11 performs an inverse Fourier transformation on the obtained voice signal spectra S(λ,k) to superpose them with an output signal of the previous frame. After that, the output terminal 12 outputs the voice signal s(t) whose noise has been suppressed.
As described above, according to Embodiment 1, even in a frequency band where voice is buried in noise and SN ratio indicates negative value, the SN ratio is estimated with correcting the harmonic structure of voice to maintain it. Therefore, excessive suppression of the voice can be avoided, and high quality noise suppression can be achieved.
According to Embodiment 1, since the harmonic structure of voice buried in noise can be corrected by weighting the SN ratio, it is not necessary to generate a quasi-low frequency region signal and the like. Therefore, high quality noise suppression can be achieved with a small amount of processing and a small amount of memory.
Furthermore, according to Embodiment 1, since the weighting is controlled by using the SN ratio for each spectral component of the previous frame and the voice/noise section determination flag, there are advantages of avoiding unnecessary weighting in a frequency band having a high SN ratio or being a noise section, and achieving higher quality noise suppression.
In Embodiment 1, although the harmonic structure of both of the low frequency region and the high frequency region is corrected, an embodiment of the present invention is not limited to it. As necessary, only the low frequency region or only the high frequency region may be corrected. Alternatively, for example, a particular frequency band such as only a band from 500 Hz to 800 Hz may be corrected. This kind of correction of the frequency band is effective for correcting voice buried in narrow-band noise such as wind noise and car engine noise.
In Embodiment 1 explained above, the value of weighting is kept in constant along a frequency direction as shown in the formula (9). Embodiment 2 presents a configuration for making the value of weighting different in a frequency direction.
For example, as a general feature of voice, the harmonic structure in the low frequency region is clear. Therefore, the weighting may be increased in the low frequency region, whereas the weighting can be decreased as the frequency increases. Constituent elements of the noise suppression device according to Embodiment 2 are the same as those of Embodiment 1, and explanation thereabout is omitted.
As described above, Embodiment 2 is configured such that different weighting is applied for each frequency in estimation of the SN ratio. Therefore, suitable weighting can be achieved for each frequency of voice, and still higher quality noise suppression can be achieved.
Embodiment 1 explained above shows a configuration in which the value of weighting is a predetermined constant as shown in the formula (9). Embodiment 3 presents a configuration in which multiple weighting constants are switched in accordance with an index of voice probability as to an input signal, or are controlled through a predetermined function.
The index of voice probability as to the input signal, that is, a control factor of mode of the input signal, may be configured such that, when the maximum value of the autocorrelation coefficient is high in the formula (4), that is, when the period structure of the input signal is clear (i.e. it is highly possible that the input signal is voice), the weighting may be increased, whereas the weighting may be decreased when the period structure of the possibility is low. Alternatively, the autocorrelation function and the voice/noise section determination flag may be used together. Constituent elements of the noise suppression device according to Embodiment 3 are the same as those of Embodiment 1, and explanation thereabout is omitted.
As described above, Embodiment 3 is configured such that the value of the weighting constant is controlled in accordance with the mode of the input signal. Therefore, when it is highly possible that the input signal is voice, the weighting can be performed so that the periodicity structure of the voice is emphasized. This can avoid a degradation of voice, while noise suppression in higher quality can be achieved.
Embodiment 1 explained above is configured to detect all the spectral peaks for estimating period components. In Embodiment 4, the SN ratio of a previous frame calculated by the SN ratio calculator 8 is output to the period component estimation unit 4, and the period component estimation unit 4 detects spectral peaks only in a frequency band in which the SN ratio is high by using the SN ratio of the previous frame. Likewise, in the calculation of the normalized autocorrelation function ρN(λ,τ), the calculation can be performed only in a frequency band in which the SN ratio is high. The other configuration is the same as the noise suppression device according to Embodiment 1, and explanation thereabout is omitted.
As described above, according to Embodiment 4, the period component estimation unit 4 is configured to detect a spectral peak only in a frequency band in which the SN ratio is high by using the SN ratio of the previous frame received from the ratio calculator 8, or calculate the normalized autocorrelation function only in a frequency band in which the SN ratio is high. Therefore, the detection accuracy of the spectral peaks and the accuracy of voice/noise section determination can be enhanced, and thereby higher quality noise suppression can be achieved.
Embodiments 1 to 4 explained above are configured to apply a weighting of the SN ratio so that the weighting coefficient calculator 7 emphasizes the spectral peaks. On the contrary, Embodiment 5 presents a configuration in which weighting is performed to emphasize trough portions of the spectra, that is, to reduce the SN ratio in the troughs of the spectra.
The troughs of the spectra may be detected by regarding a central value of spectrum numbers between spectral peaks as a trough portion of the spectra. The other configuration is the same as the noise suppression device according to Embodiment 1, and explanation thereabout is omitted.
As described above, according to Embodiment 5, since the weighting coefficient calculator 7 performs the weighting to reduce the SN ratio at the troughs of the spectra, the frequency structure of voice can be emphasized, and thereby higher quality noise suppression can be achieved.
In Embodiments 1 to 5 explained above, the maximum posteriori probability method (Joint MAP method) is used for the noise suppression, however, other methods may be used. For example, there is a minimum mean square error short-time spectral amplitude method which is described in Non-Patent Literature 1, or a spectral subtraction method described in Reference Literature 2 shown below.
S. F. Boll, “Suppression of Acoustic Noise in Speech Using Spectral Subtraction”, IEEE Trans. on ASSP, Vol. ASSP-27, No. 2, pp. 113-120, April 1979
In Embodiments 1 to 5, each is applied to a narrow-band telephone (0 to 4000 Hz), however, an embodiment of the present invention is not limited to the narrow-band telephone. For example, this can also be applied to voice and acoustic signals of a wide-band telephone supporting 0 to 8000 Hz.
In each of the above embodiments, the output signal whose noise has been suppressed is transmitted in a digital data format to various kinds of voice acoustic processing apparatuses such as a voice encoding apparatus, a voice recognition apparatus, a voice accumulation apparatus, and a hands-free communication apparatus. The noise suppression device 100 according to each embodiment may be achieved independently or together with other apparatuses explained above by a DSP (digital signal processing processor), or may be achieved by executing software programs. The programs may be stored to a storage apparatus of a computer apparatus executing the software programs, or may be distributed as a storage medium such as a CD-ROM. Alternatively, the program may be provided via a network. The output signal is transmitted to various kinds of voice acoustic processing apparatuses, or it may be amplified by an amplification apparatus after D/A (digital/analog) converting, and directly output from a speaker as a voice signal.
Embodiments 1 to 5 explained above present configurations in which the SN ratio as a ratio of the power spectra of voice to the estimated noise power spectra is used as signal information of the power spectra. Besides the SN ratio, for example, only the power spectra of the voice may be used, or a ratio between an estimated noise power spectra and a spectra obtained by subtracting the estimated noise power spectra from the power spectra of voice (i.e. power spectra of voice on an assumption that there is no noise) may be used.
Note that, in the invention of the present application, each embodiment can be freely combined, any constituent element of each embodiment can be modified, or any constituent element of each embodiment can be omitted, within the scope of the invention.
The noise suppression device of the present invention can be used to improve a recognition rate of a voice recognition system and improve a sound quality of a voice communication system such as a mobile phone and an intercom, a TV conference system, a monitoring system, and a car navigation to which a voice communication, a voice storage, and a speech recognition system are introduced, and which suppresses background noise mixed with an input signal.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/JP10/05711 | 9/21/2010 | WO | 00 | 2/5/2013 |