This application is based upon and claims the benefit of priority from prior Japanese Patent Application P2004-003108, filed on Jan. 8, 2004; the entire contents of which are incorporated herein by reference.
The present invention relates to a noise suppression apparatus and method for extracting a voice signal from input acoustic signal.
In proportion to practical use of a speech recognition or a cellular phone in actual environment, a signal processing method for excluding a noise from an acoustic signal on which the noise is superimposed in order to emphasize a voice signal becomes important. Especially, Spectral Subtraction (SS) method is often used because it is effectively easy to be realized. The Spectral Subtraction method is disclosed in “S. Boll, “Suppression of Acoustic Noise in Speech Using Spectral Subtraction”, IEEE Trans., ASSP-27, No. 2, pp. 113-120, 1979”.
The Spectral Subtraction method includes a problem that it often causes a perceptually unnatural sound (called “a musical noise”). Musical noise is especially notable in a noise section. Because of statistical variance of the noise signal, removing an average value of noise signal from an input signal causes discontinuity in the remaining signal of the reduction. The musical noise is due to the remaining signal of reduction. In order to solve this problem, an excess suppression method is utilized. In the excess suppression method, by reducing a value larger than an estimation noise from the input signal, all variation elements of the noise are suppressed. In this case, if a reduction result becomes a negative value, the negative value is replaced by a minimum value. However, in the excess suppression method, suppression overflows in a voice section. As a result, a voice is distorted in the voice section. For example, the excess suppression method is disclosed in “Z. Goh, K. Tan and B. T. G. Tan, “Postprocessing Method for Suppressing Musical Noise Generated by Spectral Subtraction”, IEEE Trans., SAP-6, No. 3, May 1998”.
Furthermore, a method for executing some processing on a section generating musical noise in order not to perceive the musical noise is utilized. For example, a small gain is multiplied with each input signal and the multiplication result is superimposed to the output signal. However, in this method, if a sufficient signal is superimposed so as not to perceive the musical noise, a noise level raises by the superimposed signal. As a result, effect of noise suppression is lost.
As mentioned-above, excess suppression using a large suppression coefficient reduces musical noise. However, distortion often occurs in the voice section. Furthermore, in the post processing method for superimposing the input signal on the musical noise, by superimposing the sufficient signal so as not to perceive the musical noise, effect of noise suppression is lost.
The present invention is directed to a noise suppression apparatus and method able to suppress a musical noise in a noise section without a distortion in a voice section.
According to an aspect of the present invention, there is provided a noise suppression apparatus, comprising: a noise estimation unit configured to estimate a noise signal in an input signal; a section decision unit configured to decide a target signal section and a noise signal section in the input signal; a noise suppression unit configured to suppress the noise signal based on a first suppression coefficient from the input signal; a noise excess suppression unit configured to suppress the noise signal based on a second suppression coefficient from the input signal, the second suppression coefficient being larger than the first suppression coefficient; and a switching unit configured to switch between an output signal from said noise suppression unit and an output signal from said noise excess suppression unit based on a decision result of said section decision unit.
According to another aspect of the present invention, there is also provided a noise suppression method, comprising: estimating a noise signal in an input signal; deciding a target signal section and a noise signal section in the input signal; suppressing the noise signal based on a first suppression coefficient from the input signal to obtain a first output signal; suppressing the noise signal based on a second suppression coefficient from the input signal to obtain a second output signal, the second suppression coefficient being larger than the first suppression coefficient; and switching between the first output signal and the second output signal based on a decision result.
According to still another aspect of the present invention, there is also provided a computer program product, comprising: a computer readable program code embodied in said product for causing a computer to suppress a noise, said computer readable program code comprising: a first program code to estimate a noise signal in an input signal; a second program code to decide a target signal section and a noise signal section in the input signal; a third program code to suppress the noise signal based on a first suppression coefficient from the input signal to obtain a first output signal; a fourth program code to suppress the noise signal based on a second suppression coefficient from the input signal to obtain a second output signal, the second suppression coefficient being larger than the first suppression coefficient; and a fifth program code to switch between the first output signal and the second output signal based on a decision result.
Hereinafter, various embodiments of the present invention will be explained by referring to the drawings.
First, the input terminal 101 inputs a following signal.
x(t)=s(t)+n(t) (1)
In this equation, “x(t)” is a signal of time waveform received by an input device such as a microphone, “s(t)” is a target signal element (For example, a voice) in x(t), and “n(t)” is non-target signal element (For example, a surrounding noise) in x(t). The frequency conversion unit 102 converts x(t) to a frequency domain by a predetermined window length (For example, using DFT) and generates “X(f)” (f: frequency).
The noise estimation unit 103 estimates a noise signal “Ne(f)” from X(f). For example, in the case that s(t) is a voice signal, the estimation value Ne(f) includes non-utterance section. In the non-utterance section, “x(t)=n(t)” and assume that an average value of this section is Ne(f). The estimation value “|Se(f)|” is calculated as follows.
|Se(f)|=|X(f)|−60 |Ne(f) (2)
By returning |Se(f)| to a time domain, only voice can be estimated. |Se(f)| is an amplitude value without a phase term. In general, |Se(f)| is represented using a phase term of input signal X(f). Above equation (2) represents a method by an amplitude spectral. Furthermore, the equation (2) can be represented by a power spectral as follows.
|Se(f)|b=|X(f)|b−α|Ne(f)|b (3)
By regarding a spectral subtraction as filter operation, the equation (2) can be represented as follows.
In the case of “(a, b)=(1, 1)”, above equation (4) is equivalent to the equation (2) of spectral subtraction using amplitude spectral. In the case of “(a, b)=(2, 2)”, the equation (4) represents spectral subtraction using power spectral. Furthermore, in the case of “(a, b)=(1, 2)” and “α=1”, the equation (4) represents a form of Wiener filter. These are regarded as the same method uniformly describable on realization.
In general, X(f) are complex numbers and represented as follows.
X(f)=|X(f)|exp(jarg(X(f)) (5)
“|X(f)|” is a magnitude of X(f), “arg(X(f))” is a phase, and “j” is an imaginary unit. The magnitude of X(f) is output from the frequency conversion unit 102. In this case, the magnitude is represented as a general expression using an index number “b”. The reason is that several variations exist in spectral subtraction. A value of “b” is often “1” or “2”. The noise estimation unit 103 calculates an estimation noise |Ne(f)|b from |X(f)|b. In this case, an average value of a section regarded as the noise section from |X(f)|b is used.
For example, in the noise section, the estimation noise is calculated as follows.
|Ne(f, n)|b=δ|Ne(f, n−1)|b+(1−δ)|X(f)|b (6)
In the above equation, “|Ne(f, n)|b” is a value of a present frame, “|Ne(f, n−1)|b” is a value of a previous frame, and “δ” is a value as “(0<δ<1)” to control a degree of smoothing. As a method for deciding a voice section, a section of which magnitude of |X(f)|b is large is decided as the voice section. Furthermore, by calculating a ratio of |X(f)|b to |Ne(f, n)|b, a section of which ratio of |X(f)|b is above some ratio may be decided as the voice section.
In the noise suppression unit 104 and the noise excess suppression unit 105, output |Ne(f)| of the noise estimation unit 103 is subtracted from output |X(f)|b of the frequency conversion unit 102, and the subtraction result |Se(f)|b is output. In this case, the equation (3) is used. However, in the case that the estimation noise |Ne(f)| is larger than the input signal |X(f)|, several processing methods may be used. For example, following equation can be used.
|Se(f)|b=Max(|X(f)|b−α|Ne(f)|b, β|X(f)|b) (7)
In this equation, Max(x, y) represents a larger value of “x, y”, and “α” represents a suppression coefficient, and “β” represents a flooring coefficient. The larger the value of α is, the larger the number of noises can be reduced. As a result, noise suppression effect becomes large. However, in the voice section, a distortion occurs in the output signal by subtracting a voice element with the noise element. “β” is a small positive value to suppress a negative value of calculation result. For example, (α, β) is (1.0, 0.01).
In the present embodiment, a suppression coefficient “αn” of the noise excess suppression unit 105 is larger than a suppression coefficient “αs” of the noise suppression unit 104. In the noise excess suppression unit 105, average power (noise level) of noise falls in comparison with the noise suppression unit 104 because of using the larger suppression coefficient. Briefly, a noise level of an output of the noise suppression unit 104 is different from a noise level of an output of the noise excess suppression unit 105. The noise level correction signal generation unit 106 compensates for this defect.
In the noise level correction signal generation unit 106, a signal by multiplying a gain with the input signal |X(f)|b is generated as follows.
|M(f)|b=(1−αs)|X(f)|b (8)
The adder 107 adds this signal to an output of the noise excess suppression unit 105.
In the switching unit 109, by selecting an output of the noise suppression unit 104 and an output of the adder 107, an output signal is generated. Selection is based on a decision result of the voice/noise decision unit 108. In the case of the voice section, the output of the noise suppression unit 104 is selected. In the case of the noise section, the output of the noise excess suppression unit 105 is selected. As a decision method of the voice/noise decision unit 108, various methods can be used. For example, a method for deciding using signal power and a threshold is used.
In the frequency inverse conversion unit 110, an output of the switching unit 109 is converted from a frequency domain to a time domain, and a time signal emphasizing a voice is obtained. In the case of processing by unit of frame, a time continuous signal can be generated by overlap-add. Furthermore, the output of the switching unit 109 itself may be output without conversion to the time domain (not using the frequency inverse conversion unit 110).
Next, processing of the noise excess suppression unit 105 and the noise level correction signal generation unit 106 is explained in more detail. As mentioned-above, in the spectral subtraction, the musical noise as a phenomenon that a subtraction residue in the noise section sounds unnaturally exists. This phenomenon is explained by referring to
As shown in
In the present invention, the musical noise is eliminated by excess suppression, and addition of input signal is executed to correct a difference of noise level between the voice section and the noise section. This is different from the prior method for adding the input signal to all sections in order not to perceive the musical noise. Accordingly, in the present invention, by setting a large suppression coefficient in the voice section, a level of signal to be added to the noise section can be lowered. Briefly, reduction effect of the musical noise can not badly affect by this operation.
On the other hand, in the prior art, a level of signal to be added is closely connected with perceptible degree of the musical noise. The smaller the signals to be added are, the higher the perceptible degree is. In the equation (8), a gain (1−αs) of the input signal is calculated as follows.
First, by setting the suppression coefficient αs as a small value in order not to occur a distortion in the voice section, a value of αs is smaller than “1”. If the voice section includes noise signal only, a noise element of (1−αs) remains with subtraction operation. On the other hand, in the noise section, noise does not remain because of excess suppression. Accordingly, by adding the noise element of (1−αs) to the noise section, a noise level of the noise section is matched with a noise level of the voice section.
If the suppression coefficient αs of the voice section is near “1”, a gain (1−αs) of noise to be added becomes a small value. In this case, addition of the input signal may be omitted because a difference of noise level between the voice section and the noise section is hard to perceive. Furthermore, in the case of noise of large variance, a difference of noise level can not be always compensated by a method of the present embodiment. In this case, a compensation method taking variance into account can be used.
The suppression coefficient calculation unit 204 calculates a suppression coefficient as follows.
The excess suppression coefficient calculation unit 205 calculates a suppression coefficient as follows.
As mentioned-above, in the case of “(a, b)=(1, 1)”, the noise suppression is the same as a spectral subtraction using am amplitude spectral. In the case of “(a, b)=(2, 2)”, the noise suppression is the same as a spectral subtraction using a power spectral. In the case of “(a, b)=(1, 2)”, the noise suppression is the same as a form of Winner filter. In the suppression coefficient calculation unit 204, the suppression coefficient is “αs”, and set as suppression not to distort a voice in the voice section. In the excess suppression coefficient calculation unit 205, the suppression coefficient is “αn”, and set as a large coefficient to sufficiently eliminate the musical noise in the noise section. This feature is the same as the first embodiment.
In the noise level correction coefficient generation unit 206, a weight coefficient corresponding to the equation (8) is calculated as follows.
wo(f)=(1−αs) (11)
In an adder 207, following calculation is executed.
wno(f)=wn(f)+wo(f) (12)
Based on a result of the voice/noise decision unit 208, the switching unit 209 selects ws(f) or wno(f), and outputs the last weight coefficient ww(f). In the multiplier 211, this weight coefficient ww(f) is multiplied with a spectral X(f) of the input signal, and an output signal S(f) is calculated as follows.
S(f)=ww(f)X(f) (13)
In the second embodiment, expression of the first embodiment is only replaced by a multiplication form of a transfer function. However, by smoothing of |X(f)|, a local variation of weight coefficient calculated by equations (9) and (10) is suppressed, and change of the weight coefficient can be smoothed. As a result, voice quality improves.
On the other hand, X(f) of equation (13) becomes unclear by smoothing. Accordingly, smoothing should not be executed. As a smoothing method of X(f) of equations (9) and (10), for example, a method of equation (6) can be used. The smoothing method of the second embodiment can be executed in the first embodiment. However, in the second embodiment, the smoothing can be more simply executed.
In the same way as in the first embodiment, in the case that the suppression coefficient “αs” of the voice section is near “1”, a gain (1−αs) of noise to be added is a small value. In this case, the noise need not be added because a difference of noise level between the voice section and the noise section is hard to perceive. Furthermore, in the case of noise of large variance, the difference of noise level can not be completely compensated irrespective of using this method. In this case, a compensation method taking variance into account can be used.
In the third embodiment, this ratio is used to select the weight coefficient. “SNR” may be calculated not in all bands, but only in a band concentrating voice power.
A method for emphasizing a sound of predetermined direction by a plurality of microphones such as a microphone array can be utilized. In this method, a problem whether the input signal is a voice or noise can be replaced as a problem whether the signal is received from a predetermined direction. In the voice/noise decision unit 508, each of a plurality of input signals is decided to be a voice or a noise based on a receiving direction of the signal. For example, as shown in
In the equation (15), “X1*(f)” is a conjugate complex number of X1(f), “arg” is an operator to extract a phase, and “M” is a number of elements of frequency. Signals from the front direction are received as the same phase by two microphones. By multiplying a signal of one microphone with a conjugate complex number of a signal of the other microphone, a phase item becomes zero. Accordingly, as for a signal ideally received from the front direction, a minimum “Ph” of the equation (15) is “0”. As for a signal received from another direction, the more that direction shifts from the front direction, the larger the value Ph is. Accordingly, by setting a suitable threshold, voice/noise can be decided. In the case of a plurality of microphones equal to or more than two, for example, a value “Ph” of the equation (15) is calculated for each two combinations of all microphones.
In the integrated signal generation unit 512, one signal is generated from a plurality of input signals. For example, in a method called “delay and sum array”, the plurality of input signals are added. Concretely, the integrated signal “X(f)” is represented using input signals X1(f)˜XN(f) as follows.
In the equation (16), “N” represents a number of microphones.
In this method, target signals input from the front direction are emphasized because of the same phase, and signals input from another direction are weakened because of a shift of the phases. As a result, a target signal is emphasized while a noise signal is suppressed. Accordingly, by a multiplier effect with a noise suppression effect of spectral subtraction (post stage), high noise suppression ability can be realized in comparison with using one microphone.
Furthermore, by detecting a voice section using a plurality of microphones, high detection ability can be realized in comparison with using one microphone. For example, in the case of receiving a disturbance sound from a side direction, this sound is hard to distinguish from a voice by one microphone. However, by a plurality of microphones, this sound can be distinguished from a voice signal (received from the front direction) using a phase element as shown in the equation (15).
In
In the sixth embodiment, the target signal emphasis unit 630 and the target signal elimination unit 631 are realized by a beam former of Griffith-Jim form as a representation of the adaptive array. This component is now explained.
In the beam former of Griffith-Jim form, a trough notch which a sensitivity immediately falls along a disturbance sound direction can be formed. This characteristic is suitable for the target signal elimination unit 631 to eliminate a voice from the front direction as a disturbance sound.
Furthermore, an output signal of the target signal elimination unit 631 is used as an input signal of a noise estimation unit 603. The noise estimation unit 606 finds a non-voice section by observing X(f) and generates an estimation noise by smoothing the non-voice section. On the other hand, the output of the target signal elimination unit 631 is always noise, and used for elimination of the noise. Accordingly, by using these two signals, noise estimation of high accuracy can be executed.
A spectral of voice along a frequency direction includes a section with amplitude and a section without amplitude. Briefly, the spectral of voice includes a peak and a trough. A frequency of the trough is regarded as a noise section, and processing for the noise section such as the estimation of noise level or the excess suppression can be used. By dividing the frequency into subbands, a plurality of subband noise suppression units 750 respectively executes noise suppression of each subband. Briefly, based on a decision of voice/noise of each subband by the voice/noise decision unit 708, each subband noise suppression unit 750 switches the noise suppression method between the voice section and the noise section. As a result, quality of the voice section improves.
In the seventh embodiment, after generating an integrated signal from a plurality of input signals, the integrated signal is divided into subbands. However, after dividing the plurality of input signals into subbands, an integrated signal of each subband may be generated.
For embodiments of the present invention, the processing of the present invention can be accomplished by a computer-executable program, and this program can be realized in a computer-readable memory device.
In embodiments of the present invention, the memory device, such as a magnetic disk, a floppy disk, a hard disk, an optical disk (CD-ROM, CD-R, DVD, and so on), an optical magnetic disk (MD and so on) can be used to store instructions for causing a processor or a computer to perform the processes described above.
Furthermore, based on an indication of the program installed from the memory device to the computer, OS (operation system) operating on the computer, or MW (middle ware software), such as database management software or network, may execute one part of each processing to realize the embodiments.
Furthermore, the memory device is not limited to a device independent from the computer. By downloading a program transmitted through a LAN or the Internet, a memory device in which the program is stored is included. Furthermore, the memory device is not limited to one. In the case that the processing of the embodiments is executed by a plurality of memory devices, a plurality of memory devices may be included in the memory device. The component of the device may be arbitrarily composed.
In embodiments of the present invention, the computer executes each processing stage of the embodiments according to the program stored in the memory device. The computer may be one apparatus such as a personal computer or a system in which a plurality of processing apparatuses are connected through a network. Furthermore, in the present invention, the computer is not limited to a personal computer. Those skilled in the art will appreciate that a computer includes a processing unit in an information processor, a microcomputer, and so on. In short, the equipment and the apparatus that can execute the functions in embodiments of the present invention using the program are generally called the computer.
Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. It is intended that the specification and examples be considered as exemplary only, with the true scope and spirit of the invention being indicated by the following claims.
Number | Date | Country | Kind |
---|---|---|---|
2004-003108 | Jan 2004 | JP | national |