Noise reduction for automatic speech recognition

Information

  • Patent Application
  • 20070260454
  • Publication Number
    20070260454
  • Date Filed
    November 14, 2006
    18 years ago
  • Date Published
    November 08, 2007
    17 years ago
Abstract
Disclosed herein is a noise reduction method for automatic speech recognitionl. A noise reduction method for automatic speech recognition, including: computing a magnitude spectrum of a noisy speech containing a clean speech to be recognized and noise affecting the clean speech; computing a power spectrum of the noisy speech; computing an estimate of a power spectrum of the clean speech; computing an estimate of a power spectrum of the noise; computing an estimate of an a priori signal-to-noise ratio as a function of the estimate of the power spectrum of the clean speech and the estimate of the power spectrum of the noise; computing an estimate of an a posteriori signal-to-noise ratio as a function of the power spectrum of the noisy speech and the estimate of the power spectrum of the noise; computing an attenuation gain as a function of the estimate of the a priori signal-to-noise ratio and the estimate of the a posteriori signal-to-noise ratio; and computing an estimate of a magnitude spectrum of the clean speech as a function of the magnitude spectrum of the noisy speech and the attenuation gain. Computing the estimates of the a priori and the a posteriori signal-to-noise ratios includes computing a noise weighting factor for weighting the estimate of the power spectrum of the noise in the computation of the estimates of the a priori and the a posteriori signal-to-noise ratios; computing a spectral flooring factor for flooring the estimates of the a priori and the a posteriori signal-to-noise ratios; and computing the estimates of the a priori and the a posteriori signal-to-noise ratios also as a function of the noise weighting factor and the spectral flooring factor
Description

BRIEF DESCRIPTION OF THE DRAWINGS

For a better understanding of the present invention, a preferred embodiment, which is intended purely by way of example and is not to be construed as limiting, will now be described with reference to the attached drawings, wherein:



FIG. 1 shows a block diagram of common sources of speech degradation;



FIG. 2 shows a block diagram of noise reduction for automatic speech recognition;



FIGS. 3 and 4 show plots of a noise overestimation factor and a spectral flooring factor as a function of a global signal-to-noise ratio and used in the noise reduction method according to the present invention;



FIG. 5 shows a standard Ephraim-Malah spectral attenuation rule; and



FIGS. 6-10 show a modified Ephraim-Malah spectral attenuation rule according to the present invention at different global signal-to-noise ratio.





DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS OF THE INVENTION

The following discussion is presented to enable a person skilled in the art to make and use the invention. Various modifications to the embodiments will be readily apparent to those skilled in the art, and the generic principles herein may be applied to other embodiments and applications without departing from the spirit and scope of the present invention. Thus, the present invention is not intended to be limited to the embodiments shown, but is to be accorded the widest scope consistent with the principles and features disclosed herein and defined in the attached claims.


The present invention relates to an automatic speech recognition system including a noise reduction system based on the Spectral Attenuation Technique, and in particular on the Ephraim-Malah spectral attenuation rule, wherein the global formula of the gain Gkk, ξk) is unchanged, whereas the estimates of the a priori and the a posteriori signal-to-noise ratios {circumflex over (ξ)}k(m), {circumflex over (γ)}k(m) are modified by making them dependent on a noise weighting factor α(m) and on a spectral flooring factor β(m), as follows:












γ
^

k



(
m
)


=


max
(








Y
k



(
m
)




2



α


(
m
)









D
^

k



(
m
)




2



-
1

,

β


(
m
)



)

+
1





(
9
)










ξ
^

k



(
m
)


=

max
(



η


(
m
)


-







X
^

k



(

m
-
1

)




2



α


(
m
)









D
^

k



(

m
-
1

)




2



+


(

1
-

η


(
m
)



)



[




γ
^

k



(
m
)


-
1

]



,

β


(
m
)



)


,






η


(
m
)




[

0
,
1

)






(
10
)







where:

    • |Yk(m)|2 is the k-th spectral line of the power spectrum of the noisy speech;
    • |{circumflex over (X)}k(m)|2 is the k-th spectral line of the estimate of the power spectrum of the clean speech;
    • |{circumflex over (D)}k(m)|2 is the k-th spectral line of the estimate of the power spectrum of the additive noise;
    • {circumflex over (ξ)}k(m) is the estimate of the a priori signal-to-noise ratio relating to the k-th spectral line;
    • {circumflex over (γ)}k(m) is the estimate of the a posteriori signal-to-noise ratio relating to the k-th spectral line;
    • α(m) is the noise weighting factor for weighting, namely overestimating or underestimating, the estimate |{circumflex over (D)}k(m)|2 of the power spectrum of the noise in the computation of the estimates {circumflex over (ξ)}k(m), {circumflex over (γ)}k(m) of the a priori and the a posteriori signal-to-noise ratios;;
    • β(m) is the spectral flooring factor for flooring the estimates {circumflex over (ξ)}k(m), {circumflex over (γ)}k(m) of the a priori and the a posteriori signal-to-noise ratios; and
    • η(m) is a weighting coefficient for appropriately weighting the two terms in formula (10).


The noise weighting factor α(m) and the spectral flooring factor β(m) are a function of the global signal-to-noise ratio SNR(m), which is defined as:










SNR


(
m
)


=

10







log
10

(




k







Y
k



(
m
)




2





k








D
^

k



(
m
)




2



)






(
11
)








FIGS. 3 and 4 show a preferred development of the noise weighting factor α(m) and the spectral flooring factor β(m) versus the global signal-to-noise ratio SNR(m). The noise weighting factor α(m) and the spectral flooring factor β(m) are piece-wise linear functions and may be mathematically defined as follows:










α


(
m
)


=

{






1.5








if






SNR


(
m
)



<
0










1.5
-



(

1.5
-
0.001

)

20

·

SNR


(
m
)












if





0



SNR


(
m
)



20









0.001








if






SNR


(
m
)



>
20










(
12
)







β


(
m
)


=

{






0.01








if






SNR


(
m
)



<
0












(

1.0
-
0.01

)

20

·

SNR


(
m
)











if





0



SNR


(
m
)



20









1.0








if






SNR


(
m
)



>
20










(
13
)







The values indicated in formulas (12) and (13) are intended purely by way of example and are not to be construed as limiting. In general, other values could be usefully employed, while maintaining the general development of the noise weighting factor α(m) and of the spectral flooring factor β(m) versus the global signal-to-noise ratio SNR(m).


In particular, the noise weighting factor α(m) versus the global signal-to-noise ratio SNR(m) should have a first substantially constant value when the global signal-to-noise ratio SNR(m) is lower than a first threshold, a second substantially constant value lower than the first substantially constant value when the global signal-to-noise ratio SNR(m) is higher than a second threshold, and values decreasing from the first substantially constant value to the second substantially constant value when the global signal-to-noise ratio SNR(m) increases from the first threshold to the second threshold.


The spectral flooring factor β(m) versus the global signal-to-noise ratio SNR(m) should have a first substantially constant value when the global signal-to-noise ratio SNR(m) is lower than a first threshold, a second substantially constant value higher than the first substantially constant value when the global signal-to-noise ratio SNR(m) is higher than a second threshold, and values increasing from the first substantially constant value to the second substantially constant value when the global signal-to-noise ratio SNR(m) increases from the first threshold to the second threshold. The developments may be piece-wise lines, as shown in FIGS. 3 and 4, or may be continuous curved lines similar to those in FIGS. 3 and 4, i.e., curved lines wherein the intermediate non-constant stretch is linear, as in FIGS. 3 and 4, or curved, e.g., a cosine-like or a sine-like curve, and transitions from the intermediate non-constant stretch to the constant stretches is rounded or smoothed.


The estimate |Dk(m)|2 of the power spectrum of the noisy speech in formulas (9), (10) and (11) is computed by means of a first-order recursion as disclosed in the aforementioned Noise Estimation Techniques for Robust Speech Recognition.


Preferably, the first-order recursion may be implemented in conjunction with a standard energy-based Voice Activity Detector, which is well-known system which detects presence or absence of speech based on a comparison of the total energy of the speech signal with an adaptive threshold and outputs a Boolean flag (VAD) having a “true” value when voice is present and a “false” value when voice is absent. When a standard energy-based Voice Activity Detector is used, the estimate |{circumflex over (D)}k(m)|2 of the power spectrum of the noisy speech may be computed as follows:















D
^

k



(
m
)




2

=

{








λ







D
^

k



(

m
-
1

)




2


+


(

1
-
λ

)







Y
k



(
m
)




2













if






{








Y
k



(
m
)




2

-






D
^

k



(
m
)




2
















μ






σ


(
m
)




}



{

VAD
=
false

}


















D
^

k



(

m
-
1

)




2







otherwise









(
14
)







where λ is a weighting factor which controls the update speed of the recursion and ranges between 0 and 1, preferably has a value of 0.9, μ is a multiplication factor which controls the allowed dynamics of the noise and preferably has a value of 4.0, and σ(m) is the noise standard deviation, estimated as follows:





σ2(m)=λσ2(m−1)+(1−λ(|Yk(m)|2−|{circumflex over (D)}k(m)|2)2   (15)



FIG. 5 shows the standard Ephraim-Malah spectral attenuation rule (Gk, ξk(m) and γk(m) computed according to formulas (3), (7) and (8)), whereas FIGS. 6-10 show the modified Ephraim-Malah spectral attenuation rule according to the present invention (Gk, ξk(m) and γk(m) computed according to formulas (3), (10) and (9)) at different global signal-to-noise ratios SNR(m) (0, 5, 10, 15 and 20 dB). It may be appreciated by the skilled person that the effect of the introduced modification is a gradual reduction of the attenuation produced by the original gain in areas where the a posteriori γk(m) signal-to-noise ratios is high, as the global signal-to-noise ratios SNR(m) increases.


A large experimental work has been performed to validate the invention, and some results, which may be useful to highlight the features of the invention, are hereinafter reported.


In particular, experiments were conducted with an automatic speech recognition system, using noise reduction with the standard Ephraim-Malah spectral attenuation and with the noise reduction proposed in the invention. The automatic speech recognition system has been trained for the target languages using large, domain and task independent corpora, not collected in noisy environments and without added noise.


The experiment was performed on the Aurora3 corpus, that is a standard corpus defined by the ETSI Aurora Project for noise reduction tests, and made of connected digits recorded in car in several languages (Italian, Spanish and German). An high mismatch test set and a noisy component of the training set (used as test set) were employed.


The modification of the Ephraim-Malah spectral attenuation rule according to the invention produces an average error reduction of 28.9% with respect to the state of the art Wiener Spectral Subtraction, and an average error reduction of 22.9% with respect to the standard Ephraim-Malah Spectral Attenuation Rule. The average error reduction with respect to no de-noising is 50.2%.


Finally, it is clear that numerous modifications and variants can be made to the present invention, all falling within the scope of the invention, as defined in the appended claims.

Claims
  • 1. A noise reduction method for automatic speech recognition, including: computing a magnitude spectrum (|Yk(m)|) of a noisy speech containing a clean speech to be recognized and noise affecting the clean speech;computing a power spectrum (|Yk(m)|2) of the noisy speech;computing an estimate (|{circumflex over (X)}k(m)|2) of a power spectrum of the clean speech;computing an estimate (|{circumflex over (D)}k(m)|2) of a power spectrum of the noise;computing an estimate ({circumflex over (ξ)}k(m)) of an a priori signal-to-noise ratio as a function of the estimate (|{circumflex over (X)}k(m)|2) of the power spectrum of the clean speech and the estimate (|{circumflex over (D)}k(m)|2) of the power spectrum of the noise;computing an estimate ({circumflex over (γ)}k(m)) of an a posteriori signal-to-noise ratio as a function of the power spectrum (|γk(m)|2) of the noisy speech and the estimate (|{circumflex over (D)}k(m)|2) of the power spectrum of the noise;computing an attenuation gain (Gk(m)) as a function of the estimate ({circumflex over (ξ)}k(m)) of the a priori signal-to-noise ratio and the estimate ({circumflex over (γ)}k(m)) of the a posteriori signal-to-noise ratio;computing an estimate (|{circumflex over (X)}k(m)|) of a magnitude spectrum of the clean speech as a function of the magnitude spectrum (|Yk(m)|) of the noisy speech and the attenuation gain (Gk(m));characterized in that computing the estimates ({circumflex over (ξ)}k(m), {circumflex over (γ)}k(m)) of the a priori and the a posteriori signal-to-noise ratios includes:computing a noise weighting factor (α(m)) for weighting the estimate (|{circumflex over (D)}k(m)|2) of the power spectrum of the noise in the computation of the estimates ({circumflex over (ξ)}k(m), {circumflex over (γ)}k(m)) of the a priori and the a posteriori signal-to-noise ratios;computing a spectral flooring factor (β(m)) for flooring the estimates ({circumflex over (ξ)}k(m), {circumflex over (γ)}k(m)) of the a priori and the a posteriori signal-to-noise ratios; andcomputing the estimates ({circumflex over (ξ)}k(m), {circumflex over (γ)}k(m)) of the a priori and the a posteriori signal-to-noise ratios also as a function of the noise weighting factor (α(m)) and the spectral flooring factor (β(m)).
  • 2. A noise reduction method as claimed in claim 1, wherein the noise weighting factor (α(m)) and the spectral flooring factor (β(m)) are computed as a function of a global signal-to-noise ratio (SNR(m)).
  • 3. A noise reduction method as claimed in claim 2, wherein the noise weighting factor (α(m)) versus the global signal-to-noise ratio (SNR(m)) has a first substantially constant value when the global signal-to-noise ratio (SNR(m)) is lower than a first threshold, a second substantially constant value lower than the first substantially constant value when the global signal-to-noise ratio (SNR(m)) is higher than a second threshold, and decreasing values when the global signal-to-noise ratio (SNR(m)) ranges between the first and the second thresholds.
  • 4. A noise reduction method as claimed in claim 3, wherein the noise weighting factor (α(m)) decreases linearly when the global signal-to-noise ratio (SNR(m)) ranges between the first and the second thresholds.
  • 5. A noise reduction method as claimed in claim 2, wherein the spectral flooring factor (β(m)) versus the global signal-to-noise ratio (SNR(m)) has a first substantially constant value when the global signal-to-noise ratio (SNR(m)) is lower than a first threshold, a second substantially constant value higher than the first substantially constant value when the global signal-to-noise ratio (SNR(m)) is higher than a second threshold, and increasing values when the global signal-to-noise ratio (SNR(m)) ranges between the first and the second thresholds.
  • 6. A noise reduction method as claimed in claim 5, wherein the spectral flooring factor (β(m)) increases linearly when the global signal-to-noise ratio (SNR(m)) ranges between the first and the second thresholds.
  • 7. A noise reduction method as claimed in claim 1, wherein the estimate ({circumflex over (γ)}k(m)) of the a posteriori signal-to-noise ratio is computed as follows:
  • 8. A noise reduction method as claimed in claim 1, wherein the estimate ({circumflex over (ξ)}k(m)) of the a priori signal-to-noise ratio is computed as follows:
  • 9. A noise reduction method as claimed in claim 1, wherein the attenuation gain (Gk(m)) is computed as follows:
  • 10. A noise reduction method as claimed in claim 1, wherein the estimate (|{circumflex over (D)}k(m)|2 of the power spectrum of the noise is computed as follows:
  • 11. A noise reduction method as claimed in claim 2, wherein the global signal-to-noise ratio (SNR(m)) is computed as follows:
  • 12. An automatic speech recognition system including a noise reduction system configured to implement the method according to claim 1.
  • 13. A computer program product comprising a computer program code able, when loaded in a processing system, to implement the method according to claim 1
Continuations (1)
Number Date Country
Parent PCT/EP04/50816 May 2004 US
Child 11598705 US