Method for conditioning a digital speech signal

Description

BACKGROUND OF THE INVENTION

The present invention concerns digital speech signal processing techniques.

Many representations of speech signals take account of the harmonic content of such signals resulting from the manner in which they are produced. In most cases, this is reflected in the determination of a pitch frequency of the speech signal.

Digital processing of speech signals has recently expanded greatly in varied domains: speech coding for transmission and storage, speech recognition, noise reduction, echo cancellation, etc. Such processing very frequently uses an estimate of the pitch frequency and particular operations related to the estimated frequency.

Many methods have been developed for estimating the pitch frequency. One method that is routinely used is based on linear prediction which evaluates a prediction delay which is inversely proportional to the pitch frequency. The delay can be expressed as an integer or fractional number of digital signal sample times. Other methods detect directly breaks in the signal which can be attributed to glottal closures of the speaker, the time intervals between such breaks being inversely proportional to the pitch frequency.

If the digital speech signal is transformed into the frequency domain, as by a discrete Fourier transform, it is necessary to consider a discrete spectrum of the speech signal. The discrete frequencies considered are of the form (a/N)×F

e

, where F

e

is the sampling frequency, N is the number of samples of the blocks used in the discrete Fourier transform and a is an integer from 0 to N/2−1. These frequencies do not necessarily include the estimated pitch frequency and/or its harmonics. This causes inaccuracy in operations relating to the estimated pitch, which can cause distortion of the processed signal, affecting its harmonic character.

A principal object of the present invention is to propose a method of conditioning the speech signal which makes it less sensitive to the above drawbacks.

SUMMARY OF THE INVENTION

The invention therefore proposes a method of conditioning a digital speech signal processed by successive frames, wherein harmonic analysis of the speech signal is performed to estimate a pitch frequency of the speech signal over each frame in which it features vocal activity. After estimating the pitch frequency of the speech signal over one frame, the speech signal of the frame is conditioned by oversampling it at an oversampling frequency which is a multiple of the estimated pitch frequency.

In processing the speech signal, this enables the frequencies closest to the estimated pitch to be favoured over other frequencies. The harmonic character of the speech signal is therefore preserved as far as possible. To compute spectral components of the speech signal, the conditioned signal is distributed between blocks of N samples which are transformed into the frequency domain and the ratio between the oversampling frequency and the estimated pitch frequency is chosen as a factor of the number N.

The foregoing technique can be refined by estimating the pitch frequency of the speech signal over a frame in the following manner:

estimating time intervals between two consecutive breaks of the signal which can be attributed to glottal closures of the speaker occurring during the frame, the estimated pitch frequency being inversely proportional to said time intervals;

interpolating the speech signal in said time intervals, so that the conditioned signal resulting from such interpolation has a constant time interval between two consecutive breaks.

This approach artificially constructs a signal frame over which the speech signal features breaks at constant intervals. Any variations of the pitch over the duration of a frame are therefore taken into account.

In a further improvement, after processing each conditioned signal frame, a number of the signal samples supplied by such processing is retained which is equal to an integer multiple of the ratio between the sampling frequency and the estimated pitch frequency. This avoids the distortion problems caused by phase discontinuities between frames, which are generally not totally corrected by conventional overlap-add techniques.

Using the oversampling technique to condition the signal yields a good measurement of the degree of voicing of the speech signal over the frame, based on the entropy of the autocorrelation of the spectral components computed on the basis of the conditioned signal. The greater the disturbance of the spectrum, i.e. the more it is voiced, the lower the entropy values. Conditioning the speech signal accentuates the irregularity of the spectrum and therefore the entropy variations, with the result that the latter constitutes a measurement of good sensitivity.

In the remainder of this description, the conditioning method according to the invention is illustrated in a system for suppressing noise in a speech signal. Clearly the method can find applications in many other types of digital speech processing: coding, recognition, echo cancellation, etc.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1

is a block diagram of a noise suppression system implementing the present invention;

FIGS. 2 and 3

are flowcharts of procedures used by a vocal activity detector of the system shown in

FIG. 1

;

FIG. 4

is a diagram representing the states of a vocal activity detection automation;

FIG. 5

is a graph showing variations in a degree of vocal activity;

FIG. 6

is a block diagram of a module for overestimating the noise of the system shown in

FIG. 1

;

FIG. 7

is a graph illustrating the computation of a masking curve;

FIG. 8

is a graph illustrating the use of masking curves in the system shown in

FIG. 1

;

FIG. 9

is a block diagram of another noise suppression system implementing the present invention;

FIG. 10

is a graph illustrating a harmonic analysis method that can be used in a method according to the invention; and

FIG. 11

shows part of a variant of the block diagram shown in FIG.

9

.

DESCRIPTION OF PREFERRED EMBODIMENTS

The noise suppression system shown in

FIG. 1

processes a digital speech signal s. A windowing module

10

formats the signal s in the form of successive windows or frames each made up of a number N of digital signal samples. In the usual way, these frames can overlap each other. In the remainder of this description, the frames are considered to be made up of N=256 samples with a sampling frequency F

e

of 8 kHz, with Hamming weighting in each window and with 50% overlaps between consecutive windows, although this is not limiting on the invention.

The signal frame is transformed into the frequency domain by a module

11

using a conventional fast Fourier transform (FFT) algorithm to compute the modulus of the spectrum of the signal. The module

11

then delivers a set of N=256 frequency components S

n,f

of the speech signal, where n is the number of the current frame and f is a frequency from the discrete spectrum. Because of the properties of the digital signals in the frequency domain, only the first N/2=128 samples are used.

Instead of using the frequency resolution available downstream of the fast Fourier transform to compute the estimates of the noise contained in the signal s, a lower resolution is used, determined by a number I of frequency bands covering the bandwidth [0, F

e

/2] of the signal. Each band i (1≦i≦I) extends from a lower frequency f(i−1) to a higher frequency f(i), with f(0)=0 and f(I)=F

e

/2. The subdivision into frequency bands can be uniform (f(i)−f(I−1)=F

e

/2I). It can also be non-uniform (for example according to a barks scale). A module

12

computes the respective averages of the spectral components S

n,f

of the speech signal in bands, for example by means of a uniform weighting such as:

\begin{matrix} S_{n, i} = \frac{1}{f (i) - f (i - 1)} \sum_{f \in [f (i - 1), f (i)} S_{n, f} & (1) \end{matrix}

This averaging reduces fluctuations between bands by averaging the contributions of the noise in the bands, which reduces the variance of the noise estimator. Also, this averaging greatly reduces the complexity of the system.

The averaged spectral components S

n,i

are sent to a vocal activity detector module

15

and a noise estimator module

16

. The two modules

15

,

16

operate conjointly in the sense that degrees of vocal activity γ

n,i

measured for the various bands by the module

15

are used by the module

16

to estimate the long-term energy of the noise in the various bands, whereas the long-term estimates {circumflex over (B)}

n,i

are used by the module

15

for a priori suppression of noise in the speech signal in the various bands to determine the degrees of vocal activity γ

n,i

.

The operation of the modules

15

and

16

can correspond to the flowcharts shown in

FIGS. 2 and 3

.

In steps

17

through

20

, the module

15

effects a priori suppression of noise in the speech signal in the various bands i for the signal frame n. This a priori noise suppression is effected by a conventional non-linear spectral subtraction scheme based on estimates of the noise obtained in one or more preceding frames. In step

17

, using the resolution of the bands I, the module

15

computes the frequency response Hp

n,i

of the a priori noise suppression filter from the equation:

\begin{matrix} {Hp}_{n, i} = \frac{S_{n, i} - α_{n - τ1, i}^{'} \cdot {\hat{B}}_{n - τ1, i}}{S_{n - τ2, i}} & (2) \end{matrix}

where τ1 and τ2 are delays expressed as a number of frames (τ1≧1, τ2≧0) , and α′

n,i

is a noise overestimation coefficient determined as explained later. The delay τ1 can be fixed (for example τ1=1) or variable. The greater the degree of confidence in the detection of vocal activity, the lower the value of τ1.

In steps

18

to

20

, the spectral components Êp

n,i

are computed from:

Êp

n,i

=max{

Hp

n,i

·S

n,i

,βp

i

·{circumflex over (B)}

n−τ1,i

} (3)

where βp

i

is a floor coefficient close to 0, used conventionally to prevent the spectrum of the noise-suppressed signal from taking negative values or excessively low values which would give rise to musical noise.

Steps

17

to

20

therefore essentially consist of subtracting from the spectrum of the signal an estimate of the a priori estimated noise spectrum, over-weighted by the coefficient α′

n−τ1,i

.

In step

21

, the module

15

computes the energy of the a priori noise-suppressed signal in the various bands i for frame n: E

n,i

=Êp

n,i

2

. It also computes a global average E

n,0

of the energy of the a priori noise-suppressed signal by summing the energies for each band E

n,i

, weighted by the widths of the bands. In the following notation, the index i=0 is used to designate the global band of the signal.

In steps

22

and

23

, the module

15

computes, for each band i (

0

≦i≦I), a magnitude ΔE

n,i

representing the short-term variation in the energy of the noise-suppressed signal in the band i and a long-term value {overscore (E)}

n,i

of the energy of the noise-suppressed signal in the band i. The magnitude ΔE

n,i

can be computed from a simplified equation:

Δ E_{n, i} = &LeftBracketingBar; \frac{E_{n - 4, i} + E_{n - 3, i} - E_{n - 1, i} - E_{n, i}}{10} &RightBracketingBar; .

As for the long-term energy {overscore (E)}

n,i

, it can be computed using a forgetting factor B

1

such that 0<B

1

<1, namely {overscore (E)}

n,i

=B

1

·{overscore (E)}

n−1,i

+(1−B

1

)·E

n,i

.

After computing the energies E

n,i

of the noise-suppressed signal, its short-term variations ΔE

n,i

and its long-term values {overscore (E)}

n,i

in the manner indicated in

FIG. 2

, the module

15

computes, for each band i (0≦i≦I), a value ρ

i

representative of the evolution of the energy of the noise-suppressed signal. This computation is effected in steps

25

to

36

in

FIG. 3

, executed for each band i from i=0 to i=I. The computation uses a long-term noise envelope estimator ba

i

, an internal estimator bi

i

and a noisy frame counter b

i

.

In step

25

, the magnitude ΔE

n,i

is compared to a threshold ε

1

. If the threshold ε

1

has not been reached, the counter b

i

is incremented by one unit in step

26

. In step

27

, the long-term estimator ba

i

is compared to the smoothed energy value {overscore (E)}

n,i

. If ba

i

≧{overscore (E)}

n,i

, the estimator ba

i

is taken as equal to the smoothed value {overscore (E)}

n,i

in step

28

and the counter b

i

is reset to zero. The magnitude ρ

i

, which is taken as equal to ba

i

/{overscore (E)}

n,i

(step

36

), is then equal to 1.

If step

27

shows that ba

i

<{overscore (E)}

n,i

, the counter b

i

is compared to a limit value bmax in step

29

. If b

i

>bmax, the signal is considered to be too stationary to support vocal activity. The aforementioned step

28

, which amounts to considering that the frame contains only noise, is then executed. If b

i

≦bmax in step

29

, the internal estimator bi

i

is computed in step

33

from the equation:

bi

i

=(1

−Bm

)·

{overscore (E)}

n,i

+Bm·ba

i

(4)

In the above equation, Bm represents an update coefficient from 0.90 to 1. Its value differs according to the state of a vocal activity detector automaton (steps

30

to

32

). The state δ

n−1

is that determined during processing of the preceding frame. If the automaton is in a speech detection state (δ

n−1

=2 in step

30

) , the coefficient Bm takes a value Bmp very close to 1 so the noise estimator is very slightly updated in the presence of speech. Otherwise, the coefficient Bm takes a lower value Bms to enable more meaningful updating of the noise estimator in the silence phase. In step

34

, the difference ba

i

-bi

i

between the long-term estimator and the internal noise estimator is compared with a threshold ε

2

. If the threshold ε

2

has not been reached, the long-term estimator ba

i

is updated with the value of the internal estimator bi

i

in step

35

. Otherwise, the long-term estimator ba

i

remains unchanged. This prevents sudden variations due to a speech signal causing the noise estimator to be updated.

After the magnitudes ρ

i

have been obtained, the module

15

proceeds to the vocal activity decisions of step

37

. The module

15

first updates the state of the detection automaton according to the magnitude ρ

0

calculated for all of the band of the signal. The new state δ

n

of the automaton depends on the preceding state δ

n−1

and on ρ

0

, as shown in FIG.

4

.

Four states are possible: δ=0 detects silence, or absence of speech, δ=2 detects the presence of vocal activity and states δ=1 and δ=3 are intermediate rising and falling states. If the automaton is in the silence state (δ

n−1

=0), it remains there if ρ

0

does not exceed a first threshold SE1, and otherwise goes to the rising state. In the rising state (δ

n−1

=1), it reverts to the silence state if ρ

0

is smaller than the threshold SE1, goes to the speech state if ρ

0

is greater than a second threshold SE2 greater than the threshold SE1 and it remains in the rising state if SE1≦ρ

0

≦SE2. If the automaton is in the speech state (δ

n−1

=2), it remains there if ρ

0

exceeds a third threshold SE3 lower than the threshold SE2, and enters the falling state otherwise. In the falling state δ

n−1

=3), the automaton reverts to the speech state if ρ

0

is higher than the threshold SE2, reverts the silence state if ρ

0

is below a fourth threshold SE4 lower than the threshold SE2 and remains in the falling state if SE4≦ρ

0

≦SE2.

In step

37

, the module

15

also computes the degrees of vocal activity γ

n,i

in each band i≧1. This degree γ

n,i

is preferably a non-binary parameter, i.e. the function γ

n,i

=g(ρ

i

) is a function varying continuously in the range from 0 to 1 as a function of the values taken by the magnitude ρ

i

. This function has the shape shown in

FIG. 5

, for example.

The module

16

calculates the estimates of the noise on a band by band basis, and the estimates are used in the noise suppression process, employing successive values of the components S

n,i

and the degrees of vocal activity γ

n,i

. This corresponds to steps

40

to

42

in FIG.

3

. Step

40

determines if the vocal activity detector automaton has just gone from the rising state to the speech state. If so, the last two estimates {circumflex over (B)}

n−1,i

and {circumflex over (B)}

n−2,i

previously computed for each band i≧1 are corrected according to the value of the preceding estimate {circumflex over (B)}

n−3,i

. The correction is done to allow for the fact that, in the rise phase (δ=1), the long-term estimates of the energy of the noise in the vocal activity detection process (steps

30

to

33

) were computed as if the signal included only noise (Bm=Bms), with the result that they may be subject to error.

In step

42

, the module

16

updates the estimates of the noise on a band by band basis using the equations:

{tilde over (B)}

n,i

=λ

B·

{circumflex over (B)}

n−1,i

+(1−λ

B

)·

S

n,i

(5)

{circumflex over (B)}

n,i

=γn,i

·{circumflex over (B)}

n−1,i

+(1−γ

n,i

)·{tilde over (B)}

n,i

(6)

in which λ

B

designates a forgetting factor such that 0<λ

B

<1. Equation (6) shows that the non-binary degree of vocal activity γ

n,i

is taken into account.

As previously indicated, the long-term estimates of the noise {circumflex over (B)}

n,i

are overestimated by a module

45

(

FIG. 1

) before noise suppression by non-linear spectral subtraction. The module

45

computes the overestimation coefficient α′

n,i

previously referred to, along with an overestimate {circumflex over (B)}′

n,i

which essentially corresponds to α′

n,i·{circumflex over (B)}

n,i

.

FIG. 6

shows the organisation of the overestimation module

45

. The overestimate {circumflex over (B)}′

n,i

is obtained by combining the long-term estimate {circumflex over (B)}n,i and a measurement ΔB

n,i

max

of the variability of the component of the noise in the band i around its long-term estimate. In the example considered, the combination is essentially a simple sum performed by an adder

46

. It could instead be a weighted sum.

The overestimation coefficient α′

n,i

is equal to the ratio between the sum {circumflex over (B)}

n,i

+ΔB

n,i

max

delivered by the adder

46

and the delayed long-term estimate {circumflex over (B)}

n−τ3,i

(divider

47

), with a ceiling limit value α

max

, for example α

max

=4 (block

48

). The delay τ

3

is used to correct the value of the overestimation coefficient α′

n,i

, if necessary, in the rising phases (δ=1), before the long-term estimates have been corrected by steps

40

and

41

from

FIG. 3

(for example τ

3

=3).

The overestimate {circumflex over (B)}′

n,i

is finally taken as equal to α′n,i·{circumflex over (B)}

n−τ3,i

(multiplier

49

).

The measurement ΔB

n,i

max

of the variability of the noise reflects the variance of the noise estimator. It is obtained as a function of the values of S

n,i

and of {circumflex over (B)}

n,i

computed for a certain number of preceding frames over which the speech signal does not feature any vocal activity in band i. It is a function of the differences |S

n−k,i

−{circumflex over (B)}

n−k,i

| computed for a number K of silence frames (n−k≦n). In the example shown, this function is simply the maximum (block

50

). For each frame n, the degree of vocal activity γ

n,i

is compared to a threshold (block

51

) to decide if the difference |S

n,i

−{circumflex over (B)}

n,i

| calculated at

52

-

53

, must be loaded into a queue

54

with K locations organised in first-in/first-out (FIFO) mode, or not. If γ

n,i

does not exceed the threshold (which can be equal to 0 if the function g() has the form shown in FIG.

5

), the FIFO

54

is not loaded; otherwise it is loaded. The maximum value contained in the FIFO 54 is then supplied as the measured variability ΔB

n,i

max

.

The measured variability ΔB

n,i

max

can instead be obtained as a function of the values S

n,f

(not S

n,i

) and {circumflex over (B)}

n,i

. The procedure is then the same, except that the FIFO

54

contains, instead of |S

n−k,i

−{circumflex over (B)}

n−k,i

| for each of the bands i,

\max_{f \in [f (i - 1), f (i) [} &LeftBracketingBar; S_{n - k, f} - {\hat{B}}_{n - k, i} &RightBracketingBar; .

Because of the independent estimates of the long-term fluctuations {circumflex over (B)}

n,i

and short-term variability ΔB

n,i

max

of the noise, the overestimator {circumflex over (B)}′

n,i

makes the noise suppression process highly robust to musical noise.

The module

55

shown in

FIG. 1

performs a first spectral subtraction phase. This phase supplies, with the resolution of the bands i (1≦i≦I), the frequency response H

n,i

1

of a first noise suppression filter, as a function of the components S

n,i

and {circumflex over (B)}

n,i

and the overestimation coefficients α′

n,i

. This computation can be performed for each band i using the equation:

\begin{matrix} H_{n, i}^{1} = \frac{\max {S_{n, i} - α_{n, i}^{'} \cdot {\hat{B}}_{n, i}, β_{i}^{1} \cdot {\hat{B}}_{n, i}}}{S_{n - τ4, i}} & (7) \end{matrix}

in which τ

4

is an integer delay such that τ

4

≧0 (for example τ

4

=0). The coefficient β

i

1

in equation (7), like the coefficient βp

i

in equation (3), represents a floor used conventionally to avoid negative values or excessively low values of the noise-suppressed signal.

In a manner known in the art (see EP-A-0 534 837), the overestimation coefficient α′

n,i

in equation (7) could be replaced by another coefficient equal to a function of α′

n,i

and an estimate of the signal-to-noise ratio (for example S

n,i

/{circumflex over (B)}

n,i

), this function being a decreasing function of the estimated value of the signal-to-noise ratio. This function is then equal to α′

n,i

for the lowest values of the signal-to-noise ratio. If the signal is very noisy, there is clearly no utility in reducing the overestimation factor. This function advantageously decreases toward zero for the highest values of the signal/noise ratio. This protects the highest energy areas of the spectrum, in which the speech signal is the most meaningful, the quantity subtracted from the signal then tending toward zero.

This strategy can be refined by applying it selectively to the harmonics of the pitch frequency of the speech signal if the latter features vocal activity.

Accordingly, in the embodiment shown in

FIG. 1

, a second noise suppression phase is performed by a harmonic protection module 56. This module computes, with the resolution of the Fourier transform, the frequency response H

n,f

2

of a second noise suppression filter as a function of the parameters H

n,i

1

, α′

n,i

, {circumflex over (B)}

n,i

, δ

n

, S

n,i

and the pitch frequency f

p

=F

e

/T

p

computed outside silence phases by a harmonic analysis module

57

. In a silence phase (δ

n

=0), the module

56

is not in service, i.e. H

n,f

2

=H

n,i

1

for each frequency f of a band i. The module

57

can use any prior art method to analyse the speech signal of the frame to determine the pitch period T

p

, expressed as an integer or fractional number of samples, for example a linear prediction method.

The protection afforded by the module

56

can consist in effecting, for each frequency f belonging to a band i:

&AutoLeftMatch; {\begin{matrix} H_{n, f}^{2} = 1 & if {\begin{matrix} S_{n, i} - α_{n, i}^{'} \cdot {\hat{B}}_{n, i} > β_{i}^{2} \cdot {\hat{B}}_{n, i} & (8) \\ and \exists η integer / &LeftBracketingBar; f - η \cdot f_{p} &RightBracketingBar; \leq Δf / 2 & (9) \end{matrix} \\ H_{n, f}^{2} = H_{n, f}^{1} & otherwise \end{matrix}

Δf=F

e

/N represents the spectral resolution of the Fourier transform. If H

n,f

2

=1, the quantity subtracted from the component S

n,f

is zero. In this computation, the floor coefficients β

i

2

(for example β

i

2

=β

i

1

) express the fact that some harmonics of the pitch frequency f

p

can be masked by noise, so that there is no utility in protecting them.

This protection strategy is preferably applied for each of the frequencies closest to the harmonics of f

p

, i.e. for any integer η.

If δf

p

denotes the frequency resolution with which the analysis module

57

produces the estimated pitch frequency f

p

, i.e. if the real pitch frequency is between f

p

−δf

p

/2 and f

p

+δf

p

/2, then the difference between the η-th harmonic of the real pitch frequency and its estimate η×f

p

(condition (

9

)) can go up to ±η×δf

p

/2. For high values of η, the difference can be greater than the spectral half-resolution Δf/2 of the Fourier transform. To take account of this uncertainty, and to guarantee good protection of the harmonics of the real pitch, each of the frequencies in the range [η×f

p

−η×δf

p

/2, η×f

p

+η×δf

p

/2] can be protected, i.e. condition (

9

) above can be replaced with:

∃ηinteger/|f−η·f

p

|≦(η·δf

p

+Δf)/2 (9′)

This approach (condition (

9

′)) is of particular benefit if the values of η can be high, especially if the process is used in a broadband system.

For each protected frequency, the corrected frequency response H

n,f

2

f can be equal to 1, as indicated above, which in the context of spectral subtraction corresponds to the subtraction of a zero quantity, i.e. to complete protection of the frequency in question. More generally, this corrected frequency response H

n,f

2

could be taken as equal to a value from 1 to H

n,f

1

according to the required degree of protection, which corresponds to subtracting a quantity less than that which would be subtracted if the frequency in question were not protected.

The spectral components S

n,f

2

of a noise-suppressed signal are computed by a multiplier

58

:

S

n,f

2

=H

n,f

2

·S

n,f

(10)

This signal S

n,f

2

is supplied to a module

60

which computes a masking curve for each frame n by applying a psychoacoustic model of how the human ear perceives sound.

The masking phenomenon is a well-known principle of the operation of the human ear. If two frequencies are present simultaneously, it is possible for one of them not to be audible. It is then said to be masked.

There are various methods of computing masking curves. The method developed by J. D. Johnston can be used, for example (“Transform Coding of Audio Signals Using Perceptual Noise Criteria”, IEEE Journal on Selected Areas in Communications, Vol. 6, No. 2, February 1988). That method operates in the barks frequency scale. The masking curve is seen as the convolution of the spectrum spreading function of the basilar membrane in the bark domain with the exciter signal, which in the present application is the signal S

n,f

2

. The spectrum spreading function can be modelled in the manner shown in FIG.

7

. For each bark band, the contribution of the lower and higher bands convoluted with the spreading function of the basilar membrane is computed from the equation:

\begin{matrix} C_{n, q} = \sum_{q^{'} = 0}^{q - 1} \frac{S_{n, q^{'}}^{2}}{{(10^{10 / 10})}^{(q - q^{'})}} + \sum_{q^{'} = q + 1}^{Q} \frac{S_{n, q^{'}}^{2}}{{(10^{25 / 10})}^{(q^{'} - q)}} & (11) \end{matrix}

in which the indices q and q′ designate the bark bands (0≦q,q′≦Q) and S

n,q

2

represents the average of the components S

n,f

2

of the noise-suppressed exciter signal for the discrete frequencies f belonging to the bark band q′.

The module

60

obtains the masking threshold M

n,q

for each bark band q from the equation:

M

n,q

=C

n,q

/R

q

(12)

in which R

q

depends on whether the signal is relatively more or relatively less voiced. As is well-known in the art, one possible form of R

q

is:

10·log

10

(

R

q

)=(

A+q

)·χ+

B

·(1−χ) (13)

with A=14.5 and B=5.5. χ designated a degree of voicing of the speech signal, varying from 0 (no voicing) to 1 (highly voiced signal). The parameter χ can be of the form known in the art:

\begin{matrix} χ = \min {\frac{SFM}{{SFM}_{\max}}, 1} & (12) \end{matrix}

where SFM represents the ratio in decibels between the arithmetic mean and the geometric mean of the energy of the bark bands and SFM

max

=−60 dB.

The noise suppression system further includes a module

62

which corrects the frequency response of the noise suppression filter as a function of the masking curve M

n,q

computed by the module

60

and the overestimates {circumflex over (B)}′

n,i

computed by the module

45

. The module

62

decides which noise suppression level must really be achieved.

By comparing the envelope of the noise overestimate with the envelope formed by the masking thresholds M

n,q

, a decision is taken to suppress noise in the signal only to the extent that the overestimate {circumflex over (B)}′

n,i

is above the masking curve. This avoids unnecessary suppression of noise masked by speech.

The new response H

n,f

3

, for a frequency f belonging to the band i defined by the module

12

and the bark band q, thus depends on the relative difference between the overestimate {circumflex over (B)}′

n,i

of the corresponding spectral component of the noise and the masking curve M

n,q

in the following manner:

\begin{matrix} H_{n, f}^{3} = 1 - (1 - H_{n, f}^{2}) \cdot \max {\frac{{\hat{B}}_{n, i}^{'} - M_{n, q}}{{\hat{B}}_{n, i}^{'}}, 0} & (14) \end{matrix}

In other words, the quantity subtracted from a spectral component S

n,f

, in the spectral subtraction process having the frequency response H

n,f

3

is substantially equal to whichever is the lower of the quantity subtracted from this spectral component in the spectral subtraction process having the frequency response H

n,f

2

and the fraction of the overestimate {circumflex over (B)}′

n,i

of the corresponding spectral component of the noise which possibly exceeds the masking curve M

n,q

.

FIG. 8

illustrates the principle of the correction applied by the module

62

. It shows in schematic form an example of a masking curve M

n,q

computed on the basis of the spectral components S

n,f

2

of the noise-suppressed signal as well as the overestimate {circumflex over (B)}′

n,i

of the noise spectrum. The quantity finally subtracted from the components S

n,f

is that shown by the shaded areas, i.e. it is limited to the fraction of the overestimate {circumflex over (B)}′

n,i

of the spectral components of the noise which is above the masking curve.

The subtraction is effected by multiplying the frequency response H

n,f

3

of the noise suppression filter by the spectral components S

n,f

of the speech signal (multiplier

64

). The module

65

then reconstructs the noise-suppressed signal in the time domain by applying the inverse fast Fourier transform (IFFT) to the samples of frequency S

n,f

3

delivered by the multiplier

64

. For each frame, only the first N/2=128 samples of the signal produced by the module

65

are delivered as the final noise-suppressed signal s

3

, after overlap-add reconstruction with the N/2=128 last samples of the preceding frame (module

66

).

FIG. 9

shows a preferred embodiment of a noise suppression system using the invention. The system includes a number of components similar to corresponding components of the system shown in

FIG. 1

, for which the same reference numbers are used. Accordingly, the modules

10

,

11

,

12

,

15

,

16

,

45

and

55

supply in particular the quantities S

n,i

, {circumflex over (B)}

n,i

, α′

n,i

, {circumflex over (B)}′

n,i

and H

n,f

1

used for selective noise suppression.

The frequency resolution of the fast Fourier transform

11

constitutes a limitation of the system shown in FIG.

1

. The frequency protected by the module

56

is not necessarily the precise pitch frequency f

p

, but the frequency closest to it in the discrete spectrum. In some cases, harmonics relatively far away from the pitch harmonics may be protected. The system shown in

FIG. 9

alleviates this drawback by appropriately conditioning the speech signal.

This conditioning modifies the sampling frequency of the signal so that the period 1/f

p

exactly covers an integer number of sample times of the conditioned signal.

Many methods of harmonic analysis which can be used by the module

57

are capable of supplying a fractional value of the delay T

p

, expressed as a number of samples at the initial sampling frequency F

e

. A new sampling frequency f

e

is then chosen which is equal to an integer multiple of the estimated pitch frequency, i.e. f

e

=p·f

p

=p·F

e

/T

p

=K·F

e

, where p is an integer. To avoid losing signal samples, f

e

must be higher than F

e

. In particular, to facilitate conditioning it is possible to impose the condition that f

e

must lie in the range from F

e

to 2F

e

(1≦K≦2).

Of course, it is not necessary to condition the signal if no vocal activity is detected in the current frame (δ

n

≠0) or if the delay T

p

estimated by the module

57

is an integer delay.

For each pitch harmonic to correspond to an integer number of samples of the conditioned signal, the integer p must be a factor of the size N of the signal window produced by the module

10

: N=αp, where α is an integer. This size N is usually a power of 2 for the implementation of the FFT. It is 256 in the example considered here.

The spectral resolution Δf of the discrete Fourier transform of the conditioned signal is given by the equation Δf=p·f

p

/N=f

p

/α. It is therefore beneficial to make p small, to maximise α, but large enough to perform oversampling. In the example considered here, where F

e

=8 kHz and N=256, the values chosen for the parameters p and α are indicated in table I.

TABLE I

500 Hz < f

p

< 1000 Hz

8 < T

p

< 16

p = 16

α = 16

250 Hz < f

p

< 500 Hz

16 < T

p

< 32

p = 32

α = 8

125 Hz < f

p

< 250 Hz

32 < T

p

< 64

p = 64

α = 4

62.5 Hz < f

p

< 125 Hz

64 < T

p

< 128

p = 128

α = 2

31,25 Hz < f

p

< 62,5 Hz

128 < T

p

< 256

p = 256

α = 1

The choice is made by a module

70

according to the value of the delay T

p

supplied by the harmonic analysis module

57

. The module

70

supplies the ratio K between the sampling frequencies to three frequency changer modules

71

,

72

,

73

.

The module

71

transforms the values S

n,i

, {circumflex over (B)}

n,i

, α′

n,i

, {circumflex over (B)}′

n,i

and H

n,f

1

relating to the bands i defined by the module

12

into the modified frequency scale (sampling frequency f

e

). This transformation merely expands the bands i by the factor K. The transformed values are supplied to the harmonic protection module

56

.

The latter module then operates as before to supply the frequency response H

n,f

2

of the noise suppression filter. This response H

n,f

2

is obtained in the same manner as in

FIG. 1

(conditions (

8

) and (

9

)), except that, in condition (

9

), the pitch frequency f

p

=f

e

/p is defined according to the value of the integer delay p supplied by the module

70

, the module

70

also supplying the frequency resolution Δf.

The module

72

oversamples the frame of N samples supplied by the windowing module

10

. Oversampling by a rational factor K (K=K

1

/K

2

) consists in first oversampling by the integer factor K

1

and then undersampling by the integer factor K

2

. This oversampling and undersampling by integer factors can be effected in the conventional way by means of banks of polyphase filters.

The conditioned signal frame s′ supplied by the module

72

includes KN samples at the frequency f

e

. The samples are sent to a module

75

which computes their Fourier transform. The transformation can be effected on the basis of two blocks of N=256 samples: one constituted by the first N samples of the frame of length KN of the conditioned signal s′ and the other of the last N samples of that frame. The two blocks therefore have an overlap of (2−K)×100%. For each of the two blocks, a set of Fourier components S

n,f

is obtained. The components S

n,f

are supplied to the multiplier

58

, which multiplies them by the spectral response H

n,f

2

to deliver the spectral components S

n,f

2

of the first noise-suppressed signal.

The components S

n,f

2

are sent to the module

60

which computes the masking curves in the manner previously indicated.

When computing the masking curves, the magnitude χ designating the degree of voicing of the speech signal (equation (13)) is preferably taken in the form χ=1−H, where H is an entropy of the autocorrelation of the spectral components S

n,f

2

of the noise-suppressed conditioned signal. The autocorrelations A(k) are computed by a module

76

, for example using the equation:

\begin{matrix} A (k) = \frac{\sum_{f = 0}^{N / 2 - 1} S_{n, f}^{2} \cdot S_{n, f + k}^{2}}{\sum_{f = 0}^{N / 2 - 1} \sum_{f^{'} = 0}^{N / 2 - 1} S_{n, f}^{2} \cdot S_{n, f + f^{'}}^{2}} & (15) \end{matrix}

A module

77

then computes the normalised entropy H and supplies it to the module

60

for computing the masking curve (see S. A. McClellan et al.: “Spectral Entropy: an Alternative Indicator for Rate Allocation?”, Proc. ICASSP'94, pages 201-204):

\begin{matrix} H = \frac{\sum_{k = 0}^{N / 2 - 1} A (k) \cdot \log [A (k)]}{\log (N / 2)} & (16) \end{matrix}

Because of the conditioning of the signal, and its noise suppression by the filter H

n,f

2

the normalised entropy H constitutes a measurement of voicing that is very robust to noise and to pitch variations.

The correction module

62

operates in the same manner as that of the system shown in

FIG. 1

, allowing for the overestimated noise {circumflex over (B)}′

n,i

rescaled by the frequency changer module

71

. It supplies the frequency response H

n,f

3

of the final noise suppression filter, which is multiplied by the spectral components S

n,f

of the conditioned signal by the multiplier

64

. The resulting components S

n,f

3

are processed back to the time domain by the IFFT module

65

. A module

80

at the output of the IFFT module

65

combines, for each frame, the two signal blocks resulting from the processing of the two overlapping blocks supplied by the FFT

75

. This combination can consist of a Hamming weighted sum of the samples to form a noise-suppressed conditioned signal frame of KN samples.

The module

73

changes the sampling frequency of the noise-suppressed conditioned signal supplied by the module

80

. The sampling frequency is returned to F

e

=f

e

/K by operations which are the inverse of those effected by the module

75

. The module

73

delivers N=256 samples per frame. After overlap-add reconstruction using the last N/2=128 samples of the preceding frame, only the first N/2=128 samples of the current frame are finally retained to form the final noise-suppressed signal s

3

(module

66

).

In a preferred embodiment, a module

82

manages the windows formed by the module

10

and saved by the module

66

, to retain a number M of samples equal to an integer multiple of T

p

=F

e

/f

p

. This avoids problems of phase discontinuity between frames. In a corresponding manner, the management module

82

controls the windowing module

10

so that the overlap between the current frame and the next corresponds to N-M. This overlap of N-M samples is taken into account in the overlap-add operation effected by the module

66

when processing the next frame. From the value of T

p

supplied by the harmonic analysis module

57

, the module

82

computes the number of samples to be retained M=T

p

×E[N/(2T

p

)], E[ ] designating the integer part, and controls the modules

10

and

66

accordingly.

In the embodiment just described, the pitch frequency is estimated as an average over the frame. The pitch can vary slightly over this duration. It is possible to allow for these variations in the context of the present invention by conditioning the signal to obtain a constant pitch in the frame by artificial means.

This requires the harmonic analysis module

57

to supply the time intervals between consecutive breaks of the speech signal which can be attributed to glottal closures of the speaker occurring during the duration of the frame. Methods which can be used to detect such micro-breaks are well-known in the art of harmonic analysis of speech signals. In this connection, reference may be had to the following articles: M. BASSEVILLE et al., “Sequential detection of abrupt changes in spectral characteristics of digital signals”, IEEE Trans. on Information Theory, 1983, Vol. IT-29, No.5, pages 708-723; R. ANDRE-OBRECHT, “A new statistical approach for the automatic segmentation of continuous speech signals”, IEEE Trans. on Acous., Speech and Sig. Proc., Vol. 36, No.1, January 1988; and C. MURGIA et al., “An algorithm for the estimation of glottal closure instants using the sequential detection of abrupt changes in speech signals”, Signal Processing VII, 1994, pages 1685-1688.

The principle of the above methods is to effect a statistical test between a short-term model and a long-term model. Both models are adaptive linear prediction models. The value of the statistical test w

m

is the cumulative sum of the a posteriori likelihood ratio of two distributions, corrected by the Kullback divergence. For a distribution of residues having a Gaussian statistic, the value w

m

is given by:

\begin{matrix} w_{m} = \frac{1}{2} [\frac{2 \cdot e_{m}^{0} \cdot e_{m}^{1}}{σ_{1}^{2}} - (1 + \frac{σ_{0}^{2}}{σ_{1}^{2}}) \cdot \frac{{(e_{m}^{0})}^{2}}{σ_{0}^{2}} + (1 - \frac{σ_{0}^{2}}{σ_{1}^{2}})] & (17) \end{matrix}

where e

m

0

and σ

0

2

represent the residue computed at the time of sample m of the frame and the variance of the long-term model, e

m

1

and σ

1

2

likewise representing the residue and the variance of the short-term model. The closer the two models, the closer the statistical test value w

m

to 0. In contrast, if the two models are far away from each other, the value w

m

becomes negative, which denotes a break R in the signal.

Thus

FIG. 10

shows one possible example of the evolution of the value w

m

, showing the breaks R in the speech signal. The time intervals t

r

(r=1,2, etc.) between two consecutive breaks R are computed and expressed as a number of samples of the speech signal. Each interval t

r

is inversely proportional to the pitch frequency f

p

which is thus estimated locally: f

p

=F

e

/t

r

over the r-th interval.

The time variations of the pitch (i.e. the fact that the intervals t

r

are not all equal over a given frame), can then be corrected to obtain a constant pitch frequency in each of the analysis frames. This correction is effected by modifying the sampling frequency over each interval t

r

to obtain constant intervals between two glottal closures after oversampling. Thus the duration between two breaks is modified by oversampling with a variable ratio, so as to lock onto the greatest interval. Also, the conditioning constraint, whereby the oversampling frequency is a multiple of the estimated pitch frequency, is complied with.

FIG. 11

shows the means employed to perform the conditioning of the signal in the latter case. The harmonic analysis module

57

uses the above analysis method and supplies the intervals t

r

relating to the signal frame produced by the module

10

. For each of these intervals, the module

70

(block

90

in

FIG. 11

) computes the oversampling ratio K

r

=p

r

/t

r

, where the integer p

r

is given by the third column of table I if t

r

takes the values indicated in the second column. These oversampling ratios K

r

are supplied to the frequency changer modules

72

and

73

so that the interpolations are effected with the sampling ratio K

r

over the corresponding time interval t

r

.

The greatest time interval T

p

of the time intervals t

r

supplied by the module

57

for a frame is selected by the module

70

(block

91

in

FIG. 11

) to obtain a pair p,α as indicated in table I. The modified sampling frequency is then f

e

=p·F

e

/T

p

as previously, the spectral resolution Δf of the discrete Fourier transform of the conditioned signal still being given by Δf=F

e

/(α·T

p

) For the frequency changer module

71

, the oversampling ratio K is given by K=p/T

p

(block

92

). The module

56

for protecting the pitch harmonics operates in the same manner as before, using for condition (

9

) the spectral resolution Δf supplied by the block

91

and the pitch frequency f

p

=f

e

/p defined according to the value of the integer delay p supplied by the block

91

.

This embodiment of the invention also implies adaptation of the window management module 82. The number M of samples of the noise-suppressed signal to be retained over the current frame here corresponds to an integer number of consecutive time intervals t

r

between two glottal closures (see FIG.

10

). This avoids the problems of phase discontinuity between frames, whilst allowing for possible variations of the time intervals t

r

over a frame.

Claims

1. Method of conditioning a digital speech signal processed by successive frames, comprising a harmonic analysis of the speech signal to estimate a pitch frequency of the speech signal over each frame in which the speech signal features vocal activity, and, after estimating the pitch frequency of the speech signal over one frame, conditioning the speech signal of said one frame by oversampling the speech signal in the time domain at an oversampling frequency which is an integer multiple of the estimated pitch frequency.
2. Method according to claim 1, wherein spectral components of the speech signal are computed by distributing the conditioned signal into blocks of N samples transformed into the frequency domain, N being a predetermined integer, and wherein the ratio between the oversampling frequency and the estimated pitch frequency is a factor of the number N.
3. Method according to claim 2, wherein the number N is a power of 2.
4. Method according to claim 2, wherein a degree of voicing of the speech signal is estimated over the frame from an entropy of an autocorrelation of spectral components computed on the basis of the conditioned signal.
5. Method according to claim 4, wherein the degree of voicing is measured on the basis of a normalised entropy H of the form: H=∑k=0N/2-1⁢A⁡(k)·log⁡[A⁡(k)]log⁡(N/2)where A(k) is the normalised autocorrelation defined by: A⁡(k)=∑f=0N/2-1⁢ ⁢Sn,f2·Sn,f+k2∑f=0N/2-1⁢∑f′=0N/2-1⁢Sn,f2·Sn,f+f′2Sn,f2 designating said spectral component of rank f computed on the basis of the oversampled signal.
6. Method according to claim 1, wherein, after processing each conditioned signal frame, a number of signal samples supplied by such processing is retained which is equal to an integer multiple of the ratio between an initial sampling frequency and the estimated pitch frequency.
7. Method according to claim 1, wherein the estimation of the pitch frequency of the speech signal over a frame includes the steps of:estimating time intervals between two consecutive breaks of the signal which can be attributed to glottal closures of speaker occurring during the frame, the estimated pitch frequency being inversely proportional to said time intervals; interpolating the speech signal in said time intervals, so that the conditioned signal resulting from such interpolation has a constant time interval between two consecutive breaks.
8. Method according to claim 7, wherein, after processing each frame, a number of samples of the speech signal supplied by such processing is retained which corresponds to an integer number of estimated time intervals.
9. Device for conditioning a digital speech signal processed by successive frames, comprising harmonic analysis means to estimate a pitch frequency of the speech signal over each frame in which the speech signal features vocal activity, and conditioning means for conditioning the speech signal of said frame by oversampling the speech signal in the time domain at an oversampling frequency which is an integer multiple of the estimated pitch frequency.
10. Device according to claim 9, distributing the conditioned signal into blocks of N samples, N being a predetermined integer, and means for computing spectral components of the speech signal by transforming said blocks into the frequency domain, and wherein the ratio between the oversampling frequency and the estimated pitch frequency is a factor of the number N.
11. Device according to claim 10, wherein the number N is a power of 2.
12. Device according to claim 10, further comprising means for estimating a degree of voicing of the speech signal over each frame from an entropy of an autocorrelation of spectral components computed on the basis of the conditioned signal.
13. Device according to claim 12, wherein the degree of voicing is measured on the basis of a normalised entropy H of the form: H=∑k=0N/2-1⁢A⁡(k)·log⁡[A⁡(k)]log⁡(N/2)where A(k) is the normalised autocorrelation defined by: A⁡(k)=∑f=0N/2-1⁢ ⁢Sn,f2·Sn,f+k2∑f=0N/2-1⁢∑f′=0N/2-1⁢Sn,f2·Sn,f+f′2Sn,f2 designating said spectral component of rank f computed on the basis of the oversampled signal.
14. Device according to claim 9, wherein, after processing each conditioned signal frame, a number of signal samples supplied by such processing is retained which is equal to an integer multiple of the ratio between an initial sampling frequency and the estimated pitch frequency.
15. Device according to claim 9, wherein the harmonic analysis means include:means for estimating time intervals between two consecutive breaks of the signal which can be attributed to glottal closures of a speaker occurring during a frame, the estimated pitch frequency being inversely proportional to said time intervals; means for interpolating the speech signal in said time intervals, so that the conditioned signal resulting from such interpolation has a constant time interval between two consecutive breaks.
16. Device according to claim 15, wherein, after processing each frame, a number of samples of the speech signal supplied by such processing is retained which corresponds to an integer number of estimated time intervals.

Priority Claims (1)

Number	Date	Country	Kind
97 11641	Sep 1997	FR

PCT Information

Filing Document	Filing Date	Country	Kind
PCT/FR98/01978		WO	00

Publishing Document	Publishing Date	Country	Kind
WO99/14744	3/25/1999	WO	A

US Referenced Citations (15)

Number	Name	Date	Kind
5073938	Galand	Dec 1991	A
5226084	Hardwick et al.	Jul 1993	A
5228088	Kane et al.	Jul 1993	A
5384891	Asakawa et al.	Jan 1995	A
5400434	Pearson	Mar 1995	A
5401897	Depalle et al.	Mar 1995	A
5469087	Eatwell	Nov 1995	A
5555190	Derby et al.	Sep 1996	A
5641927	Pawate et al.	Jun 1997	A
5787398	Lowry	Jul 1998	A
5832437	Nishiguchi et al.	Nov 1998	A
5987413	Dutoit et al.	Nov 1999	A
6064955	Huang et al.	May 2000	A
6115684	Kawahara et al.	Sep 2000	A
6475245	Gersho et al.	Nov 2002	B2

Foreign Referenced Citations (1)

Number	Date	Country
0 438 174	Jul 1991	EP

Non-Patent Literature Citations (6)

Entry
McClellan et al., “Variable-rate CELP based on subband flatness,” IEEE Transactions on Speech and Audio Processing, vol. 5, No. 2, Mar. 1997, pp. 120 to 130.*
McClellan et al., “Spectral entropy: an alternative indicator for rate allocation?” IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 1, Apr. 1994, pp. 1-201 to 1-204.*
C Murgia, et al., <<An Algorithm for the Estimation of Glottal Closure Instants Using the Sequential Detection of Abrupt Changes in Speech Signals>>, Proceedings of Eusipco-94, 7th European Signal Processing Conference, Edinburgh, vol. 3, Sep. 1994, pp. 1685-1688.
R Le Bouquin et al., <<Enhancement of Noisy Speech Signals: Application to Mobile Radio Communications>>, Speech Communication, Jan. 1996, vol. 18, No. 1, pp. 3-19.
S Nandkumar et al., <<Speech Enhancement Based on a New Set of Auditaury Constrained Parameters>>, Proceedings of the International Conference on Acoustics, Speech, Signal Processing, ICASSP 1994, Apr. 1994, vol. 1, pp. 1-4
P Lockwood et al., <<Experiments With a Nonlinear Spectral Subtractor (NSS), Hidden Markov Models and the Projection for Robust Speech Recognition in Cars>>, Speech Communication, Jun. 1992, vol. 11, No. 2/3, pp. 215-228.

Method for conditioning a digital speech signal

Information

Patent Number

Date Filed

Date Issued

Inventors

Original Assignees

Examiners

Agents

CPC

US Classifications

Field of Search

US

International Classifications