The invention relates to a smoothing method for suppressing fluctuating artifacts during noise reduction.
In digital voice signal transmission, noise suppression is an important aspect. The audio signals captured by means of a microphone and then digitized contain not only the user signal (
For noise reduction, the spectral representation is an advantageous representation of the signal. In this case, the signal is represented broken down into frequencies. One practical implementation of the spectral representation is short-term spectra, which are produced by dividing the signal into short frames (
An advantage of the spectral representation is that the fundamental voice energy is present in a concentration in a relatively small number of frequency bins (
For the noise reduction, a spectral weighting function is estimated which can be calculated on the basis of different optimization criteria. It provides low values or zero in frequency bins in which there is primarily interference, and values close or equal to one for bins in which voice energy is dominant (
Multiplying the weighting function by the short-term spectrum of the noisy signal produces the filtered spectrum, in which the amplitudes of the frequency bins in which interference is dominant are greatly reduced, while voice components remain almost without influence (
Estimation errors when calculating the spectral weighting function, what are known as fluctuations, occasionally result in excessive weighting values for frequency bins which contain primarily interference (
To suppress fluctuations in the weighting function or in spectral intermediate magnitudes or suppress outliers in the filtered spectrum, these spectral magnitudes can be smoothed by an averaging method and hence rid of excess values. Spectral variables for a plurality of spectrally adjacent or chronologically successive frequency bins are in this case accounted for to form an average, so that the amplitude of individual outliers is put into relative terms. Smoothing is known over frequency [1: Tim Fingscheidt, Christophe Beaugeant and Suhadi Suhadi. Overcoming the statistical independence assumption w.r.t. frequency in speech enhancement. Proceedings, IEEE Int. Conf. Acoustics, Speech, Signal Processing (ICASSP), 1:1081-1084, 2005], in the course of time [2: Harald Gustafsson, Sven Erik Nordholm and Ingvar Claesson. Spectral subtraction using reduced delay convolution and adaptive averaging. IEEE Transactions on Speech and Audio Processing, 9(8): 799-807, November 2001] or as a combination of temporal and spectral averaging [3: Zenton Goh, Kah-Chye Tan and B.T.G. Tan. Postprocessing method for suppressing musical noise generated by spectral subtraction. IEEE Transactions on Speech and Audio Processing, 6(3):287-292, May 1998]. A drawback of smoothing over frequency is that accounting for a plurality of frequency bins involves the spectral resolution being reduced, that is to say that it becomes more difficult to distinguish between voice bins and noise bins. Temporal smoothing by combining successive values of a bin reduces the temporal dynamics of spectral values, that is to say their capability of following rapid changes in the voice over time. Distortion of the voice signal is the result (clipping). In addition, an irritating residual noise correlated to the voice signal can become audible (noise shaping). These smoothing methods in the spectral domain therefore need to be adapted to suit the voice signal, generally in complex fashion.
A further known form of smoothing individual short-term spectra over frequency is a method known as “liftering” [4: Andrzej Cryzewski. Multitask noisy speech enchangement system. http://sound.eti.pg.gda.pl/denoise/main.html, 2004], [5: Francois Thibault. High-level control of singing voice timbre transformations. http://www.music.mcgill.ca/thibault/Thesis/-node43.html, 2004]. In this case, the short-term spectrum of a frame λ is first of all transformed into what is known as the cepstral domain. The cepstral representation of the spectral amplitudes Gu (λ) is calculated as
where IDFT {·} corresponds to the inverse discrete Fourier Transformation (DFT) of a series of values of length M. This transformation results in M transformation coefficients
what are known as the cepstral bins with index μ′.
According to equation (1), the cepstrum basically comprises a nonlinear map, namely the logarithmization, of a spectral magnitude available as an absolute value and of a subsequent transformation of this logarithmized absolute value spectrum with a transformation. The advantage of cepstral representation of the amplitudes (
A smoothed short-term spectrum can be calculated by setting cepstral bins with relatively small absolute values to zero and then transforming back the altered cepstrum to a short-term spectrum again. However, since severe fluctuations or outliers result in correspondingly high amplitudes in the cepstrum, these artifacts cannot be detected and suppressed by these methods.
As an alternative to liftering, there is also the method according to [6: Petre Stoica and Niclas Sandgren. Smoothed nonparametric spectral estimation via cepstrum thresholding. IEEE Signal Processing Magazine, pages 34-45, November 2006]. In this case, cepstral bins selected on the basis of a criterion are not set to zero, but rather are set to a value which is optimum for estimating long-term spectra for steady signals from short-term spectra. This form of estimation of signal spectra does not generally provide any advantages for highly transient signals such as voice.
Against this background, the invention is based on the object of demonstrating, for the noise reduction, a smoothing method for suppressing fluctuations in the weighting function or in spectral intermediate magnitudes or outliers in filtered short-term spectra which neither reduces the frequency resolution of the short-term spectra nor adversely affects the temporal dynamics of the voice signal.
This object is achieved by means of a smoothing method having the measures of patent claim 1. Advantageous developments are the subject matter of the subclaims.
The smoothing method according to the invention comprises the following steps:
The smoothing method according to the invention uses a transformation such as the cepstrum in order to describe a broadband voice signal with as few transformation coefficients as possible in its fundamental structure. Unlike in known methods, the transformation coefficients are not set to zero independently of one another if they are below a threshold value, however. Instead, the values of transformation coefficients from at least two successive frames are accounted for together by smoothing over time. In this case, the degree of smoothing is made dependent on the extent to which the spectral structure represented by the coefficient is crucial to describing the user signal. By way of example, the degree of temporal smoothing of a coefficient is therefore dependent on whether a transformation coefficient contains a large amount of voice energy or little. This is easier to determine in the cepstrum or similar transformations than in the short-term spectrum. By way of example, it may thus be assumed that the first four cepstral coefficients with indices μ′=0 . . . 3 and additionally the coefficient with a maximum absolute value and index μ′ greater than 16 and less than 160 at fs=8000 Hz (pitch) represent voice. Coefficients with a large amount of voice information are smoothed only to the extent that their temporal dynamics do not become less than in the case of a noiseless voice signal. If appropriate, these coefficients are not smoothed at all. Voice distortions are prevented in this way. Since spectral fluctuations and outliers represent a short-term change in the fine structure of a short-term spectrum, they are mapped in the transformed short-term spectrum as a short-term change in those transformation coefficients which represent the fine structure of the short-term spectrum. Since these transformation coefficients have a relatively low rate of change over time in the case of noiseless voice, these very coefficients can be smoothed much more. Heavier temporal smoothing therefore counteracts the formation of outliers without influencing the structure of the voice. The smoothing method therefore does not result in decreased spectral resolution for voice sounds. The change in the fine structure of the short-term spectrum in the case of successive frames is delayed such that only narrowband spectral changes with time constants below those of noiseless voice are prevented.
From the smoothed magnitude, denoted as
it is possible to obtain a spectral representation of the smoothed short-term spectrum again by backward transformation. For a cepstral representation, as described in (1), one possible backward transformation is as follows:
where DFT{ } corresponds to the discrete Fourier transformation and exp( ) corresponds to the exponential function which is applied element by element in (2).
The advantages which result from the inventive smoothing of short-term spectra are as follows:
It is important to note that the inverse DFT used for the cepstrum in (1) and the DFT for the backward transformation in (2) can be replaced by other transformations without thereby losing the basic properties of the transformation coefficients with regard to the compact representation of voice. The same situation applies to the logarithmization in (1) and the corresponding reversal function in (2), the exponential function. In these cases too, other nonlinear maps and also linear maps are conceivable.
Transformations differ in the base functions used thereof. The process of transformation means that the signal is correlated to the various base functions. The resulting degree of correlation between the signal and a base function is then the associated transformation coefficient. A transformation involves production of as many transformation coefficients as there are base functions. The number thereof is denoted by M in this case. Transformations which are important for the invention are those whose base functions break down the short-term spectrum to be transformed into its coarse structure and its fine structure.
A distinguishing feature of transformations is the orthogonality. Orthogonal transformation bases contain only base functions which are uncorrelated. If the signal is identical to one of the base functions, orthogonal transformations result in transformation coefficients with the value zero, apart from the coefficient which is identical to the signal. The selectivity of an orthogonal transformation is accordingly high. Nonorthogonal transformations use function bases which are correlated to one another.
A further feature is that the base functions for the incidence of application under consideration are discrete and finite, since the processed signal frames are discrete signals with the length of a frame.
An important feature of a transformation is the invertability. If there is an inverse transformation for a transformation (forward transformation), transforming a signal into transformation coefficients and subsequently subjecting these coefficients to inverse transformation (backward transformation) produces the initial signal again if the transformation coefficients have not been altered.
In the signal processing as described here, Discrete Fourier Transformation (DFT) is a preferred transformation. An associated important algorithm in discrete signal processing is “Fast Fourier Transformation” (FFT). In addition, Discrete Cosine Transformation (DCT) and Discrete Sine Transformation (DST) are frequently used transformations. In this case, these transformations are combined under the term “standard transformations”. An already mentioned property of standard transformations which is crucial to the invention is that the amplitudes of the various transformation coefficients represent different degrees of fine structure for the transformed signal. Thus, coefficients with small indices describe the coarse structures of the transformed signal, because the associated base functions are audio-frequency harmonic functions. The higher the index of a transformation coefficient up to μ′=M/2, the finer the structures of the transformed signal which are described by said coefficients. For coefficients beyond this, this property is turned around on account of the symmetry of the coefficients. Usually, signal processing involves only the coefficients with indices μ′=0 to μ′=M/2 being processed and the remaining values being ascertained by mirroring the results.
In addition, the invertability of the transformations makes it possible to interchange the transformation and the inverse thereof in the forward and backward transformation. In (1), it is thus also possible to use the DFT from (2), for example, if the IDFT from (1) is used in (2).
Advantageously, the spectral coefficients of the short-term spectra are mapped nonlinearly before the forward transformation. A basic property of nonlinear mapping which is advantageous for the invention is dynamic compression of relatively large amplitudes and dynamic expansion of relatively small amplitudes.
Accordingly, the spectral coefficients of the smoothed short-term spectra can be mapped nonlinearly after the backward transformation, the nonlinear mapping after the backward transformation being the reversal of the nonlinear mapping before the forward transformation.
Expediently, the spectral coefficients are mapped nonlinearly before the forward transformation by logarithmization.
A form of temporal smoothing can be achieved by a preferably first-order recursive system:
Possible values for the smoothing constants for coefficients of the standard transformations in the case of voice signals are βμ′=0 for μ′=0 . . . 3, βμ′=0.8 for μ′=4 . . . M/2 with the exception of the transformation coefficients which represent the pitch frequency of a speaker, and βμ′=0.4 for transformation coefficients which represent the pitch frequency. Methods for determining the pitch coefficient are widely available in the literature. By way of example, to determine the coefficient for the pitch, it is possible to select that coefficient whose index is between μ′=16 and μ′=160 and which has the maximum amplitude of all the coefficients in this index range. For the remaining transformation coefficients with indices μ′=M/2+1 . . . M−1, the symmetry condition βM−μ′=βμ′ applies. The values are suitable for the standard transformations and also short-term spectra which have arisen from signals where fs=8000 Hz. They can be adapted to suit other systems by proportional conversion. The selection βμ′=0 means that the relevant coefficients are not being smoothed. A crucial property of the invention is that coefficients which describe the coarse profile of the short-term spectrum are smoothed as little as possible if voice signals are being denoised. Thus, the coarse structures of the broadband voice spectrum are protected from smoothing effects. The fine structures of fluctuations or spectral outliers are mapped in the transformation coefficients between μ′=4 and μ′=M/2 in the case of standard transformations, which is why said transformation coefficients are smoothed much apart from the pitch of the voice.
Advantageously, the smoothing method is applied to the absolute value or a power of the absolute value of the short-term spectra.
It is particularly advantageous if different time constants are used to smooth the respective transformation coefficients. The time constants can be chosen such that the transformation coefficients which represent primarily voice are smoothed little. Expediently, the transformation coefficients which describe primarily fluctuating background noise and artifacts of the noise reduction algorithms can be smoothed much.
The short-term spectrum provided may be the spectral weighting function of a noise reduction algorithm. Advantageously, the short-term spectrum used may also be the spectral weighting function of a post filter for multichannel methods for noise reduction. Expediently, the spectral weighting function is in this case obtained from the minimization of an error criterion.
The short-term spectrum provided may also be a filtered short-term spectrum.
According to another development of the method, the short-term spectrum provided is a spectral weighting function of a multichannel method for noise reduction.
The short-term spectrum provided may also be an estimated coherence or an estimated “Magnitude Squared Coherence” between at least two microphone channels.
Advantageously, the short-term spectrum provided is a spectral weighting function of a multichannel method for speaker or source separation.
In addition, provision is made for the short-term spectrum provided to be a spectral weighting function of a multichannel method for speaker separation on the basis of phase differences for signals in the various channels (Phase Transform—PHAT).
In addition, it is possible for the short-term spectrum used to be a spectral weighting function of a multichannel method on the basis of a “Generalized Cross-Correlation” (GCC). The short-term spectrum provided may also be spectral magnitudes which contain both voice and noise components.
The short-term spectrum provided may also be an estimate of the signal-to-noise ratio in the individual frequency bins. In addition, the short-term spectrum used may be an estimate of the noise power.
The problem of fluctuations in short-term spectra is known not only in audio signal processing. Further advantageous areas of application are image and medical signal processing.
In image processing, the rows of an image can be interpreted as a signal frame, for example, which can be transformed into the spectral domain. In this case, the frequency bins produced are called local frequency bins. When images are processed in the local frequency domain, algorithms are used which are equivalent to those in audio signal processing. Possible fluctuations which these algorithms produce in the local frequency domain result in visual artifacts in the processed image. These are equivalent to tonal noise in audio processing.
In medical signal processing, signals are derived from the human body which may exhibit noise in the manner of audio signals. The noisy signal can be transformed into the spectral domain frame by frame as appropriate. The resultant spectrograms can be processed in the manner of audio spectra.
The smoothing method can be used in a telecommunication network and/or for a broadcast transmission in order to improve the voice and/or image quality and in order to suppress artifacts. In mobile voice communication, distortions in the voice signal arise which are caused firstly by the voice coding methods used (redundancy-reducing voice compression) and the associated quantization noise and secondly by the interference brought about by the transmission channel. Said interference in turn has a high level of temporal and spectral fluctuation and results in a clearly perceptible worsening of the voice quality. In this case, too, the signal processing used at the receiver end or in the network needs to ensure that the quasi-random artifacts are reduced. To improve quality, what are known as post filters and error masking methods have been used to date. Whereas the post filter predominantly has the task of reducing quantization noise, error masking methods are used to suppress transmission-related channel interference. In both applications, improvements can be attained if the smoothing method according to the invention is integrated into the post filter or the masking method. The smoothing method can therefore be used as a post filter, in a post filter, in combination with a post filter, as part of an error masking method or in conjunction with a method for voice and/or image coding (decompression method or decoding method), particularly at the receiver end. When the method is used as a post filter, this means that the method is used for post filtering, that is to say an algorithm which implements the method is used to process the data which arise in the applications. It is also possible to improve the quality of the voice signal in the telecommunication network by smoothing the voice signal spectrum or a magnitude derived therefrom using the smoothing method according to the invention.
The invention is explained in more detail below with reference to illustrations which are shown in the figures, in which:
are shown in gray. A comparison with
Number | Date | Country | Kind |
---|---|---|---|
10 2007 030 209.8 | Jun 2007 | DE | national |
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/DE08/01047 | 6/25/2008 | WO | 00 | 12/18/2009 |