This invention relates generally to denoised speech signals, and more particularly to restoring spectral components attenuated in the speech signals as a result of the denoising.
A speech signal is often acquired in a noisy environment. In addition to reducing the perceptual quality and intelligibility of the speech, noise negatively affects the performance of downstream processing such as coding for transmission and recognition, which are typically optimized for efficient performance on an undistorted “clean” speech signal. For this reason, it becomes necessary to denoise the signal before further processing. A large number of denoising methods are known. Typically, the conventional methods first estimate the noise, and then reduce the noise either by subtraction or filtering.
The problem is that the noise estimate is usually inexact, especially when the noise is time-varying. As a result, some residual noise remains after denoising, and information carrying spectral components are attenuated. For example, if speech is acquired in a vehicle, then the denoised, high-frequency components of fricated sounds such as /S/, and very-low frequency components of nasals and liquids, such as /M/, /N/ and /L/ are attenuated. This happens because automotive noise is dominated by high and low frequencies, and reducing the noise attenuates these spectral components in the speech signal.
Although noise reduction results in a signal with improved perceptual quality, the intelligibility of the speech often does not improve, i.e., while the denoised signal sounds undistorted, the ability to make out what was spoken is decreased. In some cases, particularly when the denoising is aggressive or when the noise is time-varying, the denoised signal is less intelligible than the noisy signal.
This problem is the result of imperfect processing. Nevertheless, it is a very real problem for a spoken-interface device that incorporates third-party denoising hardware or software. The denoising techniques are often “black boxes” that are integrated into the device, and only the denoised signal is available. In this case, it becomes important to somehow restore the spectral components of the speech information that the denoising attenuated.
Noise degrades speech signals, affecting the perceptual quality, intelligibility, as well as downstream processing, e.g., coding for transmission or speech recognition. Hence, noisy speech is denoised. Typically, denoising methods subtract or filter an estimate of the noise, which is often inexact. As a result, denoising can attenuate spectral components of the speech, and reducing intelligibility.
A training undistorted speech signal is represented as a composition of training undistorted bases. A training denoised speech is represented a composition of training distorted bases. By decomposing the test denoised speech signal as a composition of the training distorted bases. Then, a corresponding test undistorted speech signal can be estimated as an identical composition of the training undistorted bases.
The embodiments of the invention provide a method for restoring spectral components attenuated in a test denoised speech signal as a result of denoising a test speech signal to enhance the intelligibility of the speech in the denoised signal.
The method is constrained by practical aspects of the denoising. First, the denoising is usually a “backbox.” The manner in which the noise is estimated, and the actual noise reduction procedure are unknown. Second, it is usually impossible or impractical to record the noise itself separately, and no external estimate of the noise is available to understand how the denoising has affected any spectral components of the speech. Third, the processing must restore the attenuated spectral components of the speech without reintroducing the noise into the signal.
The method uses a compositional characterization of the speech signal that assumes that the signal can be represented as a constructive composition of additive bases.
In one embodiment, this characterization is obtained by non-negative matrix factorization (NMF), although other techniques can also be used. NMF factors a matrix into matrices with non-negative elements. NMF has been used for separating mixed speech signals and denoising speech. Compositional models have also been used to extend the bandwidth of bandlimited signals. However, as best as known, NMF has not been used for the specific problem of restoring attenuated spectral components in a denoised speech signal.
The manner in which the composition of the additive bases is affected by the denoising is relatively constant, and can be obtained from training data comprising stereo pairs of training undistorted signals and training distorted speech signals. By determining how the denoised signal is represented in terms of the composition of the additive bases, the attenuated spectral structures can be estimated from the undistorted versions of the bases, and subsequently restored to provide undistorted speech.
Denoising Model
As shown in
That is, the noisy speech signal S is processed by an ideal “lossless” denoising function F(S) 110 to produce a hypothetical lossless denoised signal X. Then, the denoised signal X is passed through a distortion function D(X) 120 that attenuates the spectral components to produce a lossy signal Y.
The goal is to estimate the denoised signal X, given only the lossy signal Y. The embodiments of the invention express the lossless signal X as a composition of weighted additive bases wiBi
The bases Bi are assumed to represent uncorrelated building blocks that constitute the individual spectral structures that compose the denoised speech signal X. The distortion function D( ) distorts the bases to modify the spectral structure the bases represent. Thus, any basis Bi is transformed by the distortion function to Bidistorted=D(Bi).
It is assumed that the distortion transforms any basis independently of other bases, i.e.,
D(Bi|Bj:j≠i)=D(Bi),
where D(Bi|Bj:j≠i) represents the distortion of the bases Bi given that the other bases Bj, j≠i are also concurrently present. This assumption is invalid unless the bases represent non-overlapping, complete spectral structures. It is also assumed that the manner in which the bases are combined to compose the signal is not modified by the distortion. These assumptions are made to simplify the method. The implication of the above assumptions is that
Eqn. 2 leads to the conclusion that if all bases Bi and their distorted versions Bidistorted are known, and if the manner in which the distorted bases compose Y can be determined, i.e., if the weights wi can be estimated, then the denoised signal X can be estimated.
Restoration Method Overview
Representing the Signal
The model described and shown in
An optimal analysis frame for the STFT is 40-64 ms. Hence, the speech signals are segmented by sliding a window of 64 ms over the signals to produce the frames. A Fourier spectrum is computed over each frame to obtain a complex spectral vector. Its magnitude is taken to obtain a magnitude spectral vector. The set of complex spectral vectors for all frames compose the complex spectrogram for the signal. The magnitude spectral vectors for all frames compose the magnitude spectrogram. The spectra for individual frames are represented as vectors, e.g., X(t), Y(t).
Let S, X, and Y represent magnitude spectrograms of the noisy speech, losslessly denoised speech and lossy denoised speech, respectively. The bases Bi, as well as their distorted versions Bidistorted represent magnitude spectral vectors. The magnitude spectrum of the tth analysis frame of the signal X, which is represented as X(t), is assumed to be composed from the lossless bases Bi as
X(t)=Σiwi(t)Bi,
and the magnitude spectrum of the corresponding frame of the lossy signal Y is
Y(t)=Σiwi(t)Bidistorted.
Also, the weights wi are now all non-negative, because the signs of the weights in the model of Eqn. are incorporated into the phase of the spectra for the bases, and do not appear in the relationship between magnitude spectra of the signals and the bases.
The spectral restoration method estimates the lossless magnitude spectrogram X from that of the lossy signal Y. The estimated magnitude spectrogram is inverted to a time-domain signal. To do so, the phase from the complex spectrogram of the lossy signal is used.
Restoration Method Details
For restoration, in a training phase, the lossless bases Bi 211 for the signal X and the corresponding lossy bases Bidistorted 221 for the signal Y are obtained from training data, i.e., the training undistorted speech signal 201 and the training denoised speech signal 202. After training, during operation of the method, these bases are employed to estimate the denoised signal X.
Obtaining the Bases
Because the distortion function D( ) 120 is unknown, the bases Bi and Bidistorted are jointly obtained from analysis of joint recordings of the signal X and the corresponding signal Y. Therefore, the joint recordings of the training signals X and Y are needed in the training phase. However, the signal X is not directly available, and the following approximation is used instead.
An undistorted (clean) training speech signals C is artificially corrupt with digitally added noise to obtain the noisy signal S. Then, the signal S is processed with the denoising process 110 to obtain the corresponding signal Y. The “losslessly denoised” signal X is a hypothetical entity that also is unknown. Instead, the original undistorted clean signal C is used as a proxy for X for the signal. The denoising process and the distortion function introduce a delay into the signal so that the signals for Y and C are shifted in time with respect to one another.
Because the model of Eqn. 2 assumes a one-to-one correspondence between each frame of X and the corresponding frame of Y, the recorded samples of the signals C and Y are time aligned to eliminate any relative time shifts introduced by the denoising. The time shift is estimates by cross-correlating each frame of the signal C and the corresponding frame of the signal Y.
The bases Bi are assumed to be the composing bases for the signal X. The bases can be obtained by analysis of magnitude spectra of signals using NMF. However, as an additional constraint, the distorted bases Bidistorted must be reliably known to actually be distortions of their undistorted counterpart bases Bi.
Therefore, an example based model is used, where such a correspondence is assured. A large number of magnitude spectral vectors are randomly selected from the signal C as the bases Bi for the signal X. The corresponding vectors are selected from the training instances of the signal Y as Bidistorted. This ensures that Bidistorted is indeed a near-exact distorted version of Bi. Because the bases represent spectral structures in the speech, and the potential number of spectral structures in speech is virtually unlimited, a large number of training bases are selected, e.g., 5000 or more. The model of Eqn. 1 thus becomes overcomplete, combining many more elements than the dimensionality of the signal itself.
Estimating Weights
The method for restoring spectral components in the test denoise signal Y 203 determines how each spectral vector Y(t) of Y is composed by the distorted bases. As stated above, Y(t)=Σiwi(t)Bidistorted.
If the set of all training distorted bases 221 is represented as a matrix
Y(t)=
The vector W(t) is constrained to be non-negative during the estimation. A variety of update rules are known for learning the weights. For speech and audio signals, it most effective to employ the update rule that minimizes the generalized Kullback-Leibler distance between Y(t) and
where {circumflex over (x)} represents component-wise multiplication, and all divisions are also component-wise. Because the representation is overcomplete, i.e., there are more bases than there are dimensions in Y(t)), the equation is underdetermined and multiple solutions for W(t) exist that characterize Y(t) equally well.
Estimating the Speech with Restored Spectral Components
After the weights W(t)=[w1(t)w2(t) . . . ]T are determined for any Y(t), by Eqn. 2 the corresponding lossless spectrum X(t) can be estimated as X(t)=Σiwi(t)Bi. Because the estimation procedure is iterative, the exact equality in Eqn. 3 is never achieved. Instead, the matrix
All divisions and multiplications above are component-wise, and ε>0 to ensure that attenuated spectral components can still be restored when Y(t)=0.
Expanding the Bandwidth
Often, the recorded and denoised speech signal has a reduced bandwidth, e.g., if the speech is acquired by telephony, then the speech may only include low frequencies up to 4 k Hz, and high frequencies above 4 k Hz are lost. In these cases, the method can be extended to restore high-frequency spectral components into the signal. This is also expected to improve the intelligibility of the signal. To expand the bandwidth, a bandwidth reconstruction procedure can be used, see U.S. Pat. No. 7,698,143, “Constructing broad-band acoustic signals from lower-band acoustic signals,” issued to Ramakrishnan et al. on Apr. 13, 2010, incorporated herein by reference. That procedure is only concerned with constructing broad-band acoustic signals from lower-band acoustic signals, and not denoised speech signals, as here.
In this case, the training data also includes wideband signals for the training undistorted signal C. The training recordings for C and Y are time aligned, and STFT analysis is performed using identical analysis frames. This ensures that in any joint recording there is a one-to-one correspondence between the spectral vectors for the signals C and Y. Consequently, while the bases Bidistorted 221, drawn from training instances of Y, represent reduced-bandwidth signals, the corresponding bases Bi 211 represent wideband signals and include high-frequency components. After the signals are denoised, low-frequency components are restored using Eqn. 5, and the high-frequency components are obtained as
X(t,f)=Σiwi(t)Bi(f),fε{high frequency},
where f is an index to specific frequency components of X(t) and Bi.
The above estimate only determines spectral magnitudes. To invert the magnitude spectrum to a time-domain, a signal phase is also required. The phase for low-frequency components is taken directly from the reduced-bandwidth lossy denoised signal. For higher frequencies, it is sufficient to replicate the phase terms from the lower frequencies.
Although the invention has been described by way of examples of preferred embodiments, it is to be understood that various other adaptations and modifications may be made within the spirit and scope of the invention. Therefore, it is the object of the appended claims to cover all such variations and modifications as come within the true spirit and scope of the invention.