The following disclosures relates to using end-to-end neural networks for suppressing noise and distortion in speech audio signals.
Speech signals acquired in the real world are rarely of pristine quality. In real-world applications, often because of ambient environmental conditions and the location of the microphone relative to the desired talker, speech signals are typically captured in the presence of distortions such as reverberation and/or additive noise. For human listeners, this can result in increased cognitive load and reduced intelligibility. For automated applications such as speech and speaker recognition, this can lead to significant performance degradation. Speech enhancement techniques can be used to minimize the effects of these acoustic degradations. Single-channel speech enhancement aims to reduce the effects of reverberation and noise, thereby improving the quality of the output speech signal.
For several decades, single-channel speech enhancement was addressed using a statistical model-based approach. In such systems, noise suppression was performed via multiplicative masking in the spectral domain, and optimal masks were estimated through statistical inference. In some previous techniques, various statistical cost functions were optimized during mask estimation, and in others, various statistical models were assumed for modeling speech and noise as random processes. Significant progress in noise estimation methods led to impressive noise suppression performance for acoustic environments with stationary noise components. However, for highly non-stationary noise scenarios, statistical model-based approaches to speech enhancement typically result in a high level of speech distortion and musical noise artifacts.
Within the last decade, Deep Neural Networks (DNNs) have emerged as a powerful tool for regression or classification problems, and have set the state-of-the-art across a variety of tasks, e.g., within image, speech, and language processing. Initial applications of DNNs to speech enhancement used them to predict clean speech spectrograms from distorted inputs, both for task of noise suppression and suppression of reverberation. Significant performance improvements were observed relative to statistical model-based approaches.
Later applications of neural networks to speech enhancement used DNNs to estimate multiplicative masks which were used for noise suppression in the spectral domain. In some cases, feed-forward networks were utilized, but subsequent work leveraged more advanced network architectures such as Recurrent and Long Short-Term Memory (LSTM) layers. Additional details about existing speech enhancement techniques using DNNs, including a more detailed discussion of single-channel speech enhancement using a statistical model-based approach, is provided in U.S. Pat. No. 11,227,586, entitled “SYSTEMS AND METHODS FOR IMPROVING MODEL-BASED SPEECH ENHANCEMENT WITH NEURAL NETWORKS,” filed Sep. 11, 2019, and the content of which is incorporated by reference herein in its entirety.
While some works discussed the unimportance of processing short-time phase information for speech enhancement, recent work has illustrated the potential benefits of phase processing for the task. The previously discussed DNN-based enhancement approaches manipulate spectral magnitudes of the input signal, and thereby leave the short-time phase signal untouched. This motivated recent end-to-end DNN-based enhancement systems which directly process noisy time-domain speech signals and output enhanced waveforms. Many studies explored Fully Convolutional Networks (FCNs), which offer a computationally efficient framework for noise suppression in the waveform domain. More recent studies have utilized the U-Net architecture, which enables longer temporal contexts to be leveraged during end-to-end processing by including a series of downsampling blocks, followed by a series of upsampling blocks.
Training an end-to-end neural network-based speech enhancement system requires a distance measure which operates on time-domain samples. At first, the mean squared error (MSE) between the clean and enhanced waveforms was used to optimize network parameters. Recent work, however, has proposed loss functions which are perceptually motivated. These studies have proposed losses which approximate speech quality metrics such as the Perceptual Evaluation of Speech Quality (PESQ) or the Short-Time Objective Intelligibility (STOI), or use multi-component losses, which include spectral distortion measures.
Accordingly, there exists a need for end-to-end systems and methods that effectively jointly suppress noise and reverberation in speech signals captured in the wild, which could generate enhanced signals for human listening in, by way of non-limiting example, a cellular telephone, or for automated speech applications such as Automatic Speech Recognition (ASR) or Speaker Recognition.
Certain aspects of the present disclosure provide for systems Speech Enhancement via Attention Masking Network (SEAMNET), which includes an end-to-end system for joint suppression of noise and reverberation.
Examples of SEAMNET systems according to the present disclosure include a neural network-based end-to-end single-channel speech enhancement system designed for joint suppression of noise and reverberation, which examples can accomplish through attention masking. One example property of exemplary SEAMNET systems is a network architecture that contains both an enhancement and an autoencoder path, so that disabling the masking mechanism causes exemplary SEAMNET system to reconstruct the input speech signal. This allows dynamic control of the level of suppression applied by exemplary SEAMNET systems via a minimum gain level, which is not possible in other state-of-the-art approaches to end-to-end speech enhancement. A novel loss function can be utilized to simultaneously train both the enhancement and the autoencoder paths, which includes a perceptually-motivated waveform distance measure. In addition to the novel architecture, exemplary SEAMNET system can include a novel method for designing target waveforms for network training, so that joint suppression of additive noise and reverberation can be performed by an end-to-end enhancement system, which has not been previously possible. Experimental results show that exemplary SEAMNET systems outperform a variety of state-of-the-art baselines systems, both in terms of objective speech quality measures and subjective listening tests.
Example applications of SEAMNET systems according to the present disclosure include being utilized for the end task of human listening, in, by way of non-limiting example, a cellular telephone. In this case, exemplary SEAMNET system can potentially improve the intelligibility of the speech observed in acoustically adverse environments, as well as lower the cognitive load required during listening. Additionally, exemplary SEAMNET systems can be used as a pre-processing step for automated speech applications, such as automatic speech recognition, speaker recognition, and/or auditory attention decoding.
The present disclosure includes several novel contributions. For instance, a formalization of an end-to-end masking-based enhancement architecture, referred to herein to as the b-Net. A loss function that simultaneously trains both an enhancement and an autoencoder path within the overall network. A noise suppression system allowing a user to dynamically control the tradeoff between noise suppression and speech quality via a minimum gain threshold during testing. A method for designing target waveforms so that joint suppression of noise and reverberation can be performed in an end-to-end enhancement framework. A derivation of a perceptually-motivated distance measure as an alternative to mean square-error for network training.
The present disclosure also provides experimental results comparing the performance of exemplary SEAMNET systems to state-of-the-art methods, both in terms of objective speech quality metrics and subjective listening tests, and highlights the importance of allowing dynamic user control over the inherent tradeoff between noise suppression and speech quality. Additionally, the benefit of reverberation suppression in an end-to-end system is clearly shown in objective quality measures and subjective listening. Finally, SEAMNET system according to the present disclosure offers interpretability of several internal mechanisms, and intuitive parallels are drawn to statistical model-based enhancement systems.
Certain embodiments of the present system provide significant levels of noise suppression while maintaining high speech quality, which can reduce the fatigue experienced by human listeners and may ultimately improve speech intelligibility. Embodiments of the present disclosure improve the performance of automated speech systems, such as speaker and language recognition, when used as a pre-processing step. Finally, the embodiments can be used to improve the quality of speech within communication networks.
One example of the present disclosure is a computer-implemented system for recognizing and processing speech that includes a processor configured to execute an end-to-end neural network trained to detect speech in the presence of noise and distortion. The end-to-end neural network is configured to receive an input waveform containing speech and output an enhanced waveform.
The end-to-end neural network can define a b-Net structure that can include an encoder, a mask estimator, and/or a decoder. The encoder can be configured to map the input waveform into a sequence of input embeddings in which speech signal components and non-speech signal components are separable via a scaling procedure. The mask estimator can be configured to generate a sequence of multiplicative attention masks, while the b-Net structure can be configured to utilize the multiplicative attention masks to create a sequence of enhanced embeddings from the sequence of input embeddings. The decoder can be configured to synthesize an output waveform based on the sequence of enhanced embeddings. The neural network can include an autoencoder path and an enhancement path. The autoencoder path can include the encoder and decoder, while the enhancement path can include the encoder, the mask estimator, and the decoder, and the neural network can be configured to receive an input minimum gain that adjusts the relative influence between the autoencoder path and the enhancement path on the enhanced waveform. In some example, the encoder and/or the decoder can include filter-banks configured to have non-uniform time-frequency partitioning.
The end-to-end neural network can be configured to process two or more input waveforms and output a corresponding enhanced waveform for each of the two or more input waveform. Further, the mask estimator can include a DNN path for each of the two or more input waveforms with shared layers between each path. In some examples, the encoder can include a single 1-dimensional convolutional neural network (CNN) layer with a plurality of filters and rectified linear activation functions. In some examples, the enhanced embeddings can be generated as element-wise products of the input embeddings and the estimated masks. The decoder can include a single 1-dimensional Transpose-CNN layer with an output filter configured to mimic overlap-and-add synthesis. The mask estimator can include a cepstral extraction network configured to cepstral normalize an output from the encoder. In some examples, the cepstral extraction network can be configured to perform feature normalization and can define a trainable extraction process that can include a log operator and a 1×1 CNN layer.
In some examples, the mask estimator can include a multi-layer fully convolutional network (FCN). The FCN can include a series of convolutional blocks. Each series can include a CNN filter process, a batch normalization process, an activation process, and/or a squeeze and excitation network process (SENet). In some embodiments, the mask estimator can include a sequence of FCNs arranged as time-delay neural network (TDNN). In some embodiments, the mask estimator can include a plurality of FCNs arranged as a U-Net architecture. In some embodiments, the mask estimator can include a frame-level voice activity detector layer.
Examples of the end-to-end neural network can be trained to estimate clean speech by minimizing a first cost function representing a distance between the output and an underlying clean speech signal. In some examples, the end-to-end neural network can be trained as an autoencoder to reconstruct the noisy input speech by minimizing a second cost function representing a distance between the input speech and the enhanced speech. The end-to-end neural network can be trained to restrict enhancement to the mask estimator by minimizing a third cost function that represents a combination of distance between the output and an underlying clean speech signal and distance between the input speech and the enhanced speech such that, when the mask estimator is disabled, the output of the end-to-end neutral network is configured to recreate input waveform. The end-to-end neural network can be trained to minimize a distance measure between a clean speech signal and reverberant-noisy speech signal using a target waveform according to Equation 16 (see below) with the majority of late reflections suppressed. The end-to-end neural network can be trained using a generalized distance measure according to Equation 20 (see below). The end-to-end neural network can be configured to be dynamically tuned via the input minimum gain threshold to control a level of noise suppression present in the enhanced waveform.
Another example of the present disclosure is a method for training a neural network for detecting the presence of speech that includes constructing an end-to-end neural network configured to receive an input waveform containing speech and output an enhanced waveform. The neural network includes an autoencoder path and an enhancement path. The autoencoder path includes an encoder and a decoder, while the enhancement path includes the encoder, a mask estimator, and the decoder. The neural network is configured to receive an input minimum gain that adjusts the relative influence between the autoencoder path and the enhancement path on the enhanced waveform. The method further includes simultaneously training both the autoencoder path and the enhancement path using a loss function that includes a perceptually-motivated waveform distance measure.
The training method can further include training the neural network to estimate clean speech by minimizing a first cost function representing a distance between the output and an underlying clean speech signal. Further, the training method can include training the neural network as an autoencoder to reconstruct the noisy input speech by minimizing a second cost function representing a distance between the input speech and the enhanced speech. Still further, the training method can include training the neural network to restrict enhancement to the mask estimator by minimizing a third cost function that represents a combination of distance between the output and an underlying clean speech signal and distance between the input speech and the enhanced speech such that, when the mask estimator is disabled, the output of the end-to-end neutral network can be configured to recreate input waveform.
In at least some examples, the action of simultaneously training both the autoencoder path and the enhancement path can include minimizing a distance measure between a clean speech signal and reverberant-noisy speech signal using a target waveform according to Equation 16 (see below) with the majority of late reflections suppressed.
This disclosure will be more fully understood from the following detailed description taken in conjunction with the accompanying drawings, in which:
Certain exemplary embodiments will now be described to provide an overall understanding of the principles of the structure, function, manufacture, and use of the devices and methods disclosed herein. One or more examples of these embodiments are illustrated in the accompanying drawings. Those skilled in the art will understand that the devices and methods specifically described herein and illustrated in the accompanying drawings are non-limiting exemplary embodiments and that the scope of the present disclosure is defined solely by the claims. The features illustrated or described in connection with one exemplary embodiment may be combined with the features of other embodiments. Such modifications and variations are intended to be included within the scope of the present disclosure. In the present disclosure, like-numbered components and/or like-named components of various embodiments generally have similar features when those components are of a similar nature and/or serve a similar purpose, unless otherwise noted or otherwise understood by a person skilled in the art.
Overview
Existing DNN approaches for speech enhancement, such as that shown in
As mentioned earlier, conventional enhancement methods often rely on user tuning to control the tradeoff between noise suppression and speech quality. Turning up the enhancement to suppress more noise, but typically at the cost of some speech distortion, and turning down the suppression leads to fewer distortions, but at the cost of more residual noise. However, in enhancement systems trained in an end-to-end manner, it may be difficult to interpret the internal components of the network. It then becomes very difficult to tune the network in an intuitive way. Examples of SEAMNET according to the present disclosure, however, can be trained in way that retain the ability to fine tune the network. First, example SEAMNET system can be trained to estimate clean speech by minimizing the distance between the network output and the underlying clean speech signal.
The costs can be combined, as shown in the example SEAMNET system 203 of
Even so, this type of black-box training can be difficult. In order to look at what the system was learning, the trained encoders 221 and decoders 229 can be observed and are intuitively satisfying from a speech science perspective. An example of frequency responses of decoder filters are shown in
Finally, to evaluate the relative and absolute performance of example SEAMNET enhancement systems in the speech field, there are a number of quantitative measures available that can roughly correlate with listener perception. Examples of the present SEAMNET system can be evaluated with a number of these metrics, with a comparison between an existing DNN-based system and example SEAMNET systems demonstrating a clear advantage. Examples of SEAMNET systems can also be compared to a number of other recent neural-network based enhancement systems, and examples of SEAMNET can perform on par or better than the bulk of neural-network based enhancement systems.
While objective speech quality metrics can be useful, in the end it is often how good the speech sounds. In conjunction with the present disclosures, informal listening experiments were conducted where participants were played various versions of processed noisy speech and were asked to grade the signals with respect to both overall quality and intelligibility. In a first experiment, signals processed with an example SEAMNET were played at varying maximum attenuation levels (these are levels that the user can tune during testing). It was observed that the reported quality score increases as the attenuation level increases. That is, as the enhancement becomes more aggressive, the perceived quality improves, but saturates at about 25 dB. Examples of SEAMNET are observed to maintain the intelligibility score of the unprocessed signal up to about 25 dB, but a significant drop is seen at about 40 dB. This experiment demonstrates how important the user tuning can be in navigating the tradeoff between noise suppression and speech quality. In another experiment, an example SEAMNET was compared with a DNN-based solution and SEAMNET was observed to provide a significant improvement in reported quality score. Additionally, examples of SEAMNET can maintain the intelligibility of the unprocessed signal, while the DNN-based system shows a significant drop.
b-Net Structure and SEAMNET Architecture
In this section, examples of the SEAMNET architecture are presented in more detail. Specifically, examples of the enhancement path, autoencoder path, and mask estimation network are defined.
The Enhancement Path
Recent studies on end-to-end DNN-based speech enhancement systems have utilized the fully convolutional networks (FCNs) and U-Net architectures. The present example instead explores the b-Net structure illustrated in
y
n=[y(n), . . . ,y(n+D−1)]T, (Equation 1)
Where D is the duration of the input signal 410 in samples, xn, denotes the underlying clean speech waveform, and is defined similarly. The b-Net system 400 first can include an encoder 421 that maps the input waveform 410 into a sequence of Nf embeddings Zn=[Zn, 1, . . . , Zn,NN
Z
n
=f
enc(yn). (Equation 2)
The intended goal of this embedding can be to project the degraded speech into a subspace in which the speech and interfering signal components are separable via a scaling procedure. A mask estimator 430 can then generate a sequence of multiplicative attention masks Mn=[mn, 1 , . . . , mn,NN
M
n
=f
mask(Zn), (Equation 3)
and where the elements of Mnlie within the range [0,1]. The masks can be interpreted as predicting the presence of active speech in the elements of the embedding space. Enhanced versions of the input embeddings, {circumflex over (Z)}n=[{circumflex over (z)}n, 1, . . . , {circumflex over (z)}n,T] can be obtained as the element-wise product of the input embeddings and the estimated masks are expressed according to Equation 4:
{circumflex over (z)}
n,t
=m
n,t
⊗z
n,t. (Equation 4)
Finally, the decoder 429 can synthesize the output waveform according to Equation 5:
where {circumflex over (x)}n is the enhanced speech signal. In at least some instances, the input signal 410 and output signal 450 of the example SEAMNET system 400 can be of the same duration, D. The processing chain in Equation 5 can be referred to herein as the enhancement path. The entirety of the example SEAMNET system 400 can be trained jointly using gradient descent, as described later in a SEAMNET Training section below.
In examples of the SEAMNET system, the encoder 421 can be composed of a single 1D CNN layer with Ne filters and ReLU activation functions, with filter dimensions Nin and a stride of Nstr. The encoder 421 can be designed to mimic conventional short-time analysis of speech. The decoder 429 can be composed of a single 1D Transpose-CNN layer with an output filter dimension Nout with an overlap of Nstr, and can be designed to mimic conventional overlap-and-add synthesis. The number of embeddings extracted from an input signal can be given by Nf=[D/Nstr].
The b-Net structure of the system 400 can be interpreted as a generalization of statistical model-based speech enhancement methods. With existing systems, the short-time magnitude spectrogram can be extracted from the noisy input wave-form, manipulated via a multiplicative mask, and the output waveform can be generated from the enhanced spectrogram through overlap-and add-synthesis using the original noisy phase signal. With the present b-Net, the Fourier analysis can be replaced by a set of generic encoder-decoder bases with non-linear activations, which can be learned jointly with the masking function, specifically for the speech enhancement task. Additionally, at least because signal phase can be implicitly incorporated into the encoder-decoder, in some instances there is no need to preserve or separately enhance the noisy phase component.
The Autoencoder Path
The attention masking module 430 can attenuate interfering signal components within the embedding space. However, a feature of the b-Net architecture can be the ability to disable this masking mechanism. The result can be an autoencoder path, as defined in Equation 6:
Other existing speech enhancement solutions using an end-to-end architectures such as the FCN or U-Net do not contain an analogous autoencoder path. As discussed in the SEAMNET Training section below, the existence of an autoencoder path allows the user to dynamically control the level of noise suppression via a minimum gain level.
The Mask Estimation Network
In the b-Net architecture of the example system 400, enhancement can be performed via attention masking in the embedding space defined by fenc so that interfering signal components can be appropriately attenuated. The goal of the mask estimation block 430 in
Cepstral Extraction 431: The mask estimation network 430 can include the trainable cepstral extraction process 431 illustrated in more detail in denotes the Hadamard division operator. The number of filters in theCNN Layer is denoted by Nc. The CNN outputs are unit normalized across each filter by first subtracting a filter-dependent Global Mean 464 and element-wise dividing by the filter-dependent Global Standard Deviation 465. The example cepstral extraction mimics conventional cepstral processing, wherein a linear transform, e.g., the discrete cosine transform (DCT), can be applied after a log operation to de-correlate spectral features prior to further processing. However, in the provided approach, the linear transform can be trainable, and can be interpreted as de-correlating the embeddings zn,t·Cn=[cn, 1, . . . , Cn,T] can denote the sequence of cepstral feature vectors extracted from Zn, where Cn ∈
N
c
n,t⇐(cn,t−μn)λn, (Equation 7)
with the terms of Equation 7 being able to be defined according to Equation 8:
where the square root can be applied element-wise.
Mask Estimation: The normalized encoder features of Equation 7 can be applied to an FCN, as shown in
As can be observed in Table 1, the first five layers exhibit increasing filter dilation rates, allowing the FCN to summarize increasing temporal contexts. The next four layers apply 1×1 CNN layers, and can be interpreted as improving the discriminative power of the overall network. Finally, the FCN can include a layer with channel-wise sigmoid activations, providing outputs within the range [0, 1], which are appropriate for multiplicative masking. Let hn,t ∈ N
N
N
σ(WmaskThn,t+bmask), (Equation 9)
where σ(·) denotes the element-wise sigmoid function.
Voice Activity Detection: Whereas Equation 9 describes feature-specific masking, aspects of the present disclosure can include a layer that applies additional frame-based masking. If Wvad ∈ N
v
n,t=σ(wvadThn,t+bvad). (Equation 10)
The final mask estimation output from Equation 4 can then be expressed in terms of Equations 9 and 10 as Equation 11:
m
n,t
=v
n,t·σ(WmaskThn,t+bmask). (Equation 11)
The final mask estimation layer can be interpreted as performing frame-based voice activity detection, and applying additional attenuation of the input signal during frames that lack active speech signal.
In this section, an example SEAMNET training process is described. Specifically, simultaneous training of the enhancement and autoencoder paths is disclosed. Additionally, enabling joint suppression of noise and reverberation within an end-to-end system is described. Finally, a perceptually-motivated distance measure is presented.
Training The Enhancement and Autoencoder Paths
In the context of statistical model-based enhancement systems, many studies have addressed the issue of musical noise, which can occur when mask-based enhancement produces a residual noise signal containing narrowband transient components. An efficient technique for minimizing such effects can be applying a minimum gain threshold. Flooring multiplicative enhancement masks at a minimum gain, Gin, can decrease speech distortion and increase the naturalness of the residual noise signal, helping to avoid perceptually annoying artifacts. A minimum gain threshold can also allow the user to control the inherent tradeoff between speech quality and noise suppression that exists in mask-based enhancement systems.
In conventional enhancement systems, short-time spectral analysis, e.g., the STFT, can be applied to the input signal prior to masking, and the overlap-and-add method can be used to synthesize the output waveform. Using the STFT can guarantee perfect reconstruction of the input signal for Gmin=1.0. By minimizing the distortion associated with the autoencoder path, Equation 6, the combined effect of the encoder and decoder can approximate this perfect reconstruction property. In examples of the SEAMNET system, the ability of the autoencoder path to reconstruct the input can be ensured by using the multi-component loss defined by Equation 12:
=(1−α)·d(xn,{circumflex over (x)}n)+α·d(yn,ŷn), (Equation 12)
where d(·) denotes some distance measure, {circumflex over (x)}n is the output of the enhancement path from Equation 5, ŷn is the output of the autoencoder path from Equation 6, and a is a constant. In this way, the enhancement and autoencoder paths within SEAMNET can be simultaneously trained, and a can control the balance between the two.
The b-Net architecture can allow for a minimum gain threshold to be dynamically tuned during enhancement. The enhanced output waveform from Equation 5 can be generalized as Equation 13:
x
n
=f
dec(max{Mn,Gmin}⊗Zn), (Equation 13)
where Gmin can be specified by the user during testing to control the tradeoff between noise suppression and speech quality. Note that for Gmin=1.0, the output of the enhancement and autoencoder paths are identical, as expressed by Equation 14:
{circumflex over (x)}
n
|G
min=1.0=ŷn, (Equation 14)
and, for a system trained with the multi-component loss from Equation 12, setting Gmin=1.0 will ensure that the enhancement path output is a close following approximation to original noisy speech, as expressed by Equation 15:
{circumflex over (x)}
n
|G
min=1.0≈yn, (Equation 15)
This is similar to the perfect reconstruction property of conventional masking-based enhancement systems. Other end-to-end architectures, such as the FCN and U-Net, do not exhibit an analogous reconstruction property. Instead, within such systems, noise suppression is typically performed throughout network layers, and no control over the level of suppression is typically exposed to the user.
Joint Suppression of Noise and Reverberation
Some existing end-to-end speech enhancement systems have proven successful at suppressing additive noise. However, it is not believed that a study has addressed suppression of reverberation with an end-to-end system, such as provided by aspects of the present disclosure. This may be due, at least in part, to the significant phase distortion introduced by reverberation, which makes a waveform-based mapping difficult to learn. In this section, a novel method is described for designing target waveforms that allow end-to-end systems to be trained to perform joint suppression of both additive noise and reverberation.
Typically, end-to-end systems are trained with parallel data in which known clean speech is corrupted with additive noise; the system learns the inverse mapping. However, in many realistic environments, speech signals are captured in the presence of additive noise and reverberation. As mentioned above, let x(k), w(k), and y(k) denote the underlying clean, reverberated-only, and reverberant-noisy speech signals, respectively. Let Xm,l represent the STFT of x(k), where m and l denote frequency channel and frame index, respectively, and let Wm,l be defined similarly. An enhanced version of Wm,l can be obtained using an oracle Wiener Filter, according to Equation 16:
where ηmax=1.0 and ηmin=0.1 can be the maximum and minimum gain limits. The corresponding waveform, x*(k), can be synthesized via the inverse STFT. The signal x*(k) then represents a version of the reverberant signal w(k) with the majority of late reflections suppressed, but with the phase distortion introduced by early reflections still present. This allows an end-to-end system, such as examples of the present SEAMNET system, to be trained to perform joint suppression of noise and reverberation by learning a mapping from y(k) to x*(k) through the minimization of some distance measure d(x*n, {circumflex over (x)}n).
Perceptually-Motivated Distance Measure
Training an end-to-end speech enhancement system, such as examples of the present SEAMNET system, can require a distance measure that operates on time-domain samples. Initial studies on end-to-end enhancement systems optimized network parameters using the mean squared error (MSE) between the output waveform, {circumflex over (x)}n, and the clean waveform, xn, given by Equation 17:
However, Equation 17 does not take into account properties of human perception of speech, and may not result in an enhanced signal that optimizes perceptual quality. While recent studies have proposed loss functions that address these issues, disclosed herein is an alternative version of MSE, which is perceptually motivated and computationally efficient.
Speech signals exhibit a steep spectral slope so that higher frequencies show a reduced dynamic range. To compensate for this, many conventional speech processing systems include a pre-emphasis filter designed to amplify the higher frequency ranges prior to further processing. Typically, pre-emphasis is implemented as a 1st-order moving average filter, according to Equation 18:
x(k)⇐x(k)−βx(k−1). (Equation 18)
Additionally, human hearing is more sensitive to the smaller waveform amplitudes within a given acoustic signal. In the context of speech signal compression, non-linear companding functions can be used to compensate for this effect during quantization. A commonly studied example is the μ-law companding function, which is expressed as Equation 19:
where μ controls the level of companding. The MSE loss from Equation 17 can be generalized to include the effects of both pre-emphasis and companding, leading to Equation 20:
Equation 20 offers a generalized distance measure that can be tuned to account for various properties of human perception. For settings β=0.0 and μ→0.0, the proposed measure can be equivalent to the standard MSE in Equation 17. The perceptually-motivated MSE from Equation 20 can be used during SEAMNET training. When joint suppression of noise and reverberation is enabled, the distance measure dpMSE(x*n, {circumflex over (x)}n) can be used.
Experimental Results
This section outlines an example experimental procedure. The training corpus is described, and experimental results for examples of the SEAMNET system are provided in terms of objective speech quality metrics and subjective listening tests. The interpretability of various layers within examples of the SEAMNET system are then discussed.
Training Data
As discussed above, some examples of the SEAMNET system may require three-part parallel training data. A corpus of degraded speech can be designed based on clean speech from the TIMIT corpus (ISBN: 1-58563-019-5), using room impulse responses (RIRs) from the Voice-Home package and additive noise and music from the MUSAN data set (available from http://www.openslr.org/17/). Training files were created according to the following recipe: first, clean speech signals, x(k), were simulated by concatenating eight (8) randomly selected TIMIT files, with random amounts of silence between each. Additionally, randomized gains can be applied to each input file to simulate the presence of both near-field and far-field talkers. Next, a RIR can be selected from the Voice-Home set, and artificially windowed to match a target reverberation time uniformly sampled from the range [0.0 s, 0.5s], giving the reverberant version of the signal, w(k). Finally, two additive noise files can be selected from the MUSAN corpus, the first from the Free-Sound background noise subset, and the other either from the music corpus or the Free-Sound non-stationary noise subset. These files can be combined with random gains, resulting in the noise signal. The noise signal can be mixed with the reverberant speech signal to match a target SNR, with targets sampled substantially uniformly from [−2 dB, 20 dB], resulting in the reverberant and noisy signal, y(k). The duration of the training files averaged 30 s, and the total corpus contained 500 hr of data. In practice, there are several other speech, noise, and RIR libraries that are available and this paragraph describes just one possible example set.
Experimental Results
The corpus described above was used to train example SEAMNET systems in a number of experimental tests. Separate versions of the SEAMNET system can be trained for narrowband speech, fs=8 kHz, and wideband speech, fs=16 kHz. The network architecture parameters for each (e.g., narrowband and wideband speech) are summarized in Table 2. The following training parameters were used for both versions: α=0.5 for the multi-component loss in Equation 12, β=0.5 and μ=5.0 for the distance measure in Equation 20, and Gmin=0.0 for Equation 13, though this parameter can be dynamically tuned during testing. During an example SEAMNET training, the Adam optimizer was used for 20 epochs. The narrowband and wideband example versions of the SEAMNET system contained 4.7M and 5.1M trainable parameters, respectively.
Objective Results
A database, such as the Voice Cloning Toolkit (VCTK) database, can include a parallel clean-corrupted speech corpus designed for training and testing enhancement methods. Both the noisy-reverberant and noise-only versions of VCTK test set can be utilized to evaluate the performance of example SEAMNET systems. Except for the results detailed in Table 5, none of the VCTK speech was included in the SEAMNET training procedure, at least in this instance. For all experiments, the minimum gain was set to Gmin=−25 dB.
First, an ablation study was performed to assess the effectiveness of the various components comprising example of the SEAMNET system, and objective speech quality results are provided in Table 3. Specifically, results are reported in Table 3 in terms of PESQ, STOI, segmental SNR improvement ΔSSNR, and the composite signal, background, and overall quality scores from (CSIG, CBAK, COVL, respectively). The first row of Table 3 includes results for the unprocessed input signal. Next, the second row of Table 3 can provide results for a baseline narrowband SEAMNET system, which can follow the b-Net structure from
In each subsequent row of Table 3 beyond the second, an additional feature has been cumulatively added to the example SEAMNET system. The third row provides objective results when the joint noise-reverberation suppression (detailed above) is introduced. Table 3 shows that joint suppression of noise and reverberation can provide significant performance improvements over the conventional training scheme, and the improvements are noticeable across all objective measures. Informal listening revealed that the proposed training method led to significantly attenuated reverberant tails, especially for files with more severe acoustic environments.
The fourth, fifth, and sixth rows of Table 3 detail the incremental results of adding the CMVN, including a SENet layer in the FCN modules, and utilizing the perceptually-motivated distance metric, respectively. Table 3 shows that the addition of each feature led to performance improvements across most of the objective measures. In informal listening tests, these features seemed to reduce residual noise, especially during periods of inactive speech.
Finally, the seventh and last row of Table 3 provides results for adding the VAD layer described above. Including the VAD layer feature provided improvements in STOI and ΔSNR, but led to performance degradation for other objective measures. During informal listening tests, the VAD layer provided further reduction of residual noise, especially during periods of inactive speech, but at the cost of some speech distortion.
Next, a comparative experiment was designed to compare the performance of an example SEAMNET system with an example of an existing state-of-the-art system, in which a recurrent neural network was used to predict a multiplicative mask in the short-time spectral magnitude domain. Further, in the existing system of the comparative experiment, the mask was trained to perform joint suppression of noise and reverberation. The noisy-reverberant version of the VCTK test set were again employed for this comparative experiment. Table 4 provides results from this comparative experiment in terms of the composite scores for signal, background, and overall quality from. Table 4 shows that examples of SEAMNET can provide significant performance improvements relative to the state-of-the-art system, for both the narrowband and wideband systems. One explanation for this improvement is ability of SEAMNET to enhance the short-time phase signal of the input, which is not possible within the STFT magnitude-only analysis-synthesis context of the state-of-the-art system.
Finally, a second comparative experiment was designed to compare examples of the wide-band SEAMNET system with a variety of state-of-the-art end-to-end enhancement systems. At least because prior end-to-end approaches have only addressed additive noise suppression, the noise-only version of VCTK was used as a test set for this second comparative experiment. Table 5 provides results in terms of composite quality scores for the second comparative experiment. In Table 5, Weiner represents a conventional statistical model-based systems, but the remaining baselines represent state-of-the-art, end-to-end DNN-based approaches, all of which were trained using the noisy VCTK training set. For fair comparison, in this experiment, the example SEAMNET system was trained using this set, and the system was trained in a conventional manner to learn a mapping from a waveform with additive noise to the underlying clean version. Table 5 shows that the SEAMNET system performs comparably to the baseline systems, despite not exploiting the full potential of performing joint suppression of noise and reverberation.
Subjective Results
To further test the performance of examples of the SEAMNET system, an informal listening test was conducted to assess the perceived quality of enhanced speech. The listening test was administered in five (5)-trial sessions via a Matlab-based GUI. For each trial, the participant was presented with five (5) unlabeled versions of a randomly chosen sample from the noisy and reverberant VCTK corpus, namely: (1) the original, unprocessed version, (2) the output of the spectral-based enhancement system from the existing state-of-the-art system, (3) the output of an example of the SEAMNET system with Gmin=−10 dB, (4) the output of an example of the SEAMNET system with Gmin=−25 dB, and (5) the output of an example of the SEAMNET system with Gmin=−40 dB.
In the listening test, each participant was first prompted to score each of the samples listened to with respect to overall quality, and was asked to take into account the general level of noise and reverberation in the signal, the naturalness of the speech signal, and the naturalness of the residual noise. Rather than a ranking scheme, participants were asked to assign a value to each sample across a continuous scale ranging from 0 (worst) to 1 (best). They were also instructed to assign these values with regard to their relative ranking of the samples and their perceived degree of preference. Specifically, the following instructions were provided: “If two samples are perceptually very similar, please assign them a small value difference. Samples for which you have a very distinct perceptual preference should have a larger value difference.” Each participant was then prompted to score each of the samples with respect to intelligibility using a similar scale, and was asked to judge the clarity of the words in the given audio.
Results from the listening test are provided in Table 6 and Table 7. In both Table 6 and Table 7, scores are trial-normalized, and averaged across 65 total trials from 13 sessions. That is, for each trial, raw scores from the participants are linearly transformed so that the lowest and highest reported scores are mapped to 0 and 1, respectively. In both Table 6 and Table 7, results in bold denote the best result for each experiment.
0.88
0.67
Table 6 provides a study on the effect of the minimum gain Gmin on the perceived speech quality of the SEAMNET system example. In terms of overall quality, the Gmin=−25 dB setting resulted in significant performance improvements over each of the other cases. Specifically, the −25 dB setting provided a 14% relative improvement in the trial-normalized overall quality score compared to the −40 dB case, despite the more aggressive noise suppression allowed by the latter system. In terms of intelligibility, the Gmin=−25 dB setting maintained the intelligibility score of the unprocessed input, whereas the −40 dB case suffered a 27% relative degradation. The mildest attenuation case (−10 db) case achieved the highest perceived intelligibility, preferred over the input. While this result has yet to be confirmed by formal quantitative intelligibility tests, it does highlight the quality-intelligibility tradeoff inherent in the enhancement application. Overall, the results in Table 6 show the strong effect of the minimum gain level on the subjective speech quality of example of the SEAMNET system, and highlight the importance of allowing the listener to control Gmin depending on their specific focus.
0.88
0.61
Table 7 provides a comparison of an example SEAMNET system with Gmin=−25 dB to an existing state-of-the-art spectral-based enhancement system. The baseline systems from Table 5 were not included in the listening tests at least because they were designed solely for suppression of additive noise. In terms of overall quality, the example SEAMNET system provided a significant improvement in subjective scores relative to the comparison system (e.g., Spectral-Based in Table 7). Specifically, an example SEAMNET system resulted in a 23% relative improvement in the trial-normalized overall quality score. In terms of intelligibility, it can be observed that the Spectral-Based system suffered a 18% relative performance degradation compared to the unprocessed input. The example SEAMNET system, on the other hand, maintained the intelligibility score of the unprocessed input.
Interpretability of Example SEAMNET Systems
An analysis of the learned parameters of example SEAMNET system offers some observations that are consistent with speech science intuition. For example, the encoder in the SEAMNET system can be interpreted as decomposing the input signal into an embedding space in which speech and interfering signal components are separable via masking. Similarly, the decoder in the SEAMNET system can synthesize an output waveform from the learned embedding. The behavior of examples of the SEAMNET decoder are illustrated in
Various observations can be made from
Certain aspects of the Speech Enhancement via Attention Masking Network, an end-to-end system for joint suppression of noise and reverberation, can be summarized as follows: First, b-Net, an end-to-end mask-based enhancement architecture. The explicit masking function in the b-Net architecture enables a user to dynamically control the tradeoff between noise suppression and speech quality via a minimum gain threshold. Secondly, a loss function, which can simultaneously train both an enhancement and an autoencoder path within the overall network. Finally, a method for designing target signals during system training so that joint suppression of noise and reverberation can be performed within an end-to-end enhancement system. The experimental results show example systems to outperform state-of-the-art methods, both in terms of objective speech quality metrics and subjective listening tests.
While the spectrograms of
SEAMNET Algorithm Improvements
A number of improvements to the basic SEAMNET system described above have been developed as well. The following sections detail three of architecture/algorithm changes that can improve the objective performance of SEAMNET systems, each of which improve the objective performance of SEAMNET systems: (1) multi resolution time-frequency portioned encoder and decoder filters, (2) a U-Net mask estimation network, and (3) multi-channel processing with shared masking layers. In addition to these structural changes, improvements to the objective performance of SEAMNET systems were also developed by expanding the training data used by, for example, adding hundreds of hours of noise samples to the training data and increasing the impulse response variability. This expansion and diversification of the training data, in addition to the structural changes detailed below, substantially improved the objective performance of examples of the SEAMNET system. Table 8 shows a comparison of between a new unprocessed signal, a SEAMNET system configured without these structural changes and improved training data, and finally a SEAMNET system (Improved SEAMNET′) using all of these structural improvements and expanded training data.
The SEAMNET improvements were designed to enhance the system's ability to represent the input acoustic signal in a perceptually relevant embedding space, and to increase the robustness of the system to varying and difficult acoustic environments. The results in Table 8 were obtained on the Voice Cloning Toolkit (VCTK) test corpus, which contains speech with synthetically added reverberation and noise. The test corpus includes signals sampled at 16 kHz.and none of the test corpus material was included in the training.
Multi-Resolution Encoders and Decoders
The encoder and decoder filters can have a fixed time-frequency partition resolution, as shown in
Mask Estimation Network
The mask estimation network described above, and as shown, for example, in
True Stereo Functionality
The b-net architectures described above (e.g. system 300 of
The memory 1220 can store information within the system 1200. In some implementations, the memory 1220 can be a computer-readable medium. The memory 1220 can, for example, be a volatile memory unit or a non-volatile memory unit. In some implementations, the memory 1220 can store information related to various sounds, noises, environments, and spectrograms, among other information.
The storage device 1230 can be capable of providing mass storage for the system 1200. In some implementations, the storage device 1030 can be a non-transitory computer-readable medium. The storage device 1230 can include, for example, a hard disk device, an optical disk device, a solid-date drive, a flash drive, magnetic tape, or some other large capacity storage device. The storage device 1230 may alternatively be a cloud storage device, e.g., a logical storage device including multiple physical storage devices distributed on a network and accessed using a network. In some implementations, the information stored on the memory 1220 can also or instead be stored on the storage device 1230.
The input/output device 1240 can provide input/output operations for the system 1200. In some implementations, the input/output device 1040 can include one or more of network interface devices (e.g., an Ethernet card), a serial communication device (e.g., an RS-232 10 port), and/or a wireless interface device (e.g., a short-range wireless communication device, an 802.11 card, a 3G wireless modem, or a 4G wireless modem). In some implementations, the input/output device 1240 can include driver devices configured to receive input data and send output data to other input/output devices, e.g., a keyboard, a printer, and display devices (such as the GUI 12). In some implementations, mobile computing devices, mobile communication devices, and other devices can be used.
In some implementations, the system 1200 can be a microcontroller. A microcontroller is a device that contains multiple elements of a computer system in a single electronics package. For example, the single electronics package could contain the processor 1210, the memory 1220, the storage device 1230, and input/output devices 1240.
Although an example processing system has been described above, implementations of the subject matter and the functional operations described above can be implemented in other types of digital electronic circuitry, or in computer software, firmware, and/or hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them. Implementations of the subject matter described in this specification can be implemented as one or more computer program products, i.e., one or more modules of computer program instructions encoded on a tangible program carrier, for example a computer-readable medium, for execution by, or to control the operation of, a processing system. The computer readable medium can be a machine-readable storage device, a machine-readable storage substrate, a memory device, a composition of matter effecting a machine readable propagated signal, or a combination of one or more of them.
Various embodiments of the present disclosure may be implemented at least in part in any conventional computer programming language. For example, some embodiments may be implemented in a procedural programming language (e.g., “C”), or in an object-oriented programming language (e.g., “C++”). Other embodiments of the invention may be implemented as a pre-configured, stand-along hardware element and/or as preprogrammed hardware elements (e.g., application specific integrated circuits, FPGAs, and digital signal processors), or other related components.
The term “computer system” may encompass all apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers. A processing system can include, in addition to hardware, code that creates an execution environment for the computer program in question, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them.
A computer program (also known as a program, software, software application, script, executable logic, or code) can be written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages, and it can be deployed in any form, including as a standalone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A computer program does not necessarily correspond to a file in a file system. A program can be stored in a portion of a file that holds other programs or data (e.g., one or more scripts stored in a markup language document), in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub programs, or portions of code). A computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a communication network.
Such implementation may include a series of computer instructions fixed either on a tangible, non-transitory medium, such as a computer readable medium. The series of computer instructions can embody all or part of the functionality previously described herein with respect to the system. Computer readable media suitable for storing computer program instructions and data include all forms of non-volatile or volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks or magnetic tapes; magneto optical disks; and CD-ROM and DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry. The components of the system can be interconnected by any form or medium of digital data communication, e.g., a communication network. Examples of communication networks include a local area network (“LAN”) and a wide area network (“WAN”), e.g., the Internet.
Those skilled in the art should appreciate that such computer instructions can be written in a number of programming languages for use with many computer architectures or operating systems. Furthermore, such instructions may be stored in any memory device, such as semiconductor, magnetic, optical or other memory devices, and may be transmitted using any communications technology, such as optical, infrared, microwave, or other transmission technologies.
Among other ways, such a computer program product may be distributed as a removable medium with accompanying printed or electronic documentation (e.g., shrink wrapped software), preloaded with a computer system (e.g., on system ROM or fixed disk), or distributed from a server or electronic bulletin board over the network (e.g., the Internet or World Wide Web). In fact, some embodiments may be implemented in a software-as-a-service model (“SAAS”) or cloud computing model. Of course, some embodiments of the present disclosure may be implemented as a combination of both software (e.g., a computer program product) and hardware. Still other embodiments of the present disclosure are implemented as entirely hardware, or entirely software.
Examples of the present disclosure include:
1. A computer-implemented system for recognizing and processing speech, comprising:
a processor configured to execute an end-to-end neural network trained to detect speech in the presence of noise and distortion, the end-to-end neural network configured to receive an input waveform containing speech and output an enhanced waveform.
2. The system of example 1, wherein the end-to-end neural network defines a b-Net structure comprising an encoder path configured to map the input waveform into a sequence of input embeddings in which speech signal components and non-speech signal components are separable via a scaling procedure.
3. The system of example 2, wherein the encoder path comprises a single 1-dimensional convolutional neural network (CNN) layer with a plurality of filters and rectified linear activation functions.
4. The system of example 2 or 3, wherein the b-Net structure comprises a mask estimator configured to generate a sequence of multiplicative attention masks, the b-Net structure being configured to utilize the multiplicative attention masks to create a sequence of enhanced embeddings from the sequence of input embeddings.
5. The system of example 4, wherein the enhanced embeddings are generated as element-wise products of the input embeddings and the estimated masks.
6. The system of example 5, wherein the b-Net structure comprises a decoder path configured to synthesize an output waveform based on the sequence of enhanced embeddings.
7. The system of example 6, wherein the decoder path comprises a single 1-dimensional Transpose-CNN layer with an output filter configured to mimic overlap-and-add synthesis.
8. The system of any of examples 4 to 7, wherein the mask estimator comprises a cepstral extraction network configured to cepstral normalize an output from the encoder path.
9. The system of example 8, wherein the cepstral extraction network is configured to perform feature normalization and define a trainable extraction process that comprises a log operator and a 1×1 CNN layer.
10. The system of any of examples 4 to 9, wherein the mask estimator comprises a multi-layer fully convolutional network (FCN).
11. The system of example 10, wherein the FCN comprises a series of convolutional blocks, each comprising a CNN filter process, a batch normalization process, an activation process, and a squeeze and excitation network process (SENet).
12. The system of example 10 or 11, wherein the mask estimator comprises a frame-level voice activity detector layer.
13. The system of any of examples 4 to 12, wherein the end-to-end neural network is trained to estimate clean speech by minimizing a first cost function representing a distance between the output and an underlying clean speech signal.
14. The system of any of examples 4 to 13, wherein the end-to-end neural network is trained as an autoencoder to reconstruct the noisy input speech by minimizing a second cost function representing a distance between the input speech and the enhanced speech.
15. The system of any of examples 4 to 14, wherein the end-to-end neural network is trained to restrict enhancement to the masking estimator by minimizing a third cost function that represents a combination of distance between the output and an underlying clean speech signal and distance between the input speech and the enhanced speech such that, when the masking estimator is disabled, the output of the end-to-end neutral network is configured to recreate input waveform.
16. The system of any of examples 4 to 15, wherein the end-to-end neural network is trained to minimize a distance measure between a clean speech signal and reverberant-noisy speech signal using a target waveform according to Equation 16 with the majority of late reflections suppressed.
17. The system of any of examples 4 to 16, wherein the end-to-end neural network was trained using a generalized distance measure according to Equation 20.
18. The system of any of examples 4 to 17, wherein the end-to-end neural network is configured to be dynamically tuned via an input minimum gain threshold that controls a level of noise suppression present in the enhanced waveform.
The embodiments of the present disclosure described above are intended to be merely exemplary; numerous variations and modifications will be apparent to those skilled in the art. One skilled in the art will appreciate further features and advantages of the disclosure based on the above-described embodiments. Such variations and modifications are intended to be within the scope of the present invention as defined by any of the appended claims. Accordingly, the disclosure is not to be limited by what has been particularly shown and described, except as indicated by the appended claims. All publications and references cited herein are expressly incorporated herein by reference in their entirety.
This application claims priority to and the benefit of U.S. Provisional Application Ser. No. 63/281,450, entitled “SYSTEMS AND METHODS FOR SPEECH ENHANCEMENT USING ATTENTION MASKING AND END TO END NEURAL NETWORKS,” and filed Nov. 19, 2021, the contents of which is incorporated by reference herein in its entirety.
This invention was made with Government support under Grant No. FA8702-15-D-0001 awarded by the Air Force Office of Scientific Research. The Government has certain rights in the invention.
Number | Date | Country | |
---|---|---|---|
63281450 | Nov 2021 | US |