This disclosure relates to a microphone array configuration invariant, streaming, multichannel neural enhancement frontend for automatic speech recognition.
Robustness of automatic speech recognition (ASR) systems has significantly improved over the years with the advent of neural network-based end-to-end models, large-scale training data, and improved strategies for augmenting training data. Nevertheless, various conditions such as reverberation, significant background noise, and competing speech significantly deteriorate performance of ASR systems. A joint ASR model may be trained to handle these conditions. However, isolating speech in background conditions including speech-based noise and non-speech based noise is particularly challenging.
One aspect of the present disclosure provides a multichannel neural frontend speech enhancement model for speech recognition that includes a speech cleaner, a stack of self-attention blocks each having a multi-headed self attention mechanism, and a masking layer. The speech cleaner receives, as input, a multichannel noisy input signal and a multichannel contextual noise signal, and generates, as output, a single channel cleaned input signal. The stack of self-attention blocks receives, as input, at an initial block of the stack of self-attention blocks, a stacked input including the single channel cleaned input signal output from the speech cleaner and a single channel noisy input signal, and generates, as output, from a final block of the stack of self-attention blocks, an un-masked output. The masking layer receives, as input, the single channel noisy input signal and the un-masked output generated as output from the final block of the stack of self-attention blocks, and generates, as output, enhanced input speech features corresponding to a target utterance.
Implementations of the disclosure may include one or more of the following optional features. In some implementations, the stack of self-attention blocks includes a stack of Conformer blocks. In these implementations, the stack of Conformer blocks may include four Conformer blocks. In some examples, the speech enhancement model executes on data processing hardware residing on a user device. Here, the user device is configured to capture the target utterance and the multichannel contextual noise signal via an array of microphones of the user device. In these examples, the speech enhancement model may be agnostic to a number of microphones in the array of microphones.
In some implementations, the speech cleaner executes an adaptive noise cancelation algorithm to generate the single channel cleaned input signal by applying a finite impulse response (FIR) filter on all channels of the multichannel noisy input signal except for a first channel of the multichannel noisy input signal to generate a summed output, and subtracting the summed output from the first channel of the multichannel noisy input signal. In some examples, a backend speech system is configured to process the enhanced input speech features corresponding to the target utterance. In these examples, the backend speech system includes at least one of an automatic speech recognition (ASR) model or an audio or audio-video calling application.
In some implementations, the speech enhancement model is trained jointly with a backend automatic speech recognition (ASR) model using a spectral loss and an ASR loss. In these implementations, the spectral loss may be based on an L1 loss function and L2 loss function distance between an estimated ratio mask and an ideal ratio mask. Here, the ideal ratio mask is computed using reverberant speech and reverberant noise. Additionally or alternatively, the ASR loss is computed by generating, using an ASR encoder of the ASR model configured to receive enhanced speech features predicted by the speech enhancement model for a training utterance as input, predicted outputs of the ASR encoder for the enhanced speech features, generating, using the ASR encoder configured to receive target speech features for the training utterance as input, target outputs of the ASR encoder for the target speech features. Here, computing the ASR loss is based on the predicted outputs of the ASR encoder for the enhanced speech features and the target outputs of the ASR encoder for the target speech features.
Another aspect of the disclosure provides a computer-implemented method that when executed on data processing hardware causes the data processing hardware to perform operations that include receiving a multichannel noisy input signal and a multichannel contextual noise signal, and generating, using a speech cleaner of a speech enhancement model, a single channel cleaned input signal. The operations also include generating, as output from a stack of self-attention blocks of the speech enhancement model configured to receive a stacked input including the single channel cleaned input signal output from the speech cleaner and a single channel noisy input signal, an un-masked output. Here, each self-attention block in the stack of self-attention blocks includes a multi-headed self attention mechanism. The operations further include generating, using a masking layer of the speech enhancement model configured to receive the single channel noisy input signal and the un-masked output generated as output from the stack of self-attention blocks, enhanced input speech features corresponding to a target utterance.
This aspect may include one or more of the following optional features. In some implementations, the stack of self-attention blocks includes a stack of Conformer blocks. In these implementations, the stack of Conformer blocks may include four Conformer blocks. In some examples, the speech cleaner, the stack of self-attention blocks, and the masking layer execute on the data processing hardware residing on a user device. Here, the user device is configured to capture the target utterance and the multichannel contextual noise signal via an array of microphones of the user device. In these examples, the speech enhancement model may be agnostic to a number of microphones in the array of microphones.
In some implementations, the operations further include executing, using the speech cleaner, an adaptive noise cancelation algorithm to generate the single channel cleaned input signal by applying a finite impulse response (FIR) filter on all channels of the multichannel noisy input signal except for a first channel of the multichannel noisy input signal to generate a summed output, and subtracting the summed output from the first channel of the multichannel noisy input signal. In some examples, a backend speech system is configured to process the enhanced input speech features corresponding to the target utterance. In these examples, the backend speech system includes at least one of an automatic speech recognition (ASR) model or an audio or audio-video calling application.
In some implementations, the speech enhancement model is trained jointly with a backend automatic speech recognition (ASR) model using a spectral loss and an ASR loss. In these implementations, the spectral loss may be based on an L1 loss function and L2 loss function distance between an estimated ratio mask and an ideal ratio mask. Here, the ideal ratio mask is computed using reverberant speech and reverberant noise. Additionally or alternatively, the ASR loss is computed by generating, using an ASR encoder of the ASR model configured to receive enhanced speech features predicted by the speech enhancement model for a training utterance as input, predicted outputs of the ASR encoder for the enhanced speech features, generating, using the ASR encoder configured to receive target speech features for the training utterance as input, target outputs of the ASR encoder for the target speech features. Here, computing the ASR loss is based on the predicted outputs of the ASR encoder for the enhanced speech features and the target outputs of the ASR encoder for the target speech features.
The details of one or more implementations of the disclosure are set forth in the accompanying drawings and the description below. Other aspects, features, and advantages will be apparent from the description and drawings, and from the claims.
Like reference symbols in the various drawings indicate like elements.
Robustness of automatic speech recognition (ASR) systems has significantly improved over the years with the advent of neural network-based end-to-end models, large-scale training data, and improved strategies for augmenting training data. Nevertheless, background interference can significantly deteriorate the ability of ASR systems to accurately recognize speech directed toward the ASR system. Background interference can be broadly classified into three groups: device echo; background noise; and competing speech. While separate ASR models may be trained to handle each of these background interference groups in isolation, the difficulty in maintaining multiple task/condition-specific ASR models and switching between the models on the fly during use is not practical.
Device echo may correspond to playback audio output from devices, such as smart home speakers, whereby the playback audio is recorded as echo and can affect performance of a backend speech system, such as an ASR system. Particularly, degradation of performance of the backend speech system is especially severe if the playback audio contains audible speech, e.g., a text-to-speech (TTS) response from a digital assistant.
Background noise with non-speech characteristics is usually well handled using data augmentation strategies like multi-style training (MTR) of the ASR models. Here, a room simulator is used to add noise to the training data, which is then carefully weighted with clean data during training to get a good balance in performance between clean and noisy conditions. As a result, large scale ASR models are robust to moderate levels of non-speech noise. However, background noise can still affect performance of backend speech systems in the presence of low signal-to-noise ratio (SNR) conditions.
Unlike non-speech background noise, competing speech is quite challenging for ASR models that are trained to recognize a single speaker. Training ASR models with multi-talker speech can pose problems in itself, since it is hard to disambiguate which speaker to focus on during inference. Using models that recognize multiple speakers is also sub-optimal since it is hard to know ahead of time how many users to support. Furthermore, such multi-speaker models typically have degraded performance in single-speaker settings, which is undesirable.
The three aforementioned classes of background interference have typically been addressed in isolation of one another, each using separate modeling strategies. Speech separation has received a lot of attention in the recent literature using techniques like deep clustering, permutation invariant training, and using speaker embeddings. When using speaker embeddings, the target speaker of interest is assumed to be known a priori. Techniques developed for speaker separation have also been applied to remove non-speech noise, with modifications to the training data. Acoustic Echo Cancelation (AEC) has also been studied in isolation or together in the presence of background noise. It is well known that improving speech quality does not always improve ASR performance since the distortions introduced by non-linear processing can adversely affect ASR performance. One way to mitigate discrepancies between an enhancement frontend initially processing incoming audio and the resulting ASR performance is to jointly train the enhancement frontend together with the backend ASR model.
Moreover, as the application of large scale multi-domain and multi-lingual ASR models continues to gain interest, the training data for these ASR models typically covers various acoustic and linguistic use cases (e.g., voice search and video captioning), thereby making it challenging to simultaneously address harsher noise conditions. As a result, it is often convenient to train and maintain separate frontend feature processing models capable of handling adverse conditions, without combining it with the backend ASR model.
Implementations herein are directed toward training a frontend speech enhancement model for improving robustness of ASR. The model is practical from the standpoint that it is difficult, if not impossible, to know what class of background interference to address ahead of time, particularly in a streaming ASR setting. Specifically, the frontend speech enhancement model includes a contextual enhancement neural network (CENN) capable of making use of a multichannel noisy input signal and a multichannel contextual noise signal. For speech enhancement and separation, the noise context, i.e., a few seconds of audio before the target utterance to be recognized, carries useful information about the acoustic context. The CENN employs a respective neural network architecture configured to ingest the noisy input and the contextual input to produce enhanced input speech features that may be passed to a backend speech system, such as, an ASR model that may process the enhanced input speech features to generate a speech recognition result for the target utterance. Notably, though the frontend speech enhancement model is designed to operate with a multi-channel array, the frontend speech enhancement model itself is agnostic as to the number of channels in the array or their configuration.
Referring to
Various types of background interference may interfere with the ability of a backend speech system 180 to process the target utterance 12 that specifies the query or command for the device 110. As aforementioned, the background interference may include one or more of a device echo corresponding to playback audio 154 output from the user device (e.g., a smart speaker) 110, competing speech 13 such as utterances other than the target utterance 12 spoken by one or more other users 11 that are not directed toward the device 110, and background noise with non-speech characteristics such as a ringtone 15 from a separate user device 111. Implementations herein employ a multichannel neural frontend speech enhancement model 200 (also referred to as a model 200 or a frontend speech enhancement model 200) that executes on the device 110 and is configured to receive, as input, a multichannel noisy input signal 202 including speech features corresponding to the target utterance 12 and the background interference, and a multichannel contextual noise signal 204 and generate, as output, enhanced input speech features 250 corresponding to the target utterance 12 by processing the multichannel noisy input signal 202 and the multichannel contextual noise signal 204 to remove the background interference. The multichannel noisy input signal 202 includes one or more channels 206, 206a—n of audio. A backend speech system 180 may then process the enhanced input speech features 250 to generate an output 182. Notably, the multichannel neural frontend speech enhancement model 200 effectively removes (i.e., masks) the presence of background interference recorded by the device 110 when the user 10 spoke the target utterance 12 such that the enhanced input speech features 250 provided to the backend speech system 180 convey the speech (i.e., target utterance 12) that was intended for the device 110 so that the output 182 generated by the backend speech system 180 is not degraded by the background interference.
In the example shown, the backend speech system 180 includes an ASR system 190 that employs an ASR model 192 to process the enhanced input speech features 250 to generate a speech recognition result (e.g., transcription) for the target utterance 12. The ASR system 190 may further include a natural language understanding (NLU) module (not shown) that performs semantic interpretation on the transcription of the target utterance 12 to identify the query/command directed toward the device 110. As such, the output 182 from the backend speech system 180 may include the transcription and/or instructions to fulfill the query/command identified by the NLU module.
The backend speech system 180 may additionally or alternatively include a hotword detection model (not shown) configured to detect whether or not the enhanced input speech features 250 include a presence of one or more hotwords/warm words the hotword detection model is trained to detect. For instance, the hotword detection model may output a hotword detection score indicating a likelihood that the enhanced input speech features 250 corresponding to the target utterance 12 include a particular hotword/warm word. Detection of a hotword may trigger a wake-up process that causes the device 110 to wake-up from a sleep state. For instance, the device 110 may wake-up and process the hotword and/or one or more terms preceding/following the hotword.
In additional examples, the background speech system 180 includes an audio or audio-video calling application (e.g., a video conferencing application). Here, the enhanced input speech features 250 corresponding to the target utterance 12 are used by the audio or audio-video calling application to filter the voice of the target speaker 10 for communications to recipients during an audio or audio-video communication session. The background speech system 180 may additionally or alternatively include a speaker identification model configured to perform speaker identification using the enhanced input speech features 250 to identify the user 10 that spoke the target utterance 12.
In the example shown, the device 110 captures the multichannel noisy input signal 202 (also referred to as audio data) of the target utterance 12 spoken by the user 10 in the presence of background interference emanating from one or more sources other than the user 10. The multichannel noisy input signal 202 includes one or more single channel noisy input signals 206, 206a—n of audio. The device 110 may correspond to any computing device associated with the user 10 and capable of receiving multichannel noisy input signals 202. Some examples of user devices 110 include, but are not limited to, mobile devices (e.g., mobile phones, tablets, laptops, etc.), computers, wearable devices (e.g., smart watches, smart headphones, etc.), smart appliances, and internet of things (IoT) devices, smart speakers, etc. The device 110 includes data processing hardware 112 and memory hardware 114 in communication with the data processing hardware 112 and storing instructions, that when executed by the data processing hardware 112, cause the data processing hardware 112 to perform one or more operations. The multichannel neural frontend speech enhancement model 200 may execute on the data processing hardware 112. In some examples, the backend speech system 180 executes on the data processing hardware 112.
In some examples, the device 110 includes one or more applications (i.e., software applications) where each application may utilize enhanced input speech features 250 generated by the multichannel neural frontend speech enhancement model 200 to perform various functions within the application. For instance, the device 110 includes an assistant application configured to communicate synthesized playback audio 154 to the user 10 to assist the user 10 with various tasks.
The user device 110 further includes (or is in communication with) an audio subsystem with an array of audio capturing devices (e.g., microphones) 116, 116a—n for capturing and converting spoken utterances 12 within the speech environment into electrical signals and a speech output device (e.g., a speaker) 118 for communicating an audible audio signal (e.g., a synthesized playback audio 154 from the device 110). Each microphone 116 in the array of microphones 116 of the user device 110 may separately record the utterance 12 on a separate dedicated channel 206 of the multichannel noisy input signal 202. For example, the user device 110 may include two microphones 116 that each record the utterance 12, and the recordings from the two microphones 116 may be combined into a two-channel noisy input signal 202 (i.e., stereophonic audio or stereo). That is, the two microphones reside on the user device 110. In some examples, the user device 110 includes more than two microphones 116. Additionally or alternatively, the user device 102 may be in communication with two or more microphones 116 separate/remote from the user device 110. For example, the user device 110 may be a mobile device disposed within a vehicle and in wired or wireless communication (e.g., Bluetooth) with two or more microphones 116 of the vehicle. In some configurations, the user device 110 is in communication with least one microphone 116 residing on a separate device 111, which may include, without limitation, an in-vehicle audio system, a computing device, a speaker, or another user device. In these configurations, the separate device 111 may also be in communication with the one or more microphones 116 residing on the user device 110.
In some examples, the device 110 is configured to communicate with a remote system 130 via a network (not shown). The remote system 130 may include remote resources 132, such as remote data processing hardware 134 (e.g., remote servers or CPUs) and/or remote memory hardware 136 (e.g., remote databases or other storage hardware). The device 110 may utilize the remote resources 132 to perform various functionality related to speech processing and/or synthesized playback communication. The multichannel neural frontend speech enhancement model 200 and the backend speech system 180 may reside on the device 110 (referred to as on-device systems) or reside remotely (e.g., reside on the remote system 130), but in communication with the device 110. In some examples, one or more backend speech systems 180 reside locally or on-device while one or more other backend speech systems 180 reside remotely. In other words, one or more backend speech systems 180 leveraging the enhanced input speech features 250 output from the multichannel neural frontend speech enhancement model 200 may be local or remote in any combination. For instance, when a system 180 is rather large in size or processing requirements, the system 180 may reside in the remote system 130. Yet when the device 110 may support the size or the processing requirements of one or more systems 180, the one or more systems 180 may reside on the device 110 using the data processing hardware 112 and/or the memory hardware 114. Optionally, the one or more of the systems 180 may reside on both locally/on-device and remotely. For instance, a backend speech system 180 may default to execute on the remote system 130 when a connection between the device 110 and remote system 130 is available, but when the connection is lost or unavailable, the system 180 instead executes locally on the device 110.
In some implementations, the device 110 or a system associated with the device 110 identifies text that the device 110 will communicate to the user 10 as a response to a query spoken by the user 10. The device 110 may then use a text-to-speech (TTS) system to convert the text into corresponding synthesized playback audio 154 for the device 110 to communicate to the user 10 (e.g., audibly communicate to the user 10) as the response to the query. Once generated, the TTS system communicates the synthesized playback audio 154 to the device 110 to allow the device 110 to output the synthesized playback audio 154. For instance, the device 110 outputs the synthesized playback audio 154 of “today is sunny” at a speaker 118 of the device 110 responsive to the user 10 providing a spoken query for today's weather forecast.
With continued reference to
In
The model 200 may perform speech enhancement by applying noise context modeling where the speech cleaner 300 of the model 200 processes the multichannel contextual noise signal 204 associated with a predetermined duration of noise segments captured by the audio capturing device 116 prior to the target utterance 12 spoken by the user 10. In some examples, the predetermined duration includes six (6) seconds of noise segments. As such, the multichannel contextual noise signal 204 provides noise context. In some examples, the multichannel contextual noise signal 204 includes LFBE features of the noise context signal for use as contextual information.
The speech cleaner 300 may be configured to receive, as input, the multichannel noisy input signal 202 and the multichannel contextual noise signal 204 and generate, as output, a single channel cleaned input signal 340. Here, the speech cleaner 300 includes a finite impulse response (FIR) filter to process the multichannel noisy input signal 202.
In the example shown, for simplicity, the multichannel noisy input signal 202 includes three channels 206a-c each including respective audio features captured by a separate dedicated microphone 116a-c in an array of three microphones 116. However, as mentioned above, the frontend speech enhancement model 200 is agnostic to a number of microphones 116 in the array of microphones 116. In other words, the multichannel noisy input signal 202 can include one channel 206 captured by one microphone 116, two channels 206 captured by two microphones 116, or four or more channels 206 captured by four or more microphones 116 without departing from the scope of the present disclosure.
Here, the FIR module 310 applies the FIR filter on all channels 206 of the multichannel noisy input signal 202 except for a first channel 206a to generate a summed output 312. In other words, the FIR module 310 does not process the first channel 206a of the multichannel noisy input signal 202, but does apply the FIR filter on the second channel 206b and the third channel 206c of the multichannel noisy input signal 202 to generate the summed output 312. The minimization module 320 receives the summed output 312 and the first channel 206a and generates a minimized output 322 by subtracting the summed output 312 from the first channel 206a of the multichannel noisy input signal 202. Mathematically, the FIR filter includes a tapped delay line of length L of three (3) applied to the channels 206b, 206c but not the channel 206a, where determining the minimized output 322 may be expressed as follows:
Z
m(n)=Y0(n)−Σl=−L−1UmH{tilde over (Y)}m(k,n−l) (1),
where {tilde over (Y)}m is a vector of time delayed Short-time Fourier transform (STFT) processed input for the channels 206b, 206c and Um(k) is a vector of the filter coefficients to be applied to the channels 206b, 206c. {tilde over (Y)}m and Um(k) may be expressed as follows:
{tilde over (Y)}
m(n)=[Ym(n),Ym(n−1), . . . Ym(n−(L−1))]T (2)
U
m(k)=[Um(k,0),Um(k,1),Um(k,N−1)]T (3),
where the filter coefficients may minimize the power of the output as follows:
Because the speech cleaner 300 is implemented on the device 110, the cancelation module 330 may use the multichannel contextual noise signal 204 that occurs directly before the utterance 12 in the multichannel noisy input signal 202. In other words, the minimization module 320 generates the minimized output 322 through adaptation during the multichannel contextual noise signal 204 when the utterance 12 is not present in the multichannel noisy input signal 202. The adaptation may include a recursive least squares (RLS) algorithm. Once the speech cleaner 300 detects the utterance 12, the filter coefficients are frozen, where the cancelation module 330 applies the last coefficients before the utterance 12 to the multichannel noisy input signal 202 to cancel the background interference to produce the single channel cleaned input signal 340 as follows:
{circumflex over (X)}(n)=Y0(n)−Σl=0L−1ÛmHŶm(k,n−1) (5).
Referring back to
The encoder 230 receives the stacked input 232 including the single channel cleaned input signal 340 and the single channel 206a of the multichannel noisy input signal 202, and generates, as output, an un-masked output 480. The encoder 230 includes a stack of self-attention blocks 400 (also referred to as blocks 400). Here, an initial block 400 of the stack of self-attention blocks 400 receives the stacked input 232 including the single channel cleaned input signal 340 output from the speech cleaner 300 and the single channel 206 of the multichannel noisy input signal 202, and a final block 400 of the stack of self-attention blocks 400 generates the un-masked output 480.
Each Conformer block 400 may include a feed-forward layer, a self-attention layer, a convolution layer, and a second feed-forward layer. In some implementations, the stack of self-attention blocks 400 includes a stack of Conformer blocks 400. In these implementations, the stack of Conformer blocks 400 includes four (4) layers of Conformer blocks 400 each with 1024 units, 8 attention heads, 15×1 convolutional kernel size, and 64 frames of self-attention to enable a streaming model. An example Conformer block 400 is described in greater detail below with reference to
The masking layer 240 is configured to receive, as input, the un-masked output 480 output by the self-attention blocks 400 of encoder 230, and the single channel 206a of the multichannel noisy input signal 202 and generate, as output the enhanced input speech features 250 corresponding to the target utterance 12. In some implementations, the masking layer 240 of the model 200 includes a decoder (not shown) configured to decode the un-masked output 480 into the enhanced input speech features 250 corresponding to the target utterance 12. Here, the decoder may include a simple projection decoder having a single layer, frame-wise fully connected network with sigmoid activation.
Next, a second concatenation operator 405b concatenates the output noise summary 422 with the first concatenated input 414 to generate a second concatenated input 424. Subsequently, the convolution layer 430 subsamples the second concatenated input 424 including the noise summary 422 of the multi-head self-attention block 420 and the first concatenated input 414, and generates a convolutional output 432. Thereafter, a third concatenation operator 405c concatenates the convolutional output 432 with the second concatenated input 424 to generate a third concatenated input 434. The third concatenated input 434 is provided as input to the second half-feed forward layer 440, which generates an output 442. The output 442 of the second half-feed forward layer 440 is concatenated with the third concatenated input 434 by a fourth concatenation operator 405d to generate a fourth concatenated input 444. Finally, the layernorm module 450 processes the fourth concatenated input 444 from the second half feed-forward layer 440. Mathematically, the block 400 transforms input features x, using modulation features m, to produce output features y, as follows:
The block 400 generates, as an output, the un-masked output 480, which is passed on to the next layer of the self-attention blocks 400. Thus, the inputs 240, 206 are modulated by each of the self-attention blocks 400.
In some implementations, the frontend speech enhancement model 200 is trained jointly with the ASR model 192 of the backend automatic speech recognition system 180 using a spectral loss and the ASR loss 560. The training target 536 for training the multichannel neural frontend speech enhancement model 200 uses ideal ratio mask (IRM). IRMs may be computed using reverberant speech and reverberant noise based on an assumption that speech and noise are uncorrelated in Mel spectral space as follows:
Here, X and N are the reverberant speech and reverberant noise Mel spectrograms, respectively. t and f represent time and Mel frequency bin indices. The choice to estimate IRMs is based on the targets being bounded between [0, 1], simplifying the estimation process. Moreover, the ASR model 192 used for evaluation may be trained on real and simulated reverberant data, resulting in a trained ASR model 192 that is relatively robust to reverberant speech. Therefore, IRMs derived using reverberant speech as the target still provide substantial gains in performance. The spectral loss during training are may be computed based L1 and L2 losses between the IRM and estimated IRM, M as follows:
=Σt,f|M(t,f)−{circumflex over (M)}(t,f)|+(M(t,f)−{circumflex over (M)}(t,f))2Where L1=|M(t,f)−{circumflex over (M)}(t,f)|, and L2=(M(t,f)−{circumflex over (M)}(t,f))2 (8)
During inference, the estimated IRM is scaled and floored to reduce speech distortion at the expense of reduced noise suppression. This is especially important, since the ASR model 192 is sensitive to speech distortions and non-linear frontend processing, which is one of the main challenges in improving performance of robust ASR models using enhancement frontends. The enhanced feature may be derived as follows:
{circumflex over (X)}(t,f)=Y(t,f)⊙max({circumflex over (M)}(t,f)β)α (9)
Here, Y is the noisy Mel spectrogram, g is an estimate of clean Mel spectrogram, α and β are exponential mask scalars, and mask floor. In some examples, α is set 0.5, and β is set to 0.01. The enhanced features may be log-compressed, i.e. log({circumflex over (X)}), and passed to the ASR model 192 for evaluation.
At operation 606, the method 600 also includes generating, as output from a stack of self-attention blocks 400 of the speech enhancement model 200 configured to receive a stacked input 232 including the single channel cleaned input signal 340 output from the speech cleaner 300 and a single channel noisy input signal 206, an un-masked output 480. Here, each self-attention block 400 in the stack of self-attention blocks 400 includes a multi-headed self attention mechanism. At operation 608, the method 600 further includes generating, using a masking layer 240 of the speech enhancement model 200 configured to receive the single channel noisy input signal 206 and the un-masked output 480 generated as output from the stack of self-attention blocks 400, enhanced input speech features 250 corresponding to a target utterance 12.
The computing device 700 includes a processor 710, memory 720, a storage device 730, a high-speed interface/controller 740 connecting to the memory 720 and high-speed expansion ports 750, and a low speed interface/controller 760 connecting to a low speed bus 770 and a storage device 730. Each of the components 710, 720, 730, 740, 750, and 760, are interconnected using various busses, and may be mounted on a common motherboard or in other manners as appropriate. The processor 710 (e.g., data processing hardware 112, 134 of
The memory 720 (e.g., memory hardware 114, 136 of
The storage device 730 is capable of providing mass storage for the computing device 700. In some implementations, the storage device 730 is a computer-readable medium. In various different implementations, the storage device 730 may be a floppy disk device, a hard disk device, an optical disk device, or a tape device, a flash memory or other similar solid state memory device, or an array of devices, including devices in a storage area network or other configurations. In additional implementations, a computer program product is tangibly embodied in an information carrier. The computer program product contains instructions that, when executed, perform one or more methods, such as those described above. The information carrier is a computer- or machine-readable medium, such as the memory 720, the storage device 730, or memory on processor 710.
The high speed controller 740 manages bandwidth-intensive operations for the computing device 700, while the low speed controller 760 manages lower bandwidth-intensive operations. Such allocation of duties is exemplary only. In some implementations, the high-speed controller 740 is coupled to the memory 720, the display 780 (e.g., through a graphics processor or accelerator), and to the high-speed expansion ports 750, which may accept various expansion cards (not shown). In some implementations, the low-speed controller 760 is coupled to the storage device 730 and a low-speed expansion port 790. The low-speed expansion port 790, which may include various communication ports (e.g., USB, Bluetooth, Ethernet, wireless Ethernet), may be coupled to one or more input/output devices, such as a keyboard, a pointing device, a scanner, or a networking device such as a switch or router, e.g., through a network adapter.
The computing device 700 may be implemented in a number of different forms, as shown in the figure. For example, it may be implemented as a standard server 700a or multiple times in a group of such servers 700a, as a laptop computer 700b, or as part of a rack server system 700c.
Various implementations of the systems and techniques described herein can be realized in digital electronic and/or optical circuitry, integrated circuitry, specially designed ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various implementations can include implementation in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, coupled to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device.
A software application (i.e., a software resource) may refer to computer software that causes a computing device to perform a task. In some examples, a software application may be referred to as an “application,” an “app,” or a “program.” Example applications include, but are not limited to, system diagnostic applications, system management applications, system maintenance applications, word processing applications, spreadsheet applications, messaging applications, media streaming applications, social networking applications, and gaming applications.
The non-transitory memory may be physical devices used to store programs (e.g., sequences of instructions) or data (e.g., program state information) on a temporary or permanent basis for use by a computing device. The non-transitory memory may be volatile and/or non-volatile addressable semiconductor memory. Examples of non-volatile memory include, but are not limited to, flash memory and read-only memory (ROM)/programmable read-only memory (PROM)/erasable programmable read-only memory (EPROM)/electronically erasable programmable read-only memory (EEPROM) (e.g., typically used for firmware, such as boot programs). Examples of volatile memory include, but are not limited to, random access memory (RAM), dynamic random access memory (DRAM), static random access memory (SRAM), phase change memory (PCM) as well as disks or tapes.
These computer programs (also known as programs, software, software applications or code) include machine instructions for a programmable processor, and can be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language. As used herein, the terms “machine-readable medium” and “computer-readable medium” refer to any computer program product, non-transitory computer readable medium, apparatus and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term “machine-readable signal” refers to any signal used to provide machine instructions and/or data to a programmable processor.
The processes and logic flows described in this specification can be performed by one or more programmable processors, also referred to as data processing hardware, executing one or more computer programs to perform functions by operating on input data and generating output. The processes and logic flows can also be performed by special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit). Processors suitable for the execution of a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processors of any kind of digital computer. Generally, a processor will receive instructions and data from a read only memory or a random access memory or both. The essential elements of a computer are a processor for performing instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto optical disks, or optical disks. However, a computer need not have such devices. Computer readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto optical disks; and CD ROM and DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.
To provide for interaction with a user, one or more aspects of the disclosure can be implemented on a computer having a display device, e.g., a CRT (cathode ray tube), LCD (liquid crystal display) monitor, or touch screen for displaying information to the user and optionally a keyboard and a pointing device, e.g., a mouse or a trackball, by which the user can provide input to the computer. Other kinds of devices can be used to provide interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback, e.g., visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including acoustic, speech, or tactile input. In addition, a computer can interact with a user by sending documents to and receiving documents from a device that is used by the user; for example, by sending web pages to a web browser on a user's client device in response to requests received from the web browser.
A number of implementations have been described. Nevertheless, it will be understood that various modifications may be made without departing from the spirit and scope of the disclosure. Accordingly, other implementations are within the scope of the following claims.
This U.S. patent application claims priority under 35 U.S.C. § 119(e) to U.S. Provisional Application 63/269,633, filed on Mar. 20, 2022. The disclosure of this prior application is considered part of the disclosure of this application and is hereby incorporated by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
63269633 | Mar 2022 | US |