Aspects and implementations of the present disclosure relate to machine learning model architecture for speech enhancement system.
Speech enhancement systems have become integral in improving the perceptual quality and intelligibility of speech waveforms contaminated by additive background noise and reverberation.
Aspects and implementations of the present disclosure will be understood more fully from the detailed description given below and from the accompanying drawings of various aspects and implementations of the disclosure, which, however, should not be taken to limit the disclosure to the specific aspects or implementations, but are for explanation and understanding only.
Aspects of the present disclosure relate to machine learning model architecture for speech enhancement system. Speech enhancement systems are typically designed to improve the quality and intelligibility of speech signals, especially in environments with noise or other forms of interference. Speech enhancement systems are crucial in various applications, for example in speech recognition, hearing aids, and modern telecommunication systems.
Traditional speech enhancement systems utilize one or more methods to achieve better noise reduction and speech quality. For example, some speech enhancement systems, such as a speech enhancement system using spectral subtraction in conjunction with Wiener filtering, produce estimation of the distribution or composition of the frequencies present in noise of the speech signal (e.g., noise spectra) and/or the adjustment to an amplitude of the speech signal (e.g., gain mask). Thus, the speech enhancement system multiplies the estimated gain mask with the noisy spectrum of the speech signal to produce the enhanced spectrum. Traditional speech enhancement systems operate under the assumption that the noise is stationary or slowly varying, such as constant background hum, fan noise, and slowly varying traffic noise. In real-world environments, noise is often non-stationary; it varies quickly over time. Non-stationary noise includes, for example, background conversations, doors slamming, cars honking, dogs barking, birds chirping, keyboard typing, etc. Thus, traditional speech enhancement systems tend to fail at reducing noise and improving speech quality. This is further complicated with low signal-to-noise-ratio (SNR) conditions in which the noise level is high compared to the speech level, making it difficult to distinguish between noise and speech, leading to less accurate estimation.
In recent years, machine learning and deep learning algorithms are increasingly used since they are more adept at handling non-stationary noise and preserving speech quality. Most notably, the implementation of deep neural networks (DNNs) in speech enhancement systems (e.g., DNN speech enhancement systems). DNN speech enhancement systems have significantly advanced the field, offering more robust performance than traditional speech enhancement systems in various noise conditions, including non-stationary noise and low SNR conditions. DNN speech enhancement systems are typically time-frequency based or waveform domain-based.
Time-frequency based DNN speech enhancement systems typically estimate one or more masks (e.g., ideal ratio mask, spectral magnitude mask, and/or usage of spectral features) that can be applied to the noisy speech spectrum. These masks selectively attenuates or amplifies different frequency components, aiming to suppress noise while preserving speech. However, time-frequency based DNN speech enhancement systems ignore (fails to manipulate) the phase component due to its complexity. Thereby resulting in less accurate estimation and/or enhancements.
Waveform domain-based DNN speech enhancement systems typically process and enhance the raw speech waveform to output a clean speech waveform. However, waveform domain-based DNN speech enhancement systems are typically computationally complex due to their reliance on DNNs with large model sizes and the need for intensive training. This complexity poses challenges for real-time processing, especially with resource-constrained devices.
Aspects and embodiments of the present disclosure address these and other limitations of the existing technology by providing a machine learning (ML) model for speech enhancement system. In some implementations, the machine learning model includes an encoder, bottleneck, and decoder with multiple skip connections that each directly connect an encoder layer of the encoder to a decoder layer of the decoder. In some embodiments, the number of encoder layers of the encoder are substantially equivalent to a number of decoder layers of the decoder. Each encoder layer of the encoder includes a residual network block, such as a Res2Net, and a squeeze-excitation (SE) block. The residual network block includes hierarchical residual like connection that enables multiple-size of receptive fields, resulting in multiple feature scales. The SE block adaptively re-calibrates channel-wise feature responses. The bottleneck includes a sequence of uni-directional gated recurrent unit (GRU) layers (e.g., a first GRU layer, and a second GRU layer).
During operation, the machine learning (ML) model receives a speech waveform (e.g., an original speech waveform or first speech waveform). The original speech waveform is forwarded to the encoder of the ML model. Each encoder layer of the encoder generates a latent output of the original speech waveform (e.g., feature maps). In particular, for example, a residual network block of a respective encoder layer of the encoder receives feature maps (or latent output of the previous encoder layer) and generates multi-scale feature maps with multiple sizes of receptive fields. The SE block of the respective encoder layer of the encoder receives the multi-scale feature maps. The SE block squeezes the multi-scale feature maps to a vector of average activation. The SE block generates, using the vector of average activation, a vector of scale values. The SE block recalibrates, using the vector of scale values, the multi-scale feature maps. The SE block combines the recalibrated multi-scale feature maps with the original multi-scale feature maps to generate enhanced feature maps. The encoder layer generates a latent representation of the enhanced feature maps (e.g., a latent output). Additionally, each encoder layer downsamples the input, while increasing the number of channels. The output of the last encoder layer of the encoder (e.g., a latent output) is forwarded to a bottleneck.
The bottleneck receives the output of the last encoder layer of the encoder (e.g., the latent output) and outputs a non-linear transformation of the output of the last encoder layer of the encoder. The output of the bottleneck is forwarded to the decoder.
Each decoder layer of decoder increases the dimensionality of an input (e.g., the output of the bottleneck or an output of a previous decoder layer) by upsampling and halving the number of channels. Each decoder layer further receives, via a skip connection which connected with a corresponding encoder layer, an output of the corresponding encoder layer. The output of the corresponding encoder layer, which includes detailed features, and the output of the bottleneck, which includes abstract features, assist each decoder layer in reconstructing a speech waveform. Additionally, each decoder layer upsamples the input, while decreasing the number of channels. Typically, the last decoder layer of the decoder results in a single channel that is used to reconstruct a speech waveform with the same dimension as the original speech waveform. The speech waveform reconstructed from the single channel represents a predicted speech waveform, substantially equivalent to an enhanced speech waveform (or a second speech waveform) associated with the original speech waveform (or first speech waveform).
During training, the ML model is optimized using a loss function calculated based on (i) a difference between a waveform of clean speech waveform of training data associated with an original speech waveform of training data and a waveform of an output of the ML model (e.g., enhanced speech waveform) associated with the original speech waveform of the training data, and (ii) a difference between a magnitude spectrogram of the clean speech waveform of the training data associated with original speech waveform of the training data and a magnitude spectrogram of the output of the ML model associated with the original speech waveform of the training data.
Aspects of the present disclosure overcome these deficiencies and others by reducing a computational complexity of the machine learning model used to predict enhanced speech waveform from an original speech waveform, while simultaneously improving the performance of the prediction.
ML model 150 operates as a denoiser function that predict a speech waveform (e.g., xpred) by removing noise (e.g., xnoise) from an original speech waveform (e.g., xnoisy). The predicted speech waveform (e.g., xpred) is substantially equivalent to an enhanced speech waveform (e.g., x). ML model 150 can be expressed mathematically by: Equation (2): xpred=f(xnoisy)≈x.
ML model 150 may be deployed to an edge device to provide speech enhancement functionality. The edge device may be a computing device of modest processing and memory capabilities and can have access to local data (e.g., via an Internet-of-Things, or IoT, network) and to a cloud service. Computing device may be implemented on (or shared among) any number of computing devices and/or on a cloud. Computing device may be a desktop computer, a laptop computer, a smartphone, a tablet computer, a server, a computing device that accesses a remote server, a computing device that utilizes a virtualized computing environment, a gaming console, a wearable computer, a smart TV, and so on. Computing device may have any number of central processing units (CPUs) and graphical processing units (GPUs), including virtual CPUs and/or virtual GPUs, or any other suitable processing devices. Computing device may further have any number of memory devices, network controllers, peripheral devices, and the like. Peripheral devices may include various sensing devices, photographic cameras, video cameras, microphones, scanners, or any other devices for data intake.
With quick reference to
With reference to
The CNN block 320 includes a strided 1-D convolution (e.g., conv1D) layer, a rectified linear unit (ReLU) activation function, and a batch normalization (BN). The strided conv1D layer refers to a set of learnable filters (or kernels) that slide over a 1-dimensional input data based on a stride. In some embodiments, the strided conv1D layer is casual, which indicates that the output at any given point in the sequence only depends on the input. The stride refers to the number of positions the filter moves over the input data in each step. A stride greater than 1 leads to downsampling, which effectively reduces the length of the output sequence compared to the input. In some embodiments, the stride (of 2) which indicates that it is down sampled by 2 in each encoder layer.
Each filter convolves across the input data to produce a transformed version of the data. The number of filters determines the number of generate feature maps, also known as output channels. In other words, each filter processes all input channels simultaneously, producing a single feature map (or output channel). Therefore, an increased number of filters directly results in an increased number of output channels (or feature maps). The strided conv1D layer down-samples the input in length and increases the number of output channels. For example, each encoder layer inputs 2i−2H channels and outputs min(2i−1H, Cm) channels, wherein H refers to the number of hidden channels of the first encoder layer, Cm refers to a maximum allowed channel dimension, and i refers to the ith encoder layer of the one or more encoder layers 310A-C. In some embodiment, Cm may be set to a predefined number (e.g., 768).
ReLU activation function when applied to the feature maps (e.g., the output of the strided 1-D convolution) introduces non-linearity to enhance the representation of important patterns and characteristics within the feature maps. BN normalizes the feature maps by adjusting its mean and variance. BN effectively stabilizes and accelerates the learning process and ensures consistent feature scale for more effective training of subsequent layers.
The residual network block 330, such as a Res2Net, receives an output of the CNN block 320 (e.g., feature maps). The residual network block 330 enables the presence of multiple sizes of receptive fields, thereby leading to the extraction of features at multiple scales. With quick reference to
The residual network block 330 generates an output, represented as yi, for each subset. The output can be expressed mathematically by:
where i=1, the residual network block 330 outputs xi as yi (i.e., y1=x1), where i=2, the residual network block 330 processes xi through convolution filter Ki and outputs yi (i.e., y2=K2(x2)), and where i>2, the residual network block 330 processes xi and yi−1 through convolution filter Ki and outputs yi (i.e., y3=K3(x3+y2)). As a result, early subsets (e.g., y1, y2) have smaller receptive fields, capturing finer details, while later subsets (e.g., y3, y4, . . . ) effectively have larger receptive fields due to the combination of inputs. Thus, the larger receptive fields capture broader contextual information. Depending on the embodiment, context information for the output of each subset (e.g., y1, y2, y3, y4) may be increased using a combination of a dilated convolution with a predetermined dilation factor (e.g., 2), a ReLU activation function, and a BN.
The residual network block 330 may concatenate the outputs of the residual network block 330 (e.g., y1, y2, y3, y4) along the channel dimension. This results in new feature maps (e.g., feature maps y) with a spatial resolution that is the same as the inputted feature maps, but with a significantly higher number of channels. Feature maps y captures a wider range of information and allows the network to learn more complex features. The residual network block 330 forwards feature maps y to the SE block 340.
With reference to
The residual block 510 receives the feature maps y and extracts and/or transform features, resulting in a modified feature maps y. The modified feature maps y is forwarded the squeeze block 520. The squeeze block 520 squeezes each channel to a single numeric value using global average pooling (GAP) layer. The GAP layer can be expressed mathematically by:
where, zc represents an output value for channel c of the pooled feature maps (e.g., pooling of modified feature maps y), yt represents a value at t which represents a row and column of channel c in the modified feature maps y, and T represents the spatial dimensions (height and width) of the modified feature maps y. Accordingly, for each channel c of the modified feature maps y, the GAP layer iterates through all elements t, GAP layer sums the values of all elements in the channel, divides the sum by the total number of elements in the channel (T), and the resulting value zc represents the average activation for channel c in the pooled feature maps. The zc across all channels together form a vector ž capturing a representation of the global information in the inputted feature map.
Vector ž is forwarded to the excitation block 530, which calculates weights for each channel, resulting in a vector of scale values s, using a first fully connected (FC) layer, a non-linear activation function, a second FC layer, and a sigmoid activation. The excitation block 530 can be expressed mathematically by: Equation (6): s=σ(W2ƒ(W1*ž+b1)+b2).
In view of equation (6), the first FC layer, characterized by the weight matrix W1, reduces the dimensionality of vector ž. The non-linear activation function, represented as ƒ( ), is applied to the output of the first FC layer to introduce non-linearity. The second FC layer, characterized by the weight matrix W1, receives the output of the non-linear activation function and increases the dimensionality back to its original size. The sigmoid activation, represented as σ, receives the output of the second FC layer and transforms each element of the output of the second FC layer to a value between 0 and 1. The output of the sigmoid activation, represented as s, as noted above s refers a vector of scale values, one for each channel. The vector of scale values contains channel weights which are used to adaptively recalibrate the feature maps by scaling each channel based on learned importance.
The vector of scale values (e.g., vector s) are forwarded to the scaling block 540. In addition to the scaling block 540 receiving vector s, the scaling block 540 receives the modified feature maps y. The scaling block 540 recalibrates the modified feature maps y by performing channel-wise multiplication on the modified feature maps y with the vector of scale values (e.g., vector s). Thus, the scaling block 540 adaptively emphasizing or suppressing different features in the modified feature maps y. The scaling block 540 (or channel-wise multiplication) can be expressed mathematically by: Equation (7): zc=sc*yc where, yc is the c-th channel of the modified feature maps y, sc is the c-th scale value of s, zc is the recalibrated c-th channel of the modified feature maps y. In other words, channel-wise multiplication is an element-wise operation. For each channel in the modified feature maps y, every element of that channel is multiplied by a single scalar value (the channel-specific weight). This operation scales the intensity of the features in each channel. Channels with higher weights will have their features amplified (or emphasized), while those with lower weights will have their features attenuated (or suppressed). Accordingly, the output of the scaling block 540 are recalibrated feature maps.
The recalibrated feature maps are forwarded to the residual connection 550. In addition to the residual connection 550 receiving recalibrated feature maps, the residual connection 550 receives the feature maps y. The residual connection 550, represented as “+” sign, adds the feature maps y with the recalibrated feature maps to generate the enhanced feature map (e.g., enhanced feature maps y′). The enhanced feature map (e.g., enhanced feature maps y′) is an enriched feature representation that combines original and learned features. The SE block 340 forwards the output of the SE block 340 (e.g., enhanced feature maps y′) to the CNN block 350.
The CNN block 350 receives enhanced feature maps y′. The CNN block 350 includes a convolutional layer (e.g., 1×1 convolutional layer) and a gated linear unit (GLU). The convolutional layer may be used to double the number of channels in enhanced feature maps y′. The GLU may be used to halve the number of channels in enhanced feature maps y′ based on the number of filters. The GLU is an activation function that combines the outputs of two linear units using a gating mechanism, thereby allowing the network to selectively focus on relevant information. The output of the CNN block 350 is a latent representation of enhanced feature maps y′ (e.g., latent output z). Latent representation refers to a learned abstract representation of enhanced feature maps y′ that captures essential information while typically being in a more compressed or simplified form.
With reference to
Bottleneck 220 receives the final latent output z and outputs a non-linear transformation of the final latent output z (e.g., latent output ž). In some embodiments, the final latent output z is the same size as the latent output ž. With reference to
Each of the GRU layers can be expressed mathematically by: Equation (8): ž=GRU(z) where, GRU refers to a GRU layer (e.g., GRU layer 610A and GRU layer 610B), ž is the non-linear transformation of the inputted latent output z. Accordingly, GRU layer 610A receives the inputted latent output z, received from encoder 210, and generates a non-linear transformation of the latent output z (e.g., the latent output ž). GRU layer 610A forwards the latent output ž to GRU layer 610B. GRU layer 610B receives the inputted latent output z (e.g., the latent output ž received from GRU layer 610A) and generates a non-linear transformation of the inputted latent output (e.g., the latent output ž). In other words, each GRU layer (e.g., GRU layer 610A and GRU layer 610B) performs non-linear transformation on an input. It should be noted that other RNNs may be considered, one or more of the GRU layers may be replaced by a long short-term memory (LSTM), bi-directional LSTM with multiple self-attention layers, etc. The latent output ž is forwarded to decoder 230.
Decoder 230 produces a predicted speech waveform (e.g., xpred) associated with the original speech waveform (e.g., xnoisy) by processing the latent output ž through a plurality of decoder layers. In some embodiment, the number of decoder layers of decoder 230 matches the number of encoder layers of encoder 210. Decoder 230 can be expressed mathematically by:
where, D refers to decoder 230 including a plurality of decoder layers, ž is latent output received from the bottleneck 220, and x corresponds to enhanced speech waveform 180. Each decoder layer of decoder 230 includes a convolutional layer (e.g., 1×1 convolutional layer), a gated linear unit (GLU), and a transposed ID convolution. The convolutional layer may be used to double the number of channels. The GLU may be used to halve the number of channels. The transposed 1D convolution, also known as a fractionally-strided convolution or deconvolution, is a type of convolutional operation used to increase the dimensionality by upsampling and halving the number of channels. Additionally, as noted above, the skip connection which connects a respective decoder layer to a corresponding encoder layer of encoder 210 provides the respective decoder an output of the corresponding encoder layer. Thus, the latent output from the corresponding encoder layer (i.e., detailed features) and the latent output received from encoder 210 (i.e., abstract features) assist the decoder layer in reconstructing the speech waveform. Typically, the last decoder layer of the decoder 230 results in a single channel that is used to reconstruct a speech waveform with the same dimension as the original speech waveform.
For example, the initial decoder layer receives the latent output from encoder 210. In addition, the initial decoder layer receives the latent output from a corresponding encoder layer of encoder 210. The initial decoder layer reconstructs a speech waveform. Which, in some embodiments, is accomplished by upsampling and decreasing the number of channels. The speech waveform is then fed into a subsequent decoder layer. The subsequent decoder layer further reconstructs a subsequent speech waveform. As noted above, may be accomplished by upsampling and decreasing the number of channels. The subsequent decoder layer uses the speech waveform received from the initial decoder layer and a latent output from a corresponding encoder layer of encoder 210 to generate the subsequent speech waveform. In other words, each subsequent decoder layer receive a speech waveform from the previous decoder layer and latent output from a corresponding encoder layer. The subsequent decoder layer reconstructs a subsequent speech waveform. This is repeated until the last decoder layer which outputs a speech waveform reconstructed from a single channel with the same dimension as the original speech waveform 110 (e.g., xnoisy). The speech waveform reconstructed from a single channel represents a predicted speech waveform (e.g., xpred) associated with the original speech waveform 110 (e.g., xnoisy). As noted above, the predicted speech waveform (e.g., xpred) is substantially equivalent to enhanced speech waveform 180 (e.g., x).
In one embodiment, dataset 710 includes a collection of clean speech recordings from various speakers (e.g., ˜40 speakers) with diverse accents and speaking styles and a collection of noisy speech recordings obtained by adding various types of noise (e.g., babble, car noise, street noise) at different signal-to-noise ratios (SNRs). Training set 720 includes a plurality of utterances (e.g., ˜12,000) from multiple speakers (e.g., ˜30 speakers). Each utterance of the training set 720 includes clean utterances mixed with background noise with various different noise types taken from dataset 710 (e.g., ˜8 different noise types) and artificial noise types at various SNRs (e.g., 4 different SNRs). Testing set 730 incudes a plurality of utterances (e.g., ˜1,000) from multiple unseen speakers (e.g., ˜2). Each utterance of the testing set 730 are mixed with background noise with various different noise types taken from dataset 710 (e.g., ˜5 different noise types) and at various SNRs (e.g., 4 different SNRs). In some embodiments, the various SNRs used for the training set 720 is different from the various SNRs used for the testing set 730. In some embodiments, the various SNRs used for the training set 720 are the same with the various SNRs used for the testing set 730.
In another embodiment, dataset 710 includes a collection of audio clips from a plurality of speakers (e.g., ˜10,000 speakers) with a predetermined sampling frequency (e.g., 16 kHz). The clean set includes a predetermined amount of clean speech (e.g., ˜500 hours of clean speeches). The noise set includes a predetermined amount of noise (e.g., ˜180 hours) from various categories of noise (e.g., 150 noise classes). Training set 720 includes a plurality of clean-noise pairs (e.g., 500 hours of clean and noisy speech pairs) at various SNR levels (e.g., 31 SNR levels between −5 and 25 dB) with a predetermined silence length (e.g., 0.2 seconds) of each clean speech waveform. Testing set 730 includes a plurality of artificial (or synthetic) clean-noise pairs (e.g., 150 pairs) with and without reverb.
During training of ML model 150, each training data pair (xnoisy, x) of training set 720 is randomly cropped into a clip of predetermined length (e.g., T-second clip). Data augmentation, such as shuffling the noises within a batch, removing a fraction of frequency uniformly, adding decaying echoes of clean speech and noise to, may be applied to the cropped clip. Each training data of training set 720 is forwarded to ML model 150 to output an enhanced speech waveform x. The enhanced speech waveform x is forwarded to loss function 780. A clean speech waveform x of testing set 730 is forwarded to loss function 780. Loss function 780 receives enhanced speech waveform x and clean speech waveform x.
Loss function 780 is used for optimization of ML model 150. Loss function 780 incorporates L1 distance loss over the waveform and multi-resolution short-time Fourier transform (MRSTFT) loss over the magnitude spectrogram. Both losses are computed between the clean speech waveform x and the enhanced speech waveform x. MRSTFT loss is defined as a sum of the spectral convergence (sc) loss and magnitude (mag) loss. MRSTFT loss referred to as LMRSTFT can be expressed mathematically by:
where S(x; ϕi)=|STFT(x)| and refers to the linear-scale magnitude spectrogram of x, S({circumflex over (x)}; ϕi)=|STFT({circumflex over (x)})| refers to the linear-scale magnitude spectrogram of {circumflex over (x)}, ϕi represents STFT hyper-parameters at ith resolution, ∥ ∥F and ∥ ∥1 refers to the Frobenius and L1 norm respectively, and M refers to the total number of resolutions. STFT hyper-parameters includes, for example, a number of FFT bins, hop sizes, window lengths, etc. The number of FFT bins, for example, may be ∈{512, 1024, 2048}. The hop sizes, for example, may be ∈{50, 120, 240}. The window lengths, for example, may be ∈{240, 600, 1200}.
Given the MRSTFT loss, the resulting loss function 780 can be expressed mathematically by
Loss function 780 is used for model optimization. In particular, the result of loss function 780 represented as L (e.g., result 790) is fed back into ML model 150 and a gradient with respect to each parameter is calculated. Additional operations, such as Adam optimizer with momentum, may be performed on the gradients to further optimize the ML model. The Adam optimizer with momentum, also known as AMSGrad, adaptively adjusts the learning rate for each parameter based on their individual gradients and past squared gradients, and accumulates past gradients to build up momentum in the direction of descent. Additionally, a linear warmup with cosine annealing learning rate may be used during training. Simply, the loss function 780 in conjunction with other optimization methods adjusts the internal parameters (weights and biases) of the ML model 150 to minimize the overall loss. The processes of prediction by the ML model 150, loss calculation by the loss function 780, and parameter adjustments is repeated through various iterations over the training data.
At block 810, processing logic receives, by the speech enhancement system, an original speech waveform. As previously described, original speech waveform is composed of clean speech that may be corrupted by convolutive RIR and/or additive background noise. The speech enhancement system operates as a denoiser function that predict a speech waveform, which is substantially equivalent to an enhanced speech waveform.
At block 820, processing logic generates a latent output of the original speech waveform. As previously described, an encoder of the speech enhancement system is used to generate the latent output of the original speech waveform. In particular, each encoder layer of the encoder includes a residual network block and a SE block. The residual network block of a respective encoder layer receives feature maps (or latent output of the previous encoder layer) and generates multi-scale feature maps with multiple sizes of receptive fields.
The SE block of a respective encoder layer receives the multi-scale feature maps from the residual network block. The SE block squeezes the multi-scale feature maps to a vector of average activation. The SE block generates, using the vector of average activation, a vector of scale values. The SE block recalibrates, using the vector of scale values, the multi-scale feature maps. The SE block combines the recalibrated multi-scale feature maps with the original multi-scale feature maps to generate enhanced feature maps. The encoder layer generates a latent representation of the enhanced feature maps (e.g., a latent output). The output of the last encoder layer represents the latent output of the original speech waveform.
In some embodiments, a bottleneck of the speech enhancement system receives the latent output of the original speech waveform and outputs a non-linear transformation of the latent output of the original speech waveform which may be forwarded to the decoder.
At block 830, processing logic reconstructs, from the latent output, an enhanced speech waveform associated with the original speech waveform. The latent output refers to the latent output of the original speech waveform. As previously described, each decoder layer of decoder increases the dimensionality of an input (e.g., the latent output or an output of a previous decoder layer) by upsampling and halving the number of channels. Each decoder layer further receives, via a skip connection which connected with a corresponding encoder layer, an output of the corresponding encoder layer. The output of the corresponding encoder layer, which includes detailed features, and the latent output, which includes abstract features, assist each decoder layer in reconstructing a speech waveform. Additionally, each decoder layer upsamples the input, while decreasing the number of channels. The output of the last decoder layer of the decoder is a speech waveform reconstructed from a single channel (e.g., an enhanced speech waveform). In some embodiment, the speech waveform outputted by the last decoder layer is the same dimension as the original speech waveform.
The example computer system 900 includes a processing device (processor) 902, a main memory 904 (e.g., read-only memory (ROM), flash memory, dynamic random access memory (DRAM) such as synchronous DRAM (SDRAM), double data rate (DDR SDRAM), or DRAM (RDRAM), etc.), a static memory 906 (e.g., flash memory, static random access memory (SRAM), etc.), and a data storage device 918, which communicate with each other via a bus 940.
Processor (processing device) 902 represents one or more general-purpose processing devices such as a microprocessor, central processing unit, or the like. More particularly, the processor 902 can be a complex instruction set computing (CISC) microprocessor, reduced instruction set computing (RISC) microprocessor, very long instruction word (VLIW) microprocessor, or a processor implementing other instruction sets or processors implementing a combination of instruction sets. The processor 902 can also be one or more special-purpose processing devices such as an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), a digital signal processor (DSP), network processor, or the like. The processor 902 can include processing logic 922 used to perform the operations discussed herein. The processor 902 is configured to execute instructions 905 for performing the operations discussed herein.
The computer system 900 can further include a network interface device 908. The computer system 900 also can include a video display unit 910 (e.g., a liquid crystal display (LCD) or a cathode ray tube (CRT)), an input device 912 (e.g., a keyboard, and alphanumeric keyboard, a motion sensing input device, touch screen), a cursor control device 914 (e.g., a mouse), and a signal generation device 920 (e.g., a speaker).
The data storage device 918 can include a non-transitory machine-readable storage medium 924 (also computer-readable storage medium) on which is stored one or more sets of instructions 926 embodying any one or more of the methodologies or functions described herein. The instructions can also reside, completely or at least partially, within the main memory 904 and/or within the processor 902 during execution thereof by the computer system 900, the main memory 904 and the processor 902 also constituting machine-readable storage media. The instructions can further be transmitted or received over a network 930 via the network interface device 908.
While the computer-readable storage medium 924 (machine-readable storage medium) is shown in an exemplary implementation to be a single medium, the terms “computer-readable storage medium” and “machine-readable storage medium” should be taken to include a single medium or multiple media (e.g., a centralized or distributed database, and/or associated caches and servers) that store the one or more sets of instructions. The terms “computer-readable storage medium” and “machine-readable storage medium” shall also be taken to include any medium that is capable of storing, encoding, or carrying a set of instructions for execution by the machine and that cause the machine to perform any one or more of the methodologies of the present disclosure. The terms “computer-readable storage medium” and “machine-readable storage medium” shall accordingly be taken to include, but not be limited to, solid-state memories, optical media, and magnetic media.
Reference throughout this specification to “one implementation,” “one embodiment,” “an implementation,” or “an embodiment,” means that a particular feature, structure, or characteristic described in connection with the implementation and/or embodiment is included in at least one implementation and/or embodiment. Thus, the appearances of the phrase “in one implementation,” or “in an implementation,” in various places throughout this specification can, but are not necessarily, referring to the same implementation, depending on the circumstances. Furthermore, the particular features, structures, or characteristics can be combined in any suitable manner in one or more implementations.
To the extent that the terms “comprises,” “comprising,” “includes,” “including,” “has,” “contains,” variants thereof, and other similar words are used in either the detailed description or the claims, these terms are intended to be inclusive in a manner similar to the term “comprising” as an open transition word without precluding any additional or other elements.
As used in this application, the terms “block,” “layer,” “component,” “module,” “system,” or the like are generally intended to refer to a computer-related entity, either hardware (e.g., a circuit), software, a combination of hardware and software, or an entity related to an operational machine with one or more specific functionalities. For example, a component can be, but is not limited to being, a process running on a processor (e.g., digital signal processor), a processor, an object, an executable, a thread of execution, a program, and/or a computer. By way of illustration, both an application running on a controller and the controller can be a component. One or more components can reside within a process and/or thread of execution, and a component can be localized on one computer and/or distributed between two or more computers. Further, a “device” can come in the form of specially designed hardware; generalized hardware made specialized by the execution of software thereon that enables hardware to perform specific functions (e.g., generating interest points and/or descriptors); software on a computer-readable medium; or a combination thereof.
The aforementioned systems, circuits, modules, and so on have been described with respect to interaction between several components and/or blocks. It can be appreciated that such systems, circuits, components, blocks, and so forth can include those components or specified sub-components, some of the specified components or sub-components, and/or additional components, and according to various permutations and combinations of the foregoing. Sub-components can also be implemented as components communicatively coupled to other components rather than included within parent components (hierarchical). Additionally, it should be noted that one or more components can be combined into a single component providing aggregate functionality or divided into several separate sub-components, and any one or more middle layers, such as a management layer, can be provided to communicatively couple to such sub-components in order to provide integrated functionality. Any components described herein can also interact with one or more other components not specifically described herein but known by those of skill in the art.
Moreover, the words “example” or “exemplary” are used herein to mean serving as an example, instance, or illustration. Any aspect or design described herein as “exemplary” is not necessarily to be construed as preferred or advantageous over other aspects or designs. Rather, use of the words “example” or “exemplary” is intended to present concepts in a concrete fashion. As used in this application, the term “or” is intended to mean an inclusive “or” rather than an exclusive “or.” That is, unless specified otherwise, or clear from context, “X employs A or B” is intended to mean any of the natural inclusive permutations. That is, if X employs A; X employs B; or X employs both A and B, then “X employs A or B” is satisfied under any of the foregoing instances. In addition, the articles “a” and “an” as used in this application and the appended claims should generally be construed to mean “one or more” unless specified otherwise or clear from context to be directed to a singular form.
Finally, implementations described herein include a collection of data describing a user and/or activities of a user. In one implementation, such data is only collected upon the user providing consent to the collection of this data. In some implementations, a user is prompted to explicitly allow data collection. Further, the user can opt-in or opt-out of participating in such data collection activities. In one implementation, the collected data is anonymized prior to performing any analysis to obtain any statistical patterns so that the identity of the user cannot be determined from the collected data.
This application claims the benefit of U.S. Provisional Patent Application No. 63/536,034 filed Aug. 31, 2023, the entire contents of which are incorporated by reference herein.
| Number | Date | Country | |
|---|---|---|---|
| 63536034 | Aug 2023 | US |