This disclosure pertains to systems, methods, and media for speech enhancement via attenuating distortion.
Audio devices, such as headphones, speakers, etc. are widely deployed. People frequently listen to audio content (e.g., podcasts, radio shows, television shows, music videos, user-generated content, short-video, video meetings, teleconferencing meetings, panel discussions, interviews, etc.) that may include distortion, such as reverberation and/or noise. Additionally, audio content may include far-field audio content, such as background noise. Enhancement, such as dereverberation and/or noise suppression may be performed on such audio content. However, enhancement techniques may introduce unwanted perceptual distortions, such as changes in loudness or timbre.
Throughout this disclosure, including in the claims, the terms “speaker,” “loudspeaker” and “audio reproduction transducer” are used synonymously to denote any sound-emitting transducer (or set of transducers). A typical set of headphones includes two speakers. A speaker may be implemented to include multiple transducers (e.g., a woofer and a tweeter), which may be driven by a single, common speaker feed or multiple speaker feeds. In some examples, the speaker feed(s) may undergo different processing in different circuitry branches coupled to the different transducers.
Throughout this disclosure, including in the claims, the expression performing an operation “on” a signal or data (e.g., filtering, scaling, transforming, or applying gain to, the signal or data) is used in a broad sense to denote performing the operation directly on the signal or data, or on a processed version of the signal or data (e.g., on a version of the signal that has undergone preliminary filtering or pre-processing prior to performance of the operation thereon).
Throughout this disclosure including in the claims, the expression “system” is used in a broad sense to denote a device, system, or subsystem. For example, a subsystem that implements a decoder may be referred to as a decoder system, and a system including such a subsystem (e.g., a system that generates X output signals in response to multiple inputs, in which the subsystem generates M of the inputs and the other X-M inputs are received from an external source) may also be referred to as a decoder system.
Throughout this disclosure including in the claims, the term “processor” is used in a broad sense to denote a system or device programmable or otherwise configurable (e.g., with software or firmware) to perform operations on data (e.g., audio, or video or other image data). Examples of processors include a field-programmable gate array (or other configurable integrated circuit or chip set), a digital signal processor programmed and/or otherwise configured to perform pipelined processing on audio or other sound data, a programmable general purpose processor or computer, and a programmable microprocessor chip or chip set.
At least some aspects of the present disclosure may be implemented via methods. Some methods may involve (a) obtaining, by a control system, a training set comprising a plurality of training samples, each training sample of the plurality of training samples comprising a distorted audio signal and a corresponding clean audio signal. Some methods may involve (b) for a training sample of the plurality of training samples: obtaining, by the control system, a frequency-domain representation of the distorted audio signal; providing, by the control system, the frequency-domain representation of the distorted audio signal to a convolutional neural network (CNN) comprising a plurality of convolutional layers and to a recurrent element, wherein an output of the recurrent element is provided to a subset of the plurality of convolutional layers; generating, by the control system and using the CNN, a predicted enhancement mask, wherein the CNN generates the predicted enhancement mask based at least in part on the output of the recurrent element; generating, by the control system, a predicted enhanced audio signal based at least in part on the predicted enhancement mask; and updating, by the control system, weights associated with the CNN and the recurrent element based at least in part on the predicted enhanced audio signal and the corresponding clean audio signal. Some methods may involve (c) repeating (b) by the control system until a stopping criteria is reached, wherein the updated weights at a time the stopping criteria is reached correspond to a trained machine learning model for enhancing audio signals.
In some examples, obtaining the frequency-domain representation of the distorted audio signal comprises: generating an initial frequency-domain representation of the distorted audio signal; and applying a filter that represents filtering of a human cochlea to the initial frequency-domain representation of the distorted audio signal to generate the frequency-domain representation of the distorted audio signal.
In some examples, the plurality of convolutional layers comprise a first subset of convolutional layers with increasing dilation values and a second subset of convolutional layers with decreasing dilation values. In some examples, an output of a convolutional layer of the first subset of convolutional layers is passed to a convolutional layer of the second subset of convolutional layers having a same dilation value. In some examples, the output of the recurrent element is provided to the second subset of convolutional layers.
In some examples, the output of the recurrent element is provided to the subset of the plurality of convolutional layers by reshaping the output of the recurrent element.
In some examples, generating the predicted enhanced audio signal comprises multiplying the predicted enhancement mask by the frequency-domain representation of the distorted audio signal.
In some examples, methods may further involve using the updated weights to generate at least one enhanced audio signal by providing a distorted audio signal to the trained machine learning model.
In some examples, the recurrent element is a gated recurrent unit (GRU).
In some examples, the recurrent element is a long short-term memory (LSTM) network or an Elman recurrent neural network (RNN).
In some examples, the distorted audio signal includes reverberation and/or noise.
In some examples, the trained machine learning model for enhancing audio signals is used for noise suppression and/or dereverberation.
In some examples, updating the weights associated with the CNN and the recurrent element comprises determining a loss term based at least in part on a degree of reverberation present in the predicted enhanced audio signal. In some examples, the loss term is determined further based on a difference between the predicted enhanced audio signal and a clean audio signal corresponding to the training sample. In some examples, the loss term is a weighted sum of (i) the difference between the predicted enhanced audio signal and the clean audio signal and (ii) the degree of reverberation.
Some methods may involve obtaining, by a control system, a distorted audio signal. Some methods may involve generating, by the control system, a frequency-domain representation of the distorted audio signal. Some methods may involve providing, by the control system, the frequency-domain representation to a trained machine learning model, wherein the trained machine learning model comprises a convolutional neural network (CNN) comprising a plurality of convolutional layers and to a recurrent element, wherein an output of the recurrent element is provided to a subset of the plurality of convolutional layers. Some methods may involve determining, by the control system, an enhancement mask based on an output of the trained machine learning model. Some methods may involve generating, by the control system, a spectrum of an enhanced audio signal based at least in part on the enhancement mask and the distorted audio signal. Some methods may involve generating, by the control system, the enhanced audio signal based on the spectrum of the enhanced audio signal.
In some examples, obtaining the frequency-domain representation of the distorted audio signal comprises: generating an initial frequency-domain representation of the distorted audio signal; and applying a filter that represents filtering of a human cochlea to the initial frequency-domain representation of the distorted audio signal to generate the frequency-domain representation of the distorted audio signal.
In some examples, the plurality of convolutional layers comprise a first subset of convolutional layers with increasing dilation values and a second subset of convolutional layers with decreasing dilation values. In some examples, an output of a convolutional layer of the first subset of convolutional layers is passed to a convolutional layer of the second subset of convolutional layers having a same dilation value. In some examples, the output of the recurrent element is provided to the second subset of convolutional layers.
In some examples, the output of the recurrent element is provided to the subset of the plurality of convolutional layers by reshaping the output of the recurrent element.
In some examples, the recurrent element is a gated recurrent unit (GRU).
In some examples, the recurrent element is a long short-term memory (LSTM) network or an Elman recurrent neural network (RNN).
In some examples, generating the enhanced audio signal comprises multiplying the enhancement mask by the frequency-domain representation of the distorted audio signal.
In some examples, the distorted audio signal is a live-captured audio signal.
In some examples, the distorted audio signal includes one or more of reverberation or noise.
In some examples, the trained machine learning model has been trained to perform dereverberation and/or noise suppression.
In some examples, methods further involve causing the enhanced audio signal to be presented by one or more speaker devices.
Some methods involve (a) obtaining, by a control system, a training set comprising a plurality of training samples, each training sample of the plurality of training samples comprising a reverberated audio signal and a corresponding clean audio signal. Some methods involve (b) for a training sample of the plurality of training samples: providing, by the control system, a representation of the reverberated audio signal to a machine learning model comprising a set of weights; generating, by the control system and using the machine learning model, a predicted dereverberated audio signal based at least in part on the representation of the reverberated audio signal; determining, by the control system, a loss term based at least in part on a degree of reverberation present in the predicted dereverberated audio signal; and updating the set of weights based at least in part on the loss term. Some methods involve (c) repeating (b) by the control system until a stopping criteria is reached, wherein the updated set of weights at a time the stopping criteria is reached comprise a trained machine learning model for dereverberating audio signals.
In some examples, the degree of reverberation is based on a two-dimensional acoustic-modulation frequency spectrum of the predicted dereverberated audio signal. In some examples, the degree of reverberation comprises a ratio of energy in a high modulation frequency portion of the two-dimensional acoustic-modulation frequency spectrum to energy over a low modulation frequency portion of the two-dimensional acoustic-modulation frequency spectrum. In some examples, the ratio is determined for each of a plurality of critical bands of the predicted dereverberated audio signals. In some examples, the degree of reverberation comprises a mean of the ratio over the plurality of critical bands.
In some examples, the loss term is determined further based on a difference between the predicted dereverberated audio signal and a clean audio signal corresponding to the training sample. In some examples, the loss term is a weighted sum of: (i) the difference between the predicted dereverberated audio signal and the clean audio signal and (ii) the degree of reverberation.
In some examples, some methods further involve using the updated weights to generate at least one dereverberated audio signal by providing a reverberated audio signal to the trained machine learning model.
Some or all of the operations, functions and/or methods described herein may be performed by one or more devices according to instructions (e.g., software) stored on one or more non-transitory media. Such non-transitory media may include memory devices such as those described herein, including but not limited to random access memory (RAM) devices, read-only memory (ROM) devices, etc. Accordingly, some innovative aspects of the subject matter described in this disclosure can be implemented via one or more non-transitory media having software stored thereon.
At least some aspects of the present disclosure may be implemented via an apparatus. For example, one or more devices may be capable of performing, at least in part, the methods disclosed herein. In some implementations, an apparatus is, or includes, an audio processing system having an interface system and a control system. The control system may include one or more general purpose single- or multi-chip processors, digital signal processors (DSPs), application specific integrated circuits (ASICs), field programmable gate arrays (FPGAs) or other programmable logic devices, discrete gates or transistor logic, discrete hardware components, or combinations thereof.
Details of one or more implementations of the subject matter described in this specification are set forth in the accompanying drawings and the description below. Other features, aspects, and advantages will become apparent from the description, the drawings, and the claims. Note that the relative dimensions of the following figures may not be drawn to scale.
Like reference numbers and designations in the various drawings indicate like elements.
Audio signals may include various types of distortions, such as noise and/or reverberation. For example, reverberation occurs when an audio signal is distorted by various reflections off of various surfaces (e.g., walls, ceilings, floors, furniture, etc.). Reverberation may have a substantial impact on sound quality and speech intelligibility. Accordingly, dereverberation of an audio signal may be performed, for example, to improve speech intelligibility and clarity.
Sound arriving at a receiver (e.g., a human listener, a microphone, etc.) is made up of direct sound, which includes sound directly from the source without any reflections, and reverberant sound, which includes sound reflected off of various surfaces in the environment. The reverberant sound includes early reflections and late reflections. Early reflections may reach the receiver soon after or concurrently with the direct sound, and may therefore be partially integrated into the direct sound. The integration of early reflections with direct sound creates a spectral coloration effect which contributes to a perceived sound quality. The late reflections arrive at the receiver after the early reflections (e.g., more than 50-80 milliseconds after the direct sound). The late reflections may have a detrimental effect on speech intelligibility. Accordingly, dereverberation may be performed on an audio signal to reduce an effect of late reflections present in the audio signal to thereby improve speech intelligibility.
It may be difficult to perform enhancement (e.g., dereverberation and/or noise suppression) on audio signals such that speech intelligibility is improved by the enhancement, and, such that the perceptual quality of the audio signal is preserved. For example, machine learning models, such as deep neural networks, may be used to predict a dereverberation mask that, when applied to a reverberated audio signal, generates a dereverberated audio signal. However, training such machine learning models may be computationally intensive and inefficient. For example, such machine learning models may require a high degree of complexity to be able to achieve some degree of accuracy. As a more particular example, such machine learning models may include a vast number of layers, thereby requiring that a corresponding vast number of parameters be optimized. Moreover, such complex machine learning models may be prone to overfitting, due to training on limited training sets and a large number of parameters that are to be optimized. In such cases, such machine learning models may both be computationally intensive to train, and, may ultimately achieve lower performance.
Disclosed herein are methods, systems, media, and techniques for enhancing audio signals using low-complexity machine learning models and/or using augmented training sets. As described herein (e.g., in connection with
In some implementations, an input audio signal can be enhanced using a trained machine learning model. In some implementations, the input audio signal can be transformed to a frequency domain by extracting frequency domain features. In some implementations, a perceptual transformation based on processing by the human cochlea can be applied to the frequency-domain representation to obtain banded features. Examples of a perceptual transformation that may be applied to the frequency-domain representation include a Gammatone filter, an equivalent rectangular bandwidth filter, a transformation based on the Mel scale, or the like. In some implementations, the frequency-domain representation may be provided as an input to a trained machine learning model that generates, as an output, a predicted enhancement mask. The predicted enhancement mask may be a frequency-domain representation of a mask that, when applied to the frequency-domain representation of the input audio signal, generates an enhanced audio signal. In some implementations, an inverse of the perceptual transformation may be applied to the predicted enhancement mask to generate a modified predicted enhancement mask. A frequency-domain representation of the enhanced audio signal may then be generated by multiplying the frequency-domain representation of the input audio signal by the modified predicted enhancement mask. An enhanced audio signal may then be generated by transforming the frequency-domain representation of the enhanced audio signal to the time-domain.
In other words, a trained machine learning model for enhancing audio signals may be trained to generate, for a given frequency-domain input audio signal, a predicted enhancement mask that, when applied to the frequency-domain input audio signal, generates a frequency-domain representation of a corresponding enhanced audio signal. In some implementations, a predicted enhancement mask may be applied to a frequency-domain representation of the input audio signal by multiplying the frequency-domain representation of the input audio signal and the predicted enhancement mask. Alternatively, in some implementations, the logarithm of the frequency-domain representation of the input audio signal may be taken. In such implementations, a frequency domain representation of the enhanced audio signal may be obtained by subtracting the logarithm of the predicted enhancement mask from the logarithm of the frequency-domain representation of the enhanced audio signal.
It should be noted that, in some implementations, training a machine learning model may include determining weights associated with one or more nodes and/or connections between nodes of the machine learning model. In some implementations, a machine learning model may be trained on a first device (e.g., a server, a desktop computer, a laptop computer, or the like). Once trained, the weights associated with the trained machine learning model may then be provided (e.g., transmitted to) a second device (e.g., a server, a desktop computer, a laptop computer, a media device, a smart television, a mobile device, a wearable computer, or the like) for use by the second device in dereverberating audio signals.
Process 300 can begin at 302 by receiving an input audio signal that includes reverberation. The input audio signal may be a live-captured audio signal, such as live-streamed content, an audio signal corresponding to an in-progress video conference or audio conference, or the like. In some implementations, the input audio signal may be a pre-recorded audio signal, such as an audio signal associated with pre-recorded audio content (e.g., television content, a video, a movie, a podcast, or the like). In some implementations, the input audio signal may be received by a microphone of the user device. In some implementations, the input audio signal may be transmitted to the user device, such as from a server device, another user device, or the like.
At 304, process 300 can extract features of the input audio signal by generating a frequency-domain representation of the input audio signal. For example, process 300 can generate a frequency-domain representation of the input audio signal using a transform, such as a short-time Fourier transform (STFT), a modified discrete cosine transform (MDCT), or the like. In some implementations, the frequency-domain representation of the input audio signal is referred to herein as “binned features” of the input audio signal. In some implementations, the frequency-domain representation of the input audio signal may be modified by applying a perceptually-based transformation that mimics filtering of the human cochlea. Examples of perceptually-based transformations include a Gammatone filter, an equivalent rectangular bandwidth filter, a Mel-scale filter, or the like. The modified frequency-domain transformation is sometimes referred to herein as “banded features” of the input audio signal.
At 306, process 300 can provide the extracted features (e.g., the frequency-domain representation of the input audio signal or the modified frequency-domain representation of the input audio signal) to a trained machine learning model. The machine learning model may have been trained to generate a dereverberation mask that, when applied to the frequency-domain representation of the input audio signal, generates a frequency-domain representation of a dereverberated audio signal. In some implementations, the logarithm of the extracted features may be provided to the trained machine learning model.
The machine learning model may have any suitable architecture or topology. For example, in some implementations, the machine learning model may be or may include a deep neural network, a convolutional neural network (CNN), a long short-term memory (LSTM) network, a recurrent neural network (RNN), or the like. In some implementations, the machine learning model may combine two or more types of networks. For example, in some implementations, the machine learning model may combine a CNN with a recurrent element. Examples of recurrent elements that may be used include a GRU, an LSTM network, an Elman RNN, or the like. An example of a machine learning model architecture that combines a CNN with a GRU is shown in and described below in connection with
At 308, process 300 can obtain, from an output of the trained machine learning model, a predicted dereverberation mask that, when applied to the frequency-domain representation of the input audio signal, generates a frequency-domain representation of the dereverberated audio signal. In some implementations, process 300 can modify the predicted dereverberation mask by applying an inverse perceptually-based transformation, such as an inverse Gammatone filter, an inverse equivalent rectangular bandwidth filter, or the like.
At 310, process 300 can generate a frequency-domain representation of the dereverberated audio signal based on the predicted dereverberation mask generated by the trained machine learning model and the frequency-domain representation of the input audio signal. For example, in some implementations, process 300 can multiply the predicted dereverberation mask by the frequency-domain representation of the input audio signal. In instances in which the logarithm of the frequency-domain representation of the input audio signal was provided to the trained machine learning model, process 300 can generate the frequency-domain representation of the dereverberated audio signal by subtracting the logarithm of the predicted reverberation mask from the logarithm of the frequency-domain representation of the input audio signal. Continuing with this example, process 300 can then exponentiate the difference of the logarithm of the predicted reverberation mask and the logarithm of the frequency-domain representation of the input audio signal to obtain the frequency-domain representation of the dereverberated audio signal.
At 312, process 300 can generate a time-domain representation of the dereverberated audio signal. For example, in some implementations, process 300 can generate the time-domain representation of the dereverberated audio signal by applying an inverse transform (e.g., an inverse STFT, an inverse MDCT, or the like) to the frequency-domain representation of the dereverberated audio signal.
Process 300 can end at 314.
In some implementations, after generating the time-domain representation of the dereverberated audio signal, the dereverberated audio signal may be played or presented (e.g., by one or more speaker devices of a user device). In some implementations, the dereverberated audio signal may be stored, such as in local memory of the user device. In some implementations, the dereverberated audio signal may be transmitted, such as to another user device for presentation by the other user device, to a server for storage, or the like.
In some implementations, a machine learning model for dereverberating audio signals may be trained using a training set. The training set may include any suitable number of training samples (e.g., 100 training samples, 1000 training samples, 10,000 training samples, or the like), where each training sample includes a clean audio signal (e.g., with no reverberation), and a corresponding reverberated audio signal. As described above in connection with
Training a machine learning model that can robustly generate predicted dereverberation masks for different reverberated audio signals may depend on the quality of the training set. For example, for a machine learning model to be robust, the training set may need to capture reverberation from a vast number of different room types (e.g., rooms having different sizes, layouts, furniture, etc.), a vast number of different speakers, etc. Acquiring such a training set is difficult. For example, a training set may be generated by applying various AIRs that each characterize a room reverberation to a clean audio signal, thereby generating pairs of a clean audio signal and a corresponding reverberated audio signal generated by convolving an AIR with the clean audio signal. However, there may be a limited number of real AIRs available, and the real AIRs that are available may not fully characterize potential reverberation effects (e.g., by not adequately capturing rooms of different dimensions, layouts, etc.).
Disclosed herein are techniques for generating an augmented training set that may be used to train a robust machine learning model for dereverberating audio signals. In some implementations, real AIRs are used to generate a set of synthesized AIRs. The synthesized AIRs may be generated by altering and/or modifying various characteristics of early reflections and/or late reflections of a measured AIR, as shown in and described below in connection with
Process 500 can begin at 502 by obtaining an AIR. The AIR may be a real AIR. For example, the AIR may be measured using a set of microphones within a reverberant room environment. As another example, the AIR may be an AIR generated using a room acoustics model. The AIR may be obtained from any suitable source, such as a database that stores measured AIRs, or the like.
At 504, process 500 can identify a first portion of the AIR that corresponds to early reflections of a direct sound and a second portion of the AIR that corresponds to late reflections of the direct sound. In some implementations, process 500 can identify the first portion and the second portion by identifying a separation boundary between early reflections and late reflections in the AIR. The separation boundary may correspond to a time point in the AIR that divides the AIR into early reflections and late reflections. In some implementations, the separation boundary may be identified by selecting a random value from within a predetermined range. Examples of the predetermined range include 15 milliseconds-85 milliseconds, 20 milliseconds-80 milliseconds, 30 milliseconds-70 milliseconds, or the like. In some implementations, the separation boundary may be a random value selected from any suitable distribution corresponding to the predetermined range (e.g., a uniform distribution, a normal distribution, or the like).
At 506, process 500 can generate one or more synthesized AIRs by modifying portions of the early reflections and/or the late reflections of the AIR. In some implementations, the early reflections and the late reflections may be identified within the AIR based on the separation boundary identified at block 504. In some implementations, process 500 may generate a synthesized AIR by modifying portions of the early reflections of the AIR. For example, as shown in and described above in connection with
In some implementations, process 500 may generate a synthesized AIR by modifying portions of the late reflections of the AIR. For example, as shown in and described above in connection with
As another example, in some implementations, process 500 may generate a synthesized AIR by modifying a decay associated with the late reflections. As a more particular example, in some implementations, process 500 may generate a decay function (e.g., an exponential decay function, a linear decay, etc.). Continuing with this more particular example, process 500 may then modify amplitudes of components of the late reflections subject to the generated decay function. In some implementations, this may cause the synthesized AIR to have late reflection components that are attenuated relative to the corresponding late reflection components of the measured AIR. Conversely, in some implementations, this may cause the synthesized AIR to have late reflection components that are amplified or boosted relative to the corresponding late reflection components of the measured AIR. Modification of the decay associated with the late reflections may change a reverberation time (RT), such as the time for reverberation to decrease by 60 dB (e.g., the RT60).
It should be noted that, in some implementations, a synthesized AIR may include modifications to both the early reflection components and the late reflection components. Moreover, in some implementations, early reflection components and/or late reflection components may be modified in multiple ways in a synthesized AIR relative to the real AIR. For example, in some implementations, a synthesized AIR may include late reflections that have both been truncated and late reflection components that have been modified in amplitude based at least in part on a modified decay applied to the late reflections of the synthesized AIR.
Additionally, in some implementations, the synthesized AIR may be further modified, e.g., in post-processing. For example, in some implementations, a direct-to-reverberant ratio (DRR) associated with the synthesized AIR may be modified. As a more particular example, in some implementations, the DRR associated with the synthesized AIR may be modified by applying a gain to a portion (e.g., an early reflection portion of the synthesized AIR) to increase or decrease the DRR. In some implementations, multiple modified synthesized AIRs may be generated from a single synthesized AIR. For example, in some implementations, multiple modified synthesized AIRs may be generated by applying different gains, each corresponding to a different modified synthesized AIR, to the single synthesized AIR.
At 508, process 500 can determine whether additional synthesized AIRs are to be generated based on the AIR obtained at block 502. In some implementations, process 500 can determine whether additional synthesized AIRs are to be generated based on whether a target or threshold number of synthesized AIRs that are to be generated from the AIR have been generated. For example, in an instance in which N synthesized AIRs are to be generated from a particular AIR, process 500 can determined whether N synthesized AIRs have been generated from the AIR obtained at block 502. It should be noted that N may be any suitable value, such as 1, 5, 10, 20, 50, 100, 500, 1000, 2000, etc.
If, at 508, process 500 determines that additional synthesized AIRs are not to be generated (“no” at block 508), process 500 can end at 510. Conversely, if, at block 508, process 500 determines that additional synthesized AIRs are to be generated (“yes” at block 508), process 500 can loop back to block 504 and can identify a different first portion of the AIR and second portion of the AIR obtained at block 502. By looping through blocks 504-508, process 500 may generate multiple synthesized AIRs from a single measured AIR.
Process 550 can begin at 552 by obtaining a set of clean input audio signals (e.g., input audio signals without any reverberation and/or noise). The clean input audio signals in the set of clean input audio signals may have been recorded by any suitable number of devices (or microphones associated with any suitable number of devices). For example, in some implementations, two or more of the clean input audio signals may have been recorded by the same device. As another example, in some implementations, each of the clean input audio signals may have been recorded by a different device. In some implementations, two or more of the clean input audio signals may have been recorded in the same room environment. In some implementations, each of the clean input audio signals may have been recorded in a different room environment. In some implementations, a clean input audio signal in the set of clean input audio signals may include any combination of types of audible sounds, such as speech, music, sound effects, or the like. However, each clean input audio signal may be devoid of reverberation, echo, and/or noise.
At block 554, process 550 can obtain a set of AIRs that include real AIRs and/or synthesized AIRs. The set of AIRs may include any suitable number of AIRs (e.g., 100 AIRs, 200 AIRs, 500 AIRs, or the like). The set of AIRs may include any suitable ratio of real AIRs to synthesized AIRs, such as 90% synthesized AIRs and 10% real AIRs, 80% synthesized AIRs and 20% real AIRs, or the like. More detailed techniques for generating synthesized AIRs are shown in and described above in connection with
At block 556, process 550 can, for each pairwise combination of clean input audio signal in the set of clean input audio signals and AIR in the set of AIRs, generate a reverberated audio signal based on the clean input audio signal and the AIR. For example, in some implementations, process 550 can convolve the AIR with the clean input audio signal to generate the reverberated audio signal. In some implementations, given N clean input audio signals and M AIRs, process 550 can generate up to N×M reverberated audio signals.
In some implementations, at block 558, process 550 can, for one or more of the reverberated audio signals generated at block 556, add noise to generate a noisy reverberated audio signal. Examples of noise that may be added include white noise, pink noise, brown noise, multi-talker speech babble, or the like. Process 550 may add different types of noise to different reverberated audio signals. For example, in some implementations, process 550 may add white noise to a first reverberated audio signal to generate a first noisy reverberated audio signal. Continuing with this example, in some implementations, process 550 may add multi-talker speech babble type noise to the first reverberated audio signal to generate a second noisy reverberated audio signal. Continuing still further with this example, in some implementations, process 550 may add brown noise to a second reverberated audio signal to generate a third noisy reverberated audio signal. In other words, in some implementations, different versions of a noisy reverberated audio signal may generated by adding different types of noise to a reverberated audio signal. It should be noted that, in some implementations, block 558 may be omitted, and the training set may be generated without adding noise to any reverberated audio signals.
At the end of block 558, process 550 has generated a training set comprising multiple training samples. Each training sample may include a clean audio signal and a corresponding reverberated audio signal. The reverberated audio signal may or may not include added noise. It should be noted that, in some implementations, a single clean audio signal may be associated with multiple training samples. For example, a clean audio signal may be used to generate multiple reverberated audio signals by convolving the clean audio signal with multiple different AIRs. As another example, a single reverberated audio signal (e.g., generated by convolving a single clean audio signal with a single AIR) may be used to generated multiple noisy reverberated audio signals, each corresponding to a different type of noise added to the single reverberated audio signal. Accordingly, a single clean audio signal may be associated with 10, 20, 30, 100, or the like training samples, each comprising a different corresponding reverberated audio signal (or noisy reverberated audio signal).
In some implementations, an augmented training set may be generated for a particular type of audio content. For example, the particular type of audio content may correspond to a type of audio content for which dereverberation may be particularly difficult. By way of example, it may be difficult to perform dereverberation on audio signals that include far-field noise, such as the noise of a dog barking or a baby crying in the background of an audio signal that includes near-field speech (e.g., from a video conference, from an audio call, or the like). Difficulty in performing dereverberation on far-field noise may lead to poor noise management (e.g., denoising of the audio signal). Because dereverberation of far-field noise may be dependent on both room characteristics/acoustics and/or the particular noise, it may be difficult to train a model to perform dereverberation on such far-field noise. For example, a training dataset used to train such a model may not have enough training samples of the particular type of far-field noise present in an expansive set of room acoustics, thereby making the model trained with such a limited training set less robust. Accordingly, generating an augmented training set for a particular type of audio content may allow for a more robust model to be trained. In some implementations, the particular type of audio content may include particular types of sounds or events (e.g., a dog barking, a baby crying, an emergency siren passing by, or the like) and/or particular audio environments (e.g., an indoor environment, an outdoor environment, an indoor shared workspace, or the like). In some implementations, the augmented training set may be generated by first identifying a training set of audio signals that include the particular type of audio content. For example, a training set that includes dogs barking in the background of near-field speech may be obtained. As another example, a training set that includes a far-field siren passing by in the background of near-field speech may be obtained. In some implementations, because reverberation is generally present in indoor environments, a training set that includes audio content captured in indoor environments (and that does not include audio content generated in outdoor environments) may be obtained. Note that, in some implementations, the training set may be obtained by applying audio signals from a corpus of audio signals that classifies each audio signals as associated with the particular type of audio content. In some implementations, the augmented training set may be generated by applying synthesized AIRs and/or noise of a particular type (e.g., speech noise, indoor room noise, etc.) to the identified training set to generate the augmented training set.
It should be noted that, in some implementations, an augmented training set may be used for training speech enhancement models other than dereverberation models. For example, in some implementations, such an augmented training set may be used to train machine learning models for noise management (e.g., denoising), machine learning models that perform a combination of noise management and dereverberation, or the like.
A machine learning model for dereverberating audio signals may have various types of architectures. The machine learning model may take, as an input, a frequency-domain representation of a reverberated audio signal and produce, as an output, a predicted dereverberation mask that, when applied to the frequency-domain representation of the reverberated audio signal, generates a frequency-domain representation of a dereverberated (e.g., clean) audio signal. Example architecture types include a CNN, an LSTM, an RNN, a deep neural network, or the like. In some implementations, a machine learning model may combine two or more architecture types, such as a CNN and a recurrent element. In some such implementations, a CNN may be used to extract features of an input reverberated audio signal at different resolutions. In some implementations, a recurrent element may serve as a memory gate that controls an amount of previously provided input data that is used by the CNN. Use of a recurrent element in combination with a CNN may allow the machine learning model to produce smoother outputs. Additionally, use of a recurrent element in combination with a CNN may allow the machine learning model to achieve a higher accuracy and with a decreased training time. Accordingly, use of a recurrent element in combination with a CNN may improve computational efficiency by decreasing time and/or computational resources used to train a robust, accurate machine learning model for dereverberating audio signals. Examples of types of recurrent elements that may be used include a GRU, an LSTM network, an Elman RNN, and/or any other suitable type of recurrent element or architecture.
In some implementations, a recurrent element may be combined with a CNN such that the recurrent element and the CNN are in parallel. For example, outputs of the recurrent element may be provided to one or more layers of the CNN such that the CNN generates an output based on outputs of layers of the CNN and based on an output of the recurrent element.
In some implementations, a CNN utilized in a machine learning model may include multiple layers. Each layer may extract features of an input reverberated audio signal spectrum (e.g., a frequency-domain representation of the reverberated audio signal) at different resolutions. In some implementations, layers of the CNN may have different dilation factors. Use of dilation factors greater than 1 may effectively increase the receptive field of a convolution filter used for a particular layer having a dilation factor greater than 1 but without increasing the number of parameters. Use of dilation factors greater than 1 may therefore allow a machine learning model to be more robustly trained (by increasing receptive field size) while not increasing in complexity (e.g., by maintaining a number of parameters to be learned or optimized). In one example, a CNN may have a first group of layers, each having an increasing dilation rate, and a second group of layers, each having a decreasing dilation rate. In one specific example, the first group of layers may include 6 layers, having dilation factors of 1, 2, 4, 8, 12, and 20, respectively. Continuing with this example, the second group of layers may include 5 layers decreasing in dilation factor (e.g., 5 layers having dilation factors of 12, 8, 4, 2, and 1, respectively). The size of the receptive field considered by the CNN is related to the dilation factors, the convolution filter size, a stride size, and/or a pad size (e.g., whether or not the model is causal). By way of example, given 6 CNN layers with increasing dilation factors of 1, 2, 4, 8, 12, and 20, a convolution filter size of 3×3, a stride of 0, and a causal model, the CNN may have a total receptive field of (2×(1+2+4+8+12+20))+1 frames, or 95 frames. As another example, the same network with 0 dilation would have a receptive field size of (2*(1+1+1+1+1+1))+1=13. In some implementations, the total receptive field may correspond to a delay line duration that indicates a duration of the spectrum that is considered by the machine learning model. It should be noted that the dilation factors described above are merely exemplary. In some implementations, smaller dilation factors may be used to, for example, decrease a delay duration for real-time audio signal duration.
In some implementations, the machine learning model may be zero latency. In other words, the machine learning model may not use look ahead, or future data points. This is sometimes referred to as the machine learning model being causal. Conversely, in some implementations, the machine learning model may implement layers that utilize look ahead blocks.
As illustrated, CNN 606 includes a first set of layers 610 that have increasing dilation factors. In particular, first set of layer 610 includes 6 layers with dilation factors of 1, 2, 4, 8, 12, and 20, respectively. First set of layers 610 is followed by a second set of layers 612 that have decreasing dilation factors. In particular, second set of layers 612 includes 5 layers with dilation factors of 12, 8, 4, 2, and 1. Second set of layers 612 is followed by a third set of layers 614, which each have a dilation factor of 1. In some implementations, first set of layers 610, second set of layers 612, and third set of layers 614 may each include convolutional blocks. Each convolutional block may utilize a convolutional filter. Although CNN 606 utilizes convolutional filters of a 3×3 size, this is merely exemplary, and, in some implementations, other filter sizes (e.g., 4×4, 5×5, or the like) may be used. As illustrated in
As illustrated in
In some implementations, a machine learning model may be trained using a loss function that indicates a degree of reverberation associated with a predicted dereverberated audio signal generated using a predicted dereverberation mask generated by the machine learning model. By training the machine learning model to minimize a loss function that includes an indication of a degree of reverberation, the machine learning model may not only generate dereverberated audio signals similar in content to the corresponding reverberated audio signals (e.g., including similar direct sound content as in the reverberated audio signal), but additionally, generate dereverberated audio signals with less reverberation. In some implementations, a loss term, for a particular training sample, may be a combination of a difference between a predicted dereverberated audio signal and a ground-truth clean audio signal and a degree of reverberation associated with the predicted dereverberated audio signal.
In some implementations, a degree of reverberation included in a loss function may be a speech-to-reverberation modulation energy. In some implementations, the speech-to-reverberation modulation energy may be a ratio of modulation energy at relatively high modulation frequencies relative to modulation energy over all modulation frequencies. In some implementations, the speech-to-reverberation modulation energy may be a ratio of modulation energy at relatively high modulation frequencies relative to modulation energy over relatively low modulation frequencies. In some implementations, relatively high modulation frequencies and relatively low modulation frequencies may be identified based on modulation filters. For example, in an instance in which modulation energy is determined at M modulation frequency bands, the highest N of the M (e.g., 3, 4, 5, etc.) modulation frequency bands may be considered as corresponding to “high modulation frequencies,” and the remaining bands (e.g., M−N) may be considered as corresponding to “low modulation frequencies.”
Process 700 can begin at 702 by obtaining a training set that includes training samples that comprise pairs of reverberated audio signals and clean audio signals. In some implementations, the clean audio signals may be considered “ground-truth” signals that the machine learning model is to be trained to predict, or generate. In some implementations, the training set may be an augmented training set that has been constructed using synthesized AIRs, as described above in connection with
At 704, for a given training sample (e.g., for a given pair of a reverberated audio signal and a clean audio signal), process 700 can provide the reverberated audio signal to a machine learning model to obtain a predicted dereverberation mask. In some implementations, process 700 may provide the reverberated audio signal by determining a frequency-domain representation of the reverberated audio signal and providing the frequency-domain representation of the reverberated audio signal. In some implementations, the frequency-domain representation of the reverberated audio signal may have been filtered or otherwise transformed using a filter that approximates filtering of the human cochlea, as shown in and described above in connection with block 304 of
It should be noted that the machine learning model may have any suitable architecture. For example, the machine learning model may include a deep neural network, a CNN, an LSTM, an RNN, or the like. In some implementations, the machine learning model may combine two or more architectures, such as a CNN and a recurrent element. In some implementations, a CNN may use dilation factors at different layers. A specific example of a machine learning model that may be used is shown in and described above in connection with
At 706, process 700 can obtain a predicted dereverberated audio signal using the predicted dereverberation mask. For example, in some implementations, process 700 can apply the predicted dereverberation mask to the frequency-domain representation of the reverberated audio signal to obtain a frequency-domain representation of the dereverberated audio signal, as shown in and described above in connection with block 310 of
At 708, process 700 can determine a value of a reverberation metric associated with the predicted dereverberated audio signal. The reverberation metric may be a speech-to-reverberation modulation energy (generally denoted herein as fsrmr(z), where z is the predicted dereverberated audio signal) of one or more frames of the predicted dereverberated audio signal. An example equation to determine the speech-to-reverberation modulation energy that considers a ratio of energy in relatively high modulation frequencies to energy in relatively low modulation frequencies is given by:
In the equation given above, zj,k represents the average modulation energy over frames of the jth critical band grouped by the kth modulation filter, where there are 23 critical bands and 8 modulation bands. Higher values of fsrmr(z) are indicative of a higher degree of reverberation. It should be noted that other numbers of critical bands and/or modulation bands may be used to determine the speech-to-reverberation modulation energy.
At 710, process 700 can determine a loss term based on the clean audio signal, the predicted dereverberated audio signal, and the value of the reverberation metric. In some implementations, the loss term may be a combination of a difference between the clean audio signal and the predicted dereverberated audio signal and the value of the reverberation metric. In some implementations, the combination may be a weighted sum, where the value of the reverberation metric is weighted by an importance of minimizing reverberation in outputs produced using the machine learning model. An example equation of the loss term for a particular predicted dereverberated audio signal (denoted herein as ypre) and a particular clean audio signal (denoted herein as yref) is given by:
As illustrated in the above equation, the loss term may be increased in instances in which there is a relatively high degree of reverberation in the predicted clean audio signal and/or in which the predicted dereverberated audio signal differs substantially from the ground-truth clean audio signal.
At 712, process 700 can update weights of the machine learning model based at least in part on the loss term. For example, in some implementations, process 700 may use gradient descent and/or any other suitable technique to calculate updated weight values associated with the machine learning model. The weights may be updated based on other factors, such as a learning rate, a dropout rate, etc. The weights may be associated with various nodes, layers, etc., of the machine learning model.
At block 714, process 700 can determine whether the machine learning model is to continue being trained. Process 700 can determine whether the machine learning model is to continue being trained based on a determination of whether a stopping criteria has been reached. The stopping criteria may include a determination that an error associated with the machine learning model has decreased below a predetermined error threshold, that weights associated with the machine learning model are being changed from one iteration to a next by less than a predetermined change threshold, and/or the like.
If, at block 714, process 700 determines that the machine learning model is not to continue being trained (“no” at block 714), process 700 can end at 716. Conversely, if, at block 714, process 700 determines that the machine learning model is to continue being trained (“yes” at block 714), process 700 can loop back to 704 and can loop through blocks 704-714 with a different training sample.
In some implementations, an augmented training set (e.g., as described above in connection with
As illustrated, system 800 includes a training set creation component 802. Training set creation component 802 may generate an augmented training set that may be used by a machine learning model for dereverberating audio signals. In some implementations, training set component 802 may be implemented, for example, on a device that generates and/or stores an augmented training set. Training set creation component 802 may retrieve measured AIRs from an AIR database 806. Training set creation component 802 may then generate synthesized AIRs based on the measured AIRs retrieved from AIR database 806. More detailed techniques for generating synthesized AIRs are shown in and described above in connection with
Augmented training set 808 may then be used to train a machine learning model 810a. In some implementations, machine learning model 810a may have an architecture that includes a CNN and a recurrent element (e.g., a GRU, an LSTM network, an Elman RNN, or the like) in parallel. In particular, the CNN may generate an output based on outputs of layers of the CNN as well as an output of the recurrent element. An example of such an architecture is shown in and described above in connection with
After training, trained machine learning model 810b may utilize trained prediction component 812b (e.g., corresponding to finalized weights) to generate dereverberated audio signals. For example, trained machine learning model 810b may take, as an input, a reverberated audio signal 814, and may generate, as an output, a dereverberated audio signal 816. It should be noted that trained machine learning model 810b may have the same architecture as machine learning model 810a, but may not determine a degree of reverberation at inference time.
According to some alternative implementations the apparatus 900 may be, or may include, a server. In some such examples, the apparatus 900 may be, or may include, an encoder. Accordingly, in some instances the apparatus 900 may be a device that is configured for use within an audio environment, such as a home audio environment, whereas in other instances the apparatus 900 may be a device that is configured for use in “the cloud,” e.g., a server.
In this example, the apparatus 900 includes an interface system 905 and a control system 910. The interface system 905 may, in some implementations, be configured for communication with one or more other devices of an audio environment. The audio environment may, in some examples, be a home audio environment. In other examples, the audio environment may be another type of environment, such as an office environment, an automobile environment, a train environment, a street or sidewalk environment, a park environment, etc. The interface system 905 may, in some implementations, be configured for exchanging control information and associated data with audio devices of the audio environment. The control information and associated data may, in some examples, pertain to one or more software applications that the apparatus 900 is executing.
The interface system 905 may, in some implementations, be configured for receiving, or for providing, a content stream. The content stream may include audio data. The audio data may include, but may not be limited to, audio signals. In some instances, the audio data may include spatial data, such as channel data and/or spatial metadata. In some examples, the content stream may include video data and audio data corresponding to the video data.
The interface system 905 may include one or more network interfaces and/or one or more external device interfaces (such as one or more universal serial bus (USB) interfaces). According to some implementations, the interface system 905 may include one or more wireless interfaces. The interface system 905 may include one or more devices for implementing a user interface, such as one or more microphones, one or more speakers, a display system, a touch sensor system and/or a gesture sensor system. In some examples, the interface system 905 may include one or more interfaces between the control system 910 and a memory system, such as the optional memory system 915 shown in
The control system 910 may, for example, include a general purpose single- or multi-chip processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, and/or discrete hardware components.
In some implementations, the control system 910 may reside in more than one device. For example, in some implementations a portion of the control system 910 may reside in a device within one of the environments depicted herein and another portion of the control system 910 may reside in a device that is outside the environment, such as a server, a mobile device (e.g., a smartphone or a tablet computer), etc. In other examples, a portion of the control system 910 may reside in a device within one environment and another portion of the control system 910 may reside in one or more other devices of the environment. For example, a portion of the control system 910 may reside in a device that is implementing a cloud-based service, such as a server, and another portion of the control system 910 may reside in another device that is implementing the cloud-based service, such as another server, a memory device, etc. The interface system 905 also may, in some examples, reside in more than one device.
In some implementations, the control system 910 may be configured for performing, at least in part, the methods disclosed herein. According to some examples, the control system 910 may be configured for implementing methods of dereverberating audio signals, training a machine learning model that performs dereverberation of audio signals, generating a training set for a machine learning model that performs dereverberation of audio signals, generating synthesized AIRs for inclusion in a training set, or the like.
Some or all of the methods described herein may be performed by one or more devices according to instructions (e.g., software) stored on one or more non-transitory media. Such non-transitory media may include memory devices such as those described herein, including but not limited to random access memory (RAM) devices, read-only memory (ROM) devices, etc. The one or more non-transitory media may, for example, reside in the optional memory system 915 shown in
In some examples, the apparatus 900 may include the optional microphone system 920 shown in
According to some implementations, the apparatus 900 may include the optional loudspeaker system 925 shown in
Some aspects of present disclosure include a system or device configured (e.g., programmed) to perform one or more examples of the disclosed methods, and a tangible computer readable medium (e.g., a disc) which stores code for implementing one or more examples of the disclosed methods or steps thereof. For example, some disclosed systems can be or include a programmable general purpose processor, digital signal processor, or microprocessor, programmed with software or firmware and/or otherwise configured to perform any of a variety of operations on data, including an embodiment of disclosed methods or steps thereof. Such a general purpose processor may be or include a computer system including an input device, a memory, and a processing subsystem that is programmed (and/or otherwise configured) to perform one or more examples of the disclosed methods (or steps thereof) in response to data asserted thereto.
Some embodiments may be implemented as a configurable (e.g., programmable) digital signal processor (DSP) that is configured (e.g., programmed and otherwise configured) to perform required processing on audio signal(s), including performance of one or more examples of the disclosed methods. Alternatively, embodiments of the disclosed systems (or elements thereof) may be implemented as a general purpose processor (e.g., a personal computer (PC) or other computer system or microprocessor, which may include an input device and a memory) which is programmed with software or firmware and/or otherwise configured to perform any of a variety of operations including one or more examples of the disclosed methods. Alternatively, elements of some embodiments of the inventive system are implemented as a general purpose processor or DSP configured (e.g., programmed) to perform one or more examples of the disclosed methods, and the system also includes other elements (e.g., one or more loudspeakers and/or one or more microphones). A general purpose processor configured to perform one or more examples of the disclosed methods may be coupled to an input device (e.g., a mouse and/or a keyboard), a memory, and a display device.
Another aspect of present disclosure is a computer readable medium (for example, a disc or other tangible storage medium) which stores code for performing (e.g., coder executable to perform) one or more examples of the disclosed methods or steps thereof.
While specific embodiments of the present disclosure and applications of the disclosure have been described herein, it will be apparent to those of ordinary skill in the art that many variations on the embodiments and applications described herein are possible without departing from the scope of the disclosure described and claimed herein. It should be understood that while certain forms of the disclosure have been shown and described, the disclosure is not to be limited to the specific embodiments described and shown or the specific methods described.
Number | Date | Country | Kind |
---|---|---|---|
PCT/CN2021/106536 | Jul 2021 | WO | international |
This application claims the priority benefit of U.S. Provisional Application No. 63/260,203, filed on Aug. 12, 2021 and International Application No. PCT/CN2021/106536, filed on Jul. 15, 2021, the contents of which are hereby incorporated in its entirety.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2022/036802 | 7/12/2022 | WO |
Number | Date | Country | |
---|---|---|---|
63260203 | Aug 2021 | US |