Automatic speech recognition (ASR) is being widely deployed in many real-world scenarios via smartphones and voice assistant devices. A major challenge in ASR is dealing with far-field scenarios, where the speech source is at a significant distance from the microphone. As the demand for ASR continues to increase, research and development continue to advance ASR technologies not only to meet the growing demand for ASR but also to advance and enhance ASR systems used in various environments.
The following presents a simplified summary of one or more aspects of the present disclosure, to provide a basic understanding of such aspects. This summary is not an extensive overview of all contemplated features of the disclosure and is intended neither to identify key or critical elements of all aspects of the disclosure nor to delineate the scope of any or all aspects of the disclosure. Its sole purpose is to present some concepts of one or more aspects of the disclosure in a simplified form as a prelude to the more detailed description that is presented later.
In some aspects of the present disclosure, methods, systems, and apparatus for scene-aware far-field automatic speech recognition are disclosed. These methods, systems, and apparatus may include steps or components for receiving multiple speech samples at a target scene; generating multiple labeled vectors corresponding to the multiple speech samples; generating multiple intermediate samples based on the multiple labeled vectors for normalizing noise in the multiple labeled vectors; determining multiple pair-wise distances between each of the multiple intermediate samples and each of multiple vectors of a set of acoustic impulse responses (AIRs); selecting a subset of the set of AIRs based on the multiple pair-wise distances; and training a deep learning model based on the subset of the full set of AIRs.
In other aspects of the present disclosure, methods, systems, and apparatus providing for scene-aware far-field automatic speech recognition are disclosed. These methods, systems, and apparatus implementing the method may include receiving a speech sample; generating a labeled vector corresponding to the speech sample; generating one or more intermediate samples based on the labeled vector for normalizing noise in the labeled vector; determining one or more pair-wise distances between each of the one or more intermediate samples and each of multiple vectors of a full set of acoustic impulse responses (AIRs); training a deep learning model with a dataset of a set of AIRs based on the one or more pair-wise distances; and performing speech recognition of the speech sample based on the determined deep learning model.
In further aspects of the present disclosure, methods, systems, and apparatus providing for scene-aware far-field automatic speech recognition are disclosed. These systems and apparatus implementing the method may include a memory and a processor coupled to the memory. The processor is configured, in coordination with the memory, to receive a speech sample; generate a labeled vector corresponding to the speech sample; generate one or more intermediate samples based on the labeled vector for normalizing noise in the labeled vector; determine one or more pair-wise distances between each of the one or more intermediate samples and each of multiple vectors of a set of acoustic impulse responses (AIRs); determine a deep learning model trained with a dataset of a set of AIRs based on the one or more pair-wise distances; and perform speech recognition of the speech sample based on the determined deep learning model.
These and other aspects of the disclosure will become more fully understood upon a review of the drawings and the detailed description, which follows. Other aspects, features, and embodiments of the present disclosure will become apparent to those skilled in the art, upon reviewing the following description of specific, example embodiments of the present disclosure in conjunction with the accompanying figures. While features of the present disclosure may be discussed relative to certain embodiments and figures below, all embodiments of the present disclosure can include one or more of the advantageous features discussed herein. In other words, while one or more embodiments may be discussed as having certain advantageous features, one or more of such features may also be used in accordance with the various embodiments of the disclosure discussed herein. Similarly, while example embodiments may be discussed below as devices, systems, or methods embodiments it should be understood that such example embodiments can be implemented in various devices, systems, and methods.
The detailed description set forth below in connection with the appended drawings is intended as a description of various configurations and is not intended to represent the only configurations in which the subject matter described herein may be practiced. The detailed description includes specific details to provide a thorough understanding of various embodiments of the present disclosure. However, it will be apparent to those skilled in the art that the various features, concepts and embodiments described herein may be implemented and practiced without these specific details. In some instances, well-known structures and components are shown in block diagram form to avoid obscuring such concepts.
Far-field scenarios (e.g., inside rooms, buildings, or other open spaces where audible noise may originate from locations comparatively distant from a speaker—such as farther from a speaker than the listener) can present unique challenges when providing for automated speech recognition. In far-field scenarios, speech audio signals undergo behaviors corresponding to reflections off surfaces or diffraction around obstacles in the environment. These effects distort the speech signal in different ways and thereby require much more training data for a speech recognition algorithm to achieve high accuracy and generalization. However, capturing and transcribing far-field speech data with sufficient acoustic variability is a challenging task due to time and cost constraints. In practice, it may be the case that oftentimes there exists less transcribed far-field training data (e.g., a transcription associated with corresponding audio) available to a developer than data for clean speech (e.g., speech in which background noise and distortions (like reverberation and muffling) is minimal, de minimis, or substantially reduced in comparison to the speech).
In some examples, instead of recording multiple speech utterances in the same room in order to develop training data, one can choose to record the acoustic impulse responses (AIRs) and the non-speech background noise in that room. These recordings can then be reused offline to artificially reverberate clean speech as if it had been recorded in that same environment and generate noisy speech data that can be used as training data for a speech recognition algorithm. When the AIRs and noise are properly recorded, using them to create training data for ASR can produce results as good as using real recorded speech for training.
In recent years, there has been an increasing number of recorded impulse response datasets, thereby improving the performance of far-field ASR. However, each dataset may only involve a limited number of different acoustic environments (e.g., hundreds of AIRs recorded in very few rooms). In addition, different recording devices, recording techniques, and software post-processing have been used across datasets, which can reduce the consistency of recording among these datasets. When creating a training set with a combination of all the AIRs collected from different datasets, domain mismatch due to reverberation/noise level and frequency distortion can occur. In general, training ASR systems using various data is considered beneficial to make the trained model generalize to various test conditions. However, the inclusion of mismatched data for one domain might not improve the performance on that specific domain. Further, training solely with mismatched data can significantly degrade the model performance.
In many scenarios, voice assistant devices (e.g., Apple HomePod®, Amazon Echo®, Google Home™) operate in a given or fixed indoor or room environment whose acoustic characteristics do not change frequently. For this reason, knowing more acoustic characteristics about the target scenario can help select the appropriate training data, as the devices will be used in the same environment. However, in some cases, the detailed information may not be given in advance about the acoustic environment in which the ASR systems will be deployed. Even with many popular far-field ASR benchmarks, the metadata (e.g., room types and dimensions, mic locations, etc.) is missing or not consistently labeled, making it difficult to extract useful acoustic characteristics of the scene.
In some aspects of the present disclosure, systems and methods include generating scene-aware training data that has similar acoustic characteristics as the target scene (i.e., scene-aware) without any a priori knowledge of the ground truth scene characteristics or any metadata of the scene. In further aspects of the disclosure, systems and methods include training one or more deep-learning models based on the scene-aware training data, and/or performing far-field automatic speech recognition using the one or more deep-learning models. In some examples, a deep learning-based estimator may be used to non-intrusively compute the sub-band reverberation time of an environment from its speech samples. Thus, a learning-based method can be used to blindly estimate the scene acoustic features in terms of sub-band reverberation time from unlabeled recorded signals.
In further examples, certain acoustic characteristics of a scene can be modeled using its reverberation time and can be represented using a multivariate Gaussian distribution. The multivariate Gaussian distribution can be fit to the predicted feature distribution and can be used to draw a desired number of AIR samples that have similar reverberation characteristics. This distribution can be used to select acoustic impulse responses from a large real-world dataset for augmenting speech data. The samples can be used to generate a training set which is used to train an ASR model in the same environment. The speech recognition system trained on example scene-aware data consistently outperforms the system trained using many more random acoustic impulse responses. The benefit of the disclosed systems and methods are shown by extensively comparing it to alternative data strategies on two public far-field ASR benchmarks: the REVERB challenge and the AMI corpus. The disclosed systems and methods can utilize a subset that is only 5% of the available real-world AIRs for training while consistently outperforming results obtained by using the full set of AIRs. The disclosed systems and methods also outperform uniformly selected subsets of the same size by up to 2.64% word error rate.
In some examples, some scenarios can be considered where clean speech is artificially reverberated using AIRs and recorded noise to create far-field training data. In some instances, real-world AIRs or simulated ones can be used. And, the noise recordings may include non-speech recordings of actual noise at the target location (e.g., machine noise such as HVAC, ambient traffic noise, pet noise, or the like), or may include pre-recorded noises from other locations/environments that are determined to be related to actual noise at the target location. When a set of AIRs and noise recordings is available, the set of AIRs and noise recordings can be mixed using the formulation expressed by:
x
r
[t]=x[t]*h[t]+d[t], Equation 1
where * denotes a linear convolution, x[t] represents the clean speech signal, h[t] is the AIR corresponding to the speech source, d[t] represents ambient noise, and xr[t] represents the augmented far-field speech that is used for training. Obtaining a large number of x[t] and d[t] can be relatively easy. However, capturing h[t] from the real world is laborious work. Over the past decades, multiple research groups have collected more usable AIR datasets, as listed in Table 1.
Although the full set of AIRs may have more diverse reverberation characteristics than any individual dataset, they might not match the characteristics of any target scene. For example, if the ASR device is to be deployed in a very reverberant room, there may be many samples in the full AIR dataset that are not quite reverberant that will not help with ASR applications in these scenarios, and it may be better to train a model with matched data. Thus, the full available set of AIRs for training may not be optimal. Therefore, instead of training a huge model with all the AIR augmented data and hoping it can generalize well to all scenes, a few unlabeled speech samples can be analyzed from the target scene, and then a matched set of AIRs can be selected to augment data for this scene. Alternatively, many ASR models that specialize in different scenarios can be pre-trained, and the scene analysis can be directly used to select a model already trained on matched data to load.
The general outline of this process 100 is summarized in
Many acoustic features can be used to describe the acoustic properties of an environment. The AIR fully represents the sound propagation behavior in an environment and is of crucial interest during scene analysis. Some standard acoustic metrics including reverberation time (T60), direct-to-reverberant ratio (DRR), early decay time (EDT), clarity (C80), definition (D50), etc., are useful scalar descriptors that can be calculated from an AIR. In an example scene-aware framework, some of these metrics can be blindly estimated from raw speech signals because the exact AIR may not be available corresponding to the test conditions. One example acoustic metric is T60. T60 can be defined as the time it takes for the initial impulse energy to decay by 60 dB, either for full-band or sub-band. However, it should be appreciated that any other acoustic metrics can be used in the example scene-aware framework. In the example scene-aware framework, sub-band T60 can be predicted to capture the frequency dependency of real-world AIRs.
A deep-neural network (DNN) can be used in the example scene-aware framework. In one example, a DNN can include six 2D convolutional layers followed by a fully connected layer. In some examples, the DNN can receive a 4-second speech spectrogram as input, and output multiple sub-band metrics (e.g., 7 sub-band T60s) centered at various frequencies (e.g., 125, 250, 500, 1000, 2000, 4000, and 8000Hz). In some examples, the training of this estimator is based on synthetic AIRs generated and tested on unseen real-world AIRs from a real-world AIR dataset (e.g., the MIT IR Survey which has diverse T60s). The synthetic AIRs can be used because the synthetic AIRs are enough to get T60 estimates with trackable error. Thus, the real-world AIRs can be reserved for testing and a 0.23s mean test error is obtained despite the synthetic vs. real-world domain mismatch. In other examples, real-world AIRs could be used for training the analysis network instead of synthetic AIRs.
In some examples, publicly available AIRs can be collected, as listed in Table 1. Because they come in different audio formats, all AIRs can be converted to single-channel forms (e.g., at a 16 kHz sample rate). In the example experiments, 3316 real-world AIRs are retained in total. This serves as a full set of AIRs for experiments. Then, the sub-band T60s (e.g., a vector of length 7) of each AIR can be evaluated. In some scenarios, the T60 vectors of the full AIR set can be denoted as {si}i-1 . . . K, Si ∈ 7, where K=3316. With the sub-band T60 labels, a subset of them can be selected to match with any desired distribution.
For example, N unlabeled noisy speech samples exist from the target scene and M (M<K) real-world AIRs are selected to match this scene (M can be independent of N). The T60 estimator described above can be used to create T60 labels for N unlabeled noisy speech samples, which yields N sub-band T60 vectors {ti}i-1 . . . N, ti ∈ 7. While multiple measurements in the same room have some noise and the T60 may not be identical, N sub-band T60 vectors resemble a Gaussian distribution. Therefore, the column-wise mean vector μ and the covariance matrix Σ of the N×7 T60 prediction matrix can be calculated by stacking {ti} vertically. The column-wise mean vector μ and the covariance matrix Σ of the N×7 T60 prediction matrix can be used to draw M intermediate samples {{circumflex over (t)}i}i-1 . . . M following the multivariate Gaussian distribution (μ, Σ). Finally, the pair-wise distance (e.g., pair-wise Euclidean distance) between all M intermediate samples {{circumflex over (t)}i} and all the T60 vectors of the full AIR set {si}, forming a distance matrix D ∈ M×K, where Di,j=dist ({circumflex over (t)}i, si). In some examples, the scene matching problem can be:
argmink
where {sk
Various types of DNN algorithms can be utilized in accordance with the concepts described herein: both for creating a T60 estimator and for use as an ASR that utilizes augmented training data to learn to recognize speech.
In some instances, an ASR may be based on a deep-learning model. A deep-learning model can be effectively trained with the selected M samples for the target scene or scenes having similar acoustic characteristics to the target scene. In some embodiments, a deep-learning model can include an input layer, one or more hidden layers (or nodes), and an output layer. In some instances, the input layer may include as many nodes as inputs provided to the deep-learning model. The number (and the type) of inputs provided to the deep-learning model may vary (e.g., based on the types of scenes, the types of acoustic characteristics, the number of scenes to be recognized, the number of channels of audio recording, the frequency spectrum of a recording, sub-bands of the recording, etc.).
The input layer connects to one or more hidden layers. The number of hidden layers varies and may depend on the particular task for the deep-learning model. Additionally, each hidden layer may have a different number of nodes and may be connected to the next layer differently. For example, each node of the input layer may be connected to each node of the first hidden layer. The connection between each node of the input layer and each node of the first hidden layer may be assigned a weight parameter. Additionally, each node of the deep-learning model may also be assigned a bias value. In some configurations, each node of the first hidden layer may not be connected to each node of the second hidden layer. That is, there may be some nodes of the first hidden layer that are not connected to all of the nodes of the second hidden layer. The connections between the nodes of the first hidden layers and the second hidden layers are each assigned different weight parameters. Each node of the hidden layer is generally associated with an activation function. The activation function defines how the hidden layer is to process the input received from the input layer or from a previous input or hidden layer.
Each hidden layer may perform a different function. For example, some hidden layers can be convolutional hidden layers which can, in some instances, reduce the dimensionality of the inputs. Other hidden layers can perform statistical functions such as max pooling, which may reduce a group of inputs to the maximum value; an averaging layer; batch normalization; and other such functions. In some of the hidden layers each node is connected to each node of the next hidden layer, which may be referred to then as dense layers. Some deep-learning models including more than, for example, three hidden layers may be considered deep neural networks.
The last hidden layer in some embodiments of an ASR deep-learning model can be connected to the output layer. Similar to the input layer, the output layer typically has the same number of nodes as the possible outputs. In an example in which the deep-learning model, the output layer may output recognized speech.
Different types of training processes can be used for DNNs. The training processes may include, for example, gradient descent, Newton's method, conjugate gradient, quasi-Newton, Levenberg-Marquardt, among others.
The deep-learning model can be constructed or otherwise trained associated with a target scene based on selected M samples using one or more different learning techniques, such as supervised learning, unsupervised learning, reinforcement learning, ensemble learning, active learning, transfer learning, or other suitable learning techniques for neural networks. As an example, supervised learning involves presenting a computer system with example inputs and their actual outputs (e.g., categorizations). In these instances, the deep-learning model is configured to learn a general rule or model that maps the inputs to the outputs based on the provided example input-output pairs.
Different types of deep-learning models can have different network architectures (e.g., number of layers, type of layers, ordering of layers, connections between layers, hyperparameters for layers). In some configurations, deep-learning models can be structured as a single-layer perceptron network, in which a single layer of output nodes is used and inputs are fed directly to the outputs by a series of weights. In other configurations, deep-learning models can be structured as multilayer perceptron networks, in which the inputs are fed to one or more hidden layers before connecting to the output layer.
As one example, a deep-learning model can be configured as a feedforward network, in which the connections between nodes do not form any loops in the network. As another example, an deep-learning model can be configured as a recurrent neural network (“RNN”), in which connections between nodes are configured to allow for previous outputs to be used as inputs while having one or more hidden states, which in some instances may be referred to as a memory of the RNN. RNNs are advantageous for processing time-series or sequential data. Examples of RNNs include long-short term memory (“LSTM”) networks, networks based on or using gated recurrent units (“GRUs”), or the like.
Deep-learning models can be structured with different connections between layers. In some instances, the layers are fully connected, in which each all of the inputs in one layer are connected to each of the outputs of the previous layer. Additionally or alternatively, deep-learning models can be structured with trimmed connectivity between some or all layers, such as by using skip connections, dropouts, or the like. In skip connections, the output from one layer jumps forward two or more layers in addition to, or in lieu of, being input to the next layer in the network. An example class of neural networks that implement skip connections are residual neural networks, such as ResNet. In a dropout layer, nodes are randomly dropped out (e.g., by not passing their output on to the next layer) according to a predetermined dropout rate. In some embodiments, a deep-learning model can be configured as a convolutional neural network (“CNN”), in which the network architecture includes one or more convolutional layers. Thus, based on the trained deep-learning model based on the selected AIR samples of the full set of AIRs associated with a target scene, effective far-field speech recognition can be performed.
An example of the matching outcome is depicted in
To test the effectiveness of the example scene-aware ASR approach, experiments with two far-field ASR benchmarks were conducted, and the word error rate (WER) resulting from different AIR selection strategies is reported. In the experiments, the example scene-aware AIRs are compared with two alternative strategies for speech augmentation purposes: 1. use all available AIRs (i.e., the full set); 2. select a subset of AIRs with a uniform T60 distribution. The same set of additive noise recorded in the BUT Reverb Database was used during augmentation for all experiments. Both experiments were performed with the Kaldi ASR toolbox 1 using a time delay neural network (TDNN). Other examples of neural networks may also be utilized, such as other convolutional neural networks or other ASR algorithms that require labeled/transcribed training data. In this example, each TDNN was trained on a workstation with two GeForce® RTX 2080 Ti graphic cards.
The AMI corpus includes 100 hours of meeting recordings. The recordings include close-talking speech recorded by individual headset microphones (IHM) and far-field speech recorded by single distant microphones (SDM). While the IHM data is not of anechoic quality, it still has a very high signal-to-noise ratio compared with the SDM data, so it can be considered clean speech. The original Kaldi recipe treats IHM and SDM partitions as separate tasks (i.e., both training and test sets are from the same partition), so a modified pipeline is used, the IHM data using Equation (1) is reverberated as the training set, and the trained model was tested on SDM data.
In the experiments, scene matching was performed using the example method described above targeting the SDM data. In total, 166 real-world AIRs (5% of the full set) have been selected for augmentation of the IHM data. Further, because many IHM recordings have very long durations with pauses between utterances, per-segment level speech reverberation was also performed. To do so, each recording is scanned and, whenever a continuous 3 seconds of a non-speech segment was detected, the recording was split at the beginning of the silent frames to prevent inter-segment speech from overlapping after adding reverberation. Each segment was randomly assigned with an AIR from the AIR pool (either full set, uniform set, or scene-aware set) for convolution. The original 687 IHM recordings were split into 17749 segments, which enabled better utilization of the AIRs for augmentation.
Test results are shown in Table 2. The results are also provided for the clean IHM training set not using any AIR, whose WER is the worst, indicating it is difficult to expect good results from severely mismatched scenes (i.e., IHM vs SDM). Overall, the results show that using the full set of AIRs is not optimal. On the SDM test data, using a uniform subset of AIRs achieves 1.2% (absolute) lower WER than the full set. In addition, the example scene-aware subset achieves 1.9% (absolute) lower WER than the full set, making the example scene-aware subset the best augmentation set.
The REVERB challenge is based on the WSJCA0corpus, which contains 140 speakers each speaking approximately 110 utterances. In this challenge, 3 rooms of different sizes are used to create artificial reverberation data (simulated rooms) by convolving their AIRs with the clean WSJCAM0 speech. Another large room is used to record re-transmitted WSJCAM0 speech (real room). Microphones are placed at two distances (near and far) from the speaker in both simulated and real rooms. Note that the full set contains the original AIRs from the REVERB challenge as they are sourced in Table 1, and to avoid leaking any “ground truth” AIRs into training, these AIRs were excluded from the full set during this experiment. From the reduced full AIR set, 166 AIRs were selected to match the real room scene, and 166 AIRs were mixed with 7861 clean utterances from REVERB for training. The test was performed on simulated room 3, which have similar T60 to the real room.
The results are presented in Table 3 below. For the real room results, using the full set outperforms using a uniform subset. However, the example scene-aware subset still consistently performs the best on both near and far microphone tests. Compared to the same-sized uniform subset, the example scene-aware subset achieves up to a 2.64% (absolute) improvement in WER. Even though the training set is intentionally matched to simulated room 3, the scene-aware subset still achieves the best far microphone results and the second-best near mic results.
In the present disclosure, a DNN-based sub-band T60 estimator is used to non-intrusively analyze speech samples from a target environment and fit a multivariate Gaussian distribution to represent T60 distribution effectively. The fitted distribution is used to guide the selection of real-world AIRs, which can generate scene-aware training data for the target environment. With both the REVERB challenge and the AMI corpus, the example scene-aware AIR always results in the ASR model with the lowest word error rate.
At block 310, the apparatus can convert multiple acoustic impulse response (AIR) datasets to multiple single-channel AIR samples. For example, multiple research groups collected usable AIRs. The multiple AIR datasets from multiple research groups can have diverse reverberation characteristics at diverse scenes, which can increase the accuracy of speech recognition in different scenes. However, the multiple AIR datasets can be produced/distributed in different audio formats. Thus, the apparatus can convert the multiple AIR datasets into a single channel format. For example, the converted multiple single-channel AIR samples can be generated at a 16 kHz sample rate. However, it should be appreciated that the sample rate for the single-channel AIR samples can be any other suitable sample rate. The multiple single-channel AIR samples are described as a set of AIRs or the full set of AIRs above. In some examples, the apparatus may use a single AIR dataset rather than multiple AIR datasets with different audio formats. In other examples, the apparatus may exploit multiple AIR datasets with a single sample rate. Then, the apparatus may not or does not need to perform the conversion process of block 310, and block 310 can be optional.
In some examples, using the full set of possible/available AIRs to train a deep-learning model for far-field speech recognition might not be optimal due to different scenes having different reverberation characteristics. For example, AIR samples that are not reverberant might not be very effective for training a deep-learning model to recognize speech in a very reverberant room. Thus, blocks 320-370 generally elaborate on how to select a scene-specific subset of the full AIR dataset for training.
At block 320, the apparatus may generate multiple vectors corresponding to the multiple single-channel AIR samples or the full set of AIRs. For example, the apparatus can generate a vector to represent the acoustic characteristics of each AIR of the full set of AIRs. The vectors (e.g., T60 vectors) for the full set of AIRs can directly be evaluated from IRs using non-deep-learning methods by its definition. For example, the multiple vectors corresponding to the full set of AIRs can include reverberation time (T60) vectors. In some examples, T60 can be defined as the time it takes for the initial impulse energy to decay by 60 dB, either for full-band or sub-band. However, it should be appreciated that the vector is not limited to a T60 vector to represent the acoustic characteristics of a scene. For example, the vector may be indicative of reverberation time (T60), direct-to-reverberant ratio (DRR), early decay time (EDT), clarity (C80), definition (D50), or any other suitable acoustic metrics to represent acoustic characteristics of the corresponding scene.
In some examples, once the apparatus generates the multiple vectors corresponding to the full set of AIRs, the apparatus does not need to regenerate the multiple vectors again because the full set of base AIRS does not change. Thus, the apparatus may reuse the multiple vectors corresponding to the full set of AIRs to train multiple deep-learning models. When the additional set of AIRs is added to the full set of AIRs, the apparatus can generate additional vectors corresponding to the additional set and add the additional vectors to the multiple vectors for a bigger base AIR dataset.
At block 330, the apparatus can receive noisy speech samples at a target scene. In some examples, the noisy speech samples are not labeled by a sub-band estimator. In some examples, the apparatus may perform block 330 separate from block 320 or after performing block 320. In various examples, such speech samples are recorded and received with both knowledge and consent of individuals, in accordance with all applicable laws, and securely to ensure the privacy of individuals, to protect confidential information and other sensitive subject matter, etc.
At block 340, the apparatus can generate multiple labeled vectors corresponding to the noisy speech samples using a trained sub-band estimator. For example, the apparatus will blindly estimate some metrics (e.g., T60, DRR, EDT, C80, D50, etc.) from the noisy speech samples at the target scene using the trained sub-band estimator. In some examples, the sub-band estimator can create T60 labels for the noisy speech samples and generate multiple sub-band T60 vectors with T60 labels corresponding to the multiple speech samples. In some examples, the sub-band estimator can predict sub-band T60 to capture the frequency dependency of real-world AIRs. In anon-limiting example, the sub-band estimator can receive unlabeled noisy speech samples as input, and outputs labeled T60 vectors. Each vector can include multiple sub-band T60s centered at multiple frequencies (e.g., 7 sub-band T60s centered at 25, 250, 500, 1000, 2000, 4000, and 8000 Hz).
The apparatus or another apparatus may train the sub-band estimator based on synthetic AIRs. However, it should be appreciated that the estimator may be trained with real-world AIRs. Thus, the multiple labeled vectors can represent acoustic characteristics at the target scene rather than at mixed scenes in the full set of AIRs. In some examples, the sub-band estimator can include six 2D convolutional layers followed by a fully connected layer. Then, the apparatus may select a subset of the full set of AIRs to match with any desired distribution through blocks 350-370.
At block 350, the apparatus can generate multiple intermediate samples based on the multiple labeled vectors. For example, the apparatus can generate multiple intermediate samples because the multiple labeled vectors may not be identical due to some noise. In some examples, the apparatus can calculate the column-wise mean vector μ and the covariance matrix Σ of the N (the number of multiple labeled vectors)×7 T60 prediction matrix) by stacking the labeled vectors vertically. The matrix can be written as [v1; v2; . . . ; vN], where v, is the ith predicted T60 vector. Then, multiple intermediate samples can be generated based on the multiple labeled vectors. For example, an intermediate sample ({circumflex over (t)}i) can be directly sampled from a known Gaussian distribution N (μ, Σ) using a random number generator. In some examples, the number of multiple intermediate samples can be the same as or different from the number of multiple labeled vectors. In further examples, the number of multiple intermediate samples can be the same as the number of the subset of the full set of AIRs to train a deep-learning model for the target scene.
At block 360, the apparatus can determine pair-wise distances between the intermediate samples and the multiple vectors. Here, the multiple vectors may correspond to the multiple single-channel AIR samples. In some examples, M intermediate samples can be expressed as: {{circumflex over (t)}i}i=1, . . . , M and the K vectors corresponding to the full set of AIRs can be expressed as: {si}i=1, . . . , K. Then, the apparatus can calculate and determine the pair-wise distance (e.g., Euclidean distance) between all intermediate samples {{circumflex over (t)}i} and the multiple vectors {si}, which can be expressed as a distance matrix Di,j=dist ({circumflex over (t)}i, sj) ∈ M×K. Thus, the apparatus can determine a M×K distance matrix based on the distance between each intermediate sample and each vector.
At block 370, the apparatus can select a subset of multiple single-channel samples or the full set of AIRs having the minimum distances between the intermediate samples and the multiple vectors. For example, the apparatus may select M vectors among the K multiple vectors corresponding to the M intermediate samples. In some examples, the number of intermediate samples could be 5% of the full set of AIRs. However, it should be appreciated that the number of intermediate samples could be 1%, 2%, 10%, 20%, or any other suitable number of the subset of the full set of AIRs. The apparatus can select a vector among the multiple vectors where the vector minimizes the Euclidean distance between an intermediate sample and each vector of the multiple vectors. The apparatus can repeat this process for each intermediate sample. Thus, the apparatus can select M vectors among multiple vectors for each sample of the M intermediate samples. This can be expressed as:
argmink
where K refers to the number of the full set of AIRs, M indicates the number of the multiple intermediate samples, and Di,k
In other examples, the apparatus can determine vectors for corresponding intermediate samples to minimize the overall distance. For example, a distance matrix can be expressed as Di,j=dist ({circumflex over (t)}i, sj) ∈M×K, where {circumflex over (t)}i is an intermediate vector at label i, and sj is a vector of the multiple vectors corresponding to the full set of AIRs at label j. The apparatus can determine a label j for each intermediate vector i to minimize the overall distance. Thus, the sum of the distance between each intermediate vector (M vectors) and a vector of the multiple vectors (K vectors) corresponding to the full set of AIRs is a smaller overall distance in comparison to the other overall distance.
In some examples, labels js for intermediate vectors is do not overlap. Since the selected label directly corresponds to a single-channel sample of the full set of AIRs, the apparatus can select a subset of multiple single-channel samples or the full set of AIRs corresponding to selected M vectors. In some examples, the selected subset of the full set of AIRs may be AIRs in similar scenes to the target scene and may have similar acoustic characteristics to those in the target scene.
At block 380, the apparatus can train a deep-learning model using the subset of the multiple single-channel AIR samples for the target scene. In addition, the apparatus can exploit the noisy speech samples to train the deep-learning model. Thus, the apparatus can train the deep-learning model with augmented speech data.
In some embodiments, the apparatus can repeatedly perform blocks 330-380 for different target scenes. Thus, the apparatus can train multiple deep-learning models with different sets of AIRs for corresponding target scenes. For example, if multiple rooms exist, some rooms may be reverberant while other rooms are not, and each room can have a different level of acoustic characteristics. As such, multiple deep-learning models corresponding to the multiple rooms can be trained with different subsets of the full set of AIRs. In other examples, some rooms having similar acoustic characteristics can be grouped together and used to train a deep-learning model for the rooms.
At block 410, an apparatus receives a far-field runtime speech sample. In some examples, the apparatus may not receive one or more pieces of information (or any information) describing or otherwise indicating attributes of a scene of the far-field runtime speech sample. In various examples, such speech samples are recorded and received with both knowledge and consent of individuals, in accordance with all applicable laws, and securely to ensure the privacy of individuals, to protect confidential information and other sensitive subject matter, etc.
At block 420, the apparatus can determine a deep-learning model trained with a dataset having similar acoustic characteristics to the far-field runtime speech sample. In the example process 300 in connection with
In some examples, to determine a scene-aware deep-learning model, the apparatus can generate a labeled vector corresponding to the far-field runtime speech sample using a trained sub-band estimator as elaborated at block 340 in connection with
In some examples, the dataset to train the deep-learning can have the same labels in the full set of AIRs as the labels of the selected subset of the full set of AIRs for the far-field runtime speech sample. In other examples, the apparatus can select a trained deep-learning model when the dataset to train the deep-learning model shares a predetermined number of AIRs with the selected subset of the full set of AIRs. In further examples, each AIR with a label can have a different weight to train a deep-learning model at a specific scene. Thus, the apparatus can select a trained deep-learning model when the difference between the weighted sum of the AIR dataset to train the deep-learning model and the weighted sum of the selected subset of the full set of AIRs corresponding to the far-field runtime speech sample is less than a predetermined threshold. However, it should be appreciated that the selection of a deep-learning model based on the training dataset is not limited to the examples described above.
At block 430, once the apparatus develops (constructs, trains, etc.) the scene-aware deep-learning model, the apparatus can perform speech recognition of the far-field runtime speech sample using the trained deep-learning model.
In some aspects of the disclosure, the apparatus can train multiple deep-learning models based on scene-specific subsets of the full set of AIRs as described in connection with
In some embodiments, a computing device 508 can receive the noisy speech samples at the target scene 502. In some embodiments, the computing device 508 can execute at least a portion of a sound processing system 510 to perform a sound processing task, such as generating labeled vectors corresponding to the noisy speech samples, generating multiple intermediate samples based on the labeled vectors, determining par-wise distance between the intermediate samples and the multiple vectors, selecting a subset of the multiple single-channel samples having minimum distances between the intermediate samples and the multiple vectors, training a deep-learning model using the subset of the multiple single-channel samples for the target scene, determining a deep-learning model trained with a dataset having similar acoustic characteristics to the far-field speech sample, and/or performing speech recognition of the far-field speech sample.
In some examples, the computing device 508 may include one or more neural networks (e.g., sub-band estimator to generate labeled vectors, and/or deep-learning models to perform speech recognition). In some examples, the computing device 508 or the sound processing system 510 can transmit noisy speech samples to a server 514 via a communication network 512. In other examples, the computing device or the sound processing system 510 can process generating the multiple intermediate samples of the noisy speech samples and transmit the multiple intermediate samples to the server 514 via the communication network 512. Then, the server 514 may include at least a portion of the sound processing system 510 to perform any one or more of the operations or features, such as generating labeled vectors corresponding to the noisy speech samples, generating multiple intermediate samples based on the labeled vectors, determining par-wise distance between the intermediate samples and the multiple vectors, selecting a subset of the multiple single-channel samples having minimum distances between the intermediate samples and the multiple vectors, training a deep-learning model using the subset of the multiple single-channel samples for the target scene, determining a deep-learning model trained with a dataset having similar acoustic characteristics to the far-field speech sample, and/or performing speech recognition of the far-field speech sample.
In addition, the server 514 may return information to the computing device 508 (and/or other suitable computing device) indicative of an output of a sound processing task performed by the sound processing system 510. However, the trained deep-learning models or sub-band estimator are not limited to being in the computing device 508 or the server 514. For example, the trained deep-learning models or sub-band estimator may be in a separate apparatus (e.g., a separate server, a cloud server, etc.). The sound processing system 510 may execute any or all of the one or more portions of processes 300 and 400.
In some examples, the computing device 508 and/or the server 514 can obtain the full set of AIRs in a database 516. In some instances, the full set of AIRs can be single-channel AIRs. For example, the computing device 508 or the server 514 can receive multiple AIR datasets from other databases and convert the multiple AIRs from the multiple AIR datasets to multiple single-channel AIR samples. Further, the computing device 508 and/or the server 514 can generate multiple vectors corresponding to the multiple single-channel AIR samples and store the multiple vectors and/or the multiple single-channel AIR samples (e.g., the full set of AIRs) in the database 516.
In some embodiments, the communication network 512 can be any suitable communication network or combination of communication networks. For example, the communication network 512 can include a Wi-Fi® network (which can include one or more wireless routers, one or more switches, etc.), a peer-to-peer network (e.g., a Bluetooth network), a cellular network (e.g., a 3G network, a 4G network, a 5G network, etc., complying with any suitable standard, such as CDMA, GSM, LTE™, LTE Advanced, NR, etc.), a wired network, etc. In some embodiments, the communication network 512 can be a local area network, a wide area network, a public network (e.g., the Internet), a private or semi-private network (e.g., a corporate or university intranet), any other suitable type of network, or any suitable combination of networks. Communications links shown in
In some embodiments, the computing device 508 and/or the server 514 can be any suitable computing device or combination of devices, such as a desktop computer, a laptop computer, a smartphone, a tablet computer, a wearable computer, a server computer, a computing device integrated into a vehicle (e.g., an autonomous vehicle), a camera, a robot, a virtual machine being executed by a physical computing device, etc.
The apparatus may operate in the capacity of a server or a client machine in a client-server network environment, as a peer machine in a peer-to-peer (or distributed) network environment, or as a server or a client machine in a cloud computing infrastructure or environment. The apparatus may be a server computer, a personal computer (PC), a tablet PC, a set-top box (STB), a Personal Digital Assistant (PDA), a cellular telephone, a web appliance, a server, a network router, a switch or bridge, or any apparatus capable of executing a set of instructions (sequential or otherwise) that specify actions to be taken by that apparatus. Further, while a single apparatus is illustrated, the term “apparatus” shall also be taken to include any collection of apparatuses that individually or jointly execute a set (or multiple sets) of instructions to perform any one or more of the methods discussed herein.
The example apparatus (e.g., computing device 508, server 514) includes a processing device 602, a main memory 604 (such as read-only memory (ROM), flash memory, dynamic random access memory (DRAM) such as synchronous DRAM (SDRAM) or DRAM, etc.), a static memory 606 (such as flash memory, static random access memory (SRAM), etc.), and a storage device 618, which communicate with each other via a bus 630.
Processing device 602 represents one or more general-purpose processing devices such as a microprocessor, a central processing unit, or the like. More particularly, the processing device 602 may be complex instruction set computing (CISC) microprocessor, reduced instruction set computing (RISC) microprocessor, very long instruction word (VLIW) microprocessor, or processor implementing other instruction sets, or processors implementing a combination of instruction sets. Processing device 602 may also be one or more special-purpose processing devices such as an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), a digital signal processor (DSP), network processor, and the like. The processing device 602 is configured to execute instructions 622 for performing the operations and steps discussed herein.
The apparatus 600 may further include a network interface device 608 for connecting to the LAN, intranet, internet, and/or the extranet. The apparatus 600 also may include a video display unit 610 (such as a liquid crystal display (LCD) or a cathode ray tube (CRT)), and one or more graphic processor 624 (such as a graphics card).
The storage device 618 (e.g., data storage device) may be a machine-readable storage medium (also known as a computer-readable medium) on which is stored one or more sets of instructions 622 (e.g., software or any type of machine-readable instructions) embodying any one or more of the operations or methods described herein. The instructions 622 may also reside, completely or at least partially, within the main memory 604 and/or within the processing device 602 during execution thereof by the computer system 600, the main memory 604 and the processing device 602 also constituting machine-readable storage media.
In an example, the instructions 622 may include transceiving instructions for receiving noisy speech samples at block 330 of
While the storage device 618 is shown in an example implementation to be a single medium, the term “machine-readable storage medium” should be taken to include a single medium or multiple media (such as a centralized or distributed database, and/or associated caches and servers) that store the one or more sets of instructions. The term “machine-readable storage medium” shall also be taken to include any medium that is capable of storing or encoding a set of instructions for execution by the machine that cause the machine to perform any one or more of the operations or features of the present disclosure. The term “machine-readable storage medium” shall accordingly be taken to include, but not be limited to, solid-state memories, optical media and magnetic media. The term “machine-readable storage medium” shall accordingly exclude transitory storage mediums such as signals unless otherwise specified by identifying the machine-readable storage medium as a transitory storage medium or transitory machine-readable storage medium.
In another implementation, a virtual machine 640 may include a module for executing instructions such as transceiving instructions 632, and/or sound processing instructions 634. In computing, a virtual machine (VM) is an emulation of a computer system. Virtual machines are based on computer architectures and provide functionality of a physical computer. Their implementations may involve specialized hardware, software, or a combination of hardware and software.
Some portions of the preceding detailed descriptions have been presented in terms of algorithms and symbolic representations of operations on data bits within a computer memory. Such descriptions and representations (e.g., pre-training a deep-learning model based on a subset of AIRs, determining a trained deep-learning model, and/or performing scene-aware speech recognition) are the ways used by those skilled in the art effectively convey the substance of their work to others skilled in the art. An algorithm is here, and generally, conceived to be a self-consistent sequence of operations leading to a desired result. The operations are those requiring physical manipulations of physical quantities. Usually, though not necessarily, these quantities take the form of electrical or magnetic signals capable of being stored, combined, compared, and otherwise manipulated. It has proven convenient at times, principally for reasons of common usage, to refer to these signals as bits, values, elements, symbols, characters, terms, numbers, or the like.
It should be borne in mind, however, that all of these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities. Unless specifically stated otherwise as apparent from the above discussion, it is appreciated that throughout the description, discussions utilizing terms such as “modifying” or “providing” or “calculating” or “determining” or the like, refer to the action and processes of a computer system, or similar electronic computing device, that manipulates and transforms data represented as physical (electronic) quantities within the computer system's registers and memories into other data similarly represented as physical quantities within the computer system memories or registers or other such information storage devices. The present disclosure also relates to an apparatus for performing the operations herein. This apparatus may be specially constructed for the intended purposes, or it may comprise a computing device or other machine (e.g., a specialized or special purpose machine) selectively activated or reconfigured by a computer program stored in the computer. Such a computer program may be stored in a computer-readable storage medium, such as, but not limited to, any type of disk including floppy disks, optical disks, CD-ROMs, and magnetic-optical disks, read-only memories (ROMs), random access memories (RAMs), EPROMs, EEPROMs, magnetic or optical cards, or any type of media suitable for storing electronic instructions, each coupled to a computer system bus.
The examples presented herein are not inherently related to any particular computer or other apparatus. Various types of systems may be used with programs in accordance with the teachings herein, or it may prove convenient to construct a more specialized apparatus to perform the method. The structure for a variety of these systems will appear as set forth in the description below. In addition, the present disclosure is not described with reference to any particular programming language or form of machine-readable or machine-executable instructions. It will be appreciated that a variety of programming languages may be used to implement the teachings of the disclosure as described herein. Further, machine-readable and machine-executable instructions may exist and be used in various forms including source code, executable code, interpretable code such as an intermediate language, code writing in a scripting language, etc.
The present disclosure may be provided as a computer program product, or software, that may include a machine-readable medium having stored thereon instructions, which may be used to program a computer system (or other electronic devices) to perform operations in accordance with the present disclosure. A machine-readable medium generally may include any mechanism for storing information in a form readable by a machine (such as a computer). For example, a machine-readable (such as computer-readable) medium includes a machine (such as a computer) readable storage medium such as a read-only memory (“ROM”), random access memory (“RAM”), magnetic disk storage media, optical storage media, flash memory devices, etc.
In the foregoing specification, implementations of the disclosure have been described with reference to specific example implementations thereof. It will be evident that various modifications may be made thereto without departing from the broader spirit and scope of implementations of the disclosure as set forth in the following claims. The specification and drawings are, accordingly, to be regarded in an illustrative sense rather than a restrictive sense.
This application claims priority to U.S. Provisional Application No. 63/176,211, filed on Apr. 16, 2021, which hereby is incorporated herein by reference in its entirety.
This invention was made with government support under W911NF1810313 awarded by the Department of the Army, Army Research Office (ARO). The government has certain rights in the invention.
Number | Date | Country | |
---|---|---|---|
63176211 | Apr 2021 | US |