The present disclosure relates generally to the field of automatic speech recognition. More particularly, the present disclosure relates to systems and methods for automatic speech recognition using domain adaptation techniques.
Speech recognition has long been a subject of interest in the computer field, and has many practical applications and uses. For example, automatic speech recognition systems are often used in call centers, field operations, office scenarios, etc. However, current prior art systems for automatic speech recognition are not able to recognize a wide variety of types of speech from different types of people, such as different genders and different types of accents. Another drawback of prior art systems is that models trained for speech recognition are biased in terms of the training data towards one type of speech. For example, a model might be trained on a database of speech spoken by American readers, and accordingly, might underperform if used with Australian speech. In other words, various accents in speech pose additional difficulties for automatic speech recognition systems.
Moreover, training neural networks for automatic speech recognition becomes challenging when limited amounts of supervised training data is available. In order for acoustic models to be able to handle large acoustic variability, a large amount of labeled data is necessary, which can be expensive to obtain. It is expensive to obtain labeled speech data that contains sufficient variations of the different sources of acoustic variability such as speaker accent, speaker gender, speaking style, different types of background noise or the type of recording device. Prior art systems fall short in mitigating the effects of acoustic variability that is inherent in the speech signal.
Several techniques have been proposed to mitigate the effects of acoustic variability in the speech data. For example, feature space maximum likelihood linear regression, maximum likelihood linear regression (“MLLR”), maximum a posteriori (“MAP”), vocal tract length normalization are all techniques used in generative acoustic models. Also, i-Vectors, learning hidden unit contributions (“LHUC”), Kullback-Leibler (“KL”) divergence regularized, and (deep neural network (“DNN”) acoustic models are adaptation techniques used for discriminative acoustic models. All of these techniques require labeled data from the target domain to perform adaptation, and cannot perform speech recognition using raw speech.
Therefore, in view of existing technology in this field, what would be desirable are systems and methods for automatic speech recognition using raw speech that is invariant to acoustic variability.
The present disclosure relates to systems and methods for automatic speech recognition using domain adaptation techniques. In particular, the present disclosure provides the application of adversarial training to learn features from raw speech that are invariant to acoustic variability. This acoustic variability can be referred to as a domain shift. The present disclosure leverages the architecture of domain adversarial neural networks (“DANNs”) which uses data from two different domains. The DANN is a Y-shaped network that consists of a multi-layer convolutional neural network (“CNN”) feature extractor module, a label (senone) classifier, and a domain classifier. The system of the present disclosure can be used for multiple applications with domain shifts caused due to differences in gender and speaker accents.
Further, the systems and methods of the present disclosure achieve domain adaptation using domain classification along with label classification. Both the domain classifier and the label (senone) classifier can share a common multi-layer CNN feature extraction module. The network of the present disclosure can be trained to minimize the cross-entropy cost of the label classifier and at the same time maximize the cross-entropy cost of the domain classifier.
Moreover, the systems and methods of the present disclosure provide for unsupervised domain adaptation on discriminative acoustic models trained on raw speech using the DANNs. Unsupervised domain adaptation can be used to reduce acoustic variability due to many factors including, but not limited to, speaker gender and speaker accent. The present disclosure provides systems and methods where domain invariant features can be learned directly from raw speech with significant improvement over the baseline acoustic models trained without domain adaptation.
The foregoing features of the invention will be apparent from the following Detailed Description, taken in connection with the accompanying drawings, in which:
The present disclosure relates to systems and methods for automatic speech recognition using domain adaptation techniques, as discussed in detail below in connection with
As will be discussed herein, the present disclosure provides unsupervised domain adaptation using adversarial training on raw speech features. The present disclosure can solve classification problems, for example, with an input feature vector space X and Y={0, 1, 2, . . . , L−1} as the set of labels in the output space. S(x,y) and T(x,y) can be unknown joint distributions defined over X×Y, referred to as the source and target distributions respectively. The unsupervised domain adaptation algorithm requires input as the labeled source domain data, sampled from S(x,y) and unlabeled target domain data, sampled from the marginal distribution T(x), as expressed by Equation 1, below:
{(xi,yi)}i=0n˜S(x,y);{(xi)}i=n+1n+n′=N˜T(x), Equation 1,
where N=n+n′ is the total number of input samples. As opposed to the class labels, which can be assumed only for the source domain data, binary domain labels (di={0,1}) are defined as
and can be assumed to be known for each sample.
The feature extractor Gf is a multi-layer CNN and takes the raw speech input vector xi and generates a d-dimensional feature vector fi∈Rd given by Equation 2, below:
f
i
=G
f(xi;Θf), Equation 2
where Θf can be the parameters of the feature extractor such as weights and biases of the convolutional layers. The input vector xi can be from the source distribution S(x,y) or the target distribution T(x). The 1-d convolution operation in the convolutional layer in the network can be defined by Equation 3, below:
Equation 3 gives feature vector output at index m from the first layer convolution operation on input feature vector xi, ηcf1 denotes the k-dimensional vector of weights and biases of the first convolutional layer and cth convolutional filter. The function σ(·) is a non-linear activation function like the sigmoid or ReLU.
The label classifier 6 and the domain classifier 8 will now be explained in greater detail. The feature vector fi, which can be extracted from Gf, can be mapped to class label yi=Gy (fi; Θy) by the label classifier 6 GY and to domain label di=Gd(fi; Θd) by the domain classifier 8 Gd as shown in
The parameter λ can be a hyper-parameter that weighs the relative contribution of the two costs. To simplify, Equation 4 can be written in a simpler form as shown by Equation 5, below:
The label classifier 6 can minimize the label classification loss Lyi (Θf, Θy) on the data from source distribution S(x,y). Accordingly, the label classifier 6 can optimize the parameters of both feature extractor (Θf) and label predictor(Θy). By doing so, the system of the present disclosure can ensure that the features fi can be discriminative enough to perform good prediction on samples from the source domain. At the same time, the extracted features can be invariant enough to the shift in domain. In order to obtain domain invariant features, the parameters of feature extractor Θf can be optimized to maximize the domain classification loss Ly Ly (Θf Θd) while, at the same time, domain classifier Θd can classify the input features. In other words, the domain classifier of the trained network can be configured to not be able to correctly predict the domain labels of the features coming from the feature extractor.
The desired parameters Θ{circumflex over ( )}f, Θ{circumflex over ( )}y, Θ{circumflex over ( )}d can provide a saddle point during a training phase and can be estimated as follows:
The model (e.g., the neural network) can be optimized by the standard stochastic gradient descent (hereinafter “SGD”) based approaches. The parameter updates during the SGD can be defined as follows:
where, η is the learning rate. The above equations can be implemented in a form of SGD by using a special Gradient Reversal Layer (hereinafter “GRL”) at the end of feature extractor 6 and at the beginning of domain classifier 8 as can be seen in
Implementation and testing of the system of the present disclosure will now be explained in greater detail. The TIMIT and Voxforge datasets can be used to perform domain adaptation experiments. For TIMIT speech corpus, domain adaptation can be performed by taking male speech as source domain and female speech corpus as target domain. For the Voxforge corpus, domain adaptation can be performed by taking American accent and British accent as source domain and target domain respectively and vice-versa. For TIMIT speech corpus, male and female speakers can be separated into source domain and target domain datasets. TIMIT is a read speech corpus in which a speaker reads a prompt in front of the microphone. It includes a total of 6,300 sentences, 10 sentences spoken by each of the 630 speakers for 8 major dialect regions of the United States of America. It includes a total of 3,696 training utterances sampled at 16 kHz, excluding all SA utterances because they can create a bias in the dataset. The training set consists of 438 male speakers and 192 female speakers. The core test set is used to report the results. It includes 16 male speakers and 8 female speakers from all of the 8 dialect regions. For the Voxforge dataset, American accent speech and British accent speech can be taken as two separate domains. Voxforge is a multi-accent speech dataset with 5 second speech samples sampled at 16 KHz. Speech samples can be recorded by users with their own microphones which allows quality to vary significantly among samples. Voxforge corpus has 64 hours of American accent speech and 13.5 hours of British accent speech totaling to 83 hours of speech. Results can be reported on 400 utterances each for both the accents. Alignments can be obtained by using HMM-GMM acoustic model trained using Kaldi as known by those of skill in the art. The present disclosure is not limited to any dataset or any of the parameters discussed above and below for testing, implementation and experimentation.
Raw speech features can be obtained by using a rectangular window of size 10 milliseconds on raw speech with a frame shift of 10 milliseconds. A context of 31 frames can be added to windowed speech features to get a total of 310 milliseconds of context dependent raw speech features. These context dependent raw speech features can be mean and variance normalized to obtain final features.
The feature extractor can be a two-layer convolutional neural network. The first convolutional layer can have a filter size of 64 with 256 feature maps along with the step size of 31. The second convolutional layer can have a filter size of 15 with 128 feature maps and step size of 1. After each convolutional layer, an average-pool layer can be used with a pooling size of 2 and a ReLU activation unit. Both the label classifier 6 and the domain classifier 8 can be 4 layer and 6 layer fully connected neural networks with ReLU activation unit and a hidden unit size of 1024 and 2048 for TIMIT and Voxforge, respectively. The weights can be initialized in a Glorot fashion. The model can be trained with SGD and with momentum as known by those of skill in the art. The learning rate can be selected during the training using formula
where p increases linearly from 0 to 1 as training progresses, μo=0.01 ηo=0.01, a=10, and β=0.75. A momentum of 0.9 can also be used. The adaptation parameter λ can be initialized at 0 and is gradually changed to 1 according to the formula
where 7 is set to 10 as known by those of skill in the art. Domain labels can be switched 10% of the time to stabilize the adversarial training. The present disclosure is not limited to any specific parameter or equation or dataset as noted above.
The results of testing of the system will now be discussed in greater detail. The tests specifically study the acoustic variabilities like speaker gender and accent using TIMIT and Voxforge speech corpus, respectively. Due to possible insufficient labeled female speech data in TIMIT corpus domain adaptation, tests can be performed only for male speech as the source domain and female speech as target domain. Tests can be performed by taking the American accent as the source domain and the British accent as the target domain and vice versa. Additional tests can also be performed by training the acoustic model on the labeled data from both the domains which can function as the lower limit for the achievable WER. In the tables below, DANN represents the domain adapted acoustic model using labeled data from the source domain and unlabeled data from the target domain and NN represents the acoustic model trained on the labeled data from the source domain only.
Table 1 below shows a percentage PER for acoustic model trained on supervised data from source domain and unsupervised data from target domain for TIMIT corpus taking male speech as the source and female speech as the target.
The first two rows in Table 1 list the PER results for the acoustic model trained on labeled data from both the domains with no domain adaptation. This acoustic model can provide effective results and can be the lower limit for the PER. Rows 3 and 4 of Table 1 provide the acoustic model trained on labeled data from the male speech and adapted using unlabeled data from female speech. Specifically, row 3 indicates the effect of domain adaptation on the performance on data from source domain which is male speech in this case. Row 4 gives the PER for the un-adapted and adapted acoustic models for data from target domain which is female speech in this case.
Table 2 above shows a percentage of WER for acoustic models trained on supervised data from the source domain and unsupervised data from the target domain for Voxforge dataset taking American and British accents as two different acoustic domains.
Table 3 above shows a percentage of WER for acoustic models trained on supervised data from the source domain and unsupervised data from the target domain for the Voxforge dataset taking American and British accents as two different acoustic domains for MFCC features. Rows 1 and 2 in Table 3 are the WER values for the acoustic model trained on labeled data from both the domains and without any domain adaptation. These values can correspond to the lower limit for the WER for both the domains. Rows 3 and 4 represents the effect of domain adaptation on the performance of the acoustic model on the data from source domain which is American and British respectively. The corresponding NN values are the WER for the acoustic model trained on labeled data from the same domain only. Rows 5 and 6 show the WER for target domain data on un-adapted and adapted acoustic models.
Table 4 below shows further results of the system of the present disclosure.
The following discussion expresses performance in terms of absolute increases or decreases in WER with respect to the baseline models. With reference to Table 1, the acoustic variability due to speaker gender is evident with a 12.57% increase in PER for the acoustic model trained on male speech and tested for both the male and female speech as shown in rows 3 and 4 in Table 1 against NN column. The domain adapted acoustic model, which is trained on labeled male speech as the source domain and unlabeled female speech as the target domain, performs better than the un-adapted model as shown in last row of Table 1. Domain adaptation using adversarial training succeeded in learning gender invariant features which leads to significant improvement over the acoustic model trained on the male speech only. In some cases, the model can be trying to learn domain invariant features which may lead to the sacrifice of domain specific features. Good performance for the female speech can be achieved when the labeled female speech is used alongside the labeled male speech to train the acoustic model. The speaker accent can also be a major source of acoustic variability in the speech signal. This is evident in the degradation in performance of the source only acoustic model on the target domain as compared to performance on source domain. The degradation is 16.61% for the American accent only acoustic model and 4.96% for British accent only acoustic model as shown in Table 3. The corresponding accent adapted acoustic models see an improvement for American target and British target domains respectively. In some cases, a loss of domain specific features during domain adversarial training can impact the results. Moreover the best performance on the target domain is achieved for the acoustic model trained on labeled data from both the domains.
The foregoing tests and results show that unsupervised domain invariant features learning directly from raw speech using domain adversarial neural networks is an effective method of automatic speech recognition. As can be seen in
Having thus described the system and method in detail, it is to be understood that the foregoing description is not intended to limit the spirit or scope thereof. It will be understood that the embodiments of the present disclosure described herein are merely exemplary and that a person skilled in the art may make any variations and modification without departing from the spirit and scope of the disclosure. All such variations and modifications, including those discussed above, are intended to be included within the scope of the disclosure. What is intended to be protected by Letter Patent is set forth in the following claims.
This application claims the benefit of U.S. Provisional Patent Application No. 62/659,584, filed on Apr. 18, 2018, the entire disclosure of which is expressly incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
62659584 | Apr 2018 | US |