The present disclosure relates to methods and devices for recognizing spoken keywords in acoustic signals. The invention describes a low-power system that can be used to recognize one or more spoken keywords in a continuous audio stream.
One application for keyword spotting is as wakeword, keyword or trigger-word for hands-free operations on a voice interface device such as smart speakers and smart assistants. In such scenarios, the user speaks a predefined keyword to “wake-up” the device before speaking a complete command or query to the device.
Large vocabulary speech recognition is a compute-intensive task, whereas a low-resource keyword spotting algorithm allows the device to operate at low-power by using a simpler model that only detects whether a phrase or small set of phrases are spoken. Once a wake-word has been detected, then the more complex large vocabulary model is used to decode the user query which follows.
Prior art technologies have proposed keyword spotting models with a variety of architectures such as recurrent neural networks (RNNs) combined with convolution layers, or Grid-LS™ RNNs capable of learning sequences in both the time and frequency dimensions. However, these architectures have high computational complexity and require a large amount of training data to work well.
Many of the new smart devices with a voice user-interface uses small microprocessors and many are even battery powered. Accordingly, systems and methods with small computational footprint and power requirement for designing an optimal keyword spotting remains highly desirable.
In accordance with and aspect of the present disclosure there is provided a method for keyword spotting comprising: obtaining acoustic signal comprising speech; providing an acoustic signal representation of the acoustic signal to a neural network; and predicting from the neural network a presence of at least one of a plurality of keywords or absence of any of the plurality of keywords in the acoustic signal.
In a further aspect of the method, the acoustic signal representation comprises a feature domain representation obtained by preprocessing the acoustic signal.
In a further aspect of the method, the feature domain representation comprises one of log-Mel filterbank (FBANK), Mel-filtered cepstrum coefficients MFCC, and Perceptual Linear Prediction PLP.
In a further aspect of the method, the acoustic signal representation is a waveform representation.
In a further aspect of the method, the neural network is a time delayed neural network (TDNN) that produces a sequence of keyword posteriors.
In a further aspect of the method, smoothing is applied to the keyword posteriors.
In a further aspect of the method, predicting the presence or absence of keywords comprises determining if a posterior value for any of the plurality of keywords exceeds a threshold value, and if the posterior value of a respective keyword exceeds the threshold value predicting the presence of the respective keyword in the audio signal.
In a further aspect of the method, a plurality of different threshold values are used for the plurality of keywords.
In a further aspect of the method, the TDNN uses one or more sets of layers to learn phone and keyword targets.
In a further aspect of the method, a first set of layers is initialized by using transfer learning on a related large vocabulary speech recognition task.
In a further aspect of the method, a method for reducing a number of multiplications using dynamic programming is used.
In a further aspect of the method, a total number of multiplications is reduced using frame skipping.
In a further aspect of the method, a voice activity detection (VAD) system is used to minimize computation by the TDNN network, wherein the VAD system only sends the audio signal representation to the TDNN when speech is detected in the background.
In a further aspect, the method further comprises recording the user query which follows keyword detection and recording it for further decoding.
In a further aspect of the method, the start and end times of the keyword are found in the audio stream.
In a further aspect of the method, a second neural network is used for second stage decoding, comprising of one or more of: a bidirectional GRU RNN model to produce a phone posteriorgram; a histogram of acoustic correlations (HAC) to produce a fixed-length vector from the phone posteriorgram; and a fully-connected network to produce keyword probabilities from the fixed-length vector.
In a further aspect of the method, training data for the neural network is produced by concatenating recordings of commands and user queries at different volume levels and mixing with different types of noises.
In a further aspect of the method, unrelated conversational data is included in the training data.
In a further aspect of the method, upon predicting from the neural network a presence of at least one of the plurality of keywords in the acoustic signal by a first processing core, a second processing core is awoken from a sleep state to perform further processing on the acoustic signal.
In a further aspect of the method, the second processing core verifies the presence of at least one of the plurality of keywords in the acoustic before performing further processing of the acoustic signal to determine one or more commands within the acoustic signal.
In a further aspect of the method, the first core is a low-power core and the second-core is a high-power core.
In accordance with another aspect of the present disclosure there is further provided a system comprising: a microphone; a memory storing instructions; and a processor coupled to the microphone and memory, the processor executing the instructions, which when executed configure the system to: obtain acoustic signal comprising speech; provide an acoustic signal representation of the acoustic signal to a neural network; and predict from the neural network a presence of at least one of a plurality of keywords or absence of any of the plurality of keywords in the acoustic signal.
In a further aspect of the system, the acoustic signal representation comprises a feature domain representation obtained by preprocessing the acoustic signal.
In a further aspect of the system, the feature domain representation comprises one of log-Mel filterbank (FBANK), Mel-filtered cepstrum coefficients MFCC, and Perceptual Linear Prediction PLP.
In a further aspect of the system, the acoustic signal representation is a waveform representation.
In a further aspect of the system, the neural network is a time delayed neural network (TDNN) that produces a sequence of keyword posteriors.
In a further aspect of the system, smoothing is applied to the keyword posteriors.
In a further aspect of the system, predicting the presence or absence of keywords comprises determining if a posterior value for any of the plurality of keywords exceeds a threshold value, and if the posterior value of a respective keyword exceeds the threshold value predicting the presence of the respective keyword in the audio signal.
In a further aspect of the system, a plurality of different threshold values are used for the plurality of keywords.
In a further aspect of the system, the TDNN uses one or more sets of layers to learn phone and keyword targets.
In a further aspect of the system, a first set of layers is initialized by using transfer learning on a related large vocabulary speech recognition task.
In a further aspect of the system, a method for reducing a number of multiplications using dynamic programming is used.
In a further aspect of the system, a total number of multiplications is reduced using frame skipping.
In a further aspect of the system, a voice activity detection (VAD) system is used to minimize computation by the TDNN network, wherein the VAD system only sends the audio signal representation to the TDNN when speech is detected in the background.
In a further aspect of the system, the instructions which when executed further configure the system to record the user query which follows keyword detection and recording it for further decoding.
In a further aspect of the system, the start and end times of the keyword are found in the audio stream.
In a further aspect of the system, a second neural network is used for second stage decoding, comprising of one or more of: a bidirectional GRU RNN model to produce a phone posteriorgram; a histogram of acoustic correlations (HAC) to produce a fixed-length vector from the phone posteriorgram; and a fully-connected network to produce keyword probabilities from the fixed-length vector.
In a further aspect of the system, training data for the neural network is produced by concatenating recordings of commands and user queries at different volume levels and mixing with different types of noises.
In a further aspect of the system, unrelated conversational data is included in the training data.
In a further aspect of the system, the processor further comprises a first core and a second core, wherein the first core is a low-power processing core and the second core is a high-power processing core, when the first core determine the presence of at least one of a plurality of keywords in the acoustic signal the acoustic signal is provided to the second core for further processing.
In a further aspect of the system, the further processing comprises performing keyword verification.
In a further aspect of the system, the processor operates in a lower power state until the presence of at least one of a plurality of keywords in the acoustic signal the acoustic signal and transitions to a high power state for performing further processing of the acoustic signal.
Further features and advantages of the present disclosure will become apparent from the following detailed description, taken in combination with the appended drawings, in which:
It will be noted that throughout the appended drawings, like features are identified by like reference numerals.
In accordance with the present disclosure there is provided a method for keyword spotting comprising: obtaining acoustic signal comprising speech; providing an acoustic signal representation of the acoustic signal to a neural network; and predicting from the neural network a presence of at least one of a plurality of keywords or absence of any of the plurality of keywords in the acoustic signal.
In a further embodiment of the method, the acoustic signal representation comprises a feature domain representation obtained by preprocessing the acoustic signal.
In a further embodiment of the method, the feature domain representation comprises one of log-Mel filterbank (FBANK), Mel-filtered cepstrum coefficients MFCC, and Perceptual Linear Prediction PLP.
In a further embodiment of the method, the acoustic signal representation is a waveform representation.
In a further embodiment of the method, the neural network is a time delayed neural network (TDNN) that produces a sequence of keyword posteriors.
In a further embodiment of the method, smoothing is applied to the keyword posteriors.
In a further embodiment of the method, predicting the presence or absence of keywords comprises determining if a posterior value for any of the plurality of keywords exceeds a threshold value, and if the posterior value of a respective keyword exceeds the threshold value predicting the presence of the respective keyword in the audio signal.
In a further embodiment of the method, a plurality of different threshold values are used for the plurality of keywords.
In a further embodiment of the method, the TDNN uses one or more sets of layers to learn phone and keyword targets.
In a further embodiment of the method, a first set of layers is initialized by using transfer learning on a related large vocabulary speech recognition task.
In a further embodiment of the method, a method for reducing a number of multiplications using dynamic programming is used.
In a further embodiment of the method, a total number of multiplications is reduced using frame skipping.
In a further embodiment of the method, a voice activity detection (VAD) system is used to minimize computation by the TDNN network, wherein the VAD system only sends the audio signal representation to the TDNN when speech is detected in the background.
In a further embodiment, the method further comprises recording the user query which follows keyword detection and recording it for further decoding.
In a further embodiment of the method, the start and end times of the keyword are found in the audio stream.
In a further embodiment of the method, a second neural network is used for second stage decoding, comprising of one or more of: a bidirectional GRU RNN model to produce a phone posteriorgram; a histogram of acoustic correlations (HAC) to produce a fixed-length vector from the phone posteriorgram; and a fully-connected network to produce keyword probabilities from the fixed-length vector.
In a further embodiment of the method, training data for the neural network is produced by concatenating recordings of commands and user queries at different volume levels and mixing with different types of noises.
In a further embodiment of the method, unrelated conversational data is included in the training data.
In a further embodiment of the method, upon predicting from the neural network a presence of at least one of the plurality of keywords in the acoustic signal by a first processing core, a second processing core is awoken from a sleep state to perform further processing on the acoustic signal.
In a further embodiment of the method, the second processing core verifies the presence of at least one of the plurality of keywords in the acoustic before performing further processing of the acoustic signal to determine one or more commands within the acoustic signal.
In a further embodiment of the method, the first core is a low-power core and the second-core is a high-power core.
In accordance with the present disclosure there is further provided a system comprising: a microphone; a memory storing instructions; and a processor coupled to the microphone and memory, the processor executing the instructions, which when executed configure the system to: obtain acoustic signal comprising speech; provide an acoustic signal representation of the acoustic signal to a neural network; and predict from the neural network a presence of at least one of a plurality of keywords or absence of any of the plurality of keywords in the acoustic signal.
In a further embodiment of the system, the acoustic signal representation comprises a feature domain representation obtained by preprocessing the acoustic signal.
In a further embodiment of the system, the feature domain representation comprises one of log-Mel filterbank (FBANK), Mel-filtered cepstrum coefficients MFCC, and Perceptual Linear Prediction PLP.
In a further embodiment of the system, the acoustic signal representation is a waveform representation.
In a further embodiment of the system, the neural network is a time delayed neural network (TDNN) that produces a sequence of keyword posteriors.
In a further embodiment of the system, smoothing is applied to the keyword posteriors.
In a further embodiment of the system, predicting the presence or absence of keywords comprises determining if a posterior value for any of the plurality of keywords exceeds a threshold value, and if the posterior value of a respective keyword exceeds the threshold value predicting the presence of the respective keyword in the audio signal.
In a further embodiment of the system, a plurality of different threshold values are used for the plurality of keywords.
In a further embodiment of the system, the TDNN uses one or more sets of layers to learn phone and keyword targets.
In a further embodiment of the system, a first set of layers is initialized by using transfer learning on a related large vocabulary speech recognition task.
In a further embodiment of the system, a method for reducing a number of multiplications using dynamic programming is used.
In a further embodiment of the system, a total number of multiplications is reduced using frame skipping.
In a further embodiment of the system, a voice activity detection (VAD) system is used to minimize computation by the TDNN network, wherein the VAD system only sends the audio signal representation to the TDNN when speech is detected in the background.
In a further embodiment of the system, the instructions which when executed further configure the system to record the user query which follows keyword detection and recording it for further decoding.
In a further embodiment of the system, the start and end times of the keyword are found in the audio stream.
In a further embodiment of the system, a second neural network is used for second stage decoding, comprising of one or more of: a bidirectional GRU RNN model to produce a phone posteriorgram; a histogram of acoustic correlations (HAC) to produce a fixed-length vector from the phone posteriorgram; and a fully-connected network to produce keyword probabilities from the fixed-length vector.
In a further embodiment of the system, training data for the neural network is produced by concatenating recordings of commands and user queries at different volume levels and mixing with different types of noises.
In a further embodiment of the system, unrelated conversational data is included in the training data.
In a further embodiment of the system, the processor further comprises a first core and a second core, wherein the first core is a low-power processing core and the second core is a high-power processing core, when the first core determine the presence of at least one of a plurality of keywords in the acoustic signal the acoustic signal is provided to the second core for further processing.
In a further embodiment of the system, the further processing comprises performing keyword verification.
In a further embodiment of the system, the processor operates in a lower power state until the presence of at least one of a plurality of keywords in the acoustic signal the acoustic signal and transitions to a high power state for performing further processing of the acoustic signal.
Embodiments are described below, by way of example only, with reference to
Prior art technologies have used time-delay neural networks for keyword spotting. For example, work in Ming Sun et al., “Compressed time delay neural network for small-footprint keyword spotting,” Interspeech, pp. 3607-3611, 2017, uses a time-delay neural network combined with a hidden Markov model (HMM) for recognizing the keyword, such as “Alexa”. A singular value decomposition (SVD) has also been used based on bottleneck layers to reduce the model size. Such methods require keyword training data with phone labels in order to work. The system and method described herein may perform low-powered keyword spotting using a multi-stage time-delay neural network architecture that doesn't require a separate HMM model or phone-labeled keyword training data.
In a time-delay neural network, different layers or sets of layers act on different time scales. Lower layers look at smaller time scales and produce higher level features with smaller dimensions to be sent to higher layers. This allows the architecture to look at a large time window, while reducing an amount of computations required. During training, the input features are repeatedly shifted in time and fed to the model. This introduces time-shift invariance and can operate on a sequence of any duration.
There are several factors to be considered when designing an effective keyword detection system. Both false positives and false negatives must be kept at a very low rate to provide an acceptable user experience. The amount of computation required by the model should be minimized in order to reduce power drain. Latency must also be kept low to keep the user interface responsive. A neural network architecture is disclosed which provides a method of computation which reduces the number of computations while maintaining an acceptable level of accuracy.
Referring to
In this example implementation the input audio data is transformed in to the frequency domain and frequency-band features are extracted from the audio for the feature windows 100. The filterbank features are normalized so that they have approximately zero mean and unit variance.
The phone-NN 101 outputs a vector which represents a posterior probability distribution over different phones 102. These phone posteriors are then used as input for the next set of layers. In an example implementation, 42 posteriors were used—3 representing silence or noise and 39 representing different phones.
The phone-NN 101 looks at a context large enough to fit a typical phone or tri-phone. In an example implementation, a context of 5 frames to the left or in the past and 5 frames to the right or in the future, for a total context of 11 frames, is provided in the fully connected layers 202 as shown in
In an example implementation, the phone posteriors 102 are max-pooled along the time axis to reduce the total number of weights to be sent to the next layers 103, reduce calculations, improve training performance, and reduce overfitting. Alternatively, striding along the time axis could be done to achieve the same effect, which is discussed in a later section.
The second set of layers, the word-NN, 103 acts as a keyword classifier. It takes the output of the first set 102 as input and outputs the probability of spotting one of the possible keywords at each point in time. The word-NN 105 is a neural network. In an example implementation, the word-NN 105 contains one fully-connected hidden layer with 64 neurons. The output layer may have one neuron for each keyword to be spotted as well as a neuron for background/filler speech.
The word-NN 103 looks at a context large enough to fit an entire wake word. To reduce latency, a large left context and smaller right context can be used. In an implementation, a size of 115 frames in the past and 5 frames in the future was used.
Combined with the context from the phone-NN, this enabled the TDNN to look at a window covering 1215 ms in time.
This window is shifted in time across the input features producing a sequence of posterior probabilities for the wake word detection.
Softmax 104 is utilized to convert the elements of an arbitrary vector into probabilities. A threshold is applied to these probabilities, and keyword detection 107 is triggered when the probability of one of the keywords goes above the threshold. Softmax calculates decimal probabilities to each class in a multi-class problem. Those decimal probabilities must add up to 1.0. This additional constraint helps training converge more quickly than it otherwise would.
In the network architecture shown in
Transfer learning is a method for initializing weights by first training the network on a larger corpus for a related task and then using some of the layers of this network to train on the main task. This allows the network to build upon the learning from the larger amount of data of the related task and is particularly useful for scenarios where only a limited amount of data may be available for the main task. Transfer learning and multi-task learning are common practices in keyword spotting because typical keyword spotting tasks have limited amount of data available. This also helps reduce overfitting.
The lower levels of the TDNN, in this case, the phone-NN 101, only looks at small patches of the input data 200. For every incoming patch or speech frame, the phone-NN processes the input using one or more of a fully connected neural network, a convolutional neural network or a recurrent neural network such as 102 in
Preparation of the data is an important step in training the system to work well. In order for the system to work in many different environments, the data used to train should have similar statistical distribution and physical characteristics as data used in the situations where the keyword spotting is to be deployed.
In one training method, the following method of artificially creating data was used.
The data available included:
In one implementation, in order to simulate actual use case where the user is performing a voice query, the keyword and command audios are trimmed of silence and stitched together to create long audios in the form keyword+command+pause+keyword+command+pause+etc. However, in another implementation such concatenation of data was not used.
The amplitude of the keywords and commands is randomly varied to simulate audio of different loudness. Furthermore, the resultant audios are then mixed with three kinds of noise, namely street, babble, and music, at an average of 10 dB SNR. In addition, clean data is also used.
In addition to these generated command audios, the long conversation data is added to provide more variation in data. This helps reduce the false alarm rate and is intended to simulate the situation of background chatter to which the system should not respond. Since these conversational audios sometimes already contained background noise or music, no extra noise is added to them.
The exact position of the keyword in the training audio files may be unknown. To resolve this issue, the TDNN is applied during training at different positions in the audio. The audio window which generates the maximum keyword probability is used for gradient backpropagation. This is implemented by using a max pooling layer after the Softmax layer. The max pooling layer is removed before creating the final inference model.
The computation required by the keyword spotting model may be further reduced by skipping frames during inference. Since the region of interest, where the keyword is spoken, spans several frames, it is reasonable to assume that the TDNN output posteriors would only change smoothly between frames. Frame skipping achieves large reductions in computation by taking advantage of this assumption.
In an example implementation, both the phone-NN and word-NN are strided with a step size of 4 input frames, which was chosen after experimentation with different step sizes. As a result, inference is performed every 40 ms.
The description above covers a complete keyword spotting system for one or multiple keywords. However, the accuracy of such systems are often limited because the models have to be small and because limitations of single neural networks. To address these issues, there have been some prior art technologies that have employed multi-stage keyword spotting models. In wakeword related embodiments of these systems, a smaller, less accurate model is used as a first low-power system to detect keywords/wakewords. When the first model detects a wake word candidate, the corresponding audio data, possibly with audio preceding and following the keyword audio is sent to a second, larger and more accurate model. The keyword detection system fires only when both models indicate that the keyword is present. The second model reduces the false alarm rate, while not increasing power requirements substantially since it is only occasionally invoked. In many prior art systems, the second stage model often is used in the cloud. However, as described further below both stages may be run on device. The first stage and the second stage may be performed by the same processor, or a lower powered processor may be used to perform the first stage keyword spotting and a second higher powered processor may be used to perform the second stage keyword spotting.
As depicted in
As in the first stage, the acoustic model of the second stage comprises a neural network that outputs a vector at each time step which represents a posterior probability distribution over different phones or phonemes. In an implementation, this is a bidrectional GRU RNN with 3 layers, containing 128 hidden units each 301. Other implementations of this acoustic neural network are possible, such as a fully connected network, a convolutions network, a recurrent network with LS™ units, an auto-encoder network, or a combination thereof. The output of this network is a sequence of phoneme probability vectors also known as a phone posteriorgram 302.
The phone posteriogram is provided to an HAC. One example implementation of HAC is described in F. Gemmeke, Jort. (2014), “The self-taught vocal interface” 21-22. doi: 10.1109/HSCMA.2014.6843243, incorporated herein by reference. It produces a fixed length vector representing the phonetic content of the utterance from the variable length posteriorgram 303. This represents the probability of each pair of phonemes occurring within a given delay of each other. The size of the HAC vector is given by dp2 where d is the number of delays used and p is the number of phones. In an implementation, 4 delays are used with 42 phones, resulting in a vector size of 7056. The delays used are 20, 50, 90, and 200 ms.
The semantic model is another neural network or related model that takes a posteriorgram as input and outputs the probability of each keyword being in the given utterance. In an implementation, this is a fully-connected neural network with one hidden layer containing 128 hidden units 304. Other models such as auto-encoder, RNN, CNN, or a combination thereof can also be used. Compressed or sparse models can be used to further reduce the computational footprint.
A Softmax layer 305 is applied to the output of the semantic model to produce a probability of each keyword target 306. A threshold is applied and if one of the keyword probabilities exceeds the threshold, then the system indicates the keyword is detected.
The following provides a brief description and results of two experiments: (i) comparison against CNN and (ii) frame skipping. Table 1 provides a summary of each of the models discussed. The second and third columns of the table list the number of parameters and multiplications per second performed during inference for each model. The fourth and fifth columns present the experimentally determined false rejection rates (FRR) for each model on clean and noisy data respectively. All false rejection rates in this section are given for a fixed false alarm rate of 0.5 per hour. In addition, receiver operator characteristic (ROC) curves are plotted for both clean and noisy data.
For each model, the table shows the number of parameters, multiplications per second, and false reject rate in percent on clean data and 10 dB SNR noisy data. FRR values are for a false alarm rate of 0.5 FA/hr.
The fstride4 CNN keyword spotting system described in Tara N. Sainath and Carolina Parada, “Convolutional neural networks for small-footprint keyword spotting,” Interspeech, 2015, referred to further herein as [Sainath] is used as a baseline. Both the baseline CNN and the current TDNN models are trained on the same data that is described above. However, note that the current dataset is different than the one used in [Sainath]. Furthermore, the amount of training data used in the current experiments is also much smaller than the one used in [Sainath]. Therefore, the performance of the baseline CNN model presented herein differs from that given in [Sainath].
The resulting ROC curves for the baseline CNN, the proposed single-stage TDNN model, and the two-stage model are shown in
As described earlier, low-powered keyword spotting system may also uses frame-skipping to further reduce the required computation without causing a large drop in accuracy. Experiments were performed on the single-stage model with strides of 4 for both the phone-NN and the word-NN. ROC curves for these experiments are depicted in
A method for reducing the number of multiplications using dynamic programming can be utilized. Alternatively, the total number of multiplications can be reduced by using frame skipping.
A voice activity detection (VAD) system can be used to minimize computation by the TDNN network, where such VAD system only sends audio data to the TDNN when speech is detected in the background. The user query which follows the keyword detection may be recorded for further decoding. Training data can be produced by concatenating recordings of commands and user queries at different volume levels and mixing with different types of noises. Further, unrelated conversational data can be included in training data.
It would be appreciated by one of ordinary skill in the art that the system and components shown in
Each element in the embodiments of the present disclosure may be implemented as hardware, software/program, or any combination thereof. Software codes, either in its entirety or a part thereof, may be stored in a computer readable medium or memory (e.g., as a ROM, for example a non-volatile memory such as flash memory, CD ROM, DVD ROM, Blu-Ray™, a semiconductor ROM, USB, or a magnetic recording medium, for example a hard disk). The program may be in the form of source code, object code, a code intermediate source and object code such as partially compiled form, or in any other form.
This application claims priority to United Stated Provisional Application No. 62/611,794 filed Dec. 29, 2017 there entirety of which is hereby incorporated by reference for all purposes.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/CA2018/051681 | 12/28/2018 | WO | 00 |
Number | Date | Country | |
---|---|---|---|
62611794 | Dec 2017 | US |