As speech recognition technology has advanced, voice-activated devices have become more and more popular and have found new applications. Today, an increasing number of mobile phones, in-home devices, and automobile devices include speech or voice recognition capabilities. Although the speech recognition modules incorporated into such devices are trained to recognize specific keywords, they tend to be unreliable. This is because keywords may be spoken in noisy environments, by more than one person, at the same time as other keywords, or with all of these problems simultaneously. Unrecognized keywords can frustrate a speaker, and may cause the speaker to stop using voice commands and resort to manual controls.
The present disclosure is directed to systems and methods for detecting keywords in multi-speaker environments, substantially as shown in and/or described in connection with at least one of the figures, as set forth more completely in the claims.
The following description contains specific information pertaining to implementations in the present disclosure. The drawings in the present application and their accompanying detailed description are directed to merely exemplary implementations. Unless noted otherwise, like or corresponding elements among the figures may be indicated by like or corresponding reference numerals. Moreover, the drawings and illustrations in the present application are generally not to scale, and are not intended to correspond to actual relative dimensions.
Device 110 uses microphone 105 to receive speech or voice commands from a user or a plurality of users, such as a first user and a second user playing a speech controlled video game. A/D converter 115 is configured to receive input speech 106 from microphone 105, and convert input speech 106, which is in analog form, to digitized speech 108, which is in digital form. As shown in
Keyword recognition application 140 is a computer algorithm for recognizing keywords in digitized speech 108. Keyword recognition application 140 includes probability distributions 141 for a plurality of keywords. Probability distributions 141 may include a plurality of probability distributions corresponding to a plurality of keywords. In some implementations, keyword recognition application 140 may learn the plurality of probability distributions corresponding to the plurality of keywords from a plurality of training instances of each keyword.
Keyword recognition module 140 also includes thresholds 143. Thresholds 143 may include a plurality of thresholds, where each threshold may correspond to a keyword of keywords 150. In some implementations, each threshold of thresholds 143 may be a fraction or a percentage, and may be used as a comparator for measuring a portion of a speech segment of digitized speech 108 that includes a keyword. In other implementations, each threshold of thresholds 143 may be a duration that may be used as a comparator for measuring the duration of a keyword in a speech segment of digitized speech 108. In some implementations, thresholds 143 may be based on the training instances of each keyword used to train probability distributions 141.
Keywords 150 include a plurality of keywords that keyword recognition application 140 may be able to recognize in digitized speech 108. In some implementations, keywords 150 may include two keywords, three keywords, or any number of keywords up to M keywords, M being an integer. In some implementations, each keyword of keywords 150 may have a corresponding action. For example, a keyword may be a command for a video game, so that the corresponding action is an action of a character in the video game, or the corresponding action may set a control in the video game or video game system.
Peripheral component 195 may be a functional component that is part of device 110 or peripheral component 195 may be functionally connected to device 110. Peripheral component 195 may be suitable for executing an action associated with a keyword of keywords 150. For example, peripheral component 195 may be a component for changing the station to which a smart car radio is tuned, changing a listening mode of a smart car radio, such as changing from radio to auxiliary mode. Peripheral component 195 may change a temperature setting of an in-home smart thermostat, change the mode of an in-home smart thermostat, such as from air conditioning to heat. Peripheral device 195 may include a heating element of a smart oven that is capable of being activated or deactivated when the oven is turned on or off.
In some implementations, peripheral component 195 may include a display suitable for displaying video content, such as a video game or an on-screen control menu of a video game console or video playback device. In some implementations, peripheral component 195 may be a television, a computer monitor, a display of a tablet computer, or a display of a mobile phone. Peripheral component 195 may be a light emitting diode (LED) display, an organic light emitting diode (OLED) display, a liquid crystal display (LCD), a plasma display, a cathode ray tube (CRT), an electroluminescent display (ELD), or other display appropriate for viewing video content and/or video games.
In situations where spoken words 345a-345c are three distinct keywords, speech recognition application 140 may detect each keyword in speech segment 335 if the fraction of speech segment 335 corresponding to each keyword is greater than the threshold for each keyword. Accordingly, speech recognition application 140 may detect no keywords, one keyword, two keywords, or three keywords in speech segment 335. In some implementations, not shown in
At 430, speech recognition application 140 calculates a first probability of distribution of a first keyword in the first speech segment. Since there may be M keywords, there may be 2M classes, representing the 2M possible combinations of the M possible keywords. Speech recognition application 140 may represent each keyword event by a class, where a keyword event is any combination of keywords. For instance, if there are only two possible keywords (e.g. “Go” and “Jump”), speech recognition application 140 will include 4 classes, representing the events C1=“only Go was uttered”, C2=“Only Jump was uttered”, C3=“Go and Jump were uttered simultaneously” and C4=“Neither word was uttered”. Speech recognition application 140 may learn the probability distribution P(X|Ci) from training instances of data from each class. For example, speech recognition application 140 may learn P(X|C3) from instances of recordings in which portions of a spoken “Go” and portions of a spoken “Jump” overlapped. These distributions may be mixture distributions, such as a mixture of distributions from the exponential family. The parameters of the distribution may be learned from the training data using any suitable algorithm.
In some implementations, keyword recognition application 140 may treat each speech segment of the plurality of segments such that the fraction a of any segment comprises the first keyword, and the remaining (1−α) comprises the background. Under this model, the probability distribution of the data within each speech segment of digitized speech 108 may be given by:
P(Xtest)=αP(X|Word)+(1−α)P(X|Background) (1)
where α represents the fraction of the segment that is taken up by the word. α is unknown and must be determined. In some implementations, speech recognition application 140 may do so using the maximum-likelihood estimator:
α=argmaxγ log(γP(Xtest|Word)+(1−γ)P(Xtest|Background)) (2)
which determines α as a function of the value γ that results in a best “fit” of the overall distribution to the test data Xtest.
In some implementations, different regions of the speech segment may be drawn from different classes, such as when multiple keywords occur in the speech segment. Accordingly, each fraction of the speech segment may be considered separately, such that some fractions may belong to one class (Word or Background) and the rest to the other. Speech recognition application may do so by assuming that every feature vector X in Xtest (which represents a segment with many feature vectors) may be drawn independently. Correspondingly, the class-conditional distributions of vectors, P(X|Word) and P(X|Background), representing respectively the distributions of feature vectors from audio segments that only comprise the keyword and audio segments that include no part of the keyword, are known, having been estimated from some training data.
In order to generate Xtest, each vector in Xtest may be individually generated. To generate any individual vector, first the class may be selected, and subsequently the vector may be drawn from the class conditional distribution. α may be estimated to maximize log P(Xtest):
Equation 3 may be optimized using any algorithm, such as simple gradient ascent, or expectation maximization (EM). The obtained a will represent the estimate of the fraction of the segment Xtest that is dominated by the target word. The above equation is a maximum-likelihood estimator, so the overall method is a maximum-likelihood classification algorithm to detect keywords. The maximum-likelihood formulations P(X|Word) and P(X|Background) must capture the distributions of the data under the kind of conditions that are encountered in application scenarios (e.g., ambient noise inside a specific building, outside, etc.).
Conventionally, such distributions have been modeled as mixture distributions of the form:
where k represents an index over mixture components, and P(X|k,Class) represents the individual component distributions of the mixture. The most common form for P(X|k,Class) in such applications has been a member of the exponential family of distributions, making P(X|Class) itself a mixture of exponential distributions. More generally, P(X|k,Class) may be any distribution that models the data well.
Speech recognition application 140 may specify the probability distribution of any vector X in a test segment as:
where the variable C can take as values one of the 2M values representing every combination of keywords. Generalizing across the possible classes, each αC represents the fraction of the segment Xtest that comprises feature vectors belonging to class C. For example, speech recognition application 140 may convert a speech segment to a feature vector sequence. In some implementations, speech recognition application 140 may model a plurality of keyword probability distributions from the feature vector sequence and a background probability distribution from the feature vector sequence, where each keyword probability distribution of the plurality of keyword probability distributions corresponds to a keyword of the plurality of keywords and background includes any portion of the speech segment that does not include a keyword. Speech recognition application 160 may learn all of the αC values from Xtest by maximizing log P(Xtest).
Equation 4 may be optimized using any appropriate algorithm, such as gradient descent or EM.
Any single keyword may appear in multiple classes. In some implementations, speech recognition application 140 may model the first speech segment as a combination of a plurality of keyword vectors and a plurality of background vectors. For instance, in a two-word example including the keywords “Go” and “Jump,” “Go” features both in C1 (Go only) and C3 (Go and Jump spoken together). Thus, the total fraction of Xtest that comprises “Go” must consider both classes, and will be given by αGO=αC1+αC3. Speech recognition application 140 may model a speech segment probability distribution as a mixture of the plurality of keyword probability distributions and the background probability distribution. In some implementations, speech recognition application 140 may estimate a plurality of keyword mixture weights corresponding to the plurality of keyword probability distributions and a background mixture weight corresponding to the background probability distribution using any maximum-likelihood technique.
In some implementations, the first probability of distribution may be calculated by comparing the first speech segment with a probability distribution of the first keyword from probability distributions 141. Based on the probability distribution of the first keyword from probability distributions 141, keyword recognition application 140 may calculate a probability of the duration of the first keyword in the first speech segment. In some implementations, the duration of the first keyword compared to the duration of the first speech segment may be the first probability of distribution.
At 440, speech recognition application 140 determines that a first fraction of the first speech segment includes the first keyword, in response to comparing the first probability of distribution with a first threshold associated with the first keyword. In some implementations, the first fraction may be a ratio of the duration of the first keyword, according to the first probability of distribution, to the duration of the first speech segment. In other implementations, the first fraction may be a ratio of the portion of the first speech segment determined to be the first keyword to the portion of the first speech segment that is background, where background includes all sound including background noise and other words that do not represent the first keyword. In some implementations, background may include keywords other that the first keyword. Speech recognition application 140 may equate each keyword mixture weight of the plurality of keyword mixture weights to a corresponding plurality of probabilities of each keyword of the plurality of keywords and to a corresponding plurality of fractions of the first speech segment that contain each keyword of the plurality of keywords. In some implementations, speech recognition application 140 may determine a first keyword probability and the first fraction of the speech segment including the first keyword based on the first keyword mixture weight.
In some implementations, speech recognition application 140 may compare a with a first threshold of thresholds 143. If a exceeds the first threshold, speech recognition application 140 may determine that the speech segment includes the keyword corresponding to the first threshold. In general, once αKeyword is computed for all keywords, any keyword for which the corresponding a value exceeds a threshold may be considered to have been detected in the segment. The first threshold may be calibrated to obtain different operating points—a high value of the first threshold will result in conservative, high-precision classification, where the probability ratio must pass a high threshold for the instance to be classified as the first keyword. A high threshold ensures that when an instance is identified as the first keyword, it is done so with high confidence, at the cost of occasionally missing instances of the first keyword because the likelihood ratio does not exceed the threshold. On the other hand, a low value of the first threshold will result in high-recall classification, where instances of the first keyword will rarely be missed, but in exchange for a larger fraction of data instances that are not the first keyword also being classified as being the first keyword.
At 450, speech recognition application 140 calculates a second probability of distribution of a second keyword in the first speech segment. Then, at 460, speech recognition application 140 determines that a second fraction of the first speech segment includes the second keyword, in response to comparing the second probability of distribution with a second threshold associated with the first keyword. The second threshold may be calibrated for high-precision results or high-recall results. In some implementations, speech recognition application 140 may determine a second keyword probability and the second fraction of the speech segment including the second keyword based on the second keyword mixture weight.
At 470, speech recognition application 140 executes a first action associated with the first keyword if the first keyword is recognized. In some implementations, the first keyword may be a command for a game, such as a voice-controlled video game. When speech recognition application 140 recognizes the first keyword, speech recognition application may execute the command. For example, the first keyword may be the command “Go,” which may be used to advance a player forward through a video game. When the first keyword “Go” is recognized, speech recognition module 140 may advance the player through the video game. In other implementations, system 100 may include a smart device, such as a smart car radio, a smart thermostat or a smart oven. Accordingly, execution of the first action may include turning on the smart device, turning off the smart device, changing a setting of the smart device, programming the smart device, etc. Likewise, the second keyword may have an associated action.
At 480, speech recognition application executes a second action associated with the second keyword if the second keyword is recognized. In some implementations, the second keyword may be a command for a game, such as a voice-controlled video game. When speech recognition application 140 recognizes the second keyword, speech recognition application may execute the command. For example, the second keyword may be the command “Jump,” which may be used for a player to avoid hazards or move over obstacles in a video game. When the second keyword “Jump” is recognized, speech recognition module 140 may have the player's character in the game jump. In other implementations, system 100 may include a smart device, and execution of the second action may include turning on the smart device, turning off the smart device, changing a setting of the smart device, programming the smart device, etc.
From the above description it is manifest that various techniques can be used for implementing the concepts described in the present application without departing from the scope of those concepts. Moreover, while the concepts have been described with specific reference to certain implementations, a person of ordinary skill in the art would recognize that changes can be made in form and detail without departing from the scope of those concepts. As such, the described implementations are to be considered in all respects as illustrative and not restrictive. It should also be understood that the present application is not limited to the particular implementations described above, but many rearrangements, modifications, and substitutions are possible without departing from the scope of the present disclosure.