The present invention relates to a method for automatic speech recognition. In particular the present invention relates to a method for recognizing a keyword from a spoken utterance.
A method for automatic speech recognition, where a single or a plurality of keywords is recognized in a spoken utterance, is often named as keyword spotting. For each keyword to be recognized, a keyword model is trained and stored. Each keyword model is trained either for speaker dependent or speaker independent speech recognition and represents for example a word or a phrase. A keyword is spotted from the spoken utterance, when the spoken utterance itself or a part thereof matches best to any of the previously created and stored keyword models.
In the recent years, such a method for speech recognition often has been used in mobile equipment, like e.g. in mobile phones. With it, the mobile equipment can be partly or fully controlled with voice commands instead of using the keyboard. The method is preferably useable in car hands-free equipment, where it is forbidden to handle the mobile phone with the keyboard. Hereby, the mobile phone is activated as soon as a keyword is determined from a spoken utterance of the user. Then, the mobile phone listens for a further spoken utterance and assesses parts thereof as the keyword to be recognized, if that part matches best to any of the stored keyword models.
Depending on the acoustic environment, where the mobile equipment is used, or depending on the users behaviour, like e.g. the pronunciation, the keywords are recognized more or less correctly. For example, the assessing could be wrong, if the part of the spoken utterance is matched to one of the stored keywords, but which is not the wanted keyword to be recognized. As a consequence, the hit rate, that is the number of correctly recognized keywords relative to the total number of spoken keywords, strongly depends on the acoustic environment and the users behaviour.
Methods for automatic speech recognition, known from prior art, often use so called garbage models in addition to the keyword models [A new approach towards Keyword Spotting, Jean-Marc Boite, EUROSPEECH Berlin, 1993, pp. 1273-1276]. For this, a plurality of garbage models is created. Some garbage models represent for example non-keyword speech, like lip smacks, breaths, or filler words “aeh” or “em”. Other garbage models are created to represent background noise. The garbage models are e.g. phonemes, phoneme cover classes, or complete words. By utilising these garbage models, the false alarm rate, that is the number of wrongly recognized keywords per time unit, is decreased. That is, because parts of the spoken utterance, which include non-keyword speech can be mapped directly to one of the stored garbage models. But, when applying such a method, the hit rate is decreased, because a part of the spoken utterance might matches better to one or more of the plurality of garbage models, than to the keyword model itself. For example, if during the recognition phase the acoustic environment is bad, the part of the spoken utterance might matches to a garbage model, which represents such an acoustic environment. As a result, that part is assessed as non-keyword speech, which is of course not the wanted result.
It is therefore the object of the present invention to provide a method for speech recognition, which increases the hit rate and avoids the disadvantages of the known prior art.
This is solved by the method of claim 1. According to the present invention, there is provided a method for recognizing a keyword from a spoken utterance, with at least one keyword model and a plurality of garbage models, wherein a part of the spoken utterance is assessed as a keyword to be recognized, if that part matches best either to the keyword model or to a garbage sequence model, and wherein the garbage sequence model is a series of consecutive garbage models from that plurality of garbage models.
Essentially, then the method of the present invention also assessed a part of a spoken utterance as a keyword to be recognized, when that part of the spoken keyword matches best to the garbage sequence model. Then, as an advantage of the present invention, the hit rate is increased. That is, because two models, the keyword model and the garbage sequence model, are used to recognize the keyword from a spoken utterance. Here, in the context of the present invention, a part of the spoken utterance is any time interval of an incoming utterance. The length of the time interval can be the complete utterance or only a small sequence thereof.
Advantageously, the method in accordance with the present invention avoids that the hit rate is decreased, when garbage models exist, which, in series, match better to the spoken utterance than the keyword model itself. Therefore the present automatic speech recognition method is more robust than known prior art speech recognition methods.
Preferably the garbage sequence model is determined by comparing a keyword utterance, which represents the keyword to be recognized with the plurality of garbage models, and detecting the series of consecutive garbage models, which match best to the keyword. With it, the garbage sequence model is easily created, based on existing garbage models as already used for prior art speech recognition methods. Such a prior art method is e.g. based on a finite state syntax, where one or more keyword models and a plurality of garbage models are used to recognize keywords from any incoming utterance. According to the present invention, the garbage sequence model is then created with a finite state syntax, which only includes the plurality of garbage models, but not the keyword models. The incoming utterance, which is the keyword utterance and represents the keyword, is compared with the plurality of stored garbage models. Then a series of consecutive garbage models from the plurality of garbage models is determined as the garbage sequence model, which best represent the keyword. According to the present invention this garbage sequence model is then used to recognize the keyword from a spoken utterance, if a part of the spoken utterance matches either to the keyword model or to that determined garbage sequence model.
In accordance with the method of the present invention, the determined garbage sequence model is privileged against any other path through the plurality of garbage models. Especially, the determined garbage sequence model is privileged against any path, which includes the same series of consecutive garbage models. This provides, that the part of the spoken utterance is assessed as the keyword to be recognized, although a similar path through the plurality of garbage models exists. Therefore, the hit rate is increased, because then the part of the spoken utterance is preferably assessed as the keyword to be recognized.
In accordance with a first aspect of the present invention, further, a number of further garbage sequence models is determined, which also represent that keyword, and the part of the spoken utterance is assessed as the keyword to be recognized, if that part of the spoken utterance matches best to any of that number of garbage sequence models. Then a total number of garbage sequence models, and the keyword model are used to recognize the keyword. With it, the hit rate is increased, because also a slightly worse spoken utterance might matches to any of the further garbage sequence models and is therefore assessed as the keyword.
The total number of garbage sequence models is preferably determined, by calculating for each garbage sequence model a probability value and selecting those garbage sequence models as the total number of garbage sequence models, for which the probability value is above a predefined value. Such a calculation of probability values for models is common use.
Therefore the predefined probability value, which is used here to classify the garbage sequence model as a model representing the keyword or not, is determined empirically.
In accordance with a second aspect of the present invention, further
For this, one garbage sequence model is required, which best represents the keyword. This garbage sequence model is determined and stored a-priori, before the recognition phase. If during the recognition phase, a path through the plurality of garbage models is detected, which matches best to a part of the spoken utterance then a following post-processing step is applied. In that post-processing step, a likelihood is determined, if the predefined garbage sequence model is contained in that path. If the likelihood is above a threshold, the path or a part thereof is assumed as the garbage sequence model. With that assumption the part of the spoken utterance is assessed as the keyword to be recognized. Because only one garbage sequence model has to be stored, that recognition method according to the second aspect of the present invention causes less memory consumption and can therefore advantageously be applied, when the memory size is limited, like for example in mobile phones. Advantageously, because the threshold can be adjusted at any time for the needs, the recognition method according to that second aspect has a high flexibility.
Preferably the likelihood is calculated, based on the determined garbage sequence model, the detected path through the plurality of garbage models, and a garbage model confusion matrix, and wherein the garbage model confusion matrix contains the probabilities P(i|j) that a garbage model i will be recognized supposed a garbage model j is given.
Advantageously, the at least one garbage sequence model is determined, when a keyword model is created for a new keyword to be recognized. By this, the speech recognition method according to the first and the second aspect of the present invention is flexible, because the garbage model sequences are determined as soon as a new keyword is created. This is an advantage for speaker dependent recognition methods, where the keyword models are created from one or more utterances from one speaker, which in general is the user. Then the method is applied as soon as a new keyword is created from the user.
A further aspect of the present invention relates to a computer program product, with program code means for performing the recognition method according to the present invention, when the product is executed in a computing unit.
Preferably the computer program product is stored on a computer-readable recording medium.
In the following the advantages of the present invention will be apparent upon reading the following detailed description of the preferred embodiments and upon the following drawings where:
Automatic speech recognition is used to recognize one or more keywords from a spoken utterance. Therefore, the applied recognition method is depicted as a finite state syntax.
In accordance with the principle concept of the present invention, a garbage sequence model is created, which also represents the keyword. This garbage sequence model then is used to assess the incoming utterances or a part thereof as the keyword to be recognized, if the garbage sequence model matches best to the incoming utterance or to the part of the utterance. The garbage sequence model is defined in the present invention as a series of consecutive garbage models gi. Such a garbage sequence model is preferably created, based on the finite state syntax as depicted in
The method in accordance with the first aspect of the present invention is now described by an example, as depicted in
Advantageously the determined garbage sequence models are privileged against any path through the plurality of garbage models. Particularly the series of consecutive garbage models, which determined the garbage sequence model, is always weighted higher than the same series of consecutive garbage models from the plurality of garbage models. Then the hit rate is increased, because as soon as a series of consecutive garbage models match best to the part of a spoken utterance, the garbage sequence model is selected and the part of the utterance is assessed as the keyword to be recognized. Even if the present invention is explained based on the finite state syntax for one keyword, the invention is also usable for more than one keyword. To privilege the garbage sequence model a penalty is defined for the garbage models from the plurality of garbage models. This then leads to a higher probability for the garbage sequence model, compared to an identical series through the plurality of garbage models.
A mapping from a path through a plurality of garbage models to the predefined garbage sequence model is depicted in
The method in accordance with the principle concept of the present invention increases the hit rate. The hit rate is further increased with the both described aspects of the present invention. The method in accordance with the first aspect of the present invention is easy to implement and needs less computation effort. The method in accordance with the second aspect of the present invention is more flexible. The hit rate can also be increased when applying a method, which combines the features of the first and the second aspect of the present invention. Then, a part of the spoken utterance is assessed as the keyword, when in accordance with the first aspect, the path directly matches best to one or more predefined garbage sequence models, or when in accordance with the second aspect, the path is assumed as the garbage sequence model. With it, the speech recognition method of the present invention is flexible and adaptable to the mobile equipment limitations, like e.g. limited memory size in that mobile equipment, where the method is implemented.
Contrary to speech recognition devices, known from prior art, the automatic speech recognition device according to the present invention, also assesses any part of the spoken utterance as a keyword to be recognized, if that part matches best to at least one of the determined and in the memory part stored garbage sequence models. With that, the hit rate is increased.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/EP02/08585 | 8/1/2002 | WO | 7/13/2005 |