The invention relates to speech recognition systems, and more particularly to detection of end of utterance in speech recognition systems.
Different speech recognition applications have been developed during recent years for instance for car user interfaces and mobile terminals, such as mobile phones, PDA devices and portable computers. Known applications for mobile terminals include methods for calling a particular person by saying aloud his/her name into the microphone of the mobile terminal and by setting up a call to the number according to the name/number associated with a model best corresponding to the speech input from the user. However, present speaker-dependent methods usually require that the speech recognition system is trained to recognize the pronunciation for each word. Speaker-independent speech recognition improves the usability of a speech-controlled user interface, because the training stage can be omitted. In speaker-independent word recognition, the pronunciation of words can be stored beforehand, and the word spoken by the user can be identified with the pre-defined pronunciation, such as a phoneme sequence. Most speech recognition systems use Viterbi search algorithm which builds a search through a network of Hidden Markov Models (HMMs) and maintains most likely path score at each state in this network for each frame or time step.
Detection of end of utterance (EOU) is an important aspect relating to speech recognition. The aim of the EOU detection is to detect the end of speaking as reliable and quickly as possible. When the EOU detection has been made the speech recognizer can stop decoding and the user gets the recognition result. By well working EOU detection the recognition rate can also be improved since noise part after the speech is omitted.
Different techniques have been developed for EOU detection. For instance, the EOU detection may be based on the level of detected energy, based on detected zero crossings, or based on detected entropy. However, these methods often prove to be too complex for constrained devices such as mobile phones. In case of speech recognition being performed in a mobile device, a natural place to gather information for EOU detection is the decoder part of the speech recognizer. The advancement of the recognition result for each time index (one frame) can be followed as the recognition process proceeds. The EOU can be detected and the decoding can be stopped when a pre-determined number of frames have produced (substantially) the same recognition result. This kind of approach for EOU detection has been presented by Takeda K., Kuroiwa S., Naito M. and Yamamoto S. in publication “Top-Down Speech Detection and N-Best Meaning Search in a Voice Activated Telephone Extension System”. ESCA. EuroSpeech 1995, Madrid, September 1995.
This approach is herein referred to as the “stability check of the recognition result”. However, there are certain situations where this approach fails: If there is a long enough silence portion before speech data is received, the algorithm will send EOU detection signal. Hence, end of speech may be erroneously detected even before the user begins to talk. Too early EOU detections may occur due to delay between names/words or even during speech in certain situations when using the stability check based EOU detection. In noisy environments it may be the case that such EOU detection algorithm cannot detect EOU at all.
There is now provided an enhanced method and arrangement for EOU detection. Different aspects of the invention include a speech recognition system, method, an electronic device, and a computer program product, which are characterized by what has been disclosed in the independent claims. Some embodiments of the invention are disclosed in the dependent claims.
According to an aspect of the invention, a speech recognizer of a data processing device is configured to determine whether recognition result determined from received speech data is stabilized. Further, the speech recognizer is configured to process values of best state scores and best token scores associated with frames of received speech data for end of utterance detection purposes. If the recognition result is stabilized, the speech recognizer is configured to determine whether end of utterance is detected or not, based on the processing of best state scores and best token scores. Best state score refers generally to a score of a state having the best probability amongst a number of states in a state model for speech recognition purposes. Best token score refers generally to best probability of a token amongst a number of tokens used for speech recognition purposes. These scores may be updated for each frame comprising speech information.
An advantage of arranging the detection of end of utterance according in this way is that the errors relating to silent periods before speech data is received, delays between speech segments, EOU detections during speech, and missed EOU detections (e.g. due to noise) can be reduced or even avoided. The invention provides also computationally economical way for EOU detection since pre-calculated state and token scores may be used. Thus the invention is also very well suitable for small portable devices such as mobile phones and PDA devices.
According to an embodiment of the invention, the best state score sum is calculated by summing the best state score values of a pre-determined number of frames. In response to the recognition result being stabilized, the best state score sum is compared to a predetermined threshold sum value. The detection of end of utterance is determined if the best state score sum does not exceed the threshold sum value. This embodiment enables to at least reduce above mentioned errors, being especially useful against errors relating to silent periods before speech data is received and errors EOU detections during speech.
According to an embodiment of the invention, best token score values are determined repetitively and the slope of the best token score values is calculated based on at least two best token score values. The slope is compared to a pre-determined threshold slope value. The detection of end of utterance is determined if the slope does not exceed the threshold slope value. This embodiment enables to at least reduce errors relating to silent periods before speech data is received and also long pauses between words. This embodiment is especially useful (and better than the above embodiment) against errors relating to EOU detections during speech since the best token score slope is very well tolerant against noise.
In the following the invention will be described in greater detail by means of preferred embodiments with reference to the attached drawings, in which
a, 3b, and 3c are flow charts illustrating some embodiments according to an aspect of the invention;
a and 4b are flow charts illustrating some embodiments according to an aspect of the invention;
The data processing device (TE) comprises a speech recognizer (SR) which may be implemented by software executed in the central processing unit (CPU). The SR implements typical functions associated with a speech recognizer unit, in essence it finds mapping between sequences of speech and pre-determined models of symbol sequences. As is assumed below, the speech recognizer SR may be provided with end of utterance detection means with at least part of the features illustrated below. It is also possible that an end of utterance detector is implemented as a separate entity.
The functionality of the invention relating to the detection of end of utterance and described in more detail below may thus be implemented in the data processing device (TE) by a computer program which, when executed in a central processing unit (CPU), affects the data processing device to implement procedures of the invention. Functions of the computer program may be distributed to several separate program components communicating with one another. In one embodiment the computer program code portions causing the inventive functions are part of the speech recognizer SR software. The computer program may be stored in any memory means, e.g. on the hard disk or a CD-ROM disc of a PC, from which it may be downloaded to the memory MEM of a mobile station MS.
It is also possible to use hardware solutions or a combination of hardware and software solutions to implement the inventive means. Accordingly, each of the computer program products above can be at least partly implemented as a hardware solution, for example as ASIC or FPGA circuits, in a hardware module comprising connecting means for connecting the module to an electronic device and various means for performing said program code tasks, said means being implemented as hardware and/or software.
In one embodiment the speech recognition is arranged in SR by utilizing HMM (Hidden Markov) models. Viterbi search algorithm may be used to find match to the target words. This algorithm is a dynamic algorithm which builds a search through a network of Hidden Markov Models and maintains the most likely path score at each state in this network for each frame or time step. This search process is time-synchronous: it processes all states at the current frame completely before moving on to the next frame. At each frame, the path scores for all current paths are computed based on a comparison with the governing acoustic and language models. When all the speech data has been processed, the path with the highest score is the best hypothesis. Some pruning technique may be used to reduce the Viterbi search space and to improve the search speed. Typically, a threshold is set at each frame in the search whereby only paths whose score is higher than the threshold are extended to the next frame. All others are pruned away. The most commonly used pruning technique is the beam pruning which advances only those paths whose score falls within a specified range. For more details on HMM based speech recognition, reference is made to Hidden Markov Model Toolkit (HTK) which is available at HTK homepage http://htk.eng.cam.ac.uk/.
An embodiment of the enhanced multilingual automatic speech recognition system, applicable for instance in a data processing device TE described above, is illustrated in
In the method illustrated in
Token passing is used to transfer score information between states. Each state of a HMM (at time frame t) holds a token comprising information on partial log probability. A token represents partial match between observation sequence (up to time t) and the model. A token passing algorithm propagates and updates tokens at each time frame and passes the best token (having the highest probability at time t−1) to next state (at time t). At each time frame, the log probability of a token is accumulated by corresponding transition probabilities and emission probabilities. The best token scores are thus found by examining all possible tokens and selecting the ones having the best scores. As each token is passing through a search tree (network), it maintains a history recording its route. For more details on token passing and token scores, reference is made to “Token passing: a Simple Conceptual model for Connected Speech Recognition Systems”, Young, Russell, Thornton, Cambridge University Engineering Department, Jul. 31, 1989, which is incorporated herein as reference.
The speech recognizer SR is also configured to determine 202, 203 whether the recognition results determined from received speech data have been stabilized. If the recognition results are not stabilized, speech processing may be continued 205 and also step 201 may be again entered for next frames. Conventional stability check techniques may be utilized in step 202. If the recognition result is stabilized, the speech recognizer is configured to determine 204 whether end of utterance is detected or not, based on the processing of best state score and best token scores. If the processing of best state scores and best token scores also indicates that speech is ended, the speech recognizer SR is configured to determine detection of end of utterance and end speech processing. Otherwise speech processing is continued, and also step 201 may be returned for next speech frames. By utilizing also best state scores and best token scores and suitable threshold values, the errors relating to EOU detection using only stability check can be at least reduced. Values already calculated for speech recognition purposes may be utilized in step 204. It is possible that some or all best state score and/or best token score processing is done for EOU detection purpose only if the recognition result is stabilized, or they may be processed continuously taking into account new frames. Some more detailed embodiments are illustrated in the following.
In
The speech recognizer SR is configured to compare 302, 303 the best state score sum to a predetermined threshold sum value. In one embodiment, this step is entered in response to the recognition result being stabilized, not shown in
b illustrates a further embodiment relating to the method in
c illustrates a further embodiment relating to the method in
In the following an algorithm for calculating the normalized sum of the last #BSS values is illustrated.
In the above exemplary algorithm the normalization is done based on the size of the BSS buffer.
a illustrates an embodiment for utilizing best token scores for end of utterance detection purposes. In step 401 the speech recognizer SR is configured to determine the best token score value for the current frame (at time T). The speech recognizer SR is configured to calculate 402 the slope of the best token score values based on at least two best token score values. The amount of best token score values used in the calculation may be varied; in experiments it has been noticed that it is adequate that less than ten last best token score values are used. The speech recognizer SR is in step 403 configured to compare the slope to a pre-determined threshold slope value. Based on the comparison 403, 404, if the slope does not exceed the threshold slope value, the speech recognizer SR may determine 405 detection of end of utterance. Otherwise speech processing is continued 406 and also step 401 may be continued.
b illustrates a further embodiment relating to the method in
In a further embodiment the speech recognizer SR is configured to begin slope calculations only after a pre-determined number of frames has been received. Some or all of the above features relating to best token scores may be repeated for each frame or only for some of the frames.
In the following an algorithm for arranging slope calculation is illustrated:
The formula for calculation of slope in the above algorithm is:
According to an embodiment illustrated in
As illustrated in
According to an embodiment, the speech recognizer SR is configured to wait a pre-determined time period from the beginning of speech processing before determining detection of end of utterance. This may be implemented such that the speech recognizer SR does not perform some or all of the above illustrated features related to end of utterance detection, or that the speech recognizer SR will not make positive end of utterance detection decision until the time period has elapsed. This embodiment enables to avoid EOU detections before speech and errors due to unreliable results at the early stage of speech processing. For instance, tokens have to advance some time before they provide reasonable scores. As already mentioned, it is also possible to apply certain number of received frames from the beginning of speech processing as a starting criterion.
According to another embodiment, the speech recognizer SR is configured to determine detection of end of utterance after a maximum number of frames producing substantially the same recognition result has been received. This embodiment may be used in combination with any of the features described above. By setting the maximum number reasonably high, this embodiment enables that it is possible to end speech processing after long enough “silence” period even though some criterion for detecting end of utterance has no been fulfilled e.g. due to some unexpected situation to which prevents detection of EOU.
It is important to notice that the problems related to stability check based end of utterance detection can be best avoided by combining at least most of the above illustrated features. Thus the above illustrated features may be combined in various ways within the invention, thereby causing multiple conditions which must be met before determining that end of utterance is detected. The features are suitable both for speaker dependent and speaker independent speech recognition. The threshold values can be optimized for different usage situations and testing the functioning of the end of utterance in these various situations.
Experiments on these methods have shown that that the amount of erroneous EOF detections can be largely avoided by combining the methods, especially in noisy environments. Further, the delays of detecting the end of utterance after actual end-point were smaller than in EOU detection without the present method.
It will be obvious to a person skilled in the art that, as the technology advances, the inventive concept can be implemented in various ways. The invention and its embodiments are not limited to the examples described above but may vary within the scope of the claims.
Number | Name | Date | Kind |
---|---|---|---|
4821325 | Martin et al. | Apr 1989 | A |
5621859 | Schwartz et al. | Apr 1997 | A |
5740318 | Naito et al. | Apr 1998 | A |
5819222 | Smyth et al. | Oct 1998 | A |
5848388 | Power et al. | Dec 1998 | A |
5884259 | Bahl et al. | Mar 1999 | A |
5999902 | Scahill et al. | Dec 1999 | A |
6076056 | Huang et al. | Jun 2000 | A |
6374219 | Jiang | Apr 2002 | B1 |
6405168 | Bayya et al. | Jun 2002 | B1 |
6873953 | Lennig | Mar 2005 | B1 |
7711561 | Hogenhout et al. | May 2010 | B2 |
20020165715 | Riis et al. | Nov 2002 | A1 |
20040019483 | Deng et al. | Jan 2004 | A1 |
20040254790 | Novak et al. | Dec 2004 | A1 |
20050049873 | Bartur et al. | Mar 2005 | A1 |
20050149337 | Asadi et al. | Jul 2005 | A1 |
Number | Date | Country |
---|---|---|
0895224 | Feb 1999 | EP |
2005017932 | Jan 2005 | JP |
WO 9422131 | Sep 1994 | WO |
Entry |
---|
Stoltze et al, “Integrated Circuits for a Real Time Large Vocabulary Continuous Speech Recognition System”, Jan. 1991, IEEE journal of Solid State Circuits, vol. 26, No. 1, pp. 2-11. |
Kuroiwa et al., 1999. S. Kuroiwa, M. Naito, S. Yamamoto and N. Higuchi , Robust speech detection method for telephone speech recognition system. Speech Communication 27 2 (1999), pp. 135-148. |
Sep. 1995, Kazuya Takeda, Shingo Kuroiwa, Masaki Naito and Seiichi Yamamoto, Top-Down Speech Detection and N-Best Meaning Search in a Voice Activated Telephone Extension System, pp. 1075-1078, 4th European Conference on Speech Communication and Technology, Madrid, ISSN. |
Maria Rangoussi, Anastasios Delopoulos and Michail Tsatsanis, on the Use of Higher-Order Statistics for Robust Endpoint Detection of Speech, pp. 56-60, 1993 IEEE. |
Takeda K., Kuroiwa S., Naito M. and Yamamoto S. “Top-Down Speech Detection and N-Best Meaning Search in a Voice Activated Telephone Extension System” ESCA. EuroSpeech 1995, Madrid, Sep. 1995. |
Young, Russell, Thornton: “Token passing: a Simple Conceptual model for Connected Speech Recognition Systems”, Cambridge University Engineering Department, Jul. 31, 1989. |
Printed from Internet May 12, 2004, Hidden Markov Model Toolkit (HTK) which is available at HTK homepage http://htk.eng.cam.ac.uk/. |
Number | Date | Country | |
---|---|---|---|
20050256711 A1 | Nov 2005 | US |