This application claims the priority benefit of Taiwan application serial no. 98145666, filed on Dec. 29, 2009. The entirety of the above-mentioned patent application is hereby incorporated by reference herein and made a part of this specification.
The disclosure is related to an apparatus and a method for generating a threshold for utterance verification which are suitable for a speech recognition system.
An utterance verification function is an indispensible part of a speech recognition system and is capable of effectively preventing mistaken recognition actions from occurring caused by out-of-vocabulary terms. In current utterance verification algorithms, after an utterance verification score is calculated and obtained therefrom, the score is compared with a threshold. If the score is greater than the threshold, utterance verification is successful; conversely, utterance verification fails. During actual application, an optimal threshold may be obtained by collecting more and more corpuses and analyzing an expected utterance verification result. Most solutions obtain the utterance verification result by using such a framework.
Referring to
Please refer to
The above method limits the application range of the speech recognition system, so that the practical value thereof is greatly reduced. For example, if the speech recognition system is used in an embedded system such as in a system-on-a-chip (SoC) configuration, a method for adjusting the threshold cannot be included due to consideration of costs, so that the above problem must be resolved. As shown in
Many patents, such as the following, are related to utterance verification systems and provide discussion on how to adjust the threshold.
U.S. Pat. No. 5,675,706 provides “Vocabulary Independent Discriminative Utterance Verification For Non-Keyword Rejection In Subword Based Speech Recognition.” In this patent, the threshold is a preset value, and the value is related to two false rates, including a false alarm rate and a false reject rate. The system manufacturer may perform adjustment by itself and find a balance therein between. In the method of the invention, at least a recognition object and an expected utterance verification result (such as a false alarm rate or a false reject rate) are used as a basis for obtaining the corresponding threshold. Manual adjustment by the user is not required.
Another U.S. patent, U.S. Pat. No. 5,737,489, provides “Discriminative Utterance Verification For Connected Digits Recognition,” and further specifies that the threshold may be dynamically calculated by collecting data online, thereby solving the problem of configuring the threshold when the external environment changes. Although this patent provides a method for calculating the threshold, the method for collecting data online in this patent is as follows. During speech recognition and operation of the utterance verification system, testing data of the new environment is used to obtain the recognition result through speech recognition. After analysis of the recognition result, the previously configured threshold for utterance verification is updated.
In summary of various prior art, the most common method is finding the optimal threshold through collecting additional data, and the second most common method is letting the user configuring the threshold by himself or herself The above methods, however, are more or less the same in that a recognition result in a new environment is obtained through speech recognition, an existing term is verified after analysis of the result, and the threshold is updated.
The disclosure provides an apparatus for generating a threshold for utterance verification which is suitable for a speech recognition system. The apparatus for generating the threshold for utterance verification includes a value calculation module, a object score generator, and a threshold determiner. The value calculation module is configured to generate a plurality of values corresponding to a plurality of speech segments. The object score generator receives a sequence of speech unit of at least one of the recognition objects, and generates at least one value distribution from the values corresponding to the sequence of speech unit selected form the value calculation module. The threshold determiner is configured to receive the value distribution, and to generate a recommended threshold according to an expected utterance verification result and the value distribution.
The disclosure provides a method for generating a threshold for utterance verification which is suitable for a speech recognition system. In the method, a plurality of values corresponding to a plurality of speech units are generated and stored. A speech unit sequence of at least one recognition object is received, and a value distribution is generated from the values corresponding to the speech unit sequence. A recommended threshold is generated according to an expected utterance verification result and the value distribution.
In order to make the aforementioned and other features and advantages of the disclosure more comprehensible, embodiments accompanying figures are described in detail below.
The accompanying drawings are included to provide a further understanding of the disclosure, and are incorporated in and constitute a part of this specification. The drawings illustrate embodiments of the disclosure and, together with the description, serve to explain the principles of the disclosure.
A method of calculating a threshold for utterance verification is introduced herein. When a recognition object is determined, a recommended threshold is obtained according to an expected utterance verification result. In addition, extra collection of corpuses or training models is not necessary for the utterance verification introduced here.
Please refer to
For companies in the field of integrated circuit design, the method according to the embodiment provides solutions for speech recognition, so that downstream manufacturers are able to develop speech recognition related products rapidly and efficiently and do not have to worry about the problem of collecting corpuses. The above method is considerably beneficial to the promotion of speech recognition technology.
According to the embodiment, before the operations of speech recognition and utterance verification, the threshold for utterance verification of the recognition object is predicted. In the related art, however, an existing threshold is used, and afterwards, when the speech recognition system and the utterance verification module are operated, the existing threshold is updated while corpuses are collected simultaneously. Hence, the related art is significantly different from the implementation of the disclosure. Additionally, it is not necessary to collect data for analysis during the operations of the speech recognition system and the utterance verification system, instead, an existing speech data is used. The existing speech data may be obtained from many resources, for example, a training corpus of the speech recognition system or the utterance verification system. In the method of the disclosure, the threshold for utterance verification is calculated through statistical analysis after the recognition object is determined and before the speech recognition system or the utterance verificator operates, and no extra collection of data is necessary, so that the disclosure is clearly different from the related art.
Please refer to
The speech recognizer 410 performs recognition according to the received speech signal and a recognition object 422, and then outputs a recognition result 412 to the utterance verificator 440. At the same time, the utterance verification threshold generator 430 generates a threshold 432 corresponding to the recognition object 422 and outputs the threshold 432 to the utterance verificator 440. The utterance verificator 440 performs verification according to the recognition result 412 and the threshold 432, so as to verify whether the recognition result 412 is correct, that is, whether the utterance verification score is greater than the threshold 432.
The recognition object for the speech recognizer 410, in the embodiment, is an existing vocabulary set (such as N sets of Chinese terms) which is capable of being read by the recognition object storage unit 420. After the speech signal passes through the speech recognizer 410, the recognition result is transmitted to the utterance verificator 440.
On the other hand, the recognition object is also input into the utterance verification threshold generator 430, and an expected utterance verification result, such as a 10% false reject rate, is provided, so as to obtain a recommended threshold θUV.
In the utterance verification threshold generator 430, according to an embodiment, a hypothesis testing method which is used in statistical analysis may be used to calculate an utterance verification score. The disclosure, however, is not limited to using said method.
There is a null hypothesis model and a alternative hypothesis model (respectively represented by H0 and H1) for each of the speech units. After converting the recognition result into a speech unit sequence, by using the corresponding null hypothesis models and the alternative hypothesis models, a null and a alternative hypothesis verification score for each of the units are calculated and added, so as to obtain a null hypothesis verification score (H0 score) and a alternative hypothesis verification score (H1 score) of the whole speech unit sequence. An utterance verification score (UV score) is then obtained through the following equation.
T represents the total number of frame segments of the speech signal
Finally, the utterance verification score (UV score) is compared with the threshold θUV. If the UV score is greater than θUV, verification is successful and the recognition result is output.
For the following embodiment, please refer to
Last, the scores are respectively added to obtain a null hypothesis verification score (H0 score) and alternative hypothesis verification score (H1 score) of the whole speech unit sequence, so as to obtain the utterance verification score (UV score).
T represents the total number of frame segments of the speech signal
The above utterance verification threshold generator is shown, for example, as a block diagram in
The utterance verification threshold generator 500 includes a processing-object-to-speech-unit processor 520, an object score generator 540, and a threshold determiner 550. The utterance verification threshold generator 500 further includes a value calculation module 530. The value calculation module 530 is used to generate values to be provided to the object score generator 540. According to an embodiment, the value calculation module 530 includes a speech unit verification module 532 and a speech database 534. The speech database 534 is used to store an existing corpus and may be a database having training corpuses or a storage medium into which a user inputs relevant training corpuses. The stored data may be an original audio file, a speech character parameter, or the like. The original audio file is, for example, a file in RAW AUDIO FORMAT® (RAW), WAVEFORM AUDIO FILE FORMAT® (WAV), or AUDIO INTERCHANGE FILE FORMAT® (AIFF). The speech unit verification module 532 calculates the speech verification scores of each of the speech units from the speech database 534 and provides the utterance verification scores as one or more values to the object score generator 540.
According to the speech unit sequence which is received and according to the one or more values of each of the speech units corresponding to the speech unit sequence which are received from the value calculation module 530, the object score generator 540 generates a value distribution corresponding to the speech unit sequence and provides the value distribution to the threshold determiner 550.
According to an expected utterance verification result 560 and the value distribution which is received, the threshold determiner 550 generates the recommended threshold and outputs the recommended threshold. According to an embodiment, for example, a 10% false reject rate is given. The threshold determiner 550 determines a value in the value distribution corresponding to the expected utterance verification result and outputs said corresponding value as the recommended threshold.
The value calculation module 530 collects a plurality of score samples corresponding to one of the speech units. For example, X score samples are stored for the speech unit phoi, and the corresponding values are also stored. Here the above embodiment which adopts the hypothesis testing method is used as the preferred embodiment, but the disclosure is not limited to using the hypothesis testing method.
For the speech unit phoi, there are a null hypothesis and a alternative hypothesis verification score (respectively represented by H0score and H1score) for each different sample.
H0 scorepho i,sample 1 represents the first null hypothesis score sample of phoi, H1 scorepho i,sample 1 represents the first alternative hypothesis score sample of phoi, and Tpho i,sample 1 represents the length of frame segment of the first sample of phoi.
After the utterance verification threshold value generator 500 receives the recognition object (assuming that there are W Chinese terms), all the terms are processed through a Chinese term-to-speech unit process of the processing-object-to-speech-unit processor 520, so that the terms are converted into the speech unit sequence Seqi={pho1, . . . , phok}, wherein i represents the ith Chinese term, and k is the number of speech units of the ith Chinese term.
Next, the speech unit sequence is input into the object score generator 540.
According to the content of the speech unit sequence, the verification scores of the corresponding null hypothesis model and alternative hypothesis model are selected from the value calculation module 530 based on a selection method (such as random selection). The scores are combined by the object score generator 540 into a score sample x of the speech unit sequence according to the following equation.
H0scoresample=H0scorepho
H1scoresample=H1scorepho
T
sample
=T
pho
,sample N
+ . . . +T
pho
,sample M
H0scorepho
represent the Nth H0 and H1 score samples selected for the first speech unit pho1 by the value calculation module 530. H0scorepho k,sample M H1scorepho
For each Chinese word, P utterance verification scores (UV scores) {x1, x2 . . . , xp} are generated as the score sample set for the word, and all the score samples of all the words are combined into a score set for the whole recognition object. The score set for the recognition object is then input into the threshold determiner 550.
In the threshold determiner 550, the score set of the whole recognition object as a whole is statistically analyzed in a histogram and converted into a cumulative rate distribution, so that the threshold θUV is obtained from the cumulative probability distribution. For example, the threshold when the cumulative probability value is 0.1 is obtained.
According to the above embodiment, the value calculation module 530 may be implemented through the speech unit verification module 532 and the speech database 534. Such an implementation is an embodiment of real-time calculation. Adoption of any technology having an utterance verification function by the value calculation module 530 is within the scope of the disclosure. For example, the technologies disclosed in Taiwan Patent Application Publication No. 200421261, which titled “Utterance verification Method and System”, or “Confidence measures for speech recognition: 200421261 or in the publication “Confidence measures for speech recognition: A survey” by Hui Jiang, Speech communication, 2005 may be used in the value calculation module 530, but not limit thereto. According to another embodiment, a speech unit score database may be adopted, and corresponding scores may be directly selected. The disclosure, however, is not limited to using the speech unit score database. The values stored in the speech unit score database are generated by receiving an existing speech data, generating corresponding scores through speech segmentation and through the speech unit score generator, and storing the scores in the speech unit score database. The following illustrates an embodiment of the above.
Please refer to
A speech data 602 used as the training corpus may be obtained from an existing available speech database. For example, the 500-PEOPLE TRSC (TELEPHONE READ SPEECH CORPUS) PHONETIC DATABASE® or the SHANGHAI MANDARIN ELDA FDB 1000 PHONETIC DATABASE® is one of the sources that may be used.
By using such a framework, after the recognition object is confirmed, the recommended threshold is obtained according to the expected utterance verification result. In addition, extra collection of a corpus or a training model is not necessary for the utterance verification introduced here. The present embodiment does not require obtaining a recognition result in a new environment through speech recognition, verifying an existing term after analysis of the result, and updating the threshold. According to the present embodiment, before the speech recognition system starts to operate, adjustment of effects of utterance verification are performed according to the specific recognition objects, so that a recommended threshold is dynamically obtained. The recommended threshold is output for determination by the utterance verificator, so as to obtain a verification result. For integrated circuit designing companies, the method according to the present embodiment provides more complete solutions for speech recognition, so that downstream manufacturers are able to develop speech recognition related products rapidly and do not have to worry about the problem of collecting corpuses. The above method is considerably beneficial to the promotion of speech recognition technologies.
In the method, first, the speech data 602 is converted into a plurality of speech units by the speech segmentation processor 610. According to an embodiment, the speech segmentation model 630 is the same as the model used by the utterance verificator when performing forced alignment.
Next, the scores corresponding to each of the speech units are obtained after calculation by the speech unit score generator 620. In the above speech unit score generator 620, the scores are generate through an utterance verification model 640. The utterance verification model 640 is the same as the utterance verification model used in the recognition system. The components of the speech unit score in the speech unit score generator 620 may vary according to the utterance verification method used in the speech recognition system. For example, according to an embodiment, when the utterance verification method is a hypothesis testing method, the speech unit score in the speech unit score generator 620 includes a null hypothesis score which is calculated using the corresponding null hypothesis model of said speech unit, and a alternative hypothesis score which is calculated using the corresponding alternative hypothesis model of said speech unit. According to another embodiment, the null and alternative hypothesis scores of each of the speech units are stored, along with the lengths of the units, in the speech unit score statistic database 650. The above may be defined as a first type of implementation. According to another embodiment, for the null and alternative hypothesis scores of each of the speech units, only the statistical value of the differences in each pair of normalized null and alternative hypothesis scores and the statistical values of the lengths are stored. For example, only the mean and the variance are stored in the speech unit score statistic database 650. The above may be defined as a second type of implementation.
According to a different utterance verification method, the score of one of speech units may include a null hypothesis score calculated from said one speech unit through a null hypothesis model of said one speech unit, and may also include a plurality of competing scores calculated in the speech database from all the units except said one unit through the null hypothesis model of said one speech unit. For each of the units, the null hypothesis scores and the corresponding competing null hypothesis scores are stored, along with the lengths of the units, into the speech unit score statistic database 650. The above may be defined as a third type of implementation, wherein a subset or all of the corresponding competing null hypothesis scores may be stored. Alternatively, the statistical value of the differences between the above normalized null hypothesis score and the plurality of competing null hypothesis scores thereof and the statistical value of the lengths may be stored. Said statistical values may be obtained by calculation through a mathematical algorithm. For example, the mean and the variance may be stored, wherein the mathematical algorithm is for calculating the arithmetic mean and the geometric mean. The statistical values are stored into the speech unit score statistic database 650. The above may be defined as a fourth type of implementation.
The calculation method used in the object score generator 540 in
Referring to
After each of the speech units is processed by the speech unit score generator 620, the utterance verification model 640 is used to calculate the null hypothesis scores (H0) and the null hypothesis scores (H1) thereof, which are stored, along with the lengths of the speech units, into the speech unit score statistic database 650.
Please refer to
During calculation of the UV score, one of the corresponding speech unit sequences is randomly selected as the basis for calculation. Said one speech unit sequence includes a null hypothesis score (H0), a alternative hypothesis score (H1), and the length of the speech unit. Last, the scores are added to obtain a null hypothesis verification score (H0 score) and alternative hypothesis verification score (H1 score), so as to obtain the utterance verification score (UV score).
T is the total number of frame segments of the term “qian yi xiang”
Next, the following provides a plurality of actual experimental examples for description.
An existing speech database is used for verification. Here, the 500-PEOPLE TRSC (TELEPHONE READ SPEECH CORPUS) PHONETIC DATABASE® is used as an example. From the TRSC DATABASE®, 9006 sentences are selected as the training corpus for the speech segmentation model and the utterance verification model (please refer to the speech segmentation model 630 and the utterance verification model 640 in
A simulated testing speech data is selected from the SHANGHAI MANDARIN ELDA FDB 1000 SPEECH DATABASE®. Three testing vocabulary sets are selected in total.
The testing vocabulary set (1) includes five terms “qian yi xiang” (meaning “the previous item” in Chinese), “xun xi he” (meaning “message box”), “jie xian yuan” (meaning “operator”), “ying da she bei” (meaning “answering equipment”), and “jin ji dian hua” (meaning “emergency phone”) and includes 4865 sentences in total.
The testing vocabulary set (2) includes six terms “jing hao” (meaning “number sign”), “nei bu” (meaning “internal”), “wai bu” (meaning “external”), “da dian hua” (meaning “make a call”), “mu lu” (meaning “index”), and “lie biao” (meaning “list”) and includes 5235 sentences in total.
The testing vocabulary set (3) includes six terms “xiang qian” (meaning “forward”), “hui dian” (meaning “return call”), “shan chu” (meaning “delete”), “gai bian” (meaning “change”), “qu xiao” (meaning “cancel”), and “fu wu” (meaning “service”) and includes 5755 sentences in total.
Each of the three vocabulary sets is operated by, for example, the utterance verification threshold generator shown in
Please refer to
In
As shown in
Although the disclosure has been described with reference to the above embodiments, it is apparent to one of the ordinary skill in the art that modifications to the described embodiments may be made without departing from the spirit of the disclosure. Accordingly, the scope of the disclosure will be defined by the attached claims and not by the above detailed descriptions.
For example, the disclosure may be used alone or with the utterance verificator, as shown in
After summarizing the above possible embodiments, the recognition object and the utterance verification object are collectively called the processing object. The utterance verification threshold generator provided by the disclosure is capable of receiving at least one processing object and outputting the at least one recommended threshold corresponding to the at least one processing object.
Hence, the scope of the disclosure is defined by the following claims and their equivalents.
Number | Date | Country | Kind |
---|---|---|---|
98145666 | Dec 2009 | TW | national |