The present invention is directed, in general, to automatic speech recognition (ASR) and, more particularly, to a system and method for text-to-phoneme (TTP) mapping with prior knowledge.
Speaker-independent name dialing (SIND) is an important application of ASR to mobile telecommunication devices. SIND enables a user to contact a person by simply saying that person's name; no previous enrollment or pre-training of the person's name is required.
Several challenges, such as robustness to environmental distortions and pronunciation variations, stand in the way of extending SIND to a variety of applications. However, providing SIND in mobile telecommunication devices is particularly difficult, because such devices have quite limited computing resources.
SIND requires a list of names (which may amount to thousands) to be recognized, therefore techniques that generate phoneme sequences of names are necessary. However, because of the above-mentioned limited resources, a large dictionary with many entries cannot be used. It is therefore important to have methods that are compact and accurate to generate phoneme sequences of name pronunciations in real time. These methods are usually called “text-to-phoneme” (TTP) mapping algorithms.
Conventional TTP mapping algorithms fall into two general categories. One category is algorithms based on phonological rules. The phonological rules are used to map a word to corresponding phone sequences. A rule-based approach usually works well for some languages with “regular” mappings between words and pronunciations, such as Chinese, Japanese or German. In this context, “regular” means that the same grapheme always corresponds to the same phoneme. However, for some other languages, notably English, a rule-based approach may not perform well due to “irregular” mappings between words and pronunciations.
Another category is data-driven approaches, which have come about more recently than rule-based approaches. These approaches include neural networks (see, e.g., Deshmukh, et al., “An advanced system to generate pronunciations of proper nouns,” in ICASSP, 1997, pp. 1467-1470), decision trees (see, e.g., Suontausta, et al., “Low memory decision tree method for text-to-phoneme mapping,” in ASRU, 2003) and N-grams (see, e.g., Maison, et al., “Pronunciation modeling for names of foreign origin,” in ASRU, 2003, pp. 429-34).
Among these data-driven approaches, decision trees are usually more accurate. However, they require relatively large amounts of memory. In order to reduce the size of decision trees so they can be used in mobile telecommunication devices, techniques for removing “irregular” entries from training dictionaries, such as post-processing (see, e.g., Suontausta, et al., supra], have been suggested. These techniques, however, require much manual intervention to work.
Accordingly, what is needed in the art is a new technique for TTP mapping that is not only relatively fast and accurate, but also more suitable for use in mobile telecommunication devices than are the above-described techniques.
To address the above-discussed deficiencies of the prior art, the present invention provides techniques for TTP mapping and systems and methods based thereon.
The foregoing has outlined features of the present invention so that those skilled in the pertinent art may better understand the detailed description of the invention that follows. Additional features of the invention will be described hereinafter that form the subject of the claims of the invention. Those skilled in the pertinent art should appreciate that they can readily use the disclosed conception and specific embodiment as a basis for designing or modifying other structures for carrying out the same purposes of the present invention. Those skilled in the pertinent art should also realize that such equivalent constructions do not depart from the spirit and scope of the invention.
For a more complete understanding of the invention, reference is now made to the following descriptions taken in conjunction with the accompanying drawing, in which:
Described herein are particular embodiments of a novel TTP mapping technique. The technique systematically regularizes dictionaries for training DTPMs for name recognition. In general, the technique is based upon an Expectation-Maximization (E-M)-like iterative algorithm to obtain probabilities of a particular letter given a particular phoneme. That is, the technique iteratively updates estimates of probabilities of a particular phoneme given a particular letter. In one embodiment, a prior knowledge of LTP mapping is incorporated via prior probabilities of a particular phoneme given a particular letter to yield an improved TTP performance. In one embodiment, the technique updates posterior probabilities of a particular phoneme given a particular letter by Bayesian updating. In order to remove unreliable LTP mappings and to regularize dictionaries, a threshold may be set and, by comparison with the threshold, LTP mappings having lower posterior probabilities may be removed. As a result, the technique does not require much human effort in developing a small DTPM for SIND. As will be described below, exemplary DTPMs were obtained having a memory size smaller than 250 Kbytes.
Certain embodiments of the technique of the present invention have two advantages over conventional techniques for TTP mapping. First, the technique of the present invention makes better use of prior knowledge to TTP performance. This is in contrast to certain prior art methods (e.g., Damper, et al., “Aligning letters and phonemes for speech synthesis,” in ISCA Speech Synthesis Workshop, 2004) that make no use of prior knowledge. Such methods may have a relatively high LTP alignment rate, but they fail to remove some entries, such as foreign pronunciations, that are useless for name recognition in a particular language. Second, the technique of the present invention employs a threshold to regularize the dictionary. The threshold tends to diminish prior probabilities automatically over time. Thus, the substantial human effort that would otherwise be required manually to dispense with entries having lower posterior probabilities is no longer required. This is in stark contrast with post-processing methods taught, e.g., in Suontausta, et al., supra. Post-processing methods use human LTP-mapping knowledge to remove low probability entries in a hard-decision way and are therefore tedious and prone to human error.
Having described the technique in general, a wireless telecommunication infrastructure in which the TTP technique of the present invention may be applied will now be described. Then, one embodiment of the TTP technique, including some important implementation issues, will be described. A DTPM based on the TTP technique will next be described. Finally, the performance of an exemplary embodiment of the TTP technique of the present invention will be evaluated in the context of SIND in a mobile telecommunication device.
Accordingly, referring to
One advantageous application for the system or method of the present invention is in conjunction with the mobile telecommunication devices 110a, 110b. Although not shown in
Having described an exemplary environment within which the system or the method of the present invention may be employed, various specific embodiments of the system and method will now be set forth.
The TTP mapping problem may reasonably be viewed as a statistical inference problem. The probability of a phoneme p given a letter l is defined as P(p|l). Given a word entry with an L-length sequence of letters (l1, . . .lL), a TTP mapping may be carried out by the following Maximum a Posteriori (MAP) probability method:
where P((p1, . . . , pL)|(l1, . . . , lL)) is the probability of a phoneme sequence (p1, . . . , pL) given a letter sequence (l1, . . . , lL). If it is assumed that the phoneme pi is dependent only on the current letter li, the probability may be simplified as:
A good estimate of the above probability is required to have good TTP mapping. However, some difficulties arise in achieving good TTP mapping in irregular languages, such as English. For example, English exhibits LTP mapping irregularities. A reasonable alignment between the proper name “Phil” and its pronunciation “f ih l” may be:
In English, it is common for a word to have fewer phonemes than letters. Accordingly, a “null” (or “epsilon”) phone “_” should be inserted in the transcription to maintain a one-to-one mapping. Yet, in “Phil,” it is not clear where the null-phone should be placed, since the following may also be a reasonable alignment:
Cases also occur in which one letter corresponds to two phonemes. For instance, the letter “x” is pronounced as “k s” in word “fox.” “Pseudo-phonemes” are obtained by concatenating two phonemes that are known to correspond to a single letter. In this case, “k_s,” which is a concatenation of the two phonemes, “k” and “s,” is the pseudo-phoneme of the letter “x.”
English also contains entries from other languages. For example, the word “Jolla” is pronounced as “hh ow y ah.” The word is common in American English, although it is from Spanish. However, such entries increases the “irregularity” of training dictionary for English name recognition.
Training dictionaries may further contain incorrect entries, such as typographical errors. These incorrect entries increase the overall irregularity of the training dictionary.
Incorporating prior human knowledge into TTP mapping may be helpful to obtain a good estimate of the above probability. Here, the prior knowledge is incorporated by setting prior probabilities P*(p|l) to zero, corresponding to removal of non-zero LTP mappings between l and p and allowing l to be pronounced as p. For instance, setting P*(p|l)=0, where p is “hh” and l is “j,” removes some entries such as “Jolla.”
Having described the nature of the TTP mapping problem in general, one specific embodiment of the system of the present invention will now be presented in detail. Accordingly, turning now to
The system includes an LTP mapping generator 210. The LTP mapping generator 210 is configured to generate an LTP mapping by iteratively aligning a full training set (e.g., S) with a set of correctly aligned entries (e.g., T) based on statistics of phonemes and letters from the set of correctly aligned entries and redefining the full training set as a union of the set of correctly aligned entries and a set of incorrectly aligned entries (e.g., E) created during the aligning. In the illustrated embodiment, the LTP mapping generator 210 is configured to generate the LTP mapping over a predetermined number (e.g., n) of iterations, represented by the circular line wrapping around the LTP mapping generator 210.
The system further includes a model trainer 220. The model trainer 220 is configured to update prior probabilities of LTP mappings generated by the LTP generator 210 and evaluate whether the LTP mappings are suitable for training a DTPM 230. In the illustrated embodiment, and the model trainer 220 is configured to evaluate a predetermined number (e.g., r) of LTP mappings generated by the LTP generator 210, represented by the curved line leading back from the model trainer 220 to the LTP mapping generator 210.
The operation of certain embodiments of the LTP mapping generator 210 and the model trainer 220 will now be described. Accordingly, turning now to
The technique of
A full training set S is first defined. S consists of two sets T and E, where T is a set of correctly aligned entries, and E is a set of incorrectly aligned entries. The method begins in a start step 305. The method is iterative and has outer and inner loops, viz:
As described above, the method is based upon an E-M-like iterative algorithm. Step 3(a)ii corresponds to the E-step in the E-M algorithm. Step 3(a)iii is the M-step in the E-M algorithm. The normal E-M algorithm may use the estimated posterior probability P(p|l) obtained in Equation (3) in place of {tilde over (P)}(p|l) in Equation (6) for the M-step to have TTP alignment.
As previously described prior knowledge of LTP mapping is incorporated into the method; this yields an improved posterior probability {tilde over (P)}(p|l). By Equation (5), the improved posterior probability is obtained in consideration of both observed LTP pairs and the prior probability of LTP mapping P*(p|l).
The following gives an example of the motivation for using Equation (5). Only three training cases exist for the phoneme “y_ih,” which include “POIGNANCY:” “p oy——n y_ih n s iy.” Hence, C(A,y_ih)=3, P(A|y_ih)=C (A,y_ih)/C(y_ih)=1.0, and P(y_ih|A)=C(A,y_ih)/C(A)=3/C(A) approaches zero. But if LTP-pruning is used, P(y_ih|A) will be removed if it is below threshold θA. Consequently, three cases that could otherwise be used to train DTPM are lost. In contrast to the normal E-M algorithm, the following results:
P(y—ih|A)=P(A|y—ih)Q(y—ih|A)/P(A)=Q(y—ih|A)/P(A),
which is usually larger than that by the normal E-M estimate, if Q(y_ih|A) has a large value of the prior probability of phoneme y_ih given letter A.
One implementation issue regarding the method involves the initialization of the prior probability P*(p|l). A flat initialization is done on the prior probability P*(p|l). Given lists of allowed phonemes for each letter l, the prior probability of each phoneme given the letter is set to 1/#p, where #p denotes the number of possible phonemes for the letter l.
Another implementation issue regarding the method involves the initialization of co-occurrence matrices. The above iterative algorithm converges to a local optimal estimate of posterior probabilities of a particular phoneme given a particular letter. One possible initialization method may use a naive approach, e.g., Damper, et al., supra. Processing each word of the dictionary in turn, every time a letter l and a phoneme p appear in the same word irrespective of relative position, the corresponding co-occurrence C(l, p) is incremented. Although this would not be expected to give a very good estimate of co-occurrence, it is sufficient to attempt an initial alignment.
Yet another implementation issue regarding the method involves the flooring of LTP mappings to the epsilon phone. It may fairly be assumed that every letter may be pronounced as an epsilon phone. Therefore, the LTP-pruning may prune LTP mappings with low posterior probabilities, except for LTP mappings to the epsilon phone. In addition, a flooring mechanism is set to provide a minimum posterior probability of LTP mappings to the epsilon phone. In one embodiment of the present invention, the flooring value is set to a very small value above zero.
Still another implementation issue regarding the method involves the position-dependent rearrangement. Using the above process, DTPMs may result that generate pronunciations such as:
AARON aa ae r ax n
which has an insertion of “ae” at the second letter “A.” After analyzing the aligned dictionary by the above alignment process, the following typical examples arose:
Notice that the first “A” in “AARON” is aligned to “_,” and the second letter “A” in word “AARONS” is aligned to “_.” During the DTPM training process, the epsilon phone “_” may not have enough counts to force either the first “A” or the second “A” in “AARON” to provide an epsilon phone. The problem arises in such a situation. To address the problem, a position-dependent rearrangement process may be inserted into the above TTP method after step 3(c), i.e., if one of the aligned phonemes of two identical letters is an epsilon phone, the rearrangement process swaps the aligned phonemes as required to force the second output phoneme to be the epsilon phone. Table 1 sets forth exemplary pseudo-code for the rearrangement process.
Table 1—Exemplary Pseudo-Code for the Rearrangement Process where l[i] and p[i] are the letter and phone at position i in an aligned TTP pair, respectively.
Since the estimated {tilde over (P)}(p|l) incorporates subjective prior probabilities, examining where large discrepancies in P(p|l)=C(l, p)/C(l) exist may reveal the following information.
Misspelled words: These words have small counts, and therefore a large discrepancy between {tilde over (P)}(p|l) and P(p|l) may be observed.
Abbreviations: Abbreviations usually require pseudo-phonemes. The number of abbreviations are not large, and therefore a large discrepancy between {tilde over (P)}(p|l) and P(p|l) may be observed.
Misspelled words and some abbreviations that are not useful for training pronunciation models from the training dictionary may be removed to avoid these potential discrepancies. In such a way, human knowledge on dictionary alignment can also be improved.
The mapping from spelling to the corresponding phoneme may be carried out using a decision-tree based pronunciation model (see, e.g., Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann Publishers, 1992). One embodiment of a DTPM based on the TTP technique will now be described. The following conditions hold for the specific embodiment herein described. A single pronunciation is generated for each name. The decision trees are trained on the aligned pronunciation dictionary. A single tree is trained for each letter. A decision tree consists of nodes that are internal with questions of context and leaves with output phonemes. Training cases of decision trees are composed of left- and right-letters of the current letter and left phoneme classes (such as vowels and consonants). A training case for the current letter consists of four left letters, four right letters to the current letter, four phoneme classes of the four left letters, and the corresponding phoneme of the current letter.
In the described embodiment, training is performed in two phases. The first phase splits nodes into child nodes according to an information-theoretic optimization criterion (see, e.g., Quinlan, supra). The splitting continues until the optimization criterion cannot further be improved. The second phase prunes the decision trees by removing those nodes from the tree that do not contribute to the modeling accuracy. Pruning is desirable to avoid over-training and maintains certain generalization ability. Pruning also reduces the size of the trees, and therefore may be preferred for mobile telecommunication devices in which memory constraints are material. A reduced-error pruning (see, e.g., Quinlan, supra) is used for the second phase. Such reduced-error pruning will be called “DTPM-pruning” herein.
The phoneme sequence of a word is generated by applying decision tree of each letter from left to right. First, the decision tree corresponding to the letter in question is selected. Then, questions in the tree are answered until a leaf is located. The phoneme stored in the leaf is then selected as the pronunciation of the letter. The process moves to the next letter. The phoneme sequence is constructed by concatenating the phonemes that have been found for the letters of the word. Pseudo-phonemes are split, and epsilon phones are removed from the final phoneme sequence.
Having described a DTPM based on the TTP technique, the performance of an exemplary embodiment of the TTP technique of the present invention will be evaluated in the context of SIND in a mobile telecommunication device.
TTP mappings are trained on a so-called “pronunciation dictionary.” The acoustic models in experiments were trained from the well-known Wall Street Journal (WSJ) database. The well-known CALLHOME American English Lexicon (PRONLEX) (see, LDC, “CALLHOME American English Lexicon,” http://www.ldc.upenn.edu/) was also used. Since the task is name recognition, letters such as “.” and “'” were removed from the dictionary. Some English names were also added into the dictionary. The resulting dictionary had 96,500 entries with multiple pronunciations. A DTPM was then trained after TTP alignment of the pronunciation dictionary.
The name database, called WAVES, was collected in a vehicle, using an AKG M2 hands-free distant talking microphone, in three recording conditions: parked (car parked, engine off), city driving (car driven on a stop-and-go basis) and highway driving (car driven at a relatively constant speed on a highway). In each condition, 20 speakers (ten of which being male) uttered English names. The WAVES database contained 1325 English name utterances collected in cars.
The WAVES database was sampled at 8 kHz, with frame rate of 20 ms. From the speech, 10-dimensional MFCC features and their delta coefficients were extracted. Because it was recorded using hands-free microphones, the WAVES database presented several severe mismatches.
The microphone is distant-talking band-limited, as compared to a high-quality microphone used to collect the WSJ database.
A substantial amount of background noise is present due to the car environment, with SNR decreasing to 0 dB in highway driving.
Pronunciation variations of names exist, not only because different people often pronounce the same name in different ways, but also as a result of the data-driven pronunciation model.
Although not necessary to an understanding of the performance of the technique of the present invention, the experiment also involved a novel technique introduced in (application Ser. No. [Attorney Docket No. TI-39862P1], supra) and called “IJAC” to compensate for environmental effects on acoustic models.
Experiment 1: TTP as a Function of the Inner-Loop Iteration Number n
At convergence, some posterior probabilities become zero, for example, the posterior probability of “w_ah” given the letter “A.” This observation suggests that the TTP technique properly regularizes training cases for DTPM by removing some LTP mappings with low posterior probability.
Entropy may be used to measure the irregularity of LTP mapping. The entropy is defined as
Averaging over all LTP pairs, the averaged entropy at initialization was determined to be 0.78. After five iterations, the averaged entropy decreased to 0.57. This quantitative result showed that the TTP technique was able to regularize LTP mappings.
Experiment 2: TTP as a Function of the Outer-Loop Iteration Number r
Table 2 shows LTP mapping accuracy as a function of the iteration r for the un-pruned DTPMs.
Table 2 shows that, although the size of DTPMs was smaller with increased outer-loop iteration, LTP accuracy was lower, and recognition performance degraded. A similar trend can be observed for a pruned-DTPM that uses the DTPM-pruning process described above. This trend result from the fact that, at each iteration r, the LTP-pruning process may remove some LTP mappings with a lower posterior probability than the threshold θA. As the size of the training data decreases, the reliability of DTPM estimation decreases.
It is interesting to compare performance as a function of DTPM-pruning.
Given these observations, the pruned DTPMs with r=1 were selected for the experiments that will now be described.
Acoustic models were trained from the WSJ database. The acoustic models were intra-word, context-dependent, triphone models. The models were gender-dependent and had 9573 mean vectors. Mean vectors were tied by a generalized tied-mixture (GTM) process (see, U.S. patent application Ser. No. [Attorney Docket No. TI-39685], supra).
Two types of hidden Markov models (HMMs) were used in the following experiments. One HMM was a generalized tied-mixture HMM with an analysis of pronunciation variation, denoted Analysis of pronunciation variation was done by Viterbi-aligning multiple pronunciations of words (yielding statistics for substitution, insertion and deletion errors), tying those mean vectors that belonged to the models that generated the errors and then performing E-M trainings. Pronunciation variation was analyzed using the WSJ dictionary. The other HMM was a generalized tied-mixture HMM without analysis of pronunciation variation, denoted “HMM-2.” A mixture was tied to other mixtures with the smallest distances from it. Although the total number of mean vectors was not increased, average mean vectors per state increased from one to ten in these two types of HMMs.
Experiment 3: Performance as a Function of Probability Threshold θA
A parameter, probability threshold θA, is used for LTP-pruning those LTP with low a posteriori probability P(p|l). The larger the threshold θA, the fewer the number of LTP mappings are allowed. This section presents results with a set of θA using HMM-1. Experimental results are shown in Table 3, below, together with a plot of the recognition results in
Referring to Table 3, the size of the DTPM was decreased by increasing θA. Without the threshold (i.e., θA=0.0), LTP accuracy was 83.73%. By removing some unreliable LTP mapping with a non-zero θA (θA=0.00001), LTP accuracy increased to 88.73%. However, after a certain value of θA, e.g., θA=0.005, LTP accuracy decreased.
A certain range of θA exists in which the trained DTPM attains a lower WER. Compared to the WER with θAε[0, 0.001], the WER with θAε[0.003, 0.01] was lower. In the specific experiment set forth, setting θA=0.003 results in the lowest WER in three driving conditions.
Experiment 4: Performance with Better Prior Knowledge of LTP Mapping
Experiments (using HMM-1) were then conducted with a view to improving the prior probability of a particular phoneme given a particular letter. In particular, some LTP mapping with a Spanish origin, such as (J, y) and (J, hh), were removed by setting their prior probabilities to zero. Table 4 shows results by the modified prior probabilities.
Compared to the results in Table 3, the following observations are made:
Better prior knowledge of LTP is helpful in having smaller DTPM with better performance. In particular, removal of some Spanish pronunciation in prior probabilities improves performance of DTPM. For instance, compared to results in Table 2 with θA=0.0, the size of the DTPM was decreased from 244 Kbytes to 243 Kbytes, LTP accuracy was increased from 83.73% to 88.76%, and WER in all three driving conditions was decreased in average by 2.3%.
Above a certain value of θA, the prior probability may not have much effect on performance of the DTPM. In the experiment, better prior knowledge had effects on performances with θAε[0, 0.001], but did not result in improved performance for a larger θA. The observation may be due to less Spanish pronunciation in the training dictionary. This suggests that the proposed TTP technique does not rely much on human effort.
Experiment 5—Performance by Position-Dependent Rearrangement and a Set of Acoustic Models
Now, TTP performance with a position-dependent rearrangement process as described above will be analyzed. Table 5 shows LTP accuracy and memory size of trained DTPMs as a function of various thresholds θA. By comparison with Table 4, the following observations are made:
Given the same θ4, the size of the trained DTPMs with the rearrangement process is smaller than the trained DTPMs without the rearrangement process. For example, with θA=0.003, the new DTPM is 224 Kbytes, whereas the DTPM in Table 4 is 231 KBytes.
LTP accuracies are comparable. This observation suggests that the newly-added position-dependent rearrangement process achieves similar LTP performance with smaller memory. Therefore, the new process is useful for TTP.
Based on the newly aligned dictionary with the position-dependent rearrangement process, recognition experiments were performed with both HMM-1 and HMM-2 acoustic models. Tables 6 and 7 show the results with HMM-1 and HMM-2, respectively.
The following observations are made:
As observed in the previous recognition experiments, the recognition performances of the trained DTPMs are dependent on the threshold θA. For example, in the city driving condition in Table 6, the WER with θA=0.003 outperformed the WER with θA=0.001 by 15%. In Table 6, the WERs with θA=0.003 were 2% lower on average than WERs with θA=0.002.
Although HMM-1 outperformed HMM-2 with θAε[0, 0.001], the performance of HMM-2 was better than HMM-1 in the case of θAε[0.002, 0.005], the range in which both HMM-1 and HMM2 achieved their lowest WERs. For instance, with θA=0.003, HMM-2 outperformed HMM-1 in all three driving conditions by 5%.
Considering both memory size and recognition performance, DTPM performance using HMM-2 and with θA=0.003 yielded the best performance.
To achieve a good compromise between performance and complexity, it may be desirable to use a look-up table containing phonetic transcriptions of those names that are not correctly transcribed by the decision-tree-based TTP. The look-up table requires only a modest increase of storage space, and the combination of decision-tree-based TTP and look-up table may achieve high performance.
Although the present invention has been described in detail, those skilled in the pertinent art should understand that they can make various changes, substitutions and alterations herein without departing from the spirit and scope of the invention in its broadest form.
The present invention is related to U.S. patent application Ser. No. 11/195,895 by Yao, entitled “System and Method for Noisy Automatic Speech Recognition Employing Joint Compensation of Additive and Convolutive Distortions,” filed Aug. 3, 2005, U.S. patent application Ser. No. 11/196,601 by Yao, entitled “System and Method for Creating Generalized Tied-Mixture Hidden Markov Models for Automatic Speech Recognition,” filed Aug. 3, 2005, and U.S. patent application Ser. No. [Attorney Docket No. TI-60051] by Yao, entitled “System and Method for Combined State- and Phone-Level Pronunciation Adaptation for Speaker-Independent Name Dialing,” filed ______, all commonly assigned with the present invention and incorporated herein by reference.