The present invention relates to speech recognition systems, and in particular, it relates to the employment of factors beyond speech content in such systems.
Pitch detection has been a topic of research for many years. Multiple techniques have been proposed in the literature. The nature of these techniques is usually strongly influenced by the application that motivates the development of such techniques. Speech researchers have developed pitch detection techniques that work well for speech signals, but not necessarily for musical instruments. Similarly, music researchers have developed techniques that work better for music signals and not as well for speech signals. While some consider the problem of pitch detection to be a solved problem, others view it as an extremely challenging task. The former is correct if one seeks only a rough estimate of the pitch, with speed and accuracy not important. If the application requires fast and accurate pitch tracking, however, and if the signal of interest has undetermined properties, then the problem of pitch detection remains unsolved. The most convincing example of such an application is the field of Automatic Speech Recognition. In spite of numerous improvements in front end signal processing in recent years, pitch information remains a feature not fully utilized in most state of the art speech recognizers. The main reasons for this are, first, the fact that inaccurate pitch information actually degrades performance of a speech recognition system to produces results worse than those obtained without using pitch information at all. Therefore, pitch-dependent speech recognition is only feasible if highly accurate pitch information is available. Additionally, speech recognition is most often implemented in applications requiring real time results, using only limited computational power. The speech recognition system itself usually takes most of the computational resources. Therefore, if a pitch detection algorithm is to be used to extract the pitch contour, this algorithm is required to run in a fraction of real time.
Thus, while the potential benefits of pitch-based speech recognition are clear, the art has not succeeded in providing an operable system to meet that need.
An aspect of the claimed invention is a method for employing pitch in a speech recognition engine. The process begins by building training models of selected speech samples, a process which begins by analyzing each sample as a sequential series of frames, each frame having a selected duration and overlap with adjacent frames. A pitch estimate of each frame is detected and recorded, and the pitch data is normalized, and the speech recognition parameters of the model are determined, after which the model is stored. Models are stored and updated for each of the set of training samples. The system is then employed to recognizing the speech content of a subject, which begins by analyzing the subject as a sequential series of frames, each frame having a selected duration and overlap with adjacent frames. A pitch estimate for each frame is detected and recorded, and the pitch data is normalized. Speech recognition techniques are then employed to recognize the content of the subject, employing the stored models.
Pitch data normalization in the method set out immediately above can includes the steps of calculating filterbank energies of each frame; determining a fundamental pitch of each frame; determining a harmonic density of each filterbank; dividing the filterbank energy by the harmonic density for each filterbank; and calculating mel-frequency cepstral coefficients for each frame.
Another aspect of the claimed invention is a method for employing pitch in a speech recognition engine, which begins by building training models of selected speech samples. The training model process begins by analyzing each sample as a speech samples. The training model process begins by analyzing each sample as a sequential series of frames, each frame having a selected duration and overlap with adjacent frames. Then, a pitch estimate of each frame is detected, and each frame is classified into one of a plurality of pitch classifications, based on the pitch estimate. The speech recognition parameters of the sample and determined and a separate sample is stored and updated for each sample, for each preselected pitch range. The speech content of a subject is recognized by the system, commencing with a step of analyzing the subject as a sequential series of frames, each frame having a selected duration and overlap with adjacent frames. The system detects and records a pitch estimate for each frame, and it assigns a pitch classification to each voiced frame, based on the pitch estimate. Applying speech recognition techniques, the system recognizes the content of the subject, employing the set of models corresponding to the pitch classification.
a and 5b show a method for normalizing speech data as incorporated into embodiments of the claimed invention.
a and 6b illustrate experimental results achieved with embodiments of the claimed invention.
The following detailed description is made with reference to the figures. Preferred embodiments are described to illustrate the present invention, not to limit its scope, which is defined by the claims. Those of ordinary skill in the art will recognize a variety of equivalent variations on the description that follows.
The training stage 100 creates statistical models based on transcribed training data 102. The models may represent phonemes (subwords), words, or even phrases. Phonemes may be context dependent (bi-phones or tri-phones). Once the models are selected, their statistical properties are defined. For example, their PDF (Probability Density Function) can be modeled by a mixture of Gaussian PDFs. The number of mixtures, the dimension of the features, and the restriction on the transition among states (e.g. left-to-right) are all design parameters. An essential part of the training process is the “feature extraction” 104. This building block receives as input the wave data, divides it into overlapping frames, and for each frame generates a set of features, employing techniques such as Mel Frequency Cepstral Coefficients (MFCC), as known in the art. That step is followed by the model trainer 106, which employs conventional modeling techniques to produce a set of trained models.
The testing, or recognition, stage 110 receives a set of speech data 112 to be recognized. For each input, the system performs feature extraction 114 as in the training process. Extracted features are then sent to the decoder (recognizer) 116, which uses the trained models to find the most probable sequence of models that correspond to the observed features. The output of the testing (recognition) stage is a recognized hypothesis for each utterance to be recognized.
A widely-employed embodiment of a feature recognition method 104 is s the MFCC (Mel-Frequency Cepstral Coefficient) system illustrated in
The log of each Mel band energy is then taken and the Discrete Cosine Transform (DCT) of the mel-log-energy vector is calculated, at step 130. The resulting feature vector is the MFCC feature vector, at step 132. Mel-scale energy vectors are usually highly correlated. If the model prototypes are multi-dimensional Gaussian PDFs, a correlated covariance matrix and its inverse needs to be calculated for every Gaussian mixture. This introduces a great deal of complexity to the calculation requirements. The DCT stage is known to de-correlate the features and therefore their covariance matrix can be approximated by a diagonal matrix. In addition, the combination of log and DCT remove the effect of a constant gain from the features. This means x(t) and a*x(t) produce the same features. This is highly desirable since it removes the need to normalize each frame before feature extraction.
A sample calculation follows:
Let x(t) be the time signal and let m1, m2, . . . be the filterbank energies, so that x(t)→[m1, m2, m3 . . .]
Since FFT is linear,
a×x(t)→a2×[m1, m2, m3, . . .] (1)
Taking the log results produces:
2 log (a)+log ([m1, m2, m3, . . .]) (2)
The 2 log (a) term acts as a DC bias with respect to the filter bank dimension. Therefore, after taking the DCT, 2 log (a) only appears in the zero-th Cepstral coefficient C0 (the DC component). This coefficient is usually ignored in the features.
Speech consists of phonemes (sub-words). Various phonemes and their categories in American English are provided by the TIMIT database commissioned by DARPA, with participation of companies such as Texas Instruments and research centers such as Massachusetts Institute of Technology (hence the name). The database is described in the DARPA publication, The DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus (TIMIT).
Phonemes can also be classified into voiced phonemes and unvoiced phonemes. Voiced phonemes are generally vowel sounds, such as /a/ or /u/, while unvoiced are generally consonants, such as /t/ or /p/. Unvoiced phonemes have no associated pitch information, so no calculation is possible. The system must recognize unvoiced samples, however, and make provision for dealing with them. Voiced phonemes such as (/aa/, /m/, /w/, etc.) are quasi-periodic signals and contain pitch information. As known in the art, such quasi-periodic signals can be modeled with a convolution in time domain or a multiplication in the frequency domain:
s(t)=(e·h)(t)→S(F)=E(F)H(F) (3)
Here, s(t) is the time domain speech signal, e(t) is the pitch-dependent excitation signal that can be modeled as a series of pulses, and h(t) is the pitch-independent filter that contains the phoneme information. In frequency domain, E(f) is a series of deltas equally spaced with fundamental frequency. S(f) therefore consists of samples of H(f) at harmonics of the fundamental (pitch) frequency. The observation of S(f) is therefore dependent on the pitch estimate. The analytical goal is to explore how knowledge of pitch can help to better recognize the underlying H(f) which contains the phoneme information.
An important question is how additional pitch information, and the manner of using it in a speech recognition system affects the system's accuracy. As known in the art, the accuracy of a speech recognition system depends on a variety of factors. Improving the quality of features improves the system and brings closer the achievement of a context-independent, speaker-independent and highly accurate speech recognition system. However, in small systems with limited vocabulary, the use of language models and context dependency may mask the direct improvement made by the improvements in features.
Table 1 shows various measures of accuracy using the TIMIT database. Frame level recognition does not use any context dependency or language model. It represents the number of frames correctly classified as a phoneme using a single mixture 12-dimensional Gaussian PDF modeling 12-dimensional MFCC features. The accuracy represented by this number significantly depends on the quality of the features. We will therefore use the frame-level recognition rate in this chapter. We use TIMIT database with phoneme level labels. Only voiced phonemes are considered and each of the 34 voiced phonemes is modeled with a single mixture Gaussian PDF.
Since the observation S(f) and therefore the features extracted from it are affected by the value of the pitch, one way to use knowledge of pitch is to train and use “pitch-dependent models”. This concept is similar to the highly researched topic of “gender-dependent models” in which different models are trained and used for male and female speakers. Gender-dependent models have been shown to improve the recognition accuracy. However, their use requires knowledge of the gender of the speaker.
In the embodiment under discussion, pitch is employed to classify the data into one of a number of pitch classes or bins. The number of classes or bins selected for a given application will be selected by those in the art as a tradeoff between accuracy (more bins produce greater accuracy) and computational resources (more bins require more computation). Systems employing two and three bins have proved effective and useful, while retaining good operational characteristics. Note that pitch classification includes dealing with unvoiced phonemes.
During the test, or recognition, phase 320, a similar parallel operation occurs, with pitch detection step 330 detecting the pitch employing the same weighting or calculating algorithm as was used for the training data. That pitch information is fed to pitch selection step 328, where the value is used to select the appropriate model from among the sets of pitch-dependent models built during the training phase. Thus, when the model data is fed to recognizer step 326, the model employed is not a generic dataset, as is the case with the prior art, but a model that matches the test data in pitch classification.
The dramatic improvement in accuracy is easily seen in
Although the embodiment of
Pitch provides considerably increased accuracy, as seen above, but in conventional systems that accuracy is obtained at a cost. First, training conventional, complicated models entails handling a large number of Gaussian Mixtures, which imposes significant computational overhead. Further, such training requires additional training data, which must be gathered and conditioned for use. The embodiment of
An embodiment of a method for achieving that result is shown in
The results of such a calculation are shown in Table 2. Each row shows a different filter bank in the Mel scale. The first column shows the frequency range for that filter bank, the second column shows the number of harmonics in that filter bank for a 150 Hz signal, and the third column shows the number of harmonics for a 200 hz signal.
It should be noted that each bin is scaled by a non-constant factor due to this pitch difference imposed by conversion to the Mel scale.
b illustrates a process 500 for normalizing the pitch data. First, in step 502, the filterbank energies are calculated, as shown above, and the energies for each bin are calculated, producing [m1, m2, m3, . . .]. Then, the fundamental pitch f0 is determined, step 504, as also described above, with provision being made for unvoiced (pitchless) phonemes in frames. That information allows the calculation of harmonic density, D1=number of harmonics of f0 in ith bin, step 506. Step 508 normalizes the filterbank energies by the number of harmonics present, so that for each filterbank Mi =mi/Di. Note that if no harmonics are present in a bin, the system can interpolate with adjacent bins. Typically that measure is only required in the first filter bank. At that point, sufficient data is available to allow computation of the MFCC as known in the art, using the normalized energy vector by taking log and DCT.
Another embodiment employs analysis techniques to achieve improvements over simple normalization. Drawing upon techniques similar to those presented in the study by Xu Shao and Ben Milner, entitled “Predicting Fundamental Frequency from mel-Frequency Cepstral coefficients to Enable Speech Reconstruction,” published in the Journal of the Acoustical Society of America in August 2005 (p. 1134-1143), here one can adjust the density and location of the harmonics found in each filterbank, making both parameters correspond to those of a preselected pitch value.
The process of
Some embodiment of the claimed invention can be combined with the system of
It is contemplated that modifications and combinations will readily occur to those skilled in the art, which modifications and combinations will be within the spirit of the invention and the scope of the following claims.
This application claims the benefit of us provisional patent application No. 60/884,196 entitled “Harmonic Grouping Pitch Detection and Application to Speech Recognition Systems,” filed on Jan. 9, 2007. That application is incorporated by reference for all purposes.
Number | Date | Country | |
---|---|---|---|
60884196 | Jan 2007 | US |