The preferred embodiments of the invention will be described in details with reference to the accompanying drawings, wherein:
From Bingxi Wang, Speech coding, Xidian University Press, China, July, 2002 we know that (as shown in Table 1), for the same consonant, the positions of the corresponding formants vary among the formant frequencies for six vowels of Mandarin as well as among speakers (such as between male speakers and female speakers) in the way that is not always consistent with their VTLs' variations. Considering that the vocal tract length will not vary when the same person utters different vowels and what will vary is just the vocal tract shape (such as degree of lip-rounding and the like), we know that the variation of the formant position and the vocal tract length is not completely consistent among different speakers and among different vowels. That is because different vocal tract shapes produce different responses to the same frequency. Therefore, different vocal tract shapes lead to different frequency warps, and different frequency warps further produce different formant positions. The formant positions of the vocal shape play an important role to determine the speaker group.
Nevertheless, although the variation of the formant position as discussed above reflects to some extent the vocal tract shape, the formant mainly reflects the vocal tract length. In the conventional parametric approach discussed in the Technical Background, it is the formant frequency which is used and which only (or mainly) reflects the vocal tract length and fails to well reflect the vocal tract shape. The inventors further discover that the pitch may reflect the vocal tract shape very well. Therefore, the basic idea of the invention is to consider both the vocal tract length and shape, that is, both the formant frequency and the pitch, in the parametric approach for recognizing the speaker.
Speech Data Processing Method
Below we will discuss in details the implementation of the steps. The feature extracting step 101 may be realized with a pitch extractor and a formant frequency extractor. Obviously, in the feature extracting step, extracting the pitch and extracting the formant frequency may be performed in any sequence or simultaneously.
The pitch extractor may adopt any method for estimating pitch. For example, a method similar to the method for estimating pitch as described in the following reference may be adopted: D. Chazan, M. Zibulski, R. Hoory, and G. Cohen, “Efficient Periodicity Extraction Based on Sine-wave Representation and its Application to Pitch Determination of Speech Signals”, EUROSPEECH-2001, Sep. 3-7, 2001, Aalborg Denmark. In that reference, the speech signal is modeled as a finite sum of sine-waves, with time-varying amplitudes, phases and frequencies. Assuming the speech signal x(t) can be approximated by a finite sum of sine-waves, let the approximation of x(t) be of the form:
x(t)=Σi=1Nαi sin(2πfit+φi) (1)
Where {αi, fi, φi}i=1N are the N sine-wave amplitudes (positive and real), frequencies (in Hz) and phase offsets (in radians), respectively.
Its Fourier transform and utility function are as following:
For each candidate fundament frequency f, the comb function cƒ(v) is defined such that it receives its maximum values at the arguments v=f, 2f, 3f, . . . , corresponding to the candidate pitch harmonics.
The frequency F0 which maximizes (3) utility function is selected as the fundamental-frequency (pitch) of the signal x(t);
As mentioned above, the fundamental extracting step and the pitch extractor may adopt any known method or any method developed in the future. For example, any of the following methods may be adopted:
I. AMDF (Average Magnitude Difference Function)
The technique, a variation of autocorrelation analysis, uses a difference signal formed by delaying the input speech various amounts and subtracting the delayed waveform from the original. The difference signal is always zero at delay=ø for a voiced sound having a quasi-periodic structure. Unlike the generation of the autocorrelation function, the AMDF calculations require no multiplications, a desirable property for real-time speech processing.
II. NCCF (Normalized Cross Correlation Function)
The normalized cross correlation function (NCCF) is defined as follows:
Given a frame of speech sampled, s(n), 0≦n≦N-1, Then:
Also, the formant frequency extracting step and the formant frequency extractor 104 may adopt any known method or any method developed in the future.
One of the possible method is formant estimation based on LPC (Linear Prediction Coefficient). The transformation function of LPC is:
Where A(z) is the predication polynomial obtained from a speech waveform, and ak is linear prediction coefficient.
If the equation A(z)=0, M/2 pairs of conjugated complex root (zi, z*i) are derived:
zi=rieijθ
z*i=riei−jθ (6)
Where ri is root modulus and θ is angle.
A number of standard references on speech processing provide the following transformation from complex root (6) and sampling rate Ts to formant frequency F and band width B.
F
i=θi/2πTs
B
i=|log ri|/πTs (7)
In the general case, five formant components are enough for formant analysis. Therefore, M is often set to 8˜10, that is, 4-5 formant components will be obtained. Obviously, more or less formant components may be used. Nevertheless, at least one formant component should be used for constituting the feature space.
Then a feature space comprised of the pitch F0 and at least one formant frequency (such as F1 to F4) is obtained. The feature space so constituted as to comprise the pitch and at least one formant frequency may be directly used in speaker classification.
In a preferred embodiment, the feature space is decorrelated and thus a quadrature feature space is obtained. As discussed above, although F0 reflects the vocal tract shape very well, the formant positions are also affected by the vocal tract shape to some extent. That is to say, F0 and the formant frequencies are correlated to each other to some extent. Therefore, for eliminating said correlation on one hand, and for reducing the computation load on the other hand, said feature space is decorrelated so that the correlation among features may be eliminated and the number of dimensions of the feature space may be decreased. The decorrelating operation in the feature space constructing step may adopt any decorrelation methods, one example is PCA (principal component analysis ). The first several feature values in the result of PCA may be selected to form a basic set as a sub-space. For example, in the embodiment, three feature values may be selected. That is, the 5-dimentional feature space given as an example above may be reduced to be 3-dimentional, thus a 3-dimentioanl feature set νp(t)=Aνf(t) may be obtained, where A is the PCA matrix. The other examples of decorrelation method include K-L transformation, singular value decomposition (SVD) and DCT transformation, and etc.
In the training step 103, the classifier in the prior art parametric approach may also be adopted. According to the prior art parametric approach, first, unsupervised classification is conducted through clustering, with the number of classes appointed by the user according to his/her experience, or determined directly by the clustering algorithm. Then, the classified training samples are described with a GMM model. In addition, the invention also provides a preferred classification method: supervised classification. A typical supervised classification may directly adopt a GMM model, that is, directly describes the classifier using diagonal Gaussian probability density.
The supervised classification requires determining a priori the classes of the speech samples. That is, it is necessary to label the classes of the speech samples in advance (not shown). The labeling step may be performed at any time before the training step 103, even before the feature extracting step 101.
For rendering the classification more accurate, the inventors further improves the classification technique. In the conventional parametric approach, blindfold search is conducted without considering the influences of the phonetic units on the vocal tract, that is, the classification is conducted according to “male” or “female” or the like. While as mentioned above, the vocal tract will vary among different vowels for the same person and the same consonant. Therefore, in the invention, prior phonetic knowledge is introduced in to the training of the classifier model for speaker normalization. Specifically, in the invention, different vowels, such as a plurality of or all the vowel phonemes a,o,e,i,u,ü in the Mandarin may be used as classification criteria. For example, the classification may be conducted according to the combination of “male” or “female” and all of the six vowel phonemes, and thus 2×6=12 classes may be obtained, with each class corresponding to respective warping factor (but it will not exclude the case where the warping factors of some classes are the same in value). Alternatively, the speech samples may be classified, according to the combination of “aged”, “adult” or “children” and the vowel phonemes, into 3×6=18 classes. Through such definition, classes are formed corresponding to different combinations of different speakers and different vowel phonemes, the accuracy of classification may be improved. After classification based on vowel phonemes, the training speech samples are labeled with their respective classes.
The labeling of speech samples may be performed though Viterbi alignment. Also, the alignment manner is not limited to Viterbi alignment and other alignment methods may be adopted, such as manual alignment, DTW(Dynamic time warp) and etc. The step of labeling the training speech samples with their respective classes reflecting the speaker may be performed at any time before the training step, even before the feature extracting step.
Then, after labeling the classes of the speech samples for training the classifier, supervised training of the classifier may be performed in Step 103. The training may be conducted through expectation-maximization algorithm based on the clustering labels, or though any other algorithm. In the training, the speaker normalization classifier, such as GMM classifier, is trained using the vector set νp(t). Apparently, the classifier is not limited to GMM classifier, but may be any supervised classifier known by any skilled in the art, such as NN (neuro-network) classifier, SVM (Support Vector Machine) classifier and the like.
Thus, through the steps discussed above, the training of the speaker classifier is completed.
The classification of the speech data using the classifier trained according to the method as discussed above will be described below.
As shown in
In the recognition step 203, the speech data to be recognized is classified by the classifier obtained in the training stage using the features of the speech data to be recognized in the Steps 201 and 202 into appropriate class, and thus the whole speaker class recognition is completed.
In a preferred embodiment, if the feature space is decorrelated in the training stage, then the feature space in the recognition stage should also be decorrelated correspondingly.
As mentioned above, the training stage and the recognition stage according to the invention belong to the first stage of the speaker normalization. For understanding the invention better, the spectrum normalization stage will be described below.
In general, spectrum normalization aims to eliminate the characteristics difference among speaker classes using the characteristics of different speaker classes, and may be achieved in many ways. One remarkable difference among different speaker classes is the variation of spectrum width. Therefore, the spectrum width should be made consistent before the content recognition for recognize the speech content accurately. Consequently, at the present, the main means for normalizing the speaker classes are extension or compression of the speech spectrum.
In such a situation, the warping factor is the extension ratio or the compression ratio.
When the class is determined, the warping factor may be obtained by various means. For example, the warping factor may be an empiric value, which may be obtained statistically. For the present invention, the warping factor may be obtained externally, or a priori. After speaker classification according to the invention, the normalization stage may directly utilize the warping factor corresponding to the speaker class.
In a preferred embodiment of the invention, a step may be added after the training step for obtaining the spectrum normalization factor (not shown), that is, for giving respective spectrum normalization factor for each speaker. Correspondingly, in the recognition stage, once the speech data is classified into a certain class, the warping factor that should be used is also determined.
Generally, the value of the warping factor ranges from 0.8 to 1.2. When determining the warping factors of respective classes, linear or non-linear method may be adopted.
The linear method consists of equi-partitioning the value range 0.8-1.2. For example, in the GMM model, the probability density function is calculated for the speech samples, and each class corresponds to a range of the probability density function. In general, the probability density function ranges from 0 to 1. The ranges of the probability density functions of all the classes are sorted according to the value range, the value range 0.8-1.2 is also equi-partitioned correspondingly and mapped on to the sorted ranges of the probability density functions, and thus the warping factors of respective classes may be obtained. Taking 12 classes as an example, in the ascending order of the ranges of the probability density functions of respective classes, the warping factors are respectively: 0.8, 0.8+(1.2−0.8)/12, 0.8+2*(1.2−0.8)/12, . . . , 0.8+11*(1.2−0.8)/12.
In the non-linear method, the value range of the warping factor is not divided equidistantly, so that the warping factor is more accurate for respective classes. Among the non-linear mapping methods, there is a probability function method adopting grid searching. That is, setting 13 grids in the value range (0.8-1.2) of the warping factor α, then try each warping factor α to get its degree of matching with the acoustic model HMM, take the warping factor having the highest matching degree as the warping factor α of the current speaker class. Specifically, the following mapping function may be defined:
Where λ is HMM acoustic model, Wi is the script of the corresponding speech, Xiα is the speech obtained by normalizing the speech sample of the speaker of Class I using the warping factor α. The unction Pr( ) is for calculating the matching degree (value of likelihood function) with the HMM acoustic model λ.
Of course we may also define a quadratic function or high-order function as shown in
Thus we could see that by use of the solution of the invention, both the pitch and the formant frequency may be used to train the speaker classifier better, and thus obtain classes reflect better the characteristics of the vocal tract (including length and shape). Consequently, the speech spectrum may be normalized more accurately and content recognition rate may be improved.
In practice, generally the first utterance is used to detect the speaker cluster for the coming task session. During one session, the speaker often doesn't change and share the same warping factor. However, the normalization of the speech sample using the warping factor obtained according to the invention, and content recognition based on the normalized speech sample, are not within the scope of the invention.
In the preferred embodiments as discussed above, different warping factors may be obtained with respect to different vowels. If the computing power of the speech recognition equipment so permits, the normalization may be carried out with respect to each vowel. But generally, after obtaining a plurality of warping factors though clustering according to the invention, a single warping factor may be obtained by synthesizing these warping factors, for normalizing the speaker class.
The following is an example of speech spectrum normalization and recognition applying the method according to the invention.
Let x(t) be the input speech signal, οα(n) is the n-th filter bank output after normalization, and α is the obtained warping factor dependent on speaker, then
Where φα(w) is the warping function. οα(n) depends on the speaker-specific warping factor and warping rules. Tn(W) is the output of the n-th filter, hn, ln are respectively the upper frequency limit and bottom frequency limit of the n-th filter, and X is the speech signal.
For solving the problem of band width mismatch between the normalized speech sample and the classification model for content recognition, a piecewise warping rule may be used for ensuring the band width of the normalized spectrum matches the classification model.
Where w0 in equation (10) is a fixed frequency which is set with experiment, and b, c can be calculated from w0. According to equation (10), α>1.0 means to compress the spectrum, and α<1.0 equals with to stretching the spectrum, and α=1.0 corresponds to no warping case.
The above is an example of linear normalization. As mentioned above, the normalization may be either linear or bi-linear or non-linear.
Bi-linear normalization:
Where φα(w) is normalization function, and α is warping factor.
Non-linear normalization:
Where φα(w) is normalization function, and α is warping factor. w0 is a fixed frequency which is set through experiment.
In terms of the invention, the bi-linear and non-linear normalizations are different from the linear normalization in that the warping factor is different. Such difference just belongs to the regular means in the field and detailed description thereof is omitted here.
Speech Data Processing Apparatus
The fundamental extractor 302 and the formant frequency extractor 304 have been described above and description will not be repeated here.
The training stage performed by the training means 306 has also been described above.
In a preferred embodiment, before the training means 306 may be inserted means for decorrelating the feature space comprised of the pitch extracted by the pitch extractor 302 and the formant frequency extracted by the formant frequency extractor 304, and thus decreasing the number of dimensions and obtaining a quadrature feature space. The decorrelation also has been described above.
As mentioned above, the training means 306 may conduct either unsupervised classification or supervised classification. When conducting supervised classification, the speech data processing apparatus further comprises means for labeling a priori the class of the training speech sample. The labeling means may adopt Viterbi alignment to conduct labeling. Also, as mentioned above, any possible alignment method may be used.
Also, for conducting more accurate classification with the context taken into account, the means for labeling a priori the class of the training speech sample may be further configured to label a priori the speech sample with phonetic knowledge, such as label a priori the phonemes, and thus the classes may synthetically reflect both the speaker and said phonetic knowledge.
Also, as a preferred embodiment, means for giving a spectrum normalization factor corresponding to each speaker class may be incorporated into the speech data processing apparatus according to the invention. The specific implementation has been described before.
In a preferred embodiment, the speech data processing apparatus further comprises a speaker class classifier trained by said training means 306. The speaker class classifier 308 compares the features extracted from the speech data to be recognized by the pitch extractor 302 and the formant frequency extractor 304 with the features of respective classes in the classifier, and thus classifies the speech data to be recognized into an appropriate class. Then, the spectrum normalization factor associated with the class may be used to normalize the spectrum of the speech data to be recognized, so as to facilitate the content recognition of the speech data. For the other aspects of the speech data processing apparatus, reference may be made to the description above about the speech data processing method.
To evaluate the effect of the invention, we perform a series of experiments on the speaker independent Mandarin speech recognition. The acoustic model is trained from all of the acoustic training data provided by internal recording database for automatic speech recognition (ASR) system of IBM Corporation. The testing data is recorded in a stable office condition. 120 speakers (60 males and 60 females) are recorded with no restriction on their speaking style for three tasks. There are 15 utterances for each speaker.
The main features of our ASR system are summarized as follows: 40-dimentianl acoustic features that result from 13-dimensional MFCCs (Mel Frequency Cepstrum Coefficients) followed by the application of temporal LDA and MLLT, and it consists of about 3 k HMM states and 33 k Gaussian mixtures. The search engine is based on A* heuristic stack decode.
To show the efficiency of the proposed algorithm to eliminate speakers' variation, experiments of three tasks are performed. The first two are isolate-word mode, while the third is continuous digits recognition (the length from 3 to 8), representing different applications:
1. People Name
2. Stock Name
3. Digits
Through the experiments four methods are compared with each other: baseline system (without speaker spectrum normalization), conventional parametric approach, linear searching method and the VTLS method of the invention. Through these methods, different speaker spectrum normalization methods are applied to the same application, then speech content recognition is conducted with the same speech content recognition method. The different speaker speech normalization methods are evaluated by comparing respective error rates of the speech content recognition.
Table 2 shows the recognition word error rate on three tasks. The warping rule is based on Piecewise mode of equation (10). By using VTSL for people name, stock name, and digits, the average relate word error rates are reduced to 11.20%, 8.45, 5.81%, comparison with the base-line system (that is, without speaker normalization), parametric method and linear search.
Apparently, the speech data processing method and apparatus improves the recognition rate remarkably.
While the invention has been described with reference to specific embodiments disclosed herein, it is not confined to the details set forth herein and there are may alternatives to the components and steps as described, and the protection scope is intended to cover all the variations or equivalents that are obvious to a person skilled in the art having read the specification.
Number | Date | Country | Kind |
---|---|---|---|
200610115196.4 | Aug 2006 | CN | national |