This application claims priority of Taiwanese Application No. 093112355, filed on May 3, 2004.
1. Field of the Invention
The invention relates to an acoustic model training method, more particularly to an acoustic model training method, in which sub-speech data sets are used to perform adaptation training of acoustic models of a root speech data set so as to obtain acoustic models of the sub-speech data sets.
2. Description of the Related Art
Current mainstream speech recognition techniques are based on the fundamental principle of statistical model recognition. A complete speech recognition system can be roughly divided into three levels: audio signal processing, acoustic decoding, and linguistic decoding.
For phonetics, in natural speech situations, speech sounds are continuous, i.e., the demarcation between phonetic segments is not distinct. This is the so-called coarticulation phenomenon. Currently, the complicated problem of coarticulation between phonetic segments is overcome mostly by adopting context-dependent models.
Generally speaking, each mono-syllable includes at least one phone. Each phone can be divided into an initial and a final, i.e., a consonant and a vowel. Since the same phone will have different acoustic models in different sentences due to the effect of coarticulation, the number of phones in different languages varies as well. For instance, there are 40-50 phones in English, whereas there are 37 phones in Chinese. If a context-dependent model is built according to context relationship, the required number of acoustic models will be huge. For instance, the Chinese language will require about 60,000 acoustic models, whereas the English language will require about 125,000 acoustic models. Besides, the building of each model requires sufficient speech data in order to impart a certain degree of reliability to the model. In order that there are sufficient speech data for each speech model to train reliable models, parameter sharing is a usually adopted approach to speech training.
At present, a decision tree is employed to train acoustic models using relevant speech data sharing parameters. The decision tree is a method of integrating phonetics and acoustics in a top-down approach, in which all the speech data belonging to the same phone are placed at the uppermost level, and are divided into two clusters. The differences among elements in the same cluster are smaller, whereas the differences among elements in different clusters are larger. In this way, acoustically similar models can be grouped together, while dissimilar models can be separated. Iterative splitting will yield clusters that are sets of shared parameters. The models in the same cluster can share speech training data and parameters. However, the clusters are not split without restraint. If the number of speech data in a cluster is less than a threshold value, i.e., the amount of speech training data in the cluster is sparse, the models to be trained therefrom will not have robustness, thereby resulting in inaccurate training models. A current method to solve this problem is by backing-off to all the speech data in the level immediately above the cluster and using the same as the reference speech data when building the models. That is to say, using the models in the level immediately above the cluster as substitutes. For instance, if there are insufficient speech data beginning with the initial phone “an” (meaning that “a” is followed by “n” speech data), the parameters of the initial phone “a” are backed-off to substitute “an”. However, in actuality, the threshold value of the number of speech data in the speech data clusters is not easy to determine, and backing-off to the parameters of the speech data in the upper level offers little help in enhancing the resolution of the models.
Therefore, the object of this invention is to provide an acoustic model training method which can effectively use available speech data to build a relatively precise acoustic model.
According to one aspect of this invention, an acoustic model training method includes:
(a) constructing a root speech data set, the root speech data set having a plurality of root speech data, each of the root speech data having a root phone;
(b) constructing a Hidden Markov Model for the root speech data set;
(c) constructing a sub-speech data set dependent on the root phone, the sub-speech data set having at least one sub-speech datum, the sub-speech datum having the root phone and a sub-phone adjacent to the root phone; and
(d) using the following equation to update a parameter mean value of the sub-speech data set:
where {overscore (μi)} and {overscore (μd)} are mean values of Hidden Markov Model parameters for the root speech data set and the sub-speech data set, respectively, ni and nd are numbers of samples of speech data in the root speech data set and the sub-speech data set, respectively, k is a weighted value, and {overscore (μ)} is the updated mean value of the Hidden Markov Model parameters for the sub-speech data set.
According to another aspect of this invention, a system for implementing an acoustic model training method is loadable into a computer for constructing acoustic models corresponding to input speech data. The system has a program code recorded thereon to be read by the computer so as to cause the computer to execute the following steps:
(a) constructing a root speech data set, the root speech data set having a plurality of root speech data, each of the root speech data having a root phone;
(b) constructing a Hidden Markov Model for the root speech data set;
(c) constructing a sub-speech data set dependent on the root phone, the sub-speech data set having at least one sub-speech datum, the sub-speech datum having the root phone and a sub-phone adjacent to the root phone; and
(d) using the following equation to update a parameter mean value of the sub-speech data set:
where {overscore (μi)} and {overscore (μd)} are mean values of Hidden Markov Model parameters for the root speech data set and the sub-speech data set, respectively, ni and nd are numbers of samples of speech data in the root speech data set and the sub-speech data set, respectively, k is a weighted value, and {overscore (μ)} is the updated mean value of the Hidden Markov Model parameter for the sub-speech data set.
Other features and advantages of the present invention will become apparent in the following detailed description of the preferred embodiment with reference to the accompanying drawings, of which:
Before the present invention is described in greater detail, it should be noted that the acoustic model training method according to this invention is suited for use with the language of any country or people, and that although this invention is exemplified using the English language, it should not be limited thereto
The content of automatic speech recognition (ASR) can be explained briefly in three parts: 1. Feature parameter extraction (see
Although an original speech signal can be directly used for recognition after being digitized, the original speech signal is very rarely stored in its entirety for use as standard reference speech samples since the amount of data is voluminous, the processing time is excessively long, and the recognition efficiency is unsatisfactory. Therefore, it is necessary to perform feature extraction based on the features of the speech signal so as to obtain suitable feature parameters for purposes of comparison and recognition. Prior to feature extraction, the speech signal must be subjected to pre-processing. As shown in
After determining the feature value of the speech signal, a speech model is constructed. A left to right Hidden Markov Model (HMM) is adopted in this embodiment to simulate the process of change of the vocal tract in the oral cavity. The building of a speech sample model involves using an abstract probability model as a reference sample to describe speech features. That is, the measurement of recognized speech is not the magnitude of distortion but is the calculation of the probability generated from the model.
The major feature of HMM is the use of a probability density function to describe the variation of a speech signal. When a speech signal is described by states, the state of each frame is stationary locally if not transiting to a next state. A state transition probability can be used to represent the state transition or stationary process. In addition, a state observation probability can be used to represent the extent of similarity of the frames and states. With reference to
Each frame has 20 feature values, in which ƒ1(n,i) is defined as the ith speech feature parameter of the nth frame in state 1, whereas ƒi=(ƒ1(i,1), ƒ1(i,2), Λ, ƒi(i,20))′ represents the vector of the speech feature parameter within the ith frame in state 1. Hence, the estimated mean value and covariance in state 1 are
The mean value and covariance of state 2 and state 3 can be obtained in the same manner. However, model building is not completed merely by even distribution of states. Even distribution is employed to give each model an initial value. Subsequently, the extent of similarity between the frames and the states needs to be computed using, in general, a multiple variable Gaussian probability density function as follows:
where i=1, 2, 3, representing states, and j=1, 2, Λ, N∫, representing the frame number. By using Pi,j to represent Pi(xj) and by employing a multiple variable Gaussian probability density function distribution, the extent of similarity (similarity probability value) between each frame and each state can be obtained (step 32). Thus, the state to which a frame is comparatively similar can be found. Next, these probability values are used to find many paths.
Referring to
Referring to
The acoustic model building model 5 has a root phone set unit 51, a root phone model building unit 52, a sub-phone set unit 53, and a sub-phone model building unit 54.
The root phone set unit 51 pre-sets a phone as a root phone. For example, “a” is selected as a root phone. Certainly, other phones, such as “e,” “i,” “o,” and “u,” can also be selected. Feature files containing speech data of the root phone are selected from the computer 1, and “an,” “am,” and “ab” (the lower-case letter following “a” represents the speech data of the letter following “a”) all belong to the set, based on which a voluminous root speech data set is constructed. The set may also be referred to as a context-independent phone set.
The root phone model building unit 52 builds an acoustic model dedicated to the speech data of the root phone set. In this embodiment, the Hidden Markov Model is used, and the model provides means values {overscore (μi)} and {overscore (μd)} of “a” and “an”(or “am”).
The sub-phone set unit 53 classifies sub-speech data relevant to the root phone from the root speech data set, and builds a sub-speech data set. In this embodiment, the method of classification involves using a decision tree (see
The sub-phone model building unit 54 updates the mean values (numerical value) of the sub-phones according to the following equation:
where {overscore (μi)} and {overscore (μd)} are the mean values of the HMM parameters of the root speech data set and the sub-speech data, respectively, ni and nd are the numbers of speech data samples contained in the root speech data set and the sub-speech data set, respectively, k is the weighted value, and {overscore (μ)} is the mean value of the HMM parameter of the updated sub-speech data set.
Referring to
Initially, step 60 is performed to input speech training data.
Subsequently, step 61 is performed, in which the root phone set unit 51 pre-sets a phone as a root phone, selects speech data having feature files of the root phone from the computer 1, and constructs a root speech data set. The invention is exemplified herein utilizing the initial phone “a” as the selected root phone, and using 2000 samples.
Then, step 62 is carried out, in which the root phone model building unit 52 builds an acoustic model dedicated to the root speech data set using HMM. The acoustic model provides the means values {overscore (μi)} and {overscore (μd)} (feature parameters) of the speech data signals.
Thereafter, step 63 is performed, in which, after the root phone model building unit 52 has built the acoustic model for the root speech data set, the sub-phone set unit 53 classifies sub-speech data relevant to the root phone from the root speech data set, and constructs a sub-speech data set. In this embodiment, the sub-speech data are those with a selected initial phone “an,” and the number of samples is 15.
Then, step 64 is performed, in which the sub-phone model building unit 54 utilizes the speech data in the sub-speech data set for model adaptation training of the acoustic models of the root speech data set. The adaptation training rule is as follows:
After substituting the actual numbers thereinto:
It is particularly noted that k is a weighted value, which is set depending on actual experimental requirements. It can be seen from the above equation that the updated mean value of the acoustic models of the root speech data set is between {overscore (μi)} and {overscore (μd)}. Besides, the lesser the number of nd samples, the closer will be the updated value to {overscore (μi)}. On the other hand, the greater the number of nd samples, the closer will be the updated value to {overscore (μd)}.
Finally, step 64 is performed to output the updated value.
With further reference to
In summary, the acoustic model training method of this invention does not employ the backing-off rule which is generally applied in the prior art when making determinations using a decision tree. This invention provides a method of adaptive training of acoustic models of a root speech data set using a method different from the conventionally used Hidden Markov Model to calculate the mean values of the parameters when building acoustic models of sub-speech data sets, so as to effectively use all the speech data in the sub-speech data sets to build the acoustic models of the sub-speech data sets. Thus, this invention provides both facility and robustness, and can positively achieve the stated object.
While the present invention has been described in connection with what is considered the most practical and preferred embodiment, it is understood that this invention is not limited to the disclosed embodiment but is intended to cover various arrangements included within the spirit and scope of the broadest interpretation so as to encompass all such modifications and equivalent arrangements.
Number | Date | Country | Kind |
---|---|---|---|
093112355 | May 2004 | TW | national |