1. Field of the Invention
The present invention relates to a method, a system and a program for modeling a common-language speech recognition, by a computer, under the influence of multiple dialects, and also relates to a recording medium that stores the program. The present invention particularly relates to a field of speech recognition by a computer.
2. Description of the Related Art
Enhancing robustness has been an important issue and a difficult point to achieve in the field of speech recognition. A major factor of deterioration in robustness of speech recognition lies in a problem involving linguistic accents. For example, the Chinese language has many dialects, which leads to a significant problem of accents. The problem gives incentives for ongoing research activities. In the conventional speech recognition system, the recognition rate for a standard common language is high but the recognition rate for an accented common language influenced by dialects (hereinafter referred to as “dialectal-accented common language” or simply as “dialectal common language” also) is low. To address this problem, a method such as “adaptation” may be used as a countermeasure in general. However, a precondition in this case is that a sufficient amount of data for the dialectal-accented common language must be provided. With this method, there are cases where the recognition rate for the standard common language drops markedly. Since there are many kinds of dialects, the work efficiency is degraded if an acoustic model is trained repeatedly for the respective kinds of dialects.
The present invention has been made in view of the foregoing problems, and one of purposes is to provide a method for modeling a common language speech recognition, by a computer, under the influence of a plurality of dialects, the method being capable of raising the recognition rate for dialectal-accented common languages with a small amount of data and guaranteeing to sustain the recognition rate for the standard common language, and to provide a system therefor.
A method, for modeling a common-language speech recognition by a computer under the influence of a plurality of dialects, includes the following steps of:
(1) generating a triphone standard common-language model based on training data of standard common language, generating a first monophone dialectal-accented common-language model based on development data of dialectal-accented common language of first kind, and generating a second monophone dialectal-accented common-language model based on development data of dialectal-accented common language of second kind;
(2) generating a first confusion matrix by recognizing the development data of the dialectal-accented common language of first kind using the standard common-language model, and obtaining a temporary merged model in a manner that the first dialectal-accented common-language model is merged into the standard common-language model according to the first confusion matrix; and
(3) generating a second confusion matrix by recognizing the development data of the dialectal-accented common language of second kind using the temporary merged model, and obtaining a recognition model in a manner that the second dialectal-accented common-language model is merged into the temporary merged model according to the second confusion matrix.
The merging method as described in the above steps (2) and (3) is such that:
a probability density function of the temporary merged model is expressed by
p′(x|s)=λ1p(x|s)+(1−λ1)p(x|d1)p(d1|s)
where x is an observation feature vector of speech to be recognized, s is a hidden Markov state in the standard common-language model, d1 is a hidden Markov state in the first dialectal-accented common-language model, and λ1 is a linear interpolating coefficient such that 0<λ1<1, and
wherein a probability density function of the merged recognition model is expressed by
where wk(sc)′ is a mixture weight for the hidden Markov state of the standard common-language model, wmn(dc1)′ is a mixture weight for the hidden Markov state of the first dialectal-accented common-language model, wpq(dc2)′ is a mixture weight for the hidden Markov state of the second dialectal-accented common-language model, K is the number of Gaussian mixtures for Hidden Markov Model state s in the standard common-language model, Nk(sc)(•) is an element of Gaussian mixture for Hidden Markov Model state s, M is the number of d1 that is considered as the pronunciation variants occurring between the first dialectal-accented common-language model for d1 and the standard common-language-model, N is the number of Gaussian mixtures for Hidden Markov Model state d1 in the first dialectal-accented common-language model, Nmn(dc1)(•) is an element of Gaussian mixture for Hidden Markov Model state d1, P is the number of d2 that is considered as the pronunciation variants occurring between the second dialectal-accented model for d2 and the standard common-language model, Q is the number of Gaussian mixtures for Hidden Markov Model state d2 in the second dialectal-accented model, Npq(dc2)(•) is an element of Gaussian mixture for Hidden Markov Model state d2.
The method, for modeling a common-language speech recognition by a computer under the influence of a plurality of dialects, according to the above embodiment achieves the following advantageous effects.
Each of a plurality of dialectal-accented common models is merged into a standard common-language model using an iterative method, so that the redundant operation of training an acoustic model for each of dialects can be avoided and therefore the work efficiency can be enhanced. Also, according to this method, the recognition rate for dialectal-accented common languages can be admittedly raised. At the same time, the recognition rate for the standard common language never deteriorates and sometimes increases. Thus, this method resolves a problem, as in other conventional methods, where the recognition rate for the standard common language markedly deteriorates while a dialectal-accented common language is properly treated.
Optional combinations of the aforementioned processes, and implementations of the invention in the form of apparatuses, systems, recoding media, computer programs and so forth may also be practiced as additional modes of the present invention.
Embodiments will now be described by way of examples only, with reference to the accompanying drawings which are meant to be exemplary, not limiting, and wherein like elements are numbered alike in several Figures in which:
A description is now given of preferred embodiments of the present invention with reference to drawings.
(1) generating a triphone standard common-language model based on training data of standard common language, and generating first to nth monophone dialectal-accented common-language models for respective corresponding dialectal-accented common languages of first to nth kinds, based on the development data thereof;
(2) generating a first confusion matrix by recognizing the development data of the dialectal-accented common language of first kind using the standard common-language model, and obtaining a first temporary merged model in a manner that the first dialectal-accented common-language model is merged into the standard common-language model according to the first confusion matrix; and
(3) generating an ith confusion matrix by recognizing the development data of dialectal-accented common language of ith kind using an (i−1)th temporary merged model (i being an integer such that 2≦in≦), and obtaining a final recognition model by repeating, from i=2 to i=n, an operation of merging the ith dialectal-accented common-language model into the (i−1)th temporary merged model according to the ith confusion matrix.
The training DB 10-0 is a database that stores the training data of a standard common language.
The development DB 10-1 to 10-n are databases that store the development data of dialectal-accented common languages of first to nth kinds, respectively.
The model generator 30-0 is used to generate a triphone standard common-language model based on the training data of the standard common language stored in the training DB 10-0.
The model generators 30-1 to 30-n are a sequence of blocks that generate first to nth monophone dialectal-accented common-language models based on the training data of dialectal-accented standard common languages of first to nth kinds stored in the development databases 10-1 to 10-n, respectively.
The confusion matrix generators 40-1 to 40-n are a sequence of blocks that generate first to nth confusion matrices by recognizing the development data of the first to nth dialectal-accented common languages of first to nth kinds stored in the first to nth development databases 10-1 to 10-n using the models generated by the corresponding model generators 30-0 to 30-(n-1).
The model merging unit 50-1 generates a first temporary merged model in a manner that the first dialectal-accented common-language model generated by the model generator 30-1 is merged into a standard common-language model generated by the model generator 30-0 according to the first confusion matrix generated by the confusion matrix generator 40-1.
The model merging units 50-2 to 50-(n-1) generate second to (n−1)th temporary merged models in a manner that the second to (n−1)th dialectal-accented common-language models generated by the model generators 30-2 to 30-(n-1) are each merged into a temporary merged model generated by a model merging unit placed immediately prior thereto according to the second to (n−1)th confusion matrices generated by the corresponding confusion matrix generators 40-2 to 40-(n-1).
The model merging unit 50-n finally generates a recognition model in a manner that the nth dialectal-accented common-language model generated by the model generator 30-n is merged into the (n−1)th temporary merged model generated by the model merging unit 50-(n-1) placed immediately prior thereto according to the nth confusion matrix generated by the confusion matrix generator 40-n.
The control unit 200 controls the model generation unit 100 in such a manner as to operate according to the aforementioned modeling method.
In
A concrete description is hereinbelow given of a method for modeling a recognition model capable of being compatible with two different kinds of dialectal-accented common languages (n=2).
This modeling method includes the following steps of:
(1) generating a triphone standard common-language model based on training data of standard common language, generating a first monophone dialectal-accented common-language model based on development data of dialectal-accented common language of first kind, and generating a second monophone dialectal-accented common-language model based on development data of dialectal-accented common language of second kind;
(2) acquiring a first confusion matrix by recognizing the development data of the dialectal-accented common language of first kind using the standard common-language model, and obtaining a temporary merged model in a manner that the first dialectal-accented common-language model is merged into the standard common-language model according to the first confusion matrix; and
(3) acquiring a second confusion matrix by recognizing the development data of the dialectal-accented common language of second kind using the temporary merged model, and obtaining a recognition model in a manner that the second dialectal-accented common-language model is merged into the temporary merged model according to the second confusion matrix.
The merging method as described in the above steps (2) and (3) is such that:
the probability density function of the temporary merged model is expressed by
p′(x|s)=λ1p(x|s)+(1−λ1)p(x|d1)p(d1|s)
where x is an observation feature vector of speech to be recognized, s is a hidden Markov state in the standard common-language model, d1 is a hidden Markov state in the first dialectal-accented common-language model, and λ1 is a linear interpolating coefficient such that 0<λ1<1.
Also, the probability density function of the recognition model is expressed by
where wk(sc)′ is a mixture weight for the hidden Markov state of the standard common-language model, wmn(dc1)′ is a mixture weight for the hidden Markov state of the first dialectal-accented common-language model, wpq(dc2)′ is a mixture weight for the hidden Markov state of the second dialectal-accented common-language model, K is the number of Gaussian mixtures for Hidden Markov Model state s in the standard common-language model, Nk(sc)(•) is an element of Gaussian mixture for Hidden Markov Model state s, M is the number of d1 that is considered as the pronunciation variants occurring between the first dialectal-accented common-language model for d1 and the standard common-language-model, N is the number of Gaussian mixtures for Hidden Markov Model state d1 in the first dialectal-accented common-language model, Nmn(dc1)(•) is an element of Gaussian mixture for Hidden Markov Model state d1, P is the number of d2 that is considered as the pronunciation variants occurring between the second dialectal-accented model for d2 and the standard common-language model, Q is the number of Gaussian mixtures for Hidden Markov Model state d2 in the second dialectal-accented model, Npq(dc2)(•) is an element of Gaussian mixture for Hidden Markov Model state d2.
The method according to the present embodiment is characterized by the features that models created based on various kinds of dialectal-accented data are merged into the standard common-language model in an iterative manner. The fundamental flow of this method is illustrated in
p′(x|s)=λ1p(x|s)+(1−λ1)p(x|d1)p(d1|s).
In the above equation, X is an observation feature vector of speech to be recognized, s is a hidden Markov state in the standard common-language model, d1 is a hidden Markov state in the first dialectal-accented common-language model. λ1 is a linear interpolating coefficient such that 0<λ1<1, and indicates a mixture weight in the temporary merged model. In the actual setting, the optimum λ1 is determined through experiments. p(d1|s) is the output probability of the hidden Markov state in the first dialectal-accented common-language model given the corresponding hidden Markov state in the standard common-language model and indicates a variation of pronunciations in the dialect of first kind relative to the standard common language. For the same reasoning, the probability density function of the final merged model may be expressed by
where d2 is a hidden Markov state in the second dialectal-accented common-language model, λ2 is a linear interpolating coefficient such that 0<λ2<1, and indicates a mixture weight in the final merged model. In the actual setting, the optimum λ2 is determined through experiments. K is the number of Gaussian mixtures for Hidden Markov Model state s in the standard common-language model. Nk(sc)(•) is an element of Gaussian mixture for Hidden Markov Model state s. M is the number of d1 that is considered as the pronunciation variants occurring between the first dialectal-accented common-language model for d1 and the standard common-language-model; N is the number of Gaussian mixtures for Hidden Markov Model state d1 in the first dialectal-accented common-language model. Nmn(dc1)(•) is an element of Gaussian mixture for Hidden Markov Model state d1. P(d1m|s) is the corresponding probability of pronunciation modeling. P is the number of d2 that is considered as the pronunciation variants occurring between the second dialectal-accented model for d2 and the standard common-language model; Q is the number of Gaussian mixtures for Hidden Markov Model state d2 in the second dialectal-accented model. Npq(dc2)(•) is an element of Gaussian mixture for Hidden Markov Model state d2. P(d2p|s) is the corresponding probability of pronunciation model.
It is easy to see from the last line of the above equation that the final merged model is actually constructed by taking the weighted sum of the standard common model, the first dialectal-accented model and the second dialectal-accented model. wk(sc)′, wmn(dc1)′ and wpq(dc2)′ indicate the mixture weights of three models represented by the above equation. Since the confusion matrices P(d1m|s) and P(d2p|s) and the interpolating coefficients λ1 and λ2 are already known, the weights for the mixture of normal distributions of three models can be easily determined.
A description is now given of exemplary embodiments:
As evident from Table 1, data are divided into the standard common language, the Chuan (an abbreviation of Sichuan Dialect) dialectal common language, and the Minnan dialectal common language, and the data are also divided into two parts, namely data for training/development and data for testing.
Baseline:
A mixed training recognition model is used in the baseline. This mixed training recognition model is trained based on the total of three kinds of data (standard and 2 dialectal).
Results of experiments:
As evident from Table 3, the use of a model trained by employing the method of calculation according to the present embodiment obviously improves the recognition rate for two dialects as well. At the same time, the recognition rate for the standard common language is significantly improved. Thus the methods according to the above-described embodiment prove viable and effective.
Further, according to the above-described methods, the final recognition model can be obtained by iteratively merging each dialectal-accented common-language model into the standard common-language model.
Number | Date | Country | Kind |
---|---|---|---|
2008 1 0225354 | Oct 2008 | CN | national |
Number | Name | Date | Kind |
---|---|---|---|
5865626 | Beattie et al. | Feb 1999 | A |
6963837 | Finke et al. | Nov 2005 | B1 |
7315811 | Cote et al. | Jan 2008 | B2 |
7319958 | Melnar et al. | Jan 2008 | B2 |
20040176078 | Melnar et al. | Sep 2004 | A1 |
20050165602 | Cote et al. | Jul 2005 | A1 |
20060085186 | Ma et al. | Apr 2006 | A1 |
20080059188 | Konopka et al. | Mar 2008 | A1 |
20080147404 | Liu et al. | Jun 2008 | A1 |
Entry |
---|
Liu, Linquan, Thomas Zheng, and Wenhu Wu. “State-dependent phoneme-based model merging for dialectal chinese speech recognition.” Chinese Spoken Language Processing (2006): 282-293. |
Zheng, Fang, et al. “Mandarin pronunciation modeling based on CASS corpus.” Journal of Computer Science and Technology 17.3 (2002): 249-263. |
Number | Date | Country | |
---|---|---|---|
20100121640 A1 | May 2010 | US |