1. Field of the Invention
The present invention relates to a speech recognition technology, particularly to a Chinese speech recognition system and method.
2. Description of the Related Art
The prosody-aided speech recognition technology has been an important subject in recent years. Prosody is the suprasegmental features of continuous voices, including accents, tones, pauses, intonations, rhythms, etc. Prosody is physically expressed by the track of pitches, intensities of energy, durations of voices, and pauses of speech. Prosody closely correlates with various levels of linguistic parameters, including phone, syllable, word, phrase, sentence, and even linguistic parameters of higher levels. Therefore, prosody is useful for promoting speech recognition accuracy.
Refer to
The abovementioned prior arts can only utilize few obvious prosodic clues because they lack a large-scale corpus having abundant reliable and diversified prosodic tags. Therefore, the conventional technologies can only improve the efficiency of speech recognition to a very limited extent.
Accordingly, the present invention proposes a Chinese speech recognition system and method to overcome the abovementioned problems.
The primary objective of the present invention is to provide a Chinese speech recognition system and method, wherein a prosodic state model, a prosodic break model, a syllable prosodic model, and a syllable juncture model are used to improve the problems of word recognition errors and tone recognition errors and promote the recognition rates of words, characters, base-syllables of Chinese speech, and wherein part of speech (POS), punctuation marks (PM), prosodic breaks, and prosodic states of Chinese speech files are tagged to provide prosodic structures and linguistic information for the rear-stage voice conversion and voice synthesis.
To achieve the abovementioned objective, the present invention proposes a Chinese speech recognition system, which comprises a language model storage device storing a factored language model; a hierarchical prosodic model comprising a prosodic break (sub-)model, a prosodic state (sub-)model, a syllable prosodic-acoustic (sub-)model and a syllable juncture prosodic-acoustic (sub-)model; a speech recognition device; and a rescorer. The speech recognition device receives a speech signal, recognizes the speech signal and outputs a word lattice. The language model storage device, the hierarchical prosodic model generator and the speech recognition device are connected with the rescorer. The rescorer receives the word lattice, uses the prosodic break model, prosodic state model, syllable prosodic-acoustic model, syllable juncture prosodic-acoustic model and factored language model to rescore and rerank word arcs of the word lattice, and outputs a language tag, a prosodic tag and a phonetic segmentation tag.
The present invention also proposes a Chinese speech recognition method, which comprises steps: receiving a speech signal, recognizing the speech signal and outputting a word lattice; and receiving the word lattice, rescoring word arcs of the word lattice according to a prosodic break model, a prosodic state model, a syllable prosodic-acoustic model, a syllable juncture prosodic-acoustic model and a factored language model, reranking the word arcs, and outputting a language tag, a prosodic tag and a phonetic segmentation tag.
Below, the embodiments are described in detail in cooperation with the drawings to make easily understood the technical contents and accomplishments of the present invention.
a) and
In one embodiment, Equation (1) is decoded to obtain an optimal language tag Λl={W, POS, PM}, an optimal prosodic tag Λp={B, P} and an optimal phonetic segmentation tag γs.
wherein P(B|Λl), P(P|B), P(X|γs,Λp,Λl) and P(Y, Z|γs,Λp,Λl) are respectively a prosodic break model, a prosodic state model, a syllable prosodic-acoustic model, and a syllable juncture prosodic-acoustic model, and wherein W={w1M}={w1 . . . wM} is a word sequence, POS={pos1M}={pos1 . . . posM} a part of speech (POS) sequence associated with W, and PM={pm1M}={pm1 . . . pmM} a punctuation marks (PM) sequence, and wherein M is the quantity of all the words of the speech signal, and wherein B={B1N}={B1 . . . BN} is a prosodic break sequence, and P={p, q, r} with p={p1N}, q={q1N}, and r={r1N} representing prosodic state sequence for syllable pitch level, duration, and energy level, respectively; N is the quantity of all the syllables of the speech signal, and wherein Xp={X, Y, Z} is a prosodic acoustic parameter sequence, and wherein X is a syllable prosodic-acoustic parameter, Y a syllable juncture prosodic-acoustic parameter, and Z a syllable juncture difference parameter.
Refer to
wherein S=[s1 . . . s16] is a 16-dimensional vector formed by these sixteen probabilities, and wherein Λa=[α1 . . . α16] is a weighting vector determined by a discriminative model combination algorithm.
Refer to
S10: Initially labeling of break indices processor. The decision tree is used for initial break type labeling.
S12: Initialization of 12 prosodic models processor.
S14: Update sequentially the affecting factors (APs) of tones, coarticulation, base-syllable/final type processor with all other APs being fixed.
S16: Re-label the prosodic state sequence of each utterance processor.
S18: Update the APs of prosodic state processor with all other APs being fixed.
S20: Re-label the break type processor.
S22: Update the decision trees of break-syntax model and the syllable juncture prosodic-acoustic model processor.
S24: The convergence decision device, repeats the Steps S14-S22 until a convergence is reached. As shown in Step S26, when the convergence is reached, the prosodic break model, prosodic state model, syllable prosodic-acoustic model and syllable juncture prosodic-acoustic model are generated.
The joint prosody labeling and modeling processor 32 automatically labels the prosodic state sequence P and prosodic break sequence B on the speech signal. It is time- and cost-efficient for the present invention to use the large-scale prosodic-tag-free prosody unlabeled database 24 to undertake prosodic tagging and establish prosodic models.
Below are introduced the abovementioned models. The factored language model is expressed by Equation (3):
wherein wi is the ith word, posi the ith POS tag, and pmi the ith PM tag.
The prosodic break model is expressed by Equation (4):
wherein Λl,n is the contextual linguistic parameter surrounding syllable n.
The prosodic state model is expressed by Equation (5):
wherein pn, qn and rn are respectively the prosodic states of pitch level, duration and energy level of the nth syllable.
The syllable prosodic-acoustic model is expressed by Equation (6-1):
wherein spn, sdn, sen, tn, sn, fn are respectively the syllable pitch contour, syllable duration, syllable energy, tone, base-syllable type and final type of the nth syllable, and wherein P(spn|pn,Bn−1n,tn−1n+1), P(sdn|qn,sn,tn) and P(sen(rn,fn,tn) are respectively the sub-models of the syllable pitch contour, syllable duration and syllable energy of the nth syllable, and wherein Bn−1n=(Bn−1,Bn), and wherein tn−1n+1=(tn−1,tn,tn+1). Each of the three sub-models takes in consideration several affecting factors, and the affecting factors are integrated in an addition way. For example, the sub-model of the pitch contour of the nth syllable is expressed by Equation (6-2):
spn=spnr+βt
Wherein spn is a 4-dimensional vector of orthogonal coefficients of the pitch contour observed in the nth syllable, and wherein spnr is the normalized spn, and wherein βt
P(spn|pn,Bn−1n,tn−1n+1)=N(spn;βt
The sub-model of the syllable duration P(sdn|qn,sn,tn) and the sub-model of the syllable energy level P (sen|rn,fn,tn) are also realized in the similar way by
P(sdn|qn,sn,tn)=N(sdn;γt
P(sen|rn,fn,tn)=N(sen;ωt
where γ's and ω's represent APs for syllable duration and syllable energy level; μsd and μse are their global means; Rsd and Rse are variances of modeling residues.
The syllable juncture prosodic-acoustic model is expressed by Equation (7-1):
wherein pdn and edn are respectively the pause duration of the juncture following syllable n and the energy-dip level of juncture n;
pjn=(spn+1(1)−βt
is the normalized pitch-level jump across juncture n; spn(1) is the first dimension of syllable pitch contour spn; βt
dln=(sdn−γt
dfn=(sdn−γt
are two normalized duration lengthening factors before and across juncture n.
In the present invention, pdn is simulated with a gamma distribution, and the other four models are simulated with normal distributions. For prosodic breaks, the space of Λl,n is too great. Therefore, Λl,n is divided into several types, and the parameters in the gamma distribution and the normal distribution are estimated at the same time.
However, the present invention does not restrict that the prosodic modes must adopt the methods and distribution modes mentioned in the abovementioned embodiments. The method and distribution mode used by the abovementioned four prosodic models can be modified according to practical applications.
Below is described the two-stage operating process of the present invention. Refer to
Below is described the process that the hierarchical prosodic model 18 generates the abovementioned prosodic break model, prosodic state model, syllable prosodic-acoustic model and syllable juncture prosodic-acoustic model. Refer to
After the training of the prosodic break model, prosodic state model, syllable prosodic-acoustic model and syllable juncture prosodic-acoustic model is completed, the relationships of the low-level language parameters, high-level language parameters, prosodic state sequence P, and prosodic break sequence B, syllable prosodic-acoustic parameter X, syllable juncture prosodic-acoustic parameter Y, and syllable juncture difference parameter Z shown in
Table. 1 shows the results of a speech recognition experiment, wherein the speech recognition device of the embodiment shown in
Table. 2 shows the results of a POS decoding experiment. The precision, recall and F-measure of the basic system are respectively 93.4%, 76.4% and 84.0%. The precision, recall and F-measure of the present invention are respectively 93.4%, 80.0% and 86.2%. Table. 3 shows the results of a PM decoding experiment. The precision, recall and F-measure of the basic system are respectively 55.2%, 37.8% and 44.8%. The precision, recall and F-measure of the present invention are respectively 61.2%, 53.0% and 56.8%. Table. 4 shows the results of a tone decoding experiment. The precision, recall and F-measure of the basic system are respectively 87.9%, 87.5% and 87.7%. The accuracy, recall rate and F measurement of the present invention are respectively 91.9%, 91.6% and 91.7%.
a) and
The unit of the time axis of the waveform is second. Each triangular symbol denotes a short break. There are four prosodic phrases (PPh) in the waveform. The experiment indeed decodes four PPh's separated by B3's. Prosodic words (PW) are decoded from each prosodic phrase and separated by B2's. It can be observed in the syllable pitch-level prosodic state that pitch-level resets occur at all three B3. It can also be observed in the syllable duration prosodic state that the duration of the former syllable is elongated for B2-3. The tags show that the prosodic breaks and the prosodic states have hierarchical prosodic structures.
In conclusion, the present invention performs rescoring in two stages. Thereby, the present invention not only promotes the correctness of basic speech recognition but also tags the language, prosodies and phonetic segmentations for the succeeding applications.
The embodiments described above are only to exemplify the present invention but not to limit the scope of the present invention, any equivalent modification or variation according to the characteristics or spirit of the present invention is to be also included within the scope of the present invention.
Number | Date | Country | Kind |
---|---|---|---|
100116350 A | May 2011 | TW | national |
100142341 A | Nov 2011 | TW | national |
Number | Name | Date | Kind |
---|---|---|---|
6978239 | Chu | Dec 2005 | B2 |
7263488 | Chu et al. | Aug 2007 | B2 |
7409346 | Acero et al. | Aug 2008 | B2 |
7747437 | Verhasselt et al. | Jun 2010 | B2 |
7761296 | Bakis et al. | Jul 2010 | B1 |
8374873 | Stephens, Jr. | Feb 2013 | B2 |
20050182625 | Azara et al. | Aug 2005 | A1 |
20050256713 | Garg et al. | Nov 2005 | A1 |
20110046958 | Liu et al. | Feb 2011 | A1 |
Number | Date | Country |
---|---|---|
325556 | Jan 1998 | TW |
508564 | Nov 2002 | TW |
I319152 | Jan 2010 | TW |
I319563 | Jan 2010 | TW |
Entry |
---|
Ananthakrishnan et al, “Improved Speech Recognition using Acoustic and Lexical Correlates of Pitch Accent in a N-Best Rescoring Framework,”Apr. 2007, In Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE International Conference on , vol. 4, no., pp. IV-873-IV-876. |
Lin et al, “Spontaneous Mandarin Speech Recognition with Disfluencies Detected by Latent Prosodic Modeling (LPM)”, 2006, In Proceedings of International Symposium on Linguistic Patterns in Spontaneous Speech (LPSS 2006), pp. 159-173. |
Liu et al “An Implementation of Prosody-Assisted Mandarin Speech Recognition System”, Jul. 2011, Thesis, Institute of Communication Engineering College of Electrical and Computer Engineering National Chiao Tung University, pp. 1-82. |
Chiang et al “Unsupervised joint prosody labeling and modeling for Mandarin speech”, Feb. 2009, In Acoust. Soc. Am. 125 (2), pp. 1164-1183. |
Chang et al “Enriching Mandarin Speech Recognition by Incorporating a Hierarchical Prosody Model”, Aug. 2011, Thesis, Institute of Communication Engineering College of Electrical and Computer Engineering National Chiao Tung University, pp. 1-63. |
Sankaranarayanan Ananthakrishnan et al., Unsupervised Adaptation of Categorical Prosody Models for Prosody Labeling and Speech Recognition, IEEE Trans. on Audio, Speech, and Language Processing, Jan. 2009, pp. 138-149, vol. 17, No. 1. |
Dimitra Vergyri et al., Prosodic Knowledge Sources for Automatic Speech Recognition, Proc. ICASSP, 2003, pp. I208-I211. |
Mari Ostendorf et al., Prosody Models for Conversational Speech Recognition, Proc. 2nd Plenary Meeting Symp. Prosody and Speech Process, 2003, pp. 147-154. |
Xin Lei et al., World-Level Tone Modeling for Mandarin Speech Recognition, Proc. ICASSP, 2007, pp. IV-665-IV-668. |
Chongjia NI, Improved Large Vocabulary Mandarin Speech Recognition Using Prosodic and Lexical Information in Maximum Entropy Framework, Proc. CCPR, 2009, pp. 1-4. |
Yang Liu, Enriching Speech Recognition with Automatic Detection of Sentence Boundaries and Disfluencies, IEEE Trans. on Audio, Speech, and Language Processing, Sep. 2006, pp. 1526-1540, vol. 14, No. 5. |
Number | Date | Country | |
---|---|---|---|
20120290302 A1 | Nov 2012 | US |