The application claims the benefit of Taiwan Patent Application No. 102104478, filed on Feb. 5, 2013, in the Taiwan Intellectual Property Office, the disclosures of which are incorporated herein in their entirety by reference.
The present invention relates to a speech-synthesizing device, and more particularly to a streaming encoder, prosody information encoding device, prosody-analyzing device and device and method for speech synthesizing.
In the traditional segment-based speech coding, the messages of prosody corresponding to speech segments are usually directly encoded with quantitative methods over prosodic features, without considering the use of prosodic model with linguistic meanings for performing parameterized prosody coding. Some methods of the mentioned traditional speech coding are performed with the corresponding duration and speech pitch contour of the phonemes in the syllables. The coding is to use pre-stored representative duration and grouping templates of pitch contour of the phonemes in the syllables as the duration and the pitch contour of the phonemes in the syllables, but not consider the prosody generating model. The coded speech with the mentioned method is hard to be applied to prosodic transformation thereto.
Coding to pitch contour is to use the linear segments of the pitch contour to represent the values thereof. The messages of the pitch contour are represented with the slope as well as endpoint values of those linear segments. Representative linear segment templates are stored in a codebook, which is used for the coding to pitch contour. The method is simple, but without considering the prosody generating model. The coded speech with the mentioned method is hard to be applied to prosodic transformation thereto.
There is a method of scalar quantization to the pitch contour of phoneme, which is to use the average pitch and the slope of the phoneme to represent the pitch contour of the phoneme, and to perform scalar quantization to the average pitch and the slope of the phrase, which does not consider the prosody generating model. The coded speech with the mentioned method of scalar quantization is hard to be applied to prosodic transformation thereto.
Another method is to normalize the duration and the average pitch of phoneme by subtracting the average duration and average pitch contour of the corresponding phoneme type from observed value of the duration and the pitch contour and finally performing scalar quantization to the normalized phoneme duration and the pitch contour. Such a method may reduce the transmission data rate. Doing without considering the prosody generating model, the coded speech with the mentioned method is hard to be applied to prosodic transformation thereto.
One another method is to segment the speech into segments of different number of frames, each of which has a pitch contour represented by the average pitch of the frame, while an energy contour is represented with vector quantization, without considering the prosody generating model. The coded speech with the mentioned method is hard to be applied to prosodic transformation thereto.
There is also a method of piecewise linear approximation (PLA) for use to represent the pitch. The PLA information includes the pitch value and time information of the endpoints of the segment and the pitch value and time information of the critical points. Some articles introduce scalar quantization for representing those messages, while use vector quantization for representing the PLA information. Some articles introduce traditional method of frame-based speech coder, which performs quantization to the pitch information of each frame and may accurately indicate the pitch information, but suffers high data rate.
Some articles introduce the method of quantizing the pitch contour of a segment with pitch contour templates stored in the codebook and encoding the templates. The method may encode the pitch information with very low data rate, but with higher distortion.
The encoding process of the prior arts can be summarized as below: (1) segmentation of the speech into segments; and (2) encoding of the spectrum and the prosodic information of the segments. Usually, for one segment, the corresponding phoneme, syllable or the acoustic unit defined by the system can be obtained. The segmentation can be performed by automatic speech recognition or can be done by forced alignment given known phoneme, syllable or the acoustic unit defined by the system. Then, each segment is encoded with the spectrum information and prosodic message thereof.
On the other hand, the reconstruction of the encoded speech by the segment-based speech encoder includes the following steps: (1) decoding and reconstruction of the spectrum and prosodic information; and (2) speech synthesis.
Most of the prior art technologies pay more attention on the encoding of spectrum information, but less on the aspect of the encoding of prosodic information. The prior art often encodes the prosodic information by means of quantization, without considering the model behind the prosodic information, and therefore hard to obtained lower encoding data rate and to perform speech transformation for the encoded speech by systematic methods.
In order to overcome the drawbacks in the prior art, a speech-synthesizing device, and more particularly to a streaming encoder, prosody information encoding device, prosody-analyzing device and device and method for speech synthesizing is provided. The novel design in the present invention not only solves the problems described above, but also is easy to be implemented. Thus, the present invention has the utility for the industry.
In accordance with one aspect of the present invention, a speech-synthesizing device is provided. The speech-synthesizing device includes a hierarchical prosodic module, a prosody-analyzing device, and a prosody-synthesizing unit. The hierarchical prosodic module generates at least a first hierarchical prosodic model. The prosody-analyzing device receives a low-level linguistic feature, a high-level linguistic feature and a first prosodic feature, and generates at least a prosodic tag based on the low-level linguistic feature, the high-level linguistic feature, the first prosodic feature and the first hierarchical prosodic model. The prosody-synthesizing unit synthesizes a second prosodic feature based on the hierarchical prosodic module, the low-level linguistic feature and the prosodic tag.
In accordance with a further aspect of the present invention, a prosodic information encoding apparatus is provided. The prosodic information encoding apparatus includes a speech segmentation and prosodic feature extracting device, a prosodic structure analysis unit and an encoder. The speech segmentation and prosodic feature extracting device receives an input speech and a low-level linguistic feature to generate a first prosodic feature. The prosodic structure analysis unit receives the first prosodic feature, the low-level linguistic feature and a high-level linguistic feature, and generates a prosodic tag based on the first prosodic feature, the low-level linguistic feature and the high-level linguistic feature. The encoder receives the prosodic tag and the low-level linguistic feature to generate a code stream.
In accordance with a further aspect of the present invention, a code stream generating apparatus is provided. The code stream generating apparatus comprises a prosodic feature extractor, a hierarchical prosodic module and an encoder. The prosodic feature extractor generates a first prosodic feature. The hierarchical prosodic module provides a prosodic structure meaning for the first prosodic feature. The encoder generates a code stream based on the first prosodic feature having the prosodic structure meaning. The hierarchical prosodic module has at least two parameters being ones selected from the group consisting of a syllable duration, a syllable pitch contour, a pause timing, a pause frequency, a pause duration and a combination thereof.
In accordance with a further aspect of the present invention, a method for synthesizing a speech is provided. The method comprises steps of providing a hierarchical prosodic module, a low-level linguistic feature, a high-level linguistic feature and a first prosodic feature; generating at least a prosodic tag based on the low-level linguistic feature, the high-level linguistic feature, the first prosodic feature and the hierarchical prosodic module; and outputting the speech according to the prosodic tag.
In accordance with a further aspect of the present invention, a prosodic structure analysis unit is provided. The prosodic structure analysis unit comprises a first input terminal, a second input terminal, a third input terminal and an output terminal. The first input terminal receives a first prosodic feature. The second input terminal receives a low-level linguistic feature. The third input terminal receives a high-level linguistic feature. The prosodic structure analysis unit generates a prosodic tag at the output terminal based on the first prosodic feature, the low-level and the high-level linguistic features.
In accordance with further another aspect of the present invention, a prosodic structure analysis apparatus is provided. The prosodic structure analysis apparatus includes a hierarchical prosodic module and a prosodic structure analysis unit. The hierarchical prosodic module generates a hierarchical prosodic model. The prosodic structure analysis unit receives a first prosodic feature, a low-level linguistic feature and a high-level linguistic feature, and generates a prosodic tag based on the first prosodic feature, the low-level and the high-level linguistic features and the hierarchical prosodic model.
The above objects and advantages of the present invention will become more readily apparent to those ordinarily skilled in the art after reviewing the following detailed descriptions and accompanying drawings, in which:
The present invention will now be described more specifically with reference to the following embodiments. It is to be noted that the following descriptions of preferred embodiments of this invention are presented herein for the purposes of illustration and description only; it is not intended to be exhaustive or to be limited to the precise form disclosed.
To achieve the aforementioned objective, the present invention employs a hierarchical prosodic module in a prosody encoding apparatus whose block diagram is shown in
Basic concepts of the present invention are set forth as below: Firstly, inputting a speech signal and its corresponding low-level linguistic feature A1 into the speech segmentation and prosodic feature extractor 101, so as to perform syllable boundary division to the input speech utilizing acoustic model and obtain syllable prosodic features for the use by the next prosodic structure analysis unit 103.
The main usage of the hierarchical prosodic module 102 is to describe prosodic hierarchical structure of Mandarin Chinese, including syllable prosodic-acoustic model, syllable juncture prosodic-acoustic model, prosodic state model, and break-syntax model.
The main usage of the prosodic structure analysis unit 103 is to take advantage of the hierarchical prosodic module 102 to analyze the prosodic feature A3, which is generated by the speech segmentation and prosodic feature extractor 101, and then to represent the speech prosody by prosodic structures in terms of prosodic tags.
The main function of the encoder 104 is to perform encoding to the messages necessary for the reconstruction of speech prosody and bit streaming. Those messages include the prosodic tag A4 generated by the prosodic structure analysis unit 103 and the input low-level linguistic feature A1.
The main functions of the decoder 105 include decoding the bit stream A5 and decoding the prosodic tag A6 required by the prosodic feature synthesizer unit 106 and the low-level linguistic feature A1.
The main function of the prosodic feature synthesizer unit 106 is to make use of the decoded prosodic tag A6 and the low-level linguistic feature A1 to synthesize and reconstruct the speech prosodic feature A7, with the input from the hierarchical prosodic module 102 as side information.
The main function of the speech synthesizer 107 is to synthesize the speech with the reconstructed prosodic feature A7 and the low-level linguistic feature A1 based on the hidden Markov model.
The prosodic structure analysis device 108 comprises the hierarchical prosodic module 102 and the prosodic structure analysis unit 103, and takes advantage of the prosodic structure analysis unit 103 while using the hierarchical prosodic module 102 to represent the prosodic feature A3 of the speech input by prosodic structures in terms of prosodic tags A4.
The prosodic feature synthesizer device 109 comprises the hierarchical prosodic module 102 and the prosodic feature synthesizer unit 106, and takes advantages of the prosodic feature synthesizer unit 106, while using the hierarchical prosodic module 102 as side information provider, to generate a second prosodic feature A7 using inputs of the second prosodic tag A6 and the low-level linguistic feature A1 reconstructed by the decoder 105.
The prosodic message encoding device 110 comprises the speech segmentation and prosodic feature extractor 101, the hierarchical prosodic module 102, the prosodic structure analysis unit 103, the encoder 104 and the prosodic structure analysis device 108. The prosodic message encoding device 110 firstly uses the speech segmentation and prosodic feature extractor 101 to segment the input speech by the low-level linguistic feature A1 and to obtain a first prosodic feature A3. Then the prosodic structure analysis device 108 generates a first prosodic tag A4 based on the first prosodic feature A3, the low-level linguistic feature A1 and a high-level linguistic feature A2. The encoder 104 then forms a code stream A5 based on the first prosodic tag A4 and the low-level linguistic feature A1.
The prosodic message decoding device 111 comprises the hierarchical prosodic module 102, the decoder 105, the prosodic feature synthesizer unit 106, the speech synthesizer 107 and the prosodic feature synthesizer device 109. The decoder 105 decodes the code stream A5, generated from the prosodic message encoding device 110, to reconstruct a second prosodic tag A6 and the low-level linguistic feature A1, which are used to synthesize a second prosodic feature A7 by the prosodic feature synthesizer device 109. The second prosodic feature A7 is then used to generate the output speech by the speech synthesizer 107.
The equations set forth hereinafter are for introducing some preferred embodiments according to the present invention. The following equation is employed by the prosodic structure analysis unit 103 for representing the speech prosody by prosodic structures in terms of prosodic tags. The method is to input the prosodic acoustic feature sequence (A) and the linguistic feature sequence (L) into the prosodic structure analysis unit 103, which may output the best prosodic tag sequence (T). The best prosodic tag sequence (T) can be used for representing the prosodic features of the speech and then for later encoding. The corresponding mathematical equation is:
wherein A={X,Y,Z}={A1N}={X1N,Y1N,Z1N} is the prosodic acoustic feature sequence, N is the number of syllables in the speech, and X, Y and Z denote syllable-based prosodic acoustic feature, inter-syllable prosodic acoustic feature and differential prosodic acoustic feature, respectively.
L={POS,PM,WL,t,s,f}={L1N}={POS1N,PM1N,WL1N,t1N,s1N,f1N} is a linguistic feature sequence, wherein {POS, PM, WL} is a high-level linguistic sequence, POS, PM and WL denote part-of-speech sequence, punctuation mark sequence and word length sequence respectively, {t,s,f} is a low-level linguistic feature sequence, and the letters t, s and f denote tone, base-syllable type and syllable final type, respectively.
T={B,P} is a prosodic tag sequence, where B={B1N} is a prosodic break sequence, P={p,q,r} a prosodic state sequence, and the letters p, q and r denote syllable pitch prosodic state sequence, syllable duration prosodic state sequence and syllable energy prosodic state sequence, respectively.
The prosodic tag sequence is to describe the Mandarin Chinese prosodic hierarchical structure concerned by the hierarchical prosodic module 102. Referring to
Hierarchical Prosodic Module
P(X|B,P,L)P(Y,Z|B,L)P(P|B)P(B|L)
For realizing the hierarchical prosodic module, more details are described. The model has 4 sub-models, which are syllable prosodic-acoustic model P(X|B,P,L), syllable juncture prosodic-acoustic model P(Y,Z|B,L), prosodic state model P(P|B) and break-syntax model P(B|L).
The syllable prosodic-acoustic model P(X|B,P,L) can be approximated with the following sub-models:
Wherein the P(spn|Bn−1n,pn,tn−1n+1), P(sdn|qn,sn,tn) and P(sen|rn,fn,tn) respectively denote the pitch contour model, the duration model and the energy level model of the n-th syllable, the reference characters tn, sn and fn respectively denote the tone, the base-syllable and final types of the n-th syllable, while Bn−1n=(Bn−1,Bn) and tn−1n+1=(tn−1,tn,tn+1) respectively denote the prosodic break sequence and the tone sequence.
In this embodiment, the three sub-models take more factors into account. Those factors are combined by means of superimposing. Taking the pitch contour of the n-th syllable for example, one may obtain the formula:
spn=spnr+βt
where spn=[α0,n,α1,n,α2,n,α3,n] is a four-dimensional vector for representing the pitch contour observed from the n-th syllable. The coefficients can be derived from:
Where Fn(i) is the i-th frame pitch of the n-th syllable, Mn+1 the number of frames of the n-th syllable having pitch, and
the j-th orthogonal basis.
spnr is the modeling residual of spn. βt
P(spn|Bn−1n,pn,tn−1n+1)=N(spn;βt
It is noted that spnr is a noise-like residual signal of very small deviation so that one can model the data with a normal distribution. Likewise, the syllable duration model P(sdn|qn,sn,tn) and the syllable energy level model P(sen|rn,fn,tn) can be expressed as follows:
P(sdn|qn,sn,tn)=N(sdn;γt
P(sen|rn,fn,tn)=N(sen;ωt
Where sdn and sen are the observed duration and energy level of the n-th syllable respectively, and γx and ωx respectively represent affecting factors of syllable duration and syllable energy level with the factor x.
The syllable-juncture prosodic-acoustic model P(Y,Z|B,L) describes the inter-syllable acoustic characteristics specified for different break type and surrounding linguistic features, and can be approximated with the following 5 sub-models:
The aforementioned formulas describe the pause duration pdn, the energy-dip level edn, the normalized pitch jump pjn, and two normalized syllable lengthening factors (i.e. dln and dfn) across the n-th syllable juncture.
The prosodic state model P(P|B) is simulated by three sub-models:
The break-syntax model P(B|L) can be described as follows:
where P(Bn|Ln) is the break type model for the n-th juncture, and Ln denotes the linguistic feature of the n-th syllable.
The probability can be estimated by many methods. The present embodiment uses the method of decision tree algorithm for the estimation. The method of sequential optimization algorithm is used to train the prosodic models, and the maximum likelihood criterion is used to generate prosodic tags.
Prosodic Structure Analysis Unit
The prosodic structure analysis unit is for labeling the hierarchical prosodic structure of the input speeches, that is, looking for the best prosodic tag T={B,P} based on the prosodic-acoustic feature vector sequence (A) and the linguistic feature sequence (L). The formula is:
Where Q=P(B|L)P(P|B)P(X|B,P,L)P(Y,Z|B,L).
The methods used by the prosodic structure analysis unit can be realized by obtaining the best solution through the iterative method set forth below:
(1) Initialization: For i=0, the best prosodic break type sequence can be found by:
(2) Iteration: Obtaining the prosodic break type sequence and the prosodic state sequence by iterating the following three steps:
Step 1: Given with Bi−1, re-labeling the prosodic state sequence of each utterance by the Viterbi algorithm so as to maximize the value of Q:
Step 2: Given with Pi, re-labeling the break type sequence of each utterance by the Viterbi algorism so as to maximize the value of Q:
Step 3: If a convergence of the value of Q is reached, exit the iteration process. Otherwise, increase the value of i by 1 and then go back to Step 1.
(3) Termination: Obtaining the best prosodic tag B*=Bi and P*=Pi.
Coding the Prosodic Messages
It is appreciated from the hierarchical prosodic module 102 that, the syllable pitch contour spn, the syllable duration sdn and the syllable energy level sen are linear combinations concerning multiple factors, which include low-level linguistic features such as tone tn, base-syllable type sn and final type fn. Others are prosodic-state tags for indicating the hierarchical prosodic structure (obtained by the prosodic structure analysis unit 103): prosodic break-type tag Bn and prosodic state tags pn, qn and rn. Thus, the syllable pitch contour spn, the syllable duration sdn and the syllable energy level sen can be obtained by simply coding and transmitting these factors. The following formulas are applied by the prosodic feature synthesizer unit 106 to reconstruct these three prosodic acoustic features by using these factors:
spn′=βt
sdn′=γt
sen′=ωt
Notably, the three modeling residuals, spnr, sdnr and senr may be neglected because their variance are all small. The three means, μsp, μsd and μse, are sent in advance to the decoder as side information.
The pause duration pdn is modeled by the syllable juncture pause duration sub-model, g(pdn;αB
In summary, the symbols needed to be encoded by the encoder 104 include: tone tn base-syllable type sn, final type fn, break type tag Bn, three prosodic-state tags (pn,qn,rn) and the index of the occupied leaf node Tn in the corresponding BDT. The encoder 104 encodes with different bit length based on the aforementioned types of symbols, and eventually composes bit streams which will be sent to the decoder 105 to decode and then transmitted to the prosodic feature synthesizer unit 106 to be reconstructed to prosodic messages for speech synthesis by the speech synthesizer 107. Aside from bit steams, some features of the hierarchical prosodic module 102 are regarded as side information, which is for the use of restoring prosodic features and includes the affecting patterns (APs) {βt, βp, βB,tpf, βB,tpb, μsp)} of the syllable pitch-contour sub-model, the APs {γt, γs, γs, μsd} of the syllable duration sub-model, the APs {ωt, ωf, ωr, μse} of the syllable energy level sub-model and the means {μT
Speech Synthesis
The task of the speech synthesizer 107 is to synthesize speech with HMM-based speech synthesis technology based on the base-syllable type, the syllable pitch contour, the syllable duration, the syllable energy level and the pause duration between syllables. The HMM-based speech synthesis is a technology known to the skilled person in the art.
dn,c=μn,c+ρ·σn,c2 for c=1˜C, n and c are integers
Wherein μn,c and σn,c2 represent correspondingly the mean and the variance of the Gaussian model for the c-th HMM state of the n-th syllable. ρ is an elongation coefficient, which can be obtained from the following formula:
Notably, the factor sdn′ denotes the syllable duration reconstructed by the prosodic feature synthesizer unit 106. Since the voiced/unvoiced state of each HMM state is determined, the HMM state voiced/unvoiced model 302 and the HMM state duration model 301 together can be used to obtain the duration of voiced sound within a syllable, that is, the number of frames Mn′+1. Further, contours of the syllable pitch can be reconstructed at the logarithm pitch contour and excitation signal generator 306 based on the following formula:
Wherein αj,n′ denotes the j-th dimension of the syllable pitch contour vector reconstructed by the prosodic feature synthesizer unit 106, i.e.:
spn′=[α0,n′,α1,n′,α2,n′,α3,n′]
Afterwards, the excitation signal required by the MLSA synthesis filter 307 can be generated from the reconstructed logarithm pitch contour. On the other hand, each of the frame spectrum information is the MGC parameter for each frame generated by the frame MGC generator 305 using the HMM acoustic model 304 given HMM state duration, voiced/unvoiced information, break type, prosodic-state tag, base-syllable type and syllable energy level. Energy level of each of the syllable is adjusted to the level reconstructed by the prosodic feature synthesizer unit 106. Finally, the excitation signal and the MGC parameters of each frame are input into the MLSA filter 307 so as be able to synthesize speeches.
Experimental Results
Table 1 shows important statistical information of experimental corpus, which includes two major portions: (1) Single speaker Treebank corpus; and (2) Multiple speaker Mandarin Chinese continuous speech database TCC300, which are respectively for evaluating the coding performance of the speaker-dependent and the speaker-independent embodiments of on-site testing as illustrated in
Table 2 shows the codeword length required by each encoding symbol
Table 3 displays the parameter count for the side information.
Table 4 shows the root-mean-square errors (RMSE) of the prosodic features reconstructed by the prosodic feature synthesizer unit 106. It is appreciated from Table 4 that those errors are relatively small.
Table 5 shows the bit rate performance of the present invention. The average of speaker-dependent and speaker-independent transmission bit rates are 114.9±4.78 bits per second and 114.9±14.9 bits per second respectively, both are very low.
Examples of Speech Rate Conversion
The prosodic encoding method according to the present invention also provides systematic speech rate conversion platform. The method includes replacing the hierarchical prosodic module 102 having the original speech rate with another hierarchical prosodic module 102 having a target speech rate by the prosodic feature synthesizer unit 106. The statistic data relevant to the training corpus for on-site testing are shown in Table 6. The speaker-dependent training corpus for the experimental test is recorded in a normal speed. Based on the corpus with the normal speed, the other corpus of different speech rate are the fast speed corpus and the slow speed corpus, whose corresponding hierarchical prosodic modules can be constructed by the training method the same as that for normal speed ones.
While the invention has been described in terms of what is presently considered to be the most practical and preferred embodiments, it is to be understood that the invention needs not be limited to the disclosed embodiments. On the contrary, it is intended to cover various modifications and similar arrangements included within the spirit and scope of the appended claims which are to be accorded with the broadest interpretation so as to encompass all such modifications and similar structures.
1. A speech-synthesizing device, comprising:
a hierarchical prosodic module generating at least a first hierarchical prosodic model;
a prosody-analyzing device, receiving a low-level linguistic feature, a high-level linguistic feature and a first prosodic feature, and generating at least a prosodic tag based on the low-level linguistic feature, the high-level linguistic feature, the first prosodic feature and the first hierarchical prosodic model; and
a prosody-synthesizing unit synthesizing a second prosodic feature based on the hierarchical prosodic module, the low-level linguistic feature and the prosodic tag.
2. A speech-synthesizing device of Embodiment 1, further comprising:
a prosodic feature extractor receiving a speech input and the low-level linguistic feature, segmenting the input speech to form a segmented speech, and generating the first prosodic feature based on the low-level linguistic feature and the segmented speech.
3. A speech-synthesizing device of Embodiment 2 further comprising a prosody-synthesizing device, wherein the first hierarchical prosodic model is generated based on a first speech speed, on a condition that when the prosody-synthesizing device is going to generate a second speech speed being different from the first speech speed, the first hierarchical prosodic model is replaced with a second hierarchical prosodic model having the second speech speed and the prosody-synthesizing unit changes the second prosodic feature to a third prosodic feature.
4. A speech-synthesizing device of Embodiment 3, wherein the speech-synthesizing device generates a speech synthesis with the second synthesized speech based on the third prosodic feature and the low-level linguistic feature.
5. A speech-synthesizing device of Embodiment 1, further comprising:
an encoder receiving the prosodic tag and the low-level linguistic feature to generate a code stream; and
a decoder receiving the code stream, and restoring the prosodic tag and the low-level linguistic feature.
6. A speech-synthesizing device of Embodiment 5, wherein the encoder includes a first codebook providing an encoding bit corresponding to the prosodic tag and the low-level linguistic feature so as to generate the code stream, and the decoder includes a second codebook providing the encoding bit to reconstruct code stream to the prosodic tag and the low-level linguistic feature.
7. A speech-synthesizing device of Embodiment 5, further comprising:
a prosody-synthesizing device receiving the prosodic tag and the low-level linguistic feature reconstructed by the decoder to generate the second prosodic feature including a syllable pitch contour, a syllable duration, a syllable energy level and an inter-syllable pause duration.
8. A speech-synthesizing device of Embodiment 7, wherein the second prosodic feature is reconstructed by a superposition module.
9. A speech-synthesizing device of Embodiment 7, wherein the syllable juncture pause duration is reconstructed by looking up a codebook.
10. A prosodic information encoding apparatus, comprising:
a speech segmentation and prosodic feature extracting device receiving a speech input and a low-level linguistic feature to generate a first prosodic feature;
a prosodic structure analysis unit receiving the first prosodic feature, the low-level linguistic feature and a high-level linguistic feature, and generating a prosodic tag based on the first prosodic feature, the low-level linguistic feature and the high-level linguistic feature; and
an encoder receiving the prosodic tag and the low-level linguistic feature to generate a code stream.
11. A code stream generating apparatus, comprising:
a prosodic feature extractor generating a first prosodic feature;
a hierarchical prosodic module providing a prosodic structure meaning for the first prosodic feature; and
an encoder generating a code stream based on the first prosodic feature having the prosodic structure meaning,
wherein the hierarchical prosodic module has at least two parameters being ones selected from the group consisting of a syllable duration, a pitch contour, a pause timing, a pause frequency, a pause duration and a combination thereof.
12. A method for synthesizing a speech, comprising steps of:
providing a hierarchical prosodic module, a low-level linguistic feature, a high-level linguistic feature and a first prosodic feature;
generating at least a prosodic tag based on the low-level linguistic feature, the high-level linguistic feature, the first prosodic feature and the hierarchical prosodic module; and
outputting the speech according to the prosodic tag.
13. A method of Embodiment 12, further comprising steps of:
providing an inputting speech;
segmenting the inputting speech to generate a segmented input speech;
extracting a prosodic feature from the segmented input speech according to the low-level linguistic feature to generate the first prosodic feature;
analyzing the first prosodic feature to generate the prosodic tag;
encoding the prosodic tag to form a code stream;
decoding the code stream;
synthesizing a second prosodic feature based on the low-level linguistic feature and the prosodic tag; and
outputting the speech based on the low-level linguistic feature and the second prosodic feature.
14. A prosodic structure analysis unit, comprising:
a first input terminal receiving a first prosodic feature;
a second input terminal receiving a low-level linguistic feature;
a third input terminal receiving a high-level linguistic feature; and
an output terminal, wherein the prosodic structure analysis unit generates a prosodic tag at the output terminal based on the first prosodic feature, the low-level and the high-level linguistic features.
15. A speech-synthesizing device, comprising:
a decoder receiving a code stream and restoring the code stream to generate a low-level linguistic feature and a prosodic tag;
a hierarchical prosodic module receiving the low-level linguistic feature and the prosodic tag to generate a second prosodic feature; and
a speech synthesizer generating a synthesized speech based on the low-level linguistic feature and the second prosodic feature.
16. A prosodic structure analysis apparatus, comprising:
a hierarchical prosodic module generating a hierarchical prosodic model; and
a prosodic structure analysis unit receiving a first prosodic feature, a low-level linguistic feature and a high-level linguistic feature, and generating a prosodic tag based on the first prosodic feature, the low-level and the high-level linguistic features and the hierarchical prosodic model.
17. A prosodic structure analysis apparatus of Embodiment 16, wherein the low-level linguistic feature includes a base-syllable type, a syllable-final type, and a tone type of a language.
18. A prosodic structure analysis apparatus of Embodiment 16, wherein the high-level linguistic feature includes a word, a part of speech and a punctuation mark.
19. A prosodic structure analysis apparatus of Embodiment 16, wherein the prosodic feature includes a syllable pitch contour, a syllable duration, a syllable energy level and a syllable juncture pause duration.
20. A prosodic structure analysis apparatus of Embodiment 16, wherein the prosodic structure analysis device performs an optimization algorithm by referring to the low-level linguistic feature and the high-level linguistic feature to generate the prosodic tag.
Number | Date | Country | Kind |
---|---|---|---|
102104478 A | Feb 2013 | TW | national |
Number | Name | Date | Kind |
---|---|---|---|
6161091 | Akamine | Dec 2000 | A |
6502073 | Guan | Dec 2002 | B1 |
6873953 | Lennig | Mar 2005 | B1 |
6961704 | Phillips | Nov 2005 | B1 |
7069216 | DeMoortel | Jun 2006 | B2 |
20060235685 | Nurminen | Oct 2006 | A1 |
20090055158 | Xu | Feb 2009 | A1 |
20100076761 | Juergen | Mar 2010 | A1 |
20110099019 | Zopf | Apr 2011 | A1 |
20110184721 | Subramanian | Jul 2011 | A1 |
20120016674 | Basson | Jan 2012 | A1 |
Number | Date | Country |
---|---|---|
I350521 | Oct 2011 | TW |
Entry |
---|
Burnett, Daniel C., Andrew Hunt, and Mark R. Walker. “Speech Synthesis Markup Language (SSML) Version.” WC Recommendation. W C. uRL: http://www. w3. org/TR/2004/REC-speech-synthesis-20040907/(cit. on p.) (1999). |
Office Action issued in corresponding Taiwanese Patent Application No. 10420245220 dated Feb. 25, 2015, consisting of 6 pp. |
Number | Date | Country | |
---|---|---|---|
20140222421 A1 | Aug 2014 | US |