The present invention generally relates to speech synthesis, in particular relates to methods and systems for generating prosody in speech synthesis.
Speech synthesis, or text-to-speech (TTS), involves the use of a computer-based system to convert a written document into audible speech. A good TTS system should generate natural, or human-like, and highly intelligible speech. In the early years, the rule-based TTS systems, or the formant synthesizers, were used. These systems generate intelligible speech, but the speech sounds robotic, and unnatural.
To generate natural sounding speech, the unit-selection speech synthesis systems were invented. The system requires the recording of large amount of speech. During synthesis, the input text is first converted into phonetic script, segmented into small pieces, and then find the matching pieces from the large pool of recorded speech. Those individual pieces are then stitched together. Obviously, to accommodate arbitrary input text, the speech recording must be gigantic. And it is very difficult to change the speaking style. Therefore, for decades, alternative speech synthesis systems which has the advantages of both formant systems, small and versatile, and the unit-selection systems, naturalness, have been intensively sought.
In a related patent application, a system and method for speech synthesis using timbre vectors are disclosed. The said system and method enable the parameterization of recorded speech signals into a highly amenable format, timbre vectors. From the said timbre vectors, the speech signals can be regenerated with substantial degree of modifications, and the quality is very close the original speech. For speech synthesis, the said modifications include prosody, which comprises the pitch contour, the intensity profile, and durations of each voice segments. However, in the previous application U.S. Ser. No. 13/692,584, no systems and methods for the generation of prosody is disclosed. In the current application, the systems and methods for generating prosody for an input text are disclosed.
The present invention discloses a parametrical representation of prosody based on polynomial expansion coefficients of the pitch contour near the centers of each syllable, and a parametrical representation of the average global pitch contour for different types of phrases. The pitch contour of the entire phrase or sentence is generated by using a polynomial of higher order to connect the individual polynomial representation of the pitch contour near the center of each syllable smoothly over syllable boundaries. The pitch polynomial expansion coefficients near the center of each syllable are generated from a recorded speech database, read from a number of sentences in text form. A pronunciation and context analysis of the said text is performed. By correlating the said pronunciation and context information with the said polynomial expansion coefficients at each syllable, a correlation database is formed. To generate prosody for an input text, word pronunciation and context analysis is first executed. The prosody is generated by using the said correlation database to find the best set of pitch parameters for each syllable, adding to the corresponding global pitch contour of the phrase type, then use the interpolation formulas to generate the complete pitch contour for the said phrase of input text. Duration and intensity profile are generated using a similar procedure.
One general problem of the prior-art prosody generating systems is that because pitch only exists for voiced frames, the pitch signals for a sentence in recorded speech data is always discontinuous and incomplete. Pitch values do not exist on unvoiced consonants and silence. On the other hand, during the synthesis step, because the unvoiced consonants and silence sections do not need a pitch value, the predicted pitch contour is also discontinuous and incomplete. In the present invention, in order to build a database for pitch contour prediction, only the pitch values at and near the center of each syllable are required. In order to generate the pitch contours for an input text, the first step is to generate the polynomial expansion coefficients at the center of each syllable where pitch exists. Then, the pitch values for the entire sentence is generated by interpolation using a set of mathematical formulas. If the consonants at the ends of a syllable is voiced, such as n, m, z, and so on, the continuation of pitch value is naturally useful. If the consonants at the ends of a syllable is unvoiced, such as s, t, k, the same interpolation procedure is also applied to generate a complete set of pitch marks. Those pitch marks in the time intervals of unvoiced consonants and silence are important for the speech-synthesis method based on timbre vectors, as disclosed in patent application Ser. No. 13/692,584.
A preferred embodiment of the present invention using polynomial expansion at the centers of each syllable is the all-syllable based speech synthesis system. In this system, a complete set of well-articulated syllables in a target language is extracted from a speech recording corpus. Those recorded syllables are parameterized into timbre vectors, then converted into a set of prototype syllables with flat pitch, identical duration, and calibrated intensity at both ends. During speech synthesis, the input text is first converted into a sequence of syllables. The samples of each syllable is extracted from the timbre-vector database of prototype syllables. The prosody parameters are then generated and applied to each syllable using voice transformation with timbre vectors. Each syllable is morphed into a new form according to the continuous prosody parameters, and then stitched together using the timbre fusing method to generate an output speech.
The sentence can be segmented into 12 syllables, 105. Each syllable has a voiced section, 106. The middle point of the voiced section is the syllable center, 107.
The pitch contour of the said voiced section 106 of a said syllable 105 can be expended into a polynomial, centered at the said syllable center 107. The polynomial coefficients of the said voiced section 106 are obtained using least-squares fitting, for example, by using the Gegenbauer polynomials. This method is well-known in the literature (see for example Abraham and Stegun, Handbook of Mathematical Functions, Dover Publications, New York, Chapter 22, especially pages 790-791). Showing in
The pitch contour on each said voiced section, for example, V between 306 and 307, is approximated by a polynomial using least-squares fitting. In
p=A
n
+B
n
t,
where An and Bn are the syllable pitch parameters. To make a continuous pitch curve over syllable boundaries, a higher-order polynomial is used. Suppose the next syllable center is located at a time T from the center of the first one. Near the center of the (n+1)-th syllable where t=T, the linear approximation of pitch is
p=A
n+1
+B
n+1(t−T).
It can be shown directly that a third-order polynomial can connect them together, to satisfy the linear approximations at both syllable centers, as shown in 308 in
p=A
n
+B
n
t+Ct
2
+Dt
3,
where the coefficients C and D are calculated using the following formulas:
Therefore, over the entire sentence, the pitch value and pitch slope of the interpolated pitch contour are continuous, as shown in 204 of
For expressive speech or tone languages such as Mandarin Chinese, the curvature of the pitch contour at the syllable center may also be included. Near the center of syllable n, the polynomial expansion of the pitch contour includes a quadratic term,
p=A
n
+B
n
t+C
n
t
2,
and near the center of the (n+1)-th syllable, the polynomial expansion of the pitch contour is
p=A
n+1
+B
n+1(t−T)+Cn+1(t−T)2,
wherein the coefficients are obtained using least-squares fit from the voiced section of the (n+1)-th syllable. Similar to the linear approximation, using a higher-order polynomial, a continuous curve to connect the two syllables can be obtained,
p=A
n
+B
n
t+C
n
t
2
+Dt
3
+Et
4
+Ft
5,
where the coefficients D, E and F are calculated using the following formulas:
The correctness of those formulas can be verified directly.
As shown in
p
g
=C
0
+C
1
t+C
2
t
2
+C
3
t
3
+C
4
t
4,
where pg is the global pitch contour, and C0 through C4 are the coefficients to be determined by least-squares fitting from the constant terms of the polynomial expansions of said syllables, for example, by using the Gegenbauer polynomials (see for example Abraham and Stegun, Handbook of Mathematical Functions, Dover Publications, New York, Chapter 22, especially pages 790-791).
Every sentence in the said text corpus is read by a professional speaker 605 as the reference standard for prosody. The voice data through a microphone in the form of pcm (pulse-code modulation) 606. If an electroglottograph instrument is available, the electroglottograph data 607 are simultaneously recorded. Both data are segmented into syllables to match the syllables in the text, 604. Although automatic segmentation of the voice signals into syllables is possible, human inspection is often needed. From the EGG data 607, or combined with the pcm data 606 through a glottal closure instant (GCI) program 608, the pitch contour 609 for each syllable is generated. Pitch is defined as a linear function of the logarithm of frequency or pitch period, preferably in MIDI as in section [0018]. Furthermore, from the pcm data 606, the intensity and duration data 610 of each said syllable are identified.
The pitch contour of a pitch period in the voiced section of each said syllable is approximated by a polynomial using least-squares fitting 611. The values of average pitch (the constant term of the polynomial expansion) of all syllables in a sentence or a phrase, are taken to form a polynomial using least-squares fitting. The coefficients are then averaged over all phrases or sentences of the same type in the text corpus to generate a global pitch profile for that type, see
The pitch parameters of each syllable, after subtracting the value of global pitch profile at that time, are correlated with the syllable stress pattern and context information to form a database of syllable pitch parameters 614. The said database will enable the generation of syllable pitch parameters by giving an input information of syllables.
The right-hand side of
Combining with the method of speech synthesis using timbre vectors, U.S. patent application Ser. No. 13/692,584, a syllable-based speech synthesis system can be constructed. For many important languages on the world, the number of phonetically different syllables is finite. For example, Spanish language has 1400 syllables. Because using timbre vector representation, for each syllable, one prototype syllable is sufficient. Syllables of different pitch contour, duration and intensity profile can be generated from the one prototype syllable following the prosody generated, then executing timbre-vector interpolation. Adjacent syllables can be joined together using timbre fusing. Therefore, for any input text, natural sounding speech can be synthesized.
While this invention has been described in conjunction with the exemplary embodiments outlined above, it is evident that many alternatives, modifications and variations will be apparent to those skilled in the art. Accordingly, the exemplary embodiments of the invention, as set forth above, are intended to be illustrative, not limiting. Various changes may be made without departing from the spirit and scope of the invention.
The present application is a continuation in part of patent application Ser. No. 13/692,584, entitled “System and Method for Speech Synthesis Using Timbre Vectors”, filed Dec. 3, 2012, by inventor Chongjin Julian Chen.
Number | Date | Country | |
---|---|---|---|
Parent | 13692584 | Dec 2012 | US |
Child | 14216611 | US |