This application claims the benefit of Chinese Application No. 200610167040.0, filed Dec. 13, 2006 in the State Intellectual Property Office of the People's Republic of China, the contents of which are incorporated herein by reference.
The present invention relates to Chinese speech synthesis technology, more specifically to a processing technology for performing prosodic words grouping on input Chinese sentences in a Chinese speech synthesis system, and more particularly to a Chinese prosodic words forming method and apparatus.
When a plurality of Chinese characters forms into words or phrases to be consecutively pronounced, they affect one another to form comparatively separated and complete prosodic blocks, the prosodic characteristics of which have very important function on the naturalness of the speech. The combination of different prosodic blocks usually forms different tunes to render a person's pronunciation in possession of different tones. Generally speaking, the main prosodic units in the Chinese speech include prosodic words, prosodic phrases and intonational phrases. The prosody of the Chinese language is of a layered structure, and such a layered prosodic structure forms the rhythm (prosody) of the Chinese speech. The boundary of a prosodic unit usually corresponds to the stop, the change in fundamental frequency or the change in audio duration of a prosodic boundary syllable in the speech. Prosody is an important factor affecting the naturalness and comprehensibleness of a synthesized speech. In the speech synthesis system, the prosodic structure provides the prosodic parameter prediction model with very important information, whereby the objective of controlling the mode of pronunciation of the speech synthesis system is achieved through prediction of such parameters as the fundamental frequency, the audio duration (duration) and the stop etc., so as to achieve the corresponding prosodic effect of the prosodic units at each level in the synthesized speech, to thereby render the pronunciation natural and melodious.
With the ever deeper development of linguistic processing, people need not only to learn more about the prosodic structure of the natural speech, but also try to find a method for predicting the prosodic structure from the text, so as to enhance the naturalness of the synthesized speech or the preciseness of the speech recognition in a more effective manner, and deepen the degree for understanding natural languages at the same time.
The prosodic word denotes a group of syllables that are consecutively pronounced in an audio stream, and the pronunciations between these syllables are very closely related and there is no stop to the audial perception. The prosodic word is an element of the lowest level in the layered structure of the prosody, and there is usually a perceptible stop at the boundary of the prosodic word. In other words, there is no perceptible stop inside the prosodic word, as the stop merely appears at the boundary of the prosodic word. Not all prosodic word boundaries have stops in the actual speech. It is acceptable when there is a perceptible stop at the boundary of the prosodic word, but any perceptible stop inside the prosodic word will render the speech either hard to understand or unnatural. Consequently, a good prosodic word forming module is of great significance to enhancing the naturalness of the synthesized speech.
There have been many published dissertations and patents in the prior art, such as those presented below, relating to the studies on the prosodic word forming module and the enhancement of the naturalness of the synthesized speech.
The contents of these patents and documents are incorporated herein as prior art documents of the present application for invention.
In general cases, the Chinese speech synthesis system consists of three modules, namely a text analyzing module, a prosody parameter predicting module and a backend synthesizing module. The Chinese text analyzing module includes word segmentation, part of speech annotation, phonetic notation, and prosodic structure prediction, etc. The first step is word segmentation. This is so because, unlike the texts of other languages such as the English, there is no space as a separating sign between words in the Chinese text to divide the words. Word segmentation is generally based on the analysis of the part of speech, to thereby not only reflect a certain syntactic structure but also slightly differ from the prosodic structure. The purpose of prosodic structure prediction is to find out an effective method to map the contents of the text as a prosodic structure, in order to construct a prediction model from the text to the prosodic characteristics (such as the stop and the tune) to guide the subsequent generation of prosodyparameters.
Many studies show that the prosodic words are greatly different from the words of the lexicology. One reason is that the forming of the prosodic words is based not only on the meanings of the words but also on the prosodic requirements of the speech. A prosodic word can contain more than one word as defined in the lexicology, and can also be a part of a relatively long word defined in the lexicology. The word dividing module and the part of speech annotating module perform the word segmentation and the corresponding part of speech annotation on the text of the natural language based on the knowledge of lexicology.
The following sample sentence describes two processing steps of the text analyzing module, namely word segmentation/part of speech annotation and prosodic structure prediction. As shown in
A text is input as: (once at an extramural activity in which we and the pupils of other schools climbed the Fragrance Hill, no one of us lagged behind, as all climbed to the hilltop by leaps and bounds)”.
The words are divided and the parts of speech are annotated as: /v -/m /q , /w /r /p /f /Ng /v /v /v /ns, /w /r /u -/m /q /v /u, /w /o /d /v /v /u /n /w”.
The prosodic structure is as: /v -/m /q|∥/r /c/f /Ng∥/v /v|/v /ns|∥/r /u|/n∥ /v -/m /q|/v /u|∥/o∥ /d /v /v /u| /n|∥”.
The “|” indicates the boundary of the prosodic word, the “∥” indicates the boundary of the prosodic phrase, and the “|∥” indicates the boundary of the intonational phrase. The boundary of the prosodic phrase and the boundary of the intonational phrase is of necessity also a boundary of the prosodic word. The task of the prosodic word forming module is to determine the boundary of the prosodic word on the basis of the word segmentation and the part of speech annotation. In addition, the prosodic word forming is also the footstone for the prediction of a prosodic unit of higher level, such as the prediction of a prosodic phrase. Consequently, the stand or fall of the prosodic word forming is of very great significance to the naturalness of the synthesized speech.
Several methods have been proposed in the prior art for the prediction of the boundaries of the Chinese prosodic words, such as the Classification and Regression Tree (CART) method, rule-driven approach, statistical approach and recurrent neural network (RNN) method etc. Part of Speech (POS) and word length information are widely employed in these methods.
Generally speaking, it cannot be said that the prediction of the prosodic word boundaries is very precise in the state of the art. Errors of the boundary prediction are usually generalized into two types: one is the insertion error, and another one is the deletion error. As discussed above, not all prosodic word boundaries have stops in the actual speech. It is acceptable when there is a perceptible stop at the boundary of the prosodic word, but any perceptible stop inside the prosodic word will render the speech either hard to understand or unnatural. Therefore, the type of insertion error engendered by the prosodic word forming module will bring great harm to the synthesized speech. To the contrary, the type of deletion error brings far less harm to the synthesized speech. For instance, the word segmentation result of the last portion of the aforementioned sample sentence, (climbed to . . . by leaps and bounds)”, is (see as shown in
The objective of the present invention rests in providing a Chinese prosodic words forming method and apparatus, so as to overcome the defect as discussed above whereby the type of insertion error of the prosodic word would render the pronunciation hard to understand or unnatural, and to reduce the number of the type of insertion error of prosodic word boundaries. In order to achieve the aforementioned objective, the present invention provides a method of forming Chinese prosodic words, which method comprises the steps of inputting Chinese text; performing process of word segmentation and part of speech annotation for the input Chinese text to generate an initial prosodic word sequence; inserting grids representing prosodic word boundaries for all the words in the initial prosodic word sequence to generate a grid prosodic word sequence; annotating the grids ready to be deleted in the grid prosodic word sequence based on the prosodic word forming means; judging the grids which actually need to be deleted in the grids ready to be deleted based on the prosodic word forming means; deleting the grids which actually need to be deleted in the grid prosodic word sequence, and word forming the words between every two grids in the remaining grids to generate prosodic words.
Word dividing and part of speech annotating the input Chinese text are performed to generate word segmentation result, and generate an initial prosodic word sequence based on the word segmentation result.
The said annotating the grids ready to be deleted in the grid prosodic word sequence based on the prosodic word forming means indicates annotating the grids to be deleted in the same grid prosodic word sequence based on a plurality of prosodic word forming means.
The said judging the grids which actually need to be deleted in the grids ready to be deleted based on the prosodic word forming means indicates comprehensively judging the grids which actually need to be deleted in the grids to be deleted based on a plurality of prosodic word forming means.
The said deleting the grids which actually need to be deleted in the grid prosodic word sequence includes: comprehensively judging the grids ready to be deleted at present based on the plurality of prosodic word forming means, providing trust degree of the grids which need to be deleted for the grids to be deleted at present; and judging whether the grids ready to be deleted need to be deleted based on the trust degree, if yes, deleting the grids to be deleted at present.
The present invention further provides an apparatus of forming Chinese prosodic words, which apparatus comprises an input part for inputting Chinese text; a word segmentation and part of speech annotating part for performing process of word segmentation and part of speech annotation for the input Chinese text to generate an initial prosodic word sequence; a prosodic word grid insert part for inserting grids representing prosodic word boundaries for all the words in the initial prosodic word sequence to generate a grid prosodic word sequence; a prosodic word grid delete part for annotating the grids ready to be deleted in the grid prosodic word sequence based on the prosodic word forming means, judging the grids which actually need to be deleted in the grids ready to be deleted based on the prosodic word forming means, and deleting the grids which actually need to be deleted in the grid prosodic word sequence; and a prosodic word generating part for forming the words between every two grids in the remaining grids to generate prosodic words.
The apparatus further comprises a word dividing result storage part for storing the word dividing result after the process of word dividing and part of speech annotating the input Chinese text to generate an initial prosodic word sequence based on the word segmentation result.
The prosodic word grid deletion part comprises a unit for a plurality of prosodic word forming means used for annotating the grids ready to be deleted in the same grid prosodic word sequence based on the plurality of prosodic word forming means.
The said judging the grids which actually need to be deleted in the grids to be deleted based on the prosodic word forming means indicates comprehensively judging the grids which actually need to be deleted in the grids to be deleted based on the plurality of prosodic word forming means.
The prosodic word grid deletion part further comprises a grid deletion trust degree evaluation unit for comprehensively judging the grids ready to be deleted at present based on the plurality of prosodic word forming means, providing trust degree of the grids which need to be deleted for the grids ready to be deleted at present; and a grid deletion unit for judging whether the grids ready to be deleted at present need to be deleted based on the trust degree, if yes, deleting the grids ready to be deleted at present.
The apparatus further comprises a prosodic word forming result analysis part for analyzing and processing the prosodic words generated by the prosodic word generating part to generate prosodic word forming analysis result.
The present invention further provides a program of forming Chinese prosodic words, which program comprises inputting Chinese text; performing process of word segmentation and part of speech annotation for the input Chinese text to generate an initial prosodic word sequence; inserting grids representing prosodic word boundaries for all the word boundaries in the initial prosodic word sequence to generate a grid prosodic word sequence; annotating the grids ready to be deleted in the grid prosodic word sequence based on the prosodic word forming means; judging the grids which actually need to be deleted in the grids ready to be deleted based on the prosodic word forming means; deleting the grids which actually need to be deleted in the grid prosodic word sequence, and word forming the words between every two grids in the remaining grids to generate prosodic words.
The present invention further provides a readable storage medium of storing Chinese prosodic words forming program, which readable storage medium stores the following programs of inputting Chinese text; performing process of word segmentation and part of speech annotation for the input Chinese text to generate an initial prosodic word sequence; inserting grids representing prosodic word boundaries for all the word boundaries in the initial prosodic word sequence to generate a grid prosodic word sequence; annotating the grids ready to be deleted in the grid prosodic word sequence based on the prosodic word forming means; judging the grids which actually need to be deleted in the grids ready to be deleted based on the prosodic word forming means; deleting the grids which actually need to be deleted in the grid prosodic word sequence, and word forming the words between every two grids in the remaining grids to generate prosodic words.
The advantageous effect of the present invention is to employ the grid deletion policy to make it possible for a plurality of prosodic word forming means to work in concert. The word segmentation result of the input natural language text is regarded as an initial prosodic word sequence, and it is assumed here that grids of prosodic words are inserted into all word boundaries. On the basis of this, the plurality of prosodic word forming means can work in concert, since every prosodic word forming method can delete the grids considered to be no longer required at the level of the prosodic word. In other words, if any random prosodic word forming method considers a certain grid to be no longer required, this grid is deleted. The present invention overcomes the defect whereby the type of insertion error of the prosodic word would render the pronunciation hard to understand or unnatural, and reduces the number of the type of insertion error of prosodic word boundaries. By employing the grid deletion policy, the present invention makes it possible for a plurality of prosodic word forming means to work in concert. Such a framework makes it possible for a new prosodic word forming method to be easily combined, thus facilitating the maintenance and modification of the system.
Specific embodiments of the present invention are explained below in combination with the accompanying drawings. As shown in
The apparatus further comprises a word dividing result storage part for storing the word dividing result after the process of word dividing and part of speech annotating the input Chinese text to generate an initial prosodic word sequence based on the word segmentation result.
The prosodic word grid deletion part further comprises a grid deletion trust degree evaluation unit for comprehensively judging the grids ready to be deleted at present based on the plurality of prosodic word forming means, providing trust degree of the grids which need to be deleted for the grids ready to be deleted at present; and a grid deletion unit for judging whether the grids ready to be deleted at present need to be deleted based on the trust degree, if yes, deleting the grids ready to be deleted at present.
The prosodic word grid deletion part comprises a unit for a plurality of prosodic word forming means used for annotating the grids ready to be deleted in the same grid prosodic word sequence based on the plurality of prosodic word forming means. The said judging the grids which actually need to be deleted in the grids to be deleted based on the prosodic word forming means indicates comprehensively judging the grids which actually need to be deleted in the grids to be deleted based on the plurality of prosodic word forming means.
The apparatus further comprises a prosodic word forming result analysis part for analyzing and processing the prosodic words generated by the prosodic word generating part to generate prosodic word forming analysis result.
The present invention can be implemented in a computer, a server or a computer network, wherein the input part can be such devices as a keyboard, a mouse, or a communication interface.
As shown in
The word segmentation and part of speech annotating part (the module 102) performs word segmentation and part of speech annotation on an input text. This module is the basis upon which the Chinese text analysis depends, because, unlike the texts of other languages such as the English, there is no space as a separating sign between words in the Chinese text to divide the words. Accordingly, it is necessary to firstly perform word segmentation and part of speech annotation on the input text, and the result obtained thereby is written into the module 103 to function as the basis for the subsequent processing.
In the specific embodiment, the prosodic word grid insert part, the prosodic word grid delete part and the prosodic word generating part can be unified as a prosodic word forming part (the module 104) as the main body of the present invention. The module employs the grid deletion policy and thereby supports a plurality of prosodic word forming means to work in concert. The word segmentation result of the input text is regarded as an initial prosodic word sequence, and it is assumed here that grids of prosodic words are inserted into all word boundaries. On the basis of this, the plurality of prosodic word forming means work in concert to mark eliminable signs on the grids on longer required at the level of the prosodic word. Finally, each of the grids is uniformly judged as to whether it can be deleted and the actual grid deletion is carried out.
The module 105 is the final prosodic word forming analysis result.
The module 201 is a prosodic word initializing part, which performs initialization of the prosodic words based on the word segmentation and part of speech annotation result stored in the module 103. Specifically, the word segmentation result is regarded as an initial prosodic word sequence, and grids representing prosodic word boundaries are inserted into all word boundaries.
The module 202 performs word forming process based on the prosodic word forming means 1. The module 202 makes use of the prosodic word forming means 1 to perform word forming on the prosodic words with each of the words in the initial word segmentation result as the basic unit. At the same time, the grids judged in the prosodic word forming means 1 to be deleted are marked with eliminable signs by the module 203 (a grid eliminable sign marking part).
Modules 204 through 206 perform word forming processes based on prosodic word forming means 2 to N. They make respective use of the corresponding prosodic word forming means 2 to N to perform word forming on the prosodic words. At the same time, the grids judged in the prosodic word forming means to be deleted are also marked with eliminable signs by the grid eliminable sign marking part. The prosodic word forming means 1 to N can be used as a component part of the prosodic word grid delete part, namely as a prosodic word forming means part, so as to mark the grids ready to be deleted in the same grid prosodic word sequence based on the plurality of prosodic word forming means.
The prosodic word forming means 1 to N can be embodied as follows.
Additionally, there are several modes of superimposition for the verbs of the Chinese language, such as “V-V”, “VV” and “V-V” (, and ). They are divided in the word segmentation process as verbal phrases, for example, . In fact, these verbal phrases of the superimposed mode should be regarded as a complete prosodic word in the natural prosody. Consequently, the present invention also designs corresponding prosodic word forming rules for the verbs of the superimposed mode, so as to ensure that they can be correctly formed into a prosodic word. The aforementioned plurality of prosodic word forming means work in concert on the prosodic word forming according to this invention.
The module 207 is a grid removing part. This module performs synthetical judgment based on the grid eliminable marks marked by the aforementioned N types of prosodic word forming means to determine the prosodic word grids to be finally deleted. Finally, the words between every two grids are formed together to become the prosodic word, and the analysis result is stored in the prosodic word forming analysis result in the module 208.
The module 301 is responsible for performing ergodics on all the initial grids.
The module 302 is responsible for checking as to whether there are grids that have not been processed. It is here a simple sequential process. If there are grids that have not been processed, they are transferred to the module 303 for processing there. If all the grids are processed, the processing ends.
The module 303 is responsible for checking as to whether the current grid has been marked with the eliminable sign: if it is found that the current grid has been marked with the eliminable sign by at least one prosodic word forming method, the grid is transferred to the module 304; and it is otherwise transferred to the module 301.
The module 304 is a grid delete part for performing specific operation of deleting the grids.
The module 401 is a grid deletion trust degree evaluation part. This module provides in a synthetical manner the eliminable trust degree of the current grid based on the mark of the N type prosodic word forming method as to whether the current grid is eliminable.
The module 402 judges as to whether the current grid is eliminable based on the trust degree evaluation result of the module 401: if eliminable, it is transferred to the module 403 for processing; and it is otherwise transferred to the module 301.
The grid deletion trust degree evaluation part can be carried out through the balloting mechanism. One simplest balloting mechanism can be performed as follows: if more than half of the N types of prosodic word forming means consider it necessary to delete the current grid, the grid deletion trust degree evaluation part considers it necessary to delete the current grid.
The present invention employs the grid deletion policy to make it possible for a plurality of prosodic word forming means to work in concert. The word segmentation result of the input natural language text is regarded as an initial prosodic word sequence, and it is assumed here that grids of prosodic words are inserted into all word boundaries. On the basis of this, the plurality of prosodic word forming means can work in concert, since every prosodic word forming method can delete the grids considered to be no longer required at the level of the prosodic word. In other words, if any random prosodic word forming method considers a certain grid to be no longer required, this grid is deleted. The present invention avoids the defect whereby the type of insertion error of the prosodic word would render the pronunciation hard to understand or unnatural as far as possible, and reduces the number of the type of insertion error of prosodic word boundaries. By employing the grid deletion policy, the present invention makes it possible for a plurality of prosodic word forming means to work in concert. Such a framework makes it possible for a new prosodic word forming method to be easily combined, thus facilitating the maintenance and modification of the system.
The aforementioned specific embodiments are employed only to explain, rather than to limit, the present invention.
Number | Date | Country | Kind |
---|---|---|---|
200610167040.0 | Dec 2006 | CN | national |