This application claims priority under 35 U.S.C. §119 from Chinese Patent Application No. 201010271135.3 filed Aug. 31, 2010, the entire contents of which are incorporated herein by reference.
1. Field of the Invention
The Present Invention relates to a method and system for achieving Text to Speech. More particularly, the Present Invention is related to a method and system for achieving emotional Text to Speech.
2. Description of the Related Art
Text To Speech (TTS) refers to extract corresponding speech units from an original corpus based on result of rhythm modeling, adjust and modify rhythm feature of the speech units by using specific speech synthesis technology and finally synthesize qualified speech. Currently, the synthesis level of several main speech synthesis tools have all come into practical stage.
It is well known that people can express a variety of emotion during reading, for example, during reading the sentence “Mr. Ding suffers severe paralysis since he is young, but he learns through self-study and finally wins the heart of Ms. Zhao with the help of network”, the former half of which can be read with sad emotion, while the latter half of which can be read with joy emotion. However, the traditional speech synthesis technology will not consider the emotional information accompanied in the text content, that is, when performing speech synthesis, the traditional speech synthesis technology will not consider whether the emotion expressed in the text to be processed is joy, sad or angry.
Emotional TTS has become the focus of TTS research in recent years, the problem that has to be solved in emotional TTS research is to determine emotion state and establish association relationship between emotion state and acoustical feature of speech. The existing emotional TTS technology allows an operator to specify emotion category of a sentence manually, such as manually specify that the emotion category of sentence “Mr. Ding suffers severe paralysis since he is young” is sad, and the emotion category of sentence “but he learns through self-study and finally wins the heart of Ms. Zhao with the help of network” is joy, and process the sentence with the specified emotion category during TTS.
Accordingly, one aspect of the present invention provides a method for achieving emotional Text To Speech (TTS), the method includes the steps of: receiving text data; generating emotion tag for the text data by a rhythm piece; and achieving TTS to the text data corresponding to the emotion tag, where the emotion tags are expressed as a set of emotion vectors; where the emotion vector includes a plurality of emotion scores given based on a plurality of emotion categories.
Another aspect of the present invention provides a system for achieving emotional Text To Speech (TTS), including: a text data receiving module for receiving text data; an emotion tag generating module for generating an emotion tag for the text data by a rhythm piece; and a TTS module for achieving TTS to the text data according to the emotion tag, where the emotion tag is expressed as a set of emotion vectors; and where the emotion vector includes a plurality of emotion scores given based on a plurality of emotion categories.
In the following discussion, a large amount of specific details are provided to facilitate to understand the invention thoroughly. However, for those skilled in the art, it is evident that it does not affect the understanding of the invention without these specific details. It will be recognized that, the usage of any of following specific terms is just for convenience of description, thus the invention should not be limited to any specific application that is identified and/or implied by such terms.
There are unsolved problems in the existing emotional TTS technology. For example, firstly, since each sentence is assigned unified emotion category, the whole sentence is read with unified emotion, the actual effect of which is not natural and smooth; secondly, different sentences are assigned different emotion categories, therefore, there will be abrupt emotion change between sentences; thirdly, the cost of determining emotion of a sentence manually is high and is not adapted to perform batch process on TTS.
The present invention provides a method and system for achieving emotional TTS. The present invention can make TTS effect more natural and closer to real reading. In particular, the present invention generates emotion tag based on rhythm piece instead of whole sentence. The emotion tag in the present invention is expressed as a set of emotion vectors including plurality of emotion scores given based on multiple emotion categories, which makes the rhythm piece in the present invention has more rich and real emotion expression instead of being limited to one emotion category. In addition, the present invention does not need manual intervention, that is, there is no need to specify fixed emotion tag for each sentence manually. The present invention is applicable to various products that need to achieve emotional TTS, including E-book that can perform reading automatically, robot that can perform interactive communication, and various TTS software that can read text content with emotion.
An emotion tag for the text data is generated by rhythm piece at step 103, where the emotion tags are expressed as a set of emotion vectors. The emotion vector includes plurality of emotion scores given based on multiple emotion categories. The rhythm piece can be a word, vocabulary or a phrase. If the text data is in Chinese, according to an embodiment of the present invention, the text data can be divided into several vocabularies, each vocabulary being taken as a rhythm piece and an emotion tag is generated for each vocabulary. If the text data is English, according to an embodiment of the present invention, the text data can be divided into several words, each word being taken as a rhythm piece and an emotion tag is generated for each word. Of course, generally, the invention has no special limitation on the unit of rhythm piece, which can be a phrase with relatively coarse granularity, or it can be a word with relatively fine granularity. The finer the granularity is, the more delicate the emotion tag is. The final synthesis result will be closer to actual pronunciation, but computational load will also increase. The coarser the granularity is and the rougher the emotion tag is, the final synthesis result will have some difference to actual pronunciation. However, computational load will also be relatively low in TTS.
TTS to the text data is achieved according to the emotion tag at step 105. The present invention will use one emotion category for each rhythm piece, instead of using a unified emotion category for one sentence to perform synthesis. When achieving TTs, the present invention considers a degree of each rhythm on each emotion category. The present invention considers the emotion score under each emotion category, in order to realize TTS that is closer to create an actual speech effect. The detailed content will be described below in detail.
As shown in Table 1, emotion vector can be expressed as an array with emotion scores. According to an embodiment of the present invention, normalization process can be performed on each emotion score. In the array with emotion scores for each rhythm piece, the sum of six emotion scores is 1.
The initial emotion score in Table 1 can be obtained in a variety of ways. According to an embodiment of the present invention, the initial emotion score can be a value that is given manually, where a score is given to each emotion category. For a word for that has no initial emotion score, default initial emotion score can be set as shown in Table 2 below.
According to another embodiment of the present invention, emotion categories in a large number of sentences can be marked. For example, emotion category of sentence “I feel so frustrated about his behavior at Friday” is marked as “angry”, emotion category of sentence “I always go to see movie at Friday night” is marked as “happy”. Furthermore, statistic collection can be performed on the emotion category occurred at each word within the large number of sentences. For example, “Friday” has been marked as “angry” for 10 times while been marked as “happy” for 90 times. Distribution of emotion score for word “Friday” is as shown in Table 3.
According to another embodiment of the present invention, the initial emotion score of the rhythm piece can be updated using the final emotion score obtained in prior step of the invention. As a result, the updated emotion score can be stored as initial emotion score. For example, the word “Friday” itself can be a neutral word. If the word “Friday” has been found through step many sentences have expressed a happy emotion when they refer to “Friday”, the initial emotion score of the word “Friday” can be updated from the final emotion score.
Final emotion score and final emotion category of the rhythm piece are determined at step 203. According to an embodiment of the present invention, highest value in the multiple initial emotion scores can be determined as final emotion score, and emotion category represented by the final emotion score can be taken as final emotion category. For example, the final emotion score and final emotion category of each word in Table 1 are determined as shown in Table 4.
As shown in Table 4, the final emotion score of “Don't” is 0.30 and its final emotion category is “angry”.
The emotion vector adjustment training data can be a large amount of text data where emotion score had been adjusted manually. For example, for the sentence “Don't be shy”, the established emotion tag is as shown in
Based on the context of the sentence, initial emotion score of the above sentence is adjusted manually. The adjusted emotion score is shown in Table 6:
As shown in Table 6, the emotion score of “neutral” for word “Don't” has been increased and the emotion score of “angry” has been decreased. The data shown in Table 6 is from the emotion vector adjustment training data. The emotion vector adjustment decision tree can be established based on the emotion vector adjustment training data, so that some rules for performing manual adjustment can be summarized and recorded. The decision tree is a tree structure obtained by performing analysis on the training data with certain rules. A decision tree generally can be represented as a binary tree, where a non-leaf node on the binary tree can either be a series of problems from the semantic context (these problems are conditions for adjusting emotion vector), or can be an answer between “yes” and “no”. The leaf node on the binary tree can include implementation schemes for adjusting emotion score of rhythm piece, where these implementation schemes are the result of emotion vector adjustment.
In addition to using the emotion vector adjustment decision tree to adjust the emotion score, the original emotion score can also be adjusted according to a classifier based on the emotion vector adjustment training data. The working principle of classifier is similar to that of emotion vector adjustment decision tree. The classifier, however, can statistically collect changes in emotion scores under an emotion category, and apply the statistical result to new entered text data to adjust the original emotion score. For example, some known classifiers are Support Vector Machine (SVM) classification technique, Naïve Bayes (NB) etc.
Finally, the process returns to
The marking of the emotion category in Table 7 can be manually marked, or it can be automatically expanded based on manually marked marking of the emotion category. The expansion to the emotion adjacent training data will be described in detail below. There can be a variety of ways for marking, and marking in form of a list shown in Table 7 is one of the ways. In other embodiments, colored blocks can be set to represent different emotion categories, and a marker can mark the word in the emotion adjacent training data by using pens with different colors. Furthermore, default value such as “neutral” can be set for unmarked words, such that emotion categories of the unmarked words are all set as “neutral”.
The information as shown in Table 8 below can be obtained by performing statistic collection on emotion category adjacent condition of a word in a large amount of emotion adjacent training data.
Table 8 shows that in the emotion adjacent training data, the number “1000” corresponds to two emotion categories that are marked “neutral,” where “1000” represent the numbers of words that are adjacent to each other. Similarly, the number “600” corresponds to two emotion categories, where one emotion category is marked “happy” and another emotion category is marked “neutral.”
Table 8 can be a 7×7 table that marks the number of times of words that are adjacent to each other, but can be a table with higher dimensions. According to an embodiment of the present invention, the adjacent data does not consider the order of words of two emotion categories appeared in emotion adjacent training data. Thus, the recorded number that corresponds to “happy” column and “neutral” row is identical to the recorded number that corresponds to “happy” row and “neutral” column.
According to another embodiment of the present invention, when performing a statistic collection on the number of adjacent words with emotional categories, the order of words of two emotion categories is considered, and thus the recorded number of adjacent times that corresponds with “happy” column and “neutral” row can not be identical to that the recorded number that corresponds with “happy” row and “neutral” column.
Next, adjacent probability of two emotion categories can be calculated with the following formula 1:
Where: E1 represents one emotion category; E2 represents another emotion category; num(E1, E2) represents the number of adjacent times of E1 and E2;
represents the sum of number of adjacent times of any two emotion categories; and p(E1, E2) represents adjacent probability of word of these two emotion categories. The adjacent probability is obtained by performing a statistical analysis on emotion adjacent training data, the statistical analysis including: recording the number of times at least two emotion categories adjacent in the emotion adjacent training data.
Furthermore, the present invention can perform normalization process on P(E1, E2), such that the highest value in P(Ei, Ej) is 1, when other P(Ei, Ej) is a relative number, i.e. a smaller number than 1. The normalized adjacent probability of words having two emotion categories is calculated, and can be shown on a table. See Table 9.
Based on Table 9, for one emotion category of at least one rhythm piece, adjacent probability that one emotion category is connected to an emotion category of another rhythm piece can be obtained at step 501. For example, adjacent probability between “Don't,” which has a “neutral” emotion category, and “feel,” which has a “neutral” emotion category, has a value of 1.0. In another example, adjacent probability of the word “Don't” in “neutral” emotion category and the word “feel” in “happy” emotion category is 0.6. Adjacent probability between a word in one emotion category and another word having another emotion category can be obtained.
Final emotion path of the text data is determined based on the adjacent probability and emotion scores of respective emotion categories at step 503. For example, for sentence “Don't feel embarrassed about crying as it helps you release these sad emotions and become happy”, assuming Table 1 has listed emotion tag of that sentence marked in step 303, a total of 616 emotion paths can be described based on all adjacent probabilities obtained in step 501. The path with the highest sum of adjacent probability and the highest sum of emotion score can be selected from these emotion paths at step 503 as final emotion path, as shown in Table 10 below.
In comparison with other emotion paths, the final emotion path indicated by arrows in Table 10 has the highest sum of adjacent probability (1.0+0.3+0.3+0.7+ . . . ) and the highest sum of emotion score (0.2+0.4+0.8+1+0.3+ . . . ). The determination of final emotion path has to comprehensively consider emotion score of each word on one emotion category and adjacent probability of two emotion categories, in order to obtain the path with the highest possibility. The determination of final emotion path can be realized by a plurality of dynamic planning algorithms. For example, the above sum of adjacent probability and sum of emotion score can be weighted, in order to find an emotion path with highest probability after being summed and weighted as final emotion path.
Final emotion category of the rhythm piece is determined based on the final emotion path. Emotion score of the final emotion category then is obtained as final emotion score at step 505. For example, final emotion category of “Don't” is determined as “neutral” and the final emotion score is 0.2.
The determination of final emotion path can make expression of text data smoother and closer to the emotion state expressed during real reading. For example, if emotion smoothing process is not performed, final emotion category of “Don't” can be determined as “angry” instead of “neutral”.
Generally, both the emotion smoothing process and the emotion vector adjustment described in
The emotion vector adjustment emphasizes more on making emotion score comply with true semantic content, while emotion smoothing process emphasizes more on choosing an emotion category for smoothness and avoid abruptness.
As mentioned above, the present invention can further expand the emotion adjacent training data.
According to an embodiment of the present invention, the emotion adjacent training data is automatically expanded based on the formed final emotion path. For example, new emotion adjacent training data as shown in Table 11 below can be further derived from the final emotion path in Table 10, in order to realize expansion of emotion adjacent training data:
According to another embodiment of the present invention, the emotion adjacent training data is automatically expanded by connecting emotion category of the rhythm piece with the highest emotion score. In this embodiment, final emotion category of each rhythm piece is not determined based on final emotion path, but the emotion vector tagged in step 303 is analyzed to select an emotion category represented by highest emotion score in emotion vector. As a result, the process automatically expands the emotion adjacent training data. For example, if Table 1 describes emotion vectors tagged in step 303, then the new emotion adjacent training data derived from these emotion vectors shows expanded data. See Table 12:
Since smoothing process is not performed on the emotion adjacent training data obtained in Table 12, some of its determined emotion categories (such as “Don't”) can sometimes not comply with real emotion condition. However, in comparison with the expansion manner in Table 11, the computation load of the expansion manner in Table 12 is relative low.
The present invention does not exclude using more expansion manner to expand the emotion adjacent training data.
Next, achieving TTS is described in detail. It should be noted that the following embodiment for achieving TTS is applicable to step 307 in the embodiment shown in
At step 603, for each phone in the number of phones, its speech feature is determined according to the following formula 2:
F
i=(1−Pemotion)*Fi-neutral+Pemotion*Fi-emotion formula 2
Where Fi represents value of the ith speech feature of the phone, Pemotion represents final emotion score of the rhythm piece where the phone lies, Fi-neutral represents speech feature value of the ith speech feature in neutral emotion category, and Fi-emotion represents speech feature value of the ith speech feature in the final emotion category.
For example, for vocabulary “embarrassed” in Table 10, its speech feature is:
F
i=(1−0.8)*Fi-neutral+0.8*Fi-uneasiness
The speech feature can be one or more of the following: basic frequency feature, frequency spectrum feature, time length feature. The basic frequency feature can be embodied as one or both of average value of basic frequency feature or variance of basic frequency feature. The frequency spectrum feature can be embodied as 24-dimension line spectrum frequency (LSF), i.e., representational frequencies in spectrum frequency. The 24-dimension line spectrum frequency (LSF) is a set of 24-dimension vector. The time length feature is the duration of that phone.
For each emotion category under each speech feature, there is pre-recorded corpus. For example, an announcer reads a large amount of text data that contain angry, sad, happy, emotion, and etc, and the audio is recorded into corresponding corpus. For a corpus of each emotion category under each speech feature, a TTS decision tree is established, where the TTS decision tree is typically a binary tree. The leaf node of the TTS decision tree records speech feature (including basic frequency feature, frequency spectrum feature or time length feature) that should be owned by each phone. The non-leaf node in the TTS decision tree can either be a series of problems regarding speech feature, or be an answer of “yes” or “no”.
Furthermore, the present invention can also divide a phone into several states, for example, divide a phone into 5 states and establish decision tree relating to each speech feature under each emotion category for the state, and query speech feature of one state of one phone of one rhythm piece in the text data through the decision tree.
However, the present invention is not simply limited to utilize the above method to obtain speech feature of phone under one emotion category to achieve TTS. According to an embodiment of the present invention, during TTS, not only final emotion category of the rhythm piece where a phone lies is considered, but also the final emotion category's corresponding final emotion score (such as Pemotion in formula 2) is considered. It can be seen from formula 2 that the larger the final emotion score is the closer the ith speech feature value of the phone than to the speech feature value of one final emotion category. In contrast, the smaller the final emotion score is, the closer the ith speech feature value of the phone than to speech feature value under “neutral” emotion category. The formula 2 further makes the process of TTS smoother, and avoids abrupt and unnatural TTS effect due to emotion category jump.
Of course, there can be various variations to the TTS method shown in formula 2. For example,
F
i
=F
i-emotion
Speech feature of the phones are determined based on following formula if the final emotion score of the rhythm piece where the phone lies is smaller than a certain threshold (step 615):
F
i
=F
i-neutral
For above two formulas, Fi represents value of the ith speech feature of the phone, Fi-neutral represents speech feature value of the ith speech feature in neutral emotion category, Fi-emotion represents speech feature value of the ith speech feature in the final emotion category.
In practice, the present invention is not only limited to the implementation shown in
Furthermore, the TTS module 909 is further for achieving TTS to the text data according to the final emotion score and final emotion category of the rhythm piece.
The functional flowchart performed and completed by respective modules in
The above and other features of the present invention will become more distinct by a detailed description of embodiments shown in combination with attached drawings. Identical reference numbers represent the same or similar parts in the attached drawings of the invention.
As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or computer program product. Accordingly, aspects of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit,” “module” or “system.” Furthermore, aspects of the present invention may take the form of a computer program product embodied in one or more computer readable medium(s) having computer readable program code embodied thereon.
Any combination of one or more computer readable medium(s) may be utilized. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C++ or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer. Aspects of the present invention are described below with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention.
It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer readable medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams can represent a module, segment, or portion of code, which includes one or more executable instructions for implementing the specified logical function(s).
It should also be noted that, in some alternative implementations, the functions noted in the block can occur out of the order noted in the figures. For example, two blocks shown in succession can, in fact, be executed substantially concurrently, or the blocks can sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “includes” and/or “including,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
The corresponding structures, materials, acts, and equivalents of all means or step plus function elements in the claims below are intended to include any structure, material, or act for performing the function in combination with other claimed elements as specifically claimed. The description of the present invention has been presented for purposes of illustration and description, but is not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the invention. The embodiment was chosen and described in order to best explain the principles of the invention and the practical application, and to enable others of ordinary skill in the art to understand the invention for various embodiments with various modifications as are suited to the particular use contemplated.
Number | Date | Country | Kind |
---|---|---|---|
201010271135.3 | Aug 2011 | CN | national |