The present invention is based upon and claims the benefit of the priority of Japanese patent application No. 2007-307507 filed on Nov. 28, 2007, the disclosure of which is incorporated herein in its entirety by reference thereto.
The present invention relates to a speech synthesis device, speech synthesis method, and speech synthesis program, and particularly to a speech synthesis device, speech synthesis method, and speech synthesis program that synthesize speech from a text.
A variety of speech synthesis devices that analyze a text and generate synthesized speech from speech information indicated by the text using rule synthesis have been developed.
The speech synthesis device shown in
The language processing unit X1 analyzes an input text by reading it and performing morpheme and syntactic analyses on it, and outputs symbol strings indicating how the text is “read” such as phonetic segment symbols, and the lexical category, conjugation, accent type of each morpheme as language processing results to the prosody generating unit X2 and the segment selection unit X3.
The prosody generating unit X2 generates the prosody information (information relating to pitch, duration, and power) of synthesized speech according to the language processing results outputted from the language processing unit X1, and outputs it to the segment selection unit X3 and the waveform generating unit X5. The segment selection unit X3 selects speech segments having high suitability in terms of the language processing results and the generated prosody information from the speech segments stored in the segment information storage unit X4, and outputs the selected speech segments with their attribute information to the waveform generating unit X5. The waveform generating unit X5 generates waveforms having prosodies close to the prosodies generated by the prosody generating unit X2 from the selected speech segments, connects these waveforms, and output the result as synthesized speech.
The segment selection unit X3 derives information (called “target segment environment” hereinafter) indicating the characteristics of the target synthesized speech from the inputted language processing results and the prosody information for each predetermined synthesis unit. The target segment environment includes information such as the names of the corresponding, preceding, and succeeding phonemes, whether or not there is the stress, the distance from the accent nucleus, the pitch frequency and power of the synthesis unit, the duration of the unit, the cepstrum, MFCCs (Mel Frequency Cepstral Coefficients), and the Δ amount (the variation per unit time) of these factors. Next, having derived the target segment environment, the segment selection unit X3 selects a plurality of speech segments matching particular information (mainly the corresponding phoneme), specified by the target segment environment, from the segment information storage unit X4. The selected speech segments become candidates for the speech segments used for synthesis. Then, the “score (or cost)” of each selected candidate segment that indicates the suitability of a candidate as a speech segment used for synthesis is calculated. Since the aim is to generate high quality synthesized speech, the higher the score (or the lower the cost), the higher the speech quality of the synthesized sound. In other words, the score is an indicator for estimating the degree of degradation of the quality of the synthesized speech.
Here, the segment selection unit X3 calculates two kinds of scores: unit score and concatenation score. The unit score indicates the estimated degree of speech quality degradation caused by using the candidate segment in the target segment environment, and is calculated according to the degree of similarity between the segment environment of the candidate segment and the target segment environment. Meanwhile, the concatenation score indicates the estimated degree of speech quality degradation caused when the segment environments of connected speech segments are discontinuous, and is calculated according to the degree of affinity between the segment environments of adjacent candidate segments. A variety of methods are proposed for calculating the unit score and the concatenation score. Generally, the unit score is calculated using the information included in the target segment environment, and the concatenation score is calculated using the pitch frequency, the cepstrum, MFCCs, the short-time autocorrelation, and the power on the connection border of segments, and the Δ amount of these factors. As described, the unit score and the concatenation score are calculated using multiple pieces of various information relating to segments, such as the pitch frequency, the cepstrum, and the power.
Regarding the configuration shown in
The segment selection unit X3 needs to calculate the unit score and the concatenation score for all the candidates for the optimal segment in order to derive the optimal segment series. As the number of the segments stored in the segment information storage unit, i.e., the number of the candidate segments, increases, so does the calculation amount required for calculating the scores for each of the segments, and as a result, a greatly longer processing time will be required from the text input to the generation of synthesized speech. Therefore, basic means for reducing the calculation amount is to reduce the number of the candidate segments, for which the unit score and the concatenation score need to be calculated, however, significant speech quality degradation may occur if a wrong method is employed to reduce the number of the segments. As a result, methods for reducing the calculation amount required for the segment selection processing without causing significant speech quality degradation have been investigated.
For instance, Patent Document 3 proposes a method that reduces the number of the segments without negatively influencing the speech quality by investigating how frequently the segments stored in the segment information storage unit are used during the speech synthesis and excluding infrequently used segments from the segment information storage unit. Further, Patent Document 4 proposes a method that reduces the calculation amount required for the segment selection by excluding segments having low unit sub-cost from being a candidate, thereby reducing the number of segments for which the unit sub-cost and the concatenation cost are calculated.
[Patent Document 1]
[Patent Document 2]
[Patent Document 3]
[Patent Document 4]
[Non-Patent Document 1]
[Non-Patent Document 2]
[Non-Patent Document 3]
The entire disclosures of the aforementioned Patent Documents 1 to 4 and Non-Patent Documents 1 to 3 are incorporated herein by reference thereto.
The following analysis is given from the viewpoint of the present invention.
In the present application, a score derived for each piece of information is defined as “sub-score” (also called “sub cost”) hereinafter. For instance, sub-scores are a score calculated from the similarity between the pitch frequency of the target segment environment and the pitch frequency of a candidate segment as far as the unit score is concerned, and a score calculated from the similarity between the cepstrums of adjacent candidate segments as far as the concatenation score is concerned. Sub-scores related to the unit score are called “unit sub-scores” and sub-scores related to the concatenation score are called “concatenation sub-scores.” Further, regarding the concatenation score, when two segments are continuous in the original speech waveform, the value of the concatenation score is maximum since the segment environment between these segments is perfectly continuous.
However, the methods for reducing the segments in the conventional speech synthesis devices described in the aforementioned Patent Documents and Non-Patent Documents have the following problems.
First, since the method described in Patent Document 3 excludes segments according to the frequency of use, segments may be excluded from being candidates without calculating the sub-scores. Even if a segment is not used frequently, it may have a high score depending on the content of the input text. As a result, speech quality will deteriorate for an input text for which segments excluded based on the frequency of use have high scores.
Further, Patent Document 4 discloses the configuration in which candidates are narrowed down in two or more stages in order to reduce the calculation amount, however, concrete means or criteria for appropriately narrowing down the candidates are not disclosed.
Therefore, the speech synthesis devices described in Patent Documents 3 and 4 are able to reduce the calculation amount, but cannot sufficiently prevent speech quality degradation.
The present invention aims at solving the problems described above, and it is an object to provide a speech synthesis device, speech synthesis method, and speech synthesis program capable of realizing speech quality improvement and the reduction of the calculation amount in a balanced manner.
According to a first aspect of the present invention, there is a provided a speech synthesis device comprising: a sub-score calculation unit that calculates a segment selection sub-score for selecting an optimal segment, and a candidate narrowing unit that narrows down candidates according to the number of candidate segments and the segment selection sub-score.
According to a second aspect of the present invention, there is provided a speech synthesis method, in a speech synthesis device that generates synthesized speech from an input text, comprising a step of calculating a segment selection sub-score for selecting an optimal segment and of narrowing down candidates according to the number of candidate segments and the segment selection sub-score in a process of selecting an optimal segment.
According to a third aspect of the present invention, there is a provided a program having a computer, constituting a speech synthesis device that generates synthesized speech from an input text, execute: processing of calculating a segment selection sub-score for selecting an optimal segment, and candidate narrowing processing of narrowing down candidates according to the number of candidate segments and a segment selection sub-score used when an optimal segment is selected; in a process of selecting an optimal segment for generating synthesized speech from an input text.
According to the present invention, it becomes possible to output synthesized speech without having the speech quality degradation caused by the reduction of the calculation amount. The reason is that segments having the prospect of contributing to high speech quality are selected, taking advantage of the tendency that the sub-scores of optimal segments rise as the number of the candidates increases.
Next, preferred modes for carrying out the present invention will be described in detail with reference to the drawings.
The segment selection sub-scores of the candidates narrowed down by the candidate narrowing unit 70/73 are compiled by a sub-score compiling unit provided separately, and the optimal segment is selected.
The speech synthesis device relating to the present invention narrows down the candidates according to the segment selection sub-score, applying different threshold values (
For the segment selection sub-score, either a unit sub-score or concatenation sub-score can be used.
A threshold value calculation unit that derives the threshold values according to the number of the candidate segments may be provided in the candidate narrowing unit 70/73. In this case, the candidate narrowing unit 70/73 can narrow down the candidates according to these threshold values and the segment selection sub-score.
Based on the fact that having more candidates increases the probability of having a segment close to the target value, the threshold value calculation unit may operate so as to derive a higher threshold value when the number of the candidate segments is high than a case where it is low.
For the calculation of the threshold values by the threshold value calculation unit, the segment selection sub-score can be used. Particularly, more efficient threshold values can be obtained by deriving threshold values according to statistics of the segment selection sub-scores of the optimal segments.
For instance, a weighting function selection unit that selects a weighting function according to the number of the candidate segments and a weighting unit that weights the segment selection score according to the weighting function and the segment selection score may be provided precedent to the candidate narrowing unit. In this case, the candidate narrowing unit 70/73 narrows down the candidates according to the weighted segment selection score.
Next, a first exemplary embodiment of the present invention will be described in detail with reference to the drawings.
The text storage unit 800M stores a large amount of texts required for analyzing and extracting the characteristics of the unit sub-scores of the optimal segments.
In order to derive an appropriate threshold value, it is preferable that the operations of the language processing unit 801M, the prosody generating unit 802M, the segment selection unit 803M, and the segment information storage unit 804M be identical to the operations of the language processing unit 1, the prosody generating unit 2, the segment selection unit 3, and the segment information storage unit 4 in
The operation of the speech synthesis device of the first exemplary embodiment will be described in detail with reference to the block diagrams in
The first threshold value calculation unit 801 calculates a threshold value, which will be a reference value for narrowing down the candidates, from the number of the candidates supplied by the number-of-candidates obtaining unit 200, and transmits the information to the first candidate narrowing unit 701 (step A2).
The first unit sub-score calculation unit 601 calculates a first unit sub-score according to the language processing results supplied by the language processing unit 1, the prosody information supplied by the prosody generating unit 2, and segment information stored in the segment information storage unit, and transmits the sub-score to the first candidate narrowing unit 701 (step A3).
The first candidate narrowing unit 701 compares the first unit sub-score of each candidate segment supplied by the first unit sub-score calculation unit 601 to the threshold value supplied by the first threshold value calculation unit 801, excludes candidate segments having unit sub-scores lower than the threshold value, and transmits the remaining candidate segments and their unit sub-scores to the second unit sub-score calculation unit 602 (step A4).
The processing from the step A2 to the step A4 is similarly repeated by the second threshold value calculation unit, the second unit sub-score calculation unit, and the second candidate narrowing unit through the Nth threshold value calculation unit, the Nth unit sub-score calculation unit, and the Nth candidate narrowing unit until the last unit sub-score is calculated (step A5). The last Nth candidate narrowing unit 70N transmits the remaining candidate segments and their first through Nth unit sub-scores to the unit sub-score compiling unit 121.
The unit sub-score compiling unit 121 derives a unit score corresponding to each candidate segment according to the candidate segments supplied by the Nth candidate narrowing unit 70N and their first through Nth unit sub-scores, and transmits the unit scores along with the candidate segments to the concatenation score calculation unit 13 (step A6).
The unit score can be derived from the unit sub-scores by, for instance, deeming a weighted sum of the unit sub-scores the unit score. In other words, when the unit sub-score is Ci and a weighting coefficient is wi, the unit score C can be derived by the following expression
Note that it is not necessary to calculate the threshold values and narrow down the candidates for all the kinds of sub-scores. The method described above in which the threshold value is derived according to the number of the candidates is expected to be highly effective for sub-scores such as pitch, duration, power, cepstrum, and MFCC. This is because, as the number of the candidates increases, so does the probability of having a candidate segment close to the target value of the target segment environment, and conversely, as the number of the candidates decreases, so does the probability of having a candidate segment close to the target value. On the other hand, it is difficult to expect high efficacy for sub-scores such as the names of the corresponding, preceding, and succeeding phonemes, whether or not there is the stress, and the distance from the accent nucleus since the scores are discrete and the value range is not large.
Here, how the threshold value is derived in the step A2 will be described.
With reference to the flowchart in
The prosody generating unit 802M generates the prosody information for synthesized speech according to the language processing results supplied by the language processing unit 801M, and transmits the information to the segment selection unit 803M (step A8).
The segment selection unit 803M derives the optimal segment according to the language processing results supplied by the language processing unit 801M, the prosody information supplied by the prosody generating unit 802M, and the segment information stored in the segment information storage unit 804M, and transmits the optimal segment to the Mth unit sub-score calculation unit 805M (step A9).
The Mth unit sub-score calculation unit 805M calculates the Mth unit sub-score of the optimal segment supplied by the segment selection unit 803M according to the language processing results supplied by the language processing unit 801M, the prosody information supplied by the prosody generating unit 802M, and the segment information stored in the segment information storage unit 804M, and transmits the sub-score to the optimal segment sub-score analyzing unit 807M (step A10).
The optimal segment sub-score analyzing unit 807M differs from the Mth unit sub-score calculation unit 60M in
Regarding the details of the operations of the language processing unit 801M, the prosody generating unit 802M, the segment selection unit 803M, the segment information storage unit 804M, and the Mth unit sub-score calculation unit 805M, since they are equivalent to the operations of the language processing unit 1, the prosody generating unit 2, the segment selection unit 3, the segment information storage unit 4, and the Mth unit sub-score calculation unit 60M in
The optimal segment sub-score analyzing unit 807M analyzes the Mth unit sub-scores of the optimal segments supplied by the segment information storage unit 804M and the Mth unit sub-score calculation unit 805M, and transmits an analysis value, which will be the reference value when the threshold value function is designed, to the threshold value function generating unit 808M (step A11).
The object of the optimal segment sub-score analyzing unit 807M is to analyze the unit sub-score of the optimal segment, and derive reference value and analysis value useful for designing the threshold value function that calculates the threshold value for effectively narrowing the candidates.
When the candidates are narrowed down, by excluding as many non-optimal segments as possible from being the candidates, the candidates are effectively narrowed down, thereby greatly reducing the calculation amount with small speech quality degradation. Therefore it is important to obtain the characteristics of the sub-scores so that the differences between the optimal segments selected eventually and the standard segments are clear. This can be achieved by, for instance, deriving statistical values such as an average and distribution from the sub-scores of a large number of the optimal segments or studying the frequency distribution.
In the present example, a method in which the frequency distributions of the scores are derived for different numbers of the candidates in the optimal segment sub-score analyzing unit 807M and the analysis value transmitted to the threshold value function generating unit 808M is derived from the frequency distributions will be described.
In the example shown in
According to the analysis value supplied by the optimal segment sub-score analyzing unit 807M, the threshold value function generating unit 808M derives the threshold value function that derives the threshold value from the number of the candidates, and transmits the threshold value function to the threshold value calculation unit 809M (step A12). In the present example, explanations will be made with the assumption that the analysis values supplied by the optimal segment sub-score analyzing unit 807M are k1, k2, p1, p2, and p3 described above.
Meanwhile, considering that the number of the candidates and the threshold values are proportional to each other, a diode function that passes through p1, p2, and p3 while keeping a distance from k1 and k2, as shown in
It is preferred that the design of the threshold value function, i.e., the processing from the step A7 to the step A12 in the flowchart in
The threshold value calculation unit 809M derives the threshold value according to the threshold value function supplied by the threshold value function generating unit 808M and the number of the candidates supplied by the number-of-candidates obtaining unit 200 in
In the present exemplary embodiment, the speech synthesis device derives the threshold value for narrowing down the candidates from the number of the candidates, utilizing the fact that the sub-scores of the optimal segments rise as the number of the candidates increases. Further, the speech synthesis device excludes segments having low sub-scores from being the candidates using the threshold value derived according to the number of the candidates. As a result, with a high probability, it is possible to exclude segments having a low possibility of being selected as an optimal segment while keeping segments having a prospect of achieving high speech quality. Particularly, the threshold value function that derives the threshold value from the number of the candidates is determined based on the statistics of the sub-scores of the optimal segments. As a result, the possibility of excluding segments, which are deemed to be optimal segments in a state without any screening, from being the candidate segments is sufficiently low even when the method for the narrowing down the candidates described in the present exemplary embodiment is employed.
Next, a second exemplary embodiment of the present invention will be described in detail with reference to the drawings.
The operation of the speech synthesis device of the second exemplary embodiment will be described in detail with reference to the block diagrams in
The first unit sub-score calculation unit 601 calculates the first unit sub-score according to the language processing results supplied by the language processing unit 1, the prosody information supplied by the prosody generating unit 2, and the segment information stored in the segment information storage unit, and transmits the sub-score to the first weighting unit 8111 (the step A3).
The first weighting unit 8111 derives a weight corresponding to the unit sub-score according to the first unit sub-score of each candidate segment supplied by the first unit sub-score calculation unit 601 and the weighting function supplied by the first weighting function selection unit 811, and weights the unit score with it. Then the first weighting unit 8111 transmits the weighted unit score along with the candidate segments to the first candidate narrowing unit 711 (step B2).
The first candidate narrowing unit 711 excludes candidate segments having weighted unit sub-scores lower than a predetermined threshold value according to the candidate segments supplied by the first weighting unit 8111 and the first weighted unit sub-score of each candidate segment, and transmits the remaining candidate segments and their weighted unit sub-scores to the second unit sub-score calculation unit 602 (step B3).
The processing from the step B1 to the step B3 is similarly repeated by the second weighting function selection unit, the second unit sub-score calculation unit, the second weighting unit, and the second candidate narrowing unit through the Nth weighting function selection unit, the Nth unit sub-score calculation unit, the Nth weighting unit, and the Nth candidate narrowing unit until the last unit sub-score is calculated (the step A5). The last Nth candidate narrowing unit 71N transmits the remaining candidate segments and their first through Nth unit sub-scores to the unit sub-score compiling unit 121.
With reference to
From the weighting functions stored in the weighting function storage unit 851M, the function selection unit 852M selects a weighting function corresponding to the number of the candidates supplied by the number-of-candidates obtaining unit 200 in
According to the present exemplary embodiment, the speech synthesis device that narrows down the candidates using weighted scores rather than the threshold values can be obtained. Particularly, compared to the first exemplary embodiment, the second exemplary embodiment keeps segments excluded in the first exemplary embodiment due to their scores being smaller than the threshold value only by a small value, although their scores are made smaller due to the weight. As a result, speech quality is expected to improve, compared to the first exemplary embodiment, since these remaining segments may contribute to an improvement in speech quality.
Next, a third exemplary embodiment of the present invention will be described in detail with reference to the drawings.
The operation of the speech synthesis device of the third exemplary embodiment will be described in detail with reference to the block diagrams in
Therefore, when there is any unit sub-score having a score exceeding a predetermined threshold value among the supplied unit sub-scores, or when the total sum of the supplied unit sub-scores exceeds the predetermined threshold value, the threshold value is corrected to be smaller. Further, a method in which the threshold value is decreased as the sub-scores increase is effective. On the other hand, when the supplied sub-scores as a whole are small, since the segments are unlikely to be selected as the optimal segments, it is effective to increase the possibility of them being excluded from being the candidates by correcting the threshold value to a larger value. In the present example, all of the first to the M-1th sub-scores are utilized, however a method that utilizes only specific unit sub-scores (for instance only the first unit sub-score, or only the first to the third sub-scores) is effective as well.
According to the present exemplary embodiment, the speech synthesis device corrects the threshold value calculated by the Mth threshold value calculation unit according to the values of the first to the M-1th unit sub-scores. More particularly, when the first to the M-1th unit sub-scores include a unit sub-score having a high score, the threshold value is corrected to a smaller value so as to increase the possibility of the segment being selected as an optimal segment. As a result, compared to the first exemplary embodiment, an improvement in speech quality over the first exemplary embodiment can be expected since it is less likely that segments having high unit scores are excluded from being the candidates when the candidates are narrowed down using the Mth unit sub-score.
Next, a fourth exemplary embodiment of the present invention will be described in detail with reference to the drawings.
The operation of the speech synthesis device according to the fourth exemplary embodiment will be described in detail with reference to the block diagram in
The first threshold value calculation unit 831 calculates the threshold value, which will be the reference value for narrowing down the candidates, from the number of the candidates supplied by the number-of-candidates obtaining unit 201, and transmits the threshold value to the first candidate narrowing unit 731 (step D2).
The first concatenation sub-score calculation unit 651 calculates a first concatenation sub-score according to the candidate segments supplied by the unit score calculation unit 11 and the segment information stored in the segment information storage unit 4, and transmits the concatenation sub-score to the first candidate narrowing unit 731 along with the unit scores of the candidate segments supplied by the unit score calculation unit 11 (step D3).
The first candidate narrowing unit 731 compares the first concatenation sub-score of each candidate segment supplied by the first concatenation sub-score calculation unit 651 to the threshold value supplied by the first threshold value calculation unit 831, excludes candidate segments having concatenation sub-scores lower than the threshold value, and transmits the remaining candidate segments and their unit scores and the first concatenation sub-scores to the second concatenation sub-score calculation unit 652 (step D4).
The processing from the step D2 to the step D4 is similarly repeated by the second threshold value calculation unit, the second concatenation sub-score calculation unit, and the second candidate narrowing unit through the Nth threshold value calculation unit, the Nth concatenation sub-score calculation unit, and the Nth candidate narrowing unit until the last concatenation sub-score is calculated (step D5). The last Nth candidate narrowing unit 73N transmits the remaining candidate segments, their unit scores, and their first through Nth concatenation sub-scores to the concatenation sub-score compiling unit 122.
The concatenation sub-score compiling unit 122 derives a concatenation score corresponding to each candidate segment according to the candidate segments supplied by the Nth candidate narrowing unit 73N and their first through Nth concatenation sub-scores, and transmits the concatenation scores to the optimal segment search unit 14 along with the candidate segments and the unit scores (step D6). The concatenation score can be derived from the concatenation sub-scores by, for instance, deeming a weighted sum of the concatenation sub-scores to be the concatenation score as in the case of the unit score in the first exemplary embodiment.
The Mth concatenation sub-score calculation unit 855M differs from the Mth concatenation sub-score calculation unit 65M in
According to the present exemplary embodiment, the speech synthesis device that narrows down the candidates can be obtained using the concatenation sub-score rather than the unit sub-score. As a result, the entire calculation amount required for calculating the concatenation scores can be reduced. Particularly, in cases where the calculation amount required for the unit scores is small, the types of the concatenation sub-scores are many, and the calculation amount required for the concatenation sub-scores is large, the calculation amount can be reduced greatly, compared to the first to the third exemplary embodiments described above.
When the candidates are narrowed down using the concatenation sub-scores, a method that narrows down candidates having a high possibility of being an optimal segment by selecting a weighting function as in the second exemplary embodiment described above and weighting the score according to the number of the candidates can be employed. In this case, compared to the fourth exemplary embodiment, the fifth exemplary embodiment keeps segments hitherto excluded due to their scores being smaller than the threshold value only by a small value, although their scores are made smaller due to the weight. As a result, an improvement in speech quality over the fourth exemplary embodiment can be expected, since these remaining segments may contribute to an improvement in speech quality.
When the candidates are narrowed down using the concatenation sub-scores, a method that narrows down candidates having a high possibility of being an optimal segment after the threshold value obtained according to the number of the candidates has been corrected according to the scores as in the third exemplary embodiment described above can be employed. In this case, compared to the fourth exemplary embodiment, since segments having high unit scores are less likely to be excluded from being the candidates when they are narrowed down using the Mth unit sub-scores, an improvement in speech quality over the fourth exemplary embodiment can be expected.
The present invention is not limited to the exemplary embodiments described above, and further variations, replacements, and adjustments may be added within the scope of the basic technological concept of the present invention. For instance, [EXPRESSION 1] is used as an example to calculate the score C in the exemplary embodiments described above, however, various score (cost) calculation formulas in Patent Documents 1 and 2 and Non-Patent Documents may be used instead.
Further, the configurations and operations of the speech synthesis devices are mainly described in the exemplary embodiments described above, however, the speech synthesis devices described above can be realized by a program having a computer function as each means of the speech synthesis devices described above and a program having each procedure of the speech synthesis devices described above executed.
It should be noted that other objects, features and aspects of the present invention will become apparent in the entire disclosure and that modifications may be done without departing the gist and scope of the present invention as disclosed herein and claimed as appended herewith.
Also it should be noted that any combination of the disclosed and/or claimed segments, matters and/or items may fall under the modifications aforementioned.
Number | Date | Country | Kind |
---|---|---|---|
2007-307507 | Nov 2007 | JP | national |
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/JP2008/071351 | 11/25/2008 | WO | 00 | 5/26/2010 |