This application is based upon and claims the benefit of the priority of Japanese patent application No. 2007-039622 filed on Feb. 20, 2007, the disclosure of which is incorporated herein in its entirety by reference thereto.
The present invention relates to speech synthesizing technology, and in particular to a speech synthesizing apparatus, method, and program for synthesizing speech from text.
Heretofore, there have been developed various speech synthesizing apparatuses for analyzing text and generating synthesized speech by rule-based synthesis from speech information indicated by the text.
Referring to
The speech segment information storage unit 15 includes a speech segment storage unit 152 for storing an original speech waveform (referred to below as “speech segment”) divided into speech synthesis units, and an associated information storage unit 151 in which attribute information of each speech segment is stored.
Here, the original speech waveform is a natural speech waveform collected in advance for use in generating synthesized speech.
The attribute information of the speech segments includes phonological information and prosody information such as phoneme context in which each speech segment is uttered; pitch frequency, amplitude, continuous time information, and the like.
In the speech synthesizing apparatus of
The language processing unit 10 performs morphological analysis, syntax analysis, reading analysis and the like, on input text, and outputs a symbol sequence representing a “reading” of a phonemic symbol or the like, a morphological part of speech, conjugation, an accent type and the like, as language processing results, to the prosody generation unit 11 and the segment selection unit 16.
The prosody generation unit 11 generates prosody information (information on pitch, length of time, power, and the like) for the synthesized speech, based on the language processing result output from the language processing unit 10, and outputs the generated prosody information to the segment selection unit 16 and the prosody control unit 18.
The segment selection unit 16 selects speech segments having a high degree of compatibility with regard to the language processing result and the generated prosody information, from among speech segments stored in the speech segment information storage unit 15, and outputs the selected speech segment in conjunction with associated information of the selected speech segment to the prosody control unit 18.
The prosody control unit 18 generates a waveform having a prosody generated by the prosody generation unit 11, from the selected speech segments, and outputs the result to the waveform connection unit 19.
The waveform connection unit 19 connects the speech segments output from the prosody control unit 18 and outputs the result as synthesized speech.
The segment selection unit 16 obtains information (referred to as target segment environment) representing characteristics of target synthesized speech, from the input language processing result and the prosody information, for each prescribed synthesis unit.
The following may be cited as information included in the target segment environment:
respective phoneme names of phoneme in question, preceding phoneme, and subsequent phoneme,
presence or absence of stress,
distance from accent core,
pitch frequency and power for representative point, start point, and end point of a synthesis unit, and
continuous time length of unit.
Next, when the target segment environment is given, the segment selection unit 16 selects a plurality of speech segments matching specific information (mainly the phoneme in question) designated by the target segment environment, from the speech segment information storage unit 15. The selected speech segments form candidates for speech segments used in synthesis.
The segment selection unit 16, with regard to the selected candidate segments, calculates “cost” which is an index indicating suitability as speech segments used in the synthesis. Since generation of synthesized speech of high sound quality is a target, if the cost is small, that is, if the suitability is high, the sound quality of the synthesized sound is high. Therefore, the cost may be said to be an indicator for estimating deterioration of the sound quality of the synthesized speech.
The cost calculated by the segment selection unit 16 includes a unit cost and a concatenation cost.
Since the unit cost represents estimated sound quality deterioration produced by using candidate segments under the target segment environment, computation is executed based on degree of similarity of the segment environment of the candidate segments and the target segment environment.
On the other hand, since concatenation cost represents estimated sound quality deterioration level produced by a segment environment between concatenated speech segments being non-continuous, the cost is calculated based on affinity level of segment environments of adjacent candidate segments.
Various types of methods of calculation unit cost and concatenation cost have been proposed heretofore.
In general, information included in the target segment environment is used in the computation of the unit cost.
Pitch frequency, cepstrum, power, and A amount thereof (amount of change per unit time), with regard to concatenation boundary of a segment, are used in the concatenation cost.
The segment selection unit 16 calculates the concatenation cost and the unit cost for each segment, and then obtains a speech segment, for which both the concatenation cost and the unit cost are minimum, uniquely for each synthesis unit.
Since a segment obtained by cost minimization is selected as a segment most suited to speech synthesis from among the candidate segments, it is referred to as an “optimum segment”.
The segment selection unit 16 obtains respective optimal segments for entire synthesis units, and finally outputs a sequence of optimal segments (optimal segment sequence) as a segment selection result to the prosody control unit 18.
In the segment selection unit 16, as described above, the speech segments having a small unit cost are selected, that is, the speech segments having a prosody close to a target prosody (prosody included in the target segment environment) are selected, but it is rare for a speech segment having a prosody equivalent to the target prosody to be selected.
Therefore, in general, after the segment selection, in the prosody control unit 18 a speech segment waveform is processed to make a correction so that the prosody of the speech segment matches the target prosody.
As a representative method of correcting the prosody of the speech segment, a PSOLA (pitch-synchronous-overlap-add) method described in Non-Patent Document 4 is cited.
However, the prosody correction processing is a cause of degradation of synthesized speech. In particular, the change in pitch frequency has a large effect on sound quality degradation, and the larger the amount of the change, the larger is the sound quality deterioration.
For coping with this type of problem, development is taking place of a method of synthesizing with as small a prosody change amount as possible. For example, as in Non-Patent Documents 5 and 6, a method has been proposed in which a huge quantity of speech segments are prepared, and no correction at all of the prosody of the speech segments is carried out.
In this type of method, since the quantity of segments is very large, with regard to a certain input text, speech segments having a sufficiently high level of similarity with the target prosody are selected, and even if the prosody is not corrected, synthesized speech having natural prosody is generated.
However, there are problems such as that it is difficult to generate synthesized speech that always has natural prosody, an extremely large storage capacity is required, and the like.
Otherwise, in Non-Patent Document 7, an approach is taken in which an upper limit value is set for the change amount of the pitch frequency, segments are recorded that have various pitch frequencies, or the like.
[Patent Document 1]
JP Patent Kokai Publication No. JP-P2005-91551A
[Patent Document 2]
JP Patent Kokai Publication No. JP-P2006-84854A
[Non-Patent Document 1]
Huang, Acero, Hon: “Spoken Language Processing”, Prentice Hall, pp. 689-836, 2001.
[Non-Patent Document 2]
Ishikawa: “Prosodic Control for Japanese Text-to-Speech Synthesis”, The Institute of Electronics, Information and Communication Engineers, Technical Report, Vol. 100, No. 392, pp. 27-34, 2000.
[Non-Patent Document 3]
Abe: “An introduction to speech synthesis units”, The Institute of Electronics, Information and Communication Engineers, Technical Report, Vol. 100, No. 392, pp. 35-42, 2000.
[Non-Patent Document 4]
Moulines, Charapentier: “Pitch-Synchronous Waveform Processing Techniques For Text-To-Speech Synthesis Using Diphones”, Speech Communication 9, pp. 453-467, 1990.
[Non-Patent Document 5]
Segi, Takagi, Ito: “A CONCATENATIVE SPEECH SYNTHESIS METHOD USING CONTEXT DEPENDENT PHONEME SEQUENCES WITH VARIABLE LENGTH AS SEARCH UNITS”, Proceedings of 5th ISCA Speech Synthesis Workshop, pp. 115-120, 2004.
[Non-Patent Document 6]
Kawai, Toda, Ni, Tsuzaki, Tokuda: “XIMERA: A NEW TTS FROM AIR BASED ON CORPUS-BASED TECHNOLOGIES”, Proceedings of 5th ISCA Speech Synthesis Workshop, pp. 179-184, 2004.
[Non-Patent Document 7]
Koyama, Yoshioka, Takahashi, Nakamura: “High Quality Speech Synthesis Using Reconfigurable VCV Waveform Segments with Smaller Pitch Modification”, Transactions of the Institute of Electronics, Information and Communication Engineers, D-II, Vol. 183-D-II, No. 11, pp. 2264-2275, 2000.
The entire disclosures of the abovementioned Patent Documents 1 and 2, and Non-Patent Documents 1 to 7 are incorporated herein by reference thereto. The following analysis is given for technology related to the present vention.
A speech synthesizing apparatus described in the abovementioned Non-Patent Document 7 and the like has problems as described below.
Sound quality of synthesized speech is apt to become non-uniform.
By performing prosody control, as in Non-Patent Document 7, in a method aiming to improve naturalness of prosody of synthesized speech, in order to reduce sound quality deterioration accompanying prosody control, a policy has been taken in which a speech segment having prosody with a high degree of similarity to a target prosody, that is, a speech segment whose prosody change amount is small, is selected. As a result, there occurs such a state in which, within the same text (within an optimal segment sequence), the prosody of a certain speech segment has a high degree of similarity with a target prosody, and the prosody of another speech segment has a low degree of similarity with the target prosody, that is, speech segments having different prosody levels of similarity are mixed.
With regard to this state a description is given using
In the related art, in each synthesis unit interval, candidate segments closest to the target pitch pattern, u1, u2, u3, u4, and u5 in the example of
Since differences between the target pitch pattern and the candidate segment pitch patterns form the prosody change amounts, a situation as in
When the prosody change amounts in the same sentence in this way are irregular, a sense of non-uniformity of sound quality of the synthesized speech (a certain portion has high sound quality, and another portion has low sound quality) is brought about.
This non-uniformity of sound quality is a cause of a worsening of the overall impression of synthesized speech. In particular, if the non-uniformity of sound quality is large, the impression of the synthesized speech is worse than for a case of low sound quality in which the sound quality is always equal.
Therefore, the present invention has been made in consideration of the abovementioned problems, and it is a principal object of the invention to provide a apparatus, method, and program for eliminating the non-uniformity of sound quality in synthesized speech.
In accordance with a first aspect of the present invention, there is provided a speech synthesizing apparatus that includes a segment selection unit for selecting a segment suited to a target segment environment from among candidate segments, wherein the segment selection unit excludes, from a target of the selection, a segment having a prosody change amount whose magnitude relationship with a selection criterion determined based on a prosody change amount of the candidate segments is a predetermined prescribed relationship. In the present invention, the segment selection unit is provided with a prosody change amount calculation unit that calculates a prosody change amount of each candidate segment, based on prosody information of the target segment environment and the candidate segments, a selection criterion calculation unit that'calculates a selection criterion, based on the prosody change amount, a candidate selection unit that narrows down selection candidates, based on the prosody change amount and the selection criterion, and an optimum segment search unit that searches for an optimum segment from among the narrowed-down candidate segments.
According to the abovementioned first aspect of the invention, by calculating the prosody change amount of the candidate segments, and, based on the selection criterion obtained from the prosody change amount in question, excluding, from the candidates, speech segments for which the magnitude relationship between the selection criterion and the prosody change amount is a predetermined prescribed relationship (for example, the prosody change amount is particularly small, comparatively), the variance of the prosody change amount of a speech segment, for which the possibility of being selected is high, is decreased. As a result, since the prosody change amount is made uniform, level of deterioration of sound quality due to prosody control is made uniform, and it is possible to eliminate a sense of non-uniformity of the sound quality.
In accordance with a second aspect of the present invention, there is provided a speech synthesizing apparatus that includes a segment selection unit for selecting a segment suited to a target segment environment from among candidate segments, wherein the segment selection unit includes: an optimum segment search unit that searches for an optimum segment, based on the target segment environment and a segment environment of the candidate segments, a prosody change amount calculation unit that calculates a prosody change amount of each candidate segment, based on prosody information of the target segment environment and the candidate segments, a selection criterion calculation unit that calculates a selection criterion based on the prosody change amount, and a decision unit that decides, in a case where, among the optimum segments, there exists a segment having a prosody change amount whose magnitude relationship with the selection criterion is a predetermined prescribed relationship, that re-execution of search for an optimum segment is necessary, and wherein, in a case where the decision unit decides that the re-execution of the search for an optimum segment is necessary, the optimum segment search unit re-executes the search for the optimum segment.
In the present invention, the prosody change amount calculation unit calculates the prosody change amount for only an optimum segment.
In the present invention, the optimum segment search unit excludes segments that do not satisfy the selection criterion from candidates, and re-executes searching for the optimum segment.
In accordance with a third aspect of the present invention, there is provided a speech synthesizing apparatus that includes a segment selection unit for selecting a segment suited to a target segment environment from among candidate segments, wherein the segment selection unit includes: a prosody change amount calculation unit that calculates a prosody change amount of each candidate segment, based on prosody information of the target segment environment and the candidate segments, a selection criterion calculation unit that calculates a selection criterion from the prosody change amount, a unit cost calculation unit that calculates a unit cost of each candidate segment based on the target segment environment and a segment environment of the candidate segments, and an optimum segment search unit that searches for an optimum segment from among candidate segments based on the unit cost, and wherein the unit cost calculation unit assigns a penalty to a unit cost of a segment having a prosody change amount whose magnitude relationship with the selection criterion is a predetermined prescribed relationship.
In the present invention, the unit cost calculation unit determines the penalty according to a relative relationship of the prosody change amount and the selection criterion.
In the present invention, the selection criterion calculation unit determines the selection criterion based on an average value of the prosody change amount.
In the present invention, the selection criterion calculation unit determines the selection criterion based on a value obtained by smoothing the prosody change amount in a time domain.
According to the present invention, there is provided a speech synthesizing method that includes a step of selecting a segment suited to a target segment environment from among candidate segments, wherein the step of selecting the segment excludes, from a selection target, a segment having a prosody change amount whose magnitude relationship with a selection criterion determined based on a prosody change amount of the candidate segments is a predetermined prescribed relationship.
According to another aspect of the present invention, there is provided a speech synthesizing method that includes a step of selecting a segment suited to a target segment environment from among candidate segments, wherein the step of selecting the segment includes: a step of calculating a prosody change amount of each candidate segment, based on prosody information of the target segment environment and the candidate segments, a step of calculating a selection criterion based on the prosody change amount, a step of narrowing down selection candidates, based on the prosody change amount and the selection criterion, and a step of searching for an optimum segment from among the narrowed-down candidate segments, and wherein the step of narrowing down the candidate selection excludes, from a target of search for the optimum segment, a segment having a prosody change amount whose magnitude relationship with the selection criterion is a predetermined prescribed relationship.
In the present invention, the step of calculating the selection criterion, includes a step of calculating cost of each candidate segment based on the target segment environment and the segment environment of the candidate segments, and the selection criterion is calculated based on the cost.
According to another aspect of the present invention, there is provided a speech synthesizing method having a segment selection unit for selecting a segment suited to a target segment environment from among candidate segments, wherein the step of selecting the segment includes:
a step of searching for an optimum segment, based on the target segment environment and a segment environment of the candidate segments,
a step of calculating a prosody change amount of each candidate segment, based on prosody information of the target segment environment and the candidate segments,
a step of calculating a selection criterion based on the prosody change amount, and
a step of deciding, in a case where, among the optimum segments, there exists a segment having a prosody change amount whose magnitude relationship with the selection criterion is predetermined prescribed relationship, that re-execution of search for an optimum segment is necessary, and wherein, in a case where the step of deciding judges that the re-execution of the search for an optimum segment is necessary, the step of searching for the optimum segment re-executes the search for optimum segment.
In the present invention, a step of calculating the prosody change amount includes: calculating the prosody change amount for only an optimum segment. In the present invention, the step of searching for the optimum segment includes excluding segments that do not satisfy the selection criterion from candidates, and re-executing the search for the optimum segment.
According to another aspect of the present invention, there is provided a speech synthesizing method that includes a step of selecting a segment suited to a target segment environment from among candidate segments, wherein the step of selecting the segment includes: a step of calculating a prosody change amount of each candidate segment, based on prosody information of the target segment environment and the candidate segments, a step of calculating a selection criterion from the prosody change amount, a step of calculating a unit cost of each candidate segment based on the target segment environment and a segment environment of the candidate segments, and a step of searching for an optimum segment from among the candidate segments based on the unit cost, and wherein the step of calculating the unit cost assigns a penalty to a unit cost of a segment having a prosody change amount whose magnitude relationship with the selection criterion is a predetermined prescribed relationship.
In the present invention, the step of calculating the unit cost determines the penalty according to a relative relationship of the prosody change amount and the selection criterion.
In the present invention, the step of calculating the selection criterion determines the selection criterion based on an average value of the prosody change amount.
In the present invention, the step of calculating the selection criterion determines the selection criterion based on a value obtained by smoothing the prosody change amount in a time domain.
According to another aspect of the present invention, there is provided a program for causing a computer, which constitutes a speech synthesizing apparatus, to execute
a processing of selecting a segment suited to a target segment environment from among candidate segments, wherein the processing of selecting the segment includes excluding, from a selection target, a segment having a prosody change amount whose magnitude relationship with a selection criterion determined based on a prosody change amount of the candidate segments is a predetermined prescribed relationship.
According to another aspect of the present invention, there is provided a program for causing a computer, which constitutes a speech synthesizing apparatus, to execute
a processing of selecting a segment suited to a target segment environment from among candidate segments, wherein the processing of selecting the segment includes:
a processing of calculating a prosody change amount of each candidate segment, based on prosody information of the target segment environment and the candidate segments,
a processing of calculating a selection criterion based on the prosody change amount,
a processing of narrowing down the selection candidates, based on the prosody change amount and the selection criterion, and
a processing of searching for an optimum segment from among the narrowed-down candidate segments, and wherein the processing of narrowing down the selection candidates includes
a processing of excluding, from a target of search for the optimum segment, a segment having a prosody change amount whose magnitude relationship with the selection criterion is a predetermined prescribed relationship.
In the computer program according to the present invention, the processing of calculating the selection criterion includes a processing of calculating cost of each candidate segment based on the target segment environment and the segment environment of candidate segments, and includes a processing of calculating the selection criterion based on the cost.
According to another aspect of the present invention, there is provided a program for causing a computer, which constitutes a speech synthesizing apparatus, to execute
a processing of selecting a segment suited to a target segment environment from among candidate segments, wherein the processing of selecting the segment includes:
a processing of searching for an optimum segment, based on the target segment environment and a segment environment of the candidate segments,
a processing of calculating a prosody change amount of each candidate segment, based on prosody information of the target segment environment and the candidate segments,
a processing of calculating a selection criterion based on the prosody change amount, and
a processing of deciding, in a case where, among the optimum segments, there exists a segment having a prosody change amount whose magnitude relationship with the selection criterion is a predetermined prescribed relationship, that re-execution of search for the optimum segment is necessary, and
wherein the processing of deciding includes a process in which, in a case where it is decided that the re-execution of the search for an optimum segment is necessary, the processing of searching for the optimum segment re-executes the search for the optimum segment.
In the computer program according to the present invention, the processing of calculating the prosody change amount includes a processing of calculating the prosody change amount for only the optimum segments.
In the computer program according to the present invention, the processing of searching for the optimum segment includes a processing of excluding segments that do not satisfy the selection criterion from candidates, and re-executing search for the optimum segment.
According to another aspect of the present invention, there is provided a program for causing a computer, which constitutes a speech synthesizing apparatus, to execute
a processing of selecting a segment suited to a target segment environment from among candidate segments, wherein the processing of selecting the segment includes:
a processing of calculating a prosody change amount of each candidate segment, based on prosody information of the target segment environment and the candidate segments,
a processing of calculating a selection criterion from the prosody change amount, a processing of calculating a unit cost of each candidate segment based on the target segment environment and a segment environment of the candidate segments, and
a processing of searching for an optimum segment from among candidate segments based on the unit cost, and wherein the processing of calculating the unit cost includes
a processing of assigning a penalty to a unit cost of a segment having a prosody change amount whose magnitude relationship with the selection criterion is a predetermined prescribed relationship.
In the computer program according to the present invention, the processing of calculating the unit cost includes a processing of determining the penalty according to a relative relationship of the prosody change amount and the selection criterion.
In the computer program according to the present invention, the processing of calculating the selection criterion includes a processing of determining the selection criterion based on an average value of the prosody change amount.
In the computer program according to the present invention, the processing of calculating the selection criterion includes a processing of determining the selection criterion based on a value obtained by smoothing the prosody change amount in a time domain.
According to the present invention, in a segment selection unit, since speech segments are selected in order that the prosody change amount is uniform, sound quality deterioration due to prosody control is made uniform, and a sense of non-uniformity of sound quality is eliminated.
Still other features and advantages of the present invention will become readily apparent to those skilled in this art from the following detailed description in conjunction with the accompanying drawings wherein only exemplary embodiments of the invention are shown and described, simply by way of illustration of the best mode contemplated of carrying out this invention. As will be realized, the invention is capable of other and different embodiments, and its several details are capable of modifications in various obvious respects, all without departing from the invention. Accordingly, the drawing and description are to be regarded as illustrative in nature, and not as restrictive.
The principle of the present invention will be described. In the present invention, selection of speech segments is performed in order that prosody change amount is uniform. That is, the prosody change amount of candidate segments is calculated, and based on a selection criterion obtained from the prosody change amount, by excluding speech segments having a relatively particularly small prosody change amount, from the candidates, the variance of the prosody change amount of the speech segments, which have a high possibility of being selected, is decreased. Thus, the prosody change amount is made uniform, sound quality deterioration level due to prosody control is made uniform, and it is possible to eliminate a sense of non-uniformity of the sound quality. For example, in a case of applying the present invention to an example shown in
<Exemplary Embodiment 1>
Referring to
Referring to
The unit cost calculation unit 12 generates a target segment environment from a language processing result supplied by a language processing unit 10, and prosody information supplied by a prosody generation unit 11, for each synthesis unit (step A1 in
In the present exemplary embodiment, it is supposed that the target segment environment is composed of:
respective phoneme names of phoneme in question, preceding phoneme, and subsequent phoneme,
distance from accent core,
pitch frequency and power for a representative point of synthesis unit, and
continuous time length of unit.
Next, the unit cost calculation unit 12 selects, as candidate segments, a plurality of speech segments that match specific information designated by the target segment environment from a speech segment information storage unit 15 (step A2 in
The unit cost calculation unit 12 calculates a unit cost of each candidate segment, based on the target segment environment and a segment environment of the candidate segment supplied by the speech segment information storage unit 15, and outputs to the prosody change amount calculation unit 20 and the candidate selection unit 22 (step A3).
The prosody change amount calculation unit 20 calculates the prosody change amount of each candidate segment, based on the prosody information supplied by the prosody generation unit 11, the unit cost of each candidate segment supplied by the unit cost calculation unit 12, and attribute information of each candidate segment supplied by the speech segment information storage unit 15, and transmits this to the selection criterion calculation unit 21 and the candidate selection unit 22 (step A4).
The prosody change amount is defined as the change amount of the prosody of a speech segment in the prosody control unit 18. In actuality, the prosody change amount is calculated based on pitch frequency, continuous time length, and power change amount.
Since change in power has little effect on sound quality, in the present exemplary embodiment, power change amount is not dealt. However, it is possible to deal with power change amount in the same way as the pitch frequency or the continuous time length.
If the change amount of the pitch frequency is Δf, and change amount of the continuous time length is Δt, the prosody change amount Δp is defined by the weighted sum of Expression (1) as described below.
Δp=αΔf+βΔt (1)
In this regard, α and β are weighted coefficients.
Since the pitch frequency has a larger effect on the sound quality, α>β in many cases.
In Expression (1), the change amount of the pitch frequency, the continuous time length, and the like are effective when defined by difference.
In addition to this, a method is also effective using Expression (2) described below, of a weighted addition of logarithms of Δf and Δt.
Δp=α log(Δf)+β log(Δt) (2)
Expression (2) is particularly effective in a case where the change amount of the pitch frequency or the like is defined not by difference but by ratio.
Calculation of the change amount of the continuous time length is based on a ratio or difference of time length before and after a change.
If continuous time lengths before a change and after a change are respectively t and T, the change amount of the continuous time length, when calculated based on a ratio, is defined by the following Expression (3) or (4).
When differences of t and T are used, Δt is defined, for example, as a distance space in the following Expression (5) or (6).
Δt=(t−T)2 (5)
Δt=|t−T| (6)
The change amount of the pitch frequency, similarly to the continuous time length, is calculated based on a ratio or difference of the pitch frequency before and after a change.
However, unlike the case of the continuous time length, since pitch frequency values at, for example, the 3 points of: a start point, a midpoint, and an end point of each unit are often different, calculation using values of a plurality of locations enables calculation of change amount of the pitch frequency with better accuracy.
When the change amount of the pitch frequency is calculated using the pitch frequency at N points, the change amount Δf of the pitch frequency is given by the following Expression (7) or (8).
In this regard, fk and Fk respectively represent the pitch frequency before a change and after a change, and Wk represents a weighting coefficient.
Expression (7) and Expression (8) are definitions when ratio and difference, respectively, are used in the change amount.
In Expression (7), a value that is a product of the ratio (fk/Pk) from k=0 to N−1 is Δf. When calculation is performed based on the ratio, a logarithm may be used. That is, in Expression (7), fk/Fk may be replaced by log (fk/Fk).
Where a start point, a midpoint, and an end point are used, N=3.
The larger N is, the more accurately the change amount of the pitch frequency can be calculated, but the calculation amount necessary for calculating the change amount becomes large.
If a slope of the pitch frequency at each point is used, it is possible to calculate the prosody change amount with high accuracy and with small calculation amount in comparison to when the value of N is simply made large.
The prosody change amount given by the above definitions can be approximated by an intermediate value obtained when unit cost is calculated. When it is desired to reduce calculation amount even at the cost of the approximation accuracy, a method of substituting unit cost or an intermediate value thereof, without calculating the prosody change amount, is effective.
In the selection criterion calculation unit 21, the selection criterion is calculated using a prosody change amount of a candidate segment that has a high possibility of ultimately being selected as an optimum segment, that is, whose unit cost is low.
Therefore, in the prosody change amount calculation unit 20, if the prosody change amount only for candidate segments with a low unit cost is calculated, it is possible to reduce the calculation amount for prosody change amount more than when all candidate segments are targeted.
The selection criterion calculation unit 21 computes the candidate selection criterion necessary for narrowing down the candidate segments, based on the prosody change amount of each candidate segment supplied by the prosody change amount calculation unit 20, to be supplied to the candidate selection unit 22 (step A5).
A principal object of the candidate selection unit 22 is to exclude from candidate segments whose prosody change amount is particularly small as compared to others, among candidate segments having a high possibility of being ultimately selected as an optimum segment (referred to as “optimum speech segment”).
Therefore, basically, the prosody change amount of good candidate segments (segments whose unit cost is low) in each synthesis unit are analyzed as principal targets of analysis, and the selection criterion is calculated,
The selection criterion value may be a value that is common to all the synthesis units, or a value that is calculated sequentially for each synthesis unit. Furthermore, a case is also possible where the value is common in a specific range of an accent phrase or breath group.
A basic calculation procedure for the selection criterion is as follows.
First, for each synthesis unit, an analysis target is selected and a representative value obtained.
Next, using the representative value of each synthesis unit, a criterion value is calculated.
A method of obtaining a representative value without selecting an analysis target, and a method of calculating the criterion value without obtaining a representative value are also effective.
Further detailed descriptions of each of: selection of the analysis target, calculation of the representative value, and calculation of a selection criterion value, used in the present exemplary embodiment are described.
<Selection of Analysis Target>
There exist a plurality of methods of selecting a prosody change amount target used when calculating the selection criterion value, that is, methods of selecting the analysis target.
A simplest and most effective method is a method of having as an analysis target the prosody change amount of the best candidate segment (a segment whose unit cost is lowest) of each synthesis unit.
In such a case, since there is one analysis target for each synthesis unit, this method is also a method of obtaining a representative value at the same time.
In a case where a plurality of analysis targets are provided for each synthesis unit.
a method of having, as an analysis target, N segments from those with lowest unit cost (good N segments) in each synthesis unit, are effective.
As a matter of course, the prosody change amount of all candidate segments may be the analysis target.
<Calculation of Representative Value>
In the same way, there exist a plurality of methods of obtaining representative values of each synthesis unit necessary in calculating the selection criterion.
Most often used representative value is a statistical value such as:
average value, median value, best value, and the like.
Rather than calculating the representative value directly, from the analysis target, a method of calculating the representative value by an analysis target weighted by weightings determined in accordance with the unit cost is also effective. That is, by assigning a large weighting to the prosody change amount of segments whose unit cost is low, in calculating the selection criterion, the effect of segments whose unit cost is low is made large. The weighting in accordance with the unit cost is an effective method, not only for the representative value, but also in calculating the selection criterion from a plurality of analysis targets.
<Calculation of Selection Criterion Value>
As representative calculation methods of the selection criterion value,
In a case where an average value is used, basically an average value of the representative value of each synthesis unit is calculated as the selection criterion.
When a common selection criterion in all the synthesis units is to be obtained, calculation is done using the representative value of all the synthesis units, and when a selection criterion is to be obtained for each accent phrase, calculation is done using the representative value of synthesis units composing each accent phrase.
Furthermore, a method of calculating an average value of all analysis targets, rather than a representative value, is also possible.
When smoothing is used, basically a selection criterion is calculated for each synthesis unit. Since a value smoothed in a time domain is calculated, in a case where there exist a plurality of analysis targets for each synthesis unit, a method of first obtaining a representative value of each synthesis unit, and of smoothing the representative value in a time domain, is used.
As a representative smoothing method,
Here, in an interval (accent phrase, breath group, or the like) composed of K synthesis units, with a representative value (for example, a prosody change amount of a best candidate segment) of an i-th synthesis unit as Δq(i), in a case where a selection criterion is supposed to be obtained by smoothing by first order leaky integration, a selection criterion L(i) of the i-th synthesis unit is given by the next Expression (9).
L(i)=(1−γ)L(i−1)+γΔq(i), i=0,1, . . . , K−1 (9)
where,
γ is a time constant satisfying 0<γ<1, and
L(−1)=0.
The candidate selection unit 22 narrows down the candidate segments, based on the selection criterion value supplied by the selection criterion calculation unit 21, the prosody change amount of the candidate segments supplied by the prosody change amount calculation unit 20, respective candidate segment information supplied 950 by the unit cost calculation unit 12, and unit costs thereof, and transmits information of the re-selected candidate segments and the unit costs thereof to the concatenation cost calculation unit 13 (step A6).
Basically, in the candidate selection unit 22, based on the selection criterion, from among candidate segments whose unit cost is low, segments whose prosody change amount is small in comparison to others are excluded from optimum segment candidates.
A very simple method is a method of having segments whose prosody change amount is much less than the selection criterion as exclusion targets.
That is, in an i-th synthesis unit, assuming that the selection criterion is L(i), and the prosody change amount of a j-th candidate segment is Δp(i,j), if a value η obtained by the following Expression (10) or (11) is less than a threshold θ, the segment is excluded from the selection candidates.
where W1 and W2 are constants (positive real numbers).
In a case where the prosody change amount Δp(i,j) is defined based on difference, Expression (10) is effective, and in a case when defined based on ratio, Expression (11) is effective.
Otherwise, a method of calculating η based on a ratio of the selection criterion and the prosody change amount is also effective.
The concatenation cost calculation unit 13 calculates the concatenation cost of each candidate segment based on candidate segment information supplied by the candidate selection unit 22 and attribute information of each speech segment supplied by the speech segment information storage unit 15, and transmits unit cost and concatenation cost of each candidate segment to the optimum segment search unit 14 (step A7).
The concatenation cost calculation unit 13 is supplied with the unit cost of each segment from the candidate selection unit 22, together with the candidate segment information. But, The concatenation cost calculation unit 13 does not use the unit cost of each segment in the calculation of the concatenation cost.
The optimum segment search unit 14 obtains a speech segment sequence (optimum segment sequence) for which a weighted sum of the unit cost and the concatenation cost is smallest, based on candidate segment information supplied from the concatenation cost calculation unit 13, the unit cost, and the concatenation cost, and transmits the result to the prosody control unit 18 (step A8).
The optimum segment sequence may be searched for by calculating a weighted sum of the unit cost and the concatenation cost, for combinations of all the speech segments. It is also possible to make the search efficient by using dynamic programming.
In the present exemplary embodiment, in a case in which the selection criterion is determined in advance, in the candidate selection unit 22, or
in a case of the selection criterion being input from outside the speech synthesizing apparatus, that is, a case where calculation from the prosody change amount is unnecessary, the selection criterion calculation unit 21 is unnecessary. In this case, it is possible to reduce the calculation amount necessary for calculating the selection criterion.
According to the speech synthesizing apparatus of the present exemplary embodiment, the prosody change amount of candidate segments is calculated, and, based on a selection criterion obtained from this prosody change amount, by excluding speech segments having a particularly small prosody change amount, relatively, from the candidates, the variance of the prosody change amount of the speech segments, for which the possibility of being selected is high, is decreased.
As a result, since the prosody change amount is made uniform, level of deterioration of sound quality due to prosody control is made uniform, and it is possible to eliminate a sense of non-uniformity of the sound quality.
<Exemplary Embodiment 2>
(A) The candidate selection unit 22 is replaced by a candidate selection unit 30.
(B) The prosody change amount calculation unit 20 is replaced by a prosody change amount calculation unit 31.
(C) A decision unit 33 is newly provided.
(D) Instead of the selection criterion calculation unit 21, a selection criterion calculation unit 32 is provided.
(E) In
(F) Furthermore, in
Otherwise, the present exemplary embodiment is the same as the first exemplary embodiment of
The prosody change amount calculation unit 31 calculates the prosody change amount of each candidate segment based on:
optimum segments output from the optimum segment search unit 14,
prosody information supplied by the prosody generation unit 11, and
attribute information of each optimum segment supplied by the speech segment information storage unit 15, and
transmits a result to the selection criterion calculation unit 32 and the decision unit 33 (step B1).
In the present exemplary embodiment, the prosody change 1080 amount calculation unit 31 only calculates the prosody change amount of the optimum segments, not the candidate segments. This point is different from the prosody change amount calculation unit 20 of the first exemplary embodiment.
With regard to the method of calculating the prosody change amount, a method is used that is completely the same as the method used by the prosody change amount calculation unit 20 of the first exemplary embodiment.
The selection criterion calculation unit 32 calculates a selection criterion necessary for distinguishing the existence of a segment whose prosody change amount is particularly small, based on the prosody change amount of every segment supplied by the prosody change amount calculation unit 31, and the selection criterion calculation unit 32 supplies the calculated selection criterion to the decision unit 33 (step B2).
The decision unit 33 decides whether or not there exists a segment whose prosody change amount is particularly small in comparison to others, among the optimum segments.
In the present embodiment, the target of the prosody change amount used in the calculation of the selection criterion value is uniquely determined as an optimum segment. This point is different from the selection criterion calculation unit 21 of the first exemplary embodiment.
The method of calculating the selection criterion otherwise is completely the same as the method used by the selection criterion calculation unit 21 of the first exemplary embodiment.
In the present exemplary embodiment, in calculating the selection criterion, the prosody change amount of the optimum segments, selected from among the candidate segments, is used, but, similarly to the first exemplary embodiment, the prosody change amount of the candidate segments may be used. In this case, the selection criterion calculation unit 32 calculates the prosody change amount of the candidate segments, not the optimum segments.
The decision unit 33 decides whether or not there exists a segment whose prosody change amount is particularly small in comparison to others, based on
an optimum segment supplied by the optimum segment search unit 14,
the prosody change amount of each segment supplied by the prosody change amount calculation unit 31, and
the selection criterion supplied by the selection criterion calculation unit 32 (step B3).
The decision unit 33, when it has decided that there exists a segment whose prosody change amount is particularly small in comparison to others, transmits the segment whose prosody change amount is particularly small to the candidate selection unit 30. The decision unit 33, when it is decided that there does not exist a segment whose prosody change amount is particularly small in comparison to others, transmits an optimum segment to the prosody control unit 18.
However, since there is no guarantee that an optimum segment that clears the selection criterion (judged not to exist) is supplied by the optimum segment search unit 14, it is necessary to set an upper limit to the number of times search is repeated.
Therefore, the number of times the search is repeated is recorded, and in a case where the number of times the search is repeated exceeds a prescribed upper limiting value, the optimum segment is transmitted to the prosody control unit 18 (step B4).
The decision method is the same as the method of excluding segments from the selection candidates, in the candidate selection unit 22 of the first exemplary embodiment. That is, if there exists a segment whose prosody change amount is much less than a decision criterion, it is decided that there exists a segment whose prosody change amount is particularly small.
The candidate selection unit 30 excludes one or more segments supplied by the decision unit 33 from among candidate segments supplied by the concatenation cost calculation unit 13, and transmits candidate segments that have not been excluded, and the unit cost and concatenation cost thereof to the optimum segment search unit 14 (step B5).
When there is no segment supplied from the decision unit 33, that is, before the decision unit 33 operates, since there exist no segments to be excluded, output of the concatenation cost calculation unit 13 is transmitted as it is, to the optimum segment search unit 14.
According to the present exemplary embodiment, after selection of the optimum segments, a segment whose prosody change amount is particularly small in comparison to others is detected, the detected segment is excluded from the candidate, and search is performed again.
Therefore, if completion is possible with search repeated a small number of times, the number of segments that are targets of the prosody change amount calculation is small in comparison to the first exemplary embodiment. That is, with a calculation amount less than the first exemplary embodiment, it is possible to exclude segments whose prosody change amount is small in comparison to others.
<Exemplary Embodiment 3>
The unit cost correction unit 40 corrects unit cost of a candidate segment whose prosody change amount is small in comparison to other segments, based on
a selection criterion supplied by a selection criterion calculation unit 21,
the prosody change amount of the candidate segments supplied by a prosody change amount calculation unit 20,
respective candidate segment information supplied by a unit cost calculation unit 12, and
unit costs thereof.
The unit cost correction unit 40 transmits candidate segments and unit cost thereof to a concatenation cost calculation unit 13 (step C1).
A principal difference from the candidate selection unit 22 of the first exemplary embodiment is that, rather than being completely excluded from candidate segments, candidate segments are left as they are, with the unit cost of which a value referred to as a “penalty” is added to, and are made difficult to be selected as an optimum segment in an optimum segment search unit 14.
In the first exemplary embodiment, in a case where it is difficult to appropriately set a calculation formula of a value of a threshold θ and η, with regard to the candidate selection unit 22, it is not possible to appropriately exclude the candidate segments.
In particular, if there exists a candidate segment whose prosody change amount is sufficiently close to the threshold B but does not satisfy an exclusion criterion, there is a possibility that the candidate segment is selected as an optimum segment and an adverse effect is exerted on making the prosody change amount uniform.
If a penalty is added in accordance with size of ratio or difference between the prosody change amount and the selection criterion value of each segment, a candidate segment whose prosody change amount is sufficiently close to the threshold θ but does not satisfy an exclusion criterion in the first exemplary embodiment, can be expected to be not selected as an optimum segment in the present exemplary embodiment.
As a method of calculating the penalty, a method is effective in which the difference between the prosody change amount and the selection criterion value of each segment is calculated, and using a nonlinear function as shown in
That is,
if the unit cost before correction of a certain segment is C(i,j),
the prosody change amount is Δp(i,j), and
a selection criterion is (Li),
the unit cost after correction
{tilde over (C)}(i,j)
is given by the following Expression (12).
{tilde over (C)}(i,j)=C(i,j)+g(L(i)−Δp(i,j)) (12)
In this regard, in a case where x is input to g(•), with the nonlinear function shown in
In this regard, a1, a2, and b1 are positive real numbers, and
0<a1≦a2, 0<b1 (14)
is satisfied.
A condition required by the nonlinear function g(x) in the above Expression (12) is that if x becomes large, g(x) does not become small (non-decreasing). Besides Expression (13), it is possible to use a liner function that satisfies this condition, a high degree polynomial, or an arbitrary function that includes weighted addition.
A method using Expression (12) is effective in a case where the prosody change amount is defined based on a difference, but in a case where the prosody change amount is defined based on a ratio, a method of calculating based on a ratio of the prosody change amount of each segment and a selection criterion value is effective.
In the case of using the ratio, if
the unit cost before correction of a certain segment is C(i,j),
the prosody change amount is Δp(i,j), and
the selection criterion as L(i),
the unit cost after correction
{tilde over (C)}(i,j)
is given by the following Expression (15).
In this regard, in a case where x is input to h(•), with the nonlinear function shown in
In this regard, a3, a4, and b2 are positive real numbers, and
0<a3≦a4, 1.0<b2 (17)
is satisfied.
A condition similar to g(x) is also required in h(x).
In Expression (12), the penalty is given by a sum, but in Expression (15), the penalty is given by a product. As a result, a lower limiting value of the function h(x) is 1.0.
According to the present exemplary embodiment, by adding the penalty calculated based on the difference of the selection criterion value and the prosody change amount of each segment to the unit cost, the selection of the candidate segment as an optimum segment is made difficult in the optimum segment search unit 14.
As a result, a candidate segment, whose prosody change amount is sufficiently close to the threshold θ but does not satisfy art exclusion criterion, is therefore selected in an optimum segment sequence in the first exemplary embodiment, is not selected as an optimum segment in the present exemplary embodiment.
As a result, making the prosody change amount uniform is facilitated, and a sense of non-uniformity of sound quality is improved.
Furthermore, since optimum segments are not completely excluded from selection candidates, a segment that is a target for exclusion in the first exemplary embodiment may be selected in accordance with another selection criterion.
As a result, there is a possibility that the sound quality is improved in comparison to a case of complete exclusion.
The exemplified embodiments and the examples may be changed and adjusted in the scope of all disclosures (including claims) of the present invention and based on the basic technological concept thereof. In the scope of the claims of the present invention, various disclosed elements may be combined and selected in a variety of ways. That is, it is to be understood that modifications and changes that may be made by those skilled in the art according to all disclosures, including the claims, and technological concepts are included.
Number | Date | Country | Kind |
---|---|---|---|
2007-039622 | Feb 2007 | JP | national |
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/JP2008/052574 | 2/15/2008 | WO | 00 | 8/19/2009 |
Publishing Document | Publishing Date | Country | Kind |
---|---|---|---|
WO2008/102710 | 8/28/2008 | WO | A |
Number | Name | Date | Kind |
---|---|---|---|
6823309 | Kato et al. | Nov 2004 | B1 |
7054815 | Yamada et al. | May 2006 | B2 |
7127396 | Chu et al. | Oct 2006 | B2 |
7315813 | Kuo et al. | Jan 2008 | B2 |
7668717 | Mizutani et al. | Feb 2010 | B2 |
7856357 | Mizutani et al. | Dec 2010 | B2 |
8407054 | Kato et al. | Mar 2013 | B2 |
20010037202 | Yamada et al. | Nov 2001 | A1 |
20020143526 | Coorman et al. | Oct 2002 | A1 |
20030195743 | Kuo et al. | Oct 2003 | A1 |
20040148171 | Chu et al. | Jul 2004 | A1 |
20050119891 | Chu et al. | Jun 2005 | A1 |
20050137870 | Mizutani et al. | Jun 2005 | A1 |
20050182629 | Coorman et al. | Aug 2005 | A1 |
20060069566 | Fukada et al. | Mar 2006 | A1 |
20080177548 | Yamada et al. | Jul 2008 | A1 |
20090070115 | Tachibana et al. | Mar 2009 | A1 |
20100211393 | Kato et al. | Aug 2010 | A1 |
Number | Date | Country |
---|---|---|
8-263095 | Oct 1996 | JP |
2001-092482 | Apr 2001 | JP |
2004-109535 | Apr 2004 | JP |
2004-126205 | Apr 2004 | JP |
2004-139033 | May 2004 | JP |
2004-347653 | Dec 2004 | JP |
2004-354644 | Dec 2004 | JP |
2005-091551 | Apr 2005 | JP |
2005-164749 | Jun 2005 | JP |
2005-292433 | Oct 2005 | JP |
2006-084854 | Mar 2006 | JP |
2007-025323 | Feb 2007 | JP |
Entry |
---|
International Search Report, PCT/JP2008/052574, May 27, 2008. |
Huang et al., “Spoken Language Processing”, pp. 689-836, A Guide to Theory, Algorithm, and System Develoment. |
Ishikawa, “Prosodic Control for Japanese Text-to-Speech Synthesis”, pp. 27-34, Technical Report of IEICE, SP200072 (Oct. 2000). |
Abe et al., “An introduction to speech synthesis units”, pp. 35-42, Technical Report of IEICE, SP2000-73 (Oct. 2000). |
Moulines et al., “Pitch-Synchronous Waveform Processing Techniques for Text-To-Speech Synthesis Using Diphones”, pp. 453-467, Speech Communication 9 (1990). |
Segi et al., “A Concatenative Speech Synthesis Method Using Context Dependent Phoneme Sequences With Variable Length As Search Units”, pp. 115-120. |
Kawai et al., “Ximera: A New TTS From ATR Based on Corpus-Based Technologies” pp. 179-184. |
Koyama et al., “High Quality Speech Synthesis Using Reconfigurable VCV Waveform Segments with Smaller Pitch Modification”, pp. 2264-2275. |
Notice of Grounds for Rejection mailed May 28, 2013 by the Japanese Patent Office in corresponding Japanese Patent Application No. 2009-500164 with partial English translation of portion enclosed within wavy lines, 3 pages. |
Number | Date | Country | |
---|---|---|---|
20100076768 A1 | Mar 2010 | US |