This application is based upon and claims the benefit of priority from Japanese Patent Application No. 2010-084319, filed on Mar. 31, 2010; the entire contents of which are incorporated herein by reference.
Embodiments described herein relate generally to synthesis of speech.
In recent years, speech synthesizers capable of creating synthetic speech from intermediate output after the intermediate output that is output by the speech synthesizers being corrected by a user have been proposed. JP-A 2006-313176 (KOKAI) discloses a technology in which, when a user issues instructions to replace a speech segment constituting synthetic speech, a speech synthesizer adds the speech segment to a disabled speech segment list. The speech synthesizer carries out speech synthesis by referring to the disabled speech segment list to exclude speech segments recorded in the disabled speech segment list from the speech synthesis.
However, according to the technology of JP-A 2006-313176 (KOKAI), it is very difficult for the user to precisely specify a speech segment causing quality degradation of synthetic speech, and rather speech segments in the vicinity thereof are frequently specified. Thus, a technology that effectively disables speech segments causing quality degradation is demanded.
In general, according to one embodiment, a speech synthesizer includes a generation unit that selects speech segments for respective synthesis units to generate a speech segment sequence, which is a sequence of the speech segments; a speech connection unit that synthesizes speech by connecting the speech segments of the speech segment sequence generated by the generation unit; and a prohibition unit that disables, if a speech segment of a first speech segment sequence synthesized by the speech connection unit is different from a speech segment of a second speech segment sequence, which is synthesized by the speech connection unit and has the same synthesis unit as the first speech segment sequence, the speech segment of the first speech segment sequence that is different from the speech segment of the second speech segment sequence.
Exemplary embodiments of a speech synthesizer will be described below with reference to the appended drawings.
Each synthesis unit has a phoneme symbol, prosodic information, and language information about text containing a section corresponding thereto. The synthetic speech is represented by a speech segment sequence. The prosodic information contains, for example, the fundamental frequency, phoneme duration, Mel-Cepstral Coefficients, and power. The language information contains, for example, words, the number of syllables in a word, word corresponding to each synthesis unit, position of each synthesis unit in a word measured in a syllable, and flag indicating whether a syllable in which each synthesis unit is contained is a stressed one or not.
The operation of the speech synthesizer 10 will be described with reference to
In step S301, the acquisition unit 11 acquires text data intended for speech synthesis from inside or outside the speech synthesizer 10.
In step S302, the language processing unit 12 divides the text data acquired by the acquisition unit 11 into morphemes by performing morphological analysis on the text data. This step may be omitted for languages that are not an agglutinative language.
In step S303, the language processing unit 12 performs syntax analysis on a sequence of divided morphemes to assign attribute values such as reading information, the part of speech, conjugation, and dependency between morphemes to each morpheme.
In step S304, the language processing unit 12 adds attribute values regarding the prosody such as a phoneme symbol string, position of stressed syllables and their strength to each morpheme of the sequence of morphemes having the attribute values assigned in step S303 based on the assigned attribute values.
In step S305, the prosody processing unit 13 generates prosodic information to be a target of synthetic speech for each synthesis unit based on the attribute values assigned and added to each morpheme in step S303 and S304 to generate a synthesis unit sequence constituted by a plurality of synthesis units each having a phoneme symbol, prosodic information, and language information. The present embodiment is described by taking a case in which a phoneme is the synthesis unit as an example, but the present invention is not limited to this.
In step S306, the speech synthesis unit 14 generates synthetic speech from the synthesis unit sequence generated in step S305. If a database used for analysis or acquisition of necessary data is needed in steps S301 to S304, such a database may be provided.
Next, the operation of the speech synthesis unit 14 will be described with reference to
In step S401, the generation unit 141 generates a speech segment sequence constituted by a plurality of speech segments for each synthesis unit of the synthesis unit sequence generated in step S306 by selecting optimal speech segments from those stored in the candidate segment storage unit 140 without selecting speech segments decided by the prohibition unit 146 for each synthesis unit of a partial sequence of the synthesis unit specified by the specifying unit 144.
In step S402, the speech connection unit 142 synthesizes speech by using the speech segment sequence generated in step S401.
In step S403, the output unit 143 reproduces the synthetic speech generated in step S402. Next, the specifying unit 144 presents information to enable the user to specify sites where quality of synthetic speech is insufficient.
In step S404, the specifying unit 144 accepts a pass/fail result indicating whether quality of synthetic speech is acceptable or insufficient through input from the user.
In step S405, the specifying unit 144 branches off processing depending on the pass/fail result input by the user in step S404. If quality thereof is acceptable (“pass” in step S405), the processing proceeds to step S409. If quality thereof is insufficient (“fail” in step S405), the processing proceeds to step S406.
In step S406, the specifying unit 144 allows the user to specify degraded sites through input from the user.
In step S407, the specifying unit 144 decides candidates of speech segments to be disabled. More specifically, the specifying unit 144 determines a partial sequence of synthesis units corresponding to sites specified in step S406 and a partial sequence of speech segments selected from the partial sequence of the synthesis units.
In step S408, the prohibition unit 146 decides, for each synthesis unit of the partial sequence of synthesis units determined in step S407, speech segments to be disabled based on information recorded in the change segment history storage unit 145.
In step S409, the prohibition unit 146 compares the last speech segment sequence for the same sentence selected in step S401 and the speech segment sequence of this time. The prohibition unit 146 also records identifiers specific to replaced speech segments in the change segment history storage unit 145.
Details of step S401 in
In step S501, the generation unit 141 checks for each the synthesis unit whether the prohibition unit 146 has decided speech segment to be disabled. If there is any speech segment to be disabled (“YES” in step S501), the processing proceeds to step S502 and if there is no speech segment to be disabled (“NO” in step S501), the processing proceeds to step S503.
In step S502, the generation unit 141 excludes disabled speech segments to narrow down candidates of speech segments for each synthesis unit in advance.
In step S503, the generation unit 141 reads speech segments appropriate for the synthesis unit from the candidate segment storage unit 140 to preliminarily select a predetermined number of speech segments by comparing phoneme information, prosodic information, and language information held by the synthesis unit and the same kinds of information held by each speech segment. The processing of steps S501 to S503 is performed for all synthesis units. A conventional method may be used as the comparison method in step S503 with necessary information being supplied when needed.
In step S504, the generation unit 141 actually selects one speech segment for each synthesis unit from a plurality of speech segments selected for each synthesis unit in consideration of the degree of appropriateness of connection between each speech segment of adjacent synthesis units and a difference between a target value of the information calculated in step S503 and held by each synthesis unit and a value of the same kind of information held by each speech segment. A conventional method may be used as the method of calculating appropriateness of connection in step S504 with necessary information being supplied when needed.
Details of step S408 in
The prohibition unit 146 performs step S601 and step S602 below for each speech segment of the speech segment sequence determined in step S407.
In step S601, the prohibition unit 146 checks whether any speech segment is recorded in the change segment history storage unit 145 before branching off processing. If no speech segment is recorded (“NO” in step S601), the processing proceeds to step S603. If any speech segment is recorded (“YES” in step S601), the processing proceeds to step S602.
In step S602, the prohibition unit 146 stores such speech segments as speech segments (disabled speech segments) not to be used in the synthesis unit. When the above processing is completed for all speech segments, the processing moves to step S603.
In step S603, the prohibition unit 146 branches off processing depending on whether any disabled speech segment is recorded. If any disabled speech segment is recorded (“YES” in step S603), the processing moves to the next processing (step S401 in
In step S604, the specifying unit 144 requests the user to select at least one speech segment to be disabled from the speech segment sequence determined in step S407 of
In step S605, the prohibition unit 146 stores, like step S602, such a speech segment selected as a speech segment (disabled speech segment) not to be used. Speech segments recorded as speech segments (disabled speech segments) not to be used in step S602 or step S605 in this manner are referred to in step S501 of
The operation of the speech synthesis unit 14 of a speech synthesizer according to the first embodiment will be described in detail with reference to
In step S406, as illustrated in
In step S407, as illustrated in
Next, in step S601, the prohibition unit 146 refers to the change segment history storage unit 145 in a state (initial state) in which nothing is recorded, which yields “NO” in step S601 and the processing proceeds to step S603. Since there is no disabled segment here, the processing proceeds to step S604.
In step S604, as illustrated in
In step S605, as illustrated in
Next, after returning to step S401, the speech synthesis unit 14 creates synthetic speech again.
First, in step S501, the generation unit 141 proceeds to step S502 because the speech segment E is recorded as a disabled speech segment (“YES” in step S501) for the synthesis unit /u/ corresponding to the vowel of the syllable “c” of the word 111 (“bag” in English).
In step S502, the generation unit 141 excludes the speech segment E from targets to be preliminary selected (step S503) for the synthesis unit.
In step S503, the generation unit 141 performs preliminary selection.
As a result of performing step S501 to step S503 for each synthesis unit, in contrast to the last synthetic speech creation, subsequent processing proceeds and synthetic speech is presented to the user without the speech segment E being selected for the synthesis unit /u/ corresponding to the vowel of the syllable “c” of the word 111 (“bag” in English).
Next, a case where the user finds quality thereof acceptable in step S404 and the speech synthesis unit 14 moves the processing to step S409 will be described.
In step S409, as illustrated in
It is assumed that
In the present embodiment, even if the user cannot identify a speech segment causing quality degradation, replaced speech segments are all recorded when the user recognizes quality improvement. Thus, recorded speech segments contain a defective speech segment that caused quality degradation. By referring to records thereof, it becomes possible to prevent the same defective speech segment from being selected in synthetic speech for other text.
A concrete example of the method of using the above history will be described with reference to
In step S407, the specifying unit 144 decides candidates of speech segments to be disabled. More specifically, as illustrated in
In step S601, the prohibition unit 146 checks whether any speech segment is recorded in the change segment history storage unit 145 in a state of
In step S602, the prohibition unit 146 stores the speech segment D selected for the consonant /g/ of the syllable “c” as illustrated in
Hereinafter, as shown in
If the user finds quality of the synthetic speech created and presented in this manner acceptable (step S405), the prohibition unit 146 adds the newly added speech segment L of the replaced speech segment D and speech segment L to the change segment history storage unit 145 (step S409), which looks as illustrated in
Thus, according to the present embodiment, speech segments replaced when the user recognizes quality improvement are all recorded and thus, a defective speech segment that caused quality degradation is always contained in the history thereof. Therefore, even if the user cannot identify the defective speech segment that caused degradation in previous improvement work of synthetic speech, quality degradation caused by the same speech segment as before can be avoided without the need for the user to identify the cause (speech segment) thereof again with a precision of the synthesis unit.
The second embodiment will be described. The description here centers on processing that is different from that in the first embodiment and similar processing is omitted when appropriate.
In the present embodiment, the change segment history storage unit 145 has, in addition to the identifier specific to a speech segment shown in the first embodiment, the count (change count) of replacement before and after the user recognizes quality improvement recorded therein by being associated with each speech segment. Because accompanying information such as the change count is recorded and updated, processing content in step S409 (
The prohibition unit 146 performs step S2001 and step S2002 below for each speech segment of the speech segment sequence determined in step S407.
In step S2001, the prohibition unit 146 checks whether any speech segment is recorded in the change segment history storage unit 145 before branching off processing. If any speech segment is recorded (“YES” in step S2001), the processing proceeds to step S2003. If no speech segment is recorded (“NO” in step S2001), the processing proceeds to step S2002.
In step S2002, the prohibition unit 146 stores such speech segments as candidates of speech segments (disabled speech segments) not to be used in the synthesis unit. When the above processing is completed for all speech segments, the processing proceeds to step S2003.
In step S2003, the prohibition unit 146 branches off processing depending on whether any candidate of disabled speech segment is recorded. If any candidate of disabled speech segment is recorded (“YES” in step S2003), the processing moves to step S2006. If no candidate of disabled speech segment is recorded (“NO” in step S2003), the processing proceeds to step S2004.
In step S2004, like in the first embodiment, the specifying unit 144 requests the user to select from the speech segment sequence determined in step S407 of
In step S2005, the prohibition unit 146 stores such a speech segment disabled by the user in step S2004 as a disabled speech segment.
In step S2006, the prohibition unit 146 selects from candidates stored in step S2002 a candidate with the maximum change count among candidates recorded in the change segment history storage unit 145 and records the candidate as a speech segment (disabled speech segment) not to be used in the synthesis unit thereof. The change count of a candidate that is not recorded in the change segment history storage unit 145 may be treated with 0. If a plurality of candidates with the maximum change count is present, such candidates may be all recorded or a candidate may be selected from such candidates by using another criterion such as the head of a list.
Disabled speech segments recorded in step S2005 and step S2006 in this manner are referred to in step S501 of
A concrete example of the change segment history storage unit 145 and the prohibition unit 146 will be described with reference to
In step S407, as illustrated in
In step S2001, the prohibition unit 146 checks whether any speech segment is recorded in the change segment history storage unit 145 before branching off processing. If no speech segment is recorded (“NO” in step S2001), the processing proceeds to step S2003. If any speech segment is recorded (“YES” in step S2001), the processing proceeds to step S2002.
In step S2002, the prohibition unit 146 refers to, for example, the change segment history storage unit 145 in the state of
In step S2003, the prohibition unit 146 proceeds to step S2006 because candidates of disabled speech segments are recorded (“YES” in step S2003). Incidentally, if no candidate of disabled speech segment is recorded (“NO” in step S2003), the processing proceeds to step S2004.
Step S2004 and step S2005 are the same as step S604 and step S605 in
In step S2006, the prohibition unit 146 refers to the change segment history storage unit 145 in the state of
Hereinafter, synthetic speech is created, like in
Thus, according to a speech synthesizer in the second embodiment, speech segments replaced when the user recognizes quality improvement are all recorded and also the count of improvement due to replacement of the speech segments is also recorded as accompanying information. A speech segment whose count of quality improvement due to non-use thereof is large is preferentially disabled. Accordingly, the accuracy with which the use of a speech segment causing quality degradation common in many synthetic speeches is avoided can be increased.
The third embodiment will be described. The description here centers on processing that is different from that in the first embodiment and similar processing is omitted when appropriate.
In the present embodiment, the change segment history storage unit 145 has, in addition to the identifier specific to a speech segment shown in the first embodiment, information about a phonemic environment in which the speech segment is used recorded therein by being associated with each speech segment. Because accompanying information such as the information about the phonemic environment is recorded/updated, processing content in step S409 (
The prohibition unit 146 performs step S2701 and step S2702 below for each speech segment of the speech segment sequence determined in step S407.
In step S2701, the prohibition unit 146 checks whether any speech segment is recorded in the change segment history storage unit 145 before branching off processing. If no speech segment is recorded (“NO” in step S2701), the processing proceeds to step S2703. If any speech segment is recorded (“YES” in step S2701), the processing proceeds to step S2702.
In step S2702, the prohibition unit 146 records such speech segments as candidates of speech segments (disabled speech segments) not to be used in the synthesis unit. When the above processing is completed for all speech segments (“NO” in step S2701), the processing proceeds to step S2703.
In step S2703, the prohibition unit 146 branches off processing depending on whether any candidate of disabled speech segment is recorded in step S2702. If any candidate of disabled speech segment is recorded (“YES” in step S2703), the processing moves to step S2706. If no candidate of disabled speech segment is recorded (“NO” in step S2703), the processing proceeds to step S2704.
Step S2704 and step S2705 are the same as step S2004 and step S2005 in
In step S2706, the prohibition unit 146 selects from candidates recorded in step S2702 a candidate whose information about the phonemic environment of each candidate recorded in the change segment history storage unit 145 matches the phoneme of each synthesis unit and adjacent synthesis units thereof and records the candidate as a speech segment (disabled speech segment) not to be used in the synthesis unit. In the present embodiment, the range of synthesis units where the phonemes are compared is set to be a synthesis unit and adjacent synthesis units thereof, but phonemes of a wider range may be considered and compared. Candidates that are not recorded in the change segment history storage unit 145 are treated as not having a matching phonemic environment and are not recorded. If there is a plurality of candidates having matching phonemic environment information, all such candidates may be recorded or a candidate may be selected from such candidates by using another criterion such as the head of a list.
In step S2707, the prohibition unit 146 branches off processing depending on whether any disabled speech segment is recorded in step S2706. If any disabled speech segment is recorded (“YES” in step S2707), the prohibition unit 146 terminates the processing described in the flow chart before proceeding to step S401 in
A concrete example of the change segment history storage unit 145 and the prohibition unit 146 will be described with reference to
In step S407, as illustrated in
In step S2701, the prohibition unit 146 checks whether any speech segment is recorded in the change segment history storage unit 145 before branching off processing. If no speech segment is recorded (“NO” in step S2701), the processing proceeds to step S2703. If any speech segment is recorded (“YES” in step S2701), the processing proceeds to step S2702.
In step S2702, the prohibition unit 146 refers to, for example, the change segment history storage unit 145 in the state of
In step S2703, the prohibition unit 146 proceeds to step S2706 because candidates of disabled speech segments are recorded (“YES” in step S2703). Incidentally, if no candidate of disabled speech segment is recorded (“NO” in step S2703), the processing proceeds to step S2704.
In step S2704, the specifying unit 144 displays the speech segment sequence used by the degraded site to cause the user to select the synthesis unit.
In step S2705, if the user can correctly select the speech segment in the synthesis unit /u/ corresponding to the vowel of the syllable “c” as illustrated in
In step S2706, the prohibition unit 146 refers to the change segment history storage unit 145 in the state of
In step S2707, the prohibition unit 146 proceeds to step S2704 because no disabled speech segment is recorded.
Hereinafter, through processing similar to that in the first embodiment described above, like in
Thus, according to a speech synthesizer in the third embodiment, speech segments replaced when the user recognizes quality improvement are all recorded and also information (phonemic environment) about the environment in which the speech segment is used is recorded as accompanying information. Moreover, each speech segment is disabled only if the speech segment is used in a phonemic environment indicated by the accompanying information thereof. Accordingly, only if each speech segment is used in an inappropriate environment that could cause quality degradation, the speech segment is disabled and therefore, the accuracy with which speech segments used appropriately in other phonemic environments are disabled will be lower.
In embodiments from the first embodiment to the third embodiment, like in steps S3401 to S3408
Incidentally, a speech synthesizer according to an embodiment can also be realized by, for example, using a general-purpose computer apparatus as system hardware. That is, each unit of such a speech synthesizer can be realized by causing a processor mounted on the computer apparatus to execute a program. In this case, a speech synthesizer may be realized by pre-installing the program on the computer apparatus or distributing the program stored in a storage medium such as CD-ROM or via a network to install the program on the computer apparatus when appropriate. A plurality of storage media holding speech segment data and whose data acquisition times are different can be realized by appropriately using a memory or hard disk added to the computer apparatus internally or externally or CD-R, CD-RW, DVD-RAM, DVD-R or the like.
According to the embodiments, speech segments causing quality degradation can effectively be disabled.
While certain embodiments have been described, these embodiments have been presented by way of example only, and are not intended to limit the scope of the inventions. Indeed, the novel embodiments described herein may be embodied in a variety of other forms; furthermore, various omissions, substitutions and changes in the form of the embodiments described herein may be made without departing from the spirit of the inventions. The accompanying claims and their equivalents are intended to cover such forms or modifications as would fall within the scope and spirit of the inventions.
Number | Date | Country | Kind |
---|---|---|---|
2010-084319 | Mar 2010 | JP | national |