This application is based upon and claims the benefit of priority from Japanese Patent Application No. 2010-073694, filed on Mar. 26, 2010; the entire contents of which are incorporated herein by reference.
Embodiments described herein relate generally to a method and an apparatus for editing speech, and a method for synthesizing speech.
As to conventional technique, phrase concatenation based speech synthesis method is well known (For example, JP-A H07-210184 (Kokai)). In this technique, speech uttered by persons is divided into speech units (such as a word, a paragraph, or a phrase), and each speech unit is previously stored in a memory. By reading these speech units and concatenating them, a plurality of sentences is output as a speech.
In such speech synthesis method, the same speech units are used several times among a plurality of sentences. Accordingly, in comparison with the case that all sentences to be output are stored as speech, data quantity to be stored can be reduced.
However, In above-mentioned speech synthesis method, recorded speech is divided into speech units by a hand operation. Accordingly, speech units having high usage efficiency cannot be created.
In one embodiment, a method for editing speech is disclosed. The method can generate speech information from a text. The speech information includes phonologic information and prosody information. The method can divide the speech information into a plurality of speech units, based on at least one of the phonologic information and the prosody information. The method can search at least two speech units from the plurality of speech units. At least one of the phonologic information and the prosody information in the at least two speech units are identical or similar. In addition, the method can store a speech unit waveform corresponding to one of the at least two speech units as a representative speech unit into a memory.
Hereinafter, embodiments of the present invention will be explained by referring to the drawings. The present invention is not limited to the following embodiments.
As to a speech editing apparatus 1 of the first embodiment, by text-to-speech synthesis method, phonologic information, prosody information and a speech waveform are created from an input text by a user. The speech waveform is divided (split) into speech unit waveforms (a unit of speech waveform). Among all speech unit waveforms, at least two speech unit waveforms having identical or similar waveforms are searched, and a representative speech unit waveform (representing the at least two speech unit waveforms) is selected from them. This representative speech unit waveform is used for a speech synthesis apparatus to output by concatenating representative speech unit waveforms.
As shown in
The input unit 11 inputs one or a plurality of texts from a user. The input unit 11 may be a key board or a handwriting-pad. The generation unit 12 generates a speech waveform corresponding to phonologic information or prosody information of the text (or, phonologic information and prosody information of the text) by CPU (Central Processing Unit). Moreover, the user can input a text to be desirably synthesized by phrase concatenation based speech synthesis method, via the input unit 11.
The speech waveform is time change of amplitude of speech. The phonologic information is speech contents represented by letter or sign. The prosody information represents rhythm or intonation of speech. In case of inputting a plurality of texts, the generation unit 12 generates the phonologic information, the prosody information and a speech waveform corresponding to each text. For example, the generation unit 12 may generate the speech waveform using a memory (not shown in Fig.) storing speech units corresponding to the phonologic information and the prosody information. The generation unit 12 may be a conventional speech synthesis apparatus to generate speech waveforms from texts.
The division unit 13 divides the speech waveform into speech unit waveforms at a predetermined time by using the speech waveform, the phonologic information and the prosody information. If a plurality of texts is input to the input unit 11, the division unit 13 divides the speech waveform corresponding to each text into speech unit waveforms.
The search unit 14 searches speech unit waveforms having identical or similar waveforms from all speech unit waveforms acquired by the division unit 13. If a plurality of speech unit waveforms having identical or similar waveforms is searched, the search unit 14 selects one as a representative speech unit waveform from the plurality of speech unit waveforms, and removes other of the plurality of speech unit waveforms. The search unit 14 stores the representative speech unit waveform into a storage unit 50. The representative speech unit waveform is any of the plurality of speech unit waveforms having identical or similar waveforms.
The generation unit 12, the division unit 13, the search unit 14, may be realized by a CPU (Central Processing Unit) and a memory (used by the CPU). Hereinafter, operation of the first embodiment is explained in detail.
In
In
The generation unit 12 determines phonologic information of three texts by linguistic analysis (such as morphological analysis and semantic analysis), determines prosody information from the phonologic information, and generates speech waveforms from the phonologic information and the prosody information (S302). In
By using the phonologic information, the division unit 13 segments the speech waveform at a predetermined time, i.e., divides into speech unit waveforms (S303). In
In this case, the unvoiced plosive sound section is a speech waveform section corresponding to phoneme of unvoiced plosive sound (such as “k”, “t”, “p”, “ch”). The pause section is a speech waveform section corresponding to phoneme letter “PAUSE” representing silence (a punctuation mark or a period) in the text. In the first embodiment, the section is a range between an arbitrary one time and an arbitrary another time in the speech waveform.
As shown in
In the same way, the division unit 13 divides the speech waveform 2 into six speech unit waveforms “n i i g a”, “t a h o o m e N e m u”, “k a t e i r u k a t a n i P”, “h a”, “ch i j i g e N z a i n o j y u u, “t a i n o j y o h o d e s”. Furthermore, the division unit 13 divides the speech waveform 3 into five speech unit waveforms “k a m a”, “t a h o m e N e m u”, “k a t e i r u k a t a n i P”, “s i z e N j yu u”, “t a i n o j yo o h o o d e s”.
In
If decision result at S304 is No, the search unit 14 leaves the speech unit waveform, and processing is forwarded to S306. If decision result at S304 is Yes, the search unit 14 selects one speech unit waveform from at least two speech unit waveforms having identical or similar waveforms, and removes other speech unit waveforms (S305). The one speech unit waveform is called a representative speech unit waveform. The representative speech unit waveform may be randomly selected from at least two speech unit waveforms having identical or similar waveforms.
For example, in
Then, as to a speech unit waveform 102 (“k a t e I r u k at a n i P”) divided from the speech waveform 1, a speech unit waveform 105 (“k a t e i r u k a t a n i P”) divided from the speech waveform 2, and a speech unit waveform 109 (“k a t e r u k at a n i P”) divided from the speech waveform 3, these speech unit waveforms are decided to be identical or similar.
Furthermore, as to a speech unit waveform 103 (“t a i n o j yo h o o d e s”) divided from the speech waveform 1, a speech unit waveform 107 (“t a in o j yo h o o d e s”) divided from the speech waveform 2, and a speech unit waveform 110 (“t a i n o j yo h o o d e s”) divided from the speech waveform 3, these speech unit waveforms are decided to be identical or similar.
Furthermore, as to a speech unit waveform 104 (“t a h o o m e N e m u”) divided from the speech waveform 2 and a speech unit waveform 108 (“t a h o o m e N e m u”) divided from the speech waveform 3, these speech unit waveforms are decided to be identical or similar.
The search unit 14 selects the speech unit waveform 101 as a first representative speech unit waveform of the speech unit waveforms 101 and 106. In the same way, the search unit 14 selects the speech unit waveform 102 as a second representative speech unit waveform of the speech unit waveforms 102, 105 and 109. Furthermore, the search unit 14 selects the speech unit waveform 103 as a third representative speech unit waveform of the speech unit waveforms 103, 107 and 110.
Among at least two speech unit waveforms having identical or similar waveforms, the search unit 14 removes (deletes) all speech unit waveforms not selected as the representative speech unit waveform. For example, the search unit 14 removes a speech unit waveform 106 not selected as the first representative speech unit waveform. In the same way, the search unit 14 removes speech unit waveforms 105 and 109 each not selected as the second representative speech unit waveform. Furthermore, the search unit 14 removes speech unit waveforms 107 and 110 each not selected as the third representative speech unit waveform.
As shown in
As mentioned-above, in the first embodiment, speech units having high usage efficiency can be created, and total data quantity of speech units to be stored can be easily reduced. Furthermore, from all speech units, at least two speech units having identical or similar waveforms are searched. Accordingly, degradation of sound quality can be suppressed.
Moreover, in the first embodiment, processing in case of Japanese is explained. However, for example, the same processing can be performed in case of English.
As shown in
At S302, the generation unit 12 generates a speech waveform 4 corresponding to the text 4, a speech waveform 5 corresponding to the text 5, and a speech waveform 6 corresponding to the text 6. Letters described with speech waveforms 4˜6 represent phonemes. As shown in
At S303, as mentioned-above, the division unit 13 divides the speech waveform into speech unit waveforms at a predetermined time. For example, the division unit 13 divides the speech waveform 4 (represented as phoneme sequence in
In the same way, the division unit 13 divides the speech waveform 5 into seven speech unit waveforms, “t 3R n l E f”, “t A”, “tc D @ n E”, “k s”, “t I n”, “t 3R s E”, “k S @ n”. Furthermore, the division unit 13 divides the speech waveform 6 into eight speech unit waveforms, “t 3R n r aI”, “t A”, “tc D @ l n”, “t 3R s E”, “k S @n P”, “D E n I m i d i @”, “tc l i r aI”, “t @ g E n”.
At S304, the search unit 304 searches speech unit waveforms having identical or similar waveforms from all speech unit waveforms. For example, the search unit 14 decides that a speech unit waveform 201 (divided from the speech waveform 4) and a speech unit waveform 211 (divided from the speech waveform 6) are identical or similar. In the same way, the search unit 14 decides that a speech unit waveform 202 (divided from the speech waveform 4), a speech unit waveform 206 (divided from the speech waveform 5) and a speech unit waveform 212 (divided from the speech waveform 6) are identical or similar. The search unit 14 decides that a speech unit waveform 203 (divided from the speech waveform 4) and a speech unit waveform 207 (divided from the speech waveform 5) are identical or similar.
Furthermore, the search unit 14 decides that a speech unit waveform 204 (divided from the speech waveform 4) and a speech unit waveform 208 (divided from the speech waveform 5) are identical or similar. The search unit 14 decides that a speech unit waveform 205 (divided from the speech waveform 4) and a speech unit waveform 215 (divided from the speech waveform 6) are identical or similar. The search unit 14 decides that a speech unit waveform 209 (divided from the speech waveform 5) and a speech unit waveform 213 (divided from the speech waveform 6) are identical or similar. The search unit 14 decides that a speech unit waveform 210 (divided from the speech waveform 5) and a speech unit waveform 214 (divided from the speech waveform 6) are identical or similar.
At S305, the search unit 14 selects one speech unit waveform from at least two speech unit waveforms having identical or similar waveforms, and removes (deletes) other speech unit waveforms not selected. For example, the search unit 14 selects the speech unit waveform 201 as a fourth representative speech unit waveform of the speech unit waveforms 201 and 211. In the same way, the search unit 14 selects the speech unit waveform 202 as a fifth representative speech unit waveform of the speech unit waveforms 202, 206 and 212. The search unit 14 selects the speech unit waveform 203 as a sixth representative speech unit waveform of the speech unit waveforms 203 and 207. The search unit 14 selects the speech unit waveform 204 as a seventh representative speech unit waveform of the speech unit waveforms 204 and 208. The search unit 14 selects the speech unit waveform 205 as an eighth representative speech unit waveform of the speech unit waveforms 205 and 215. The search unit 14 selects the speech unit waveform 209 as a ninth representative speech unit waveform of the speech unit waveforms 209 and 213. The search unit 14 selects the speech unit waveform 210 as a tenth representative speech unit waveform of the speech unit waveforms 210 and 214.
The search unit 14 removes (deletes) other speech unit waveforms (not selected as the representative speech unit waveform) in the at least two speech unit waveforms having identical or similar waveforms. For example, the search unit 14 removes the speech unit waveform 211 not selected as the fourth representative speech unit waveform. In the same way, the search unit 14 removes the speech unit waveforms 206 and 212 each not selected as the fifth representative speech unit waveform. The search unit 14 removes the speech unit waveform 207 not selected as the sixth representative speech unit waveform. The search unit 14 removes the speech unit waveform 208 not selected as the seventh representative speech unit waveform. The search unit 14 removes the speech unit waveform 215 not selected as the eighth representative speech unit waveform. The search unit 14 removes the speech unit waveform 213 not selected as the ninth representative speech unit waveform. The search unit 14 removes the speech unit waveform 214 not selected as the tenth representative speech unit waveform.
At S306, the search unit 14 stores speech unit waveforms remained without deletion, into the storage unit 50. In this way, in the first embodiment, the same processing can be performed in case of English text.
In the first embodiment, the search unit 14 selects the representative speech unit waveform from speech unit waveforms. However, if at least two speech unit waveforms having identical or similar waveforms is included in all speech unit waveforms, the search unit 14 may create a representative speech unit waveform based on the at least two speech unit waveforms. For example, from prosody information of each speech unit waveform, the search unit 14 may newly create a speech unit waveform having a weighted average of duration and a weighted average of fundamental frequency. Briefly, as to prosody information of identical or similar speech unit waveforms, the search unit 14 determines averaged prosody information by calculating a weighted sum of duration and a weighted sum of fundamental frequency (included in the prosody information). Using speech synthesis means such as text-to-speech synthesis method, the search unit 14 may create a representative speech unit waveform by re-synthesizing speech unit waveforms from the averaged prosody information.
(Modification 1)
In the first embodiment, the search unit 14 searches speech unit waveforms having identical or similar waveforms. However, in the modification 1, the search unit 14 searches speech units having identical or similar prosody information. In
Above-mentioned condition that “waveforms are identical or similar” is called a condition 1. Above-mentioned condition that “prosody information is identical or similar” is called a condition 2. If the condition 1 is satisfied, the condition 2 is satisfied. However, even if the condition 2 is satisfied, the condition 1 is not always satisfied.
Briefly, the search unit 14 decides whether the condition 2 is satisfied. In this case, in comparison with decision using the condition 1, total data quantity of speech units to be stored in the storage unit 50 can be reduced.
(Modification 2)
In the modification 2, the search unit 14 searches speech units having identical or similar phonologic information. In
Above-mentioned condition that “phonologic information are identical or similar” is called a condition 3. If the condition 2 is satisfied, the condition 3 is satisfied. However, even if the condition 3 is satisfied, the condition 2 is not always satisfied.
Briefly, the search unit 14 decides whether the condition 3 is satisfied. In this case, in comparison with decision using the condition 1 or 2, total data quantity of speech units to be stored in the storage unit 50 can be reduced.
Moreover, except for the phoneme sequence and the accent phoneme, for example, the phonologic information may include information of a boundary of accent phrase. The boundary of accent phrase represents a boundary between adjacent accent phrases including an accent. The condition 3 may include a condition that the boundaries of two accent phrases are identical.
(Modification 3)
In above modifications, as to a speech waveform generated by the generation unit 12, the division unit 13 divides the speech unit. However, division method is not limited to this. For example, following method can be used.
From an input text, the generation unit 12 generates phonologic information (including phoneme sequence in which text is represented as phonemes) and prosody information (including duration of each phoneme and time change of fundamental frequency). Based on the phoneme sequence and the duration, the division unit 13 divides the prosody information into speech units as a unit of the prosody information. For example, the prosody information may be divided at a mediate time of unvoiced plosive sound (or pause phoneme). Among a plurality of speech units divided, the search unit 14 searches at least two speech units of which at least any of the phoneme sequence, the duration and the time change of fundamental frequency, are identical or similar. Briefly, based on phonologic information and prosody information included in a representative speech unit, by using speech synthesis method such as text-to-speech synthesis method, the search unit 14 generates a synthesized speech waveform, i.e., a speech waveform corresponding to the text. The search unit 14 stores the speech waveform into the storage unit 50.
As to a speech editing apparatus (not shown in Fig.) according to the second embodiment, by using the condition 1 (the most strict condition), speech unit waveforms having identical or similar feature are searched. When data quantity of speech unit waveforms (remained after searching) is below a predetermined threshold, the speech unit waveforms are stored into the storage unit 50. When data quantity of speech unit waveforms (remained after searching) is not below a predetermined threshold, by using the condition 2 (the second strict condition), speech unit waveforms having identical or similar feature are searched. By repeating this processing, data quantity of speech unit waveforms (to be stored into the storage unit 50) is controlled. In the second embodiment, processing of the search unit 14 is different from the first embodiment.
In
After receiving all speech unit waveforms from the division unit 13, the search unit 14 sets an initial value of condition n (n=1, 2, . . . , N (N=3 in this example)) as “n=1” (S1000). The search unit 14 decides whether at least two speech unit waveforms satisfy the condition n (S1001). In the same way as the modification 1 and 2, if the condition n is satisfied, the conditions (n+1)˜(n+(N−1)) are satisfied.
In case of Yes at S1001, the search unit 14 executes processing of S305, and decides whether total data quantity of speech unit waveforms (remained without deletion) is below a predetermined threshold (S1002). In case of No at S1001, the search unit 14 does not execute processing of S305, and processing is forwarded to S1002.
In case of Yes at S1002, the search unit 14 stores the speech unit waveforms (remained without deletion) into the storage unit 50 (S306), and the processing is completed. In case of No at S1002, the search unit 14 decides whether to be “n=N” (S1003).
In case of Yes at S1003, the search unit 14 stores the speech unit waveforms (remained without deletion) into the storage unit 50 (S306), and the processing is completed. In case of Yes at S1003, the search unit 14 increments n by “1” (S1004), and the processing is forwarded to S1001.
In this way, as to the second embodiment, data quantity of speech unit waveforms (to be stored into the storage unit 50) can be gradually limited.
As to a speech synthesis apparatus 3 according to the third embodiment, by using speech unit waveforms stored in the storage unit 50 (as mentioned in the first and second embodiments), speech is artificially synthesized.
As shown in
As mentioned-above, in the third embodiment, the speech synthesis apparatus using speech units having high usage efficiency can be presented.
While certain embodiments have been described, these embodiments have been presented by way of examples only, and are not intended to limit the scope of the inventions. Indeed, the novel methods and systems described herein may be embodied in a variety of other forms; furthermore, various omissions, substitutions and changes in the form of the methods and systems described herein may be made without departing from the spirit of the inventions. The accompanying claims and their equivalents are intended to cover such forms or modifications as would fall within the scope and spirit of the inventions.
Number | Date | Country | Kind |
---|---|---|---|
2010-073694 | Mar 2010 | JP | national |