This application is based upon and claims the benefit of priority from Japanese Patent Application No. 2014-056667, filed on Mar. 19, 2014; the entire contents of which are incorporated herein by reference.
Embodiments described herein relate generally to a text-to-speech device, a text-to-speech method, and a computer program product.
In recent years, reading out documents using speech synthesis (TTS: Text To Speech) is getting a lot of attention. Although reading out books has been carried out in the past too; the use of TTS results in making narration recording redundant, thereby making it easier to enjoy the recitation voice. Moreover, regarding blogs or Twitter (registered trademark) in which the written text is updated almost in real time, TTS-based services are being provided these days. As a result of using a TTS-based service, reading of a text can be listened to while doing some other task.
However, when users write texts in a blog or Twitter, some of the users use leet-speak expressions (hereinafter, called “peculiar expressions”) that are not found in normal expressions. The person who sends such a text is intentionally expressing some kind of mood using peculiar expressions. However, since peculiar expressions are totally different than the expressions in a normal text, the conventional text-to-speech devices are not able to correctly analyze the text containing peculiar expressions. For that reason, if a conventional text-to-speech device performs speech synthesis of a text containing peculiar expressions; not only it is not possible to reproduce the mood that the sender wished to express, but the reading also turns out to be completely irrational.
According to an embodiment, a text-to-speech device includes a receiver, a normalizer, a selector, a generator, a modifier, and an output unit. The receiver receives an input text which contains a peculiar expression. The normalizer normalizes the input text based on a normalization rule in which the peculiar expression, a normal expression for expressing the peculiar expression in a normal form, and an expression style of the peculiar expression are associated with one another, so as to generate one or more normalized texts. The selector performs language processing with respect to each of the normalized texts, and selects a single normalized text based on result of the language processing. The generator generates a series of phonetic parameters representing phonetic expression of the single normalized text. The modifier modifies a phonetic parameter in the normalized text corresponding to the peculiar expression in the input text based on a phonetic parameter modification method according to the normalization rule of the peculiar expression. The output unit outputs a phonetic sound which is synthesized using the series of phonetic parameters including the modified phonetic parameter.
An embodiment will be described below in detail with reference to the accompanying drawings.
The analyzer 20 performs language processing with respect to the text received by the text-to-speech device 10. The analyzer 20 includes a receiver 21, a normalizer 22, normalization rules 23, a selector 24, and a language processing dictionary 25.
The synthesizer 30 generates a speech waveform based on the result of language processing performed by the analyzer 20. The synthesizer 30 includes a generator 31, speech waveform generation data 32, a modifier 33, modification rules 34, and an output unit 35.
The normalization rules 23, the language processing dictionary 25, the speech waveform generation data 32, and the modification rules 34 are stored in a memory (not illustrated in
Firstly, the explanation is given about the configuration of the analyzer 20. The receiver 21 receives input of a text containing peculiar expressions. Given below is the explanation of a specific example of a text containing peculiar expressions.
Meanwhile, the receiver 21 can also receive a text expressed in a language other than the Japanese language. In that case, for example, a peculiar expression can be “ooo” (three or more “o” in succession).
Returning to the explanation with reference to
A first cost represents a value counted in the case of applying a normalization rule. When a plurality of normalization rules is applicable to a text, an extremely high number of normalized texts are generated. Hence, when a plurality of normalization rules is applicable to a text, the normalizer 22 calculates the total first cost with respect to the text. That is, the normalizer 22 applies, to the text, the normalization rules only up to a predetermined first threshold value of the total first cost, thereby holding down the number of normalized texts that are generated.
In the example illustrated in
Meanwhile, the peculiar expressions for applying normalization rules can be defined not only in units of character but also using regular expressions or conditional expressions. Moreover, the normal expressions can be defined not only as post-normalization data but also regular expressions or conditional expressions representing normalization.
In the example illustrated in
Meanwhile, generally, there is a possibility that a plurality of normalization rules is applicable at the same position in a text. In such a case, either it is possible to apply any one of the normalization rules to the position, or it is possible to apply a plurality of normalization rules to the position at the time as long as the applied normalization rules do not contradict each other.
Returning to the explanation with reference to
Meanwhile, among normalized-text lists, a normalized-text list may be generated despite the fact that the expression is not actually a peculiar expression. Such a normalized-text list is generated because it fits into a conditional expression or because normalization rules get applied thereto. In that regard, with the aim of selecting the most plausible normalized text from the normalized-text list, the selector 24 calculates second costs. More particularly, the selector 24 performs language processing of a normalized text, and breaks the normalized text down into a morpheme string. Then, the selector 24 calculates a second cost according to the morpheme string.
In the example of the normalized-text list illustrated in
Meanwhile, generally, as the methods for obtaining a suitable morpheme string during language processing, various methods, such as the longest match principle and the clause count minimization method, are known aside from the cost minimization method. However, the selector 24 needs to select the most plausible normalized text from among the normalized texts generated by the normalizer 22. Hence, in the selector 24 according to the embodiment, the cost minimization method is implemented in which the costs of the morpheme strings (equivalent to the second costs according to the embodiment) are also obtained at the same time.
However, the method by which the selector 24 selects the normalized text is not limited to the cost minimization method. Alternatively, for example, from among the normalized texts having the second costs smaller than a predetermined second threshold value, it is possible to select the normalized text having the least number of times of text rewriting according to the normalization rules. Still alternatively, it is possible to select the normalized text having the smallest product of the (total) first cost, which is calculated during the generation of the normalized text, and the second cost, which is calculated from the morpheme string of the normalized text.
Returning to the explanation with reference to
The generator 31 makes use of the speech waveform generation data 32, and generates a series of phonetic parameters representing the phonetic expression of the normalized text selected by the selector 24. Herein, the speech waveform generation data 32 contains, for example, synthesis units or acoustic parameters. In the case of using synthesis units in generating the series of phonetic parameters; for example, synthesis unit IDs registered in a synthesis unit dictionary are used. In the case of using acoustic parameters in generating the series of phonetic parameters; for example, acoustic parameters based on the hidden Markov model (HMM) are used.
Regarding the generator 31 according to the embodiment, the explanation is given for an example in which synthesis units IDs registered in a synthesis unit dictionary are used as phonetic parameters. In the case of using HMM-based acoustic parameters, there are no single numerical values such as IDs. However, if combinations of numerical values are regarded as IDs, the HMM-based acoustic parameters can be essentially treated same as the synthesis unit IDs.
For example, in the case of the normalized text 206, since the phonetic expression is /ijada:/ and the prosodic type is 2. Accordingly, the series of phonetic parameters of the normalized text 206 is as illustrated in
Meanwhile, there are times when the selector 24 selects, as the most plausible normalized text, a normalized text not registered in the language processing dictionary 25.
As is the case of the expression 208, there are times when a normalization text includes an unknown word in lower case character.
To the modifier 33, the generator 31 outputs the series of phonetic parameters representing the phonetic sound of the normalized text, and outputs the expression styles at the positions in the selected normalized text that correspond to the peculiar expressions present in the input text
Based on a phonetic parameter modification method according to the normalization rules of peculiar expressions, the modifier 33 modifies the phonetic parameters in the normalized text that correspond to the peculiar expressions in the input text. More particularly, based on the expression styles specified in the normalization rule, the modifier 33 modifies the phonetic parameters that represent the phonetic sound at the positions corresponding to the peculiar expressions in the input text. Herein, there can be a plurality of expression-style-based phonetic parameter modification methods.
Due to the phonetic parameter modification methods illustrated in
Meanwhile, if the text-to-speech device 10 constantly reflects the expression styles of peculiar expressions in the phonetic expression, then sometimes it becomes difficult to hear the phonetic sound. Hence, the configuration can be such that the expression styles set in advance to “reflection not required” by the user are not reflected in the phonetic parameters.
Meanwhile, if modification is done only to the phonetic parameters at the positions in the normalized text that correspond to the peculiar expressions present in the input text, then there is a possibility that the phonetic sound is unnatural. In that regard, the modifier 33 can be configured to modify the entire series of phonetic parameters representing the phonetic sound of the normalized text. In this case, there it may be necessary to perform a plurality of modifications to the same section of phonetic parameters. In that case, if a plurality of modification methods needs to be implemented, then it is desirable that the modifier 33 selects mutually non-conflicting modification methods.
For example, regarding a phonetic parameter modification method for reflecting the expression styles of peculiar expressions in the phonetic parameters; a case of applying “increase the qualifying age” and a case of applying “decrease the qualifying age” contradict with each other. In contrast, regarding a phonetic parameters modification method for reflecting the expression styles of peculiar expressions in the phonetic parameters; a case of applying “increase the qualifying age” and a case of applying “keep the volume high for a long duration of time” do not contradict with each other.
In case non-contradictory modification methods cannot be selected, the modifier 33 can determine the modification methods based on an order of priority set in advance by the user, or can select the modification methods in a random manner.
Returning to the explanation with reference to
The text-to-speech device 10 according to the embodiment has the configuration described above. With that, even if an input text contains peculiar expressions that are not used under normal circumstances, speech synthesis can be done in a flexible while having the understanding of the mood. That makes it possible to read out various input texts.
Explained below with reference to flowcharts is a text-to-speech method implemented in the text-to-speech device 10 according to the embodiment. Firstly, the explanation is given for the method by which the analyzer 20 determines a single normalized text corresponding to an input text containing peculiar expressions.
Subsequently, the normalizer 22 calculates combinations of the positions to which the normalization rules are to be applied (Step S3). Then, for each combination, the normalizer 22 calculates the total first cost in the case of applying the normalization rules (Step S4). Subsequently, the normalizer 22 deletes the combinations for which the total first cost is greater than a first threshold value (Step S5). As a result, it becomes possible to hold down the number of normalized texts that are generated, thereby enabling achieving reduction in the processing load of the selector 24 while determining a single normalized text.
Then, from among the combinations of positions in the text to which the normalization rules are to be applied, the normalizer 22 selects a single combination and applies the normalization rules at the corresponding positions in the text using the selected combination (Step S6). Subsequently, the normalizer 22 determines whether or not all combinations to which the normalization rules are to be applied are processed (Step S7). If all combinations are not yet processed (No at Step S7), then the system control returns to Step S6. When all combinations are processed (Yes at Step S7), the selector 24 selects a single normalized text from the normalized-text list that contains one or more normalized texts generated by the normalizer 22 (Step S8). More particularly, the selector 24 calculates the second costs mentioned above by performing language processing, and selects the normalized text having the smallest second cost.
Given below is the explanation of a method by which the synthesizer 30 modifies the phonetic parameters, which are determined from the phonetic expression of a normalized text, according to the expression styles of the peculiar expressions; and reads out the modified phonetic parameters.
Subsequently, the modifier 33 obtains the phonetic parameter modification method according to the expression styles of the peculiar parameters (Step S13).
Then, according to the modification method obtained at Step S13, the modifier 33 modifies the phonetic parameters identified at Step S12 (Step S14). Subsequently, the modifier 33 determines whether or not modification is done with respect to all phonetic parameters at the positions in the normalized text that correspond to the peculiar expressions included in the text that is input to the receiver 21 (Step S15). If all phonetic parameters are not yet modified (No at Step S15), then the system control returns to Step S12. When all parameters are modified (Yes at Step S15), the output unit 35 outputs the phonetic sound based on the series of phonetic parameters modified by the modifier 33 (Step S16).
Lastly, given below is the explanation about an exemplary hardware configuration of the text-to-speech device 10 according to the embodiment.
The control device 41 executes computer programs that are read from the auxiliary memory device 43 and loaded into the main memory device 42. Herein, the main memory device 42 is a memory such as a read only memory (ROM) or a random access memory (RAM). The auxiliary memory device 43 is a hard disk drive (HDD) or a memory card. The display device 44 displays the status of the text-to-speech device 10. The input device 45 receives operation inputs from the user. The communication device 46 is an interface that enables the text-to-speech device 10 to communicate with other devices. The output device 47 is a device such as a speaker that outputs phonetic sound. Moreover, the output device 47 corresponds to the output unit 35 described above.
The computer programs executed in the text-to-speech device 10 according to the embodiment are recorded in the form of installable or executable files in a computer-readable recording medium such as a compact disk read only memory (CD-ROM), a memory card, a compact disk readable (CD-R), or a digital versatile disk (DVD); and are provided as a computer program product.
Alternatively, the computer programs executed in the text-to-speech device 10 according to the embodiment can be saved as downloadable files on a computer connected to the Internet or can be made available for distribution through a network such as the Internet.
Still alternatively, the computer programs executed in the text-to-speech device 10 according to the embodiment can be stored in advance in a ROM.
The computer programs executed in the text-to-speech device 10 according to the embodiment contain a module for each of the abovementioned functional blocks (i.e., the receiver 21, the normalizer 22, the selector 24, the generator 31, and the modifier 33). As the actual hardware, the control device 41 reads the computer programs from a memory medium and runs them such that the functional blocks are loaded in the main memory device 42. As a result, each of the abovementioned functional blocks is generated in the main memory device 42.
Meanwhile, some or all of the abovementioned constituent elements (the receiver 21, the normalizer 22, the selector 24, the generator 31, and the modifier 33) can be implemented using hardware, such as an integrated circuit, instead of using software.
As explained above, the text-to-speech device 10 according to the embodiment has normalization rules in which peculiar expressions, normal expressions of the peculiar expressions, and expression styles of the peculiar expressions are associated with one another. Based on the expression styles associated to the peculiar expressions in the normalization rules, modification is done to phonetic parameters that represent the phonetic expression at the positions in the normalized text that correspond to the peculiar expressions. As a result, even regarding a text in which the user has intentionally used peculiar expressions that are not used in normal expressions, the text-to-speech device according to the embodiment can perform appropriate phonetic expression while having the understanding of the user intentions.
Meanwhile, the text-to-speech device 10 according to the embodiment can be applied not only for reading out blogs or Twitter but also for reading out comics or light novels. Particularly, if the text-to-speech device 10 according to the embodiment is combined with the character recognition technology, then the text-to-speech device 10 can be applied for reading out the imitative sounds handwritten in the pictures of comics. Besides, if the normalization rules 23, the analyzer 20, and the synthesizer 30 are configured to deal with the English language and the Chinese language, then the text-to-speech device 10 according to the embodiment can be used for those languages too.
While a certain embodiment has been described, the embodiment has been presented by way of example only, and is not intended to limit the scope of the inventions. Indeed, the novel embodiment described herein may be embodied in a variety of other forms; furthermore, various omissions, substitutions and changes in the form of the embodiment described herein may be made without departing from the spirit of the inventions. The accompanying claims and their equivalents are intended to cover such forms or modifications as would fall within the scope and spirit of the inventions.
Number | Date | Country | Kind |
---|---|---|---|
2014-056667 | Mar 2014 | JP | national |