1. Field of the Invention
The present invention relates to devices, programs, and methods for text-to-speech read-aloud for converting character data including phonetic characters in a document and outputting speech. More specifically, the present invention relates to a device, a program, and a method for text-to-speech read-aloud for controlling phoneme lengths, particularly, for increasing/reducing a specific phoneme length and so on, in accordance with a read-aloud speed, such as a high read-aloud speed.
2. Description of the Related Art
Technology for the so-called “text-to-speech read-aloud” is known which analyzes character data including phonetic characters, synthesizes speech from the character data through a speech synthesis technique, and outputs the character data in the form of speech. For mobile terminal devices such as mobile phones, a speech synthesis function for reading aloud arbitrary sentences in electronic mail and so on is beginning to be widely available. For personal computers (PCs), software called “screen readers” is beginning to become popular. For understanding of the contents of a sentence, the lengths of phonemes representing vowels, consonants, pauses, and so on which act on the auditory sense are important factors to enhance recognition.
In relation to such text-to-speech read-aloud, Japanese Laid-open Patent Publication No. 6-149283 (Patent Document 1; e.g., Summary of the Invention and
It is now assumed that the speech speed (i.e., the read-aloud speed) is configured to be settable and each phoneme length is set in reverse portion to the speech speed. For example, when the speech speed is doubled, the phoneme lengths are reduced to ½, and when the speech speed is reduced to ½, the phoneme lengths are doubled. Setting the relationship between the speech speed and the phoneme lengths to have such a simple relationship, i.e., the relationship in which the speech speed and the phoneme lengths are in simple reverse proportion to each other may cause difficulty in hearing, an unpleasant sensation, and a reduction in recognition at a high or low read-aloud speed, even when it sounds natural (i.e., it is easy to hear) at a normal speech speed.
Japanese Laid-open Patent Publication, however, does not disclose or suggest such requirements and problems and also does not disclose or suggest a configuration and so on for addressing the requirements and problems.
According to an aspect of an embodiment, an apparatus for converting text data into sound signal, comprises: a phoneme determiner for determining phoneme data corresponding to a plurality of phonemes and pause data corresponding to a plurality of pauses to be inserted among a series of phonemes in the text data to be converted into sound signal; a phoneme length adjuster for modifying the phoneme data and the pause data by determining lengths of the phonemes, respectively in accordance with a speed of the sound signal and selectively adjusting the length of at least one of the phonemes which is placed immediately after one of the pauses so that the at least one of the phonemes is relatively extended timewise as compared to other phonemes; and a output unit for outputting sound signal on the basis of the adjusted phoneme data and pause data by the phoneme length adjuster.
a, 24b and 24c, respectively, show speech-synthesis waveforms;
a and 25b, respectively, show speech-synthesis waveforms;
a and 26b, respectively, show speech-synthesis waveforms;
a and 27b, respectively, show speech-synthesis waveforms;
a and 28b, respectively, show speech-synthesis waveforms;
a and 29b, respectively, show speech-synthesis waveforms;
a and 30b, respectively, show speech-synthesis waveforms;
a and 31b, respectively, show speech-synthesis waveforms; and
A first embodiment of the present invention will now be described with reference to
This speech read-aloud device (speech reading apparatus, text to speech reading apparatus) 2 is one example of a device configuration, a program, and a method for text-to-speech read-aloud according to the present invention, and is implemented by a computer. For example, the text-to-speech read-aloud device 2 includes a speech synthesizing device that converts character data, such as a text sentence (e.g., text with both kanji and kana in Japanese Language) into speech and outputs the speech. The phoneme length of a phoneme immediately after a pause in the character data is controlled in accordance with a speech speed (i.e., a read-aloud speed) to enhance ease of hearing output speech resulting from the character data and to improve recognition of synthesized speech (read-aloud output). The character data to be read aloud includes phonetic characters, a string of the phonetic characters, and pauses. The phonetic characters or the phonetic character string is an intermediate language including phonetic transcriptions with prosodic symbols which are used for speech synthesis. One example of the phonetic symbols is kana characters. Pauses included in the character data represent voiceless periods, such as a period in which no speech conversion is performed. For example, in a Japanese sentence “so tsugyoshi te, shinyou kin koni . . . ” expressed in Roman characters, a comma “,” representing a voiceless period exists between “so tsugyoshi te” and “shinyou kin ko”, and this comma is one example of pauses. Japanese sentence “so tsugyoshi te, shinyou kin koni . . . ” means “after (he) graduated from (high school), (he has worked) at a bank . . . ”. In other words, “so tsugyoshi te” means “after graduation” and “shinyou kin koni” means “at a bank”. Information whose phoneme length of a phoneme immediately after a pause is to be controlled does not include, for example, a Japanese sokuon (a sound expressed by a small-sized kana character “tsu” in Japanese) and a silent period immediately before a plosive. A Japanese sokuon is called a geminate consonant or double consonant in English. A breath group is a unit of human speech in one breath and is preceded and followed by pauses for breath.
In order to achieve such a function, as shown in
The linguistic processor 4 serves as linguistic processing means for inputting text with both kanji and kana, analyzing words by referring to the word dictionary 6, determining phonetic transcriptions, accents, and intonations, and outputting a phonetic character string (intermediate language). The word dictionary 6 contains word types (parses and so on), phonetic transcriptions, accent positions, and so on.
Physically, accent and intonation are closely associated with a temporal change pattern of a pitch frequency. Specifically, the pitch frequency increases at an accent position and decreases according to an increase in intonation. Thus, the linguistic processor 4 divides the input text into breath groups, based on punctuation marks in the text and/or phrases extracted through the word analysis.
The parameter generator 8 serves as parameter generating means for setting phoneme durations, pause durations, and pitch frequency patterns. The parameter generator 8 controls phoneme lengths in accordance with the speech speed.
The parameter generator 8 includes a phoneme-length setter (phoneme length setting unit) 14, a phoneme-length table 16, a phoneme-length controller (phoneme length control unit) 18, and a pitch pattern generator (pitch pattern generating unit) 20.
At the stage of the phonetic character string generated by the linguistic processor 4, which phonemes are to be speech-synthesized are determined. The phoneme-length setter 14 serves as means for setting the phoneme length of each phoneme, and sets a phoneme length at a normal speech speed. The phoneme-length table 16 serves as means for storing phoneme lengths that are used at a normal speech speed and that are associated with a phoneme and preceding and subsequent phonemes. Accordingly, as an example for setting the phoneme lengths, phoneme lengths (values extracted from a database) that are used at a normal speech speed and that are associated with a phoneme and preceding and subsequent phonemes are stored in the phoneme-length table 16, and phoneme lengths are set with reference to the values. The phoneme lengths may be modified according to another parameter element.
The phoneme-length controller 18 serves as phoneme-length controlling means. That is, in accordance with a speech speed, the phoneme-length controller 18 controls the phoneme lengths used at the normal speech speed and set by the phoneme-length setter 14. The speech speed is supplied, as control information, from means (not shown) for adjusting a read-aloud speed (set by a user or the like) or the like to the phoneme-length controller 18.
As shown in
According to the phoneme-length controller 18, a phoneme length is adjusted so that it is inversely proportional to a predetermined speech speed relative to a normal speech speed. For example, when the normal speech speed is assumed to be about 7 moras per second and a speech speed of 14 moras per second is set, each phoneme-length is adjusted to half, and when a speech speed of 6 moras per second is set, each phoneme length is adjusted to 7/6. In this case, a mora represents a beat and is a unit corresponding to substantially one character when written in kana characters. Kana characters that have diphthongs (e.g., small-sized Japanese kana characters “ya”, “yu”, and “yo”, which are expressed in Roman characters for convenience of description), for example, a kana character “kya”, are each one mora. In the case of Japanese language, one character (mora) has a similar length.
The pitch pattern generator 20 serves as pattern generating means for setting a pitch period for each phoneme considering accent information and so on in a phonetic character string.
The pitch extraction/concatenating unit 10 serves as pitch cutting-out/concatenating means that employs, for example, a PSOLA (Pitch Synchronous OverLap and Add) method, which is a pitch conversion method using a waveform overlap-add technique). The waveform dictionary 12 contains phoneme labels indicating to which phonemes specific parts of sound correspond and a pitch mark indicating a pitch period for voiced sound. The pitch extraction/concatenation unit 10 extracts a speech waveform corresponding to two periods from the waveform dictionary 12 based on a parameter generated by the parameter generator 8, multiplies the speech waveform by a window function (e.g., a Hanning window), and executes processing for multiplying the resulting waveform by a gain for amplitude adjustment, as required. Thereafter, when a desired pitch frequency is different from a pitch frequency in the waveform dictionary 12, the pitch extraction/concatenation unit 10 performs pitch conversion, overlaps and adds the extracted waveforms, and outputs a synthesized speech signal.
The hardware of the text-to-speech read-aloud device 2 will now be described with reference to
This mobile terminal device (portable terminal, portable terminal device) 200 is one example to which the text-to-speech read-aloud device 2 is applied, and the device, method, and program for text-to-speech read-aloud according to the present invention are not limited to the configuration of the mobile terminal device 200. The mobile terminal device 200 has a communication function and a function for converting character data for a text sentence (e.g., text with both kanji and kana in the case of Japanese language), such as electronic mail text, into speech and outputs the speech. Thus, as shown in
The processor 202 serves as controlling means for controlling phone communication, execution of speech read-aloud, such as speech synthesis, and so on. The processor 202 is implemented by a CPU (central processing unit) or an MPU (micro processor unit) to execute an OS (operating system) and application programs stored in the storage unit 204. The application programs include a program for executing a procedure for speech-read-aloud processing.
The storage unit 204 is a storage medium that stores the programs executed by the processor 202 and various data used for the execution and that also provides a processing area. The storage unit 204 includes a program storage section 216, a data storage section 218, and a RAM (random access memory) 220. The program storage section 216 stores the OS and the application programs. The data storage section 218 contains the word dictionary 6, the waveform dictionary 12, the phoneme-length table 16 (
The wireless communication unit 206 serves as wireless communicating means for wirelessly transmitting/receiving audio-signal radio waves, packet-signal radio waves, and so on to/from a base station. The wireless communication unit 206 is controlled by the processor 202.
The input section 208 serves as means for inputting, through user operation, control data and a response to a dialog displayed on the display unit 210. The inputting means 208 includes a keyboard, a touch panel, and so on.
The display unit 210 is controlled by the processor 202 and serves as displaying means for displaying characters, graphics, and so on. The display unit 210 is implemented by, for example, an LCD (liquid crystal display) device. The display unit 210 displays a text sentence for read-aloud and so on.
The sound input unit 212 serves as sound inputting means, which is controlled by the processor 202. The sound input unit 212 includes a microphone 222. Input sound is converted by the microphone 222 into an audio signal, which is then converted into a digital signal and is sent to the processor 202.
The sound output unit 214 serves as sound outputting means, which is controlled by the processor 202. The sound output unit 214 includes a receiver 224 and speakers 226R and 226L which serve as sound converting means. Synthesized speech for read-aloud is reproduced by the receiver 224 and the speakers 226R and 226L.
In the mobile terminal device 200, the text-to-speech read-aloud device 2 described above is constituted by the processor 202, the storage unit 204, the display unit 210, the sound output unit 214, and so on.
As shown in
The mobile terminal device 200 can read-aloud various text sentences, including electronic-mail text and novel text. A sentence or the like displayed on a screen 242 of the display unit 210 is speech-synthesized and the speech is reproduced by the receiver 224 or the speakers 226R and 226L. In this case, as shown in
Control of phoneme lengths will now be described with reference to
This processing procedure is one example of a program or method for text-to-speech read-aloud. The processing in the first embodiment includes a process or step of determining whether or not a phoneme in question is a phoneme immediately after a pause, i.e., a speech head (the first phoneme in each breath group), and also includes, as a process or step of controlling the phoneme length, a process or step of increasing the phoneme length of the phoneme when the phoneme is a phoneme at the speech head. This processing procedure is executed by the phoneme-length controller 18 (
In the processing procedure, as shown in
After the phoneme-length setting processing, as processing for phonemes in a breath group, in step S103, a phoneme number n is initialized (n=1), and in steps S104 to S110, control of a phoneme length is performed in accordance with a speech speed. The phoneme-length control is executed for each breath group. A flow from steps S105 to S109 shows processing for phonemes in the breath group. The phoneme-length control includes processing for determining phonemes to be controlled and processing for adjusting phoneme lengths according to the determination results.
Based on recognition of input speech-speed information, the phoneme-length controller 18 controls the phoneme length in accordance with the speech speed. In this case, in step S104, the phoneme length is set to a fixed multiple. In step S105, a determination is made as to whether or not the set speech speed is high read-aloud speed and also determines whether or the phoneme in question is the first phoneme (i.e., n=1). Thus, in this processing, the phoneme length of a phoneme immediately after a pause is specified as a phoneme length to be adjusted.
When the speech speed is high read-aloud speed and the phoneme is the first phoneme (n=1; i.e., YES in step S105), the phoneme length is set or adjusted to a predetermined multiple, for example, 1.5 times, in step S106. On the other hand, when the speech speed is not high and/or the phoneme is not the first phoneme (n=1; i.e., NO in step S105), the phoneme length is not adjusted. After the adjustment or non-adjustment, in step S107, the phoneme number n is updated (i.e., n=n+1). In step S108, a determination is made as to whether or not the processing on all phonemes in the breath group is completed, i.e., the numbers n of the phonemes in the breath group have reached the number n of phonemes. Consequently, the processing of all phonemes in the breath group is processed.
When the processing on all phonemes in the breath group is completed and the pause at the end of the breath group is reached, in step S109, the length of the pause is set to a fixed multiple in accordance with the speech speed. In step S110, a determination is made as to whether or not processing on all data of the input data is completed. Until the processing on all data is completed, the processing from steps S103 to S110 is repeated. After the completion processing, speech synthesis is executed in step S111 and speech is output.
As described above, the first phoneme in each breath group is modified in accordance with the speech speed and the phoneme length of a phoneme immediately after a pause is adjusted to, for example, 1.5 times during high-speed read-aloud. This arrangement eliminates unclearness during high-speed read-aloud to thereby facilitate hearing, and can improve the recognition of text converted into speech.
A second embodiment of the present invention will now be described with reference to
This processing procedure is one example of a program or method for text-to-speech read-aloud, and is executed using the text-to-speech read-aloud device 2 (
In the second embodiment, in order to identify a phoneme whose phoneme length is to be increased, the phoneme determining unit 28 (
In the processing procedure, as shown in
Based on recognition of input speech-speed information, the phoneme-length controller 18 controls the phoneme length in accordance with the speech speed. In this case, in step S204, the phoneme-length controller 18 sets the phoneme length to a fixed multiple. In step S205, the phoneme-length controller 18 determines whether or not the speech speed is high read-aloud speed and the phoneme is the first phoneme (n=1)). In this determination processing, the phoneme length of a phoneme (a speech head) immediately after a pause is specified as a phoneme length to be adjusted.
When the speech speed is high read-aloud speed and the phoneme is the first phoneme (n=1; i.e., Yes in step S205), a determination is made as to whether or not the phoneme is a fricative in step S206. When the speech speed is high read-aloud speed and the phoneme is the first phoneme (n=1) and a fricative (YES in step S206), the phoneme length of the phoneme is set or adjusted to a predetermined multiple α (e.g. α=1.7) in step S207. When the phoneme is neither the first phoneme (n=1) nor a fricative (No in step S208), the phoneme length thereof is not adjusted. That is, in this case, the state in which the phoneme length was set to the fixed multiple in step S204 is maintained.
On the other hand, when the speech speed is high read-aloud speed and the phoneme is the first phoneme (No in step S206), the phoneme length thereof is set or adjusted to a predetermined multiple β (e.g., β=1.5) in step S209. When the speech speed is high read-aloud speed and the phoneme is a fricative (Yes in step S208), the phoneme length thereof is set or adjusted to a predetermined multiple γ (e.g., γ=1.4) in step S210.
Thus, when the speech speed is high read-aloud speed and the phoneme is the first phoneme and a fricative, when the speech speed is high read-aloud speed and the phoneme is the first phoneme, when the speech speed is high read-aloud speed and the phoneme is a fricative, or when the phoneme is neither the first phoneme nor a fricative, the phoneme length thereof is adjusted or not adjusted as shown in table 3200 (
After the processing described above, in step S211, the phoneme number n is updated (n=n+1). In step S212, a determination is made as to whether or not the processing on all phonemes in the breath group is completed. As a result of the processing, the processing of all phonemes in the breath group is executed.
When all phonemes in the breath group have been processed and the pause at the end of the breath group is reached, the length of the pause is set to be a fixed multiple in accordance with the speech speed in step S213. In step S214, a determination is made as to whether or not the processing on all data is completed. Until the processing on all data is completed, the processing from steps S203 to S214 is repeated. After the completion of the processing, speech synthesis is executed in step S215 and speech is output.
Thus, the first phoneme and a fricative in each breath group are modified in accordance with the speech speed, and when a phoneme in question is a phoneme immediately after a pause and/or a fricative or when the phoneme in question is neither thereof, the degree of the increase in the phoneme length is varied. Thus, ease of hearing synthesized speech is enhanced and recognition of read-aloud text converted into speech is improved.
A third embodiment of the present invention will now be described with reference to
This processing procedure is one example of a program or method for text-to-speech read-aloud, and is executed using the text-to-speech read-aloud device 2 (
In the third embodiment, in order to identify a phoneme whose phoneme length is to be adjusted, the phoneme determining unit 28 (
In this processing procedure, as shown in
Based on recognition of input speech-speed information, the phoneme-length controller 18 controls the phoneme length in accordance with the speech speed. In this case, in step S304, the phoneme length is set to a fixed multiple, and in step S305, a determination is made as to whether or not the speech speed is high read-aloud speed and the phoneme is the first phoneme (n=1). In this determination processing, the phoneme length of a phoneme (a speech head) immediately after a pause is specified as a phoneme length to be adjusted.
When the speech speed is high read-aloud speed and the phoneme is the first phoneme (n=1; i.e., Yes in step S305), the phoneme length is set or adjusted to a predetermined multiple, for example, 1.5 times, in step S306. When the speech speed is not high and/or the phoneme is not the first phoneme (n=1; i.e., No in step S305), the phoneme length thereof is not adjusted.
After the processing, in step S307, a determination is made as to whether or not the speech speed is high read-aloud speed and the phoneme is a vowel. When the speech speed is high read-aloud speed and the phoneme is a vowel (Yes in step S307), the phoneme length of the phoneme is set or adjusted to a predetermined multiple, for example, 0.9 times, in step S308. When the phoneme is not a vowel (No in step S307), the phoneme length thereof is not adjusted.
After the adjustment or non-adjustment, in step S309, the phoneme number n is updated (i.e., n=n+1). In step S310, a determination is made as to whether or the processing on all phonemes in the breath group is completed. When the pause at the end of the breath group is reached after the processing on all phonemes in the breath group is executed, the length of the pause is set to a fixed multiple in accordance with the speech speed in step S311. In step S312, a determination is made as to whether or not the processing is completed. Until the processing on all data is completed, the processing from steps S303 to S312 is repeated. After the completion processing, speech synthesis is executed in step S313 and speech is output.
As described above, the first phoneme and a vowel in each breath group are modified in accordance with the speech speed. That is, the phoneme length of a phoneme immediately after a pause is set to, for example, 1.5 times, whereas the phoneme length of a vowel is set to, for example, 0.9 times. As a result, the time of the increased phoneme length is complemented by a reduction in the phoneme length of the vowel. Thus, ease of hearing synthesized speech is enhanced and recognition of read-aloud text converted into speech improved while substantially maintaining the total length without an increase in the overall reproduction time of output speech.
A fourth embodiment of the present invention will now be described with reference to
This processing procedure is one example of a program or method for text-to-speech read-aloud, and is executed using the text-to-speech read-aloud device 2 (
In the fourth embodiment, the phoneme-length controller 18 (
In this processing procedure, as shown in
Based on recognition of input speech-speed information, the phoneme-length controller 18 controls the phoneme length in accordance with the speech speed. In this case, in step S404, the phoneme length is set to a fixed multiple, and in step S405, a determination is made as to whether or not the speech speed is high read-aloud speed and the phoneme is the first phoneme (n=1). Thus, in the determination processing, the phoneme length of a phoneme (a speech head) immediately after a pause is specified as a phoneme length to be adjusted.
When the speech speed is high read-aloud speed and the phoneme is the first phoneme (n=1; i.e., Yes in step S405), the phoneme length is set or adjusted to a predetermined multiple, for example, 1.5 times, in step S406. When the speech speed is not high and/or the phoneme is not the first phoneme (n=1;, i.e., No in step S405), the phoneme length thereof is not adjusted.
After the adjustment or non-adjustment, in step S407, the phoneme number n is updated (i.e., n=n+1). In step S408, a determination is made as to whether or the processing on all phonemes in the breath group is completed. When the pause at the end of the breath group is reached after the processing on all phonemes in the breath group is executed, the length of the pause is set to a fixed multiple in accordance with the speech speed in step S409.
After the setting, in step S410, a total length of the breath group is calculated. In step S411, the phoneme lengths of all phonemes are proportionally adjusted so that the length of the breath group becomes a predetermined length, for example, a length that is the same as or similar to the length when the phoneme lengths are not increased. In step S412, a determination is performed whether or not the processing on all data is completed. Until the processing on all data is completed, the processing from steps S403 to S412 is repeated. After the completion processing, speech synthesis is executed in step S413 and speech is output.
As described above, the first phoneme in each breath group is adjusted in accordance with the speech speed, that is, the phoneme length of a phoneme immediately after a pause is adjusted to, for example, 1.5 times, whereas other phonemes in the breath group are proportionally reduced by an amount corresponding to the increase in the phoneme length of the first phoneme. This arrangement enhances ease of hearing synthesized speech while maintaining the length of the breath group and improves recognition of read-aloud text converted into speech.
A fifth embodiment of the present invention will now be described with reference to
This processing procedure is one example of a program or method for text-to-speech read-aloud, and is executed using the text-to-speech read-aloud device 2 (
In the fifth embodiment, the phoneme-length controller 18 (
In this processing procedure, as shown in
Based on recognition of input speech-speed information, the phoneme-length controller 18 controls the phoneme length in accordance with the speech speed. In this case, in step S504, the phoneme length is set to a fixed multiple, and in step S505, a determination is made as to whether or not the speech speed is high read-aloud speed and whether or not the phoneme is the first phoneme (n=1). Thus, in the determination processing, the phoneme length of a phoneme (a speech head) immediately after a pause is specified as a phoneme length to be adjusted.
When the speech speed is high read-aloud speed and the phoneme is the first phoneme (n=1; i.e., Yes in step S505), the phoneme length is set or adjusted to a predetermined multiple, for example, 1.5 times, in step S506. When the speech speed is not high and/or the phoneme is not the first phoneme (n=1; i.e., No in step S505), the phoneme length thereof is not adjusted.
After the adjustment or non-adjustment, in step S507, the phoneme number n is updated (i.e., n=n+1). In step S508, a determination is made as to whether or the processing on all phonemes in the breath group has been finished. When the pause at the end of the breath group is reached after the processing on all phonemes in the breath group is executed, the length of the pause is set to a fixed multiple in accordance with the speech speed in step S509. In step S510, a determination is made as to whether or not the processing is completed. Until the processing on all data is completed, the processing from steps S503 to S510 is repeated. After the processing of all data is completed, the length of an entire sentence is calculated in step S511. In step S512, the phoneme lengths of all phonemes in the sentence are proportionally adjusted so that the length of the entire sentence, i.e., the amount of read-aloud time, has a predetermined length, for example, a length that is the same as or similar to the length when the phoneme lengths are not increased. After the completion of the processing, speech synthesis is executed in step S513 and speech is output.
As described above, the first phoneme in each breath group is adjusted in accordance with the speech speed, that is, the phoneme length of a phoneme immediately after a pause is adjusted to, for example, 1.5 times, whereas the phoneme lengths of all phonemes in a sentence are proportionally reduced by an amount corresponding to the increase in the phoneme length of the first phoneme. This arrangement enhances ease of hearing synthesized speech while maintaining the length of the breath group and improves recognition of read-aloud text converted into speech.
A sixth embodiment of the present invention will now be described with reference to
This processing procedure is one example of a program or method for text-to-speech read-aloud, and is executed using the text-to-speech read-aloud device 2 (
In this processing procedure, as shown in
In the sixth embodiment, in step S604, the phoneme length is set to a fixed multiple corresponding to the speech speed. In step S605, a determination is made as to whether or not the speech speed is high read-aloud speed and whether or not the phoneme is the first phoneme (n=1). When the speech speed is high read-aloud speed and the phoneme is the first phoneme (n=1; i.e., Yes in step S605), in step S606, a determination is made as to whether or not the phoneme is fricative. When the speech speed is high read-aloud speed and the phoneme is the first phoneme (n=1) and a fricative (Yes in step S606), the phoneme length is set or adjusted to a predetermined multiple α (e.g., α=1.7) in step S607. When the phoneme is neither the first phoneme (n=1) nor a fricative (No in step S608), the phoneme length is not adjusted. That is, in this case, the state in which the phoneme length was set to the fixed multiple in step S604 is maintained.
When the speech speed is high read-aloud speed and the phoneme is the first phoneme (No in step S606), the phoneme length is set or adjusted to a predetermined multiple β (e.g., β=1.5) in step S609. When the speech speed is high read-aloud speed and the phoneme is a fricative (Yes in step S608), the phoneme length is set or adjusted to a predetermined multiple γ (e.g., γ=1.4) in step S610.
Thus, when the speech speed is high read-aloud speed and the phoneme is the first phoneme and a fricative, when the speech speed is high read-aloud speed and the phoneme is the first phoneme, when the speech speed is high read-aloud speed and the phoneme is a fricative, or when the phoneme is neither the first phoneme nor a fricative, the phoneme length thereof is adjusted or not adjusted as shown in table 1 illustrated above.
After such processing, in step S611, a determination is made as to whether or not the speech speed is high read-aloud speed and the phoneme is a vowel,. When the speech speed is high read-aloud speed and a vowel (Yes in step S611), the phoneme length thereof is set or adjusted to a predetermined multiple, for example, 0.9 times, in step S612. When the phoneme is not a vowel (No in step S611), the phoneme length is not adjusted.
Thereafter, in step S613, the phoneme number n is updated (n=n+1), as described above. In step S614, a determination is made as to whether or not the processing on all phonemes in the breath group is completed. When the pause at the end of the breath group is reached, the length of the pause is set to a fixed multiple in accordance with the speech speed in step S615. In step S616, a determination is made as to whether or not all data is processed. In step S617, speech synthesis is executed.
As described above, the first phoneme and a fricative in each breath group are adjusted in accordance with a speech speed. Thus, when a phoneme in question is a phoneme immediately after a pause and/or a fricative or when the phoneme in question is neither thereof, the amount of increase in the phoneme length of the phoneme in question is varied. When the phoneme is a vowel, the phoneme length thereof is reduced as described above. As a result, the increased amount of time for the phoneme length of the phoneme after the pause or the fricative is complemented by an amount corresponding to the reduction in the phoneme length of the vowel. This arrangement enhances ease of hearing synthesized speech and improves recognition of read-aloud text converted into the speech without an increase in the amount of entire reproduction time for speech output, while maintaining the entire length. Seventh Embodiment A seventh embodiment of the present invention will now be described with reference to
This processing procedure is one example of a program or method for text-to-speech read-aloud, and is executed using the text-to-speech read-aloud device 2 (
In the seventh embodiment, the phoneme-length adjusting unit 24 in the phoneme-length controller 18 has the breath-group-length calculating unit 30, as in the fourth embodiment (
In this processing procedure, as shown in
In the seventh embodiment, in step S704, the phoneme length is set to a fixed multiple corresponding to the speech speed. In step S705, a determination is made as to whether or not the speech speed is high read-aloud speed and whether or not the phoneme is the first phoneme (n=1). When the speech speed is high read-aloud speed and the phoneme is the first phoneme (n=1; i.e., Yes in step S705), in step S706, a determination is made as to whether or not the phoneme is fricative. When the speech speed is high read-aloud speed and the phoneme is the first phoneme (n=1) and a fricative (Yes in step S706), the phoneme length is set or adjusted to a predetermined multiple α (e.g., α=1.7) in step S707. When the phoneme is neither the first phoneme (n=1) nor a fricative (No in step S708), the phoneme length is not adjusted. That is, in this case, the state in which the phoneme length was set to the fixed multiple in step S704 is maintained.
When the speech speed is high read-aloud speed and the phoneme is the first phoneme (No in step S706), the phoneme length is set or adjusted to a predetermined multiple β (e.g., β=1.5) in step S709. When the speech speed is high read-aloud speed and the phoneme is a fricative (Yes in step S708), the phoneme length is set or adjusted to a predetermined multiple γ (e.g., γ=1.4) in step S710.
Thus, when the speech speed is high read-aloud speed and the phoneme is the first phoneme and a fricative, when the speech speed is high read-aloud speed and the phoneme is the first phoneme, when the speech speed is high read-aloud speed and the phoneme is a fricative, or when the phoneme is neither the first phoneme nor a fricative, the phoneme length thereof is adjusted or not adjusted as shown in table 1 illustrated above.
After such processing, in step S711, the phoneme number n is updated (n=n+1). In step S712, a determination is made as to whether or not the processing on all phonemes in the breath group is completed. When the pause at the end of the breath group is reached, the length of the pause is set to a fixed multiple in accordance with the speech speed in step S713. Thereafter, in step S714, the length of the entire breath group is calculated. In step S715, the phoneme lengths of all phonemes are proportionally adjusted so that the length of the breath group becomes a predetermined length, for example, a length that is the same as or similar to the length when the phoneme lengths are not increased. In step S716, a determination is made as to whether or not all data is processed. Until the processing on all data is completed, the processing from steps S703 to S716 is repeated. After the completion determination, speech synthesis is executed in step S717 and speech is output.
As described above, the first phoneme and a fricative in each breath group are adjusted in accordance with a speech speed. Thus, when a phoneme in question is a phoneme immediately after a pause and/or a fricative or when the phoneme in question is neither thereof, the amount of increase in the phoneme length of the phoneme in question is varied, as described above, and phonemes in the breath group are proportionally reduced by an amount corresponding to the increases in the phoneme lengths of the phonemes. This arrangement enhances ease of hearing synthesized speech and improves recognition of read-aloud text converted into the speech, while maintaining the length of the breath group.
An eighth embodiment of the present invention will now be described with reference to
This processing procedure is one example of a program or method for text-to-speech read-aloud, and is executed using the text-to-speech read-aloud device 2 (
In the eighth embodiment, the phoneme-length controller 18 in the text-to-speech read-aloud device 2 (
In this processing procedure, as shown in
In the eighth embodiment, in step S804, the phoneme length is set to a fixed multiple corresponding to the speech speed. In step S805, a determination is made as to whether or not the speech speed is high read-aloud speed and whether or not the phoneme is the first phoneme (n=1). When the speech speed is high read-aloud speed and the phoneme is the first phoneme (n=1; i.e., Yes in step S805), in step S806, a determination is made as to whether or not the phoneme is fricative. When the speech speed is high read-aloud speed and the phoneme is the first phoneme (n=1) and a fricative (Yes in step S806), the phoneme length is set or adjusted to a predetermined multiple α (e.g., α=1.7) in step S807. When the phoneme is neither the first phoneme (n=1) nor a fricative (No in step S808), the phoneme length is not adjusted. That is, in this case, the state in which the phoneme length was set to the fixed multiple in step S804 is maintained.
When the speech speed is high read-aloud speed and the phoneme is the first phoneme (No in step S806), the phoneme length is set or adjusted to a predetermined multiple β (e.g., β=1.5) in step S809. When the speech speed is high read-aloud speed and the phoneme is a fricative (Yes in step S808), the phoneme length is set to a predetermined multiple γ (e.g., γ=1.4) in step S810.
Thus, when the speech speed is high read-aloud speed and the phoneme is the first phoneme and a fricative, when the speech speed is high read-aloud speed and the phoneme is the first phoneme, when the speech speed is high read-aloud speed and the phoneme is a fricative, or when the phoneme is neither the first phoneme nor a fricative, the phoneme length thereof is adjusted or not adjusted as shown in table 1 illustrated above.
After such processing, in step S811, the phoneme number n is updated (n=n+1). In step S812, a determination is made as to whether or not the processing on all phonemes in the breath group is completed. When the pause at the end of the breath group is reached, the length of the pause is set to a fixed multiple in accordance with the speech speed in step S813. In step S814, a determination is made as to whether or not all data is processed.
After the processing of all data is completed, the length of an entire sentence is calculated in step S815. In step S816, the phoneme lengths of all phonemes in the sentence are proportionally adjusted so that the length of the entire sentence, i.e., the amount of read-aloud time, has a predetermined length, for example, a length that is the same as or similar to the length when the phoneme lengths are not increased. After the completion of the processing, speech synthesis is executed in step S817 and speech is output.
As described above, the first phoneme and a fricative in each breath group are adjusted in accordance with a speech speed. Thus, when a phoneme in question is a phoneme immediately after a pause and/or a fricative or when the phoneme in question is neither thereof, the amount of increase in the phoneme length of the phoneme in question is varied, as described above, and all phonemes in the sentence are proportionally reduced by an amount corresponding to the increases in the phoneme lengths. This arrangement enhances ease of hearing synthesized speech and improves recognition of read-aloud text converted into the speech, while maintaining the length of the entire sentence.
A ninth embodiment of the present invention will now be described with reference to
This processing procedure is one example of a program or method for text-to-speech read-aloud, and is executed using the text-to-speech read-aloud device 2 (
In the processing procedure, as shown in
In the ninth embodiment, in step S904, the phoneme length is set to a fixed multiple corresponding to the speech speed. In step S905, the phoneme number n is updated (n=n+1). In step S906, a determination is made as to whether or not the processing on all phonemes in the breath group is completed.
In this case, in step S907, a determination is made as to whether or not the speech speed is high. When the speech speed is high (Yes in step S907), in step S908, the length of the pause at the end of the breath group is set to a predetermined multiple, for example, one half, relative to the fixed multiple.
When the speech speed is not high (No in step S907), the length of the pause when the pause at the end of the breath group is reached is set to a fixed multiple according to the speech speed in step S909. In step S910, a determination is made as to whether or not a determination is made as to whether or not processing on all data is completed. After processing on all data is completed, speech synthesis is executed in step S911 and speech is output.
As described above, the length of the pause at the end of a breath group is reduced during high-speed read-aloud, to thereby maintain the amount of time for reading the entire length, enhance ease of hearing synthesized speech, and improve recognition of read-aloud text converted into the speech.
A tenth embodiment of the present invention will now be described with reference to
In the tenth embodiment, in the parameter generator 8, a delimiter changing unit 34 is provided at a stage prior to the phoneme-length setter 14. The delimiter changing unit 34 changes the length of a pause at a delimiter in a breath group containing a phonetic character string generated by the linguistic processor 4 (
In this case, when the phonetic character string resulting from the linguistic processing is assumed to be “yamanashi'ken no koukou wo so tsugyoshi te, shinyou ki'n koni ha*itte yonen me'desu.”, the delimiter changing unit 34 reduces the lengths of breath-group delimiters by one step. Specifically, a middle point “.” having a small pause length is changed to an accent-delimitated blank (without a pause), a comma “,” having a medium pause length is changed to a middle point “.” having a small pause length, and a period “.” having a large pause length is changed to a comma “,” having a medium pause length.
Consequently, the phonetic character string is changed to “yamanashi'ken no koukou wo so tsugyoshi te.shinyou ki'n koni ha*itte yonen me'desu,”, so that the total amount of time for reproducing the read-aloud text can be reduced.
In the processing procedure, as shown in
In the tenth embodiment, in step S1004, the phoneme length is set to a fixed multiple in accordance with the speech speed. After the setting of the phoneme length, in step S1005, a determination is made as to whether or not the character is a period “.”. When the character is a period “.”, the character is replaced with a comma “,” in step S1006 and the process proceeds to step S1011.
When the character is not a period “.” (No in step S1005), a determination is made as to whether or not the character is a comma “,” in step S1007. When the character is a comma “,”, the character is replaced with a middle point “.” in step S1008 and the process proceeds to step S1011.
When the character is not a comma “,” (No in step S1007), a determination is made as to whether or not the character is a middle point “.” in step S1009. When the character is a middle point “.”, the character is replaced with a blank “ ” in step S1010 and the process proceeds to step S1011.
In the processing procedure, in step S1011, the phoneme number n is updated (n=n+1). In step S1012, a determination is made as to whether or not the processing on all phonemes in the breath group is completed. When the pause at the end of the breath group is reached, the length of the pause is set to a fixed multiple in accordance with the speech speed in step S1013. In step S1014, a determination is made as to whether or not the processing on all data is completed. In step S1015, speech synthesis is executed.
In the processing procedure, characters representing breath group delimiters are replaced to reduce the lengths of the delimiters by one step. Specifically, a middle point “.” having a small pause length (e.g., 0.1 second at the normal speech speed) is changed to an accent-delimitated blank (without a pause), a comma “,” having a medium pause length (e.g., 0.3 second at the normal speech speed) is changed to a middle point “.” having a small pause length, and a period “.” having a large pause length (e.g., 0.8 second at the normal speech speed) is changed to a comma “,”, having a medium pause length. Thus, the phonetic character string is changed to “yamanashi'ken no koukou wo so tsugyoshi te.shinyou ki'n koni ha*itte yonen me'desu,”. As a result of such change, the total amount of reproduction time can be reduced.
Thus, the phoneme lengths in each breath group are secured and the total amount of time for reproducing a read-aloud sentence can be reduced.
(1) Speech-speed information input to the phoneme-length controller 18 will now be described with reference to
(2) Although cases in which the phoneme length of a phoneme immediately after a pause is increased have been described in the above embodiments, the present invention is also applicable to a case in which the phoneme length is reduced.
(3) Although the mobile terminal device 200 (
(4) When the read-aloud speed is high in the above embodiments, some or all pauses in character data may be removed. The pause removal allows the amount of reproduction time to be reduced without compromising ease of hearing.
(5) When the read-aloud speed is low, the phoneme length of a phoneme immediately after a pause may be reduced or may be adjusted to have the same length as a reference speed.
(6) In the above-described sixth embodiment (
(7) Although the processing is performed for each breath group in the above-described tenth embodiment (
(8) Although a fricative is used as an example of a specific phoneme in the second, sixth, seventh, and eighth embodiments and the phoneme length of the fricative is increased, the increase in the length of the fricative may be eliminated or the length of another phoneme other than a fricative may be increased.
A first example will now be described with reference to
When the phoneme length of each phoneme is to be increased in accordance with the speech speed, the text-to-speech read-aloud device 2 (
In the processing, when the input text is, for example, “yamanashiken no koukou wo so tsugyoshi te shinyou kin koni haitte 4nen me desu.” (
In this exemplary text “yamanashi ken no koukou wo so tsugyoshi te shinyou kin koni haitte 4nen me desu.”, “yamanashi'”, is a noun, a phonetic character string thereof is “yamanashi'”, “ken” is a noun, a phonetic character string thereof is “ken”, “no” is a particle, a phonetic character thereof is “no”, and a portion subsequent to the “no” is a accent-phrase border and is thus a blank. Further, “koukou” is a noun, a phonetic character string thereof is “koukou”, “wo” is a particle, a phonetic character string thereof is “wo”, and a portion subsequent thereto is an accent-phrase border and is thus a blank. “sotsugyo shi” is a verb and a phonetic character string thereof is “sotsugyo shi”, “te” is a particle, a phonetic character string thereof is “te”, “,” is a breath group border (having a medium pause length), a phonetic character string thereof is “,”, “shinyo” is a noun, and a phonetic character string thereof is “shinyo”, “kinko” is a noun, a phonetic character string thereof is “ki'nko”, “ni” is a particle”, a phonetic character string thereof is “ni”, and a portion subsequent thereto is an accent-phrase border and is thus a blank. Further, “hait” is a verb, a phonetic character string thereof is “ha*it”, “te” is a particle, a phonetic character string thereof is “.”, “4” is at numeral, a phonetic character string thereof is “yo”, “nen” is a measure word, a phonetic character string thereof is “nen”, “me” is a postposition of the measure word, a phonetic character string thereof is “me'”, “desu” is a verbal auxiliary, a phonetic character string thereof is “desu”, “.” is a breath-group border (having a large pause length), and a phonetic character string thereof is “.”. Thus, a phonetic character string of the above-noted exemplary text is expressed by “yamanashi'ken no koukou wo so tsugyoshi te, shinyou ki'n koni ha*itte yonen me'desu”. In
In this example, when seven moras per second is assumed to be a reference (1×) speed and about 21 moras per second (i.e., the 3× speech speed) are to be generated, the phoneme lengths for the 1× speed are read from the phoneme-length table 16 (
In contrast, a result of the processing in the first embodiment (
When a phoneme length is to be generated at the 3× speed, the length of phoneme “Sh”, which is a speech head subsequent to the pause, is set to 1.5 times the phoneme length obtained from simple inverse proportion. As a result, the phoneme length at the reference (1×) speed is 117 ms, whereas the phoneme length at the 3× speed is 59 ms. Comparison of these phoneme lengths with those of other phonemes “I”, “N”, “y”, “O”, and “O” shows that the phoneme length “117 ms” of phoneme “sh” at the 1× speed is not prominently different from the phoneme lengths of the other phonemes, specifically, the length of phoneme “I”=60 ms, the length of phoneme “N”=60 ms, the length of phoneme “y”=65 ms, the length of phoneme “O”=80 ms, and the length of phoneme “O”=105 ms. In contrast, the phoneme length “59 ms” of phoneme “sh” at the 3× speed is prominently different from the phoneme lengths of the other phonemes, specifically, the length of phoneme “I”=20 ms, the length of phoneme “N”=20 ms, the length of phoneme “y”=22 ms, the length of phoneme “O”=27 ms, and the length of phoneme “O”=35 ms. As a result, it is possible to improve the ease of auditory hearing and also can enhance recognition.
Speech-synthesis waveforms resulting from the above-described processing will now be described with reference to
A dotted-line-surrounded portion a in waveform in
In contrast,
When e in waveform in
a represents a waveform obtained at a normal read-aloud speed and B represents a waveform obtained at a high read-aloud speed. Compared to the normal-speed read aloud for waveform A, for the high-speed read aloud for waveform A, the phoneme lengths immediately after pauses f and g are reduced in proportional to the speech speed, i.e., is reduced to 19 ms at the portion f and 24 ms at the portion g in this example.
In contrast,
When h and i in waveform in
Waveforms resulting from processing according to a fourth example will now be described with reference to
In contrast,
When the pause section m in waveform in
Waveforms resulting from processing according to a fifth example will now be described with reference to
a represents a waveform obtained when the processing in the ninth embodiment, (the flowchart in
When the pause sections p and q in waveform in
Technical ideas according to the above-described embodiments of the present invention will be described below.
While preferred embodiments of the present invention and so on according to the present invention have been described above, the present invention is not limited to thereto. Thus, naturally, it is apparent to those skilled in the art that various modifications and changes can be made based on the appended claims or the subject matter of the present invention disclosed herein. Needless to say, such modifications and changes are also encompassed by the scope of the present invention.
Number | Date | Country | Kind |
---|---|---|---|
2007-167018 | Jun 2007 | JP | national |