1. Field of the Invention
The present invention relates to a speech synthesis technique.
2. Description of the Related Art
For train guidance on station platforms, traffic jam information on expressways, and the like, domain-specific synthesis is used, which combines and concatenates pre-recorded speech data (pre-stored word speech data and phrase speech data). This scheme can obtain synthetic speech with high naturalness because the technique is applied to a specific domain, but cannot synthesize speech corresponding to arbitrary texts.
A concatenative synthesis system, which is a typical rule-based speech synthesis system, generates rule-based synthetic speech by dividing an input text into words, adding pronunciation information to them, and concatenating the speech segments in accordance with the pronunciation information. Although this scheme can synthesize speech corresponding to arbitrary texts, the naturalness of synthetic speech is not high.
Japanese Patent Laid-Open No. 2002-221980 discloses a speech synthesis system which generates synthetic speech by combining pre-recorded speech and rule-based synthetic speech. This system comprises a phrase dictionary holding pre-recorded speech and a pronunciation dictionary holding pronunciations and accents. Upon receiving an input text, the system outputs pre-recorded speech of a word when it is registered in the phrase dictionary, and outputs rule-based synthetic speech of a word which is generated from the pronunciation and accent of the word when it is registered in the pronunciation dictionary.
In speech synthesis disclosed in Japanese Patent Laid-Open No. 2002-221980, since voice quality greatly changes near the boundary between pre-recorded speech and rule-based synthetic speech, the intelligibility may deteriorate.
The present invention has been made in consideration of the above problem, and has as its object to improve intelligibility when synthetic speech is generated by combining pre-recorded speech and rule-based synthetic speech.
According to one aspect of the present invention, a speech synthesis apparatus includes a language analysis unit configured to identify a word by performing language analysis on a supplied text, a selection unit configured to select one of first speech synthesis processing of performing rule-based synthesis based on a result of the language analysis and second speech synthesis processing of performing pre-recorded-speech-based synthesis for playing back pre-recorded speech data as speech synthesis processing to be executed for a word of interest which is extracted from the result of the language analysis, wherein the selection unit selects the first or second speech synthesis processing based on a word adjacent to the word of interest, a process execution unit configured to execute the first or second speech synthesis processing, which is selected by the selection unit, for the word of interest, and an output unit configured to output synthetic speech generated by the process execution unit.
Another aspect of the present invention, a speech synthesis method includes a language analysis step of identifying a word by performing language analysis on a supplied text, a selection step of selecting one of first speech synthesis processing of performing rule-based synthesis based on a result of the language analysis and second speech synthesis processing of performing pre-recorded-speech-based synthesis for playing back pre-recorded speech data as speech synthesis processing to be executed for a word of interest which is extracted from the result of the language analysis, wherein the selection step selects the first or second speech synthesis processing based on a word adjacent to the word of interest, a process execution step of executing the first or second speech synthesis processing, which is selected in the selection step, for the word of interest, and an output step of outputting synthetic speech generated in the process execution step.
Further features of the present invention will become apparent from the following description of exemplary embodiments with reference to the attached drawings.
Various exemplary embodiments, features, and aspects of the present invention will be described in detail below with reference to the drawings. The present invention is not limited by the disclosure of the embodiments and all combinations of the features described in the embodiments are not always indispensable to solving means of the present invention.
The following embodiments exemplify a case in which a term registered in a language dictionary used for language analysis for rule-based synthesis or registered in pre-recorded speech data for pre-recorded-speech-based synthesis is a word. However, the present invention is not limited to this. A registered term can be a phrase comprising a plurality of word strings or a unit smaller than a word.
Referring to
Referring to
A synthesis selection unit 209 selects a speech synthesis method (rule-based synthesis or pre-recorded-speech-based synthesis) to be applied to a word of interest based on the analysis result held by the analysis result holding unit 203 and the previous selection result held by a selection result holding unit 210. The selection result holding unit 210 holds the speech synthesis method for the word of interest, which is selected by the synthesis selection unit 209, together with the previous result. A speech output unit 211 outputs, via the speech output device 107, the synthetic speech held by the synthetic speech holding unit 208. The language dictionary 212 holds the spelling information, pronunciation information, and the like of words.
Pre-recorded-speech-based synthesis in this method is a method of generating synthetic speech by combining pre-recorded speech data such as pre-recorded words and phrases. Needless to say, pre-recorded speech data can be processed or output without any processing when they are combined.
In step S301, the language processing unit 202 extracts a word as a speech synthesis target by performing language analysis on a text as a synthesis target held by the text holding unit 201 by using the language dictionary 212. This embodiment is premised on the procedure of sequentially performing speech synthesis processing from the start of a text. For this reason, words are sequentially extracted from the start of a text. In addition, pronunciation information is added to each word, and information indicating whether there is pre-recorded speech corresponding to each word is extracted from the pre-recorded-speech-based synthesis data 207. The analysis result holding unit 203 holds the analysis result. The process then shifts to step S302.
If it is determined in step S302 that the analysis result held by the analysis result holding unit 203 contains a word which has not been synthesized, the process shifts to step S303. If the analysis result contains no word which has not been synthesized, this processing is terminated.
In step S303, the synthesis selection unit 209 selects a speech synthesis method for the word of interest (the first word) based on the analysis result held by the analysis result holding unit 203 and the speech synthesis method selection results on previously processed words which are held by the selection result holding unit 210. The selection result holding unit 210 holds this selection result. If rule-based synthesis is selected as a speech synthesis method, the process shifts to step S304. If pre-recorded-speech-based synthesis is selected as a speech synthesis method instead of rule-based synthesis, the process shifts to step S305.
In step S304, the rule-based synthesis unit 204 as a process execution unit performs rule-based synthesis for the word of interest by using the analysis result held by the analysis result holding unit 203 and the rule-based synthesis data 205. The synthetic speech holding unit 208 holds the generated synthetic speech. The process then shifts to step S306.
In step S305, the pre-recorded-speech-based synthesis unit 206 as a process execution unit performs pre-recorded-speech-based synthesis for the word of interest by using the analysis result held by the analysis result holding unit 203 and the pre-recorded-speech-based synthesis data 207. The synthetic speech holding unit 208 holds the generated synthetic speech. The process then shifts to step S306.
In step S306, the speech output unit 211 outputs, via the speech output device 107, the synthetic speech held by the synthetic speech holding unit 208. The process returns to step S302.
The following is a selection criterion for a speech synthesis method in step S303 in this embodiment.
Priority is given first to the pre-recorded-speech-based synthesis scheme. In other cases, the same speech synthesis method as that selected for a word (second word) adjacent to the word of interest, e.g., a word immediately preceding the word of interest, is preferentially selected. If no pre-recorded speech of the word of interest is registered, pre-recorded-speech-based synthesis cannot be performed. In this case, therefore, rule-based synthesis is selected. Rule-based synthesis can generally synthesize an arbitrary word, and hence can always be selected.
According to the above processing, a speech synthesis method for the word of interest is selected in accordance with a speech synthesis method for a word immediately preceding the word of interest. This makes it possible to continuously use the same speech synthesis method and suppress the number of times of switching of the speech synthesis methods. This allows to expect an improvement in the intelligibility of synthetic speech.
In the first embodiment described above, the same speech synthesis method as that selected for a word immediately preceding a word of interest is preferentially selected for the word of interest. In contrast to this, the second embodiment sets the minimization of concatenation distortion as a selection criterion. This will be described in detail below.
The same reference numerals as in
A processing procedure in the speech synthesis apparatus according to this embodiment will be described with reference to
In step S303, the concatenation distortion calculation unit 401 calculates the concatenation distortion between the synthetic speech of a word immediately preceding a word of interest, which is held by the synthetic speech holding unit 208, and synthesis target speech of the word of interest. The synthesis selection unit 209 then selects synthesis candidate speech for which the concatenation distortion calculation unit 401 has calculated minimum concatenation distortion and a speech synthesis method corresponding to it. The selection result holding unit 210 holds this selection result. If the selected speech synthesis method is rule-based synthesis, the process shifts to step S304. If the selected speech synthesis method is not rule-based synthesis but is pre-recorded-speech-based synthesis, the process shifts to step S305.
Referring to
Concatenation distortion in this embodiment is the spectral distance between the end of the synthetic speech of a word immediately preceding a word of interest and the start of synthetic speech of the word of interest. The concatenation distortion calculation unit 401 calculates the concatenation distortion between the synthetic speech 501 of the immediately preceding word and the synthesis candidate speech (speech synthesized from a pronunciation) 502 obtained by rule-based synthesis of the word of interest and the concatenation distortion between the synthetic speech 501 of the immediately preceding word and the synthesis candidate speech 503 obtained by pre-recorded-speech-based synthesis. The synthesis selection unit 209 selects synthesis candidate speech which minimizes concatenation distortion and a speech synthesis method for it.
Obviously, concatenation distortion is not limited to a spectral distance, and can be defined based on an acoustic feature amount typified by a cepstral distance or a fundamental frequency, or by using another known technique. Consider, for example, a speaking rate. In this case, concatenation distortion can be defined based on the difference or ratio between the speaking rate of an immediately preceding word and the speaking rate of synthesis candidate speech. If the speaking rate difference is defined as concatenation distortion, it can be defined that the smaller the difference, the smaller the concatenation distortion. When the speaking rate ratio is defined as concatenation distortion, it can be defined that the smaller difference between the speaking rate ratio and a reference ratio of 1, the smaller the concatenation distortion. In other words, it can be defined that the smaller the distance of a speaking rate ratio from a reference ratio of 1, the smaller the concatenation distortion.
As described above, if there are a plurality of synthesis candidate speech data for a word of interest, setting the minimization of concatenation distortion as a selection criterion makes it possible to select synthesis candidate speech with smaller distortion at a concatenation point and a speech synthesis method for it. This allows to expect an improvement in intelligibility.
The first and second embodiments are configured to select a speech synthesis method word by word. However, the present invention is not limited to this. For example, it suffices to select synthesis candidate speech of each word and a speech synthesis method for it so as to satisfy a selection criterion for all or part of a supplied text.
The first and second embodiments are based on the premise that the language processing unit 202 uniquely identifies a word. However, the present invention is not limited to this. An analysis result can contain a plurality of solutions. This embodiment exemplifies a case in which there are a plurality of solutions.
Referring to
In step S601, a synthesis selection unit 209 selects an optimal sequence of synthesis candidate speech data which satisfy a selection criterion for all or part of a text based on the analysis result held by the analysis result holding unit 203. A selection result holding unit 210 holds the selected optimal sequence. The process then shifts to step S302.
Assume that the selection criterion adopted by the synthesis selection unit 209 is “to minimize the sum of the number of times of switching of speech synthesis methods and the number of times of concatenation of synthesis candidate speech”.
If it is determined in step S302 that the optimal sequence held by the selection result holding unit 210 contains a word which has not been synthesized, the process shifts to step S303. If there is no word which has not been synthesized, this processing is terminated.
In step S303, the synthesis selection unit 209 causes the processing to be applied to a word of interest to branch to step S304 or step S305 based on the optimal sequence held by the selection result holding unit 210. If rule-based synthesis is selected for the word of interest, the process shifts to step S304. If pre-recorded-speech-based synthesis is selected for the word of interest instead of rule-based synthesis, the process shifts to step S305. Since the processing in steps S304, S305, and S306 is the same as that in the first embodiment, a repetitive description will be omitted.
The selection of a plurality of solutions of a language analysis and an optical sequence will be described next with reference to
Referring to
Referring to
The example shown in
As is understood, each of these sequences of synthesis candidate speech data represents a selection pattern of speech synthesis methods in consideration of the presence/absence of pre-recorded speech data of each word. This embodiment selects one of obtained selection patterns which minimizes the sum of the number of times of switching of the speech synthesis methods and the number of times of concatenation of words. In this case, the sequence “(7) 801-805” minimizes the sum of the number of times of switching of the speech synthesis methods and the number of times of concatenation of words. The synthesis selection unit 209 therefore selects the sequence “801-805”.
A general user dictionary function of speech synthesis registers pairs of spellings and pronunciations in a user dictionary. A speech synthesis apparatus having both the rule-based synthesis function and the pre-recorded-speech-based synthesis function as in the present invention preferably allows a user to register pre-recorded speech in addition to pronunciations. It is further preferable to register a plurality of pre-recorded speech data. Consider a case in which this embodiment is provided with a user dictionary function capable of registering any of combinations of spellings and pronunciations, spellings and pre-recorded speech, and spellings, pronunciations, and pre-recorded speech. A pronunciation registered by the user is converted into synthetic speech by using rule-based synthesis. In addition, pre-recorded speech registered by the user is converted into synthetic speech by using pre-recorded-speech-based synthesis.
Assume that in this embodiment, when there is pre-recorded speech registered in the system, synthetic speech obtained by using pre-recorded-speech-based synthesis is selected. Assume also that if there is no pre-recorded speech registered in the system, synthetic speech obtained by applying rule-based synthesis to a pronunciation is selected.
Pre-recorded speech registered by the user does not always have high quality depending on a recording environment. Some contrivance is therefore required to select the synthetic speech of a word registered by the user. A method of selecting the synthetic speech of a word registered by the user by using information about speech synthesis methods for preceding and succeeding words will be described.
A text holding unit 201 holds a text as a speech synthesis target. A text rule-based synthesis unit 901 performs language analysis on the spelling of an unknown word (to be described later) held by an identification result holding unit 904 by using words whose pronunciations are registered in a language dictionary 212 and user dictionary 906, and then performs rule-based synthesis based on the language analysis result. The text rule-based synthesis unit 901 then output the synthetic speech. A pronunciation rule-based synthesis unit 902 receives a pronunciation registered in the user dictionary 906, performs rule-based synthesis, and outputs the synthetic speech. A pre-recorded-speech-based synthesis unit 206 performs pre-recorded-speech-based synthesis for one of the word identification results held by the identification result holding unit 904 which is identified as a word by using pre-recorded-speech-based synthesis data 207, and outputs the synthetic speech. The pre-recorded-speech-based synthesis data 207 holds the pronunciations and pre-recorded speech of words and phrases.
A word identifying unit 903 identifies a word of the text held by the text holding unit 201 by using the spellings of pre-recorded speech data registered in the pre-recorded-speech-based synthesis data 207 and user dictionary 906. The identification result holding unit 904 holds the word identification result. A word identification result may contain a character string (to be referred to as an unknown word in this embodiment) which is not registered in either the pre-recorded-speech-based synthesis data 207 or the user dictionary 906. A word registration unit 905 registers, in the user dictionary 906, the spellings and pronunciations input by the user via an input device 105.
The word registration unit 905 registers, in the user dictionary 906, the pre-recorded speech input by the user via a speech input device 109 and the spellings input by the user via the input device 105. The user dictionary 906 can register any of combinations of spellings and pronunciations, spellings and pre-recorded speech, and spellings, pronunciations, and pre-recorded speech. When the word registered in the user dictionary 906 is present in the identification result holding unit 904, a synthetic speech selection unit 907 selects the synthetic speech of a word of interest in accordance with a selection criterion. The speech output unit 211 outputs the synthetic speech held by a synthetic speech holding unit 208. The synthetic speech holding unit 208 holds the synthetic speech data respectively output from the text rule-based synthesis unit 901, the pronunciation rule-based synthesis unit 902, and the pre-recorded-speech-based synthesis unit 206.
Processing in the speech synthesis apparatus according to this embodiment will be described next with reference to
Referring to
In step S1002, by using pre-recorded speech registered in the pre-recorded-speech-based synthesis data 207 and user dictionary 906, the pre-recorded-speech-based synthesis unit 206 performs pre-recorded-speech-based synthesis for one of the word identification results held by the identification result holding unit 904 which is identified as a word. The synthetic speech holding unit 208 holds the generated synthetic speech. The process then shifts to step S1003.
In step S1003, the text rule-based synthesis unit 901 performs language analysis on the spelling of an unknown word held by the identification result holding unit 904 by using words whose pronunciations are registered in the language dictionary 212 and user dictionary 906, and then performs rule-based synthesis based on the language analysis result. The synthetic speech holding unit 208 holds the generated synthetic speech. The process then shifts to step S1004.
In step S1004, the pronunciation rule-based synthesis unit 902 performs rule-based synthesis for a word, of the word identification results held by the identification result holding unit 904, whose pronunciation is registered in the user dictionary 906. The synthetic speech holding unit 208 holds the generated synthetic speech. The process then shifts to step S1005.
In step S1005, if a plurality of synthesis candidate speech data are present with respect to a word including an unknown word in the identification result holding unit 904, the synthetic speech selection unit 907 selects one of them. The selection result is reflected in the synthetic speech holding unit 208 (for example, the selected synthetic speech is registered, or synthetic speech which is not selected is deleted). The process then shifts to step S1006.
In step S1006, a speech output unit 211 sequentially outputs the synthetic speech data held by the synthetic speech holding unit 208 from the start of the text. This processing is then terminated.
Referring to
Reference numerals 1105, 1106, and 1107 denote synthetic speech data obtained as the results of speech synthesis processing up to step S1004. The synthetic speech 1105 corresponds to the unknown word 1102, and comprises only text rule-based synthetic speech. The synthetic speech 1106 corresponds to the word 1103, and comprises pre-recorded-speech-based synthetic speech, user pre-recorded-speech-based synthetic speech, and user pronunciation rule-based synthetic speech. The synthetic speech 1107 corresponds to the word 1104, and comprises only pre-recorded-speech-based synthetic speech.
The text rule-based synthesis unit 901 outputs text rule-based synthetic speech. The pronunciation rule-based synthesis unit 902 outputs user pronunciation rule-based synthetic speech. The pre-recorded-speech-based synthesis unit 206 outputs pre-recorded-speech-based synthetic speech and user pre-recorded-speech-based synthetic speech.
The processing in step S1005 will be described with reference to
The synthetic speech selection unit 907 selects one of the pre-recorded-speech-based synthetic speech 1202, user pre-recorded-speech-based synthetic speech 1203, and user pronunciation rule-based synthetic speech 1204 which satisfies a selection criterion.
Consider a case in which the selection criterion is “to give priority to the same or similar speech synthesis method as or to an immediately preceding speech synthesis method”. In this case, since the immediately preceding speech synthesis method is text rule-based synthesis, the user pronunciation rule-based synthetic speech 1204 which is a kind of speech based on rule-based synthesis is selected.
If the selection criterion is “to give priority to the same or similar speech synthesis method as or to an immediately succeeding speech synthesis method”, the pre-recorded-speech-based synthetic speech 1202 is selected.
As described above, providing the function of registering a pronunciation and pre-recorded speech in a user dictionary in correspondence with the spelling of each word will increase the number of choices for the selection of speech synthesis methods, thus allowing to expect an improvement in intelligibility.
The fourth embodiment has exemplified the case in which there is only one synthesis candidate speech data before and after a word registered by the user. The fifth embodiment exemplifies a case in which words registered by the user are present consecutively.
Referring to
As in the fourth embodiment, a synthetic speech selection unit 907 selects one synthetic speech data from synthesis candidate speech data in accordance with a predetermined selection criterion. If, for example, the selection criterion is “to minimize the number of times of switching of speech synthesis methods and give priority to pre-recorded-speech-based synthetic speech”, 1301-1302-1305-1308 is selected. If the selection criterion is “to give priority to user pre-recorded-speech-based synthetic speech and minimize the number of times of switching of speech synthesis methods”, 1301-1303-1306-1308 is selected.
Considering the probability that the voice quality of pre-recorded speech registered by the user is unstable, it is also effective to use the selection criterion “to minimize the sum total of concatenation distortion at concatenation points”.
As described above, even if words registered by the user are present consecutively, an improvement in intelligibility can be expected by setting a selection criterion so as to implement full or partial optimization.
The first to fifth embodiments have exemplified the case in which a speech synthesis method is selected for a word of interest based on word information other than that of the word of interest. However, the present invention is not limited to this. The present invention can adopt an arrangement configured to select a speech synthesis method based on only the word information of a word of interest.
The same reference numerals as in
Since a processing procedure in the sixth embodiment is the same as that in the first embodiment, the processing procedure in the sixth embodiment will be described with reference to
The processing procedure in steps S301, S302, S304, S305, and S306 in
In step S303, the waveform distortion calculating unit 1401 calculates the waveform distortion between the synthesis candidate speech obtained by applying rule-based synthesis to a pronunciation registered in the language dictionary 212 and the synthesis candidate speech obtained by applying pre-recorded-speech-based synthesis to pre-recorded speech registered in the user dictionary 906. The synthesis selection unit 209 then compares the waveform distortion obtained by the waveform distortion calculating unit 1401 with a preset threshold. If the waveform distortion is larger than the threshold, the synthesis selection unit 209 selects pre-recorded-speech-based synthesis regardless of speech synthesis methods for preceding and succeeding words. The process then shifts to step S305; otherwise, the process shifts to step S304.
As waveform distortion, a value based on a known technique, e.g., the sum total of the differences between the amplitudes of waveforms at the respective time points or the sum total of spectral distances, can be used. Alternatively, waveform distortion can be calculating by using dynamic programming or the like upon establishing a temporal correlation between two synthesis candidate speech data.
As described above, introducing waveform distortion makes it possible to give priority to user's intention of the registration of pre-recorded speech (more than a simple intention to increase variations, e.g., the intention to make a word be pronounced according to registered pre-recorded speech).
The sixth embodiment has exemplified the case in which a speech synthesis method is selected for a word of interest in consideration of the waveform distortion between the synthesis candidate speech obtained by applying rule-based synthesis to a pronunciation registered in the language dictionary 212 and the synthesis candidate speech obtained by applying pre-recorded-speech-based synthesis to pre-recorded speed registered in the user dictionary 906. However, targets for which waveform distortion is to be obtained are not limited to them. That is, it suffices to pay attention to the waveform distortion between the synthesis candidate speech based on a pronunciation or pre-recorded speed registered in the system and the synthesis candidate speech based on a pronunciation or pre-recorded speech registered in the user dictionary. In this case, if the waveform distortion is larger than a threshold, priority is given to the synthesis candidate speech based on the pronunciation or pre-recorded speech registered in the user dictionary.
The first and second embodiments have exemplified the case in which when a speech synthesis method is to be selected for each word, a text is processed starting from its start word. However, the present invention is not limited to this, and can adopt an arrangement configured to process a text starting from its end word. When a text is to be processed starting from its end word, a speech synthesis method is selected for a word of interest based on a speech synthesis method for an immediately succeeding word. In addition, the present invention can adopt an arrangement configured to process a text starting from an arbitrary word. In this case, a speech synthesis method is selected for a word of interest based on already selected speech synthesis methods for preceding and succeeding words.
The first to third embodiments have exemplified the case in which the language processing unit 202 divides a text into words by using the language dictionary 212. However, the present invention is not limited to this. For example, the present invention can incorporate an arrangement configured to identify words by using words and phrases included in a language dictionary 212 and pre-recorded-speech-based synthesis data 207.
If rule-based synthesis is selected in step S303 in
If rule-based synthesis is selected, a rule-based synthesis unit 204 processes the word 1507. When the word 1507 is processed, the phrase 1503 is excluded from selection targets in step S302, and the word 1508 is processed next. Referring to
As described above, when the result obtained by performing language analysis by using words and phrases included in the language dictionary 212 and pre-recorded-speech-based synthesis data 207 is to be used, it is necessary to proceed with processing while establishing correspondence between phrases and corresponding words.
When the language dictionary 212 is to be generated, incorporating the information of the words and phrases of the pre-recorded-speech-based synthesis data 207 in the language dictionary 212 makes it unnecessary for the language processing unit 202 to access the pre-recorded-speech-based synthesis data 207 at the time of execution of language analysis.
According to the first embodiment, the selection criterion for speech synthesis methods is “to preferentially select the same speech synthesis method as that selected for an immediately preceding word”. However, the present invention is not limited to this. It suffices to use another selection criterion or combine the above selection criterion with an arbitrary selection criterion.
For example, the selection criterion “to reset a speech synthesis method at a breath group” is combined with the above selection criterion to set the selection criterion “to select the same speech synthesis method as that selected for an immediately preceding word but to give priority to the pre-recorded-speech-based synthesis method upon resetting the speech synthesis method at a breath group”. Information indicating whether a breath group is detected is one piece of word information obtained by language analysis. That is, a language processing unit 202 includes a unit configured to determine whether each identified word corresponds to a breath group.
In the case of the selection criterion in the first embodiment, when rule-based synthesis is selected, this method is basically kept selected up to the end of the processing. In contrast to this, in the case of the above combination of selection criteria, since the selection is reset at a breath group, the pre-recorded-speech-based synthesis method can be easily selected. It is therefore possible to expect an improvement in voice quality. Note that switching of the speech synthesis methods at a breath group has almost no influence on intelligibility.
The second embodiment has exemplified the case in which one pre-recorded speech data corresponds to a word of interest. However, the present invention is not limited to this, and a plurality of pre-recorded speech data can exist. In this case, the concatenation distortion between the synthesis candidate speech obtained by applying rule-based synthesis to the pronunciation of a word and immediately preceding synthetic speech and the concatenation distortion between the synthesis candidate speech obtained by applying pre-recorded-speech-based synthesis to a plurality of pre-recorded speech data and the immediately preceding synthetic speech are calculated. Among these synthesis candidate speech data, synthesis candidate speech exhibiting the minimum concatenation distortion is selected. Preparing a plurality of pre-recorded speech data for one word is an effective method from the viewpoint of versatility and a reduction in concatenation distortion.
In the third embodiment, the selection criterion is “to minimize the sum of the number of times of switching of speech synthesis methods and the number of times of concatenation of synthesis candidate speech”. However, the present invention is not limited to this. For example, it suffices to use a known selection criterion such as a criterion for concatenation distortion minimization like that used in the second embodiment or introduce an arbitrary selection criterion.
The fourth embodiment has exemplified the case in which when pre-recorded-speech-based synthetic speech exists, text rule-based synthetic speech is not set as synthesis candidate speech, as shown in
Note that the present invention can be applied to an apparatus comprising a single device or to system constituted by a plurality of devices.
Furthermore, the invention can be implemented by supplying a software program, which implements the functions of the foregoing embodiments, directly or indirectly to a system or apparatus, reading the supplied program code with a computer of the system or apparatus, and then executing the program code. In this case, so long as the system or apparatus has the functions of the program, the mode of implementation need not rely upon a program.
Accordingly, since the functions of the present invention can be implemented by a computer, the program code installed in the computer also implements the present invention. In other words, the claims of the present invention also cover a computer program for the purpose of implementing the functions of the present invention.
In this case, so long as the system or apparatus has the functions of the program, the program may be executed in any form, such as an object code, a program executed by an interpreter, or script data supplied to an operating system.
Example of storage media that can be used for supplying the program are a floppy disk, a hard disk, an optical disk, a magneto-optical disk, a CD-ROM, a CD-R, a CD-RW, a magnetic tape, a non-volatile type memory card, a ROM, and a DVD (DVD-ROM and a DVD-R).
As for the method of supplying the program, a client computer can be connected to a website on the Internet using a browser of the client computer, and the computer program of the present invention or an automatically-installable compressed file of the program can be downloaded to a recording medium such as a hard disk. Further, the program of the present invention can be supplied by dividing the program code constituting the program into a plurality of files and downloading the files from different websites. In other words, a WWW (World Wide Web) server that downloads, to multiple users, the program files that implement the functions of the present invention by computer is also covered by the claims of the present invention.
It is also possible to encrypt and store the program of the present invention on a storage medium such as a CD-ROM, distribute the storage medium to users, allow users who meet certain requirements to download decryption key information from a website via the Internet, and allow these users to decrypt the encrypted program by using the key information, whereby the program is installed in the user computer.
Besides the cases where the aforementioned functions according to the embodiments are implemented by executing the read program by computer, an operating system or the like running on the computer may perform all or a part of the actual processing so that the functions of the foregoing embodiments can be implemented by this processing.
Furthermore, after the program read from the storage medium is written to a function expansion board inserted into the computer or to a memory provided in a function expansion unit connected to the computer, a CPU or the like mounted on the function expansion board or function expansion unit performs all or a part of the actual processing so that the functions of the foregoing embodiments can be implemented by this processing.
While the present invention has been described with reference to exemplary embodiments, it is to be understood that the invention is not limited to the disclosed exemplary embodiments. The scope of the following claims is to be accorded the broadest interpretation so as to encompass all such modifications and equivalent structures and functions.
This application claims the benefit of Japanese Patent Application No. 2007-065780, filed Mar. 14, 2007, which is hereby incorporated by reference herein in its entirety.
Number | Date | Country | Kind |
---|---|---|---|
2007-065780 | Mar 2007 | JP | national |
Number | Name | Date | Kind |
---|---|---|---|
5745651 | Otsuka et al. | Apr 1998 | A |
5913193 | Huang et al. | Jun 1999 | A |
5930755 | Cecys | Jul 1999 | A |
6175821 | Page et al. | Jan 2001 | B1 |
6253182 | Acero | Jun 2001 | B1 |
6266637 | Donovan et al. | Jul 2001 | B1 |
6308156 | Barry et al. | Oct 2001 | B1 |
6345250 | Martin | Feb 2002 | B1 |
6980955 | Okutani et al. | Dec 2005 | B2 |
7039588 | Okutani et al. | May 2006 | B2 |
7054814 | Okutani et al. | May 2006 | B2 |
7260533 | Kamanaka | Aug 2007 | B2 |
7277855 | Acker et al. | Oct 2007 | B1 |
7742921 | Davis et al. | Jun 2010 | B1 |
20020103648 | Case et al. | Aug 2002 | A1 |
20020193996 | Squibbs et al. | Dec 2002 | A1 |
20030158734 | Cruickshank | Aug 2003 | A1 |
20030177010 | Locke | Sep 2003 | A1 |
20030187651 | Imatake | Oct 2003 | A1 |
20030229496 | Yamada et al. | Dec 2003 | A1 |
20040254792 | Busayapongchai et al. | Dec 2004 | A1 |
20050114137 | Saito et al. | May 2005 | A1 |
20050137870 | Mizutani et al. | Jun 2005 | A1 |
20050209855 | Okutani et al. | Sep 2005 | A1 |
20050288929 | Kuboyama et al. | Dec 2005 | A1 |
20080270140 | Hertz et al. | Oct 2008 | A1 |
Number | Date | Country |
---|---|---|
1 511 008 | Feb 2005 | EP |
2002-221980 | Aug 2002 | JP |
Number | Date | Country | |
---|---|---|---|
20080228487 A1 | Sep 2008 | US |