This application is the National Phase of PCT/JP2007/051669, filed Feb. 1, 2007, which claims priority to Japanese Application No. 2006-031442, filed Feb. 8, 2006, the disclosures of which are hereby incorporated by reference in their entirety.
The present invention relates to a speech synthesizing technology, and more particularly to a speech synthesizing device, a speech synthesizing method, and a speech synthesizing program for synthesizing a speech from text.
A recent sophistication and downsizing of a computer allows the speech synthesizing technology to be installed and used in various devices such as a car navigation device, a mobile phone, a PC (Personal computer), a robot, etc. Widespread use of this technology in various devices finds applications in a variety of environments where a speech synthesizing device is used.
In a conventional, commonly-used speech synthesizing device, the processing result of prosody (for example, pitch frequency pattern, amplitude, duration time length) generation, unit waveform (for example, waveform having the length of about pitch length or syllabic sound time length extracted from a natural speech) selection, and waveform generation is basically determined uniquely for a phonetic symbol sequence (text analysis result including reading, syntax/part-of-speech information, accent type, etc.). That is, a speech synthesizing device always performs speech synthesizing in the same utterance form (volume, phonation speed, prosody, and voice tone of a voice) in any situation or environment.
However, the actual observation of human's phonation indicates that, even when the same text is spoken, the utterance form is controlled by the speaker's situation, emotion, or intention. Therefore, a conventional speech synthesizing device, which always uses the same utterance form, does not necessarily make the best use of the characteristics of a speech that is one of communication media.
To solve the problem with a speech synthesizing device like this, an attempt is made to generate a synthesized speech suited for the user environment and to improve the user's usability by dynamically changing the prosody generation and the unit waveform selection according to the user environment (situation and environment of the place where the user of the speech synthesizing device is present). For example, Patent Document 1 discloses the configuration of a speech synthesizing system that selects the control rule for the prosody and phoneme according to the information indicating the light level of the user environment or the user's position.
Patent Document 2 discloses the configuration of a speech synthesizing device that controls the consonant power, pitch frequency, and sampling frequency based on the power spectrum and frequency distribution information on the ambient noises.
In addition, Patent Document 3 discloses the configuration of a speech synthesizing device that controls the phonation speed, pitch frequency, sound volume, and voice quality based on various types of clocking information including the time of day, date, and day of week.
Non-Patent Documents 1-3 that disclose the music signal analysis and search method, which constitute the background technology of the present invention, are given below. Non-Patent Document 1 discloses a genre estimation method that analyzes the short-time amplitude spectrum and the discrete wavelet conversion coefficients of music signals to find musical characteristics (instrument configuration, rhythm structure) for estimating the musical genre.
Non-Patent Document 2 discloses a genre estimation method that estimates the musical genre from the mel-frequency cepstrum coefficients of the music signal using the tree-structured vector quantization method.
Non-Patent Document 3 discloses a method that calculates the similarity using the spectrum histograms for retrieving the musical signal.
Patent Document 1:
Japanese Patent No. 3595041
Patent Document 2:
Japanese Patent Publication Kokai JP-A-11-15495
Patent Document 3:
Japanese Patent Kokai Publication JP-A-11-161298
Non-Patent Document 1:
Tzanetakis, Essl, Cook: “Automatic Musical Genre Classification of Audio Signals”, Proceedings of ISMIR 2001, pp. 205-210, 2001.
Non-Patent Document 2:
Hoashi, Matsumoto, Inoue: “Personalization of User Profiles for Content-based Music Retrieval Based on Relevance Feedback”, Proceedings of ACM Multimedia 2003, pp. 110-119, 2003.
Non-Patent Document 3:
Kimura et al.: “High-Speed Retrieval of Audio and Video In Which Global Branch Removal Is Introduced”, Journal of The Institute of Electronics, Information and Communication Engineers, D-II, Vol. J85-D-II, No. 10, pp. 1552-1562, October, 2002
To attract the attention of an audience or to give an impression of a message to an audience, BGM (background music, hereinafter called BGM) is usually played with a natural speech. For example, BGM is played in the background of a narration in many of news programs and information providing programs on TV or radio.
The analysis of those programs indicates that, though BGM, especially the musical genre to which the BGM belongs, is selected according to the utterance form of the speaker, the speaker speaks with consideration for the BGM. For example, in a weather forecast program or a traffic information program, the speaker usually speaks in an even tone with gentle melody BGM, such as easy listening music, playing in the background. Meanwhile, the announcer sometimes speaks same contents in a voice full of life in a special program or a live program.
Blues music is used as the BGM when a poem is read aloud sadly, and the speaker reads aloud the poem emotionally. In addition, we can find the relation that religious music is selected to produce a mystic atmosphere and pops music is selected for a bright way of speaking.
Meanwhile, a speech synthesizing device is used in a variety of environments as described above, and a synthesized speech is output more often in a place (a user environment) where various types of music, including the BGM described above, is reproduced. Nevertheless, the conventional speech synthesizing device, including those described in Patent Document 1 and so on, has a problem that the utterance form does not match the ambient music because the music playing in the user environment cannot be taken into consideration in controlling the utterance form of a synthesized speech.
In view of the foregoing, it is an object of the present invention to provide a speech synthesizing device, a speech synthesizing method, and a program capable of synthesizing a speech that matches the music playing in a user environment.
According to a first aspect of the present invention, there is provided a speech synthesizing device that automatically selects an utterance form according to music reproduced in a user environment. More specifically, the speech synthesizing device comprises an utterance form selection unit that analyzes a music signal reproduced in a user environment and determines an utterance form that matches an analysis result of the music signal; and a speech synthesizing unit that synthesizes a speech according to the utterance form.
According to a second aspect of the present invention, there is provided a speech synthesizing method that generates a synthesized speech using a speech synthesizing device, wherein the method comprises a step for analyzing, by the speech synthesizing device, a received music signal reproduced in a user environment and determining an utterance form that matches an analysis result of the music signal; and a step for synthesizing, by the speech synthesizing device, a speech according to the utterance form.
According to a third aspect of the present invention, there is provided a program and a recording medium storing therein the program wherein the program causes a computer, which constitutes a speech synthesizing device, to execute processing for analyzing a received music signal reproduced in a user environment and determining an utterance form, which matches an analysis result of the music signal, from utterance forms prepared in advance; and processing for synthesizing a speech according to the utterance form.
According to the present invention, a synthesized speech can be generated in an utterance form that matches the music such as the BGM in the user environment. As a result, a synthesized speech can be output that attracts the user's attention or that does not spoil the atmosphere of the BGM nor does break the mood of the user listening to the BGM.
Next, the preferred mode for carrying out the present invention will be described in detail with reference to the drawings.
The prosody generation unit 11 is processing means for generating prosody information from the prosody generation rule, selected based on an utterance form, and a phonetic symbol sequence.
The unit waveform selection unit 12 is processing means for selecting a unit waveform from unit waveform data, selected based on an utterance form, a phonetic symbol sequence, and prosody information.
The waveform generation unit 13 is processing means for generating a synthesized speech waveform from prosody information and unit waveform data.
The prosody generation rule (for example, pitch frequency pattern, amplitude, duration time length, etc.), required for producing a synthesized speech in each utterance form, is saved in the prosody generation rule storage units 151 to 15N.
As in the prosody generation rule storage units, unit waveform data (for example, waveform having the length of about pitch length or syllabic sound time length extracted from a natural speech), required for producing a synthesized speech in each utterance form, is saved in the unit waveform data storage units 161 to 16N.
The prosody generation rules and the unit waveform data, which should be saved in the prosody generation rule storage units 151 to 15N and the unit waveform data storage units 161 to 16N, can be generated by collecting and analyzing the natural speeches that match the utterance forms.
In the description of the embodiments given below, it is assumed that the prosody generation rule and the unit waveform data generated from a loud voice and required for producing a loud voice are saved in the prosody generation rule storage unit 151 and the unit waveform data storage unit 161, the prosody generation rule and the unit waveform data generated from a composed voice and required for producing a composed voice are saved in the prosody generation rule storage unit 152 and the unit waveform data storage unit 162, the prosody generation rule and the unit waveform data generated from a low voice are saved in the prosody generation rule storage unit 153 and the unit waveform data storage unit 163, and the prosody generation rule and the unit waveform data generated from a moderate voice are saved in the prosody generation rule storage unit 15N and the unit waveform data storage unit 16N. The method for generating the prosody generation rule and the unit waveform data from a natural speech does not depend on the utterance form, but the method similar to that for generating them from a moderate voice can be used.
The musical genre estimation unit 21 is processing means for estimating a musical genre to which a received music signal belongs.
The utterance form selection unit 23 is processing means for determining an utterance form from a musical genre estimated based on the table saved in the utterance form information storage unit 24.
The table, shown in
Conversely, another configuration is also possible in which the only relation between a musical genre and an utterance form is defined in the utterance form information storage unit 24 and, for the correspondence among an utterance form, a prosody generation rule, and unit waveform data, the prosody generation unit 11 and the unit waveform selection unit 12 are allowed to select the prosody generation rule and the unit waveform data according to the utterance form.
Although many utterance forms are prepared in the example shown in
In addition, the correspondence between musical genre information and an utterance form defined in the utterance form information storage unit 24 described above may be changed to suit the user's preference or may be selected from the combinations of multiple correspondences, prepared in advance, to suit the user's preference.
Next, the following describes the operation of the speech synthesizing device in this embodiment in detail with reference to the drawings.
If there is no BGM or if the genre of the received music is a genre that is none of those anticipated, not a specific genre name but “others” is output to the utterance form selection unit 23 as the musical genre.
Next, the utterance form selection unit 23 selects the corresponding utterance form from the table (see
According to
Next, the prosody generation unit 11 references the utterance form parameter supplied from the utterance form selection unit 23 and selects the prosody generation rule storage unit, which has the storage unit number specified by the utterance form selection unit 23, from the prosody generation rule storage units 151 to 15N. After that, based on the prosody generation rule in the selected prosody generation rule storage unit, the prosody generation unit 11 generates prosody information from the received phonetic symbol sequence and sends the generated prosody information to the unit waveform selection unit 12 and the waveform generation unit 13 (step A3).
Next, the unit waveform selection unit 12 references the utterance form parameter sent from the utterance form selection unit 23 and selects the unit waveform data storage unit, which has the storage unit number specified by the utterance form selection unit 23, from the unit waveform data storage units 161 to 16N. After that, based on the received phonetic symbol sequence and the prosody information supplied from the prosody generation unit 11, the unit waveform selection unit 12 selects a unit waveform from the selected unit waveform data storage unit, and sends the selected unit waveform to the waveform generation unit 13 (step A4).
Finally, based on the prosody information sent from the prosody generation unit 11, the waveform generation unit 13 connects the unit waveform, supplied from the unit waveform selection unit 12, and outputs the synthesized speech signal (step A5).
As described above, a synthesized speech can be generated in this embodiment in the utterance form produced by the prosody and the unit waveform that match the BGM in the user environment.
Although the embodiment described above has the configuration in which the unit waveform data storage units 161 to 16N are prepared, one for each utterance form, another configuration is also possible in which the unit waveform data storage unit is provided only for the moderate voice. In this case, though the utterance form is controlled only by the prosody generation rule, this configuration has the advantage of significantly reducing the storage capacity of the whole synthesizing device because the size of the unit waveform data is larger than that of other data such as the prosody generation rule.
In the first embodiment described above, the power of the synthesized speech is not controlled but the synthesized speech is assumed to have the same power both when the synthesized speech is output in a low voice and when the synthesized speech is output in a loud voice. For example, depending upon the correspondence between the BGM and the utterance form, if the sound volume of the synthesized speech is too larger than that of the background music, the balance is lost and, in some cases, the speech is offensive to the ear. Conversely, if the sound volume of the synthesized speech is too smaller than that of the background music, not only the balance is lost but also, in some cases, it becomes difficult to hear the synthesized speech.
A second embodiment of the present invention, in which an improvement is added to the above-described configuration in such a way that the power of the synthesized speech is controlled, will be described in detail below with reference to the drawings.
Referring to
The table, shown in
This power ratio is a value generated by dividing the power of the synthesized speech by the power of the music signal. That is, a power ratio higher than 1.0 indicates that the power of the synthesized speech is higher than the power of the music signal. For example, referring to
Next, the following describes the operation of the speech synthesizing device in this embodiment in detail with reference to the drawings.
When the waveform generation is completed in step A5, the music signal power calculation unit 19 calculates the average power of the received music signal and sends the resulting value to the synthesized speech power adjustment unit 17 (step B1). The average power Pm(n) of the music signal can be calculated by the linear leaky integration, such as the expression (1) given below, where n is the sample number of the signal and x(n) is the music signal.
Pm(n)=aPm(n−1)+(1−a)x2(n) [Expression 1]
Note that a is the time constant of the linear leaky integration. Because the power is calculated to prevent the difference between the synthesized speech and the average sound volume of the BGM from increasing, it is desirable that a be set to a large value, such as 0.9, to calculate a long-time average power. Conversely, if the power is calculated with a small value, such as 0.1, assigned to a, the sound volume of the synthesized speech is changed frequently and greatly and, as a results, there is a possibility that the synthesized speech becomes difficult to hear. Instead of the expression given above, it is also possible to use the moving average or the average of all samples of the received signals.
Next, the synthesized speech power calculation unit 18 calculates the average power of the synthesized speech supplied from the waveform generation unit 13 and sends the calculated average power to the synthesized speech power adjustment unit 17 (step B2). The same method as that used in calculating the music signal power described above can be used also for the calculation of the synthesized speech power.
Finally, the synthesized speech power adjustment unit 17 adjusts the power of the synthesized speech signal supplied from the waveform generation unit 13, based on the music signal power supplied from the music signal power calculation unit 19, the synthesized speech power supplied from the synthesized speech power calculation unit 18, and the power ratio included in the utterance form parameters supplied from the utterance form selection unit 27, and outputs resulting value as the power-adjusted speech synthesizing signal (step B3). More specifically, the synthesized speech power adjustment unit 17 adjusts the power of the synthesized speech so that the ratio between the power of the finally-output synthesized speech signal and the power of the music signal becomes closer to the power ratio value supplied from the utterance form selection unit 27.
More clearly, the music signal power, the synthesized speech signal power, and the power ratio are used to calculate the power adjustment coefficient that is multiplied by the synthesized speech signal. Therefore, as the power adjustment coefficient, a value must be used that makes the ratio between the power of the music signal and the power of the power-adjusted synthesized speech almost equal to the power ratio supplied from the utterance form selection unit 27. The power adjustment coefficient c is given by the following expression where Pm is the music signal power, Ps is the synthesized speech power, and r is the power ratio.
The power-adjusted synthesized speech signal y2(n) is given by the following expression where y1(n) is the synthesized speech signal before the adjustment.
y2(n)=cy1(n) [Expression 3]
As described above, more flexible control is possible in which the synthesized speech power is generated as a voice slightly louder than the moderate voice when a loud voice is selected and the power is slightly reduced when a low voice is selected. In this way, it is possible to implement the utterance form that can ensure a good balance between the synthesized speech and the BGM.
Although the genre of the received music is estimated in the first and second embodiments described above, it is also possible to use recently-introduced search and checking methods to analyze the received music more accurately. A third embodiment of the present invention, in which the above-described improvement is added, will be described in detail below with reference to the drawings.
Referring to
The music attribute information search unit 31 is processing means for extracting the characteristic amount, such as a spectrum, from the received music signal. The characteristic amounts of various music signals and the musical genres of those music signals are recorded individually in the music attribute information storage unit 32 so that music can be identified, and its genre can be determined, by checking the characteristic amount.
To search for the music signal using the characteristic amount described above, the method for calculating the similarity in the spectrum histograms, described in Non-Patent Document 3, can be used.
Next, the following describes the operation of the speech synthesizing device in this embodiment in detail with reference to the drawings.
First, the music attribute information search unit 31 extracts the characteristic amount, such as a spectrum, from the received music signal. Next, the music attribute information search unit 31 calculates the similarity between all characteristic amounts of the music saved in the music attribute information storage unit 32 and the characteristic amount of the received music signal. After that, the musical genre information on the music having the highest similarity is sent to the utterance form selection unit 23 (step D1).
If the maximum of the similarity is lower than the pre-set threshold in step D1, the music attribute information search unit 31 determines that the music corresponding to the received music signal is not recorded in the music attribute information storage unit 32 and outputs “others” as the musical genre.
As described above, because this embodiment uses the music attribute information storage unit 32 in which a musical genre is recorded individually for each piece of music, this embodiment can identify a musical genre more accurately than the first and second embodiments described above and can reflect the genre on the utterance form.
The attribute information such as a title, an artist name, and a composer's name, if stored when the music attribute information storage unit 32 is built, allows the utterance form to be determined also by the attribute information other than the musical genre.
When a larger number of music types are stored in the music attribute information storage unit 32, the genres of more music signals can be identified but the capacity of the music attribute information storage unit 32 becomes larger. It is also possible to use a configuration as necessary in which, with the music attribute information storage unit 32 installed outside the speech synthesizing device, wired or wireless communication means is used to access the music attribute information storage unit 32 for calculating the similarity of the characteristic amount of the music signal.
Next, a fourth embodiment of the present invention, in which the reproduction function of music, such as BGM, is added to the speech synthesizing device in the first embodiment described above, will be described in detail below with reference to the drawings.
Music signals as well as the music numbers and musical genres of the music are saved in the music data storage unit 37. The music reproduction unit 35 is means for outputting music signals, saved in the music data storage unit 37, via a speaker or an ear phone according to a music number, a sound volume, and reproduction commands such as reproduction, stop, rewind, and fast-forwarding. The music reproduction unit 35 supplies the music number of music, which is being reproduced, to the reproduced music information acquisition unit 36.
The reproduced music information acquisition unit 36 is processing means, equivalent to the musical genre estimation unit 21 in the first embodiment, that acquires the musical genre information, corresponding to a music number supplied from the music reproduction unit 35, from the music data storage unit 37 and sends the retrieved information to the utterance form selection unit 23.
Next, the following describes the operation of the speech synthesizing device in this embodiment in detail with reference to the drawings.
When the music reproduction unit 35 reproduces specified music, the music number is supplied to the reproduced music information acquisition unit 36 (step D2).
The reproduced music information acquisition unit 36 acquires the genre information on the music, corresponding to the music number supplied from the music reproduction unit 35, from the music data storage unit 37 and sends it to the utterance form selection unit 23 (step D3).
This embodiment eliminates the need for the estimation processing and the search processing of a musical genre and allows the musical genre of the BGM, which is being reproduced, to be reliably identified. Of course, if the music reproduction unit 35 can acquire the genre information on the music, which is being reproduced, directly from the music data storage unit 37, another configuration is also possible in which there is no reproduced music information acquisition unit 36 and the musical genre is supplied directly from the music reproduction unit 35 to the utterance form selection unit 23.
If musical genre information is not recorded in the music data storage unit 37, another configuration is also possible in which the musical genre is estimated using the musical genre estimation unit 21 instead of the reproduced music information acquisition unit 36.
If music attribute information other than genres is recorded in the music data storage unit 37, it is also possible to change the utterance form selection unit 23 and the utterance form information storage unit 24 so that the utterance form can be determined by the attribute information other than genres as described in the third embodiment described above.
While the embodiments of the present invention have been described, the technical scope of the present invention is not limited to the embodiments described above but various modifications may be added, or an equivalent may be used, according to the use and the specifications of the speech synthesizing device.
Number | Date | Country | Kind |
---|---|---|---|
2006-031442 | Feb 2006 | JP | national |
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/JP2007/051669 | 2/1/2007 | WO | 00 | 8/7/2008 |
Publishing Document | Publishing Date | Country | Kind |
---|---|---|---|
WO2007/091475 | 8/16/2007 | WO | A |
Number | Name | Date | Kind |
---|---|---|---|
6424944 | Hikawa | Jul 2002 | B1 |
6446040 | Socher et al. | Sep 2002 | B1 |
6731307 | Strubbe et al. | May 2004 | B1 |
6915261 | Barile | Jul 2005 | B2 |
6990453 | Wang et al. | Jan 2006 | B2 |
7203647 | Hirota et al. | Apr 2007 | B2 |
7365260 | Kawashima | Apr 2008 | B2 |
7603280 | Hirota et al. | Oct 2009 | B2 |
7684991 | Stohr et al. | Mar 2010 | B2 |
20030046076 | Hirota et al. | Mar 2003 | A1 |
20100145702 | Karmarkar | Jun 2010 | A1 |
Number | Date | Country |
---|---|---|
1061863 | Jun 1992 | CN |
5-307395 | Nov 1993 | JP |
08-037700 | Feb 1996 | JP |
8-328576 | Dec 1996 | JP |
10-20885 | Jan 1998 | JP |
11-15488 | Jan 1999 | JP |
11-015495 | Jan 1999 | JP |
11-161298 | Jun 1999 | JP |
2001-309498 | Nov 2001 | JP |
2003-058198 | Feb 2003 | JP |
3595041 | Sep 2004 | JP |
2004-361874 | Dec 2004 | JP |
2005-077663 | Mar 2005 | JP |
2007-86316 | Apr 2007 | JP |
WO 9953612 | Jan 1999 | WO |
WO 0237474 | May 2002 | WO |
Number | Date | Country | |
---|---|---|---|
20100145706 A1 | Jun 2010 | US |