This application is based on Japanese Patent Application 2002-054487, filed on Feb. 28, 2002 the entire contents of which are incorporated herein by reference.
A) Field of the Invention
This invention relates to a singing voice synthesizing apparatus, a singing voice synthesizing method and a program for singing voice synthesizing for synthesizing a human singing voice.
B) Description of the Related Art
In a conventional singing voice synthesizing apparatus, data obtained from an actual human singing voice is stored as a database, and data that agrees with contents of an input performance data (a musical note, a lyrics, an expression and the like) is chosen from the database. Then, a singing voice that is close to the real human singing voice is synthesized by a data conversion of this performance data based on the chosen data.
A principle of the singing voice synthesizing is explained in Japanese Patent Application No.2001-67258, which was filed by the applicant of the present invention, with reference to
The principle of the singing voice synthesizing apparatus mentioned by Japanese Patent Application No.2001-67258 is shown in
The characteristic parameter is generated by applying these templates by doing as follows.
That is, synthesizing of the long sound part is executed by adding changing component included in the stationary template on the characteristic parameter obtained from the timbre template.
On the other hand, however, synthesizing of the transition part is executed by adding the changing component included in the articulation template on the characteristic parameter obtained from the timbre template, a characteristic parameter to be added with is different by cases. For example, in a case that a front and a rear phonemes of the transition part are both voiced sounds, the changing component included in the articulation template on the characteristic parameter is added on what is obtained by linear interpolation of the characteristic parameter of the front part phoneme and the characteristic parameter of the rear part phoneme. Also, in a case that the front part phoneme is a voiced sound and the rear part phoneme is a silence, the changing component included in the articulation template on the characteristic parameter is added on the characteristic parameter of the front part phoneme. Also, in a case that the front part phoneme is a silence and the rear part phoneme is a voiced sound, the changing component included in the articulation template-on the characteristic parameter is added on the characteristic parameter of the rear part phoneme. As doing as the above, in the singing voice synthesizing apparatus disclosed in Japanese Patent Application No.2001-67258, the characteristic parameter generated from the timbre template is a standard, and singing voice synthesizing is executed by change of the characteristic parameter of the articulation part so that it is agreed with the characteristic parameter of this timbre part.
In the singing voice synthesizing apparatus disclosed in Japanese Patent Application No.2001-67258, there were cases that the singing voice to be synthesized was unnatural. The causes for that are the followings:
a change in the characteristic parameter of the transition part is different from a change in that if original transition part because the change of the articulation template is changed; and
a phoneme before a long sound part is always same regardless of a kind of the phoneme because the characteristic parameter of the long sound part is also calculated from the addition of the characteristic parameter generated from the timbre template with the changing component of the stationary template.
That is, in the singing voice synthesizing apparatus disclosed in Japanese Patent Application No.2001-67258, there were cases that the synthesized singing voice was unnatural because the parameter of the long sound and the transition part has been added based on the characteristic parameter of the timbre template that is just a part of whole singing song.
For example, in the conventional singing voice synthesizing apparatus, in a case of making a singer sing “saita”, phonemes between phonemes do not transit naturally, and the singing voice to be synthesized has an unnatural audio sound. Also, there is a case that it cannot be judged what the synthesized singing voice is singing.
That is, in the singing voice, for example, in a case of singing “saita”, it is pronounced without partitions of each phoneme (“sa”, “i” and “ta”), and it is normally pronounced by inserting a long sound part and a transition part between each phoneme as “[#s] sa (a), [ai], i, (i), [it], ta, (a) (“#” represents a silence). In this case of the example of “saita”, [#s], [ai] and [it] are the transition parts, and (a), (i) and (a) are the long sounds. Therefore, in a case that a singing voice is synthesized from performance data such as MIDI information, it is significant how realistically the transition part and the long sound part are generated.
It is an object of the present invention to provide a singing voice synthesizing apparatus that can naturally reproduce a transition part.
According to the present invention, high naturality of a synthesized singing voice of the transition part can be kept.
According to one aspect of the present invention, there is provided a singing voice synthesizing apparatus, comprising: a storage device that stores singing voice information for synthesizing a singing voice; a phoneme database that stores articulation data of a transition part that includes an articulation for a transition from one phoneme to another phoneme and stationary data of a long sound part that includes stationary part where one phoneme is stably pronounced; a selecting device that selects data stored in the phoneme database in accordance with the singing voice information; a first outputting device that outputs a characteristic parameter of the transition part by extracting the characteristic parameter of the transition part from the articulation data selected by the selecting device, and a second outputting device that obtains the articulation data before and after the stationary data of a long sound part selected by the selecting device, generates a characteristic parameter of the long sound part by interpolating the obtained two articulation data and outputs the generated characteristic parameter of the long sound part.
According to another aspect of the present invention, there is provided a singing voice synthesizing method, comprising the steps of: (a) storing articulation data of a transition part that includes an articulation for a transition from one phoneme to another phoneme and stationary data of a long sound part that includes stationary part where one phoneme is stably pronounced into a phoneme database; (b) inputting singing voice information for synthesizing a singing voice; (c) selecting data stored in the phoneme database in accordance with the singing voice information; (d) outputting a characteristic parameter of the transition part by extracting the characteristic parameter of the transition part from the articulation data selected by the step (c); and
(e) obtaining the articulation data before and after the stationary data of a long sound part selected by the selecting device, generating a characteristic parameter of the long sound part by interpolating the obtained two articulation data and outputting the generated characteristic parameter of the long sound part.
According to the present invention, only the articulation template database 52 and the stationary template database 53 are used, and the timbre template is basically not necessary.
After dividing the performance data into the transition part and the long sound part, the articulation template is used without change in the transition part. Therefore, singing voice of the transition parts that are significant parts of the song sounds natural, and quality of the synthesized singing voice will be high.
Also, as for the long sound part, the characteristic parameter of the transition parts of both ends of the long sound is executed linear interpolation, and a characteristic parameter is generated by adding the changing component included in the stationary template on the interpolated characteristic parameter. The singing voice will not be unnatural because of interpolation based on data without change of the template.
As shown in
Next, the voice signal is divided by phonemes by a phoneme dividing unit 32 based on phoneme dividing information. For example, phoneme dividing information is normally input by a human operation of a predetermined switch with reference to a waveform of a voice signal.
Then, a characteristic parameter is extracted from the deterministic component of the voice signal divided by phonemes by a characteristic parameter extracting unit 33. The characteristic parameter includes an excitation waveform envelope, a formant frequency, a formant width, formant intensity, a spectrum of difference and the like.
The excitation waveform envelope (excitation curve) is consisted of an Egain that represents a magnitude of a vocal cord waveform (dB), an EslopeDepth that represents slope for the spectrum envelope of the vocal tract waveform, and an Eslope that represents depth from a maximum value to a minimum value for the spectrum envelope of the vocal cord vibration waveform (dB). ExcitationCurve can be expressed by the following equation (A):
ExcitationCurve(f)=EGain+ESlopeDepth*(exp(−ESlope*f)−1) (A)
The excitation resonance represents chest resonance. It is consisted of three parameters: a central frequency (ERFreq), a band width (ERBW) and an amplitude (ERAmp), and has a secondary filtering character.
The formant represents a vocal tract resonance by combining 1 to 12 resonances. It is consisted of three parameters: a central frequency (Formant Freqi, i is an integral number from 1 to 12), a band width (FormantBWi, i is an integral number from 1 to 12) and an amplitude (FormantAmpi, i is an integral number from 1 to 12).
The differential spectrum is a characteristic parameter that has a differential spectrum from an original deterministic component that cannot be expressed by the above three: the excitation waveform envelope, the excitation resonance and the formant.
This characteristic parameter is stored in a phoneme database 10 corresponding to a name of phoneme. The stochastic component is also stored in the phoneme database 10 corresponding to the name of phoneme. In this phoneme database 10, they are divided into articulation (phonemic chain) data and stationary data to be stored as shown in
The voice synthesis data is a chain of data corresponding to a first phoneme name, a following phoneme name, the characteristic parameter and the stochastic component.
On the other hand, the stationary data is a chain of data corresponding to one phoneme name, a chain of the characteristic parameters and the stochastic component.
Back to
A voice synthesis unit selector 12 accepts an input of performance data kept in the performance data storage unit 11 in a unit of a frame (hereinafter the unit are called the frame data), and reads voice synthesis unit data corresponding to lyrics data included in the input performance data by selecting it from the phoneme database 10.
A previous articulation data storage unit 13 and a later articulation data storage unit 14 are used for storing stationary data. The previous articulation data storage unit 13 stores previous articulation data of stationary data to be processed. On the other hand, the later articulation data storage unit 14 stores later articulation data of stationary data to be processed.
A characteristic parameter interpolation unit 15 reads a parameter of a last frame of the articulation data stored in the previous articulation data storage unit 13 and a characteristic parameter of a first frame of the articulation data stored in the later articulation data storage unit 14, and interpolates the characteristic parameters in a time sequence to be corresponding to a time directed by the timer 27.
A stationary data storage unit 16 temporarily stored stationary data from voice synthesis data read by the voice synthesis unit selector 12. On the other hand, an articulation data storage unit 17 temporarily stored articulation data.
A characteristic parameter change detecting unit 18 reads stationary data stored in the stationary data storage unit 16 to extract a change (throb) of the characteristic parameter, and it has a function to output as a change component.
An adding unit K1 is a unit to output deterministic component data of the long sound by adding output of the characteristic parameter interpolation unit 15 and output of the characteristic parameter change detecting unit 18.
A frame reading unit 19 reads articulation data stored in the articulation data storage unit 17 as frame data in accordance with a time indicated by a timer 27, and divides into a characteristic parameter and a stochastic component to output.
A pitch defining unit 20 defines a pitch of a synthesized voice to be synthesized finally based on musical note data in frame data. Also, a characteristic parameter correction unit 21 interpolates a characteristic parameter of a long sound output from the adding unit K1 and a characteristic parameter of a transition part output from the frame reading unit 19 based on dynamics information that is included in performance data. In the preceding part of the characteristic parameter correction unit 21, a switch SW1 is provided, and the characteristic parameter of the long sound and the characteristic parameter of the transition part are input in the characteristic correction unit. Details of a process in this characteristic parameter correction unit 21 are explained later. A switch SW2 switches the stochastic component of the long sound read from the stationary data storage unit 16 and the stochastic component of the transition part read from the frame reading unit 19 to output.
A harmonic chain generating unit 22 generates a harmonic chain for formant synthesizing on a frequency axis in accordance with a determined pitch.
A spectrum envelope generating unit 23 generates a spectrum envelope in accordance with a characteristic parameter that is interpolated in the characteristic parameter correction unit 21.
A harmonics amplitude/phase calculating unit 24 calculates an amplitude or a phase of each harmonics generated in the harmonic chain generating unit 22 in accordance with the spectrum envelope generated in the spectrum envelope generating unit 23.
An adding unit K2 adds a deterministic component as output of the harmonics amplitude/phase calculating unit 24 and a stochastic component output from the switch SW2.
An inverse FFT unit 25 converts a signal in a frequency expression into a signal in a time sequential expression by the inverse fast Fourier transformation (IFFT) of output value of the adding unit K2.
An overlapping unit 26 outputs a synthesized singing voice by overlapping signals obtained one after another from lyrics data processed in a time sequential order.
Details of the chacteristic parameter correction unit 21 are explained based on
Also, a spectrum envelope generating unit 42 generates a spectrum envelope based on the characteristic parameter output from the switch SW1.
A harmonics chain generating unit 43 generates a harmonics based on the pitch defined in the pitch defining unit 20. An amplitude calculating unit 44 calculates an amplitude A2 corresponding to the generated spectrum envelope and harmonics. Calculation of the amplitude can be executed, for example, by the inverse FFT and the like.
An adding unit K3 outputs difference between the desired amplitude value A1 defined in the amplitude defining unit 41 and the amplitude value A2 calculated in the amplitude calculating unit 44. A gain correcting unit 45 calculates amount of the amplitude value based on this difference and corrects the characteristic parameter based on the amount of this gain correction. By doing that, a new characteristic parameter matched with desired amplitude.
Further, in
Next, an operation of the singing voice synthesizing apparatus according to a first embodiment of the present invention is explained by referring a flow chart shown in
A performance data storage unit 11 outputs frame data in a time sequential order. A transition part and a long sound part show by turns, processes are different for the transition part and the long sound part.
When frame data is input from the performance data storage unit 11 (S1), it is judged whether the frame data is related to a long sound part or an articulation part in a voice synthesis unit selector 12 (S2). In a case of the long sound part, previous articulation data, later articulation data and stationary data are transmitted to the previous articulation data storage unit 13, the later articulation data storage unit 14 and the articulation data storage unit 16 (S3).
Then, the characteristic parameter interpolation unit 15 picks up the characteristic parameter of the last frame of the previous articulation data stored in the previous articulation data storage unit 13 and the characteristic parameter of the first frame of the last articulation data stored in the later articulation data storage unit 1. Then a characteristic parameter of the long sound prosecuted is generated by linear interpolation of these two characteristic parameters (S4).
Also, the characteristic parameter of the stationary data stored in the stationary data storage unit 16 is provided to the characteristic parameter change detecting unit 18, and a change component of the characteristic parameter of the stationary data is extracted (S5). This change component is added to the characteristic parameter output from the characteristic parameter interpolation unit 15 in the adding unit K1 (S6). This adding value is output to the characteristic parameter correction unit 21 as a characteristic parameter of a long sound via the switch SW1, and correction of the characteristic parameter is executed (S9). On the other hand, the stochastic component of stationary data stored in the stationary data storage unit 16 is provided to the adding unit K2 via the switch SW2.
The spectrum envelope generating unit 23 generates a spectrum envelope for this corrected characteristic parameter. The harmonics amplitude/phase calculating unit 24 calculates an amplitude or a phase of each harmonics generated in the harmonic chain generating unit 22 in accordance with the spectrum envelope generated in the spectrum envelope generating unit 23. This calculated result is output to the adding unit K2 as a chain of parameters (deterministic component) of the prosecuted long sound part.
On the other hand, in the case that the obtained frame data is judged to be a transition part (NO) in Step S2, articulation data of the transition part is stored in the articulation data storing unit 17 (S7).
Next, the frame reading unit 19 reads articulation data stored in the articulation data storage unit 17 as frame data in accordance with a time indicated by a timer 27, and divides into a characteristic parameter and a stochastic component to output. The characteristic parameter is output to the characteristic parameter correction unit 21, and the stochastic component is output to the adding unit K2. This characteristic parameter of the transition part is executed the same process as the characteristic parameter of the above long sound in the chacteristic parameter correction unit 21, the spectrum envelope generating unit 23, the harmonics amplitude/phase calculating unit 24 and the like.
Moreover, the switches SW1 and SW2 switch depending on kinds of prosecuted data. The switch SW1 connects the characteristic parameter correction unit 21 to the adding unit K1 during processing the long sound and connects the chacteristic parameter correction unit 21 to the frame reading unit 19 during processing the transition part. The switch SW2 connects the adding unit K2 to the stationary data storage unit 16 during processing the long sound and connects to the adding unit K2 to the frame reading unit 19 during processing the transition part.
When the transition part, the characteristic parameter of the long sound and the stochastic component are calculated, the added value is processed in the inverse FFT unit 25, and it is overlapped in the overlapping unit 26 to output a final synthesized waveform (S10).
The singing voice synthesizing apparatus according to a second embodiment of the present invention is explained based on
Also, the pitch defining unit 20 defines pitch based on musical note information in performance data, and outputs the result to the voice synthesis unit selector.
As for an operation of the second embodiment, the pitch defining unit 20 defines pitch of prosecuted frame data based on the musical note from the performance data storage unit 11, and outputs the result to the voice synthesis unit selector 12. The voice synthesis unit selector 12 reads articulation data and stationary data which are the closest to the defined pitch and phoneme information in lyrics information. The later process is the same as that of the first embodiment.
The singing voice synthesizing apparatus according to a third embodiment of the present invention is explained based on
Also, the pitch defining unit 20 defines pitch based on vibrato data from musical note information performance data and the expression template selector 30A.
As for an operation of the third embodiment, reading articulation data and stationary data from the phoneme database 10 in the voice synthesis unit selector 12 is same as the first embodiment based on the musical note from the performance data storage unit 11. The later process is the same as that of the first embodiment.
On the other hand, the expression template selector 30A reads the most suitable vibrato data from the expression database 30 based on expression information from the performance data storage unit 11. Pitch is defined by the pitch defining unit 20 based on the read vibrato data and musical note information in performance data.
The present invention has been described in connection with the preferred embodiments. The invention is not limited only to the above embodiments. It is apparent that various modifications, improvements, combinations, and the like can be made by those skilled in the art.
Number | Date | Country | Kind |
---|---|---|---|
2002-054487 | Feb 2002 | JP | national |
Number | Name | Date | Kind |
---|---|---|---|
5536902 | Serra | Jul 1996 | A |
5703311 | Ohta | Dec 1997 | A |
5704006 | Iwahashi | Dec 1997 | A |
5895449 | Nakajima et al. | Apr 1999 | A |
5998725 | Ohta | Dec 1999 | A |
20030009336 | Kenmochi et al. | Jan 2003 | A1 |
20030009344 | Kayama et al. | Jan 2003 | A1 |
20040006472 | Kemmochi | Jan 2004 | A1 |
Number | Date | Country |
---|---|---|
1220195 | Jul 2002 | EP |
05-006168 | Jan 1993 | JP |
10-240264 | Sep 1997 | JP |
11-184490 | Dec 1997 | JP |
2002-268659 | Sep 2002 | JP |
Number | Date | Country | |
---|---|---|---|
20030159568 A1 | Aug 2003 | US |