1. Field of the Invention
This invention relates to a singing voice-synthesizing method and apparatus for synthesizing singing voices based on performance data being input in real time, and a storage medium storing a program for executing the method.
2. Prior Art
Conventionally, a singing voice-synthesizing method of the above-mentioned kind has been proposed which makes the rise time of a phoneme to be sounded first (first phoneme) in accordance with a note-on signal based on performance data shorter than the rise time of the same phoneme when it is sounded in succession to another phoneme during the note-on period (see e.g. Japanese Laid-Open Patent Publication (Kokai) No. 10-49169).
On the other hand,
The conventional singing voice-synthesizing method suffers from the following problems:
(1) The vowel singing-starting time points of the human singing shown in
(2) Information of a phonetic unit is transmitted immediately before a note-on time point of the phonetic unit, and the singing voice corresponding to the information of the phonetic unit starts to be generated at the note-on time point. Therefore, it is impossible to start generation of the singing voice earlier than the note-on time point.
(3) The singing voice is not controlled in respect of state transitions, such as an attack (rise) portion, and a release (fall) portion. This makes it impossible to synthesize more natural singing voices.
(4) The singing voice is not controlled in respect effects, such as vibrato. This makes it impossible to synthesize more natural singing voices.
It is an object of the present invention to provide a singing voice-synthesizing method and apparatus which is capable of synthesizing natural singing voices close to human singing voices based on performance data being input in real time, and a storage medium storing a program for executing the method.
To attain the above object, according to a first aspect of the invention, there is provided a singing voice-synthesizing method comprising the steps of inputting phonetic unit information representative of a phonetic unit, time information representative of a singing-starting time point, and singing length information representative of a singing length, in timing earlier than the singing-starting time point, for a singing phonetic unit including a sequence of a first phoneme and a second phoneme, generating a phonetic unit transition time length formed by a generation time length of the first phoneme and a generation time length of the second phoneme, based on the inputted phonetic unit information, determining a singing-starting time point and a singing duration time of the first phoneme and a singing-starting time point and a singing duration time of the second phoneme, based on the generated phonetic unit transition time length, the inputted time information and singing length information, and starting generation of a first singing voice and a second singing voice formed by the first phoneme and the second phoneme at the singing-starting time point of the first phoneme and the singing-starting time point of the second phoneme, respectively, and continuing generation of the first singing voice and the second singing voice for the singing duration time of the first phoneme and the singing duration time of the second phoneme, respectively.
Preferably, the determining step includes setting the singing-starting time point of the first phoneme to a time point earlier than the singing-starting time point represented by the time information.
According to this singing voice-synthesizing method, the phonetic unit information, the time information, and the singing length information are inputted in timing earlier than the singing-starting time point represented by the time information, and a phonetic unit transition time length is formed based on the phonetic unit information. Further, a singing-starting time point and a singing duration time of the first phoneme and a singing-starting time point and a singing duration time of the second phoneme are determined based on the generated phonetic unit transition time length. As a result, as to the first and second phonemes, it is possible to determine desired singing-starting time points before or after the singing-starting time point represented by the time information, or determine singing duration times different from the singing length represented by the singing length information, whereby natural singing sounds can be produced as the first and second singing phonetic units. For example, if the singing-starting time point of the first phoneme can be set to a time point earlier than the singing-starting time point represented by the time information, it is possible to make the rise of a consonant sufficiently earlier than the rise of a vowel to thereby synthesize singing voices close to human singing voices.
To attain the above object, according to a second aspect of the invention, there is provided a singing voice-synthesizing method comprising the steps of inputting phonetic unit information representative of a phonetic unit, time information representative of a singing-starting time point, and singing length information representative of a singing length, for a singing phonetic unit, generating a state transition time length corresponding to a rise portion, a note transition portion, or a fall portion of the singing phonetic unit, based on the inputted phonetic unit information, and generating a singing voice formed by the phonetic unit, based on the phonetic unit information, the time information, and the singing length information which have been inputted, the generating step including adding a change in at least one of pitch and amplitude to the singing voice during a time period corresponding to the generated state transition time length.
According to this singing voice-synthesizing method, the state transition time length is generated based on the inputted phonetic unit, and a change in at least one of pitch and amplitude is added to the singing voice during a time period corresponding to the generated state transition time length. This makes it possible to synthesize natural singing voices with feelings of attack, note transition, or release.
To attain the above object, according to a third aspect of the invention, there is provided a singing voice-synthesizing apparatus comprising an input section that inputs phonetic unit information representative of a phonetic unit, time information representative of a singing-starting time point, and singing length information representative of a singing length, in timing earlier than the singing-starting time point, for a phonetic unit including a sequence of a first phoneme and a second phoneme, a storage section that stores a phonetic unit transition time length formed by a generation time length of the first phoneme and a generation time length of the second phoneme, a readout section that reads out the phonetic unit transition time length from the storage section based on the phonetic unit information inputted by the input section, a calculating section that calculates a singing-starting time point and a singing duration time of the first phoneme, and a singing-starting time point and a singing duration time of the second phoneme, based on the phonetic unit transition time length read by the readout section and the time information and the singing length information which have been inputted by the input section, and a singing voice-synthesizing section that starts generation of a first singing voice and a second singing voice formed by the first phoneme and the second phoneme at the singing-starting time point of the first phoneme and the singing-starting time point of the second phoneme calculated by the calculating section, respectively, and continuing generation of the first singing voice and the second singing voice for the singing duration time of the first phoneme and the singing duration time of the second phoneme calculated by the calculating section, respectively.
This singing voice-synthesizing apparatus implements the singing sound-synthesizing method according to the first aspect of the invention, and hence the same advantageous effects described as to this method can be obtained. Further, since the apparatus is configured such that the phonetic unit transition time length is read from the storage section, the construction of the apparatus or the processing executed thereby can be simple even if the number of singing phonetic units is increased.
Preferably, the input section inputs modifying information for modifying the generation time length of the first phoneme, and the calculating section modifies the generation time length of the first phoneme in the phonetic unit transition time length read by the readout section according to the modifying information inputted by the input section, and then calculates the singing-starting time point and the singing duration time of the first phoneme and the singing-starting time point and the singing duration time of the second phoneme, based on the phonetic unit transition time length including the modified generation time length of the first phoneme.
According to this preferred embodiment, it is possible to reflect the operator's intention on the singing-starting time points and singing duration times of the first and second phonemes, and hence synthesize more natural singing voices.
To attain the above object, according to a fourth aspect of the invention, there is provided a singing voice-synthesizing apparatus comprising an input section that inputs phonetic unit information representative of a phonetic unit, time information representative of a singing-starting time point, and singing length information representative of a singing length, for a singing phonetic unit, a storage section that stores state transition time lengths corresponding to a rise portion, a note transition portion, or a fall portion of the singing phonetic unit, a readout section that reads out the state transition time length from the storage section based on the phonetic unit information inputted by the input section, and a singing voice-synthesizing section that generates a singing voice formed by the phonetic unit, based on the phonetic unit information, the time information, and the singing length information which have been inputted by the input section, the singing voice-synthesizing section adding a change in at least one of pitch and amplitude to the singing voice during a time period corresponding to the state transition time length read out by the readout section.
This singing voice-synthesizing apparatus implements the singing sound-synthesizing method according to the first aspect of the invention, and hence the same advantageous effects described as to this method can be obtained. Further, since the apparatus is configured such that the phonetic unit transition time length is read from the storage section, the construction of the apparatus or the processing executed thereby can be simple even if the number of singing phonetic units is increased.
Preferably, the input section inputs modifying information for modifying the state transition time lengths, and the singing voice-synthesizing apparatus includes a modifying section that modifies the corresponding state transition time length read out by the readout section based on the modifying information inputted by the input section, the singing voice-synthesizing section adding a change in at least one of pitch and amplitude to the singing voice during a time period corresponding to the state transition time length modified by the modifying section.
According to this preferred embodiment, it is possible to reflect the operator's intention on the state transition time length, and hence synthesize more natural singing voices.
To attain the above object, according to a fifth aspect of the invention, there is provided a singing sound-synthesizing apparatus comprising an input section that inputs phonetic unit information representative of a phonetic unit, time information representative of a singing-starting time point, singing length information representative of a singing length, and effects-imparting information, for a singing phonetic unit, and a singing voice-synthesizing section that generates a singing voice formed by the phonetic unit, based on the phonetic unit information, the time information, and the singing length information which have been inputted by the input section, the singing voice synthesizing section imparting effects to the singing voice based on the effects-imparting information inputted by the input section.
According to this singing voice-synthesizing apparatus, it is possible to add minute changes in pitch and amplitude, e.g. those in vibrato effect, to singing voices, whereby more natural singing voices can be synthesized.
Preferably, the effects-imparting information inputted by the input section represents an effects-imparting time period, and the singing voice-synthesizing apparatus further comprises a setting section that sets a new effects-imparting time period corresponding to both the effects-imparting time period represented by the effects-imparting information and a second effects-imparting time period of a singing phonetic unit preceding the singing phonetic unit if the effects-imparting time period is continuous from the second effects-imparting time period, the singing voice-synthesizing section imparting effects to the singing voice during the new effects-imparting time period set by the setting section.
According to this preferred embodiment, since effects are imparted by setting a new effects-imparting time period corresponding to effects imparting-time periods continuous to each other, effects are not interrupted to improve the continuity thereof.
To attain the above object, according to a sixth aspect of the invention, there is provided a singing voice-synthesizing apparatus comprising an input section that inputs phonetic unit information representative of a phonetic unit, time information representative of a singing-starting time point, and singing length information representative of a singing length, for a singing phonetic unit, in timing earlier than the singing-starting time point, a setting section that randomly sets a new singing-starting time point, within a predetermined time range extending before and after the singing-starting time point, based on the time information inputted by the input section, and a singing voice-synthesizing section that generates a singing voice formed by the phonetic unit, based on the phonetic unit information and the singing length information which have been inputted by the input section, and the singing-starting time point set by the setting section, the singing voice synthesizing section starting generation of the singing sound at the new singing-starting time point set by the setting section.
According to this singing voice-synthesizing apparatus, a new singing-starting time point is randomly set within a predetermined time range extending before and after the singing-starting time point represented by the time information, and a singing voice is generated at the set singing-starting time point. This makes it possible to synthesize more natural singing voices with variations in singing-starting timing.
To attain the above object, there is provided a storage medium storing a program for executing the singing voice-synthesizing method according to the first aspect of the invention.
Similarly, there is provided a storage medium storing a program for executing the singing voice-synthesizing method according to the second aspect of the invention.
The above and other objects, features and advantages of the present invention will become more apparent from the following detailed description taken in conjunction with the accompanying drawings.
The present invention will now be described in detail with reference to the drawings showing a preferred embodiment thereof.
Referring first to
In the present embodiment, performance data which is comprised of phonetic unit information, singing-starting time information, and singing length information is inputted for each of phonetic units which constitute a lyric such as “saita”, each phonetic unit consisting of “sa”, “i”, or “ta”. The singing-starting time information represents an actual singing-starting time point (e.g. timing of a first beat of a time), such as T1 shown in
In the singing voice synthesis, the consonant “s” starts to be generated at the determined singing-starting time point and continues to be generated over the determined singing duration time. This also applies to the phonetic units “i” and “ta”. As a result, the singing voices synthesized by the present method become very natural in which the singing-starting time points and the singing duration times thereof are approximate to those of the
The singing voice-synthesizing apparatus is comprised of a CPU (Central Processing Unit) 12, a ROM (Read Only Memory) 14, a RAM (Random Access Memory) 16, a detection circuit 20, a display circuit 22,. an external storage device 24, a timer 26, a tone generator circuit 28, and a MIDI (Musical Instrument Digital Interface) interface 30, all connected to each other via a bus 10.
The CPU 12 performs operations of various processes concerning the generation of musical tones, the synthesis of singing voices, etc. according to programs stored in the ROM 14. The process concerning the synthesis of singing voices (singing voice-synthesizing process) will be described in detail hereinafter with reference to flowcharts shown in
The RAM 16 includes various storage sections used as working areas for processing operations of the CPU 12, and is provided with a receiving buffer in which received performance data are written, etc. as a storage section related to the execution of the present invention.
The detection circuit 20 detects operating information concerning operations of various operating elements of an operating element group 34 arranged on a panel, not shown.
The display circuit 22 controls the operation of a display 36 to thereby enable various images to be displayed thereon.
The external storage device 24 is comprised of a drive in which at least one type of storage medium, e.g. a HD (hard disk), an FD (floppy disk), a CD (compact disk), a DVD (digital versatile disk), and an MO (magneto-optical disk) can be removably mounted. When a desired storage medium is mounted in the external storage device 24, data can be transferred from the storage medium to the RAM 16. Further, when the storage medium is a writable one, such as a HD and an FD, data can be transferred from the RAM 16 to the storage medium.
As program-recording means, there may be employed a storage medium mounted in the external storage section 24 instead of the ROM 14. In this case, a program stored in the storage medium is transferred from the storage medium 24 to the RAM 16. Then, the CPU 12 is operated according to the program stored in the RAM 16. This makes it possible to add a program or upgrade the same, with ease.
The timer 26 generates a tempo clock signal TCL having a repetition period corresponding to a tempo designated by tempo data TM, and the tempo clock signal TCL is supplied to the CPU 12 as an interrupt command. The CPU 12 carries out the singing voice synthesis by executing an interrupt-handling process in response to the tempo clock signal TCL. The tempo designated by the tempo data TM can be varied according to the operation of a tempo-setting operating element of the operating element group 34. The repetition period of generation of the tempo clock signal TCL can be set e.g. to 5 ms.
The tone generator circuit 28 includes a large number of tone-generating channels and a large number of singing voice-synthesizing channels. The singing voice-synthesizing channels synthesize singing voices based on a formant-synthesizing method. In the singing voice-synthesizing process, described hereinafter, singing voice signals are generated from the respective singing voice-synthesizing channels. The thus generated tone signals and/or singing voice signals are converted to sound or acoustic waves by a sound system 38.
The MIDI interface 30 is provided for MIDI communication between the present singing voice-synthesizing apparatus and an MIDI apparatus 39 provided as a separate unit. In the present embodiment, the MIDI interface 30 is used for receiving performance data from the MIDI apparatus 39, so as to synthesize singing voices. The singing voice-synthesizing apparatus may be configured such that performance data for accompaniment for singing may be received together with performance data for the singing voice synthesis from the MIDI apparatus 39, and the tone generator circuit 28 generates musical tone signals for the accompaniment based on the performance data for the accompaniment of singing, so that the sound system 38 generates accompaniment sounds.
Next, the outline of the singing voice-synthesizing process carried out by the singing voice-synthesizing apparatus according to the present embodiment will be described with reference to
In a step S42, based on each received performance data, a phonetic unit transition time length and a state transition time length are retrieved from a phonetic unit transition DB (database) 14b and a state transition DB (database) 14c within a singing voice synthesis DB (database) 14. Based on the phonetic unit transition time length, the state transition time length and the performance data, a singing voice synthesis score is formed. The singing voice synthesis score is comprised of three tracks of a phonetic unit track, a transition track, and a vibrato track. The phonetic unit track contains information of singing-starting time points, singing duration times, etc., the transition track contains information of starting time points and duration times of transition states, such as attack, and the vibrato track contains information of starting time points and duration times of a vibrato-added state, and the like.
In a step S44, the singing voice synthesis is performed by a singing voice-synthesizing engine. More particularly, the singing voice synthesis is carried out based on the performance data inputted in the step S40, the singing voice synthesis scores formed in the step S42, and tone generator control information retrieved from the phonetic unit DB 14a, the phonetic unit transition DB 14b, the state transition DB 14c and the vibrato DB 14d, whereby singing voice signals are generated in the order of voices to be sung. In the singing voice-synthesizing process, a singing voice formed by a single phonetic unit (e.g. “a”) designated by the phonetic unit track or a transitional phonetic unit (e.g. “sa” in which transition from “s” to “a” occurs) and at the same time having pitch designated by the performance data starts to be generated at a singing-starting time point designated by the phonetic unit track and continues to be generated over a singing duration time designated by the phonetic unit track.
To the singing voice thus generated, minute changes in pitch, amplitude and the like can be added at and after the starting time of a transition state, such as attack, designated by the transition track, and the state in which such changes are added to the singing voice can be continued over a duration time of the transition state, such as attack, designated by the transition track. Further, to the singing voice, a vibrato can be added at and after a starting time designated by the vibrato track and the state in which the vibrato is added to the singing voice can be continued over a duration time designated by the vibrato track.
In steps S46 and S48, processes are carried out within the tone generator circuit 28. In the step S46, the singing voice signal is subjected to D/A (digital-to-analog) conversion, and in the step S48, the singing voice signal subjected to the D/A conversion is outputted to the sound system 38 to cause the same to be sounded as a singing voice.
The note information contains note-on information indicative of an actual singing-starting time point, duration information indicative of actual singing length, and pitch information indicative of the pitch of singing voice. The phonetic unit track information contains information of a singing phonetic unit (denoted by PhU), consonant modification information representative of a singing consonant expansion/compression ratio, etc. In the present embodiment, it is assumed that the singing voice synthesis is carried out to synthesize singing voices of a Japanese-language song, and hence the phonemes appearing in the singing voices are consonants and vowels, and further, the phonetic unit state (PhU State) can be a combination of a consonant and a vowel, a vowel alone, or a voiced consonant (nasal sound, half vowel) alone. If the phonetic unit state is the voiced consonant alone, the singing-starting time point of the voiced consonant is similar to that of a vowel alone case, and hence the phonetic unit state is handled as the vowel alone.
The transition track information contains attack type information indicative of a singing attack type, attack rate information indicative of a singing attack expansion/compression ratio, release type information indicative of a singing release type, release rate information indicative of a singing release expansion/compression ratio, note transition type information indicative of a singing note transition type, etc. The attack type designated by the attack type information includes “normal”, “sexy”, “sharp”, “soft”, etc. The release type information and the note transition type information can also designate one of a plurality of types, similar to the attack type. The note transition means a transition from the present performance data (performance event) to the next performance data (performance event). The singing attack expansion/compression ratio, the singing release expansion/compression ratio, and the note transition expansion/compression ratio are each set to a value larger than 1 when the state transition time length associated therewith is desired to be increased, and to a value smaller than 1 when the same is desired to be decreased. These ratios can be also set to 1, and in this case, addition of minute changes in pitch, amplitude and the like accompanying the attack, release and note transition is not carried out.
The vibrato track information contains information of a vibrato number indicative of the number of vibrato events in the present performance data, information of vibrato delay 1 indicative of a delay time of a first vibrato, information of vibrato duration 1 indicative of a duration time of the first vibrato, information of vibrato delay K indicative of a delay time of a K-th vibrato, where K is equal to or larger than 2, information of vibrato duration K indicative of a duration time of the K-th vibrato, and information of vibrato type K indicative of a type of the K-th vibrato. When the number of vibrato events is 0, the information of vibrato delay 1, et seq. are not contained in the vibrato track information. The vibrato type designated by the information of vibrato type 1 to vibrato type K includes “normal”, “sexy”, and “enka (Japanese traditional popular song)”.
Although the singing voice synthesis DB 14A shown in
Next, the information stored in the phonetic unit DB 14a, the phonetic unit transition DB 14b, the state transition DB 14c, and the vibrato DB 14d will be described with reference to
The phonetic unit DB 14a shown in
(a) “V_Sil” represents a phonetic unit transition from a vowel to silence, and, for example, in
(b) “Sil_C” represents a phonetic unit transition from silence to a consonant, and, for example, in
(c) “C_V” represents a phonetic unit transition from a consonant to a vowel, and, for example, in
(d) “Sil_V” represents a phonetic unit transition from silence to a vowel, and, for example, in
(e) “pV_C” represents a phonetic unit transition from a preceding vowel to a consonant, and, for example, in
(f) “pV_V” represents a phonetic unit transition from a preceding vowel to a vowel, and, for example, in
The phonetic unit DB 14b shown in
The state transition DB 14c shown in
The vibrato DB 14d shown in
Then, in the step S44, according to the formed singing voice synthesis scores, singing voices SS1, SS2, SS3 are synthesized. As a result of the singing voice synthesis, it is possible to start generation of the consonant “s” of the singing voice SS1 at a time point T11 earlier than the time point T1, and further the vowel “a” of the singing voice SS1 at the time point T1. Also, it is possible to start generation of the vowel “i” of the singing voice SS2 at the time point T2. Further, it is possible to start generation of the consonant “t” of the singing voice SS3 at a time point T31 earlier than the time point T3, and further the vowel “a” of the singing voice SS3 at the time point T3. If desired, it is also possible to start generation of the vowel “a” of the phonetic unit “sa” or the vowel “i” of the phonetic unit “i” earlier than the respective time points T1 and T2.
The above description concerns the processes of forming reference scores and singing voice synthesis scores when the transmission and reception of performance data are carried out in the order of actual singing-starting time points. When the transmission and reception of performance data are not carried out in the order of actual singing-starting time points, reference scores and singing voice synthesis scores are formed in manners as illustrated in
Assuming, for example, that performance data S1, S2, and S3 designate, similarly to
The information of duration times of phonetic unit transitions, such as “Sil_a” and “s_a” is comprised of a combination of the time length of the preceding phonetic unit and the time length of the following phonetic unit, with the boundary between the time lengths being held as time slot information. Therefore, the time slot information can be used to instruct the tone generator circuit 28 to operate according to the duration time of the preceding phonetic unit and the starting time point and duration time of the following phonetic unit. For example, based on the duration time information of the transition Sil_s, the circuit 28 can be instructed to operate according to the duration time of silence and the singing-starting time point T11 and singing duration time of the consonant “s”, and based on the duration time information of the transition s_a, the circuit 28 can be instructed to operate according to the duration time of the consonant “a” and the singing-starting time point T1 and singing duration time of the vowel “a”.
Information as shown in
As shown in
Information as shown in
The information of the vibrato on event corresponds to the information of the vowel “a” of the phonetic unit “ta” in the phonetic unit track TP, and is used for adding vibrato-like changes in pitch and amplitude to a singing voice synthesized based on the information of the vowel “a”. In the information of the vibrato on event, by setting the starting time point later than the starting time point T3 at which the singing voice “a” is to start being generated, by a delay time DL, a delayed vibrato can be realized. It should be noted that starting time points T11 to T14, T21 to T26, T31 to T33, etc., and duration times D11 to D14, D21 to D26, D31 to D33, etc. can be set as desired by using the number of clocks of the tempo clock signal TCL.
By using the singing voice synthesis score SC and the performance data S1 to S3, the singing voice-synthesizing process in the step S44 can synthesize the singing voice as shown in
Following this, the tone generator control information corresponding to the information of the vowel “a” in the track TP and the pitch information of C3 in the performance data S1 is read out from the phonetic unit DB 14a to control the tone generator circuit 28, whereby the vowel “a” continues to be generated. The control time period at this time corresponds to the duration time designated by the information of the vowel “a” in the track TP. Then, the tone generator control information corresponding to the information of the transition a_i in the track TP and the pitch information of D3 in the performance data S2 is read out from the DB 14b to control the tone generator circuit 28, whereby the generation of the vowel “a” is stopped and at the same time the generation of the vowel “i” is started at the time point T2. The control time period at this time corresponds to the duration time designated by the information of the transition “a_i” in the track TP.
Following this, similarly to the above, the tone generator control information corresponding to the information of the vowel “i” and the pitch information of D3 and one corresponding to the information of a transition i_t in the track TP and the pitch information of D3 are sequentially read out to control the tone generator circuit 28, whereby the generation of the vowel “i” is continued until the time point T31, and at this time point T31, the generation of the consonant “t” is started. Then, after starting the generation of the vowel “a” at the time point T3, based on the tone generator control information corresponding to the information of the transition t_a and the pitch information of E3, the tone generator control information corresponding to the information of the vowel a in the track TP and the pitch information of E3 and one corresponding to the information of the transition a_Sil in the track TP and the pitch information of E3 are sequentially read out to control the tone generator circuit 28, whereby the generation of the vowel “a” is continued until the time point T4, and at this time point T4, the state of silence is started. As a result, as the singing voices SS2, SS3, the phonetic units “i” and “ta” are sequentially generated.
In accordance with the generation of the singing voices as described above, the singing voice control is carried out based on the information in the performance data S1 to S3 and the information in the transition track TR. More specifically, before and after the time point T1, the tone generator control information corresponding to the state information of the transition state Attack in the track TR and the information of the transition s_a in the track TP are read out from the state transition DB 14c in
Further, in accordance with generation of the singing voices described above, the singing voice control is carried out based on the information of the performance data S1 to S3, and the information in the vibrato track TB. More specifically, at a time later than the time point T3 by the delay time DL, the tone generator control information corresponding to the information of a vibrato on event in the track TB, the information of the vowel a in the track TP, and the pitch information of E3 in the performance data S3 is read out from the vibrato DB 14d shown in
Next, the performance data-receiving and singing voice synthesis score-forming process will be described with reference to
In a step S50, the initialization of the system is carried out, whereby, for example, the count n of a reception counter in the RAM 16 is set to 0.
In a step S52, the count n of the reception counter is incremented by 1 (n=n+1). Then, in a step S54, a variable m is set to the value or count n of the counter, and performance data at an m-th (m=n) position in the sequence of performance data (hereinafter simply refereed to as the “m-th performance data”) is received and written into the receiving buffer in the RAM 16.
In a step S56, it is determined whether or not the m-th (m=n) performance data is at the end of the data, i.e. the last data. If first (m=1) data is received in the step S54, the answer to the question of the step S56 becomes negative (N), and hence the process proceeds to a step S58. In the step S58, m-th (m=n) performance data is read out from the receiving buffer and written into the reference score in the RAM 16. It should be noted that once the first (m=1) performance data has been written into the reference score, subsequent performance data are either added to or inserted into the reference score, as described hereinabove with reference to
Then, in a step S60, it is determined whether or not n>1 holds. If the first (m=1) performance data has been received, the answer to the question of the step S60 becomes negative (N), so that the process returns to the step S52, wherein the count n is incremented to 2, and in the following step S54, second (m=2) performance data is received and written into the receiving buffer. Then, the process proceeds via the step 56 to the step S58, wherein the second (m=2) performance data is added to the reference score.
Then, it is determined in the step S60 whether or not n>1 holds, and in the present case, since the count n is equal to 2, the answer to this question becomes affirmative (Y), so that the singing voice synthesis score-forming process is carried out in a step S61. Although the process in the step S61 will be described in detail with reference to
After the processing in the step S64 is completed, the process returns to the step S52, wherein similarly to the above, the reception of performance data and writing of the received performance data into the reference score are carried out. For example, after forming the singing voice synthesis score is formed concerning the first (m=1) performance data in the step S64, third (m=3) performance data is received in the step S54, and in the step S58, this data is added to or inserted into the reference score.
If the answer to the question of the step S62 is affirmative (Y), this means that m-th (m=n−1) performance data has been inserted into the reference score, so that the process proceeds to a step S66, wherein singing voice synthesis scores whose actual singing-starting time points are later than that of the m-th (m=n−1) performance data are discarded, and singing voice synthesis scores are newly formed concerning the m-th (m=n−1) data and performance data subsequent thereto in the reference score. For example, assuming that after receiving performance data S1, S3, S4, as shown in
After the processing in the step S66 is completed, the process returns to the step S52, the processing similar to the above is repeatedly carried out. When the m-th (m=n) performance data is at the end of the data, the answer to the question of the step S56 becomes affirmative (Y), and in a step S68, a terminating process (e.g. addition of end information) is carried out. The execution of the step S68 is followed by the singing voice-synthesizing process being carried out in the step S44 in
Then, in a step S76, it is determined whether or not the obtained performance data has been inserted into the reference score when it has been written into the reference score. If the answer to this question is affirmative (Y), in a step S78, singing voice synthesis scores whose actual singing-starting time points are later than that of the obtained performance data are discarded.
When the processing in the step S78 is completed or if the answer to the question of the step S76 is negative (N), the process proceeds to a step S80, wherein a phonetic unit track-forming process is carried out. This process in the step S80 forms a phonetic unit track TP based on performance data, the management data formed in the step S74, and the stored score data (score data of the preceding performance data). The details of the process will be described hereinafter with reference to
In a step S82, a transition track TR is formed based on the performance information, the management data formed in the step S74, the stored score data, and the phonetic unit track TP. The details of the process in the step S82 will be described hereinafter with reference to
In a step S84, a vibrato track TB is formed based on the performance information, the management data formed in the step S74, the stored score data, and the phonetic unit track TP. The details of the process in the step S84 will be described hereinafter with reference to
In a step S86, score data for the next performance data is formed based on the performance information, the management data formed in the step S74, the phonetic unit track TP, the transition track TR, and the vibrato track TB, and stored. The score data contains an NtN transition time length from the preceding vowel. As shown in
When the performance data is obtained in a step S90, at the following step S92, the singing phonetic unit in the performance data is analyzed. The information of a phonetic unit state represents a combination of a consonant and a vowel, a vowel alone, or a voiced consonant alone. In the following, for convenience, the combination of a consonant and a vowel will be referred to as PhU State=Consonant Vowel, and the vowel alone or the voiced consonant alone as PhU State=Vowel. The information of a phoneme represents the name of a phoneme (name of a consonant and/or name of a vowel), the category of the consonant (nasal sound, plosive sound, half vowel, etc.), whether the consonant is voiced or unvoiced, and so forth.
In a step S94, the pitch of a singing voice in the performance data is analyzed, and the analyzed pitch of the singing voice is set as the pitch information “Pitch”. In a step S96, the actual singing time in the performance data is analyzed, and the actual singing-starting time point of the analyzed actual singing time is set as the current note-on information “Current Note On”. Further, the actual singing length is set as the current note duration information “Current Note Duration”, and a time point later than the actual singing-starting time point by the actual singing length is set as the current note-off information “Current Note Off”.
As the current note-on information, the time point obtained by modifying the actual singing-starting time point may be employed. For example, a time point (t0±Δt, where t0 indicates the actual singing-starting time point) obtained by randomly changing the actual singing-starting time point through a random number-generating process or the like, by Δt within a predetermined time range (indicated by two broken lines in
In a step S98, by using the management data of preceding performance data, the singing time points of the present performance data are analyzed. In the management data of the preceding performance data, the information “Preceding Event Number” represents the number of preceding performance data received, of which the rearrangement has been completed. The data “Preceding Score Data” is score data formed and stored in the step S86 when a singing voice synthesis score was formed concerning the preceding performance data. The information “Preceding Note Off” represents a time point at which the preceding actual singing should be terminated. The information “event State” represents a state of connection (whether silence is interposed) between a preceding singing event and a current singing event determined based on the information “Preceding Note Off” and the current note-on information. In the following, for convenience, a state in which the current singing event is continuous from the preceding singing event (i.e. without silence), as shown in
Next, the phonetic unit track-forming process will be described with reference to
In a step S104, based on the management data, it is determined whether or not Event State=Attack holds. If the answer to this question is affirmative (Y), it means that preceding silence exists, and in a step S106, a silence singing length is calculated. The details of the processing in the step S106 will be described hereinafter with reference to
If the answer to the determination in the step S104 is negative (N), it means that Event State=Transition holds, and hence a preceding vowel exists, so that in a step S108, a preceding vowel singing length is calculated. The details of the process in the step S108 will be described hereinafter with reference to
When the processing in the step S106 or S108 is completed, in a step S110, a vowel singing length is calculated. The details of the processing in the step S110 will be described hereinafter with reference to
In a step S112, management data and score data are obtained. Then, in a step S114, all phonetic unit transition time lengths (phonetic unit transition time lengths obtained in steps S116, S122, S124, S126, S130, S132, S134, all hereinafter referred to) are initialized.
In a step S116, a phonetic unit transition time length of V_Sil (vowel to silence) is retrieved from the DB 14b based on the management data. Assuming, for example, that the vowel is “a”, and the pitch of the vowel is “P1”, the phonetic unit transition time length corresponding to “a_Sil” and “P1” is retrieved from the DB 14b. The processing in the step S116 is related to the fact that in the Japanese language syllables terminate in vowel.
In a step S118, based on the management data, it is determined whether or not Event State=Attack holds. If the answer to this question is affirmative (Y), it is determined based on the management data in a step S120 whether or not PhU State=Consonant Vowel holds. If the answer to this question is affirmative (Y), a phonetic unit transition time length of Sil_C (silence to consonant) is retrieved from the DB 14b based on the management data in a step S122. Thereafter, in a step S124, based on the management data, a phonetic unit transition time length of C_V (consonant to vowel) is retrieved from the DB 14b.
If the answer to the question of the step S120 is negative (N), it means that PhU State=Vowel holds, so that in a step S126, a phonetic unit transition time length of Sil_V is retrieved from the DB 14b based on the management data. It should be noted that the details of the manner of retrieving the transition time lengths at the respective steps S122 to S126 are the same as described as to the step S116.
If the answer to the question of the step S118 is negative (N), similarly to the step S120, it is determined in a step S128 whether or not PhU state=Consonant Vowel holds. If the answer to this question is affirmative (Y), in a step S130, based on the management data and the score data, a phonetic unit transition time length of pV_C (preceding vowel to consonant) is retrieved from the DB 14b. Assuming, for example, that the score data indicates that the preceding vowel is “a”, and the management data indicates that the consonant is “s” and its pitch is “P2”, a phonetic unit transition time length corresponding to “a_s” and “P2” is retrieved from the DB 14b. Thereafter, in a step S132, similarly to the step S116, a phonetic unit transition time length of C_V (consonant to vowel) is retrieved from the DB 14b based on the management data.
If the answer to the question of the step S128 is negative (N), the process proceeds to a step S134, wherein similarly to the step S130, a phonetic unit transition time length of pV_V (preceding vowel to vowel) is retrieved from the DB 14b based on the management data and the score data.
First, in a step S136, performance data, management data and score data are obtained. In a step S138, it is determined whether or not PhU State=Consonant Vowel holds. If the answer to this question is affirmative (Y), in a step S140, a consonant singing length is calculated. In this case, as shown in
In a step S142, the silence singing length is calculated. As shown in
First, in a step S146, performance data, management data, and score data are obtained. In a step S148, it is determined whether or not PhU State=Consonant Vowel holds. If the answer to this question is affirmative (Y), in a step S150, the consonant singing length is calculated. In this case, as shown in
Then, in a step S152, the preceding vowel singing length is calculated. As shown in
First, in a step S154, performance information, management data and score data are obtained. In a step S156, the vowel singing length is calculated. In this case, until the next performance data is received, a vowel connecting portion is not made definite. Therefore, it is assumed that “silence is interposed between the present performance data and the next performance data”, and as shown in
When the next performance data is received, the state of connection (Event State) between the present performance data and the next performance data becomes definite, and if Event State=Attack holds for the next performance data, the vowel singing length of the present performance data is not updated, while if Event State=Transition holds for the next performance data, the vowel singing length of the present performance data is updated by the process in the step S152 described above.
First in a step S160, performance information, management data, score data, and data of the phonetic unit track are obtained. In a step S162, an attack transition time length is calculated. To this end, the state transition time length of an attack transition state Attack corresponding to a singing attack type, a phonetic unit, and pitch, is retrieved from the state transition DB 14c shown in
In a step S164, a release transition time length is calculated. To this end, the state transition time length of a release transition state Release corresponding to a singing release type, a phonetic unit, and pitch, is retrieved from the state transition DB 14c based on the performance information and the management data. Then, the retrieved state transition time length is multiplied by a singing release expansion/compression ratio in the performance information to obtain the release transition time length (duration time of the release portion).
In a step S166, an NtN transition time length is obtained. More specifically, from score data stored in the step 86 in
In a step S168, it is determined whether or not Event State=Attack holds. If the answer to this question is affirmative (Y), a NONE transition time length corresponding to the silence portion (referred to as “NONEn transition time length”) is calculated in a step S170. More specifically, in the case of PhU State=Consonant Vowel, as shown in
In the step S170, the NONE transition time length corresponding to the steady portion(referred to as “NONEs transition time length) is calculated. In this case, until the next performance data is received, the state of connection following the NONEs transition time length is not made definite. Therefore, it is assumed that “silence is interposed between the present performance data and the next performance data”, and as shown in
If the answer to the question of the step S168 is negative (N), in a step S174, a NONE transition time length corresponding to the steady portion of the preceding performance data (referred to as “pNONEs transition time length”) is calculated. Since the reception of the present performance data has made definite the state of connection with the preceding performance data, the NONEs transition time length and the preceding release transition time length formed based on the preceding performance data are discarded. More specifically, the assumption “silence is interposed between the present performance data and the next performance data” employed in the processing in a step S176, described hereinafter, is annuled. In the step S174, as shown in
In the step S176, the NONE transition time length corresponding to the steady portion (NONEs transition time length) is calculated. In this case, until the next performance data is received, the state of connection with the NONEs transition time length is not made definite. Therefore, it is assumed that “silence is interposed between the present performance data and the next performance data”, and as shown in
First, in a step S180, performance information, management data, score data, and data of a phonetic unit track are obtained. In a step S182, it is determined based on the obtained data whether or not the vibrato event should be continued. If vibrato is started at the actual singing-starting time point of the present performance data, and at the same time the vibrato-added state is continued from the preceding performance data, the answer to this question is affirmative (Y), so that the process proceeds to a step S184. On the other hand, although vibrato is started at the actual singing-starting time point of the present performance data, the vibrato-added state is not continued from the preceding performance data, or if vibrato is not started at the actual singing-starting time point of the present performance data, the answer to this question is negative (N), so that the process proceeds to a step S188.
In many cases, vibrato is sung over a plurality of performance data (notes). Even if vibrato is started at the actual singing-starting time point of the present performance data, there are a case as shown in
In the step S188, it is determined based on the obtained data whether or not the non-vibrato event should be continued. In the
If the vibrato event is to be continued, in the step S184, the preceding vibrato time length is discarded. Then, in a step S186, a new vibrato time length is calculated by connecting (adding) together the preceding vibrato time length and a vibrato time length of vibrato to be started at the actual singing-starting time point of the present note. Then, the process proceeds to the step S194.
If the non-vibrato event is to be continued, in the step S190, the preceding non-vibrato event time length is discarded. Then, a new non-vibrato event time length is calculated by connecting (adding) together the preceding non-vibrato time length and a non-vibrato time length of non-vibrato to be started at the actual singing-starting time point of the present note. Then, the process proceeds to the step S194.
In the step S194, it is determined whether or not the vibrato time length should be added. If the answer to this question is affirmative (Y), first, in a step S196, a non-additional vibrato time length is calculated. More specifically, a non-vibrato time length from the trailing end of the vibrato time length calculated in the step S186 to a vibrato time length to be added is calculated as the non-additional vibrato time length.
Then, in a step S198, an additional vibrato time length is calculated. Then, the process returns to the step S194, wherein the above-described process is repeated. This makes it possible to add a plurality of additional vibrato time lengths.
If the answer to the question of the step S194 is negative (N), the non-vibrato time length is calculated in a step S200. More specifically, a time period from the final time point of a final vibrato event to the end time point of V_Sil within the actual singing time length (time length between Current Note On to Current Note Off) is calculated as the non-vibrato time length.
Although in the above steps S142 to S152, the silence singing length or the preceding vowel singing length is calculated such that the singing-starting time point of the vowel of the present performance data coincides with the actual singing-starting time point, this is not limitative, but for the purpose of synthesizing more natural singing voices, the silence singing length, the preceding vowel singing length and the vowel singing length may be calculated as in (1) to (11) described below:
(1) For each of categories (unvoiced/voiced plosive sound, unvoiced/voiced fricative sound, nasal sound, half vowel, etc.) of consonants, a silence singing length, a preceding vowel singing length, and a vowel singing length are calculated.
The phonetic unit connection pattern shown in FIG. 39A corresponds to a case of the preceding vowel “a”-silence-“sa”. The silence singing length is calculated with the consonant singing length C being inserted to lengthen the consonant (“s” in this example) of a phonetic unit formed by a consonant and a vowel. The phonetic unit connection pattern shown in
In the examples shown in
(2) For each of consonants (“p”, “b”, “s”, “z”, “n”, “w”, etc.), a silence singing length, a preceding vowel singing length, a vowel singing length are calculated.
(3) For each of vowels (“a”, “i”, “u”, “e”, “o”, etc.), a silence singing length, a preceding vowel singing length, a vowel singing length are calculated.
(4) For each of the categories (unvoiced/voiced plosive sound, unvoiced/voiced fricative sound, nasal sound, half vowel, etc.) of consonants, and at the same time for each vowel (“a”, “i”, “u”, “o”, or the like) continued from the consonant, a silence singing length, a preceding vowel singing length and a vowel singing length are calculated. That is, for each combination of a category to which a consonant belongs and a vowel, the silence singing length, the preceding vowel singing length and the vowel singing length are calculated.
(5) For each of the consonants (“p”, “b”, “s”, “z”, “n”, “w”, etc.), and at the same time for each vowel continued from the consonant, a silence singing length, a preceding vowel singing length and a vowel singing length are calculated. That is, for each combination of a consonant and a vowel, the silence singing length, the preceding vowel singing length and the vowel singing length are calculated.
(6) For each of preceding vowels (“a”, “i”, “u”, “e”, “o”, etc.), a silence singing length, a preceding vowel singing length, a vowel singing length are calculated.
(7) For each of the preceding vowels (“a”, “i”, “u”, “e”, “o”, etc.), and at the same time for each category (unvoiced/voiced plosive sound, unvoiced/voiced fricative sound, nasal sound, half vowel, or the like) of a consonant continued from the preceding vowel, a silence singing length, a preceding vowel singing length and a vowel singing length are calculated. That is, for each combination of a preceding vowel and a category to which a consonant belongs, the silence singing length, the preceding vowel singing length and the vowel singing length are calculated.
(8) For each of the preceding vowels (“a”, “i”, “u”, “e”, “o”, etc.), and at the same time for each consonant (“p”, “b”, “s”, “z”, “n”, “w”, or the like) continued from the preceding vowel, a silence singing length, a preceding vowel singing length and a vowel singing length are calculated. That is, for each combination of a preceding vowel and a consonant, the silence singing length, the preceding vowel singing length and the vowel singing length are calculated.
(9) For each of the preceding vowels “a”, “i”, “u”, “e”, “o”, etc.), and at the same time for each vowel (“a”, “i”, “u”, “e”, “o”, or the like) continued from the preceding vowel, a silence singing length, a preceding vowel singing length and a vowel singing length are calculated. That is, for each combination of a preceding vowel and a vowel, the silence singing length, the preceding vowel singing length and the vowel singing length are calculated.
(10) For each of the preceding vowels (“a”, “i”, “u”, “e”, “o”, etc.), for each category (unvoiced/voiced plosive sound, unvoiced/voiced fricative sound, nasal sound, half vowel, or the like) of a consonant continued from the preceding vowel, and for each vowel (“a”, “i”, “u”, “e”, “o”, or the like) continued from the consonant, a silence singing length, a preceding vowel singing length and a vowel singing length are calculated. That is, for each combination of a preceding vowel, a category to which a consonant belongs, and a vowel, the silence singing length, the preceding vowel singing length and the vowel singing length are calculated.
(11) For each of the preceding vowels (“a”, “i”, “u”, “e”, “o”, etc.), for each consonant (“p”, “b”, “s”, “z”, “n”, “w”, or the like) continued from the preceding vowel, and for each vowel (“a”, “i”, “u”, “e”, “o”, or the like) continued from the consonant, a silence singing length, a preceding vowel singing length and a vowel singing length are calculated. That is, for each combination of a preceding vowel, a consonant, and a vowel, the silence singing length, the preceding vowel singing length and the vowel singing length are calculated.
The present invention is by no means limited to the embodiment described hereinabove by way of example, but can be practiced in various modifications and variations. Examples of such modifications and variations include the following:
(1) Although in the above described embodiment, after completing the forming of a singing voice synthesis score, singing voices are synthesized according to the singing voice synthesis score, this is not limitative, but while forming a singing voice synthesis score, singing voices may be synthesized based on the formed portion of the score. To carry out this, it is only required that while preferentially performing the reception of performance data by an interrupt handling routine, the singing voice synthesis score may be formed based on the received portion of the performance data.
(2) Although in the above embodiment, the formant-forming method is employed for the tone generation method, this is not limitative but a waveform processing method or other suitable method may be employed.
(3) Although in the above embodiment, the singing voice synthesis score is formed by three tracks of a phonetic unit track, a transition track and a vibrato track, this is not limitative, but the same may be formed by a single track. To this end, information of the transition track and the vibrato track may be inserted into the phonetic unit track, as required.
It goes without saying that the above described embodiment, modifications or variations may be realized even in the form of a program as software to thereby accomplish the object of the present invention.
Further, it also goes without saying that the object of the present invention may be accomplished by supplying a storage medium in which is stored software program code executing the singing voice-synthesizing method or realizing the functions of the singing voice-synthesizing apparatus according to the above described embodiment, modifications or variations, and causing a computer (CPU or MPU) of the apparatus to read out and execute the program code stored in the storage medium.
In this case, the program code itself read out from the storage medium achieves the novel functions of the above embodiment, modifications or variations, and the storage medium storing the program constitutes the present invention.
The storage medium for supplying the program code to the system or apparatus may be in the form of a floppy disk, a hard disk, an optical memory disk, an magneto-optical disk, a CD-ROM, a CD-R (CD-Recordable), DVD-ROM, a semiconductor memory, a magnetic tape, a nonvolatile memory card, or a ROM, for example. Further, the program code may be supplied from a server computer via a MIDI apparatus or a communication network.
Further, needless to say, not only the functions of the above embodiment, modifications or variations can be realized by carrying out the program code read out by the computer but also an OS (operating system) or the like operating on the computer can carry out part or whole of actual processing in response to instructions of the program code, thereby making it possible to implement the functions of the above embodiment, modifications or variations.
Furthermore, it goes without saying that after the program code read out from the storage medium has been written in a memory incorporated in a function extension board inserted in the computer or in a function extension unit connected to the computer, a CPU or the like arranged in the function extension board or the function extension unit may carry out part or whole of actual processing in response to the instructions of the code of the next program, thereby making it possible to achieve the functions of the above embodiment, modifications or variations.
Number | Date | Country | Kind |
---|---|---|---|
2000-402880 | Dec 2000 | JP | national |
Number | Name | Date | Kind |
---|---|---|---|
5642470 | Yamamoto et al. | Jun 1997 | A |
5703308 | Tashiro et al. | Dec 1997 | A |
5857171 | Kageyama et al. | Jan 1999 | A |
5876213 | Matsumoto | Mar 1999 | A |
5895449 | Nakajima et al. | Apr 1999 | A |
5998725 | Ohta | Dec 1999 | A |
6304846 | George et al. | Oct 2001 | B1 |
6462264 | Elam | Oct 2002 | B1 |
6740804 | Shimizu et al. | May 2004 | B1 |
6836761 | Kawashima et al. | Dec 2004 | B1 |
6944589 | Yoshioka et al. | Sep 2005 | B1 |
20020105359 | Shimizu et al. | Aug 2002 | A1 |
20020123990 | Abe et al. | Sep 2002 | A1 |
20020184006 | Yoshioka et al. | Dec 2002 | A1 |
20020184032 | Hisaminato et al. | Dec 2002 | A1 |
20030009336 | Kenmochi et al. | Jan 2003 | A1 |
20030009344 | Kayama et al. | Jan 2003 | A1 |
20030046079 | Yoshioka et al. | Mar 2003 | A1 |
20030159568 | Kemmochi et al. | Aug 2003 | A1 |
20030221542 | Kenmochi et al. | Dec 2003 | A1 |
20040006472 | Kemmochi | Jan 2004 | A1 |
20040027369 | Kellock et al. | Feb 2004 | A1 |
20040133425 | Kawashima | Jul 2004 | A1 |
20040186720 | Kemmochi | Sep 2004 | A1 |
20040231499 | Kobayashi | Nov 2004 | A1 |
20050049875 | Kawashima et al. | Mar 2005 | A1 |
Number | Date | Country |
---|---|---|
08-248993 | Sep 1996 | JP |
10-49169 | Feb 1998 | JP |
10-319993 | Dec 1998 | JP |
Number | Date | Country | |
---|---|---|---|
20030009344 A1 | Jan 2003 | US |