The present invention relates to an electronic musical instrument that generates a singing voice in accordance with the operation of an operation element on a keyboard or the like, an electronic musical instrument control method, and a storage medium.
In one conventional technology, an electronic musical instrument is configured so as to generate a singing voice (vocals) in accordance with the operation of an operation element on a keyboard or the like (for example, see Patent Document 1). This conventional technology includes a keyboard operation element for instructing pitch, a storage unit in which lyric data is stored, an instruction unit that gives instruction to read lyric data from the storage unit, a read-out unit that sequentially reads lyric data from the storage unit when there has been an instruction from the instruction unit, and a sound source that generates a singing voice at a pitch instructed by the keyboard operation element and with a tone color corresponding to the lyric data read by the read-out unit.
However, with conventional technology such as described above, when, for example, attempting to output singing voices corresponding to lyrics in time with the progression of accompaniment data that is output by the electronic musical instrument, if singing voices corresponding to the lyrics are progressively output each time a key is specified by a user no matter which key has been specified, depending on the way the keys were specified by the user, the progression of accompaniment data and singing voices being output may not be in time with one another. For example, in cases where a single measure contains four musical notes for which the respective timings at which sound is generated are mutually distinct, lyrics will run ahead of the progression of accompaniment data when a user specifies more than four pitches within this single measure, and lyrics will lag behind the progression of accompaniment data when a user specifies three or fewer pitches within this single measure.
If lyrics are progressively advanced in this manner each time a user specifies a pitch with a keyboard or the like, the lyrics may, for example, run too far ahead of the accompaniment, or conversely, the lyrics may lag too far behind the accompaniment.
A similar issue exists with respect to the progression of lyrics even when no accompaniment data is output, that is, when only a singing voice is output. Accordingly, the present invention is directed to a scheme that substantially obviates one or more of the problems due to limitations and disadvantages of the related art.
Additional or separate features and advantages of the invention will be set forth in the descriptions that follow and in part will be apparent from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims thereof as well as the appended drawings.
To achieve these and other advantages and in accordance with the purpose of the present invention, as embodied and broadly described, in one aspect, the present disclosure provides an electronic musical instrument that includes: a performance receiver having a plurality of operation elements to be performed by a user for respectively specifying different pitches of musical notes; a memory that stores musical piece data that includes data of a vocal part, the vocal part including at least first and second notes and respectively associated first and second lyric parts that are to be successively played in the order of the first note and then the second note, wherein the first note has a first pitch and the second note has a second pitch; and at least one processor, wherein the at least one processor performs the following: when the user specifies, via the performance receiver, the first pitch, digitally synthesizing a first singing voice that includes the first lyric part and that has the first pitch in accordance with data of the first note stored in the memory, and causing the digitally synthesized first singing voice to be audibly output; and if the user specifies, via the performance receiver, a third pitch that is different from the second pitch successively after specifying the first pitch, instead of the second pitch of the second note that should have been specified, synthesizing a modified first singing voice that has the third pitch in accordance with data of the first lyric part, and causing the digitally synthesized modified first singing voice to be audibly output without causing the second lyric part of the second note to be audibly output.
In another aspect, the present disclosure provides a method performed by the at least one processor in the above-mentioned electronic musical instrument, the method including the above-mentioned features performed by the at least one processor.
In another aspect, the present disclosure provides a non-transitory computer-readable storage medium having stored thereon a program executable by the above-mentioned at least one processor in the above-mentioned electronic musical instrument, the program causing the at least one processor to perform the above-mentioned features performed by the at least one processor.
According to the present invention, an electronic musical instrument that satisfactorily controls the progression of lyrics can be provided.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory, and are intended to provide further explanation of the invention as claimed.
Embodiments of the present invention will be described in detail below with reference to the drawings.
While using the RAM 203 as working memory, the CPU 201 executes a control program stored in the ROM 202 and thereby controls the operation of the electronic keyboard instrument 100 in
The CPU 201 is provided with the timer 210 used in the present embodiment. The timer 210, for example, counts the progression of automatic performance in the electronic keyboard instrument 100.
In accordance with a sound generation control instruction from the CPU 201, the sound source LSI 204 reads musical sound waveform data from a non-illustrated waveform ROM, for example, and outputs the musical sound waveform data to the D/A converter 211. The sound source LSI 204 is capable of 256-voice polyphony.
When the voice synthesis LSI 205 is given, as music data 215, information relating to lyric text data, pitch, duration, and starting frame by the CPU 201, the voice synthesis LSI 205 synthesizes voice data for a corresponding singing voice and outputs this voice data to the D/A converter 212.
The key scanner 206 regularly scans the pressed/released states of the keys on the keyboard 101 and the operation states of the switches on the first switch panel 102 and the second switch panel 103 in
The LCD controller 208 is an integrated circuit (IC) that controls the display state of the LCD 104.
The voice synthesis LSI 205 includes a voice training section 301 and a voice synthesis section 302. The voice training section 301 includes a training text analysis unit 303, a training acoustic feature extraction unit 304, and a model training unit 305.
The training text analysis unit 303 is input with musical score data 311 including lyric text, pitches, and durations, and the training text analysis unit 303 analyzes this data. In other words, the musical score data 311 includes training lyric data and training pitch data. The training text analysis unit 303 accordingly estimates and outputs a training linguistic feature sequence 313, which is a discrete numerical sequence expressing, inter alia, phonemes, parts of speech, words, and pitches corresponding to the musical score data 311.
The training acoustic feature extraction unit 304 receives and analyzes singing voice data 312 that has been recorded via a microphone or the like when a given singer sang the aforementioned lyric text. The training acoustic feature extraction unit 304 accordingly extracts and outputs a training acoustic feature sequence 314 representing phonetic features corresponding to the singing voice data for a given singer 312.
In accordance with Equation (1) below, the model training unit 305 uses machine learning to estimate an acoustic model {circumflex over (λ)} with which the likelihood (P(o|l,λ)) that a training acoustic feature sequence 314 (o) will be generated given a training linguistic feature sequence 313 (l) and an acoustic model (λ) is maximized. In other words, a relationship between a linguistic feature sequence (text) and an acoustic feature sequence (voice sounds) is expressed using a statistical model, which here is referred to as an acoustic model.
{circumflex over (λ)}=arg maxλP(o|l,λ) (1)
The model training unit 305 outputs, as training result 315, model parameters expressing the acoustic model A that have been calculated using Equation (1) through the employ of machine learning, and the training result 315 is set in an acoustic model unit 306 in the voice synthesis section 302.
The voice synthesis section 302 includes a text analysis unit 307, an acoustic model unit 306, and a vocalization model unit 308. The voice synthesis section 302 performs statistical voice synthesis processing in which singing voice inference data for a given singer 217, corresponding to music data 215 including lyric text, is synthesized by making predictions using the statistical model, referred to herein as an acoustic model, set in the acoustic model unit 306.
As a result of a performance by a user made in concert with an automatic performance, the text analysis unit 307 is input with music data 215, which includes information relating to lyric text data, pitch, duration, and starting frame, specified by the CPU 201 in
The acoustic model unit 306 is input with the linguistic feature sequence 316, and using this, the acoustic model unit 306 estimates and outputs an acoustic feature sequence 317 corresponding thereto. In other words, in accordance with Equation (2) below, the acoustic model unit 306 estimates a value (ô) for an acoustic feature sequence 317 at which the likelihood (P(o|l,{circumflex over (λ)})) that an acoustic feature sequence 317 (o) will be generated based on a linguistic feature sequence 316 (l) input from the text analysis unit 307 and an acoustic model {circumflex over (λ)} set using the training result 315 of machine learning performed in the model training unit 305 is maximized.
ô=arg maxoP(o|l,{circumflex over (λ)}) (2)
The vocalization model unit 308 is input with the acoustic feature sequence 317. With this, the vocalization model unit 308 generates singing voice inference data for a given singer 217 corresponding to the music data 215 including lyric text specified by the CPU 201. The singing voice inference data for a given singer 217 is output from the D/A converter 212, goes through the mixer 213 and the amplifier 214 in
The acoustic features expressed by the training acoustic feature sequence 314 and the acoustic feature sequence 317 include spectral information that models the vocal tract of a person, and sound source information that models the vocal chords of a person. A mel-cepstrum, line spectral pairs (LSP), or the like may be employed for the spectral parameters. A fundamental frequency (F0) indicating the pitch frequency of the voice of a person may be employed for the sound source information. The vocalization model unit 308 includes a sound source generator 309 and a synthesis filter 310. The sound source generator 309 is sequentially input with a sound source information 319 sequence from the acoustic model unit 306. Thereby, the sound source generator 309, for example, generates a sound source signal that periodically repeats at a fundamental frequency (F0) contained in the sound source information 319 and is made up of a pulse train (for voiced phonemes) with a power value contained in the sound source information 319 or is made up of white noise (for unvoiced phonemes) with a power value contained in the sound source information 319. The synthesis filter 310 forms a digital filter that models the vocal tract on the basis of a spectral information 318 sequence sequentially input thereto from the acoustic model unit 306, and using the sound source signal input from the sound source generator 309 as an excitation signal, generates and outputs singing voice inference data for a given singer 217 in the form of a digital signal.
In the present embodiment, in order to predict an acoustic feature sequence 317 from a linguistic feature sequence 316, the acoustic model unit 306 is implemented using a deep neural network (DNN). Correspondingly, the model training unit 305 in the voice training section 301 learns model parameters representing non-linear transformation functions for neurons in the DNN that transform linguistic features into acoustic features, and the model training unit 305 outputs, as the training result 315, these model parameters to the DNN of the acoustic model unit 306 in the voice synthesis section 302.
Normally, acoustic features are calculated in units of frames that, for example, have a width of 5.1 msec, and linguistic features are calculated in phoneme units. Accordingly, the unit of time for linguistic features differs from that for acoustic features. The DNN acoustic model unit 306 is a model that represents a one-to-one correspondence between the input linguistic feature sequence 316 and the output acoustic feature sequence 317, and so the DNN cannot be trained using an input-output data pair having differing units of time. Thus, in the present embodiment, the correspondence between acoustic feature sequences given in frames and linguistic feature sequences given in phonemes is established in advance, whereby pairs of acoustic features and linguistic features given in frames are generated.
The model training unit 305 in the voice training section 301 in
During voice synthesis, a linguistic feature sequence 316 phoneme sequence (corresponding to (b) in
The vocalization model unit 308, as depicted using the group of heavy solid arrows 403 in
The DNN is trained so as to minimize squared error. This is computed according to Equation (3) below using pairs of acoustic features and linguistic features denoted in frames.
{circumflex over (λ)}=arg minλ½Σt=1T∥ot−gλ(lt)∥2 (3)
In this equation, ot and lt respectively represent an acoustic feature and a linguistic feature in the tth frame t, {circumflex over (λ)} represents model parameters for the DNN of the acoustic model unit 306, and gλ(·) is the non-linear transformation function represented by the DNN. The model parameters for the DNN are able to be efficiently estimated through backpropagation. When correspondence with processing within the model training unit 305 in the statistical voice synthesis represented by Equation (1) is taken into account, DNN training can represented as in Equation (4) below.
Here, {tilde over (μ)}t is given as in Equation (5) below.
{tilde over (μ)}t=gλ(lt) (5)
As in Equation (4) and Equation (5), relationships between acoustic features and linguistic features are able to be expressed using the normal distribution (ot|{circumflex over (μ)}t,{circumflex over (Σ)}t), which uses output from the DNN for the mean vector. Normally, in statistical voice synthesis processing employing a DNN, independent covariance matrices are used for linguistic features lt. In other words, in all frames, the same covariance matrix {tilde over (Σ)}g is used for the linguistic features lt. When the covariance matrix {tilde over (Σ)}g is an identity matrix, Equation (4) expresses a training process equivalent to that in Equation (3).
As described in
Detailed description follows regarding the operation of the present embodiment, configured as in the examples of
The timings t1, t2, t3, t4 in
The following control is performed in cases where, at any one of the original vocalization timings, a user has pressed a key on the keyboard 101 in
For example, in
While the progression of lyrics and the progression of automatic accompaniment are being controlled in the manner described above such that no singing voice is output, when a pitch specified by a user key press matches the pitch that should have been specified, the CPU 201 resumes the progression of lyrics and the progression of automatic accompaniment. For example, after the progression of lyrics and the progression of automatic accompaniment has been stopped at timing t3 in
When the progression of lyrics and the progression of automatic accompaniment has been resumed as described above, the vocalization timing t4 for the “Ra/kle” (the fourth character(s)) lyric data that is to be vocalized next following vocalization of the “i/in” (the third character(s)′) singing voice in the “Ki/twin” (the third character(s)) lyric data that was resumed at timing t3′ in
Cases where the pitch that was specified does not match the pitch that should have been specified encompass cases where there is no key press corresponding to a timing at which a specification should have been made. In other words, although not illustrated in
Timing t3 in
In this way, in cases where a user has not specified the correct pitch matching the pitch that should have been specified at an original vocalization timing, the progression of lyrics and the progression of automatic accompaniment fall out of time with one another if the control operation of the present embodiment is not performed, and the user must correct the progression of lyrics each time this happens. However, in the present embodiment, the progression of lyrics and the progression of automatic accompaniment are stopped until the user specifies the correct pitch matching the pitch that should have been specified. This enables natural lyric progression in time with a user performance.
If, at a timing at which no original vocalization timing comes, a user has performed a key press operation on a key (operation element) on the keyboard 101 in
That is, in cases where a first user operation has been performed on a first operation element such that a singing voice corresponding to a first character(s) indicated by lyric data included in music data is output at a first pitch, the CPU 201 receives first operation information that indicates “note on” resulting from the first user operation, and, based on the first operation information, the CPU 201 outputs the singing voice corresponding to the first character(s) at the first pitch.
In cases where a second user operation has been performed on an operation element for a pitch different than the first pitch and the second pitch among the plurality of operation elements while the singing voice corresponding to the first character(s) and the first pitch is being output and prior to the arrival of the second timing, the CPU 201 receives second operation information that indicates “note on” resulting from the second user operation, and, based on the second operation information, the CPU 201 outputs, at the pitch that is different than the first pitch and the second pitch, the singing voice corresponding to the first character(s) being output without outputting a singing voice corresponding to the second character(s).
Suppose that a user presses the key on the keyboard 101 in
Timing t1′ in
In this way, if the control operation of the present embodiment is not performed, the lyrics will steadily advance in cases where a pitch specified by the user at a timing other than an original vocalization timing does not match a pitch that is to be specified. This results in an unnatural-sounding progression. However, in the present embodiment, the pitch being vocalized in accordance with the singing voice inference data for a given singer 217 being vocalized from the original timing immediately before this timing is able to be changed to the pitch specified by the user and continue being vocalized. In this case, the pitch of singing voice inference data for a given singer 217 corresponding to the “Ki/Twin” (the first character(s)) lyric data vocalized at, for example, the original song playback timing t1 in
Although not illustrated in
In other words, while accompaniment data stored in the memory is being output from the sound source LSI 204 based on an instruction from the CPU 201 and singing voice data corresponding to the first character(s) is being output from the voice synthesis LSI 205 based on an instruction from the CPU 201, in cases where the CPU 201 determines there to be a match with a second pitch corresponding to a second timing from second operation information received in accordance with a user operation, the voice synthesis LSI 205 outputs singing voice data corresponding to the second character(s) without waiting for the second timing to arrive, whereby the progression of singing voice data is moved forward, and the progression of accompaniment data being output from the sound source LSI 204 is also moved forward in time with the moved-up progression of singing voice data.
Alternatively, when the user performs a performance operation at a timing other than an original vocalization timing and the specified pitch does not match the pitch is to be specified at the next timing, a vocalization corresponding to previously output singing voice inference data for a given singer 217 may be repeated (with the changed pitch). In this case, following the singing voice inference data for a given singer 217 corresponding to the “Ki/Twin” (the first character(s)) lyric data vocalized at, for example, the original song playback timing t1 in
The header chunk is made up of five values: ChunkID, ChunkSize, FormatType, NumberOfTrack, and TimeDivision. ChunkID is a four byte ASCII code “4D 54 68 64” (in base 16) corresponding to the four half-width characters “MThd”, which indicates that the chunk is a header chunk. ChunkSize is four bytes of data that indicate the length of the FormatType, NumberOfTrack, and TimeDivision part of the header chunk (excluding ChunkID and ChunkSize). This length is always “00 00 00 06” (in base 16), for six bytes. FormatType is two bytes of data “00 01” (in base 16). This means that the format type is format 1, in which multiple tracks are used. NumberOfTrack is two bytes of data “00 02” (in base 16). This indicates that in the case of the present embodiment, two tracks, corresponding to the lyric part and the accompaniment part, are used. TimeDivision is data indicating a timebase value, which itself indicates resolution per quarter note. TimeDivision is two bytes of data “01 E0” (in base 16). In the case of the present embodiment, this indicates 480 in decimal notation.
The first and second track chunks are each made up of a ChunkID, ChunkSize, and performance data pairs. The performance data pairs are made up of DeltaTime_1[i] and Event_1[i] (for the first track chunk/lyric part), or DeltaTime_2[i] and Event_2[i] (for the second track chunk/accompaniment part). Note that 0≤i≤L for the first track chunk/lyric part, and 0≤I≤M for the second track chunk/accompaniment part. ChunkID is a four byte ASCII code “4D 54 72 6B” (in base 16) corresponding to the four half-width characters “MTrk”, which indicates that the chunk is a track chunk. ChunkSize is four bytes of data that indicate the length of the respective track chunk (excluding ChunkID and ChunkSize).
DeltaTime_1[i] is variable-length data of one to four bytes indicating a wait time (relative time) from the execution time of Event_1[i−1] immediately prior thereto. Similarly, DeltaTime_2[i] is variable-length data of one to four bytes indicating a wait time (relative time) from the execution time of Event_2[i−1] immediately prior thereto. Event_1[i] is a meta event designating the vocalization timing and pitch of a lyric in the first track chunk/lyric part. Event_2[i] is a MIDI event designating “note on” or “note off” or is a meta event designating time signature in the second track chunk/accompaniment part. In each DeltaTime_1[i] and Event_1[i] performance data pair of the first track chunk/lyric part, Event_1[i] is executed after a wait of DeltaTime_1[i] from the execution time of the Event_1[i−1] immediately prior thereto. The vocalization and progression of lyrics is realized thereby. In each DeltaTime_2[i] and Event_2[i] performance data pair of the second track chunk/accompaniment part, Event_2[i] is executed after a wait of DeltaTime_2[i] from the execution time of the Event_2[i−1] immediately prior thereto. The progression of automatic accompaniment is realized thereby.
After first performing initialization processing (step S701), the CPU 201 repeatedly executes the series of processes from step S702 to step S708.
In this repeat processing, the CPU 201 first performs switch processing (step S702). Here, based on an interrupt from the key scanner 206 in
Next, based on an interrupt from the key scanner 206 in
Next, the CPU 201 processes data that should be displayed on the LCD 104 in
Next, the CPU 201 performs song playback processing (step S705). In this processing, the CPU 201 performs a control process described in
Then, the CPU 201 performs sound source processing (step S706). In the sound source processing, the CPU 201 performs control processing such as that for controlling the envelope of musical sounds being generated in the sound source LSI 204.
Then, the CPU 201 performs voice synthesis processing (step S707). In the voice synthesis processing, the CPU 201 controls voice synthesis by the voice synthesis LSI 205.
Finally, the CPU 201 determines whether or not a user has pressed a non-illustrated power-off switch to turn off the power (step S708). If the determination of step S708 is NO, the CPU 201 returns to the processing of step S702. If the determination of step S708 is YES, the CPU 201 ends the control process illustrated in the flowchart of
First, in
TickTime (sec)=60/Tempo/TimeDivision (6)
Accordingly, in the initialization processing illustrated in the flowchart of
Next, the CPU 201 sets a timer interrupt for the timer 210 in
Then, the CPU 201 performs additional initialization processing, such as that to initialize the RAM 203 in
The flowcharts in
First, the CPU 201 determines whether or not the tempo of lyric progression and automatic accompaniment has been changed using a switch for changing tempo on the first switch panel 102 in
Next, the CPU 201 determines whether or not a song has been selected with the second switch panel 103 in
Then, the CPU 201 determines whether or not a switch for starting a song on the first switch panel 102 in
Finally, the CPU 201 determines whether or not any other switches on the first switch panel 102 or the second switch panel 103 in
Similarly to at step S801 in
Next, similarly to at step S802 in
First, with regards to the progression of automatic accompaniment, the CPU 201 initializes the values of both a DeltaT_1 (first track chunk) variable and a DeltaT_2 (second track chunk) variable in the RAM 203 for counting, in units of TickTime, relative time since the last event to 0. Next, the CPU 201 initializes the respective values of an AutoIndex_1 variable in the RAM 203 for specifying an i (1≤i≤L−1) for DeltaTime_1[i] and Event_1[i] performance data pairs in the first track chunk of the music data illustrated in
Next, the CPU 201 initializes the value of a SongIndex variable in the RAM 203, which designates the current song position, to 0 (step S822).
The CPU 201 also initializes the value of a SongStart variable in the RAM 203, which indicates whether to advance (=1) or not advance (═0) the lyrics and accompaniment, to 1 (progress) (step S823).
Then, the CPU 201 determines whether or not a user has configured the electronic keyboard instrument 100 to playback an accompaniment together with lyric playback using the first switch panel 102 in
If the determination of step S824 is YES, the CPU 201 sets the value of a Bansou variable in the RAM 203 to 1 (has accompaniment) (step S825). Conversely, if the determination of step S824 is NO, the CPU 201 sets the value of the Bansou variable to 0 (no accompaniment) (step S826). After the processing at step S825 or step S826, the CPU 201 ends the song-starting processing at step S906 in
First, the CPU 201 performs a series of processes corresponding to the first track chunk (steps S1001 to S1006). The CPU 201 starts by determining whether or not the value of SongStart is equal to 1, in other words, whether or not advancement of the lyrics and accompaniment has been instructed (step S1001).
When the CPU 201 has determined there to be no instruction to advance the lyrics and accompaniment (the determination of step S1001 is NO), the CPU 201 ends the automatic-performance interrupt processing illustrated in the flowchart of
When the CPU 201 has determined there to be an instruction to advance the lyrics and accompaniment (the determination of step S1001 is YES), the CPU 201 then determines whether or not the value of DeltaT_1, which indicates the relative time since the last event in the first track chunk, matches the wait time DeltaTime_1[AutoIndex_1] of the performance data pair indicated by the value of AutoIndex_1 that is about to be executed (step S1002).
If the determination of step S1002 is NO, the CPU 201 increments the value of DeltaT_1, which indicates the relative time since the last event in the first track chunk, by 1, and the CPU 201 allows the time to advance by 1 TickTime corresponding to the current interrupt (step S1003). Following this, the CPU 201 proceeds to step S1007, which will be described later.
If the determination of step S1002 is YES, the CPU 201 executes the first track chunk event Event−1[AutoIndex_1] of the performance data pair indicated by the value of AutoIndex_1 (step S1004). This event is a song event that includes lyric data.
Then, the CPU 201 stores the value of AutoIndex_1, which indicates the position of the song event that should be performed next in the first track chunk, in the SongIndex variable in the RAM 203 (step S1004).
The CPU 201 then increments the value of AutoIndex_1 for referencing the performance data pairs in the first track chunk by 1 (step S1005).
Next, the CPU 201 resets the value of DeltaT_1, which indicates the relative time since the song event most recently referenced in the first track chunk, to 0 (step S1006). Following this, the CPU 201 proceeds to the processing at step S1007.
Then, the CPU 201 performs a series of processes corresponding to the second track chunk (steps S1007 to S1013). The CPU 201 starts by determining whether or not the value of DeltaT_2, which indicates the relative time since the last event in the second track chunk, matches the wait time DeltaTime_2[AutoIndex_2] of the performance data pair indicated by the value of AutoIndex_2 that is about to be executed (step S1007).
If the determination of step S1007 is NO, the CPU 201 increments the value of DeltaT_2, which indicates the relative time since the last event in the second track chunk, by 1, and the CPU 201 allows the time to advance by 1 TickTime corresponding to the current interrupt (step S1008). The CPU 201 subsequently ends the automatic-performance interrupt processing illustrated in the flowchart of
If the determination of step S1007 is YES, the CPU 201 then determines whether or not the value of the Bansou variable in the RAM 203 that denotes accompaniment playback is equal to 1 (has accompaniment) (step S1009) (see steps S824 to S826 in
If the determination of step S1009 is YES, the CPU 201 executes the second track chunk accompaniment event Event_2[AutoIndex_2] indicated by the value of AutoIndex_2 (step S1010). If the event Event_2[AutoIndex_2] executed here is, for example, a “note on” event, the key number and velocity specified by this “note on” event are used to issue a command to the sound source LSI 204 in
However, if the determination of step S1009 is NO, the CPU 201 skips step S1010 and proceeds to the processing at the next step S1011 without executing the current accompaniment event Event_2[AutoIndex_2]. Here, in order to progress in sync with the lyrics, the CPU 201 performs only control processing that advances events.
After step S1010, or when the determination of step S1009 is NO, the CPU 201 increments the value of AutoIndex_2 for referencing the performance data pairs for accompaniment data in the second track chunk by 1 (step S1011).
Next, the CPU 201 resets the value of DeltaT_2, which indicates the relative time since the event most recently executed in the second track chunk, to 0 (step S1012).
Then, the CPU 201 determines whether or not the wait time DeltaTime_2[AutoIndex_2] of the performance data pair indicated by the value of AutoIndex_2 to be executed next in the second track chunk is equal to 0, or in other words, whether or not this event is to be executed at the same time as the current event (step S1013).
If the determination of step S1013 is NO, the CPU 201 ends the current automatic-performance interrupt processing illustrated in the flowchart of
If the determination of step S1013 is YES, the CPU 201 returns to step S1009, and repeats the control processing relating to the event Event_2[AutoIndex_2] of the performance data pair indicated by the value of AutoIndex_2 to be executed next in the second track chunk. The CPU 201 repeatedly performs the processing of steps S1009 to S1013 same number of times as there are events to be simultaneously executed. The above processing sequence is performed when a plurality of “note on” events are to generate sound at simultaneous timings, as for example happens in chords and the like.
First, at step S1004 in the automatic-performance interrupt processing of
If the determination of step S1101 is YES, that is, if the present time is a song playback timing (e.g., t1, t2, t3, t4, t5, t6, t7 in the example of
If the determination of step S1102 is YES, the CPU 201 reads a pitch from the song event Event_1[SongIndex] in the first track chunk of the music data in the RAM 203 indicated by the SongIndex variable in the RAM 203, and determines whether or not a pitch specified by a user key press matches the pitch that was read (step S1103).
If the determination of step S1103 is YES, the CPU 201 sets the pitch specified by a user key press to a non-illustrated register, or to a variable in the RAM 203, as a vocalization pitch (step S1104).
Then, the CPU 201 reads the lyric string from the song event Event_1[SongIndex] in the first track chunk of the music data in the RAM 203 indicated by the SongIndex variable in the RAM 203. The CPU 201 generates music data 215 for vocalizing, at the vocalization pitch set to the pitch specified based on key press that was set at step S1104, singing voice inference data for a given singer 217 corresponding to the lyric string that was read, and instructs the voice synthesis LSI 205 to perform vocalization processing (step S1105).
The processing at steps S1104 and S1105 corresponds to the control processing mentioned earlier with regards to the song playback timings t1, t2, t3′, t4 in
After the processing of step S1105, the CPU 201 stores the song position at which playback was performed indicated by the SongIndex variable in the RAM 203 in a SongIndex_pre variable in the RAM 203 (step S1106).
Next, the CPU 201 clears the value of the SongIndex variable so as to become a null value and makes subsequent timings non-song playback timings (step S1107).
The CPU 201 then sets the value of the SongStart variable in the RAM 203 controlling the advancement of lyrics and automatic accompaniment to 1, denoting advancement (step S1108). The CPU 201 subsequently ends the song playback processing at step S705 in
As described with regards to timing t3 in
If the determination of step S1103 is NO, that is, if the pitch specified by a user key press does not match the pitch read from the music data, the CPU 201 sets the value of the SongStart variable in the RAM 203 controlling the advancement of lyrics and automatic accompaniment to 0, denoting that advancement is to stop (step S1109). The CPU 201 subsequently ends the song playback processing at step S705 in
As described with regards to timing t3 in
If the determination of step S1101 is NO, that is, if the present time is not a song playback timing, the CPU 201 then determines whether or not a new user key press on the keyboard 101 in
If the determination of step S1110 is NO, the CPU 201 ends the song playback processing at step S705 in
If the determination of step S1110 is YES, the CPU 201 generates music data 215 instructing that the pitch of singing voice inference data for a given singer 217 currently undergoing vocalization processing in the voice synthesis LSI 205, which corresponds to the lyric string for song event Event_1[SongIndex_pre] in the first track chunk of the music data in the RAM 203 indicated by the SongIndex_pre variable in the RAM 203, is to be changed to the pitch specified based on the user key press detected at step S1110, and outputs the music data 215 to the voice synthesis LSI 205 (step S1111). At such time, the frame in the music data 215 where a latter phoneme among phonemes in the lyrics already being subjected to vocalization processing starts, for example, in the case of the lyric string “Ki”, the frame where the latter phoneme /i/ in the constituent phoneme sequence /k/ /i/ starts (see (b) and (c) in
Due to the processing at step S1111, the pitch of the vocalization of singing voice inference data for a given singer 217 being vocalized from an original timing immediately before the current key press timing, for example from timing t1 in
After the processing at step S1111, the CPU 201 ends the song playback processing at step S705 in
First, the CPU 201 determines whether or not a new user key press on the keyboard 101 in
If the determination of step S1201 is YES, the CPU 201 then determines whether or not, at step S1004 in the automatic-performance interrupt processing of
If the determination of step S1202 is YES, that is, if the present time is a song playback timing (e.g., t1, t2, t3, t4 in the example of
Then, the CPU 201 reads the lyric string from the song event Event_1[SongIndex] in the first track chunk of the music data in the RAM 203 indicated by the SongIndex variable in the RAM 203. The CPU 201 generates music data 215 for vocalizing, at the vocalization pitch set to the pitch specified based on key press that was set at step S1203, singing voice inference data for a given singer 217 corresponding to the lyric string that was read, and instructs the voice synthesis LSI 205 to perform vocalization processing (step S1204).
Following this, the CPU 201 reads a pitch from the song event Event_1[SongIndex] in the first track chunk of the music data in the RAM 203 indicated by the SongIndex variable in the RAM 203, and determines whether or not a specified pitch specified by a user key press matches the pitch that was read from the music data (step S1205).
If the determination of step S1205 is YES, the CPU 201 advances to step S1206. This processing corresponds to the control processing mentioned earlier with regards to the song playback timings t1, t2, t3′, t4 in
At step S1206, the CPU 201 stores the song position at which playback was performed indicated by the SongIndex variable in the RAM 203 in the SongIndex_pre variable in the RAM 203.
Next, the CPU 201 clears the value of the SongIndex variable so as to become a null value and makes subsequent timings non-song playback timings (step S1207).
The CPU 201 then sets the value of the SongStart variable in the RAM 203 controlling the advancement of lyrics and automatic accompaniment to 1, denoting advancement (step S1208). The CPU 201 subsequently ends the song playback processing at step S705 in
If the determination of step S1205 is NO, that is, if the pitch specified by a user key press does not match the pitch read from the music data, the CPU 201 sets the value of the SongStart variable in the RAM 203 controlling the advancement of lyrics and automatic accompaniment to 0, denoting that advancement is to stop (step S1210).
The CPU 201 subsequently ends the song playback processing at step S705 in
This processing corresponds to the control processing mentioned earlier with regards to the song playback timing t3 in
When the determination of step S1201 is YES and the determination of step S1202 is NO, that is, when a user performance (key press) is vocalized at a timing other than a timing at which a singing voice should be vocalized, the following processing is performed.
First, the CPU 201 sets the value of the SongStart variable in the RAM 203 to 0 and momentarily stops the progression of a singing voice and automatic accompaniment (step S1211) (see step S1001 in
Next, the CPU 201 saves the values of the DeltaT_1, DeltaT_2, AutoIndex_1, and AutoIndex_2 variables in the RAM 203, which relate to the current positions of singing voice and automatic accompaniment progression, to the DeltaT_1 now, DeltaT_2 now, AutoIndex_1 now, and AutoIndex_2 now variables (step S1212).
Then, the CPU 201 performs next-song-event search processing (step S1213). This processing finds the SongIndex value that designates event information relating to the singing voice that will come next. The details of this processing will be described later.
Following the search process at step S1213, the CPU 201 reads a pitch from the song event Event_1[SongIndex] in the first track chunk of the music data in the RAM 203 indicated by the value of the SongIndex variable that was found, and determines whether or not the pitch specified by the user key press matches the pitch that was read (step S1214).
If the determination of step S1214 is YES, the CPU 201 advances the control processing through step S1203, step S1204, step S1205 (determination: YES), step S1206, step S1207, and step S1208.
Due to the aforementioned series of control processes, in cases where, at a timing at which no original vocalization timing comes, a user has pressed a key having the same pitch as the pitch that is to be vocalized next, the CPU 201 is able to enact control such that the progression of lyrics and the progression of automatic accompaniment are immediately advanced (made to jump ahead) to the timing of the singing voice that is to be vocalized next.
If the determination of step S1214 is NO, the CPU 201 respectively restores the values of the DeltaT_1, DeltaT_2, AutoIndex_1, and AutoIndex_2 variables to the values held by the DeltaT_1 now, DeltaT_2 now, AutoIndex_1 now, and AutoIndex_2 now variables in the RAM 203 that were saved at step S1212, and any advancement of these variables due to the search process at step S1213 is reverted so as to return to the progression positions from before the search (step S1215).
Then, the CPU 201 generates music data 215 instructing that the pitch of singing voice inference data for a given singer 217 currently undergoing vocalization processing in the voice synthesis LSI 205, which corresponds to the lyric string for song event Event_1[SongIndex_pre] in the first track chunk of the music data in the RAM 203 indicated by the SongIndex_pre variable in the RAM 203, is to be changed to the pitch specified based on the user key press detected at step S1201, and outputs the music data 215 to the voice synthesis LSI 205 (step S1216).
The processing at step S1216 is similar to the processing of step S1111 in
After the processing at step S1216, the CPU 201 sets the value of the SongStart variable in the RAM 203 to 1, thereby causing the progression of lyrics and automatic accompaniment that was temporarily stopped at step S1211 to be resumed (step S1208). The CPU 201 subsequently ends the song playback processing at step S705 in
If the determination of step S1201 is NO, that is, when there are no user performances (key presses), the CPU 201 then determines whether or not, at step S1004 in the automatic-performance interrupt processing of
If the determination of step S1209 is NO, the CPU 201 ends the song playback processing at step S705 in
If the determination of step S1209 is YES, the CPU 201 sets the value of SongStart variable in the RAM 203 controlling the advancement of lyrics and automatic accompaniment to 0, denoting that advancement is to stop (step S1210). The CPU 201 subsequently ends the song playback processing at step S705 in
In other words, in
Whenever the value of DeltaT_2, which indicates the relative time since the last event in the second track chunk, matches the wait time DeltaTime_2[AutoIndex_2] of the automatic accompaniment performance data pair indicated by the value of AutoIndex_2 that is about to be executed and the determination of step S1007 is YES, the CPU 201 advances the value of AutoIndex_2. When the determination of step S1007 is NO, the CPU 201 increments the value of DeltaT_2 to advance the progression of automatic accompaniment, and then returns to the control processing at step S1002.
In the foregoing series of repeating control processes, when the determination of step S1002 is YES, the CPU 201 stores the value of AutoIndex_1 in the SongIndex variable in the RAM 203, and then ends the next-song-event search processing at step S1213 in
As used in the present specification, a “timing corresponding to a first timing” is a timing at which a user operation on the first operation element is received, and refers to an interval of a predetermined duration prior to the first timing.
Further, as used in the present specification, character(s) such as the “first character(s)” and the “second character(s)” denote character(s) associated with a single musical note, and may be either single characters or multiple characters.
Moreover, while an electronic musical instrument is outputting a singing voice corresponding to a first character(s) and a first pitch based on a first user operation for the first pitch in time with a first timing indicated in music data, and prior to the arrival of a second timing indicated in the music data, in cases where, rather than being performed on an operation element for a second pitch associated with the second timing, a second user operation is performed on the operation element for the first pitch that is being output (in other words, the same operation element is struck in succession), the output of the singing voice for the first character(s) is continued without a singing voice for the second character(s) associated with the second timing being output. At such time, vibrato or another musical effect may be applied to the singing voice for the first character(s) that is being output. When the operation element subjected to the second user operation is an operation element for the second pitch, the singing voice for the second character(s) to be output on or after the second timing is output before the arrival of the second timing. As a result, the lyrics advance, and in accordance with the lyric progression, the accompaniment also advances.
In the embodiments described above, in order to predict an acoustic feature sequence 317 from a linguistic feature sequence 316, the acoustic model unit 306 is implemented using a deep neural network (DNN). Alternatively, in order to make this prediction, the acoustic model unit 306 may be implemented using a hidden Markov model (HMM). In such case, the model training unit 305 in the voice training section 301 trains a model that considers context so as to more accurately model the acoustic features of a voice. To model acoustic features in detail, in addition to preceding and following phonemes, factors such as accent, part of speech, and phrase length are taken into account. However, since there are a large number of possible contextual combinations, it is not easy to prepare singing voice data with which a context-dependent model can be accurately trained on all contextual combinations. To address this issue, the model training unit 305 may employ decision-tree based context clustering techniques. In decision-tree based context clustering, questions relating to context, such as “Is the preceding phoneme /a/?”, are used to classify context-dependent models, and model parameters for similar contexts are set in the acoustic model unit 306 as training results 315. The context being considered changes depending on the structure of the decision tree. Thus, by selecting the appropriate decision tree structure, highly accurate and highly versatile context-dependent models can be estimated. In the acoustic model unit 306 in the voice synthesis section 302 in
In the embodiments described above, the present invention is embodied as an electronic keyboard instrument. However, the present invention can also be applied to electronic string instruments and other electronic musical instruments.
The present invention is not limited to the embodiments described above, and various changes in implementation are possible without departing from the spirit of the present invention. Insofar as possible, the functionalities performed in the embodiments described above may be implemented in any suitable combination. Moreover, there are many aspects to the embodiments described above, and the invention may take on a variety of forms through the appropriate combination of the disclosed plurality of constituent elements. For example, if after omitting several constituent elements from out of all constituent elements disclosed in the embodiments the advantageous effect is still obtained, the configuration from which these constituent elements have been omitted may be considered to be one form of the invention.
Number | Date | Country | Kind |
---|---|---|---|
2018-078110 | Apr 2018 | JP | national |
Number | Name | Date | Kind |
---|---|---|---|
6337433 | Nishimoto | Jan 2002 | B1 |
8008563 | Hastings | Aug 2011 | B1 |
20020017187 | Takahashi | Feb 2002 | A1 |
20030009344 | Kayama | Jan 2003 | A1 |
20050257667 | Nakamura | Nov 2005 | A1 |
20140006031 | Mizuguchi | Jan 2014 | A1 |
20170140745 | Nayak et al. | May 2017 | A1 |
20180277075 | Nakamura | Sep 2018 | A1 |
20180277077 | Nakamura | Sep 2018 | A1 |
20190096372 | Setoguchi | Mar 2019 | A1 |
20190096379 | Iwase | Mar 2019 | A1 |
20190198001 | Danjyo | Jun 2019 | A1 |
20190318715 | Danjyo | Oct 2019 | A1 |
20190392798 | Danjyo | Dec 2019 | A1 |
20190392799 | Danjyo | Dec 2019 | A1 |
20190392807 | Danjyo | Dec 2019 | A1 |
Number | Date | Country |
---|---|---|
H04-238384 | Aug 1992 | JP |
H06-332449 | Dec 1994 | JP |
2005-331806 | Dec 2005 | JP |
2014-10190 | Jan 2014 | JP |
2014-62969 | Apr 2014 | JP |
2016-206323 | Dec 2016 | JP |
2017-97176 | Jun 2017 | JP |
2017-194594 | Oct 2017 | JP |
Entry |
---|
Japanese Office Action dated May 28, 2019, in a counterpart Japanese patent application No. 2018-078110. (A machine translation (not reviewed for accuracy) attached.). |
Japanese Office Action dated May 28, 2019, in a counterpart Japanese patent application No. 2018-078113. (Cited in the related U.S. Appl. No. 16/384,883 and a machine translation (not reviewed for accuracy) attached.). |
Kei Hashimoto and Shinji Takaki, “Statistical parametric speech synthesis based on deep learning”, Journal of the Acoustical Society of Japan, vol. 73, No. 1 (2017), pp. 55-62 (Mentioned in paragraph Nos. 18-19 of the as-filed specification as a concise explanation of relevance.). |
U.S. Appl. No. 16/384,883, filed Apr. 15, 2019. |
Number | Date | Country | |
---|---|---|---|
20190318712 A1 | Oct 2019 | US |