The present invention relates to an electronic musical instrument that generates a singing voice in accordance with the operation of an operation element on a keyboard or the like, an electronic musical instrument control method, and a storage medium.
Hitherto known electronic musical instruments output a singing voice that is synthesized using concatenative synthesis, in which fragments of recorded speech are connected together and processed (for example, see Patent Document 1).
Patent Document 1: Japanese Patent Application Laid-Open Publication No. H09-050287
However, this method, which can be considered an extension of pulse code modulation (PCM), requires long hours of recording when being developed. Complex calculations for smoothly joining fragments of recorded speech together and adjustments so as to provide a natural-sounding singing voice are also required with this method.
An object of the present invention is to provide an electronic musical instrument that sings well in the singing voice of a given singer at pitches specified through the operation of operation elements by a user due to being equipped with a trained model that has learned the singing voice of the given singer.
Additional or separate features and advantages of the invention will be set forth in the descriptions that follow and in part will be apparent from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims thereof as well as the appended drawings.
To achieve these and other advantages and in accordance with the purpose of the present invention, as embodied and broadly described, in one aspect, the present disclosure provides an electronic musical instrument including: a plurality of operation elements respectively corresponding to mutually different pitch data; a memory that stores a trained acoustic model obtained by performing machine learning on training musical score data including training lyric data and training pitch data, and on training singing voice data of a singer corresponding to the training musical score data, the trained acoustic model being configured to receive lyric data and pitch data and output acoustic feature data of a singing voice of the singer in response to the received lyric data and pitch data; and at least one processor, wherein the at least one processor: in accordance with a user operation on an operation element in the plurality of operation elements, inputs prescribed lyric data and pitch data corresponding to the user operation of the operation element to the trained acoustic model so as to cause the trained acoustic model to output the acoustic feature data in response to the inputted prescribed lyric data and the inputted pitch data, and digitally synthesizes and outputs inferred singing voice data that infers a singing voice of the singer on the basis of at least a portion of the acoustic feature data output by the trained acoustic model in response to the inputted prescribed lyric data and the inputted pitch data, and on the basis of instrument sound waveform data that are synthesized in accordance with the pitch data corresponding to the user operation of the operation element.
In another aspect, the present invention provides an electronic musical instrument comprising: an operation unit that receives a user performance; and at least one processor, wherein the at least one processor performs the following: in accordance with a user operation specifying a chord on the operation unit, obtaining lyric data of a lyric and obtaining a plurality of pieces of waveform data respectively corresponding to a plurality of pitches indicated by the specified chord; inputting the obtained lyric data to a trained model that has been trained and learned singing voices of a singer so as to cause the trained model to output acoustic feature data in response thereto; synthesizing each of the plurality of pieces of waveform data with the acoustic feature data outputted from the trained model so as to generate a plurality of pieces of synthesized waveform data corresponding to the plurality of pitches of the specified chord and the lyric; and outputting a polyphonic synthesized singing voice based on the generated plurality of pieces of synthesized waveform data.
In another aspect, the present disclosure provides a method performed by the at least one processor in the electronic musical instruments described above, the method including, via the at least one processor, each step performed by the at least one processor described above.
In another aspect, the present disclosure provides a non-transitory computer-readable storage medium having stored thereon a program executable by the at least one processor in the above-described electronic musical instrument, the program causing the at least one processor to perform each step performed by the at least one processor described above.
According to an aspect of the present invention, an electronic musical instrument can be provided that sings well in the singing voice of a given singer at pitches specified through the operation of operation elements by a user due to being equipped with a trained model that has learned the singing voice of the given singer. Furthermore, according to at least some of the aspects of the present invention, when the user plays a chord, a polyphonic synthesized singing voice corresponding to the chord and the preset lyric can be outputted.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory, and are intended to provide further explanation of the invention as claimed.
Embodiments of the present invention will be described in detail below with reference to the drawings.
While using the RAM 203 as working memory, the CPU 201 executes a control program stored in the ROM 202 and thereby controls the operation of the electronic keyboard instrument 100 in
The ROM 202 (memory) is also pre-stored with melody pitch data (215d) indicating operation elements that a user is to operate, singing voice output timing data (215c) indicating output timings at which respective singing voices for pitches indicated by the melody pitch data (215d) are to be output, and lyric data (215a) corresponding to the melody pitch data (215d).
The CPU 201 is provided with the timer 210 used in the present embodiment. The timer 210, for example, counts the progression of automatic performance in the electronic keyboard instrument 100.
Following a sound generation control instruction from the CPU 201, the sound source LSI 204 reads musical sound waveform data from a non-illustrated waveform ROM, for example, and outputs the musical sound waveform data to the D/A converter 211. The sound source LSI 204 is capable of 256-voice polyphony.
When the voice synthesis LSI 205 is given, as singing voice data 215, lyric data 215a and either pitch data 215b or melody pitch data 215d by the CPU 201, the voice synthesis LSI 205 synthesizes voice data for a corresponding singing voice and outputs this voice data to the D/A converter 212.
The lyric data 215a and the melody pitch data 215d are pre-stored in the ROM 202. Either the melody pitch data 215d pre-stored in the ROM 202 or pitch data 215b for a note number obtained in real time due to a user key press operation is input to the voice synthesis LSI 205 as pitch data.
In other words, when there is a user key press operation at a prescribed timing, an inferred singing voice is produced at a pitch corresponding to the key on which there was a key press operation, and when there is no user key press operation at a prescribed timing, an inferred singing voice is produced at a pitch indicated by the melody pitch data 215d stored in the ROM 202.
Musical sound output data outputted from designated channel(s) (single or plural channels) of the sound source LSI 204 are inputted to the voice synthesis LSI 205 as instrument sound waveform data 220.
The key scanner 206 regularly scans the pressed/released states of the keys on the keyboard 101 and the operation states of the switches on the first switch panel 102 and the second switch panel 103 in
The LCD controller 609 is an integrated circuit (IC) that controls the display state of the LCD 505.
Along with lyric data 215a, the voice synthesis section 302 is input with pitch data 215b instructed by the CPU 201 on the basis of a key press on the keyboard 101 in
It is important to note that the output inferred singing voice data 217 is not based on sound source data 319 output by the trained model, but is based on instrument sound waveform data 220 output by the sound source LSI 204. Thus, in this aspect of the present invention, the electronic musical instrument 100 uses the instrument sound waveform data 220 output by the sound source LSI 204 instead of (in other words, without using) sound source data 319 output by the trained acoustic model 306. The instrument sound waveform data 220 are instrument sound waveform data having one or more pitches specified by the user by operating the keyboard 101 (or specified by the melody pitch data 215d stored in the ROM 202 if there is no keyboard operation by the user). The instrument sounds for the waveform data that are synthesized here preferably include, but not limited to, sounds of brass instruments, strings instruments, organ, sound of animals, for example. The instrument sound may be the sound of just one of these instrumental sounds selected by an user operation of the first switch panel 102. Through diligent research, the present inventors have discovered that these listed instrument sounds are particularly effective when combined with the spectral data 318 that carry characteristics of a human singing voice.
In this embodiment of the present invention, if the user presses multiple keys at the keyboard 101 at the same time (specifying a chord, for example), a synthesized singing voice having certain characteristics of a human singing voice having the corresponding multiple pitches is output (i.e., polyphonic output). That is, in this embodiment, for each of the pitches specified in the chord, the waveform data of the music instrument having the corresponding pitch is modified by the spectral data 318 (formant information) outputted from the acoustic model 306, thereby adding the vocal characteristics of the singer with respect to which the acoustic model 306 has been trained to the inferred singing voice data 217, which is polyphonically output. This aspect is advantageous because when the user presses multiple keys at the same time, the polyphonic singing voice corresponding to the specified multiple pitches are outputted.
In conventional vocoders, users needed to sing while operating the keyboard; a microphone to pick up the user's singing voice was necessary. In this embodiment of the present invention, the user need not sing, and a microphone is not needed. Also, as noted above, in this embodiment, with respect to the acoustic feature data 317 (explained below) including spectral data 318 and sound source data 319, only the spectral data 318 is used in synthesizing the inferred singing voice data.
The acoustic effect application section 320 is input with effect application instruction data 215e, as a result of which the acoustic effect application section 320 applies an acoustic effect such as a vibrato effect, a tremolo effect, or a wah effect to the output data 321 output by the voice synthesis section 302.
Effect application instruction data 215e is input to the acoustic effect application section 320 in accordance with the pressing of a second key (for example, a black key) within a prescribed range from a first key that has been pressed by a user (for example, within one octave). The greater the difference in pitch between the first key and the second key, the greater the acoustic effect that is applied by the acoustic effect application section 320.
As illustrated in
The voice training section 301 and the voice synthesis section 302 in
(Non-Patent Document 1)
Kei Hashimoto and Shinji Takaki, “Statistical parametric speech synthesis based on deep learning”, Journal of the Acoustical Society of Japan, vol. 73, no. 1 (2017), pp. 55-62
The voice training section 301 in
The voice training section 301, for example, uses voice sounds that were recorded when a given singer sang a plurality of songs in an appropriate genre as training singing voice data for a given singer 312. Lyric text (training lyric data 311a) for each song is also prepared as training musical score data 311.
The training text analysis unit 303 is input with training musical score data 311, including lyric text (training lyric data 311a) and musical note data (training pitch data 311b), and the training text analysis unit 303 analyzes this data. The training text analysis unit 303 accordingly estimates and outputs a training linguistic feature sequence 313, which is a discrete numerical sequence expressing, inter alia, phonemes and pitches corresponding to the training musical score data 311.
In addition to this input of training musical score data 311, the training acoustic feature extraction unit 304 receives and analyzes training singing voice data for a given singer 312 that has been recorded via a microphone or the like when a given singer sang (for approximately two to three hours, for example) lyric text corresponding to the training musical score data 311. The training acoustic feature extraction unit 304 accordingly extracts and outputs a training acoustic feature sequence 314 representing phonetic features corresponding to the training singing voice data for a given singer 312.
As described in Non-Patent Document 1, in accordance with Equation (1) below, the model training unit 305 uses machine learning to estimate an acoustic model {circumflex over (λ)} with which the probability (P(o|l,λ)) that a training acoustic feature sequence 314 (o) will be generated given a training linguistic feature sequence 313 (l) and an acoustic model (λ) is maximized. In other words, a relationship between a linguistic feature sequence (text) and an acoustic feature sequence (voice sounds) is expressed using a statistical model, which here is referred to as an acoustic model.
Here, arg max denotes a computation that calculates the value of the argument underneath arg max that yields the greatest value for the function to the right of arg max.
The model training unit 305 outputs, as training result 315, model parameters expressing the acoustic model A that have been calculated using Equation (1) through the employ of machine learning.
As illustrated in
The voice synthesis section 302, which is functionality performed by the voice synthesis LSI 205, includes a text analysis unit 307, the trained acoustic model 306, and a vocalization model unit 308. The voice synthesis section 302 performs statistical voice synthesis processing in which output data 321, corresponding to singing voice data 215 including lyric text, is synthesized by making predictions using the statistical model referred to herein as the trained acoustic model 306.
As a result of a performance by a user made in concert with an automatic performance, the text analysis unit 307 is input with singing voice data 215, which includes information relating to phonemes, pitches, and the like for lyrics specified by the CPU 201 in
As described in Non-Patent Document 1, the trained acoustic model 306 is input with the linguistic feature sequence 316, and using this, the trained acoustic model 306 estimates and outputs an acoustic feature sequence 317 (acoustic feature data 317) corresponding thereto. In other words, in accordance with Equation (2) below, the trained acoustic model 306 estimates a value (ô) for an acoustic feature sequence 317 at which the probability (P(o|l,{circumflex over (λ)})) that an acoustic feature sequence 317 (o) will be generated based on a linguistic feature sequence 316 (l) input from the text analysis unit 307 and an acoustic model {circumflex over (λ)} set using the training result 315 of machine learning performed in the model training unit 305 is maximized.
The vocalization model unit 308 is input with the acoustic feature sequence 317. With this, the vocalization model unit 308 generates output data 321 corresponding to the singing voice data 215 including lyric text specified by the CPU 201. An acoustic effect is applied to the output data 321 in the acoustic effect application section 320, described later, and the output data 321 is converted into the final inferred singing voice data 217. This inferred singing voice data 217 is output from the D/A converter 212, goes through the mixer 213 and the amplifier 214 in
The acoustic features expressed by the training acoustic feature sequence 314 and the acoustic feature sequence 317 include spectral data that models the vocal tract of a person, and sound source data that models the vocal cords of a person. A mel-cepstrum, line spectral pairs (LSP), or the like may be employed for the spectral data. A power value and a fundamental frequency (F0) indicating the pitch frequency of the voice of a person may be employed for the sound source data. The vocalization model unit 308 includes a synthesis filter 310. Instrument sound waveform data 220 that are outputs from designated sound generation channels (single or multiple channels) of the sound source LSI 204 in
As described above, instrument sound waveform data 220 generated and output by the sound source LSI 204 based on the playing of a user on the keyboard 101 (
The sound source LSI 204 may be operated such that, for example, at the same time that the output from a plurality of designated sound generation channels is supplied to the voice synthesis LSI 205 as instrument sound waveform data 220, the output of another channel(s) is output as normal musical sound output data 218. Operation is thus possible in which singing voices for a melody are vocalized by the voice synthesis LSI 205 at the same time that accompaniment sounds are produced as normal instrument sounds or instrument sounds for a melody line are produced.
The instrument sound waveform data 220 input to the synthesis filter 310 in a vocoder mode may be any kind of signal, but in terms of qualities as a sound source signal, instrument sounds that have many harmonic components and can be sustained for long durations, such as, for example, brass sounds, string sounds, and organ sounds, are preferable. Of course, a very amusing effect may be obtained even when, to achieve a greater effect, an instrument sound that does not remotely adhere to this standard, for example an instrument sound that sounds like an animal cry, is used. As one specific example, data obtained by sampling the cry of a pet dog, for example, is input to the synthesis filter 310 as an instrument sound. Sound is then produced from the speaker on the basis of inferred singing voice data 217 output from the synthesis filter 310 and the acoustic effect application section 320. This results in a very amusing effect in which it sounds as if the pet dog were singing the lyrics.
The sampling frequency of the training singing voice data for a given singer 312 is, for example, 16 kHz (kilohertz). When a mel-cepstrum parameter obtained through mel-cepstrum analysis, for example, is employed for a spectral parameter contained in the training acoustic feature sequence 314 and the acoustic feature sequence 317, the frame update period is, for example, 5 msec (milliseconds). In addition, when mel-cepstrum analysis is performed, the length of the analysis window is 25 msec, and the window function is a twenty-fourth-order Blackman window function.
An acoustic effect such as a vibrato effect, a tremolo effect, or a wah effect is applied to the output data 321 output from the voice synthesis section 302 by the acoustic effect application section 320 in the voice synthesis LSI 205.
A “vibrato effect” refers to an effect whereby, when a note in a song is drawn out, the pitch level is periodically varied by a prescribed amount (depth).
A “tremolo effect” refers to an effect whereby one or more notes are rapidly repeated.
A “wah effect” is an effect whereby the peak-gain frequency of a bandpass filter is moved so as to yield a sound resembling a voice saying “wah-wah”.
When a user performs an operation whereby a second key (second operation element) on the keyboard 101 (
In this case, the user is able to vary the degree of the pitch effect in the acoustic effect application section 320 by, with respect to the pitch of the first key specifying a singing voice, specifying the second key that is repeatedly struck such that the difference in pitch between the second key and the first key is a desired difference. For example, the degree of the pitch effect can be made to vary such that the depth of the acoustic effect is set to a maximum value when the difference in pitch between the second key and the first key is one octave and such that the degree of the acoustic effect is weaker the lesser the difference in pitch.
The second key on the keyboard 101 that is repeatedly struck may be a white key. However, if the second key is a black key, for example, the second key is less liable to interfere with a performance operation on the first key for specifying the pitch of a singing voice sound.
In the present embodiment, it is thus possible to apply various additional acoustic effects in the acoustic effect application section 320 to output data 321 that is output from the voice synthesis section 302 to generate the final inferred singing voice data 217.
It should be noted that the application of an acoustic effect ends when no key presses on the second key have been detected for a set time (for example, several hundred milliseconds).
As another example, such an acoustic effect may be applied by just one press of the second key while the first key is being pressed, in other words, without repeatedly striking the second key as above. In this case too, the depth of the acoustic effect may change in accordance with the difference in pitch between the first key and the second key. The acoustic effect may be also applied while the second key is being pressed, and application of the acoustic effect ended in accordance with the detection of release of the second key.
As yet another example, such an acoustic effect may be applied even when the first key is released after the pressing the second key while the first key was being pressed. This kind of pitch effect may also be applied upon the detection of a “trill”, whereby the first key and the second key are repeatedly struck in an alternating manner.
In the present specification, as a matter of convenience, the musical technique whereby such acoustic effects are applied is sometimes called “what is referred to as a legato playing style”.
Next, a first embodiment of statistical voice synthesis processing performed by the voice training section 301 and the voice synthesis section 302 in
(Non-Patent Document 2)
Shinji Sako, Keijiro Saino, Yoshihiko Nankaku, Keiichi Tokuda, and Tadashi Kitamura, “A trainable singing voice synthesis system capable of representing personal characteristics and singing styles”, Information Processing Society of Japan (IPSJ) Technical Report, Music and Computer (MUS) 2008 (12 (2008-MUS-074)), pp. 39-44, 2008-02-08
In the first embodiment of statistical voice synthesis processing, when a user vocalizes lyrics in accordance with a given melody, HMM acoustic models are trained on how singing voice feature parameters, such as vibration of the vocal cords and vocal tract characteristics, change over time during vocalization. More specifically, the HMM acoustic models model, on a phoneme basis, spectrum and fundamental frequency (and the temporal structures thereof) obtained from the training singing voice data.
First, processing by the voice training section 301 in
Here, ot represents an acoustic feature in frame t, T represents the number of frames, q=(q1, . . . , qT) represents the state sequence of a HMM acoustic model, and qt represents the state number of the HMM acoustic model in frame t. Further, aq
The spectral parameters of singing voice sounds can be modeled using continuous HMMs. However, because logarithmic fundamental frequency (F0) is a variable dimension time series signal that takes on a continuous value in voiced segments and is not defined in unvoiced segments, fundamental frequency (F0) cannot be directly modeled by regular continuous HMMs or discrete HMMs. Multi-space probability distribution HMMs (MSD-HMMs), which are HMMs based on a multi-space probability distribution compatible with variable dimensionality, are thus used to simultaneously model mel-cepstrums (spectral parameters), voiced sounds having a logarithmic fundamental frequency (F0), and unvoiced sounds as multidimensional Gaussian distributions, Gaussian distributions in one-dimensional space, and Gaussian distributions in zero-dimensional space, respectively.
As for the features of phonemes making up a singing voice, it is known that even for identical phonemes, acoustic features may vary due to being influenced by various factors. For example, the spectrum and logarithmic fundamental frequency (F0) of a phoneme, which is a basic phonological unit, may change depending on, for example, singing style, tempo, or on preceding/subsequent lyrics and pitches. Factors such as these that exert influence on acoustic features are called “context”. In the first embodiment of statistical voice synthesis processing, HMM acoustic models that take context into account (context-dependent models) can be employed in order to accurately model acoustic features in voice sounds. Specifically, the training text analysis unit 303 may output a training linguistic feature sequence 313 that takes into account not only phonemes and pitch on a frame-by-frame basis, but also factors such as preceding and subsequent phonemes, accent and vibrato immediately prior to, at, and immediately after each position, and so on. In order to make dealing with combinations of context more efficient, decision tree based context clustering may be employed. Context clustering is a technique in which a binary tree is used to divide a set of HMM acoustic models into a tree structure, whereby HMM acoustic models are grouped into clusters having similar combinations of context. Each node within a tree is associated with a bifurcating question such as “Is the preceding phoneme /a/?” that distinguishes context, and each leaf node is associated with a training result 315 (model parameters) corresponding to a particular HMM acoustic model. For any combination of contexts, by traversing the tree in accordance with the questions at the nodes, one of the leaf nodes can be reached and the training result 315 (model parameters) corresponding to that leaf node selected. By selecting an appropriate decision tree structure, highly accurate and highly generalized HMM acoustic models (context-dependent models) can be estimated.
The duration of states 401 #1 to #3 indicated by the HMM at (a) in
As a result of training, the model training unit 305 in
As a result of training, the model training unit 305 in
Moreover, as a result of training, the model training unit 305 in
Next, processing by the voice synthesis section 302 in
As described in the above-referenced Non-Patent Documents, in accordance with Equation (2), the trained acoustic model 306 estimates a value (ô) for an acoustic feature sequence 317 at which the probability (P(o|l,{circumflex over (λ)})) that an acoustic feature sequence 317 (o) will be generated based on a linguistic feature sequence 316 (l) input from the text analysis unit 307 and an acoustic model {circumflex over (λ)} set using the training result 315 of machine learning performed in the model training unit 305 is maximized. Using the state sequence
estimated by the state duration model at (b) in
Here,
μ{circumflex over (q)}=[μ{circumflex over (q)}
Σ{circumflex over (q)}=diag[Σ{circumflex over (q)}
and μ{circumflex over (q)}
o=Wc (5)
Here, W is a matrix whereby an acoustic feature sequence o containing a dynamic feature is obtained from static feature sequence c=[c1T, . . . , cTT]T. With Equation (5) as a constraint, the model training unit 305 solves Equation (4) as expressed by Equation (6) below.
Here, ĉ is the static feature sequence with the greatest probability of output under dynamic feature constraint. By taking dynamic features into account, discontinuities at state boundaries can be resolved, enabling a smoothly changing acoustic feature sequence 317 to be obtained. This also makes it possible for high quality singing voice sound output data 321 to be generated in the synthesis filter 310.
It should be noted that phoneme boundaries in the singing voice data often are not aligned with the boundaries of musical notes established by the musical score. Such timewise fluctuations are considered to be essential in terms of musical expression. Accordingly, in the first embodiment of statistical voice synthesis processing employing HMM acoustic models described above, in the vocalization of singing voices, a technique may be employed that assumes that there will be time disparities due to various influences, such as phonological differences during vocalization, pitch, or rhythm, and that models lag between vocalization timings in the training data and the musical score. Specifically, as a model for lag on a musical note basis, lag between a singing voice, as viewed in units of musical notes, and a musical score may be represented using a one-dimensional Gaussian distribution and handled as a context-dependent HMM acoustic model similarly to other spectral parameters, logarithmic fundamental frequencies (F0), and the like. In singing voice synthesis such as this, in which HMI acoustic models that include context for “lag” are employed, after the boundaries in time represented by a musical score have been established, maximizing the joint probability of both the phoneme state duration model and the lag model on a musical note basis makes it possible to determine a temporal structure that takes fluctuations of musical note in the training data into account.
Next, a second embodiment of the statistical voice synthesis processing performed by the voice training section 301 and the voice synthesis section 302 in
As described in the above-referenced Non-Patent Documents, normally, acoustic features are calculated in units of frames that, for example, have a width of 5.1 msec (milliseconds), and linguistic features are calculated in phoneme units. Accordingly, the unit of time for linguistic features differs from that for acoustic features. In the first embodiment of statistical voice synthesis processing in which HMM acoustic models are employed, correspondence between acoustic features and linguistic features is expressed using a HMM state sequence, and the model training unit 305 automatically learns the correspondence between acoustic features and linguistic features based on the training musical score data 311 and training singing voice data for a given singer 312 in
In the second embodiment of statistical voice synthesis processing, the model training unit 305 in the voice training section 301 in
During voice synthesis, a linguistic feature sequence 316 phoneme sequence (corresponding to (b) in
The vocalization model unit 308, as depicted using the group of heavy solid arrows 503 in
As described in the above-referenced Non-Patent Documents, the DNN is trained so as to minimize squared error. This is computed according to Equation (7) below using pairs of acoustic features and linguistic features denoted in frames.
In this equation, ot and lt respectively represent an acoustic feature and a linguistic feature in the tth frame t, {circumflex over (λ)} represents model parameters for the DNN of the trained acoustic model 306, and gλ(∩) is the non-linear transformation function represented by the DNN. The model parameters for the DNN are able to be efficiently estimated through backpropagation. When correspondence with processing within the model training unit 305 in the statistical voice synthesis represented by Equation (1) is taken into account, DNN training can represented as in Equation (8) below.
Here, {tilde over (μ)}t is given as in Equation (9) below.
{tilde over (μ)}t=gλ(lt) (9)
As in Equation (8) and Equation (9), relationships between acoustic features and linguistic features are able to be expressed using the normal distribution (ot|{tilde over (μ)}t,{tilde over (Σ)}t), which uses output from the DNN for the mean vector. In the second embodiment of statistical voice synthesis processing in which a DNN is employed, normally, independent covariance matrices are used for linguistic feature sequences lt. In other words, in all frames, the same covariance matrix {tilde over (Σ)}g is used for the linguistic feature sequences lt. When the covariance matrix {tilde over (Σ)}g is an identity matrix, Equation (8) expresses a training process equivalent to that in Equation (7).
As described in
Detailed description follows regarding the operation of the embodiment of the electronic keyboard instrument 100 of
The header chunk is made up of five values: ChunkID, ChunkSize, FormatType, NumberOfTrack, and TimeDivision. ChunkID is a four byte ASCII code “4D 54 68 64” (in base 16) corresponding to the four half-width characters “MThd”, which indicates that the chunk is a header chunk. ChunkSize is four bytes of data that indicate the length of the FormatType, NumberOfTrack, and TimeDivision part of the header chunk (excluding ChunkID and ChunkSize). This length is always “00 00 00 06” (in base 16), for six bytes. FormatType is two bytes of data “00 01” (in base 16). This means that the format type is format 1, in which multiple tracks are used. NumberOfTrack is two bytes of data “00 02” (in base 16). This indicates that in the case of the present embodiment, two tracks, corresponding to the lyric part and the accompaniment part, are used. TimeDivision is data indicating a timebase value, which itself indicates resolution per quarter note. TimeDivision is two bytes of data “01 E0” (in base 16). In the case of the present embodiment, this indicates 480 in decimal notation.
The first and second track chunks are each made up of a ChunkID, ChunkSize, and performance data pairs. The performance data pairs are made up of DeltaTime_1[i] and Event_1[i] (for the first track chunk/lyric part), or DeltaTime_2[i] and Event_2[i] (for the second track chunk/accompaniment part). Note that 0≤i≤L for the first track chunk/lyric part, and 0≤i≤M for the second track chunk/accompaniment part. ChunkID is a four byte ASCII code “4D 54 72 6B” (in base 16) corresponding to the four half-width characters “MTrk”, which indicates that the chunk is a track chunk. ChunkSize is four bytes of data that indicate the length of the respective track chunk (excluding ChunkID and ChunkSize).
DeltaTime_1[i] is variable-length data of one to four bytes indicating a wait time (relative time) from the execution time of Event_1[i−1] immediately prior thereto. Similarly, DeltaTime_2[i] is variable-length data of one to four bytes indicating a wait time (relative time) from the execution time of Event_2[i−1] immediately prior thereto. Event_1[i] is a meta event (timing information) designating the vocalization timing and pitch of a lyric in the first track chunk/lyric part. Event_2[i] is a MIDI event (timing information) designating “note on” or “note off” or is a meta event designating time signature in the second track chunk/accompaniment part. In each DeltaTime_1[i] and Event_1[i] performance data pair of the first track chunk/lyric part, Event_1[i] is executed after a wait of DeltaTime_1[i] from the execution time of the Event_1[i−1] immediately prior thereto. The vocalization and progression of lyrics is realized thereby. In each DeltaTime_2[i] and Event_2[i] performance data pair of the second track chunk/accompaniment part, Event_2[i] is executed after a wait of DeltaTime_2[i] from the execution time of the Event_2[i−1] immediately prior thereto. The progression of automatic accompaniment is realized thereby.
After first performing initialization processing (step S701), the CPU 201 repeatedly executes the series of processes from step S702 to step S708.
In this repeat processing, the CPU 201 first performs switch processing (step S702). Here, based on an interrupt from the key scanner 206 in
Next, based on an interrupt from the key scanner 206 in
Next, the CPU 201 processes data that should be displayed on the LCD 104 in
Next, the CPU 201 performs song playback processing (step S705). In this processing, the CPU 201 performs a control process described in
Then, the CPU 201 performs sound source processing (step S706). In the sound source processing, the CPU 201 performs control processing such as that for controlling the envelope of musical sounds being generated in the sound source LSI 204.
Then, the CPU 201 performs voice synthesis processing (step S707). In the voice synthesis processing, the CPU 201 controls voice synthesis by the voice synthesis LSI 205.
Finally, the CPU 201 determines whether or not a user has pressed a non-illustrated power-off switch to turn off the power (step S708). If the determination of step S708 is NO, the CPU 201 returns to the processing of step S702. If the determination of step S708 is YES, the CPU 201 ends the control process illustrated in the flowchart of
First, in
TickTime (sec)=60/Tempo/TimeDivision (10)
Accordingly, in the initialization processing illustrated in the flowchart of
Next, the CPU 201 sets a timer interrupt for the timer 210 in
Then, the CPU 201 performs additional initialization processing, such as that to initialize the RAM 203 in
The flowcharts in
First, the CPU 201 determines whether or not the tempo of lyric progression and automatic accompaniment has been changed using a switch for changing tempo on the first switch panel 102 in
Next, the CPU 201 determines whether or not a song has been selected with the second switch panel 103 in
Then, the CPU 201 determines whether or not a switch for starting a song on the first switch panel 102 in
Then, the CPU 201 determines whether or not a switch for selecting an effect on the first switch panel 102 in
Depending on the setting, a plurality of effects may be applied at the same time.
Finally, the CPU 201 determines whether or not any other switches on the first switch panel 102 or the second switch panel 103 in
The CPU 201 subsequently ends the switch processing at step S702 in
Similarly to at step S801 in
Next, similarly to at step S802 in
First, with regards to the progression of automatic performance, the CPU 201 initializes the values of both a DeltaT_1 (first track chunk) variable and a DeltaT_2 (second track chunk) variable in the RAM 203 for counting, in units of TickTime, relative time since the last event to 0. Next, the CPU 201 initializes the respective values of an AutoIndex_1 variable in the RAM 203 for specifying an i value (1≤i≤L−1) for DeltaTime_1[i] and Event_1[i] performance data pairs in the first track chunk of the musical piece data illustrated in
Next, the CPU 201 initializes the value of a SongIndex variable in the RAM 203, which designates the current song position, to 0 (step S822).
The CPU 201 also initializes the value of a SongStart variable in the RAM 203, which indicates whether to advance (=1) or not advance (=0) the lyrics and accompaniment, to 1 (progress) (step S823).
Then, the CPU 201 determines whether or not a user has configured the electronic keyboard instrument 100 to playback an accompaniment together with lyric playback using the first switch panel 102 in
If the determination of step S824 is YES, the CPU 201 sets the value of a Bansou variable in the RAM 203 to 1 (has accompaniment) (step S825). Conversely, if the determination of step S824 is NO, the CPU 201 sets the value of the Bansou variable to 0 (no accompaniment) (step S826). After the processing at step S825 or step S826, the CPU 201 ends the song-starting processing at step S906 in
First, the CPU 201 performs a series of processes corresponding to the first track chunk (steps S1001 to S1006). The CPU 201 starts by determining whether or not the value of SongStart is equal to 1, in other words, whether or not advancement of the lyrics and accompaniment has been instructed (step S1001).
When the CPU 201 has determined there to be no instruction to advance the lyrics and accompaniment (the determination of step S1001 is NO), the CPU 201 ends the automatic-performance interrupt processing illustrated in the flowchart of
When the CPU 201 has determined there to be an instruction to advance the lyrics and accompaniment (the determination of step S1001 is YES), the CPU 201 then determines whether or not the value of DeltaT_1, which indicates the relative time since the last event in the first track chunk, matches the wait time DeltaTime_1[AutoIndex_1] of the performance data pair indicated by the value of AutoIndex_1 that is about to be executed (step S1002).
If the determination of step S1002 is NO, the CPU 201 increments the value of DeltaT_1, which indicates the relative time since the last event in the first track chunk, by 1, and the CPU 201 allows the time to advance by 1 TickTime corresponding to the current interrupt (step S1003). Following this, the CPU 201 proceeds to step S1007, which will be described later.
If the determination of step S1002 is YES, the CPU 201 executes the first track chunk event Event_1[AutoIndex_1] of the performance data pair indicated by the value of AutoIndex_1 (step S1004). This event is a song event that includes lyric data.
Then, the CPU 201 stores the value of AutoIndex_1, which indicates the position of the song event that should be performed next in the first track chunk, in the SongIndex variable in the RAM 203 (step S1004).
The CPU 201 then increments the value of AutoIndex_1 for referencing the performance data pairs in the first track chunk by 1 (step S1005).
Next, the CPU 201 resets the value of DeltaT_1, which indicates the relative time since the song event most recently referenced in the first track chunk, to 0 (step S1006). Following this, the CPU 201 proceeds to the processing at step S1007.
Then, the CPU 201 performs a series of processes corresponding to the second track chunk (steps S1007 to S1013). The CPU 201 starts by determining whether or not the value of DeltaT_2, which indicates the relative time since the last event in the second track chunk, matches the wait time DeltaTime_2[AutoIndex_2] of the performance data pair indicated by the value of AutoIndex_2 that is about to be executed (step S1007).
If the determination of step S1007 is NO, the CPU 201 increments the value of DeltaT_2, which indicates the relative time since the last event in the second track chunk, by 1, and the CPU 201 allows the time to advance by 1 TickTime corresponding to the current interrupt (step S1008). The CPU 201 subsequently ends the automatic-performance interrupt processing illustrated in the flowchart of
If the determination of step S1007 is YES, the CPU 201 then determines whether or not the value of the Bansou variable in the RAM 203 that denotes accompaniment playback is equal to 1 (has accompaniment) (step S1009) (see steps S824 to S826 in
If the determination of step S1009 is YES, the CPU 201 executes the second track chunk accompaniment event Event_2[AutoIndex_2] indicated by the value of AutoIndex_2 (step S1010). If the event Event_2[AutoIndex_2] executed here is, for example, a “note on” event, the key number and velocity specified by this “note on” event are used to issue a command to the sound source LSI 204 in
However, if the determination of step S1009 is NO, the CPU 201 skips step S1010 and proceeds to the processing at the next step S1011 without executing the current accompaniment event Event_2[AutoIndex_2]. Here, in order to progress in sync with the lyrics, the CPU 201 performs only control processing that advances events.
After step S1010, or when the determination of step S1009 is NO, the CPU 201 increments the value of AutoIndex_2 for referencing the performance data pairs for accompaniment data in the second track chunk by 1 (step S1011).
Next, the CPU 201 resets the value of DeltaT_2, which indicates the relative time since the event most recently executed in the second track chunk, to 0 (step S1012).
Then, the CPU 201 determines whether or not the wait time DeltaTime_2[AutoIndex_2] of the performance data pair indicated by the value of AutoIndex_2 to be executed next in the second track chunk is equal to 0, or in other words, whether or not this event is to be executed at the same time as the current event (step S1013).
If the determination of step S1013 is NO, the CPU 201 ends the current automatic-performance interrupt processing illustrated in the flowchart of
If the determination of step S1013 is YES, the CPU 201 returns to step S1009, and repeats the control processing relating to the event Event_2[AutoIndex_2] of the performance data pair indicated by the value of AutoIndex_2 to be executed next in the second track chunk. The CPU 201 repeatedly performs the processing of steps S1009 to S1013 same number of times as there are events to be simultaneously executed. The above processing sequence is performed when a plurality of “note on” events are to generate sound at simultaneous timings, as for example happens in chords and the like.
First, at step S1004 in the automatic-performance interrupt processing of
If the determination of step S1101 is YES, that is, if the present time is a song playback timing, the CPU 201 then determines whether or not a new user key press on the keyboard 101 in
If the determination of step S1102 is YES, the CPU 201 sets the pitch specified by a user key press to a non-illustrated register, or to a variable in the RAM 203, as a vocalization pitch (step S1103).
Next, the CPU 201 generates “note on” data for producing musical sound in the designated sound generation channel(s) having the tone color set previously at step S909 in
Then, the CPU 201 reads the lyric string from the song event Event_1[SongIndex] in the first track chunk of the musical piece data in the RAM 203 indicated by the SongIndex variable in the RAM 203. The CPU 201 generates singing voice data 215 for vocalizing, at the vocalization pitch set to the pitch based on a key press that was set at step S1103, output data 321 corresponding to the lyric string that was read, and instructs the voice synthesis LSI 205 to perform vocalization processing (step S1105). The voice synthesis LSI 205 implements the first embodiment or the second embodiment of statistical voice synthesis processing described with reference to
As a result, instrument sound waveform data 220 generated and output by the sound source LSI 204 based on the playing of a user on the keyboard 101 (
If at step S1101 it is determined that the present time is a song playback timing and the determination of step S1102 is NO, that is, if it is determined that no new key press is detected at the present time, the CPU 201 reads the data for a pitch from the song event Event_1[SongIndex] in the first track chunk of the musical piece data in the RAM 203 indicated by the SongIndex variable in the RAM 203, and sets this pitch to a non-illustrated register, or to a variable in the RAM 203, as a vocalization pitch (step S1104).
Then, by performing the processing at step S1105 and subsequent steps, described above, the CPU 201 instructs the voice synthesis LSI 205 to perform vocalization processing of the output data 321 (step S1105, S1106). In implementing the first embodiment or the second embodiment of statistical voice synthesis processing described with reference to
After the processing of step S1105, the CPU 201 stores the song position at which playback was performed indicated by the SongIndex variable in the RAM 203 in a SongIndex_pre variable in the RAM 203 (step S1107).
Then, the CPU 201 clears the value of the SongIndex variable so as to become a null value and makes subsequent timings non-song playback timings (step S1108). The CPU 201 subsequently ends the song playback processing at step S705 in
If the determination of step S1101 is NO, that is, if the present time is not a song playback timing, the CPU 201 then determines whether or not “what is referred to as a legato playing style” for applying an effect has been detected on the keyboard 101 in
If the determination of step S1109 is NO, the CPU 201 ends the song playback processing at step S705 in
If the determination of step S1109 is YES, the CPU 201 calculates the difference in pitch between the vocalization pitch set at step S1103 and the pitch of the key on the keyboard 101 in
Then, the CPU 201 sets the effect size in the acoustic effect application section 320 (
The processing of step S1110 and step S1111 enables an acoustic effect such as a vibrato effect, a tremolo effect, or a wah effect to be applied to output data 321 output from the voice synthesis section 302, and a variety of singing voice expressions are implemented thereby.
After the processing at step S1111, the CPU 201 ends the song playback processing at step S705 in
In the first embodiment of statistical voice synthesis processing employing HMM acoustic models described with reference to
In the second embodiment of statistical voice synthesis processing employing a DNN acoustic model described with reference to
In the embodiments described above, statistical voice synthesis processing techniques are employed as voice synthesis methods, can be implemented with markedly less memory capacity compared to conventional concatenative synthesis. For example, in an electronic musical instrument that uses concatenative synthesis, memory having several hundred megabytes of storage capacity is needed for voice sound fragment data. However, the present embodiments get by with memory having just a few megabytes of storage capacity in order to store training result 315 model parameters in
Moreover, in a conventional fragmentary data method, it takes a great deal of time (years) and effort to produce data for singing voice performances since fragmentary data needs to be adjusted by hand. However, because almost no data adjustment is necessary to produce training result 315 model parameters for the HMM acoustic models or the DNN acoustic model of the present embodiments, performance data can be produced with only a fraction of the time and effort. This also makes it possible to provide a lower cost electronic musical instrument. Further, using a server computer 300 available for use as a cloud service, or training functionality built into the voice synthesis LSI 205, general users can train the electronic musical instrument using their own voice, the voice of a family member, the voice of a famous person, or another voice, and have the electronic musical instrument give a singing voice performance using this voice for a model voice. In this case too, singing voice performances that are markedly more natural and have higher quality sound than hitherto are able to be realized with a lower cost electronic musical instrument.
In particular, because instrument sound waveform data 220 for instrument sounds generated by the sound source LSI 204 is used as a sound source signal in the present embodiment, the essence of instrument sounds set in the sound source LSI 204 as well as the vocal characteristics of the singing voice of the singer come through clearly, allowing effective inferred singing voice data 217 to be output. An effect in which a plurality of singing voices seem to be in harmony can also be achieved owing to polyphonic operation being possible. It is thus possible to provide an electronic musical instrument that sings well in a singing voice corresponding to the singing voice of a singer that has been learned on the basis of pitches specified by a user.
In the embodiments described above, the present invention is embodied as an electronic keyboard instrument. However, the present invention can also be applied to electronic string instruments and other electronic musical instruments.
Voice synthesis methods able to be employed for the vocalization model unit 308 in
In the embodiments described above, a first embodiment of statistical voice synthesis processing in which HMM acoustic models are employed and a second embodiment of a voice synthesis method in which a DNN acoustic model is employed were described. However, the present invention is not limited thereto. Any voice synthesis method using statistical voice synthesis processing may be employed by the present invention, such as, for example, an acoustic model that combines HMMs and a DNN.
In the embodiments described above, lyric information is given as musical piece data. However, text data obtained by voice recognition performed on content being sung in real time by a user may be given as lyric information in real time. The present invention is not limited to the embodiments described above, and various changes in implementation are possible without departing from the spirit of the present invention. Insofar as possible, the functionalities performed in the embodiments described above may be implemented in any suitable combination. Moreover, there are many aspects to the embodiments described above, and the invention may take on a variety of forms through the appropriate combination of the disclosed plurality of constituent elements. For example, if after omitting several constituent elements from out of all constituent elements disclosed in the embodiments the advantageous effect is still obtained, the configuration from which these constituent elements have been omitted may be considered to be one form of the invention.
It will be apparent to those skilled in the art that various modifications and variations can be made in the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention cover modifications and variations that come within the scope of the appended claims and their equivalents. In particular, it is explicitly contemplated that any part or whole of any two or more of the embodiments and their modifications described above can be combined and regarded within the scope of the present invention.
Number | Date | Country | Kind |
---|---|---|---|
2018-118056 | Jun 2018 | JP | national |
Number | Date | Country | |
---|---|---|---|
Parent | 17036582 | Sep 2020 | US |
Child | 18077151 | US | |
Parent | 16447586 | Jun 2019 | US |
Child | 17036582 | US |