The present invention relates to an information processing device for outputting synthesized voice such as a singing voice, an electronic musical instrument, and an information processing method.
A conventional technique for synthesizing high-quality singing voice sounds corresponding to a lyrics data is known in which, based on stored lyrics data, the corresponding parameters and tone combination parameters are read from a phoneme database, the corresponding voice is synthesized and output by a formant synthetic sound source unit, and the unvoiced consonants are produced by a PCM sound source. (for example, see Japanese Patent No. 3233306).
The human singing range is generally about two octaves. Therefore, when the above-mentioned conventional technique is applied to an electronic keyboard having 61 keys, if an attempt is made to assign the singing voice of a single person to all the keys, a range that cannot be covered by one singing voice exists.
On the other hand, even if an attempt is made to cover with a plurality of singing voices, an unnatural feeling of strangeness occurs at the place where the character of the singing voice is switched.
Therefore, it is an object of the present invention to enable the generation of voice data suitable for such a wide range.
In one aspect of the present invention, an information processing device detects a designated sound pitch, and, based on a first data of a first voice model and a second data of a second voice model different from the first voice model, generates a third data corresponding to the detected designated pitch.
According to the present invention, it is possible to generate voice data suitable for a wide range of pitches.
Additional or separate features and advantages of the invention will be set forth in the descriptions that follow and in part will be apparent from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims thereof as well as the appended drawings.
To achieve these and other advantages and in accordance with the purpose of the present invention, as embodied and broadly described, in one aspect, the present disclosure provides an information processing device for voice synthesis, comprising: at least one processor, implementing a first voice model and a second voice model different from the first voice model, the at least one processor performing the following: receiving data indicating a specified pitch; and causing the first voice model to output a first data and the second voice model to output a second data, and generating and outputting a third data corresponding to the specified pitch based on the first data and second data.
In another aspect, the present disclosure provides an electronic musical instrument, comprising: a performance unit for specifying a pitch; and the above-described information processing device including the at least one processor, the at least one processor receiving the data indicating the specified pitch from the performance unit.
In another aspect, the present disclosure provides an electronic musical instrument, comprising: a performance unit for specifying a pitch; a processor; and a communication interface configured to communicates with an information processing device that is externally provided, the information processing device implementing a first voice model and a second voice model different from the first voice model, wherein the processor causes the communication interface to transmit data indicating the pitch specified by the performance unit to the information processing device and receive from the information processing device data generated in accordance with the first voice model and the second voice model that corresponds to the specified pitch, and wherein the processor synthesizes singing voice based on the data received from the information processing device and causes the synthesized singing voice to output.
In another aspect, the present disclosure provides a method performed by at least one processor in an information processing device, the at least one processor implementing a first voice model and a second voice model different from the first voice model, the method comprising, via the at least one processor: receiving data indicating a specified pitch; and causing the first voice model to output a first data and the second voice model to output a second data, and generating and outputting a third data corresponding to the specified pitch based on the first data and second data.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory, and are intended to provide further explanation of the invention as claimed.
Hereinafter, embodiments for carrying out the present invention will be described in detail with reference to the drawings. First, the first embodiment will be described.
The range of human singing voice, which is an example of voice, is generally about two octaves. On the other hand, for example, when an attempt is made to synthesize a singing voice function as an information processing device, there is a possibility that the designation of the sound range exceeds the human singing voice range and extends to, for example, about 5 octaves.
Therefore, in the first embodiment, for example, as shown in
Further, in the first embodiment, as shown in
First, the processor detects a specified pitch (step S201). When the information processing apparatus/device is implemented as, for example, an electronic musical instrument, the electronic musical instrument includes, for example, a performance unit 210. Then, for example, the processor detects the specified pitch based on the pitch designating data 211 detected by the performance unit 210.
Here, the information processing device includes, for example, a voice model 220 which is a database system. Then, the processor reads out the first voice data (first data) 223 of the first voice model 221 and the second voice data (second data) 224 of the second voice model 222 from, for example, the database system of the voice model 220. Then, the processor generates morphing data (third data) based on the first voice data 223 and the second voice data 224 (step S202). More specifically, when the voice model 220 is a human singing voice model, the processor generates the morphing data by the interpolation calculation between the formant frequencies of the first singing voice data corresponding to the first voice data 223 and the formant frequencies of the second singing voice data corresponding to the second voice data 224.
Here, for example, the first voice model 221 stored as the voice model 220 may include a trained (acoustic) model that has learned a first voice (for example, the singing voice of a first actual singer), and the voice model 222 stored as the voice model 220 may include a trained (acoustic) model that has learned a second voice (for example, the singing voice of a second actual singer).
The processor outputs voice synthesized based on the morphing data generated in step S202 (step S203).
Here, for example, the pitch detected in step S201, as described above, can be in a non-overlapping range between the first range corresponding to the first voice model 221 and the second range corresponding to the second voice model 222. Then, the morphing data generated in step S202 may be generated when there is no voice model corresponding to the range of a designated song. However, even if the first range and the second range overlap with each other, the present invention may be applied so as to generate morphing data based on the voice data of the respective plural voice models.
In the voice generation process of the first embodiment described above, if the music belongs to, for example, the bass side range 1 of
Further, if the music belongs to, for example, the treble side range 2 of
On the other hand, if the music belongs to, for example, the range 3 that is between the ranges 1 and 2 of
As a result of the above processing, it is possible to output, for example, a singing voice having an optimum range that matches the key range of the music.
Next, a second embodiment will be described. In the second embodiment, singing voice models (acoustic models) that model human singing voices are used as models for the voice model 220 of
The CPU 401 executes control operations of the electronic keyboard instrument 300 shown in
The timer 410 may be mounted in the CPU 401, and counts, for example, the progress of automatic playback of performance guide data in the electronic keyboard instrument 300.
The sound source LSI 404 reads, for example, musical sound waveform data from a waveform ROM (not shown), and outputs the data to the D/A converter 411 in accordance with the sound generation control instruction from the CPU 401. The sound source LSI 404 has the ability to produce up to 256 voices at the same time.
When the voice synthesis LSI 405 is given with the lyrics information, which is the text data of the lyrics, and the pitch information about the pitch, as the singing voice data 415 from the CPU 401, the voice synthesis LSI 405 synthesizes the singing voice output data 417 which is the voice data of the corresponding singing voice, and outputs the data 417 to the D/A converter 412.
The key scanner 406 constantly scans the key press/release state of the keyboard 301 of
The LED controller 407 is an IC (integrated circuit) that controls the display state of the LED 304 in each key of the keyboard 301 of
The voice synthesis process 500 synthesizes and outputs the singing voice output data 417 based on the singing voice data 415 that includes the lyrics information, the pitch information, and the range information instructed from the CPU 401 of
The details of the basic operation of the voice synthesis process 500 are disclosed in the above patent document, but operations of the voice synthesis process 500 of this embodiment that include operations unique to the second embodiment will be described below.
The voice synthesis process 500 includes a text analysis process 502, an acoustic model process 501, a vocalization model process 503, and a formant interpolation process 506. The formant interpolation process 506, for example, is a unique part in the second embodiment.
In the second embodiment, the voice synthesis process 500 performs statistical voice synthesis in which the singing voice output data 417 corresponding to the singing voice data 415 that include the lyrics, which are the texts of the lyrics, the pitch, and the sound range is generated by inference using an acoustic model set in the acoustic model 501, which is a statistical model.
The text analysis process 502 receives the singing voice data 415 including information on lyrics, pitch, range, etc., designated by the CPU 401 of
Further, the text analysis unit 502 generates range information 509 indicating the sound range in the singing voice data 415 and gives it to the formant interpolation processing unit 506. If the range indicated by the range information 509 is within the range of the first range, which is the current or default range, the formant interpolation processing unit 506 requests the acoustic model unit 501 to provide the spectrum information 510 of the first range (hereinafter, “first range spectrum information 510”).
The first range spectrum information 510 may be referred to as first spectrum information, first spectrum data, first voice data, first data, and the like.
On the other hand, if the range indicated by the range information 509 is not within the range of the first range, which is the current range, and is within another preset second range, the formant interpolation processing unit 506 change the value of the range setting variable to indicate the second range and requests the acoustic model unit 501 to provide the second range spectrum information 511.
On the other hand, if the range indicated by the range information 509 is not included in the first or second range, but is in a range between the first range and the second range, the formant interpolation processing unit requests the acoustic model unit 501 to provide both the first range spectrum information 510 and the second range spectrum information 511.
The second range spectrum information 511 may be referred to as second spectrum information, second spectrum data, second voice data, second data, or the like.
The acoustic model unit 501 receives the above-mentioned linguistic feature sequence 507 and the pitch information 508 from the text analysis unit 502, and also receives the above-mentioned request specifying the above-mentioned range(s) from the formant interpolation processing unit 506.
As a result, the acoustic model unit 501 uses an acoustic model(s) that has been set as a trained result by machine learning, for example, and infers the first range spectrum and/or the second range spectrum that corresponds to the phoneme that maximizes the generation probability, and provide them to the formant interpolation processing unit 506 as the first range spectrum information 510 and/or the second range spectrum information 511.
Further, the acoustic model unit 501 infers a sound source corresponding to the phoneme that maximizes the generation probability by using the acoustic model, and provide it as the target sound source information 512 to the sound source generation unit 504 in the vocalization model unit 503.
The formant interpolation processing unit 506 provides the first range spectrum information 510 or the second range spectrum information 511, or spectrum information obtained by interpolating the first range spectrum information 510 and the second range spectrum information 511 (hereinafter referred to as “interpolated spectrum information”) to the synthesis filter unit 505 in the vocal model unit 503 as the target spectrum information 513.
The target spectrum information 513 may be referred to as morphing data, third data, or the like when it represents interpolated spectrum information.
The vocalization model unit 503 receives the target sound source information 512 output from the acoustic model unit 501 and the target spectrum information 513 output from the formant interpolation processing unit 506, and generates the singing voice output data 417 corresponding to the singing voice data 415. The singing voice output data 417 is output from the D/A converter 412 of
The acoustic features output by the acoustic model unit 501 include spectrum information modeling the human vocal tract and sound source information modeling the human vocal cords. As parameters of the spectrum information, for example, a line spectrum pair (Line Spectral Pairs: LSP) that can efficiently model a plurality of formant frequencies that are human voice tract characteristics, a line spectrum frequency (Line Spectral Frequencies: LSF), or a Mel LSP (hereinafter referred to as “LSP”) or the like that is an improvement of these models can be adopted. Therefore, the first and second range spectrum information 510/511 output from the acoustic model unit 501, and the target spectrum information 513 output from the formant interpolation processing unit 506 may be frequency parameters based on the above-mentioned LSP, for example.
Cepstrum or mel cepstrum may be adopted as another example of the parameters of the spectrum information.
As the sound source information, the fundamental frequency (F0) indicating the pitch frequency of human voice and its power values (in the case of voiced phonemes) or the power value of white noise (in the case of unvoiced phonemes) can be adopted. Therefore, the target sound source information 512 output from the acoustic model unit 501 can be the parameters of F0 and the power values as described above.
The vocalization model unit 503 includes the sound source generation unit 504 and the syntheses filter unit 505. The sound source generation unit 504 is a portion that models a human voice cords, and by sequentially receiving a series of target sound source information 512 input from the acoustic model unit 501, generates the sound source data 514 constituted a pulse train periodically repeated with the fundamental frequency (F0) and its power values included in the target sound source information 512 (in the case of a voiced phoneme), for example, or white noise having power values included in the target sound source information 512 (in the case of an unvoiced phoneme), or a mixed signal thereof.
The synthesis filter unit 505 is a part that models the human vocal tract, and constructs an LSP digital filter that models the vocal tract based on the LSP frequency parameters included in the target spectrum information 513 that is sequentially input from the acoustic model unit 501 via the formant interpolation processing unit 506. When the digital filter is excited using the sound source input data 514 input from the sound source generation unit 504 as an excitation source signal, the filter output data 515, which is a digital signal, is output from the synthesis filter unit 505. This filter output data 515 is converted into an analog singing voice output signal by the D/A converter 412 of
The sampling frequency for the singing voice output data 417 is, for example, 16 KHz (kilohertz). Further, when, for example, the LSF parameters obtained by LSP analysis processing are adopted as the parameters of the first sound range spectrum information 510, the second sound range spectrum information 511, and the target spectrum information 513, the update frame period is, for example, 5 milliseconds. The analysis window length is, for example, 25 milliseconds, the window function is, for example, a Blackman window, and the analysis order is, for example, 10th order.
An outline of the overall operation of the second embodiment under the configurations of
At that time, the CPU 401 indicates keys to be played on the keyboard 301 corresponding to the pitch information to be automatically played so as to provide the user with a guidance for music practice (performance practice)—that is, the user's practice of pressing appropriate keys in synchronization with the automatic playback. More specifically, in the process of this performance guide, in synchronization with the timing of the automatic playback, the CPU 401 causes the LED 304 of the key to be played next to lit with strong brightness, for example, maximum brightness and causes the LED 304 of the key to be next played after the maximumly illuminated key to lit with weak brightness, for example, half of the maximum brightness.
Next, the CPU 401 acquires performance information, which is information related to performance operations in which the performer presses or releases the key(s) on the keyboard 301 of
Subsequently, if the key press timing (performance timing) of the keyboard 301 and the pressed key pitch (performance pitch) by the user performing the performance lesson correctly correspond to the timing information and the pitch information, respectively, that are automatically played back, the CPU 401 causes the lyrics information and pitch information to be automatically played back to be input into the text analysis unit 502 of
The singing voice data 415 may contain at least one of lyrics (text data), syllable types (start syllable, middle syllable, end syllable, etc.), lyrics index, corresponding voice pitch (correct voice pitch), and corresponding voicing period (for example, voice start timing, voice end timing, voice duration) (correct voicing period).
For example, as illustrated in
The singing voice data 415 may include information (data in a specific audio file format, MIDI data, etc.) for playing the accompaniment (song data) corresponding to the lyrics. When the singing voice data is presented in the SMF format, the singing voice data 415 may include a track chunk in which data related to singing voice is stored and a track chunk in which data related to accompaniment is stored. The singing voice data 415 may be read from the ROM 402 to the RAM 403. The singing voice data 415 has been stored in a memory (for example, ROM 402, RAM 403) before the performance.
The electronic keyboard instrument 300 may control the progression of automatic accompaniment based on events indicated by the singing voice data 415 (for example, a meta event (timing information) indicating the sound generation timing and pitch of lyrics, a MIDI event indicating note-on or note-off, or a meta event indicating the time signature).
Here, in the acoustic model unit 501, for example, an acoustic model(s) of a singing voice is formed as a learning result by machine learning. But as described above in the first embodiment, the human singing voice range is generally about two octaves. On the other hand, for example, 61 keys shown as the keyboard 301 in
Therefore, in the second embodiment, of the 51 keys in the keyboard 301, an acoustic model (voice model) that is formed as a result of machine-learning a male singing voice having a low pitch sound, for example, is assigned to the key area 1 for two octaves on the bass side of the 61-key keyboard 301, and another acoustic model (voice model) that is formed as a result of machine-learning a female singing voice having a high pitch sound, for example, is assigned to the key area 2 for two octaves on the treble side.
Further, in the second embodiment, of the 61 key keyboard 301, a singing voice between men and women that is morphed from the first range singing voice of the key area 1 and the second range singing voice of the key area 2 is assigned to the key area 3 of the central two octaves.
Here, the singing voice data 415 loaded in advance from the ROM 402 to the RAM 403 may include, as the first meta event, for example, key area data that indicates which key area out of the key areas 1, 2, and 3 shown in
At the start of singing voice synthesis, the formant interpolation processing unit 506 determines which range of the key ranges 1, 2, and 3 exemplified in
As a result, the acoustic model unit 501 uses the acoustic model of the corresponding first or second range requested by the formant interpolation processing unit 506 since the start of singing voice synthesis, and infers spectrum corresponding to phenome that makes generation probability maximum with respect to the linguistic feature sequence 507 and the pitch information 508 received from the text analysis unit 502. The inferred spectrum is then given to the formant interpolation processing unit 506 as the first/second range spectrum information 510/511.
Therefore, in the above-mentioned control operation, if the music belongs to the key area 1 on the bass side of the keyboard 301 of
On the other hand, if the music as a whole belongs to, for example, the key area 2 on the high sound side of
On the other hand, if the music as a whole belongs to, for example, the key area 3 in the middle of
In response, the acoustic model unit 501 outputs two spectrum information: the first range spectrum information 510 and the second spectrum information 511. The first range spectrum information 510 corresponds to the spectrum inferred from the acoustic model of the masculine singing voice, which is pre-assigned to the key area 1, and the second range spectrum information 511 corresponds to the spectrum inferred from the acoustic model of the feminine singing voice, which is pre-assigned to the key area 2. The key areas 1 and 2 are arranged on both sides of the key area 3. The formant interpolation processing unit 506 then calculates the interpolation spectrum information by the interpolation processing between the first region spectrum information 510 and the second region spectrum information 511, and outputs the interpolated spectrum information as the morphed target spectrum information 513 to the synthesis filter unit 505 in the vocal model unit 503.
This target spectrum information 513 may be referred to as morphing data (third voice data), third spectrum information, or the like.
As a result of the above processing, the synthesis filter unit 505 can output the filter output data 515 as the singing voice output data 417 that is synthesized according to the target spectrum information 513, which is based on the acoustic models as a result of machine-learning the singing voices, with respect to the key area that is well matched to the sound range of the music, as a whole.
The line 601 of
If the music as a whole belongs to, for example, the key area 1 on the bass side of
The line 602 of
If the music as a whole belongs to, for example, the key area 2 on the treble side of
As can be seen by comparing
Kasuya et. al., “Changes in Pitch and first Three Formant Frequencies of Five Japanese Vowels with Age and Sex of Speakers,”, Journal of the Acoustic Society 24, 6 (1968)
Here, for the sake of clarity, the vocal tract spectral characteristics 601 in
The line 603 of
This shows that the vocal tract spectrum characteristic 603 of the singing voice between men and women in the key area 3 can be calculated by a frequency range interpolation processing from the vocal tract spectrum characteristic 601 of the masculine voice in the key area 1 and the vocal tract spectrum characteristic 602 of the feminine voice in the key area 2.
Because the above-mentioned LSP parameters have a frequency dimension, interpolation in frequency range is known to work well. Therefore, in the second embodiment, when the music as a whole belongs to the key area 3 in the middle of
Then, the formant interpolation processing unit 506 applies the following equation (1) between the LSP parameter L1 [i] of the first range spectrum information 510 and the LSP parameter L2 [i] of the second range spectrum information 511. By executing the operation of the indicated interpolation process, the LSP parameter L3 [i] of the key area 3 which is the interpolation spectrum information is calculated. Here, N is the LSP analysis order.
L
3[i]=(L1[i]+L2[i])/2(1≤i≤N) (1)
The formant interpolation processing unit 506 of
As a result of the above processing, the synthesis filter unit 505 can output the filter output data 515, as the singing voice output data 417, that is synthesized by the target spectrum information 513 having the optimum vocal tract spectral characteristics that well matches the sound range of the entire music.
The detailed operation of the second embodiment having the configuration of
First, the CPU 401 assigns the initial value “1” to the lyrics index variable n, which is a variable on the RAM 403 indicating the current position of the lyrics, and assigns, to a range setting variable, which is a variable on the RAM 403 indicating a currently set or default sound range, an initial value that indicates that the current sound range is the key area 1, for example, in
The lyrics index variable n may be a variable indicating the position of a syllable (or character(s)) as counted from the beginning when the entire lyrics are regarded as a character string. For example, the lyrics index variable n can indicate the singing voice data at the nth playback position of the singing voice data 415 shown in
Next, before the start of singing voice synthesis, the CPU 401 reads out key area data indicating which one of the key areas 1, 2, and 3 of
After that, the CPU 401 advances the singing voice synthesis process by repeatedly executing the series of processes from steps S703 to S710 while incrementing the value of the lyrics index variable n by +1 in step S707 until it determines that playing the singing voice data is completed (there is no singing voice data corresponding to the new value of the lyrics index variable n) in step S710.
In a series of iterative processes from steps S703 to S710, the CPU 401 first determines whether or not there is a new key pressed as a result of the key scanner 406 of
If the determination in step S703 is YES, the CPU 401 reads the singing voice data of the nth lyrics indicated by the value of the lyrics index variable n on the RAM 403 from the RAM 403 (step S704).
Next, the CPU 401 transmits the singing voice data 415 instructing the progress of the singing voice including the singing voice data read in step S704 to the voice synthesis LSI 405 (step S705).
Further, the CPU 401 transmits sound generation instructions, as the sound generation control data 416, which specify a pitch corresponding to the key pressed by the performer of any of the keyboard 301 detected by the key scanner 406 as well as the musical instrument sound previously designated by the performer on the switch panel 303 of
As a result, the sound source LSI 404 generates the music sound output data 418 corresponding to the sound generation control data 416. The music sound output data 418 is converted into an analog music sound output signal by the D/A converter 411. This analog music sound output signal is mixed with the analog singing voice output signal output from the voice synthesis LSI 405 via the D/A converter 412 by the mixer 413, and the mixed signal is amplified by the amplifier 414, and then output from a speaker or output terminal.
Note that the process of step S706 may be omitted. In this case, the performer does not produce a musical tone in response to the key press operation, and the key press operation is used only for the progress of singing voice synthesis.
Then, the CPU 401 increments the value of the lyrics index variable n by +1 (step S707).
After the process of step S707 or after the determination of step S703 becomes NO, the CPU 401 determines whether or not there is a new key release as a result of the key scanner 406 of
If the determination in step S708 is YES, the CPU 401 instructs the voice synthesis LSI 405 to mute the singing voice corresponding to the pitch of the key release detected by the key scanner 406, and the sound source LSI 404 to mute the musical sound corresponding to the pitch (step S709). As a result, the corresponding muting operations are executed in the voice synthesis LSI 405 and the sound source LSI 404.
After the processing of step S709, or when the determination in step S708 is NO, the CPU 401 determines whether there is no singing voice data corresponding to the value of the lyrics index variable n incremented in step S707 on the RAM 403, and the playback of the singing voice data should therefore end (step S710).
If the determination in step S710 is NO, the CPU 401 returns to the process of step S703 and proceeds with the process of singing voice synthesis.
When the determination in step S710 becomes YES, the CPU 401 ends the process of singing voice synthesis exemplified in the flowchart of
The processor of the voice synthesis LSI 405 realizes the respective functions of various parts shown in
First, the text analysis unit 502 of
When the singing voice data 415 is received from the CPU 401 and the determination in step S801 is YES, the text analysis unit 502 determines whether or not the sound range is specified by the received singing voice data 415 (see step S702 in
If the determination in step S802 is YES, the range information 509 is passed from the text analysis unit 502 to the formant interpolation processing unit 506. After that, the formant interpolation processing unit 506 perform subsequent steps.
The formant interpolation processing unit 506 executes the singing voice optimization processing (step S803). The details of this process will be described later using the flowchart of
After the singing voice data 415 is received again and the determination in step S801 is YES, if the determination in step S802 is NO in the text analysis unit 502, the received singing voice data 415 is instructing the advancement of the lyrics. (See step S705 in
On the other hand, by the singing voice optimization processing of step S803 executed before the start of singing voice synthesis, the formant interpolation processing unit 506 has requested the acoustic model unit 501 for the first or second range spectrum information 510 or 511 or both first range spectrum information 510 and second range spectrum information 511.
Based on each of the above information pieces, the formant interpolation processing unit 506 acquires the respective LSP parameters of the spectrum information that has been requested to the acoustic model unit 501 in step S903 or S908 (in the case of S908, the spectrum information requested first) of
Next, the formant interpolation processing unit 506 determines whether or not the value “1” is set in the interpolation flag stored in the RAM 403 in the singing voice optimization process (step S803), which will be described later, that is, whether or not the interpolation process should be executed (step S805).
If the determination in step S805 is NO (no interpolation processing should be executed), the formant interpolation processing unit 506 acquires respective LSP parameters of the spectrum information (510 or 511) that has been acquired from the acoustic model unit 501 and stored in the RAM 403 in step S804, and set them in the array variables for the target spectrum information 513 on the RAM 403, as it is (step S806).
If the determination in step S805 is YES (execute interpolation processing), the formant interpolation processing unit 506 acquires respective LSP parameters of the spectrum information that has been secondly requested to the acoustic model unit 501 in step S908 of
Then, the formant interpolation processing unit 506 executes the formant interpolation processing (step S808). Specifically, the formant interpolation processing unit 506 calculates the LSP parameters L3 [i] of interpolated spectrum information from the LSP parameters L1 [i] of the spectrum information that has been stored in the RAM 403 in step S804 and the LSP parameters L2 [i] of the spectrum information that has been stored in the RAM 403 in step S807 by executing an interpolation processing operation of, for example, the above-mentioned equation (1), and store them in the RAM 403.
After step S808, the formant interpolation processing unit 506 sets LSP parameters L3 [i] of the interpolated spectrum information that have been stored in the RAM 403 in step S808 to the array variables for the target spectrum information 513 on the RAM 403 (step S809).
After step S806 or S809, the target sound source information 512 output from the acoustic model unit 501 is provided to the sound source generation unit 504 of the vocalization model unit 503. At the same time, the formant interpolation processing unit 506 sets the respective LSP parameters of the target spectrum information 513 stored in the RAM 403 in step S806 or S809 to the LSP digital filter of the synthesis filter unit 505 in the vocalization model unit 503 (S810). After that, the CPU 401 returns to the standby process of waiting for the singing voice data 415 in step S801, which is executed by the text analysis unit 502.
As a result of the above processing, the vocalization model unit 503 outputs the filter output data 515 as the singing voice output data 417 by exciting the LSP digital filter of the synthesis filter unit 505 in which the target spectrum information 513 has been set through the sound source input data 514 from the sound source generation unit 504 in which the target sound source information 512 has been set.
First, the formant interpolation processing unit 506 acquires the information of the range (key range) set in the range information 509 handed over from the text analysis unit 502 (step S901).
Next, the formant interpolation processing unit 506 determines whether the sound range of the entire music that is set in the singing voice data 415 acquired in step S901 (see the description of step S702) is within the default sound range (current sound range) set by the range setting variable stored in the RAM 403 (step S902).
Here, the key area 1 in
If the determination in step S902 is YES, the formant interpolation processing unit 506 requests the acoustic model unit 501 for the spectrum information corresponding to the sound range that is currently set in the range setting variable (step S903).
After that, the formant interpolation processing unit 506 sets the value “0” indicating that the interpolation processing will not be needed in the interpolation flag variable on the RAM 403 (step S904). In this case, when this interpolation flag variable is referred to in step S805 of
If the range of the entire song set in the singing voice data 415 acquired in step S901 is not within the range of the default/currently set range and the determination in step S902 is NO, the formant interpolation processing unit 506 determines whether or not the range of the entire song is within another preset range other than the default/current range (for example, the key range 2 in
If the determination in step S905 is YES, the formant interpolation processing unit 506 replaces the value of the range setting variable indicating the current/default range on the RAM 403 with the value indicating the another preset range (step S906).
Then, the formant interpolation processing unit 506 requests the acoustic model unit 501 for the spectrum information corresponding to the updated range that is set in the range setting variable (step S903), and sets the value of the interpolation flag variable on the RAM 403 to 0 (step S904). Thereafter, the formant interpolation processing unit 506 ends the singing voice optimization process of step S803 of
If the entire range of the music set in the singing voice data 415 acquired in step S901 is not within the range of the default/current range (determination of step S902 is NO), and also is not within the another preset range other than the default/current range (step S905 is also NO), then the formant interpolation processing unit 506 determines whether or not the range of the entire music is between the default/current range indicated by the current range setting variable and the another preset range (step S907).
If the determination in step S907 is YES, the formant interpolation processing unit 506 requests the acoustic model unit 501 for the spectrum information corresponding to the default/current range set in the range setting variable as well as the spectrum information corresponding to the another preset range determined in step S907 (step S908).
After that, the formant interpolation processing unit 506 sets the value “1” indicating that the interpolation processing should be executed in the interpolation flag variable on the RAM 403 (step S909). When this interpolation flag variable is referred to in step S805 of
If the determination in step S907 is NO, the formant interpolation processing unit 506 cannot determine the range. At this time, the formant interpolation processing unit 506 maintains the currently set range and requests the acoustic model unit 501 for the spectrum information corresponding to the default/current range set in the range setting variable (step S903), and set the value “0” in the interpolation flag variable on RAM 403 (step S904). After that, the formant interpolation processing unit 506 ends the singing voice optimization process of step S803 of
In the second embodiment described above, the singing voice data 415 specifying the range was transmitted to the voice synthesis LSI 405 of
Further, in the singing voice optimization process exemplified in the flowchart of
Further, in the second embodiment described above, in the vocalization model unit 503, the sound source input data 514 that excites the synthesis filter unit 505 was generated by the sound source generation unit 504 of
In the second embodiment described above, the acoustic models set in the acoustic model unit 501 are obtained by machine-learning with training music score data that include training lyrics information, training pitch information, and training sound range information, and training signing voice data of singers. Here, as the acoustic models, models using a general phoneme database may be adopted instead.
In the second embodiment described above, the voice synthesis LSI 405, which is the information processing device according to one aspect of the present invention and the voice synthesis unit 500, which is one of the functions of the information processing device, respectively shown in
As shown in
In the third embodiment having the configuration examples of
Instead of the BLE-MIDI communication interface 1105, a MIDI communication interface connected to the electronic keyboard instrument 1002 with a wired MIDI cable may be used.
In the third embodiment, the electronic keyboard instrument 1002 of
The CPU 1101 monitors whether or not the key pressing information and the key release information are received from the electronic keyboard instrument 1002 via the BLE-MIDI communication interface 1105.
When the CPU 1101 receives the key press information from the electronic keyboard instrument 1002, the CPU 1101 executes the same processing as in steps S703 and S704 of
Then, the CPU 1101 transmits the singing voice data 415 (see
On the other hand, when the CPU 1101 receives the key release information from the electronic keyboard instrument 1002, the CPU 1101 executes the same processing as a part of the processing in step S709 of
By repeating the control processing of steps S705 and S1203 of
Next, a fourth embodiment will be described.
In the second embodiment having the block configuration of
In the fourth embodiment, the electronic keyboard instrument 1302 and the tablet terminal or the like 1301 are connected by, for example, a USB cable 1203. In this case, the control system of the electronic keyboard instrument 1302 has a block configuration equivalent to that of the control system 400 of the electronic keyboard instrument 300 in the second embodiment illustrated in
If the data capacity allows, wireless communication interfaces such as Bluetooth (registered trademark of Bluetooth SIG, Inc. in the US) and Wi-Fi (registered trademark of Wi-Fi Alliance in the US) or the like may be used instead of the wired USB communication interface.
In
Each functional unit of the acoustic model unit 501, the text analysis unit 502, and the formant interpolation processing unit 506 in the voice synthesis unit 1501 of
Specifically, these processes are processes performed by the CPU 1401 of the tablet terminal or the like 1301 of
However, in step S705 of
Then, as shown in
As a result, the singing voice output data 417 is generated in the voice synthesis LSI 405 (
As described above, in the fourth embodiment, the function of the voice synthesis LSI 405 of the electronic keyboard instrument 1302 and the function of the singing voice synthesis of the tablet terminal or the like 1301 are combined to enable production of synthesized voice in synchronization with the performer's operation on the electronic keyboard instrument 1302.
Here, the acoustic model unit 501 including the trained models may be built in the information processing device side such as a tablet terminal 1301 or a server device, and certain data generation parts, such as the formant interpolation processing unit 506, may be installed in the instrument side, such as in the electronic keyboard instrument 1302. In this case, the information processing device transmits the first range spectrum information 510 and/or the second range spectrum information 511 to the electronic keyboard instrument 1302 for processing, such as formant interpolation processing.
Although the embodiments of the disclosure and their advantages have been described in detail above, those skilled in the art can make various changes, additions, and omissions without departing from the scope of the present invention as set forth in the claims.
In addition, the present invention is not limited to the above-described embodiments, and can be variously modified at the implementation stage without departing from the gist thereof. In addition, the functions executed in the above-described embodiments may be combined as appropriate as possible. The embodiments described above include various stages, and various inventions can be extracted by an appropriate combination according to a plurality of disclosed constituent requirements. For example, even if some constituent elements are deleted from all the constituent elements shown in the embodiment, if the same or similar effect is obtained, the configuration in which the constituent elements are deleted can be extracted as an
Further, it will be apparent to those skilled in the art that various modifications and variations can be made in the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention cover modifications and variations that come within the scope of the appended claims and their equivalents. In particular, it is explicitly contemplated that any part or whole of any two or more of the embodiments and their modifications described above can be combined and regarded within the scope of the present invention.
Number | Date | Country | Kind |
---|---|---|---|
2021-045284 | Mar 2021 | JP | national |
2021-117857 | Jul 2021 | JP | national |
2021-190167 | Nov 2021 | JP | national |