The present invention relates to a singing synthesis technique for synthesizing singing voices (human voices) in accordance with score data representative of a musical score of a singing music piece.
Voice synthesis techniques, such as techniques for synthesizing singing voices and text-reading voices, are getting more and more prevalent these days, and the voice synthesis techniques are broadly classified into one based on a voice segment connection scheme and one using voice models based on a statistical scheme. In the voice synthesis technique based on the voice segment connection scheme, segment data indicative of respective waveforms of a multiplicity of phonemes are prestored in a database, and voice synthesis is performed in the following manner. Namely, segment data corresponding to phonemes, constituting voices to be synthesized, are read out from the database in order in which the phonemes are arranged, and the read-out segment data are interconnected after pitch conversion etc. are performed on the segment data. Many of the voice synthesis techniques in ordinary practical use today are based on the voice segment connection scheme. Among examples of the voice synthesis technique using voice models is one using a Hidden Markov Model (hereinafter referred to as “HMM”). The Hidden Markov Model (HMM) is indented to model a voice on the basis of probabilistic transition between a plurality of states (sound sources). More specifically, each of the states, constituting the HMM, outputs a character amount indicative of its specific acoustic characteristics (e.g., fundamental frequency, spectrum, or characteristic vector comprising these elements), and voice modeling is implemented by determining, by use of the Baum-Welch algorithm or the like, an output probability distribution of character amounts in the individual states and state transition probability in such a manner that variation over time in acoustic character of the voice to be modeled can be reproduced with the highest probability. The voice synthesis using the HMM can be outlined as follows.
The voice synthesis technique using the HMM is based on the premise that variation over time in acoustic character is modeled for each of a plurality of kinds of phonemes through machine learning and then stored into a database. The following describe the above-mentioned modeling using the HMM and subsequent databasing, in relation to a case where a fundamental frequency is used as the character amount indicative of the acoustic character. First, each of a plurality kinds of voices to be learned is segmented on a phoneme-by-phoneme basis, and a pitch curve indicative of variation over time in fundamental frequency of the individual phonemes is generated. Then, for each of the phonemes, an HMM representing the pitch curve with the highest probability is identified through machine learning using the Baum-Welch algorithm or the like. Then, model parameters defining the HMM (HMM parameters) are stored into a database in association with an identifier indicative of one or more phonemes whose variation over time in fundamental frequency is represented by the HMM. This is because, even for different phonemes, characteristics of variation over time fundamental frequency may sometimes be represented by a same HMM. Doing so can achieve a reduced size of the database. Note that the HMM parameters include data indicative of characteristics of a probability distribution defining appearance probabilities of output frequencies of states constituting the HMM (e.g., average value and distribution of the output frequencies, and average value and distribution of change rates (first- or second-order differentiation) and data indicative of state transition probabilities.
In a voice synthesis process, on the other hand, HMM parameters corresponding to individual phonemes constituting human voices to be synthesized are read out from the database, and a state transition that may appear with the highest probability in accordance with an HMM represented by the read-out HMM parameters and output frequencies of the individual states are identified in accordance with a maximum likelihood estimation algorithm (such as the Viterbi algorithm). A time series of fundamental frequencies (i.e., pitch curve) of the to-be-synthesized voices is represented by a time series of the frequencies identified in the aforementioned manner. Then, control is performed on a sound source (e.g., sine wave generator) so that the sound source outputs a sound signal whose fundamental frequency varies in accordance with the pitch curve, after which a filter process dependent on the phonemes (e.g., a filter process for reproducing spectra or cepstrum of the phonemes) is performed on the sound signal. In this way, the voice synthesis is completed. In many cases, such a voice synthesis technique using HMMs have been used for synthesis of read voices (as disclosed for example in Japanese Patent Application Laid-open Publication No. 2002-268,660). However, in recent years, it has been proposed that the voice synthesis technique for singing synthesis (see, for example, “Trainable Singing Voice Synthesis System Capable of Representing Personal Characteristics and Singing Style”, by Sako Shinji, Saino keijiro, Nankaku Yoshihiko and Tokuda Keiichi, in a study report “Musical Information Science” of Information Processing Society of Japan, 2008(12), pp. 39-44 20080208, which will hereinafter be referred to as “Non-patent Literature 1”). In order to synthesize natural singing voices through singing synthesis based on the segment connection scheme, there is a need to database a multiplicity of segment data for each of voice characters (e.g., high clean voice, husky voice, etc.) of singing persons. However, with the voice synthesis technique using HMMs, data indicative of a probability density distribution for generating data of character amounts are retained or stored instead of all of character amounts being stored as data, and thus, such a synthesis technique is suited to be incorporated into small-size electronic equipment, such as portable game machines and portable phones.
In the case where text-reading voices are to be synthesized using HMMs, it is conventional to model a voice using a phoneme as a minimum component unit of a model and taking into account a context, such as an accent type, part of speech and arrangement of preceding and succeeding phonemes; such modeling will hereinafter referred to as “context-dependent modeling”. This is because, even for a same phoneme, a manner of variation over time in acoustic character of the phoneme can differ if the context differs. Thus, in performing singing synthesis by use of HMMs too, it is considered preferable to perform context-dependent modeling. However, in singing voices, variation over time in fundamental frequency representative of a melody of a music piece is considered to occur independently of a context of phonemes constituting lyrics, and it is considered that a singing expression unique to a singing person appears in such variation over time in fundamental frequency (namely, melody singing style). In order to synthesize singing voices that accurately reflect therein a singing expression unique to a singing person in question and that sound more natural, it is considered necessary to accurately model the variation over time in fundamental frequency that is independent of the context of phonemes constituting lyrics. Further, if a phoneme, such as a voiceless consonant, which is considered to have a great influence on pitch variation in singing voices is contained in lyrics, it is necessary to model variation over time in fundamental frequency while taking into account phoneme-dependent pitch variation. However, it is hard to say that the framework of the conventionally-known technique, where the modeling is performed using phonemes as minimum component units of a model, can appropriately model variation over time in fundamental frequency based on a singing expression that straddles across a plurality of phonemes. Furthermore, it is hard to say that the conventionally-known technique has so far appropriately modeled variation over time in fundamental frequency while taking into account phoneme-dependent pitch variation.
In view of the foregoing, it is an object of the present invention to provide a technique which can accurately model a singing expression, unique to a singing person and appearing in a melody singing style of the person, while taking into account phoneme-dependent pitch variation and thereby permits synthesis of singing voices that sound more natural.
In order to accomplish the above-mentioned object, the present invention provides an improved singing synthesizing database creation apparatus, which comprises: an input section to which are input learning waveform data representative of sound waveforms of singing voices of a singing music piece and learning score data representative of a musical score of the singing music piece, the learning score data including note data representative of a melody and lyrics data representative of lyrics associated with individual ones of the notes; a pitch extraction section which analyzes the learning waveform data to generate pitch data indicative of variation over time in fundamental frequency in the singing voices; a separation section which analyzes the pitch data, for each of pitch data sections corresponding to phonemes constituting the lyrics of the singing music piece, by use of the learning score data and separates the pitch data into melody component data representative of a variation component of the fundamental frequency dependent on the melody of the singing music piece and phoneme-dependent component data representative of a variation component of the fundamental frequency dependent on the phoneme constituting the lyrics; a first learning section which generates, in association with a combination of notes constituting the melody of the singing music piece, melody component parameters by performing predetermined machine learning using the learning score data and the melody component data, the melody component parameters defining a melody component model that represents a variation component presumed to be representative of the melody among the variation over time in fundamental frequency between notes in the singing voices, and which stores, into a singing synthesizing database, the generated melody component parameters and an identifier, indicative of the combination of notes to be associated with the melody component parameters, in association with each other; and a second learning section which generates, for each of the phonemes, phoneme-dependent component parameters by performing predetermined machine learning using the learning score data and the phoneme-dependent component data, the phoneme-dependent component parameters defining a phoneme-dependent component model that represents a variation component of the fundamental frequency dependent on the phoneme in the singing voices, and which stores, into the singing synthesizing database, the generated phoneme-dependent component parameters and a phoneme identifier, indicative of the phoneme to be associated with the phoneme-dependent component parameters, in association with each other.
According to the singing synthesizing database creation apparatus of the present invention, pitch data indicative of variation over time in fundamental frequency in the singing voices are generated from the learning waveform data representative of the singing voices of the singing music piece. From the pitch data are separated melody component data representative of a variation component of the fundamental frequency presumed to represent the melody of the singing music piece, and phoneme-dependent component data representative of a variation component of the fundamental frequency dependent on a phoneme constituting the lyrics. Then, melody component parameters defining a melody component model, representative of a variation component presumed to represent the melody among the variation over time in fundamental frequency between notes in the singing voices are generated, through machine learning, from the melody component data and learning score data (namely, data indicative of time series of notes constituting the melody of the singing music piece and lyrics to be sung to the notes), and the thus-generated melody component parameters are databased. Meanwhile, phoneme-dependent component parameters defining a phoneme-dependent component model that represents a phone-dependent variation component of the fundamental frequency between notes in the singing voices are generated, through machine learning, from the phoneme-dependent component data and learning score data, and the thus-generated phoneme-dependent component parameters are databased.
Note that the above-mentioned HMMs may be used as the melody component model and the phoneme-dependent component model. The melody component model, defined by the melody component parameters generated in the aforementioned manner, reflects therein a characteristic of the variation over time in fundamental frequency component between notes (i.e., characteristic of a singing style of the singing person) that are indicated by the identifier stored in the singing synthesizing database in association with the melody component parameters. Also, the phoneme-dependent component model, defined by the phoneme-dependent component parameters melody component parameters generated in the aforementioned manner, reflects therein a characteristic of a phoneme-dependent variation over time in the fundamental frequency. Thus, the present invention permits singing synthesis accurately reflecting therein a singing expression unique to any singing person and pitch variation occurring due to phonemes, by databasing the melody component parameters in a form classified according to combinations of notes and singing persons and the phoneme-dependent component parameters in a form classified according to phonemes and by performing singing synthesis based on HMMs using the stored content of the singing synthesizing database.
According to another aspect of the present invention, the present invention provides a pitch curve generation apparatus, which comprises: a singing synthesizing database storing therein, separately for each individual one of a plurality of singing persons, 1) melody component parameters defining a melody component model that represents a variation component presumed to be representative of a melody among variation over time in fundamental frequency between notes in singing voices of the singing person, and 2) an identifier indicative of a combination of one or more notes of which fundamental frequency component variation over time is represented by the melody component model, the singing synthesizing database storing therein sets of the melody component parameters and the identifiers in a form classified according to the singing persons, the singing synthesizing database also storing therein, in association with phoneme-dependent component parameters defining a phoneme-dependent component model that represents a variation component dependent on a phoneme among variation over time in the fundamental frequency, an identifier indicative of the phoneme for which the variation component is represented by the phoneme-dependent component model; an input section to which are input singing synthesizing score data representative of a musical score of a singing music piece and information designating any one of the singing persons for which the melody component parameters are prestored in the singing synthesizing database; a pitch curve generation section which synthesizes a pitch curve of a melody of a singing music piece, represented by the singing synthesizing score data, on the basis of a melody component model defined by the melody component parameters, stored in the singing synthesizing database for the singing person designated by the information inputted via the input section, and a time series of notes represented by the singing synthesizing score data; and a phoneme-dependent component correction section which, for each of pitch curve sections corresponding to phonemes constituting lyrics represented by the singing synthesizing score data, corrects the pitch curve, in accordance with the phoneme-dependent component model defined by the phoneme-dependent component parameters stored for the phoneme in the singing synthesizing database, and outputs the corrected pitch curve.
Further, the present invention may provide a singing synthesizing apparatus which performs driving control on a sound source so that the sound source generates a sound signal in accordance with the pitch curve, and which performs a filter process, corresponding to phonemes constituting the lyrics represented by the singing synthesizing score data, on the sound signal output from the sound source. Note that the aforementioned singing synthesizing database may be created by the aforementioned singing synthesizing database creation apparatus of the present invention.
The present invention may be constructed and implemented not only as the apparatus invention as discussed above but also as a method invention. Also, the present invention may be arranged and implemented as a software program for execution by a processor such as a computer or DSP, as well as a storage medium storing such a software program. In this case, the program may be provided to a user in the storage medium and then installed into a computer of the user, or delivered from a server apparatus to a computer of a client via a communication network and then installed into the computer. Further, the processor used in the present invention may comprise a dedicated processor with dedicated logic built in hardware, not to mention a computer or other general-purpose type processor capable of running a desired software program.
The following will describe embodiments of the present invention, but it should be appreciated that the present invention is not limited to the described embodiments and various modifications of the invention are possible without departing from the basic principles. The scope of the present invention is therefore to be determined solely by the appended claims.
For better understanding of the object and other features of the present invention, its preferred embodiments will be described hereinbelow in greater detail with reference to the accompanying drawings, in which:
b are diagrams showing example stored content of a singing synthesizing database;
b are diagrams showing example stored content of a singing synthesizing database of the second embodiment of the singing synthesis apparatus.
The control section 110 is, for example, in the form of a CPU (Central Processing Unit). The control section 110 functions as a control center of the singing synthesis apparatus 1A by executing various programs prestored in the storage section 150. The storage section 150 includes a non-volatile storage section 154 having prestored therein a database creation program 154a and a singing synthesis program 154b. Processing performed by the control section 110 in accordance with these programs will be described in detail later.
The group of interfaces 120 includes, among others, a network interface for communicating data with another apparatus via a network, and a driver for communicating data with an external storage medium, such as a CD-ROM (Compact Disk Read-Only Memory). In the instant embodiment, learning waveform data indicative of singing voices of a singing music piece and score data (hereinafter referred to as “learning score data”) of the singing music piece are input to the singing synthesis apparatus 1A via suitable ones of the interfaces 120. Namely, the group of interfaces 120 functions as input means for inputting learning waveform data and learning score data to the singing synthesis apparatus 1A, as well as input means for inputting score data indicative of a musical score of a singing music piece that is an object of singing voice synthesis (hereinafter referred to as “singing synthesizing score data”) to the singing synthesis apparatus 1A.
The operation section 130, which includes a pointing device, such as a mouse, and a keyboard, is provided for a user of the singing synthesis apparatus 1A to perform various input operation. The operation section 130 supplies the control section 110 with data indicative of operation performed by the user, such as drag and drop operation using the mouse and depression of any one of keys on the keyboard. Thus, the content of the operation performed by the user on the operation section 130 is communicated to the control section 110. In the instant embodiment, in response to user's operation on the operation section 130, an instruction for executing any of the various programs and information indicative of a person or singing person of singing voices represented by learning waveform data or a singing person who is an object of singing voice synthesis are input to the singing synthesis apparatus 1A. The display section 140 includes, for example, a liquid crystal display and a drive circuit for the liquid crystal display. On the display section 140 is displayed a user interface screen for prompting the user of the singing synthesis apparatus 1A to operate the apparatus 1A.
As shown in
As shown in
In the instant embodiment, the pitch curve generating database of
In the phoneme waveform database, as shown in
The database creation program 154a is a program which causes the control section 110 to perform database creation processing for: extracting note identifiers from a time series of notes represented by learning score data (i.e., a time series of notes constituting a melody of a singing music piece); generating, through machine learning, melody component parameters to be associated with the individual note identifiers, from the learning score data and learning waveform data; and storing, into the pitch curve generating database, the melody component parameters and the note identifiers in association with each other. In the case where the note identifiers are each of the type indicative of a combination of two notes, for example, it is only necessary to extract the note identifiers indicative of combinations of two notes (C3, E3), (E3, C4), . . . sequentially from the beginning of the time series of notes indicated by the learning score data. The singing synthesis program 154b, on the other hand, is a program which causes the control section 110 to perform singing synthesis processing for: causing a user to designate, through operation on the operation section 130, any one of singing persons for which a pitch curve generating database has already been created; and performing singing synthesis on the basis of singing synthesizing score data and the stored content of the pitch curve generating database for the singing person, designated by the user, and phoneme waveform database. The foregoing is the construction of the singing synthesis apparatus 1A. Processing performed by the control section 110 in accordance with these programs will be described later.
The following describe various processing performed by the control section 110 in accordance with the database creation program 154a and singing synthesis program 154b.
First, the database creation processing is described. The melody component extraction process SA110 is a process for analyzing the learning waveform data and then generating, on the basis of singing voices represented by the learning waveform data, data indicative of variation over time in fundamental frequency component presumed to represent a melody (such data will hereinafter be referred to as “melody component data”). The melody component extraction process SA110 may be performed in either of the following two specific styles.
In the first style, pitch extraction is performed on the learning waveform data on a frame-by-frame basis in accordance with a pitch extraction algorithm, and a series of data indicative of pitches (hereinafter referred to as “pitch data”) extracted from the individual frames are set as melody component data. The pitch extraction algorithm employed here may be a conventionally-known pitch extraction algorithm. In the second style, on the other hand, a component of phoneme-dependent pitch variation (hereinafter referred to as “phoneme-dependent component”) is removed from the pitch data, so that the pitch data having the phoneme-dependent component removed therefrom are set as melody component data. An example of a specific scheme for removing the phoneme-dependent component from the pitch data may be as follows. Namely, the above-mentioned pitch data are segmented into intervals or sections corresponding to the individual phonemes constituting lyrics represented by the learning score data. Then, for each of the segmented sections where a plurality of notes correspond to one phoneme, linear interpolation is performed between pitches of the preceding and succeeding notes as indicated by one-dot-dash line in
Namely, with the aforementioned second style employed in the instant embodiment, linear interpolation is performed between pitches represented by the preceding and succeeding notes (i.e., pitches represented by positions of the notes on a musical score (or positions in a tone pitch direction), and a series of pitches indicated by the interpolating linear line are set as melody component data. In short, it is only necessary that the style be capable of generating melody component data by removing a phoneme-dependent pitch variation component, and another style, such as the following, is also possible. For example, the other style may be one in which linear interpolation is performed between a pitch indicated by pitch data at a time-axial position of the preceding note and a pitch indicated by pitch data at a time-axial position of the succeeding note and a series of pitches indicated by the interpolating linear line are set as melody component data. This is because pitches represented by positions, on a musical score, of notes do not necessarily agree with pitches indicated by pitch data (namely, pitches corresponding to the notes in actual singing voices).
Still another style is possible, in which linear interpolation is performed between pitches indicated by pitch data at opposite end positions of a section corresponding to a consonant and then a series of pitches indicated by the interpolating linear line are set as melody component data. Alternatively, linear interpolation may be performed between pitches indicated by pitch data at opposite end positions of a section slightly wider than a section segmented, in accordance with the learning score data, as corresponding to a consonant, to thereby generate melody component data. Because, an experiment conducted by the Applicants has shown that the approach of generating melody component data by performing linear interpolation between pitches at opposite end positions of a section slightly wider than a section segmented in accordance with the learning score data can effectively remove a phoneme-dependent pitch variation component occurring due to the consonant as compared to the approach of generating melody component data by performing linear interpolation between the pitches at the opposite end positions of the section segmented in accordance with the learning score data. Among specific examples of the above-mentioned section slightly wider than the section segmented, in accordance with the learning score data, as corresponding to the consonant are a section that starts at a given position within a section immediately preceding the section corresponding to the consonant and ends at a given position within a section immediately succeeding the section corresponding to the consonant, and a section that starts at a position a predetermined time before a start position of the section corresponding to the consonant and ends at a position a predetermined after an end position of the section corresponding to the consonant.
The aforementioned first style is advantageous in that it can obtain melody component data with ease, but disadvantageous in that it can not extract accurate melody component data if the singing voices represented by the learning waveform data contain a voiceless consonant (i.e., phoneme considered to have particularly high phoneme dependency in pitch variation). The aforementioned second style, on the other hand, is disadvantageous in that it increases a processing load for obtaining melody component data as compared to the first style, but advantageous in that it can extract accurate melody component data even if the singing voices contain a voiceless consonant. The phoneme-dependent component removal may be performed only on consonants (e.g., voiceless consonants) considered to have particularly high dependence on a phoneme in pitch variation. More specifically, in which of the first and second styles the melody component extraction is to be performed may be determined, i.e. switching may be made between the first and second styles, for each set of learning waveform data, depending on whether or not any consonant considered to have particularly high phoneme dependency in pitch variation. Alternatively, switching may be made between the first and second styles for each of the phonemes constituting the lyrics.
In the machine learning process SA120 of
In the case where a transition segment from one note to another is made as an object of modeling as in the example of
Next, a description will be given about the pitch curve generation process SB110 and filter process SB120 constituting the singing synthesis processing. Similarly to the process performed in the conventionally-known technique using HMMs, the pitch curve generation process SB110 synthesizes a pitch curve corresponding to a time series of notes, represented by the singing synthesizing score data, using the singing synthesizing score data and stored content of the pitch curve generating database. More specifically, the pitch curve generation process SB110 segments the time series of notes, represented by the singing synthesizing score data, into sets of notes each comprising two notes or three or more notes and then reads out, from the pitch curve generating database, melody component parameters corresponding to the sets of notes. For example, in a case where each of the note identifiers used here indicates a combination of two notes, the time series of notes represented by the singing synthesizing score data may be segmented into sets of two notes, and then the melody component parameters corresponding to the sets of notes may be read out from the pitch curve generating database. Then, a process is performed, in accordance with the Viterbi algorithm or the like, for not only identifying a state transition sequence, presumed to appear with the highest probability, by reference to state duration probabilities indicated by the melody component parameters, but also identifying, for each of the states, a frequency presumed to appear with the highest probability on the basis of an output probability distribution of frequencies in the individual states. The above-mentioned pitch curve is represented by a time series of the thus-identified frequencies.
After that, as in the conventionally-known voice synthesis process, the control section 110 in the instant embodiment performs driving control on a sound source (e.g., sine waveform generator (not shown in
According to the instant embodiment, as described above, melody component parameters, defining a melody component model representing individual melody components between notes constituting a melody of a singing music piece, are generated for each combination of notes; such generated melody component parameters are databased separately for each singing person. In performing singing synthesis in accordance with the singing synthesizing score data, a pitch curve which represents the melody of the singing music piece represented by the singing synthesizing score data is generated on the basis of the stored content of the pitch curve generating database corresponding to a singing person designated by the user. Because a melody component model defined by melody component parameters stored in the pitch curve generating database represents a melody component unique to the singing person, it is possible to synthesize a melody accurately reflecting therein a singing expression unique to the singing person, by synthesizing a pitch curve in accordance with the melody component model. Namely, with the instant embodiment, it is possible to perform singing synthesis accurately reflecting therein a singing expression based on a style of singing the melody (hereinafter “melody singing expression”) unique to the singing person, as compared to the conventional singing synthesis technique for modeling a singing voice on the phoneme-by-phoneme basis or the conventional singing synthesis technique based on the segment connection scheme.
The singing synthesizing database 154f in the singing synthesis apparatus 1B is different from the singing synthesizing database 154c in the singing synthesis apparatus 1A in that it includes a phoneme-dependent-component correcting database in addition to the pitch curve generating database and phoneme waveform database. In association with each of phoneme identifiers indicative of phonemes that could influence variation over time in fundamental frequency component in singing voices, HMM parameters (hereinafter referred to as “phoneme-dependent component parameters”), defining a phoneme-dependent component model that is an HMM representing a characteristic of the variation over time in fundamental frequency component occurring due to the phonemes, are stored in the phoneme-dependent-component correcting database. As will be later detailed, such a phoneme-dependent-component correcting database is created for each singing person in the course of database creation processing that creates the pitch curve generating database by use of learning waveform data and learning score data.
The following describe various processing performed by the control section 110 of the singing synthesizing apparatus 1B in accordance with the database creation program 154d and singing synthesis program 154e.
First, the database creation processing is described. As seen in
As shown in
Next, the singing synthesis processing is described. As shown in
According to the above-described second embodiment, it is possible to perform singing synthesis that reflects therein not only a melody singing expression unique to a designated singing person but also a characteristic of pitch variation occurring due to a phoneme uttering style unique to the designated singing person. Although the second embodiment has been described above in relation to the case where phonemes to be subjected to the pitch curve correction are not particularly limited, the second embodiment may of course be arranged to perform the pitch curve correction only for an interval or section corresponding to a phoneme (i.e., voiceless consonant) presumed to have a particularly great influence on variation over time in fundamental frequency component of singing voices. More specifically, phonemes presumed to have a particularly great influence on variation over time in fundamental frequency component of singing voices may be identified in advance, and the machine learning process SD130 may be performed only on the identified phonemes to create a phoneme-dependent component correcting database. Further, the phoneme-dependent component correction process SE110 may be performed only on the identified phonemes. Furthermore, whereas the second embodiment has been described above as creating a phoneme-dependent component correcting database for each singing person, it may create a common phoneme-dependent component correcting database for a plurality of singing persons. In the case where a common phoneme-dependent component correcting database is created for a plurality of singing persons like this, a characteristic of pitch variation occurring due to a phoneme uttering style that appears in common to the plurality of singing persons is modeled per phoneme by phoneme, and the thus-modeled characteristics are databased. Thus, the second embodiment can perform singing synthesis reflecting therein not only a melody singing expression unique to each of the singing persons but also a characteristic of phoneme-specific pitch variation that appears in common to the plurality of singing persons.
The above-described first and second embodiments may of course be modified variously as exemplified below.
(1) Each of the first and second embodiments has been described above in relation to the case where the individual processes that clearly represent the characteristic features of the present invention is implemented by software. However, a melody component extraction means for performing the melody component extraction process SA110, a machine learning means for performing the machine learning process SA120, a pitch curve generation means for performing the pitch curve generation process SB110 and a filter process means for performing the filter process SB120 may each be implemented by an electronic circuit, and the singing synthesis circuit 1A may be constructed of a combination of these electronic circuits and an input means for inputting learning waveform data and various score data. Similarly, a pitch extraction means for performing the pitch extraction process SD110, a separation means for performing the separation process SD120, machine learning means for performing the machine learning process SA120 and machine learning process SD130 and a phoneme-dependent component correction means for performing the phoneme-dependent component correction process SE110 may each be implemented by an electronic circuit, and the singing synthesis circuit 1B may be constructed of a combination of these electronic circuits and the input means, pitch curve generation means and filter process means.
(2) The singing synthesizing database creation apparatus for performing the database creation processing shown in
(3) In each of the above-described embodiments, the database creation program 154a (or 154d), which clearly represents the characteristic features of the present invention, is prestored in the non-volatile storage section 154 of the singing synthesis apparatus 1A (or 1B). However, the database creation program 154a (or 154d) may be distributed in a computer-readable storage medium, such as a CD-ROM, or by downloading via an electric communication line, such as the Internet. Similarly, in each of the above-described embodiments, the singing synthesis program 154b (or 154e) may be distributed in a computer-readable storage medium, such as a CD-ROM, or by downloading via an electric communication line, such as the Internet.
This application is based on, and claims priorities to, JP PA 2009-157531 filed on 2 Jul. 2009 and JP PA 2010-131837 filed on 9 Jun. 2010. The disclosure of the priority applications, in its entirety, including the drawings, claims, and the specification thereof, are incorporated herein by reference.
Number | Date | Country | Kind |
---|---|---|---|
2009-157531 | Jul 2009 | JP | national |
2010-131837 | Jun 2010 | JP | national |
Number | Name | Date | Kind |
---|---|---|---|
5750912 | Matsumoto | May 1998 | A |
5889224 | Tanaka | Mar 1999 | A |
5895449 | Nakajima et al. | Apr 1999 | A |
5915237 | Boss et al. | Jun 1999 | A |
5963903 | Hon et al. | Oct 1999 | A |
6236966 | Fleming | May 2001 | B1 |
6304846 | George et al. | Oct 2001 | B1 |
6424944 | Hikawa | Jul 2002 | B1 |
6665641 | Coorman et al. | Dec 2003 | B1 |
6684187 | Conkie | Jan 2004 | B1 |
6810379 | Vermeulen et al. | Oct 2004 | B1 |
6992245 | Kenmochi et al. | Jan 2006 | B2 |
7016841 | Kenmochi et al. | Mar 2006 | B2 |
7065489 | Hisaminato et al. | Jun 2006 | B2 |
7092878 | Yamada | Aug 2006 | B1 |
7135636 | Kemmochi et al. | Nov 2006 | B2 |
7241947 | Kobayashi | Jul 2007 | B2 |
7383186 | Kemmochi | Jun 2008 | B2 |
7444286 | Roth et al. | Oct 2008 | B2 |
7464034 | Kawashima et al. | Dec 2008 | B2 |
7490035 | Fujishima et al. | Feb 2009 | B2 |
7552052 | Kemmochi | Jun 2009 | B2 |
7565291 | Conkie | Jul 2009 | B2 |
7737354 | Basu et al. | Jun 2010 | B2 |
8035022 | Wolfram | Oct 2011 | B2 |
20020184032 | Hisaminato et al. | Dec 2002 | A1 |
20030055647 | Yoshioka et al. | Mar 2003 | A1 |
20040243413 | Kobayashi | Dec 2004 | A1 |
20060085198 | Kayama et al. | Apr 2006 | A1 |
20090306987 | Nakano et al. | Dec 2009 | A1 |
20090314155 | Qian et al. | Dec 2009 | A1 |
20110004476 | Saino et al. | Jan 2011 | A1 |
20110054902 | Li et al. | Mar 2011 | A1 |
20110231193 | Qian et al. | Sep 2011 | A1 |
20120031257 | Saino | Feb 2012 | A1 |
20120067196 | Rao et al. | Mar 2012 | A1 |
20120103167 | Saino et al. | May 2012 | A1 |
Number | Date | Country |
---|---|---|
2002-268660 | Sep 2002 | JP |
Entry |
---|
European Search Report mailed Oct. 11, 2010, for EP Application No. 10167617.9, five pages. |
Gu, H-Y. et al. (Jul. 12, 2008). “Mandarin Singing Voice Synthesis Using ANN Vibrato Parameter Models,” Proceedings of the Seventh International Conference on Machine Learning and Cybernetics, Piscataway, NJ, Jul. 12-15, pp. 3288-3293. |
Saitou, T. et al. (Jul. 1, 2005). “Development of an F0 Control Model Based on F0 Dynamic Characteristics for Singing-Voice Synthesis,” Speech Communication 46(3-4):405-417. |
Sako, Shinji, et al.; A trainable singing voice synthesis system capable of representing personal characteristics and singing styles, IPSJ SIG Technical Report, Feb. 8, 2008. |
Saino, Keijiro, et al.; An HMM-based Singing Voice Synthesis System, Sep. 2006. |
Number | Date | Country | |
---|---|---|---|
20110004476 A1 | Jan 2011 | US |