1. Field of the Invention
The present invention relates to a singing voice synthesizing apparatus that synthesizes a singing voice, a method of synthesizing a singing voice, and a program for realizing the method thereof.
2. Description of the Related Art
In the past, there has been a wide range of attempts to synthesize singing voice.
One of these attempts, an application of speech synthesis by rule, receives inputs of pitch data, which corresponds to the pitch of a note, and of lyric data, and synthesizes speech using a synthesis-by-rule device for text-to-speech synthesis. In most cases, raw waveform data or analyzed and parameterized data are stored in a database in units of phonemes or phoneme chains comprised of two or more phonemes. At the time of synthesis, required voice fragments (phonemes or phoneme chains) are selected, concatenated, and synthesized. Examples are disclosed in Japanese Laid-Open Patent Publications (Kokai) Nos. S62-6299, H10-124082, and H11-1184490, among others.
However, since the object of these technologies is to synthesize a speaking voice, they are not always capable of synthesizing a singing voice with satisfactory quality.
For example, a singing voice synthesized by a method of overlapping and adding waveforms as typified by PSOLA (Pitch-Synchronous OverLap and Add) has a good degree of comprehensibility, but often has the problems of unnatural sounding of elongated tones, for which the quality of a singing voice varies the greatest, and an unnatural sounding synthesized voice when there are slight fluctuations of pitch and vibrato, which are essential for a singing voice.
Moreover, attempting to synthesize a singing voice using a waveform concatenating type speech synthesizing device with a large-scale corpus base would require an astronomically large number of fragment data if the original data are to be concatenated and output without any processing.
On the other hand, synthesizers whose original purpose is for synthesizing a singing voice have also been proposed. A well-known example is the synthesis method of formant synthesis (Japanese Laid-Open Patent Publication (Kokai) No. 3-200300). However, although this method offers a large degree of freedom with respect to the quality and fluctuations of vibrato and pitch of elongated sounds, the clarity of synthesized sounds (especially consonants) is poor, and therefore quality is not always satisfactory.
U.S. Pat. No. 5,029,509 discloses a technique known as Spectral Modeling Synthesis (SMS) for analyzing and synthesizing a musical sound using a model that expresses an original sound as comprised of two components, namely a deterministic component and a stochastic component.
With SMS analysis and synthesis, good control of the musical characteristics of a musical sound is possible, and at the same time, in the case of a singing voice, through use of the stochastic component, a high degree of clarity can be expected from even the consonants. Therefore, applying this technique to the synthesis of a singing voice is expected to achieve a synthesized sound having a high degree of clarity and musicality. In fact, Japanese Patent No. 2906970 proposes specific applications for sound synthesis based on SMS analysis and synthesis techniques, and at the same time, also describes a methodology for utilizing SMS techniques in singing voice synthesis (singing synthesizer).
An application of the techniques proposed in the aforementioned Japanese Patent No. 2906970 to a singing voice synthesizing apparatus will be described with reference to
In
When synthesizing a singing voice sound, a phoneme string comprising the desired lyrics is obtained, a phoneme-to-fragment converter 104 determines the required voice fragments (phonemes or phoneme chains) that comprise the phoneme string, and then SMS data (deterministic component and stochastic component) of the required voice fragments is read from the aforementioned database 100. Next, a fragment concatenator 105 concatenates the read-out SMS data of the voice fragments into a time series. For the deterministic component, based on pitch information corresponding to a melody of the song, a deterministic component generator 106 generates harmonic components having the desired pitch while preserving the shape of the spectral envelope of the deterministic component. For example, to synthesize the Japanese word “saita”, the fragments of “#s”, “s”, “s-a”, “a”, “a-i”, “i”, “i-t”, “t”, “t-a”, “a”, and “a#” are concatenated, and the deterministic component of the desired pitch is generated while preserving the shape of the spectral envelope included in the SMS data obtained from the fragment concatenation. Next, the generated deterministic component and the stochastic component are added together by a synthesizing means 107, and the result thereof is transformed into time domain data to obtain synthesized voice.
By thus utilizing these SMS techniques, natural sounding synthesized singing with good comprehensibility can be obtained even for elongated sounds.
However, the method described in the aforementioned Japanese Patent No. 2906970 is overly rudimentary and simplistic, and the following types of problems will occur if a singing voice is synthesized according to that method.
It is a first object of the present invention to provide a singing voice synthesizing apparatus and a singing voice synthesizing method that resolve the above described problems through prescribing a specific method for utilizing the SMS techniques proposed in the aforementioned Japanese Patent No. 2906970 and adding considerable improvements for enhancing the synthesized sound quality, to thereby enable achievement of a natural sounding synthesized singing voice with a good level of comprehensibility, and a program for realizing a singing voice synthesizing method.
It is a second object of the present invention to provide a singing voice synthesizing apparatus and a singing voice synthesizing method that are capable of reducing the size of the aforementioned database and increasing the efficiency with which the database is generated, and a program for realizing a singing voice synthesizing method.
It is a third object of the present invention to provide a singing voice synthesizing apparatus and a singing voice synthesizing method that are capable of adjusting the degree of huskiness in a synthesized voice, and a program for realizing a singing voice synthesizing method.
To attain the objects, the present invention provides a singing voice synthesizing apparatus comprising a phoneme database that stores a plurality of voice fragment data formed of voice fragments each being a single phoneme or a phoneme chain of at least two concatenated phonemes, each of the plurality of voice fragment data comprising data of a deterministic component and data of a stochastic component, an input device that inputs lyrics, a readout device that reads out from the phoneme database the voice fragment data corresponding to the inputted lyrics, a duration time adjusting device that adjusts time duration of the read-out voice fragment data so as to match a desired tempo and manner of singing, an adjusting device that adjusts the deterministic component and the stochastic component of the read-out voice fragment so as to match a desired pitch, and a synthesizing device that synthesizes a singing sound by sequentially concatenating the voice fragment data that have been adjusted by the duration time adjusting device and the adjusting device.
With the above arrangement according to the present invention, through improvement of the SMS techniques, a natural sounding synthesized singing voice with a good level of comprehensibility can be obtained even for elongated sounds, and further, even slight variations of vibrato and pitch do not result in an unnatural sounding synthesized sound.
Preferably, the phoneme database stores a plurality of voice fragment data having different musical expressions for a single phoneme or phoneme chain.
More preferably, the musical expressions include at least one parameter selected from the group consisting of pitch, dynamics and tempo.
In a preferred embodiment of the present invention, the phoneme database stores voice fragment data comprising elongated sounds that are each enunciated by elongating a single phoneme, voice fragment data comprising consonant-to-vowel phoneme chains and vowel-to-consonant phoneme chains, voice fragment data comprising consonant-to-consonant phoneme chains, and voice fragment data comprising vowel-to-vowel phoneme chains.
In a preferred form of the present invention, each of the voice fragment data comprises a plurality of data corresponding respectively to a plurality of frames of a frame string formed by segmenting a corresponding one of the voice fragments, and wherein the data of the deterministic component and the data of the stochastic component of each of the voice fragment data each comprise a series of frequency domain data corresponding respectively to the plurality of frames of the frame string corresponding to each of the voice fragments.
Moreover, in this preferred form, the duration time adjusting device generates a frame string of a desired time length by repeating at least one frame of the plurality of frames of the frame string corresponding to each of the voice fragments, or by thinning out a predetermined number of frames of the plurality of frames of the frame string corresponding to each of the voice fragments.
With this arrangement, since the length of an elongated phoneme and length of a phoneme chain can be adjusted freely, a synthesized singing voice can be obtained at a desired tempo.
More preferably, the duration time adjusting device generates the frame string of a desired time length by repeating a plurality of frames of the frame string corresponding to each of the voice fragments, the duration time adjusting device repeating the plurality of frames in a first direction in which the frame string of a desired time length is generated and in a second direction opposite thereto.
Still more preferably, when repeating the plurality of frames of the frame string corresponding to the data of the stochastic component of each of the voice fragments in the first and second directions, the duration time adjusting device reverses a phase of a phase spectrum of the stochastic component.
Preferably, the singing voice synthesizing apparatus according to the present invention further comprises a fragment level adjusting device that performs smoothing processing or level adjusting processing on the deterministic component and the stochastic component contained in each of the voice fragment data when the voice fragment data are sequentially concatenated by the synthesizing device.
With this arrangement, since a smoothing or level adjusting process is performed at the concatenation boundary between phonemes, noise is not generated when the phonemes are concatenated.
Also preferably, the singing voice synthesizing apparatus according to the present invention further comprises a deterministic component generating device that changes only pitch of the deterministic component to a desired pitch while preserving the spectral envelope shape of the deterministic component contained in each of the voice fragment data when the voice fragment data are sequentially concatenated by the synthesizing device.
Preferably, the phoneme database stores voice fragment data comprising elongated sounds that are each enunciated by elongating a single phoneme, the phoneme database further storing a flat spectrum as an amplitude spectrum of the stochastic component of each of the voice fragment data comprising each of the elongated sounds, obtained by multiplying the amplitude spectrum thereof by an inverse of a typical spectrum within an interval of the elongated sound.
In this case, the amplitude spectrum of the stochastic component of each of the voice fragment data comprising each of the elongated sounds is obtained by multiplying an amplitude spectrum of the stochastic component calculated based on an amplitude spectrum of the deterministic component of the voice fragment data of the elongated sound, by the flat spectrum.
Preferably, the phoneme database does not store amplitude spectra of stochastic components of voice fragment data comprising certain elongated sounds, and the flat spectrum stored as an amplitude spectrum of voice fragment data comprising at least one other elongated sound is used for synthesis of the certain sounds.
Preferably, the amplitude spectrum of the stochastic component calculated based on the amplitude spectrum of the deterministic component has a gain thereof at 0 Hz controlled according to a parameter for controlling a degree of huskiness.
With this arrangement, the degree of huskiness of a synthesized voice can be controlled simply.
To attain the above objects, the present invention also provides a singing voice synthesizing method comprising the steps of storing in a phoneme database a plurality of voice fragment data formed of voice fragments each being a single phoneme or a phoneme chain of at least two concatenated phonemes, each of the plurality of voice fragment data comprising data of a deterministic component and data of a stochastic component, reading out from the phoneme database the voice fragment data corresponding to lyrics inputted by an input device, adjusting time duration of the read-out voice fragment data so as to match a desired tempo and manner of singing, adjusting the deterministic component and the stochastic component of the read-out voice fragment so as to match a desired pitch, and synthesizing a singing sound by sequentially concatenating the voice fragment data that have been adjusted in respect of the time duration and the deterministic component and the stochastic component thereof.
To attain the above objects, the present invention further provides a program for causing a computer to execute the above mentioned singing voice synthesizing method.
To attain the above objects, the present invention further provides a mechanically readable storage medium storing instructions for causing a machine to execute the above mentioned singing voice synthesizing method.
According to the present invention, the synthesized singing voice can be of high quality, having an appropriate tone color for a desired pitch, and is free of noise between concatenated units. Further, the database can be made extremely small in size and can be generated with a higher efficiency. Still further, the degree of huskiness of a synthesized voice can be controlled simply.
The above and other objects, features, and advantages of the invention will become more apparent from the following detailed description taken in conjunction with the accompanying drawings.
The singing voice synthesizing apparatus of the present invention has a phoneme database which is comprised of individual phonemes and phoneme chains that have been obtained by dividing into required segments SMS data of deterministic and stochastic components obtained from an SMS analysis of input voices. This database also contains heading information including information indicative of the phonemes and phoneme chains, information indicative of the pitch of voice fragments formed of the phonemes and phoneme chains, and information indicative of musical expressions such as dynamics and tempo thereof. Here, the dynamics information may be either sensory information indicative of whether the voice fragment (phoneme or phoneme chain) is a forte or mezzo forte sound, or physical information indicating the level of the fragment.
Moreover, an SMS analysis means is provided for decomposing the input singing voice into deterministic and stochastic components, and analyzing them in order to generate the aforementioned database. Also, a means (which may be either automatic or manual) for segmenting the SMS data into the required phonemes or phoneme chains (fragments) is provided.
An example of generating the phoneme database will be described with reference to
In
In the case of synthesizing Japanese language lyrics, the voice fragments are comprised of, for example, vowel sound data (one or a plurality of frames), consonant-to-vowel sound data (a plurality of frames), vowel-to-consonant sound data (a plurality of frames), and vowel-to-vowel data (a plurality of frames).
A voice synthesis apparatus that uses voice synthesis by rule or the like normally stores data in its phoneme database in units that are longer than one syllable, such as VCV (vowel-consonant-vowel) or CVC (consonant-vowel-consonant) units. On the other hand, in the singing voice synthesizing apparatus of the present invention which aims to synthesize a singing voice sound, data of elongated sound, which frequently occurs in singing as the enunciation of long vowels, consonant-to-vowel (CV), vowel-to-consonant (VC) sound data, consonant-to-consonant sound data, and vowel-to-vowel sound data are stored in the phoneme database.
The SMS analyzer 13 performs an SMS analysis of original input singing voices and outputs SMS-analyzed data for each frame.
More specifically, the input voice is divided into a series of time frames, and an FFT or other frequency analysis is performed for each frame. From the resulting frequency spectra (complex spectra), amplitude spectra and phase spectra are obtained, and a specific frequency spectrum that corresponds to a peak in the amplitude spectrum is extracted as a line spectrum. In this case, a spectrum containing the fundamental frequency and frequencies in the vicinity of its integer multiples is a line spectrum. This extracted line spectrum corresponds to the deterministic component.
Next, a residual spectrum is obtained by subtracting the line spectrum, which has been extracted as described above, from the spectrum of the input waveform of the frame. Alternatively, temporal waveform data of the deterministic component, which has been synthesized from the extracted line spectrum, is subtracted from the input waveform data of that frame to obtain temporal waveform data of the residual component, and then a frequency analysis of the residual component temporal waveform data is performed to obtain the residual spectrum. The thus-obtained residual spectrum corresponds to the stochastic component.
The frame period used in the above SMS analysis may have either a certain fixed length, or a variable length that changes according to the pitch or other parameter of the input voice. If the frame period has a variable length, the input voice is processed with a first frame period of fixed length, the pitch is detected, and then the input voice is reprocessed with a frame period of a length that corresponds to the results of the pitch detection; alternatively, a method may be employed, in which the period of the following frame is varied according to the pitch detected from the present frame.
The SMS-analyzed data output for each frame from the SMS analyzer 13 is segmented into the length of a voice fragment stored in the phoneme database by the segmentor 14. More specifically, the SMS-analyzed data is manually or automatically segmented to extract vowel phonemes, vowel-consonant or consonant-vowel phoneme chains, consonant-consonant phoneme chains, and vowel-vowel phoneme chains so as to be optimally suited for singing sound synthesis. Here, long interval data of vowels that are to be elongated and sung (elongated sounds) are also extracted by segmentation as vowel phonemes.
Moreover, the segmentor 14 detects the pitch of the input voice based on the aforementioned SMS analysis results. The pitch detection is performed by first calculating an average pitch value from the frequency of lower-order line spectra in the deterministic component of a frame included in the fragment, and then calculating an average pitch value for all frames.
In this manner, data of the deterministic component and data of the stochastic component are extracted for each fragment and stored in the phoneme database 10, with headings comprised of information of the pitch of the input singing voice and musical expressions of tempo, dynamics, etc. appended thereto.
Of data of deterministic and stochastic components contained in the data of each fragment, namely, SMS data SMS from the aforementioned SMS analyzer 13 that has been segmented into individual fragments by the segmentor 14, the data of deterministic components may be stored either by storing all spectral envelopes (line spectra (harmonic series) strength (amplitude) and phase spectra) of each frame contained in each fragment as they are, or by storing arbitrary functions that express the spectral envelopes instead of spectral envelopes. The data of deterministic components may also be stored in the form of inverse-transformed temporal waveforms. Furthermore, the data of stochastic components may be stored in the form of strength spectra (amplitude spectra) and phase spectra for each frame of the segment corresponding to each fragment, or in the form of temporal waveform data of each segment. Moreover, the above-noted storage formats are not limitative, but may be varied for each fragment, or according to vocal properties (such as nasal, fricative or plosive sounds) of each segment. In the description that follows, the deterministic component data are stored in the format of spectral envelopes, and the stochastic component data are stored in the format of amplitude spectra and phase spectra. With these types of storage format, the required storage capacity can be reduced.
In this manner, in the singing voice synthesizing apparatus of the present invention, the phoneme database 10 stores a plurality of data corresponding to different pitches, dynamics, tempos, and other musical expressions for each of the same phoneme and the same phoneme chain.
Next, the process of synthesizing singing sounds using the phoneme database 10 created as described above will be described with reference to
In
Reference numeral 22 designates a deterministic component adjusting means that, based on control parameters such as pitch, dynamics and tempo that are included in the melody data of the song, adjusts the data of the deterministic component of fragment data read from the phoneme database 10, and reference numeral 23 designates a stochastic component adjusting means that adjusts the data of the stochastic component.
Reference numeral 24 designates a duration time adjusting means that varies the duration time of fragment data output from the deterministic component adjusting means 22 and from the stochastic component adjusting means 23. Reference numeral 25 designates a fragment level adjusting means that adjusts the level of each fragment data output from the duration time adjusting means 24. Reference numeral 26 designates a fragment concatenating means that concatenates individual fragment data, which have been level-adjusted by the fragment level adjusting means 25, into a time series. Reference numeral 27 designates a deterministic component generating means that, based on the deterministic components of fragment data that have been concatenated by the fragment concatenating means 26, generates deterministic components (harmonic components) having a desired pitch. Reference numeral 28 designates an adding means that synthesizes harmonic components generated by the deterministic component generating means 27 and stochastic components output from the fragment concatenating means 26. Voice synthesis can be achieved by transforming the output from this adding means 28 into a time domain signal.
The processing of each of the above-mentioned blocks will be described below.
The phoneme-to-fragment conversion means 21 generates a fragment string from a phoneme string that has been converted based on the input lyrics, and thereupon selectively reads out voice fragments (phonemes or phoneme chains) from the phoneme database 10. As described previously, even for a single phoneme or phoneme chain, a plurality of data (voice fragment data) are stored in the database corresponding respectively to the pitch, dynamics, tempo, etc. When selecting a fragment, the most suitable one is chosen according to the various control parameters.
Moreover, instead of selecting a fragment, it may be so arranged that several candidates are selected for interpolation to obtain SMS data to be used for synthesis. The selected voice fragments contain deterministic components and stochastic components which are results of the SMS analysis. These deterministic and stochastic components contain SMS data, namely, the spectral envelopes (strength and phase) of the deterministic components, the spectral envelopes (strength and phase) of the stochastic component, and waveforms themselves. Based on these contents, deterministic components and stochastic components are generated so as to match a desired pitch and required duration time. For example, the shapes of spectral envelopes of deterministic and stochastic components are obtained by interpolation or other means and may be varied so as to match the desired pitch.
Adjustment of Deterministic Component
Adjustment of the deterministic component is performed by the deterministic component adjusting means 22.
In the case of a voiced sound, the deterministic component contains strength and phase spectral envelope information, which are the SMS analysis results. In the case of a plurality of fragments, either the fragment most ideally suited for the desired control parameter (such as pitch) is selected, or a spectral envelope suitable for the desired control parameter is obtained by performing an operation such as interpolating the plurality of fragments. In addition, the shape of the obtained spectral envelope may be further changed according to another control parameter by a suitable method.
Moreover, to decrease harsh noises, or to give the sound a special characteristic, band pass filtering may be applied to allow components of a certain frequency band to pass.
An unvoiced sound contains no deterministic component.
Adjustment of Stochastic Component
Since the stochastic component from the SMS analysis of a voiced sound remains influenced by its original pitch, an attempt to match the sound to another pitch may result in an unnatural sound. To prevent this, processing needs to be carried out on low frequency stochastic components to achieve matching with the desired pitch. This processing is performed by the stochastic component adjusting means 23.
The processing of adjustment of the stochastic component will be described with reference to
In the case of an unvoiced sound, the above described processing is unnecessary as it is not affected by the original pitch.
The stochastic component thus obtained by the above processing may further be subjected to additional processing (such as changing the shape of the spectral envelope) according to a control parameter. Moreover, to decrease harsh noises, or to give the sound a special characteristic, band pass filtering may be applied to allow components of a certain frequency band to pass.
Adjustment of Duration Time
In the above described processing, the fragments are processed with their original length maintained, so that singing voice synthesis can only be carried out in fixed timing. Therefore, depending on the desired timing, it is necessary to change the duration of the fragment as required. For example, in the case of a phoneme chain, the fragment length can be made shorter by thinning out frames within the fragment, or made longer by adding duplicate frames within the fragment. Moreover, in the case of a single phoneme (the case of an elongated sound), the elongated part can be made shorter by using only some of the frames within the fragment, or made longer by repeating frames within the fragment.
When repeating within frames within a fragment of an elongated sound, it is known that noise at the junction between frames can be decreased by repeating in a manner of advancing in one direction, returning in the reverse direction, and then again advancing in the original direction (in other words, looping within a fixed interval or a random interval), rather than repeating in a single direction. However, in the case where the stochastic component has been segmented into frames (of either fixed or variable length) and stored as frequency domain data, there is a problem when attempting to synthesize a waveform by repeating frequency domain frame data in its original format. The reason is that, when proceeding in the reverse direction, the waveform in the frame must also be reversed with respect to time. To generate such a time-reversed waveform from frame data of the original frequency domain, the phase in the frequency domain may be reversed and transformed into the time domain.
A solution to this problem with generation of a time domain waveform from frame data is to pre-process the frame data so that a time-reversed waveform will be generated.
If the original waveform is designated by f(t) (which, for the sake of simplicity, is assumed to be infinitely continuous) and a time-reversed waveform g(t), and respective Fourier transforms applied to these waveforms F(ω) and G(ω), g(t)=f(−t) holds, and since f(t) and g(t) are both real functions, the following relation is established:
G(ω)=F(ω)* (where * indicates a complex conjugate)
When expressed with amplitude and phase, since the phase of the complex conjugate will be reversed, it will be learned that all phase spectra of the frequency domain frame data should be reversed in order to generate a time-reversed waveform. In this manner, as shown in FIG. 4C, the waveform even within each frame is reversed with respect to time, and noise and distortion are not generated.
The duration time adjusting means 24 performs the above described fragment compression (thinning out of frames), expansion (repeating of frames) and looping (in the case of elongated sounds). Through such processing, the duration (or in other words, the length of the frame string) of each read-out fragment can be adjusted to a desired length.
Adjustment of Fragment Level
Furthermore, noise may be audible if the disparity between spectral envelope shapes of the deterministic component and the stochastic component is too large at the concatenation boundary where one fragment is connected to another. Performing a smoothing process over a plurality of frames at their concatenation boundaries can eliminate this problem.
This smoothing process will be described with reference to
Since stochastic components are relatively difficult to hear even if there are differences in tone color and level at the fragment concatenation boundary, here, a smoothing process will be performed for deterministic components only. At this time, to make the data easier to process and to simplify the calculations, as shown in
Next, the two fragments of “a-i” and “i-a” as shown in
As shown in
In this manner, noise at the concatenation boundary between fragments can be avoided by multiplying each parameter (each resonance component, in this case) by a cross-fade parameter, and then adding them up.
Instead of performing the above described cross-fading, the levels of individual deterministic and stochastic components of fragments may be adjusted so as to make the fragment amplitudes before and after the concatenation boundary nearly equal. The level adjustment can be performed by multiplying the amplitude of each fragment by either a constant or time-varying coefficient.
An example of level adjustment will now be described for the case where “a-i” and “i-a” are to be concatenated and synthesized similarly to the above case.
Here, the matching of the gain of the gradient component of each of the fragments will be considered.
As shown in
Next, typical samples (of the parameters of the gradient and resonance components) of each of “a” and “i” phonemes are obtained. The “a-i” data of the first and last frames may be used to obtain these typical samples, for example.
Based on these typical samples, a linear interpolation of the value of the parameter, e.g. gain, of the gradient component is performed first. Next, by sequentially adding together the results of the interpolation and the above calculated gain difference, as shown in
Alternatively to the above described method, the level adjustment may be performed, for example, by transforming deterministic component data into waveform data and then adjusting the levels in the time domain.
After the fragment level adjusting means 25 performs the above described smoothing or level adjusting between fragments, the fragment concatenating means 26 concatenates the fragments.
Next, the deterministic component generating means 27 generates a harmonic series that corresponds to the desired pitch, while preserving the obtained deterministic component spectral envelope, whereby the actual deterministic component is obtained. By adding the stochastic component to the actual deterministic component, a synthesized singing sound is obtained, which is then transformed into a time domain signal. For example, in the case where both the deterministic component and the stochastic component are stored as frequency components, the both components are added together, and the resulting sum is subjected to an inverse FFT and applying windowing and overlapping, whereby a synthesized waveform is obtained.
It should be noted that the deterministic component and the stochastic component may be subjected to an inverse FFT and apply windowing and overlapping separately for each component, and then the thus processed components may be added together. Moreover, a sine wave corresponding to each harmonic of the deterministic component may be generated, which is then added to a stochastic component obtained by performing an inverse FFT and applying windowing and overlapping.
In
Deterministic component data included in the fragment data output from the fragment selecting means 34 is fed to the deterministic component adjusting means 22. In the case where a plurality of fragment data have been read out by the fragment selecting means 34, a spectral envelope interpolator 35 within the deterministic component adjusting means 22 performs interpolation so that the search conditions are satisfied, and as necessary, a spectral envelope shaper 36 changes the shape of the spectral envelope according to the control parameters.
On the other hand, stochastic component data included in the fragment data output from the fragment selecting means 34 is input to the stochastic component adjusting means 23. This stochastic component adjusting means 23 is supplied with pitch information from the pitch determining means 33, and as was described with reference to
The deterministic component data from the deterministic component adjusting means 22 and the stochastic component data from the stochastic component adjusting means 23 are input to the duration time adjusting means 24. Then, the duration time adjusting means 24 changes the time length of the fragment according to a sounding time length which is determined by the melody information and the tempo information. As previously described, in the case where the duration time of the fragment is to be made shorter, the time axis compressor-expander 43 performs the process of thinning out frames, and in the case where the duration time is to be made longer, a loop section 42 performs the loop processing described with reference to the
The fragment data whose duration time has been adjusted by the duration time adjusting means 24 is subjected to a level adjusting process by the fragment level adjusting means 25 as described previously with reference to the
The deterministic components (spectral envelope information) of the fragment data concatenated by the fragment concatenating means 26 are input to the deterministic component generating means 27. This deterministic component generating means 27 is supplied with pitch information from the pitch determining means 33, and based on the spectral envelope information, generates harmonic components corresponding to the pitch information from which the actual deterministic component for each frame is obtained.
Next, the adder 28 synthesizes a frequency domain signal for each frame by combining stochastic component amplitude and phase spectral envelope information from the fragment concatenating means 26 with deterministic component amplitude spectrum information from the deterministic component generating means 27.
Then, the frequency domain signal for each frame thus synthesized is transformed by an inverse Fourier transform means (inverse FFT means) 51 into a time domain waveform signal. Next, a windowing means 52 multiplies the time domain waveform signal by a windowing function that corresponds to the frame length, and an overlap means 53 synthesizes a time waveform signal by overlapping the time domain waveform signals for respective frames.
Then, a D/A conversion means 54 converts the thus-synthesized time waveform signal into an analog signal that is output via an amplifier 55 to a speaker 56 to be sounded therefrom.
The phoneme database 10 is loaded into the ROM 62 or the RAM 63. A singing sound is synthesized in the above described manner according to the data input by the lyric-melody input unit 66 and the control parameter input unit 67, and a singing sound is output from the speaker 71.
The construction of the hardware apparatus of
In the above described embodiment, the fragment data stored in the database 10 is SMS data, which is typically comprised of a spectral envelope of the deterministic component for each unit time (frame), and amplitude and phase spectral envelopes of the stochastic component for each frame. As described above, by storing fragment data of elongated sounds, such as long vowels, a high-quality singing sound can be synthesized. However, especially in the case of elongated sounds, there is the problem of large data sizes due to the storage of deterministic and stochastic components for each time instance (frame) during the interval of the elongated sound.
In the case of deterministic components, it is sufficient to store data for each frequency that is an integer multiple of the fundamental pitch. For example, if the fundamental pitch is 150 Hz and the maximum frequency is 22025 Hz, the amplitude (or phase) data of the 150 Hz frequency must be stored. On the other hand, in the case of stochastic components, a much larger quantity of data are required, that is, the amplitude spectral envelope and phase spectral envelope must be stored for all frequencies. If 1024 points are sampled within a frame, the amplitude and phase data for 1024 frequencies is required. Especially in the case of elongated sounds, the quantity of data becomes extremely large since data must be stored for all frames within the interval of the elongated sound. Moreover, the data of the elongated sound interval must be provided for each of individual phonemes, and as described above, the data should desirably be provided for each of various pitches to increase naturalness, but this leads to a further increase in the quantity of data in the database.
Therefore, another embodiment of the present invention, which enables the size of the database to be made extremely small, will be described below. According to this embodiment, a means is added for whitening the spectral envelope when storing stochastic component data of elongated sounds to generate the database 10. Also, a means for generating a stochastic component spectral envelope during synthesis of a singing sound is provided within the stochastic component adjusting means. Thus, the data size can be reduced because it is unnecessary to store individual spectral envelopes of the stochastic components of elongated sounds.
Moreover, in the case of an elongated sound, each frequency component in each frame within a certain interval to be processed has a slight fluctuation that is important. The degree of this fluctuation is not considered to change much even when a vowel changes. Therefore, an amplitude spectral envelope of a stochastic component is flattened in advance by some means (whitening) to eliminate the influence of the tone color of the original vowel. The spectrum appears flat due to the whitening. Then, at the time of synthesis, a spectral envelope of the stochastic component is determined based on the shape of the spectral envelope of the deterministic component and the determined stochastic component spectral envelope is multiplied by the whitened spectral envelope to obtain an amplitude spectrum of the stochastic component. In other words, only the spectral envelope of the stochastic component is generated based on the deterministic component spectral envelope, while the phase included in the original stochastic component of the elongated sound, is used as it is. In this manner, stochastic components of different elongated vowel sound data can be generated based on whitened elongated sound data.
As previously noted, the stochastic component amplitude spectrum of an elongated sound is whitened by this spectral whitening means 80, and appears flat. However, at this time, the spectral envelopes of all frames within an interval for processing are not made completely flat (i.e. not the same spectral value at all frequencies). It is important that the small temporal fluctuations of each frequency be retained while making the spectral envelope shape in each frame nearly flat. To this end, as shown in
Here, a typical envelope of an amplitude spectrum within the interval may also be generated, for example, by calculating an average value of the amplitude spectrum for each frequency and using those average values as the typical spectral envelope. Alternatively, the maximum value of each frequency component within the interval may be used as the typical spectral envelope.
As a result, whitened amplitude spectra can be obtained from the filter 83. Moreover, the phase spectra are stored directly as stochastic component information of the fragment.
In this manner, the stochastic component of an elongated sound is whitened, and the spectral envelope of the deterministic component is used during synthesis to generate the stochastic component. Therefore, if the whitened stochastic component is a stochastic component, it can be used commonly for all vowels. In other words, in the case of a vowel, a single whitened stochastic component of an elongated sound is sufficient. Of course, a plurality of whitened stochastic components may be provided.
When the whitened stochastic component of an elongated sound is read out from the phoneme database 10, the spectral envelope generating means 90 calculates the amplitude spectral envelope of the stochastic component based on the spectral envelope of the deterministic component, as described above. For example, a method a method is considered, in which, assuming that the component at the maximum frequency does not change, the amplitude spectral envelope of the stochastic component is determined by changing only the gradient of the spectral envelope.
Then, the determined amplitude spectral envelope, together with the phase spectrum of the stochastic component that has been read at the same time, are input to the stochastic component adjusting means 23. The subsequent processing is the same as was illustrated in
As described above, when the amplitude spectra of stochastic components of elongated sounds are to be whitened and stored, the whitened amplitude spectra of stochastic components of some of the elongated sounds may be stored, while the amplitude spectra of stochastic components of the other elongated sounds are not stored.
In this case, if one of the other elongated sounds is to be synthesized, the amplitude spectra of the stochastic components of this elongated sound are not included in the fragment data of the elongated sound. Thererefore, a phoneme that most closely resembles the phoneme to be synthesized is extracted from the database. Using the stochastic components of the elongated sound, amplitude spectra of the stochastic components may be generated in the above described manner.
Moreover, phonemes from which elongated sounds can be generated may be divided into one or more groups, and using one of elongated sound data belonging to the group affiliated with the phoneme to be synthesized, amplitude spectra of the stochastic components may be generated in the above described manner.
Further, when using the amplitude spectra of stochastic components obtained from the whitened amplitude spectra and the amplitude spectra of deterministic components, all or a part of the frequency axes of the stochastic component phase spectra are shifted so that data indicative of harmonics and their vicinities corresponding to the pitch of the original data becomes indicative of harmonics and their vicinities corresponding to the desired pitch at which the sound is to be reproduced. In other words, a more natural synthesized sound can be obtained by using the phase data indicative of harmonics and their vicinities as it is during synthesis.
According to this embodiment, the database does not have to store an elongated sound stochastic component for every vowel, and therefore the quantity of data can be reduced.
Furthermore, in the case where the spectral envelope of the stochastic component is determined by changing only the gradient of this spectral envelope, the “degree of huskiness” of the synthesized voice can be controlled by correlating the change in gradient with huskiness.
More specifically, the synthesized voice will be husky if it contains many stochastic components, and will be smooth if it contains few stochastic components. Therefore, if the gradient is steep (the gain at 0 Hz is large), the voice will be husky, and if the gradient is slight (the gain at 0 Hz is small), the voice will be smooth. Therefore, as shown in
It is also possible to model the spectral envelope of the deterministic component in a suitable manner and correlating a parameter of the model and the degree of huskiness. For example, the spectral envelope of the stochastic component may also be calculated by correlating the degree of huskiness and any one of parameters (a parameter related to gradient) used in formularizing the spectral envelope of the deterministic component ,by changing the parameter.
Furthermore, the degree of huskiness may be constant or may be varied over time. In the case of time-varying huskiness, an interesting effect can be obtained wherein a voice becomes gradually more husky during the elongation of a phoneme.
Moreover, for the sole purpose of controlling the degree of huskiness, it is unnecessary to store the whitened amplitude spectrum of a stochastic component in the phoneme database 10 as described above. As in the first embodiment described above, the amplitude spectrum of the stochastic component of an elongated sound is stored as it is, similarly as for other fragments. During synthesis, a flat spectrum is generated by obtaining a typical amplitude spectrum within the elongated sound interval, and multiplying the inverse thereof by the amplitude spectrum of the stochastic component. Then, based on the amplitude spectrum of the deterministic component, the amplitude spectrum of the stochastic component is calculated according to the parameter that controls the degree of huskiness. The flat spectrum is then multiplied by the calculated amplitude spectrum of the stochastic component to obtain the amplitude spectrum of the stochastic component.
Number | Date | Country | Kind |
---|---|---|---|
2000-401041 | Dec 2000 | JP | national |
Number | Name | Date | Kind |
---|---|---|---|
5029509 | Serra et al. | Jul 1991 | A |
5536902 | Serra et al. | Jul 1996 | A |
5698807 | Massie et al. | Dec 1997 | A |
5750912 | Matsumoto | May 1998 | A |
5857171 | Kageyama et al. | Jan 1999 | A |
5895449 | Nakajima et al. | Apr 1999 | A |
5998725 | Ohta | Dec 1999 | A |
6304846 | George et al. | Oct 2001 | B1 |
6462264 | Elam | Oct 2002 | B1 |
6748355 | Miner et al. | Jun 2004 | B1 |
6836761 | Kawashima et al. | Dec 2004 | B1 |
20020184032 | Hisaminato et al. | Dec 2002 | A1 |
20030159568 | Kemmochi et al. | Aug 2003 | A1 |
20030221542 | Kenmochi et al. | Dec 2003 | A1 |
20040006472 | Kemmochi | Jan 2004 | A1 |
20040243413 | Kobayashi | Dec 2004 | A1 |
20050049875 | Kawashima et al. | Mar 2005 | A1 |
Number | Date | Country |
---|---|---|
S57-163299 | Oct 1982 | JP |
62-006299 | Jan 1987 | JP |
A-63-25700 | Feb 1988 | JP |
3185500 | Aug 1991 | JP |
07325583 | Dec 1995 | JP |
H10-091191 | Apr 1998 | JP |
10124082 | May 1998 | JP |
2906970 | Apr 1999 | JP |
11184490 | Jul 1999 | JP |
2000-507377 | Jun 2000 | JP |
WO9736288 | Oct 1997 | WO |
Number | Date | Country | |
---|---|---|---|
20030009336 A1 | Jan 2003 | US |