1. Technical Field
The present invention relates to a technology of synthesizing voices with various characteristics.
2. Related Art
Conventionally, there have been proposed technologies to apply various effects to voices. For example, Japanese Non-examined Patent Publication No. 10-78776 (paragraph 0013 and FIG. 1) discloses the technology that converts the pitch of a voice as material (hereafter referred to as a “source voice”) to generate a concord sound (voices constituting a chord with the source voice) and adds the concord sound to the source voice for output. Even though one utterer vocalizes the source voice, the technology according to this configuration can output voices audible as if multiple persons sang individual melodies in chorus. When the source voice represents a musical instrument's sound, the technology generates voices audible as if multiple musical instruments were played in concert.
Types of chorus and ensemble include: a general chorus in which multiple performers sing or play individual melodies; and a unison in which multiple performers sing or play the same melody. The technology described in Japanese Non-examined Patent Publication No. 10-78776 generates a concord sound by converting the source voice pitch. Accordingly, the technology can generate a voice simulating individual melodies sung or played by multiple performers, but cannot provide the source voice with a unison effect of the common melody sung or played by multiple performers. The technology described in Japanese Non-examined Patent Publication No. 10-78776 can also output the source voice together with a voice only having the acoustic characteristic (voice quality) converted without changing the source voice pitch, for example. In this manner, somehow or other, it is possible to provide an effect of the common melody sung or played by multiple performers. In this case, however, it is required to provide a scheme to convert source voice characteristics for each of voices constituting the unison. Consequently, an attempt to provide a unison composed of many performers enlarges the circuit scale for a configuration that converts source voice characteristics using hardware such as a DSP (Digital Signal Processor). In a configuration that uses software for this conversion, the processor is subject to excessive processing loads. The present invention has been made in consideration of the foregoing.
It is therefore an object of the present invention to synthesize an output voice composed of multiple voices using a simple configuration.
To achieve this object, a voice synthesizer according to the present invention comprises: a data acquisition portion for successively obtaining phonetic entity data (e.g., lyrics data in the embodiment) specifying a phonetic entity; an envelope acquisition portion for obtaining a spectral envelope of a voice segment corresponding to an phonetic entity specified by the phonetic entity data out of a plurality of voice segments corresponding to different phonetic entities; a spectrum acquisition portion for obtaining a conversion spectrum, i.e., a collective frequency spectrum of a target voice containing a plurality of parallel generated voices; an envelope adjustment portion for adjusting a spectral envelope of the conversion spectrum obtained by the spectrum acquisition portion so as to approximately match with the spectral envelope obtained by the envelope acquisition portion; and a voice generation portion for generating an output voice signal from the conversion spectrum adjusted by the envelope adjustment portion. The term “voice” in the present invention includes various sounds such as a human voice and a musical instrument sound.
According to this configuration, the collective spectral envelope of the conversion voice containing multiple parallel vocalized voices is adjusted so as to approximately match with the spectral envelope of a source voice collected as a voice segment. Accordingly, it is possible to generate an output voice signal of multiple voices (i.e., choir sound or ensemble sound) having the voice segment's phonetic entity. In principle, there is no need to provide an independent element for converting a voice segment property with respect to each of multiple voices to be contained in the output voice indicated by the output voice signal. The configuration of the inventive voice synthesizer is greatly simplified in comparison with the configuration described in Japanese Non-examined Patent Publication No. 10-78776. In other words, it is possible to synthesize an output voice composed of so many voices without complexing the configuration of the voice synthesizer.
The term “voice segment” in the present invention represents the concept including both a phoneme and a phoneme concatenation composed of multiple concatenated phonemes. The phoneme is an audibly distinguishable minimum unit of voice (typically the human voice). The phoneme is classified into a consonant (e.g., “s”) and a vowel (e.g., “a”). The phoneme concatenation is an alternate concatenation of multiple phonemes corresponding to vowels or consonants along the time axis such as a combination of a consonant and a succeeding vowel (e.g., [s_a]), a combination of a vowel and a succeeding consonant (e.g., [i_t]), and a combination of a vowel and a succeeding vowel (e.g., [a_i]). The voice segment can be provided in any mode. For example, the voice segment may be presented as waveforms in a time domain (time axis) or spectra in a frequency domain (frequency axis).
When a sound is actually generated based on an output voice signal generated from the frequency spectrum adjusted by the envelope adjustment portion, the voice's phonetic entity may approximate (ideally match) the voice segment's phonetic entity in such a degree that they can be sensed audibly the same. In this case, the voice segment's spectral envelope is assumed to “approximately match” the conversion spectrum's spectral envelope. Therefore, it is not always necessary to ensure strict correspondence between the voice segment's spectral envelope and the spectral envelope of the conversion voice adjusted by the envelope adjustment portion.
On the voice synthesizer according to the present invention, an output voice signal generated from the voice generation portion is supplied to a sound generation device such as a speaker or an earphone and is output as an output voice. This output voice signal can be used in any mode. For example, the output voice signal may be stored on a recording medium. Another apparatus for reproducing the stored signal may be used to output an output voice. Further, the output voice signal may be transmitted to another apparatus via a communication line. That apparatus may reproduce the output voice signal as a voice.
On the voice synthesizer according to the present invention, the envelope acquisition portion may use any method to obtain the voice segment's spectral envelope. For example, there may be a configuration provided with a storage portion for storing a spectral envelope corresponding to each of multiple voice segments. In this configuration, the envelope acquisition portion reads, from the storage portion, a spectral envelope of the voice segment corresponding to the phonetic entity specified by the phonetic entity data (first embodiment). This configuration provides an advantage of simplifying a process of obtaining the voice segment's spectral envelope. There may be another configuration provided with a storage portion for storing a frequency spectrum corresponding to each of multiple voice segments. In this configuration, the envelope acquisition portion reads, from the storage portion, a frequency spectrum of the voice segment corresponding to the phonetic entity specified by the phonetic entity data and extracts a spectral envelope from this frequency spectrum (see
In the preferred embodiments of the present invention, the spectrum acquisition portion obtains a conversion spectrum of the conversion voice corresponding to the phonetic entity specified by phonetic entity data out of multiple conversion voices vocalized with different phonetic entities. In this mode, the conversion voice as a basis for output voice signal generation is selected from conversion voices with multiple phonetic entities. Consequently, natural output voices can be generated in comparison with the configuration where an output voice signal is generated from a conversion voice with a single phonetic entity.
According to another mode of the present invention, the voice synthesizer further comprises a pitch acquisition portion for obtaining pitch data (e.g., musical note data according to the embodiment) specifying a pitch; and a pitch conversion portion for varying each peak frequency contained in the conversion spectrum obtained by the spectrum acquisition portion. The envelope adjustment portion adjusts the spectral envelope of a conversion spectrum processed by the pitch conversion portion. According to this mode, an output voice signal's pitch can be appropriately specified in accordance with the pitch data. It may be preferable to use any method of changing a frequency of each peak contained in the conversion spectrum (i.e., any method of changing the conversion voice's pitch). For example, the pitch conversion portion extends or contracts the conversion spectrum along the frequency axis in accordance with the pitch specified by pitch data. This mode can adjust the conversion spectrum pitch using a simple process of multiplying each frequency of the conversion spectrum and a numeric value corresponding to an intended pitch. In still another mode, the pitch conversion portion moves each spectrum distribution region containing each peak's frequency in the conversion spectrum along the frequency axis direction in accordance with the pitch specified by the pitch data (see
There may be provided any configuration for changing output voice pitches. For example, it may be preferable to provide a configuration provided with the pitch acquisition portion for obtaining pitch data specifying pitches. In this configuration, the spectrum acquisition portion may obtain the conversion spectrum of the conversion voice with a pitch approximating (ideally matching) the pitch specified by the pitch data out of multiple conversion voices with different pitches (see
According to a preferred mode of the present invention, the envelope acquisition portion obtains a spectral envelope for each frame resulting from dividing a voice segment along the time axis. The envelope acquisition portion interpolates between a spectral envelope in the last frame for one voice segment and another spectral envelope in the first frame for the other voice segment following that voice segment to generate a spectral envelope of the voice corresponding to a gap between both frames. This mode can generate an output voice with any time duration.
Multiple singers or players may simultaneously (parallel) generate voices at approximately the same pitch. According to the frequency spectrum of these voices, the bandwidth (e.g., bandwidth W2 as shown in
This configuration selects one of the first and second conversion spectra as the frequency spectrum for generating an output voice signal. It is possible to selectively generate an output voice signal having characteristics corresponding to the first conversion spectrum and an output voice signal having characteristics corresponding to the second conversion spectrum. For example, when the first conversion spectrum is selected, it is possible to generate an output voice generated from a single singer or a few of singers. When the second conversion spectrum is selected, it is possible to generate an output voice generated from multiple singers or players. While there are provided the first and second conversion spectra, there may be a configuration where the other conversion spectra are provided to be selected by the selection portion. According to a possible configuration, for example, a storage portion may store three types or more of conversion spectra with different peak bandwidths. The spectrum acquisition portion may select any of these conversion spectra for use for generation of output voice signals.
The voice synthesizer according to the present invention is implemented by not only hardware dedicated for voice synthesis such as a DSP, but also cooperation of a computer such as a personal computer with a program. The inventive program allows a computer to perform: a data acquisition process of successively obtaining phonetic entity data specifying a phonetic entity; an envelope acquisition process of obtaining a spectral envelope of a voice segment corresponding to an phonetic entity specified by the phonetic entity data out of a plurality of voice segments corresponding to different phonetic entities; a spectrum acquisition process of obtaining a conversion spectrum, i.e., a collective frequency spectrum of conversion voice containing a plurality of parallel generated voices; an envelope adjustment process of adjusting a spectral envelope of the conversion spectrum obtained by the spectrum acquisition process so as to approximately match with the spectral envelope obtained by the envelope acquisition process; and a voice generation process of generating an output voice signal from the conversion spectrum adjusted by the envelope adjustment process.
An inventive program according to another mode allows a computer to perform: a data acquisition process of successively obtaining phonetic entity data specifying a phonetic entity; an envelope acquisition process of obtaining a spectral envelope of a voice segment identified as corresponding to the phonetic entity specified by the phonetic entity data out of a plurality of voice segments corresponding to different phonetic entities; a spectrum acquisition process of obtaining one of a first conversion spectrum, i.e., a frequency spectrum of a conversion voice and a second conversion spectrum which is a frequency spectrum of a voice having almost the same pitch as that of the conversion voice indicated by the first conversion spectrum and which has a peak width larger than that of the first conversion spectrum; an envelope adjustment process of adjusting a spectral envelope of the conversion spectrum obtained by the spectrum acquisition portion so as to approximately match with the spectral envelope obtained by the envelope acquisition process; and a voice generation process of generating an output voice signal from the conversion spectrum adjusted by the envelope adjustment process. These programs are stored on a computer-readable recording medium (e.g., CD-ROM) and supplied to users for installation on computers. In addition, the programs are delivered via a network from a server apparatus for installation on computers.
Further, the present invention is also specified as a method for synthesizing voices. The method comprises the steps of: successively obtaining phonetic entity data specifying a phonetic entity; obtaining a spectral envelope of a voice segment identified as corresponding to the phonetic entity specified by the phonetic entity data out of a plurality of voice segments corresponding to different phonetic entities; obtaining a conversion spectrum, i.e., a collective frequency spectrum of conversion voice containing a plurality of parallel generated voices; adjusting a spectral envelope for a conversion spectrum obtained by the spectrum acquisition step so as to approximately match with the spectral envelope obtained by the envelope acquisition step; and generating an output voice signal from the conversion spectrum adjusted by the envelope adjustment step.
A voice synthesis method based on another aspect of the invention comprises the steps of: successively obtaining phonetic entity data specifying a phonetic entity; obtaining a spectral envelope of a voice segment corresponding to the phonetic entity specified by the phonetic entity data out of a plurality of voice segments corresponding to different phonetic entities; obtaining one of a first conversion spectrum, i.e., a frequency spectrum of a conversion voice and a second conversion spectrum which is a frequency spectrum of another conversion voice having almost the same pitch as that of the conversion voice indicated by the first conversion spectrum and which has a peak width larger than that of the first conversion spectrum; adjusting a spectral envelope of the conversion spectrum obtained at the spectrum acquisition step so as to approximately match with the spectral envelope obtained at the envelope acquisition step; and generating an output voice signal from the conversion spectrum adjusted at the envelope adjustment step.
As mentioned above, the present invention can use a simple configuration to synthesize an output voice composed of multiple voices.
The following describes an embodiment that applies the present invention to an apparatus for synthesizing musical composition's singing sounds.
The data acquisition means 5 in
The storage means 55 stores envelope data Dev for each voice segment. Envelope data Dev indicates a spectral envelope of a frequency spectrum of voice segment previously collected from the source voice or reference voice. Such envelope data Dev is created by a data creation apparatus D2 as shown in
As shown in
The FFT portion 92 selects voice segments segmented from source voice signal V0 to form frames of specified time durations (e.g., 5 to 10 ms). The FFT portion 92 performs frequency analysis including the FFT process for source voice signal V0 on a frame basis to detect frequency spectrum SP0. Each frame of source voice signal V0 is selected so as to overlap with each other along the time axis. The embodiment assumes a voice vocalized from one utterer to be the source voice. As shown in
The feature extraction portion 93 in
The envelope acquisition means 10 in
The spectrum conversion means 20 in
The spectrum acquisition means 30 provides means for acquiring conversion spectrum SPt and has an FFT portion 31, a peak detection portion 32, and a data generation portion 33. The FFT portion 31 is supplied with conversion voice signal Vt read from the storage means 50. The conversion voice signal Vt is of a time domain and represents a conversion voice waveform during a specific interval, and is stored in the storage means 50 beforehand. Similarly to the FFT portion 92 as shown in
The embodiment assumes a case where many utterers generate voices (i.e., unison voices for choir or ensemble) at approximately the same pitch Pt, a sound pickup device such as a microphone picks up the voices to generate a collective signal, and the storage means 50 stores this collective signal as conversion voice signal Vt. The FFT process is applied to such conversion voice signal Vt to produce conversion spectrum SPt. As shown in
The data generation portion 33 in
The following describes the configuration and operations of the spectrum conversion means 20. As shown in
a) shows conversion spectrum SPt which is also shown in
The envelope adjustment portion 22 in
The envelope adjustment portion 22 first selects one piece of unit data Ut provided with the indicator A out of conversion spectrum data Dt. This unit data Ut contains frequency Ft and spectrum intensity Mt of any peak pt (hereafter specifically referred to as “focused peak pt”) for conversion spectrum SPt (see
The pitch conversion portion 21 and the envelope adjustment portion 22 perform the processes for each frame resulting from dividing source voice signal V0 and conversion voice signal Vt. The total number of frames for the conversion voice is limited in accordance with the time duration of conversion voice signal Vt stored in the storage means 50. By contrast, time duration T0 indicated by the musical note data varies with musical composition contents. In many cases, the total number of frames for the conversion voice differs from time duration T0 indicated by the musical note data. When the total number of frames for the conversion voice is smaller than time duration T0, the spectrum acquisition means 30 uses frames of conversion voice signal Vt in a loop fashion. That is, the spectrum acquisition means 30 completely outputs conversion spectrum data Dt corresponding to all frames to the spectrum conversion means 20. The spectrum acquisition means 30 then outputs conversion spectrum data Dt corresponding to the first frame for conversion voice signal Vt to the conversion means 20. When the total number of frames for the conversion voice signal Vt is greater than time duration T0, it just needs to discard conversion spectrum data Dt corresponding to extra frames.
The source voice may be also subject to such mismatch of the number of frames. That is, the total number of frames for the source voice (i.e., the total number of envelope data Dev corresponding to one phonetic entity) becomes the same as a fixed value selected at the time of creating spectral envelope EV0. By contrast, time duration T0 indicated by the musical note data varies with musical composition contents. The total number of frames for the source voice corresponding to one phonetic entity may be insufficient for time duration T0 indicated by the musical note data. To solve this problem, the embodiment finds a time duration corresponding to the total number of frames for one voice segment and the total number of frames for the subsequent voice segment. When the time duration is shorter than time duration T0 indicated by the musical note data, the embodiment generates a voice for the gap between both voice segments by interpolation. The interpolating portion 12 in
As shown in
The voice generation means 40 as shown in
According to the embodiment, as mentioned above, the conversion voice contains multiple voices generated from many utterers and is adjusted so that spectral envelope EVt for the conversion voice approximately matches spectral envelope EV0 for the source voice. It is possible to generate output voice signal Vnew indicative of multiple voices (i.e., choir sound and ensemble sound) having the phonetic entity similar to the source voice. Even when the source voice represents a voice generated from one singer or player, the voice output portion 60 can output a voice sounded as if many singers or players sang in chorus or played in concert. In principle, there is no need for an independent element that generates each of multiple voices contained in the output voice. The configuration of the voice synthesizer D1 is greatly simplified in comparison with the configuration described in patent document 1. Further, the embodiment converts pitch Pt of conversion spectrum SPt in accordance with musical note data, making it possible to generate choir sounds and ensemble sounds at any pitch. There is another advantage of implementing the pitch conversion using the simple process (multiplication process) by extending conversion spectrum SPt in the direction of the frequency axis.
The following describes a voice synthesizer according to the second embodiment of the present invention. The mutually corresponding parts in the first and second embodiments are designated by the same reference numerals and a detailed description is appropriately omitted for simplicity.
The spectrum acquisition means 30 contains a selection portion 34 prior to the FFT portion 31. The selection portion 34 works based on an externally supplied selection signal and provides means for selecting one of the first conversion voice signal Vt1 and the second conversion voice signal Vt2 and reading it from the storage means 50. The selection signal is supplied in accordance with operations on an input device 67, for example. The selection portion 34 reads conversion voice signal Vt and supplies it to the FFT portion 31. The subsequent configuration and operations are the same as those for the first embodiment.
In this manner, the embodiment selectively uses the first conversion voice signal Vt1 and the second conversion voice signal Vt2 to generate new spectrum SPnew. Selecting the first conversion voice signal Vt1 outputs a single output voice that has both the source voice's phonetic entity and the conversion voice's frequency characteristic. On the other hand, selecting the second conversion voice signal Vt2 outputs an output voice composed of many voices maintaining the source voice's phonetic entity similarly to the first embodiment. According to the embodiment, a user can choose between a single voice and multiple voices as an output voice at discretion.
While the embodiment has described the configuration where conversion voice signal Vt is selected in accordance with operations on the input device 67, it may be preferable to use any factor as a criterion for the selection. For example, a timer interrupt may be generated at a specified interval and trigger a change from the first conversion voice signal Vt1 to the second conversion voice signal Vt2, and vice versa. When the voice synthesizer D1 according to the embodiment is applied to a chorus synthesizer, it may be preferable to employ a configuration of changing the first conversion voice signal Vt1 to the second conversion voice signal Vt2, and vice versa, in synchronization with the progress of a played musical composition. While the embodiment has described the configuration where the storage means 50 stores the first conversion voice signal Vt1 indicative of a single voice and the second conversion voice signal Vt2 indicative of multiple voices, the present invention is not limited to the number of voices indicated by each conversion voice signal Vt. For example, the first conversion voice signal Vt1 may indicate a conversion voice composed of a specified number of parallel generated voices. The second conversion voice signal Vt2 may indicate a conversion voice composed of more voices.
The embodiments may be variously modified. The following describes specific modifications. These modifications may be provided in any combination.
(1) The above-mentioned embodiments have exemplified the configuration where the storage means 50 stores conversion voice signal Vt (Vt1 or Vt2) for one pitch Pt. As shown in
(2) The above-mentioned embodiments have exemplified the configuration where the storage means 50 stores conversion voice signal Vt indicative of the conversion voice containing one phonetic entity at one moment. As shown in
(3) The above-mentioned embodiments have exemplified the configuration where the storage means 55 stores envelope data Dev indicative of the source voice's spectral envelope EV0. It may be preferable to use a configuration where the storage means 55 stores other data. As shown in
It may be preferable to use a configuration where the storage means 55 stores source voice signal V0 itself on a phonetic entity basis. According to this configuration, the feature extraction portion 13 in
(4) The above-mentioned embodiments have exemplified the configuration where a specific value (P0/Pt) is multiplied by frequency Ft contained in each unit data Ut of conversion spectrum data Dt to extend or reduce conversion spectrum SPt in the frequency axis direction. Further, it may be preferable to use any method of converting pitch Pt of conversion spectrum SPt. For example, the method according to the above-mentioned embodiments extends or reduces conversion spectrum SPt at the same rate over all bands. There may be a case where the bandwidth of each peak pt becomes remarkably greater than the bandwidth of the original peak pt. For example, let us suppose that the method for the first embodiment is used to convert pitch Pt of conversion spectrum SPt as shown in
There has been described the example of converting pitch Pt by performing the multiplication process for frequency Ft of each unit data Ut. As shown in
(5) The above-mentioned embodiments have exemplified the configuration where conversion spectrum SPt is specified from conversion voice Vt stored in the storage means 50. Further, it may be preferable to use a configuration where the storage means 50 previously stores conversion spectrum data Dt indicative of conversion spectrum SPt on a frame basis. According to this configuration, the spectrum acquisition means 30 just needs to read conversion spectrum data Dt from the storage means 50 and output the read data to the spectrum conversion means 20. There is no need to provide the FFT portion 31, the peak detection portion 32, or the data generation portion 33. There has been exemplified the configuration where the storage means 50 stores conversion spectrum data Dt. Further, the spectrum acquisition means 30 may acquire conversion spectrum data Dt from a communication apparatus connected via a communication line, for example. In this manner, the spectrum acquisition means 30 according to the present invention just needs to acquire conversion spectrum SPt. No special considerations are required for acquisition methods or destinations.
(6) The above-mentioned embodiments have exemplified the configuration where pitch Pt of the conversion voice matches pith P0 indicated by musical note data. Further, pitch Pt of the conversion voice may be converted into other pitches. For example, it may be preferable to use a configuration where the pitch conversion portion 21 converts pitch 0 and pitch Pt of the conversion voice so as to constitute a concord sound. This configuration can generate, as an output sound, a chorus sound constituting a main melody and the concord sound. When the pitch conversion portion 21 is provided, it just needs to be configured to change pitch Pt of a conversion voice in accordance with musical note data (i.e., in accordance with a change in pitch P0).
(7) While the above-mentioned embodiments have exemplified the case of applying the present invention to the apparatus for synthesizing sung or played sounds of musical compositions, the present invention can be applied to other apparatuses. For example, the present invention can be applied to an apparatus that works based on document data (e.g., text files) indicative of various documents and reads out character strings of the documents. That is, there may be a configuration where the voice segment selection portion 11 selects envelope data Dev of the phonetic entity corresponding to the character indicated by a character code constituting the text file, and reads the selected envelope data Dev from the storage means 50 to use this envelope data Dev for generation of new spectrum SPnew. “Phonetic entity data” according to the present invention represents the concept including all data specifying phonetic entities for output voices such as lyrics data in the above-mentioned embodiments and in this modification. When the data acquisition means 5 is configured to obtain pitch data specifying pitch P0, the configuration according to the modification can generate an output voice at any pitch. This pitch data may indicate user-specified pitch P0 or may be previously associated with document data. “Pitch data” according to the present invention represents the concept including all data specifying output voice pitches such as the musical note data in the above-mentioned embodiments and the pitch data in this modification.
Number | Date | Country | Kind |
---|---|---|---|
2005-026855 | Feb 2005 | JP | national |
Number | Name | Date | Kind |
---|---|---|---|
4783805 | Nishio et al. | Nov 1988 | A |
5210366 | Sykes, Jr. | May 1993 | A |
5642470 | Yamamoto et al. | Jun 1997 | A |
5704007 | Cecys | Dec 1997 | A |
5750912 | Matsumoto | May 1998 | A |
5870704 | Laroche | Feb 1999 | A |
5930755 | Cecys | Jul 1999 | A |
6003000 | Ozzimo et al. | Dec 1999 | A |
6029133 | Wei | Feb 2000 | A |
6073100 | Goodridge, Jr. | Jun 2000 | A |
6111181 | Macon et al. | Aug 2000 | A |
6125346 | Nishimura et al. | Sep 2000 | A |
6424939 | Herre et al. | Jul 2002 | B1 |
6992245 | Kenmochi et al. | Jan 2006 | B2 |
7016841 | Kenmochi et al. | Mar 2006 | B2 |
7085712 | Manjunath | Aug 2006 | B2 |
7379873 | Kemmochi | May 2008 | B2 |
Number | Date | Country |
---|---|---|
07-146695 | Jun 1995 | JP |
10-078776 | Mar 1998 | JP |
2004-077608 | Mar 2004 | JP |
Number | Date | Country | |
---|---|---|---|
20060173676 A1 | Aug 2006 | US |