1. Field of the Invention
The present invention relates to a vocoder system and, in particular, to a vocoder system and method for vocal sound synthesis, with which it is possible to improve the performance expression of a sound with a light computational load.
2. Description of the Prior Art
Vocoder systems have been known with which the formant characteristics of a speech signal that is input are detected and employed. Using a musical tone signal produced by operating a keyboard or the like, the musical tone signal is modulated by the speech signal, outputting a distinctive musical tone. With this vocoder system, the speech signal that is input is divided into a plurality of frequency bands by the analysis filter banks, and the levels of each of the frequencies that express the formant characteristics of the speech signal that are output from the analysis filter banks are detected. On the other hand, the musical tone signal that is produced by the keyboard and the like is divided into a plurality of frequency bands by the synthesis filter banks. Then, by amplitude modulation with the envelope curves that correspond to the output of the analysis filter banks, an effect such as that discussed above is applied to the output sound.
However, with the vocoder systems of the past, since the characteristics of each of the filters (the center frequency and bandwidth) of the analysis filter bank and the synthesis filter bank have been set to be equal, the formant characteristics of the speech signal are reflected as they are, unchanged, in the output sound. Thus, it has not been possible to change the formant of the speech that has been input and modulate the output of the synthesis filters. In other words, with the vocoder systems of the past, there is the problem that it is not possible to apply sound changes to the output sound using the sex, age, singing method, special effects, pitch information, strength, and the like. The performance expression of the output sound is, therefore, limited.
To solve this problem, there is a method in which the center frequencies of each of the filters that comprise the synthesis filter bank are changed with respect to the center frequencies of each of the filters that comprise the analysis filter bank. By means of this method, the formant characteristics of the speech signal can be shifted on the frequency axis and changed. It is thus possible to improve the performance expression of the output sound. It is set up, for example, with the speech signal divided into a plurality of frequency bands by the analysis filter bank and, in a specified time t, as is shown in
On the other hand, in those cases where, contrary to what has been discussed above, the formant curve that is produced from the output from the analysis filter bank is, as is shown in
If the center frequencies of each of the filters that comprise the synthesis filter bank are changed in this manner with respect to the center frequencies of each of the corresponding filters that comprise the analysis filter bank, it is possible for the formant characteristics of the speech signal to be changed and for this to be reflected in the output signal, and the performance expression of the output signal can be improved. In Japanese Unexamined Patent Application Publication (Kokai) Number 2001-154674, a vocoder system is disclosed that is related to this method in which the frequency band characteristics (the center frequencies) of the synthesis filter bank are changed appropriately and that has been furnished with a parameter setting means in which parameters are set in order to determine the frequency band characteristics of the synthesis filter bank.
However, in those cases where the method discussed above is employed in order to improve the performance expression of the output sound, the filter coefficients of each of the filters that comprise the synthesis filter bank must be changed. When this is carried out with digital filters, the computational load that is borne by the processing unit for the computation becomes great. In addition, since the synthesis filter bank is actually on the side on which the output sound is produced, in order to prevent the generation of noise, it is necessary to change the filter coefficients for each sample and do the computation; thus, the computational load on the processing unit becomes even greater.
In addition, in those cases where the method discussed above is employed when the formant characteristics are changed during the performance, it is necessary to change the filter coefficients of each of the filters that comprise the synthesis filter bank individually and continuously. Therefore, the computations of the processing unit become complicated and the computational load becomes great.
The present invention resolves these problems and has as its object a vocoder system with which it is possible to improve the performance expression of the output sound with a light computational load.
In accordance with the vocoder system of the present invention, the system comprises formant detection means as well as division means in which the center frequencies are fixed and the modulation levels, which modulate the levels of each of the frequency bands that have been divided in the division means, are set by the setting means based on the levels of each of the frequency bands that correspond to what has been detected in the formant detection means and the formant information that changes the formants. Therefore, the invention has the advantageous result that it is possible to improve the performance expression of the output sound with a light computational load and without the need, as in the past to calculate and change the filter figure of each filter for each sample in order to change the center frequency and bandwidth of each of the filters that comprise the division means.
In order to achieve this object, the vocoder system is furnished with formant detection means with which the formant characteristics of the first musical tone signal are detected, and musical tone signal input means with which the second musical tone signal that corresponds to specified pitch information is input, and division means with which the second musical tone signal that is input in the musical tone signal input means is divided into a plurality of frequency bands, the respective center frequencies of which have been fixed, and setting means with which the modulation levels that correspond to each of the frequency bands that have been divided in the previously mentioned division means are set based on the previously mentioned formant characteristics that have been detected in the previously mentioned formant detection means and the formant control information with which the formant characteristics that are detected by the previously mentioned formant detection means are changed, and modulation means with which level of the signal of each of the frequency bands that have been divided in the previously mentioned division means is modulated based on the modulation level that has been set in the setting means.
The formant characteristics for the first musical tone signal are detected by the formant detection means. On the other hand, the second musical tone signal is input from the musical tone signal input means as the musical tone that corresponds to the specified pitch information and is divided into a plurality of frequency bands by the division means. The setting means sets the modulation level that corresponds to each of the frequency bands that have been divided in the division means based on the formant characteristics that have been detected in the formant detection means and the formant information with which the formant characteristics that have been detected in the formant detection means are changed. In addition, the levels that correspond to each of the frequency bands that have been divided in the division means are modulated by the modulation means based on the modulation levels that have been set.
The formant detection means may comprise a filter or a Fourier transform.
The division means may comprise a filter. The division means may comprise a Fourier transform.
The setting means sets the modulation level that corresponds to each of the frequency bands that have been divided in the division means based on the pitch information and the formant characteristics that have been detected in the formant detection means and the formant control information with which the formant characteristics that have been detected in the formant detection means are changed.
The setting means stores a formant change table that changes the formant non-uniformly and sets the modulation levels that correspond to each of the frequency bands that have been divided in the division means based on the change table.
A detailed description of embodiments of the invention will be made with reference to the accompanying drawings, wherein like numerals designate corresponding parts in the several figures.
a) shows a formant curve that is contoured and produced by the levels of the output signals from each of the filters in a specified time t in two dimension;
b) shows a formant curve that is produced when the formant curve shown in
c) is a sinc function;
d) shows each of the levels of the formant curve shown in
a) shows a formant curve that is contoured and produced by the levels of the output signals from each of the filters in a specified time t in two dimensions;
b) shows a formant curve that is produced when the formant curve shown in
c) shows each of the levels of the formant curve shown in
a) through 10(c) show the situation in which the formant curves of the input signals that have been detected are changed into the formant curves shown on the right side in accordance with the tables on the left side according to an embodiment of the present invention.
In the following description of preferred embodiments, reference is made to the accompanying drawings which form a part hereof, and in which are shown by way of illustration specific embodiments in which the invention may be practiced. It is to be understood that other embodiments may be utilized and structural changes may be made without departing from the scope of the preferred embodiments of the present invention
The MPU 2 is the central processing unit that controls this entire system 1 and has built in a ROM, in which are stored the various types of control programs that are executed by the MPU 2, and a RAM for the execution of the various types of control programs that are stored in the ROM and in which various types of data are stored temporarily
The DSP 6 detects the formants by deriving the levels of each of bands of the speech signal that have been digitally converted. The DSP changes the formants of the input speech signals based on the formant control information that is instructed by the operators 4 and derives the levels that correspond to each of the frequency bands on the synthesis side. On the other hand, in accordance with the instructions of the keyboard 3, the DSP reads out the specified waveforms from the waveform memory 7, divides the waveforms equally into each of the bands, changes the levels based on the formant information for each band following the changes, synthesizes the outputs of each of the bands and outputs this to the D/A converter 9. The processing programs and algorithms are stored in a ROM that is built into the DSP 6. The MPU 2 may also transmit to the RAM of the DSP 6 as required.
These programs are programs that execute the speech signal analysis process, the envelope interpolation and generation process, the modulation process, and the like that are executed by the analysis filter bank 10, the envelope detector and interpolator 11, and the synthesis filter bank 13, which will be discussed later. In addition, the A/D converter 8, which converts the speech signal that has been input into a digital signal, and the D/A converter 9, which converts the musical tone signal that has been modulated into an analog signal, are connected to the DSP 6.
Next, an explanation will be given in detail regarding the processing that is executed by the DSP 6 while referring to
The envelope detector and interpolator 11 detects the formant curve on the frequency axis for the speech signal in a certain time from the level of each frequency band that has been detected by the analysis filter bank 10 and, together with this, generates a new formant based on the formant control information that changes the formant curve and the pitch information. Here, the formant control information that changes the formant is assigned by a change table such as is shown in
For example, in those cases where the speech that is input is a male voice, presets in order to change to the formants of a female voice and, conversely, in those cases where the speech that is input is a female voice, presets in order to change to the formants of a male voice, are prepared in advance in the change table and may be selected from among them. In addition, the pitch information that is referred to here is the pitch information of the waveform that is produced by the waveform generator 12. The formant curve that is generated is shifted based on the pitch information and the change table is shifted and changed based on the pitch information. The pitch information corresponds to the pitch that is instructed by the keyboard 3 in
The synthesis filter bank 13 divides the musical tone signal that has been input into a plurality of frequency bands and, together with this, amplitude modulates the outputs that have been divided into each of the frequency bands based on the new formant information that has been produced by the envelope detector and interpolator 11. The synthesis filter bank 13 comprises a plurality of filters for different frequency bands, and the characteristics of each filter are fixed corresponding to the respective center frequencies for the bands that have been divided.
The mixer 14 is an adder that mixes the outputs from each of the filters of the synthesis filter bank 13. The outputs from each of the filters of the synthesis filter bank 13 are mixed by the mixer 14, and a musical tone signal having the desired formant characteristics is produced. Incidentally, the signal that has been mixed by the mixer 14 is analog converted by the D/A converter 9 and output from an output system such as a speaker and the like.
Also, in addition to those cases in which a single sound musical tone is produced by the waveform generator 12, there are also cases in which a plurality of musical tones are produced. In those cases, the plurality of musical tones are modulated by a single synthesis filter bank 13.
The synthesis filter bank 13 divides the musical tone signal that has been input to a plurality of frequency bands (0 to n; here the number of analysis filter bank 10 and synthesis filter bank 13 filters has been made the same and each frequency band (center frequency and bandwidth) has also been made the same, but it may also be set up such that they are each different) and, together with this, the outputs that have been divided into each of the frequency bands are amplitude modulated based on the new envelope curve that has been generated by the envelope detector and interpolator 11. The synthesis filter bank 13 comprises a plurality of filters for different frequency bands and the characteristics of each of the filters are fixed corresponding to the respective center frequencies for the bands that have been divided. In addition, each filter is furnished with an amplitude modulator 13a with which the output of each corresponding filter is amplitude modulated based on the new envelope curve that has been generated by the envelope detector and interpolator 11.
The mixer 14 is an adder that mixes the outputs from each of the filters of the synthesis bank 13. The outputs from each of the filters of the synthesis filter bank 13 are mixed by the mixer 14 and a musical tone signal having the desired formant characteristics is produced.
a) is a drawing that shows in two dimensions the levels of the output signals from each of the filters for a specified period of time t as contours and the formant curve that is generated. The level of each frequency f1, f2, . . . is a1, a2, . . . respectively.
Next, an explanation will be given of a specific example of the processing that is carried out using the configuration described above. As the first operation example, an explanation will be given regarding the case in which the formant characteristics of the speech signal are expanded and contracted linearly on the frequency axis. When the input signal that has been digitally converted is input to the analysis filter bank 10, the levels of each of the frequency bands (the solid line arrows of
The envelope detector and interpolator 11 contours the levels of each of the frequency bands and produces a formant curve such as that shown in
With regard to the interpolation processing, the simplest one is the linear interpolation method for the values before and after the derived sample value. However, with this linear interpolation method, since the error becomes large when each band division is economized, the preferable interpolation method is the polynomial arithmetic method using the sinc function in which the interpolation of the time series sample signal is utilized.
This interpolation is processing on the frequency axis and not on the time axis. The item in which the sample value is placed and superimposed on the impulse response shown in
Ii=Yi sin {π(X−i)}/π(X−i)
Here, Ii indicates the response value in accordance with the sample value Yi and Yi indicates the sample value located an amount i from the interpolation point that has been derived. Although the value that has been superimposed is
Y=Σ−∞+∞Yi sin {π(X−i)}/π(X−i)
the length of the impulse response is limited by the window and since i is finite, the calculation amount can be small.
For example, the case in which from the fifth level from the left (the solid line arrow) of
When it is done in this manner and the new formant curve is produced by the envelope detector and interpolator 11, an amplitude envelope is generated based on the new formant curve and a corresponding musical tone signal output that has been band divided by the synthesis filter bank 13 is amplitude modulated by the amplitude modulator 13a. Therefore, the formant characteristics of the output sound are changed from formant characteristics for which the low frequency side is rich to formant characteristics for which the high frequency side is rich. Since it is only necessary to simply modulate the amplitude without the need to change many coefficients in order change the center frequencies of each of the filters that comprise the synthesis filter bank 13 as in the past, it is possible to lighten the computational load of the DSP 6 that carries out the computation.
In addition, by means of the method discussed above, since the timing at which the modulation level for the modulation of the musical tone signal is produced is not that of the synthesis filter bank 13 that outputs the output sound, there is no need to carry this out for each sample and a comparatively slow signal is fine. Therefore, the timing at which the modulation level is produced may be a period of several milliseconds, and the value between the periods can be derived, as is shown in
In
Next, an explanation will be given of the second operation example while referring to
Although, for a formant change in accordance with sex or age as in the case of a change from a male voice to a female or a child's voice, expansion and contraction is done roughly uniformly on a logarithmic frequency axis, strictly speaking, the sizes of the throats, the palates, and the lips of women and children are different and there are also individual differences. Therefore, even if a male voice is extended linearly on a logarithmic frequency axis, these will be subtle differences with that of a female as well as that of a child and an unnatural impression is imparted.
In addition, there are cases in which it is desired to change the center frequency or bandwidth of the specific band of the formant characteristics and produce a special effect. For example, there are cases in which it is desired to intentionally move the resonant frequency of the formant in order to match the singing pitch. This is called a singing formant. In this case, since it is not possible to obtain the desired output by simply expanding and contracting the formant on a logarithmic frequency axis, it is necessary to expand and contract the formant non-uniformly on the logarithmic frequency axis.
Therefore, the positions of the low domain, the middle domain, and the high domain are changed by non-uniformly distorting the scale of the logarithmic frequency axis, and the expansion and contraction of the formant on the logarithmic frequency axis is done non-uniformly. With regard to the method with which the scale is distorted, there are those such as the one using a specific function and the method using a numeric table and the like. In this preferred embodiment, the formant of the speech signal is changed non-uniformly on the logarithmic frequency axis using the tables shown on the left sides of
The envelope detector and interpolator 11 sets the modulation level with which the level of the musical tone signal is modulated based on the level of each frequency band that has been detected by the analysis filter bank 10, the tables that are shown on the left side of
Specifically, with the tables that are shown on the left side of
On the other hand, when the formant curve of the speech signal that has been detected by the envelope detector and interpolator 11 is transformed in accordance with the table that is shown on the left side of
In addition, when the formant curve of the speech signal that has been detected by the envelope detector and interpolator 11 is transformed in accordance with the table that is shown on the left side of
The new formant curve that is obtained in this manner is a new envelope curve that modulates the levels that correspond to each of the frequency bands that have been divided by the synthesis filter bank 13 are modulated. In addition, in those cases where the vocoder system 1 is made polyphonic, as has been discussed above, when the formant is changed in accordance with each specified pitch information, an envelope detector and interpolator, a synthesis filter bank, and an amplitude modulator must be prepared for each voice. Since the change in accordance with the pitch is gentle, rather than changing the formant in accordance with each of the voices, the formant is changed in accordance with some registers, for example three register groups of high, middle, and low, it is possible to reduce the number of synthesis filter banks and the like.
Explanations were given above of the present invention based on preferred embodiments; however, the present invention is in no way limited to the preferred embodiments that have been discussed above, and the fact that various modifications and changes are possible that do not deviate from and are within the scope of the essentials of the present invention can be easily surmised. For example, a plurality of digital band pass fitters are used as the method with which the formant of the speech that is input is detected but, instead of this, the level for each specified frequency may be detected using Fourier transforms (FFT). In this case, the levels of the fundamental frequencies of the musical tones that have been input and each of their harmonics are derived. Based on the levels of the fundamental wave and the harmonics that have been derived in this way, amplitude modulation of each of the respective components that have been divided by the band pass filters on the synthesis side is possible.
In addition, in the preferred embodiments described above, IIR filters were given as examples of the band pass filters used for analysis and synthesis but FIR filters may also be used. In addition, since the bands for each of the speech signals that have been divided by each band pass filter are limited, resampling may be done at a sampling frequency that corresponds to the band and the count for the performance time is reduced.
In addition, in the preferred embodiments described above, the synthesis filter bank 13 also comprises a plurality of band pass filters and has been divided into the musical tone signal of each frequency band. However, the spectrum waveform may be obtained by the Fourier transforms (FFT) of the musical tone signal, a window for each frequency band is placed on the spectrum waveform and the waveform is divided, a reverse Fourier transform is done for each, and the musical tone signals for each frequency band are synthesized.
In addition, for the vocoder system 1 of these preferred embodiments, an explanation was given regarding the case where specified formant information with which the formant of the speech signal that has been input is changed is applied. However, rather than inputting a speech signal, a speech signal stored in advance, the formant of this speech signal is detected, an envelope signal is produced based on that formant, and the musical tone signal is modulated. In addition, with regard to the musical tone signal, this does not have to be limited to an electronic musical instrument such as a piano and the like, and may also be voices, the cries of animals, and sounds produced by nature.
As another method for changing the formant, there is the method in which the center frequency and bandwidth of each of the filters that comprise the analysis filter bank 10 is changed. Specifically, if the center frequencies and the bandwidths of the analysis filter bank 10 are made a fixed percentage smaller than those of the synthesis filter bank 13, each of the levels of the synthesis filters corresponding to each of the levels obtained by each of the analysis filters are set based on each of the levels obtained by each of the analysis filters. A formant curve such as is shown in
While particular embodiments of the present invention have been shown and described, it will be obvious to those skilled in the art that the invention is not limited to the particular embodiments shown and described and that changes and modifications may be made without departing from the spirit and scope of the appended claims.
Number | Date | Country | Kind |
---|---|---|---|
2003-080246 | Mar 2003 | JP | national |
Number | Name | Date | Kind |
---|---|---|---|
3711620 | Kameoka et al. | Jan 1973 | A |
4192210 | Deutsch | Mar 1980 | A |
4300434 | Deutsch | Nov 1981 | A |
4311877 | Kahn | Jan 1982 | A |
4374304 | Flanagan | Feb 1983 | A |
4406204 | Katoh | Sep 1983 | A |
5109417 | Fielder et al. | Apr 1992 | A |
5231671 | Gibson et al. | Jul 1993 | A |
5301259 | Gibson et al. | Apr 1994 | A |
5401897 | Depalle et al. | Mar 1995 | A |
5567901 | Gibson et al. | Oct 1996 | A |
5641926 | Gibson et al. | Jun 1997 | A |
5691496 | Suzuki et al. | Nov 1997 | A |
5945932 | Smith et al. | Aug 1999 | A |
5981859 | Suzuki | Nov 1999 | A |
5986198 | Gibson et al. | Nov 1999 | A |
6046395 | Gibson et al. | Apr 2000 | A |
6098038 | Hermansky et al. | Aug 2000 | A |
6159014 | Jenkins et al. | Dec 2000 | A |
6182042 | Peevers | Jan 2001 | B1 |
6201175 | Kikumoto et al. | Mar 2001 | B1 |
6313388 | Suzuki | Nov 2001 | B1 |
6323797 | Kikumoto et al. | Nov 2001 | B1 |
6336092 | Gibson et al. | Jan 2002 | B1 |
6338037 | Todd et al. | Jan 2002 | B1 |
6362411 | Suzuki et al. | Mar 2002 | B1 |
7003120 | Smith et al. | Feb 2006 | B1 |
7152032 | Suzuki et al. | Dec 2006 | B2 |
7313519 | Crockett | Dec 2007 | B2 |
7343281 | Breebaart et al. | Mar 2008 | B2 |
20020154041 | Suzuki et al. | Oct 2002 | A1 |
20030014246 | Choi | Jan 2003 | A1 |
Number | Date | Country |
---|---|---|
5-2390 | Jan 1993 | JP |
2001-154674 | Jun 2001 | JP |
Number | Date | Country | |
---|---|---|---|
20040260544 A1 | Dec 2004 | US |