The present invention relates to an estimation system of spectral envelopes and group delays, and to an audio signal synthesis system.
Many studies have been made on estimation of spectral envelopes, but estimating an appropriate envelope is still difficult. There have been some studies on application of group delays to sound synthesis, and such application needs time information called pitch marks.
For example, source-filter analysis (Non-Patent Document 1) is an important way to deal with human sounds (singing and speech) and instrumental sounds. An appropriate spectral envelope obtained from an audio signal (an observed signal) can be useful in a wide application such as high-accuracy sound analysis and high-quality sound synthesis and transformation. If phase information (group delays) can appropriately be estimated in addition to an estimated spectral envelope, naturalness of synthesized sounds can be improved.
In the field of sound analysis, great importance has been put on amplitude spectrum information, but little focus on phase information (group delays). In sound synthesis, however, the phase plays an important role for perceived naturalness. In sinusoidal synthesis, for example, if an initial phase is shifted from natural utterance more than π/8, perceived naturalness is known to be reduced monotonically according to the magnitude of shifting (Non-Patent Document 2). Also, in sound analysis and synthesis, the minimum phase response is known to have better naturalness than the zero-phase response in obtaining an impulse response from a spectral envelope to define a unit waveform (a waveform for one period) (Non-Patent Document 3). Further, there have been studies on phase control of unit waveform for improved naturalness (Non-Patent Document 4).
Further, many studies have been made on signal modeling for high-quality synthesis and transformation of audio signals. Some of the studies do not use supplemental information, some of them are accompanied by F0 estimation as supplemental information, and others need phoneme labels. As a typical technique, the Phase Vocoder (Non-Patent Documents 5 and 6) deals with input signals in the form of power spectrogram on the time-frequency domain. This technique enables temporal expansion and contraction of periodic signals, but suffers from reduced quality due to aperiodicity and F0 fluctuation.
In addition, LPC (Linear Predictive Coding) analysis (Non-Patent Documents 7 and 8) and cepstrum are widely known as conventional techniques for spectral envelope estimation. Various modifications and combinations of these techniques have been proposed (Non-Patent Documents 9 to 13). Since the contour of the envelope is determined by the order of analysis in LPC or cepstrum, the envelope cannot appropriately be represented in some order of analysis.
In PSOLA (Pitch Synchronized Overlap-Add) (Non-Patent Documents 1 and 14) known as a conventional F0-adaptive analysis technique, estimated F0 is used as supplemental information. Time-domain waveforms are cutout as unit waveforms based on pitch marks, and the unit waveforms thus cut out are overlap-added in a fundamental period. This technique can deal with changing F0 and stored phase information helps provide high-quality sound synthesis. This technique still has problems such as difficult pitch mark allocation as well as F0 change and reduced quality of non-stationary sound.
Also in sinusoidal models of voice and music signals (Non-Patent Documents 15 and 16), F0 estimation is used for modeling the harmonic structure. Many extensions of these models have been proposed such as modeling of harmonic components and broadband components (noise, etc.) (Non-Patent Documents 17 and 18), estimation from the spectrogram (Non-Patent Document 19), iterative estimation of parameters (Non-Patent Documents 20 and 21), estimation based on quadratic interpolation (Non-Patent Document 22), improved temporal resolution (Non-Patent Document 23), estimation of non-stationary sounds (Non-Patent Documents 24 and 25), and estimation of overlapped sounds (Non-Patent Document 26). Most of these sinusoidal models can provide high-quality sound synthesis since they use phase estimation, and some of them has high temporal resolution (Non-Patent Documents 23 and 24).
STRAIGHT, a system (VOCODER) based on source-filter analysis incorporates F0-adaptive analysis and is widely used in the speech research community throughout the world for its high-quality sound analysis and synthesis. In STRAIGHT, the spectral envelope can be obtained with periodicity being removed from an input audio signal by F0-adaptive smoothing and other processing. The system provides high-quality and has high temporal resolution. Extensions of this system are TANDEM STRAIGHT (Non-Patent Document 28) which eliminates temporal fluctuations by use of tandem windows, emphasis placed on spectral peaks (Non-Patent Document 29), and fast calculation (Non-Patent Document 30). In the STRAIGHT system and these extensions, the following techniques, for example, are introduced to attempt to improve naturalness of synthesized sounds: the mixed mode excitation with Gaussian noise convoluted with non-periodic components (defined as components which cannot be represented by the sum of harmonics or response driven by periodic pulse trains) without estimating the original phase, and the group delay randomization in the high frequency range. However, the standards for phase manipulation have not been established. Further, excitation extraction (Non-Patent Document 31) extracts excitation signals by deconvolution of the original audio signal and impulse response waveforms of the estimated envelope. It cannot be said that this technique efficiently represents the phase and it is difficult to apply the technique to interpolation and conversion. Some studies on sound analysis and synthesis (Non-Patent Documents 32 and 33), which estimate and smooth group delays, need pitch marks.
In addition to the foregoing studies, there are some studies such as Gaussian mixture modeling (GMM) of the spectral envelope, STRAIGHT spectral envelope modeling (Non-Patent Document 34), and formulated joint estimation of F0 and spectral envelope (Non-Patent Document 35).
Common problems to the studies described so far are: the analysis is limited by local observation and only the harmonic structure (frequency components of integer multiple of F0) is modeled, and transfer functions between adjacent harmonics can be obtained only with interpolation.
Further, some studies utilize phoneme labels as supplemental information. For example, attempts have been made to estimate a true envelope by integrating spectra at different F0 (different frames) using the same phoneme as the time of analysis for the purpose of estimating unobservable envelope components between harmonics (Non-Patent Documents 36 through 38). One of such studies is directed not to a single sound but to vocal in a music audio signal (Non-Patent Document 39). This study assumes that the same phoneme has a similar vocal tract shape. In this case, accurate phoneme labels are required. Furthermore, if target sound such as singing voice fluctuates largely depending upon the context, it may lead to excessive smoothing.
JP10-97287A (Patent Document 1) discloses an invention comprising the steps of: convoluting a phase adjusting component with a random number and band limit function on the frequency domain to obtain a band limited random number; multiplying a target value of delay time fluctuation by the band limited random number to obtain group delay characteristics; calculating an integral of the group delays with frequency to obtain phase characteristics; and multiplying the phase characteristics by an imaginary unit to obtain an exponent of exponential function, thereby obtaining phase adjust components.
Patent Document 1: JP10-97287A
Non-Patent Document 1: Zolzer, U. and Amatriain, X., “DAFX—Digital Audio Effects”, Wiley (2002).
Non-Patent Document 2: Ito, M. and Yano, M., “Perceptual Naturalness of Time-Scale Modified Speech”, IEICE (The Institute of Electronics, Information and Communication Engineer) Technical Report EA, pp. 13-18 (2008).
Non-Patent document 3: Matsubara, T., Morise, M. and Nishiura, T, “Perceptual Effect of Phase Characteristics of the Voiced Sound in High-Quality Speech Synthesis”, Acoustical Society of Japan, Technical Committee of Psychological and Physiological Acoustics Papers, Vol. 40, No. 8, pp. 653-658 (2010).
Non-Patent Document 4: Hamagami, T., “Speech Synthesis Using Source Wave Shape Modification Technique by Harmonic Phase Control”, Acoustical Society of Japan, Journal, Vol. 54, No. 9, pp. 623-631 (1998).
Non-Patent Document 5: Flanagan, J. and Golden, R., “Phase Vocoder, Bell System Technical Journal”, Vol. 45, pp. 1493-1509 (1966).
Non-Patent Document 6: Griffin, D. W., “Multi-Band Excitation Vocoder, Technical report (Massachusetts institute of Technology”, Research Laboratory of Electronics) (1987).
Non-Patent Document 7: Itakura, F. and Saito, S., “Analysis Synthesis Telephony based on the Maximum Likelihood Method”, Reports of the 6th Int. Cong. on Acoust., vol. 2, no. C-5-5, pp. C17-20 (1968).
Non-Patent Document 8: Atal, B. S. and Hanauer, S., “Speech Analysis and Synthesis by Linear Prediction of the Speech Wave”, J. Acoust. Soc. Am., Vol. 50, No. 4, pp. 637-655 (1971).
Non-Patent Document 9: Tokuda, K., Kobayashi, T., Masuko, T. and Imai, S., “Melgeneralized Cepstral Analysis—A Unified Approach to Speech Spectral Estimation”, Proc. ICSLP1994, pp. 1043-1045 (1994).
Non-Patent Document 10: Imai, S., and Abe, Y., “Spectral Envelope Extraction by Improved Cepstral Method”, IEICE, Journal, Vol. J62-A, No. 4, pp. 217-223 (1979).
Non-Patent Document 11: Robel, A. and Rodet, X., “Efficient Spectral Envelope Estimation and Its Application to Pitch Shifting and Envelope Preservation”, Proc. DAFx2005, pp. 30-35 (2005).
Non-Patent Document 12: Villavicencio, F., Robel, A. and Rodet, X., “Extending Efficient Spectral Envelope Modeling to Mel-frequency Based Representation”, Proc. ICASSP2008, pp. 1625-1628 (2008).
Non-Patent Document 13: Villavicencio, F., Robel, A. and Rodet, X., “Improving LPC Spectral Envelope Extraction of Voiced Speech by True-Envelope Estimation”, Proc. ICASSP2006, pp. 869-872 (2006).
Non-Patent Document 14: Moulines, E. and Charpentier, F., “Pitch-synchronous Waveform Processing Techniques for Text-to-speech Synthesis Using Diphones”, Speech Communication, Vol. 9, No. 5-6, pp. 453-467 (1990).
Non-Patent Document 15: McAulay, R. and T. Quatieri, “Speech Analysis/Synthesis Based on A Sinusoidal Representation”, IEEE Trans. ASSP, Vol. 34, No. 4, pp. 744-755 (1986).
Non-Patent Document 16: Smith, J. and Serra, X., “PARSHL: An Analysis/Synthesis Program for Non-harmonic Sounds Based on A Sinusoidal Representation”, Proc. ICMC 1987, pp. 290-297 (1987).
Non-Patent Document 17: Serra, X. and Smith, J., “Spectral Modeling Synthesis: A Sound Analysis/Synthesis Based on A Deterministic Plus Stochastic Decomposition”, Computer Music Journal, Vol. 14, No. 4, pp. 12-24 (1990).
Non-Patent Document 18: Stylianou, Y., “Harmonic plus Noise Models for Speech, combined with Statistical Methods, for Speech and Speaker Modification”, Ph.D. Thesis, Ecole NationaleSupèrieure des Télécommunications, Paris, France (1996).
Non-Patent Document 19: Depalle, P. and H'elie, T., “Extraction of Spectral Peak Parameters Using a Short-time Fourier Transform Modeling and No Sidelobe Windows”, Proc. WASPAA1997 (1997).
Non-Patent Document 20: George, E. and Smith, M., “Analysis-by-Synthesis/Overlap-Add Sinusoidal Modeling Applied to The Analysis and Synthesis of Musical Tones”, Journal of the Audio Engineering Society, Vol. 40, No. 6, pp. 497-515 (1992).
Non-Patent Document 21: Pantazis, Y., Rosec, O. and Stylianou, Y., “Iterative Estimation of Sinusoidal Signal Parameters”, IEEE Signal Processing Letters, Vol. 17, No. 5, pp. 461-464 (2010).
Non-Patent Document 22: Abe, M. and Smith III, J. O., “Design Criteria for Simple Sinusoidal Parameter Estimation based on Quadratic Interpolation of FFT Magnitude Peaks”, Proc. AES 117th Convention (2004).
Non-Patent Document 23: Bonada, J., “Wide-Band Harmonic Sinusoidal Modeling”, Proc. DAFx-08, pp. 265-272 (2008).
Non-Patent Document 24: Ito, M. and Yano, M., “Sinusoidal Modeling for Nonstationary Voiced Speech based on a Local Vector Transform”, J. Acoust. Soc. Am., Vol. 121, No. 3, pp. 1717-1727 (2007).
Non-Patent Document 25: Pavlovets, A. and Petrovsky, A., “Robust HNR-based Closed-loop Pitch and Harmonic Parameters Estimation”, Proc. INTERSPEECH2011, pp. 1981-1984 (2011).
Non-Patent Document 26: Kameoka, H., Ono, N. and Sagayama, S., “Auxiliary Function Approach to Parameter Estimation of Constrained Sinusoidal Model for Monaural Speech Separation”, Proc. ICASSP 2008, pp. 29-32 (2008).
Non-Patent Document 27: Kawahara, H., Masuda-Katsuse, I. and de Cheveigne, A., “Restructuring Speech Representations Using a Pitch Adaptive Time-frequency Smoothing and an Instantaneous Frequency Based on F0 Extraction: Possible Role of a Repetitive Structure in Sounds”, Speech Communication, Vol. 27, pp. 187-207 (1999).
Non-Patent Document 28: Kawahara, H., Morise, M., Takahashi, T., Nishimura, R., Irino, T. and Banno, H., “Tandem-STRAIGHT: A Temporally Stable Power Spectral Representation for Periodic Signals and Applications to Interference-free Spectrum, F0, and Aperiodicity Estimation”, Proc. of ICASSP 2008, pp. 3933-3936 (2008).
Non-Patent Document 29: Akagiri, H., Morise M., Irino, T., and Kawahara, H., “Evaluation and Optimization of F0-Adaptive Spectral Envelope Extraction Based on Spectral Smoothing with Peak Emphasis”, IEICE, Journal Vol. J94-A, No. 8, pp. 557-567 (2011).
Non-Patent Document 30: Morise, M., Matsubara, T., Nakano, K., and Nishiura N., “A Rapid Spectrum Envelope Estimation Technique of Vowel for High-Quality Speech Synthesis”, IEICE, Journal Vol. J94-D, No. 7, pp. 1079-1087 (2011).
Non-Patent Document 31: Morise, M.: PLATINUM, “A Method to Extract Excitation Signals for Voice Synthesis System”, Acoust. Sci. & Tech., Vol. 33, No. 2, pp. 123-125 (2012).
Non-Patent Document 32: Bannno, H., Jinlin, L., Nakamura, S. Shikano, K., and Kawahara, H., “Efficient Representation of Short-Time Phase Based on Time-Domain Smoothed Group Delay”, IEICE, Journal Vol. J84-D-II, No. 4, pp. 621-628 (2001).
Non-Patent Document 33: Bannno, H., Jinlin, L., Nakamura, S. Shikano, K., and Kawahara, H., “Speech Manipulation Method Using Phase Manipulation Based on Time-Domain Smoothed Group Delay”, IEICE, Journal Vol. J83-D-II, No. 11, pp. 2276-2282 (2000).
Non-Patent Document 34: Zolfaghari, P., Watanabe, S., Nakamura, A. and Katagiri, S., “Bayesian Modelling of the Speech Spectrum Using Mixture of Gaussians”, Proc. ICASSP 2004, pp. 553-556 (2004).
Non-Patent Document 35: Kameoka, H., Ono, N. and Sagayama, S., “Speech Spectrum Modeling for Joint Estimation of Spectral Envelope and Fundamental Frequency”, IEEE Transactions on Audio, Speech, and Language Processing, Vol. 18, No. 6, pp. 1507-1516 (2010).
Non-Patent Document 36: Akamine, M. and Kagoshima, T., “Analytic Generation of Synthesis Units by Closed Loop Training for Totally Speaker Driven Text to Speech System (TOS Drive TTS)”, Proc. ICSLP1998, pp. 1927-1930 (1998).
Non-Patent Document 37: Shiga, Y. and King, S., “Estimating the Spectral Envelope of Voiced Speech Using Multi-frame Analysis”, Proc. EUROSPEECH2003, pp. 1737-1740 (2003).
Non-Patent Document 38: Toda, T. and Tokuda, K., “Statistical Approach to Vocal Tract Transfer Function Estimation Based on Factor Analyzed Trajectory HMM”, Proc. ICASSP2008, pp. 3925-3928 (2008).
Non-Patent Document 39: Fujihara, H., Goto, M. and Okuno, H. G., “A Novel Framework for Recognizing Phonemes of Singing Voice in Polyphonic Music”, Proc. WASPAA2009, pp. 17-20 (2009).
Several conventional methods of estimating spectral envelopes and group delays assume that additional information such as pitch marks and phoneme transcriptions (phoneme labels) are available. Here, a pitch mark is time information indicating a driving point of a waveform (and time of analysis) for analysis synchronized with fundamental frequency. The time of excitation of a glottal sound source or the time at which amplitude is large in a fundamental period is used for a pitch mark. Such conventional methods require a large amount of information for analysis. In addition, improvements of applicability of estimated spectral envelopes and group delays are limited.
Accordingly, an object of the present invention is to provide an estimation system and an estimation method of spectral envelopes and group delays for sound analysis and synthesis, whereby spectral envelopes and group delays can be estimated from an audio signal with high accuracy and high temporal resolution for high-accuracy analysis and high-quality synthesis of voices (singing and speech).
Another object of the present invention is to provide a synthesis system and a synthesis method of an audio signal with higher synthesis performance than ever.
A further object of the present invention is to provide a computer-readable recording medium recorded with a program for estimating spectral envelopes and group delays for sound analysis and synthesis and a program for audio signal synthesis.
An estimation system of spectral envelopes and group delays for sound analysis and synthesis according to the present invention comprises at least one processor operable to function as a fundamental frequency estimation section, an amplitude spectrum acquisition section, a group delay extraction section, a spectral envelope integration section, and a group delay integration section. The fundamental frequency estimation section estimates F0s from an audio signal at all points of time or at all points of sampling. The amplitude spectrum acquisition section divides the audio signal into a plurality of frames, centering on each point of time or each point of sampling, by using a window having a window length changing or varying with F0 (fundamental frequency) at each point of time or each point of sampling, and performs Discrete Fourier Transform (DFT) analysis on the plurality of frames of the audio signal. Thus, the amplitude spectrum acquisition section acquires amplitude spectra at the respective frames. The group delay extraction section extracts group delays as phase frequency differentials at the respective frames by performing a group delay extraction algorithm accompanied by DFT analysis on the plurality of frames of the audio signal. The spectral envelope integration section obtains overlapped spectra at a predetermined time interval by overlapping the amplitude spectra corresponding to the frames included in a certain period which is determined based on a fundamental period of F0. Then, the spectral envelope integration section averages the overlapped spectra to sequentially obtain a spectral envelope for sound synthesis. The group delay integration section selects a group delay corresponding to a maximum envelope for each frequency component of the spectral envelope from the group delays at a predetermined time interval, and integrates the thus selected group delays to sequentially obtain a group delay for sound synthesis. According to the present invention, the overlapped spectra are obtained from amplitude spectra of the respective frames. Then, a spectral envelope for sound synthesis is sequentially obtained from the overlapped spectra thus obtained. From a plurality of group delays, a group delay is selected, corresponding to the maximum envelope of each frequency component of the spectral envelope. Group delays thus selected are integrated to sequentially obtain a group delay for sound synthesis. The spectral envelope for sound synthesis thus estimated has high accuracy. The group delay for sound synthesis thus estimated has higher accuracy than ever.
In the fundamental frequency estimation section, voiced segments and unvoiced segments are identified in addition to the estimation of F0s, and the unvoiced segments are interpolated with F0 values of the voiced segments or predetermined values are allocated to the unvoiced segments as F0. With this, spectral envelopes and group delays can be estimated in unvoiced segments in the same manner as in the voiced segments.
In the spectral envelope integration section, the spectral envelope for sound synthesis may be obtained by arbitrary methods of averaging the overlapped spectra. For example, a spectral envelope for sound synthesis may be obtained by calculating a mean value of the maximum envelope and the minimum envelope of the overlapped spectra. Alternatively, a median value of the maximum envelope and the minimum envelope of the overlapped spectra may be used as a mean value to obtain a spectral envelope for sound synthesis. In this manner, a more appropriate spectral envelope can be obtained even if the overlapped spectra greatly fluctuate.
Preferably, the maximum envelope is transformed to fill in valleys of the minimum envelope and a transformed minimum envelope thus obtained is used as the minimum envelope in calculating the mean value. The minimum enveloper thus obtained may increase the naturalness of hearing impression of synthesized sounds.
Preferably in the spectral envelope integration section, the spectral envelope for sound synthesis is obtained by replacing amplitude values of the spectral envelope of frequency bins under F0 with a value of the spectral envelope at F0. This is because the estimated spectral envelope of frequency bins under F0 is unreliable. In this manner, the estimated spectral envelope of frequency bins under F0 becomes reliable, thereby increasing the naturalness of hearing impression of the synthesized sounds.
A two-dimensional low-pass filter may be used to filter the replaced spectral envelope. Filtering can remove noise from the replaced spectral envelope, thereby furthermore increasing the naturalness of hearing impression of the synthesized sounds.
In the group delay integration section, it is preferred to store by frequency the group delays in the frames corresponding to the maximum envelopes for respective frequency components of the overlapped spectra, to compensate a time-shift of analysis of the stored group delays, and to normalize the stored group delays for use in sound synthesis. This is because the group delays spread along the time axis or in a temporal direction (at a time interval) according to a fundamental period corresponding to F0. Normalizing the group delays along the time axis may eliminate effects of F0 and obtain group delays transformable according to F0 at the time of resynthesizing.
Also in the group delay integration section, it is preferred to obtain the group delay for sound synthesis by replacing values of group delay of frequency bins under F0 with a value of the group delay at F0. This is because the estimated group delays of frequency bins under F0 are unreliable. In this manner, the estimated group delays of frequency bins under F0 become reliable, thereby increasing the naturalness of hearing impression of the synthesized sounds.
Further, in the group delay integration section, it is preferred to smooth the replaced group delays for use in sound synthesis. It is convenient for sound analysis and synthesis if the values of group delays change continuously.
Preferably, in smoothing the replaced group delays for use in sound synthesis, the replaced group delays are converted with sin function and cos function to remove discontinuity due to the fundamental period; the converted group delays are subsequently filtered with a two-dimensional low-pass filter; and then the filtered group delays are converted to an original state with tan−1 function for use in sound synthesis. It is convenient for two-dimensional low-pass filtering if the group delays are converted with sin function and cos function.
An audio signal synthesis system according to the present invention comprises at least one processor operable to function as a reading section, a conversion section, a unit waveform generation section, and a synthesis section. The reading section reads out, in a fundamental period for sound synthesis, the spectral envelopes and group delays for sound synthesis from a data file of the spectral envelopes and group delays for sound synthesis that have been estimated by the estimation system of spectral envelopes and group delays for sound analysis and synthesis according to the present invention. Here, the fundamental period for sound synthesis is a reciprocal of the fundamental frequency for sound synthesis. The spectral envelopes and group delays, which have been estimated by the estimation system, have been stored at a predetermined interval in the data file. The conversion section converts the read-out group delays into phase spectra. The unit waveform generation section generates unit waveforms based on the read-out spectral envelopes and the phase spectra. The synthesis section outputs a synthesized audio signal obtained by performing overlap-add calculation on the generated unit waveforms in the fundamental period for sound synthesis. The sound synthesis system according to the present invention can generally reproduce and synthesize the group delays and attain high-quality naturalness of the synthesized sounds.
The audio signal synthesis system according to the present invention may include a discontinuity suppression section which suppresses an occurrence of discontinuity along the time axis in a low frequency range of the read-out group delays before the conversion section converts the read-out group delays. Providing the discontinuity suppression section may furthermore increase the naturalness of synthesis quality.
The discontinuity suppression section is preferably configured to smooth group delays in the low frequency range after adding an optimal offset to the group delay for each voiced segment and re-normalizing the group delay. Smoothing in this manner may eliminate unstableness of the group delays in a low frequency range. It is preferred in smoothing the group delays to convert the read-out group delays with sin function and cos functions, to subsequently filter the converted group delays with a two-dimensional low-pass filter, and then to convert the filtered group delays to an original state with tan−1 function for use in sound synthesis. Thus, two-dimensional low-pass filtering is enabled, thereby facilitating the smoothing.
Further, the audio signal synthesis system according to the present invention preferably includes a compensation section which multiplies the group delays by the fundamental period for sound synthesis as a multiplier coefficient after the conversion section converts the group delays or before the discontinuity suppression section suppresses the discontinuity. With this, it is possible to normalize the group delays which spread along the time axis (at a time interval) according to the fundamental period corresponding to F0, thereby obtaining more accurate phase spectra.
The synthesis section is preferably configured to convert an analysis window into a synthesis window and perform overlap-add calculation in the fundamental period on compensated unit waveforms obtained by windowing the unit waveforms by the synthesis window. The unit waveforms compensated with such synthesis window may increase the naturalness of hearing impression of the synthesized sounds.
An estimation method of spectral envelopes and group delays according to the present invention is implemented on at least one processor to execute a fundamental frequency estimation step, an amplitude spectrum acquisition step, a group delay extraction step, a spectral envelope integration step, and a group delay integration step. In the fundamental frequency estimation step, F0s are estimated from an audio signal at all points of time or at all points of sampling. In the amplitude spectrum acquisition step, the audio signal is divided into a plurality of frames, centering on each point of time or each point of sampling, by using a window having a window length changing or varying with F0 at each point of time or each point of sampling; Discrete Fourier Transform (DFT) analysis is performed on the plurality of frames of the audio signal; and amplitude spectra are thus acquired at the respective frames. In the group delay extraction step, group delays are extracted as phase frequency differentials at the respective frames by performing a group delay extraction algorithm accompanied by DFT analysis on the plurality of frames of the audio signal. In the spectral envelope integration step, overlapped spectra are obtained at a predetermined time interval by overlapping the amplitude spectra corresponding to the frames included in a certain period which is determined based on a fundamental period of F0; and the overlapped spectra are averaged to sequentially obtain a spectral envelope for sound synthesis. In the group delay integration step, a group delay is selected, corresponding to the maximum envelope for each frequency component of the spectral envelope from the group delays at a predetermined time interval, and the thus selected group delays are integrated to sequentially obtain a group delay for sound synthesis.
A program for estimating spectral envelopes and group delays for sound analysis and synthesis adapted to implement the above-mentioned method on a computer is recorded in a non-transitory computer-readable recording medium.
An audio signal synthesis method according to the present invention is implemented on at least one processor to execute a reading step, a conversion step, a unit waveform generation step, and a synthesis step. In the reading step, the spectral envelopes and group delays for sound synthesis are read out, in a fundamental period for sound synthesis, from a data file of the spectral envelopes and group delays for sound synthesis that have been estimated by the estimation method of spectral envelopes and group delays according to the present invention. Here, the fundamental period for sound synthesis is a reciprocal of the fundamental frequency for sound synthesis, and the spectral envelopes and group delays that have been estimated by the estimation method according to the present invention have been stored at a predetermined interval in the data file. In the conversion step, the read-out group delays are converted into phase spectra. In the unit waveform generation step, unit waveforms are generated based on the read-out spectral envelopes and the phase spectra. In the synthesis step, a synthesized audio signal, which has been obtained by performing overlap-add calculation on the generated unit waveforms in the fundamental period for sound synthesis, is output.
A program for audio signal synthesis adapted to implement the above-mentioned audio signal synthesis method on a computer is recorded in a non-transitory computer-readable recording medium.
Now, embodiments of the present invention will be described below in detail.
The estimation system 1 of spectral envelopes and group delays estimates a spectral envelope for sound synthesis as shown in
[Estimation of Spectral Envelopes and Group Delays]
In this embodiment of the present invention, first, a method of obtaining spectral envelopes and group delays for sound synthesis will briefly be described below.
In this embodiment of the estimation system 1 of spectral envelopes and group delays according to the present invention (see
The amplitude spectrum acquisition section 5 performs F0-adaptice analysis as shown in step ST3 of
Specifically, in this embodiment, a Gaussian window ω(τ) of formula (1) with the window length changing according to F0 is used for windowing as shown in
The Gaussian window of σ(t)=⅓×F0(t)) means that the analysis window length corresponds to two fundamental periods, (2×3σ(t)=2/F0(t)). This window length is also used in PSOLA analysis and is known to give a good approximation of the local spectral envelope (refer to Non-Patent Document 1).
Next, the amplitude spectrum acquisition section 5 performs Discrete Fourier Transform (DFT) including Fast Fourier Transform (FFT) analysis on the divided frames X1 to Xn of the audio signal. Thus, the amplitude spectra Y1 to Yn of the respective frames X1 to Xn are obtained.
The amplitude spectrum acquisition section 5 performs F0-adaptive analysis as shown in step ST3 of
The spectral envelope integration section 9 overlaps a plurality of amplitude spectra corresponding to the plurality of frames included in a certain period, which is determined based on the fundamental period (1/F0) of F0, at a predetermined interval, namely, in a discrete time of spectral envelope (at an interval of 1 ms in this embodiment). Thus, overlapped spectra are obtained. Then, a spectral envelope SE for sound synthesis is sequentially obtained by averaging the overlapped spectra.
It is arbitrary to employ what method by which “a spectral envelope for sound synthesis” is obtained by averaging the overlapped spectra. In this embodiment, a spectral envelope for sound synthesis is obtained by calculating a mean value of the maximum envelope and the minimum envelope (at step ST55). A median value of the maximum envelope and the minimum envelope may be used as a mean value in obtaining a spectral envelope for sound synthesis. In these manners, a more appropriate spectral envelope can be obtained even if the overlapped spectra greatly fluctuate.
In this embodiment, the maximum envelope is transformed to fill in the valleys of the minimum envelope at step ST54. Such transformed envelope is used as the minimum envelope. Such transformed minimum enveloped can increase the naturalness of hearing impression of synthesized sound.
In the spectral envelope integration section 9, at step ST56, the amplitude values of the spectral envelope of frequency bins under F0 are replaced with the amplitude value of a spectral envelope of frequency bin at F0 for use in the sound synthesis. This is because the spectral envelope of frequency bins under F0 is unreliable. With such replacement, the spectral envelope of frequency bins under F0 becomes reliable, thereby increasing the naturalness of hearing impression of the synthesized sound.
As described above, step ST50 (steps ST51 through ST56) is performed every predetermined time (1 ms), and a spectral envelope is estimated in each unit time (1 ms). In this embodiment, at step ST57, the replaced spectral envelope is filtered with a two-dimensional low-pass filter. Filtering can remove noise from the replaced spectral envelope, thereby furthermore increasing the naturalness of hearing impression of the synthesized sound.
In this embodiment, the spectral envelope is defined as a mean value of the maximum value (the maximum envelope) and the minimum value (the minimum envelope) of the spectra in the range of integration (at step ST55). The maximum enveloped is not simply used as a spectral envelope. This is because such possibility should be considered as there is some sidelobe effect of the analysis window. Here, a number of valleys due to F0 remain in the minimum envelope, and such minimum envelope cannot readily be used as a spectral envelope. Then, in this embodiment, the maximum envelope is transformed to overlap the minimum envelope, thereby eliminating the valleys of the minimum envelope while maintaining the contour of the minimum envelope (at step ST54).
The group delay integration section 11 as shown in
Since the thus obtained group delay spreads along the time axis, according to the fundamental period corresponding to F0, the group delay is normalized along the time axis. The group delay corresponding to the maximum envelope at frequency f is expressed in formula (2).
ĝ(f,t) <Formula (2)>
The value of frequency bin corresponding to n×F0(t) is expressed in formula (3).
ĝ(fn×F
The fundamental period (1/F0(t)) and the value of frequency bin of formula (3) are used to normalize the group delay. The normalized group delay g(f,t) is expressed in formula (4).
Here, mod(x,y) denotes the remainder of the division of x by y.
An offset due to different times of analysis is eliminated as shown in Formula (5).
{circumflex over (g)}(f,t)−{circumflex over (g)}(fn×F
Here, n=1 or n=1.5 where analysis may be unreliable in the proximity of n=1; in such case, more reliable result may be obtained based on the value between these harmonics.
As described above, the group delay g(f,t) is normalized in the range of (0,1). However, the following problems remain unsolved due to the division by the fundamental period and integration in the range of the fundamental period.
(Problem 1) Discontinuity occurs along the frequency axis.
(Problem 2) Step-like discontinuity occurs along the time axis.
Solutions to these problems will be described below.
First, Problem 1 relates to discontinuity due to the fundamental period around F0=318.6284 Hz, 1.25 KHz, 1.7 KHz, etc. as shown in
<Formula (6)>
gπ(f,t)=(g(f,t)×2π)−π (5)
gx(f,t)=cos(gπ(f,t)) (6)
gy(f,t)=sin(gπ(f,t)) (7)
Next, Problem 2 is similar with a problem with the estimation of spectral envelopes. This is due to the periodic occurrence of waveform driving. Here, in order to solve the problem for the purpose of sound analysis and synthesis, it is convenient if the period continuously changes. For this purpose, gx(f,t) and gy(f,t) are smoothed in advance.
Last, as with the spectral envelopes, since components of frequency bins under F0 are not reliably estimated in many cases, the normalized group delays of frequency bins under F0 are replaced with the value of frequency bin at F0.
Now, how to implement the group delay integration section 11 which operates as described above by using a program installed on a computer will be described below.
In smoothing the group delays, as shown in
The spectral envelopes and group delays obtained in the manner described so far are stored in a memory 13 of
[Sound Synthesis Based on Spectral Envelopes and Group Delays]
In order to use in sound synthesis the spectral envelopes and normalized group delays obtained as described so far, as with conventional sound analysis and synthesis systems, expansion and contraction of the time axis and amplitude control are performed and F0 for sound synthesis is specified. Then, a unit waveform is sequentially generated based on the specified F0 and spectral envelopes for sound synthesis as well as the normalized group delays. Overlap-add calculation is performed on the generated unit waveforms, thereby synthesizing sound. An audio signal synthesis system 2 of
As shown in
In the embodiment as shown in
The discontinuity suppression section 23 re-normalizes the group delays by adding the optimal offset to the group delay for each voiced segment, and then smoothes the group delays in the low frequency range at step ST102B.
In this embodiment, the audio signal synthesis system further comprises a compensation section 25 operable to multiply the group delays by the fundamental period for sound synthesis as a multiplier coefficient after the conversion section 17 of
In this embodiment, the unit waveform generation section 19 generates unit waveforms by converting the analysis window to the synthesis window and windowing the unit waveform by the synthesis window. The synthesis section 21 performs overlap-add calculation on the generated unit waveforms in the fundamental period.
The use of the unit waveforms thus compensated with the synthesis window can help improve the naturalness of hearing impression of synthesized sound.
The calculation performed at step ST102B will be described below in detail. The group delay is finally dealt with after the following calculation has been performed to convert the group delay to g(f,t) from gx(f,t) and gy(f,t) converted with sin and cos functions respectively.
Where the formant frequency fluctuates, the contour of an estimated group delay may sharply change, thereby significantly affecting the synthesis quality when the power is large in the low frequency range. It can be considered that this is caused when the fluctuation due to F0 as described before (see
[Experiments]
Regarding the accuracy of estimating the spectral envelopes by the method according to this embodiment of the present invention, the proposed method was compared with two previous methods known to have high accuracy, STRAIGHT (refer to Non-Patent Document 27) and TANDEM-STRAIGHT (refer to Non-Patent Document 28). An unaccompanied male singing sound (solo vocal) was taken from the RWC Music Database (Goto, M., Hashiguchi, H., Nishimura, T., and Oka, R., “RWC Music Database for Experiments: Music and Instrument Sound Database” authorized by the copyright holders and available for study and experiment purpose, Information Processing Society of Japan (IPS) Journal, Vol. 45, No. 3, pp. 728-738 (2014) ((Music Genre: RWC-MDB-G-2001 No. 91). A female spoken sound was taken from the AIST Humming Database (E008) (Goto, M. and Nishimura, T., “AIST Hamming Database, Music Database for Singing Research”, IPS Report, 2005-MUS-61, pp. 7-12 (2005)). Instrument sounds, piano and violin sounds, were taken from the RWC Music Database as described above (Piano: RWC-MDB-I-20001, No. 01, 011PFNOM) and (Violin: RWC-MDB-I-2001, No. 16, 161VLGLM). All spectral envelopes were represented with 2049 frequency bins (4096 FFT length) which are frequently used in STRAIGHT, and the unit time of analysis was set to 1 ms. In the embodiment described so far, the temporal resolution means the discrete time step of executing the integration process every 1 ms in the multi-frame integration analysis.
Regarding the estimation of group delays, the further analysis results of the synthesized sound with group delays reflected were compared with the analysis results of natural sound. Here, not as with the estimation experiments of spectral envelopes, 4097 frequency bins (FFT length of 8192) were used in experiments in order to secure the estimation accuracy of group delays.
[Experiment A: Comparison of Spectral Envelopes]
In this experiment, the analysis results of natural sound were compared with the STRAIGHT spectral envelopes.
In
[Experiment B: Reproduction of Spectral Envelopes]
In this experiment, the accuracy of spectral envelope estimation was evaluated using synthesized sound with known spectral envelopes and F0. Specifically, in this experiment were used the analyzed and synthesized sound by STRAIGHT from the natural sound and instrument sound samples as described before and sounds synthesized by a cascade-type Klatt synthesizer (Klatt, D. H., “Software for A Cascade/parallel Formant Synthesizer”, J. Acoust. Soc. Am., Vol. 67, pp. 971-995 (1980)) with the spectral envelopes being parameter controlled.
A list of parameters given to the Klatt synthesizer is shown in Table.
Here, the values of the first and second formant frequencies (F1 and F2) were set to those shown in Table 2 to generate spectral envelopes. Sinusoidal waves were overlapped with the fundamental frequency of 125 Hz to synthesize six kinds of sounds from the generated spectral envelopes.
The following log-spectral distance (LSD) was used in the evaluation of estimation accuracy. Here, T stands for the number of voiced frames, F for the number of frequency bins (=FH−FL+1), (FL,FH) for the frequency range for the evaluation, and Sg(t,f) and Se(t,f) for the ground-truth spectral envelope and an estimated spectral envelope, respectively. Further, α(t) stands for a normalization factor determined by minimizing an error defined as a square error ε2 between Sg(t,f) and α(t)Se(t,f) in order to calculate the log-spectral distance.
Table 3 shows the evaluation results and
1.0981
2.0538
2.0588
2.5908
3.1232
3.3649
1.1467
2.1012
1.6676
1.5995
1.4700
1.1271
1.0643
1.1712
[Experiment C: Reproduction of Group Delays]
[Other Remarks]
In this embodiment, the amplitude ranges in which the estimated spectral envelopes lie were also estimated, which can be utilized in voice timber conversion, transformation of spectral contour, and unit-selection and concatenation synthesis, etc.
In this embodiment, there is a possibility that group delays are stored for synthesis. Further, with the conventional techniques (Non-Patent Documents 32 and 33), smoothing group delays does not improve the synthesis quality. In contrast therewith, the technique proposed in this disclosure can properly fill in the valleys of the envelope by integrating a plurality of frames. In addition, according to the embodiment of the present invention, more detailed analysis is available beyond the single pitch marking analysis since the group delay resonates at a different time for each frequency band. As shown in
The present invention is not limited to the embodiment described so far. Various modifications and variations fall within the scope of the present invention.
According to the present invention, spectral envelopes and phase information can be analyzed with high accuracy and high temporal resolution from voice and instrument sounds, and high quality sound synthesis can be attained while maintaining the analyzed spectral envelopes and phase information. Further, according to the present invention, audio signals can be analyzed, regardless of the difference in sound kind, without needing additional information such as the pitch marks [time information indicating a driving point of waveform (and the time of analysis) in analysis synchronized with frequency, the time of excitation of a glottal sound source, or the time at which the amplitude in the fundamental period] and phoneme information.
Number | Date | Country | Kind |
---|---|---|---|
2012-171513 | Aug 2012 | JP | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2013/070609 | 7/30/2013 | WO | 00 |
Publishing Document | Publishing Date | Country | Kind |
---|---|---|---|
WO2014/021318 | 2/6/2014 | WO | A |
Number | Name | Date | Kind |
---|---|---|---|
5602959 | Bergstrom | Feb 1997 | A |
6115684 | Kawahara et al. | Sep 2000 | A |
20120243705 | Bradley | Sep 2012 | A1 |
20120265534 | Coorman | Oct 2012 | A1 |
Number | Date | Country |
---|---|---|
10-97287 | Apr 1998 | JP |
Entry |
---|
Nakatani, Tomohiro, and Toshio Irino. “Robust and accurate fundamental frequency estimation based on dominant harmonic components.” The Journal of the Acoustical Society of America 116.6 (2004): 3690-3700. |
Abe, Toshihiko, Takao Kobayashi, and Satoshi Imai. “Robust pitch estimation with harmonics enhancement in noisy environments based on instantaneous frequency.” Spoken Language, 1996. ICSLP 96. Proceedings., Fourth International Conference on. vol. 2. IEEE, 1996. |
Kawahara, Hideki. “Speech representation and transformation using adaptive interpolation of weighted spectrum: vocoder revisited.” Acoustics, Speech, and Signal Processing, 1997. ICASSP-97., 1997 IEEE International Conference on. vol. 2. IEEE, 1997. |
Duncan, G., B. Yegnarayana, and Hema A. Murthy. “A nonparametric method of formant estimation using group delay spectra.” Acoustics, Speech, and Signal Processing, 1989. ICASSP-89., 1989 International Conference on. IEEE, 1989. |
Blauert, J., and P. Laws. “Group delay distortions in electroacoustical systems.” The Journal of the Acoustical Society of America 63.5 (1978): 1478-1483. |
Klatt, D.H.: “Software for a Cascade/parallel Formant Synthesizer”, J. Acoust. Soc. Am., vol. 67, pp. 971-995 (1980). |
Goto, M. and Nishimura, T.: “AIST Humming Database: Music Database for Singing Research”, IPSJ, SIG Technical Report, 2005-MUS-61, pp. 7-12 (2005). |
Goto, M., Hashimoto, H., Nishimura, T. and Oka, R.: “RWC Music Database: Database of Copyright-cleared Musical Pieces and Instrument Sounds for Research Purposes”, IPSJ, Transaction vol. 45, No. 3, pp. 728-738 (2004). |
Fujihara, H., Goto, M. and Okuno, H.G: “A Novel Framework for Recognizing Phonemes of Singing Voice in Polyphonic Music”, Proc. WASPAA2009, pp. 17-20 (2009). |
Toda, T. and Tokuda, K.: “Statistical Approach to Vocal Tract Transfer Function Estimation Based on Factor Analyzed Trajectory HMM”, Proc. ICASSP2008, pp. 3925-3928 (2008). |
Shiga, Y. and King, S.: “Estimating the Spectral Envelope of Voiced Speech Using Multi-frame Analysis”, Proc. EUROSPEECH2003, pp. 1737-1740 (2003). |
Akamine, M. and Kagoshima, T.: “Analytic Generation of Synthesis Units by Closed Loop Training for Totally Speaker Driven Text to Tpeech System ((TOS Drive TTS)”, Proc. ICSLP1998, pp. 1927-1930 (1998). |
Kameoka, H., Ono, N. and Sagayama, S.: “Speech Spectrum Modeling for Joint Estimation of Spectral Envelope and Fundamental Frequency”, IEEE Transactions on Audio, Speech, and Language Processing vol. 18, No. 6, pp. 1507-1516 (2010). |
Zolfaghari, R, Watanabe, S. Nakamura, A. and Katagiri, S.: “Bayesian Modelling of the Speech Spectrum Using Mixture of Gaussians”, Proc. ICASSP 2004, pp. 553-556 (2004). |
Banno, H., Lu, J., Nakamura, S., Shikano, K. and Kawahara, H.: “Speech Manipulation Method Using Phase Manipulation Based on Time-Domain Smoothed Group Delay”, IEICE, Journal vol. J83-D-11, pp. 2276-2282 (2000). |
Banno, H., Lu, J., Nakamura, S., Shikano, K. and Kawahara, H.: “Efficient Representation of Short-Time Phase Based on Time-Domain Smoothed Group Delay”, IEICE, Journal vol. J84-D-II, No. 4, pp. 621-628 (2001). |
Morise, M: Platinum: “A Method to Extract Excitation Signals for Voice Synthesis System”, Acoust. Sci. & Tech., vol. 33, No. 2, pp. 123-125 (2012). |
Morise, M., Matsubara, T., Nakano, K. and Nishiura, T: “A Rapid Spectrum Envelope Estimation Technique of Vowel for High-Quality Speech Synthesis”, IEICE, Journal vol. J94-D, No. 7, pp. 1079-1087 (2011). |
Akagiri, H., Morise, M., Irino, T. and Kawahara, H.,: Evaluation and Optimization of FO-Adaptive Spectral Envelope Extraction Based on Special Smoothing with Peak Emphasis, IEICE, Journal, vol. J94-A, No. 8, pp. 557-567 (2011). |
Kawahara, H., Morise, M., Takahashi, T., Nishimura, R., Irino, T. and Banno, H.: Tandem Straight: A Temporally Stable Power Spectral Representation for Periodic Signals and Applications to Interference-free Spectrum, FO, and Aperiodicity Estimation, Proc. of ICASSP 2008, pp. 3933-3936 (2008). |
Kawahara, H., Masuda-Katsuse, I. and De Cheveigne, A.: Restructuring Speech Representations Using a Pitch Adaptive Time-frequency Smoothing and an Instantaneous-Frequency-Based on FO Extraction: Possible Role of a Repetitive Structure in Sounds, Speech Communication, vol. 27, pp. 187-207 (1990). |
Kameoka, H. Ono, N. and Sagayama, S.: “Auxiliary Function Approach to Parameter Estimation of Constrained Sinusoidal Model for Monaural SpeechSeparation”, Proc. ICASSP 2008, pp. 29-32 (2008). |
Pavlovets, A. and Petrovsky, A.: “Robust HNR-based Closed-loop Pitch and Harmonic Parameters Estimation”, Proc. INTERSPEECH2O11, pp. 1981-1984 (2011). |
Ito, M. and Yano, M.: “Sinusoidal Modeling for Nonstationary Voiced Speech based on a Local Vector Transform”, J. Acoust. Soc. Am., vol. 121, No. 3, pp. 1717-1727 (2007). |
Bonada, J.: “Wide-Band Harmonic Sinusoidal Modeling”, Proc. DAFx-08, pp. 265-272 (2008). |
Abe, M. and Smith III, J.O.: “Design Criteria for Simple Sinusoidal Parameter Estimation based on Quadratic Interpolation of FFT Magnitude Peaks”, Proc. AES 117th Convention (2004). |
Pantazis, Y., Rosec, O. and Stylianou, Y.: “Iterative Estimation of Sinusoidal Signal Parameters”, IEEE Signal Processing Letters, vol. 17, No. 5, pp. 461-464 (2010). |
George, E. and Smith, M.: “Analysis-by-Synthesis/Overlap—Add Sinusoidal Modeling Applied to the Analysis and Synthesis of Musical Tones”, Journal of the Audio Engineering Society, vol. 40, No. 6, pp. 497-515 (1992). |
Depalle, P. and H'Elie, T.: “Extraction of Spectral Peak Parameters Using a Short-time Fourier Transform Modeling and No Sidelobe Windows”, Proc. WASPAA1997 (1997). |
Stylianou, Y.: “Harmonic plus Noise Models for Speech, combined with Statistical Methods, for Speech and Speaker Modification”, Phd Thesis. |
Serra, X. and Smith, J.: “Spectral Modeling Synthesis: a Sound Analysis/Synthesis Based on a Deterministic Plus Stochastic Decomposition”, Computer Music Journal, vol. 14, No. 4, pp. 12-24 (1990). |
Smith, J. and Serra, X.: “PARSHL: an Analysis/Synthesis Program for Non-harmonic Sounds Based on a Sinusoidal Representation”, Proc. ICMC 1987, pp. 290-297 (1987). |
McAulay, R. and Quatieri, T.: “Speech Analysis/Synthesis Based on a Sinusoidal Representation”, IEEE Trans. ASSP, vol. 34, No. 4, pp. 744-755 (1986). |
Moulines, E. and Charpentier, F.: “Pitch-synchronous Waveform Processing Techniques for Text-to-speech Synthesis Using Diphones”, Speech Sommunication, vol. 9, No. 5-6, pp. 453-467 (1990). |
Villavicencio, F., Robel, A. and Rodet, X.: “Improving LPC Spectral Envelope Extraction of Voiced Speech by True-Envelope Estimation”, Proc. ICASSP2006, pp. 869-872 (2006). |
Villavicencio, F., Robel, A. and Roidet, X.: “Extending Efficient Spectral Envelope Modeling to Mel-frequency Based Representation”, Proc. ICASSP2008, pp. 1625-1628 (2008). |
Robel, A. and Rodet, X.: “Efficient Spectral Envelope Estimation and Its Application to Pitch Shifting and Envelope Preservation”, Proc. DAFx2005, pp. 30-35 (2005). |
Imai, S. and Abe, Y.: “Spectral Envelope Extraction by Improved Cepstral Methods”, IEICE, Journal, vol. J62-A, No. 4, pp. 217-223 (1979). |
Tokuda, K., Kobayashi, T., Masuko, T. and Imai, S.: “Melgeneralized Cepstral Analysis—A Unified Approach to Speech Spectral Estimation”, Proc. ICSLP1994, pp. 1043-1045 (1994). |
Atal, B.S. and Hanauer, S.: “Speech Analysis and Synthesis by Linear Prediction of the Speech Wave”, J. Acoust. Soc. Am., vol. 50, No. 4, pp. 637-655 (1971). |
Itakura, F. and Saito, S.: “Analysis Synthesis Telephony based on the Maximum Likelihood Method”, Reports of the 6th Int. Cong. on Acoustics., vol. 2, No. C-5-5, pp. C17-C20 (1968). |
Griffin, D. W.: “Multi-Band Excitation Vocoder”, RLE Technical Report 524, Massachusetts Institute of Technology, Research Laboratory of Electronics (1987). |
Flanagan, J. and Golden, R.M., “Phase Vocoder”, Bell System Technical Journal, vol. 45, pp. 1493-1509 (1966). |
Hamagami, T.: “Speech Synthesis Using Source Wave Shape Modification Technique by Harmonic Phase Control”, Acoustical Society of Japan, Journal, vol. 54, No. 9, pp. 623-631 (1998). |
Matsubara, T., Morise, M. and Nishiura, T.: “Perceptual Effect of Phase Characteristics of the Voiced Sound in High-Quality Speech Synthesis”, Acoustical Society of Japan, Technical Committee of Psychological and Physiological Acoustics Papers, vol. 40, No. 8, pp. 653-658 (2010). |
Ito, M. and Yano, M.: “Perceptual Naturalness of Time-scale Modified Speech”, IEICE Technical Report EA2007-114, pp. 13-18 (2008). |
Zolzer, U. and Amatriain, X.: “DAFX—Digital Audio Effects”, Wiley (2002). |
Lin K-S, et al: “Speech Applications with a General Purpose Digital Signal Processor”, Proceedings of the Region 5 Conference (Mar. 1985). |
Banno, Hideki, et al: “Efficient Representation of Short-Time Phase Based on Group Delay”, The Transactions of the Institute of Electronics, Information and Communication Engineers, vol. J84-D-II(Apr. 2001). |
Kawahara, H, et al: “Restructuring Speech Representations Using a Pitch-Adaptive Time-Frequency Smoothing and an Instantaneous-Based F0 Extraction: Possible Role of a Repetitive Structure in Sounds”, Speech Communication Elsevier Science Publishers, Amsterdam NL, vol. 27, No. 3-4 (Apr. 1999). |
European Search Report dated Feb. 12, 2016. |
Number | Date | Country | |
---|---|---|---|
20150302845 A1 | Oct 2015 | US |