The invention generally relates to a technique for generating an audio file of a person singing without the person actually singing. In particular, the invention relates to a system and method for generating an audio file of a person singing from the text of a song.
Current commercial TTS systems are able to generate high-quality speech. These systems are generally limited to the generation of spoken content of a single voice. However, an interest is emerging for techniques for performing identity transformation such as Voice Conversion and Speaker Adaptation. Although there have not been many attempts to extend TTS capacity to singing voice generation, there has been some work done in what has been referred to as a Speech-to-Singing transformation (STS). In a pioneering work, psycho-acoustical aspects referred to as vibration and ringing-ness were found to significantly affect the “singing-ness” of the voice. Although a STS schema was proposed using a music score and FD, spectral, and duration, there is still a need for a technique that provides a realistic singing voice from text.
The preferred embodiment of the present invention is a technique to enhance the quality of Text-to-Speech (TTS) based Singing Voice generation. Speech-to-singing refers to techniques transforming a spoken voice into singing, mainly by manipulating the duration and pitch of a spoken version of a song's lyrics. The present invention efficiently preserves the speaker identity and improves sound quality (e.g. reducing hoarseness) by incorporating speaker-independent natural singing information into TTS-based Speech-to-Singing (STS). We use TTS as the input speech on a TSTS-like schema to build what we denote for simplicity as Template-based Text-to-Singing (TTTS) system. Moreover, we propose: 1) enhanced singing generation by integrating singer-independent features from natural singing to a baseline TTSing engine, and 2) to use a personalized TTS system (i.e. a target speaker identity is applied) as input speech so that new “virtual singers” can be easily generated from a small adaptation data.
Some embodiments of the invention also include a technique to stretch a vowel segment in such a way that is suitable for singing. Recordings of vowels enunciated at several pitch levels are acquired and their acoustic information used to enhance the timbre of the singing voice. In addition, acoustic information from a singing template is used to further balance the voicing features and energy contours to reduce hoarseness and energy fading.
The present invention is illustrated by way of example and not limitation in the figures of the accompanying drawings, and in which:
Illustrated in
To generate a singing voice for a particular song, the TTTS system requires an acappella version of the song including singer without instrumental portion, the corresponding instrumental content, and the song lyrics. The acappella version is phonetically labeled to produce template timing including the time position of each phonetic unit. The template pitch contour is also extracted from the acappella version using SAC, which is method to estimate the pitch information from an audio file, which is a robust estimator for singing voice. SAC is taught by Emilia Gomez and Jordi Bonada in “Towards computer-assisted flamenco transcription: An experimental comparison of automatic transcription algorithms as applied to a cappella singing,” published in Computer Music Journal at vol. 37, no. 2, pp. 73-90, 2013, which is hereby incorporated by reference herein. The lyrics of the song, phonetic labels, and pitch contours represent one of a plurality of singing templates collectively represented as template asset in database 110.
The lyrics 112 are transmitted to the TTS system 120 which generates data 122 including (a) phonetic timing information from the lyrics and (b) acoustic features, including Mel-generalized cepstrum (MGC), band aperiodicity (BAP), and fundamental frequency (F0) which are used in a vocoder as WORLD for waveform generation. The TTS system 120 includes a voice model that is pre-trained based on a large speaker corpus and then adapted to the voice of a particular individual. The voice model is adapted to the individual speaker voice by parameter adaptation using one hour of speech acquired with one or more microphones 100, for example. In our work, we employ the WORLD vocoder for the TTS system 120 and waveform generator. The acoustic features in WORLD include Mel-generalized cepstrum (MGC), band aperiodicity (BAP), and fundamental frequency (F0). The TTS-based features 122 generated by the TTS system 120 are then transmitted to the phonetic alignment module 130.
The phonetic alignment module 130, also referred to herein as block A, aligns the TTS-based features in duration to match the timing 132 for each phoneme in the singing template. That is, the duration of the phonemes derived from the text are aligned with the phonemes represented in the singing template. The acoustic features 132 of the phonemes after phonetic alignment are then transmitted to the ENTW module 140.
The ENTW module 140 receives the acoustic features 132 after the phonetic alignment and modifies the features so that they observe acoustic conditions more suitable for a singing voice. In particular, the ENTW module 140 applies a nonlinear time warping function d(n) to the MGC and BAP from the TTS 120 so as to elongation the vowels. In the following discussion, for any building block, say block H, notations XH and YH refer to MGC and BAP as outputs from Block H, respectively. The first index refers to frame number and the second index refers to the MGC order for XH and BAP order for YH. Subscript T denotes the features of the template song.
In the preferred embodiment, the ENTW module 140 uniformly stretches each vowel segment to match the vowel duration of the singing template. As a result, low-energy frames found at the beginning and at the end of a segment may be elongated. In the preferred embodiment, the ENTW module 140 applies a nonlinear time warping function d(n) to MGC and BAP in such a way that vowel elongation is concentrated near the middle of a spoken vowel. It is determined to be acoustically consistent and avoids over-lengthening the border, i.e., the last section of a frame generated by the TTS in a vowel ending a word which generally exhibits lower energy and/or weaker spectral features. Utilizing the relationship of the first coefficient and the summation of the logarithmic of the filter bank energy, we approximate the relative energy contour using C0.
For a given vowel segment, let N1 be the first frame and N2 be the last frame of the segment, our warping function is defined as:
If d(n) is not an integer, the value of XB (d(n), k) is approximated using linear interpolation. BAP and F0 are warped in the same fashion as MGC using the same function d(n). Intuitively, the high energy frames are stretched while the lower energy frames are compressed in such a way that the segment length remains the same, as d(N1)=N1 and d(N2)=N2. The warping affects, i.e., is applied to, only vowel segments.
The effect of ENTW module 140 is illustrated in waveform outputs in
The timing information at the top of
After the ENTW module 140, the high-energy interval frames are elongated, which is illustrated by the waveform in
In the preferred embodiment, the TTTS system also performs interpolation of the vocalic timbre based on F0. Our “vocalic library” 150 refers to a collection of recordings of vowel exemplars where a skilled singer sings each vowel at different pitch levels (e.g, low, mid, and high). We found that recordings at different pitch levels have different spectral envelopes, so exemplars at several pitches are needed for accuracy. The recording process is done offline once and the vocalic library 150 can be used with any singing voice.
For each vowel segment provided as input to the ENTW module 140, the phonetic label (extracted from the template timing) is used to query which vowel exemplars to use from the vocalic library 150. The vocalic timbre interpolator 160 then determines the best pitch level(s) with which to construct the exemplar features based upon the pitch in the song template (F0T). Since the limited number of pitch levels cannot cover all pitch values, we estimate the MGC features XC (n, k) at a certain pitch by linear interpolation from the exemplars whose FU averages closest to the F) of that particular frame. It is possible that the vocoder may detect a voiced frame as unvoiced, so we select the minimum BAP value of the exemplars (higher voicing degree), i.e., YC (n, k)=min{Y1(n, k), Y2(n, k)} for each frequency bin k and frame n.
The acoustic feature fusion module 170, also known as block D, generates the resulting acoustic features (MGC, BAP, F0) 172 after processing the ones given as input by ENTW module 140 and vocalic timbre interpolator 160. These acoustic features 172 are then used in a vocoder, i.e., waveform reconstruction module 180, to generate the sound waveform from three sources of information: TTS-based features 142, vowel exemplars 162, and the singing template.
The acoustic feature fusion module 170 generates a hybrid MGC by merging the MGC derived from the lyrics with the MGC derived from the singing voice. In particular, the acoustic feature fusion module 170 keeps the first K coefficients of XB (n, k) but replaces the remaining coefficients starting with K+1 with the MGC from the singing voice, namely XC (n, k). Thus, the low-order coefficients are derived from the phonemes derived from the lyrics after dynamic time warping while the high-order coefficients are derived from the singing voice.
From our inspection, we found that K=30 is an appropriate order that adds some spectral content (from the exemplar voice) to high frequencies while still maintaining the identity of the virtual singer. Note that this procedure is only executed in vowel frames.
To reduce an abrupt change of the MGC values at the vowel segment boundaries, we gradually increase the effect of the exemplar coefficient values when transitioning from non-vowel frames to vowel frames. We achieve this by using a ramp function with 4 defining points as (M1, M2, M3, M4) in order (shown in
XV(n,k)=rV(n)XC(n,k)+(1−rV(n))XB(n,k)
for k>=K and XV (n, k)=XB (n, k) for k<K.
In addition, we utilize the energy contour and spectral tilt of the template to further enhance the features. To do so, we take the average of XV and XT instead of only XT in order to avoid amplitude instability in the reconstructed waveform, which can occur when the modified C0 contour significantly differs from the original values. We found that applying the same process for the second coefficient C1 also makes the output have more singing characteristics.
We found that the above process works well for sonorant phonemes (e.g, vowels, semivowels, or nasals). However, obstruent phonemes (plosives, fricatives, and affricates) are short and turbulent, making the process unreliable. For this reason, we keep the intervals near the boundaries close to the output of the baseline with a margin ramp of length M as a leeway when applying a ramp function to transitions between obstruents and sonorants. The ramp function rD (n) is defined by (N1−L−M, N1−M, N2+M, N2+M+L) where L, N1 and N2 are the ramp length and the first and last sample of the obstruent segments, respectively. Or mathematically given by:
XD(n,k)=rD(n)XB(n,k)+(1−rD(n))((XV(n,k)+XT(n,k))/2)
for k=0 and 1.
In 5 to 8 seconds, the harmonic structure in
The TTTS singing voice is generated using the WORLD vocoder with time-aligned features (MGC and BAP) and the template pitch contour 172 derived from the waveform reconstruction module 180, also referred to herein as block E. The short-term energy contour of the synthesized singing is scaled by the amplitude scaling module 190, also referred to herein as block F, to match that of the template. Finally, the resulting singing voice is mixed with the corresponding instrumental content and the complete waveform transmitted to an audio speaker 199, for example, for the benefit of the user.
Illustrated in
We have presented a TTS-based singing framework as well as techniques to enhance the singing voice output. The energy-based nonlinear time warping (ENTW) algorithm appropriately stretches and compresses different portions in each vowel to reduce low-energy intervals. The timbre of the signals are enhanced by supplementary vowel recordings from our vocalic library. The feature fusion algorithm combines the information from the enhanced timbre, the ENTW output, and the reference template to improve the contours of energy and aperiodicity of the singing voice. The listening test validates that the enhanced singing was perceived with higher quality than the baseline framework without the enhancement techniques. Additionally, the enhancement techniques are flexible to use with different voices. Future work will include validating the system with more languages. In addition, we plan to further investigate the different characteristics between speech and singing such as the dynamics of formant frequencies, aperiodicity, and consonants. We also plan to develop other enhancement techniques and utilize other useful information from the template reference to further improve the quality of the singing voices.
One or more embodiments of the present invention may be implemented with one or more computer readable media, wherein each medium may be configured to include thereon data or computer executable instructions for manipulating data. The computer executable instructions include data structures, objects, programs, routines, or other program modules that may be accessed by a processing system, such as one associated with a general-purpose computer, processor, electronic circuit, or module capable of performing various different functions or one associated with a special-purpose computer capable of performing a limited number of functions. Computer executable instructions cause the processing system to perform a particular function or group of functions and are examples of program code means for implementing steps for methods disclosed herein. Furthermore, a particular sequence of the executable instructions provides an example of corresponding acts that may be used to implement such steps. Examples of computer readable media include random-access memory (“RAM”), read-only memory (“ROM”), programmable read-only memory (“PROM”), erasable programmable read-only memory (“EPROM”), electrically erasable programmable read-only memory (“EEPROM”), compact disk read-only memory (“CD-ROM”), or any other device or component that is capable of providing data or executable instructions that may be accessed by a processing system. Examples of mass storage devices incorporating computer readable media include hard disk drives, magnetic disk drives, tape drives, optical disk drives, and solid state memory chips, for example. The term processor as used herein refers to a number of processing devices including electronic circuits such as personal computing devices, servers, general purpose computers, special purpose computers, application-specific integrated circuit (ASIC), and digital/analog circuits with discrete components, for example.
Although the description above contains many specifications, these should not be construed as limiting the scope of the invention but as merely providing illustrations of some of the presently preferred embodiments of this invention.
Therefore, the invention has been disclosed by way of example and not limitation, and reference should be made to the following claims to determine the scope of the present invention.
This application claims the benefit of U.S. Provisional Patent Application Ser. No. 62/757,594 filed Nov. 8, 2018, titled “Enhanced virtual singers generation by incorporating singing dynamics to personalized text-to-speech-to-singing,” which is hereby incorporated by reference herein for all purposes.
Number | Name | Date | Kind |
---|---|---|---|
4527274 | Gaynor | Jul 1985 | A |
10008193 | Harvilla | Jun 2018 | B1 |
20030009336 | Kenmochi | Jan 2003 | A1 |
20080097754 | Goto | Apr 2008 | A1 |
20110000360 | Saino | Jan 2011 | A1 |
20110054902 | Li | Mar 2011 | A1 |
20130019738 | Haupt | Jan 2013 | A1 |
20190103084 | Ogasawara | Apr 2019 | A1 |
Entry |
---|
Zemedu et al., “Concatenative Hymn Synthesis from Yared Notations.” International Conference on Natural Language Processing. Springer, Cham, (Year: 2014). |
Freixes et al., “Adding singing capabilities to unit selection TTS through HNM-based conversion.” International Conference on Advances in Speech and Language Technologies for Iberian Languages. Springer, Cham, (Year: 2016). |
Number | Date | Country | |
---|---|---|---|
62757594 | Nov 2018 | US |