The present application claims priority to Singapore Patent Application No. 201201581-4, filed 6 Mar. 2012.
The present invention generally relates to voice synthesis, and more particularly relates to a system and method for template-based personalized singing synthesis.
There has been a constant increase in the direct impact of computer-based music technology on the entertainment industry from the use of Linear Predictive Coding (LPC) to synthesize singing voices in a computer in the 1960's to present day synthesis technology. For example, singing voice synthesis technology, such as synthesis of singing voices from a spoken voice of the lyrics, has many applications in the entertainment industry. The advantage of singing synthesis by speech to singing conversion is that the timbre of the voice is easy to keep. Thus higher singing voice quality is easy to achieve and personalized singing voice can be generated. However, one of the biggest difficulties is that it is not easy to generate natural melody from musical score when synthesizing singing voice.
Based on the source used in the generation of singing, singing voice synthesis can be classified into two categories. In the first category, singing voices are synthesized from the lyrics of a song, which is called lyrics-to-singing synthesis (LTS). Singing voices in the second category are generated from spoken utterances of the lyrics of the song. This is called speech-to-singing (STS) synthesis.
In LTS synthesis, corpus-based methods, such as wave concatenation synthesis and Hidden Markov Model (HMM) synthesis, are mostly used. This is more practical than traditional systems using methods such as vocal tract physical modeling and formant-based synthesis.
Compared to LTS synthesis, STS synthesis has received far less attention. However, STS synthesis can enable a user to produce and listen to his/her singing voice merely by reading the lyrics of songs. For example, STS synthesis can modify the singing of an unprofessional singer by correcting the imperfect parts to improve the quality of his/her voice. As the synthesized singing preserves the timbre of the speaker, the synthesized singing will sound like it is being sung by the speaker, making it possible to create a professional-quality singing voice for poor singers.
However, present STS systems are complex and/or difficult to implement by the end-user. In one conventional method, the singing voice is generated by manually modifying the F0 contour, phoneme duration, and spectrum of a speaking voice. Another STS system has been proposed where the F0, phoneme duration, and spectrum are automatically controlled and modified based on not only the information from the music score of the song, but, also its tempo. A system for synthesizing singing voices in Chinese has also been proposed, yet it requires inputting not only the Chinese speech and the lyrics of the song, but also inputting the music score. The fundamental frequency contour of a synthesized singing voice is generated from the pitch of the score and the duration is controlled using a piecewise-linear function, in order to generate the singing voices.
Thus, what is needed is a system and method for speech-to-singing synthesis which reduces complexity of the synthesis as well as simplifying operations by the end user. Furthermore, other desirable features and characteristics will become apparent from the subsequent detailed description and the appended claims, taken in conjunction with the accompanying drawings and this background of the disclosure.
According to the Detailed Description, a method for speech-to-singing synthesis is provided. The method includes deriving characteristics of a singing voice for a first individual and modifying vocal characteristics of a voice for a second individual in response to the characteristics of the singing voice of the first individual to generate a synthesized singing voice for the second individual.
In accordance with another aspect, a method for speech-to-singing synthesis is provided. The method includes deriving a template of first speech characteristics and first singing characteristics in response to a first individual's speaking voice and singing voice and extracting second speech characteristics from a second individual's speaking voice. The method also includes modifying the second speech characteristics in accordance with the template to generate the second individual's approximated singing voice and aligning acoustic features of the second individual's approximated singing voice in response to the first speech characteristics, the first singing characteristics and the second speech characteristics to generate the second individual's synthesized singing voice.
In accordance with yet another aspect, a method for speech-to-singing synthesis is provided. The method includes extracting pitch contour information and alignment information from a singing voice of a first individual and extracting alignment information and a spectral parameter sequence from a spoken voice of a second individual. The method further includes generating alignment information from the alignment signals of the singing voice of the first individual and the alignment signals of the spoken voice of the second individual and converting the spectral parameter sequence from the spoken voice of the second individual in response to the alignment information to generate a converted spectral parameter sequence. Finally, the method includes synthesizing a singing voice for the second individual in response to the converted spectral parameter sequence and the pitch contour information of the singing voice of the first individual.
The accompanying figures, where like reference numerals refer to identical or functionally similar elements throughout the separate views and which together with the detailed description below are incorporated in and form part of the specification, serve to illustrate various embodiments and to explain various principles and advantages in accordance with a present embodiment.
And
Skilled artisans will appreciate that elements in the figures are illustrated for simplicity and clarity and have not necessarily been depicted to scale. For example, the dimensions of some of the elements in the block diagrams or flowcharts may be exaggerated in respect to other elements to help to improve understanding of the present embodiments.
The following detailed description is merely exemplary in nature and is not intended to limit the invention or the application and uses of the invention. Furthermore, there is no intention to be bound by any theory presented in the preceding background of the invention or the following detailed description. It is the intent of this invention to present a template based speech-to-singing (STS) conversion system, in which a template singing voice from an individual, such as a professional singer, is used in the system to synthesize a singing voice from another individual's speaking voice.
Unlike previous techniques that estimate the acoustic features for synthesized singing voices based on the music score of a song, operation in accordance with the present embodiment generates a singing voice merely from a speaking voice reading the lyrics. A user's spoken voice is converted into singing using the timbre of the speaker's voice while applying the melody of a professional voice. In this manner, a singing voice is generated from speech reading the lyrics of a song. The acoustic features are modified based on the differences between singing and speaking voices determined by analyzing and templating singing and speaking voices from the same person. Thus, a music score of a song is advantageously not required as an input reducing the complexity of the operation of the system and, consequently, making it easier for end-users. In addition, a natural pitch contour is acquired from an actual singing voice without needing to modify a step contour to account for F0 fluctuations such as overshoot and vibrato. This can potentially improve the naturalness and quality of the synthesized singing. Also, by aligning singing and speaking voices automatically, there is no need to perform manual segmentation for the speech, thereby enabling a truly automatic STS system.
Thus, in accordance with the present embodiment, a template-based STS system converts speaking voices into singing voices by automatically modifying the acoustic features of the speech with the help of pre-recorded template voices. Referring to
In the learning stage 102, the template singing voice 110 and the template speaking voice 112 are analyzed to extract the Mel-Frequency Cepstral Coefficients (MFCC) 114, short-time energy (not shown), voice and unvoiced (VUV) information 116, fundamental frequency (F0) contour 118, and spectrum (not shown). The MFCC 114, energy and VUV 116 are used as acoustic features in the alignment 120 of the singing and speech in order to accommodate their differences in timing and achieve optimal mapping between them. In accordance with the present embodiment, dynamic time warping (DTW) is used for the alignment 120. The transformation models for the F0 contour 118 (i.e., the F0 modeling 122) and phoneme duration (including the duration modeling 124 and the spectrum modeling 126) are then derived based on the synchronization information (i.e., the synchronization index 128) obtained.
In the transformation stage 104, features are extracted for the new speaking voice 130 which is usually uttered by a different person from the template speaker. These features are the MFCC, the short-time energy, the VUV information, the F0 contour, and the spectrum. These are modified (i.e., F0 modification 132, a phoneme duration modification 134 and a spectrum modification 136) to approximate those of the singing voice based on the transformation models, generating a F0 contour 140, VUV information 142, an aperiodicity (AP) index 144, and spectrum information 146.
After these features have been modified, the singing voice is synthesized 150 in the last stage 106. For enhancing the musical effect, the backing track and reverberation effect may be added 152 to the synthesized singing. In our implementation, the analysis of speech and singing voices as well as singing voice synthesis are carried out using STRAIGHT, a high quality speech analysis, modification synthesis system that is an extension of the classical channel VOCODER.
It is certain for the point of entry and duration of each phoneme in a singing voice to be different from those in a speaking voice. The two voices 110, 112 are aligned 120 before deriving the transformation models 122, 124, 126 and carrying out acoustic feature conversion 104. The quality of the synthesized singing voice is largely dependent on the accuracy of these alignment results. In accordance with the present embodiment, a two-step DTW-based alignment method using multiple acoustic features is employed at the alignment 120.
Prior to alignment 120, the silence is removed from the signals to be aligned. This silence is detected based on energy and spectral centroid and removal of the silence in accordance with the present embodiment improves the accuracy of alignment. MFCC 114, short-time energy (not shown) and voiced/unvoiced regions 116 are then extracted as acoustic features for deriving aligned data. MFCC 114 are popular features used in Automatic Speech Recognition (ASR) and MFCC 114 computes the cosine transform of the real logarithm of the short-time power spectrum on a Mel-warped frequency scale. Since the same lyrics having equal syllables are uttered in both singing 110 and speech 112, the voiced and unvoiced regions 116 can provide useful information for alignment 120 and, hence, is extracted as a feature prior to alignment 120.
Besides the raw features 114, 116, the Delta and Acceleration (Delta-Delta) of these features 114, 116 are also calculated. Frame- and parameter-level normalization are carried out on the features 114, 116 to reduce the acoustic variation across different frames and different parameters. The normalization is performed by subtracting the mean and dividing by the standard deviation of the features 114, 116.
During the alignment 120, the acoustic features of different signals are aligned with each other using the DTW. The DTW algorithm measures the similarity between two sequences which vary in time or speed, aiming to find an optimal match between them. The similarity between the acoustic features of two signals is measured using cosine distance as follows:
where s is the similarity matrix, and xi and yj are the feature vectors of i-th and j-th frames in two signals, respectively.
To improve the accuracy in aligning the new speaking utterance to be converted with the template singing voice that is sung by a different speaker, a two-step alignment is implemented. The alignment 120 is the first step and aligns the template singing voice 110 and the template speaking voice 112 from the same speaker. Alignment data from the alignment 120 is then used to derive the mapping models 124, 126 of the acoustic features between singing and speech
A second alignment step (not shown in
After the mapping between singing and speaking voices is achieved via the alignment 120, the transformation models 124, 126 are derived based on the template voices. The acoustic features of the new speech 130 are then modified 132, 134, 136 to obtain the features for the synthesized singing. Prior to transformation 104, interpolation and smoothing are carried out on the acoustic features to be converted if their lengths are different from those of the short-time features used in alignment. In view of accuracy and computational load, template voices are divided into several segments and the transformation models are trained separately for each segment. When a new instance of speech is converted into singing using the trained transformation models, it needs to be segmented similarly to the template speech. In the proposed system, the F0 contour of the speaking voice is modified 132 by acquiring a natural F0 contour 140 from template singing voice. In doing so, we do not need to modify a step contour to account for F0 fluctuations such as overshoot and vibrato. The synthesized singing voice could be more natural with the F0 contour of the actual singing. The phoneme durations of the speaking voice are different from that in the singing voice and should be lengthened or shortened during the transformation 104 in accordance with the singing voice at the phoneme duration modification 134.
Unlike conventional STS systems, the musical score is not required as an input to derive the duration of each phoneme in singing, and we also do not need to carry manual segmentation for each phoneme of the speaking voice before conversion. Instead, the synchronization information from aligning the template voices and the converted speech is used to determine the modification for phoneme duration 134. The duration of each phoneme in the speech is modified to be equal to that in the template singing. To implement this, the VUV, spectral envelope and aperiodicity (AP) index estimated using a vocoder such as STRAIGHT are compressed or elongated in accordance with the transformation model of phoneme duration.
Referring to
The major aim of the analysis 209 of the singing voice 202 is to extract the pitch contour 206 of the singing voice 202 so as to extract the melody of the song from the professional voice. The aim of the analysis 213 of the spoken voice 204 is to extract the spectral parameter sequence 212 from the spoken voice 204 to capture the timbre of the spoken voice 204 for synthesis 220.
In accordance with the present invention, the timing processing 216 obtains the alignment information 215 from the alignment 214 and uses the alignment information 215 to convert the spectral sequence 212 to regenerate the converted spectral parameter sequence 218 of the target singing voice. Compared to the spoken voice 204, some voice segments are stretched to be longer, and some segments are compressed to be shorter. Each piece of the voice segments in the converted spectral parameter sequence 218 will match its corresponding part in the template singing voice 202. The synthesizer 220 then uses the converted spectral parameter sequence 218 and the pitch contour 206 from the template singing voice 202 to synthesize a personalized singing voice 222.
The alignment process 214 can be implemented in accordance with the present embodiment in one of three variants depicted in
Referring to
Feature extraction 304 extracts the alignment information 208 from the template singing voice 202. Similar to feature extraction 304 and feature extraction 306, feature extraction 404 extracts alignment information 406 from the template spoken voice 402. Then a two-step DTW is performed. First, the template singing voice 202 is matched with the template spoken voice 402 by DTW 408 of the alignment information 208 and the alignment information 406. Because the two voices 202, 402 are from the same speaker, the spectra of the two signals are similar with the major differences being in the timing and pitch. Thus, it is easier to align the two signals 208, 406 than to align the two signals 208, 210 (
In accordance with the present embodiment and this second variant of the alignment 214, the template singing and speaking voices 202, 402 are analyzed to extract the Mel-Frequency Cepstral Coefficients (MFCC), short-time energy, voice and unvoiced (VUV) information, F0 contour and spectrum, which in layman terms are the pitch, timing and spectrum. The transformation model for F0 122 (
The dynamic time warping (DTW) algorithm is used to align the acoustic features extracted for the template singing and speaking voices 202, 402 and for the individual's speaking voice 204. A two-step alignment is done to align the speaking and singing voices. First, the template singing and speaking voices 202, 402 from the same person are aligned 408 and the alignment data is used to derive mapping models 124, 126 (
Referring to
Text-to-phone conversion 504 extracts alignment information 506 from the lyrics text 502. Then a two-step HMM is performed (similar to the two-step DTW 408, 410 of
A more complete depiction 600 of the template-based personalized singing synthesis method is shown in
Alignment 214 of the input spoken voice 204 (user's voice) and the template singing voice 202 set up time mapping between segments of same sound in the two different sequences. Analysis 606 of the input spoken voice 204, the analysis 209 of the template singing voice 202, and analysis 608 of the template spoken voice 402 are extract spectrum information 212, 610, 612 and pitch contour 614, 206, 616 from each signal 204, 202, 402.
The template spoken voice 402 and the template singing voice 202 are from the same person. By comparing the analysis 612, 610 of the two voices 402, 202, we are able to find the spectral difference of the two voices to train a spectral conversion rule 618, thereby forming the rule 620 for spectral transformation.
At the timing processing 216, the alignment information 215 is used to regenerate the spectral sequence 218 so that the voice segments match those of the singing voice. The rule for spectral transformation 620 is used for the spectral conversion 602 which transforms the regenerated the spectral sequence 218 to obtain a spectrally converted sequence 622 of the user's spoken voice. The pitch transposition 604 transposes the pitch contour 616 according to the relationship between the pitch contours 206614 to generate a transposed pitch contour 624 thereby bringing the melody of the template singing to a level that is more suitable for the user's voice. Finally, a synthesis component 626 uses the transformed spectral parameter sequence 622 and the transposed pitch contour 624 from the template singing voice to generate the personalized singing voice 222.
While implementations of the system and method for personalized speech-to-singing synthesis has been shown in
Referring to
In accordance with the present embodiment, the point of entry and duration of each phoneme in a singing voice must be different from those in a speaking voice. Thus, the two voices should be aligned before deriving the transformation models. The quality of the synthesized singing voices is largely dependent on the accuracy of the alignment results.
As set out previously, the short-time cepstral features, MFCC 114 (
For example, a full feature set used in alignment may have a dimension of M, where M=40 is the total number of features calculated for each frame. The number of features includes one VUV feature and thirty-nine MFCC features (twelve MFCC features, twelve delta MFCC features, twelve Delta-Delta MFCC features, one (log) frame energy, one Delta (log) frame energy, and one Delta-Delta (log) frame energy). In order to reduce the acoustic variation across different frames and different parameters, frame- and parameter-level normalizations are carried on the MFCC related features. Normalization is performed by subtracting the mean and dividing by the standard deviation of the features given by
where xij is the ith (i≦39) MFCC coefficient of the jth frame, μpi and σpi are the mean and standard deviation of the ith MFCC coefficient, and μfj and σfj are the mean and standard deviation of the jth frame.
This feature set is used during the alignment 120, 214 that uses the DTW method. The DTW measures the similarity between two sequences which vary in time or speed, aiming to find an optimal match between two sequences. This method has been largely used in ASR to deal with different speaking speeds. Referring to
Referring to
Thus, in accordance with the present embodiment a personalized template based singing voice synthesis system is provided that is able to generate a singing voice from the uttered lyrics of a song. The template singing voice is used to provide a very natural melody of the song, while the user's spoken voice is used to keep the user's natural voice timbre. In doing so, the singing voice is generated with a general user's voice and a professional melody.
The proposed singing synthesis has many potential applications in entertainment, education, and other areas. The method of the present embodiment enables the user to produce and listen to his/her singing voice merely by reading the lyrics of songs. As the template singing voices are used in the system, we are able to acquire a natural pitch contour from that of the actual singing voice without a need to purposely generate natural fluctuations (such as overshoot and vibrato) from a step contour directly from the musical score. This substantially improves the naturalness and quality of the synthesized singing and makes it possible to create professional-quality singing voice for poor singers. As the synthesized singing preserves the timbre of the speaker, it can sound like it is being sung by the speaker.
The technology of the present embodiment and its various alternates and variants can also be used for other scenarios. For example, in accordance with the present embodiment, the singing from an unprofessional singer can be modified by correcting the imperfect parts to improve the quality of his/her voice. Alternatively, a student can be taught how to improve his singing by detecting the errors in his/her singing melody.
Thus, it can be seen that a system and method for speech-to-singing synthesis which reduces complexity of the synthesis as well as simplifying operations by the end user has been provided. While exemplary embodiments have been presented in the foregoing detailed description of the invention, it should be appreciated that a vast number of variations exist.
It should further be appreciated that the exemplary embodiments are only examples, and are not intended to limit the scope, applicability, operation, or configuration of the invention in any way. Rather, the foregoing detailed description will provide those skilled in the art with a convenient road map for implementing an exemplary embodiment of the invention, it being understood that various changes may be made in the function and arrangement of elements and method of operation described in an exemplary embodiment without departing from the scope of the invention as set forth in the appended claims.
Number | Date | Country | Kind |
---|---|---|---|
2012101581-4 | Mar 2012 | SG | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/SG2013/000094 | 3/6/2013 | WO | 00 |