1. Field of the Invention
The invention generally relates to the synthesis of singing voices, and more particularly, to singing voice synthesis system, method, and apparatus capable of generating a synthesized singing voice with personal tones.
2. Description of the Related Art
In recent years, the processing capability of electronic computing devices has improved substantially. Accordingly, applications thereof have also increased. One such example may be seen in speech/singing voice synthesis systems. In general, speech/singing voice synthesis refers to artificially generating pseudo human voices. There are already many related products commercially available, including the virtual singer software, electronic pets, the singing tutor software/systems, and software for virtually combining melodies as a composer and singer.
For the conventional singing voice synthesis system, as shown in
Accordingly, embodiments of the invention provide a singing voice synthesis system, method, and apparatus for a user to generate a synthesized singing voice with personal tones. The user does not have to be skilled with music theory, and is just required to intuitively input the voice signals by reading or singing the lyrics according to the tempo cues.
In one aspect of the invention, a singing voice synthesis system is provided. The singing voice synthesis system comprises a storage unit, a tempo unit, an input unit, and a processing unit. The storage unit stores at least one tune. The tempo unit provides a set of tempo cues in accordance with a selected tune from the at least one tune. The input unit receives a plurality of original voice signals corresponding to the selected tune. The processing unit processes the original voice signals and generates a synthesized singing voice signal according to the selected tune.
In another aspect of the invention, a singing voice synthesis method for an electronic computing device with an audio receiver and an audio speaker is provided. The method comprises providing a set of tempo cues in accordance with a selected tune from the at least one tune, receiving, via the audio receiver, a plurality of original voice signals corresponding to the selected tune, processing the original voice signals according to the selected tune, and outputting, via the audio speaker, a synthesized singing voice signal.
In another aspect of the invention, a singing voice synthesis apparatus is provided. The singing voice synthesis apparatus comprises an exterior case, a storage device, a tempo means, an audio receiver, and a processor. The storage device, installed inside of the exterior case and connected to the processor, stores at least one tune. The tempo means, installed outside of the exterior case and connected to the processor, provides a set of tempo cues in accordance with a selected tune from the at least one tune. The audio receiver, installed outside of the exterior case and connected to the processor, receives a plurality of original voice signals corresponding to the selected tune. The processor, installed inside of the exterior case, processes the original voice signals and generates a synthesized singing voice signal according to the selected tune.
Other aspects and features of the present invention will become apparent to those ordinarily skilled in the art upon review of the following descriptions of specific embodiments of the singing voice synthesis systems, methods, and apparatuses.
The invention can be more fully understood by reading the subsequent detailed description and examples with references made to the accompanying drawings, wherein:
The following description is made for the purpose of illustrating the general principles and features of the invention, and should not be taken in a limiting sense. The scope of the invention is best determined by reference to the appended claims. In order to give better examples, the preferred embodiments are given below accompanied with the drawings.
In some embodiments, the selected tune may be a WAV (Waveform Audio) file for the tempo unit 202 to mark out the beats of the selected song by the beat tracking technique. Also, in other embodiments, the selected tune may be a MIDI file for the tempo unit 202 to retrieve the beats of the selected song by acquiring the tempo events in the MIDI file. The provision of the set of tempo cues from the tempo unit 202 may be implemented in a variety of ways, such as: visual sign (for example, moving symbol, flashing symbol, leaping dot, or color-changing pattern, etc.) generated by a display, audio signals (for example, the ticking sound of a metronome) generated by an audio speaker, actions (for example, swinging, rotating, leaping, or the waving axis of a metronome, etc.) performed by a movable machinery, or flashes and color changing lights generated by a light emitting unit.
In order to make sure the established rhythm pattern of the original voice signals is within an acceptable level, in some embodiments, a rhythm analysis unit (not shown) determines whether the established rhythm pattern exceeds a default error threshold value. The established rhythm pattern refers to accuracy (slow or fast) of each word of the lyrics being read or sung, when corresponding to the selected tune. If the established rhythm pattern exceeds the default error threshold value, the rhythm analysis unit (not shown) prompts the user to regenerate the original voice signals and the receiving procedure of the original voice signals is repeated. The determination of whether the established rhythm pattern exceeds the default error threshold value will be described in detail later with reference to
The processing of the original voice signals includes, in some embodiments, flatting all the pitches of the original voice signals to a specific pitch level, and adjusting each of the flatted pitches to its standard pitch indicated by the selected tune to obtain a plurality of adjusted voice signals. The processing of the original voice signals further includes smoothing the adjusted voice signals into a smoothed voice signal. The details are given in the embodiments as follows.
In some embodiments, the processing unit 204 may perform a pitch analysis procedure to flat the pitches of the original voice signals by the pitch tracking and pitch marking techniques, and obtain a plurality of same pitches as a result. Next, the processing unit 204 may perform a pitch adjustment procedure, for instance, the PSOLA (Pitch Synchronous OverLap-Add) method, the Cross-Fadding method, or the Resample method, on the same pitches, to adjust each of the same pitches to its standard pitch indicated by the tune of the selected song, and obtain a plurality of adjusted voice signals. The detailed operation of the PSOLA (Pitch Synchronous OverLap-Add) method, Cross-Fadding method, and Resample method will be described later with reference to
In other embodiments, the processing unit 204 further performs a sound effect procedure on the smoothed voice signal. The sound effect procedure may first determine the size of the sampling frame to the smoothed voice signal based on the loading of the singing voice synthesis system 200. Then, the sound effect procedure continues by adjusting the volume and adding vibrato and echo effects to the smoothed voice signal, one sampling frame at a time, and consequently, a sound-effected voice signal is obtained. The processing unit 204 may choose one of the adjusted voice signals, the smoothed voice signal, and the sound-effected voice signal, to be the input to an accompaniment procedure. The accompaniment procedure combines the chosen voice signal(s) with the accompaniment of the selected song and generates an accompanied voice signal. It is noted that each of the previously mentioned adjusted voice signals, smoothed voice signal, sound-effected voice signal, and accompanied voice signal may be the presentation of a synthesized singing voice signal of the present invention. The synthesized singing voice signal may be an electronic file having a plurality of voice signals, such as the adjusted voice signals, the smoothed voice signal, the sound-effected voice signal, or the accompanied voice signal. In some other embodiments, the singing voice synthesis system 200 further includes an output unit for outputting the synthesized singing voice signal. The output unit may be connected to the tempo unit 202 or any other display unit (not shown), so that when outputting the synthesized singing voice signal, the output unit can utilize the tempo unit 202 or the display unit to show the beats in the form of the previously mentioned actions, such as visual signals such as moving symbols, flashing symbols, leaping dots, or color-changing patterns or swinging, rotating, leaping, or the waving axis of a metronome or flashes or color changing lights or audio signals such as the ticking sound of a metronome.
wherein j represents the word number. If the result of function (1) exceeds the default error threshold value μ, then the step of receiving the original voice signals is repeated.
wherein N represents the time length of the sampling process, and in represents the time points within the sampling range. After obtaining the Hamming windows, the PSOLA method continues by overlapping the voice signals re-modeled by the Hamming windows to form new voice signals, which are the previously mentioned adjusted voice signals.
In regards to singing from a low pitch to a high pitch, unlike computer generated voices, where pitches jump from the low to high pitch, for the human voice, often a slightly higher pitch than the high pitch is reached before gliding to the high pitch; especially when the pitch difference between the two pitches is large. In order to simulate this feature of human voices, one embodiment of the present invention uses the Bézier curve to implement the smoothing procedure. Take the cubic Bézier curve for example, four control points are given as shown in
wherein δ represents a parameter, which increases in accordance with the variation of the pitches, and its value is between 0 and 1 and 2 is the ratio of the halftones of the scale of the twelve-tone equal temperament. The operator “±”, uses “+” to represent moving from a low pitch to a high pitch, and “−” to represent moving from a high pitch to a low pitch. In
B(t)=P0(1−t)3+3P1t(1−t)2+3P2t2(1−t)+P3t3, tε[0,1] (4)
In another embodiment, a quartic Bézier curve is used to implement the smoothing procedure. The relationship between the five control points, P0, P1, P2, P3, and P4, can be expressed with the following function:
wherein δ represents a parameter, which increases in accordance with the variation of the pitches, and its value is between 0 and 1 and 2 is the ratio of the halftones of the scale of the twelve-tone equal temperament. The operator “±”, uses “+” to represent moving from a low pitch to a high pitch, and “−” to represent moving from a high pitch to a low pitch. In
B(t)=P0(1−t)4+4P1(1−t)3t+6P2(1−t)2t2+4P3(1−t)t3+P4t4, tε[0,1] (6)
In another embodiment, a quintic Bézier curve is used to implement the smoothing procedure. The relationship between the six control points, P0, P1, P2, P3, P4, and P5, can be expressed with the following function:
wherein δ represents a parameter, which increases in accordance with the variation of the pitches, and its value is between 0 and 1 and 2 is the ratio of the halftones of the scale of the twelve-tone equal temperament. The operator “±”, uses “+” to represent moving from a low pitch to a high pitch, and “−” to represent moving from a high pitch to a low pitch. In
B(t)=P0(1−t)5+5P1(1−t)4t+10P2(1−t)3t2+10P3(1−t)2t3+5P4t4(1−t)+P5t5, tε[0,1] (8)
The electronic computing device may include a display unit generating visual signals to be the set of tempo cues, such as: moving symbols, flashing symbols, leaping dots, or color-changing patterns. The electronic computing device may generate audio signals to be the set of tempo cues, and output the audio signals via the audio speaker. The audio signals may be the ticking sound of a metronome. The electronic computing device may include a movable machinery providing actions to be the set of tempo cues, such as: swinging, rotating, leaping, or the waving axis of a metronome. The electronic computing device may include a light emitting unit generating flashes or color changing lights to be the set of tempo cues. In order to make sure the established rhythm pattern of the original voice signals is at an acceptable level, in some embodiments, the singing voice synthesis method may further determine whether the established rhythm pattern exceeds a default error threshold value according to the tune of the selected song. If the established rhythm pattern exceeds the default error threshold value, the singing voice synthesis method continues with prompting the user to regenerate the original voice signals. The detailed operation of determining the established rhythm pattern is shown in
As shown in
In some embodiments, after the pitch analysis procedure and the pitch adjustment procedure, the singing voice synthesis method, as shown in
In some embodiments, after the pitch analysis procedure, the pitch adjustment procedure, and the smoothing procedure, the singing voice synthesis method, as shown in
In some embodiments, the singing voice synthesis method, as shown in
The electronic computing device implementing the singing voice synthesis method may be a desktop computer, a laptop, a mobile communication device, an electronic toy, or an electronic pet. Moreover, the electronic computing device may include a song database storing tunes of popular songs for the user to select and synthesize with their personalized singing voice. The song database may also store the lyrics of the songs and the corresponding rhythms.
As shown in
The processor 1050 may be an embedded micro-processor including any other necessary components to support the functions thereof. The processor 1050 may be installed in the trunk-area of the electronic toy. The processor 1050 is connected to the storage device 1020, the tempo means 1030, and the audio receiver 1040. The processor 1050 mainly processes the original voice signals according to the selected tune and generates a synthesized singing voice signal. In some embodiments, the processing includes flatting the pitches of the original voice signals to obtain a plurality of same pitches, and adjusting each of the same pitches to its standard pitch indicated by the selected tune to obtain a plurality of adjusted voice signals. Further, the processor 1050 may perform a smoothing procedure on the adjusted voice signals to generate a smoothed voice signal.
In other embodiments, the processor 1050 may perform a pitch analysis procedure to obtain the plurality of same pitches by the pitch tracking, pitch marking, and pitches flatting techniques. The processor 1050 continues its procedure, by performing a pitch adjustment procedure on the same pitches to adjust each of the same pitches to its standard pitch indicated by the selected tune, by using the PSOLOA method, the Cross-fadding method, or the Resample method. The detailed operation of the PSOLA method, the Cross-Fadding method, and the Resample method are illustrated in
In other embodiments, the processor 1050 may further perform a sound effect procedure on the smoothed voice signal. The sound effect procedure first determines the size of the sampling frame to the smoothed voice signal based on the loading of the singing voice synthesis apparatus 1000. Then, the sound effect procedure continues with adjusting the volume and adding vibrato and echo effects to the smoothed voice signal according to the sampling frame, and consequently, a sound-effected voice signal is obtained. In other embodiments, the processor 1050 may perform an accompaniment procedure on one of the adjusted voice signals, the smoothed voice signal, and the sound-effected voice signal. The accompaniment procedure combines one of the adjusted voice signals, the smoothed voice signal, and the sound-effected voice signal, with the accompaniment of the selected song and generates an accompanied voice signal. It is noted that each of the previously mentioned adjusted voice signals, smoothed voice signal, sound-effected voice signal, and accompanied voice signal may be the presentation of a synthesized singing voice signal of the present invention. In addition, the synthesized singing voice signal contains the tone of the user.
In some embodiments, the singing voice synthesis apparatus 1000 may further include an audio speaker (not shown), installed outside of the exterior case 1010 and connected to the processor 1050, for outputting of the synthesized singing voice signal. As shown in
In order to make sure the established rhythm pattern of the original voice signals is at an acceptable level, the processor 1050 may further determine whether the established rhythm pattern exceeds a default error threshold value. If the established rhythm pattern exceeds the default error threshold value, the processor 1050 prompts the user to regenerate the original voice signals and the receiving of the original voice signals is repeated. The detailed operation of determining the established rhythm pattern is depicted in
In the previously mentioned embodiments, the original voice signals are generated by the user reading or singing based on the selected tune and the tempo cues. Each original voice signal corresponds to each note of the selected tune and each tempo cue, respectively, so that the original voice signals are ready to be processed without word segmentation. The conventional singing voice synthesis system requires the corpus database to be established and this requirement usually takes up much time and cost. When compared to the conventional singing voice synthesis system, the present invention does not need to establish a corpus database; and thus, less system resources are required and better results are obtained when considering required time and quality. Most importantly, the synthesized singing voice signal contains the tone of the user, and is more fluent and natural sounding.
While the invention has been described by way of example and in terms of preferred embodiment, it is to be understood that the invention is not limited thereto. Those who are skilled in this technology can still make various alterations and modifications without departing from the scope and spirit of this invention. Therefore, the scope of the present invention shall be defined and protected by the following claims and their equivalents.
Number | Date | Country | Kind |
---|---|---|---|
098128479 | Aug 2009 | TW | national |