The present disclosure relates to voice conversion using dynamic time warping, and more particularly, to using shorter uniform frame lengths in dynamic time warping. Consequently for purposes of illustration and not for purposes of limitation, the exemplary embodiments of the invention are described in a manner consistent with such use, though clearly the invention is not so limited.
In voice conversion, an acoustic feature of speech, such as, for example, its spectral profile or average pitch, may be analyzed to represent it as a sequence of numbers. The feature may then be modified from the source speaker's voice in accordance with statistical properties of a target speaker's voice. A typical voice converter may have a reference vocabulary stored as acoustic patterns called templates. An input utterance may be converted to digital form and compared to the reference templates. The most similar template is selected as the identity of the input.
In order to compare an input pattern, e.g. a spoken word, with a reference, each word is divided into a sequence of time frames. In each time frame, signals representative of acoustic features of the speech pattern are obtained. For each frame of the input word, a frame of the reference word is selected. Signals representative of the similarity or correspondence between each selected pair of frames are obtained responsive to the acoustic feature signals. The correspondence signals for the sequence of input and reference word frame pairs are used to obtain a signal representative of the global or overall similarity between the input word and a reference word template.
Since there are many different ways of pronouncing the same word, the displacement in time of the acoustic features comprising the word is variable. Different utterances of the same word, even by the same individual, may be widely out of time alignment. The selection of frame pairs is therefore not necessarily linear. Matching, for example, the fourth, fifth and sixth frames of the input utterance with the fourth, fifth and sixth frames respectively of the reference word may distort the similarity measure and produce unacceptable errors.
Dynamic time warping (DTW) techniques may be used to align the frames of a test and reference pattern in an efficient manner. The DTW technique is used to cope with a difference in length of utterance according to the individual personalities of the unspecified person. The alignment is efficient in that the global similarity measure assumes an extremum. It may be, for example, that the fifth frame of the test word should be paired with the sixth frame of the reference word to obtain the best similarity measure.
Since the acoustic feature vector is based on short-term quasi-stationary speech signal analysis, the vector needs to be extracted from the speech waveform frame-by-frame. To make sure that the corresponding frames of the source and target speakers' voices contain substantially similar content, the two speakers need to input speech read from substantially similar text. However, experiments have revealed that the DTW technique has difficulty in matching frames when the two voices are substantially different than when they are substantially similar.
In recognition of the above-described difficulties in using the conventional dynamic time warping (DTW) technique, the present disclosure describes a system and method for providing shorter uniform frame lengths for the DTW technique in voice conversion. Thus, the present system of providing shorter frame lengths reinforces the DTW technique when the two voice signals are significantly different. However, the present system should work well for all cases.
The source signal 100 is represented as i frames, while the target signal 102 is represented as j frames. For the case where the source voice is substantially similar to the target voice, the i-th frame of the source signal 100 should correspond to the j-th frame of the target signal 102. However, in
A block diagram of a speech conversion system 200 in accordance with an embodiment of the present disclosure is shown in
The speech conversion system 200 includes a frame length generator 220 adapted to provide shorter uniform frame lengths than the frame lengths provided by the conventional DTW technique. The system 200 also includes voice unit boundary detectors 202, 212, voice/unvoice detectors 204, 214, and voice frame mark generators 206, 216. The system 200 further includes a training model 222, which receives the frame number and the uniform frame length for each frame number, and generates a conversion operation.
In the illustrated embodiment of
In an alternative embodiment, the system 200 may include only one each of the voice unit boundary detector 202 voice/unvoice detector 204 and the voice frame mark generator 206. In this particular embodiment, the source and target signals may then be routed or multiplexed through the detectors 202, 204 and the generator 206, sequentially or in parallel.
The voice/unvoice detector 204, 214 segregates the parsed voice unit or syllable into voiced and unvoiced sections. The voice/unvoice segregation is applied to the voice unit to allow the generation of pitch marks or frame marks.
The voice frame mark generator 206, 216 generates these pitch marks or frame marks only on the voiced section of the voice unit. The generator 206, 216, however, may generate any other marks to indicate the voice unit. Typically, the correlated processing duration for the voiced section of the voice unit is approximately between 200 and 400 milliseconds.
The illustrated process includes receiving the number of frames in source (Ns) and target (Nt) signals within a parsed voice unit such as a syllable, at 300. Only the voiced section of the syllable may be processed. At 302, the number of frames in the source signal (Ns) is compared to the number of frames in the target signal (Nt).
If the number of frames (Ns) in the source signal is greater than or equal to the number of frames (Nt) in the target signal, the number of frames (Ns) and the uniform frame length (Ls) of the source signal are unchanged. However, the number of frames (Nt) and the uniform frame length (Lt) of the target signal are modified, at 306. In the illustrated embodiment, the number of frames (Nt) of the target signal is set to the number of frames (Ns) in the source signal. Moreover, the uniform frame length (Lt) of the target signal is set to the time sample period (nt) of the target signal divided by the number of frames (Nt) of the target signal.
Otherwise if the number of frames (Ns) in the source signal is less than the number of frames (Nt) in the target signal, the number of frames (Nt) and the uniform frame length (Lt) of the target signal are unchanged. However, the number of frames (Ns) and the uniform frame length (Ls) of the source signal are modified, at 304. In the illustrated embodiment, the number of frames (Ns) of the source signal is set to the number of frames (Nt) in the target signal. Moreover, the uniform frame length (Ls) of the source signal is set to the time sample period (ns) of the source signal divided by the number of frames (Ns) of the source signal.
Therefore, the above-described process operates to use the larger number for the number of frames in the input signal to obtain the shorter uniform frame length.
The effectiveness of the new process, illustrated in
The normalized MSE between the converted training voice and the target training voice may be computed as follows:
The generator 500 receives the source and target training feature vectors, Xn and Yn, respectively. The feature vectors are then processed, summed, and normalized to produce a mean square error according to equation (1) above. As mentioned above, the normalized MSE generator 500 may be used to measure the effectiveness of the new process illustrated in
In an alternative embodiment, the normalized MSE generator 500 may be used to determine which source signal includes “substantially different” voice from the target signal. If the normalized MSE is large, then the two signals may have substantially different voice. Otherwise if the normalized MSE is small, then the two signals may have substantially similar voice. Therefore in the alternative embodiment, the determination may be used to apply the shorter uniform frame length generation only when the two signals have substantially different voice.
Advantages of the present disclosure may be evaluated both objectively and subjectively. The subjective evaluation may be made by listening to the converted voice, which has noise and other artifacts removed. The shorter uniform frame length generation process, illustrated in
While specific embodiments of the invention have been illustrated and described, such descriptions have been for purposes of illustration only and not by way of limitation. Accordingly, throughout this detailed description, for the purposes of explanation, numerous specific details were set forth in order to provide a thorough understanding of the present invention. It will be apparent, however, to one skilled in the art that the system and method may be practiced without some of these specific details. In other instances, well-known structures and functions were not described in elaborate detail in order to avoid obscuring the subject matter of the present invention. Accordingly, the scope and spirit of the invention should be judged in terms of the claims which follow.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/CN01/00877 | 5/28/2001 | WO | 5/5/2005 |