Not Applicable
Not Applicable
Not Applicable
1. Field of the Invention
This invention relates the field of voice morphing.
2. Description of the Related Art
Voice morphing is the science of transforming a first person's voice into a second persona' voice, or a reasonably acceptable approximation. In order to have the first or original speakers speech “sound like” the second or target speakers speech, it is important to mimic the pitch of the second speaker, and to have the spectral energy peaks of the first speaker approximately in the same place that these peaks appear in the spectrum of the second speaker. It is useful to think of speech as a “source”, whether pitch or noise, and a “filter”, typically made up of the resonances associated with the throat, mouth, and noise in a person. (There are alternate definitions of a filter, like those used by a parrot, or electrical filters, often described with poles, or resonances and bandwidths). In general if there is close approximation of the general pitch values and the resonance positions in the spectrum to those of a particular person, then the speech “sounds like” that person. A third variable, speaking rate, also affects how a person sounds.
Since the early days of speech coders based on LPC (Linear Predictive Coding), speech has been manipulated by changing the pitch of the signal, the “formants” of the signal, or both, made to sound like another speaker.
All of the modern systems of voice morphing require decomposition of the speech signal into a pitch or “source”, and a spectrum or “filter” portion. This signal processing algorithm is well known to one skilled in the art of speech or voice morphing.
There are three inter-dependent issues that must be solved before building a voice morphing system. Firstly, it is important to develop a mathematical model to represent the speech signal so that the synthetic speech can be regenerated and prosody, i.e. rhythm, stress, etc. of speech, can be manipulated without artifacts. Secondly, the various acoustic cues which enable humans to identify speakers must be identified and extracted. Thirdly, the type of conversion function and the method of training and applying the conversion function must be decided.
This decomposition process is error prone, computationally difficult, and the reconstructions are generally only rough approximates of the speech of a particular person.
Creating an efficient and effective transformation between a first speaker and a second target speaker can be done by measuring the average pitch of each speaker, measuring the “formant positions” of speech of each speaker, and then transforming the speech of the first speaker to match both the average pitch and formant positions of the second speaker
Referring to
There are two equivalent processes to accomplish this task, described in
The morphing algorithm requires two parameters for each speaker: the average pitch of each speaker and the formant position warping function to move formants from the first speaker to the second speaker. This can be one of many forms: The average change in the formant frequency to best match each speaker's formants, the cumulative distribution of each formant for the speech of each talker, or the cumulative distribution of the first three (or four) formants of each speaker over some corpus of speech.
Note that this process does not describe mimicking the accent of either speaker, nor does it affect other process (like word choice, unusual emphasis, idiosyncratic pronunciations, and others) that can affect the identity of a speaker. We are rather creating a framework onto which these more subtle transformations can be later applied, if required or desired.
This patent describes a non-decompositional computationally efficient method to implement voice morphing.
The invention herein describes relates to an exemplary method of morphing the speech of one person into the speech of another, i.e. to make one person sound like another. The traditional means include finding the pitch and formants of each speak and performing a match. In this invention, the difficult task of locating formants is avoided. Rather, the spectral envelops are matched and the spectral envelop of the first speakers voice is warped to be statistically similar to the spectral envelop of the second speakers voice.
i illustrates a flow diagram showing matching the formants of a first speaker's voice to a second speakers voice
We describe the simplest implementation of voice warping here, and discuss the more sophisticated forms later.
The second speaker's pitch is adjusted to match the first speaker pitch at step 230. At Step 240 the invention determines how much the second speaker's formants much be moved to match the formants of the first speaker. The formants of the second speaker's speech are moved frame by frame to match the function of the first speaker's formants at Step 250. At Step 260, the signal is reconstructed frame by frame. The entire signal is reconstructed at step 270.
At Step 550, the invention adjusts the spectrum for this frame by the gain at each frequency. This moves the formants (or any other spectral feature) by the ration of the speaker's formants. At Step 560, the invention reconstructs the frame of signal by reinserting the phase at each frequency and doing an inverse transform. This can be done in either the log cepstral domain or in the power domain using an appropriate arithmetic operation. At Step 560, the inventions reconstruct the entire signal using overlap-and-add reconstruction, as is normal in zero-phase filtering operations.
The remaining detail is the computation of the envelope of a log spectrum of a frame. An example of this computation may be understood by examining
In
This “cepstrally smoothed” value is used in many other algorithms to represent the spectrum, but it is not what a person hears. Rather, the person hears the energy at the peaks of the spectrum, which we refer to as the “envelope” of the spectrum. The envelop is computed as follows: 1) Compute an auxiliary spectrum consisting of, at each frequency, the maximum of the spectrum and the “cepstrally smoothed” spectrum; Cepstrally smooth that auxiliary spectrum as we did above.
Finally, compute the envelope as, at each frequency, the value of the smoothed log spectrum plus the difference of the smoothed auxiliary spectrum and the smoothed log spectrum times a constant (empirically determined as 4, but may be between 3 and 4).
Following this algorithm, it is possible to move pitch and formants independently, simultaneously, and efficiently, changing speaker A to mimic speaker B. However, the pitch change described here changes the length of the speech signal by a proportion that is the proportion of pitch change. This may be ignored, or it may be corrected by using some standard procedures, all of which are well known to someone of ordinary skills in the art.
This invention claims priority to Provisional Patent Application No. 61/557,756 titled Method for First Order Morphing.
Number | Date | Country | |
---|---|---|---|
61557756 | Nov 2011 | US |