The foregoing summary of the invention, as well as the following detailed description of illustrative embodiments, is better understood when read in conjunction with the accompanying drawings, which are included by way of example, and not by way of limitation with regard to the claimed invention.
Except with regard to element 27 in
Device 10 is also configured to generate a converted voice based on input received through an input/output (I/O) port 18. In some cases, that input may be a recording of a source voice. The recording is stored in random access memory (RAM) 20 (and/or magnetic disk drive (HDD) 22) and subsequently routed to DSP 14 by microprocessor 16 for segmentation and parameter generation. Parameters for the recorded voice may then be used by microprocessor 16 to generate a converted voice. Device 10 may also receive text data input through I/O port 18 and store the received text in RAM 20 and/or HDD 22. Microprocessor 16 is further configured to generate a converted voice based on text input, as is discussed in more detail below.
After conversion in microprocessor 16, a digitized version of a converted voice is processed by digital-to-analog converter 24 and output through speaker 27. Instead of (or prior to) output of the converted voice via DAC 24 and speaker 27, microprocessor 16 may store a digital representation of the converted voice in random access memory (RAM) 20 and/or magnetic disk drive (HDD) 22. In some cases, microprocessor 16 may output a converted voice (through I/O port 18) for transfer to another device. In other cases, microprocessor 16 may further encode the digital representation of a converted voice (e.g., using linear predictive coding (LPC) or other techniques for data compression).
In some embodiments, microprocessor 16 performs voice conversion and other operations based on programming instructions stored in RAM 20, HDD 22, read-only memory (ROM) 21 or elsewhere. Preparing such programming instructions is within the routine ability of persons skilled in the art once such persons are provided with the information contained herein. In yet other embodiments, some or all of the operations performed by microprocessor 16 are hardwired into microprocessor 16 and/or other integrated circuits. In other words, some or all aspects of voice conversion operations can be performed by an application specific integrated circuit (ASIC) having gates and other logic dedicated to the calculations and other operations described herein. The design of an ASIC to include such gates and other logic is similarly within the routine ability of a person skilled in the art if such person is first provided with the information contained herein. In yet other embodiments, some operations are based on execution of stored program instructions and other operations are based on hardwired logic. Various processing and/or storage operations can be performed in a single integrated circuit or divided among multiple integrated circuits (“chips” or a “chip set”) in numerous ways.
Device 10 could take many forms. Device 10 could be a dedicated voice conversion device. Alternatively, the above-described elements of device 10 could be components of a desktop computer (e.g., a PC), a mobile communication device (e.g., a cellular telephone, a mobile telephone having wireless internet connectivity, or another type of wireless mobile terminal), a personal digital assistant (PDA), a notebook computer, a video game console, etc. In certain embodiments, some of the elements and features described in connection with
In at least some embodiments, a codebook is stored in memory and used to convert a passage in a source voice into a target voice version of that same passage. As used herein, “passage” refers to a collection of words, sentences and/or other units of speech (spoken or textual). Segments of the passage in the source voice are used to select data in a source portion of the codebook. For each of the data selected from the codebook source portion, corresponding data from a target portion of the codebook is used to generate pitch profiles of the passage segments in the target voice. Additional processing can then be performed on those generated pitch profiles.
In some embodiments designed for converting the voice of one human speaker to the voice of another human speaker, codebook creation begins with the source and target speakers each reciting the same training material (e.g., 30-60 sentences chosen to be generally representative of a particular language). Pitch analysis is performed on the source and target voice recitations of the training material. Pitch values at certain intervals are obtained and smoothed. The spoken training material from both speakers is also subdivided into smaller segments (e.g., syllables) using phoneme boundaries and linguistic information. If necessary, F0 outliers at syllable boundaries can be removed. For each training material segment, data representing the source voice speaking that segment is mapped to data representing the target voice speaking that same segment. In particular, the source and target speech signals are analyzed to obtain segmentations (e.g., at the phoneme level). Based on this segmentation and on knowledge of which signal pertains to which sentence(s), the different parts of signals that correspond to each other are identified. If necessary, additional alignment can be performed on a finer level (e.g., for 10 millisecond frames instead of phonemes). In other embodiments, the codebook is designed for use with textual source material. For example, such a codebook could be used to artificially generate a target voice version of a typed passage. In some such textual source embodiments, the source version of the training material is not provided by an actual human speaker. Instead, the source “voice” is the data generated by processing a text version of the training material with a text-to-speech (TTS) algorithm. Examples of TTS systems that could be used to generate a source voice for textual training material include (but are not limited to) concatenation-based unit selection synthesizers, diphone-based systems and formant-based TTS systems. The TTS algorithm can output a speech signal for the source text and/or intermediate information at some level between text and a speech signal. The TTS system can output pitch values directly or using some modeled form. The pitch values from the TTS system may correspond directly to the TTS output speech or may be derived from a prosody model.
In some alternate embodiments, dynamic time warping (DTW) can be used to map (based on Mel-frequency Cepstral Coefficients) source speech segments (e.g., 20 millisecond frames) of the codebook training material to target speech segments of the codebook training material.
In the embodiments described herein, speech is segmented at the syllable level. This approach is robust against labeling errors. Moreover, syllables can also be regarded as natural elemental speech units in many languages, as syllables are meaningful units linguistically and prosodically. For example, the tone sequence theory on intonation modeling concentrates on F0 movements on syllables. However, other segmentation schemes could be employed.
In addition to the data representing the source and target voices speaking various segments, the codebook in some embodiments contains linguistic feature data for some or all of the training material segments. This feature data can be used, in a manner discussed below, to search for an optimal source-target data pair in the codebook. Examples of linguistic features and values thereof are given in Table 1.
All of the above features may not be used in a particular embodiment, and other features could also and/or alternatively be employed. For example, Van Santen-Hirschberg classifications of onset could be used. Linguistic features describing multiple syllables can also be used (e.g., a feature describing the current syllable and/or the next syllable and/or the preceding syllable). Sentence level features (i.e., information about the sentence in which a particular syllable was uttered) could also be used; examples of sentence level features include pitch declination, sentence duration and mean pitch.
As indicated above, codebook 80 is created using training material that is spoken by source and target voices. The spoken training material is segmented into syllables, and a pitch analysis is performed to generate a pitch contour (a set of pitch values at different times) for each syllable. Pitch analysis can be performed prior to segmentation. Pitch contours can be generated in various manners. In some embodiments, a spectral analysis for input speech (or a TTS analysis of input text) undergoing conversion outputs pitch values (F0) for each syllable. As part of such an analysis, a duration of the analyzed speech (and/or segments thereof) is also provided or is readily calculable from the output. For example,
Returning to
Similarly, a target vector ZjTGT for syllable j is calculated from the target pitch contour zjTGT according to Equation 2.
There are several advantages to storing transformed representations of the training material source and target pitch contour data in codebook 80. Because a transformed representation concentrates most of the information from the pitch contour in the first coefficients, comparisons can be speeded (and/or memory requirements reduced) by only using the first few coefficients when comparing two vectors. As indicated above, pitch contours will often have differing numbers of pitch samples. Even with regard to the same training material syllable, a source speaker may utter that syllable more rapidly or slowly than a target speaker, thereby resulting in contours of different durations (and thus different numbers of pitch samples). When comparing contours of different length, a shorter of two DCT vectors can be zero-padded (or the longer of two DCT vectors can be truncated), but a meaningful comparison still results. Transformed representations also permit generation of a contour, from DCT coefficients of an original contour, having a length different from that of the original contour.
If a set of training material used to generate a codebook is relatively small, the first coefficient for each source and target vector can be omitted (i.e., set to zero). The first coefficient represents a bias value, and there may not be sufficient data from a small training set to meaningfully use the bias values. In certain embodiments, there may not be entries in the codebook for every syllable of the training material. For example, data for syllables having pitch contours with only a few values may not be included.
For each syllable in the source passage, the process uses source data in codebook 80 to search for the training material syllable for which the corresponding target data will yield a natural sounding contour that could be used in the context of the source passage. As used herein, codebook source data corresponds to codebook target data having the same index (j) (i.e., the source and target data relate to the same training material syllable). As indicated above in connection with
Beginning in block 101 (
The process continues to block 103, where linguistic information (e.g., features such as are described in Table 1) is extracted from the source passage. A pitch analysis is also performed on the source passage, and the data smoothed. Data smoothing can be performed using, e.g., low-pass or median filtering. Explicit smoothing may not be needed in some cases, as some pitch extraction techniques use heavy tracking to ensure appropriate smoothness in the resulting pitch contour. If the source passage is actual speech (either live or recorded), DSP 14 (
The process next determines syllable boundaries for the source passage (block 105). For textual source passages, linguistic and phoneme duration from the TTS output is used to detect syllable boundaries. This information is directly available from the TTS process, as the TTS process uses that same information in generating speech for the textual source passage. Alternatively, training data from actual voices used to build the TTS voice could be used. For speech source passages, and as set forth above, a text version of the passage will typically be available for use in segmentation. After identifying syllable boundaries, pitch data from block 103 is segmented according to those syllable boundaries. The segmented pitch data is stored as a separate pitch contour for each of the source passage syllables. A duration (di) is also calculated and stored for each source passage pitch contour. A duration of the voiced portion of each source passage pitch contour (d_vi) is also calculated and stored.
First level processing is then performed on the source speech passage in block 107. In particular, and for every syllable of the source speech passage, a mean-variance (MV) version of the syllable pitch contour is calculated and stored. In at least some embodiments, the MV version of each syllable is calculated according to Equation 3.
The process then continues to block 111 and flags the pitch contour for the first source passage syllable (i=1) as the source contour undergoing conversion (SCUC). The process then proceeds to block 115 and determines if there are sufficient pitch measurements for the SCUC to permit meaningful use of data from codebook 80. For example, a weakly voiced or (primarily) unvoiced source passage syllable might have only one or two pitch values with an estimation interval of 10 milliseconds, which would not be sufficient for a meaningful contour. If there are insufficient pitch measurements for the SCUC, the process continues along the “No” branch to block 125 and calculates a target voice version of the SCUC using an alternative technique. Additional details of block 125 are provided below.
If there are sufficient pitch measurements for the SCUC, the process continues along the “Yes” branch from block 115 to block 117 to begin a search for an optimal index (jopt) in codebook 80 (
In block 117, a transform vector XiSRC (upper case X) is calculated for the SCUC according to equation 4.
In equation 4, “i” is an index for the SCUC syllable in relation to other syllables in the source passage. The quantity xiSRC(n) (lower case x) is (as in equation 3) a value for pitch at time interval “n” in the SCUC. The value N in equation 4 can be the same or different than the value of N in equation 1 or equation 2. If N in equation 4 is different than N in equation 1 or equation 2, vector XiSRC can be adjusted in subsequent computations (e.g., as described below in connection with condition 1) by padding XiSRC with “0” coefficients for k=N+1, N+2, etc., or by dropping coefficients for k=N, N−1, etc.
In block 119, a group of candidate codebook indices is found by comparing XiSRC to ZjSRC for all values of index j. In at least some embodiments, the comparison is based on a predetermined number of DCT coefficients (after the first DCT coefficient) in xiSRC and in ZjSRC according to condition 1.
The quantity p in condition 1 is a threshold which can be estimated in various ways. One manner of estimating p is described below. Each value of j which results in satisfaction of condition 1 is flagged as a candidate codebook index. The values “w” and “z” in condition 1 are 2 and 10, respectively, in some embodiments. However, other values could be used.
The process then continues to block 121. If in block 119 no candidate indices were found (i.e., condition 1 was not satisfied for any value of index j), the process advances to block 125 along the “no” branch. In block 125, a target voice version of the SCUC is generated using an alternate conversion technique. In at least some embodiments, the alternate technique generates a target voice version of the SCUC using the values for xi(n)|MV that were stored in block 107. Other techniques can be used, however. For example, Gaussian mixture modeling, sentence level modeling and/or other modeling techniques could be used. From block 125 the process then proceeds to block 137 (
If one or more candidate indices were found in block 119, the process then advances from block 121 to block 123. In block 123, an optimal codebook index is identified from among the candidates indices. In at least some embodiments, the optimal index is identified by comparing the durations (di and d_vi) calculated in block 105 to values of djSRC and d_vjSRC for each candidate index, as well as by comparing linguistic features (Fj) associated with the candidate codebook indices to features of the SCUC syllable. In particular, a feature vector Fi=[F(1), F(2), . . . , F(M)] is calculated for the SCUC syllable based on the same feature categories used to calculate feature vectors Fj. The SCUC feature vector Fi is calculated using linguistic information extracted in block 103 and the syllable boundaries from block 105. An optimal index is then found using a classification and regression tree (CART).
One example of such a CART is shown in
Use of the CART begins at decision node 201 with the first candidate index identified in block 119 (
In node 209, the value for F0DurDiff (calculated at decision node 201) is again checked. If F0DurDiff is less than 0.0300001 milliseconds, the “Yes” branch is followed, and the candidate is marked as optimal. If F0DurDiff is not less than 0.0300001 milliseconds, the “No” branch is followed to decision node 213. At node 213, the absolute value of the difference between the SCUC syllable duration (di) and the duration of the source syllable for the candidate index (djSRC) is calculated. If that difference (“SylDurDiff”) is not less than 0.14375 milliseconds, the “No” branch is followed to leaf node 215, where the candidate is marked non-optimal. The next candidate index is then used to begin (at node 201) a second pass through the CART.
If the value of SylDurDiff at decision node 213 is less than 0.14375 milliseconds, the yes branch is followed to decision node 217. In node 217 the values for the Van Santen-Hirschberg classification of syllable coda feature of the SCUC syllable and of the candidate index source syllable are compared. If the values are the same, the difference between those values (“CodaTypeDiff”) is “1.” Otherwise the value for CodaTypeDiff is “0”. If CodaTypeDiff=0, the “No” branch is followed to leaf node 219, where the candidate is marked non-optimal. The next candidate index is then used to begin (at node 201) a second pass through the CART. If the value for CodaTypeDiff is 1, the “Yes” branch is followed to leaf node 221, and the index is marked as optimal.
All of the candidate indices from block 119 of
In particular, the candidate having the smallest value for
(i.e., the left side of condition 1) is chosen. If no candidate is marked optimal after evaluation in the CART, then the candidate that progressed to the least “non-optimal” leaf node is chosen. In particular, each leaf node in the CART is labeled as “optimal” or “non-optimal” based on a probability (e.g., 50%) of whether a candidate reaching that leaf node will be a candidate corresponding to a codebook target profile that will yield a natural sounding contour that could be used in the context of the source passage. The candidate reaching the non-optimal leaf node with the highest probability (e.g., one that may have a probability of 40%) is selected. If no candidates reached an optimal leaf node and more than one candidate reached the non-optimal leaf node with the highest priority, the final selection from those candidates is made based on the candidate having the smallest value for the left side of condition 1.
In at least some alternate embodiments, an index is chosen in block 123 according to equation 5.
The quantity “C(m)” in equation 5 is the mth member of a cost vector C that is calculated between Fj and Fi. If Fi=[Fi(1),Fi(2), . . . , Fi(M)] and Fj=[Fj(1),Fj(2), . . . , Fj(M)], cost vector C=[{Diff(Fi(1),Fj(1)}, {Diff(Fi(1),Fj(1)}. . . , {Diff(Fi(M),Fj(M)}].
For a linguistic feature, the difference between values of a feature can be set to one if there is a perfect match or to zero if there is no match. For example, assume the feature corresponding to Fi(1) and to Fj(1) is Van Santen-Hirschberg classification (see Table 1). Further assume that the classification for the syllable associated with the SCUC is “UV” (Fi(1)=UV) and that the classification for the training material syllable associated with index j is “VS—” (Fj(1)=VS—). In such a case, {Diff(Fi(1),Fj(1)}=1. In alternate embodiments, non-binary cost values can be used. The quantity “W(m)” in equation 5 is a weight for the mth feature. Calculation of a weight vector W=[W(1), W(2), . . . , W(M)] is described below.
The process advances from block 123 (
In equation 6, the first DCT coefficient is set to zero (ZjTGT(1)=0) so as to obtain a zero-mean contour. If a resulting contour having a length different than that of the target version of the codebook syllable for which for which ZjTGT is used in equation 6 is desired, ZjTGT can be padded with 0 coefficients (or some coefficients dropped).
The process then continues to block 129, where the output from block 127 is further adjusted so as to better maintain lexical information of the source passage syllable associated with the SCUC. F0 values in the adjusted contour (xiTGT(n)|a) are calculated according to equation 7.
x
i
TGT(n)|a=xiTGT(n)+xiSRC(n)−zjSRC(n) Equation 7
In equation 6, “xiSRC” is the source pattern (i.e., the SCUC) and “zjSRC” is the pitch contour for the source version of the syllable corresponding to the key selected in block 123 (i.e., the inverse DCT transformed ZjSRC).
The process then continues to block 131, where the output of block 129 is adjusted in order to predict target sentence pitch declination. F0 values for the adjusted contour (xiTGT(n)|a,μ) are calculated according to equation 8.
x
i
TGT(n)|a,μ=xiTGT(n)|a+xi(n)|MV Equation 8
The quantity “xi(n)|MV” in equation 8 is described above in connection with equation 3. Adjusting for pitch declination using the mean value helps to avoid large errors than can result using a declination slope mapping approach.
Next, the process determines in block 133 if the boundary between the source passage syllable corresponding to the SCUC and the preceding source passage syllable is continuous in voicing. If not, the process skips to block 137 (described below) along the “No” branch. As can be appreciated, the result in block 133 would be “no” for the first syllable of a passage. As to subsequent passage syllables, the result may be “yes”, in which case the process further adjusts xiTGT(n)|a,μ(from block 131) in block 135 by adding a bias (b) in order to preserve a continuous pitch level. This adjustment is performed using equation 9.
x
i
TGT(n)|a,μ,c=xiTGT(n)|a,μ+b where b=xiTGT(N)|a,μ,c−xiTGT(1)|a, μ, Equation 9
In equation 8, “xiTGT(1)|a,μ” is the first pitch value in the SCUC after adjustment in block 131 and “xi−1TGT(N)|a,μ,c” is the Nth pitch value in the previous SCUC after all adjustments. The pitch levels in a SCUC can be further (or alternatively) adjusted using the mean values obtained in block 107.
In block 137, the final target voice version of the SCUC is stored. The process then determines in block 139 whether there are additional syllables in the source passage awaiting conversion. If so, the process continues on the “yes” branch to block 141, where the next source passage syllable is flagged as the SCUC. The process then returns to block 115 (
However, source passage spectral data can be obtained at the same time as input data used for the process shown in
As indicated above, at least some embodiments utilize a classification and regression tree (CART) when identifying potentially optimal candidates in block 121 of
Similarly, each element bgh of matrix B is found with equation 11 using the first Q members of each target vector ZjTGT, and with ZjTGT(1)=0 for every target vector.
Matrices A and B each has zeros as diagonal values.
During a separate training procedure performed after creation of codebook 80, a CART can be built to predict a group of pre-selected candidates which could be the best alternative in terms of linguistic and durational similarity to the SCUC. The CART training data is obtained from codebook 80 by sequentially using every source vector in the codebook as a CART-training SCUC (CT-SCUC). For example, assume the first source vector contour in codebook 80 is the current CT-SCUC. Values in matrix A from a12 to a1k are searched. If a value a1j is below a threshold, i.e., a1
Neutral samples are not used in the CART training since they fall into a questionable region. The source feature vector values associated with the optimal and the non-optimal CART training samples are matched with the feature vectors of the CT-SCUC used to find those optimal and the non-optimal CART training samples, resulting in a binary vector. In the binary vector, each one means that there was a match in the feature (for example 1 if both are monosyllabic), and zero if the corresponding features were not the same. The absolute duration difference between each CT-SCUC source version syllable duration and the source syllable durations of the CART optimal and nonoptimal training samples found with that CT-SCUC are stored, as are absolute duration differences between the duration of the voiced part of each CT-SCUC source version syllable and the durations of the voiced parts of the source syllables of the CART optimal and nonoptimal training samples found with that CT-SCUC. Ultimately, a reasonably large number of optimal CART training samples and nonoptimal CART training samples, together with corresponding linguistic and durational information, is obtained.
Values for δ0 and δn can be selected heuristically based on the data. The threshold δ1 is made adaptive in such a manner that it depends on the CT-SCUC with which it is being used. It is defined so that a p % deviation from the minimum difference between the closest source contour and the CT-SCUC (e.g., minimum value for agh when comparing the CT-SCUC with other source contours in the codebook) is allowed. The value p is determined by first computing, for each CT-SCUC in the codebook, (1) the minimum distance (e.g., minimum agh) between the source contour for that CT-SCUC and other source contours in the codebook, and (2) the minimum distance between optimal CART training sample source contours found for that CT-SCUC. Then, for each CT-SCUC, the difference between (1) and (2) is calculated and stored. Since there are not always good targets and the mean value could become rather high, the median of these differences is found, and p is that median divided by the largest of the (1)-(2) differences. The value of p is also used in condition 1, above.
The optimal CART training samples and nonoptimal CART training samples are used to train the CART. The CART is created by asking a series of question for features and samples. Numerous references are available regarding techniques for use in CART building validation. Validating attempts to avoid overfitting. In at least one embodiment, tree functions of the MATLAB programming language are used to validate the CART with 10-fold cross-validation (i.e., a training set is randomly divided into 10 disjoint sets and the CART is trained 10 times; each time a different set is left out to act as a validation set). A validation error gives an estimate of what kind of performance can be expected. The training of a CART seeks to find which features are important in the final candidate selection. There can be many contours very similar to a SCUC (here SCUC refers to a SCUC in the process of
In embodiments which employ equation 5 in block 123, the weight vector W can be found using an LMS algorithm or a perceptron network with a fixed number of iterations.
Although the above discussion concentrates on conversion of the pitch prosody component, the invention is not limited in this regard. For example, the techniques described above can also be used for energy contours. A listener perceives speech energy as loudness. In some applications, replicating a target voice energy contour is less important to a convincing conversion than is replication of a target voice pitch contour. In many cases, energy is very susceptible to variation based on conditions other than a target voice (e.g., distance of a speaking person from a microphone). For some voices, however, energy contour may be more important during voice conversion. In such cases, a codebook can also include transformed representations of energy contours for source and target voice versions of the codebook training material. Using that energy data in the codebook, energy contours for syllables of a source passage can be converted using the same techniques described above for pitch contours.
The duration prosodic component can be converted in various manners. As indicated above, a codebook in at least some embodiments includes data for the duration of the source and target versions of each training material syllable. This data (over all training material syllables) can be used to determine a scaling ratio between source and target speakers. For example, a regression line (y=ax+b) can be fit through all source and respective target durations in the codebook. Target duration could then be predicted using the regression coefficients. This scaling ratio can be applied to the output target pitch contour (e.g., prior to storage in block 137 of
In some cases, durations are better modeled in the logarithmic domain. Under such circumstances, the above described duration predicting techniques can be used in the logarithmic domain.
Although specific examples of carrying out the invention have been described, those skilled in the art will appreciate that there are numerous variations and permutations of the above-described systems and methods that are contained within the spirit and scope of the invention as set forth in the appended claims. Examples of such variations include, but are not limited to, the following:
These and other modifications are within the scope of the invention as set forth in the attached claims. In the claims, various portions are prefaced with letter or number references for convenience. However, use of such references does not imply a temporal relationship not otherwise required by the language of the claims.