This application is based upon and claims the benefit of priority from United Kingdom Patent Application No. 1105314.7, filed Mar. 29, 2011; the entire contents of which are incorporated herein by reference.
Embodiments of the present invention described herein generally relate to voice conversion.
Voice Conversion (VC) is a technique for allowing the speaker characteristics of speech to be altered. Non-linguistic information, such as the voice characteristics, is modified while keeping the linguistic information unchanged. Voice conversion can be used for speaker conversion in which the voice of a certain speaker (source speaker) is converted to sound like that of another speaker (target speaker).
The standard approaches to VC employ a statistical feature mapping process. This mapping function is trained in advance using a small amount of training data consisting of utterance pairs of source and target voices. The resulting mapping function is then required to be able to convert of any sample of the source speech into that of the target without any linguistic information such as phoneme transcription.
The normal approach to VC is to train a parametric model such as a Gaussian Mixture Model on the joint probability density of source and target spectra and derive the conditional probability density given source spectra to be converted.
The present invention will now be described with reference to the following non-limiting embodiments.
In an embodiment, the present invention provides a method of converting speech from the characteristics of a first voice to the characteristics of a second voice, the method comprising:
The kernels can be derived for either static features on their own or static and dynamic features. Dynamic features take into account the preceding and following frames.
In one embodiment, the speech to be output is determined according to a Gaussian
Process predictive distribution:
p(yt|xt,x*,y*,)=(μ(xt),Σ(xt)),
where yt is the speech vector for frame t to be output, xt is the speech vector for the input speech for frame t, x*, y* is {x1*, y1*}, . . . , {xN*, yN*}, where xt* is the tth frame of training data for the first voice and yt* is the tth frame of training data for the second voice, M denotes the model, μ(xt) and Σ(xt) are the mean and variance of the predictive distribution for given xt.
Further:
and σ is a parameter to be trained, m(x1) is a mean function and k(a,b) is a kernel function representing the similarity between a and b.
The kernel function may be isotropic or non-stationery. The kernel may contain a hyper-parameter or be parameter free.
In an embodiment, the mean function is of the form: m(x)=ax+μ.
In a further embodiment, the speech features are represented by vectors in an acoustic space and said acoustic space is partitioned for the training data such that a cluster of training data represents each part of the partitioned acoustic space, wherein during mapping a frame of input speech is compared with the stored frames of training data for the first voice which have been assigned to the same cluster as the frame of input speech.
In an embodiment, two types of clusters are used, hard clusters and soft clusters. In the hard clusters the boundary between adjacent clusters is hard so that there is no overlap between clusters. The soft clusters extend slightly beyond the boundary of the hard clusters so that there is overlap between the soft clusters. During mapping, the hard clusters will be used for assignment of a vector representing input speech to a cluster. However, the Gramians K* and/or kt may be determined over the soft clusters.
The method may operate using pre-stored training data or it may gather the training data prior to use. The training data is used to train hyper-parameters. If the acoustic space has been partitioned, in an embodiment, the hyper-parameters are trained over soft clusters.
Systems and methods in accordance with embodiments of the present invention can be applied to many uses. For example, they may be used to convert a natural input voice or a synthetic voice input. The synthetic voice input may be speech which is from a speech to speech language converter, a satellite navigation system or the like.
In a further embodiment, systems in accordance with embodiments of the present invention can be used as part of an implant to allow a patient to regain their old voice after vocal surgery.
The above described embodiments apply a Gaussian process (GP) to Voice Conversion. Gaussian processes are non-parametric Bayesian models that can be thought of as a distribution over functions. They provide advantages over the conventional parametric approaches, such as flexibility due to their non-parametric nature.
Further, such a Gaussian Process based approach is resistant to over-fitting.
As such an approach is non-parametric it tackles the issue of the meaning of parameters used in a parametric approach. Also, being non-parametric means that there are only a few hyper-parameters that need to be trained and these parameters maintain their meaning even when more data is introduced. These advantages help to circumvent issues with scaling.
In accordance with further embodiments, a system is provided for converting speech from the characteristics of a first voice to the characteristics of a second voice, the system comprising:
Methods and systems in accordance with embodiments can be implemented either in hardware or on software in a general purpose computer. Further embodiments can be implemented in a combination of hardware and software. Embodiments may also be implemented by a single processing apparatus or a distributed network of processing apparatuses.
Since methods and systems in accordance with embodiments can be implemented by software, systems and methods in accordance with embodiments may be implanted using computer code provided to a general purpose computer on any suitable carrier medium. The carrier medium can comprise any storage medium such as a floppy disk, a CD ROM, a magnetic device or a programmable memory device, or any transient medium such as any signal e.g. an electrical, optical or microwave signal.
The above voice combination system converts speech from one speaker, (an input speaker) into speech from a different speaker (the target speaker). Ideally, the actual words spoken by the input speaker should be identical to those spoken by the target speaker. The speech of the input speaker is matched to the speech of the output speaker using a mapping function. In embodiments of the present invention, the mapping operation is derived using Gaussian Processes. This is essentially a non-parametric approach to the mapping operation.
To explain how the mapping operation is derived using Gaussian Processes, it is first useful to understand how the mapping function is derived for a parametric Gaussian Mixture Model. Conditionals and marginals of Gaussian distributions are themselves Gaussian. Namely if
Let xt and yt be spectral features at frame t for source and target voices, respectively. (For notation simplicity, it is assumed that xt and yt are scalar values. Extending them to vectors is straightforward.) GMM-based voice conversion. approaches typically model the joint probability density of the source and target spectral features by a GMM as
where zt is a joint vector [xt, yt]T, m is the mixture component index, M is the total number of mixture components, ωn, is the weight of the m-th mixture component. The mean vector and covariance matrix of the m-th component, μm(z) and Σm(z) are given as
A parameter set of the GMM is λ(z), which consists of weights, mean vectors, and the covariance matrices for individual mixture components.
The parameters set λ(z) is estimated from supervised training data, {x1*, y1*}, . . . , {xN*,yN*}, which is expressed as x*, y* for the source and targets, based on the maximum likelihood (ML) criterion as
where z* is the set of training joint vectors z={z1*, . . . zN*} and zt* is the training joint vector at frame t, zt*=[xt*,yt*]T.
In order to derive the mapping function, the conditional probability density of yt, given xt, is derived from the estimated GMM as follows:
The conventional approach, the conversion may be performed on the basis of the minimum mean-square error (MMSE) as follows:
In order to avoid each frame being independently mapped, it is possible to consider the dynamic features of the parameter trajectory. Here both the static and dynamic parameters are converted, yielding a set of Gaussian experts to estimate each dimension. Thus
z
t
=[x
t
,y
t
,Δx
t
,Δy
t]T, (10)
Δxt=½(xt+1−xt−1), (11)
and similarly for Δyt. Using this modified joint model, a GMM is trained with the following parameters for each component m:
Note to limit the number of parameters in the covariance matrix of z the static and delta parameters are assumed to be conditionally independent given the component. The same process as for the static parameters alone can be used to derive the model parameters. When applying voice conversion to a particular source sequence, this will yield two experts (assuming just delta parameters are added):
As in standard Hidden Markov Model (HMM)-based speech synthesis the sequence ŷ={ŷ1 . . . ŷN} that maximises the output probability given both experts is produced:
In a method and system according to an embodiment of the present invention, the mapping function is derived using non parametric techniques such as Gaussian Processes. Gaussian processes (GPs) are flexible models that fit well within a probabilistic Bayesian modelling framework. A GP can be used as a prior probability distribution over functions in Bayesian inference. Given any set of N points in the desired domain of functions, a multivariate Gaussian whose covariance matrix parameter is the Gramian matrix of the N points with some desired kernel, and sample from that Gaussian. Inference of continuous values with a GP prior is known as GP regression. Thus GPs are also useful as a powerful non-linear interpolation tool. Gaussian processes are an extension of multivariate Gaussian distributions to infinite numbers of variables.
The underlying model for a number of prediction models is that (again considering a single dimension)
y
t
=f(xt;λ)+ε, (17)
where epsilon is some Gaussian noise term and λ are the parameters that define the model.
A Gaussian Process Prior can be thought of to represent a distribution over functions.
The above Bayesian likelihood function (17) as before is used with a Gaussian process prior for f(x; ω):
f(x;λ)˜(m(x),k(x,x′)), (18)
where k(x, x′) is a kernel function, which defines the “similarity” between x and x′, and m(x) is the mean function. Many different types of kernels can be used. For example: covLIN—Linear covariance function:
k(xp,xq)=xpTxq (K1)
covLINard—Linear covariance function with Automatic Relevance Determination, where P is a hyper parameter to be trained.
k(xp,xq)=xpTP−1xq (K2)
covLINOne—Linear covariance function with a bias. Where t2 is a hyper parameter to be trained
covMaterniso—Matern covariance function with v=d/2, r=√{square root over ((xp−xq)TP−1(xp−xq))}{square root over ((xp−xq)TP−1(xp−xq))} and isotropic distance measure.
k(xp,xq)=σf2*f(√{square root over (d)}*r)*exp(−√{square root over (d)}*r) (K4)
covNNone—Neural network covariance function with a single parameter for the distance measure. Where σf is a hyperparameter to be trained.
covPoly—Polynomial covariance function. Where c is a hyper-parameter to be trained
k(xp,xq)=σf2(c+xpTxq)d (K6)
covPPiso—Piecewise polynomial covariance function with compact support
k(xp,xq)=σf2*(1−r)+·j*f(r,j)
covRQard—Rational Quadratic covariance function with Automatic Relevance Determination where α is a hyperparameter to be trained.
covRQiso—Rational Quadratic covariance function with isotropic distance measure
covSEard—Squared Exponential covariance function with Automatic Relevance Determination
covSEiso—Squared Exponential covariance function with isotropic distance measure.
covSEisoU—Squared Exponential covariance function with isotropic distance measure with unit magnitude.
Using equations 18 and 19 above, leads to a Gaussian process predictive distribution which is shown in
p(yt|xt,x*,y*,)=(μ(xt),Σ(xt)), (19)
where μ(xt) and Σ(xt) are the mean and variance of the predictive distribution for given xt. These may be expressed as
μ(xt)=m(xt)+ktT[K*+σ2I]−1(y*−μ*) (20)
Σ(xt)=k(xt,xt)+σ2−ktT[K*+σ2I]−1kt, (21)
Where μ* is the training mean vector and K* and k are Gramian matrices. They are given as
The above method computes a matrix inversion which is O(N3) however sparse methods and other reductions like using Cholesky decomposition may be used.
Using the above method it is possible to use GPs to derive a mapping function between source and target speakers.
From Eqs. (20) and (21) the means and covariance matrices for the prediction can be obtained. However if used directly this would again yield a frame-by-frame prediction. To address this the dynamic parameters can also be predicted. Thus, two GP experts can be produced:
In an embodiment, GPs for each of the static and delta experts are trained independently, though this is not necessary.
If only the static expert is used, then in the same fashion as GMM VC the estimated trajectory is just frame by frame. Thus
In the same fashion as the standard GMM VC process it is possible to use these
As the GP predictive distributions are Gaussian, a standard speech parameter generation algorithm can be used to generate the smooth trajectories of target static features from the GP experts.
A Gaussian Process is completely described by its covariance and mean functions. These when coupled with a likelihood function are everything that is needed to perform inference. The covariance function of a Gaussian Process can be thought of as a measure that describes the local covariance of a smooth function. Thus a data point with a high covariance function value with another is likely to deviate from its mean in the same direction as the other point. Not all functions are covariance functions as they need to form a positive definite Gram matrix.
There are two kinds of kernel, stationary and non-stationary. A stationary covariance function is a function of xi−xj. Thus it is invariant stationery to translations in the input space. Non-stationery kernels take into account translation and rotation. Thus isotropic kernel are atemporal when looking at time series as they will yield the same value wherever they are evaluated if their input vectors are the same distance apart. This contrast with non-stationary kernels that will give difference values. An example of an isotropic kernel is the squared exponential
which is a function of the distance between its input vectors. An example of a non-stationary kernel is the linear kernel.
k(xp,xq)=xp·xq, (30)
Both types can be of use in voice conversion. Firstly under stationary assumptions iso-tropic kernels can capture the local behaviour of a spectrum well. Non-stationary kernels handle time series better when there is little correlation. The kernels described above are parameter free. It is also possible to have covariance functions that have hyperparameters that can be trained. One example is a linear covariance function with automatic relevance detection (ARD) where:
k(xp,xq)=xp*(P−1)*xq (31)
P−1 is a free parameter that needs to be trained. For a complete list of the forms of covariance function examined in this work see Appendix A. A combination of kernels can also be used to describe speech signals. There are also a few choices for the mean function of a Gaussian Process; a zero mean, m(x)=0, a constant mean μ(x)=μ, a linear mean m(x)=ax, or their combination m(x)=ax+μ. In this embodiment, the combination of constant and linear mean, m(x)=ax+μ, was used for all systems.
Covariance and mean functions have parameters and selecting good values for these parameters has an impact on the performance of the predictor. These hyper-parameters can be set a priori but it makes sense to set them to the values that best describe the data; maximize the negative marginal log likelihood of the data. In an embodiment, the hyper-parameters are optimized using Polack-Ribiere conjugate gradients to compute the search directions, and a line search using quadratic and cubic polynomial approximations and the Wolfe-Powell stopping criteria was used together with the slope ratio method for guessing initial step sizes.
The size of the Gramian matrix K, which is equal to the number of samples in the training data, can be tens of thousands in VC. Computing the inverse of the Gramian matrix requires O(N3). In an embodiment, the input space is first divided into its sub-spaces then a GP is trained for each sub-space. This reduces the number of samples that are trained for each GP. This circumvents the issue of slow matrix inversion and also allows a more accurate training procedure that improves the accuracy of the mapping on a per-cluster level. The Linde-Buza-Gray (LBG) algorithm with the Euclidean distance in mel-cepstral coefficients is used to split the data into its sub-spaces.
A voice conversion method in accordance with an embodiment of the present invention will now be described with reference to
The front end unit also removes signals which are not believed to be speech signals and other irrelevant information. Popular front end units comprise apparatus which use filter bank (F BANK) parameters, Melfrequency Cepstral Coefficients (MFCC) and Perceptual Linear Predictive (PLP) parameters. The output of the front end unit is in the form of an input vector which is in n-dimensional acoustic space.
The speech features are extracted in step S105. In some systems, it may be possible to select between multiple target voices. If this is the case, a target voice will be selected in step S106. The training data which will be described with reference to
Next, kernels are derived which defines the similarity between two speech vectors. In step S109, kernels are derived which show the similarity between different speech vectors in the training data. In order to reduce the computing complexity, in an embodiment, the training data will be partitioned as described with reference to
Next, kernels are derived looking this time at the similarity between speech features derived from the training data and the actual input speech.
The method then continues at step S113 of
The training mean vector p* is then derived using equation 22 and this is the mean taken over all training samples in this embodiment.
A second Gramian matrix kt is derived using equation 24 this uses the kernel functions obtained in step S111 which looks at the similarity between training data and input speech.
Then using the results of step S113, S115 and S117, the mean value at each frame is computed for the target speech using equation 25.
The variant value is then computed for each frame of the converted speech. The converted speech is the most likely approximation to the target speech. Using the results derived in S113, S115 and S117. The covariant function has hyper-parameter σ. Hyper-parameter σ can be optimized as previously described using techniques such as Polack-Ribiere conjugate gradients to compute the search directions and a line search using quadratic and cubic polynomial approximations and the Wolfe-Powell stopping criteria was used together with the slope ratio method for guessing initial step sizes.
Using the results of step S119 and step S121, the most probable static feature y (target speech) from the mean and variances is generated by solving equation 28. The target speech is then output in step S125.
Signals which are believed not to be speech signals and other irrelevant information are removed.
In this embodiment, the speech features are clustered S205 as shown in
For each cluster, the hyper-parameters are trained for each cluster in step S207 and
The procedure is then repeated for each cluster.
In an embodiment where clustering has been performed, in use, an input speech vector which is extracted from the speech which is to be converted is assigned to a cluster. The assignment takes place by seeing in which cluster in acoustic space the input vector lies. The vectors μ(xt) and Σ(xt) are then determined using the data stored for that cluster.
In a further embodiment, soft clusters are used for training the hyper-parameters. Here, the volume of the cluster which is used to train the hyper-parameters for a part of acoustic space is taken over a region over acoustic space which is larger than the said part. This allows the clusters to overlap at their edges and mitigates discontinuities at cluster boundaries. However, in this embodiment although the clusters extend over a volume larger than the part of acoustic space defined when acoustic space is partitioned in step S205, assignment of an speech vector to be converted will be on the basis of the partitions derived in step S205.
Voice conversion systems which incorporate a method in accordance with the above described embodiment, are, in general more resistant to overfitting and oversmoothing. It also provides an accurate prediction of the format structure. Over-smoothing exhibits itself when there is not enough flexibility in a modelling of the relationship between the target speaker and input speaker to capture certain structure in the spectral features of the target speaker. The most detrimental manifestation of this is the over-smoothing of the target spectra. When parametric methods are used to model the relationship between the target speaker and input speaker, it is possible to add more parameters. However, adding more mixture components allows for more flexibility in the set of mean parameters and can tackle these problems of over-smoothing but soon encounters over-fitting in the data and quality is lost especially in an objective measure like melcepstral distortion. Also parametric models have more limited ability as more data is introduced as they lose flexibility and also the meaning of the parameters can become difficult to interpret.
The above described embodiment applies a Gaussian process (GP) to Voice Conversion. Gaussian processes are non-parametric Bayesian models that can be thought of as a distribution over functions. They provide advantages over the conventional parametric approaches, such as flexibility due to their non-parametric nature.
Further, such a Gaussian Process based approach is resistant to over-fitting.
As such an approach is non-parametric it tackles the issue of the meaning of parameters used in a parametric approach. Also, being non-parametric means that there are only a few hyper-parameters that need to be trained and these parameters maintain their meaning even when more data is introduced. These advantages help to circumvent issues with scaling.
a and 9b show schematically how the above Gaussian Process based approach differs from parametric approaches. Here, following the previous notation, it is desired to convert speech vectors xt from the first voice to speech vectors yt of the second voice. In the previous parametric based approaches, set of model parameters λ are derived based on speech vectors of the first voice x1*, . . . , xN* and the second voice y1*, . . . , yN*. The parameters are derived by looking at the correspondence between the speech vectors of the training data for the first voice with the corresponding speech vectors of the training data of the second voice. Once the parameters are derived, they are used to derive the mapping function from the input vector from the first voice xt to the second voice yt. In this stage, only the derived parameters λ is used as shown in
However, in embodiments according to the present invention, model parameters are not derived and the mapping function is derived by looking at the distribution across all training vectors either across the whole acoustic space or within a cluster if the acoustic space has been partitioned.
To evaluate the performance of the Gaussian Process based approach, a speaker conversion experiment was conducted. Fifty sentences uttered by female speakers, CLB and SLT, from the CMU ARCTIC database were used for training (source: CLB, target: SLT). Fifty sentences, which were not included in the training data, were used for evaluation. Speech signals were sampled at a rate of 16 kHz and windowed with 5 ms of shift, and then 40th-order mel-cepstral coefficients were obtained by using a mel-cepstral analysis technique. The log F0 values for each utterance were also extracted. The feature vectors of source and target speech consisted of 41 mel-cepstral coefficients including the zeroth coefficients. The DTW algorithm was used to obtain time alignments between source and target feature vector sequences. According to the DTW results, joint feature vectors were composed for training joint probability density between source and target features. The total number of training samples was 34,664.
Five systems were compared in this experiment, which were
They were trained from the composed joint feature vectors. The dynamic features (delta and delta-delta features) were calculated as
Δxt=0.5xt+1−0.5xt−1,
Δxt=xt+1−2xt−1.
For GP-based VC, we split the input space (mel-cepstral coefficients from the source speaker) into 32 regions using the LBG algorithm then trained a GP for each cluster for each dimension. According to the results of a preliminary experiment, we chose combination of constant and linear functions for the mean function of GP-based VC.
The log F0 values in this experiment were converted by using the simple linear conversion. The speech waveform was re-synthesized from the converted mel-cepstral coefficients and log F0 values through the mel log spectrum approximation (MLSA) filter with pulse-train or white-noise excitation.
The accuracy of the method in accordance with an embodiment was measured for various kernel functions. The mel-cepstral distortion between the target and converted mel-cepstral coefficients in the evaluation set was used as an objective evaluation measure.
First, the choice of kernel functions (covariance function), the effect of optimizing hyper-parameters, and the effect of dynamic features was evaluated. Tables 1 and 2 show the melcepstral distortions between target speech and converted speech by the proposed GP-based mapping with various kernel functions, with and without using dynamic features, respectively.
It can be seen from Table 1 that optimizing the hyper-parameter slightly reduced the distortions and the isotropic kernels appeared to outperform the non-stationary ones. This is believed to be due to the consistency between evaluation measure and kernel function. The mel-cepstral distortion is actually the total Euclidean distance between two mel-cepstral coefficients in dB scale. The linear kernel uses the distance metric in input space (mel-cepstral coefficients), thus the evaluation measure (mel-cepstral distortion) and similarity measure (kernel function) was consistent. Table 2 indicates that the use of dynamic features degraded the mapping quality.
Next the GP-based conversion in accordance with an embodiment of the invention is compared with the conventional approaches. Table 3 shows the mel-cepstral distortions by conversion approaches by GMM with and without dynamic features, trajectory GMMs, and the proposed GP based approaches. It can be seen from the table that the proposed GP-based approaches achieved significant improvements over the conventional parametric approaches.
It can be seen from the results of
The above experimental results shown here indicated that GP with the simple linear kernel function achieved the lowest melcepstral distortion among many kernel functions. It is believed that this is due to the consistency between evaluation measure and kernel function. The mel-cepstral distortion used here is actually the total Euclidean distance between two mel-cepstral coefficients. The linear kernel uses the distance metric in input space (mel-cepstral coefficients), thus the evaluation measure (mel-cepstral distortion) and similarity measure (kernel function) was consistent.
However, it is known that the mel-cepstral distortion is not highly correlated to human perception.
Therefore, in a further embodiment, the kernel function is replaced by a distance metric more correlated to human perception.
One possible metric is the log-spectral distortion (LSD), where the distance between two power spectra P(ω) and {circumflex over (P)}(ω) is computed as
where these two spectra can be computed from the mel-cepstral coefficients using a recursive formulae. An alternative is the Itakura-Saito distance which measures the perceived difference between two spectra. It was proposed by Fumitada Itakura and Shuzo Saito in the 1970s and is defined as
The current implementation operates on scalar inputs, but could be extended to vector inputs.
In a further embodiment, linear combination of iso-tropic and non-stationary kernels are used, for example combinations of those listed as K1 to K10 above.
In the above embodiments, Gaussian Process based voice conversion is applied to convert the speaker characteristics in natural speech. However, it can also be used to convert synthesised speech for example the output for an in-car Sat Nav system or a speech to speech translation system.
In a further embodiment, the input speech is not produced by vocal excitations. For example, the input speech could be bodyconducted speech, esophageal speech etc. This type of system could be of benefit where a user had received a larygotomy and was relying on non-larynx based speech. The system could modify the non-larynx based speech to reproduce the original speech of the user before the laryngotomy. Thus allowing a used to regain a voice which is close to their original voice.
Voice conversion has many uses, for example modifying a source voice to a selected voice in systems such as in-car navigation systems, uses in games software and also for medical applications to allow a speaker who has undergone surgery or otherwise has their voice compromised to regain their original voice.
While certain embodiments have been described, these embodiments have been presented by way of example only, and are not intended to limit the scope of the inventions. Indeed, the novel systems and methods described herein may be embodied in a variety of other forms; furthermore, various omissions, substitutions and changes in the form of the systems and methods described herein may be made without departing from the spirit of the inventions. The accompanying claims and their equivalents are intended to cover such forms or modifications as would fall within the scope and spirit of the inventions.
Number | Date | Country | Kind |
---|---|---|---|
1105314.7 | Mar 2011 | GB | national |