Having thus described the invention in general terms, reference will now be made to the accompanying drawings, which are not necessarily drawn to scale, and wherein:
a-1c are schematic block diagrams of a framework for voice conversion according to different exemplary embodiments of the present invention;
a-2c are schematic block diagrams of a telecommunications apparatus including components of a framework for voice conversion according to different exemplary embodiment of the present invention;
a-3c are schematic block diagrams of a text-to-speech converter according to different exemplary embodiments of the present invention;
The present invention now will be described more fully hereinafter with reference to the accompanying drawings, in which preferred exemplary embodiments of the invention are shown. This invention may, however, be embodied in many different forms and should not be construed as limited to the exemplary embodiments set forth herein; rather, these exemplary embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art. Like numbers refer to like elements throughout.
Exemplary embodiments of the present invention provide a system, method and computer program product for voice conversion whereby a source speech signal associated with a source voice is converted into a target speech signal that is a representation of the source speech signal, but is associated with a target voice. Portions of exemplary embodiments of the present invention may be shown and described herein with reference to the voice conversion framework disclosed in U.S. patent application Ser. No. 11/107,344, entitled: Framework for Voice Conversion, filed Apr. 15, 2005, the contents of which are hereby incorporated by reference in its entirety. It should be understood, however, that exemplary embodiments of the present invention may be equally adaptable to any of a number of different voice conversion frameworks. As explained herein, the framework of the U.S. patent application Ser. No. 11/107,344 is a parametric framework wherein speech may be represented using a set of feature vectors or parameters. It should be understood, however, that exemplary embodiments of the present invention may be equally adaptable to any of a number of other types of frameworks (e.g., waveform frameworks, etc.).
In accordance with exemplary embodiments of the present invention, a source speech signal may be converted into a target speech signal. More particularly, in accordance with a parametric voice conversion framework of one exemplary embodiment of the present invention, encoding parameters related to the source speech signal (source encoding parameters) may be converted into corresponding encoding parameters related to the target speech signal (target encoding parameters). As explained above, a speech signal is frequently represented by a source-filter model of speech whereby a source component of speech (excitation signal), originating from the vocal cords, is shaped by a filter imitating the effect of the vocal tract (vocal tract filter). Thus, for example, vocal tract filter and/or excitation encoding parameters related to the source speech signal may be converted into corresponding vocal tract filter and/or excitation encoding parameters related to the target speech signal.
a-1c are schematic block diagrams of a framework for voice conversion according to different exemplary embodiments of the present invention. Turning to
As shown and described herein, the encoder 10a, 10b and decoder 12a, 12b of the framework 1a, 1b may be implemented in the same apparatus, such as within a module of a speech processing system. In such instances, the link 11 may be a simple electrical connection. Alternatively, however, the encoder and decoder may be implemented in different apparatuses, and in such instances, the link 11 may be a transmission link (wired or wireless link) between the apparatuses. Locating the encoder and decoder in different apparatuses may be particularly useful in various contexts, such as that of a telecommunications system, as will be discussed with reference to
c illustrates a framework 1c of yet another exemplary embodiment of the present invention, where the converter 13c is implemented in a component separate from the encoder 10c and decoder 12c. In this regard, the encoder may be configured for encoding a source speech signal into encoding parameters, which may be transferred via link 11-1 to the converter. The converter may convert the encoding parameters into a converted representation thereof, or more particularly convert source parameters into target parameters. The converter may then forward the converted representation of the encoding parameters via a link 11-2 to the decoder. In turn, the decoder may be configured for decoding the converted representation of the encoding parameters into the target speech signal. The encoder, decoder and converter of the framework of
a illustrates a block diagram of a telecommunications apparatus 2a, such as a mobile terminal operable in a mobile communications system, including components of a framework for voice conversion according to one exemplary embodiment of the present invention. A typical use case of such an apparatus is the establishment of a call via a core network of the mobile communications system. As shown, the apparatus includes an antenna 20, an R/F (radio frequency) instance 21, a central processing unit (CPU) 22 or other processor or controller, an audio processor 23 and a speaker 24, although it should be understood that the apparatus may include other components for operation in accordance with exemplary embodiments of the present invention. The antenna may be configured for receiving electromagnetic signals carrying a representation of speech signals, and passing those signals to the R/F instance. The R/F instance may be configured for amplifying, mixing and analog-to-digital converting the signals, and passing the resulting digital speech signals to the CPU. In turn, the CPU may be configured for processing the digital speech signals and triggering the audio processor to generate a corresponding analog speech signal for emission by the speaker.
As also shown in
b illustrates a block diagram of a telecommunications apparatus 2b including components of a framework for voice conversion according to another exemplary embodiment of the present invention. As shown, components of apparatus 2b with the same function as those of their counterparts in apparatus 2a of
c illustrates a block diagram of a telecommunications apparatus 2c including components of a framework for voice conversion according to yet another exemplary embodiment of the present invention. As shown, components of apparatus 2c with the same function as those of their counterparts in apparatuses 2a and 2b of
a is a schematic block diagram of a text-to-speech (TTS) converter 3a according to one exemplary embodiment of the present invention. The TTS converter of exemplary embodiments of the present invention may be particularly useful in a number of different contexts including, for example, reading of Short Message Service (SMS) messages to a user of a telecommunications apparatus, or reading of traffic information to a driver of a car via a car radio. As shown, the TTS converter includes a voice conversion unit 1, which may be implemented according any of the frameworks 1a, 1b and 1c of
b is a schematic block diagram of a TTS converter 3b according to another exemplary embodiment of the present invention. As shown, components of TTS converter 3b with the same function as those of their counterparts in TTS converter 3a of
c is a schematic block diagram of a TTS converter 3c according to yet another exemplary embodiment of the present invention. Again, components of TTS converter 3c with the same function as those of their counterparts in TTS converters 3a and 3b of
In accordance with exemplary embodiments of the present invention, voice conversion generally includes feature/parameter extraction (e.g., by encoder 10), conversion model training and voice conversion (e.g., by converter 13), and re-synthesis (e.g., by decoder 12). Each of these phases of voice conversion will now be described below in accordance with such exemplary embodiments of the present invention, although it should be understood that one or more of the respective phases may be performed in manners other than those described herein.
A. Feature/Parameter Extraction
A popular approach in parametric speech coding is to represent the speech signal or the vocal tract excitation signal by a sum of sine waves of arbitrary amplitudes, frequencies and phases:
where αm, ωm(t) and θm represent the amplitude, frequency and a fixed phase offset for the m-th sinusoidal component. To obtain a frame-wise representation, the parameters may be assumed to be constant over the analysis window. Thus, the discrete signal s(n) in a given frame may be approximated by
where Am and θm represent the amplitude and the phase of each sine-wave component associated with the frequency track ωm, and L is the number of sine-wave components. In the underlying sinusoidal model, the parameters to be transmitted may include: the frequencies, the amplitudes, and the phases of the found sinusoidal components. The sinusoids are often assumed to be harmonically related at the multiple of the fundamental frequency ω0(=2πf0). During voice speech, No corresponds to speaker's pitch, but ω0 has no physical meaning during unvoiced speech. To further simplify the model, it may be assumed that the sinusoids can be classified as continuous or random-phase sinusoids. The continuous sinusoids represent voiced speech, and can be modeled using a linearly evolving phase. The random-phase sinusoids, on the other hand, represent unvoiced noise-like speech that can be modeled using a random phase.
To facilitate both voice conversion and speech coding, the sinusoidal model described above can be applied to modeling the vocal tract excitation signal. The excitation signal can be obtained using the well-known linear prediction approach. In other words, the vocal tract contribution can be captured by the linear prediction analysis filter A(z) and the synthesis filter 1/A(z), while the excitation signal can be obtained by filtering the input signal x(t) using the linear prediction analysis filter A(z) as
where N denotes the order of the linear prediction filter. In addition to the separation into the vocal tract model and the excitation model, the overall gain or energy can be used as a separate parameter to simplify the processing of the spectral information.
As described above, the speech representation may include three elements: i) vocal tract contribution modeled using linear prediction, ii) overall gain/energy, and iii) normalized excitation spectrum. The third of these elements, i.e., the residual spectrum, can be further represented using the pitch, the amplitudes of the sinusoids, and voicing information. The encoder 10 may therefore estimate or otherwise extract each of these parameters at regular (e.g., 10-ms) intervals from a source speech signal (e.g., 8-kHz speech signal), in accordance with any of a number of different techniques. Examples of a number of techniques for estimating or otherwise extracting different parameters are explained in greater detail below.
The coefficients of the linear prediction filter can be estimated in a number of different manners including, for example, in accordance with the autocorrelation method and the well-known Levinson-Durbin algorithm, alone or together with a mild bandwidth expansion. This approach helps ensure that the resulting filters are always stable. Each analysis frame includes a speech segment (e.g., 25-ms speech segment), windowed using a Hamming window. In this regard, the degree of the linear prediction filter can be set to 10 for 8-kHz speech, for sample. For further processing, the linear prediction coefficients may be converted into a line spectral frequency (LSF) representation. From the viewpoint of voice conversion, this representation can be very convenient since it has a close relation to formant locations and bandwidths, and may offer favorable properties for different types of processing and guarantees filter stability.
One exemplary algorithm for estimating the pitch may include computing a frequency-domain metric using a sinusoidal speech model matching approach. Then, a time-domain metric measuring the similarity between successive pitch cycles can be computed for a fixed number of pitch candidates that received the best frequency-domain scores. The actual pitch estimate can be obtained using the two metrics together with a pitch tracking algorithm that considers a fixed number of potential pitch candidates for each analysis frame. As a final step, the obtained pitch estimate can be further refined using a sinusoidal speech model matching based technique to achieve better than one-sample accuracy.
Once the final refined pitch value has been estimated, the parameters related to the residual spectrum can be extracted. For these parameters, the estimation can be performed in the frequency domain after applying variable-length windowing and fast Fourier transform (FFT). The voicing information can be first derived for the residual spectrum through analysis of voicing-specific spectral properties separately at each harmonic frequency. The spectral harmonic amplitude values can then be computed from the FFT spectrum. Each FFT bin can be associated with the harmonic frequency closest to it.
Similar to the other parameters, the gain/energy of the source speech signal can be estimated in a number of different manners. This estimation may, for example, be performed in the time domain using the root mean square energy. Alternatively, since the frame-wise energy may significantly vary depending on how many pitch peaks are located inside the frame, the estimation may instead compute the energy of a pitch-cycle length signal.
B. Voice Conversion Model Training and Conversion
Irrespective of exactly how the source and target speech signals are represented, conversion of a source speech signal to a target speech signal may be accomplished by the converter 13 in a number of different manners, including in accordance with a Gaussian Mixture Model (GMM) approach. Individual features/parameters may utilize different conversion functions or models, but generally, the GMM-based conversion approach has become popular, especially for vocal tract (LSF) conversion. As explained below, before conversion models may be utilized to convert respective parameters of source speech signals into corresponding parameters of target speech signals, the models are typically trained based on a sequence of feature vectors (for respective parameters) from the source and target speakers. The trained GMM-based models may then be used in the conversion phase of voice conversion in accordance with exemplary embodiments of the present invention. Thus, for example, a sequence of vocal tract (LSF) parameter/feature vectors from the source and target speakers may be utilized to train a GMM-based model from which vocal tract (LSF) parameters related to a source speech signal may be converted into corresponding vocal tract (LSF) parameters related to a target speech signal. Also, for example, a sequence of pitch parameter/feature vectors from the source and target speakers may be utilized to train a GMM-based model from which pitch parameters related to a source speech signal may be converted into corresponding pitch parameters related to a target speech signal.
1. Voice Conversion Model Training
The training of a GMM-based model may utilize aligned parametric data from the source and target voices. In this regard, alignment of the parametric data from the source and target voices may be performed in two steps. First, both the source and target speech signals may be segmented, and then a finer-level alignment may be performed within each segment. In accordance with one exemplary embodiment of the present invention, the segmentation may be performed at phoneme-level using hidden Markov models (HMMs), with the alignment utilizing dynamic time warping (DTW). Additionally or alternatively, manually labeled phoneme boundaries may be utilized if such information is available.
More particularly, the speech segmentation may be conducted using very simple techniques such as, for example, by measuring spectral change without taking into account knowledge about the underlying phoneme sequence. However, to achieve better performance, information about the phonetic content may be exploited, with segmentation performed using HMM-based models. Segmentation of the source and target speech signals in accordance with one exemplary embodiment may include estimating or otherwise extracting a sequence of feature vectors from the speech signals. The extraction may be performed frame-by-frame, using similar frames as in the parameter extraction procedure described above. Assuming the phoneme sequence associated with the corresponding speech is known, a compound HMM model may be built up by sequentially concatenating the phoneme HMM models. Next, the frame-based feature vectors may be associated with the states of the compound HMM model using Viterbi search to find the best path. By keeping track of the states, a backtracking procedure can be used to decode the maximum likelihood state sequence. The phoneme boundaries in time may then be recovered by following the transition change from one phoneme HMM to another.
As indicated above, the phoneme-level alignment obtained using the procedure above may be further refined by performing frame-level alignment using DTW. In this regard, DTW is a dynamic programming technique that can be used for finding the best alignment between two acoustic patterns. This may be considered functionally equivalent to finding the best path in a grid to map the acoustic features of one pattern to those of the other pattern. Finding the best path requires solving a minimization problem, minimizing the dissimilarity between the two speech patterns. In one exemplary embodiment, DTW may be applied on Bark-scaled LSF vectors, with the algorithm being constrained to operate within one phoneme segment at a time. In this exemplary embodiment, non-simultaneous silent segments may be disregarded.
Let x=[x1, x2, . . . xn] represent a sequence of feature vectors characterizing n frames of speech content produced by the source speaker, and let y=[y1, y2, . . . ym] represent a sequence of feature vectors characterizing m frames of the same speech content produced by the target speaker. The DTM algorithm may then result in a combination of aligned source and target vector sequences z=[z1, z2, . . . zw], where zk=[xpT yqT]T and (xp, yq) represents aligned vectors for frames p and q, respectively. The combination vector sequence z may then be used train a conversion model (e.g., GMM-based model).
Generally, a GMM allows the probability distribution of z to be written as the sum of L multivariate Gaussian components (classes), where its probability density function (pdf) may be written as follows:
where αl represents the prior probability of z for the component l. Also in the preceding, N(z; μl, Σl) represents the Gaussian distribution with the mean vector μl and covariance matrix Σi. GMM-based conversion models may therefore be trained by estimating the parameters (α, μ, Σ) to thereby model the distribution of x (the source speaker's spectral space), such as in accordance with any of a number of different techniques. In various exemplary embodiments of the present invention, the GMM-based conversion model may be trained iteratively through the well-known Expectation Maximization (EM) algorithm or K-means type of training algorithm.
Conventionally, training a conversion model may be accomplished on aligned feature vectors x, y from the source and target speakers. If the training parametric data is noisy, however, the model accuracy may degrade. Before training the GMM-based conversion model, then, exemplary embodiments of the present invention may select for training only those parts of speech where speech content dominates the noise. For simplicity and without loss of generality, presume the case of training data affected by stationary noise (i.e., the noise distribution does not change in time). Consider estimation of the statistics of the frame-wise energy parameter over the sequence of training parametric data. As shown in
As indicated above, exemplary embodiments of the present invention may include estimating or otherwise extracting information related to the energies E (e.g., energy parameters) of frames of the training source and target speech signals, and as such, each frame of source and target speech content may be associated with information related to its energy. As also indicated above, each frame (at a time t) of speech content for the source speaker and target speaker may be characterized by or otherwise associated with a respective feature vector xt and yt, respectively. Accordingly, it may also be the case that each feature vector xt is also associated with information related to the energy Ext of a respective frame (at a time t) of speech content for the source speaker. Similarly, it may be the case that each feature vector yt is also associated with information related to the energy Eyt of a respective frame (at a time t) of speech content for the target speaker. As explained herein, the energy of a frame of speech content for the source speaker or target speaker, Ext or Eyt, may be generically referred to as energy E.
In accordance with exemplary embodiments of the present invention, a threshold energy value Etr may be calculated and compared to the energies of the frames of the source and target speech signals Ext and Eyt, respectively. In this regard, the threshold energy value Etr may be calculated in any of a number of different manners. For example, the threshold energy value Etr may be empirically determined as roughly the smallest energy of perceived and understandable speech, and may be some fraction of the highest level of noisy energy in non-speech frames. As a consequence, the energy E<Etr may indicate the frame is more likely to be non-speech than speech, and vice versa when E≧Etr. In this regard, the threshold energy value Etr may be considered a linear discriminator between the non-speech/noisy-speech pdf (lower SNR frames, a decreasing exponential in
More particularly, for example, the threshold energy value Etr may be calculated by first considering an overlap in the distributions of speech versus non-speech energies for a converted training sequence x, where a threshold ECmax may be empirically found as shown in
Along with selecting threshold ECmax, a value wESmax may be found or otherwise selected. The value wESmax may be selected in a number of different manners including based upon a primitive VAD developed as optimally sized windowed energy. The optimality of the window size may stay in that it may enable an optimal separation between pdfs of speech and non-speech windowed-energy. The value wESmax may be empirically found as shown in
Now, as shown in
By comparing the threshold energy value Etr to the energies of the frames of the source and target speech signals xt, Eyt, respectively, exemplary embodiments of the present invention may identify one or more frames more likely associated with non-speech frames (e.g., E<Etr, identified by VAD as non-speech, etc.), and thereby identify one or more associated frame feature vectors (x, y) more likely to negatively impact the trained GMM-based conversion model. These identified feature vectors may then be withheld from inclusion in the training procedure to thereby facilitate generation of a trained conversion model less affected by noise. The respective feature vectors (x, y) may be withheld from inclusion in the training procedure at any of a number of different points in the during the model training. In one embodiment, for example, the respective feature vectors (x, y) may be withheld from inclusion in the training procedure during formation of the vector sequence z for training the GMM-based model. Thus, in accordance with exemplary embodiments of the present invention, a noise-reduced vector sequence z′ for training the GMM-based model may be formed to only include vectors zk=[xpT yqT]T with aligned source and target vector sequences (xp, yq) having associated energies Exp and Eyq greater than or equal to (i.e., ≧) than the threshold energy value Etr. This noise-reduced vector sequence z′ may be formed in a number of different manners, such as by selecting the respective vectors zk from the original vector sequence z. Alternatively, the vector sequence z′ may be formed by removing, from the original vector sequence z, vectors zk=[xpT yqT]T with aligned source and target vector sequences (xp, yq) having associated energies Exp and Eyq less than (i.e., <) the threshold energy value Etr. Although the above description included, in the noise-reduced vector sequence z′, aligned source and target vector sequences (xp, yq) having associated energies equal to the threshold energy value, the noise-reduced vector sequence z′ may alternatively withhold these sequences along with the sequences having associated energies less than the threshold energy value, if so desired.
2. Voice Conversion
After training a GMM-based model for each of one or more parameters representing speech content, the trained GMM-based model may be utilized to convert the respective parameter related to a source speech signal (e.g., source encoding parameter) produced by the source speaker into a corresponding parameter related to a target speech signal as produced by the target speaker (e.g., target encoding parameter). As indicated above, for example, one trained GMM-based model may be utilized to convert vocal tract (LSF) parameters related to a source speech signal into corresponding vocal tract (LSF) parameters related to a target speech signal. As also indicated above, for example, another trained GMM-based model may be utilized to convert pitch parameters related to a source speech signal into corresponding pitch parameters related to a target speech signal.
For a particular speech parameter, the conversion of the speech parameter may follow a scheme where the respective, trained GMM-model parameterize a linear function that minimizes the mean squared error (MSE) between the converted source and target vectors. In this regard, the conversion function may be implemented as follows:
The covariance matrix Σi may be formed as follows:
represents the mean vector of the i-th Gaussian mixture of the GMM.
In one particular instance, conversion of LSF vectors may be performed using an extended vector that also includes the derivative of the LSF vector so as to take some dynamic context information into account, although the derivative may be removed after conversion (retaining the true LSF part). This combined feature vector may be transformed through GMM modeling using Equation (6). The conversion may also utilize several modes, each containing its own GMM model with one or more (e.g., 8) mixtures. In this regard, the modes may be achieved by clustering the LSF data in a data-driven manner.
In another particular instance, conversion of the pitch parameter (pitch vectors) may be performed through an associated GMM-based model in frequency domain using Equation (6) where, during unvoiced parts, “pitch” may be left unchanged. A multiple mixture (e.g., 8-mixture) GMM-based model used for pitch conversion may be trained on aligned data, with a requirement to have matched voicing between the source and the target data. After conversion of the pitch parameter, the residual amplitude spectrum may be processed accordingly as the length of the amplitude spectrum vector may depend on the pitch value at the corresponding time instant. Thus, the residual spectrum, although essentially unchanged, may be re-sampled to fit the dimension dictated by the converted pitch at that time.
C. Re-Synthesis
As described above, the speech representation may include three elements: i) vocal tract contribution modeled using linear prediction, ii) overall gain/energy, and iii) normalized excitation spectrum (represented using the pitch, the amplitudes of the sinusoids, and voicing information). After conversion, one or more desired features/parameters of the source speech signal that have been converted into corresponding features/parameters of the target speech signal, and any remaining features/parameters of the source speech signal not otherwise converted may collectively form features/parameters of the target speech signal. Thus, after conversion, the features/parameters of the target speech signal may be re-synthesized into a target speech signal. In this regard, the features/parameters of the target speech signal may be re-synthesized into the target speech signal in any of a number of different known manners, such as in a known pitch-synchronous manner.
Conventional voice conversion techniques either treat the two classes of utterance content (speech and non-speech) as distinct with different models for conversion, which may generate disturbing artifacts at the speech and non-speech boundary (considering particularly, that VAD is typically not error-free); or treat all utterance content as one class and transform speech and non-speech frames using the same conversion functions. In the latter case, however, non-speech frames may amplify the input noise or simply become noisy as a consequence of the conversion. Thus, after converting the features/parameters of the source speech signal into the features/parameters of the target speech signal, and before re-synthesis of the target speech signal therefrom, the converter 13 or decoder 12 (or another apparatus therebetween) of exemplary embodiments of the present invention may apply a power function (see, e.g.,
The power function may be represented on a frame-wise basis (for each time t) in any of a number of different manners. For a target energy feature/parameter that has been converted from a corresponding source energy/parameter, for example, the power function Conv may be represented as follows:
In the preceding, F represents the conventional energy transformation function (see Equation (6)), and γ represents a degree of suppression. The degree of suppression may be calculated or otherwise set to any of a number of different values, as reflected in
Up to this point, it has been assumed that the model of the noise does not change over time (stationary). In reality, however, this may not be the case. Thus, in accordance with further aspect of exemplary embodiments of the present invention, the component applying the aforementioned power function (i.e., converter 13, decoder 12, other apparatus therebetween) may at least partially preserve the time-variant attributes of noise using an online mechanism to build and update local speech and non-speech models. The models of non-speech and speech segments can be iteratively updated in a local history window and, thus, the threshold energy value Etr that delineates them can be updated online in an adaptive manner. In addition or in the alternative, windows energy that includes the average energy across certain number of frames (windows) can be also used as adaptive factor. Further, an implementation could additionally or alternatively take advantage of a number of other techniques, such as soft VAD or the like, to detect speech and non-speech frames and help build the energy statistics. The threshold energy value Etr may, for example, be determined from local history models of speech versus non-speech energies by any one of the following approaches: (a) a determination of a weighted ratio, such as 20%, of speech versus non-speech energies, (b) based upon a mean and variance of the distributions of speech versus non-speech energies, (c) a determination of a weighted percentile of either a distribution of speech energies and/or a distribution of non-speech energies or (d) determination of the rank order value in speech versus non-speech energies, e.g., fifth smallest speech energy—provided that in any of these approaches Etr is sufficiently low so as to not harm speech integrity and sufficiently high to ensure non-speech suppression, thereby serving as a tradeoff between these two competing concerns. Alternatively, such a weighted ratio may serve only for initialization until sufficient statistics are collected about “speech” and “noise” to compute a delineator. Even in this case, however, sudden changes in noise may require special treatment. It may therefore be better in these cases to update the threshold energy value Etr to, e.g., a weighted mean of local noise with increasing weights for recent frames until collected statistics become sufficient to compute the speech/noise delineator.
Referring now to
After training the voice conversion model, the model (shown at block 65) may be utilized in the conversion of source speech signals into target speech signals. In this regard, the method may further include receiving, into the trained voice conversion model, information characterizing each of a plurality of frames of a source speech signal (e.g., source encoding parameters), as shown in blocks 64 and 65. Then, as shown in block 66, at least some of the information characterizing each of the frames of the source speech signal may be converted into corresponding information characterizing each of a plurality of frames of a target speech signal (e.g., target encoding parameters) based upon the trained voice conversion model.
The information characterizing each frame of the target speech signal may include an energy (e.g., Eit) of the respective frame (at time t). The method may therefore further include reducing the energy of one or more frames of the target speech signal that have an energy less than the threshold energy value (e.g., Eit<Etr), as shown in block 67. The information characterizing the frames of the target speech signal (e.g., target encoding parameters) including the reduced energy may be configured for synthesizing the target speech signal. The target speech signal may then be synthesized or otherwise decoded from the information characterizing the frames of the target speech signal, including the converted information characterizing the respective frames, as shown in block 68.
Further, to account for a variable noise model, the method may include building models of speech frames and non-speech frames based upon the received information characterizing the source speech signal (e.g., source encoding parameters), as shown in block 69. The threshold energy value (e.g., Etr) may then be adapted based upon the models, the threshold energy value representing a delineation between the speech frames and the non-speech frames, as shown at block 70. The adapted threshold energy value may then be utilized as above, such as to determine the frames of the target speech signal for energy reduction (see block 67). It is noted that the foregoing discussion related to
According to one aspect of the present invention, the functions performed by one or more of the entities or components of the framework, such as the encoder 10, decoder 12 and/or converter 13, may be performed by various means, such as hardware and/or firmware (e.g., processor, application specific integrated circuit (ASIC), etc.), alone and/or under control of one or more computer program products, which may be stored in a non-volatile and/or volatile storage medium. The computer program product for performing one or more functions of exemplary embodiments of the present invention includes a computer-readable storage medium, such as the non-volatile storage medium, and software including computer-readable program code portions, such as a series of computer instructions, embodied in the computer-readable storage medium.
In this regard,
Accordingly, blocks or steps of the flowchart support combinations of means for performing the specified functions, combinations of steps for performing the specified functions and program instruction means for performing the specified functions. It will also be understood that one or more blocks or steps of the flowchart, and combinations of blocks or steps in the flowchart, can be implemented by special purpose hardware-based computer systems which perform the specified functions or steps, or combinations of special purpose hardware and computer instructions.
Many modifications and other embodiments of the invention will come to mind to one skilled in the art to which this invention pertains having the benefit of the teachings presented in the foregoing descriptions and the associated drawings. Therefore, it is to be understood that the invention is not to be limited to the specific exemplary embodiments disclosed and that modifications and other embodiments are intended to be included within the scope of the appended claims. Although specific terms are employed herein, they are used in a generic and descriptive sense only and not for purposes of limitation.