The present invention relates to the fields of speech synthesis and speech processing.
Pitch modification is an important processing component of expressive Text-To-Speech (TTS) synthesis and voice transformation. The pitch modification task may generally appear either in the context of TTS synthesis or in the context of natural speech processing, e.g. for entertainment applications, voice disguisement applications, etc.
Applications such as affective Human Computer Interface (HCI), emotional conversational agents and entertainment, demand for extreme pitch modification capability which preserves speech naturalness. However, it is widely acknowledged that pitch modification and synthesized speech naturalness are contradictory requirements.
Pitch modification may be performed, for example, over a non-parameterized speech waveform using Pitch-Synchronous Overlap and Add (PSOLA) method or by using a parametric speech representation. Regardless of the method used, significant raising or lowering of the original tone of speech segments may significantly deteriorate the perceived naturalness of the modified speech signal.
The foregoing examples of the related art and limitations related therewith are intended to be illustrative and not exclusive. Other limitations of the related art will become apparent to those of skill in the art upon a reading of the specification and a study of the figures.
The following embodiments and aspects thereof are described and illustrated in conjunction with systems, tools and methods which are meant to be exemplary and illustrative, not limiting in scope.
There is provided, in accordance with an embodiment, a method comprising: receiving an utterance, an original pitch contour of the utterance, and a target pitch contour for the utterance, wherein the utterance comprises a plurality of consecutive frames, and wherein at least one of said frames is a voiced frame; calculating an original intensity contour of said utterance; generating a pitch-modified utterance based on the target pitch contour; calculating an intensity modification factor for each of said frames, based on said original pitch contour and on said target pitch contour, to produce a sequence of intensity modification factors corresponding to said plurality of consecutive frames; calculating a final intensity contour for said utterance by applying said intensity modification factors to said original intensity contour; and generating a coherently-modified speech signal by time-dependent scaling of the intensity of said pitch-modified utterance according to said final intensity contour.
There is provided, in accordance with another embodiment, a computer program product comprising a non-transitory computer-readable storage medium having program code embodied therewith, the program code executable by at least one hardware processor to: receive an utterance, an original pitch contour of the utterance, and a target pitch contour for the utterance, wherein the utterance comprises a plurality of consecutive frames, and wherein at least one of said frames is a voiced frame; calculate an original intensity contour of said utterance; generate a pitch-modified utterance based on the target pitch contour; calculate an intensity modification factor for each of said frames, based on said original pitch contour and on said target pitch contour, to produce a sequence of intensity modification factors corresponding to said plurality of said consecutive frames; calculate a final intensity contour for said utterance by applying said intensity modification factors to said original intensity contour; and generate a coherently-modified speech signal by time-dependent scaling of the intensity of said pitch-modified utterance according to said final intensity contour.
There is provided, in accordance with a further embodiment, a system comprising: (i) a non-transitory storage device having stored thereon instructions for: receiving an utterance, an original pitch contour of the utterance, and a target pitch contour for the utterance, wherein the utterance comprises a plurality of consecutive frames, and wherein at least one of said frames is a voiced frame, calculating the original intensity contour of said utterance, generating a pitch-modified utterance based on the target pitch contour, calculating an intensity modification factor for each of said frames, based on said original pitch contour and on said target pitch contour, to produce a sequence of intensity modification factors corresponding to said plurality of said consecutive frames, calculating a final intensity contour for said utterance by applying said intensity modification factors to said original intensity contour, and generating a coherently-modified speech signal by time-dependent scaling of the intensity of said pitch-modified utterance according to said final intensity contour; and (ii) at least one hardware processor configured to execute said instructions.
In some embodiments, the received utterance is natural speech, and the method further comprises mapping each of said frames to a corresponding speech class selected from a predefined set of speech classes.
In some embodiments, the calculating of the intensity modification factor for each of said frames is based on a pitch-to-intensity transformation modeling the relationship between the instantaneous pitch frequency and the instantaneous intensity of the utterance, and the pitch-to-intensity transformation is represented as a function of a pitch frequency and a set of control parameters.
In some embodiments, each of said frames is mapped to a corresponding speech class selected from a predefined set of speech classes, and the method further comprises setting the values of said control parameters for each of said frames according to its corresponding speech class.
In some embodiments, the method of further comprises offline modeling of the pitch to intensity relationship to receive said values for said control parameters according to said speech classes.
In some embodiments, the method further comprises setting said control parameters to constant predefined values.
In some embodiments, the pitch-to-intensity transformation is based on log-linear regression, and the set of control parameters comprises the slope coefficient of the regression line of the log-linear regression.
In some embodiments, the intensity modification factor is ten in the power of the twentieth of the ratio of average empirical decibels per octave multiplied by the extent of pitch modification expressed in octaves.
In some embodiments, the value of the ratio of empirical decibels per octave is set to six decibels per octave.
In some embodiments, the calculating of the intensity modification factor for each of said frames comprises: calculating a reference value of the intensity corresponding to an original pitch frequency of the original pitch contour for said each frame, by applying the pitch-to-intensity transformation to the original pitch frequency; calculating a reference value of the intensity corresponding to the target pitch frequency of the target pitch contour for each of said frames by applying the pitch-to-intensity transformation to the target pitch frequency; and dividing the reference value of the intensity corresponding to the target pitch frequency by the reference value of the intensity corresponding to the original pitch frequency.
In some embodiments, the received utterance is natural speech, and the program code is further executable by said at least one hardware processor to map each of said frames to a corresponding speech class selected from a predefined set of speech classes.
In some embodiments, each of said frames is mapped to a corresponding speech class selected from a predefined set of speech classes, and the program code is further executable by said at least one hardware processor to set the values of said control parameters for each of said frames according to its corresponding speech class.
In some embodiments, the program code is further executable by said at least one hardware processor to offline model the pitch to intensity relationship to receive said values for said control parameters according to said speech classes.
In some embodiments, the system further comprises a database, wherein: each of said frames is mapped to a corresponding speech class selected from a predefined set of speech classes, the calculating of the intensity modification factor for each of said frames is based on a pitch-to-intensity transformation represented as a function of a pitch frequency and a set of control parameters, and the database comprises values of said control parameters per a speech class of said set of speech classes, and wherein said storage device have further stored thereon instructions for setting values for said control parameters for each of said frames according to its corresponding speech class, wherein the values are fetched from said database.
In addition to the exemplary aspects and embodiments described above, further aspects and embodiments will become apparent by reference to the figures and by study of the following detailed description.
Exemplary embodiments are illustrated in referenced figures. Dimensions of components and features shown in the figures are generally chosen for convenience and clarity of presentation and are not necessarily shown to scale. The figures are listed below.
Disclosed herein is a coherent modification of pitch contour and intensity contour of a speech signal. The disclosed pitch contour and intensity contour modification may enhance pitch-modified speech signals by improving their naturalness. The disclosed enhancement may be applicable to virtually any type of pitch modification technique.
Some prior works, unrelated to pitch modification, reported experimental evidence of a salient positive correlation between instantaneous fundamental frequency and instantaneous loudness of speech. For example, in P. Gramming, et al, “Relationship between changes in voice pitch and loudness”, Journal of Voice, Vol 2, Issues 2, 1988, pp. 118-126, Elsevier, 1988, that correlation phenomenon was observed in professional singers, healthy non-singers and people suffering from voice disorders. In Rosenberg, and J. Hirshberg, “On the correlation between energy and pitch accent in read English speech”, In Proc. Interspeech 2006, Pittsburgh, Pa., USA, September 2006, a method was proposed for pitch accent prediction based on the pitch-energy correlation.
Advantageously, the disclosed pitch contour and intensity contour modification may harness that correlation phenomenon to improve the sound naturalness of a speech signal otherwise damaged by pitch modification. A statistically-proven observation of a positive correlation between instantaneous pitch frequency and instantaneous intensity (also “loudness”) of a speech signal is herein provided, which is the product of experimentation performed by the inventors. A statistical exemplification of this pitch-intensity relation is shown in
The disclosed pitch contour and intensity contour modification formulates this proven pitch-intensity interrelation and manipulates it to improve the sound naturalness of a pitch-modified speech signal. More specifically, the intensity of a speech segment (represented, for example, as a frame of the speech signal) may be modified in agreement with the pitch modification within the segment. Such coherent modification of the pitch and intensity may significantly reduce the naturalness loss following pitch modification.
The term “utterance”, as referred to herein, may relate to non-synthesized utterance (i.e., natural speech produced by a living being such as humans) and/or synthesized utterance (i.e., artificially-produced speech, e.g., as a result of TTS synthesis).
The utterance may be represented as a digitized speech signal (hereinafter also referred to as ‘raw speech signal’ or simply ‘speech signal’) and/or as a parameterized signal, as discussed below.
The speech signal may be mathematically represented as a function s(n) of discrete time instant n=0, 1, 2, . . . , N corresponding to time moments tn=n·τ, n=0, 1, 2, . . . , N where τ is the time sampling interval, e.g. τ= 1/22050 seconds (s) for a 22050 Hertz (Hz) sampling rate.
For the purpose of speech processing, the time axis may be divided to frames centered at equidistant time offsets k·Δ, where k=0, 1, 2, . . . , K is the frame index and Δ is frame size. For example, frame size Δ=5 milliseconds (ms) may be used for speech signals sampled at 22050 Hz.
Hereinafter various features of the speech signal calculated at frame level may be considered, such as pitch frequency (or simply ‘pitch’), intensity, line spectrum and spectral envelope. The calculation at frame level may mean that a feature for frame k is derived from a portion of the signal enclosed within a short time window, such as, for example, 5 ms, 10 ms or 20 ms, surrounding the frame center tk=k·Δ. These frame level values may be also referred to as instantaneous values at the centers of the respective frames.
The frame centers may be expressed in discrete time instants nk, k=0, 1, . . . , K. To this end, the frame center moments tk=k·Δ expressed in seconds may be divided by the sampling interval τ and rounded to the nearest integer values. For example, if the frame size is 5 ms and the sampling rate is 22050 Hz, then the frame centers in discrete time instants are: n0=0, n1=110, n2=220, n3=330 etc.
The term “pitch contour”, as referred to herein, may relate to the sequence of fundamental frequency (or pitch frequency) values associated with the respective frame centers and may be denoted: {F0k, k=0, 1, . . . , K}. The value F0k=F0(nk) may represent the instantaneous tone level of speech at the time moment k·Δ. The pitch frequency may be set to zero for unvoiced frames, i.e. frames which represent aperiodic parts of the utterance where pitch is undefined.
Speech signal intensity may be a measure of the loudness. The instantaneous intensity may be estimated as the square root of the signal energy. The signal energy may be measured as the sum of squared values of the signal components within a short window surrounding the time moment of interest. Equivalently, the signal energy may be measured as the sum of the squared magnitudes of the Short-Time Fourier Transform (STFT) of the signal. The sequence {I(nk), k=0, 1, . . . , K} of the instantaneous intensity values associated with the frame centers may form what is referred to herein as the “intensity contour”.
As an alternative to representing the utterance as a speech signal, it may be represented parametrically as a sequence {Pk, k=0, 1, . . . , K} of frame-wise sets of vocoder parameters, wherein a set Pk may be associated with the center nk of k-th frame, Pk=P(nk). A speech signal corresponding to the utterance may be then reconstructed from the parametric representation (i.e., the parameterized signal):
Ω:{P(nk),k=0,1, . . . ,K}{s(n),n=0,1, . . . ,N}
The contents of the vocoder parameter set Pk and the reconstruction algorithm Ω may depend on the type of the parameterization (i.e., the vocoding technique) employed. The vocoder parameter set may include spectral envelop and excitation components or Sinusoidal Model parameters including harmonic and noise components. The pitch frequency is generally included in the vocoder parameter set.
Frames of the speech signal may be mapped to distinct speech classes. The frame class identity labels may be hereinafter referred to as frame classification information. A speech class may correspond to frames which represent certain phonetic-linguistic context. For example, the frame may belong to a certain phonetic unit which is a part of a sentence subject and is preceded by a consonant and followed by a vowel. Frames associated with the same class may be expected to have similar acoustic properties. It is known that speech manipulation techniques using class dependent transformations of frames may perform better than the ones employing global class independent transformations.
The pitch-intensity relationship may be analyzed per speech class using a TTS voice dataset built by techniques known in the art. The analysis presented below was performed using a TTS voice dataset built from expressive sports news sentences uttered by a female American English speaker and recorded at a 22050 Hz sampling rate.
As a starting phase of the voice dataset building procedure, the speech signals were analyzed at the frame update rate of 5 ms (i.e., frame size=5 ms). The analysis included pitch contour estimation using an algorithm similar to the one disclosed in A. Sorin et al., “The ETSI Extended Distributed Speech Recognition (DSR) standards: client side processing and tonal language recognition evaluation”, In Proc. ICASSP 2004, Montreal, Quebec, Canada, May 2004. The analysis also included estimation of pitch harmonic magnitudes, also known as line spectrum, using the method presented in D. Chazan et al, “High quality sinusoidal modeling of wideband speech for the purpose of speech synthesis and modification”, In Proc. ICASSP 2006, Toulouse, France, May 2006. The line spectrum estimation for each frame was performed using the 2.5 pitch period long hamming windowing function centered at the frame center. Then Mel-frequency Regularized Cepstral Coefficients (MRCC) spectral envelope parameters vector (see S. Shechtman and A. Sorin, “Sinusoidal model parameterization for Hidden Markov Model (HMM)-based TTS system”, in Proc. Interspeech 2010, Makuhari, Japan, September 2010) was calculated for each frame. It should be noted that the specific pitch and line spectrum estimators and the MRCC spectral parameters may be substituted by other estimators known in the art and spectral parameters, respectively.
The frames represented by the MRCC vectors were used in a standard HMM-based phonetic alignment and segmentation procedure with three HMM states per phoneme. Hence, each segment may represent one third of a phoneme. The speech classes generation and segment classification were performed using a standard binary decision tree approach depending on phonetic-linguistic context and the MRCC vectors homogeneity. This process yielded about 5000 speech classes.
Only fully-voiced segments, i.e. the ones comprised of all voiced frames (i.e., frames which represent periodic parts of the utterance), were used for the pitch-intensity relationship analysis. The classes containing less than 5 frames found in fully voiced segments were excluded from the analysis. This pruning procedure retained about 1900 classes containing in total more than 1.6 million frames which sums to more than 8000 seconds of purely voiced speech material.
The energy was estimated for each frame as:
where Ai is the magnitude of the i-th harmonic and Nh is the number of harmonics in the full frequency band (up to the Nyquist frequency of 11025 Hz) for that frame excluding the direct current component, i.e. excluding the harmonic associated with zero frequency. It should be noted that the line spectrum estimation algorithm (see D. Chazan et al, id) yields the harmonic magnitudes scaled in such a way that:
where {tilde over (s)}(n), n=1, 2, . . . , T is a representative pitch cycle associated with the frame center, and T is the rounded pitch period in samples corresponding to the F0 value associated with the frame. Thus the energy given by (1) is proportional to the average per-sample energy calculated over the representative pitch cycle derived from the speech signal around the frame center. Estimation of the representative pitch cycle is addressed herein below. Another line spectrum estimator may be used, e.g., spectral peak picking. In such a case, the appropriate scaling factor may be introduced in (1) to reproduce the results of the current analysis.
The intensity I was calculated for each frame as the squared root of the energy:
I=√{square root over (E)} (3)
Thus a frame k may be represented by the intensity Ik and the pitch frequency F0k. Both the parameters may be mapped to a logarithmic scale aligned with the human perception of the loudness and tone changes measured in decibel (dB) and octave (oct) units respectively:
IdBk=20·log10 Ik (4)
F0octk=log2 F0k (5)
Reference is now made to
To quantify the degree of this interdependency, the correlation coefficients between {IdBk} and {F0octk} sequences were calculated at class levels and over the voice dataset globally. For example, the correlation coefficient calculated within the class shown in
The correlation measures obtained in the above evaluation may provide a statistical evidence of the pitch-intensity relationship. An analytical expression which models the pitch-intensity relationship utilizing log-linear regression is herein disclosed. Nevertheless, other mathematical representations may be utilized, such as piece-wise linear, piece-wise log-linear or exponential functions. With reference to
IdBC=λ1C·F0octC+λ2C (6)
The least squares approximation yields the following values for the regression parameters λ1C and λ2C:
where NC is the number of frames associated with the class C.
The model given by equations (6), (7) and (8) may be used for prediction of the intensity of a frame from the pitch frequency associated with the frame. Additionally this model may be used for prediction of the frame intensity altering when the frame pitch frequency is modified by a given amount.
The slope coefficient λ1C given by (7) of the regression line (6) may indicate the average constant rate of intensity increase dependent on pitch increase. For example, for the class shown in
The same evaluation performed with another TTS voice dataset derived from neutral speech produced by another American English speaker yielded similar results: the weighted average of the intra-class correlation coefficients was 0.40 and the weighted average of the intra-class slope values was 4.6 dB/octave.
Reference is now made to
In steps 100 and 200, an utterance, indicated Uorg (i.e., the original utterance), may be received (also referred as the original utterance, the received utterance or the received original utterance). The utterance may include a plurality of consecutive segments. Each segment may include one or more consecutive frames. Thus, the utterance may include a plurality of consecutive frames, while at least one of the frames is a voiced frame.
The utterance may be produced, for example, by a human speaker (i.e., natural speech) or by a TTS system (i.e., synthesized speech). The utterance may be represented by a raw speech signal {sorg(n), n=0, 1, . . . N} or by a sequence {Pk, k=0, 1, . . . , K} of frame-wise sets of vocoder parameters as described hereinabove.
Modern TTS systems may employ either concatenative (also known as unit selection) synthesis scheme or a statistical synthesis scheme or a mix of them. Regardless of the TTS scheme, the synthesized utterance may be composed of segments. A segment may include one or more consecutive speech frames. In the concatenative TTS, the segments (represented either by raw data or parameterized) may be extracted from natural speech signals. In statistical TTS, the segment frames may be represented in a parametric form and the vocoder parameter sets may be generated from statistical models. A TTS system may include a voice dataset, which may include a repository of labeled speech segments or a set of statistical models or both.
The hierarchical dichotomies described above are summarized in Table 1 below.
In some embodiments, each frame of the received original utterance may be mapped to a distinct speech class selected from a predefined set of speech classes. Thus, according to step 200, a sequence: {Ck, k=0, 1, . . . , K} of the frame class identity labels may be received along with the original utterance.
In the context of TTS, the frame classification information may be inherently available. A TTS voice dataset may include a collection of speech segments (raw, parameterized or represented by statistical models) clustered according to their phonetic-linguistic context. At synthesis time, the received utterance may be composed of segments selected based on their class labels. The frames included in a segment may be mapped to the class associated with the segment.
In embodiments relating to applications operating on natural speech, the frame classification information may be further generated according to the disclosed method. Each of the frames of the utterance may be mapped to a corresponding speech class selected from a predefined set of speech classes. Alternatively, such information may be received with the original utterance. For example, an Automatic Speech Recognition (ASR) process may be applied to the natural utterance. The ASR may provide a word transcript and phonetic identity of each speech frame. Natural Language Processing (NLP) techniques may be applied to the word transcript to extract required linguistic features such as part of speech, part of sentence, etc. Finally, the phonetic-linguistic features associated with each frame may be mapped to a certain predefined phonetic-linguistic context as it is done in TTS front-end blocks.
As part of steps 100 and/or 200, the original pitch contour and/or the target pitch contour of the received utterance may be also received. The original pitch contour (i.e., prior to modification of the received utterance) may include a sequence of original pitch frequency values, corresponding to the frames of the received utterance, and may be indicated as: {F0org(nk), k=0, 1, 2, . . . , K}. The target pitch contour may include a sequence of target pitch frequency values, corresponding to the frames of the received utterance, and may be indicated as: {F0out(nk), k=0, 1, 2, . . . , K}. Optionally, the original pitch contour and/or the target pitch contour may be measured or generated as part of the disclosed technique. The original pitch contour may be measured and the target pitch contour may be generated in various manners depending on whether the utterance was produced by a human speaker, a concatenative TTS system or by a statistical TTS system.
When natural speech is concerned, the original pitch contour may be measured from the speech signal corresponding to the original utterance by one of the pitch estimation algorithms known in the art, such as the one disclosed in A. Sorin et al. (2004), id. When concatenative TTS is concerned, a concatenation of the segmental pitch contours may be performed. When statistical TTS is concerned, the original pitch contour may be the trajectory of the F0 parameter. The trajectory may be generated from the statistical models trained on the speech dataset, which is used for the modeling of all the vocoder parameters in the system.
When natural speech and TTS are concerned, the target pitch contour may be generated by transforming the original pitch contour. For example: F0out(nk)=avF0out+α·[F0org(nk)−avF0org], where avF0org and avF0out are the utterance level average values of the original and target pitch contours respectively and a is the parameter that controls the dynamics and hence influences the perceived expressiveness of the target pitch contour. The utterance level average avF0out may be set to a desired pre-defined value or made dependent on the average value of the original pitch contour, e.g. avF0out=β·avF0org where β is a control parameter. When TTS (concatenative and statistical) is concerned, the target pitch contour may be generated by a rule-based framework. Alternatively, it may be derived from a relevant, typically expressive and relatively small speech data corpus external to the voice dataset used for the synthesis. The desired pitch values may be generated initially at a lower temporal resolution than a frame, e.g. one value per phoneme, and then downsampled to the frame centers grid.
Alternatively, the utterance in its original form and the utterance with a modified pitch (i.e. pitch-modified utterance) may be received. The original pitch contour may be then measured based on the utterance in its original form and/or the target pitch contour may be measured based on the pitch-modified utterance.
In steps 110 and 210, an original intensity contour {Iorg(nk), k=0, 1, 2, . . . , K} of the utterance may be calculated. This calculation may be performed by applying an instantaneous intensity estimator to the original utterance.
The instantaneous intensity estimator may be defined to output a value I(n) proportional to the average amplitude of samples of a speech signal corresponding to an utterance within a short time window surrounding the discrete time instant n.
When the representation of the utterance is based on a Sinusoidal Model, an instantaneous intensity estimator operating in frequency domain may be defined as specified by equations (3) and (1). The harmonic magnitudes Ai associated with the frame centered at the time instant n may be determined using the line spectrum estimation algorithm similar to the D. Chazan et al, id. If the line spectrum is determined by a Short Time Fourier Transform (STFT) peak picking algorithm, then the harmonic magnitudes may be divided by the DC (direct current) value of the spectrum of the windowing function used in the STFT.
In some embodiments, the intensity estimator may be defined to operate over a speech signal in time domain. In this case the intensity I(n) may be also calculated as the square root of the energy E(n) as specified in equation (3), but the energy may be estimated from the speech signal s using a time window surrounding the time instant n. The simplest form of the energy estimation may be given by:
where Ln is the window length (generally frame dependent) and └.┘ denotes the down integer rounding operation. The value of Ln may be set so that the window includes one pitch cycle:
Ln=Tn=[Fs/F0(n)] (10)
where Fs is the sampling frequency, e.g. 22,050 Hz, and [.] denotes the integer rounding operation. However, other settings may be possible. With these settings the expression (9) may represent the average energy of the pitch cycle centered at the time instant n.
A more robust method may be employed to extract a representative pitch cycle {tilde over (s)}(i), i=1, 2, . . . , Tn as a weighted average of the pitch cycles occurring in proximity of the time instant n:
where:
w(i), i=1, 2, . . . , 2M+1 is a positive windowing function symmetric relatively to i=M+1, e.g., Hamming windowing function; and
the interval K(i) spanned by the summation index k is defined so that:
|−[Tn/2]+i+kTn|≦M.
Then the energy E(n) may be calculated as:
In steps 120 and 220, a pitch-modified utterance Uintr (hereafter referred to as intermediate utterance) may be generated based on the target pitch contour. The pitch-modified utterance may be generated by applying a pitch modification technique to the original utterance Uorg. Any pitch modification technique suitable for the original utterance representation form may be applied. Some pitch modification techniques depending on the underlying speech representation are exemplified below.
When raw speech signal is concerned, the pitch modification may be performed by using Pitch Synchronous Overlap and Add (PSOLA) techniques either in time or frequency domain (see E. Moulines, W. Verhelst, “Time-domain and frequency-domain techniques for prosodic modification of speech”, in Speech Coding and Synthesis, B. Klein ed, Elsevier Science Publishers 1995). PSOLA may require calculation of pitch marks (i.e., pitch cycle related epochs) derived from the pitch contour and the signal.
When a parameterized signal is concerned, the pitch modification may be performed in the space of vocoder parameters yielding a parameterized pitch-modified signal. In some embodiments it may be preferable to further convert the parameterized pitch-modified (or intermediate) signal to the form of raw speech signal. The pitch modification algorithm may depend on the parametric representation adopted in the system.
Some of the statistical TTS systems may adopt the source-filter representation framework (see Zen, H., Tokuda, K., and Black, A. W., “Statistical parametric speech synthesis”, Speech Communication, vol. 51, November 2009, pp. 1039-1064) where the source may represent the excitation signal produced by the vocal folds and the filter may represent the vocal tract. Within this framework the main (quasi-periodic) part of the excitation signal may be generated using the target pitch contour. Other vocoding techniques may employ frame-wise sinusoidal representations (see Chazan et al., id; S. Shechtman and A. Sorin, “Sinusoidal model parameterization for HMM-based TTS system”, in Proc. Interspeech 2010, Makuhari, Japan, September 2010; T. F. Quatieri and R. I. McAulay, “Speech Transformations Based on a Sinusoidal Representation.” IEEE Trans. Acoust. Speech Signal Process. ASSP-34, 1449, December 1986; Stylianou, Y., et al, “An extension of the adaptive Quasi-Harmonic Model”, in Proc. ICASSP 2012, Kyoto Japan, March 2012). In such a framework, the pitch modification at frame k may be performed by an interpolation and re-sampling of the frame-wise harmonic structures (also known as line spectra) along the frequency axis at the integral multiples of the target pitch frequency F0out(k).
The intermediate utterance Uintr may be represented either in a parametric form as a sequence {Pintr(nk), k=0, 1, . . . , K} of modified sets of vocoder parameters or by a modified raw speech signal sintr(n), depending on the representation type of the received original utterance. Although other representations may be applicable, in embodiments based on Sinusoidal Model representation of the original utterance, the intermediate utterance may be kept in the parameterized form while otherwise, it may be represented by a raw speech signal.
In steps 130 and 230, an intensity modification factor ρ(nk) may be calculated for each frame based on the original pitch contour and the target pitch contour. More specifically, an intensity modification factor ρ(nk) may be calculated for each frame k based on the original pitch frequency value F0org(nk) and the target pitch frequency value F0out(nk) associated with that frame. If the frame is unvoiced then the intensity modification factor may be set such that: ρ(nk)=1.
For a voiced frame the calculation of the intensity modification factor may be based on a pitch-to-intensity transformation R modeling the relationship between the instantaneous pitch frequency F0 and instantaneous intensity I. Such a transformation may be defined as:
I=R(F0,λ) (12)
where λ is a set of control parameters. Thus, the transformation may be represented, for example, as a function of the pitch frequency and a set of control parameters.
The calculation of the intensity modification factor ρ(nk) for a voiced frame k may include setting values for the control parameters λ of the pitch-to-intensity transformation R.
According to step 230, the calculation of the intensity modification factor may be further based on the frame classification information. The control parameters may be then set to a value λ*C
With reference to both methods, i.e., the methods of
The calculation of the intensity modification factor may further include calculating a reference value of the intensity Iorgref(nk) corresponding to the original pitch frequency applying the R-transformation of equation (12) to the original pitch frequency as follows:
Iorgref(nk)=R(F0org(nk),λ) (13)
The calculation of the intensity modification factor may further include calculating a reference value of the intensity Imodref(nk) corresponding to the modified (i.e., target) pitch frequency applying the R-transformation of equation (12) to the modified pitch frequency as follows:
Imodref(nk)=R(F0out(nk),λ) (14)
The intensity modification factor ρ(nk) may be then obtained by dividing the reference value of the intensity Imodref(nk) corresponding to the target (i.e., modified) pitch frequency by the reference value of the intensity Iorgref(nk) corresponding to the original pitch frequency:
ρ(nk)=Imodref(nk)/Iorgref(nk) (15)
In some embodiments, the R-transformation (12) may be defined according to equations (4), (5) and (6), i.e., based on log-linear regression, as follows:
yielding the intensity modification factor in the form:
Hence, in such embodiments, the intensity modification factor depends only on the amount of pitch modification F0out/F0org regardless of the absolute level of F0. With specific reference to the method of
In some embodiments, other types of the R-transformation may be used, e.g. based on an exponential function, or piece-wise linear function in the linear or log scales.
In an optional step, temporal smoothing may be applied to the sequence {ρ(nk), k=0, 1, 2, . . . , K} of the intensity modification factors. A smoothened sequence {ρs(nk), k=0, 1, 2, . . . , K} of the intensity modification factors may be then generated. Such temporal smoothing may prevent abrupt changes in the final intensity contour.
Any smoothing technique, as known in the art, may be adopted. One choice may be the weighted moving averaging method, i.e. the convolution with a symmetric positive (2I+1)-tap filter v:
The filter v may be defined, for example, as: [v0=3, v1=2, v2=1]. The smoothed sequence of the intensity modification factors may be used in the following steps instead of the original sequence of the intensity modification factors.
In steps 140 and 240, a final intensity contour {Iout(nk), k=0, 1, 2, . . . , K} may be calculated. The calculation may be based on the original intensity contour and the sequence of the intensity modification factors.
The sequence of the intensity modification factor values (i.e., per frame) may be applied to the original intensity contour:
Iout(nk)=Iorg(nk)·ρ(nk) (19)
The values of the final intensity contour may be then limited in order to preserve the original amplitude range of the output speech signal and/or to prevent possible clipping of the output signal:
Iout(nk)=max(Iout(nk),Imax) (20)
where Imax is either a predefined value or is derived from the original utterance, for example:
In steps 150 and 250, a coherently-modified speech signal (i.e., the output speech signal) may be generated by time-dependent intensity scaling of the intermediate (i.e. pitch-modified) utterance according to the final intensity contour.
The intensity scaling may include determining an intermediate intensity contour {Iintr(nk), k=0, 1, 2, . . . , K} based on the pitch-modified utterance. The intermediate intensity contour may be determined by applying the instantaneous intensity estimator, described hereinabove, to the intermediate (i.e. pitch modified) utterance.
The intensity scaling may further include determining a gain factor contour {g(nk), k=0, 1, 2, . . . , K} comprised of frame-wise gain factor values:
In embodiments employing Sinusoidal Model based speech representation, the intensity scaling may further include multiplying all the sinusoidal component magnitudes by g(nk) for each frame k of the pitch-modified utterance and then transforming of the modified parametric representation to the output speech signal sout(n).
Otherwise, the intensity scaling may further include deriving a gain factor signal {g(n), n=0, 1, 2, . . . , N} by down-sampling the gain factor contour {g(nk), k=0, 1, 2, . . . , K} using an interpolation function h:
g(n)=h(n−nk)·g(nk)+(1−h(n−nk))·g(nk+1) for nk≦n≦nk+11=h(0)>h(1)>h(2) . . . >h(nk+1−nk)=0 (23)
The interpolation function {h(i), i=0, 1, . . . , tk+1−tk} may be set, for example, to the right half of hann windowing function.
The intensity scaling may further include multiplying the raw speech signal corresponding to the intermediate utterance (i.e. the pitch-modified speech signal or the intermediate speech signal) by the gain factor signal:
sout(n)=sintr(n)·g(n) (24)
With specific reference to
A collection of speech data may be required for this step. The collection of speech data may include speech segments mapped to the speech classes expected in the frame classification information. Each speech segment may be divided to frames. A frame may be mapped to the speech class associated with the segment including that frame. In the context of a TTS application, such a collection may be readily available as a part of the voice dataset. In the context of natural utterance processing, such a collection may be composed from a transcribed and phonetically aligned single speaker data corpus either publically available (e.g., TIMIT Acoustic-Phonetic Continuous Speech Corpus, Linguistic Data Consortium. University of Pennsylvania. Web. Mar. 1, 2015, https://catalog.ldc.upenn.edu/LDC93S1) or proprietary.
For each class C, all the voiced frames may be gathered. In order to make the estimation more robust, the voiced frames which are included in not fully-voiced segments (i.e., a segment containing at least one unvoiced frame) may be excluded. The following sub-steps may be performed for a class C, optionally provided that the number of frames gathered for this class is greater than a predefined amount, e.g., four.
Each analyzed frame k may be represented by the pitch frequency F0k value and the intensity Ik value estimated by using the intensity estimator adopted for the specific case. Two observation vectors IC={Ik, ∀kεC} and F0C={F0k, ∀kεC} may be generated by stacking together the frame intensity and pitch frequency values respectively. The optimal set of the parameter values λ*C of the pitch-to-intensity transformation (12) may be determined such that the intensity values predicted by the equation (12) yields the best possible approximation of the real intensity values observed in the frames associated with class C:
where WI(x) and WF0(x) are intensity and pitch frequency scale mapping functions respectively. It should be noted that WI(X) and WF0(X) denote the component-wise transformations of the vector X; R(X,λ) denote the component-wise transformation of the vector X. In some embodiments, where no scale mapping is performed, an identity scale mapping function may be used (e.g., WI(x)=x). The optimization problem (25) may be solved by a suitable numeric optimization technique, as known in the art.
In some embodiments, the R-transformation may be defined by equation (16) and the scale mapping functions may be defined according to equations (4) and (5), as follows:
WI(I)=IdB=20·log10 I (26a)
WF0(F0)=F0oct=log2 F0 (26b)
In this case, the optimization problem of equation (25) may be solved analytically and the optimal parameter values for a class C may be calculated as specified by equations (7) and (8).
The per-class optimal control parameter values λ*C labeled by the respective class identity may be stored in the intensity prediction models database which may be available at run-time.
Speech classes that do not include enough observation data for the statistically meaningful estimation of the parameters λ (for example, those classes containing less than five frames found in fully voiced segments) may be marked in the intensity prediction models database as irrelevant for the intensity modification.
Step 260 may be performed offline and prior to the performance of steps 200-250 of the method of
Reference is now made to
Database 350 may be stored on any one or more storage devices such as a Flash disk, a Random Access Memory (RAM), a memory chip, an optical storage device such as a CD, a DVD, or a laser disk; a magnetic storage device such as a tape, a hard disk, storage area network (SAN), a network attached storage (NAS), or others; a semiconductor storage device such as Flash device, memory stick, or the like. Database 350 may be a relational database, a hierarchical database, object-oriented database, document-oriented database, or any other database.
In some embodiments, computing device 310 may include an I/O device 340 such as a terminal, a display, a keyboard, a mouse, a touch screen, a microphone, an input device and/or the like, to interact with system 300, to invoke system 300 and to receive results. It will however be appreciated that system 300 may operate without human operation and without I/O device 340.
In some exemplary embodiments of the disclosed subject matter, storage device 330 may include or be loaded with code for a user interface. The user interface may be utilized to receive input or provide output to and from system 300, for example receiving specific user commands or parameters related to system 300, providing output, or the like.
A standard subjective listening preference test was performed in order to test the disclosed speech modification. Twelve text messages conveying expressive contents (Sports News) were synthesized by a TTS system trained on a neutral voice. During the synthesis, expressive pitch contours where implanted into the speech signals, i.e. the default pitch contours emerging from the TTS system were replaced by externally generated expressive ones. The synthesis has been performed twice: A) while preserving the original energy contour; and B) while modifying the energy contour in accordance with the disclosed technique. Ten listeners were presented with 12 pairs of stimuli each. Each pair included the above version A and version B of the same stimulus. After listening to both versions of a stimulus, a listener was instructed to choose between five options: no preference, preference to either version, strong preference to either version. The evaluation revealed average preference of 51% including 14% of strong preference for the version B, i.e., the speech signal modified according to the disclosed technique, and only 25% for the version A.
The present invention may be a system, a method, and/or a computer program product. The computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the present invention.
The computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. A non-exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon, and any suitable combination of the foregoing. A computer readable storage medium, as used herein, is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.
Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network. The network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. A network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device.
Computer readable program instructions for carrying out operations of the present invention may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C++ or the like, and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider). In some embodiments, electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present invention.
Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions.
These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, a programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable storage medium having instructions stored therein comprises an article of manufacture including instructions which implement aspects of the function/act specified in the flowchart and/or block diagram block or blocks.
The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable apparatus or other device to produce a computer implemented process, such that the instructions which execute on the computer, other programmable apparatus, or other device implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.
The descriptions of the various embodiments of the present invention have been presented for purposes of illustration, but are not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen to best explain the principles of the embodiments, the practical application or technical improvement over technologies found in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.
Number | Name | Date | Kind |
---|---|---|---|
5479564 | Vogten et al. | Dec 1995 | A |
6101470 | Eide | Aug 2000 | A |
6260016 | Holm | Jul 2001 | B1 |
8145491 | Hamza et al. | Mar 2012 | B2 |
8494856 | Latorre et al. | Jul 2013 | B2 |
8744854 | Chen | Jun 2014 | B1 |
20040024600 | Hamza | Feb 2004 | A1 |
20040030546 | Sato | Feb 2004 | A1 |
20050131680 | Chazan | Jun 2005 | A1 |
20060074678 | Pearson et al. | Apr 2006 | A1 |
20080082333 | Nurminen | Apr 2008 | A1 |
20110251840 | Cook | Oct 2011 | A1 |
20130262119 | Latorre-Martinez | Oct 2013 | A1 |
20140195242 | Chen | Jul 2014 | A1 |
Number | Date | Country |
---|---|---|
07003891.4 | Jun 2007 | EP |
Entry |
---|
R. Muralishankar et al., “Modification of Pitch using DCT in the Source Domain”, Speech Communication, vol. 42, Issue 2, Feb. 2004, pp. 143-154. |
Wen-Wei Liao and Jia-Lin Shen., “Improved Prosody Module in a Text-to-Speech System”, The Association for Computational Linguistics and Chinese Language Processing (ACLCLP), Proceedings of the 16th Conference on Computational Linguistics and Speech Processing, Sep. 2004, pp. 345-354. |
E. Moulines, W. Verhelst, “Time-domain and frequency-domain techniques for prosodic modification of speech”, in Speech Coding and Synthesis, B. Klein ed, Elsevier Science Publishers 1995. |
Number | Date | Country | |
---|---|---|---|
20160307560 A1 | Oct 2016 | US |