Text-to-speech (TTS) synthesis is used in computer software and hardware products to convert normal language text into audible speech. In TTS, audio speech samples of a human speaker are prerecorded, processed, and stored in a database as discrete audio segments and supporting data which are later used to form the words and sentences of an input text. A TTS solution provider typically offers a limited selection of prepared voices corresponding to actual human speakers. Those that employ TTS in their products may wish to employ multiple voices, such as when producing a multi-speaker conversation. However, as building a voice for TTS is a costly and time-consuming process, providing multiple TTS voices on demand presents a great challenge to TTS solution providers.
In one aspect of the invention a method is provided for text-to-speech synthesis including deriving from a voice dataset a sequence of speech frames corresponding to a text, wherein any of the speech frames is represented in the voice dataset by a parameterized vocal tract component, glottal pulse parameters, and an aspiration noise level, transforming the speech frames in the sequence by applying a voice transformation to any of the parameterized vocal tract component, glottal pulse parameters, and aspiration noise level representing the speech frames, wherein the voice transformation is applied in accordance with a virtual voice specification that includes at least one voice control parameter indicating a value for at least one of timbre, glottal tension and breathiness, and producing a digital audio signal of synthesized speech from the transformed sequence of speech frames.
In other aspects of the invention systems and computer program products embodying the invention are provided.
Aspects of the invention will be understood and appreciated more fully from the following detailed description taken in conjunction with the appended drawings in which:
Reference is now made to
Each speech signal from transcribed speech corpus 100 is provided to a framer 106 together with segmentation information generated for the speech signal by voice dataset builder 102, where the segmentation information indicates the segments associated with the speech signal and their boundary time offsets associated with each segment. Framer 106 is configured to divide the speech signal into overlapping frames, such as where frame duration and frame shift are set to 30 ms and 5 ms respectively. Framer 106 assigns a unique identifier to each frame and associates each frame with the segment that has the greatest time overlap with that frame. Framer 106 creates for each segment a segmental frames list which lists all the frames associated with that segment in their natural order. Framer 106 stores the segmental frames lists in augmented TTS voice dataset 112 in association with the aforementioned unique voice dataset identifier.
The sequence of frames produced by framer 106 for each speech signal is provided to a pitch estimator 108 which is configured to classify each frame as either voiced or unvoiced and determine, in accordance with any pitch estimation algorithm, a pitch frequency F0 for each voiced frame, all of which information is stored in augmented TTS voice dataset 112 in association with the aforementioned unique voice dataset identifier. Unvoiced frames are marked as unvoiced, for example, by setting F0=0.
Each voiced frame is provided to a glottal encoder 110 which is configured in accordance with conventional techniques to represent each voiced frame as a parameterized vocal tract component, glottal pulse parameters, and an aspiration noise level. The glottal pulse parameters are preferably derived from gain Ee and time instants Tp, Te, Ta and Tc=1/F0 expressing the general properties of glottal flow, such as is shown in
Reference is now made to
The raw glottal source signal h(t) and the pitch frequency F0 are provided to a preliminary glottal pulse fitter 302 which is configured to fit a preliminary glottal pulse of length T0=1/F0 to the raw glottal source signal h(t) using a simplified, single parameter reduction of the LF pulse shape known as Rd glottal pulse parameterization as described in “The LF-model revisited, transformations and frequency domain analysis” (G. Fant, STL-QPSR Journal, vol. 36, no. 2-3, pp. 119-156, 1995). This is done by performing a joint estimation of the Rd parameter and the optimal pulse time offset O relative to the raw glottal signal h(t) by maximizing the correlation coefficient between the synthetic pulse and the pitch period long portion of the raw glottal source signal starting at time offset O, as follows:
where:
The simplex search method described in “Convergence Properties of the Nelder-Mead Simplex Method in Low Dimensions” (J. Lagarias, et al., SIAM Journal of Optimization, Vol. 9 Number 1, pp. 112-147, 1998) may be used to solve the scalar optimization problem in equation A1.
Preliminary glottal pulse fitter 302 further estimates a preliminary gain Q as:
Q=
p(Rd*),hO*T0)/∥p(Rd*)∥2 (A2)
Preliminary glottal pulse fitter 302 thus produces values for Rd*, O* and Q.
A noise level estimator 304 is configured to estimate an aspiration noise signal ξ within one pitch cycle, using the output from preliminary glottal pulse fitter 302 and the raw glottal source signal h(t), as follows:
ξ=hO*T0−Q·p(Rd*) (A3)
The aspiration noise signal is high-pass filtered by noise level estimator 304, such as with the cutoff frequency set to 500 Hz, and the aspiration noise level ρ is calculated as:
where {tilde over (ξ)} is the high-pass filtered aspiration noise signal ξ.
A final glottal pulse fitter 306 is configured to perform pulse shape refinement by optimizing Ta with the fixed Tp* and Te* values corresponding to the optimal Rd* value as follows:
where:
σ=ρ·
The golden section search method with parabolic interpolation described in “Algorithms for Minimization without Derivatives” (R. Brent, Prentice-Hall, Englewood Cliffs, N.J., 1973) may be used to solve the scalar minimization problem in equation A5.
Final glottal pulse fitter 306 thus produces values for Ta* and a factor σ* using equation A6 with substitution Ta=Ta*.
A smoother 308 is configured to smooth the temporal trajectories of the glottal pulse parameters Tp*, Te*, and Ta*, and of the aspiration noise level ρ, within each sequence of consecutive voiced frames. This is done by using a moving averaging window, such as of 7 frames. References below to Tp*, Te*, Ta*, and ρ refer to their smoothed values.
A vocal tract parameterization unit 310 estimates the glottal source power spectrum in frequency domain as:
Ψ=P(Tp*,Te*,Ta*)+σ*2·Ξ (A7)
and fits to it a low-order, such as 6, all-pole model. This yields the glottal source auto-regression (AR) operator ψ which is then applied to the frame waveform s:
ν=ψ⊗s (A8)
where ⊗ designates the convolution operation.
Vocal tract parameterization unit 310 fits an all-pole model to the output ν of the filtering operation of equation A8, producing the vocal tract AR operator which is converted to an LSF vector. In one embodiment operating at the sampling rate of 22 kHz, the vocal tract all-pole model order is set to 40, and therefore the LSF vector dimension is equal to 40.
A gain estimator 312 is configured to estimate the glottal pulse gain Ee by comparing one-pitch-cycle energy of the frame waveform to one-pitch-cycle energy of a synthetic waveform.
A reference energy is calculated as:
where s(t) is the frame waveform.
A synthetic energy is calculated as:
where:
Gain estimator 312 calculates the gain as:
The following vocoder parameters of each voiced frame are stored in augmented TTS voice dataset 112 (
Reference is now made to
νν={G,T1,T2, . . . ,Tn,B} (S1)
where:
Each virtual voice control is preferably set within a predefined numeric range, such as within the [−1,1] interval, where a virtual voice control with a zero value indicates that no transformation of a corresponding voice characteristic is to be applied during TTS synthesis. In one embodiment, the virtual voice specification is expressed using a human-readable markup language, such as the Extensible Markup Language (XML). For example, a virtual voice specification may be expressed as:
<virtual-voice glottal_tension=“−0.7” timbre=“{0.4,−0.6,0.3}” breathiness−“−0.2”>This is the text to be synthesized.</virtual-voice>
where:
A front-end unit 402 is configured in accordance with conventional techniques to process the input text and produce a sequence of contextual phonetic labels and prosodic targets associated with the labels using a TTS voice dataset identified by voice dataset identifier V, where the TTS voice dataset is preferably stored in an augmented TTS voice dataset 404, such as is configured in the manner described hereinabove with reference to augmented TTS voice dataset 112 of
A segment selector 406 is configured in accordance with conventional techniques to process the phonetic labels and prosodic targets and produce a sequence of segment identifiers from the segments in the TTS voice dataset identified by voice dataset identifier V.
A frames sequencer 408 is configured to process the sequence of segment identifiers and, using a segmental frames list stored in augmented TTS voice dataset 404 in association with voice dataset identifier V, produce a sequence of frame identifiers corresponding to the sequence of segments identified by the sequence of segment identifiers.
Front end 402, segment selector 406, and frames sequencer 408 are collectively referred to herein as frame selector 410.
A voice transformer 412 is configured to use the sequence of frame identifiers to identify and retrieve corresponding voiced frames as represented by their vocoder parameters from augmented TTS voice dataset 404 that are associated with voice dataset identifier V. Using the virtual voice controls of equation S1, voice transformer 412 modifies the glottal pulse parameters, vocal tract parameters, and aspiration noise level of each voiced frame as described below.
Glottal Pulse Modification
Voice transformer 412 modifies the glottal pulse parameters of each voiced frame using virtual voice control G. A modified glottal pulse parameters vector Pout is calculated as:
where:
The two polar glottal pulses Plax and Ptense are preferably defined to correspond respectively to a relatively low and a relatively high glottal tension. For example, using the LF glottal pulse model, the following settings may be used:
The glottal pulse modification represents a mixing of the original glottal pulse and either the lax or the tense polar glottal pulse depending on the value of virtual voice control G. The mixing proportion depends on the absolute value of G, where a positive G value increases the perceived glottal tension while a negative G value decreases the perceived glottal tension. The predefined constants γmin and γmax define limits of the mixing proportion. This is shown by way of example in
Vocal Tract Modification
Voice transformer 412 modifies the vocal tract component of each voiced frame using virtual voice controls {Ti, i=1, . . . , n}. The vocal tract parameters are converted to the vocal tract power spectrum V(f) in accordance with conventional techniques. A modified vocal tract power spectrum Vmod(f) is calculated as:
V
mod(f)=V(w(f)) (S5)
where w(f) is a monotonous piece-wise linear frequency warping function passing throw N break points as follows:
(yi,xi=w(yi)),i=1, . . . ,N} (S6)
where (N−2) is equal to or greater than the number n of virtual voice controls {Ti}. The break point coordinates xi and yi, referred to as input and output nodes respectively, are set such that:
x
0
=y
0=0;xk<xk+1;yk<yk+1;xN=N=Fs/2 (S7)
where Fs is the sampling frequency. The frequency warping function for any frequency f that falls within an interval [yk, yk+1] is calculated as:
In various embodiments, all the input nodes are predefined, whereas the output nodes are set depending on the virtual voice controls T1, T2, . . . , Tn. For example, for synthesizing speech at a sampling rate of Fs=22050 Hz, the input and output nodes may be defined as follows:
x
0
=y
0=0
x
1=200;y1=(200,100,350,T1)
x
2=600;y2=(600,400,800,T2)
x
3=1200;y3=(1200,900,1600,T3)
x
4=2200;y4=(2200,1900,2600,T4)
x
5
=y
5=4000
x
6
=y
6=11025 (S9)
In equation S9 the function is defined as:
In the equations above, the output nodes y1, y2, y3, y4 are interpolated between predefined values according to the vocal tract control parameters T1, T2, T3 and T4 respectively. Such settings allow for controlling the vocal tract shape in the perceptually important frequency band 0-4000 Hz and thus changing the perceived personality of the synthesized voice.
The modified power spectrum (S5) is converted to modified vocal tract parameters in accordance with conventional techniques. For example, if the LSF representation is used for the vocal tract component, then an all-pole model is fitted to the modified power spectrum yielding an auto-regression operator which is converted to an LSF vector.
To compensate for energy change resulting from the vocal tract modification, the gain parameter Ee of the frame is modified as:
Aspiration Noise Modification
Voice transformer 412 modifies the aspiration noise level of each voiced frame in accordance with virtual voice control B:
where:
A smoother 414 is configured to perform smoothing of the temporal trajectories of the transformed vocoder parameters of consecutive voiced frames across non-contiguous segment joints, which are joints between two segments that are not consecutive segments of the same speech signal. This smoothing may be implemented by applying a moving averaging window to the parameter trajectory in a vicinity of the non-contiguous segment joints. For example, the vicinity radius may be set to 5 frames, and the moving averaging window size may be set to 7 frames. Preferably, all of the vocoder parameters of the voiced frames are smoothed in this manner.
A decoder 416 is configured to assemble sequences of consecutive voiced frames, where each voiced frame is represented by its smoothed transformed vocoder parameters. Each such sequence represents a voiced region. Decoder 416 converts each sequence to a speech waveform representing the respective voiced region. Decoder 416 may employ any decoding technique suitable for use with the vocoder parameters described herein. In one embodiment, LSF vocal tract parameters and an LF glottal pulse model are decoded using the method described in “Mixed source model and its adapted vocal-tract filter estimate for voice transformation and synthesis” (G. Degottex, et al, Speech Communication, 2013).
A concatenator 418 is configured to compose a final synthesized speech signal by splicing the synthesized voiced region waveforms produced by decoder 416 together with unvoiced frames from the augmented TTS voice dataset 404, which may be performed in accordance with any conventional splicing technique, thereby producing a digital audio signal of synthesized speech.
Control values for other types of modifications, such as of pitch and speech rate, may be added to the virtual voice specification and applied by decoder 416 in accordance with conventional techniques.
Although embodiments of the invention described herein employ a unit selection text-to-speech synthesis technology also known as concatenative text-to-speech, the invention, with modifications and variations that will be apparent to those of ordinary skill in the art, can be applied to a system employing statistical text-to-speech technology using statistical models, such as Hidden Markov Models (HMM) or Deep Neural Networks (DNN), for vocoder parameter generation.
Reference is now made to
Any of the elements shown in the drawings and described herein are preferably implemented by one or more computers in computer hardware and/or in computer software embodied in a non-transitory, computer-readable medium in accordance with conventional techniques.
Referring now to
It is to be appreciated that the term “processor” as used herein is intended to include any processing device, such as, for example, one that includes a CPU (central processing unit) and/or other processing circuitry. It is also to be understood that the term “processor” may refer to more than one processing device and that various elements associated with a processing device may be shared by other processing devices.
The term “memory” as used herein is intended to include memory associated with a processor or CPU, such as, for example, RAM, ROM, a fixed memory device (e.g., hard drive), a removable memory device (e.g., diskette), flash memory, etc. Such memory may be considered a computer readable storage medium.
In addition, the phrase “input/output devices” or “I/O devices” as used herein is intended to include, for example, one or more input devices (e.g., keyboard, mouse, scanner, etc.) for entering data to the processing unit, and/or one or more output devices (e.g., speaker, display, printer, etc.) for presenting results associated with the processing unit.
Embodiments of the invention may include a system, a method, and/or a computer program product. The computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the invention.
The computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. A non-exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon, and any suitable combination of the foregoing. A computer readable storage medium, as used herein, is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.
Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network. The network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. A network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device.
Computer readable program instructions for carrying out operations of the invention may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C++ or the like, and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider). In some embodiments, electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the invention.
Aspects of the invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions.
These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, a programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable storage medium having instructions stored therein comprises an article of manufacture including instructions which implement aspects of the function/act specified in the flowchart and/or block diagram block or blocks.
The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable apparatus or other device to produce a computer implemented process, such that the instructions which execute on the computer, other programmable apparatus, or other device implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.
The descriptions of the various embodiments of the invention have been presented for purposes of illustration, but are not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen to best explain the principles of the embodiments, the practical application or technical improvement over technologies found in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.