The invention relates to a system and method for converting text to speech. In particular, the invention relates to the conversion of text to speech using fixed-length spectrograms and neural networks.
There are a number of text-to-speech synthesis systems in the prior art. The two main categories of text-to-speech synthesis systems are: unit-selection and parametric systems. Unit-selection methods find and join audio segments together to achieve the synthesized speech. These systems typically have high quality output but also have several drawbacks such as: requires large amounts of memory, require high computational power, and have limited flexibility to customize to new speakers or emotions. Parametric methods use statistical models to learn the conversion from text to speech. Some parametric systems employ hidden Markov models (HMM) to perform the conversion. Unfortunately, these HMM's are not capable of encoding standard spectrograms of phonemes because the temporal length of these phonemes is highly variable. Instead, HMM's assume that the sequences of speech segments have a temporal dependence on each other in highly short segments (under 100 milliseconds), which is a sub-optimal model for human speech since speech has transitions that have a much longer context (sometimes ranging to several words). Also the spectrograms are altered in order make the data suitable for the HMM's. Unfortunately, these alterations and modelling assumptions impact the accuracy with which the spectrograms can be encoded in the HMM. There is therefore a need for a system that can encode spectrograms without loss of precision or robustness.
The invention in some embodiments features a system and method for converting text to speech. The system includes a phoneme generator for first converting the text to a sequence of phonemes, a feature generator defining the character of the phonemes and how they are to be pronounced and accented, a spectrum generator for querying a neural network to retrieve normalized spectrograms based on the input of the sequence of phonemes and the plurality of text features, a duration generator for outputting a plurality of durations that are associated with the phonemes, and a speech synthesizer configured to: modify the temporal length of each normalized spectrogram based on the associated duration, and combine the plurality of the modified spectrograms into speech. The normalized spectrograms are fixed-length spectrograms with uniform temporal length (i.e., data size), which enables them to be effectively encoded into and retrieved from a neural network representation. When converting the text to speech, the normalized spectrograms retrieved from the neural network are generally de-normalized, i.e., expanded in time or compressed in time, thereby producing variable length spectrograms which yield speech that is realistic sounding.
In some embodiments, the pitch contours associated with phonemes are also normalized before retrieval and subsequently de-normalized based on the associated duration. The de-normalized pitch contours are used to convert the de-normalized spectrograms into waveforms that are concatenated into the synthetic speech.
In the embodiment described above, phonemes are the basic unit of speech. In other embodiments, basic speech units are based on diphones, tri-phones, syllables, words, minor phrases, major phrases, other meaningful speech units, or combination thereof for example.
The present invention is illustrated by way of example and not limitation in the figures of the accompanying drawings, and in which:
The present invention is a system and method for producing speech from text. The text is provided directly by a user or is generated by a software application (“app”), for example. The invention then synthesizes speech from the text. The synthesized speech possesses the same linguistic content as the text, and the speaker sounds the same as the speaker that provided the recorded sentences. In the preferred embodiment, the text is converted using three deep neural networks, which are first trained using training speech from this same speaker.
The deep neural networks are each configured to model one of the three main characteristics of human speech necessary to produce speech:
1) speaking rate or duration: The system has to learn how long does the speaker utter a speech unit. The speech unit is typically a phoneme, diphone, triphone, syllable, and word. Shorter or longer speech segments might be considered as units.
2) intonation: Intonation is usually realized as the fundamental frequency (f0) of speech units. The human perception of fundamental frequency is called pitch, which can be represented as logarithm of f0. The fundamental frequency is varied by humans by their movements of vocal cords, generating different intonations.
3) spectrum: The spectral shape of the speech segments are a function of the speaker's vocal tract shape (tongue position, lip shape, etc). For synthesizing speech, we have to estimate the spectral shape from text. The spectral envelope is represented using mel-cepstrum in this embodiment.
Using the three aspects above, the speech is synthesized from text.
Referring to the functional block diagram in
Forced alignment 116 is then used to align, in time, the phonemes of the phoneme sequence with segments of audio for the same phonemes. In particular, forced alignment produces the sequences of phonemes with the start time and end time for each of the phonemes in the audio data. The start time and end time refer to the time in the audio file where the speaker begins uttering the phoneme and finishes uttering the phoneme, respectively. There is a start time and end time for all the phonemes of the phoneme sequence.
Once the start time and end time of a sequence is known, the text feature extraction module 118 extracts various attributes and parameters that characterize the audio data in general and the speaker in particular. In the preferred embodiment, the feature extractor extracts approximately 300 features per phoneme. The sequence of phonemes and features is provided as output in the form of a matrix referred to herein as the “training text matrix” 120, where the size of the two-dimensional matrix is given by the product of the number of phonemes and the number of features per phoneme. In the alternative to forced alignment, the invention is some embodiments utilize human experts to extract and label phonemes and features.
In addition to the training text matrix 110, the system also includes a converter 122 that converts the audio representation of the speaker from an audio file to a representation in terms of Mel Cepstral coefficients and pitch. The Mel Cepstral coefficients and pitch are equivalent to the audio representation in the audio database 112 but are easier to work with.
Based on the output of the forced alignment 116 and the mel cepstal coefficients, a spectrogram generator 120 produces a spectrum matrix 140. The spectrum matrix has a size given by n×(t*f), where n is the number of phonemes, t is the time-resolution of the spectrogram section (preferably 10<t<50), and f is the frequency-resolution of the spectrogram. If the actual given length of the audio segment for the phoneme is different than t, the system may “interpolate” the spectral values to get t samples. To represent a spectrum with a 8 kHz bandwidth using mel-cepstral coefficients, the frequency resolution is set to 100.
In accordance with some embodiments of the invention, the spectrogram generator 120 generates “normalized” spectra 140. Normalized spectra refer to spectra that are represented with a fixed number of data points independent of the duration of the phoneme. In the preferred embodiment, the length of the spectrogram is approximately 50 points in the temporal dimension for all spectrograms. This is different from the prior art where the sampling rate may be fixed which yields a number of spectrogram frames that is linearly proportional to the duration of the phoneme. Using the present invention, however, the entire length of each spectrogram can be properly modeled using the deep neural network.
In addition to the spectrum matrix 140, the duration and pitch module 124 generates a duration matrix 141 and a pitch matrix 142. The duration matrix, which includes the start time and end time for all phoneme, is represented by a matrix having a size n×t. The pitch contour in the pitch matrix is represented by the logarithm of f0, i.e., the pitch or fundamental frequency. The pitch contour is preferably normalized so that all pitch contours have the same fixed-length (analogous to the normalized spectrograms). Interpolation may be used to fill in the times where there is no f0 value available due to unvoiced segments, for example. The pitch matrix 142 has a fixed size given by n×t.
Thereafter, the neural network trainer 150 trains three deep neural networks, one for the spectrum data, one for the duration data, and one for the pitch data. The training is accomplished using a neural network. Let X=[x1, . . . , xN] represent the training text matrix and Y=[y1, . . . , yN] represent the output matrix which is the spectrum matrix, the pitch matrix and the duration matrix for the spectrum DNN, the pitch DNN, and the duration DNN, respectively. The dimension of the matrices are N×M, where N is the number of data samples and M is the dimension of the normalized spectrum representation, the normalized pitch contour, and the duration for the spectrum DNN training, pitch DNN training, and the duration DNN training, respectively.
A DNN model is a sequence of layers of linear transformations of the form Wx+b, where a linear or non-linear transfer function ƒ(.) can be applied to each transformation as y=ƒ(Wx+b). The matrices Wand b are called weight and bias, respectively. The input and output to each layer are represented by x and y, respectively. The goal of DNN training is to estimate all the weights and biases of all the layers, hereon referred to as parameters, of the DNN. We label the parameters and transfer functions of the DNN by subscripts representing their layer number, starting from the input layer. The mapping function for Q-layered DNN is represented as:
Ŷ=F(X)=ƒQ(WQ . . . ƒ2(W2ƒ1(W1X+b2)+b2)+bQ)
The goal of the DNN training stage is to optimize the F function by estimating the parameters of the model:
Ŷ=F(X)
such that Ŷ is the most similar to Y. In the present invention, the Y and Ŷ matrices are deemed to be “similar” if the Mean Squared Error (MSE) and the standard deviation of the spectral representations are minimized. Other similarity measures such as cross-entropy, can also be utilized for DNN training purposes. The proposed cost function, also referred to herein as the proposed criterion, is given by:
Cost=MSE(Y,Y)+STD(Y,Y)
where
and
Where SD(Y) (and similarly, SD(Ŷ)) is a 1×M matrix in which each column represents the standard deviation of each dimension, computed for each dimension m as
The vectors X and Y can be used to train the deep neural network using a batch training process that evaluates all data at once in each iteration before updating the weights and biases using a gradient descent algorithm. In an alternate embodiment, training is performed in mini batches sometimes referred to as stochastic gradient descent where the batch of all samples is split into smaller batches, e.g., 100 samples each, and the weights and biases updated after each mini-batch iteration.
Illustrated in
In contrast to the prior art,
In addition to normalized spectrograms, the preferred embodiment also normalizes the temporal duration of the corresponding pitch contours for purposes of training the deep neural networks 162. After the DNNs are trained and subsequently used to retrieve spectral information, the spectrograms are “de-normalized” and used to produce a sequence of audio data that becomes the synthesized speech.
Illustrated in
Thereafter, a deep neural network is trained 390 to encode the information represented in each of the spectrum matrix, the duration matrix, and the pitch matrix.
Text features include, but are not limited to:
Syllable level:
Word level:
Phrase level:
Utterance:
Illustrated in
The text matrix is then provided as input to the spectrum generator 450, duration generator 460, and pitch generator 470. In turn, the spectrum generator 450 queries the spectrum DNN 160 with each of the individual phonemes of the text matrix and the text features associated with the individual phoneme. The spectrum DNN 160 then outputs the spectrogram for that phoneme and features. The spectrogram represents the voice of the speaker with the attributes defined by the text features. As described above, the temporal length of the spectrogram is fixed so all the phonemes are unit length. A spectrum of unit length is referred to herein as a normalized spectrum.
In parallel to the spectrum generator, the duration generator 460 queries the duration DNN 161 to produce a number representing the time, i.e., the duration, the corresponding phoneme should take to utter based upon the phoneme and the text features. Also in parallel, the pitch generator 470 queries the pitch DNN 162 to produce an array representing the normalized pitch contours over the same unit length used to encode the spectrogram. This pitch contour represents the fundamental frequency sequence with which the phoneme should be spoken based upon the particular phoneme and the text features.
For each phoneme, the speech synthesis module 480 alters the normalized spectrum based on the duration provided by the duration generator. In particular, the de-normalize spectrum module 482 either stretches or compresses the normalized spectrum so that it is the length specified by the duration. In some cases, the normalized spectrogram is down-sampled to remove frames to reduce the length of the spectrum. In other cases, interpolation may be used to add frames to increase the duration beyond that of the normalized spectrogram.
Similar to that above, the de-normalize pitch module 484 either stretches or compresses the normalized pitch contour so that it is the length specified by the duration. In some cases, the pitch array is down-sampled to remove frames to reduce the length. In other cases, interpolation may be used to add frames to increase the duration beyond that of the normalized pitch contour.
After the de-normalized spectrum and de-normalized pitch array are generated, the speech synthesis module 480 generates a segment of speech reflecting this newly generated spectrogram and pitch. The process is repeated for each phoneme of the phoneme sequence and the resulting segments of speech concatenated and filtered as necessary to yield the newly synthesized speech outputted to the database 490.
One or more embodiments of the present invention may be implemented with one or more computer readable media, wherein each medium may be configured to include thereon data or computer executable instructions for manipulating data. The computer executable instructions include data structures, objects, programs, routines, or other program modules that may be accessed by a processing system, such as one associated with a general-purpose computer or processor capable of performing various different functions or one associated with a special-purpose computer capable of performing a limited number of functions. Computer executable instructions cause the processing system to perform a particular function or group of functions and are examples of program code means for implementing steps for methods disclosed herein. Furthermore, a particular sequence of the executable instructions provides an example of corresponding acts that may be used to implement such steps. Examples of computer readable media include random-access memory (“RAM”), read-only memory (“ROM”), programmable read-only memory (“PROM”), erasable programmable read-only memory (“EPROM”), electrically erasable programmable read-only memory (“EEPROM”), compact disk read-only memory (“CD-ROM”), or any other device or component that is capable of providing data or executable instructions that may be accessed by a processing system. Examples of mass storage devices incorporating computer readable media include hard disk drives, magnetic disk drives, tape drives, optical disk drives, and solid state memory chips, for example. The term processor as used herein refers to a number of processing devices including personal computing devices, servers, general purpose computers, special purpose computers, application-specific integrated circuit (ASIC), and digital/analog circuits with discrete components, for example.
Although the description above contains many specifications, these should not be construed as limiting the scope of the invention but as merely providing illustrations of some of the presently preferred embodiments of this invention.
Therefore, the invention has been disclosed by way of example and not limitation, and reference should be made to the following claims to determine the scope of the present invention.
This application claims the benefit of U.S. Provisional Patent Application Ser. No. 62/204,878 filed Aug. 13, 2015, titled “Text to speech synthesis using deep neural network with constant unit length spectrogram modelling,” which is hereby incorporated by reference herein for all purposes.
Number | Date | Country | |
---|---|---|---|
62204878 | Aug 2015 | US |