Embodiments described herein relate to a text-to-speech synthesis method, a text-to-speech synthesis system, and a method of training a text-to speech system.
Embodiments described herein also relate to a method of calculating an expressivity score.
Text-to-speech (TTS) synthesis methods and systems are used in many applications, for example in devices for navigation and personal digital assistants. TTS synthesis methods and systems can also be used to provide speech segments that can be used in games, movies or other media comprising speech.
There is a continuing need to improve TTS synthesis systems. In particular, there is a need to improve the quality of speech generated by TTS systems such that the speech generated retains vocal expressiveness. Expressive speech may convey emotional information and sounds natural, realistic and human-like. TTS systems often comprise algorithms that need to be trained using training samples and there is a continuing need to improve the method by which the TTS system is trained such that the TTS system generates expressive speech.
Systems and methods in accordance with non-limiting examples will now be described with reference to the accompanying figures in which:
According to a first aspect of the invention, there is provided a text-to-speech synthesis method comprising:
Methods in accordance with embodiment described herein provide an improvement to text-to-speech synthesis by providing a neural network that is trained to generate expressive speech. Expressive speech is speech that conveys emotional information and sounds natural, realistic and human-like. The disclosed method ensures that the trained neural network can accurately generate speech from text, the generated speech is comprehensible, and is more expressive than speech generated using a neural network trained using the first dataset directly.
In an embodiment, the expressivity score is obtained by extracting a first speech parameter for each audio sample; deriving a second speech parameter from the first speech parameter; comparing the value of the second parameter to the first speech parameter.
In an embodiment, the first speech parameter comprises the fundamental frequency.
In an embodiment, the second speech parameter comprises the average of the first speech parameter of all audio samples in the dataset.
In another embodiment, the first speech parameter comprises a mean of the square of the rate of change of the fundamental frequency.
In an embodiment, the second sub-dataset is obtained by pruning audio samples with lower expressivity scores from the first sub-dataset.
In an embodiment, audio samples with a higher expressivity score are selected from the first training dataset and allocated to the second sub-dataset, and audio samples with a lower expressive score are selected from the first training dataset and allocated to the first sub-dataset.
In an embodiment, the neural network is trained using the first sub-dataset for a first number of training steps, and then using the second sub-dataset for a second number of training steps.
In an embodiment, the neural network is trained using the first sub-dataset for a first time duration, and then using the second sub-dataset for a second time duration.
In an embodiment, the neural network is trained using the first sub-dataset until a training metric achieves a first predetermined threshold, and then further trained using the second sub-dataset. In an example, the training metric is a quantitative representation of how well the output of the trained neural network matches a corresponding audio data sample.
According to a second aspect of the invention, there is provided a method of calculating an expressivity score of audio samples in a dataset, the method comprising: extracting a first speech parameter for each audio sample of the dataset; deriving a second speech parameter from the first speech parameter; and comparing the value of the second parameter to the first parameter.
The disclosed method provides an improvement in the evaluation of an expressivity score for an audio sample. The disclosed method is quick and accurate. Empirically, it has been observed that the disclosed method correlates well with subjective assessments of expressivity made by human operators. The disclosed method is quicker, more consistent, more accurate, and more reliable than assessments of expressivity made by human operators.
According to a third aspect of the invention, there is provided a method of training a text-to-speech synthesis system that comprises a prediction network, wherein the prediction network comprises a neural network, the method comprising: receiving a first training dataset comprising audio data and corresponding text data;
In an embodiment, the method further comprised training the neural network using a second training dataset. The neural network may be trained to gain further speech abilities.
In an embodiment the average expressivity score of the audio data in the second training dataset is higher than the average expressivity score of the audio data in the first training dataset.
According to a fourth aspect of the invention, there is provided a text-to-speech synthesis system comprising:
receiving a first training dataset comprising audio data and corresponding text data;
In an embodiment, the system comprises a vocoder that is configured to convert the speech data into an output speech data. In an example, the output speech data comprises an audio waveform.
In an embodiment, the system comprises an expressivity scorer module configured to calculate an expressivity score for audio samples.
In an embodiment, the prediction network comprises a sequence-to-sequence model.
According to a fifth aspect of the invention, there is provided speech data generated by a text-to-speech system according to the third aspect of the invention. The speech data disclosed is expressive and that conveys emotional information and sounds natural, realistic and human-likes.
In an embodiment, the speech data is an audio file of synthesised expressive speech.
According to a sixth aspect of the invention, there is provided a carrier medium comprising computer readable code configured to cause a computer to perform any of the methods above.
The methods are computer-implemented methods. Since some methods in accordance with examples can be implemented by software, some examples encompass computer code provided to a general purpose computer on any suitable carrier medium. The carrier medium can comprise any storage medium such as a floppy disk, a CD ROM, a magnetic device or a programmable memory device, or any transient medium such as any signal e.g. an electrical, optical or microwave signal. The carrier medium may comprise a non-transitory computer readable storage medium.
The system comprises a prediction network 21 configured to convert input text 7 into a speech data 25. The speech data 25 is also referred to as the intermediate speech data 25. The system further comprises a Vocoder that converts the intermediate speech data 25 into an output speech 9. The prediction network 21 comprises a neural network (NN). The Vocoder also comprises a NN.
The prediction network 21 receives a text input 7 and is configured to convert the text input 7 into an intermediate speech data 25. The intermediate speech data 25 comprises information from which an audio waveform may be derived. The intermediate speech data 25 may be highly compressed while retaining sufficient information to convey vocal expressiveness. The generation of the intermediate speech data 25 will be described further below in relation to
The text input 7 may be in the form of a text file or any other suitable text form such as ASCII text string. The text may be in the form of single sentences or longer samples of text. A text front-end, which is not shown, converts the text sample into a sequence of individual characters (e.g. “a”, “b”, “c” . . . ). In another example, the text front-end converts the text sample into a sequence of phonemes (/k/, /p/, . . . ).
The intermediate speech data 25 comprises data encoded in a form from which a speech sound waveform can be obtained. For example, the intermediate speech data may be a frequency domain representation of the synthesised speech. In a further example, the intermediate speech data is a spectrogram. A spectrogram may encode a magnitude of a complex number as a function of frequency and time. In a further example, the intermediate speech data 25 may be a mel spectrogram. A mel spectrogram is related to a speech sound waveform in the following manner: a short-term Fourier transform (STFT) is computed over a finite frame size, where the frame size may be 50 ms, and a suitable window function (e.g. a Hann window) may be used; and the magnitude of the STFT is converted to a mel scale by applying a non-linear transform to the frequency axis of the STFT, where the non-linear transform is, for example, a logarithmic function.
The Vocoder module takes the intermediate speech data 25 as input and is configured to convert the intermediate speech data 25 into a speech output 9. The speech output 9 is an audio file of synthesised expressive speech and/or information that enables generation of expressive speech. The Vocoder module will be described further below.
In another example, which is not shown, the intermediate speech data 25 may be in a form from which an output speech 9 can be directly obtained. In such a system, the Vocoder 23 is optional.
The prediction network 21 comprises an Encoder 31, an attention network 33, and decoder 35. As shown in
The Encoder 31 takes as input the text input 7. The encoder 31 comprises a character embedding module (not shown) which is configured to convert the text input 7, which may be in the form words, sentences, paragraphs, or other forms, into a sequence of characters. Alternatively, the encoder may convert the text input into a sequence of phonemes. Each character from the sequence of characters may be represented by a learned 512-dimensional character embedding. Characters from the sequence of characters are passed through a number of convolutional layers. The number of convolutional layers may be equal to three for example. The convolutional layers model longer term context in the character input sequence. The convolutional layers each contain 512 filters and each filter has a 5×1 shape so that each filer spans 5 characters. After the stack of three convolutional layers, the input characters are passed through batch normalization step (not shown) and ReLU activations (not shown). The encoder 31 is configured to convert the sequence of characters (or alternatively phonemes) into encoded features 311 which is then further processed by the attention network 33 and the decoder 35.
The output of the convolutional layers is passed to a recurrent neural network (RNN). The RNN may be a long-short term memory (LSTM) neural network (NN). Other types of RNN may also be used. According to one example, the RNN may be a single bi-directional LSTM containing 512 units (256 in each direction). The RNN is configured to generate encoded features 311. The encoded features 311 output by the RNN may be a vector with a dimension k.
The Attention Network 33 is configured to summarize the full encoded features 311 output by the RNN and output a fixed-length context vector 331. The fixed-length context vector 331 is used by the decoder 35 for each decoding step. The attention network 33 may take information (such as weights) from previous decoding steps (that is, from previous speech frames decoded by decoder) in order to output a fixed-length context vector 331. The function of the attention network 33 may be understood to be to act as a mask that focusses on the important features of the encoded features 311 output by the encoder 31. This allows the decoder 35, to focus on different parts of the encoded features 311 output by the encoder 31 on every step. The output of the attention network 33, the fixed-length context vector 331, may have dimension m, where m may be less than k. According to a further example, the Attention network 33 is a location-based attention network.
According to one embodiment, the attention network 33 takes as input an encoded feature vector 311 denoted ash={h1, h2, . . . , hk}. A(i) is a vector of attention weights (called alignment). The vector A(i) is generated from a function attend(s(i−1), A(i−1), h), where s(i−1) is a previous decoding state and A(i−1) is a previous alignment. s(i−1) is 0 for the first iteration of first step. The attend( ) function is implemented by scoring each element in h separately and normalising the score. G(i) is computed from G(i)=93kA(i,k)×hk. The output of the attention network 33 is generated as Y(i)=generate(s(i−1), G(i)), where generate( ) may be implemented using a recurrent layer of 256 gated recurrent units (GRU) units for example. The attention network 33 also computes a new state s(i)=recurrency(s(i−1), G(i), Y(i)), where recurrency( ) is implemented using LSTM.
The decoder 35 is an autoregressive RNN which decodes information one frame at a time. The information directed to the decoder 35 is be the fixed length context vector 331 from the attention network 33. In another example, the information directed to the decoder 35 is the fixed length context vector 331 from the attention network 33 concatenated with a prediction of the decoder 35 from the previous step. In each decoding step, that is, for each frame being decoded, the decoder may use the results from previous frames as an input to decode the current frame. In an example, as shown in
The parameters of the encoder 31, decoder 35, predictor 39 and the attention weights of the attention network 33 are the trainable parameters of the prediction network 21.
According to another example, the prediction network 21 comprises an architecture according to Shen et al. “Natural tts synthesis by conditioning wavenet on mel spectrogram predictions.” 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018.
Returning to
According to an embodiment, the Vocoder 23 comprises a convolutional neural network (CNN). The input to the Vocoder 23 is a frame of the mel spectrogram provided by the prediction network 21 as described above in relation to
According to an alternative example, the Vocoder 23 comprises a convolutional neural network (CNN). The input to the Vocoder 23 is derived from a frame of the mel spectrogram provided by the prediction network 21 as described above in relation to
According to another example, the Vocoder 23 comprises a WaveNet NN architecture such as that described in Shen et al. “Natural tts synthesis by conditioning wavenet on mel spectrogram predictions.” 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018.
According to a further example, the Vocoder 23 comprises a WaveGlow NN architecture such as that described in Prenger et al. “Waveglow: A flow-based generative network for speech synthesis.” ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019.
According to an alternative example, the Vocoder 23 comprises any deep learning based speech model that converts an intermediate speech data 25 into output speech 9.
According to another alternative embodiment, the Vocoder 23 is optional. Instead of a Vocoder, the prediction network 21 of the system 1 further comprises a conversion module (not shown) that converts intermediate speech data 25 into output speech 9. The conversion module may use an algorithm rather than relying on a trained neural network. In an example, the Griffin-Lim algorithm is used. The Griffin-Lim algorithm takes the entire (magnitude) spectrogram from the intermediate speech data 25, adds a randomly initialised phase to form a complex spectogram, and iteratively estimates the missing phase information by: repeatedly converting the complex spectrogram to a time domain signal, converting the time domain signal back to frequency domain using STFT to obtain both magnitude and phase, and updating the complex spectrogram by using the original magnitude values and the most recent calculated phase values. The last updated complex spectrogram is converted to a time domain signal using inverse STFT to provide output speech 9.
According to an example, the prediction network 21 is trained from a first training dataset 41 of text data 41a and audio data 41b pairs as shown in
The training of the Vocoder 23 according to an embodiment is illustrated in
The training of the Vocoder 23 according to another embodiment is illustrated in
In an alternative embodiment which is not shown, the audio data 41b from the original training dataset 41 is assessed by a human operator. In this case, the human operator listens to the audio data 41b and assigns a score to each sample. In yet another alternative embodiment, the audio data 41b is scored by several human operators. Each human operator may assign a different score to the same sample. An average of the different human scores for each sample is taken and assigned to the sample. The outcome of human operator based scoring is that audio samples from the audio data 41b are assigned a score. As explained in relation to
In an embodiment, the audio data 41b is assigned a score by the human operator as well as a label indicating a further property. For example, the further property is an emotion (sad, angry, etc . . . ), an accent (e.g. British English, French . . . ), style (e.g. shouting, whispering etc . . . ), or non-verbal sounds (e.g. grunts, shouts, screams, um's, ah's, breaths, laughter, crying etc . . . ). The TDS module is then configured to receive a label as an input and the TDS module is configured to select text and audio pairs that correspond to the inputted label.
In another embodiment, the label indicating the further property is assigned to the audio data 41b as it is generated. For example, as a voice actor records an audio sample, the voice actor also assigns a label indicating the further property, where, for example, the further property is an emotion (sad, angry, etc . . . ), an accent (e.g. British English, French . . . ), style (e.g. shouting, whispering etc . . . ), or non-verbal sounds (e.g. grunts, shouts, screams, um's, ah's, breaths, laughter, crying etc . . . ). The TDS module is then configured to receive a label as an input and the TDS module is configured to select text and audio pairs that correspond to the inputted label.
According to another embodiment which described further below in relation to
The TDS module will be described further below in relation to
The method of training the prediction network 21 in the configuration shown in
An example of an algorithm for estimating fo is the YIN algorithm in which: (i) the autocorrelation rt of a signal xt over a window W is found; (ii) a difference function (DF) is found from the difference between xt (assumed to be periodic with period T) and xt+T, where xt+T represents signal xt shifted by a candidate value of T; (iii) a cumulative mean normalised difference function (CMNDF) is derived from DF in (ii) to account for errors due to imperfect periodicities; (iv) an absolute threshold is applied to the value of the CMNDF to determine if the candidate value of T is acceptable; (v) considering each local minimum in the CMNDF; and (vi) determining which value of T gives the smallest CMNDF. However, it will be understood that other parameters such as the first three formants (F1, F2, F3) could also be used. It will also be understood that a plurality of speech parameters could be used in combination. The parameter f0 is related to the perception of pitch by the human ear and is sometimes referred to as the pitch. In the examples shown in
A second speech parameter is determined from the first speech parameter. According to an embodiment, the second speech parameter is obtained as the average of the first speech parameter <fn0(t)> for one or more samples in the dataset. In an embodiment, as shown in
According to another embodiment, the second speech parameter is obtained as the mean of the square of the rate of change of the fundamental frequency for one or more samples in the dataset. A discrete value for the expressivity score of an audio sample is computed by the ES module 51.
According to another embodiment, a discrete value for the expressivity score of an audio sample is formed using emf and emv in combination.
According to an example, k=10 such that discrete expressivity scores of 0, 1, 2, . . . ,10 are available. According to one example, a sample having an expressivity score of 1 or above is considered to be expressive. It will be understood, however, that samples having scores above any predetermined level may be considered to be expressive. For example, it may be preferred that a sample having a score above any value from 2, 3, 4, 5, 6, 7,8, 9, 10 or any value therebetween, is considered to be expressive.
According to one example which is not shown, the average is the arithmetic mean, or median, or mode, of all the time averaged fm0(t). Furthermore, for each sample, the variability of fm0(t) for each sample, denoted as σm0, is computed. The average variability, which is the average value of σm0 for all samples is determined. The average variability may be the arithmetic mean, or median, or mode of all values of σm0. The average variability is assigned an expressivity score of zero. For the other end of the scale, the maximum value of σm0 over all m samples is identified and assigned a value of 10. In step 63 and 65, each sample is assigned an expressivity score equal to |σm0−average variability|×10. Although the example above describes a score in the range of 0 to 10, it will be understood that the score could be in the range of 0 to 1, or between any two numbers. Furthermore, it will be understood that although a linear scoring scale is described, other non-linear scales may also be used. The ES module 51 then outputs a score data 41c whose entries correspond to the entries of the audio data 41b.
In one embodiment, the expressivity score is computed for an entire audio sample, that is, for the full utterance.
In another embodiment, the expressivity score is computed for the audio sample on a frame-by-frame basis. The expressivity score computation is performed for several frames of the sample. An expressivity score for the sample is then derived from the expressivity scores for each frame, for example by averaging.
In another embodiment (which is not shown), the audio sample is further labelled with a further property. The further property label is assigned by a human operator for example. For example, the further property is an emotion (sad, happy, angry, etc . . . ), an accent (e.g. British English, French . . . ), style (e.g. shouting, whispering etc . . . ), or non-verbal sounds (e.g. grunts, shouts, screams, um's, ah's, breaths, laughter, crying etc . . . ). In the calculation of the expressivity score described above in relation to
This knowledge is slow to learn but is retained during training with subsequent sub-datasets containing expressive audio data. By progressively training with sub-datasets comprising samples having an increasing average expressivity score, the trained prediction network 21 learns to produce speech having a high expressivity. By contrast, if the prediction network 21 was not provided with increasingly expressive data sets for training, the prediction network 21 would learn to produce speech corresponding to the average of a diverse data set having a low average expressivity score.
The TDS module 53 is configured to change from one sub-dataset to another sub-dataset so that the prediction network 21 may be trained in turn with each sub-dataset.
In one embodiment, the TDS is configured to change sub-dataset after a certain number of training steps have been performed. The first sub-dataset 55-1 may be used for a first number of training steps. The second sub-dataset 55-2 may be used for a second number of training steps. The third sub-dataset 55-3 may be used for a third number of training steps. In one embodiment, the number of training steps are equal. In another embodiment, the number of training steps is different; for example the number of training steps decreases exponentially.
In another embodiment, the TDS is configured to change sub-dataset after an amount of training time has passed. The first sub-dataset 55-1 is used for a first time duration. The second sub-dataset 55-2 is used for a second time duration. The third sub-dataset 55-3 is used for a third time duration. In one embodiment, the time durations are equal. In another embodiment, the time durations are different, and, for example, are reduced when a sub-dataset is changed. For example, the first time duration is one day.
In another embodiment, the TDS is configured to change sub-dataset after a training metric of the neural network training reaches a predetermined threshold. In an example, the training metric is a parameter that indicates how well the output of the trained neural network matches the audio data used for training. An example of a training metric is the validation loss. For example, the TDS is configured to change sub-dataset after the validation loss falls below a certain level. In another embodiment, the training metric is the expressivity score as described in relation to
In yet another embodiment, the prediction network 21 is trained for a predetermined amount of time, and/or a number of training steps, and the performance of the prediction network 21 is verified on test sample text and audio pairs, and if the intermediate speech data 25 meets a predetermined quality, the sub-dataset is changed. In one embodiment, the quality is determined by a human tester who performs a listening test. In another embodiment, the quality is determined comparing the predicted intermediate speech data with the test audio data (which is converted using converter 47 if necessary) to generate an error metric. In yet another embodiment, the quality is determined by obtaining an expressivity score for the intermediate speech data 25b (which is converted to a time domain waveform if necessary) and comparing it with the expressivity score of the corresponding sample from the audio data 41b.
In another example, which is not shown, the sub-datasets 55-1, 55-2, and 55-3 are obtained by sorting samples of the audio data 41b according to their expressivity scores, and allocating the lower scoring samples to sub-dataset 55-1, the intermediate scoring samples to sub-dataset 55-2, and the high scoring samples to sub-dataset 55-3. When the prediction network 21 is trained using these sub-datasets in turn, the prediction network 21 may be trained to generate highly expressive intermediate speech data 25.
The TDS module 53 is configured to select data pairs from the first training dataset 41 to generate sub-dataset 55-1, and to select data pairs from the second training dataset 71 to generate sub-dataset 71-1. In one example, sub-datasets 55-1 and/or 71-1 are formed by using features such as expressivity scoring (as described in relation to
Although this is not shown, it will be understood that the example of
In a further example which is not shown, the prediction network 21 can be trained initially to generate speech 25 according to any of the examples described in relation to
The TTS system 1 comprises a processor 3 and a computer program 5 stored in a non-volatile memory. The TTS system 1 takes as input a text input 7. The text input 7 may be a text file and/or information in the form of text. The computer program 5 stored in the non-volatile memory can be accessed by the processor 3 so that the processor 3 executes the computer program 5. The processor 3 may comprise logic circuitry that responds to and processes the computer program instructions. The TTS system 1 provides as output a speech output 9. The speech output 9 may be an audio file of the synthesised speech and/or information that enables generation of speech.
The text input 7 may be obtained from an external storage medium, a communication network or from hardware such as a keyboard or other user input device (not shown). The output 9 may be provided to an external storage medium, a communication network, or to hardware such as a loudspeaker (not shown).
In an example, the TTS system 1 may be implemented on a cloud computing system, which transmits and receives data. Although a single processor 3 is shown in
While certain embodiments have been described, these embodiments have been presented by way of example only, and are not intended to limit the scope of the inventions. Indeed the novel methods and apparatus described herein may be embodied in a variety of other forms; furthermore, various omissions, substitutions and changes in the form of methods and apparatus described herein may be made.
Number | Date | Country | Kind |
---|---|---|---|
1919101.4 | Dec 2019 | GB | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/GB2020/053266 | 12/17/2020 | WO |