Embodiments described herein relate to a text-to-speech synthesis method, a text-to-speech synthesis system, and a method of training a text-to speech system.
Text-to-speech (TTS) synthesis methods and systems are used in many applications, for example in devices for navigation and personal digital assistants. TTS synthesis methods and systems can also be used to provide speech segments that can be used in games, movies, audio books, or other media comprising speech.
There is a continuing need to improve TTS synthesis systems. In particular, there is a need to improve the quality of speech generated by TTS systems such that the speech generated is perceived by human listeners to convey a speech attribute. A speech attribute comprises emotion, intention, projection, pace, and/or accent. Speech data that has a speech attribute may sound natural, realistic and human-like. Such speech can be used in games, movies or other media comprising speech. TTS systems often comprise algorithms that need to be trained using training samples and there is a continuing need to improve the method by which the TTS system is trained such that the TTS system generates speech that conveys a speech attribute.
Systems and methods in accordance with non-limiting examples will now be described with reference to the accompanying figures in which:
According to a first aspect of the invention, there is provided a method of text-to-speech synthesis comprising:
Methods in accordance with embodiments described herein provide an improvement to text-to-speech synthesis by providing a prediction network that is trained to generate speech that has a speech attribute, as perceived by human listeners. A speech attribute comprises emotion, intention, projection, pace, and/or accent. Speech data that has a speech attribute may sound natural, realistic and human-like. The disclosed methods ensure that the trained prediction network can accurately generate speech from text, the generated speech is comprehensible, and the generated speech is perceived by human listeners to have a speech attribute.
Further, the disclosed methods address a technical problem tied to computer technology and arising in the realm of text-to-speech synthesis. Namely, the disclosed method provide an improvement in a method of text-to-speech synthesis that uses a trained prediction network to generate speech that is perceived by humans to have a speech attribute. Training such models require training data that has a speech attribute. Availability of such speech data is limited. The methods overcome this limitation by a providing a method capable of training a prediction network using relatively small datasets. Further, the methods maintain the performance of the prediction network. For example, the claimed methods avoid overfitting of the trained model.
Further, the disclosed methods address a technical problem arising in the realm of text-to-speech synthesis. Namely, the disclosed methods provide an improvement in a method of text-to-speech synthesis that uses a trained prediction network to generate speech that is perceived by humans to have a speech attribute. Training such models require training data have a speech attribute that is pronounced (so that the model learns to produce speech having such patterns). Training a model directly with training data conveying a pronounced speech attribute has been found to result in models with a poor performance (as measured by a performance metric for example). Training a model sequentially such that the strength of the speech attribute of the audio samples seen by the model is gradually increased has been found to result in models with a better performance (as measured by the performance metric). The methods overcome this limitation by sequentially training a first model with a dataset, and then training a second model with another dataset, wherein the second model is trained with dataset that has a more pronounced speech attribute on average than the dataset used to train the first model.
In an embodiment, the obtaining of the prediction network further comprises refreshing the second model, wherein refreshing the second model comprises:
In an embodiment, the performance metric comprises one or more of a validation loss, a speech pattern accuracy test, a mean opinion score (MOS), a MUSHRA score, a transcription metric, an attention score, and a robustness score
In an embodiment, when the performance metric is the validation loss, the first predetermined value is less than the second predetermined value.
In an embodiment, the second predetermined value is 0.6 or less.
In an embodiment, when the performance metric is the transcription metric, the second predetermined value is 1 or less.
In an embodiment, the attention score comprises an attention confidence or a coverage deviation.
In an embodiment, when the performance metric is the attention score and the attention score comprises the attention confidence, the second predetermined value is 0.1 or less.
In an embodiment, when the performance metric is the attention score and the attention score comprises the coverage deviation, the second predetermined value is 1.0 or less.
In an embodiment, when the performance metric is the robustness score, the second predetermined value is 0.1% or less.
In an embodiment, when the performance metric is the MUSHRA score, the second predetermined value is 60 or more.
In an embodiment, the obtaining of the prediction network further comprises:
According to an embodiment, the obtaining of the prediction network further comprises:
According to a second aspect of the invention, there is provided a method of text-to-speech synthesis comprising:
In an embodiment, the obtaining of the prediction network further comprises refreshing the second model, wherein refreshing the second model comprises:
In an embodiment, refreshing the second model is performed until a performance metric reaches a predetermined value.
In an embodiment, refreshing the second model is performed until a performance metric reaches a predetermined value, wherein the performance metric comprises one or more of a validation loss, a speech pattern accuracy test, a mean opinion score (MOS), a MUSHRA score, a transcription metric, an attention score and a robustness score.
In an embodiment, when the performance metric is the validation loss, the predetermined value is 0.6 or less.
In an embodiment, the obtaining of the prediction network further comprises:
According to an embodiment, the obtaining of the prediction network further comprises:
According to an embodiment, the obtaining of the prediction network further comprises refreshing the third model, wherein refreshing the third model comprises:
According to a third aspect of the invention, there is provided a method of training a text-to-speech synthesis system that comprises a prediction network configured to convert received text into speech data having a speech attribute, wherein the speech attribute comprises emotion, intention, projection, pace, and/or accent, the method comprising:
According to a fourth aspect of the invention, there is provided a method of training a text-to-speech synthesis system that comprises a prediction network configured to convert received text into speech data having a speech attribute, wherein the speech attribute comprises emotion, intention, projection, pace, and/or accent, the method comprising:
In an embodiment, the second sub-dataset comprises fewer samples than the first sub-dataset.
In an embodiment, the audio samples of the first sub-dataset and the second sub-dataset are recorded by a human actor.
According to an embodiment, the audio samples of the first sub-dataset and the second sub-dataset are recorded by the same human actor.
In an embodiment, the first model is pre-trained prior to training with the first sub-dataset or the combined dataset.
In an embodiment, the first model is pre-trained using a dataset comprising audio samples from one or more human voices.
In an embodiment, the samples of the first sub-dataset and of the second sub-dataset are from the same domain, wherein the domain refers to the topic that the method is applied in.
According to a fifth aspect of the invention, there is provided a text-to-speech synthesis system comprising:
According to an embodiment, the prediction network comprises a sequence-to-sequence model.
According to a sixth aspect of the invention, there is provided speech data synthesised by a method according to any preceding embodiment.
According to a seventh aspect of the invention, there is provided a carrier medium comprising computer readable code configured to cause a computer to perform the methods of any of the above embodiments.
The methods are computer-implemented methods. Since some methods in accordance with examples can be implemented by software, some examples encompass computer code provided to a general purpose computer on any suitable carrier medium. The carrier medium can comprise any storage medium such as a floppy disk, a CD ROM, a magnetic device or a programmable memory device, or any transient medium such as any signal e.g. an electrical, optical or microwave signal. The carrier medium may comprise a non-transitory computer readable storage medium.
The system comprises a prediction network 21 configured to convert input text 7 into speech data 25. The speech data 25 is also referred to as the intermediate speech data 25. The system further comprises a Vocoder that converts the intermediate speech data 25 into an output speech 9. The prediction network 21 comprises a neural network (NN). The Vocoder also comprises a NN.
The prediction network 21 receives a text input 7 and is configured to convert the text input 7 into an intermediate speech data 25. The intermediate speech data 25 comprises information from which an audio waveform may be derived. The intermediate speech data 25 may be highly compressed while retaining sufficient information to convey vocal expressiveness. The generation of the intermediate speech data 25 will be described further below in relation to
The text input 7 may be in the form of a text file or any other suitable text form such as ASCII text string. The text may be in the form of single sentences or longer samples of text. A text front-end, which is not shown, converts the text sample into a sequence of individual characters (e.g. “a”, “b”, “c”, ...). In another example, the text front-end converts the text sample into a sequence of phonemes (/k/, /t/, /p/, ...). Phonemes are units of sound that distinguish a word from another in a particular language. For example, in English, the phonemes /p/, /b/, /d/, and /t/ occur in the words pit, bit, din, and tin respectively for example.
The intermediate speech data 25 comprises data encoded in a form from which a speech sound waveform can be obtained. For example, the intermediate speech data may be a frequency domain representation of the synthesised speech. In a further example, the intermediate speech data is a spectrogram. A spectrogram may encode a magnitude of a complex number as a function of frequency and time. In a further example, the intermediate speech data 25 may be a mel spectrogram. A mel spectrogram is related to a speech sound waveform in the following manner: a short-time Fourier transform (STFT) is computed over a finite frame size, where the frame size may be 50 ms, and a suitable window function (e.g. a Hann window) may be used; and the magnitude of the STFT is converted to a mel scale by applying a non-linear transform to the frequency axis of the STFT, where the non-linear transform is, for example, a logarithmic function.
The Vocoder module takes the intermediate speech data 25 as input and is configured to convert the intermediate speech data 25 into a speech output 9. The speech output 9 is an audio file of synthesised speech and/or information that enables generation of speech. The Vocoder module will be described further below.
Alternatively, the intermediate speech data 25 is in a form from which an output speech 9 can be directly obtained. In such a system, the Vocoder 23 is optional.
The prediction network 21 comprises an Encoder 31, an attention network 33, and decoder 35. As shown in
The Encoder 31 takes as input the text input 7. The encoder 31 comprises a character embedding module (not shown) which is configured to convert the text input 7, which may be in the form words, sentences, paragraphs, or other forms, into a sequence of characters. Alternatively, the encoder may convert the text input into a sequence of phonemes. Each character from the sequence of characters may be represented by a learned 512-dimensional character embedding. Characters from the sequence of characters are passed through a number of convolutional layers. The number of convolutional layers may be equal to three for example. The convolutional layers model longer term context in the character input sequence. The convolutional layers each contain 512 filters and each filter has a 5×1 shape so that each filer spans 5 characters. To the outputs of each of the three convolutional layers, a batch normalization step (not shown) and a ReLU activation function (not shown) are applied. The encoder 31 is configured to convert the sequence of characters (or alternatively phonemes) into encoded features 311 which is then further processed by the attention network 33 and the decoder 35.
The output of the convolutional layers is passed to a recurrent neural network (RNN). The RNN may be a long-short term memory (LSTM) neural network (NN). Other types of RNN may also be used. According to one example, the RNN may be a single bi-directional LSTM containing 512 units (256 in each direction). The RNN is configured to generate encoded features 311. The encoded features 311 output by the RNN may be a vector with a dimension k.
The Attention Network 33 is configured to summarize the full encoded features 311 output by the RNN and output a fixed-length context vector 331. The fixed-length context vector 331 is used by the decoder 35 for each decoding step. The attention network 33 may take information (such as weights) from previous decoding steps (that is, from previous speech frames decoded by decoder) in order to output a fixed-length context vector 331. The function of the attention network 33 may be understood to be to act as a mask that focusses on the important features of the encoded features 311 output by the encoder 31. This allows the decoder 35, to focus on different parts of the encoded features 311 output by the encoder 31 on every step. The output of the attention network 33, the fixed-length context vector 331, may have dimension m, where m may be less than k. According to a further example, the Attention network 33 is a location-based attention network.
Additionally or alternatively, the attention network 33 takes as input an encoded feature vector 311 denoted as h = {h1, h2,..., hk}. A(i) is a vector of attention weights (called alignment). The vector A(i) is generated from a function attend(s(i-1), A(i-1), h), where s(i-1) is a previous decoding state and A(i-1) is a previous alignment. s(i-1) is 0 for the first iteration of first step. The attend() function is implemented by scoring each element in h separately and normalising the score. G(i) is computed from G(i) = ∑k A(i,k)×hk. G(i) is the context vector. The output of the attention network 33 is generated as Y(i) = generate(s(i-1), G(i)), where generate() may be implemented using a recurrent layer of 256 gated recurrent units (GRU) units for example. The attention network 33 also computes a new state s(i) = recurrency(s(i-1), G(i), Y(i)), where recurrency() is implemented using LSTM.
The decoder 35 is an autoregressive RNN which decodes information one frame at a time. The information directed to the decoder 35 is be the fixed length context vector 331 from the attention network 33. In another example, the information directed to the decoder 35 is the fixed length context vector 331 from the attention network 33 concatenated with a prediction of the decoder 35 from the previous step. In each decoding step, that is, for each frame being decoded, the decoder may use the results from previous frames as an input to decode the current frame. In an example, as shown in
The parameters of the encoder 31, decoder 35, predictor 39 and the attention weights of the attention network 33 are the trainable parameters of the prediction network 21.
According to another example, the prediction network 21 comprises an architecture according to Shen et al. “Natural tts synthesis by conditioning wavenet on mel spectrogram predictions.” 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018.
Returning to
According to an embodiment, the Vocoder 23 comprises a convolutional neural network (CNN). The input to the Vocoder 23 is a frame of the mel spectrogram provided by the prediction network 21 as described above in relation to
Alternatively, the Vocoder 23 comprises a convolutional neural network (CNN). The input to the Vocoder 23 is derived from a frame of the mel spectrogram provided by the prediction network 21 as described above in relation to
Additionally or alternatively, the Vocoder 23 comprises a WaveNet NN architecture such as that described in Shen et al. “Natural tts synthesis by conditioning wavenet on mel spectrogram predictions.” 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018.
Additionally or alternatively, the Vocoder 23 comprises a WaveGlow NN architecture such as that described in Prenger et al. “Waveglow: A flow-based generative network for speech synthesis.” ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019.
Alternatively, the Vocoder 23 comprises any deep learning based speech model that converts an intermediate speech data 25 into output speech 9.
According to another alternative embodiment, the Vocoder 23 is optional. Instead of a Vocoder, the prediction network 21 further comprises a conversion module (not shown) that converts intermediate speech data 25 into output speech 9. The conversion module may use an algorithm rather than relying on a trained neural network. In an example, the Griffin-Lim algorithm is used. The Griffin-Lim algorithm takes the entire (magnitude) spectrogram from the intermediate speech data 25, adds a randomly initialised phase to form a complex spectrogram, and iteratively estimates the missing phase information by: repeatedly converting the complex spectrogram to a time domain signal, converting the time domain signal back to frequency domain using STFT to obtain both magnitude and phase, and updating the complex spectrogram by using the original magnitude values and the most recent calculated phase values. The last updated complex spectrogram is converted to a time domain signal using inverse STFT to provide output speech 9.
According to an example, the prediction network 21 is trained from a first training dataset 41 of text data 41a and audio data 41b pairs as shown in
The training of the Vocoder 23 according to an embodiment is illustrated in
The training of the Vocoder 23 according to another embodiment is illustrated in
Additionally or optionally, the first model 1211 is configured so that the intermediate speech data 25 generated results in output speech that has an attribute that is more pronounced compared to that generated by the base model 1210. A speech attribute is related to a speech pattern, where the speech pattern refers to a distinctive manner in which a person expresses himself. When an output speech conveys a speech pattern more strongly than another, it is meant that the output speech is perceived by human listeners to be more distinctive and/or that the speech attribute is perceived by human listeners to be more pronounced.
Speech patterns may be characterised by different speech attributes. Examples of speech attributes comprise emotion, intention, projection, pace, and accents amongst others. Note that the speech data may include sounds such as sobbing or laughing for example.
When an output speech has a speech pattern characterised by the speech attribute of emotion, it is meant that the output speech is perceived by human listeners to convey a particular emotion. When an output speech conveys a higher level of emotion than another, it is meant that the emotion is perceived by human listeners to be more intense. For example, an output speech may convey the emotion of sadness, anger, happiness, surprise etc.... Taking “sadness” as an example, there may be different degrees of sadness. An output speech might sound mildly “sad”, while an output speech where the level of emotion is more pronounced might sound extremely “sad” (e.g. comprising sobbing, or crying).
When an output speech has a speech pattern characterised by the speech attribute of intention, it is meant that the output speech is perceived by human listeners to convey a particular meaning that is separate from the words. For example, the meaning may be conveyed from the manner in which a speaker modulates his voice. When an output speech coveys a higher level of intention than another, it is meant that the intention is perceived by human listeners to be more intense. For example, an output speech may convey sarcasm. A sarcastic comment may be made with the intent of humour or may be made with the intent of being hurtful. For example, the expression “This is fantastic!” can be used to imply a different meaning (even the opposite meaning) depending on how it is said. Taking “sarcasm” as an example of intention, there may be different degrees of sarcasm. An output speech might sound mildly sarcastic, while an output speech that conveys a higher level of intention might sound extremely sarcastic.
When an output speech has a speech pattern characterised by the speech attribute of projection, it is meant that the output speech is perceived by human listeners to convey particular projection. When an output speech coveys a higher level of projection than another, it is meant that the projection is perceived by human listeners to be more intense. Projection comprises whispering, and shouting amongst others. Taking “shouting” as example, there may be different degrees of shouting. For example, an output speech might come across as a shout when a teacher asks students to “Get out of the room”, while an output speech conveys a higher level of projection for example when, a police officer is yelling instructions to the public to “Get out of the room” during an emergency situation.
When an output speech has a speech pattern characterised by the speech attribute of pace, it is meant that the output speech is perceived by human listeners to convey a particular pace. Pace may comprise rhythm. Rhythm refers to the timing pattern among syllables and is marked by the stress, timing and quantity of syllables. For example, a measure of rhythm may be obtained by considering the differences between sequences of syllables. Alternatively or additionally, the pace of an output speech comprises a rate of delivery (also referred to as the tempo). When an output speech coveys a higher level of pace than another, it is meant that the pace is perceived by human listeners to be more distinct. For example, the pace may be quantified as the number of syllables per minute.
When an output speech has a speech pattern characterised by the speech attribute of accent, it is meant that the output speech is perceived by human listeners to convey a particular accent. When an output speech conveys a higher level of an accent than another, it is meant that the accent is perceived by human listeners to be stronger. Taking the English language as a non-limiting example, accents may comprise regional variations such as British English, American English, Australian English amongst others. Accents may also comprise other variations such as a French accent, or a German accent amongst others. Taking a French accent in an utterance in the English language as example, the output speech may have different degrees of French accent; on one hand, the output speech may have a barely noticeable accent, while on the other hand, the output speech may have a strong French accents (where e.g. the vowels might be heavily distorted from the usual English speech).
An output speech may convey a speech pattern that has a single attribute. An output speech may alternatively convey multiple attributes. For example, an output speech may have a speech pattern that conveys sadness (emotion), with a French accent (accent), and projected as a whisper (projection). Any number of attributes may be combined.
For ease of explanation, in the embodiments below, the output speech may be described in terms of a speech data having the speech attribute of “emotion”. It will be understood however that any other attribute is equally applicable used in those embodiments. It will further be understood that attributes can be combined.
Additionally and optionally, the synthesiser 1 of
Additionally and optionally, the synthesiser 1 of
Alternatively and optionally, the synthesiser 1 of
Additionally and optionally, the synthesiser 1 of
Additionally and optionally, the base model 1210 is trained using voice samples that are neutral and that are not perceived to convey any emotion. The first model 1211 is then trained using voice samples that convey mild emotion. The voice samples used to train the first model 1211 and the voice sample used to train the base model 1210 are perceived by human listeners to be similar. Training the first model 1211 using such voice samples (which convey mild emotion and are perceived to be similar to the neutral voice samples used to train the base model 1210) is found to result in a model that performs well (as measured by a validation loss, which will be described further below).
The samples of the base sub-dataset 100 are obtained from a human actor. By training the base model 1210 using samples from the base sub-dataset 100 (which contain samples from a human actor), the base model 1210 may perform well for that particular actor. Subsequent models trained starting from the base model 1210 may also perform well for the same actor. For example, the subsequent models generate accurate speech and/or exhibit low validation losses (defined further below). Conversely, if subsequent models trained starting from the base model 1210 are then further trained using samples from a different actor, the subsequent models may not perform as well.
Additionally and optionally, the base sub-dataset 100 comprises text and audio samples corresponding to thousands of sentences. Optionally, the base sub-dataset comprises text and audio samples corresponding to one thousand sentences or more.
Additionally and optionally, the base sub-dataset 100 comprises text and audio samples corresponding to thousands of non-repeating sentences from a particular domain (which is described further below) and in a particular language. When the thousands of sentences are non-repeating, the base sub-dataset 100 may be adequate to train the model. For example, when the thousands of sentences are non-repeating, the sentences cover most of the phonemes of the language such that the model sees most of the phonemes during training.
A domain may be understood to refer to different applications such as audiobooks, movies, games, navigation systems, personal assistants etc.... For example, when the synthesiser is used in a TTS system for audiobook applications, the synthesiser may be trained using a base sub-dataset 100 comprising samples that are obtained from an audiobook dataset. An audiobook dataset may contain samples from audiobooks in general (and not just from a specific audiobook). Similarly a game dataset may contain samples from games in general (and not just from specific games).
Additionally and optionally, for a particular language being used, the base sub-dataset 100 comprises samples that represent a neutral speech pattern. A neutral speech pattern is a speech pattern that does not convey any particular attributes. For example, when the attribute concerned is the emotion of sadness, a neutral speech pattern is not perceived by a human listener as conveying sadness.
The first sub-dataset 101 is obtained from a dataset comprising recordings that are from the same domain as the base sub-dataset 100. For example, if the base sub-dataset 100 comprises samples from an audiobook dataset, the first sub-dataset 101 also comprises samples from an audiobook dataset. The first sub-dataset 101 comprises audio samples that convey a degree of sadness that is higher than the degree of sadness conveyed by the base-sub-dataset 100 (which comprises neutral recordings). In other words, the attribute of emotion (sadness) is more pronounced in the first sub-dataset 101 than the base sub-dataset 100.
Additionally and optionally, the audio recordings for the first sub-dataset 101 are obtained from the same the human actor from whom the audio recordings of the base sub-dataset 100 were obtained. As explained above, the first model, being trained starting from the base model 1210, may perform well since it is trained from samples from the same actor.
Additionally and optionally, the first sub-dataset 101 comprises audio and text samples that correspond to the audio and text samples of the base sub-dataset 100. For example, the text samples of the first sub-dataset may be the same as the text in the base sub-dataset; however, the audio samples differ in that the audio samples of the fist sub-dataset are perceived to convey a higher degree of sadness than the audio samples of the base sub-dataset.
Audio recordings for the first sub-dataset 101 are obtained as follows. A human actor is provided with a number of text samples which he then reads out and records corresponding audio samples. The text samples correspond to the particular language and to the domain of the samples of the base sub-dataset 100. Taking the attribute of emotion as a non-limiting example, for the first sub-dataset 101, the human actor records the audio samples using a style that conveys emotion. For example, the human actor might record the voice samples using a style that conveys one of the emotions of “anger”, “sadness”, or “surprise” etc... Audio samples that are recorded in such a style are perceived by listeners as sounding “angry”, or “sad”, or “surprised” etc... The audio samples recorded from the audio 101b and their corresponding text samples form the text 101a of the first sub-dataset 101.
Additionally and optionally, the first sub-dataset 101 may comprise fewer samples than the base sub-dataset.
Additionally and optionally, the first sub-dataset 101 comprises text and audio samples corresponding to tens of sentences for example.
In order to record an audio sample that conveys a certain emotion, a human voice actor may get himself into character, for example, by placing themselves in a situation that generates the emotion he intends to covey. For example, to convey sadness, the voice actor may get into a sad state of mind. The voice actor may find it difficult to maintain such a state for a long time. Therefore, the voice actor may find it difficult to record a large number of audio samples that convey emotion. Therefore, datasets that comprise audio samples conveying high levels of emotion may be small compared to the base sub-dataset and comprise, for example, text and audio samples corresponding to tens of sentences.
In step S103, the base sub-dataset 100 and the first sub-dataset 101 are combined to form a combined dataset 1000. The combined dataset 1000 comprises text and audio samples from the base sub-dataset and from the first sub-dataset.
Additionally and optionally, in step S103, the samples from the base sub-dataset are appended to the samples from the first sub-dataset to form the combined dataset 1000. When samples from the combined dataset are used for training (as described further below), the samples are retrieved in a random order such that the source of the samples (i.e. whether the samples are from the base sub-dataset or the first sub-dataset) varies randomly as the training progresses.
Additionally and optionally, the training batches are acquired from the combined dataset 1000 during training. During training, the entire combined dataset 1000 is not passed to the base model 1210 at once; the combined dataset 1000 is divided into a number of training batches and the training batches are each passed to the base model 1210 in turn. The training batches are configured such that each batch contains samples from the base sub-dataset and from the first sub-dataset in a ratio that is comparable to the ratios of the size of the base sub-dataset to the first sub-dataset. According to an example, when the base sub-dataset has 1000 samples and the first sub-dataset has a size of 100 (a ratio of 10:1), each training batch has samples from both sub-datasets in ratios ranging from 20:1 to 2:1.
Additionally and optionally, when the size of the base sub-dataset is much greater than the size of the first sub-dataset, the size of the base sub-dataset may be reduced, so that the ratio of samples from both sub-datasets is less skewed. Training with a less skewed training batch may improve the speed at which the model learns. For example, when the base sub-dataset has 10000 samples and the first sub-dataset has a size of 10 (a ratio of 1000:1), the size of the base sub-dataset may be reduced to the 1000 samples (a ratio of 100:1).
In step S105, the base model 1210 is trained using the combined dataset 1000. The training of the base model 1210 is similar to the training of the prediction network 21 described in relation to
In step S107, the first model 1211 of the synthesiser is trained. To train the first model 1211, the trained base model 1210 is used as a starting point and further trained using the first sub-dataset 101 (this approach is also referred to as transfer learning). Using the base model 1210 as a starting point means that the parameters of the first model 1211 are initialised to the learned values of the trained base model 1210. The parameters of the first model 1211 are further updated during training with the first sub-dataset 101. Training the first model 1211 is similar to the training of the prediction network 21 described in relation to
The first sub-dataset 101 may contain fewer samples than the base sub-dataset 100. As described above, the first sub-dataset 101, which comprises samples conveying more emotion than the samples of the base sub-dataset 100, may comprise tens of samples. It has been found that training the base model 1210 and the first model 1211 as described above in
For each of the sub-datasets 100 and 101 that are used for training, a corresponding validation dataset comprising samples that have similar characteristics (e.g. neutral or conveying emotion) may be available. The validation dataset corresponding to each sub-dataset comprises samples from the sub-dataset that have been randomly selected and kept out of the training. The validation loss of the first model 1211 is obtained using the validation dataset corresponding to the first sub-dataset.
Returning to
By pre-training the base model 1210, it can be ensured that the base model 1210 is able to generate comprehensible speech with sufficient accuracy. Furthermore, the amount of data required to further train the base model 1210 and to train the first model 1211 may be reduced.
In the training method of
The predetermined range or value may be experimentally obtained by training different models under different conditions and then performing a human listening test (described further below) to assess the performance of each model. For example, the different models may be obtained by training a model using a dataset for a number of training steps, saving a model checkpoint (described further below), and repeatedly further training to obtain other model checkpoints. The different model checkpoints then act as the different models used for assessing performance and determining the predetermined range.
An example of a performance metric is the validation loss. Details of the validation loss are provided further below in relation to
For example, to have a good performance, the validation loss of the trained first model 1211 is at most approximately 0.5. For example, at most approximately 0.5 refers to 0.6 or less. For the validation loss of the first model 1211 to be at 0.6 or less, the validation loss of the base model 1210 must be smaller than the validation loss of the first model 1211, since validation losses increase as training progresses (as described below in relation to
It will be understood that in an alternative embodiment, the features described above in relation to
Returning to
In an alternative example, synthesiser 1 comprises a single model which is the first model 1211 for example. In other words, synthesiser 1 may be similar to the synthesiser shown in
Additionally or alternatively, the synthesiser 1 of
In use, the speaker id may be used as an input to influence the model to generate output speech according to the voice of a particular actor. Alternatively, if no speaker id is specified, a default value may be used.
Returning to the synthesiser 1 of
Step S101 is similar to that of
Step S109 is similar to S107, except that the second model of the synthesiser is trained. To train the second model, the trained first model 1211 is used as a starting point and further trained using the second sub-dataset 102 (this approach is also referred to as transfer learning). Using the first model 1211 as a starting point means that the parameters of the second model 1211 are initialised to the learned values of the trained first model 1211. The parameters of the second model are further updated during training with the second sub-dataset 102. Training the second model is similar to the training of the prediction network 21 described in relation to
Step S111 for training the third model is similar to S109 except that the second trained model is used as starting point. The third model has the ability to generate speech that conveys more intense emotion than the second model. Step S113 for training the Nth model is similar to S111 and S109 except that the starting point is the previously trained model (the N-1th model).
The method described in relation to
Alternatively, the synthesiser 1 may comprise a subset of the models trained. For example, the synthesiser comprises the base model 1210, the first model 1211 and the second model. Yet alternatively, the synthesiser 1 comprises the third model; or comprises the third model and the fourth model; or comprises the third model, a fourth model, and a fifth model (when N≥5).
Additionally and optionally, in the method described in relation to
In another embodiment which is not shown, a method of training the base model 1210, the first model 1211, a second model, a third model, and further models up to N models, where N is a natural number, is similar to the method described in relation to
Additionally and optionally, the performance metric is the validation loss. Details of the validation loss are provided further below in relation to
For example, when N = 3, the validation loss of the third model after S111 should be 0.6 or less. To obtain this, the previous validation losses from previous training steps should aggregate to 0.6 or less in total. A non-limiting example of values of validation loss are: after S105, validation loss = 0.2, after S107, validation loss = 0.3, after S109, validation loss = 0.4. After step S111, the validation loss may be 0.6 or less, which results in a good performance. During training, in S105, once validation loss reaches 0.2, the next model (First model 1211 in this case) is trained. In S107, once validation loss reaches 0.3, the next model (second model) is trained. In S109, once validation loss reaches 0.4, the next model (third model) is trained.
A non-limiting example of unsuitable validation losses are: after S105, validation loss = 0.3, after S107, validation loss = 0.5, after S109, validation loss = 0.8.
Although the above examples consider the case where N = 3, training with any value of N is similar. To determine up to what point (i.e. up to what value of validation loss) a training stage (e.g. S105, S107, S109 ...) should be carried out to, the rate at which the validation loss increases after S105 and/or S107 is noted. The rate may refer to the increase in validation loss per number of training samples. Using the obtained rate, a projection of the value for the validation loss for the Nth trained model may be obtained. When the projected value for the Nth trained model is 0.6 or less, the training is as described above.
Alternatively, when the projected value of the validation loss for the Nth trained model is 0.6 or more, then one or more additional “Refresh” steps are introduced between the training stages. The “Refresh” step is described in detail further below in relation to
In a comparative example which is not shown, a model is trained first with the base sub-dataset 100, and then directly with a fifth sub-dataset (which is not shown). The comparative model is compared to a fifth model (N=5) trained as described in relation to the embodiment shown in
In the above comparison, the validation loss obtained from the fifth model trained for 20k training steps is 0.6, when the fifth model is trained according to the embodiment described in
In the above comparison, the speech pattern accuracy as perceived by human listeners is determined by performing a listening test. In the speech pattern accuracy test, the human listeners rate the synthesised speech as high or low (or alternatively, as good or bad), based on how accurate they perceive the speech to be. The fifth model when trained according to the embodiment of
Alternatively or optionally, the speech pattern accuracy as perceived by human listeners is based on a mean opinion score (MOS) test. A mean opinion score is a numerical measure of a quality an approximation of a real world signal (e.g. synthesised speech) as judged by humans. For example, when the speech pattern to be tested is intended to portray the attribute of emotion and in particular sadness, human judges might be asked to “Rate on a scale of 1 to 5 how accurate you think this model is at portraying sadness”, The mean opinion score for each model is then obtained by taking the average of all the scores from the different human judges.
In the above comparison, the validation loss is the loss obtained by comparing the output of the trained model with audio samples from a validation data set. The validation loss function may be similar to the loss function used in the training of the prediction network as described in relation to
In step S200, the trained third model from S111 is further trained using the combined dataset 1000. Step S200 is referred to as a ‘refresh’ step and, in general, the refresh step is performed to improve a performance metric of the trained model. As illustrated in
Alternatively, step S200 may be performed until the validation loss drops below a certain threshold value. The threshold value is determined empirically and depends on e.g. the speaker, the language, the speech pattern or attribute being considered. For example, the S200 is performed and the validation loss curve monitored until the curve suggests that there is no point training further. For example, the validation loss curve may have flattened. At this point a human based listening test is performed to confirm the quality (as perceived by a human listener).
Alternatively, S200 is performed until at least 2 model states have been saved. Model states may be saved every 500 steps for example.
Yet alternatively, Step S200 may be performed until the generated speech achieves a certain quality. The quality is determined by considering two aspects. Firstly, the validation loss is considered; if the validation loss has dropped below an empirically determined threshold value, then a listening test is performed by human operators to determine whether the generated speech remains comprehensible while conveying the intended emotion. If the model passes the human test, then step S200 is complete. The threshold value may be different for different languages, or for different attributes, or for different voice actors. The threshold value may be determined empirically on a case-by-case basis, for example, by human operators performing listening tests.
Additionally and optionally, the threshold value for the validation loss is approximately 0.5. For example, the threshold value is less than or equal to 0.6. This threshold value has been empirically found to yield models that generate good quality speech.
Alternatively, the predetermined threshold value may be a fraction of the peak value of the validation loss obtained during S111.
In S112, a fourth model is trained using the refreshed model obtained in step S200. The validation loss of the fourth model (which is calculated using the validation data corresponding to the fourth sub-dataset) starts at a value similar to value obtained after S200 and further increases as the training in S112 progresses. Since the validation loss started at a low value, the validation loss remains acceptable and the performance of the fourth model acceptable after the training step S112.
Returning to
Alternatively and optionally, the refresh step S200 may be introduced after training any model after the first model 1211. Referring to the training of N models, as described in relation to
Additionally and optionally, two or more refresh steps S200 may be introduced. For example, a refresh step S200 may be introduced between training the first model S107 and before training the second model S109, and between training the third model S111 and training the fourth model S112. Alternatively, a refresh step S200 may be introduced between training the first model S107 and before training the second model S109, and between training the second model S109 and before training the third model S111. Alternatively, a refresh step S200 may be introduced between training the first model S107 and before training the second model S109, between training the second model S109 and before training the third model S111, and between training the third model S111 and training the fourth model S112.
Although the embodiments described above in relation to
In an alternative embodiment, the performance metric is obtained by performing a human listening test as described above to determine the speech pattern accuracy. For example, the human listeners provide a rating of high or low. If the rating is high, the performance of the trained model is considered to be good enough.
In another alternative embodiment, the performance metric is obtained by performing a Mean Opinion Score (MOS) test as described above. For example, when the MOS test is carried out on a scale of 1 to 5, a MOS of greater than 3.5 indicates that the model is good enough.
Yet alternatively, the performance metric is obtained by performing a ‘MUSHRA’ test. MUSHRA stands for Multiple Stimuli with Hidden Reference and Anchor. The MUSHRA is a listening test designed to compare two or more audio samples with respect to perceived fidelity. In the MUSHRA test, a human listener is provided with the reference sample (which might be a training sample performed by a human actor, and is labelled as such), test samples, a hidden version of the reference, and one or more anchors (anchors are low pass filtered versions of the reference). The human listener listens to the different samples and assigns a score to each (out of 0-100 scale). Generally, the human listener would assign a score of at least to the hidden version of the reference. The score for the test samples would depend upon how their fidelity to with respect to the reference is perceived by the human listener. The MUSHRA test is generally performed using several human listeners and an average score for each sample is obtained. The average score from the MUSHRA test (also referred to as the MUSHRA score) is then the performance metric. In an example, a MUSHRA score greater than 60 indicates that the model performs well.
Yet alternatively, the performance metric is a transcription metric. The transcription metric is designed to measure the intelligibility of the trained model. Test sentences are prepared and inputted into the trained model that is being tested, these sentences are then synthesised into their corresponding speech using the trained model. The resulting audio/speech outputs of the model for these test sentences are then passed through a speech-to-text (STT) system. The text resulting from this inference is then converted into its representative series of phonemes, with punctuation removed. The outputted series of phonemes is then compared, on a sentence-by-sentence basis, to the series of phonemes representing the original input text. If this series of phonemes exactly matches the series of phonemes represented by the original input text, then that specific sentence is assigned a perfect score of 0.0. In this embodiment, the “distance” between the input phoneme string and the output phoneme string is measured using the Levenshtein distance; the Levenshtein distance corresponds to the total number of single character edits (insertions, deletions or substitutions) that are required to convert one string to the other. Alternative methods of measuring the differences and hence “distance” between the input and output phoneme string can be used. STT systems are not perfect; in order to ensure the errors being measured by the transcription metric correspond to those produced by the trained model being tested and not the STT system itself, multiple STT systems of differing quality may be used. Sentences with high transcription errors for all STT systems are more likely to contain genuine intelligibility errors caused by the TTS model than those for which only some STT systems give high transcription errors.
In an embodiment, the STT system comprises an acoustic model that converts speech signals into acoustic units in the absence of a language model. In another embodiment, the STT system also comprises a language model. Additionally and optionally, multiple STT models are used and the result is averaged. The output series of phonemes from the STT in step S205 is then compared with the input series of phonemes S201 in step S207. This comparison can be a direct comparison of the acoustic units or phonemes derived from the input text with the output of the STT system. From this, judgement can be made as to whether the output of the trained model can be accurately understood by the STT. If the input series of phonemes exactly matches the output series of phonemes, then it receives a perfect score of 0.0. The distance between the two series of phonemes is the Levenshtein distance as described earlier. This Levenshtein distance/score is calculated on a sentence-by-sentence basis in step S209. Furthermore, a combined score for the test sentences is calculated by averaging the transcription metric score of all the test sentences. The combined score obtained in S209 is then used as a performance metric for the trained model. For example, when the combined score (the averaged Levenshtein distance) is less than or equal to 1, the trained model is considered to perform well. An averaged Levenshtein distance of less than or equal to 1 corresponds to only one insertion, deletion, or substitution required to make the transcriptions match up to the test sentences on average.
Yet alternatively, the performance metric comprises an attention score. Here, the attention weights of the attention mechanism of the trained model are used. From the attention weights, an attention score can be calculated and used as an indication of the quality of the performance of the attention mechanism and thus model quality. The attention weights is a matrix of coefficients that indicate the strength of the links between the input and output tokens, alternatively this can be thought of as representing the influence that the input tokens have over the output tokens. The input tokens/states may be a sequence of linguistic units (such as characters or phonemes) and the output tokens/states may be a sequence of acoustic units, specifically mel spectrogram frames, that are concatenated together to form the generated speech audio.
The attention mechanism is described above in relation to
In step S902, the attention weights are retrieved from the trained model for the current test sentence, together with its corresponding generated speech. This matrix of weights shows the strength of the connections between the input tokens (current test sentence broken down into linguistic units) and the output tokens (corresponding generated speech broken down into the spectrogram frames).
In step S903, the attention metric/score is calculated using the attention weights pulled from the model. In this embodiment, there are two metrics/scores that can be calculated from the attention mechanism: the ‘confidence’ or the ‘coverage deviation’.
The first attention metric in this embodiment consists of measuring the confidence of the attention mechanism over time. This is a measure of how focused the attention is at each step of synthesis. If, during a step of the synthesis, the attention is focused entirely on one input token (linguistic unit) then this is considered maximum “confidence” and signifies a good model. If the attention is focused on all the input tokens equally then this is considered minimum “confidence”. Whether the attention is “focused” or not can be derived from the attention weights matrix. For a focused attention, a large weighting value is observed between one particular output token (mel frame) and one particular input token (linguistic unit), with small and negligible values between that same output token and the other input tokens. Conversely, for a scattered or unfocused attention, one particular output token would share multiple small weight values with many of the input tokens, in which not one of the weighting values especially dominates the others.
In an embodiment, the attention confidence metric is measured numerically by observing the alignment, αt, at decoder step t, which is a vector whose length is equal to the number of encoder outputs, I, (number of phonemes in the sentence) and whose sum is equal to 1. If αti represents the ith element of this vector, i.e. the alignment with respect to encoder output, then the confidence is calculated using a representation of the entropy according to
Here a value of 0.0 represents the maximum confidence and 1.0 minimum confidence. To obtain a value for the whole sentence, it is necessary to take the sum over all the decoder steps t and divide by the length of the sentence to get the average attention confidence score, or alternatively take the worst case, i.e. largest value. It is possible to use this metric to find periods during the sentence when the confidence is extremely low and use this to find possible errors in the output.
The latter metric, coverage deviation, looks at how long each input token is attended to during synthesis. Here, an input token being ‘attended to’ by an output token during synthesis means the computation of an output token (acoustic units/mel spectrograms) is influenced by that input token. An output token attending to an input token will show itself as a weighting value close to one within the entry of the attention matrix corresponding to those two tokens. Coverage deviation simultaneously punishes the output token for attending too little, and for attending too much to the linguistic unit input tokens over the course of synthesis. If a particular input token is not attended to at all during synthesis, this may correspond to a missing phoneme or word; if it is attended to for a very long time, it may correspond to a slur or repeated syllable/sound.
In an embodiment, the coverage deviation is measured numerically by observing the attention matrix weightings, and summing over the decoder steps. This results in an attention vector, β, whose elements, βi, represent the total attention for linguistic unit input token i during the synthesis. There are various methods for analysing this attention vector to look for errors and to produce metrics for judging model quality. For example, if the average total attention for all encoder steps,
Here, if βi =
Further, to use the attention score as a performance metric for the trained model, the scores each test sentences are averaged across the plurality of test sentences and these are then compared with a threshold. For example: when the attention score is based on attention confidence (Equation 1), an average score below 0.1 indicates that the trained model performs well; when the attention score is based on coverage deviation (Equation 2), an average score below 1.0 indicates that the trained model performs well.
Yet alternatively, the performance metric comprises a metric termed a robustness metric. The robustness metric is based on the presence or absence of a stop token. The robustness is designed to determine the probability that a trained Tacotron model will reach the synthesis length limit rather than end in the correct manner, which is to produce a stop-token. A stop-token is a command, issued to the model during active synthesis that instructs the model to end synthesis. A stop-token should be issued when the model is confident that it has reached the end of the sentence and thus speech synthesis can end correctly. Without the issue of a stop-token, synthesis would continue, generating “gibberish” speech that does not correspond to the inputted text sentence. The failure for the synthesis to end correctly may be caused by a variety of different errors, including a poorly trained stop-token prediction network, long silences or repeating syllables and unnatural/incorrect speech rates.
The stop-token is a (typically single layer) neural network with a sigmoid activation function. It receives an input vector, vs, which in the Tacotron model is a concatenation of the context vector and the hidden state of the decoder LSTM. Let Ws be the weights matrix of a single later stop-token network. If the hidden state of the LSTM is of dimension NL and the dimension of the context vector is Nc then the dimension of the projection layer weight matrix, Ws, is:
and the output of the layer is computed according to,
where σ is the sigmoid function and the rest of the equation equates to a linear transformation that ultimately projects the concatenated layers down to a scalar. Since the final dimension of the weights vector is 1, the result of Ws · vs is a scalar value and therefore, due to the sigmoid activation function, the output of this layer is a scalar value between 0 and 1. This value is the stop-token and represents the probability that inference has reached the end of the sentence. A threshold is chosen, such that if the stop-token is above this threshold then inference ceases. This is the correct way for synthesis to end. If, however, this threshold is never reached, then synthesis ends by reaching the maximum allowed number of decoder steps. It is this failure that the robustness check measures.
To compute the robustness metric, the process takes a trained model and synthesises a large number, typically NS = 10000 test sentences, and counts the number of sentences NF that end inference by reaching the maximum allowed number of decoder steps, i.e. fail to produce a stop token. The robustness score is then simply the ratio of these two numbers, NF/NS. The test sentences are chosen to be sufficiently short such that, if the sentence were rendered correctly, the model would not reach the maximum allowed number of decoder steps.
Stop tokens may be used to assess the quality of the synthesis, that is, whether a trained model performs well enough.
In step S1102 it is then determined whether during the sentence’s inference a stop token was issued, in other words, whether the gate confidence ever exceeded the given threshold. If a stop token was issued, implying that the generated speech is of good quality and ended appropriately, then that sentence is flagged as ‘good’ in step S1107. Conversely, if a stop token was never issued before the hard limit/fixed duration, implying the presence of ‘gibberish speech’ at the end of the generated audio, then the sentence is flagged as ‘bad’ in step S1105. In step S1109, the robustness score is updated based upon the new ‘good’ or ‘bad’ sentence.
Further, to use the robustness metric as a performance metric for the trained model, once all of the test sentences have passed through inference, the final robustness score is obtained as number of bad sentences (NF) out of the total number of sentences (Ns). If the robustness metric lies below a threshold value of 0.001 or 0.1%, such that only 1 in 1000 renders fail to produce a stop token, then the trained model is considered to perform well.
Yet alternatively, the performance metric comprises two or more of the previously described metrics in combination. The previously described metrics comprise validation loss, speech pattern accuracy as perceived by human listeners, MOS, MUSHRA, transcription metric, attention score and/or robustness score. When two or more metrics are used in combination, the metrics are obtained individually as explained above.
In one example, if all of the two of more metrics meet their respective thresholds, then performance metric is assigned a value representing a ‘good’ model and the model is considered to perform well enough.
In another example, the two or more metrics are combined into a single aggregate score for each sentence. A unique threshold is used for each separate metric, and then use a simple voting system to aggregate the metrics. The voting system consists of allocating a sentence a score of 1 if they cross the threshold of a metric (fail), and 0 if they don’t (pass). This is done for each metric separately so that each sentence has a total score that essentially represents the number of metrics that sentence failed. For example, if the metrics being considered are the transcription, attention, and robustness metrics disclosed previously, then each sentence will have a score ranging from 3 (failed all metrics) to 0 (passed all metrics). Further, to use the aggregate score as a performance metric for the trained model, the score are averaged over all test sentences, and, for example, if the averaged aggregate score is 1 or less, then the model is considered to perform well.
In S802, a human based listening test is performed on each audio segment and each audio segment and its corresponding text is allocated into one of the intermediate sub-datasets 101, 102, 103, up to 10N. Optionally, S802 comprises a mean opinion score test as described above in relation to
Additionally and optionally, the script comprises repetitions of the same sentence or phrase both at different positions of the script. For example, when a script comprises the phrase “It is not here” at different positions of the script; early segments corresponding to this phrase may sound mildly sad and would, after S802, be placed in e.g. a low sub-dataset such as 101 or 102, while a later segment corresponding to the same phrase may sound very sad and would, after S802, be placed in e.g. a high sub-dataset such as 103 or 10N.
In the above in relation to
The TTS system 1100 comprises a processor 3 and a computer program 5 stored in a non-volatile memory. The TTS system 1100 takes as input a text input 7. The text input 7 may be a text file and/or information in the form of text. The computer program 5 stored in the non-volatile memory can be accessed by the processor 3 so that the processor 3 executes the computer program 5. The processor 3 may comprise logic circuitry that responds to and processes the computer program instructions. The TTS system 1100 provides as output a speech output 9. The speech output 9 may be an audio file of the synthesised speech and/or information that enables generation of speech.
The text input 7 may be obtained from an external storage medium, a communication network or from hardware such as a keyboard or other user input device (not shown). The output 9 may be provided to an external storage medium, a communication network, or to hardware such as a loudspeaker (not shown).
In an example, the TTS system 1100 may be implemented on a cloud computing system, which transmits and receives data. Although a single processor 3 is shown in
While certain embodiments have been described, these embodiments have been presented by way of example only, and are not intended to limit the scope of the inventions. Indeed the novel methods and apparatus described herein may be embodied in a variety of other forms; furthermore, various omissions, substitutions and changes in the form of methods and apparatus described herein may be made.
Number | Date | Country | Kind |
---|---|---|---|
2013590.1 | Aug 2020 | GB | national |
This application is a continuation of International Application No. PCT/GB2021/052241 filed Aug. 27, 2021, which claims priority to U.K. Application No. GB2013590.1, filed Aug. 28, 2020; each of which is hereby incorporated by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/GB2021/052241 | Aug 2021 | WO |
Child | 18174145 | US |