This application claims priority to and the benefit of Korean Patent Application No. 10-2024-0070691 filed in the Korean Intellectual Property Office on May 30, 2024, the entire contents of which are incorporated herein by reference.
The present disclosure relates to a speech generation method and device for performing zero-shot speech generation by using prosody control and random speaker generation.
Recently, speech synthesis technology using neural network models has been gaining attraction. In particular, the emergence of text-to-speech (TTS) models based on deep learning has greatly improved the naturalness and fluency of synthesized speech. AI-based speech synthesis is being utilized in various fields, such as virtual assistants and voice guidance systems, and as the quality of speech synthesis improves, user experience, real-time responsiveness, and precision are improving.
Zero-shot speech generation is a technique for synthesizing speech by using speaker identities that are extracted from a given piece of audio. It is important to note that the deep neural network models that perform zero-shot speech generation may synthesize the voice and pronunciation of a new speaker without training, even if the deep neural network models have not encountered the speaker before. In other words, in the zero-shot speech generation, the voice characteristics of a new speaker may be recognized by using pre-trained models and natural speech based on the recognized voice characteristics may be generated, and research is ongoing to improve the practicality of the technology.
The present disclosure attempts to provide a speech generation method and device capable of controlling prosody elements of generated speech, such as pitch, in zero-shot speech generation.
The present disclosure also attempts to provide a speech generation method and device capable of providing random generation of a speaker identity in zero-shot speech generation.
An exemplary embodiment of the present disclosure provides a speech generation method of performing zero-shot speech generation by using prosody control and random speaker generation, the speech generation method including: receiving paired text and speaker audio for an ith speaker and an jth utterance from a training set; inputting the speaker audio to a speaker encoder to perform extraction of a speaker identity, and obtaining first embedding representing a representation of the speaker identity; inputting the first embedding to a speaker quantizer and obtaining quantized second embedding; inputting the text and the first embedding to a text prior encoder and obtaining a first intermediate representation; inputting the first intermediate representation and the first embedding to a prosody predictor, adding a prosodic hidden representation to the first intermediate representation, and obtaining a second intermediate representation; inputting the second intermediate representation and the first embedding to an intermediate decoder, and obtaining a final representation; and converting the final representation to a waveform by using the decoder to generate speech.
In some exemplary embodiments, the speech generation method may further include: inputting a linear spectrogram and the first embedding to a speech post encoder and obtaining a third intermediate representation; and aligning the third intermediate representation to the first intermediate representation.
In some exemplary embodiments, the obtaining of the first embedding may further include introducing a first loss according to Equation 1 below.
Herein, g
In some exemplary embodiments, the obtaining of the second embedding may include: obtaining a second embedding according to Equation 2 below, based on an optimal weight for a basis vector found by a self-attention module of the speaker quantizer.
Herein, fquantize is the speaker quantizer, gcontinuous,i,j is the first embedding, Wquantize is parameter of the speaker quantizer, and B is a codebook including n learnable vectors.
In some exemplary embodiments, the obtaining of the second embedding may include introducing a second loss according to Equation 3 below.
Herein, g
In some exemplary embodiments, the obtaining of the second intermediate representation may include generating predicted prosody values according to Equation 4 and Equation 5 below, and introducing a third loss.
Herein, f is the prosody predictor, zprior is the first intermediate representation, gcontinuous,i,j is the first embedding, W is a parameter of the prosody predictor, g
In some exemplary embodiments, the obtaining of the second intermediate representation may include introducing a fourth loss according to Equation 6 below.
Herein, g
In some exemplary embodiments, the obtaining of the final representation may include introducing a fifth loss according to Equation 7 below.
Herein, g
In some exemplary embodiments, the speech post encoder may include a context organizer for removing speaker information from the linear spectrogram while preserving context information; and a speaker organizer for implanting the speaker information into an output of the context organizer, and the obtaining of the third intermediate representation may include introducing a sixth loss according to Equation 8 below.
Herein, Wpost is a parameter of the speech post encoder, g
In some exemplary embodiments, the obtaining of the third intermediate representation may include introducing a seventh loss according to Equation 9 below,
Herein, Wpost is a parameter of the post-speech encoder, g
In some exemplary embodiments, the aligning of the third intermediate representation to the first intermediate representation may include introducing an eighth loss according to Equation 10 below,
Herein, g
In some exemplary embodiments, the first intermediate representation may be computed according to Equation 11 below,
Herein, fprior is the text prior encoder, xi,j is the text, gcontinuous,i,j is the first embedding, Wspk is a parameter of the speaker encoder, and Wprior may be a parameter of the text prior encoder.
In some exemplary embodiments, the third intermediate representation may be computed according to Equation 12 below.
Herein, fpost is the speech post encoder, yi,j is the speaker audio, gcontinuous,i,j is the first embedding, Wspk is a parameter of the speaker encoder, and Wpost is a parameter of the speech post encoder.
In some exemplary embodiments, the speech generation method may further include implementing zero-shot text-to-speech (TTS) based on the first embedding and the first intermediate representation, or implementing zero-shot voice conversion (VC) based on the first embedding, the first intermediate representation, and the third intermediate representation.
In some exemplary embodiments, the speech generation method may further include implementing random speaker text-to-speech (TTS) based on the second embedding and the first intermediate representation, or implementing random speaker voice conversion (VC) based on the second embedding, the first intermediate representation, and the third intermediate representation.
In some exemplary embodiments, the speech generation method may further include inputting a random seed into the speaker quantizer.
Another exemplary embodiment of the present disclosure provides a speech generation device that executes a program code loaded into one or more memory devices via one or more processors and performs zero-shot speech generation by using prosody control and random speaker generation, in which the program code is executed to: receive paired text and speaker audio for an ith speaker and a jth utterance from a training set; input the speaker audio to a speaker encoder to perform extraction of a speaker identity, and obtain first embedding representing a representation of the speaker identity; input the first embedding to a speaker quantizer and obtain quantized second embedding; input the text and the first embedding to a text prior encoder and obtain a first intermediate representation; input the first intermediate representation and the first embedding to a prosody predictor, and add a prosodic hidden representation to the first intermediate representation, and obtain a second intermediate representation; input the second intermediate representation and the first embedding to an intermediate decoder and obtain a final representation; and convert the final representation to a waveform by using the decoder to generate speech.
In some exemplary embodiments, the program code may be executed to further input a linear spectrogram and the first embedding to a speech post encoder and obtain a third intermediate representation, and align the third intermediate representation to the first intermediate representation.
In some exemplary embodiments, the program code may be executed to further implement zero-shot text-to-speech (TTS) based on the first embedding and the first intermediate representation, or implement zero-shot voice conversion (VC) based on the first embedding, the first intermediate representation, and the third intermediate representation.
In some exemplary embodiments, the program code may be executed to further implement random speaker text-to-speech (TTS) based on the second embedding and the first intermediate representation, or implement random speaker voice conversion (VC) based on the second embedding, the first intermediate representation, and the third intermediate representation.
Hereinafter, the present invention will be described more fully hereinafter with reference to the accompanying drawings, in which exemplary embodiments of the invention are shown. As those skilled in the art would realize, the described exemplary embodiments may be modified in various different ways, all without departing from the spirit or scope of the present invention. Accordingly, the drawings and description are to be regarded as illustrative in nature and not restrictive. Like reference numerals designate like elements throughout the specification.
Throughout the specification and the claims, unless explicitly described to the contrary, the word “comprise”, and variations such as “comprises” or “comprising”, will be understood to imply the inclusion of stated elements but not the exclusion of any other elements. Terms including an ordinary number, such as first and second, are used for describing various constituent elements, but the constituent elements are not limited by the terms. The terms are used only to discriminate one constituent element from another constituent element.
Terms, such as “ . . . unit,” “ . . . device,”, and “module,”, as used in the specification may refer to a unit capable of performing at least one function or operation described herein, which may be implemented in hardware or circuit, software, or a combination of hardware or circuit and software. Further, at least some configurations or functions of a speech generation method and device for performing zero-shot speech generation by using prosody control and random speaker generation according to exemplary embodiments described below may be implemented as a program or software, and the program or software may be stored on a computer-readable medium.
Referring to
The speech generation device 10 may perform zero-shot speech generation. Zero-shot speech generation is the generation of speech by using a representation of a speaker extracted from audio of a previously unseen speaker, and a neural network model performing the zero-shot speech generation does not need to undergo any adaptation process to learn and generate speech of a new speaker. The speech generation device 10 may include a speaker identity extraction module 110, a speaker identity quantization module 120, a text-to-speech (TTS) pipeline module 130, and a voice conversion (VC) pipeline module 140 to enable control of prosody elements including pitch, and provide random generation of speaker identities in zero-shot speech generation. Hereinafter, the speaker identity extraction module 110, the speaker identity quantization module 120, the TTS pipeline module 130, and the VC pipeline module 140 will be described with reference to
The speech generation device 10 may be trained to generate a speech ŷi,j that is similar to ground-truth yi,j, for given text xi,j and speaker audio yi,j as input, as follows.
Herein, xi,j and yi,j may be paired sentence and audio for the ith speaker and jth utterance from a specific training set X (where i, j are integers). f is a TTS pipeline, and Wspk and Wt2s may be a parameter of the speaker encoder 20 and a parameter of the TTS pipeline, respectively. Herein, the TTS pipeline may include a text prior encoder 22, a prosody predictor 23, an intermediate decoder 24, and a decoder 25. Meanwhile, a speech post encoder 26 may form a VC pipeline.
For example, the output zpost of the VC pipeline may be aligned with an output zprior of a frame decoder of the text prior encoder 22 of the TTS pipeline, and an output gdiscrete of a speaker quantizer 21 may follow an output gcontinuous of the speaker encoder 20. In this way, the speech generation device 10 may build a jointly trainable pipeline to realize applications for various applications, such as zero-shot TTS, VC, random speaker generation, and prosody control, with only a single-step training for the neural network, and to this end, the total loss Ltotal for the single-step training is introduced in the following form.
Herein, TriniTTS is a loss term for the TTS on which the speech generation device 10 according to the exemplary embodiments is based, and each of
frame_prosody,
adv_spk,
spk,
quantize, and
spk_classification will be described later. λframe_prosody, λadv_spk, λspk, λquantize, and λspk_classification may be hyperparameters for their respective loss terms.
The speaker identity extraction module 110 may perform speaker identity extraction by inputting speaker audio yi,j to the speaker encoder 20, and obtain a first embedding gcontinuous representing a representation of the speaker identity. The speaker encoder 20 is a pre-trained speaker recognition model, and in some exemplary embodiments, the speaker encoder 20 may be trained by using angular prototypical loss.
The speaker encoder 20 may receive speaker audio yi,j as input and output a representation of the speaker identity gcontinuous, as follows.
g
continuous,i,j
=f
spk(gcontinuous,i,j;Wspk)
Herein, fspk is the speaker encoder 20, and Wspk may be a parameter of the speaker encoder 20. Since the speaker encoder 20 aims to extract only speaker-specific information, the extracted speaker embedding gcontinuous,i,j may be approximated to the speaker identity s; of the ith speaker among all speakers in the training set S as follows: gcontinuous,i,j≈Si where si˜S. In the speaker encoder 20, the extracted representation continuous may be introduced as a condition for the normalizing layer for the elements in the TTS pipeline and the VC pipeline. The extracted representation gcontinuous may also be used as a target for the output gdiscrete of the speaker quantizer 21. To make the speaker embedding (i.e., the first embedding) more discriminative, a speaker classifier may be added to the output of the speaker encoder 20, and a loss @@@ may be introduced for the speaker classifier according to Equation 1 below.
Herein, g
The speaker identity quantization module 120 may input the first embedding gcontinuous obtained from the speaker encoder 20 to the speaker quantizer 21 and obtain a quantized second embedding gdiscrete. The speaker quantizer 21 may aim to reconstruct the extracted gcontinuous as a weighted sum of basis vectors bi, i=1, . . . , n corresponding to a set of n learnable vectors from a codebook B. To this end, a second embedding gdiscrete may be obtained according to Equation 2 below based on an optimal weight wi for the basis vector bi found by the self-attention module of the speaker quantizer 21. That is, the reconstructed second embedding gdiscrete may be computed as the sum of the products of the basis vector bi and the weight wi computed from the attention layer.
Herein, fquantize may be the speaker quantizer 21, gcontinuous,i,j may be the first embedding, Wquantize may be the parameter of the speaker quantizer 21, and B may be a codebook including n learnable vectors. To train the speaker quantizer 21, a speaker quantization loss, that is, a loss quantize, may be introduced between gcontinuous and gdiscrete by using a mean square error L, according to Equation 3 below.
Herein, g
To prevent the loss from affecting the parameter Wspk of the speaker encoder 20, gcontinuous may be separated from the speaker quantization loss.
The TTS pipeline module 130 may input the text xi,j and the first embedding gcontinuous to the text prior encoder 22 and obtain the first intermediate representation zprior. The text prior encoder 22 may include a phoneme encoder, an alignment search module, a duration predictor, and a frame decoder. In the phoneme encoder, the alignment search module, and the duration predictor, the first embedding, that is, the speaker embedding gcontinuous, may be given as a condition for the normalization layers assigned to the phoneme encoder, the alignment search module, and the duration predictor.
The frame decoder may receive as input an extended text hidden representation htext_extended, which is extended from the number of tokens to the number of frames after iterations of the phoneme encoder's text representation in the time-wise dimension, and the first embedding gcontinuous. In some exemplary embodiments, the architecture of the frame decoder may be the same as the phoneme encoder.
The TTS pipeline module 130 may input the first intermediate representation zprior and the first embedding gcontinuous to the prosody predictor 23, and add the prosodic hidden representations hpitch and henergy to the first intermediate representation zprior to obtain a second intermediate representation zprosody. The prosody predictor 23 may receive as input the output zprior of the frame decoder and the speaker embedding gcontinuous. The main function of the prosody predictor 23 is to generate a predicted pitch value {circumflex over (x)}pitch and an energy value {circumflex over (x)}energy. The normalized pitch and energy values extracted from the ground truth audio xpitch and xenergy may be used as targets during training. When it is assumed that n is the length of the text token xi,j and m is the length of the mel frame of the audio yi,j, the loss term may be computed based on the length of the mel frame of yi,j rather than the length of xi,j. In some exemplary embodiments, the prosody prediction loss may be computed at the token level or at the frame level.
Specifically, a predicted prosody value {circumflex over (x)}prosody may be generated and a loss frame_prosody may be introduced according to Equation 4 and Equation 5 below.
Herein, f is the prosody predictor 23, zprior is the first intermediate representation, continuous,i,j may be the first embedding, W may be the parameter of the prosody predictor 23, g
In addition, a loss token_prosody may be introduced according to Equation 6 below.
Herein, g
In the training stage, the ground truth audio xpitch and xenergy may be delivered to the prosody encoder to generate prosodic hidden representations hpitch and henergy. These hidden representations may then be added to the output zprior of the frame decoder. However, in the inference stage, the predicted prosody values {circumflex over (x)}pitch and {circumflex over (x)}energy are delivered to the encoder, and the prosody may be controlled by adjusting the values using parameters.
The TTS pipeline module 130 may input the second intermediate representation zprosody and the first embedding gcontinuous to the intermediate decoder 24 to obtain a final representation zfinal. The intermediate decoder 24 may receive the output zprosody of the prosody predictor 23 along with the speaker embedding gcontinuous as input. In some exemplary embodiments, the intermediate decoder 24 may include a fully convolutional neural network with residual connection to capture local information. The intermediate decoder 24 may be the final stage of the intermediate representation before up-sampling is performed on the waveform. To align the output zfinal of the intermediate decoder 24 with the Mel-spectrogram of the ground truth audio xmel with the mean square error L, a loss intermediate may be introduced according to Equation 7 below.
Herein, g
The TTS pipeline module 130 may convert the final representation zfinal to a waveform to generate the speech ŷi,j by using the decoder 25. In some exemplary embodiments, the decoder 25 may be implemented as a generative adversarial network (GAN)-based decoder.
The VC pipeline module 140 may input a linear spectrogram xspec and the first embedding gcontinuous to the speech post encoder 26, obtain the third intermediate representation zpost, and align the third intermediate representation zpost to the first intermediate representation zprior. The speech post encoder 26 may learn the intermediate representations of the TTS pipeline on-the-fly during training. The speech post encoder 26 may receive as input the linear spectrogram xspec and the speaker embedding gcontinuous, and output a latent variable zpost that best represents the context and speaker information. The speech post encoder 26 may include a context organizer and a speaker organizer.
A context organizer may remove speaker information from the linear spectrogram xspec while preserving contextual information. On the other hand, the speaker organizer may implant speaker information into the output of the context organizer. To ensure that the speaker information is removed after the context organizer, an adversarial speaker classifier may be added to the output of the context organizer. In this regard, a loss adv_spk may be introduced according to Equation 8 below.
Herein, Wpost is the parameter of the speech post encoder 26, g
Further, a speaker classifier may be added to the output of the speaker organizer to implant speaker information from a target speaker reference. In this regard, a loss spk may be introduced according to Equation 9 below.
Herein, Wpost is the parameter of the speech post encoder 26, g
To ensure that the output zpost of the speaker organizer after the context organizer is aligned with the output zprior of the frame decoder, a loss bridge may be introduced according to Equation 10 below.
Herein, g
In some exemplary embodiments, the first intermediate representation zprior may be computed according to Equation 11 below.
Herein, fprior is the text prior encoder 22, xi,j is the text, gcontinuous,i,j is the first embedding, Wspk is a parameter of the speaker encoder 20, and Wprior may be a parameter of the text prior encoder 22.
In some exemplary embodiments, the third intermediate representation zpost may be computed according to Equation 12 below.
Herein, fpost may be the speech post encoder 26, yi,j may be the speaker audio, gcontinuous,i,j may be the first embedding, Wspk may be a parameter of the speaker encoder 20, and Wpost may be a parameter of the speech post encoder 26.
According to the present exemplary embodiment, pitch control may be implemented in zero-shot speech generation by integrating the TTS pipeline, the VC pipeline, and the prosody predictor, while new speaker identities may be introduced in speech generation according to the use of the codebook of the speaker quantizer. Furthermore, by building a jointly trainable pipeline, the application to various applications, including zero-shot TTS, VC, random speaker generation, and prosody control, may be implemented only with a single step training of the neural network. Furthermore, by sharing the prosody predictor and the decoder, computational costs may be reduced and efficiency increased.
Referring now to
In some exemplary embodiments, a zero-shot text-to-speech (TTS) may be implemented based on the first embedding gcontinuous and the first intermediate representation zprior. Further, a zero-shot voice conversion (VC) may be implemented based on the first embedding gcontinuous, the first intermediate representation zprior, and the third intermediate representation zpost.
In some exemplary embodiments, a random speaker text-to-speech (TTS) may be implemented based on the second embedding gdiscrete and the first intermediate representation zprior. Further, a random speaker voice conversion (VC) may be implemented based on the second embedding gdiscrete, the first intermediate representation zprior, and the third intermediate representation zpost. In this case, a random seed may be input to the speaker quantizer 31.
Referring now to
For a more detailed description of the above method, reference may be made to the description of the exemplary embodiments described herein, so that duplicative descriptions are omitted herein.
Referring to
The computing device 50 may include at least one of a processor 510, a memory 530, a user interface input device 540, a user interface output device 550, and a storage device 560 communicating via a bus 520. The computing device 50 may also include a network interface 570 electrically connected to the network 40. The network interface 570 may transmit or receive signals to and from other entities over the network 40.
The processor 510 may be implemented in various types, such as a microcontroller unit (MCU), application processor (AP), central processing unit (CPU), graphic processing unit (GPU), neural processing unit (NPU), and quantum processing unit (QPU), and may be any semiconductor device that executes instructions stored in the memory 530 or the storage device 560. The processor 510 may be configured to implement the functions and methods described above with respect to
The memory 530 and the storage device 560 may include various forms of volatile or non-volatile storage media. For example, the memory may include a read-only memory (ROM) 531 and a random access memory (RAM) 532. In some exemplary embodiments, the memory 530 may be located inside or outside of the processor 510, and the memory 530 may be coupled to the processor 510 through various means already known in the art.
In some exemplary embodiments, at least some configurations or functions of the speech generation method and device according to the exemplary embodiments may be implemented as programs or software executing on the computing device 50, and the programs or software may be stored on a computer-readable medium. Specifically, the computer-readable medium according to the exemplary embodiment may be a computer, including a processor 510 that executes programs or instructions stored in the memory 530 or the storage device 560, that records a program for executing steps included in implementing the speech generation method and device according to the exemplary embodiment.
In some exemplary embodiments, at least some configurations or functions of the speech generation method and device according to the exemplary embodiments may be implemented using hardware or circuit of the computing device 50, or may be implemented as separate hardware or circuit that may be electrically connected to computing device 50.
According to the exemplary embodiments, pitch control may be implemented in zero-shot speech generation by integrating the TTS pipeline, the VC pipeline, and the prosody predictor, while new speaker identities may be introduced in speech generation by using the speaker quantizer's codebook. Furthermore, by building the jointly trainable pipeline, the application to various applications, including zero-shot TTS, VC, random speaker generation, and prosody control, may be implemented only with a single step training of the neural network.
Although the above exemplary embodiments of the present invention have been described in detail, the scope of the present invention is not limited thereto, but also includes various modifications and improvements by one of ordinary skill in the art utilizing the basic concepts of the present invention as defined in the following claims.
Number | Date | Country | |
---|---|---|---|
63504872 | May 2023 | US |