Method and Device for Zero-Shot Speech Generation with Prosody Control and Random Speaker Generation

Description

CROSS-REFERENCE TO RELATED APPLICATION

This application claims priority to and the benefit of Korean Patent Application No. 10-2024-0070691 filed in the Korean Intellectual Property Office on May 30, 2024, the entire contents of which are incorporated herein by reference.

BACKGROUND
(a) Field

The present disclosure relates to a speech generation method and device for performing zero-shot speech generation by using prosody control and random speaker generation.

(b) Description of the Related Art

Recently, speech synthesis technology using neural network models has been gaining attraction. In particular, the emergence of text-to-speech (TTS) models based on deep learning has greatly improved the naturalness and fluency of synthesized speech. AI-based speech synthesis is being utilized in various fields, such as virtual assistants and voice guidance systems, and as the quality of speech synthesis improves, user experience, real-time responsiveness, and precision are improving.

Zero-shot speech generation is a technique for synthesizing speech by using speaker identities that are extracted from a given piece of audio. It is important to note that the deep neural network models that perform zero-shot speech generation may synthesize the voice and pronunciation of a new speaker without training, even if the deep neural network models have not encountered the speaker before. In other words, in the zero-shot speech generation, the voice characteristics of a new speaker may be recognized by using pre-trained models and natural speech based on the recognized voice characteristics may be generated, and research is ongoing to improve the practicality of the technology.

SUMMARY

The present disclosure attempts to provide a speech generation method and device capable of controlling prosody elements of generated speech, such as pitch, in zero-shot speech generation.

The present disclosure also attempts to provide a speech generation method and device capable of providing random generation of a speaker identity in zero-shot speech generation.

An exemplary embodiment of the present disclosure provides a speech generation method of performing zero-shot speech generation by using prosody control and random speaker generation, the speech generation method including: receiving paired text and speaker audio for an i^thspeaker and an j^thutterance from a training set; inputting the speaker audio to a speaker encoder to perform extraction of a speaker identity, and obtaining first embedding representing a representation of the speaker identity; inputting the first embedding to a speaker quantizer and obtaining quantized second embedding; inputting the text and the first embedding to a text prior encoder and obtaining a first intermediate representation; inputting the first intermediate representation and the first embedding to a prosody predictor, adding a prosodic hidden representation to the first intermediate representation, and obtaining a second intermediate representation; inputting the second intermediate representation and the first embedding to an intermediate decoder, and obtaining a final representation; and converting the final representation to a waveform by using the decoder to generate speech.

In some exemplary embodiments, the speech generation method may further include: inputting a linear spectrogram and the first embedding to a speech post encoder and obtaining a third intermediate representation; and aligning the third intermediate representation to the first intermediate representation.

In some exemplary embodiments, the obtaining of the first embedding may further include introducing a first loss according to Equation 1 below.

$\begin{matrix} ℒ_{spk_classification} = 𝔼_{g_{c o n t i n u o u s, i, j} - S} {- \sum_{i = 1}^{C} l_{i} \log (f_{spk} (g_{c o n tinuous, i, j}; W_{spk}))} & (Equation 1) \end{matrix}$

Herein, custom-character _g_{continuous,i,j}_˜Sis a notation for an expected value given the training set and the first embedding as input, C is the number of speakers in the training set, l_iis a one-hot vector of the i^thspeaker, f_spkis the speaker encoder, g_{continuous,i,j}is the first embedding, and W_spkis a parameter of the speaker encoder.

In some exemplary embodiments, the obtaining of the second embedding may include: obtaining a second embedding according to Equation 2 below, based on an optimal weight for a basis vector found by a self-attention module of the speaker quantizer.

$\begin{matrix} g_{discrete, i, j} = f_{quantize} (g_{c o n t i n u o u s, i, j}; W_{quantize}, B) = \sum_{i = 1}^{n} w_{i} b_{i}, w_{i} - W_{q u a n t ize}, b_{i} - B, \sum_{i = 1}^{n} w_{i} = 1 & (Equation 2) \end{matrix}$

Herein, f_quantizeis the speaker quantizer, g_{continuous,i,j}is the first embedding, W_quantizeis parameter of the speaker quantizer, and B is a codebook including n learnable vectors.

In some exemplary embodiments, the obtaining of the second embedding may include introducing a second loss according to Equation 3 below.

$\begin{matrix} ℒ_{quantize} = 𝔼_{g_{c o n t i n u o u s, i, j} - S, y_{i, j} - X} {L (g_{discrete, i, j}, g_{continuous, i, j})} & (Equation 3) \end{matrix}$

Herein, custom-character _g_{continuous,i,j}_−S,y_i,j_−Xis a notation for an expected value given the training set and the first embedding as input, L is a mean squared error, g_discrete,i,jis the second embedding, and g_{continuous,i,j}is the first embedding.

In some exemplary embodiments, the obtaining of the second intermediate representation may include generating predicted prosody values according to Equation 4 and Equation 5 below, and introducing a third loss.

$\begin{matrix} {\hat{x}}_{p r osody} = f (z_{p r ior}, g_{c o n t i n u o u s, i, j}; W) & (Equation 4) \end{matrix}$

$\begin{matrix} ℒ_{frame_prosody} = 𝔼_{g_{c o n t i n u o u s, i, j} - S, (x_{i, j}, y_{i, j}) - X} {\sum_{i = 1}^{m} L ({\hat{x}}_{p r osody, m}, x_{p r osody, m})} & (equation 5) \end{matrix}$

Herein, f is the prosody predictor, z_prioris the first intermediate representation, g_{continuous,i,j}is the first embedding, W is a parameter of the prosody predictor, custom-character _g_{continuous,i,j}_−S,(x_i,j_,y_i,j_)−Xis a notation representing the expected value given the training set and the first embedding as input, L is a mean square error, and x_prosodyis an actual prosody value.

In some exemplary embodiments, the obtaining of the second intermediate representation may include introducing a fourth loss according to Equation 6 below.

$\begin{matrix} ℒ_{token_prosody} = 𝔼_{g_{c o n t i n u o u s, i, j} - S, (x_{i, j}, y_{i, j}) - X} {\sum_{k = 1}^{n} L (\frac{\sum_{l \in a_{k}}^{m} {\hat{x}}_{prosody, l}}{❘ a_{k} ❘}, \frac{\sum_{l \in a_{k}}^{m} x_{prosody, l}}{❘ a_{k} ❘})} & (Equation 6) \end{matrix}$

Herein, custom-character _g_{continuous,i,j}_−S,(x_i,j_y_i,j_)−Xis a notation for an expected value given the training set and the first embedding as input, L is a mean squared error, a_kis a frame sequence corresponding to the k^thtoken of a sentence in a duration alignment between a token of the text and a mel-frame of the speaker audio, {circumflex over (x)}_prosodyis a predicted prosody value, and x_prosodyis an actual prosody value.

In some exemplary embodiments, the obtaining of the final representation may include introducing a fifth loss according to Equation 7 below.

$\begin{matrix} ℒ_{intermediate} = 𝔼_{g_{continuous, i, j} - S, (x_{i, j}, y_{i, j}) - X} {L (f (z_{prosody}, g_{continuous, i, j}; W), x_{Mel})} & (Equation 7) \end{matrix}$

Herein, custom-character _g_{continuous,i,j}_−S,(x_i,j_y_i,j_)−Xis a notation for an expected value given the training set and the first embedding as input, L is a mean squared error, f is the intermediate decoder, z_prosodyis the second intermediate representation, g_{continuous,i,j}IS the first embedding, W is a parameter of the intermediate decoder, and x_Melis a Mel spectrogram of an actual speaker audio.

In some exemplary embodiments, the speech post encoder may include a context organizer for removing speaker information from the linear spectrogram while preserving context information; and a speaker organizer for implanting the speaker information into an output of the context organizer, and the obtaining of the third intermediate representation may include introducing a sixth loss according to Equation 8 below.

$\begin{matrix} ℒ_{adv_spk} = \max_{W_{post}} 𝔼_{g_{continuous, i, j} - S, y_{i, j} - X} {- \sum_{i = 1}^{C} l_{i} \log (f_{c o n t e x t} (y_{i, j}; W_{post_context}))} & (Equation 8) \end{matrix}$

Herein, W_postis a parameter of the speech post encoder, custom-character _g_{continuous,i,j}_−S,y_i,j_−Xis a notation for an expected value given the training set and the first embedding as input, C is the number of speakers in the training set, l_iis a one-hot vector of the i^thspeaker, f_contextis the context organizer, y_i,jis the speaker audio, and W_{post_context}is a parameter of the context organizer.

In some exemplary embodiments, the obtaining of the third intermediate representation may include introducing a seventh loss according to Equation 9 below,

$\begin{matrix} ℒ_{s p k} = \max_{W_{post}} 𝔼_{g_{continuous, i, j} - S, y_{i, j} - X} {- \sum_{i = 1}^{C} l_{i} \log (f_{spk} (f_{c o n t e x t} (y_{i, j}; W_{post_context}); W_{post_spk}))} & (Equation 9) \end{matrix}$

Herein, W_postis a parameter of the post-speech encoder, custom-character _g_{continuous,i,j}_−S,y_i,j_−Xis a notation for an expected value given the training set and the first embedding as input, C is the number of speakers in the training set, l_iis a one-hot vector of the i^thspeaker, f_spkis the speaker organizer, f_contextis the context organizer, and y_i,jis the speaker audio, W_{post_context}is a parameter of the context organizer, and W_{post_spk}us a parameter of the speaker organizer.

In some exemplary embodiments, the aligning of the third intermediate representation to the first intermediate representation may include introducing an eighth loss according to Equation 10 below,

$\begin{matrix} ℒ_{bridge} = 𝔼_{g_{c o n t i n u o u s, i, j - S, (x_{i, j}, y_{i, j}) - X}} {L (z_{post}, z_{p r i o r})} & (Equation 10) \end{matrix}$

Herein, custom-character _g_{continuous,i,j}_−S,(x_i,j_,y_i,j_)−Xis a notation for an expected value given the training set and the first embedding as input, L is a mean squared error, z_postis the third intermediate representation, and z_prioris the first intermediate representation.

In some exemplary embodiments, the first intermediate representation may be computed according to Equation 11 below,

$\begin{matrix} z_{p r i o r} = f_{p r i o r} (X_{i, j}, g_{c o n t i n u o u s, i, j}; W_{spk}, W_{p r i o r}) & (Equation 11) \end{matrix}$

Herein, f_prioris the text prior encoder, x_i,jis the text, g_{continuous,i,j}is the first embedding, W_spkis a parameter of the speaker encoder, and W_priormay be a parameter of the text prior encoder.

In some exemplary embodiments, the third intermediate representation may be computed according to Equation 12 below.

$\begin{matrix} z_{p o s t} = f_{p o s t} (y_{i, j}, g_{c o n t i n u o u s, i, j}; W_{spk}, W_{p o s t}) & (Equation 12) \end{matrix}$

Herein, f_postis the speech post encoder, y_i,jis the speaker audio, g_{continuous,i,j}is the first embedding, W_spkis a parameter of the speaker encoder, and W_postis a parameter of the speech post encoder.

In some exemplary embodiments, the speech generation method may further include implementing zero-shot text-to-speech (TTS) based on the first embedding and the first intermediate representation, or implementing zero-shot voice conversion (VC) based on the first embedding, the first intermediate representation, and the third intermediate representation.

In some exemplary embodiments, the speech generation method may further include implementing random speaker text-to-speech (TTS) based on the second embedding and the first intermediate representation, or implementing random speaker voice conversion (VC) based on the second embedding, the first intermediate representation, and the third intermediate representation.

In some exemplary embodiments, the speech generation method may further include inputting a random seed into the speaker quantizer.

Another exemplary embodiment of the present disclosure provides a speech generation device that executes a program code loaded into one or more memory devices via one or more processors and performs zero-shot speech generation by using prosody control and random speaker generation, in which the program code is executed to: receive paired text and speaker audio for an i^thspeaker and a j^thutterance from a training set; input the speaker audio to a speaker encoder to perform extraction of a speaker identity, and obtain first embedding representing a representation of the speaker identity; input the first embedding to a speaker quantizer and obtain quantized second embedding; input the text and the first embedding to a text prior encoder and obtain a first intermediate representation; input the first intermediate representation and the first embedding to a prosody predictor, and add a prosodic hidden representation to the first intermediate representation, and obtain a second intermediate representation; input the second intermediate representation and the first embedding to an intermediate decoder and obtain a final representation; and convert the final representation to a waveform by using the decoder to generate speech.

In some exemplary embodiments, the program code may be executed to further input a linear spectrogram and the first embedding to a speech post encoder and obtain a third intermediate representation, and align the third intermediate representation to the first intermediate representation.

In some exemplary embodiments, the program code may be executed to further implement zero-shot text-to-speech (TTS) based on the first embedding and the first intermediate representation, or implement zero-shot voice conversion (VC) based on the first embedding, the first intermediate representation, and the third intermediate representation.

In some exemplary embodiments, the program code may be executed to further implement random speaker text-to-speech (TTS) based on the second embedding and the first intermediate representation, or implement random speaker voice conversion (VC) based on the second embedding, the first intermediate representation, and the third intermediate representation.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram illustrating a speech generation device according to an exemplary embodiment.

FIG. 2 is a diagram illustrating a training pipeline of a speech generation device according to the exemplary embodiment.

FIG. 3 is a diagram illustrating an inference pipeline of the speech generation device according to the exemplary embodiment.

FIG. 4 is a flow diagram illustrating a speech generation method according to an exemplary embodiment.

FIG. 5 is a diagram illustrating a computing device according to an exemplary embodiment.

DETAILED DESCRIPTION

Hereinafter, the present invention will be described more fully hereinafter with reference to the accompanying drawings, in which exemplary embodiments of the invention are shown. As those skilled in the art would realize, the described exemplary embodiments may be modified in various different ways, all without departing from the spirit or scope of the present invention. Accordingly, the drawings and description are to be regarded as illustrative in nature and not restrictive. Like reference numerals designate like elements throughout the specification.

Throughout the specification and the claims, unless explicitly described to the contrary, the word “comprise”, and variations such as “comprises” or “comprising”, will be understood to imply the inclusion of stated elements but not the exclusion of any other elements. Terms including an ordinary number, such as first and second, are used for describing various constituent elements, but the constituent elements are not limited by the terms. The terms are used only to discriminate one constituent element from another constituent element.

Terms, such as “ . . . unit,” “ . . . device,”, and “module,”, as used in the specification may refer to a unit capable of performing at least one function or operation described herein, which may be implemented in hardware or circuit, software, or a combination of hardware or circuit and software. Further, at least some configurations or functions of a speech generation method and device for performing zero-shot speech generation by using prosody control and random speaker generation according to exemplary embodiments described below may be implemented as a program or software, and the program or software may be stored on a computer-readable medium.

FIG. 1 is a block diagram illustrating a speech generation device according to an exemplary embodiment, and FIG. 2 is a diagram illustrating a training pipeline of a speech generation device according to the exemplary embodiment.

Referring to FIG. 1, a speech generation device 10 according to an exemplary embodiment may execute program codes loaded into one or more memory devices via one or more processors. For example, the speech generation device 10 may be implemented as a computing device 50, such as that described later with reference to FIG. 5. In this case, the one or more processors may correspond to the processor 510 of the computing device 50, and the one or more memory devices may correspond to the memory 520 of the computing device 50. The program code may be executed by the one or more processors to perform zero-shot speech generation by using prosodic control and random speaker generation. In the present specification, the term “module” is used to logically distinguish the functions performed by the program code from each other.

The speech generation device 10 may perform zero-shot speech generation. Zero-shot speech generation is the generation of speech by using a representation of a speaker extracted from audio of a previously unseen speaker, and a neural network model performing the zero-shot speech generation does not need to undergo any adaptation process to learn and generate speech of a new speaker. The speech generation device 10 may include a speaker identity extraction module 110, a speaker identity quantization module 120, a text-to-speech (TTS) pipeline module 130, and a voice conversion (VC) pipeline module 140 to enable control of prosody elements including pitch, and provide random generation of speaker identities in zero-shot speech generation. Hereinafter, the speaker identity extraction module 110, the speaker identity quantization module 120, the TTS pipeline module 130, and the VC pipeline module 140 will be described with reference to FIGS. 1 and 2 together.

The speech generation device 10 may be trained to generate a speech ŷ_i,jthat is similar to ground-truth y_i,j, for given text x_i,jand speaker audio y_i,jas input, as follows.

${\hat{y}}_{i, j} = f ((x_{i, j}, y_{i, j}) - X; W_{spk}, W_{t 2 s})$

Herein, x_i,jand y_i,jmay be paired sentence and audio for the i^thspeaker and j^thutterance from a specific training set X (where i, j are integers). f is a TTS pipeline, and W_spkand W_t2smay be a parameter of the speaker encoder 20 and a parameter of the TTS pipeline, respectively. Herein, the TTS pipeline may include a text prior encoder 22, a prosody predictor 23, an intermediate decoder 24, and a decoder 25. Meanwhile, a speech post encoder 26 may form a VC pipeline.

For example, the output z_postof the VC pipeline may be aligned with an output z_priorof a frame decoder of the text prior encoder 22 of the TTS pipeline, and an output g_discreteof a speaker quantizer 21 may follow an output g_continuousof the speaker encoder 20. In this way, the speech generation device 10 may build a jointly trainable pipeline to realize applications for various applications, such as zero-shot TTS, VC, random speaker generation, and prosody control, with only a single-step training for the neural network, and to this end, the total loss L_totalfor the single-step training is introduced in the following form.

$ℒ_{total} = ℒ_{TriniTTS} + λ_{frame_prosody} ℒ_{frame_prosody} + λ_{adv_spk} ℒ_{adv_spk} + λ_{spk} ℒ_{spk} + λ_{q uantize} ℒ_{q uantize} + λ_{spk_classification} ℒ_{spk_classification}$

Herein, custom-character _TriniTTSis a loss term for the TTS on which the speech generation device 10 according to the exemplary embodiments is based, and each of _{frame_prosody}, _{adv_spk}, _spk, _quantize, and _{spk_classification}will be described later. λ_{frame_prosody}, λ_{adv_spk}, λ_spk, λ_quantize, and λ_{spk_classification}may be hyperparameters for their respective loss terms.

The speaker identity extraction module 110 may perform speaker identity extraction by inputting speaker audio y_i,jto the speaker encoder 20, and obtain a first embedding g_continuousrepresenting a representation of the speaker identity. The speaker encoder 20 is a pre-trained speaker recognition model, and in some exemplary embodiments, the speaker encoder 20 may be trained by using angular prototypical loss.

The speaker encoder 20 may receive speaker audio y_i,jas input and output a representation of the speaker identity g_continuous, as follows.

g
_{continuous,i,j}
=f
_spk(g_{continuous,i,j};W_spk)

Herein, f_spkis the speaker encoder 20, and W_spkmay be a parameter of the speaker encoder 20. Since the speaker encoder 20 aims to extract only speaker-specific information, the extracted speaker embedding g_{continuous,i,j}may be approximated to the speaker identity s; of the i^thspeaker among all speakers in the training set S as follows: g_{continuous,i,j}≈Si where si˜S. In the speaker encoder 20, the extracted representation continuous may be introduced as a condition for the normalizing layer for the elements in the TTS pipeline and the VC pipeline. The extracted representation g_continuousmay also be used as a target for the output g_discreteof the speaker quantizer 21. To make the speaker embedding (i.e., the first embedding) more discriminative, a speaker classifier may be added to the output of the speaker encoder 20, and a loss @@@ may be introduced for the speaker classifier according to Equation 1 below.

$\begin{matrix} ℒ_{spk_classification} = 𝔼_{g_{cont i n u o u s, i, j^{- S}}} {- \sum_{i = 1}^{C} l_{i} \log (f_{s p k} (g_{c o n t inuous, i, j}; W_{s p k}))} & (Equation 1) \end{matrix}$

Herein, custom-character _g_{continuous,i,j}_˜Smay be a notation for the expected value given the training set S and the first embedding as input, C may be the number of speakers in the training set S, l_imay be the one-hot vector of the i^thspeaker, f_spkmay be the speaker encoder 20, g_{continuous,i,j}may be the first embedding, and W_spkmay be a parameter of the speaker encoder 20.

The speaker identity quantization module 120 may input the first embedding g_continuousobtained from the speaker encoder 20 to the speaker quantizer 21 and obtain a quantized second embedding g_discrete. The speaker quantizer 21 may aim to reconstruct the extracted g_continuousas a weighted sum of basis vectors b_i, i=1, . . . , n corresponding to a set of n learnable vectors from a codebook B. To this end, a second embedding g_discretemay be obtained according to Equation 2 below based on an optimal weight w_ifor the basis vector b_ifound by the self-attention module of the speaker quantizer 21. That is, the reconstructed second embedding g_discretemay be computed as the sum of the products of the basis vector b_iand the weight w_icomputed from the attention layer.

Herein, f_quantizemay be the speaker quantizer 21, g_{continuous,i,j}may be the first embedding, W_quantizemay be the parameter of the speaker quantizer 21, and B may be a codebook including n learnable vectors. To train the speaker quantizer 21, a speaker quantization loss, that is, a loss custom-character _quantize, may be introduced between g_continuousand g_discreteby using a mean square error L, according to Equation 3 below.

$\begin{matrix} ℒ_{quantize} = 𝔼_{g_{c o n t i n u o u s, i, j} - S, y_{i, j} - X} {L (g_{discrete, i, j}, g_{c o n t i n u o u s, i, j})} & (Equation 3) \end{matrix}$

Herein, custom-character _g_{continuous,i,j}_−S,y_i,j_−Xmay be a notation for the expected value given the training set S and the first embedding as input, L may be the mean squared error, discrete,i,j may be the second embedding, and g_{continuous,i,j}may be the first embedding.

To prevent the loss from affecting the parameter W_spkof the speaker encoder 20, g_continuousmay be separated from the speaker quantization loss.

The TTS pipeline module 130 may input the text x_i,jand the first embedding g_continuousto the text prior encoder 22 and obtain the first intermediate representation z_prior. The text prior encoder 22 may include a phoneme encoder, an alignment search module, a duration predictor, and a frame decoder. In the phoneme encoder, the alignment search module, and the duration predictor, the first embedding, that is, the speaker embedding g_continuous, may be given as a condition for the normalization layers assigned to the phoneme encoder, the alignment search module, and the duration predictor.

The frame decoder may receive as input an extended text hidden representation h_{text_extended}, which is extended from the number of tokens to the number of frames after iterations of the phoneme encoder's text representation in the time-wise dimension, and the first embedding g_continuous. In some exemplary embodiments, the architecture of the frame decoder may be the same as the phoneme encoder.

The TTS pipeline module 130 may input the first intermediate representation z_priorand the first embedding g_continuousto the prosody predictor 23, and add the prosodic hidden representations h_pitchand h_energyto the first intermediate representation z_priorto obtain a second intermediate representation z_prosody. The prosody predictor 23 may receive as input the output z_priorof the frame decoder and the speaker embedding g_continuous. The main function of the prosody predictor 23 is to generate a predicted pitch value {circumflex over (x)}_pitchand an energy value {circumflex over (x)}_energy. The normalized pitch and energy values extracted from the ground truth audio x_pitchand x_energymay be used as targets during training. When it is assumed that n is the length of the text token x_i,jand m is the length of the mel frame of the audio y_i,j, the loss term may be computed based on the length of the mel frame of y_i,jrather than the length of x_i,j. In some exemplary embodiments, the prosody prediction loss may be computed at the token level or at the frame level.

Specifically, a predicted prosody value {circumflex over (x)}_prosodymay be generated and a loss custom-character _{frame_prosody}may be introduced according to Equation 4 and Equation 5 below.

$\begin{matrix} {\hat{x}}_{p rosody} = f (z_{prior}, g_{c o n t i n u o u s, i, j}; W) & (Equation 4) \end{matrix}$

$\begin{matrix} ℒ_{frame_prosody} = 𝔼_{g_{c o n t i n u o u s, i, j} - S, (x_{i, j}, y_{i, j}) - X} {\sum_{i = 1}^{m} L ({\hat{x}}_{p rosody, m}, x_{p rosody, m})} & (Equation 5) \end{matrix}$

Herein, f is the prosody predictor 23, z_prioris the first intermediate representation, continuous,i,j may be the first embedding, W may be the parameter of the prosody predictor 23, custom-character _g_{continuous,i,j}_−S,(x_i,j_,y_i,j_)−Xmay be a notation representing the expected value given the training set S and the first embedding as input, L may be the mean square error, and x_prosodymay be the actual prosody value.

In addition, a loss custom-character _{token_prosody}may be introduced according to Equation 6 below.

$\begin{matrix} ℒ_{token_prosody} = 𝔼_{g_{c o n t i n u o u s, i, j} - S, (x_{i, j}, y_{i, j}) - X} {\sum_{k = 1}^{n} L (\frac{\sum_{l \in a_{k}}^{m} {\hat{x}}_{p rosody}, l}{❘ a_{k} ❘}, \frac{\sum_{l \in a_{k}}^{m} x_{p rosody}, l}{❘ a_{k} ❘})} & (Equation 6) \end{matrix}$

Herein, custom-character _g_{continuous,i,j}_−S,(x_i,j_,y_i,j_)−Xmay be the notation for the expected value given the training set S and the first embedding as input, L may be the mean squared error, a_kmay be the frame sequence corresponding to the k^thtoken of the sentence in the duration alignment between the token of the text x_i,jand the mel-frame of the speaker audio y_i,j, {circumflex over (x)}_prosodymay be the predicted prosody value, and x_prosodymay be the actual prosody value.

In the training stage, the ground truth audio x_pitchand x_energymay be delivered to the prosody encoder to generate prosodic hidden representations h_pitchand h_energy. These hidden representations may then be added to the output z_priorof the frame decoder. However, in the inference stage, the predicted prosody values {circumflex over (x)}_pitchand {circumflex over (x)}_energyare delivered to the encoder, and the prosody may be controlled by adjusting the values using parameters.

The TTS pipeline module 130 may input the second intermediate representation z_prosodyand the first embedding g_continuousto the intermediate decoder 24 to obtain a final representation z_final. The intermediate decoder 24 may receive the output z_prosodyof the prosody predictor 23 along with the speaker embedding g_continuousas input. In some exemplary embodiments, the intermediate decoder 24 may include a fully convolutional neural network with residual connection to capture local information. The intermediate decoder 24 may be the final stage of the intermediate representation before up-sampling is performed on the waveform. To align the output z_finalof the intermediate decoder 24 with the Mel-spectrogram of the ground truth audio x_melwith the mean square error L, a loss custom-character _intermediatemay be introduced according to Equation 7 below.

$\begin{matrix} ℒ_{intermediate} = 𝔼_{g_{continuous, i, j} - S, (x_{i, j}, y_{i, j}) - X} {L (f (z_{prosody}, g_{c o n t i n u o u s, i, j}; W), x_{Mel})} & (Equation 7) \end{matrix}$

Herein, custom-character _g_{continuous,i,j}_−S,(x_i,j_,y_i,j_)−Xis the notation for the expected value given the training set S and the first embedding as input, L may be the mean squared error, f may be the intermediate decoder 24, z_prosodymay be the second intermediate representation, g_{continuous,i,j}may be the first embedding, W may be the parameter of the intermediate decoder 24, and x_Melmay be the Mel spectrogram of the actual speaker audio.

The TTS pipeline module 130 may convert the final representation z_finalto a waveform to generate the speech ŷ_i,jby using the decoder 25. In some exemplary embodiments, the decoder 25 may be implemented as a generative adversarial network (GAN)-based decoder.

The VC pipeline module 140 may input a linear spectrogram x_specand the first embedding g_continuousto the speech post encoder 26, obtain the third intermediate representation z_post, and align the third intermediate representation z_postto the first intermediate representation z_prior. The speech post encoder 26 may learn the intermediate representations of the TTS pipeline on-the-fly during training. The speech post encoder 26 may receive as input the linear spectrogram x_specand the speaker embedding g_continuous, and output a latent variable z_postthat best represents the context and speaker information. The speech post encoder 26 may include a context organizer and a speaker organizer.

A context organizer may remove speaker information from the linear spectrogram x_specwhile preserving contextual information. On the other hand, the speaker organizer may implant speaker information into the output of the context organizer. To ensure that the speaker information is removed after the context organizer, an adversarial speaker classifier may be added to the output of the context organizer. In this regard, a loss custom-character _{adv_spk}may be introduced according to Equation 8 below.

$\begin{matrix} ℒ_{adv_spk} = \max_{W_{post}} 𝔼_{g_{continuous, i, j} - S, y_{i, j} - X} {- \sum_{i = 1}^{C} l_{i} \log (f_{c o n t e x t} (y_{i, j}; W_{p o s t_{-} c o n t e x t}))} & (Equation 8) \end{matrix}$

Herein, W_postis the parameter of the speech post encoder 26, custom-character _g_{continuous,i,j}_−S,y_i,j_−Xmay be a notation for the expected value given the training set S and the first embedding as input, C may be the number of speakers in the training set S, l_imay be the one-hot vector of the i^thspeaker, f_contextmay be the context organizer, y_i,jmay be the speaker audio, and W_{post_context}may be the parameter of the context organizer.

Further, a speaker classifier may be added to the output of the speaker organizer to implant speaker information from a target speaker reference. In this regard, a loss custom-character _spkmay be introduced according to Equation 9 below.

$\begin{matrix} ℒ_{spk} = \max_{W_{p o s t}} 𝔼_{g_{c o n t i n u o u s, i, j} - S, y_{i, j} - X} {- \sum_{i = 1}^{C} l_{i} \log (f_{spk} (f_{c o n t e x t} (y_{i, j}; W_{post_context}); W_{post_spk}))} & (Equation 9) \end{matrix}$

Herein, W_postis the parameter of the speech post encoder 26, custom-character _g_{continuous,i,j}_−S,y_i,j_−Xmay be the notation for the expected value given the training set S and the first embedding as input, C may be the number of speakers in the training set S, l_imay be the one-hot vector of the i^thspeaker, f_spkmay be the speaker organizer, f_contextmay be the context organizer, and y_i,jmay be the speaker audio, W_postcontext may be the parameter of the context organizer, and W_{post_spk}may be the parameter of the speaker organizer.

To ensure that the output z_postof the speaker organizer after the context organizer is aligned with the output z_priorof the frame decoder, a loss custom-character _bridgemay be introduced according to Equation 10 below.

$\begin{matrix} ℒ_{bridge} = 𝔼_{g_{cont i n u o u s, i, j - S, (x_{i, j}, y_{i, j}) - X}} {L (z_{p ost}, z_{p r i o r})} & (Equation 10) \end{matrix}$

Herein, custom-character _g_{continuous,i,j}_−S,(x_i,j_,y_i,j_)−Xis a notation for the expected value given the training set S and the first embedding as input, L may be the mean squared error, z_postmay be the third intermediate representation, and z_priormay be the first intermediate representation.

In some exemplary embodiments, the first intermediate representation z_priormay be computed according to Equation 11 below.

$\begin{matrix} z_{p r i o r} = f_{p r i o r} (x_{i, j}, g_{c o n t i n u o u s, i, j^{;}} W_{spk}, W_{prior}) & (Equation 11) \end{matrix}$

Herein, f_prioris the text prior encoder 22, x_i,jis the text, g_{continuous,i,j}is the first embedding, W_spkis a parameter of the speaker encoder 20, and W_priormay be a parameter of the text prior encoder 22.

In some exemplary embodiments, the third intermediate representation z_postmay be computed according to Equation 12 below.

$\begin{matrix} z_{p o s t} = f_{p o s t} (y_{i, j}, g_{c o n t i n u o u s, i, j}; W_{spk}, W_{p o s t}) & (Equation 12) \end{matrix}$

Herein, f_postmay be the speech post encoder 26, y_i,jmay be the speaker audio, g_{continuous,i,j}may be the first embedding, W_spkmay be a parameter of the speaker encoder 20, and W_postmay be a parameter of the speech post encoder 26.

According to the present exemplary embodiment, pitch control may be implemented in zero-shot speech generation by integrating the TTS pipeline, the VC pipeline, and the prosody predictor, while new speaker identities may be introduced in speech generation according to the use of the codebook of the speaker quantizer. Furthermore, by building a jointly trainable pipeline, the application to various applications, including zero-shot TTS, VC, random speaker generation, and prosody control, may be implemented only with a single step training of the neural network. Furthermore, by sharing the prosody predictor and the decoder, computational costs may be reduced and efficiency increased.

FIG. 3 is a diagram illustrating the inference pipeline of the speech generation device according to the exemplary embodiment.

Referring now to FIG. 3, the trained speech generation device 10, as described in FIG. 2, may be used to implement applications, such as zero-shot speech generation, random speaker generation, and prosody control. In FIG. 3, a speaker encoder 30, a speaker quantizer 31, a text prior encoder 32, a prosody predictor 33, an intermediate decoder 34, a decoder 35, and a speech post encoder 36 may correspond to the speaker encoder 20, the speaker quantizer 21, the text prior encoder 22, the prosody predictor 23, the intermediate decoder 24, the decoder 25, and the speech post encoder 26 of FIG. 2.

In some exemplary embodiments, a zero-shot text-to-speech (TTS) may be implemented based on the first embedding g_continuousand the first intermediate representation z_prior. Further, a zero-shot voice conversion (VC) may be implemented based on the first embedding g_continuous, the first intermediate representation z_prior, and the third intermediate representation z_post.

In some exemplary embodiments, a random speaker text-to-speech (TTS) may be implemented based on the second embedding g_discreteand the first intermediate representation z_prior. Further, a random speaker voice conversion (VC) may be implemented based on the second embedding g_discrete, the first intermediate representation z_prior, and the third intermediate representation z_post. In this case, a random seed may be input to the speaker quantizer 31.

FIG. 4 is a flow diagram illustrating a speech generation method according to an exemplary embodiment.

Referring now to FIG. 4, the speech generation method according to the exemplary embodiment may include receiving paired text and speaker audio as input for the i^thspeaker and the j^thutterance from a training set (S401), inputting the speaker audio to a speaker encoder to perform an extraction of a speaker identity, and obtaining a first embedding representing a representation of the speaker identity (S402), inputting the first embedding to a speaker quantizer and obtaining a quantized second embedding (S403), inputting the text and the first embedding into a text prior encoder and obtaining a first intermediate representation (S404), inputting the first intermediate representation and the first embedding to a prosody predictor, adding a prosodic hidden representation to the first intermediate representation, and obtaining a second intermediate representation (S405), inputting the second intermediate representation and the first embedding to an intermediate decoder and obtaining a final representation (S406); and converting the final representation to a waveform by using the decoder to generate speech (S407).

For a more detailed description of the above method, reference may be made to the description of the exemplary embodiments described herein, so that duplicative descriptions are omitted herein.

FIG. 5 is a diagram illustrating a computing device according to an exemplary embodiment.

Referring to FIG. 5, the speech generation method and device according to the exemplary embodiments may be implemented using a computing device 50.

The computing device 50 may include at least one of a processor 510, a memory 530, a user interface input device 540, a user interface output device 550, and a storage device 560 communicating via a bus 520. The computing device 50 may also include a network interface 570 electrically connected to the network 40. The network interface 570 may transmit or receive signals to and from other entities over the network 40.

The processor 510 may be implemented in various types, such as a microcontroller unit (MCU), application processor (AP), central processing unit (CPU), graphic processing unit (GPU), neural processing unit (NPU), and quantum processing unit (QPU), and may be any semiconductor device that executes instructions stored in the memory 530 or the storage device 560. The processor 510 may be configured to implement the functions and methods described above with respect to FIGS. 1 to 4.

The memory 530 and the storage device 560 may include various forms of volatile or non-volatile storage media. For example, the memory may include a read-only memory (ROM) 531 and a random access memory (RAM) 532. In some exemplary embodiments, the memory 530 may be located inside or outside of the processor 510, and the memory 530 may be coupled to the processor 510 through various means already known in the art.

In some exemplary embodiments, at least some configurations or functions of the speech generation method and device according to the exemplary embodiments may be implemented as programs or software executing on the computing device 50, and the programs or software may be stored on a computer-readable medium. Specifically, the computer-readable medium according to the exemplary embodiment may be a computer, including a processor 510 that executes programs or instructions stored in the memory 530 or the storage device 560, that records a program for executing steps included in implementing the speech generation method and device according to the exemplary embodiment.

In some exemplary embodiments, at least some configurations or functions of the speech generation method and device according to the exemplary embodiments may be implemented using hardware or circuit of the computing device 50, or may be implemented as separate hardware or circuit that may be electrically connected to computing device 50.

According to the exemplary embodiments, pitch control may be implemented in zero-shot speech generation by integrating the TTS pipeline, the VC pipeline, and the prosody predictor, while new speaker identities may be introduced in speech generation by using the speaker quantizer's codebook. Furthermore, by building the jointly trainable pipeline, the application to various applications, including zero-shot TTS, VC, random speaker generation, and prosody control, may be implemented only with a single step training of the neural network.

Although the above exemplary embodiments of the present invention have been described in detail, the scope of the present invention is not limited thereto, but also includes various modifications and improvements by one of ordinary skill in the art utilizing the basic concepts of the present invention as defined in the following claims.

Claims

1. A speech generation method of performing zero-shot speech generation by using prosody control and random speaker generation, the speech generation method comprising: receiving paired text and speaker audio for an ith speaker and an jth utterance from a training set;inputting the speaker audio to a speaker encoder to perform extraction of a speaker identity, and obtaining first embedding representing a representation of the speaker identity;inputting the first embedding to a speaker quantizer and obtaining quantized second embedding;inputting the text and the first embedding to a text prior encoder and obtaining a first intermediate representation;inputting the first intermediate representation and the first embedding to a prosody predictor, adding a prosodic hidden representation to the first intermediate representation, and obtaining a second intermediate representation;inputting the second intermediate representation and the first embedding to an intermediate decoder, and obtaining a final representation; andconverting the final representation to a waveform by using the decoder to generate speech.
2. The speech generation method of claim 1, further comprising: inputting a linear spectrogram and the first embedding to a speech post encoder and obtaining a third intermediate representation; andaligning the third intermediate representation to the first intermediate representation.
3. The speech generation method of claim 1, wherein: the obtaining of the first embedding includesintroducing a first loss according to Equation 1 below,
4. The speech generation method of claim 1, wherein: the obtaining of the second embedding includesobtaining a second embedding according to Equation 2 below, based on an optimal weight for a basis vector found by a self-attention module of the speaker quantizer,
5. The speech generation method of claim 4, wherein: the obtaining of the second embedding includesintroducing a second loss according to Equation 3 below,
6. The speech generation method of claim 1, wherein: the obtaining of the second intermediate representation includesgenerating predicted prosody values according to Equation 4 and Equation 5 below, and introducing a third loss,
7. The speech generation method of claim 6, wherein: the obtaining of the second intermediate representation includesintroducing a fourth loss according to Equation 6 below,
8. The speech generation method of claim 1, wherein: the obtaining of the final representation includesintroducing a fifth loss according to Equation 7 below,
9. The speech generation method of claim 2, wherein: the speech post encoder includes a context organizer for removing speaker information from the linear spectrogram while preserving context information; and a speaker organizer for implanting the speaker information into an output of the context organizer, andthe obtaining of the third intermediate representation includesintroducing a sixth loss according to Equation 8 below,
10. The speech generation method of claim 9, wherein: the obtaining of the third intermediate representation includesintroducing a seventh loss according to Equation 9 below,
11. The speech generation method of claim 10, wherein: the aligning of the third intermediate representation to the first intermediate representation includesintroducing an eighth loss according to Equation 10 below,
12. The speech generation method of claim 11, wherein: the first intermediate representation is computed according to Equation 11 below,
13. The speech generation method of claim 11, wherein: the third intermediate representation is computed according to Equation 12 below,
14. The speech generation method of claim 2, further comprising: implementing zero-shot text-to-speech (TTS) based on the first embedding and the first intermediate representation, or implementing zero-shot voice conversion (VC) based on the first embedding, the first intermediate representation, and the third intermediate representation.
15. The speech generation method of claim 2, further comprising: implementing random speaker text-to-speech (TTS) based on the second embedding and the first intermediate representation, or. implementing random speaker voice conversion (VC) based on the second embedding, the first intermediate representation, and the third intermediate representation.
16. The speech generation method of claim 15, further comprising: inputting a random seed into the speaker quantizer.
17. A speech generation device that executes a program code loaded into one or more memory devices via one or more processors and performs zero-shot speech generation by using prosody control and random speaker generation, wherein the program code is executed to:receive paired text and speaker audio for an ith speaker and a jth utterance from a training set;input the speaker audio to a speaker encoder to perform extraction of a speaker identity, and obtain first embedding representing a representation of the speaker identity;input the first embedding to a speaker quantizer and obtain quantized second embedding;input the text and the first embedding to a text prior encoder and obtain a first intermediate representation;input the first intermediate representation and the first embedding to a prosody predictor, and add a prosodic hidden representation to the first intermediate representation, and obtain a second intermediate representation;input the second intermediate representation and the first embedding to an intermediate decoder and obtain a final representation; andconvert the final representation to a waveform by using the decoder to generate speech.
18. The speech generation device of claim 17, wherein: the program code is executed to furtherinput a linear spectrogram and the first embedding to a speech post encoder and obtain a third intermediate representation, andalign the third intermediate representation to the first intermediate representation.
19. The speech generation device of claim 18, wherein: the program code is executed to furtherimplement zero-shot text-to-speech (TTS) based on the first embedding and the first intermediate representation, or implement zero-shot voice conversion (VC) based on the first embedding, the first intermediate representation, and the third intermediate representation.
20. The speech generation device of claim 18, wherein: the program code is executed to furtherimplement random speaker text-to-speech (TTS) based on the second embedding and the first intermediate representation, or. implement random speaker voice conversion (VC) based on the second embedding, the first intermediate representation, and the third intermediate representation.

Provisional Applications (1)

	Number	Date	Country
	63504872	May 2023	US

Method and Device for Zero-Shot Speech Generation with Prosody Control and Random Speaker Generation

Information

Publication Number

Date Filed

Date Published

Inventors

Original Assignees

CPC

International Classifications

Abstract

Description

Claims

Provisional Applications (1)