Method and Device for Zero-Shot Speech Generation with Prosody Control and Random Speaker Generation

Information

  • Patent Application
  • 20240404509
  • Publication Number
    20240404509
  • Date Filed
    May 30, 2024
    9 months ago
  • Date Published
    December 05, 2024
    2 months ago
Abstract
Disclosed are a speech generation method and device for performing zero-shot speech generation by using prosody control and random speaker generation. The speech generation method may include: receiving paired text and speaker audio for an ith speaker and an jth utterance from a training set; inputting the speaker audio to a speaker encoder to perform extraction of a speaker identity, and obtaining first embedding representing a representation of the speaker identity; inputting the first embedding to a speaker quantizer and obtaining quantized second embedding; inputting the text and the first embedding to a text prior encoder and obtaining a first intermediate representation; inputting the first intermediate representation and the first embedding to a prosody predictor, adding a prosodic hidden representation to the first intermediate representation, and obtaining a second intermediate representation; inputting the second intermediate representation and the first embedding to an intermediate decoder, and obtaining a final representation; and converting the final representation to a waveform by using the decoder to generate speech.
Description
CROSS-REFERENCE TO RELATED APPLICATION

This application claims priority to and the benefit of Korean Patent Application No. 10-2024-0070691 filed in the Korean Intellectual Property Office on May 30, 2024, the entire contents of which are incorporated herein by reference.


BACKGROUND
(a) Field

The present disclosure relates to a speech generation method and device for performing zero-shot speech generation by using prosody control and random speaker generation.


(b) Description of the Related Art

Recently, speech synthesis technology using neural network models has been gaining attraction. In particular, the emergence of text-to-speech (TTS) models based on deep learning has greatly improved the naturalness and fluency of synthesized speech. AI-based speech synthesis is being utilized in various fields, such as virtual assistants and voice guidance systems, and as the quality of speech synthesis improves, user experience, real-time responsiveness, and precision are improving.


Zero-shot speech generation is a technique for synthesizing speech by using speaker identities that are extracted from a given piece of audio. It is important to note that the deep neural network models that perform zero-shot speech generation may synthesize the voice and pronunciation of a new speaker without training, even if the deep neural network models have not encountered the speaker before. In other words, in the zero-shot speech generation, the voice characteristics of a new speaker may be recognized by using pre-trained models and natural speech based on the recognized voice characteristics may be generated, and research is ongoing to improve the practicality of the technology.


SUMMARY

The present disclosure attempts to provide a speech generation method and device capable of controlling prosody elements of generated speech, such as pitch, in zero-shot speech generation.


The present disclosure also attempts to provide a speech generation method and device capable of providing random generation of a speaker identity in zero-shot speech generation.


An exemplary embodiment of the present disclosure provides a speech generation method of performing zero-shot speech generation by using prosody control and random speaker generation, the speech generation method including: receiving paired text and speaker audio for an ith speaker and an jth utterance from a training set; inputting the speaker audio to a speaker encoder to perform extraction of a speaker identity, and obtaining first embedding representing a representation of the speaker identity; inputting the first embedding to a speaker quantizer and obtaining quantized second embedding; inputting the text and the first embedding to a text prior encoder and obtaining a first intermediate representation; inputting the first intermediate representation and the first embedding to a prosody predictor, adding a prosodic hidden representation to the first intermediate representation, and obtaining a second intermediate representation; inputting the second intermediate representation and the first embedding to an intermediate decoder, and obtaining a final representation; and converting the final representation to a waveform by using the decoder to generate speech.


In some exemplary embodiments, the speech generation method may further include: inputting a linear spectrogram and the first embedding to a speech post encoder and obtaining a third intermediate representation; and aligning the third intermediate representation to the first intermediate representation.


In some exemplary embodiments, the obtaining of the first embedding may further include introducing a first loss according to Equation 1 below.












spk

_

classification


=


𝔼


g


c

o

n

t

i

n

u

o

u

s

,
i
,
j


-
S




{


-






i
=
1

C




l
i



log

(


f
spk

(


g


c

o

n

tinuous

,
i
,
j


;

W
spk


)

)


}






(

Equation


1

)







Herein, custom-charactergcontinuous,i,j˜S is a notation for an expected value given the training set and the first embedding as input, C is the number of speakers in the training set, li is a one-hot vector of the ith speaker, fspk is the speaker encoder, gcontinuous,i,j is the first embedding, and Wspk is a parameter of the speaker encoder.


In some exemplary embodiments, the obtaining of the second embedding may include: obtaining a second embedding according to Equation 2 below, based on an optimal weight for a basis vector found by a self-attention module of the speaker quantizer.











g

discrete
,
i
,
j


=



f
quantize

(



g


c

o

n

t

i

n

u

o

u

s

,
i
,
j


;

W
quantize


,
B

)

=







i
=
1

n



w
i



b
i




,



w
i

-

W

q

u

a

n

t

ize



,


b
i

-
B

,








i
=
1

n



w
i


=
1





(

Equation


2

)







Herein, fquantize is the speaker quantizer, gcontinuous,i,j is the first embedding, Wquantize is parameter of the speaker quantizer, and B is a codebook including n learnable vectors.


In some exemplary embodiments, the obtaining of the second embedding may include introducing a second loss according to Equation 3 below.











quantize

=


𝔼



g


c

o

n

t

i

n

u

o

u

s

,
i
,
j


-
S

,


y

i
,
j


-
X





{

L

(


g

discrete
,
i
,
j


,

g

continuous
,
i
,
j



)

}






(

Equation


3

)







Herein, custom-charactergcontinuous,i,j−S,yi,j−X is a notation for an expected value given the training set and the first embedding as input, L is a mean squared error, gdiscrete,i,j is the second embedding, and gcontinuous,i,j is the first embedding.


In some exemplary embodiments, the obtaining of the second intermediate representation may include generating predicted prosody values according to Equation 4 and Equation 5 below, and introducing a third loss.











x
ˆ


p

r

osody


=

f

(


z

p

r

ior


,


g


c

o

n

t

i

n

u

o

u

s

,
i
,
j


;
W


)





(

Equation


4

)















frame

_

prosody


=


𝔼



g


c

o

n

t

i

n

u

o

u

s

,
i
,
j


-
S

,


(


x

i
,
j


,

y

i
,
j



)

-
X





{







i
=
1

m



L

(



x
ˆ



p

r

osody

,
m


,

x


p

r

osody

,
m



)


}






(

equation


5

)







Herein, f is the prosody predictor, zprior is the first intermediate representation, gcontinuous,i,j is the first embedding, W is a parameter of the prosody predictor, custom-charactergcontinuous,i,j−S,(xi,j,yi,j)−X is a notation representing the expected value given the training set and the first embedding as input, L is a mean square error, and xprosody is an actual prosody value.


In some exemplary embodiments, the obtaining of the second intermediate representation may include introducing a fourth loss according to Equation 6 below.












token

_

prosody


=


𝔼



g


c

o

n

t

i

n

u

o

u

s

,
i
,
j


-
S

,


(


x

i
,
j


,

y

i
,
j



)

-
X





{




k
=
1

n


L

(









l


a
k


m




x
ˆ


prosody
,
l






"\[LeftBracketingBar]"


a
k



"\[RightBracketingBar]"



,








l


a
k


m



x

prosody
,
l






"\[LeftBracketingBar]"


a
k



"\[RightBracketingBar]"




)


}






(

Equation


6

)







Herein, custom-charactergcontinuous,i,j−S,(xi,jyi,j)−X is a notation for an expected value given the training set and the first embedding as input, L is a mean squared error, ak is a frame sequence corresponding to the kth token of a sentence in a duration alignment between a token of the text and a mel-frame of the speaker audio, {circumflex over (x)}prosody is a predicted prosody value, and xprosody is an actual prosody value.


In some exemplary embodiments, the obtaining of the final representation may include introducing a fifth loss according to Equation 7 below.











intermediate

=


𝔼



g

continuous
,
i
,
j


-
S

,


(


x

i
,
j


,

y

i
,
j



)

-
X





{

L

(


f

(


z
prosody

,


g

continuous
,
i
,
j


;
W


)

,

x
Mel


)

}






(

Equation


7

)







Herein, custom-charactergcontinuous,i,j−S,(xi,jyi,j)−X is a notation for an expected value given the training set and the first embedding as input, L is a mean squared error, f is the intermediate decoder, zprosody is the second intermediate representation, gcontinuous,i,j IS the first embedding, W is a parameter of the intermediate decoder, and xMel is a Mel spectrogram of an actual speaker audio.


In some exemplary embodiments, the speech post encoder may include a context organizer for removing speaker information from the linear spectrogram while preserving context information; and a speaker organizer for implanting the speaker information into an output of the context organizer, and the obtaining of the third intermediate representation may include introducing a sixth loss according to Equation 8 below.












adv

_

spk


=


max

W
post




𝔼



g

continuous
,
i
,
j


-
S

,


y

i
,
j


-
X





{


-






i
=
1

C




l
i



log

(


f

c

o

n

t

e

x

t


(


y

i
,
j


;

W

post

_

context



)

)


}






(

Equation


8

)







Herein, Wpost is a parameter of the speech post encoder, custom-charactergcontinuous,i,j−S,yi,j−X is a notation for an expected value given the training set and the first embedding as input, C is the number of speakers in the training set, li is a one-hot vector of the ith speaker, fcontext is the context organizer, yi,j is the speaker audio, and Wpost_context is a parameter of the context organizer.


In some exemplary embodiments, the obtaining of the third intermediate representation may include introducing a seventh loss according to Equation 9 below,












s

p

k


=


max

W
post




𝔼



g

continuous
,
i
,
j


-
S

,


y

i
,
j


-
X





{


-






i
=
1

C




l
i



log

(


f
spk

(



f

c

o

n

t

e

x

t


(


y

i
,
j


;

W

post

_

context



)

;

W

post

_

spk



)

)


}






(

Equation


9

)







Herein, Wpost is a parameter of the post-speech encoder, custom-charactergcontinuous,i,j−S,yi,j−X is a notation for an expected value given the training set and the first embedding as input, C is the number of speakers in the training set, li is a one-hot vector of the ith speaker, fspk is the speaker organizer, fcontext is the context organizer, and yi,j is the speaker audio, Wpost_context is a parameter of the context organizer, and Wpost_spk us a parameter of the speaker organizer.


In some exemplary embodiments, the aligning of the third intermediate representation to the first intermediate representation may include introducing an eighth loss according to Equation 10 below,











bridge

=


𝔼

g


c

o

n

t

i

n

u

o

u

s

,
i
,

j
-
S

,


(


x

i
,
j


,

y

i
,
j



)

-
X






{

L

(


z
post

,

z

p

r

i

o

r



)

}






(

Equation


10

)







Herein, custom-charactergcontinuous,i,j−S,(xi,j,yi,j)−X is a notation for an expected value given the training set and the first embedding as input, L is a mean squared error, zpost is the third intermediate representation, and zprior is the first intermediate representation.


In some exemplary embodiments, the first intermediate representation may be computed according to Equation 11 below,










z

p

r

i

o

r


=


f

p

r

i

o

r


(


X

i
,
j


,


g


c

o

n

t

i

n

u

o

u

s

,
i
,
j


;

W
spk


,

W

p

r

i

o

r



)





(

Equation


11

)







Herein, fprior is the text prior encoder, xi,j is the text, gcontinuous,i,j is the first embedding, Wspk is a parameter of the speaker encoder, and Wprior may be a parameter of the text prior encoder.


In some exemplary embodiments, the third intermediate representation may be computed according to Equation 12 below.










z

p

o

s

t


=


f

p

o

s

t


(


y

i
,
j


,


g


c

o

n

t

i

n

u

o

u

s

,
i
,
j


;

W
spk


,

W

p

o

s

t



)





(

Equation


12

)







Herein, fpost is the speech post encoder, yi,j is the speaker audio, gcontinuous,i,j is the first embedding, Wspk is a parameter of the speaker encoder, and Wpost is a parameter of the speech post encoder.


In some exemplary embodiments, the speech generation method may further include implementing zero-shot text-to-speech (TTS) based on the first embedding and the first intermediate representation, or implementing zero-shot voice conversion (VC) based on the first embedding, the first intermediate representation, and the third intermediate representation.


In some exemplary embodiments, the speech generation method may further include implementing random speaker text-to-speech (TTS) based on the second embedding and the first intermediate representation, or implementing random speaker voice conversion (VC) based on the second embedding, the first intermediate representation, and the third intermediate representation.


In some exemplary embodiments, the speech generation method may further include inputting a random seed into the speaker quantizer.


Another exemplary embodiment of the present disclosure provides a speech generation device that executes a program code loaded into one or more memory devices via one or more processors and performs zero-shot speech generation by using prosody control and random speaker generation, in which the program code is executed to: receive paired text and speaker audio for an ith speaker and a jth utterance from a training set; input the speaker audio to a speaker encoder to perform extraction of a speaker identity, and obtain first embedding representing a representation of the speaker identity; input the first embedding to a speaker quantizer and obtain quantized second embedding; input the text and the first embedding to a text prior encoder and obtain a first intermediate representation; input the first intermediate representation and the first embedding to a prosody predictor, and add a prosodic hidden representation to the first intermediate representation, and obtain a second intermediate representation; input the second intermediate representation and the first embedding to an intermediate decoder and obtain a final representation; and convert the final representation to a waveform by using the decoder to generate speech.


In some exemplary embodiments, the program code may be executed to further input a linear spectrogram and the first embedding to a speech post encoder and obtain a third intermediate representation, and align the third intermediate representation to the first intermediate representation.


In some exemplary embodiments, the program code may be executed to further implement zero-shot text-to-speech (TTS) based on the first embedding and the first intermediate representation, or implement zero-shot voice conversion (VC) based on the first embedding, the first intermediate representation, and the third intermediate representation.


In some exemplary embodiments, the program code may be executed to further implement random speaker text-to-speech (TTS) based on the second embedding and the first intermediate representation, or implement random speaker voice conversion (VC) based on the second embedding, the first intermediate representation, and the third intermediate representation.





BRIEF DESCRIPTION OF THE DRAWINGS


FIG. 1 is a block diagram illustrating a speech generation device according to an exemplary embodiment.



FIG. 2 is a diagram illustrating a training pipeline of a speech generation device according to the exemplary embodiment.



FIG. 3 is a diagram illustrating an inference pipeline of the speech generation device according to the exemplary embodiment.



FIG. 4 is a flow diagram illustrating a speech generation method according to an exemplary embodiment.



FIG. 5 is a diagram illustrating a computing device according to an exemplary embodiment.





DETAILED DESCRIPTION

Hereinafter, the present invention will be described more fully hereinafter with reference to the accompanying drawings, in which exemplary embodiments of the invention are shown. As those skilled in the art would realize, the described exemplary embodiments may be modified in various different ways, all without departing from the spirit or scope of the present invention. Accordingly, the drawings and description are to be regarded as illustrative in nature and not restrictive. Like reference numerals designate like elements throughout the specification.


Throughout the specification and the claims, unless explicitly described to the contrary, the word “comprise”, and variations such as “comprises” or “comprising”, will be understood to imply the inclusion of stated elements but not the exclusion of any other elements. Terms including an ordinary number, such as first and second, are used for describing various constituent elements, but the constituent elements are not limited by the terms. The terms are used only to discriminate one constituent element from another constituent element.


Terms, such as “ . . . unit,” “ . . . device,”, and “module,”, as used in the specification may refer to a unit capable of performing at least one function or operation described herein, which may be implemented in hardware or circuit, software, or a combination of hardware or circuit and software. Further, at least some configurations or functions of a speech generation method and device for performing zero-shot speech generation by using prosody control and random speaker generation according to exemplary embodiments described below may be implemented as a program or software, and the program or software may be stored on a computer-readable medium.



FIG. 1 is a block diagram illustrating a speech generation device according to an exemplary embodiment, and FIG. 2 is a diagram illustrating a training pipeline of a speech generation device according to the exemplary embodiment.


Referring to FIG. 1, a speech generation device 10 according to an exemplary embodiment may execute program codes loaded into one or more memory devices via one or more processors. For example, the speech generation device 10 may be implemented as a computing device 50, such as that described later with reference to FIG. 5. In this case, the one or more processors may correspond to the processor 510 of the computing device 50, and the one or more memory devices may correspond to the memory 520 of the computing device 50. The program code may be executed by the one or more processors to perform zero-shot speech generation by using prosodic control and random speaker generation. In the present specification, the term “module” is used to logically distinguish the functions performed by the program code from each other.


The speech generation device 10 may perform zero-shot speech generation. Zero-shot speech generation is the generation of speech by using a representation of a speaker extracted from audio of a previously unseen speaker, and a neural network model performing the zero-shot speech generation does not need to undergo any adaptation process to learn and generate speech of a new speaker. The speech generation device 10 may include a speaker identity extraction module 110, a speaker identity quantization module 120, a text-to-speech (TTS) pipeline module 130, and a voice conversion (VC) pipeline module 140 to enable control of prosody elements including pitch, and provide random generation of speaker identities in zero-shot speech generation. Hereinafter, the speaker identity extraction module 110, the speaker identity quantization module 120, the TTS pipeline module 130, and the VC pipeline module 140 will be described with reference to FIGS. 1 and 2 together.


The speech generation device 10 may be trained to generate a speech ŷi,j that is similar to ground-truth yi,j, for given text xi,j and speaker audio yi,j as input, as follows.








y
ˆ


i
,
j


=

f

(




(


x

i
,
j


,

y

i
,
j



)

-
X

;

W
spk


,

W

t

2

s



)





Herein, xi,j and yi,j may be paired sentence and audio for the ith speaker and jth utterance from a specific training set X (where i, j are integers). f is a TTS pipeline, and Wspk and Wt2s may be a parameter of the speaker encoder 20 and a parameter of the TTS pipeline, respectively. Herein, the TTS pipeline may include a text prior encoder 22, a prosody predictor 23, an intermediate decoder 24, and a decoder 25. Meanwhile, a speech post encoder 26 may form a VC pipeline.


For example, the output zpost of the VC pipeline may be aligned with an output zprior of a frame decoder of the text prior encoder 22 of the TTS pipeline, and an output gdiscrete of a speaker quantizer 21 may follow an output gcontinuous of the speaker encoder 20. In this way, the speech generation device 10 may build a jointly trainable pipeline to realize applications for various applications, such as zero-shot TTS, VC, random speaker generation, and prosody control, with only a single-step training for the neural network, and to this end, the total loss Ltotal for the single-step training is introduced in the following form.








total

=



TriniTTS

+


λ

frame

_

prosody






frame

_

prosody



+


λ

adv

_

spk






adv

_

spk



+


λ
spk




spk


+


λ

q

uantize






q

uantize



+


λ

spk

_

classification






spk

_

classification








Herein, custom-characterTriniTTS is a loss term for the TTS on which the speech generation device 10 according to the exemplary embodiments is based, and each of custom-characterframe_prosody, custom-characteradv_spk, custom-characterspk, custom-characterquantize, and custom-characterspk_classification will be described later. λframe_prosody, λadv_spk, λspk, λquantize, and λspk_classification may be hyperparameters for their respective loss terms.


The speaker identity extraction module 110 may perform speaker identity extraction by inputting speaker audio yi,j to the speaker encoder 20, and obtain a first embedding gcontinuous representing a representation of the speaker identity. The speaker encoder 20 is a pre-trained speaker recognition model, and in some exemplary embodiments, the speaker encoder 20 may be trained by using angular prototypical loss.


The speaker encoder 20 may receive speaker audio yi,j as input and output a representation of the speaker identity gcontinuous, as follows.






g
continuous,i,j
=f
spk(gcontinuous,i,j;Wspk)


Herein, fspk is the speaker encoder 20, and Wspk may be a parameter of the speaker encoder 20. Since the speaker encoder 20 aims to extract only speaker-specific information, the extracted speaker embedding gcontinuous,i,j may be approximated to the speaker identity s; of the ith speaker among all speakers in the training set S as follows: gcontinuous,i,j≈Si where si˜S. In the speaker encoder 20, the extracted representation continuous may be introduced as a condition for the normalizing layer for the elements in the TTS pipeline and the VC pipeline. The extracted representation gcontinuous may also be used as a target for the output gdiscrete of the speaker quantizer 21. To make the speaker embedding (i.e., the first embedding) more discriminative, a speaker classifier may be added to the output of the speaker encoder 20, and a loss @@@ may be introduced for the speaker classifier according to Equation 1 below.












spk

_

classification


=


𝔼

g


cont

i

n

u

o

u

s

,
i
,

j

-
S







{


-






i
=
1

C




l
i



log

(


f

s

p

k


(


g


c

o

n

t

inuous

,
i
,
j


;

W

s

p

k



)

)


}






(

Equation


1

)







Herein, custom-charactergcontinuous,i,j˜S may be a notation for the expected value given the training set S and the first embedding as input, C may be the number of speakers in the training set S, li may be the one-hot vector of the ith speaker, fspk may be the speaker encoder 20, gcontinuous,i,j may be the first embedding, and Wspk may be a parameter of the speaker encoder 20.


The speaker identity quantization module 120 may input the first embedding gcontinuous obtained from the speaker encoder 20 to the speaker quantizer 21 and obtain a quantized second embedding gdiscrete. The speaker quantizer 21 may aim to reconstruct the extracted gcontinuous as a weighted sum of basis vectors bi, i=1, . . . , n corresponding to a set of n learnable vectors from a codebook B. To this end, a second embedding gdiscrete may be obtained according to Equation 2 below based on an optimal weight wi for the basis vector bi found by the self-attention module of the speaker quantizer 21. That is, the reconstructed second embedding gdiscrete may be computed as the sum of the products of the basis vector bi and the weight wi computed from the attention layer.











g

discrete
,
i
,
j


=



f
quantize

(



g


c

o

n

t

i

n

u

o

u

s

,
i
,
j


;

W
quantize


,
B

)

=







i
=
1

n



w
i



b
i




,


w
i

-

W

q

u

a

n

t

ize



,


b
i

-
B

,








i
=
1

n



w
i


=
1





(

Equation


2

)







Herein, fquantize may be the speaker quantizer 21, gcontinuous,i,j may be the first embedding, Wquantize may be the parameter of the speaker quantizer 21, and B may be a codebook including n learnable vectors. To train the speaker quantizer 21, a speaker quantization loss, that is, a loss custom-characterquantize, may be introduced between gcontinuous and gdiscrete by using a mean square error L, according to Equation 3 below.











quantize

=


𝔼



g


c

o

n

t

i

n

u

o

u

s

,
i
,
j


-
S

,


y

i
,
j


-
X





{

L

(


g

discrete
,
i
,
j


,

g


c

o

n

t

i

n

u

o

u

s

,
i
,
j



)

}






(

Equation


3

)







Herein, custom-charactergcontinuous,i,j−S,yi,j−X may be a notation for the expected value given the training set S and the first embedding as input, L may be the mean squared error, discrete,i,j may be the second embedding, and gcontinuous,i,j may be the first embedding.


To prevent the loss from affecting the parameter Wspk of the speaker encoder 20, gcontinuous may be separated from the speaker quantization loss.


The TTS pipeline module 130 may input the text xi,j and the first embedding gcontinuous to the text prior encoder 22 and obtain the first intermediate representation zprior. The text prior encoder 22 may include a phoneme encoder, an alignment search module, a duration predictor, and a frame decoder. In the phoneme encoder, the alignment search module, and the duration predictor, the first embedding, that is, the speaker embedding gcontinuous, may be given as a condition for the normalization layers assigned to the phoneme encoder, the alignment search module, and the duration predictor.


The frame decoder may receive as input an extended text hidden representation htext_extended, which is extended from the number of tokens to the number of frames after iterations of the phoneme encoder's text representation in the time-wise dimension, and the first embedding gcontinuous. In some exemplary embodiments, the architecture of the frame decoder may be the same as the phoneme encoder.


The TTS pipeline module 130 may input the first intermediate representation zprior and the first embedding gcontinuous to the prosody predictor 23, and add the prosodic hidden representations hpitch and henergy to the first intermediate representation zprior to obtain a second intermediate representation zprosody. The prosody predictor 23 may receive as input the output zprior of the frame decoder and the speaker embedding gcontinuous. The main function of the prosody predictor 23 is to generate a predicted pitch value {circumflex over (x)}pitch and an energy value {circumflex over (x)}energy. The normalized pitch and energy values extracted from the ground truth audio xpitch and xenergy may be used as targets during training. When it is assumed that n is the length of the text token xi,j and m is the length of the mel frame of the audio yi,j, the loss term may be computed based on the length of the mel frame of yi,j rather than the length of xi,j. In some exemplary embodiments, the prosody prediction loss may be computed at the token level or at the frame level.


Specifically, a predicted prosody value {circumflex over (x)}prosody may be generated and a loss custom-characterframe_prosody may be introduced according to Equation 4 and Equation 5 below.











x
ˆ


p

rosody


=

f

(


z
prior

,


g


c

o

n

t

i

n

u

o

u

s

,
i
,
j


;
W


)





(

Equation


4

)















frame

_

prosody


=


𝔼



g


c

o

n

t

i

n

u

o

u

s

,
i
,
j


-
S

,


(


x

i
,
j


,

y

i
,
j



)

-
X





{







i
=
1

m



L

(



x
ˆ



p

rosody

,
m


,

x


p

rosody

,
m



)


}







(

Equation


5

)








Herein, f is the prosody predictor 23, zprior is the first intermediate representation, continuous,i,j may be the first embedding, W may be the parameter of the prosody predictor 23, custom-charactergcontinuous,i,j−S,(xi,j,yi,j)−X may be a notation representing the expected value given the training set S and the first embedding as input, L may be the mean square error, and xprosody may be the actual prosody value.


In addition, a loss custom-charactertoken_prosody may be introduced according to Equation 6 below.












token

_

prosody


=


𝔼



g


c

o

n

t

i

n

u

o

u

s

,
i
,
j


-
S

,


(


x

i
,
j


,

y

i
,
j



)

-
X





{




k
=
1

n


L

(










l


a
k


m




x
ˆ


p

rosody



,
l




"\[LeftBracketingBar]"


a
k



"\[RightBracketingBar]"



,









l


a
k


m



x

p

rosody



,
l




"\[LeftBracketingBar]"


a
k



"\[RightBracketingBar]"




)


}






(

Equation


6

)







Herein, custom-charactergcontinuous,i,j−S,(xi,j,yi,j)−X may be the notation for the expected value given the training set S and the first embedding as input, L may be the mean squared error, ak may be the frame sequence corresponding to the kth token of the sentence in the duration alignment between the token of the text xi,j and the mel-frame of the speaker audio yi,j, {circumflex over (x)}prosody may be the predicted prosody value, and xprosody may be the actual prosody value.


In the training stage, the ground truth audio xpitch and xenergy may be delivered to the prosody encoder to generate prosodic hidden representations hpitch and henergy. These hidden representations may then be added to the output zprior of the frame decoder. However, in the inference stage, the predicted prosody values {circumflex over (x)}pitch and {circumflex over (x)}energy are delivered to the encoder, and the prosody may be controlled by adjusting the values using parameters.


The TTS pipeline module 130 may input the second intermediate representation zprosody and the first embedding gcontinuous to the intermediate decoder 24 to obtain a final representation zfinal. The intermediate decoder 24 may receive the output zprosody of the prosody predictor 23 along with the speaker embedding gcontinuous as input. In some exemplary embodiments, the intermediate decoder 24 may include a fully convolutional neural network with residual connection to capture local information. The intermediate decoder 24 may be the final stage of the intermediate representation before up-sampling is performed on the waveform. To align the output zfinal of the intermediate decoder 24 with the Mel-spectrogram of the ground truth audio xmel with the mean square error L, a loss custom-characterintermediate may be introduced according to Equation 7 below.











intermediate

=


𝔼



g

continuous
,
i
,
j


-
S

,


(


x

i
,
j


,

y

i
,
j



)

-
X





{

L

(


f

(


z
prosody

,


g


c

o

n

t

i

n

u

o

u

s

,
i
,
j


;
W


)

,

x
Mel


)

}






(

Equation


7

)







Herein, custom-charactergcontinuous,i,j−S,(xi,j,yi,j)−X is the notation for the expected value given the training set S and the first embedding as input, L may be the mean squared error, f may be the intermediate decoder 24, zprosody may be the second intermediate representation, gcontinuous,i,j may be the first embedding, W may be the parameter of the intermediate decoder 24, and xMel may be the Mel spectrogram of the actual speaker audio.


The TTS pipeline module 130 may convert the final representation zfinal to a waveform to generate the speech ŷi,j by using the decoder 25. In some exemplary embodiments, the decoder 25 may be implemented as a generative adversarial network (GAN)-based decoder.


The VC pipeline module 140 may input a linear spectrogram xspec and the first embedding gcontinuous to the speech post encoder 26, obtain the third intermediate representation zpost, and align the third intermediate representation zpost to the first intermediate representation zprior. The speech post encoder 26 may learn the intermediate representations of the TTS pipeline on-the-fly during training. The speech post encoder 26 may receive as input the linear spectrogram xspec and the speaker embedding gcontinuous, and output a latent variable zpost that best represents the context and speaker information. The speech post encoder 26 may include a context organizer and a speaker organizer.


A context organizer may remove speaker information from the linear spectrogram xspec while preserving contextual information. On the other hand, the speaker organizer may implant speaker information into the output of the context organizer. To ensure that the speaker information is removed after the context organizer, an adversarial speaker classifier may be added to the output of the context organizer. In this regard, a loss custom-characteradv_spk may be introduced according to Equation 8 below.












adv

_

spk


=


max

W
post




𝔼



g

continuous
,
i
,
j


-
S

,


y

i
,
j


-
X





{


-






i
=
1

C




l
i



log

(


f

c

o

n

t

e

x

t


(


y

i
,
j


;

W

p

o

s


t
-


c

o

n

t

e

x

t



)

)


}






(

Equation


8

)







Herein, Wpost is the parameter of the speech post encoder 26, custom-charactergcontinuous,i,j−S,yi,j−X may be a notation for the expected value given the training set S and the first embedding as input, C may be the number of speakers in the training set S, li may be the one-hot vector of the ith speaker, fcontext may be the context organizer, yi,j may be the speaker audio, and Wpost_context may be the parameter of the context organizer.


Further, a speaker classifier may be added to the output of the speaker organizer to implant speaker information from a target speaker reference. In this regard, a loss custom-characterspk may be introduced according to Equation 9 below.











spk

=


max

W

p

o

s

t





𝔼



g


c

o

n

t

i

n

u

o

u

s

,
i
,
j


-
S

,


y

i
,
j


-
X





{


-






i
=
1

C




l
i



log

(


f
spk

(



f

c

o

n

t

e

x

t


(


y

i
,
j


;

W

post

_

context



)

;

W

post

_

spk



)

)


}






(

Equation


9

)







Herein, Wpost is the parameter of the speech post encoder 26, custom-charactergcontinuous,i,j−S,yi,j−X may be the notation for the expected value given the training set S and the first embedding as input, C may be the number of speakers in the training set S, li may be the one-hot vector of the ith speaker, fspk may be the speaker organizer, fcontext may be the context organizer, and yi,j may be the speaker audio, Wpost context may be the parameter of the context organizer, and Wpost_spk may be the parameter of the speaker organizer.


To ensure that the output zpost of the speaker organizer after the context organizer is aligned with the output zprior of the frame decoder, a loss custom-characterbridge may be introduced according to Equation 10 below.











bridge

=


𝔼

g


cont

i

n

u

o

u

s

,
i
,

j
-
S

,


(


x

i
,
j


,

y

i
,
j



)

-
X






{

L

(


z

p

ost


,

z

p

r

i

o

r



)

}






(

Equation


10

)







Herein, custom-charactergcontinuous,i,j−S,(xi,j,yi,j)−X is a notation for the expected value given the training set S and the first embedding as input, L may be the mean squared error, zpost may be the third intermediate representation, and zprior may be the first intermediate representation.


In some exemplary embodiments, the first intermediate representation zprior may be computed according to Equation 11 below.










z

p

r

i

o

r


=


f

p

r

i

o

r


(


x

i
,
j


,


g


c

o

n

t

i

n

u

o

u

s

,
i
,

j
;





W
spk


,

W
prior


)





(

Equation


11

)







Herein, fprior is the text prior encoder 22, xi,j is the text, gcontinuous,i,j is the first embedding, Wspk is a parameter of the speaker encoder 20, and Wprior may be a parameter of the text prior encoder 22.


In some exemplary embodiments, the third intermediate representation zpost may be computed according to Equation 12 below.










z

p

o

s

t


=


f

p

o

s

t


(


y

i
,
j


,


g


c

o

n

t

i

n

u

o

u

s

,
i
,
j


;

W
spk


,

W

p

o

s

t



)





(

Equation


12

)







Herein, fpost may be the speech post encoder 26, yi,j may be the speaker audio, gcontinuous,i,j may be the first embedding, Wspk may be a parameter of the speaker encoder 20, and Wpost may be a parameter of the speech post encoder 26.


According to the present exemplary embodiment, pitch control may be implemented in zero-shot speech generation by integrating the TTS pipeline, the VC pipeline, and the prosody predictor, while new speaker identities may be introduced in speech generation according to the use of the codebook of the speaker quantizer. Furthermore, by building a jointly trainable pipeline, the application to various applications, including zero-shot TTS, VC, random speaker generation, and prosody control, may be implemented only with a single step training of the neural network. Furthermore, by sharing the prosody predictor and the decoder, computational costs may be reduced and efficiency increased.



FIG. 3 is a diagram illustrating the inference pipeline of the speech generation device according to the exemplary embodiment.


Referring now to FIG. 3, the trained speech generation device 10, as described in FIG. 2, may be used to implement applications, such as zero-shot speech generation, random speaker generation, and prosody control. In FIG. 3, a speaker encoder 30, a speaker quantizer 31, a text prior encoder 32, a prosody predictor 33, an intermediate decoder 34, a decoder 35, and a speech post encoder 36 may correspond to the speaker encoder 20, the speaker quantizer 21, the text prior encoder 22, the prosody predictor 23, the intermediate decoder 24, the decoder 25, and the speech post encoder 26 of FIG. 2.


In some exemplary embodiments, a zero-shot text-to-speech (TTS) may be implemented based on the first embedding gcontinuous and the first intermediate representation zprior. Further, a zero-shot voice conversion (VC) may be implemented based on the first embedding gcontinuous, the first intermediate representation zprior, and the third intermediate representation zpost.


In some exemplary embodiments, a random speaker text-to-speech (TTS) may be implemented based on the second embedding gdiscrete and the first intermediate representation zprior. Further, a random speaker voice conversion (VC) may be implemented based on the second embedding gdiscrete, the first intermediate representation zprior, and the third intermediate representation zpost. In this case, a random seed may be input to the speaker quantizer 31.



FIG. 4 is a flow diagram illustrating a speech generation method according to an exemplary embodiment.


Referring now to FIG. 4, the speech generation method according to the exemplary embodiment may include receiving paired text and speaker audio as input for the ith speaker and the jth utterance from a training set (S401), inputting the speaker audio to a speaker encoder to perform an extraction of a speaker identity, and obtaining a first embedding representing a representation of the speaker identity (S402), inputting the first embedding to a speaker quantizer and obtaining a quantized second embedding (S403), inputting the text and the first embedding into a text prior encoder and obtaining a first intermediate representation (S404), inputting the first intermediate representation and the first embedding to a prosody predictor, adding a prosodic hidden representation to the first intermediate representation, and obtaining a second intermediate representation (S405), inputting the second intermediate representation and the first embedding to an intermediate decoder and obtaining a final representation (S406); and converting the final representation to a waveform by using the decoder to generate speech (S407).


For a more detailed description of the above method, reference may be made to the description of the exemplary embodiments described herein, so that duplicative descriptions are omitted herein.



FIG. 5 is a diagram illustrating a computing device according to an exemplary embodiment.


Referring to FIG. 5, the speech generation method and device according to the exemplary embodiments may be implemented using a computing device 50.


The computing device 50 may include at least one of a processor 510, a memory 530, a user interface input device 540, a user interface output device 550, and a storage device 560 communicating via a bus 520. The computing device 50 may also include a network interface 570 electrically connected to the network 40. The network interface 570 may transmit or receive signals to and from other entities over the network 40.


The processor 510 may be implemented in various types, such as a microcontroller unit (MCU), application processor (AP), central processing unit (CPU), graphic processing unit (GPU), neural processing unit (NPU), and quantum processing unit (QPU), and may be any semiconductor device that executes instructions stored in the memory 530 or the storage device 560. The processor 510 may be configured to implement the functions and methods described above with respect to FIGS. 1 to 4.


The memory 530 and the storage device 560 may include various forms of volatile or non-volatile storage media. For example, the memory may include a read-only memory (ROM) 531 and a random access memory (RAM) 532. In some exemplary embodiments, the memory 530 may be located inside or outside of the processor 510, and the memory 530 may be coupled to the processor 510 through various means already known in the art.


In some exemplary embodiments, at least some configurations or functions of the speech generation method and device according to the exemplary embodiments may be implemented as programs or software executing on the computing device 50, and the programs or software may be stored on a computer-readable medium. Specifically, the computer-readable medium according to the exemplary embodiment may be a computer, including a processor 510 that executes programs or instructions stored in the memory 530 or the storage device 560, that records a program for executing steps included in implementing the speech generation method and device according to the exemplary embodiment.


In some exemplary embodiments, at least some configurations or functions of the speech generation method and device according to the exemplary embodiments may be implemented using hardware or circuit of the computing device 50, or may be implemented as separate hardware or circuit that may be electrically connected to computing device 50.


According to the exemplary embodiments, pitch control may be implemented in zero-shot speech generation by integrating the TTS pipeline, the VC pipeline, and the prosody predictor, while new speaker identities may be introduced in speech generation by using the speaker quantizer's codebook. Furthermore, by building the jointly trainable pipeline, the application to various applications, including zero-shot TTS, VC, random speaker generation, and prosody control, may be implemented only with a single step training of the neural network.


Although the above exemplary embodiments of the present invention have been described in detail, the scope of the present invention is not limited thereto, but also includes various modifications and improvements by one of ordinary skill in the art utilizing the basic concepts of the present invention as defined in the following claims.

Claims
  • 1. A speech generation method of performing zero-shot speech generation by using prosody control and random speaker generation, the speech generation method comprising: receiving paired text and speaker audio for an ith speaker and an jth utterance from a training set;inputting the speaker audio to a speaker encoder to perform extraction of a speaker identity, and obtaining first embedding representing a representation of the speaker identity;inputting the first embedding to a speaker quantizer and obtaining quantized second embedding;inputting the text and the first embedding to a text prior encoder and obtaining a first intermediate representation;inputting the first intermediate representation and the first embedding to a prosody predictor, adding a prosodic hidden representation to the first intermediate representation, and obtaining a second intermediate representation;inputting the second intermediate representation and the first embedding to an intermediate decoder, and obtaining a final representation; andconverting the final representation to a waveform by using the decoder to generate speech.
  • 2. The speech generation method of claim 1, further comprising: inputting a linear spectrogram and the first embedding to a speech post encoder and obtaining a third intermediate representation; andaligning the third intermediate representation to the first intermediate representation.
  • 3. The speech generation method of claim 1, wherein: the obtaining of the first embedding includesintroducing a first loss according to Equation 1 below,
  • 4. The speech generation method of claim 1, wherein: the obtaining of the second embedding includesobtaining a second embedding according to Equation 2 below, based on an optimal weight for a basis vector found by a self-attention module of the speaker quantizer,
  • 5. The speech generation method of claim 4, wherein: the obtaining of the second embedding includesintroducing a second loss according to Equation 3 below,
  • 6. The speech generation method of claim 1, wherein: the obtaining of the second intermediate representation includesgenerating predicted prosody values according to Equation 4 and Equation 5 below, and introducing a third loss,
  • 7. The speech generation method of claim 6, wherein: the obtaining of the second intermediate representation includesintroducing a fourth loss according to Equation 6 below,
  • 8. The speech generation method of claim 1, wherein: the obtaining of the final representation includesintroducing a fifth loss according to Equation 7 below,
  • 9. The speech generation method of claim 2, wherein: the speech post encoder includes a context organizer for removing speaker information from the linear spectrogram while preserving context information; and a speaker organizer for implanting the speaker information into an output of the context organizer, andthe obtaining of the third intermediate representation includesintroducing a sixth loss according to Equation 8 below,
  • 10. The speech generation method of claim 9, wherein: the obtaining of the third intermediate representation includesintroducing a seventh loss according to Equation 9 below,
  • 11. The speech generation method of claim 10, wherein: the aligning of the third intermediate representation to the first intermediate representation includesintroducing an eighth loss according to Equation 10 below,
  • 12. The speech generation method of claim 11, wherein: the first intermediate representation is computed according to Equation 11 below,
  • 13. The speech generation method of claim 11, wherein: the third intermediate representation is computed according to Equation 12 below,
  • 14. The speech generation method of claim 2, further comprising: implementing zero-shot text-to-speech (TTS) based on the first embedding and the first intermediate representation, or implementing zero-shot voice conversion (VC) based on the first embedding, the first intermediate representation, and the third intermediate representation.
  • 15. The speech generation method of claim 2, further comprising: implementing random speaker text-to-speech (TTS) based on the second embedding and the first intermediate representation, or. implementing random speaker voice conversion (VC) based on the second embedding, the first intermediate representation, and the third intermediate representation.
  • 16. The speech generation method of claim 15, further comprising: inputting a random seed into the speaker quantizer.
  • 17. A speech generation device that executes a program code loaded into one or more memory devices via one or more processors and performs zero-shot speech generation by using prosody control and random speaker generation, wherein the program code is executed to:receive paired text and speaker audio for an ith speaker and a jth utterance from a training set;input the speaker audio to a speaker encoder to perform extraction of a speaker identity, and obtain first embedding representing a representation of the speaker identity;input the first embedding to a speaker quantizer and obtain quantized second embedding;input the text and the first embedding to a text prior encoder and obtain a first intermediate representation;input the first intermediate representation and the first embedding to a prosody predictor, and add a prosodic hidden representation to the first intermediate representation, and obtain a second intermediate representation;input the second intermediate representation and the first embedding to an intermediate decoder and obtain a final representation; andconvert the final representation to a waveform by using the decoder to generate speech.
  • 18. The speech generation device of claim 17, wherein: the program code is executed to furtherinput a linear spectrogram and the first embedding to a speech post encoder and obtain a third intermediate representation, andalign the third intermediate representation to the first intermediate representation.
  • 19. The speech generation device of claim 18, wherein: the program code is executed to furtherimplement zero-shot text-to-speech (TTS) based on the first embedding and the first intermediate representation, or implement zero-shot voice conversion (VC) based on the first embedding, the first intermediate representation, and the third intermediate representation.
  • 20. The speech generation device of claim 18, wherein: the program code is executed to furtherimplement random speaker text-to-speech (TTS) based on the second embedding and the first intermediate representation, or. implement random speaker voice conversion (VC) based on the second embedding, the first intermediate representation, and the third intermediate representation.
Provisional Applications (1)
Number Date Country
63504872 May 2023 US