The development of machine learning models for speech synthesis of emotionally expressive voices is challenging due to extensive variability in speaking styles. For example, the same word can be enunciated within a sentence in a variety of different ways to elicit unique characteristics, such as the emotional state of the speaker. As a result, training a successful model to generate a full sentence of speech typically requires a very large dataset, such as twenty hours or more of prerecorded speech.
Even when conventional neural speech generation models are successful, the speech they generate is often not emotionally expressive due at least in part to the fact that the training objective employed in conventional solutions is regression to the mean. Such a regression to the mean training objective encourages the conventional model to output a “most likely” averaged utterance, which tends not to sound convincing to the human ear. Consequently, expressive speech synthesis is usually not successful and remains a largely unsolved problem in the art.
There are provided systems and methods for generating audio including emotionally expressive synthesized content, substantially as shown in and/or described in connection with at least one of the figures, and as set forth more completely in the claims.
The following description contains specific information pertaining to implementations in the present disclosure. One skilled in the art will recognize that the present disclosure may be implemented in a manner different from that specifically discussed herein. The drawings in the present application and their accompanying detailed description are directed to merely exemplary implementations. Unless noted otherwise, like or corresponding elements among the figures may be indicated by like or corresponding reference numerals. Moreover, the drawings and illustrations in the present application are generally not to scale, and are not intended to correspond to actual relative dimensions.
The present application discloses automated systems and methods for generating audio including emotionally expressive synthesized content using a trained neural network that overcomes the drawbacks and deficiencies in the conventional art. It is noted that, as used in the present application, the terms “automation,” “automated”, and “automating” refer to systems and processes that do not require the participation of a human user, such as a human editor. Although, in some implementations, a human editor may review the synthesized content generated by the automated systems and according to the automated methods described herein, that human involvement is optional. Thus, the methods described in the present application may be performed under the control of hardware processing components of the disclosed automated systems.
It is further noted that, as defined in the present application, a neural network (NN), also known as an artificial neural network (ANN), is a type of machine learning framework in which patterns or learned representations of observed data are processed using highly connected computational layers that map the relationship between inputs and outputs. A “deep neural network”, in the context of deep learning, may refer to a neural network that utilizes multiple hidden layers between input and output layers, which may allow for learning based on features not explicitly defined in raw data. “Online deep learning” may refer to a type of deep learning in which machine learning models are updated using incoming data streams, and are designed to progressively improve their performance of a specific task as new data is received and/or adapt to new patterns of a dynamic system. As such, various forms of NNs may be used to make predictions about new data based on past examples or “training data.” In various implementations, NNs may be utilized to perform image processing or natural-language processing.
It is noted that, as shown by
As shown in
Also shown in
It is further noted that although audio processing system 100 may receive audio sequence template 150 from audio template provider 124 via communication network 120 and network communication links 122, in some implementations, audio template provider 124 may take the form of an audio content database integrated with computing platform 102, or may be in direct communication with audio processing system 100 as shown by dashed communication link 128. Alternatively, in some implementations, audio sequence template 150 may be provided to audio processing system 100 by user 132.
It is also noted that although user system 130 is shown as a desktop computer in
Audio integration software code 110, when executed by hardware processor 104 of computing platform 102, is configured to generate integrated audio sequence 160 based on audio sequence template 150 and descriptive data 134. Although the present application refers to audio integration software code 110 as being stored in system memory 106 for conceptual clarity, more generally, system memory 106 may take the form of any computer-readable non-transitory storage medium.
The expression “computer-readable non-transitory storage medium,” as used in the present application, refers to any medium, excluding a carrier wave or other transitory signal that provides instructions to hardware processor 104 of computing platform 102. Thus, a computer-readable non-transitory medium may correspond to various types of media, such as volatile media and non-volatile media, for example. Volatile media may include dynamic memory, such as dynamic random access memory (dynamic RAM), while non-volatile memory may include optical, magnetic, or electrostatic storage devices. Common forms of computer-readable non-transitory media include, for example, optical discs, RAM, programmable read-only memory (PROM), erasable PROM (EPROM), and FLASH memory.
Moreover, although
Audio sequence template 250A/250B corresponds in general to audio sequence template 150, in
Audio sequence template 150/250A/250B may be a portion of a prerecorded audio voiceover, for example, from which some audio content has been removed to produce audio gap 253. According to various implementations of the present inventive principles, hardware processor 104 is configured to execute audio integration software code 110 to synthesize word or words 254 for insertion into audio gap 253 based on the syntax of audio segment 252 or first and second audio segments 252a and 252b, further based on emotional tone or context 256 of at least one of audio segment 252 and first and second audio segments 252a and 252b, and still further based on descriptive data 134 describing word or words 254. That is to say, word or words 254 are synthesized by audio integration software code 110 to be syntactically correct as usage with audio segment 252 or first audio segment 252a and second audio segment 252b, while also agreeing in emotional tone with emotional tone or context 256 of audio segment 252 or one or both of first and second audio segments 252a and 252b.
It is noted that, as defined for the purposes of the present application, the phrases “emotional tone” and “emotional context” are equivalent and refer to the emotion expressed by the words included in audio segment 252 or first audio segment 252a and second audio segment 252b, as well as the speech cadence and vocalization with which those words are enunciated. Thus, emotional context or emotional tone may include the expression through speech pattern and vocal tone of emotional states such as happiness, sadness, anger, fear, excitement, affection, and dislike, to name a few examples.
It is further noted that, in some implementations, as shown in
Audio integration software code 310, training data 342, descriptive data 334, and integrated audio sequence 360 correspond respectively in general to audio integration software code 110, training data 142, descriptive data 134, and integrated audio sequence 160, in
In addition, audio sequence template 350 corresponds in general to audio sequence template 150/250A/250B in
As shown in
In addition, NN 370/470 includes audio encoder 472 having audio analyzer 472a configured to provide audio spectrogram 474 of audio sequence template 150/250A/250B/350/450 as an input to CNN 472b of audio encoder 472. In other words, audio analyzer 472a of audio encoder 472 is configured to generate audio spectrogram 474 corresponding to audio segment(s) 252/252a/252b and one or more words 254 described by descriptive data 134/334/434. For example, audio analyzer 472a may perform a text-to-speech (TTS) conversion of audio sequence template 150/250A/250B/350/450.
As further shown in
According to the exemplary implementation shown in
Also shown in
It is noted that, when utilized during training, optional discriminator NN 480 may be used by training module 312 to train NN 370/470 using objective function 482 designed to encourage generation of synthesized word or words 254 that agree in emotional tone or context 256 with one or more of audio segment(s) 252/252a/252b of audio sequence template 150/250A/250B/350/450, as well as being syntactically and grammatically consistent with audio segment(s) 252/252a/252b.
It is further noted that, in contrast to “regression to the mean” type objective functions used in the training of conventional speech synthesis solutions, the present novel and inventive solution may employ optional discriminator NN 480 and objective function 482 in the form of an adversarial objective function to bias integrated audio sequence 160/360 away from a “mean” value such that its corresponding acoustic representation 346/446 sounds convincing to the human ear. It is noted that NN 370/470 may be trained using objective function 482 including a syntax reconstruction loss term. However, in some implementations, NN 370/470 may be trained using objective function 482 including an emotional context loss term summed with a syntax reconstruction loss term.
As noted above, NN 470 corresponds in general to NN 370, in
The functionality of audio processing system 100 including audio integration software code 110/310 will be further described by reference to
As a preliminary matter, and as noted above, NN 370/470 is trained to synthesize expressive audio that sounds genuine to the human ear. NN 370/470 may be trained using training platform 140, training data 142, and training module 312 of audio integration software code 110/310. The goal of training is to fill in audio gap 253 in audio spectrogram 474 of audio sequence template 150/250A/250B/350/450 with a convincing utterance given emotional context or tone 256.
During training, discriminator NN 480 of NN 370/470 looks at the generated acoustic representation 346/446 and emotional context or tone 256 and determines whether it is a convincing audio synthesis. In addition, user 132 may provide descriptive data 134/334/434 and/or pronunciation exemplar 145, which can help NN 370/470 to appropriately pronounce synthesized word or words 254 for insertion into audio gap 253. For example, where word or words 254 include a phonetically challenging word, or a name or foreign word, pronunciation exemplar may be used as a guide track to guide NN 370/470 with the proper pronunciation of word or words 254.
In some implementations, sets of training data 142 may be produced using forced alignment to cut full sentences into individual words. A single sentence of training data 142, e.g., audio sequence template 150/250A/250B/350/450 may take the form of a full sentence with one or several word(s) cut out to produce audio gap 253. The goal during training is for NN 370/470 to learn to fill in audio gap 253 with synthesized words that are syntactically and grammatically correct as usage with audio segment(s) 252/252a/252b, while also agreeing with emotional context or tone 256 of audio segment(s) 252/252a/252b.
During training, validation of the learning process may be performed by user 132, who may utilize user system 130 to evaluate integrated audio sequence 160/360 generated during training and provide additional descriptive data 134/334/434 based on the accuracy with which integrated audio sequence 160/360 has been synthesized. However, in some implementations, validation of the learning can be performed as an automated process using discriminator NN 480. Once training is completed, audio integration software code 110/310 including NN 370/470 may be utilized in an automated process to generate integrated audio sequence 160/360 including emotionally expressive synthesized content as outlined by flowchart 590.
Referring now to
Audio sequence template 150/250A/250B/350/450 may be received by audio integration software code 110/310 of audio processing system 100, executed by hardware processor 104. As shown in
Flowchart 590 continues with receiving descriptive data 134/334/434 describing one or more words 254 for insertion into audio gap 253 (action 594). Descriptive data 134/334/434 may be received by audio integration software code 110/310 of audio processing system 100, executed by hardware processor 104. As discussed above, in some implementations, as shown in
However, in other implementations, descriptive data 134/334/434 may be included in audio sequence template 150/250A/250B/350/450 and may be identified by audio integration software code 110/310, executed by hardware processor 104. For example, in some implementations, descriptive data 134/334/434 may include the last word in audio segment 252 or first audio segment 252a preceding audio gap 253, or one or more phonemes of such a word. In some of those implementations, descriptive data 134/334/434 may also include the first word in second audio segment 252b following audio gap 253, or one or more phonemes of that word. Alternatively, in some implementations, descriptive data 134/334/434 may include the first word in audio segment 252 following audio gap 253, or one or more phonemes of that word. Alternatively, or in addition, in some implementations, descriptive data 134/334/434 may include pronunciation exemplar 145 provided by user 132, or received directly from pronunciation database 144 by audio integration software code 110. Thus, in various implementations, descriptive data 134/334/434 may include pronunciations from a pronunciation NN model of pronunciation database 144 and/or linguistic features from audio segment(s) 252/252a/252b.
In some implementations, flowchart 590 can conclude with using trained NN 370/470 to generate integrated audio sequence 160/360 using audio sequence template 150/250A/250B/350/450 and descriptive data 134/334/434, where integrated audio sequence 160/360 includes audio segment(s) 252/252a/252b and one or more synthesized words 254 corresponding to the words described by descriptive data 134/334/343 (action 596). Action 596 may be performed by audio integration software code 110/310, executed by hardware processor 104, and using trained NN 370/470.
By way of summarizing the performance of trained NN 370/470 with reference to the specific implementation of audio sequence template 250A, in
Referring to text encoder 471, in one implementation, text encoder 471 may begin with a 256-dimensional text embedding, thereby converting text 351/451 into a sequence of 256-dimensional vectors as first sequence of vector representations 473, also referred to herein as “encoder states.” It is noted that the length of first sequence of vector representations 473 is determined by the length of input text 351/451. In some implementations, text 351/451 may be converted into phonemes or other phonetic pronunciations, while in other implementations, such conversion of text 351/451 may not occur. Additional linguistic features of audio sequence template 150/250A/350/450 may also be encoded together with text 351/451, such as parts of speech, e.g., noun, subject, verb, and so forth.
Audio encoder 472 includes CNN 472b over input audio spectrogram 474, followed by RNN encoder 472c. That is to say, audio encoder 472 takes audio sequence template 150/250A/350/450, converts it into audio spectrogram 474, processes audio spectrogram 474 using CNN 472b and RNN 472c, and outputs a sequence of 256-dimensional vectors as second sequence of vector representations 476.
Audio decoder 478 uses two sequence-to-sequence attention mechanisms, shown in
Similarly, audio attention block 477 processes second sequence of vector representations 476 and forms a blended state that summarizes the audio that audio decoder 478 should be paying attention to. Audio decoder 478 combines the blended states from each of text attention block 475 and audio attention block 477 by combining, i.e., concatenating, the vectors of both blended states. Audio decoder 478 then decodes the combined state, updates its own state, and the two attention mechanisms are processed again. This process may continue sequentially until the entire speech is synthesized.
As noted above, audio decoder 478 may be implemented as an RNN (e.g., LSTM or GRU). According to the exemplary implementation shown in
Action 596 results in generation of integrated audio sequence 160/360 including synthesized word or words 254. Moreover, and as discussed above, word or words 254 are synthesized by audio integration software code 110/310 to be syntactically and grammatically correct as usage with audio segment(s) 252/252a/252b, while also agreeing in emotional tone with emotional tone or context 256 of one or more of audio segment(s) 252/252a/252b. Once produced using audio integration software code 110/310, integrated audio sequence 160/360 may be stored locally in system memory 106 of audio processing system 100, or may be transmitted, via communication network 120 and network communication links 122, to user system 130.
In some implementations, as shown in
Thus, the present application discloses automated systems and methods for generating audio including emotionally expressive synthesized content. From the above description it is manifest that various techniques can be used for implementing the concepts described in the present application without departing from the scope of those concepts. Moreover, while the concepts have been described with specific reference to certain implementations, a person of ordinary skill in the art would recognize that changes can be made in form and detail without departing from the scope of those concepts. As such, the described implementations are to be considered in all respects as illustrative and not restrictive. It should also be understood that the present application is not limited to the particular implementations described herein, but many rearrangements, modifications, and substitutions are possible without departing from the scope of the present disclosure.