Text-to-speech (TTS) synthesis aims at generating a corresponding speech waveform based on a text input. The TTS synthesis is widely applied for speech-to-speech translation, speech customization for certain users, role playing in a fairytale, etc. A neural TTS system is being more and more adopted for implementing TTS synthesis, and is tending to be one of the most popular directions in Artificial Intelligence (AI) field in recent years. The neural TTS system may predict acoustic features based on a text input, and further generate a speech waveform based on the predicted acoustic features. Different from traditional TTS techniques which require well-designed frontend linguistic features, the neural TTS system is modeled in an end-to-end structure and may be trained directly based on, for example, text-speech data pairs. The neural TTS system may jointly optimize pronunciation, prosody, etc. of speech, which results in more natural synthesized speech than the traditional TTS techniques.
This Summary is provided to introduce a selection of concepts that are further described below in the Detailed Description. It is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
Embodiments of the present disclosure provide a method and apparatus for generating speech through neural TTS synthesis. A text input may be obtained. A phone feature of the text input may be generated. Context features of the text input may be generated based on a set of sentences associated with the text input. A speech waveform corresponding to the text input may be generated based on the phone feature and the context features.
It should be noted that the above one or more aspects comprise the features hereinafter fully described and particularly pointed out in the claims. The following description and the drawings set forth in detail certain illustrative features of the one or more aspects. These features are only indicative of the various ways in which the principles of various aspects may be employed, and this disclosure is intended to include all such aspects and their equivalents.
The disclosed aspects will hereinafter be described in connection with the appended drawings that are provided to illustrate and not to limit the disclosed aspects.
The present disclosure will now be discussed with reference to several example implementations. It is to be understood that these implementations are discussed only for enabling those skilled in the art to better understand and thus implement the embodiments of the present disclosure, rather than suggesting any limitations on the scope of the present disclosure.
A neural TTS system may be used to generate speech corresponding to a text input. A traditional neural TTS system uses only a phone feature or a character feature of a current text input, such as a current sentence, to generate speech. Herein, the phone feature refers to information representations of phones pronouncing the text input, which is generated based on a phone sequence identified from the text input, wherein the phone sequence is a list of sequential phones that form pronunciation of the text input. The character feature refers to information representations of characters constituting the text input, which is generated based on a character sequence identified from the text input, wherein the character sequence is a list of sequential characters contained in the text input.
Generally, the same text may correspond to different pronunciations, such as pronunciations having different speech rates, tones, prosodies, emotions, or pleasantness. From a mathematical point of view, conversion from text to speech can be viewed as a large-scale inverse problem, which decompresses a highly compressed source, e.g., text, into a very complex target, e.g., audio signals. The neural TTS system tries to solve this problem with machine learning techniques, which can be viewed as a one-to-many mapping problem from a machine learning point of view. Since the traditional neural TTS system uses only the phone feature or character feature of the current text input to generate speech, lacking information related to context contained in the text input or text adjacent to the text input, the traditional neural TTS system cannot effectively solve the one-to-many mapping problem. Therefore, the traditional neural TTS systems typically generate less expressive speech with a fixed pattern, such as reading-style speech with a plain prosody and emotion.
In addition, when generating speech for a set of sentences, such as a paragraph, the traditional neural TTS system usually generates a corresponding speech waveform for each sentence in the set of sentences respectively, and then combines all generated speech waveforms, thereby obtaining a speech waveform corresponding to the set of sentences. Further, at the time of combination, pause durations between respective speech waveforms corresponding to respective sentences are usually set to be the same or fixed. This will also result in similar rhythms between the sentences, making the generated speech quite boring.
Embodiments of the present disclosure propose to improve speech generating ability of a neural TTS system by further using multi-level context features. The context features may be generated based on a plurality of levels of context information, the context information including, for example, a word sequence of a text input, a word sequence and acoustic features of adjacent text of the text input, position information of the text input, etc. The acoustic features may include various traditional TTS acoustic features, such as mel-spectrum, linear spectrum pairs (LSP), etc. The position information may refer to information representations of a position of the text input in the adjacent text.
In one aspect, the embodiments of the present disclosure propose to use both a phone feature and multi-level context features of the text input to generate speech corresponding to the text input. The context features generated based on context information may contain priori knowledge about semantics and acoustics. The neural TTS system may learn general patterns of speech rates, tones, prosodies, emotions or pleasantness from such prior knowledge. Therefore, when generating speech, considering more features such as the above-mentioned context features may help alleviate the one-to-many mapping problem and enhance the speech generating ability of the neural TTS system, thereby generating more natural and more expressive speech.
In another aspect, the embodiments of the present disclosure propose that when generating a speech waveform for a set of sentences, such as a paragraph, pauses between the sentences may be modeled, and pause durations may be determined based on the context features. Since the pause durations between the sentences are related to the context features, it may vary with the context features, so a rhythm of the generated speech will be more abundant and natural.
As shown in
The phone embedding vector sequence may be provided as an input to an encoder 110 of the neural TTS system 100. The encoder 110 may be based on various network structures. As an example, the encoder 110 may include one or more convolution layers 112 and at least one bidirectional Long Short Term Memory (BLSTM) layer 114. The encoder 110 may convert information contained in the phone embedding vector sequence into a vector space that is more robust and more suitable for learning alignment with acoustic features output by a decoder. For example, the encoder 110 may convert the phone embedding vector sequence into a phone feature 120 in the vector space. Herein, a feature generated by the encoder corresponding to the phone sequence of the text input is referred to as the phone feature. It should be appreciated that in other implementations, the phone embedding model 106 for converting the phone sequence into the phone embedding vector sequence may also be trained or updated in association with the neural TTS system 100. In this case, the phone embedding model 106 may be located inside the neural TTS system 100, for example, inside the encoder 110.
It should be appreciated that as an alternative to identifying the phone sequence, a character sequence may also be identified from the text input 102, and the identified character sequence may be further converted into a character embedding vector sequence. The character embedding vector sequence may also be provided as an input to the encoder 110 to generate a character sequence-based character feature corresponding to the text input 102.
The neural TTS system 100 may include an attention unit 130. The attention unit 130 may apply an attention mechanism which acts as a bridge connecting the encoder 110 and the decoder 140. For example, the attention mechanism may facilitate to make alignment between phone features output by the encoder 110 and acoustic features 150 output by the decoder. Various types of attention mechanism may be applied by the attention unit 130, e.g., soft attention, hard attention, location sensitive attention, Gaussian Mixture Model (GMM) attention, etc.
The decoder 140 may include a pre-net 142 consisted of feed-forward layers, Long Short Term Memories (LSTMs) 144, a linear projection 146, a post-net 148 consisted of convolution layers, etc. The LSTMs 144 may receive an input from the pre-net 142 and provide its output to the linear projection 146 while the processing by the LSTMs 144 is affected by the attention unit 130. The linear projection 146 may provide its output to the pre-net 142 and the post-net 148, respectively. Finally, the output of the post-net 148 is combined with the output of the linear projection 146 to produce the acoustic features 150. In an implementation, the linear projection 146 may also be used to generate stop tokens.
The neural TTS system 100 may also include a vocoder 160. The vocoder 160 may generate a speech waveform 170 based on the acoustic features 150 output by the decoder 140. The vocoder 160 may be based on various network structures, such as a Wavenet vocoder, a Griffin-Lim vocoder, etc.
The traditional neural TTS system 100 in
As shown in
In addition to generating the phone feature 210 of the text input 202, context features 218 of the text input 202 may also be generated by the neural TTS system 200 according to the embodiments of the present disclosure. A set of sentences associated with the text input 202 may be obtained, such as a paragraph 212 in which the text input 202 is located. Context information 214 may be extracted from the paragraph 212, such as a word sequence of the text input 202, a word sequence and acoustic features of text adjacent to the text input 202, position information of the text input 202, etc. The context features 218 may be generated based on the context information 214 through a feature generating unit 216. The feature generating unit 216 may have different structures for different context information. For example, when the context information 214 is the word sequence of the text input 202, the feature generating unit 216 may include a word embedding model, an up-sampling unit, an encoder, etc. The feature generating unit 216 may all be located inside the neural TTS system 200, or only a portion inside the neural TTS system 200.
The generated phone feature 210 and context features 218 may be combined into mixed features through a cascading unit 220. An attention unit 222 may apply an attention mechanism on the mixed features, such as a location sensitive attention mechanism. The attended mixed features may be provided to a decoder 224. The decoder 224 may correspond to the decoder 140 in
The context features 300 may include a current semantic feature 310. Herein, the current semantic feature refers to a feature generated based on a word sequence of a text input, such as a current sentence, which may reflect or contain semantic information of the current sentence. A specific process for generating the current semantic feature will be described later in conjunction with
The context features 300 may also include global features 320. The global features 320 may include historical and future context features of a text input, such as a historical acoustic feature 322, a historical semantic feature 324, a future semantic feature 326, a paragraph semantic feature 328, a position feature 330, etc. Herein, the historical acoustic feature refers to a feature generated based on acoustic features of previous sentences of the text input, which may reflect or contain acoustic information related to a way of expression and an acoustic state of a speaker when speaking the previous sentences. A specific process for generating the historical acoustic feature will be described later in conjunction with
Referring back to
A current semantic feature 420 may be generated based on the word sequence 404 through a feature generating unit 410. The feature generating unit 410 may correspond to the feature generating unit 216 in
The word embedding model 412 may be based on Natural Language Processing (NLP) techniques, such as Neural Machine Translation (NMT). Both the word embedding model and the neural TTS system have similar sequence-to-sequence encoder-decoder frameworks, this would benefit to network convergence. In one embodiment, a Bidirectional Encoder Representations from Transformers (BERT) model may be employed as the word embedding model. A word embedding vector sequence may be generated based on the word sequence 404 through the word embedding model 412, wherein each word has a corresponding embedding vector, and all of these embedding vectors form the word embedding vector sequence. The word embedding vector contains meaning and semantic context information of a word, which will facilitate improvement of naturalness of a generated speech. In addition, the word embedding vector will also facilitate solving a word break problem of speech generation for Chinese text.
The word embedding vector sequence may be up-sampled through the up-sampling unit 414 to align with a phone sequence of the text input 402. For example, a word may be pronounced using one or more phones. Thus, during up-sampling, each word embedding vector in the word embedding vector sequence may be repeated a number of times corresponding to a number of phones of the word. The up-sampled word embedding vector sequence may be provided to the encoder 416. The encoder 416 may have a similar network structure as the encoder 110 of
The respective acoustic features 502-1, 502-2, . . . , 502-k may be converted to speaker embedding vector sequences for the sentence i−1, sentence i−2, . . . , sentence i-k through acoustic encoders 512-1, 512-2, . . . 512-k, respectively. An exemplary implementation of the acoustic encoder will be described later in conjunction with
A historical semantic feature 730 may be generated based on the word sequence 704 through a feature generating unit 710. The feature generating unit 710 may correspond to the feature generating unit 216 in
The word embedding model 712 may generate a word embedding vector sequence based on the word sequence 704. The average pooling layer 714 may perform averaging pooling on the word embedding vector sequence to generate an average segment embedding vector sequence. The average segment embedding vector sequence may be up-sampled through the up-sampling unit 716 to align with a phone sequence of the text input. For example, the average segment embedding vector sequence may be repeated a number of times corresponding to a number of phones of the text input. A compressed representation of the average segment embedding vector sequence may then be obtained through the dense layer 718. The compressed representation may be provided to the encoder 720 to generate a historical semantic feature 730. The encoder 720 may have a similar network structure as the encoder 110 of
In an implementation, the previous segment 702 may include several complete sentences. For example, a number of sentences included in the previous segment 702 may be determined based on a number of characters that the word embedding model 712 may process. Taking the BERT model as a word embedding model as an example, if the number of characters that the BERT model may process is 512, one or more sentences before the text input having a total number of characters not exceeding 512 may be selected as the previous segment 702.
A future semantic feature 830 may be generated based on the word sequence 804 through a feature generating unit 810. The feature generating unit 810 may correspond to the feature generating unit 216 in
The word embedding model 812 may generate a word embedding vector sequence based on the word sequence 804. The average pooling layer 814 may perform averaging pooling on the word embedding vector sequence to generate an average segment embedding vector sequence. The average segment embedding vector sequence may be up-sampled through the up-sampling unit 816 to align with a phone sequence of the text input. For example, the average segment embedding vector sequence may be repeated a number of times corresponding to a number of phones of the text input. A compressed representation of the average segment embedding vector sequence may then be obtained through the dense layer 818. The compressed representation may be provided to the encoder 820 to generate a future semantic feature 830. The encoder 820 may have a similar network structure as the encoder 110 of
In an implementation, the subsequent segment 802 may include several complete sentences. Similar to the previous segment 702, a number of sentences included in the subsequent segment 802 may be determined, for example, based on a number of characters that the word embedding model 812 may process. In another implementation, the subsequent segments 802 may include only one sentence, i.e., a sentence immediately after the text input.
A paragraph semantic feature 930 may be generated based on the word sequence 904 through a feature generating unit 910. The feature generating unit 910 may correspond to the feature generating unit 216 in
The word embedding model 912 may generate a word embedding vector sequence based on the word sequence 904. The average pooling layer 914 may perform averaging pooling on the word embedding vector sequence to generate an average paragraph embedding vector sequence. The average paragraph embedding vector sequence may be up-sampled through the up-sampling unit 916 to align with a phone sequence of the text input. For example, the average paragraph embedding vector sequence may be repeated a number of times corresponding to a number of phones of the text input. A compressed representation of the average paragraph embedding vector sequence may then be obtained through the dense layer 918. The compressed representation may be provided to the encoder 920 to generate a paragraph semantic feature 930. The encoder 920 may have a similar network structure as the encoder 110 of
A position feature 1020 may be generated based on the position information 1002 through a feature generating unit 1010. The feature generating unit 1010 may correspond to the feature generating unit 216 in
A position embedding vector sequence may be generated based on the position information 1002 through the position embedding model 1012. The position embedding vector sequence may be up-sampled through the up-sampling unit 1014 to align with a phone sequence of the text input. For example, the position embedding vector sequence may be repeated a number of times corresponding to a number of phones of the text input. A compressed representation of the up-sampled position embedding vector sequence may then be obtained through the dense layer 1016. The compressed representation may be provided to the encoder 1018 to generate a position feature 1020. The encoder 1018 may have a similar network structure as the encoder 110 of
The position feature generated based on the position information may reflect a position of the text input in the paragraph. In general, the position of the text input in the paragraph may affect a tone of speech corresponding to the text input. For example, when the text input is at the beginning of the paragraph, its tone tends to be high; when the text input is in the middle of the paragraph, its tone tends to be flat; and when the text input is at the end of the paragraph, its tone tends to be high. Therefore, when generating the speech of the text input, considering the position information of the text input facilitates enhancement of the naturalness of the generated speech.
It should be appreciated that the processes 400-500, 700-1000 in
Referring back to
As shown in
Moreover, an attention mechanism may be applied separately on global features 1110. For example, the global features 1110 may be provided to an attention unit 1112. The global features 1110 may correspond to the global features 320 in
It should be appreciated that the context features 1104 in
In addition, averaging pooling 1112 may be performed on global features 1210 to obtain average global features. The global features 1210 may correspond to the global features 320 in
It should be appreciated that the context features 1204 in
The exemplary neural TTS systems according to the embodiments of the disclosure are described above with reference to
Traditional methods, when generating a speech waveform for a set of sentences, such as a paragraph, usually generates a corresponding speech waveform for each sentence in this set of sentences, and then combines all the generated speech waveforms, thereby obtaining the speech waveform corresponding to this set of sentences. More specifically, when combining the speech waveforms, pause durations between respective speech waveforms corresponding to respective sentences are set to be the same or fixed. In order to obtain speech with more abundant and natural rhythm between sentences, the embodiments of the present disclosure further improve the method for generating speech through neural TTS synthesis.
At step 1310, a text input may be obtained.
At step 1320, a phone sequence may be identified from the text input, such as a current sentence, by various techniques such as LTS.
At step 1330, the phone sequence may be updated by adding a begin token and/or an end token to the beginning and/or the end of the phone sequence, respectively. In an implementation, a mute phone may be used as the begin token and the end token.
At step 1340, a phone feature may be generated based on the updated phone sequence. For example, the updated phone sequence may be converted into a phone embedding vector sequence through a phone embedding model. The phone embedding vector sequence may then be provided to an encoder, such as the encoder 110 in
At step 1350, a speech waveform corresponding to the text input may be generated based on the phone feature and context features. The speech waveform may be generated in a variety of ways.
In an implementation, as shown in
In another implementation, as shown in
In yet another implementation, as shown in
Since mute phones are added as the begin token and/or the end token at the beginning and/or the end of the phone sequence of the text input, respectively, the generated speech waveform corresponding to the text input has a period of mute at the beginning and/or the end, respectively. Since the context features are taken into account when generating the speech waveform, the mute at the beginning and/or the end of the speech waveform is related to the context features, which may vary with the context features. For example, the mute at the end of the speech waveform of the text input and the mute at the beginning of a speech waveform of a next sentence of the text input constitute a pause between the text input and the next sentence. The duration of the pause may accordingly vary with the context features, such that the rhythm of the generated speech corresponding to a set of sentences is more abundant and natural.
Firstly, a paragraph text 1402 and a paragraph speech waveform 1404 corresponding to the paragraph text 1402 may be obtained.
At 1406, the paragraph text 1402 may be split into a plurality of sentences.
At 1408, the paragraph speech waveform 1404 may be split into a plurality of speech waveforms corresponding to the plurality of sentences in the paragraph text 1402. At the time of splitting, pauses between the sentences may be reserved. For example, a pause between two adjacent speech waveforms may be split into two portions, with the first portion attached to the end of the previous speech waveform and the latter portion attached to the beginning of the latter speech waveform.
The plurality of sentences obtained from the paragraph text 1402 may correspond to the plurality of speech waveforms obtained from the paragraph speech waveform 1404 one by one, forming a plurality of training data pairs, such as [sentence 1, speech waveform 1], [sentence 2, speech waveform 2], [sentence 3, speech waveform 3], etc. These training data pairs may be used to construct a training corpus.
A text input, such as a current sentence, may be identified from the training corpus and a current phone sequence 1410 may be identified from the text input. A phone feature 1414 may be generated based on the identified current phone sequence 1410 through feature generating 1412. The phone feature 1414 may be generated, for example, by using the approach of generating the phone feature 120 based on the phone sequence 104 described with reference to
A current word sequence 1420 may be identified from the text input. A current semantic feature 1424 may be generated based on the identified current word sequence 1420 through feature generating 1422. The current semantic feature 1424 may be generated, for example, by using the approach of generating the current semantic feature 420 based on the word sequence 404 described with reference to
A previous segment located before the text input may be identified from the training corpus, and a previous word sequence 1430 may be identified from the previous segment. A historical semantic feature 1434 may be generated based on the identified previous word sequence 1430 through feature generating 1432. The historical semantic feature 1434 may be generated, for example, by using the approach of generating the historical semantic feature 730 based on the word sequence 704 described with reference to
A subsequent segment located after the text input may be identified from the training corpus, and a subsequent word sequence 1440 may be identified from the subsequent segment. A future semantic feature 1444 may be generated based on the identified subsequent word sequence 1440 through feature generating 1442. The future semantic feature 1444 may be generated, for example, by using the approach of generating the future semantic features 830 based on the word sequence 804 described with reference to
A plurality of sentences in a paragraph in which the text input is located may be identified from the training corpus, and a paragraph word sequence 1450 may be identified from the plurality of sentences. In an implementation, a paragraph word sequence 1450 may be identified directly from the plurality of sentences in the paragraph. In another implementation, a core sentence may be extracted from the plurality of sentences in the paragraph, and then the paragraph word sequence 1450 may be identified from the extracted core sentence. A paragraph semantic feature 1454 may be generated based on the identified paragraph word sequence 1450 through feature generating 1452. The paragraph semantic feature 1454 may be generated, for example, by using the approach of generating the paragraph semantic feature 930 based on the word sequence 904 described with reference to
Position information 1460 of the text input in the paragraph in which it is located may be obtained through the training corpus. A Position feature 1464 may be generated based on the extracted position information 1460 through feature generating 1462. The position feature 1464 may be generated, for example, by using the approach of generating the position feature 1020 based on the position information 1002 described with reference to
At least one previous acoustic feature 1470 corresponding to at least one previous sentence before the text input may also be obtained through the training corpus. It is to be noted that the at least one previous acoustic feature is a ground truth acoustic feature. Taking the text input being sentence i as an example, the at least one previous acoustic feature of the text input may be at least one acoustic feature corresponding to at least one speech waveform of at least one sentence before the text input, such as at least one acoustic feature corresponding to speech waveform i−1 of sentence i−1, speech waveform i−2 of sentence i−2, . . . , speech waveform i-k of sentence i-k. A historical acoustic feature 1474 may be generated based on the obtained at least one previous acoustic feature 1470 through feature generating 1472. The historical acoustic feature 1474 may be generated, for example, by using the approach of generating the historical acoustic feature 520 based on the acoustic features 502-1, 502-2, . . . , 502-k described with reference to
The current semantic feature 1424, historical semantic feature 1434, future semantic feature 1444, paragraph semantic feature 1454, position feature 1464, and historical acoustic feature 1474 generated by the approach described above may be combined into context features 1480.
The phone feature 1414 and context features 1480 generated according to the process illustrated in
Taking training the neural TTS system 1500 using a training data pair [sentence i, voice waveform i] as an example, a phone feature 1502 and context features 1504 of the sentence i may be obtained firstly. The phone feature 1502 and the context features 1504 of the sentence i may be generated, for example, by the approach shown in
The phone feature 1502 and the context features 1504 of the sentence i may be combined into mixed features through the cascading unit 1506, which may be further processed by the attention unit 1508, the decoder 1510, and the vocoder 1530, to generate a speech waveform i 1540 corresponding to the sentence i. The decoder 1510 may include a pre-net 1512, LSTMs 1514, a linear projection 1516, and a post-net 1518. In a training phase, an input of the pre-net 1512 is from a ground truth acoustic feature 1520 corresponding to a previous speech waveform, i.e., speech waveform i−1 corresponding to sentence i−1.
It should be appreciated that although the foregoing discussion relates to training of the neural TTS system 1500 corresponding to the neural TTS system 200 in
At step 1610, a text input may be obtained.
At step 1620, a phone feature of the text input may be generated.
At step 1630, context features of the text input may be generated based on a set of sentences associated with the text input.
At step 1640, a speech waveform corresponding to the text input may be generated based on the phone feature and the context features.
In an implementation, the generating the phone feature may comprise: identifying a phone sequence from the text input; and generating the phone feature based on the phone sequence.
In an implementation, the generating the context features may comprise: obtaining acoustic features corresponding to at least one sentence of the set of sentences before the text input; and generating the context features based on the acoustic features.
The generating the context features may further comprise: aligning the context features with a phone sequence of the text input.
In an implementation, the generating the context features may comprise: identifying a word sequence from at least one sentence of the set of sentences; and generating the context features based on the word sequence.
The at least one sentence may comprise at least one of: a sentence corresponding to the text input, sentences before the text input, and sentences after the text input.
The at least one sentence may represent content of the set of sentences.
The generating the context features based on the word sequence may comprise: generating a word embedding vector sequence based on the word sequence; generating an average embedding vector sequence corresponding to the at least one sentence based on the word embedding vector sequence; aligning the average embedding vector sequence with a phone sequence of the text input; and generating the context features based on the aligned average embedding vector sequence.
In an implementation, the generating the context features may comprise: determining a position of the text input in the set of sentences; and generating the context features based on the location.
The generating the context features based on the location may comprise: generating a position embedding vector sequence based on the location; aligning the position embedding vector sequence with a phone sequence of the text input; and generating the context features based on the aligned position embedding vector sequence.
In an implementation, the generating the speech waveform may comprise: combining the phone feature and the context features into mixed features; applying an attention mechanism on the mixed features to obtain attended mixed features; and generating the speech waveform based on the attended mixed features.
In an implementation, the generating the speech waveform may comprise: combining the phone feature and the context features into first mixed features; applying a first attention mechanism on the first mixed features to obtain first attended mixed features; applying a second attention mechanism on at least one context feature of the context features to obtain at least one attended context feature; combining the first attended mixed features and the at least one attended context feature into second mixed features; and generating the speech waveform based on the second mixed features.
In an implementation, the generating the speech waveform may comprise: combining the phone feature and the context features into first mixed features; applying an attention mechanism on the first mixed features to obtain first attended mixed features;
performing averaging pooling on at least one context feature of the context features to obtain at least one average context feature; combining the first attended mixed features and the at least one average context feature into second mixed features; and generating the speech waveform based on the second mixed features.
In an implementation, the generating the phone feature may comprise:
identifying a phone sequence from the text input; updating the phone sequence by adding a begin token and/or an end token to the phone sequence, wherein the length of the begin token and the length of the end token are determined according to the context features; and generating the phone feature based on the updated phone sequence.
In an implementation, the set of sentences may be in the same paragraph.
It should be appreciated that the method 1600 may further comprise any steps/processes for generating speech through neural TTS synthesis according to the embodiments of the present disclosure as mentioned above.
In an implementation, the context feature generating module 1730 may be further configured for: obtaining acoustic features corresponding to at least one sentence of the set of sentences before the text input; and generating the context features based on the acoustic features.
In an implementation, the context feature generating module 1730 may be further configured for: identifying a word sequence from at least one sentence of the set of sentences; and generating the context features based on the word sequence.
In an implementation, the phone feature generating module 1720 may be further configured for: identifying a phone sequence from the text input; updating the phone sequence by adding a begin token and/or an end token to the phone sequence, wherein the length of the begin token and the length of the end token are determined according to the context features; and generating the phone feature based on the updated phone sequence.
Moreover, the apparatus 1700 may further comprise any other modules configured for generating speech through neural TTS synthesis according to the embodiments of the present disclosure as mentioned above.
The apparatus 1800 may comprise at least one processor 1810. The apparatus 1800 may further comprise a memory 1820 connected with the processor 1810. The memory 1820 may store computer-executable instructions that, when executed, cause the processor 1810 to perform any operations of the methods for generating speech through neural TTS synthesis according to the embodiments of the present disclosure as mentioned above.
The embodiments of the present disclosure may be embodied in a non-transitory computer-readable medium. The non-transitory computer-readable medium may comprise instructions that, when executed, cause one or more processors to perform any operations of the methods for generating speech through neural TTS synthesis according to the embodiments of the present disclosure as mentioned above.
It should be appreciated that all the operations in the methods described above are merely exemplary, and the present disclosure is not limited to any operations in the methods or sequence orders of these operations, and should cover all other equivalents under the same or similar concepts.
It should also be appreciated that all the modules in the apparatuses described above may be implemented in various approaches. These modules may be implemented as hardware, software, or a combination thereof. Moreover, any of these modules may be further functionally divided into sub-modules or combined together.
Processors are described in connection with various apparatus and methods. These processors may be implemented using electronic hardware, computer software, or any combination thereof. Whether these processors are implemented as hardware or software will depend on the specific application and the overall design constraints imposed on the system. By way of example, a processor, any portion of a processor, or any combination of processors presented in this disclosure may be implemented as a microprocessor, a microcontroller, a digital signal processor (DSP), a field programmable gate array (FPGA), a programmable logic device (PLD), state machine, gate logic, discrete hardware circuitry, and other suitable processing components configured to perform the various functions described in this disclosure. The functions of a processor, any portion of a processor, or any combination of processors presented in this disclosure may be implemented as software executed by a microprocessor, a microcontroller, a DSP, or other suitable platforms.
Software should be considered broadly to represent instructions, instruction sets, code, code segments, program code, programs, subroutines, software modules, applications, software applications, software packages, routines, subroutines, objects, running threads, processes, functions, and the like. Software may reside on computer readable medium.
Computer readable medium may include, for example, a memory, which may be, for example, a magnetic storage device (e.g., a hard disk, a floppy disk, a magnetic strip), an optical disk, a smart card, a flash memory device, a random access memory (RAM), a read only memory (ROM), a programmable ROM (PROM), an erasable PROM (EPROM), an electrically erasable PROM (EEPROM), a register, or a removable disk. Although a memory is shown as being separate from the processor in various aspects presented in this disclosure, a memory may also be internal to the processor (e.g., a cache or a register).
The previous description is provided to enable any person skilled in the art to practice the various aspects described herein. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects. Thus, the claims are not intended to be limited to the aspects shown herein. All structural and functional equivalents to the elements of the various aspects described throughout the present disclosure that are known or later come to be known to those of ordinary skilled in the art are intended to be encompassed by the claims.
Number | Date | Country | Kind |
---|---|---|---|
201910864208.0 | Sep 2019 | CN | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2020/038019 | 6/17/2020 | WO |