The invention generally relates to the field of synthetic voice production. In particular, the invention relates to a technique for using generic synthetic voice data to produce speech signals that resemble a user's voice.
Text-to-speech (TTS) synthesis refers to a technique for generating synthetic speech that is artificially produced. The synthetic speech is generally composed by a computer system and designed to sound like human speech. Another technique, referred to as the Personalization of TS, seeks to modify the synthesized speech from the TTS system to sound like a target speaker. One of the challenges in doing so is to match the rhythm and speaking style using a small amount of data generally limited to a small number of utterances from that speaker. As a result, the syllable durations of a typical speaker do not match the syllable durations of a TTS system output for the same sentence.
The mismatch between a typical speaker and corresponding TTS output is illustrated in
There is therefore a need for a technique for adapting the TTS system speech to match the target speaker, thereby generating synthetic speech that realistically sounds like the target speaker.
The invention in the preferred embodiment is a method and system for personalizing synthetic speech from a text-to-speech (TTS) system. The method comprises: recording target speech data having a plurality of words with onsets and rimes; generating synthetic speech data with the same set of words; identifying pairs of onsets and rimes in the target speech; determining the durations of the onsets and rimes in the target speech and synthetic speech data; generating a plurality of onset scaling factors and rime scaling factors; generating linguistic feature vector for the plurality of words; associating each of the linguistic feature vector with an onset and rime scaling factor; receiving target text comprising a second plurality of words; identifying pairs of onsets and rimes for the second plurality of words; generating a linguistic feature vector for each of the second plurality of words; identifying onset and rime scaling factors based on the linguistic feature vectors for the second plurality of words; generating synthetic speech based on the target text; compressing or expanding the duration of each onset and rime for the second plurality of words in the synthetic speech based on the identified onset scaling factor and rime scaling factor, generating a waveform from the onsets and rimes with compressed or expanded durations; and playing the waveform to a user. In this embodiment the target speech data substantially consists of Chinese Mandarin speech, and the target text substantially consists of Chinese Mandarin words.
Each linguistic feature vector is associated with a current syllable and comprises a plurality of onset and rime feature attributes, including a group ID attribute, voicing attribute, complexity attribute, and nasality attribute for the current syllable. The group ID attribute is assigned a value from among 10 different groups or categories. The voicing attribute is assigned a value associated with one of a plurality of voicing categories where the categories differ in the frequency domain representation of the rime, namely the positions of formants in the frequency domain. The complexity attribute is assigned a value associated with one of a plurality of complexity categories based on the number of vowels in the rime. The nasality attribute assigned a value associated with one of a plurality of nasality categories based on the composition of consonants in the rime.
The linguistic feature vector described above is used to characterize and categorize the onset and rime of a given syllable referred to herein as the “current” syllable. A different linguistic feature vector is generated for each syllable in the target speech data and target text. In some embodiments, the linguistic feature vectors further include an onset feature attribute and a plurality of rime feature attributes characterizing the syllable preceding the current syllable to provide context. The linguistic feature vectors may further include an onset feature attribute and a plurality of rime feature attributes characterizing the syllable following the current syllable for additional context.
The present invention is illustrated by way of example and not limitation in the figures of the accompanying drawings, and in which:
The invention features a speech personalization system and method for generating realistic-sounding synthetic speech. In the preferred embodiment, the speech signal of a Test-to-Speech (TTS) system is corrected to emulate a particular target speak. That is, the speech signal, after modification with a plurality of scaling factors, accurately reflects the voice and speech patterns of a target speaker while retaining the words of the TTS input. In the preferred embodiment, the TTS system applies the plurality of scaling factors to generate appropriate compression or expansion to segments of speech signal outputted from the TTS. Compression is represented by a scaling factor less than 1 and expansion is represented by a scaling factor greater than or equal to 1.
In the preferred embodiment, the synthetic speech signal from the TTS system is spoken in the Chinese language. As illustrated in
As illustrated in the corrected TTS speech signal in
Illustrated in
The target speech data is then decomposed and pairs of onsets and rimes identified 405 for all the words (or syllables) of the target speech data. The duration of each pair of onset and rime is then determined 410 and denoted {dspk
The words spoken in the target speech data are converted to a string of text which is provided as input into the TTS system. The TTS system outputs 415 is a synthetic voice speaking those words present in the target speech data but in a generic, unnatural sounding voice. As described above, each pair of onset and rime is identified and the onset and rime durations, {dtts
Next, a scaling factor is computed 425 for each pair of onset and rime. The initial scaling factor for an onset is computed as follows:
While the initial scaling factor for a rime is given by:
An initial scaling factor may be too high or too low to be useful where the target speech data is very noisy. To avoid spurious results, the value of onset scaling factors (So) and rime scaling factors (Sr) may be limited within the range of (0.5 to 2.0) and (0.3 to 3.0), respectively, using the following functions:
So=max(0.5, min(So,2.0)),
Sr=max(0.3, min(Sr,3.0)).
Upon completion, there will be two scaling factor for each syllable comprising an onset and a rime.
Next, a linguistic feature vector (x) is computed 430 for each syllable. The linguistic feature vector is constructed based on attributes that characterize the syllable as well as the syllable's context in the sentence. In this embodiment, the context includes attributes characterizing a preceding syllable as well as a subsequent syllable, although this context may vary depending on the application. In the preferred embodiment, the linguistic feature vector consists of the following six parts:
1. Group ID vector for onsets and rimes in the current syllable;
2. Group ID vector for onset and rime in the syllable immediately preceding the current syllable;
3. Group ID vector for onset and rime in the syllable immediately following the current syllable;
4. Tone of the current syllable;
5. Tone of the syllable immediately preceding the current syllable; and
6. Tone of the syllable immediately following the current syllable.
The group ID vector comprises one or more numerical values assigned to a phoneme or phoneme combination based on one or more categories of attributes. In the preferred embodiment, the group ID for an onset is selected from TABLE 1 in
In the preferred embodiment, the group ID for a rime is based on three attributes consisting of phoneme “voicing”, phoneme “complexity”, and the “nasality” of the phoneme. The “voicing” attribute is associated with categories that are assigned values ranging between 0 and 10, effectively binning rimes into one of eleven groups of similar phonemes. The bins for the voicing attribute are organized and numbered based on the similarity of the rimes' formants in their spectral representations. The “complexity” attribute is associated with categories that are assigned values ranging between 0 and 2, effectively binning rimes into one of three groups of similar phonemes. The bins for the complexity attribute are numbered based on the number of vowels in the rimes. The “nasality” attribute is associated with categories that are assigned values ranging between zero and two, effectively binning rimes into one of three groups of similar phonemes. The bins for the nasality attribute include a value of 0 where the phoneme possesses no nasality, a value of 1 where the phoneme ends in the “N” sound, and a value of 2 where the phoneme ends in the “NG” sound.
In the preferred embodiment, the group ID for a rime is selected from TABLE 2A or 2B in
The numerical values assigned to these group ID attributes are intelligently selected to limit the variability or range of attribute values. This operation effectively reduces the dimensionality of the attributes into a limited number of clusters or groups, each group consisting of similar data types. As a result, similar sounding onsets and rimes may be used to predict the scaling factor for various onsets and rimes even when those particular onsets and rimes are absent from the corpus derived from the target speech data. That is to say, the present invention enables the speech to be time scaled more accurately despite the availability of less training data or incomplete training data.
By way of example, the group id vectors for the syllable D-AH in the sequence ZH, A-NG, D, AH, H, AH-N are [1], [1,0,0], [7], [1,0,1], [2], and [1,0,1], respectively. In this sequence, D-AH represents the current syllable, ZH and A-NG represent the syllable immediately preceding the current syllable, and H and AH-N represent the syllable immediately following the current syllable. The group ID vectors for onset D and rime AH are given by [1] and [1,0,0], respectively. Similarly, the group ID vectors for onset and rime of the preceding syllable are [7] and [1,0,2], respectively, while the group ID vectors for onset and rime of the following syllable are [2] and [1,0,1], respectively. Therefore, the linguistic feature vector for syllable D-AH includes all group ID vectors: [1], [1,0,0], [7], [1,0,1], [2], and [1,0,1].
With regard to tones, there are five possible tones: 0, 1, 2, 3 and 4, which are readily available in a standard Chinese pronunciation dictionary and known to those of ordinary skill in the art. For the syllable D-AH, the tones of the current syllable, preceding syllable, and following syllable are 3, 2, and 3, respectively. After concatenating all group ID vectors and tones, the linguistic feature vector (x) for syllable D-AH is given by [1,1,0,0, 7,1,0,2, 2,1,0,1, 3,2,3].
Once the linguistic feature vector (x), and onset and rime scale factors (So and Sr) are extracted for all the syllables, the linguistic feature vector and scale factors are associated with one another 435 using a neural network or other model that can estimate So and Sr for a given linguistic feature vector (x). In the preferred embodiment, two regression trees are generated, one for estimating onset scaling factor (So) and another for estimating rime scaling factor (Sr). In particular, a Gradient Boosting Regression Tree (GBRT) is developed using each linguistic feature vector as the input and the corresponding scaling factors (So and Sr) as the output.
Once the regression trees are trained, they may be used to estimate scaling factors for sequences of target text. First, the sequence of text is received 440 as output from the user directly or from the TTS system which, in turn, may have been generated from an audio sequence provided by a user via mobile phone, for example. The onset and rime are then identified 445 for each of the second plurality of words in the target text. A linguistic feature vector is then generated 450 for each syllable based on the pairs of identified onsets and rimes. Using the GBRT, the linguistic feature vector of each of the second plurality of words is used to estimate 455, look up, or otherwise retrieve an onset scaling factor and rime scaling factor for each syllable. The durations of the syllables of synthetic speech are then identified 465 and those durations compressed or expanded 470 using the respective onset scaling factor and rime scaling factor. The modification of the time scale of the audio frames of the synthetic speech may be modified using any of a number of time-warping techniques known to those skilled in the art. The modified speech is then used to generate 475 a waveform and made available to the user to playback 480 via the speaker in a mobile phone, for example. As one skilled in the art will appreciate, the modified synthetic speech now resembles the voice and exhibits the speech patterns of the target speaker.
Illustrated in
The onset/rime identification module 610 then identifies pairs of onsets and rimes for each of the words in the training speech and synthetic speech, as well as the durations of those onsets and rimes. The durations of the onsets and rimes are then used to generate onset and rime scaling factors, which are retained in the scaling factors database 630.
Linguistic feature vectors are also generated to characterize each pair of onset and rime in the training speech based on attributes of the syllable as well as the context of the syllable. As described above, the linguistic feature vectors effectively classify syllables into a limited number of clusters or groups based on the voicing, complexity, and nasality attributes of the syllable and context. The Group ID's are retained in the Group ID database 640.
The speech personalization system further includes two GBRT's 650 that associate the onset and rime scaling factors with the linguistic feature vectors. In particular, the system 600 includes a first GBRT trained to estimate an onset scaling factor based on a given linguistic feature vector, and a second GBRT trained to estimate a rime scaling factor based on a linguistic feature vector. Together, the first and second GBRT's 650 generate the two scaling factors necessary to modify the duration of a syllable from the default duration in the generic synthetic voice to the specific duration that matches the target speaker's speech pattern, thus enabling the speech personalization system 600 to tailor the speech to a specific speaker.
In operation, a user may speak into the microphone 672 on the mobile phone 670, for example, and that speech converted into synthetic speech using the TTS 680. In other embodiments, the user taps the soft keys of a mobile phone keyboard 676 to generate text or a text message to the TTS 680 which then generates the synthetic speech. Linguistic feature vectors characterizing and/or classifying the syllables of the speech are generated and used with the first and second GBRT's to estimate scaling factors for all the onsets and rimes, respectively. The compression/expansion module 660 then applies the scaling factors to modify the time scale of the synthetic speech and produce personalized speech, which is transmitted to the user's phone 670 in the form of a waveform file that may be played back to the user via the phone's speaker 674.
One or more embodiments of the present invention may be implemented with one or more computer readable media, wherein each medium may be configured to include thereon data or computer executable instructions for manipulating data. The computer executable instructions include data structures, objects, programs, routines, or other program modules that may be accessed by a processing system, such as one associated with a general-purpose computer or processor capable of performing various different functions or one associated with a special-purpose computer capable of performing a limited number of functions. Computer executable instructions cause the processing system to perform a particular function or group of functions and are examples of program code means for implementing steps for methods disclosed herein. Furthermore, a particular sequence of the executable instructions provides an example of corresponding acts that may be used to implement such steps. Examples of computer readable media include random-access memory (“RAM”), read-only memory (“ROM”), programmable read-only memory (“PROM”), erasable programmable read-only memory (“EPROM”), electrically erasable programmable read-only memory (“EEPROM”), compact disk read-only memory (“CD-ROM”), or any other device or component that is capable of providing data or executable instructions that may be accessed by a processing system. Examples of mass storage devices incorporating computer readable media include hard disk drives, magnetic disk drives, tape drives, optical disk drives, and solid state memory chips, for example. The term processor as used herein refers to a number of processing devices including personal computing devices, servers, general purpose computers, special purpose computers, application-specific integrated circuit (ASIC), and digital/analog circuits with discrete components, for example.
Although the description above contains many specifications, these should not be construed as limiting the scope of the invention but as merely providing illustrations of some of the presently preferred embodiments of this invention.
Therefore, the invention has been disclosed by way of example and not limitation, and reference should be made to the following claims to determine the scope of the present invention.
This application claims the benefit of U.S. Provisional Patent Application Ser. No. 62/469,457 filed Mar. 9, 2017, titled “Modifying syllable durations for personalizing Chinese Mandarin TTS using small corpus,” which is hereby incorporated by reference herein for all purposes.
Number | Name | Date | Kind |
---|---|---|---|
5651095 | Ogden | Jul 1997 | A |
5852802 | Breen | Dec 1998 | A |
6094633 | Gaved | Jul 2000 | A |
20050005266 | Datig | Jan 2005 | A1 |
20070219933 | Datig | Sep 2007 | A1 |
Number | Date | Country | |
---|---|---|---|
62469457 | Mar 2017 | US |