The following specification describes many aspects of improved TTS systems and example embodiments that illustrate some representative combinations with optional aspects. Some examples are process steps or systems of machine components for speech synthesis and its applications. These can be implemented with computers that execute software instructions stored on non-transitory computer-readable media.
The present subject matter describes improved approaches to an optimized TTS system. According to some embodiments, various computer-implemented methods and approaches, including neural network models, can be adopted to implement the present TTS system. The system can generate variable-rate frames via a speech synthesis model, through which key frames are kept and other frames with little information are omitted. With fewer frames to generate per utterance, the system can reduce the execution time and speed up speech synthesis. According to some embodiments, the TTS system can reconstruct and approximate the frames that would have been generated for the input text without skipping frames via various methods, for example, linear interpolation or model inference. As such, the synthesized speech waveforms can be intelligible and natural.
According to some embodiments, the TTS system can transmit various functions, e.g., interpolation and/or voice synthesis, to a lower-power system for execution. The lower-power system, e.g., a mobile computing device, can then locally interpolate or de-compress the key frames generated, thus resulting in reduced bandwidth of voice information for the mobile device. As such, the optimized TTS system can reduce processing latency and bandwidth in speech synthesis. Furthermore, it can also improve data security and privacy and increase the quality of the synthesized speech.
According to some embodiments, for the reconstruction or approximation of frames, each of the generated key frames can include an interpolation parameter. For example, the interpolation parameter can indicate the number of skipped frames between the plurality of key frames or other interpolation information such as variable frame rate or period or an indicated interpolation mode. The interpolation process can be implemented before a vocoder model or directly by a vocoder model. Furthermore, according to some embodiments, the interpolation process is not needed when a neural vocoder can recognize, associate and generate the waveform samples based on the variable-rate key frames.
According to some embodiments, a vocoder model can generate speech waveforms based on the reconstructed frames, which comprise both the key frames and the interpolated frames. According to some embodiments, a neural vocoder can directly generate speech waveforms based on the key frames without interpolation. According to some embodiments, the vocoder model can be a neural vocoder or a conventional signal-processing-based vocoder.
To enable the speech synthesis model to generate the fewer but more information rich frames, the model can be trained with compressed datasets. According to some embodiments, various approaches can be adopted to generate the compressed datasets, including choosing compressed datasets with the minimized sum of square errors of approximation. For example, the training data pair can be <text, compressed audio recordings>. The original audio/frames of the training datasets are compressed in such a way that the non-essential audio/frames are omitted.
According to some embodiments, a neural vocoder can be trained together with the speech synthesis model with the same compressed datasets so that it can directly generate waveform samples based on the key frames without the interpolation or reconstruction process.
Accordingly, the present TTS system can be efficient and responsive for generating real time and natural speech for human-computer communications, thus enhancing the user experience of a voice-enabled interface.
A computer implementation of the present subject matter comprises a computer-implemented method of speech synthesis, which comprises: receiving a sequence of symbols; and synthesizing from the sequence of symbols, by a speech synthesis model, a plurality of key frames, wherein the key frames have a variable frame rate, and wherein a key frame comprises at least one interpolation parameter that indicates the variable frame rate.
According to some embodiments, at least one interpolation parameter can indicate, for example, one or more skipped frames between the plurality of key frames, a length of time between the key frames, an indicated interpolation mode such as linear interpolation, code book.
According to some embodiments, the speech synthesis model can generate the plurality of key frames based on an average key frame rate input, wherein the ratio of the number of the plurality of key frames and the one or more skipped frames is associated with the average key frame rate input.
According to some embodiments, the TTS system can interpolate one or more interpolated or skipped frames based on the at least one interpolation parameter. A vocoder model can generate speech waveforms based on the key frames and the interpolated frames. It can synthesize waveforms from low-dimensional acoustic representation, such as Bark spectrograms or Mel-spectrograms. According to some embodiments, a vocoder model can be a neural vocoder or a conventional vocoder.
The present subject matter is illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:
The present subject matter pertains to improved approaches for a speech synthesis system with low latency and improved efficiency. By predicting fewer frames with variable frame rates without voice-quality loss, the system can deliver synthesized speeches with reduced latency and improved efficiency. Embodiments of the present subject matter are discussed below with reference to
In the following description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the present subject matter. It will be apparent, however, to one skilled in the art that the present subject matter may be practiced without some of these specific details. In addition, the following description provides examples, and the accompanying drawings show various examples for the purposes of illustration. Moreover, these examples should not be construed in a limiting sense as they are merely intended to provide examples of embodiments of the subject matter rather than to provide an exhaustive list of all possible implementations. In other instances, well-known structures and devices are shown in block diagram form in order to avoid obscuring the details of the disclosed features of various described embodiments.
The following sections describe process steps and systems of machine components for generating synthesized speeches and its applications. These can be implemented with computers that execute software instructions stored on non-transitory computer-readable media. Improved systems for optimized speech synthesis can have one or more of the features described below.
According to some embodiments, in a frame-based mechanism, speech synthesis model 116 can be a neural acoustic model configured to process the phoneme sequence to infer the acoustic frames, such as Mel-scale spectrogram or Bark-scale spectrogram. It can be trained to skip highly redundant frames and only keep key frames that have a high information content relative to one or more prior key frames. Furthermore, each key frame can comprise an interpolation parameter for the later interpolation process to estimate information between key frames.
According to some embodiments, interpolation model 118 can reconstruct or interpolate the skipped frames at least based on the interpolation parameter. The key frames and the interpolated frames are input for vocoder model 120 to generate the speech waveforms 108, which can be transmitted to a client device 101 for communication with a user.
Client device 101 can render waveforms 108 as speech. The client device 101 can be any computing device with a speaker capable of rendering the speech. As shown in
According to some embodiments, TTS system 112 can be implemented by a virtual assistant to provide a voice-enabled interface for a client device 101. The virtual assistant can be a software agent that can be integrated into different types of devices and platforms. For example, the virtual assistant can be incorporated into smart speakers. It can also be integrated into voice-enabled applications for specific companies.
Network 110 can comprise a single network or a combination of multiple networks, such as the Internet or intranets, wireless cellular networks, local area network (LAN), wide area network (WAN), WiFi, Bluetooth, near-field communication (NFC), etc. Network 110 can comprise a mixture of private and public networks, or one or more local area networks (LANs) and wide-area networks (WANs) that may be implemented by various technologies and standards.
For example, some or all functions related to interpolation and voice synthesis can be implemented by processors distributed throughout network 110, such as a user's mobile device. This edge computing can not only reduce the latency of speech synthesis, but also bandwidth use of the client device 101. In addition, it can also improve data security and privacy and increase the quality of the synthesized speech.
According to some embodiments, partial functions of TTS 152, such as interpolation model 118 and vocoder model 120, can be implemented by client device 101. Accordingly, key frames 107 can be transmitted to client device 101 for voice synthesis 113. Next, interpolation model 118 can reconstruct or interpolate the skipped frames for the key frames 107 based on various reconstruction methods. Accordingly, vocoder model 120 can generate speech waveforms 108 based on the key frames 107 and the interpolated frames.
According to some embodiments, interpolation model 118 can be omitted when vocoder model 120 is a properly-trained neural model that can recognize key frames 107 and generate the complete speech waveforms. As such, vocoder model 120 can generate natural sounding speech waveforms 108 directly based on the key frames 107.
A spectrogram can be considered to be a low-dimensional acoustic representation of the input text audio. According to some embodiments, the spectrogram 200 can be generated by segmenting the generated audio signal into frames at a fixed interval, e.g., 10 ms and overlapping window size of 25 ms, generating a short-time Fourier transform of each windowed frame, and computing the power spectrum of each frequency range. Spectrogram 200, or the corresponding Bark spectrogram, can be the input data for some embodiments of a vocoder model.
According to some embodiments, input text 302 can be pre-processed to generate a phoneme sequence, i.e., a sequence of symbols, for input text 302. Furthermore, speech synthesis model 304 can comprise a symbol-rate network 306, and a frame-rate network 308, with location-sensitive attention in between. Both symbol-rate network 306 and frame-rate network 308 can be autoregressive recurrent neural networks. Symbol-rate network 306 can convert the phoneme sequence into a hidden feature representation in a character embedding process, which can be processed by several convolutional layers. The output of the convolutional layers can be further fed into a bi-directional Long Short-Term Memory (LSTM) layer to generate the encoded features. Such encoded features can be the input for a location-sensitive attention layer that generates attention probabilities and location features for the encoded input sequence.
Frame-rate network 308 can predict an acoustic frame from the encoded input sequence one frame at a time. According to some embodiments, frame-rate network 308 can feed the encoded input sequence through, respectively, a pre-net with two connected layers, the LSTM layers, the linear projection and a multi-layer convolutional post-net, to generate. As shown in
After the spectrogram frame prediction, the generated acoustic frames 330 are input for a voice synthesizer, such as a vocoder model 352, for generating speech waveforms 360. It can comprise a frame-rate network 354 and a sample-rate network 356.
According to some embodiments, each key frames can comprise an interpolation parameter, which can be used to later interpolate the omitted frames by an interpolation model 432. According to some embodiments, the interpolation parameter can indicate the number of the omitted or skipped frames between two consecutive key frames. According to some embodiments, the interpolation parameter can indicate the variable frame rate of key frames 430. According to some embodiments, the interpolation parameter can indicate a length of time between any two consecutive key frames.
Furthermore, according to some embodiments, the interpolation parameter can indicate a preferred interpolation mode or method. Examples of the interpolation mode can be linear interpolation, parabolic interpolation, nearest neighbor method, code book, etc. For example, linear interpolation can apply a distinct linear polynomial between each pair of data points for curves.
According to some embodiments, speech synthesis model 404 can generate the key frames based on an average key frame rate input, and the ratio of the number of key frames and the omitted frames is associated with the average key frame rate input. For example, the average key frame rate can indicate the ratio of the key frames and the omitted frames. The probability of generation of a key frame at a given time step depends, in part, on the amount of recent bandwidth used. This allows adaptive rate control. It also allows dynamic selection of a quality vs bandwidth or quality vs processing performance.
According to some embodiments, upon receiving key frames 430 with the respective interpolation parameters, interpolation model 432 can interpolate the omitted frames via one or more interpolation modes. As a result, interpolation model 432 can reconstruct a number of interpolated frames that are presumably similar to what the omitted frames would have been if generated by the speech synthesis model 404. According to some embodiments, the interpolated frames can be stitched together, in its respective order, with key frames 430 to form de-compressed frames 434.
According to some embodiments, de-compressed frames 434 can be input for a vocoder model 452 to generate speech waveforms 460. Vocoder model 452 can generate speech waveforms based on the key frames and the interpolated frames. It can synthesize waveforms from low-dimensional acoustic representation, such as Bark spectrograms or Mel-spectrograms. According to some embodiments, a vocoder model can be a neural vocoder or a conventional vocoder. According to some embodiments, vocoder model 452 can be an autoregressive model such as LPCNet, WaveGlow, WaveNet and Wave RNN. According to some embodiments, vocoder model 352 can be a Generative Adversarial Networks (GANs) model such as MelGAN. According to some embodiments, vocoder model 452 can be a diffusion probabilistic model such as WaveGrad and DiffWave. According to some embodiments, the vocoder model can be a signal-processing-based vocoder.
According to some embodiments, vocoder model 452 can be an autoregressive model configure to predict the probability of each waveform sample based on previous waveform samples. It can comprise a frame-rate network 454 and a sample-rate network 456, both of which can be autoregressive RNN models. Due to the interpolation of the skipped frames, speech waveforms 460 can share a similar sound quality as could be achieved with a conventional speech synthesis model that generates all frames.
To enable speech synthesis model 404 to predict key frames 430 with variable frame rates, speech synthesis model 404 can be trained with selected training datasets. For example, the training data pair can be <text, compressed audio recordings>. The original audio/frames of the training datasets can be compressed so that the non-essential audio/frames are omitted.
The original datasets can comprise a number of audio clips of one or more speakers. For example, the LJ Speech dataset comprises short audio recordings of a single speaker along with the transcriptions, whereas the LibriTTS dataset comprises multi-speaker English audio clips for many hours. In addition to English datasets, other international languages, such as Chinese, Japanese, Korean, German, French, and Italian can also be utilized for training a TTS for a specific market or application. According to some embodiments, either the raw waveform or pre-processed waveforms, e.g., after compression, can be used as input for the training process.
Different approaches or methods can be adopted to generate compressed datasets that have high-definition and significantly reduce the frame numbers. e.g., 50% or fewer frames. According to some embodiments, the original datasets, e.g., <text, audio recordings>, can be time-warped at a predetermined omission/compression ratio. For example, the omitted frames can be every other frame, or two of every five frames. According to some embodiments, the omitted frames can be redundant frames that contain substantially similar data values to the “neighboring” frames that are kept in the compressed datasets.
Furthermore, various cost/loss functions can be implemented to select a compression approach with least data loss between the original datasets and the compressed datasets. For example, the cost function can be Mixture of Logistics or a normal-loss. The system can implement different versions or scenarios of the compressed datasets using each of the possible configures and select the one with best performance or least loss. For example, a number of potential cost functions can be implemented respectively for all the possible set of frames to omit or the key frames to keep. As a result, a set of key frames or compressed audio recordings that render the least loss can be selected as the compressed training datasets, e.g., <text, compressed audio recordings>.
For example, an exemplary cost function can be based on the sum of square errors from all the Bark parameters linearly interpolated between key frames and relative the omitted original frame that these parameters replaced, as shown below:
Cost=SUM[over omitted frames i](SUM[over bark bin k](((interp−value)[i,k]−(orig−value)[i,k]){circumflex over ( )}2)
According to some embodiments, in addition to the linear interpolation method, other interpolation methods, such as parabolic interpolation, nearest neighbor method, code book, can also be adopted. Furthermore, various training algorithms can be used such as gradient descent or adaptive motion.
According to some embodiments, each key frames can comprise an interpolation parameter, which can be used to interpolate intermediate frames by an interpolation model 432. According to some embodiments, the interpolation parameter can indicate the number of the omitted or skipped frames between two consecutive key frames. According to some embodiments, the interpolation parameter can indicate the variable frame rate of key frames 430. According to some embodiments, the interpolation parameter can indicate a length of time between any two consecutive key frames.
Furthermore, according to some embodiments, the interpolation parameter can indicate a preferred interpolation mode or method. Examples of the interpolation mode can be linear interpolation, parabolic interpolation, nearest neighbor method, code book, etc.
Next, key frames 430 can be input to vocoder model 453 for interpolation and waveform generation. As shown in this Figure, interpolation model 433 associated with vocoder model 453 can interpolate the omitted frames via one or more interpolation mode. As a result, interpolation model 433 can reconstruct a number of interpolated frames that are approximately similar to the omitted frames. According to some embodiments, the interpolated frames can be stitched together with key frames 430 to form de-compressed frames 435.
According to some embodiments, de-compressed frames 435 can be input to frame-rate network 455 and sample-rate network 457 for generating speech waveforms 461. Based on the reconstruction of the skipped frames, speech waveforms 461 can have comparable or the same sound quality as the original un-skipped speech waveforms.
According to some embodiments, speech synthesis model 504 can generate the key frames based on an average key frame rate input, and the ratio of the number of key frames and the interpolated frames is associated with the average key frame rate input. For example, the average key frame rate can indicate the ratio of the key frames and the interpolated frames.
According to some embodiments, upon receiving key frames 530 with the respective interpolation parameters, vocoder model 552 can directly recognize and generate speech waveforms 560. According to some embodiments, vocoder model 552 can comprise frame-rate network 554 and sample-rate network 556, both of which can be autoregressive RNN models. As vocoder model 552 can have been trained with datasets that enable it to correlate key frames 530 with speech waveforms 560, it can generate speech waveforms 560 that share a comparable or equal sound quality with the original un-skipped speech waveforms. According to some embodiments, vocoder model 552 can be trained together with speech synthesis model 504 with the same training datasets and configuration.
Furthermore,
Different approaches or methods can be utilized to generate compressed datasets with high definition while significantly reducing the frame numbers. e.g., 50% or fewer frames. According to some embodiments, the original datasets, e.g., input text and its corresponding compressed waveforms or audio recordings, can be time-warped at a predetermined omission/compression ratio. For example, the omitted frames can be every other frame, or two of every five frames. According to some embodiments, the omitted frames can be redundant frames that contain substantially similar data values to the “neighboring” frames that are kept in the compressed datasets.
Furthermore, various cost/loss functions can be implemented to select a compression approach with least data loss between the original datasets and the compressed datasets. Examples of such cost functions can comprise Mixture of Logistics, or a normal-loss. The system can implement different versions or scenarios of the compressed datasets using each of the possible configures and select the one with the best performance, or least loss. For example, a number of potential cost functions can be implemented respectively for all the possible sets of frames to omit or the key frames to keep. As a result, a specific set of key frames that renders the least loss can be selected as the compressed training datasets.
For example, an exemplary cost function can be based on the sum of square errors from all the Bark parameters linearly interpolated between key frames and relative the omitted original frame that these parameters replaced, as shown below:
Cost=SUM[over omitted frames i](SUM[over bark bin k](((interp−value)[i,k]−(orig−value)[i,k]){umlaut over ( )} 2)
According to some embodiments, in addition to the linear interpolation method, other interpolation methods, such as parabolic interpolation, nearest neighbor method, code book, can also be adopted.
According to some embodiments, a key frame generated by the system can further comprise an additional interpolation parameter to facilitate the later interpolation/de-compression process. As shown in
According to some embodiments, interpolation parameter 706 can indicate a length of time between the key frames in contrast to simply indicating a number of frames to interpolate at a constant frame rate. In addition, interpolation parameter 706 can comprise an indicated interpolation mode for the later interpolation/de-compression process. Examples of the interpolation mode can be linear interpolation, parabolic interpolation, nearest neighbor method, code book, etc. For example, linear interpolation can apply a distinct linear polynomial between each pair of data points for curves.
A mode-based encoding, not shown in a drawing, is one that has a frame parameter 21 that indicates an interpolation mode. A vocoder or interpolation model capable of decoding such frames can use both the number of omitted frames and the encoded interpolation mode to interpolate frames. This can provide even better sounding interpolation than a single interpolation method built into the vocoder or interpolation model that is solely based on the number of frames skipped and spectrogram information.
It is also possible to use an encoding in which a frame skipping parameter indicates the number of frames to skip after the prior key frame rather. That is in contrast to an encoding in which the number of frames to skip is the number until the next key frame. In the former approach, interpolation can begin at the vocoder or interpolation model as soon as it receives an encoded keyframe without waiting for the next frame of data.
According to some embodiments, during an interpolation process, an interpolation model can reconstruct the omitted frames via various interpolation approaches. For example, linear interpolation can reconstruct the first interpolated frame 810 and the second interpolated frame 811 to produce the reconstructed data curve 808. The respective length of the first and second interpolated frames can be a constant inter-frame period. The respective value of the first interpolated frame and second frame can determine the reconstructed data curve 808. According to some embodiments, Nth key frame 802, first interpolated frame 810, second interpolated frame 811, and (N+1)th key frame 804, can be input for a vocoder model for generating sample waveforms of the input text.
According to some embodiments, a vocoder model can handle the interpolation process by reconstructing the omitted frames via various interpolation approaches. For example, linear interpolation can reconstruct the first interpolated frame 810 and the second interpolated frame 811 to create the reconstructed data curve 808. According to some embodiments, the vocoder model can generate sample waveforms based on Nth key frame 802, first interpolated frame 810, second interpolated frame 811, and (N+1)th key frame 804.
As shown in
According to some embodiments, the interpolation parameter can indicate the number of the skipped frames between two consecutive key frames, e.g., the interpolation parameter is 2, indicating there are two skipped frames. According to some embodiments, the interpolation parameter can indicate a length of time between the key frames. According to some embodiments, the interpolation parameter can comprise an indicated interpolation mode, e.g., linear interpolation, parabolic interpolation, code book, etc.
According to some embodiments, the speech synthesis model can generate the plurality of key frames based on an average encoded key frame rate input, and the ratio of the number of the plurality of key frames and the one or more skipped frames is associated with the average key frame rate input. This is a form of dynamic rate control to enable an adjustable trade-off between reducing the processor performance and power cost of frame generation by having fewer key frames and more interpolation and voice quality by having more key frames and less interpolation.
According to some embodiments, after generating the key frames, an interpolation model can interpolate one or more interpolated frames based on the interpolation parameter. Accordingly, a vocoder model can generate speech waveforms corresponding to the sequence of symbols based on the key frames and the interpolated frames.
According to some embodiments, after generating the key frames, a vocoder model can interpolate one or more interpolated frames based on the interpolation parameter. Next, the vocoder model can generate speech waveforms corresponding to the sequence of symbols based on the key frames and the interpolated frames.
According to some embodiments, the interpolation parameter can indicate the number of the skipped frames between two consecutive key frames, e.g., the interpolation parameter is 2, indicating there are two skipped frames. According to some embodiments, the interpolation parameter can indicate a length of time between the key frames. According to some embodiments, the interpolation parameter can comprise an indicated interpolation mode, e.g., linear interpolation, parabolic interpolation, code book, etc.
According to some embodiments, the speech synthesis model can generate the plurality of key frames based on an average key frame rate input, and the ratio of the number of the plurality of key frames and the one or more skipped frames is associated with the average key frame rate input.
At step 1206, after generating the key frames, an interpolation model can interpolate one or more interpolated frames based on the interpolation parameter. At step 1208, a vocoder model can generate speech waveforms corresponding to the sequence of symbols based on the key frames and the interpolated frames.
According to some embodiments, after generating the key frames, a vocoder model can interpolate one or more interpolated frames based on the interpolation parameter. Next, the vocoder model can generate speech waveforms corresponding to the sequence of symbols based on the key frames and the interpolated frames.
Examples shown and described use certain spoken languages. Various embodiments work, similarly, for other languages or combinations of languages. Some systems are screenless, such as an earpiece, which has no display screen. Some systems are stationary, such as a vending machine. Some systems are mobile, such as an automobile. Some systems are portable, such as a mobile phone. Some systems are for implanting in a human body. Some systems comprise manual interfaces such as keyboards or touchscreens.
Some systems function by running software on general-purpose programmable processors (CPUs) such as ones with ARM or x86 architectures. Some power-sensitive systems and some systems that require especially high performance, such as ones for neural network algorithms, use hardware optimizations. Some systems use dedicated hardware blocks burned into field-programmable gate arrays (FPGAs). Some systems use arrays of graphics processing units (GPUs). Some systems use application-specific-integrated circuits (ASICs) with customized logic to give higher performance.
Some physical machines described and claimed herein are programmable in many variables, combinations of which provide essentially an infinite variety of operating behaviors. Some systems herein are configured by software tools that offer many parameters, combinations of which support essentially an infinite variety of machine embodiments.
Hardware blocks, custom processor instructions, co-processors, and hardware accelerators perform neural network processing or parts of neural network processing algorithms with especially high performance and power efficiency. This enables extended battery life for battery-powered devices and reduces heat removal costs in data centers that serve many client devices simultaneously.
In addition, the foregoing description, for purposes of explanation, used specific nomenclature to provide a thorough understanding of the embodiments of the invention. However, it will be apparent to one skilled in the art that the specific details are not required in order to practice the embodiments of the invention.
It is to be understood that even though numerous characteristics and advantages of various embodiments of the present invention have been set forth in the foregoing description, together with details of the structure and function of various embodiments of the invention, this disclosure is illustrative only. In some cases, certain subassemblies are only described in detail with one such embodiment. Nevertheless, it is recognized and intended that such subassemblies may be used in other embodiments of the invention. Practitioners skilled in the art will recognize many modifications and variations. Changes may be made in detail, especially matters of structure and management of parts within the principles of the embodiments of the present invention to the full extent indicated by the broad general meaning of the terms in which the appended claims are expressed.
Having disclosed exemplary embodiments and the best mode, modifications and variations may be made to the disclosed embodiments while remaining within the scope of the embodiments of the invention as defined by the following claims.