This invention relates to speech synthesis, and more particularly to mitigation of amplitude quantization or other artifacts in synthesized speech signals.
One recent approach to computer-implemented speech synthesis makes use of a neural network to process a series of phonetic labels derived from text to produce a corresponding series of waveform sample values. In some such approaches, the waveform sample values are quantized, for example, to 256 levels of a μ-law non-uniform division of amplitude.
One or more approaches described below address the technical problem of automated speech synthesis, such as conversion of English text to samples of a waveform that represents a natural-sounding voice speaking the text. In particular, the approaches address improvement of the naturalness of the speech represented in the output waveform, for example, under a constraint of limited computation resources (e.g., processor instructions per second, process memory size) or limited reference data used to configure a speech synthesis system (e.g., total duration of reference waveform data). Very generally, a common aspect of a number of these approaches is that there is a two-part process of generation of an output waveform y(t), which may be a sampled signal at a sampling rate of 16,000 samples per second, with each sample being represented as signed 12-bit or 16-bit integer values (i.e., quantization into 212 or 216 levels). In the discussion below, a “waveform” should be understood to include a time-sampled signal, which can be considered to be or can be represented as a time series of amplitude values (also referred to as samples, or sample values). Other sampling rates and number of quantization levels may be used, preferably selected such that the sampling rate and/or the number of quantization levels do not contribute to un-naturalness of the speech represented in the output waveform. The first stage of generation of the waveform involves generation of an intermediate waveform x(t), which is generally represented with fewer quantization levels (e.g., resulting in greater quantization noise) and/or lower sampling rate (e.g., resulting in smaller audio bandwidth) than the ultimate output y(t) of the synthesis system. The second stage then transforms the intermediate waveform x(t) to produce y(t). In general, y(t) provides improved synthesis as compared to x(t) in one or more characteristics (e.g., types of degradation) such as perceptual quality (e.g., mean opinion score, MOS), a signal-to-noise ratio, a noise level, degree of quantization, a distortion level, and a bandwidth. While the generation of the intermediate waveform, x(t), is directly controlled by the text that is to be synthesized, the transformation from x(t) to y(t) does not, in general require, direct access to the text to be synthesized.
Referring to
In the system illustrated in
In the system 100 of
Although the enhancer 120 is applicable to a variety of synthesizer types, the synthesizer 140 shown in
The synthesis network 142 includes a parameterized non-linear transformer (i.e., a component implementing a non-linear transformation) that processes a series of past values of the synthesizer output, x(t−1), . . . , x(t−T), internally generated by passing the output through a series of delay elements 146, denoted herein as x(t−1), as well as the set of control values h(t) 148 for the time t, and produces the amplitude distribution p(t) 143 for that time. In one example of a synthesis network 142, a multiple layer artificial neural network (also equivalently referred to as “neural network”, ANN, or NN below) is used in which the past synthesizer values are processed as a causal convolutional neural network, and the control value is provided to each layer of the neural network.
In some examples of the multiple-layer synthesis neural network, an output vector of values y from the kth layer of the network depends on the input x from the previous layer (or the vector of past sample values for the first layer), and the vector of control values h as follows:
y=tan h(Wk,f*x+Vk,fTh)⊙σ(Wk,g*ξ+Vk,gTh)
where Wk,f, Wk,g, Vk,f, and Vk,g are matrices that hold the parameters (weights) for the kth layer of the network, σ( ) is a nonlinearity, such as a rectifier non-linearity or a sigmoidal non-linearity, and the operator ⊙ represents an elementwise multiplication. The parameters of the synthesis network are stored (e.g., in a non-volatile memory) for use by the multiple-layer neural network structure of the network, and impart the synthesis functionality on the network.
As introduced above, the enhancer 120 accepts successive waveform samples x(t) and outputs corresponding enhanced waveform samples y(t). The enhancer includes an enhancement network 122, which includes a parameterized non-linear transformer that processes a history of inputs x(t)=(x(t), x(t−1), . . . , x(t−T)), which are internally generated using a series of delay elements 124, to yield the output y(t) 125.
In one embodiment, with the sampling rate for x(t) and y(t) being the same, the enhancer 120 has the same internal structure as the synthesis network 142, except that there is no control input h(t) and the output is a single real-value quantity (i.e., there is a single output neural network unit), rather than there being one output per quantization level as with the synthesis network 142. That is, the enhancement network forms a causal (or alternatively non-causal with look-ahead) convolutional neural network. If the sampling rate of y(t) is higher than x(t), then additional inputs may be formed by repeating or interpolating samples of x(t) to yield a matched sampling rate. The parameters of the enhancer are stored (e.g., in a non-volatile memory) for use by the multiple-layer neural network structure of the network, and impart the enhancement functionality on the network.
The enhancement network 122 and synthesis network 142 have optional inputs, shown in dashed lines in
Referring to
Referring to
In yet another training approach, the parameters of the enhancer 120 and the synthesizer 140 are trained together. For example, the synthesizer 140 and the enhancer 120 are individually trained using an approach described above. As with the approach for training the enhancer 120 illustrated in
In yet another training approach, a “Generative Adversarial Network” (GAN) is used. In this approach, the enhancement network 122 is trained such that resulting output waveforms (i.e., sequences of output samples y(t)) are indistinguishable from true waveforms. In general terms, a GAN approach makes use of a “generator” G(z), which processes a random value z from a predetermined distribution p(z) (e.g., a Normal distribution) and outputs a random value x. For example, G is a neural network. The generator G is parameterized by parameters θ(G), and therefore the parameters induce a distribution p(y). Very generally, training of G (i.e., determining the parameter values θ(G)) is such that p(y) should be indistinguishable from a distribution observed in a reference (training) set. To achieve this criterion, a “discriminator” D(y) is used which outputs a single value d, in the range [0,1] indicating the probability that the input x is an element of the reference set or is an element randomly generated by G. To the extent that the discriminator cannot tell the difference (e.g., the output d is like flipping a coin), the generator G has achieved the goal of matching the generated distribution p(y) to the reference data. In this approach, the discriminator D(x) is also parameterized with parameters θ(D), and the parameters are chosen to do as good a job as possible in the task of discrimination. There are therefore competing (i.e., “adversarial”) goals: θ(D) values are chosen to make discrimination as good as possible, while θ(G) values are chosen to make it as hard as possible for the discriminator to discriminate. Formally, these competing goals may be expressed using an objective function
where the averages are over the reference data (x) and over a random sampling of the known distribution data (z). Specifically, the parameters are chosen according to the criterion
minθ
In the case of neural networks, this criterion may be achieved using a gradient descent procedure, essentially implemented as Back Propagation.
Referring to
Turning to the specific use of the GAN approach to determine the values of the parameters of the enhancement network 122, the role of the generator G is served by the combination of the synthesizer 140 and enhancement network 120, as shown in
The discriminator D(y|h) can have a variety of forms, for example, being a recurrent neural network that accepts the sequences y(t) and h(t) and ultimately at the end of the sequence provides the single scalar output d indicating whether the sequence y(t) (i.e., the enhanced synthesized waveform) if a reference waveform or a synthesized waveform corresponding to the control sequence h(t)). The parameters of the neural network of the discriminator D has parameters θ(D). Consistent with the general GAN training approach introduced above, the determination of the parameter values is performed over mini-batches of reference and synthesized utterances.
Alternative embodiments may differ somewhat from the embodiments described above without deviating from the general approach. For example, the output of the synthesis network 142 may be fed directly to the enhancer 120 without passing through a distribution-to-value converter 144. As another example, rather than passing delayed values of x(t) to the synthesis network 142, delayed values of y(t) may be used during training as well as during runtime speech synthesis. In some embodiments, the enhancer 120 also makes use of the control values h(t), or some reduced form of the control values, in addition to the output from the synthesizer 140. Although convolutional neural networks are used in the synthesis network 142 and enhancement network 122 described above, other neural network structures (e.g., recurrent neural networks) may be used. Furthermore, it should be appreciated that neural networks are only one example of a parameterized non-linear transformer, and that other transformers (e.g., kernel-based approaches, parametric statistical approaches) may be used without departing from the general approach.
Referring to
Referring to
In
Returning to the processing of an input utterance by the user, there are several stages of processing that ultimately yield a trigger detection, which in turn causes the device 510 to pass audio data to the server 590. The microphones 521 provide analog electrical signals that represent the acoustic signals acquired by the microphones. These electrical signals are time sampled and digitized (e.g., at a sampling rate of 20 kHz and 56 bits per sample) by analog-to-digital converters 522 (which may include associated amplifiers, filters, and the like used to process the analog electrical signals). As introduced above, the device 510 may also provide audio output, which is presented via a speaker 524. The analog electrical signal that drives the speaker is provided by a digital-to-analog converter 523, which receives as input time sampled digitized representations of the acoustic signal to be presented to the user. In general, acoustic coupling in the environment between the speaker 524 and the microphones 521 causes some of the output signal to feed back into the system in the audio input signals.
An acoustic front end (AFE) 530 receives the digitized audio input signals and the digitized audio output signal, and outputs an enhanced digitized audio input signal (i.e., a time sampled waveform). An embodiment of the signal processor 530 may include multiple acoustic echo cancellers, one for each microphone, which track the characteristics of the acoustic coupling between the speaker 524 and each microphone 521 and effectively subtract components of the audio signals from the microphones that originate from the audio output signal. The acoustic front end 530 also includes a directional beamformer that targets a user by providing increased sensitivity to signal that originate from the user's direction as compared to other directions. One impact of such beamforming is reduction of the level of interfering signals that originate in other directions (e.g., measured as an increase in signal-to-noise ratio (SNR)).
In alternative embodiments, the acoustic front end 530 may include various features not described above, including one or more of: a microphone calibration section, which may reduce variability between microphones of different units; fixed beamformers, each with a fixed beam pattern from which a best beam is selected for processing; separate acoustic echo cancellers, each associated with a different beamformer; an analysis filterbank for separating the input into separate frequency bands, each of which may be processed, for example, with a band-specific echo canceller and beamformer, prior to resynthesis into a time domain signal; a dereverberation filter; an automatic gain control; and a double-talk detector.
A second stage of processing converts the digitized audio signal to a sequence of feature values, which may be assembled in feature vectors. A feature vector is a numerical vector (e.g., an array of numbers) that corresponds to a time (e.g., a vicinity of a time instant or a time interval) in the acoustic signal and characterizes the acoustic signal at that time. In the system shown in
The normalized feature vectors are provided to a feature analyzer 550, which generally transforms the feature vectors to a representation that is more directly associated with the linguistic content of the original audio signal. For example, in this embodiment, the output of the feature analyzer 550 is a sequence of observation vectors, where each entry in a vector is associated with a particular part of a linguistic unit, for example, part of an English phoneme. For example, the observation vector may include 3 entries for each phoneme of a trigger word (e.g., 3 outputs for each of 6 phonemes in a trigger word “Alexa”) plus entries (e.g., 2 entries or entries related to the English phonemes) related to non-trigger-word speech. In the embodiment shown in
Various forms of feature analyzer 550 may be used. One approach uses probability models with estimated parameters, for instance, Gaussian mixture models (GMMs) to perform the transformation from feature vectors to the representations of linguistic content. Another approach is to use an Artificial Neural Network (ANN) to perform this transformation. Within the general use of ANNs, particular types may be used including Recurrent Neural Networks (RNNs), Deep Neural Networks (DNNs), Time Delay Neural Networks (TDNNs), and so forth. Yet other parametric or non-parametric approaches may be used to implement this feature analysis. In the embodiment described more fully below, a variant of a TDNN is used.
The communication interface receives an indicator part of the input (e.g., the frame number) corresponding to the identified trigger. Based on this identified part of the input, the communication interface 570 selects the part of the audio data (e.g., the sampled waveform) to send to the server 590. In some embodiments, this part that is sent starts at the beginning of the trigger, and continues until no more speech is detected in the input, presumably because the user has stopped speaking. In other embodiments, the part corresponding to the trigger is omitted from the part that is transmitted to the server. However, in general, the time interval corresponding to the audio data that is transmitted to the server depends on the time interval corresponding to the detection of the trigger (e.g., the trigger starts the interval, ends the interval, or is present within the interval).
Referring to
Following processing by the runtime speech recognizer 681, the text-based results may be sent to other processing components, which may be local to the device performing speech recognition and/or distributed across data networks. For example, speech recognition results in the form of a single textual representation of the speech, an N-best list including multiple hypotheses and respective scores, lattice, etc. may be sent to a natural language understanding (NLU) component 691 may include a named entity recognition (NER) module 692, which is used to identify portions of text that correspond to a named entity that may be recognizable by the system. An intent classifier (IC) module 694 may be used to determine the intent represented in the recognized text. Processing by the NLU component may be configured according to linguistic grammars 693 and/or skill and intent models 695. After natural language interpretation, a command processor 696, which may access a knowledge base 697, acts on the recognized text. For example, the result of the processing causes an appropriate output to be sent back to the user interface device for presentation to the user.
The command processor 696 may determine word sequences (or equivalent phoneme sequences, or other control input for a synthesizer) for presentation as synthesized speech to the user. The command processor passes the word sequence to the communication interface 570, which in turn passes it to the speech synthesis system 100. In an alternative embodiment (not illustrated), the server 590 includes the speech synthesis system 100, and the command processor causes the conversion of a word sequence to a waveform at the server 590, and passes the synthesized waveform to the user interface device 510.
Referring to
The training procedures, for example, as illustrated in
It should be understood that the device 400 is but one configuration in which the speech synthesis system 100 may be used. In one example, the synthesis system 100 shown as hosted in the device 400 may instead or in addition be hosted on a remote server 490, which generates the synthesized waveform and passes it to the device 100. In another example, the device 400 may host the front-end components 422 and 421, with the speech recognition system 430, the speech synthesizer 100, and the processing system 440 all being hosted in the remote system 490. As another example, the speech synthesis system may be hosted in a computing server, and clients of the server may provide text or control inputs to the synthesis system, and receive the enhanced synthesis waveform in return, for example, for acoustic presentation to a user of the client. In this way, the client does not need to implement a speech synthesizer. In some examples, the server also provides speech recognition services, such that the client may provide a waveform to the server and receive the words spoken, or a representation of the meaning, in return.
The approaches described above may be implemented in software, in hardware, or using a combination of software and hardware. For example, the software may include instructions stored on a non-transitory machine readable medium that when executed by a processor, for example in the user interface device, perform some or all of the procedures described above. Hardware may include special purpose circuitry (e.g., Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) and the like) for performing some of the functions. For example, some of the computations for the neural network transformers may be implemented using such special purpose circuitry.
It is to be understood that the foregoing description is intended to illustrate and not to limit the scope of the invention, which is defined by the scope of the appended claims. Other embodiments are within the scope of the following claims.
Number | Name | Date | Kind |
---|---|---|---|
6658287 | Litt | Dec 2003 | B1 |
9082401 | Fructuoso | Jul 2015 | B1 |
9159329 | Agiomyrgiannakis | Oct 2015 | B1 |
9922641 | Chun | Mar 2018 | B1 |
20050057570 | Cosatto et al. | Mar 2005 | A1 |
20060106619 | Iser | May 2006 | A1 |
20140236588 | Subasingha et al. | Aug 2014 | A1 |
20150073804 | Senior | Mar 2015 | A1 |
20150127350 | Agiomyrgiannakis | May 2015 | A1 |
20150348535 | Dachiraju et al. | Dec 2015 | A1 |
20160078859 | Luan | Mar 2016 | A1 |
20160140951 | Agiomyrgiannakis | May 2016 | A1 |
20160189027 | Graves | Jun 2016 | A1 |
20160379638 | Basye | Dec 2016 | A1 |
20180114522 | Hall | Apr 2018 | A1 |
20190019500 | Jang | Jan 2019 | A1 |
Entry |
---|
Oord, Aaron van den, et al. “Wavenet: A generative model for raw audio.” arXiv preprint arXiv:1609.03499 (Year: 2016). |
Palaz, Dimitri, Ronan Collobert, and Mathew Magimai Doss. “Estimating phoneme class conditional probabilities from raw speech signal using convolutional neural networks.” arXiv preprint arXiv:1304.1018 (Year: 2013). |
Oord, Aaron van den, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. “Wavenet: A generative model for raw audio.” arXiv preprint arXiv:1609.03499 (2016). |
Fisher, Kyle, and Adam Scherlis. “WaveMedic: Convolutional Neural Networks for Speech Audio Enhancement,” 2016, 6 pages, Retrieved from cs229.stanford.edu/proj2016/report/FisherScherlis-WaveMedic-project.pdf on Jun. 5, 2017. |
Goodfellow, Ian. “NIPS 2016 tutorial: Generative adversarial networks.” arXiv preprint arXiv:1701.00160 (2016). |