In the following, different inventive embodiments and aspects will be described. Also, further embodiments will be defined by the enclosed claims. It should be noted that any embodiments as defined by the claims can be supplemented by any of the details (features and functionalities) described in this description.
Also, the embodiments described in this description can be used individually, and can also be supplemented by any of the features herein, or by any feature included in the claims.
Also, it should be noted that individual aspects described herein can be used individually or in combination. Thus, details can be added to each of said individual aspects without adding details to another one of said aspects.
It should also be noted that the present disclosure describes, explicitly or implicitly, features usable in an audio generator and/or a method and/or a computer program product. Thus, any of the features described herein can be used in the context of a device, a method, and/or a computer program product.
Moreover, features and functionalities disclosed herein relating to a method can also be used in a device (configured to perform such functionality). Furthermore, any features and functionalities disclosed herein with respect to a device can also be used in a corresponding method. In other words, the methods disclosed herein can be supplemented by any of the features and functionalities described with respect to the devices.
Also, any of the features and functionalities described herein can be implemented in hardware or in software, or using a combination of hardware and software, as will be described in the section “implementation alternatives”.
Although some aspects are described in the context of a device, it is clear that these aspects also represent a description of the corresponding method, where a feature corresponds to a method step or a feature of a method step. Analogously, aspects described in the context of a method step also represent a description of a corresponding feature of a corresponding device. Some or all of the method steps may be executed by (or using) a hardware apparatus, like for example, a microprocessor, a programmable computer or an electronic circuit. In some embodiments, one or more of the most important method steps may be executed by such an apparatus.
Depending on certain implementation requirements, embodiments of the invention can be implemented in hardware or in software. The implementation can be performed using a digital storage medium, for example a floppy disk, a DVD, a Blu-Ray, a CD, a ROM, a PROM, an EPROM, an EEPROM or a FLASH memory, having electronically readable control signals stored thereon, which cooperate (or are capable of cooperating) with a programmable computer system such that the respective method is performed. Therefore, the digital storage medium may be computer readable.
Some embodiments according to the invention comprise a data carrier having electronically readable control signals, which are capable of cooperating with a programmable computer system, such that one of the methods described herein is performed.
Generally, embodiments of the present invention can be implemented as a computer program product with a program code, the program code being operative for performing one of the methods when the computer program product runs on a computer. The program code may for example be stored on a machine-readable carrier.
Other embodiments comprise the computer program for performing one of the methods described herein, stored on a machine-readable carrier.
In other words, an embodiment of the inventive method is, therefore, a computer program having a program code for performing one of the methods described herein, when the computer program runs on a computer.
A further embodiment of the inventive methods is, therefore, a data carrier (or a digital storage medium, or a computer-readable medium) comprising, recorded thereon, the computer program for performing one of the methods described herein. The data carrier, the digital storage medium or the recorded medium are typically tangible and/or non-transitionary.
A further embodiment of the inventive method is, therefore, a data stream or a sequence of signals representing the computer program for performing one of the methods described herein. The data stream or the sequence of signals may for example be configured to be transferred via a data communication connection, for example via the Internet.
A further embodiment comprises a processing means, e.g. a computer, or a programmable logic device, configured to or adapted to perform one of the methods described herein.
A further embodiment comprises a computer having installed thereon the computer program for performing one of the methods described herein.
A further embodiment according to the invention comprises an apparatus or a system configured to transfer (for example, electronically or optically) a computer program for performing one of the methods described herein to a receiver. The receiver may, for example, be a computer, a mobile device, a memory device or the like. The apparatus or system may, for example, comprise a file server for transferring the computer program to the receiver.
In some embodiments, a programmable logic device (for example a field programmable gate array) may be used to perform some or all of the functionalities of the methods described herein. In some embodiments, a field programmable gate array may cooperate with a microprocessor in order to perform one of the methods described herein. Generally, the methods are performed by any hardware apparatus.
The devices described herein may be implemented using a hardware apparatus, or using a computer, or using a combination of a hardware apparatus and a computer.
The devices described herein, or any components of the devices described herein, may be implemented at least partially in hardware and/or in software.
The methods described herein may be performed using a hardware apparatus, or using a computer, or using a combination of a hardware apparatus and a computer.
The methods described herein, or any part of the methods described herein, may be performed at least partially by hardware and/or by software.
The above described embodiments are merely illustrative for the principles of the present invention. It is understood that modifications and variations of the arrangements and the details described herein will be apparent to others skilled in the art. It is the intent, therefore, to be limited only by the scope of the impending patent claims and not by the specific details presented by way of description and explanation of the embodiments herein.
The invention is within the technical field of audio generation.
Embodiments of the invention refer to an audio generator, configured to generate an audio signal from an input signal and target data, the target data representing the audio signal. Further embodiments refer to methods for generating an audio signal, and methods for training an audio generator. Further embodiments refer to a computer program product.
In recent years, neural vocoders have surpassed classical speech synthesis approaches in terms of naturalness and perceptual quality of the synthesized speech signals. The best results can be achieved with computationally-heavy neural vocoders like WaveNet and WaveGlow, while light-weight architectures based on Generative Adversarial Networks, e.g. MelGAN and Parallel WaveGAN, are still inferior in terms of the perceptual quality.
Generative models using Deep Learning for generating audio waveforms, such as WaveNet, LPCNet, and WaveGlow, have provided significant advances in natural-sounding speech synthesis. These generative models called in Text-To-Speech (TTS) applications neural vocoders, outperform both parametric and concatenative synthesis methods. They can be conditioned using compressed representations of the target speech (e.g. mel-spectrogram) for reproducing a given speaker and a given utterance.
Prior works have shown that speech coding at very low bit-rate of clean speech can be achieved using such generative models at the decoder side. This can be done by conditioning the neural vocoders with the parameters from a classical low bit-rate speech coder.
Neural vocoders were also used for speech enhancement tasks, like speech denoising or dereverberation.
The main problem of these deep generative models is usually the high number of required parameters, and the resulting complexity both during training and synthesis (inference). For example, WaveNet, considered as state-of-the-art for the quality of the synthesized speech, generates sequentially the audio samples one by one. This process is very slow and computationally demanding, and cannot be performed in real time.
Recently, lightweight adversarial vocoders based on Generative Adversarial Networks (GANs), such as MelGAN and Parallel WaveGAN, have been proposed for fast waveform generation. However, the reported perceptual quality of the speech generated using these models is significantly below the baseline of neural vocoders like WaveNet and WaveGlow. A GAN for Text-to-Speech (GAN-TTS) has been proposed to bridge this quality gap, but still at a high computational cost.
There exists a great variety of neural vocoders, which all have drawbacks. Autoregressive vocoders, for example WaveNet and LPCNet, may have very high quality, and be suitable for optimization for inference on CPU, but they are not suitable for usage on GPUs, since their processing cannot be parallelized easily, and they cannot offer not real time processing without compromising the quality.
Normalizing flows vocoders, for example WaveGlow, may also have very high quality, and be suitable for inference on a GPU, but they comprise a very complex model, which takes a long time to train and optimize, it is also not suitable for embedded devices.
GAN vocoders, for example MelGAN and Parallel WaveGAN may be suitable for inference on GPUs and lightweight, but their quality is lower than autoregressive models.
In summary, there still does not exist a low complexity solution delivering high fidelity speech. GANs are the most studied approach to achieve such a goal. The present invention is an efficient solution for this problem.
It is an object of the present invention to provide a lightweight neural vocoder solution which generates speech at very high quality and is trainable with limited computational resources.
An embodiment may have an audio generator, configured to generate an audio signal from an input signal and target data, the target data representing the audio signal, comprising: A first processing block, configured to receive first data derived from the input signal and to output first output data, wherein the first output data comprises a plurality of channels, and a second processing block, configured to receive, as second data, the first output data or data derived from the first output data, wherein the first processing block comprises for each channel of the first output data: a conditioning set of learnable layers configured to process the target data to acquire conditioning features parameters; and a styling element, configured to apply the conditioning feature parameters to the first data or normalized first data; and wherein the second processing block is configured to combine the plurality of channels of the second data to acquire the audio signal.
Another embodiment may have a method for generating an audio signal by an audio generator from an input signal and target data, the target data representing the audio signal, comprising: receiving, by a first processing block, first data derived from the input signal; for each channel of a first output data: processing, by a conditioning set of learnable layers of the first processing block, the target data to acquire conditioning feature parameters; and applying, by a styling element of the first processing block, the conditioning feature parameters to the first data or normalized first data; outputting, by the first processing block, first output data comprising a plurality of channels; receiving, by a second processing block, as second data, the first output data or data derived from the first output data; and combining, by the second processing block, the plurality of channels of the second data to acquire the audio signal.
Another embodiment may have a method to generate an audio signal comprising a mathematical model, wherein the mathematical model is configured to output audio samples at a given time step from an input sequence representing the audio data to generate, wherein the mathematical model is configured to shape a noise vector in order to create the output audio samples using the input representative sequence.
Another embodiment may have a non-transitory digital storage medium having a computer program stored thereon to perform the method for generating an audio signal by an audio generator from an input signal and target data, the target data representing the audio signal, comprising: receiving, by a first processing block, first data derived from the input signal; for each channel of a first output data: processing, by a conditioning set of learnable layers of the first processing block, the target data to acquire conditioning feature parameters; and applying, by a styling element of the first processing block, the conditioning feature parameters to the first data or normalized first data; outputting, by the first processing block, first output data comprising a plurality of channels; receiving, by a second processing block, as second data, the first output data or data derived from the first output data; and combining, by the second processing block, the plurality of channels of the second data to acquire the audio signal, when said computer program is run by a computer.
Another embodiment may have a non-transitory digital storage medium having a computer program stored thereon to perform the method to generate an audio signal comprising a mathematical model, wherein the mathematical model is configured to output audio samples at a given time step from an input sequence representing the audio data to generate, wherein the mathematical model is configured to shape a noise vector in order to create the output audio samples using the input representative sequence, when said computer program is run by a computer.
Embodiments of the present invention will be detailed subsequently referring to the appended drawings, in which:
In the figures, similar reference signs denote similar elements and features.
There is proposed, inter alia, an audio generator (e.g., 10), configured to generate an audio signal (e.g., 16) from an input signal (e.g.,14) and target data (e.g.,12), the target data (e.g.,12) representing the audio signal (e.g., 16), comprising at least one of:
The first processing block (e.g.,50) may comprise for each channel of the first output data:
The second processing block (e.g.,45) may be configured to combine the plurality of channels (e.g.,47) of the second data (e.g.,69) to obtain the audio signal (e.g.,16).
There is also proposed a method e.g. for generating an audio signal (e.g.,16) by an audio generator (e.g., 10) from an input signal (e.g., 14) and target data (e.g., 12), the target data (e.g.,12) representing the audio signal (e.g.,16), comprising:
There is also proposed a method to train a neural network for audio generation, wherein the neural network:
There is also proposed a method to generate an audio signal (e.g. 16) comprising a mathematical model, wherein the mathematical model is configured to output audio samples at a given time step from an input sequence (e.g. 12) representing the audio data (e.g. 16) to generate. The mathematical model may shape a noise vector (e.g. 14) in order to create the output audio samples using the input representative sequence (e.g. 12).
It is in this context that we propose StyleMelGAN (e.g., the audio generator 10), a light-weight neural vocoder, allowing synthesis of high-fidelity speech with low computational complexity. StyleMelGAN is a fully convolutional, feed-forward model that uses Temporal Adaptive DEnormalization, TADE, (e.g., 60a and 60b in
The present application proposes, inter alia, a neural vocoder for generating high quality speech 16, which may be based on a generative adversarial network (GAN). The solution, here called StyleMelGAN (and, for example, implemented in the audio generator 10), is a lightweight neural vocoder allowing synthesis of high-quality speech 16 at low computational complexity. StyleMelGAN is a feed-forward, fully convolutional model that uses temporal adaptive denormalization (TADE) for styling (e.g. at block 77) a latent noise representation (e.g. 69) using, for example the mel-spectrogram (12) of the target speech waveform. It allows highly parallelizable generation, which is several times faster than real time on both CPUs and GPUs. For training, it is possible to use multi-scale spectral reconstruction losses followed by adversarial losses. This enables to obtain a model able to synthesize high-quality outputs after less than 2 days of training on a single GPU.
Potential applications and benefits from the invention are as follows:
The invention can be applied for Text-to-Speech, and the resulting quality, i.e. the generated speech quality for TTS and copy-synthesis is close to WaveNet and natural speech. It also provides a fast training, such that the model is easy and quick to be re-trained, personalized. It uses less memory, since it is a relatively small neural network model. And finally, the proposed invention provides a benefit in terms of complexity, i.e. it has a very good quality/complexity tradeoff.
The invention can also be applied for speech enhancement, where it can provide a low complexity and robust solution for generating clean speech from noisy one.
The invention can also be applied for speech coding, where it can lower significantly the bitrate by transmitting only the parameters needed for conditioning the neural vocoder. Also, in this application the lightweight neural vocoder-based solution is suitable for embedded systems, and especially suitable for upcoming (end-)User Equipment (UE) equipped with a GPU or a Neural Processing Unit (NPU).
Embodiments of the present application refer to audio generator, configured to generate an audio signal from an input signal and target data, the target data representing the audio signal, comprising a first processing block, configured to receive first data derived from the input signal and to output first output data, wherein the first output data comprises a plurality of channels, and a second processing block, configured to receive, as second data, the first output data or data derived from the first output data, wherein, the first processing block comprises for each channel of the first output data a conditioning set of learnable layers configured to process the target data to obtain conditioning features parameters; and a styling element, configured to apply the conditioning feature parameters to the first data or normalized first data; and wherein the second processing block is configured to combine the plurality of channels of the second data to obtain the audio signal.
According to one embodiment, the conditioning set of learnable layers consists of one or two convolution layers.
According to one embodiment, a first convolution layer is configured to convolute the target data or up-sampled target data to obtain first convoluted data using a first activation function.
According to one embodiment, the conditioning set of learnable layers and the styling element are part of a weight layer in a residual block of a neural network comprising one or more residual blocks.
According to one embodiment, the audio generator further comprises a normalizing element, which is configured to normalize the first data. For example, the normalizing element may normalize the first data to a normal distribution of zero-mean and unit-variance.
According to one embodiment, the audio signal is a voice audio signal.
According to one embodiment, the target data is up-sampled, advantageously by non-linear interpolation, by a factor of 2 or a multiple of 2, or a power of 2. In some examples, instead, a factor greater than 2 may be used.
According to one embodiment, the first processing block further comprises a further set of learnable layers, configured to process data derived from the first data using a second activation function, wherein the second activation function is a gated activation function.
According to one embodiment, the further set of learnable layers consists of one or two convolution layers.
According to one embodiment, the second activation function is a softmax-gated hyperbolic tangent, TanH, function
According to one embodiment, the first activation function is a leaky rectified linear unit, leaky ReLu, function.
According to one embodiment, convolution operations run with maximum dilation factor of 2.
According to one embodiment, the audio generator comprises eight first processing blocks and one second processing block.
According to one embodiment, the first data has a lower dimensionality than the audio signal. The first data may have a first dimension or at least one dimension lower than the audio signal. The first data may have one dimension lower than the audio signal but a number of channels greater than the audio signal. The first data may have a total number of samples across all dimensions lower than at the audio signal.
According to one embodiment, the target data is a spectrogram, advantageously a mel-spectrogram, or a bitstream.
According to one embodiment, the target data is derived from a text, the target data is a compressed representation of audio data, or the target data is a degraded audio signal.
Further embodiments refer to a method for generating an audio signal by an audio generator from an input signal and target data, the target data representing the audio signal, comprising receiving, by a first processing block, first data derived from the input signal; for each channel of a first output data processing, by a conditioning set of learnable layers of the first processing block, the target data to obtain conditioning feature parameters; and applying, by a styling element of the first processing block, the conditioning feature parameters to the first data or normalized first data; outputting, by the first processing block, first output data comprising a plurality of channels; receiving, by a second processing block, as second data, the first output data or data derived from the first output data; and combining, by the second processing block, the plurality of channels of the second data to obtain the audio signal.
Normalizing may include, for example, normalizing the first data to a normal distribution of zero-mean and unit-variance.
The method can be supplied with any feature or feature combination from the audio generator as well.
Further embodiments refer to a method for training an audio generator as laid out above wherein training comprises repeating the steps of any one of methods as laid out above one or more times.
According to one embodiment, the method for training further comprises evaluating the generated audio signal by at least one evaluator, which is advantageously a neural network, and adapting the weights of the audio generator according to the results of the evaluation.
According to one embodiment, the method for training further comprises adapting the weights of the evaluator according to the results of the evaluation.
According to one embodiment, training comprises optimizing a loss function.
According to one embodiment, optimizing a loss function comprises calculating a fixed metric between the generated audio signal and a reference audio signal.
According to one embodiment, calculating the fixed metric comprises calculating one or several spectral distortions between the generated audio signal and the reference signal.
According to one embodiment, calculating the one or several spectral distortions is performed on magnitude or log-magnitude of the spectral representation of the generated audio signal and the reference signal, and/or on different time or frequency resolutions.
According to one embodiment, optimizing the loss function comprises deriving one or more adversarial metrics by randomly supplying and evaluating a representation of the generated audio signal or a representation of the reference audio signal by one or more evaluators, wherein evaluating comprises classifying the supplied audio signal into a predetermined number of classes indicating a pretrained classification level of naturalness of the audio signal.
According to one embodiment, optimizing the loss function comprises calculating a fixed metric and deriving an adversarial metric by one or more evaluators.
According to one embodiment, the audio generator is first trained using the fixed metric.
According to one embodiment, four evaluators derive four adversarial metrics.
According to one embodiment, the evaluators operate after a decomposition of the representation of the generated audio signal or the representation of the reference audio signal by a filter-bank.
According to one embodiment, each of the evaluators receive as input one or several portions of the representation of the generated audio signal or the representation of the reference audio signal.
According to one embodiment, the signal portions generated by sampling random windows from the input signal, using random window functions.
According to one embodiment, sampling of the random window is repeated multiple times for each evaluator.
According to one embodiment, the number of times the random window is sampled for each evaluator is proportional to the length of the representation of the generated audio signal or the representation of the reference audio signal.
Further embodiments refer to a computer program product including a program for a processing device, comprising software code portions for performing the steps of the methods described herein when the program is run on the processing device.
According to one embodiment, the computer program product comprises a computer-readable medium on which the software code portions are stored, wherein the program is directly loadable into an internal memory of the processing device.
Further embodiments refer to a method to generate an audio signal comprising a mathematical model, wherein the mathematical model is configured to output audio samples at a given time step from an input sequence representing the audio data to generate, wherein the mathematical model is configured to shape a noise vector in order to create the output audio samples using the input representative sequence.
According to one embodiment, the mathematical model is trained using audio data. According to one embodiment, the mathematical model is a neural network. According to one embodiment, the network is a feed-forward network. According to one embodiment, the network is a convolutional network.
According to one embodiment, the noise vector may have a lower dimensionality than the audio signal to generate. The first data may have a first dimension or at least one dimension lower than the audio signal. The first data may have a total number of samples across all dimensions lower than the audio signal. The first data may have one dimension lower than the audio signal but a number of channels greater than the audio signal.
According to one embodiment, temporal adaptive de-normalization (TADE) technique is used for conditioning the mathematical model using the input representative sequence and therefore for shaping the noise vector.
According to one embodiment, a modified softmax-gated Tanh activates each layer of the neural network.
According to one embodiment, convolution operations run with maximum dilation factor of 2.
According to one embodiment, the noise vector as well as the input representative sequence are up-sampled to obtain the output audio at the target sampling rate.
According to one embodiment, the up-sampling is performed sequentially in different layers of the mathematical model.
According to one embodiment, the up-sampling factor for each layer is 2 or a multiple of 2, such as a power of 2. In some examples, values the upsampling factor may more in general be greater than 2.
According to one embodiment, the generated audio signal is used in a text-to-speech application, wherein the input representative sequence is derived from a text.
According to one embodiment, the generated audio signal is used in an audio decoder, wherein the input representative sequence is a compressed representation of the original audio to transmit or store.
According to one embodiment, the generated audio signal is used to improve the audio quality of a degraded audio signal, wherein the input representative sequence is derived from the degraded signal.
Further embodiments refer to a method to train a neural network for audio generation, wherein the neural network outputs audio samples at a given time step from an input sequence representing the audio data to generate, wherein the neural network is configured to shape a noise vector in order to create the output audio samples using the input representative sequence, wherein the neural network is designed as laid out above, and wherein the training is design to optimize a loss function.
According to one embodiment, the loss function comprises a fixed metric computed between the generated audio signal and a reference audio signal.
According to one embodiment, the fixed metric is one or several spectral distortions computed between the generated audio signal and the reference signal.
According to one embodiment, the one or several spectral distortions are computed on magnitude or log-magnitude of the spectral representation of the generated audio signal and the reference signal.
According to one embodiment, the one or several spectral distortions forming the fixed metric are computed on different time or frequency resolutions.
According to one embodiment, the loss function comprises an adversarial metric derived by additional discriminative neural networks, wherein the discriminative neural networks receive as input a representation of the generated or of the reference audio signals, and wherein the discriminative neural networks are configured to evaluate how the generated audio samples are realistic.
According to one embodiment, the loss function comprises both a fixed metric and an adversarial metric derived by additional discriminative neural networks.
According to one embodiment, the neural network generating the audio samples is first trained using solely the fixed metric.
According to one embodiment, the adversarial metric is derived by 4 discriminative neural networks.
According to one embodiment, the discriminative neural networks operate after a decomposition of the input audio signal by a filter-bank.
According to one embodiment, each discriminative neural network receives as input one or several random windowed versions of the input audio signal.
According to one embodiment, the sampling of the random window is repeated multiple times for each discriminative neural network.
According to one embodiment, the number of times the random window is sampled for each discriminative neural network is proportional to the length of the input audio samples.
The first processing block 50 is shown in
The first output data 69 at each block 50 are in a plurality of channels. The audio generator 10 may include a second processing block 45 (in
The “channels” are not to be understood in the context of stereo sound, but in the context of neural networks (e.g. convolutional neural networks). For example, the input signal (e.g. latent noise) 14 may be in 128 channels (in the representation in the time domain), since a sequence of channels are provided. For example, when the signal has 176 samples and 64 channels, it may be understood as a matrix of 176 columns and 64 rows, while when the signal has 352 samples and 64 channels, it may be understood as a matrix of 352 columns and 64 rows (other schematizations are possible). Therefore, the generated audio signal 16 (which in
The at least the original input signal 14 and/or the generated speech 16 may be a vector). To the contrary, the output of each the blocks 30 and 50a-50h, 42, 44 has in general a different dimensionality. The first data may have a first dimension or at least one dimension lower than that of the audio signal. The first data may have a total number of samples across all dimensions lower than the audio signal. The first data may have one dimension lower than the audio signal but a number of channels greater than the audio signal. At each block 30 and 50a-50h, the signal, evolving from noise 14 towards becoming speech 16, may be upsampled. For example, at the upsampling block 30 before the first block 50a among the blocks 50a-50h, an 88-times upsampling is performed. An example of upsampling may include, for example, the following sequence: 1) repetition of same value, 2) insert zeros, 3) another repeat or insert zero + linear filtering, etc.
The generated audio signal 16 may generally be a single-channel signal (e.g. 1×22528). In case multiple audio channels are needed (e.g., for a stereo sound playback) then the claimed procedure shall be in principle iterated multiple times.
Analogously, also the target data 12 can be, in principle, in one single channel (e.g. if it is text) or in multiple channels (e.g. in spectrograms). In any case, it may be upsampled (e.g. by a factor of two, a power of 2, a multiple of 2, or a value greater than 2) to adapt to the dimensions of the signal (59a, 15, 69) evolving along the subsequent layers (50a-50h, 42), e.g. to obtain the conditioning feature parameters 74, 75 in dimensions adapted to the dimensions of the signal.
When the first processing block 50 is instantiated in multiple blocks 50a-50h, the number of channels may, for example, remain the same for the multiple blocks 50a-50h. The first data may have a first dimension or at least one dimension lower than that of the audio signal. The first data may have a total number of samples across all dimensions lower than the audio signal. The first data may have one dimension lower than the audio signal but a number of channels greater than the audio signal.
The signal at the subsequent blocks may have different dimensions from each other. For example, the sample may be upsampled more and more times to arrive, for example, from 88 samples to 22,528 samples at the last block 50h. Analogously, also the target data 12 are upsampled at each processing block 50. Accordingly, the conditioning features parameters 74, 75 can be adapted to the number of samples of the signal to be processed. Accordingly, semantic information provided by the target data 12 is not lost in subsequent layers 50a-50h.
It is to be understood that examples may be performed according to the paradigms of generative adversarial networks (GANs). A GAN includes a GAN generator 11 (
As explained by the wording “conditioning set of learnable layers”, the audio generator 10 may be obtained according to the paradigms of conditional GANs, e.g. based on conditional information. For example, conditional information may be constituted by target data (or upsampled version thereof) 12 from which the conditioning set of layers 71-73 (weight layer) are trained and the conditioning feature parameters 74, 75 are obtained. Therefore, the styling element 77 is conditioned by the learnable layers 71-73.
The examples may be based on convolutional neural networks. For example, a little matrix (e.g., filter or kernel), which could be a 3×3 matrix (or a 4×4 matrix, etc.), is convolved (convoluted) along a bigger matrix (e.g., the channel x samples latent or input signal and/or the spectrogram and/or the spectrogram or upsampled spectrogram or more in general the target data 12), e.g. implying a combination (e.g., multiplication and sum of the products; dot product, etc.) between the elements of the filter (kernel) and the elements of the bigger matrix (activation map, or activation signal). During training, the elements of the filter (kernel) are obtained (learnt) which are those that minimize the losses. During inference, the elements of the filter (kernel) are used which have been obtained during training. Examples of convolutions are at blocks 71-73, 61a, 61b, 62a, 62b (see below). Where a block is conditional (e.g., block 60 of
It is possible to have, in some examples, activation functions downstream to the convolution (ReLu, TanH, softmax, etc.), which may be different in accordance to the intended effect. ReLu may map the maximum between 0 and the value obtained at the convolution (in practice, it maintains the same value if it is positive, and outputs 0 in case of negative value). Leaky ReLu may output x if x>0, and 0.1*x if x≤0, x being the value obtained by convolution (instead of 0.1 another value, such as a predetermined value within 0.1 ± 0.05, may be used in some examples). TanH (which may be implemented, for example, at block 63a and/or 63b) may provide the hyperbolic tangent of the value obtained at the convolution, e.g.
with x being the value obtained at the convolution (e.g. at block 61a and/or 61b). Softmax (e.g. applied, for example, at block 64a and/or 64b) may apply the exponential to each element of the elements of the result of the convolution (e.g., as obtained in block 62a and/or 62b), and normalize it by dividing by the sum of the exponentials. Softmax (e.g. at 64a and/or 64b) may provide a probability distribution for the entries which are in the matrix which results from the convolution (e.g. as provided at 62a and/or 62b). After the application of the activation function, a pooling step may be performed (not shown in the figures) in some examples, but in other examples it may be avoided.
Multiple layers of convolutions (e.g. a conditioning set of learnable layers) may be one after another one and/or in parallel to each other, so as to increase the efficiency. If the application of the activation function and/or the pooling are provided, they may also be repeated in different layers (or maybe different activation functions may be applied to different layers, for example).
The input signal 14 (e.g. noise) is processed, at different steps, to become the generated audio signal 16 (e.g. under the conditions set by the conditioning sets of learnable layers 71-73, and on the parameters 74, 75 learnt by the conditioning sets of learnable layers 71-73. Therefore, the input signal is to be understood as evolving in a direction of processing (from 14 to 16 in
It is also noted that the multiple channels of the input signal (or any of its evolutions) may be considered to have a set of learnable layers and a styling element associated thereto. For example, each row of the matrixes 74 and 75 is associated to a particular channel of the input signal (or one of its evolutions), and is therefore obtained from a particular learnable layer associated to the particular channel. Analogously, the styling element 77 may be considered to be formed by a multiplicity of styling elements (each for each row of the input signal x, c, 12, 76, 76′, 59, 59a, 59b, etc.).
It will be shown that the noise vector 14 is step-by-step processed (e.g., at blocks 50a-50h, 42, 44, 46, etc.), so as to evolve from, e.g., noise 14 to, e.g., speech 16 (the evolving signal will be indicated, for example, with different signals 15, 59a, x, c, 76′, 79, 79a, 59b, 79b, 69, etc.).
At block 30, the input signal (noise) 14 may be upsampled to have 88 samples (different numbers are possible) and 64 channels (different numbers are possible).
As can be seen, eight processing blocks 50a, 50b, 50c, 50d, 50e, 50f, 50g, 50h (altogether embodying the first processing block 50 of
Each of the blocks 50a-50h (50) can also be a TADEResBlock (residual block in the context of TADE, Temporal Adaptive DEnormalization). Notably, each block 50a-50h may be conditioned by the target data (e.g., mel-spectrogram) 12.
At a second processing block 45 (
At least one of the blocks 50a-50h (or each of them, in particular examples) may be, for example, a residual block. A residual block operates a prediction only to a residual component of the signal evolving from the input signal 14 (e.g. noise) to the output audio signal 16. The residual signal is only a part (residual component) of the main signal. For example, multiple residual signals may be added to each other, to obtain the final output audio signal 16.
In
Notably, the addition at adder 65c does not necessarily need to be performed within the residual block 50 (50a-50h). A single addition of a plurality of residual signals 65b′ (each outputted by each of residual blocks 50a-50h) can be performed (e.g., at an adder block in the second processing block 45, for example). Accordingly, the different residual blocks 50a-50h may operate in parallel with each other.
In the example of
For each replica (600, 601), a conditioning set of learnable layers 71-73 and a styling element 77 is applied (e.g. twice for each block 50) to the signal evolving from the input signal 16 to the audio output signal 16. A first temporal adaptive denormalization (TADE) is performed at TADE block 60a to the first data 59a at the first replica 600. The TADE block 60a performs a modulation of the first data 59a (input signal or, e.g., processed noise) under the conditions set out by the target data 12. In the first TADE block 60a, an upsampling of the target data 12 may be performed at upsampling block 70, to obtain an upsampled version 12′ of the target data 12. The upsampling may be obtained through non-linear interpolation, e.g. using a factor of 2, a power of 2, a multiple of two, or another value greater than 2. Accordingly, in some examples it is possible to have that the spectrogram 12′ has the same dimensions (e.g. conforms to) the signal (76, 76′, x, c, 59, 59a, 59b, etc.) to be conditioned by the spectrogram. An application of stylistic information to the processed noise (first data) (76, 76′, x, c, 59, 59a, 59b, etc.) may be performed at block 77 (styling element). In a subsequent replica 601, another TADE block 60b may be applied to the output 59b of the first replica 600. An example of the TADE block 60 (60a, 60b) is provided in
In examples, the first and second convolutions at 61b and 62b, respectively downstream to the TADE block 60a and 60b, may be performed at the same number of elements in the kernel (e.g., 9, e.g., 3×3). However, the second convolutions 61b and 62b may have a dilation factor of 2. In examples, the maximum dilation factor for the convolutions may be 2 (two).
After stylistic element 77, the signal is output. The convolutions 72 and 73 have not necessarily activation function downstream of them. It is also noted that the parameter γ (74) may be understood as a variance and β (75) as a bias. Also, block 42 of
The following procedure may be performed:
The GAN discriminator 100 of
The GAN discriminator 100 has the role of learning how to recognize the generated audio signals (e.g., audio signal 16 synthesized as discussed above) from real input signals (e.g. real speech) 104. Therefore, the role of the GAN discriminator 100 is mainly exerted during training (e.g. for learning parameters 72 and 73) and is seen in counter position of the role of the GAN generator 11 (which may be seen as the audio generator 10 without the GAN discriminator 100).
In general terms, the GAN discriminator 100 may be input by both audio signal 16 synthesized generated by the GAN generator 10, and real audio signal (e.g., real speech) 104 acquired e.g., through a microphone, and process the signals to obtain a metric (e.g., loss) which is to be minimized. The real audio signal 104 can also be considered a reference audio signal. During training, operations like those explained above for synthesizing speech 16 may be repeated, e.g. multiple times, so as to obtain the parameters 74 and 75, for example.
In examples, instead of analyzing the whole reference audio signal 104 and/or the whole generated audio signal 16, it is possible to only analyze a part thereof (e.g. a portion, a slice, a window, etc.). Signal portions generated in random windows (105a-105d) sampled from the generated audio signal 16 and from the reference audio signal 104 are obtained. For example random window functions can be used, so that it is not a priori pre-defined which window 105a, 105b, 105c, 105d will be used. Also the number of windows is not necessarily four, at may vary.
Within the windows (105a-105d), a PQMF (Quadrature Mirror Filter-bank (PQMF) 110 may be applied. Hence, subbands 120 are obtained. Accordingly, a decomposition (110) of the representation of the generated audio signal (16) or the representation of the reference audio signal (104) is obtained.
An evaluation block 130 may be used to perform the evaluations. Multiple evaluators 132a, 132b, 132c, 132d (complexively indicated with 132) may be used (different number may be used). In general, each window 105a, 105b, 105c, 105d may be input to a respective evaluator 132a, 132b, 132c, 132d. Sampling of the random window (105a-105d) may be repeated multiple times for each evaluator (132a-132d). In examples, the number of times the random window (105a-105d) is sampled for each evaluator (132a-132d) may be proportional to the length of the representation of the generated audio signal or the representation of the reference audio signal (104). Accordingly, each of the evaluators (132a-132d) may receive as input one or several portions (105a-105d) of the representation of the generated audio signal (16) or the representation of the reference audio signal (104).
Each evaluator 132a-132d may be a neural network itself. Each evaluator 132a-132d may, in particular, follow the paradigms of convolutional neutral networks. Each evaluator 132a-132d may be a residual evaluator. Each evaluator 132a-132d may have parameters (e.g. weights) which are adapted during training (e.g., in a manner similar to one of those explained above).
As shown in
Upstream and/or downstream to the evaluators, convolutional layers 131 and/or 134 may be provided. An upstream convolutional layer 131 may have, for example, a kernel with dimension 15 (e.g., 5×3 or 3×5). A downstream convolutional layer 134 may have, for example, a kernel with dimension 3 (e.g., 3×3).
During training, a loss function (adversarial loss) 140 may be optimized. The loss function 140 may include a fixed metric (e.g. obtained during a pretraining step) between a generated audio signal (16) and a reference audio signal (104). The fixed metric may be obtained by calculating one or several spectral distortions between the generated audio signal (16) and the reference audio signal (104). The distortion may be measured by keeping into account:
In examples, the adversarial loss may be obtained by randomly supplying and evaluating a representation of the generated audio signal (16) or a representation of the reference audio signal (104) by one or more evaluators (132). The evaluation may comprise classifying the supplied audio signal (16, 132) into a predetermined number of classes indicating a pretrained classification level of naturalness of the audio signal (14, 16). The predetermined number of classes may be, for example, “REAL” vs “FAKE”.
Examples of losses may be obtained as
where:
The spectral reconstruction loss Lrec is still used for regularization to prevent the emergence of adversarial artifacts. The final loss is can be, for example:
where each i is the contribution at each evaluator 132a-132d (e.g.. each evaluator 132a-132d providing a different Di) and Lrec is the pretrained (fixed) loss.
During training, there is a search foddr the minimum value of L, which may be expressed for example as
Other kinds of minimizations may be performed.
In general terms, the minimum adversarial losses 140 are associated to the best parameters (e.g., 74, 75) to be applied to the stylistic element 77.
In the following, examples of the present disclosure will be described in detail using the accompanying descriptions. In the following description, many details are described in order to provide a more thorough explanation of examples of the disclosure. However, it will be apparent to those skilled in the art that other examples can be implemented without these specific details. Features of the different examples described can be combined with one another, unless features of a corresponding combination are mutually exclusive or such a combination is expressly excluded.
It should be pointed out that the same or similar elements or elements that have the same functionality can be provided with the same or similar reference symbols or are designated identically, with a repeated description of elements that are provided with the same or similar reference symbols or the same are typically omitted. Descriptions of elements that have the same or similar reference symbols or are labeled the same are interchangeable.
Neural vocoders have proven to outperform classical approaches in the synthesis of natural high-quality speech in many applications, such as text-to-speech, speech coding, and speech enhancement. The first groundbreaking generative neural network to synthesize high-quality speech was WaveNet, and shortly thereafter many other approaches were developed. These models offer state-of-the-art quality, but often at a very high computational cost and very slow synthesis. An abundance of models generating speech with low computational cost was presented in the recent years. Some of these are optimized versions of existing models, while others leverage the integration with classical methods. Besides, many completely new approaches were also introduced, often relying on GANs. Most GAN vocoders offer very fast generation on GPUs, but at the cost of compromising the quality of the synthesized speech.
One of the main objectives of this work is to propose a GAN architecture, which we call StyleMelGAN (and may be implemented, for example, in the audio generator 10), that can synthesize very high-quality speech 16 at low computational cost and fast training. StyleMelGAN’s generator network may contain 3.86 M trainable parameters, and synthesize speech at 22.05 kHz around 2.6x faster than real-time on CPU and more than 54x on GPU. The model may consist, for example, of eight up-sampling blocks, which gradually transform a low-dimensional noise vector (e.g., 30 in
To summarize, StyleMelGAN is proposed, which is a low complexity GAN for high-quality speech synthesis conditioned on a mel-spectrogram (e.g. 12) via TADE layers (e.g. 60, 60a, 60b). The generator 10 may be highly parallelizable. The generator 10 may be completely convolutional. The aforementioned generator 10 may be trained adversarial with an ensemble of PQMF multi-sampling random window discriminators (e.g. 132a-132d), which may be regularized by multi-scale spectral reconstruction losses. The quality of the generated speech 16 can be assessed using both objective (e.g. Fréchet scores) and/or subjective assessments. Two listening tests were conducted, a MUSHRA test for the copy-synthesis scenario and a P.800 ACR test for the TTS one, both confirming that StyleMelGAN achieves state-of-art speech quality.
Existing neural vocoders usually synthesize speech signals directly in time-domain, by modelling the amplitude of the final waveform. Most of these models are generative neural networks, i.e. they model the probability distribution of the speech samples observed in natural speech signals. They can be divided in autoregressive, which explicitly factorize the distribution into a product of conditional ones, and non-autoregressive or parallel, which instead model the joint distribution directly. Autoregressive models like WaveNet, SampleRNN and WaveRNN have been reported to synthesize speech signals of high perceptual quality. A big family of non-autoregressive models is the one of Normalizing Flows, e.g. WaveGlow. A hybrid approach is the use of Inverse Autoregressive Flows, which use a factorized transformation between a noise latent representation and the target speech distribution. Examples above mainly refer to autoregressive neural networks.
Early applications of GANs for audio include WaveGAN for unconditioned speech generation, and Gan-Synth for music generation. MelGAN learns a mapping between the mel-spectrogram of speech segments and their corresponding time-domain waveforms. It ensures faster than real-time generation and leverages adversarial training of multi-scale discriminators regularized by spectral reconstruction losses. GAN-TTS is the first GAN vocoder to use uniquely adversarial training for speech generation conditioned on acoustic features. Its adversarial loss is calculated by an ensemble of conditional and unconditional random windows discriminators. Parallel WaveGAN uses a generator, similar to WaveNet in structure, trained using an unconditioned discriminator regularized by a multi-scale spectral reconstruction loss. Similar ideas are used in Multiband-MelGAN, which generates each subband of the target speech separately, saving computational power, and then obtains the final waveform using a synthesis PQMF. Its multiscale discriminators evaluate the full-band speech waveform, and are regularized using a multi-bandscale spectral reconstruction loss. Research in this field is very active and we can cite the very recent GAN vocoders such as VocGan and HooliGAN.
cC⊙γ+β⊙γβcxC⊙γ+βxγ+βx⊙γ+β
Training GANs is known to be challenging. Using random initialization of the weights (e.g. 74 and 75), the adversarial loss (e.g. 140) can lead to severe audio artifacts and unstable training. To avoid this problem, the generator 10 may be firstly pretrained using only the spectral reconstruction loss consisting of error estimates of the spectral convergence and the log-magnitude computed from different STFT analyses. The generator obtained in this fashion can generate very tonal signals although with significant smearing in high frequencies. This is nonetheless a good starting point for the adversarial training, which can then benefit from a better harmonic structure than if it started directly from a complete random noise signal. The adversarial training then drives the generation to naturalness by removing the tonal effects and sharpening the smeared frequency bands. The hinge loss 140 is used to evaluate the adversarial metric, as can be seen in equation 1 below.
where x is the real speech 104, z is the latent noise 14 (or more in general the input signal), and s is the mel-spectrogram of x (or more in general the target signal 12). It should be noted that the spectral reconstruction loss Lrec (140) is still used for regularization to prevent the emergence of adversarial artifacts. The final loss (140) is according to equation 2, which can be seen below.
Weight normalization may be applied to all convolution operations in G (or more precisely the GAN generator 11) and D (or more precisely the discriminator 100). In experiments, StyleMelGAN was trained using one NVIDIA Tesla V100 GPU on the LJSpeech corpus at 22050 Hz. The log-magnitude mel-spectrograms is calculated for 80 mel-bands and is normalized to have zero mean and unit variance. This is only one possibility of course; other values are equally possible. The generator is pretrained for 100.000 steps using Adam optimizer with learning rate (Irg) of 10-4, β = {0.5, 0.9}. When starting the adversarial training, the learning rate of G (Irg) is set to 5 * 10-5 and use FB-RWDs with the Adam optimizer with a discriminator learning rate (Ird) of 2 * 10-4 and the same β. The FB-RWDs repeat the random windowing for 1 s/window_length, i.e. one second per window length, times at every training step to support the model with enough gradient updates. A batch size of 32 and segments with a length of 1 s, i.e. one second, for each sample in the batch are used. The training lasts for about one and a half million steps, i.e. 1.500.000 steps.
The following lists the models used in experiments:
Objective and subjective evaluations of StyleMelGAN against pretrained baseline vocoder models listed above have been performed. The subjective quality of the audio TTS outputs via a P.800 listening test performed by listeners were evaluated in a controlled environment. The test set contains unseen utterances recorded by the same speaker and randomly selected from the LibriVox online corpus. These utterances test the generalization capabilities of the models, since they were recorder in slightly different conditions and present varying prosody. The original utterances were resynthesized using the GriffinLim algorithm and used these in the place of the usual anchor condition. This favors the use of the totality of the rating scale.
Traditional objective measures such as PESQ and POLQA are not reliable to evaluate speech waveforms generated by neural vocoders. Instead, the conditional Fréchet Deep Speech Distances (cFDSD) are used. The following cFDSD scores for different neural vocoders show that StyleMelGAN significantly outperforms the other models.
It can be seen that that StyleMelGAN outperforms other adversarial and non-adversarial vocoders.
A MUSHRA listening test with a group of 15 expert listeners was conducted. This type of test was chosen, because this allows to more precisely evaluate the quality of the generated speech. The anchor is generated using the Py-Torch implementation of the Griffin-Lim algorithm with 32 iterations.
The subjective quality of the audio TTS outputs can be evaluated via a P.800 ACR listening test performed by 31 listeners in a controlled environment. The Transformer.v3 model of ESPNET can be used to generate mel-spectrograms of transcriptions of the test set. The same Griffin-Lim anchor can also be added, since this favors the use of the totality of the rating scale.
The following P800 mean opinion scores (MOS) for different TTS systems show the similar finding that StyleMelGAN clearly outperforms the other models:
The following shows the generation speed in real-time factor (RTF) with number of parameters of different parallel vocoder models. StyleMelGAN provides a clear compromise between generation quality and inference speed.
Here is given, the number of parameters and real-time factors for generation on a CPU (e.g. Intel Core i7-6700 3.40 GHz) and a GPU (e.g. Nvidia GeForce GTX1060) for various models under study.
Finally,
This work presents StyleMelGAN, a lightweight and efficient adversarial vocoder for high-fidelity speech synthesis. The model uses temporal adaptive normalization (TADE) to deliver sufficient and accurate conditioning to all generation layers instead of just feeding the conditioning to the first layer. For adversarial training, the generator competes against filter bank random window discriminators that provide multiscale representations of the speech signal in both time and frequency domains. StyleMelGAN operates on both CPUs and GPUs by order of magnitude faster than real-time. Experimental objective and subjective results show that StyleMelGAN significantly outperforms prior adversarial vocoders as well as autoregressive, flow-based and diffusion-based vocoders, providing a new state-of-the-art baseline for neural waveform generation.
To conclude, the embodiments described herein can optionally be supplemented by any of the important points or aspects described here. However, it is noted that the important points and aspects described here can either be used individually or in combination and can be introduced into any of the embodiments described herein, both individually and in combination.
Although some aspects have been described in the context of an apparatus, it is clear that these aspects also represent a description of the corresponding method, where a device or a part thereof corresponds to a method step or a feature of a method step. Analogously, aspects described in the context of a method step also represent a description of a corresponding apparatus or part of an apparatus or item or feature of a corresponding apparatus. Some or all of the method steps may be executed by (or using) a hardware apparatus, like for example, a microprocessor, a programmable computer or an electronic circuit. In some embodiments, one or more of the most important method steps may be executed by such an apparatus.
Depending on certain implementation requirements, embodiments of the invention can be implemented in hardware or in software. The implementation can be performed using a digital storage medium, for example a floppy disk, a DVD, a Blu-Ray, a CD, a ROM, a PROM, an EPROM, an EEPROM or a FLASH memory, having electronically readable control signals stored thereon, which cooperate (or are capable of cooperating) with a programmable computer system such that the respective method is performed. Therefore, the digital storage medium may be computer readable.
Some embodiments according to the invention comprise a data carrier having electronically readable control signals, which are capable of cooperating with a programmable computer system, such that one of the methods described herein is performed.
Generally, embodiments of the present invention can be implemented as a computer program product with a program code, the program code being operative for performing one of the methods when the computer program product runs on a computer. The program code may for example be stored on a machine-readable carrier.
Other embodiments comprise the computer program for performing one of the methods described herein, stored on a machine-readable carrier.
In other words, an embodiment of the inventive method is, therefore, a computer program having a program code for performing one of the methods described herein, when the computer program runs on a computer.
A further embodiment of the inventive methods is, therefore, a data carrier (or a digital storage medium, or a computer-readable medium) comprising, recorded thereon, the computer program for performing one of the methods described herein. The data carrier, the digital storage medium or the recorded medium are typically tangible and/or non-transitionary.
A further embodiment of the inventive method is, therefore, a data stream or a sequence of signals representing the computer program for performing one of the methods described herein. The data stream or the sequence of signals may for example be configured to be transferred via a data communication connection, for example via the Internet.
A further embodiment comprises a processing means, for example a computer, or a programmable logic device, configured to or adapted to perform one of the methods described herein.
A further embodiment comprises a computer having installed thereon the computer program for performing one of the methods described herein.
A further embodiment according to the invention comprises an apparatus or a system configured to transfer (for example, electronically or optically) a computer program for performing one of the methods described herein to a receiver. The receiver may, for example, be a computer, a mobile device, a memory device or the like. The apparatus or system may, for example, comprise a file server for transferring the computer program to the receiver.
In some embodiments, a programmable logic device (for example a field programmable gate array) may be used to perform some or all of the functionalities of the methods described herein. In some embodiments, a field programmable gate array may cooperate with a microprocessor in order to perform one of the methods described herein. Generally, the methods are performed by any hardware apparatus.
The apparatus described herein may be implemented using a hardware apparatus, or using a computer, or using a combination of a hardware apparatus and a computer. The apparatus described herein, or any components of the apparatus described herein, may be implemented at least partially in hardware and/or in software. The methods described herein may be performed using a hardware apparatus, or using a computer, or using a combination of a hardware apparatus and a computer. The methods described herein, or any parts of the methods described herein, may be performed at least partially by hardware and/or by software.
The above described embodiments are merely illustrative for the principles of the present invention. It is understood that modifications and variations of the arrangements and the details described herein will be apparent to others skilled in the art. It is the intent, therefore, to be limited only by the scope of the impending patent claims and not by the specific details presented by way of description and explanation of the embodiments herein.
A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, et al., “WaveNet: A Generative Model for Raw Audio,” arXiv:1609.03499, 2016.
R. Prenger, R. Valle, and B. Catanzaro, “Waveglow: A Flow-based Generative Network for Speech Synthesis,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 3617-3621.
S. Mehri, K. Kumar, I. Gulrajani, R. Kumar, et al., “SampleRNN: An Unconditional End-to-End Neural Audio Generation Model,” arXiv:1612.07837, 2016.
N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury, et al., “Efficient neural audio synthesis,” arXiv:1802.08435, 2018.
A. van den Oord, Y. Li, I. Babuschkin, K. Simonyan, et al., “Parallel WaveNet: Fast High-Fidelity Speech Synthesis,” in Proceedings of the 35th ICML, 2018, pp. 3918-3926.
J. Valin and J. Skoglund, “LPCNET: Improving Neural Speech Synthesis through Linear Prediction,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 5891-5895.
K. Kumar, R. Kumar, de T. Boissiere, L. Gestin, et al., “MelGAN: Generative Adversarial Networks for Con-ditional Waveform Synthesis,” in Advances in NeurIPS 32, pp. 14910-14921. 2019.
R. Yamamoto, E. Song, and J. Kim, “Parallel Wavegan: A Fast Waveform Generation Model Based on Genera-tive Adversarial Networks with Multi-Resolution Spec-trogram,” in IEEE International Conference on Acous-tics, Speech and Signal Processing (ICASSP), 2020, pp. 6199-6203.
M. Bin’kowski, J. Donahue, S. Dieleman, A. Clark, et al., “High Fidelity Speech Synthesis with Adversarial Networks,” arXiv:1909.11646, 2019.
T. Park, M. Y. Liu, T. C. Wang, and J. Y. Zhu, “Se-mantic Image Synthesis With Spatially-Adaptive Nor-malization,” in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
P. Govalkar, J. Fischer, F. Zalkow, and C. Dittmar, “A Comparison of Recent Neural Vocoders for Speech Signal Reconstruction,” in Proceedings of the ISCA Speech Synthesis Workshop, 2019, pp. 7-12.
I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, et al., “Generative Adversarial Nets,” in Advances in NeurIPS 27, pp. 2672-2680. 2014.C. Donahue, J. McAuley, and M. Puckette, “Adversarial Audio Synthesis,” arXiv:1802.04208, 2018.
J. Engel, K. K. Agrawal, S. Chen, I. Gulrajani, et al., “GANSynth: Adversarial Neural Audio Synthesis,” arXiv:1902.08710, 2019.
G. Yang, S. Yang, K. Liu, P. Fang, et al., “Multiband MelGAN: Faster Waveform Generation for High-Quality Text-to-Speech,” arXiv:2005.05106, 2020.
J. Yang, J. Lee, Y. Kim, H. Cho, and I. Kim, “VocGAN: A High-Fidelity Real-time Vocoder with a Hierarchically-nested Adversarial Network,” arXiv:2007.15256, 2020.
Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae, “Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,” arXiv preprintarXiv:2010.05646, 2020.
D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Instance normalization: The missing ingredient for fast styliza-tion,” arXiv:1607.08022, 2016.
A. Mustafa, A. Biswas, C. Bergler, J. Schottenhamml, and A. Maier, “Analysis by Adversarial Synthesis - A Novel Approach for Speech Vocoding,” in Proc. Inter-speech, 2019, pp. 191-195.
T. Q. Nguyen, “Near-perfect-reconstruction pseudo-QMF banks,” IEEE Transactions on Signal Processing, vol. 42, no. 1, pp. 65-76, 1994.
T. Salimans and D. P. Kingma, “Weight normalization: A simple reparameterization to accelerate training of deep neural networks,” in Advances in NeurIPS, 2016, pp. 901-909.
K. Ito and L. Johnson, “The LJ Speech Dataset,” https: //keithito.com/LJSpeech-Dataset/, 2017.
D. P. Kingma and J. Ba, “Adam: A method for stochas-tic optimization,” arXiv:1412.6980, 2014.
T. Hayashi, R. Yamamoto, K. Inoue, T. Yoshimura, et al., “Espnet-tts: Unified, reproducible, and inte-gratable open source end-to-end text-to-speech toolkit,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 7654-7658.
A. Gritsenko, T. Salimans, R. van den Berg, J. Snoek, and N. Kalchbrenner, “A Spectral Energy Distance for Parallel Speech Synthesis,” arXiv:2008.01160, 2020.
“P.800: Methods for subjective determination of trans-mission quality,” Standard, International Telecommuni-cation Union, 1996.
Number | Date | Country | Kind |
---|---|---|---|
20202058 | Oct 2020 | EP | regional |
PCT/EP2021/059805 | Apr 2021 | WO | international |
This application is a continuation of copending International Application No. PCT/EP2021/078372, filed Oct. 13, 2021, which is incorporated herein by reference in its entirety, and additionally claims priority from European Application No. EP 20 202 058.2, filed Oct. 15, 2020, and International Application No. PCT/EP2021/059805, filed Apr. 14, 2021, all of which are incorporated herein by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/EP2021/078372 | Oct 2021 | WO |
Child | 18300810 | US |