SYSTEM AND METHODS FOR GENERATING REALISTIC WAVEFORMS

Information

  • Patent Application
  • 20250095663
  • Publication Number
    20250095663
  • Date Filed
    December 03, 2024
    5 months ago
  • Date Published
    March 20, 2025
    a month ago
Abstract
Systems, apparatuses, and methods directed to training a Generator that is part of a Generative Adversarial Network (GAN) to generate “realistic” examples of a distribution. In some embodiments, this includes training the Generator model to receive a tensor as an input and generate a distribution which is then converted to the frequency domain and used to determine a loss term. Similarly, an actual distribution is also converted to the frequency domain and used to determine a loss term. The loss terms, generated distribution, and actual distribution are provided to a Discriminator which generates loss terms used as part of a backpropagation or feedback mechanism to modify the operation of the Generator and/or Discriminator.
Description
BACKGROUND

Many processes depend on the ability to generate realistic audio waveforms from input data. Examples of such processes include speech synthesis, text-to-speech conversion, speech-to-speech translation, and processing of other forms of time series data. In each of these examples, a waveform may be generated and input to a process or device (e.g., an amplifier, speaker, etc.) that generates audible sounds (such as words) from the waveform. In other examples, a waveform is generated and then subjected to further processing to identify a characteristic of a source or an aspect of a received signal. As these techniques are used in more types of applications and across industries, it can be beneficial to improve the realism and accuracy of the waveforms generated by the processes. In the example of generating speech, this will help to improve the clarity of the speech, as well as drive the adoption of the techniques for use in a greater number and variety of operational environments, such as interactive voice response (IVR) systems.


In some cases, the application or use case may include a process designed to generate a “realistic” waveform from input text (or from another type of input). The process may include use of a trained model that generates an output waveform from an input spectrogram (a representation of the spectrum of frequencies in a signal), where the input results from converting text into an audio signal and then into a spectrogram. In some examples, the output waveform is used to “drive” another device (such as a speaker or a speech translation or conversion process). Although using such a model can be effective, training the model can be difficult because the training process may be very sensitive to the form of the input data and the way that the process determines error and updates the model. This is particularly true when the model is a trained neural network where the weights of connections between nodes in different layers may be susceptible to a choice of loss function or other characteristics of the model.


While attempts have been made at generating more realistic waveforms for use in speech synthesis, these attempts have disadvantages with regards to complexity and/or with regards to the inability to create realistic enough waveforms from input data. For example, while some conventional approaches can generate a more realistic or “better” waveform (with regards to some characteristic or metric), they are often unable to produce waveforms that can be used to generate realistic enough speech for a specific application. This is believed to be at least partly because conventional approaches to creating a realistic waveform typically rely on using a generative model with a discriminator, where the discriminator operates to compare a generated waveform to an expected waveform or sample of an actual waveform. However, use of a discriminator in this way introduces potential errors because a discriminator is not optimal for evaluating a waveform, as a waveform is a relatively complex source of data and a waveform often does not represent human hearing as well as other types of data, such as a spectrogram. A result is that if a model for generating a waveform incorporates a discriminator, then it may not be able to generate a realistic enough waveform for some uses.


In general, use of a discriminator as part of a generative system intended to output realistic versions of an input or sample (such as a GAN, a generative adversarial network) is highly dependent upon the type of input data, the representation of the input data, the loss or cost functions or terms used for the Generator and Discriminator, and the form of the Generator output used when determining the loss.


What is desired are systems, apparatuses, and methods for generating more realistic outputs from a generative system or network, such as waveforms that can be used to produce realistic sounds for purposes of speech synthesis and similar applications. Embodiments of the disclosure described herein address this and other objectives both individually and collectively.


SUMMARY

The terms “invention,” “the invention,” “this invention,” “the present invention,” “the present disclosure,” or “the disclosure” as used herein are intended to refer broadly to all the subject matter disclosed in this document, the drawings or figures, and to the claims. Statements containing these terms do not limit the subject matter disclosed or limit the meaning or scope of the claims. Embodiments covered by this disclosure are defined by the claims and not by this summary. This summary is a high-level overview of various aspects of the disclosure and introduces some of the concepts that are further described in the Detailed Description section below. This summary is not intended to identify key, essential or required features of the claimed subject matter, nor is it intended to be used in isolation to determine the scope of the claimed subject matter. The subject matter should be understood by reference to appropriate portions of the entire specification, to any or all figures or drawings, and to each claim.


In some embodiments, the systems, apparatuses, and methods described herein are directed to generating waveforms that can be used to produce more realistic sounding speech when used in a speech synthesis or text-to-speech application. In some embodiments, this includes training a neural network model (referred to as a Generator) that is part of a generative adversarial network (GAN) to generate the waveform from an input spectrogram, while implementing a specific set of processing stages to determine a loss term or terms for use in improving the model during the training process.


A generative adversarial network (GAN) is a class of machine learning frameworks wherein two neural networks contest with each other in the form of a zero-sum game, where one agent's/network's gain is the other agent's/network's loss. Given a set of training data, this technique learns to generate new data with the same statistics (i.e., distribution) as the training data. Though originally proposed as a form of generative model for unsupervised learning, GANs have also proved useful for semi-supervised learning, fully supervised learning, and reinforcement learning (See Wikipedia entry for GAN).


The loss term or terms are used with a back-propagation process (or a different technique for performing a similar feedback process of modifying weights) to update the weights of the model during a training cycle. In one non-limiting example use case, text is converted to a spectrogram and the spectrogram is converted to a waveform by a trained model. The output of the trained model may then be used to drive a transducer that converts the waveform to audible sound, such as speech. The goal is for the trained model to be able to generate/output a waveform that is sufficiently realistic to be used to convincingly represent the desired source (typically a person or person with specific speech characteristics, but could also be an animal, siren, or naturally occurring sound).


In some embodiments, the input to the model being trained (such as the Generator in a GAN) and ultimately to a trained model as part of an inference process may be a tensor that is generated from an image, a spectrogram, an electromagnetic signal, a set of numbers, or a set of values generated by encoding other data, as non-limiting examples. The tensor elements may represent specific characteristics of the image or other source of the input data and may be the result of processing the input data using a suitable technique such as sampling, filtering, thresholding, mapping, or application of a specific mathematical model or technique. In a general sense, the input tensor is a signal which refers to a measurable variation that represents information as a function of time or space. A process for being able to generate new, realistic versions of this signal is represented by one or more embodiments of this disclosure.


For example, an image may be sampled and its intensity or color (in terms of frequency or RGB level) for each sampled pixel or section may be represented as elements of a tensor used as an input. In this example, the GAN or other form of generative model is being used to generate similar images that would not be easily distinguishable by a viewer. In this use, the output of the GAN would be a tensor that is then converted to an image by constructing pixels or image segments from the tensor values.


Similarly, for an input in the form of a time-varying signal, the signal may be sampled and a characteristic (intensity, amplitude, frequency, or phase, as non-limiting examples) represented as a set of elements in a tensor. The GAN or other form of generative model will then produce signals with a similar characteristic. This may be useful in evaluating the ability of a system to detect a specific type of signal and in response, to take an appropriate action. In this use, the output of the GAN would be a tensor that is then converted to a signal by constructing wave phenomena that have characteristics based on the tensor values.


In some embodiments, the training approach is used to produce a trained spectrogram-to-waveform model for use with an end-to-end text-to-speech system. In this use case, text may be converted to speech, with the audible form of the speech processed into a spectrogram. The trained model then generates waveforms that may be used to drive an audio transducer, thereby producing audible sounds. In one example, text may be converted to a spectrogram by a separate model, such as Tacotron-2, which can be used to produce spectrograms from input text using an encoder-decoder architecture. Additional sources of input data and possible uses of the output of the trained model are disclosed and/or described herein.


In some embodiments, the generative model may be what is termed an unconditional model. In one example, the input tensor may be random numbers. In the case of a conditional model, the input tensor values represent data relevant to driving the processing of the input data and controlling generation of the output with respect to one or more characteristics.


In some embodiments disclosed and/or described herein, the model being trained uses a spectrogram as an input for driving and controlling the output. In other embodiments, a conditional model may use text, a label, a pitch/loudness feature of an audio waveform, or a characteristic of an image or signal, as further non-limiting examples. In such examples, a tensor serving as the conditional input to the Generator may function as a “prompt” to the model to cause or guide production of a desired output. Note that in image generation context, it is most common to condition a model on an “image class” (such as a label for a set of objects) and/or by using “random noise”.


Other objects and advantages of the systems, apparatuses, and methods disclosed and/or described herein may be apparent to one of ordinary skill in the art upon review of the detailed description and the included figures. Throughout the drawings, identical reference characters and descriptions indicate similar, but not necessarily identical, elements. While the embodiments disclosed or described herein are susceptible to various modifications and alternative forms, specific embodiments are shown by way of example in the drawings and are described in detail herein. However, embodiments of the disclosure are not limited to the exemplary or specific forms described. Rather, the present disclosure covers all modifications, equivalents, and alternatives falling within the scope of the appended claims.





BRIEF DESCRIPTION OF THE DRAWINGS

Embodiments of the system and methods in accordance with the present disclosure will be described with reference to the drawings, in which:



FIG. 1 is a block diagram illustrating the primary components or elements of a conditional GAN (generative adversarial network);



FIG. 2(a) is a block diagram illustrating the primary functions or operations implemented by an example embodiment of the disclosed system, apparatus, and method for training a GAN model (labeled the Generator Model in the figure) to generate a “realistic” waveform in the situation of an expected form of output;



FIG. 2(b) is a block diagram illustrating the primary functions or operations implemented by an example embodiment of the disclosed system, apparatus, and method for training a GAN model (labeled the Generator Model in the figure) to generate a “realistic” waveform in the situation of a random output;



FIG. 2(c) is a block diagram illustrating the primary functions or operations implemented by an example embodiment of the disclosed system, apparatus, and method for training a GAN model (labeled the Generator Model in the figure) to generate a “realistic” waveform in the situation of a conditional random output;



FIG. 2(d) is a block diagram illustrating the primary functions or operations implemented by an example embodiment of the disclosed system, apparatus, and method for training a GAN model (labeled the Generator Model in the figure) to generate a “realistic” waveform in the situation of using a pre-trained Discriminator;



FIG. 2(e) is a table or chart illustrating a set of value functions that may be used in training a model that is part of a GAN architecture for the indicated type of GAN;



FIG. 3(a) is a flowchart or flow diagram illustrating a process, method, set of operations, or set of functions for training a model, such as a Generator that is part of a GAN architecture to generate a “realistic” waveform, in some embodiments;



FIG. 3(b) is a flowchart or flow diagram illustrating a process, method, set of operations, or set of functions for training a model, such as a Generator that is part of a GAN architecture to generate examples of an input tensor, in some embodiments;



FIGS. 4(a) and 4(b) are diagrams illustrating elements or components that may be present in a computer device or system configured to implement a method, process, function, or operation in accordance with some embodiments; and



FIGS. 5-7 are diagrams illustrating an architecture for a multi-tenant or SaaS platform that may be used in implementing an embodiment of the systems and methods described herein.





Note that the same numbers are used throughout the disclosure and figures to reference like components and features.


DETAILED DESCRIPTION

The subject matter of embodiments of the disclosure is described herein with specificity to meet statutory requirements, but this description is not intended to limit the scope of the claims. The claimed subject matter may be embodied in other ways, may include different elements or steps, and may be used in conjunction with other existing or later developed technologies. This description should not be interpreted as implying any required order or arrangement among or between various steps or elements except when the order of individual steps or arrangement of elements is explicitly noted as being required.


Embodiments of the disclosure will be described more fully herein with reference to the accompanying drawings, which form a part hereof, and which show, by way of illustration, exemplary embodiments by which the disclosure may be practiced. The disclosure may, however, be embodied in different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will satisfy the statutory requirements and convey the scope of the disclosure to those skilled in the art.


Among other things, the present disclosure may be embodied in whole or in part as a system, as one or more methods, or as one or more devices. Embodiments of the disclosure may take the form of a hardware implemented embodiment, a software implemented embodiment, or an embodiment combining software and hardware aspects. For example, in some embodiments, one or more of the operations, functions, processes, or methods described herein may be implemented by one or more suitable processing elements (such as a processor, microprocessor, CPU, GPU, TPU, controller, etc.) that is part of a client device, server, network element, remote platform (such as a SaaS platform), an “in the cloud” service, or other form of computing or data processing system, device, or platform.


The processing element or elements may be programmed with a set of executable instructions (e.g., software instructions), where the instructions may be stored on (or in) one or more suitable non-transitory data storage elements. In some embodiments, the set of instructions may be conveyed to a user through a transfer of instructions or an application that executes a set of instructions (such as over a network, e.g., the Internet). In some embodiments, a set of instructions or an application may be utilized by an end-user through access to a SaaS platform or a service provided through such a platform.


In some embodiments, one or more of the operations, functions, processes, or methods described herein may be implemented by a specialized form of hardware, such as a programmable gate array, application specific integrated circuit (ASIC), or the like. Note that an embodiment of the inventive methods may be implemented in the form of an application, a sub-routine that is part of a larger application, a “plug-in”, an extension to the functionality of a data processing system or platform, an API, or other suitable form. The following detailed description is, therefore, not to be taken in a limiting sense.


In some embodiments, the systems, apparatuses, and methods disclosed and/or described herein are directed to generating waveforms that can be used to produce more realistic sounding speech when used in a speech synthesis or text-to-speech application. In some embodiments, this includes training a neural network model (referred to as a Generator) that is part of a generative adversarial network (GAN) to generate the waveform from an input spectrogram, while implementing a specific set of processing stages to determine a loss term or terms used as a control signal to modify a Generator Model. The input spectrogram may be generated from an example or examples of sounds produced by speech based on text. In some embodiments, a model such as Tacotron-2 may be used to produce spectrograms from input text using an encoder-decoder architecture.


The loss term(s) may include one or more of a Generator loss term, a Discriminator loss term, and an additional loss term introduced as part of more accurately representing source data. The loss term or terms are used with a back-propagation process (or a different technique for performing a similar feedback process of modifying weights) to update the weights of the model during a training cycle. In some embodiments, the Generator and additional loss terms are used to modify the Generator neural network weights, and the Discriminator loss term is used to modify the Discriminator neural network weights.


In one embodiment, the output of the trained model may be used to drive a transducer that converts the waveform to audible sound(s). If the Generator Model output is not a waveform for use in generating audible sounds, then the model output may be used as a basis to construct the desired form, such as an image, a time-series signal, or other format.


As mentioned, conventional approaches to generating more realistic waveforms used to produce audible sounds suffer from limitations or disadvantages. This is believed to be the result (at least in part) of the following two factors:

    • Conventional solutions try to have a model extract and learn relevant features to discriminate between real and fake waveforms. However, a waveform is a physical representation of sound. Unlike a spectrogram, a waveform does not represent human hearing well or accurately enough for some purposes. In these applications (as recognized by the inventor), a spectrogram may be preferred because human hearing is based on a process similar to a real-time spectrogram encoded by the cochlea of the inner ear;
      • Similarly, for other forms of data expressed as an input tensor, there may be a format or representation that “better” represents how such data is perceived or understood by a person or machine—in such cases, conversion of the data to another form may be helpful in training a model to generate more realistic examples of that data. In such cases, an appropriate error term may also need to be introduced as a control signal for the Generator Model;
    • A waveform tends to be high-fidelity. It is conventional to have 24,000 samples per second for a 24 kHz audio waveform. A conventional approach to developing a model will struggle (and may not be capable within a reasonable amount of computational effort) to learn patterns with this number of data points.


Due to the limitations inherent in using a conventional generative model to generate realistic waveforms for speech synthesis or text-to-speech applications, it may be desirable to use a spectrogram as an input. While a spectrogram is believed to represent human hearing well, approaches for generating waveforms from spectrograms using a trained generative model with a spectrogram loss term may not produce sufficiently realistic waveforms.


This is believed to be because using a measure of the spectrogram loss (i.e., a measure of how similar two waveforms will “sound” to a person or be perceived) in training a model does not produce a sufficiently accurate output. This may occur because spectrogram loss tends to prioritize large discrepancies between an original and a predicted or expected spectrogram. Unfortunately, this means that small discrepancies are largely ignored, which can result in a faint and unrealistic buzzing in a final audio output.


In some conventional approaches a spectrogram loss is not used, and the audio is represented strictly as a waveform. This results in poor performance because waveforms are not a good perceptual representation of how we hear sound and thus result in unrealistic sounding audio. In comparison to these methods, embodiments use of a spectrogram and spectrogram loss greatly improves the representation of the audio and thus also greatly improves the result.


In some conventional approaches a spectrogram loss is used but it is not used in the context of a generative model. As a result, one may encounter the problems described herein, namely, “small discrepancies are largely ignored, which can result in a faint and unrealistic buzzing in the final audio output”.


In contrast, embodiments use of spectrogram loss allows maintaining the perceptual improvement within the model and use of a generative architecture with a Discriminator that pushes a Generator to produce audio that is more difficult to distinguish from real audio results in more realistic audio that has substantially less buzzing. In this sense, reference herein to “spectrogram loss” may refer to the same loss as in conventional approaches, however, a distinction is in how this loss is contributing to the training of a model.


Even if applied to other types of inputs aside from spectrograms, use of a spectrogram loss measure may result in a waveform that is not suited for an intended use. This is another indication of the importance of how a loss term is specified and used as part of a GAN or other generative process in which feedback is used to control or modify the generative process performed by the Generator Model in cooperation with the Discriminator.



FIG. 1 is a block diagram illustrating the primary components or elements of a conditional GAN 100 (generative adversarial network) that may be used in training a model, as part of a process to implement an embodiment of the disclosure. An aspect of a generative adversarial network (GAN) is the use of “indirect” training through a “Discriminator”, which itself is being updated dynamically. The result is that the Generator network, element, or process of the GAN is not trained to minimize an error metric, but to “fool” the Discriminator network, element, or process. In one sense, the Generator is being trained to produce more and more realistic data samples until the Discriminator reaches a desired level of confidence that the generated samples are “real” data. A loss or cost term may be used as part of the training of each of the respective Generator and Discriminator networks or processes.


In operation, the generative network generates candidate examples of data while the discriminative network evaluates them. Typically, the generative network learns to map from a latent space to a data distribution of interest, while the discriminative network distinguishes candidates produced by the generator (the generative network model) from the true (the actual) data distribution. The generative network's training objective is to increase the error rate of the discriminative network, that is to “fool” the discriminator network by producing novel candidates that the Discriminator thinks are not synthesized and are part of the true data distribution.


A known dataset may serve as initial training data for the Discriminator. Training involves presenting the Discriminator with samples from the training dataset, until it achieves acceptable accuracy. The Generator trains based on whether it succeeds in fooling the Discriminator. The Generator may be seeded with randomized input that is sampled from a predefined latent space (e.g., a multivariate normal distribution). Thereafter, candidates synthesized by the Generator are evaluated by the Discriminator. Independent backpropagation procedures are applied to both networks so that the Generator produces better samples, while the Discriminator becomes more skilled at flagging synthetic samples. When used for image generation, the Generator is typically a deconvolutional neural network, and the Discriminator is a convolutional neural network (See Wikipedia entry for GAN and related topics for additional information).


A “Random Noise” input is usually termed a “Latent Variable” or “Z”. The “C” in FIG. 1 is the “conditional,” and may be a term, label, classification, category, or item of data (as non-limiting examples). During image (as an example) generation, it is typical for the “conditional” to be an “image class” (i.e., a dog, cat, table, or cup as examples).


As suggested by the figure, a Generator G 102 produces a Generated Sample 104 (e.g., a waveform, image, or other form of input). A Real Sample (e.g., an actual waveform, image, or other form of input) 106 and the Generated Sample are input to a Discriminator D 108. The Discriminator's task is to determine if the Generated Sample is real or fake (i.e., does it satisfy the conditional, or was it generated and hence “fake”). This is based, in the case of a conditional GAN, on using the conditional “C” as an input to the Discriminator. Based on the output of the Discriminator, the operation of the Generator and/or the Discriminator may be modified through a feedback or training loop 110, such as by using a backpropagation technique.


Backpropagation adjusts each weight in a network of layers (such as a convolutional neural network) by calculating the weight's impact on the output—that is, how the output would change if the weight(s) are changed. The impact of a Generator weight depends on the impact of the Discriminator weights it feeds into. In this application of backpropagation or other form of feedback control, backpropagation typically starts at the output and flows back through the Discriminator into the Generator.


Note that a Discriminator typically does not operate as a binary or other form of classifier, such as to classify an output of a model as either real or “fake”. More generally, a Discriminator is a model of a real (actual) data distribution and is used to evaluate additional data points that are generated by the Generator. A Generator model uses the Discriminator's evaluation to improve its ability to generate examples which could be considered part of the real data distribution that the Discriminator is modeling.


As will be described with reference to FIGS. 2(a) through 2(d), the disclosed process for training a Generator model to generate more “realistic” waveforms (as an example) may be implemented as part of multiple GAN designs or architectures. The GAN architectures may vary based on the form of conditional provided to the Generator and Discriminator, and the form of input to the Generator.



FIG. 2(a) is a block diagram illustrating the primary functions or operations implemented by an example embodiment of the disclosed system, apparatus, and method for training a GAN model (labelled the Generator Model 208 in the figure) to generate a more “realistic” waveform in the situation in which the output is of a form or type that is generally expected or known. For example, in text-to-speech, one has a concrete expectation of the text that is read and the speaker who is speaking to create an audible sound. As shown in the figure, the overall architecture or system includes a training pipeline 202 and an inference (or production) pipeline 204.


When trained, a Generator Model 230 (which may be a recurrent neural network (RNN) or convolutional neural network (CNN), as non-limiting examples) operates on input data 228 in the form of a tensor (which in one example, may have been obtained from a spectrogram generated by a text-to-spectrogram processing technique or model) to generate an output tensor 232 (which in one example, may represent or be convertible into an audio waveform). The output waveform generated by the model is such that that when it is used to drive an audio transducer (e.g., a speaker in this example), the sounds a person hears will be perceived as a more realistic representation of the input to the inference pipeline than are the sounds produced using conventional approaches to training a Generator.


This is because, at least in part, the system and methods disclosed and/or described incorporate a stage to determine an error or control signal for the training process from both an adversarial (i.e., Generator and Discriminator) based loss and a spectrogram-based (or Lp-Norm, as an example) loss measure where, for a real number p≥1, the p-norm or Lp-norm of x is defined by:





x∥p=(|x1|p+|x2|p+ . . . +|xn|p)1/p.


The improved results obtained from implementing an embodiment of the disclosure are due at least in part from incorporating knowledge regarding the way that the output is used or perceived into the training process. In one embodiment, this includes representing or converting inputs to spectrograms and then including a loss term applicable to spectrograms into the feedback or backpropagation loop. In some cases, multiple loss terms or other forms of loss terms may need to be utilized to obtain a desired level of performance for the Generator during the training process.


As noted, the adversarial loss term (222 in the figure) represents both the Generator loss and the Discriminator loss. In one embodiment, the Generator loss and spectrogram loss terms are used to modify the operation of the Generator Model, while the Discriminator loss term is used to modify the operation of the Discriminator.


As part of the disclosed model training process, the output waveform (i.e., a distribution of values as a function of time in one non-limiting example) from the Generator model being trained (indicated as “Fake Signal (Tensor) 210” in the figure) is processed using a Short-time Fourier Transform 214 (or other suitable signal processing technique to transform or convert the output to the frequency domain, such as a cosine transform or other form of Fourier Transform) to generate a spectrogram. A Short-time Fourier transform (STFT) is a Fourier-related transform used to determine the sinusoidal frequency and phase content of local sections of a signal as it changes over time. In practice, a procedure for computing STFTs is to divide a longer time signal into shorter segments of equal length and then compute the Fourier transform separately on each shorter segment. This reveals the Fourier spectrum on each shorter segment (See Wikipedia entry).


As mentioned, a spectrogram is believed to be a better representation of how human hearing perceives sounds. This provides a motivation for conducting the training process using input data that is converted to the form of a spectrogram prior to being introduced to the Discriminator. This also provides a motivation for selecting an appropriate form of loss term to include as part of a feedback control or backpropagation process that impacts the operation of the Generator.


Similarly, an example of a real/expected signal (indicated as “Real Signal (Tensor) 212” in the figure, and which may be an audio segment represented as a waveform and expressed as a tensor) is also processed by a Short-time Fourier Transform 216 for purposes of generating a spectrogram that may be compared to the spectrogram obtained from transforming the waveform generated by the Generator model 208 (i.e., Fake Signal 210).


The spectrograms that result from application of the Fourier or other transform to each of the real waveform and fake waveforms (which are expressed as a tensor 212 for the real signal/waveform and derived from a tensor 210 for the fake signal/waveform) are then used to determine a Lp-Norm loss term 224 (via Lp-Norm Loss Function 220) and are also used as inputs to Discriminator Model 218. Discriminator Model 218 generates an Adversarial (i.e., Generator and Discriminator) Loss term 222 (which as described herein, may take one of multiple forms), which is combined with Lp-Norm Loss 224 to produce Total Loss term 226. Total Loss term 226 may then be used (in whole or in part) as a control signal in a feedback loop or backpropagation process to modify the weights assigned to connections between nodes in a neural network or other layers of Generator Model 208 and/or Discriminator Model 218. In one embodiment, the Generator and Lp-Norm loss terms are used to modify the operation of the Generator Model, and the Discriminator loss term is used to modify the operation of the Discriminator.


The Adversarial loss term 222 represents an approximation to the error between the Generator Model 208 output and the real data distribution (such as a spectrogram of a real audio signal obtained from input 212). In the discipline of Deep Learning, a loss term typically is represented by a single number or a Tensor. There are multiple ways to calculate an Adversarial loss term, with non-limiting examples shown in FIG. 2(e). Although the Lp loss term can be combined with the Adversarial loss term using multiple types of mathematical operations, typically it is combined using an addition operation between two loss term values. Another example of a suitable combination is a weighted sum or weighted product.


As has been described, an aspect of the disclosed Generator model training pipeline is to utilize a Fourier transform before applying the Discriminator model and then use the Discriminator to evaluate if a generated “fake” waveform is sufficiently “realistic”. The Discriminator output is then used to determine an Adversarial loss term which serves as one contribution to a total loss term or factor, with the other part of the total loss factor being a Lp-Norm based loss derived from considering the transformed representations of a “real” and a “fake” signal in the form of spectrograms.


In some embodiments, the Discriminator neural network weights may be updated as part of a feedback mechanism from processing both “fake” and “real” waveforms that have been transformed by a Fourier transform into the frequency domain. In one sense, the Discriminator uses a “real” signal in order to update its understanding of the “true data distribution”. It then uses updated weights to evaluate the “fake” signal, and to produce a loss value (or term used in a total loss) for updating the Generator model. As noted, the signal input to the Discriminator is a representation in the frequency space or domain.


In the inference pipeline 204, the input tensor 228 may represent a signal, a spectrogram generated from a text-to-spectrogram process (such as the mentioned Tacotron-2), a sampled or processed image, a processed waveform, or a set of time-series data, as non-limiting examples. In the case of a waveform generated from a spectrogram input to a model that is trained using the disclosed technique, the output tensor from the Generator model represents a more “realistic” waveform (after conversion from the output tensor to a waveform that may be used to drive a transducer). This enables creation of sounds more similar to how a person would hear or perceive the input (such as text) if spoken by a real person. This has a benefit of producing speech based on text that is more likely to be accepted and understood by a human listener, and without the potentially jarring impact of a mechanical or “fake” sounding voice.


In one sense, the output of a Generator model trained using an embodiment of the disclosed process (which may be represented as or converted to the form of a spectrogram, waveform, time-series data, or image, as non-limiting examples) is more “realistic” in that the Discriminator is unable (or less able) to tell the difference (i.e., to discriminate) between a generated output (that is a “fake” output) and a real example of the output generated from an actual person speaking (or depending on the input format, a sampled image of a real object, an actual waveform, or set of signals, as non-limiting examples). As noted, in some embodiments, the intermediate data format of a spectrogram serves as a proxy for the input and one that better represents how a person would perceive sounds.


The Discriminator in the disclosed and/or described system operates to make a waveform (as an example output or form obtained from the output) generated by the Generator Model more similar to one that when converted to sounds, would be perceived as “realistic” by a human ear and how the human brain processes sound. From one perspective, the disclosed and/or described system and methods train a model to generate a more realistic sounding waveform by using the intermediate stage of converting data to a spectrogram to control the training process (such as by being used in determining loss signals or factors).


The approach of using an intermediary form of a spectrogram is implemented in some embodiments because (similar to the operation of a human ear) a spectrogram converts a waveform into different frequency intervals and their corresponding loudness (intensity) and is believed to better represent the response of human hearing to an input sound. By using a comparison between a spectrogram representation of both a real and a generated waveform (through the use of a suitable intermediary process, such as a transform), and the two forms of loss (spectrogram/Lp Norm and Adversarial), the training process can produce a trained Generator Model that more accurately generates waveforms that correspond to how a person perceives sounds.


In some embodiments, the disclosed approach may be used for a model (such as a Generator in a GAN) which translates an input into an output that can then be processed with a Fourier transform, cosine transform or similar signal processing technique to transform or convert the output to the frequency domain. In general, the output of the Generator is a “signal” represented as a tensor that can be processed further by application of a Fourier Transform or other mathematical operation. In one embodiment, the model takes a spectrogram as an input (after representing it as a tensor) and predicts a waveform (in the form of a tensor) as the output. The waveform is processed by a Fourier transform to produce a spectrogram. The produced spectrogram is then passed to a Discriminator.


In some embodiments and when used as part of a system to generate more realistic audio segments, the training pipeline 202 may include the following stages or steps:

    • Provision of a set of training data, where each example of training data is a spectrogram and is used as an input to a model (indicated as Generator Model 208 in the figure) being trained;
      • In some embodiments, this may be a spectrogram (represented as a tensor) generated by a text-to-spectrogram model (such as Tacotron-2, which may be used to produce spectrograms from input text using an encoder-decoder architecture);
      • In other embodiments, the input may be in the form of a tensor that represents an image, a set of values obtained by processing an image (such as a tensor formed from values of a characteristic of each pixel or sub-unit of an image), or a set of values obtained from a source (such as a tensor formed from sampling a signal at a sequence of time intervals), as non-limiting examples;
        • In general, the input to the model may be expressed as a tensor 206;
        • As mentioned, the input tensor is a set of values representing the distribution or variation of a signal over time or space. The tensor values can be (re)constructed into a waveform, image, time-series, or other form as needed;
        • Transformation to the frequency domain (via the Fourier or other transform) enables a more accurate determination of the accuracy or “realness” of the Generator output, particularly for the use case of generating sounds, as patterns may be easier to identify in the frequency domain as compared to the time domain;
        • The model being trained 208 may be in the form of a Deep Neural Network (DNN), such as a recurrent neural network (RNN) or a convolutional neural network (CNN), as non-limiting examples;
          • In one example embodiment, the DNN is a spectrogram-to-waveform model, such as the MelGAN generator model (described in MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis, Kumar et. al., arXiv:1910.06711);
          • In the case of a different type or form of data, the DNN may be of a type more commonly used for that type of data, such as a RNN for time-series data or a CNN for image data;
    • The Generator Model 208 generates an output tensor (a “fake” or generated example of the input tensor, such as a tensor representation of a signal, waveform, or image) 210 from the input tensor 206. In one example, tensor 210 represents (i.e., may be converted into) a waveform of the type represented by the input 206 (where, as mentioned, the input may be a spectrogram that represents a waveform that may be used to generate audible sounds);
    • To improve the Generator Model 208 (i.e., to set and modify weights associated with connections between nodes in different layers of the model), a comparison is performed between the output tensor 210 and a tensor representation of an expected or real waveform 212, again represented in the form of a tensor containing a distribution of values as a function of time or space;
      • In some embodiments, the expected (real) signal 212 is a tensor representing a waveform obtained from a text-to-speech dataset, for example ␣ Speech (a public domain speech dataset consisting of 13,100 short audio clips of a single speaker reading passages from 7 non-fiction books, and where a transcription is provided for each clip);
        • In the case of other uses or types of input data, the expected (real) signal 212 is a tensor representing an example of the real class of images or signals, as non-limiting examples. The expected or “real” signal 212 may be obtained from a database of examples (such as images of dogs converted to or represented as a tensor) corresponding to the class or category of source data being used;
      • In some embodiments, this comparison is used to determine the error or loss between the generated signal tensor 210 and the real signal tensor 212;
        • The error or loss term(s) are used in whole or in part as a control signal or signals in an adaptive feedback loop or backpropagation process to modify the weights between nodes in the neural network layers used to represent the Generator Model and/or Discriminator Model (though not shown in FIG. 2(a));
        • In some embodiments, prior to determining the error or loss term(s) contributing to a total loss term 226, each of the generated signal tensor (representing the generated waveform) 210 and the real signal tensor 212 are subjected to further processing;
          • In some embodiments, this further processing is a short-time Fourier Transform (STFT), illustrated as functions, operations, or modules 214 and 216);
          •  A STFT processes the signal in chunks, usually a chunk of length 50 milliseconds. A Fourier Transform would process the entire signal, at one time. In text-to-speech, where there are a number of distinctive sounds made, it is preferable to process the signal in chunks. In a signal that is continuous, such as an image, a regular Fast Fourier Transform (FFT) can be used. A difference between using the two forms of Fourier Transform is that the error term is calculated based on either distinctive intervals or on the entire signal. Note that a STFT can become a form of FFT if the window length is long enough;
          •  This processing operates to convert or transform each of the tensor representations of waveforms (one obtained from a sample of real data and the other produced by the Generator Model) into a spectrogram, which as mentioned is a more accurate representation of human hearing or human ear responsiveness;
          • In some embodiments, the spectrogram(s) may be further processed into a Mel Spectrogram, log Mel Spectrogram, or mel-frequency cepstrum;
          •  The Mel Scale, mathematically speaking, is the result of a non-linear transformation of the frequency scale. The Mel Scale is constructed such that sounds of equal distance from each other on the Mel Scale, also “sound” to humans as if they are equal in distance or difference from one another;
          • For other types of data, the generated signal tensor (representing the generated waveform) 210 and the real signal tensor 212 may be converted or transformed into a form that represents a more accurate way in which that type of data is perceived or used in further processing stages (such as by filtering, adjusting, or enhancing specific frequencies, altering signals in a time domain, changing pixel contrasts or color, or other feature);
          •  The conversion, transformation, or other signal processing operations applied to generated signal 210 and to real signal 212 operate to represent the signals in the frequency domain;
        • The transformed waveforms (the output of processing stages 214 and 216) are then used to determine two types of “loss”, with the components of a total loss 226 representing a difference or control signal for adjusting the weights of the deep neural network(s) used to represent a Generator and a Discriminator;
          • In some embodiments, each of the transformed outputs (e.g., the spectrograms generated by processes 214 and 216) is provided as an input to both a process to determine an Lp loss term (such as L1 or L2 loss) 220 and to a Discriminator Model 218 (which in some cases may be characterized as operating as a form of binary classifier to determine if an input is real or fake);
          • The resulting output of the process 220 to determine L1 or L2 loss represents the Lp-Norm (which may be referred to as the spectrogram loss, for example) 224 and the resulting output of the Discriminator is represented as Adversarial Loss 222;
          •  As noted, instead of (or in addition to) L1 or L2 loss, another form of Lp-norm loss may be used;
          •  The Lp Norm Loss Function may be replaced by a different function or process to determine a suitable loss term, depending upon the form of source data and/or the type of processing applied to the tensors at stages 214 and 216;
          •  In one embodiment, different forms of Lp or other type of loss may be used to modify the Generator, with the form of loss term used depending on the performance of the Generator for the given input or distribution of values;
          • The Lp-Norm Loss 224 and Adversarial Loss 222 are combined to generate the Total Loss 226 from the training cycle (as mentioned, total loss is typically calculated using an addition operation, although other combination methods may be used, such as a weighted sum, a fit to a polynomial, or a fit to a formula, as non-limiting examples);
        • The Total Loss 226 term or its individual components are then used to update or modify the weights set during the training cycle, typically through backpropagation (although other adaptive or feedback control techniques may be used);
          • In one embodiment, the Adversarial loss term is comprised of a Generator Loss and a Discriminator loss. The combination of Generator loss and Lp-Norm loss (or other loss term corresponding to the processing implemented at stages 214 and 216) is used to modify the operation of the Generator, while the Discriminator loss term is used to modify the operation of the Discriminator.


Once the weights associated with the layers of the DNN or other form of model have converged to sufficiently stable values, the Generator Model 208 may be considered trained and the trained model 230 may be used as part of an inference pipeline 204. In a typical use, an input tensor 228 representing a spectrogram may be provided to the trained model 230. In response, the model 230 will operate to generate or produce as an output a tensor 232 representing a “realistic” waveform or other form of source data corresponding to the training process.



FIG. 2(b) is a block diagram illustrating the primary functions or operations implemented by an example embodiment of the disclosed system, apparatus, and method for training a GAN model (labelled the Generator Model in the figure) to generate a more “realistic” waveform in the situation of a random output. Here, “random output” refers to a situation in which the Generator Model may generate data corresponding to a source (such as an image or a sound), without regard to a specific input or expectation that was sampled from the learned dataset distribution (where a “generator model” has learned to model a “dataset distribution”, and afterwards, one can “sample” from the “learned distribution” as part of generating the output).


As an example, in this situation, a model trained on faces may produce a random face. As shown in the figure, in this embodiment of using the disclosed and/or described training process, the loss term for the real signal (the Lp Norm loss in FIG. 2(a)) is not calculated since it would not represent a useful quantity, and only the Adversarial Loss term 222 (or a component of it) is used in determining a loss value in a feedback process to improve the operation of the Generator Model 208.


A difference between the architectures illustrated in FIGS. 2(a) and 2(b) is the inclusion of the “Lp Norm Loss” term in FIG. 2(a). The “Lp Norm Loss” is helpful for “guiding” the Generator model to generate a specific type or form of output. For example, in a text-to-speech application, one may be seeking to generate a specific voice-over (i.e., a voice-over that says a predetermined piece of text and that sounds like a particular speaker or one with a particular characteristic, such as an accent or cadence) and is more “realistic” sounding to a human ear. In one sense, the “Lp Norm Loss” helps to generate the specific voice-over, and the use of the “Adversarial loss” helps to ensure that the generated voice-over also sounds realistic to the human ear and auditory processing system.



FIG. 2(c) is a block diagram illustrating the primary functions or operations implemented by an example embodiment of the disclosed system, apparatus, and method for training a GAN model (labelled the Generator Model 208 in the figure) to generate a “realistic” waveform in the situation of a conditional random output. Here, “conditional random output” refers to a situation in which the output is partially expected or known. For example, in an image generation use case, a conditional may be a general label, such as “dog” or “cat”. The Generator Model may then be trained to generate any dog or cat, with any lighting, with any position, and with any fur color, as possible examples of outputs.


The tutorial at https://pytorch.org/tutorials/beginner/dcgan_faces_tutorial.html provides information to assist in implementing certain components of the processing pipeline. Further, a tutorial that discusses implementing a conditional GAN is found at: https://medium.com/@utk.is.here/training-a-conditional-dc-gan-on-cifar-10-fce88395d610. An article that may assist in implementing a Fourier transform for an Image Signal is found at https://arxiv.org/pdf/1907.06515.pdf.



FIG. 2(c) illustrates a situation in which additional information is included with the “input”, in the form of a “conditional”. For example, in image processing, a conditional might be an image class or label, such as “dog”. In text-to-speech, a conditional might be a speaker or accent the model is trying to mimic. The conditional processing or input can be included in the processing flow of either FIG. 2(a) or FIG. 2(b). FIG. 2(c) illustrates how a conditional can be added to the architecture of FIG. 2(b). The conditional is represented as a tensor.


As shown in FIG. 2(c), in this embodiment of the training process a conditional in the form of a tensor 240 is an optional input to Generator Model 208 and/or to Discriminator Model 218. The conditional 240 provides additional information for the Generator and Discriminator to use in the training process. Although there is often no prescribed method for choosing when to use the conditional in the Generator, the Discriminator, or both, there are situations where each such use may be helpful. These include to improve model stability as well as improve consistency and control. Using the conditional only in the Discriminator forces a model to assess fidelity not only in terms of realism but also in how well the output matches the condition. Conditioning the Generator allows for controlling the Generator output to be associated with a specific condition.



FIG. 2(d) is a block diagram illustrating the primary functions or operations implemented by an example embodiment of the disclosed system, apparatus, and method for training a GAN model (labelled the Generator Model 208 in the figure) to generate a “realistic” waveform in the situation of using a pre-trained Discriminator 219. As suggested by the figure, in this example implementation of the disclosed training process, the Discriminator Model 219 does not receive a form of a real signal as an input and generates Adversarial Loss term 222, which is then used (in whole or as components) as an error or control term for modifying Generator Model 208. In this example, Discriminator Model 219 may be based on a pre-trained model which accepts a spectrogram as an input. This may help in scenarios where the Generator needs a well-defined and strong adversarial signal to start learning effectively. It may also reduce training complexity and be helpful in cases of low-resources or if training needs to be quickened such as for prototyping purposes. In general, it can be used to help stabilize the training process.


As indicated by the implementations of the processing flow shown in FIGS. 2(a) through 2(d), embodiments of the disclosed and/or described training process may be utilized to assist in training Generator Models using multiple types of GAN network architectures. In each case, the techniques disclosed and/or described may be used to improve the training process and produce a trained model that can generate a more “realistic” waveform or other desired form of output. Here, realistic refers to a characteristic or characteristics of the waveform (or other form) such that when the waveform is used to generate sounds, those sounds would be perceived by the human ear as a real set of spoken sounds. As mentioned, an embodiment of the disclosure may be applied to other types of data, such as for image processing, where it can be used to ensure that image signals or data generated by a model result in a more realistic image or more accurate representation of a signal, waveform, time-series data, or events.


As described, in some embodiments, the Generator is trained to output a signal based on an input in the form of a spectrogram. The output signal is than converted or transformed to a spectrogram (via a Fourier transform, as an example). An alternative is to train a Generator to output a spectrogram based on an input in the form of a spectrogram. The output spectrogram is then processed in the manner of the Fourier transformed output signal that occurs in other embodiments. In this alternative embodiment, the Generator is trained to produce spectrograms directly and the signal transformation or conversion stage is not needed (i.e., the “Fake” Signal 210 and its subsequent transformation are not part of the process flow). During the inference process, to produce a waveform, an inverse Fourier transform may be applied to the produced spectrogram.



FIG. 2(e) is a table or chart illustrating a set of value functions that may be used in training a model that is part of a GAN architecture for the indicated type of GAN (see https://github.com/hwalsuklee/tensorflow-generative-model-collections?tab=readme-ov-file). The value function (or loss term function) represents a term that may be used as part of an adaptive feedback or control process to update the weights in the Generator (the model being trained) and/or the Discriminator. The value function or loss term function may be a function of one or more of the fake data, the Discriminator's evaluation of the fake data, or the Discriminator's evaluation of the real data, as non-limiting examples.


The table or chart of FIG. 2(e) illustrates the conventional cost or loss functions used as part of training a GAN model and represent examples of “error” terms that may be used when training a GAN network. Note that in some cases, there may be two types of errors, Generator error (LG) used in training the Generator and Discriminator error (LD) used in training the Discriminator. The combination of the Generator error and the Discriminator error are referred to as Adversarial loss herein. In some cases, the Discriminator process may be “fixed” and does not require further training (e.g., by using a pre-trained Discriminator).



FIG. 3(a) is a flowchart or flow diagram illustrating a process, method, set of operations, or set of functions for training a model, such as a Generator that is part of a GAN architecture to generate a “realistic” waveform, in some embodiments. As shown in the figure, in one embodiment, a process for training a model to generate more realistic waveforms using a GAN architecture may comprise the following steps or stages:

    • Process Training Data (Input) into a Tensor Format (as suggested by step or stage 330);
      • In one embodiment, the training data may represent a spectrogram (generated from speech or text, as non-limiting examples), with the tensor elements representing the intensity or amplitude of a waveform as a function of frequency or space bin;
    • Input the Tensor Representing the Training Data to a Generator Model Being Trained (as suggested by step or stage 335);
      • In one embodiment, the Generator or other form of model being trained may be in the form of a deep neural network (DNN);
        • Other forms or architectures may be used for the Generator and may depend upon the type of input data and/or how a realistic version of the input data would be determined, characterized, or otherwise defined;
      • Depending upon the form and source of the training data, the deep neural network (DNN) may be a convolutional neural network (e.g., for images), a recurrent neural network (e.g., for time-series data), or a multi-layer perceptron (e.g., for a signal), as non-limiting examples;
      • In one embodiment, the output of the Generator may be a tensor representing (or convertible into) a waveform;
    • Obtain the Output of the Generator and Process the Output by Performing a Fourier Transform on the Output (as suggested by step or stage 340);
      • This may operate to convert or transform the output into a spectrogram, for example (where as described, a spectrogram may be chosen because of its relationship to how a person perceives an audio waveform);
      • The Fourier transform may be a short-time Fourier transform (STFT), or other similar form of transform or process that converts the output to a representation in the frequency domain;
      • In some embodiments, a different method of converting the model output into a spectrogram or frequency domain may be used and may include a cosine transform, as a non-limiting example;
        • As mentioned herein, the conversion or transformation to a spectrogram is used in some embodiments due to its ability to more accurately or realistically represent how sounds are processed by a person. Similarly, if the input data represents a type or form of data or information that is best or more accurately perceived or understood by a person or machine in a different format, then a process to convert the output of the Generator to that format would be used;
    • Obtain a Real or Expected Output of the Model and Process (if needed) into Tensor Format (as suggested by step or stage 345);
      • As an example, in one embodiment, the real or expected output is a waveform representing a segment of speech;
    • Process the Real or Expected Output by Performing a Fourier Transform on the Real or Expected Output (as suggested by step or stage 350);
      • In the case where a process other than a Fourier transform is used to process the Generator model output, that same process would be applied to the Real or Expected Output;
    • Input the Transformed Generator Output and the Transformed Expected/Real Output to a Discriminator (as suggested by step or stage 355);
    • Input the Transformed Generator Output and the Transformed Expected/Real Output to a Lp Norm (or other error metric) Function (as suggested by step or stage 360);
      • As mentioned, in some embodiments, a different form of Generator Model output and the Real or Expected output may be preferred, in which case both the conversion/transformation process used, and the loss function or method used may differ from those in this example embodiment;
    • Determine the Lp Norm Loss and the Adversarial Loss Terms (as suggested by step or stage 365);
      • As mentioned, Adversarial Loss comprises both a Generator loss and a Discriminator loss, with the LP Norm or other loss term being combined with the Generator loss term and used to modify the behavior of the Generator Model, and the Discriminator loss term being used to modify the behavior of the Discriminator Model;
    • Combine the Lp Norm Loss and Adversarial Loss Terms into a Total Loss Term (as suggested by step or stage 370);
    • Use the Total Loss Term (or one or more of its components) as a Control Signal for Modifying Generator and/or Discriminator Weights or Parameters (as suggested by step or stage 375);
      • In some embodiments, the total error term may be used as part of a backpropagation or other form of feedback or adaptive control process to modify the weights between nodes or layers in the Generator, and in some cases those of the Discriminator;
      • In one example use case, the output from a trained Generator Model (i.e., a tensor representing a waveform) may be used as an input to an audio transducer (such as a speaker) after appropriate conversion or processing to generate audible sounds.


Although embodiments have been described with reference to text-to-speech or speech synthesis applications, it is noted that there are other possible applications of the training process(es) described. Further, as mentioned, in some embodiments, the input to a Generator model is a tensor, where the values of the tensor may be obtained or derived from sources that include an image, a video, a document, a string of text, a sampled waveform, random numbers, time-series data, a spectrogram, a set of data generated by a system, or sensor data, as non-limiting examples. The data may be subjected to processing steps that include conversion, transformation, filtering, sampling, or thresholding prior to being used to construct the input tensor. As further non-limiting examples:

    • An embodiment of the disclosed and/or described training pipeline can be used with other models that translate an input into an output, where the output can be processed further by applying a Fourier or similar transform (or other form of processing or conversion depending upon the source data or format to enable representing the Generator output and the “real” or actual signal in the frequency domain);
      • In some of the examples described, the Generator model uses a spectrogram (such as may be obtained from a text-to-spectrogram process flow) as an input and “predicts” a waveform as the output. The output waveform is then processed by a Fourier transform to produce a spectrogram. The spectrogram is then passed to a Discriminator, where it is compared to a sample of expected or real data in the appropriate form (e.g., also in the frequency domain);
        • As examples, an input tensor may be generated from sources, including but not limited to:
          • audio waveforms;
          • processed images or scans (including photographs, x-rays, MRI, or ultrasound, which may be sampled or subject to signal processing methods);
          • processed text (e.g., after application of an optical character recognition (OCR) process to produce a digitized version);
          • time-series signals from radar or sonar devices;
          •  this could have applications for systems used in identifying friend or foe (IFF), locating objects, tracking objects using emitted or reflected signals, discriminating between objects, and similar uses;
          • light or sound output by a device;
          • sensor measurements (e.g., heartbeat, pulse, brain waves);
          • wireless communication signals;
          • signals (e.g., traveling waves) generated by seismological events;
          • signals (e.g., time-series data) obtained from processing financial data;
          • random numbers representing “noise”;
        • In one embodiment, the Generator model output is a signal or format that can be processed with a Fourier transform. For example, an image, waveform, or time-series forecast represent outputs that may be processed as disclosed and/or described herein;
          • The output may be in the form of one of the inputs mentioned, and the output of the trained model may be used to represent a “realistic” (or more realistic) form of a wave, signal, trend, or other type of data (by generating more realistic signals of a type being processed for purposes of system testing, system evaluation, or error correction, as non-limiting examples);
    • The Deep Neural Network (DNN) (typically comprised of one or more RNNs or CNNs, or a combination of the two architectures and which may include other layers, such as an Attention layer or a Feedforward layer) or model should be compatible with the input and output forms of the data;
    • In one embodiment, the disclosed and/or described solution enables the generation of more realistic waveforms than conventional approaches. This capability may be used for text-to-speech conversion or speech-to-speech conversion, musical-notes-to-music conversion, or noise-to-speech applications, as non-limiting examples;
      • Thus, some embodiments are generally applicable to a use case or problem which requires generating an audio signal or similar type of signal or waveform;
    • As mentioned, the described approach may also be applied to generating “realistic images”. This could be an improvement over conventional approaches used for this purpose, as synthetic images tend to have spectra that are not realistic (the paper found at https://arxiv.org/pdf/2003.08685.pdf describes how to detect fake images through their spectra. See also https://arxiv.org/pdf/1907.06515.pdf);
      • For example, the disclosed technique could be applied to generate realistic images that are more likely to be undetectable by the methods described in the referenced papers;
      • In one example, this may be done by sampling an image (e.g., dividing the image into pixels or sections and determining a characteristic of each pixel or section), converting the sampled data to a tensor, inputting the tensor into a Generator model, converting or transforming the output of a Generator model into a spectrogram or other representation of a spectra, and then performing one or more of the disclosed and/or described steps or stages involving converting or transforming an example of real or expected data, inputting the converted or transformed data to a Discriminator, followed by the other described steps;
    • Although there are GAN models that have been used to generate synthetic time-series data, the approach disclosed and/or described herein may be used to generate more realistic synthetic time-series data—this data could be used for training detection systems, identifying potential threats, or calibrating devices and systems, as examples.


In general, embodiments may be used for improving the results obtained by conventional signal and image processing techniques that may be applied in the following contexts:

    • speech synthesis, text-to-speech processing, image processing to detect the presence or absence of a feature or of movement, to generate video, to detect false signals, to provide simulations of physical phenomena, to generate handwriting, or for radar or sonar signal processing;
    • other applications of the techniques and methods disclosed herein include the following:
      • Audio signal processing—for electrical signals representing sound, such as speech or music;
      • Image processing—in digital cameras, computers, and imaging systems;
      • Video processing—for interpreting a set of images;
      • Wireless communication—for waveform generations, demodulation, filtering, or equalization applications;
      • Control systems and associated signals derived from sensed waveforms;
      • Array processing—for processing signals from an array of sensors;
      • Process control—here a variety of signals are used, including the industry standard 4-20 mA current loop;
      • Seismology applications;
      • Financial signal processing—analyzing financial data using signal processing techniques, such as for prediction purposes;
      • Feature extraction, such as image understanding and speech recognition;
      • Signal quality improvement, such as by noise reduction, image enhancement, or echo cancellation;
      • Source coding, including audio compression, image compression, and video compression; and
      • Genomic signal processing.


For example, FIG. 3(b) is a flowchart or flow diagram illustrating a process, method, set of operations, or set of functions for training a model, such as a Generator that is part of a generative adversarial network (GAN, where the GAN includes a Generator and a Discriminator) to generate examples of an input tensor, in accordance with some embodiments. As shown in the figure, in one embodiment a process for performing a training cycle for a model using a generative adversarial network architecture may comprise the following steps or stages:

    • Inputting a first tensor including a set of elements to the Generator, each tensor element being a value and the set of elements representing a distribution of values (as suggested by step or stage 380);
      • As non-limiting examples, the tensor values may represent a source such as a signal, a waveform, or an image, and the source may be sampled or otherwise processed into the form of a distribution of values in a tensor;
      • In some embodiments, the values may represent a distribution over space or over time generated from a signal or waveform;
    • Operating the Generator to output a tensor representing a generated distribution of values (as suggested by step or stage 381);
    • Processing the output tensor to convert the generated distribution of values to the frequency domain (step or stage 382);
      • In one embodiment, this may be performed by use of a short-time Fourier transform (STFT) or similar transformation;
    • Determining one or more error terms from the converted generated distribution (step or stage 383);
      • In one embodiment, this may be what is referred to as an Lp norm loss (such as L1 or L2), although other forms of a loss term may also or instead be used;
    • Obtaining a second tensor including a set of elements, each tensor element being a value and the set of elements representing an actual distribution of values (step or stage 384);
      • As noted, this is a real or actual distribution of values, or a real signal represented as a distribution;
    • Converting the actual distribution of values to the frequency domain (step or stage 385);
      • In one embodiment, this may be performed by use of a short-time Fourier transform (STFT) or similar transformation;
      • The transformation or conversion should be the same one as used for the generated distribution;
    • Determining one or more error terms from the converted actual distribution (step or stage 386);
      • In one embodiment, this may be what is referred to as an Lp norm loss (such as L1 or L2), although other forms of a loss term may also or instead be used;
      • The loss term should be the same one or ones as used for the converted generated distribution;
    • Inputting the converted generated distribution and the converted actual distribution to the Discriminator (step or stage 387);
    • Operating the Discriminator to determine an adversarial loss term, the adversarial loss term including a Generator loss term and a Discriminator loss term (step or stage 388); and
    • Using the Generator loss term, the error term from the converted generated distribution and the error term from the converted actual distribution to modify the Generator and using the Discriminator loss term to modify the Discriminator (step or stage 389);
      • The “modification” may be implemented using backpropagation or other feedback mechanism to adjust the weights between the layers of the Generator model and the Discriminator (as an example);
    • Repeating the training cycle until weights between layers of the Generator and weights between layers of the Discriminator become stable (step or stage 390); and
    • Using the Generator as an inference engine to generate realistic examples of a tensor or distribution input to the Generator (step or stage 391).



FIG. 4(a) is a diagram illustrating elements or components that may be present in a computer device or system configured to implement a method, process, function, or operation in accordance with an embodiment of the system and methods described herein. FIG. 4(a) represents modules that contain software instructions, which when executed, implement the set of steps or stages of FIG. 3(a).


As noted, in some embodiments, the system and methods may be implemented in the form of an apparatus that includes a processing element and set of executable instructions. The executable instructions may be part of a software application and arranged into a software architecture. In general, an embodiment may be implemented using a set of software instructions that are designed to be executed by a suitably programmed processing element (such as a GPU, CPU, microprocessor, processor, controller, computing device, etc.). In a complex application or system such instructions are typically arranged into “modules” with each such module typically performing a specific task, process, function, or operation. The entire set of modules may be controlled or coordinated in their operation by an operating system (OS) or other form of organizational platform.


Each application module or sub-module may correspond to a particular function, method, process, or operation that is implemented by the module or sub-module. Such function, method, process, or operation may include those used to implement one or more aspects of the disclosed system and methods.


The application modules and/or sub-modules may include any suitable computer-executable code or set of instructions (e.g., as would be executed by a suitably programmed processor, microprocessor, or CPU), such as computer-executable code corresponding to a programming language. For example, programming language source code may be compiled into computer-executable code. Alternatively, or in addition, the programming language may be an interpreted programming language such as a scripting language.


The application modules and/or sub-modules may contain one or more sets of instructions for performing a method or function described with reference to the Figures, and the descriptions of the functions and operations provided in the specification. These modules may include those illustrated but may also include a greater number or fewer number than those illustrated. As mentioned, each module may contain a set of computer-executable instructions. The set of instructions may be executed by a programmed processor or processors contained in a server, client device, network element, system, platform, or other component.


A module may contain instructions that are executed by a processor contained in more than one of a server, client device, network element, system, platform or other component. Thus, in some embodiments, a plurality of electronic processors, with each being part of a separate device, server, or system may be responsible for executing all or a portion of the software instructions contained in an illustrated module. Thus, although FIG. 4 illustrates a set of modules which taken together perform multiple functions or operations, these functions or operations may be performed by different devices or system elements, with certain of the modules (or instructions contained in those modules) being associated with those devices or system elements.


As shown in FIG. 4, system 400 may represent a server or other form of computing or data processing system, platform, or device. Modules 402 each contain a set of executable instructions, where when the set of instructions is executed by a suitable electronic processor or processors (such as that indicated in the figure by “Physical Processor(s) 430”), system (or server, platform, or device) 400 operates to perform a specific process, operation, function or method. Modules 402 are stored in a memory 420, which typically includes an Operating System module 404 that contains instructions used (among other functions) to access and control the execution of the instructions contained in other modules. The modules 402 stored in memory 420 are accessed for purposes of transferring data and executing instructions by use of a “bus” or communications line 418, which also serves to permit processor(s) 430 to communicate with the modules for purposes of accessing and executing a set of instructions. Bus or communications line 418 also permits processor(s) 430 to interact with other elements of system 400, such as input or output devices 422, communications elements 424 for exchanging data and information with devices external to system 400, and additional memory devices 426.


Although the following description of the functions of the modules is with reference to a use case of a text-to-speech conversion process, as mentioned, the methods and techniques disclosed herein may be used as part of other data, signal, or image processing pipelines. The following is provided for purposes of an example and is not intended to limit or restrict the functions that may be implemented by execution of the software instructions in each module.


For example, Obtain Training and Expected Data for Model Module 406 may contain computer-executable instructions which when executed by a programmed processor cause the processor or a device in which it is implemented to acquire, access, or otherwise obtain training data and expected data (or desired output data) for use in training a deep learning model, such as a Generator that is part of a GAN. As described with reference to FIGS. 3(a) and 3(b), the input data may take the form of a tensor, such as a spectrogram generated by a text-to-spectrogram model. The expected data is also in the form of a tensor and may be obtained from a waveform obtained from a text-to-speech dataset.


Input Training Data to Generator Model Module 408 may contain computer-executable instructions which when executed by a programmed processor cause the processor or a device in which it is implemented to input the training data tensors to the deep learning or other model and obtain an output tensor. The Generator model may be a neural network, and in some examples may be a recurrent neural network (RNN) or a convolutional neural network (CNN).


Process Output Waveform from Model and Expected Waveform Using STFT Module 410 may contain computer-executable instructions which when executed by a programmed processor cause the processor or a device in which it is implemented to process each of the output tensor generated by the model and the expected tensor using a Fourier transform, such as a short-time Fourier Transform, although other types of such transforms or signal processing techniques may be used to obtain a frequency domain representation.


Input Fourier Transforms of Both Output Waveform and Expected Waveform to Lp Norm Loss Calculation and to Discriminator Module 412 may contain computer-executable instructions which when executed by a programmed processor cause the processor or a device in which it is implemented to perform a calculation or operation to determine each of two loss terms, a Lp Norm loss (e.g., L1, L2, or other form) and a Discriminator loss term, with each loss term based on a combination of both the transform of the output waveform from the Generator and the transform of the expected waveform.


Determine Lp Norm Loss, Discriminator Loss and Total Loss Module 414 may contain computer-executable instructions which when executed by a programmed processor cause the processor or a device in which it is implemented to determine the (spectrogram) loss from the output of the Lp-norm loss calculation, determine the discriminator loss from the output of the discriminator loss calculation, and determine a total loss by combining the two loss terms. The combination may be through addition, scaling, fitting to a polynomial or curve, a weighted combination, or other suitable technique.


Use Total Loss to Update Model Weights Module 415 may contain computer-executable instructions which when executed by a programmed processor cause the processor or a device in which it is implemented to use the total loss term determined by the output of module 414 as a control measure or signal (typically in a feedback loop) for updating the weights of the Generator model (and in some cases the Discriminator). Use Trained Model to Generate “Realistic” Output Tensor (e.g., Waveform) from Input Module 416 may contain computer-executable instructions which when executed by a programmed processor cause the processor or a device in which it is implemented to receive a new input tensor or set of tensors (i.e., data not used in the training process and which may result from pre-processing of a spectrogram, image, or other form of signal) and to process that input using the trained model to generate an output tensor. This use of the trained model is as part of an inference pipeline, as illustrated in FIG. 2(a).


Use Realistic Tensor (Waveform) in Process to Generate Speech or Other Signal Processing Pipeline Module 417 may contain computer-executable instructions which when executed by a programmed processor cause the processor or a device in which it is implemented to use the output of the trained model as a source for a process that generates an audio signal (e.g., speech sounds via an audio transducer), a visual signal (video), or another representation of the output waveform or tensor using a specific signal processing flow.



FIG. 4(b) is a diagram illustrating elements or components that may be present in a computer device or system configured to implement a method, process, function, or operation in accordance with an embodiment of the system and methods described herein. FIG. 4(b) represents modules that contain software instructions, which when executed, implement the set of steps or stages of the flowchart or flow diagram of FIG. 3(b).


The set of modules may contain software instructions, which when executed by a programmed processor, perform the indicated step or stage:

    • Obtain a first tensor including a set of elements and input to the Generator, each tensor element being a value and the set of elements representing a distribution of values (as suggested by module 460);
    • Operate the Generator to output a tensor representing a generated distribution of values (as suggested by module 462);
    • Process the output tensor to convert the generated distribution of values to the frequency domain (as suggested by module 464);
    • Determine one or more error terms from the converted generated distribution (as suggested by module 466);
    • Obtain a second tensor including a set of elements, each tensor element being a value and the set of elements representing an actual distribution of values (as suggested by module 468);
    • Convert the actual distribution of values to the frequency domain (as suggested by module 470);
    • Determine one or more error terms from the converted actual distribution (as suggested by module 472);
    • Input the converted generated distribution and the converted actual distribution to the Discriminator (as suggested by module 474);
    • Operate the Discriminator to determine an adversarial loss term, the adversarial loss term including a Generator loss term and a Discriminator loss term (module 474);
    • Use the Generator loss term, the error term from the converted generated distribution and the error term from the converted actual distribution to modify the Generator and use the Discriminator loss term to modify the Discriminator (as suggested by module 476);
    • Repeat the training cycle until weights between layers of the Generator and weights between layers of the Discriminator become stable (module 476); and
    • Use the Generator as an inference engine to generate realistic examples of a tensor or distribution input to the Generator (module 476).


In some embodiments, the functionality and services provided by the system and methods described herein may be made available to multiple users by accessing an account maintained by a server or service platform. Such a server or service platform may be termed a form of Software-as-a-Service (SaaS). FIG. 5 is a diagram illustrating a SaaS system in which an embodiment of the invention/disclosure may be implemented. FIG. 6 is a diagram illustrating elements or components of an example operating environment in which an embodiment of the invention/disclosure may be implemented. FIG. 7 is a diagram illustrating additional details of the elements or components of the multi-tenant distributed computing service platform of FIG. 6, in which an embodiment of the invention/disclosure may be implemented.


In some embodiments, the system or service(s) described herein may be implemented as micro-services, processes, workflows, or functions performed in response to a user request to process a set of input data, such as an input tensor. The micro-services, processes, workflows, or functions may be performed by a server, data processing element, platform, or system. In some embodiments, the services may be provided by a service platform located “in the cloud”. In such embodiments, the platform is accessible through APIs and SDKs. The described model training services may be provided as micro-services within the platform for each of multiple users, use cases, or companies. The interfaces to the micro-services may be defined by REST and GraphQL endpoints. An administrative console may allow users or an administrator to securely access the underlying request and response data, manage accounts and access, and in some cases, modify the processing workflow or configuration.


Note that although FIGS. 5-7 illustrate a multi-tenant or SaaS architecture that may be used for the delivery of business-related or other applications and services to multiple accounts/users, such an architecture may also be used to deliver other types of data processing services and provide access to other applications. For example, such an architecture may be used to provide the waveform/tensor processing and model training functions and capabilities described herein, as well as using the model output as part of another process. Although in some embodiments, a platform or system of the type illustrated in FIGS. 5-7 may be operated by a 3rd party provider to provide a specific set of business-related applications, in other embodiments, the platform may be operated by a provider and a different business may provide the applications or services for users through the platform. For example, some of the functions and services described with reference to FIGS. 5-7 may be provided by a 3rd party with the provider of the training process and/or trained models maintaining an account on the platform for each company or business using a trained model to provide services to that company's customers.



FIG. 5 is a diagram illustrating a system 500 in which an embodiment of the invention may be implemented or through which an embodiment of the services described herein may be accessed. In accordance with the advantages of an application service provider (ASP) hosted business service system (such as a multi-tenant data processing platform), users of the services described herein may comprise individuals, businesses, stores, organizations, etc. A user may access the services using any suitable client, including but not limited to desktop computers, laptop computers, tablet computers, scanners, smartphones, etc. In general, any client device having access to the Internet may be used to provide a request or text message requesting customer support services and to receive and display an intent tree model. Users interface with the service platform across the Internet 508 or another suitable communications network or combination of networks. Examples of suitable client devices include desktop computers 503, smartphones 504, tablet computers 505, or laptop computers 506.


System 510, which may be hosted by a third party, may include a set of services 512 and a web interface server 514, coupled as shown in FIG. 5. It is to be appreciated that either or both of services 512 and web interface server 514 may be implemented on one or more different hardware systems and components, even though represented as singular units in FIG. 5. Services 512 may include one or more functions or operations for the model training and inference processes described herein.


In some embodiments, the set of applications available to a company or user may include one or more that perform the functions and methods described herein for the training of a model to generate a specific form of output tensor or waveform and the use of the trained model to generate a more “realistic” waveform, signal, or image. As discussed, and as a non-limiting example, these functions or processing workflows may be used to provide a more “realistic” waveform for purposes of driving an audio transducer as part of a text-to-speech or speech synthesis application.


As examples, in some embodiments, the set of applications, functions, operations or services made available through the platform or system 510 may include:

    • account management services 516, such as
      • a process or service to authenticate a person or user wishing to access the data processing services available through the platform (such as credentials or proof of purchase, verification that the customer has been authorized to use the services, etc.);
      • a process or service to receive a request for generating a desired form of output tensor or waveform from one or more input tensors representing a spectrogram, a sampled image, an output of an executed process, or other form of signal;
      • an optional process or service to generate a price for the requested service or a charge against a service contract;
      • a process or service to generate a container or instantiation of the desired signal processing and related techniques for the user, where the instantiation may be customized for a particular user or account; and
      • other forms of account management services.
    • a process or service 517 for training a model (in the form of a Generator that is part of a GAN) to operate on an input tensor representing a spectrogram, image, or other form of signal to produce a more “realistic” tensor as an output, where the training process may include:
      • A process to receive a set of training data, where each example of training data is a spectrogram and is used as an input to the model being trained;
        • In some embodiments, this may be a spectrogram generated by a text-to-spectrogram model (such as Tacotron-2);
        • The input to the model may be in the form of a tensor;
      • The model being trained may be in the form of a Deep Neural Network (DNN), such as a recurrent neural network (RNN) or a convolutional neural network (CNN);
        • The model generates an output tensor from the input tensor. In one example, this is a waveform predicted from the input;
      • To improve the model (i.e., to set and modify weights associated with connections between nodes in different layers of the model), a comparison is performed between the output tensor and the expected or desired output;
        • In some embodiments, the expected output is a waveform obtained from a text-to-speech dataset, for example ␣ Speech;
        • In some embodiments, this comparison is used to determine the error or loss between the output tensor and the expected output tensor;
        • In some embodiments, prior to determining the error or loss, each of the output waveform and the expected waveform (expressed as tensors) are subjected to further processing;
          • In some embodiments, this further processing is a short-time Fourier Transform (STFT);
        • The transformed waveforms are then used to determine two types of “loss”, with the total loss representing a difference or control signal for adjusting the weights of the neural network;
          • In some embodiments, each of the transformed waveforms is provided as an input to both a process to determine a Lp Norm loss (for example, L1 or L2 loss) and a Discriminator loss (a form of binary classifier, with the loss term expressed as a cost function, such as those illustrated in FIG. 2(e));
          •  The resulting output of the process to determine L1 or L2 loss represents the Lp Norm loss, and the resulting output of the Discriminator represents Discriminator Loss;
        • The Lp Norm Loss and Discriminator Loss are combined to generate the Total Loss from the training cycle;
          • In one embodiment, the two loss terms are added to generate the total loss term, although as noted, other forms of combining the two terms may be used;
        • The Total Loss is then used to update or modify the weights set during the training cycle, typically through backpropagation (although other feedback or adaptive techniques may be used);
    • a process or service 518 for using the trained model with new input data in an inference pipeline;
      • as an example, once the model weights (such as those connecting nodes in layers of the network/model) have converged to sufficiently stable values, the model may be considered trained;
        • the trained model may be used as part of the inference pipeline. In a typical use, an input tensor may be provided to the trained model. In response, the model will operate to generate or produce as an output a tensor representing a realistic waveform (or other form of tensor corresponding to the training data and what it represents);
    • In some embodiments, the output tensor may then be used as part of a processor service 519 that converts text-to-speech, generates synthetic speech, generates an image having specific characteristics, processes a signal to identify a signal characteristic or interpret the signal, etc.; and
    • administrative services 522, such as
      • a process or services to enable the provider of the model training and inference pipelines and/or the platform to administer and configure the processes and services provided to users.


The set of applications, functions, operations or services made available through the platform or system 510 may also or instead include those which implement one or more of the functions, operations, or processes described with reference to the flowchart of FIG. 3(b) and the diagram of FIG. 4(b):


Training Process and Loss/Error Determination





    • Obtain a first tensor including a set of elements and input to the Generator, each tensor element being a value and the set of elements representing a distribution of values;

    • Operate the Generator to output a tensor representing a generated distribution of values;

    • Process the output tensor to convert the generated distribution of values to the frequency domain;

    • Determine one or more error terms from the converted generated distribution;

    • Obtain a second tensor including a set of elements, each tensor element being a value and the set of elements representing an actual distribution of values;

    • Convert the actual distribution of values to the frequency domain;

    • Determine one or more error terms from the converted actual distribution;

    • Input the converted generated distribution and the converted actual distribution to the Discriminator;

    • Operate the Discriminator to determine an adversarial loss term, the adversarial loss term including a Generator loss term and a Discriminator loss term;





Backpropagation/Feedback Process





    • Use the Generator loss term, the error term from the converted generated distribution and the error term from the converted actual distribution to modify the Generator and use the Discriminator loss term to modify the Discriminator;

    • Repeat the training cycle until weights between layers of the Generator and weights between layers of the Discriminator become stable; and





Inference Process Using Trained Model





    • Use the Generator as an inference engine to generate realistic examples of a tensor or distribution input to the Generator.





The platform or system shown in FIG. 5 may be hosted on a distributed computing system made up of at least one, but likely multiple, “servers.” A server is a physical computer dedicated to providing data storage and an execution environment for one or more software applications or services intended to serve the needs of the users of other computers that are in data communication with the server, for instance via a public network such as the Internet. The server, and the services it provides, may be referred to as the “host” and the remote computers, and the software applications running on the remote computers being served may be referred to as “clients.” Depending on the computing service(s) that a server offers it could be referred to as a database server, data storage server, file server, mail server, print server, web server, etc. A web server is a most often a combination of hardware and the software that helps deliver content, commonly by hosting a website, to client web browsers that access the web server via the Internet.



FIG. 6 is a diagram illustrating elements or components of an example operating environment 600 in which an embodiment of the invention may be implemented. As shown, a variety of clients 602 incorporating and/or incorporated into a variety of computing devices may communicate with a multi-tenant service platform 608 through one or more networks 614. For example, a client may incorporate and/or be incorporated into a client application (e.g., software) implemented at least in part by one or more of the computing devices. Examples of suitable computing devices include personal computers, server computers 604, desktop computers 606, laptop computers 607, notebook computers, tablet computers or personal digital assistants (PDAs) 610, smart phones 612, cell phones, and consumer electronic devices incorporating one or more computing device components, such as one or more electronic processors, microprocessors, central processing units (CPU), or controllers. Examples of suitable networks 614 include networks utilizing wired and/or wireless communication technologies and networks operating in accordance with any suitable networking and/or communication protocol (e.g., the Internet).


The distributed computing service/platform (which may also be referred to as a multi-tenant data processing platform) 608 may include multiple processing tiers, including a user interface tier 616, an application server tier 620, and a data storage tier 624. The user interface tier 616 may maintain multiple user interfaces 617, including graphical user interfaces and/or web-based interfaces. The user interfaces may include a default user interface for the service to provide access to applications and data for a user or “tenant” of the service (depicted as “Service UI” in the figure), as well as one or more user interfaces that have been specialized/customized in accordance with user specific requirements (e.g., represented by “Tenant A UI”, . . . , “Tenant Z UI” in the figure, and which may be accessed via one or more APIs).


The default user interface may include user interface components enabling a tenant to administer the tenant's access to and use of the functions and capabilities provided by the service platform. This may include accessing tenant data, launching an instantiation of a specific application, causing the execution of specific data processing operations, etc. Each application server or processing tier 622 shown in the figure may be implemented with a set of computers and/or components including computer servers and processors, and may perform various functions, methods, processes, or operations as determined by the execution of a software application or set of instructions. The data storage tier 624 may include one or more data stores, which may include a Service Data store 625 and one or more Tenant Data stores 626. Data stores may be implemented with any suitable data storage technology, including structured query language (SQL) based relational database management systems (RDBMS).


Service Platform 608 may be multi-tenant and may be operated by an entity to provide multiple tenants with a set of business-related or other data processing applications, data storage, and functionality. For example, the applications and functionality may include providing web-based access to the functionality used by a business to provide services to end-users, thereby allowing a user with a browser and an Internet or intranet connection to view, enter, process, or modify certain types of information. Such functions or applications are typically implemented by one or more modules of software code/instructions that are maintained on and executed by one or more servers 622 that are part of the platform's Application Server Tier 620. As noted with regards to FIG. 5, the platform system shown in FIG. 6 may be hosted on a distributed computing system made up of at least one, but typically multiple, “servers.”


As mentioned, rather than build and maintain such a platform or system themselves, a business may utilize systems provided by a third party. A third party may implement a business system/platform as described above in the context of a multi-tenant platform, where individual instantiations of a business' data processing workflow (such as the data processing described herein) are provided to users, with each company/business representing a tenant of the platform. One advantage to such multi-tenant platforms is the ability for each tenant to customize their instantiation of the data processing workflow to that tenant's specific business needs or operational methods. Each tenant may be a business or entity that uses the multi-tenant platform to provide business services and functionality to multiple users.



FIG. 7 is a diagram illustrating additional details of the elements or components of the multi-tenant distributed computing service platform of FIG. 6, in which an embodiment of the invention may be implemented. The software architecture shown in FIG. 7 represents an example of an architecture which may be used to implement an embodiment of the invention. In general, an embodiment of the invention may be implemented using a set of software instructions that are designed to be executed by a suitably programmed processing element (such as a CPU, microprocessor, processor, controller, computing device, etc.). In a complex system such instructions are typically arranged into “modules” with each such module performing a specific task, process, function, or operation. The entire set of modules may be controlled or coordinated in their operation by an operating system (OS) or other form of organizational platform.


As noted, FIG. 7 is a diagram illustrating additional details of the elements or components 700 of a multi-tenant distributed computing service platform, in which an embodiment of the invention may be implemented. The example architecture includes a user interface layer or tier 702 having one or more user interfaces 703. Examples of such user interfaces include graphical user interfaces and application programming interfaces (APIs). Each user interface may include one or more interface elements 704. For example, users may interact with interface elements to access functionality and/or data provided by application and/or data storage layers of the example architecture. Examples of graphical user interface elements include buttons, menus, checkboxes, drop-down lists, scrollbars, sliders, spinners, text boxes, icons, labels, progress bars, status bars, toolbars, windows, hyperlinks, and dialog boxes. Application programming interfaces may be local or remote and may include interface elements such as parameterized procedure calls, programmatic objects, and messaging protocols.


The application layer 710 may include one or more application modules 711, each having one or more sub-modules 712. Each application module 711 or sub-module 712 may correspond to a function, method, process, or operation that is implemented by the module or sub-module (e.g., a function or process related to providing business related data processing and services to a user of the platform). Such function, method, process, or operation may include those used to implement one or more aspects of the disclosed system and methods, such as for one or more of the processes or functions described with reference to the Figures:

    • Training a model (in the form of a Generator that is part of a GAN) to operate on an input tensor representing a spectrogram, image, or other form of signal to produce a more “realistic” tensor as an output, where the training process may include:
      • Receiving a set of training data, where each example of training data is a spectrogram and is used as an input to the model being trained;
        • The input to the model may be in the form of a tensor;
      • The model being trained may be in the form of a Deep Neural Network (DNN), such as a recurrent neural network (RNN) or a convolutional neural network (CNN);
        • The model generates an output tensor from the input tensor. In one example, this is a waveform predicted from the input;
      • To improve the model (i.e., to set and modify weights associated with connections between nodes in different layers of the model), performing a comparison between the output tensor and the expected or desired output;
        • In some embodiments, the expected output is a waveform obtained from a text-to-speech dataset, for example ␣ Speech;
        • In some embodiments, this comparison is used to determine the error or loss between the output tensor and the expected output tensor;
        • In some embodiments, prior to determining the error or loss, each of the output waveform and the expected waveform (expressed as tensors) are subjected to further processing;
          • In some embodiments, this further processing is a short-time Fourier Transform (STFT);
        • The transformed waveforms are used to determine two types of “loss”, with the total loss representing a difference or control signal for adjusting the weights of the neural network;
          • In some embodiments, each of the transformed waveforms is provided as an input to both a process to determine a Lp Norm loss (for example, L1 or L2 loss) and a Discriminator loss (a form of binary classifier, with the loss term expressed as a cost function, such as those illustrated in FIG. 2(e));
          •  The resulting output of the process to determine L1 or L2 loss represents the Lp Norm loss, and the resulting output of the Discriminator represents Discriminator Loss;
        • The Lp Norm Loss and Discriminator Loss are combined to generate the Total Loss from the training cycle;
          • In one embodiment, the two loss terms are added to generate the total loss term, although other forms of combining the two terms may be used;
        • The Total Loss is used to update or modify the weights set during the training cycle, typically through backpropagation (although other feedback or adaptive techniques may be used);
    • Using the trained model with new input data in an inference pipeline;
      • Once the model weights (such as those connecting nodes in layers of the network/model) have converged to sufficiently stable values, the model may be considered trained;
        • The trained model may be used as part of the inference pipeline. In a typical use, an input tensor may be provided to the trained model. In response, the model will operate to generate or produce as an output a tensor representing a realistic waveform (or other form of tensor corresponding to the training data and what it represents); and
    • In some embodiments, the output tensor may then be used as part of a process that converts text-to-speech, generates synthetic speech, generates an image having specific characteristics, processes a signal to identify a signal characteristic or interpret the signal, etc.


Similarly, the functions, operations, or processes illustrated or described with reference to FIGS. 3(b) and 4(b) may be implemented in an Application Layer 710 that includes modules and sub-modules containing software instructions corresponding to a function, method, process, or operation that is implemented by the module or sub-module. With reference to FIGS. 3(b) and 4(b), these may include:


Training Process and Loss/Error Determination





    • Obtain a first tensor including a set of elements and input to the Generator, each tensor element being a value and the set of elements representing a distribution of values;

    • Operate the Generator to output a tensor representing a generated distribution of values;

    • Process the output tensor to convert the generated distribution of values to the frequency domain;

    • Determine one or more error terms from the converted generated distribution;

    • Obtain a second tensor including a set of elements, each tensor element being a value and the set of elements representing an actual distribution of values;

    • Convert the actual distribution of values to the frequency domain;

    • Determine one or more error terms from the converted actual distribution;

    • Input the converted generated distribution and the converted actual distribution to the Discriminator;

    • Operate the Discriminator to determine an adversarial loss term, the adversarial loss term including a Generator loss term and a Discriminator loss term;





Backpropagation/Feedback Process





    • Use the Generator loss term, the error term from the converted generated distribution and the error term from the converted actual distribution to modify the Generator and use the Discriminator loss term to modify the Discriminator;

    • Repeat the training cycle until weights between layers of the Generator and weights between layers of the Discriminator become stable; and





Inference Process Using Trained Model





    • Use the Generator as an inference engine to generate realistic examples of a tensor or distribution input to the Generator.





The application modules and/or sub-modules may include any suitable computer-executable code or set of instructions (e.g., as would be executed by a suitably programmed processor, microprocessor, or CPU), such as computer-executable code corresponding to a programming language. For example, programming language source code may be compiled into computer-executable code. Alternatively, or in addition, the programming language may be an interpreted programming language such as a scripting language. Each application server (e.g., as represented by element 622 of FIG. 6) may include each application module. Alternatively, different application servers may include different sets of application modules. Such sets may be disjoint or overlapping.


The data storage layer 720 may include one or more data objects 722 each having one or more data object components 721, such as attributes and/or behaviors. For example, the data objects may correspond to tables of a relational database, and the data object components may correspond to columns or fields of such tables. Alternatively, or in addition, the data objects may correspond to data records having fields and associated services. Alternatively, or in addition, the data objects may correspond to persistent instances of programmatic data objects, such as structures and classes. Each data store in the data storage layer may include each data object. Alternatively, different data stores may include different sets of data objects. Such sets may be disjoint or overlapping.


Note that the example computing environments depicted in FIGS. 5-7 are not intended to be limiting examples. Further environments in which an embodiment of the invention may be implemented in whole or in part include devices (including mobile devices), software applications, systems, apparatuses, networks, SaaS platforms, laaS (infrastructure-as-a-service) platforms, or other configurable components that may be used by multiple users for data entry, data processing, application execution, or data review.


Although specific examples and embodiments of the disclosure have been described with reference to text-to-speech (TTS) translation, another way to describe the purpose or context in which an embodiment may provide benefits is that of a desire to first model a signal and subsequently be able to generate new examples of that signal. An example outside of TTS is that of image generation where the image is a signal that can be modeled and then subsequently the model can be used to generate new, realistic images. Another example could be a stock price signal and generating new realistic versions of a stock signal.


In addition to the described embodiments and implementations, alternative ways of implementing one or more of the data processing operations or functions may include but are not limited to:

    • An input to the Generator Model in the form of a tensor—the tensor can represent or be derived from multiple types of sources or data formats and may represent one or more characteristics of a source. In one embodiment, the tensor may include or be comprised of random numbers;
      • Non-limiting examples of sources that may be represented as an input tensor include images, text, audio, waveforms representing signals, time-series data corresponding to values of a measured value or metric over time (such as for financial data or physical phenomena);
      • In some embodiments, and depending upon the source, the source data format may need to be transformed, converted, sampled, filtered, or otherwise processed to place it into the form of a tensor that may be used as an input;
    • In some embodiments, the Generator output is preferably in a form or signal that can be processed using a Fourier or other type of transform or operation. Examples include an image, waveform, or time-series forecast, expressed as tensor elements;
      • As mentioned herein, for some types of source data, a more optimal form for processing may depend upon how a desired output is perceived or understood (i.e., processed by the intended consumer);
        • For example, as described, in the case of an audio waveform or signal, it has been found that a spectrogram representation is more accurate with regards to how the human ear operates. This suggested converting or transforming the Generator Model output into a spectrogram through the use of a Fourier Transform. This also impacts the type of error term used when updating the Generator Model (as the Lp norm error was determined and used as part of a control signal);
        • As additional non-limiting examples, for an image, a vision model may be most accurate in representing how the human eye or a computer would process and interpret an image. This information may be used to modify the processing stages disclosed, such as by converting the output of the Generator into a different form through application of the vision model to the Generator output tensor. This may also impact the type of error term used to represent a part of the Generator loss term;
    • The model being trained (for example, a Generator in the form of a deep neural network) should be compatible in architecture and input/output format with the input and output forms;
      • As mentioned, this may be achieved in some processing flows by conversion, transformation, or other set of operations;
    • In a model where the output is expected, the predicted output should be comparable to the expected output. It may also be processed into a comparable format. In other models, the output is random, and there is no expected output;
      • In some embodiments, there are variations where the output may be partially expected. For example, a generative model might be expected to generate a dog without specification of the type of dog. In this case, a classifier output might be used instead of Lp-norm, to ensure that the generated image is in-fact a dog;
      • The classifier output would presumably be combined with the Generator loss term to create the loss term(s) used as a control signal for the Generator;
      • In some embodiments, a separate Discriminator loss term may be used as a control signal for the Discriminator.


In addition to those uses described, other possible uses for the solution represented by the disclosure may include but are not limited to:

    • Generation of realistic waveforms for purposes of text-to-speech, speech-to-speech, musical-notes-to-music, or noise-to-speech conversions, as non-limiting examples. The disclosed approach is generally applicable to a problem which requires generating an audio signal;
    • The approach may also be applied to generating realistic images. As mentioned, synthetic images tend to have spectra that are not realistic. The disclosed approach may be used to generate realistic images that are largely undetectable by conventional techniques;
    • GAN models may be used to generate synthetic time series data. The disclosed approach may be used to generate more realistic synthetic time series data, which have applications in training a model to detect and interpret radar backscatter, emitted signals from a transmitter, etc.


In addition to those described, additional examples of “signals” that may be used as part of the processing flow include but are not limited to:

    • Audio recording, where an analog pressure wave is sampled and converted to a one-dimensional discrete-time signal;
    • Photos/images, where the analog scene of light is sampled using a CCD array and stored as a two-dimensional discrete-space signal;
    • Text, where messages are represented with collections of characters; each is as-signed a standard 16-bit number and those are stored in sequence;
    • Ratings, where for books (Goodreads), movies (Netflix), vacation rentals (AirBnB) are typically stored using integers (e.g., 0-5); and
    • epileptic seizure waveform data, sensed human activity/behaviors modelling, MRI data, emitted or backscatter wavelengths from crops, unmanned aerial vehicle control signals or collected data, etc.


Examples of “waves” or waveforms that may be processed as part of an embodiment include sound waves, light, water waves, and periodic electrical signals in a conductor. A sound wave is a variation in air pressure, while in light and other electromagnetic radiation the strength of the electric and the magnetic field vary. Water waves are variations in the height of a body of water. In a crystal lattice vibration, atomic positions vary and may generate wave phenomena.


In addition to those described forms of models that may be used include:

    • RNN (a recurrent neural network) for time-series data;
    • CNN (a convolutional neural network) for images; or
    • Transformers for images or time-series data;


The Discriminator referenced in the figures and description takes as an input a tensor in the frequency domain. The Discriminator operates to evaluate the output of the Generator (the model being trained) to determine how close the output is to real data. The Discriminator produces a loss factor given an input of fake data. In a GAN framework, the Generator typically tries to maximize the loss. In one embodiment, an approach used with a DCGAN may be used to train a Discriminator, although there are other approaches that may be used to train a Discriminator. A Deep Convolutional GAN or DCGAN is an extension of the GAN concept, except that it explicitly uses convolutional and transpose-convolutional layers in the discriminator and generator, respectively. The discriminator is made up of strided convolution layers, batch norm layers, and LeakyReLU activations without max-pooling layers i.e., convolution>batch norm>leaky ReLU.


As further examples of the use of the disclosed processing flow to generate a more realistic waveform:

    • Start with text—then generate a spectrogram, input the spectrogram to a trained model to generate a waveform, and use the waveform to generate speech, followed by using the generated speech as part of delivering a service, etc.;
    • As examples, ways in which generated speech, or the model output (a more realistic waveform) might be used include:
      • The model may be used to generate music, or a sound (animal sounds, speech disfluencies, breathing sounds, etc.);
      • The model may be used as part of a speech-to-speech service;
      • The model may be used as part of a voice editing service for post-production situations;
      • The model may be used for brain wave to speech conversion (e.g., for those suffering from an accident or debilitating disease);
      • The model input may be notes for music generation, speech for speech-to-speech, or random noise for unconditional generation, as examples;
        • speech-to-speech use cases may include:
          • Voice changing;
          • Compression;
          • Removing disfluencies; or
          • Removing background noise (noise-canceling).


In addition to those described, implementation options or variations may include the following for the indicated element, component, or process:

    • For the use of back-propagation or other form of adaptive control or feedback;
      • The models described herein may, in some embodiments, be trained with an alternative feedback mechanism to backpropagation. These alternate mechanisms may include:
        • Difference Target Propagation;
        • The HSIC Bottleneck;
        • Online Alternative Minimization with Auxiliary Variables;
        • Decoupled Neural Interfaces Using Synthetic Gradients;
        • REINFORCE; or
        • Random Search;
      • The model weights can be tuned with a variety of gradient estimators. In some embodiments, the weights can be tuned without consideration for the gradient, using methods such as random search or grid search;
    • For the use of the STFT;
      • This can be replaced by other types of Fourier Transform. An STFT is generally specific to time series data. For images, it is not necessary to use a “Short-time Fourier transform”. In one example, a “Discrete-time Fourier Transform” may be used;
      • A cosine transform can be used in-place of a Fourier transform, as well. In a similar way, other Fourier-related transforms can be used to transform a signal from the time domain into the frequency domain;
    • For the use of a Discriminator;
      • This can be another type of model or process that is able to detect “fake” data examples. For example, one could use a pre-trained ASR model for the discrimination function;
      • There are ways to train a Discriminator that do not involve directly classifying examples into fake or real. For example, a WGAN architecture introduces a Wasserstein loss for training the Discriminator;
      • The Discriminator typically needs to provide some sort of process for updating the Generator weights. This process helps ensure that the Generator output is similar to the training data examples;
      • As another example, a technique termed “Feature matching” may be used. This technique changes the cost function for the Generator to minimize the statistical difference between the features of the real images and the generated images. In this example, the Discriminator extracts features, and those features are analyzed for a statistical difference between the real and generated examples. In this case, the Discriminator is not functioning as a binary classifier; however, it is still providing a signal based on if the generated examples exist in the target distribution (see https://towardsdatascience.com/gan-ways-to-improve-gan-performance-acf37f9f59b);
      • While conventionally, a Discriminator is a binary classifier, it is not required to be one or to operate in that manner. More generally, a Discriminator is a model of the “true data distribution”. The Generator uses the Discriminator to obtain feedback on if a generated sample is part of the “true data distribution”. For example, a speech-to-text model could be used to Discriminate the generated waveforms;
      • The Discriminator does not have to be trained in a DCGAN approach. In some embodiments, the Discriminator is not trained at all. For example, for tabular data, it is feasible to create a simple Discriminator which is not a Deep Neural Network. The Discriminator could also be pre-trained on the “true data distribution”. A pre-trained Discriminator does not have to be further updated with examples of “real” data;
      • While conventionally, the Discriminator is differentiable, it is not required to be. In some embodiments, the Discriminator is not trained with backpropagation, and therefore, it does not need to be differentiable;
    • For the use of L1 or L2 (Lp) Norm loss factors and/or other loss terms;
      • L1 is loss=(x−y)
      • L2 is loss=(x−y)2
      • In general, almost any algorithm, formula, or approach that compares and measures the difference between x and y can be used. There are many different mechanisms for measuring the difference between two tensors. They include simple polynomial functions and more complex Deep Neural Networks, as examples;
      • “LP-norm loss” can be generalized to “content loss or reconstruction loss” which encompasses LP-norm loss as an example. Another example of content loss is feature matching loss, which changes the cost function for the Generator to minimize the statistical difference between the features of the real images and the generated images;
    • In some embodiments, a total loss factor is computed or determined from Spectrogram Loss and Adversarial Loss by adding the two loss factors or values together. Usually, there are weights applied when they are added together, such as in the form of x*w1+y*w2;
      • The Discriminator loss term may be dependent upon the type of GAN used (as suggested by FIG. 2(e), where LD represents Discriminator loss, and LG represents Generator loss). Discriminators may be changed during a training process or for a specific use case and may also be “stacked” and used together. In such cases, the relevant discriminator loss terms may also be combined;
    • The generator loss is a loss function used to train the generator in a GAN, measuring how well the generator's outputs are able to fool the discriminator into classifying them as real. Examples of generator loss terms for different GAN architectures are shown in FIG. 2(e);
    • As mentioned, additional number of losses can be added to the model without loss of generality. Other categories of losses can include but are not limited to adversarial loss (e.g., vision-aided adversarial loss), perceptual loss (e.g., perceptual similarity loss), regularization loss (e.g., L1, L2, weight-decay, weight-norm, or gradient-regularization) chosen to restrict weights and/or gradients, or style loss (e.g., gram matrix-based style loss, or layer-wise style loss);
    • Furthermore, a loss term can be added to one or more components of the disclosed models. For example, a loss can be placed on the spectrogram, more specifically the phase of the spectrogram. For TTS the phase of the spectrogram provides temporal information which affects the perceptual quality of audio. Adding a loss may reduce artifacts in the resulting audio. An additional loss can be placed on the output signal such as perceptual evaluation of the speech quality (PESQ) loss. For TTS PESQ is designed to correlate with human judgments of speech quality—thus models trained with PESQ loss are likely to produce outputs that are more “natural”.


As disclosed, in some embodiments, a Generator model processes the input to generate an output which is a signal/waveform. A signal processing algorithm, such as a Fourier transform, is then applied to transform the output signal into a frequency-domain representation. Lastly, a Discriminator processes the frequency-domain representation of the signal and determines if it is a “realistic” example of the signal/waveform.


Embodiments include use of an appropriate technique to transform or convert data between a time domain and a frequency domain. This disclosure describes Fourier-transform based transformations to produce data in the frequency domain. Such transformations include but are not limited to STFT, Gabor transform (a special case of STFT), FFT, discrete Fourier transform (DFT), or the discrete cosine transform (DCT).


Another transformation that performs the same function is LEarnable Audio Frontend (LEAF) as introduced in https://arxiv.org/pdf/2101.08596. LEAF also generates a spectrogram like Fourier-transform based methods but using machine learning. This makes it better able to capture features that are data specific as opposed to using a predefined transformation.


Wavelet transformations is a group of transformations including but not limited to a continuous wavelet transform (CWT) and discrete wavelet transform (DWT). In general, they provide more flexible and detailed time-frequency representation and are better suited for signals with varying structure and applications requiring detailed time-frequency localization. Examples include high frequency images such as a picture of stars or music audio that might contain large changes in frequency such as percussion (See Wikipedia).


In addition to this transformation, additional processing steps can be performed on the resulting data. One example is applying a mel filter bank on top of the frequency representation of the data producing a mel-scale spectrogram. As a non-limiting example, for TTS, “compared to linear-frequency domain or time-domain speech enhancement, the major advantage of Mel-spectrogram enhancement is that Mel-frequency presents speech in a more compact way and thus is easier to learn, which will benefit both speech quality and ASR.” (See Mel-FullSubNet: Mel-Spectrogram Enhancement for Improving Both Speech Quality and ASR).


Additionally, a transformation of the frequency domain can include adjusting the unit of frequency such as using decibels in the logarithmic scale, power in the polynomial scale, or amplitude in the linear scale. The frequency domain can also be normalized. An example of this is decibels relative to full scale (dBFS) where decibels are normalized to full scale. This may be done to normalize the scale one is working with to put an upper bound on how loud the frequency can be and thereby make it easier to model.


As mentioned, the input to the discriminator model can be a general tensor and is not necessarily a spectrogram. In addition, this can be either conditioned or unconditioned. The tensor may be generated or produced from an image, a waveform, a structure, or other suitable form or content.


As described herein, FIG. 2(e) illustrates examples of GAN architectures and loss or cost terms that may be used as part of a training process. Examples of a reason or context in which to select one are the following:

    • Least square GAN which might be used to help reduce issues of vanishing gradients;
    • Wasserstein GAN (including WGAN, WGAN-GP) which are often less prone to mode collapse and could result in more stable training and better convergence;
    • Deep Regret Analytic GAN (DRAGAN) which focuses on improving local stability and robustness to perturbations in the discriminator;
    • Conditional GAN (CGAN) which generates outputs conditioned on input data (ex. labels, images) and thus often useful for tasks requiring control over output;
    • Info-GAN which maximizes the mutual information between a subset of latent variables and the generated data, allowing the model to learn disentangled representations that are more interpretable and meaningful;
    • Auxiliary Classifier GAN (ACGAN) which incorporates class information to give more control over the output. For further detail, Auxiliary Classifier GANs (ACGANs) are a type of generative adversarial network (GAN) that incorporates class information into the GAN framework. This allows ACGANs to generate more realistic images and improve performance in various applications, such as medical imaging, cybersecurity, and music generation. ACGANs consist of a generator and a discriminator, with the discriminator also acting as a classifier to predict the class of the generated images;
    • Energy-Based GAN which focuses on minimizing the energy of real samples and maximizing that of fake samples sometimes providing more stable training;
    • Boundary Equilibrium GAN which balances the generator and discriminator using a proportional control algorithm which can make training more stable.


Beyond changing the type of GAN architecture being used, the discriminator itself can be manipulated. For instance, a fixed discriminator can be used such as a pre-trained ASR model. Someone may choose this in the following non-limiting scenarios: the generator needs a well-defined and strong adversarial signal to start learning effectively, to reduce training complexity in the case of low-resources or if training needs to be quickened such as for prototyping, to stabilize the training process.


As mentioned herein, as an example, for tabular data, it is feasible to create a simple Discriminator which is not a Deep Neural Network. In order to implement this the weights of the discriminator are not updated during the training process and are often pre-trained. As also mentioned, FIG. 2(d) is a block diagram illustrating the primary functions or operations implemented by an example embodiment of the disclosed system, apparatus, and method for training a GAN model (labeled the Generator Model in the figure) to generate a “realistic” waveform in the situation of using a pre-trained Discriminator”. In addition to using a fixed discriminator, the discriminator does not necessarily need to be differentiable. Non-limiting examples include decision tree-based discriminators and MetricGAN. Non-differentiable discriminators can be more suitable for certain tasks that require explicit decision boundaries or where the data has a nonlinear structure. They can also be helpful if someone wants to be able to more easily interpret the discriminator process.


As also mentioned, not only can the discriminator be manipulated but embodiments are not limited to a single discriminator. The GAN architecture is not limited to a single discriminator and depending on what is being tested there is no limit to the number of discriminators included in the training. An example of this is the D2GAN architecture that has two discriminators. This could be useful if the data inherently has multiple characteristics (distributions, scales, domains) and each discriminator can specialize in learning different aspects of these characteristics. In order to implement this, as opposed to the expected and the generated signals going into a single discriminator, they can go in parallel to any number of discriminators and the output of each can be combined to get the final label.


As mentioned, a motivation for the disclosed processing flow is that conventional models for generating waveforms do not produce realistic outputs. For example, the following paper describes some of the errors these models produce: https://arxiv.org/pdf/2010.14356.pdf. As noted in the paper's abstract, “A number of recent advances in neural audio synthesis rely on up-sampling layers, which can introduce undesired artifacts. In computer vision, up-sampling artifacts have been studied and are known as checkerboard artifacts (due to their characteristic visual pattern). However, their effect has been overlooked so far in audio processing. Here, we address this gap by studying this problem from the audio signal processing perspective. We (the paper's authors) first show that the main sources of up-sampling artifacts are: (i) the tonal and filtering artifacts introduced by problematic up-sampling operators, and (ii) the spectral replicas that emerge while up-sampling. We then compare different up-sampling layers, showing that nearest neighbor up-sampler(s) can be an alternative to the problematic (but state-of-the-art) transposed and subpixel convolutions which are prone to introduce tonal artifacts”.


The Generator model which processes the input can be a conditional or unconditional generator:

    • A conditional generator model needs some sort of input that correlates with the output. For example:
      • Transcription of audio can be mapped back to the audio;
      • A stock price is correlated with the stock class;
      • A speaker is correlated with the audio pitch and loudness; or
      • A movie rating is correlated with the director of the movie and the budget;
    • An unconditional generator model takes random noise as the input. For example:
      • An unconditional generator model can generate random music that is not conditional on an input; or
      • An unconditional generator model can generate random images.


The present invention as described above can be implemented in the form of control logic using computer software in a modular or integrated manner. Based on the disclosure and teachings provided herein, a person of ordinary skill in the art will know and appreciate other ways and/or methods to implement the present invention using hardware and a combination of hardware and software.


In some embodiments, certain of the methods, models or functions described herein may be embodied in the form of a trained neural network, where the network is implemented by the execution of a set of computer-executable instructions or representation of a data structure. The instructions may be stored in (or on) a non-transitory computer-readable medium and executed by a programmed processor or processing element. The set of instructions may be conveyed to a user through a transfer of instructions or an application that executes a set of instructions (such as over a network, e.g., the Internet). The set of instructions or an application may be utilized by an end-user through access to a SaaS platform or a service provided through such a platform. A trained neural network, trained machine learning model, or other form of decision or classification process may be used to implement one or more of the methods, functions, processes, or operations described herein. Note that a neural network or deep learning model may be characterized in the form of a data structure in which are stored data representing a set of layers containing nodes, and connections between nodes in different layers are created (or formed) that operate on an input to provide a decision or value as an output.


In general terms, a neural network may be viewed as a system of interconnected artificial “neurons” that exchange messages between each other. The connections have numeric weights that are “tuned” during a training process, so that a properly trained network will respond correctly when presented with an image or pattern to recognize (for example). In this characterization, the network consists of multiple layers of feature-detecting “neurons”; each layer has neurons that respond to different combinations of inputs from the previous layers. Training of a network is performed using a “labelled” dataset of inputs in a wide assortment of representative input patterns that are associated with their intended output response. Training uses general-purpose methods to iteratively determine the weights for intermediate and final feature neurons. In terms of a computational model, each neuron calculates the dot product of inputs and weights, adds the bias, and applies a non-linear trigger or activation function (for example, using a sigmoid response function).


Machine learning (ML) is being used to enable the analysis of data and assist in making decisions in multiple industries. To benefit from using machine learning, a machine learning algorithm is applied to a set of training data and labels to generate a “model” which represents what the application of the algorithm has “learned” from the training data. Each element (or example, in the form of one or more parameters, variables, characteristics or “features”) of the set of training data is associated with a label or annotation that defines how the element should be classified by the trained model. A machine learning model is an algorithm that can predict outcomes based on data and training provided to it to make a decision (such as a classification) regarding a sample of input data. When trained, the model will operate on a new element of input data to generate the correct label or classification as an output.


The disclosure includes the following clauses and embodiments:


1. A method of training a generative adversarial network including a Generator and a Discriminator, comprising performing a training cycle by:

    • inputting a first tensor including a set of elements to the Generator, each tensor element being a value and the set of elements representing a distribution of values;
    • operating the Generator to output a tensor representing a generated distribution of values;
    • processing the output tensor to convert the generated distribution of values to the frequency domain;
    • determining one or more error terms from the converted generated distribution;
    • obtaining a second tensor including a set of elements, each tensor element being a value and the set of elements representing an actual distribution of values;
    • converting the actual distribution of values to the frequency domain;
    • determining one or more error terms from the converted actual distribution;
    • inputting the converted generated distribution and the converted actual distribution to the Discriminator;
    • operating the Discriminator to determine an adversarial loss term, the adversarial loss term including a Generator loss term and a Discriminator loss term; and
    • using the Generator loss term, the error term from the converted generated distribution and the error term from the converted actual distribution to modify the Generator and using the Discriminator loss term to modify the Discriminator.


2. The method of clause 1, wherein converting the generated distribution of values to the frequency domain and converting the actual distribution of values to the frequency domain further comprises using a Fourier Transform.


3. The method of clause 1, wherein determining one or more error terms from the converted generated distribution and determining one or more error terms from the converted actual distribution further comprise determining an Lp loss.


4. The method of clause 1, wherein the set of elements included in the first tensor is derived from a waveform or signal.


5. The method of clause 4, wherein the waveform corresponds to audio content.


6. The method of clause 1, wherein the set of elements included in the first tensor is derived from an image.


7. The method of clause 1, wherein the set of elements included in the second tensor is derived from a waveform or signal.


8. The method of clause 1, wherein the set of elements included in the second tensor is derived from an image.


9. The method of clause 1, wherein the Generator includes a set of layers, with each layer containing a plurality of nodes, with a plurality of connections between a set of nodes in a first layer and a set of nodes in a second layer adjacent to the first layer, and with a set of weights with each weight in the set of weights associated with one of the plurality of connections, and the method further comprises repeating the training cycle until weights between layers of the Generator and weights between layers of the Discriminator become stable.


10. The method of clause 1, further comprising using the Generator as an inference engine to generate realistic examples of a tensor input to the Generator.


11. A system for training a generative adversarial network including a Generator and a Discriminator by performing one or more training cycles, comprising:

    • one or more electronic processors operable to execute a set of computer-executable instructions; and
    • the set of computer-executable instructions stored in a non-transitory medium, wherein when executed, the instructions cause the one or more electronic processors to
      • input a first tensor including a set of elements to the Generator, each tensor element being a value and the set of elements representing a distribution of values;
      • operate the Generator to output a tensor representing a generated distribution of values;
      • process the output tensor to convert the generated distribution of values to the frequency domain;
      • determine one or more error terms from the converted generated distribution;
      • obtain a second tensor including a set of elements, each tensor element being a value and the set of elements representing an actual distribution of values;
      • convert the actual distribution of values to the frequency domain;
      • determine one or more error terms from the converted actual distribution;
      • input the converted generated distribution and the converted actual distribution to the Discriminator;
      • operate the Discriminator to determine an adversarial loss term, the adversarial loss term including a Generator loss term and a Discriminator loss term; and
      • use the Generator loss term, the error term from the converted generated distribution and the error term from the converted actual distribution to modify the Generator and use the Discriminator loss term to modify the Discriminator.


12. The system of clause 11, wherein converting the generated distribution of values to the frequency domain and converting the actual distribution of values to the frequency domain further comprises using a Fourier Transform.


13. The system of clause 11, wherein determining one or more error terms from the converted generated distribution and determining one or more error terms from the converted actual distribution further comprise determining an Lp loss.


14. The system of clause 11, wherein the Generator includes a set of layers, with each layer containing a plurality of nodes, with a plurality of connections between a set of nodes in a first layer and a set of nodes in a second layer adjacent to the first layer, and with a set of weights with each weight in the set of weights associated with one of the plurality of connection, and the instructions further cause the one or more electronic processors to repeat the training cycle until weights between layers of the Generator and weights between layers of the Discriminator become stable.


15. The system of clause 12 wherein the instructions further cause the one or more electronic processors to use the Generator as an inference engine to generate realistic examples of a tensor input to the Generator.


16. A non-transitory medium including a set of computer-executable instructions that when executed by one or more programmed electronic processors, cause the processors to train a generative adversarial network including a Generator and a Discriminator by performing one or more training cycles by:

    • inputting a first tensor including a set of elements to the Generator, each tensor element being a value and the set of elements representing a distribution of values;
    • operating the Generator to output a tensor representing a generated distribution of values;
    • processing the output tensor to convert the generated distribution of values to the frequency domain;
    • determining one or more error terms from the converted generated distribution;
    • obtaining a second tensor including a set of elements, each tensor element being a value and the set of elements representing an actual distribution of values;
    • converting the actual distribution of values to the frequency domain;
    • determining one or more error terms from the converted actual distribution;
    • inputting the converted generated distribution and the converted actual distribution to the Discriminator;
    • operating the Discriminator to determine an adversarial loss term, the adversarial loss term including a Generator loss term and a Discriminator loss term; and
    • using the Generator loss term, the error term from the converted generated distribution and the error term from the converted actual distribution to modify the Generator and using the Discriminator loss term to modify the Discriminator.


17. The non-transitory medium of clause 16, wherein converting the generated distribution of values to the frequency domain and converting the actual distribution of values to the frequency domain further comprises using a Fourier Transform.


18. The non-transitory medium of clause 16, wherein determining one or more error terms from the converted generated distribution and determining one or more error terms from the converted actual distribution further comprise determining an Lp loss.


19. The non-transitory medium of clause 16, wherein the Generator includes a set of layers, with each layer containing a plurality of nodes, with a plurality of connections between a set of nodes in a first layer and a set of nodes in a second layer adjacent to the first layer, and with a set of weights with each weight in the set of weights associated with one of the plurality of connection, and the instructions further cause the one or more electronic processors to repeat the training cycle until weights between layers of the Generator and weights between layers of the Discriminator become stable.


20. The non-transitory medium of clause 16, wherein the instructions further cause the one or more electronic processors to use the Generator as an inference engine to generate realistic examples of a tensor input to the Generator.


Any of the software components, processes or functions described in this application may be implemented as software code to be executed by a processor using any suitable computer language such as Python, Java, JavaScript, C, C++, or Perl using conventional or object-oriented techniques. The software code may be stored as a series of instructions, or commands in (or on) a non-transitory computer-readable medium, such as a random-access memory (RAM), a read only memory (ROM), a magnetic medium such as a hard-drive or a floppy disk, or an optical medium such as a CD-ROM. In this context, a non-transitory computer-readable medium is almost any medium suitable for the storage of data or an instruction set aside from a transitory waveform. Any such computer readable medium may reside on or within a single computational apparatus and may be present on or within different computational apparatuses within a system or network.


According to one example implementation, the term processing element or processor, as used herein, may be a central processing unit (CPU), or conceptualized as a CPU (such as a virtual machine). In this example implementation, the CPU or a device in which the CPU is incorporated may be coupled, connected, and/or in communication with one or more peripheral devices, such as display. In another example implementation, the processing element or processor may be incorporated into a mobile computing device, such as a smartphone or tablet computer.


The non-transitory computer-readable storage medium referred to herein may include a number of physical drive units, such as a redundant array of independent disks (RAID), a floppy disk drive, a flash memory, a USB flash drive, an external hard disk drive, thumb drive, pen drive, key drive, a High-Density Digital Versatile Disc (HD-DV D) optical disc drive, an internal hard disk drive, a Blu-Ray optical disc drive, or a Holographic Digital Data Storage (HDDS) optical disc drive, synchronous dynamic random access memory (SDRAM), or similar devices or other forms of memories based on similar technologies. Such computer-readable storage media allow the processing element or processor to access computer-executable process steps, application programs and the like, stored on removable and non-removable memory media, to off-load data from a device or to upload data to a device. As mentioned, with regards to the embodiments described herein, a non-transitory computer-readable medium may include almost any structure, technology or method apart from a transitory waveform or similar medium.


Certain implementations of the disclosed technology are described herein with reference to block diagrams of systems, and/or to flowcharts or flow diagrams of functions, operations, processes, or methods. It will be understood that one or more blocks of the block diagrams, or one or more stages or steps of the flowcharts or flow diagrams, and combinations of blocks in the block diagrams and stages or steps of the flowcharts or flow diagrams, respectively, may be implemented by computer-executable program instructions. Note that in some embodiments, one or more of the blocks, or stages or steps may not necessarily need to be performed in the order presented or may not necessarily need to be performed at all.


These computer-executable program instructions may be loaded onto a general-purpose computer, a special purpose computer, a processor, or other programmable data processing apparatus to produce a specific example of a machine, such that the instructions that are executed by the computer, processor, or other programmable data processing apparatus create means for implementing one or more of the functions, operations, processes, or methods described herein. These computer program instructions may also be stored in a computer-readable memory that may direct a computer or other programmable data processing apparatus to function in a specific manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means that implement one or more of the functions, operations, processes, or methods described herein.


While certain implementations of the disclosed technology have been described in connection with what is presently considered to be the most practical and various implementations, it is to be understood that the disclosed technology is not to be limited to the disclosed implementations. Instead, the disclosed implementations are intended to cover various modifications and equivalent arrangements included within the scope of the appended claims. Although specific terms are employed herein, they are used in a generic and descriptive sense only and not for purposes of limitation.


This written description uses examples to disclose certain implementations of the disclosed technology, and also to enable any person skilled in the art to practice certain implementations of the disclosed technology, including making and using any devices or systems and performing any incorporated methods. The patentable scope of certain implementations of the disclosed technology is defined in the claims, and may include other examples that occur to those skilled in the art. Such other examples are intended to be within the scope of the claims if they have structural and/or functional elements that do not differ from the literal language of the claims, or if they include structural and/or functional elements with insubstantial differences from the literal language of the claims.


All references, including publications, patent applications, and patents, cited herein are hereby incorporated by reference to the same extent as if each reference were individually and specifically indicated to be incorporated by reference and/or were set forth in its entirety herein.


The use of the terms “a” and “an” and “the” and similar referents in the specification and in the following claims are to be construed to cover both the singular and the plural, unless otherwise indicated herein or clearly contradicted by context. The terms “having,” “including,” “containing” and similar referents in the specification and in the following claims are to be construed as open-ended terms (e.g., meaning “including, but not limited to,”) unless otherwise noted. Recitation of ranges of values herein are merely indented to serve as a shorthand method of referring individually to each separate value inclusively falling within the range, unless otherwise indicated herein, and each separate value is incorporated into the specification as if it were individually recited herein. All methods described herein may be performed in any suitable order unless otherwise indicated herein or clearly contradicted by context. The use of any and all examples, or exemplary language (e.g., “such as”) provided herein, is intended merely to better illuminate embodiments of the invention and does not pose a limitation to the scope of the invention unless otherwise claimed. No language in the specification should be construed as indicating any non-claimed element as essential to each embodiment of the present invention.


As used herein (i.e., the claims, figures, and specification), the term “or” is used inclusively to refer items in the alternative and in combination.


Different arrangements of the components depicted in the drawings or described above, as well as components and steps not shown or described are possible. Similarly, some features and sub-combinations are useful and may be employed without reference to other features and sub-combinations. Embodiments of the invention have been described for illustrative and not restrictive purposes, and alternative embodiments will become apparent to readers of this patent. Accordingly, the present invention is not limited to the embodiments described above or depicted in the drawings, and various embodiments and modifications may be made without departing from the scope of the claims below.

Claims
  • 1. A method of training a generative adversarial network including a Generator and a Discriminator, comprising performing a training cycle by: inputting a first tensor including a set of elements to the Generator, each tensor element being a value and the set of elements representing a distribution of values;operating the Generator to output a tensor representing a generated distribution of values;processing the output tensor to convert the generated distribution of values to the frequency domain;determining one or more error terms from the converted generated distribution;obtaining a second tensor including a set of elements, each tensor element being a value and the set of elements representing an actual distribution of values;converting the actual distribution of values to the frequency domain;determining one or more error terms from the converted actual distribution;inputting the converted generated distribution and the converted actual distribution to the Discriminator;operating the Discriminator to determine an adversarial loss term, the adversarial loss term including a Generator loss term and a Discriminator loss term; andusing the Generator loss term, the error term from the converted generated distribution and the error term from the converted actual distribution to modify the Generator and using the Discriminator loss term to modify the Discriminator.
  • 2. The method of claim 1, wherein converting the generated distribution of values to the frequency domain and converting the actual distribution of values to the frequency domain further comprises using a Fourier Transform.
  • 3. The method of claim 1, wherein determining one or more error terms from the converted generated distribution and determining one or more error terms from the converted actual distribution further comprise determining an Lp loss.
  • 4. The method of claim 1, wherein the set of elements included in the first tensor is derived from a waveform or signal.
  • 5. The method of claim 4, wherein the waveform corresponds to audio content.
  • 6. The method of claim 1, wherein the set of elements included in the first tensor is derived from an image.
  • 7. The method of claim 1, wherein the set of elements included in the second tensor is derived from a waveform or signal.
  • 8. The method of claim 1, wherein the set of elements included in the second tensor is derived from an image.
  • 9. The method of claim 1, wherein the Generator includes a set of layers, with each layer containing a plurality of nodes, with a plurality of connections between a set of nodes in a first layer and a set of nodes in a second layer adjacent to the first layer, and with a set of weights with each weight in the set of weights associated with one of the plurality of connections, and the method further comprises repeating the training cycle until weights between layers of the Generator and weights between layers of the Discriminator become stable.
  • 10. The method of claim 1, further comprising using the Generator as an inference engine to generate realistic examples of a tensor input to the Generator.
  • 11. A system for training a generative adversarial network including a Generator and a Discriminator by performing one or more training cycles, comprising: one or more electronic processors operable to execute a set of computer-executable instructions; andthe set of computer-executable instructions stored in a non-transitory medium, wherein when executed, the instructions cause the one or more electronic processors to input a first tensor including a set of elements to the Generator, each tensor element being a value and the set of elements representing a distribution of values;operate the Generator to output a tensor representing a generated distribution of values;process the output tensor to convert the generated distribution of values to the frequency domain;determine one or more error terms from the converted generated distribution;obtain a second tensor including a set of elements, each tensor element being a value and the set of elements representing an actual distribution of values;convert the actual distribution of values to the frequency domain;determine one or more error terms from the converted actual distribution;input the converted generated distribution and the converted actual distribution to the Discriminator;operate the Discriminator to determine an adversarial loss term, the adversarial loss term including a Generator loss term and a Discriminator loss term; anduse the Generator loss term, the error term from the converted generated distribution and the error term from the converted actual distribution to modify the Generator and use the Discriminator loss term to modify the Discriminator.
  • 12. The system of claim 11, wherein converting the generated distribution of values to the frequency domain and converting the actual distribution of values to the frequency domain further comprises using a Fourier Transform.
  • 13. The system of claim 11, wherein determining one or more error terms from the converted generated distribution and determining one or more error terms from the converted actual distribution further comprise determining an Lp loss.
  • 14. The system of claim 11, wherein the Generator includes a set of layers, with each layer containing a plurality of nodes, with a plurality of connections between a set of nodes in a first layer and a set of nodes in a second layer adjacent to the first layer, and with a set of weights with each weight in the set of weights associated with one of the plurality of connection, and the instructions further cause the one or more electronic processors to repeat the training cycle until weights between layers of the Generator and weights between layers of the Discriminator become stable.
  • 15. The system of claim 12 wherein the instructions further cause the one or more electronic processors to use the Generator as an inference engine to generate realistic examples of a tensor input to the Generator.
  • 16. A non-transitory medium including a set of computer-executable instructions that when executed by one or more programmed electronic processors, cause the processors to train a generative adversarial network including a Generator and a Discriminator by performing one or more training cycles by: inputting a first tensor including a set of elements to the Generator, each tensor element being a value and the set of elements representing a distribution of values;operating the Generator to output a tensor representing a generated distribution of values;processing the output tensor to convert the generated distribution of values to the frequency domain;determining one or more error terms from the converted generated distribution;obtaining a second tensor including a set of elements, each tensor element being a value and the set of elements representing an actual distribution of values;converting the actual distribution of values to the frequency domain;determining one or more error terms from the converted actual distribution;inputting the converted generated distribution and the converted actual distribution to the Discriminator;operating the Discriminator to determine an adversarial loss term, the adversarial loss term including a Generator loss term and a Discriminator loss term; andusing the Generator loss term, the error term from the converted generated distribution and the error term from the converted actual distribution to modify the Generator and using the Discriminator loss term to modify the Discriminator.
  • 17. The non-transitory medium of claim 16, wherein converting the generated distribution of values to the frequency domain and converting the actual distribution of values to the frequency domain further comprises using a Fourier Transform.
  • 18. The non-transitory medium of claim 16, wherein determining one or more error terms from the converted generated distribution and determining one or more error terms from the converted actual distribution further comprise determining an Lp loss.
  • 19. The non-transitory medium of claim 16, wherein the Generator includes a set of layers, with each layer containing a plurality of nodes, with a plurality of connections between a set of nodes in a first layer and a set of nodes in a second layer adjacent to the first layer, and with a set of weights with each weight in the set of weights associated with one of the plurality of connection, and the instructions further cause the one or more electronic processors to repeat the training cycle until weights between layers of the Generator and weights between layers of the Discriminator become stable.
  • 20. The non-transitory medium of claim 16, wherein the instructions further cause the one or more electronic processors to use the Generator as an inference engine to generate realistic examples of a tensor input to the Generator.
CROSS REFERENCE TO RELATED APPLICATION

This application is a continuation-in-part of U.S. Non-Provisional application Ser. No. 18/747,197, entitled “System and Methods for Generating Realistic Waveforms,” filed Jun. 18, 2024, which is a continuation of U.S. Non-Provisional application Ser. No. 17/739,642, entitled “System and Methods for Generating Realistic Waveforms,” filed May 9, 2022, now issued U.S. Pat. No. 12,051,428. This application also claims the benefit of U.S. Provisional Application No. 63/186,634, entitled “System and Methods for Generating Realistic Waveforms,” filed May 10, 2021, the disclosure of which is incorporated, in its entirety (including the Appendix) by this reference.

Provisional Applications (1)
Number Date Country
63186634 May 2021 US
Continuations (1)
Number Date Country
Parent 17739642 May 2022 US
Child 18747197 US
Continuation in Parts (1)
Number Date Country
Parent 18747197 Jun 2024 US
Child 18967109 US