The present disclosure relates to the distillation of diffusion models. In disclosed embodiments, a deep equilibrium model may be utilized in a distilled architecture. Disclosed embodiments may enable distilling diffusion models directly from initial noise to a resulting image.
Diffusion models have shown exceptional performance across a broad spectrum of generative tasks, including high-quality image generation, audio, and video synthesis. Knowledge distillation refers to the process of transferring knowledge from a large model or set of models to a single smaller (and faster) model that can be practically deployed under real-world constraints. Knowledge distillation may be viewed as a form of model compression.
A key drawback of diffusion models is their slow generative process restricting the practical applicability of diffusion models in real-time or resource-constrained scenarios. Existing distillation methods for diffusion models aim to condense the multi-step sampling process into a more efficient few-step or single-step process. However, these methods suffer from the drawback of demanding multiple training passes to distill the lengthy sampling process and they require large memory and computing resources due to maintaining a dual copy of the model. Disclosed embodiments may distill a multi-step diffusion process into a single-step generative model, using solely noise/image pairs.
In some embodiments, disclosed methods comprise converting noise into a noise embedding vector; tokenizing the noise embedding vector via an injection transformer; inputting the tokenized noise into an equilibrium transformer; solving a fixed point via the equilibrium transformer; and decoding the fixed point to generate an image sample.
Embodiments of the present disclosure are described herein. It is to be understood, however, that the disclosed embodiments are merely examples and other embodiments can take various and alternative forms. The figures are not necessarily to scale; some features could be exaggerated or minimized to show details of particular components. Therefore, specific structural and functional details disclosed herein are not to be interpreted as limiting, but merely as a representative basis for teaching one skilled in the art to variously employ the present invention. As those of ordinary skill in the art will understand, various features illustrated and described with reference to any one of the figures can be combined with features illustrated in one or more other figures to produce embodiments that are not explicitly illustrated or described. The combinations of features illustrated provide representative embodiments for typical applications. Various combinations and modifications of the features consistent with the teachings of this disclosure, however, could be desired for particular applications or implementations.
Diffusion models have demonstrated remarkable performance on a wide range of generative tasks such as high-quality image generation and manipulation, audio synthesis, video, 3D shape, text, and molecule generation. These models are trained with a denoising objective derived from score matching, variational inference, or optimal transport, enabling them to generate clean data samples by progressively denoising the initial Gaussian noise during the inference process. Unlike adversarial training, the denoising objective leads to a more stable training procedure, which in turn allows diffusion models to scale up effectively. Despite the promising results, one major drawback of diffusion models is their slow generative process, which often necessitates hundreds to thousands of model evaluations. This computational complexity limits the applicability of diffusion models in real-time or resource-constrained scenarios.
In an effort to speed up the slow generative process of diffusion models, researchers have proposed distillation methods aimed at distilling the multi-step sampling process into a more efficient few-step or single-step process. However, these techniques often come with their own set of challenges. The distillation targets must be carefully designed to successfully transfer knowledge from the larger model to the smaller one. Further, distilling a long sampling process into a few-step process often calls for multiple training passes. Most of the prevalent techniques for online distillation require maintaining dual copies of the model, leading to increased memory and computing requirements. As a result, there is a clear need for more streamlined and efficient approaches that address the computational demands of distilling diffusion models without sacrificing the generative capabilities.
One proposed technique of knowledge distillation distills a multi-step Denoising Diffusion Implicit Model (“DDIM”) sampler into a single-step sampler by training the student model on one million synthetic image samples. Other techniques, such as progressive distillation, aim to distill a T-step teacher DDIM into a new T/2 step student DDIM, repeating this process until one-step models are achieved. Transitive closure time-distillation generalizes progressive distillation to distill N>2 steps together at once to reduce the overall number of training phases. Consistency models achieve online distillation in a single pass by taking advantage of a carefully designed teacher and distillation loss objective. Diffusion Fourier neural operator (“DFNO”) maps the initial Gaussian distribution to the solution trajectory of the reverse diffusion process by inserting the temporal Fourier integral operators in the pretrained U-Net backbone. Another approach proposes to distill classifier-free guided diffusion models into few-step generative models by first distilling a combined conditional and unconditional model, and then progressively distilling the resulting model for faster generation.
While distillation may be an effective approach to speed up sampling rate of existing diffusion models, there are alternate lines of work to reduce the length of sampling chains by considering alternate formulations of diffusion model, correcting bias and truncation errors in the denoising process, and through training-free fast samplers at inference. Several works like DDIM, Improved DDPM, FastDPM, SGM-CLD, and EDM modify or optimize the forward diffusion process so that the denoising process can be made more efficient. For example, DDIM is based on a non-Markovian definition of the diffusion processes that has the same training objective, but whose reverse process is much faster to sample from. DPM-Solver, and GENIE are higher-order ordinary differential equations (“ODE”) solvers that generate samples in few steps. There are also approaches that combine diffusion models with other families of generative models for faster sampling.
It is an objective of the present disclosure to streamline the distillation process of diffusion models while retaining perceptual quality of images generated by the original model. Disclosed embodiments may exhibit enhancements over previous approaches. Disclosed embodiments may comprise a simple and effective technique that may distill a multi-step diffusion process into a single-step generative model using solely noise/image pairs. Disclosed embodiments may include an architecture comprising a novel Deep Equilibrium (“DEQ”) model, which may be referred to herein as a Generative Equilibrium Transformer (“GET”). In some disclosed embodiments, GET may be interpreted as an infinite depth network using weight-tied transformer layers, which solve for a fixed point in the forward pass. Disclosed embodiments allow for adaptive application of these layers in the forward pass, striking a balance between inference speed and sample quality. Furthermore, disclosed embodiments incorporate an almost parameter-free class conditioning mechanism in the architecture, expanding its utility to class-conditional image generation.
Approaches for distillation disclosed herein, via noise/image pairs generated by a diffusion model, may be applied to both Vision Transformer (“ViT”) and GET-based architectures. However, in disclosed embodiments, the GET-based architecture achieves substantially improved results with smaller architectures. The GET-based architecture may produce perceptual image quality on par with or superior to other complex distillation techniques, such as progressive distillation, in context of both conditional and unconditional image generation. In some embodiments, GET exhibits significantly better parameter and data efficiency compared to architectures like ViT, as GET matches the Frechet inception distance (“FID”) scores of a 5× larger ViT, underscoring the transformative potential of GET in enhancing the efficiency of generative models.
Disclosed embodiments may be used to generate synthetic image datasets to train machine learning systems that can be used for any computer vision task, including: 1) computer-controlled machines, robots, vehicles, domestic appliances, power tools, manufacturing machines, personal assistants, and access control systems; and 2) systems for conveying information, such as surveillance systems or medical (imaging) systems. Disclosed embodiments may also be used to augment existing training datasets by generating additional images or by modifying the images in an existing dataset.
The slow generative process of diffusion models, requiring hundreds of model evaluations, limits their practicality in real-time or resource-constrained scenarios. With the goal to speed up the generative process, distillation methods have emerged that aim at distilling the multi-step sampling process into a more efficient few-step or single-step process. These methods are usually complex as they need multiple stages of training, the resulting distilled model may not retain the perceptual quality of the images generated by the original model, and in many established approaches there is a need to keep a dual copy of the model that in turn leads to memory increase.
Deep equilibrium (“DEQ”) models (e.g., as described in Deep Equilibrium Models, Bai et al.) compute internal representations by solving for a fixed point in their forward pass. Specifically, consider a deep feedforward model with L layers:
where x∈Rn
DEQs directly solve for this fixed point z* using black-box root finding algorithms like Broyden's method, or Anderson acceleration in the forward pass. As we cannot rely on explicit backpropagation through exact operations in the forward pass, DEQs utilize implicit differentiation to analytically differentiate through the fixed point. Let gθ(z*; x)=ƒθ(z*; x)−z*, then the Jacobian of z* with respect to the model weights θ is given by
Computing the inverse of the Jacobian can quickly become intractable while dealing with high-dimensional feature maps. The inverse-Jacobian term may be replaced with an identity matrix (i.e., a Jacobian-free or an approximate inverse-Jacobian) without affecting the final performance.
Diffusion models or score-based generative models progressively perturb images with an increasing amount of Gaussian noise and then reverse this process through sequential denoising to generate images. Specifically, consider a dataset of i.i.d. samples pdata, then the diffusion process {x(t)}t=0T for t∈[0, T] is given by an Itô stochastic differential equation (“SDE”):
where w is the standard Wiener process, ƒ(⋅, t): Rd→Rd is the drifting coefficient, g(⋅): R→R is the diffusion coefficient, and x(0)˜pdata and x(T)˜N(0, I). All diffusion processes have a corresponding deterministic process known as the probability flow ODE (“PF-ODE”) whose trajectories share the same marginal probability densities as the SDE. This ODE can be written as:
where σ(t) is the noise schedule of diffusion process, and ∇x log p(x, σ(t)) represents the score function. It has been shown that the optimal choice of σ(t) in Eq. (5) is σ(t)=t. Thus, the PF-ODE can be simplified to
where Dθ(⋅, t) is a denoiser function parametrized with a neural network that minimizes the expected L2 denoising error for samples drawn from pdata. Samples can be efficiently generated from this ODE through numerical methods like Euler's method, Runge-Kutta method, and Heun's second-order solver.
Disclosed embodiments comprise a one-step distilled diffusion model, referred herein as GET. GET may comprise a DEQ vision transformer that may distill diffusion models into generative models that are capable of rapidly sampling images using only a single model evaluation. The GET model may map a set of Gaussian noises e and optional (indicate by a dotted box in
Referring to
The noise embedding network 110 may convert input noise e∈RH×W×C into a sequence of two-dimensional (“2D”) patches p∈RN×(P
is the resulting number of patches. By adding standard sinusoidal position encoding, the noise embedding vector h is produced. Because the entire generative process is being directly distilled, time embedding t, that is common in standard diffusion models, is not needed.
The injection transformer 130, in accordance with equation (8), may transform tokenized noise (e.g., embedding vector h) to an intermediate representation n. In
The injection transformer 130 and the equilibrium transformer 140 may be composed of a sequence of transformer blocks (i.e., GET blocks) that have similar block designs, for example, as illustrated in
GET blocks may utilize a similar block design for the noise injection (i.e., InjectionT) and the equilibrium layer (i.e., EquilibriumT), that may differ only at the injection interface. Specifically, the GET block may be built upon a standard Pre-LN transformer block, as shown below:
Here, z∈RN×D represents the latent token, u∈RN×3D is the input injection, LN, FFN, and Attention respectively stand for Layer Normalization, a 2-layer Feed-Forward Network with a hidden dimension of size D×E, and an attention layer with an injection interface.
For GET blocks used in the injection transformer 130, u is equal to the class embedding token c∈R1×3D for conditional image generation. That is u=c for conditional models, and u=0 otherwise. For GET blocks in the equilibrium transformer, u is the broadcast sum of noise injection n∈RN×D and class embedding token c∈R1×3D. That is, u=n+c for conditional models and u=n otherwise.
A standard transformer attention layer may be modified to incorporate an additive injection interface before query q∈RN×D, key k∈RN×D, and value v∈RN×D,
where Wi∈RD×3D and W0∈RD×D. The injection interface enables interactions between the latent tokens and the input injection in the multi-head dot-product attention (“MHA”) operation,
where Wq, Wk∈RD×D are slices from Wi, and uq, Wk∈RD×D are slices from u. This scheme adds no more computational cost compared to the standard MHA operation, yet it achieves a similar effect as cross-attention and offers good stability during training.
The output from the equilibrium transformer 140, z*, may be decoded and rearranged to generate an image sample {tilde over (x)} 160 in accordance with equation (10). The decoder 150 may comprise a Layer Normalization component and a linear layer to generate patches
The injection transformer in this embodiment 200 may include one or more GET blocks 214. Each GET block 214 includes layer normalization components 206, 210. Layer normalization is a technique used in deep learning to normalize the activations (output) of a neural network layer. It independently normalizes each training example across its features, reducing dependency on batch size. The layer normalization component 206 receives as input embedding vector h (not shown) from the embedding network 202.
Each GET block 214 includes a multi-head attention (“MHA”) component 208. Multi-head attention is a component used in transformer-based models. Attention mechanisms allow models to focus on different parts of the input sequence when making predictions or generating output. Multi-head attention extends this idea by using multiple sets of attention weights, called “attention heads,” to capture different types of relationships and information in the input sequence. The MHA component 208 may receive, as an input, the output from the layer normalization component 206. The MHA component 208 may optionally receive, as an input, the embedding vector c.
Each GET block 214 includes a second layer normalization component 210. The layer normalization component 210 may receive, as an input, the output of the MHA component 208.
Each GET block 214 includes a multi-layer perceptron (“MLP”) component 212. A MLP may consist of linear layers combined with non-linear activation functions. The linear layers perform transformations that involve weights and biases, while the non-linear activation functions introduce non-linearities into the network. The MLP component 212 may receive, as an input, the output from the second layer normalization component 210.
In
The equilibrium transformer in
As illustrated in
In some embodiments, as depicted in
In some embodiments, dedicated hardware implementations, such as application specific integrated circuits, programmable logic arrays and other hardware devices, can be constructed to implement one or more of the methods or functionalities described herein. Applications that may include the apparatus and systems of various embodiments can broadly include a variety of electronic and computer systems. One or more embodiments described herein may implement functions using two or more specific interconnected hardware modules or devices with related control and data signals that can be communicated between and through the modules, or as portions of an application-specific integrated circuit. Accordingly, the present system encompasses software, firmware, and hardware implementations, or combinations thereof.
While the computer-readable medium is shown to be a single medium, the term “computer-readable medium” includes a single medium or multiple media, such as a centralized or distributed database, and/or associated caches and servers that store one or more sets of instructions. The term “computer-readable medium” shall also include any medium that is capable of storing or encoding a set of instructions for execution by a processor or that cause a computer system to perform any one or more of the methods or functionalities disclosed herein.
In some embodiments, some or all of the computer-readable media will be non-transitory media. In a particular non-limiting, exemplary embodiment, the computer-readable medium can include a solid-state memory such as a memory card or other package that houses one or more non-volatile read-only memories. Further, the computer-readable medium can be a random access memory or other volatile re-writable memory. Additionally, the computer-readable medium can include a magneto-optical or optical medium, such as a disk or tapes or other storage device to capture carrier wave signals such as a signal communicated over a transmission medium.
Disclosed herein is a Generative Equilibrium Transformer (“GET”), a deep equilibrium vision transformer that is well-suited for single-step generative models. GET's unique architecture allows striking a balance between inference speed and the quality of generated images. Also disclosed herein is a streamlined diffusion distillation process and a disclosure that training directly on noise/image pairs from diffusion models (with the GET architecture) is an effective strategy for distilling a multi-step sampling chain into a one-step generative model, for both class-conditional and unconditional cases. One or more disclosed embodiments show that implicit models for generative tasks can strictly outperform classic networks in terms of performance, model size, model compute, training memory, and speed.
While exemplary embodiments are described above, it is not intended that these embodiments describe all possible forms encompassed by the claims. The words used in the specification are words of description rather than limitation, and it is understood that various changes can be made without departing from the spirit and scope of the disclosure. As previously described, the features of various embodiments can be combined to form further embodiments of the invention that may not be explicitly described or illustrated. While various embodiments could have been described as providing advantages or being preferred over other embodiments or prior art implementations with respect to one or more desired characteristics, those of ordinary skill in the art recognize that one or more features or characteristics can be compromised to achieve desired overall system attributes, which depend on the specific application and implementation. These attributes can include, but are not limited to strength, durability, marketability, appearance, packaging, size, serviceability, weight, manufacturability, ease of assembly, etc. As such, embodiments described as less desirable than other embodiments or prior art implementations with respect to one or more characteristics are not outside the scope of the disclosure and can be desirable for particular applications.