In at least one aspect, the present invention relates to autoencoding for compressing and denoising images and text and other forms of data. In another aspect, a method related to autoencoding for generating images, text, and other forms of data is provided.
Ordinary unidirectional AEs use separate networks for encoding and decoding. AE networks themselves learn or approximate identity mappings from unlabeled data or patterns. AEs can compress or summarize patterns or text. This lets them generate new patterns from old patterns. They can combine with large language models (LLMs) such as chat-AI GPTs to improve the performance of LLMs [1]-[3]. AEs also apply to a wide range of problems in data compression or dimension reduction [4]-[6], image denoising [7]-[11], feature extraction [12], anomaly detection [13]-[16], collaborative filtering [17], [18], and sentiment analysis [19]-[21].
Variational autoencoders (VAEs) build on traditional AEs by introducing probabilistic representations to model the data distribution more effectively. VAEs are widely used for tasks such as generative modeling, dimensionality reduction, and anomaly detection. They employ separate networks for encoding and decoding and utilize a probabilistic latent space to enable data reconstruction and generation. This probabilistic framework allows VAEs to balance reconstruction accuracy and the regularization of latent space, making them a versatile tool for various applications in machine learning [37]-[41].
Although current autoencoder technology works well, traditional autoencoders, including unidirectional VAEs, rely on separate networks for encoding and decoding, doubling the parameter count and computational burden.
Accordingly, there is a need for improving efficiency, scalability, and performance in data modeling tasks performed by autoencoders.
In a variation, the bidirectional autoencoder is a bidirectional variational autoencoder. Characteristically, the bidirectional variational autoencoder uses a single parametrized network for encoding and decoding.
A method of training a bidirectional autoencoder is provided. The bidirectional autoencoder including a single bidirectional network for encoding and decoding wherein the encoding and decoding use the same synaptic weights. The single bidirectional network is configured to run as an encoder in a forward direction and as a decoder in a backward direction. The method includes steps of performing a forward pass through a bidirectional network to encode input data into a latent representation using a first set of synaptic weights and performing a backward pass through the bidirectional network to decode the latent representation into reconstructed data using a transpose of the first set of synaptic weights. The method also includes a step of optimizing a training error function with an optimization process that incorporates forward likelihood and backward likelihood to enhance data reconstruction accuracy.
In another aspect, a method for training a bidirectional variational autoencoder is provided. The bidirectional variational autoencoder includes a single bidirectional network for encoding and decoding wherein the encoding and decoding use the same synaptic weights. The single bidirectional network is configured to run as an encoder in a forward direction and as a decoder in a backward direction. The method includes steps of receiving a dataset of input samples and defining a latent space dimension and initializing synaptic weights for encoding and decoding, a learning rate, and other hyperparameters. The method further includes a step of iteratively performing steps 1)-4) comprising:
In another aspect, the training loss for the bidirectional variational autoencoder is minimized using a gradient-based optimization algorithm.
For a further understanding of the nature, objects, and advantages of the present disclosure, reference should be made to the following detailed description, read in conjunction with the following drawings, wherein like reference numerals denote like elements and wherein:
Reference will now be made in detail to presently preferred embodiments and methods of the present invention, which constitute the best modes of practicing the invention presently known to the inventors. The Figures are not necessarily to scale. However, it is to be understood that the disclosed embodiments are merely exemplary of the invention that may be embodied in various and alternative forms. Therefore, specific details disclosed herein are not to be interpreted as limiting, but merely as a representative basis for any aspect of the invention and/or as a representative basis for teaching one skilled in the art to variously employ the present invention.
It is also to be understood that this invention is not limited to the specific embodiments and methods described below, as specific components and/or conditions may, of course, vary. Furthermore, the terminology used herein is used only for the purpose of describing particular embodiments of the present invention and is not intended to be limiting in any way.
It must also be noted that, as used in the specification and the appended claims, the singular form “a,” “an,” and “the” comprise plural referents unless the context clearly indicates otherwise. For example, reference to a component in the singular is intended to comprise a plurality of components.
The term “comprising” is synonymous with “including,” “having,” “containing,” or “characterized by.” These terms are inclusive and open-ended and do not exclude additional, unrecited elements or method steps.
The phrase “consisting of” excludes any element, step, or ingredient not specified in the claim. When this phrase appears in a clause of the body of a claim, rather than immediately following the preamble, it limits only the element set forth in that clause; other elements are not excluded from the claim as a whole.
The phrase “consisting essentially of” limits the scope of a claim to the specified materials or steps, plus those that do not materially affect the basic and novel characteristic(s) of the claimed subject matter.
With respect to the terms “comprising,” “consisting of,” and “consisting essentially of,” where one of these three terms is used herein, the presently disclosed and claimed subject matter can include the use of either of the other two terms.
It should also be appreciated that integer ranges explicitly include all intervening integers. For example, the integer range 1-10 explicitly includes 1, 2, 3, 4, 5, θ, 7, 8, 9, and 10. Similarly, the range 1 to 100 includes 1, 2, 3, 4 . . . 97, 98, 99, 100. Similarly, when any range is called for, intervening numbers that are increments of the difference between the upper limit and the lower limit divided by 10 can be taken as alternative upper or lower limits. For example, if the range is 1.1. to 2.1 the following numbers 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, and 2.0 can be selected as lower or upper limits.
When referring to a numerical quantity, in a refinement, the term “less than” includes a lower non-included limit that is 5 percent of the number indicated after “less than.” A lower non-includes limit means that the numerical quantity being described is greater than the value indicated as a lower non-included limited. For example, “less than 20” includes a lower non-included limit of 1 in a refinement. Therefore, this refinement of “less than 20” includes a range between 1 and 20. In another refinement, the term “less than” includes a lower non-included limit that is, in increasing order of preference, 20 percent, 10 percent, 5 percent, 1 percent, or 0 percent of the number indicated after “less than.”
The term “one or more” means “at least one” and the term “at least one” means “one or more.” The terms “one or more” and “at least one” include “plurality” as a subset.
The term “substantially,” “generally,” or “about” may be used herein to describe disclosed or claimed embodiments. The term “substantially” may modify a value or relative characteristic disclosed or claimed in the present disclosure. In such instances, “substantially” may signify that the value or relative characteristic it modifies is within ±0%, 0.1%, 0.5%, 1%, 2%, 3%, 4%, 5% or 10% of the value or relative characteristic.
The processes, methods, or algorithms disclosed herein can be deliverable to/implemented by a processing device, controller, or computer, which can include any existing programmable electronic control unit or dedicated electronic control unit. Similarly, the processes, methods, or algorithms can be stored as data and instructions executable by a controller or computer in many forms including, but not limited to, information permanently stored on non-writable storage media such as ROM devices and information alterably stored on writeable storage media such as floppy disks, magnetic tapes, CDs, RAM devices, and other magnetic and optical media. The processes, methods, or algorithms can also be implemented in a software executable object. Alternatively, the processes, methods, or algorithms can be embodied in whole or in part using suitable hardware components, such as Application Specific Integrated Circuits (ASICs), Field-Programmable Gate Arrays (FPGAs), state machines, controllers or other hardware components or devices, or a combination of hardware, software, and firmware components.
Throughout this application, where publications are referenced, the disclosures of these publications in their entireties are hereby incorporated by reference into this application to more fully describe the state of the art to which this invention pertains.
The term “likelihood” refers to a function that measures how plausible a set of parameters θ is for a given statistical model, based on observed data x. Formally, if p(x|θ) represents the probability density or mass function of the observed data conditioned on the parameters, then the likelihood L(θ|x) is defined as L(θ|x)=p(x|θ). Unlike a probability distribution, the likelihood is not normalized with respect to θ and does not sum (or integrate) to 1. The likelihood function is central to many statistical and machine learning methods, such as Maximum Likelihood Estimation (MLE), where the goal is to find the parameter values θ that maximize L(θ|x), making the observed data most plausible under the model.
The term “joint likelihood” extends the concept of likelihood to account for multiple observations or events, providing a measure of how plausible the entire dataset is, given a statistical model. For a dataset consisting of n independent and identically distributed (i.i.d.) observations X={x1, x2, . . . , xn}, the joint likelihood is the product of the likelihoods of each observation: L(θ|X)=πi=12p(xi|θ). If the observations are not independent, the joint likelihood incorporates their dependencies and is expressed as L(θ|X)=p(x1, x2, . . . , xn|θ). The joint likelihood plays a critical role in Bayesian inference and machine learning, as it combines information from all data points to quantify how well the model parameters θ explain the observed dataset.
The term “latent representation” refers to a compact, often lower-dimensional, encoding of input data that captures its most relevant features or characteristics. In the context of machine learning and neural networks, latent representations are typically the outputs of an intermediate layer in the network, such as the bottleneck layer of an autoencoder. In the context of a bidirectional autoencoder, a latent representation is the intermediate output generated during the forward pass, where the input data is transformed into a condensed form. This representation is stored in the latent space and serves as a bridge between the encoding and decoding processes.
In an embodiment, a bidirectional autoencoder is provided. Referring to
In the bidirectional autoencoder, convolutional neural network 16, fully connected neural network 18, and output layer 20 work together within a single bidirectional network that performs both encoding and decoding using the same synaptic weights. Convolutional neural network 16 plays a critical role in feature extraction, particularly for high-dimensional and structured input data such as images. Through convolutional operations, convolutional neural network 16 detects localized spatial features like edges, textures, and patterns, progressively abstracting them across multiple layers. This process not only captures meaningful details but also reduces the input's spatial dimensions, retaining essential information while improving computational efficiency. Following convolutional neural network 16, fully connected neural network 18 integrates and transforms the extracted feature maps into a holistic representation. Each neuron in fully connected neural network 18 connects to all outputs of the previous layer, allowing for global interactions among the features. Fully connected neural network 18 maps the high-dimensional output of convolutional neural network 16 into a compact latent representation (z) in the latent space. This latent representation captures the most relevant characteristics of the input data, serving as the interface between the encoding and decoding phases. The output layer 20 of fully connected neural network 18 finalizes this process during the forward pass, outputting the latent representation (z), which is used for reconstruction or downstream tasks such as generation and classification. Together, convolutional neural network 16, fully connected neural network 18, and output layer 20 form an efficient and streamlined processing pipeline for the bidirectional autoencoder. During the forward pass (encoding), the input is processed through convolutional neural network 16, fully connected neural network 18, and output layer 20 to produce the latent representation (z). In the backward pass (decoding), the process is reversed, with the latent representation passed back through the output layer 20, fully connected neural network 18, and convolutional neural network 16 to reconstruct the input data. Convolutional neural network 16 specializes in capturing localized patterns, while the fully connected neural network 18 and output layer 20 work together to integrate these features into a cohesive and compact representation. This complementary relationship ensures that the bidirectional autoencoder efficiently encodes and decodes high-dimensional data while maintaining performance and reducing computational overhead.
In another aspect, the bidirectional autoencoder operates such that during the forward pass (or forward inference), the input data is processed through a rectangular weight matrix W, transforming the input into a latent representation. During the backward pass (or backward inference), the process is reversed to reconstruct the original input from the latent representation. This reconstruction is achieved by passing the latent representation through the transpose of the same weight matrix, WT. By using the same weight matrix for both encoding (forward pass) and decoding (backward pass), the model simplifies its structure, reduces the number of trainable parameters, and ensures consistent transformations between the input and latent spaces.
In another aspect, the bidirectional autoencoder learns or approximates an identity mapping from an input pattern space to the same or similar output pattern space. In a refinement, the bidirectional autoencoder of is trained with a bidirectional backpropagation algorithm. In a further refinement, training maximizes a backward likelihood p(x|y, θ) of the bidirectional networks. In still a further refinement, an error function εM(x, θ) is minimized during training. Characteristically, the training error function εM (x, θ) equals the negative log-likelihood of M training samples with the assumption of independent and identical distribution.
In another aspect, a method of training a bidirectional autoencoder (BAE) is provided. The method includes performing a forward pass on input data to encode it into a latent representation using a first set of synaptic weight and performing a backward pass through the bidirectional network to decode the latent representation into reconstructed data using a transpose of the first set of synaptic weights. The method further includes a step of optimizing a training error function with an optimization process that incorporates forward likelihood and backward likelihood to enhance (e.g., increase) data reconstruction accuracy.
In a refinement, the optimization includes calculating a loss function backward error based on the difference between the reconstructed data and the input data and updating the synaptic weights using a gradient descent algorithm to minimize the loss function. In a refinement, the method includes a loss function that is a negative log-likelihood function of the reconstructed data which can be expressed as a cross-entropy between the input data and the reconstructed data. Advantageously, the forward and backward passes in the method are performed using shared synaptic weights within a single network, eliminating the need for separate encoder and decoder networks. In a further refinement, the method further comprises dynamically adjusting training parameters during training, including at least one of batch size, learning rate, number of epochs, or number of iterations per epoch. The forward pass in the method can transform the input data into a latent representation through a neural network, and the backward pass reconstructs the input data using the transposed synaptic weights. In still a further refinement, the method further comprises applying the trained bidirectional autoencoder to tasks including image compression, image denoising, or feature extraction. The training process in the method can be validated using performance metrics, including Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM).
In another aspect, an algorithm trains the bidirectional autoencoder to minimize the backward error Eb(θ), ensuring the reconstructed outputs (ax) match the target outputs (y). The forward pass compresses the input into a latent representation, while the backward pass reconstructs the input. The gradient descent step iteratively adjusts the network parameters θ, enabling the autoencoder to learn an efficient representation of the data and perform tasks like image denoising. The input requirements include a dataset (D) of training samples where D={(x(i), y(i)}i=1N
a
z(l)
=N
θ(x(l))
where Nθ is the encoding function, parameterized by θ, which maps input x(l) to a lower-dimensional latent space, and az(l) is the latent (encoded) representation of input x(l). In step 4 (Backward Pass (Decoding)), the latent variable az(l) is decoded to reconstruct the input signal using the decoder NθT (the transpose of the encoder):
a
x(l)
=N
θ
T(az(l))
Ne is the decoding function, parameterized by θ, reconstructs the input signal from the latent representation and ax(l) is the reconstructed version of the input signal. In step 5 (Compute Backward Error (Eb(θ))), the backward error is computed using the binary cross-entropy loss function:
where yk(l) is the ground truth (target) value for the k-th output of the l-th sample. yk(l) is a binary value where yk(l)=1 if the k-th class is the correct target and otherwise,
is the predicted activation (output probability) for the k-th output of the l-th sample, computed during the backward pass. L is the number of samples in the current batch and K is the number of output nodes (or classes). The first term
measures the loss for correctly predicting the target class. The second term,
penalizes the model for assigning probabilities to incorrect classes. In step 6 (Update the Weights), the model parameters θ are updated using gradient descent: θ(m+1)=θ(m)−η1∇θEb(θ)|θ=θ(m). where ηθ is the learning rate, which controls the magnitude of the parameter update; and ∇θEb(θ) is the gradient of the backward error Eb(θ) with respect to the parameters θ. In step 7, continue the loop over batches and epochs until the total number of epochs M is completed.
In another aspect, the bidirectional autoencoder 10 described herein has a range of applications across various domains due to its ability to encode input data into compact latent representations and reconstruct data with high fidelity. One key application is image compression, where the BAE reduces the dimensionality of high-resolution images by encoding them into a latent space, significantly lowering storage and transmission requirements. The reconstructed images from the compressed features retain essential details, making this approach highly effective for efficient storage in resource-constrained systems and reducing bandwidth usage in image transmission. In a refinement, a method for compressing images using bidirectional autoencoder 10 involves processing the input data through a series of coordinated steps. First, an input image is received through an input layer of the bidirectional network. During the forward pass, spatial features are extracted from the input image using a convolutional neural network 16. These features are then transformed into a latent representation by a fully connected neural network 18, which serves as a compact and efficient encoding of the input image. During the backward pass, the input image is reconstructed from the latent representation using transposed operations of the convolutional neural network 16 and fully connected neural network 18. The latent representation, which encapsulates the essential information of the input image in a reduced-dimensional form, is then stored or transmitted as the compressed representation of the image. This method ensures efficient compression while maintaining the ability to reconstruct the original image with high fidelity.
In another aspect, another important application is image denoising, where bidirectional autoencoder 1θ is trained to reconstruct clean images from noisy inputs. By learning to separate noise from underlying features in the data, bidirectional autoencoder 10 enhances image quality, making it valuable for applications such as medical imaging (e.g., improving the clarity of MRI or CT scans), surveillance (e.g., refining low-light or grainy video feeds), and professional photography (e.g., removing unwanted visual artifacts). In a refinement, a method for denoising images using a bidirectional autoencoder 10 includes the following steps. First, a noisy input image is received through an input layer of the bidirectional network. During the forward pass, spatial features are extracted from the noisy image using a convolutional neural network 16, which captures essential patterns while reducing noise. These extracted features are then transformed into a latent representation by a fully connected neural network 18, providing a compact encoding of the input data. In the backward pass, a denoised image is reconstructed from the latent representation using transposed operations of the convolutional neural network 16 and fully connected neural network 18. To enhance the network's performance, the parameters are optimized by minimizing a reconstruction loss, which quantifies the difference between the reconstructed denoised image and a reference image, ensuring high-quality denoising results.
In another aspect, bidirectional autoencoder 10 also excels in feature extraction, where it learns a compact latent representation of the input data that preserves the most relevant features. These latent features can be used in downstream tasks such as clustering, classification, and anomaly detection. For example, in image recognition tasks, the latent features extracted by the BAE can serve as inputs to classifiers, reducing computational complexity and improving model performance.
In addition, bidirectional autoencoder 1θ is a powerful tool for data visualization. By combining the encoded features with dimensionality reduction techniques like t-SNE, it enables the visualization of high-dimensional datasets in a lower-dimensional space, facilitating pattern recognition and exploratory data analysis. For instance, in research and diagnostics, t-SNE visualizations of the BAE's encoded features can reveal relationships or clusters that are otherwise hidden in raw high-dimensional data.
The BAE is also highly effective for representation learning, where the encoded latent space provides a meaningful representation of the input data. These learned representations can be reused in transfer learning scenarios or as pre-trained features for other machine learning models, improving their performance, and reducing training times. For example, in complex tasks like object detection, the BAE's learned features can serve as a robust foundation for downstream models.
Another significant application is anomaly detection, where bidirectional autoencoder 10 identifies discrepancies between reconstructed and input data. This capability is especially valuable in fields such as fraud detection, where unusual patterns in financial transactions can be flagged, or in industrial settings, where equipment malfunctions can be detected early by identifying deviations from normal operation patterns.
Finally, bidirectional autoencoder 10 can be applied to domain-specific tasks such as handwriting recognition using datasets like MNIST and object recognition using datasets like CIFAR-10. In these scenarios, bidirectional autoencoder 10 leverages its ability to encode and reconstruct data to process and analyze structured datasets, improving classification accuracy and enabling enhanced image analysis. These applications demonstrate bidirectional autoencoder 10's versatility in solving real-world problems that require data reconstruction, noise reduction, and dimensionality reduction.
In another aspect, an application-specific integrated circuit (ASIC) for bidirectional autoencoding is provided. The ASIC includes an input interface configured to receive input data (e.g., noisy image data). The ASIC further includes a bidirectional autoencoding processing unit, a memory module, an optional control unit, and an output interface. The bidirectional autoencoding processing unit includes a forward processing module that encodes the input image data into a latent representation using a single set of synaptic weight matrices and a backward processing module that decodes the latent representation into a reconstructed image by applying the transpose of the synaptic weight matrices. The memory module stores the synaptic weight matrices shared between the forward and backward processing modules and the latent representation generated during encoding. In a refinement, the control unit is configured to execute a bidirectional backpropagation training algorithm that optimizes the synaptic weight matrices by maximizing the backward likelihood of the reconstructed image with respect to the input data, using cross-entropy as the loss function and operating in an inference mode to perform tasks (e.g., real-time image compression and denoising) by processing data through the bidirectional autoencoding processing unit. The ASIC further includes an output interface configured to output the reconstructed data. In a refinement, a performance monitoring module computes peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM) metrics to evaluate the quality of the reconstructed image.
In a variation, a system for encoding and decoding data using a bidirectional variational autoencoder is provided. Referring to
where Wθθ is the weight matrix used for predicting the mean, h is the feature vector, and bμ is the bias vector for branch 42. The other branch 44 predicts the log-variance (In σ2).
where Vθθ is the weight matrix used for predicting the log-variance, h is the feature vector, and blnσ2 is the bias vector for branch 44. The output of these layers is used in the reparameterization step to sample from the latent distribution: z=μ+ϵ·σ where σ=√{square root over (exp(ln σ2))}.
In the variational autoencoder system 30, the convolutional neural network 36 and fully connected neural network 38 work in tandem to encode input data into a latent representation and facilitate its reconstruction. Convolutional neural network 36 is primarily tasked with extracting meaningful features from the input data, such as images, vectors, or other structured forms. By processing the input through multiple convolutional layers, convolutional neural network 36 detects spatial features like edges, patterns, and textures, which are essential for understanding the input's underlying structure. Additionally, convolutional neural network 36 reduces the spatial dimensions of the data through operations such as pooling, retaining critical information while improving computational efficiency. The extracted feature maps generated by convolutional neural network 36 are then passed to fully connected neural network 38 for further processing. Fully connected neural network 38 plays a complementary role by integrating the features from convolutional neural network 36 and mapping them into a more compact, structured latent space representation. Each neuron in fully connected neural network 38 is fully connected to the outputs of the preceding layer, enabling it to combine the extracted features into a cohesive representation. Fully connected neural network 38 generates two outputs: one branch predicts the mean (μ) and the other predicts the log-variance (lnσ2) of the latent distribution. These parameters are essential for the reparameterization step, where the latent variable (z) is sampled using the formula z=μ+ϵ·σ, with α=√{square root over (exp(lnσ2))}. This probabilistic encoding bridges the encoding and decoding phases of the variational autoencoder, allowing the model to sample from the latent distribution. Together, convolutional neural network 36 and fully connected neural network 38 form a cohesive pipeline where the CNN specializes in extracting localized features and reducing dimensionality, while the fully connected neural network 38 integrates these features into the latent representation and prepares them for the probabilistic framework of the VAE. This collaboration ensures efficient encoding and reconstruction of high-dimensional data, leveraging the strengths of both local feature extraction and global feature integration.
In another aspect, the bidirectional variational autoencoder system 30 includes a bidirectional backpropagation algorithm configured to optimize directional likelihoods for the forward and backward passes simultaneously and a memory module for storing the synaptic weights. The bidirectional structure reduces parameter count compared to traditional encoder-decoder architectures.
In another aspect, during training, the bidirectional variational autoencoder system 30 learns to approximate the posterior q(zx, θ) by predicting its parameters—the mean (μ) and log-variance lnσ2. These parameters are updated iteratively as the network learns from the data, meaning the posterior distribution evolves to better represent the latent structure of the input data. The posterior is refined to encode data-specific features while regularizing against the prior p(z).
In another aspect, the bidirectional variational autoencoder is trained with a bidirectional backpropagation algorithm that jointly optimizes the single bidirectional network's joint bidirectional likelihood. In a refinement, the bidirectional variational autoencoder allows direct computation of ln p(x|θ) wherein p(x|θ) is the data likelihood. In a further refinement, the bidirectional variational autoencoder the same bidirectional associative network to model the encoding and decoding phases. As set forth below in more detail, a forward-pass likelihood p(x|z, θ) models the encoding and a backward-pass likelihood q(x|z, θ) models the decoding.
In another aspect, the invention provides a system wherein the bidirectional backpropagation algorithm uses an adaptive learning rate optimizer.
In another aspect, the invention provides a method for encoding and decoding data using a bidirectional variational autoencoder. The method involves receiving input data, encoding the input data into a latent representation using a forward pass through a neural network, and decoding the latent representation back into reconstructed data using a backward pass through the same neural network. The method further includes optimizing the encoding and decoding processes using a bidirectional backpropagation algorithm and regularizing the latent space using a predefined prior distribution over the latent variables. In a refinement, the prior distribution is a Gaussian distribution with mean zero and unit variance.
In another aspect, the method further includes visualizing the latent space representations using dimensionality reduction techniques, such as t-Distributed Stochastic Neighbor Embedding (t-SNE). This technique transforms high-dimensional latent representations into a two- or three-dimensional space, making it possible to visually assess the structure and separability of the latent features.
In another aspect, the invention provides a bidirectional variational autoencoder produced by the following method. The method includes configuring a neural network with shared weights for bidirectional operations, training the neural network with a bidirectional backpropagation algorithm to optimize forward and backward directional likelihoods, using a Gaussian prior distribution to regularize the latent space, and validating the autoencoder's performance on image reconstruction, classification, interpolation, and generation tasks.
In another aspect, a system for training a bidirectional variational autoencoder is provided. The system includes a training dataset comprising input samples, a neural network configured for bidirectional operations, and a bidirectional backpropagation algorithm to jointly optimize forward and backward directional likelihoods. The system further includes a module for calculating the evidence lower bound (ELBO) as a sum of forward and backward likelihoods and a processor configured to adjust synaptic weights of the neural network based on ELBO optimization.
In another aspect, the invention provides a method for training a bidirectional variational autoencoder. The method involves initializing synaptic weights of a neural network, using forward and backward passes to compute respective directional likelihoods for encoding and decoding operations, and calculating the evidence lower bound (ELBO) as a combination of forward and backward log-likelihoods. The method further includes updating the synaptic weights using gradient-based optimization techniques and iteratively refining the neural network to minimize reconstruction error and ensure adherence to a predefined prior distribution.
In another aspect, the invention provides a bidirectional backpropagation algorithm for training a neural network. The algorithm involves receiving a mini-batch of input data, performing a forward pass to encode input data into a latent representation, and performing a backward pass to decode the latent representation into reconstructed data. The algorithm further includes computing a loss function comprising a reconstruction loss and a KL-divergence term between the variational posterior and the prior distribution and updating synaptic weights of the neural network based on gradients of the loss function. An example of a useful algorithm for training is as follows. The input requirements include a dataset of data {xn})n=1N and latent space dimension J·{xn}n=1N is the dataset of N input samples, where each individual sample is xn. J is the dimension of the latent space, representing the size of the latent variable. Synaptic weights (e.g., θ∈{Nθ, Vθ, Wθ}), learning rate α, and other hyperparameters are initialized. θ is the set of synaptic weights, including Nθ, Vθ, and Wθ where Nθ is the weight matrix used for the first neural network layer, Vθ is the weight matrix used for predicting the log-variance, and Wθ is the weight matrix used for predicting the mean. After the parameters are initialized, the algorithm executes a loop that includes the following steps. Step 1 includes selecting a mini-batch {xm}m=1B of B samples for each iteration t where B is the size of the mini-batch selected during each training iteration. Step 2 includes executing a subloop for m iterations. Subloop step s1) (Forward Pass (Encoding)) includes predicting the variational mean and log-covariance:
μm=Wθ(Nθ(xm))
and
ln{circumflex over (σ)}m2=Vθ(Nθ(xm))
Subloop step s2 includes sampling the latent features zm from the variational Gaussian distribution conditioned on xm:
where ϵm is a random noise vector sampled from a standard normal distribution; zm is the latent variable for the m-th sample, computed using the predicted mean, variance, and noise; (0, 1) is the standard normal distribution with a mean of 0 and an identity covariance matrix I.
Subloop step s3 (Backward Pass (Decoding)) includes mapping the latent variable back to the input space:
â
m
(x)
=N
θ
T(WθT(zm))
âm(x) refers to the decoded output during the backward pass of the BVAE. The subloop terminated after B iterations.
Step 3 includes estimating the negative log-likelihood NLL(x, θ):
where {circumflex over (L)}ELBO(x, θ)=−ELBO(x, θ, ϕ); and (μm, Diag({circumflex over (σ)}m2)) is the Gaussian distribution defined by the predicted mean μm and diagonal covariance matrix derived from the variance &. During training, {circumflex over (L)}ELBO(x, θ) (i.e., the negative of the ELBO) is minimized. Step 3 includes updating θ by backpropagating {tilde over (L)}ELBO(x, θ) through the weights. The loop continues until a predetermined {circumflex over (L)}ELBO(x, θ) is achieved or a predetermined number of iterations completed. The θ are then returned.
In another aspect, the invention provides an algorithm wherein the reconstruction loss is calculated using a combination of Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM).
In another aspect, the methods for training the BVAE includes a step of monitoring performance metrics, including reconstruction accuracy and divergence minimization, to evaluate training effectiveness.
In another aspect, an ASIC for implementing a bidirectional variational autoencoder (BVAE) is provided. The ASCIC includes an input interface for receiving input data, a bidirectional variational processing unit, a memory module, a control unit, and an output interface. The bidirectional variational processing unit includes a forward processing module that encodes input data into a latent representation using a probabilistic forward likelihood q(z|x, θ) and a backward processing module that decodes the latent representation into reconstructed data using a probabilistic backward likelihood pb(x|z, θ), with both processes sharing a single set of synaptic weights. The memory module stores the shared synaptic weights, latent representations, and network parameters. In a refinement, the control unit is configured to execute a bidirectional backpropagation algorithm during training, which jointly optimizes the forward and backward likelihoods to maximize the evidence lower bound (ELBO) on the data log-likelihood, incorporating steps such as computing the Kullback-Leibler (KL) divergence for regularization and minimizing reconstruction loss using cross-entropy. In a further refinement, the training process involves iterative updates of network weights based on gradient descent using a stochastic mini-batch approach. The control unit further enables inference tasks such as image reconstruction, classification, interpolation, and generation after training is complete. The ASIC also includes an output interface for providing reconstructed or generated data. In a refinement, a performance monitoring module evaluates the quality of the output data using metrics such as negative log-likelihood (NLL), peak signal-to-noise ratio (PSNR), structural similarity index measure (SSIM), and Frechet Inception Distance (FID).
In another aspect, the Bidirectional Variational Autoencoder (BVAE) reduces the number of parameters by about 50% while improving performance. This significant reduction in parameter size not only enhances the efficiency of chip hardware and software but also enables improved applications across various domains, including large language models, speech, image, and knowledge processing. These advancements position the BVAE as a pivotal tool for optimizing hardware and software requirements while simultaneously driving progress in generative AI models and other computational tasks.
Still referring to
Additional details of the invention are set forth below and in O. Adigun and B. Kosko, “Bidirectional Backpropagation Autoencoding Networks for Image Compression and Denoising,” 2023 International Conference on Machine Learning and Applications (ICMLA), Jacksonville, FL, USA, 2023, pp. 730-737, doi: 10.1109/ICMLA58977.2023.00107 and Kosko, Bart & Adigun, Olaoluwa. (2024). Bidirectional Variational Autoencoders. 10.1109/IJCNN60899.2024.10650379 and any supplemental materials for these papers; the entire disclosures of which are hereby incorporated by reference in their entirety.
The following examples illustrate the various embodiments of the present invention. Those skilled in the art will recognize many variations that are within the spirit of the present invention and scope of the claims.
A new bidirectional backpropagation algorithm is demonstrated as a method for training an autoencoder (AE) network. The resulting bidirectional AE uses a single network and the same synapses for forward and backward passes.
Bidirectional backpropagation [22], [23] maximizes a network's joint likelihood pf(y|x, θ)pb(x|y, θ). The forward probability pf(y|x, θ) describes the forward pass of input pattern x from the input layer to the output layer. The backward probability pb(x|y, θ) describes the backward pass of target y from the output layer to the input layer. BAE training differs from bidirectional backpropagation because it maximizes just the network's backward likelihood.
Preliminary results also showed that the bidirectional backpropagation architecture extends to variational autoencoders. Here the forward error measures the Kullback-Leibler divergence between the encoded vector z and a target prior probability. The backward error measures the reconstruction error. The bidirectional framework offers a simple alternative to the reparameterization trick in variational AEs [25].
The next section presents the unidirectional AEs used in the simulations.
An ordinary or unidirectional AE consists of two contiguous networks. These are the encoder network and the decoder network.
The output layer of neural networks for image-related tasks can be modeled as discretized independent beta random variables Y1, . . . , YK. Ma and Leijon [26] have found that the beta probability density gives a reasonable model for such image pixel values. The random variables are YK|X=x˜Beta(α=1+γk, β=2−γk) where the beta density is discretized. This choice of the two beta parameters coincides with a continuous Bernoulli [27], [28]. This discretization models the finite cardinality of the set of all pixel values. Yk denotes the kth neuron or pixel at the output layer of the decoder. It has the target pixel value
for some c∈{0, 1, . . . , 255}.
The pixel values are not continuous and again the support of the beta structure assumes discretized values. This allows multi-level representation with 2 or more levels and assists image representation because it gives a multi-level model for the 256 possible values per pixel.
The decoder's output negative log-likelihood equals the double cross-entropy between the output activation ay and the target y. This gives the output likelihood p(γk|x, θ, ϕ) as
The corresponding log-likelihood is
where ψ(γk)=ln2−lnΓ(1+γk)−lnΓ(2+γk). Then the negative log-likelihood simplifies as
where ε(γk, aky, θ, ϕ) is the double cross-entropy between Yk and aky.
Unidirectional or ordinary backpropagation (BP) trains the AE. This gradient method finds the model weights θ* and ϕ* that locally maximize the decoder's output likelihood. This just minimizes the double cross-entropy:
because the logarithm is a monotonic function and because −ψ(γk) does not depend on θ or ϕ.
Unidirectional BP trains on only the forward error ε(γk, ay, θ, ϕ) over M training samples {x(m)}m=1M. This forward error simplifies as
The corresponding log-likelihood is
where yk(m) is the target at the kth output neuron and where aky(m) is the activation of the kth neuron at the output layer of the decoder. Note that yk(m)=xk(m) because the autoencoder approximates an identity map.
A bidirectional network runs forward and backward through the same synaptic weights [23], [29]-[31]. A forward inference passes through a given rectangular weight matrix W while the backward inference passes through the matrix transpose WT.
A BAE Nθ learns or approximates an identity mapping from an input pattern space to the same or similar output pattern space. The data-encoding from the pattern x to the latent variable z passes forward through the network Nθ. The encoding of the input image gives
a
z
=N
θ(x). (1.21)
The decoding from z back to x passes backwards through the same network. So, the encoded message decodes as
a
x
b
=N
θ
T(az). (1.22)
BAE networks train with a form of the bidirectional backpropagation algorithm [23], [30]. This training maximizes the backward likelihood p(x|y, θ) of the bidirectional networks. So this bidirectional structure differs in kind from the encoder-only structure of the Bidirectional Encoder Representations from Transformers (BERT) model [32].
The training error function εM(x, θ) equals the negative log-likelihood of M training samples with the assumption of independent and identical distribution. The negative loglikelihood of Beta(α=1+xk, β=2−xk) gives the cross-entropy:
where xi(m) is the ith pixel value of the mth sample. The term xaixb(m) is the activation of the ith input neuron on the backward pass of the mth sample.
Overall: BAE networks significantly reduced memory usage because they reduced the number of synaptic parameters by about 50%. This favors both large-scale neural models and dedicated hardware implementations.
The supercomputer simulations compared the performance of unidirectional with bidirectional AEs. They tested these autoencoders on image compression and reconstruction and on image denoising.
Two types of autoencoders were tested. The AEs were either fully connected or convolutional. The models used the new logistic nonvanishing (NoVa) hidden neurons [33], [34] because NoVa neurons outperformed rectified linear unit (ReLU) neurons and many others. The NoVa activation perturbs a logistic where the activation a(x) from input x is
where b=0.3 and c=2.0 were used
Each decoder network mirrored the encoder and used four fully connected hidden layers. The first two hidden layers used 500 neurons per layer and the other two used 1000 neurons per layer.
BAEs with fully connected layers each used one bidirectional network. Each bidirectional network used four fully connected hidden layers. The first two hidden layers used 1000 neurons per layer and the other two used 500 neurons per layer.
The convolutional decoder network used five convolutional layers and two fully connected layers for decoding. The dimensions of the respective input channels and output channels of the convolutional layers were (512, 256, 128, 64, 32) and (256, 128, 64, 32, 3). The two fully connected layers used 2048 neurons and 1024 neurons.
The bidirectional convolutional AEs used a bidirectional convolutional network for encoding and decoding. Each BAE used five convolutional layers and two fully connected layers. The dimensions of the respective input channels and output channels of the convolutional layers were (3, 32, 64, 128, 256) and (32, 64, 128, 256, 512). The two fully connected layers used 2048 neurons and 1024 neurons. The convolutional autoencoders trained over 300 epochs.
AEs and BAEs were compared on the MNIST handwritten digit, and the CIFAR-10 image dataset [35]. The MNIST handwritten digit dataset contained the 10 classes of the handwritten digits {0, 1, 2, 3, 4, 5, 6; 7, 8, 9}. This dataset set consisted of 60,000 training samples with 6,000 samples per class and 10,000 test samples with 1,000 samples per class.
The CIFAR-10 test set consists of 60,000 color images from 10 categories (K=10). Each image had size 32×32×3. The 10 pattern categories were airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck. Each class consisted of 5,000 training samples and 1,000 testing samples.
The denoising experiments used noise-corrupted input images. The additive-noise denoising used noisy input images x=y+n where n that came from the Gaussian probability density N(μ=0; σ) for clean image y. The multiplicative (speckle) noise denoising models used the noisy input image x=y*n where n came from N (μ=0; σ).
BAEs outperformed unidirectional AEs on image compression and image denoising. BAEs also reduced the number of parameters by about 50%. The simulations used two performance metrics: peak signal-to-noise ratio (PSNR) and structural similarity index measure (SSIM) [36].
The simulations showed that fully connected BAEs outperformed fully connected unidirectional AEs on both metrics. BAEs also reduced the number of trainable parameters by about 50%. The simulations found these bidirectional benefits for image compression on the digit MNIST handwritten dataset.
The t-distributed stochastic neighbor embedding (t-SNE) method was used to visualize the encoded features. This method uses a statistical approach to map the high-dimensional representation of data {xi}i=1N to their respective low-dimensional representation {yi}i=1N based on the similarity of the datapoints [24]. It is a two-step method. The first step defines the conditional probability pj|i and the joint probability pij over the high-dimensional space. The conditional probability is proportional to the similarity between xi and xj. It uses a Gaussian probability density with mean xi:
for all j≠i where σi is the variance of the Gaussian with mean x1. The term pi|i=0. The joint probability is
where N is the number of datapoints.
The second step maps the high-dimensional representation x1 to its corresponding low-dimensional representation yi in (d is typically 2 or 3). It uses a heavy-tailed student's-t density with one-degree of freedom (which equals the Cauchy density) to model the low-dimensional joint distribution. So, the joint probability qij of the low-dimensional representations yi and yj has the form:
for all i≠j and qii=0. The location of the low-dimension representation comes from minimizing the Kullback-Leibler divergence
The t-SNE algorithm uses gradient descent to iteratively find the value of γi that minimizes the KL(P∥Q).
Simulations also showed that convolutional BAEs outperformed convolutional unidirectional AEs on the CIFAR-10 dataset. Table 1.3 shows that convolutional BAEs slightly outperformed their corresponding unidirectional architecture for image compression. This included a slight increase in the PSNR and the SSIM as well as a reduction of about 50% in the number of parameters. Table 1.4 shows a similar bidirectional benefit of a slight increase in the PSNR and a slight increase in the SSIM for the image denoising task.
Bidirectional autoencoders offer an efficient way to learn autoencoder mappings. The new bidirectional backpropagation algorithm allows a single network to perform encoding and decoding. The bidirectional architecture improved network performance and reduced computing memory because it cut in half the number of trainable synaptic parameters. So, it should have more pronounced bidirectional benefits on larger-scale models and aid hardware implementations. Preliminary simulations also found that these bidirectional benefits extended to variational autoencoders.
This section introduces the new bidirectional variational autoencoder (BVAE) network. This architecture uses a single parametrized network for encoding and decoding. It trains with the new bidirectional backpropagation algorithm that jointly optimizes the network's bidirectional likelihood [37], [38]. The algorithm uses the same synaptic weights both to predict the target y given the input x and to predict the converse x given y. Ordinary or unidirectional VAEs use separate networks to encode and decode.
Unidirectional variational autoencoders (VAEs) are unsupervised machine-learning models that learn data representations [39], [40]. They both learn and infer with directed probability models that often use intractable probability density functions [41]. A VAE seeks the best estimate of the data likelihood p(x|θ) from samples {x(n)}n=1N if x depends on some observable feature z and if θ represents the system parameters. The intractability involves marginalizing out the random variable z to give the likelihood p(x|θ):
Kingma and Welling introduced VAEs to solve this computational problem [41]. The VAE includes a new recognition (or encoding) model q(z|x, ϕ) that approximates the intractable likelihood q(z|x, θ). The probability q(z|x, ϕ) represents a probabilistic encoder while p(x|z, θ) represents a probabilistic decoder. These probabilistic models use two neural networks i-th different synaptic weights.
A BVAE approximates the intractable q(z|x, θ) with the forward likelihood qf(z|x, θ). Then the probabilistic encoder is qf(z|x, θ)) and the probabilistic decoder is Pb(x|z, θ). So the two densities share parameter θ and there is no need for a separate network.
VAEs vary based on the choice of latent distribution, the method of training, and the use of joint modeling with other generative models, among other factors. The β-VAE introduced the adjustable hyperparameter β. It balances the latent channel capacity of the encoder network and the reconstruction error of the decoder network [42]. It trains on a weighted sum of the reconstruction error and the Kullback-Leibler divergence DKL(q(z|x, ϕ)∥p(z|θ)). The β-TCVAE (Total Correlation Variational Autoencoder) extends β-VAE to learning isolating sources of disentanglement [43]. A disentangled β-VAE modifies the β-VAE by progressively increasing the information capacity of the latent code while training [44].
Importance weighted autoencoders (IWAEs) use priority weights to derive a strictly tighter lower bound on the loglikelihood [45]. Variants of IWAE include the partially importance weighted auto-encoder (PIWAE), the multiply importance weighted auto-encoder (MIWAE), and the combined importance weighted auto-encoder (CIWAE) [46].
Hyperspherical VAEs use a non-Gaussian latent probability density. They use a von Mises-Fisher (vMF) latent density that gives in turn a hyperspherical latent space [40]. Other VAEs include the Consistency Regularization for Variational AutoEncoder (CRVAE) [39], the InfoVAE [47], and the Hamiltonian VAE [48] and so on. All these VAEs use separate networks to encode and decode.
Vincent et alia [49] suggests the use of tied weights in stacked autoencoder networks. This is a form of constraint that parallels the working of restricted Boltzmann machines RBMs [50] and thus a simple type of bidirectional associative memory or BAM [51]. It forces the weights to be symmetric using WT on the backward pass. The building block here is a shallow network with no hidden layer [52], [53]. They further suggest that combining this constraint with a nonlinear activation would lead to poor reconstruction error.
Bidirectional autoencoders BAEs [54] extend bidirectional neural representations to image compression and denoising. BAEs differ from autoencoders with tied weights because they relax the constraint by extending the bidirectional assumption over the depth of a deep network. BAEs differ from bidirectional VAEs because they do not require the joint optimization of the directional likelihoods. This limits the generative capability of BAEs.
The next sections review ordinary VAEs and introduce probabilistic BVAEs using the new B-BP algorithm. Section IV compares them on the four standard image test datasets: MNIST handwritten digits, Fashion-MNIST, CIFAR-10, and CelebA-64 datasets. It was found that BVAEs cut the number of tunable parameters in half while still performing slightly better than the unidirectional VAEs.
Let p(x|θ) denote the data likelihood and z denote the hidden variable. The data likelihood simplifies as
The likelihood q(z|x, θ) is intractable to solve. So unidirectional VAEs introduce a new likelihood that represents the recognition or encoding model. The term qf(z|x, ϕ) represents the forward likelihood of the encoding network that approximates the intractable likelihood q(z|x, θ):
The corresponding data log-likelihood ln p(x|θ) is
Now take the expectation of (8) with respect to qf(z|x, θ):
because qf(z|x, θ) is a probability density function and its integral over the domain of z equals 1. The expectation of the term on the right-hand side of (8) with respect qf(z|x, θ) is
Combining (8), (9), and (12) gives
The KL-divergence between qf(z|x, θ) and q(z|x, θ) yields the following inequality because of Jensen's inequality:
because the negative of the natural logarithm is convex. So
where (x, θ, ϕ) is the evidence lower bound (ELBO) on the data log-likelihood p(x|θ).
Unidirectional VAEs train on the estimate (x, θ, ϕ) of the ELBO using ordinary or unidirectional backpropagation (BP). This estimate involves using the forward pass q(z|x, ϕ) to approximate the intractable encoding model q(z|x, θ) and the forward pass pf(x|z, θ) to approximate the encoding model. The gradient update rules for the encoder and decoder networks at the (n+1)th iteration or training epoch are
where η is the learning rate, ϕ(n)) is the encoder parameter, and θ(n) is the decoder parameter after n training iterations.
Bidirectional VAEs use the directional likelihoods of a network with parameter θ to approximate the data log-likelihood ln p(x|θ). They use the same bidirectional associative network to model the encoding and decoding phases. The forward-pass likelihood qf(z|x, θ) models the encoding and the backward pass likelihood pb(x|z, θ) models the decoding. So BVAEs do not need an extra likelihood q(z|x, ϕ) or an extra network with parameter ϕ.
The data log-likelihood is
Now take the expectation of (29) with respect to qf(z|x, θ) and consider the left-hand side of (29):
The expectation of the right-hand term is
The corresponding data log-likelihood of a BVAE with parameter θ is
The log-likelihood of the BVAE is such that
where (x, θ) is the ELBO on in p(x|θ) and the expectation Ez|x,θ with respect to qf(z|x, θ) is taken.
Bidirectional VAEs train on the estimate ELBO (x, θ) of the ELBO that uses bidirectional neural representation [38]. This estimate involves using the forward pass qf(z|x, θ) to approximate the intractable encoding model q(z|x, θ) and the reverse pass pb(x|z, θ) to approximate the decoding model. The update rule at the (n+1)th iteration or training epoch is
where η is the learning rate and θ(n) is the autoencoder network parameter just after the nth training iteration.
The performance of unidirectional VAEs and bidirectional VAEs is compared using different tasks, datasets, network architectures, and loss functions. The image test sets for the experiments are described first.
The simulations compared results on four standard image datasets: MNIST handwritten digits [55], Fashion-MNIST [56], CIFAR-10 [57], and CelebA [58] datasets.
The MNIST handwritten digit dataset contains 10 classes of handwritten digits {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}. This dataset consists of 60,000 training samples with 6,000 samples per class, and 10,000 test samples with 9,000 samples per class. Each image is a single-channel image with dimension 28×28.
The Fashion-MNIST dataset is a database of fashion images. It is made of 10 classes namely ankle boot, bag, coat, dress, pullover, sandal, shirt, sneaker, trouser, and t-shirt/top. Each class has 6,000 training samples and 9,000 testing samples. Each image is also a single-channel image with dimension 28×28.
The CIFAR-10 dataset consists of 60,000 color images from 10 categories. Each image has size 32×32×3. The 10 pattern categories are airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck. Each class consists of 5,000 training samples and 9,000 testing samples.
The CelebA dataset is a large-scale face dataset of 10, 177 celebrities [58]. This dataset is made up of 202,599 color (three-channel) images. This is not a balanced dataset. The number of Images per celebrity varies between 1-35. The dataset is divided into two splits of 9,160 celebrities for training and 9,017 celebrities for testing the VAEs. This resulted in 185,133 training samples and 17,466 testing samples. Each image is resized to 64×64×3.
The performance of bidirectional VAEs and unidirectional VAEs is compared on the following four tasks.
The VAE-extracted features for the MNIST handwritten dataset trained on simple linear classifiers. A neural classifier with one hidden layer of 256 logistic hidden neurons was used to classify the VAE-extracted features from the Fashion-MNIST and CIFAR-10 datasets.
The t-distributed stochastic neighbor embedding (t-SNE) method was used to visualize the reduced features. This method uses a statistical approach to map the high-dimensional representation of data {xi}i=1N to their respective low-dimensional representation {yi}i=1N based on the similarity of the datapoints [62]. This low-dimensional representation provides insight into the degree of separability among the classes.
Different neural network architectures were used for various datasets and tasks.
Variational Autoencoders: Deep convolutional and residual neural network architectures were used. VAEs that trained on the MNIST handwritten and Fashion-MNIST used the residual architecture.
Each of the encoder and decoder networks that trained on the CIFAR-10 dataset used six convolutional layers and two fully connected layers. The corresponding BVAEs used only one network for encoding and decoding each. The dimension of the hidden convolutional layers is {64↔128↔256↔512↔1024↔2048}. The dimension of the fully connected layers is (2048 ↔1024↔0.64).
The configuration of the VAEs that trained on the CelebA dataset differs slightly. The sub-networks each used nine convolutional layers and two fully connected layers. The dimension of the hidden convolutional layers is {128↔128↔192↔256↔384↔512↔768↔1024↔1024}. The dimension of the fully connected layers is {4096↔2048↔256}.
The VAEs used generalized nonvanishing (G-NoVa) hidden neurons [65], [66]. The G-NoVa activation a(x) of input x is
where α>0 and β>0. Each layer of a BVAE performs probabilistic inference in both the forward and backward passes. The convolutional layers use bidirectional kernels. The kernels run convolution in the forward pass and transposed convolution in the backward pass. Transposed convolution projects feature maps to a higher-dimensional space [67].
Downstream Classification: Simple linear classifiers were trained on VAE-extracted features from the MNIST digit dataset. Shallow neural classifiers were trained with one hidden layer and 100 hidden neurons each on the extracted features from the Fashion-MNIST images. Similar neural classifiers with one hidden layer and 256 hidden neurons each trained on VAE extracted features from the CIFAR-10 dataset.
Four implementations of VAEs were considered and compares with their respective bidirectional versions. The four VAEs are vanilla VAE [41], β-VAE [42], β-TCVAE [43], and IWAE [45]. These were trained over the four datasets across the four tasks. The AdamW optimizer [68] was used with the OneCycleLR [69] learning rate scheduler. The optimizer trained on their respective ELBO estimates.
A new framework for bidirectional VAEs was designed, and unidirectional VAEs were implemented using the Pythae framework [70]. All the models trained on a single A 100 GPU. and
The performance of the VAE models on generative and compression tasks was evaluated using the following quantitative metrics:
where I is an indicator function, zd represents the dth component of the latent variable, and ϵ=0.01. Higher AU means the model uses more features for the latent space representation. But having too many active units can lead to overfitting.
Bidirectional VAEs encode and decode through the same synaptic web of a deep neural network. This bidirectional flow captures the joint probabilistic structure of both directions during learning and recall. BVAEs cut the synaptic parameter count in half compared with unidirectional VAEs. The simulations on the four image test sets showed that the BVAEs still performed slightly better than the unidirectional VAEs.
While exemplary embodiments are described above, it is not intended that these embodiments describe all forms of the invention. Rather, the words used in the specification are words of description rather than limitation, and it is understood that various changes may be made without departing from the spirit and scope of the invention. Additionally, the features of various implementing embodiments may be combined to form further embodiments of the invention.
This application claims the benefit of U.S. provisional application Ser. No. 63/609,007 filed Dec. 12, 2023, the disclosure of which is hereby incorporated in its entirety by reference herein.
Number | Date | Country | |
---|---|---|---|
63609007 | Dec 2023 | US |