The field of the present disclosure generally relates to generative models, and more particularly, to aspects of an architecture and methods for generative models that represent data as semantically descriptive symbols.
Recent efforts in deep generative modeling have yielded impressive results, showcasing both the capabilities and as well as some limitations of variational autoencoders (VAEs) and generative adversarial networks (GANs). In some instances, VAEs and GANs have been used to generate high-resolution counterfeit images that are virtually indistinguishable by the naked eye from real images used as inputs to the VAEs and GANs. However, latent representations of data used in VAEs and GANs to generate images are generally uninterpretable by a human. Being uninterpretable, no additional insight regarding the input image and/or the modeling process may be gained by observing the latent representations.
Accordingly, in some respects, a need exists for methods and systems that provide an efficient and accurate mechanism for generative models to represent interpretable data.
Embodying systems and methods herein relate to generative models that, in general, have a goal of learning the true distribution of a set of data in order to generate new data points. Neural networks may be used to learn a function to approximate the model distribution to the true distribution. An autoencoder is one type of generative model and can be used to encode an input image into a lower dimensional representation that can store latent information about the input. A variational autoencoder (VAE) model may encode an input image into a lower dimensional representation storing latent information that can be used to generate images similar to the input image with some variability.
In some aspects, VAEs are generative models used to estimate the underlying data distribution from a set of training examples. A VAE may generally include an encoder that maps raw input to a latent variable z and a decoder that uses z to reconstruct the input. A loss function optimized in the VAE may a combination of (i) the KL (Kullback-Leibler) divergence loss between the latent encoding vector and a known reference (e.g., Gaussian) distribution and (ii) the reconstruction loss at the decoder. Training may be performed in an end-to-end manner with the help of a reparameterization process at the decoder that converts a non-differentiable node to a differentiable node to thereby allow for backpropagation.
In some aspects, the present disclosure may present a number of features and concepts in the context of, for example, a VAE. However, the presented features and concepts may be applied in varying embodiments, including generative models in general unless otherwise specified.
In some embodiments, the present disclosure includes a Symbolic VAE (i.e., a SVAE). In some aspects, the SVAE disclosed herein may be viewed as an extension of a traditional VAE that includes key features on the hidden/latent state of the network. In some embodiments, these features may improve interpretability by capturing explainable image semantics within a discrete symbol space. In some regards similar to speech and language, discrete representations of latent information may provide several benefits. For example, discrete representations of latent information may be used to model salient classes in auditory/visual data, represent meaningful policies and states in reinforcement learning applications, and other use-cases and applications.
In some aspects, a distinct aspect of a SVAE herein is that latent symbols used to encode (e.g., an image) may serve as the building blocks for a learned private language. Given a sequence of discrete symbols (i.e., a sentence comprising the discrete symbols), systems and processes herein may directly decode the image that was used to generate the sentence. As a consequence, some symbols in a sentence might be manipulated to determine the “meaning” of each one. In some aspects, the present disclosure focuses on how objects in images are constructed, as opposed to how they are described.
Humans may typically use hierarchical labeling to describe entities in the world. WordNet appears to capture this property, where each word has many possible hypernyms (e.g., “color” is a hypernym of “red”). Hierarchical mappings in WordNet have improved interpretability significantly, helping to capture relationships between two words. Studies in neuroscience have also shown that rule-based hierarchical models can be used to explain cortical linguistic structure. It has even been shown that GAN-Tree uses a hierarchical structure to generate multi-modal data distributions. In some aspects, an SVAE disclosed herein may generate symbols following a learned grammar that is both hierarchical and explainable. Some embodiments use a discrete latent space to generate a hierarchical grammar via unsupervised learning methods. These mechanisms may effectively improve model explainability, as they provide greater control in generating data based on symbols. In some aspects, an SVAE herein might demonstrate how an image generated from a sentence of symbols varies as the symbols in the sentence are changed in a systematic manner. Based thereon, symbol manipulations may be associated with semantically noticeable changes in the reconstructed image, thereby effectively grounding the meaning of learned symbols.
Referring to the system architecture of
The array of numbers 220 in
In the present disclosure, a symbolic grounding problem is implemented in terms of the sender LSTM module that receives an input from an encoder and generates a sentence comprising a sequence of symbols (i.e., categorical data that is not continuous) and the receiver LSTM module receives the sentence that is then decoded to recreate the input image. Not insignificantly, the sender LSTM module implements backpropagation in the sender network by using a process to approximate differential data. In some aspects, the receiver LSTM and the decoder need not do anything to enable backpropagation since the sender LSTM fully addresses this issue. A receiver LSTM module herein may receive sentences from a sender LSTM module and produce continuous data that may be used by a decoder to recreate the original input image.
When these artificial intelligence (AI) agents are able to reconstruct the signal data from the symbols, then the system is referred to as being grounded. That is, if a SVAE herein is able to recreate the original image the sender LSTM receives based on the symbolic representation thereof by the receiver LSTM, then the sender LSTM and the receiver LSTM are grounded and able to communicate with each other via symbols.
In some embodiments, a SVAE herein may include a number of features to facilitate solving the symbolic grounding problem. In particular, (1) a vocabulary of a sequence of symbols is defined and (2) a length of a sentence comprising the symbols is defined. These constraints may operate to allow the sender LSTM and the receiver LSTM communicate with each other efficiently and accurately.
In some aspects, the present disclosure uses categorical symbols for sentences instead of continuous values; the generated sentences are semantically meaningful wherein the order of the symbols have a direct meaning corresponding to an input; and the input can be reconstructed (as the output) from the sequence of symbols and a human can understand the meaning of the symbols.
The SVAE of the present disclosure generates a sequence of symbols using a Long Short-Term Memory (LSTM) network. This is different than, for example, VAEs that use a discrete latent space (e.g., VQ-VAE and VQ-VAE2) wherein encoder output is quantized into one latent vector from an N-vector codebook. In some embodiments, the sequence of symbols generated by a SVAE herein follows a hierarchy where, for example, the first symbol in the sequence may capture the most discriminative information, such as class/category assignment. In some embodiments, later symbols in a sequence of symbols might represent finer details within the class, such as, for example, child nodes under a parent node. In using an LSTM, some embodiments might capture the grammar that underlies patterns of discrete symbols, rather than (explicitly) encoding the information in terms of independent symbols. This grammar, when effectively captured, may be used for other purposes such as, for example, generating variations of the same image by changing one or more of its associated symbols. As an example, multiple colors of an object could be visualized by varying one or more symbols in a sequence of symbols, even though the SVAE has not actually seen images corresponding to the multiple colors of the object during the training of the SVAE.
In some embodiments, the discrete latent factors symbols generated in one or more SVAEs herein are novel in that they also capture hierarchical representation.
In some embodiments, a SVAE herein may, in some aspects, be constructed with variational inference like some traditional VAEs.
In some embodiments, a process herein might train the entire deep neural network with the reconstruction loss and KL divergence loss as described in Equation 1 below. Parameters of the encoder, sender, receiver, and decoder modules are jointly optimized by backpropagation:
Loss=[log
(X|z)]−
KL[
(z|X)|
(z)] (1)
where [log
(X|z)] simplifies to taking binary cross entropy between reconstructed image and input image,
and KL[
(z|X)[
(z)] results in:
KL[(μ(X),σ(X))|
(0,1)]]=½Σ[exp(σ(x)+μ2(X)−1−σ(X)]
It is noted that in some embodiments, simplifications are made by assuming P(z) to the normal distribution with mean 0 and standard deviation 1. In some aspects, the encoder and decoder consist of two fully connected layers. Since backpropagating gradients across discrete symbols is not possible, some embodiments utilize an estimator (e.g., Gumbel-Softmax) that results in a continuous gradient that is both stable and differentiable.
In some aspects, training a neural network with discrete intermediate outputs exhibits a number of challenges. For instance, standard backpropagation may only work on differentiable functions. Referring to
Instead of learning to describe imagery, the present disclosure focuses more on learning what constitutes an image so that whole images can be reconstructed using latent, symbolic representations.
Various aspects of the present disclosure relating to SVAEs have been tested on two image datasets: MNIST and FashionMNIST. Both datasets consist of about 60,000 training images and 10,000 test images. In a plurality of experiments, an encoder and decoder consisting of two fully-connected layers were used, reducing the dimension of the input image first to 400 and then to 20 in respective layers. The 20-dimensional feature from the last fully-connected layer of the encoder module was fed to a reparameterization layer. The output from the reparameterized layer was then provided to the sender module. The sender and seceiver components consist of a single LSTM unrolled based on the sequence length used in different settings. For example, the sender LSTM embedding dimension may be 256 and the hidden layer dimension may be 512. The temperature parameter in Gumbel Softmax is 1. Adam (or another optimization process or algorithm) optimizer was used with a learning rate set to 1e−5.
In some aspects, conducted experiments show that discrete symbols capture the semantic properties of an image and can be used to unearth underlying primitives. Without any supervision, it was observed that each symbol represents a concept. As outlined in detail below, the generated symbols form a grammar with useful semantic properties.
As depicted in
Another set of experiments trained a SVAE in accordance with the present disclosure with a vocabulary size of 20 and a sentence length of 3.
A similar result is obtained when the same experiment was performed using different dataset (e.g., the MNIST dataset), as seen in
Compared to, for example, Vq-VAE, aspects of the present disclosure provide better control over generated images because of the added advantage of using sentences instead of a single bit. Thus, aspects of the present disclosure include a systematic methodology of generating images by exhausting all possible symbols of the given vocabulary size and sentence length.
In some aspects and embodiments, the SVAE presented herein provides the benefits and practical application(s) of interpretability and the ability to generate images by varying symbolic encodings. It is noted that the generated symbols form a grammar, where the first symbol might refer to the class of the image, and the next set of symbols express finer features. By exhausting all possible symbol sequences for a given category, it has been demonstrated that the how finer characteristics are captured in a hierarchical fashion. These aspects provide a foundation that supports an understanding of what the primitives of images are and how each primitive might affect the appearance of various image types.
In some aspects while the success of deep learning methods have provided exciting new ways to transform data into useful representations, explainability remains a critical problem. In particular, human interfacing with artificial agents relies on modes of communication that can be interpreted by both parties. Significantly, the SVAE disclosed herein provides a framework by implementing a symbolic method for encoding raw data, wherein each symbol appears to have some meaning to human observers (e.g., a “red shoe” versus a “white shoe”).
System 600 includes processor(s) 610 operatively coupled to communication device 620, data storage device 630, one or more input devices 640, one or more output devices 650, and memory 660. Communication device 620 may facilitate communication with external devices, such as a data server and other data sources. Input device(s) 640 may comprise, for example, a keyboard, a keypad, a mouse or other pointing device, a microphone, knob or a switch, an infra-red (IR) port, a docking station, and/or a touch screen. Input device(s) 640 may be used, for example, to enter information into system 600. Output device(s) 650 may comprise, for example, a display (e.g., a display screen) a speaker, and/or a printer.
Data storage device 630 may comprise any appropriate persistent storage device, including combinations of magnetic storage devices (e.g., magnetic tape, hard disk drives and flash memory), optical storage devices, Read Only Memory (ROM) devices, etc., while memory 660 may comprise Random Access Memory (RAM), Storage Class Memory (SCM) or any other fast-access memory. Files including, for example, generative model representations (e.g., VAE, GAN, etc.), training datasets, output records (e.g., generated recreated images), reparameterization process(es)/models herein, and other data structures may be stored in data storage device 630.
SVAE engine 632 may comprise program code executed by processor(s) 610 (and within the execution engine) to cause system 600 to perform any one or more of the processes or portions thereof disclosed herein to effectuate a SVAE or other symbolic generative model. Embodiments are not limited to execution by a single apparatus. Data storage device 630 may also store data and other program code 636 for providing additional functionality and/or which are necessary for operation of system 600, such as device drivers, operating system files, etc.
In accordance with some embodiments, a computer program application stored in non-volatile memory or computer-readable medium (e.g., register memory, processor cache, RAM, ROM, hard drive, flash memory, CD ROM, magnetic media, etc.) may include code or executable instructions that when executed may instruct and/or cause a controller or processor to perform methods disclosed herein, such as a method of determining a design a part and a combination of a thermal support structure and a structural support structure.
The computer-readable medium may be a non-transitory computer-readable media including all forms and types of memory and all computer-readable media except for a transitory, propagating signal. In one implementation, the non-volatile memory or computer-readable medium may be external memory.
Although specific hardware and methods have been described herein, note that any number of other configurations may be provided in accordance with embodiments of the invention. Thus, while there have been shown, described, and pointed out fundamental novel features of the invention, it will be understood that various omissions, substitutions, and changes in the form and details of the illustrated embodiments, and in their operation, may be made by those skilled in the art without departing from the spirit and scope of the invention. Substitutions of elements from one embodiment to another are also fully intended and contemplated. The invention is defined solely with regard to the claims appended hereto, and equivalents of the recitations therein.