The present disclosure relates generally to systems and methods for computer learning that can provide improved computer performance, features, and uses. More particularly, the present disclosure relates to systems and methods to improve computer performance, features, and uses in learning latent structural relations.
Deep neural networks have achieved great successes in many domains, such as computer vision, natural language processing, recommender systems, etc. Disentangled representation learning, which aims to learn factorized representations that discover and disentangle the latent explanatory factors in data, is a fundamental but challenging problem in machine learning and artificial intelligence. Interpretable disentangled representations have demonstrated their power in unsupervised learning and semi-supervised learning.
A major challenge to extract representations from images with multiple objects lies in an unsupervised setting and complicated interaction patterns. Most existing approaches may not be applied to this problem because it is challenging to integrate data segmentation and representation learning. Moreover, learning the complicated entity interactions in real-word requires a powerful and flexible prior for latent representations that may adaptively encode complicated structural relations.
Accordingly, what is needed are systems and methods to learn latent structural relations for improved computer performance, features, and uses.
References will be made to embodiments of the disclosure, examples of which may be illustrated in the accompanying figures. These figures are intended to be illustrative, not limiting. Although the disclosure is generally described in the context of these embodiments, it should be understood that it is not intended to limit the scope of the disclosure to these particular embodiments. Items in the figures may not be to scale.
In the following description, for purposes of explanation, specific details are set forth in order to provide an understanding of the disclosure. It will be apparent, however, to one skilled in the art that the disclosure can be practiced without these details. Furthermore, one skilled in the art will recognize that embodiments of the present disclosure, described below, may be implemented in a variety of ways, such as a process, an apparatus, a system, a device, or a method on a tangible computer-readable medium.
Components, or modules, shown in diagrams are illustrative of exemplary embodiments of the disclosure and are meant to avoid obscuring the disclosure. It shall also be understood that throughout this discussion that components may be described as separate functional units, which may comprise sub-units, but those skilled in the art will recognize that various components, or portions thereof, may be divided into separate components or may be integrated together, including, for example, being in a single system or component. It should be noted that functions or operations discussed herein may be implemented as components. Components may be implemented in software, hardware, or a combination thereof.
Furthermore, connections between components or systems within the figures are not intended to be limited to direct connections. Rather, data between these components may be modified, re-formatted, or otherwise changed by intermediary components. Also, additional or fewer connections may be used. It shall also be noted that the terms “coupled,” “connected,” “communicatively coupled,” “interfacing,” “interface,” or any of their derivatives shall be understood to include direct connections, indirect connections through one or more intermediary devices, and wireless connections. It shall also be noted that any communication, such as a signal, response, reply, acknowledgement, message, query, etc., may comprise one or more exchanges of information.
Reference in the specification to “one or more embodiments,” “preferred embodiment,” “an embodiment,” “embodiments,” or the like means that a particular feature, structure, characteristic, or function described in connection with the embodiment is included in at least one embodiment of the disclosure and may be in more than one embodiment. Also, the appearances of the above-noted phrases in various places in the specification are not necessarily all referring to the same embodiment or embodiments.
The use of certain terms in various places in the specification is for illustration and should not be construed as limiting. A service, function, or resource is not limited to a single service, function, or resource; usage of these terms may refer to a grouping of related services, functions, or resources, which may be distributed or aggregated. The terms “include,” “including,” “comprise,” and “comprising” shall be understood to be open terms and any lists the follow are examples and not meant to be limited to the listed items. A “layer” may comprise one or more operations. The words “optimal,” “optimize,” “optimization,” and the like refer to an improvement of an outcome or a process and do not require that the specified outcome or process has achieved an “optimal” or peak state. The use of memory, database, information base, data store, tables, hardware, cache, and the like may be used herein to refer to system component or components into which information may be entered or otherwise recorded.
[In one or more embodiments, a stop condition may include: (1) a set number of iterations have been performed; (2) an amount of processing time has been reached; (3) convergence (e.g., the difference between consecutive iterations is less than a first threshold value); (4) divergence (e.g., the performance deteriorates); and (5) an acceptable outcome has been reached.]
One skilled in the art shall recognize that: (1) certain steps may optionally be performed; (2) steps may not be limited to the specific order set forth herein; (3) certain steps may be performed in different orders; and (4) certain steps may be done concurrently.
Any headings used herein are for organizational purposes only and shall not be used to limit the scope of the description or the claims. Each reference/document mentioned in this patent document is incorporated by reference herein in its entirety.
It shall be noted that any experiments and results provided herein are provided by way of illustration and were performed under specific conditions using a specific embodiment or embodiments; accordingly, neither these experiments nor their results shall be used to limit the scope of the disclosure of the current patent document.
It shall also be noted that although embodiments described herein may be within the context of image processing, aspects of the present disclosure are not so limited. Accordingly, the aspects of the present disclosure may be applied or adapted for use in other contexts, including but not limited to language processing.
Deep neural networks have achieved great successes in many domains, such as computer vision, natural language processing, recommender systems, etc. Disentangled representation learning, which aims to learn factorized representations that discover and disentangle the latent explanatory factors in data, is a fundamental but challenging problem in machine learning and artificial intelligence. Interpretable disentangled representations have demonstrated their power in unsupervised learning and semi-supervised learning.
Most existing methods for disentangled representation learning are based on Variational Auto-Encoders (VAEs) or Generative Adversarial Networks (GANs). The commonality of these works is that disentangled representations are extracted from a single entity or object in one data sample. Recently, there is growing research interest to integrate representation learning with scene segmentation by leveraging generative models. Similarly, very few of the methods consider the interaction among multiple objects or sample portions.
A major challenge to extract representations from images with multiple objects lies in an unsupervised setting and complicated interaction patterns. Most existing approaches may not be applied to this problem because it is challenging to integrate data segmentation and representation learning. Moreover, learning the complicated entity interactions in the real-world requires a powerful and flexible prior for latent representations that may adaptively encode complicated structural relations.
In the present disclosure, embodiments of a novel approach to learn object representations and encode object relations are presented. In one or more embodiments, the latent representation vector for each object or component in a scene is divided into two sections, a local section and a global section. Firstly, the local section controls the individual properties that are independent of the other objects. The global section, shared by all the objects in a scene, encodes the object relationships as well as the global latent factors. In one or more embodiments, the inference and interaction between different objects may be handled with a flow-based model. The flow-based structure prior of latent representation allows rigorous scores computation to estimate correlation and causality interaction between two components.
Embodiments of the present model have been applied to different datasets, and significant improvement has been obtained in scene segmentation and object representation learning by considering interactions among different components. Compared to existing methods, embodiments capture significantly more relations between objects. Theoretical properties of bi-level variational auto-encoder embodiments, such as the Evidence Lower Bound (ELBO), are also provided.
Embodiments of the present disclosure is for developing an approach to disentangle the structural latent representation by leveraging deep generative models, which has direct applications in computer vision, image processing, and many other fields need to resolve data segmentation or data decomposition as well. In this section, some related works on scene or image segmentation in computer vision and disentanglement learning are reviewed.
1. Scene Segmentation
Recently, deep generative models have been integrated with unsupervised scene segmentation methods. Some proposed an approach to learn the representation of individual objects and scene segmentation simultaneously. Such a method of integrating iterative amortized inference and VAE is a fully unsupervised approach to learn visual concepts. With this method, a complete system may be trained end-to-end by simply maximizing its ELBO. MONet employed a recurrent attention network to discriminate different objects instead of using complicated amortized inference. The scene is segmented by leveraging the weighted objective with attention masks. At least one of the major differences between embodiments of the present disclosure and the aforementioned methods is that the objects in a scene may interact with each other in the present disclosure without the independence assumption among them.
2. Disentanglement
Variants of VAEs have achieved state-of-the-art (SOTA) performance for unsupervised disentanglement learning. One may assume a specific prior P(z) on the latent space and then parameterize the conditional probability P(x|z) with a deep neural network. The distribution P(z|x) is approximated using a variational distribution Q(z|x). The objective function for VAE may be expressed as:
The objective function is the ELBO. It is also possible to introduce various properties of the final presentation by modifying the KL term. Some proposed a β-VAE, in which a hyper-parameter β was introduced for the KL regularizer of vanilla VAEs. When β>1, β-VAE penalizes the mutual information between latent representation and data sample. There are several different approaches to learn disentangled data representation. Independent component analysis (ICA) has been extended to nonlinear cases to achieve disentanglement of variables.
Embodiments of the present disclosure involve developing a framework that may seamlessly integrate data segmentation and representation learning. In the present patent disclosure, a latent relational learning prior with message passing scheme and theoretical analysis are disclosed; embodiments of a bi-level VAE framework with a solid derivation of ELBO are presented; and embodiments of the presented framework have been applied to latent relational representation learning and component segmentation. Experiments show that segmentation and disentangled representation of different components may be improved with the inference mechanism of the presented structured VAE with the novel prior.
In this section, embodiments of information aggregation prior are first introduced. Afterwards, some detailed framework embodiments to learn disentangled structured latent factors are presented. In one or more embodiments, components may be referred to present objects in an image or different portions in a data sample.
In the present disclosure, embodiments of an aggregation model to learn interactions among data components are disclosed. The aggregation prior is then extended to bi-level decomposable variational auto-encoders (VAEs) that may learn disentangled latent structural representations from input data. Unlike some previous methods that ignore component or object interactions, embodiments of the present disclosure simultaneously learn component representation and encode component relationships with a bi-level VAE structure. In one or more embodiments, an auto-encoder for a second level or layer is parameterized with a flow-based model that allows performing relational inference, and it may also be taken as the structural prior for part of the first layer auto-encoder's latent distribution. In the present disclosure, theoretical property proofs, some empirical results, and detailed network architecture embodiments are provided. Notations in one or more model embodiments of the bi-level decomposable VAEs are listed in Table 1 shown in below.
k
k
0(k)
1. Latent Relational Learning with Message Passing Prior Embodiments
and reconstruction ŷk=ƒk−1(h).
In one or more embodiments, relationship between yk, k=1, . . . , K and h may be modeled with invertible flow-based networks. In one or more embodiments, flow function ƒk specifies a parametric invertible transformation from the distribution of yk to the latent variable hk, i.e., ƒk: l→l may be invertible. Here l is the dimension of hk and yk. With hk=ƒk (yk), using the change-of-variables formula, the following equation may be obtained:
As shown in
Therefore, the aggregated latent variable h may be a concise representation that may fully reconstruct all components of the data. In one or more embodiments, ƒk, k=1, . . . , K may be enforced to ensure that hk=h, and yk=ŷk=ƒk−1(h).
Latent Variable Aggregation In one or more embodiments, it is assumed that each entry of hk, k=1, . . . , K follows Normal distribution, i.e., hk˜N(μk, σ2). In one or more embodiments, the variance σ2 is set as a fixed value across all components. With
the prior distribution for each entry of his a Normal distribution N(μ, σ2). Based on an encoder and decoder VAE scheme, model parameters of the aggregation model may be learned by maximizing the ELBO,
log pƒ
Given a batch of training samples, the ELBO value may be computed with the message passing procedures. In one or more embodiments,
is used as the sample generated from qƒ (h|y). Given an h, it is expected that it may fully reconstruct the input data. In one or more embodiments, the reconstruction term log pƒ
Here
In one or more embodiments, constant values for both σy2 and σ2 are used, hence the value of C may also be set as a constant. In one or more embodiments, h from a batch of training samples are used to approximate the KL term in (2).
In one or more embodiments, given a data sample y=[y1, y2, . . . , yK], the following lemma regarding the likelihood value computed with the message passing scheme in
Lemma 1. The log-likelihood of y can be approximated by
log p(y)≈log p(h)−½ log p(det(Jŷ(h)τJŷ(h))) (4)
Here Jŷ(h)=[Jŷ
Proof: The structure relation between h and y=[y1, y2, . . . , yK] is given in
From equation (5), the log-likelihood for y may be obtained.
In one or more embodiments, one may compute Jacobian matrix of each flow function ƒk, and thus obtain the log-likelihood values. In one or more embodiments, the correlation or causality relations between two data components is estimated. In an example with two components a and b, to estimate the value of component b with observed value of a, the latent value h with ĥ=ƒa(ya), is first estimated. Then, an estimated ŷb may be obtained as ŷb=ƒb−1 (ƒa(ya)). Meanwhile, the conditional probability may be written as:
log p(yb|ya)≈log p(ĥ)−½ log p(det(Jŷ
In one or more embodiments, the inference scheme may be applied to component relation detection, which is discussed with mode details in section of component interaction inference.
Identifiability Let yk,i be the i-th entry of yk. In one or more embodiments, a relation r between yu,i and yv,j is defined if there is a mapping or a function links them. A relation set r may include multiple relations. In one or more embodiments, r is used to represent the set of variables involved in r. In one or more embodiments, a relation set in the present disclosure may be a connected graph with r as the vertex set. Let be the set of all relation sets in a data set , it is easy to prove the following lemma regarding the recovery of relations.
Lemma 2. If variable relations in are monotone, and ||≤dim(h), then R can be approximately fully recovered.
Proof of Lemma 2: The ELBO for the aggregation model is
With N(u, σ2) as the prior for each entry of h, the KL term may be approximate with
Here {circumflex over (σ)}i is the ith entry of {circumflex over (σ)}, and {circumflex over (σ)} may be approximated with a batch of training data samples. The KL term regularize the distributions of all entries of h to be close to prior N(u, σ2) individually and thus to be independent with each other.
Without loss of generalization, it may be assumed that σy=σ=1. Maximizing the ELBO may be equivalent to the following optimization problem,
In one or more embodiments, sigmoid function may be used as last step of each ƒk, and u may be set as a non-negative function. In an example of a simple case with two variables y1 and y2, and l=dim(h)=1, assuming the relation as y1=ϕ(y2), and ϕ is continuous, monotone and invertible, the objective Equation (7) may be rewritten as:
In one or more embodiments, ƒ1(y1)=ƒ2(y2)=h may be obtained, with y2=ϕ(y1), ƒ2−1ƒ1=ϕ. Hence ƒ1=ƒ2∘ϕ and ƒ1−1=ϕ−1∘ƒ2−1. For multivariant, let r∈ be one of the relation sets involve multiple variables belong to v. Each relation set corresponding to an interaction graph with vertex variables in v from different components yk, k=1, . . . , K. Under the assumptions that the relations in are monotone and invertible, any pair of variables from v may be linked with a function. With the assumption that any pair of variables from different relation sets or graphs are independent with each other, by maximizing the ELBO in equation (7), the VAE model may assign one of the independent latent variables hi to each relation set or graph. Therefore, as long as ||≤dim(h), the relation sets in may be approximately fully recovered with the 25 independence of the latent variables h.
With the invertible flow-based model, embodiments of the disclosed bi-level decomposable VAEs may be fit to the nonlinear ICA framework. For component k, suppose the distribution regarding hk is a factorial member of the exponential family with m sufficient statistics, conditioned on uk. Here uk is additional observed variable. In one or more embodiments, the general form of the distribution may be written as:
Here Qi is the base measure, Zi is the normalizing constant, Ti,j are the components of the sufficient statistic, and λi,j are the corresponding parameters, depending on uk. The variable yk is the output of an arbitrarily complex, inevitable, and deterministic transformation from the latent space to the data space, i.e., yk=ƒk−1(hk). Let T=[T1, . . . , Tl, λ=[λ1, . . . , λl], and Θ={θ:=(T, λ, ƒk−1)}, with parameter θ=(T, λ, ƒk−1), the following equation may be obtained:
p
θ(yk,hk|uk)=log pƒ
In one or more embodiments, the set of parameters {circumflex over (Θ)} may be obtained with some learning algorithm, i.e., {circumflex over (Θ)}={{circumflex over (θ)}:=({circumflex over (T)}, {circumflex over (λ)}, gk)}. Using gk to represent the learned approximation of ƒk−1, and yk=gk(hk), the following equivalence relations on Θ may be defined.
Definition 1. Let ˜ be the equivalence relation on {circumflex over (Θ)}. Equation (10) is identifiable up to ˜ if
p(yk,Θ)=P(yk,{circumflex over (Θ)})⇒Θ˜{circumflex over (Θ)}.
The elements of the quotient space Θ/˜ are called the 180 identifiability classes.
Definition 2. Let ˜ be the binary relation on {circumflex over (Θ)} defined by:
(T,λ,ƒk−1)˜({circumflex over (T)},{circumflex over (λ)},gk)↔∃A,c|T(ƒk(yk))+c,∀yk∈k,
In one or more embodiments, explicit additional observable variable uk for component k may not be available. However, K−1 signals from other components relate to it may be available. Assuming the relations involving component k may be fully recovered and sufficient label support from other components may be obtained, the model is identifiable. In one or more embodiments, y−k is used to represent components other than component k, and uk(y−k) is the additional variable recovered from the relations with other components. In the limit of infinite data and good convergence, the estimating model may give the same conditional likelihood to all data points as the true generating model:
p
T,λ,ƒ
(yk|uk(y−k)=p{circumflex over (T)},{circumflex over (X)},g
In one or more embodiments, the domain of ƒk−1 is defined as =1× . . . ×l. The follow theorem regarding the identifiability of the model may be obtained.
Theorem 1. Assuming data distributed are observed according to the generative model given by equations (7) and (8), and with the following assumptions,
(a) The sufficient statistics Ti,j(h) are differentiable almost everywhere and their derivatives
are nonzero almost surely for all h∈i and all 1≤i≤l and 1≤j≤m;
(b) The relations involving component k can be approximately fully recovered and can be represented with uk(y−k); and
(c) There exist lm+1 distinct conditions uk(0), . . . , uk(lm) from y−k such that the matrix L=[λ(uk(1))−λ(uk(0)), . . . , λ(uk(lm))−λ(uk(0))] of size lm×lm is inventible;
Then the model parameters (T, λ, ƒk−1) are ˜A identifiable.
The proof of Theorem 1 and analysis are shown below. Real-world datasets are usually more complicated with non-stationary component locations. The present patent disclosure disclosed embodiments of a bi-level latent model that integrates the aggregation prior model, attention mechanism, and component segmentation for improved flexibility.
Proof of Theorem 1: The conditional probabilities of pT,X,ƒ
log pT,λ(yk|uk)+log|det Jƒ(yk)|=log p{circumflex over (T)},{circumflex over (X)}(hk′|uk)+log|det Jg
Different from approaches using observed auxiliary variables as conditional variables, it is assumed that the relations with component k may be recovered and signals from other components may be used as conditional labels. Using uk(0), . . . , uk(lm) from conditions (b) and (c), uk(0) is subtracted from this expression to obtain some condition uk(t), and with the Jacobian terms removed since they do not depend on uk, the following equation may be obtained:
log ph
In equation (11), both conditional distributions of hk given uk belong to exponential family. Eq. (11) may be rewritten as:
Here the base measures Qi are cancelled out as they do not depend on uk. Equation (12) may be rewritten with inner products as:
With lm equations combined together, equation (15) may be rewritten in a matrix equation form as following:
L
τ
T(hk)={circumflex over (L)}τ{circumflex over (T)}(hk′)+b (16)
By multiplying inverse of Lτ to both size of equation (16), the following equation may be obtained:
T(hk)=A{circumflex over (T)}(hk′)+c (17)
Here A=L−1 τ {circumflex over (L)}τ and c=L−1 τb. There may exist m distinct values hk,i1 to hk,im such that
are linear independent in m, for all 1≤i≤l. By defining m vectors hkt=[hk,1t, . . . , hk,it] from multiple points, the Jacobian Q=[JT(hk1), . . . , JT(hkm)] may be obtained with each entry as Jacobian with size lm×l from the derivative of equation (17) regarding these m vectors. Hence Q is an lm×lm invertible and the fact that each component of T is univariate. In one or more embodiments, a corresponding matrix {circumflex over (Q)} with the Jabocian computed at the same points may be constructed and the following equation may be obtained:
Q=A{circumflex over (Q)} (18)
Here {circumflex over (Q)} and A are both full rank as Q is full rank.
2. Embodiments of Bi-Level Latent Structure
In this subsection, embodiments of a generative model that may identity the hierarchy of components in a dataset are disclosed. The generative model uses a generator that maps latent space to a manifold embedded in the sample input space. It is assumed that there are K conditional independent components for the samples of a dataset. x=x1 . . . xK is the output variable of the generator, and z=z0z1 . . . zK is the latent variable of the generator, wherein xk is the variable for kth component, z0 controls the global properties of each sample of x across all components, and zk controls the properties of component k that are independent from the other components and z0. In one or more embodiments, it is assumed the components are conditional independent from each other given the latent variable, i.e., xi⊥xk|z, if i≠k. In one or more embodiments, it is also independently assumed about the components and latent variables, xi⊥zk|z0, if i≠k. With these two assumptions, the distribution of the generated samples may be shown as:
In one or more embodiments, a hierarchy structure is employed for the latent variables.
As shown in
3. Network Framework Embodiments
Embodiments of a framework are disclosed to encode and decode each component and capture the global latent factor as well. In one or more embodiments, a single VAE framework may be used for encoding and decoding of all components. For each component, the latent vector zk′ contents two sections, i.e., zk′=zkz0(k). zk is used for component k's local latent features, and z0(k) is for the features of component k controlled by global latent factor z0.
The encoder e encodes (415) the input and the mask for the kth component (x, mk) into an overall latent variables for the kth component zk′ (zk′=zk z0(k)). zk is used for component k's local latent features, and z0(k) is for the features of component k controlled by global latent factor z0. In one or more embodiments of the present disclosure, different from MONet that uses just one layer of latent variables, a flow-based model comprising one or more flow functions (ƒ={ƒ1, . . . , ƒk}) is used as a second layer auto-encoder to transform (420) all global latent variables z0(k), k=1, . . . K, into transformed global transformed global latent variable (ƒk (z0(k)), k=1, . . . K. An aggregated global latent variable (z0) is then generated (425) based on the one or more transformed global latent variables. The aggregated global latent variable z0 is transformed (430), using the flow-based model, back into one or more reconstructed global latent variables (
In one or more embodiments, the message passing prior may curb the model's degree of freedom and may capture the interaction between different segments or components as well. The aggregation prior model shown in
The global latent variable z0 is passed (515) backward through the flow-based model to obtain a reconstructed global latent variable
In one or more embodiments, the relation between global latent variable z0 and global latent variable for each component z0(k), k=1, . . . , K is taken as the encoding (with ƒ={ƒ1, . . . , ƒk}) and decoding (with ƒk−1) procedure. With the flow-based model as both encoder and decoder for the second layer of latent variable, the model's degree of freedom may be curbed and interactions between different segments or components may be captured. With {circumflex over (z)}0=[z0(1), z0(2), . . . , z0(K)], z0 encodes {circumflex over (z)}0 by aggregating outputs of all invertible functions ƒk, i.e.,
With
4. ELBO of the Bi-Level Latent Model Embodiments
In one or more embodiments, to derive the ELBO, a bi-level variational autoencoder (VAE) with simplified notations is used as a start. Afterwards, derivations to the model are extended. For the kth component, the latent variable of its first layer has two sections, zk and z0(k). Only z0(k) connects to layer 2, z0. Therefor the ELBO has two components regarding these two different latent parts. In one or more embodiments, with (x, mk) as the kth component's input for the encoder, (zk, z0(k)) as the first layer latent variable, z0 as the second layer variable, z0(k) and (
Theorem 2. Let k (X, mk; a, e, d, ƒ) be the ELBO regarding component k in the bi-level segmentation VAW model, then:
The derivation of the ELBO is given in subsection a) Derivation of the ELBO below. Here a, e, d, ƒk are the attention, encoder, decoder, and flow function for component k, respectively. In one or more embodiments, the reconstruction term regarding x and mk in the above ELBO (18) may be given as Φk=q
Φk=q
In one or more embodiments, the regularization terms for the first layer's latent variable are given by
Ψk=−KL(qe(zk|x,mk)∥p(zk))+H(z0(k)|x,mk) (22)
In one or more embodiments, all the latent variables are assumed to follow Gaussian distributions. Both the KL and entropy terms may be computed with reparameterization. Improved disentanglement may be achieved with total correlation (TC) for component local representation regarding the KL term in equation (22). In one or more embodiments, the objective function across all components for maximization may be given by:
a) Derivation of the ELBO
In one or more embodiments of the present disclosure, a bi-level VAE enhanced with a recurrent attention mechanism is disclosed. The ELBO of the model may be optimized. As shown in
Proof of Theorem 2: in the bi-level auto-encoder, (x, mk) is the first layer's input, and (zk, z0(k)) is the first layer's latent variable. Meanwhile, z0(k) is also the second layer's input, and z0 is the second layer's latent variable. (
Derivation of the ELBO starts with a bi-level VAE with simplified notations. The derivation is extended to embodiments of the model. In one or more embodiments, zl, l∈{1, 2} is used to represent the latent variable in layer l. Let z={z1, z2}, the following equation may be obtained:
In one or more embodiments, the second term in equation above may be extended as follows.
Accordingly, the EBLO may be written as:
log p(x)≥q(z
In one or more embodiments, for the kth component, the latent variable of its first layer has two sections, zk and z0(k). Only z0(k) connects to layer 2 (z0). Therefor the ELBO has two components regarding these two different latent parts,
log p(x)≥q
Here qe is the posterior distribution for the first layer latent variable parameterized by the encoder e. pd is the distribution for x and mk parameterized with the decoder d. ƒk is the k's flow-based model, and ƒ={ƒ1, . . . , ƒK}. The conditional distribution qƒ (z0|z0(k)) captures the relationship between z0(k) and the other z0(j), j≠k. In one or more embodiments, all the latent variables are assumed to follow Gaussian distribution. In one or more embodiments, the variance value of posterior qƒ (z0| z0(1)z0(2) . . . z0(K)) is set to a fixed value 1.
5. Inference of the Global Latent Variable Embodiments
In one or more embodiments, for component k, the terms in the ELBO of equation (20) regarding the second layer of latent variable z0 may be given by:
ƒ
q
(z
|z
)[log pƒ
It may be seen that the computation of ƒ
Here Ck is a constant value. The KL term can be calculated with disclosure regarding equation (2).
6. Causal Direction Embodiments
Section 5 above discloses using a learned aggregation prior model to infer relation between two components. By maximizing the objective (ELBO) of the auto-encoder in the second layer, equation (28) for the objective ƒ
In one or more embodiments, assuming two data components are a pair of causal and effect variables, the causal relation may be detected by extending a cause-effect inference approach. Given a causal-effect pair {c, e} defined by eα=ϕ(c)+αn, wherein α is a positive real number and n following noise distribution, expected variance of prediction error may be used to reveal the causal direction based on Theorem 3. In one or more embodiments, the ratio of expected variance of prediction error as the score to reveal the causal direction of a pair of variables by the theory of Theorem 3.
With a well-trained flow-based model shown in
Proof of Theorem 3: For two random variables and the conditional variance of , given by g is defined by Var[|g]:=[(|[|g])2|g]. Var[|] is the random variable attaining Var[|g] when attains g. Its expectation is given by:
[Var[|]]:=∫Var[|g](g)dg. (29)
For an invertible function h, equations Var[h(G)|g]=0 and Var(|h(q))=Var(|q) may be obtained.
In one or more embodiments, it may be observed that
[Var[eα|c]]=[Var[ϕ(c)+αn|c]]=α2[Var[n|c]]=α2 (30)
Accordingly, one may have
In the latter step of the above equation, αn+ and αn− vanishes in the limit due to eVar [ϕ−1(e−αn)|e]pe
With Taylor's theorem, one may get:
Here Ē is a value in [e−αn, e]. Furthermore, one may have
Thus, one may have
The last term in equation (34) may be rewritten as:
Here the inequality above is based on Cauchy Schwartz inequality. If ϕ is linear, the last term of the above formula becomes 1, as ϕ′=1. Alternatively, a statement may be made about equation (35) and thus complete the proof for Theorem 3.
In one or more embodiments, the causal interaction between a first component (e.g., component i) and a second component (e.g., component j) of an input (e.g., an image) may be inferred using z0(i) and z0(j). The reconstructed global latent variable value (
7. Embodiments of Disentanglement with TC
In one or more embodiments, for each component, the KL term for local latent variable zk in equation (3) in section C.2 may be rewritten as:
Here zki is the ith entry of zk. In one or more embodiments, the total correlation (TC) is penalized to enforce disentanglement of the local latent factors. In one or more embodiments, to compute the second term, a weighted approach may be used for estimating the distribution value q (z).
Embodiments of the present disclosure were evaluated with both synthetic data and real-world data. For the synthetic data, it is simulated with a multi-object setting. With this dataset, it was demonstrated that embodiments of the present disclosure may outperform other methods when there are correlations between objects. Those embodiments were further validated using some real-world data. In one or more experiments, a causality dataset was also used to evaluate the model's component interaction discovery.
It shall be noted that these experiments and results are provided by way of illustration and were performed under specific conditions using a specific embodiment or embodiments; accordingly, neither these experiments nor their results shall be used to limit the scope of the disclosure of the current patent document.
1. Performance Metric
Some of the experiments were primarily focused on disentanglement, segmentation, and component interaction and performance comparison with MONet.
Disentanglement. Disentanglement evaluation metrics have been proposed previously. Some defined a metric that utilizes the accuracy that a low VC-dimension linear classifier can achieve at identifying a fixed ground truth factor. A drawback of this method is the lack of axis-alignment detection. Some proposed to use the mutual information gap (MIG) between latent variables and the ground truth factors to measure disentanglement. In one or more embodiments of the present disclosure, a regression-based approach was utilized for various experiments. The regression-based approach divides a latent space data into training, evaluation, and testing. The disentanglement score is obtained based on the performance of the learned regression model.
Segmentation. In one or more experiments, an adjusted rand index (ARI) was employed to evaluate the segmentation. The ground truth mask and predicted mask were converted to binary values, and the similarity of a pair of masks is based on the number of same entry values. In one or more experiments, the ARI score may be computed with the pair-wise similarity matrix.
Component Interaction. Causality may refer to the relations between two events, one (effect) preceded by the other (cause). In one or more experiments, the approach disclosed in Section C.5 was applied for correlation and causality component interaction discovery. The baseline for correlation in the experiments used MONet for latent representation learning and Hilbert-Schmidt Independence Criterion (HSIC) for independent testing. For a fair comparison, embodiments of the disclosed model (also referred as “CONet” hereinafter) and the baseline model (MONet) used the same network structures for the encoder, decoder, and attention network. Details about the network's structure embodiments may be found in the Section E. Comparisons were also made against existing methods on benchmark data. In the supplemental file, more results on additional datasets were presented.
2. Simulated Multi-Object Dataset
In this subsection, embodiments of the present disclosure were evaluated using simulated 2-object dataset comprising images generated with three types of objects, green squares, red circle, and blue diamonds. Multiple samples were for training, and multiple samples were used for testing. Some exemplary sample images are shown in the first row in
In the first set of experiments, images were generated to contain two objects. Only object pairs {circle, circle}, {circle, squire}, {squire, squire}, and {squire, diamond} appear in the same image. Circles and diamonds do not appear in the same image. The γ is set to 0.5 for both models, and the β values are tuned based on disentanglement and segmentation scores for both MONet and our model.
More Results on Simulation Data Sets:
As shown in 2-object dataset plot in
3. Evaluation Dataset
In one or more experiments, embodiments of the present disclosure were evaluated using an evaluation dataset. Each image in the evaluation dataset may comprise one or more shapes with a background. In one or more experiments, all the available features for disentanglement testing were used for evaluation. The features may include positions (x and y), shape, color (e.g., RGB values), orientation, scale, visibility (a binary feature indicating whether an object is not null). The disentanglement score may be computed with LASSO as the regressor and α=0.2. In one or more experiments, γ=0.5 for both models, and β is tuned for both methods. The disentanglement and segmentation performance are given in Table 2 after 20 epochs with a learning rate 10−4 for both models. Table 2 shows that embodiments of the present disclosure may achieve superior disentanglement and segmentation scores. More results on the segmentation score (ARI) for both methods are presented in the supplemental file. It may be seen that embodiments of the present disclosure may consistently improve the segmentation along with the updating steps. Embodiments of the present disclosure may produce visually more reasonable object segmentation.
Additional Results on Evaluation Dataset:
4. Polyomino Dataset
In one or more experiments, multiple images from a polyomino dataset are used. Each polyomino image may comprise several polyominoes, e.g., tetrominoes, sampled from multiple different shapes or orientations. Four components for both MONet and embodiments of the present disclosure. Multiple images were randomly selected to evaluate disengagement and segmentation scores and multiple images were used to train both models. The experimental procedure follows the previous two datasets. Table 3 gives the disentanglement and segmentation scores for both methods after 1 epoch (α=0.1 for the disentanglement score). It may be seen that embodiments of the present disclosure may improve both segmentation and disentanglement.
5. Results on Component Interaction Detection
In this set of experiments, the causal-effect benchmark dataset was prepared with a causal discovery toolbox. Unlike existing methods for causal discovery, in one or more embodiments, each variable in a causal effect pair is tiled into a component in an image and latent representation is used to learn the causality relations. As each component only contains one variable, the model's variable segmentation capability may be tested with the composed images. In one or more experiments, simple causal-effect pairs are considered and the causal directions were ignored. A few pairs of variables with relations are tiled as component pixels in order to form images. Then, the composed images are passed into our network. Similar to previous experiments, the outputs include reconstructed components and component masks. Likewise, m-scores based on ARI are reported to show whether each reconstructed component has been discovered and segmented correctly. In addition, one or more experiments were also done to evaluate whether component pairs that have causal relations may be discovered as well.
In one or more experimental settings, for MONet, Hibert-Schmidt Independence Criterion (HSIC), a kernel-based nonparametric independence test is employed to score the relations between a pair of components. In one or more embodiments of the disclosed model (CONet), the conditional probability of latent variables in section C.4 was used for relationship learning. As shown in Table 4, embodiments of CONet (“Proposed”) outperforms MONet in disentanglement score by 1.0%. Remarkably, embodiments of CONet may find much more number of correct relation pairs than MONet. Such a result shows that embodiments of CONet have a stronger capacity in causality discovery besides component segmentation. More results on component interaction detection are available in section D.7.
6. Results on Facial Image Test
Multiple facial images were randomly picked for training and testing. Those facial images comprise multiple attributes, including gender, hair color, with glasses or not, etc. In one or more experiments, at least some attributes were used to assess the disentanglement for both MONet and an embodiment of the present disclosure. Table 5 gives the disentanglement scores for both models (α=0.3). The plots in
7. Results on Causal Direction Detection
The setup for this set of experiments may be similar to section D.S. In one or more experiments, the causal-effect benchmark data set ‘tuebingen’ prepared with a causal discovery toolbox was employed. Four pairs of causal effect variables were taken and each variable was tiled in a causal-effect pair into a component in an image. Latent representation was used to learn causality relations. As each component only contains one variable, the component's latent z0(k) was used for causal detection.
Embodiments of the present disclosure were compared to MONet+RECI method. For MONet+RECI method, the latent representation for each component was learned with MONet, and then the variance of regression error between latent representations was used to determine causal directions. In this baseline method, the implementation of RECI in causal discovery toolbox was used. For embodiments using CONet, scores defined in Theorem 3 was computed according to the prediction errors with the aggregation model, and then the scores were used to determine the causal directions.
To calculate the accuracy of causal detection, the ground truth label for each component was obtained from the models by comparing its mask with the ground truth masks in the data synthesis stage. A list of component pairs with causality scores were obtained after thresholding with a value γ. Afterwards, the percentage of correct causal direction pairs was calculated. Table 6 gives the accuracy of correct causal pairs with different threshold values. It may be seen that embodiments of the disclosed framework always learn better representation for causal detection.
This section discloses some network structure embodiments for the encoder and decoder, which are shown in Table 7 and Table 8, respectively. The attention network employs one U-net with 5 blocks. The decoder is a spatial broadcast decoder to encourage the VAE to learn spatial features.
In the present disclosure, embodiments of a novel bi-level framework are disclosed to learn disentangled structured latent factors. In one or more embodiments, the flow-based structure prior of latent presentation enables the model to learn interactions among components via a message-passing scheme. The framework improved existing scene segmentation methods regarding both disentanglement and segmentation. It is shown that the framework embodiments may capture the inner interactions between data components in the experiments.
One skilled in the art shall understand that the present disclosure may be applicable to various scenarios, e.g., physical interaction extraction. Physical interaction between objects is an important common sense or prior knowledge for humans to make actionable decisions. Objects placed within static scenes commonly adhere to certain relations, such as pen and paper, book and book bookshelf, cup, and desk, etc. Another useful application is to integrate embodiments of the present disclosure with reinforcement learning. With the learned relationships between objects, the searching space for an agent may be significantly reduced and make a reasonable decision more efficiently.
In one or more embodiments of the present disclosure, data segmentation and representation learning are integrated by developing a bi-level VAE framework. With the inference method, the bi-level VAE framework may learn more meaningful structural representations of the data. Besides the data sets presented in the experiments, the framework may be applied to other types of data. Embodiments of the present disclosure may potentially enlarge the application of unsupervised learning and self-supervised learning to broader scenarios, such as information extraction, knowledge discovery, etc.
In one or more embodiments, aspects of the present patent document may be directed to, may include, or may be implemented on one or more information handling systems (or computing systems). An information handling system/computing system may include any instrumentality or aggregate of instrumentalities operable to compute, calculate, determine, classify, process, transmit, receive, retrieve, originate, route, switch, store, display, communicate, manifest, detect, record, reproduce, handle, or utilize any form of information, intelligence, or data. For example, a computing system may be or may include a personal computer (e.g., laptop), tablet computer, mobile device (e.g., personal digital assistant (PDA), smart phone, phablet, tablet, etc.), smart watch, server (e.g., blade server or rack server), a network storage device, camera, or any other suitable device and may vary in size, shape, performance, functionality, and price. The computing system may include random access memory (RAM), one or more processing resources such as a central processing unit (CPU) or hardware or software control logic, read only memory (ROM), and/or other types of memory. Additional components of the computing system may include one or more disk drives, one or more network ports for communicating with external devices as well as various input and output (I/O) devices, such as a keyboard, mouse, stylus, touchscreen and/or video display. The computing system may also include one or more buses operable to transmit communications between the various hardware components.
As illustrated in
A number of controllers and peripheral devices may also be provided, as shown in
In the illustrated system, all major system components may connect to a bus 1216, which may represent more than one physical bus. However, various system components may or may not be in physical proximity to one another. For example, input data and/or output data may be remotely transmitted from one physical location to another. In addition, programs that implement various aspects of the disclosure may be accessed from a remote location (e.g., a server) over a network. Such data and/or programs may be conveyed through any of a variety of machine-readable medium including, for example: magnetic media such as hard disks, floppy disks, and magnetic tape; optical media such as CD-ROMs and holographic devices; magneto-optical media; and hardware devices that are specially configured to store or to store and execute program code, such as application specific integrated circuits (ASICs), programmable logic devices (PLDs), flash memory devices, other non-volatile memory (NVM) devices (such as 3D XPoint-based devices), and ROM and RAM devices.
Aspects of the present disclosure may be encoded upon one or more non-transitory computer-readable media with instructions for one or more processors or processing units to cause steps to be performed. It shall be noted that the one or more non-transitory computer-readable media shall include volatile and/or non-volatile memory. It shall be noted that alternative implementations are possible, including a hardware implementation or a software/hardware implementation. Hardware-implemented functions may be realized using ASIC(s), programmable arrays, digital signal processing circuitry, or the like. Accordingly, the “means” terms in any claims are intended to cover both software and hardware implementations. Similarly, the term “computer-readable medium or media” as used herein includes software and/or hardware having a program of instructions embodied thereon, or a combination thereof. With these implementation alternatives in mind, it is to be understood that the figures and accompanying description provide the functional information one skilled in the art would require to write program code (i.e., software) and/or to fabricate circuits (i.e., hardware) to perform the processing required.
It shall be noted that embodiments of the present disclosure may further relate to computer products with a non-transitory, tangible computer-readable medium that have computer code thereon for performing various computer-implemented operations. The media and computer code may be those specially designed and constructed for the purposes of the present disclosure, or they may be of the kind known or available to those having skill in the relevant arts. Examples of tangible computer-readable media include, for example: magnetic media such as hard disks, floppy disks, and magnetic tape; optical media such as CD-ROMs and holographic devices; magneto-optical media; and hardware devices that are specially configured to store or to store and execute program code, such as application specific integrated circuits (ASICs), programmable logic devices (PLDs), flash memory devices, other non-volatile memory (NVM) devices (such as 3D XPoint-based devices), and ROM and RAM devices. Examples of computer code include machine code, such as produced by a compiler, and files containing higher level code that are executed by a computer using an interpreter. Embodiments of the present disclosure may be implemented in whole or in part as machine-executable instructions that may be in program modules that are executed by a processing device. Examples of program modules include libraries, programs, routines, objects, components, and data structures. In distributed computing environments, program modules may be physically located in settings that are local, remote, or both.
One skilled in the art will recognize no computing system or programming language is critical to the practice of the present disclosure. One skilled in the art will also recognize that a number of the elements described above may be physically and/or functionally separated into modules and/or sub-modules or combined together.
It will be appreciated to those skilled in the art that the preceding examples and embodiments are exemplary and not limiting to the scope of the present disclosure. It is intended that all permutations, enhancements, equivalents, combinations, and improvements thereto that are apparent to those skilled in the art upon a reading of the specification and a study of the drawings are included within the true spirit and scope of the present disclosure. It shall also be noted that elements of any claims may be arranged differently including having multiple dependencies, configurations, and combinations.