Aspects of the present disclosure relate generally to artificial intelligence, and more particularly, to a method and a network for visual reasoning.
Artificial Intelligence (AI) is widely deployed in a variety of areas such as image classification, object detection, scene understanding, machine translation and the like. There is an increasing interest in visual reasoning with an increasing growth of applications such as visual question answering (VQA), embodied question answering, visual navigation, autopilot and the like, where AI models may be generally required to perform high-level cognition processes over low-level perception results, for example, to perform high-level abstract reasoning upon simple visual concepts such as lines, shapes and the like.
Deep neural networks have been widely applied in visual reasoning, where deep neural networks may be trained to model the correlation between task input and output, and may gain success in various visual reasoning tasks with deep and rich representation learning, particular in perception tasks. Additionally, modularized networks have been drawn more and more attention for visual reasoning in recent years, which may unify deep learning and symbolic reasoning, focusing on building neural-symbolic models with the aim of combining the best of representation learning and symbolic reasoning. The main idea is to manually design neural modules that each represents a primitive step in the reasoning process, and solve reasoning problems by assembling those modules into respective symbolic networks corresponding to the solved reasoning problems.
With this modularized network with neural-symbolic methodology, a conventional visual question answering (VQA) problem may be generally solved properly, where questions are generally in form of texts. In addition to VQA, an abstract visual reasoning is recently proposed to extract abstract concepts or questions directly from a visual input without a natural language question, such as from an image, and conduct reasoning processes accordingly. As reasoning about abstract concepts has been a long-standing challenge in machine learning, the current visual reasoning methods or AI models as described above may have an unsatisfying performance on such an abstract visual reasoning task.
It may be desirable to provide even improved methods or AI models to process abstract visual reasoning tasks.
The following presents a simplified summary of one or more aspects according to the present invention in order to provide a basic understanding of such aspects. This summary is not an extensive overview of all contemplated aspects, and is intended to neither identify key or critical elements of all aspects nor delineate the scope of any or all aspects. Its sole purpose is to present some concepts of one or more aspects in a simplified form as a prelude to the more detailed description that is presented later.
In an aspect of the present invention, a method for visual reasoning is provided. According to an example embodiment of the present invention, the method comprises: providing a network with sets of inputs and sets of outputs, wherein each set of inputs of the sets of inputs mapping to one of a set of outputs corresponding to the set of inputs based on visual information on the set of inputs, and wherein the network comprising a Probabilistic Generative Model (PGM) and a set of modules; determining a posterior distribution over combinations of one or more modules of the set of modules through the PGM, based on the provided sets of inputs and sets of outputs; and applying domain knowledge as one or more posterior regularization constraints on the determined posterior distribution.
In another aspect of the present invention, a method for visual reasoning with a network comprising a Probabilistic Generative Model (PGM) and a set of modules is provided. According to an example embodiment of the present invention, the method comprises: providing the network with a set of input images and a set of candidate images; generating a combination of one or more modules of the set of modules based on a posterior distribution over combinations of one or more modules of the set of modules and the set of input images, wherein the posterior distribution is formulated by the PGM trained under domain knowledge as one or more posterior regularization constraints; processing the set of input images and the set of candidate images through the generated combination of one or more modules; and selecting a candidate image from the set of candidate images based on a score of each candidate image in the set of candidate images estimated by the processing.
In another aspect of the present invention, a network for visual reasoning is provided. According to an example embodiment of the present invention, the network comprises: a set of modules, wherein each of the set of modules being implemented as a neural network and having at least one trainable parameters for focusing that module on one or more variable image properties; and a Probabilistic Generative Model (PGM) coupled to the set of modules, wherein the PGM is configured to output a posterior distribution over combinations of one or more modules of the set of modules.
In another aspect of the present invention, an apparatus for visual reasoning comprises a memory; and at least one processor coupled to the memory. According to an example embodiment of the present invention, the at least one processor is configured to provide a network with sets of inputs and sets of outputs, wherein each set of inputs of the sets of inputs mapping to one of a set of outputs corresponding to the set of inputs based on visual information on the set of inputs, and wherein the network comprising a Probabilistic Generative Model (PGM) and a set of modules; determine a posterior distribution over combinations of one or more modules of the set of modules through the PGM, based on the provided sets of inputs and sets of outputs; and apply domain knowledge as one or more posterior regularization constraints on the determined posterior distribution.
In another aspect of the present invention, a computer program product for visual reasoning is provided. According to an example embodiment of the present invention, the computer program product comprises processor executable computer code for providing a network with sets of inputs and sets of outputs, wherein each set of inputs of the sets of inputs mapping to one of a set of outputs corresponding to the set of inputs based on visual information on the set of inputs, and wherein the network comprising a Probabilistic Generative Model (PGM) and a set of modules; determining a posterior distribution over combinations of one or more modules of the set of modules through the PGM, based on the provided sets of inputs and sets of outputs; and applying domain knowledge as one or more posterior regularization constraints on the determined posterior distribution.
In another aspect of the present invention, a computer readable medium stores computer code for visual reasoning. According to an example embodiment of the present invention, the computer code when executed by a processor causes the processor to provide a network with sets of inputs and sets of outputs, wherein each set of inputs of the sets of inputs mapping to one of a set of outputs corresponding to the set of inputs based on visual information on the set of inputs, and wherein the network comprising a Probabilistic Generative Model (PGM) and a set of modules; determine a posterior distribution over combinations of one or more modules of the set of modules through the PGM, based on the provided sets of inputs and sets of outputs; and apply domain knowledge as one or more posterior regularization constraints on the determined posterior distribution.
With the guidance of the domain knowledge, the generated modularized networks may provide structures that may represent human-interpretable reasoning process precisely, which may lead to improved performance.
Other aspects or variations of the present invention, as well as other advantages thereof will become apparent by consideration of the following detailed description and the figures.
The disclosed aspects of the present invention will hereinafter be described in connection with the figures that are provided to illustrate and not to limit the disclosed aspects.
The present invention will now be discussed with reference to several example implementations. It is to be understood that these implementations are discussed only for enabling those skilled in the art to better understand and thus implement the embodiments of the present invention, rather than suggesting any limitations on the scope of the present invention.
Moving one step ahead of traditional computer vision tasks such as image classification and object detection, visual reasoning requires not only a comprehensive understanding of the visual content, but also the capability of reasoning about the extracted concepts to draw a conclusion.
The present disclosure proposes a method for performing an abstract visual reasoning task with a probabilistic neural-symbolic model that is regularized with domain knowledge. A neural-symbolic model may provide a powerful tool in combining symbolic program execution for reasoning and deep representation learning for visual recognition. For instance, a neural-symbolic model may compose a particular modularized network that may comprise one or more modules selected from a set of modules, such as an inventory of reusable modules, for each set of inputs. A probabilistic formulation to train models with stochastic latent variables may obtain an interpretable and legible reasoning system with fewer supervisions. Domain knowledge may provide guidance in a generation of a reasonable modularized network, as it may generally involve an optimization problem with a mixture of continuous and discrete variables in the generation. With the guidance of the domain knowledge, the generated modularized networks may provide structures that may represent human-interpretable reasoning process precisely, which may lead to improved performance.
For instance, the PGM 210 may comprise a variational auto-encoder (VAE), where an encoder of a VAE may formulate a variational posterior distribution of structures of modularized networks, and a decoder of the VAE may formulate a generative distribution. The formulated variational posterior distribution of structures of modularized networks by the encoder may be an estimated posterior distribution of structures of modularized networks based on the observed dataset. The formulated generative distribution by the decoder may be used for reconstruction (as illustrated via route 4 of
For instance, the set of modules 220 may comprising one or more pre-designed neural modules, with each representing a primitive step in a reasoning process. For instance, each module of the set of modules 220 may be implemented as a multi-layer neural network with one or more trainable parameters. In an aspect of the present disclosure, each module of the set of modules 220 may be dynamically assembled with each other to form a particular modularized network, which may be used to map a given set of inputs to the correct output. In an aspect of the present disclosure, the PGM 210 may be used to generate modularized networks with structures corresponding to individual sets of inputs, to predict the respective underlying rules within individual sets of inputs.
As an example, the structure shown in
As another example, the structure shown in
In some aspects of the present disclosure, the modularized networks with respective structures shown in
It will be appreciated by those skilled that other structures may be possible, and other representations for at least part of the set of modules 220 may be also possible.
In block 410, sets of inputs and sets of outputs may be provided to a network 200 or 700, wherein each set of inputs of the sets of inputs may map to one of a set of outputs corresponding to the set of inputs based on visual information on the set of inputs. For instance, the sets of inputs and sets of outputs may comprise a training dataset, such as Procedurally Generated Matrice (PGM) dataset, or Relational and Analogical Visual rEasoNing dataset (RAVEN) or the like. The network 200, 700 may comprise a Probabilistic Generative Model (PGM) 210, 710 and a set of modules 220, 720.
In block 420, a posterior distribution related to the set of modules 220, 720 may be determined through the PGM 210, 710 based on the provided sets of inputs and sets of outputs. In an aspect of the present disclosure, a posterior distribution over combinations of one or more modules of the set of modules 220, 720 may be determined through the PGM 210, 710, based on the provided sets of inputs and sets of outputs. In an example, the combinations of one or more modules of the set of modules 220, 720 may comprise modularized networks assembled from one or more modules of the set of modules 220, 720, wherein the modularized networks may have structures that may be represented as G=(v,A). In another example, the combinations of one or more modules of the set of modules 220 may comprise any permutations of one or more modules among the set of modules 220. For instance, the PGM 210 may comprise a VAE. An estimated posterior distribution over structures of modularized networks may be formulated through an encoder of the VAE based on the observed dataset.
In block 430, domain knowledge may be applied to the determined posterior distribution of the set of modules 220 as one or more posterior regularization constraints. For instance, a regularized Bayesian framework (RegBayes) may be used to incorporate human domain knowledge into Bayesian methods by directly apply constraints on the posterior distribution. The flexibility of RegBayes may allow explicitly considering domain knowledge by incorporating knowledge into any Bayesian models as soft constraints.
With the guidance of the domain knowledge, the method 400 may be utilized to generate precise and interpretable structures for different sets of inputs, as the generated structures may capture hidden rules among the sets of inputs.
It will be appreciated by those skilled that other probabilistic generative models may be possible, and other distributions related to the set of modules 220 may be also possible.
In one aspect of the present disclosure, one or more posterior regularization constraints may comprise one or more First-Order Logic (FOL) constraints that may carry domain knowledge. For instance, a constraint function may consist of first-order logic computations over each of structures and each of sets of inputs. Specifically, each constraint function takes each of structures and each of sets of inputs as input, and compute the designed first-order logic expression as output. The output of the constraint function may take a value in a range of [0, 1] that may indicate the degree that the input of each of structures and each of sets of inputs satisfies a specific demand, where lower value may show stronger correspondence. Therefore, by minimizing values of such constraint functions during the optimization of the posterior distribution of structures, the network 200 may learn to generate structures that may be in correspondence with the applied domain knowledge.
In another aspect of the present disclosure, it may be beneficial to consider inner correlations among constraints. Constraints concerning different aspects of domain knowledge may be independent with each other. On the other hand, constraints applied to different nodes of a structure but sharing a same aspect of domain knowledge may be correlated with each other. Accordingly, the constraints sharing the same aspect of domain knowledge may be grouped into a group of constraints. For instance, a total L groups of constraints may be proposed, where each group may correspond to a certain reasoning type, including Boolean logical reasoning, temporal reasoning, spatial reasoning, arithmetical reasoning, and the like.
In another aspect of the present disclosure, the one or more FOL constraints may be generated based on one or more properties of each of sets of inputs. For instance, in a Procedurally Generated Matrices (PGM) dataset, each pair of a set of inputs and the corresponding set of outputs may have one or more rules, each rule may be represented as a triple, ={[r,o,a]:r ∈
, o ∈
, a ∈
}, which is sampled from the following primitive sets:
These triples may determine abstract reasoning rules exhibited by a particular set of inputs and the corresponding correct output. For instance, if contains the triple [progression, shape, colour], the set of inputs and corresponding correct output may exhibit a progression relation, instantiated on the colour (e.g., greyscale intensity) of shapes. For instance, each attribute type a ∈
(e.g., colour) may take one of a finite number of discrete values z ∈ Z (e.g., 10 integers between [0, 255] denoting greyscale intensity). Therefore, a given rule
may have a plurality of realizations depending on the values for the attribute types, but all of these realizations may share the same underlying abstract rule. The choice of r may constrain the values of z that may be realized. For instance, if r is progression, the values of z may increase along rows or columns in the matrix of input image panels, and may vary with different values under this rule.
In an aspect of the present disclosure, the one or more FOL constraints may be generated based on at least one of relation types, object types or attribute types of the sets of inputs. For instance, an example formation of FOL constraint may be given by:
Φj(G, x):=1−1[vj ∈ S(x)] (1)
Where 1[·] is the indicator function and vj ∈ S(x) is true if the semantic representation of vj can be found in S(x). Where S(x) is semantic attributes of a set of inputs x that may be extracted from one or more triples {[r,o,a]} of the set of inputs x. Where the j-th node in the structure G is denoted by vj.
In one or more aspects of the present disclosure, a group of FOL constraints may be generated based on one or more triples {[r,o,a]} of the set of inputs x, according to a certain aspect of domain knowledge, such as logical reasoning, temporal reasoning, spatial reasoning, or arithmetical reasoning and the like. For instance, logical reasoning may comprise logical AND, OR, XOR and the like. For instance, arithmetical reasoning may comprise arithmetical ADD, SUB, MUL and the like. For instance, spatial reasoning may comprise STRUC (Structure), e.g., for changing the computation rules of input modules, and the like. For instance, temporal reasoning may comprise PROG (Progress), ID (Identical) and the like.
In one or more aspects of the present disclosure, a group of FOL constraints that are generated according to a certain aspect of domain knowledge may be applied to each of nodes of a structure, respectively. For instance, constraints in the group may perform one FOL rule on all nodes of the structure which may check the certain aspect of domain knowledge.
It will be appreciated by those skilled that the one or more of the aspects described above may be performed by the network 200, 700 or other networks, systems, or models.
In one example, in the exemplary flow chart of method 400, reasoning tasks may be performed by optimizing trainable parameters of PGM 210, 710 and modules of the set of modules 220, 720, which is to minimize the prediction loss over observed samples, as formulated by following objective:
minφminθerr(φ,θ):=ΣDΣG˜q
Where φ denotes trainable parameters in the PGM 210, 710, θ denotes trainable parameters of modules of the set of modules 220, 720, and D={(xn,yn)}n=1:N denotes a dataset comprising n-th input xn associated with output yn.
In an aspect of the present disclosure, the network 200, 700 may utilize a PGM 210, 710 to formulate a generative distribution pφ(x|G) and a variational distribution qφ(G|x). For instance, an encoder of a VAE may formulate the variational distribution qφ(G|x), and a decoder of the VAE may formulate the generative distribution pφ(x|G). In particular, by optimizing the formulation (2), an estimated posterior distribution of structures {tilde over (p)}φ
In another aspect of the present disclosure, one or more FOL constraints may be applied for regularization to generate the new posterior distribution of structures qφ(G|x). Formally, the overall objective may be written as:
Where qφ(G|x) is the regularized posterior distribution of structures, {tilde over (p)}φ
The Φij(G, xn) functions in formulation (3) whose values may be bounded by the slack variables are FOL constraints. In one example, each constrain function may take a value in a range of [0,1] where smaller value may denote better correspondence of the structure G and input xn according to domain knowledge. It should be noted that constraint functions may form L groups instead of being independent from each other. The i-th group may comprise Ti correlating constraints, which may correspond to a shared slack variables ξi.
While the main goal of formulation (3) may be to minimize the task loss err, the slack variables ξi=1:L in the formulation may take the FOL constraints into consideration. The process of structure generation may be regularized with the applied domain knowledge. In order to reach the minimum of the whole objective, the network 200, 700 may learn to generate structures that satisfy the applied FOL constraints properly. Moreover, the KL-divergence between qφ(G|x)and {tilde over (p)}φ
It will be appreciated that one or more additional constraints may be added, and one or more exemplary constraints described above may be omitted.
In block 510, parameters of the PGM 210, 710 and parameters of modules of the set of modules 220, 720 may be updated alternatively by maximizing evidences of the sets of inputs and the sets of outputs, to obtain an estimated posterior distribution over the combinations of one or more modules of the set of modules 220, 720 and optimized parameters of the modules of the set of modules 220, 720.
In block 520, one or more weights of one or more posterior regularization constraints applied to the estimated posterior distribution over the combinations of one or more modules of the set of modules 220, 720 may be updated, to obtain one or more optimal solutions of the one or more weights.
In block 530, the estimated posterior distribution over the combinations of one or more modules of the set of modules 220, 720 may be adjusted by applying the one or more optimal solutions of the one or more weights and one or more values of the one or more constraints on the estimated posterior distribution.
In block 540, the optimized parameters of the modules of the set of modules 220, 720 may be updated based on the adjusted estimated posterior distribution over the combinations of one or more modules of the set of modules 220, 720, in order to fit the updated structure distribution.
In an example, suppose θ is fixed, the objective of the probabilistic generation model may be given by maximizing the evidence of the observed data samples, which may be written as:
Where γ is the scaling hyper-parameter of the prediction likelihood, and β is a constant parameter that satisfies β>1. Since prob(φ,θ) may not be differentiable for the expectation EG˜q
Suppose the PGM 210, 710 parameters φ have achieved optimum, optimizing process over θ may become optimizing the network execution performance, which may be written as:
minθerr(φ,θ)=ΣDΣG˜q
Where the gradient ∇θerr(φ,θ) may be estimated with stochastic gradient descent (SGD) with structure G sampled during training.
Suppose the results of above optimization procedure regarding formulation (2) are denoted with φ0 and θ0, and the estimated posterior distribution of structures may be denoted by {tilde over (p)}φ
In an aspect of the present disclosure, a dual problem introduced by convex analysis may be applied to find solution to formulation (6). Therefore, by introducing variables of dual problem, μ, optimal distribution of the RegBayes objective may be obtained by following formulation:
Where Φ[i](D)(G,x) is the grouped summation of FOL constraints in the i-th group,
Φ[i](D)(G,x):=Σj=1T
Where each Φij(D)(G,x) is an expectation value over observed samples for the corresponding constraint function,
Φij(D)(G,x):=Ex
Z(μ*) is the normalization factor for qφ, where μ* is the optimal solution of the dual problem:
where C and ϵ are hyper-parameter in formulation (3).
The optimization of dual problem (10) may be processed with an approximated stochastic gradient descent (SGD) procedure. Specifically, the gradient may be approximated as:
∂μ
Where the first equation is due to duality, and the approximation is to estimate the expectation with {circumflex over (Φ)}[i](G,x), which may be given by uniformly sampling over the observed samples and calculating the constraint function values. Specifically, the updates to μi may be given by the SGD rule:
μi(t+1)=Proj[−C,C](μi(t)+rt(−∂μ
Where Proj[−C,C] denotes the Euclidean projection of the input to [31 C,C], and rt is the step length. After solving for μ* , the regularized posterior distribution of structures qφ(G|x) may be given by the formulation (7). The module parameters θ may be further optimized to fit the updated structure distribution.
In an example, the overall pipeline of the exemplary optimization process 500 may be presented in Algorithm 1.
prob(φ, θ) to
err (φ, θ) to
err (φ, θ) to update θ
Where μ may be considered as weights of the FOL constraints. In an aspect of the present disclosure, one or more FOL constraints may be grouped into one or more groups of FOL constraints, and the grouped FOL constraints may collectively correspond to only one weight. As illustrated in step 3) of Algorithm 1, the optimization process 500 may have to perform multiple iteration computations to update each of weights until converged. The grouped FOL constraints may reduce the number of weights, which may save computation resources accordingly.
In another aspect of the present disclosure, a value of a FOL constrain may be determined based on a correlation between a set of inputs and a module in a combination of one or more modules of the set of modules generated according to the estimated posterior distribution given the set of inputs. For instance, the correlation may relate to whether the semantic representation of a module in a structure that is generated according to the estimated posterior distribution (e.g., given xn,φ0) can be found in S(xn), as illustrated by formulation (1).
In block 610, the network 200, 700 may be provided with a set of input images and a set of candidate images.
In block 620, a combination of one or more modules of the set of modules 220, 720 may be generated based on a posterior distribution over combinations of one or more modules of the set of modules 220, 720 and the set of input images, wherein the posterior distribution is formulated by the PGM 210, 710 trained under domain knowledge as one or more posterior regularization constraints. In an example, the training process may be performed according to the method 400 by reference to
In block 630, the set of input images and the set of candidate images may be processed through the generated combination of one or more modules of the set of modules 220, 720.
In block 640, a candidate image may be selected from the set of candidate images based on a score of each candidate image in the set of candidate images estimated by the processing.
In an aspect of the present disclosure, each module of the set of modules 720 may be configured to perform a pre-designed process on one or more variable image properties, and the one or more variable image properties may be resulted from processing an input image feature map through at least one trainable parameters. For instance, a module with a type of logical AND may be represented as follow:
ƒAND(d, e)=(Wd·d)Λ(We·e) (13)
Where d and e are input panel features, Wd and We are trainable parameters for focusing on a specific panel property.
In an aspect of the present disclosure, an image panel property may comprise of any property that may be exhibited on an image. In another aspect of the present disclosure, with the guidance of domain knowledge, one or more variable image properties may comprise shape, line, size, type, color, position, or number or the like that are based at least in part on triples {[r,o,a]} from which constraints may be depend.
In an aspect of the present disclosure, PGM 710 may be configured to output a posterior distribution over structures of modularized networks 730 assembled from the set of modules 720, where the structures 730 may identify the types of the assembled modules and connections therebetween. The one or more variable image properties of each module 740 may be determined by training the at least one trainable parameters. The separate generation of structures 730 (e.g., generated by the PGM 710) and variable image properties 740 (e.g., generated based on the trainable parameters) may provide the network 700 with more flexibility on high-level concept abstracting and representative learning.
In an example, the method 400 may start with providing the network 800 with sets of inputs and sets of outputs (e.g., via route 1), wherein each set of inputs (e.g., X1 of 3×3 panels of
The method 400 may repeat the procedure described reference to the inputs X1 and outputs Y1, e.g., with X2, Y2, X3, Y3, . . . , Xn, Yn. The parameters φ,θ of encoder 810-1, decoder 810-2 and modules of set of modules 820 may be updated according to the optimization process 500 described above reference to
In an aspect of the present disclosure, the decoder 810-2 may be used for a backward propagation, e.g., via route 4. In another aspect of the present disclosure, the decoder 810-2 may be omitted.
In an example, the method 600 may be performed for an inference process after the network 800 have been trained according to the method 400 and/or optimization process 500.
It will be appreciated by those skilled that the posterior distribution unit 850 and/or the sub-network 860 may be incorporated into one or more parts of the network 800, rather than being illustrated as a separate part in
The various operations, models, and networks described in connection with the disclosure herein may be implemented in hardware, software executed by a processor, firmware, or any combination thereof. According an embodiment of the disclosure, a computer program product for visual reasoning may comprise processor executable computer code for performing the method 400, optimization process 500, and method 600 described above with reference to
The preceding description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the various embodiments. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the scope of the various embodiments. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/CN2021/078877 | 3/3/2021 | WO |