The present disclosure relates generally to the field of computer vision. More specifically, the present disclosure relates to computer vision systems and methods for end-to-end training of convolutional neural networks using differentiable dual-decomposition techniques.
Modern computer vision approaches generally utilize convolutional neural networks (CNNs) that excel at hierarchal feature extraction. Conventional computer vision approaches generally utilize conditional random fields (CRFs) that excel at modeling flexible higher order interactions. The benefits of CNNs and CRFs are complementary, and as such, are often combined. However, these approaches generally utilize a mean-field (MF) approximation technique which does not directly optimize a real problem.
Therefore, there is a need for computer vision systems and methods which can utilize a deep network architecture to provide an alternative to the MF approximation technique for dual-decomposition base approaches to CRF optimization while improving the ability of computer systems to more efficiently process data. These and other needs are addressed by the computer vision systems and methods of the present disclosure.
The present disclosure relates to computer vision systems and methods for end-to-end training of convolutional neural networks using differentiable dual-decomposition techniques. In particular, the system allows for segmenting an attribute of an image using a convolutional neural network and a conditional random field that learn to perform semantic image segmentation. Additionally, the system trains the convolutional neural network and the conditional random field using a fixed point algorithm for dual-decomposition and a plurality of images of a dataset.
The foregoing features of the disclosure will be apparent from the following Detailed Description, taken in connection with the accompanying drawings, in which:
The present disclosure relates to computer vision systems and methods for end-to-end training of convolutional neural networks using differentiable dual-decomposition, as discussed in detail below in connection with
End-to-end training of combined convolutional neural networks (CNNs) and conditional random fields (CRFs) generally improves quality in pixel labeling tasks in comparison with decoupled training. Conventional end-to-end trainable systems are generally based on a mean-field (MF) approximation to the CRF. The MF approximation approximates a posterior distribution of a CRF via a set of variational distributions, which are of simple forms and amendable to an analytical solution. This approximation is utilized in end-to-end trainable frameworks where the CRF is combined with a CNN because MF iterations can be unrolled as a set of recurrent convolutional and arithmetic layers and thus is fully differentiable.
However, despite computational efficiency and straight forward implementation, MF based approaches suffer from the assumptions that the underlying latent variables are independent and that the variational distributions are simple. Accordingly, an exact maximum-a-posteriori (MAP) solution of a CRF cannot be realized with MF iterations because in practical CRFs latent variables are generally not independent and posterior distributions are complex and cannot be analytically expressed. For example, to employ an efficient inference, a known approach can model pairwise potentials as a weighted sum of a set of Gaussian kernels over pairs of feature vectors which penalizes or yields a very small boost to dissimilar feature vectors and therefore tends to smooth-out pixel labels spatially. Alternatively, a more general pairwise model can encourage an assignment of different labels to a pair of dissimilar feature vectors, if they actually belong to different semantic classes in ground-truth.
The system of the present disclosure describes a dual-decomposition solution to MAP inference of a Markov random field (MRF). It should be understood that a CRF can be considered as an MRF conditioned on input data and therefore inference methods on an MRF are applicable to a CRF if input data and potential functions are fixed per inference. Dual-decomposition does not make assumptions about a distribution of a CRF but instead formulates the MAP inference problem as an energy maximization problem. Directly solving this problem on a graph with cycles is generally nondeterministic polynomial time (NP) hard. However, dual-decomposition relaxes the original problem by decomposing the graph into a set of trees that cover each edge and vertex at least once such that MAP inferences on these tree-structured sub-problems can be executed efficiently via dynamic programming while the solution to the original problem can be realized via dual-coordinate descent or sub-gradient descent utilizing solutions from the sub-problems. Dual decomposition is considered to be an approximate algorithm in that it can minimize a convex upper bound of an original problem and the solutions of the sub-problems do not necessarily agree with each other. However, the approximate primal objective can be realized via heuristic decoding at any time during optimization. Further, the duality-gap describes a quality of the current solution such that a gap of zero yields an optimal solution. In contrast, the MF approximation only yields a local minimum of the Kullback Leibler (KL) divergence.
With respect to learning parameters for CNNs and CRFs, the MF approximation is fully differentiable and therefore trainable with back-propagation. Known dual-decomposition approaches rely on sub-gradient descent or dual-coordinate-descent to maximize the energy objectives and therefore are not immediately differentiable with respect to CNN parameters that generate CRF potentials. Max-margin learning is generally utilized in such a situation for linear and non-linear deep neural network models. However, these models require that the MAP inference routine be sufficiently robust to determine reasonable margin violators to enable learning. This can be problematic when the underlying CRF potentials are parameterized by non-linear and complex CNNs instead of linear classifiers as in traditional max-margin frameworks (e.g., a structured support vector machine (SVM)). In contrast, the fixed-point iteration of the system of the present disclosure is derived from a special case of more general block coordinate descent algorithms and optimizes the CRF energy directly with respect to the CRF potentials and thus can be jointly trained with CNNs via back-propagation. Additionally, instead of employing a gradient predictor for MAP inference which can provide minimal theoretical guarantees, the system of the present disclosure derives a differentiable dual-decomposition algorithm for MAP inference including properties such as dual-monotonicity and provable optimality.
The systems and methods of the present disclosure derive a fixed-point algorithm having dual-monotonicity and dual-differentiability based on dual-decomposition that overcomes the aforementioned problems. In particular, a smoothed-max operator with negative-entropy regularization provides for the fixed-point algorithm to be fully-differentiable while remaining monotone for the dual objective. Additionally, the system provides an efficient and highly parallel graphics processing unit implementation for the algorithm. Accordingly, the system of the present disclosure can perform end-to-end training of CNNs and CRFs by differentiating through dual-composition layers thereby improving semantic segmentation accuracy on the PASCAL Visual Object Classes (VOC) 2012 dataset over baseline models.
Turning to the drawings,
where Z=ΣL∈
where ψ(⋅) and ϕ(⋅,⋅) are neural networks modeling unary and pairwise potential functions. V denotes the set of all pixel locations and E denotes the pairwise edges in the graph. Determining the mode (or a maximum of the modes in a multi modal case) of the posterior distribution of Equation 1 is equivalent to determining the maximizing configuration L* of L to Equation 2 as given by Equation 3 and Equation 4 below:
The system 10 maximizes the objective function ƒθ(L; ) w.r.t. L when conducting test time optimization and therefore the optimization problem can be defined by Equation 5 as:
In particular, Equation 5 denotes the MAP inference problem and the system 10 solves this problem via dual-decomposition as described in further detail below.
In step 92, the system 10 defines a graph G=(V, E) with M×N vertices representing a two-dimensional (2D) grid to define the MAP inference problem on the MRF. Each vertex can select one of the states in the label set ={1, 2, . . . , L}. The system 10 defines a labeling of the 2D grid as L∈|M×N| and a state of the vertex at a location i as li. For simplicity, the system 10 discards a dependency on θ and for all potential functions ƒ, ϕ, and ψ during the derivation of the inference algorithm since they are fixed for each inference. Accordingly, the MAP inference problem on the MRF is defined by Equation 6 as:
In step 94, the system 10 transforms the MAP inference problem defined by Equation 6 to an integer linear programming (ILP) problem. The system 10 denotes xi(l) as a distribution corresponding to vertex/location i and xij(l′, l) as a joint distribution corresponding to a pair of vertices/locations i, j. The system 10 defines a constraint set XG to enforce the pairwise and unary distribution to be consistent and discrete as follows:
As such, Equation 6 can be rewritten as an ILP as defined by Equation 7:
where xi and ψi are || dimensional vectors denoting vertex distribution and scores at vertex i, while xij and ϕij are ||2 dimensional vectors denoting edge distribution and scores for a pair of vertices (i,j).
In step 96, the system 10 decomposes the graph G with vertical and horizontal connections of arbitrary length into sets of horizontal and vertical chain sub-problems. Solving Equation 7 is generally NP-hard given an arbitrary graph with cycles. The system 10 tackles Equation 7 by first decomposing the graph G into a set of tree structured and solvable sub-problems such that each edge and vertex of the graph G=(V, E) is covered at least once by these sub-problems. Subsequently, the system 10 adds a set of additional constraints to enforce that maximizing configurations of each sub-problem agree with one another. These constraints are then relaxed by Lagrangian relaxation and the relaxed objective can be optimized by sub-gradient ascent or fixed point updates.
In particular, the system 10 defines a set of trees T that covers each vertex i∈V and edge ij∈E at least once. The set of variables corresponding to vertices and edges in tree t∈T and all sets of such variables are denoted as xt and {xt}. Additionally, T(i) and T(ij) respectively denote the set of trees that cover vertex i∈V and edge ij∈E. In view of the foregoing, Equation 7 can be rewritten as Equation 8 below:
As mentioned above, the system 10 decomposes the graph G with vertical and horizontal connections of arbitrary length into sets of horizontal and vertical chain sub-problems. For example,
Referring back to step 96 of
Without a replicated edge, Equation 8 can be rewritten as Equation 10 below:
Applying Lagrangian multipliers to relax the second set of constraints (e.g., agreement constraints among sub-problems) yields Equation 11 below as:
where λt and {λt} respectively denote dual variables for sub-problem t and the set of all sub-problems. λit denotes dual variables for sub-problem t at location i and has a dimension of L (e.g., the number of labels/states). As such, the system 10 can determine whether if Σt∈τ(i)λit≠0 for any i∈V, then λit=+∞. Thus the system 10 enforces and it follows that Σi∈VΣt∈τxi·λit=0. Therefore, {x} can be eliminated from Equation 11 to yield Equation 12 as follows:
In step 98, the system 10 derives a block coordinate-descent algorithm that monotonically decreases the objective of Equation 12. It is generally convenient to initialize λ's as 0 and fold them into {ψt} terms such that the system 10 optimizes an equivalent objective to Equation 12 over {ψt} as defined by Equation 13:
The system 10 fixes the dual variables for all sub-problems at all locations except for those at one location k and optimizes with respect to the vector ψkt, ∀t∈τ(k) and primal variables {xt}. The system 10 also defines
as the max-marginal of sub-problem t at location k with xkt(l)=1. Similarly, the system 10 defines the max-marginal vector of sub-problem t at location k as μkt and the max energy of sub-problem t as μt. It should be understood that μkt is a vector value while μkt(l) and μt are scalar values. For a single location k∈V, a coordinate update to ψkt, ∀t∈τ(k) is optimal and can be given by Equation 14 as follows:
The system 10 aims to optimize the linear program as defined by Equation 15 with respect to the dual variables where the primal variables {xt} are included in the max energy terms μt as follows:
where {circumflex over (μ)}kt(lk) and {circumflex over (ψ)}ki(lk) in the first set of constraints are max-marginals and unary potentials at location k after applying the update rule of Equation 14. This set of constraints is derived from the fact that {circumflex over (μ)}kt(lk)+ψki(lk)−{circumflex over (ψ)}ki(lk)=μkt(lk), ∀lk∈ and μt=μkt(lk). Converting Equation 15 to dual form yields Equation 16 below:
It can be inferred from the second set of constraints that the terms αt(lk)=α
It should be understood that the update rule denoted by Equation 14 satisfies Σt∈τ(k){circumflex over (ψ)}kt=ψk. As such, Equation 17 can be simplified as Equation 18 below:
It should be understood that applying Equation 14 yields {circumflex over (μ)}kt={circumflex over (μ)}k
In practice, the system 10 could update all ψ's at once, which, although it may yield slower convergence, may provide for efficient parallel algorithms for modern Graphics Processing Unit (GPU) architectures. Intuitively, a neural network can minimize an empirical risk with respect to the coordinate-descent algorithm of the system 10 since an output from any coordinate-descent step is sub differentiable.
A known theorem states that applying an update to ψit, ∀i∈V, t∈τ(i) does not increase the objective of Equation 13 where the update is defined by Equation 19 below as:
where |V0|=maxt∈T|Vt| and |Vt| denotes a number of vertices in sub-problem t. To prove the theorem, {circumflex over (μ)}t denotes the max energy of sub-problem t after applying the update defined by Equation 19 to {ψt} and
μt can be briefly expanded by Equation 21 as:
In view of the above, μt can be considered to be a convex function of ψt because μt is the maximum of a set of affine functions (each of which is defined by a point in XG) of ψt. When |V0|=maxt∈T|Vt|, Jensen's Inequality can be applied to the second term of Equation 20 as follows:
Here it is observed that Σt∈τ(i)
It should be understood that the update rules of the system 10 differ from conventional tree-block coordinate descent and its variant. For example, Equation 19 avoids reparameterization of edge potentials which can be expensive when forming coordinate-descent as differential layers, as memory consumption grows linearly with the number of coordinate-descent steps while each step requires (|E|×||2) memory for storing edge potentials. The update rules of the system 10 are similar to a fixed-point update of a known approach but the monotone fixed-point update of this known approach is effective on a single covering tree and therefore is not amenable for parallel computation. Furthermore, for a simultaneous update to all locations, the system 10 proves a monotone update step-size of
while the known approach merely proves a monotone update step-size
which is smaller. The coordinate-descent fixed-point update algorithm is described in further detail below as Algorithm 1 with respect to
Equation 22 is neither differentiable nor sub-differentiable and is the only non-differentiable portion of Algorithm 1. Accordingly, in step 142, the system 10 applies a softmax function on Σt∈τ(i)μit where μit is a vector of length L to output a probability distribution over L labels defined by Equation 23 below:
In step 144, the system 10 utilizes one-hot encoding of pixel-wise labels as the ground truth. Then, in step 146, the system 10 determines cross-entropy loss and gradients utilizing the softmax of the max-marginals which is sub-differentiable. Thus, the system 10 is end-to-end trainable with respect to CNN parameters θ.
Referring back to
where C(i) denotes a set of neighbors of node i. The max operator is only sub-differentiable. Common activation functions in neural networks such as Rectified Linear Unit (ReLU) and leaky ReLU are also based on a max operation and are thus sub-differentiable. In practice, the parameters for generating the {ψ, ϕ} terms are often initialized from zero-mean uniform Gaussian distributions. As such, at a start of training the {ψ, ϕ} terms are near-identical over classes and locations while max(⋅) over such terms can be random. It should be understood that in a backward pass, the gradient ∂L/∂μit(l) will only flow through the maximum of the max operator which can hinder learning progression and can yield inferior training and testing results. Accordingly, the system 10 implements a smoothed-max operator with negative-entropy regularization, whose forward pass is a log-sum-exp operator while the backward pass is a soft-max operator.
and defines a gradient of the forward pass of the smoothed-max operator maxΩ(y) by Equation 26 as:
where γ denotes a positive value that controls a strength of the convex regularization. It should be understood that the gradient vector defined by Equation 26 is indicative of a softmax over input values. This ensures that gradients from the loss layer can flow equally through input logits when inputs are near-identical thereby facilitating an initialization of the training process. It should also be understood that the negative-entropy regularized smoothed-max operator of Equation 25 satisfies associativity and distributivity, and that for tree-structured graphs, the smoothed-max-energy determined via a recursive dynamic program is equivalent to the smoothed-max over the combinatorial space XG. As described in detail below, Algorithm 1 can optimize a convex upper bound of Equation 13 when replacing a standard max with the smoothed-max defined by Equation 25 while both Equations 14 and 19 still hold. The system 10 provides for faster convergence of Algorithm 1 when the smoothed-max is utilized with a reasonable γ (e.g., γ=1.0 or 2.0).
For simplicity and to prove that Equations 14 and 19 still hold for a monotone fixed-point algorithm for dual-composition with smoothed-max, the dual-decomposition objective (e.g., Equation 13) is restated as Equation 27 below:
Since xt's are independent of each other ∀t∈τ, the max can be moved to a right of τt∈τ to yield Equation 28:
The max can be replaced with the smoothed-max as defined by Equation 25 to yield a new objective as defined by Equation 29:
It should be understood that the objective with smoothed-max as defined by Equation 29 corresponds to a tree-reweighted belief propagation (TRBP) objective that bounds a partition function (e.g., Z in Equation 1) with a decomposition of a graph into trees and the system 10 implements a monotone, highly parallel coordinate descent approach to minimize bounds of the partition function. It is noted that dual-decomposition does not represent overhead when each edge is only covered once but if a more complex tree bound is required, then TRBP can be utilized to minimize over edge appearance probabilities. The smoothed-max marginal of state li at location i on sub-problem t is defined by Equation 30 below:
where C(i) denotes a set of neighbors of node i.
Similar to max-marginal vector μit for sub-problem t location i and max-energy μt for sub-problem t, the smoothed max-marginal vector vit for sub-problem t location i is defined and the smoothed-max energy vt for sub-problem t is also defined. In particular, Equation 31 defines the smoothed-max energy as:
Equation 31 is equivalent for any i∈Vt because a smoothed-max with negative-entropy regularization (e.g., log-sum-exp) over a combinatorial space of a tree-structured problem is equal to a smoothed-max-energy determined via dynamic programs with a smoothed-max operator. This is given by Equation 32 below as:
The system 10 provides and proves a new monotone update rule for Equation 29 when fixing the dual variables for all sub-problems at all locations except for those at one location k and optimizing with respect to the vector ψkt, ∀t∈τ(k). In particular, for a single location k∈V, the following coordinate update to ψkt, ∀t∈τ(k) defined by Equation 33 below is optimal:
The system 10 endeavors to optimize the linear program with respect to ψkt as defined by Equation 34 below and where vkt(lk) is a function of ψkt(lk):
Since it is a monotonically increasing function (for positive γ), γ log(⋅) can be removed from Equation 34 to yield Equation 35 below as:
Defining λkt as the change of ψkt after some update and optimizing over λkt while fixing ψkt yields Equation 36 below:
As such and in terms of λkt(lk), the update rule defined by Equation 33 is equivalent to the solution given by Equation 37 below:
This solution makes vkt(lk)+λkt(lk)=vk
When performing minimization, the denominator and exponent on the right-hand side can be eliminated. Accordingly, the objective is defined by Equation 38 below:
It is equivalent to optimize for each unique lk∈ independently since there is no constraint over different pairs in label space (e.g., ×). Thus, each optimization problem can be defined by Equation 39 below:
Converting Equation 39 to dual form yields Equation 40 below:
It should be understood that the min can be positioned on an outside of Equation 40 because a Linear Programming (LP) problem always satisfies a Karush-Kuhn-Tucker (KKT) condition. Setting a derivative of Equation 40 w.r.t. λkt(lk) to zero for any sub-problem t, yields Equation 41 below:
In view of the foregoing, Equation 37 satisfies vkt(lk)+λkt(lk)=vk
Additionally, the following proves the monotone update step for simultaneously updating all locations for dual-decomposition with smoothed-max. In particular, the update to ψit, ∀i∈V, t∈τ(i) as defined by Equation 42 below does not increase the objective as defined by Equation 29:
where |V0|=maxt∈τ|Vt| and |Vt| denotes a number of vertices of sub-problem t.
It should be understood that {circumflex over (v)}t is denoted as the smoothed-max-energy of sub-problem t after applying the update defined by Equation 42 to {ψt} and that vit is denoted as the smoothed-max-energy of sub-problem t after applying the update defined by Equation 33 for location i. Considering changes in the objective from updating {ψt} according to Equation 42 yields Equation 43 below:
It can be deducted that vt as defined by Equation 23 (and consequently {circumflex over (v)}t) is a convex function of ψt in view of the following: (1) Σi∈Vxit·ψit+Σij∈Exijt·ϕij is a non-decreasing convex function of ψt for any xt∈XG, and (2) the log-sum-exp function is a non-decreasing convex function, and a composition of two non-decreasing convex functions is convex such that vt (and {circumflex over (v)}t) is a convex function of ψt. Applying Jensen's Inequality to the second term of Equation 43 yields:
Accordingly, it can be observed that Σt∈τ(x)vit corresponds to the objective defined by Equation 34 before applying the update defined by Equation 33 and that Σt∈τ(x)
As mentioned above, it should be understood that line 6 of Algorithm 1 (as shown in
It should be understood that the system 10 can alternatively define a dynamic programming layer for determining a max-marginal term μit(li) as defined by Equation 45:
It should also be understood that for any leaf nodes k, that C(k)=Ø. In step 192 and without loss of generality, the system 10 defines an M×M pixel grid with stride 1 and stride 2 horizontal and vertical edges (as shown in
It should be understood that with respect to a backward pass, a loss function is determined with smoothed-max-marginals of all locations and labels thereby necessitating differentiation through all locations and labels. This yields an algorithm having a time complexity of (M2||) with parallelization over locations and labels at each location) such that a gradient at one location can be affected by a gradient from all locations. Alternatively, a backward pass can be performed by commencing from a root/leaf location, passing gradients to a next location while also adding in the gradients of the next location from the loss layer, and recursing till the leaf/root. The system 10 can perform two passes of this dynamic programming layer to determine final gradients for each location which yields an algorithm having a total time complexity (M||) and a space complexity (M||2).
It should be understood that dynamic programming from root to leaf is known as forward dynamic programming and denoted by a subscript ƒ while dynamic programming from leaf to root is known as backward dynamic programming and denoted by a subscript b. In this context, Cƒ(i) denotes a set of previous node(s) of node i in a root-leaf direction, while Cb(i) denotes a set of previous node(s) of node i in a leaf-root direction.
Training, testing, and results of the system 10 will now be described in greater detail in relation to
As mentioned above, the system 10 utilizes the PASCAL VOC 2012 semantic segmentation benchmark and average pixel intersection-over-union (mIoU) of the foreground as a performance measure. The PASCAL VOC dataset is widely utilized to benchmark various computer vision tasks such as object detection and semantic image segmentation. The semantic image segmentation benchmark includes 21 classes including a background class. The training dataset includes 1,464 images, the validation dataset includes 1,449 images and the testing dataset includes 1,456 images with pixel-wise semantic labeling. The system 10 utilizes additional annotations provided by the Semantics Boundaries Dataset (SBD) to augment the original PASCAL VOC 2012 training dataset thereby yielding an augmented training dataset including 10,582 images. The system 10 trains the network 14 on the augmented training dataset and evaluates a performance of the network on the validation dataset.
The system 10 re-implements DeepLabV3 in PyTorch with a block4-backbone (e.g., no additional block following the backbone except for one 1×1 convolution, two 3×3 convolutions and a final 1×1 convolution for classification) and Atrous Spatial Pyramid Pooling (ASPP) as a baseline. The system 10 sets the output stride to 16 for all models. The system 10 utilizes the Residual Neural Network 50 (ResNet-50) and Xception-65 as the backbones. To train the models, the system 10 utilizes a stochastic gradient descent (SGD) optimizer with a momentum of 0.9 and a “poly” learning rate policy with an initial learning rate of 0.007 for 30k steps. Additionally, the system 10 utilizes a weight decay of 0.00001 to train the ResNet-50 models and a weight decay of 0.00004 to train the Xception-65 models. With respect to data augmentation, the system 10 applies random color jittering, random scaling from 0.5 to 1.5, random horizontal flip, random rotation between −10° and 10°, and a randomly cropped 513×513 image patch of the transformed image. The system 10 also utilizes synchronized batch normalization with a batch size of 16 for all models which is the standard for training semantic segmentation models. With respect to the dual-composition augmented models, the system 10 obtains the pairwise potentials by applying a fully connected layer over concatenated features of pairs of locations on the feature map before the final layer of the baseline model, executes several steps of the fixed-point iteration (FPI) of Algorithm 1 utilizing output logits from unary and pairwise heads for each of training and inference, and maintains remaining configurations consistent with the baseline model.
As described above, the system 10 utilizes a fixed-point algorithm having dual-monotonicity and dual-differentiability and provides an efficient and highly parallel graphics processing unit implementation for the algorithm. In particular, a smoothed-max operator with negative-entropy regularization provides for the fixed-point algorithm to be fully-differentiable while remaining monotone for the dual objective. Additionally, the system 10 can perform end-to-end training of CNNs and CRFs by differentiating through dual-composition layers thereby improving semantic segmentation accuracy on the PASCAL VOC 2012 dataset over baseline models. It should be understood that additional testing of the system 10 can be performed on other datasets and architectures, and tasks such as human-pose-estimation and stereo matching (which can be formulated as general CRF inference problems) in comparison to other models. It should be understood that certain modifications of the system 10 can include unsupervised/semi-supervised learning by enforcing an agreement of sub-problems.
The functionality provided by the present disclosure could be provided by computer vision software code 306, which could be embodied as computer-readable program code stored on the storage device 304 and executed by the CPU 312 using any suitable, high or low level computing language, such as Python, Java, C, C++, C#, .NET, MATLAB, etc. The network interface 308 could include an Ethernet network interface device, a wireless network interface device, or any other suitable device which permits the server 302 to communicate via the network. The CPU 312 could include any suitable single-core or multiple-core microprocessor of any suitable architecture that is capable of implementing and running the computer software code 306 (e.g., Intel processor). The random access memory 314 could include any suitable, high-speed, random access memory typical of most modern computers, such as dynamic RAM (DRAM), etc.
Having thus described the system and method in detail, it is to be understood that the foregoing description is not intended to limit the spirit or scope thereof. It will be understood that the embodiments of the present disclosure described herein are merely exemplary and that a person skilled in the art can make any variations and modification without departing from the spirit and scope of the disclosure. All such variations and modifications, including those discussed above, are intended to be included within the scope of the disclosure. What is desired to be protected by Letters Patent is set forth in the following claims.
This application claims priority to U.S. Provisional Patent Application Ser. No. 62/947,874 filed on Dec. 13, 2019, the entire disclosure of which is hereby expressly incorporated by reference.
Number | Name | Date | Kind |
---|---|---|---|
20180253622 | Chen | Sep 2018 | A1 |
Entry |
---|
Wang, et al. (computer English translation of Chinese patent No. CN108305266A), pp. 1-10. (Year: 2018). |
Vemulapalli, et al. (Gaussian Conditional Random Field Network for Semantic Segmentation), pp. 3224-3233 (Year: 2016). |
Knöbelreiter, et al. (End-to-End Training of Hybrid CNN-CRF Models for Stereo), pp. 1456-1465 (Year: 2017). |
Teichmann, et al. (Convolutional CRFs for Semantic Segmentation), pp. 1-12. (Year: 2018). |
Wang, et al., “End-to-End Training of CNN-CRF via Differentiable Dual-Decomposition,” arXiv:1912.02937v1, Dec. 6, 2019 (14 pages). |
Mensch, et al., “Differentiable Dynamic Programming for Structured Prediction and Attention,” Proceedings of the 35th International Conference on Machine Learning (2018) (10 pages). |
Belanger, et al., “End-to-End Learning for Structured Prediction Energy Networks,” Proceedings of the 34th International Conference on Machine Learning (2017) (11 pages). |
Chen, et al., “Rethinking Atrous Convolution for Semantic Image Segmentation,” arXiv:1706.05587v3, Dec. 5, 2017 (14 pages). |
Chollet, et al., “Xception: Deep Learning with Depthwise Separable Convolutions,” Proceedings of CVPR (2017) (8 pages). |
Domke, “Dual Decomposition for Marginal Inference,” Proceedings of the 25th AAAI Conference on Artificial Intelligence (2011) (6 pages). |
Finley, et al. “Training Structural SVMs When Exact Inference is Intractable,” Proceedings of the 25th International Conference on Machine Learning (2008) (8 pages). |
Globerson, et al., “Fixing Max-Product: Convergent Message Passing Algorithms for MAP LP-Relaxations,” Proceedings of NIPS (2008) (8 pages). |
Hariharan, et al., “Semantic Contours from Inverse Detectors,” 2011 IEEE International Conference on Computer Vision (8 pages). |
Jancsary, et al., “Convergent Decomposition Solvers for Tree-Reweighted Free Energies,” Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (2011) (11pages). |
He, et al., “Deep Residual Learning for Image Recognition,” Proceedings of CVPR (2016) (9 pages). |
Kolmogorov, et al., “Convergent Tree-Reweighted Message Passing for Energy Minimization,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, No. 10, Oct. 2006 (16 pages). |
Komodakis, et al., “Efficient Training for Pairwise or Higher Order CRFs Via Dual Decomposition,” Proceedings of CVPR (2011) (8 pages). |
Komodakis, et al., “MRF Optimization via Dual Decomposition: Message-Passing Revisited,” Proceedings of ICCV (2007) (8 pages). |
Krahenbuhl, et al., “Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials,” Proceedings of NIPS (2011) (9 pages). |
Chen, et al., “Learning Deep Structured Models,” Proceedings of the 32nd International Conference on Machine Learning (2015) (10 pages). |
Lin, et al., “Efficient Piecewise Training of Deep Structured Models for Semantic Segmentation,” Proceedings of CVPR (2016) (10 pages). |
Liu, et al., “Deep Learning Markov Random Field for Semantic Segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, No. 8, Aug. 2018 (15 pages). |
Wainwright, et al., “A New Class of Upper Bounds on the Log Partition Function,” IEEE Transactions on Information Theory, vol. 51, No. 7, Jul. 2005 (23 pages). |
Paszke, et al., “Automatic Differentiation in PyTorch,” 31st Conference on Neural Informational Processing Systems (2017) (4 pages). |
Shimony, “Finding MAPs for Belief Networks is NP-hard,” Artificial Intelligence 68 (1994) (12 pages). |
Song, et al., “End-to-End Learning for Graph Decomposition,” Proceedings of ICCV (2019) (10 pages). |
Sontag, et al., “Introduction to Dual Decomposition for Inference,” Optimization for Machine Learning, (2011) (37 pages). |
Sontag, et al., “Tree Block Coordinate Descent for MAP in Graphical Models,” Proceedings of the 12th International Conference on Artificial Intelligence and Statistics (2009) (8 pages). |
Meltzer, et al., “Convergent Message Passing Algorithms: A Unifying View,” Proceedings of UAI (2009) (9 pages). |
Tsochantaridis, et al., “Large Margin Methods for Structured and Interdependent Output Variables,” Journal of Machine Learning Research (2005) (32 pages). |
Wang, et al., “Non-Local Neural Networks,” Proceedings of CVPR (2018) (10 pages). |
Yarkony, et al., “Covering Trees and Lower-Bounds on Quadratic Assignment,” Proceedings of CVPR (2010) (8 pages). |
Zheng, et al., “Conditional Random Fields as Recurrent Neural Networks,” arXiv:1502.03240v2, Apr. 30, 2015 (16 pages). |
Huang, et al., “CCnet: Criss-Cross Attention for Semantic Segmentation,” Proceedings of ICCV (2019) (10 pages). |
The PASCAL Visual Object Classes Challenge 2012 (VOC2012), http://host.robots.ox.ac.uk/pascal/VOC/voc2012/ (11 pages). |
Everingham, et al., “The PASCAL Visual Object Classes Challenge: A Retrospective,” International Journal of Computer Vision (2015) (39 pages). |
Number | Date | Country | |
---|---|---|---|
20210182675 A1 | Jun 2021 | US |
Number | Date | Country | |
---|---|---|---|
62947874 | Dec 2019 | US |