The present disclosure relates generally to the field of computer vision technology. More specifically, the present disclosure relates to computer vision systems and methods for machine learning using a set packing framework.
Artificial neural networks (“ANN”) excel at learning functions that map input data vectors (e.g., images of objects such as a dog, a cat, a horse, etc.) to output labels (e.g., semantic label: dog, cat, horse, etc.) by using large quantities of labeled training data. An ANN learns a function that generalizes beyond a training data set to produce the correct label as output on test data not part of the training data set. A possible application of ANNs is object recognition, in which an ANN learns to recognize the presence of objects (e.g., cat, dog, horse, etc.) in images. Large data sets facilitate learning such functions. An example of a large data set includes the image-net data set, which provides fourteen million training images, each associated with the labels of the objects present in the image.
Localizing each unique instance of objects in crowded images, which is called instance segmentation, is an important related task to object recognition. The common approach to instance segmentation iterates over all possible rectangles of pixels (called bounding boxes) in the image, and predicts the presence of each object in that rectangle. However, combining the hypotheses generated in each rectangle to describe each unique instance of objects is challenging as the hypotheses need not be mutually consistent. For example, multiple predicted hypotheses can share a common pixel, but multiple objects cannot be associated with the same pixel in the ground truth. Heuristics, such as non-max suppression, are often used to remove conflicts between predicted hypotheses. Non-max suppression removes from consideration all but one of each set of “similar” and/or overlapping predictions. Combinatorial optimization provides a principled alternative to non-max suppression heuristics, which is referred to as data association.
Data association uses combinatorial optimization to partition the observations in a data set (e.g., pixels in an image) into a set of hypotheses (e.g., unique instances of objects or background), each associated with a subset of the observations that are consistent with the statistical properties of the known structure of hypothesis.
The use of combinatorial optimization in computer vision/machine learning, has developed largely without influence from the operations research community, and has been focused on network flows (called graph cuts), primal dual methods (the most prominent of which is message passing), and compact linear programming (“LP”) relaxations augmented with cutting plane methods. This often leads to less efficient/optimal solvers than are desirable. Further, the capacity of the associated models is limited by not taking advantage of the decades of research in combinatorial optimization in the operations research community.
Recently the core operations research techniques of column generation (“CG”) and (nested) Benders decomposition (called “(N)BD”) have been introduced to the machine learning and computer vision communities. However, the application of these techniques, and the construction of models to support the use of CG and (N)BD is in its infancy.
Therefore, there is a need for computer vision systems and methods which can he overcome data association problems in computer visions systems, thereby improving the speed and efficiency of the computer vision systems. These and other needs are addressed by the computer vision systems and methods of the present disclosure.
The present disclosure relates to computer vision systems and methods for machine learning using a set packing framework. The systems and methods disclosed herein include a minimum weight set packing (“MWSP”) framework, which uses advance methods of integer programming that the system applies to data association problems commonly studied in computer vision. In the present system, an MWSP instance for data association is parameterized by a set of possible hypotheses, each of which is associated with a real valued cost, that describes the sensibility of the belief that the members of the hypothesis correspond to a common cause. Using MWSP, the system then selects the lowest total cost set of hypotheses, such that no two selected hypotheses share a common observation. Observations that are not included in any selected hypothesis, define the set of false observations can be thought of as false observations/noise. Embodiments and examples of the present disclosure will be discussed in regards to multi-person detection, which can be used in, for example, self-driving car applications. The set of observations is the set of all pixels, and the set of possible hypotheses is the power set of pixels. The statistical support for a hypothesis, is defined in terms of how well a classifier (such as an ANN) scores the quality of a single person dominating the corresponding pixels.
The foregoing features of the invention will be apparent from the following Detailed Description of the Invention, taken in connection with the accompanying drawings, in which:
The present disclosure relates to computer vision systems and methods for machine learning using a set packing framework, as described in detail below in connection with
The set packing engine 16 models data association as a minimum weight set packing formulation (“MWSP”), which is framed using sets of observations and hypotheses denoted D and G respectively, which are index by d and g respectively. The mapping of observations to hypotheses is described using matrix G E {0, 1}|D|×G where Gdg=1 if hypothesis g includes observation d. Real valued costs are associated to hypotheses using Γ∈|G| where Γg is a cost associated with hypothesis g. The MWSP is formulated as an integer linear program (“ILP”) using γg∈{0, 1} where γg=1 if hypothesis g is included in the set packing, as is expressed in Equation 1, below:
In Equation 1, the objective of optimization is the total cost of all hypotheses in the packing. For every observation d∈D, there is one constraint in Equation 1 that states that no more than one selected hypothesis contains observation d. In an example, the cost of a hypothesis consisting of zero observations is zero.
Prior art systems generally generate cost terms by training a standard linear classifier to determine the probability a variable/pair of variables takes on a given label/pair of labels. The output probabilities are converted to cost terms by taking the negative log of the probability. However, this is not a mathematically principled approach since it does not consider the ILP context, in which a complete solution to all variables is produced. To correctly model the ILP used to produce a solution, the system 10 uses structured support vector machines (“SVM”). A structured SVM learns a mechanism to produce cost terms for ILPs such that the optimal solution to that ILP is similar to the ground truth (information provided by direct observation). The system 10 learns a structured SVM from large amounts of labeled data using a cutting plane approach where the ground truth solution is separated from other solutions generated in the course of training the structured SVM. Learning for structured SVMs requires repeatedly solving ILPs (or linear programs “LPs”) across problem instances, making learning on large data sets challenging. Other mechanisms that can be used by the system 10 to learn cost terms include herding, which is designed to decrease computational requirements relative to the structured SVM, and provides multiple solutions for a problem instance akin to samples from a probability distribution over solutions.
In step 32, the system 10 formulates correlation clustering as an ILP (integer linear program). Specifically, a graph is expressed with a node set D indexed by d, edge set c indexed by (d1, d2) with weights θ∈|D|×D indexed by (d1, d2). Correlation clustering partitions the nodes into sets, so as to minimize the sum of the within cluster edges. Correlation clustering is known to be NP-Hard problem. The system 10 uses decision variables x∈{0, 1}|D|×|D|, which are index with d, j, where xdj=1 if node d is in cluster j. Clusters are indexed by j, and lie in J={0, 1, 2, |D|}. Expression y∈{0, 1}D×D×D describes co-association. Specifically, yd1d2j=1 if d1, d2 are part of a common cluster j. Accordingly, correlation clustering as an ILP is expressed by Equation 2-7, below:
The objective of Equation 2 is to minimize the sum of the within cluster edges. Equation 3 is a constraint that enforces that every node is assigned to exactly one cluster. Equations 4, 5, and 6 are constraints that collectively enforce that γd1d2j=1 if xd1j=1 and xd2j=1. Equation 7 is a constraint that enforces integrality of x. It is noted that the integrality of x ensures that y is also integral. The optimization in Equation 2 in which Equation 7 is ignored is referred to as the compact formulation of correlation clustering.
In step 34, the system 10 expands the formulation of the correlation clustering to correspond to a tighter relaxation. By expanding the formulation, the system 10 increases optimization speed. Specifically, the system 10 generates an expanded formulation of correlation clustering that corresponds to a tighter relaxation than the compact formulation. The power set of D denoted G is indexed with g. The term G is expressed using G∈{0, 1}|D|×|G| where Gdg=1 if d is in g. The cost associated with each member of g∈G is defined as the sum of all edges within the cluster g. The cost of clusters is expressed using Γ∈|G| which is indexed with g, where Γg is the cost associated with cluster g, and is defined by Equation 8 below:
Equations 9-11, below, frame optimization as selecting the lowest cost non-overlapping subset of G:
The objective in Equation 9 is to minimize the sum of the costs of the clusters selected. The constraint in Equation 10 enforces that every node is assigned to no more than one cluster. If the solution γ does not select a cluster that includes d, then d is in a cluster by itself. The constraint in Equation 11 enforces that γ is integral. The optimization expressed in Equation 9, where Equation 11 is ignored, is referred to as expanded LP relaxation.
In step 36, the system 10 solves the expanded formulation using column generation. Specifically, column generation circumvents the problem of the massive size of the set of hypothesis by constructing a sufficient subset of G denoted Ĝ so that solving the LP relaxation of Equation 9 over Ĝ provides the same objective as solving over G. Construction of Ĝ is performed in a cutting plane manner using the Lagrangian dual of the LP relaxation of Equation 9 defined using Ĝ, which will be referred to as the restricted master problem (“RMP”). Primal and dual LP relaxations of Equations 9-11 are expressed in Equation 12, below, where the dual LP relaxation is described using dual variables λd≥0 for all d∈D:
The dual form Equation 12 has a finite number of variables and |G| constraints, which allows the system 10 to use a cutting plane method to solve the dual form. After the system 10 uses the cutting plane approach to solves the dual form, the corresponding primal solution is provably optimal. The use of the cutting plane method in the dual form can require access to an oracle that provides a violated dual constraint given a dual solution λ. This violated dual constraint corresponds to a negative reduced cost primal variable. The task of finding the lowest reduced cost primal variable is referred to as pricing, whose corresponding optimization is expressed in Equation 13, below:
The optimization in Equation 13 is often not solved by search, but instead as an integer program or a dynamic program. The system 10 can employ specialized solvers to solve the pricing problems that exploit the special structure found in specific problem domains.
In step 62, the system 10 identifies all candidate detections of people in each frame of a video. For example, the system 10 can use a classifier, such as an ANN (artificial neural network), to perform the identifications. It is noted that some of these detections can be false detections.
In step 64, the system 10 associates each group of K detections ordered in time, each on a separate frame, with a real cost describing how plausible it is for the K detections to follow each other directly in the track of a single person. In an example, for a single user, K can be any real number (e.g., 3, 4, 5, etc.). These sets can be referred to as subtracks in the present disclosure.
The parameter K trades off modeling power and computation requirements. The set of subtracks is pruned by relying on the fact that most subsets of K are non-sensible since the detections are not in sufficiently visually similar to correspond to a common person. Similarly subtracks that do not follow the known statistics of human motion are removed, e.g., humans cannot teleport across space within a few frames of video.
In step 66, the system 10 formulates the packing of detections into sequences of subtracks as an ILP, and solves the IPL using column generation. Specifically, the system 10 employs a MWSP formulation in which detections correspond to observations and complete tracks correspond to sequences of subtracks. The cost of a track is the sum of the costs of the subtracks that compose it plus a constant offset. The constant offset penalizes/rewards having additional people in the video which models a Bayesian prior belief on the number of people in the image.
In step 72, the system 10 defines a set of detections (observations) of people in frames of video as V. By way of example, S to denote a set of subtracks, each of which contains K detections. For a given subtrack s∈S, sk indicates the k'th detection in the sequence s={s1, . . . , sK} ordered by time from earliest to latest. It is noted that the detections that compose a subtrack need not be consecutive in time, thus permitting a person to disappear and reappear in video. The mapping of subtracks to tracks is described using T∈{0, 1}|S|×|G| where Tsg=1 indicates that track g contains subtrack s as a sub-sequence.
The set of tracks are denoted as G where a track is a sequence of subtracks ordered in time where the latest K−1 elements in time of any subtrack s1 in the sequence are the earliest K−1 elements of a subtrack s2 that immediately succeeds s1. A track can be equivalently described as a sequence of detections ordered in time or a sequence of subtracks ordered in time.
Returning to
To permit the construction of tracks that have fewer detections than Km, in step 76, the system 10 augments the set of subtracks with subtracks padded with empty detections. Such subtracks have no possible predecessors or successors.
Returning to
In step 84, a classifier (such as an artificial neural network) associates each pair of detections with a cost to be associated with a common person. The cost is made using a negative log odds ratio of probability that the two detections are/are not associated with a common person. Similarly, a cost is made to associate each detection with a person. The classifiers take as input local statistics of pixel values around the detections, and or spatial, angular statistics concerning the relative location of the pair of detections. The cost terms over pairs of detections are referred to as pairwise, and those over a single detections is referred to as a unary.
It is noted that person detection in computer vision relies traditionally on tree (pictorial) structured models, which describe the feasibility of poses of the human body, according to a cost function defined on a graph, where nodes correspond to body parts, and edges indicated adjacency. Thus, pairwise cost terms are non-zero only between adjacent detections corresponding to the same, or adjacent body parts in the tree model.
Returning to
In step 94, the system 10 defines a set of people (hypotheses) G as the power set of V. It is noted that a person can contain more than one detection of any given body part. This can be a modeling decision and is a consequence of the body part detector firing multiple places in close proximity corresponding the same ground truth body part. Similarly, since human body parts are occluded in real images it is possible for a hypothesis to contain zero detections of some body parts.
In step 96, the system 10 defines a cost of a person using terms θ1∈|D|, and θ2∈|D|×|D|, which is index with d, and d1,d2 respectively. The terms θ1, θ2 are referred to as unary and pairwise respectively. The term θd1 denotes the cost of including detection d in a person. Similarly, the term θ2d1d2 denotes the cost of including detections d1, d2 in a common person. Here positive/negative values θd discourage/encourage the use of the detection d in a person. Similarly positive/negative values of θd1d2 discourage/encourage the presence of d1, d2 jointly in a single person. The system 10 models a prior on the number of people in an image using θ0 to denote a constant cost associated with instancing a person. Here, positive/negative values of θ0 discourage/encourage the presence of more people in the packing.
In step 98, the system 10 models a person according to a common tree structured model. The system 10 can augment the tree structure by connecting the neck to every other body part, the left shoulder to the right shoulder, and the right hip to the right shoulder. These augmentations improve performance, as will be discussed in greater detail below. The augmented tree structure is respected with regards to the costs, thus θd1d2 can only be non-zero if Rd1=Rd2, or if Rd2 is a child of Rd1 in the augmented tree. The mapping of people to costs is defined by Equation 15, below:
Returning to
In step 104, for each pair of adjacent superpixels, the system 10 use a classifier that provides a cost for the pair to be associated with a common cell. Similarly we use a classifier to generate a cost for each superpixel to be part of a cell. These costs are referred to as unary and pairwise, respectively.
In step 106, the system 10 computes a maximum radius and area (volume in 3D images) of cells on annotated data. In step 108, the system 10 formulates identifying each cell in the image as a MWSP problem where elements are superpixels and sets are cells. The cost of a cell is the sum of the pairwise terms associated with pairs of superpixels in the cell, plus the unary terms associated with superpixels in the cell. As in the other applications, the system 10 adds a constant offset to the cost of a cell that penalizes/rewards having additional cells in the image. This offset models a Bayesian prior belief on the number of cells in the image. The system 10 sets the cost of the cell to ∞ if the radius of the cell or the volume of the cell significantly exceeds the known maximum volume and radius of cells on the annotated data.
∃d*∈ s.t. [Gdg=1]⇒[Sd*d≤Rmax] ∀d∈ Equation 16
In step 114, the system 10 defines the radius constraint as a cost. Specifically, for any g∈G, a penalty of ∞ is added to Γg if g does not follow the radius constraint. The radius constraint as optimization is expressed in Equation 17, below:
Optionally, the system 10 can require that the anchor be present in the cell. This changes Equations 16 and 17 to the following formula, expressed in Equation 18, below:
Next, the constraint on the area of a cell is considered. In step 116, the system 10 uses Vmax to denote the upper bound on the area of a cell, and Vd to denote the area of a superpixel d. A cell g∈G satisfies the constraint on the area of a cell if the following, expressed below in Equation 19, holds:
In step 118, the system 10 defines the volume constraint as a cost. For any g∈G, a penalty of ∞ is added to Γg if g does not follow the volume constraint. The volume constraint is expressed as a cost using Equation 20, below:
In step 120, the system 10 describes the image level evidence for the quality of a cell using θd and θd1d2. Specifically, the system 10 uses θd to denote the cost for superpixel d to be part of any cell. Similarly, the system 10 uses θd1d2 to denote the cost for d1 and d2 to belong in a common cell. Positive/negative values θd discourage/encourage the use of the superpixel d in a cell. Similarly, positive/negative values of θd1d2 discourage/encourage the presence of d1, d2 jointly in a single cell. The system 10 model a prior on the number of cells in an image using θ0 to denote a cost associated with instancing a cell. Positive/negative values of θ0 discourage/encourage the presence of more cells in the packing. The cost Γg of an hypothesis g is expressed in Equation 21, below:
The following will discuss the system 10 solving the pricing problem of Equation 13 in the context of the MW SP formulations. In pricing for multi-object tracking, the system 10 formulates the task of identifying the lowest reduced cost track (hypothesis) as a dynamic program. The system 10 considers the structure of that dynamic program and specifies that a subtrack s may be preceded by another subtrack ŝ, if the least recent K−1 detections in s correspond to the most recent K−1 detections in ŝ. The system 10 denotes the set of valid subtracks, that may precede a subtrack s as {→s}. The system 10 uses ls to denote the reduced cost of the lowest reduced cost track, that terminates at subtrack s. Ordering the subtracks by the time of last detection allows efficient computation of l, using the following dynamic programming expressed in Equation 22, below:
The system 10 can choose to add, not only the lowest reduced cost track to Ĝ, but other distinct negative reduced cost tracks. Such strategies can be implemented by the system 10 since the dynamic program produces the lowest reduced cost track terminating at each subtrack. One such strategy adds to Ĝ the lowest reduced cost track terminating at each detection (excluding those with non-negative reduced cost).
In pricing for multi-person pose estimation, the system 10 identifies the lowest reduced cost person (hypothesis), which can be formulated as a set of dynamic programs. A graph is used where nodes correspond to human body parts, and edges indicate adjacency. A subgraph in which the neck is removed corresponds to a tree structure can motivate the use of dynamic pro-gramming to solve the pricing problem. During the pricing step, the system 10 iterates through the power set of neck detections and compute the lowest reduced cost person containing the neck detections. The power set of neck detections is indexed with Ď and [g↔Ď]=1 is used to indicate that the neck detections in g are exactly those in Ď. Pricing for an arbitrary subset of the neck detections Ď is expressed in Equation 23, below:
To solve Equation 23 as a dynamic program, the system 10 enumerates the power set of pairs of adjacent detections in the tree in the problem domain. Specifically, the system 10 provides a notation to assist formulating Equation 23 as a dynamic program. The system 10 uses R to denote the set of human body parts, which is index by r. The system 10 uses Sr to denote the power set of detections of part r, and index it with s. The system 10 uses Dr to denote the set of detections of part r. Sr is described using Sr∈{0, 1}|D|×|Sr|, where Srds=1 indicates that detection d is in set s. For convenience, the system can define the neck as part 0 and thus the power set of neck detections is denoted S0.
It is noted that when conditioned on a specific set of neck detections (denoted s0), the pairwise costs from the neck detections to all other detections can be added to unary costs of the other detections. Thus, the augmented-tree structure becomes a typical tree structure, and exact inference can be done via dynamic programming. The system 10 makes the tree directed by choosing a single node to be the root arbitrarily, and orienting edges in the graph going away from the root.
The system 10 defines the set of children of any human body part r in the tree graph as {r→}. The system 10 defines μrŝ as the reduced cost of the lowest reduced cost sub-tree rooted at r given that its parent {circumflex over (r)} takes on state ŝ. The term μrŝ includes the cost of the pairwise terms between detections of part {circumflex over (r)}, with detections of part r, as expressed in Equation 24, below:
Specifically, in Equation 24, the term
computers pairwise costs between part r and its parent {circumflex over (r)}, while vrs accounts for the cost of the sub-tree rooted at part ra with state s, and is defined by Equations 25 and 26, below:
To compute μrŝ for each ŝ∈S{circumflex over (r)}, the system 10 needs to iterate over all s∈Sr. For most problems, this is feasible. However, considering that |Dr|=|D{circumflex over (r)}|=15, the system 10 would have to enumerate the joint space of over one billion configurations, which is can be expensive. Accordingly, the system 10 can use nested Benders decomposition, which is able to solve the dynamic program exactly, with computation that scales in practice O(|Dr|) time not O(|Dr|×|Dr|)
In pricing for multi-cell segmentation, the system 10 finds negative reduced cost cells (hypothesis) by exploiting that cells are small and compact. In Equation 21, above, every cell with non-infinite cost is associated with an anchor d* in close proximity to all other super-pixels (observations) that compose the cell. The system 10 solves pricing by conditioning on the choice of the anchor d*, and finds the lowest reduced cost cell denoted gd*, as expressed by Equation 27, below:
The system 10 reconfigures the optimization in Equation 27 as an ILP (seen below in Equations 30-33) using decision variables x∈{0, 1}|D|, y∈0, 1|D|×|D| which are indexed by d and d1, d2 respectively, and where x and y are defined in Equations 28 and 28, below:
The system enforces Equations 28 and 29 with Equations 31, 32, and 33. Equations 31 and 32 state that γd1d2 cannot be set to one unless both d1,d2 are included in the cell gd*. Similarly, Equation 33 states that if both d1, d2 are included in gd*, then γd1d2 is set to one. It is noted that γ is entirely governed by x and it does not need to be explicitly required to be integer in order for the ILP solver to produce an integer solution.
The system 10 can generate many distinct hypotheses with negative reduced cost when solving Equation 27 as a consequence solving using different d*. Thus, the system 10 can add to the nascent set Ĝ each hypothesis with negative reduced cost generated by solving Equation 27. The system 10 can resolve the master problem after any negative reduced cost hypothesis is generated. If the anchor is included in a cell for it to be feasible, then xd* is required to be set to one in Equation 30.
The LP relaxation of MWSP can be tightened by the system 10 employing subset-row inequalities in such a way as to preserve the structure of the pricing problem. The system 10 can add them to the pricing problem, and parameterize them by two integers m1, m2 and a subset {circumflex over (D)}⊆|D| of cardinality m1m2−1. Subset-row inequalities are used to require that the number of hypotheses containing m1 or more members of |{circumflex over (D)}| must be no greater than m2−1. The most general form of subset-row inequalities is written in Equation 34, below:
Subset-row inequalities where m1=m2=2 will be referred to as triplets. However, all content in this section is fully applicable to the other subset-rows inequalities modeled in the present disclosure.
In step 132, the system 10 generates an MWSP formulation tightened using triplets. In step 134, the system 10 determines whether the subset-row inequalities destroys the structure of the pricing problem. When the subset-row inequalities do not destroy the structure of the pricing problem, the system 10 proceeds to step 136, where the system 10 solves the pricing problem while modifying the structure of the pricing problem. This allow the system to use subset-row inequalities to tighten the LP relaxation for multi-cell segmentation. When the subset-row inequalities destroys the structure of the pricing problem, the system 10 proceeds to step 138, where the system 10 solves the pricing problem without modifying the structure of the pricing problem. This permits the use of subset-row inequalities to tighten the LP relaxations for multi-person tracking and multi-person pose estimation. Each step will be discussed in further detail below.
In step 132, the system 10 tightens the LP relaxation of MWSP by enforcing that for any set of three unique observations, a number of selected hypotheses that include two or more members can be no larger than one. The system 10 describes the set of sets of three unique observations by C, and index it with c. The membership of c is described using [d∈c], where [d∈c]=1 if observation d is in c, and otherwise [d∈c]=0. The mapping of triplets to hypotheses is described using matrix C∈{0, 1}|C|×|G|, which is index by c, g. Here, Ccg=1 if at least two of the observations in c are present in g. The LP relaxation for MWSP tightened using triplets is expressed using Equation 35, below:
A dual form of Equation 35 is expressed in Equation 36 below, which uses dual variables ψ∈|C|, which is index by c, where ψc is the dual variables associated with the constraint in Equation 35 over c.
The system can solve Equation 35 using a generalization of column generation, called column/row generation (“CRG”). CRG exploits the fact that the dual LP relaxation has a finite number of variables, thus making it amenable to optimization via cutting plane method.
As in column generation, the system 10 uses CRG to construct a sufficient set G by adding negative reduced cost hypothesis (violated dual constraints), given fixed dual variables. CRG augments this procedure by identifying a sufficient set C by identifying violated constraints given a fixed primal solution. CRG begins with sets Ĝ, Ĉ equal to the empty set, then iterates between solving optimization in Equation 35 over set Ĝ, Ĉ, and adding elements to Ĝ, Ĉ. Each iteration produces primal/dual solutions, which facilitate the identification of violated primal/dual constraints. When, no violated primal/dual constraints exist, the system 10 terminates CRG. Identifying violated primal constraints is done by iterating over c∈C, to identify the c∈C, that maximizes ΣG∈GγgCcg, given fixed γ. While C is too large to include each element as a constraint in the LP relaxation, it is not too large to search over. This is because only triplets, where each detection is associated with a fractional valued hypothesis in γ, need be considered when iterating over c∈C. Finding the most violated dual constraint (which is called pricing) corresponds to the following optimization expressed in Equation 37, below:
Intelligent schedules can be employed over the operations (e.g., solve the restricted master problem, augment Ĝ and augment Ĉ. For example, multiple elements can be added to Ĝ, and or Ĉ after each time the restricted master problem is solved. Alternatively, the system can only augment Ĉ when no negative reduced cost elements exist to be added to Ĝ.
Returning to
It is noted that that zc is described entirely by x and is set to the smallest possible value at optimality since ψs is non-negative. Thus, the system 10 does not require zc to be integer since integrality of z is assured given that x is integral.
In step 138, the system solves the pricing problem without modifying the structure of the pricing problem. Specifically, the system 10 finds negative reduced cost primal variables, given the dual solution λ, ψ where ψ cannot be directly considered, when using a specialized solver for pricing. First, the system 10 denotes the reduced cost of a hypothesis g as V (Γ, λ, ψ, g). The reduced cost of the lowest reduced cost hypothesis is denoted as as V*(Γ, λ, ψ). V (Γ, λ, ψ, g), V*(Γ, λ, ψ) are expressed in Equation 40, below:
The system 10 applies a specialized solver and ignores the triplet term Σc∈ĈψcCcg, providing a lower bound. Specifically, the system 10 can use a branch and bound (“B&B”) approach. The set of branches in a B&B tree is denoted B. Each branch b∈B is defined by two sets Db+, and Db−. These correspond to observations that must be included in the hypothesis and those that must not be included in the hypothesis respectively. The set of all hypotheses that are consistent with both Db+ and Db− is expressed as Gb±. The bounding and branching operators will be discussed in further detail below. The initial branch b is defined by Db+=Db−={ }.
Regarding the bounding operator, pricing ignoring the 0 terms is referred to as the independent pricing problem. Term Vb(Γ, λ, ψ) denotes a value of the lowest reduced cost over columns in Gb±. The system 10 computes a lower-bound for this value, denoted Vblb by independently optimizing the independent pricing program and the triplet penalty, as expressed below in Equation 41:
The system 10 can compute ming∈Gb±Γg+Σd∈D λdGdg for applications in multi-object tracking and multi-person pose estimation. In multi-person tracking, when performing dynamic programming, the system 10 enforces that g∈gb± as follows: 1) Enforcing Db−: For each subtrack s that includes a d∈Db−, the system sets the corresponding θs value of ∞; and 2) Enforcing Db+: For each subtrack s that includes a detection co-occurring in time with any d∈Db− (other than d), the system sets the θs to ∞. Similarly, the system 10 does not consider starting a track after the occurrence of the first member of Db+ in time. After completing the dynamic program generating tracks, the system 10 sets the reduced cost to ∞ for any track terminating prior to the point in time of the last member of Db+. In multi-person pose estimation, the system 10 forces detections Db+, Db− to be active/inactive respectively when generating a person.
Branch operation will now be discussed. The system 10 expresses an upper bound on Vb(Γ, λ, ψ) as Vbub(Γ, λ, ψ). The system 10 constructs this by adding in the active ψ terms ignored when constructing Vblb (Γ, λ, ψ). Setting gb=arg ming∈Gb±Γg+Σd∈D λdGdg yields Equation 42, below:
The largest triplet term ψc that is included in Vbub(Γ, λ, ψ) but not Vblb(Γ, λ, ψ) is expressed in Equation 43, below:
The system 10 generates eight new branches for each of the eight different ways of splitting the observations in the triplet term corresponding to c* between the include (+) and exclude (−) sets.
It is noted that not all child nodes need be created as some are guaranteed to be infeasible if some observations in c* already belongs to Db− or Db+. For example, let us assume that c*=d1, d2, d3. If d1∈Db+, then the child nodes Db2, Db4, Db6 and Db8 will all be infeasible because d1 belongs to both + and − decisions. Furthermore, if d3∈Db−, then all nodes Db5, Db6, Db7 and Db8 are infeasible. Thus only the nodes Db1 and Db3 are feasible and gb remains an optimal solution for Db1. Note that the branch operator is not applied if ψc*=0.
The following section discussed upper bounds on the Lagrange multipliers λ, called dual optimal inequalities (“DOI”), which do not remove all dual optimal solutions. The system 10 using of DOI decreases the search space that column generation needs to explore, thus decreasing the number of iterations of pricing required. For various applications including cutting stock, and image segmentation, DOI are used to dramatically decrease optimization time without sacrificing optimality.
Regarding basic dual optimal inequalities, it is noted that at any given iteration of column generation, the optimal solution to the primal LP relaxation need not lie in the polyhedron of Ĝ. If limited to producing a primal solution over Ĝ, it is useful to allow Σg∈GGdgγg to exceed one for some d∈D.
The system 10 uses a slack term ξd≥0 that tracks the presence of any observations included more than once and prevents them from contributing to the objective when the corresponding contribution is negative. Specifically, the system 10 offsets the cost for “over-including” an observation with a cost that at least compensates and likely overcompensates. It is noted that removal of a detection d from a hypothesis increases the cost of a hypothesis by no more than Ξd for each d, where Ξd is expressed by Equation 44, and the expanded MWSP objective and its dual LP relaxation are expressed by Equation 45, both below:
It is noted that the dual relaxation bounds λ by Ξ from above. These bounds are called dual optimal inequalities DOIs. To ensure that the DOIs are not active at termination of column generation, the system 10 offsets Ξ with a tiny positive constant.
It should be understood that the use of the DOI does not cut off all dual optimal solutions when Ĝ=G. Specifically, the system 10 can map any solution γ, ξ, where ξ is optimal given γ to a feasible solution
G
dg−{circumflex over (d)}
=G
dg[{circumflex over (d)}≠d] Equation 46
The system 10 converts γ, ξ to
α←min(γg,ξd)
γg←γg−α
γg−d←γg−d+α
ξd←ξd−α Equation 47
In Equation 47, α is the magnitude of the update to the terms γg, γg−d, ξd. The change in the objective using Equation 47 is expressed in Equation 48, below:
α(−Ξd+Γg−d−Γg) Equation 48
Since Ξd≥Γg{circumflex over ( )}(−d)−Γg by definition, and a is positive, then the total change in Equation 48 is non-negative. Thus, there exists an optimal primal solution in which ξ is the zero vector. Therefore, the use of DOI does no remove all dual optimal solutions.
This section discusses dual optimal inequalities that are not looser that those discussed above. The system 10 uses Ĝ to denote the set of hypotheses which are subsets of hypotheses in Ĝ. Thus, at any given point in column generation, the system 10 binds λd as in Equation 44, above, except replacing optimization over G with Ĝ*, which is expressed in Equation 49, below:
It is noted that bounds in Equation 49 are not greater than Equation 44 and may increase when elements are added to Ĝ. The DOI in Equations 44 and 49 are referred to as invariant and varying DOI, respectively.
The following section discusses generating a valid DOI for multi-person pose estimation, multi-cell segmentation, and multi-person tracking. Regarding multi-person pose estimation and an invariant DOI, the removal of a detection d from a pose removes from the cost the associated and any active pairwise terms, θ2dd1, θ2d1d. Similarly, if d is the only detection in a pose, then the θ0 term is also removed. The system 10 upper bounds the sum of these three terms by considering only the positive valued terms and θ1d. If this sum is negative, the system 10 sets the upper bound d to zero, since λ is non-negative by definition. The system express Ξd using Equation 50, below:
Regarding multi-person pose estimation and an invariant DOI, the system 10 produces Ξd by using the same approach as in Equation 50, except that the system 10 only considers pairwise terms that could be removed when replacing members of Ĝ* with other members of Ĝ*, as expressed below in Equation 51:
The DOI for multi-cell segmentation are identical to the DOI for multi-person pose estimation Regarding multi-person tracking, the system 10 consider the production of Ξd for tracking. The system 10, rather than producing a single track when removing an element d, splits the track into two separate tracks, where d defines the boundary, and itself is removed. The removal of d causes the removal of the costs of all subtracks including d. This procedure will produce a track if d is a middle element in the track. Similarly, if d is in every subtrack, then this procedure removes a track.
For invariant DOI, the system denotes δs,d,k to be the lowest total cost sequence of subtracks each including d (e.g., δs,d,K=θs), where the last subtrack in the sequence is s and d is in position k, as expressed in Equation 52, and using δ to express Ξd is shown in Equation 53, both below:
In Equation 53, the system adds the absolute value of θ0 since the removal of all subtracks including d may create two tracks from one or remove a track without replacing it. Further, In Equation 53, all possible sequences of subtracks that contain d are considered. However, the system 10 in regards to the varying DOI need only consider the sequences of subtracks in tracks in Ĝ. As such, the system denotes δgs,d,k be the lowest total cost sequence of subtracks of g each including d, where the last subtrack in the sequence is s and d is in position k, as expressed in Equations 54-56, below:
The following section discusses the system 10 generating a lower bound on the LP relaxation at termination of column generation. Given any fixed set Ĝ, solving the restricted master problem (RMP) does not necessarily provide a lower bound on the ILP over G. The system 10 can generate anytime lower bounds by adding to the LP objective the lowest reduced costs of terms generated during pricing
As discussed above, each observation can be assigned to at most one hypothesis. The system generates a lower bound using Equation 57, below, given any non-negative λ provided by the RMP:
It is noted that minimization in Equation 57 is the pricing problem called at each iteration of column generation. The bound in Equation 57 can be tightened using an application specific analysis. For example, the corresponding lower bound for multi-person tracking, adds to the RMP objective the following: a sum of the negative valued, reduced costs for the lowest reduced cost track terminating at each detection, expressed below in Equation 58, where there are no triplets:
It is further noted that Equation 57 provides a lower bound on the optimal packing. Specifically, rewriting the optimization incorporating that the number of hypothesis selected by any packing is bounded by the number of observations, since every selected hypothesis must contain at least one observation, yields Equation 59, below:
Dualizing the packing constraint and the subset-row inequalities, but retaining in the minimization, the constraint that no more than D hypothesis are selected, as expressed below where Equation 59 is equal to Equation 60:
The system 10 then relaxes the constraint the λ, ψ is optimal, and reorders terms by γ, which yields Equation 60 being greater of equal to Equation 61, below:
It is noted that the inner minimization selects the lowest reduced cost solution |D| times if a negative reduced cost hypothesis exists and otherwise has zero value. Thus, Equation 61 is equal to Equation 62, below:
Testing and analysis of the above systems and methods will now be discussed in greater detail. Specifically, computational results will be discussed on the three applications discussed above, multi-person tracking, multi-person pose estimation, multi-cell segmentation. The system of the present disclosure used a part of MOT 2015 training dataset, to train and evaluate multi-person tracking in video. The system 10 further used a structured support vector machine (“SVM”) based learning approach as the mechanism to produce cost terms. To generate the set of detections D, the system 10 used the raw detector output provided by the MOT dataset. The system 10 trained models with varying subtrack length (K=2, 3, 4), and allowed for occlusion up to three frames.
In the problem instance for testing, there are 71 frames and 322 detections in the video. The numbers of subtracks present are 1,068, 3,633 and 13,090 for K=2, 3, 4 respectively. For K=2, 48.5% “Multiple Object Tracking Accuracy”, 11 identity switches, and 9 track fragments were observed, which can be expressed as (48.5,11,9). However, when setting K=3, 4 the performance is (49,10,7) and (49.9, 9, 7) respectively. Thus, increasing subtrack length provides noticeable improvements over all metrics.
Each time the present system solves the pricing problem, the present system adds to Ĝ, the lowest reduced cost track terminating at each detection, excluding those with non-negative reduced cost. As discussed above, the dynamic programming structure of the pricing problem facilitates this computation.
The testing and analysis in this section used the following enhancements to column generation: anytime lower bounds and subset-row inequalities. However dual optimal inequalities are not employed. The present system evaluated the above discussed methods on the MPII-multi-person validation set, which consists of 418 images. The present system used the cost terms θ1, θ2 with the following modifications. First, γd1d2=oc for each pair of unique neck detections d1, d2. This accelerates optimization since the present system need not explore an entire power set of neck detections during pricing. Second, the present system construct Dr as follows. The system provides a probability that each detection d is associated with each body part r denoted pdr. For each detection d, the system assigns it to the set Vr that maximizes this probability. This assignment corresponds to the following optimization arg maxr pdr for a given d∈V. Third, the system sets θ0 to a single value for the entire data set. Lastly, the system limits a size of Sr to 50,000 for each r∈R. The system constructs Sr as follows: the system iterates over integer k=[0, 1, 2, . . . |Vr|], then adds to Sr the group of configurations containing exactly k detections in Vr. If adding a group would have Sr exceed 50,000, then the system does not add the group and terminate construction of Sr.
The set packing relaxation, is tight in over 99% of problem instances, and in the remaining cases the gap between the lower and upper bounds is less than 1.5% of the LP objective. The present system produces an integral solution, when the set packing LP is loose by solving the set packing ILP over Ĝ.
The following section will discuss the performance improvements provided by the system of the present disclosure using DOI (dual optimal inequalities). To establish the value of DOI, the present system decouples the value provided by DOI from that provided by varying the solver. The solver is defined by an LP toolbox (e.g., linprog, CPLEX, Gurobi), options for the toolbox such as algorithm used (interior point, simplex, etc.) and the computer used. Decoupling the value added by DOI from that added by the solver is important since some solvers work dramatically better than others and that DOI provides different speedups depending on the solver.
This difference in performance is accounted for by the number of iterations of column generation. Different solvers provide different dual optimal solutions. In column generation, the space of dual optimal solutions rarely consists of a single point but a space of such points. Using dual optimal solutions that are well centered allows column generation achieve faster convergence. Well centered solutions are solutions that have low L2 norm, meaning that the mass of the dual variables is not concentrated in a small number of variables.
Regarding a well centered solution, considering a step of pricing, using a poorly centered dual solution, in which only a small number of observations D− have non-zero dual value. The hypotheses produced in pricing, will not include D−, but will be otherwise inclined to produce similar columns to those produced in the first iteration of column generation, where dual variables have value zero. Thus, the use of poorly centered solutions tends to lead to little progress in column generation.
In the present system, the time spent performing pricing vastly exceeds that for solving the RMP (restricted master problem). Thus, using a faster toolbox, such as CPLEX or Gurobi to solve the RMP adds little value if the resultant dual solution is not well centered. The solvers used for testing are as follows. Solver one: MAT-LAB 2016 linprog solver with default settings. Solver two: MATLAB 2017 with the interior point solver on a workstation.
Testing showed that the DOI, that vary with Ĝ outperform those that are invariant. The use of DOI provides a large speedup to solver two (nearly 20 times speedup) but limited speedup to solver one (only 1.4-1.6 times speedup). Further, solver one is an older computer running an older version of MATLAB than solver two but the timing results of solver one are better than those of solver two for each selection of DOI. This is a consequence of solver one producing well centered solutions, and solver two not. The use of DOI makes solver two perform almost as well as solver one demonstrating the value of DOI when the solver is poorly selected.
The following experiments use the column generation enhancement of anytime lower bounds but not subset-row inequalities. DOI are not used in the experiments. The present system applies column generation for multi-cell segmentation on three different data sets. The problem instances include challenging properties, such as densely packed and touching cells, out-of-focus artifacts, and variations in the shape/size of cells.
To generate cost terms, the present system uses an open source toolbox to train a random forest classifier to discriminate: (1) boundaries of in-focus cells; (2) in-focus cells; (3) out-of-focus cells; and (4) background. For training, the present system used <1% pixels per dataset with generic features e.g. gaussian, lapla-cian, and structured tensor. The output of this random forest classifier are also used to generate superpixels.
The performance of the system of the present disclosure was compared with prior art systems, in terms of detection (precision, recall and F-score), and segmentation (Dice coefficient and Jaccard index) which are common measures in bio-image analysis.
Next, performance of the present system with regard to a gap between the upper and lower bounds is considered. The gaps are normalized by dividing by an absolute value of the lower bound. For the three data sets, the proportion of problem instances that achieve normalized gaps under 0.1 are 99.28%, 80% and 100%, on datasets one, two, and three, respectively.
The functionality provided by the present disclosure could be provided by computer vision software code 206, which could be embodied as computer-readable program code stored on the storage device 204 and executed by the CPU 212 using any suitable, high or low level computing language, such as Python, Java, C, C++, C#, .NET, MATLAB, etc. The network interface 208 could include an Ethernet network interface device, a wireless network interface device, or any other suitable device which permits the server 202 to communicate via the network. The CPU 212 could include any suitable single-core or multiple-core microprocessor of any suitable architecture that is capable of implementing and running the computer vision software code 206 (e.g., Intel processor). The random access memory 214 could include any suitable, high-speed, random access memory typical of most modern computers, such as dynamic RAM (DRAM), etc.
Having thus described the system and method in detail, it is to be understood that the foregoing description is not intended to limit the spirit or scope thereof. It will be understood that the embodiments of the present disclosure described herein are merely exemplary and that a person skilled in the art can make any variations and modification without departing from the spirit and scope of the disclosure. All such variations and modifications, including those discussed above, are intended to be included within the scope of the disclosure. What is desired to be protected by letters patent is set forth in the following claims.
The present application claims the benefit of U.S. Provisional Patent Application Ser. No. 62/845,526 filed on May 9, 2019, the entire disclosure of which is expressly incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
62845526 | May 2019 | US |