The present invention pertains to tracking and data association particularly to tracking and associating targets including those that may be temporarily occluded, merged, stationary, and the like. More particularly, the invention pertains to implementing techniques in tracking and data association.
The invention is a tracking system which that incorporates several hypotheses.
a and 2b show additional graph information for merge and split hypotheses, respectively;
a, 3b and 3c are pictures showing tracking results for merge, slit and reacquisition operations, respectively;
a and 4b show a tracking of targets using ground plane information;
Multiple target tracking and association is a key component in visual surveillance. Tracking may provide a spatio-temporal description of detected moving regions in the scene. Such low-level information may be critical for recognition of human actions in video surveillance. In the present visual tracking approach, the observations are the detected moving blobs, or detected stationary or moving objects (e.g., faces, people, vehicles . . . ). These observations may be referred to herein as blobs. An issue related to visual tracking may come from incomplete observations, occlusions and noisy foreground segmentation. The assumption that one detected blob corresponds to one moving object is not always true. Several factors may be needed to be considered for a good tracking algorithm as follows. A single moving object (e.g., one person) may be detected as multiple moving blobs, and thus the tracking algorithm should “merge” the detected blobs. Similarly, one detected blob may be composed of multiple moving objects; in this case, the tracking algorithm should “split” and segment the detected blob. A detected blob could be a false alarm due to erroneous motion or object detection. Here, the tracking algorithm should filter these observations. In the presence of static or dynamic occlusions of the moving objects in the scene, one may often observe a partial occlusion where the appearance information gets affected, or have a total occlusion where no observation on the object is available. The lack of observation may also correspond to a stop-and-go motion since the observation may come from motion detection. Also, the number of objects in the scene may vary as new objects enter and leave the field-of-view of a camera, or detection mechanism or module.
A graph representing observations over time may be adopted. A multiple-target tracking approach may be formulated as finding the best multiple paths in the graph. To use the visual observations from an image sequence, both motion and appearance models may be introduced. Given these models, one may associate a weight to each edge defined between two nodes of the graph. Due to noisy foreground segmentation, one target may report foreground regions, and one foreground region may correspond to multiple targets. To deal with the various issues, several types of operations, including merge, split and reacquisition (by appearance), may be introduced. If a prediction of one track at time t+1 has enough spatial overlapping with more than one observation at time t+1, a merge operation may generate a new observation. When an observation at time t+1 is the best child of more than one track, this may incur a split operation, which splits a node into several new observations. A reacquisition operation may be used to handle misdetection. New hypotheses may carry a hypothesis proposed by merge, split or reacquisition operation. A final decision about tracking may be made by considering all of the observations in the graph.
The present multiple-target tracking algorithm may be widely used in a visual surveillance application. An input for the tracking algorithm may include the foreground regions and original image sequences, or detected objects in the image. A foreground region usually can be provided by a motion detection procedure. An observation graph may be constructed, which contains all of the observations within a time period. An edge between nodes may be weighted by a joint motion and appearance likelihood. The motion likelihood may be computed with a Kalman Filter. The appearance likelihood may be the KL distance between two non-parametric appearance models. Next, one may perform an optimal path selection in the graph to find the best temporal and spatial trajectories of the targets.
Multiple-target tracking may be considered as a maximum a posterior (MAP) problem. To make full use of the visual observations from the image sequence, both motion and appearance likelihood may be introduced. The graph representation of all observations over time may be adopted. A final decision of the trajectories of the targets may be delayed until enough observation is obtained.
The observations may be expanded with hypotheses added by merge, split and reacquisition operations, which are designed to deal with noisy foreground segmentation due to occlusion, foreground fragment and missing detection. These added hypotheses may be validated during a MAP estimate. A MAP formulation of multiple target tracking approach and the motion and appearance likelihoods may be noted.
In a multiple-target tracking approach, an objective is to track multiple target trajectories over time given noisy measurements provided by a motion detection algorithm. The targets' positions and velocities may be automatically initialized and should not require operator interaction, or could be provided by the operator. The detector may usually provide image blobs which contain the estimated location, size and the appearance information as well. Within any arbitrary time span [1, T], there may be K unknown number of targets in the monitored scene. yt={yti:i=1, . . . , nt} may denote the observations at time tt and Y=∪tε{1, . . . , T}yt may be the set of all the observations within the duration [1, T]. The multiple target tracking can be formulated as finding the set of K best paths {τ1, τ2 . . . , τK} in the temporal and spatial space, where K is unknown. Let τk denote a track by the set of its observations: τk={τk(1), τk(2), . . . , τk(T)} where τk(t)εyt represents the observation of track τk at time t.
A graph representation G=<V,E> of all measurements within time [1, T] may be utilized. The graph is a directed graph that consists of a set of nodes V={ytk:t=1, . . . T, k=1, . . . , K}. Considering missing detections, one special measurement of yt0 may represent the null measurement at time t. A directed edge, (yt
a shows a hypothesis added by a merge operation. Node 15 is prediction on the left at t+1. Node 16 is a new node added to the graph on the right at t+1.
The multiple target tracking may be formulated as a maximum a posterior (MAP) problem, given the observations over time, to find K best paths τ*1, . . . , K through the graph of measurements in
τ*1, . . . , K=arg max(P(τ*1, . . . , K|Y)). (1)
The posterior of the K best paths may be represented as the observation likelihood of the K paths and the prior of the K paths. A prior distribution model of P(τk:k=1, . . . , K) may be represented as
where Tmi is the number of measurements associated to the tracks and Fmi is the number of measurements not associated to the tracks. p(Fmi) may be a Poisson distribution of Fmi, and pd denotes the detection rate which may be estimated from the prior knowledge of the detection procedure. By introducing this prior information, the posterior of the unknown K paths may be represented as
P(τ1, . . . , K|Y)∝P(Y|τ1 . . . K)P(τ1, . . . , K). (2)
The K paths multiple target tracking may be extended to a MAP estimate as
τ*1, . . . K=arg max(P(Y|τ1 . . . , K)P(τ1, . . . , K)). (3)
Since the measurements are image blobs, besides position and dimension (width and height) information, an appearance model may be considered in the tracking approach. To make full use of the visual cues of the observations, both motion and appearance may be considered as likelihood measures. By assuming each target is moving independently, the joint likelihood of the K paths over time [1, T] may be represented as
A joint probability may be defined by a product of the appearance and motion probabilities. This probability maximization approach may be inferred by using a Viterbi™ algorithm (see Kang et al., “Continuous tracking within and across camera streams”, IEEE, Conference on CVPR 2003, Madison, Wis., which is hereby incorporated by reference). Other algorithms may be utilized.
A constant velocity motion model in a 2D image plane and 3D ground plane may be considered. xtk, may denote the state vector of the target k at time t to be [lx, ly, w, h, ix, iy, lgx, lgy] (position, width, height and velocity in 2D image, position on the ground plane). One may consider a linear kinematic model,
x
t+1
k
=A
k
x
t
k
+w
t
k, (5)
where xtk is the state vector for target k at time t. wtk may be assumed to have a normal probability distribution, wk˜N(0, Qk). Ak may be a transition matrix. A constant velocity motion model may be used. The observation ytk=[ux,uy,w,h,ugx,ugy] may contain a measurement of a target position and size in a 2D image plane and position on a 3D ground plane. Since observations often contain false alarms, the observation model may be represented as
where ytk represents the measurement which may arise either from a false alarm or from the target. δt may be the false alarm rate at time t. The Hk matrix may serve also to take into account the ground plane as one could use it to map 2D observations to 3D measurements. A measurement may be provided as a linear model of a current state if it is from a target otherwise is modeled as a false alarm δt, which is assumed to be a uniform distribution.
{circumflex over (τ)}k(ti) and {circumflex over (P)}t(τk) may denote a posterior state estimate and a posterior estimate of the error covariance matrix τk at time t. Along a track τk, the motion likelihood of one edge τk(t1), τk(t2))εE, t1<t2, may be represented as Pmotion(τk(t2)|{circumflex over (τ)}k(t1)). Given the transition and observation model in a Kalman filter, the motion likelihood may then be written as
where e=ytk−HA{circumflex over (τ)}k(t1) and {tilde over (P)}t
The tracking of each region may rely on the kinematic model, described herein, as well as on an appearance model. The appearance of each detected region may be modeled using a non-parametric histogram. All RGB bins may be concatenated to form a one dimension histogram. The appearance likelihood between two image blobs, τk(t1), τk(t2))εE, t1<t2, in track k, may be measured using a symmetric Kullback-Leibler (KL) divergence defined in the following.
other appearance models may be used by the present framework also.
Given the motion and appearance models, one may associate a weight to each edge defined between two nodes of the graph. This weight may combine the appearance and motion likelihood models presented herein.
In equations (7) and (9), one may assume the state of the target at time t as determined by the previous state at time t−1 and the observation at time t as a function of the state at time t alone, i.e., a Markov condition. Also one may assume the motion and appearance of different targets is independent. Thus, the joint likelihood of K paths in equation (5) may be factorized as in the following.
An augmented graph representation for a multiple hypothesis tracker may be provided. Many multiple target tracking algorithms assume that no two paths pass through the same observation. This assumption appears reasonable when considering punctual observations. However, this assumption may often be violated in the context of a visual tracking situation, where the targets are not regarded as points and the inputs to the tracking algorithm are usually image blobs. A framework may be presented to handle split and merge behavior in estimating the best paths.
Merge and split hypotheses may be considered. Merge and split behaviors may correspond to a recursive association of new observations, given estimated trajectories. At a given time instant t, one may obtain K best paths which are denoted as [τ1t, . . . , τKt]. Using the estimated tracks, one may evaluate how the mt+1 observations {yt+1i:i=1, . . . , mt+1} at time t+1 fit the estimated tracks which end at time t. The spatial overlap between an estimate state at instant time t and a new observation may be considered as a primary cue. Several cases may be noted. First, if a prediction of
Second, if the predicted positions and shapes of more than one track spatially overlap within one observation y*t+1 at time t+1, then the set of candidate tracks may be κ,|κ|>1. The “split” operation may proceed as in the following. For each track τkt in κ whose prediction has sufficient overlap with y*t+1, one may change the predicted size and location at time t+1 to find the best appearance score sk=Pcolor(
A reacquisition hypothesis may be considered. Noisy segmentation of foreground regions often provides incomplete observations not suitable for a good estimation of the position of the tracked objects. Indeed, moving objects are often fragmented, several objects may be merged into a single blob, and thus regions are not necessarily detected in a case of stop-and-go motion.
Additional information may be incorporated from the images for improving appearance-based tracking. Since the appearance histogram of each target has been maintained at each time t, the reacquisition operation may be introduced to keep track of the appearance distribution when the blob does not provide good enough input. The reacquisition approach may be regarded as a mode-seeking approach and be successfully applied to a tag-to-track situation. Often the central module of the tracker may be doing reacquisition iterations to find the most probable target position in the current frame according to the previous target appearance histogram. In the present multiple target tracking situation, if a reliable track is not associated with a good observation at time t, due to a fragmented detection, non-detection or a large mismatch in size, one may instantiate a reacquisition algorithm to propose the most probable target position given the appearance of the track. One may note that the histogram used by the reacquisition algorithm may be established using past observations along the path (within a sliding window), instead of using only the latest one. Using a predicted position from the reacquisition, a new observation may be added to the graph. The final decision may be made by considering all of the observations in the graph. To prevent reacquisition tracking from tracking a target after it leaves the field of view, the reacquisition hypothesis may be considered only for trajectories where the ratio of the real node to the total number of observations along the track is larger than a certain threshold.
In use of the present system, a sliding temporal window of 45 frames may be used to implement the present algorithm as an online algorithm. The graph may contain observations between time t and t+45. When new observations are added to the graph, the observations older than t may be removed from the graph.
The present tracking algorithm may be tested and used on both indoor and outdoor data sets. The data considered may be collected inside of a laboratory, and around parking lots and other facilities. In the considered data set, a large number of partial or complete occlusions between targets (pedestrians and vehicles) may be observed. In conducted tests, the input considered for the tracking algorithm may include the foreground regions and the original image sequence. One may test the accuracy of the present tracking algorithm and compare it to the classical approaches without the added merge, split and reacquisition hypotheses.
a-3c show data sets with tracking results overlaid and the foreground detected. Due to noisy foreground segmentation, the input foreground for one target could have multiple fragment regions, as shown in
The case where two or more moving objects are very close to each other, one may have a single moving blob for all of the moving objects, as shown in
In the case where the targets merge into the background is shown in
Given the homography between the ground plane and the image plane, the targets may be tracked on the 3D ground plane, as shown in
The present approach may be used for multiple targets tracking in video surveillance. If the application scenarios are partitioned into easy, medium and difficult cases, many tracking algorithms may handle the easy cases rather well. However, for the medium and difficult cases, multiple targets could be merged into one blob especially during the partial occlusion and one target could be split into several blobs due to noisy background subtraction. Also, missed detections may happen often in the presence of stop-and-go motion, or when one is unable to distinguish foreground from background regions without adjusting the detection parameters to each sequence considered.
The mechanism introduced here is based on multiple hypotheses which expand the solution space. The present formulation of multiple-target tracking as a maximum posterior (MAP) and the extended set of hypotheses by considering merge, split and reacquisition operations is very robust. It may deal with noisy foreground segmentation due to occlusion, foreground fragments and missing detections. It shows good performance on various data sets.
In the present specification, some of the matter may be of a hypothetical or prophetic nature although stated in another manner or tense.
Although the invention has been described with respect to at least one illustrative example, many variations and modifications will become apparent to those skilled in the art upon reading the present specification. It is therefore the intention that the appended claims be interpreted as broadly as possible in view of the prior art to include all such variations and modifications.
The present application claims the benefit of U.S. Provisional Application No. 60/804,761, filed Jun. 14, 2006. U.S. Provisional Application No. 60/804,761, filed Jun. 14, 2006, is hereby incorporated by reference. The present application is a continuation-in-part application of U.S. patent application Ser. No. 11/548,185, filed Oct. 10, 2006. U.S. patent application Ser. No. 11/548,185, filed Oct. 10, 2006, is hereby incorporated by reference. The present application is a continuation-in-part application of U.S. patent application Ser. No. 11/562,266, filed Nov. 21, 2006. U.S. patent application Ser. No. 11/562,266, filed Nov. 21, 2006, is hereby incorporated by reference.
Number | Date | Country | |
---|---|---|---|
60804761 | Jun 2006 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 11548185 | Oct 2006 | US |
Child | 11761171 | US | |
Parent | 11562266 | Nov 2006 | US |
Child | 11548185 | US |