This invention relates generally to the tracking of multiple items.
The tracking of multiple objects (such as, but not limited to, objects in a video sequence) is known in the art. Considerable interest exists in this regard as successful results find application in various use case settings, including but not limited to target identification, surveillance, video coding, and communications. The tracking of multiple objects becomes particularly challenging when objects that are similar in appearance draw close to one another or present partial or complete occlusions. In such cases, modeling the interaction amongst objects and solving the corresponding data association problem comprises a significant problem.
A widely adopted solution to address this need uses a centralized solution that introduces a joint state space representation that concatenates all of the object's states together to form a large resultant meta state. This approach provides for inferring the joint data association by characterization of all possible associations between objects and observations using any of a variety of known techniques. Though successful for many purposes, unfortunately such approaches are neither a comprehensive solution nor always a desirable approach in and of themselves.
As one example in this regard, these approaches tend to handle an error merge problem at tremendous computational cost due to the complexity inherent to the high dimensionality of the joint state representation. In general, this complexity tends to grow exponentially with respect to the number of objects being tracked. As a result, in many real world applications these approaches are simply impractical for real-time purposes.
The above needs are at least partially met through provision of the method and apparatus to facilitate disambiguating state information for multiple items described in the following detailed description, particularly when studied in conjunction with the drawings, wherein:
Skilled artisans will appreciate that elements in the figures are illustrated for simplicity and clarity and have not necessarily been drawn to scale. For example, the dimensions and/or relative positioning of some of the elements in the figures may be exaggerated relative to other elements to help to improve understanding of various embodiments of the present invention. Also, common but well-understood elements that are useful or necessary in a commercially feasible embodiment are often not depicted in order to facilitate a less obstructed view of these various embodiments of the present invention. It will further be appreciated that certain actions and/or steps may be described or depicted in a particular order of occurrence while those skilled in the art will understand that such specificity with respect to sequence is not actually required. It will also be understood that the terms and expressions used herein have the ordinary meaning as is accorded to such terms and expressions with respect to their corresponding respective areas of inquiry and study except where specific meanings have otherwise been set forth herein.
Generally speaking, pursuant to these various embodiments, automatic use of a disjoint probabilistic analysis of captured temporally parsed data regarding at least a first and a second item serves to facilitate disambiguating state information as pertains to the first item from information as pertains to the second item. This can also comprise, for example, using a joint probability as pertains to the temporally parsed data for the first item and the temporally parsed data for the second item, by using, for example, a Bayesian-based probabilistic analysis of the temporally parsed data.
The latter can comprise using, if desired, a transitional probability as pertains to temporally parsed data for the first item as was captured at a first time and temporally parsed data for the first item as was captured at a second time that is different than the first time (by using, for example, a transitional probability as pertains to first state information for the first item as pertains to the first time and second state information for the first item as pertains to the second time) as well as using a transitional probability as pertains to temporally parsed data for the second item as was captured at the first time and temporally parsed data for the second item as was captured at the second time (by using, for example, a transitional probability as pertains to first state information for the second item as pertains to the first time and second state information for the second item as pertains to the second time).
This approach can further comprise, if desired, using a conditional probability as pertains to temporally parsed data for the first item and state information for the first item as well as a conditional probability as pertains to temporally parsed data for the second item and state information for the second item.
In effect, these teachings related to providing multiple interactive trackers in a manner that extends beyond a traditional use of Bayesian tracking in a tracking structure. In particular, this approach avoids using a joint state representation that introduces high complexity and that requires corresponding high computational costs. By these teachings, as objects exhibit interaction, such interaction can be modeled in terms of potential functions. By one approach, this can comprise modeling the interactive likelihood densities by a so-called gravitation attraction versus a so-called magnetic repulsion scheme. In addition, if desired, one can approximate 2nd order state transition density by an ad hoc 1st order inertia Markov chain in a unified particle filtering implementation. The proposed models represent the cumulative effect of virtual physical forces that objects undergo while interacting with one another. Those skilled in the art will recognize and appreciate that these approaches implicitly handle the error merge problems of the prior art and further serve to minimize corresponding object labeling problems.
These and other benefits may become clearer upon making a thorough review and study of the following detailed description. Referring now to the drawings, and in particular to
This step of capturing temporally parsed data can therefore comprise, for example, providing a video stream as provided by a single data capture device of a particular scene (such as a scene of a sidewalk, an airport security line, and so forth) where various of the frames contain data (that is, images of objects) that represent samples captured at different times. Although, as noted, such data can comprise a wide variety of different kinds of objects, for the sake of simplicity and clarity the remainder of this description shall presume that the objects are images of physical objects unless stated otherwise. Those skilled in the art will recognize and understand that this convention is undertaken for the sake of illustration and is not intended as any suggestion of limitation with respect to the scope of these teachings.
This process 100 then provides for automatically using 102, at least in part, disjoint probabilistic analysis of the temporally parsed data to disambiguate state information as pertains to a first such item from information (such as, but not limited to, state information) as pertains to a second such item. Those skilled in the art will understand that this process 100 does not require use of a disjoint probabilistic analysis in this regard under all operating circumstances; in many cases such an approach will only be automatically occasioned when such items approach near (and/or impinge upon) one another. In cases where such items are further apart from one another, if desired, alternative approaches can be employed.
Generally speaking, by one approach, this probabilistic analysis can comprise using, at least in part, a Bayesian-based probabilistic analysis of the temporally parsed data. This can comprise, at least in part, using a joint probability as pertains to the temporally parsed data for the first item and the temporally parsed data for the second item. More detailed examples will be provided below in this regard.
This step can further comprise, if desired, using transitional probabilities as pertain to these items. For example, this step will accommodate using a first transitional probability as pertains to temporally parsed data (such as, but not limited to, first state information) for the first item as was captured at a first time and temporally parsed data (such as, but not limited to, second state information) for this same first item as was captured at a second time that is different than the first time. In a similar fashion, this step will accommodate using another transitional probability as pertains to temporally parsed data (such as, but not limited to, first state information) for the second item as was captured at the first time and temporally parsed data (such as, but not limited to, second state information) for this same second item as was captured at that second time.
This step will also further accommodate, if desired, effecting the aforementioned Bayesian-based probabilistic analysis of the temporally parsed data by using conditional probabilities. In particular, for example, this can comprise using a first conditional probability as pertains to temporally parsed data and state information for the first item and a second conditional probability as pertains to temporally parsed data and state information for the second item. Again, more details regarding such approaches are provided below.
Those skilled in the art will appreciate that the above-described processes are readily enabled using any of a wide variety of available and/or readily configured platforms, including partially or wholly programmable platforms as are known in the art or dedicated purpose platforms as may be desired for some applications. Referring now to
In this illustrative example, a processor 201 operably couples to a memory 202. The memory 202 serves to store the aforementioned captured temporally parsed data regarding at least a first and a second item. By one approach, this memory 202 can be operably coupled to a single image capture device 203 such as, but not limited to, a video camera that provides sequential frames of captured video content of a particular field of view.
The processor 201 is configured and arranged to effect the above-described automatic usage of a disjoint probabilistic analysis of the temporally parsed data to facilitate disambiguation of state information as pertains to the first item from information (such as, but not limited to, state information) as pertains to the second item. This can comprise some or all of the above-mentioned approaches in this regard as well as the more particular examples provided below. By one approach, this processor 201 can comprise a partially or wholly programmable platform as are known in the art. Accordingly, such a configuration can be readily achieved via programming of the processor 201 as will be well understood by those skilled in the art.
Those skilled in the art will recognize and understand that such an apparatus 200 may be comprised of a plurality of physically distinct elements as is suggested by the illustration shown in
A more detailed presentation of a particular approach to effecting such distributed multi-object tracking by use of multiple interactive trackers will now be provided. Again, those skilled in the art will understand and appreciate that this more-detailed description is provided for the purpose of illustration and not by way of limitation with respect to the scope or reach of these teachings.
The described process uses a four dimension parametric ellipse to model visual object's boundaries. The state of an individual object is denoted here by xti=(cxti, cyti, ati, pti) where I=1, . . . , M is the index of objects, t is the time index, (cx cy) is the center of the ellipse, a is the major axis, and p is the orientation in radians. Those skilled in the art will recognize that a wide variety of modifications, alterations, and combinations can be made with respect to the above described embodiments without departing from the spirit and scope of the invention, and that such modifications, alterations, and combinations are to be viewed as being within the ambit of the inventive concept. The ratio of the major and minor axis of the ellipse is kept constantly equal to its value as computed during initialization in this example. This approach also denotes the image observation of xti by zti, the set of all states up to time t by x0:ti where x0i is a prior initialization, and the set of all observations up to time t by z1:ti. This approach also denotes the interactive observations of zti at time t by ztJ
Since the interactive relationship among observations is likely changing, J may also differ over time. For example, in the graphical model 300 shown in
When multiple visual objects move close to one another or other present partial or complete occlusions, it can be generally difficult for the trackers to segment and distinguish these spatially adjacent objects from image observations as the interactive observations are not independent (note that p(zt1, . . . , ztM)≠Πi=1Mp(zti)). As a result, one cannot reliably simply factorize the posteriors of different objects. This conditional dependency of objects comprises, in the view of the inventors, a significant reason why multiple independent trackers have difficulty coping with the aforementioned error merge problem as well as the object labeling problem.
By one approach, the present teachings espouse using a separate tracker for each object. In such a case, an error merge problem can occur in at least two cases. First, when two visual objects move closer or begin to present occlusion, the object with the strong observation (in the sense of a large visual image) effectively pulls the tracker of the object with the weaker observation. Second, after occlusion, when two objects move apart, their associated optical trackers often cannot detach and remain bonded while simultaneously tracking the object with the stronger observation.
In these scenarios, it may be helpful to image the influence of an invisible force among the interactive trackers that attracts them to merge together when objects move closer and that prevents them from disjointing when these objects move apart. With this in mind, by analogy, one may then imagine these effects to be associated with an associated tracker's “mass.” When objects are far apart, the corresponding gravitational force between their trackers is relatively weak and can be effectively ignored. Similarly, when such objects are adjacent or occluded, this attractive force becomes relatively strong. This imaginary construct permits an interesting application of Newton's Laws.
By Newton's Third Law, the relative forces between two such trackers will remain equal. At the same time, however, Newton's Second Law would hold that trackers corresponding to different masses will have corresponding different accelerations. As a result, after several frames of captured data, the tracker having a smaller mass (which will correlate to a larger acceleration) will be attracted to merge with the object having the larger mass (i.e., the larger observation which correlates to a small acceleration) and thus error merge will likely occur. To resist the excessive attraction that is viewed as occurring, in this analogical example, a repulsive force can be introduced between these interacting trackers.
In particular, when objects move closer, a repulsive force can be introduced and used to prevent the trackers from falsely merging. As the objects move away, this repulsive force can also help the trackers to detach from one another. As will be demonstrated below, another analogy can be introduced to facilitate the introduction of such a repulsive force; magnetic field theory.
Referring again to
The directed link from object xi to its observation zi represents a generative relationship and can be characterized by the local observation likelihood p(zi|xi). The undirected link between observation nodes represents the interaction itself. The structure of the observation layer at each time depends on the spatial relationships among observations for the objects. That is, when observations for two or more visual objects are sufficiently close or leading to occlusion, an undirected link between them is constructed to represent that dependency event.
Those skilled in the art will note that the graphical model 300 illustrated in
By one approach these decomposed graphs all comprise directed acyclic independence graphs as are known in the art. By then applying the separation theorem to the associated moral graphs (where again both such notions are well known in the art) one then obtains the corresponding Markov properties (namely, the conditional independence of the decomposed graphs.
To model the density propagation for each object, one may then estimate the posterior based on all of the involved observations p(x0:ti|z1:ti, z1:tJ
The density propagation for each interactive tracker can be formulated as:
Equation 1 uses the conditional independence property p(zti|x0:ti, z1:t-1i, z1:tJ
The interactive likelihood can be expressed as shown in equation 2:
The local likelihood p(zti|zti) characterizes the so-called gravitational force between interactive observations.
The interactive prior density of x0:ti can be expressed as shown below in equations 3 and 4:
In equation 3 the conditional independence property p(xti, ztJ
By substituting equations 2 and 4 back into equation 1 and then rearranging the order, one obtains:
The densities in the denominator of equation 5 are unrelated with xi and thus the fraction in the second line of equation 5 becomes a normalization constant kt. In equation 6, p(zti|xti) is the local likelihood, and p(xti|x0:t-1i) is the state transition density. By the present teachings one introduces a new density p(ztJ|xti, zti) referred to here as an interactive function to characterize the interaction among object's observations. When not activating the interaction among object's observations, this formulation will degrade to multiple independent particle filters. This can easily be achieved by switching p(ztJ|xti, zti) to a uniform distribution.
To estimate the posterior derived in the preceding, different density estimation methods (such as the Gaussian Mixture model, Kernel density estimation, and so forth) can be applied to the described. By one approach a sequential importance sampling method as is known in the art can provide a useful paradigm. {x0:ti,n, wti,n}n=1N
where δ (.) is the Dirac delta function.
This results in a discrete weighted approximation to the true posterior density p(x0:ti|z1:ti, z1:tJ
In the sequential case, one could have particles constituting an approximation to p(x0:t-1i,n|z1:t-1i, z1:t-1J
One can then obtain particles x0:ti,n˜q(x0:ti,n|z1:ti, z1:tJ
For most application purposes, only xtn, xt-1n, and xt-2n need to be stored and one can effectively disregard the path x0:t-3n and the history of observations z1:t-1. By this approach the modified weight becomes as shown in equation 11:
As mentioned above, it becomes useful to introduce a so-called repulsion force to resist excessive attraction among the interactive observations and magnetic field theory provides an analogy to facilitate the description of this force. Consider, for the purposes of example and explanation, a simple case where ztJ
In this analogy the local likelihood p(zti|xti) only characterizes the intensity of the corresponding local magnetic field while the interactive function p(ztJ
where α1 is a normalization constant, σ1 is a prior constant that characterizes the allowable maximal interaction distance, di,n,t is the distance between the current particle's observation and the interactive observation ztj, for example, can be the Euclidean distance di,n,t=∥ztj−zti|xti,n∥. For some practical purposes it can be acceptable to use the reciprocal of the area of an object overlapping region to represent this distance for simplicity and also to set α1=1 and σ1=10/Ao˜50/Ao where Ao is the average area of objects (ellipses) in the initial frame. In such a case the interactive function can be approximately estimated as shown in equation 13:
By one approach it can be useful to recursively locate the interactive observations and iterate the repulsion process to reach a relatively stable state.
When zti has two interactive observations ztJ
where α11 and α12 are normalization constants, σ11 and σ12 are again prior constants, di,j1,n,t and di,j2,n,t are the distances between the current particle's observation zti|xti,n and other interactive observations zt,kj1 and zt,kj2, respectively. For some application purposes it can be acceptable to set α11=α12=1 and choose σ11 and σ12 =10/Ao˜50/Ao where Ao is the average area of objects (ellipses) in the initial frame.
By leveraging this magnetic potential model, the interactive function p(ztJ
By one approach, an ad hoc 1st order inertia Markov chain can serve to estimate the 2nd order state transition density p(xti|xt-1i, xt-2i) and solve the aforementioned object labeling problem with considerably reduced computational cost. This approach is exemplified in equation 15 as follows:
where the state transition density p(xti|xt-1i) can be modeled by a 1st order Markov chain as usual in a typical Bayesian tracking method. This can be estimated by either a constant acceleration model or by a Gaussian random walk model. φti (.) comprises an inertia function and relates with two posteriors.
The inertia weights are defined as shown below in equation 16
where α2 is a normalization term and σ21 and σ22 are prior constants that characterize the allowable variances of a motion vector's direction and speed respectively. In equation 16,
is the angle between
are the Euclidean metrics. Accordingly, the inertia function can be approximated as shown in equation 17 below:
The prior art has leveraged other image cues such as gradient, color, and motion in order to estimate a local observation likelihood. Here, if desired, one can combine existing color histogram models and a principle component analysis (PCA)-based model to efficiently estimate the local likelihood exemplified by equation 18:
p(zti|xti)=pc·pp. (18)
where pc and pp are the likelihood densities estimated by the color histogram and PCA models respectively.
For a color cue, one can use a Bhattacharyya distance to measure the similarity between a reference histogram hoi that is obtained prior to tracking and the histogram hti,n that is determined by particle xti,n for object i. Equation 19 exemplifies such an approach:
where b is the index of bins. The color factor can then be specified by a Gaussian distribution with variance σc as illustrated in equation 20:
In this example, the color space employed is simply the normalized YCbCr space with 8 bins for CbCr and only 4 bins coarsely provided for luminance.
To apply principle component analysis here, one may first collect a set of training examples of tracking objects. One may then use singular value decomposition to obtain the Karhune-Loeve basis vectors. To measure a likelihood of an image region determined by xti,n, one can calculate the Mahalanobis distance dp between the image region and the mean of the training examples. The PCA factor can be defined as a Gaussian distribution with variance σp as illustrated in equation 21:
So configured, those skilled in the art will recognize and understand that these teachings comprise a distributed multiple objects tracking architecture that uses multiple interactive trackers and that extends traditional Bayesian tracking structures in a unique way. In particular, this approach eschews the joint state representation approach that tends, in turn, to require high complexity and considerable computational capabilities. Instead, a conditional density propagation mathematical structure is derived for each tracked object by modeling the interaction among the object's observations in a distributed scheme. By estimating the interactive function and the state transition density using a magnetic-inertia potential model in the particle filtering implementation, these teachings implicitly handle the error merge problems and further lead to resolution of object labeling problems as well. These teachings are sufficiently respectful of computational requirements to readily permit use in a real-time application setting.
Those skilled in the art will recognize that a wide variety of modifications, alterations, and combinations can be made with respect to the above described embodiments without departing from the spirit and scope of the invention, and that such modifications, alterations, and combinations are to be viewed as being within the ambit of the inventive concept.