The above needs are at least partially met through provision of the method and apparatus to facilitate use of conditional probabilistic analysis of multi-point-of-reference samples of an item to disambiguate state information as pertains to the item described in the following detailed description, particularly when studied in conjunction with the drawings, wherein:
Skilled artisans will appreciate that elements in the figures are illustrated for simplicity and clarity and have not necessarily been drawn to scale. For example, the dimensions and/or relative positioning of some of the elements in the figures may be exaggerated relative to other elements to help to improve understanding of various embodiments of the present invention. Also, common but well-understood elements that are useful or necessary in a commercially feasible embodiment are often not depicted in order to facilitate a less obstructed view of these various embodiments of the present invention. It will further be appreciated that certain actions and/or steps may be described or depicted in a particular order of occurrence while those skilled in the art will understand that such specificity with respect to sequence is not actually required. It will also be understood that the terms and expressions used herein have the ordinary meaning as is accorded to such terms and expressions with respect to their corresponding respective areas of inquiry and study except where specific meanings have otherwise been set forth herein.
Generally speaking, pursuant to these various embodiments, temporally parsed data regarding at least a first item is captured. This temporally parsed data comprises data that corresponds to substantially simultaneous samples of the first item with respect to at least a first and a second different points of view. Conditional probabilistic analysis of at least some of this temporally parsed data is then automatically used to disambiguate state information as pertains to this first item. This conditional probabilistic analysis comprises analysis of at least some of the temporally parsed data as corresponds in a given sample to both the first point of reference and the second point of reference.
In cases where there is more than one such item, if desired, these teachings will further accommodate automatically using, at least in part, disjoint probabilistic analysis of the temporally parsed data as pertains to multiple such items to disambiguate state information as pertains to a given one of the points of reference for the first item from information as pertains to the given one of the points of reference for a second such item.
So configured, these teachings facilitate the use of multiple data capture points of view when disambiguating state information for a given item. These teachings achieve such disambiguation in a manner that requires considerably less computational capacity and capability than might otherwise be expected. In particular, these teachings are suitable for use in substantially real-time monitoring settings where a relatively high number of items, such as pedestrians or the like, are likely at any given time to be visually interacting with one another in ways that would otherwise tend to lead to confused or ambiguous monitoring results when relying only upon relatively modest computational capabilities.
Furthermore, and as will be evident below, these teachings provide a superior solution to multi-target occlusion problems by leveraging the availability of multiocular videos. These teachings permit avoidance of the computational complexity that is generally inherent in centralized methods that rely on joint-state representation and joint data association.
These and other benefits may become clearer upon making a thorough review and study of the following detailed description. Referring now to the drawings, and in particular to
This activity of capturing temporally parsed data can therefore comprise, for example, providing a video stream as provided by data capture devices of a particular scene (such as a scene of a sidewalk, an airport security line, and so forth) where various of the frames contain data (that is, images of objects) that represent samples captured at different times. Although, as noted, such data can comprise a wide variety of different kinds of objects, for the sake of simplicity and clarity the remainder of this description shall presume that the objects are images of physical objects unless stated otherwise. Those skilled in the art will recognize and understand that this convention is undertaken for the sake of illustration and is not intended as any suggestion of limitation with respect to the scope of these teachings.
Pursuant to these teachings, this activity of capturing temporally parsed data can comprises capturing temporally parsed data regarding at least a first item, wherein the temporally parsed data comprises data corresponding to substantially simultaneous samples of the at least first item with respect to at least first and second different points of reference. This can comprise, for example, providing data that has been captured using at least two cameras that are positioned to have differing view of the first item.
It will be understood and recognized by those skilled in the art that such cameras can comprise any combination of similar or dissimilar cameras: true color cameras, enhanced color cameras, monochrome cameras, still image cameras, video capture cameras, and so forth. It would also be possible to employ cameras that react to illumination sources other than visible light, such as infrared cameras or the like.
This process 100 then provides for automatically using 102, at least in part, conditional probabilistic analysis of at least some of the temporally parsed data as corresponds in a given sample to the first point of reference and the second point of reference to disambiguate state information as pertains to the first item. By one approach, for example, this can comprise using conditional probabilistic analysis with respect to state information as corresponds to the first item. This can also comprise, if desired, determining whether to use a joint conditional probabilistic analysis or a non-joint conditional probabilistic analysis as will be illustrated in more detail below. And, if desired, this can also comprise determining whether to use such conditional probabilistic analysis for only some of the temporally parsed data or for substantially all (or all) of the temporally parsed data as corresponds to the given sample.
As noted above, this process 100 will accommodate the use of data as corresponds to more than one item. When temporally parsed data comprises data corresponding to substantially simultaneous samples regarding at least a first item and a second item with respect to at least a first and a second different points of reference, the aforementioned step regarding disambiguation can further comprise automatically using conditional probabilistic analysis of at least some of the temporally parsed data to also disambiguate state information as pertains to the first item from information as pertains to the second item.
When multiple items are present, these teachings will also accommodate, if desired, optionally automatically using 103, at least in part, disjoint probabilistic analysis of the temporally parsed data to disambiguate state information as pertains to a given one of the points of reference for the first item from information as pertains to the given one of the points of reference for the second item. (A complete description of such analysis can be found in a patent application entitled METHOD AND APPARATUS TO DISAMBIGUATE STATE INFORMATION FOR MULTIPLE ITEMS TRACKING as was filed on Oct. 13, 2006 and which was assigned application Ser. No. 11/549,542, the contents of which are fully incorporated herein by this reference.) This, in turn, can optionally comprise using epipolar geometry within a sequential Monte Carlo implementation to substantially avoid attempting to match first item features with second item features. Generally speaking, by one approach, these teachings will accommodate using a distributed Bayesian framework to facilitate multiple-target tracking using multiple collaborative cameras. Viewed generally, these teachings facilitate provision and use of a multiple-camera collaboration model using epipolar geometry to estimate the camera collaboration function efficiently without requiring recovery of the targets' three dimensional coordinates.
A more detailed presentation of a particular approach to effecting such an approach by use of multiple collaborative cameras will now be provided. Again, those skilled in the art will understand and appreciate that this more-detailed description is provided for the purpose of illustration and not by way of limitation with respect to the scope or reach of these teachings.
This example presumes the use of multiple trackers; in particular, one tracker per target in each camera view for multiple-target tracking in multiocular videos. Although this specific example refers to only two cameras for the sake of simplicity and clarity, these teachings can be easily generalized to cases using more cameras.
For the purposes of this explanation, the state of a target in a first camera (referred to hereafter as camera A) is denoted by XtA,i, where i=1, . . . , M is the index of targets, and t is the time index. The image observations of XtA,i by ZtA,i are denoted by the set of all states up to time t by X0:tA,i, where X0A,i is the initialization prior, and the set of all observations up to, time t by Z1:tA,i. One can similarly denote the corresponding notions for targets in a second camera (denoted hereafter as camera B). For instance, the “counterpart” of XtA,i is XtB,i. This explanation further uses ZtA,J
The elements jl1, jl2, . . . ε{1, . . . , M}, jl1, jl2, . . . ≠i, are indexes of targets whose observations interact with ZtA,i. When there is no interaction of ZtA,i with other observations at time t, Jt=Ø. Since the interaction structure among observations is changing, J may vary in time. In addition, Z1:tA,J
Graphical models comprise an intuitive and convenient tool to model and analyze complex dynamic systems.
Pursuant to these teachings one activates the interaction only when the targets' observations are in close proximity or occlusion. This can be approximately determined by the spatial relation between the targets' trackers since the exact locations of observations are typically unknown.
The directed curve link between the counterpart states of the same target in two cameras represents the “camera collaboration.” This collaboration is activated between any possible collection of cameras only for targets which need help to improve their tracking robustness. For instance, such help may be needed when the targets are close to occlusion or are possibly completely occluded by other targets in a camera view. The direction of the link shows which target resorts to which other targets for help. This need driven-based scheme avoids performing camera collaboration at all times and for all targets; thus, a tremendous amount of computation is saved.
As one illustrative example in this regard, and with continued reference to
A graphical model as shown in
According to graphical model theory, one can analyze the Markov properties (that is, the conditional independence properties) for every decomposed graph on its corresponding moral graphs 601 as illustrated in
p<x
t
A,i
, Z
t
A,J,
Z
t
B,i
|X
0:t
A,i
, Z
1:t−1
A,l
, Z
1:t−1
A,J
, Z
1:t−1
B,i
>=p<X
t
A,i
, Z
t
A,J
Z
t
B,i
|X
0:t−1
A,i>, (i)
p<Z
t
A,J
, Z
t
B,i
|X
t
A,i
, X
0:t 1
A,i
>=p<Z
t
A,J
, Z
t
B,i|X
t
A,i>, (ii)
p<Z
t
A,i
|X
0:l
A,i
, Z
l:l−1
A,i
, Z
l:t
A,J
, Z
l:t
B,i
>=p<Z
l
A,i
|X
l
A,l
, Z
1:t−1
A,i
, Z
t
A,J
, Z
1:t
B,i>, (iii)
p<Z
t
B,i
|X
t
A,i
, Z
t
A,i
>=p<Z
t
B,i
|X
t
A,i> (iv)
p<Z
t
A,J
, Z
t
B,i
|X
t
A,i
, Z
t
A,i
>=p<Z
t
A,j
|X
t
A,i
, Z
t
A,i
>p<Z
t
B,i
|X
t
A,i
, X
t
A,i> (v)
One may now consider a Bayesian conditional density propagation structure for each decomposed graphical model as illustrated in
By applying Bayes's rule and the Markov properties derived in the previous section, a recursive conditional density updating rule can be obtained by:
Those skilled in the art will note that the normalization constant kt does not depend on the states X0:tA,j. In (1), p<ZtA,i|XtA,i> is the local observation likelihood for target i in analyzed camera view A, and p<XtA,i|X0:t 1A,i> represents the state dynamics, which are similar to traditional Bayesian tracking methods. And, p<ZtA,J
When not activating the camera collaboration for a target and regarding its projections in different views as independent, the proposed Bayesian multiple-camera tracking framework can be identical to the Interactively Distributed Multi-Object Tracking (IDMOT) approach which is known in the art, where p<ZtB,i|XtA,i> is uniformly distributed. When deactivating the interaction among the targets' observations, such a formulation can further reduce to traditional Bayesian tracking, where p<ZtA,J
Since the posterior of each target is generally non-Gaussian, one can posit a nonparametric implementation of the derived Bayesian formulation using the sequential Monte Carlo algorithm, in which a particle set is employed to represent the posterior
p(x0:tA,i|zl:tA,i, zl:tA,j
where {X0:tA,i,n, n=1, Np} are the samples, {WtA,i,n, n=1 Np} are associated weights, and Np is the number of samples.
Considering the derived sequential iteration in (1), if the particles X0:tA,i,n are sampled from the importance density function q<XtA,i|X0:t−1A,i,n, Z1:tA,J
W
t
i,n
∝W
t−1
i,n
p<Z
t
A,i
|X
t
A,i,n
>p<Z
t
A,J
|X
t
A,i,n
Z
t
A,i
>p<Z
t
B,i
|X
t
A,i,n> (4)
It has been widely accepted that better importance density functions can make particles more efficient. Accordingly one can choose a relatively simple function p<XlA,i|Xt−1A,u> to highlight the efficiency of using camera collaboration. Other importance densities as are known in the art can also be used to provide better performance as desired.
Modeling the densities in (4) is not necessarily trivial and can have great influence on the performance of practical implementations. A proper model can play a significant role in estimating the densities. Different target models, such as a 2D ellipse model, a 3D object model, a snake or dynamic contour model, and so forth, are known in the art. One may also employ a five-dimensional parametric ellipse model that is quite common in the prior art, saves a lot of computational costs, and is sufficient to represent the optical tracking results for these purposes. For example, the state XtA,i is given by (cxtA,i,cytA,i,atA,i,btA,i,ptA,i), where i=1, . . . , M is the index of targets, t is the time index, (cx, cy) is the center of the ellipse, a is the major axis, b is the minor axis, and p is the orientation in radians.
Those skilled in the art will recognize that the proposed Bayesian conditional density propagation framework has no specific requirements of the cameras (e.g., fixed or moving, calibrated or not, and so forth) and the collaboration model (e.g., 3D/2D) as long as the model can provide a good estimation of the density p<ZtB,i|XtA,i>. Epipolar geometry has been used to model the relation across multiple camera views in different ways. Somewhat contrary to prior uses of epipolar geometry, however, the present teachings will accommodate presenting a paradigm of camera collaboration likelihood modeling that uses sequential Monte Carlo implementation that does not require feature matching and recovery of the target's 3D coordinates, but only assumes that the cameras' epipolar geometry is known.
These teachings then contemplate mapping the observation ZtB,i to camera view 701 and calculating the density there. The observations ZtB,i and ZtB,j are initially found by tracking in view 702. Then, they are mapped to view 701, producing h(ZtB,i) and h(ZtB,j), where h(·) is a function of ZtB,i or ZtB,j characterizing the epipolar geometry transformation. After that, the collaboration likelihood can be calculated based on h(ZtB,i) and h(ZtB,j). Sometimes, a more complicated case occurs, for example, target i is occluding with others in both cameras. In this situation, the above scheme is initialized by randomly selecting one view, say, view 702, and using IDMOT to find the observations. These initial estimates may not be very accurate; therefore, in this case, one can iterate several times (usually twice is enough) between different views to get more stable estimates.
is the variance that can be chosen as the bandwidth. In
A so-called “magnetic repulsion model” can be employed thusly:
where φlA,i,n is the interaction weight of particle XtA,i,n. It can be iteratively calculated by
where α and Σφ are constants and llA,i,n is the distance between the current particle's observation and the neighboring observation.
Different cues have been proposed to estimate the local observation likelihood. For present purposes one can fuse the target's color histogram with a PCA-based model, namely, p<ZtA,i|XtA,i>=pc×pp, where pc and pp are the likelihood estimates obtained from the color histogram and PCA models, respectively.
For simplicity, one can manually initialize all the targets for experimental or calibration purposes. Many automatic initialization algorithms are available and can be used instead as desired.
To minimize computational cost, one may wish to avoid activating such camera collaboration when targets are far away from each other since a single-target tracker can achieve reasonable performance under such operating conditions. Moreover, some targets cannot utilize the camera collaboration even when they are occluding with others if these targets have no projections in other views. Therefore, a tracker can be configured to activate the camera collaboration and thus implement the proposed Bayesian multiple-camera tracking only when its associated target needs and can do so. In other situations, the tracker will degrade to implement IDMOT or a traditional Bayesian tracker such as multiple independent regular particle filters.
Within a camera view, if the analyzed tracker is isolated from other targets, it will only implement multiple independent regular particle filters (MIPF) 903 to reduce the computational costs. When it becomes closer or interacts with other trackers, it can activate either BMCT 902 or IDMOT 901 according to the associated targets' status. This approach tends to ensure that the proposed Bayesian multiple-camera tracing approach using multiocular videos can work better and is, in any event, never inferior to monocular video implementations of IDMOT or MIPF.
If desired, the tracker can be configured to have the capability to decide that the associated target has disappeared and should be deleted in either of two cases: (1) the target moves out of the image; or (2) the tracker loses the target and tracks clutter instead. In both situations, the epipolar consistence loop checking fails and the local observation weights of the tracker's particles become very small since there is no target information any more. On the other hand, in the case where the tracker misses its associated target and follows a false target, these processes will not delete the tracker and instead leave it for further evaluation.
There are three different likelihood densities that are beneficially estimated in this Bayesian multiple-camera tracking architecture: (1) local observation likelihood p<ZtA,i|XtA,i>; (2) target interaction likelihood p<ZtA,J
In Table 1, a comparison appears as to the average computation time of the different likelihood weightings in processing one frame of synthetic sequences using Bayesian multiple-camera tracking as per these teachings. Compared with the most time-consuming component (which is the local observation likelihood weighting of traditional particle filters), the computational cost required for camera collaboration is negligible. This is primarily because of two reasons: firstly, a tracker activates the camera collaboration only when it encounters potential multi-target occlusions; and secondly, this epipolar geometry-based camera collaboration likelihood model avoids feature matching and is very efficient.
The computational complexity of the centralized approaches used for many prior art multi-target tracking approaches increases exponentially in terms of the number of targets and cameras since the centralized methods rely on joint-state representations. The computational complexity of the proposed distributed architecture, on the other hand, increases linearly with the number of targets and cameras. Table 2 presents a comparison of the complexity of these two modes in terms of the number of targets by running the proposed Bayesian multiple-camera tracking approach and a joint-state representation-based MCMC particle filter (the data was obtained by varying the number of targets on synthetic videos). It can be seen that under the condition of achieving reasonable robust tracking performance, both the required number of particles and the speed of the proposed Bayesian multiple-camera tracking approach vary linearly.
These teachings are therefore seen to provide a Bayesian strucure that solves the multi-target occlusion problem for multiple-target tracking application settings that use multiple collaborative cameras. Compared with the common practice of using a joint-state representation whose computational complexity increases exponentially with the number of targets and cameras, the proposed approach solves the multi-camera multi-target tracking problem in a distributed way whose complexity grows only linearly with the number of targets and cameras.
Moreover, the proposed approach presents a very convenient architecture for tracker initialization of new targets and tracker elimination of vanished targets. The distributed architecture also makes it very suitable for efficient parallelization in complex computer networking applications. The proposed approach does not recover the targets' 3D locations. Instead, it generates multiple estimates, one per camera, for each target in the 2D image plane. For many practical tracking applications such as video surveillance, this is sufficient since the 3D target location is usually not necessary and 3D modeling will require a very expensive computational effort for precise camera calibration and nontrivial feature matching.
The merits of this Bayesian multiple-camera tracking approach compared with 3D tracking approaches include speed, ease of implementation, graceful degradation (fault tolerance), and robust (noise resilient) tracking results in crowded environments. In addition, with the necessary camera calibration information, the 2D estimates can also be projected back to recover the targets' 3D location in the world coordinate system. Furthermore, these teachings present an efficient collaboration model using epipolar geometry with sequential Monte Carlo implementation. This avoids the need for recovery of the targets' 3D coordinates and does not require feature matching, which is difficult to perform in widely separated cameras.
Those skilled in the art will appreciate that the above-described processes are readily enabled using any of a wide variety of available and/or readily configured platforms, including partially or wholly programmable platforms as are known in the art or dedicated purpose platforms as may be desired for some applications. Referring now to
In this illustrative embodiment, the apparatus 200 comprises a memory 201 that operably couples to a processor 202. The memory 201 serves to store and hold available the aforementioned captured temporally parsed data regarding at least a first item, wherein the data comprises data corresponding to substantially simultaneous samples of the first item (and other items when present) with respect to at least first and second differing points of reference. Such data can be provided by, for example, a first 203 through an Nth image capture device 204 (where N comprises an integer greater than one) that are each positioned to have differing views of the first item.
The processor 202, in turn, is configured and arranged to effect selected teachings as have been set forth above. This includes, for example, automatically using, at least in part, conditional probabilistic analysis of at least some of the temporally parsed data as corresponds in a given sample to the first point of reference and the second point of reference to disambiguate state information as pertains to the first item.
Those skilled in the art will recognize and understand that such an apparatus 200 may be comprised of a plurality of physically distinct elements as is suggested by the illustration shown in
Those skilled in the art will recognize that a wide variety of modifications, alterations, and combinations can be made with respect to the above described embodiments without departing from the spirit and scope of the invention, and that such modifications, alterations, and combinations are to be viewed as being within the ambit of the inventive concept.
This is a continuation-in-part of prior application Ser. No. 11/549,542, filed Oct. 13, 2006, which is hereby incorporated herein by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
Parent | 11549542 | Oct 2006 | US |
Child | 11614361 | US |