The invention relates generally to a system and method for tracking articulated body motion, and more particularly to a system and method for estimating the articulated motion of the head and hands of one or multiple people.
The deployment of video surveillance systems, especially in retail environments, is known. Digital video is necessary to efficiently provide continuous surveillance. Conventional video surveillance systems utilize single methods, such as Multiple Hypothesis Tracking or Joint Probabilistic Data Association Filter, to track multiple objects. A disadvantage with such methods is that prior model assumptions and computational efficiency of such methods are not particularly robust. Another disadvantage is that the entrance and departure of new objects in a scene must be captured by the birth and death of new modes.
One exemplary embodiment of the invention is a system for tracking the movements of persons. The system includes a video capturing device capable of providing stereo views and a computing device coupled to the video capturing device. The computing device includes a computing section capable of performing calculations to support stochastic filtering.
One aspect of the exemplary system embodiment is system for tracking the behavior of a customer in a retail environment. The system includes at least two pan tilt zoom cameras and a computing device capable of performing calculations to support mode stratified particle filtering.
Another exemplary embodiment of the invention is a method for monitoring the movements of one or more persons. The method includes first visually capturing a scene encompassing one or more strata, second re-sampling each of the stratum, third redefining each of the stratum, and fourth adding new or subtracting old strata based upon the arrival or departure of isolated targets within the scene. The method also includes fifth normalizing each of the stratum and re-performing the second through fifth steps.
One aspect of the exemplary method embodiment is that the step of visually capturing a scene is accomplished with at least two video devices, and the step of re-sampling each of the stratum includes collecting hypotheses on how the one or more persons in the scene will move.
These and other advantages and features will be more readily understood from the following detailed description of preferred embodiments of the invention that is provided in connection with the accompanying drawings.
Embodiments of the invention, described herein, utilize entropy measures to control the process of sampling particles. The entropy measures are implemented through mode stratification.
A radio frequency identification (RFID) transmitter 34 may optionally be included within the system 10. The transmitter 34 is configured to enable the computing device 40 to obtain information regarding the position of any item upon which an RFID tag 32 (
Finally, the system 10 may include a device controller 60 in communication with the computing device 40. The device controller 60 may control a device in the environment that the tracking system 10 is monitoring, and the device controller 60 may be controlled by the computing device 40. For example, the tracking system 10 may be utilized in an image guided manufacturing environment. In such an environment, computer-numerical-control (CNC) cutting machines may be incorporated. As a safety measure, the CNC cutting machines may be controlled by the device controller 60. Based upon images obtained through the video devices 12, the computing device 40 may determine that a health hazard has arisen (such as, for example, a person's hand has gotten too close to a cutting blade of one of the CNC cutting machines). In such an occurrence, the computing device 40 sends a signal to the device controller 60 to turn off the CNC cutting machine at issue.
As another example, the tracking system 10 may be incorporated in a retail environment. This example will be further described with specific reference to
Next will be described examples of algorithms that may be used in the computing device 40 to deduce head and hand position. The types of algorithms useful in deducing head and hand position may be collectively considered as stochastic filters. One example of a stochastic filter is a condensation filter. Another example of a stochastic filter is a mode stratified particle filter.
Stochastic filters, and in particular the mode stratified particle filter, utilize a Bayesian framework. In Bayesian sequential estimation, three main problems need to be addressed. Multi-modality must be maintained, and maintaining multiple modes of distribution with only a finite number of particles is a challenge. The performance of particle filters depends on prior model assumptions, and control mechanisms must be introduced that will improve the efficiency and robustness of the particle filters. In a dynamic environment, objects will enter and leave a particular scene, and a process modeling that environment must appropriately account for the entrance and exit of objects.
In a full Bayesian approach, the model selection would be based on evidence, which would be computationally infeasible. However, by learning the response of the likelihood function to the background, in the absence of any foreground object, it is possible to measure the distribution model for the posterior. Posterior distribution refers to the probability distribution of a state given all prior information and the current set of observations. Where the scene does not contain any foreground objects, the posterior should be similar to the learned background distribution. The similarity can be measured through relative entropy or Kullbach-Leibler divergence. Each of the disconnected areas of the scene that contain foreground objects has its own particle set to model the local distribution. Re-sampling of each particle set is done locally, due to the fact the dynamics and appearance of different foreground objects is statistically independent. The local re-sampling may be accomplished through, for example, mode stratification. In mode stratification, a stratum is defined to be the set of particles that represents a particular mode of the posterior distribution. The relative entropy of the distribution of each stratum should be significantly different from the learned background distribution. The cumulative measure of the relative entropies characterizes the fit of the observed data to the model. An empirical quantity, called the order parameter, is used to measure this fit. The order parameter is defined as:
where qt is the learned background response and ptk is a distribution associated with the k-th mode at time t. K is the total number of modes in the model, and KL is the Kullbach-Liebler divergence. The first term of the order parameter is the relative entropy between the posterior on the non-foreground region and a known background distribution qt. This first term provides a basis to ascertain if a new mode has been formed, due to the appearance of a new object, or if a mode of the distribution no longer corresponds to an object in the scene. A spike of the order parameter indicates a poor fit with the model, and generally corresponds with an event in the scene, i.e., an arrival or departure of an object. A poor fit leads to the need to adjust K, the total number of modes of the model to the posterior distribution.
The principle behind the algorithm used in mode stratification is to maximize the amount of information contained in the foreground while minimizing the amount of information in the background. By doing so, the existence of a true background distribution having a high relative entropy with respect to the distribution of the scene with our hypothesized foreground objects removed means that there is probably a new foreground object. The principle behind the algorithm is implement using a discretized control space X obtained as an image Ξ(X) of configuration space X under the mapping Ξ: X→X. The control space is utilized (1) to implement a stratification of the configuration space X so that modes can be represented in a statistically independent way, (2) in a re-sampling scheme which adapts automatically to maintain the information contained in each stratum, and (3) control birth and death of modes.
Mode stratification is managed in the control space, wherein each strata are defined and managed. The control space X is divided into disjoint cells of a fixed volume such that:
X=U Xij with Xij∩Xkl=0.
Based upon this control space partitioning, the k-th stratum at time t, Vtk, is defined as the collection of cells
Vtk:=U Xij(i,j)Itk
where Itk is the index set associated with each stratum. The dimensionality of the control space can be equal to the configuration space X or be lower. For example, when tracking the location and orientation of faces in three dimensions, it is possible to use a subdivision along the spatial dimensions alone, rather than along the entire six dimension configuration space X.
The size and the elements of each Vtk are adaptively determined in the re-sampling step. A stratum is represented by a particle set Stk of size ntk:
The π's are the ensemble weights of each particle, while the ω's are the stratum (local) weights of each particle. The posterior distribution is represented by the union of these particle sets, S=Stk and is approximated by
The πik,t encapsulate the relative heights of the peaks represented by each stratum, while the ωik,t encapsulate the likelihood weights of the particles within each stratum. After each re-sampling, the state of each particle πik,t changes, and so each individual particle set and its state variables and cell membership must be redefined accordingly. Such redefinition itself leads to potential splitting and merging of strata. Furthermore, the control space is used to maintain birth and death of strata that are responsible for managing the appearance and disappearance of tracks over time.
With specific reference to
At Step 110, the strata are redefined. Specifically, after the re-sampling step, the preliminary strata particle sets are reorganized into Kt strata Vtk based on the cells that are occupied under the mapping Ξ(x) for x U Stk. Cells are organized into strata such that
Vtk∩Vtk′=0, for all k′≠k
and such that each stratum Vtk includes one connected component with respect to the control space partition defined in
Vtk:=U Xij(i,j)Itk.
Based upon the preliminary sets Ptk and the redefined strata, each strata's particles sets are constructed as
S′tk={(xim,t, πim,t, ωim,t)Stm: xim,tVtk, m=1, . . . , Kt}.
Finally, the values of ωim,t and πim,t are renormalized and the parameters of the measurement scores Ĉk,t, Ŵk,t are updated for each new stratum.
Next, at Step 115, strata are created (birth) or deleted (death) based upon the arrival or departure of isolated targets. Cells of the control space are identified as belonging either to the background or the foreground. Each cell of the control space is associated with a likelihood value from strata samples occupying the strata cell or, if no particle resides in a cell, by sampling from the background configuration space. The control space is an image of the configuration space under the mapping Ξ(X). Each cell of the control space can be associated with a volume in configuration space as
Uij:={xX:Ξ(x)Xij}.
The control space distribution is defined as
Pkij,t=p(xUij|Ztk)=∫Uij(p(Ztk|x)p(x)/p(Ztk))dx.
Zk represents the observations Z with the target corresponding to the k-th stratum removed. The resulting control space distributions directly reflect the modal structure of the current configuration space and can be used to manage the death and birth of strata. If all visible targets are accounted for by existing strata and were to be removed from the configuration space, the remaining control space distribution should contain no further information. Alternatively, if visible targets remain, there is a higher information content and a resulting low entropy. Thus, the birth and death of strata can be managed by computing the relative entropy between the control space distributions pkt={pkij,t} that is hypothesized to contain no targets for the birth process or only a single target for the death process, and a learned background reference distribution qt={qij,t}, which is known to contain no targets.
The creation of new stratum is triggered once the relative entropy between the control space distribution and the reference reaches a significant level. The deletion of an existing stratum is similarly decided by calculating the control space distribution for which all but the considered stratum are removed. When the relative entropy between this control space distribution and the background falls below a significant level, uniformity of the control space can be deduced and the strata is removed. The significance levels can be calculated based on the typical volume, W, of the strata in the control space. By assuming a uniform reference background volume, and the stratum in question is uniformly distributed over its control space,
where N is the total number of cells in the control space and V is the volume of one control space cell. The stratum size is estimated based on the current noise variance of the target.
Next, at Step 120, the ωin each stratum is normalized and the πik,t is normalized over all the strata. Finally, at Step 125, the parameters of the measurement scores Ck,t, Wk,t for each Zk,t are updated for each new stratum.
With reference to
Having at least two video devices 12 allows for a three-dimensional analysis of a scene by the use of triangulation and by adding at least a second perspective of the scene. The video devices 12 may be digital video cameras or analog video cameras in conjunction with an analog-to-digital converter (not shown). The video devices 12 may be pan-tilt-zoom cameras. Such pan-tilt-zoom cameras provide capability to rotate the video device 12 view so as to allow the video device 12 to capture a scene at a particular location. As shown in
At Step 100 (
All of the hypotheses of how the actors in a scene will move that are derived through re-sampling are each assigned a numerical value attributable to the weight or likelihood that that hypothesis is a true representation of how the actors 20, 22 actually moved in the scene 16. At Step 120, the numerical values of the likelihood weights are normalized to add up to 1.0. Finally, at Step 125, observation distributions are updated. It is possible that an actor or an actor's hand or head may be obstructed from view of the video devices 12, and therefore subtracted from the scene 16 erroneously. When that occurs, and there is an inconsistency between what is known (for example, there are two actors 20, 22 in the scene 16) and what is hypothesized (there is only one actor in the scene 16), further sampling or other analysis is performed in Step 125 to quell the inconsistency. Steps 105 through 125 are repeated for time t=2, 3, 4, . . . n.
The process as described and shown in
The tracking system 10 may optionally include one or more devices 14 capable of reflecting an image. An example of such a device 14 is a mirrored dome. The mirrored domes 14 may be positioned at various strategic locations within an environment. For example, mirrored domes 14 may be located at various locations that are outside of the sight line of cashiers or other personnel. With the positioning of the mirrored domes 14, the video devices 12 are trained on the mirrored domes 14, instead of the actors, to capture a scene. Through the use of the mirrored domes 14, less video devices 12 may be necessary.
There are certain applications where the tracking of customers in a retail environment is important for both behavioral analysis and surveillance. Single modality tracking is, however, challenging due to clutter and occlusion and ambiguities with respect to the vast range of products with which a customer can interact. Next, with reference to
As described above, the stereo video devices 12 are used to capture the scene 16 including the customers 20, 22. The video devices 12 observe the customers 20, 22, and body part locations are tracked in three dimensions and real-time using both anatomical constraints and the mode stratified particle filtering method (
Combining the information on the customers gleaned through the use of the mode stratified particle filtering method and the information obtained through the transmitter 34 and the RFID tag 32, the state of the customer's interaction with the product 30 can be equated. Behavior analysis of customers, or surveillance, may be performed with the system 10. For example, the obtained information can be used to determine if the customer 20, 22 is tampering with the product 30, or whether the customer 20, 22 is interested in, stealing, or vandalizing the product 30.
While the invention has been described in detail in connection with only a limited number of embodiments, it should be readily understood that the invention is not limited to such disclosed embodiments. Rather, the invention can be modified to incorporate any number of variations, alterations, substitutions or equivalent arrangements not heretofore described, but which are commensurate with the spirit and scope of the invention. Additionally, while various embodiments of the invention have been described, it is to be understood that aspects of the invention may include only some of the described embodiments. Accordingly, the invention is not to be seen as limited by the foregoing description, but is only limited by the scope of the appended claims.