 
                 Patent Application
 Patent Application
                     20090148046
 20090148046
                    This invention relates to variable-bandwidth peak detection in Hough space used for the detection of lines and geometrical shapes in video occurring in a variety of application domains, such as medical, automotive, inspection, and augmented reality. It further relates to error-propagation for uncertainty modeling in joint motion-color space for modeling of dynamic backgrounds in a variety of application domains, such as surveillance and monitoring. Furthermore, the invention relates to variable-bandwidth peak detection in joint color-spatial domains used for video segmentation occurring in various application domains, such as medical and object detection.
Background Modeling forms a central module in systems using Computer Vision to detect events of interest in a video stream. Most current methods use only the intensity observed at a pixel. Such a model is reasonable when the background is stationary. However, these methods deteriorate in their discrimination power when the background is dynamic. Examples of these include ocean waves, waving trees, rain, moving clouds, and camouflaged objects that are camouflaged such that they are of similar color as the background of the object.
A Hough Transform is a method for detecting straight lines and curves on gray level images. For line detection, the equation of a line can be expressed as ρ=xcos(θ)+ysin(θ), where θ and ρ are the line orientation and the line distance from origin to the line, respectively. A line is therefore, completely specified by a parameter pair (θ,ρ). For straight line detection, the Hough Transform maps each pixel (x,y) from the image space into a parameter space of (θ,ρ), where contributions from each feature point to each possible set of (θ,ρ), which are accrued. For this purpose, the parameter space is divided into cells with each cell corresponding to a pair of quantized (θ,ρ). A multi-dimensional accumulator array is often used to represent the quantized space. For each feature point, all the parameters associated with the point are estimated, the corresponding cells of the accumulator are incremented accordingly. This is repeated for all feature points. Lines are found by searching the accumulator array for peaks. The peaks correspond to the parameters of the most likely lines.
The standard Hough Transform adopts a “top hat” strategy to compute the contribution of each point to a hypothesized line. Specifically, the scheme assumes all feature points located within a close range of the hypothesized line contribute equally to the line. The accumulator is, therefore, incremented by a unit for those feature points. This scheme is inadequate in that data points are not all equally reliable. This means that line parameters derived from each feature point may carry different uncertainties due to the following reasons. Most Hough Transform techniques employ certain techniques for estimating the orientation of feature points (edgels) to restrict the ranges of values of θ a pixel may vote for. The estimation of the orientation of each edge pixel is often uncertain due to: 1) image noise, for example, positional errors from quantization and sensor errors, 2) small neighborhood associated with the edge detection procedure and the inherent uncertainty with the procedure, and 3) the parametric representation used to define a line. Therefore, feature points vary in uncertainties and should not be treated equally.
Previous efforts in algorithm improvement to Hough Transforms focused on improving the computational efficiency of the Hough Transform, that is, speed and memory. Early efforts in this aspect concentrated on reducing the number of bins used for tessellating the parameter space. Many proposed techniques drew on some form of coarse-to-fine search strategy resulting in a dramatic reduction of cells.
Recent efforts have been focusing on sampling the feature points. The idea is to use only a subset of image features. These efforts give rise to different probabilistic, also called randomized, Hough Transform techniques which increase the computational efficiency and decrease memory usage by means of sampling the image feature space.
Therefore, a need exists for a unified framework that utilizes the uncertainty of transformed data for peak detection and clustering in feature space. A further need exists for a method for background modeling that is able to account for dynamic backgrounds that change according to a certain pattern. A still further need exists to analyze Hough Transforms that are built with uncertainty and a need exists for video segmentation in invariant color spaces.
An embodiment of the present invention comprises using error propagation for building feature spaces with variable uncertainty and using variable-bandwidth mean shift for the analysis of such spaces, to provide peak detection and space partitioning. The invention applies these techniques to construct and analyze Hough spaces for line and geometrical shape detection, as well as to detect objects that are represented by peaks in the Hough space. This invention can be further used for background modeling by taking into account the uncertainty of the transformed image color and uncertainty of the motion flow, to be used in application domains, such as surveillance and monitoring. Furthermore, the invention can be used to segment video data in invariant spaces, by propagating the uncertainty from the original space and using the variable-bandwidth mean shift to detect peaks.
An embodiment of the present invention comprises providing input data to be analyzed from a domain, developing an uncertainty model of the input data in a feature space, and using variable bandwidth mean shift to detect an object of interest.
Another embodiment of the present invention includes deriving the uncertainty model through error propagation.
A further embodiment of the present invention comprises feature space including joint spatial-color space.
A further embodiment of the present invention comprises feature space including invariant space.
A further embodiment of the present invention comprises feature space including parameter space.
A further embodiment of the present invention comprises feature space including joint motion-color space.
A further embodiment of the present invention comprises domains including one or more of medical, surveillance, monitoring, automotive, inspection, and augmented reality.
Another embodiment of the present invention comprises modeling a background using multiple features and uncertainties.
Another embodiment of the present invention comprises modeling a background using multiple features and uncertainties wherein the multiple features include one or more of color, texture, and motion.
A further embodiment of the present invention comprises analyzing a video frame and adding a vector of features to a background model.
A further embodiment of the present invention comprises analyzing a video frame and detecting a change by evaluating a vector of features and a background model.
A still further embodiment of the present invention comprises applying morphological operations to the detections.
The embodiments of the present invention will become more apparent from the following detailed description of illustrative embodiments thereof, which is to be read in connection with the accompanying drawings.
    
    
    
    
    
    
    
Referring to 
A method according to an embodiment of the present invention comprises using error propagation to build feature spaces, analyzing feature spaces that are built with uncertainty using variable-bandwidth mean shift to provide pixels and clustering of the feature spaces. Variable bandwidth mean shift identifies modes in joint spatial color space, while image segments are delineated by detecting valleys surrounding the modes. The main statistical tool that can utilize the variable uncertainty is variable-bandwidth mean shift, an adaptive estimator of density gradient. This technique is applied to detect high density points, that is, modes, in the feature space. The feature space can be the Hough space, the joint motion-color space, or the joint image-color space.
Referring to 
Using variable bandwidth mean shift to analyze feature space (step 206) can be described by beginning with a set of d-dimensional points, xi, i=1 . . . n, that exists in space Rd and a symmetric positive definite d×d bandwidth matrix Hi that is defined for each data point xi. The matrix Hi quantifies the uncertainty associated with xi. The sample point density estimator with d-variate normal kernel, computed at the point x is given by
  
    
  
  
    
  
is the Mahalanobis distance from x to xi. Hh is the data-weighted harmonic mean of the bandwidth matrices computed at x
  
    
  
where the weights
  
    
  
satisfy Σi=1n wi(x)=1. An estimator of the gradient of the true density is the gradient of {circumflex over (f)}v
  
    
  
By multiplying the above to the left with Hh(x), it results that
  
    
  
  
    
  
is the variable-bandwidth mean shift vector. From the above,
  
    
  
which shows that the variable-bandwidth mean shift vector is an adaptive estimator of the normalized gradient of the underlying density.
If the bandwidth matrices Hi are all equal to a fixed matrix H, called analysis bandwidth, the sample point estimator reduces to the simple multivariate density estimator with normal kernel
  
    
  
  
    
  
is the fixed-bandwidth mean shift vector.
A mode seeking algorithm can be derived by iteratively computing the fixed- or variable-bandwidth mean shift vector. The partition of the feature space is obtained by grouping together all the data points that converged to the same mode.
Step 202 includes developing an uncertainty model of data. Location dependent uncertainty, such as, covariance matrices, in invariant space will now be described. For a given location (x,y) in the image, denote {circumflex over (R)}(x,y), Ĝ(x,y), {circumflex over (B)}(x,y) the observed color data. Assume that {circumflex over (R)}, Ĝ, and {circumflex over (B)} are normal with mean R, G, and B, and identical standard deviation σ. To derive uncertainties in normalized color space, certain computations can be utilized.
The illumination prior assumption is that a scene contains multiple light sources with the same spectral distribution with no constraint on individual intensities. An invariant representation of color data is obtained through the transformation T:R3→R2 which normalizes R and G by S=R+G+B
  
    
  
In Step 204 a feature space is built using the uncertainty of data described above. Due to the nonlinear character of the transformation T(.), uncertainties in the normalized estimates {circumflex over (r)} and ĝ are dependent not only on sensor noise variance, but also on actual true unknown values of the underlying samples. Based on the assumption of a moderate signal to noise ratio, such as σ<<S, ({circumflex over (r)},ĝ)T can be approximated as normal distributed with pixel-dependent covariance matrix
  
    
  
where
  
    
  
In normalized space the covariance matrix for each pixel is different: darker regions in the RGB image, that is variable small S, correspond to regions with high variance in a normalized image.
A similar technique can be used to compute optical flow and motion vectors with their associated uncertainties. Preferably the present invention employs optical flow and motion vector techniques described in Bayesian Multi-scale Differential Optical Flow, E. P. Simoncelli, Handbook of Computer Vision and Applications (1999), Vol. 2; Chapter 14; pages 397-422, which is incorporated by reference herein in its entirety.
To model a background (step 208), detect lines and shapes (step 210), and segment video data (step 212), a Hough Transform can be used. A Hough Transform is a technique to represent geometrical structures by voting. The main idea is that a change in representation converts a point grouping problem into a peak detection problem. When detecting lines, every point “votes for” any line it might belong to. In discretized line parameter space, each bucket represents a particular line. The bucket contains the number of edges that support that line segment. Large buckets in Hough space correspond to lines in point space. The original Hough approach does not take into account any possible uncertainty in the positions and orientations of the points voting for possible lines. A variant of the Hough Transform can be described where the votes of points for lines are a function of the uncertainty in the positions and orientations of those points.
  
The uncertainty of the orientation of the gradient at point (xi,yi) is noted σθ. Most often, edge detection is performed by: (1) smoothing and differentiating the image along x and y using linear filters, (2) estimating the norm of the gradient
  
    
  
(3) extracting the local maxima of the norm of the gradient in the image, which are edge points, and (4) estimating the orientation θ=arctan(Iy/Ix) In a first approximation, it can be considered that non-maxima suppression has no influence on the variance of θ and influences only the miss and false positive rates of edge detection. If image smoothing and differentiation is performed by a linear filter W, it can be shown that σθ2=CW*(σ2/∥g∥2) where σ2 is the variance of the image intensity and CW is a constant related to the coefficients of the linear filter W.
Referring to 
  
    
  
with σp2=k2σθ2+σp2 and σρθ=kσθ2 and k=ycosθ−xsinθ. Because of the uncertainty associated with the vote Θ=(ρ, θ), the edge point 502 (x,y) votes not only in bin Θ=(ρ, θ) but also in the adjacent bins. The contribution of (x,y) to each bin in Hough space is equal to:
  
    
  
Referring to 
Background Modeling forms a central module in surveillance systems using Computer Vision to detect events of interest in a video stream. Current methods use only the intensity observed at a pixel. Such a model is reasonable when the background is stationary. However, these methods deteriorate in discrimination power when the background is dynamic.
A method according to an embodiment of the present invention accounts for dynamic backgrounds that change according to a certain pattern.
Referring to 
Once optical flow has been determined as described above, a probability distribution on the joint 5-D space of intensity (3 color components) and flow (2 flow components), can be constructed. Although the regular RGB space can be used, improved insensitivity to changes in illumination can be obtained if the normalized RG+intensity I space is used. The intensity is retained with a high variance so that some discriminability is retained between observations that may have the same chromaticity (that is, normalized r and g values) but very different intensities (for example, white, grey and black all have the same chromaticity).
Given previous observations of intensity and flow, the probability distribution can be developed in several ways. A method according to an embodiment of the present invention comprises kernel density estimation. Let x1,x2, . . . xn be n observations determined to belong to a model. The probability density function can be non-parametrically estimated (known as the Parzen window estimate in pattern recognition) using the kernel function K as
  
    
  
Choosing a kernel estimator function, K, to be the Normal function, where Σ represents the kernel function bandwidth, then the density can be written as
  
    
  
The combined covariance matrix n is derived from the covariances for the normalized color and optical flow. A general form for the covariance matrix can be derived, but for simplicity, the case where the cross-covariance between intensity and optical flow is zero, is described. For the covariance of the intensity of color in invariant space, the formula described in Error Propagation in Invariant Space can be used: Assuming that the cross-covariance between intensity and flow is zero, the combined covariance matrix can be written as:
  
    
  
where 0's represent the appropriate zero matrices. In the above formula, σi represents the standard deviation of the intensity and Θf represents the covariance of the motion flow.
For each new observation, the probability is calculated using the above equations. If the probability is below a certain value, the pixel is new. This is determined for each pixel in the scene and detection is performed after applying morphological operations so that noise is removed. Information about the size of the objects is used so that only objects above a certain size are detected. This is done by not only using pixels connected to each other, but also by using those pixels that might not be connected but can otherwise belong to an object.
Mixture Model—based and kernel-based methods use only the intensity feature to build a probability distribution on the RGB (or normalized RGB) space. When using only the intensity feature, objects having colors similar to the background, cannot be detected. People camouflaged according to the color of the background can easily escape detection using this model. The problem becomes more severe if the background is dynamic, such as, ocean waves, waving trees, and moving clouds etc, and a wide variety of intensities can be observed at a particular pixel. Having such a wide spectrum in the observation means that the discriminability of such a system will be very low and many objects will not be detected. Using the flow feature along with the intensity helps us to detect not only objects having a different color than the background, but also objects that might have the same color characteristics as background, but move in a direction that is different from the direction of motion of the background. The discriminability of such a system will be retained even in the presence of dynamic backgrounds.
The probability density function in joint spatial-color domain will now be described. Following color transformation from RGB to normalized rg space, each image pixel z is characterized by a location x=(x1,x2)T and a color c=(c1,c2)T≡(r,g)T. An input image of n pixels is represented as a collection of d=4-dimensional points zi=(xiT,ciT)T with i=1 . . . n[2]. The 4-dimensional space constructed is called joint spatial-color domain.
The task of image segmentation reduces to partitioning of data points zi according to their probability density. The number of image segments is determined by the number of modes in the joint space, while segment delineation is defined by the valleys that separate the modes.
To estimate probability density in joint space, a product kernel with variable bandwidth for color coordinates is utilized. The rationale is that in normalized color space the uncertainty varies with the location, as illustrated above. It has been proven that by adapting the kernel bandwidth to the statistics of the data, the estimation bias decreases. The bandwidth matrix associated with the color component of data point i is denoted by Hi=diag{hi12,hi22}. Hi quantifies the uncertainty of ci. The bandwidth for the spatial domain is taken constant and isotropic, that is, H=hI2 where I2 is the unit matrix of dimension 2.
The density estimator with normal kernel computed at location z=(xT,cT)T is given by
  
    
  
where
  
  
  d
  2(c,ci,H)≡(c−ci)THi−1(c−ci)
is the Mahalanobis distance from c to ci. A similar definition holds for d2(x,xi,H).
Using the notations
  
    
  
  
    
  
  
    
  
the density estimator becomes
  
    
  
The variable bandwidth mean shift equations for mode detection are now described. Additionally, computation of local modes, that is, peaks, of the density function are now described. Mode detection in joint space employs mean shift iterations for both x and c components of z. By taking the gradient of
  
    
  
with respect to x, it results that the mean shift vector for the x component is given by
  
    
  
The gradient of
  
    
  
with respect to c yields a mean shift vector for the c component
  
    
  
where
  
    
  
The above gradients with respect to x and c, provide the components of the joint mean shift vector
  
  
  m(z)=(mxT(z),mrT(z))T 
The iterative computation of the above vector and translation of z by that amount, leads to a local mode, that is, peak, of the density. Strictly speaking, the mean shift iterations lead to a stationary point. Additional precautions should be taken to make certain that the convergence point is a local maximum.
The segmentation procedure is now described. By estimating, the sensor noise
  
    
  
can be employed to compute the covariance matrix associated with the normalized color of each pixel. The components of the color bandwidth matrix Hi=diag{hi12,hi22} are taken proportionally to σ{circumflex over (r)}2 and σ{circumflex over (k)}2, respectively. The mode estimation process is thus adapted to the local uncertainty in the data. In this implementation the contribution of E[({circumflex over (r)}-r)(ĝ-g)] is neglected.
Using the algorithm described above, the modes in the joint space are first detected. Since plateaus may appear in the density function, the modes that are sufficiently closed to each other are grouped together and a label is assigned to each group. The metric for distance evaluation is based on the matrices H and Hc(z), computed in the convergence point. Region delineation is then obtained by associating each pixel to its mode and assigning the label of the group to which the mode belongs to.
Segmentation in normalized subspace is thus particularly advantageous when frames of a video sequence are known to contain shadows or illumination effects. At the same time, a decrease in resolution occurs, for example the chair feet are not recovered in the normalized space. Additionally, the generality of the proposed framework is shown. Various embodiments according to the present invention can additionally be applied to other illumination or geometric invariants.
A method according to the present invention can be used for object detection in a variety of scenes. The present invention can be used in applications, such as, traffic monitoring, surveillance systems in the presence of moving backgrounds, for example, waving trees, and ocean waves, activity detection, automatic traffic lights, monitoring in high security areas, and delineating people in a scene for detection. In many of these applications, the background might be dynamic and has be discounted. This is not possible with current prior art background adaptation methods. The present invention advantageously allows dealing with more complex scenes, and gets better results in scenes where prior art methods are currently being used.
The teachings of the present disclosure are preferably implemented as a combination of hardware and software. Moreover, the software is preferably implemented as an application program tangibly embodied on a program storage unit. The application program may be uploaded to, and executed by, a machine comprising any suitable architecture. Preferably, the machine is implemented on a computer platform having hardware such as one or more Central Processing Units (“CPUs”), a Random Access Memory (“RAM”), and Input/Output (“I/O”) interfaces. The computer platform may also include an operating system and micro instruction code. The various processes and functions described herein may be either part of the micro instruction code or part of the application program, or any combination thereof, which may be executed by a CPU. In addition, various other peripheral units may be connected to the computer platform such as an additional data storage unit and an output unit.
It is to be further understood that, because some of the constituent system components and steps depicted in the accompanying drawings may be implemented in software, the actual connections between the system components or the process function blocks may differ depending upon the manner in which the present disclosure is programmed. Given the teachings herein, one of ordinary skill in the pertinent art will be able to contemplate these and similar implementations or configurations of the present disclosure.
Although illustrative embodiments have been described herein with reference to the accompanying drawings, it is to be understood that the present disclosure is not limited to those precise embodiments, and that various changes and modifications may be effected therein by one of ordinary skill in the pertinent art without departing from the scope or spirit of the present disclosure. All such changes and modifications are intended to be included within the scope of the present disclosure as set forth in the appended claims.
This application claims the benefit of U.S. Provisional Application Ser. No. 60/362,015 filed on Mar. 6, 2002, which is incorporated by reference herein in its entirety.
| Number | Date | Country | |
|---|---|---|---|
| 60362015 | Mar 2002 | US | 
| Number | Date | Country | |
|---|---|---|---|
| Parent | 10382437 | Mar 2003 | US | 
| Child | 12198349 | US |