Detecting Moving Objects in Video by Classifying on Riemannian Manifolds

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a flow diagram of a method for training a classifier for classifying test data according to an embodiment of the invention;

FIG. 2 is a flow diagram of a procedure for extracting low-level features from test data according to an embodiment of the invention;

FIG. 3 is a flow diagram of a procedure for converting low-level features to high-level features according to an embodiment of the invention;

FIGS. 4A-4E are flow diagrams of procedures for training a classifier according to an embodiment of the invention;

FIG. 5 is a flow diagram of a procedure for determining an intrinsic mean covariance matrix according to an embodiment of the invention;

FIG. 6 is a flow diagram of details of a procedure for determining high-level features from motion cues according to an embodiment of the invention.

FIG. 7 is pseudo-code of details of a procedure for training a LogitBoost classifier according to an embodiment of the invention.

FIG. 8 a block diagram of a cascaded classifier for objects according to an embodiment of the invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

Classifier Construction and Classification

FIG. 1 shows a method 100 for constructing a trained classifier according to an embodiment of our invention. The classifier is constructed and trained using training data. By training data, we mean that the data are already labeled. The training data are used both to extract (labeled) features and to verify or measure a performance of the trained classifier. The trained classifier can then be used to classify test data.

Low-level features 101 are extracted 200 from training data 102. The low-level features 101 are used to generate 300 high-level features 301. The high level-features are the form of positive definite matrices on an analytical manifold.

A subset 111 of the high-level feature 301 is selected 110. The subset 111 of the selected high-level features are used to determine 120 an intrinsic mean covariance matrix 121. The intrinsic mean covariance matrix 121 defines a tangent space of the analytical manifold for the subset of high-level features. Tangent space is a local Euclidean space. The intrinsic mean matrix is used to map (project) 130 each high-level feature 301 to a feature vector 131 in the local Euclidean space of the manifold. Then, the feature vectors 131 are used to train 400 a classifier model 410 to produce the trained classifier 109.

Subsequently, the trained classifier 601 can be used to classify 140 test data 104. The classification assigns labels 105 to the test data. Feature vectors are produces for the test data in the same manner as described above.

Extract Low-Level Features

FIG. 2 shows the extraction 200 for an example test data 102, e.g., an image or a video. It should be noted that the extraction of low-level features can also be for other data, such as acoustic signal, medical images, data sampled from physical processes, and the like.

The low-level features 101 can include pixel intensities, pixel colors, and derivative low-level features, such as gradients 201, texture 202, color histograms 203, and motion vectors 204.

Generate High-Level Features

The low-level features 101 are used to generate 300 the high-level features 301 on a analytical manifold. In a preferred embodiment, the high-level features are positive definite matrices on a Riemannian manifold, projected onto a tangent space using the intrinsic mean matrix. More specifically the positive definite matrices are covariance matrices of the low-level features. This is done by determining 310 covariance matrices 311 from the low-level features using windows 320.

High-level features in the form of covariance matrices is described generally in U.S. patent application Ser. No. 11/305,427, “Method for Constructing Covariance Matrices From Data Features,” filed by Porikli et al, on Dec. 14, 2005, incorporated herein by reference,

For objects that are symmetric along one or more axes, we construct high-level features for the image windows that are symmetrical parts along the corresponding axes. For example, for human or face, the objects are symmetrical along a vertical line passing through the center of the image, thus, the high level features are computed in two symmetrical regions along that axis instead of only one region.

Covariance Descriptors

The covariance matrix provides a natural way for combining multiple low-features features that might otherwise be correlated. The diagonal entries of each covariance matrix represent the variance of each high-level feature and the non-diagonal entries represent the correlations of the high-level features. Because covariance matrices do not lie in Euclidean space, that method uses a distance metric involving generalized eigenvalues, which follow from Lie group structures of positive definite matrices.

Covariance descriptors, which we adapt for human detection in images according to embodiments of our invention, can be described as follows. A one-dimensional intensity or three-dimensional color image is I, and a W×H×d dimensional low-level feature image extracted from the image I is

F(x, y)=Φ(I, x, y), (1)

where the function Φ can be any mapping, such as intensity, color, gradients, filter responses, etc. For a given rectangular detection window or region R in the feature image F, the d-dimensional features inside the rectangular region R is {z_j}_{i=1 . . . S}. The region R is represented with the d×d covariance matrix of the features

$\begin{matrix} C_{R} = \frac{1}{S - 1} \sum_{i = 1}^{S} (z_{i} - μ) {(z_{i} - μ)}^{T}, & (2) \end{matrix}$

where μ is the mean of the features z, and T is the transform operator.

For the human detection problem, we define the mapping Φ(I, x, y) as eight (d=8) low-level features

$\begin{matrix} {{x y \langle I_{x} \rangle \langle I_{y} \rangle \sqrt{I_{x}^{2} + I_{y}^{2}} \langle I_{xx} \rangle \langle I_{yy} \rangle \arctan \frac{\langle I_{x} \rangle}{\langle I_{y} \rangle}]}^{T}, & (3) \end{matrix}$

where x and y are pixel coordinates, I_x, I_xx, . . . are intensity derivatives, arctan(|I_x|/|I_y|) is an edge orientation, and T is the transpose operator.

We can use different type and number of low-level features for detection.

With the defined mapping, the input image is mapped to the eight-dimensional low-level feature image F as defined by Equation (3). The covariance descriptor of the region r is the 8×8 covariance matrix C_R. Due to symmetry, only an upper triangular part is stored, which has only 36 different values. The descriptor encodes information of the variances of the defined features inside the region, their correlations with each other, and a spatial layout.

The covariance descriptors can be determined using integral images, O. Tuzel, F. Porikli, and P. Meer, “Region covariance: A fast descriptor for detection and classification,” Proc. European Conf. on Computer Vision, Graz, Austria, volume 2, pages 589-600, 2006, incorporated herein by reference.

After constructing d(d+1)/2 integral images, the covariance descriptor of any rectangular region can be determined independent of the size of the region, see Tuzel et al. above. Given an arbitrary sized region R, there are a very large number of covariance descriptors that can be from subregions r_{1,2, . . .}.

As shown in FIG. 8, the integral image 102 can be partitioned into multiple regions 321. Locations of windows 320 are generated 325 from the given training regions 321, according shape and size constraints 322. For each window, the low-level features 101 in the window 320 are used to determine a covariance matrix.

We perform sampling and consider subregions r, starting with a minimum size of 1/10 of the width and height of the detection regions R, at all pixel locations. The size of the subwindow r is incremented in steps of 1/10 along the horizontal or vertical directions, or both, until the subregion equals the region, r=R.

Although this approach might be considered redundant due to overlaps, the overlapping regions are an important factor in detection performances. The boosting mechanism, which is described below, enables us to search for the best regions. The covariance descriptors are robust towards illumination changes. We enhance this property to also include local illumination variations in an image.

A possible feature subregion r is inside the detection region R. We determine the covariance of the detection regions C_Rand subregion C_rusing the integral image representation described above. The normalized covariance matrix is determined by dividing the columns and rows of the covariance matrix C_rwith respective diagonal entries of the matrix C_R. This is equivalent to first normalizing the feature vectors inside the region R to have zero mean and unit standard deviation, and after that, determining the covariance descriptor of the subregion r.

Using the windows 325, the covariance matrices 311 can be constructed 330 on the Riemannian manifold. The matrices 311 can then be normalized 340 using the windows 320 to produce the high-level features 301.

Projection to Tangent Space

The d×d dimensional symmetric positive definite matrices (nonsingular covariance matrices) Sym⁺_d, can be formulated as a connected Riemannian manifold and an invariant Riemannian metric on the tangent space of Sym⁺_d, is

$\begin{matrix} < y, z > x = tr (X^{- \frac{1}{2}} {yX}^{- 1} {zX}^{- \frac{1}{2}}) . & (4) \end{matrix}$

The exponential map associated to the Riemannian metric

$\begin{matrix} \exp_{X} (y) = X^{\frac{1}{2}} \exp (X^{- \frac{1}{2}} {yX}^{- \frac{1}{2}}) X^{\frac{1}{2}} & (5) \end{matrix}$

is a global diffeomorphism (one-to-one, onto and continuously differentiable mapping in both directions). Therefore, the logarithm is uniquely defined at all the points on the manifold

$\begin{matrix} \log_{X} (Y) = X^{\frac{1}{2}} \log (X^{- \frac{1}{2}} {YX}^{- \frac{1}{2}}) X^{\frac{1}{2}} . & (6) \end{matrix}$

The operators exp and log are the conventional matrix exponential and logarithm operators. Not to he confused, the operators exp_Xand log_Xare the manifold specific operators, which are also point dependent, X ∈ Sym⁺_d. The tangent space of Sym⁺_dis the space of d×d symmetric matrices, and both the manifold and the tangent: spaces are m=d(d+1)/2 dimensional.

For symmetric matrices, the conventional matrix exponential and logarithm operators can be determined as follows. As is well known, an eigenvalue decomposition of a symmetric matrix is Σ=UDU^T. The exponential series is

$\begin{matrix} \exp (\sum) = \sum_{k = 0}^{\infty} \frac{\sum^{k}}{k!} = U \exp (D) U^{T}, & (7) \end{matrix}$

where exp(D) is the diagonal matrix of the eigenvalue exponentials. Similarly, the logarithm is

$\begin{matrix} \log (\sum) = \sum_{k = 1}^{\infty} \frac{{(- 1)}^{k - 1}}{k} {(\sum - I)}^{k} = U \log (D) U^{T} . & (8) \end{matrix}$

The exponential operator is always defined, whereas the logarithms only exist for symmetric matrices with positive eigenvalues, Sym⁺_d. From the definition of the geodesic given above, the distance between two points on Sym⁺_dis measured by substituting Equation (6) into Equation (4)

$\begin{matrix} \begin{matrix} d^{2} (X, Y) = < \log_{X} (Y), \log_{X} (Y) > x \\ = tr (\log^{2} (X^{- \frac{1}{2}} {YX}^{- \frac{1}{2}})) . \end{matrix} & (9) \end{matrix}$

We note that an equivalent form of the affine invariant distance metric can be given in terms of the joint eigenvalues of X and Y.

We define an orthogonal coordinate system on the tangent space with the vector operation. The orthogonal coordinates of a vector y on the tangent space at point X is given by a mapping vector

$\begin{matrix} {vec}_{X} (y) = upper (X^{- \frac{1}{2}} {yX}^{- \frac{1}{2}}), & (10) \end{matrix}$

where the upper operator refers to the vector form of the upper triangular part of the matrix. The mapping vec_x, relates the Riemannian metric of Equation (4) on the tangent space to the canonical metric defined in .

Intrinsic Mean Covariance Matrices

We improve classification accuracy by determining the intrinsic mean covariance matrix 121. Covariance matrices do not conform to Euclidean geometry. Therefore, we use elliptical or Riemannian geometry. Several methods are known for determining the mean of symmetric positive definite (Hermitian) matrices, such as our covariance matrices (high-level features 301), see Pennec et at, “A Riemannian framework for tensor computing,” In Intl. J. of Computer Vision, volume 66, pages 41-66, January 2006, incorporated herein by reference.

A set of points on a Riemannian manifold is {X_i}_{i=1 . . . N}. Similar to Euclidean spaces, the Karcher mean of the points on Riemannian manifold is the point on the manifold that minimizes the sum of squared distances

$\begin{matrix} μ = \arg \min_{Y \in ℳ} \sum_{i = 1}^{N} d^{2} (X_{i}, Y), & (11) \end{matrix}$

which m our case is the distance metric d²of Equation (9).

Differentiating the error function with respect to Y and setting it equal to zero, yields

$\begin{matrix} μ^{i + 1} = \exp_{μ^{t}} [\frac{1}{N} \sum_{i = 1}^{N} \log_{μ^{t}} (X_{i})], & (12) \end{matrix}$

which can locate a local minimum of the error function using a gradient descent procedure. The method iterates by determining first order approximations to the mean on the tangent space. We replace the inside of the exponential, i.e., the mean of the tangent vectors, with the weighted mean

$\frac{1}{\sum_{i = 1}^{N} w_{i}} \sum_{i = 1}^{N} w_{i} \log_{μ^{t}} (X_{i}) .$

FIG. 5 shows the selecting step 110 and determining step 120 in greater detail. The intrinsic mean covariance matrix 121 obtained 500 as follows. For a given window 320 and the high-level features 301 therein, a subset of the covariance matrices (high-level features 301) is selected 510. A matrix 521 is selected 520 from the subset 111. The selected matrix 521 is assigned 530 as a current reference matrix.

A mean logarithm with respect to the current reference matrix is determined 540. A weighted sum is determined 550. The weighted sum is compared 560 to the reference matrix, and a change score is determined 570. If the change score is greater than some small threshold ε (Y), then a next matrix is selected and assigned.

Otherwise if not (N), the reference matrix is assigned 590 as the intrinsic mean covariance matrix 121. The intrinsic mean covariance matrix can now be used to map each high-level feature 301 to a corresponding feature vector 131. The feature vectors are used to train the classifier model 410.

Classification on Riemannian Manifolds

A training set of class labels is {(X_i, y_i)}_{i=1 . . . N}, where X ∈ and y₁∈ {0, 1}. We want to find a function F(X): {0, 1}, which partitions the manifold into two based on the training set of class labels.

Such a function, which partitions the manifold, is a complicated notion compared to a similar partitioning in the Euclidean space. For example, consider the simplest form a linear classifier ². A point and a direction vector on ²define a line that partitionsinto the two parts. Equivalently, on a two-dimensional differentiable manifold, we can consider a point on the manifold and a tangent vector on the tangent space of the point, which define a curve on the manifold via an exponential map. For example, if we consider the image of the lines on a 2D-torus, then the curve can never partition the manifold into two parts.

One method for classification maps the manifold to a higher dimensional Euclidean space, which can be considered as flattening the manifold. However in a general case, there is no such mapping that globally preserves the distances between the points on the manifold. Therefore, a classifier trained on the flattened space does not reflect the global structure of the points.

Classifiers

FIGS. 4A-4B show alternative embodiments for training the classifier 601. FIG. 4A shows a single classifier method. FIGS. 4B-4C show a boosted classifier method. FIG. 4D shows a margin cascade boosted classifier method.

Single Classifier

As shown in FIG. 4A for the single classifier 601 is trained as follows. The intrinsic mean covariance matrix is determined 120 from the high-level features 103. The high-level features 301 are projected 412 on a tangent space using the intrinsic mean covariance matrix 121 to map the high-level features. Unique coefficients of the matrix, after the projection, is reconstructed 620 as the feature vector 131. Using a selected classifier model 410 from available classifier models, the selected classifier is then trained using the feature vectors 131.

Boosted Classifier

FIG. 4B shows the steps for training a boosted classifier. A subset 111 of the high-level features 103 is selected 110. The intrinsic mean covariance matrix is determined 120 from the selected subset of high-level features 111. These features are used to train 600 a selected single classifier to produce a trained classifier 601.

The trained classifier is applied 422 to a portion of the training data 102, and a performance 125 of the classifier can be determined 424. If the performance is acceptable, then the classifier is added 426 to the set of classifiers 401.

Additional classifiers can then be evaluated, via step 428, until the desired number of classifiers have been accumulated in the set of classifiers 401. It should be noted, that a different subset of the high-level features can be selected for each classifier to be trained. In this case, an intrinsic mean covariance matrix is determined for each selected subset of high-level features.

For the boosting, the set of classifiers 401 can be further evaluated as shown in FIG. 4C. Using the performance data 425, a best classifier 431 is selected 430. The best classifier 431 is applied 432 to a portion of the training data 102 and a cumulative performance 435 is determined 434. If the cumulative performance 435 is less than a predetermined target performance, in step 436, then the weights of the training data are updated 438, and another classifier can be trained. Otherwise, the training of the boosted classifier is done 439.

We describe an incremental approach by training several weak classifiers on the tangent space and combining the weak classifiers through boosting. We start by defining mappings from neighborhoods on the manifold to the Euclidean space, similar to coordinate charts. Our maps are the logarithm maps, log_X, that map the neighborhood of points X to the tangent spaces T_X. Because this mapping is a homeomorphism around the neighborhood of the point, the structure of the manifold is preserved locally. The tangent space is a vector space, and we train the classifiers on this space. The classifiers can be trained on the tangent space at any point on the manifold. The mean of the points minimizes the sum of squared distances on the manifold. Therefore, the mean is a good approximation up to a first order,

During each iteration, we determine the weighted mean of the points, where the weights are adjusted through boosting. We map the points to the tangent space at the mean and train a weak classifier on this vector space. Because the weights of the samples, which are misclassified during earlier stages of boosting increase, the weighted mean moves towards these points producing more accurate classifiers for these points. This approach minimizes the approximation error through averaging over several weak classifiers.

LogitBoost on Riemannian Manifolds

We start with brief description of the conventional LogitBoost method on vector spaces, J. Friedman, T. Hastie, and R. Tibshirani, “Additive logistic regression: A statistical view of boosting,” Ann. Statist., 28(2):337-407, 2000, incorporated herein by reference.

We consider the binary classification problem, y_i∈ {0, 1}. The probability of the point x being in class 1 is represented by

$\begin{matrix} p (x) = \frac{e^{F (x)}}{e^{F (x)} + e^{- F (x)}} F (x) = \frac{1}{2} \sum_{t = 1}^{L} f_{l} (x) . & (13) \end{matrix}$

The LogitBoost method trains the set of regression functions {ƒ_l(x)}_{l=1 . . . L}(weak functions) by minimizing the negative binomial log-likelihood of the data l(y, p(x)) as

$\begin{matrix} - \sum_{i = 1}^{N} [y_{i} \log (p (x_{i})) + (1 - y_{i}) \log (1 - p (x_{i}))] & (14) \end{matrix}$

through Newton iterations. The LogitBoost method fits a weighted least square regression, ƒ_l(x) of training points (features) x_i∈ to response values z_i∈ with weights w_i.

Our LogitBoost method on Riemannian manifolds is different to the conventional LogitBoost at the level of weak functions. In our method, the domains of the weak functions are in such that ƒ_l(X): . Following the description above, we train the regression functions in the tangent space at the weighted mean of the points on the manifold. We define the weak functions as

ƒ_l(X)=g_i(vec_μ_l(log _μ_l(X))) (15),

and train the functions g_l(x): , and the weighted mean of the points μ_l∈ . Notice that, the mapping vector of Equation (10) gives the orthogonal coordinates of the tangent vectors.

Pseudo-code for the method is shown in FIG. 7. The steps marked with (*) are different from the conventional LogitBoost method. For functions {g_l}_{l=1 . . . L}, is possible to use any form of weighted least squares regression such as linear functions, regression stumps, etc., because the domains of the functions are in .

Input is a training set of class labels is {(X_i, y_i)}_{i=1 . . . N}, where X ∈ and y_i∈ {0, 1}, and we start with weights w_i. Then, we repeat l=1 . . . L the following steps. We compute the response values z_iand weights w_i. Then, we compute the weighted mean of the points μ_l. We map the data points to the tangent space at μ_l. We fit the function g(x) by the weighted least-square regression of z_ito x_iusing weights w_i, and update F(X) where ƒ_lis defined in Equation (15) and p(X) is defined in Equation (13). The method outputs the classifier sign

$[F (X)] = sign [\sum_{i = 1}^{L} f_{l} (X)] .$

Boosted Classifier with Adjustable Margin

For a margin cascade classifier as shown in FIGS. 4D and 4E, a probabilistic classifier model 441 is used to train 440 a strong boosted classifier using the high-level features 301, as described above. The strong trained boosted classifier is then applied 442 to a portion of the training data. 102.

Positive and negative samples of the training data are ordered 443 according to their probabilities to obtain two lists; a positive list for positive samples and a negative list for the negative samples.

Then, the probability of the particular positive example in the positive list that corresponds to the positive detection rate is obtained. This positive sample is assigned as the current positive probability. Similarly, the current negative probability is found using the negative list and the negative detection rate.

Then, the current negative probability is subtracted 444 from the current positive probability to obtain a current gap. A classifier decision threshold is set 453 to the half of the summations of the current positive and current negative probabilities, see FIG. 4E described below in greater detail. Two cost factors are defined. A cost of not missing any positive samples (CP) 451, and cost of minimizing false positive samples (CFP) 452. A margin 448 for classification is set by a user or adjusted based on the detection performance of the classifiers in the cascade.

The margin is used to determine the classifier decision threshold, using the gap and probabilities of target detection and rejection rates 454, based on a target margin 448, and target detection and rejection rates 449. The result can be used to remove 446 true negative samples from the training data 101, and to add 447 false positives as negative training data.

Adjusting the Margin

A size of the margin determines the speed and accuracy of the classifier. If the margin is large, in step 455, based on the CP and CFP costs, then the speed is fast, but the results can be less accurate. Decreasing the size of the margin, slows down the classification but increases the accuracy of the results. If it is desired to not miss any positive samples, then the threshold is shifted 456 towards the negative samples, i.e., the threshold is decreased. If it is desired to not detect any false positive samples, then the threshold is shifted 457 towards positive samples away from the negative training samples, i.e., the threshold value is increased.

We repeat adding 426 classifiers to the boosted classifier, as described above, until the current gap is greater than the margin in step 445.

Cascade of Rejectors

We employ a cascade of rejectors and a boosting framework to increase the speed of classification process. Each rejector is a strong classifier, and consists of a set of weighted linear weak classifiers as described above. The number of weak classifiers at each rejector is determined by the target true and false positive rates. Each weak classifier corresponds to a high-dimensional feature and it splits the high dimensional input space with a decision boundary (hyper plane, etc). Each weak classifier makes its estimation based on a single high-dimensional feature from the bag of high-dimensional features. Boosting works by sequentially fitting weak classifiers to reweighted versions of the training data. Using Gentle-Boost, we fit an additive logistic regression model by stage-wise optimization of the Bernoulli log-likelihood.

Human Detection using Cascade of Rejectors

FIG. 8 shows one embodiment of our cascaded classifier. For human detection we combine, e.g., K=30, of our LogitBoost classifiers 801 on Sym⁺₈with a rejection cascade. The weak classifiers {g_l}_{l=1 . . . L}are linear regression functions trained on the tangent space of Sym⁺_i. The tangent space is an m+36 dimensional vector space. Let N_piand N_nibe the number of positive and negative images in the training set. Because any detection region or window sampled from a negative image is a negative sample, it is possible to generate more negative examples than the number of negative images.

Assume that we are training the k^thcascade level. We classify all the possible detection regions on the negative training images with the cascade of the previous (k−1) classifiers. The samples m, which are misclassified, form the possible negative set (samples classified as positive). Because the cardinality of the possible negative set is very large, we sample Nⁿ=10000 examples from this set as the negative examples at cascade level k. At every cascade level, we consider all the positive training images as the positive training set. There is a single human at each of the positive images, so N_p=N_pi.

A very large number of covariance descriptors can be determined from a single detection region. It is computationally intractable to test all of the descriptors. At each boosting iterations of the k^thLogitBoost level, we sample 200 subregions among all the possible subregions, and construct normalized covariance descriptors as described above. We train the weak classifiers representing each subregion, and add the best classifier which minimizes negative binomial log-likelihood to the cascade level k.

Each level of cascade detector is optimized to correctly detect at least 99.8% of the positive examples, while rejecting at least 35% of the negative examples. In addition, we enforce a margin constraint between the positive samples and the decision boundary. The probability of a sample being positive at cascade level k is p_k(X), evaluated using Equation (13).

The positive example that has the (0.998N_p)^thlargest probability among all the positive examples is X_p. The negative example that has the (0.35N_n)^thsmallest probability among ail the negative examples X_n. We continue to add weak classifiers to cascade level k p_k(X_p)−p_k(X_n)>th_b. We se the threshold th_b=0.2.

When the constraint is satisfied, a new sample is classified as positive by cascade level k if p_k(X)>p_k(X_p)−th_b>p_k(X_n) or equivalently F_k(X)>F_k(X_n). With our method, any of the positive training samples in the top 99.8 percentile have at least th_bmore probability than the decision boundary. The process continues with the training of (k+1)^thcascade level, until k=K.

This method is a modification of our LogitBoost classifier on Riemannian manifolds described above. We determine the weighted means of only the positive examples, because the negative set is not well characterized for detection tasks. Although it rarely happens, if some of the features are totally correlated, there will be singularities in the covariance descriptor. We ignore those eases by adding very small identity matrix to the covariance descriptor.

Object Detection with Motion Cues in Covariance Descriptors

FIG. 6 shows a procedure for determining high-level features from motion cues according to an embodiment of the invention. The features can be used for training a classifier as described herein, and for object detection.

Covariance descriptors, which we adapt for object detection in images according to embodiments of our invention, can also include motion cues that are extracted from the video data. The motion cues can be provided by another sensor, or can be determined by analyzing the video data itself. The pixel-wise object motion information is incorporated as a low-level feature.

In case of system with a moving camera, the apparent motion in the video can be due to object and/or camera motion. Therefore, we use a first training video₁601 from moving camera, and a second training video_2602 from a static camera.

The camera motion in the first training video 601 is compensated 610 to obtain the motion due to the objects only. This is done by stabilizing or aligning the consecutive video images. Image alignment gives the camera motion either as a parametric (affine, perspective, etc.) global motion model, or as a non-parametric dense motion field. In both cases, the consecutive images are aligned. This results in stabilized images 611. Using the stabilized images, 611 the moving objects in the scene are found as the regions that have high motion after the compensation step 611.

For static camera systems, there is no camera motion in the second video 602. Thus, no compensation is required. Any motion present is due to object motion. Therefore, we generate and maintain 660 a statistical background model 662.

We use different motion cues. A motion cue is an additional low-level feature when we determine the high-level features 300.

A first set of motion, cues is obtained from a foreground 661. Using the input images (stabilized images 611 in case of moving cameras), the background model 662 is maintained 660. The changed part of the scene, which is called the foreground 661, is determined by comparing and subtracting 665 the current image 666 from the background models 660. A foreground pixel value corresponds to the distance between the current image pixel and the background models for that pixel. This distance can be thresholded. We use these pixel-wise distances as a first set of motion cues.

A second set of motion cues is obtained by the consecutive image differences 620. A number of difference images 621 are determined by subtracting the current image from one or multiple of the previous images, that are the motion compensated (stabilized) images 611 in case of a moving camera system. The subtraction gives the intensity distance at a pixel. Instead of the intensity distance, other distances, for instance, gradient magnitude distance, orientation difference, can also used.

A third set of motion features is computed by determining 650 an optical flow 651 between the current and previous (stabilized) images 611. The optical flow determination produces a motion vector at every pixel. The motion vectors, which include the vertical and horizontal components of the optical flow vector, or the magnitude and orientation angles of the optical flow vector, are then assigned as the pixel-wise motion cues. Alternatively, a motion vector for each pixel is determined by block-matching or other methods instead of the optical flow.

For moving object detection, we include the motion cues among the low-level features in the aforementioned mapping function Φ(I, x, y). The low-level feature can be used to generate 300 high-level features as described above. Then, the high-level features obtained from training data can be used to train the classifier 601, while the high-level features obtained in a similar manner from test data can be used to detect moving objects using the trained classifier.

Testing

During classification 140 of test data 104, low-level features are extracted and used to generate high-level features, as described above. The high-level features are eventually mapped to feature vectors, as described above, for classification 140. The classification assigns the labels 105 to the test data 104, e.g. human or not.

Although, the invention has been described by way of examples of preferred embodiments, it is to be understood that various other adaptations and modifications can be made within the spirit and scope of the invention. Therefore, it is the object of the appended claims to cover all such variations and modifications as come within the true spirit and scope of the invention.

Effect of the Invention

The embodiments of the invention provide a method for detecting humans in images utilizing covariance matrices as object descriptors and a training method the Riemannian manifolds. The method is not specific to Sym⁺_d, and can be used to train classifiers for points lying on any connected Riemannian manifold.

Although the invention has been described by way of examples of preferred embodiments, it is to be understood that various other adaptations and modifications can be made within the spirit and scope of the invention. Therefore, it is the object of the appended claims to cover all such variations and modifications as come within the true spirit and scope of the invention.

	Number	Date	Country
Parent	11517645	Sep 2006	US
Child	11763699		US

Detecting Moving Objects in Video by Classifying on Riemannian Manifolds

Information

Publication Number

Date Filed

Date Published

Inventors

CPC

US Classifications

International Classifications

Abstract

Description

Claims

RELATED APPLICATION

Continuation in Parts (1)