The invention relates generally to image processing, and more particularly to separating foreground content from background content in a sequence of images.
Foreground/background (FG/BG) separation can be used in applications such as video surveillance, human-computer interaction, and panoramic photography, where foreground content has a different motion than the background content. For example, FG/BG separation can improve object detection, object classification, trajectory analysis, and unusual motion detection leading to high level understanding of events represented in a sequence of images (video).
When robust principal component analysis (RPCA) is used for the separation, the RPCA assumes that an observed video signal Bεm×n can be decomposed into a low rank component Xεm×n, and a complementary sparse component Sεm×n. Thus, the FG/BG separation can be formulated as an optimization problem for X and S:
where ∥.∥* is a nuclear norm of a matrix and ∥.∥1 is l1-norm of a vectorization of the matrix, and λ is a regularization parameter. The solution to the RPCA problem involves computing a full or partial singular value decomposition (SVD) at every iteration.
To reduce the complexity, several techniques, such as, Low-Rank Matrix Fitting (LMaFit). have been described using low rank factors and optimize over the factors in order to limit the computational complexity. Factorization of a matrix on the low-rank component represents X=LRT, where Lεm×r, Rεn×r, and r≧rank(X).
The factorization-based RPCA method can be formulated and solved using an augmented Lagrangian alternating direction method (ADM) as follows:
where ∥.∥F is a Frobenius norm of a matrix, λ, is a regularization parameter, Y is the Lagrange dual variable, μ is an augmented Lagrangian parameter, and E=B−LRT−S. Note that the nuclear norm ∥X∥* in equation (1) is replaced by ½∥L∥F2+½∥R∥F2 in equation (2), where X=LRT, based on the observation that
where T is a transpose operator.
S
λ/μ(r)=sign(r)max(|r|−λ/μ,0), (4)
wherein
does not impose structure on the sparse component.
In recent years, structured sparsity techniques have been applied to the RPCA methods. Sparse techniques learn over-complete bases to represent data efficiently. In the art, a sparse matrix is a matrix in which most of the elements are zero. By contrast, if most of the elements are nonzero, then the matrix is considered dense. The fraction of zero elements (non-zero elements) in a matrix is called the sparsity (density). This is mainly motivated by the observation that sparse data are often not random located but tend to cluster.
For example, one learning formulation, called dynamic group sparsity (DGS) uses a pruning step in selecting sparse components that favor local clustering. Another approach enforces group sparsity by replacing the l1-norm in equation (1) with a mixed l2,1-norm defined as,
∥S∥2,1=Σg=1swg∥Sg∥2, (5)
where Sg is the component corresponding to group g, g=1, . . . , s, and wg's are weights associated to each group. The resulting problem formulation is
Most recent FG/BG separation approaches in the PCA-family are quite effective for image sequences acquired with a stationary camera, and a mostly static background. However, the separation performance degrades for image sequences with a moving camera which may result in apparent motion in the background, even with limited motion jitter. There, a global motion compensation (MC) aligns the images before applying a RPCA-based FG/BG separation method.
With moving camera sequences, the motion in the background no longer satisfies the low-rank assumption. Hence, in order to apply the RPCA, global motion compensation using a homography model can be used in a pre-processing step on the image sequence prior to using the RPCA.
One approach for performing global motion compensation is to compute a homography model for the image sequence. In an 8-parameter homography model h=[h1, h2, . . . , h8]T, the corresponding pixel x1=(x1, y1)T in the current image and x2=(x2, y2)T in its reference image are related according to
Given local motion information associating a pixel location (x1, y1) in the current image to its corresponding location (x2, y2) in a reference image, the homography model h can be estimated by least square (LS) fitting: b=Ah, where b is a vector composed by stacking the vectors x2's, and the rows of A corresponding to each x2 is specified as
Image sequences with corresponding depth maps are now common, especially with the rapid growth of depth sensors like Microsoft Kinect™ and the advancement of depth estimation algorithms from stereo images. Jointly using depth and color data produces superior separation results. Also, a depth-enhanced can better deal with illumination changes, shadows, reflections and camouflage.
The embodiments of the invention provide a method for processing a sequence of images. The method uses an algebraic decomposition for solving a background subtraction problem with a novel PCA framework that uses depth-based group sparsity.
The method decomposes the sequence of images, e.g., group of pictures (GOP), in the video into a sum of a low rank component, and a group-sparse foreground component. The low rank component represents the background in the sequence, and the group-sparse component represents foreground moving objects in the sequence.
For videos acquired of a scene with a moving camera, motion vectors are first extracted from the video, e.g., the video is encoded as a bitstream with motion vectors. Then, an associated depth map is combined with the motion vectors to compute a parametric perspective model with fourteen parameters that matches a global motion in every image in the video.
In the RPCA problem formulation, the video background is assumed to have small variations that can be modeled using a low rank component X. Foreground content, such as, e.g., moving objects, represented by S, are assumed to be sparse and have a different type of motion than the background.
Prior art FG/BG separation algorithms generally do not incorporate the foreground object structure in the separation.
The embodiments provide a structured group-sparsity based PCA method that can overcome larger variations in the background, e.g., from misalignment in global motion compensation on a sequence acquired by a moving camera.
Depth-Weighted Group-Wise PCA
In practical image sequences, the foreground objects (sparse components) tend to be clustered both spatially and temporally rather than evenly distributed. This observation led to the introduction of group sparsity into RPCA approaches, moving the sparse component into more structured groups.
Our method uses a depth map of the video to define group structures in a depth-weighted group-wise PCA (DG-PCA) method.
In order to deal with structured sparsity, we replace the l1-norm in the factorized RPCA problem with a mixed l2,1-norm as defined in equation (5). The l2,1-norm is based on a monotonically increasing function of depths in the depth map. The resulting problem is
The background in the video can be aligned 160 in a pre-processing step, see
In order to define pixel groups G using the depth map D, an operator G(D) segments the depth map into s groups 102 using the following procedure. In one embodiment of the invention, suppose the depth level ranges from 0 to 255, a pixel with depth d is classified into group
Consequently, the pixels in B can be clustered into Bg groups with gε{1, . . . , s}. Each Bg is composed of elements from B which is marked into segment g. In the same way, Lg, Rg, and Lagrangian multiplier Yg are also grouped.
Next, step 3 and 4 of Algorithm 2 solve for the low rank component (background) using X=LRT.
Next, in step 5 of Algorithm 2, the operator Sλ/μ,g is a group-wise soft-thresholding
where
and ε is a small constant to avoid division by 0, and wg defines group weights in equation, (5). Because a foreground object has a higher probability to be nearer to the camera, i.e., to have a larger depth than a background object, we use the following equation to set group weights,
where c is some constant, and dg is the mean depth of pixels in group g. wg is equal to 1 for objects nearest to the camera, d=255, and it is equal to c for objects farthest to the camera, d=0. The choice of c controls the a threshold that permits foreground pixels to be selected based on their corresponding depths. After Sg is calculated for each group g, the sparse component S is obtained by summing up all Sg.
The above steps are iterated until the algorithm converges, or a maximum iteration number is reached.
The pixels that have large values in S, e.g., larger than a predetermined threshold, are outputted as foreground pixels 141.
The method favors group structures, where the foreground content, e.g., objects, are closer to the camera. It is also possible within our framework to define the groups as the sets of pixels that are spatially connected and have a constant depth, or connected pixels with a constant depth gradient.
It is worthwhile to mention that the nuclear norm equivalent items ½∥L∥F2+½∥R∥F2 in equation (9) make Algorithm 2 numerically stable. Without the nuclear norm, (I+μRiTRi)−1 in step 3 of Algorithm 2 becomes (μRiTRi)−1, which is unstable when the matrix RiTRi is singular, for example, when the image is relatively dark with B, L, R≈0.
Depth-Enhanced Homography Model
In practice, the local motion information associating pixel locations is often inaccurate. In this case, the full 8-parameter model in equation (7) is sensitive to errors in the motion information. Hence, a reduced number of parameters in homography model is preferred, thus limiting the types of motion in the scene 151.
For example, 2-, 4- and 6-parameter models correspond to translational only, geometric and affine models, respectively, by setting some coefficients in h to be zero. We select the 4-parameter geometric model as our starting point, where h=[h1, h2, h3, 0,0, h6, 0,0]T.
However, motion in a video sequence is generally not planar. Therefore, even after a careful selection of the conventional homography model, it is possible to have large motion estimation errors, which would dramatically degrade the detection rate in a subsequent PCA-like procedure. Therefore, we use a depth-enhanced homography model. Specifically, six new parameters related to depth are added, and we have h=[h1, . . . , h8, h9, . . . , h14]T. Let z1 and z2 stand for the depth of the corresponding pixels, then the corresponding pixel x1=(x1, y1, z1)T in the current image and the pixel x2=(x2, y2, z2)T in its reference image are related by the our depth-enhanced homography model
In equation (12), a depth value 0 means the object is far from the camera, e.g., at ∞. A larger depth means that the object is nearer to the camera. Certain simplification are possible for simpler video sequences. For example, if z2=z1, then the motion is be limited to be within the same depth plane.
If the depth-enhanced homography model h is defined as in equation (12), then we can solve the problem using least square fitting. Given local motion information associating a pixel location (x1, y1) in the current image and the corresponding location (x2, y2) the reference image, the homography model h can be estimated by solving square (LS) fitting: b=Ah, where b is a vector composed by stacking the vectors x2's, and the rows of A corresponding to each x2 is specified as
Although the invention has been described by way of examples of preferred embodiments, it is to be understood that various other adaptations and modifications may be made within the spirit and scope of the invention. Therefore, it is the object of the appended claims to cover all such variations and modifications as come within the true spirit and scope of the invention.