The present application relates to methods and systems for image and video processing, and in particular, methods and systems for extracting foreground objects from video sequences.
Real-time foreground segmentation for live video relates to methods and systems for image and video processing, and in particular, methods and systems for extracting foreground objects from live video sequences, even where the boundaries between the foreground objects and the background regions are complicated by multiple closed regions, partially covered pixels, similar background colour, similar background textures, etc., in a computationally efficient manner so as to permit so-called “real-time” operation.
Foreground segmentation, also referred to as video cutout, is the extraction of objects of interest from input videos. It is a fundamental problem in computer vision and often serves as a pre-processing step for other video analysis tasks such as surveillance, teleconferencing, action recognition and retrieval. A significant number of techniques have been proposed in both computer vision and graphics communities. However, some of them are limited to sequences captured by stationary cameras, whereas others require large training datasets or cumbersome user interactions. Furthermore, most existing algorithms are rather complicated and computationally demanding. As a result, there is still lacking an efficient yet powerful algorithm that can process challenging live video scenes with minimum user interactions.
There is a need for a system and method for foreground segmentation which is both robust, and computationally efficient.
Existing approaches to foreground segmentation may be categorized as unsupervised and supervised.
Unsupervised approaches, try to generate background models automatically and detect outliers of the models as foreground. Most of them, referred to as background subtraction approaches, assume that the input video is captured by a stationary camera and model background colors at each pixel location using either generative methods (e.g.: J. Zhong and S. Sclaroff, Segmenting foreground objects from a dynamic textured background via a robust Kalman filter, ICCV, 2003 [Zhong 1]; or J. Sun, W. Zhang, X. Tang, and H. Shum, Background cut, ECCV, 2006 [Sun 2]) or nonparametric methods (for example: Y. Sheikh and M. Shah. Bayesian object detection in dynamic scenes, CVPR, 2005 [Sheikh 3]; or J. Wang, P. Bhat, A. Colburn, M. Agrawala, and M. Cohen, Interactive video cutout, SIGGRAPH, 2005 [Wang 4]). Some of these techniques can handle repetitive background motion, such as rippling water and waving trees, but are unsuitable for a camera in motion.
Considering existing unsupervised methods where camera motion does not change the viewing position, such as PTZ security cameras, the background motion has been described by a homography, which has been used to align different frames before applying the conventional background subtraction methods (e.g. E. Hayman and J. Eklundh, Statistical background subtraction for a mobile observer, ICCV, 2003 [Hayman 5]). The method of Y. Sheikh, O. Javed, and T. Kanade, Background subtraction for freely moving cameras, ICCV, 2009 [Sheikh 6], proposed to deal with freely moving cameras by means of tracking the trajectories of salient features across the whole video, where the trajectories are used for estimating the background trajectory space, based on which foreground feature points can be detected accordingly. While this method automatically detects moving objects, it tends to classify background with repetitive motion as foreground, as well as to confuse large rigidly moving foreground objects with background.
Supervised methods allow users to provide training examples to train the segmentation method being employed. Certain existing methods (for example: V. Kolmogorov, A. Criminisi, A. Blake, G. Cross, and C. Rother, Bilayer segmentation of binocular stereo video, CVPR, 2005 [Kolmogorov 7]; A. Criminisi, G. Cross, A. Blake, and V. Kolmogorov, Bilayer segmentation of live video, CVPR, 2006 [Criminisi 8]; and P. Yin, A. Criminisi, J. Winn, and I. Essa, Tree-based classifiers for bilayer video segmentation, CVPR, 2007 [Yin 9]) integrate multiple visual cues such as color, contrast, motion, and stereo with the help of structured prediction methods such as conditional random fields. Although operational for video conferencing applications, these algorithms require a large set of fully annotated images and considerable offline training, which bring up many issues when attempting to apply them in different scene setups.
Some existing matting algorithms also provide supervised foreground segmentation by modelling the video sequence as a 3D volume of voxels. Users are required to label fore/background on multiple frames or directly on the 3D volume. To enforce temporal coherence, these algorithms usually segment over the entire volume at one time, which restricts their capacity toward live video processing.
The Video Snap-Cut by X. Bai, J. Wang, D. Simons, and G. Sapiro, Video snapcut: robust video object cutout using localized classifiers, SIGGRAPH, 2009 [Bai 10], is one existing technique, starting from a segmentation of the first frame, both global and local classifiers are trained using color and shape cues, then labelling information is propagated to the rest of the video frame by frame. Video SnapCut expects users to provide a fine annotation of the entire first frame which can be challenging for fuzzy objects, and runs at about 1 FPS for VGA-sized videos (excluding the time for matting).
There is a need for a robust, minimally supervised video segmentation technique which is able to operate in real time, and which is able to handle freely moving cameras and/or background images.
There is a need for a video segmentation technique designed for parallel computing which is both easy to implement and has low computation cost, that is capable of dealing with challenging video segmentation scenarios with minimal user interaction.
This present application relates to a foreground/background segmentation approach that is designed for parallel computing, is both easy to implement and has low computation cost, and is capable of dealing with challenging video segmentation scenarios with minimal user interaction. As shown in
A number of improvements are proposed. First, the segmentation method maintains two Competing 1-class Support Vector Machines (C-1SVMs) at each pixel location, rather than operating a single classifier. A first competing 1-class Support Vector Machines (C-1SVM) captures the local foreground color densities and a second competing 1-class Support Vector Machines (C-1SVM) captures the local background color densities separately from the first C-1 SVM. However, the two C-1SVM's determine the proper foreground/background label for the pixel jointly. Through iterations between training local C-1SVMs and applying them to label the pixels, the algorithm can effectively propagate \ initial user labeling to the whole image and to consecutive frames. The frame/image is partitioned into known foreground (if the foreground-1SVM says foreground and the background-1SVM says not background), known background (if the foreground-1SVM says not foreground and the background-1SVM says background) and unknown (if the foreground a-SVM and background-1SVM disagree as to classification). Then, optionally, the unknown pixels are forced to either foreground or background by a smoothing function. The smoothing function disclosed in Algorithm 2 below is a globally optimized thresholded costing function biased towards foreground identification. On a step-wise basis as frames advance, the edges of the foreground and background are eroded, and the impact of older frames on the support vectors for individual pixels attenuated in time.
Choice of grid sizes, and novel approaches to structuring the grid for computational purposes provide optional advantages. In general, each pixel may be trained to the algorithm using its own neighbourhood of sufficient size, which may be augmented by training based on the centre points of one or more neighbourhoods elsewhere within the image. In different approaches shown: (i) a pixel may be trained to the algorithm with reference to all pixels within a shape about the pixel; (ii) a pixel may be trained with reference to all the pixels within its neighbourhood and then with the middle pixels in adjacent neighbourhoods of similar size; or even (iii) a pixel may be trained with reference to its neighbourhood and the centre points of neighbourhoods of similar size not full adjacent to the neighbourhood of the pixel, but separated by some known distance in a pseudo adjacency. Furthermore, by exploiting the parallel structure of the proposed algorithm, and appropriate grid spacing and sizes, real-time processing speed of 14 frames per second (FPS) is achieved for VGA-sized videos.
The steps of the segmentation method disclosed in this application can be summarized with reference to
Step 1: the design parameters of the 1-class support vector machines (1SVM) to be used to separately classify foreground and background at each pixel are established. The design parameters include: the choice of kernel function k(•,•); whether the C-1SVM will train based on batch, online or modified online learning, or some combination; the size and shape of neighbourhoods about each pixel upon which each of the C-1SVM; the score function to be used, and the margin γ. Optionally, the initialization step also includes a choice of whether to classify based on the entire neighbourhood, or only subgroups within the neighbourhood in which case maxpooling and special decay (discussed below) would be used to classify each pixel according to a train-relabel loop.
Step 2: Obtaining an image or video stream for processing. The present method operates either on a single digital image or a video stream of digital images. A single image may be referred to as Io, while a series of frames t in a video stream may be referred to as It.
Step 3: obtain initial background sample set (BSS) and initial foreground sample set (FSS). The sample sets of known background and foreground may be provided by user instructions (e.g. swiping a cursor over the image, identifying particular colours, etc.) in a supervised method, or through an automated unsupervised learning method. In the video stream, at each a given pixel in time/frame t, the sample sets of BSS and FSS are referred to jointly as the label Lt(p).
Step 4: Training of the C-1SVM occurs as follows. For each pixel, train the background-1SVM (B-1SVM) using the BSS and train the foreground-1SVM (F-1SVM) using the FSS.
Step 5: Classification of each pixel is performed independently by each 1SVM. The classification routine may be run on the entire neighbourhood. Or, by max pooling over other specified subgroups within the neighbourhood, as discussed below.
Step 6: Relabeling of the BSS and FSS occurs on a pixel-wise basis if the C-1SVM agrees as to the classification of the pixel as foreground or background. Otherwise, it is not relabelled. The steps 4 through 6 are repeated in a Train-Relabel loop until no new pixels are labelled. Four categories of pixels result: those labelled foreground by both classifiers, those labelled background by both classifiers, those labelled background by the F-1SVM and foreground by the B-1SVM, and those labelled foreground by the F-1SVM and background by the B-1SVM. This is a sufficient output of the segmentation method, but additional steps are also possible.
Step 7: optionally, a bianarization set further segments the non-binary output of Step 6 by forcing the unlabeled pixels into foreground or background according to some coherence rule. A global optimizing function (discussed below), has been shown to be useful in this regard.
Step 8: optionally, a smooth/erode function may be used to further prepare the output of Step 6 for insertion back into the algorithm at Step 4 as a proxy to additional supervised user labelling. Using the global optimizer to smooth the data and/or eroding the boundary by fixed number of pixels prepares new BSS and FSS data for the Train-Relabel loop in the following frame.
Step 9: (not shown) Nothing prohibits additional user relabeling of the BSS and FSS either in respect of a given frame, or prior to segmentation of a future frame.
a), (b) and (c) show foreground assignment (in black) by the foreground-1SVM for image frame 0 from the sequence used in
a) and 8(b) compare the segmentation accuracy of a preferred embodiment of the method of the present invention for two sequences, where a ground truth segmentation is available for every 5 or 10 frames.
Certain implementations of the present invention will now be described in greater detail with reference to the accompanying drawings.
The method of foreground segmentation proposed foregoing the classification of the problem as a binary matter; and instead creates two competing classifiers which operate independently. Where the competing classifiers disagree, the pixels are labelled unknown, and ultimately resolved through a final step globalized costing function. Improved performance is predicted for two reasons:
First, foreground and background may not be well separable in the color feature space. For example, the black sweater and the dark background shown in
Second, even in areas that both fore/background examples are available, modeling the two sets separately using the C-1SVMs gives us two hyperplanes that enclose the training examples more tightly. As illustrated in
In the proposed method, the training can be based either on batch learning or online learning. Training a SVM using a large set of examples is a classical batch learning problem, the solution of which can be found through minimizing a quadratic objective function. Those of skill in the art will appreciate that similar or even better generalization performance can be achieved using online learning with a much less computational cost, by showing all examples repetitively to an online learner, when comparing to that of batch learning. A less noticed but distinct advantage of online learning is that it produces a partially trained model immediately, which is then gradually refined toward the final solution. However, either option may be practised within the scope of the method disclosed in this application.
In one example, the online learner of a preferred embodiment of the foreground segmentation method proceeds as follows: Let ft(•) be a score function of examples at time t, and let k(•,•) be a kernel function. Now, denote αt a non-negative weight of example of time t, and clamp(•,A,B), an identical function of the first argument bounded from both sides by A and B. When a new example, xt arrives, the score function becomes:
In this example, the update rule for the weights is:
Where γ:=1 is the margin, τε(0,1) the decay parameter, and C>0 the cut-off value.
Directly applying Eq. (2) adds multiple support vectors to the model; all would come from the same sample and would have different weights. Also, as shown in Eq. (2), once a support vector (xt,αt) is added to the applicable 1SVM, over time its weight αt is only affected by the decay rate (1−τ). Hence, to ensure the support vectors converge to their proper weights, the decay parameter should be careful adjusted using e.g. cross validation results.
In a modified online learning example, and to avoid the complexity of monitoring/performing cross validation of results, the C-1SVM segmentation method may not rely on the decay at all, instead it may execute an explicitly reweighting scheme: If a training example xt arrives and it turns out identical to an existing support vector (xt,αt) inside the model, this support vector is first taken out when computing the score function, it is then included with its newly obtained weight, α′t to substitute for the original weight αt. To summarize
where χ(•) is an indicator function: χ(true)=1 and χ(false)=0.
Intuitively, this modified online learning method resets the weight component of a particular support vector (xt,αt), based on how well the separating hyperplane defined by the remaining support vectors is able to classify example xt. This reweighting process can either increase or decrease αt and hence an implementation of the C-1 SVM using modified online learning does not rely on decay as do some prior art methods. With fewer operations, this leads to a method with shorter training time (i.e. fewer computations).
Training 1SVMs with large scale examples is known to be computationally expensive, which becomes a serious issue in a real-time processing scenario. In addition to online learning, in one example, the present segmentation method proposes “max-pooling” of subgroups, as follows: the whole example set ψ is divided into N non-intersecting groups ψiε(0≦i<N) and a 1SVM is trained on each group. Then the original 1SVM score function is approximated by the maximum operation of these 1 SVM score functions from subgroups. That is:
f(x)=max0≦i<Nfψ
where fψ
Different options are proposed for dividing examples into subgroups, and thereby exploit the spatial coherence of images so that the 1SVM trained on each subgroup models local appearance density.
In addition to the idea of using competing, separately initialized and trained 1SVM classifiers for the foreground segmentation, another improvement exists in the train-relabel procedure between video frames (i.e. in time). Two competing 1SVMs, p for foreground and
p for background, are trained locally for each pixel p using pixels with known labels within the local window/neighbourhood Ωp. Once trained,
p and
p are used to jointly label p as either foreground, background, or unknown. Since the knowledge learned from neighbouring pixels in the neighbourhood group Ωp is used for labelling pixel p, the above procedure effectively propagates known foreground and background information to its neighbourhood. As a result, and as shown in
For inter frame training, a similar train-relabel procedure for iteratively expanding propagating foreground and background information in a single frame is used for handling interframe/temporal changes as well. When a new frame t+1 arrives, the label Lt+1(p) is initialized automatically using the existing p and
p. The initial labels, together with newly observed colors, are then used to conduct the train-relabel process. Since
p and
p are trained using all pixels within Ωp of frame t, if any of these pixels moves to within the radius of Ωp of pixel p,
p and
p can attempt to classify it. Consequently, the algorithm can handle arbitrary foreground and background movement without a priori motion information, as long as the amount of movement is less than the radius of the neighbourhood grouping Ω.
Under ideal situations, where the appearance distributions of foreground and background pixels are locally separable, the above baseline procedure is sufficient. However, the two distributions may intersect due to fuzzy object boundary, motion blur, or low color contrast. To address these cases, global optimization is applied to enforce the smoothness of the solution. In addition, when moving to a new frame, decaying is applied to existing support vectors for better adapting to temporal changes. The details for the above steps are discussed in the each of the following subsections.
When training the two competing classifiers, p and
p at each pixel p, the size of the local/frame window Ωp is an important parameter. It needs to be small enough so that the local foreground/background appearance distributions are separable, but also large enough for effective propagation of label information and for covering foreground/background motions.
In all images processed in the Figures, the method of foreground segmentation uses Equation 3 and Equation 4, and the frame window, Ωp, as been set to 33×33 pixels, large enough to deal with motions of up to 16 pixels between adjacent frames. A skilled user could implement the present invention using larger windows, at a larger computational cost, in order to address greater relative motion of foreground within the video stream. Since using a 33×33 window means 1089 examples are used for training each 1SVM, and training is performed for 1SVMs at all pixel locations: using the entire frame window is not affordable for real-time processing. To reduce the training cost, the techniques noted above for online learning with reweighting and batch pooling of results are applied here.
First, using the max-pooling method discussed above, the 1089 examples inside Ωp are divided into twenty-five subgroups.
As shown in the example of p and
p, but for the training of other pixels within the subgroup radius, they are used. Since these local 1SVMs are trained at all pixel locations simultaneously, after splitting the examples into subgroups, the C-1SVMs at each pixel location only need to be trained using pixels in the center subgroup at that location. The training for the remaining twenty-four subgroups will occur at their corresponding center pixel locations.
This strategy permits the method of the present invention to reduce the computational costs of training from using 1089 examples to using just twenty-five examples. For the sake of clarity, the symbols p and
p are reserved for the two C-1SVMs obtained through training with all examples in Ωp, and the symbols
p and
p to denote the C-1SVMs trained using pixels in the center subgroup ̂p.
Moreover, as suggested above, online learning may be employed to provide an on-demand feedback during the train-relabel process. If a user is unsatisfied with the initialization, or future segmentation, the user may force additional labelling onto the frame by further supervision of the foreground and background sets. As shown in Algorithm 1, instead of repetitively showing examples in Ωp to p and
p and waiting for the training to converge, another implementation of the foreground segmentation method only shows these examples once in each train step. The partially trained 1SVMs are then immediately used to perform relabeling. This enables a faster propagation of labelling information through neighbourhoods. Hence, empirically, it allows a faster convergence of the train-relabel process. For example, starting from the input user stokes shown in
Once p and
p are trained, they are used jointly to classify pixel p. That is, pixel p is labelled as foreground or background only if the two competing classifiers output consistent predictions. Otherwise pixel p is labeled as unknown.
As shown in Algorithm 2, the relabeling module starts by, computing the scores of observation I(p) based on Equation (3) with both p and
p. As explained above, these two quantities are approximated by the scores of
p and
p from a set of nearby subgroups and by taking the maximum. To allow examples closer to p have higher influence than those further away, the implementation may optionally defined a spatial decay parameter τspacial, to attenuate the weight of more distant pixels on the pixel being labelled.
p (or
p) is low and the loss of
p (or
p) is high. This dual thresholding strategy facilitates the detection of ambiguities, which is used to prevent incorrect labeling information from being propagated. Please note, where these loss values are used to produce the label maps of
One implementation of the C-1SVM, therefore, proposes Algorithm 1 to train C-1SVMs using frame It and label Lt, and Algorithm 2 for relabeling input frame It using the C-1SVMs.
(It(p)));
(It(p)));
(p) = max (0,γ − fF(p));
(p) = max (0,γ − fB(p));
To the extent that the above algorithm performs SVM training which may be implemented using different variations, which may have other steps, it should be considered within the scope of the invention if the overall approach is to use two completing 1-class support vector machines to compete in the labelling of each pixel. Furthermore, the use of subgroupings {circumflex over (Ω)}q which are mere non-intersecting subsets of Ωp, is a preferred implementiaton of the C-1SVM method, but not a requirement. Indeed, the use of subgroupings {circumflex over (Ω)}q which are mere subsets of Ωp is considered to be a new approach in respect of all SVMs.
While explicitly labelling ambiguous pixels as unknown helps to prevent error propagation during the train-relabel process, decisions still need to be made for these pixels in the final segmentation. If desired, the segmentation following convergence of Algorithm 2 may be bianarized in a simple fashion by forcing label coherence among neighbouring pixels. Alternatively, global optimization may be applied to the results of segmentation after convergence of Algorithm 2, to obtain a more accurate binarization. In addition, the pixel-wise prediction result based on previous sections often tends to be noisy (see (p) and
(p) directly. The contrast term is that used in [Criminisi 8] and [Sun 2], which adaptively penalizes segmentation boundaries based on local color differences.
It is known that the global optimizing function Gt can be calculated efficiently using graph cuts. In one example of the present foreground segmentation technique, a version of the push-relabel algorithm is implemented to compute the min cuts, wherein the number of push-relabel steps is limited to 20. This was sufficient for the examples shown in the Figures, although a skilled practitioner of the present invention may chose a greater or lesser number of iterations based on computational constraints or other considerations.
Dealing with Incoming Frames
When a new frame t+1 arrives, the following preparation procedures are performed before the train-relabel procedure:
First, the classifier is informed of the results of the bianarization, after optional erosion of the boundary pixels. If the global optimizing function Gt was used to bianarize (coined to mean render binary) the output of Algorithm 2, then this final output is used as the sample space to train the C-1SVMs at frame t+1. Since the ambiguous areas are labeled in Gt using smoothness constraints, using it to train and
helps to resolve ambiguities in future frames. Nevertheless, pixels along the boundary between the foreground and background often have mixed colors. Using these pixels for training may introduce bad examples to
and
, which in turn may cause incorrect labeling. An optional step of the method, therefore, circumvents this problem by applying morphological erode to both the foreground and the background regions in Gt to remove ambiguities. In the case shown, 2 pixels erosion is found to be sufficient.
Second, to allow the C-1SVMs to better adapt to the temporal changes of fore/background appearance distributions, a temporal decay is applied: after Lt+1 is predicted, the weights of existing support vectors in and
are downweighted by a factor (1−τtemporal).
In the examples shown, the C-1SVM method was implemented on a GPU using DirectCompute, which is proposed by Microsoft as an alternative to CUDA and is included in the Direct3D11 API. For VGA-sized videos, so as to exploit the inherit parallel structure of the method. In each of the examples shown in the
C=0.5,τtemporal=0.25,τspacial=0.05,=0.1,
=0.3,
=
=0.4
Notice that the background labelling requirements (l<
and l
>
) are intentionally looser than those for foreground objects. Leveraged with a relatively high temporal decay τtemporal, these parameter choices introduce a tendency of accepting unseen examples as background, which allows proper handling of background changes. Furthermore, the kernel function k(•,•) is computed as a Gaussian kernel with σ=10. Where Gaussian kernels are used the parameters are tuned through cross validation. The lower the value of
, the more easily a pixel is classified as foreground. Similarly, the lower the value of
the more easily a pixel is classified as background. The method is not limited to the parameters used in these tests, but may be varied according to the art and the declosures herein. Although the Gaussian kernel is used here, any kernel function known to be suitable for support vector machines could be used, including: homogeneous polynomial, unhomogeneous polynomial, Guassian radial basis function, hyperbolic tangent.
Results on Testbed Videos:
As displayed in
Quantitative Evaluation Using Ground Truth:
For a fair comparison with previous approaches which used multiple annotated images for training, the C-1SVM method disclosed herein was trained using both the first frame and one more selected frame where the initially occluded foreground portion is visible.
Ability to Recover from Incorrect Predictions:
While the aforementioned tendency of accepting unseen examples as background helps to correctly handle background changes, such as the background person in “talk” and new background scene in “walk”, it occasionally introduces errors if locally unseen foreground appearances are presented. Nevertheless, the redundancy of using two competing 1SVMs helps limit the effect incorrect labels have on the training of foreground 1SVMs, which gradually recognize the novel foreground colors.
Ability to Work in Background Subtraction Scenario:
With a different set of threshold settings (=∞ and
=0.2), the algorithm can also be initialized by one or a few pure background image(s) instead of any stroke. Under this setup, only the local background 1SVMs are trained initially. As new frames are processed, outliers to the background 1SVMs are classified as foreground, which are then utilized to initiate the training of local foreground 1SVMs. Meanwhile inliers are used to update the background 1SVMs, allowing the algorithm to adapt to dynamic changes and background motion. As displayed in
The C-1SVM method is easy to implement, simple to use, and capable of handling a variety of difficult scenarios, such as dynamic background, camera motion, topology changes, and fuzzy object boundaries. Experiments on standard testbed videos demonstrate that the implementation of the C-1SVM method used for testing possesses comparable or superior performance comparing to the other state-of-the-art approaches referred to herein.
It would be readily apparent to a person of skill in the art that the above procedures may be configured on other computing platforms in known ways, to achieve different results based on the processing power of such computing systems.
The foregoing embodiments and advantages are merely exemplary and are not to be construed as limiting the present invention. The present teaching can be readily applied to other types of apparatuses. Also, the description of the embodiments of the present invention is intended to be illustrative, and not to limit the scope of the claims, and many alternatives, modifications, and variations will be apparent to those skilled in the art.