Visual tracking is a fundamental problem in computer vision, with many applications involved in various fields, such as motion analysis, human computer interaction, robot perception and traffic monitoring, etc. While much progress has been made in recent years, there still exist many challenges in designing a robust visual tracking algorithm, which are mainly caused by varying illumination, camera motion, occlusions, pose variation and rotation.
One of the issues in visual tracking is to match the target's visual appearances over consecutive video frames. Similar to many other computer vision problems such as image retrieval and object recognition, the matching process is largely stipulated by the coaction of the appearance model of the target that describes the target with unpredictable variations due to complex and dynamic scenes and the distance metric that determines the matching relations, e.g., the candidates with the target in visual tracking, or the whole image set with the query image in image retrieval domain.
Distinct from those retrieval or recognition tasks, a tracking problem holds a uniqueness that the appearance model and distance metric are both required to be adaptive to the changes of target as well as background over video frames.
The appearance model of the target plays a big role in visual tracking. In other words, if a strong appearance model can be constructed which is invariant to illumination changes and local deformation, discriminative from the background, and capable of handling occlusions, even simple and fixed distance metric like the Euclidean distance results in favorable matching performance. However, respectable difficulties arise to obtain a good appearance model which is able to handle the difficulties mentioned above. Although many excellent models are designed to describe the target in recent years, few algorithms can deal with various challenges at once. They usually only focus on solving certain tracking issues. When strong appearance model cannot be easily accessed, the choice of distance metrics for visual tracking becomes particularly critical. Many recent researches explored single distance metric learning in tracking problem and obtained pleasant results.
Recently, more and more tracking algorithms in collaborative framework achieve satisfactory performances. These approaches are all based on a simple motivation that single description of the object has its limitation in object representation and may lose helpful discriminative information for matching.
One aspect of the present method includes (1) a distance metric fusion framework for visual tracking and (2) an effective samples selection mechanism in a distance metric learning process.
The method includes tracking a target object in sequential frames of images containing the target object including determining a target image depicting a first tracking position associated with a target object in one frame of an image sequence; generating, by a processor, a plurality of sample windows about the first tracking position of the target object by using sliding windows; classifying, by the processor, image targets in the sliding windows by Gaussian Kernel Regularized Least Square; collecting all sub-windows; computing sub-window scores; creating, by the processor, a negative sample set; creating two weight graphs of the template and target candidates with node links; calculating, by the processor, a distance metric fusion by cross diffusion; and tracking an object to the sequential frames of images using a collaborative distance metric.
The step of collecting all sub-windows can be performed by a sliding window process.
The step of computing sub-windows scores can include selecting samples with a higher response score.
The step of generating a distance metric fusion by cross diffusion includes generating, by the processor, two finite weighted graphs with target candidates and the template as vertices and links between pairs of nodes as edges; and fusing two graphs by cross diffusion.
The method further includes constructing an affinity matrix using K nearest neighbors method where K is a constant.
The method can further include constructing, by the processor, a weighted identity matrix, a plurality of status matrices; and choosing the arithmetic average of two status matrices generated in the last iteration as a final decision matrix.
The step of tracking with a collaborative distance metric uses a RGB-doublet-SVM to project a RGB histogram to a doublet-SVM distance metric state.
The method further includes training a doublet-SVM distance metric and a max-pooling distance metric by positive and negative samples.
The method can include selecting a candidate in the decision matrix having the largest value as a final tracking result of the target object in the current image frame.
An apparatus for tracking an object over a plurality of sequential image frames includes an image sensor generating a plurality of sequential image frames of a target object in a field of view of the image sensor and outputting the sequential image frames by a processor receives the sequential image frames from the image sensors. The processor executes program instructions for determining a target image depicting a first tracking position associated with a target object in one frame of an image sequence; generating, by the processor, a plurality of sample windows about the first tracking position of the target object by using sliding windows; classifying, by the processor, image targets in sliding windows by Gaussian Kernel Regularized Least Square; collecting all sub-windows; computing sub-window scores; creating, by the processor, a negative sample set; creating two weight graphs of the template and target candidates with node links; calculating, by the processor, a district metric fusion by cross diffusion; and tracking an object to the sequential frames of images using a collaborating distance metric.
The processor can collect all sub-windows by a sliding window process.
The processor can compute sub-windows scores including selecting samples with a higher score.
The processor generates a distance metric fusion by cross diffusion including generating, by the processor, two finite weighted graphs with the template and target candidates as vertices and links between pairs of nodes as edges, and fusing two graphs by cross diffusion.
The processor constructs an affinity matrix using a K nearest neighbors method where K is a constant.
The processor constructs a weighted identity matrix, a plurality of status matrices, and chooses the arithmetic average of two status matrices generated in the last iteration as a final decision matrix.
The processor tracks a collaborative distance metric using a RGB-doublet-SVM to project a RGB histogram to a doublet-SVM distance metric state.
The processor trains a doublet-SVM distance metric and a max-pooling distance metric by positive and negative samples.
The processor selects a candidate in the decision matrix having the largest value as a final tracking result of the target object in the current image frame.
The various features, advantages and other uses of the present method and apparatus will become more apparent by referring to the following detailed description and drawing in which:
Referring now to the drawing, and to
The apparatus, which can be mounted, for example, on a moving vehicle 80, includes an image sensor or camera 82. By way of example, the camera 82 may be a high definition video camera with a resolution of 1024 pixels per frame.
The camera 82, in addition to possibly being a high definition camera 82, for example, can also be a big-picture/macro-view (normal or wide angle) or a telephoto camera. Two types of cameras, one having normal video output and the other providing a big-picture/macro-view normal or wide angle or telephoto output, may be employed on the vehicle 80. This would enable one camera to produce a big-picture/macro view, while the other camera could zoom in on parts of the target.
The apparatus, using the image sensor or camera 82 and the control system described hereafter and shown in
The method can be implemented by the apparatus which includes a computing device 100 shown in a block diagram form in
The computing device 100 can also include secondary, additional, or external storage 114, for example, a memory card, flash drive, or other forms of computer readable medium. The installed applications 112 can be stored in whole or in part in the secondary storage 114 and loaded into the memory 104 as needed for processing.
The computing device 100 can be mounted on the vehicle 80 or situated remote from the vehicle 80. In the latter case, remote communication will be used between the camera 82 and the computing device 100.
The computing device 100 receives an input in the form of sequential video frame image data 116 from the image sensor or camera 82 mounted on the vehicle 80. The video image data 116 may be stored in the memory 104 and/or the secondary storage 114.
Using a high definition output 116 from the camera 82, the target will have a reasonable size, as shown in
The presently disclosed object tracking method and apparatus uses an algorithm where a matching process plays a role in which the performance relies on the selected distance metric for matching one-step further. The matching process in a tracking problem is formulated as:
minid2(t,ci) (1)
where t is the target's appearance model and ci is the ith target candidate's appearance model in the current image frame. d(•, •) is the distance function measured by the distance metric. Here, the form of Mahalanobis distance is used to represent the distance function as following by introducing a measure matrix M:
d2(t,ci)=∥t−ci∥M≧02=(t−ci)TM(t−ci) (2)
Since M is symmetric and positive semi-definite, the Eq. 2 can be rewritten as:
d2=(t−ci)TAAT(t−ci) (3)
where M=ATA represents the distance metric matrix, and AT is the projection matrix.
Supervised distance metric learning methods instead of the unsupervised ones are used because the tracker can utilize information attained from the tracking process which is more valid. During supervised distance metric learning process, sufficient samples are collected to guarantee a good result. Unfortunately, it has become a computational burden that directly conflicts with real-time requirements.
In most cases, sample-selection methods tend to conduct random sampling as a sacrifice to obtain efficiency. A fast frequency domain sample-selection method is used to supervise distance metric training process as a solution that takes all samples into computation.
The positive and negative samples x are collected with label y as +1 and −1 respectively, and used to train a ridge regression to obtain the parameter z:
minz[J(y,hz(x))+μ∥z∥22],
where J(y, hz(x))=[y−k(z,x)]2 and
is a Gaussian kernel.
The parameter z, is known. The Gaussian kernel regularized least square classifies input image xi as:
y=k(z,xi).
This is a response score for each input image; the score with large value is more likely to be the positive samples.
In
A pre-process includes two steps. In step one, a simple classifier in the current frame is trained; then this classifier is used to compute classification score of all the sub-windows around target in the next step. By utilizing a circulated structure of sub-windows 130 in
First, all sub-windows are collected by sliding window approach denoted by W=(w1, w2, . . . , wM), wiεr×c with the same size as the target in the current frame, which can be divided into positive and negative sets as
W=(W+,W_) (4)
Each sub-window in W+ has a label +1, while that in W_has a label −1.
A Gaussian Kernel Regularized Least Square (GK-RLS) is chosen as the classifier, which has the form
hz(x)=κz,x (5)
Where κ•,• is the Gaussian Kernel functioning on classifier parameter z and input data x.
Since there is a need to handle non-linear data, the Kernel Trick is used for this problem.
The parameter z of GK-RLS classifier can be obtained by solving the minimization problem of ridge regression:
minz[J(y,hz(x))+μ∥z∥22] (6)
where y is the label vector. In Eq. 6, the former term is the loss function and μ∥z∥22 is the regularization term with μ being the weight. This is shown graphically in
where FFT is Fast Fourier Transform and K(x, x) is the Gaussian Kernel. Especially for the problem, Eq. 7 is specified to
where [1,−1] is the label vector corresponding to positive and negative set.
The trained classifier conducts the second step. In the next image frame, the sub-windows are created by the sliding windows in the area around the target. The large target area is shown by the square 134 in
response=hz(S)=κ(z,S) (9)
which can also be transformed into frequency domain to obtain an efficient solving as
(response)=(K)⊙(z) (10)
where K is the vector with elements of K(z, S).
Next, the samples are selected with higher response score but not occluded with tracking result of last frame as negative samples. This is because including the occluded samples with the target will let the distance metric mistake the foreground as background.
A finite weighted graph G=(V,E), See
Candidate nodes in a feature: x1(a), . . . , xn(a)
Template node in a feature: xn+1(a)
Edges weight in a feature:
where i,j=1, . . . , n+1 and φ(•,•) is the distance measured by the learned metric between nodes.
Because the weights of the edges encode nodal affinity such that nodes connected by an edge with high weight are considered to be strongly connected and edges with low weights represent nearly disconnected nodes. The weight can be viewed as a similarity measure between vertices, so it is also nonnegative and symmetric. In this work, the weight of the edge eij between nodes i and j is defined as
wij=e−d
where dij represents the distance between nodes i and j and a is a constant that controls the strength of the weight. Usually features and operators are chosen as representations. The normalization of the weight matrix P=D−1×W can be interpreted as a natural kernel acting on functions on V, so that the sum of each P's row is all 1, where D=diag (Σwij), W=[wij] The weighted identity matrix
The status matrix P is the normalized weighted graph W (W=[wij]), i.e. P=D−1W, where D=diag(Σjwij).
The final decision matrix Df is the arithmetic mean of two status matrix after it steps cross diffusion: Df=½×(P(1)it+1+P(2)it+1).
With the assumption of that local similarities with high values are more reliable than far-away ones, an affinity matrix is constructed using K nearest neighbors (KNN) method, denoted as A with elements:
where N(i) represents node i's k nearest neighborhood. The corresponding natural kernel becomes:
Q=R−1×A (13)
where R=diag (Σjaij) It is concluded that W is a fully connected graph, so that P encodes the full information about the similarity of each vertex to all the others, whereas Q only carries the similarity between adjacent nodes and becomes a sparsely connected graph.
Two graphs of different metrics are fused by cross diffusion process. Let V(i)=[V(i)
As different perspectives of representing data have complementarity and single description cannot achieve satisfactory performance because of the loss of completeness. Intuitively, fusion of complement representations can offer better result.
In this process, just two different methods are used to measure similarity between data nodes, to different weight matrix, affinity matrix and also their corresponding normalization matrix, denoted by W(1) and W(2), also P(1) and P(2), A(1) and A(2), Q(1) and Q(2). The fusion process is defined as:
P(1)it+1=Q(1)×P(2)it×P(1)T+ρI
P(2)it+1=Q(2)×P(1)it×Q(2)T+ρI (14)
where it=0,1, . . . is the iteration times, and P(1)0=P(1), P(2)0=P(2) as the initial states. ρI is a weighted identity matrix to guarantee self-similarity. This process will finally converge and the two-status matrix will become almost the same, so that the decision matrix Df is obtained by computing the arithmetic mean value of two converged status matrix:
Df=½×(P(1)it+1+P(2)it+1) (15)
Either status matrix can be chosen as the final decision matrix as well without a large difference. Since in every iteration status matrix of one measure is modified by the kernel matrix corresponding to the other measure's KNN graph and the KNN processing can reduce noise between instances, the decision matrix has robust performance. The decision matrix is used to match target and candidate and obtain better performance than either measure method alone.
Two similarity measure approaches are used; each of them is a feature projecting to a distance metric space model. One is RGB-doublet-SVM, which is to project RGB histogram feature, see
Let W=(W+, W_)={(wl,yl)|l=1, . . . , L} be the samples set obtained by samples selection mechanism introduced in proceeding parts, where yl is the label having value +1 in W+ and −1 in W— and L is the number of samples. Then according to the pair constraints, doublets D are constructed for doublet-SVM distance metric learning by extracting any two samples from W
D={(wm,wn,zl)|m,n=1, . . . ,L,m≠n} (16)
where zl=+1 when ym=yn and zl=−1 when ym≠yn. By defining a kernel of doublet
A kernel SVM is learned that classifies the training samples best. The doublet-SVM is defined as:
z
l[(wl1−wl2)TM1(wl1−wl2)+c]≧1−ξl (18)
Where ∥•∥F denotes the Frobenius norm, c is the bias, h is the weight of hinge loss penalty and ξl is a slack variable. In Eq. 18, M1 represents the doublet-SVM distance metric to be learnt and according to the Mahalanobis metric definition,
M1=Σiλizi(wi2−wi2)(wi1−wi2)T (19)
with λi being the weight. Then the first constraint condition in Eq. 18 becomes
zl[(wl1−wl2)TM1(wl1−wl2)+c]≧1−ξl═>Σiλizizlκd(Dl,Di)+zlc≧1−ξl (20)
which is the common kernel SVM form. In addition, many existing SVM solvers such as LibSVM [20] can easily solve the Lagrange dual form of Eq. 18.
Let C=[c1, c2, . . . , cl] be representation of the candidates set in the current frame generated by particle filter, where l is the number of particles and ciεRM*N. T represents the target in the last frame; T is also an M×N matrix. Then the vertices for graph construction is to concatenate the templates and candidates, which is V=[T,C]=[T, c1, c2, . . . , cl]. To compute the similarity between nodes, we first extract two kinds of features from the vertices:
F1=RGB(V) (21)
F2=SC(V) (21)
where RGB is RGB histogram approach and SC is to extract sparse coding coefficient from each node as a feature. Then two corresponding distance metrics doublet-SVM and Max-Pooling, see
Ds(i,j)=[Fs(i)−Fs(j)]Ms[Fs(i)−Fs(j)]T (22)
where s=1,2 correspond to the two features and distance metrics. By utilizing fusion process described above the decision matrix Df is obtained. Then the target candidate corresponding to the largest value in the first row of Df is chosen as the final tracking result of the current frame, because the first row denotes the similarities between the target and target candidates.
Accordingly, the apparatus and method perform the following sequence of steps including sampling a target image containing a plurality of sample windows by a sliding window approach in step 140 in
Next, in step 144, the processor collects all sub-windows and, in step 146, computes scores for all of the sub-windows.
The processor creates a negative sample set in step 148 and then computes two weighted graphs of the template and target candidates with node lengths in step 150. The processor chooses two graphs by a distance metric fusion using cross diffusion in step 152 to track a target object across a sequence of an image frame using a collaborative distance metric in step 154.
Number | Name | Date | Kind |
---|---|---|---|
7720993 | Liu | May 2010 | B2 |
8868489 | Hood | Oct 2014 | B2 |
Number | Date | Country |
---|---|---|
103500345 | Jan 2014 | CN |
Entry |
---|
Jiang et al.; “Learning Adaptive Metric for Robust Visual Tracking”; Jul. 14, 2011; 13 pages; vol. 20, Issue 8; Image Processing, IEEE Transactions. |
Lafon et al.; “Data Fusion and Multi-Cue Data Matching by Diffusion Maps”; Sep. 25, 2006; 21 pages; vol. 28, Issue 11; Pattern Analysis and Machine Intelligence, IEEE Transactions. |
Anonymous; “Collaborative Distance Metric Learning for Visual Tracking”; 2014; 16 pages; ACCV. |
Jia et al.; “Visual tracking via adaptive structural local sparse appearance model”; In: CVPR; 2012; pp. 1822-1829. |
Zhong et al.; “Robust object tracking via sparsity-based collaborative model”; In: CVPR; 2012; pp. 1838-1845. |
Jepson et al.; “Robust online appearance models for visual tracking”; PAMI 25; 2003; pp. 1296-1311. |
Tang et al.; “Co-tracking using semi-supervised support vector machines”; In: ICCV; 2007; pp. 1-8. |
Jiang et al.; “Adaptive and discriminative metric differential tracking”; In: CVPR; 2011; pp. 1161-1168. |
Wang et al.; “Unsupervised metric fusion by cross diffusion”; In: CVPR; 2012; pp. 2997-3004. |
Weinberger et al.; “Distance metric learning for large margin nearest neighbor classification”; In: NIPS; 2005. |
Goldberger et al.; “Neighbourhood components analysis”; In: NIPS; 2004. |
Hong et al.; “Dual-force metric learning for robust distracter-resistant tracker”; In: ECCV (1); 2012; pp. 513-527. |
Li et al.; “Non-sparse linear representations for visual tracking with online reservoir metric learning”; In: CVPR; 2012; pp. 1760-1767. |
Blum; “Combining labeled and unlabeled sata with co-training”; In: COLT; 1998; pp. 92-100. |
Leistner et al.; “Semi-supervised boosting usual visual similarity learning”; In: CVPR; 2008. |
Babenko et al.; “Robust object tracking with online multiple instance learning”; PAMI 33; 2011; pp. 1619-1632. |
Henriques et al.; “Exploiting the circulant structure of tracking-by-detection with kernels”; In: ECCV (4); 2012; pp. 702-715. |
Wang et al.; “A kernal classification framework for metric learning”; CoRR abs/1309.5823; 2013. |
Davis et al.; “Information-theoretic metric learning”; In: ICML; 2007; pp. 209-216. |
Guillaumin et al.; “Is that you? metric learning approaches for face identification”; In: ICCV; 2009; pp. 498-505. |
Globerson et al.; “Metric learning by collapsing classes”; In: NIPS; 2005. |
Chang et al.; “A library for support vector machines”; 2001; pp. 27. |
Adam et al.; “Robust fragments-based tracking using the integral histogram”; In: CVPR (1); 2006; pp. 798-805. |
Ross et al.; “Incremental learning for robust visual tracking”; IJCV 77; 2008; pp. 125-141. |
Bao et al.; “Real time robust I1 tracker using accelerated proximal gradient approach” In: CVPR; 2012; pp. 1830-1837. |
Zhang et al.; “Robust visual tracking via multi-task sparse learning” In: CVPR; 2012; pp. 2042-2049. |
Grabner et al.; “Semi-supervised on-line boosting for robust tracking”; In: ECCV (1); 2008; pp. 234-247. |
Kwon et al.; “Visual tracking decomposition”; In: CVPR; 2010; pp. 1269-1276. |