The present invention relates generally to the field of dense point matching in a video sequence. More precisely, the invention relates to a method for filtering a displacement fields.
This section is intended to introduce the reader to various aspects of art, which may be related to various aspects of the present invention that are described and/or claimed below. This discussion is believed to be helpful in providing the reader with background information to facilitate a better understanding of the various aspects of the present invention. Accordingly, it should be understood that these statements are to be read in this light, and not as admissions of prior art.
The problem of point and path tracking is a widely studied and still open issue with implications in a broad area of computer vision and image processing. On one side and among others, applications such as object tracking, structure from motion, motion clustering and segmentation, and scene classification may benefit from a set of point trajectories by analyzing an associated feature space. On the other side, applications related to video processing such as augmented reality, texture insertion, scene interpolation, view synthesis, video inpainting and 2D-to-3D conversion eventually require determining a dense set of trajectories or point correspondences that permit to propagate large amounts of information (color, disparity, depth, position, etc.) across the sequence. Dense instantaneous motion information is well represented by optical flow fields and points can be simply propagated through time by accumulation of the motion vectors, also called displacement vectors. That is why state-of-the-art methods as described by Brox and Malik in “Object segmentation by long term analysis of point trajectories” (Proc. ECCV, 2010) or by Sundaram, Brox and Keutzer in “Dense point trajectories by GPU-accelerated large displacement optical flow” (Proc. ECCV, 2010) have built on top of optical flow, methods for dense point tracking using such accumulation of motion vectors. Finally, such state-of-the art methods produce a motion field either based on a from-the-reference integration, for instance using Euler integration as disclosed by Sundaram, Brox and Keutzer in “Dense point trajectories by GPU-accelerated large displacement optical flow” (Proc. ECCV, 2010)) or a to-the-reference integration as disclosed in an international patent application PCT/EP13/050870 filed on Jan. 17th, 2013 by the applicant.
The technical issue is how to combine both representations in order to efficiently exploit their respective benefits such as a better representation of spatio-temporal features of a point (or pixel) for a from-the-reference displacement field and accuracy of the estimation with to-the-reference displacement field.
The present invention provides such a solution.
The invention is directed to a method for filtering a displacement field between a first image and a second image, a displacement field comprising for each pixel of the first (reference) image a displacement vector to the second (current) image, the method comprising a first step of spatio-temporal filtering wherein a weighted sum of neighboring displacement vectors produces, for each pixel of the first image, a filtered displacement vector. The filtering step is remarkable in that a weight in the weighted sum is a trajectory weight where a trajectory weight is representative of a trajectory similarity. Advantageously, the first filtering step allows taking into account trajectory similarities between neighboring points.
According to an advantageous characteristic, a trajectory associated to a pixel of the first image comprises a plurality of displacement vectors from the pixel to a plurality of images. According to another advantageous characteristic, a trajectory weight comprises a distance between a trajectory from the pixel and a trajectory from a neighboring pixel.
In a first embodiment, the first step of spatio-temporal filtering comprises for each pixel of the first image:
Advantageously in the second filtering step, backward displacement field is used to refine forward displacement field build by a from-the-reference integration. Advantageously the second step is applied on filtered from-the-reference displacement field. In a variant, the second step is applied on from-the-reference displacement field.
In a variant of the second embodiment, the method comprises a second step of joint forward backward spatial filtering comprising a weighted sum of displacement vectors wherein the displacement vector belongs:
In another variant of the second embodiment, the method comprises, after the second joint forward backward spatial filtering step, a third step of selecting a displacement vector between a previously filtered displacement vector and a current filtered displacement vector. This variant advantageously produces converging displacement fields.
In a third embodiment, the method comprises, before the first spatio-temporal filtering step a step of occlusion detection wherein a displacement vector for an occluded pixel is discarded in the first and/or second filtering steps.
In a refinement of the third embodiment, the 3 steps (spatio-temporal filtering, joint forward backward filtering, occlusion detection) are sequentially iterated for each displacement vector of successive second images belonging to a video sequence.
In a further refinement of the third embodiment, the steps are iterated for each inconsistent displacement vectors of successive second images belonging to the video sequence. In others words, once displacement vectors are filtered for a set of N images, the filtering is iterated only for inconsistent displacement vectors of the same set of N images. Advantageously, in this refinement, only bad displacement vectors (those for which the similarity of forward and backward displacement vectors are above a threshold) are processed in a second pass.
According to another aspect, the invention is directed to a graphics processing unit comprising means for executing code instructions for performing the method previously described.
According to another aspect, the invention is directed to a computer-readable medium storing computer-executable instructions performing all the steps of the method previously described when executed on a computer.
Any characteristic or variant embodiment described for the method is compatible with the device intended to process the disclosed method or the computer-readable medium.
Preferred features of the present invention will now be described, by way of non-limiting example, with reference to the accompanying drawings, in which:
a illustrates motion integration strategies through Euler integration method according to prior art;
b illustrates motion integration strategies through inverse integration method according to an international patent application of the applicant;
a illustrates estimated trajectories for rotational motion;
b illustrates estimated trajectories for divergent motion;
c illustrates estimated trajectories for zero motion;
a illustrates position square error through time for rotational motion;
b illustrates position square error through time for divergent motion;
c illustrates position square error through time for zero motion;
a illustrates from-the-reference correspondence point scheme;
b illustrates to-the-reference correspondence point scheme;
In the following description, the term “motion vector” or “displacement vector” d0,N(x) comprises a data set which defines a displacement from a pixel x of a first frame I0 to a corresponding location into a second frame IN of a video sequence and wherein indices 0 and N are numbers representative of the temporal frame position in the video sequence. An elementary motion field defines a motion field between 2 consecutives frames IN and IN+1.
Respectively the terms “motion vector” or “displacement vector”, “elementary motion vector” or “elementary displacement vector”, “motion field” or “displacement field”, “elementary motion field” or “elementary displacement field” are indifferently used in the following description.
A salient idea of the method for filtering a motion field or a set of motion fields for a video sequence is to introduce an information representative of trajectory similarity of spatial and temporal neighboring points in the filtering method.
Consider a sequence of images {In}n:0 . . . N where In:G→Λ is defined on the discrete rectangular grid G and A is the color space. Let dn,m:Ω→2 be a displacement field defined on the continuous rectangular square Ω, such that for every xεΩ it corresponds a displacement vector dn,m (x)ε
2 for the ordered pair of images {In, Im}. Furthermore, let us call I0 the reference image. We pose the following problem: Given an input set of elementary optical flow fields vn,n+1:G→
2 defined on the grid G, compute the displacement vectors d0,m (x)=d0,m (i,j) ∀m: 1 . . . N, and for the grid position x=(i,j)εG.
This is essentially the problem of determining the position of the initial point (i,j) in I0 at each subsequent frame, i.e. the trajectory of (i,j) from I0 to IN or 0:N(i,j). The classical solution to this problem is to apply a simple Euler's integration method1 which is defined by the iteration
d
0,m+1(i,j)=d0,m(i,j)+νm,m+1((i,j)+d0,m(i,j)) (1)
from which the trajectory position in Im+1 is given by xm+1=(i,j)+d0,m+1(i,j) and νm,m+1(•) is probably an interpolated value at a non-grid location. Now, is this the best way of computing each displacement vector and hence the trajectory 0:N(i, j)? In an ideal error-free world, yes. But . . . .
We shall see how the unavoidable optical flow estimation inaccuracies lead to errors in the estimated displacements. Let us call d0,m+1 (i,j) the true displacement vector and {circumflex over (d)}0,m+1 (i,j) an estimation of it. Likewise we use the notation to indicate any estimated error-prone quantity. For a given iteration of (1) we can express the estimation error ξm+1={circumflex over (d)}0,m+1 (i,j)−d0,m+1(i,j) as
with xm=(i,j)+d0,m (i,j) and where δm,m+1(•) accounts for the input optical flow estimation error such that {circumflex over (ν)}m,m+1(x)={circumflex over (ν)}m,m+1(x)+δm,m+1(x). Here we distinguish three types of terms:
The two first terms are inherent to the process of integration and elementary motion estimation and thus, they cannot be avoided nor neglected. On the other hand, it is interesting to analyze the motion bias term. We first define the relative motion bias magnitude as
Note that ∥ξ0,m∥ is in general an increasing value (as the position estimation error inevitably increases along the sequence) and thus this bound cannot be tightened. In other words, as ∥ξ0,m∥ is not bounded, the motion bias term can be arbitrarily large, only limited by the maximum flow difference between two (possibly distant) image points. This undesirable behavior is the cause of the ubiquitous position drift observed in dense optical-flow-based tracking algorithms, independently of the flow estimation precision. What equation (3) states is that even small errors introduced by δm,m+1 may lead to an unbounded drift. How to radically reduce this drift is the concern of what follows.
Surprisingly, we can dramatically reduce the drift effect if we proceed differently while integrating the input optical flow fields. Consider the following iteration for computing dn,m, (i,j)
d
n,m(i,j)=νn,n+1(i,j)+dn+1,m((i,j)+νn,n+1(i,j)) (4)
for n=m 1, . . . , 0, so that one pass for the index n finally gives the displacement field d0,m. Let us discuss the differences between (1) and (4). Euler's method starts at the reference I0 and performs the motion accumulation in the sense of motion providing a sequential integration. Meanwhile, what we call inverse integration starts from the target image Im and recursively computes the displacement fields back to the reference image, in a non-causal manner. Note that in (1) a previously estimated displacement value is accumulated with an interpolation of the elementary motion field, which introduces both an error due to the noisy field νm,m+1 itself and an error due to evaluating νm,m+1 at a position biased by the current accumulated drift. In (4), on the other side, an elementary flow vector is accumulated with an interpolation now of a previously estimated displacement value. However, the difference is that in this second case, the drift is limited to that introduced by νn,n÷1(i,j)
a illustrates motion integration strategies through Euler integration method according to prior art. Euler integration method also called direct integration method performs the estimation by sequentially accumulating the motion vectors in the sense of the sequence, that is to say from the first image I0 to last image Im.
b illustrates motion integration strategies through inverse integration method according to a method disclosed in an international patent application PCTEP13050870 filed on Jan. 17, 2013 by the applicant. The inverse integration performs the estimation recursively in the opposite sense from the last image to first image.
Effectively, for n=0 we have
ξ0,m=δ0,1(i,j)+d1,m((i,j)+{circumflex over (ν)}0,1(i,j))+ε1,m((i,j)+{circumflex over (ν)}0,1(i,j))−d1,m((i,j)+{circumflex over (ν)}0,1(i,j)) (5)
In this case, as δ0,1(i, j) corresponds to the error term in the estimated optical flow {circumflex over (ν)}0,1(i,j), we can assume that ∥δ0,1(i,j)∥ is kept small (it is not an increasing accumulated error as ξ0,m in (3) and thus for the motion bias we have
with ρ(x1) a ball of radius ∥δ0,1(i,j)∥ centered at x1=(i,j)+ν0,1(i,j). Assuming continuous displacement fields dn+1,N and small elementary motion estimation error ∥δ0,1(i,j)∥, ∥d1,m(y)−d1,m(x1)∥ is bounded as well as B0,m.
We have attained a highly desirable property, by changing the way of integrating the same input optical flows: the bias introduced at each integration step does not diverge anymore.
We now analyze the behavior of the two integration methods in trajectory estimation, by studying the case of stationary affine motion models perturbed by zero-mean Gaussian noise. We assume elementary motion fields of the form νm,m+1(x)=Ax+b and the estimated fields are νm,m+1 (x)=dm,m+1 (xm)+rm with rm ≡(0, σ2I). The same input fields are used for estimating trajectories using both methods.
In the case of Euler's integration the application of equation (1) is straightforward, by iterating over m=1 . . . N. For the inverse integration method, equation (4) is repeated for each m: 1 . . . N and n:m−1 . . . 0, so as to obtain the series of displacement fields d0,m. We have tested three different affine models: a rotational motion, a divergent motion and the zero motion.
The behavior depicted by the simulations can be predicted by analyzing the stability of each integration method by recoursing to the theory of dynamical systems. For simplicity, let us consider νm,m+1 (x)=Ax ∀m:0 . . . N−1. Then the true displacement fields are d0,m+1 (x)=((A+I)m+1−I)x and for Euler's method ξ0,m+1(x0)|Euler=(A+I)·ξ0,m(x0)|Euler+rm while for the inverse integration approach ξ0,m+1(x0)|Inv=(A+I)m·r0+ε1,m+1 (x1)|Inv. Essentially, Euler's method error equation is stable if all the eigenvalues λi of A lie inside the unit circle centered at −1 in the complex plane (i.e. |1+λi|<1), and possibly unstable (the error may diverge) otherwise. Meanwhile, the inverse approach defines a linear model with transition matrix equal to the identity and driven by the motion estimation errors rm. Though it is not an asymptotically stable system around the zero-error equilibrium point (i.e. ∥ξ0,m+1(x0)|Inv∥→0 does not hold), it is always stable in the sense of Lyapunov (or just stable, loosely ∥ξ0,m+1(x0)|Inv∥<ε, for some ε>0, ∀m). The error depends only on the accumulation of instantaneous motion estimation errors, but shows no unstable behavior. Concretely, a divergent field (R(λi)>0), a rotational field (|1+λi|=1) or the zero-field (λi=0→|1+λi|=1) are not well handled by the Euler method. For the case of the inverse method, we must emphasize that our analysis does not imply a zero-error or the absence of error accumulation, but a more robust dynamic behavior. Besides, it also appears that it implicitly performs a temporal filtering of the trajectory as observed in the figures.
Finally, in the general case of an arbitrary motion model, and thanks to the Grobman-Hartman theorem (known from C. Robinson in “Dynamical Systems: Stability, Symbolic Dynamics, and Chaos”, Studies in Advanced Mathematics, CRC Press, 2nd edition 1998) we can study the behavior of both methods by regarding the linear approximations of (1) and (4) around an equilibrium point. This may lead to the problem of analyzing time-varying linear systems, for which it is not trivial to determine its stability properties. However we believe one can still obtain useful and analogous conclusions about the behavior of the error function by applying the theory of time-invariant systems.
Within the universe of dense point correspondence estimation we have distinguished two different scenarios, tightly bonded together but also to the concrete application one needs to deal with. Let us leave apart for an instant our concern about high accuracy displacement field estimation, and focus on the way we represent the information. Given a reference image, say I0, we might want to determine either:
As illustrated on
Now returning to the motion integration methods discussed above, one would ask which is the best option, not only in terms of accuracy, but also ease of implementation with regard to the reference (from or to), computational load, memory requirements and of course, concrete application-related issues.
Thus, from-the-reference scheme presents the following characteristics for each integration methods:
Thus to-the-reference scheme presents the following characteristics for each integration methods:
On the other side, a trajectory-based (from-the-reference) representation of point correspondences seems to be more natural for capturing spatio-temporal features of a point along the sequence as there is a direct (unambiguous) association between points and the path they follow. Consequently, refinement tasks as trajectory based filtering are easier to formulate. Meanwhile, to-the-reference fields do not directly provide such spatio-temporal information but can be efficiently and more accurately estimated. The question is then how to combine both representations which essentially can be formulated as how to pass from one representation to the other in order to efficiently exploit their benefits.
Considering the reference frame I0 we call forward the from-the-reference displacements fields d0,n and backward the to-the-reference displacement fields dn,0. The set of forward vectors d0,n(x) that give the position of pixel x in the frames n describe its trajectory along the sequence. On the other hand, backward fields dn,0, have been estimated independently and carry consensual, complementary or contradictory information. Forward and backward displacement fields can be advantageously combined in particular to detect inconsistencies and occlusions (this is widely used in stereo vision and for example disclosed by G. Egnal and R. Wildes in “Detecting binocular half-occlusions: empirical comparisons of five approaches”, PAMI, 24(8) 1127-1133, 2002). In addition, one can highlight the interest of combining both approaches in a refinement step as each one can constrain the other. In this section, both forward and backward displacement fields are combined in order to be mutually improved while taking into account the trajectory aspect.
Occlusions are detected and taken into account in the filtering process. For this sake, the forward 52 (respectively backward 53) displacement field at the reference frame I0 (respectively, In) is used to detect occlusions at frame In (respectively, I0). The occlusion detection method (called OCC by Egnal) works as follows: addressing the detection of those pixels in frame I0 that are occluded in frame In, one considers the displacement map {tilde over (d)}n,0(x) and scans the image In, to identify for each pixel via its displacement vector, the corresponding position in frame I0. Then the closest pixel to this (probably) non-grid position in frame I0 is marked as visible. At the end of this projection step, the pixels that are not marked in frame I0 are classified as occluded in frame In.
Moreover, inconsistency value is evaluated between forward and backward displacement fields on the non-occluded pixels. It provides a way to identify unreliable vectors. After the first process iteration, the filtering is limited to the vectors which inconsistency value is above a threshold.
In the third step 55, for each frame pair {I0, In}, forward and backward displacement fields d0,n and dn,0 are jointly processed via multilateral filtering. Moreover, the “trajectory” aspect of the forward fields is considered via two ways. First, in addition to generally used weights, a trajectory similarity weight is introduced that replaces classical displacement similarity often introduced when two vectors are compared. Second, 2D filtering is extended to 2D+t along the trajectories.
Each updated vector 56 results from a weighted average of neighboring forward and backward vectors at frame pair {I0, In} and also forward vectors d0,m (mε[n−Δ,n+Δ]) at frame pairs {I0, Im}. Updated forward displacement vector {tilde over (d)}0,n(x) is obtained as follows:
where {x} is a spatial window centered at x and w0,mxy is a weight that links points x and y at frame I0. Similarly,
{z} is a spatial window centered at z=x+d0,n(x) and wn,0zy is a weight that links points z and y at frame In. The weight ws,tuv assigned to each displacement vector ds,t(y) is defined as:
with:
Γuv is the Euclidean distance between locations u and v:
Γuv=∥u−ν∥2 (8)
The color similarity Φuv,s between pixels u and ν in Is is defined as follows:
The matching cost Θν,st is:
Θν,st≡Θs,t(ν,ds,t(ν))=Σcε{r,g,b}|Isc(ν)−Itc(ν+ds,t(ν))| (9)
ρst is a binary value that takes into account the occlusion detection as follows:
The weight
refers to the similarity measurement between the trajectories that support the two currently compared forward vectors. This trajectory similarity is defined as follows:
Similarly, updated backward displacement vector {tilde over (d)}n,0(x) is obtained as follows:
where {x} and
{z} are windows defined respectively in frames In around x and I0 around z=x+dn,0 (x).
{x} centered at x are determined. From this information, neighboring displacement vectors d0,m(y) are determined from temporal and spatial window.
In a second filtering step 65, a joint filtering of backward and forward displacement vector is performed. In a first variant, filtered updated forward displacement vectors {tilde over (d)}0,n(y) 63 and backward displacement vectors dn,0(y) 64 are processed to produce a filtered forward displacement vector {tilde over (d)}0,n(x) 66. In a second variant, filtered updated forward displacement vectors {tilde over (d)}0,n(y) 63 and backward displacement vectors dn,0(y) 64 are processed to produce a filtered backward displacement vector {tilde over (d)}n,0(x) 66. The filtered from-the-reference displacement vectors {tilde over (d)}0,n(y) 63 are considered for pixels y belonging to the spatial window {x} centered at x. While the to-the-reference displacement vectors dn,0(y) 64 are considered for pixels y belonging to the spatial window
{z} centered at z=x+d0,n(x) that is the endpoint location in the image In resulting from from-the-reference displacement vector d0,n(x) for pixel x of I0.
Once the filtering steps 62, 65 are processed, advantageously in parallel, for each pixel of current image, the spatio-temporal filtered motion field 66 is memorized. The filtered motion field is then a motion field available for the filtering of motion field of the next frame to be processed or for a second pass of the algorithm as disclosed in
The skilled person will also appreciate that as the method can be implemented quite easily without the need for special equipment by devices such as PCs. According to different variant, features described for the method are being implemented in software module or in hardware module.
The device 7 also comprises a display device 73 such as a display screen directly connected to the graphical card 72 for notably displaying the rendering of images computed and composed in the graphical card for example by a video editing tool implementing the filtering according to the invention. According to a variant, the display device 73 is outside the device 7.
It is noted that the word “register” used in the description of memories 72, 76 and 77 designates in each of the memories mentioned, a memory zone of low capacity (some binary data) as well as a memory zone of large capacity (enabling a whole programme to be stored or all or part of the data representative of computed data or data to be displayed).
When powered up, the microprocessor 71 loads and runs the instructions of the algorithm comprised in RAM 77.
The memory RAM 57 comprises in particular:
Algorithms implementing the steps of the method of the invention are stored in memory GRAM 721 of the graphical card 72 associated to the device 7 implementing these steps. When powered up and once the data 771 representative of the video sequence have been loaded in RAM 77, GPUs 720 of the graphical card load these data in GRAM 721 and execute instructions of these algorithms under the form of micro-programs called “shaders” using HLSL language (High Level Shader Language), GLSL language (OpenGL Shading Language) for example.
The memory GRAM 721 comprises in particular:
According to a variant, the power supply is outside the device 7.
The invention as described in the preferred embodiments is advantageously computed using a Graphics processing unit (GPU) on a graphics processing board.
The invention is also therefore implemented preferentially as software code instructions and stored on a computer-readable medium such as a memory (flash, SDRAM . . . ), said instructions being read by a graphics processing unit.
The foregoing description of the embodiments of the invention has been presented for the purposes of illustration and description. It is not intended to be exhaustive or to limit the invention to the precise form disclosed. Persons skilled in the relevant art can appreciate that many modifications and variations are possible in light of the above teaching. It is therefore intended that the scope of the invention is not limited by this detailed description, but rather by the claims appended hereto.
Number | Date | Country | Kind |
---|---|---|---|
12305266.4 | Mar 2012 | EP | regional |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/EP2013/054165 | 3/1/2013 | WO | 00 |