The present invention relates generally to the field of dense point matching in a video sequence. More precisely, the invention relates to a method for generating a motion field from a current frame to a reference frame belonging to a video sequence from an input set of motion fields.
This section is intended to introduce the reader to various aspects of art, which may be related to various aspects of the present invention that are described and/or claimed below. This discussion is believed to be helpful in providing the reader with background information to facilitate a better understanding of the various aspects of the present invention. Accordingly, it should be understood that these statements are to be read in this light, and not as admissions of prior art.
The invention concerns the estimation of dense point correspondences between two frames of a video sequence. This task is complex and a lot of methods have been proposed. There is no perfect estimator able to match any pair of frames. State-of-the-art methods have various strengths and weaknesses with respect to accuracy and robustness, and their respective quality also depend on the video content (image content, type and value of motion . . . ). In particular, the presence of large displacements is a limiting factor of the performance of the estimators, often making the motion estimation between distant frames difficult.
It is relevant to notice that there are numerous motion estimators with different intrinsic characteristics that lead to a performance that comparatively vary according to image content. From this remark, a solution consists in applying different estimators to produce various motion fields between two input frames and then deriving a final motion field by merging all these input motion fields. For example, the method described in the paper “FusionFlow: Discrete-Continuous Optimization for Optical Flow Estimation” by V. Lempitsky, S. Roth and C. Rother in IEEE Transactions on Computer Vision and Pattern Recognition 2008 or in the paper “Fusion moves for Markov random field optimization” by same othors in IEEE Transactions on Pattern Analysis and Machine Intelligence 2010, can be a solution to merge the motion fields pair by pair up to obtain a final motion field. A pixel-wise selection among this large set of dense motion fields is carried out based on an intrinsic vector quality (matching cost) and a spatial regularization. Theoretically, this technique allows one to combine all the benefits of the strategies mentioned above. Nevertheless, the matching can remain inaccurate for difficult cases such as: illumination variations, large motion, occlusions, zoom, non-rigid deformations, low color contrast between different motion regions, transparency, large uniform areas. The problem occurs frequently when the estimation is applied to distant frames.
Numerous applications require motion estimation between distant frames. This is particularly the case when the application requires referring to a small set of key frames, the other frames refer to. This includes video compression, semi-automatic video processing where an operator applies changes to key frames that must then be propagated to the other frames using motion compensation. For example, consider the task of modifying several images of a video sequence. It would be a tedious task to consistently modify all the frames manually. So it would be useful to automatically propagate these changes to the other frames taking into account the point correspondences between these frames and the key frame.
The invention applies to distant frames, called a current frame and a reference frame, in a sequence but can address motion estimation between any pair of frames and is particularly adapted to pairs for which classical motion estimators have a high error rate.
Concerning distant frames, motion estimation can be obtained through concatenation of elementary optical flow fields. These elementary optical flow fields can be computed between consecutive frames or for example skipping each other frame. However, this strategy is very sensitive to motion errors as one erroneous motion vector is enough to make the concatenated motion vector wrong. It becomes very critical in particular when concatenation involves a high number of elementary vectors.
A solution, described in the international patent application PCT/EP13/050870, addresses motion estimation between a reference frame and each of the other frames in a video sequence. The reference frame is for example the first frame of the video sequence. The solution consists in sequential motion estimation between the reference frame and the current frame, this current frame being successively the frame adjacent to the reference frame, then the next one and so on. The method relies on various input elementary motion fields that are supposed to be available. These motion fields link pairs of frames in the sequence with good quality as inter-frame motion range is supposed to be compatible with the motion estimator performance. The current motion field estimation between the current frame and the reference frame relies on previously estimated motion fields (between the reference frame and frames preceding the current one) and elementary motion fields that link the current frame to the previous processed frames: various motion candidates are built by concatenating elementary motion fields and previous estimated motion fields. Then, these various candidate fields are merged to form the current output motion field. This method is a good sequential option but cannot avoid possible drifts in some pixels. Then, once an error is introduced in a motion field, it can be propagated to the next fields during the sequential processing.
An alternative consists in performing a direct matching between the considered distant frames. However, the motion range is generally very large and estimation can be very sensitive to ambiguous correspondences, like for instance, within periodic image patterns. The method described in in the international patent application PCT/EP13/050870 has been shown much better than this alternative.
In order to avoid the problems above mentioned, we propose a method that relies on a new statistical fusion phase of multiple independent motion candidates that are built via concatenation.
The invention is directed to a method for generating a motion field between a current frame and a reference frame belonging to a video sequence from an input set of elementary motion fields. A motion field associated to an ordered pair of frames (Ia and Ib) comprises for a group of pixels (xa) belonging to a first frame (Ia) of the ordered pair of frames, a motion vector (da,b(xa)) computed from the pixel (xa) in the first frame to an endpoint in a second frame (Ib) of the ordered pair of frames. The method is remarkable in that it comprises steps for:
According to a further advantageous characteristic of motion path determination, the number N of ordered pairs of frames in determined motion paths is smaller than a threshold Nc. According to another further advantageous characteristic, the number N is variable; therefore 2 motion paths have or do not have the same number of concatenated motion vectors.
According to another further advantageous characteristic, the N ordered pairs of frames in determined motion paths are randomly selected so as to achieve independent motion paths.
According to another further advantageous characteristic the second frame of the previous ordered pair in the sequence is temporally placed before or after the first frame of the ordered pair.
According to another further advantageous characteristic, the first frame of an ordered pair is temporally placed before the current frame or after the reference frame, thus allowing concatenating motion paths from frames outside of the video sequence comprised between the current frame and the reference frame.
According to an advantageous characteristic of motion path selection, the selection comprises minimizing a metric for the selected motion vector among the plurality of candidate motion vectors.
In a first embodiment, the metric comprises the Euclidian distance between candidate endpoints location.
In a second embodiment, the metric comprises Euclidian distance between color gain vectors. Indeed color gain vectors are defined in any color space known by the skilled in the art such as RGB color space or LAB color space. A candidate endpoint location results from a candidate motion vector. Color gain vectors are computed between color vectors of a local neighborhood of the candidate endpoint location and color vectors of a local neighborhood of the current pixel belonging to the current frame.
According to a further advantageous characteristic of the first embodiment, the selection comprises for each determined candidate motion vector, a) computing each Euclidian distance between a candidate endpoint location resulting from the determined candidate motion vector and each of other candidate endpoints location resulting from other candidate motion vectors; b) for each determined candidate motion vector, computing a median for the computed Euclidian distances; and c) selecting the motion vector for which the median of computed Euclidian distance is the smallest.
According to another further advantageous characteristic of the first embodiment, between step a) and step b), a step further comprises, for each determined candidate motion vector, counting the Euclidian distance a number of time representative of a confidence score of the candidate endpoint location resulting from the determined candidate motion vector.
According to a further advantageous characteristic of the motion path selection, candidate motion vectors from the reference frame to the current frame are generated as the candidate motion vectors from the current frame (Ia) to the reference frame according to the disclosed method, and each of candidate motion vectors for a pixel of reference frame is then used to define a new candidate motion vector between the current frame and the reference frame by identifying an endpoint of the vector in the current frame and by assigning inverted the candidate motion vector to the closest pixel in the current frame. Thus an inconsistency value is computed for a candidate motion vector for a current pixel in the current frame by comparing a distance between an endpoint location of the candidate motion vector and endpoint locations of the inverted vectors of the current pixel when the candidate motion vector is not inverted, or by comparing a distance between an endpoint location of the candidate motion vector and endpoint locations of the non-inverted vectors of the current pixel when the candidate motion vector is inverted, and by selecting the smallest distance as the inconsistency value. The inconsistency value is used to define the confidence score of the candidate endpoint location.
According to a further advantageous characteristic of the second embodiment, the selection comprises d) for each determined candidate motion vector, computing Euclidian distance between color gain vectors of a local neighborhood of candidate endpoint location and color gain vectors of a local neighborhood current pixel of a current frame, a candidate endpoint resulting from the determined candidate motion vector; e) for each determined candidate motion vector, computing a median for the computed color gain vectors; and f) selecting the motion vector for which the median is the smallest.
According to another further advantageous characteristic of the first embodiment, between step d) and step e), a step further comprises, for each determined candidate motion vector, counting the Euclidian distance between color gain vectors a number of time representative of a confidence score of candidate endpoint location resulting from the determined candidate motion vector.
According to a first variant of motion path selection, selecting step c) or f) are repeated on a subset of determined candidate motion vectors resulting in a subset of motion vectors for which the median are the smallest. The selection is then followed by a global optimization process on the subset of motion vectors in order to select for each current pixel of the current frame the best vector with respect to minimization of a global energy.
According to second variant of motion path selection, selecting step c) or f) further comprises selecting P motion vectors for which the median is the smallest, P being an integer. The selection is then followed by a global optimization process on a subset of P motion vectors in order to select for each pixel of the current frame the best vector with respect to minimization of a global energy.
According to any of the variants of motion path selection, the global optimization process comprises the use of gain in matching cost of global energy, use of inconsistency value in a data cost of global energy, use of gain in a regularization of global energy.
According to another further advantageous characteristic the steps of the method are repeated for a plurality of current frame belonging to the video sequence/to the neighbouring of reference frame. Then, the global optimization process further comprises use of temporal smoothing in global energy.
According to another further advantageous, the generated motion field is used as input set of motion field for iteratively generating a motion field.
A device for generating a set of motion fields comprising a processor configured to:
A device for generating a set of motion fields comprising:
Any characteristic or variant described for the method is compatible with a device intended to process the disclosed methods.
A computer program product comprising program code instructions to execute of the steps of the method according to any of claims 1 to 18 when this program is executed on a computer.
A processor readable medium having stored therein instructions for causing a processor to perform at least the steps of the method according to any of claims 1 to 18.
Preferred features of the present invention will now be described, by way of non-limiting example, with reference to the accompanying drawings, in which:
a illustrates steps of the method according to a preferred embodiment for motion estimation between distant frames;
b illustrates steps of the method according to a refinement of the preferred embodiment for motion estimation between distant frames;
a illustrates the construction of motion vector candidates for a given pixel of a reference frame with respect to another reference frame wherein each motion candidate is obtained by concatenating elementary input vectors with various step values;
b illustrates the construction of motion vector candidates for a given pixel of a reference frame with respect to another reference frame wherein each motion candidate is obtained by concatenating forward and backward elementary input vectors with various step values;
c illustrates the construction of motion vector candidates for a given pixel of a reference frame with respect to another reference frame wherein each motion candidate is obtained by concatenating forward and backward elementary input vectors with various step values and wherein some motion fields may link frames located outside the interval delimited by the reference frames;
A salient idea of the method for generating a set of motion fields for a video sequence is to propose an advantageous sequential method of combining motion fields to produce a long term matching through an exhaustive search of paths of motion vector. A complementary idea of the method for generating a set of motion fields for a video sequence is to select a motion vector among a large number of candidate motion vector, not only on cost matching but through statistical distribution in term of spatial location or color gain of candidate motion vectors.
Thus the invention concerns two main subjects namely motion estimation between frames Ia and Ib, from the set S of motion candidates and construction of the motion candidates (set S) for motion estimation between frames Ia and Ib. These two subjects are described below in two separate sub-sections.
a illustrates steps of the method according to a preferred embodiment for motion estimation between distant frames via combinatorial multi-step integration and statistical selection. In a preliminary step 101, multi-step elementary motion estimations are performed to generate the set of input motion fields. In a first step 102, the motion candidates between frames Ia and Ib are constructed using determined motion paths. In a second step 103, a motion field is estimated through a selection process among motion candidates.
Motion Estimation Between Two Frames from an Input Set of Motion Candidates
Let Ia and Ib be two frames of a given video sequence. The goal is to obtain very accurate forward (from pixels of Ia to positions in Ib) and backward (from pixels of Ib to positions in Ia) motion fields between these two frames. Let Sa,b and Sb,a be respectively, the large sets of forward and backward dense motion fields.
For each pixel xa (resp. xb) of frame Ia (resp. Ib), the forward (resp. backward) dense motion fields in Sa,b (resp. in Sb,a) give a large set of candidate positions in frame Ib (resp. Ia). This set of candidate positions is defined as Sa,b(xa) (resp. Sb,a(xb)) in the following. The proposed processing aims at selecting the best correspondences by exploiting the statistical nature of the available information and the intrinsic candidate quality. Moreover, spatial regularization is considered through a global optimization technique.
Backward (resp. forward) motion fields in Sb,a (resp. Sa,b) can be reversed into forward (resp. backward) motion fields. The resulting motion fields are included into set Sa,b (resp. Sb,a). For instance, backward motion fields from pixels of frame Ib are back-projected into frame Ia. For each one, we identify the nearest pixel of the arrival position in frame Ia. Finally, the corresponding displacement vector from Ib to Ia is reversed and assigned to this nearest pixel. This gives a new forward motion vector which is added into Sa,b(xa).
In the following, the proposed statistical processing 1032 and optimization 1033 technique are separately described. Then, we present the whole optimal candidate position selection framework and explains how both are combined.
Let Sa,b(xa)={xbn}n∈[[0, . . . , K−1]] be the set of candidate positions xbn (i.e. candidate correspondences) in frame Ib for pixel xa of frame Ia. K corresponds to the cardinal of Sa,b(xa). The goal is to find the optimal candidate position x* within Sa,b(xa), i.e. the best position of xa in frame Ib, by exploiting the statistical information extracted from the sample distribution of the candidate point positions and the quality values assigned to each candidate vector.
The underlying idea is to assume a Gaussian model for the distribution of the position samples, and try to find the its central value, which is then considered as the position estimation x*. Consequently, we suppose that the position candidates in Sa,b(xa) follow a Gaussian probability density with mean μ and variance σ2. The probability density function of xbn is thus given by:
Supposing that all the candidate positions xbn are independent, the probability density function of Sa,b(xa) is written as follows:
The maximum likelihood estimator (MLE) of the mean μ and variance σ2 is obtained from maximizing equation (3).
We are interested in the central value, which in the case of a Gaussian distribution coincides with the mean value, the median value and the mode. Thus we seek for estimating μ, regardless of the value of σ2 Furthermore, we impose that the estimator must be one of the elements of Sa,b(xa). The optimal candidate position equals
The assumption of Gaussianity can be largely perturbed by erroneous position samples, called outliers. Consequently, a robust estimation of the distribution central value is necessary. For this sake, the mean operator is replaced by the median operator. The estimate becomes:
Finally, each candidate position xbn receives a corresponding quality score Q(xbn) computed using an inconsistency value Inc(xbn), as described in the following. Inconsistency concerns a vector (e.g. da,bn) assigned to a pixel (e.g. xa). It is then noted either Inc(xa, da,bn) or Inc(xbn) referring to the endpoint of vector da,bn assigned to pixel xa (xbn=xa+da,bn). More precisely, the inconsistency value assigned to each candidate xbn corresponds to the inconsistency of the corresponding motion vector da,bn(xa), i.e. the motion vector which has been used to obtain xbn. Inconsistency values can be computed in different manners:
In a first variant, as described in equation (6), the inconsistency value Inc(xa, da,b) can be obtained similarly to left/right checking (LRC) described in the case of stereo vision but applied to forward/backward displacement fields. Thus, we compute the Euclidean distance between the starting point xa in frame Ia and the end position of the backward displacement fields db,a starting from (xa+da,b(xa))in frame Ib.
Inc(xa,da,b)=∥da,b(xa)+db,a(xa+da,b(xa))∥2 (6)
In a second variant, instead of considering the backward displacement fields db,a starting from the nearest pixel (np) of xa−da,b(xa) in frame Ib, an alternative consists in taking into account all the backward displacement vectors in db,a for which the ending point in frame Ia, has xa as nearest pixel. In practice, this backward motion field has been transformed into forward motion field by inversion and added to the set of forward motion fields Sa,b(xa) as described previously. In other words, the second variant consists in computing the Euclidean distance from the current candidate position xbn and the nearest candidate position of the distribution which has been obtained through this procedure of back-projection and inversion.
Once inconsistency values have been computed, a quality score, here denoted as Q(xbn), is defined for each candidate position xbn. Q(xbn) is computed as follows: the maximum and minimum values of Inc(xbn) among all candidates are mapped, respectively, to 0 and a predefined integer value Qmax. Intermediate inconsistency values are then mapped to the line defined by these two values and the result is rounded to the nearest integer value. Then, Q(xbn) ∈ [0, . . . , Qmax]. In this manner, the higher Q(xbn) is, the smaller the inconsistency Inc(xbn). We aim at favoring high quality candidate positions in the computation of the estimate x*. In practice, Q(xbn) is used as a voting mechanism: while computing the intervening medians in equation (5), each sample xbj is considered Q(xbj) times to set the occurrence of elements ∥xbj−xbn∥22. A robust estimate towards the high quality candidates is thus introduced, which enforces the forward-backward motion consistency.
This statistical processing is applied to each pixel of Ia independently. In addition, it is necessary to include a spatial regularization in order to strive for motion spatial consistency in frame Ia.
The same minimization procedure can be applied on color gain in order to guide the selection to a candidate position which exhibits a gain similarity with a large number of candidate positions within the distribution. Color gain ga,b of pixel xa is a 3-component vector (ga,b=(ga,br,ga,bg,ga,bb)T for R, G, B components) that relates color of this pixel in frame Ia and color of the corresponding point moved at location (xa+da,b(xa)) in frame Ib as follows:
I
a
c(xa)=ga,bc(xa)·Ibc(xa+da,b(xa)) (7)
Index c refers to one of the 3 color components. The gain can be estimated for example via known correlation methods during motion estimation. A color gain vector can be obtained by applying such methods to each color channel CR, CG, CB, leading to a gain factor for each of these channels. The estimation of the gain of a given pixel involves a block of pixels (e.g. 3×3) centered on the pixel.
For the statistical processing, we use the symmetric formula that introduces the gain of point (xa+da,b(xa)) in frame Ib as follows:
I
b
c(xa+da,b(xa))=gb,ac(xa+da,b(xa))·Iac(xa) (8)
Replacing the position criterion in equation (5) by a gain criterion, the median operator becomes:
Furthermore, it is possible to consider both locations and gains of the motion candidates in the statistical processing using the following equation:
Scalar δ allows adjusting weight of gain-based component with respect to position-based component.
We propose to combine statistical processing per pixel and a global candidate selection process to include simultaneously:
The statistical processing precedes the application of the global optimization process. Two variants have been considered to form the framework combining statistical processing per pixel and global optimization and will be described in more details in
Thus, according to a first variant of candidate position selection, the set Sa,b(xa) of candidate positions xbn is divided randomly into different equally sized subsets. The statistical processing is applied for each subset in order to select the best candidate position per subset. Then, our global optimization approach merges the obtained candidates in order to finally select the optimal one x*.
According to a second variant of candidate position selection, the statistical processing is applied to the whole set Sa,b(xa). Then, the P best candidate positions of the distribution are selected from median minimization, as described in (5). Then, our global optimization approach fuses these P candidate positions in order to finally select the optimal one x*.
We describe now the energy we have defined for global optimization. We consider set Ra,b(xa) of candidate positions coming from the previous selection process.
It consists in performing a global optimization stage that fuses candidate positions in Ra,b(xa) into a single optimal one. We consider Ra,b(xa)={xbn}n∈[[0, . . . , K−1]] as the set of K candidate positions xbn in frame Ib for pixel xa of frame Ia. We introduce L={lx
The data term for each pixel is denoted as
a gain-compensated color matching cost between grid position xa in frame Ia and position
in frame Ib as described in equation (11)
Moreover, inconsistency is introduced in the data cost to make it more robust. It is computed via one of the variants mentioned above. Scalar γd allows adjusting weight of inconsistency with respect to matching cost.
Furthermore, smoothness is imposed by considering that two neighboring pixels should take similar motion values, as one expects for the majority of the points inside a moving scene element (objects, backgrounds, textures). A first possibility would be to favor the situation where both pixels take the same candidate label. This can be done, for instance, by considering a classical discrete interaction as the Potts model. However, equal labels thus not imply that motion vectors are forcedly similar as, for each pixel, the candidates were generated independently. A better solution is to favor directly the similarity on the motion vectors by introducing the following function to be minimized
where the spatial regularization term involves both motion and gain comparisons with neighboring positions according to the 8-nearest-neighbor neighborhood. αx
The whole framework is applied from Ia to Ib and then from Ib to Ia. Finally, we obtain very accurate forward and backward dense motion fields between these two frames.
b illustrates rafinement in the motion estimation generation 103. As in previous embodiment, the statistical processing step 1032 is able to select the best candidate positions within a large distribution of candidate positions using criteria based on spatial density and intrinsic candidate quality. As in previous embodiment, a global optimization step 1033 fuses candidate motion fields by pairs following the approach of Lempitsky et al in the article entitled “FusionFlow: Discrete-continuous optimization for optical flow estimation” published CVPR 2008. In this rafinement, let Iref and In be respectively the reference frame and the current frame of a given video sequence.
Regading another variant of candidate position selection in step 1032, for each xref ∈ Iref we select among the large distribution Tref,n(xrefKsp=2×K candidate positions through statistical processing. Then, in a step 1033, we randomly group by pairs these Ksp candidates in order to choose the K best candidates
For first pairs or in the case of temporary occlusion, the statistical selection is not adapted due to the small amount of candidates. Therefore, between 1 and K candidate positions, we do not perform any selection and all the candidates are kept. Between K+1 and Ksp candidates, we use only the global optimization method up to obtain the K best candidate fields. If the number of candidates exceeds Ksp, the statistical processing and the global optimization method are applied as explained above.
Another variant of candidate position selection in step 1032 provides further focus to inconsistency reduction. The idea is to strongly encourage the selection of from-the-reference motion vectors (i.e. between Iref and In) which are consistent with to-the-reference motion vectors (i.e. between In and Iref). Thus, the inconsistency assigned to a candidate motion vector dref,ni(xref) with i ∈ [[0, . . . , Kx
However, inconsistencies may still remain and we propose to enforce consistency with stronger constraints. The proposed constraints are as follow. First, only input multi-step elementary optical flow vectors which are considered as consistent according to their inconsistency masks can be used to generate motion paths between If and In. Second, we introduce an outlier removal step 1031 before the statistical selection. This step consists in ordering all the candidates of the distribution with respect to their inconsistency values. Then, a percentage of R% bad candidates is removed and the selection is performed on the remaining candidates. Third, at the end of the combinatorial integration and the selection procedure between Iref and In, the optimal displacement field d*ref,n is incorporated into the processing between In and Iref which aims at enforcing the motion consistency between from-the-reference and to-the-reference displacement fields.
The proposed initial motion candidates generation is applied for both directions: from Iref to In in order to obtain K initial from-the-reference candidate displacement fields as described above and then, from In to Iref where an exactly similar processing leads to K initial to-the-reference candidate displacement fields. All the pairs {Iref,In} are processed through this way. Only Nc, the maximum number of concatenations, changes with respect to the temporal distance between the considered frames. In practice, we determine Nc with equation (14). This function, built empirically, is a good compromise between a too large number of concatenations which leads to large propagation errors and the opposite situation which limits the effectiveness of the statistical processing due to an insignificant total number of candidate positions.
The guided-random selection which selects for each pair of frames {Iref,In} one part of all the possible motion paths limits the correlation between candidates respectively estimated for neighbouring frames. This avoids the situation in which a single estimation error is propagated and therefore badly influences the whole trajectory. The example given on
Once the initial motion candidates have been generated, we aim at iteratively refining the estimated displacement fields. The idea is to question the matching between each pixel xref (resp. xn) of Iref (resp. In) and the candidate position x*n (resp. x*ref) in In (resp. Iref) established during the previous iteration or during the initial motion candidates generation phase if the current iteration is the first one.
We propose to compare the previous estimate x*n (resp. x*ref) with respect to one part of all the following other candidate positions described in
Moreover, we take into account a candidate position coming from the previous estimation of d*n,ref (resp. d*ref,n) which is inverted to obtain xnr (resp. xrefr), as illustrated in
Regarding the global optimization step 1034, we introduce temporal smoothing by considering previously estimated motion fields for neighbouring frames to construct new input candidates. Let w be the temporal window. Between Iref and In for instance, we use the elementary optical flow fields vm,n between Im and In with
and m≠n to obtain from x*m ∈ Im the new candidate xnm in In. Conversely, to join Iref from In, the elementary optical flow fields vn,m are concatenated to the optimal displacement fields d*m,ref computed during the previous iteration.
Instead of considering the candidates coming from all the frames of the spatial window, we can:
New candidates can be obtained through:
We perform a global optimization method in order to fuse the previously described set of candidates into a single optimal displacement field, as done in Lempitsky et al., in the paper entitled “Fusion moves for Markov random field optimization”. For this task, a new energy has been built and two formulations are proposed depending on the type (from-the-reference or to-the-reference) of the displacement fields to be refined.
In the from-the-reference case, we introduce L={Ix
one of the candidates listed above. Let
be the corresponding motion vectors. We define the following energy in equation (15) and we use the fusion moves algorithm described by Lempitsky et al. in the two publications mentioned earlier to minimize it:
The data term Eref,nd, described with more details in equation (16), involves the matching cost
and the inconsistency value
with respect to
as described earlier. In addition, we propose to introduce strong temporal smoothness constraints into the energy formulation in order to efficiently guide the motion refinement.
The temporal smoothness constraints translate in three new terms which are computed with respect to each neighbouring candidate x*m defined for the frames inside the temporal window w. These terms are illustrated in
and x*m of Im,
and the ending point of the elementary optical flow vector vm,n starting from x*m (see equation (17)). edm,n encourages the selection of xnm, the candidate coming from the neighbouring frame Im via the elementary optical flow field vm,n and therefore tends to strengthen the temporal smoothness. Indeed, for xnm, the euclidean distance edm,n is equal to 0.
(see equation (18)). If vm,n is consistent, i.e. vm,n≈vn,m, edn,m is approximately equal to 0 which promotes again the selection of xnm, the candidate coming from Im.
The regularization term Eref,nr involves motion similarities with neighbouring positions, as shown in equation (15). αx
Compared to the from-the-reference case, the energy for the refinement of to-the-reference displacement fields is similar except for the data term, equation (19), which involves neither the matching cost between the current candidate of the temporal neighbouring ones nor the euclidean distance edm,n. This is due to trajectories which can not be explicitly handled in this direction. Nevertheless, we compute the euclidean distance between the ending points of d*n,ref starting from xn ∈ In and d*m,ref concatenated to vn,m.
The global optimization method fuses the displacement fields by pairs and therefore chooses to update or not the previous estimations with one of the previously described candidates. The motion refinement phase consists in applying this technique for each pair of frames {Iref,In} in from-the-reference and to-the-reference directions. The pairs {Iref,In} are processed in a random order in order to encourage temporal smoothness without introducing a sequential correlation between the resulting displacement fields.
This motion refinement phase is repeated iteratively Nit times where one iteration corresponds to the processing of all the pairs {Iref,In}. The proposed statistical multi-step flow is done once the initial motion candidates generation and the Nit iterations of motion refinement have been run through the sequence.
We consider now the situation where input frames Ia and Ib are distant in the sequence (they are not adjacent). In the following, we will call these two frames “reference frames” (also corresponding to a pair of a current frame and a reference frame) to distinguish them from the other frames of the sequence. Depending on the displacement of the objects across the sequence, it often happens that direct estimation between such frames is difficult. An alternative consists in building motion vector candidates by concatenating or summing elementary motion fields that correspond to pairs of frames with smaller inter-frame distance (or step) and performing a statistical analysis.
A first solution to form a candidate consists in simply summing motion vectors of successive pairs of adjacent frames. If we call “step” the distance between two frames, step value is 1 for adjacent frames. We propose to extend this construction of motion candidates to the sum of motion vectors of pairs of frames that are not necessarily adjacent but remain reasonably distant so that this elementary motion field can be expected to be of good quality. This relies on the idea described in the international patent application PCT/EP13/050870 where motion estimation between a reference frame and the other frames of the sequence is carried out sequentially starting from the first frame adjacent to the reference frame. For each pair, multiple candidate motion fields are merged to form the output motion field. Each candidate motion field is built by summing an elementary input motion field and a previously estimated output motion field.
Here, we consider a pair of reference images and different candidates that join the two images. There is no sequential processing. The candidate motion fields are built by summing elementary motion fields with variable steps. Therefore, the number of candidate motion fields is variable. The elementary motion fields join pairs of frames in the interval delimited by the reference frames.
Another version of motion concatenation consists in considering both forward and backward motion fields in the sum. This may have advantages in particular in case of occlusions. In the case that occlusion maps attached to the motion fields are available indicating whether a pixel is occluded or not in another frame, this information is used to possibly stop the construction of a path.
For the same reasons, we can extend the motion candidate construction using elementary motion fields that join frames that are outside the interval delimited by the reference frames.
We suppose that the elementary motion fields have been computed by at least one motion estimator applied to pairs of frames with various steps for example, steps are equal to 1, 2 or 3 as illustrated on
A first solution consists in considering all possible elementary motion fields of step values belonging to a selected set (for example steps equal to 1, 2 or 3) and linking frames of a predefined set of frames (for example all the frames located between the two reference frames plus these reference frames, but as seen above it could also include frames located outside this interval).
Formally, a motion path is obtained through concatenations or sums of elementary optical flow fields across the video sequence. It links each pixel xa of frame Ia to a corresponding position in frame Ib. Elementary optical flow fields can be computed between consecutive frames or with different frame steps, i.e. with larger inter-frame distances. Let Sn={s1,s2, . . . , sQ
Our objective is to obtain a large set of motion paths and consequently a large set of candidate motion maps between Ia and Ib. Given this objective, we propose to initially generate all the possible step sequences (i.e. combinations of steps) in order to join Ib from Ia. Let Γa,b={γ0, . . . , γK−1} be the set of K possible step sequences between Ia and Ib. Γa,b is computed by building a tree structure where each node corresponds to a motion field assigned to a given frame for a given step value (node value). In practice, the construction of the tree is done recursively: we create for each node as many children as the number of steps available at the current instant. A child node is not generated when Ib have already been reached (therefore, the current node is considered as a leaf node) or if Ib is overpassed given the considered step. Finally, once the tree has been completely created, going from the leaf nodes to the root node gives Γa,b, the set of step sequences.
x
a+f
=x
a+f
+v
a+f
,a+f
(xa+f
Once all the step sji ∈ γi have been run through, we obtain xbi, i.e. the corresponding positions in Ib of xa ∈ Ia obtained with step sequence γi. Finally, at the end of the process, we have a large set of motion maps between Ia and Ib and consequently a large set of candidate positions in Ib for each pixel xa of Ia.
In the case that occlusion maps attached to the motion fields are available indicating whether a pixel is occluded or not in another frame, this information is used to possibly stop the construction of a path. Considering an intermediate point xa+f
Another solution for the construction of multiple paths corresponds to a wider problem addressing the case of more distant reference frames and more steps than in the previous case. The problem will clearly appear with an example. Let us consider a distance of 30 between the reference frames and the following set of steps: 1, 2, 5 and 10. In this case, the number of possible paths using concatenation of elementary motion fields between the two reference frames is 5877241. Of course, all these paths cannot be considered and a different procedure must be introduced to select a reasonable number of paths.
According to an advantageous characteristic of motion path construction, a first constraint consists in limiting the number of elementary vectors composing the path. Actually, the concatenation of numerous vectors may lead to an important drift and more generally increases the noise level on the resulting vector. So, limiting the number of candidate vectors is reasonable.
According to another advantageous characteristic of motion path construction, a second constraint is imposed by the fact that the candidate vectors should be independent according to our assumption on the statistical processing. In fact, the frequency of appearance of a given step at a given frame should be uniform among all the possible steps arising from this frame in order to avoid a systematic bias towards the more populated branches of the tree. Practically, a problem would occur in particular if an erroneous elementary vector contributes several times to the construction of candidate vectors while the other correct vectors occur just once. In this case, the number of erroneous candidate vectors would be significant and would introduce a bias in the statistical processing.
So, the method consists in considering a maximum number of concatenations Nc for the motion paths. Secondly, once this constraint has been taken into account, we select randomly Ns motion paths (determined by storage capability). The random selection is guided by the second constraint above. Indeed, this second constraint ensures a certain independence of resulting candidate positions in Ib. In practice, for a given frame, each available step must lead to the same (or almost the same) number of step sequences. Each time we select a step sequence γi, we increment the occurrence of each step sji ∈ γi. Thus, the step sequence selection is done as follows. We run through the tree from root node. For a given frame, we choose the step of minimal occurrence, i.e. the step which has been less used than other steps defined for the current frame. If more than two steps return this minimum occurrence value, a random selection is performed between them. This selection of steps is repeated until a leaf node is reached.
The skilled person will also appreciate that as the method can be implemented quite easily without the need for special equipment by devices such as PCs, mobile phone including or not graphic processing unit. According to different variant, features described for the method are being implemented in software module or in hardware module.
Each feature disclosed in the description and (where appropriate) the claims and drawings may be provided independently or in any appropriate combination. Features described as being implemented in software may also be implemented in hardware, and vice versa. Reference numerals appearing in the claims are by way of illustration only and shall have no limiting effect on the scope of the claims.
Naturally, the invention is not limited to the embodiments previously described. In particular, if the described method is dedicated to dense motion estimation between two frames, the invention is compatible with any method for generating motion field for sparse motion estimation. Thus, if statistical processing output is one motion vector per pixel and if global optimization is not considered, the system can be also applied to sparse motion estimation, i.e. statistical processing is applied to motion candidates assigned to any particular point in the current image.
Number | Date | Country | Kind |
---|---|---|---|
13305139.1 | Feb 2013 | EP | regional |
13306076.4 | Jul 2013 | EP | regional |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/EP2014/052164 | 2/4/2014 | WO | 00 |