The invention relates to a method and to an apparatus for hierarchical motion estimation in which motion vectors are refined in successive levels of increasing search window pixel density and/or decreasing search window size.
Estimation of motion between frames of image sequences is used for applications such as targeted content and in digital video encoding. Known motion estimation methods are based on different motion models and technical approaches such as gradient methods, block matching, phase correlation, ‘optical flow’ methods (often gradient-based) and feature point extraction and tracking. They all have advantages and drawbacks. Orthogonal to and in combination with one of these approaches, hierarchical motion estimation allows a large vector search range and is typically combined with block matching, cf. [1],[2].
In motion estimation generally a cost function is computed by evaluating the image signal of two image frames inside a measurement window.
Motion estimation faces a number of different situations in image sequences. A challenging one is when motion is estimated for an image location where there are different objects moving at different speed and/or direction. In this case the measurement window covers these different objects so that the motion estimator is distracted by objects other than the intended one.
In [3] a method of estimating correspondences between stereo pairs of images is presented. In determining the cost function, the method targets to weight pixels “in proportion of the probability that the pixels have the same disparity”. In order to approach this objective, pixels with similar color and located nearby are preferred by means of two kinds of weights (involving factors determined empirically): one related to color difference, the other related to spatial distance. Unfortunately that method has inherent problems with periodic structures because pixels with same or similar color but a certain distance apart may mislead the motion estimator. Also its concept does not attempt to consider different motion.
A problem to be solved by the invention is to provide reliable motion estimation for image sequences even in situations or locations where the measurement or search window of the motion estimator covers different objects with different motion.
Hierarchical motion estimation is used with several levels of hierarchy. In each level the image is prefiltered, e.g. by means of a 2D mean value filter of an appropriate search window size, the filtering strength being reduced from level to level, e.g. by reducing the window size. In each level a block matcher can be used for determining a motion vector for a marker position or a subset of pixels or every pixel of the whole frame in a certain pixel grid. Within the measurement window, the image signal for the related sections of the two frames compared is subsampled as allowed according to the strength of the prefilter. A motion vector (update) is computed, e.g. by log(D)-step search or full search, which optimizes a cost function, e.g. by minimizing SAD (sum of absolute differences) or SQD (sum of squared differences). Motion estimation is carried out with integer-pel resolution first, followed by sub-pel refinement, thereby also reducing computational complexity. The processing described provides a motion vector typically applying to the center pixel of the measurement window.
By evaluating the displaced frame differences in the measurement window—after finding an optimum vector at a certain pixel—a segmentation of the measurement window into different moving object regions is carried out. A corresponding segmentation mask is stored and used as an initial mask in the next level of the hierarchy, and a new mask is determined at the end of this level. Several embodiments further enhance the performance of the basic concept.
Advantageously, the described processing allows estimating motion and tracking of image content or points of interest with improved reliability and accuracy in situations or image locations where different objects are moving at different speed and/or direction.
In principle, the inventive method is adapted for hierarchical motion estimation in which motion vectors are refined in successive levels of increasing search window pixel density and/or decreasing search window size, including the steps:
In principle, in the inventive hierarchical motion estimator motion vectors are refined in successive levels of increasing search window pixel density and/or decreasing search window size, said hierarchical motion estimator including means adapted to:
Exemplary embodiments of the invention are described with reference to the accompanying drawings, which show in:
Even if not explicitly described, the following embodiments may be employed in any combination or sub-combination.
If an image sequence contains two or more objects moving in different directions and/or at different speed, the measurement or search window of a (hierarchical) motion estimator—when lying at their edge—will contain image information of all these objects.
With every hierarchy level the hierarchical motion estimator provides true motion vectors closer towards object boundaries (e.g. of a truck on a road), due to the decreasing grid size (i.e. distance of pixels for which a vector is estimated) and/or decreasing size of the measurement window, but not at the boundaries themselves. In the motion compensated image, high ‘displaced frame differences’ (DFD) remain around moving objects, in structured areas of medium-size moving objects, in or around uncovered background regions, and throughout small moving objects, or at least at their front and rear if they are less textured (e.g. a car moving behind a bush).
During the motion estimation process—along the levels of the hierarchy or in the search steps of one level—the measurement window may contain well-matched pixels with a low absolute difference (AD) and other pixels with a high AD, all of which add to the sum of absolute differences (SAD) or the sum of differences for a certain test vector. If a vector is to be estimated for a specific pixel location—especially for point-of-interest tracking-, ideally that part of the measurement window should be evaluated only which belongs to the same object as that pixel.
In some situations the motion estimator may therefore be misled, e.g. where a foreground object passes by near a point of interest. In such case much of the measurement window is occupied by the misleading object. An improvement can be achieved by taking the vector estimated for the same point of interest in the preceding frame and using it as a candidate vector in the search in the present frame.
During the exposure time (which typically is longer under low-light conditions for instance) the sensor elements of a camera integrate the incoming light so that an object moving at high speed relative to the sensor may be blurred. The boundary of a foreground object moving relative to a background object may be smeared and wide, rather than being sharp and narrow. Such situation, however, might be similar in successive frames—a smeared boundary is estimated with respect to a smeared boundary—so that the location of the object boundary might not need to be known with the highest precision.
In cases where the foreground object is e.g. a fence through which a background object is seen (between its pickets), an ideal motion estimator would distinguish the two objects inside the measurement window. That is, the decision whether a pixel belongs to the one object or the other is to be taken based just on that sole pixel rather than on its spatial neighborhood which may belong to another object.
A. Distinguishing Different Object Areas in the Measurement Window
A.1 Basic Processing at the End of the First Level of Hierarchy
A first approach of distinguishing different object areas in the measurement window is this: for a vector found at a certain pixel location by the end of the first (i.e. coarsest) level of the hierarchy, the cost function of all pixels in the measurement window is analyzed. If the DFD or absolute DFD is low for the center pixel then all other picture elements in the measurement window that have a low DFD or absolute DFD are considered as belonging to the same object as that center pixel. Otherwise, if the DFD or absolute DFD is high for the center pixel then all other picture elements in the measurement window that have a high DFD or absolute DFD are considered as belonging to the same object as that center pixel.
DFDs or absolute DFDs can be related to or translated into probabilities of belonging to the same or another object, resulting in a more continuous decision than a binary one. Such probability, related to an object, reflects also the part of the exposure time for which the camera sensor element has seen that object as mentioned above.
As a first approach, a mask with three possible values ‘0’, ‘0.5’ and ‘1’ is computed by comparing the DFD or absolute DFD of each pixel (x,y) against two thresholds:
The ‘0’ and ‘1’ values denote different object areas while the value of (e.g.) ‘0.5’ expresses some uncertainty. A low absolute DFD thus turns into a mask value of ‘0’ which represents object number ‘0’.
mask(x,y) represents a finer and continuous function that translates the absolute DFD into a probability between ‘0’ and ‘1’. One example is the linear function.
Again, a low absolute DFD turns into a mask value of ‘0’ and represents object number ‘0’.
A further improvement can be a function (continuously differentiable) with a smooth rather than sharp transition at thrlow, and thrhigh. The following is an exponential function which starts steep and has a saturation towards higher values of |DFD(x,y)|:
wherein thrhigh determines the gradient of the function at |DFD(x,y)|=thrlow which gradient is 1/(thrhigh−thrlow). For proper setting of the thrlow, value, e.g. the noise level present in the image sequence can be taken into account.
A.2 Improvement of Initial Identification of Object Areas
Another motion estimation step may then be appended,
In case there still remain high absolute DFD values in this part of the window, this may indicate the presence of another object moving with a different speed and/or direction of motion, i.e. with a different motion vector. Therefore the process can be repeated with a further reduced part of the measurement window, and corresponding to further subdivision of the object area which has been identified first.
A.3 Initialization of Object Mask
A.3.1 First Level of the Hierarchy
Before starting motion estimation in the present frame, the information about the shape of the areas within the measurement window and the motion of the two (or more) areas of the measurement window in the previous frame (see above) can be used for predicting or deriving an initial segmentation of the measurement window in the present frame for use in its first (i.e. coarsest) level of the hierarchy. This will reduce the risk of arriving at a false motion estimate right in the beginning if the disturbing object is prominent in its influence, even though perhaps not big in size.
In a simplified alternative just one vector is estimated and provided. Before starting motion estimation in the present frame, the information about the shape of the area in the measurement window and maybe also the single motion vector of the window in the previous frame could be used to predict or derive an initial segmentation of the measurement window in the present frame for use in its first level of the hierarchy. Although the window could already contain a bit of a disturbing object which may have moved by a different amount, this portion will, when the object starts entering the window, be small enough to make this processing still successful.
An enhanced processing can modify the mask values in an appropriate way in case they are continuous.
The first approach is to use the mask derived in the previous frame, with the simplifying assumption that the other object is still at the same relative position. In fact, there will be a spatial offset resulting from the relative true motion of the two objects.
Because this initial mask is used (only) for the first level of the hierarchy of the present frame, it can be just the mask resulting from the first level of the hierarchy of the previous frame since this refers to the same sampling grid.
A.3.2 Second and Further Levels of the Hierarchy
There are challenging situations where the mask resulting from the first level of the hierarchy in the present frame is not useful for the second level as it masks off (far) less pixels than the initial mask obtained from the previous (prey) frame. To overcome such situations, the initial mask can be combined with the present (pres) mask resulting from the first level, e.g. by:
mask2,pres(x,y)=(mask1,prev(x,y)+mask1,pres(x, y))/2 ; (4)
mask2,pres(x,y)=wprev·mask1,prev(x,y)+(1−wprev)·mask1,pres(x,y); (6)
mask2,pres(x,y)=mask1,prev(x, y). (7)
Likewise, from the second level of the hierarchy onwards up to the finest level, the mask obtained from the previous level n-1 of the hierarchy in the previous frame can be used instead of the mask obtained from the previous level in the present frame, if the change in length of the present estimated vector compared to the length of the corresponding estimated vector of the previous frame is too large (e.g. by more than a safety factor of ‘4’):
maskn,pres(x,y)=maskn-1,prev(x,y). (8)
In addition, also the vector obtained from the previous frame can be used instead of the present estimated vector if the change in the vector length is too large (e.g. by more than a safety factor of ‘4’):
{right arrow over ({circumflex over (d)})}
n,pres(x,y)={right arrow over ({circumflex over (d)})}prev(x,y). (9)
A.4 Utilizing Location Information in Next Level of Hierarchy
In the next level of the motion estimation hierarchy, the sections of the measurement window identified in the previous level (and stored for every point of interest to be tracked, or for the complete image) is considered in the motion estimation from the beginning. Interpolation (or repetition of nearest neighbor) of this information to the mostly denser sampling inside the new measurement window is performed, as well as cutting away boundary areas in order to fit the new smaller window.
Because the location information relates to the reference image rather than the search image and therefore does not depend on the motion vector, interpolation to the original sampling grid of the image will probably not be necessary.
A useful segmentation mask is needed from the first level of the hierarchy in order to make further steps successful. This may be difficult in situations where another object enters and covers much (e.g. half) of the measurement window area while the motion estimator finds a match for the other object with significant DFDs only for a few remaining pixels in the window.
A.5 Update of Identification of Object Areas in Each Level
At the end of each hierarchical motion estimation level the cost function of all picture elements in the measurement window (without using a mask) using the newly determined vector is analyzed and segmented as above, and the shape of the part of the measurement window to be used for motion estimation of the center pixel is thus updated and refined. This is continued through the following levels of the hierarchy (maybe not through all of them) unless the search window becomes too small.
B. Using Probability Values in Cost Function
The mask or probability values defined above can be translated (by a given function) into weighting factors for the absolute DFDs inside the measurement window when computing the cost estimate. If a mask value of ‘0’ is supposed to relate perfectly to the intended object area (i.e. object number ‘0’), the mask values are ‘inverted’, i.e. subtracted from ‘1’ in order to derive a probability belonging to that object:
p
0(x,y)=1−mask(x,y) (10)
cost=Σx,yw(p0(x,y))·|DFD(x,y)|. (11)
In a different embodiment, the probability values themselves are used as weighting factors:
cost=Σx,yp0(x,y)·|DFD(x,y)|. (12)
Depending on the low or high value of the absolute DFD of the center pixel, or if motion estimation is carried out for the remaining part of the measurement window, the probability values can be ‘inverted’, i.e. subtracted from ‘1’ in order to derive their proper meaning:
p
1(x,y)=1−p0(x,y)=mask(x,y). (13)
whereby a segmentation mask masking off disturbing pixels is used to make this method successful.
C. Signal-Based and Distance-Based Weights in Cost Function
In [3] a method of stereo matching (called “visual correspondence search”) is presented, i.e. an estimation of correspondences between stereo pairs of images. The application involves different perspectives and different depths. In determining the cost function (called “dissimilarity”) the method targets to weight pixels “in proportion of the probability that the pixels have the same disparity”. In order to achieve this objective, pixels with similar color and located nearby are preferred. A pixel q in the measurement window (called “support window”) is weighted by a factor
w(p,q)=k·fs(Δcpq)·fp(Δgpq) (14)
that includes a constant k and two weights fs and fp which depend on the colour difference (Euclidean distance in color space) and the spatial distance of the pixels (and which weights are called “strength of grouping by color similarity” and “strength of grouping by proximity”, respectively)) with
Δcpq=√{square root over ((Lp−Lq)2+(ap−aq)2+(bp−bq)2)} (15)
defined in the CIE Lab color space (which uses L=0 . . . 100, a=−150 . . . +100, b=−100 . . . +150 and which is related to human visual perception) and
given without explanation, and probably (not given in the paper) with
Δgpq=√{square root over ((xp−xq)2 +(yp−yq)2)}. (18)
The constants are given as yc=7 which is said to be a typical value (might refer to the CIELab signal value ranges, maybe to 100), and yp=36 is determined empirically (might refer to the pixel grid as in another paper by the same authors in which the window size is 35×35 pixels and yp=17.5 which is said to be the radius of the window). A measurement window e.g. of size 33×33 is used. The processing is not hierarchical.
That idea is transferred to motion estimation between successive frames by means of hierarchical motion estimation.
For simplicity,
cost=Σx,yw(x,y)·|DFD(x,y)|. (19)
using luminance only (rather than R, G and B), k=1, and scaling yc to the bit depth b of the image signal:
and modifying yp by considering the factor s of subsampling in the measurement window performed in combination with prefiltering the image signal in the levels of the hierarchy:
y′
p
=y
p
·s . (21)
D. Description of Figures
The hierarchical motion estimator in
An example parameter set for 6 levels of hierarchy is:
For the motion of pixel 40 from the present frame to its search window position, a log(D)-step search is depicted in
As an example a 4-step search is depicted with step sizes of 8, 4, 2, 1 pels or lines. The corresponding lowpass-filtered and subsampled pixels of the search window are marked by ‘1’ in the coarsest level, ‘2’ in the next level, ‘3’ in the following level, and ‘4’ in the finest level of the hierarchy. In the coarsest level motion vector 42 is found, in the next level motion vector update 43 is found, in the following level motion vector update 44 is found, and in the last level a motion vector update is found which adds up to motion vectors 42 to 44 to the total or final motion vector 41.
Motion estimator 72 passes its motion or displacement vector or vectors dv1 found as well as the corresponding segmentation information si1 found by the absolute pixel DFD values as described above to motion estimator 75 for update. Motion estimator 75 passes its motion or displacement vector or vectors dv2 found as well as the corresponding segmentation information si2 found to motion estimator 78 for update. Motion estimator 78 receives displacement vector or vectors dvN-1 as well as the corresponding segmentation information siN-1 and outputs the final displacement vector or vectors dvN as well as the corresponding final segmentation information siN.
Motion estimator 82 passes its motion or displacement vector dv1 found as well as the corresponding segmentation information si1 found by the absolute pixel DFD values as described above to motion estimator 85 for update. Motion estimator 85 passes its motion or displacement vector or vectors dv2 found as well as the corresponding segmentation information si2 found to motion estimator 88 for update. Motion estimator 88 receives displacement vector or vectors dvN-1 as well as the corresponding segmentation information siN-1 and outputs the final displacement vector or vectors dvN as well as the corresponding final segmentation information siN. The segmentation information si1 output from motion estimator 82 (and possibly the segmentation information siN output from motion estimator 88) is fed to a frame delay 801, which outputs the corresponding segmentation information si1 from the previous-frame motion estimation as an initialisation segmentation information siinit to motion estimator 82 for evaluation.
The described processing can be carried out by a single processor or electronic circuit, or by several processors or electronic circuits operating in parallel and/or operating on different parts of the complete processing. The instructions for operating the processor or the processors according to the described processing can be stored in one or more memories. The at least one processor is configured to carry out these instructions.
Number | Date | Country | Kind |
---|---|---|---|
15305162.8 | Feb 2015 | EP | regional |