The invention relates to a method and to an apparatus for hierarchical motion estimation wherein a segmentation of the measurement window into different moving object regions is performed.
Estimation of motion between frames of image sequences is used for applications such as targeted content and in digital video encoding. Known motion estimation methods are based on different motion models and technical approaches such as gradient methods, block matching, phase correlation, ‘optical flow’ methods (often gradient-based) and feature point extraction and tracking. They all have advantages and drawbacks. Orthogonal to and in combination with one of these approaches, hierarchical motion estimation allows a large vector search range and is typically combined with block matching, cf. [1],[2]. In motion estimation a cost function is computed by evaluating the image signal of two image frames inside a measurement window.
Motion estimation faces a number of different situations in image sequences. A challenging one is when motion is estimated for an image location where there are different objects moving at different speed and/or direction. In this case the measurement window covers these different objects so that the motion estimator is distracted by objects other than the intended one.
In [3] a method of estimating correspondences between stereo pairs of images is described. In determining the cost function, the method targets to weight pixels “in proportion of the probability that the pixels have the same disparity”. Pixels with similar colour and located nearby are preferred by means of two kinds of weights (involving factors determined empirically): one related to colour difference, the other one related to spatial distance. Unfortunately that method has inherent problems with periodic structures because pixels with same or similar colour but a certain distance apart may mislead the motion estimator. In addition, this concept does not attempt to consider different motion.
[4] describes reliable motion estimates for image sequences even in situations or locations in an image where the measurement window of the motion estimator covers different objects with different motion. A motion vector is provided that typically applies to the centre pixel of the measurement window. It is therefore most appropriate for estimating the motion of points or pixels of interest and for tracking them. If motion shall be estimated for every pixel in an image, this method can be used in principle. However, since segmentation information is stored for the complete (subsampled) measurement window for every location a vector is estimated for, memory requirements can become huge (e.g. for the later, finer levels of the hierarchy). Moreover, no advantage is taken of multiple pixel-related segmentation information being present as a result of overlapping measurement window positions.
In particular, [4] describes a method adapted for hierarchical motion estimation in which motion vectors are refined in successive levels of increasing search window pixel density and/or decreasing search window size, including the steps:
Thus, with every hierarchy level the hierarchical motion estimator provides true motion vectors closer towards object boundaries (e.g. of a truck on a road), due to the decreasing grid size (i.e. distance of pixels for which a vector is estimated) and/or decreasing size of the measurement window, but not at the boundaries themselves. In the motion compensated image, high ‘displaced frame differences’ (DFD) remain around moving objects, in structured areas of medium-size moving objects, in or around uncovered background regions, and throughout small moving objects, or at least at their front and rear if they are less textured (e.g. a car moving behind a bush). [4] thus describes to translate DFDs or absolute DFDs into probabilities of belonging to the same or another object, resulting in a more continuous decision than a binary one. Such probability, related to an object, reflects also the part of the exposure time for which the camera sensor element has seen that object as mentioned above.
As a first approach, a mask with three possible values ‘0’, ‘0.5’ and ‘1’ is computed by comparing the DFD or absolute DFD of each pixel (x,y) against two thresholds:
The ‘0’ and ‘1’ values denote different object areas while the value of (e.g.) ‘0.5’ expresses some uncertainty. A low absolute DFD thus turns into a mask value of ‘0’ which represents object number ‘0’.
mask(x,y) represents a finer and continuous function that translates the absolute DFD into a probability between ‘0’ and ‘1’ to belong to the same or another object. [4] is incorporated by reference herein in its entirety.
The inventive processing solves this issue, as disclosed in claim 1. An apparatus that utilises this method is disclosed in claim 2.
Advantageous additional embodiments of the invention are disclosed in the respective dependent claims.
A hierarchical motion estimation is performed which includes a number of hierarchy levels 17, 18, . . . , 19 as shown in
As an enhancement, only the image information in the measurement window that is considered as belonging to the same object as the centre pixel of the measurement window is included in calculating (e.g. in steps/stages 17, 18, . . . , 19) the cost function of the motion estimator, in order to obtain a vector that is specific to that very object or is part of it.
The described processing allows estimating motion and tracking specific image content or points of interest with improved reliability and accuracy in situations or image locations where different objects move at different speed and/or direction. It prevents the motion estimator from being distracted from other objects while attempting to determine a motion vector for an image location or pixel which is part of the object in focus, or for any pixel in the image in a regular grid. The processing provides reliable motion estimates for every pixel in an image of an image sequence, and it does so in an efficient way.
In principle, the inventive method is adapted for hierarchical motion estimation, including:
a) using—in each motion estimation hierarchy level—in a pixel block matcher of a corresponding motion estimator a measurement window for comparing correspondingly (sub)sampled pixel values of a current image frame and a delayed previous image frame in order to compute a motion vector for every pixel of interest, wherein—by evaluating displaced frame differences in the measurement window—a segmentation of the measurement window into different moving object regions is performed;
b) storing corresponding segmentation information;
c) using said stored segmentation information as an initial segmentation for motion estimation in the following finer level of the motion estimation hierarchy;
d) determining in said following finer level of the motion estimation hierarchy an updated segmentation information;
e) continuing the processing with step b) until the finest level of said motion estimation hierarchy is reached.
In principle the inventive apparatus is adapted for hierarchical motion estimation, said apparatus including means adapted to:
a) using—in each motion estimation hierarchy level—in a pixel block matcher of a corresponding motion estimator a measurement window for comparing correspondingly (sub)sampled pixel values of a current image frame and a delayed previous image frame in order to compute a motion vector for every pixel of interest, wherein—by evaluating displaced frame differences in the measurement window—a segmentation of the measurement window into different moving object regions is performed;
b) storing corresponding segmentation information;
c) using said stored segmentation information as an initial segmentation for motion estimation in the following finer level of the motion estimation hierarchy;
d) determining in said following finer level of the motion estimation hierarchy an updated segmentation information;
e) continuing the processing with step b) until the finest level of said motion estimation hierarchy is reached.
Exemplary embodiments of the invention are described with reference to the accompanying drawings, which show in:
Even if not explicitly described, the following embodiments may be employed in any combination or sub-combination.
I. Identifying Object Locations in the Complete Image—Whole-Frame Mode
I.1 Motion Estimation Type and Memory Requirements
The motion estimation method described in [4] includes segmentation of the measurement window and fits well the case of estimating motion vectors for pixels of interest (‘pixels-of-interest mode’) since location information—which is available for every pixel in the subsampled measurement window—needs to be stored only around the pixels of interest, and their number will typically be low. When estimating motion vectors for all picture elements in an image (‘whole-frame mode’) the same processing can be applied, i.e. location information obtained from motion estimation for every grid point or pixel for which a vector is estimated in a level of the hierarchy could be stored in the same way. This would require a storage space proportional to the number of grid points for which a vector is estimated, multiplied by the number of pixels in the subsampled measurement window. This is roughly proportional to
numLines/iGrid*numPels/iGrid*(iMeasWin/iSub)̂2,
with numLines=number of lines of the image, numPels=number of pels per line of the image, iGrid=horizontal and vertical distance of pixels for which a motion vector is estimated, iMeasWin=horizontal and vertical width in pixels of the measurement window (can also be a rectangular window), iSub=quasi-subsampling factor or horizontal and vertical distance of pixels in the measurement window that are evaluated in determining the cost function.
As can be seen in the following, determining and storing segmentation information independently for each measurement window position can be sufficient for the first levels of the hierarchy, due to the large iGrid values. On the other hand, no benefit would be taken from the information obtained from the overlapping measurement windows. E.g. in the first one of six levels with a measurement window of size 257×257 pixels and a grid size of 64 pixels, about 75% of the pixels in the measurement window are the same as for the previous grid pixel for which a vector has been estimated. This is depicted in
As an alternative, location information obtained from motion estimation for neighbouring grid points—with their overlapping measurement windows—can be combined in order to generate an object mask or segmentation information which then can be used and refined in the next level of the hierarchy, and so on. This kind of processing is facilitated by considering and merging/combining the location information based on, or given by, the discrete or continuous segmentation information or probabilities obtained by DFD-based segmentation (cf. [4]). The memory requirement is then given by numLines*numPels.
The following Table 1 shows that, for an example of numLines=1088, numPels=1920 and a 6-level hierarchy, in the first (l=1) and second (l=2) motion estimation hierarchy level the method of (a) storing segmentation information individually for the measurement window around every grid point would require less storage space than the method of (b) combining the segmentation information into one image. For the third hierarchy level both methods have almost the same memory requirement, and for the last three levels method (b) is clearly more efficient in terms of memory requirement.
Note that DFD-based segmentation can be carried out in different colour spaces, e.g. by evaluating luminance (Y) or RGB or YUV (solely or jointly). Even different masks can be determined for different colour components such as R, G and B for application examples where they do not match exactly (e.g. in case of storing the three colour components of a film on different grey films individually).
I.e., in the whole-frame mode, without combining segmentation information in the measurement window at and around neighbouring grid pixels, storing and using the segmentation information individually for each window position can already be applied with some benefit in the first motion estimation hierarchy levels.
I.2 Combining New and Previous Segmentation Information
I.2.1 Segmentation Information from Neighbouring Measurement Window Positions
For combining segmentation information of neighbouring window positions in a level of the hierarchy, when a motion vector is estimated at a current pixel location, the measurement window is placed around it and mask or probability information is obtained for every pixel in the quasi-subsampled measurement window. When a motion vector is then estimated at/for the next pixel location, e.g. to the right, the measurement window is placed around this pixel and segmentation information is obtained for some more pixels at the right side of the window and for some less pixels at the left side of the window, and for many same pixels in the rest of the subsampled measurement window. While the measurement window is shifted over a current pixel when estimating motion for a number of pixels to its left and right (according to the width of the measurement window), segmentation information becomes available for this pixel with every new window position, as depicted by
All these segmentation information or probability value contributions for the current pixel are to be combined, for example by a linear combination (e.g. average) and/or a non-linear combination (e.g. median, max) and/or a logical combination, in order to form a segmentation information for this pixel, maybe while considering also its neighbourhood. Inversion of a segmentation information or probability value contribution of a pixel may be included where necessary, such as where the pixel is supposed to belong to another object than the centre pixel of the measurement window because its segmentation information value is much different from/inverse to the segmentation information value at the centre pixel.
To this end, the segmentation information of a pixel as contributed from overlapping windows of neighbouring window positions needs to be evaluated. After having obtained the segmentation information in the measurement window at the dashed position in
In addition to the overlap situations shown in
The segmentation is assumed to be most reliable in cases where the other object covers just a small part of the measurement window so that the motion estimator nicely matches the major part of it. For the same reason, segmentation information reliability is considered to be high in the first levels of the hierarchy with their large measurement windows. This can be considered in the combining or merging algorithm, e.g. by weighting, at a certain pixel location, the contributions from different window positions depending on the relation of the quantities of pixels in the measurement window belonging to different objects. This can be approximated by the absolute average segmentation information value for the pixel location, which may typically occupy a range between 0.5 (two objects of equal size) and 1 (just one object). This span might can be decreased or enlarged artificially, e.g. to 0.25 . . . 1.
An example of a small foreground object moving in front of a large background object is depicted in
It is assumed that a pixel (x,y) is located within a measurement window for k subsequent times (k=1 . . . 4 in the horizontal example of
If with the third measurement position the pixel still lies within the overlap area, it receives the segmentation information mask(x,y,3), which is to be combined with the stored segmentation information, e.g. by averaging them with equal or different weights. If all three shall be stored with equal weights, the number of times nSI(x,y) for which the pixel has received segmentation information is stored as well, so that e.g.
And similarly with a fourth measurement position, e.g.
Or in general:
This is continued in case of later vertical overlap, and finally this leads to equal weights for the contributions resulting from the neighbouring measurement window positions.
As an alternative, adaptive weights can be used that take into account the reliability of the segmentation, e.g. based on the number of pixels in the measurement window that get a ‘good’ segmentation, i.e. a low mask value. In this case the sum of previous weights can be stored.
An object label, e.g. a number, is assigned in addition. This way, either (a) a complete frame of mask probability information is formed for every object, or (b) one frame containing mask probability information is formed along with a frame array containing object labels, or (c) both mask probabilities and object labels are combined into a single number or information sample representing both. Probability entries will typically be somewhat lower near supposed object boundaries.
I.2.2 Segmentation Information from Different Levels of the Hierarchy
Following a spatial interpolation (linear or non-linear, e.g. majority of 3 of 4 neighbours), the final output mask of a level of the motion estimation hierarchy is used as an input mask of motion estimation in the next level of the motion estimation hierarchy: when estimating an update vector at a certain pixel location, a mask section of the size and respective position of the measurement window is extracted and used as an input mask by the block matcher.
The output mask generated for a present measurement window position from the remaining DFDs as described above is either stored in a separate array (i.e. combined into the new mask of the complete image), or is merged in place (e.g. in a weighted fashion, for instance old vs. new or left/top vs. right/bottom, or by replacement) into the existing (input) mask, thereby refining it immediately.
In this case the combination or merging process described above can be carried further by considering either equal or individual weights for the different levels of the motion estimation hierarchy. E.g. after finalising the segmentation of the first level of the hierarchy as mask1(x,y), all its nSI(x,y) can be virtually set to ‘1’ and new ones can be created and used when determining the mask in the second level as:
and so on for the other levels l:
in case of equal weights.
Alternatively, first the mask of the new level l is formed as described in section I.2.1, and subsequently it is combined with the mask of the previous level according to
Because the measurement window size decreases from one level to the next, it will less often contain image information from different objects. Therefore the mask combination during the later (i.e. finer) levels is to be carried out carefully. To this end, different reliability, and hence weights, can be assigned to the different levels l of the motion estimation hierarchy. A higher (i.e. coarser) level gets a higher weight than the following level—e.g. in the ratios of 6:5:4:3:2:1,
or according to the ratio of the measurement window sizes w1 because a larger window should mean a more reliable segmentation as it more likely covers different objects (see the Table 2 below):
or according to the ratio of the squared measurement window sizes:
In the example of Table 2 the first three levels of the hierarchy have a contribution of (a) 85% or (b) even 96% or (c) 72% to the overall combination of segmentation information (see also
hierarchy vs. (a) the sum of window sizes (sum is 698) and (b) the same for squared window sizes (sum is 131390), and compared with (c) a ratio of 6:5:4:3:2:1 (sum is 21)
The weighted combination can also be implemented successively, starting with the first hierarchy level l=1 and the second hierarchy level (l=2):
etc. And similarly in case of weights based on squared window sizes. The major objective of using the mask is to enhance motion estimation rather than segmenting the image into semantically distinguished objects.
In summary there are the following modes determining the merging of segmentation information of neighbouring measurement window positions and different levels of the motion estimation hierarchy:
Processing in modes 1 and 2 is carried out across the levels of the motion estimation hierarchy without any weighting among them, every new entry has equal weight.
I.2.3 Inversion of segmentation information prior to merging
As already mentioned above, the present mask(x,y,k) may need to be inverted prior to merging. For this purpose it is compared with the stored mask information in the overlap area of the measurement window, see section I.2.1. The overlap area is given by those pixels for which segmentation information has already been stored, i.e. the number of times nSI(x,y) for which the pixel has received segmentation information is. This comparison can be limited to the present level of the hierarchy (by initialising nSI(x,y) with zero at the beginning of each level) or go across all levels (e.g. by initialising nSI(x,y) with zero only at the beginning of the first level).
The simplest method is to compare the mask of the centre pixel of the measurement window at its present position with the stored information which has been obtained from at least one or two previous measurement window positions (see
Reliability of this kind of operation can be improved by taking into account a spatial neighbourhood of the centre pixel—up to the complete overlap area. In this case, if the absolute value of the average of the differences is larger than a characteristic value or threshold of e.g. 0.5 or even 0.75, the mask is inverted prior to merging. Also another measure such as the average product of mask values can be evaluated, e.g. by way of correlation rather than by differences.
For using the segmentation information in motion estimation, the mask around the centre pixel—as extracted from the mask memory—may need to be re-inverted if it is not zero or not small, i.e. if it is greater than a threshold of e.g. 0.5 or even 0.75.
Using original and inverted versions of the segmentation information typically holds in cases where there are just two objects moving relative to each other—a foreground and a background object.
If there are more objects partly covering each other while moving, a segmentation into more levels is necessary, or each object has its own segmentation mask memory assigned to it.
If there are small objects present in the measurement window which are surrounded by a large background object such that they do not touch each other, these small objects typically appear in the segmentation information separately and can therefore be separated from the large background object for use in motion estimation: where a motion vector is estimated for a pixel lying within such a small object, the segmentation information of the background object and the other small object(s) can be masked off so that motion of the first isolated object is estimated. This is achieved by scanning around the centre pixel at a growing distance and comparing the mask information of that pixel with its neighbour or neighbours as seen towards the centre pixel C, as shown in
Using this kind of processing, smaller objects lying around the centre pixel can be separated from another small object not connected to it, as shown in
For the motion estimation itself, the mask thus modified will just express whether or not a pixel belongs to the same object as the centre pixel, because for that object the match shall be found in motion estimation. All other pixels will be excluded from determining the cost function and it will not matter whether they belong to just one or more other objects.
II. Segmentation on a Finer Sampling Grid
Motion estimation performance in terms of resulting motion vectors and computational load benefits from pre-filtering and quasi-subsampling in the measurement window. Up to now segmentation was also tied to that concept. However, once the best match has been found for the present pixel of interest or grid pixel within one level of the hierarchy, segmentation can also be performed separately on a finer sampling grid, e.g. even on the original integer-pel sampling grid. I.e., one approach is to use the same measurement window size for performing segmentation, but to carry it out on the original sampling grid by evaluating every pixel. In addition, segmentation can be carried out on the non-filtered, original frames. In connection with this embodiment there is no need to adapt thresholds to the different levels of the hierarchy.
Moreover, segmentation can use its own adapted window sizes in the levels of the hierarchy. While a small measurement window can be used in motion estimation in the last levels of the hierarchy, segmentation can use a larger window in order to increase the probability of containing more image information near object boundaries, thereby improving the segmentation.
As a benefit from performing segmentation on the original sampling grid, no interpolation of segmentation information would be necessary, and the segmentation mask would always have the best spatial resolution, given by the original sampling grid.
For using the segmentation mask in motion estimation, the segmentation information of the pixels in the quasi-subsampled measurement window can be extracted from the mask memory. Pre-filtering of the mask in order to match the prefiltering of the image signals might not be necessary, or could have some drawback, because a mask level of zero should be maintained.
III. Use of Object Locations in Interpolation of Motion Vectors
When interpolating the vector field of one level of the hierarchy before starting the next level, the occurrence of objects with different motion can be considered. Typically a vector of a pixel on the denser new vector grid is interpolated bilinearly from the 4 or 2 neighbours of the less dense grid of the previous level of the motion estimation hierarchy. Instead, in case of four neighbours a majority decision can be taken, or the median is used.
VI. Results
Some results obtained in the whole-frame mode are shown in the following. Hierarchical motion estimation using six integer-pel levels has been performed for an HD video sequence representing global motion and containing trucks and cars on a highway, without and with DFD-based segmentation.
The following setting has been used:
in the first level of the hierarchy and ‘1’ otherwise, in order to guarantee that grid pixels for which vectors are estimated in the levels of the hierarchy lie on matching rasters;
The DFD-based segmentation leads to improved motion estimation around moving objects as shown in
The segmentation information obtained along the levels of the hierarchy by merging the segmentation information of neighbouring measurement window positions and of successive levels of the hierarchy is shown in
Along the levels of the hierarchy the segmentation information gets sharper and changes somewhat in intensity. As high DFDs in the measurement window lead to lighter mask contributions and occur especially before and behind moving objects and in moving object areas of high contrast, this holds similarly for the merged segmentation information contributed by neighbouring measurement window positions. For use in motion estimation this is sufficient as the segmentation information successfully masks off pixels that would harm motion estimation to be performed for the object or area around the centre pixel of the measurement window.
Section I.2.3 describes a processing of scanning around the centre pixel at a growing distance and comparing the mask information of that pixel with its neighbour(s) as seen towards the centre pixel. This method successfully helps distinguishing between smaller objects with different motion, like the big truck in the middle moving to the left and the smaller car before it moving to the right, as seen in the displacement vector x and y components shown in
Note that the images shown are provided for the purpose of comparing certain methods and are not meant to be an optimum result. E.g. the inhomogeneous vectors in the sky area can be avoided by updating a vector only in nonhomogeneous regions using threshAllowME>0 starting with the second level of the hierarchy. The effect of vectors extending to the bottom right of the centre truck can be reduced by a neighbour vector comparison.
The invention is applicable to many applications requiring motion estimation and/or object or point tracking, such as broadcast and web streaming applications, surveillance or applications of Targeted Content.
The described processing can be carried out by a single processor or electronic circuit, or by several processors or electronic circuits operating in parallel and/or operating on different parts of the complete processing.
The instructions for operating the processor or the processors according to the described processing can be stored in one or more memories. The at least one processor is configured to carry out these instructions.
Number | Date | Country | Kind |
---|---|---|---|
15306269.0 | Aug 2015 | EP | regional |