This application is the U.S. national phase of the International Patent Application No. PCT/IB2009/055216 filed Oct. 13, 2009, which claims the benefit of the International Patent Application No. PCT/CN2008/072901 filed in the Chinese Receiving Office on Oct. 31, 2008, the entire content of which is incorporated herein by reference.
The invention relates in general to image processing and more specifically to image prediction.
Prediction is a statistical estimation process where one or more random variables are estimated from observations of other random variables. It is called prediction when the variables to be estimated are in some sense associated with the “future” and the observable variables are associated with the “past”. One of the simplest, prevalent prediction techniques is linear prediction. Linear prediction consists for instance in predicting a vector from another vector. The most common use of prediction is the estimation of a sample of a stationary random process (i.e. a random stochastic process whose joint probability distribution does not change when shifted in time or space) from observations of several prior samples. Another application of prediction is, in image/video compression, when a block of pixels is estimated from an observed “prior” block of pixels comprised in a reference image (also called forward image). In this case, each predicted image (or picture or frame) is divided into non-overlapped rectangular blocks. Motion vectors (i.e. the vectors used for prediction that provides an offset from the coordinates in the predicted picture to the coordinates in a reference picture) of each block are derived using Motion Estimation (ME) in the reference picture. Then, each block is predicted using Motion Compensation (MC) with reference to the corresponding block in the reference frame pointed by the derived motion vectors. Both ME and MC are methods known to the person skilled in the art. This method may help eliminating redundancy information and, consequently, fewer bits may be needed to describe the residual (which is the difference between the original and the predicted block). However, such ME/MC prediction method is actually not the ultimate solution for predicting future frames as it is based on the assumption that the captured moving object is performing translation motion, which is not always true. Besides, for the estimation of images involving non-Gaussian processes, the ME/MC technique cannot fully squeeze out all possible information about the past that will help predicting future frames.
Today there is a need for an image prediction solution that can easily be implemented on the existing communication infrastructures, overcoming the drawbacks of the prior art.
It is an object of the present system to overcome disadvantages and/or make improvement over the prior art.
To that extent, an embodiment of, the invention proposes a method for computing a predicted frame from a first and a second reference frames, said method comprising, for each block of pixels in the predicted frame:
An embodiment of the invention also relates to an interpolating device for computing a predicted frame from a first and a second reference frames of a video flow, said device being arranged to select said first and second frames from the video flow, said device being further arranged for each block of pixels in the predicted frame to: a) define a first block of pixels in the first reference frame collocated with a third block of pixels which is the block of pixels in the predicted frame, b) define a second block of pixels corresponding, in the second reference frame, to the first block of pixels along the motion vector of said first block from said first to second reference frames, c1) compute a first set of coefficients allowing a transformation of the pixels of the first block into pixels of the second block, and d1) compute pixels of the third block using the first set of coefficients and pixels from a fourth block collocated in the first reference frame with the second block of pixels.
An embodiment of the invention also relates to a system for computing a predicted frame from a first and a second reference frames of a video flow, said system comprising: a transmitting device for transmitting the video flow, an interpolating device arranged to: receive the video flow from the transmitting device, and select said first and second frames from the video flow, said device being further arranged for, for each block of pixels in the predicted frame, to: a) define a first block of pixels in the first reference frame collocated with a third block of pixels which is the block of pixels in the predicted frame, b) define a second block of pixels corresponding, in the second reference frame, to the first block of pixels along a motion vector of said first block from said first to second reference frames, c1) compute a first set of coefficients allowing a transformation of the pixels of the first block into pixels of the second block, and d1) compute pixels of the third block using the first set of coefficients and pixels from a fourth block collocated in the first reference frame with the second block of pixels.
An embodiment of the invention also relates to a non-transitory computer program product providing computer executable instructions stored on a computer readable medium, which when loaded on to a data processor causes the data processor to perform the method for computing a predicted frame from a first and a second reference frames described above.
An advantage of the proposed method is that it may adaptively exploit the redundancies to adjust the motion information derived between successive frames according to the characteristics of pixels within a local spatiotemporal area.
Another advantage of the proposed method in comparison to existing solutions is that it is able to adaptively tune interpolation coefficients (to predict pixels from existing pixels in previous frames) to match the non-stationary statistical properties of video signals. The interpolation coefficients play a critical role for the accuracy of the prediction. The more the coefficients are accurate, the more the predicted frames are reliable. These coefficients may involve a heavy burden in terms of bit rate for the video compression. Thus, the method according to the invention proposes an algorithm to derive more accurate coefficients for the first block by exploiting the high similarities between the same objects of adjacent frames, releasing hence the nontrivial burden of transmitting such coefficients.
Embodiments of the present invention will now be described solely by way of example and only with reference to the accompanying drawings, where like parts are provided with corresponding reference numerals, and in which:
The following are descriptions of exemplary embodiments that when taken in conjunction with the drawings will demonstrate the above noted features and advantages, and introduce further ones.
In the following description, for purposes of explanation rather than limitation, specific details are set forth such as architecture, interfaces, techniques, devices etc. . . . , for illustration. However, it will be apparent to those of ordinary skill in the art that other embodiments that depart from these details would still be understood to be within the scope of the appended claims.
Moreover, for the purpose of clarity, detailed descriptions of well-known devices, systems, and methods are omitted so as not to obscure the description of the present system. Furthermore, routers, servers, nodes, base stations, gateways or other entities in a telecommunication network are not detailed as their implementation is beyond the scope of the present system and method.
In addition, it should be expressly understood that the drawings are included for illustrative purposes and do not represent the scope of the present system.
The method according to the invention proposes a model for predicting an image (i.e. called predicted or current image/frame) based on observations made in previous images. In the method according to the invention, the prediction is performed in the unit of block of pixels and may be performed for each block of the predicted image. In extenso, an image may be assimilated to a block (of pixels). By collocated blocks in a first image and a second image, one may understand blocks that are in the exact same location in the two images. For example, in
In this illustrative embodiment, the method comprises, for each block of pixels to be predicted in the predicted frame 260, an act 220 allowing defining a first block of pixels in the first reference frame collocated with the block of pixels (named third block) to be predicted in the predicted frame. Then, act 230 allows defining a second block of pixels corresponding, in the second reference frame, to the first collocated block of pixels along the motion vector of said collocated block from said first to second references frames. A motion vector is a vector used for inter prediction that provides an offset from the coordinates in the predicted picture to the coordinates in a reference picture. It is used to represent a macroblock or a pixel in the predicted picture based on the position of this macroblock or pixel (or a similar one) in the reference picture. As the first and second reference pictures are known pictures, techniques readily available to the man skilled in the art may be used to derive the motion vector of the collocated block from the first to the second reference frames, and consequently the second block can be defined. Subsequently, a first set of coefficients can be computed in act 240 to allow the transformation of the pixels of the collocated block into pixels of the second block. Eventually, act 250 allows computing pixels of the predicted frame block using the first set of coefficients and pixels from a fourth block collocated in the first reference frame with the second block of pixels.
In the method according to the invention, the block of pixels to be predicted is derived from the fourth block of pixels in the first reference frame using the first set of coefficient. Using the fourth block to derive the block of pixels to be predicted in the predicted frame implies thus, as the second and fourth block are collocated in respectively the second and first reference frames that the motion vector used for the definition of the second block is the same as the motion vector for establishing the relationship between the fourth block and the block of pixels to be predicted.
As shown in
Where:
Interpolation coefficients may be derived using methods known from the person skilled in the art such as e.g. the Mean Squared Error (MSE) method and the Least Mean Square (LMS) method as explained further hereunder.
The radius r of the interpolation set or interpolation filter (i.e. set of interpolation coefficients for the square spatial neighborhood 125) may be used to define the size of the interpolation filter as: (2r+1)×(2r+1). For instance in
A first block Bt-1(k,l) 212 in the first reference frame Xt-1 211 is defined as being the collocated block of a block of pixels (third block) Bt(k,l) 201 to be predicted in the predicted frame Yt 205.
As the first and the second reference frames are both known and defined, existing method known from the person skilled in the art allows defining a second block Bt-2({tilde over (k)},{tilde over (l)}) along the motion vector νt-1,t-2(k,l) of the collocated (or first) block Bt-1(k,l) 212 from the first reference frame Xt-1 to the second reference frame Xt-2.
A first set of interpolation coefficients may thus be defined as described in
The collocated block in the first reference frame Bt-1({tilde over (k)},{tilde over (l)}) 241 (also called fourth block in reference to
A block of pixels to be predicted Bt(k,l) is selected in the predicted frame Yt in an act 300. A first block Bt-1(k,l) in Xt-1 collocated with the block of pixels to be predicted in Yt is then defined in act 310. As Xt-1 and Xt-2 are both known and defined, a second block Bt-2({tilde over (k)},{tilde over (l)}) corresponding in Xt-2 to Bt-1(k,l) may thus be defined along the motion vector νt-1,t-2(k,l) of the collocated block Bt-1(k,l) 212 from the first reference frame Xt-1 to the second reference frame Xt-2 in act 320. A fourth block in Xt-1
Bt-1({tilde over (k)},{tilde over (l)}) collocated with Bt-2({tilde over (k)},{tilde over (l)}) is defined in act 330.
Applying the method described in
The pixel approximation depends on the definition of the interpolation coefficient. Indeed, these should be chosen to be the optimum ones.
In equation (2), pixels in Xt-2 are known. Besides, pixels in Yt-1 are also known so the pixels approximates by equation (2) may be compared to the corresponding real pixels in Yt-1 in order to derive the interpolation coefficients αi,j. This comparison is performed using, in this illustrative embodiment of the method according to the invention, as mentioned here above, the mean squared error (MSE) to define the resulting mean squared error:
The MSE as a performance criterion may be viewed as a measure of how much the energy of the signal is reduced by removing the predictable information based on the observation from it. Since the goal of a predictor is to remove this predictable information, a better predictor corresponds to a smaller MSE.
The Least-Mean-Square (LMS) method may then be used to derive the optimum interpolation coefficients.
The assumption is then made in act 345 that pixels in the block to be predicted may be approximated from pixels in the fourth block using the same first set of coefficient as there are high redundancies between the two reference and the predicted frames). Assuming the optimum interpolation coefficients derived using equation (3) are αi,j, the prediction of Bt(k,l) may be then made as follows using the same coefficients and equation (1) as previously explained:
wherein αi,j are the interpolation coefficients obtained in equations (2) and (3).
It can be emphasized that the closer the frames in the stream of frames, the higher the redundancy and thus the better this assumption is. That may be emphasized that it is equivalent to say that the same motion vector is used, i.e.:
Vt,t-1(k,l)=Vt-1,t-2(k,l)
to derive a prediction of pixels in Bt(k,l) from pixels in Bt-1({tilde over (k)},{tilde over (l)}) (act 350) than to derive a prediction of pixels in Bt-1(k,l) from pixels in Bt-2({tilde over (k)},{tilde over (l)}).
In an additional embodiment of the present invention, in reference to
Indeed, symmetrically, pixels in Bt-2({tilde over (k)},{tilde over (l)}) may be approximated or expressed as linear combination of pixels of Bt-1(k,l) using a second set of interpolation coefficients in act 245:
It then may be assumed that the same second set of coefficient may be used to approximate or express pixels in Bt-1({tilde over (k)},{tilde over (l)}) from pixels in Bt(k,l) using said second set of interpolation coefficients in act 255 (making again the assumption that there are high redundancies between the reference and predicted frames, e.g. when there are chosen as being adjacent frames in a stream of frames):
However here, as pixels in Bt(k,l) are unknown (as being the ones to be predicted), they cannot be expressed as linear combinations of pixels in Bt-1({tilde over (k)},{tilde over (l)}). But, as the mathematical expression is a linear combination, pixels in Bt(k,l) may be expressed from pixels in Bt-1({tilde over (k)},{tilde over (l)}) using the symmetric interpolation coefficients of the interpolation coefficients of the second set:
where β′i,j is the reverse one corresponding to the one derived in (5).
Eventually, with this optional embodiment of the method according to the invention, two sets of interpolation coefficients are derived/obtained, implying two expressions/approximations of pixels in Bt(k,l) from pixels in Bt-1({tilde over (k)},{tilde over (l)}). An optimum prediction may thus be derived, for each pixel, from these two approximations of the same pixel by taking the average or mean of the two:
Indeed, equations (4) and (8) allows approximating the same pixel at (m,n) in frame Yt in two different directions (forward and backward), this implies that αi,j≈β′i,j, allowing improving the accuracy of the prediction.
Practically, in the case of, for example, an encoder/decoder system, the method according to the invention is based on the fact that the blocks of pixels in the first and second reference frames are available/known both to the encoder and decoder, allowing thus obtaining the predicted frame using data derived from these reference frames. The present method may also be implemented using an interpolating device for computing the predicted frame from a first and second reference frames in a video flow. The encoding, decoding or interpolating devices may typically electronic devices comprising a processor arranged to load executable instructions stored on a computer readable medium, causing said processor to perform the present method. The interpolating device may also be an encoder/decoder part of a system of computing the predicted frame from a first and second reference frames in a video flow, the system comprising a transmitting device for transmitting the video flow comprising the reference frames to the interpolating device for further computing of the predicted frame.
In this illustrative embodiment described here above, the motion vector is of pixel integer accuracy. However, the method according to the present invention may also achieve sub-pixel accuracy. This is because, in the existing/known quarter-pixel interpolation method (described on
On
In such a case of sub-pixel accuracy, the method according to the invention may be applied, as shown in
To verify the prediction efficiency of the proposed model, an example of implementation is made using the Distributed Video Coding (DVC) extrapolation. In DVC, the ultimate rebuilt Wyner-Ziv (WZ) frame is composed of the Side information (SI) pulsing the errors corrected by the parity bits. Consequently, the improvement of SI prediction constitutes one of the most critical aspects in improving the DVC compression efficiency. Since if the SI is of high quality, the energy of the residual information, which needs to correct the error between the SI and the original frame, decreases, and results in the reduction of transmitting the parity bits and thus reduce the bit rates. Since the method according to the invention is suitable for predicting the current frame just based on the available information in the past, it may be implemented in the extrapolation application in DVC, and compared with the existing extrapolation based approaches. In DVC, since the original pixel is not available in decoder side, the ME is performed in the past frames. For example, as in
Number | Date | Country | Kind |
---|---|---|---|
PCT/CN2008/072901 | Oct 2008 | WO | international |
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/IB2009/055216 | 10/20/2009 | WO | 00 | 4/26/2011 |
Publishing Document | Publishing Date | Country | Kind |
---|---|---|---|
WO2010/049916 | 5/6/2010 | WO | A |
Number | Name | Date | Kind |
---|---|---|---|
20040101058 | Sasai et al. | May 2004 | A1 |
20070086526 | Koto et al. | Apr 2007 | A1 |
20070291844 | Lu | Dec 2007 | A1 |
20080056366 | Bhaskaran | Mar 2008 | A1 |
20080117977 | Lee et al. | May 2008 | A1 |
20080159391 | He et al. | Jul 2008 | A1 |
20100202532 | Webb | Aug 2010 | A1 |
Number | Date | Country |
---|---|---|
WO 2007077243 | Jul 2007 | WO |
WO 2008112072 | Sep 2008 | WO |
Entry |
---|
Liu et al., “Wyner-Ziv Video Coding with Multi-resolution Motion Refinement: Theoretical Analysis and Practical Significance,” Visual Communications and Image Processing, Jan. 29, 2008-Jan. 31, 2008, San Jose, CA, USA, pp. 1-11. |
Number | Date | Country | |
---|---|---|---|
20110206129 A1 | Aug 2011 | US |