This application claims the benefit, under 35 U.S.C.§365 of International Application PCT/EP2012/062196, filed Jun. 25, 2012, which was published in accordance with PCT Article 21(2) on Jan. 24, 2013 in English and which claims the benefit of European patent application No. 11305937.2, filed Jul. 19, 2011.
The invention relates to a method and to an apparatus for reframing and encoding an original video signal, wherein the reframing window position and/or size is adapted so as to reduce the encoding cost of the reframed video signal.
Reframing is used to re-size an image or video content, e.g. for displaying video signals with a given aspect ratio on a display having a different aspect ratio. For example, High Definition (HD) video content might not be well suited for display on a small portable device.
EP 1748385 A2 discloses dynamic reframing based on a human visual attention model, in which source video content is appropriately cropped in order to keep the region of interest. The output signal may be encoded and transmitted via a network.
C. Chamaret, O. LeMeur, “Attention-based video reframing: validation using eye-tracking”, 19th International Conference on Pattern Recognition ICPR'08, 8-11 Dec. 2008, Tampa, Fla., USA, also describes reframing applications.
O. LeMeur, P. LeCallet and D. Barba, “Predicting visual fixations on video based on low-level visual features”, Vision Research, vol. 47, no. 19, pp. 2483-2498, September 2007, describes the calculation of a dynamic saliency map, based on a visual attention model.
It appears that no known reframing processing addresses the bit rate and distortion of the output cropped and encoded video signal. For example, a cropping window may track a region of interest, without considering the coding complexity of the encoded video signal. This can result in multiple zooming and de-zooming, pannings and thereby in high coding cost for appearing areas. When the final reframed video signal is encoded, for example by using an H.264/AVC encoder, this can result in an increase of the bit rate and/or a decrease of video quality.
A problem to be solved by the invention is to provide a video source signal reframing wherein the cropping window position and size takes into account the rate/distortion of an encoded output video signal containing that cropped window. This problem is solved by the method disclosed in claim 1. An apparatus that utilises this method is disclosed in claim 2.
According to the invention, the cropping window parameters (position and size over time) are constrained in order to optimise the rate/distortion of the encoded output video signal. An initial reframing is improved by considering the video coding context and then by taking into account the coding efficiency cost induced if the reframed video sequence is encoded.
In principle, the inventive method is suited for reframing an original video signal, followed by an encoding of the reframed video signal, said method including the steps:
In principle the inventive apparatus is suited for reframing an original video signal, followed by an encoding of the reframed video signal, said apparatus including:
Advantageous additional embodiments of the invention are disclosed in the respective dependent claims.
Exemplary embodiments of the invention are described with reference to the accompanying drawings, which show in:
In
To optimise the rate/distortion of the output video signal encoding, the cropping window is following some rules:
Following are several complementary embodiments that address bit rate reduction of the encoded output reframed video sequence.
A. Improving Temporal Consistency
In this section the fact is used that a temporally more stable image will in principle need fewer bits for encoding.
A.1 Simple Temporal Constraint
A Kalman filter is used to control the position and the size of the cropping window. In a first implementation the covariance noise matrix Q is constrained. The goal of the Kalman filter application is to smooth the variations of the raw values provided by the attention model 22, i.e. the variations of the content of the saliency map over time. In the Kalman modelisation, the raw values given by the attention model 22 are considered as representing a noise measurement while it is tried to estimate the optimal cropping parameters (centre and size of window). The current state xk of the parameters of the cropping window is defined as xk=Akxk−1+Bkuk+wk, where Ak is the state transition model applied to the previous state, Bk is the control input model, uk is the control vector, wk is the state noise with wk≈N(0,Qk), and N is the normal distribution with zero mean and covariance Qk.
At time k an observation or measurement zk of the true state xk is made and is defined as zk=Hkxk+vk,
where Hk is the observation model which maps the true state space into the observed space and vk is the observation noise with vk≈N(0,Rk) that is assumed to be zero mean Gaussian white noise with covariance Rk.
In the reframing application, the Kalman filter is used as follows. The state xk defines the position of the centre of the cropping window and its size. It is defined as:
For each image, the Kalman filter is applied in two steps: prediction phase and update phase.
In the prediction phase, the prediction step estimates the current state with the previous state and the state transition model. In this particular case, the Kalman filter is used to stabilise the cropping window parameters. Consequently the predicted state {circumflex over (x)}k is mostly a copy of the previous state:
The update phase is a correction of the prediction using the noisy measure:
{tilde over (y)}k=zk−{circumflex over (x)}k is the measurement residual, where zk are the raw window parameters given by the attention model, and Hk is the identity matrix.
xk={circumflex over (x)}k+Kk{tilde over (y)}k, Pk=(I−Kk){circumflex over (P)}k−1,
wherein
is the matrix gain that minimises the a posteriori error covariance.
As an initial estimate the centre of the screen can be chosen. R and Q are constant diagonal matrices that define state and measure covariance noise, respectively. In this context, the simplest way to implement a temporal constraint is to lower the values in the state covariance noise matrix Q.
This solution does not consider video content with moving background (or non-zero dominant motion). The next section considers this more complex case.
A.2 Dominant Motion Constraint
In case of video content with non-zero dominant motion, i.e. with background translation and/or zoom, to improve coding efficiency, the cropping window follows the dominant motion in order to avoid appearing blocks at the border of the image and to avoid size change of objects for better interpicture prediction.
In the visual attention model, a motion estimator is used that computes parameters of dominant motion with a 6-parameters affine model. In the context previously defined, these parameters are used to derive background translation and zoom.
As before, the Kalman filter state is defined as:
This time the command is non-null and is used to follow dominant motion:
Again, for each image the Kalman filter is applied in two steps: prediction phase and update phase.
In the prediction phase, the prediction step estimates the current state with the previous state and the state transition model. The cropping window parameters are allowed to change only according to the dominant motion:
The update phase is a correction of the prediction using a noise measure:
{tilde over (y)}k=zk−{circumflex over (x)}k is the measurement residual, where zk are the raw window parameters given by the attention model, and Hk is the identity matrix.
is the matrix gain that minimises the a posteriori error covariance.
xk={circumflex over (x)}k+Kk{tilde over (y)}k, Pk(I−Kk){circumflex over (P)}k−1
As an initial estimate the centre of the screen can be chosen. Q and R are constant diagonal matrices that define state and measure covariance noise. Here again the state covariance noise matrix Q defines the relation between the model and the actual output. It also takes into account the noise in the estimation of the dominant motion parameters. If Q has low values, the output is strongly constrained to the model, otherwise (Q has high values) it will faster follow the attention model output and possibly high variations due to noise.
The output of the dominant motion uk can also be integrated into the Kalman filter for improving temporal consistency.
B. Constraining by Macro-Block Cost
It is assumed that the original video sequence 11, 21 comes with values of encoding cost if it is a compressed video sequence. As is well-known, the coding cost of a given macroblock is represented the number of bits required to encode that macroblock using a current quantisation parameter q. According to the invention, these input sequence coding costs are used for constraining the reframing. The processing defined in the following can be used in addition to, or independently from, the processings described in section A.
B.1 Constrained by Overall Image Cost
The state covariance noise matrix Q of the Kalman filter can be derived from the overall cost of the picture sequence. If the cost of the input sequence is low, it can be predicted that the cost of the cropped picture will also be low, and as a consequence the constraints for reducing the cost of the cropped sequence can be lowered.
As an example, the matrix Q can be defined as Q=I·(σ−λ·cost), where I is the identity matrix, σ is a constant, λ is a weighting parameter that gives more or less weight to the cost, and cost is the coding cost of the sequence in megabytes per second (MB/s).
B.2 Constraining Window Enlargement with Macro-Block Cost Map
This implementation deals with the aspect ratio step/stage 233 described in
The aspect ratio AR is the ratio between the width and the height of the original video signal 11 and 22. The anisotropic extension refines the cropping window size by extending the cropping window CWiSM(xSM,ySM,wSM,hSM) in a direction depending on the current aspect ratio RSM, wherein SM refers to salience map. The extension is either on width or on height for achieving the targeted aspect ratio RTG.
is the aspect ratio resulting from the extraction from the saliency map, and
is the target aspect ratio.
If RTG>RSM, horizontal extension is performed (on the width of the current rectangle) else vertical extension is performed (on the height of the current rectangle).
Assuming a horizontal extension (for vertical extension in parentheses), one can define:
Once the side of extension is defined, there are still several ways to extend the window. In other words, dright and dleft may be computed in a different manner. In the following it is assumed that the width wSM is to be extended to reach the final aspect ratio. The extension may be entirely transferred to the left side, such as dleft=dw and dright=0, to the right side, such as dleft=0 and dright=dw, or to both sides in the same proportion, such as
Such solutions are not optimal from a content point of view. Therefore, in the prior art a finer analysis of the saliency map was carried out to favour one side or the other one.
According to the invention a new criterion is used for the choice of the extended direction: the coding cost based on a macro-block coding efficiency cost map for a current picture, which is depicted in
dright and dleft should be found such as
The bits cost Crightmax and Cleftmax are computed by considering a full extension to the left (dleft=wAR−wSM and dright=0) and fully to the right (dright=wAR−wSM and dleft=0), where
can be defined. Once the saliency quantities available on each side are known, one can estimate the extensions dright and dleft on each direction, when using the equation (1):
B.3 Predefined Window Location Chosen by Macro-Block Cost Map
Another way to constraint the location of the cropping window box is to compute only the cost corresponding to several reachable cropping windows in the neighbourhood, as depicted in
C. Other Features
C.1 Constraint at the Saliency Map Level
Another interesting embodiment is to merge the saliency map with the macro-block cost map, such that an expensive coding cost macro-block decreases its corresponding saliency in the final saliency map. Thereby the potential impact of an expensive macro-block is decreased in the determination of the final cropping window location.
One way to merge the two maps is to apply the following processing, called Coherent Normalisation, Sum plus Product (CNSP):
SMfinal=NC(SM)+NC(MBinv)+(1+NC(SM))·(1+NC(MBinv)),
where MB is a macro-block coding efficiency cost map value in the range 0 . . . 255, MBinv=255−MB, SM is a salience map value, and NC is the normalisation operator driven by a priori knowledge. Instead of using the global maximum of each map, this operator uses an empirical value.
C.2 Cropping Window Displacement Constraint by Encoder Architecture
Some simplifications can be performed to adapt the cropping window to the encoder architecture, but also to improve coding efficiency in some cases:
In addition to such adaptations of the re-framing process for the encoding, it is also possible to include the reframing process within the encoding loop. For instance, the cropping window can be computed during the encoding of a frame such that the encoding and the re-framing are jointly optimised instead of being performed as a pre-processing. There are several advantages in doing that:
Number | Date | Country | Kind |
---|---|---|---|
11305937 | Jul 2011 | EP | regional |
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/EP2012/062196 | 6/25/2012 | WO | 00 | 1/17/2014 |
Publishing Document | Publishing Date | Country | Kind |
---|---|---|---|
WO2013/010751 | 1/24/2013 | WO | A |
Number | Name | Date | Kind |
---|---|---|---|
8160150 | Moore | Apr 2012 | B2 |
20050025387 | Luo | Feb 2005 | A1 |
20070230565 | Tourapis | Oct 2007 | A1 |
20080212897 | Le Meur et al. | Sep 2008 | A1 |
20090153730 | Knee et al. | Jun 2009 | A1 |
20110200302 | Hattori | Aug 2011 | A1 |
20130050574 | Lu | Feb 2013 | A1 |
Number | Date | Country |
---|---|---|
101620731 | Jan 2010 | CN |
1215626 | Jun 2002 | EP |
1679659 | Jul 2006 | EP |
1748385 | Jan 2007 | EP |
2071511 | Jun 2009 | EP |
2141658 | Jan 2010 | EP |
EP 1679659 | Jul 2006 | FR |
EP 1748385 | Jan 2007 | FR |
2370438 | Jun 2002 | GB |
WO2009024966 | Feb 2009 | WO |
WO2009115101 | Sep 2009 | WO |
Entry |
---|
Chamaret et al. “Attention-based video reframing: Validation using eye-tracking” Pattern Recognition, 2008. ICPR 2008. 19th International Conference on Year: 2008. |
Deselaers et al., “Pan, Zoom, Scan—Time-Coherent, Trained Automatic Video Cropping”, 2008 IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, Alaska, USA, Jun. 23, 2008, pp. 1-8. |
Herranz et al., “Adapting Surveillance Video to Small Displays Via Object-Based Cropping”, 8th International Workshop in Image Analysis for Multimedia Interactive Services, Santorini, Greece, Jun. 6, 2007, pp. 1-4. |
Zhang et al., “Zoomed Object Segmentation From Dynamic Scene Containing a Door”, 10th IEEE International Conference on High Performance Computing and Communications, Dalian, China, Sep. 25, 2008, pp. 807-812. |
Bongwon Suh etal, Automatic thumbnail cropping and its effectiveness, Proceedings of the 16th Annual ACM SAymposium on user interface software and technology, Vancouver, Canada, Nov. 2-5, 2003; vol. 5, No. 2, Jan. 1, 2003, pp. 95-104. |
Chamaret et al., “Attention-based video regraming: validation using eye-tracking”, 19th International Conference on Pattern Recognition ICPR™ 08, Dec. 8-11, 2008, Tampa, FL, USA, also describes reframing applications. |
Lemeur et al, “Predicting visual fixations on video based on low-level visual features”, Vision Research, vol. 47, No. 19, pp. 2483-2498, Sep. 2007. |
Chen L-Qun etal, A visual attention model for adapting images on small displays, Multimedia Systems, ACM New York, NY, USA, vol. 9, No. 4, Oct. 1, 2003; pp. 353-364. |
Search Report Aug. 2, 2012. |
Number | Date | Country | |
---|---|---|---|
20140153651 A1 | Jun 2014 | US |