The present application relates to a method of decoding an encoded video stream, a method of encoding a video stream, a video decoding apparatus, a video encoding apparatus, and a computer-readable medium.
H.264, ITU-T recommendation (03/2010); SERIES H: AUDIOVISUAL AND MULTIMEDIA SYSTEMS; Infrastructure of audiovisual services—Coding of moving video; Advanced video coding for generic audiovisual services; is an international standard which defines H.264 video coding. H.264 is an evolution of the existing video coding standards (H.261, H.262, and H.263) and it was developed in response to the growing need for higher compression of moving pictures for various applications such as videoconferencing, digital storage media, television broadcasting, Internet streaming, and communication. It is also designed to enable the use of the coded video representation in a flexible manner for a wide variety of network environments. The use of H.264 allows motion video to be manipulated as a form of computer data and to be stored on various storage media, transmitted and received over existing and future networks and distributed on existing and future broadcasting channels.
In known video coding standards such as H.264, temporal redundancy in picture information of successive video frames is exploited by prediction of displaced blocks from a previously encoded or decoded picture or frame. This prediction is often referred to as motion compensated prediction, where the motion vector defines the spatial displacement of a pixel or group of pixels from one picture to another. According to the H.264 standard, the motion vector may have quarter pixel accuracy. This means that the motion vector can reference a block (in another picture) at a spatial displacement of, say, 16.75 pixels in a horizontal direction and 11.25 pixels in a vertical direction.
The quarter-pixels (sometimes referred to as Qpels) are sub-pixels that lie between the integer pixels at one quarter intervals. Pixel and sub-pixel values may be defined in terms of luminance and chroma, or red, green and blue intensity values, or any other suitable colour space definition. Sub-pixel values are calculated for a particular picture using an interpolation filter. The interpolation filter is an equation which defines the value of a sub-pixel using the nearby integer pixel values.
During encoding, all sub-pixel values are calculated to allow for the searching of similar blocks of pixels between pictures in order to find motion vectors. During decoding, a sub-pixel value for a referred picture is only calculated when a motion vector for a picture currently being decoded is identified which points to that sub-pixel value. The decoder may receive the motion vector. Alternatively, the decoder may receive an indication of the motion vector. The indication of the motion vector may comprise a reference to a motion vector candidate and a difference vector such that the required motion vector can be derived by summing the motion vector candidate and the difference vector. The indication of the motion vector may also comprise which previously decoded picture to reference. Alternatively, the decoder may receive an indication of which previously decoded picture to reference for a particular set of motion vectors.
b=[A−5B+20C+20D−5E+F]*[ 1/32]
This interpolation filter is referred to as a six-tap filter because it uses the values of six other pixel positions. Sub-pixel positions a and c may be calculated using similar filters but having different weightings to allow for their different positions. Sub-pixels a, b and c are calculated from integer pixel values having the same vertical coordinate as themselves, these sub-pixels can be said to only require filtering in the horizontal direction. Similarly, sub-pixels d, h and l may be obtained from interpolation filters having taps of integer pixel values with a common horizontal coordinate to themselves.
Sub-pixel positions e, f, g, i, j, k, m, n and o require filtering in both the horizontal and the vertical direction, which makes these sub-pixel positions more computationally costly to calculate. The calculation of these sub-pixel values can require the calculation of multiple nearby sub-pixels in order to provide values for taps of the interpolation filter for these pixel positions.
Sub-pixel value interpolation is a computationally intensive task and consumes a significant proportion of the processor resources in a video decoder. This leads to increased cost of implementation, increased power consumption, decreased battery life, etc.
Accordingly, an improved method and apparatus for sub-pixel interpolation is required.
According to the method and apparatus disclosed herein, a mask is applied to a picture being referenced, the mask disallowing certain sub-pixel positions, preventing the application of an interpolation filter for that sub-pixel. The mask reduces the number of sub-pixel positions for which interpolation must be performed and thus reduces the amount of calculation required in the decoder. The mask can be selected to exclude the more complex sub-pixel positions, for example those that require interpolation in both a vertical and horizontal direction. Thus there is provided an improved trade-off between computational efficiency and decoded video quality.
There is further provided a method for decoding an encoded video stream. The method comprises receiving an indication of a motion vector for a current picture, the motion vector referring to a previously decoded picture. The method also comprises applying a mask, the mask defining a subset of sub-pixel positions of the previously decoded picture which may be referenced by the motion vector for the current picture. The method further comprises identifying at least one pixel value for the current picture by referring to the value of at least one pixel in an allowed pixel position of the previously decoded picture.
By eliminating interpolation for certain sub-pixel positions the amount of calculation required during decoding is reduced. Advantageously, the most computational intensive sub-pixel positions may be eliminated giving a significant reduction in decoder computation with a reduced impact on decoded video quality.
The mask may be applied to the previously decoded picture. The mask may allow a subset of sub-pixel positions of the previously decoded picture to be referred to. The mask may define a subset of sub-pixel positions that are allowed to be referenced.
The mask may be dependent upon the quality of the previously decoded picture. Interpolated sub-pixel values in low quality reference pictures give less of an improvement in decoded video quality than interpolated sub-pixel values in high quality reference pictures. Accordingly, determining the allowed sub-pixel positions according to the quality of the reference picture allows for a reduction in decoder computation with a minimal impact on decoded video quality.
There is further provided a method of decoding an encoded video stream. The method comprises receiving an indication of a motion vector for a current picture, the motion vector referring to a previously decoded picture. The method also comprises identifying at least one pixel value for the current picture by referring to at least one sub-pixel in the previously decoded picture as indicated by the motion vector. The method further comprises applying an interpolation filter to the previously decoded picture to identify a value of the at least one referred to sub-pixel, wherein the interpolation filter applied is dependent upon the quality of the previously decoded picture.
In a high quality reference frame, the sub-pixel value interpolation is advantageously calculated taking into account a high number of integer pixel values, such as six integer pixel values in a six-tap interpolation filter. For a low quality reference frame, a sufficient sub-pixel value interpolation may be calculated taking into account a lower number of integer pixel values, such as two integer pixel values in a two-tap interpolation filter.
There is further provided a method of encoding a video stream. The method comprises identifying a motion vector for a current picture, the motion vector referring to a previously encoded picture. The method also comprises applying a mask, the mask defining a subset of sub-pixel positions of the previously decoded picture which may be referenced by the motion vector for the current picture. The method further comprises modifying the motion vector to identify at least one pixel value for the current picture by referring to the value of at least one pixel in an allowed pixel position of the previously decoded picture.
By eliminating interpolation for certain sub-pixel positions in the encoded video stream the amount of calculation required during decoding is reduced.
There is further provided a video decoding apparatus. The apparatus comprises a receiver arranged to receive an indication of a motion vector for a current picture, the motion vector referring to a previously decoded picture. The apparatus also comprises a processor arranged to apply a mask, the mask defining a subset of sub-pixel positions of the previously decoded picture which may be referenced by the motion vector for the current picture.
The processor is further arranged to identify at least one pixel value for the current picture by referring to the value of at least one pixel in an allowed pixel position of the previously decoded picture.
By eliminating interpolation for certain sub-pixel positions the amount of calculation required during decoding is reduced.
There is further provided a video encoding apparatus comprising a processor. The processor is arranged to identify a motion vector for a current picture, the motion vector referring to a previously encoded picture. The processor is also arranged to apply a mask, the mask defining a subset of sub-pixel positions of the previously decoded picture which may be referenced by the motion vector for the current picture. The processor is further arranged to modify the motion vector to identify at least one pixel value for the current picture by referring to the value of at least one pixel in an allowed pixel position of the previously decoded picture.
By eliminating interpolation for certain sub-pixel positions in the encoded video stream the amount of calculation required during decoding is reduced.
There is further provided a computer-readable medium, carrying instructions, which, when executed by computer logic, causes said computer logic to carry out any of the methods defined herein.
There is further provided a method of decoding an encoded video stream, the method comprising: receiving an indication of a motion vector for a current picture, the motion vector referring to a previously decoded picture; applying a mask, the mask defining a subset of sub-pixel positions of the previously decoded picture which may be referenced by the motion vector for the current picture; if the pixel indicated by the motion vector is in an allowed pixel position, then identifying a pixel value for the current picture by referring to the indicated sub-pixel value in the previously decoded picture; and if the pixel indicated by the motion vector is in a disallowed pixel position, then identifying a pixel value for the current picture by referring to an alternative allowed pixel position.
The equations used to calculate sub-pixel values from integer pixel values are referred to herein as filters or interpolation filters. Each image that comprises a frame of a video sequence is referred to herein as a picture; these may also be referred to as frames in the art. The pattern of allowed sub-pixel positions in a picture which may be referred to by a motion vector related to another picture is referred to herein as a mask.
An improved method and apparatus for sub-pixel interpolation will now be described, by way of example only, with reference to the accompanying drawings, in which:
According to a first embodiment, in a video decoding system a mask is applied to a picture being referenced, the mask disallowing certain sub-pixel positions, preventing the application of an interpolation filter for that sub-pixel. The mask reduces the number of sub-pixel positions for which interpolation must be performed and thus reduces the amount of calculation required in the decoder. The mask can be selected to exclude the more complex sub-pixel positions, for example those that require interpolation in both a vertical and horizontal direction, to provide an improved trade-off between computational efficiency and decoded video quality.
According to a further embodiment, different masks are selected for different reference pictures. Any previously decoded picture may serve as a reference picture to which a motion vector refers. These pictures can be encoded in different ways and the image quality of any particular received picture varies according to how well it was encoded. According to a method and apparatus disclosed herein, a mask is selected to be applied to a picture being referenced, wherein the number of sub-pixel positions allowed by the mask is proportional to the quality of the reference picture. A high quality reference picture is allowed to be referenced to any sub-pixel position, whereas a low quality reference picture is allowed to be referenced to only a limited number of sub-pixel positions. In this way, the amount of calculation required for sub-pixel interpolation is reduced with minimal impact on video quality.
Pictures may be coded as: I-frames (intracoded frames—without reference to any other pictures), P-frames (predicted frames—with reference to the previous picture), or B-frames (bi-predicted frame—with reference to two other pictures, for example both a previous and subsequent picture). It should be noted that B-frames also can refer to only previous pictures as needed in some applications to obtain coding with low delay.
A B-frame is a picture obtained using bi-prediction. Bi-predictions are made with references to two other previously decoded pictures. The two other pictures may be: both preceding the current picture in the series of frames; both following the current picture in the series of frames; or a picture preceding the current picture in the series of frames and a picture following the current picture in the series of frames. It should be noted that the order of picture coding does not necessarily follow the order of pictures in the series of frames. In bi-prediction, because the predicted picture is composed from two reference pictures, twice the number of sub-pixels could be referenced. This means that a motion vector is more likely to refer to sub-pixels whose values have not yet been interpolated and thus more sub-pixel interpolation is required. Bi-prediction has therefore approximately twice the complexity in terms of filtering operations such as additions, multiplications and shifts compared to single picture prediction.
H.264 has B-skip and B-direct modes where the motion vector is predicted from the neighboring macroblocks without any coding of the motion prediction error. This means that if the predicted motions both have sub-pixel positions in both directions the skip needs to do sub-pixel interpolation twice. H.264 also has a feature called hierarchical B coding. In hierarchical B coding some B-frames are derived from references to at least one other B-frame, using either single picture prediction or bi-prediction.
In these referencing schemes the quality of the pictures varies with position within the group of pictures, and type of picture. Each reference to another picture introduces some minor error. Some pictures are composed using references to pictures which are themselves composed using references to other pictures and for these pictures minor errors accumulate and the quality of the picture decreases. For example, an I-frame gives a high quality picture as this is essentially a compressed still image; no errors are introduced from approximate references to other pictures. A P-frame gives a lower quality picture than an I-frame. A B-frame gives a lower quality picture than a P-frame. Subsequent hierarchical B-frames have lower quality still than a B-frame derived from references to only I-frames and P-frames.
Any previously decoded picture may serve as a reference picture to which a motion vector points. These pictures can be encoded in different ways and the image quality of any particular received picture varies according to how it was encoded. When a reference is made to another picture by way of a motion vector the motion vector may point to a sub-pixel. Where a reference is made to a sub-pixel in a referenced picture that sub-pixel must be calculated using an interpolation filter. For low quality pictures such as B2 in
Quantization Parameters (QP) are used to determine the level of quantization of transform coefficients. A larger QP means a larger quantization step size meaning a lower resolution scale of transform coefficients and so a lower picture quality. In the example of
According to a method and apparatus disclosed herein, a mask is applied to a picture being referenced, the mask disallowing certain sub-pixel positions, preventing the application of an interpolation filter for that sub-pixel.
The masks are defined in the decoder. Different masks may be used for different levels of reference picture quality. Each mask indicates, for a particular reference picture quality, which sub-pixel positions may be used as references for subsequent pictures. This allows the complexity of bi-prediction to be controlled dependent upon the reference picture. Reference pictures of higher quality thus have a different sub-pixel mask compared to reference frames of lower quality.
It is advantageous to allow for many sub-pixel positions in a high quality reference picture in order to use the sharpness of the high quality reference picture in current picture prediction. Low quality reference pictures contain less detail and thus a sufficient reference can be made with fewer sub-pixel positions. By masking away sub-pixel positions that have the highest calculation complexity the interpolation cost of the low quality reference frames can be reduced.
The masks 410, 420, 430, 440 in
A picture obtained through bi-prediction using appropriate masks for high and low quality reference frames can maintain much of the coding efficiency and video quality of a system that uses no masking but at a significantly lower interpolation cost at the decoder.
It should be noted that the masking of sub-pixel positions may also be deployed in an encoder. This is done by allowing an encoder to select motion vectors which reference a particular picture only at sub-pixel positions according to a mask determined according to the quality of the referenced picture as described above with reference to a decoder.
In a further alternative, the encoder may transmit the different masks as describe above to a decoder for the decoder to implement should it need to reduce computational load and/or improve coding efficiency. The encoder can transmit masks as a 16 bit stream in Sequence Parameter Set or Picture Parameter Set. Of course, instead of transmitting the mask, the encoder may transmit a flag indicating that a mask should be used.
In another embodiment the processing burden for calculating sub-pixel values is further reduced by using less complex filters for all allowed sub-pixels in a lower quality picture that is being referenced. As explained above, the value of sub-pixel b may be calculated as a weighted average of six nearby integer pixels according to:
b=[A−5B+20C+20D−5E+F]*[ 1/32].
With reference to
b=[C+D]*[½].
According to the method and apparatus disclosed herein, at least one interpolation filter is applied to a picture being referenced, the interpolation filter giving a value for a sub-pixel position based on nearby integer pixel values. Different interpolation filters are applied according to the quality of the picture being referenced such that the number of integer pixel values referenced by the interpolation filter is proportional to the quality of the reference picture. An interpolation filter with a greater number of taps is used for a high quality reference picture as compared to an interpolation filter used for a low quality reference picture. In this way, the amount of calculation required for sub-pixel interpolation is reduced with minimal impact on video quality.
The sub-pixel mask and/or interpolation filter applied to a referenced picture may be determined according to the quality of the referenced picture. The picture quality may be determined from the prediction modes used to create it (e.g. I-frame, P-frame, B-frame, secondary B-frame etc.). The quality of each picture may be indicated in the stream by a sequence parameter at the start of a video bitstream, or by a parameter for each frame or slice in the video bitstream.
Further still, the sub-pixel mask and/or interpolation filter applied by a decoder may be determined by the decoder itself dependent upon available processing resources. Such an adaptive system allows greater flexibility of resource management in a decoder or a multi-function device incorporating a video decoder.
It will be apparent to the skilled person that the exact order and content of the actions carried out in the method described herein may be altered according to the requirements of a particular set of execution parameters. Accordingly, the order in which actions are described and/or claimed is not to be construed as a strict limitation on order in which actions are to be performed.
The sub-pixels of the examples described herein have been described in the context of quarter pixels. It should be noted that these examples are in no way limiting of the arrangements to which the disclosed method and apparatus may be applied. For example, the principles disclosed herein can also be applied to a ⅛th sub-pixels (eighth-pixels, wherein each integer pixel has 63 associated sub-pixel positions arranged 8 by 8) or any other pixel sub-division scheme. Further, masks may be provided which limit references to: only half-pixels; only half-pixels and quarter-pixels; and half-pixels, quarter-pixels and eighth-pixels.
Further, while examples have been given in the context of particular video coding standards, these examples are not intended to be the limit of the communications standards to which the disclosed method and apparatus may be applied. For example, while specific examples have been given in the context of H.264/AVC, the principles disclosed herein can also be applied to an MPEG-4 ASP (advanced simple profile) system, HEVC (High Efficiency Video Coding) and indeed any video coding system which uses interpolated sub-pixel values.
This application claims the benefit of U.S. Provisional Application No. 61/301,659 filed Feb. 5, 2010, the entire contents of which is hereby incorporated by reference.
Number | Date | Country | |
---|---|---|---|
61301659 | Feb 2010 | US |