The present embodiments generally relate to video coding and decoding, and in particular to transcoding of video bit-streams.
High Efficiency Video Coding (HEVC) is a new video coding standard developed in a collaborative project between International Organization for Standardization/International Electrotechnical Commission (ISO/IEC) Moving Picture Experts Group (MPEG) and International Telecommunication Union Telecommunication standardization sector (ITU-T) Video Coding Experts Group (VCEG). The HEVC standard will become MPEG-H Part 2 in ISO/IEC and H.265 in ITU-T.
The HEVC standard introduces a new block structure called quad-tree block structure to efficiently organize picture data. Each block in the quad-tree structure, denoted coding unit (CU), has a prediction mode and a prediction-block-splitting into sub-blocks, called prediction units (PUs). Each such PU has further parameters such as motion vector(s) or intra prediction direction. The task of an encoder is, for a given video, to find the optimal settings of coding parameters so that the video is represented in an efficient way. The space of possible coding parameter combinations is huge. Thus finding the optimal quad-tree structure and other coding parameter settings that most efficiently represent a picture is a computationally expensive task.
A major difference between prior video coding standards, such as MPEG-2 and H.264/MPEG-4 Advanced Video Coding (AVC), and the HEVC standard is the way coding units are defined and signaled. MPEG-2 and AVC have 16×16 luma pixel macroblocks. In AVC, each macroblock can have a prediction mode, e.g. inter or intra prediction, and can be further split into 8×8 blocks of pixels. Also 8×8 blocks can be further split into 4×4 blocks. Each sub-block in a macroblock can have a different motion vector for inter prediction or prediction direction for intra prediction. However all sub-blocks in a macroblock have the same prediction mode. In HEVC, a quad-tree block structure is used. The root in the quad-tree structure is a so called coding tree unit (CTU), which typically has a size of 64×64 luma pixels. Each of these CTUs can be split recursively in a quad-split manner, i.e. a 64×64 CTU can be split into four 32×32 blocks, each of which can further be split into four 16×16 blocks, each of which can be further split into 8×8 blocks. As an example,
A leaf of the quad-tree structure, which is the resulting end block after splitting the CTU, is called CU. Each CU has a prediction mode, e.g. skip, inter prediction or intra prediction, and a PU split structure for prediction, typically denoted partitioning mode, as well as a transform unit (TU) split structure for applying a block transform to encode and decode the residual data after prediction. The possible PU splits with the corresponding prediction and partitioning modes are shown in
As seen in
In contrast to eight possible directional predictions of intra blocks in AVC, HEVC supports 35 intra prediction modes with 33 distinct prediction directions in addition to the planar and DC prediction modes.
A PU within a CU with inter prediction has a corresponding motion vector or vectors that point(s) to a (respective) prediction reference of a past or future picture. At the encoder, the prediction reference is chosen to be a block of pixels that closely matches the current PU. This matching is evaluated by finding the difference between the pixel values in the current PU and the pixel values in the prediction reference and choosing the prediction reference that gives the smallest residual according to someenergy measure or distortion measure, or considering both residual energy or distortion and the number of bits required for representing the coded data, or using similar strategies.
A picture could be partitioned into one or more slices. A slice could be dependent or independent. In the latter case, slices of a single picture could be decoded individually. Similar to the CU prediction mode, a slice could be predicted using the current picture (I-slice), previous pictures (P-slice), or past and future pictures (B-slice).
Finding the optimal quad-tree structure, prediction modes and partition or partitioning modes requires a computationally expensive search through the space of all possible splits and modes. When encoding source video for the first time this costly process of encoding must be carried out. An alternative to searching all possible combinations of coding parameters is to search a subset of such coding parameters. Searching such subset is less time consuming, but it will also lead to suboptimal compression performance. In general, the bigger the search space and, thus, the more time consuming the search, the better compression performance can be expected. It is very challenging to define a subset such that the search time is significantly reduced while good compression efficiency is retained.
Transcoding the bit-streams encoded with HEVC standard is required for interoperability. For example in video conferencing applications a source video bit-stream could be broadcasted over a network of heterogeneous devices. Assume two receivers A and B where receiver A is connected through a Local Area Network (LAN) and receiver B is connected through a wireless connection. It is known that receiver B has access to limited network bandwidth. Hence, the bit-rate of video stream must be reduced for receiver B.
There is, thus, a need for a technique that enables HEVC transcoding and in particular such a technique that can be computationally efficient, fast and provide good compression efficiency.
It is a general objective to provide an efficient video transcoding.
This and other objectives are met by embodiments disclosed herein.
An aspect of the embodiments relates to a method of transcoding a coding tree unit of a picture in a video sequence. The coding tree unit comprises one or multiple coding units of pixels. The method comprises decoding an input encoded representation of the coding tree unit to obtain coding parameters for the input encoded representation. The method also comprises determining, based on the coding parameters, a search sub-space consisting of a subset of all possible combinations of candidate encoded representations of the coding tree unit. The method further comprises encoding the coding tree unit to get an output encoded representation of the coding tree unit belonging to the search sub-space.
Another aspect of the embodiments relates to a transcoder configured to transcode a coding tree unit of a picture in a video sequence. The coding tree unit comprises one or multiple coding units of pixels. The transcoder comprises a decoder configured to decode an input encoded representation of the coding tree unit to obtain coding parameters for the input encoded representation. The transcoder also comprises a search-sub space determiner configured to determine, based on the coding parameters, a search sub-space consisting of a subset of all possible combinations of candidate encoded representations of the coding tree unit. The transcoder further comprises an encoder configured to encode the coding tree unit to get an output encoded representation of the coding tree unit belonging to the search sub-space.
A further aspect of the embodiments relates to a transcoder configured to transcode a coding tree unit of a picture in a video sequence. The coding tree unit comprises one or multiple coding units of pixels. The transcoder comprises a processor and a memory containing instructions executable by the processor. The processor is operable to decode an input encoded representation of the coding tree unit to obtain coding parameters for the input encoded representation. The processor is also operable to determine, based on the coding parameters, a search sub-space consisting of a subset of all possible combinations of candidate encoded representations of the coding tree unit. The processor is further operable to encode the coding tree unit to get an output encoded representation of the coding tree unit belonging to the search sub-space.
Yet another aspect of the embodiments relates to a transcoder for transcoding a coding tree unit of a picture in a video sequence. The coding tree unit comprises one or multiple coding units of pixels. The transcoder comprises a decoding module for decoding an input encoded representation of the coding tree unit to obtain coding parameters for the input encoded representation. The transcoder also comprises a search-sub space determining module for determining, based on the coding parameters, a search sub-space consisting of a subset of all possible combinations of candidate encoded representations of the coding tree unit. The transcoder further comprises an encoding module for encoding the coding tree unit to get an output encoded representation of the coding tree unit belonging to the search sub-space.
Still another aspect of the embodiments relates to a user equipment or terminal comprising a transcoder according to above.
Another aspect of the embodiments relates to a network device being or configured to be arranged in a network node in a communication network. The network device comprises a transcoder according to above.
A further aspect of the embodiments relates to a computer program configured to transcode a coding tree unit of a picture in a video sequence. The coding tree unit comprises one or multiple coding units of pixels. The computer program comprises code means which when run on a computer causes the computer to decode an input encoded representation of the coding tree unit to obtain coding parameters for the input encoded representation. The code means also causes the computer to determine, based on the coding parameters, a search sub-space consisting of a subset of all possible combinations of candidate encoded representations of the coding tree unit. The code means further causes the computer to encode the coding tree unit to get an output encoded representation of the coding tree unit belonging to the search sub-space.
A related aspect of the embodiments defines a computer program product comprising computer readable code means and a computer program according to above stored on the computer readable code means. Another related aspect of the embodiments defines carrier comprising a computer program according to above. The carrier is one of an electronic signal, an optical signal, an electromagnetic signal, a magnetic signal, an electric signal, a radio signal, a microwave signal, or a computer-readable storage medium.
The embodiments, together with further objects and advantages thereof, may best be understood by making reference to the following description taken together with the accompanying drawings, in which:
Throughout the drawings, the same reference numbers are used for similar or corresponding elements.
The embodiments generally relate to encoding and decoding of pictures in a video sequence, and in particular to transcoding of video bit-streams into transcoded video bit-streams.
According to a particular embodiment a picture 2 comprises one or, more typically, multiple, i.e. at least two, so called coding tree units (CTUs) or coding tree blocks (CTBs). As is well known in the art, pixels of a picture 2, also denoted samples, generally have a respective pixel value, or sample value, typically representing a color of the pixel. Various color formats and corresponding color components are available including luminance (luma) and chrominance (chroma). Hence, generally a pixel has two components: luminance and chrominance, which is defined according to the color space. Luminance or luma is the brightness. Chrominance, or chroma, is the color. In such a case, a picture 2 could be decomposed into luma CTBs and chroma CTBs. Thus, a given block of pixels occupying an area of the picture 2 constitutes a luma CTB if the pixels have a respective luma value. Two corresponding chroma CTBs occupy the same area of the picture 2 and have pixels with respective chroma values. A CTU comprises such a luma CTB and the corresponding two chroma CTBs. Reference number 10 in
The size of a CTU 10, and thereby of a luma CTB could be fixed or predefined, such as 64×64 pixels. Alternatively, the size of the CTU 10 is set by the encoder and signaled to decoder in the video bit-stream, such as 64×64 pixels, 32×32 pixels or 16×16 pixels.
In the following, the embodiments will be further discussed in connection with transcoding a CTU of a picture in a video sequence. As discussed in the foregoing, the size of a CTU and the size of its including luma CTB are identical. Hence, the embodiments likewise relate to transcoding of a CTB, such as a luma CTB or more generally a block of pixels in a picture.
A CTU comprises one or more so-called coding units (CUs) of pixels and a luma/chroma CTB correspondingly comprises one or more so-called luma/chroma coding blocks (CBs) of pixels.
In a particular embodiment, a CTU (CTB) is partitioned into one or more CUs (CBs) to form a quad-tree structure as shown in
The left part of
A CTU (L×L pixels) recursively split in a quad-tree structure of CUs implies, herein, that the CTU can be split into four equally sized CUs (L/4×L/4 pixels). Each such CU may be further split into four equally sized CUs (L/16×L/16) and so on down to a smallest coding unit size or lowest depth.
In a particular embodiment, the method of
Hence, the transcoding method as shown in
The limitation or restriction of the search sub-space in step S2 based on retrieved coding parameters implies that the transcoding method will be computationally less complex and faster as compared to using an exhaustive search or encoding, which basically involves testing all available candidate encoded representations for the CTU.
In a particular embodiment, step S1 comprises decoding the input encoded representation of the CTU to obtain pixel values of the pixels and the coding parameters. The encoding in step S3 then preferably comprises encoding the pixel values of the pixels to get the output encoded representation belonging to the search sub-space.
The pixel values could be color values, such as one luma value and two chroma values per pixel for a CTU, one luma value per pixel for a luma CTB, or one or two chroma values per pixel for a chroma CTB.
In an embodiment, step S3 of
Thus, in this embodiment the candidate encoded representation that results in the best rate-distortion quality metric of the candidate encoded representations belonging to the search sub-space is selected and used as output encoded representation of the CTU. Various rate-distortion quality metrics are known in the art and can be used according to the embodiments. The rate-distortion quality metric acts as a video quality metric measuring both the deviation from a source material, i.e. raw/decoded video data, and the bit cost for each possible decision outcome, i.e. candidate encoded representation. In an example, the bits are mathematically measured by multiplying the bits cost by the Langrangian, a value representing the relationship between bit cost and quality for particular quality level. The deviation from the source material is usually measured as the mean squared error in order to maximize the peak-signal-to-noise-ratio (PSNR) video quality metric.
Hence, in a preferred embodiment the method in
Generally, a transcoder can be seen as a cascaded combination of a decoder and an encoder. A “simple”, in terms of implementation but not in terms of computational complexity, cascaded transcoder (SCT or ST) would decode the input bitstream, using a decoder, see
Hence, the simple cascaded transcoder will try splitting the CTU from depth zero to maximum depth and trying each prediction and partitioning mode. Common maximum depth for HEVC encoder configurations is four levels. Assuming a maximum CU size of 64×64 pixels and splitting every CU would produce the quad-tree shown in
Herein, various embodiments of using coding parameters extracted from an input encoded representation in the transcoding of the input encoded representation will be presented. In these embodiments, a CTU as exemplified by
In a first embodiment, the at least one CTU of the picture is recursively split in a quad-tree structure of CUs having a respective depth within the quad-tree structure. Each CU has a respective prediction mode and a respective partitioning mode. In this embodiment, the coding parameters extracted in step S1 of
In this embodiment step S2 of
Hence, in this embodiment the quad-tree structure of the output encoded representation will be identical to the input encoded representation. Hence, this means that the quad-tree structure, prediction and partitioning modes, motion vectors and intra prediction modes as shown in
In an alternative embodiment, b) above involves b) a same or neighboring motion vector (P prediction) or same or neighboring motion vectors (B prediction) as a coding unit i) defined in the input encoded representation, ii) having inter prediction as prediction mode and iii) occupying a same area of the picture as the coding unit. Correspondingly, c) above involves c) a same or neighboring intra prediction mode as a coding unit i) defined in the input encoded representation, ii) having intra prediction as prediction mode and iii) occupying a same area of the picture as the coding unit. Neighboring intra prediction mode is defined further in the sixth embodiment here below and neighboring motion vector is defined further in the seventh embodiment here below.
In a second embodiment, the at least one CTU of the picture is recursively split in a quad-tree structure of CUs. In this second embodiment, the coding parameters extracted in step S1 define the quad-tree structure. Hence, the coding parameters define or enable generation of information defining the quad-tree structure and the split of the CTU into CUs. For instance, the coding parameters could define the quad-tree structure as shown in
In this embodiment, step S2 comprises determining, based on the coding parameters, the search sub-space consisting of candidate encoded representations re-using the quad-tree structure defining the split of the CTU into CUs.
Thus, if the current CTU of the input encoded representation had a quad-tree structure and CU split as shown in
In a third embodiment, the at least one CTU of the picture is recursively split in a quad-tree structure of CUs having a respective depth within the quad-tree structure. In the third embodiment, the coding parameters extracted in step S1 define the respective depths of the coding units in the quad-tree structure.
In this embodiment, step S2 comprises determining, based on the coding parameters, the search sub-space consisting of candidate encoded representations in which a, preferably each, CU of the CTU has a same or shallower depth as compared to a CU i) defined in the input encoded representation and ii) occupying an area or portion of the picture encompassed by an area or portion occupied by the CU.
Shallower depth implies that the CU has a depth value that is closer to the minimum depth, i.e. zero, as compared to the CU defined in the input encoded representation. For instance, CU number 0 in
Compared to the space of all possible quad-tree splits,
An alternative to copying the quad-tree structure from the input encoded representation is to consider both the quad-tree structure of the input encoded representation and one or more alternative quad-tree structures in the search space. With reference to
An intuitive motivation for evaluating more “shallow” trees is that when an input encoded representation is transcoded to a lower bit rate, the quad-tree structure could be expected to become coarser, since that will reduce the number of bits to signal CTU splits. At the same time, if the bit rate is reduced, the quantization step size is typically increased, so that even if a coarser CU split should model the video content less accurately, there may not be more transform coefficients to code.
In a particular embodiment, step S2 comprises determining, based on the coding parameters, the search sub-space consisting of candidate encoded representations in which each coding unit has a same depth as a coding unit i) defined in the input encoded representation and ii) occupying a same area in the picture as the coding unit. Thus, in this particular embodiment no coarser quad-tree structures are evaluated.
The depth of a CU defines, in an embodiment, the size of the CU in terms of a number of pixels relative to the size of the CTU.
Coding parameters defining the quad-tree structure or the respective depths of the coding units in the quad-tree structure typically include so-called split flags. For instance, the following set of split flags would represent the quad-tree structure of
In a fourth embodiment, the CTU is recursively split in a quad-tree structure of CUs having a respective prediction mode. In this embodiment, the coding parameters extracted in step S1 define the respective prediction modes. Step S2 then preferably comprises determining, based on the coding parameters, the search sub-space consisting of candidate encoded representations in which a, preferably each, CU of the CTU has a same prediction mode as a CU i) defined in the input encoded representation and ii) occupying an area of the picture encompassing an area occupied by the CU.
In a particular embodiment the prediction mode is selected from a group consisting of intra prediction, inter prediction and skip mode.
The fourth embodiment reduces the search space by re-using the prediction modes. Consider the example CTU of
This quad-tree tree is smaller than the full quad-tree tree shown in
In a fifth embodiment, the CTU is recursively split in a quad-tree structure of CUs having a respective partitioning mode. In this embodiment, the coding parameters extracted in step S1 define the respective partitioning modes. Step S2 then preferably comprises determining, based on the coding parameters, the search sub-space consisting of candidate encoded representations in which a, preferably each, CU of the CTU has a same or shallower partitioning mode as compared to a CU i) defined in the input encoded representation and ii) occupying an area of the picture encompassing an area occupied by the CU.
In a particular embodiment, a partitioning mode of a CU (CB) defines a respective size of one or more PUs (PBs) in terms of a number of pixels into which the CU (CB) is split. In another particular embodiment, the one or more PUs of the CU have a same prediction mode selected from intra prediction or inter prediction but may have different intra prediction modes or different motion vectors.
Table 1 below indicates the search sub-space of same or shallower partitioning modes for various input partitioning modes.
The search space is reduced in similar way as in fourth embodiment. An example of search space is shown in
For example, a transcoder that decides to split the 64×64 CTU has to evaluate three prediction modes for 2N×2N nodes, i.e. inter 2N×2N and intra 2N×2N, skip 2N×2N. This search space is smaller as compared to the full search space in which the transcoder also had to try possible inter and intra PU splits, e.g. 2N×N, N×2N, etc.
In a sixth embodiment, the CTU is recursively split in a quad-tree structure of CUs having a respective prediction mode. In this embodiment, the coding parameters extracted in step S1 define at least one respective intra prediction mode of any CU having intra prediction as prediction mode. Step S2 preferably comprises determining, based on the coding parameters, the search sub-space consisting of candidate encoded representations in which a, preferably each, CU, occupying an area of the picture encompassed by an area occupied by a CU i) defined in the input encoded representation and ii) having intra prediction as prediction mode, has a same or neighboring intra prediction mode as the CU i) defined in the input encoded representation and ii) having intra prediction as prediction mode.
In an embodiment, neighboring intra prediction mode refers to the available intra prediction directions. Generally, there are 35 intra prediction modes: mode 0 represents planar mode, mode 1 represents DC mode and modes 2-34 represent 33 different intra prediction directions. For instance, if the CU i) defined in the input encoded representation and ii) having intra prediction as prediction mode has intra prediction mode number X∈[2,34] then a neighboring intra prediction mode include intra prediction mode number within the interval [X−Y, X+Y]∈[2,34], where Y is a defined integer. Hence, in particular embodiments Y=1 or Y=2 or Y=3 as illustrative but non-limiting examples. Thus, neighboring intra prediction modes have similar intra prediction directions. For instance, intra prediction mode number 6, 8 could be regarded as neighboring intra prediction modes for intra prediction mode number 7.
For each I-picture and intra coded CUs in P-pictures and B-pictures, intra directions of the intra-coded input CUs will be re-used. As shown in
In a seventh embodiment, the CTU is recursively split in a quad-tree structure of CUs having a respective prediction mode. In this embodiment, the coding parameters extracted in step S1 define at least one motion vector of any CU having inter prediction as prediction mode. Step S2 preferably comprises determining, based on the coding parameters, the search sub-space consisting of candidate encoded representations in which a, preferably each, CU, occupying an area of the picture encompassed by an area occupied by a CU i) defined in the input encoded representation and ii) having inter prediction as prediction mode, has same or neighboring motion vector or vectors as the CU i) defined in the input encoded representation and ii) having inter prediction as prediction mode.
A motion vector could, in an embodiment, be represented by an X-component and a Y-component [X, Y]. In such a case, a neighboring motion vector could be defined as a motion vector within a range of motion vectors [X−x, Y−y] to [X+x, Y+y], such as [X−x, Y−x] to [X+x, Y+x]. In such a case, the parameters x, y or only x, if the same interval is used for both vector components, could be signaled in the video bit-stream or be pre-defined and thereby known to the transcoder. The parameters x, y define the search space of motion vectors around the motion vector [X, Y] that could be used if neighboring motion vectors are available.
A P-predicted CU has a single motion vector, whereas a B-predicted CU has two motion vectors.
For each B- or P-picture, motion vectors from inter-coded input CUs will be re-used. For successful re-use, the corresponding CUs will, preferably, also re-use the PU split (partitioning mode) and prediction mode.
In an eight embodiment, the CTU is recursively split in a quad-tree structure of CUs comprising one or more transform units (TUs) for which a block transform is applied during decoding. In this embodiment, the coding parameters extracted in step S1 define the TU sizes. Step S2 preferably comprises determining, based on the coding parameters, the search sub-space consisting of candidate encoded representations in which one or more TUs of a, preferably each, CU of the CTU has a same or larger size in terms of number of pixels as compared to a CU i) defined in the input encoded representation and ii) occupying a same area of the picture as the CU.
In a particular embodiment, step S2 comprises determining, based on the coding parameters, the search sub-space consisting of candidate encoded representations in which one or more TUs of a, preferably each, CU of the CTU has a same size in terms of number of pixels as compared to a CU i) defined in the input encoded representation and ii) occupying a same area of the picture as the CU.
Prediction residuals of CUs are coded using block transforms. A TU tree structure has its root at the CU level. The luma CB residual may be identical to the luma transform block (TB) or may be further split into smaller luma TBs. The same applies to the chroma TBs.
In an implementation embodiment, TU sizes may be re-used (copied) from the input bit-stream. That would require the least amount of evaluations and thus be fast, although generally not lead to very good compression results. Alternatively, for each CU, all possible TU sizes could be evaluated by the encoder. That would require the most amount of TU evaluations and thus be the most computational expensive, while it may lead to the best compression results. As an intermediate alternative, the TU sizes to be evaluated could be derived based on the TU size in the input stream. For instance, if a 16×16 CU uses 8×8 TUs, then both a 8×8 TU size and a 16×16 TU size could be evaluated. The motivation for evaluating coarser TU partitions is that coarser TUs may require less signaling, in particular if the quantization step size used in the input bitstream is larger than the quantization step size used in the output bitstream. Thus this intermediate alternative could be almost as good in compression efficiency as the exhaustive alternative while not much slower than re-using TU sizes.
In a ninth embodiment the search sub-space is restricted by adopting any of four base methods defined below or any combination of two, three or, preferably, all four base methods.
While traversing the quad-tree structure the transcoder:
In an alternative variant of this ninth embodiment the transcoder, while traversing the quad-tree structure:
In yet another variant of this ninth embodiment the transcoder, while traversing the quad-tree structure:
In a further variant of this ninth embodiment the transcoder, while traversing the quad-tree structure:
In these variants the transcoder then selects the candidate encoded representation that provides the best encoding according a video quality metric, such as based on a rate-distortion criterion.
A pseudo-code for implementing a variant the ninth embodiment is presented below. This pseudo-code or algorithm implemented in the transcoder is then preferably called for each CTU and it traverses the nodes in the quad-tree structure of the CTU until it reaches a leaf node. If split flags are used to define the quad-tree structure and 1bin indicates a CU split and 0bin indicates no further CU split then a leaf node is reached when a split flag has value 0bin or a smallest coding unit size has been reached and no further CU splitting is possible.
The traversed nodes are potential leafs in the output quad-tree structure and on each traversed node different coding options are tested and the best is selected.
In the following pseudo-code dO and dI indicate the depth in the quad-tree structure of an output (transcoded) CU and an input CU, respectively.
In this embodiment, the input CTU is recursively split in a quad-tree structure of CUs and the coding parameters extracted in step S1 define the quad-tree structure.
Step S2 preferably comprises determining, based on the coding parameters, the search sub-space consisting of candidate encoded representations in which
a) a, preferably each, CU of the CTU has a same or shallower depth as compared to a CU i) defined in the input encoded representation and ii) occupying an area of the picture encompassed by an area occupied by the CU, and
b) a root CU and each CU at a depth directly above a leaf CU in the quad-tree structure has skip as prediction mode, or
c) a CU has intra prediction as prediction mode and 2N×2N as partitioning mode, or
d) a CU has inter prediction as prediction mode and 2N×2N as partitioning mode.
In an optional embodiment, step S2 comprises determining, based on the coding parameters, the search sub-space consisting of candidate encoded representations in which
a) a, preferably each, CU of the CTU has a same or shallower depth as compared to a CU i) defined in the input encoded representation and ii) occupying an area of the picture encompassed by an area occupied by the CU, and
b) a root CU and each CU at a depth directly above a leaf CU in the quad-tree structure and each leaf CU has skip as prediction mode, or
c) a CU has intra prediction as prediction mode and 2N×2N or a same portioning mode as the CU i) defined in the input encoded representation and ii) occupying an area of the picture encompassed by an area occupied by the CU, or
d) a CU has inter prediction as prediction mode and 2N×2N or a same portioning mode as the CU i) defined in the input encoded representation and ii) occupying an area of the picture encompassed by an area occupied by the CU.
In another optional embodiment, step S2 comprises determining, based on the coding parameters, the search sub-space consisting of candidate encoded representations in which
a) a, preferably each, CU of the CTU has a same or shallower depth as compared to a CU i) defined in the input encoded representation and ii) occupying an area of the picture encompassed by an area occupied by the CU, and
b) a root CU and each CU at a depth directly above a leaf CU in the quad-tree structure has skip as prediction mode, or
c) a non-leaf CU has intra prediction as prediction mode and 2N×2N as partitioning mode, or
d) a non-leaf CU has inter prediction as prediction mode and 2N×2N as partitioning mode, or
e) a leaf CU has a same prediction and partitioning modes as the CU i) defined in the input encoded representation and ii) occupying an area of the picture encompassed by an area occupied by the CU.
The various embodiments discussed in the foregoing could be used separately or combined. For instance, the coding parameters could include any combination of parameters or information i) defining the quad-tree structure or respective depths of the CUs in the CTU, ii) defining the respective prediction modes of the CUs in the CTU, iii) defining the respective partitioning modes of the CUs in the CTU, iv) defining the intra prediction modes of any intra predicted CU of the CTU, v) defining the motion vector(s) of any inter predicted CU in the CTU, and vi) defining TU sizes of the CUs in the CTU. When implementing embodiments as combinations of i) to vi) above the combination could use coding parameters being a combination of two of i) to vi), three of i) to vi), four of i) to vi), five of i) to vi) or all of i) to vi).
Generally, intra prediction mode and motion vectors are defined on a PU basis. Hence, a CU that is restricted to have a same or neighboring prediction mode as a CU defined in the input encoded representation and having intra prediction as prediction preferably implies that the PU(s) of the CU has/have a same or neighboring prediction mode as the corresponding PU(s) of the CU defined in the input encoded representation and having intra prediction as prediction mode. Corresponding, a CU that is restricted to have same or neighboring motion vector or vectors as a CU defined in the input encoded representation and having inter prediction as prediction mode preferably implies that the PU(s) of the CU has/have same or neighboring motion vector or vectors as the corresponding PU(s) of the CU defined in the input encoded representation and having inter prediction as prediction mode.
Hence, in a particular embodiment, the at least one CTU is recursively split in a quad-tree structure of CU having a respective prediction mode and at least one PU. The coding parameters define at least one respective intra prediction mode of any CU having intra prediction as prediction mode. Determining the search sub-space preferably comprises, in a particular embodiment, determining, based on the coding parameters, the search sub-space consisting of candidate encoded representations in which a, preferably each, PU belonging to a CU having intra prediction as prediction mode and occupying an area of the picture encompassed by an area occupied by a PU i) defined in the input encoded representation and ii) belonging to a CU having intra prediction as prediction mode has a same or neighboring prediction mode as the PU i) defined in the input encoded representation and ii) belonging to a CU having intra prediction as prediction mode.
In another particular embodiment, the at least one CTU is recursively split in a quad-tree structure of CU having a respective prediction mode and at least one PU. The coding parameters define at least one motion vector for any PU belonging to a CU having inter prediction as prediction mode. Determining the search sub-space preferably comprises, in a particular embodiment, determining, based on the coding parameters, the search sub-space consisting of candidate encoded representations in which a, preferably each, PU belonging to a CU having inter prediction as prediction mode and occupying an area of the picture encompassed by an area occupied by a PU i) defined in the input encoded representation and ii) belonging to a CU having inter prediction as prediction mode has same or neighboring motion vector or vectors as the PU i) defined in the input encoded representation and ii) belonging to a CU having inter prediction as prediction mode
In an embodiment, the decoder 110 is configured to decode the input encoded representation to obtain pixel values or data of the pixels and the coding parameters. The coding parameters are then preferably input to the search sub-space determiner 120 and the pixel values are preferably input to the encoder 130. The encoder 130 is, in this embodiment, configured to encode the pixel values to get the output encoded representation belonging to the search sub-space determined by the search sub-space determiner 120.
In a particular embodiment, the encoder 130 is configured to select, as the output encoded representation, the candidate encoded representation belonging to the search sub-space and optimizing a rate-distortion quality metric.
In the embodiment as shown in
In an embodiment, the at least one coding tree unit is recursively split in a quad-tree structure of coding units. In this embodiment, the coding parameters define the quad-tree structure. The search sub-space determiner 120 is, in this embodiment, configured to determine, based on the coding parameters, the search sub-space consisting of candidate encoded representations re-using the quad-tree structure defining the split of the coding tree unit into coding units.
In an embodiment, the at least one coding tree unit is recursively split in a quad-tree structure of coding units having a respective depth within the quad-tree structure. In this embodiment, the coding parameters define the respective depths of the coding units in the quad-tree structure. The search sub-space determiner 120 is, in this embodiment, configured to determine, based on the coding parameters, the search sub-space consisting of candidate encoded representations in which a, preferably each, coding unit has a same or shallower depth as compared to a coding unit i) defined in the input encoded representation and ii) occupying an area of the picture encompassed by an area occupied by the coding unit.
In a particular embodiment, the search sub-space determiner 120 is configured to determine, based on the coding parameters, the search sub-space consisting of candidate encoded representations in which each coding unit has a same depth as a coding unit i) defined in the input encoded representation and ii) occupying a same area in the picture as the coding unit.
In an embodiment, the at least one coding tree unit is recursively split in a quad-tree structure of coding units having a respective prediction mode. In this embodiment, the coding parameters define the respective prediction modes. The search sub-space determiner 120 is, in this embodiment, configured to determine, based on the coding parameters, the search sub-space consisting of candidate encoded representations in which a, preferably each, coding unit has a same prediction mode as a coding unit i) defined in the input encoded representation and ii) occupying an area of the picture encompassing an area occupied by the coding unit.
In an embodiment, the at least one coding tree unit is recursively split in a quad-tree structure of coding units having a respective partitioning mode. In this embodiment, the coding parameters define the respective partitioning modes. The search sub-space determiner 120 is, in this embodiment, configured to determine, based on the coding parameters, the search sub-space consisting of candidate encoded representations in which a, preferably each, coding unit has a same or shallower partitioning mode as compared to a coding unit ii) defined in the input encoded representation and ii) occupying an area of the picture encompassing an area occupied by the coding unit.
In an embodiment, the at least one coding tree unit is recursively split in a quad-tree structure of coding units having a respective prediction mode. In this embodiment, the prediction parameters define at least one respective intra prediction mode of any coding unit having intra prediction as prediction mode. The search sub-space determiner 120 is, in this embodiment, configured to determine, based on the coding parameters, the search sub-space consisting of candidate encoded representations in which a, preferably each, coding unit occupying an area of the picture encompassed by an area occupied by a coding unit i) defined in the input encoded representation and ii) having intra prediction as prediction mode has a same or neighboring intra prediction mode as the coding unit i) defined in the input encoded representation and ii) having intra prediction as prediction mode.
In an embodiment, the at least one coding tree unit is recursively split in a quad-tree structure of coding units having a respective prediction mode. In this embodiment, the coding parameters define at least one motion vector of any coding unit having inter prediction as prediction mode. The search sub-space determiner 120 is, in this embodiment, configured to determine, based on the coding parameters, the search sub-space consisting of candidate encoded representations in which a, preferably each, coding unit occupying an area of the picture encompassed by an area occupied by a coding unit i) defined in the input encoded representation and ii) having inter prediction as prediction mode has same or neighboring motion vector or vectors as the coding unit i) defined in the input encoded representation and ii) having inter prediction as prediction mode.
In an embodiment, the at least one coding tree unit is recursively split in a quad-tree structure of coding units having a respective depth within the quad-tree structure. Each coding unit has a respective prediction mode and a respective partitioning mode. In this embodiment, the coding parameters define the respective depths, the respective prediction modes, the respective partitioning modes, at least one motion vector for any coding unit having inter prediction as prediction mode and at least one intra prediction mode for any coding units having intra prediction as prediction mode. The search sub-space determiner 120 is, in this embodiment, configured to determine, based on the coding parameters, the search sub-space consisting of candidate encoded representations in which a, preferably each, coding unit has
a) a same depth, a same prediction mode and a same partitioning mode as a coding unit i) defined in the input encoded representation and ii) occupying a same area of the picture as the coding unit, and
b) same, or optionally neighboring, motion vector or vectors as a coding unit i) defined in the input encoded representation, ii) having inter prediction as prediction mode and iii) occupying a same area of the picture as the coding unit, or
c) a same, or optionally neighboring, intra prediction mode as a coding unit i) defined in the input encoded representation, ii) having intra prediction as prediction mode and iii) occupying a same area of the picture as the coding unit.
In an embodiment, the at least one coding tree unit is recursively split in a quad-tree structure of coding units comprising one or more transform units for which a block transform is applied during decoding. In this embodiment, the coding parameters define the transform unit sizes. The search sub-space determiner 120 is, in this embodiment, configured to determine, based on the coding parameters, the search sub-space consisting of candidate encoded representations in which one or more transform units of a, preferably each, coding unit has a same or larger size in terms of number of pixels as compared to a coding unit i) defined in the input encoded representation and ii) occupying a same area of the picture as the coding unit.
In a particular embodiment, the search sub-space determiner 120 is configured to determine, based on the coding parameters, the search sub-space consisting of candidate encoded representations in which one or more transform units of a, preferably each, coding unit has a same size in terms of number of pixels as a coding unit i) defined in the input encoded representation and ii) occupying a same area of the picture as the coding unit.
Coding parameters obtained by the decoder 110 while decoding the input encoded representation are forwarded to an optional adjuster 150 that adjusts the coding parameters to match the down-sampling in pixels. Hence, the adjuster 150 could be used to correspondingly down-sample the coding parameters to match the down-sampled layout of pixels and CUs in the CTU. The optionally adjusted coding parameters are input to the search sub-space determiner 120 to determine the search sub-space for the candidate encoded representations.
A current block of pixels, i.e. PU, is predicted by performing a motion estimation from an already provided block of pixels in the same picture or in a previous or future picture obtained from a decoded picture buffer. The result of the motion estimation is a motion vector allowing identification of the reference block of pixels. The motion vector is utilized in a motion compensation for outputting an inter prediction of the PU.
An intra picture estimation is performed for the PU according to various available intra prediction modes. The result of the intra prediction is an intra prediction mode number. This intra prediction mode number is utilized in an intra picture prediction for outputting an intra prediction of the PU.
Either the output from the motion compensation or the output from the intra picture prediction is selected for the PU. The selected output is input to an error calculator in the form of an adder that also receives the pixel values of the PU. The adder calculates and outputs a residual error as the difference in pixel values between the PU and its prediction.
The error is transformed, scaled and quantized to form quantized transform coefficients that are encoded by an encoder, such as by entropy encoder. In inter coding, also the estimated motion vectors are brought to the entropy encoder as is intra prediction data for intra coding.
The transformed, scaled and quantized residual error for the PU is also subject to an inverse scaling, quantization and transform to retrieve the original residual error. This error is added by an adder to the PU prediction output from the motion compensation or the intra picture prediction to create a reference PU of pixels that can be used in the prediction and coding of a next PU of pixels. This new reference PU is first processed by deblocking and sample adaptive offset (SAO) filters to combat any artifact. The processed new reference PU is then temporarily stored in the decoded picture buffer.
These residual errors are added in an adder to the pixel values of a reference block of pixels. The reference block is determined in a motion estimation or intra prediction depending on whether inter or intra prediction is performed. The resulting decoded PU of pixels output from the adder is input to SAO and deblocking filters to combat any artifacts. The filtered PU is temporarily stored in a decoded picture buffer and can be used as reference block of pixels for any subsequent PU to be decoded. The output from the adder is preferably also input to the intra prediction to be used as an unfiltered reference block of pixels.
The transcoder 100 of
The transcoder 100 described herein could alternatively be implemented e.g. by one or more of a processing unit 12 in a computer 10 and adequate software with suitable storage or memory therefore, a programmable logic device (PLD) or other electronic component(s) as shown in
The steps, functions and/or units described herein may be implemented in hardware using any conventional technology, such as discrete circuit or integrated circuit technology, including both general-purpose electronic circuitry and application-specific circuitry.
Particular examples include one or more suitably configured digital signal processors and other known electronic circuits, e.g. discrete logic gates interconnected to perform a specialized function, or Application Specific Integrated Circuits (ASICs).
Alternatively, at least some of the steps, functions and/or units described herein may be implemented in software such as a computer program for execution by suitable processing circuitry such as one or more processors or processing units.
The flow charts presented herein may therefore be regarded as a computer flow diagrams, when performed by one or more processors. A corresponding apparatus may be defined as a group of function modules or units, see
Thus,
Examples of processing circuitry and processors includes, but is not limited to, one or more microprocessors, one or more Digital Signal Processors (DSPs), one or more Central Processing Units (CPUs), video acceleration hardware, and/or any suitable programmable logic circuitry such as one or more Field Programmable Gate Arrays (FPGAs), or one or more Programmable Logic Controllers (PLCs).
In an embodiment, the processor 210 and the memory 220 are interconnected to each other to enable normal software execution. An optional input/output (I/O) unit 230 may also be interconnected to the processor 210 and/or the memory 220 to enable input of the bit-stream to be transcoded and output of the transcoded bitstream.
It should also be understood that it may be possible to re-use the general processing capabilities of any conventional device or unit in which the proposed technology is implemented. It may also be possible to re-use existing software, e.g. by reprogramming of the existing software or by adding new software components.
Furthermore, the computer 10 comprises at least one computer program product 13 in the form of a non-volatile memory, for instance an EEPROM (Electrically Erasable Programmable Read-Only Memory), a flash memory or a disk drive. The computer program product 13 comprises a computer program 14, which comprises code means which when run on or executed by the computer 10, such as by the processing unit 12, causes the computer 10 to perform the steps of the method described in the foregoing in connection with
In an embodiment the computer program 14 is a computer program 14 configured to transcode a CTU of a picture in a video sequence. The CTU comprises one or multiple CUs of pixels. The computer program 14 comprises code means, also referred to as program code, which when run on the computer 10 causes the computer to decode an input encoded representation of the CTU to obtain coding parameters for the input encoded representation. The code means also causes the computer 10 to determine, based on the coding parameters, a search sub-space consisting of a subset of all possible combinations of candidate encoded representations of the CTU. The code means further causes the computer 10 to encode the CTU to get an output encoded representation of the CTU belonging to the search sub-space.
An embodiment also relates to a computer program product 13 comprising computer readable code means and a computer program 14 as defined according to above stored on the computer readable code means.
Another embodiment relates to a carrier comprising a computer program as defined above. The carrier is one of an electronic signal, an optical signal, an electromagnetic signal, a magnetic signal, an electric signal, a radio signal, a microwave signal, or a computer-readable storage medium.
An electric signal could be a digital electric signal, such as represented by a series of 0bin and 1bin, or an analogue electric signal. Electromagnetic signals include various types of electromagnetic signals, including infrared (IR) signals. A radio signal could be either a radio signal adapted for short range communication, such as Bluetooth®, or for long range.
In an embodiment, the transcoder 100 is implemented in a user equipment or terminal 80 as shown in
The encoded representations are brought from the memory 84 to a transcoder 100, such as the transcoder illustrated in any of
As illustrated in
The present embodiments are particularly suitable for the HEVC video coding standard. In such a case, the HEVC transcoder or transcoding method is configured to transcode an input HEVC encoded representation of a CTU of a picture in a HEVC video sequence into an output HEVC encoded representation.
The embodiments could, however, also be applied to other video coding standards using a quad-tree structure of defining blocks of pixels in a picture. An example of such another video coding standard is VP9.
Herein follows a particular implementation example of a transcoder. Table 2 below lists the functions used by the transcoder.
The transcoder takes the data constituting a picture as input. A picture is usually made of one or more slices and each slice is created from collection of CTUs. For simplicity, assume each picture is made of a single slice. The transcoder processes each CTU, i.e. largest CU, of a slice in raster scan order. This is illustrated by the following pseudo-code.
loop over every LCU
The function TranscodeCU recursively traverses the CTU based on quad-tree structure of input picture until a leaf node is reached. This quad-tree structure is a single realization of every possible structure and it is extended by making decisions on other branch possibilities in each tree node.
The implementation example of the transcoder is based on three important observations:
1) Skipped blocks require the least bits to encode;
2) Merging blocks reduces the number of bits;
3) 2N×2N splitting requires one motion vector to be signaled. Hence, it is very possible that it will require less bits to encode the block. The proposed pseudo-code based on these observations is described further below.
On each recursive iteration the transcoder will try to encode the current CU with:
1) 2N×2N intra prediction by re-calculating intra prediction direction;
2) Inter prediction with 2N×2N PU spilt and re-calculating the motion vector;
3) Inter prediction by re-using partitioning mode and re-calculating motion vector;
4) Skip mode on root CU, the CU before leaf node, and leaf node;
5) Re-using partitioning mode on leaf node and re-calculating the intra prediction directions and motion vectors.
The output CU structure will be a sub-set of the input CU structure, i.e. quad-tree structure, meaning there will be no input quad-tree node that will be split further. Therefore, every CU will be shallower or of equal depth compared to input CUs. As it is seen in line 58 of the pseudo-code, the final CU coding will be chosen as the one that is best in regards to Rate-Distortion criteria.
current CU is split to four sub-CUs
try intra mode for non-leaf CU
try no split PU for non-leaf CU
N × N PU split is only permitted for the smallest CU size
for none I-slices: at the root and one node before leaf try SKIP mode
leaf node
try SKIP on every leaf
try no split PU for leaf
The performance of proposed transcoder is measured using 20 video sequences divided into five classes based on video resolution, see Annex. An example is given here. Class C of the test sequence includes video conferencing sequences with 1280×720 resolution. The performance is measured by defining bit-rate ratio (r) and overhead (O). Bit-rate ratio determines the bit-rate reduction of transcoded bit-stream over input bit-stream:
In this equation, RS is the bit-rate of the input stream and RB is the base bit-rate. Higher bit-rate ratio means higher compression.
Overhead determines the trade-off between bit-rate and quality:
wherein RT, is the bit-rate of the transcoded stream. Lower overhead is better. Overhead is calculated in comparison to bit-rate of an encoder that has access to original video sequence and has encoded it with PSNR quality equal to transcoded bit-stream.
As illustrated in
PSNR and bit-rate comparison of the transcoder of the embodiment against the simple cascaded transcoder for sequence KristenAndSara is shown in
The present embodiments promote inter-operability by enabling, for instance, HEVC to HEVC transcoding. The embodiments provide a fast transcoding that requires low computational power and produces excellent video quality.
The embodiments described above are to be understood as a few illustrative examples of the present invention. It will be understood by those skilled in the art that various modifications, combinations and changes may be made to the embodiments without departing from the scope of the present invention. In particular, different part solutions in the different embodiments can be combined in other configurations, where technically possible.
List of Abbreviations
AT Advanced Transcoding.
AVC Advanced Video Coding.
CODEC enCOder and DECoder.
CTB Coding Tree Block.
CTU Coding Tree Unit.
CU Coding Unit.
DCT Discrete Cosine Transform.
FPR Full Prediction Re-use.
GOP Group of Pictures.
HEVC High Efficiency Video Coding.
IPR Intra Prediction Re-use.
ISO International Standards Organization.
ITU International Telecommunication Union.
LCU Largest Coding Unit.
MB Macro Block.
MC Motion Compensation.
ME Motion Estimation.
MPEG Motion Picture Experts Group.
MR MV Re-use.
NAL Network Abstraction Layer.
P Predictive.
PSNR Peak Signal-to-Noise Ratio.
PU Prediction Unit.
Q Quantization.
QP Quantization Parameter.
RDO Rate-Distortion Optimization.
RT Rate Distortion.
SVC Scalable Video Coding.
TU Transform Unit.
Proposed HEVC Transcodinq Models
A simple drift free transcoding could be achieved by cascading a decoder and encoder, where at the encoder side the video is encoded with regards to target platform specifications. This solution is computationally expensive, however the video quality is preserved. The preservation of video quality is an important characteristic, since it provides a benchmark for more advanced transcoding methods. This solution is key-worded Simple Cascaded Transcoding (SCT), so to differentiate it from advanced cascaded transcoding methods proposed in this Annex.
The developed transcoding models are designed with two goals:
1) understanding the extend by which the transcoding time is reduced by exploiting the information available from input bit-stream;
2) reducing the bit-rate and producing video quality as close as to the video quality of SCT model while minimizing the transcoding time.
To preserve the video quality closed-loop architecture is used. Closed-loop architectures are drift free and drift is a major quality sink.
Four transcoding models are developed for bit-rate reduction. These models are based on the idea that the input bit-stream contains valuable information for fast re-encoding. Based on how the information is reused, these models are key-worded: Full Prediction Reuse (FPR); Intra Prediction Re-estimation (IPR); MV Re-estimation (MR); and Advanced Transcoding (AT).
FPR is spatial-domain transcoding which reuses all the information available in the spatial-domain. IPR is similar to FPR, with one major difference. Intra-prediction is carried out fully for intra pictures, because in applications with random access requirement there are I-pictures at the beginning of each GOP, and it seems that these I-pictures will have a great impact on the quality of following B-and P-pictures. To measure this impact IPR transcoding model is developed.
MR transcoding model is similar to IPR with the addition of full MV search. This change is made with the goal of understanding how much the video quality could be improved if transcoder were free to search for new MVs at the CU level. AT model is designed to get as close as possible to cascaded transcoding quality and bit-rate with minimum transcoding time.
Full Prediction Reuse (FPR) Transcoding Model
The idea is to recursively traverse the Coding Tree Unit (CTU) and if a leaf node is reached, which is determined by examining the depth of decoded CU from input bit-stream (CUI), the CU is encoded by using the information available in CUI. The input CTU structure is replicated in transcoded CTU.
branching continues until leaf node is reached
intra mode CUO
inter mode CUO
current CU is split to four sub-CUs
Intra Prediction Re-Estimation (IPR) Transcoding Model
The model structure is same as FPR, however for intra coded pictures intra directions and modes are re-estimated in the same manner as reference encoder. The input CTU structure is replicated in transcoded CTU.
branching continues until leaf node is reached
intra directions are recalculated for I-slices
intra mode CUO
inter mode CUO
current CU is split to four sub-CUs
MV Re-Estimation (MR) Transcoding Model
In succession to IPR model, when at the leaf node, MVs are re-estimated by examining all the candidates in the same manner as reference encoder. The input CTU structure is replicated in transcoded CTU.
branching continues until leaf node is reached
re-calculate MV index
re-calculate MV
re-calculate intra directions
current CU is split to four sub-CUs
Advanced Transcodinq (AT) Model
Three observations are important for understanding this model:
1) Skipped blocks require the least bits to encode;
2) Merging blocks reduces the number of bits;
3) 2N×2N splitting requires one motion vector to be signalled, hence it is very possible that it will require less bits to encode the block.
The heuristics build upon these observations are: 1) Try skip and merge combinations on the root node of the tree and the node before the leaf node; 2) Try Inter-and Intra-coding with the size of 2N×2N on each node.
current CU is split to four sub-CUs
try intra mode for non-leaf CU
try no split PU for non-leaf CU
N × N PU split is only permitted for the smallest CU size
for none I-slices: at the root and one node before leaf try SKIP mode
leaf node
try SKIP on every leaf
try no split PU for leaf
On each recursive iteration, AT will try to encode the current CU with:
1) 2N×2N intra prediction by re-calculating intra prediction direction;
2) Inter prediction with 2N×2N PU spilt and re-calculating the motion vector;
3) Inter prediction by re-using PU split mode and re-calculating motion vector;
4) Skip mode on root CU, the CU before leaf node, and leaf node;
5) re-using PU mode on leaf node and re-calculating the intra prediction directions and motion vectors.
The output CU structure will be a sub-set of input CU structure, meaning, there will be no input tree node that will be split further. Therefore, every CU will be shallow or equal depth compared to input CUs. As it is seen in line 58 of Algorithm 5, the final CU coding will be chosen as the one that is best in regards to Rate-Distortion criteria.
Performance Evaluation
To test the transcoding performance comprehensive set of simulations are designed. The idea is that SCT model produces the best coding performance, however it is very slow, because it requires and exhaustive search through the space of tree structures with all possible prediction and partitioning modes. In contracts, the proposed transcoding methods minimize the search for optimal coding structure by re-using the information available through the input bit-stream. Comparing the performance of developed transcoding models to SCT model provides evidence towards the gains that are achieved through re-using information available through the input bit-stream.
Video Quality Measurement
It is generally preferred to compress the video signal as much as possible and keep the video quality close to the original. There are, generally, two categories of methods for measuring video quality: 1) subjective quality: its based on test procedures devised by ITU to quantify the quality using human agents; and 2) objective quality: its a mathematical model to approximate the subjective quality of video. Subjective quality assessment of digital videos requires human agents, which is expensive, hence objective quality is a suitable alternative.
The main mathematical model used by the researchers for developing better encoding methods is the famous Rate-Distortion Optimization (RDO) model. Based on this model, distortion is optimized based on changes in rate which is the amount of data required for encoding the input data. In video coding, every decision usually affects the Rate-Distortion (RD) values, and the challenge is to find the optimal solution. Commonly used RD criteria in video coding is PSNR-Bit-rate pair.
PSNR
The most common subjective quality measure is Peak Noise-to-Source Ratio (PSNR). It is measured with decibel (dB) as follows:
In the equation, PSNR is measured relative to Mean Square Error (MSE) between two images (Img1 and Img2), of which, one image is the original image and another is the compressed image. n is the number of bits used to specify each pixel, which is normally 8 bits. Higher PSNR means that the input and output images are similar. Typically PSNR values range over 30 dB and 50 dB, where higher is better. In transcoder design, the PSNR is calculated between original picture and the decoded picture of transcoded bit-stream.
Sole use of PSNR is insufficient in quantifying coding performance, since higher PSNR usually requires higher bit-rate and high bit-rate means lower compression rate. For example, if the encoder output is equal to input, no compression, the PSNR will be highest but the bit-rate will stay the same. The challenge is to reduce the bit-rate as much as possible and keep the PSNR highest. To mitigate these issues, it is important to compensate for the changes in bit-rate to get better understanding of transcoding performance.
Bit-Rate
The bit-rate of bit-stream (R) is calculated by averaging the total number of bits in the bit-stream by the length of the bit-stream measured in seconds, the results is usually measured with kilobits per-second (kbps) or megabits per-second (mbps).
Transcoder Evaluation Concepts
Transcoding performance is measured by calculating bit-rate and PSNR for base sequences and transcoded sequences. Note that only the PSNR for luma channel is measured (Y-PSNR). For illustration purposes two set of plots are created: 1) Rate-Distortion (RD) plots; and 2) Average overhead and bit-rate ratio.
Bit-Rate Ratio
For bit-rate reduction transcoding, it is convenient to quantify the reduction in inverse fraction of input bit-rate. Denote the bit-rate of input bit-stream by RS and the base bit-rate RB, then define the bit-rate ratio:
which determines the ratio at which bit-rate has been reduced. For example, reducing the input bit-rate by 50% requires a transcoded bit-rate equal to half of input bit-rate.
Overhead
Reduction in bit-rate will usually cause a reduction in quality. To compensate this loss, one solution is to calculate the overhead. First define transcoding loss (LT) as the difference between base bit-rate (RB) and transcoded bit-rate (RT):
L
T
=R
B
−R
T
Then overhead (O) is defined as the ratio between transcoding loss (LT) and the base bit-rate (RB):
In order to find the base bit-rate (RB) the bit-rate at the location with equal PSNR of the RD point of the transcoded sequence is located, see
Notice that PSNRB=PSNRT. Hence to obtain RB it is sufficient to fit a curve to the RD points on the base plot and interpolate it for PSNRT. For example, if there is a set of three RD pairs, polynomial of degree 3 is fit to obtain f(PSNRB):
R
B
=f(PSNRB)=c4PSNRB3+c3PSNRB2+c2PSNRB1+c1
f(PSNRB) is solved using PSNRB=PSNRT to obtain RB, in which case, PSNRT is measured from transcoded sequence. The used algorithm that calculates the coefficients in the equation above is the monotonic piecewise cubic interpolation of Fritsch and Carlson. This interpolation method is chosen since it produces smooth curves due to their monotonic characteristics, and hence they are easier to interpret. The overhead demonstrates the price that has to be paid in bits to maintain the PSNR. Lower overhead shows that the transcoder coding performance, for equal PSNR values, was close to that of first encoder, in other words, if the raw video sequence was available and it has been encoded the RD performance was close to (RB, PSNRB). Hence low overhead is better than high overhead.
Bit-Rate Ratio and Overhead Plots
To provide a reliable estimate for r and O, each video sequence is encoded with n different QP values, and then the bit-stream is transcoded with m different QP values. Plotting the pairs (r, O)={(r11,O11), . . . , (r1m,O1m)} provides a convenient way to summarize the performance of transcoding a single bit-stream.
As an example,
From performance perspective, the ideal case is a transcoding model that has no overhead while reducing the bit-rate, which is equal to an encoder that has access to original raw video and the bit-rate reduction is achieved through encoding with higher QP. However in reality this is impossible, since HEVC encoding is lossy and decoding the bit-stream will produce a raw video sequence that is unequal to the original.
Average Bit-Rate Ratio and Overhead Plots
Assume there is a video sequence encoded with QP1. Transcoder will produce a set of new bit-streams encoded with QP11, . . . , QP1m. For transcoded bit-streams there is a function fit to RD points, f(r1,O1)={(r11,O11), . . . , (r1m,O1m)}. If the same video sequence is encoded with QP2 and then transcoded with QP21, . . . , QP2m will create another chain of transcoded bit-streams, and a corresponding bit-rate ratio and overhead curve described by: f(r2,O2)={(r21,O21), . . . , (r2m,O2m)}. Ultimately, for n encoded bit-streams there will be n corresponding f(rn,On).
To provide a suitable way to summarize the transcoder performance using the set of n×m pair of (r, O), a simple solution is to fit a curve to each (r, O) set and average the result. The result is called average bit-rate ratio and overhead, denoted by f(r, O).
In the reported results in this Annex five base and transcoded QP values, as shown in Table 3, are used in simulations. Each row corresponds to an input bit-stream encoded with QPn and then transcoded with QPnm. Base QP values of 38, 42 and 46 are included to extend the PSNR-bit-rate curve so to avoid extrapolation in the RD plots.
(r, O) pairs in Table 3 could also be visualized through a matrix:
In the matrix, each row corresponds to the input bit-stream encoded with the same QP. It is possible to fit a curve to model f(r, O) in each row. Then averaging the coefficients of these curves gives the average bit-rate ratio and overhead curve,
The average curve corresponds to a video sequence created from concatenation of sub-sequences, where each sub-sequence corresponds to concatenation of sequences in each column of the matrix.
f(r, O) corresponds to a single video sequence. It is also possible to group video sequence with similar characteristics and equal spatial resolution and calculate the overall (r, O) presenting the transcoder performance for the group. Assume there is p sequences in class A, hence there is p average bit-rate ratio and overhead functions. It is reasonable to average these functions to provide an overview performance curve of a transcoder. This average is denoted as:
The final curve, shown in the equation above, could be derived for each transcoding method and drawing them in one plot provides a convenient way to compare the transcoding performance over several sequences where each sequence is encoded and transcoded with series of QP values. A hypothetical example of such plot is shown in
By taking a quick glance at
Test Sequences and Encoder Configuration
Table 4 details the video sequences used in simulations. Test sequences are chosen from the standard set defined by ITU for evaluating encoder models. In this table, spatial and temporal resolutions are determined by Size and Frames per Second (FPS), respectively. Class is determined by the spatial resolution and could be: A) 2560×1600; B) 1920×1080; C) 1280×720; D) 832×480; or E) 416×240. All the sequences use 4:2:0 YUV color sampling.
Two main encoder configurations are used with each simulation: Low-Delay Main (LDM) and Random-Access Main (RAM). Important characteristics of these configurations are detailed in Table 5.
Results
The result for the performance of the developed video transcoder models, described in the foregoing, is included herein using the reference HEVC encoder model HM-8.2. The developed transcoders are based on this software test model.
Base Bit-Stream
Each raw video sequence, shown in Table 3, is encoded and decoded twice (once for each encoder configuration) to produce the base curves of HEVC reference encoder performance. The PSNR and bit-rate for each encoding is measured and denoted as Rate-Distortion (RD) pair:
={(RB1, PSNRB1), (RB2, PSNRB2), . . . , (RB8, PSNRB8)},
where each pair corresponds to the following quantization parameters:
QP={18, 22, 26, 30, 34, 38, 42, 46}.
Total of 20 sequences were encoded twice with each QP value, hence 320 RD points for base curve has been produced. The chosen sequences cover wide range of real world scenarios. These scenarios range over sequences with complex textures, fast motion, video conferencing setup, vibrant colors, and more.
To demonstrate the transcoder performance a sequence is chosen and discussed through this chapter. An important transcoding application is the video conferencing scenario. Popular sequence Johnny demonstrates a case in point. RD values for sequence Johnny encoded with Low Delay Main (LDM) configuration is listed in Table 6. Low delay configuration is mostly used in video conferencing applications to minimize the delay between participants.
QP≧38 in Table 5 produces poor quality videos. Such high QP values are included for completeness purposes in the base RD plots, and they are unused in simulations.
SCT Model
Each encoded sequence is sent to the transcoder to reduce the bit-rate. Simple Cascaded Transcoding (SCT) model decodes the sequence and follows the procedure of reference encoder, with only difference being higher QP value. The encoder will code the sequence with the goal of achieving highest coding performance, and since the encoder is not restricted by any means, it is reasonable to assume that the coding performance is the best possible transcoding performance.
Each coded sequence with the base QPB={18, 22, 26, 30, 34, 38} is decoded and re-encoded with QPB+ΔQP, where ΔQP={0, 2, 4, 6, 8}. In addition to PSNR and bit-rate, transcoding time (seconds) is also measured.
The SCT model performance for Johnny is shown in Table 7. Base bit-rate, RB, is calculated as described in foregoing by fitting a curve to RD1 through RD5 values from Table 5 and solving it for PSNRT values from Table 6.
To achieve higher bit-rate reduction, QP value is increased. For example, as it is noted in Table 7, ΔQP=6 reduces the bit-rate 5-folds while requiring ˜12% overhead. In other words, to reduce the bit-rate to one fifth, SCT model requires ˜12% higher bit-rate compared to direct encoding of original raw video sequence with matching bit-rate and PSNR.
In addition to bit-rate ratio and overhead plots, a common method to illustrate the coding performance in details is through the PSNR-bit-rate curves. The plots illustrating the RD transcoding performance for Johnny sequence is shown in
An observation from
FPR Model
The aim of the Full Prediction Reuse (FPR) model is to re-use all the information available in the input bit-stream about: Coding Tree Unit and Coding Units. In principle this model will be the faster model since most time consuming stages, e.g. motion estimation, of encoding is bypassed. It is also interesting to observe the changes in transcoding performance compared to SCT model. The FPR transcoding performance for Johnny sequence is shown in Table 8.
A striking observation of Table 8 is the increasing overhead trend of FPR model compared to decreasing overhead trend of SCT model in Table 7.
The results in Table 8 suggests that FPR transcoding achieves better bit-rate ratio performance for lower QP values as compared to higher QP values.
PSNR-bit-rate performance for several base QP values for FPR transcoding is plotted in
Compared to the performance reported in Table 8, FPR transcoding performance is better with base QP=34. For example, for ΔQP=6 the overhead is 70.24% compared to 100.27%. However it should be noted that the overhead depends both on the base and transcoded RD performance, hence lower overhead could be the result of worse RD performance of the base encoding. This behaviour is also observed for SCT model. RD performance for SCT model from higher base QP values is better than lower base QP.
IPR Model
Intra Prediction Re-estimation (IPR) transcoding model is similar to FPR model. The difference is in the way that I-slices are handled. IPR transcoding model re-estimates the I-slice prediction data. The motivation for developing this method is that I-slices could improve the coding efficiency of other slices by providing accurate prediction data. This is especially true for random access configuration of the transcoder where each GOP is composed of eight pictures and each starting picture is an I-picture.
To evaluate the IPR model, two tests are considered: 1) a sequence is transcoded with FPR model and IPR model using random access configuration; and 2) a sequence is transcoded with FPR model using random access and low delay configurations. Random access configuration is important for video playback and video editing. BQTerrace (1920×1080) video sequence is used in these tests since it provides a typical public scene with motion and textures.
By comparing the overheads reported in Table 10, it is obvious that IPR transcoding model demonstrates a better performance, since the overheads are lower in every case. IPR model performs even better for higher QP values, 18.82% overhead difference (ΔQP=8) compared to 5.16% (ΔQP=2). Which means coding I-slices with better quality is important for higher bit-rate reduction.
To further support the claim that FPR model has higher transcoding performance for random access configuration compared to low delay, transcoding performance of sequence Johnny with both configurations is reported in Table 11.
LDM configuration has single I-slice for a bit-stream opposed to one I-slice per GOP of RAM configuration. Based on this fact, roughly 50% lower overhead of IPR transcoding, as noted in Table 10, is due to motion re-estimation for higher number of I-slices.
MR Model
Motion Re-estimation (MR) transcoding model is developed to understand the transcoder performance when the transcoder is granted with the flexibility to re-estimate the motion data for every picture, however constrained by the quad-tree structure that is copied from the input bit-stream.
Naturally one would expect that granting transcoder the ability to optimize the block coding would increase transcoding performance. However it is observed that reusing tree structure and PU splitting of input bit-stream has effected the performance for ΔQP=0, which is shown in Table 12.
Overhead for FPR model with ΔQP=0 was reported 51.76% in Table 8, whereas the corresponding overhead for MR model has increased by 23.81%. The reason could be that local optimization of motion data has a negative effect on the global coding performance. The tree structure extracted from input bit-stream has been optimized in conjunction with the appropriate motion data in a global manner, and reusing that structure with different motion data may lead to somewhat lower performance in some applications.
Comparing the PSNR-Bit-rate performance of FPR and MV models also shows a big improvement. It is observed in
AT Model
Advanced Transcoding model is designed with the goal of reducing the overhead of previous transcoding models while maintaining good time complexity performance. To accomplish such goal it is proposed to extend the search space of CU structures, and try other block encoding combinations that are different from block information of input bit-stream. It is also important to note that the space of possible combinations between CU tree structures, PU splits, and block modes is huge.
To reduce the search space the following observations about HEVC standard are considered:
1) Skipped blocks require least number of bits to encode, hence, for higher QP values it is reasonable to try to transcode as many skipped blocks as possible;
2) 2N×2N PU spilt requires one MV to be signalled, therefore it has a higher chance of reducing the overhead, since it requires less bits;
3) Having access to the CU tree structure of input bit-stream it is possible to produce a shallow tree structure by pruning the input tree. Pruned tree will have less details and require less bits to be coded.
In foregoing, it was observed that MV model has demonstrated better transcoding performance compared to FPR due to re-estimation of MV for inter blocks and re-estimation of motion directions for I-pictures. Improvement of AT transcoding performance over MV model is shown in Table 12.
The design goal of AT model was to produce performance close to SCT model. Table 13 demonstrates that the performance of these two models is very close. This is also observed in
Class Average Performance
As described in the foregoing, transcoder performance could be averaged across sequences of the same class. This will provide a better overview of how each transcoding method performs given video sequences with certain characteristics.
The 20 sequences were classified into five categories de-pending on the spatial resolution. Upper Class sequences have higher spatial resolution, therefore they will contain higher number of CTU.
It is observed that using higher QP to reduce the bit-rate, FPR and IPR models are not that efficient to maintain high bit-rate reduction. This is observed for average bit-rate ratio values above 2.0 in the figure. For every sequence class, FPR and IPR has a rising overhead trend, which supports the observations made in the foregoing that the input block information without modification becomes less efficient for reducing bit-rate over half the input bit-rate.
FPR and IPR models have demonstrated somewhat lower performance for class C sequences. For class A, those models have shown better transcoding performance up to average bit-rate ratio of 2.0 compared to MV model. Except the performance of FPR and IPR models for sequence class C, in each case the overhead is below 50% regardless of bit-rate ratio.
MV model has consistent performance trend through every class, where approximately equal overhead is required for reducing the bit-rate independent from the bit-rate ratio. This trend falls between rising trends of FPR-IPR models and falling trends of SCT-AT models. This is interesting, since the re-use of CU tree structure and re-estimation of MV and intra prediction directions provides an equal trade-off between quality loss due to re-use and quality gain due to re-estimations.
The goal of AT model design was to exhibit a close performance to that of SCT model. It is clear that this goal is achieved by observing and comparing the AT and SCT model performance in
As expected, for bit-rate ratios above 2.0, SCT has maintained the best transcoding performance with lowest overhead, whereas, AT model was second. Bit-rate ratio above 2.6 was unachievable with FPR and IPR models.
Per sequence class transcoding performance with RAM configuration is illustrated in
The major difference between transcoder performance with RAM configuration compared to LDM configuration is the lower overhead of IPR model. RAM configuration incorporated an I-picture every eight pictures (one GOP) compared to single I-picture of LDM configuration. These I-pictures are used as reference pictures for inter prediction, and since they are only intra predicted with higher quality, the following pictures will have better reference quality. The effect of such GOP structure for RAM configuration is the lower overhead for IPR model compared to LDM configuration. Since IPR re-calculates the intra prediction directions for I-pictures.
MV model is similar to IPR in a sense that I-pictures are encoded with re-estimated intra prediction directions. Therefore in a similar manner to IPR, as it is observed in
FPR model has lower overhead for lower QP values with RAM configuration compared to LDM, as it is observed for r<2.0. Overhead for r<2.0 corresponds to transcoding with equal QP to base encoder. Therefore, reusing block information results in better global coding performance compared to other transcoding models which locally optimize the block structure. For higher QP values, however, similar to RAM configuration the transcoder performance drops significantly.
Time
In the previous section, it was observed that SCT model has the best coding performance compared to the other transcoding models, which justifies its use as the reference. The problem with SCT model is the high computational requirements. To further illustrate this point, transcoding time differences are illustrated in
There is a visible decreasing trend for SCT and AT models for higher QP values. High QP usually produces CU tree structures that are shallow compared to lower QP values, which is due to the fact that higher QP will cause most of the coefficients from prediction stage to be truncated to zero, therefore the encoder will most likely encode more blocks with SKIP mode compared to lower QP, and as a result, the CU will stop splitting to smaller sub-CUs and therefore the final tree will be shallow. Finally, stopping the branching in lower levels of CU structure will make the search space smaller which leads to shorter transcoding time.
As expected, transcoder models with highest information re-use require the least transcoding time. FPR and IPR models require approximately equal transcoding time which is about 1% of SCT transcoding time. Motion vector re-estimation of MV model increases the transcoding time to 5%. AT model performs differently based on sequence class. Fastest transcoding time for AT model was recorded for class C with 10%. Around 20% transcoding time was observed for highest resolution sequence class for AT model. In general, transcoding time for AT model compared to SCT model has shown a decrease between 80% to 90%.
Per sequence class transcoding time with RAM configuration is shown in
Transcoding is necessary to enable interoperability of devices with heterogeneous computational resources. Previously developed transcoding methods are insufficient for transcoding bit-streams compatible with H.265/HEVC video coding standard, which is sought to be an important part of video communication systems in near future.
An important part of H.265/HEVC design is the quad-tree based structure of CUs. This structure provides flexible coding design and higher coding performance, however, searching the space of possible structures is computationally expensive. This Annex has investigated transcoding methods that reduce the search space by reusing the CU structure information available through the input bit-stream.
Overview of Proposed Transcoding Models
This Annex focused on transcoding H.265/HEVC bit-streams for bit-rate reduction and the transcode evaluation methods. In this regard, four transcoding methods were developed:
1) Full Prediction Re-use (FPR) model: this model re-uses CU information for intra-and inter-prediction but re-calculates residuals and coefficients;
2) Intra Prediction Re-estimation (IPR): this model is similar to FPR with the difference of re-estimating prediction data for intra coded picutres;
3) MV Re-estimation (MR): this model is similar to IPR with the difference of re-estimating motion vectors after copying the CU structure from input bit-stream; and
4) Advanced Transcoding (AT): this model is a combination of previous models with specific additions to push the transcoder performance farther by efficiently extending the search space of block coding structure.
Overview of Transcoders Performance
It has been observed that re-using motion data in conjunction to CU structure, FPR an IPR models, has limited performance for bit-rate reduction below half the input bit-rate. However, as it was expected, FPR and IPR models that re-use CU information were very fast. Compared to SCT model they only required 1% of the transcoding time. It was also noted that motion vector re-estimation based model, MR, has inverted the increasing trend of overhead for FPR and IPR with small addition of computational complexity that is approximately 5% of SCT model.
Finally, AT model was designed with consideration of the following observations: 1) Using skip mode is likely to reduce the overhead since it requires the least number of bits for encoding; 2) PU split mode with one motion vector (2N×2N) requires only one motion vector to be signalled; and 3) Merging blocks reduces the number of bits. It was observed that AT has demonstrated competitive performance to that of SCT within margin of 5% difference while requiring at most 80% of transcoding time.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/SE2014/050194 | 2/18/2014 | WO | 00 |
Number | Date | Country | |
---|---|---|---|
61773921 | Mar 2013 | US |