The present application generally relates to data compression and, in particular, to methods and devices for context-adaptive video coding.
Data compression occurs in a number of contexts. It is very commonly used in communications and computer networking to store, transmit, and reproduce information efficiently. It finds particular application in the encoding of images, audio and video. Video presents a significant challenge to data compression because of the large amount of data required for each video frame and the speed with which encoding and decoding often needs to occur. A popular video coding standard has been the ITU-T H.264/AVC video coding standard. It defines a number of different profiles for different applications, including the Main profile, Baseline profile and others. A newly-developed video coding standard is the ITU-T H.265/HEVC standard. Other standards include VP-8, VP-9, AVS, and AVS-2.
Many modern video coding standard use context-adaptive entropy coding to maximize coding efficiency. However, many standards also require a degree of independence between pictures or groups of pictures in the video. This often means that context adaptations are lost from picture-to-picture. In other words, with each new picture or slice, the encoder and decoder re-initializes the context model to a default context model state. If data is sparse, then there may be too few values to adapt the probabilities associated with a particular set of contexts quickly enough to make them efficient.
Reference will now be made, by way of example, to the accompanying drawings which show example embodiments of the present application, and in which:
Similar reference numerals may have been used in different figures to denote similar components.
The present application describes methods and encoders/decoders for encoding and decoding video.
In a first aspect, the present application describes a method of encoding video using a video encoder, the video encoder employing context-adaptive entropy encoding using a context model, the context model having a context model state defining the probability associated with each context defined in the context model, the video encoder storing a pre-defined context model state for initialization of the context model, and the video encoder including a buffer storing at least two context model states each being the context model state after context-adaptive encoding of a respective previously-encoded picture in the video encoder. The method includes, for encoding a current picture of the video, selecting one of the at least two stored context model states from the buffer; initializing the context model for context-adaptively encoding the current picture using the selected one of the at least two stored context model states; and context-adaptively entropy encoding the current picture to produce a bitstream of encoded data.
The present application further discloses a method of decoding video from a bitstream of encoded video using a video decoder, the encoded video having been encoded using context-adaptive entropy encoding using a context model, the context model having a context model state defining the probability associated with each context defined in the context model, the video decoder storing a pre-defined context model state for initialization of the context model, and the video decoder including a buffer storing at least two context model states each being the context model state after context-adaptive decoding of a respective previously-decoded picture in the video decoder. The method including, for decoding a current picture of the video, selecting one of the at least two stored context model states from the buffer; initializing the context model for context-adaptively decoding the current picture using the selected one of the at least two stored context model states; and context-adaptively entropy decoding the bitstream to reconstruct the current picture.
In yet another aspect, the present application describes a method of encoding video using a video encoder, the video encoder employing context-adaptive entropy encoding using a context model, the context model having a context model state defining the probability associated with each context defined in the context model, the video encoder storing a pre-defined context model state for initialization of the context model, and the video include a series of pictures. The method includes, for a subset of the pictures in the series, initializing the context model for context-adaptively entropy encoding a picture in the subset using the pre-defined context model state, context-adaptively entropy encoding that picture to produce a bitstream of encoded data, wherein the context-adaptively entropy encoding includes updating the context model state during encoding, and storing the updated context model state in a buffer; and then, for each of the remaining pictures in the series, initializing the context model for context-adaptively entropy encoding that picture using one of the stored context model states from the buffer, and context-adaptively entropy encoding that picture.
In a further aspect, the present application describes encoders and decoders configured to implement such methods of encoding and decoding.
In yet a further aspect, the present application describes non-transitory computer-readable media storing computer-executable program instructions which, when executed, configured a processor to perform the described methods of encoding and/or decoding.
Other aspects and features of the present application will be understood by those of ordinary skill in the art from a review of the following description of examples in conjunction with the accompanying figures.
In the description that follows, some example embodiments are described with reference to the H.264/AVC standard for video coding and/or the H.265/HEVC standard. Those ordinarily skilled in the art will understand that the present application is not limited to H.264/AVC or H.265/HEVC but may be applicable to other video coding/decoding standards, including possible future standards, multi-view coding standards, scalable video coding standards, 3D video coding standards, and reconfigurable video coding standards.
In the description that follows, when referring to video or images the terms frame, picture, slice, tile and rectangular slice group may be used somewhat interchangeably. Those of skill in the art will appreciate that a picture or frame may contain one or more slices. A series of frames/pictures may be called a “sequence” in some cases. Other terms may be used in other video coding standards. It will also be appreciated that certain encoding/decoding operations might be performed on a frame-by-frame basis, some are performed on a slice-by-slice basis, some picture-by-picture, some tile-by-tile, and some by rectangular slice group, depending on the particular requirements or terminology of the applicable image or video coding standard. In any particular embodiment, the applicable image or video coding standard may determine whether the operations described below are performed in connection with frames and/or slices and/or pictures and/or tiles and/or rectangular slice groups, as the case may be. Accordingly, those ordinarily skilled in the art will understand, in light of the present disclosure, whether particular operations or processes described herein and particular references to frames, slices, pictures, tiles, rectangular slice groups are applicable to frames, slices, pictures, tiles, rectangular slice groups, or some or all of those for a given embodiment. This also applies to coding tree units, coding units, prediction units, transform units, etc., as will become apparent in light of the description below.
Reference is now made to
The encoder 10 receives a video source 12 and produces an encoded bitstream 14. The decoder 50 receives the encoded bitstream 14 and outputs a decoded video frame 16. The encoder 10 and decoder 50 may be configured to operate in conformance with a number of video compression standards.
The encoder 10 includes a spatial predictor 21, a coding mode selector 20, transform processor 22, quantizer 24, and entropy encoder 26. As will be appreciated by those ordinarily skilled in the art, the coding mode selector 20 determines the appropriate coding mode for the video source, for example whether the subject frame/slice is of I, P, or B type, and whether particular coding units within the frame/slice are inter or intra coded. The transform processor 22 performs a transform upon the spatial domain data. In particular, the transform processor 22 applies a block-based transform to convert spatial domain data to spectral components. For example, in many embodiments a discrete cosine transform (DCT) is used. Other transforms, such as a discrete sine transform, a wavelet transform, or others may be used in some instances. The block-based transform is performed on a transform unit. The transform unit may be the size of the coding unit, or the coding unit may be divided into multiple transform units. In some cases, the transform unit may be non-square, e.g. a non-square quadrature transform (NSQT).
Applying the block-based transform to a block of pixel data results in a set of transform domain coefficients. A “set” in this context is an ordered set in which the coefficients have coefficient positions. In some instances the set of transform domain coefficients may be considered as a “block” or matrix of coefficients. In the description herein the phrases a “set of transform domain coefficients” or a “block of transform domain coefficients” are used interchangeably and are meant to indicate an ordered set of transform domain coefficients.
The set of transform domain coefficients is quantized by the quantizer 24. The quantized coefficients and associated information are then encoded by the entropy encoder 26.
Intra-coded frames/slices (i.e. type I) are encoded without reference to other frames/slices. In other words, they do not employ temporal prediction. However intra-coded frames do rely upon spatial prediction within the frame/slice, as illustrated in
Inter-coded frames/blocks rely upon temporal prediction, i.e. they are predicted using reconstructed data from other frames/pictures. The encoder 10 has a feedback loop that includes a de-quantizer 28, inverse transform processor 30, and deblocking processor 32. The deblocking processor 32 may include a deblocking processor and a filtering processor. These elements mirror the decoding process implemented by the decoder 50 to reproduce the frame/slice. A frame store 34 is used to store the reproduced frames. In this manner, the motion prediction is based on what will be the reconstructed frames at the decoder 50 and not on the original frames, which may differ from the reconstructed frames due to the lossy compression involved in encoding/decoding. A motion predictor 36 uses the frames/slices stored in the frame store 34 as source frames/slices for comparison to a current frame for the purpose of identifying similar blocks. In other words, a motion vector search is carried out to identify a block within another frame/picture. That block is the source of the predicted block or unit. The difference between the predicted block and the original block becomes the residual data that is then transformed, quantized and encoded.
Those ordinarily skilled in the art will appreciate the details and possible variations for implementing video encoders.
The decoder 50 includes an entropy decoder 52, dequantizer 54, inverse transform processor 56, and deblocking processor 60. The deblocking processor 60 may include deblocking and filtering processors. A line buffer 59 stores reconstructed pixel data while a frame/picture is being decoded for use by a spatial compensator 57 in intra-coding. A frame buffer 58 stores fully-reconstructed and deblocked frames for use by a motion compensator 62 in applying motion compensation.
The bitstream 14 is received and decoded by the entropy decoder 52 to recover the quantized coefficients. Side information may also be recovered during the entropy decoding process, including coding mode information, some of which may be supplied to the feedback loop for use in creating the predictions. For example, the entropy decoder 52 may recover motion vectors and/or reference frame information for inter-coded blocks, or intra-coding mode direction information for the intra-coded blocks.
The quantized coefficients are then dequantized by the dequantizer 54 to produce the transform domain coefficients, which are then subjected to an inverse transform by the inverse transform processor 56 to recreate/reconstruct the residual pixel-domain data. The spatial compensator 57 generates the video data from the residual data and a predicted block that it creates using spatial prediction. The spatial prediction applies the same prediction mode/direction as was used by the encoder in reliance upon previously-reconstructed pixel data from the same frame. Inter-coded blocks are reconstructed by creating the predicted block based on a previously-decoded frame/picture and the motion vector decoded from the bitstream. The reconstructed residual data is then added to the predicted block to generate the reconstructed pixel data. Both spatial and motion compensation may be referred to herein as “prediction operations”.
A deblocking/filtering process may then be applied to a reconstructed frame/slice, as indicated by the deblocking processor 60. After deblocking/filtering, the frame/slice is output as the decoded video frame 16, for example for display on a display device. It will be understood that the video playback machine, such as a computer, set-top box, DVD or Blu-Ray player, and/or mobile handheld device, may buffer decoded frames in a memory prior to display on an output device.
Reference is now made to
The actual bin value is then used to update the context model. The bin value may be used to adjust the probability of the context selected for encoding of that bin value. In other words, the actual value of the bin may impact the probability used for encoding the next bin having that same selected context. This feature makes the coding process “context-adaptive”. Context-adaptive models tend to outperform static context models because the context model state is adapted to the actual data over time. Provided there is sufficient data for each context, then each context should have an improved accuracy as the encoding proceeds.
The context modeler 102 begins the coding process by initializing the context model state. That is, it must select an initial probability for each context. That initial context model state (i.e. the initial probabilities) are loaded into the context modeler 102 during initialization. In some cases, there may be two or more pre-defined context model initialization states and the encoder may select one of them to use in a particular situation. In many coding process, to reduce error propagation and improve resiliency a condition of independence is imposed on partitions of video, i.e. groups-of-pictures, pictures, slices, etc. For example, in AVS2 and H.264, for each slice the context model is re-initialized with the predefined context model state.
The problem created when the context models state is reinitialized to default probabilities at every slice boundary is that there may be too few bins associated with certain contexts in a slice to accurate adapt that context in the course of encoding a slice. As a result, the CABAC process is not as effective and efficient as it could be, and BD rate performance suffers.
If no “independence” condition is imposed at a picture or slice level, then the coding process could initialize the context model state once per video, or sequence, or group-of-pictures, or at another granularity, in order to ensure enough bins are available for each context of the context model to adapt to an accurate probability state. This results in potential problems with error propagation and decoding complexity because of the interdependency among a number of pictures/slices.
Reference is now made to
The context modeler 102 is initialized using either a pre-defined context model state from a pre-defined context initialization register 106, or a stored context model state 120 from a context initialization buffer 108. The pre-defined context initialization register 106 stores the one or more pre-defined context model states.
The context initialization buffer 108 stores the actual context model state from the context modeler 102 after encoding of a previous slice or picture in the video. The entropy encoder 110 encodes a slice or picture, which causes context-adaptive updating of the context model state in the context modeler 102 with each bin (other than bypass coded bins). Once the slice or picture is encoded, the context modeler 102 contains an adapted or updated context model state. This updated context model state is stored in the context initialization buffer 108.
Before encoding the next slice or picture, the entropy encoder 110 determines whether to initialize the context model using a pre-defined context model state from the pre-defined context initialization register 106 or a stored context model state 120 from the context initialization buffer 108. This determination may be based on a number of factors, some of which are described further below.
In some embodiments, the context initialization buffer 108 may store two or more context model states 120 corresponding to the context model state following context-adaptive entropy encoding of a respective two or more previous slices or pictures. The entropy encoder 110 may select one of the stored context model states 120 for use in initializing the context model in the context modeler 102. The selection may be based on a number of factors, some of which are described further below.
The encoder may insert flags or other syntax elements in the bitstream to indicate to the decoder whether the context model is to be initialized using the pre-defined context initialization register 106 or the context initialization buffer 108 for a particular slice or picture. Illustrative example flags and syntax are provided below.
In some cases, the decoder may be configured to determine whether to initialize using the pre-defined context initialization register 106 or the context initialization buffer 108 for a particular slice or picture using the same logic or algorithmic operations as the encoder.
When selecting between two or more stored context model states in the context initialization buffer 108, the decoder may rely upon a flag or one or more other syntax elements encoded by the encoder to make the selection, or the decoder may make the selection based upon the same logical or algorithmic selection process employed by the encoder.
In one case, in order to allow for backward compatibility, when it is known that the decoder is not able to use various context model states, then only the pre-defined context model set/state is used. However, when it is known that the decoder is able to use various context model sets/states, then the described context model sets/states selection process is used. The decoder capability can be known via any negotiation procedure (e.g. session description protocol SDP).
In some cases, when multiple representations of content are made available, such as in dynamic adaptive streaming over HTTP (DASH), or Mpeg Multimedia Transport (MMT), then one of the representations could correspond to a version of the coder that includes such modification to the entropy context model.
Reference will now be made to
In operation 204, the encoder entropy encodes a picture (or, in some embodiments, a slice or other portion of video) of video using context-adaptive entropy encoding. References herein to determinations or selections being made for a picture may be made at a slice-level in some embodiments or at other levels of granularity.
In this process, the context model states are continuously updated based on actual bin values and context determinations. Following operation 204, the entropy encoder has a context model with adapted state values.
In operation 206, the updated context model state is stored in the context initialization buffer. In one embodiment, the buffer stored only the most recent context state. Accordingly, with each picture that is encoded, the ending context state is stored in the buffer overwriting the previously-stored context state. In such an embodiment, only the previous picture's (or slice's) context state is available for use in initializing the context model for the next picture (or slice).
In another embodiment, the buffer stores two or more context model states from the respective two or more previously-encoded pictures (or slices). In this embodiment, the entropy encoder selected between the available stored context states when using the buffer to initialize the context model. The number of stored context model states may be predetermined. For example, the buffer may store five context model states from the five most recently encoded pictures (or slices). After encoding of a picture, the corresponding context model state is stored in place of the oldest of the stored context model states in the buffer.
In yet another embodiment, the buffer stores various context states corresponding to, for example, a lossless mode, a transform skip mode, a transform and quantization skip mode, a screen content mode, an High Dynamic Range mode, a wide color gamut mode, a bit-depth mode, a scalable mode, and/or a color component mode (e.g. 4:4:4, 4:2:2). As such, when the bitstream comprises an indication that any one of those modes is used for a picture, a sequence or a subset of a picture, then the related context states are used.
In yet another embodiment, the buffer stores a plurality of context model states, each corresponding to particular category or type of picture (or slice). For example, the entropy encoder may store a context model state for each QP value or for a number of subsets of QP values. When a new context model state is to be stored for a QP value (or subset of QP values) that already has an associated stored context model state, then the old context model state is overwritten. In another example, the buffer may store the most recent context model state corresponding to each picture types, e.g. I-pictures, P-pictures, and B-pictures. In another example, the buffer may store the most recent context models state corresponding to a hierarchical layer. In yet another example, the buffer may store only context model states of those pictures that may serve as reference pictures for a prediction operation in encoding future pictures. Other types or characteristics of the pictures may be used, individually or grouped into similar types.
The entropy encoder then determines, in operation 208, whether to use a pre-defined context model state from the pre-defined context initialization register or a stored context model state from the buffer when initializing the context model for encoding the next picture. The determination may be based upon whether the available stored context model states meet certain criteria relative to the picture to-be-encoded. Example criteria may include one or more of: being of the same picture type, having the same or similar QP value, being of the same or higher/lower hierarchical level, being within a threshold temporal distance in the video sequence, being a reference frame used by the picture-to-be-encoded, and other factors, either alone or in combination with any of the above factors.
If the entropy encoder determines that the pre-defined context model state is to be used for initialization, then the method 200 returns to operation 202. Otherwise, it continues at operation 210, where the entropy encoder selects one of the stored context model states from the buffer. The selection may be based upon a number of criteria relative to the picture to-be-encoded. The criteria may be applied alone or in combination. Example criteria may include one or more of: being of the same picture type, having the same or similar QP value, being of the same or higher/lower hierarchical level, being within a threshold temporal distance in the video sequence, being a reference frame used by the picture-to-be-encoded, and other factors, either alone or in combination with any of the above factors.
In operation 212, the context model is initialized using the context model state selected from the buffer in operation 210. The method 200 then proceeds to operation 204 to entropy encode the picture-to-be-encoded using context-adaptive encoding starting from the initialized context model state.
A corresponding example decoding method 300 for reconstructing a video from a bitstream of encoded data is shown in an example flowchart in
The method 300 includes initialization of the context model in operation 202. The same context model used by the encoder is used by the decoder. The context model is initialized with the pre-defined context state. This pre-defined state may be one of equal probabilities for all contexts in some cases. In others, this operation 302 may include selecting between a number of available pre-defined context states. The selection may be based upon the picture type, QP value, or other coding parameters. The selection may alternatively be based upon a flag or syntax element sent by the encoder in the bitstream that identifies the pre-defined context model state to be used.
In operation 304, the decoder entropy decodes a picture (or, in some embodiments, a slice or other portion of video) of video using context-adaptive entropy decoding. References herein to determinations or selections being made for a picture may be made at a slice-level in some embodiments or at other levels of granularity.
In this process, the context model states are continuously updated based on actual bin values and context determinations. Following operation 304, the entropy decoder has a context model with adapted state values.
In operation 306, the updated context model state is stored in the decoder's context initialization buffer. In one embodiment, the buffer stored only the most recent context state. Accordingly, with each picture that is decoded, the ending context state is stored in the buffer overwriting the previously-stored context state. In such an embodiment, only the previous picture's (or slice's) context state is available for use in initializing the context model for the next picture (or slice) to be decoded.
In another embodiment, the buffer stores two or more context model states from the respective two or more previously-decoded pictures (or slices). In this embodiment, the entropy decoder selects between the available stored context states when using the buffer to initialize the context model. The number of stored context model states may be predetermined. For example, the buffer may store five context model states from the five most recently decoded pictures (or slices). After decoding of a picture, the corresponding context model state is stored in place of the oldest of the stored context model states in the buffer.
In yet another embodiment, the buffer stores a plurality of context model states, each corresponding to particular category or type of picture (or slice). For example, the entropy decoder may store a context model state for each QP value or for a number of subsets of QP values. When a new context model state is to be stored for a QP value (or subset of QP values) that already has an associated stored context model state, then the old context model state is overwritten. In another example, the buffer may store the most recent context model state corresponding to each picture types, e.g. I-pictures, P-pictures, and B-pictures. In another example, the buffer may store the most recent context model state corresponding to a hierarchical layer. In yet another example, the buffer may store only context model states of those pictures that may serve as reference pictures for a prediction operation in decoding future pictures. Other types or characteristics of the pictures may be used, individually or grouped into similar types.
The entropy decoder then determines, in operation 308, whether to use a pre-defined context model state from the pre-defined context initialization register or a stored context model state from the buffer when initializing the context model for decoding the next picture. The determination may be based upon whether the available stored context model states meet certain criteria relative to the picture-to-be-decoded. Example criteria may include one or more of: being of the same picture type, having the same or similar QP value, being of the same or higher/lower hierarchical level, being within a threshold temporal distance in the video sequence, being a reference frame used by the picture-to-be-decoded, and other factors, either alone or in combination with any of the above factors. In some embodiments, the decoder makes the determination based on a flag or other syntax element from the bitstream. The encoder may insert the flag or other syntax element to signal to the decoder whether to use pre-defined context model states or stored context model states.
If the entropy decoder determines that the pre-defined context model state is to be used for initialization, then the method 300 returns to operation 302. Otherwise, it continues at operation 310, where the entropy decoder selects one of the stored context model states from the buffer. The selection may be based upon a number of criteria relative to the picture to-be-decoded. The criteria may be applied alone or in combination. Example criteria may include one or more of: being of the same picture type, having the same or similar QP value, being of the same or higher/lower hierarchical level, being within a threshold temporal distance in the video sequence, being a reference frame used by the picture-to-be-decoded, and other factors, either alone or in combination with any of the above factors. Again, in some embodiments, the decoder may make this selection based upon a syntax element inserted in the bitstream by the encoder to signal which context model state to use.
In operation 312, the context model is initialized using the context model state selected from the buffer in operation 310. The method 300 then proceeds to operation 304 to entropy decode the picture-to-be-decoded using context-adaptive decoding starting from the initialized context model state.
One of the factors or criteria mentioned above is hierarchical layer. In some video coding processes, groups-of-pictures may have a defined hierarchical structure within them defining whether particular pictures are lower layer pictures or higher layer pictures. The hierarchy may be used for a number of purposes. For example, the references pictures available to predict a picture may be limited based upon the layer relationship. For example, a picture may be limited to serving as a reference for pictures in the same or a lower (or higher) layer. In some cases, the number of candidate reference pictures may be limited in number or in temporal distance. Other conditions or restrictions could be imposed based on the hierarchical structure of the group-of-pictures.
One example hierarchical structure 400 is diagrammatically illustrated in
It will be appreciated that the structure 400 is only illustrative and that other structures may be used in other situations. The structures may have two layers, three layers, or more. Different rules may apply in different embodiments.
In the processes described in
Reference is now made to
In this example process 500, candidate stored context sets are filtered based on a set of conditions. The conditions used in any particular embodiment may be aimed at ensuring the candidate context model states are those developed during encoding/decoding of pictures that are most likely to be statistically similar to the current picture.
In operation 502, the process 500 excludes candidates if they are not associated with a reference frame. That is, the only candidates that are permitted are those context sets that were derived during the encoding/decoding of pictures that are (or are permitted to be) reference pictures to the current picture. In one embodiment, this condition restricts candidates to actual reference pictures used by the current picture. In another embodiment, this condition restricts candidates to pictures that are permitted to be reference pictures to the current picture irrespective of whether they are actually used as such.
Operation 504, excludes any candidates pictures that are not of the same picture type. For example, B pictures may only use stored context states associated with other B pictures; I pictures may only use stored context states associated with other I pictures, etc.
Operation 506 excludes candidate stored context states from pictures based on a hierarchical layer relationship. As noted above, this condition may limit candidates to those pictures in the same hierarchical layer in some cases. In other implementations, this condition may limit candidates to those pictures in layers above the current layer. In other implementations, this condition may limit candidates to those pictures in the same layer or layers above.
Operation 508 excludes candidate stored context model states obtained from pictures that are more than a threshold number of pictures temporally distant from the current picture.
In operation 510, from any remaining candidate context model states, a context model state is selected on the basis that its associated picture uses a QP value closest to the QP value used with the current picture.
If no candidates remain for selection at the end of the process 500, then the pre-defined context state initialization values are used, as indicated by operation 512.
In another embodiment, any one of the above “filtering” conditions by which candidates are excluded may be used as a “selecting” condition. For example, in operation 508 instead of filtering out stored context model states on the basis that they are more than a threshold temporal distance away, minimizing the temporal distance may be the basis for selecting amongst candidate stored context model states. For example, as between two stored context model states, the encoder and decoder may be configured to use the one closest in the display sequence.
In yet another embodiment, the process to select one context model states out of two or more stored states uses hypothesis testing. That is, each stored state is a candidate associated with a cost in hypothesis testing. The cost is determined by assigning a numerical score to each of the relevant conditions (picture type, hierarchical layer, temporal distance, quantization parameters, being a reference picture or not, etc.), and aggregating them. The candidate with the smallest cost is then selected.
It will be understood that although the above description refers to context model states associated with respective pictures, the same methods and processes may be applied in the case of context model states associated with slices.
In one embodiment, the context model buffer is updated by storing a context model state after each slice. In one variation, the encoder/decoder determine whether to overwrite a previously stored context model state in the buffer based on whether the associated slices have the same characteristics. For example, if the slices have the same (or similar) QP values, are from the same picture type, are on the same hierarchical level, etc. More or fewer characteristics may be applied in different implementations. Context model states stored in the buffer for one picture may be available for used by other slices in the same picture.
In one embodiment, context model states may be stored in the buffer after each picture. In other words, a temporary buffer or memory holds the context models states for individual slices and, when the picture is finished encoding/decoding, then the temporarily stored context model states are saved to the buffer to make them available for encoding/decoding slices in subsequent pictures.
In yet another embodiment, some frames/pictures are designated for “training” context model states and other frames/pictures are designated for “static” use of trained context model states. Reference is now made to
In this example, the first three pictures 600 (in encoding/decoding sequence, which may or may not be the same as the display sequence) are designated as training pictures 602. The remaining pictures are designated as “static” pictures 604. They are “static” in the sense that the context model state at the end of encoding/decoding these pictures is not saved for use in subsequent initializations.
During the training phase, the context model states developed during encoding/decoding of the training pictures 602 are saved. This context model state(s) is then available for use in initializing the context model during encoding/decoding of the static pictures 604.
In a first embodiment, the training pictures 602 use the pre-defined context model state for initializing the model. At the end of encoding/decoding each training picture 602, the context model state is stored in the buffer, meaning three stored context models states will be available for selection by the static pictures.
In a second embodiment, the first training picture 602 uses the pre-defined context model state for initializing the model. Each subsequent training picture 602 uses the trained context model state from the encoding/decoding of the previous training picture 602. At the end of the training phase, the context model state developed over the course of encoding/decoding the training pictures 602 is saved to the buffer for use by the static pictures 604.
In a third embodiment, the training pictures 602 each use the pre-defined context model state for initializing the model before encoding/decoding of each training picture. At the end of the training phase 602, the encoder/decoder selects one of the trained context model states as the context model state for initialization during encoding/decoding of the static pictures 604. The selection may be based upon QP, hierarchical level, or other factors. In one example, one or more of the trained context model states may be combined to create a merged context model state. Various weighting schemes may be used to merge probabilities of two context model states.
In a fourth embodiment, each training picture 602 uses the pre-defined context model state for encoding/decoding; however, in parallel the training of the context model is carried out starting from the first training picture and carrying through all training pictures. That is, the context model state training is separated from the context model state used for encoding/decoding of training pictures after the first picture. With the second picture, the context model state is re-initialized using the pre-defined context model for the purpose of encoding/decoding, but a separate training context model is maintained that inherits the trained states at the end of the first picture.
By separating the pictures into a training phase and a static phase, error resiliency during the static phase is improved compared to a process in which context model states are saved after each picture and may be used in subsequent pictures.
Reference is now made to
The method 700 is applied to a set of pictures. The set of pictures may be GOP in some examples. The set of pictures may be a subset of the GOP in some examples.
In operation 702, the context model is initialized with probabilities based upon the pre-defined context model states, i.e. untrained probabilities pre-established for the encoding/decoding process. The first picture of the set of pictures is then encoded/decoded in operation 704 using context-adaptive entropy coding. In the course of this coding, the context probabilities are updated with each bin (excluding bypass coded bins). After encoding/decoding the picture, the encoder/decoder then saves the updated context model state in the context initialization buffer in operation 706.
In operation 708, the encoder/decoder determines whether it is done the training phase. The training phase may be defined as a number of pictures at the beginning of the set of pictures. For example, three or four pictures may be used for training with the remaining pictures belonging to the static phase. The number of pictures included in the training phase may be configurable number that is set by the encoder and communicated to the decoder in a sequence parameter set or elsewhere in the bitstream. In some embodiments, a syntax element, such as a flag or other element, may signal to the decoder the start/end of training and/or the start/end of the static phase.
If the training phase is not finished, i.e. there are additional training pictures to be encoded, then the method 700 returns to operation 702 to reinitialize the context model with the pre-defined context model state and to encode the next picture. If the training phase is finished, then the method 700 proceeds to operation 710 where the encoder selects one of the stored context model states for encoding of a static picture. As described above, the selection may be based upon one or more conditions that tend to result in selecting a context model state from a picture likely to be statistically similar to the current static picture. In operation 712, the selected context model state is used to initialize the context model. The static picture is then encoded in operation 714 using context-adaptive entropy encoding.
In operation 716, the encoder evaluates whether or not there are additional static pictures to be encoded in the set. If so, then it returns to operation 710 to again select a context model state from the stored context model states in the buffer. That selected state is then used to reinitialize the context model for the entropy encoding of the next static picture.
Once all the static pictures are encoded, the method 700 then returns to operation 702 with to repeat the process of training for the next set of pictures.
Those skilled in the art will appreciate that various modifications to the method 700 may be made to implement other embodiments. For example, in the case of the fourth embodiment described above, in which the training context model states are separated from the encoding context model states used during the training phase (after the first training picture), the encoding operation 704 may be modified to reflect the fact that, after the first picture, two context models are being maintained: one for encoding/decoding and one being trained. Other modifications or variations will be appreciated by those ordinarily skilled in the art in light of the description herein.
In one example embodiment, out of caution it may be desirable to reduce or soften the impact of the training operation in case it results in an inaccurate context model state. In such a case, it may be advantageous to ensure that the stored context model states are a blend of the trained context model states developed during the training phase and the pre-defined context model state prescribed by the video coding standard for initializing the model.
In other words, the stored context model states may be function of the new trained states and the old pre-defined states.
As an example, if the existing state index is A with MPS=a, and the new state index is B with MPS=b, where MPS means “most probable symbol”, the updated state index may be given as:
C=w
a
*A+w
b
*B, if a=b
The state corresponding to (1/2,1/2), otherwise
In this example, wa and wb are weighting factors. For instance, wa=wb=0.5.
In another example, the corresponding probabilities prob_A and prob_B of the existing state and the new state may be derived first. An estimation for the updated state probability is then:
Prob—C=(wa*prob—A+wb*prob—B)/(wa+wb), if a=b
wb/(wa+wb)+(wa*prob—A−wb*prob—B)/(wa+wb), if a!=b and wb+2*wa*prob—A>wa+2*wb*prob—B
wa/(wa+wb)+(wb*prob—B−wa*prob—A)/(wa+wb), otherwise
In this case,
MPS—C=a, if wb+2*wa*prob—A>wa+2*wb*prob—B
b, otherwise
The state may then be updated to the state whose corresponding probability distribution is the closest to prob_C. The MPS is updated as MPS_C.
An initialization indication flag or group of flags may be used to signal from encoder to decoder whether the context initialization buffer is to be used, and how.
In one embodiment, each slice includes a flag indicating whether the slice uses the buffer-stored context model state or the pre-defined context model state. If the buffer is to be used, then it may by default be the context model state stored for the previously-encoded/decoded slice or picture. It may alternatively be a context model state selected from amongst two or more stored context model states based upon one or more conditions/criteria, such as similar QP values, same or similar hierarchical layer, same picture type, or other factors. In some cases, if no buffer-stored context model state meets the criteria then the slice/picture uses the pre-defined context model state.
In another embodiment, if the flag authorizes use of the stored context model states, then the decoder may use the context model state associated with a reference picture to the current picture. If more than one reference picture, then it may select the closest temporally. In one example, it may combine reference picture context model states, perhaps using a weighting mechanism such as that described above in connection with “training” pictures.
In some embodiments, multiple level flags may be defined. For example, an enabling flag may be present in the sequence parameter set to indicate whether there are context initialization flags in the picture parameter set and/or slice header. If yes, then a flag in the picture parameter set may indicate whether or not to use the context initialization buffer for slices in the current picture. A further flag, flags, or code in the slice header may indicate whether that slice uses the pre-defined context model state or a stored context model state and, in some cases, which stored context model state to use.
In another variation, the sequence parameter set flag indicates whether the context initialization is enabled for the sequence. If so, then a flag appears in each slice header to indicate whether that slice uses the pre-defined context model state or a stored context model state. In some cases, the slice header may indicate which stored context model state is selected for that slice; whereas, in other cases the decoder may select which stored context model state to use based on a set of conditions and/or criteria, like those described above.
In another example, a flag in the sequence parameter set may indicate whether or not the context initialization buffer should be flushed to restart the process of building stored context model states. In a further example, such a “flush” flag may be defined for the picture parameter set.
The buffer usage and flag syntax may depend on the type of video sequence being encoded/decoded. For example, in an all-intra-coding case (all I-type pictures), if the picture or slice level flag is not present then the current I picture/slice may be decoded using the updated context model state of the previous I picture/slice, except for the IDR picture which uses the pre-defined context model state.
On the other hand, if there are flags at the picture/slice level in the all-intra case, those flags may signal whether to use the context model state of the previous I picture/slice, or whether to use the pre-defined context model state.
In the case of a random-access sequence, i.e. I, P and B pictures in a GOP, there are a number of potential flag uses and meanings. In one embodiment, if there is no flag signaling whether to use the buffer, then the I-pictures use the pre-defined context model state, the P-pictures use the context model state of the previously-decoded P-picture, if any, and the B-pictures use the context model state of the previously-decoded B-picture, if any.
In another embodiment, if there is no flag indicating whether to use the buffer, then the B-picture may use the context model state of the previously-decoded B-picture and the I and P-pictures use the pre-defined context model state.
In the case of a low-delay P setting, then if there is no flag indicating whether to use the stored context model states, then the default setting may be to permit the P-picture to use the stored context model state of the previously-decoded P-picture, whereas the I-picture uses the pre-defined context model state.
Reference is now made to
Reference is now also made to
It will be appreciated that the decoder and/or encoder according to the present application may be implemented in a number of computing devices, including, without limitation, servers, suitably-programmed general purpose computers, audio/video encoding and playback devices, set-top television boxes, television broadcast equipment, and mobile devices. The decoder or encoder may be implemented by way of software containing instructions for configuring a processor or processors to carry out the functions described herein. The software instructions may be stored on any suitable non-transitory computer-readable memory, including CDs, RAM, ROM, Flash memory, etc.
It will be understood that the encoder described herein and the module, routine, process, thread, or other software component implementing the described method/process for configuring the encoder may be realized using standard computer programming techniques and languages. The present application is not limited to particular processors, computer languages, computer programming conventions, data structures, other such implementation details. Those skilled in the art will recognize that the described processes may be implemented as a part of computer-executable code stored in volatile or non-volatile memory, as part of an application-specific integrated chip (ASIC), etc.
Certain adaptations and modifications of the described embodiments can be made. Therefore, the above discussed embodiments are considered to be illustrative and not restrictive.