The invention relates to motion vector prediction for video coding, and, in particular, though not exclusively, to methods and systems for motion vector prediction, a video decoder apparatus and a video encoder apparatus using such methods and a computer program product for executing such methods.
State of the art video coding standards use a hybrid block-based video coding scheme wherein a video frame is partitioned into video blocks which are subsequently encoded using prediction block-based compression techniques. Here, a video block or in short a block refers to a basic processing unit of a video standard, e.g., coding tree units (CTUs) as defined in HEVC, macroblocks as defined in AVC and super blocks as defined in VP9 and AV1. In certain video coding standards, such as HEVC, blocks may be partitioned into smaller sub-blocks e.g. Coding Units (CUs) and Prediction Units (PUs). Different prediction modes may be used to code each of the blocks or sub-blocks. For example, different intra-prediction modes may be used to code a block based on predictive data within the same frame so as to exploit spatial redundancy within a video frame. Additionally, inter-prediction modes may be used to code a block based on predictive data from another frame so that temporal redundancy across a sequence of video frames can be exploited.
Inter-prediction uses a motion estimation technique to determine motion vectors (MVs), wherein a motion vector identifies a block in an already encoded reference video frame (past or future video frame) that is suitable for predicting a block in a video frame that needs to be encoded wherein the block that needs to be encoded and its associated MV are typically referred to as the current block and the current MV respectively. The difference between a prediction block and a current block may define a residual block, which can be encoded together with metadata, such as the MV and transmitted to a video playout device that includes a video decoder for decoding the encoded information using the metadata. In turn, MVs may be compressed by exploiting correlations between motion vectors of blocks that have been already encoded and a motion vector of a current block. Translational moving objects in video pictures typically cover many blocks that have similar motion vectors in direction and length. Video coding standards typically exploit this correlation.
For example, motion vector compression schemes such as the so-called Advanced Motion Vector Prediction (AMVP) algorithm used by HEVC or the Dynamic Reference Motion Vector Prediction (REFMV) used by AV1, aim to compress information about motion vectors in the bitstream by using a motion vector that already has been calculated as a reference for predicting a current MV. Such motion vector may be referred to as a motion vector predictor (MVP). In this scheme, a MVP for a MV of a current block may be generated by determining candidate MVs of already encoded blocks of the current video frame or co-located blocks of an encoded reference video frame and selecting a candidate MV as the best predictor, a MVP, for the current block based on an optimization scheme such as a well-known rate distortion and optimization (RDO) scheme. The difference between the MVP and the MV and information about the selected motion vector predictor is entropy coded into a bitstream. The decoder uses the information of the selected MVP and the MV difference to reconstruct a motion vector of a current block that needs to be decoded. The motion vector compression schemes of the video standards are all based on the fact that encoded blocks close to the current block, e.g. neighbouring blocks in the same video frame, or encoded co-located blocks in a reference frame, will typically have the same or similar motion vector values due to the spatial correlation of pixels in blocks of consecutive frames.
A problem associated with the inter-prediction algorithms used by most coding standards is that these algorithms assume that motion in pictures relates to uniform translational movement of objects in the video. A motion estimation technique will associated such moving object with a uniform motion vector field (i.e. all motion vectors have approximately the same size and direction). Motion vectors determined by a motion estimation technique however provide an estimate of all motion in video blocks of a video frame, translational local motion associated with moving objects as well as other types of motion in a video. For example, a moving camera may introduce a non-uniform motion vector field, especially if the camera is a 360 camera that generate projected video frames (e.g. video frames comprising equirectangular projected pixels). In most video coding standards, such as AVC and HEVC, inter-prediction schemes and associated motion vector predictor schemes are mainly optimized for dealing with uniform motion vector fields. This may result in suboptimal coding and compression efficiency.
Hence, from the above it follows there is a need in the art for improved methods and systems for coding of video data that includes a global motion field that is caused by motion, in particular camera motion.
As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or computer program product. Accordingly, aspects of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit,” “module” or “system.” Functions described in this disclosure may be implemented as an algorithm executed by a microprocessor of a computer. Furthermore, aspects of the present invention may take the form of a computer program product embodied in one or more computer readable medium(s) having computer readable program code embodied, e.g., stored, thereon.
Any combination of one or more computer readable medium(s) may be utilized. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber, cable, RF, etc., or any suitable combination of the foregoing. Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object-oriented programming language such as Java™, Smalltalk, C++ or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
Aspects of the present invention are described below with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor, in particular a microprocessor or central processing unit (CPU), of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer, other programmable data processing apparatus, or other devices create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer readable medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. Additionally, the Instructions may be executed by any type of processors, including but not limited to one or more digital signal processors (DSPs), general purpose microprocessors, application specific integrated circuits (ASICs), field programmable logic arrays (FP-GAs), or other equivalent integrated or discrete logic circuitry.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The embodiments in this application aim to improve the efficiency of coding video, that includes non-uniform motion, e.g. non-uniform global motion or non-uniform local motion relating to rotating objects. Non-uniform global motion due to movement of the camera that captures the video which can either be a virtual (computer generated world) or a real camera. Here, the movement may be associated with a physically moving camera (e.g. due to a panning or tilting operation), or with a virtually moving camera (e.g. due to a zooming action).
In known motion vector predictor schemes such as AMVP and REFMV an algorithm is used by the encoder and the decoder for building a list of motion vector predictor candidates by first reviewing predetermined motion vectors of encoded blocks of the current video frame (the so-called spatial motion vector predictor candidates) and—if these are not available—then reviewing predetermined motion vectors of blocks of one or more reference frames (the so-called temporal motion vector predictor candidates). Applying such algorithm to video frames comprising non-uniform motion will result in inaccurate motion vector predictors since the motion vectors of encoded blocks of the current video will likely be of different intensities and direction and thus suboptimal compression.
The inventors recognized that a non-uniform motion field in video frames results in low correlation between motion vectors of blocks that are positioned within the vicinity of the current block, a strong correlation however exists between motion vectors of blocks in one or more reference frames. This insight may be used to improve motion vector predictor schemes for predicting motion vectors of blocks that include non-uniform motion. Further, the inventors recognized that non-uniform motion fields in video frames may have distinct and predictable patterns, which depend on the type of video data in the video frames. For example, in case of spherical video non-uniform global motion due to camera movement has a distinct and predictable pattern, depending on the projection that is used to project the spherical video onto a rectangular 2D plane of a video frame. For example, in an equirectangular (ERP) projected video frame, camera motion may cause a characteristic pattern in the video including an expansion point and a compression point in the video frames. Pixels in the area around expansion points may have motion vectors that point in opposite directions with their tails located at the same point (expansion point) and pixels in the area around compression points may have motion vectors that point toward the same point (compression point) causing complex but accurately predictable motion patterns in these two types of area. Such patterns can be predicted using a simple parametric algorithm that models the movement of the camera in the 3D scene, hence causing a global motion in the captured video. This insight may also be used to improve motion vector predictor schemes for predicting motion vectors of blocks that include non-uniform motion.
In an aspect, the invention may relate to a method of providing a bitstream comprising video data encoded by an encoder apparatus. In an embodiment, the method may comprise a processor of the encoder apparatus determining a current motion vector of a current block of a current video frame of a sequence of video frames comprising video data.
The method may also comprise the processor determining or receiving motion information, the motion information defining whether the current block is part of a set of blocks in the current video frame that is associated with a non-uniform motion vector field.
The method may also comprise the processor determining a motion vector predictor candidate, wherein the determining includes: selecting one of a plurality of motion vector predictor algorithms, the plurality of motion vector predictor algorithms including at least a first motion vector predictor algorithm and a second motion vector predictor algorithm, the selection being based on the motion information; determining a list of motion vector predictor candidates based on the selected motion vector predictor algorithm; and, selecting the motion vector predictor candidate from the list of motion vector predictor candidates.
The method may further comprise determining a motion vector difference based on the selected motion vector predictor candidate and the current motion vector.
The method may also comprise the processor generating a bitstream, the generating including encoding the motion vector difference, the indication of the selected motion vector predictor candidate and a residual block, the residual block defining a difference between the current block and the prediction block.
In an embodiment, the generating a bitstream may comprise inserting an indication of the selected motion vector predictor algorithm or at least part of the motion information into the bitstream. In an embodiment, the method may further comprise the processor instructing a transmitter to transmit the bitstream to a receiver. Hence, in this embodiment, the indication of a motion vector predictor algorithm as selected by the encoder on the basis of the motion information or the motion information is transmitted in the bitstream, e.g. in-band, to a receiver.
In another embodiment, the method may comprise the processor instructing a transmitter to transmit the bitstream and the indication of a motion vector predictor algorithm selected by the encoder or the motion information to a receiver. In this embodiment, the indication of a motion vector predictor algorithm or the motion information may be transmitted separately from the bitstream, e.g. out-of-band, to a receiver.
The current motion vector may define a spatial offset of the current block relative to a prediction block of a previously encoded reference video frame stored in a memory of the encoder apparatus.
In an embodiment, the selection of the motion vector predictor candidate may be based on an optimization scheme such as a rate distortion optimization process.
The invention uses motion information associated with the current video frame to determine motion vector predictor candidates. If non-uniform motion is present in the current video frame (or a region in the current video frame) then the associated motion vectors may form a non-uniform motion vector field. If such non-uniform motion vector field is determined or signaled, a motion vector predictor algorithm can be selected which takes the non-uniform motion vector field into account when generating motion vector predictor candidates. This way, the compression efficiency can be improved.
In this application, the term prediction block refers to a set of reference samples in one or more reference frames that are used to predict the current block. The reference samples of a prediction block do not necessarily have a one to one correspondence with samples of the current block. Often the reference samples are used in an interpolation scheme to predict samples of the current block. Further, it is submitted that the samples of a prediction block may have any suitable shape, e.g. a polygon including rectangular, triangular or other shapes.
In an embodiment, selecting one of a plurality of motion vector predictor algorithms may include: selecting the first motion vector predictor algorithm, if the motion information defines that the current block is part of a set of blocks that are associated with a non-uniform motion vector field; and, selecting the second motion vector predictor algorithm, if the motion information defines that the current block is part of a set of blocks that are associated with a uniform motion vector field.
In an embodiment, determining a list of motion vector predictor candidates may include: the first motion vector predictor algorithm determining at least part of the list of motion vector predictor candidates based on one or more motion vectors of one or more already encoded blocks of one or more reference video frames stored in the memory of the encoder apparatus; or, the first motion vector predictor algorithm determining at least part of the list of motion vector predictor candidates based on a parametric model of the non-uniform motion vector field. In an embodiment, the parametric model may be a parametric algorithm configured to compute the non-uniform motion vector field at the position of the current block.
Hence, in these embodiments, a first algorithm may be used to first evaluate temporal candidates which—due to the non-uniform motion—may have a strong correlation with the motion vector of the current block. Alternatively, a parametric algorithm may be used to build at least part of the list of motion vector predictor candidates.
In an embodiment, the determining a list of motion vector predictor candidates may include: the second motion vector predictor algorithm determining at least part of the list of motion vector predictor candidates based on first evaluating one or more motion vectors of one or more already encoded blocks of the current video frame; and, optionally, after evaluating one or more motion vectors of one or more already encoded blocks of the current video frame, evaluating one or more motion vectors of one or more already encoded blocks of one or more reference video frames stored in the memory of the encoder apparatus.
In this embodiment, if the motion vector field is uniform, a second vector predictor algorithm may be used to determine candidates. In that case, a list may be built by first evaluating spatial candidates.
Current state of the art coding standards such as HEVC, VP9, AV1 and VVC use motion vector prediction schemes wherein first spatial candidates are evaluated and thereafter—if no or not enough suitable spatial candidates are found—temporal candidates. In case of non-uniform motion fields however such scheme does not yield proper motion vector predictors.
The embodiments in this application propose to determine a list of motion vector predictors candidates based on a first algorithm which is configured to first evaluate motion vector predictor candidates based on motion vectors associated with blocks of samples in references frames (i.e. temporal candidates) or to evaluate motion vector predictor candidates which are computed based on a parametric model of the non-motion field.
The schemes to determine a motion vector predictor candidate are both implemented in the encoder and decoder. The indication of the selected motion vector predictor candidate and the motion information are transmitted to the decoder which selects a motion vector prediction algorithm based on the motion information. The decoder may build a list candidates on the basis of the selected algorithm and use an indication of the motion vector predictor candidate selected by the encoder to select the motion vector prediction candidate that may be used for reconstructing a current block.
Thus, if a non-uniform motion vector field is present (if the non-uniformity of the motion vectors is above a certain threshold), a first motion vector predictor algorithm (a first algorithm) may be selected, which is configured to handle non-uniform motion vectors. This way, the compression efficiency can be improved. If no non-uniform motion vector field is present (if the non-uniformity of the motion vectors is below a certain threshold), the encoder may select a second motion vector predictor scheme (a second algorithm) that is optimized for uniform motion vectors (typically set of motion vectors representing translational motion) may be selected.
The second motion vector predictor algorithm may be used to build a list of motion vector predictor candidates by first evaluating spatial candidates. If no sufficient spatial candidates are available, the second motion vector predictor algorithm may evaluate temporal candidates, typically motion vectors of blocks of samples close to a block in the reference frame that is co-located with the current block of the current frame.
Based on the motion information, a first or second motion vector predictor algorithm may be used to build a candidate list by evaluation of motion vectors associated with blocks of samples that are already encoded. Deriving a list of candidates by evaluating temporal candidates first or by only evaluating temporal candidates will provide substantial advantages in the coding efficiency of video that comprises non-uniform motion vectors.
In an embodiment, the motion information may include one or more parameters for a map function, wherein the map function may be configured to determine or signal one or more first regions in the current frame for which the first algorithm may be used. In another embodiment, the map function may be configured or signal one or more second regions in the current frame for which the second algorithm may be used. The map function may be implemented in the encoder and the decoder so only one or more parameters are needed to determine the complete map.
In an embodiment, the indication of a motion vector predictor algorithm may include a value for signaling the processor to use the first motion vector predictor algorithm or the second motion vector predictor algorithm.
In an embodiment, the indication of a motion vector predictor algorithm may include a map, preferably a binary map, the map including a plurality of data units, e.g. bits or bytes, each data unit being associated with a block of the current frame, each data unit including a value for signaling the processor to use the first motion vector predictor algorithm or the second motion vector predictor algorithm.
In an embodiment, the indication of a motion vector predictor algorithm may include a map defining one or more first regions for which the first algorithm is used and. In an embodiment, the map may include one or more second regions in the current frame for which the second algorithm is used.
In an embodiment, the video frames of the sequence of video frames may comprise spherical video data. In an embodiment, the spherical video data may be projected onto a rectangular video frame based on a projection model. In an embodiment, the projection model may be an equirectangular or a cubic projection model.
In an embodiment, at least part of the motion information may be included in the bitstream as one or more network abstraction layer, NAL, units. Such NAL units may include at least one of: a non-VCL NAL unit such as a Picture Parameter Set, PPS, or a Sequence Parameter Set, SPS. These NAL units may be NAL units as defined in the HEVC coding standard or coding standard based on the HEVC standard.
In an embodiment, at least part of the motion information may be included in the bitstream in a Slice Header as defined in the VVC coding standard or a Slice Header as defined in the AVC or HEVC coding standard. The encoding and decoding processes in this application may be based on a coding standard, such as a block-based video coding standard. In an embodiment, the video coding standard may be based on one of the AVC, HEVC, VP9, AV1, or VVC coding standard or a coding standard based on one of these standards.
In an embodiment, the processor determining motion information may include: comparing a magnitude and/or a direction of motion vectors of blocks in a region of the current video frame; and, determining that the motion vectors in the region belong to a non-uniform motion field based on the compared magnitudes and/or directions.
Hence, motion vectors associated with blocks in a video frame may define a motion vector field. A set of rules can be used to evaluate the size and direction of motion vectors of the motion vector field of a block-partitioned picture to decide if a set of motion vectors define a uniform or a non-uniform motion vector field. For example, a motion vector field may be determined as uniform if the magnitude and/or direction of motion vectors of a set of blocks in a region of a video frame are substantially the same (wherein deviations within a limited range may be allowable). In other words, the motion vector field may be determined as uniform if the differences in the magnitude and/or direction of motion vectors associated with the set of blocks are below a certain a value, e.g. a threshold value. A motion vector field may be determined as non-uniform if the size and direction of motion vectors of a set of blocks in a frame exhibits substantial differences. For example, differences in the magnitude and/or direction of motion vectors associated with a set of blocks in a particular area in a video frame may be larger than a certain value, e.g. a threshold value.
A non-uniform motion vector field may be caused by camera movements. In that case the motion vector field may be referred to as a global motion vector field. Non-uniform global motion can be modelled, e.g. described based on a mathematical function. Examples of such functions are described in this application. In case of non-uniform local motion vector field, e.g. a vector field associated with a non-uniform moving (e.g. rotating) object, such non-uniform vector fields may be determined by performing object analysis and motion vector analysis on a sequence of video frames.
In an aspect, the invention may relate to a method for reconstructing a block of a video frame from a bitstream encoded by an encoder apparatus, wherein the method may comprise the steps of: a processor of a decoder apparatus receiving a bitstream comprising an encoded current block of a current video frame to be decoded by the decoder apparatus based one or more already decoded prediction blocks of one or more reference video frames stored in the memory of the decoder apparatus and based on a current motion vector representing a spatial offset of the current block relative to one of the one or more prediction blocks, the current motion vector being associated with a motion vector predictor candidate; the processor receiving an indication of a motion vector predictor algorithm selected by an encoder for determining a list of motion vector predictor candidates from which the encoder has selected the motion vector predictor candidate; or, the processor receiving motion information defining whether the current block is part of a set of blocks in the current frame that is associated with a non-uniform motion vector field, preferably a non-uniform global motion vector field; the processor decoding the encoded current block into a residual block, a motion vector difference and an indication of the selected motion vector predictor candidate, the residual block defining a difference between one of the one or more already decoded prediction blocks and the current block; the processor determining the motion vector predictor candidate, the determining including: selecting one of a plurality of motion vector predictor algorithms, the plurality of motion vector predictor algorithms including at least a first motion vector predictor algorithm and a second motion vector predictor algorithm, the selection being based on the indication of a motion vector predictor algorithm or the motion information; determining the list of motion vector predictor candidates based on the selected motion vector predictor algorithm; selecting the motion vector predictor candidate from the list of motion vector predictor candidates.
In an embodiment, the method may further comprise the processor determining the current motion vector based on the selected motion vector predictor candidate and the motion vector difference; and, the processor reconstructing the current block based on the prediction block and the residual block, the reconstructing including using the current motion vector to identify the prediction block of a reference video frame stored in the memory of the decoder apparatus.
In an embodiment, the selecting one of a plurality of motion vector predictor algorithms based on the motion information may include: selecting the first motion vector predictor algorithm, if the motion information defines that the current block is part of a set of blocks that are associated with a non-uniform motion vector field; and, selecting the second motion vector predictor algorithm, if the motion information defines that the current motion vector is part of a set of blocks that are associated with a uniform motion vector field.
In an embodiment, the determining a list of motion vector predictor candidates may include: the first motion vector predictor algorithm determining at least part of the list of motion vector predictor candidates based on one or more motion vectors of one or more already encoded blocks of one or more reference video frames stored in the memory of the encoder apparatus; or, the first motion vector predictor algorithm determining at least part of the list of motion vector predictor candidates based on a parametric model of the non-uniform motion vector field, preferably the parametric model representing a parametric algorithm configured to compute the non-uniform motion vector field at the position of the current block.
In a further aspect, the invention may relate to an encoding apparatus comprising: a computer readable storage medium having at least part of a program embodied therewith; and, a computer readable storage medium having computer readable program code embodied therewith, and a processor, preferably a microprocessor, coupled to the computer readable storage medium, wherein responsive to executing the computer readable program code, the processor is configured to perform executable operations comprising: determining a current motion vector of a current block of a current video frame of a sequence of video frames comprising video data, the current motion vector defining a spatial offset of the current block relative to a prediction block of a previously encoded reference video frame stored in a memory of the encoder apparatus; determining or receiving motion information, the motion information defining the processor whether the current block is part of a set of blocks in the current video frame that is associated with a non-uniform motion vector field, preferably a non-uniform global motion vector field, in the video data of the current video frame; determining a motion vector predictor candidate, wherein the determining includes: selecting one of a plurality of motion vector predictor algorithms, the plurality of motion vector predictor algorithms including at least a first motion vector predictor algorithm and a second motion vector predictor algorithm, the selection being based on the indication of a motion vector predictor algorithm or the motion information; determining a list of motion vector predictor candidates based on the selected motion vector predictor algorithm; selecting the motion vector predictor candidate from the list of motion vector predictor candidates, preferably the selecting being based on an optimization scheme such as a rate distortion optimization process; determining a motion vector difference based on the selected motion vector predictor candidate and the current motion vector; and, generating a bitstream, the generating including encoding the motion vector difference, an indication of the selected motion vector predictor candidate and a residual block, the residual block defining a difference between the current block and the prediction block and, optionally, inserting an indication of the selected motion vector predictor algorithm or at least part of the motion information into the bitstream.
The encoding apparatus may be configured to execute any of the steps performed by an encoder apparatus as described by the embodiments in the application.
In yet a further aspect, the invention may relate to a decoding apparatus comprising: a computer readable storage medium having at least part of a program embodied therewith; and, a computer readable storage medium having computer readable program code embodied therewith, and a processor, preferably a microprocessor, coupled to the computer readable storage medium, wherein responsive to executing the computer readable program code, the processor is configured to perform executable operations comprising: receiving a bitstream comprising an encoded current block of a current video frame to be decoded by the decoder apparatus based one or more already decoded prediction blocks of one or more reference video frames stored in the memory of the decoder apparatus and based on a current motion vector representing a spatial offset of the current block relative to one of the one or more prediction blocks, the current motion vector being associated with a motion vector predictor candidate; receiving an indication of a motion vector predictor algorithm selected by an encoder for determining a list of motion vector predictor candidates from which the encoder has selected the motion vector predictor candidate; or, receiving motion information defining whether the current block is part of a set of blocks in the current frame that is associated with a non-uniform motion vector field, preferably a global motion vector field; decoding the encoded current block into a residual block, a motion vector difference and the indication of a motion vector predictor candidate selected by the encoder apparatus, the residual block defining a difference between one of the one or more already decoded prediction blocks and the current block; determining a motion vector predictor candidate, the determining including: selecting one of a plurality of motion vector predictor algorithms, the plurality of motion vector predictor algorithms including at least a first motion vector predictor algorithm and a second motion vector predictor algorithm, the selection being based on the indication of a motion vector predictor algorithm or the motion information; determining the list of motion vector predictor candidates based on the selected motion vector predictor algorithm; selecting the motion vector predictor candidate from the list of motion vector predictor candidates; using the indication of the motion vector prediction candidate selected by the encoder to select the motion vector predictor candidate from the list of motion vector predictor candidates.
In a further embodiment, the executable operations may include: determining the current motion vector based on the selected motion vector predictor candidate and the motion vector difference; and, reconstructing the current block based on the prediction block and the residual block, the reconstructing including using the current motion vector to identify the prediction block of a reference video frame stored in the memory of the decoder apparatus.
The decoding apparatus may be configured to execute any of the steps performed by an decoder apparatus as described by the embodiments in the application.
In another embodiment, the first motion vector predictor algorithm may be configured to build a list of motion vector predictor candidates which is only based on motion vectors of already encoded blocks of one or more reference video frames stored in the memory of the encoder apparatus. In this embodiment, the spatial candidates are not evaluated at all.
In a further embodiment, the invention may relate to a decoding apparatus configured to execute any of the decoding processes defined in this application.
The invention may also relate to a computer program product comprising software code portions configured for, when run in the memory of a computer, executing the method steps according to any of process steps described above. In this application the following abbreviations and terms are used:
The invention will be further illustrated with reference to the attached drawings, which schematically will show embodiments according to the invention. It will be understood that the invention is not in any way restricted to these specific embodiments.
Spherical video data of a panorama composition may be transformed and formatted by projection and mapping operations (step 106) into 2D rectangular video frames which are encoded by a state-of-the-art video encoder (step 108). The encoded video data may be encapsulated into a transport container so that the video data can be transmitted to a playout device comprising a video decoder, which is configured to decode the video data (step 110) into 2D rectangular frames. For presentation of the content to the user, the playout device renders a 3D (polyhedronic) object, and textures it with video data of decoded video frames (step 114). Depending on the projection that was used, decoded video frames are transformed back into omnidirectional video data by reversing the packing, mapping and projection operations (step 112). The encoding process 108 may be implemented in a video encoder apparatus and steps 110-114 may be implemented in a media playback device connected to or integrated in e.g. a head mounted display device (HMD).
The transformation of the spherical video data by projection and mapping operations into 2D rectangular video frames is described in more detail with reference to
Depending on the projection model, after mapping, a projected video frame may include some areas 307 that do not comprise any video data. In order to improve compression efficiency, the pixel regions in the projected video frame may be rearranged and resized, hence removing the areas that do not comprise any video data. This process may be referred to as packing. The packing process results in a packed projected video frame 310 including rearranged pixel regions 312 and horizontally and vertically arranged region boundaries 314,316. Similarly,
In the context of 360 video, it is very likely that the video content will not be captured or computer-generated using a camera rotating around one of its axes. This is because motion in the video that is caused by a rotational camera displacement may trigger motion sickness to the viewer. Translational motion however, in particular slow translational motion, is acceptable for most of the users. Hence, camera motion in 360 video is predominantly translational camera motion. If camera translation is possible, typically a scene in the video content allows the camera to move in a given direction for a good amount time, e.g. a scene captured by a moving drone. This way an “immersive” video experience can be provided to the viewer. This means that a scene world as a function of time is not a small closed 3D geometry like a cube (as may be the case when using a static 360 camera). On the contrary, the scene world may include at least two vertical planes (“walls”) on the left and right parallel to the camera motion and two horizontal planes top and bottom, “sky” and “ground” respectively. In some cases, such as large indoor warehouse, the world scene may be further simplified and can be characterized by only two planes, top and bottom. In yet another case, e.g. outdoor footage, the top plane (sky) can be considered at an infinite distance from the camera and thus can be omitted.
It is well known that in case of translational camera motion, there is a corresponding change in the images captured by the camera. If a camera moves with a certain velocity, a point of a 3D scene world imaged on the image plane of a camera can be assigned a vector indicating a magnitude and direction of the movement of the point on the image plane due to camera movement. The collection of the motion vectors for each point (e.g. pixel or collection of pixels such as a video block) in a video frame may form a motion vector field associated with the video frame. The motion vector field thus forms a representation of 3D motion as it is projected onto a camera image wherein the 3D motion causes a (relative) movement between (parts of) the real-world scene and the camera. The motion vector field can be represented as a function which maps image coordinates to a 2-dimensional vector. The motion fields in projected spherical video due to translational movement of a 360-video camera will exhibit a motion field of a distinct pattern depending on the type of projection that is used to project spherical video onto a rectangular 2D plane.
In addition, the trigonometric relationship in the triangle OPH gives:
Here, by definition angle θ for all the points of the top plane is within the range [0;π] since dtop≠0. The motion visible on the sphere surface is given by an angular velocity of the point P′ intersecting the sphere and the segment OP. Within the frame of reference of the camera, point P, i.e. point P is associated with a non-zero motion vector. In order to determine a displacement of the point P′ on the circle while point P moved by δI, a tangent equation (providing a relation between the position of point P and the angle θ), may be evaluated:
By deriving the equation with respect to the time, the following expressions of the motion field for the plane world scene may be derived (dtop is kept constant with respect to time):
These equations are valid when {right arrow over (vP)} and {right arrow over (vC)} are in the same plane OXZ, i.e. P is in the plane OXZ. When P is not in this plane, the point P can be defined in a spherical coordinate system by (r, θ, φ) with φ being the azimuth angle and θ the polar angle in the spherical coordinate system and O its origin.
In the case P is not in the plane OXZ, then the previous equation can be generalized by substituting vP with the projection of {right arrow over (vP)} onto the line (HP) hence falling back in the case of {right arrow over (vP)} and {right arrow over (vC)} are aligned which is:
vP=—vC sin φ
This expression provides a horizontal component (an x-component) of the motion field of an EPR projection due camera motion (see also
In a similar way, the vertical y-component of the motion field can be determined. These expressions indicate that the motion field produced by translational motion in a plane world model is characterized by a specific set of properties and can be analytically modelled using a simple parametric function as described above. As will be shown in the figures, the motion field is either globally zero or characterized by a focus of expansion (an expansion point) and a focus of contraction (contraction point) which are separated by 180 degrees. Moreover, any flow that is present is parallel to the geodesics connecting a focus of expansion to a focus of contraction.
For example, the motion field depicted in video frames as shown in
The motion vector associated with a certain video block in a video frame represents a shift in (blocks of) pixels due to motion in the video, including motion caused by camera motion. Therefore, the motion field in a projected video frame (which can be accurately modelled) at the position of a video block provides an effective measure of the contribution of the camera motion to the motion vector. As shown in
Hence, in the white and black areas of
Although the effect of translational camera movement is explained above with reference to an ERP projection, the described effect, i.e. the distinctive pattern of the non-uniform motion field, will also appear in other projections such as cube map. This is because each of these projections will introduce a characteristic pattern in the video.
Projected video frames that include motion fields due to a moving camera need to be encoded by a state-of-the-art video coding system, e.g. HEVC, AV1, VP10, etc., These video coding systems rely on inter-prediction techniques in order to increase the coding efficiency. These techniques rely on temporal correlations between video blocks in different video frames, wherein the correlation between these blocks can be expressed based on a motion vector. Such motion vector provides an offset of a video block in a decoded picture relative to the same video block in a reference picture (either an earlier video frame or a further video frame). For each video block in the video frame a motion vector can be determined. The motion vectors of all video blocks of a video frame may form a motion vector field. When examining the motion vector field of projected video frames, e.g. video frames comprising ERP or cubic projected video, the motion field at a position of a certain video block provides a measure of the motion of the pixels of that block wherein the motion includes non-uniform global motion due to translational camera movement as described with reference to
Experimental data of motion vector fields of decoded video frames is depicted in
As shown in
The mean MV values for the vertical, Y direction acquired for each of the 60×120 blocks by averaging over the motion vectors of the video frames are shown in
The effect of camera motion causing a non-uniform motion vector field of a distinct pattern in the video and affecting the direction and/or magnitude of the motion vectors determined by a motion estimation module of a video coding system not only occurs in spherical video but also in other types of video such as 2D video. For example,
Hence, from the statistical analysis of the motion vector fields as depicted in
State of the art video coding systems such as HEVC, VP9 and AV1, use inter-prediction techniques to reduce the temporal redundancy in a sequence of video pictures in order to achieve efficient data compression. Inter-prediction uses a motion estimation technique to determine motion vectors (MVs), wherein a motion vector identifies reference samples in a reference video frame that is suitable for predicting a video block in a current video frame that needs to be encoded. The reference samples identified by a motion vector may be referred to as a prediction block. The difference between samples of a prediction block and samples of a current video block may define residual samples of a residual video block. Such residual video block, or in short residual block, may be encoded together with metadata, such as the MV, into a bitstream and transmitted to a video playout device that includes a video decoder for decoding the encoded information in the bitstream using the metadata. MVs may be further compressed by exploiting correlations between motion vectors. Translational moving objects in video pictures typically cover many blocks that have similar motion vectors in directions and length. Video coding standards typically exploit this redundancy.
For example, HEVC uses a so-called Advanced Motion Vector Prediction (AMVP) algorithm which aims to improve compression by using motion vectors that have already been calculated as references for predicting a current motion vector. In this scheme, a motion vector predictor (MVP) for a block may be generated by determining candidate motion vectors of already encoded blocks of the current video frame or co-located blocks of an decoded reference video frame and selecting a candidate motion vector as the best predictor for the current block based on an optimization scheme such as a well-known rate distortion and optimization (RDO) scheme. The motion vector difference MVD between the motion vector predictor and the motion vector and information about the selected motion vector predictor may be entropy coded into a bitstream. The decoder may use the information of the selected motion vector predictor and the motion vector difference to reconstruct a motion vector of a block that needs to be decoded.
A problem associated with inter-prediction algorithms used by most coding standards is that these inter-prediction algorithms are based on the assumption that the motion relates to translational movement of objects in the video, which is sometimes referred to as local motion compensation. Motion vectors however provide an estimate of all motion in video blocks of a video frame, including for example local motion associated with moving objects, global motion associated with a moving background caused by a moving camera and any other type of motion. Conventional inter-prediction schemes are optimized for local motion when the spatial correlation between motion vectors is close to an identity function, i.e. neighbour motion vectors tend to have the same amplitude and direction. These inter-prediction schemes are however not designed to deal with other types of motion, such as non-uniform global motion effects. Such non-uniform motion fields are particular dominant where 360-video is captured by a moving camera. The embodiments in this application address this problem.
In an embodiment, the first motion vector predictor algorithm may be configured to build a list of motion vector predictor candidates by first evaluating already encoded blocks of a reference frame. In an embodiment, if no suitable temporal candidates are found or not enough suitable temporal candidates are found, spatial candidates may be evaluated. In another embodiment, the first algorithm may be configured to only evaluate temporal candidate and not spatial candidates.
In another embodiment, the first motion vector predictor algorithm may be configured to determine a list of motion vector predictor candidates based on a parametric model of the non-uniform motion. For example, the parametric model being used to compute the non-uniform motion at the position of the current block. Examples of such parametric models are described with reference to
Thus, based on motion information associated with a current block of a current frame, the encoder may decide to use a first motion vector predictor algorithm which is capable to generate suitable motion vector predictor candidates for blocks that have a motion vector that is part of a non-uniform motion field in a part of the current video frame. Deriving a list of candidates by first evaluating temporal candidates (or only on temporal candidates) will provide substantial advantages in efficient encoding of video that comprises non-uniform motion, such as global motion. These advantages are explained in more detail with reference to
The second motion vector predictor algorithm may be configured to build a list of motion vector predictor candidates by first evaluating already motion vectors of encoded blocks of the current frame, typically encoded blocks in neighbourhood of the current block. These motion vector predictor candidates may be referred to as spatial candidates. If the algorithm does not find any suitable candidates (or not enough candidates) in the current frame, it may evaluate motion vectors of already encoded blocks in a reference frame, typically blocks close to a block in the reference frame that is co-located with the current block of the current frame. These motion vector predictor candidates may be referred to as temporal candidates. Current state of the art coding standards such as HEVC, VP9, AV1 and VVC use schemes wherein first spatial candidates are evaluated and thereafter—if no suitable spatial candidates are found—temporal candidates.
An example of a process of constructing a list of two motion vector predictor candidates is shown in
As shown in the figure, after evaluation of the A and B candidates, the algorithm checks if the list contains two candidates (step 814). If the motion vectors are identical (step 816), one candidate is removed from the list and the algorithm continues with the evaluation of the temporal candidates C0 and C1 (steps 8201,2) and checks if—after evaluation of the C candidates—the list contains two candidates (step 822). If this is not the case, a null vector may be added to the list (step 824). Hence, the algorithm is configured to build a list that contains two different motion vector predictor candidates so that the encoder can select the best candidate. When the algorithm of
Hence, in most cases, the algorithm depicted in
A (non-limiting) example of such algorithm is depicted in
The advantage of the embodiment depicted in
It is noted however that the example of
To counter this problem, a motion vector prediction scheme may be used as described above with reference to
In an embodiment, a map, e.g. a binary map, associated with a video frame may be used to signal a decoder which motion vector predictor algorithm to use, e.g. a first motion vector predictor algorithm or a second motion vector predictor algorithm. An example of such map is illustrated in
In another embodiment, the map may be generated by a function in the encoder and decoder. For example, the map may define certain areas which may be defined based on standardized parameters, which can be efficiently compressed and transmitted in the bitstream to the decoder. In yet another embodiment, the encoder and decoder may use a simple parametric function to estimate global motion in video blocks of video frames. A threshold value may be used to determine if the global motion above a certain level so that motion vector predictor candidates are determined by evaluating spatial candidates first or by evaluating only spatial candidates.
The video encoder apparatus may receive video data 1102 to be encoded. In the example of
The mode select unit 1104 may select one of the coding modes (e.g. intra-prediction or inter-prediction modes based on error results of an optimization function such as a rate-distortion optimization (RDO) function), and provides the resulting intra- or inter-coded block to summer 1106 to generate a block of residual video data (a residual block) to summer 1128 to reconstruct the encoded block for use as a reference picture. During the encoding process, video encoder 1100 may receive a picture or slice to be coded. The picture or slice may be partitioned into multiple video blocks. An inter-prediction unit 1120 in the mode selection unit 1104 may perform inter-prediction coding of the received video block relative to one or more blocks in one or more reference pictures to provide temporal compression. Alternatively, an intra-prediction unit 1118 in the mode selection unit may perform intra-prediction coding of the received video block relative to one or more neighbouring blocks in the same picture or slice as the block to be coded to provide spatial compression. Video encoder may perform multiple coding passes, e.g., to select an appropriate coding mode for each block of video data.
The partition unit 1103 may further partition video blocks into sub-blocks, based on evaluation of previous partitioning schemes in previous coding passes. For example, the partition unit may initially partition a picture or slice into LCUs, and partition each of the LCUs into sub-CUs based on rate-distortion analysis (e.g., rate-distortion optimization). The partitioning unit may further produce a quadtree data structure indicative of partitioning of an LCU into sub-CUs. Leaf-node CUs of the quadtree may include one or more PUs and one or more TUs.
The motion vector estimation unit 1116 may execute a process of determining motion vectors for video blocks. A motion vector, for example, may indicate a displacement Dx,Dy of a prediction block (a prediction unit or PU) of a video block within a reference picture (or other coded unit) relative to the current block being coded within the current picture (or other coded unit). The motion estimation unit may compute a motion vector by comparing the position of the video block to the position of a prediction block of a reference picture that approximates the pixel values of the video block. Accordingly, in general, data for a motion vector may include a reference picture list (e.g. an (indexed) list of already decoded pictures (video frames) stored in the memory of the encoder apparatus), an index into the reference picture list, a horizontal (x) component and a vertical (y) component of the motion vector. The reference picture may be selected from one or more reference picture lists, e.g. a first reference picture list, a second reference picture list, or a combined reference picture list, each of which identify one or more reference pictures stored in reference picture memory 1114.
The MV motion estimation unit 1116 may generate and send a motion vector that identifies the prediction block of the reference picture to entropy encoding unit 1112 and the inter-prediction unit 1120. That is, motion estimation unit 1116 may generate and send motion vector data that identifies a reference picture list containing the prediction block, an index into the reference picture list identifying the picture of the prediction block, and a horizontal and vertical component to locate the prediction block within the identified picture.
Instead of sending the actual motion vector, a motion vector prediction unit 1122 may predict the motion vector to further reduce the amount of data needed to communicate the motion vector. In that case, rather than encoding and communicating the motion vector itself, the motion vector prediction unit 1122 may generate a motion vector difference (MVD) relative to a known motion vector, a motion vector predictor MVP. The MVP may be used with the MVD to define the current motion vector. In general, to be a valid MVP, the motion vector being used for prediction points to the same reference picture as the motion vector currently being coded.
The motion vector prediction unit 1122 may be configured to build a MVP candidate list that may include motion vectors associated with a plurality of already encoded blocks in spatial and/or temporal directions as candidates for a MVP. In an embodiment, the plurality of blocks may include blocks in the current video frame that are already decoded and/or blocks in one or more references frames, which are stored in the memory of the decoder apparatus. In an embodiment, the plurality of blocks may include neighbouring blocks, i.e. blocks neighbouring the current block in spatial and/or temporal directions, as candidates for a MVP. A neighbouring block may include a block directly neighbouring the current block or a block that is in the neighbourhood of the current block, e.g. within a few blocks distance.
The list may be built using an algorithm that is configured to evaluate the blocks in the spatial and/or temporal directions. Depending on the global motion associated with a block, the motion vector prediction unit may select one of a plurality of motion vector predictor algorithms for determining a MVP candidate list. Non-limiting examples of these algorithms as described with reference to
For example, in an embodiment, it may receive motion information in the form of a map, a global motion map, that signals which of the blocks in the video frames that need to be encoded are associated with global motion of a certain level. In a further embodiment, the map may signal which motion vector predictor algorithm should be used for blocks in a video frame. In an embodiment, the map may be a bitmap, wherein each block is associated with a bit or a bit code that signals a predetermined motion vector predictor algorithm. In another embodiment, it may receive motion information in the form of global motion values which can be used by the decoder (e.g. together with one or more threshold values) to determine which motion vector predictor algorithm should be used. In an embodiment, global motion values may be used to compute an estimate of global motion in a video frame based on a model, e.g. an analytical or a computer model as e.g. described with reference to
If the motion information signals the decoder that the block is part of a set of blocks that are associated with a non-uniform motion vector field, it may select a motion vector predictor algorithm which is configured to determine motion vector candidates by evaluating temporal candidates. In an embodiment, the algorithm may first evaluate temporal candidates before evaluating spatial candidates. In an embodiment, a motion vector predictor algorithm may be configured to only evaluate temporal candidates. It is further, noted that the plurality of algorithms may be configured as sub-modules of one motion vector prediction algorithm. What matters is that the motion information determines how the motion vector prediction unit builds the list of motion vector predictor candidates. When multiple MVP candidates are available (from multiple candidate blocks), MV prediction unit 1122 may determine a MVP for a current block according to predetermined selection criteria. For example, MV prediction unit 1122 may select the most accurate predictor from the candidate list based on analysis of encoding rate and distortion (e.g., using a rate-distortion cost analysis or other coding efficiency analysis). Other methods of selecting a motion vector predictor are also possible. Upon selecting a MVP, MV prediction unit may determine a MVP index, which may be used to inform a decoder apparatus where to locate the MVP in a reference picture list comprising MVP candidate blocks. MV prediction unit 1122 may also determine the MVD between the current block and the selected MVP. The MVP index and MVD may be used to reconstruct the motion vector of a current block. Typically, the partition unit and mode selection unit (including the intra- and inter-prediction unit and the motion vector predictor unit) and the motion vector estimation unit may be highly integrated. These units are illustrated separately in the figures for conceptual purposes. A residual video block may be formed by an adder 1106 subtracting a predicted video block (as identified by a motion vector) received from mode select unit 1104 from the original video block being coded. The transform processing unit 109 may be used to apply a transform, such as a discrete cosine transform (DCT) or a conceptually similar transform, to the residual video block to form a block of residual transform coefficient values. Transforms that are conceptually similar to DCT may include for example wavelet transforms, integer transforms, sub-band transforms, etc. The transform processing unit 1109 applies the transform to the residual block, producing a transformed residual block. In an embodiment, the transformed residual block may comprise a block of residual transform coefficients. The transform may convert the residual information from a pixel value domain to a transform domain, such as a frequency domain. Transform processing unit 1109 may send the resulting transform coefficients to a quantization unit 1110, which quantizes the transform coefficients to further reduce bit rate.
A controller 1110 may provide syntax elements (metadata) of the encoding process, such as inter-mode indicators, intra-mode indicators, partition information, and syntax information, to entropy coding unit 1114.
Here the syntax elements may include information for signalling (selected) motion vector predictors (for example an indication, e.g. an index in an indexed list, of the MVP candidate selected by the encoder), motion vector differences and metadata associated with the motion vector prediction process as described in the embodiments of this application. The metadata may further include at least part of the motion information that is used by the encoder apparatus to select a motion vector predictor algorithm for determining the list of MVP candidates.
The entropy coding unit 1112 entropy may be configured to encode the quantized transform coefficients and the syntax elements. For example, entropy coding unit may perform context adaptive variable length coding (CAVLC), context adaptive binary arithmetic coding (CABAC), syntax-based context-adaptive binary arithmetic coding (SBAC), probability interval partitioning entropy (PIPE) coding or another entropy coding technique. In the case of context-based entropy coding, context may be based on neighbouring blocks. Following the entropy coding by entropy coding unit, the encoded bitstream may be transmitted to another device (e.g., a video decoder) or stored for later transmission or retrieval.
The inverse quantization and inverse transform unit 1115 may be configured to apply an inverse quantization and inverse transformation, respectively, to reconstruct the residual block in the pixel domain, e.g., for later use as a reference block. Inter-prediction unit 1120 may calculate a reference block by adding the residual block to a prediction block of one of the pictures of reference picture memory 1114. Inter-prediction unit 1120 may also apply one or more interpolation filters to the reconstructed residual block to calculate sub-integer pixel values for use in motion estimation. The reconstructed residual block may be added to the motion prediction block produced by the inter-prediction unit 1120 to produce a reconstructed video block for storage in the reference picture memory 1114. The reconstructed video block may be used by motion vector estimation unit 1116 and inter-prediction unit 1120 as a reference block to inter-coding a block in a subsequent picture.
The encoder apparatus may perform a known rate-distortion optimisation (RDO) process in order to find the best coding parameters for coding blocks in a picture. Here, the best coding parameters (including mode decision (intra-prediction or inter-prediction); intra prediction mode estimation; motion estimation; and quantization) refer to the set of parameters that provide the best trade-off between a number of bits used for encoding a block versus the distortion that is introduced by using the number of bits for encoding.
The term rate-distortion optimization is sometimes also referred to as RD optimization or simply “RDO”. RDO schemes that are suitable for AVC and HEVC type coding standards are known as such, see for example, Sze et al. “High efficiency video coding (HEVC).” Integrated Circuit and Systems, Algorithms and Architectures. Springer (2014): 1-375; Section: 9.2.7 RD Optimization. RDO can be implemented in many ways. In one well-known implementation, the RDO problem can be expressed as a minimization of a Lagrangian cost function J with respect to a Lagrangian multiplier:
Here, the parameter R represents the rate (i.e. the number of bits required for coding) and the parameter D represents the distortion of the video signal that is associated with a certain rate R. The distortion D may be regarded a measure of the video quality. Known metrics for objectively determining the quality (objectively in the sense that the metric is content agnostic) include means-squared error (MSE), peak-signal-to-noise (PSNR) and sum of absolute differences (SAD).
In the context of HEVC, the rate-distortion cost may require that the encoder apparatus computes a predicted video block using each or at least part of the available prediction modes, i.e. one or more intra-prediction modes and/or one or more inter-prediction modes. The encoder apparatus may then determine a difference video signal between each of the predicted blocks and the current block (here the difference signal may include a residual video block) and transforms each residual video block of the determined residual video blocks from the spatial domain to the frequency domain into a transformed residual block. Next, the encoder apparatus may quantize each of the transformed residual blocks to generate corresponding encoded video blocks. The encoder apparatus may decode the encoded video blocks and compare each of the decoded video blocks with the current block to determine a distortion metric D. Moreover, the rate-distortion analysis may involve computing the rate R for each encoded video block associated with of one of the prediction modes, wherein the rate R includes a number of bits used to signal an encoded video block. The thus determined RD costs, the distortion D and the rate R of the encoded blocks for each of the prediction modes, are then used to select an encoded video block that provides the best trade-off between the number of bits used for encoding the block versus the distortion that is introduced by using the number of bits for encoding.
In the example of
Similar to the motion vector predictor unit of the encoder apparatus of
The list may be determined by an algorithm that is configured to evaluate the blocks in the spatial and/or temporal directions. Depending on the global motion associated with a block, the motion vector prediction unit may select one of a plurality of algorithms for building a MVP candidate list. For example, the algorithms as described with reference to
Decoder apparatus 1200 may be configured to receive an encoded video bitstream 1202 that comprises encoded video blocks and associated syntax elements from a video encoder. Entropy decoding unit 1204 decodes the bitstream to generate a transformed decoded residual blocks (e.g. quantized coefficients associated with residual blocks), motion vector differences, and syntax elements (metadata) for enabling the video decoder to decode the bitstream. The syntax elements may include metadata associated with the motion vector prediction process as described in the embodiments of this application. For example, in an embodiment, the metadata may include an indication of the MVP candidate selected by the encoder. For example, the indication may include an index in an indexed list of MVP candidates. Further, in an embodiment, the metadata may include at least part of the motion information that was used by the encoder to select a motion vector predictor algorithm for determining the list of MVP candidates.
Parser unit 1206 forwards the motion vector differences and associated syntax elements to prediction unit 1218. The syntax elements may be received at video slice level and/or video block level. For example, by way of background, video decoder 1200 may receive compressed video data that has been compressed for transmission via a network into so-called network abstraction layer (NAL) units. Each NAL unit may include a header that identifies a type of data stored to the NAL unit. There are two types of data that are commonly stored to NAL units. The first type of data stored to a NAL unit is video coding layer (VCL) data, which includes the compressed video data. The second type of data stored to a NAL unit is referred to as non-VCL data, which includes additional information such as parameter sets that define header data common to a large number of NAL units and supplemental enhancement information (SEI).
When video blocks of a video frame are intra-coded (I), intra-prediction unit 1220 of prediction unit 1218 may generate prediction data for a video block of the current video slice based on a signalled intra-prediction mode and data from previously decoded blocks of the current picture. When video blocks of a video frame are inter-coded (e.g. B or P), inter-prediction unit 1222 of prediction unit 1218 may produces prediction blocks for a video block of the current video slice based on motion vector differences and other syntax elements received from entropy decoding unit 1204. The prediction blocks may be produced from one or more of the reference pictures within one or more of the reference picture lists stored in the memory of the video decoder. The video decoder may construct the reference picture lists, using default construction techniques based on reference pictures stored in reference picture memory 1216.
Inter-prediction unit 1220 may determine prediction information for a video block of the current video slice by parsing the motion vector differences and other syntax elements and using the prediction information to produce prediction blocks for the current video block being decoded. For example, inter-prediction unit 1220 uses some of the received syntax elements to determine a prediction mode (e.g., intra- or inter-prediction) which was used to code the video blocks of the video slice, an inter-prediction slice type (e.g., B slice or a P slice), construction information for one or more of the reference picture lists for the slice, motion vector predictors for each inter-encoded video block of the slice, inter-prediction status for each inter-coded video block of the slice, and other information to decode the video blocks in the current video slice. In some examples, inter-prediction unit 1220 may receive certain motion information from motion vector prediction unit 1224.
The decoder apparatus may retrieve a motion vector difference MVD and an associated encoded block representing a current block that needs to be decoded. In order to determine a motion vector based on the MVD, the motion vector prediction unit 1224 may determine a candidate list of motion vector predictor candidates associated with a current block. The motion vector predictor unit 1224 may be configured to build a list of motion vector predictors in the same way as done by the motion vector predictor unit in the encoder. Thus, based on motion information, the motion vector predictor unit may select one motion vector prediction algorithm from a plurality of motion vector predictor algorithms and use the selected motion vector predictor algorithm to determine a list of motion vector prediction candidates. The motion vector prediction unit may select a motion vector prediction algorithm as e.g. described with reference to
The motion vector prediction algorithm may evaluate motion vector predictor candidates which are associated with blocks in the current frame or a reference frame that have a predetermined position (typically neighbouring) relative to the position of the current block. These relative positions are known to the encoder and the decoder apparatus. Thereafter, the motion vector prediction unit may select a motion vector predictor MVP from the list of motion vector prediction candidates based on the indication of the selected motion vector predictor candidate which was transmitted in the bitstream to decoder. Based on the MVP and the MVD the inter-prediction unit may determine a prediction block for the current block.
Inverse quantization and inverse transform unit 1208 may inverse quantize, i.e., de-quantizes, the quantized transform coefficients provided in the bitstream and decoded by entropy decoding unit. The inverse quantization process may include the use of a quantization parameter calculated by video encoder for each video block in the video slice to determine a degree of quantization and, likewise, a degree of inverse quantization to be applied. It may further apply an inverse transform, e.g., an inverse DCT, an inverse integer transform, or a conceptually similar inverse transform process, to the transform coefficients in order to produce residual blocks in the pixel domain.
After the inter-prediction unit 1220 has generated the prediction block for the current video block, the video decoder may form a decoded video block by summing a residual block with the corresponding prediction block. The adder 1209 represents the component or components that perform this summation operation. If desired, a deblocking filter may also be applied to filter the decoded blocks to remove blocking artefacts. Other loop filters (either in the coding loop or after the coding loop) may also be used to smooth pixel transitions, or otherwise improve the video quality. The decoded video blocks in a given picture are then stored in reference picture memory 1216, which stores reference pictures which may be used for subsequent coding of further current blocks. Reference picture memory 1216 also stores decoded video for later presentation on a display device.
The encoding process depicted in
In a further step 1304, the encoder apparatus may determine or receive motion information associated with the current video frame and select a MVP algorithm from a plurality of MVP algorithms based on the motion information. For example if the motion information informs the encoder apparatus that a current block is associated with a non-uniform motion field (e.g. the current block is part of a set of blocks for which the differences between motion vectors in magnitude and/or direction are higher than a certain threshold value), then the encoder apparatus may select an MVP algorithm which first (or only) evaluates temporal MVP candidates, i.e. motion vectors of blocks in a reference frame alternatively or in additional it may evaluate modelled MVP candidates. If the motion information informs the encoder apparatus that a current block is associated with a uniform motion field (e.g. the current block is part of a set of blocks for which the differences between motion vectors in magnitude and/or direction are lower than a certain threshold value), the encoder apparatus may select an MVP algorithm, which first evaluates spatial MVP candidates, i.e. motion vectors of block in the current video frame that are already encoded.
In step 1306, the selected algorithm may be used to determine a list of MVP candidates. An MVP candidate may be selected from the list on the basis of an optimization scheme such as a rate distortion optimization scheme. Based on the selected MVP candidate and the current MV a motion vector difference may be determined. In a further step 1308, a bitstream is formed, wherein the formation may include encoding the residual block, the motion vector difference MVD and an indication of the selected MVP candidate using an encoding scheme, e.g. an entropy encoding scheme. Further, the formation of the bitstream may include inserting at least part of the motion information into a bitstream so that the motion information can be transmitted in-band to a receiver. In another embodiment, the motion information can be transmitted in an out-of-band channel to the receiver.
The decoding process depicted in
The decoder apparatus may use the motion information to select a motion vector prediction algorithm from a plurality of MVP algorithms that reside in the memory of the decoder apparatus (step 1312) and the selected algorithm may be used to determine a list of MVP candidates. Further, the indication of the selected MVP candidate may be used to select a MVP candidate from the list and a MV may be formed based on the selected MVP candidate and the MVD (step 1314). Finally, the method may include a step of reconstructing the current block based on a prediction block and the residual block, the reconstruction including using the motion vector to identify the prediction block of a reference video frame (step 1316).
Hence, as shown by the process of
The motion information may be formatted into a suitable data format so that it can be inserted in the bitstream together with the video data. For example, in an embodiment, the parameters may be inserted in a header of a suitable data structure of the bitstream. The type of data structure may depend on the video coding standard that is used for encoding and decoding. For example, a Tile Group header as defined in the VVC coding standard may be used to send the motion information to a decoder apparatus. In another embodiment, a Slice header as defined in the AVC and HEVC coding standards may be used to send the motion information to the decoder apparatus. In another embodiment, the parameters may be inserted in one or more NAL units, in particular non-VCL NAL units, such as a picture parameter set (PPS) or a Sequence Parameter Set (SPS). An example of a PPS is provided in table 1:
Thus, the motion information may further include, information signalling in which part or parts of the video frame or tile no spatial candidates should be used.
Alternatively, in further embodiment, the motion information may include a flag, e.g. a binary flag, associated with one or more blocks in a video frame. The flag may signal a decoder apparatus that when constructing a list of MVP candidates for a block in a video frame, a slice or a tile, no spatial candidates should be used for determining the list of motion vector predictor candidates. For example, such flag may signal the processor to use either a first motion vector predictor algorithm or a second motion vector predictor algorithm.
In a further embodiment, the motion information may include a binary map comprising a string of values, for example a string of bits or bytes, wherein each value is associated with a block in a frame and each value signaling a decoder to use a first motion vector predictor algorithm or the second motion vector predictor algorithm.
In an embodiment, the motion information described with reference to the embodiments may be sent inside the bitstream itself, for example just before the motion vectors, the a decoder apparatus. Alternatively, in an embodiment, the motion information may be sent in an out-of-band to the decoder apparatus.
The second video processing device may receive the encoded video data to be decoded through a transmission channel 1506 or any type of medium or device capable of moving the encoded video data from the first video processing device to the second video processing device. In one example, the transmission channel may include a communication medium to enable the first video processing device to transmit encoded video data directly to the second video processing device in real-time. The encoded video data may be transmitted based on a communication standard, such as a wireless communication protocol, to the second video processing device. The communication medium may comprise any wireless or wired communication medium, such as a radio frequency (RF) spectrum or one or more physical transmission lines. The communication medium may form part of a packet-based network, such as a local area network, a wide-area network, or a global network such as the Internet. The communication medium may include routers, switches, base stations, servers or any other equipment that may be useful to facilitate communication between first and second video processing devices.
Alternatively, encoded data may be sent via an I/O interface 1508 of the first video processing device to a storage device 1510. Encoded data may be accessed by input an I/O interface 1512 of the second video processing device. Storage device 1510 may include any of a variety of distributed or locally accessed data storage media such as a hard drive, Blu-ray discs, DVDs, CD-ROMs, flash memory, volatile or non-volatile memory, or any other suitable digital storage media for storing encoded video data. In a further example, the storage device may correspond to a file server or another intermediate storage device that may hold the encoded video generated by the first video processing device. The second video processing device may access stored video data from storage device via streaming or downloading. The file server may be any type of server capable of storing encoded video data and transmitting that encoded video data to the second video processing device. Example file servers include a web server (e.g., for a website), an FTP server, network attached storage (NAS) devices, or a local disk drive. The second video processing device may access the encoded video data through any standard data connection, including an Internet connection. This may include a wireless channel (e.g., a Wi-Fi connection), a wired connection (e.g., DSL, cable modem, etc.), or a combination of both that is suitable for accessing encoded video data stored on a file server. The transmission of encoded video data from storage device 36 may be a streaming transmission, a download transmission, or a combination of both.
The techniques of this disclosure are not necessarily limited to wireless applications or settings. The techniques may be applied to video coding in support of any of a variety of multimedia applications, such as over-the-air television broadcasts, cable television transmissions, satellite television transmissions, streaming video transmissions, e.g., via the Internet, encoding of digital video for storage on a data storage medium, decoding of digital video stored on a data storage medium, or other applications. In some examples, system 1500 may be configured to support one-way or two-way video transmission to support applications such as video streaming, video playback, video broadcasting, and/or video telephony.
In the example of
The captured, pre-captured, or computer-generated video may be encoded by video encoder 1516. The encoded video data may be transmitted directly to the second video processing device via I/O interface 1508. The encoded video data may also (or alternatively) be stored onto storage device 1510 for later access by the second video processing device or other devices, for decoding and/or playback.
The second video processing device may further comprise a video decoder 1518, and a display device 1520. In some cases, I/O interface 1512 may include a receiver and/or a modem. I/O interface 1512 of the second video processing device may receive the encoded video data. The encoded video data communicated over the communication channel, or provided on storage device 1510, may include a variety of syntax elements generated by video encoder 1516 for use by a video decoder, such as video decoder 1418, in decoding the video data. Such syntax elements may be included with the encoded video data transmitted on a communication medium, stored on a storage medium, or stored a file server.
Display device 1520 may be integrated with, or external to, the second video processing device. In some examples, second video processing device may include an integrated display device and also be configured to interface with an external display device. In other examples, second video processing device may be a display device. In general, display device displays the decoded video data to a user, and may comprise any of a variety of display devices such as a liquid crystal display (LCD), a plasma display, an organic light emitting diode (OLED) display, or another type of display device.
Video encoder 1516 and video decoder 1518 may operate according to a video compression standard, such as the High Efficiency Video Coding (HEVC), VP9, AV1 or VVC. Alternatively, video encoder 1516 and video decoder 1518 may operate according to other proprietary or industry standards, such as the ITU-T H.264 standard, alternatively referred to as MPEG-4, Part 10, Advanced Video Coding (AVC), or extensions of such standards. The techniques of this disclosure, however, are not limited to any particular coding standard.
Although not shown in
Video encoder 1516 and video decoder 1518 each may be implemented as any of a variety of suitable encoder circuitry, such as one or more microprocessors, digital signal processors (DSPs), application specific integrated circuits (ASICs), field programmable gate arrays (FPGAs), discrete logic, software, hardware, firmware or any combinations thereof. When the techniques are implemented partially in software, a device may store instructions for the software in a suitable, non-transitory computer-readable medium and execute the instructions in hardware using one or more processors to perform the techniques of this disclosure. Each of video encoder 1516 and video decoder 1518 may be included in one or more encoders or decoders, either of which may be integrated as part of a combined encoder/decoder (CODEC) in a respective device.
This disclosure may generally refer to video encoder “signaling” certain information to another device, such as video decoder. The term “signaling” may generally refer to the communication of syntax elements and/or other data (metadata) used to decode the compressed video data. Such communication may occur in real- or near-real-time. Alternately, such communication may occur over a span of time, such as might occur when storing syntax elements to a computer-readable storage medium in an encoded bitstream at the time of encoding, which then may be retrieved by a decoding device at any time after being stored to this medium.
The techniques of this disclosure may be implemented in a wide variety of devices or apparatuses, including a wireless handset, an integrated circuit (IC) or a set of ICs (e.g., a chip set). Various components, modules, or units are described in this disclosure to emphasize functional aspects of devices configured to perform the disclosed techniques, but do not necessarily require realization by different hardware units. Rather, as described above, various units may be combined in a codec hardware unit or provided by a collection of interoperative hardware units, including one or more processors as described above, in conjunction with suitable software and/or firmware. The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the singular forms “a,” “an,” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
The corresponding structures, materials, acts, and equivalents of all means or step plus function elements in the claims below are intended to include any structure, material, or act for performing the function in combination with other claimed elements as specifically claimed. The description of the present invention has been presented for purposes of illustration and description, but is not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the invention. The embodiment was chosen and described in order to best explain the principles of the invention and the practical application, and to enable others of ordinary skill in the art to understand the invention for various embodiments with various modifications as are suited to the particular use contemplated.
Number | Date | Country | Kind |
---|---|---|---|
19219743.2 | Dec 2019 | EP | regional |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/EP2020/087852 | 12/24/2020 | WO |