A video coding format is a content representation format for storage or transmission of digital video content (such as in a data file or bitstream). It typically uses a standardized video compression algorithm. Examples of video coding formats include H.262 (MPEG-2 Part 2), MPEG-4 Part 2, H.264 (MPEG-4 Part 10), HEVC (H.265), Theora, RealVideo RV40, VP9, and AV1. A video codec is a device or software that provides encoding and decoding for digital video. Most codecs are typically implementations of video coding formats.
Recently, there has been an explosive growth of video usage on the Internet. Some websites (e.g., social media websites or video sharing websites) may have billions of users and each user may upload or download one or more videos each day. When a user uploads a video from a user device onto a website, the website may store the video in one or more different video coding formats, each being compatible with or more efficient for a certain set of applications, hardware, or platforms. However, with many uploaded videos from different users and user devices, the quality of the videos varies. Video quality metrics play an essential role in determining the coding parameters for subsequent processing of the uploaded videos. Therefore, improved techniques for calculating video quality metrics would be desirable.
Various embodiments of the invention are disclosed in the following detailed description and the accompanying drawings.
The invention can be implemented in numerous ways, including as a process; an apparatus; a system; a composition of matter; a computer program product embodied on a computer readable storage medium; and/or a processor, such as a processor configured to execute instructions stored on and/or provided by a memory coupled to the processor. In this specification, these implementations, or any other form that the invention may take, may be referred to as techniques. In general, the order of the steps of disclosed processes may be altered within the scope of the invention. Unless stated otherwise, a component such as a processor or a memory described as being configured to perform a task may be implemented as a general component that is temporarily configured to perform the task at a given time or a specific component that is manufactured to perform the task. As used herein, the term ‘processor’ refers to one or more devices, circuits, and/or processing cores configured to process data, such as computer program instructions.
A detailed description of one or more embodiments of the invention is provided below along with accompanying figures that illustrate the principles of the invention. The invention is described in connection with such embodiments, but the invention is not limited to any embodiment. The scope of the invention is limited only by the claims and the invention encompasses numerous alternatives, modifications and equivalents. Numerous specific details are set forth in the following description in order to provide a thorough understanding of the invention. These details are provided for the purpose of example and the invention may be practiced according to the claims without some or all of these specific details. For the purpose of clarity, technical material that is known in the technical fields related to the invention has not been described in detail so that the invention is not unnecessarily obscured.
Video encoder 100 includes many modules. Some of the main modules of video encoder 100 are shown in
Video encoder 100 includes a central controller module 108 that controls the different modules of video encoder 100, including motion estimation module 102, mode decision module 104, decoder prediction module 106, decoder residue module 110, filter 112, and DMA controller 114.
Video encoder 100 includes a motion estimation module 102. Motion estimation module 102 includes an integer motion estimation (IME) module 118 and a fractional motion estimation (FME) module 120. Motion estimation module 102 determines motion vectors that describe the transformation from one image to another, for example, from one frame to an adjacent frame. A motion vector is a two-dimensional vector used for inter-frame prediction; it refers the current frame to the reference frame, and its coordinate values provide the coordinate offsets from a location in the current frame to a location in the reference frame. Motion estimation module 102 estimates the best motion vector, which may be used for inter prediction in mode decision module 104. An inter coded frame is divided into blocks known as macroblocks. Instead of directly encoding the raw pixel values for each block, the encoder will try to find a block similar to the one it is encoding on a previously encoded frame, referred to as a reference frame. This process is done by a block matching algorithm. If the encoder succeeds on its search, the block could be encoded by a vector, known as a motion vector, which points to the position of the matching block at the reference frame. The process of motion vector determination is called motion estimation.
Video encoder 100 includes a mode decision module 104. The main components of mode decision module 104 include an inter prediction module 122, an intra prediction module 128, a motion vector prediction module 124, a rate-distortion optimization (RDO) module 130, and a decision module 126. Mode decision module 104 detects one prediction mode among a number of candidate inter prediction modes and intra prediction modes that gives the best results for encoding a block of video.
Decoder prediction module 106 includes an inter prediction module 132, an intra prediction module 134, and a reconstruction module 136. Decoder residue module 110 includes a transform and quantization module (T/Q) 138 and an inverse quantization and inverse transform module (IQ/IT) 140.
Video quality metrics may be used to evaluate the quality of different video codecs, encoders, encoding settings, or transmission variants. Video Multimethod Assessment Fusion (VMAF) is an objective full-reference video quality metric. It predicts subjective video quality based on reference and distorted video sequences. VMAF uses existing image quality metrics and other features to predict video quality. It is a fusion-based video quality assessment method that includes multiple component metrics. The VMAF standard model uses features from three component metrics, including Visual Information Fidelity (VIF), Detail Loss Metric (DLM), and motion, fused with a support vector regression (SVR) model.
Visual Information Fidelity (VIF) is a full reference image quality assessment index based on natural scene statistics and the notion of image information extracted by the human visual system. VIF considers information fidelity loss at four different spatial scales; the features corresponding to these 4 different spatial scales are actually being used in the standard VMAF model, rather than the full VIF metric. It is deployed in the core of the Netflix VMAF video quality monitoring system, which controls the picture quality of all encoded videos streamed by Netflix. Detail Loss Metric (DLM) measures loss of details, and impairments which distract viewer attention. Motion is defined as the Mean Absolute Difference (MAD) of consecutive low-pass filtered video frames. In VMAF, the above component metrics are fused using a support-vector machines (SVM) based regression to provide a single output score in the range of 0-100 per video frame, with 100 having a quality identical to the reference video. Calculating VMAF is computationally intensive and therefore consumes a significant amount of power. Therefore, improved techniques to reduce the complexity of calculating VMAF would be desirable.
Based on model 200, the mutual information between the perceived original image 210 (after HVS processing) and the original image 202 may be computed. Similarly, the mutual information between the perceived distorted image 212 and the original image 202 may be computed. VIF is then defined as the ratio between these two mutual information measures.
VIF uses natural scene statistics to determine the degree of distortion between the tested video frame and the original video frame. For each scale, the VIF statistics are calculated with a sliding window over the image.
At 402, at each decomposed level, low pass filtering is performed on a plurality of patches of pixels corresponding to the original image. At each decomposed level, low pass filtering is performed on a plurality of patches of pixels corresponding to the distorted image. In some embodiments, the same Gaussian kernel used for Gaussian blurring may be used at step 402. At 404, the local variances of the plurality of patches of pixels of the original image (σC2) and the distorted image (σD2) at the level are computed. At 406, the covariance between the original and distorted image patches (σCD) is computed. At 408, the gain term g is estimated using
At 410, the distortion term V is determined using σV2=σD2−g·σCD.
At 412, the ratio of the mutual information measures is computed based on the sum of the local statistics combined as follows:
where σN2 is the variance of the HVS system additive noise. In the VMAF implementation, σN2 set at 2.
DLM in VMAF is computed in the wavelet domain. As shown in
As shown in
The approximation L1 sub-band is then processed by a Level-2 (L2) decomposition module 606, which comprises another filter bank with low-pass and high pass filters. Level-2 module 606 generates a Level-2 wavelet decomposition output 614, which includes an approximation L2 sub-band, a vertical L2 sub-band, a horizontal L2 sub-band, and a diagonal L2 sub-band. The approximation L2 sub-band is then processed by a Level-3 decomposition module 608 that generates a Level-3 wavelet decomposition output 616, and finally the approximation L3 sub-band is processed by a Level-4 decomposition module 610 that generates a Level-4 decomposition output 618. Output 616 and output 618 each include four sub-bands, namely the approximation, vertical, horizontal, and diagonal sub-bands of the level. It should be recognized that the approximation sub-band at each level has a different scale. At each level, the scale is ¼ of that in the previous level.
The DLM component metric for VMAF applies a four-level Daubechies 2 (db2) DWT to the video frames 602. Daubechies 2 is a 4-tap wavelet. The decomposition module at each level includes a 4-tap decomposition low-pass filter and a 4-tap decomposition high-pass filter.
With reference again to
For the HVS function, a Contrast Sensitivity Function (CSF) and a Contrast Masking (CM) function are applied to the restored coefficients R, while only the CSF function is applied to O. The CSF function is implemented as sub-band weighting to account for the contrast sensitivity of the nominal spatial frequency for each level. The sub-band weights associated with CSF are as defined in the original DLM standard. The Contrast Masking function assumes that the restored image R and the additive impairment image A are both viewed at the same time and essentially each acts as a mask for the other. Therefore, the coefficients in A are used as the contrast mask for R.
After the restored image R has been processed by the HVS (i.e., CSF+CM), then at step (3) of process 500, DLM 510 is computed as a ratio between the Minkowski sum of the coefficients of the restored image R and those of the original image O. DLM only uses the detail sub-bands across all levels and completely ignores the approximation sub-band (which contains very little by the fourth scale).
Among the three component metrics of VMAF, VIF is the most computationally intensive. Therefore, the complexity of VMAF may be reduced by simplifying the VIF calculations, including by modifying VIF to reuse some of the computations that are performed for DLM. As shown above, both DLM and VIF decompose the images into multiple scales but use different ways of decomposing the image. To reduce the complexity of calculating VMAF, both DLM and VIF may be computed based on a single shared decomposition in the wavelet domain.
In the present application, a method of calculating Video Multimethod Assessment Fusion (VMAF) is disclosed. A reference version and a distorted version of a video frame are received. A first component image quality metric included in a plurality of eligible component image quality metrics is computed. A reference version of a video frame is decomposed into a first set of decomposed levels in different scales. A distorted version is decomposed into a second set of decomposed levels in different scales. Detail loss is determined based on the first set and second set of decomposed levels in different scales. A second component image quality metric included in the plurality of eligible component image quality metrics is computed. The first set and second set of decomposed levels in different scales are reused for computing the second component image quality metric. Natural scene statistics are evaluated based on the first set and second set of decomposed levels in different scales. A video quality metric for the distorted version is determined based on at least a portion of the eligible component image quality metrics.
At step 704, a first component image quality metric included in a plurality of eligible component image quality metrics is computed. In some embodiments, the first component image quality metric is DLM, which is one of the three component image quality metrics for VMAF.
With reference to
Step 802 and step 804 may be performed by wavelet transform module 502 in
At step 806 of process 800, detail loss is determined based on the first plurality and second plurality of decomposed levels in different scales. Step 806 may be performed by decoupling module 504, contrast sensitivity function 506, and contrast masking function 508 in process 500, as described above.
The CSF function 506 is implemented as sub-band weighting to account for the contrast sensitivity of the nominal spatial frequency for each level. In some embodiments, the sub-band weights associated with CSF function 506 are those as defined in the original DLM standard. In some embodiments, the sub-band weights are modified based on the new wavelet transform type, i.e., the four-level Haar DWT. For example, the sub-band weights may be adjusted empirically based on the visibility thresholds for wavelet quantization noise.
At step 706, a second component image quality metric included in the plurality of eligible component image quality metrics is computed. In some embodiments, the second component image quality metric is VIF, which is one of the three component image quality metrics for VMAF.
At step 902, the first plurality and second plurality of decomposed levels in different scales are reused for computing the second component image quality metric.
At step 904, natural scene statistics are evaluated based on the first plurality and second plurality of decomposed levels in different scales. VIF uses natural scenes statistics to determine the degree of distortion between the tested video frame and the original video frame. For each scale, the VIF statistics are calculated with a sliding window over the image.
At 1002, at each decomposed level, low pass filtering is performed on a plurality of patches of pixels corresponding to the original image. At each decomposed level, low pass filtering is performed on a plurality of patches of pixels corresponding to the distorted image. In some embodiments, instead of using a Gaussian kernel for the low pass filtering, a box window is used for the low pass filtering. The advantage is that a box window only requires summations but not multiplications, thereby reducing the amount of computation needed. Using a box filter may also improve the performance of the video quality metrics. In some embodiments, the box window size is 3×3.
At 1004, the local variances of the plurality of patches of pixels of the original image (σC2) and the distorted image (σD2) at the level are computed. At 1006, the covariance between the original and distorted image patches (σCD) is computed. At 1008, the gain term g is estimated using
At 1010, the distortion term V is determined using σV2=σD2−g·σCD.
At 1012, the ratio of the mutual information measures is computed based on the sum of the local statistics combined as follows:
where σN2 is a parameter that is a measure of the variance of the HVS system additive noise. In this improved VMAF implementation, σN2 is set at 5 instead of 2 in the original VMAF implementation.
At step 708, a video quality metric for the distorted version with respect to the reference version is determined based on at least a portion of the plurality of eligible component image quality metrics. For example, the video quality metric is determined based on the first component image quality metric determined at step 704 of process 700 and the second component image quality metric determined at step 706 of process 700. In addition, motion is included as the third component image quality metric. Motion may also be beneficially calculated on a LL band of the wavelet pyramid decomposition. As such, it may benefit from the low-pass filtering effect of the wavelet approximation filters, which remove noise elements, and also from the lower pixel count of these LL bands, compared to the original frame pixels. Features from the three component image quality metrics may be fused with a support vector regression (SVR) model to generate the modified VMAF video quality metric. For example, the 4 scale VIF scores, as they are calculated on the 4 LL bands of the wavelet decomposition can serve as features of the support vector regression model, alongside the DLM score evaluated in the same 4 wavelet decomposition scales and motion score features evaluated in the one of the LL bands of the wavelet decomposition.
Although the foregoing embodiments have been described in some detail for purposes of clarity of understanding, the invention is not limited to the details provided. There are many alternative ways of implementing the invention. The disclosed embodiments are illustrative and not restrictive.