This invention relates to a Video Quality Model, a method for training a Video Quality Model and a corresponding device.
As IP networks develop, video communication over wired and wireless IP network (e.g. IPTV service) has become very popular. Unlike traditional video transmission over cable network, video delivery over IP network is much less reliable. The situation is even worse in the environment of wireless networks. Correspondingly, a requirement for Video Quality Modeling and/or Video Quality Measuring (both being denoted VQM herein) is to rate the quality degradation caused by IP transmission impairment (e.g. packet loss, delay, jitter), in addition to that caused by video compression.
When parts of the coded video bitstream are lost during network transmission, the decoder may employ Error Concealment (EC) methods to conceal the lost parts in an effort to reduce the perceptual video quality degradation. However, usually a loss artifact remains after concealment. The less visible the concealed loss artifact is, the more effective is the EC method. The EC effectiveness depends heavily on the video content features and the video compression techniques used.
Rating of EC artifacts determines the initial visible artifact (IVA) level when a packet loss occurs. Further, the IVA will propagate spatio-temporally to the areas that use it as a reference in a predictive video coding framework, like in H.264, MPEG-2, etc. Accurate prediction of the EC artifact level is a fundamental part of VQM for measuring transmission impairment. Different visibility of EC artifacts results from the different EC strategies implemented in the respective decoders. However, the EC method employed by a decoder is not always known before decoding the video.
Thus, one big challenge for VQM on bitstream-level, in particular in the case of network impairment, is to predict the quality level of EC artifacts at the bitstream level before decoding the video. Known solutions that deal with this challenge assume that the EC method used at the decoder is known. But a big problem is that, in practice, there are various versions of implementation of decoders that employ various different EC strategies. EC methods roughly fall into two categories: spatial approaches and temporal approaches. In the spatial category, the spatial correlation between local pixels is exploited, and missing macroblocks (MBs) are recovered by interpolation techniques from the neighboring pixels. In the temporal category, both the coherence of motion fields and the spatial smoothness of pixels along edges across block boundaries are exploited to estimate the motion vector (MV) of a lost MB. In various decoder implementations, these EC methods may be used in combination.
A full-reference (FR) image quality assessment method known in the prior art [1] is limited to a situation where the original frames that do not suffer from network transmission impairment are available. However, in realistic multimedia communication the original signal is often not available. A known no-reference (NR) image quality assessment model [2] is more consistent with realistic video communication situations, but it is not adaptive with respect to EC strategies. An enhanced VQM would be desirable that is capable of adapting automatically to different EC strategies of different decoder implementations that are not known beforehand.
The present invention is based on the recognition of the fact that the effectiveness of various EC methods can be estimated from some common content features and compression technique features. This is valid even if different EC methods are applied to the same case of lost content, which may lead to different EC artifacts levels, such as e.g. spatial EC methods and temporal EC methods. Spatial EC methods recover missing macroblocks (MBs) by interpolation from the neighboring pixels, while temporal EC methods exploit the motion field and the spatial smoothness of pixels on block edges. The invention provides a method and a device for enhanced video quality measurement (VQM) that is capable of adapting automatically to any given decoder implementation that may employ any known or unknown EC strategy. Adaptivity is achieved by training.
Advantageously, the adapted/trained VQM method and device can estimate video quality of a target video when decoded and error concealed by the target video decoder and EC method to be assessed, even without fully decoding and error concealing the target video.
In principle, the present invention comprises selecting training data frames of a predefined type, analyzing predefined typical features of the selected training data frames, decoding the training data frames using the target video decoder (or an equivalent), wherein the decoding may comprise EC, and performing video quality measurement, wherein the video quality of the decoded and error concealed training data frames is measured or estimated using a reference VQM model. The video quality measurement results in a reference VQM metric. Further, a plurality of candidate VQM metrics are calculated from at least some of the analyzed typical features, by a plurality of VQM models (VQMM) or sets of VQM coefficients of at least one given VQMM. The reference VQM metric, candidate VQM metric, and VQMMs or sets of VQM coefficients may be stored. After a plurality of training data frames have been processed in this way, an optimal set of VQM coefficients is determined in an adaptive learning process, wherein the stored candidate VQM metrics are compared and matched with the reference VQM metric. A best-matching candidate VQM metric is determined as optimal VQM metric, and the corresponding VQM coefficients or the VQMM of the optimal VQM measure are stored as the optimal VQMM. Thus, the stored VQMM or VQM coefficients are optimally suitable for determining video quality of a video after its decoding and EC using the target decoder and EC strategy. After the training, the VQM model adapted by the determined and stored VQM coefficients can be applied to the target video frames, thereby constituting an adapted VQM tool.
A metric is generally the result, i.e. measure, that is obtained by a measurement method or device, such as a VQM. That is, each measuring algorithm has its own individual metric.
One particular advantage of the invention is that the training dataset can be automatically generated so as to satisfy certain important requirements defined below. Another advantage of the present invention is that an adaptive learning method is employed, which improves modeling of the EC artifacts level assessment for different or unknown EC methods. That is, a VQM model learns the EC effects without having to know and emulate for the assessment the EC strategy employed in any particular target decoder.
In a first aspect, the invention provides a method and a device for generating a training dataset for adaptive VQM, and in particular for learning-based adaptive EC artifacts assessment. In one embodiment, the whole process is performed totally automatic. This has the advantage that the EC artifacts assessment is quick, objective and reproducible.
In one embodiment, interactions from a user are allowed. This has the advantage that the video quality assessment can be subjectively improved by a user.
In principle, the method for generating a training dataset for adapting adaptive VQM to a target video decoder comprises steps of extracting one or more concealed frames from a training video stream, calculating typical features of the extracted frames, decoding the extracted frames and performing EC, wherein the target video decoder and EC unit (or an equivalent) is used, performing a first quality assessment of the one or more extracted frames by a reference VQM model, and performing a second quality assessment of the extracted one or more frames by a plurality of candidate VQM models, each using at least some of the calculated typical features. The second quality assessment employs a self-learning assessment method, and may generate and/or store a training data set for EC artifact assessment.
In one embodiment, a method for generating a training dataset for EC artifacts assessment comprises steps of extracting one or more concealed frames from a training video stream, determining (e.g. calculating) typical features of the extracted frames, decoding the extracted frames and performing EC by using the target video decoder and EC unit (or equivalent), performing a first quality assessment of the decoded extracted frames using a reference VQM model, performing a second quality assessment of the extracted frames by using for each of the decoded extracted frames a plurality of different candidate VQM models or a plurality of different candidate coefficient sets for at least one given VQMM, wherein at least some of the calculated typical features are used, determining from the plurality of VQMMs or VQMM coefficient sets an optimal VQMM or VQMM coefficient set that optimally matches the result of the first quality assessment, wherein for each of the decoded extracted frames the plurality of candidate VQMs are matched with the result of the reference VQM and wherein an optimal VQMM or set of VQMM coefficients is obtained, and providing (e.g. transmitting, or storing for later retrieval) the optimal VQMM or set of VQMM coefficients for video quality assessment of target videos.
In one embodiment, a device for generating a training dataset for EC artifacts assessment comprises a Concealed Frame Extraction module for extracting one or more concealed frames from a training video stream, decoding the extracted frames and performing EC by using the target video decoder and EC unit (or an equivalent), a Typical Feature Calculation unit for calculating typical features of the extracted frames, a Reference Video Quality Assessment unit for performing a first quality assessment of the decoded extracted frames by using a reference VQM model, and a Learning-based EC Artifacts Assessment Module (LEAAM) for performing a second quality assessment of the extracted frames, the LEAAM having a plurality of different candidate VQM models or a plurality of different candidate coefficient sets for a given VQMM, wherein the plurality of different candidate VQMMs or candidate coefficient sets for a given VQMM use at least some of the calculated typical features and are applied to each of the decoded extracted frames. The Learning-based EC Artifacts Assessment Module further has an Analysis, Matching and Selection unit for determining from the plurality of VQMMs or VQMM coefficient sets an optimal VQMM or VQMM coefficient set that optimally matches the result of the first quality assessment, wherein for each of the decoded extracted frames the plurality of candidate VQMs is matched with the reference VQM and wherein an optimal VQMM or set of VQMM coefficients is obtained, and an Output unit for providing (e.g. storing for later retrieval) the optimal VQMM or set of VQMM coefficients for video quality assessment of target videos.
In a second aspect, the present invention provides a VQM method and a VQM tool for a target video, wherein the VQM method and VQM tool comprises an adaptive EC artifact assessment model trained by the generated training dataset. In particular, the invention provides a method for determining video quality of a video frame by using an adaptive VQM model (VQMM) that was automatically adapted to a target video decoder and target EC module (that may be part of, or integrated in, the target video decoder) by the training dataset generated by the above-described method or device. The VQM method according to the second aspect of the invention comprises steps of extracting one or more frames from a target video stream, calculating typical features of the extracted frames, retrieving a stored VQM model and/or stored coefficients of a VQM model, and performing a video quality assessment of the extracted frames by calculating a video quality metric using the retrieved VQM model and/or coefficients of a VQM model, wherein the calculated typical features are used.
According to the second aspect of the invention, a VQM method that is capable of automatically adapting to a target video decoder comprises steps of configuring a VQM model, wherein a stored VQM model or stored coefficients of a VQM model are retrieved and used for configuring, extracting one or more video frames from a target video sequence, calculating typical features of the extracted one or more frames, calculating typical features of each of the extracted frames, and calculating a video quality metric (e.g. mean opinion score MOS) of the extracted frames, wherein the configured VQM model and at least some of the calculated typical features are used.
Further, the present invention provides a computer readable medium having executable instructions stored thereon to cause a computer to perform a method for generating a training dataset for EC artifacts assessment that is suitable for automatically adapting to a video decoder and EC unit, wherein adaptive learning is used that is adapted by using a training data set as described above.
VQM according to the invention has the capability to learn different EC effects and later recognize them, in order to be able to estimate video quality when the EC strategy of a decoder is unknown. Advantageously, the invention allows predicting the EC artifacts level in the final picture with improved accuracy.
An advantage of the adaptive EC artifacts measurement solution according to the invention over existing VQM methods is that the EC strategy used in a decoder needs not be known in advance. That is, it is advantageous that the VQM needs not be manually selected for a given target decoder and EC unit. A VQM according to the invention can automatically adapt to different decoders and is more interesting and useful from a practical viewpoint, i.e. more flexible, reliable and user-friendly. Further, a VQM according to the invention can be re-configured. Therefore, it can be applied to different decoders and EC methods, and even can, in a simple manner, be re-adjusted after a decoder update and/or an EC method update. As a result, an EC artifacts level in the final picture can be predicted with improved accuracy even before/without full decoding of the picture, since the typical features that are used for calculating the VQM metric can be obtained from the bitstream without full decoding.
A further advantage of the invention is that, in one embodiment, the whole adaptation process is performed automatically and transparent to users. On the other hand, in one embodiment a user may also input his opinion about image quality and let the quality assessment model be finely tuned according to this input.
Advantageous embodiments of the invention are disclosed in the dependent claims, the following description and the figures.
Exemplary embodiments of the invention are described with reference to the accompanying drawings, which show in
a) Learning-based EC Artifacts Assessment Modeling module and separate Target Video Quality Assessment module;
b) Learning-based EC Artifacts Assessment Modeling module with integrated Target Video Quality Assessment module;
A decoder-adaptive EC artifacts assessment solution as implemented in a device for generating a training data set, according to various embodiments of the invention, is illustrated in
In the embodiment shown in
The PIS's are provided to a Quality Assessment module 203a-203c, which performs a quality assessment of the extracted frames and derives a numeric quality score NQS (e.g. mean opinion score, MOS) for each PIS. For this purpose, it uses an automatic or subjective quality assessment model (such as e.g. the FR image quality assessment method known from [1] or [2]), as described below. A PIS together with its numeric quality score NQS forms the sample of a training data set TDS, which is then provided to a Learning-based Error Concealment (EC) Artifacts Assessment Modeling module 204. The training data set TDS comprises several, typically up to several hundreds, of such samples.
Further, the CFE module 201 provides data to a Typical Feature Calculation (TFC) module 202, which calculates typical features of the PIS's, i.e. the frames that are extracted in the CFE module 201. For example, the CFE module 201 indicates to the TFC module 202 which of the frames is a PIS, and other information. More details on the features are described below. The calculated typical features TF are also provided to the Learning-based EC Artifacts Assessment Modeling module 204.
The Learning-based EC Artifacts Assessment Modeling (LEAAM) module 204 may store the samples of the training data set TDS in a storage S and creates, adapts and/or—in some embodiments—applies a learning-based EC artifacts assessment model, based on the training data set. In one embodiment described below, the LEAAM module 204 operates only on the training data set in order to obtain an optimized model, which can be defined by optimized model coefficients. In another embodiment, the LEAAM module 204 operates also on the actual video to be assessed. One or more template models that can be parameterized using the obtained optimized coefficients may be available to the LEAAM module 204. The optimized model, or its coefficients respectively, can also be stored in the storage S or in another, different storage (not shown), and can be applied to an actual video to be assessed either within the LEAAM module 204 or in a separate Target Video Quality Assessment module 205 described below. Such separate Target Video Quality Assessment module, e.g. implemented in a processor, may access the stored optimized model or model coefficients that are adapted in the LEAAM module 204.
In the following, more details on the above-mentioned blocks are provided.
Concealed Frame Extraction 201
The Concealed Frame Extraction (CFE) module 201 performs at least full decoding and error concealment (EC) of frames that have lost packets, but that refer to (i.e. are predicted from) correctly received prediction references, so that they have no propagated artifacts from their prediction reference. These are so-called Processed Image Samples (PIS's). The CFE module 201 decodes also their prediction references, since they are necessary for decoding the PIS's. In one embodiment, also frames that are necessary for EC of PIS's are decoded. Further, the CFE module 201 provides at its output the de-coded and error concealed PIS's at least to the Quality Assessment Module 203a-203c. In one embodiment, the CFE module 201 extracts and processes only predicted frames (i.e. frames that were decoded using prediction). In some simple decoders, no error concealment strategy is implemented at all and the lost data is left empty (pixels are grey). In this case, the PIS is the target frame after full decoding, and “no error concealment” is regarded as a special case of error concealment strategy.
Taking an ITU-T Rec. H.264 standard coded bitstream as example, the “slice type” and “frame_num” fields of the slice header syntax and the “max_num_ref_frames” of sequence parameter set syntax are parsed 62 to identify one or more frames having only EC artifacts. Then, the frames having only EC artifacts are fully decoded in the EC Video Decoder 65. Full decoding includes at least integer DCT (IDCT) and motion compensation 63, in addition to syntax parsing 62. For obtaining a target extracted frame (e.g. frame n in
Typical Feature Calculation 202
The Typical Feature Calculation module 202 calculates typical features for each frame extracted in the CFE module 201, including so-called effectiveness features or local features, which are calculated at a local level around a lost MB, and condition features, which are calculated at frame level. Effectiveness features are e.g. some or all from the group of spatial motion homogeneity, temporal motion consistence, texture smoothness, and the probabilities of one or more special encoding modes, such as spatial uniformity of motion, temporal uniformity of motion, InterSkipModeRatio and InterDirectModeRatio. The condition features comprise e.g. some or all of Frame Type, ratio of intra-coded MBs or IntraMBsRatio (i.e. number of Intra-coded MBs divided by number of Inter-coded MBs), Motion Index and Texture Index. Condition features are global features of each frame of the training data set. As described in the co-pending patent application [3], the features will be used for emulating a decision process for determining an EC strategy employed by a decoder, i.e. which type of EC method to use.
In one embodiment, a motion index for partially lost P- or B-frames is calculated by averaging the motion vectors lengths of the received MBs of the frame, according to
MotionIndex(n)=average{|mv(n,i,j)|,(i,j)εall received MBs of the frame}
In one embodiment, texture smoothness is obtained from a ratio between DC coefficients and all (DC+AC) coefficients of the MBs that are adjacent to a lost MB. In one embodiment, a texture index is calculated using the texture smoothness value of those MBs that are adjacent to a lost MBs and the lost MBs themselves (the so-called interested MBs), e.g. using the average of the texture smoothness value of the MBs according to
where K is the total number of the interested MBs, and k is the index of an interested MB. The larger the TextureIndex value is, the richer is the texture of the frame. In one embodiment, the texture smoothness is obtained from DCT coefficients of adjacent MBs, e.g. the ratio of DC coefficient energy to the DC+AC coefficient energy, using DCT coefficients of MBs adjacent to a lost MB.
In one embodiment, texture smoothness is calculated according to the following method. For an I-frame that serves as a reference, the texture smoothness of a correctly received MB is calculated using its DCT coefficients according to
k is an index of the DCT coefficients so that k=0 refers to the DC component; M is the size of DCT transform; T is a threshold ranging from 0 to 1 and set empirically according to dataset (e.g. T=0.8). In H.264, the DCT transform can be of size 16×16 or 8×8 or 4×4. If the DCT transform is of size 8×8 (or 4×4), in one method, the above equation is applied to the 4 (or 16) basic DCT transform units of the MB individually, then the texturesmoothness of the MB is the average of the texturesmoothness values of the 4 (or 16) basic DCT transform units. In another method, for 4×4 DCT transform, 4×4 Hadamard transform is applied to the 16 4×4 arrays composed of the same components of the 16 basic 4×4 DCT coefficient units. For 8×8 DCT transform, Haar transform is applied to the 64 2×2 arrays composed of the same components of the 64 8×8 DCT coefficient units. Then 256 coefficients are obtained, no matter what size of the DCT transform is used by the MB. Then the above equation is used to calculate texturesmoothness of the MB.
Then, for an inter predicted frame (P or B frame) with MB loss, the texture smoothness of a correct MB is calculated according to the above-described smoothness calculation equation, and the texture smoothness of a lost MB is calculated as the medium value of those of its neighbor MBs (if exist) as described above, or equals that of the collocated MB of the previous frame. E.g., in one embodiment, if the motion activity of the current MB (e.g. the above defined spatial homogeneity or motion magnitude) equals zero or the MB has no prediction residual (e.g., skip mode, or DCT coefficients of prediction residual equal zero), then the texture smoothness of the MB equals that of the collocated MB in the previous frame. Otherwise, the texture smoothness of a correct MB is calculated according to the above-described smoothness calculation equation, and the texture smoothness of a lost MB is calculated as the medium value of those of its neighbor MBs (if exist), or equals that of the collocated MB of the previous frame. The basic idea behind the equation for texture smoothness is that, if the texture is smooth, most of the energy is concentrated at the DC component of the DCT coefficients; on the other hand, for the high-activity MB, the more textured the MB is, the more uniformly distributed to different AC components of DCT the energy of the MB is.
In one embodiment, the InterSkipModeRatio, which is a probability of inter skip_mode, is calculated using the following method:
Skip mode in H.264 means that no further data is present for the MB in the bitstream.
In one embodiment, the InterDirectModeRatio, which is a probability of inter_direct_mode, is calculated using the following method:
Direct mode in H.264 means that no MV differences or reference indices are present for the MB. The blocks in the previous two equations refer to 4×4_sized_blocks of the neighboring MBs of the lost MB, no matter if the MB is partitioned into smaller blocks or not.
The above two features InterSkipModeRatio and InterDirectModeRatio may be used separately or together, e.g. added-up. Generally, if a MB is predicted using Skip mode or Direct mode in H.264, its motion can be predicted well from the motion of its spatial or temporal neighbor MBs. Therefore, if this type of MB is lost, it can be concealed with less visible artifacts if temporal EC approaches are applied to recover the missing pixels.
Motion homogeneity may refer to spatial motion uniformity, and motion consistence to temporal motion uniformity. In the following, a frame index is denoted as n and the coordinate of a MB in the frame as (i,j). For a lost MB (i,j) in frame n, the condition features for the frame n and the local features for the MB (i,j) are calculated.
For calculating spatial MV homogeneity, in one embodiment, two separate parameters are calculated for spatial uniformity are calculated in x direction and in y direction according to
As long as any of the eight MBs around a lost MB (n,i,j) is received or recovered, its motion vector, if existing, is used to calculate the spatial MV homogeneity. If there is no available neighbor MB, the spatial MV uniformity is set to that of the collocated MB in the previous reference frame (i.e. P-frame or reference B-frame in hierarchical H.264 coding).
For H.264 video encoder, one MB may be partitioned into sub-blocks for motion estimation. Thus, in case of an H.264 encoder, the sixteen motion vectors of the 4×4-sized blocks of a MB instead of one motion vector of a MB may be used in the above equation. Each motion vector is normalized by the distance from the current frame to the corresponding reference frame. This practice is applied also in the following calculations that involve the manipulation of motion vectors. The smaller the standard variance of the neighbor MVs is, the more homogeneous is the motion of these MBs. In turn, the lost MB is more probable to be concealed without visible artifacts if a certain type of motion-estimation based temporal EC method is applied. This feature is applicable to lost MBs of inter-predicted frames like P-frames and B-frames. For B-frames, there may be two motion fields, forward and backward. Spatial uniformity is calculated in two directions respectively.
For calculating temporal MV uniformity, in one embodiment, two separate parameters for temporal uniformity are calculated in x direction and in y direction according to
so that the temporal MV uniformity is calculated as the standard variance of the motion difference between the collocated MBs in adjacent frames. The smaller the standard variance is, the more uniform is the motion of these MBs in temporal axis, and in turn, the lost MB is more probable to be concealed without visible artifacts if the motion projection based temporal EC method is applied. This feature is applicable to lost MBs of both Intra frame (e.g. I_frame) and inter-predicted frame (e.g. P_frame and/or B_frame).
If one of the adjacent frames (e.g., frame n+1) is an Intra frame where there is no MV available in the coded bitstream, the MVs of the spatially adjacent MBs (i.e. (n, i±1, j±1)) of the lost MB and those of the temporally adjacent MBs of an inter-predicted frame (i.e. frame n−1 and/or n+1) are used to calculate temporal MV uniformity. That is,
The MV magnitude is calculated as follows. For a simple zero motion copy based EC scheme, the larger the MV magnitude is, the more probable to be visible is the loss artifact. Therefore, in one embodiment, the average of motion vectors of neighbor MBs and current MB (if not lost) are calculated. That is,
In another embodiment, the magnitude of the median value of the motion vectors of neighbor MBs is used as the motion magnitude of the lost current MB. If the lost current MB has no neighbor MBs, the motion magnitude of the lost current MB is set to that of the collocated MB in the previous frame.
The typical features TF calculated/extracted in the Typical Feature Calculation module 202 can be represented by any values, e.g. numerical or textual (alpha-numerical) values, and they are provided to the LEAAM module 204.
Quality Assessment 203
The Quality Assessment Module 203a-203c can utilize any existing automatic image quality assessment method (or automatic quality assessment model) or subjective image quality assessment method (or subjective quality assessment model). Eg. a full-reference (FR) image quality assessment method known from [1] can be used to obtain the numeric quality score NQS of the extracted frames or pictures, as shown in
Similar to the embodiment shown in
An advantage of the above-described embodiments shown in
The VQM model can be embedded e.g. in a set-top box (STB) at a user's home network.
Learning-Based EC Artifacts Assessment Modeling 204
The Learning-based EC Artifacts Assessment Modeling (LEAAM) module 204 receives values of the calculated/extracted features TF from the Typical Features Calculation module 202, and it receives the samples of the training data set TDS, i.e. each PIS and its related numerical quality score (NQS), from the Quality Assessment module 203. The NQS received from the Quality Assessment module 203 serves as reference NQS. In one embodiment, the LEAAM module 204 creates a learning-based EC artifacts assessment model, based on the training data set. In another embodiment, it adapts an existing pre-defined learning-based EC artifacts assessment model based on the training data set. At least in the latter embodiment, model coefficients for a fixed model are determined by the LEAAM module 204. The module generates or adapts parameters or coefficients for an optimized EC artifacts assessment model and stores them in a storage S. It may also store the received samples of the training data set TDS in the storage S, e.g. for later re-evaluation or re-optimization. Further, the received Typical Feature values TF are stored by the LEAAM module.
In one embodiment, the stored data are structured in a data base such that for each PIS its NQS and the values representing its typical features form a data set. The storage may be within the LEAAM module 204 or within a separate storage S.
Further, in one embodiment the LEAAM 204 further comprises an Output unit 2048 that provides the optimal VQM model or set of VQM model coefficients to subsequent modules (not shown) for video quality assessment of target videos.
The model coefficients and/or the optimized model that are obtained in the LEAAM module 204 during at least a first training phase can be applied to an actual video to be assessed. In one embodiment, a device for automatically adapting a Video Quality Model (VQM) to a video decoder and a device for assessing video quality, which uses the VQM, are integrated together in a product, such as a set-top box (STB). In principle, typical features of the actual video to be assessed are calculated and extracted in the same way as for the training data set. The extracted typical features are then compared with the stored training data base as described below, a best-matching condition feature is determined, and parameters or coefficients for the VQM model according to the best-matching condition feature are selected. These parameters or coefficients define, from among the available trained VQM models, an optimal VQM model for the actual video to be assessed. The optimal VQM model is applied to the actual video to be assessed in a Target Video Quality Assessment (TVQA) module 205, as shown in
In the LEAAM module 204, statistic learning methods may be used to implement the adaptive EC artifacts assessment model. E.g., the LEAAM module may implement the method disclosed in the co-pending patent application [3], i.e. using the above-mentioned condition features to determine which type of EC method to use, and using the local features as parameters of the determined type of EC method. In one embodiment, all the condition features and local features are put into an artificial neural network (ANN) for obtaining the optimal model. Another embodiment, which is an example implementation of this part of the LEAAM module 204, is described in the following.
For condition feature Frame Type, the calculation of the EC artifacts level is
For condition feature IntraMBsRatio, the calculation of the EC artifacts level is
For the combination of the condition features MotionIndex and TextureIndex, the calculation of the EC artifacts level is
T1 and T2 are thresholds that can be determined e.g. by adaptive learning. An advantage of utilizing the piece-wise function form in Eqs. (1-3) is that the decoder may adopt a more advanced EC strategy by choosing a type of EC approach (i.e., spatial EC or temporal EC) according to certain conditions for each portion. If the decoder only adopts one type of EC approach by setting a=e, b=f, c=g, and d=h, the piece-wise function also works, but is less adaptive and therefore may have slightly worse results.
For example, the above-mentioned effectiveness features motionUniformity, texture Smoothness, InterSkipModeRatio and InterDirectModeRatio and the above-mentioned condition features Frame Type, IntraMBsRatio, Motion index and Texture Index are calculated as numerical values in the Typical Feature Calculation module 202 and stored in storage S for each of the training images (i.e. PIS's), and for each video frame to be assessed. For training the VQM model, the feature values are stored together with the quality score NQS of the training image, which is obtained in the Quality Assessment model 203. For assessing the quality of a target video frame, the calculations according to equations (1)-(3) are performed using the features of the target video frame, with parameters a1, . . . , h3 obtained during the model training.
The calculation of a texture index may be based on any known texture analysis method, e.g. a comparing a DC coefficient and/or selected AC coefficients with thresholds.
In principle it is sufficient to use any two or more of the condition (i.e. global) features, and any two or more of the effectiveness (i.e. local) features. The more features are used, the better will be the result.
The selection according to the correlation (PC coefficient value) can be summarized as in
A correlation is optimized if the correlation coefficient v is at its maximum, so that the results of the optimal candidate VQMM and the reference quality values converge as much as possible. In other words, the optimal candidate VQMM emulates the actual behavior of the target video decoder and EC method best. Tab.1 shows exemplary values of the first three training frames of
In another embodiment, some or all coefficients are pre-defined. Then, a correlation between each candidate numeric quality score value and the reference numeric quality score value is calculated by regression analysis. Tab.2 shows an intermediate result within the LEAAM module 204, comprising a plurality of correlation values v1, v2, v3 and related optimized coefficients of three candidate VQM models, namely Frame Type, IntraMBsRatio and kxMotionIndex+TextureIndex. E.g. if v1>v2 and v1>v3, then Frame Type is the optimal condition feature and the coefficients a1, . . . , d1 or e1, . . . , h1 are used for the model, depending on the current condition feature (in this case the frame type).
Thus, the LEAAM module 204 determines from the plurality of VQM models (or VQM model coefficient sets) a best VQM model (or best VQM model coefficient set) that optimally matches the result of the first quality assessment, wherein, for each of the decoded extracted frames, the results of the plurality of candidate VQM models are matched with the result of the reference VQM model, and wherein an optimal VQM model (or set of VQM model coefficients) is obtained.
Returning to
The extracted frames are decoded and Error Concealment is performed 94. A first quality assessment 95 of the decoded extracted frames is performed, using a Reference Video Quality Measuring model. A second quality assessment 93 of the extracted frames is performed as described above, i.e. by using, for each of the decoded extracted frames, a plurality of different candidate Video Quality Measuring models or a plurality of different candidate coefficient sets for at least one given Video Quality Measuring model, wherein at least some of the calculated typical features are used.
From the plurality of Video Quality Measuring models or Video Quality Measuring model coefficient sets, a best Video Quality Measuring model or best Video Quality Measuring model coefficient set, is determined 96 that optimally matches the result of the first quality assessment, wherein, for each of the decoded extracted frames, the results of the plurality of candidate Video Quality Measuring models are matched 962 with the result (i.e. NQS) of the reference Video Quality Measuring model and wherein an optimal Video Quality Measuring model or set of Video Quality Measuring model coefficients is obtained, as also shown in
Details of embodiments of the second quality assessment module 93 and the determining module 96 for determining the best Video Quality Measuring model or best Video Quality Measuring model coefficient set, i.e. the one that optimally matches the result of the first quality assessment, are also shown in
The feature combination module 141 enumerates the possible combinations of the condition features and local features, e.g. those of equations (1)-(3) above. These can also be complemented by other, further features and their relationships. In one embodiment, the correlation module 142 performs multiple regression analysis for each of the enumerated combinations (e.g. equations (1)-(3)) in order to fit the equation on the training data set and get the coefficient set that fits best, e.g. by calculating the corresponding Pearson Correlation value v1, v2, v3. In one embodiment, the selection module (within second quality assessment module 143) selects the best fitting equation from the equations (1)-(3), being the one that results in the highest PC value, as an optimal model (or model coefficient set, respectively). The extracted frames are decoded and Error Concealment is performed 144. In the first quality assessment module 145, a first quality assessment of the decoded extracted frames is performed, using a Reference Video Quality Measuring model. In the second quality assessment module 143, a second quality assessment of the extracted frames is performed as described above, i.e. by using, for each of the decoded extracted frames, a plurality of different candidate Video Quality Measuring models or a plurality of different candidate coefficient sets for at least one given Video Quality Measuring model, wherein at least some of the calculated typical features are used.
From the plurality of Video Quality Measuring models or Video Quality Measuring model coefficient sets, a best Video Quality Measuring model or best Video Quality Measuring model coefficient set, is determined 96 that optimally matches the result of the first quality assessment, wherein, for each of the decoded extracted frames, the results of the plurality of candidate Video Quality Measuring models are matched 962 with the result (i.e. NQS) of the reference Video Quality Measuring model and wherein an optimal Video Quality Measuring model or set of Video Quality Measuring model coefficients is obtained, as also shown in
Details of embodiments of the second quality assessment module 143 and the determining module 146 for determining the best Video Quality Measuring model or best Video Quality Measuring model coefficient set, i.e. the one that optimally matches the result of the first quality assessment, are also shown in
This embodiment of the second quality assessment module 143 comprises a selection unit 1431 for selecting a current candidate Video Quality Measuring models or a current candidate coefficient set for a given Video Quality Measuring model, an application module 1432 for applying the current candidate Video Quality Measuring model or current candidate coefficient set to each of the decoded extracted frames, using at least some of the calculated typical features, comparing 1432 the result with previous results and storing the best one, and determining 1433 if more candidate VQMM or candidate coefficient sets are available. In the depicted embodiment of the determining module 146 for determining the best Video Quality Measuring model or best Video Quality Measuring model coefficient set, the module comprises selection unit 1461 for selecting from the plurality of VQM models or VQM model coefficient sets a current VQM model or VQM model coefficient set, a matching and selection module 1462 for matching (for each of the decoded extracted frames) the current candidate VQM model with the reference VQM model, selecting an optimal VQM model or set of VQM model coefficients (either the best previous or the current) and storing it, and determining unit 1462 for determining if more VQM models or VQM model coefficient sets exist.
In one embodiment, the LEAAM module 204 uses a single fixed template model and determines the model coefficients that optimize the template model. In one embodiment, the LEAAM module 204 can select one of a plurality of template models. In one embodiment, the template model is a default model that can also be used without being optimized; however, the optimization improves the model.
An advantage of the described extraction/calculation of global condition features from an image of the training data set and the local effectiveness features is that they make the model more sensitive to channel artifacts than to compression artifacts. Thus, the model focuses on channel artifacts and depends less on different levels of compression errors. The calculated EC effectiveness level is provided as an estimated visible artifacts level of video quality.
Advantageously, the used features are based on data that can be extracted from the coded video at bitstream-level, i.e. without decoding the bitstream to the pixel domain.
In
In one embodiment, a flow-chart of a method for adapting a VQM to a given decoding and EC method is shown in
The typical features TF of the extracted frames can be calculated before their full decoding and EC. In one embodiment, the typical features TF of the extracted frames are calculated from un-decoded extracted frames. In another embodiment, the typical features TF are calculated from partially decoded extracted frames. In one embodiment, the partial decoding reveals at least one of Frame Type, IntraMBsRatio, MotionIndex and TextureIndex, as well as motion Uniformity, textureSmoothness, InterSkipModeRatio and InterDirectModeRatio, according to the above definitions.
Further, in one embodiment as shown in
analyzing 1302 predefined typical features of the selected training data frames, decoding 65 the training data frames using the video decoder, wherein the decoding comprises at least error concealment 64,
measuring or estimating a reference video quality metric (measure) for each of the decoded and error concealed training data frames using a reference video quality measurement 1303,
for each of the selected training data frames, calculating 1304 from the analyzed typical features a plurality of candidate video quality measurement measures, wherein for each of the selected training data frames a plurality of different predefined candidate video quality measurement models or candidate sets of video quality measurement coefficients of a given video quality measurement model are used,
storing, for each of the selected training data frames, the plurality of candidate video quality measurement models or candidate sets of video quality measurement coefficients and their calculated candidate video quality measurement measures x1, . . . , x3,
determining, from the plurality of candidate video quality measurement models or candidate sets of video quality measurement coefficients, an optimal video quality measurement model or optimal set of video quality measurement coefficients in an adaptive learning process 1304, wherein for each of the selected training data frames the stored candidate video quality measurement measures are compared and matched with the reference video quality measure and a best-matching candidate video quality measurement measure is determined, and
storing the video quality measurement coefficients or the video quality measurement model of the optimal video quality measurement measure.
An advantage of the present invention is that it enables the VQM model to learn the EC effects without having to know and emulate the EC strategy employed in decoder. Therefore, the VQM model can automatically adapt to various real-world decoder implementations.
VQM is used herein as an acronym for Video Quality Modeling, Video Quality Measurement or Video Quality Measuring, which are considered as equivalents.
While there has been shown, described, and pointed out fundamental novel features of the present invention as applied to preferred embodiments thereof, it will be understood that various omissions and substitutions and changes in the apparatus and method described, in the form and details of the devices disclosed, and in their operation, may be made by those skilled in the art without departing from the present invention. Although all candidate VQM models in the described embodiments use the same set of typical features, there may exist cases where one or more of the candidate VQM models use only less or different typical features than other candidate VQM models.
Further, it is expressly intended that all combinations of those elements that perform substantially the same function in substantially the same way to achieve the same results are within the scope of the invention. Substitutions of elements from one described embodiment to another are also fully intended and contemplated. It will be understood that the present invention has been described purely by way of example, and modifications of detail can be made without departing from the scope of the invention.
Each feature disclosed in the description and (where appropriate) the claims and drawings may be provided independently or in any appropriate combination. Features may, where appropriate be implemented in hardware, software, or a combination of the two. Reference numerals appearing in the claims are by way of illustration only and shall have no limiting effect on the scope of the claims.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/CN2012/087206 | 12/21/2012 | WO | 00 |