This invention relates to digital image quality assessment, and in particular to quality assessment of a distorted image as compared to a pristine image.
Objective image quality assessment (IQA), which aims to automatically quantify the extent of the distortions corrupting the images, provides the quality monitoring criteria or optimization goal in numerous vision-centric systems [1], [2], [3], [4], [5], [6], [7], [8]. Generally speaking, according to the availability of the reference images, existing IQA methods fall into three categories: full-reference IQA (FR-IQA) [9], [10], [11], reduced-reference IQA (RR-IQA) [12], [13], [14], and no-reference IQA (NR-IQA) [15], [16], [17], [18], [19], [20], [21], [22].
Existing FR-IQA methods can be historically divided into two categories according to top-down and bottom-up design philosophies, where the former attempts to model the overall functionalities of human visual system (HVS) with certain hypotheses, and the latter aims to simulate the processing stages in the visual pathway of the HVS. One mild assumption in FR-IQA is that the distorted image is generated from the reference image which is of pristine quality, such that the perceptual fidelity or similarity can be quantified. Therefore, it is natural to treat the given pristine-quality image as the reference in producing a quantitative score that quantifies the degree of fidelity/similarity.
The following references are referred to throughout this specification, as indicated by the numbered brackets:
The invention in one aspect provides an image quality assessment IQA method, which includes the steps of providing a pristine image as well as a distorted image related to the pristine image, constructing an equal-quality space of the pristine image at feature level, finding, within the equal-quality space, a best reference of a distorted feature of the distorted image, and constructing a pseudo-reference feature of the distorted feature.
In some embodiments, the step of constructing an equal-quality space of the pristine image further includes estimating a near-threshold map of a feature extracted from the pristine image, and constructing the equal-quality space under a guidance of the near-threshold map.
In some embodiments, the step of estimating a near-threshold map of a feature extracted from the pristine image, further includes predicting the near-threshold map based on a global spatial correlation map and a local spatial correlation map.
In some embodiments, the method further includes, before the step of predicting the near-threshold map, the steps of calculating a global standard deviation of the feature extracted from the pristine image, calculating a local standard deviation of the feature extracted from the pristine image; and generating the global and local spatial correlation maps based on the global and local standard deviations.
In some embodiments, the step of finding the best reference of the distorted feature further includes locating the best reference of the distorted feature within the equal-quality space in an element-wise minimum distance search manner.
In some embodiments, the method further contains a step of optimizing the constructed equal-quality space using at least one of a quality regression loss, a disturbance maximization loss and a content loss.
In some embodiments, the step of optimizing the constructed equal-quality space uses all of the quality regression loss, the disturbance maximization loss and the content loss.
In some embodiments, in the step of constructing an equal-quality space of the pristine image, the equal-quality space is constructed using a pre-trained artificial neural network.
In some embodiments, the step of finding the best reference of the distorted feature is performed at every layer of the artificial neural network.
In some embodiments, the method further includes a step of predicting a quality score based on the distorted feature and the pseudo-reference feature.
According to another aspect of the invention, there is provided a non-transitory computer-readable memory recording medium having computer instructions recorded thereon, the computer instructions, when executed on one or more processors, causing the one or more processors to perform operations according to the image quality assessment IQA method as mentioned above.
According to a further aspect of the invention, there is provided a computing system that includes one or more processors, and an memory containing instructions that, when executed by the one or more processors, cause the computing system to perform operations according to the image quality assessment IQA method as mentioned above.
Embodiments of the invention therefore provide FR-IQA methods involving a flexible reference selection. Such methods dedicate to generating the reference feature by finding the best explanation of the distorted feature among an equal-quality space constructed based on a given pristine feature. Even without the ground-truth reference for distorted images with various distortion types, the pseudo-reference feature learning can be optimized.
The foregoing summary is neither intended to define the invention of the application, which is measured by the claims, nor is it intended to be limiting as to the scope of the invention in any way.
The foregoing and further features of the present invention will be apparent from the following description of embodiments which are provided by way of example only in connection with the accompanying figures, of which:
In contrast, the FLRE paradigm in the first embodiment is developed in the feature space by attempting to obtain the feature-level reference of the distorted image via the selection of its corresponding best explanation within an equal-quality space, enabling the freedom in reference image selection for distorted images. To this end, the FLRE in
The PNTE module 22 and the PRS module 24 are configured in the FLRE to perform PNTE and PRS strategies, respectively. In particular, the PNTE module predicts the equal-quality map of a given pristine-quality feature, forming an equal-quality space. Subsequently, the PRS strategy is employed to locate the reference of the distorted feature within the equal-quality space in an element-wise minimum distance search manner. Due to the lack of the ground-truth reference (i.e., best explanation) of each distorted image, the pseudo-reference feature learning is optimized under three constraints, i.e., the quality regression loss, the disturbance maximization loss, and the content loss. The FLRE is implemented as a plug-in module before the deterministic FR-IQA process, and experimental results that will be described later have demonstrated that combining FLRE with the existing deep feature-based FR-IQA models can significantly improve the quality prediction performance, largely surpassing the state-of-the-art methods. Details of the working flow of the FLRE are discussed below.
Given the pristine image Iref and the distorted image Idist as shown in
Next, the design process of the FLRE will be described. As mentioned above, the new FR-IQA paradigm in
Based on the near-threshold characteristics of the HVS, the PNTE module 22 is devised to estimate the near-threshold map of the feature extracted from a given pristine-quality image, thereby constructing an equal-quality space under the guidance of the near-threshold map. As shown in
where Frefs∈ denotes the feature representation of Iref at the s-th layer of the VGG space. Cs and Hs×Ws represent the channel and spatial dimensions of feature representation at the s-th layer, respectively. S is the total number of layers. In this embodiment, feature representations from conv1_2, conv2_2, conv3_3, conv4_3, and conv5_3 in the VGG16 network 20 are adopted. Visual perception thresholds are determined by interactions or interference among stimuli. Herein, the spatial correlation in the feature map is taken into account. In particular, the global and local standard deviation of Fref(1) are calculated to obtain the global and local spatial correlation maps (denoted as Mg(1)∈
and Ml(1)∈
, respectively), where the size of the sliding window in local standard deviation calculation is 3×3 with a stride of 1. Subsequently, the near-threshold map is predicted with the aid of the global and local spatial correlation maps. As shown in
where μc, and σc are computed across spatial dimensions independently for each channel:
Then, the local and global spatial correlation maps are fed into two four-layer convolutional networks, respectively, to generate the local perceptual-based modulation parameters (denoted as γl(1) for scale and βl1 for bias) and the global perceptual-based modulation parameters (denoted as γg(1) and βg(1). Two sets of parameters are summed to obtain the final modulation parameters (denoted as γref(1) and βref(1) for
where αγ and αβ are learnable weight parameters. The modulated feature {tilde over (F)}ref(1) can be generated by denormalizing
The modulated feature is fed into a 3×3 convolution with the stride of 1, generating the near-threshold map {circumflex over (F)}ref(1) that is just at the critical point of perceptual equivalence. Subsequently, the feature representations are collected from conv2_2, conv3_3, conv4_3, and conv5_3 in the pre-trained VGG16 network 20 when the input of conv2_2 is replaced from Fref(1) of to {circumflex over (F)}ref(1). The hierarchical near-threshold feature maps (denoted as {circumflex over (F)}ref) can be represented as follows,
where {circumflex over (F)}ref(1)∈ represents the near-threshold map at s-th layer in the VGG space. S is the total number of the layers. For Fref(s), its perceptual threshold map (denoted as T(s)∈
can be computed as follows,
where |⋅| is the absolute value operation. Therefore, numerous equal-quality features can be generated by varying the feature of the given pristine-quality image according to T(s), constructing the equal-quality space. The equal-quality maps on the bounds of the equal-quality space can be represented as follows,
where Fup(s) is the equal-quality map on the upper bound and Flow(s) is the equal-quality map on the lower bound.
Human brain can actively infer the best explanation (i.e., ideal reference) of the distorted image [68], [69], [70], [71], [72]. Based on the hypothesis that the best explanation of the distorted feature is the feature among all equal-quality features with the smallest distance to the distorted feature, the PRS strategy is developed to locate the reference of the distorted feature within the equal-quality space in an element-wise minimum distance search manner. Let Fdist(s) be the feature representation of the distorted image Idist at the s-th layer in the VGG space.
where D0s) (h,w), D1s) (h,w) and D2s) (h,w) are the distances from Fref(s)(h,w), Flow(s)(h,w) and Fup(s)(h,w) to Fdist(s)(h,w), respectively. They can be calculated as follows,
During quality prediction, the quality score Q of the distorted image is computed based on F′ref and Fdist,
where F′ref={Fref(s); s=1, . . . , S} and Fdist={Fdist(s); s=1, . . . , S}. f is the deep feature-based FR-IQA algorithm.
The objective function is composed of three loss functions, namely the quality regression loss, the disturbance maximization loss and the content loss, denoted by ,
and
, respectively. In particular, the disturbance maximization loss focuses on maximizing the difference between {circumflex over (F)}ref(1) and Fref(1) such that the learned equal-quality space can contain as many equal-quality features as possible. The disturbance maximization loss is described as follows,
where ϵ is a small positive constant to avoid numerical instability when the denominator is close to zero. In this embodiment, ϵ is set to 1×10−6. Content loss is also utilized to make sure the primary content represented by {circumflex over (F)}ref is the same as that of Fref in the same scene. The content loss is represented as follows,
where lc denotes the set of VGG16 layers for computing the content loss. In this embodiment, lc={conv4_2}.
To keep the pseudo-reference features quality-aware, the quality regression loss is employed to minimize the mean-square error between the ground-truth quality score and the predicted quality score:
where Y is the ground-truth quality score of Idist. The objective function is defined as follows,
where λ1 and λ2 are the weighting factors of and
, respectively.
In the next section, the implementation details of the FLRE in
The FLRE is trained on the entire KADID-10k dataset [73] and test on three traditional IQA datasets (LIVE [74], CSIQ [76] and TID2013 [75]) and three image restoration datasets with human-annotated scores (QADS [77], SHQR [78] and PIPAL [79]). More details are provided in Table I below.
The FLRE is implemented by PyTorch [85]. The Adam [86] optimizer is utilized with an initial learning rate of 1×10−4 and a weight decay of 5×10−4. The learning rate is reduced by a factor of 5 after every 5 epochs. The FLRE is trained with 50 epochs for convergence on the KADID-10k dataset, and the batch size is set by 12. In Eqn. (16), the existing deep feature-based methods can be used as f for distance measurement. In this experiment DISTS [44], LPIPS [43] and DeepWSD [45] are deployed, where DISTS and DeepWSD are the pre-trained models on the KADID-10k dataset. For LPIPS, it is retrained on the KADID-10k dataset for fair comparison. The input size is 256×256×3. During the training of the FLRE, the parameters of the selected FR-IQA method are fixed. When f is LPIPS, the weighting parameters λ1 and λ2 in Eqn. (20) are set to 2 and 15, respectively. When f is set to DISTS trained on the KADID-10k dataset, the weighting parameters λ1 and λ2 in Eqn. (20) are set to 1 and 10, respectively. When f is set to DeepWSD trained on the KADID-10k dataset, the weighting parameters λ1 and λ2 in Eqn. (20) are set to 5 and 10, respectively.
With regards to the evaluation criteria, three common criteria, i.e., Spearman Rank Correlation Coefficient (SRCC), Pearson Linear Correlation Coefficient (PLCC) and Kendall rank-order correlation coefficient (KRCC), where SRCC and KRCC measure prediction monotonicity and PLCC reflects the prediction precision. The higher values of PLCC, KRCC and SRCC indicate the IQA model is more consistent with the HVS perception. A nonlinear logistic mapping function tis leveraged to evaluate the performance of various IQA methods on a common space before computing those correlation coefficients. The mapped score Q of the predicted score Q from the FR-IQA method can be computed as
where ξ1, ξ2, ξ3, ξ4 and ξ5 are to be determined during the curve fitting process.
1) Correlation: In this experiment, the aim is to explore the consistency of the proposed model with the HVS in quality prediction. Several conventional FR-IQA models, including PNSR, SSIM [9], MS-SSIM [82], FSIM [38], VSI [83], VIF [40], NLPD [84], GMSD [39], MAD [76], DeepIQA [41], PieAPP [42], LPIPS [43], DISTS [44] and DeepWSD [45], are employed for performance comparison. To ensure a fair comparison, all the source codes of the competing models from the respective authors were obtained, except for DeepIQA and LPIPS. DeepIQA and LPIPS used in this experiment are their corresponding retrained version on the KADID-10k dataset. The comparison results are shown in Table II as shown in
From the experimental results, one can see that compared with the conventional models, the embodiment of the invention combining FLRE with the existing deep feature-based methods can achieve competitive results on LIVE, TID2013, CSIQ, QADS, PIPAL. Furthermore, it can be found that the FLRE can improve the performance of the original of the original FR-IQA methods. In terms of SRCC, FLRE+LPIPS achieves around 0.23%, 1.39%, 0.05%, 1.80% and 3.70% improvements over LPIPS on LIVE, TID2013, CSIQ, QADS, PIPAL. FLRE+DISTS achieves around 0.19%, 2.39%, 0.34%, 1.09% and 0.75% improvements over DISTS on LIVE, TID2013, CSIQ, QADS, PIPAL. FLRE+DeepWSD achieves around 0.15%, 0.31%, 0.37%, 5.67% and 10.99% improvements over DeepWSD on LIVE, TID2013, CSIQ, QADS, PIPAL. It is worth noting that although FLRE+LPIPS and FLRE+DISTS have lower PLCC values on the LIVE database compared with LPIPS and DISTS, they still perform well in correctly ranking the relative image quality. For the SHRQ dataset, one can observe that the performance of the FLRE exhibits a slight decrease.
The GAN-based algorithms show remarkable visual performance in the image restoration field but pose significant challenges for IQA. Unlike synthetic distortions, the distortion introduced by GAN-based algorithms is more complicated to simulate. In Table IV below, the performance evaluation results of different FR-IQA models are provided with respect to GAN-based distortion on the PIPAL dataset. The experimental results show that LPIPS, DISTS and DeepWSD combined with the FLRE can further improve the prediction performance on GAN-generated IQA, even though they are not re-trained on any images generated by image restoration algorithms. In conclusion, the results in Table II and Table IV reveal the effectiveness of the FLRE, which is attributed to the capability of FLRE to flexibly select the reference feature of the distorted feature among numerous equal-quality features, thereby providing more accurate reference benchmark for FR-IQA than the given pristine-quality feature.
0.6876
0.5540
0.3883
0.6498
0.5518
0.3855
Scatter Plots: To further visualize the performance yielded by the competing FR-IQA models, the scatter plots of the subjective scores against the objective scores predicted by some representative IQA models on TID2013 and PIPAL datasets are in
where K is the total number of distorted images in the dataset. {tilde over (Q)}k and Yk are the fitted objective score and the subjective score of k-th distorted image, respectively.
To demonstrate the accuracy of the reference predicted by the IQA method in
Compared with the original FR-IQA models, one can find that the MSE values of FLRE+LIPIS, FLRE+DISTS and FLRE+DeepWSD decrease by 55.45%, 70.40%, 87.95%, respectively, on the VVC_JND dataset, and by 55.62%, 66.13%, 87.88%, respectively, on MCL_JCL dataset. Those results demonstrate that the pseudo-reference features predicted by the FLRE can better explain the distorted features compared to the given pristine-quality features.
In
It is worth noting that for the feature at each scale, the channel dimension is reduced by taking the average pooling operation along its channel axis for better visualization. One can observe that the pseudo-reference features generated by the IQA method are closer to the JND features. Furthermore, comparing sub-images (p) and (u) in
To investigate the contributions of different modules and loss functions in the FLRE, ablation experiments are conducted based on FLRE+DeepWSD as an example. The corresponding results are listed in Table VI below, where experiments 1 and 6 are DeepWSD and FLRE+DeepWSD. In particular, the PRS from FLRE+DeepWSD is ablated and only use to optimize the learnable feature space of the PNTE module in the second experiment. When evaluating the quality of the distorted image, the second variant directly uses the output of the PNTE module to perform IQA without the PSR. The SRCC results in the second experiment demonstrate that the performance of the second variant significantly decreases on all five datasets, revealing that the effectiveness of the FLRE is not attributed to the increased number of convolutional layers. Furthermore, by comparing the results of the second and fourth experiments, one can find that the model using the PRS gains higher performance, which demonstrates the PRS can effectively select the pseudo-reference feature for IQA. To verify the effectiveness of the PNTE module, the PNTE module is replaced with two 3×3 convolutions with the stride of 1, resulting in the third experimental setting. The SRCC results show that the PNTE module can better learn the equal-quality feature with the guidance of the spatial correlation feature. By comparing the SRCC results of the fourth and fifth variants, one can observe that the learned equal-quality space optimized by
leads to better performance on image restoration databases containing GAN-based and CNN-based distorted images. Then,
is further added to enforce the consistency of primary content between the images in the learned equal-quality space and the given pristine-quality image. The results indicate that the IQA method obtains a higher SRCC value when simultaneously using
,
and
.
pred
dmax
ctt
0.8852
0.9634
0.7854
0.9669
0.5432
0.8235
0.9638
0.8768
0.9661
0.8050
0.5708
0.8371
In addition, FLRE+DeepWSD is compared with its eight variants using pseudo-reference features from different layers: (1)-(5) using the pseudo-reference feature from an individual layer; (6)-(8) using pseudo-reference features from multiple layers. In particular, when using the pseudo-reference feature from an individual layer, the model only performs the PNTE module and the PRS at the specific layer. When using pseudo-reference features from multiple layers, firstly the near-threshold map based on Fref(1) is predicted and fed into VGG16 to construct the equal-quality space at each layer. Subsequently, the PRS is used to locate the reference feature of the distorted feature at each layer. Finally, the reference features at different layers are obtained for IQA. The SRCC comparison results on five datasets are reported in Table VII below. One can find that the IQA model achieves the best performance when the given pristine-quality feature at each layer is replaced by its corresponding pseudo-reference features. This phenomenon demonstrates that the features of different layers jointly govern the image quality. When the given pristine-quality features are replaced by the predicted pseudo-reference features that better explain the distorted image, the FR-IQA model can make a more accurate prediction.
0.9633
0.8774
0.8788
0.9669
0.7958
0.5683
0.8335
0.9638
0.9661
0.8050
0.5708
0.8365
In summary, as one can see, in the above exemplary embodiment, a new FR-IQA paradigm which is the FLRE is proposed. The paradigm/method starts by producing the equal-quality space given the pristine-quality image by identifying the near-threshold distortion. Subsequently, rooted in the widely accepted view that the intrinsic and perceptually-meaningful features govern the image quality, the feature-level pseudo-reference of the distorted image is constructed. The main characteristic of the embodiment can be summarized as follows.
The exemplary embodiments are thus fully described. Although the description referred to particular embodiments, it will be clear to one skilled in the art that the invention may be practiced with variation of these specific details. Hence this invention should not be construed as limited to the embodiments set forth herein.
While the embodiments have been illustrated and described in detail in the drawings and foregoing description, the same is to be considered as illustrative and not restrictive in character, it being understood that only exemplary embodiments have been shown and described and do not limit the scope of the invention in any manner. It can be appreciated that any of the features described herein may be used with any embodiment. The illustrative embodiments are not exclusive of each other or of other embodiments not recited herein. Accordingly, the invention also provides embodiments that comprise combinations of one or more of the illustrative embodiments described above. Modifications and variations of the invention as herein set forth can be made without departing from the spirit and scope thereof, and, therefore, only such limitations should be imposed as are indicated by the appended claims.
The functional units and modules of the systems and methods in accordance with the embodiments disclosed herein may be implemented using computing devices, computer processors, or electronic circuitries including but not limited to application-specific integrated circuits (ASIC), field programmable gate arrays (FPGA), and other programmable logic devices configured or programmed according to the teachings of the present disclosure. Computer instructions or software codes running in the computing devices, computer processors, or programmable logic devices can readily be prepared by practitioners skilled in the software or electronic art based on the teachings of the present disclosure.
All or portions of the methods in accordance with the embodiments may be executed in one or more computing devices including server computers, personal computers, laptop computers, and mobile computing devices such as smartphones and tablet computers.
The embodiments include computer storage media, transient and non-transient memory devices having computer instructions or software codes stored therein which can be used to program computers or microprocessors to perform any of the processes of the present invention. The storage media, transient and non-transitory computer-readable storage medium can include but are not limited to floppy disks, optical discs, Blu-ray Disc, DVD, CD-ROMs, magneto-optical disks, ROMs, RAMs, flash memory devices, or any type of media or devices suitable for storing instructions, codes, and/or data.
Each of the functional units and modules in accordance with various embodiments also may be implemented in distributed computing environments and/or Cloud computing environments, wherein the whole or portions of machine instructions are executed in a distributed fashion by one or more processing devices interconnected by a communication network, such as an intranet, WAN, LAN, the Internet, and other forms of data transmission medium.