This application claims priority to and the benefit of Korean Patent Application No. 10-2022-0039854, filed on Mar. 30, 2022, the disclosure of which is incorporated herein by reference in its entirety.
The present disclosure relates to a method for evaluating video quality based on a non-reference video, and in particular, to a method for evaluating video quality based on a non-reference video to evaluate the quality of a received video without an original video using artificial intelligence composed of a plurality of convolutional neural networks (CNN) and recurrent neural networks (RNN) capable of setting a learning range.
However, as for a method for evaluating the quality of a received video perceived by a user in a non-reference method without an original video, an algorithm designed manually by a person is still known. That is, around 10 KPIs (Key Performance Indicators), such as BPS (Bit Per Second), brightness and blur, are used as inputs to algorithms such as SVM (Support Vector Machine) to evaluate, and considering that the method for evaluating the quality of video by human is very high-level, there is a problem that is difficult to operate properly.
Specifically, methods for evaluating the quality of received video data in the related art may be largely classified into three types.
In the case of the above-mentioned full-reference method, since the original video and the received video may be used, it is easy to design an algorithm for obtaining the quality of the received video, and the error between the predicted quality of the received video and the quality perceived by a person is generally smaller than that of the non-reference method. On the other hand, in a real communication environment, it is extremely rare for a receiver to have an original video, so it is not applicable in most real environments (refer to Prior Art 1 below).
In the case of the reduced-reference method described above, while there is an advantage in that the quality of the received video may be predicted even if only some information is available without having all of the original video, there is a burden that requires additional information processing and information transmission in order to apply it in a real environment, which is an obstacle in applying it to a real environment.
Lastly, in the case of the non-reference method, while there is an advantage that quality prediction is possible only with the received video, it has a disadvantage that it is very difficult to design an algorithm for determining the quality of received video data only with the received video. Accordingly, most of the non-reference methods are designed as an algorithm that predicts the quality of the received video by utilizing dozens of KPIs such as BPS, brightness, and blur, but there is a problem that the error between the predicted quality of the received video and the quality perceived by the actual person is large compared to the full-reference method since the algorithm is relatively simple (see Prior Art 2 below).
The present disclosure was made to solve the above matters and for the purpose of providing a method for evaluating video quality based on a non-reference video that evaluates the quality of a received video without an original video using artificial intelligence composed of a plurality of convolutional neural networks (CNNs) and recurrent neural networks (RNNs) capable of setting a learning range.
A method for evaluating video quality based on the non-reference video of the present disclosure for achieving the above object may include: Operation (a) of extracting Red, Green, Blue (RGB) values for one frame of a video; Operation (b) of obtaining an output by providing the extracted RGB values to CNN No. 1; Operation (c) of obtaining an output by providing the extracted RGB values n times (n ≥ an integer greater than or equal to 2) to a CNN; Operation (d) of repeating Operations (a) to (c) for all frames and merging outputs of all CNNs; Operation (e) of obtaining the output of the RNN with the time dimension reduced to 1 after passing the merged output value to the RNN for learning in the time dimension; and Operation (f) of, after applying a regression algorithm to the final output of the RNN so that the output value becomes one dimension, predicting this value as video quality value.
In the above configuration, some convolutional layers of CNN No. 1 are disabled for learning, while others are enabled for learning.
Some convolutional layers of CNN No.1 are pre-learned with a plurality of ImageNet learning data and have fixed coefficients, and the remaining convolutional layers of CNN No. 1 start learning from the pre-learned coefficients, but additional learning is possible.
The n is 2, some convolutional layers of CNN No. 2 are pre-learned with a plurality of image learning data different from CNN No. 1 and the coefficients are fixed, and the remaining convolutional layers of CNN No. 2 start learning from the pre-learned coefficients, but additional learning is possible.
In the learning operation, the error backpropagation is used.
In the RNN, except for the frame number dimension responsible for the concept of time, the remaining data is changed to the dimension of the total number of filters through global average pooling, and operates as a one-dimensional RNN.
According to the method for evaluating video quality based on the non-reference video of the present disclosure, a true non-reference video quality monitoring system that requires only the video itself may be implemented with AI, and at the same time, it is possible to increase the correlation between the MOS (Mean Opinion Score) value by person and AI’s predicted value, thereby it is possible to greatly increase consumer satisfaction in line with the rapidly increasing demand for video in accordance with the advent of the untact era by reflecting a plurality of convolutional neural networks capable of setting the learning range into the AI structure design instead of the KPI method or the quality evaluation algorithm in the related art devised mathematically by person.
Meanwhile, a low MOS value is also correlated with disturbances in the filming environment, and since the method of the present disclosure is a system for measuring the MOS value of a video using only the video itself, it is possible to determine whether there is an obstacle in the camera filming environment in a future autonomous driving system and the like, as a result, when the camera is covered by dust or a piece of wood in an autonomous driving system, it is possible to prevent a major life-threatening accident that may be caused by not properly identifying it.
Hereinafter, a preferred example embodiment of a method for evaluating image quality based on a non-reference video according to the present disclosure will be described in detail with reference to the accompanying drawings.
As is well known, artificial neural networks not only learn from data alone without human subjectivity, but also it may operate at a higher level than the method using around dozens of KPIs in the point that the original video itself is used as an input characteristic.
In addition, it operates by receiving only pixel values of the received video as input instead of an algorithm that operates based on dozens of KPIs such as BPS (Bit Per Second), brightness, and blur and the like.
If KPI is used, if there are 20 KPIs in use, the input value of the algorithm is (20 x video length), but if pixel values are used, a truly non-reference video quality evaluation algorithm that is much higher-dimensional and does not require any additional information such as KPIs, with “the number of channels x the horizontal width of the video (width) x the vertical width of the video (height) x the length of the video” as input values may be designed. The difficulty of the existing learning operation is achieved by improving the AI structure.
This prevents over-fitting for both ImageNet and video data. In addition, by allowing multiple image learning data other than ImageNet to be used so that selectively learnable CNNs may become plural, the generalization effect due to the increase in image-based learning data sets may be further doubled.
Hereinafter, each operation of the method for evaluating video quality based on the non-reference video of the present disclosure will be described in detail.
The non-reference video quality evaluation artificial neural network according to the method for evaluating video quality based on the non-reference video of the present disclosure is possible to stably evaluate the video quality in a non-reference method by not requiring other information other than the received video, that is, by not requiring other prerequisites such as the need for information other than the received video. In addition, since information other than the received video is not required, there is an advantage in that additional time such as calculation of information is not required.
However, since the video needs to be expressed mathematically in order for the machine to understand the video, an operation of converting each frame of the video into RGB values is required.
The CNN proposed in the method for evaluating video quality based on the non-reference video of the present disclosure is a CNN pre-learned with images to avoid over-fitting that does not perform as good as the learning operation during actual prediction when it is difficult to secure hundreds of thousands of learning data in large quantities, in which the video and the quality value assigned by a person exist as correct answers so that the learning of artificial neural network is performed only with a video. At this time, CNN No. 1 may be a CNN pre-learned with millions of ImageNet learning data.
Here, if all the convolutional layers of the CNN are not learned at all, there is a problem in that the pattern contained between the video and the quality value may not be learned. Meanwhile, if all convolutional layers of CNN are learned, if the number of video data for learning is less than hundreds of thousands, over-fitting only to image data, that is, general patterns learned for images are lost and an over-fitting in which only the patterns existing in the video data are learned excessively occurs, so that, rather, it is likely to show worse generalization performance than the case when CNN is not learned. Therefore, in order to avoid over-fitting for both ImageNet and video data, some convolutional layers constituting CNN fix pre-learned coefficients from ImageNet so that they do not learn, and the remaining convolutional layers start to learn from pre-learned coefficients, but set to enable further learning.
For example, in the case of Resnet50, which is well known as a CNN structure, there are a total of 48 convolution layers in one CNN, and in the method of the present disclosure, for example, 39 convolution layers are set to be impossible to learn, and only 9 convolution layers are set to be able to learn so that over-fitting may be prevented from occurring in both ImageNet and video data. In addition, as shown in
Assuming that the CNN in this operation is a CNN pre-learned with millions of ImageNet learning data as in the Operation (b), there is a risk of over-fitting to patterns learned only with ImageNet, so the method of the present disclosure proposes the use of multiple CNNs. For example, CNN No. 2 is assumed to be a neural network pre-learned with more than tens of thousands of image learning data different from CNN No. 1. In this case, to prevent CNN No. 2 from over-fitting for both image data and video data, some convolutional layers of the CNN fix the pre-learned coefficients, and the other convolutional layers start to learn from pre-learned coefficients, but set to enable further learning.
If the data having the length of 3 (R, G, B color space) x height pixels x width pixels per frame is the input of CNN No. 2, the output of CNN No. 2 is changed to the number of filters (the number of filters in the final convolution layer of the CNN) x the number of height pixels x the number of width pixels.
Specifically, when Operations (a) to (c) are repeated, the dimension of the number of frames is added to the output, and as a result, data with the number of two frames x number of filters x number of height pixels x number of width pixels is output. Assuming that the number of height pixels and width pixels of the images used in the two CNNs are the same, after the Operation (d), data having the number of frames x the total number of filters (the sum of all filters of multiple CNNs) x the number of height pixels x the number of width pixels is output.
If the neural network is a learned neural network, the MOS value, which is the quality value of the video, may be obtained by providing the received video as an input to the neural network through the above-described operation.
Next, video data to be learned is prepared.
Finally, some of the convolutional layers of the CNN learned by the prepared images are selected to configure the convolutional layers to be relearnable, while the rest of the convolutional layers are not re-learned, so that the characteristics of the convolutional layer that have already been learned are not changed even if learning proceeds.
Since there are usually a relatively larger number of image data sets than the video data set to be learned, the method of the present disclosure overcomes the limitation of small learning data by using an image data set in addition to the video data set, which is called transfer learning. In this operation, as shown in
Meanwhile, if only one CNN is used, an artificial neural network dependent only on ImageNet may occur, although it does not over-fit ImageNet. To prevent this, in the method of the present disclosure, a convolutional neural network capable of setting a learning range is extended to a plurality, and
As shown in
In the above, a preferred example embodiment of the method for evaluating video quality based on the non-reference video of the present disclosure has been described in detail with reference to the accompanying drawings, but this is only an example, and various modifications and changes may be possible within the scope of the technical idea of the present disclosure. Therefore, the scope of the present disclosure will be determined by the description of the claims below. For example, three or more CNNs may be configured.
Number | Date | Country | Kind |
---|---|---|---|
10-2022-0039854 | Mar 2022 | KR | national |