This invention relates to video quality measurement, and more particularly, to a method and apparatus for estimating video quality for an encoded video.
With the development of IP networks, video communication over wired and wireless IP networks (for example, IPTV service) has become popular. Unlike traditional video transmission over cable networks, video delivery over IP networks is less reliable. Consequently, in addition to the quality loss from video compression, the video quality is further degraded when a video is transmitted through IP networks. A successful video quality modeling tool needs to rate the quality degradation caused by network transmission impairment (for example, packet losses, transmission delays, and transmission jitters), in addition to quality degradation caused by video compression.
The present principles provide a method for estimating video quality of a video, comprising the steps of: accessing a bit stream including the video; determining a picture type of a picture in the video as one of a scene-cut frame, non scene-cut I frame, P frame, and B frame; and estimating the video quality for the video in response to the determined picture type as described below. The present principles also provide an apparatus for performing these steps.
The present principles also provide a method for estimating video quality of a video, comprising the steps of: accessing a bit stream including the video; determining a picture type of a picture in the video as one of a scene-cut frame, non scene-cut I frame, P frame, and B frame, wherein the picture type of the picture is determined in response to at least one of a size of the picture and a corresponding GOP length; determining an initial artifact level and a propagated artifact level in response to the picture type; determining an overall artifact level for the picture in response to the initial artifact level and the propagated artifact level; and estimating the video quality for the video in response to the determined overall artifact level as described below. The present principles also provide an apparatus for performing these steps.
The present principles also provide a computer readable storage medium having stored thereon instructions for estimating video quality of a video according to the methods described above.
In recent years, IPTV (Internet Protocol television) service has become one of the most promising applications over the next generation network. For IPTV service to meet expectation of end users, predicting and monitoring quality of service (QoS) and quality of experience (QoE) are in great need.
Some QoE assessment methods have been developed for the purpose of network quality planning and in-service quality monitoring. ITU-T (International Telecommunication Union, Telecommunication Standardization Sector) has led study works and standardized recommendations on these applications. ITU-T Recommendation G.107 (“The E-model, a computational model for use in transmission planning,” March, 2005) and G.1070 (“Opinion model for video-telephony applications,” April, 2007) provide quality planning models, while ITU-T P.NAMS (non-intrusive parametric model for assessment of performance of multimedia streaming) and P.NBAMS (non-intrusive bit stream model for assessment of performance of multimedia streaming) are proposed for quality monitoring.
As payload information is usually encrypted in IPTV, a bit stream level quality model (for example, P.NBAMS) cannot be applied at a device where an encrypted bit stream cannot be decrypted. A packet layer quality model (for example, P.NAMS) can be applied to estimate perceived video quality by using only packet header information. For instance, frame boundaries may be detected by using RTP (Real-time Transport Protocol) timestamps, the number of lost packets may be counted by using RTP sequence numbers, and the number of bytes in a frame may be estimated by the number of TS (Transport Stream) packets in the TS header.
An exemplary packet layer quality monitor is shown in
In a packet layer quality monitoring framework as shown in
The present principles relate to a no-reference, packet based video quality measurement tool. The quality prediction method is of no-reference or non-intrusive type, and is based on header information, for example, header of MPEG-2 transport stream over RTP. That is, it does not need access to the decoded video. The tool can be operated in user terminals, set-up boxes, home gateways, routers, or video streaming servers.
In the present application, the term “frame” is used interchangeably with “picture.”
An exemplary method 200 for assessing video quality according to the present principles is shown in
It should be noticed that the assessment method can also be used with transport protocols other than RTP, for example, transport stream over TS. The frame boundaries may be detected by timestamps in TS header, and the transmit order and occurred loss may be computed by a continuity counter in TS header.
In the following, the steps of frame type estimation, artifact level estimation, and quality prediction are described in further detail.
Losses happening in different types of frames may result in different levels of visible artifacts, which lead to different perceived quality levels to viewers. For example, the effect of a loss occurring in a reference I or P frame is more severe than that in a non-reference B frame. In the present embodiments, the frame type is estimated based on an estimated GOP structure and the number of bytes in a frame.
We define four frame types (ftype): {ftype=4 (scene-cut frame), ftype=3 (non scene-cut I frame), ftype=2 (P frame), ftype=1 (B frame)}.
Whether a frame is an Intra frame can be determined from a syntax element, for example, “random_access_indicator” in the adaptation field of transport stream (TS) packet.
A scene-cut frame is estimated as a frame that scene cut may happen and thus usually has a high encoding bitrate. A scene-cut frame may occur at an Intra frame or a non-Intra frame. For a bit stream with an adaptive GOP structure, scene-cut frames mainly correspond to I frames with quite short GOP length. For a bit stream with a fixed GOP length, scene-cut frames may be non-Intra frames with quite large numbers of bytes.
Considering different implementations of an encoder with different GOP structures, we estimate frame i (i E GOP) as a scene-cut frame using the following equation:
where bytesi is the number of bytes in frame i, PREIBytes is the number of bytes in a previous I frame, glenj is the GOP length of GOP j containing frame i, and AVEGOPLength is the average GOP length. A GOP starts from a scene-cut frame or I frame till the next scene-cut frame or I frame.
To decide whether frame i (i E GOPj & i ε non-intra frame) is a P or B frame, AVE_bytesj is calculated as the average number of bytes of GOP j by excluding the scene-cut frame or I frame in the GOP. If bytes, is larger than AVE_bytesj, frame i is determined to be a P frame, and is determined to be a B frame otherwise. That is,
ftypei=2 if bytesi>AVE_bytesj (2.1)
ftypei=3 if bytes,AVE_bytesj (2.2)
An exemplary method 300 for determining frame type for a frame according to the present principles is shown in
For a non-Intra frame, it checks whether the frame size is very large, for example, it checks whether the frame size is greater than the frame size of a previous I frame as specified in Eq. (1.1). If the frame size is very large, the non-Intra frame is estimated to be a scene-cut frame (350). Otherwise, if the frame size is not very large, it checks whether the frame size is large, for example, it checks whether the frame size is greater than the average frame size of the GOP as specified in Eq. (2.1). If the frame size is large, the non-Intra frame is estimated to be a P frame (370), and otherwise a B frame (380).
For an exemplary video sequence,
An Averaged Loss Artifact Extension (ALAE) metric is estimated based on estimated frame types and other parameters. The ALAE metric is estimated to measure visible degradation caused by video transmission loss. For each frame i, a Loss Artifact Extension (LAE) can be calculated as the sum of Initial Artifact (IA) caused by the loss in the current frame and Propagated Artifact (PA) caused by the loss in reference frames:
LAE
i
=IA
i
+PA
i. (3)
The initial artifact level may be calculated as:
where lpi is the number of lost packets (including packets lost due to unreliable transmission and packets ensuing the lost packets in the current frame), tpi is the number of total packets (including the estimated number of lost packets), and wiIA is a weighting factor, which depends on the frame type because losses occurred in different types of frame cause different levels of visible artifacts. In one exemplary embodiment, the frame type and the corresponding weighing factor is set as shown in TABLE 1. Because a loss occurred in a scene-cut frame often causes most serious visible artifacts for viewers, its weighting factor is set to be the largest. A non scene-cut I frame and P frame usually cause similar levels of visible artifacts since they are both used as reference frames, so their weighting factors are set to be the same.
The propagated artifact may be calculated as:
PA
i
=w
i
PA×((1−α)×LAEpre1+α×LAEpre2), (5)
where (1−α)×LAEpre1+α×LAEpre2 is used to estimate the propagated error from two previous reference frames, and wiPA is a weighting factor. In one embodiment, α is set to 0.25 for P frame and 0.5 for B frame, and wiPA is set to 1 for P and B frames which means no artifacts attenuation, and 0.5 for loss-occurred I frame (regardless whether it is a scene-cut frame or not) which means the artifacts is attenuated by half. If an I frame is successfully received without loss, wiPA is set to 0, which means no error propagation.
One frame may be encoded into several slices, for example, in a high-definition IPTV program. Each slice is an independent decoding unit. That is, a lost packet of one slice may cause all following received packets in that slice undecodable; but this lost packet will not influence the decoding of received packets in other slice(s) of the frame. That is, the number of slices in a frame impacts video quality. Thus, in the present embodiments, the number of slices (denoted as s) is considered in quality modeling.
When the video is encrypted, how a frame is partitioned into slices is unknown, and the exact location of a lost packet in the slice is also unknown. In our experiments, we observe that when the perceived video quality is similar, a video sequence with more slices per frame has a larger LAE value than another sequence with fewer slices per frame, even though these two sequences may have similar perceived quality levels and the ALAE values should also be similar. Based on experimental results, we use √{square root over (s)} to take into account the effect of the number of slices per frame on the video quality.
The number of slices per frame may be determined from the video applications. For example, a service provider may provide this parameter in a configuration file. If the number of slices per frame is not provided, we set it to a default value, for example, 1.
Using the estimated visible artifact levels (i.e., LAE parameters) and the number of slices in a frame, the average visible artifact level for a video sequence (ALAE) can be calculated as:
where N is the number of frames in the video, f is the frame rate, and s is the number of slices per frame.
The video quality is then estimated using the ALAE parameter. In the present principles, the quality prediction model predicts video quality by considering both coding artifacts and channel artifacts.
A video program may be compressed into various coding bitrates, thus with different quality degradation due to video compression. In the present embodiments, using the bitrate parameter, video compression artifacts are taken into account when predicting video quality.
Considering the bitrate parameter and the ALAE parameter, the overall quality for the encrypted video can be obtained, for example, using a logistic function:
where VqN is a normalized mean opinion score (NMOS) within [0,1]. In Eq. (7), the bitrate parameter Br is used to model coding artifacts and the ALAE parameter is used to model slicing channel artifacts. In Eq. (7), a, b, and c are constants, which may be obtained using a least-square curve fitting method. For example, coefficients a, b, and c may be determined from a training database that is built conforming to ITU-T SG 12.
Various constants are used in the present embodiments, for example, constant 0.5 in Eq. (1.2), weighting factors in Eqs. (4), (5) and TABLE 1, and coefficients a, b, and c in Eq. (7). When the present principles are applied to different systems than those exemplified in the present application, the equations or the values of the model parameters may be adjusted, for example, for new training databases or different video coding methods.
We compared the proposed quality prediction model with other two models described respectively in “Parametric packet-layer model for monitoring video quality of IPTV services,” K. Yamagishi, T. Hayashi, ICC, 2008 (herein after “Yamagishi”) and “Frame-layer packet-based parametric video quality model or encrypted video in IPTV services,” M. N. Garcia, A. Raake, QoMEX, 2011 (hereinafter “Garcia”). Similar to our method, Yamagishi estimates coding degradation using a logistic function of the bitrate parameter, and loss degradation using an exponential function of PLF (packet-loss frequency) parameter. xwpSEQ metric proposed in Garcia is applicable to slicing-type loss degradation, which is fitted by a log function.
The Spearman correlation of slicing-related metric ALAE in our model, xwpSEQ in Garcia and PLF in Yamagishi are shown in FIGS. 5(A)-(C), respectively In FIGS. 5(A)-(C), the y-axis indicates the NMOS and the x-axis indicates the value of metric in the respective papers. We observe that our proposed method significantly outperforms methods of Yamagishi and Garcia, which indicates that the proposed metric is superior to these and more correlated with the subjective quality. In
In the present application, packet layer quality assessment for monitoring quality of an encrypted video is proposed. The proposed model is applicable to in-service non-intrusive applications, and its computational load is quite light by only using packet header information and does not need access to media signals. An efficient loss-related metric is proposed to predict the visible artifacts and perceived quality. The estimation of visible artifact level is based on the spatio-temporal complexity from frame layer information. The overall quality prediction model is capable of handling videos with various slice numbers and different GOP structures, and considers both coding and channel artifacts. The generality of the model is demonstrated from an adequate amount of training and validation databases with various configurations. The better performance in metric correlation and RMSE comparison shows the superiority of our model.
The present principles can also be used when the video is not encrypted. That is, even if the video payload information becomes available, and more information about the video can be parsed or decoded, the proposed video quality prediction method may still be desirable because of its low complexity.
Referring to
In one embodiment, a video quality monitor 640 may be used by a content creator. For example, the estimated video quality may be used by an encoder in deciding encoding parameters, such as mode decision or bit rate allocation. In another example, after the video is encoded, the content creator uses the video quality monitor to monitor the quality of encoded video. If the quality metric does not meet a pre-defined quality level, the content creator may choose to re-encode the video to improve the video quality. The content creator may also rank the encoded video based on the quality and charges the content accordingly.
In another embodiment, a video quality monitor 650 may be used by a content distributor. A video quality monitor may be placed in the distribution network. The video quality monitor calculates the quality metrics and reports them to the content distributor. Based on the feedback from the video quality monitor, a content distributor may improve its service by adjusting bandwidth allocation and access control.
The content distributor may also send the feedback to the content creator to adjust encoding. Note that improving encoding quality at the encoder may not necessarily improve the quality at the decoder side since a high quality encoded video usually requires more bandwidth and leaves less bandwidth for transmission protection. Thus, to reach an optimal quality at the decoder, a balance between the encoding bitrate and the bandwidth for channel protection should be considered.
In another embodiment, a video quality monitor 660 may be used by a user device. For example, when a user device searches videos in Internet, a search result may return many videos or many links to videos corresponding to the requested video content. The videos in the search results may have different quality levels. A video quality monitor can calculate quality metrics for these videos and decide to select which video to store. In another example, the user device may have access to several error concealment techniques. A video quality monitor can calculate quality metrics for different error concealment techniques and automatically choose which concealment technique to use based on the calculated quality metrics.
The implementations described herein may be implemented in, for example, a method or a process, an apparatus, a software program, a data stream, or a signal. Even if only discussed in the context of a single form of implementation (for example, discussed only as a method), the implementation of features discussed may also be implemented in other forms (for example, an apparatus or program). An apparatus may be implemented in, for example, appropriate hardware, software, and firmware. The methods may be implemented in, for example, an apparatus such as, for example, a processor, which refers to processing devices in general, including, for example, a computer, a microprocessor, an integrated circuit, or a programmable logic device. Processors also include communication devices, such as, for example, computers, cell phones, portable/personal digital assistants (“PDAs”), and other devices that facilitate communication of information between end-users.
Reference to “one embodiment” or “an embodiment” or “one implementation” or “an implementation” of the present principles, as well as other variations thereof, mean that a particular feature, structure, characteristic, and so forth described in connection with the embodiment is included in at least one embodiment of the present principles. Thus, the appearances of the phrase “in one embodiment” or “in an embodiment” or “in one implementation” or “in an implementation”, as well any other variations, appearing in various places throughout the specification are not necessarily all referring to the same embodiment.
Additionally, this application or its claims may refer to “determining” various pieces of information. Determining the information may include one or more of, for example, estimating the information, calculating the information, predicting the information, or retrieving the information from memory.
Further, this application or its claims may refer to “accessing” various pieces of information. Accessing the information may include one or more of, for example, receiving the information, retrieving the information (for example, from memory), storing the information, processing the information, transmitting the information, moving the information, copying the information, erasing the information, calculating the information, determining the information, predicting the information, or estimating the information.
Additionally, this application or its claims may refer to “receiving” various pieces of information. Receiving is, as with “accessing”, intended to be a broad term. Receiving the information may include one or more of, for example, accessing the information, or retrieving the information (for example, from memory). Further, “receiving” is typically involved, in one way or another, during operations such as, for example, storing the information, processing the information, transmitting the information, moving the information, copying the information, erasing the information, calculating the information, determining the information, predicting the information, or estimating the information.
As will be evident to one of skill in the art, implementations may produce a variety of signals formatted to carry information that may be, for example, stored or transmitted. The information may include, for example, instructions for performing a method, or data produced by one of the described implementations. For example, a signal may be formatted to carry the bit stream of a described embodiment. Such a signal may be formatted, for example, as an electromagnetic wave (for example, using a radio frequency portion of spectrum) or as a baseband signal. The formatting may include, for example, encoding a data stream and modulating a carrier with the encoded data stream. The information that the signal carries may be, for example, analog or digital information. The signal may be transmitted over a variety of different wired or wireless links, as is known. The signal may be stored on a processor-readable medium.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/CN2012/085618 | 11/30/2012 | WO | 00 |