The present invention relates to monitoring of digital video/audio signals.
Video quality assessment is currently one of the most challenging problems in the broadcasting industry. No matter what the format of the coded video or the medium of transmission, there are always sources that cause degradation in the coded/transmitted video. Almost all of the current major broadcasters are concerned with the notion of “How good will our video look at the receiver?” Currently, there are very few practical methods and objective metrics to measure video quality. Also, most current metrics/methods are not feasible for real-time video quality assessment due to their high computational complexity.
Watermarking is a technique whereby information is transmitted from a transmitter to a receiver in such a way that the information is hidden in an amount of digital media. A major goal of watermarking is to enhance security and copyright protection for digital media.
Whenever a digital video is coded and transmitted, it undergoes some form of degradation. This degradation may be in many forms, for example, blocking artifacts, packet loss, black-outs, lip-synch errors, synchronization loss, etc. Human eyes and ears are very sensitive to these forms of degradation. Hence it is beneficial if the transmitted video undergoes no or only a minimal amount of degradation and quality loss. Almost all the major broadcasting companies are competing to make their media the best quality available. However, in order to improve video quality, methods and metrics are required to determine quality loss. Unfortunately, most of the quality assessment metrics currently available rely on having some form of the original video source available at the receiver. These methods are commonly referred to as Full Reference (FR) and Reduced Reference (RR) quality assessment methods. Methods that do not use any information at the receiver from the original source are called No Reference (NR) quality assessment methods.
While FR and RR methods have the advantage of estimating video quality with high accuracy, they require a large amount of transmitted reference data. This significantly increases the bandwidth requirements of the transmitted video, making these methods impractical for real-time systems (e.g. broadcasting). NR methods are ideal in applications where the original media is not needed in the receiver. However, the measurement accuracy is low, and the complexity of the blind detection algorithm is high.
Watermarking in digital media has been used for security and copyright protection for many years. In watermarking, information is imperceptibly embedded in the digital media. The embedded information can be of many different forms ranging from encrypted codes to pilot patterns, in the digital media at the encoder. Then, at the decoder, the embedded information is recovered and verified, and in some cases removed from the received signal before opening/playing/displaying it. If there is a watermark mismatch, the decoder identifies a possible security/copyright violation and does not open/play/display the digital media contents. Such watermarking has become a common way to ensure security and copyright preservation in digital media, especially digital images, audio and video content.
Digital video is, however, often subjected to compression (MPEG-2, MPEG-4, H.263, etc.) and conversion from one format to another (HDTV-SDTV, SDTV-CIF, TV-AVI, etc.). Due to composite processing involving compression, format conversion, resolution changes, brightness changes, filtering, etc., the embedded watermark can be easily destroyed such that it cannot then be decoded at the receiver. This may result in either a security/copyright breach and/or distortion in the decoded video. One such scenario is illustrated in
Also, it is often difficult to embed imperceptible watermarks in high quality videos. Therefore, the embedding strength of video watermarking is limited by imperceptibility. In this situation, hybrid channel distortion makes it difficult for watermarks to survive in video.
In recent years, video processing techniques have improved, and high-quality video broadcasts, such as a high-definition television (HDTV) broadcasts, are common. Digital video signals of a high-definition television broadcast, etc., are often transmitted to each home through satellite broadcasting or a cable TV network. However, an error sometimes occurs during the transmission of video signals from various causes. When an error occurs, problems, such as a video freeze, a blackout, noise, audio mute, etc., may result, and thus it becomes necessary to take countermeasures.
Japanese Patent Application Laid-Open No. 2003-20456 discloses a signal monitoring system in which a central processing terminal calculates a difference between a first statistic value based on a video signal (first signal) output from a transmission source and a second statistic value based on a video signal (second signal) output from a relay station or a transmission destination. If the difference is below a threshold value, then the transmission is determined to be normal, whereas if the difference is over the threshold value then a determination is made that transmission trouble has occurred between the transmission source and the relay station so that a warning signal can be output to raise an alarm (alarm display and alarm sound).
A novel monitoring method provides a reliable way to monitor the quality of video and audio, while at the same time not demanding substantially more data to be broadcast. In one example of the novel monitoring method, a first video characteristic of a video/audio signal is determined. The term “video/audio signal” as the term is used here generally refers to a signal including both a picture signal (video signal) and an associated sound signal (audio signal). The video/audio signal can be either a raw signal or may involve compressed video/audio information.
The video/audio signal is transmitted from a transmission source to a transmission destination. The first video characteristic is communicated in an audio signal portion of the video/audio signal. This audio-transmitted video characteristic is usable for copyright protection and/or for measuring and improving video quality.
The video/audio signal is received at the transmission destination and the first video characteristic is recovered from the audio signal portion of the video/audio signal. The video/audio signal is also analyzed and a second video characteristic is thereby determined. The same algorithm is used to determine the second video characteristic from the received video and audio signal as was used to determine the first video characteristic from the original video and audio signal prior to transmission.
The recovered first video characteristic is then used to verify or test the determined second video characteristic. If the difference between the first and second video characteristics is greater than a predetermined threshold amount, then an error condition is determined to have occurred. For example, if appropriate parameters are used, then it is determined that a lip-sync error condition likely occurred. If, however, the difference between the first and second video characteristics is below the predetermined threshold amount, then it is determined that an error condition has likely not occurred.
In one example, the first and second video characteristics are determined based at least in part on video frame statistic parameters and are referred to here as “VDNA” (Video DNA) values. A VDNA value may, for example, be a concatenation of many video frame parameter values that are descriptive of, and associated with, a single frame or a group of frames of video. The video frame statistic parameters may together characterize the amount of activity, variance, and/or motion in the video of the video/audio signal. The parameters are used by a novel monitoring apparatus to evaluate video quality using the novel monitoring method set forth above. The amount of information required to be transmitted from the transmission source to the transmission destination in the novel monitoring method is small because the first characteristic, in one example, is communicated using fewer than one hundred bits per frame. Furthermore, in one example the novel quality assessment monitoring method is based on block variance parameters, as more particularly described below, and has proven to be highly accurate.
Further details, embodiments and techniques are described in the detailed description below. This summary does not purport to define the invention. The invention is defined by the claims.
The accompanying drawings, where like numerals indicate like components, illustrate embodiments of the monitoring method.
In one example of a monitoring method, a first video characteristic, hereinafter referred to as the first VDNA, is extracted at an encoder/transmitter from a video frame of a video/audio signal. This first VDNA is then embedded in an audio signal portion of the video/audio signal. The audio signal portion corresponds to the video frame. The group of audio samples corresponding to the same video frame is referred to here as an “audio frame”.
At the receiver, the embedded first VDNA is extracted from the audio signal portion of the received video/audio signal. A second VDNA is computed from the received video frame. The same algorithm may be used to determine the second VDNA from the received video frame as was used to determine the first VDNA from the original video frame prior to transmission. The first and second VDNAs are then compared to each other. Depending on the type of application, different decisions can be made if the VDNAs and VDNA parameters do not match. For example, in a security/copyrights application, in the case of VDNA mismatch, the application may declare a breach. From the point of view of quality assessment, a VDNA mismatch may indicate a loss of quality and/or the presence of errors and distortion in the received video.
Many different characteristics or parameters can be used as the video characteristic. However, it is desirable that the chosen parameters be relatively insensitive to format conversion or compression. This is because digital videos often undergo format conversions or compression. Because of this, some frame statistics change, making the choice of certain parameters useless. Through extensive simulations, it has been determined that the characteristics corresponding to scene change are less sensitive to format conversions. Hence, in the preferred embodiment, a parameter is used that represents the block variance of the difference between two consecutive frames. Whenever this parameter has a high value, it means that a scene change has likely occurred. This high valued parameter is then used as the video frame parameter for all the frames until the next scene change is encountered.
There are several suitable methods for adding and encoding the first VDNA into the audio signal. These various methods are generally referred to as audio watermarking. Two such generally known methods are Quantization Index Modulation (QIM) and Spread Transform Dither Modulation (STDM). Both are recognized watermark embedding and detection methods, and are usable with the preferred monitoring method. Both are well-developed methods, and are briefly described below.
QIM is a general class of embedding and decoding methods that uses a quantized codebook (sometimes called code-set). There are two practical implementations for QIM, which are Dither Modulation (DM) and Spread Transform Dither Modulation (STDM).
DM consists of information bits (i.e., user ID, VDNA, encrypted message), dither vectors (i.e. a kind of repetition code to provide redundancy), an embedder which has a quantization operation, and decoder that performs a minimum distance decoding. The strength of DM is adjusted by a step size Δ.
For embedding, it is assumed that the information bits contain 0 and 1. Two dither vectors are generated from a random sequence and a step size Δ for bit 0 and bit 1, named dither_0 and dither_1, respectively. The following steps constitute watermark embedding. 1) If bit 0 is selected, dither_0 is applied for embedding. 2) Host media (original media) is added to dither_0 and quantization is carried out. 3) Then, dither_0 is subtracted from the quantized result. And note that similar steps are carried out for bit 1.
The following steps are carried out at the decoder. 1) Dither_0 is added to the received (watermarked and attacked) media (same step for dither_1). 2) Quantization is carried out on the resulting data and dither_0 and dither_1 are subtracted from their respective quantized results. 3) The respective quantized results are then subtracted from the received media, and the two summations of all root-squared results from dither_0 and dither_1 are compared. 4) Then, the transmitted information bit is decided based on the smaller value of the summation (minimum distance decoding).
STDM involves information bits (i.e., user ID, VDNA, encrypted message), dither vectors (i.e. a kind of repetition code to provide redundancy), a spreading vector, the embedder, which has a quantization operation, and the decoder that performs a minimum distance decoding. The strength of STDM is adjusted by the length of spreading vectors and step size Δ. STDM has the exact same procedure with DM except applying a spreading vector.
For embedding, it is assumed that the information bits contain 0 and 1. Two dither vectors are generated from a random sequence and a step size A for bit 0 and bit 1, named dither_0 and dither_1. We have the spreading vectors. The following steps constitute the embedding process. 1) If bit 0 is selected, dither_0 is used for embedding (bit 1 case is the same). 2) Host media is projected on the spreading vector first. 3) The projected host media is added to dither_0 (or dither_1 in case of bit 1) and quantization is carried out. 4) Dither vector (dither_0 or dither_1) is then subtracted from the quantized result.
The following steps are carried out at the decoder. 1) The received media is first projected on a spreading vector. 2) Dither_0 and dither_1 are then added separately to the projected media. 3) Quantization is carried out and dither_0 and dither_1 are subtracted from the quantized results. 4) The two quantized results from dither_0 and dither_1 are subtracted from the projected media, and the two summations of all root-squared result from dither_0 and dither_1 are compared. 5) Then, the transmitted information bit is decided based on the smaller value of the summation (minimum distance decoding).
The main advantage of using QIM and STDM is the possibility of blind detection without having multimedia interference at the detector.
To calculate a video frame block variance, a video signal VD (see
In one example, Motion is calculated as follows. An image frame is divided into 8 pixels×8 line-size small blocks, the average value and the variance of the 64 pixels are calculated for each small block, and the Motion is represented by the difference between the average value, and the variance of the blocks of the same place of the frame before N, and indicates the movement of the image. N is normally 1, 2, or 4. Also, the Video Level is the average value of the pixel values included in an image frame. Furthermore, for the Video Activity, when a variance is obtained for each small block included in an image, the average value of the variances of the pixels included in a frame may be used. Alternatively, the variance of the pixels in a frame included in an image frame may simply be used.
There are many advantages of using VDNA as the embedded video characteristic. A few of these advantages are listed below.
An audio signal has a higher probability of survival as compared to a video signal because the distortion in the audio is usually much less as compared to the distortion in the video when transmitted over common communication channels. Hence the characteristic embedded into the audio has a higher probability of correct detection. This makes the claimed monitoring method more robust.
In the claimed monitoring method, decoded parameters from the audio are compared to the parameters extracted from the received video frame. This means that there is a two-fold redundancy in the claimed monitoring method. First an algorithm checks for characteristic integrity in the audio, and second, the decoded parameters are compared to those extracted from the received video. This two-fold redundancy increases the probability of synchronization and correct detection of characteristics, as well as lowers the probability of a breach in security and copyright applications.
The usage of the claimed monitoring method does not impose any bandwidth increase on the transmitted video/audio with additional information.
There can be many possible applications of the claimed monitoring method technology. A few of these applications here. For example, this technology can be used to implement security and copyrights in digital videos (e.g. Digital Rights Management).
Since there are two versions of the same VDNA parameters available at the receiver, the novel monitoring method can also be used to assess video quality. The decoded VDNA from the audio can be compared to the extracted VDNA from the received video to determine possible quality loss. In addition to quality assessment, the movel method can also be used for correction and quality improvement. A few quality assessment and correction examples can be chroma difference, level change and resolution loss.
The novel method can also be used to correct and detect synchronization loss between audio and video in general and lip-sync in particular. Lip-sync is a very common problem in video transmission these days. Audio and video packets undergo different amounts of delays in the network and hence are out of synchronization at the receiver. Because of this, either the picture of a person talking is either displayed before the actual voice is heard or vice versa. This technology can be used to synchronize audio and video, and correct such errors. The receiver decodes the audio and compares the recovered first VDNA parameters to the extracted second VDNA parameters from a few video frames and synchronizes the audio with video such that the first and second VDNAs match.
In a VDNA-based lip-sync detection/correction system, the VDNA is first determined from the video sequence on a frame-by-frame basis. This first video characteristic is then embedded in the audio stream using STDM (or DM). The audio and video streams are then passed on to the encoder and the encoded bitstream is transmitted. At the receiver, the second VDNA is determined from the video stream after decoding. Also, the first VDNA is extracted from the audio stream. The first and second VDNA parameters are then compared. If the difference between them is greater than a specified threshold amount, then the system determines that a lip-sync error has occurred. Now, the VDNA parameter extracted from the audio stream is compared with the VDNA parameters extracted from some of the past video frames. If there is a match, the decoder synchronizes, using conventional methods, the audio stream with the matched video frame. If there is no match, the decoder waits for future frames and compares the VDNA (from audio) with video VDNA from future frames as they arrive at the decoder. As soon as it finds a match, it synchronizes the audio and the video.
Although certain specific embodiments are described above for instructional purposes, the teachings of this patent document have general applicability and are not limited to the specific embodiments described above. Accordingly, various modifications, adaptations, and combinations of various features of the described embodiments can be practiced without departing from the scope of the invention as set forth in the claims.