This invention relates to signal processing.
Random noise can be a major impairment in video signals. Such noise may degrade video quality and subsequent video coding operations. Potential benefits of noise reduction algorithms include improving visual quality by removing noise from the video. Such benefits also include enabling better coding or compression of video signals, since bits may be used to code the signal itself rather than to code the noise.
In one embodiment, a method of processing a video signal includes obtaining a predicted pixel value according to a motion vector and a location of a first pixel value of the video signal; and calculating a second pixel value based on the first pixel value, the predicted pixel value, and a weighting factor. The method also includes estimating a noise statistic of the video signal according to a known signal content of the video signal. In this method, the weighting factor is based on the estimated noise statistic.
In another embodiment, an apparatus includes a motion compensator configured to produce a predicted pixel value according to a motion vector and a location of a first pixel value of a video signal; a pixel value calculator configured to calculate a second pixel value based on the first pixel value, the predicted pixel value, and a weighting factor; a noise estimator configured to estimate a noise statistic of the video signal according to a known signal content of the video signal; and a weighting factor calculator configured to calculate the weighting factor based on the estimated noise statistic.
In another embodiment, an apparatus includes means for obtaining a predicted pixel value according to a motion vector and a location of a first pixel value of the video signal; means for calculating a second pixel value based on the first pixel value, the predicted pixel value, and a weighting factor; means for estimating a noise statistic of the video signal according to a known signal content of the video signal; and means for calculating the weighting factor based on the estimated noise statistic.
a shows a flowchart of a method M100 according to an embodiment.
b shows a flowchart of an implementation M110 of method
a shows a block diagram of an integrated circuit device 400 according to an embodiment.
b shows a block diagram of a video recorder 500 according to an embodiment.
A noise reduction algorithm may be implemented to use motion information. One technique of noise reduction uses motion-adaptive filtering to average all or part of the current video frame with corresponding portions of one or more other frames. In such a technique, temporal filtering may be suspended for a portion of the current frame which differs by more than a threshold value from a corresponding portion of another frame. Another technique of noise reduction uses motion compensation to average all or part of the current video frame with corresponding portions of one or more predicted frames.
Sources of noise may include radio-frequency (RF) noise, jitter, and picture noise such as film grain. The RF noise typically has a Gaussian distribution. It may be desirable to remove effects of RF noise form a video signal without unduly affecting aesthetic features such as film grain noise.
Methods according to embodiments of the invention use information in addition to motion information for noise reduction. Such additional information may include dynamics of video sequences such as scene change (a transition between levels of video complexity) and a distinction between film and video modes (film is progressive and video is interlaced, which may affect the type of motion search performed). Such methods may also include noise estimation based on, for example, information from vertical blanking interval (VBI) lines in video frames. The vertical blanking interval may include deterministic signals, such as closed captioning (CC) timing data, that can be used to estimate noise power. A noise reduction method may also include preservation of local direct-current (DC) level, as changes of the local DC level may be expected to generate artifacts in the video frame.
g(x,y,t)=(1−α)ĝ(x,y,t)+αf(x,y,t), (1)
where f(x, y, t) denotes pixel intensity of the input frame, with the triplet (x, y, t) denoting spatial-temporal coordinates. The signal g(x, y, t) is the output filtered frame, ĝ(x,y,t) is the motion compensated frame, and the weighting factor α is a constant (e.g. 0≦α≦1.0).
One potential advantage of a recursive scheme (e.g., a first-order infinite-impulse-response filter as described by expression (1) above) as compared to a finite-impulse-response filter is a higher ratio between noise attenuation and storage area required. Embodiments may also be configured for application to higher-order filters.
In the case of ideal motion compensation, the motion compensated frame ĝ(x,y,t) relates to the previous frame as follows:
ĝ(x,y,t)=g(x−vx,y−vy,t−T) (2)
where (vx,vy) denotes the motion vectors associated to each pixel and T is the sampling interval. Higher-order implementations of a temporal filter as described above may be configured to compute the output frame based on motion compensated frames relating to one or more other frames as well.
A prediction error e(x, y, t) may be defined as equal to the difference f(x, y, t)−g(x−vx, y−vy, t−T), which we will assume to be small. Note that this is not always true. For example, the model described above does not take into account occlusion of objects, fades, dissolves and scene changes. Such events may prevent a feature present in one frame from being present at any location in another frame. However, it provides a tractable mathematical model, and further embodiments include methods that take such features into account. Embodiments may also be applied to other models of ĝ(x, y, t) that are based on more than one previous frame and/or on one or more future frames. For example, such a model may be bidirectional in time.
Estimating the motion vectors (vx,vy) can be done using any number of algorithms which are known or may be developed in the field. One example of a full-search block matching algorithm for motion estimation is illustrated in
One block matching criterion that may be used is the sum of absolute error difference (SAD). In one example, SAD (vx,vy) is equal to
where the current frame is the one to be noise reduced f(x, y, t) and the reference frame is the previously noise-reduced g(x−vx,y−vy,t−T). Then the motion vector (vx,vy) may be defined as arg min SAD(vx, vy). (vxvy)
Another block matching criterion that may be used is the sum of squared error (SSE). In one example, SSE(vx,vy) is equal to
in which case the motion vector (vx,vy) may be defined as arg min SSE(vx, vy). (vx,vy)
The reference frame may be interpolated. Potential advantages of interpolation include better accuracy in the motion estimation and consequently a decreased prediction error. Interpolation may be performed according to several methods. The following example describes an interpolation by a factor of 2, enabling half-pel motion estimation and compensation. Another typical example includes interpolation by a factor of 4 (quarter-pel interpolation).
Broadcast video may be generated from video and film sources. In general, film is progressive and video is interlaced. It may be useful to define motion estimation and motion compensation in terms of this video characteristic. For example, a method according to an embodiment may be configured to use frame motion estimation for frames that are classified as progressive (i.e., the value of T is 1) and field motion estimation and compensation for frames that are classified as interlaced (the value of T is assumed to be 2).
In such a scheme, top_field current field motion vectors are predicted from one or more top_field reference fields, and bottom_field current field motion vectors are predicted from one or more bottom_field reference fields. This motion estimation decision may also be guided by a inverse telecine algorithm as discussed below. In one example, the block sizes used are Nx=Ny=8 pixels for progressive prediction and Nx=8, Ny=16 pixels for interlaced prediction (using frame coordinates). Typically a search window with size of −8≦Wx≦7.5 and −8≦Wy≦7.5 is enough to track desirable motion. Embodiments may be configured for application to any other values of Nx, Ny, Wx, and/or Wy.
An analytical model for measuring reduction of noise is also presented. Assuming uniform motion with perfect motion compensation, a filter according to expression (1) may be reduced to a one-dimensional temporal filter as illustrated in the manipulation below:
In order to simplify the above model, let us assume that the video sequence is composed of still frames (i.e., vx=vy=0). Since the resulting transfer function only depends on zt, we can reduce the multidimensional transfer function to a one-dimensional transfer function as follows:
In order to estimate the potential noise reduction gains, we assume that the signal input to the filter f(x, y, t) is purely white noise and is denoted by w(x, y, t) with power spectral density Sw(ω)=σw2 and T=1:
with β defined as (1−α).
Using a familiar relationship of spectral estimation, we evaluate the power spectral density as
where the variance is given as
By applying the following solution for the integral of expression (6):
the estimated reduction of noise σw2 is determined as
For newly exposed parts of the video sequence (e.g., at a spatial border or a previously occluded object), the displacement field (vx, vy) may not be defined. Also, the displacement estimate may not always be accurate, especially where expression (2) (ideal motion compensation) does not hold. As such regions typically have very high prediction error, the value of the weighting factor α may be made dependent on the prediction error e(x, y, t). In one example, the value of α is determined according to the following expression:
For intermediate values of the prediction error, α(e) may vary linearly between αb and αe. In other embodiments, the value of α(e) may vary nonlinearly (e.g. as a sigmoid curve) between αb and αe. Values for the parameter ab, the parameter αe, the prediction error thresholds Tb and Te, and/or the shape of the curve may be selected adaptively according to, for example, dynamics of the signal.
Features as described below may be applied to adapt the extent of filtering to such characteristics as the accuracy of motion estimation, local video characteristics, and/or global video characteristics. Potential advantages of such methods include enabling better quality output video by reducing smearing artifacts.
A scene change may cause a decrease in the visual quality of the first noise-reduced frame of a new scene, since the prediction error is likely to be high. Also, due to the nature of the recursive filtering, artifacts may be propagated into successive frames. Although an adaptive value of α(e) may limit this effect in areas with a high prediction error, some areas of the frame may, at random, have a low prediction error. Thus it may be desirable to disable noise reduction for the first frame of a new scene instead of limiting it on a local (e.g., pixel-by-pixel) basis.
A scene change detection mechanism may be based on a field difference measure calculation such as the following:
where f(x, y, t) denotes pixel intensity with the triplet (x, y, t) denoting spatial-temporal coordinates. In expression (10), w denotes picture width and h picture height. In this example, the calculation is done by measuring differences between field segments that have the same video parity (i.e., top and bottom field slice differences are evaluated). Such an operation is illustrated in
After calculation of M(k), a comparison with a threshold may be performed, with the number of slices having differences greater than (alternatively, not less than) the threshold being counted. In one example, if the number of slices which exceed the threshold corresponds to a picture area that covers more than some portion (e.g. 50%) of the picture, a scene change flag S is set:
where TM denotes a threshold. In one example, TM has a value of 32 times the number of pixels in a strip (e.g., 368,640 for a strip that is 720 pixels wide and 16 pixels high). Such a result may be used to qualify or otherwise modify the value of α(e), as in the following example:
α(e,S)=(1−S)α(e)+1. (12)
A potential advantage of an implementation including scene change detection is a reduction in wasted processing cycles for frames in which a scene change is detected.
A video frame may consist of two fields. In general, film is generated at 24 frames per second. Film is typically converted to video (e.g., to 480i/29.97 Hz format) using a technique called 2-3 pulldown (or telecine), in which certain fields are repeated in a sequence of four film frames to produce five video frames. The inverse process is called 3-2 pulldown or inverse telecine. It may be desirable for a noise reducer to identify this temporal redundancy in order to discriminate between frames from an interlaced source, such as video, and frames from a progressive source, such as film. If the source is interlaced (e.g., each field in a frame corresponds to a different time interval), it may be desirable to use field motion estimation and compensation. If the source is progressive (e.g., each field in a frame corresponds to the same time interval), it may be desirable to use frame motion estimation and compensation.
In a method that includes inverse telecine processing, if the current frame is determined to be from a film source, it is identified as a progressive frame, and a frame motion procedure is performed. Otherwise, field motion is performed. The field repetition discrimination operation may be complex, since it can be very difficult to distinguish true repeat fields due to noise.
A 3-2 pulldown process may be applied to noise-reduce a five-frame sequence without having to process the repeated fields.
In order to detect a true 3-2 pulldown cadence with reliability, it is may be desirable to first collect field differences as in a scene change detector, but with the difference that in this case we are looking for field equalities. In one example, a slice is considered a match if the difference is less than (alternatively, not greater than) a threshold value which is used to allow for noise. A field may then be considered a match if some proportion (e.g. more than 80%) of its slices match:
where repeat_first_field is a flag that denotes field equalities and TP is a threshold to allow for noise. In one example, TP has a value that is ¾ times the number of pixels in a strip (e.g., 4320 for a strip that is 720 pixels wide and 8 pixels high).
These field equalities may be recorded in a set R. In one example, this set can hold the field equality flags for 5 frames. As a new frame is processed, two new binary values are added to the set, one for the top field and one for the bottom field. A binary value of ‘1’ indicates equality of fields and a binary value of ‘0’ indicates inequality.
If the set R is full, it is shifted to remove the oldest entries to make room for the new data. When the set contains data for 5 frames, the cadence detection can proceed. In one example, set R is compared with a list of valid patterns as shown below:
If R(t) is equivalent to set ST(1), then for the new frame captured at the instant t+1, it should hold R(t+1)≡ST(2), R(t+2)≡ST(3), . . . , R(t+4)≡ST(0). Likewise, if R(t) is equivalent to set SB(1), then for the new frame captured at the instant t+1, it should hold R(t+1)≡SB(2), R(t+2)≡SB(3), . . . , R(t+4)≡SB(0).
When the cadence is broken at t′ (which may occur due to scene changes, insertion of commercials, or video editing, for example), the set R is set to:
R(t′)←(0,0,x,x,x,x,x,x,x,x)
where x denotes a binary value that it is not available.
In analog video, there are two types of blanking intervals. The horizontal blanking interval occurs between scan lines, and the vertical blanking interval (VBI) occurs between frames (or fields). The horizontal blanking interval is present in every video line in general and carries horizontal synchronization information. The vertical blanking interval carries vertical synchronization and other types of information, such as a closed captioning (CC) signal. A VBI line of an analog video signal carries deterministic signals that can be used for noise estimation process.
It may be desired in this waveform to estimate noise variance at segment D in
If we denote by Ns the number of samples available in segment D, then a sampled average may be defined as:
with variance defined as
This calculation may be performed for top (fT) and bottom (fB) fields.
A method of noise reduction may be implemented such that the initial value of αb from expression (9) is dependent on information obtained during the VBI. In one example, the noise variance σf2 is defined as
and the value of αb is set as follows:
with αe equal to unity.
A method for locating segment D is described. Such a method may include slicing the VBI line to identify locations of the cycles of the sine wave in segment B, the heights of their peaks and troughs, and their period. A section of samples is examined that can be expected to fall within the 7 cycles of the sine wave, based on the number of samples in the entire line. It may be desirable to examine enough samples to include more than one (e.g. about two) complete cycles of the sine wave.
The entire array of samples from a VBI line may be divided into 26 sections of equal width. In order, they are: 7 sine wave cycles, 3 start bits (two blanking level bits and one high bit), 7 data bits, 1 parity bit, 7 data bits, and 1 parity bit (assuming that the color burst and earlier portions of the line are not included in the array of samples). Since the sine wave is the first of these sections in the array of samples, it occupies roughly the first 27% of the samples.
One method of slicing (or parsing) the VBI line includes selecting peak and trough values of the sine wave. The following pseudo-code shows one example of calculations based on two indices located roughly in the center of the sine wave and separated by approximately two cycles of the sine wave. In this method, the minimum and maximum sample values between those two indices are stored as indicators of the peak and trough values of the sine wave:
Although there may be a small number of samples prior to the start of the first cycle of the sine wave and a small number of samples after the final parity bit, these calculations are typically conservative enough to insure that all the samples from beginning to end are within the sine wave.
Based on the range of values observed in this subset of the sine wave, threshold values may be calculated for determining when the wave is transitioning from one part of its cycle to another. Such threshold values are used during an iteration through the sample array from the beginning, locating each cycle of the sine wave. This iteration may include cycling sequentially through four states: ENTERING_PEAK, IN_PEAK, ENTERING_TROUGH, and IN_TROUGH.
period=((wave[6]−wave[0])<<INDEX_FRAC_BITS)/6;
where typical values for INDEX_FRAC_BITS include 8, 10, and 12.
The index of the sample at the center of any start, data, or parity bit may be calculated by multiplying the sine-wave period by the appropriate amount and adding the result to the index that indicates the location of the seventh sine-wave cycle.
Although it is not necessary to use all the start, data and parity bits in such a noise measurement, methods may include verifying that all the start bits have the expected values and/or that the parity bits correspond correctly to the data bits. If there are any errors in the start bits or parity bits, the very presence of these errors can be used as indicators of a high noise level. Similarly, if an error occurs in the code that attempts to locate the cycles of the sine wave, the occurrence of such error may be used to indicate a high noise level.
It is noted that the principles described above may also be applied to obtain one or more noise statistics (e.g., variance) from other deterministic regions of a video signal. For example, such principles may be applied to determine noise statistics from portions of a vertical blanking interval (also called “field blanking interval”) of an NTSC, PAL, or SECAM (Séquentiel couleur avec memoire) video signal that is configured according to a format such as teletext (e.g., as described in reference specification ETS 300 706), VPS signaling, or Wide Screen Signaling (e.g., as described in specification ETSI 300 294 and/or SMPTE RP 186).
It may be desirable for a noise reducer to take into account variations in the local DC (direct current) level. For example, it may be desirable to minimize local DC changes introduced by the noise reducer, as such changes may be noticeable in the resulting image, especially on large screens and especially in regions of flat color. The error between the current sample to be filtered f(x, y, t) and motion compensated sample g(x−vx, y−vy, t−T) as set forth in expression (9) may not be enough to guarantee local DC preservation. Further embodiments include methods, systems, and apparatus in which the thresholds Tb and Te are functions of the local DC level.
In one example, the local DC level is quantified as follows:
DC(x,y,t)=[f(x−1,y,t)+f(x,y,t)+f(x+1,y,t)]/3.
Different fixed and/or adaptive neighborhoods may also be used. For example, a one-dimensional neighborhood of five pixels (e.g. centered at the current pixel) may be used, although it has been observed that the comparative advantage may decrease for neighborhoods larger than five pixels. The deviation between the DC and the current noisy sample is then set forth as
AC(x,y,t)=|f(x,y,t)−DC(x,y,t)|.
The AC value may be used to control the values of Tb and Te. In one example, values for Tb and Te are determined as follows:
Tb(AC)=max(AC,Kb) and Te(AC,Tb)=min(δ×AC+Tb(AC),Ke);
where Ke>Kb are positive constants used to bound Tb and Te, and δ is a constant that controls the error swing of e=|f(x,y,t)−g(x−vx,y−vy,t)|. According to this example (and setting αe equal to 1.0), expression (9) may be rewritten as
Such an expression may be further modified (e.g., according to expression (12) above) to take scene changes into account:
a(e,S,σf2)=(1−S)α(e,σf2)+1.
A system including these features may thus be modeled as
g(x,y,t)=(1−α(e,S,σf2)){circumflex over (g)}(x,y,t)+α(e,S,σf2)f(x,y,t).
As disclosed herein, additional information may be obtained from a video signal and used to guide a motion-adaptive or motion-compensated temporal filtering process. For example, one or more detection mechanisms as described herein (e.g. scene change detection, inverse telecine) may be included and may make such noise reduction more robust. Alternatively or additionally, local DC levels and/or information obtained during other portions of the signal (e.g. one or more horizontal and/or vertical blanking intervals) may be used in a noise reduction operation.
A noise reducer system or apparatus may be implemented to include features as described herein. For example, such a system or apparatus may include mechanisms for scene detection, inverse telecine, and VBI noise level estimation.
a shows a flowchart of a method M100 according to an embodiment. Task T110 estimates a noise statistic of a video signal that includes a pixel value to be processed. For example, task T110 may be configured to calculate the noise statistic according to a deterministic portion of the video signal, such as a portion that occurs during the vertical blanking interval. Task T120 obtains a predicted pixel value according to a motion vector and a location of the pixel value to be processed. In some cases, task T120 may be configured to obtain the predicted pixel value according to other information as well, such as one or more additional motion vectors. Task T130 calculates a filtered pixel value based on the pixel value, the predicted pixel value, and a weighting factor based on the estimated noise statistic. The weighting factor may also depend on the distance between the pixel value and the predicted pixel value.
b shows a flowchart of an implementation M110 of method M100. In this example, tasks T120 and T130 are repeated for some or all of the pixels in a video image. Task T140 encodes the resulting filtered image according to, for example, a DCT-based scheme such as MPEG-1 or MPEG-2.
In one example, processing of a video signal includes the following operations:
1) the digital video signal is buffered for four frames with the respective VBI information;
2) noise estimation is performed for each of these frames, and an individual αb is calculated for each frame;
3) scene change detection is performed once per frame, with detected scene changes disabling noise reduction;
4) inverse telecine is also performed once per frame, enabling a field or frame decision for motion estimation;
5) motion estimation and compensation is performed;
6) motion compensated temporal filtering is then performed, with appropriate Tb and Te evaluated at pixel level; and
7) the filtered frame is then stored in the frame buffer to be used as the reference in the next iteration.
The foregoing presentation of the described embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments are possible, and the generic principles presented herein may be applied to other embodiments as well. For example, the invention may be implemented in part or in whole as a hard-wired circuit, as a circuit configuration fabricated into an application-specific integrated circuit, or as a firmware program loaded into non-volatile storage or a software program loaded from or into a data storage medium as machine-readable code, such code being instructions executable by an array of logic elements such as a microprocessor or other digital signal processing unit. Thus, the present invention is not intended to be limited to the embodiments shown above but rather is to be accorded the widest scope consistent with the principles and novel features disclosed in any fashion herein.
One possible effect of a noise reduction as described herein is to provide a signal that may be encoded at a low bit rate and/or high compression ratio, while reducing noise artifacts and/or preserving high-frequency picture information (e.g. edges) in the signal.
A video signal processed by a method, system, or apparatus for noise reduction as disclosed herein may be provided as input to a video encoder. Examples of an encoding scheme as may be applied by such an encoder, which may be a constant bit rate (CBR) or variable bit rate (VBR) scheme, include DCT-based schemes such as MPEG-1 or MPEG-2. It may be desired to support real-time encoding. In some implementations, a noise reducer according to an embodiment resides on the same chip, or in the same chipset, as such an encoder.
This application claims benefit of U.S. Provisional Pat. Appl. No. 60/669,878, entitled “SYSTEMS, METHODS, AND APPARATUS FOR NOISE REDUCTION,” filed Apr. 11, 2005.
Number | Name | Date | Kind |
---|---|---|---|
5361102 | Roy | Nov 1994 | A |
5929902 | Kwok | Jul 1999 | A |
6122314 | Bruls | Sep 2000 | A |
6657676 | Borneo | Dec 2003 | B1 |
6678330 | Kondo et al. | Jan 2004 | B1 |
6724433 | Lippman | Apr 2004 | B1 |
6867814 | Adams | Mar 2005 | B2 |
7418149 | Dinh et al. | Aug 2008 | B2 |
20040130619 | Lin | Jul 2004 | A1 |
20050105627 | Sun | May 2005 | A1 |
20050140828 | Patel | Jun 2005 | A1 |
Number | Date | Country | |
---|---|---|---|
20070019114 A1 | Jan 2007 | US |
Number | Date | Country | |
---|---|---|---|
60669878 | Apr 2005 | US |