1. Field of the Invention
Embodiments of the present invention generally relate to a method and apparatus for dynamic illumination compensation for background subtraction.
2. Description of the Related Art
Detecting changes in video taken by a video capture device with a stationary field-of-view, e.g., a fixed mounted video camera with no pan, tilt, or zoom, has many applications. For example, in the computer vision and image understanding domain, background subtraction is a change detection method that is used to identify pixel locations in an observed image where pixel values differ from co-located values in a reference or “background” image. Identifying groups of different pixels can help segment objects that move or change their appearance relative to an otherwise stationary background.
Embodiments of the present invention relate to a method, apparatus, and computer readable medium for background subtraction with dynamic illumination compensation. Embodiments of the background subtraction provide for receiving a frame of a video sequence, computing a gain compensation factor for a tile in the frame as an average of differences between background pixels in the tile and corresponding pixels in a background model, computing a first difference between a pixel in the tile and a sum of a corresponding pixel in the background model and the gain compensation factor, and setting a location in a foreground mask corresponding to the pixel based on the first difference.
Particular embodiments in accordance with the invention will now be described, by way of example only, and with reference to the accompanying drawings:
Specific embodiments of the invention will now be described in detail with reference to the accompanying figures. Like elements in the various figures are denoted by like reference numerals for consistency.
Background subtraction works by first establishing a model or representation of the stationary field-of-view of a camera. Many approaches can be used to define the background model. For example, a naïve technique defines a single frame in a sequence of video frames S as the background model Bt such that
B
t(x,y)=It(x,y),
where S={I0, I1, I2, . . . , It, It+1, . . . } and It and Bt are both N×M arrays of pixel values such that 1≦x≦M and 1≦y≦N. In some instances, the first frame in the sequence is used as the background model, e.g., Bt(x,y)=I0(x,y).
A more sophisticated technique defines a Gaussian distribution to characterize the luma value of each pixel in the model over subsequent frames. For example, the background model Bt can be defined as a pixel-wise, exponentially-weighted running mean of frames, i.e.,
B
t(x,y)=(1−α(t))·It(x,y)+α(t)·Bt−1(x,y), (1)
where α(t) is a function that describes the adaptation rate. In practice, the adaptation rate α(t) is a constant between zero and one. When Bt(x,y) is defined by Eq. 1, the pixel-wise, exponentially-weighted running variance Vt(x,y) is also calculated such that
V
t(x,y)=|(1−α(t))·Vt−1(x,y)+α(t)·Δt(x,y)2|. (2)
In any case, once the background model has been determined, detecting changes between the current frame It and the background Bt is generally a simple pixel-wise arithmetic subtraction, i.e.,
Δt(x,y)=It(x,y)−Bt(x,y). (3)
A pixel-wise threshold Tt(x,y) is often applied to Δt(x,y) to help determine if the difference in pixel values at a given location (x,y) is large enough to attribute to a meaningful “change” versus a negligible artifact of sensor noise. If the the pixel-wise mean and variance is established for the background model Bt, the threshold Tt(x,y) is commonly set as a standard deviation of the variance, e.g., Tt(x,y)=λ √ Vt(x,y) where λ is the standard deviation factor.
A two-dimensional binary map Ht for the current frame It is defined as
H
t(x,y)={1 if |Δt(x,y)|>Tt(x,y); otherwise 0} ∀ 1≦x≦M and 1≦y≦N (4)
The operation defined by Eq. 4 is generally known as “background subtraction” and can be used to identify locations in the image where pixel values have changed meaningfully from recent values. These locations are expected to coincide with the appearance of changes, perhaps caused by foreground objects. Pixel locations where no significant change is measured are assumed to belong to the background. That is, the result of the background subtraction, i.e., a foreground mask Ht, is commonly used to classify pixels as foreground pixels or background pixels. For example, Ht(x,y)=1 for foreground pixels versus Ht (x,y)=0 for those associated with the background. In practice, this map is processed by grouping or clustering algorithms, e.g., connected components labeling, to construct higher-level representations, which in turn, feed object classifiers, trackers, dynamic models, etc.
There are many factors, or combinations of factors, that can produce these transient conditions, including camera automatic gain control and brightly colored objects entering the field of view. In response to dynamic illumination conditions in the overall image, many cameras equipped with gain control apply an additive gain distribution Gt(x,y) to the pixels in the current frame It(x,y) to produce an adjusted frame It(x,y) that may be more subjectively appealing for humans. However, this gain is generally unknown to the background subtraction algorithm, which can lead to errors in segmentation. This behavior represents a common issue in real time vision systems.
Embodiments of the invention provide for background subtraction that compensates for dynamic changes in illumination in a scene. Since each pixel in an image is potentially affected differently during brief episodes of illumination change, the pixels in the current image may be represented as It(x,y) such that
Î
t(x,y)=It(x,y)+Gt(x,y), (5)
where Gt(x,y) is an additive transient term that is generally negligible outside the illumination episode interval. An additive gain compensation term Ct(x,y) is introduced to the background model that attempts to offset the contribution from the unknown gain term Gt(x,y) that is added to the current frame It(x,y), i.e.,
Î
t(x,y)−(Bt(x,y)+Ct(x,y))≈It(x,y)−Bt(x,y). (6)
More specifically, Ct(x,y) is estimated such that Ct(x,y)≈−Gt(x,y).
To estimate the gain compensation term Ct(x,y), the two dimensional (2D) (x,y) locations in a frame where the likelihood of segmentation errors are low are initially established. This helps to identify pixel locations that have both a low likelihood of containing foreground objects and a high likelihood of belonging to the “background”, i.e., of being stable background pixels.
A 2D binary motion history mask Ft is used to assess these likelihoods. More specifically, for each image or frame, the inter-frame difference, which subtracts one time-adjacent frame from another, i.e., It(x,y)−It−1(x,y), provides a measure of change between frames that is independent of the background model. The binary motion history mask Ft is defined by
F
t(x,y)={1 if (Mt(x,y)>0); otherwise 0}, ∀ x,y (7)
where Mt is a motion history image representative of pixel change over q frames, i.e.,
M
t(x,y)={q if (Dt(x,y)=1); otherwise max[0, Mt(x,y)−1]} (8)
where q is the motion history decay constant and Dt is the binary inter-frame pixel-wise difference at time t, i.e.,
D
t(x,y)={1 if |It(x,y)−It−1(x,y)|>τt(x,y); otherwise 0} ∀ 1≦x≦M and 1≦y≦N. (9)
Note that Tt(x,y) and τt(x,y) are not necessarily the same. For simplicity, τt(x,y) is assumed to an empirically determined constant.
To estimate the gain distribution Gt(x,y) in frame t, background pixel values in the current frame It(x,y) are monitored to detect changes beyond a threshold β. Although Dt(x,y)=0 indicates no pixel change at (x,y) over the interval between time t and t−1, the inter-frame difference result Dt over a single interval may not provide adequate segmentation for moving objects. For example, the inter-frame difference tends to indicate change along the leading and trail edges of moving objects most prominently, especially if the objects are homogeneous in appearance. The binary motion history mask Ft is essentially an aggregate of Dt over the past q intervals, providing better evidence of pixel change over q intervals. A background pixel location (x,y) is determined whenever Ft(x,y)=0. As is describe in more detail herein, pixel locations involved in the calculation of the gain compensation term Ct(x,y) are also established by the binary motion history mask Ft.
Applying a single gain compensation term for the entire frame, i.e., Ct(x,y)=constant ∀ x, y, may poorly characterize the additive gain distribution Gt(x,y), especially if the gain compensation term is determined by a non-linear 2D function. To minimize the error between Ct(x,y) and Gt(x,y), Ct(x,y) is estimated as a constant c in a 2D piece-wise fashion. For example, estimating and applying Ct(x,y) as a constant to a subset or tile of the image Φ, e.g., 1≦x≦M/4 and 1≦y≦N/4, reduces segmentation errors more than allowing x and y to span the entire N×M image. The constant c for a tile in an image is estimated by averaging the difference between the background model Bt(x,y) and the image It(x,y) at 2D (x,y) pixel locations determined by Ft(x,y), i.e.,
C
t(x,y)≈c=1/n·Σ(1−Ft(x,y))·[Ît(x,y)−Bt(x,y)] ∀ x, y ∈ Φ, (10)
where n is the number of pixels that likely belong to the background, or
n=Σ(1−Ft(x,y)). (11)
Note that the constant c is not necessarily the same for all subsets or tiles. The constant c may also be referred to as the mean illumination change or the gain compensation factor. By re-calculating background subtraction compensated by c, i.e.,
Δt,2(x,y)=Ît(x,y)−(Bt(x,y)+c) (12)
and comparing this difference to the original, uncompensated background subtraction, i.e.,
Δt,1(x,y)=Ît(x,y)−Bt(x,y), (13)
segmentation errors that can cause subsequent processing stages to fail can generally be reduced by selecting the result producing the smallest change. That is, the final binary background mask is defined as
Ĥ
t(x,y)={1 if (min[Δt,1(x,y), Δt,2(x,y)]>Tt(x,y)); otherwise 0}∀ x, y ∈ Φ. (14)
Embodiments of the gain compensated background subtraction techniques have been shown to result in the same or fewer errors in segmentation as compared to uncompensated background segmentation. Further, the compensation approach is applied to selective areas of an image, e.g., block-based tiles, making the illumination compensated background subtraction amenable to SIMD implementations and software pipelining. In addition, the illumination compensated background can be applied iteratively, which tends to improve the performance.
The luma extraction component 402 receives frames of image data and generates corresponding luma images for use by the other components. The background subtraction component 404 performs gain compensated background subtraction as described herein, e.g., as per Eqs. 7-14 above or the method of
The morphological operations component 406 performs morphological operations such as dilation and erosion to refine the foreground mask, e.g., to remove isolated pixels and small regions. The event detection component 408 analyzes the foreground masks to identify and track objects as they enter and leave the scene in the video sequence to detect events meeting specified criteria, e.g., a person entering and leaving the scene, and to send alerts when such events occur. As part of sending an alert, the event detection component 414 may provide object metadata such as width, height, velocity, color, etc. The event detection component 408 may classify objects as legitimate based on criteria such as size, speed, appearance, etc. The analysis performed by the event detection component 408 may include, but is not limited to, region of interest masking to ignore pixels in the foreground masks that are not in a specified region of interest. The analysis may also include connected components labeling and other pixel grouping methods to represent objects in the scene. It is common practice to further examine the features of these high-level objects for the purpose of extracting patterns or signatures that are consistent with the detection of behaviors or events.
As shown in
A motion history image Mt(x,y) representative of pixel value changes over some number of frames is then updated based on the inter-frame motion mask Dt(x,y) 506. The motion history image Mt(x,y) is representative of the change in pixel values over some number of frames q. The value of q, which may be referred to as the motion history decay constant, may be predetermined based on simulation and/or may be user-specified to correlate with the anticipated speed of typical objects in the scene.
The motion history image Mt(x,y) is then binarized to generate a binary motion history mask Ft(x,y) 508. That is, an (x,y) location in the binary motion history mask Ft(x,y) corresponding to a pixel in the current frame It(x,y) is set to one to indicate that motion has been measured at some point over the past q frames; otherwise, the location is set to zero, indicating no motion has been measured in the pixel location. Locations with no motion, i.e., Ft(x,y)=0, are herein referred to as background pixels. The number of background pixels n in the tile It(x,y) is determined from the binary motion history mask Ft(x,y) 510.
The mean illumination change c is then computed for the tile It(x,y) 512. The mean illumination change c is computed as the average pixel difference Δt,1(x,y) between pixels in the tile It(x,y) that are identified as background pixels in the binary motion history mask Ft(x,y) and the corresponding pixels in the background model Bt(x,y).
A determination is then made as to whether or not gain compensation should be applied to the tile It(x,y) 514. This determination is made by comparing the mean illumination change c to a compensation threshold R. The compensation threshold β may be predetermined based on simulation results and/or may be user-specified. If the mean illumination change c is not less than the compensation threshold β 514, background subtraction with gain compensation is performed on the tile It(x,y) 516 to compute gain compensated pixel differences Δt,2(x,y). That is, a gain compensation factor, which is the mean illumination change c, is added to each pixel in the background model Bt(x,y) corresponding to the tile It(x,y), and the gain compensated background model pixel values are subtracted from the corresponding pixels in the tile It (x,y). If the mean illumination change c is less than the compensation threshold β 514, the pixel differences Δt,2(x,y) are set 518 such that the results of the uncompensated background subtraction Δt,1(x,y) 500 will be selected as the minimum 522.
The minimum differences Δt(x,y) between the uncompensated background subtraction Δt,1(x,y) and the gain compensated background subtraction Δt,2(x,y) are determined 522 and a portion of the foreground mask Ht(x,y) corresponding to the tile It(x,y) is generated by binarizing the minimum differences Δt(x,y) based on a threshold Tt(x,y) 526. The threshold Tt(x,y) is the pixel-wise standard deviation of the variance 520. If a minimum difference in Δt(x,y) is less than the threshold Tt(x,y), the corresponding location in the foreground image is set to indicate a background pixel; otherwise, the corresponding location is set to indicate a foreground pixel.
The RISC processor 704 may be any suitably configured RISC processor. The video/image coprocessors 702 may be, for example, a digital signal processor (DSP) or other processor designed to accelerate image and/or video processing. One or more of the video/image coprocessors 702 may be configured to perform computational operations required for video encoding of captured images. The video encoding standards supported may include, for example, one or more of the JPEG standards, the MPEG standards, and the H.26x standards. The computational operations of the video content analysis including the background subtraction with dynamic illumination compensation may be performed by the RISC processor 704 and/ or the video/image coprocessors 702. That is, one or more of the processors may execute software instructions to perform the video content analysis and the method of
The VPS 706 includes a configurable video processing front-end (Video FE) 708 input interface used for video capture from a CCD imaging sensor module 730 and a configurable video processing back-end (Video BE) 710 output interface used for display devices such as digital LCD panels.
The Video FE 708 includes functionality to perform image enhancement techniques on raw image data from the CCD imaging sensor module 730. The image enhancement techniques may include, for example, black clamping, fault pixel correction, color filter array (CFA) interpolation, gamma correction, white balancing, color space conversion, edge enhancement, detection of the quality of the lens focus for auto focusing, and detection of average scene brightness for auto exposure adjustment.
The Video FE 708 includes an image signal processing module 716, an H3A statistic generator 718, a resizer 719, and a CCD controller 717. The image signal processing module 716 includes functionality to perform the image enhancement techniques. The H3A module 718 includes functionality to support control loops for auto focus, auto white balance, and auto exposure by collecting metrics on the raw image data.
The Video BE 710 includes an on-screen display engine (OSD) 720, a video analog encoder (VAC) 722, and one or more digital to analog converters (DACs) 724. The OSD engine 720 includes functionality to manage display data in various formats for several different types of hardware display windows and it also handles gathering and blending of video data and display/bitmap data into a single display window before providing the data to the VAC 722 in YCbCr format. The VAC 722 includes functionality to take the display frame from the OSD engine 720 and format it into the desired output format and output signals required to interface to display devices. The VAC 722 may interface to composite NTSC/PAL video devices, S-Video devices, digital LCD devices, high-definition video encoders, DVI/HDMI devices, etc.
While the invention has been described with respect to a limited number of embodiments, those skilled in the art, having benefit of this disclosure, will appreciate that other embodiments can be devised which do not depart from the scope of the invention as disclosed herein. For example, the meaning of the binary values 0 and 1 in one or more of the various binary masks described herein may be reversed.
Those skilled in the art can also appreciate that the method also applies generally to any background model-based approach. That is, the method is not unique to any particular background model representation. For example, the approach performs equally well when each pixel in the model is defined by uniformly weighted running average and running variance. The method also works with various sensor types, even those collecting measurements outside of the visible spectrum. For example, sensors sensitive to thermal and infrared spectra also experience momentarily changes in the model representation due to sensor noise and environmental flare ups. The method described herein can also compensate for such conditions, providing improved segmentation of foreground pixels. The method also works for background models described by a stereo disparity or depth map.
Embodiments of the background subtraction method described herein may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the software may be executed in one or more processors, such as a microprocessor, application specific integrated circuit (ASIC), field programmable gate array (FPGA), or digital signal processor (DSP). Further, the software may be initially stored in a computer-readable medium such as compact disc (CD), a diskette, a tape, a file, memory, or any other computer readable storage device and loaded and executed in the processor. In some cases, the software may also be sold in a computer program product, which includes the computer-readable medium and packaging materials for the computer-readable medium. In some cases, the software instructions may be distributed via removable computer readable media (e.g., floppy disk, optical disk, flash memory, USB key), via a transmission path from computer readable media on another digital system, etc.
Although method steps may be presented and described herein in a sequential fashion, one or more of the steps shown and described may be omitted, repeated, performed concurrently, and/or performed in a different order than the order shown in the figures and/or described herein. Accordingly, embodiments of the invention should not be considered limited to the specific ordering of steps shown in the figures and/or described herein.
It is therefore contemplated that the appended claims will cover any such modifications of the embodiments as fall within the true scope of the invention.
This application claims benefit of U.S. Provisional Patent Application Ser. No. 61/367,611, filed Jul. 26, 2010, which is incorporated by reference herein in its entirety.
Number | Date | Country | |
---|---|---|---|
61367611 | Jul 2010 | US |