The present invention relates generally to imaging, and more particularly to surveillance imaging analysis systems and methods.
The usefulness of video surveillance systems is becoming increasingly acknowledged as the demand for enhanced safety has increased, both for fixed locations and properties and for moving vehicles. Locations commonly covered by such systems, include, for example, monitoring of harbors, airports, bridges, power plants, parking garages, public spaces, and other high-value assets. Traditionally, such camera networks require a labor-intensive deployment and monitoring by human security personnel. Human-monitored systems are, in general, relatively costly and prone to human error. For these reasons, the development of technology to automate the deployment, calibration, and monitoring of such systems will be increasingly important in the field of video surveillance. Moving vehicles, as well, would benefit from the greater safety provided by video surveillance, which could, for example, alert the driver of a passenger car of a foreign object in the road, or in the case of a driverless vehicle, alter the planned route because of a detected object or obstacle in the planned route.
For example, in automated video surveillance of sensitive infrastructures, it is always desirable to detect and alarm in the event of intrusion. To perform such a task reliably, it is often helpful to classify detected objects beforehand and to also track such detected objects in an attempt to discern from their actions and movements whether the objects pose an actual threat. Detecting an object is no easy task, however. It requires powerful video analytics and complex algorithms supporting those analytics. For example, a pixel in an aligned image sequence, such as a video stream, is represented as a discrete time series. The states of that time series are not observable, but, through an abstraction operation, they do aid in the observation of the state of the pixel at a given time, which depends on the immediately prior state, and not necessarily on earlier states, much as in the way a Hidden Markov Model is used. It often requires determining which portions of a video or image sequence are background and which are foreground, and then detecting the object in the foreground. Further, object detection is complicated when the camera imaging the target moves, either because it is mounted to something which is mobile or because the camera is monitoring a wide field of view by a step-and-stare method of camera movement.
Generally, the video surveillance system is unable to determine the actual size of an object, which can make threat detection even more difficult. With actual size detection, benign objects can be better differentiated from real threats. Moreover, the kinematics of an object, such as its velocity, acceleration, and momentum, are much more difficult to analyze when real size is unknown.
Additionally, georeferencing with a single camera demands the existence of landmark-rich scenes which may not be available in many instances, such as in the surveillance of ports and harbors, or when a site is being remotely—and perhaps covertly—monitored, and it is not feasible to introduce synthetic landmarks into the scene. Clearly, in reference to the above-described issues, the development of systems to improve the efficiency and effectiveness of automated video surveillance is needed.
In one embodiment, a surveillance method includes the steps of providing an imaging means, acquiring images from the imaging means, each image comprising a plurality of pixels. The method further includes producing a histogram of a characteristic property for analysis of each of the pixels, and analyzing the histogram for each pixel so as to assign each pixel with one of a background state and a foreground state. The method further includes classifying each pixel as one of background and foreground based on the assignment of the pixel with the one of the background state and the foreground state.
In another embodiment, a surveillance method includes the steps of providing an imaging means, acquiring images from the imaging means, each image comprising a plurality of pixels, and producing an initial background model of the pixels. The method further includes analyzing characteristic properties of the image including a grayscale characteristic property and a gradient orientation characteristic property, and producing a first histogram for each pixel for the grayscale characteristic property and a second histogram for each pixel for the gradient orientation characteristic property. The method further includes analyzing each of the first and second histograms for each pixel so as to assign each pixel with one of a background state and a foreground state, and classifying each pixel as one of background and foreground based on the assignment of the pixel with the one of the background state and the foreground state.
In yet another embodiment, a surveillance method includes the steps of providing an imaging means, acquiring images from the imaging means, each image comprising a plurality of pixels, and producing an initial background model of the pixels. The method further includes analyzing characteristic properties of the image including a grayscale characteristic property and a gradient orientation characteristic property, producing a first histogram for each pixel for the grayscale characteristic property and a second histogram for each pixel for the gradient orientation characteristic property, and analyzing each of the first and second histograms for each pixel so as to assign each pixel with one of a background state and a foreground state. The method still further includes classifying each pixel as one of background and foreground based on the assignment of the pixel with the one of the background state and the foreground state, and updating the initial background model with each pixel classified as background.
Referring to the drawings:
Reference now is made to the drawings, in which the same reference characters are used throughout the different figures to designate the same elements.
The system includes a pan-tilt-zoom enabled (“PTZ”) camera, and preferably a plurality of cameras, which capture images or video of objects in real space, or terrain space, into camera space, or image space. Terrain space is the real world, and camera space is a two-dimensional abstraction of the terrain space as imaged by the camera. The camera records a video stream of the terrain space as is conventional and well-known in step 100. It is noted that the camera may be fixed in position and attitude, may be fixed and position and mobile in attitude, or may be mobile in both position and attitude, as would occur when the camera is mounted to a mobile platform.
In the image stabilization step 101, individual frames in the video stream are adjusted to align images within the frames with respect to each other. Image stabilization reduces the effect of small movements and trembles of the camera that may be caused by wind, mechanical aberrations, movement of the camera or the platform on which the camera is mounted, and the like.
Background modeling is performed in step 102 of
In
A histogram is built for each characteristic property. An exemplary histogram is shown in
The histogram is constructed by first identifying the bin which corresponds in value to the property of the pixel at the pixel coordinate. For example, if a pixel has a gradient orientation of one value, it will correspond to one bin, while if the pixel has a gradient orientation of another value, it will correspond to another bin. Likewise, and for example only, if a pixel has a low grayscale value, it will correspond to a different bin than if it had a high grayscale value. This corresponding bin will be referred to hereinafter as the “encompassing bin.” Once the encompassing bin is identified, the system decrements all other bins in height. In some cases, as the user desires, bins immediately adjacent to the encompassing bin will not be decremented. In other cases, bins neighboring the encompassing bin will not be decremented. The user of the system is given the functionality to designate whether immediately-adjacent bins should not be decremented, whether neighboring bins should not be decremented, and if so, how many neighboring bins should not be decremented. In other words, the user may choose to define “neighboring bins” as the first bins to the sides of the encompassing bin, or may choose to define “neighboring bins” as the first and second bins to the side of the encompassing bin, or may choose to define “neighboring bins” as the first, second, and third bins to the side of the encompassing bin, etc. After decrementing all other bins, the system increments the encompassing bin. The encompassing bin is incremented by a growth factor δf, the value of which is determined by equation (1) below:
δf=k·loge(fmax−f), (1)
where f denotes the current frequency, 0<k≦2.7 is user-defined, and fmax=255. Equation (1) is useful for conserving memory space and preventing buffer overrun or wrapping in a byte-deep circular buffer. In systems with greater amounts of memory and deeper buffers, the equation may be altered accordingly. In other embodiments, the constant k may be assigned different numbers as long as (f+δf) is truncated or capped at the level specified by fmax.
Once the multi-modal background model for a characteristic property is constructed as a histogram, that histogram is abstracted into vocabulary which can be more easily manipulated, in steps 22 and 23 of
T1 is a low threshold. Bins having frequencies that are below this threshold correspond to pixels which are highly likely to have appeared infrequently, and hence, are likely to be the result or effect of a moving object. T1 is thus helpful for identifying foreground.
T2 is a user-defined threshold. This threshold is set by the user as a selected percentage of the most recurrent bin. Bins having frequencies above this correspond to pixels which have appeared many times before.
T3 is a calculated threshold, determined by T3=T1+γ(T2−T1), where γ is a user-defined constant.
The bins in
Assign as H+ if the bin is the most recurrent;
Assign as H− if the bin frequency exceed T2 but is less than the most recurrent;
Assign as M+ if the bin frequency exceeds T3 but is less than or equal to T2;
Assign as M− if the bin frequency exceed T1 but is less than or equal to T3; and
Assign as L if the associated bin frequency is below T1.
In the same way, but with different vocabulary, threshold values are established for the gradient orientation histogram. The gradient orientation histogram is shown in
τ1 is a low threshold. Bins having frequencies that are below this threshold correspond to pixels which are highly likely to have appeared infrequently, and hence, are likely to be the result or effect of a moving object. τ1 is thus helpful for identifying foreground.
τ2 is a user-defined threshold. This threshold is set by the user as a selected percentage of the most recurrent bin. Bins having frequencies above this correspond to pixels which have appeared many times before.
τ3 is a calculated threshold, determined by τ3=τ1+γ(τ2−τ1), where γ is a user-defined constant.
The intermediate symbols in the set {h+, h−, m+, m−, l} are then assigned to the bins based on frequency of each bin in the gradient orientation histogram with respect to τ1, τ1, and τ2. These symbols are useful when the pixel being analyzed has sufficient, or at least non-negligible, gradient magnitude. When the pixel has a non-negligible gradient magnitude, it will possess a gradient orientation and is mapped to the gradient orientation histogram. Otherwise, if the pixel does not have the requisite gradient magnitude and thus has a negligible gradient magnitude, a different intermediate set of symbols are used, as seen in
After the intermediate symbol sets for grayscale and gradient magnitude have been determined and assigned, they are mapped onto {Ii|i=1 . . . 5} and {Gj|i=1 . . . 5}, respectively. Likewise, those corresponding to the gradient orientation histogram last bin, which exhibit negligible gradient magnitude, are mapped onto {G˜j|j=1 . . . 5}.
Pixel classification occurs next, according to step 103 in
qt=B qt=F
The matrix (2) is useful for predicting a current state of the pixel based on its prior state. For example, if the pixel is currently assigned to the background state, then there is an 80% chance that the pixel will next be assigned the background state, and a corresponding 20% chance that the pixel will next be assigned the foreground state. On the initial classification, pixels are assumed to be background, and they are then classified according to the matrix (2) as well as the methodology described below.
Pixel classification begins in step 24 only after the initial background model is constructed, as in steps 20 and 21, and vocabulary for the histogram resulting from each multi-modal background model has been created as well, as in steps 22 and 23, respectively, of
In some cases, in which the corresponding encompassing bins for a pixel in both the grayscale and gradient orientation histograms have different frequencies, but the grayscale frequency is low while the gradient orientation frequency is high, the pixel is likely part of a shadow, and thus, is assigned as background.
Equation (3) below is used when the pixel exhibits a sufficient or valid gradient orientation, as in
and
where
P(Ii|qt=B)=Πfi+Φi(ΠCi−Πfi) (5),
and
P(Gi|qt=B)=πfi+φi(πCi−πfi) (6),
in which Πfi is a user-defined probability floor for the ith symbol, ΠCi is a user-defined probability ceiling for the ith symbol, πfi is a user-defined probability floor parameter for the ith symbol, λCi and is a user-defined probability ceiling parameter for the ith symbol.
The following coefficients are then used in Equations (5) and (6), which relate the vocabulary abstracted from the histograms to the probabilities, when the pixel exhibits a sufficient or valid gradient orientation (as in
and
Wi+wj|i=1 . . . 5; j=1 . . . 5
Similarly, the following relations are then used in Equations (5) and (6), which relate the vocabulary abstracted from the histograms to the probabilities, when the pixel lacks sufficient gradient magnitude to be assigned a gradient orientation (as in
and
Wi+|i=1 . . . 5; j=1 . . . 5
Individual image pixel coordinates are assigned to the state of background (B) or foreground (F) according to rules defined by the following conditional equations:
where qt denotes the current state of the pixel at the pixel coordinate, and qt−1 denotes the prior state.
It is noted that, where memory and processing power of the system permit, the multiple modes may be modeled on a two-dimensional histogram or scattergram rather than separate histograms.
Assign as H+if the bin is the most recurrent;
Assign as H− if the bin frequency exceed T2 but is less than the most recurrent;
Assign as M+ if the bin frequency exceeds T3 but is less than or equal to T2;
Assign as M− if the bin frequency exceed T1 but is less than or equal to T3; and
Assign as L if the associated bin frequency is below T1. (24)
And, when the pixel does not exhibit sufficient gradient magnitude and thus cannot be assigned a gradient orientation, the following intermediate symbols are used instead of the above intermediate symbols:
Assign as if the bin is the most recurrent;
Assign as if the bin frequency exceed T2 but is less than the most recurrent;
Assign as if the bin frequency exceeds T3 but is less than or equal to T2;
Assign as if the bin frequency exceed T1 but is less than or equal to T3; and
Assign as {tilde over (L)} if the associated bin frequency is below T1. (25)
As before, the intermediate symbols defined in rules (24) and (25) above are mapped onto {νi|i=1 . . . 5} and {νi−|i=1 . . . 5} respectively, for ease of reference. The first of these mappings is used when the pixels have a valid gradient orientation, and the second of these mappings is used when the pixels do not exhibit a sufficient gradient magnitude or any meaningful gradient orientation.
Before finally classifying the pixels as either background or foreground, another series of probabilities are determined. When the pixel has a sufficient gradient magnitude and thus a gradient orientation, the following probabilities are determined:
P(νi|qt=B)=PLi+ψi(PUi−PLi):i=1 . . . 5 (26),
where
PLi is a user-defined lower-bound probability for the ith symbol,
PUi is a user-defined upper-bound probability for the ith symbol,
and
Then, finally, individual pixels are classified to either the background or foreground according to:
This equation (31) ultimately determines whether the system classifies a pixel as foreground or background, when the pixel exhibits sufficient gradient magnitude and hence gradient magnitude orientation. Conversely, when the pixel lacks sufficient gradient magnitude and hence gradient orientation, the following set of equations govern the computation of probabilities and classification of pixels:
P({tilde over (ν)}l|qt=B)=+(−):i=1 . . . 5 (32)
where
and
Then, finally, for those pixels lacking sufficient gradient magnitude, the pixels are assigned to the background or foreground according to:
The state of a pixel as background or foreground, as having been determined by the pixel classification step, may later be changed in response to information acquired or applied during subsequent processing steps, such as morphological operation or image segmentation.
The invention thus far described is useful for classifying pixels into foreground and background. A particular challenge in this process is avoiding the absorption into background of objects or information which truly are part of the foreground. For instance, on an airplane runway, foreign object detection is extremely important. Small objects such as nails, scrap metal, rubber, or the like pose a tremendous danger to aircraft during landing and take-off. However, because such objects are essentially inert but may be introduced or moved by large planes which may conceal those objects, detection of them is difficult. The present invention, however, has means for dealing with this challenge.
The present invention has flexibility because the amount of the incrementation and decrementation of the histograms can be adjustably set by the user to coordinate that incrementation and decrementation with respect to a rate at which background model udpating occurs, or with respect to a particular region of interest that a user may desire to monitor more or less frequently than other regions. Even different background model refresh rates can be specified for different image regions of interest. For example, distant regions, in which a pixel covers a larger area than a pixel in a close region, may be assigned a slower refresh rate to allow for the transition of moving objects from one pixel to another compared to a close region.
The system supports the detection of slowly-moving (or non-moving) objects concurrently with fast-moving objects, such as would be the situation with a foreign object on a runway being taxied by an airplane. The system sets the rate at which background model updating occurs so that oscillatory movement is absorbed into the background, as such oscillatory movement often has non-random periodicities, such as in the trembling of leaves and branches in the wind. Such a background model will suppress stationary pixels (i.e., pixels not associated with motion). Additionally, the background model will adapt and suppress pixels associated with regular oscillatory motion, as opposed to random oscillatory movement. However, the background model may also suppress motion due to relatively slow objects intruding into monitored premises (such as a human crawling slowly). This presents a hazard, because potential intruders with knowledge of the system's operations would be able to defeat the system by crawling through the monitored terrain space very slowly, and essentially be absorbed into the background. To combat this, a plurality of multi-modal background models are used and updated according to the following numbered steps:
1. An optimally-refreshed background model, which effectively absorbs oscillatory movement as described above, is used until it develops an excessive amount of noise or artifacts.
2. The optimally-refreshed background model is then copied as additional background models.
3. Those additional background models are then used together with the optimally-refreshed background model, but are refreshed at slower rates to detect slowly-moving objects.
4. The optimally-refreshed background model and the additional background models are used concurrently in parallel to both absorb oscillatory movement into the background and to reveal slow-moving objects as foreground, respectively.
5. Steps 2-4 are repeated periodically, at a period much longer than the refresh rate of the optimally-refreshed background model, overwriting the additional background models with new additional background models. Moreover, if additional information is acquired in subsequent processing stages, steps 2-4 are immediately repeated with that new information.
The background model can be refreshed at rates other than the pixel classification rate. Pixel classification is performed at an analytics nominal processing frame rate, such as 15 frames per second (“fps”), or 7.5 fps, while the background model refresh intervals remain responsive to detection of intended class of targets of certain speed range, say, a crawling human, which could extend to several seconds. More specifically, multi-modal background models with different refresh rates will have their own dedicated and independent subsequent processing stages of pixel classification and segmentation. As such, anomalies—such as noise in the form of unwanted foreground—resulting from the departure of slower refreshing background models from an optimally-refreshing multi-modal background model will be detected readily, and when the extent of the anomalies becomes excessive, such as by a pre-defined threshold, the optimally-refreshing multi-modal background model is copied to new the additional background models, as described in step 2 above.
In yet another embodiment of the present invention, application of the multi-modal background model is extended from fixed video cameras to cameras mounted on mobile platforms, such as automated guide vehicle (AGVs), unmanned waterborne vessels, high speed trains, mobile phones, and the like. A process 30 for operating a mobile video camera surveillance system is displayed graphically in
Based on the direction and distance of movement, a pixel at one time may correspond to one pixel, several pixels, several partial pixels, or one or more pixels together with one or more partial pixels (and vice versa) at another time, because the area covered by the pixels is not equal; cells closer to the camera cover a smaller area of terrain space than do cells further from the camera. This correspondence is made possible by georeferencing the camera in accordance with the method described in U.S. Pat. No. 8,253,797. As described in that patent, georeferencing the camera produces equations for rays in three-space emanating from the camera, with each ray being associated with a camera pixel coordinate when the camera has a specified attitude or orientation. Then, as the camera moves because of either movement of the camera platform or change in attitude, those equations are correspondingly transformed to reflect the camera's new location and attitude. The camera location is continually determined and monitored by a GPS system carried on board in step 110, and the attitude of the camera is likewise continually determined and monitored by both vertical gyros and course gyros (or MEMS or AHRS equivalents thereof) in step 111 in
In embodiments of the present invention which use step-and-stare cameras, background modeling and pixel classification are useful for detecting permanent changes since the last visit to a particular view from the camera, and also for detecting transient changes while the camera is fixed and focused on one sector. Once pixel classification has occurred and subsequent processing stages like morphological filtering and image segmentation have been completed, a georeferencing step 112 and a location and attitude determination step 113 are again performed, so as to update the background model in step 106′. Those steps 112 and 113 are identical to the steps 111 and 110, respectively, but for the fact that they occur later in the process. PTZ cameras are preferably used for step-and-stare operation. Such PTZ cameras can operate in step-and-stare mode in which the camera holds a view on a directional alignment and takes a picture or records video, then pans to a new view in a new alignment just offset from the previous alignment, again takes a picture or records a video, and repeats this process until the entire field of view has been captured. PTZ cameras are capable of rotating and capturing 360 degrees of view through a contiguous set of sequential steps, with each successive steps covering an adjacent sector. The use of PTZ cameras allows a much larger field of view to be monitored than would be possible with a single camera or a plurality of fixed cameras.
The methodology used to determine or characterize a single pixel property is employed to determine or characterize multiple pixel properties (such as grayscale and gradient orientation) and can be used to construct two independent histograms or a single, two-dimensional histogram or scattergram, as described above. The modeling process described here, however, is augmented by additional processing channels yielding a multichannel detection (MCD) scheme. The step-and-stare background modeling and pixel classification process is performed in the context of a multichannel detection (MCD) scheme, comprising five channels: {xi|i=1 . . . 5}, each described below:
x1{fDiff(bImgt−1,Imgt+r)|r=0 . . . R},
x2{fg(BgndModelt−1,Imgt+r)|r=0 . . . R},
x3{fg(BgndModelt−1+r,Imgt+r)|r=0 . . . R},
x4{fDiff(bImgt−1+r,Imgt+r)|r=0 . . . R},
and
x5{fDiff(Imgt−1+r,Imgt+r)|r=0 . . . R}
where:
1. fDiff is a frame-differencing operation between two images yielding a resultant image, where a first image is constructed from the highest frequency value of all of the pixels, and a second image includes the successive images in time. In x1, the multi-modal background model is initially constructed and not updated. In x4, the multi-modal background model is updated. In x5, successive images are compared to each other, rather than to a multi-modal background model.
2. fg is an operation for computing a foreground image from the multi-modal background model resulting from the last visit to a particular camera view with successively incoming images. In x2, the multi-modal background model is not updated, while in x3, the multi-modal background model is updated.
3. bImgt−1 is an image whose pixel coordinates retain the characteristic property (grayscale, for example) identified with the dominant mode of the multi-modal background model.
Before detecting permanent changes from the last visit to a sector, the following steps are performed. The last multi-modal background model that had been saved in memory or in mass storage is retrieved and placed into the fast buffer. An image whose pixel coordinates retain the grayscale identified with the dominant mode of the multi-modal background model for every pixel coordinates is constructed. The reference image associated with that sector (i.e. step) is retrieved and placed in the fast-access buffer. Through a registration process, successive images are aligned and saved prior to the camera moving away from the sector.
The present invention is described above with reference to a preferred embodiment. However, those skilled in the art will recognize that changes and modifications may be made in the described embodiment without departing from the nature and scope of the present invention. To the extent that such modifications and variations do not depart from the spirit of the invention, they are intended to be included within the scope thereof.
Having fully and clearly described the invention so as to enable one having skill in the art to understand and practice the same, the invention claimed is:
This application is a continuation of and claims the benefit of pending U.S. patent application Ser. No. 14/210,435, filed Mar. 13, 2014, which claimed the benefit of U.S. Provisional Application No. 61/785,278, filed Mar. 14, 2013, all of which are hereby incorporated by reference.
Number | Name | Date | Kind |
---|---|---|---|
8599261 | Maali | Dec 2013 | B1 |
20130044951 | Cherng | Feb 2013 | A1 |
Entry |
---|
Eveland et al., “Background Modeling for Segmentation of Video-Rate Stereo Sequences”, IEEE publication, Jun. 1998, Bib sheet + 6 pages of article. |
Lawrence R. Rabiner, A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition, Proceedings of the IEEE, Feb. 1989, pp. 257-286, vol. 77, No. 2. |
Chris Stauffer, et al., Learning Patterns of Activity Using Real-Time Tracking, IEEE Transactions on Pattern Analysis and Machine Intelligence, Aug. 2000, pp. 747-757, vol. 22, No. 8. |
Number | Date | Country | |
---|---|---|---|
61785278 | Mar 2013 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 14210435 | Mar 2014 | US |
Child | 15040597 | US |