1. Field of the Invention
This invention relates to surveillance systems. Specifically, the invention relates to a video-based surveillance system that monitors a wide range of area by fusing the data from multiple surveillance cameras.
2. Related Art
Some state of art intelligent video surveillance (IVS) system can perform content analysis on the image view of each camera. Based on user defined rules or policies, IVS systems can automatically detect potential threats by detecting, tracking and analyzing the targets in the scene. While this type of system has been proven to be very effective and helpful for video surveillance applications, its capability may be constrained by the fact that an isolated single camera can only monitor a limited area. Further, conventional systems usually do not remember past targets, especially when the past targets appeared to act normally, thus a conventional system cannot detect the threats which can only be inferred by repeatable actions.
Now, security needs demand much more capabilities from IVS. For example, a nuclear power plant may have more than ten intelligent surveillance cameras monitoring the surroundings of one of its critical facilities. It may be desirable to receive an alert when there may be some target (e.g., a human or vehicle) loitering around the site for more than fifteen minutes, or when the same target approaches the site more than three times in a day. The conventional individual camera system would fail to detect the threats because a target of interest may loiter at the site for more than an hour, but not stay in any single camera view for more than two minutes, or the same suspect target might approach the site five times in a day but from different directions.
What may be needed then may be an improved IVS system that overcomes shortcomings of conventional solutions.
The invention includes a method, a system, an apparatus, and an article of manufacture for wide-area site-based video surveillance.
An embodiment of the invention may be a computer-readable medium contains software that, when read by a computer, causes the computer to perform a method for wide-area site-based surveillance. The method includes receiving surveillance data, including view targets, from a plurality of sensors at a site; synchronizing the surveillance data to a single time source; maintaining a site model of the site, wherein the site model comprises a site map, a human size map, and a sensor network model; analyzing the synchronized data using the site model to determine if the view targets represent a same physical object in the site. The method further includes creating a map target corresponding to a physical object in the site, wherein the map target includes at least one view target; receiving a user-defined global event of interest, wherein the user-defined global event of interest is based on the site map and based on a set of rules; detecting the user-defined global event of interest in real time based on a behavior of the map target; and responding to the detected event of interest according to a user-defined response to the user-defined global event of interest.
In another embodiment, the invention may be a computer-readable medium containing software that, when read by a computer, causes the computer to perform a method for wide-area site-based surveillance, the software comprising: a data receiver module, adapted to receive and synchronize surveillance data, including view targets, from a plurality of sensors at a site; and a data fusion engine, adapted to receive the synchronized data, wherein the data fusion engine comprises: a site model manager, adapted to maintain a site model, wherein the site model comprises a site map, a human size map, and a sensor network model; a target fusion engine, adapted to analyze the synchronized data using the site model to determine if the view targets represent a same physical object in the site, and create a map target corresponding to a physical object in the site, wherein the map target comprises at least one view target; and an event detect and response engine, adapted to detect an event of interest based on a behavior of the map target.
A system for the invention includes a computer system including a computer-readable medium having software to operate a computer in accordance with the invention.
An apparatus for the invention includes a computer including a computer-readable medium having software to operate the computer in accordance with the invention.
An article of manufacture for the invention includes a computer-readable medium having software to operate a computer in accordance with the invention.
Exemplary features and advantages of the invention, as well as the structure and operation of various embodiments of the invention, may be described in detail below with reference to the accompanying drawings.
The foregoing and other features of the invention will be apparent from the following, more particular description of exemplary embodiments of the invention, as illustrated in the accompanying drawings wherein like reference numbers generally indicate identical, functionally similar, and/or structurally similar elements. The left most digits in the corresponding reference number indicate the drawing in which an element first appears.
The following definitions may be applicable throughout this disclosure, including in the above.
A “video” may refer to motion pictures represented in analog and/or digital form. Examples of video may include: television, movies, image sequences from a video camera or other observer, and computer-generated image sequences.
A “frame” may refer to a particular image or other discrete unit within a video.
An “object” may refer to an item of interest in a video. Examples of an object may include: a person, a vehicle, an animal, and a physical subject.
A “target” may refer to a computer model of an object. The target may be derived from the image processing, with a one to one correspondence between targets and objects.
A “view” may refer to what a camera may see for a particular camera viewing position. A camera may have multiple views if its position or viewing angle change.
A “map” or a “site map” may refer to an image or graphical representation of the site of interest. Examples of a map may include: an aerial photograph, a blueprint, a computer graphical drawing, a video frame, or a normal photograph of the site.
A “view target” may refer to a target from each single camera IVS system and the associated site location information, for each camera.
A “map target” may refer to an integrated model of an object on the map. Each map target may at one time correspond to one and only one object in the real world, but may include several view targets.
A “video sensor” may refer to an IVS system which only processes one camera feed. The inputs may be the frame, and outputs may be tracked targets in that particular camera field of view (FOV).
A “fusion sensor” may refer to the present cross-camera site IVS system which may not process raw video frames. The inputs may be view target data from a single IVS system, or may be map target data from other fusion sensors.
A “sensor” may refer to any apparatus for obtaining information about events occurring in a view. Examples include: color and monochrome cameras, video cameras, static cameras, pan-tilt-zoom cameras, omni-cameras, closed-circuit television (CCTV) cameras, charge-coupled device (CCD) sensors, analog and digital cameras, PC cameras, web cameras, tripwire event detectors, loitering event detectors, and infra-red imaging devices. If not more specifically described, a “camera” refers to any sensing device.
A “computer” may refer to any apparatus that is capable of accepting a structured input, processing the structured input according to prescribed rules, and producing results of the processing as output. The computer can include, for example, any apparatus that accepts data, processes the data in accordance with one or more stored software programs, generates results, and typically includes input, output, storage, arithmetic, logic, and control units. Examples of a computer include: a computer; a general purpose computer; a supercomputer; a mainframe; a super mini-computer; a mini-computer; a workstation; a micro-computer; a server; an interactive television; a web appliance; a telecommunications device with internet access; a hybrid combination of a computer and an interactive television; a portable computer; a personal digital assistant (PDA); a portable telephone; and application-specific hardware to emulate a computer and/or software, for example, a programmable gate array (PGA) or a programmed digital signal processor (DSP). A computer can be stationary or portable. A computer can have a single processor or multiple processors, which can operate in parallel and/or not in parallel. A computer may also refer to two or more computers connected together via a network for transmitting or receiving information between the computers. An example of such a computer may include a distributed computer system for processing information via computers linked by a network.
A “computer-readable medium” may refer to any storage device used for storing data accessible by a computer. Examples of a computer-readable medium include: a magnetic hard disk; a floppy disk; an optical disk, such as a CD-ROM and a DVD; a magnetic tape; a memory chip; and a carrier wave used to carry computer-readable electronic data, such as those used in transmitting and receiving e-mail or in accessing a network.
“Software” may refer to prescribed rules to operate a computer. Examples of software include: software; code segments; instructions; software programs; computer programs; and programmed logic.
A “computer system” may refer to a system having a computer, where the computer comprises a computer-readable medium embodying software to operate the computer.
A “network” may refer to a number of computers and associated devices that may be connected by communication facilities. A network may involve permanent connections such as cables or temporary connections such as those made through telephone, wireless, or other communication links. Examples of a network include: an internet, such as the Internet; an intranet; a local area network (LAN); a wide area network (WAN); and a combination of networks, such as an internet and an intranet.
An exemplary embodiment of the invention may be discussed in detail below. While specific exemplary embodiments may be discussed, it should be understood that this may be done for illustration purposes only. A person skilled in the relevant art will recognize that other components and configurations can be used without parting from the spirit and scope of the invention.
Embodiments of the present invention may be based on existing single camera IVS systems with increased automatic situation awareness capability under both spatial and temporal domains. The input to the system may be content analysis results from multiple individual cameras, such as tracked humans and vehicles. The output may be tracked targets in the site under monitoring and global events detected by the system. In summary, the task of the system may be to perform data fusion on the information from individual sensors and provide more reliable and powerful surveillance capability.
There may be several major challenges to overcome to achieve data fusion from multiple sensor sources. A first challenge may be to determine how to associate the targets from different cameras. There may be multiple cameras in the site under surveillance, and the cameras may be of different types, e.g., static, PTZ, Omni, etc. The individual cameras or sensors usually may be looking at different areas; and they may or may not have overlapped fields of view. When a physical target may be detected, it may be detected simultaneously by multiple cameras but with different target ids. A target may also be detected by the same or different camera at different times. The inventive system may receive detected targets from different cameras for every sample moment. How to reliably associate the different detected targets that correspond to the same physical target may be difficult. In the present invention, several new techniques and an adaptive mechanism may be developed to solve this problem which supports different levels of availabilities on prior knowledge of the site and cameras. The new technologies may include: map-based static, PTZ and omni camera calibration methods; camera network traffic models; human relative size maps; appearance-based target verification; and target fusion algorithms.
A second challenge may be to determine how to provide prompt and easy understandable global and local situation awareness. In addition to detecting what a single camera IVS cannot detect, the wide-area multi-sensor IVS also may need to integrate the potentially duplicated events produced by different individual IVS sensors so as not to confuse the operators. For this purpose, embodiments of the present invention may include a general site model, together with a site-based event detector.
A third challenge may be to determine how to support a large number of cameras and sensors. Since the data may come from distributed sensors and possibly out of sequential order, the data may need to be synchronized with a minimum amount of latency. Data communication among cameras and a center unit may be viable, but increasing the number of cameras may cause a bandwidth limitation issue. Embodiments of the present invention may include a scalable architecture developed to remove this potential limitation.
The input data 202 may include the information gathered by lower-level IVS systems, including other cross-camera site IVS systems (e.g., fusion sensors) as well as individual IVS systems (e.g., video cameras). The input data 202 may be targets, video frames, and/or camera coordinates (e.g., pan-tilt-zoom (PTZ) coordinates). In one embodiment, all the sensors may use the same time server, in other words, they may use the same clock. This may be achieved, for example, through network time synchronization. The input data 202 may include a timestamp of the data's own sensor. The data receiver 204 may contain internal buffers for each input sensor. Due to the different process latencies in each input sensor and the different amount of network transmission delays, the data on the same object at a certain time may arrive at different time from different sensors. A major task of the data receiver 204 may be to synchronize the input data 202 and pass them to the data fusion engine 206. The user interface 208 may be used to obtain necessary information about the site and the system from the user, and provide visual assistance to the operator for better situation awareness. The data fusion engine 206 may build and maintain the site model, integrate the corresponding input map and view targets into the map targets in the current site, detect all the events of interest in the site and perform user desired responses to these events. Data storage unit 210 may store and manage all the useful information used or generated by the system. Data sender 212 may be in charge of sending controls to any PTZ cameras in the system and sending map targets to the higher level fusion sensors. The output data 214 may be map targets, current site information and/or other camera commands, e.g., PTZ commands.
The map calibration features may be a list of pairs of matching map features and image features. The map calibration features may be optional input, and may be needed only when there are enough matching features observable on both the map and the video frame. Here, the control feature may refer to an image feature on that map having an easily-identified corresponding feature in the video frame.
Camera information may refer to the specific properties of each camera, such as camera type, map location, lens specifications, etc.
Camera relationship description may be needed when both the site map and the camera information are lacking. The relationship description provides the normal entry/exit regions in each camera view and each potential path of a target moving from one camera view to another camera view.
Besides the above system information, the user may specify global event rules (e.g., what event may be of interest), and event response configurations (e.g., how the system should respond to these events). Embodiments of the present invention may provide a wide range of visual information in addition to the source videos. The system may, for example, mark up the targets in both the source video frame and the site map in real-time; display the camera locations on the map and their fixed (for static camera) or moving (PTZ camera) field of views; and display an alert once the event is triggered.
Traditional camera calibration may be performed by viewing a three-dimensional (3D) reference object with a known Euclidean structure. An example of this approach is described, for example, in R. Y. Tsai. “A Versatile Camera Calibration Technique for High-Accuracy 3D Machine Vision Metrology Using Off-the-Shelf TV Cameras and Lenses,” IEEE Journal of Robotics and Automation, 3(4):323-344, August 1987, which may be herein incorporated by reference. This type of technique may yield the best results if the 3D geometry of the reference object is known with high accuracy. In addition, this type of technique may be directly applicable to multi-camera systems by simply repeating the calibration process independently for each camera. However, setting up the 3D reference object with great accuracy may be an elaborate task that requires special equipment and becomes more difficult as the dimensions of the view volume increase.
To reduce such difficulties, a simple and practical camera calibration technique using a model plane with a known 2D reference pattern was proposed by P. F. Sturm and S. J. Maybank, “On Plane-Based Camera Calibration: A General Algorithm, Singularities, Applications,”, Proc. Computer Vision and Pattern Recognition, volume 1, pages 432-437, 1999, and Z. Zhang, “Flexible Camera Calibration by Viewing a Plane from Unknown Orientations,” Proc. 7th International Conference on Computer Vision, volume 1, pages 666-673, 1999, independently, both of which may be incorporated herein by reference. In this technique, the user may freely place the model plane or the camera at two or more locations and capture images of the reference points. Camera parameters may be recovered from homographs between the model plane and the image plane computed from correspondences between the reference points and their projections. A homograph may be a matrix associating two 3D planes in a space. Although this algorithm may be simpler and may yield good results when calibrating a camera, it may be mainly used for indoor and/or close range applications, where the pattern object captured by the camera may be big enough that the features of the pattern may be easily and accurately detected and measured. Many of the wide area IVS systems may be outdoor applications. To calibrate the camera using the above 2D model plane, a significant large object may be needed to get the required accuracy. Also, this extra calibration procedure may not be allowed due to physical or cost constraints. These factors may make this model-based calibration impractical for wide range of commercial applications.
Embodiments of the present invention may use new methods to quickly and accurately extract the physical location and size of a target, as well as to guide the PTZ cameras in the site to focus on the targets of interest. Here, the site-model manager may need to provide three types of information: the site map location of each view target; the actual size of the target; and the object traffic model of the site. These types of information may be stored in map-view mappings 604, human size maps 608 and camera network models 612, which may be created and managed by map-based calibrator 602, view-based calibrator 606 and camera network model manager 610, respectively.
The most commonly used calibration features are matching pairs of points, which are often called control points. Points provide unambiguous matching pairs between the map and the camera view. However, there are potential problems with using matching points for calibration. One problem is that it may be difficult to find the precise corresponding point locations in some environments due to limited resolution, visibility, or the angle of view. As an example, looking at an overhead view of a road, the corner points of the broken lane dividing lines theoretically provide good calibration targets. However, it may be difficult to reliably determine which lane dividing line segment of the map view corresponds to which line segment in the camera view.
Another problem is that the precision of each pair of matching points is usually unknown. The precision of a point measures the sensitivity of the accuracy of the map matching location with respect to the accuracy of the image frame location. For example, at one location in the camera image plane, one pixel movement away from that location may cause 100 pixels of movement away from its original corresponding location on the map. This means the precision of the matching point pair is low. When the camera view is calibrated onto the map, the distance between these matching point pairs is minimized. Different weights may be placed on these points based on the precision of their location measurement. The points with higher precision may have larger weights. Without knowing the precision of the point locations, the same weight may have to be assigned to each point. Thus in some cases, a pair of matching points with poor precision may cause the calibration results to be very unstable. Such poor precision points should have a small weight, or may even be excluded from the set of points used to compute the calibration parameters.
Embodiments of the present invention provide methods to overcome the afore-mentioned problems by applying, for example, more image features for image to map calibration in addition to the matching points. Such new calibration features include, but are not limited to, matching lines or matching convex curves.
For the purposes of calibration, a line may be represented by two end points, while a convex curve may be represented by a sequence of ordered pivot points and each convex curve contains one and only one convex corner. An arbitrary curve may be segmented into a set of convex curves.
The use of these new calibration features allows a camera view to be calibrated onto the map even when there are insufficient matching point features available. In such cases, more complex features such as lines or convex curves may be easier to define and locate. As an example, on a road, it may be difficult to find exact point correspondences, however, lines like the boundaries of the road or the center dividing lines may be well defined and observable both on the map and the camera view. As another example, corner features may be good candidates for calibration points, but if a corner is not “sharp” enough, for example, if the corner is actually an arc shape, it would be difficult for the operator to locate the exact corner point. In this case, the user may use a convex curve by selecting some pivot points on the curve as a calibration feature.
A calibration method according to embodiments of the present invention may require only that the user-defined corresponding line segments or convex curves represent the same line or same curve in reality. The user may not be required to select the exact starting or ending or corner corresponding points of these features. The new calibration features may also provide a precision measure of each matching feature, which in turn may enable the system to choose the calibration control points selectively to provide more stable and optimized calibration results. In addition, line and curve features may cover a much larger area than a single point feature, thus a good matching line or curve feature may at least provide an accurate local calibration for a much bigger local area.
In producing the initial calibration control points from convex curves, each pair of convex curve features may provide one pair of control points. Each input convex curve feature may include a list of ordered pivot points describing the curve. The list should contain one and only one convex corner point, which may be used as the control point. To locate this convex corner point, a continuous convex curve may be first obtained by curve fitting using the ordered pivot points. Next, a pair of start and end pivot points, S and E, respectively, may be used as the two end points. To find the convex corner point, the convex curve may be searched to find the location P where the angle SPE is minimal. Point P may be considered as the convex corner location and used as a control point.
To produce the initial calibration control points from lines, the lines used may be line features directly specified by the user, or they may be lines derived from two input matching point features. The intersection of any two lines may be considered as a potential calibration control point.
The precision of the location of the point of intersection may be estimated by the following procedure. Using the point of intersection on the original map as the reference point, add small random Gaussian noise with zero mean and small standard deviation (e.g. 0.5) to the end points of all the related lines on the map. Recompute the point of intersection with the adjusted end points. Calculate the distance between the new point of intersection and the reference point of intersection. This random noise is used to simulate the potential point location error introduced by the operator feature selection process. Repeat the adjustment and recomputation for a statistically significant number of iterations (e.g., more than 100) and compute the mean distance.
The precision of the point of intersection is high when the mean distance is small. The point of intersection may be used as a control point only if the mean distance is less than a threshold distance. This threshold may be determined by the user based on the availability of other higher precision control points and on the desired accuracy of the calibration. For example, one threshold the user may use is the average dimension of the target of interest on the map, for example, 2 meters if the targets of interest are humans and cars, or 5 meters if big trucks are the major targets of interest. One example of this control point generation from matching features is demonstrated in
After the control points are determined, in block 1904 the image plane to map plane homograph is computed using a Direct Linear Transformation (DLT) algorithm, which may use the least-squares method to estimate the transform matrix. This DLT algorithm and other homograph estimating algorithms are available in the camera calibration literature. Here, the existing calibration methods only provide the best solution with minimum mean-square error for the control points. The results may be very sensitive to the location error of the control points, especially if the number of the control points is small or the points are clustered in a small portion of the image frame.
The calibration may be improved iteratively by using matching line and convex curve features. After each round of homograph computation, the feature matching errors on the map plane are computed in block 1906. In block 1908, the reduction of the matching error is compared to the last round. If the improvement is insignificant, which may indicate that the process is converging to its optimal result, the iteration may be stopped. Otherwise, in block 1910, some control points may be added or adjusted based on the error measurement. The adjusted control point list is used in block 1904 to perform the next iteration of homograph estimation. Since a line segment or a convex curve may be a more representative feature than a single point, and a line or curve location is more reliable than that of a single point, this iterative refinement process may be very effective to reduce calibration errors and may converge to the optimal result rapidly. More details on line-based calibration error computation and control point adjustments will be illustrated below with respect to
Points p1 and p2 may be directly considered as control points and denoted as P1 and P2. From the list of pivot points p3 through p7 for the convex curve C1, the convex corner point P3 may be extracted as a control point using the method described earlier. The rest of the control points may come from the crossing points of the input lines L1, L2, L3 and the derived line L4. Derived line L4 is formed by the two input point features p1 and p2. In an ideal case, four lines may provide six points of intersection, but in the example shown, L2 and L3 are almost parallel and their point of intersection is close to infinity, which has a very low precision measure. For this reason, the intersection of L2 and L3 must be excluded from the set of control points. Therefore, in this example, eight initial control points may be extracted: two from the two input point features (P1 and P2), one from the convex corner C1 (P3), and 5 from the intersection of the three user-defined lines (L1, L2, L3) and one derived (L4) line (P4, P5, P6, P7, P8).
The map-view mapping 604 may provide the physical location information of each view target, especially when the view target footprint may be used as the location and velocity estimation. But since the site map may be a 2D representation, the exact 3D dimensions of the target may be still lacking. This physical size information may be used to perform tasks such as target classification. Human size map 608 may be used to provide the physical size information of each map target. The human size map 608 may be a frame size lookup table that shows, on each image position, what the expected average human image height and image area are. To estimate the physical size of an image target, the target's relative size may be compared to the expected average human size at that image position. The target's relative size may be converted to an absolute physical size using an estimated average human size, for example, 1.75 meters in height and 0.5 meters in width and depth.
There may be at least two methods of producing this human size map for each camera. First, if a camera-to-map calibration is available, the map may be generated by projecting the 3D human object back onto the image.
Second, if no camera-to-map calibration is available, the human size map may be generated by self-learning. In self-learning, human detection and tracking may be performed on each sensor. As shown in
Embodiments of the wide-area site-based IVS system of the present invention may support a flexible site model. The site map that the system supports may be in several formats. For example,
If no site map is available, the user may need to provide the camera connection information through the GUI, which may be used by the system to produce the camera network model 612 at the background. An example of camera connection information may be illustrated in
After all the existing view targets have been updated with current location and size information, the view target fusion module looks for any stable new view targets to see if they belong to any existing map targets. If the new view target matches an existing map target, it will be merged into this map target in block 1604, and trigger the map target update in block 1606. Otherwise, the system may produce a new map target based on the new view target. The matching measure between two targets may be the combination of three probabilities: the location matching probability, the size matching probability and the appearance matching probability. The location matching probability may be estimated using the target map location from the map view mapping and the camera network traffic model. The size matching probability may be computed from the relative human size value of each target. The appearance matching probability may be obtained by comparing the two appearance models of the targets under investigation.
The appearance model of an exemplary embodiment may be a distributed intensity histogram, which includes multiple histograms for different spatial partitions of the target. The appearance matching may be the average correlation between the corresponding spatial partitioned histograms. The tasks of the map target update process in block 1606 may be to determine the primary view target and update the general target properties such as map location, velocity, classification type and stability status, etc. Since target occlusion may cause significant map location estimation errors, a map target needs also to be tested for whether it actually corresponds to another existing map target when the map target switches from one stable status to a different stable status. A stable status means the target has consistent shape and size in a temporal window. One map target may have multiple different stable periods due to occlusions. The map target fusion module 1608 may merge two matching map targets in to the one map target that has a longer history.
In block 1706, the automatic target close up monitoring using PTZ cameras may be performed. Once one target triggers any map-based event, the user may require PTZ camera to zoom-in and follow the target as one type of event response. Based on the target map location and the user required image target resolution, the system may determine the pan, tilt and zoom level of a dedicated PTZ camera and may control the camera to follow the target of interest. In addition, when multiple PTZ cameras exist, the hand-off from one PTZ camera to another PTZ camera in the site may be also developed. This may be achieved by automatically selecting the camera which can provide the required target coverage with the smallest zoom-in level. A larger zoom-in value usually makes the video more sensitive to camera stabilization and PTZ command latencies, which may be undesirable for this application scenario.
Due to the limited computing power and data bandwidth, one data fusion engine 206 may not be able to handle an unlimited number of inputs. Advantages of the present invention include that it may provide high scalability, and may be easily expandable to monitor bigger areas involving more cameras.
All examples and embodiments discussed herein are exemplary and non-limiting.
While various embodiments of the present invention have been described above, it should be understood that they have been presented by way of example only, and not limitation. Thus, the breadth and scope of the present invention should not be limited by any of the above-described exemplary embodiments, but should instead be defined only in accordance with the following claims and their equivalents.
This application is a continuation-in-part, and claims the benefit, of U.S. application Ser. No. 11/098,579, filed Apr. 5, 2005, of common assignee, entitled “Wide-Area Site-Based Video Surveillance System,” the contents of which are incorporated by reference in their entirety.
Number | Date | Country | |
---|---|---|---|
Parent | 11098579 | Apr 2005 | US |
Child | 11397930 | US |