1. Field of the Invention
This invention relates generally to methods and system configurations for designing and implementing video surveillance systems. More particularly, this invention relates to an improved off-line calibration process and an on-line selective focus-of-attention procedure for providing wide-area video-based surveillance and select focus-of-attention for a surveillance system using dynamic-stationary camera assemblies. The goal is to enhance the functionality of such a surveillance system with both static and dynamic surveillance to provide improved dynamic tracking with increased resolution to effectively provide more secured protections of limited access.
2. Description of the Prior Art
Conventional system configurations and methods for providing security surveillance of limited access areas are still confronted with difficulties that the video images having poor resolution and very limited flexibly are allowed for control of tracking and focus adjustments. Existing surveillance systems implemented with video cameras are provided with object movement tracking capabilities to follow the movements of persons or objects. However, the resolution and focus adjustments are often inadequate to provide images of high quality to effectively carry out the necessary security functions currently required for the control of access to the protected areas.
There has been a surge in the number of surveillance cameras put in service in the last two years since the September 11th attacks. Closed Circuit Television (CCTV) has grown significantly from being used by companies to protect personal property to becoming a tool used by law enforcement authorities for surveillance of public places. US policymakers, especially in security and intelligence services, are increasingly turning toward video surveillance as a means to combat terrorist threats and a response to the public's demand for security. However, important research questions must be addressed before the video surveillance data can reliably provide an effective tool for crime prevention.
In carrying out video surveillance to achieve a large area of coverage with a limited supply of hardware, it is often desirable to configure the surveillance cameras in such a way that each camera watches over an extended area. However, if suspicious persons/activities are identified through video analysis, it is then often desirable to obtain close-up views of the suspicious subjects for further scrutiny and potential identification (e.g., to obtain a close-up view of the license plate of a car or the face of a person). These two requirements (a large field-of-view and the ability of selective focus-of-attention) oftentimes place conflicting constraints on the system configuration and camera parameters. For instance, a large field-of-view is achieved using a lens of a short focal length while selective focus-of-attention requires a lens of a long focal length.
Specifically, since any trespass into a limited access area has dynamically changing circumstances with persons and objects continuously moving, the ability to perform dynamic tracking of movement to determine the positions of persons and objects and to carry out focus adjustment according to these positions are critical. Additionally, methods and configurations must be provided to produce clear images with sufficient resolutions such that required identity checking and subsequent secured actions may be taken accordingly.
Venetianer, et al. disclose in U.S. Pat. No. 6,696,945, entitled “Video Tripwire”, a method for implementing a video tripwire. The method includes steps of calibrating a sensing device to determine sensing device parameters for use by a control computer. The controlling computer system then performs the functions of initializing the system that includes entering at least one virtual tripwire; obtaining data from the sensing device; analyzing the data obtained from the sensing device to determine if the at least one virtual tripwire has been crossed; and triggering a response to a virtual tripwire crossing. Venetianer et al. however do not provide a solution to the difficulties faced by the conventional video surveillance systems that the video surveillance systems are unable to obtain clear images with sufficient resolution of a dynamically moving object.
A security officer is now faced with a situation that frequently requires him to monitor different security areas. On the one hand, it is necessary to monitor larger areas to understand the wide field of view. On the other hand, when there is suspicious activity, it is desirable in the meantime to use another camera or the same camera to zoom in on the activity and try to gather as much information as they can about the suspects. Under these circumstances, the conventional video surveillance technology is still unable to provide an automated way to assist a security officer to effectively monitor secure areas.
Therefore, a need still exists in the art to of video surveillance of protected areas with improved system configurations and attention of focus adjustments keeping in synchronization with dynamic tracking of movements such that the above-mentioned difficulties and limitations may be resolved.
It is therefore an object of the present invention to provide improved procedures and algorithms for calibrating and operating stationary-dynamic camera assemblies in a surveillance system to achieve wide-area coverage and selective focus-of-attention.
It is another object of the present invention to provide an improved system configuration for configuring a video surveillance system that includes a stationary-dynamic camera assembly operated in a cooperative and hierarchical process such that the above discussed difficulties and limitations can be overcome.
Specifically, this invention discloses several preferred embodiment implemented with procedures and software modules to provide accurate and efficient results for calibrating both stationary and dynamic cameras in a camera assembly and will allow a dynamic camera to correctly focus on suspicious subjects identified by the companion stationary cameras.
Particularly, an object of this invention is to provide an improved video surveillance system by separating the surveillance functions and by assigning different surveillance functions to different cameras. A stationary camera is assigned the surveillance of a large area and tracking of object movement while one or more dynamic cameras are provided to dynamically rotate and adjust focus to obtain clear images of the moving objects as detected by the stationary camera. Algorithms to adjust the focus-of-attention are disclosed to effectively carry out the tasks by a dynamic camera under the command of a stationary camera to obtain images of a moving object with clear feature detections.
Briefly, in a preferred embodiment, the present invention includes (1) an off-line calibration module, and (2) an on-line focus-of-attention module. The offline calibration module positions a simple calibration pattern (a checkerboard pattern) at multiple distances in front of the stationary and dynamic cameras. The 3D coordinates and the corresponding 2D image coordinates are used to infer the extrinsic and intrinsic camera parameters. The on-line process involves identifying a target (e.g., a suspicious person identified by the companion stationary cameras through some pre-defined activity analysis) and then uses the pan, tilt, and zoom capabilities of the dynamic camera to correctly center on the target and magnify the target images to increase resolution.
In another preferred embodiment, the present invention includes a video surveillance system that utilizes at least two video cameras performing surveillance by using a cooperative and hierarchical control process. In a preferred embodiment, the two video cameras include a first video camera functioning as a master camera for commanding a second video camera functioning as a slave camera. A control processor controls the functioning the cameras and this control processor may be embodied in a computer. In a preferred embodiment, at least one of the cameras is mounted on a movable platform. In another preferred embodiment, at least one of the cameras has flexibility of multiple degrees of freedoms (DOFs) that may include a rotational freedom to point to different angular directions. In another preferred embodiment, at least one of the cameras is provided to receive a command from anther camera to automatically adjust a focus length. In another preferred embodiment, the surveillance system includes at least three cameras and arranged in a planar or co-linear configuration. In another preferred embodiment, the surveillance comprises at least three cameras with one stationary camera and two dynamic cameras disposed on either sides of the stationary camera.
This invention discloses a method for off-line calibration using an efficient, robust, and closed-form numerical solution and a method for on-line selective focus-of-attention using a visual servo principle. In particular, the calibration procedure will, for stationary and dynamic cameras, correctly compute the camera's pose (position and orientation) in the world coordinate system and will estimate the focal length of the lens used, and the aspect ratio and center offset of the camera's CCD. The calibration procedure will, for dynamic cameras, correctly and robustly estimate the pan and tilt degrees-of-freedom, including axis position, axis orientation, and angle of rotation as functions of focal length. The selective focus-of-attention procedure will compute the correct pan and tilt maneuvers needed to center a suspect in the dynamic cameras, regardless if the optical center is located on the rotation axes and if the rotation axes are properly aligned with the width and height of the camera's CCD array.
In a preferred embodiment, this invention further discloses a method for configuring a surveillance video system by arranging at least two video cameras with one of the cameras functioning as a stationary camera to command and direct a dynamic camera to move and adjust focus to obtain detail features of a moving object.
These and other objects and advantages of the present invention will no doubt become obvious to those of ordinary skill in the art after having read the following detailed description of the preferred embodiment, which is illustrated in the various drawing figures.
Referring to
Specifically, the scenario addressed in this application is one where multiple cameras, or multiple camera surveillance functions carried out by a single camera processed by a surveillance controller, are used for monitoring an extended surveillance area. This can be an outdoor parking lot, an indoor arrival/departure lounge in an airport, a meeting hall in a hotel, etc. In order to achieve a large area of coverage with a limited supply of hardware, it is often desirable to configure the surveillance cameras in such a way that each camera covers a large field of view. However, if suspicious persons/activities are identified through video analysis, it then often desirable to obtain close-up views of the suspicious subjects for further scrutiny and potential identification, e.g., to obtain a close-up view of the license plate of a car or the face of a person. These two requirements, i.e., a large field of view and the ability of selective focus-of-attention, oftentimes impose conflicting constraints on the system configuration and camera parameters. For instance, a large field-of-view is achieved using a lens of a short focal length while selective focus-of-attention requires a lens of a long focal length.
To satisfactorily address these system design issues, a system configuration as disclosed in a preferred embodiment is to construct highly-compartmentalized surveillance stations and employ these stations in a cooperative and hierarchical manner to achieve large areas of coverage and selective focus-of-attention. In the present invention, an extended surveillance area, e.g., an airport terminal building will be partitioned into partially overlapped surveillance zones. While the shape and size of a surveillance zone depend on the particular locale and the number of stationary cameras used and their configuration may vary, the requirement is that the fields-of-view of the stationary cameras deployed should cover, collectively, the whole surveillance area (a small amount of occlusion by architectural fixtures, decoration, and plantation is unavoidable) and overlap partially to facilitate registration and correlation of events observed by multiple cameras.
Within each surveillance zone, multiple camera groups, e.g., at least one group, should be deployed. Each camera group will comprise at least one stationary camera and multiple, e.g., at least one, dynamic cameras. The cameras in the same group will be hooked up to a PC or multiple networked PCs, which performs a significant amount of video analysis. The stationary camera will have a fixed platform and fixed camera parameters such as the focal length. The dynamic cameras will be mounted on a mobile platform. The platform should provide at least the following degrees of freedom: two rotational degrees of freedom (DOFs) and another DOF for adjusting the focal length of the camera. Other DOFs are desirable but optional. As disclosed in this Application, it is assumed that the rotational DOFs comprise a pan and a tilt. When the camera is held upright, the panning DOF corresponds to roughly a “left-right” rotation of the camera body and the tilting DOF corresponds to roughly a “top-down” rotation of the camera body. However, there is no assumption that such “left-right” and “top-down” motion has to be precisely aligned with the width and height of the camera's CCD array, and there is no assumption that the optical center has to be on the rotation axes.
The relative position of the stationary and dynamic cameras may be collinear or planar for the sake of simplicity. For example, when multiple dynamic cameras are deployed, they could be on the two sides of the stationary one. If more than two dynamic cameras are deployed, some planar grid configuration, with the stationary camera in the center of the grid and dynamic cameras arranged in a satellite configuration around the stationary, should be considered. The exact spacing among the cameras in a camera group can be locale and application dependent. One important tradeoff is that there should be sufficient spacing between cameras to ensure an accurate depth computation while maintaining a large overlap of the fields-of-view of the cameras. The deployment configuration of multiple camera groups in the same surveillance zone is also application-dependent. However, the placement is often dictated by the requirement that collectively, the fields-of-view of the cameras should cover the whole surveillance zone area. Furthermore, the placement of camera groups in different zones should be such that some overlap exists between the fields-of-view of the cameras in spatially adjacent zones. This will ensure that motion events can be tracked reliably across multiple zones and smooth handoff policies from one zone to the next can be designed.
As a simple example, in an airport terminal building with multiple terminals, each surveillance zone might comprise the arrival/departure lounge of a single terminal. Multiple camera groups can be deployed within each zone to ensure a complete coverage. Another example is that multiple camera groups can be used to cover a single floor of a parking structure, with different surveillance zones designated for different floors.
Technical issues, related to the configuration, deployment, and operation of a multi-camera surveillance system as envisioned above, are disclosed and several of these issued are addressed in this Patent Application. To name a few of them:
Issues Related to the Configuration of the System
Some of the preferred embodiments of this Patent Application disclose practically implementations and methods to operate the stationary-dynamic cameras in the same camera group to achieve selective and purposeful focus-of-attention. Extension of these disclosures to multiple groups certainly falls within the scopes of this invention. As the disclosures below may address some of the issues in particular, the solutions to all other issues are likely covered by the scope as well since the solutions to deal with the remaining issues may be dealt with manually by those of ordinary skill in the art after reviewing and understand the disclosures made in the disclosures of this Patent Application.
For practical implementations, it is assumed that the stationary camera 110 performs a global, wide field-of-view analysis of the motion patterns in a surveillance zone. Based on some pre-specified criteria, the stationary camera is able to identify suspicious behaviors and subjects that need further attention. These behaviors may include loitering around sensitive or restricted areas, entering through an exit, leaving packages behind unattended, driving in a zigzag or intoxicated manner, circling an empty parking lot or a building in a suspicious, reconnoitering way, etc. The question is then how to direct the dynamic cameras to obtain detailed views of the subjects/behaviors under the guidance of the master.
Briefly, the off-line calibration algorithm discloses a closed-form solution that accurately and efficiently calibrates all DOFs of a pan-tilt-zoom (PTZ) dynamic camera. The on-line selective focus-of-attention is formulated as a visual servo problem. The formulation has special advantages that the formulation is applicable even with dynamic changes in scene composition and varying object depths, and does not require tedious and time-consuming calibration as others.
In this particular application, video analysis and feature extraction processes are not described in details as these analyses and processes are disclosed in several standard video analysis, tracking, and localization algorithms. More details are described below on the following two techniques: (1) an algorithm for camera calibration and pose registration for both the stationary and dynamic cameras, and (2) an algorithm for visual servo and error compensation for selective focus-of-attention.
In order to better understand the video surveillance and video camera calibration algorithms disclosed below, background technical information is first provided below. Extensive research has been conducted in video surveillance. To name a few works, the Video Surveillance and Monitoring (VSAM) project R. Collins, A. Lipton, H. Fujiyoshi, T. Kanade, “Algorithms for Cooperative Multisensor Surveillance,” Proceedings of IEEE, Vol. 89, 2001, pp. 1456-1477 at CMU has developed a multi-camera system that allows a single operator to monitor activities in a cluttered environment using a distributed sensor network. This work has laid the technological foundation for a number of start-up companies. The Sphinx system by Gang Wu, Yi Wu, Long Jiao, Yuan-Fang Wang, and Edward Chang “Multi-camera Spatio-temporal Fusion and Biased Sequence-data Learning for Security Surveillance,” Proceedings of the ACM Multimedia Conference, Berkeley, Calif., 2003, reported by researchers at the University of California, is a multi-camera surveillance system that addresses motion event detection, representation, and recognition for outdoor surveillance. W4 I. Haritaoglu, D. Karwood, and L. S. Davis, “W4: Real-time Surveillance of People and Their Activities,” IEEE Transactions PAMI, Vol. 22, 2000, pp. 809-830, from the University of Maryland, is a real-time system for detecting and tracking people and their body parts. Pfinder C. R. Wren, A. Azarbayejani, T. J. Darrel, and A. P. Petland, “Pfinder: Real-time Tracking of the Human Body,” IEEE Transactions on PAMI, Vol. 19, 1997, pp. 780-785, developed at MIT, is another people-tracking and activity-recognition system. J. Ben-Arie, Z. Wang, P. Pandit, and S. Rajaram, “Human Activity Recognition Using Multidimensional Indexing”, IEEE Transactions on PAMI, Vol. 24, 2002 presents another system for analyzing human activities using efficient indexing techniques. More recently, a number of workshops and conferences ACM 2nd International Workshop on Video Surveillance and Sensor Networks, New York, 2004 and IEEE Conference on Advanced Video and Signal Based Surveillance, Miami, Fla., 2003 have been organized to bring together researchers, developers, and practitioners from academia, industry, and government, to discuss various issues involved in developing large-scale video surveillance networks.
As discussed above, many issues related to image and video analysis need to be addressed satisfactorily to enable multi-camera video surveillance. A comprehensive survey of these issues will not be presented in this application; instead, solutions that address two particular challenges, i.e., 1) off-line calibration and 2) the on-line selective focus-of-attention issues, are disclosed in details below. These two technical challenges have been well understood as critical to the use of stationary-dynamic camera assemblies for video surveillance. The disclosures made in this invention addressing these two issues can therefore provide new and improved solutions that are different and in clear contrast with those of the state-of-the-art methods in these areas.
J. Davis and X. Chen, “Calibrating Pan-Tilt Cameras in Wide-area Surveillance Networks,” Proceedings of ICCV, Nice, France, 2003 presented a technique for calibrating a pan-tilt camera off-line. This technique adopted a general camera model that did not assume that the rotational axes were orthogonal or that they were aligned with the imaging optics of the cameras. Furthermore, Davis and Chen argued that the traditional methods of calibrating stationary cameras using a fixed calibration stand were impractical for calibrating dynamic cameras, because a dynamic camera had a much larger working volume. Instead, a novel technique was adopted to generate virtual calibration landmarks using a moving LED. The 3D positions of the LED were inferred, via stereo triangulation, from multiple stationary cameras placed in the environment. To solve for the camera parameters, an iterative minimization technique was proposed.
Zhou et al. X. Zhou, R. T. Collins, T. Kanade, P. Metes, “A Master-Slave System to Acquire Biometric Imagery of Humans at Distance,” Proceedings of 1st ACM Workshop on Video Surveillance, Berkeley, Calif., 2003 presented a technique to achieve selective focus-of-attention on-line using a stationary-dynamic camera pair. The procedure involved identifying, off-line, a collection of pixel locations in the stationary camera where a surveillance subject could later appear. The dynamic camera was then manually moved to center on the subject. The pan and tilt angles of the dynamic camera were recorded in a look-up table indexed by the pixel coordinates in the stationary camera. The pan and tilt angles needed for maneuvering the dynamic camera to focus on objects that appeared at intermediate pixels in the stationary camera were obtained by interpolation. At run time, the centering maneuver of the dynamic camera was accomplished by a simple table-look-up process, based on the locations of the subject in the stationary camera and the pre-recorded pan-and-tilt maneuvers.
Compared to the state-of-the-art methods surveyed above, new techniques are disclosed in this invention in dealing with off-line camera calibration and on-line selective focus of attention as further described below:
In terms of off-line camera calibration:
In terms of on-line selective focus-of-attention:
Generally speaking, all camera-calibration algorithms model the image formation process as a sequence of transformations plus a projection. The sequence of operations brings a 3D coordinate Pworld, specified in some global reference frame, to a 2D coordinate Preal, specified in some camera coordinate frame (both in homogeneous coordinates). The particular model that implemented in a preferred embodiment is shown below.
The process can be decomposed into three stages:
If enough known calibration points are used, all these parameters can be solved for. Many public-domain packages and free software, such as OpenCV have routines for stationary camera calibration.
Dynamic Cameras: Calibrating a pan-tilt-zoom (PTZ) camera is more difficult, as there are many variable DOFs, and the choice of a certain DOF, e.g., zoom, affects the others.
The pan and tilt DOFs correspond to rotations, specified by the location of the rotation axis, the axis direction, and the angle of rotation. Furthermore, the ordering of the pan and tilt operations is important because matrix multiplication is not commutative except for small rotation angles. A camera implemented in a surveillance system can be arranged with two possible designs: pan-tilt (or pan before tilt) and tilt-pan (or tilt before pan). Both designs are widely used, and the calibration formulation as described below is applicable to both designs.
Some simplifications can make the calibration problems slightly easier, but at the expense of a less accurate solution. The simplifications are (1) collocation of the optical center on the axes of pan and tilt, (2) parallelism of the pan and tilt axes with the height (y) and width (x) dimensions of the CCD, and (3) the requested and realized angles of rotation match, or the angle of rotation does not require calibration. For example, Davis and Chen may assume that (3) is true and calibrates only the location and orientation of the axes relative to the optical center. In contrast, a general formulation is adopted that does not make any of the above simplifications. The formulations as presented in this invention show that such simplifications are unnecessary, and that an assumption of a general configuration does not unduly increase the complexity of the solution.
The equation that relates a 3D world coordinate and a 2D camera coordinate for a pan-tilt PTZ camera is (See Davis and Chen):
Preal={Mreal←ideal(ƒ)Mideal←cameraTt−1(ƒ)Rn
Where θ denotes the pan angle and φ denotes the tilt angle while np and nt denote the orientations of the pan and tilt axes, respectively. To execute the pan and tilt DOFs, a translation (Tp and Tt) from the optical center to the respective center of rotation is executed first, followed by a rotation around the respective axis, and then followed by a translation back to the optical center for the ensuing projection1 Mathematically speaking, only the components of Tp and Tt that are perpendicular to np and nt can be determined. The components parallel to np and nt are not affected by the rotation, and hence will cancel out in the back-and-forth translations. The parameters Tp and Tt are expressed as functions of the camera zoom, because zoom moves the optical center and alters the distances between the optical center and the rotation axes.
1 Mathematically speaking, only the components of Tp and Tt that are perpendicular to np and nt can be determined. The components parallel to np and nt are not affected by the rotation, and hence will cancel out in the back-and-forth translations.
For tilt-pan PTZ cameras, if there are pan and tilt operations, the order of calibrations is reversed. This is because the first platform, tilt, will move the whole upper assembly (the pan platform and the camera) as a unit, thus maintaining the position of the optical center relative to the axis of pan. This allows the calibration of Tp as a function of camera zoom only.
To calibrate a PTZ camera, two steps are needed: (1) calibrating the rotation angles: that is, if the realized angles of rotation ((θrealized and φrealized) are close to the requested ones (θrequested and φrequested), and (2) calibrating the location and orientation of the rotation axis.
The calibration procedure comprises two nested loops.
In general, this calibration procedure should be carried out multiple times with different θrequested settings. The axis of rotation and the center of location should be obtained by averaging of multiple calibration trials. The relationship of the requested angle of rotation and the executed angle of rotation, i.e., θrealized=ƒ(θrequested) can be interpolated from multiple trials using a suitable interpolation function ƒ (e.g., a linear, quadratic, or sigmoid function).
On-Line Selective Focus-of-Attention
Once a potential suspect (e.g., person/vehicle) has been identified in a stationary camera, the next step is often to relay discriminative visual traits of the suspect (RGB and texture statistics, position and trajectory, etc.) from the stationary camera to a dynamic camera. The dynamic camera then uses its pan, tilt, and zoom capabilities for a closer scrutiny.
To accomplish the selective focus-of-attention feat, it is required to (1) identify the suspect in the field-of-view of the dynamic camera, and (2) manipulate the camera's pan, tilt, and zoom mechanisms to continuously center upon and present a suitably sized image of the subject.
The first requirement is often treated as a correspondence problem, solved by matching regions in the stationary and dynamic cameras based on similarity of color and texture traits, congruency of motion trajectory, and affirmation of geometrical epipolar constraints. As there are techniques available to provide solutions for implementation in the surveillance of this application, for the sake of simplicity and clarity, the details will not described here.
As to the second requirement, there is in fact a trivial solution if the optical center of the PTZ camera is located on the axes of pan and tilt, and if the axes are aligned with the width and height of the CCD.
In reality, however, the optical center is often not located on the rotation axis. As illustrated in
In more detail, if it is assumed that the optical center and the centers of pan and tilt are collocated, and the axes align with the CCD as in
2 While 3 m may sound short, one has to remember that dynamic cameras need to operate at high zoom settings for close scrutiny. At a high zoom setting, the effective depth of the object can and do become very small.
It might seem that the centering problem could be solved if the centering processes either (1) adopt a mechanical design that ensures collocation of the optical center on the rotation axes, or failing that, (2) infer the depth of the subject to compute the rotation angle correctly. However, both solutions turn out to be infeasible for the following reasons:
Instead, the present invention formulates this selective, purposeful focus-of-attention problem as one of visual servo. Referring to
As mentioned, the stationary cameras perform visual analysis to identify the current state (RGB, texture, position, and velocity) of the suspicious persons/vehicles. A similar analysis is performed by the dynamic cameras under the guidance of the stationary camera. Image features of the subjects (e.g., position and size of a car license plate or the face of a person) are computed and then serve as the input to the servo algorithm (the real signals). The real signals are compared with the reference signals, which specify the desired position (e.g., at the center of the image plane) and size (e.g., covering 80% of the image plane) of the image features. Deviation between the real and reference signals generates an error signal that is used to compute a camera control signal (i.e., desired changes in the pan, tilt, and zoom DOFs). Executing these recommended changes to the camera's DOFs will train and zoom the camera to minimize the discrepancy between the reference and real signals (i.e., to center the subject with a good size). Finally, as there is no control over the movements of the surveillance subjects, such movements must be considered external disturbance (noise) in the system. This loop of video analysis, feature extraction, feature comparison, and camera control (servo) is then repeated for the next time frame. This seems to be a reasonable formulation as the dynamic and incremental nature of the problem is considered.
The video analysis and feature extraction processes are not described here because there are many standard video analysis, tracking, and localization algorithms that can accomplish these. Instead, the descriptions in this section discuss the detail of how to generate the camera control signals below, assuming that features have already been extracted from the stationary and dynamic cameras.
Visual servo is based on Eq. 1 that relates the image coordinate to the world coordinate for PTZ cameras. Assume that samples are taken at the video frame rate (30 frames/second) and at a particular instant it is observed that the tracked object at a certain location in the dynamic camera. Then, the questions addressed here are:
One can expect, by a cursory examination of Equation 1, that the relationship between image coordinates and the camera's DOFs to be fairly complicated and highly nonlinear. Hence, a closed-form solution to the above two questions is not likely. Instead, the formulations are linearized by rearranging terms in Equation 1 and taking the partial derivative of the resulting expressions with respect to the control variables ƒ, θ, φ:
The expression of J is a little bit complicated and will not be presented here to save space. However, it is a simple mathematic exercise to figure out the exact expression. The expression in Eq. 3 answers the first question posed above. The answer to the second question is then obvious: the formulations of this invention substitute [xcenter-x, ycenter-y]T for [dxreal, dyreal]T in Eq. 3 because that is the desired centering movement. However, as Eq. (3) represents a linearized version of the original nonlinear problem (or its first-order Taylor series expansion), iterations are needed to converge to the true solution. Actually the need for iterations does not present a problem, since computation is efficient and convergence is fast even with the simple Newton's method. In the experiments discussed below, convergence is always achieved within four iterations with ˜ 1/10,000 of a pixel precision. Two final points worth mentioning are
It is easy to verify that Jacobian Jcenter is well conditioned and invertible using an intuitive argument. This is because the two columns of the Jacobian represent the instantaneous image velocities of the tracked point due to a change in the pan (θ) and tilt (φ) angles, respectively. As long as the instantaneous velocities are not collinear, Jcenter has independent columns and is therefore invertible. It is well known that degeneracy can occur only if a “gimbal lock” condition occurs that reduces one DOF. For pan-tilt cameras, this occurs only when the camera is pointing straight up. In that case, the pan DOF reduces to a self-rotation of the camera body, which can make some image points move in a way similar to that under a tilt maneuver. This condition rarely occurs; in fact, it is not even possible for Sony PTZ cameras because the limited range of tilt does not allow the camera to point straight up.
This section describes the results in both off-line calibration and on-line selective focus-of-attention. For the off-line calibration, the traditional method is adopted to construct a planar checkerboard pattern and then placing the pattern at different depths before the camera to supply 3D calibration landmarks. As discussed above, while Davis and Chen advocates a different method of generating virtual 3D landmarks by moving an LED around in the environment, it is found that method to be less accurate.
The argument used in Davis and Chen to support the virtual landmark approach is the need of a large working space to fully calibrate the pan and tilt DOFs. While this is true, there are different ways to obtain large angular ranges. Because θ≈r/d, a large angular range can be achieved by either (1) placing a small calibration stand (small r) nearby (small d) or (2) using dispersed landmarks (large r) placed far away (large d). While Davis and Chen advocates the latter, the former approach is adopted.
The reason here is that to calibrate Tp and Tt accurately, it is desirable that their effects to be as pronounced as possible and easily observable in image coordinates. This makes a near-field approach better than a far-field approach. As seen in
Off-line calibration: Results are summarized in
In
On-line focus-of-attention: experiments are conducted using both synthesized and real data. For synthesized data, comparisons are made between the accuracy of the algorithm disclosed in this invention with that of a näive centering algorithm. The naïve algorithm makes the following assumptions: (1) the optical center is collocated on the axes of pan and tilt, and (2) the pan DOF affects only the x coordinates whereas the tilt DOF affects only the y coordinates. While those assumptions are not generally valid, algorithms making those assumptions can and do serve as good baselines because they are very easy to implement, and they give reasonable approximations for far-field applications.
In more detail the naïve algorithm works as follow: Assume that a tracked object appears in the dynamic camera at location pideal=[Xideal/zideal, yideal/zideal, 1]T as defined in Eq. 2. The pan rotation in the y direction is applied as pideal=Ry(θ)pideal, and tilt rotation in the x direction as p″ideal=Rx(φ)p′ideal, where θ=tan−1((x-xcenter)/(kuƒ)) and φ=tan−1((y′-ycenter)/(kvƒ)). To make the simulation realistic, the parameters of Sony PTZ cameras are used.
Another source of error of the naïve method involves the possible misalignment of the pan/tilt axes. While the naïve method assumes that the axes are perfectly aligned with the CCD, in reality some deviation can and should be expected. The method incorporates this deviation information into the equations and thus avoids this additional source of error that leads to inaccurate centering.
The experiments further test the performance of the centering algorithm on real data as well. A person is made to stand in front of the camera at an arbitrary position in the image frame. The centering algorithm then centers the tip of the nose of the person and shows the centering is achieved using the method disclosed in this invention The center of the screen has been marked by dotted white lines and the nose tip by the white circle, both before and after centering. The centering error reduces as the object gets farther away from the camera; however, the centering results are good even when the object (the face) gets as close as 50 cm to the camera.
According to above descriptions, this invention further discloses an alternate embodiment of a video surveillance camera that includes a global large-filed-of-view surveillance lens and a dynamic selective-focus-of-attention surveillance lens. The video surveillance camera further includes an embedded controller for controlling the video surveillance camera to implement a cooperative and hierarchical control process for operating with the global large-filed-of-view surveillance lens and the dynamic selective-focus-of-attention surveillance lens. In a preferred embodiment, the video surveillance camera is mounted on a movable platform. In another preferred embodiment, the video surveillance camera has flexibility of multiple degrees of freedom (DOFs). In another preferred embodiment, the controller is embodied in the camera as an application specific integrated circuit (ASIC) processor.
Although the present invention has been described in terms of the presently preferred embodiment, it is to be understood that such disclosure is not to be interpreted as limiting. Various alternations and modifications will no doubt become apparent to those skilled in the art after reading the above disclosure. Accordingly, it is intended that the appended claims be interpreted as covering all alternations and modifications as fall within the true spirit and scope of the invention.
This application is a Formal Application and claims priority to pending U.S. patent application entitled “VIDEO SURVEILLANCE USING STATIONARY-DYNAMIC CAMERA ASSEMBLIES FOR WIDE-AREA VIDEO SURVEILLANCE AND ALLOW FOR SELECTIVE FOCUS-OF-ATTENTION” filed on Dec. 4, 2004 and accorded Ser. No. 60/633,166 by the same Applicant of this Application, the benefit of its filing date being hereby claimed under Title 35 of the United States Code.
Number | Date | Country | |
---|---|---|---|
60633166 | Dec 2004 | US |