This invention relates generally to surveillance systems, and more particularly to querying and visualizing surveillance data.
Surveillance and sensor systems are used to make environments safer and more efficient. Typically, surveillance systems detect events in signals acquired from an environment. The events can be due to people, vehicles, or changes in the environment itself. The signals can be complex, for example, visual (video) and acoustic, or the signals can be simple from sensors such as from heat sensors and motion detectors.
The detecting can be done in real-time as the events occur, or off-line after the events have occurred. The off-line processing requires means for storing, searching, and retrieving recorded events. It is desired to automate the processing of surveillance data.
A number of systems for analyzing surveillance videos are known, Stauffer, et al., “Learning patterns of activity using real-time tracking,” IEEE Transactions on Pattern Recognition and Machine Intelligence, 22(8):747-757, 2000, Yuri A. Ivanov and Aaron F. Bobick, Recognition of Visual Activities and Interactions by Stochastic Parsing, Transactions on Pattern Analysis and Machine Intelligence 22(8): 852-872, 2000, Johnson, et al., “Learning the distribution of object trajectories for event recognition,” Image and Vision Computing, 14(8), 1996, Minnen, et al., “Expectation grammars: Leveraging high-level expectations for activity recognition,” Workshop on Event Mining, Event Detection, and Recognition in Video, Computer Vision and Pattern Recognition, volume 2, page 626, IEEE, 2003, Cutler, et al., “Real-time periodic motion detection, analysis and applications,” Conference on Computer and Pattern Recognition, pages 326-331, Fort Collins, USA, 1999. IEEE, and Moeslund, et al., “A survey of computer vision based human motion capture,” Computer Vision and Image Understanding, 81:231-268, 2001.
Several systems use gestural input to improve usability of computer systems, R. A. Bolt. ‘put-that-there’: Voice and gesture at the graphics interface. Computer Graphics Proceedings, SIGGRAPH 1980, 14(3):262-70, July 1980, Christoph Maggioni. Gesturecomputer—new ways of operating a computer. SIEMENS AG Central Research and Development, 1994, David McNeill. Hand and Mind: What Gestures Reveal about Thought. The University of Chicago Press, 1992.
The embodiments of the invention provide a system and method for detecting unusual events in an environment, and for searching surveillance data using a global context of the environment. The system includes a network of heterogeneous sensors, including motion detectors and video cameras. The system also includes a surveillance database for storing the surveillance data. A user specifies queries that take advantage of a spatial context of the environment.
Specifically, a method for querying a surveillance database stores videos and events acquired by cameras and detectors in an environment. Each event includes a time at which the event was detected. The videos are indexed according to the events. A query specifies a spatial and temporal context. The database is searched for events that match the spatial and temporal context of the query, and only segments of the videos that correlate with the matching events are displayed.
System
The processor is conventional and includes memory, buses, and I/O interfaces. The processor can perform the query method 111 according to an embodiment of the invention. The surveillance database stores surveillance data, e.g., video and sensor data streams 131, and plans 220 of an environment 105 where the surveillance data are collected.
An input device 140, e.g., a mouse or touch sensitive surface can be used to specify a spatial query 141. Results 121 of the query 141 are displayed on the display device 120.
Sensors
The sensor data 131 are acquired by a network of heterogeneous sensors 129. The sensors 129 can include video cameras and detectors. Other types of sensors as known in the art can also be included. Because of the relative cost of the cameras and the detectors, the number of detectors may be substantially larger than the number of cameras; i.e., the cameras are sparse and the detectors are dense in the environment. For example, one area viewed by one camera can include dozens of detectors. In a large building, there could be hundreds of cameras, but thousands and thousands of detectors. Even though the number of detectors can be relatively large compared with the number of cameras, the amount of data (events/times) acquired by the detectors is miniscule compared with the video data. Therefore, the embodiments of the invention leverage the event data to rapidly locate video segments of potential interest.
The plan 220 can show the location of the sensors. A particular subset of sensors can be selected by the user using the input device, or by indicating a general area on the floor plan.
Sensors
The set of sensors in the system consists of regular surveillance video cameras and various detectors, implemented in either hardware or software. Usually, the cameras continuously acquire videos of areas of the environment. Typically cameras do not respond to activities in their field of view, but simply record the images of the monitored environment. It should be noted, that the videos can be analyzed using conventional computer techniques. This can be done in real-time, or after the videos are acquired. The computer vision techniques can include object detection, object tracking, object recognition, face detection, and face recognition. For example, the system can determine whether a person entered a particular area in the environment, and record this as a time stamped event in the database.
Other detectors, e.g., motion detectors and other similar detectors, may be either active or passive as long as they signal discrete time-stamped events. For example, a proximity detector signals in response to a person moving near the detector at a particular instant in time.
Queries 141 on the databases 130 differ from conventional queries on typical multimedia databases in that the surveillance data share a spatial and temporal context. We leverage this shared context explicitly in a visualization of the query results 121, as well as in a user interface used to input the queries.
Display Interface
As shown in
The event timeline 230 shows the events in a “player piano roll” format, with time running from left to right. A current time is marked by a vertical line 221. The events for the various detectors are arranged along the vertical axis. The rectangles 122 represent events (vertical position) being active for a time (horizontal position and extent). We call each horizontal arrangement for a particular sensor an event track, as outlined by a rectangular block 125 only for the purpose of this description.
The visualization has a common highlighting scheme. The activation zones 133 can be highlighted with color on the floor plan 220. Sensors that correspond to the activation zones are indicated on the event timeline by horizontal bars 123 rendered in the same color. A video can be played that corresponds to events, at a particular time, and a particular area of the environment.
After events have been located in the database 130, the events can be displayed either on the background of the complete timeline (see
The event time line can be further compressed by removing tracks of all sensors not related to a query, as shown in
Selection and Queries
A simple query can simply request all the video segments that include any type of motion. Generally, this query returns too much information. A better query specifies an activation zone 133 on the floor plan 220. The zone can be indicated with the mouse 140, or if a touch-sensitive screen is used, by touching the plan 220 at the appropriate location(s).
In a still better query specifies context constraints in the form of a path 134 and an event timing sequence. The system automatically joins these context constraints with the surveillance data, and the results are appropriately refined for display. Because the system has access to the database of events, the system can analyze the event data for statistics, such as inter-arrival times.
Paths
According to one embodiment the detected events can be linked in space and time to form a path and an event timing sequence. For example, a person walking down a hallway will cause a linear subset of the detectors mounted in the ceiling to signal events serially at predictable time intervals that are consistent with walking. For example, if the detectors are spaced apart by about 5 meters, the detectors will signal events serially at times separated by about two to three second. In this event timing sequence the events are well separated. The event timing sequence caused by a running person can also easily be distinguished in that spatially adjacent detectors will signal events at almost the same time.
The amount of data associated with sensor events is substantially smaller that the amount of data associated with videos. In addition, the events and their times can be efficiently organized in a data structure. If the times in the video and the times of the events are correlated in the database, than it is possible to search the database with a spatio-temporal query to quickly locate video segment that correspond to unusual events in the environment.
Similarly, video segments can be used to search the database where events of interest can include a particular feature observation in the camera view (video). For instance, we can search for trajectories that a particular person traversed in a monitored area. This can be done by detecting and identifying faces in the videos. If such face data and discrete event data are stored in the database, then all detected faces can be presented to the user, a user can select a particular face, and the system can use the temporal and spatial information about that particular face to perform a search in the database to determine where in the monitored area that person has attended.
A less constrained query identifies a sequence as valid result if the second detector activates within one second from the first and the third detector within two to three seconds from the first detector, regardless of the signaling of the second detector.
The system has various levels of search constraints: level 0, level 1, level 2, etc, that can be assigned to the query.
A strict query only searches for events that exactly match the query, and a less constrained query admits variations. For example, a query specifies that sensors 1-2-3-4 should signal in order. Level 0 finds all event chains where sensors 1-2-3-4 signaled. Level 1 in addition to that also finds sequences 1-3-4, and 1-2-4, where the timings of sensors that did signal satisfy the constraints. Then, Level 2 allows any two sensors to be inactive, and thus finds all instances of sensors 1-4 where timings of sensor 1 and sensor 4 are satisfied. As the level number gets larger, there are more and more search results for a given query.
For any query involving N sensors, N levels of constraints are generally available.
Effect of the Invention
The system and method as described above can locate events that are not fully detected by any one sensor, be that camera or a particular motion detector. This enables a user of the system to treat all sensor in an environment as one ‘global’ sensor, instead of a collection of independent sensors.
For example, it is desired to locate events that are consistent with an unauthorized intrusion. A large amount of the available video can be eliminated by rejecting video segments that are not correlated to sensor event sequences that are inconsistent with the intrusion, and only providing the user with video segments are consistent with the intrusion.
It is to be understood that various other adaptations and modifications may be made within the spirit and scope of the invention. Therefore, it is the object of the appended claims to cover all such variations and modifications as come within the true spirit and scope of the invention.