This disclosure relates to surveillance systems. More specifically, the disclosure relates to a video-based surveillance system that fuses information from multiple surveillance sensors.
Video surveillance is critical in many circumstances. One problem with video surveillance is that videos are manually intensive to monitor. Video monitoring can be automated using intelligent video surveillance systems. Based on user defined rules or policies, intelligent video surveillance systems can automatically identify potential threats by detecting, tracking, and analyzing targets in a scene. However, these systems do not remember past targets, especially when the targets appear to act normally. Thus, such systems cannot detect threats that can only be inferred. For example, a facility may use multiple surveillance cameras to that automatically provide an alert after identifying a suspicious target. The alert may be issued when the cameras identify some target (e.g., a human, bicycle, or vehicle) loitering around the building for more than fifteen minutes. However, such system may not issue an alert when a target approaches the site several times in a day.
The present disclosure provides systems and methods for a surveillance system. The surveillance system includes multiple_sensors. The surveillance system is operable to track a target in an environment using the sensors. The surveillance system is also operable to extract information from images of the target provided by the sensors. The surveillance system is further operable to determine confidences corresponding to the information extracted from images of the target. The confidences include at least one confidence corresponding to at least one primitive event. Additionally, the surveillance system is operable to determine grounded formulae by instantiating predefined rules using the confidences. Further, the surveillance system is operable to infer a complex event corresponding to the target using the grounded formulae. Moreover, the surveillance system is operable to provide an output describing the complex event.
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate the present teachings and together with the description, serve to explain the principles of the disclosure.
It should be noted that some details of the figures have been simplified and are drawn to facilitate understanding of the present teachings, rather than to maintain strict structural accuracy, detail, and scale.
This disclosure relates to surveillance systems. More specifically, the disclosure relates to a video-based surveillance systems that fuse information from multiple surveillance sensors. Surveillance systems in accordance with aspects of the present disclosure automatically extract information from a network of sensors and make human-like inferences. Such high-level cognitive reasoning entails determining complex events (e.g., a person entering a building using one door and exiting from a different door) by fusing information in the form of symbolic observations, domain knowledge of various real-world entities and their attributes, and interactions between them.
In accordance with aspects of the invention, a complex event is determined to have likely occurred based only on other observed events and not based on a direct observation of the complex event itself. In embodiments, a complex event can be an event determined to have occurred based only on circumstantial evidence. For example, if a person enters a building with a package and exits the building without the package (e.g., a bag), it may be inferred that the person left the package is in the building.
Complex events are difficult to determine due to the variety of ways in which different parts of such events can be observed. A surveillance system in accordance with the present disclosure infers events in real-world conditions and, therefore, requires efficient representation of the interplay between the constituent entities and events, while taking into account uncertainty and ambiguity of the observations. Further, decision making for such a surveillance system is a complex task because such decisions involve analyzing information having different levels of abstraction from disparate sources and with different levels of certainty (e.g., probabilistic confidence), merging the information by weighing in on some data source more than other, and arriving at a conclusion by exploring all possible alternatives. Further, uncertainty must be dealt with due to a lack of effective visual processing tools, incomplete domain knowledge, lack of uniformity and constancy in the data, and faulty sensors. For example, target appearance frequently changes over time and across different sensors, data representations may not be compatible due to difference in the characteristics, levels of granularity and semantics encoded in data.
Surveillance systems in accordance with aspects of the present disclosure include a Markov logic-based decision system that recognizes complex events in videos acquired from a network of sensors. In embodiments, the sensors can have overlapping and/or non-overlapping fields of view. Additionally, in embodiments, the sensors can be calibrated or non-calibrated Markov logic networks provide mathematically sound and robust techniques for representing and fusing the data at multiple levels of abstraction, and across multiple modalities to perform complex task of decision making. By employing Markov logic networks, embodiments of the disclosed surveillance system can merge information about entities tracked by the sensors (e.g., humans, vehicles, bags, and scene elements) using a multi-level inference process to identify complex events. Further, the Markov logic networks provide a framework for overcoming any semantic gaps between the low-level visual processing of raw data obtained from disparate sensors and the desired high-level symbolic information for making decisions based on the complex events occurring in a scene.
Markov logic networks in accordance with aspects of the present disclosure use probabilistic first order predicate logic (FOPL) formulas representing the decomposition of real world events into visual concepts, interactions among the real-world entities, and contextual relations between visual entities and the scene elements. Notably, while the first order predicate logic formulas may be true in the real world, they are not always true. In surveillance environments, it is very difficult to come up with non-trivial formulas that are always true, and such formulas capture only a fraction of the relevant knowledge. For example, while the rule that “pigs do not fly” may always be true, such a rule has little relevance to surveilling and office building and, even if it were relevant, would not encompass all of the other events that might be encountered around a office building. Thus, despite its expressiveness, such pure first order predicate logic has limited applicability to practical problems of drawing inferences. Therefore, in accordance with aspects of the present disclosure, the Markov logic network defines complex events and object assertions by hard rules that are always true and soft rules that are usually true. The combination of hard rules and soft rules encompasses all events relevant to a particular set of threat for which a surveillance system monitors in particular environment. For example, the hard rules and soft rules disclosed herein can encompass all events related to monitoring for suspicious packages being left by individuals at an office building.
In accordance with aspects of the present disclosure, the uncertainty as to the rules is represented by associating each first order predicate logic (FOPL) formulas with a weight reflecting its uncertainty (e.g., a probabilistic confidence representing how strong a constraint is). That is, the higher the weight, the greater the difference in probability between truth states of occurrence of an event or observation of an object that satisfies the formula and one that does not, provided that other variables stay equal. In general, a rule for detecting a complex action entails all of its parts, and each part provides (soft) evidence for the actual occurrence of the complex action. Therefore, in accordance with aspects of the present disclosure, even if some parts of a complex action are not seen, it is still possible to detect the complex event across multiple sensors using the Markov logic network inference.
Markov logic networks allow for flexible rule definitions with existential quantifiers over sets of entities, and therefore allow expressive power of the domain knowledge. The Markov logic networks in accordance with aspects of the present disclosure models uncertainty at multiple levels of inference, and propagates the uncertainty bottom-up for more accurate and/or effective high-level decision making with regard to complex events. Additionally, surveillance systems in accordance with the present disclosure scale the Markov logic networks to infer more complex activities involving network of visual sensors under increased uncertainty due to inaccurate target associations across sensors. Further, surveillance systems in accordance with the present disclosure apply rule weights learning for fusing information acquired from multiple sensors (target track association) and enhance visual concept extraction techniques using distance metric learning.
Additionally, Markov logic networks allow multiple knowledge bases to be combined into a compact probabilistic model by assigning weights to the formulas, and is supported by a large range of learning and inference algorithms. Not only the weights, but also the rules can be learned from the data set using Inductive logic programming (ILP). As the exact inference is intractable, Gibbs sampling (MCMC process) can be used for performing the approximate inference. The rules form a template for constructing the Markov logic networks from evidence. Evidence are in the form of grounded predicates obtained by instantiating variables using all possible observed confidences. The truth assignment for each of the predicates of the Markov Random Field defines a possible world x. The probability distribution over the possible worlds W, defined as joint distribution over the nodes of the corresponding Markov Random Field network, is the product of potentials associated with the cliques of the Markov Network:
The weights associated to the kth formula wk can be assigned manually or learned. This can be reformulated as:
Equations (1) and (2) represent that if the kth rule with weight wkis satisfied for a given set of confidences and grounded atoms, the corresponding world is exp(wk) times more probable than when the kth rule is not satisfied.
For detecting occurrence of an activity, embodiments disclosed herein query the Markov logic network using the corresponding predicate. Given a set of evidence predicates x=e, hidden predicates u and query predicates y, inference involves evaluating the MAP (Maximum-A-Posterior) distribution over query predicates y conditioned on the evidence predicates x and marginalizing out the hidden nodes u as P(y|x):
Markov logic networks support both generatively and discriminatively weigh learning. Generative learning involves maximizing the log of the likelihood function to estimate the weights of the rules. The gradient computation uses partition function Z. Even for reasonably sized domains, optimizing log-likelihood is intractable as it involves counting number of groundings ni(x) in which ith formula is true. Therefore, instead of optimizing likelihood, generative learning in existing implementation uses pseudo-log likelihood (PLL). The difference between PLL and log-likelihood is that, instead of using chain rule to factorize the joint distribution over entire nodes, embodiments disclosed herein use Markov blanket to factorize the joint distribution into conditionals. The advantage of doing this is that predicates that do not appear in the same formula as a node can be ignored. Thus, embodiments disclosed herein scale inference to support multiple activities and longer videos, which can greatly increase the speed inference. Discriminative learning on the other hand maximizes the conditional log-likelihood (CLL) of the queried atom given the observed atoms. The set of queried atoms need to be specified for discriminative learning. All the atoms are partitioned into observed X and queried Y. CLL is easier to optimize compared to the combined log-likelihood function of generative learning as the evidence constrains the probability of the query atoms to a much fewer possible states. Note that CLL and PLL optimization are equivalent when evidence predicates include the entire Markov Blanket of the query atoms. A number of gradient-based optimization techniques can be used (e.g., voted perceptron, contrastive divergence, diagonal Newton method and scaled conjugate gradient) for minimizing negative CLL. Learning weights by optimizing the CLL gives more accurate estimates of weights compared to PLL optimization.
In accordance with aspects of the present disclosure the surveillance system 25 visually monitors the spatial and temporal domains of the environment 10 around the building 20. Spatially, the monitoring area from the fields of view of the individual sensors 15 may be expanded to the whole environment 10 by fusing the information gathered by the sensors 15. Temporally, the surveillance system 25 can track the targets 30, 35 for a long periods of time, even the targets 30, 35 they may be temporarily outside of a field of view of one of the sensors 15. For example, if target 30 is in a field of view of sensor 15-2 and enters building 20 via door 22 and exits back into the field of view of sensor 15-2 after several minutes, the surveillance system 25 can recognize that it is the same target that was tracked previously. Thus, the surveillance system 25 disclosed herein can identify events as suspicious when the sensors 15 track the target 30 following a path indicated by the dashed line 45. In this example situation, the target 30 performs the complex behavior of carrying the package 31 when entering door 22 of the building 20 and subsequently reappearing as target 30′ without the package when exiting door 24. After identifying the event of target 30 leaving the package 31 in the building 20, the surveillance system 25 can semantically label segments of the video including the suspicious events and/or issue an alert to an operator.
In accordance with aspects of the present disclosure, the surveillance system 25 includes hardware and software that perform the processes and functions described herein. In particular, the surveillance system 25 includes a computing device 130, an inputoutput (I/O) device 133, and a storage system 135. The I/O device 133 can include any device that enables an individual to interact with the computing device 130 (e.g., a user interface) and/or any device that enables the computing device 130 to communicate with one or more other computing devices using any type of communications link. The I/O device 133 can be, for example, a handheld device, PDA, smartphone, touchscreen display, handset, keyboard, etc.
The storage system 135 can comprise a computer-readable, non-volatile hardware storage device that stores information and program instructions. For example, the storage system 135 can be one or more flash drives and/or hard disk drives. In accordance with aspects of the present disclosure, the storage device 135 includes a database of learned models 136 and a knowledge base 138. In accordance with aspects of the present disclosure, learned models 136 is a database or other dataset of information including domain knowledge of an environment under surveillance (e.g., environment 10) and objects the may appear in the environment (e.g., buildings, people, vehicles, and packages). In embodiments, learned models 136 associate information of entities and events in the environment with spatial and temporal information. Thus, functional modules (e.g., program and/or application modules), such as those disclosed herein, can use the information stored in the learned models 136 for detecting, tracking, identifying, and classifying objects, entities , and or events in the environment.
In accordance with aspects of the present disclosure, the knowledge base 138 includes hard and soft rules modeling spatial and temporal interactions between various entities and the temporal structure of various complex events. The hard and soft rules can be first order predicate logic (FOPL) formulas of a Markov logic network, such as those previously described herein.
In embodiments, the computing device 130 includes one or more processors 139, one or more memory devices 141 (e.g., RAM and ROM), one or more I/O interfaces 143, and one or more network interfaces 144. The memory device 141 can include a local memory (e.g., a random access memory and a cache memory) employed during execution of program instructions. Additionally, the computing device 130 includes at least one communication channel (e.g., a data bus) by which it communicates with the I/O device 133, the storage system 135, and the device selector 137. The processor 139 executes computer program instructions (e.g., an operating system and/or application programs), which can be stored in the memory device 141 and/or storage system 135.
Moreover, the processor 139 can execute computer program instructions of an visual processing module 151, an inference module 153, and a scene analysis module 155. In accordance with aspects of the present disclosure, the visual processing module 151 processes information obtained from the sensors 15 to detect, track, and classify object in the environment information included in the learned models 136. In embodiments, the visual processing module 151 extracts visual concepts by determining values for confidences that represent space-time (i.e., position and time) locations of the objects in an environment, elements in the environment, entity classes, and primitive events. The inference module 153 fuses information of targets detected in multiple sensors using different entity similarity scores and spatial-temporal constraints, with the fusion parameters (weights) learned discriminatively using a Markov logic network framework from a few labeled exemplars. Further, the inference module 153 uses the confidences determined by the visual processing module 151 to ground (a.k.a., instantiate) variables in rules of the knowledge base 138. The rules with the grounded variables are referred to herein as grounded predicates. Using the grounded predicates, the inference module 153 can construct a Markov logic network 160 and infer complex events by fusing the heterogeneous information (e.g., text description, radar signal) generated using information obtained from the sensors 15. The scene analysis module 155 provides outputs using the Markov logic network 160. For example, using the scene analysis module 155 can execute queries, label portions of the images associated with inferred events, and output tracking result information.
It is noted that the computing device 130 can comprise any general purpose computing article of manufacture capable of executing computer program instructions installed thereon (e.g., a personal computer, server, etc.). However, the computing device 130 is only representative of various possible equivalent-computing devices that can perform the processes described herein. To this extent, in embodiments, the functionality provided by the computing device 130 can be any combination of general and/or specific purpose hardware and/or computer program instructions. In each embodiment, the program instructions and hardware can be created using standard programming and engineering techniques, respectively.
In accordance with aspects of the present disclosure, the visual processing module 151 monitors sensors (e.g., sensors 15) to extract visual concepts and to track targets across the different fields of view of the sensors. The visual processing module 151 processes videos and extracts visual concepts in the form of confidences, which denote times and locations of the entities detected in the scene, scene elements, entity class and primitive events directly inferred from the visual tracks of the entities. The extraction can include and/or reference information in the learned models 136, such as time and space proximity relationships, object appearance representations, scene elements, rules and proofs of actions that targets can perform, etc. For example, the learned modules 138 can identify the horizon line and/or ground plane in the field of view of each of the sensors 15. Thus, based on learned models 136, the visual processing model 151 can identify some objects in the environment as being on the ground, and other objects as being in the sky. Additionally, the learned models 136 can identify objects such as entrance points (e.g., doors 22, 24) of a building (e.g., building 20) in the field of view of each of the sensors 15. Thus, the visual processing mode 151 can identify some objects as appearing or disappearing at an entrance point. Further, learned models 136 can include information used to identify objects (e.g., individuals, cars, packages) and events (moving, stopping, and disappearing) that can occur in the environment. Moreover, learned models 136 can include basic rules that can be used when identifying the objects or events. For example, a rule can be “human tracks are more likely to be on a ground plane,” which can assist in the identification of an object as a human, rather than a different object flying above the horizon line. The confidences can be used to ground (e.g., instantiate) the variables in the first-order predicate logic formulae of Markov logic network 160.
In embodiments, the visual processing includes detection, tracking and classification of human and vehicle targets, and attributes extraction (e.g., such as carrying a package 31). Targets can localized in the scene using background subtraction and tracked in 2D image sequence using Kalman filtering. Targets are classified to human/vehicle based on their aspect ratio. Vehicles are further classified into Sedans, SUVs and pick-up trucks using 3D vehicle fitting. The primitive events (a.k.a., atomic events) about target dynamics (moving or stationary) are generated from the target tracks. For each event the visual processing module 151 generates confidences for the time interval and pixel location of the target in 2D image (or the location on the map if homography is available). Furthermore, the visual processing module 151 learns discriminative deformable part-based classifiers to compute a probability scores for whether a human target is carrying a package. The classification score is fused across the track by taking average of top K confident scores (based on absolute values) and is calibrated to a probability score using logistic regression.
In accordance with aspects of the present disclosure, the knowledge base 138 includes hard and soft rules for modeling spatial and temporal interactions between various entities and the temporal structure of various complex events. The hard rules are assertions that should be strictly satisfied for an associated complex event to be identified. Violation of hard rules sets the probability of the complex event to zero. For example, a hard rule can be “cars do not fly,” whereas soft rules allow uncertainty and exceptions. Violation of soft rules will make the complex event less probable but not impossible. For example, a soft rule can be, “walking pedestrians on foot do not exceed a velocity of 10 miles per hour.” Thus, the rules can be used to determine that a fast moving object on the ground is a vehicle, rather than a person.
The rules in the knowledge base 138 can be used to construct the Markov logic network 160. For every set of confidences (detected visual entities and atomic events) determined by the visual processing model 151, the first-order predicate logic rules involving the corresponding variables are instantiated to form the Markov logic network 160. As discussed previously, the Markov logic network 160 can be comprised of nodes and edges, wherein the nodes comprise the grounded predicate. An edge exists between two nodes if the predicates appear in a formula. From the Markov logic network 160, MAP inference can be run to infer probabilities of query nodes after conditioning them with observed nodes and marginalizing out the hidden nodes. Targets detected from multiple sensors are associated across multiple sensors using appearance, shape and spatial-temporal cues. The homography is estimated by manually labeling correspondences between the image and a ground map. The coordinated activities include, for example, dropping bag in a building and stealing bag from a building. scene analysis module
In embodiments, the scene analysis module 155 can automatically determine labels for basic events and complex events in the environment using relationships and probabilities defined by the Markov logic network. For example, the scene analysis module 155 can label segments of video including suspicious events identified using one or more of the complex events and issue to a user an alert including the segments of the video.
At 410, the visual processing module 151 extracts the visual concept to determine contextual relations between the elements and targets within a monitored environment (e.g., environment 10), which provide useful information about an activity occurring in the environment. The surveillance system 25 (e.g., using sensors 15) can track a particular target by segmenting images from sensors into multiple zones based, for example, on events indicting the appearance of the target in each zone. In embodiments, the visual processing module 151 categorizes the segmented images into categories. For example, there can be three categories including sky, vertical, and horizontal. In accordance with aspects of the present disclosure, the visual processing module 151 associates objects with semantic labels. Further, the semantic scene labels can then be used to improve target tracking across sensors by enforcing spatial constraints on the targets. An example constraint may be that a human can only appear in image entry region. In accordance with aspects of the present disclosure, the visual processing module 151 automatically infers probability map of the entry or exit regions (e.g., doors 24, 26) of the environment by formulating following rules:
At 415, the targets detected in multiple sensors by the visual processing module 151 are fused in the Markov logic network 425 using different entity similarity scores and spatial-temporal constraints, with the fusion parameters (weights) learned discriminatively using the Markov logic networks framework from a few labeled exemplars. To fuse the targets, the visual processing module 151 performs entity similarity relation modeling, which associate entities and events observed from data acquired from diverse and disparate sources. Challenges to robust target appearance similarity measure across different sensors include substantial variations resulting from the changes in sensor settings (white balance, focus, and aperture), illumination and viewing conditions, drastic changes in the pose and shape of the targets, and noise due to partial occlusions, cluttered backgrounds, and presence of similar entities in the vicinity of the target. Invariance to some of these changes (such as illumination conditions) can be achieved using distance metric learning that learns a transformation in the feature space such that image features corresponding to the same object are closer to each other.
In embodiments, the inference module 153 performs similarity modeling using Metric Learning. Inference module 153 can employ metric learning approaches based on Relevance Component Analysis (RCA) to enhance similarity relation between same entities when viewed under different imaging conditions. RCA identifies and downscales global unwanted variability within the data belonging to same class of objects. The method transforms the feature space using a linear transformation by assigning large weights to the only relevant dimensions of the features and de-emphasizing those parts of the descriptor which are most influenced by the variability in the sensor data. For a set of N data points {(xij;j)} belonging to K semantic classes with data points nj, RCA first centers each data point belonging to a class to a common reference frame by subtracting in-class means mj (thus removing inter-class variability). It then reduces the intra-class variability by computing a whitening trans-formation of the in-class covariance matrix as:
wherein the whitening transform of the matrix, W=C(−1/2) is used as the linear transformation of the feature subspace such that features corresponding to same object are closer to each other.
At 420, in accordance with aspects of the present disclosure, the inference module 153 infers associations between the trajectories of the tracked targets across multiple sensors. In embodiments, the inferences are determined using a Markov logic network 425, which performs data association and handles the problem of long-term occlusion across multiple sensors, while maintaining the multiple hypotheses for associations. The soft evidence of association is outputted as, a predicate, e.g., equalTarget( . . . ) with a similarity score recalibrated to a probability value, and used in high-level inference of activities. In accordance with aspects of the present disclosure, the inference module 160 first learns weights for rules of the Markov logic networks 425 rules that govern the fusion of spatial, temporal and appearance similarity scores to determine equality of two entities observed in two different sensors. Using a subset of videos with labeled target associations, Markov logic networks 425 are discriminatively trained.
Tracklets extracted from Kalman filtering are used to perform target associations. Set of tracklets across multiple sensors are represented as X=xi, where a tracklet xi is defined as:
x
i
=f(ci, tis, tie, li, si, oi, ai)
where ci is the sensor ID, tsi is the start time, tei is the end time, li is the location in the image or the map, oi is the class of the entity (human or vehicle), si is the measured Euclidean 3D size of the entity (only used for vehicles), and ai is appearance model of the target entity. The Markov logic networks rules for fusing multiple cues for the global data association problem are:
In accordance with aspects of the present disclosure, the inference module 153 models temporal difference between the end and start time of a target across a pair of cameras using Gaussian distribution:
temporallyClose(tiA,e, t3B,s)=N(f(tiA,e, tjB,s);mt, σt2)
For the non-overlapping sensors, f(tei;tsj) computes this temporal difference. If two cameras are nearby and there is no traffic signal between them, the variance tends to be smaller and contribute a lot to the similarity measurement. However, when two cameras are further away from each other or there are traffic signals in between, this similarity score will contribute less to the overall similarity measure since the distribution would be widely spread due to large variance.
Further, in accordance with aspects of the present disclosure, the inference module 153 determines the spatial distance between objects in the two cameras is measured at the enter/exit regions of the scene. For a road with multiple lanes, each lane can be an enter/exit area. The inference module 153 applies Markov logic network 425 inference to directly classify image segments into enter/exit areas as discussed in section 4. The spatial probability is defined as:
spatiallyClose(liA, ljB)=N(dist(g(liA), g(ljB)); ml, σl2)
Enter/exit areas of a scene are located mostly near the boundary of the image or at the entrance of a building. Function g is the homography transform to project image locations lB and lA to map. Two targets detected in two cameras are only associated if they lie in the corresponding enter/exit areas.
Moreover, in accordance with aspects of the present disclosure, the inference module 153 determines a size similarity score is computed for vehicle targets where we convert a 3D vehicle shape model to the silhouette of the target. The probability is computed as:
similarSize(siA, sjB)=N(∥siA−sjB∥; ms, σs2)
In accordance with aspects of the present disclosure, the inference model 153 also determines a classification similarity:
similarClass(ojA, ojB)
More specifically, the inference model 153 characterizes the empirical probability of classifying a target for each of the visual sensor, as classification accuracy depends on the camera intrinsics and calibration accuracy. Empirical probability is computed from the class confusion matrix for each sensor A where each matrix element RCA i;j represents probability P(oAj|ci) of classifying object j to class i. For computing the classification similarity we assign higher weight to the camera with higher classification accuracy. The joint classification probability of the same object observed from sensor A and B is:
where oAj and oAj are the observed classes and ck is the groundtruth. classification in each sensor is conditionally independent given the object class, the similarity measure can be computed as:
where P(oAj|ck) and P(oBj|ck) can be computed from the confusion matrix, and P(ck) can be either set to uniform or estimated as the marginal probability from the confusion matrix.
In accordance with aspects of the present disclosure, the inference model 153 further determines an appearance similarity for vehicles and humans. Since vehicles exhibit significant variation in shapes due to viewpoint changes, shape based descriptors did not improve matching scores. Covariance descriptor based on only color, gave sufficiently accurate matching results for vehicles across sensors. Humans exhibit significant variation in appearance compared to vehicles and often have noisier localization due to moving too close to each other, carrying an accessory and forming significantly large shadows on the ground. For matching humans however, unique compositional parts provide strongly discriminative cues for matching. Embodiments disclosed herein compute similarity scores between target images by matching densely sampled patches within a constrained search neighborhood (longer horizontally and shorter vertically). The matching score is boosted by the saliency score S that characterizes how discriminative a patch is based on its similarity to other reference patches. A patch exhibiting larger variance for the K nearest neighbor reference patches is given higher saliency score S(x). In addition to the saliency, in our similarity score we also factor in a relevance based weighting scheme to down weigh patches, that are predominantly due to background clutter. RCA can be used to obtain such a relevance score R(x) from a set of training examples. The similarity Sim(xp; xq) measured between the two images, xp and xq, is computed as:
where xpm,n denote (m, n) patch from the image, p is the normalization confidence, and the denominator term penalizes large difference in saliency scores of two patches. RCA uses only positive similarity constraints to learn a global metric space such that intra-class variability is minimized. Patches corresponding to highest variability are due to the background clutter and are automatically down weighed during matching. The relevance score for a patch is computed as absolute sum of vector coefficients corresponding to that patch for the first column vector of the trans-formation matrix. Appearance similarity between targets are used to generate soft evidence predicates similarAppearance(aAi, aBj) for associating target i in camera A to target j in camera B.
Table 1 below shows event predicates representing various sub-events that are used as inputs for high-level analysis and detecting a complex event across multiple sensors.
In accordance with aspects of the present disclosure, the scene analysis module 155 performs probabilistic fusion for detecting complex events based on predefined rules. Markov logic networks 425 allow principled data fusion from multiple sensors, while taking into account the errors and uncertainties, and achieving potentially more accurate inference over doing the same using individual sensors. The information extracted from different sensors differs in the representation and the encoded semantics, and therefore should be fused at multiple levels of granularity. Low level information fusion would combine primitive events, local entity interactions in a sensor to infer sub-events. Higher level inference for detecting complex events will progressively use more meaningful information as generated from low-level inference to make decisions. Uncertainties may introduces at any stage due to missed or false detection of targets and atomic events, target tracking and association across cameras and target attribute extraction. To this end, the inference model 153 generate predicates with an associated probability (soft evidence). The soft evidence thus enables propagation of uncertainty from the lowest level of visual processing to high-level decision making.
In accordance with aspects of the present disclosure, the visual processing module 151 models and recognizes events in images. The inference module 153 generates groundings at fixed time intervals by detecting and tracking the targets in the images. The generated information includes sensor IDs, target IDs, zones IDs and types (for semantic scene labeling tasks), target class types, location, and time. Spatial location is a constant pair Loc_X_Y either as image pixel coordinates or geographic location (e.g. latitude and longitude) on the ground map obtained using image to map homography. The time is represented as an instant, Time _T or as an interval using starting and ending time, TimeInt_S_E. In embodiments, the visual processing module 151 detects three classes of targets in the scene, vehicles, humans, bags. Image zones are categorized into one of the three geometric classes C classes. The grounded atoms are instantiated predicates and represent either an target attribute or any primitive event it is performing. The ground predicates include: (a) zone classifications zoneClass(Z1, ZType); (b) zone where an target appears appearI(A1, Z1) or disappears disappearI(A1, Z1); (c) target classification class(A1, AType); (d) primitive events appear(A1, Loc; Time), disappear(A1, Loc, Time), move(A1, LocS, LocE, TimeInt) and stationary(A1 Loc, TimeInt); and (e) target is carrying a bag carryBag(A1). The grounded predicates and constants generated from the visual processing module are used to generate Markov Network.
The scene analysis module 155 determines complex events by querying for the corresponding unobserved predicates, running the inference using fast Gibbs sampler and estimating their probabilities. These predicates involve both unknown hidden predicates that are marginalized out during inference and the queried predicates. Example predicates along with their description in the Table 1. The inference module 153 applies Markov logic network 160 inference to detect two different complex activities that are composed of sub-events listed in table 1:
Complex activities are spread across network of four sensors and involve interactions between multiple targets, a bag and the environment. For each of the activities, the scene analysis module 155 identifies a set of sub-events that are detected in each sensor (denoted by sensorXEvents( . . . )). The rules of Markov logic network 160 for detecting sub-events for the complex event bagStealEvent( . . . ) in sensor C1 can be:
The predicate sensorType( . . . ) enforces hard constraints that only confidences generated from sensor C1 are used for inference of the query predicate. Each of the sub-events are detected using Markov logic networks inference engine associated to each sensor and the result predicates are fed into higher level Markov logic networks along with the associated probabilities, for inferring complex event. The rule formulation of the bagStealEvent( . . . ) activity are can be follows:
First order predicate logic (FOPL) rule for detecting generic complex event involving multiple targets and target association across multiple sensors. For each sensor, a predicate is defined for events occurring in that sensor. The targets in that sensor are associated to the other sensor using target association Markov logic networks 425 (that infers equalTarget( . . . ) predicate). The predicate after Int(Int1, Int2) is true if the time interval Int1 occurs before the Int2.
Inference in Markov logic networks is a hard problem, with no simple polynomial time algorithm for exactly counting the number of true cliques (representing instantiated formulas) in the network of grounded predicates. The nodes in the Markov logic networks grows exponentially with the number of rules (e.g., instances and formulas) in the Knowledge Base. Since all the confidences are used to instantiate all the variables of the same type, in all the predicates used in the rules, predicates with high arity cause combinatorial explosion in the number of possible cliques formed after the grounding step. Similarly long rules also cause high order dependencies in the relations and larger cliques in Markov logic networks.
A Markov logic network, providing bottom-up grounding by employing Relation Database Management System (RDBMS) as a backend tool for storage and query. The rules in the Markov logic networks are written to minimize combinatorial explosion during inference. Conditions, as the last component of either the antecedent or the consequent, to restrict the range of confidences can be used for grounding a formula. Using hard constraints further also improves tractability of inference as an interpretation of the world violating a hard constraint has zero probability and can be readily eliminated during bottom-up grounding. Using multiple smaller rules instead of one long rule also improves the grounding by forming smaller cliques in the network and fewer nodes. Embodiments disclosed herein further reduce the arity of the predicates by combining multiple dimensions of the spatial location (X-Y coordinates) and time interval (start and end time) into one unit. This greatly improves the grounding and inference step. For example, the arity of the predicate move(A, LocX1, LocY 1, Time1, LocX2, LocY 2, Time2) gets reduced to move(A, LocX1 Y 1, LocX2 Y 2; IntTime1 Time2). Scalable Hierarchical Inference in Markov logic networks: Inference in Markov logic networks for sensor activities can be significantly improved if, instead of generating a single Markov logic network for all the activities, embodiments explicitly partition the Markov logic network into multiple activity specific networks containing only the predicate nodes that appear in only the formulas of the activity. This restriction effectively considers only a Markov Blanket (MB) of a predicate node for computing expected number of true groundings and had been widely used as an alternative to exact computation. From implementation perspective this is equivalent to having a separate Markov logic networks inference engine for ach activities, and employing a hierarchical inference where the semantic information extracted at each level of abstraction is propagated from the lowest visual processing level to sub-event detection Markov logic networks engine, and finally to the high-level complex event processing module. Moreover, since the primitive events and various sub-events (as listed in Table 1 ) are dependent only on temporally local interactions between the targets, for analyzing long videos we divide a long temporal sequence into multiple overlapping smaller sequences, and run Markov logic networks engine within each of these sequences independently. Finally, the query result predicates from each temporal windows are merged using a high level Markov logic networks engine for inferring long-term events extending across multiple such windows. A significant advantage is that it supports soft evidences that allows propagating uncertainties in the spatial and temporal fusion process used in our framework. Result predicates from low-level Markov logic networks are incorporated as rules with the weights computed as log odds of the predicate probability ln(p/(1-p)). This allows partitioning the grounding and inference in the Markov logic networks in order to scale it to larger problems.
The flow diagram in
At 505, the process 500 tracks one or more targets (e.g., target 30 and/or 35) detected in the environment using multiple sensors (e.g., sensors 15). For example, the surveillance system can control the sensors to periodically or continually obtain images of the tracked target as it moves through the different fields of view of the sensors. Further, the surveillance system can identify a human target holding a package (e.g., target 30 with package 31) the moves in and out of the field of view of one or more of cameras. The identification and tracking of the targets can be performed as described previously herein
At 509, the process 500 (e.g., using visual processing module 151) extracts target information and spatial-temporal interaction information of the targets tracked at 505 as probabilistic confidences, as previously described herein. In embodiments, extracting information includes determining the position of the targets, classifying the targets, and extracting attributes of the targets. For example, the process 500 can determine spatial and temporal information of a target in the environment, classify the target a person (e.g., target 30, and determine an attribute of the person is holding a package (e.g., package 31). As previously described herein, the process 500 can reference information in learned models 136 for classifying the target and identifying its attributes.
At 513, the process 500 constructs a Markov logic networks (e.g., Markov logic networks 160 and 425) by grounded formulae based on each of the confidences determined at 509 by instantiating rules from a knowledge base (e.g., knowledge base 138), as previously described herein. At 519, the process 500 (e.g., using scene analysis module 135) determines probability of occurrence of a complex event based on the Markov logic network constructed at 513 for individual sensor, as previously described herein. For example, an event of a person leaving the package in the building can be determined based on a combination of events, including the person entering the building with a package and the person exiting the building without the package.
At 521, the process (e.g., using the inference module 153) fuses the trajectory of the target across more than one of the sensors. As previously discussed herein, a single target may be tracked individually by multiple cameras. In accordance with aspects of the invention, the tracking information is analyzed to identify the same target in each of the cameras to fuse their respective information. For example, the process may use an RCA analysis. In some embodiments, where the target disappears and reappears at one or more entrances of the building, the process may use a Markov logic networks (e.g., Markov logic network 425) to predict how the duration of time during which the target disappears and reappears.
At 525, the process 500 (e.g., using scene analysis module 135) determines probability of occurrence of a complex event based on the Markov logic network constructed at 513 for multiple sensors, as previously described herein. At 529, the process 500 provides an output corresponding to one or more of the complex events inferred at 525. For example, based on a predetermined sets of complex events inferred from the Markov logic network, the process (e.g., using scene analysis module) may retrieve images identified with to the complex event and provide them
While various aspects and embodiments have been disclosed herein, other aspects and embodiments will be apparent to those skilled in the art. The various aspects and embodiments disclosed herein are for purposes of illustration and are not intended to be limiting, with the true scope and spirit being indicated by the following claims.
This application claims benefit of prior provisional Application No. 61/973,226, filed Apr. 1, 2014, the entire disclosure of which is incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
61973226 | Mar 2014 | US |