Currently, video surveillance systems are configured to trigger alarms when, for example, someone is detected moving inside a restricted area. A restricted area is an observed area under the surveillance of the video surveillance system. When an alarm is triggered, a notification or alert is sent to security personnel (for example, security operating center (SOC) personnel).
Video/image processing techniques that are traditionally used in video surveillance systems are capable of detecting movements in a video. Often, video/image processing techniques are highly sensitive to movement in order to ensure that every movement that could possibly pose a threat in the observed area is detected and triggers an alarm. However, this high sensitivity causes many false alarms to be triggered. For example, using video/image processing techniques that are highly sensitive to movement may cause a video surveillance system to trigger a true alarm when an intruder is crawling through grass but may also cause a video surveillance system to trigger a false alarm when the video surveillance system detects moving trees, rain drops on the camera lens, moving animals, and the like. False alarms may be up to 90 percent of the alarms triggered by a video surveillance system.
Today, there are surveillance systems which utilize deep neural nets (DNNs) to perform object detection in images. However, there are several disadvantages to using DNN object detection alone to determine whether an alert should be generated in a video surveillance system. One disadvantage is that intruders can easily fool video surveillance systems relying solely on object detection methods to trigger alarms. For example, if a video surveillance system is configured to trigger an alarm when a DNN determines that a person is in a restricted area, the video surveillance system may fail to trigger an alarm when a person covers themselves with a sheet of cardboard and moves in the restricted area.
Another disadvantage is that even state-of-the-art DNNs may fail to determine that an object belongs to a class of objects (for example, people or vehicles) due to training data that is not representative of one or more domains or one or more camera positions. A DNN may also fail to determine that an object belongs to a class of objects due to the quality of an image the DNN is analyzing (for example, an image taken in low light, a gray image, a thermal image, a noisy image, an image with a low number of pixels, an image with low resolution, and the like). These challenges may elevate the potential of false detection and misdetections.
Additionally, in traditional video surveillance systems certain areas monitored by a camera included in the video surveillance system may be what are referred to as “non-sensitive” areas where objects of interest may already be present (for example, parking lots where vehicles are parked). In non-sensitive areas it is desirable to have the video surveillance system generate an alert only when a detected object of interest is moving. In other words, the video surveillance system should not generate an alert when a stationary object of interest is detected in a non-sensitive area of an image. Other areas monitored by the camera may be “sensitive” areas. If an object of interest is present in a sensitive area, it is desirable that the video surveillance system generate an alert, whether or not the object of interest is moving. A DNN which is trained to identify static, non-moving objects may be sufficient to use to generate an alert for sensitive areas but will generate many false alarms in for non-sensitive areas. To avoid generating false alarms for non-sensitive areas, some instances described herein utilize a temporal DNN which is trained to identify moving objects and not detect static objects and some instances only analyze a video captured by the camera when movement is detected in the video. In some instances, a video is a video clip including one or more frames or images.
Aspects, features, and embodiments described herein provide, among other things, a system and a method for reducing false alarms in a video surveillance system. The suppression of true alarms can result in the unintended consequence of reducing false alarms in a video surveillance system. However, the aspects, features, and embodiments described herein also reduce the unintentional suppression of true alarms. To reduce false alarms in a video surveillance system, one aspect combines image-based object classification of a moving object using artificial intelligence (for example, DNNs) with features associated with the moving object. The features are determined using metadata from a video and describe aspects of an object's movement that are relevant to determining whether the movement of the object is associated with human activity.
One example provides a video surveillance system for reducing false alarms. The video surveillance system includes a camera and an electronic processor. The electronic processor is configured to, when a moving object is detected in a video captured by the camera, perform object detection on the video to determine a score associated with a class of objects, wherein the score represents a likelihood that the moving object detected in the video is associated with a class of objects, using metadata associated with the video, determine a features associated with the moving object detected in the video, and, using a machine learning algorithm, analyze the score associated with the class of objects and the feature associated with the moving object detected in the video, to determine whether the moving object detected in the video is a false alarm or a true alarm. The electronic processor is also configured to, when the moving object detected in the video is a true alarm, generate an alert.
Another example provides a method for reducing false alarms in a video surveillance system. The method includes, when a moving object is detected in a video captured by a camera, performing object detection on the video to determine a score associated with a class of objects, wherein the score represents a likelihood that the moving object detected in the video is associated with a class of objects, using metadata associated with the video, determining a feature associated with the moving object detected in the video, and, using a machine learning algorithm, analyzing the score associated with the class of objects and the feature associated with the moving object detected in the video, to determine whether the moving object detected in the video is a false alarm or a true alarm. The method also includes, when the moving object detected in the video is a true alarm, generating an alert.
Other aspects, features, and embodiments will become apparent by consideration of the detailed description and accompanying drawings.
Before any aspects, features, or embodiments are explained in detail, it is to be understood that this disclosure is not intended to be limited in its application to the details of construction and the arrangement of components set forth in the following description or illustrated in the following drawings. Aspects, features, and embodiments are capable of other configurations and of being practiced or of being carried out in various ways.
A plurality of hardware and software based devices, as well as a plurality of different structural components may be used to implement various embodiments. In addition, embodiments may include hardware, software, and electronic components or modules that, for purposes of discussion, may be illustrated and described as if the majority of the components were implemented solely in hardware. However, one of ordinary skill in the art, and based on a reading of this detailed description, would recognize that, in at least one embodiment, the electronic based aspects of the invention may be implemented in software (for example, stored on non-transitory computer-readable medium) executable by one or more processors. For example, “control units” and “controllers” described in the specification can include one or more electronic processors, one or more memory modules including non-transitory computer-readable medium, one or more communication interfaces, one or more application specific integrated circuits (ASICs), and various connections (for example, a system bus) connecting the various components.
The memory 125 includes object identification software 135, behavioral feature determination software 140, and machine learning fusion software 145. In some instances, the object identification software 135, behavioral feature determination software 140, and machine learning fusion software 145 include computer executable instructions which, when executed by the electronic processor 130, cause the electronic processor 130 to perform the functionality described herein. It should be understood that functionality described herein as being performed when multiple software components are executed by the electronic processor 130 may be performed by a single software component executed by the electronic processor 130. For example, the behavioral feature determination software 140 and machine learning fusion software 145 may combined into a single software component. It should be understood that the memory 125 may include additional components than those illustrated in
At block 215, the electronic processor 130, using metadata associated with the video, determines a feature associated with the moving object detected in the video. In some instances, the metadata includes timestamped positions of the moving object, bounding boxes around the moving object, a trajectory of the moving object, and the like. In some instances, using metadata associated with the video, the electronic processor 130 determines a plurality of features associated with the moving object detected in the video. The features determined by the electronic processor 130 include, for example, at least one selected from the group consisting of (a) a displacement of the moving object over time, (b) a change in bounding box height associated with the moving object over time, (c) an average directional change of the moving object over time, (d) from a starting position of the moving object, a standard deviation of directional change of the moving object over time, (e) an average distance traveled by the moving object over time, (f) a standard deviation of a distance traveled by the moving object over time, (g) a difference in bounding box width between frames, (h) a difference in bounding box height between frames, (i) a mean absolute percentage error associated with fitting a line to direction values associated with the moving object over time, (j) a mean absolute percentage error associated with fitting a line to position values associated with the moving object over time, and (k) a mean absolute percentage error associated with fitting a line to distance values associated with the moving object over time. In other instances, the features determined by the electronic processor 130 may include fewer, additional, or different features than those described above. Using the features determined by the electronic processor 130, insights regarding the movement of the moving object may be determined, including (a) whether the movement of the moving object is smooth and continuous, (b) whether the size of the moving object is consistent when the moving object is not moving toward or away from the camera 105, (c) a linear transformation of bounding-box width and height between frames when the moving object is moving toward or away from the camera 105, (d) whether the moving object is moving in a consistent direction, and (e) whether the movement of the moving object is purposeful. These features are indicative of whether the movement is associated with human activity and, consequently, with a true alarm (for example, a vehicle or a person). For example, if the moving object remains the same size (the bounding box width and height stays the same) while seemingly moving towards the camera 105, the moving object is likely a false alarm (for example, a bug crawling on the lens or a raindrop trickling down the lens).
At block 220, the electronic processor 130 uses a machine learning algorithm, to analyze the score associated with the class of objects and the feature associated with the moving object detected in the video, to determine whether the moving object detected in the video is a false alarm or a true alarm. In some instances, the electronic processor 130 may also use the machine learning algorithm to analyze the positions associated with the moving object over time as determined using the DNN, the bounding boxes associated with the moving object over time as determined using the DNN, or both. In some instances, the electronic processor 130 may generate an overlap score and additionally analyze the overlap score with the machine learning algorithm. An overlap score may be determined based on the similarities between the positions associated with the moving object over time as determined using the DNN, the bounding boxes associated with the moving object over time as determined using the DNN, or both and the metadata based feature determined at block 215. In some instances, the machine learning algorithm is a random forest classifier. A random forest is a model ensemble approach to classification, regression analysis, and other types of problems, which builds multiple decision trees over randomly sampled training datasets. A random forest classifier performs well even when presented with outliers and noise. A random forest also provides an estimated error associated with the classification, an estimation of the strength or accuracy of individual trees, correlations between trees, and an estimate of the importance associated with each of a plurality of variables.
At block 225, the electronic processor 130 generates an alert when the moving object detected in the video is a true alarm. In some cases, the electronic processor 130 sends the generated alert to an output device (for example, the display device 120, a speaker, an LED light, a combination of the foregoing, or the like). In some instances, the electronic processor 130 sends the video along with the generated alert to the display device 120. The display device 120 displays the video along with the generated alert to allow security personnel to review the video. Based on their analysis of the video, the security personnel may provide feedback to the electronic processor 130 regarding the generated alert via the input device 115. For example, if the electronic processor 130 generated an alert based on a video that shows a tumbleweed blowing in the wind, security personnel may provide the feedback that video contains a false alarm. Based on the feedback received, the electronic processor 130 adjusts or retrains the DNN executed at block 210, the machine learning algorithm executed at block 220, or both. The retraining process is represented by the dashed lines in
In some instances, to minimize the number of true alarms that the electronic processor 130 fails to generate an alert for, the electronic processor 130 operates in a “high-alert mode” for a predetermined amount of time after a camera in the video surveillance system 100 captures a moving object that the electronic processor 130 determines to be a true alarm. For example, in some instances, the electronic processor 130, generates a second alert when a second moving object or the moving object is detected in a second video captured by the second camera 110 within a predetermined amount of time after the camera 105 captured the video and the moving object detected in the video is a true alarm. In other words, when the electronic processor 130 determines that a moving object captured by the camera 105 in a video surveillance system 100 is a true alarm and generates an alert, if another camera (for example, the second camera 110) in the video surveillance system 100 captures a moving object (for example, the moving object captured by the camera 105 or a different moving object) within a predetermined amount of time (for example, 30 minutes after the moving object is detected in the video captured by the camera 105), the electronic processor 130 generates an alert without performing the method 200 to determine whether the moving object captured by the second camera 110 is a false alarm. In another example, when a second moving object or the moving object is detected in a second video captured by the camera 105 a predetermined amount of time after the moving object is detected in the video and the moving object detected in the video is a true alarm, the electronic processor 130 generates a second alarm.
The one or more scores generated by at block 505 and the features determined at block 510 are input to block 515. At block 515, the scores and the features are analyzed with the machine learning algorithm (by the electronic processor 130 executing the machine learning fusion software 145) to determine whether the moving object is true alarm or a false alarm. As described above, when the electronic processor 130 determines that the moving object is a true alarm, the electronic processor 130 generates an alert and send the alert to the display device 120. The display device 120 may display a visual representation of the alert, a video that caused the alert to be generated, or both via a software application user interface (UI) (represented by block 520 in
Thus, the aspects, features, and embodiments described herein provide, among other things, a video surveillance system and a method for reducing false alarms in a video surveillance system. Various features and advantages of the embodiments are set forth in the following claims.