The present invention is related to clustering methods in general and in particular to anomaly detections within context-aware data.
The present invention is in the field of solutions for internet of things (TOT) device providers, and for IoT analytic platform providers. The invention provides a generic capability to detect relevant events, reduce false-alerts and configure the detection parameters automatically based on training data only, taking away the tremendous costs of sensor-specific analytic configurations. The invention therefore enables market differentiation and increases productivity during deployment and maintenance of event detection systems.
Anomaly detection in observed data is performed by training or developing models of normality, where the anomaly detection is performed by observing for deviations of the tested data from the normality models.
In the case of vehicle traffic anomaly detection, normal traffic can change from minute to minute; accordingly, at least 2,900 models with 20 parameters have to be trained for such a process with no significant context-dependent behavior. The data collection process requires at least six weeks of collecting data samples for training the normality models. The training process, for training the 2,900 models, requires 5 M-Byte of parameter data per sensor to be kept in the memory in order to perform real-time anomaly detection. When introducing the above measurement data with context variables, the amount of the required training data and the required memory grow exponentially along with the resource consumption, like memory and processing time, for training the model and for the real-time detection. Further, this would require a large amount of training data to be collected in order to cover the context space with sufficient data points.
The obvious way to deal with context-aware detection is to carefully design a context partitioning for each anomaly detection use-case so that the models' count remains reasonable. To do that, knowledge about the observed system needs to be gained through domain expertise or by investigating a significant volume of annotated measurement data in order to identify which context parameters should be considered and at what granularity. For example, an insight must be contributed that Saturday and Sunday can be treated same for traffic incident detection. Of course this particular insight varies depending on where the sensor is deployed; for example different countries have different weekend days (e.g. Friday and Saturday in the Middle East). Another example is the influence of the weather which might depend on the type of road and therefore for some sensors the weather condition should be incorporated and for some it can be left out.
Context dependent anomaly detection has been solved in the prior art using either manual methods or adaptive context partitioning methods, as described in the following.
According to the manual context partitioning method, models are separated for the different contexts and any available context-agnostic models are used to model the measurements of a specific context. The context subspaces are defined manually for every use-case, for example incorporating the knowledge about weekend and weekday behavior, or by using very large volume of training data. Example for the manual context partitioning care are disclosed in Ihler et al., Adaptive event detection with time-varying Poisson processes, KDD '06 Proceedings of the 12th ACM, pages 207-216, ACM New York, N.Y., USA ©2006 and in Cobb et al. U.S. Pat. No. 8,167,430.
According to the conditional probability distribution learning method the observed measurements are modelled as being generated by a conditional random distribution, with the context parameters as the condition space. Conditional probabilities can be learned through estimation of a total probability distribution, which is hardly possible, due to the required huge volume of training data, practically rarely available. An alternative method is Bayesian networks, as disclosed in Chapman et al. U.S. Pat. No. 8,682,571 and Downs et al. U.S. Pat. No. 7,899,611. The structure of such networks can be defined manually, or by learning methods. However these methods are only well-defined for discrete variables. As anomaly detection is usually performed on continuous measurement data, such methods cannot be directly applied.
According to the function estimation method the observed measurements are modelled as being generated by a deterministic function. For example, this can be done through decision tree learning as disclosed in Chapman et al. U.S. Pat. No. 8,682,571 and in Downs et al. U.S. Pat. No. 7,899,611, or through neural networks or look-up tables as disclosed in Burgess, Two Dimensional Time-Series for Anomaly Detection and Regulation in Adaptive Systems, lecture notes in computer science, volume 2506, 2002, pp 169-180.
Conversely, neither observed systems nor sensors have deterministic behavior; the measurements' noise and system's variational behavior are prominent in practical anomaly detection problems, and therefore function estimation methods cannot be learned nor represent such systems.
Clustering methods are widely used for unsupervised categorization of multi-dimensional data, for example to identify customer segments in customer relationship management data. Vector quantization is an application used for clustering, for example for lossy video and image compression, where the measurement data is represented by respective cluster centers. Gupta et al., Context-aware time series anomaly detection for complex systems, work shop notes—2nd workshop on data mining for service and maintenance, Austin, Tex., May 4, 2013, pp. 14-22, discloses clustering context variables for context-aware anomaly detection. Gupta et al. map extracted context variables for further portioning of the data according to time series.
Accordingly, there is still an unanswered long felt need for a method and system that would efficiently use the context information of the measured data for accurate anomaly detection, and which will require smaller training groups and shorter training process.
It is one object of the present invention to disclose a method directed for detecting anomalies in monitored data having plurality of data-segments partitioned to context related initial-subspaces, the method comprising:
It is another object of the present invention to disclose the method as defined above, wherein the data is continuous measurement-data collected from at least one sensor; and wherein the plurality of data-segments are feature-vectors extracted from plurality of sections of the data.
It is another object of the present invention to disclose the method as defined above, further comprising extracting the plurality of the feature-vectors from the plurality of sections.
It is another object of the present invention to disclose the method as defined above, wherein the extracting is performed by a method selected from the group consisting of: principal component analysis (PCA), independent component analysis, minimum noise fraction, random forest embedding, non-negative matrix factorization, and any combination thereof.
It is another object of the present invention to disclose the method as defined above, wherein each of the plurality of data-segments is labeled with at least one context-label; and wherein the method further comprising partitioning the plurality of data-segments to the context related initial-subspaces, responsive to a predetermined similarity in the at least one context-label.
It is another object of the present invention to disclose the method as defined above, further comprising selecting the at least one context-label from the group consisting of: days of the week, midweek- or weekend-days, time of the day, light- or dark-hours, holidays, public events, weather conditions, visibility, temperature, locations, measuring scenarios, population, and any combination thereof.
It is another object of the present invention to disclose the method as defined above, wherein the data is vehicle traffic measured data.
It is another object of the present invention to disclose the method as defined above, further comprising clustering the feature-clusters, using an unsupervised clustering-method.
It is another object of the present invention to disclose the method as defined above, wherein at least one of the following holds true:
It is another object of the present invention to disclose the method as defined above, wherein at least one of the following holds true:
It is another object of the present invention to disclose the method as defined above, wherein the training further comprising defining at least one additional feature-cluster associated to the data-segments of at least one of the initial-subspaces, responsive to a failure of the one of the initial-subspaces to comply with the fit-criterion.
It is another object of the present invention to disclose the method as defined above, further comprising repeating the training and the concatenating, responsive to the defining of the at least one additional feature-cluster.
It is another object of the present invention to disclose the method as defined above, further comprising repeating the training and the concatenating, responsive to the defining of the at least one additional feature-cluster.
It is another object of the present invention to disclose the method as defined above, further comprising selecting the fit-criterion from the group consisting of: frequency threshold, average deviation threshold, statistical properties deviation threshold, dedicated matrices, Silhouette coefficients, and any combination thereof.
It is another object of the present invention to disclose the method as defined above, wherein the pinpointing and the triggering are in real-time.
It is another object of the present invention to disclose the method as defined above, wherein at least one of the following holds true:
It is another object of the present invention to disclose the method as defined above, further comprising selecting the trigger-criterion from the group consisting of:
It is another object of the present invention to disclose a computer system for detection of anomalies in monitored data having plurality of data-segments partitioned to context related initial-subspaces, the detection according to method steps comprising:
It is still an object of the present invention to a non-transitory computer readable medium (CRM) that, when loaded into a memory of a computing device and executed by at least one processor of the computing device, configured to execute the steps of a computer implemented method for detecting anomalies in monitored data having plurality of data-segments partitioned to context related initial-subspaces, the steps comprising:
It is lastly an object of the present invention to disclose the CRM as defined above, wherein at least one of the following holds true:
The subject matter disclosed may best be understood by reference to the following detailed description when read with the accompanying drawings in which:
For simplicity and clarity of illustration, elements shown are not necessarily drawn to scale, and the dimensions of some elements may be exaggerated relative to other elements. In addition, reference numerals may be repeated to indicate corresponding or analogous elements.
The following description is provided, alongside all chapters of the present invention, so as to enable any person skilled in the art to make use of the invention and sets forth the best modes contemplated by the inventor of carrying out this invention. Various modifications, however, are adapted to remain apparent to those skilled in the art, since the generic principles of the present invention have been defined specifically to provide a method and a system for detecting anomalies in monitored data having plurality of data-segments partitioned to initial-subspaces, according to context-labels of the data-segments.
The present invention provides a new method directed for detecting anomalies in monitored data having plurality of data-segments partitioned to context related initial-subspaces, the method comprising:
The present invention further provides a new computer system for detection of anomalies in monitored data having plurality of data-segments partitioned to context related initial-subspaces, the detection according to method steps comprising:
The present invention further provides a new non-transitory computer readable medium (CRM) that, when loaded into a memory of a computing device and executed by at least one processor of the computing device, configured to execute the steps of a computer implemented method for detecting anomalies in monitored data having plurality of data-segments partitioned to context related initial-subspaces, the steps comprising:
Unless specifically stated otherwise, as apparent from the following discussions, throughout the specification discussions utilizing terms such as “processing”, “computing”, “storing”, “calculating”, “determining”, “evaluating”, “measuring”, “providing”, “transferring”, “outputting”, “inputting”, or the like, refer to the action and/or processes of a computer or computing system, or similar electronic computing device, that manipulates and/or transforms data represented as physical, such as electronic, quantities within the computing system's registers and/or memories into other data similarly represented as physical quantities within the computing system's memories, registers or other such information storage, transmission or display devices.
The term “pinpoint” (or any form thereof), used herein is to be commonly understood as any of: find, locate, identify, indicate, determine, detect, notice, discover, recognize, diagnose, spot, investigate and trace.
The term “cluster” (or any form thereof), used herein refers to the task of grouping a set of objects (or as used herein a set of data-vectors) according to their features and/or characteristics in such a way that objects in the same group (called a cluster) are more similar in nature to each other than to those in other groups (clusters).
The term “context” (or any form thereof), used herein refers to the group of conditions that exist where and when the data was or is collected.
The term “anomaly” (or any form thereof), used herein is to be commonly understood as any of: irregularity, abnormality, difference, divergence and deviation.
According to various embodiments of the presented invention a system and a method are disclosed configured to find clusters in the measurements' data and establish a mapping between the measurement's context subspaces and the data's clusters in order to detect anomalies in the measured data.
Typically, anomaly detection is performed by learning models of normality and detecting deviations of new observations from the learned or trained models. The observed systems often behave differently depending on context like time of day, weather and public holidays. For example, for traffic anomaly detection, the traffic flow parameters may depend on: days of the week, mid-week or weekend days, time of the day, light or dark hours, holidays, special events, weather conditions, road condition, visibility, temperature, locations and measuring scenarios. These context parameters should to be incorporated into the anomaly detecting model in order to avoid false-alerts and to maintain detection sensitivity. It is known in the art that when introducing the additional context variables, the amount of training data and the required memory grow exponentially.
The common way to deal with models for context aware data is to carefully design context partitioning for each anomaly detection use-case, so that the models' count remains reasonable. To do that, knowledge about the observed system needs to be gained through domain expertise or by investigation of a significant volume of annotated measurement data, in order to identify which context parameters need to be considered and at what granularity. For example in vehicle traffic, knowledge must be available that Saturday and Sunday can be treated same for traffic incident detection, however, this particular insight may vary depending on where the sensor is deployed. Different countries have different weekend days (e.g. Friday and Saturday in the Middle East). Another example is the influence of the weather condition, which may depend on the type of ridden road, and therefore for the measurements of some sensors context regarding the weather condition should be incorporated and for some it can be left out. According to embodiments of the present invention, the disclosed system and method incorporate the context information with automatic optimization methods for the context's space without the need for human supervision or annotated training data.
Anomaly detection is performed by learning models of normality and detecting deviations of new observations from the learned models. Typically the data space is spanned from the measurements of at least one sensor providing a stream of data. The data is then collected at different- or constant-measurement intervals and stored in a database. The sensor's measurement can be a single value in time, represented by a single variable, or a set of values, represented as a measurement vector. The training data is then extracted from the database, at regular intervals (e.g. once a day), to learn the normality model, using statistical methods like minimum covariance determinant (MCD), regression methods, clustering methods; or classification methods like support vector machines (SVM) or one-class SVM. For real-time anomaly detection, new incoming sensors' measurements are tested against the learned model in order to calculate the magnitude of the deviation of the tested data from the model's mapped clusters.
According to some embodiments of the present invention, the magnitude of the deviation is further manipulated to define an anomaly Index. The anomaly index and the actual deviation from the normal distribution are further used to decide if an anomaly event is raised. The anomaly event is then presented to the user or used for triggering automatic actions. For example, if a traffic accident is detected, triggering an alert to the relevant authorities and redirecting the traffic.
The measurement data usually contains measuring noise. The observed system can be better described via selected features that are extracted from the measurement vector. According to some embodiments of the present invention, a step of feature extraction is used to remove noise and extract relevant features.
Context information often has strong influence on the behavior of an observed system; in traffic flow for example: time of day, weather, holidays, sport event and such. An anomaly detection system, as described in the above and in
Partitioning categorical information is achieved by assigning a context subspace for each category. Continuous information, like timestamps, has to be discretized using a uniform discretization. Multiple context variables can be combined through concatenation or generalization; for example, partitioning that takes into account day of week and time of day. The following context subspaces can be defined, as shown in Table 1, considering the day of the week and the time in minuets resolution.
Using this approach, the context dependency can be modelled very accurately, however there are limitations:
The obvious way to deal with the above mentioned limitations is to carefully design the partitioning for each anomaly detection use-case. To do that, knowledge about the observed system needs to be gained through domain experts or by investigating a significant volume of annotated measurement data, in order to identify which context parameters should to be considered and at what granularity.
The general approach described above can be applied, according to a non-limiting example, to traffic anomaly detection. According to some embodiments of the present invention, the measuring sensors may include: license plate recognition (LPR) sensors, video analytics and magnetic loop detectors. The characteristic features extracted from the raw data can include: average speed, total vehicle volume, speed difference between the different lanes and vehicle volume difference between the different lanes. The data, according to this example, is acquired and stored once a minute. Weekend and weekdays have to be treated separately, and different times of day are partitioned according to one minute intervals.
According to some embodiments of the present invention, a minimum covariance determinant method (MCD) is used to model the distribution of the data inside a context subspace.
According to some embodiments of the present invention, in order to reduce false anomaly-alerts, a persistence check is applied to make sure that the abnormal state persists at least for two minutes until an anomaly-detection is triggered.
According to some embodiments of the present invention, the deviation vector, which is the difference of a measurement from the mean vector of its corresponding model, can be used to distinguish different types of traffic anomalies, for example traffic jam and partial road-blocks, by applying simple rules on the deviation vector; like for example speed difference thresholds.
In order to overcome the above mentioned limitations of static or hand-crafted context partitioning, the present invention discloses an adaptive method to determine efficient context-aware partitions which incorporates the features of the actual measurement data. According to a preferred embodiment, the method spans a map between clusters of the measurement's data and initial-subspaces of the initial context-aware partitions; the initial-subspaces are based on the context-aware labels solely.
Further mapping is conducted by observing common distributions or clusters in the measurement's data and concatenating the initial-subspaces that share similar data distributions or similar clusters into common clusters-subspaces. In so doing, the initial context-aware subspaces are concatenated into fewer cluster-subspaces. Accordingly the amount of data available for the models' training is increased and the required memory and number of models are reduced, without the use of any manual optimization or configuration.
According to one embodiment of the invention the mapping method is implemented as follows:
According to another embodiment of the invention, the mapping is implemented as follows:
Reference is now made to
Reference is now made to
Reference is now made to figures
Specifically,
According to another embodiment of the invention, the case of the initial-subspace D (514), which could not be associated to any of the data's clusters (531-533) may be considered as having a redundant context-label, which should be ignored, and the data-segments or feature-vectors of that initial-subspace (514) should spread and related to any of the other initial-subspaces (511-513,515-516).
According to an embodiment of the invention, the fit-criterion is a predetermined threshold for the difference between the average deviation of the feature-vectors of an initial-subspace and the center of the examined feature-cluster.
According to another embodiment of the invention, the fit-criterion is a predetermined threshold for the difference between the statistical properties (e.g. standard deviation, covariance matrix) of all related feature-vectors assigned to a specific feature-cluster and the statistical properties of the feature-vectors of the particular examined initial-subspace.
According to another embodiment of the invention, the fit-criterion is chosen as dedicated metrics. The dedicated matrices can be derived purely from empiric methods (e.g., elbow method) that typically require human interpretation and can be sometimes ambiguous, fully automated ones (for example approaches based on Bayesian Information Criterion for clustering) which typically require a lot of data, as well as methods that fall between the two extremes, such as Silhouette coefficients and diagrams. An example for dedicated cluster goodness of fit-criteria metrics is the case of Silhouette coefficients, although other metrics may also be employed.
Specifically, Silhouette coefficients measure the cohesion of each (potentially new) point of a cluster to the others, as well as the separation from the most nearby cluster. When used to examine if a new point “p” should be assigned to a particular cluster “C” the method is as follows:
To demonstrate the advantages of the embodiments of the present invention, experimental detecting results on simulated datasets are presented. Each dataset simulates a daily recurring process as is common in traffic monitoring, with several steady state switches during the day, e.g. low traffic at nighttime, and morning/evening rush-hours. Measurements were taken at a one minute intervals, with four feature measurement dimensions (four different sensors) and at different daily patterns including weekend and weekdays. White Gaussian noise of −20 dB relative to measurement level was added to simulate sensors' noise. Eighty anomalies each of twenty minutes duration were introduced, by adding a constant vector to the normal feature vector. The magnitude of the anomaly vector is 12 dB above the additive noise level.
A comparison is provided between: model computation time, size of the trained model (measured in memory Bytes) and detection accuracy (demonstrated by F-Measure) of three prior art hand-crafted partitioning configurations versus the currently disclosed adaptive partitioning method.
The three prior art demonstrated methods are:
The currently disclosed adaptive partitioning method is demonstrated using 1 min data-segments, with the context labels being the time of the day (TOD) and where the clustering method is K-mean, with K=150 clusters; noted as “Auto (150 Cluster)” or as “Adaptive (150 Clusters)”. The anomalies were detected for all four methods using the MCD anomaly detection method.
The number of clusters influences the resolution of the normality model and the number of cluster-subspaces created. It can be therefore be used to control the maximum amount of memory used. To further automate the selection of the number of clusters, clustering methods that automatically decide on the number of clusters based on the data can be applied, for example BSCAN or DBSCAN.
Possible Extension: Multi-Pass Clustering for Dealing with Multimodal Data
The approach described above is well suited for data that can be modeled properly using a unimodal distribution. For measurement data that has multiple modes, the data in the same context subspace will very likely be assigned to multiple clusters, and grouping of subspaces will not be possible efficiently. In this case we propose to do a second pass of clustering on the cluster assignment distribution of each subspace (the distribution is shown in Error! Reference source not found. 5A and 5B). This way, context initial-subspaces like the one labeled with the letter D which are significantly assigned to multiple cluster centers and therefore are not grouped according to the goodness of fit criteria, can be grouped based on a goodness of fit on the second pass clustering.
Envisioned Embodiments include:
Further applications and industries that would require anomaly detection and can benefit from context-aware variables may include, but are not limited to: power plants, power grids, manufacturing plants, monitoring electricity consumption, monitoring water consumption, security methods, online/cloud security methods, demand of different commercial goods (books, movies, furniture) and more. The form of the context-aware variables can be: time series, structured-text, semi structured-text and unstructured-text. The present invention further lowers the hardware requirements to run anomaly detection on an edge device, which usually has low memory capacity.
It is understood that various other modifications will be readily apparent to those skilled in the art without departing from the scope and spirit of the invention. Accordingly, it is not intended that the scope of the claims appended hereto be limited to the description set forth herein, but rather that the claims be construed as encompassing all the features of the patentable novelty that reside in the present invention, including all features that would be treated as equivalents thereof by those skilled in the art to which this invention pertains.