Enterprises utilize computer systems having a variety of components. For example, these conventional computer systems can include one or more servers and one or more storage devices interconnected by one or more communication devices, such as switches or routers. The servers can be configured to execute one or more virtual machines (VMs) during operation where each VM can be configured to execute or run one or more applications or workloads.
In certain cases, the computer systems can generate a large amount of data relating to various aspects of the infrastructure. For example, the computer systems can generate latency data related to the operation of associated VMs, storage devices, and communication devices. In turn the computer system can provide the data in real time to a host device for storage and/or processing.
As provided above, during operation the host device can receive real time data from the computer system and can retain and/or process the data. In order to identify particular patterns or trends of behavior of the computer system, the host device can be configured to utilize an unsupervised-machine learning function, such as a clustering function, to define a data training set. Further, the host device can utilize the data training set to derive the patterns of behavior of an environment in order to detect anomalous behavior or predict the future behavior for the computer system. For example, the host device can be configured to obtain the data that characterizes the workload and to define it as a training set that later is classified, or clustered, to derive the learned behavioral patterns of attributes of the computer system. The host device can also be configured to compare the learned behavioral pattern of the data training set to data elements of the received data to detect anomalous data elements, which are indicative of anomalous behavior within the computer system.
In the process of developing the training set, as a result of the clustering and re-clustering of the data elements over time, the host device executing the unsupervised-machine learning function can generate a relatively large amount of random variation in the clusters. This can be particularly true when the data elements received from the computer system, as used for the training set, have a lot of variability.
For example,
As is indicated, application of the unsupervised machine learning function results in clusters having a wide range of variation. Anomalousness, however, is a function of the variability in the data, which is, in turn, reflected in the random variability among the thresholds. Accordingly, the resulting anomaly analysis and detection can give rise to unquantified uncertainty with respect to anomalous behavior detection within the computer system.
By contrast to conventional anomaly detection mechanisms, embodiments of the present innovation relate to an apparatus and method of introducing probability and uncertainty via order statistics to unsupervised data classification via clustering. In one arrangement, a host device is configured to limit variability and provide a level of certainty to an unsupervised machine learning paradigm utilized on data received from a computer infrastructure. For example, the host device can be configured to first execute a clustering function on a set of data elements received from a computer infrastructure over multiple iterations, such as for a total of ten iterations. Because of the inherent variation in the data element set, the host device can generate ten distinct sets of clusters. The host device can be further configured to then divide the resulting clusters among time slices and to find the maximum and minimum value threshold for each time slice. The host device can be further configured to then apply order statistics to the thresholds of each time slice and to assign a probability levels to each time slice. Quantification of the threshold variability provides a probabilistic framework which underlies anomaly detection.
Embodiments of the innovation enable the host device to quantify the uncertainty in the data training set. Specifically, the host device can be configured to stabilize the clustering of a data training set and to provide the measurement of the uncertainty or variation associated with the data training set. As a result, the host device can introduce probability estimation for various additional components associated with the computer infrastructure, such as anomaly detection, root cause selection, and/or issue severity ratings.
One embodiment of the innovation relates to, in a host device, a method for stabilizing a data training set. The method can comprise generating, by the host device, a data training set based upon a set of data elements received from a computer infrastructure; applying, by the host device, multiple iterations of a clustering function to the data training set to generate a set of clusters; dividing, by the host device, the set of clusters resulting from the multiple iterations of the clustering function into multiple time intervals; for each time interval of the multiple time intervals, deriving, by the host device, a maximum threshold and a minimum threshold for each cluster of the set of clusters included in the time interval; and applying, by the host device, an order statistic function to the maximum thresholds and the minimum thresholds for each time interval.
The foregoing and other objects, features and advantages will be apparent from the following description of particular embodiments of the innovation, as illustrated in the accompanying drawings in which like reference characters refer to the same parts throughout the different views. The drawings are not necessarily to scale, emphasis instead being placed upon illustrating the principles of various embodiments of the innovation.
Embodiments of the present innovation relate to an apparatus and method of introducing probability and uncertainty via order statistics to unsupervised data classification via clustering. In one arrangement, a host device is configured to limit variability and provide a level of certainty to an unsupervised machine learning paradigm utilized on data received from a computer infrastructure. For example, the host device can be configured to first execute a clustering function on a set of data elements received from a computer infrastructure over multiple iterations, such as for a total of ten iterations. Because of the inherent variation in the data element set, the host device can generate ten distinct sets of clusters. The host device can be configured to then divide the resulting clusters among time slices and to find the maximum and minimum value threshold for each time slice. The host device can be configured to then apply order statistics to the thresholds of each time slice and to assign a probability levels to each time slice. Quantification of the threshold variability provides a probabilistic framework which underlies anomaly detection as well as other functions that can be derived from behavioral analysis, such as forecasting of the future behavior.
Each server device 14 can include a controller or compute hardware 20, such as a memory and processor. For example, server device 14-1 includes controller 20-1 while server device 14-N includes controller 20-N. Each controller 20 can be configured to execute one or more virtual machines 22 with each virtual machine (VM) 22 being further configured to execute or run one or more applications or workloads 23. For example, controller 20-1 can execute a first virtual machine 22-1 which is configured to execute a first set of workloads 23-1 and a second virtual machine 22-2 which is configured to execute a second set of workloads 23-2. Each compute hardware element 20, storage device element 18, network communication device element 16, and application 23 relates to an attribute of the computer infrastructure 11.
In one arrangement, the host device 25 is configured as a computerized device having a controller 26, such as a memory and a processor. The host device 25 is disposed in electrical communication with the computer infrastructure 11 and with a display 51. The host device 25 is configured to receive, via a communications port (not shown), a set of data elements 24 from at least one computer environment resources 12 of the computer infrastructure 11 where each data element 28 of the set of data elements 24 relates to an attribute of the computer environment resources 12. For example, each data element 28 can relate to the compute level (compute attributes), the network level (network attributes), the storage level (storage attributes) and/or the application or workload level (application attributes) of the computer environment resources 12. Also, each data element 28 can include additional information relating to the computer infrastructure 11, such as events, statistics, and the configuration of the computer infrastructure 11. As a result, the host device 25 can receive data elements 28 that relate to the controller configuration and utilization of the servers devices 14 (i.e., compute attribute), the virtual machine activity in each of the server devices 14 (i.e., application attribute) and the current state and historical data associated with the computer infrastructure 11.
Each data element 28 of the set of data elements 24 can be configured in a variety of ways. In one arrangement, each data element 28 can include object data that can identify a related attribute of the originating computer environment resource 12. For example, the object data can identify the data element 28 as being associated with a compute attribute, storage attribute, network attribute, or application attribute of a corresponding computer environment resource 12. In one arrangement, each data element 28 can include statistical data that can specify a behavior associated with the computer environment resource 12.
In one arrangement, the host device 25 can include a machine learning analytics framework or engine 27 configured to receive each data element 28 from the computer infrastructure 11, such as via a streaming API, and to automate analysis of the data elements 28 during operation. For example, as will be described below, when executing the machine learning analytics engine 27, the host device 25 is configured to transform, store, and analyze the data elements 28 over time. Based upon the receipt of the of data elements 28, the host device 25 can provide continuous analysis of the computer infrastructure 11 in order to identify anomalies associated with attributes of the computer infrastructure 11 on a substantially continuous basis. Further, the host device 25 can perform other functions based upon the receipt of the of data elements 28. These functions can include, but are not limited, to forecasting of the future behaviors and operational issues associated with the computer infrastructure 11.
The controller 26 of the host device 25 can be configured to store an application of the machine learning analytics engine 27. For example, the machine learning analytics engine application installs on the controller 26 from a computer program product 32. In some arrangements, the computer program product 32 is available in a standard off-the-shelf form such as a shrink wrap package (e.g., CD-ROMs, diskettes, tapes, etc.). In other arrangements, the computer program product 32 is available in a different form, such downloadable online media. When performed on the controller 26 of the host device 25, the machine learning analytics engine application causes the host device 25 to perform the classification, or clustering, stabilization on a data training set and to detect operational uncertainty. As a result of the classification and detection, the host device can provide an output 52 to a user via a graphical user interface 50 as provided by the display 51.
During operation, the host device 25 is configured to collect data elements 28, such as latency information (e.g., input/output (IO) latency, input/output operations per second (IOPS) latency, etc.) regarding the computer environment resources 12 of the computer infrastructure 11. For example, the host device 25 is configured to poll the computer environment resources 12, such as via private API calls, to obtain data elements 28 relating to latency within the computer infrastructure 11.
In one arrangement, as the host device 25 receives the data elements 28, the host device 25 is configured to direct the data elements 28 to a uniformity or normalization function 34 to normalize the data elements 28. For example, any number of the computer environment resources 12 can provide the data elements 28 to the host device 25 in a proprietary format. In such a case, the normalization function 34 of the host device 25 is configured to normalize the data elements 28 to a standard, non-proprietary format.
In another case, as the host device 25 receives the data elements 28 over time, the data elements 28 can be presented with a variety of time scales. For example, for data elements 28 received from multiple network devices 16 of the computer infrastructure 11, the latency of the devices 16 can be presented in seconds (s) or milliseconds (ms). In this example, the normalization function 34 of the host device 25 is configured to format the data elements 28 to a common time scale. As will be described below, normalization of the data elements 28 for application of a clustering function provides equal scale for all data elements 28 and a balanced impact on a distance metric utilized by the clustering function (e.g., a Euclidean distance metric). Moreover, in practice, normalization of the data elements 28 tends to produce clusters that appear to be roughly spherical, a generally desirable trait for cluster-based analysis.
Next, the host device 25 is configured to develop a data training set 36 for use in anomalous behavior detection. In one arrangement, the host device 25 is configured to store normalized data elements 30 as part of the data training set 36 which can then be used by the host device 25 to detect the anomalous behavior within the computer infrastructure 11. For example, the host device 25 can include, as part of data training set 36, normalized latency data elements 30 having per object (i.e., datastore) sampling, such as 5 minute average interval, normalized to each day of the week as an index (e.g., Sunday 0:00 is 0, Monday 0:00 is 300 . . . 0-2100 for a week, Monday-Sunday, for the 5 minute averaged data). As such, the data training set 36 can include data collected over a timeframe of a day, week, or month. Further, the host device 25 can be configured to update the data training set 36 at regular intervals, such as during daily intervals. For example, the data training set 36 can further contain 10,000 samples per object (˜1 month worth of performance data) which can be refreshed on daily basis.
In one arrangement, after collecting a given volume of normalized data elements 30 as part of the data training set 36, (e.g., normalized data elements 30 collected over a period of seven days) the host device 25 is configured to stabilize various characteristics of the data training set 36 for use in anomaly detection. For example, an anomaly is an event that is considered out of ordinary (e.g., an outlier) based on the continuing analysis of data with reference to the historical or data training set 36 and based on the application of the principles of machine learning.
In one arrangement, in stabilizing the characteristics of the data training set 36, the host device 25 is configured to apply multiple iterations of a classification function 38 to the data training set 36. For example, the host device 25 includes a classification function 38 which, when applied to the normalized latency data elements 30 (i.e., the attribute of the computer infrastructure resources of the computer infrastructure) of the data training set 36, is configured to define at least one group of the data elements 30 (i.e., data element groups).
While the classification function 38 can be configured in a variety of ways, in one arrangement, the classification function 38 is configured as an unsupervised machine learning function, such as a clustering function 40, that defines the data element groups as clusters. Clustering is the task of grouping a set of objects in such a way that objects in the same group, called a cluster, are more similar to each other than to the objects in other groups or clusters. Clustering is a common technique of machine learning data analysis, used in many fields, including pattern recognition, image analysis, information retrieval, and bioinformatics. The grouping of objects into clusters can be achieved by various algorithms that differ significantly in their notion of what constitutes a cluster and how to efficiently find them. Known clustering algorithms include hierarchical clustering, centroid-based clustering (i.e., K-Means Clustering), distribution based clustering, and density based clustering.
In one arrangement, during each application of the clustering function 40 to the data training set 36, the host device 25 separates the information of the data training set 36 into sets of clusters. For example,
By applying the clustering function 40 to the data training set 36, the host device 25 can derive learned behaviors of the various attributes of the computer infrastructure 11. However, variability of the data training set 36 can result in variability in the clusters generated following application of the clustering function 40. For example, application of the clustering function 40 to the data training set 36 in a first iteration can result in the generation of a first set of clusters which identify computer infrastructure attributes having some common similarity. However, application of the clustering function 40 to the data training set 36 in subsequent iterations can typically generate slightly or very different clustering results. That is, application of the clustering function 40 to the data training set 36 in a second iteration can result in the generation of a second set of clusters that are different from the first set of clusters and the application of the clustering function 40 to the data training set 36 in a third iteration can result in the generation of a third set of clusters that are different from the first set of clusters and from the second set of clusters. This can lead to instability of the model of the learned behavior of the computer structure attributes.
In order to develop a set of stabilized characteristics from the data training set 36, the host device 25 is configured to apply the clustering function 40 to the data training set 36 over multiple iterations and to derive the learned behavior of the computer infrastructure based upon the results of the iterative application of the clustering function 40.
In one arrangement, with reference to
Next, the host device 25 is configured to derive the learned behavior from the sets of clusters generated from the data training set 36. In one arrangement, with reference to
During operation, with continued reference to
Next, the host device 25 is configured to detect the maximum and minimum threshold for each cluster of each clustering function iteration associated with each time interval 110. For example, with reference to
Next, with reference to
Taking the second time interval 110-2 of
In one arrangement, the host device 25 can estimate or identify the relative variability among the ordered thresholds 120 and can identify probability distributions for the order statistics during the process of anomaly detection.
For example,
When identifying or calculating the probability distributions, the host device 25 can be configured to leverage quantiles, such as a collection of non-parametric statistics that allow the host device to estimate the relative variability among sample thresholds 120. For example, as shown in
As indicated in
Based on (3) and (4) above, the host device 25 can be configured to utilize the quantiles to estimate the probability that a data point was truly anomalous and/or qualifying the severity of the anomaly for the purposes of creating or updating existing issues, as well as aggregate anomaly severities for characterization of issue severity.
By associating a probability value to each of the ordered thresholds, the host device 25 is configured to measure uncertainty with respect to data points located within each time interval 110. It is noted that probability and uncertainty are not necessarily synonymous—uncertainty is a property of a given probability estimate relating to precision, and is dependent upon the amount of data used to compute the probability estimate. However, probability can be interpreted in the following way: “What is the probability that a threshold generated at random by the K means clustering algorithm 40 will identify a data point as an anomaly?” In other words, “How certain is the host device 25 that this point is anomalous?”
In one arrangement, as part of an anomaly detection process, the host device 25 is configured to identify the ordered thresholds 120 and determine, for a particular data point investigated as being anomalous, the number of thresholds that the investigated data point has crossed or exceeded. Once the host device 25 has identified a given threshold, the host device 25 can be configured to divide the highest maximum ordered threshold reached by the total number of thresholds in order to derive the probability that the investigated data point is truly anonymous. Further, the host device 25 can be configured to utilize that derived probability to report the probability of each data point as an anomaly, as well as even control it, by only accepting anomalies with highest probability (such as 0.9).
For example, assume the case where the host device 25 is configured with 90% probability, such that the host device 25 is 90% confident of its outcome. Further assume the case where the host device 25 has identified a data element disposed within a probability distribution of the ordered thresholds. As shown in
In one arrangement, with reference to
With such a configuration, the host device 25 is configured to stabilize the data training set 36 to substantially reflect real data received from the computer infrastructure 11. This configuration of the host device 25 enables the quantification of the uncertainty/variation in the data training set 36. Specifically, the host device 25 is configured stabilize the clustering of a data training set 36 and to allow the measurement of the uncertainty associated with the data training set. As a result, the host device 25 can support probability estimation for various additional components associated with the computer infrastructure 11, such as anomaly detection, root cause selection, and/or issue severity ratings.
As provided above, the host device 25 is configured to develop a data training set 36 for use in anomalous behavior detection. Such description is by way of example only. In one arrangement, the host device 25 is configured to develop the data training set 36 for performance of other functions including, but not limited, to forecasting of the future behaviors and problems in the computer infrastructure 11.
While various embodiments of the innovation have been particularly shown and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the innovation as defined by the appended claims.
This patent application claims the benefit of U.S. Provisional Application No. 62/561,404, filed on Sep. 21, 2017, entitled, “Apparatus and Method of Introducing Probability and Uncertainty Via Order Statistics to Unsupervised Data Classification Via Clustering,” the contents and teachings of which are hereby incorporated by reference in their entirety.
Number | Date | Country | |
---|---|---|---|
62561404 | Sep 2017 | US |