A large amount of data can be produced or received in an environment, such as a network environment that includes many machines (e.g. computers, storage devices, communication nodes, etc.), or other types of environments. As examples, data can be acquired by sensors or collected by applications. Other types of data can include financial data, health-related data, sales data, human resources data, and so forth.
Some implementations of the present disclosure are described with respect to the following figures.
Activity occurring within an environment can give rise to events. An environment can include a collection of machines and/or program code, where the machines can include computers, storage devices, communication nodes, and so forth. Events that can occur within a network environment can include receipt of data packets that contain corresponding addresses and/or ports, monitored measurements of specific operations (such as metrics relating to usage of processing resources, storage resources, communication resources, and so forth), or other events. Although reference is made to activity of a network environment in some examples, it is noted that techniques or mechanisms according to the present disclosure can be applied to other types of events in other environments, where such events can relate to financial events, health-related events, human resources events, sales events, and so forth.
Generally, an event can be generated in response to occurrence of a respective activity. An event can be represented as a data point (also referred to as a data record).
Each data point can include multiple dimensions (also referred to as an attribute), where an attribute can refer to a feature or characteristic of an event represented by the data point. More specifically, each data point can include a respective collection of values for the multiple attributes. In the context of a network environment, examples of attributes of an event include a network address attribute (e.g. a source network address and/or a destination network address), a network subnet attribute (e.g. an identifier of a subnet), a port attribute (e.g. source port number and/or destination port number), and so forth. Data points that include a relatively large number of attributes (dimensions) can be considered to be part of a high-dimensional data set.
Finding patterns (such as patterns relating to failure or fault, unauthorized access, or other issues) in data points representing respective events can be difficult when there is a very large number of data points. For example, some patterns can indicate an attack on a network environment by hackers, or can indicate other security issues. Other patterns can indicate other issues that may have to be addressed.
For example, to identify security attack patterns in a high-dimensional data set collected for a network environment, analysts can use scatter plots for identifying patterns associated with security attacks. A scatter plot includes graphical elements representing data points, where positions of the data points in the scatter plot depend on values of a first attribute corresponding to an x axis of the scatter plot, and values of a second attribute corresponding to a y axis. In some examples, the first attribute can be time, while the second attribute can include a value of a port (e.g. destination port) that is being accessed.
If ports are scanned (accessed) sequentially by security attacks, the security attacks can be manifested as a visible diagonal pattern in the scatter plot. If the ports are accessed in randomized order, however, the port scans may not be visible in the scatter plot.
In accordance with some implementations according to the present disclosure, techniques or mechanisms are provided to allow users to identify patterns associated with issues of interest to the users, such as occurrence of security attacks in a network environment, or other issues in other environments. More specifically, techniques or mechanisms are provided to allow users to identify similar patterns within a visualization of data points. Identifying similar patterns can be performed by a user selecting a group of data points that may be indicative of an issue of interest to the user. Based on the selected group of data points, cohorts of data points can be identified, and the similarities of the cohorts of data points to the user-selected group of data points can be indicated. A cohort of data points can refer to a collection of data points that has been identified as having a respective similarity to the user-selected group of data points.
The identification of similar patterns can be based on the combination of weighted distance computations (to compute weighted distances between data points) and density-based grouping of data points. A weighted distance can be used to compare each data point to a user-selected group of data points at a dimensional level. A weighted distance can refer to a measure of how close events are to each other, where the measure is calculated using weights assigned to respective dimensions of the events. Density-based grouping (to determine a density distribution) can be used to place events (data points) in different cohorts based on specified threshold (which can be user-specified). Density-based grouping can refer to a process of identifying multiple cohorts of data points, in which data points that are close to each other (that have small weighted distances) are collected together into cohorts; each cohort is a dense group of data points.
Further details regarding the computations of weighted distances and density-based grouping are discussed further below.
As shown in the example of
A distance (or more specifically, a weighted distance) between data point A and the user-selected group 102 of data points is determined (as represented by 202). The process of determining distances between a respective data point and the user-selected group 102 of data points can be repeated for multiple further data points, such as those included in the plot 100.
Weighted distances are computed based on respective weights assigned to dimensions of a further data point and dimensions of the data points in the user-selected group 102. In other words, a specific weight is assigned to each dimension of the data points, where the weights assigned to different dimensions can be different. The weights are assigned based on user selection, for example. In the example of
The weighted distance between data points is based on performing binary comparisons between the data points, where the binary comparisons are based on respective weights assigned to the dimensions. Since the computation of the weighted distance between data points has to be able to handle categorical data (as well as numerical data), techniques or mechanisms according to some implementations of the present disclosure perform the binary comparisons rather than computations of Euclidean distances between data points. Categorical data is data that do not have numerical values, but rather, have values in different categories. An example of categorical data can include location data, where location can be identified by different city names (the categories). Thus, the categorical values of the location dimension (which is a categorical dimension) can include Los Angeles, San Francisco, Palo Alto, and so forth.
The binary comparison of two data points is illustrated by Table 1 below.
In the example above, it is assumed that each of data points A and B has three dimensions (dimension 1, dimension 2, dimension 3). For data point A, the values of dimensions 1, 2, and 3 are W, X, and Z, respectively. For data point B, the values of dimensions 1, 2, and 3 are W, Y, and Z, respectively.
A string comparison per dimension is performed between data points A and B. For dimension 1, both data points A and B share the same value; as a result, the similarity is high, and thus, the string comparison for dimension 1 outputs a binary value of 0. The same is also true for dimension 3, where data points A and B both share the same value D. As a result, the distance between data points A and B along dimension 3 is also assigned the binary value 0. However, for dimension 2, data points A and B do not have the same value, and thus, the distance between data points A and B along dimension 2 is assigned the binary value 1. The foregoing comparisons of the data points along respective dimensions are referred collectively as binary comparisons, since the outputs produced by the comparisons include a collection of binary values indicated similarity or dissimilarity along respective different dimensions. In other examples, high similarity can be represented with the binary value 1, while low similarity (or dissimilarity) can be represented with the binary value 0.
More specifically, to compute the similarity value between two data points A and B, the computation iterates through all dimensions starting at i=1 (first dimension) and ending at the number of dimensions dim. The computation can then use Iverson Brackets [ ] to compare the i-th dimension of the data points A and B to each other. Then the result, either 0 or 1, is multiplied with the weight w(i) at position i: w(i). To build the average (i.e. the weighted distance between data points A and B), the computation sums the foregoing weighted values and divide by the number of dimensions (dim) as specified in the following equation:
The weighted distance between data points A and B is represented as sim(A, B) above.
Note that when determining the weighted distance between a further data point (e.g. a data point A, B, or C in
The multiple sim(A, Cj) values are averaged to produce an aggregate weighted distance between the further data point and the data points in the user-selected group. In other examples, instead of averaging the multiple sim(A, Cj) values, a different aggregation can be performed, such as a sum or other aggregate.
The aggregate weighted distance represents the similarity between the further data point and the user-selected group of data points. The aggregate weighted distance WD can be used as a similarity value for indicating similarity between a further data point and the user-selected group of data points. In other examples, a similarity value can be derived from the aggregate weighted distance.
Based on the determined aggregate weighted distances of further data points to the user-selected group 102 of data points, multiple cohorts 302, 304, 306, and 308 of data points can be identified, as shown in
A threshold t (which can be user-specified or specified by another entity) can be provided for identifying the cohorts. The threshold t defines the maximum distance between further data points within a particular cohort. In other words, the aggregate weighted distance between any two data points within the particular cohort does not exceed t. Data points that have aggregate weighted distances greater than t are placed in separate cohorts, as shown in
The process computes (at 404) weighted distances (more specifically, the aggregate weighted distances discussed above) between further data points (e.g. data points A, B, C, etc. in
The further data points can be sorted according to their respective similarity values, to produce a sorted list of further data points.
Next, the process of
In some examples, the density-based grouping performed at 406 can involve iterating through the further data points of the sorted list. For any two further data points whose similarity value is less than the threshold t, the two further data points can be grouped into a corresponding cohort. However, if the similarity value between any two data points exceeds the threshold t, then a cut is defined, and the two data points are provided in different cohorts.
A graphical visualization including graphical elements (e.g. circles or dots) representing the user-selected group of data points and the cohorts of data points is generated (at 408). In the ensuing discussion, graphical elements are referred to as “pixels,” where each pixel represents a respective data point. In the graphical visualization, each cohort is represented using pixels assigned a common visual indicator (e.g. fill pattern or color). The different cohorts can be detected by a user based on the assigned common visual indicators; in other words, a first cohort can be detected based on a first common visual indicator assigned to a group of pixels, a second cohort can be detected based on a second common visual indicator assigned to a group of pixels, and so forth. In some implementations, the graphical visualization represents a temporal plot (such as that depicted in
In the example of
In
Once the cohorts are identified, a common visual indicator (same fill pattern or same color) is assigned to the pixel representing each data point of a given cohort. These common visual indicators are assigned to the pixels shown in
The identified cohorts and their respective assigned visual indicators can be mapped back to a graph that depicts a scatter plot of data points along a destination port dimension and a time dimension, as shown in
User selection of one of the control elements 806, 808, 810, 812, and 814 causes a graphical visualization to be generated that depicts just the data points in the respective cohort associated with the selected control element.
Based on the results depicted in the temporal plot 602 of
The identified cohorts and respective assigned visual indicators can be mapped to a graph 1002, as shown in
Flexibility can be provided to a user in the form of the ability to iterate through different results by changing the weights assigned to dimensions of data points, and the selection of different cohorts of data points to which other data points are compared to.
Visual analytic techniques are provided to allow users to find, show, and save patterns in data points. Finding can be accomplished by selecting a user-selected group of data points and initiating the computation of weighted distances an performance of density-based grouping. Once a pattern is detected, the results can be shown in the various visualizations discussed above, and also saved.
In some implementations, a user can merge, delete, or display patterns. For example, control elements (such as those shown in
The processor(s) 1102 can be coupled to a non-transitory machine-readable or computer-readable storage medium (or storage media) 1104. The storage medium (storage media) 1104 can store various machine-readable instructions, including weighted distance computation instructions 1106 (to compute weighted distances as discussed above), density-based grouping instructions 1108 (to perform density-based grouping as discussed above), and visualization instructions 1110 (to generate various visualizations). The weighted distance computation instructions 1106 computes weighted distances such as according to task 404 in
The storage medium (or storage media) 1104 can include one or multiple different forms of memory including semiconductor memory devices such as dynamic or static random access memories (DRAMs or SRAMs), erasable and programmable read-only memories (EPROMs), electrically erasable and programmable read-only memories (EEPROMs) and flash memories; magnetic disks such as fixed, floppy and removable disks; other magnetic media including tape; optical media such as compact disks (CDs) or digital video disks (DVDs); or other types of storage devices. Note that the instructions discussed above can be provided on one computer-readable or machine-readable storage medium, or alternatively, can be provided on multiple computer-readable or machine-readable storage media distributed in a large system having possibly plural nodes. Such computer-readable or machine-readable storage medium or media is (are) considered to be part of an article (or article of manufacture). An article or article of manufacture can refer to any manufactured single component or multiple components. The storage medium or media can be located either in the machine running the machine-readable instructions, or located at a remote site from which machine-readable instructions can be downloaded over a network for execution.
In the foregoing description, numerous details are set forth to provide an understanding of the subject disclosed herein. However, implementations may be practiced without some of these details. Other implementations may include modifications and variations from the details discussed above. It is intended that the appended claims cover such modifications and variations.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2015/021015 | 3/17/2015 | WO | 00 |