In recent years, with the advancement of data collection, data scale has become very large. For example, an event log, which is a typical high dimensional data set, can have more than a hundred dimensions. An event log extracts event log data from networking and computing devices, stores the log data as a big relation table on a server or a cluster of servers, and provides a query facility for log analysis. However, massive data leads to expensive query processing time, which limits many types of applications and makes effective data analysis difficult to achieve. Some applications may desire to keep a short query response time such as data mining, decision support and analysis, while many other applications may find approximate answers adequate to provide insights about the data, at least in a preliminary phase of analysis.
The accompanying drawings illustrate various examples of various aspects of the present disclosure. It will be appreciated that the illustrated element boundaries (e.g., boxes, groups of boxes, or other shapes) in the figures represent one example of the boundaries. It will be appreciated that in some examples one element may be designed as multiple elements or that multiple elements may be designed as one element. In some examples, an element shown as an internal component of another element may be implemented as an external component and vice versa.
Systems and methods for generating a hint for a query on a dataset are disclosed. As used herein, a “dataset” can include rows and columns, wherein each row is also known as a record and each column contains various values of an attribute. The dataset can be static or dynamic such as an event log dataset where event log data streams in the dataset in a continuous way. As used herein, a query can be consisted of attributes, their values and operators such as AND, OR, NOT and so on. Without loss of generality, a query can be in the format of “Ai1=vi1 Ops Ai2=vi2 Ops . . . ”, wherein Aij is an attribute of a dataset, vij is the value of Aij and Ops represents an operator. If a user has input a new query of “Aq1=Vq1 and Aq2=Vq2 and . . . and Aqk=Vqk” and also specified another attribute Aqk+1, an example query hint is to present the user an estimated data distribution on attribute Aqk+1 while satisfying the k prior given conditions of Aqi.
The following are two examples of query hints:
Wherein, in the first example, a query hint presents a user the distinct values and their distributions (e.g. Sun 20%; IBM 10%; HP 5% . . . ) for specified attribute “DeviceVendor” while meeting the condition “OS=‘Linux’. The user can complete “DeviceVendor=” by choosing a value in the hints list. The above process will continue if the user inputs the third attribute “DeviceEventClassld” as shown in the second example.
An example of the present disclosure can leverage statistics techniques and graphical models to infer probabilistic distribution on any specified attribute in a dataset and present it to the user as a hint. Examples of the present disclosure may generate query hints to help users identify the interesting content from a large dataset such as the event log data and thus focus on their explorations quickly and effectively without consuming significant amounts of valuable system resources. Further, query hints generated by examples of the present disclosure may help users compose their queries, and allow them to get a preliminary understanding about the dataset and then determine whether they would like to spend more time and resources to execute the query to completion.
In the following, certain examples of the present disclosure are described in detail with reference to the drawings.
Referring now to
The system 100 may include a server 102, and one or more client computers 104, in communication over a network 106. As illustrated in
The network 106 may be a local area network (LAN), a wide area network (WAN), or another network configuration. The network 106 may include routers, switches, modems, or any other kind of interface device used for interconnection. The network 106 may connect to several client computers 104. Through the network 106, several client computers 104 may connect to the server 102. The client computers 104 may be similarly structured as the server 102.
The server 102 may have other units operatively coupled to the processor 108 through the bus 110. These units may include tangible, machine-readable storage media, such as storage 122. The storage 122 may include any combinations of hard drives, read-only memory (ROM), random access memory (RAM), RAM drives, flash drives, optical drives, cache memory, and the like. Storage 122 may include a model constructing unit 124, a statistics computing unit 126 and a hint generating unit 128.
The model constructing unit 124 may be used to construct a model which represents relevance between attributes in a dataset on which a query is to be run, based on historical data associated with the dataset. The historical data can include query historical data (i.e., queries submitted by users previously) and dataset historical data (i.e., data that is collected before a new query is submitted). The statistics computing unit 126 may be used to compute statistics about attributes according to the model constructed by the model constructing unit 124. The hint generating unit 128 may be used to generate, in response to a query input by a user, a hint for said query based on the model constructed by the model constructing unit 124 and the statistics computed by the statistics computing unit 126.
With reference to
According to an example of the present disclosure, the model constructed at block 201 can be a junction tree. A node of the junction tree represents a set of attributes that are determined to be relevant and an edge of the junction tree represents a common attribute between two nodes connected by said edge.
With reference to
Several strategies can be considered for selecting a sub-graph from an undirected connected graph: (1) Maximum spanning tree—it is a spanning tree of a weighted graph having maximum weight. (2) Minimum diameter of the final sub-graph—it is a spanning tree having the minimum diameter. (3) Minimum estimation error computed using the previous actual query workload—it is a spanning tree achieving minimum estimation error for previous queries. This process can be an offline computation. Then, at block 3003, the sub-graph is converted into a junction tree. During converting, the nodes in the sub-graph are first clustered into groups. This involves choosing a node ordering of the sub-graph and using node elimination to identify a set of clusters, wherein each cluster includes attributes that are deemed to be relevant in the dataset according to, for example, the mutual information. Then, a junction tree as shown in
With reference to
According to an example of the present disclosure, as new log data flows into the dataset continuously, computing the statistics of the attributes can be performed in a real-time manner.
In an example, computing a histogram for an attribute can include using at least one of top-k distinct value and top-k prefix to compute the histogram based on data type in the dataset, as shown in block 5021. A top-k distinct value histogram refers to a histogram that keeps the most frequent k distinct values and their corresponding frequencies and a top-k distinct value refers to a value that is one of the most frequent k distinct values. A top-k prefix histogram can be used if there are a large number of distinct values with similar frequencies and thus it is very difficult to keep all the values given the efficiency requirement and space budget. In a top-k prefix histogram, all the string values of an attribute can be represented in a tree, where a value is represented as a labeled path from the root of the tree to a leaf node of the tree and every non-leaf node in the tree represents a prefix of all the strings in the subtrees or children of this non-leaf node. In a top-k prefix histogram, a weight can be assigned to every node (including both leaf nodes and non-leaf nodes) and then top-k prefixes are identified in the tree, as described in detail below.
For an attribute with a few distinct values, a top-k distinct value histogram can be computed. In this histogram construction, an empty bucket with k entries can be built in advance and then the distinct values and their frequencies are inserted into the bucket as the data comes. After the bucket is full, if a new incoming data equals to one entry of the bucket, then the corresponding frequency is added by 1. If the new data doesn't equal to any entry, then for purposes of saving storage space, a hashing operation can be performed on the new data to obtain a hash item. If the hash item is new, then it is put into a hash table as a new item. Otherwise, if the hash item already exists in the hash table, the existing item is added by 1. The frequency of the existing item is compared with the minimum frequency of the bucket. If the frequency of the existing item is greater, then the minimum distinct value in the bucket and this existing item in the hash table are replaced with each other.
If there are too many distinct values for an attribute or if the space is not enough to store the frequency table, a method of top-k prefix can be used to construct a 1D histogram for the attribute. As mentioned above, in the top-k prefix histogram, a weight is assigned to every node. For example, the weight of a node b can be defined by the following equation:
Wherein, Ii is a child string which is covered by node b; freq(Ii) is the frequency of Ii appearing in the data set; pb and pi are the length of node b and node Ii respectively and n is the children number of node b. Then, nodes with the greatest weights can be selected in the tree as top-k prefixes.
In an example, since the amount of 2D and 3D or higher dimensional histograms has been highly reduced in the phase of junction tree building, computing a joint histogram for each set of attributes in a node of the junction tree can include enumerating distinct sets of attribute values and computing their respective occurrence frequencies, as shown in block 5022.
With reference to
At block 602, a minimum sub-tree that can cover all the obtained attributes may be selected from a junction tree constructed for example at block 201. For example,
At block 603, a hint is generated for the target attribute based on the sub-tree selected at block 602 and the statistics computed for example at block 202.
With reference to
The following is an example algorithm for calculating the distribution of the target attribute.
The principle for computing the distribution in the above algorithm is that for a junction tree (A,B)-B-(B,C) with (A,B) and (B,C) being nodes and B being an edge connecting them, the joint probability distribution of P(A,C) is calculated as P(A,C)=P(A,B)*P(B,C)/P(B).
In this algorithm, the function computDist has three arguments, wherein root is the junction tree node where the propagation starts from; atts is the attribute-value pair set who have been assigned values and the attribute that is searched for; keep is the attribute set whose joint distribution is desired; dist represents distribution; mstnodes are attributes contained in a node; project is function to perform projection operation based on atts and keep; and product is a function to compute H*h/distribution of S. This algorithm is a recursive one. If the root contains all the attributes, the projection results are returned based on the keep attribute and filter condition in atts. Otherwise, for each child of the root node, the atts and keep arguments are first calculated and then the joint distribution of attributes in keep are determined. Finally the result of each node is multiplied according to properties of the junction tree.
For example, suppose that attributes in a dataset are A, B, C and D and the query input by a user is “A=a” and D, and the sub-tree selected from a junction tree is (C,D)-C-(B,C)-B-(A,B). Then, in the first nest loop, root=(C,D), atts=[A D], and keep=D. In the second nest loop, root=(B,C), atts=[A C], and keep=[C D]. In the third and final nest loop, root=(A,B), atts=[A B], and keep=[B C D].
The above algorithm is just an example of computing a distribution for an attribute and other methods can be conceived in light of the present disclosure.
With reference to
The non-transitory, computer-readable medium 1000 may correspond to any typical storage device that stores computer-implemented instructions, such as programming code or the like. For example, the non-transitory, computer-readable medium 1000 may include one or more of a non-volatile memory, a volatile memory, and/or one or more storage devices. Examples of non-volatile memory include, but are not limited to, electrically erasable programmable read only memory (EEPROM) and read only memory (ROM). Examples of volatile memory include, but are not limited to, static random access memory (SRAM), and dynamic random access memory (DRAM). Examples of storage devices include, but are not limited to, hard disks, compact disc drives, digital versatile disc drives, and flash memory devices.
A processor 1001 generally retrieves and executes the computer-implemented instructions stored in the non-transitory, computer-readable medium 1000 for generating a hint for a query. At block 1002, a model constructing module may construct a model which represents relevance between attributes in a dataset based on historical data associated with the dataset. At block 1003, a statistics computing module may compute statistics about said attributes according to the model constructed by the model constructing module 1002. At block 1004, in response to a query input by a user, a hint generating module may generate a hint for said query based on the model and the statistics.
Examples of the present disclosure can provide a practical and user friendly query hints implementation, which helps users organize their queries and get wanted insights into real data quickly. Advantageously, query hints can be applied in many circumstances, e.g., from event log management applications, such as security management, IT trouble shooting, to user behavior analysis. Examples of the present disclosure can also support high dimensional data estimation by combining data-driven and query-driven methods for automatic discovery of correlations. Based on attributes correlations, a simplistic and scalable graphic model can be used for probability distribution propagation. Examples of the present disclosure also achieve high space and time efficiency.
From the above depiction of the implementation mode, the above examples can be implemented by hardware, software or firmware or a combination thereof. For example, the various methods, processes, modules and functional units described herein may be implemented by a processor (the term processor is to be interpreted broadly to include a CPU, processing unit, ASIC, logic unit, or programmable gate array etc.) The processes, methods and functional units may all be performed by a single processor or be split between several processers. They may be implemented as machine readable instructions executable by one or more processors. Further the teachings herein may be implemented in the form of a software product. The computer software product is stored in a storage medium and comprises a plurality of instructions for making a computer device (which can be a personal computer, a server or a network device, etc.) implement the method recited in the examples of the present disclosure.
The figures are only illustrations of an example, wherein the modules or procedure shown in the figures are not necessarily essential for implementing the present disclosure. Moreover, the sequence numbers of the above examples are only for description, and do not indicate an example is more superior to another.
Those skilled in the art can understand that the modules in the device in the example can be arranged in the device in the example as described in the example, or can be alternatively located in one or more devices different from that in the example. The modules in the aforesaid example can be combined into one module or further divided into a plurality of sub-modules.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/CN2013/000106 | 1/31/2013 | WO | 00 |