With the advancing of data collection and data management, data scale has become very large. The massive amounts of data available may lead to expensive query processing times. While some applications may desire to keep a short query response time, such as data mining, decision support and analysis, in some other applications, an approximate answer may be adequate to provide insights about the data.
The accompanying drawings illustrate various examples of various aspects of the present disclosure. It will be appreciated that the illustrated element boundaries (e.g., boxes, groups of boxes, or other shapes) in the figures represent one example of the boundaries. It will be appreciated that in some examples one element may be designed as multiple elements or that multiple elements may be designed as one element. In some examples, an element shown as an internal component of another element may be implemented as an external component and vice versa.
Systems and methods for generating an approximate answer for a query on a database are disclosed. As used herein, a database refers to a structured collection of data which can be organized in various ways. Without loss of generality and as used below, a database can be consisted of rows and columns, wherein each row represents a record in the database and each column represents a set of values for an attribute. As used herein, a query refers to an operation used to search in the database for records and/or attributes that satisfy certain conditions or obtain statistics about these records and/or attributes. An example of the systems and methods disclosed herein can divide a query into multiple sub queries and obtain approximate answers for these sub queries, which then can be combined to get an approximate answer for the query. Examples of the systems and methods disclosed herein can provide an accurate approximation for query answering in a short response time and can also support complex queries.
In the following, certain examples according to the present disclosure are described in detail with reference to the drawings.
Referring to
The system 100 may include a server 102, and one or more client computers 104, in communication over a network 106. As illustrated in
The network 106 may be a local area network (LAN), a wide area network (WAN), or another network configuration. The network 106 may include routers, switches, modems, or any other kind of interface device used for interconnection. The network 106 may connect to several client computers 104. Through the network 106, several client computers 104 may connect to the server 102. The client computers 104 may be similarly structured as the server 102. The network can also connect to a database 130. The database 130 can be any type of database and can also be located in the server 102. The database 130 can hold any kind of data, including, but not limited to, an event log, which is one of the commonly used high dimensional data and may have more than a hundred dimensions.
For example, event logs can be processed and analyzed for purposes such as security management, IT trouble shooting or user behavior analysis. When a user wants to analyze events matching specific criteria, the user may need to create a query to search for events from an event log database. The query can be as simple as a term to match, such as “login” or an IP address; or it can be more complex, such as events that include multiple IP addresses and ports and occur in specific time ranges from devices that belong to a particular device group. The user can specify a set of conditions in a query expression that are used to select or reject an event log.
As an example, a user can specify multiple conditions in a query expression with operators connecting these conditions. For example, a query name=“failed login” AND message!=“success” searches for event logs with a “name” field set to “failed login” and a message field not set to “success”. Various operators can be supported between field conditions, including, but not limited to, string operators such as ‘!=’, ‘=’, ‘>’, ‘<’, ‘<=’, ‘>=’, ‘BETWEEN’, ‘IN’, ‘STARTSWITH’, ‘ENDSWITH’ and ‘CONTAINS’, numeric/timestamp operators such as ‘!=’, ‘=’, ‘>’, ‘<’, ‘<=’, ‘>=’, ‘BETWEEN’, SQL operators such as ‘IS’, Boolean operators such as ‘AND’, ‘OR’, ‘NOT’ and list operator such as ‘IN’.
For sake of convenience, suppose that a query q is to be performed on a large data set, e.g., a high dimensional table R, wherein the table R is composed of rows (i.e. records) and columns (i.e. attributes), as described above. The query q can be expressed using SQL as follows:
select Ax, count (*)
from R
where AF
group by Ax
wherein, count indicates the number of records with Ax being a specific value that are in the table R and AF is the filtering condition with the following recursive definition using Backus Normal Form or BackusNaur Form (BNF):
<AF>::=Ai<Ω>vi
<AF>::=<AF><OP><AF>
<OP>::=AND|OR|NOT
<Ω>::=>|=|>=|<|<=|BETWEEN|CONTAINS|STARTSWITH|ENDSWITH|IN|NOT IN|IS NULL|NOT NULL
As is appreciated, a BNF specification is a set of derivation rules, written as <symbol>::=_expression_, wherein <symbol> is a nonterminal, and the _expression_ consists of one or more sequences of symbols; more sequences are separated by the vertical bar, ‘|’, indicating a choice, the whole being a possible substitution for the symbol on the left. Symbols that never appear on a left side are terminals. On the other hand, symbols that appear on a left side are non-terminals and are always enclosed between the pair < >. The ‘::=’ means that the symbol on the left must be replaced with the expression on the right.
Although different operators may have different semantics for different data types, the processing approach will be similar. Without loss of generality, a query q on a database can be expressed by a general form of q=<AF AND Ax=?>, as described below.
Continuing with
With reference to
For example, a query (A1Ωv1 OR A2Ωv2) AND (A3Ωv3 AND NOT A4Ωv4) can be converted to:
(A1Ωv1 AND A3Ωv3 AND NOT A4Ωv4) OR (A2Ωv2 AND A3Ωv3 AND NOT A4Ωv4), wherein “(A1Ωv1 AND A3Ωv3 AND NOT A4Ωv4)” and “(A2Ωv2 AND A3Ωv3 AND NOT A4Ωv4)” are the converted sub-queries.
At block 202, an approximate answer is generated for each of the sub queries. According to an example of the present disclosure, for a sub-query, an approximate answer is generated by utilizing either sampling technique or a top-k histogram associated with the database. For instance, given a sub-query q, samples of the database can be used to answer this sub-query and the result is denoted as process (S,q), wherein S represents a set of samples used to answer the query q. Please be noted that any sampling technique can be used herein and the result of the sub-query can be scaled up based on the sampling ratio and bounded by the total number of records in the database.
A top-k histogram can be built on some predefined column combinations in a database.
Continuing with
Wherein, sqi represents ith sub-query and F( ) represents an approximate answer. In each component, such as F(sqiΛsqj), the attribute-value constraint pairs are connected through “AND” or “AND NOT” operator. An attribute and value pair can be connected using “=” “!=”, “>”, “>=”, “<”, “<=”, “BETWEEN” “CONTAINS”, “STARTSWITH”, “ENDSWITH”.
With reference to
By way of example and not limitation, processing approaches for some operators according to methods described above are described below. For convenience, operators which have similar processing approach are grouped together.
For AND operator, a query has the following form: q=<Ai=vi, AND . . . , AND Aj=vj AND Ax=?>
If there exists a histogram h as shown in
On the other hand, if no attribute value exists in the top-k frequent values, then only the frequency of record y can be modified according to the statistical information about the rest non-top-k values in the histogram, as follows:
For operator OR, a query can be one of the following two forms: q=<AF OR Ax=?> and q=<subquery OR Ai=vi) AND Ax=?>.
For the former case, results of sub-queries sq1=<AF> and sq2=<Ax=?> are calculated. These results are then unioned and grouped and aggregated based on Ax. For the latter case, the query is equivalent to q=<(subquery AND Ax=?) OR (Ai=vi AND Ax=?)>. The query processing is similar to the former case: calculate the result of sub-query sq1=<subquery AND Ax=?>; calculate the result of sub-query sq2=<Ai=vj AND Ax=?>; union the result of sq1 and sq2, and then group and aggregate based on Ax.
In both cases, if record y1 in the result of sq1 and record y2 in the result of sq2 have the same attribute value, then the lower bound frequency of this record is max(ylower, zlower) and the upper bound frequency of this record is max(yupper, zupper), wherein yupper and glower are the upper bound and lower bound of the frequency of record y1 respectively, Zupper and Zlower are the upper bound and lower bound of the frequency of record y2 respectively and max ( ) gets the maximum value of two values.
For operators NOT and !=, a query is in the following form:
q=<subquery NOT Aj=vj AND Ax=?>
It is equivalent to:
q=<subquery NOT Aj!=vj AND Ax=?>
If there exist histograms h and h′ that can cover all the attributes of q and q′=<subquery AND Ax=?> respectively, then a result Y=process(h′, q′) can be obtained first by using the top-k frequent items in h′; and then, for each y in Y, y.frequency-process (h,q|Aj=y.Aj), which is a record, is put into the answer set. If y and z have bound as yupper, ylower, zupper and zlower respectively, yupper=yupper−zlower and ylower=ylower−zupper are returned.
Otherwise, if there does not exist histograms h and h′ that can cover all the attributes of q and q′, and if there exists a histogram h′ that can cover all the attributes of q′=<subquery AND Ax=?>, then a result Y=process(h′, q′) can be obtained by using the top-k frequent items in h′; a result Z=process(S,q) can be obtained by using samples and for each y in Y, y.frequency-Z(Aj=y.Aj) is put into the answer set.
However, if there does not exist a histogram h′ that can cover all the attributes of q′, then the answer process(S,q) is returned using samples directly.
For operators >=, <, <=, BETWEEN, CONTAINS, STARTSWITH, and ENDSWITH, a query is in the following form:
The query processing can first get X=process(S,q) using samples; then for each sub-query sq of q in the form of <Aj=vj, AND . . . AND Ai=vi>, if there exists a histogram h that covers all the attributes of sq, m=process(h,sq) can be obtained; then if there is any x in X, and x.frequency >m, set x.frequency=m. X will be returned as the final result.
For operator IN, a query is in the form of:
It is equivalent to:
The query processing can first compute the result of each component and then sum the results of each component together, as follows:
For operator NOT IN, a query is in the form of:
It is equivalent to:
The query process will be similar to operator AND, and will not be described herein.
For operator IS NULL/IS NOT NULL, a query is in the form of:
NULL can be considered as a special value, and the query is equivalent to:
The query processing is similar to operators = and !=, and will not be described herein.
As described above, examples of the present disclosure for providing an approximate answer for a query can generate more accurate approximation and also can support a variety of complex operators and complex queries.
With reference to
The non-transitory, computer-readable medium 500 may correspond to any typical storage device that stores computer-implemented instructions, such as programming code or the like. For example, the non-transitory, computer-readable medium 500 may include one or more of a non-volatile memory, a volatile memory, and/or one or more storage devices. Examples of non-volatile memory include, but are not limited to, electrically erasable programmable read only memory (EEPROM) and read only memory (ROM). Examples of volatile memory include, but are not limited to, static random access memory (SRAM), and dynamic random access memory (DRAM). Examples of storage devices include, but are not limited to, hard disks, compact disc drives, digital versatile disc drives, and flash memory devices.
A processor 501 generally retrieves and executes the computer-implemented instructions stored in the non-transitory, computer-readable medium 500 for obtaining an approximate answer for a query on a database. At block 502, a converting module may convert said query into a set of sub queries with a standard form. At block 503, a sub-query processing module may generate an approximate answer for each of the sub queries. At block 504, a combining module may combine approximate answers for the sub queries to obtain an approximate answer for the query.
From the above depiction of the implementation mode, the above examples can be implemented by hardware, software or firmware or a combination thereof. For example the various methods, processes, modules and functional units described herein may be implemented by a processor (the term processor is to be interpreted broadly to include a CPU, processing unit, ASIC, logic unit, or programmable gate array etc.) The processes, methods and functional units may all be performed by a single processor or split between several processors. They may be implemented as machine readable instructions executable by one or more processors. Further the teachings herein may be implemented in the form of a software product. The computer software product is stored in a storage medium and comprises a plurality of instructions for making a computer device (which can be a personal computer, a server or a network device, etc.) implement the method recited in the examples of the present disclosure.
The figures are only illustrations of an example, wherein the modules or procedure shown in the figures are not necessarily essential for implementing the present disclosure. Moreover, the sequence numbers of the above examples are only for description, and do not indicate an example is more superior to another.
Those skilled in the art can understand that the modules in the device in the example can be arranged in the device in the example as described in the example, or can be alternatively located in one or more devices different from that in the example. The modules in the aforesaid example can be combined into one module or further divided into a plurality of sub-modules.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/CN2013/000107 | 1/31/2013 | WO | 00 |