The present disclosure generally relates to data processing methods and systems.
Since the invention of computer, the processing power of computer systems continues to improve, more or less following the famous Moore's Law. With ever increasing computing power, more and more data intensive applications are being developed. It is now not uncommon to see a database that stores many billions records running into peta bytes of data storage. Often, it is desirable to quickly analyze the data to obtain interesting information.
The present disclosure generally relates to data processing. Bloom filters are used to process data at high speed. A Bloom filter can be used to quickly determine count, maximum, minimum, and/or other properties of a data array. The data array can have multiple rows where each row of data comprise three or more attributes. Other types of data arrays (such as rows of a table or rows from streams) can be processed as well.
As the size of datasets for analysis explodes, finding summaries quickly in a data array or a data cube assumes increased importance. These summaries are often computed in form of aggregates. Aggregations summarize the datasets through the group-by, the grouping-set, and the roll-up operations. The grouping operations could be computed either at an online analytical processing (OLAP) layer or at a database management system (DBMS) layer of an analytics engine. The construction of a data cube and even a partial cube of these aggregates is an expensive operation and most of work concentrates on exact computation, and where quick approximate results are required, the standard techniques are not sufficient. For example, an executive who wants a quick idea of performance of a certain product is primarily interested in a ballpark number to make sure that there are no surprises. In fact, in many situations quick answers are preferred even if they are approximate to precise answers that may take longer time. It is to be appreciated that the present disclosure provides simple techniques for fast approximation computations of these aggregates. Unfortunately, existing techniques do not provide single body of work that could quickly provide approximate values of maximum (MAX), minimum (MIN), and count (COUNT) functions for data arrays, in a data cube or a partial cube.
As an example, for a given a dataset with three attributes C1, C2, and C3 per row of data, it is often desirable to perform a grouping-set online analytical processing (OLAP) operation and give out values of an aggregate function for every group. The aggregate functions such as the COUNT(*), the MAX, and the MIN can be computed using a Bloom filter.
In database parlance, each distinct value of an input represents a group. A group-by operation collapses the input into groups and computes a corresponding aggregate function and the grouping-set operation and the roll-up operation use the group-by operation as a basis for computing groups. By using Bloom-filter based algorithms to compute approximately the groups of an input with the corresponding aggregates, aggregate functions and the grouping-set operation can be performed quickly.
It is to be appreciated that a Bloom-filter based algorithm, as described in the present disclosure, is good for computing many distinct values of a population, and compares favorably to the traditional hash or sort based implementations. Algorithms for computing a group-by operation are presented using an annotated Bloom filter. With the annotated Bloom filter, updating the aggregates value of the group is quick when an input value is given. The present disclosure presents an algorithm (Bloom-Aggregate) that computes the aggregates in the annotated bloom filter. Due to false positives, the grouping aggregates are not 100% accurate but the number of false positives is low for a well-designed Bloom filter.
The error rate becomes unacceptable when the number of l's in the Bloom filter exceeds a threshold (say 30%-50%). The second algorithm (Bloom-Groups) takes this into account to reduce the error rate. When this happens, the Bloom-Groups algorithm stores partial groups on disk and reinitializes the Bloom filter (that is, reset bitmap and counters or accumulators to zero) and the remaining input is consumed to construct partial groups. To get the final answer, these partial groups are then fed to a traditional hash-based grouping implementation but with fewer inputs than before.
Developed by Burton Howard Bloom in 1970, Bloom filters are widely used probability data structures that can be used to test whether an element is a member of a set. For example, let there be an m bit vector where each bit is set to zero, this vector is used as a Bloom filter. Besides the Bloom Filter, we store an auxiliary structure with m counters that can store integers instead of just 0's and 1's to compute/store the aggregate value for all groups. Then, let there be n input values to encode in the filter. Assume that we have k hash functions. Each hash function hi takes a column value v as input and outputs a number in {1 . . . m}. We set hi(v)th bit in the vector to one for each i ranging from 1 to k. Whenever hi(v)th bit is accessed we increment the corresponding counter (or accumulator) in the auxiliary structure. We take a function of the k counters (or the k accumulators) as our aggregate for a given group. For example, for the aggregate COUNT, the function could be the minimum value of the k counters.
Bloom filters are used to test membership queries. To test if a value v is already present in the filter we could apply the k hash functions on the value. If any of the k bits from the filter return zero, then the value v does not exist in the filter (that is, it is not a member).
There is a small but non-zero probability that the value does not exist but we may conclude that it does because all k bits are set to one. This false positive probability p is computed by
p=(1−ê(−kn/m))k (Equation 1)
For a desired set of p, n, k values we can calculate number of bits needed (m) using the formula:
m=−kn/(ln(1−p̂1/k)) (Equation 2)
If we set n to be one million column values and set p to be 0.001, and k=5 then m needed is 16.48 MB. If we increase p to 0.01 then m is 9.39 MB. The more memory we have the more accurate bloom filter.
It is to be appreciated that one advantage of Bloom filters, despite possible false positives, is being able to be executed very quickly compared to many other types of methods, and the present disclosure describes methods for using Bloom filter to compute COUNT, MAX, and MIN functions.
For COUNT, MAX, and MIN functions, Bloom filters can be used in conjunction with hash functions. The Bloom filter performance may depend on availability of fast hash functions. As an example, an array S of 255 random values is constructed. The random values can be obtained from www.random.org, or can be generated by a pseudo random number generator with the restriction that each generated value is in range {0 . . . 255} inclusive, so that each value requires at most one byte of storage space. For example, for a hash function that returns a 64 bit unsigned integer, the following pseudo code can be used for the implementation For example, the hash function takes a string of characters as input and returns a 64 bit unsigned integer as output:
To use a Bloom filter for these functions, the Bloom filter needs to be generated.
For example, to determine an aggregate using the Bloom filter, the following steps are performed:
1. Consume an input file and set the Bloom filter.
2. Maintain an auxiliary structure (e.g., counters or accumulators) to Bloom filter bit vector to maintain state for each group. A simple counter would suffice for the count (*) aggregation. If we use k hash functions for the Bloom filter, we increment the k counters a given group hashes to. For a MAX aggregate function we keep 2-byte state information. K such objects are used. In the case of MAX or MIN aggregate, we take mode value of K values to be the aggregate value.
To provide an example, computing approximate aggregation uses the function count(*) function. Assume that m is 14 bits and that there are three hash functions (i.e., k is 3). Now assume that input “25” comes in and gets hashed to indices 13, 8, and 4. The corresponding counters are incremented and bits in the bitmap are set to one. When an input “92” comes in as input and gets mapped to indexes 8, 4, and 3 by the three hash functions. If the next input is again “25”, the Bloom filter updates as the counters 13, 8, and 4 are incremented. The aggregate value for group “25” is given by minimum of counter values at 13, 8, and 4. In this case, counters 4 and 8 has the value of 3 (i.e., incremented from both inputs “25” and “92”), and the counter 13 has a value of “2”. The value “2” is the minimum among the three counters. For the input v=25 (i.e., querying how many “25” inputs are in there), the count aggregate is 2, as there were 2 input values of 25.
It is to be appreciated that methods for determine aggregates using Blooming filter can provide significant speed advantages over other types of methods (e.g., Hash operation). The accuracy of the aggregate or COUNT function can be close to 100%. As an example, in performing a group-by operation on 20 million strings after choosing a high number of groups for the experiment, that false positives go up as groups go up. A join operation is performed between the Bloom filter and the output of groups, and hash-map implementation of aggregate values. The value match is provided between the hash-map implementation and the Bloom filter implementation. Accuracy of the Bloom filter is 98% in this case. In this example, 40 million bits for m is used to measure accuracy. The mismatched aggregate values are higher than they should be due to false positives. No partial groups are rewritten to disk for the Bloom filter based algorithm. The k of 5 is used. The result is summarized in Table 1 Below:
Bloom filter can be used to determine maximum value as well. To use Bloom filter for MAX operation, a Bloom filter for aggregate maximum may be needed, where the MAX operations determines the maximum u for group value v.
As an example, one operation is to compute maximum value of an attribute for a given group. Members of the group are stored in column C, and the goal is to compute maximum value for column D in the group. A state object is used for each bit to store the corresponding maximum. For k=3, a maximum in the 3 objects is stored. If the object is already used, the object is overwritten with the current maximum. Maximum for a given group is determined by mode value of the 3 objects. If there is no mode, the majority value is used. If there is no majority value, one value from the k objects is selected.
Again, determining maximum (or MAX function) using Bloom filter can provide various advantages over other techniques. For example, in comparison to using hash function, the Bloom filter based MAX function can provide significant speed (i.e., faster) and memory (i.e., smaller) advantages, as demonstrated in a test case where there are 723 groups for an input size of 25 million, and each row has 3 attributes. The first two columns used as grouping columns, and MAX function is computed for third column. The parameter m is 12 million bits. The hash algorithm is stored on a disk for the maximum aggregate for each group along with the group. The Bloom aggregate algorithm used the file from disk to measure accuracy of the maximum function. The result and parameters are shown in Table 2 below:
The Bloom filter can be used to determine minimum (or MIN function) as well. To use Bloom filter for MIN operation, a Bloom filter for aggregate minimum may be needed, where the MIN operations determines the maximum u for group value v.
The aggregate function provided with Bloomer filters can be used to store groups. Groups of data can be stored on disk so that partial groups could be computed for larger inputs. Storing groups of data on disk is sometimes needed, because Bloom filter bitmaps become less reliable as number of 1's increase in the bit map, which resulting in an increase in false positives. To overcome this, the bitmap is reinitialized. Before re-initialization, the partial groups are stored on the disk. A partial group refers to a group that is not fully collapsed so the group values repeated.
First, set S1 is built. The set S1 contains almost all distinct values (distinct value=group) by applying a Bloom filter on an input of column values and adding to S1 when the input column is not a member. Periodically, the groups are written to disk when the number of groups exceeds a specific value (for example, the specific value can be 2 million). When all input is consumed, or when the partial groups need to be stored (i.e., Bloom filter needs to be reinitialized), each member and the corresponding aggregate value are outputted to a file.
Since Bloom filter is re-initialized when density (e.g., density refers to percentage of 1's in the bitmap) exceeds a predetermined threshold (e.g., 30%-50%), it is not needed to have all the m counters at any given time. In practice, it may be enough to store 30%-50% of m counters. To increase accuracy, the higher number of bits per input value can be allocated (e.g., increasing the m/n ratio), which may result in an increase in storage requirement and a decrease in calculation speed.
Bloom filter can also be used for group-set computation in a partial cube. Suppose a grouping-set operation on columns (C1, C2) with column (C3) is to be performed. The computation starts by splitting a row (C1, C2, C3) into two different rows: (C1, C2) and (C1, C3) and feeding the two split rows to two Bloom operators. For example, a Bloom operator refers to a Bloom filter is that used to perform one or more operations, such as aggregate count, aggregate maximum, and/or aggregate minimum. Each Bloom operator performs a group-by operation. For example, one Bloom operator performs a group-by on columns C1 and C2, while another Bloom operator performs a group-by on columns C1 and C3. The two outputs from the two Bloom operators are used to compute union of two outputs. Each of the Bloom operators consumes input and writes out full groups or partial groups depending on whether the Bloom filter is reinitialized. If partial groups are output then these are fed to a hash-table based grouping implementation to finish the group-by operation.
Using Bloom filters, other operation are possible in addition to aggregate, maximum, and minimum. For example, by using Bloom filters that split in n streams, a various operations can be performed on n dimension hypercubes.
It is to be appreciated that the methods and processes described above can be implemented using various types of computing system, and the algorithm can be stored on various types of computer readable mediums.
While the above is a full description of the specific embodiments, various modifications, alternative constructions and equivalents may be used. Therefore, the above description and illustrations should not be taken as limiting the scope of the present invention which is defined by the appended claims.