Set intersection is a very frequent operation in information retrieval, databases operations and data mining. For example, in an Internet search for a document containing some term 1 and some term 2, the set of document identifiers containing term 1 is intersected with the set of document identifiers containing term 2 to find the resulting set of documents having both terms.
Any technology that speeds up the set intersection process in such technologies is highly desirable. For example, the latency with respect to the time taken to return Internet search results is a significant aspect of the user experience. Indeed, if query processing takes too long before the user receives a response, even on the order of hundreds of milliseconds longer than expected, users tend to become consciously or subconsciously annoyed, leading to fewer search queries being issued and higher rates of query abandonment.
This Summary is provided to introduce a selection of representative concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used in any way that would limit the scope of the claimed subject matter.
Briefly, various aspects of the subject matter described herein are directed towards a fast set intersection technology by which sets of elements to be intersected are maintained as partitioned subsets (small groups) in data structures, along with representative values (e.g., hash signatures) representing those subsets, in which the results of a mathematical operation (e.g., bitwise-AND) on the representative values indicates whether an intersection of range-overlapping subsets is empty. If so, the intersection operation on those subsets may be skipped, with intersection operations performed only on overlapping subsets that may have one or more intersecting elements.
In one aspect, an offline pre-processing stage is performed to partition the sets of ordered elements into the subsets, and to compute the representative value (one or more hash signatures) for each subset. In an online intersection stage, the subsets from each set to intersect are selected, and any subset of one set that overlaps with a subset of another subset is evaluated for possible intersection, e.g., by bitwise-AND-ing their respective hash signatures to determine whether the result is zero (any intersection will be empty) or non-zero (there may be one or more intersecting elements). Only when there is a possibility of non-empty results is the intersection performed.
In one aspect, a plurality of independent hash signatures (e.g., three, obtained from different hash functions) is maintained for each subset. If any one mathematical combination of a hash signature with a corresponding (i.e., same hash function) hash signature of another subset indicates that an intersection operation, if performed, will be empty, the intersection need not be performed.
Other advantages may become apparent from the following detailed description when taken in conjunction with the drawings.
The present invention is illustrated by way of example and not limited in the accompanying figures in which like reference numerals indicate similar elements and in which:
Various aspects of the technology described herein are generally directed towards a fast and efficient set intersection mechanism based upon algorithms and data structures. In general, in an offline pre-processing stage, sets are ordered, partitioned into subsets (smaller groups), and the smaller groups from one set numerically aligned with one or more of the smaller groups from the other set or sets. Each smaller group is represented by a value, such as provided by computing one or more hash values corresponding to the groups' elements.
In an online set intersection stage, a mathematical operation (e.g., a bitwise-AND) is performed on the representative (e.g., hash) value to determine whether any two aligned groups possibly intersect. Only if there is a possible intersection is an intersection performed on the small groups.
While the examples herein are directed towards information retrieval such as web search examples, e.g., intersecting sets of document identifiers, it should be understood that any of the examples herein are non-limiting, and other technologies (e.g., database and data mining) may benefit from the technology described herein. As such, the present invention is not limited to any particular embodiments, aspects, concepts, structures, functionalities or examples described herein. Rather, any of the embodiments, aspects, concepts, structures, functionalities or examples described herein are non-limiting, and the present invention may be used in various ways that provide benefits and advantages in computing and data processing in general.
By way of example, the sets to be intersected may comprise lists of document identifiers, e.g., one set containing all of the document identifiers containing the term “Microsoft” and the other set containing all of the document identifiers containing the term “Office.” As can be readily appreciated, such lists may be extremely large at the web scale where billions of documents may be referenced.
Because the intersection results are typically so much smaller than the sizes of the original large sets, most of the small group intersections are empty. Described herein is efficiently and rapidly detecting those empty group intersections so that the online set intersection only needs to be performed on groups where an intersection may result in a non-empty result set. Note that the partitioning and other operations (e.g., hash computations) are performed in an offline pre-processing operation, and thus do not take any processing time during online set intersection processing.
Because of the offline pre-processing, the various sub-group elements and their representative (e.g., hash) values need to be maintained in storage for online access. As described below, a data structure encodes these data compactly, and allows the fast set intersection process/mechanism 108 to detect, in a constant number of operations (i.e., almost instantly) whether any two subsets have an empty intersection result. Only in the relatively infrequent event that the two subsets may not have an empty intersection result does the intersection operation need to be performed.
To this end, in addition to the values for each subset, a representative value such as a hash signature (or signatures) for the subset is maintained, as generally represented in
When set intersection does need to take place in online processing, a logical bitwise-AND of the stored signatures for the aligned subsets efficiently detects whether there is any possibility of a subset intersection result that is not empty, e.g., the result of the AND operation is non-zero. As can be readily appreciated, such an AND operation and compare versus zero operation are among the fastest operations performed by computing devices. Note that it is possible that because of a hash collision that a false positive may occur, (whereby the intersection operation may be performed only to find out that the intersection result is empty), however whenever the AND operation results in zero, (which occurs frequently in information retrieval, for example), the intersection is certain to be empty.
As will be understood, described hereinafter are various ways to partition the sets into the subsets (small groups) to facilitate efficient data storage and online processing. In addition, described is determining which of the small groups to intersect, and how to compute the intersection of two small groups as described below.
Consider a collection of N sets S={L1, . . . , LN}, where Li is a subset of Σ and Σ is the universe of elements in the sets; let ni=|Li| be the size of set Li. When referring to sets, inf(Li) and sup(Li) represent the minimum and maximum elements of a set Li, respectively. The elements in a set are ordered. The size (number of bits) of a word on the target processor is denoted by w. Pr[E] denotes the probability of an event E and E[X] denotes the expectation of a random variable X. Also, [w] denotes the set {1, . . . , w}.
A general task is to design data structures such that the intersection of arbitrarily many sets can be computed efficiently. As described above, there is a pre-processing stage that reorganizes each set and attaches additional index data structures, and an online processing stage that uses the pre-processed data structures to compute the intersections. An intersection query is specified via a collection of k sets L1, L2, . . . , Lk (to simplify the notation, the subscripts 1, 2, . . . , k are used to refer to the sets in a query). The general goal is to efficiently compute the intersections L1∩L2∩ . . . ∩Lk. Note that pre-processing is typical of the known techniques used for set intersections in practice. The pre-processing stage is time/space-efficient.
One concept described herein is that the intersection of two sets in a small universe can be computed very efficiently. More particularly, if sets are subsets of {1, 2, . . . , w}, they can be encoded as single machine-words and their intersection computed using a bitwise-AND. Another concept is that for the data distribution seen in text corpora, the size of an intersection is typically much smaller than the size of the smallest set being intersected (in this case, an O(|L1|∩|L2|) algorithm is better than an O(|L1|+|L2|) algorithm).
These concepts are leveraged by partitioning each set into smaller groups Lij's, which are intersected separately. In the preprocessing stage, each small group is mapped into a small universe [w]={1, 2, . . . , w} using a universal hash function h, and the image h(Lij) encoded with a machine-word. Then, in the online processing stage, to compute the intersection of two small groups L1p and L2q, a bitwise-AND operation is used to compute H=h(L1p)∩H(L2q).
The “small” intersection sizes seen in practice imply that a large fraction of pairs of the small groups with overlapping ranges have an empty intersection. Thus, by using the word-representations of H to detect these groups quickly, a significant amount of unnecessary computation is skipped, resulting in significant speedup.
The resulting algorithmic framework is illustrated in
One way to intersect sets is via fixed-width partitions, e.g., eight elements per group. Consider a scenario when there are only two sets L1 and L2 in the intersection query. In a pre-processing stage, L1 and L2 are sorted, and partitioned into groups of equal size √{square root over (w)} (except possibly the last groups; note that w is the word width as described above):
L
1
1
,L
1
2
, . . . ,L
1
┌n
/√{square root over (x)}┐, and L21,L22, . . . ,L2┌n
In the online processing stage, the small groups are scanned in order, and the intersection L1p∩L2q of each pair of overlapping groups is computed; the union of all these intersections is L1∩L2 (Algorithm 1):
If the ranges of L1p and L2q overlap, implying that it is possible that L1p∩L2q≠Ø, then L1p∩L2q is computed (line 8) in some iteration. Because each group is scanned once, lines 2-10 are repeated for O((ni+n2)/√{square root over (w))} iterations.
Turning to computing L1p ∩L2q efficiently based upon pre-processing, each group L1p or L2q is mapped into a small universe for fast intersection. Single-word representations are leveraged to store and manipulate sets from a small universe.
With respect to single-word representation of sets, a set is represented as A ⊂ |w|={1,2, . . . , w} using a single machine-word of width w by setting the y-th bit as 1 if and only if yεA. This is referred to as the word representation w(A) of A. For two sets A and B, the bitwise-AND w(A)Λw(B) (computed in O(1) time) is the word representation of A∩B. Given a word representation w(A), the elements of A can be retrieved in linear time O(|A|). Hereinafter, if A ⊂ |w|, A denotes both a set and its word representation.
In the pre-processing stage, elements in a set Li are sorted as {xi1, xi2 . . . , xin
L
i
1
={x
i
1
, . . . ,x
i
√{square root over (w)}
},L
i
2
={x
i
√{square root over (w)}
, . . . , x
i
2√{square root over (w)}} (1)
L
i
j
={x
i
(j−1)√{square root over (w)}+1
,x
i
(j−1)√{square root over (w)}+2
, . . . , x
i
j√{square root over (w)}} (2)
For each small group Lij, the word-representation of its image is computed under a universal hash function h: Σ→[w], i.e., h(Lij)={h(x)|xεLij}. In addition, for each position yε[w] and each small group Lij, an inverted mapping is also maintained, h−1(y,Lij)={x|xεLij and h(x)=y}, i.e., for each yε[w], store the elements are stored in Lij with hash value y, in a data structure supporting ordered access, e.g., a sorted list. The sort order for these elements is identical across h−1(y,Lij); this way, these short lists may be intersected using a simple linear merge.
By way of example,
Via a hash mechanism 334 (of the fast set intersection mechanism 108), the process pre-computes h(L11)={1, 2, 4, 9}, h(L12)={0, 11}, h(L21)={1, 3, 5, 9}, h(L22)={0, 6, 11}, h(L23)={1, 2}. The inverted mappings (not shown) are also pre-processed, h−1(y,Lip)'s: for example, h−1(0, L12)={1016}, h−1(11, L12)={1016, 1032}, h−1(0,L22)={1027, 1043}, and h−1(11,L22)={1011}.
Turning to the online processing stage, one algorithm used to intersect two lists is shown in Algorithm 1. Because the elements in L1 are sorted, Algorithm 1 ensures that only if the ranges of any two small groups L1p, L2q overlap, their intersection needs to be computed (line 8). This is represented in
To compute the intersection of two small groups L1p∩L2q efficiently, IntersectSmall (Algorithm 2) is provided, which first computes H=h(L1p)∩h(L2q) using a bitwise-AND. Then for each (1-bit) yεh, Algorithm 2 intersects the corresponding inverted mappings using the simple linear merge algorithm:
By way of example of computing the intersection of small groups in online processing, to compute L1∩L2, the process needs to compute L11∩L21, L12∩L22, and L12∩L23 (the pairs with overlapping ranges as represented in
Note that the word representations and inverted mappings are pre-computed, and the word-representations are intersected using one operation. Thus the running time of IntersectSmall is bounded by the number of pairs of elements, one from L1p and one from L2q, that are mapped to the same hash-value. This number can be shown to be approximately equal (in expectation) to the intersection size, with a bounding time of
where
r=|L
1
∩L
2|.
To achieve a better bound, the group sizes may be optimized into groups s*i=√{square root over (wn1/n2)}, and s*2=√{square root over (wn2/n1)}, respectively, whereby L1∩L2 can be computed in expected O√{square root over (n1n2/w)}+r time.
To achieve the better bound O√{square root over (n1n2/w)}+r, multiple “resolutions” of the partitioning of a set Li are needed. This is because, as described above, the optimal group size s*1=√{square root over (wn1/n2)}, of the set L1, also depends on the size n2 of the set L2 to be intersected with L1. For this purpose, a set Li is partitioned into small groups of size 2, 4, . . . , 2j and so forth.
To compute L1∩L2 for the given two sets, suppose s*i is the optimal group size of Li; the actual group size selected is s*i*=2t such that s*i≦s*i*≦2s*i, obtaining the same bound. A properly-designed multi-resolution data structure consumes only O(ni) space for Li, as described below.
There are limitations to fixed-width partitions, including that it is difficult to extend to more than two sets, because the partitioning scheme used is not well-aligned for more than two sets. For three sets, for example, there may be more than O((n1+n2+n3)/√{square root over (w)}) triples of small groups that intersect. A different partitioning scheme to address this issue is described below, which is extendable for k>2 sets, namely intersection via randomized partitions
In general, instead of fixed-size partitions, a hash function g is used to partition each set into small groups, using the most significant bits of g(x) to group an element xεΣ. This reduces the number of combinations of small groups to intersect, providing bounds similar to those described above for computing intersections of more than two sets.
In a pre-processing stage, let g be a universal hash function g: Σ→{0,1}w mapping an element to a bit-string (or binary number). Note that gt(x) denotes the t most significant bits of g(x). For two bit-strings z1 and z2, z1 is a t1-prefix of z2, if and only if z1 is identical to the highest t1 bits in z2; e.g., 1010 is a 4-prefix of 101011.
When pre-processing a set Li, it is partitioned into groups Liz such that Liz={x|xεLi} and gt(x)=z. As before, the word representation of the image of each Liz is computed under another hash function h: Σ→{w}, and the inverted mappings for each group.
The online processing stage is similar to the algorithm described above, that is, to compute the intersection of two sets L1 and L2, the intersections of some pairs of overlapping small groups are computed, and the union of these intersections taken. In general, suppose L1 is partitioned using gt
One improvement of Algorithm 3 compared to Algorithm 1 is that Algorithm 1 needs to compute L1p∩L2q whenever the ranges of L1p and L2q overlap. In contrast, L1z
Based on the choices of the parameters t1 and t2, L1 and L2 may be partitioned into the same number of small groups or into small groups of the (approximately) identical sizes.
To extend the process for more than two sets, that is, to compute the intersection of k sets L1, . . . , Lk where ni=|Li| and n1≦ . . . ≦nk, Li is partitioned into groups L1z's using gt
The process then proceeds as in Algorithm 4:
As can be seen, Algorithm 4 is almost identical to Algorithm 3, with a difference being that Algorithm 4 picks the group identifiers zi to be the ti-prefix of zk, such that the process only intersects groups that share a prefix of size at least ti, and no combination of such groups is repeated. Also, the IntersectSmall algorithm (Algorithm 2) is extended to k groups; the process first computes the intersection (bitwise-AND) of hash images (their word-representations) of the k groups and, if the result is not zero, for each 1-bit, performs a simple linear merge over the k corresponding inverted mappings.
Turning to a multi-resolution data structure represented in
For simplicity, suppose Σ={0,1}w and choose g to be a random permutation of Σ. Note that as used herein, universal hash functions and random permutations are interchangeable. To pre-process Li, the elements xεLi are ordered according to g(x). Then any small group Liz in the partition induced by gt (for any t) forms a consecutive interval in Li.
With respect to word representations of hash mappings, for each small group Liz, the word representation h(Liz) is pre-computed and stored. Note that the total number of small groups is
which uses O(ni) space.
For inverted mappings, the elements in h−1(y, Liz) need to be accessed, in order, for each yε[w]. Explicitly storing these mappings consumes prohibitive space, and thus the inverted mappings are implicitly stored. To this end, for each group Liz, because it corresponds to an interval in Li, the starting and ending positions are stored, denoted by left(Liz) and right(Liz). These allow determining whether a value x belongs to Liz. To enable the ordered access to the inverted mappings, define, for each xεLi, next(x) is defined to be the “next” element x′ to x on the right such that h(x′)=h(x), (i.e., with minimum g(x′)>g(x)). Then, for each Liz and each yε[w], the data structure stores the position first(y, Liz) of the first element x″ in Liz such that x″=y.
To access the elements in h−1(y, Liz) in order, the process starts from the element at first(y,Liz), and follows the pointers next(x), until passing the right boundary right(Liz). In this way, the elements in the inverted mapping are retrieved in the order of g(x) which is needed by IntersectSmall. For all groups of different sizes, the total space for storing the h(Liz)'s, left(Liz)'s, right(Liz)'s, and next(x)'s is O(ni).
While the above algorithms suffice, a more practical version is described herein, which in general is simpler, uses significantly less memory, has more straightforward data structures and is faster in practice. A difference is that for each small group Liz, only stored are the elements in Liz and their representative images, under multiple (m>1) hash functions. Note that inverted mappings are not maintained, as the process instead uses a simple scan over a short block of data. Also, the process uses only a single grouping for each set Li. Having multiple word representations of hash images for each small group allows detecting empty intersections of small groups with higher probability.
In a pre-processing stage, each set Li is partitioned into groups Liz's using a hash function gt
which depends only on the size of Li. Thus for each set Li, pre-processing with a single partitioning suffices, saving significant memory. For each group, word representations of images are computed under m (independent/different) universal hash functions h1, . . . , hm: Σ→[w]. Note that in practice, only a small value of m suffices, e.g., m=3.
In the online processing stage, the algorithm for computing ∩i Li (Algorithm 5) is generally the same as Algorithm 4, except that when needed, ∩iLiz
Algorithm 5 is generally efficient because the chances of a false positive intersection resulting from a hash collision is already small, but becomes even smaller (significantly) given the multiple hash functions, each of which have to have a hash collision for there to be a false positive. Thus, most empty intersections can be skipped using the test in line 3.
As represented in
Turning to another aspect, namely intersecting small and large sets, a simple algorithm may be used to handle asymmetric intersections, i.e., two sets L1 and L2 with significantly differing sizes, e.g., a 100 times size difference; (in this example L2 is the larger set). The algorithm works by focusing on the partitioning induced by gt: Σ→{0,1}t, where t=┌ log n1┐ for both of them. To compute L1∩L2, the process computes L1z∩L2z for all zε{0,1}t and takes the union of them. To compute L1z∩L2z, the process iterates over each xεL1z, and performs a binary search for L1z in L2z. In other words, the process selects an element from the smaller group, and uses a binary search to determine if there is an intersection with an element in the larger group.
The invention is operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well-known computing systems, environments, and/or configurations that may be suitable for use with the invention include, but are not limited to: personal computers, server computers, hand-held or laptop devices, tablet devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.
The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, and so forth, which perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in local and/or remote computer storage media including memory storage devices.
With reference to
The computer 610 typically includes a variety of computer-readable media. Computer-readable media can be any available media that can be accessed by the computer 610 and includes both volatile and nonvolatile media, and removable and non-removable media. By way of example, and not limitation, computer-readable media may comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by the computer 610. Communication media typically embodies computer-readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. Combinations of the any of the above may also be included within the scope of computer-readable media.
The system memory 630 includes computer storage media in the form of volatile and/or nonvolatile memory such as read only memory (ROM) 631 and random access memory (RAM) 632. A basic input/output system 633 (BIOS), containing the basic routines that help to transfer information between elements within computer 610, such as during start-up, is typically stored in ROM 631. RAM 632 typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by processing unit 620. By way of example, and not limitation,
The computer 610 may also include other removable/non-removable, volatile/nonvolatile computer storage media. By way of example only,
The drives and their associated computer storage media, described above and illustrated in
The computer 610 may operate in a networked environment using logical connections to one or more remote computers, such as a remote computer 680. The remote computer 680 may be a personal computer, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer 610, although only a memory storage device 681 has been illustrated in
When used in a LAN networking environment, the computer 610 is connected to the LAN 671 through a network interface or adapter 670. When used in a WAN networking environment, the computer 610 typically includes a modem 672 or other means for establishing communications over the WAN 673, such as the Internet. The modem 672, which may be internal or external, may be connected to the system bus 621 via the user input interface 660 or other appropriate mechanism. A wireless networking component such as comprising an interface and antenna may be coupled through a suitable device such as an access point or peer computer to a WAN or LAN. In a networked environment, program modules depicted relative to the computer 610, or portions thereof, may be stored in the remote memory storage device. By way of example, and not limitation,
An auxiliary subsystem 699 (e.g., for auxiliary display of content) may be connected via the user interface 660 to allow data such as program content, system status and event notifications to be provided to the user, even if the main portions of the computer system are in a low power state. The auxiliary subsystem 699 may be connected to the modem 672 and/or network interface 670 to allow communication between these systems while the main processing unit 620 is in a low power state.
While the invention is susceptible to various modifications and alternative constructions, certain illustrated embodiments thereof are shown in the drawings and have been described above in detail. It should be understood, however, that there is no intention to limit the invention to the specific forms disclosed, but on the contrary, the intention is to cover all modifications, alternative constructions, and equivalents falling within the spirit and scope of the invention.