1. Field of the Invention
Aspects of the present invention relate generally to processing events, and more specifically to generating a particular data structure to increase the efficiency of said processing.
2. Description of Related Art
The publish/subscribe (“pub/sub”) paradigm in which a large population of users expresses long-term interests (“subscriptions”) over streams of “published events” has gained immense popularity in recent years, due at least in part to the availability of increasing volumes of dynamic information available over the worldwide web such as, for example, stock quotes and news reports. A pub/sub engine typically matches an incoming event to a subset of standing subscriptions. For example, streams of event messages originating at one or more “publishers” may be matched with the interests of one or more pre-registered “subscribers. However, conventional methodologies rely on a simple binary notion of matching that assumes that each event either matches a subscription or does not, and many emerging applications require a more sophisticated notion of matching, where only the “best” matching subscriptions are of interest.
Thus, it is desirable to provide an efficient way to generate an index structure amenable to top-k stabbing queries.
In light of the foregoing, it is a general object of the present invention to provide an efficient method for creating an index structure to store scored intervals corresponding to subscriptions, which index structure is amendable to top-k stabbing queries.
Detailed descriptions of one or more embodiments of the invention follow, examples of which may be graphically illustrated in the drawings. Each example and embodiment is provided by way of explanation of the invention, and is not meant as a limitation of the invention. For example, features described as part of one embodiment may be utilized with another embodiment to yield still a further embodiment. It is intended that the present invention include these and other modifications and variations.
Aspects of the present invention are described below in the context of providing an efficient way of representing scored intervals such that they may be retrieved in response to a stabbing query.
Publish/subscribe (pub/sub) systems are designed to efficiently match incoming events (e.g., stock quotes) against a set of subscriptions (e.g., trader profiles specifying quotes of interest). However, current pub/sub systems support only binary matching (i.e., either it matches or it does not); for example, a stock quote will either match or not match a trader profile. This simple notion of matching is inadequate for many applications where only the “best” matching subscriptions are of interest.
For example, in targeted Web advertising, an incoming user (“event”) may match several different advertiser-specified user profiles (“subscriptions”), but given the limited advertising real-estate, it is desired to quickly discover only the best (e.g., most relevant, etc.) ads to display. As a more specific example, consider a mortgage vendor who wishes to show an ad tailored to users between 20 and 35 years of age, with credit scores between 400 and 500, and who have visited a real-estate web site at least three times in the past month. Such a goal can be modeled as a pub/sub problem, where the stream of incoming users corresponds to events (e.g., a user with age=25, credit score=441, and real estate web site visit count=6), and the advertiser specifications are subscriptions (e.g., 20≦age≦35 and 400≦credit score≦500 and real estate count≧3). However, unlike traditional pub/sub systems, it is not desired to retrieve all the subscriptions (ads) that correspond to a given event (user), because only a small number of ads can be shown on the web page. Rather, it is desired to retrieve the “best” subscriptions based on some criteria such as the most targeted ads, the most profitable ads, the most underserved ads, etc.
Online job sites provide another good example. Such sites generally allow job seekers to register profiles, and job posters to specify job seeker profiles in which they are interested. For instance, a job seeker may register a profile for nursing jobs that pay $50/hour and require 25-hours/week; and a job poster may express an interest in nurses who are willing to work between 20 and 30 hours/week for $45-60/hour. Thus, when a job seeker visits the site, she can be presented with jobs that match her profile. This can again be modeled as a pub/sub problem, where the events are job seekers (e.g., job type=nursing, hourly rate=$50 and hours/week=25) and the subscriptions are job poster interests (e.g., job type=nursing, 45≦hourly rate≦60, and 20≦hours/week≦30). However, as in the targeted advertising case, it is likely that all the jobs that match a user profile cannot be shown because of the web page's limited real estate. Therefore, it is again desired to retrieve only the best jobs for a given user based on criteria such as the monetary value to the job poster, fairness of exposure across job postings, etc.
Throughout this disclosure, subscriptions correspond to interval ranges (e.g., age in [25, 35] and salary>$50,000), and are hereafter referred to as such. In addition, each interval has a score, and the goal is to quickly recover the top-scoring matching subscriptions. Unfortunately, adapting existing index structures to solve this problem results in either an unacceptable space overhead or significant performance degradation, and thus new index structures are needed.
As is known in the art, there are many existing interval index structures, including the R-tree, which are designed to support interval stabbing queries (i.e., queries that return the set of all intervals that are stabbed by a given query point). However, it is an object of the present invention to gather the top-k interval stabbing queries (i.e., queries that return the top-k scoring intervals that are stabbed by a query point), and such existing index structures are either time or space-inefficient for this type of application.
Given the goal of producing the top-k matching subscriptions (as opposed to returning all matching subscriptions and then performing some post-processing to get the top-k results), the main technical challenge is devising efficient scored interval indices. Existing interval index structures such as interval trees, segment trees and (1-dimensional) R-trees are not directly applicable to the problem because they do not produce results in score order, though they can be adapted to produce such results, as described in related U.S. Ser. No. 11/932,928.
In fact, the present invention may be implemented as a particular R-tree, which relies on an intelligent pre-processing of the underlying scored interval set before indexing it. Before describing the present invention, some context regarding the prior art is provided. Generally, the input used for the remainder of this disclosure comprises a collection of n intervals Γ, where each interval Ii ∈ Γ is a pair of left/right endpoints (Ii=[xil,xir],i=1, . . . ,n).
Conventionally, R-trees have been used for indexing hyperrectangles in order to efficiently search for all rectangles that overlap with a query rectangle. In a single dimension, intervals “overlap” a query point q if and only if they are stabbed by q. Hence, R-trees can be used to solve the problem at hand. Generally, an R-tree groups intervals into partitions of size≦b , where b is the branching factor. Various heuristics can be used for grouping intervals, including minimizing the size of the bounding interval for a group, minimizing bounding interval overlap between groups, grouping intervals by their start or end points, etc.
Each group of intervals is stored in a leaf node of the R-tree, and the leaf node is associated with an extent interval which is the minimum bounding interval of the intervals in the leaf node. For example, suppose [lig,rig],i=1, . . . ,b, are the intervals in a leaf node g, then Ig=[lg,rg], where lg=mini lig and rg=maxi rig is the minimum bounding interval. The R-tree is constructed recursively on these minimum bounding intervals, and a child pointer is added from the entry corresponding to interval Ig to the leaf node g. In order to answer a stabbing query q, child pointers may be continually chased (starting from the root node) as long as q is in the extent interval of each intermediate node. When a leaf node is reached, the set of intervals that contain q is returned.
R-trees have the flexibility to group intervals together based on certain criteria, and in order to answer top-k stabbing queries, it is natural to group intervals by their scores so that the top scored intervals are grouped together, the next lower scored intervals are grouped together, and so on. In other words, a scored R-tree orders intervals in decreasing order of their scores and picks consecutive blocks of size b to form the leaf node groups. Recursively, if (g1, . . . ,gk) are the set of internal nodes at any level of the R-tree (in that order), then every interval in the subtree of g1 has a score at least as large as that of every interval in the subtree of g2. Starting from the root node of a scored R-tree, a stabbing query q may be answered by, at each internal node, scanning each entry from left to right and recursing on its child node only if its extent interval contains the query point q. At a leaf node, the intervals are scanned from left to right and an interval is recorded if it is stabbed by q. The recursive call is returned from if either all entries in the node have been processed or if k intervals have been recorded.
As just discussed, the intervals in a scored R-tree are sorted by their scores, and the R-tree is built on top of these scored intervals. For many distributions, this approach will produce a large number of “holes,” leading to poor performance, but by rearranging the intervals in a certain manner, most holes can be avoided and query times increased.
Such an approach to building the scored R-tree is a principle of the present invention, which stems from the following insight. Suppose that I1 and I2 are intervals to be indexed. Suppose further that the score of I1 is greater than the score of I2, and that no interval has a score between the score of I1 and the score of I2. If I1 and I2 intersect, then any R-tree indexing them must place I1 before I2. However, if I1 and I2 do not intersect, they are free to be placed in either order, since no query point can stab both intervals (i.e., their relative ordering is immaterial).
To build a scored R-tree that takes into account the property just described, a constraint graph may be defined for the intervals, which captures the allowable arrangements of intervals. Given an interval set and a constraint graph, the optimal arrangement for a scored R-Tree may be found.
To understand the concept of the constraint graph, consider the set Γ of n input intervals, each with an associated score, and let {tilde over (G)}(Γ) be the directed graph (V,{tilde over (E)}), where V and {tilde over (E)} are as follows: the set V consists of n nodes, one for each interval I ∈ Γ. The node associated with I is referred to by node(I). An edge is included in {tilde over (E)} from node(I1) to node(I2) if and only if I1 ∩I2≠0 and score (I1)>score (I2). This approach is further illustrated by
In another embodiment, and in an effort to avoid some extraneous “transitive” edges, a couple of other steps may be taken when constructing the constraint graph. First, graph G=(V,E) may be defined to have the same vertex set as {tilde over (G)}. Second, E may be defined as follows. If I1,I2 ∈ Γ with score(I1)>score(I2 ), then E contains an edge from node(I1) to node(I2) if and only if (a) I1 ∩ I2≠0; and (b) there exists a point q ∈ I1 ∩ I2 such that, for all I ∈ Γ with score(I1)>score(I)>score(I2), the point q ∉ I. It will be appreciated that such a graph contains only a subset of the edges in {tilde over (G)}, and that if there is an edge from node(I1) to node(I2) in {tilde over (E)}, then there is a path from node(I1) to node(I2) in E.
It can thus be said that an arrangement of the scored intervals in Γ respects G(Γ) if for all scored intervals I1,I2 ∈ Γ such that there is an edge from node(I1) to node(I2), the scored interval I1 comes before I2 in the arrangement. By the fact that that edges in {tilde over (G)}(Γ) always map to paths in G(Γ), an arrangement respects G(Γ) if and only if it respects {tilde over (G)}(Γ).
In another embodiment, the construction of the constraint graph may make use of an additional concept—“visible blocks”—which concept is explained below. Given a subset K ∩ Γ of scored intervals, let an endpoint p be visible with respect to K if (a) there is some interval I ∈ K for which p is an endpoint; and (b) there is no other interval J ∈ K with score(I)>score(J) and p ∈ J. In an effort to better explain the concept of visible blocks, it may be helpful to consider again the example intervals shown in
The set of endpoints that are visible with respect to K, break the real line into intervals, and these intervals are the “visible blocks,” said blocks hereinafter referred to as visBlks(K), wherein set visBlks(0) contains only the interval(−∞,∞). For each block B ∈ visBlks(K), it is said that interval I ∈ K is associated with B if I is the lowest scoring interval in K such that B ∩ I.
Referring again to
For convenience, assume that Γ, the set of scored intervals, contains the interval (−∞,∞) with score ∞, so that every visible block will have an associated interval. At block 500, the intervals in Γ are sorted in decreasing order of their scores, say I1,I2, . . . . At block 510, K and the constraint graph G(Γ) are initialized; K←{I1}, and G(Γ) gets a node for each interval, with no edges yet between them. For each interval Ii other than I1 (block 520), it is determined if there are blocks left to process in the set of visible blocks from visBlks(Ki-1) that intersect Ii, as illustrated at block 530. To the extent that visBlks(Ki-1) is not empty to begin with or, if non-empty, not every block B has been processed, an interval I is defined to be the interval associated with each block B, as shown at block 540. Once the association between block B and I has been made, an edge is added to the constraint graph G(Γ) from node(I) to node (Ii), as illustrated at block 550. After this edge has been added, control returns to block 530, which checks to see if there are any more blocks B to process, and if so, blocks 540 and 550 are again invoked; if not, block 560 is reached and Ii is added to K. After all the blocks B in the set of visible blocks from visBlks(Ki-1) that intersect Ii are processed, control is returned to block 520, which determines if there are intervals left to process, and if so cedes control to block 530 which carries on as described above. If all of the intervals in Γ have been processed (block 520), the constraint graph is returned, as illustrated at block 570.
In an embodiment, the set of visible endpoints with respect to K, sorted by value, may be maintained during construction of the constraint graph (using, for example, a tree). Given interval Ii, let x be its left endpoint and y its right endpoint. To maintain the list of visible endpoints when interval Ii is added to K (block 560), x and y are inserted and all previously visible endpoints that lie between x and y are removed.
Once a constraint graph has been generated, intervals can be grouped together in terms of their spatial proximity by exploiting the partial-ordering constraints specified in the constraint graph.
The sequence and numbering of blocks depicted in
Several features and aspects of the present invention have been illustrated and described in detail with reference to particular embodiments by way of example only, and not by way of limitation. Those of skill in the art will appreciate that alternative implementations and various modifications to the disclosed embodiments are within the scope and contemplation of the present disclosure. Therefore, it is intended that the invention be considered as limited only by the scope of the appended claims.
This application is related to previously-filed U.S. patent application Ser. No. 11/932,928, filed Oct. 31, 2007, entitled SYSTEM AND/OR METHOD FOR PROCESSING EVENTS.