The present invention is directed towards automatic management of networked publisher-subscriber relationships used in online advertising, based on validity and reachability characteristics.
The marketing of products and services over the internet through advertisements is big business. Advertising over the internet seeks to reach individuals within a target set having very specific target predicates (e.g. male, age 40-48, graduate of Stanford, living in California or New York, etc). This targeting of very specific demographics is in significant contrast to print and television advertisements that are generally capable only to reach an audience within some broad, general demographics (e.g. living in the vicinity of Los Angeles, or living in the vicinity of New York City, etc).
Advertisers have long relied on advertising agents to manage the advertiser's campaigns, including reach and spend. Moreover an agent may itself use other agents, and any agent may place orders with ad networks, and an ad network may participate with others via an advertising exchange. In the context of internet advertising where an advertiser seeks to manage advertising spend, the task of the agent (or agents) can become very complex very quickly, possibly involving tens, hundreds, even thousands of entities (e.g. web publishers, other agents, advertising networks, etc) interconnected via relationships (e.g. business relationships, delivery contract terms, etc).
Thus, a solution for efficiently matching an advertiser's target demographics to a highly specific event raised by an Internet publisher is needed. In an exemplary advertising exchange, an advertiser may have relationships with multiple agencies, and an agency may have relationships with multiple publishers. Similar to the case of other commercial exchanges, the operation of the advertising exchange seeks to correlate sellers with buyers, even in the case that a seller and/or buyer is represented by an intermediary such as an agent. Thus a networked advertising exchange seeks to correlate relationships between buyers (e.g. advertisers), sellers (e.g. publishers), and intermediaries (e.g. agents). Thus a networked advertising exchange seeks to correlate relationships between buyers (e.g. subscribers), sellers (e.g. publishers), and agents (e.g. intermediaries).
Other automated features and advantages of the present invention will be apparent from the accompanying drawings and from the detailed description that follows below.
Systems, methods and techniques for automatic management of networked publisher-subscriber relationships in an advertising server network. The method comprises steps for constructing a directed graph representation comprising at least one publisher node (e.g. an Internet property), at least one subscriber node (e.g. an Internet advertiser), at least one intermediary node (e.g. an Internet advertising agent), and at least one edge (e.g. an advertising target predicate) wherein any one of the edges is directly associated with at least one target predicate. The directed graph representation is used in conjunction with an inverted index for retrieving a valid node list comprising only nodes having at least one target predicate that matches at least one event predicate. The event predicate (as well as any target predicate) is any arbitrarily complex Boolean expression, and is used in retrieving and producing a result node list comprising only nodes that concurrently match the event predicate with an advertising target predicate and are reachable. Systems may include techniques for skipping certain retrievals such that the process for producing the results node list does not evaluate a valid node from the valid node list when the valid node is unreachable. Techniques are provided for labeling nodes of the directed graph representation, including labeling of graphs that contains cyclic subgraphs (e.g. using a two-part labeling scheme for condensed directed graph representations).
The novel features of the invention are set forth in the appended claims. However, for purpose of explanation, several embodiments of the invention are set forth in the following figures.
A networked advertising exchange seeks to correlate relationships between buyers (e.g. subscribers), sellers (e.g. publishers), and agents (e.g. intermediaries). In the context of an Internet advertising, such relationships, inter-relationships, reciprocal relationships, etc. may be complex. In order to aid in the management of such relationships, an advertising exchange connects publishers to advertisers through advertising networks. Advertising networks enable publishers to reach a wider set of advertisers. Every time a publisher web page is visited, an advertising opportunity arises. At that time, an event from the publisher is generated indicating the event predicates for the opportunity. Such event predicates can include information about the page (such as the page content and its main topics), information about the available advertising slots (number of ads in the page and their maximum dimensions in pixels), and information about the user (such as user attributes and geographic location). Also, each ad network and advertiser in the system may specify target attributes, constraining the types of opportunities they are interested in. For instance, an ad network may be interested only in traffic from sports and finance pages with users older than 30.
Within the context of systems for online advertising, an advertiser seeks to present the advertiser's advertisement or message within content such as an online publication (e.g. Yahoo Autos) that is relevant to a particular internet user. For example, a manufacturer of hybrid motor vehicles (e.g. Ford) might establish an advertising campaign that attempts to place the manufacturer's advertisement on the same page as a Yahoo.com./autos search results page resulting from a search using the keyword “hybrid”. Matching an advertisement to a page to be presented to a particular internet user is facilitated by a network of publishers (e.g. Yahoo!) coordinated with a network of subscribers (e.g. advertisers and/or their brokers). Various relationships within such a network of networks may be represented by a graph, where each node on a graph is either a publisher (e.g. an Internet publisher such as Yahoo!), or an advertiser (e.g. a company an such as Ford), or an intermediary (e.g. a broker such as Satchi & Satchi), and where a node is connected to another node via an edge indicating a relationship (e.g. a business relationship, a contract, a revenue sharing agreement, a payment promissory, etc). The occurrence of an opportunity to present to a particular user an advertisement or message on a publisher's page (i.e. an impression opportunity) may be considered an impression opportunity event. At the occurrence of such an impression opportunity event, any/all of the advertisers or intermediaries might wish to be notified of the existence of the event. In some cases, an advertiser might be selective, and wish to be notified of the existence of an event only under certain circumstances (e.g. the internet user is in the age group 24-25 and the internet user has a credit rating within some range).
The single appearance of an advertisement on a web page is known as an online advertisement impression. Each time a web page is requested by a user via the internet represents an impression opportunity to display an advertisement in some portion of the web page (e.g. a “slot” or “spot”) to the individual internet user. Often, there may be significant competition among advertisers for a particular impression opportunity, i.e. to be the one to provide that advertisement impression to the individual internet user.
To participate in this competition, some advertisers define one or more campaigns, including a subscription (i.e. authorization) to bid on certain impression opportunities (e.g. authorization to bid in an auction) in the hope of winning the competition. An advertiser may specify desired targeting criteria (e.g. target predicates) in the subscription definition, which targeting criteria may include a keyword, multiple keywords, key phrases, or other targeting criteria. For example, an advertiser or agent (i.e. subscriber) may wish to present advertising messages to users who visit a particular web page from a particular publisher (e.g. Yahoo! Sports).
In modern internet advertising systems, competition for showing an advertiser's message in an impression is often resolved by an auction, and the winning bidder's advertisement(s) and/or message(s) are shown in the available spaces within the impression. Indeed online advertising and marketing campaigns often rely, at least partially, on an auction process where any number of subscribers book contracts to authorize highest bids corresponding to targeting characteristics (e.g. a search keyword, a set of keywords, bid phrases, or various target predicates). Considering that (1) the actual existence of a web page impression opportunity event suited for displaying an advertisement is not known until the user clicks on a link pointing to the subject web page, (2) the entire auction/bidding process for selecting advertisements corresponding to notified/winning subscribers must complete before the web page is actually displayed, and (3) there may be many subscribers to a particular property/demographic, it then becomes clear that the identification of subscribers (and notification as to the event) should be carried out automatically.
Therefore, multiple competing advertisers might elect to bid in a market via an exchange auction engine server 107 in order to win the most prominent spot, or an advertiser might enter into a contract (e.g. with the internet property, or with an advertising agency, or with an advertising network, etc) to purchase the desired spots for some time duration (e.g. all top spots in all impressions of the web page empirestate.com/hotels for all of 2010). Such an arrangement, and variants as used herein, is termed a contract.
In embodiments of the system 100, components of the additional content server perform processing such that, given an advertisement opportunity (e.g. an impression opportunity profile predicate), processing determines which (if any) contract(s) match the advertisement opportunity. In some embodiments, the system 100 might host a variety of modules to serve management and control operations (e.g. objective optimization module 110, forecasting module 111, data gathering and statistics module 112, storage of advertisements module 113, automated bidding management module 114, admission control and pricing module 115, campaign generation module 116, a publisher-subscriber relationship module 117, etc) pertinent to contract matching and delivery methods. In particular, the modules, network links, algorithms, and data structures embodied within the system 100 might be specialized so as to perform a particular function or group of functions reliably while observing capacity and performance requirements. For example, an additional content server 108, possibly in conjunction with a publisher-subscriber relationship module 117 might be employed to perform automatic management of networked publisher-subscriber relationships within an advertising exchange having buyers, sellers, and agents.
Agencies as discussed herein include real companies with real people making decisions and taking action on behalf of the agency's clients. Agencies can enter into business deals with other entities. Using the techniques described herein, an agency's business deals (i.e. contracts) can be represented as data items to be shared among the entities involved in a given transaction. Further, agencies seek and establish contracts with other entities on the advertising exchange. As used within the context of the embodiments of the invention herein, these contracts allow agencies to act as a proxy on behalf of their customers. Embodiments of the invention herein provide for representing an agency as an entity on the advertising exchange, and thus, as an entity-on-exchange, the agency may participate with the advertising exchange (i.e. perform transactions through or with other advertising exchange seat-holders).
Other embodiments provide for agencies to perform regular publishing and subscribing activities on or through the advertising exchange within the limits of permissions granted to the agency specifically for the purpose of performing such activities.
So, with the above definitions, and for the purposes of understanding the disclosure herein, an ad delivery transaction on the advertising exchange can be represented on a directed graph such as is shown in
Now, for any ad delivery transaction, there may be zero, one, or more hops, and as introduced above, each hop has a buyer and a seller and may also involve an intermediary (e.g. an agency). Accordingly, a hop represents a transactional relationship between a buyer and a seller, even if not the original buyer and original seller. Such relationships may include a link, and possibly also a deal. Collectively these relationships may be represented on/in the directed graph representations.
Agencies are entities on the advertising exchange that perform activities on behalf of their customers. These activities include actions to:
As are described in exemplary embodiments, an agency may operate as a reseller, under which model an agency gets billed by its supplier(s), and in turn bills its customers for delivery. In the reverse sense of a reseller, an agency gets paid by its customer, and in turn pays its supplier. Such transactions may be recorded at each occurrence of an ad delivery, and may be summarized in a periodic statement, which statement may include detailed information of any number of transactions, or groups of transactions, or invoices.
Also, as are described in further exemplary embodiments, an agency may operate as a pure agency, under which model an agency does not get billed by its supplier(s); instead the pure agency's clients transact directly with the supplier. In this scenario, the pure agency receives remuneration via an agency fee (e.g. broker fee).
In various cases, the agency fee is processed as a separate transaction. Also, in various cases, including both agency as reseller and also agency as pure agency, revenue sharing may be processed as a separate transaction.
Agencies may want to cooperate with other agencies, and may wish to establish interrelationships with other agencies or, more generally, may wish to establish interrelationships with other agencies at large or, still more generally, may wish to establish interrelationships and/or engage in transactions with other entities (i.e. beyond just agencies) and may thus wish to become seat-holders on an advertising exchange.
An advertising exchange can be formed comprising any group of entities involved in the trading/matching of advertising placement opportunities, and advertising to fill such placement opportunities. Inasmuch as an agency performs actions on behalf of other entities on the exchange, various instruments are used in the provision of agency services. For example, agency-contracts, or links:
In some aspects, the relationship between a publisher and an advertiser or intermediaries is akin to the relationship between a print media publisher and a print media subscriber, where the subscriber wishes only to receive certain specific publications from the publisher (e.g. only the Sunday morning edition of the publisher's daily newspaper). Systems exhibiting such publisher/subscriber relationships may be termed publisher-subscriber systems.
Disclosed herein are a new class of publisher-subscriber systems (termed networked publisher-subscriber systems) and techniques for automatic management of networked publisher-subscriber relationships. In the embodiments disclosed herein, publishers and subscribers are connected through a network of intermediary nodes in a computer-readable graph.
Now, applying the concepts of a publisher-subscriber system, the advertising exchange is responsible for notifying all subscribers to a particular type of opportunity event of the existence of a particular opportunity event instance of the subscribed-to type. A valid subscriber includes advertisers for which there is a contract (or other description) of a willingness to bid on a given ad opportunity (e.g. an ad opportunity with an event predicate matching contractual target predicates or other specifications). Moreover, a “valid” advertiser must be “reachable” via at least one valid path from the publisher that originated the opportunity (i.e. a direct relationship as shown in
Of course, a publisher-subscriber relationship module 117 might implement algorithms for efficient query evaluation that work for any directed graph network. As the number of nodes within a publisher-subscriber system increases, and as the specificity of the relationship (e.g. target predicates) of the subscriber to the publisher increases, operators of publisher-subscriber systems seek techniques to efficiently match event predicates to a set of subscribers that are interested (by virtue of their corresponding target predicates) in these event predicates. In general, when an event is generated, an efficient publisher-subscriber system might quickly identify all matches.
Referring again to
In one embodiment, one or more internet advertising networks connect publishers to advertisers, possibly through an advertising exchange clearinghouse 255). For example, and as shown in
Further describing the computer-readable graph of
An advertising impression opportunity arises at such a time when a publisher's web page is visited (see web page visit event 420) by an internet user 418. Using the systems and method described herein, at that time, a publisher (or proxy for a publisher) may construct an event predicate message (see operation 421). As shown, an event from the publisher is generated (see event predicate message 422) indicating the target predicates for the opportunity. Such target predicates can include information about the page (such as the page content and its main topics), information about the available advertising slots (number of ads in the page and their maximum dimensions in pixels), and information about the user (such as user demographics and geographic location). The event predicate message 422 may be formatted for receiving the event predicate message at a server (e.g. content server). As previously described, each advertiser in the system may specify target predicates constraining the types of opportunities in which they are interested (and which attributes may be carried by any one or more advertising network nodes). For instance, an ad network may specialize only in trading in traffic related to sports and finance pages with users older than 30 (as is the case for Intermediary1 in
The advertising exchange is then responsible for notifying all valid advertisers for the given ad opportunity. Subscribers may then be notified (see message 428). Valid advertisers have at least one valid path from the publisher that originated the opportunity, meaning that the path exists and that each node in the path satisfies its targeting constraints. In the example of
Continuing the discussion of
Of course the described protocol is only one example of uses of an index, and a graph representation in conjunction with the algorithms. The notions herein described are also useful in other contexts, in particular for implementing a networked publisher-subscriber system in a social network.
In social networks, users are connected to each other forming a connection graph (similar to the aforementioned directed graph). Consider a situation where every user subscribes and produces a stream of “interesting tidbits”. Such tidbits could be events (say music shows, theater shows, etc), news, books of interest, and so on. A user can choose to incorporate in their tidbits a collection of tidbits produced by other users in the network, but with some restrictions. For instance, a user may be only interested in tidbits related to theater shows. The operation of a networked publisher-subscriber system in this context needs to add to the user's collection all the tidbits that have a valid path from the tidbit publisher to the user and that satisfy the user's interest restrictions. The “status update” feature in Facebook can be viewed as a simplified version of the tidbit idea. In such a Facebook example, the status updates are delivered only to the immediate ‘friends’ of a user (i.e. only to users that are one hop away from the publishing user); users have limited control over which updates are determined as being in their interest and who should receive their updates. Using other social networking models such as Twitter, intermediate services can act as content dissemination nodes accumulating and redistributing tidbits (e.g. tweets) to interested subscribers.
Now, returning to disclosure of automatic management of networked publisher-subscriber relationships and applying the concepts of a publisher-subscriber system, the advertising exchange is responsible for notifying all reachable subscribers of the existence of a matching opportunity. One possible solution for this problem is to merely identify all subscribers for a particular event, and then to post-filter the results, discarding subscribers that do not have valid paths leading to them. This solution can be greatly improved by keeping track of node reachability while using an index to evaluate the target predicates. Given that the target predicates may include hundreds or thousands or more specific attributes to be evaluated, the computing complexity increases quickly as the number of subscribers to an event increases, thus a solution for efficiently matching a subscriber to a highly specific event (one specific event from among many millions of similar events) is needed.
One such solution uses an index structure that efficiently evaluates the target predicates, returning only subscribers to the event that satisfy the following:
A targeted interest, where the subscriber has a contract that matches the opportunity, and
Reachability, where there is at least one valid path from the publisher to the subscriber (possibly direct, or possibly involving one or more intermediaries).
In other words, in the setting of an advertising network exchange, a candidate subscriber is only a true subscriber if the subscriber has indeed expressed an interest in delivering an advertisement to the specific targeted opportunity, and also, the candidate subscriber has established some mechanism (e.g. contract with the publisher or a contract with one or more intermediaries) for data exchange pertaining to the specific targeted opportunity.
To verify reachability, the algorithms disclosed below use efficient access to the graph structure. In some cases, the graph can be stored in main memory. It is also possible in some cases to keep track of two sets of nodes during query evaluation. Specifically, the two sets of nodes are:
Reachable nodes, which are the nodes that are reachable from the publisher through at least one valid path, and
Valid nodes, which are the nodes for which their target predicates satisfy at least one given event predicate.
Some embodiments use an “online” breath-first search (BFS) from the publisher node to compute the reachable set using the nodes returned by the index as input. Every node returned by the index is valid with respect to its target predicates and, therefore, it is part of the valid set (by definition). Certain aspects of efficiency rely on the fact that the nodes that should be returned as valid and reachable subscribers are the nodes in the intersection of the reachable node set and valid node sets, i.e. the valid nodes that have at least one valid path leading to them.
In exemplary embodiments, the structure of the graph is known a priori and the known structure of the graph can be exploited to speed up evaluation by skipping over nodes that are unreachable (see Algorithm 1, Algorithm 2, Algorithm 3, Algorithm 4).
Now further describing the embodiment of
Algorithms for Evaluation of Valid and Reachable Subscribers using Graph Representations of the Network
The paragraphs presented below formalize the problem into mathematic representation, introduces algorithms for use on directed acyclic graphs (DAGs), and further develops algorithms for use on any input graph—acyclic or not. For directed acyclic graphs, a topological sort order of the graph aids to decide which nodes are unreachable (see Generalized Query Evaluation Algorithm for DAGs, presented below) without having to retrieve them from the index. In the case of general directed graphs with cycles (i.e. containing at least one cyclic subgraph), a condensation of the graph is formed by mapping each strongly connected component (SCC) into a single condensed node, then use the resulting condensed DAG to avoid retrieving from the index nodes that belong to unreachable SCCs.
Herein is discussed the algorithm for the special case of DAGs, showing how the graph structure allows for evaluation speed-up using skipping in the index. Subsequent sections describe modifications to the algorithms for use on any directed graph.
The problem of query evaluation in networked publisher-subscriber systems consists of identifying the set of valid nodes in a network graph G, which are the subscribers to be notified for the event. Queries in this context are defined using two components
1. A start node s, representing the publisher, and
2. A set Q of labels representing the event.
A network may be modeled by a directed graph G=(N,E), with each node n ∈ N having an associated set of labels Ln corresponding to its target predicates. With respect to a matching function match(Q,Ln), a directed path P is defined to be valid for Q if P is a path in G and the set of labels Ln associated to every node n in P is valid for Q. The output of the system is defined as the set of nodes in G reachable from s via valid paths for Q. In this formalization of the problem the target predicates are placed on nodes. If, in another formalism, the target predicates were placed over an edge, the target predicates could be, for instance, mapped onto its destination node.
Generalized Query Evaluation Algorithm for DAGs The function match(Q,Ln) might be defined specifically for each application. For example, match(Q,Ln) could be defined with semantics as a “superset”, meaning that the set of labels Ln must be a superset of the labels in Q, which definition would represent AND queries as used in information retrieval systems. That is, every query label must be present in the qualifying documents. Alternatively, the function match(Q,Ln) might be defined with semantics as a “subset”, meaning that the target predicates specified for each node must be a subset of the event attributes (e.g. when a subscriber is interested in sports pages only and the event identifies a page as belonging to both the sports and news categories).
Consider the nodes and labels in Table 1. For query labels Q={A, B, C}, if the semantics is “superset”, only nodes 2 and 3 would be valid. On the other hand, if the semantics is “subset”, then only nodes 2, 5 and 6 would be considered valid.
For purposes of the development of the algorithms below, it is reasonable to abstract away the details of the match(Q,Ln) function, and instead assume that:
(a) Each node has a unique node id, and
(b) There is an underlying index that returns matching nodes in order of their IDs.
The index engine 520 implements a getNextEntity(Q,n) function call which returns the next matching node with node ID of at least n. Considering the example from Table 1, getNextEntity(Q,3) would return 5 when the match(•,•) semantics is defined as a subset.
Given such an index engine 520, one possible algorithm is to first retrieve all of the matching nodes, and then compute the subset reachable from s in the graph induced by them. In the following subsections are presented algorithms for the evaluator engine 510 of
Observe the following notation and the formalization of previously introduced concepts (in one special case, the graph G is a DAG):
Function toposort assigns node IDs in the order of a topological sort of G. This maintains the invariant that for any node n, its children v ∈ Cn come later in the node ID order. Function evaluate (see Algorithm 1) begins by adding the children of the start node s to the reachable set NR (line 1). It then retrieves the first valid node with node ID greater than s from the index (line 3). If the retrieved node is already in the reachable set, then it is both reachable and valid and added to the results set (line 5). Since it is also true that its children are reachable, then the children are added to the reachable set (line 6). Resume the search using the index to retrieve the next valid node after node ID n+1. At the end of processing, return the nodes that are in the result set (line 10).
Table 2 shows the valid, reachable, and result sets after each valid node is returned by the index engine 520. When nodes 2 and 3 are returned by the index engine 520, they are simply discarded since they are not reachable. When node 5 is returned, it is known to be reachable, and therefore, is added to the result set along with its children. A similar scenario is shown for nodes 6 and 8.
The table shows the state of NV, NR and NR ∩ NV after each valid node is returned by the index.
To prove the algorithm's correctness, observe the following important invariant:
Invariant 1: For any node n, let Pn={v ∈ N,(v,n) ∈ E} denote the set of parents of n. Then for any n ∈ NR ∩ NV there exists one node v ∈ Pn such that v ∈ NR ∩ NV.
Proof Assume the contrary, let n be a node so that none of the nodes v ∈ Pn are present in the result set. Then n cannot be reached from s using only valid nodes because none of its parents are valid.
Theorem 1: The algorithm of Algorithm 1 is correct.
Proof By sorting the nodes in order of the topological sort, it is concluded that at the time node n is examined, all of its parents already have been examined by the algorithm. Node n can be added to the reachable set if and only if one of the nodes v ∈ Pn was added to the result set. Therefore, n is added to the result set only if one of its parents is valid and reachable.
It is possible to speed up the DAG algorithm further by skipping in the underlying index. The following two lemmas show how to skip to the minimum element in the reachable set that is at least as big as the current node ID returned by the index.
Lemma 1: Let m be the minimum node id in NR. Then no node with an id of less than m can ever be added to the result set.
Proof Consider a node k whose ID is less than m. Then when processing node k, it is known that it is not in the reachable set; therefore the reachable.contains(k) statement will fail.
Lemma 2: When processing node n, let m be the minimum id in NR that is at least as big as n. Then no node with an id of less than m can ever be added to the result set.
Proof Suppose by contradiction that some node with an ID less than m should be added to the result set, and let k be such a node with the smallest ID. Clearly k must be a valid node; furthermore, one of its parents, v ∈ Pk must be both valid and reachable. When processing v, add Cv to the reachable set. Therefore, since k ∈ Cv it could not be skipped during the course of the algorithm.
The algorithm shown in Algorithm 2 (see below) implements the skipping for retrieval when G is a DAG. The changes from the Algorithm 1 are shown in line 2, where (set the next node to be retrieved by the index to be the minimum node id in the reachable set), and in line 8, (ask the index to resume searching for valid nodes after the minimum node id from the reachable set that is greater than n).
Consider again the example from
A crucial invariant in the case of DAGs was that when processing a node n, all of its parents had already been processed, and thus logic concludes whether n would be reachable or not. This is not the case in general graphs that contain cycles, since no topological sort on the nodes exists (since graphs with cycles contain mutually-referencing nodes). Therefore, in addition to maintaining the reachable set, a query evaluation algorithm for general graphs explicitly maintains the valid set NV, since when a node n ∈ NV is returned by the index, it is not known to be reachable or not. See Algorithm 3.
In this version of the algorithm, no assumption is made about the node ID assignments, and therefore all valid nodes from the index, starting from node ID 0 (line 2), must be retrieved. Once a node n is returned by the index, evaluate adds it to the valid set (line 4). It then checks if n is reachable (line 5). If n belongs to the reachable set, it is known to be both reachable and valid and the auxiliary function updatePath is used to update the status of n and its descendant nodes.
Function updatePath starts by adding n to the result set (line 1). Then it updates the status of n's children since now it is known that they have at least one valid path leading to them through node n. This is done in lines 2-12. The status of a child node c is modified only if it is not already in the result set (line 4). This checks guarantees that function updatePath is called exactly once for each node in the result set. If c already belongs to the valid set (i.e. c was already returned by the index), then it is known to be both valid and reachable. Thus, its status through a recursive call to updatePath (line 6) is updated. If c does not belong to the valid set, it is simply added to the reachable set (line 9).
When nodes 1, 2, 5, and 6 are returned by the index engine, they are not in the reachable set, so they are added to the valid set. When the index engine returns node 8, which is reachable, it is added to the valid set and call updatePath, which adds 8 to the result set and its children 0 and 1 to the reachable set. Since node 1 is already valid, updatePath is called recursively and it is added to the result set as well.
The table shows the state of NV, NR and NR ∩ NV after each valid node is returned by the index engine.
Lemma 3: The query evaluation algorithm returns node n in a result if and only if n is valid and reachable.
Proof For n to be added to the result set, it must be returned by the index and therefore valid. Furthermore, since only the children of result nodes are added to the set of reachable nodes NR, one of its parents was a result node, therefore n must be reachable as well.
To prove the converse, assume by contradiction that the lemma is false and let V be the set of valid and reachable nodes that is not returned by the algorithm. There exists some node n ∈ V such that one of its parents v ∈ Pn must be returned by the algorithm (otherwise none of the nodes in V can be reached from s). If v was added to the result set before processing n, then it will appear in NR when processing n and therefore be added to the result set. Otherwise, n is added to the valid set NV; however, when v is added to the result set, n will be marked reachable and added to the result set as well. Therefore no such n can exist.
In the case of DAGs, the numbering of the nodes allowed the algorithm to conclude that some of the valid nodes cannot be reachable, and thus skip in the underlying index. At first glance, this is not true in the case of general graphs—that is, absent a full ordering on the nodes, a node cannot be skipped simply because it is not currently in the reachable set. In order to maintain the skipping property, first decompose the graph into strongly connected components (SCCs). Recall that, contracting each SCC into a single node the resulting graph (called the condensation of G) is a DAG, and thus it is possible to combine the skipping aspect from the DAG algorithm (Algorithm 2) as well as the recursive evaluation component from the general algorithm (Algorithm 3) to enable skipping in the case of general graph G.
As is readily understood by those skilled in the art, it is possible to decompose the graph (generalized graph G) into the SCCs, resulting in the condensation of generalized graph G, before building the index. In one embodiment, node IDs have two parts: (a) the SCC ID and (b) the ID of the node within the SCC. After decomposing the graph into SCCs, IDs are assigned to the nodes (including nodes that are SCCs) in topological sort order. Then, inside each SCC, IDs are assigned in arbitrary order.
The full algorithm for dealing with a directed graph with two-part node IDs is given in Algorithm 4. Note the use of variable reachableSCCs to store just the component IDs from the nodes in NR. The main changes from Algorithm 3 are in lines 6 and 16, where the step sets the variable skip to the minimum SCC ID in the reachable set. Also in line 16, the step makes sure the component is greater than the current component, denoted by scc. For simplicity, assume that setting skip to a given component comp will cause the index to return the next valid node with an ID greater than comp.0. Another change is to only add a node to the valid set if it belongs to a reachable component (line 8).
To reason about the skipping behavior, observe the following simple consequence of the labeling scheme.
Invariant 2: For any two nodes v, w ∈ N if there exists a path from v to w in G, then either v and w lie in the same SCC, or the SCC id of v is strictly smaller than the SCC id of w.
The invariant allows skipping unreachable SCCs in the general graph in the same manner of skipping unreachable nodes in DAGs (see Algorithm 3). To ensure correctness, below in Lemmas 4 and 5 are stated the analogues of Lemmas 1 and 2.
Lemma 4: Let cm.nm be the minimum node id in NR. Then no node with an id of less than cm.0 can ever be added to the result set.
Lemma 5: When processing node c.n, let cm.nm be the minimum id in NR that is at least as big as c.n. Then no node with an id less than cm.0 can ever be added to the result set.
Table 4 shows a run of the evaluate algorithm with skipping enabled. The example is the same example as in Table 3 but the graph is annotated using the two-part node ID assignment scheme. The algorithm proceeds as before, keeping a set of valid and reachable nodes, as well as the reachable SCCs. When evaluating node c.n=2.1 it is noted that the minimum reachable SCC has index=4, therefore set skip to 4.0. This allows skipping over nodes 3.1 and 3.2, which would otherwise be retrieved by the index. Otherwise stated, the evaluate algorithm with skipping enabled includes skipping index retrievals based on the next minimum reachable condensed node. Another point is that although node 2.1 is valid, the algorithm does not add it to the valid set NV since at the point that it is processed it is already known that it is not reachable.
The example of processing using Algorithm 4 proceeds after the graph G is decomposed into strongly connected components (SCCs). The column labeled SCCs is the set of reachable SCCs. After processing node 2.1 the next reachable SCC is 4, therefore the algorithm sets skip to 4.0 and nodes 3.1 and 3.2 are skipped during the processing.
The algorithms herein use an index engine 520 for evaluating the targeting constraints, and rely on the graph engine 530 for checking node reachability. The inverted index 521 and the directed graph representation 531 might be built offline, possibly using an index constructor engine 580 and a graph constructor engine 570. Such data structures might be labeled as (a) currently available, and (b) currently under construction. Alternating retrievals between these two data structures implements a technique for handling updates in the system. The inverted index 521 and the directed graph representation 531 might be used by a index engine 520 and a graph engine 530 during query processing by an evaluator engine 510.
As shown in
The valid node list 523, reachable node list 533, and result node list 511 are query processing data structures that are reinitialized for each query. The directed graph representation 531 can be updated in-place. Each index structure handles updates in a manner dependent on the implemented data structure. Some inverted indexes, for instance, may use a “tail” index to contain the entities added or updated since the last index build.
In some cases, depending on the index structure used to evaluate targeting, it may be sub-optimal to enforce a topological sort order for node and SCC IDs in the presence of updates. In such an instance, the generic version of the algorithm (Algorithm 4), which does not make any assumption about node and SCC ID ordering, may be employed.
Any node of the network 1200 may comprise a general-purpose processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof capable to perform the functions described herein. A general-purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices (e.g. a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration, etc).
In alternative embodiments, a node may comprise a machine in the form of a virtual machine (VM), a virtual server, a virtual client, a virtual desktop, a virtual volume, a network router, a network switch, a network bridge, a personal digital assistant (PDA), a cellular telephone, a web appliance, or any machine capable of executing a sequence of instructions that specify actions to be taken by that machine. Any node of the network may communicate cooperatively with another node on the network. In some embodiments, any node of the network may communicate cooperatively with every other node of the network. Further, any node or group of nodes on the network may comprise one or more computer systems (e.g. a client computer system, a server computer system) and/or may comprise one or more embedded computer systems, a massively parallel computer system, and/or a cloud computer system.
The computer system 1250 includes a processor 1208 (e.g. a processor core, a microprocessor, a computing device, etc), a main memory 1210 and a static memory 1212, which communicate with each other via a bus 1214. The machine 1250 may further include a display unit 1216 that may comprise a touch-screen, or a liquid crystal display (LCD), or a light emitting diode (LED) display, or a cathode ray tube (CRT). As shown, the computer system 1250 also includes a human input/output (I/O) device 1218 (e.g. a keyboard, an alphanumeric keypad, etc), a pointing device 1220 (e.g. a mouse, a touch screen, etc), a drive unit 1222 (e.g. a disk drive unit, a CD/DVD drive, a tangible computer readable removable media drive, an SSD storage device, etc), a signal generation device 1228 (e.g. a speaker, an audio output, etc), and a network interface device 1230 (e.g. an Ethernet interface, a wired network interface, a wireless network interface, a propagated signal interface, etc).
The drive unit 1222 includes a machine-readable medium 1224 on which is stored a set of instructions (i.e. software, firmware, middleware, etc) 1226 embodying any one, or all, of the methodologies described above. The set of instructions 1226 is also shown to reside, completely or at least partially, within the main memory 1210 and/or within the processor 1208. The set of instructions 1226 may further be transmitted or received via the network interface device 1230 over the network bus 1214.
It is to be understood that embodiments of this invention may be used as, or to support, a set of instructions executed upon some form of processing core (such as the CPU of a computer) or otherwise implemented or realized upon or within a machine- or computer-readable medium. A machine-readable medium includes any mechanism for storing or transmitting information in a form readable by a machine (e.g. a computer). For example, a machine-readable medium includes read-only memory (ROM); random access memory (RAM); magnetic disk storage media; optical storage media; flash memory devices; electrical, optical or acoustical or any other type of media suitable for storing information.