In commercial enterprises, a wide variety of business decisions need to be made on a regular basis. In an example of a store stocking a large collection of items, management needs to decide what to put on sale, how to design coupons, how to place merchandise on shelves in order to maximize the profit, etc. Analysis of past transaction data stored in data sets is a commonly used approach in order to improve the quality of such decisions. Transaction data is mined to obtain information that can be used in future decisions. However, the mining of data from these data sets has proved difficult. One method of mining data from data sets is through the use of association rules, which in general are rules used to discover interesting relations between variables in large data sets.
Association rules have been well studied for discovering regularities between items in relational data sets, for example in promotional pricing and product placements. There have also been recent interests in studying associations between entities in social networks. Such associations are useful in social media marketing. Prior work on association rules for social networks and resource description framework (RDF) knowledge bases resorts to mining conventional rules and Horn rules (as conjunctive binary predicates) over tuples with extracted attributes from social graphs. However, such conventional work does not exploit graph patterns.
There is a need for efficiently and accurately identifying graph pattern association rules (GPARs) in social media marketing, community structure analysis, social recommendation, knowledge extraction and link prediction. Such rules, however, depart from association rules for item sets, and introduce several challenges. These challenges include: (1) conventional support and confidence metrics no longer work for GPARs; (2) mining algorithms for traditional rules and frequent graph patterns cannot be used to discover practical diversified GPARs; and (3) a major application of GPARs is to identify potential customers in social graphs. This is costly, in that graph pattern matching by subgraph isomorphism is intractable. Worse still, real-life social graphs are often big, e.g., Facebook has 13.1 billion nodes and 1 trillion links.
In one embodiment, the present technology relates to a method of identifying graph pattern association rules (GPARs) having a confidence above a predetermined threshold in a social network, the graph including a plurality of designated nodes and a plurality of association edges between the designated nodes, comprising: identifying a first data element that corresponds to a first node of interest; identifying at least a second data element that is a common data element to the first node of interest and to a second node of interest; identifying a first subgraph including the first node of interest and a second subgraph including the second node of interest, wherein the first and second subgraphs include the at least second data element identifying relationships among the first and the second nodes of interest, wherein the first and second subgraphs specify conditions as topological constraints represented by one or more edges formed from the relationships among the first and second nodes of interest; determining one or more graph pattern association rules (GPARs) for the first and second subgraphs; and using the first and second nodes of interest and the GPARs to identify one or more consequents among the second node of interest and the data elements.
In another embodiment, the present technology relates to a method of parallel mining of a set M of graph pattern association rules in a graph of a social network, the graph including a plurality of nodes and a plurality of association edges between nodes, the method comprising: dividing the graph into a plurality of fragments F; using a plurality of processors comprising a coordinator processor and a plurality of worker processors, processing each fragment F in parallel in each of the plurality of worker processors to identify candidate graph pattern association rules for the set M a candidate graph pattern association rule, R(x, y), being defined as Q(x, y)q(x, y), where Q(x, y) is a graph pattern in which x and y are two designated nodes, and q(x, y) is an edge labeled q from x to y, on which the same search conditions as in Q are imposed; verifying candidate graph pattern association rules as having at least a predefined confidence threshold; and transmitting the verified candidate graph pattern association rules to the coordinator processor to update the set M.
In a further embodiment, the present technology relates to a system for identifying entities in a graph of a social network, the graph including a plurality of nodes and a plurality of association edges between nodes, graph pattern association rules, R(x, y), being defined for the graph, R(x, y) being defined as Q(x, y)q(x, y), where Q(x, y) is a graph pattern in which x and y are two designated nodes, and q(x, y) is an edge labeled q from x to y, on which the same search conditions as in Q are imposed, the system comprising: a plurality of processors, the plurality of processors comprising a coordinator processor and a plurality of worker processors, the plurality of processors configured to: divide the graph into a plurality of fragments Fi; process each fragment Fi in parallel in each of the plurality of worker processors Si to identify local matches in Fi; assemble the local matches Fi from the plurality of worker processors Si into a match set; process the each fragment Fi in parallel in each of the plurality of worker processors Si to determine confidence value, conf(R, G), for each of the plurality of graph pattern association rules, where the confidence value defines how likely q(x, y) holds when x and y satisfy the constraints of Q(x, y) for each local fragment Fi; remove local matches from the match set where the local matches have a graph pattern association rule with a confidence value less than a predefined threshold; and output the graph pattern association rules and matches of the graph pattern association rules that are not removed in said step of remove local matches from the match set where the local matches have a graph pattern association rule with a confidence value less than a predefined threshold.
In a further embodiment, the present technology relates to a non-transitory computer-readable medium storing computer instructions for parallel mining of a set M of graph pattern association rules in a graph of a social network, the graph including a plurality of nodes and a plurality of association edges between nodes, that when executed by one or more processors, cause the one or more processors to perform the steps of: identifying a first data element that corresponds to a first node of interest; identifying at least a second data element that is a common data element to the first node of interest and to a second node of interest; identifying a first subgraph including the first node of interest and a second subgraph including the second node of interest, wherein the first and second subgraphs include the at least second data element identifying relationships among the first and the second nodes of interest, wherein the first and second subgraphs specify conditions as topological constraints represented by one or more edges formed from the relationships among the first and second nodes of interest; determining one or more graph pattern association rules (GPARs) for the first and second subgraphs; and using the first and second nodes of interest and the GPARs to identify one or more consequents among the second node of interest and the data elements, wherein the one or more consequents include a consequent between the second node of interest and the first data element.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter. The claimed subject matter is not limited to implementations that solve any or all disadvantages noted in the Background.
The present technology will now be explained with reference the figures which in general relate to graph pattern association rules (GPARs) used, for example, in social media marketing. GPARs differ from conventional rules for item sets in both syntax and semantics. A GPAR defines its antecedent as a graph pattern, which specifies associations between entities in a social graph, and explores social links, influence and recommendations. It enforces conditions via both value bindings and topological constraints by subgraph isomorphism.
Graph patterns in general may be graphical mathematical structures used to model pairwise relations between objects. A graph in this context is made up of vertices, or nodes, which are connected by edges. Stated another way, a graph is an ordered pair G=(V, E) comprising a set V of vertices or nodes together with a set E of edges between the nodes.
The first node P1 and/or the second node P2 are connected to nodes D1-D5 by edges. Nodes D1-D5 are data elements describing some object, feature, state or place of interest to P1 and/or P2. For example, the data elements can represent physical locations, such as a nation, city, region, and so forth. The data elements can represent stores, products, or brands, and so forth. The data elements can represent a location lived in or visited by the corresponding person of the node of interest. The data elements can be used to determine common preferences, experiences, travels, visits, and so forth between the persons represented by the nodes of interest. As a consequence, comparison of various subgraphs can be used to determine and predict future actions by persons represented in a graph such as a social network. In this example, the first node of interest P1 is connected to data elements D1-D4, while the second node of interest P2 is connected to data elements D1-D2 and D4-D5. Thus, as a consequence, comparison of the subgraphs of nodes P1 and P2 can be used to determine and predict future actions by P1 and/or P2.
Referring again to
In this example, the first node of interest P1 includes a relationship/edge with a first data element D3. The first node of interest P1 further includes relationships/edges with second data elements D1-D2 and D4. In this example, the second node of interest P2 does not include a relationship/edge with the first data element D3. The second node of interest P2 shares common relationships/edges with the second data elements D1-D2 and D4. The second node of interest P2 in this example further includes a relationship/edge with a third data element D5 that is not in common with the first node of interest P1.
Using GPARs as explained below, a consequent can be determined, with the consequent in this example including a relationship being inferred or predicted between the second node of interest P2 and the first data element D3. This is shown by a dashed line in
In step 305, GPARs are determined for the two or more subgraphs. GPARs are explained below, but in general operate to identify relationships between nodes of interest and data items inferred from other nodes of interest and the data items. In step 306, using the GPARs determined in step 305, the consequent relationship between the second node of interest and the second data element.
Topological support and confidence metrics are defined for GPARs as explained below. Support is defined in terms of distinct “potential customers,” and a confidence metric is defined for GPARs to incorporate a local closed world assumption. This enables the present technology to cope with incomplete social graphs, and to identify interesting GPARs with correlated antecedent and consequent. Generally, in logic systems, the consequent is the second half of a hypothetical proposition while the antecedent precedes and may be the cause of the consequent.
In accordance with the present technology, a graph is defined as G=(V, E, L), where (1) V is a finite set of nodes; (2) E⊂V×V is a set of edges, in which (υ, υ′) denotes an edge from node υ to υ′; (3) each node υ in V carries L(υ), indicating its label or content as found in social networks and property graphs. Each edge e also carries L(e), indicating its label or content as found in social networks and property graphs.
A pattern query is a graph (Vp, Ep, ƒ, C), in which Vp and Ep are the set of pattern nodes and edges, respectively. Each node up in Vp has a label ƒ(up) specifying a search condition, e.g., city. Each edge ep in Ep also as a label ƒ(ep) specifying a search condition, e.g., lives in, likes, etc. For succinct representation, a node up can be labeled with an integer C(up)=k, indicating k copies of up with the same label and associated links in the common neighborhood.
Graph pattern matching may be accomplished using two definitions of subgraphs. (1) A graph G′=(V′, E′, L′) is a subgraph of G=(V, E, L), denoted by G′⊂G, if V′⊂V, E′⊂E, and moreover, for each edge eεE′, L′ (e)=L(e), and for each υεV′, L′ (υ)=L(υ). (2) G′ is a subgraph induced by a set V′ of nodes if G′⊂G and E′ consists of all those edges in G whose endpoints are both in V′.
Subgraph isomorphism may be adopted for pattern matching. A match of pattern Q in graph G is a bijective function h from the nodes of Q to the nodes of a subgraph G′ of G such that (a) for each node uεVp, ƒ(u)=L(h(u)), and (b (u, u′) is an edge in Q if and only if (h(u), h(u′)) is an edge in G′, and ƒ(u, u′)=L(h(u), h(u′). It can be said that G′ matches Q.
The set of all matches of Q in G may be denoted by Q(G). For each pattern node u, Q(u, G) may be used to denote the set of all matches of u in Q(G), i.e., Q(u, G) consists of nodes υ in G such that there exists a function h under which a subgraph G′εQ(G) is isomorphic to Q, υεG′ and h(u)=υ.
The antecedent of the rule can be represented as a graph pattern Q1 (with solid edges) shown in
As opposed to rules for item sets, association rules for social graphs may target social groups with multiple entities. For example,
The association rule shown by the social graph of
Association rules with graph patterns conveniently extend data dependencies such as conditional functional dependencies (CFDs) in the context of social networks.
Applications of association rules are not limited to marketing activities. They also help detect scams.
For pattern Q5 of
A pattern Q′=(V′p, E′p, ƒ′, C′) is said to be subsumed by another pattern Q=(Vp, Ep, ƒ, C), denoted by Q′Q, if (V′p, E′p) is a subgraph of (Vp, Ep), and functions ƒ′ and C′ are restrictions of ƒ and C in V, respectively. If Q′Q, then for any graph G′ that matches Q, there exists a subgraph G″ of G′ such that G″ matches Q′.
The following notations may be used. (1) For a pattern Q and a node x in Q, the radius of Q at x, denoted by r(Q, x), is the longest distance from x to all nodes in Q when Q is treated as an undirected graph. (2) Pattern Q is connected if for each pair of nodes in Q, there exists an undirected path in Q between them. (3) For a node υx in a graph G and a positive integer r, Nr(υx) denotes the set of all nodes in G within radius r of υx. (4) The size |G| of G is |V|+|E|, the number of nodes and edges in G. (5) Node υ′ is a descendant of υ if there is a directed path from υ to υ′ in G.
Using the above framework, graph pattern association rules, or GPARs, may be defined. A GPAR R(x, y) is defined as Q(x, y)q(x, y), where Q(x, y) is a graph pattern in which x and y are two designated nodes, and q(x, y) is an edge labeled q from x to y, on which the same search conditions as in Q are imposed. Q and q are referred to as the antecedent and consequent of R, respectively.
A rule may be formulated that for all nodes υx and υy in a (social) graph G, if there exists a match hεQ(G), such that h(x)=υx and h(y)=υy (i.e υux and υy), match the designated nodes x and y in Q, respectively, then the consequent q(υux, υy) will likely hold. Intuitively, υx is a potential customer of υy. R(x, y) may be modeled as a graph pattern PR, by extending Q with a (dotted) edge q(x, y). Pattern PR may be referred to as R when it is clear from the context. q(x, y) may be treated as pattern Pq, and q(x, G) as the set of matches of x in G by Pq. Practical and nontrivial GPARs may be considered by requiring that (1) PR is connected; (2) Q is nonempty, i.e., it has at least one edge; and (3) q(x, y) does not appear in Q.
The association rule described above with respect to
The association rule described above with respect to
In embodiments, the consequent of GPAR may be defined with a single predicate q(x, y). Conditional functional dependencies can also be represented by GPARs (see Q3 of
Support and confidence may further be defined for GPARs. The support of a graph pattern Q in a graph G, denoted by supp(Q, G), indicates how often Q is applicable. As with association rules for item sets, the support measure should be anti-monotonic, i.e., for patterns Q and Q′, if Q′Q, then in any graph G, supp(Q′, G)≧supp(Q, G).
Supp(Q, G) may be defined as the number ∥Q(G)∥ of matches of Q in Q(G). However, this conventional notion is not anti-monotonic. For example, consider pattern Q′ with a single node labeled cust, and Q with a single edge like (cust, French restaurant). When posed on G1, ∥Q(G)∥=18>∥Q′(G)∥=6 (since French restaurant3 denotes 3 nodes labeled French restaurant), although Q′Q.
To cope with this, support of the designated node x of Q may be defined as ∥Q(x, G)∥, i.e., the number of distinct matches of x in Q(G). The support of Q in G may be defined as
supp(Q,G)=∥Q(x,G)∥ (1)
One can verify that this support measure is anti-monotonic. For a GPAR R(x, y): Q(x, y)q(x, y), supp(R, G) may be defined:
supp(R,G)=∥PR(x,G)∥ (2)
by treating R as pattern PR(x, y) with designated nodes x, y.
Referring again to
Referring now to confidence, confidence may be used to find how likely q(x, y) holds when x and y satisfy the constraints of Q(x, y). The confidence of R(x, y) in G may be denoted as conf(R, G). In general, confidence is based in part on the number of pattern matching isomorphic subgraph association edges for the two or more designated nodes, where more pattern matching isomorphic subgraph association edges correlate to a higher confidence level. In embodiments, confidence of a GPAR may be defined as:
That is, every match x in Q but not in R is considered as negative example for R. However, the standard confidence is blind to the distinction between “negative” and “unknown”. This is particularly an overkill when G is incomplete.
Referring back to pattern Q2 in
The closed world assumption may not hold for social networks. To distinguish “unknown” cases from true negative for GPAR mining in incomplete social networks, the local closed world assumption may be adopted, as commonly used in mining incomplete knowledge bases. The following notations may be used for local closed world assumption (LCWA), given a predicate q(x, y).
(1) supp(q, G)=∥Pq(x, G)∥, the number of matches of x;
(2) supp(
(3) supp(Q
Given an (incomplete) social network G and a predicate q(x, y), the local closed world assumption (LCWA) distinguishes the following three cases for a node u.
(1) “positive” case, if uεPq(x, G);
(2) “negative” case, for every u counted in supp(
(3) “unknown” case, for every u that satisfies the search condition of x but has no edge labeled as q.
That is, G is assumed “locally complete”. Therefore, G either gives all correct local information of u in connection with predicate q, or knows nothing about q at node u (hence unknown cases).
Based on LCWA, conf (R, G) may be defined by revising the Bayes Factor (BF) of association rules as described for example in S. Lallich, O. Teytaud, and E. Prudhomme, “Association rule interestingness: Measure and statistical validation,” In Quality measures in data mining, pages 251-275. 2007. This may be done as:
Intuitively, conf(R, G) measures the product of completeness and discriminant. A GPAR R(x, y) has a better completeness if, for more matches of x identified in Q(x, y) there are also matches of x in R(x, y), and is more discriminant if, for more matches of x in Q(x, y), there are less likely to be matches in Q
Referring to GPAR R2 and Q2(x, G) described above with respect to
It can be seen that supp(R2, G)=1 (match v1), supp(
There are other alternatives to define support and confidence for GPARs. (1) Following minimum image-based support (B. Bringmann and S. Nijssen, “What is frequent in a single graph?” In PAKDD, 2008), supp(R, G) can be defined as the maximum number of matches for x in non-overlap matches (i.e., no shared nodes and edges) of R. However, this excludes potential customers from matches that share even a single node (e.g., only one of the three matches cust1-cust3 of
under LUWA. However, this only considers the “coverage” of R instead of its interestingness in terms of completeness and discriminant.
Two trivial cases are noted when conf(R, G)=∞: (1) supp(Q
The following section describes how to discover useful GPARs. GPARs for a particular event q(x, y) are of interest. However, this often generates an excessive number of rules, which often pertain to the same or similar people. This motivates the study of a diversified mining problem, to discover GPARs that are both interesting and diverse.
To formalize the problem, an objective function diff(,) is first defined to measure the difference of GPARs. Given two GPARs R1 and R2, diff(R1, R2) is defined as:
in terms of the Jaccard distance of their match set (as social groups). Such diversification has been adopted to battle against over-concentration in social recommender systems when the items recommended are too “homogeneous”. See for example, S. Amer-Yahia, L. V. Lakshmanan, S. Vassilvitskii, and C. Yu, “Battling predictability and overconcentration in recommender systems,” IEEE Data Eng. Bull., 32(4), 2009.
Given a set Lk of k GPARs that pertain to the same predicate q(x, y), the objective function F(Lk) may be defined again by following the practice of social recommender systems (as disclosed in S. Gollapudi and A. Sharma, “An axiomatic approach for result diversification,” In WWW, 2009):
This, known as max-sum diversification, aims to strike a balance between interestingness (measured by revised Bayes Factor) and diversity (by distance diff(,)) with a parameter λ controlled by users. Taking nontrivial GPARs (discussed above) with conf(R, G)ε[0, supp(R, G)*supp(
since there are
numbers for the difference sum, while only k numbers for the confidence sum.
For λ=0.5, a top-2 diversified set of these GPARs is {R7, R8} with
(similarly for {R1, R8}). Indeed, R7 and R8 find two disjoint customer groups sharing interests in French restaurant and Asian restaurant, respectively, with their friends.
Based on the objective function, the diversified GPAR mining problem (DMP) is stated as follows.
Input: A graph G, a predicate q(x, y), a support bound σ and positive integers k and d.
Output: A set Lk of k nontrivial GPARs pertaining to q(x, y) such that (a) F(Lk) is maximized; and (b) for each GPAR RεLk, supp(R, G)≧σ and r(PR, x)≦d.
DMP is a bi-criteria optimization problem to discover GPARs for a particular event q(x, y) with high support, bounded radius, and balanced confidence and diversity. In practice, users can freely specify q(x, y) of interests, while proper parameters (e.g., support, confidence, diversity) can be estimated from query logs or recommended by domain experts.
The diversified GPAR mining problem is nontrivial. Consider a decision problem to decide whether there exists a set Lk of k GPARs with F(Lk)≧B for a given bound B. Thus, by reduction from the dispersion problem, the DMP decision problem is NP-hard (Theorem 1).
It is possible to follow a “discover and diversify” approach that (1) first finds all GPARs pertaining to q(x, y) by frequent graph pattern mining, and then (2) selects top-k GPARs via result diversification. However, this is costly: (a) an excessive number of GPARs are generated; and (b) for all GPARs R generated, it has to compute conf(R, G) and their pairwise distances, and moreover, pick a top-k set based on F( ); the latter is an intractable process itself.
It can be done more efficiently, with accuracy guarantees, as set forth in Theorem 2:
Theorem 2: There exists a parallel algorithm for DMP that finds a set Lk of top-k diversified GPARs such that (a) Lk has approximation ratio 2, and (b) Lk is discovered in d rounds by using n processors, and each round takes at most t(|G/n, k, |Σ|) time, where Σ is the set of GPARs R(x, y) such that supp(R, G)≧σ and r(PR, x)≦d.
Here t(|G|/n, k, |Σ| is a function that takes |G|/n, k and |Σ| as parameters, rather than the size |G| of the entire G.
As a proof, an algorithm is provided, denoted as DMine and shown in Table 1 below and described with respect to the flowchart of
Algorithm DMine works as follows.
(1) It divides G into n−1 fragments (F1, . . . , Fn_1) such that (a) for each “candidate” vx that satisfies the search condition on x in q(x, y), its d-neighbor Gd(vx), i.e., the subgraph of G induced by Nd(vx), is in some fragment; and (b) the fragments have roughly even size. These are possible since 98% of real-life patterns have radius 1, 1.8% have radius 2, and the average node degree is 14.3 in social graphs. Thus, Gd(vx) is typically small compared with fragment size.
Fragment Fi is stored at worker Si, for iε[1, n−1].
(2) DMine discovers GPARs in parallel by following bulk synchronous processing, in d rounds. The coordinator Sc maintains a list Lk of diversified top-k GPARs, initially empty. In each round, (a) Sc posts a set M of GPARs to all workers, initially q(x, y) only; (b) each worker Si generates GPARs locally at Fi in parallel, by extending those in M with new edges if possible; (c) these GPARs are collected and assembled by Sc in the barrier synchronization phase; moreover, Sc incrementally updates Lk: it filters GPARs that have low support or cannot make top-k as early as possible, and prepares a set M of GPARs for expansion in the next round.
As opposed to the “discover and diversify” method, DMine combines diversifying into discovering to terminate the expansion of non-promising rules early, rather than to conduct diversifying after discovering; and (b) it incrementally computes top-k diversified matches, rather than recomputing the diversification function F( ) starting from scratch.
Algorithm DMine maintains the following: (a) at the coordinator Sc, a set Lk to store top k GPARs, and a set Σ to keep track of generated GPARs; and (b) at each worker Si, a set Ci of candidates vx for x at Fi.
In each round, coordinator Sc and workers Si communicate via messages. (1) Each worker Si generates a set Mi of messages. Each message is a triple <R, conf, flag>, where (a) R is a GPAR generated at Si, (b) conf includes, e.g., supp(R(x, y), Fi) and supp(Q
In step 1102, DMine initializes Lk and Σ as empty, and M as {q(x, y)} (line 1). For r from 1 to d (step 1104), it improves Lk by incorporating GPARs of radius r (lines 2-11), following a levelwise approach. In each round, it invokes localMine with M at all workers (line 4). Details are described below.
Parallel GPARs generation (line 13 of the DMine algorithm, step 1108 of the flowchart of
In round r, upon receiving M from Sc, localMine does the following. For each GPAR R(x, y): Q(x, y)q(x, y) in M, and each center node υx, it expands Q by including at least one new edge that is at hop r from υx, for all such edges.
Message construction (lines 14-15 of the DMine algorithm, step 1218 of
Message assembling (lines 4-7 of the DMine algorithm). Upon receiving Mi from each Si, coordinator Sc does the following. (1) It groups automorphic GPARs from all Mi. (2) For each group of mi=<R, confi, flagi> that refers to the same (automorphic) R, it assembles conf(R) into a single m=<R, conf(R, G), flag>, where (a)
and (b) flag is the disjunction of all flagi, for ε[1, n−1]. This suffices since by the partitioning of graph G, nodes accounted for local support in Fi are disjoint from those in Ej if i≠j; hence conf(R) can be directly assembled from local conf from Fi. Similarly, supp(R, G)=Σiε[1, n−1] supp(R, Fi). For each GPAR R, if supp(R, G)≧σ, it is added to AΣ and Σ.
Incremental diversification (lines 8-9 of the DMine algorithm). Next, in step 1110, DMine incrementally updates Lk by invoking procedure incDiv. It uses a max priority Queue of size
where (1) each element in Queue is a pair of GPARs, and (2) all GPAR pairs in Queue are pairwise disjoint. In round r, starting from Queue of top-k diversified GPARs with radius at most r−1, DMine improves Queue by incorporating pairs of GPARs from ΔE, with radius r. (1) If Queue contains less than
GPARs pairs, incDiv iteratively selects two distinct GPARs R and R′ from ΔE that maximize a revised diversification function:
and insert (R, R′) into Queue, until
It bookkeeps each pair (R, R′) and F′ (R, R′). (2) If
for each new GPAR RεΔE (not in any pair of Queue) and R′εΣ, it incrementally computes and adds a new pair (R, R′)εΔE×Σ that maximizes F′ (R, R′) to Queue. This ensures that a pair (R1, R2) with minimum F′(R1, R2) is replaced by (R, R′), if F′ (R1, R2)<F′ (R, R′).
After all GPAR pairs are processed, incDiv inserts R and R′ into Lk, for each GPARs pairs (R, R′)εQueue.
Message generation at Sc (lines 10-11 of the DMine algorithm). DMine next selects promising GPARs for further parallel extension at the workers (step 1112). These include RεΔE that satisfy two conditions: (1) supp(R, G)≧σ, since by the anti-monotonic property of support, if supp(R, G)<σ, then any extension of R cannot have support no less than σ; and (2) R is “Extendable”, i.e., flag=true in <R, conf, flag>. It includes such R in M, and posts M to all workers in the next round.
As an example, suppose that graph G1 in
(1) Coordinator Sc sends q to all workers, and computes supp(q, G1)=5 (cust1-cust4, cust6), supp(
(2) In round 1, R5 (among others) is generated at S1 from 1-hop neighbors of cust1-cust3, which are matches in q(x, G1)(
(3) Coordinator Sc assembles M1 and M2, and builds ΔE including {R5, R6}. It computes conf(R5)=0.8, conf(R6)=0.4, diff(R5, R6)=0.8. It updates Lk={R5, R6}, with
It includes R5 and R6 in message M (the table above), and posts it to S1 and S2.
(4) In round 2, R5 is extended to R7 and R1 at S1 and S2, and R6 to R8 at S2 (
(5) Given these, coordinator Sc assembles the messages and computes conf(R7)=0.6, conf(R8)=0.2 and diff(R7, R8)=1. DMine computes
Hence, it replaces (R5, R6) with (R7, R8) and updates Lk to be {R7, R8}. As R7 and R8 are marked as “not extendable” at radius 2 (since d=2), DMine returns {R7, R8} as top-2 diversified GPARs (step 1114), in total 2 rounds.
By maintaining additional information, DMine reduces the sizes of Σ, M and Mi. The idea is to test whether an upper bound of marginal benefit for any GPAR pairs can improve the minimum F′-value of Lk.
In each round r, incDiv filters non-promising GPARs from Σ and ΔE that cannot make top-k even after new GPARs are discovered. It keeps track of (1) a value F′m=min F′ (R1, R2) for all pairs (R1, R2) in Lk, (2) for each GPAR Rj in ΔE, an estimated maximum confidence Uconf+(Rj, G) for all the possible GPARs extended from Rj, and (3) conf(R, G) for each GPAR R in Σ. Here Uconf+(Rj, G) is estimated as follows. (a) Each Si computes Usuppi(Rj, Fi) as the number of matches of x in Rj(x, Fi) that connect to a center node in Fi at hop r+1 (r≦d−1). (b) Then Uconf+(Rj) is assembled at Sc as
Denote the maximum Uconf+(Rj, G) for RjεΔE as max Uconf+(ΔE), and the maximum conf(R, G) for RεΣ as max conf(Σ). Then incDiv reduces Σ and M based on the reduction rules below.
Lemma 3 (reduction rules): (1) A GPAR RεΣ cannot contribute Lk if
(2) Extending a GPAR RjεΔE does not contribute to Lk if either (a)Rj is not extendable, or (b)
For the correctness of the rules, observe the following. (1) For each RεΣ, conf(R)+max Uconf+(ΔE)+1 is an upper bound for its maximum possible increment to the F′-value of Lk; similarly for any Rj from ΔE. (2) If GPAR R does not contribute to Lk, then any GPARs extended from R do not contribute to Lk. Indeed, (a) upper bounds Uconf(R), Usuppi(R), and Uconf+(R) are anti-monotonic with any R′ expanded of R, and (b) max Uconf+(ΔE) and max conf(Σ) are monotonically decreasing, while F′m is monotonically increasing with the increase of rounds. Hence R can be safely removed from Σ, ΔE or M. Note that the removal of GPARs from Σ benefit the reduction of ΔE with smaller max conf(Σ)), and vice versa. DMine repeatedly applies the rules until no GPARs can be reduced from Σ and ΔE.
To reduce redundant GPARs, DMine checks whether GPARs in ΔE are automorphic at coordinator Sc (line 6) and locally at each Si (localMine). It is costly to conduct pairwise automorphism tests on all GPARs in ΔE, since it is equivalent to graph isomorphism.
To reduce the cost, bisimulation may be used as disclosed in A. Dovier, C. Piazza, and A. Policriti, “A fast bisimulation algorithm,” In CAV, pages 79-90, 2001. A graph pattern PR
Lemma 4: If graph pattern PR
Hence, for a pair R1 and R2 of GPARs, DMine first checks whether PR
DMine detects trivial GPARs R(x, y): Q(x, y)q(x, y) at Sc as follows: (1) if supp(q, G) is 0, it returns Ø to indicate that no interesting GPARs exist; and (2) if an extension leads to supp(Q
DMine returns a set Lk of k diversified GPARs with approximation ratio 2 (line 12), for the following reasons. (1) Parallel generation of GPARs finds all candidate GPARs within radius d. This is due to the data locality of subgraph isomorphism: for any node υx in G, υxεPR(x, G) if and only if υxεPR(x, Gd(υx)) for any GPAR R of radius at most d at x. That is, it is determined whether υx matches x via R by checking the d-neighbor of υx locally at a fragment Fi. (2) Procedure incDiv updates Lk following the greedy strategy disclosed in S. Gollapudi and A. Sharma, “An axiomatic approach for result diversification,” In WWW, 2009, with approximation ratio 2. This is verified by approximation-preserving reduction to the max-sum dispersion problem, which maximizes the sum of pairwise distance for a set of data points and has approximation ratio 2. The reduction maps each GPAR to a data point, and sets the distance between two GPARs R and R′ as F′(R, R′).
For time complexity, observe that in each round, the cost consists of (a) local parallel generation time T1 of candidate GPARs, determined by |Fi|, M and Mi; and (b) total assembling and incremental maintenance cost T2 of Lk at Sc, dominated by |Σ|, k and |Mi|. The cost of message reduction (by applying Lemma 3) takes in total O(d|E|) time, where in each round, it takes a linear scan of ΔE and Σ to identify redundant GPARs. Note that Σiε[1,n−1]|Mi|≦ΔE|, |M|≦|Σ|, and |Fi| is roughly |G|/n by the disclosed partitioning strategy. Hence T1 and T2 are functions of |G|/n, k and |Σ| This completes the proof of Theorem 2.
Algorithm DMine can be easily adapted to at least the following two cases. (1) When a set of predicates instead of a single q(x, y) is given, it groups the predicates and iteratively mines GPARs for each distinct q(x, y). (2) When no specific q(x, y) is given, it first collects a set of predicates of interests (e.g., most frequent edges, or with user specified label q), and then mines GPARs for the predicate set as in (1).
The following sections describe how to identify potential customers with GPARs, first describing the Entity Identification Problem. Consider a set Σ of GPARs pertaining to the same q(x, y), i.e., their consequents are the same event q(x, y). The set of entities identified by Σ in a (social) graph G with confidence denoted by Σ(x, G, η), may be defined as follows:
{υx|υxεQ(x,G),Q(x,y)q(x,y)εΣ,conf(R,G)≧η} (3)
Under the Entity Identification Problem (EIP):
Input: A set Σ of GPARs pertaining to the same q(x, y), a confidence bound η>0, and a graph G.
Output: Σ(x, G, η).
The EIP is to find potential customers x of y in G identified by at least one GPAR in Σ, with confidence of at least η.
The decision problem of EIP is to determine, given Σ, G and η, whether Σ(x, G, η) #Ø. It is equivalent to decide whether there exists a GPAR RεΣ such that conf(R, G)≧η. The problem is nontrivial, as it embeds the subgraph isomorphism problem, which is NP-hard.
Theorem 5: The decision problem for EIP is NP-hard, even when Σ consists of a single GPAR.
One way to compute Σ(x, G, η) is as follows. For each R(x, y): Q(x, y)q(x, y) in Σ, (a) enumerate all matches of Q
To characterize the effectiveness of parallelization, parallel scalability may be formalized following C. P. Kruskal, L. Rudolph, and M. Snir, “A complexity theory of efficient parallel algorithms,” TCS, 71(1), 1990. Consider a problem A posed on a graph G. The worst-case running time of a sequential algorithm for solving A on G may be denoted by t(|A|, |F|). For a parallel algorithm, the time taken by the algorithm for solving A on G by using n processors may be denotes by T(|A|, |G|, n). Here, it is assumed that n<<|F|, i.e., the number of processors does not exceed the size of the graph; this typically holds in practice since G has billions of nodes and edges, much larger than n.
The algorithm is said to be parallel scalable if
T(|A|,|G|,n)=O(t(|A|,|G|)/n)+(n|A|)O(1) (4)
That is, the parallel algorithm achieves a polynomial reduction in sequential running time, plus a “bookkeeping” cost O((n|A|l) for a constant l that is independent of |G|.
If the algorithm is parallel scalable, then for a given G, it guarantees that the more processors are used, the less time it takes to solve A on G. It allows big graphs to be processed by adding processors when needed. If an algorithm is not parallel scalable, there may not be a reasonable response time no matter how many processors are used. Problem A is said to be parallel scalable if there exists a parallel scalable algorithm for it.
Theorem 6: EIP is parallel scalable. As a proof, a parallel algorithm may be outlined for EIP, denoted by Matchc. Given Σ, G=(V, E, L), η and a positive integer n, it computes Σ(x, G, η) by using n processors. Note that Matchc is exact: it computes precisely Σ(x, G, η).
To present Matchc, the following notations may be used. (a) d is used to denote the maximum radius of R(x, y) at node x, for all GPARs R in Σ. (b) For a node υxεV, Gd(υx) is the d-neighbor of υx in G (described above). (c) the set of all candidates υx of x, i.e., nodes in G that satisfy the search condition of x in q(x, y) are denoted by L.
Matchc capitalizes on the data locality of subgraph isomorphism (as discussed above). The Matchc algorithm will now be described with reference to the flowchart of
(1) Partitioning. It divides G into n fragments =(F1, . . . , Fn) (step 1320) in the same way as algorithm DMine (described above), such that Ft's have roughly even size, and Gd(υx) is contained in one Fi for each υxεL. This is done in parallel. In particular, Gd(υx) can be constructed in parallel by revising BFS (breadth-first search), within d hops from υx. The match set Σ is initialized (step 1324), and each fragment Fi is assigned to a processor Si for iε[1, n].
(2) Matching. All processors Si compute local matches in Fi in parallel (step 1328). For each candidate υxεL that resides in Fi, and for each GPAR R(x, y): Q(x, y)q(x, y) in Σ, Si checks whether υx is in PR(x, Gd(υx)), Pq(x, Gd(υx)) and Pq(x, Gd(υx)), and whether υx has an outlink labeled q.
(3) Assembling. Compute conf(R, G) for each R in Σ by assembling the partial results of (2) above (step 1330). This is also done in parallel: first partition L into n fragments; then each processor operates on a fragment and computes partial support (step 1334). These partial results are then collected to compute conf(R, G). In step 1336, for any υx not having a GPAR R such that υxεPR(x, G) and conf(R, G)≧η, these are removed. Finally, step 1340 outputs those υx when there exists a GPAR R such that υxεPR(x, G) and conf(R, G)≧η.
To show that Matchc is parallel scalable, the following is noted. (1) Step 1 is in O(|L∥Gdm|/n) time, since BFS is in O(|Gdm|) time, where Gdm is the largest d-neighbor for all υxεL. (2) Step 2 takes O(t(Gdm|, |Σ|)|L|/b) time, where t(|Gdm|, |Σ|) is the worst-case sequential time for processing a candidate υx. (3) Step 3 takes O(|L∥Σ|/n) time. (4) By |L|≦|V|, steps 1 and 2 take much less time than t(|G|, |Σ|), since t(,) is an exponential function by Theorem 5, unless P=NP. (5) In practice, t(|Gdm|, |Σ|)|L|<<t(|G|, |Σ|) since t(,) is exponential and Gdm is much smaller than G. Indeed, (a) in the real world, graph patterns in GPARs are typically small, and hence so is the radius d; as discussed above, Gd(υx) is thus often small. Putting these together, the parallel cost T(|G|, |Σ|, n)<O(t(|G|, |Σ|)/n), and better still, the larger n is, the smaller T(|G|, |Σ|, n) is.
Algorithm DMine (discussed above) takes t(|A|/n, k) time and is parallel scalable if the problem size |A| is measured as |G|+|Q|+|Σ| [29]. Indeed, if one wants all candidate GPARs R with supp(R, G)≧σ, then |Σ| is the size of the output, and |Σ| is not large (due to small d and large σ).
Certain optimization strategies may be employed to optimize Matchc. Algorithm Matchc just aims to show the parallel scalability of EIP. Its cost is dominated by step 2 for matching via subgraph isomorphism. To reduce the cost, algorithm Match may be developed that improves Matchc by incorporating the following optimization techniques. To simplify the discussion, a single GPAR R(x, y): Q(x, y)q(x, y) may be taken as the starting point.
For each candidate υxεL that resides in fragment Fi, a check is performed to determine whether there exists a match Gx of PR in which υx matches x. When one Gx is verified as a match of PR, υx is included in PR(x, Fi), without enumerating all matches of PR at υx, and the process may be terminated. This is done locally at Fi: by the partitioning strategy, Gd(υx) is contained in Fi.
To identify Gx at υx, Match starts with pair (x, υx) as a partial match m, and iteratively grows m with new pairs (u, v) for uεPR and υΣGd(υx) in a guided search until a complete match is identified, i.e., m covers all the nodes in PR. A complete m induces a subgraph Gx. It is in PTIME to verify whether m is an isomorphism from PR to Gx.
To grow m, Match performs guided search based on k-hop neighborhood sketch. For each node υ in G, a k-hop sketch K(υ) is a list {(1, D1), . . . , (k, Dk)}, where Di denotes the distribution of the node labels and their frequency at i hop of υ. Given a pair (u, v) newly added to m and a pattern edge (u, u′) in Q, Match picks “the best neighbor” υ′ of υ such that the pair (u′, υ′) has a high possibility to make a match. This is decided by assigning a score ƒ(u′, υ′) as Eiε[1,k](Di−D′i), where D′iεK(u′), DiεK(υ′), and Di−D′i is the total frequency difference for each label in Di. In fact, (1) υ′ does not match u′ if for some i, Di−D′i; and (2) the larger the difference is, the more likely υ′ matches u′. If (u′, υ′) does not lead to a complete m, Match backtracks and picks υ″ with the next best score r(u′, υ″).
As an example, referring to GPAR R1 of
Given R1 and G1 of
Given PR
Given a set Σ of GPARs, Match revises step (2) of Matchc by checking whether υx matches x via guided search and early termination; it reduces redundant computation for multiple GPARs by extracting common sub-patterns of GPARs in Σ. It remains parallel scalable following the same complexity analysis for Matchc.
The computing environment 1400 may include computer readable media. Computer readable media can be any available tangible media that can be accessed by the computing environment 1400 and includes both volatile and nonvolatile media, removable and non-removable media. Computer readable media does not include transitory, modulated or other transmitted data signals that are not contained in a tangible media. The system memory 1404 includes computer readable media in the form of volatile and/or nonvolatile memory such as ROM 1410 and RAM 1412. RAM 1412 may contain an operating system 1413 for the computing environment 1400. RAM 1412 may also execute one or more application programs 1414. The computer readable media may also include storage media 1406, such as hard drives, optical drives and flash drives.
The computing environment 1400 may include a variety of interfaces for the input and output of data and information. Input interface 1416 may receive data from different sources including touch (in the case of a touch sensitive screen), a mouse 1424 and/or keyboard 1422. A video interface 1430 may be provided for interfacing with a touchscreen 1431 and/or monitor 1432. A peripheral interface 1436 may be provided for supporting peripheral devices, including for example a printer 1438.
The computing environment 1400 may operate in a networked environment via a network interface 1440 using logical connections to one or more remote computers 1444, 1446. The logical connection to computer 1444 may be a local area connection (LAN) 1448, and the logical connection to computer 1446 may be via the Internet 1450. Other types of networked connections are possible, including broadband communications as described above. It is understood that the above description of computing environment 1400 is by way of example only, and may include a wide variety of other components in addition to or instead of those described above.
The description of the present disclosure has been presented for purposes of illustration and description, but is not intended to be exhaustive or limited to the disclosure in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the disclosure. The aspects of the disclosure herein were chosen and described in order to best explain the principles of the disclosure and the practical application, and to enable others of ordinary skill in the art to understand the disclosure with various modifications as are suited to the particular use contemplated.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.