The present invention generally relates to the testing of reachability between nodes in a graph and related problems.
Numerals presented herebelow in brackets—[ ]—are keyed to the list of references found towards the close of the present disclosure.
Testing the reachability between nodes in a graph is a well-known problem with many important applications, including knowledge representation, program analysis, and more recently, biological and ontology databases inferencing as well as XML query processing. Generally, as discussed below, various approaches have been proposed to encode graph reachability information using node labeling schemes, but most existing schemes only work well for specific types of graphs.
One may consider a directed graph G=(V,E). Graph reachability is the following decision problem: Given two nodes u and v in G, is there a path from u to v? If the answer is yes, one can say that u can reach v, or
Graph reachability has been a well-known problem with many traditional applications, e.g., testing concept subsumption in knowledge representation systems; and reasoning about inheritance in compiler design for object-oriented programming languages. Recently, the interest in graph reachability work has been rekindled by new applications of graph-structured databases. For example, several well-known projects in bioinformatics model data such as protein interactions, metabolic pathways and gene regulatory networks as directed graphs. A general example of such a representation is shown in
The well-known single-source shortest path algorithm can be used to answer reachability queries. However, the algorithm has a high complexity of O(|E|), making it infeasible for efficient query processing. At the other extreme, one can precompute and store the transitive closure of the graph. Reachability queries can then be answered with constant-time matrix lookups. However, the space requirement is O(|V|2), making this approach infeasible for large graphs.
If one only considers reachability in trees (or forests), interval labeling is a desirable solution that takes linear space and supports reachability queries in constant time. It labels each node u in the tree by an interval [start(u)end(u)]. The labels can be assigned with a depth-first traversal of the tree, using a counter that is incremented whenever the traversal enters or leaves a node; start(u) and end(u) are assigned the value of the counter when the traversal enters and leaves u, respectively. It is not difficult to see that interval labeling has the following property: Given two tree nodes u and v,
Thus, reachability can be verified in constant time. Unfortunately, this approach is not directly applicable to graphs.
Labeling a general graph to support efficient reachability queries is a difficult problem. It has been shown that there exist graphs for which any reachability labeling scheme would require O(|V|×|E|1/2). Still, a variety of labeling schemes have been proposed, and they are surveyed herebelow. Briefly, the two most relevant and popular schemes are the interval-based approach by Agrawal et al. [1] and the 2-hop approach by Cohen et al. [2]. The interval-based approach extends the basic interval labeling to work on DAGs, and is effective on graphs that mostly resemble trees or forests. However, the performance degrades when the graph contains many non-tree edges. The 2-hop approach identifies subgraphs where one set of nodes connect to another set of nodes via a “hop” node; between these two sets, reachability relationships can be encoded compactly. This approach is thus optimized for graphs that contain many good “hop” nodes, i.e., nodes that connect two large sets of other nodes. However, the approach is less efficient for graphs with other types of substructures, e.g., long, branchless paths or one-way bipartite graphs.
Overall, there is currently no single approach that works well for all types of graphs. A need has thus been recognized in connection with providing a labeling scheme that is robust for a larger variety of graphs. It is noted that each existing approach to reachability labeling exploits certain substructural features in graphs; a need has thus also been recognized in connection with combining the strengths of different approaches to achieve generality in a labeling scheme.
There is broadly contemplated herein, in accordance with at least one presently preferred embodiment of the present invention, a hierarchical approach to reachability labeling that may be referred to as HLSS (Hierarchical Labeling of Sub-Structures). A graph often contains different types of substructures whose reachability information is easier to encode with different labeling techniques. HLSS extracts such substructures and apply efficient labeling techniques suitable to each of them.
At least one embodiment of the present invention preferably involves a two-phase labeling algorithm, which implements HLSS. The first phase identifies and encodes strongly connected components as well as tree substructures. The second phase encodes the remaining reachability relationships by compressing dense rectangular submatrices in the transitive closure matrix. This hierarchical approach handles different types of graphs well, while existing approaches fall prey to graphs with substructures they are not designed to handle.
Preferably employed in accordance with at least one embodiment of the present invention is a 2-approximation algorithm to find dense submatrices. The method is ambiguity tolerant, that is, it allows false positives to encourage larger submatrices, which can be encoded more efficiently; meanwhile, it considers the cost of filtering out false positives to balance this benefit.
In summary, one aspect of the invention provides a method of providing reachability labeling for graphs, the method comprising the steps of: providing an input graph having at least one substructure associated therewith; and labeling the at least one substructure with reachability information in a manner optimally configured for the at least one substructure.
Another aspect of the invention provides an apparatus for providing reachability labeling for graphs, the apparatus comprising: an arrangement for providing an input graph having at least one substructure associated therewith; and an arrangement for labeling the at least one substructure with reachability information in a manner optimally configured for the at least one substructure.
Furthermore, an additional aspect of the invention provides a program storage device readable by machine, tangibly embodying a program of instructions executable by the machine to perform method steps for providing reachability labeling for graphs, the method comprising the steps of: providing an input graph having at least one substructure associated therewith; and labeling the at least one substructure with reachability information in a manner optimally configured for the at least one substructure.
For a better understanding of the present invention, together with other and further features and advantages thereof, reference is made to the following description, taken in conjunction with the accompanying drawings, and the scope of the invention will be pointed out in the appended claims.
To further the present discussion, one may recast the problem of graph reachability labeling as a problem of finding a compact representation for a transitive closure matrix. From this viewpoint, and for the purposes of providing a basis of comparison, two highly popular conventional approaches are discussed herebelow—namely, interval-based and 2-hop—followed by a brief discussion of other related work.
In an interval-based approach, nodes are labeled by intervals, whose containment relationships encode ancestor-descendant relationships among nodes in a tree. In the transitive closure matrix, each directed path in the graph corresponds to a reordered submatrix with ones in the upper triangle and zeros in the lower triangle (see
Although originally proposed for trees, the interval-based approach was extended by Agrawal et al. [1] to DAGs. Each node u is assigned a set of non-overlapping intervals L(u);
if every interval in L(v) is contained in some interval in L(u). Labeling is done by first finding a spanning forest and assigning interval labels for nodes in the forest. Next, to capture reachability relationships through non-spanning-forest edges, one adds additional intervals to labels in reverse topological order of the DAG; specifically, if (u,v) is an edge not in the spanning forest, then all intervals in L(v) are added to L(u) (as well as labels of all nodes that can reach u).
Consider the graph in
With more complicated graphs, the size of a label can become linear in the graph size. For example, if d and f have many non-spanning-forest descendants, their labels will become larger, which will in turn cause ancestors of b and c to have larger labels. Since reachability queries involves checking containment for all intervals in a label, large labels can seriously impact query performance.
The 2-hop approach, on the other hand, was proposed as an alternative to the interval-based approach by Cohen et al. [2]. For each node u, let Cin(u) denote the set of nodes that can reach u, let Cout(u) the set of nodes that can be reached by u. The key observation of this approach is that every node in Cin(u) and can reach every node in Cout(u). For example, in
iff the out-label of u intersects with the in-label of v. Thus, reachability relationships from nodes in Cin(x) to nodes in Cout(x) can be encoded succinctly by adding x to the out-label of every node in Cin(x), and the in-label of every node in Cout(x).
From the viewpoint of compressing the transitive closure matrix, the 2-hop approach seeks to compress the submatrix induced by x consisting of all ones; its columns correspond to the nodes in Cin(x) and its rows correspond to the nodes in Cout(x), as illustrated in
where k is the number of ones in the submatrix that have been previously encoded.
On the other hand, it is clear from the matrix compression viewpoint that the 2-hop approach may miss many submatrices that are high-quality candidates for compression, because the approach only considers submatrices induced by hop nodes. For example, in
As shown hereabove, existing approaches to labeling graph reachability each have their respective strengths and weaknesses, and are each most effective on specific types of graphs. For example, the interval-based approach works best on tree-like graphs, while the 2-hop approach works best on graphs with many well-connected hop nodes. By combining the power of these approaches and other optimizations, there is proposed herein HLSS, a hierarchical labeling scheme that can work well for graphs with different characteristics.
HLSS preferably assigns labels in two phases, each focusing on exploiting different characteristics of the input graph G. The first phase, tree-reachability reduction, begins with a preprocessing step that identifies each strongly connected component of the graph, collapses the component into one representative node, and uses this node to label others in the component. Then, one preferably identifies tree structures within G and assigns interval labels to nodes based on these tree structures. Containment of interval labels implies reachability through tree paths. This phase also computes a remainder graph Gr that captures any remaining reachability information not encoded by interval labels. Specifically, a node can also reach another node through portals in Gr. Each node is preferably labeled by their portals to facilitate reachability checking.
The second phase, remainder graph-reachability encoding, aims at compressing the reachability information in the remainder graph Gr produced by the first phase. One may preferably do so by assigning additional labels to portals so that reachability among them can be checked efficiently by comparing their labels. Presented herein are several techniques for assigning such labels, including an enhanced version of the 2-hop approach as well as techniques inspired by data mining, linear algebra, and graph algorithms.
To summarize, these two phases will create a label of a 4-level hierarchy for each node u:
The disclosure now turns to a description of how the two phases assign these labels. Further below is a discussion on how to answer reachability queries using these labels.
Preferably, there are identified strongly connected components in G that contain more than one node. If two nodes u and v belong to the same strongly connected component, they are indistinguishable from each other as far as reachability is concerned. For any node w, if
then
similarly, if
then
Hence, for the purpose of computing reachability, one can collapse each strongly connected component of G into one single representative node that retains all edges coming in and out of the strongly connected component. Subsequently, there will be dealt with only this representative node; nodes strongly connected to it receive it as their strongly connected component label (ls). One can find all strongly connected components of G in O(|V|+|E|) time, using Tarjan's algorithm. By replacing all strongly connected components with their representative nodes, there is obtained a result graph G′ with no cycles.
Next, there is preferably identified a spanning forest T of G′ and assign interval labels li(u) and li(u) to each node u to capture all reachability relationships in T. The identification of T and assignment of interval labels both can be done easily in time linear in the size of G′.
From G′, there is preferably found a remainder graph Gr, which embodis reachability not covered by the spanning forest in G′. The remainder graph Gr is usually much smaller than G′. Nodes of Gr are a subset of nodes in G′, which one may call portals. One may preferably label each node u of G′ with two portals: an in-portal lpin(u) and an out-portal lpout(v). To support efficient reachability checking, portal labels and Gr must have the following property:
(P1) For any two nodes u, vεG′,
In other words, unless u can reach v via tree edges in the spanning forest, the only way for u to reach v is by going through the out-portal of u and the in-portal of v.
To this end, one may preferably define lpout(u), lpin(u), and Gr as follows:
Definition 1 (Portals and Remainder Graph) Given a spanning forest T of G′, a node uεG′ is exposed if there exists an edge (u,v)) (or (v, u)) in G′ such that u is not v's ancestor (or descendant, respectively) in T
The definition of portals assumes that ancestor and descendant relationships are reflexive, i.e., a node is an ancestor and descendant of itself. Note that an outportal is not necessarily exposed; a non-exposed node can be an out-portal if it is the lowest common ancestor of some exposed nodes.
The following theorem ensures that the reduction of G′ into Gr given by Definition 1 preserves all remaining reachability information.
Theorem 1 The portal labels and remainder graph defined by Definition 1 have property (P1).
In
Finally, it is noted in the following theorem that the size of the remainder graph is linear in the number of “non-tree” edges, i.e., those that are not in T or implied by T. Therefore, in practice, the remainder graph can be much smaller than the original graph. In particular, if the input graph is a tree or forest, the remainder graph would be empty and all portal labels would be unassigned, and the scheme basically degenerates into the interval-based approach.
Theorem 2 Let Ent denote the set of edges in G′ that are not from a node to its descendant in T. The remainder graph Gr has fewer than 4|Ent| nodes and 5|Ent| edges.
After extracting reachability relationships in the spanning forest, one can now turn to the problem of encoding remaining reachability relationships in the remainder graph Gr. As outlined below, an important objective in this phase is to assign the remainder labels Lrin(u) and Lrout(u) for each node u in Gr to help checking reachability among nodes in Gr. Let Tr denote the transitive closure matrix of Gr. The general idea is to compress the content of Tr into remainder labels, in a way that allows any Tr entry to be recovered efficiently.
While many compression algorithms can be applied to Tr (e.g., Blocked Huffman coding or LZW), most of them do not support efficient recovery of individual entries. A presently preferred approach is to identify a dense submatrix of Tr with mostly ones, and encode it compactly in remainder labels of the nodes associated with rows and columns of the submatrix. This process is repeated until unencoded ones in Tr are sparse enough to be stored efficiently in a sparse matrix.
By way of encoding a dense submatrix, suppose R and C are two non-empty sets of nodes in the remainder graph Gr. Let Tr(R,C) denote the submatrix of Tr spanned by R and C, i.e., the submatrix whose rows and columns correspond to nodes in R and C, respectively. This submatrix captures reachability relationships from R nodes to C nodes. To encode these reachability relationships, one may pick a unique symbol s; one may preferably then add s to Lrout(u) for each node uεR, and also to Lrin(u) for each node vεC. In addition, for each entry (u, v) of Tr with value zero, one adds the pair (u, v) to a zero-exception set ε0. Clearly,
if
Lrout(u)∩Lrin(v)≠Ø and (u, v)∉ε0.
Intuitively, adding a common symbol to |R|+|C| remainder labels has the effect of remembering Tr(R, C) as a submatrix of all ones. Any zero in Tr(R, C) needs to be remembered in ε0 as an exception. Subsequently, one no longer needs to store entries of Tr(R, C) in Tr.
The amount of space used in remainder labels to encode Tr(R, C) is (|R|+|C|)×size(s), where size(s) denotes the size of symbol s in bits; in addition, the amount of space used in the zero-exception set is n0(Tr(R, C))×size(e0), where nx(•) counts the number of entries with value x in a matrix, and size(e0) denotes the size of an entry in ε0. One can quantify the quality of encoding by the encoding density of the submatrix Tr(R, C), defined as the ratio between the number of ones in the submatrix and the amount of space used in encoding the submatrix:
The higher the encoding density of a submatrix—or in short, the denser the submatrix—the better it is to apply encoding. The overall remainder graph-reachability encoding algorithm, to be presented herebelow, greedily identifies a dense (if not the densest) submatrix of Tr encodes it, marks its entries as encoded (using a value other than one or zero), and repeats the process. Thus, in the general case that parts of Tr have already been encoded, Equation (1) defines the encoding density to be the ratio between the number of unencoded ones and the amount of additional space used in encoding (zeros covered by previously encoded parts are already remembered in ε0 and thus do not require additional space).
The 2-hop approach also uses a notion of encoding density, which is essentially a restricted case of the definition above. As discussed hereabove, the main restriction is that the 2-hop approach only considers submatrices induced by single nodes. The submatrix of Tr induced by node u is Tr(Cin(u),Cout(u)), where, it is to be recalled, Cin(u) is the set of nodes that can reach u and Cout(u) is the set of nodes that can be reached by u Note that all entries in this submatrix are ones, so n1(Tr(Cin(u), Cout(u)))=|Cin(u)|×|Cout(u)| and n0(Tr(Cin(u), Cout(u)))=0. Hence, the definition of encoding density reduces (up to a constant factor) to the one used by the 2-hop approach:
where k is the number of ones in the submatrix that have been previously encoded.
An important step of a remainder graph-reachability encoding algorithm involves identifying a submatrix of Tr to encode, preferably the densest one. Before algorithms are presented for finding such submatrices, below is a formal definition of the general problem.
Definition 2(Densest submatrix problem) The densest submatrix problem (DSM) is defined as follows: Given a binary matrix A and non-negative parameters size(s) and size(e0), find a subset R of rows and a subset C of columns from A that maximize density (A(R,C)), the encoding density of the submatrix spanned by R and C.
Theorem 3 Under the plausible assumption that 3-SAT does not have a subexponential time algorithm, DSM is hard to approximate within a factor of 2(log n) δ−1 for some δ>0. Here, n denotes the total number of rows and columns in the input matrix to the DSM problem.
This result implies that, for the general DSM problem, it may be fruitless to go after the optimal solution. Thus, one turns to heuristics that consider a restricted solution space or special instances of the DSM problem for which efficient approximation is possible.
The first algorithm, FINDDSM—2APPROX (
FINDDSM—2APPROX can be made an efficient O(n2) algorithm, where n denotes the total number of rows and columns in A. The main loop runs at most n times. Without any optimization, each iteration of the loop would take O(n2) time, most of which is spent on counting ones. However, for each row (or column) to be removed, one can remember the count of its ones, and update this count whenever a column (or row, respectively) is removed. This maintenance only takes O(n) time per iteration, and the cost of finding the row or column with the least number of ones is reduced to O(n) per iteration. Thus, one has reduced the overall complexity of FINDDSM—2APPROX to O(n2). For simplicity of presentation, his optimization is not shown in the algorithm of
Another bit of good news is that, for the instance of DSM with size(e0)=0, FIND-DSM—2APPROX turns out to be a 2-approximation algorithm, i.e., it returns a submatrix whose encoding density is in a factor of two of the densest one. Note that this instance of DSM is not at all unreasonable: Setting size(e0)=0 in Equation (1) does not imply that the cost of zero-exception list is ignored, because the numerator, n1(A(R, C)), still favors submatrices with more ones.
Theorem 4 FINDDSM—2APPROX is a 2-approximation algorithm for finding the densest submatrix A(R, C)) with encoding density defined as
Recall that the 2-hop approach considers only submatrices induced by single nodes, and picks the densest submatrix among them. The approach does not consider any submatrix containing zeroes, or any submatrix corresponding to a bipartite subgraph such as the one illustrated in
Like the other two algorithms presented just hereabove, FINDDSM_EXT2HOP takes O(n2) time. The 2-hop loop requires only O(n2) time, because density calculation is simplified by the fact that all submatrices considered by the 2-hop approach contain no zeros. The second loop for extending the result submatrix runs at most n times, and each iteration can be optimized to run in O(n) time using two optimizations that one has applied earlier to the other two algorithms in this section. First, for each remaining row (or column), one can record the count of ones that it can add to the current submatrix, and update this count whenever a column (or row, respectively) is added; this optimization brings down the cost of finding the row and column with the most ones to O(n). Second, one can record the numbers of ones and zeros in the submatrix and update the two counts in each iteration; this optimization brings down the cost of density calculation to O(n). Overall, the complexity of FINDDSM_EXT2HOP becomes O(n2).
Also proposed herein is an algorithm based on the singular value decomposition. A matrix A can be decomposed to A=UΣVT where U and V are orthogonal matrixes, and Σ is a diagonal matrix. The singular value σi is the ith diagonal entry of Σ while the columns ui of U and vi of V are the corresponding singular vectors. A well-known result is that the best rank-k approximation of A in the least square sense is given by
Consider the rank-1 approximation A1=σ1u1v1T. Intuitively, the components of u1 and v1 with the most significant values should span the submatrix of A1 with the most significant values. For the present problem, this submatrix of A1 should correspond to a dense submatrix of A, because A1 approximates A, a matrix with ones and zeros.
Therefore, there is proposed here a greedy algorithm FINDDSM_SVD based on the above intuition. Unfortunately, computing u1 and v1 requires O(n3) time, which makes the approach less desirable computationally.
Next to be presented is ENCODE (
Let m denote the number of nodes in Gr (i.e., the portals). The running time of ENCODE is O(m3) (or O(m4) if one uses FINDDSM_SVD). Computation of the transitive closure matrix can be done easily in O(m3), using a simplified version of the Floyd-Warshall algorithm. One can control the main loop so that it executes at most O(m) times. First, note that FINDDSM_EXT2HOP can be run O(m) times before triggering the break condition of the main loop, because each run uses up one hop node. For FINDDSM—2APPROX and FINDDSM_SVD, one can run them O(m) times and then simply switch to running FINDDSM_EXT2HOP subsequently, which results in at most O(m) iterations of the main loop overall. The cost of each iteration is dominated by the cost of finding a dense submatrix, which takes O(m2) time for FINDDSM—2APPROX and FINDDSM_EXT2HOP (or O(m3) for FINDDSM_SVD), as discussed herein.
Given two nodes u and v, testing whether
is straightforward. If either one has a strongly connected component label, the node in the label is checked instead. One may preferably first check their interval labels. If the answer is not affirmative, one may preferably look up their portal labels and check reachability between u's out-portal and v's in-portal, which involves testing whether their remainder labels intersect, and whether the pair belong to zero- and one-exception sets (implemented as hash tables). All steps take constant time except testing whether two remainder labels intersect, which can be done in time linear to the lengths of these labels.
By way of brief recapitulation, many labeling schemes have been proposed in the past, but most of them are optimized to exploit particular types of substructures in graphs and do not work well on other substructures. Proposed herein is a hierarchical approach that combines the strengths of existing approaches by labeling different types of substructures differently.
It is to be understood that the present invention, in accordance with at least one presently preferred embodiment, includes an arrangement for providing an input graph having at least one substructure associated therewith, and an arrangement for labeling the at least one substructure with reachability information. Together, these elements may be implemented on at least one general-purpose computer running suitable software programs. These may also be implemented on at least one Integrated Circuit or part of at least one Integrated Circuit. Thus, it is to be understood that the invention may be implemented in hardware, software, or a combination of both.
If not otherwise stated herein, it is to be assumed that all patents, patent applications, patent publications and other publications (including web-based publications) mentioned and cited herein are hereby fully incorporated by reference herein as if set forth in their entirely herein.
Although illustrative embodiments of the present invention have been described herein with reference to the accompanying drawings, it is to be understood that the invention is not limited to those precise embodiments, and that various other changes and modifications may be affected therein by one skilled in the art without departing from the scope or spirit of the invention.