A network may include a plurality of nodes and connections forming a network graph representing data and/or information of the network. For example, applications for network graphs may include social networks, computer networks, computer vision, large scale integrations, relational databases, evolutionary biology and the like. Network graph queries are used for extracting information from and/or about, or sending information to, one or more nodes of the network graph. For large networks, answering network graph queries in near real time with known querying systems and methods is typically difficult because the query processing time depends on the size of the network graph.
A system for performing network graph queries on a network graph may comprise a preprocessing module configured for generating a data structure from the network graph, and a query module configured for receiving a network query for a query set of nodes of the network graph and for generating a query response to the network query. The data structure may include a plurality of landmark nodes for each node of the network graph, a plurality of landmark distances connecting each node to its respective landmark nodes, a plurality of important nodes that is a subset of the nodes of the network graph and a plurality of paths connecting each important node to each other important node. The query response may be generated by constructing a weighted graph based on the data structure and the network query.
According to an embodiment, the weighted graph may be a gray-black graph constructed using the data structure and the network query.
According to an embodiment, the gray-black graph may include gray edges representing distances based on the landmark distances and black edges representing placeholders.
According to an embodiment, the query module may generate the query response by determining a plurality of forest components in the gray-black graph by deleting one or more of the black edges of the gray-black graph and determining a set of least-cost hook paths for connecting the plurality of forest components using the set of important nodes of the data structure.
According to an embodiment, a computer-implemented method for processing a network graph having a plurality of nodes interconnected by a plurality of edges may comprise generating, using a processor and based on the network graph, a data structure for representing a plurality of landmark nodes for each node of the network graph, a plurality of landmark distances connecting each node to its respective landmark nodes, a plurality of important nodes that is a subset of the nodes of the network graph and a plurality of paths connecting each important node to each other important node. The computer-implemented method may also comprise receiving a network query for a query set of nodes of the network graph and generating, using the processor, a query response to the network query. The query response may be generated by constructing a weighted graph based on the data structure and the network query.
According to an embodiment, the weighted graph of the computer-implemented method may be a gray-black graph including gray edges representing distances based on the landmark distances and black edges representing placeholders.
According to an embodiment, the computer-implemented method may further comprise computing, using the processor, a Minimum Spanning Tree for the gray-black graph, determining a plurality of forest components by deleting one or more of the black edges of the gray-black graph, determining a set of least-cost hook paths for connecting the plurality of forest components using the set of important nodes of the data structure, and generating the query response based on the plurality of forest components and the set of least cost hook paths.
According to an embodiment, the query response may be generated using a Steiner Tree format, Cheapest Tour format, or Minimum Spanning Tree format.
According to an embodiment, a system for performing network graph queries on a network graph may comprise a preprocessing module configured for generating and dynamically maintaining a data structure representing a Minimum Spanning Tree for the network graph, and a query module configured for generating a query response to a network query by outputting the current Minimum Spanning Tree for the network graph. The data structure may comprise a plurality of substructures, each substructure comprising a set of connected components representing at least a portion of the network graph, and a set of edges forming a spanning forest for the set of connected components of the substructure.
According to an embodiment, the preprocessing module may store the set of edges forming the spanning forest of the set of connected components of each substructure of the plurality of substructures of the network graph in a plurality of subforests each of which is arranged in a Euler tree structure.
According to an embodiment, the Euler tree structure may be based on edge levels defining subforests of the spanning forest.
According to an embodiment, the data structure may also comprise a top tree storing the highest level subforest from each substructure, with the top tree of the highest substructure forming an approximate Minimum Spanning Tree for the network graph.
According to an embodiment, the preprocessing module may generate the approximate Minimum Spanning Tree by rounding a weight associated with one or more edges of the network graph.
According to an embodiment, the preprocessing module may dynamically maintain the data structure by adding and deleting edges connecting nodes in the dynamic Minimum Spanning Tree to compensate for changes in the portion of the network graph.
According to an embodiment, a computer-implemented method for processing a network graph having a plurality of nodes interconnected by a plurality of edges may comprise generating, using a processor and based on the network graph, a data structure representing a Minimum Spanning Tree for the network graph, receiving a network query for the network graph, and generating, using the processor, a query response to the network query. The data structure may comprise a plurality of substructures, each substructure comprising a set of connected components representing at least a portion of the network graph and a set of edges forming a spanning forest for the set of connected components of the substructure. The query response may be generated by outputting the current Minimum Spanning Tree represented by the data structure.
According to an embodiment, the computer-implemented method may further comprise dynamically updating the data structure in a memory based on updates to one or more connections between nodes of the network graph.
According to an embodiment, dynamically updating the data structure may further comprise updating the Minimum Spanning Tree for the network graph by adding or deleting one or more edges of the Minimum Spanning Tree based on updates to the one or more connections of the network graph.
According to an embodiment, the computer-implemented method may further comprise storing the set of edges forming the spanning forest of the set of connected components of each substructure of the plurality of substructures of the network graph in a plurality of subforests, each of which is arranged in a Euler tree structure, and adding or deleting one or more edges of the Minimum Spanning Tree based on updates to the one or more connections of the network graph by respectively adding or deleting one or more edges connecting two nodes of one or more substructures in the Euler tree structures.
According to an embodiment, the highest level subforest from each substructure may be stored as a top tree in the data structure, with the top tree of the highest substructure forming an approximate Minimum Spanning Tree for the network graph.
According to an embodiment, adding a new edge connecting two nodes in the Minimum Spanning Tree may comprises identifying if a substructure of the current Minimum Spanning Tree includes both nodes of the new edge in the same connected component, determining if the identified substructure is higher than a substructure of the current Minimum Spanning Tree to which the new edge is being added, and replacing the existing edge with the new edge in the plurality of substructures if the identified substructure is higher than the substructure of the current Minimum Spanning Tree to which the new edge is being added.
According to an embodiment, deleting an existing edge connecting two nodes in the Minimum Spanning Tree may comprise finding a replacement edge in the lowest substructure of the network graph connecting the two connected components in which the two nodes of the existing edge belong, deleting the existing edge from one or more substructures of the plurality of substructures, and inserting the replacement edge in the one or more substructures of the plurality of substructures.
These and other embodiments of will become apparent in light of the following detailed description herein, with reference to the accompanying drawings.
Before the various embodiments are described in further detail, it is to be understood that the invention is not limited to the particular embodiments described. It will be understood by one of ordinary skill in the art that the systems and methods described herein may be adapted and modified as is appropriate for the application being addressed and that the systems and methods described herein may be employed in other suitable applications, and that such other additions and modifications will not depart from the scope thereof.
In the drawings, like reference numerals refer to like features of the systems and methods of the present application. Accordingly, although certain descriptions may refer only to certain Figures and reference numerals, it should be understood that such descriptions might be equally applicable to like reference numerals in other Figures.
Referring to
The computerized system 10 includes the necessary electronics, software, memory, storage, databases, firmware, logic/state machines, microprocessors, communication links, and any other input/output interfaces to perform the functions described herein and/or to achieve the results described herein. For example, the computerized system 10 may include one or more processors 22 and memory 24, which may include system memory, including random access memory (RAM) and read-only memory (ROM). Suitable computer program code may be provided to the computerized system 10 for executing numerous functions, including those discussed in connection with the preprocessing module 14 and query module 16. For example, in embodiments, the preprocessing module 14 and query module 16 may be stored in memory 24 of the computerized system 10 and may be executed by the processor 22, as should be understood by those skilled in the art.
The one or more processors 22 may include one or more conventional microprocessors and one or more supplementary co-processors such as math co-processors or the like. The one or more processors 22 may communicate with other networks and/or devices such as servers, other processors, computers, smart phones, cellular telephones, tablets and the like and may receive queries 11 therefrom, as should be understood by those skilled in the art.
The one or more processors 22 may be in communication with memory 24, which may comprise an appropriate combination of magnetic, optical and/or semiconductor memory, and may include, for example, RAM, ROM, flash drive, an optical disc such as a compact disc and/or a hard disk or drive. The one or more processors 22 and the memory 24 may be, for example, located entirely within a single computer or other device; or connected to each other by a communication medium, such as a USB port, serial port cable, a coaxial cable, an Ethernet type cable, a telephone line, a radio frequency transceiver or other similar wireless or wired medium or combination of the foregoing.
The memory 24 may store a variety of data and any other information required by and/or generated by the preprocessing module 14 and query module 16, an operating system, and/or one or more other programs (e.g., computer program code and/or a computer program product) adapted to direct the preprocessing module 14 and query module 16 to perform according to the various embodiments discussed herein. The preprocessing module 14, query module 16 and/or other programs discussed herein may be stored, for example, in a compressed, an uncompiled and/or an encrypted format, and may include computer program code executable by the one or more processors 22. The instructions of the computer program code may be read into a main memory of the one or more processors 22 from the memory 24 or a computer-readable medium other than the memory 24. While execution of sequences of instructions in the program causes the one or more processors 22 to perform the process steps described herein, hard-wired circuitry may be used in place of, or in combination with, software instructions for implementation of the processes of the present invention. Thus, embodiments of the present invention are not limited to any specific combination of hardware and software.
The methods and programs discussed herein may also be implemented in programmable hardware devices such as field programmable gate arrays, programmable array logic, programmable logic devices or the like. Programs may also be implemented in software for execution by various types of computer processors. A program of executable code may, for instance, comprise one or more physical or logical blocks of computer instructions, which may, for instance, be organized as an object, procedure, process or function. Nevertheless, the executables of an identified program need not be physically located together, but may comprise separate instructions stored in different locations which, when joined logically together, comprise the program and achieve the stated purpose for the programs such as providing localization activity recognition. In an embodiment, an application of executable code may be a compilation of many instructions, and may even be distributed over several different code partitions or segments, among different programs, and across several devices.
The term “computer-readable medium” as used herein refers to any medium that provides or participates in providing instructions and/or data to the one or more processors of the computerized system 10 (or any other processor of a device described herein) for execution. Such a medium may take many forms, including but not limited to, non-volatile media and volatile media. Non-volatile media include, for example, optical, magnetic, or opto-magnetic disks, such as memory. Volatile media include dynamic random access memory (DRAM), which typically constitutes the main memory. Common forms of computer-readable media include, for example, a floppy disk, a flexible disk, hard disk, magnetic tape, any other magnetic medium, a CD-ROM, DVD, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, a RAM, a PROM, an EPROM or EEPROM (electronically erasable programmable read-only memory), a FLASH-EEPROM, any other memory chip or cartridge, or any other medium from which a computer can read.
Various forms of computer readable media may be involved in carrying one or more sequences of one or more instructions to the one or more processors (or any other processor of a device described herein) for execution. For example, the instructions may initially be stored on a magnetic disk of a remote computer (not shown). The remote computer can load the instructions into its dynamic memory and send the instructions over an Ethernet connection, cable line, telephone line using a modem, wirelessly or over another suitable connection. A communications device local to a computing device can receive the data on the respective communications line and place the data on a system bus for the one or more processors. The system bus carries the data to the main memory, from which the one or more processors 22 retrieve and execute the instructions. The instructions received by main memory may optionally be stored in memory 24 either before or after execution by the one or more processors 22. In addition, instructions may be received via a communication port as electrical, electromagnetic or optical signals, which are exemplary forms of wireless communications or data streams that carry various types of information.
In operation, the preprocessing module 14 of the computerized system 10 preprocesses the network graph 12 in order to answer network graph queries 11. Referring to
At step 28, the preprocessing module 14 constructs distance oracles on the network graph 12. Various methods for distance oracle construction are known in the art, all of which may be implemented by the preprocessing module 14. For example, the distance oracles may be constructed using the method described in in the article by Mikkel Thorup and Uri Zwick, Approximate Distance Oracles (STOC, pages 183-192, 2001), which is hereby incorporated by reference in its entirety (hereinafter the “TZ method”). Using the TZ method, in one embodiment the preprocessing module 14 randomly and recursively samples nodes 18 from the network graph 12 and constructs a series of randomized node subsets At−1 ⊂At−2 ⊂At−3 ⊂ . . . ⊂A1 ⊂A0=V, where V is the set of all nodes 18 within the network graph 12 and t is a tradeoff factor that is greater than or equal to one.
The preprocessing module 14 then constructs, for every node v of the set V of all nodes 18, a set of landmark nodes Bv and computes and stores, as the distance oracles, distances from the node v to each landmark node in Bv. For example, the following exemplary computer pseudo-code may be implemented in the preprocessing module 14 for constructing the set of landmark nodes Bv and the distance oracles:
where n=|V|; and
At step 30, the preprocessing module 14 then defines a set of important nodes kimp defined as:
A
imp=Ar ∪ Ar+1 ∪ . . . ∪ Ai−1=Ar
where a ceiling for r is set to
wherein the symbol [value] stands for the ceiling value.
Referring to
Referring back to
where:
The preprocessing module 14 may store the data structure D in memory 24, shown in
Referring to
At step 38, the query module 16 determines a type of query response 13 for answering the network query 11. The type of query response 13 may depend upon information requested about the query set S in the network graph query 11 and may include determining a minimum spanning tree (MST), Steiner tree (ST), cheapest tour (CT) or any similar tree structure or response, as should be understood by those skilled in the art. For example, if the network graph query 11 is a simple distance query requesting the shortest path between a pair of given nodes 18, the query module 16 may determine the query response 13 for satisfying the network query 11 as the CT of the shortest path between the pair of nodes 18. If the network query 11 includes users of a network to whom multicast data is to sent, the query module 16 may determine that the ST interconnecting the nodes of the query set S is the query response 13 for satisfying the network query 11.
In order to determine a MST, ST, or CT, at step 40, the query module 16 constructs a gray-black graph GB using the data structure D stored by the preprocessing module 14, shown in
iterations. For example, the following exemplary computer pseudo-code may be implemented in the query module 16 for performing the oscillating calculation:
If the calculation terminates within r iterations, there is a 2r−1 approximate distance between the pair of nodes u,v denoted by dalg(u,v) and the query module 16 colors (or designates) the edge 20 between the pair of nodes u,v as a gray edge and sets a weight w(u,v) for the edge 20 between the pair of nodes u,v to dalg(u,v). Alternatively, if the calculation does not terminate within r iterations, the query module 16 colors (or designates) the edge 20 between the pair of nodes u,v as a black edge and sets the weight w(u,v) for the edge 20 between the pair of nodes u,v to two times a maximum of the hook edges mu and mv of u and v, respectively, where the hook edges mu and mv are the paths connecting the nodes u and v to their landmark nodes sr(u) and sr(v), respectively, within the set of nodes Ar=Aimp. The gray-black graph GB may thus be constructed as a set of gray and black edges, where the gray edges may be considered as the real edges having weights within a t−1 factor of the actual distance between corresponding nodes 18 in the original network graph 12 and the black edges may be considered as the placeholders to be further processed by the query module 16 as described below. The following exemplary computer pseudo-code may be implemented in the query module 16 for constructing the gray-black graph:
where:
Once the query module 16 has constructed the gray-black graph GB, at step 42, the query module 16 uses the gray-black graph GB and distances dalg(u,v) stored therein, along with the data structure D comprising the distances between every pair of nodes in the set of important nodes Aimp to generate the query response 13 (e.g., based on a computed MST, CT, or ST as appropriate). To generate the query response 13, the query module 16 may first compute a minimum spanning tree (MST) T on the gray-black graph GB at step 44. Various methods for computing MSTs are known in the art, all of which may be implemented by query module 16. At step 46, the query module 16 then deletes the black edges from the MST T in the gray-black graph GB since only the gray edges are considered to be real edges, as discussed above. The deletion of the black edges results in a forest Fgr having components C1,C2, . . . , Ct comprising nods 18, shown in
In order to provide the query response 13 based on the computed MST, the query module 16 determines a set R of least cost hook path nodes for connecting the components C1,C2, . . . , Cl to the set of important nodes Aimp. Specifically, for each component Ci, the query module selects a representative node wi with the shortest path to a hook node in the set of important nodes Aimp. The set of representative nodes wi for all components C1,C2, . . . , Cl is the set R of least cost hook path nodes and the corresponding set of hook nodes in Aimp is H(R). The distances between the nodes wi of the set R of least cost hook path nodes and the respective hook nodes in the set of hook nodes H(R) is the hook path set HP(R).
At step 50, the query module 16 is able to compute the query response 13 from the forest Fgr and the set R of least cost hook path nodes since all of the nodes of the set of hook nodes H(R) are within the set of important nodes Aimp stored in the data structure D and since the distances between each pair of nodes in the set of important nodes Aimp is also stored in the data structure D. As should be understood by those skilled in the art, the query response 13 constructed by the query module 16 at step 50 will depend on the type of query response required for answering the network query 11, such as a ST or a CT.
For example, when returning a ST as the query response 13, the query module may compute a ST, denoted by {circumflex over (T)}, on the set of hook nodes H(R) and return the query response 13 as Talg=Fgr∪{circumflex over (T)}∪HP(R), namely, the combination of the forest Fgr, the ST {circumflex over (T)} on the set of hook nodes H(R) and the distances in the hook path set HP(R). The following exemplary pseudo-code instructions may be implemented as computer code in the query module 16 for generating ST query responses 13:
where:
C(MST(G[S]))≧2C(OST(S))−w(e);
At step 50, when returning a CT as the query response 13, the query module may compute an approximate CT, denoted by Ĉ on the nodes of the set of hook nodes H(R) and then return the query response 13 as Calg=Ĉ∪HP(R)2∪Fgr2, where, for any given subgraph H, H2 is the subgraph obtained by duplicating the edges of H. The following exemplary pseudo-code instructions may be implemented as computer code in the query module 16 for generating CT query responses 13:
where Christofides calculation includes the following steps:
At step 52, the query module 16 returns the query response 13 that answers the network query 11. Thus, the computerized system 10, shown in
Referring to
Within each substructure Ti, the computerized system 10, shown in
The computerized system 10 dynamically maintains connected components in the approximate MST T by mapping the problem of computing the approximate MST T to the problem of finding connected components in the set of forest components. Referring to
If the edge e does not need to be added to the approximate MST T, at step 56, the new edge e is added to the data structure 53, shown in
Alternatively, if the computerized system 10 determines that the new edge e should be added to the approximate MST T at step 55, the computerized system 10 determines if the insertion of new edge e requires the removal of an existing edge f connecting the nodes u and v of weight w(f)>w(e) at step 57. If insertion of the edge e does not require removal of existing edge f, the computerized system 10 adds the new edge e to the data structure 53, shown in
If, at step 57, the computerized system 10 determines that insertion of the new edge e connecting the nodes u and v requires removal of existing edge f from the approximate MST T of the approximate graph G, at step 60, the computerized system 10 deletes the existing edge f from the data structure 53, shown in
At step 62, the computerized system 10 may then add the new edge e as a tree edge in the data structure 53, shown in
At step 64, the computerized system 10 may then reinserts the edge f into the data structure 53, shown in
If, at step 54, a new edge has not been added, the computerized system 10 then considers whether edge e connecting two nodes u and v and having weight w(e)=(1+ε)r−1 has been deleted from network graph 12, shown in
Alternatively, if the edge e is a tree edge in the approximate MST T, at step 72, the computerized system 10 finds a replacement existing edge f of weight w(f)≧w(e) to add to the approximate MST T. The existing edge f for replacing edge e may be found, for example, by applying the replacement procedure of the HLT method at every construction of substructures Ti where i≧r, for increasing values of i until the existing edge f is found by the computerized system 10. Finding the replacement edge f for edge e does not impact the Top Trees TTi since, although some edges may change levels, all edges in Fi remain included in forest Fi0, at level 0, and the Top Trees TTi are only maintained for these edges. When selecting the replacement edge f, for a particular substructure Ti the only relevant edges are the non-tree edges of weights w=(1+ε)i−1, since all lower weight edges would have been considered earlier when selecting the edge e for the approximate MST T. Thus, the computerized system 10 may find the replacement edge f, if such an edge exists, in, for example, substructure T. At step 74, the computerized system 10 then deletes the edge e from all constructions of substructures Ti where i≧r, including the Top Trees TTi. For example, the computerized system 10 may delete edge e using the delete procedure of the HLT method discussed in connection with step 60. At step 76, the computerized system 10 inserts the replacement edge f, if such an edge exists, as a tree edge of the forest Fi in all constructions of substructures Ti where i≧s. For example, the computerized system may insert the replacement edge f using the insertion procedure of the HLT method discussed in connection with step 62. The level of replacement edge f within each substructure Ti remains unchanged. The computerized system 10 may also insert the replacement edge f in all Top Trees TTi for all i≧max {s,2}. Thus, the computerized system 10 may advantageously update the approximate graph G through edge deletion of both tree and non-tree edges of the approximate MST T.
The computerized system 10 may dynamically maintain the approximate MST T by continuing to add and delete edges, as necessary, according to steps 54 through 76, while continuing to maintain the invariant Fi⊂Fi+1 for all 1≦i≦k−1. The computerized system 10 advantageously improves maintenance of an approximate MST T on a fully dynamic network graph 12 by accommodating for edge additions and deletions in the approximate graph G and in the approximate MST T of the network graph 12. For example, by maintaining a (1+ε) approximate MST (for an arbitrarily small constant ε>1) rather than the optimal MST, the computerized system 10 may provide an amortized running time O(log3 n) as compared to known amortized running times that are O(log4 n) per operation. This improvement is achieved by jointly maintaining connected components at logn different sets of edge weights and by quickly identifying and removing heavy edges in the cycle formed after edge insertion according to the method shown in
As discussed above, the Top Trees TTi are adapted to handle path queries and may maintain additional information used by the computerized system 10 for this purpose. For example, in addition to maintaining dynamic forests under the edge insertion and deletion operations, as discussed above, the Top Trees may also support an Expose operation in O(logn) amortized time that, for any two different vertices u and v, that are within the same forest Fi in the approximate MST T, returns a cluster of the Top Tree TTi for the operation Expose(u,v) within which the path from u to v in the approximate MST T is contained. This provides the computerized system 10 with constant time access to path information maintained in the Top Tree TTi for the u to v path of the approximate MST T. The Top Tree TTi maintains a pointer p(C)=e on the path from u to v in the approximate MST T, where:
Each path cluster C in the Top Tree TTi is associated with at most two special vertices of the graph called the boundary nodes and may be used by the computerized system 10 to maintain path values for these nodes. Updates to the Top Tree TTi may be implemented by the computerized system 10 as a sequence of two basic operations on the clusters C called Merge and Split that allow the computerized system 10 to maintain the path cluster information P(C).
For example, C=Merge(A,B) returns a new cluster C with children A and B by combining Top Tree components TA and TB in the a Top Tree with root C. The computerized system 10 sets p(C)=null if either C is not a path cluster or both p(A)=p(B)=null. Otherwise, the computerized system 10 sets p(C)=e, where e is the edge pointed to by either the non-null pointer p(A) or p(B).
For the operation Split(C), the computerized system 10 splits a root cluster C of Top Tree T, having children A and B, into two Top Tree components TA and TB and deletes C. For the Split operation, the computerized system 10 does not need to change the pointers of the child clusters.
Both the Merge and Split operations take constant time and, therefore, all operations for dynamically maintaining the approximate MST T, including dynamically maintaining the Top Tree TT, under edge insertion and deletion and querying for an edge of weight (1+ε)i−1 on the path from nodes u to v can be performed by the computerized system 10 in O(logn) amortized time. Additionally, by dynamically maintaining the approximate MST T, the computerized system 10 may avoid having to compute the MST for a particular set of nodes 18, shown in
The computerized system 10, shown in
The computerized system 10, shown in
For example, the computerized system 10, shown in
The computerized system 10, shown in
Although this invention has been shown and described with respect to the detailed embodiments thereof, it will be understood by those skilled in the art that various changes in form and detail thereof may be made without departing from the spirit and the scope of the invention.