The invention relates generally to computer systems, and more particularly to an improved system and method for finding connected components in a large-scale graph.
Many models have been proposed to explain the structure and dynamics of social networks. However most of these models are based on simulated graphs or on relatively small graphs compared to real-world graphs of significant size. Furthermore, analysis of the interaction between users in many online applications may be modeled by a large-scale graph in order to determine a social network of online users for instance. Such a graph may model on the order of a billion interactions between hundreds of thousands of users. Large graphs such as the web graph may be described as scale-free in which the degree of nodes is independent of the size of the graph. See for example Albert-Laszlo Barabasi and Reka Albert, Emergence of Scaling in Random Networks, Science, 286:509, 1999.
Computing the connected components in such a large graph is a nontrivial task. In an undirected graph, the set of connected components is the set of maximally connected subgraphs of a graph. Each vertex in the component is connected via a path of edges to all other vertices in the component. In the case of undirected graphs, polynomial time algorithms exist. However, methods such as depth first search or finding eigenvectors cannot be computed easily when the graph is too large for the set of vertices and edges to fit into memory on a single machine. Furthermore, these algorithms are impractical for large graphs where the set of vertices and edges do not fit into memory.
What is needed is a way to efficiently find the connected components of a graph that is too large to fit the set of vertices and edges into memory on a single machine. Such a system and method should be capable of finding the connected components without traversing the edges in the graph and should be capable of finding the connected components in a constant number of passes over the data.
The present invention provides a system and method for finding connected components in a large-scale graph. In a map-reduce framework for computing weakly connected components of a large-scale graph, one or more mappers may be operably coupled to one or more reducers. A mapper may receive a collection of edges for unique vertices, find connected components for subgraphs represented by the collection of edges, and output sets of edges for each vertex representing connected components of subgraphs. A mapper may include a subgraph union-find component that finds a maximal set of connected components for subgraphs by executing a union-find algorithm for a collection of edges. A reducer may receive sets of edges for vertices output by the mapper that represent connected components of subgraphs, find connected components for the graph by merging subgraphs of connected components, and outputs sets of edges for vertices representing connected components of the large-scale graph. The reducer may include a graph union-find component that finds a maximal set of connected components for a graph by executing a union-find algorithm for a collection of edges for vertices of subgraphs.
In an embodiment to compute weakly connected components of a large-scale graph, subsets of a collection of edges for unique vertices may be distributed to several mappers. Connected components of subgraphs represented by each subset of edges may be computed. Then the sets of edges for connected components of subgraphs may be sorted by vertex. In an embodiment, the sets of edges representing connected components of subgraphs may be distributed to one or more reducers to find maximal sets of weakly connected components of the large-scale graph. The sorted sets of edges for each vertex representing the maximal sets of connected components for subgraphs may be merged by a reducer to identify maximal sets of connected components of a graph, and the maximal sets of connected components of a graph may be output.
The present invention may be used by many applications for finding connected components in a large-scale graph. In applications such as social network analysis, computing the set of connected components identifies which users are reachable within the social network from a given user. By providing a map-reduce framework for computing weakly connected components of a large-scale graph, the present invention may be scalable for social network applications involving billions of users with hundreds of thousands of communications. Connected components may be computed in parallel across multiple machines on extremely large graphs.
Other advantages will become apparent from the following detailed description when taken in conjunction with the drawings, in which:
The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, and so forth, which perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in local and/or remote computer storage media including memory storage devices.
With reference to
The computer system 100 may include a variety of computer-readable media. Computer-readable media can be any available media that can be accessed by the computer system 100 and includes both volatile and nonvolatile media. For example, computer-readable media may include volatile and nonvolatile computer storage media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by the computer system 100. Communication media may include computer-readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. For instance, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media.
The system memory 104 includes computer storage media in the form of volatile and/or nonvolatile memory such as read only memory (ROM) 106 and random access memory (RAM) 110. A basic input/output system 108 (BIOS), containing the basic routines that help to transfer information between elements within computer system 100, such as during start-up, is typically stored in ROM 106. Additionally, RAM 110 may contain operating system 112, application programs 114, other executable code 116 and program data 118. RAM 110 typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by CPU 102.
The computer system 100 may also include other removable/non-removable, volatile/nonvolatile computer storage media. By way of example only,
The drives and their associated computer storage media, discussed above and illustrated in
The computer system 100 may operate in a networked environment using a network 136 to one or more remote computers, such as a remote computer 146. The remote computer 146 may be a personal computer, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer system 100. The network 136 depicted in
The present invention is generally directed towards a system and method for finding connected components in a large-scale graph. A map-reduce framework may be provided for computing weakly connected components of a large-scale graph using mappers and reducers. A mapper may receive a collection of edges for unique vertices, find connected components for subgraphs represented by the collection of edges, and outputs sets of edges for each vertex representing connected components of subgraphs. A reducer may receive sets of edges for vertices output by the mapper that represent connected components of subgraphs, find connected components for the graph by merging subgraphs of connected components, and outputs sets of edges for vertices representing connected components of the large-scale graph. Connected components within a set of edges may be computed by executing a union-find algorithm over every edge to partition the set of vertices into disjoint subsets of connected components.
As will be seen, by providing a map-reduce framework for computing weakly connected components of a large-scale graph, the present invention may be scalable for social network applications involving billions of users with hundreds of thousands of communications. Connected components may be computed in parallel across multiple machines on extremely large graphs. As will be understood, the various block diagrams, flow charts and scenarios described herein are only examples, and there are many other scenarios to which the present invention will apply.
Turning to
In various embodiments, one or more mapper servers 202 may be operably coupled to one or more reducer servers 218 by a network 216. The mapper server 202 and the reducer server 218 may each be a computer such as computer system 100 of
The mapper server 202 may include a mapper 204 that receives a collection of edges for unique vertices, finds connected components for subgraphs represented by the collection of edges, and outputs sets of edges for each vertex representing connected components of subgraphs. The mapper 204 may include a subgraph union-find component 206 that finds a maximal set of connected components for subgraphs by executing a union-find algorithm for a collection of edges. Each of these components may be any type of executable software code that may execute on a computer such as computer system 100 of
The reducer server 218 may include functionality for receiving sets of edges for vertices that represent connected components of subgraphs, finding the connected components of a graph, and outputting the graph of connected components. The reducer server 218 may be operably coupled to a computer storage medium such as reducer storage 226 that may store a graph of one or more connected components 228 that include vertices 230 connected by edges 232. The reducer server 218 may include a reducer 220 that receives sets of edges for vertices that represent connected components of subgraphs, finds connected components for the graph by merging subgraphs of connected components, and outputs sets of edges for vertices representing connected components of a graph. The reducer 220 may include a graph union-find component 224 that finds a maximal set of connected components for a graph by executing a union-find algorithm for a collection of edges for vertices of subgraphs. The reducer 220 and graph union-find component 224 may be any type of executable software code that may execute on a computer such as computer system 100 of
There are many applications that may use the present invention to find connected components in a large-scale graph. For instance, the present invention may be used to determine a social network of online users. Consider for example an instant messaging application that allows users to exchange text, voice, and data between peers. Each message may translates to an HTTP request, similar to accessing a web page. Assuming that there is an exchange of messages between two users, a social network of instant messaging users may be represented by an undirected graph of connected components. Such a graph may model on the order of a billion communications between hundreds of thousands of users.
In particular, such a social network may be represented by a graph, G=(V,E), of weakly connected components. A weakly connected component (WCC) is a maximal subgraph of a directed graph such that for every pair of vertices (v,v′) in the subgraph, there is an undirected path from v to v′. From a perspective of sets, the set of WCCs partition the set of vertices into disjoint subsets.
A map-reduce framework may be implemented for finding weakly connected components. In an implementation of a single map-reduce task, there may be a map phase and a reduce phase. In general, the map phase may receives an edge set denoted by (v,v′) in an unspecified order and may find the connected components within the edge set. The map phase may output the resulting connected components to the reducer phase. The reducer phase may receive the connected components grouped by vertex so that the connected components that include the same vertex are presented contiguously to a single reducer for finding the maximal set of weakly connected components of the graph.
In particular, an implementation may distribute the edge set (v,v′)ε E to m mappers, where each mapper mi operates on some subset Ei⊂E such that ∪iEi=E. Each mapper may find the connected components within the set of edges given to it by executing a union-find algorithm over every edge in the subset. For more details about the union-find algorithm, see for example H. Kaplan, N. Shafrir, and R. Tarjan, Union-Find with Deletions, In Proceedings 13th Symposium on Discrete Algorithms (SODA), pages 19-28, 2002. The resulting WCCs on each mapper may be defined by child-parent pairs of vertices, {(vx,px)|x ε vi}, such that all child vertices, vx, with the same parent vertex, px, belong in the same WCC. A single reducer may execute on the child-parent pairs of vertices, (vx,px), that sorts the pairs by child vertex value, and resolves any conflicts if a child vertex belongs to multiple parent vertices. Such a conflict can occur if one mapper assigns a child vertex v to a parent p and another mapper assigns the same child vertex to a different parent p′≠p. The conflicting parent vertices are resolved by running a union-find algorithm over the set of conflicting parent and child vertices. The parents of the parent vertices (grandparents) resulting from execution of the union-find algorithm denote the merged WCCs which may be output as grandparent-parent-child triples (p′,p,v) of vertices. Thus, two vertices v and v′ belong to the same WCC denoted by p′ if there exists triples (p′,·,v) and (p′,·,v′).
The overall process of finding connected components in a large-scale graph may be represented by
At step 308, the sets of edges for each vertex representing the maximal sets of connected components for subgraphs may be sorted by child vertex value. The sorted sets of edges for each vertex may then be sent at step 310 to one or more reducers to find a graph of maximal sets of connected components. In an embodiment, a reducer may execute on the same computer as one or more mappers. In various embodiments, a reducer may execute on one or more reducer servers. At step 312, sorted sets of edges for each vertex representing the maximal sets of connected components for subgraphs may be merged to identify maximal sets of connected components of a graph. At step 314, the maximal sets of connected components of a graph may be output as grandparent-parent-child triples (p′,p,v) of vertices.
At step 506, a set of edges for a vertex represented by a child-parent pair of vertices that represent the connected components for subgraphs may be obtained from the sets of edges for sorted vertices. It may be determined at step 508 whether the vertex is a duplicate of a vertex previously obtained from the sets of edges for sorted vertices. If not, then the set of edges for the vertex may be output at step 512. Otherwise, it may be determined at step 510 whether the parent vertices of the vertex are the same. If so, then the set of edges for the vertex may be output at step 512 as a grandparent-parent-child triple, (p′,p,v). Otherwise, a union-find algorithm may be executed on the set of edges for each parent vertex and its child vertices at step 514 to find the maximal sets of connected components for the set of edges for each parent vertex and its child vertices. The maximal sets of connected components for the set of edges for each parent vertex and its child vertices may then be output at step 516. In an embodiment, the set of edges for a triple of a grandparent vertex, a parent vertex and a child vertex, (p′,p,v), that represent a maximal set of a connected component may be output for each connected component of the graph. At step 518, it may be determined whether the last set of edges for a vertex from the sets of edges for sorted vertices has been processed. If not, then processing may continue at step 506 where the set of edges for the next vertex may be obtained from the sets of edges for sorted vertices. Otherwise, if the last set of edges for a vertex from the sets of edges for sorted vertices has been processed, then processing may be finished for computing the connected components of a large-scale graph from the connected components of subgraphs in a map-reduce framework. In an embodiment where there may be several reducer servers for computing the connected components of a large-scale graph from the connected components of subgraphs, the output of each of the reducers may be sent to a single reducer to resolve conflicts where a child vertex belongs to multiple parent vertices for computing the connected components of a large-scale graph.
Thus the present invention may compute connected components in parallel across multiple machines for a graph too large to fit the set of vertices and edges into memory on a single machine. Importantly, the system and method may find the connected components without traversing the edges in the graph. The system and method are accordingly scalable and maintain a constant number of passes through the input data. Thus, social network analysis applications involving millions of users with billions of communications may use the present invention to compute the set of connected components to identify which users are reachable within the social network from a given user.
As can be seen from the foregoing detailed description, the present invention provides an improved system and method for finding connected components in a large-scale graph is provided. A map-reduce framework may be implemented for finding weakly connected components by distributing subsets of a collection of edges for unique vertices to several mappers to compute the connected components of subgraphs represented by each subset of edges. Then the sets of edges for connected components of subgraphs may be sorted by vertex. The sets of edges representing connected components of subgraphs may be distributed to one or more reducers to find maximal sets of weakly connected components of the large-scale graph. Advantageously, connected components may be computed in parallel across multiple machines on extremely large graphs in a constant number of passes through the input data. As a result, the system and method provide significant advantages and benefits needed in contemporary computing, and more particularly in online applications that analyze communications between users.
While the invention is susceptible to various modifications and alternative constructions, certain illustrated embodiments thereof are shown in the drawings and have been described above in detail. It should be understood, however, that there is no intention to limit the invention to the specific forms disclosed, but on the contrary, the intention is to cover all modifications, alternative constructions, and equivalents falling within the spirit and scope of the invention.