This application claims priority under USC§ 119 from Chinese Patent Application number 200710110101.4, filed on Jun. 15, 2007, the entire contents of which are incorporated herein by reference.
1. Field of the Invention
The present invention relates to graph coarsening, and more particularly to a system and method for coarsening a graph so as to discover a community rapidly and accurately.
2. Description of the Related Art
In the real world, many data such as social network (e.g., networks in the bank, financial service, insurance, and health care industries), life science network (e.g., protein interaction network), computer network (e.g. World Wide Web, the Internet) can be modeled as graphs. Furthermore, most of the graphs display community structure (i.e. group of vertices) within which connections are denser but between which are sparser. Therefore, it is useful to understand and analyze various networks by discovering these communities. In terms of social network, some networks are large and unknown, and it is beyond human capability to grasp the colony information thereof, for example, the personal telecommunications records maintained by the telecommunications carrier may constitute a telecommunications network. By way of community detection, we can predict the actual functional colony using the computers. Such colonies can be used to analyze the features of the colonies and the associations therebetween, and customize their particular policies regarding sales, advertising and marketing. The significance of data mining is to analyze and predict.
To better understand the relationship between the network and the community, an example regarding the computer network is given below. For a network containing a plurality of web pages, each web page can be regarded as a vertex, and the hyperlinks between the pages as edges. By partitioning the web pages in the network, the authority communities within the network can be found. Authority communities within the network refer to collections of web pages with identical or similar contents, which can be used to help users browse and search their desired information, so that the process can be efficient and convenient.
With the development of information technology, many researchers developed various solutions for discovering communities from the networks. The Modularity Q solution proposed in 2004 is considered important means for evaluating the community structural attribute. For details on Modularity Q solution, see M. E. J. Newman and M. Girvan, Finding and Evaluating Community Structure in Network, Physical Review E series, 2004. Meanwhile, Newman employs Modularity Q solution to evaluate the community quality discovered by various betweenness. However, these methods are time consuming and limited to process the graph under 10000 vertices. The heuristics algorithms in Modularity Q solution (such as greedy algorithms) perform partitioning with low quality, and thus can not always result in good partitioning for various graphs.
Thereafter, a few spectral based approaches were proposed (for example, see S. White and P. Smyth, A Spectral Clustering Approach to Finding communities in Graphs. Proceedings of the SIAM International Conference on Data Mining, Newport Beach, 2005, and M. E. J. Newman, Modularity and Community Structure in Networks, PNAS. 0601602103, 2006), to improve the quality of the detected communities. However, among the new approaches, large-scale matrix computations and lower-order approximations are extremely space- and time-consuming. Although they are more efficient than the Modularity Q solution, the bottleneck on large graphs still can not be solved.
In light of the above, a scalable system and method is proposed, which coarsens a graph using the multilevel paradigm, wherein the coarsened graphs can be easily refined into high quality communities.
According to a first aspect of the invention, a method for coarsening a graph, the graph including a plurality of vertices each having a respective position in the graph, the method including the steps: selecting a vertex from the plurality of vertices; calculating a merge modularity gain between the selected vertex and its adjacent vertices, wherein the adjacent vertices are a function of the position of the selected vertex in the graph; calculating mathematically a similarity between the selected vertex and its adjacent vertices; determining mathematically, based on the calculated merge modularity gain and similarity, whether the selected vertex can be merged with one of its adjacent vertices; and performing the merge when merge is determined possible.
According to a second aspect of the invention, a system for coarsening a graph, the graph including a plurality of vertices, the system consisting: initial coarsening means, for the selected vertex, for calculating the merge modularity gain between the selected vertex and its adjacent vertices; bias adjusting means for calculating the similarity between the selected vertex and its adjacent vertices; wherein, based on the calculated merge modularity gain and similarity, determining whether the selected vertex can be merged with one of its adjacent vertices, and performing the merge when merge is determined possible.
In the present invention, by introducing modularity into the multilevel paradigm, the graph is first coarsened based on the modularity stage by stage, and then similarity is used to avoid the coarsening of the vertices on the edges of different communities. As a consequence of this, the graph can be fast and accurately coarsened by using modularity and similarity, and then the clusters of vertices can be refined during the uncoarsening process.
The invention further provides a storage medium tangibly embodying a program of machine-readable instructions executable by a digital processing apparatus to carry out a method for coarsening a graph, the graph including a plurality of vertices each having a respective position in the graph, the method consisting:
selecting a vertex from the plurality of vertices;
calculating a merge modularity gain between the selected vertex and its adjacent vertices, wherein the adjacent vertices are a function of the position of the selected vertex in the graph;
calculating mathematically similarity between the selected vertex and its adjacent vertices;
determining mathematically, based on the calculated merge modularity gain and similarity, whether the selected vertex can be merged with one of its adjacent vertices; and
performing the merge when merge is determined possible.
The present invention and its embodiments will be more fully understood by reference to the Drawings and the Detailed Description of the Preferred Embodiments that follow.
The foregoing and other objects, aspects, and advantages will be better understood from the following non-limiting detailed description of preferred embodiments of the invention with reference to the drawings that include the following:
Referring now to the drawing, an example for network community detection (i.e., graph coarsening) using the invention is described.
For the network of
The left side ellipse of
The right side ellipse of
Below, system 300 for graph coarsening according to the invention is described with reference to
System 300 being a recursive system resides in the following aspects. On one hand, for a current graph (e.g., the original graph G0 in
The term “vertex” is used herein. In is noted that in the original graph, one “vertex” includes only itself, however, in the coarsened middle level graph, one “vertex” may include one or more vertices in the original graph, meanwhile the edges in the coarsened graph may also constitute of a plurality of edges in the original graph, as a result, the vertex in the coarsened graph may also be referred to as a “cluster”.
According to a preferred embodiment of the invention, the merge modularity gain and similarity of the graph or the subgraph are calculated by using Modularity Q formula. Modularity Q formula is a function for calculating the community intensity of the network, which is an index for measuring community intensity (that is, whether the community is good or bad). However, it is appreciated that the implementation of the invention does not rely on the use of Modularity Q formula, any algorithm that can calculate the community intensity and then obtain the merge modularity gain and similarity of the vertices in the network can be applied in the invention.
The preferred embodiments of the invention will be described in connection with reference to
The method of
The method of
According to a preferred embodiment of the invention, the merge modularity gain is calculated based on Modularity Q formula. Modularity Q formula is a basic function for calculating the modularity within a graph or subgraph, as shown in formula (1) below.
wherein,
Q: modularity of the vertex visit[i] and its adjacent vertices;
Aij: the adjacent matrix to which the graph corresponds;
C(i): the partition in which vertex i is located;
d(i): the degree of vertex i (i.e., the number of edges connected to vertex i);
Dc: the sum of the degrees of all vertices in the partition c;
Ec: the number of edges in partition c; and
1 when vertices i and j belong to the same partition; otherwise 0.
Based on the above Modularity Q formula, the modularity gain generated during the vertex combining process. According to a preferred embodiment of the invention, this is calculated by using formula (2) below.
wherein,
QA: Q of vertex visit[i] (vertex 1 in the example);
QB: Q of the adjacent vertex (vertices 5, 7, and 8 in the example);
QC: Q of the vertex obtained by merging vertex visit[i] and its adjacent vertex;
ΔQC: merge modularity gain of vertex visit[i] and its adjacent vertices.
For example, for vertex 1, when using ΔQC=Qc−Qa−Qb to calculate its modularity gain with vertex 5, C represents the graph constituting of vertices 1 and 5, A represents the graph constituting of vertex 1, and B represents the graph constituting of vertex 5. The calculated ΔQC is indicative of the merge modularity gain of vertices 1 and 5. The same process can be used to calculate the merge modularity gain of vertex 1 with vertex 7 and with vertex 8.
As shown in
Then the method proceeds to step 520, to determine if the biggest merge modularity gain of vertex 1 is greater than 0. If “YES”, the method proceeds to step 530, otherwise to step 525, so as to mark the vertex as visited.
As shown in
According to a preferred embodiment of the invention, the similarity is also calculated by using above formula (2), wherein only QA, QB, QC and ΔQC are assigned different meaning than calculating the modularity gain.
To take the selected vertex 1 as an example, its adjacent vertices are vertices 5, 7 and 8, and ΔQC=QC−QA−QB is used to calculate the similarity of vertex 5 and other adjacent vertices of vertex 1, C represents the graph constituting of vertices 1, 5, 7 and 8, A represents the graph constituting of vertices 1, 7 and 8, and B represents the graph constituting of vertex 5. Then, the calculated ΔQC is indicative of the similarity of vertices 1 and 5. Likewise, the similarity of vertex 1 with vertex 7 and vertex 8 can be calculated.
Then, it is determined if vertex u is the same vertex as vertex v, that is, if the vertex with the biggest merge modularity gain and the vertex with the biggest similarity are the same vertex.
If “YES”, the method goes to step 540, to merge vertices u and v, and mark them as visited, then the method enters step 545. However, as shown in
With step 510, then vertex with random order 2 (that is, vertex 2) is visited. Steps 515 to 535 are repeated for vertex 2. The merge modularity gain calculated for vertex 2 with its adjacent vertices 3, 5 and 8 in step 515 are 0.063, 0.052, 0.031, respectively, wherein vertex 3 has the biggest merge modularity gain (as shown in
Then, the method of the invention returns again into step 510, to determine if all vertices in the graph have been visited. If “NO” in step 510, repeat the above process for the next vertex. Recursively performing the above process, until all vertices in the graph have been visited (i.e., the determination in step 510 is “YES”).
After having visited all vertices, the method of
If the determination of step 555 is “NO” (that is, the graph can be further coarsened), the method returns to step 510, the current coarsened graph is recursively input to initial coarsening means 310, to randomly order the vertices in the coarsened graph, and repeat the above initial coarsening and bias adjusting processes.
For the example of
Then, the method of the invention ends in step 565.
In the invention, the graph is coarsened based on the modularity and similarity of the vertices. In the proposed method, first, the adjacent vertex around the randomly chosen vertex, having the biggest merge modularity gain, is identified (i.e., visiting each vertex in the graph by using random order, and combining the selected vertex with the adjacent vertex or cluster with the locally maximum merge modularity gain). Then, the random order is adjusted to use the similarity to merge the vertices (i.e., to adjust the order of those vertices that might locate on the edge of the community via similarity). The method can avoid low community quality attributing to the random order visit. By recursive coarsening, a coarsened graph set is output, when it is no longer possible to add the modularity gain by merging any cluster or vertex. Such coarsened graph can then be refined as high quality community.
As compared with existing community detection algorithms, the present invention can process network with higher number of vertices and edges, and discover the community within the network fast and accurately.
Bar lines 701, 707, 713, 716, 718 and 719 correspond to the runtime bar values using present invention. Bar lines 702, 708, 714 and 717 correspond to the runtime bar values using PNAS 2006 (Power Method). Bar lines 703, 709, and 715 correspond to the runtime bar values using PNAS 2006 (CLaPack). Bar lines 704 and 710 correspond to the runtime bar values using SDM 2005 (Spec-1). Bar lines 705 and 711 correspond to the runtime bar values using SDM 2005 (Spec-2). Bar lines 706 and 712 correspond to runtime bar values using PR.E. 2004.
It should be noted, in
As can be seen from
Those skilled in the art would appreciate that, the embodiment of the invention can be provided in the form of a method, system or computer program product. Therefore, the invention may adopt the form of an all-hardware embodiment, all-software embodiment or combined software and hardware embodiment such as, but not limited to, commercially available general purpose computer or a laptop. A typical combination of hardware and software comprises a universal computer system with a computer program which is loaded and executed to control the computer system to execute the above method.
The present invention may be embedded in the computer program product that incorporates all the features enabling the method described herein to implement. The computer program product is contained in one or more computer readable storage medium (including but not limited to a disk memory, CD-ROM, optical memory etc.) that has computer readable program codes stored therein.
The present invention has been described with reference to the flowchart and/or block diagram of the method, system and computer program product according to the invention. Each block in the flowchart and/or block diagram and a combination of the blocks in the flowchart and/or block diagram obviously can be achieved by computer program instructions. These computer program instructions may be provided to a universal computer, dedicated computer, embedded type processor or processors of other programmable data processing equipments, to generate a machine to thereby instruct (through the computer or processors of other programmable data processing equipments) to generate means for achieving functions specified in one or more blocks in the flowchart and/or block diagram.
These computer program instructions may be stored in a read memory of one or more computer that can instruct the computer or other programmable data processing equipments to exert themselves in a particular way, such that the instructions stored in the computer readable memory generate a manufactured product that comprises means for achieving the instructions of the functions specified in one or more blocks in the flowchart and/or block diagram. A storage medium tangibly embodying a program of machine-readable instructions executable by a digital processing apparatus is also one of the many possible means to carry out a method for coarsening a graph.
These computer program instructions may be loaded into one or more computer or other programmable data processing equipments, such that a series of operation steps are executed in the computer or other programmable data processing equipments, to thereby generate a computer-implemented process in each such equipment, so that the instructions executed in the equipment provide for the steps specified in one or more blocks in the flowchart and/or block diagram.
The above has described the principle of the invention in conjunction with the preferred embodiments of the invention, which, however, is illustrative and cannot be construed as limiting the invention. Various changes and variations may be made to the invention by those skilled in the art without departing from the spirit and scope of the invention as defined in accompanying claims.
| Number | Date | Country | Kind |
|---|---|---|---|
| 200710110101.4 | Jun 2007 | CN | national |