This application claims foreign priority to P.R. China Patent application 201110076719.X filed 29 Mar. 2011, the complete disclosure of which is expressly incorporated herein by reference in its entirety for all purposes.
The present invention generally relates to the information processing technology field, and in particular, to a computer processing method and system for network data.
Nowadays, as information technology, especially network technology, develops, information is transferred between respective information nodes, so lots of such network data reflecting the relation between information nodes exists on the network. With respect to the large amounts of network data and network data of large scale, there are many technical analysis requirements now, i.e., how to find the relationship between these information nodes, for example, detecting nodes having abnormal behavior from the network, or filtering junk e-mails, and so on.
However, when processing large scale network data including lots of nodes, for example when the nodes relating to network data to be processed reach 105 or larger, the existing technology seems to be inadequate, and even helpless.
Thus, it is desirable to provide a computer processing method and system for network data.
One aspect of the invention provides a computer processing method for network data, comprising: receiving network data; filtering a node with a degree higher than a predefined threshold in the network data; storing the filtered node and its neighborhood relationship; clustering the filtered network data to obtain primary group(s); and obtaining a final group based on the filtered node and its neighborhood relationship and the primary group(s).
Another aspect of the invention provides a computer system for processing network data, comprising: a receiving means, configured to receive network data; a filtering means, configured to filter a node with a degree higher than a predefined threshold in the network data; a storing means, configured to store the filtered node and its neighborhood relationship; a clustering means, configured to cluster the filtered network data to obtain primary group(s); and a final grouping means, configured to obtain a final group based on the filtered node and its neighborhood relationship and the primary group(s).
The computer processing method and system provided by the invention which can accelerate network data processing may be applicable to the processing for network data of large scale, and the processing time for clustering network data of large scale will be greatly reduced. The invention can also be parallelized, to facilitate its common embodiments.
The features and advantages of the embodiments of the invention will be particularly explained with reference to the appended drawings. If possible, the same or like reference number denotes the same or like component in the drawings and the description. In the drawings:
Below, the exemplary embodiments of the invention will be described in detail with reference to the drawings in which the embodiments of the invention are illustrated, and like reference number always indicates the same element. It should be understood that the invention is not limited to the disclosed exemplary embodiments. It should be also understood that not every feature of the method and apparatus is necessary for implementing the invention to be protected by any claim. In addition, in the whole disclosure, when displaying or describing the process or the method, the steps of the method can be executed in any order or simultaneously, unless it is clear from the context that one step depends on another previously-executed step. In addition, there may be a prominent time interval between the steps.
Generally, the association extent between nodes in network data is referred to as a degree by a person skilled in the art. For example, if a node V1 is associated with 5 other nodes, it can be considered that the node V1 has a degree of 5 in the network data. If each node in the network data is considered as a point, lines are connected between nodes which are associated to form a graph (also referred to interchangeably as a map). Embodiments of the invention are applicable to both directional network data and un-directional network data. It is particularly noted by the inventor during study and practice that, in network data of large scale, the associations between nodes are not usually uniform, some nodes are tightly associated with other many other nodes, but most of the nodes are associated with only a few nodes. Just based on this natural non-uniformity, the inventor proposed the invention in a new way.
In step 203, a node with a degree higher than a predefined threshold in the network data is filtered. For setting the predefined threshold, a different predefined threshold can be set by the person skilled in the art according to particular dataset, and the predefined threshold can be an absolute value of the degree. In addition, it can be also considered to filter a certain percentage of nodes. In particular, the degree distribution of all the nodes in the network data is statistically calculated, and preferably, the degrees of all the nodes can be ordered in an ascending order or a descending order. A degree of any node from a certain percentage range (preferably, the first 5.5%-1%) of nodes with high degrees in all the nodes is selected, as the predefined threshold.
In step 205, the filtered node and its neighborhood relationship are stored. In this step, the neighborhood relationship is represented by a set of nodes adjacent to the filtered node. For example, a node V16 is adjacent to nodes V15, V18, V19, V17 and V12, the node V16 is filtered, and the node V16 and its neighborhood relationship V15, V18, V19, V17 and V12, can be stored. The storage manner can include storing them in a memory or storing them in a non-volatile memory medium.
In step 207, the filtered network data is clustered to obtain a primary group(s). In this step, the network data which is represented by the nodes and the lines can be clustered to be grouped. The person skilled in the art can select any suitable clustering algorithm according particular data to obtain the primary group(s). For example, for the community discovery, the methods as proposed in reference document [1], or reference document [2], Fábio Protti, Felipe M. G. Franca, Jayme Luiz Szwarcfiter, On Computing All Maximal Cliques Distributedly, Proceedings of the 4th International Symposium on Solving Irregularly Structured Problems in Parallel, 1997 (expressly incorporated herein by reference in its entirety for all purposes), can be used.
In step 209, a final group is obtained based on the filtered node and its neighborhood relationship and the primary group(s). In this step, the primary group(s) associated with the filtered node is determined based on the neighborhood relationship of the filtered node, and then it is further determined whether the filtered node belongs to a certain or some certain primary group(s), to finally obtain the final group.
In step 303, it is determined whether the filtered node belongs to the primary group(s). Preferably, an average degree of the nodes in the primary group(s) is calculated, in which, the average degree is the sum of the degree of all the nodes in the primary group(s) divided by the number of all the nodes in the primary group(s). And an actual association degree of the filtered node with respect to the nodes in the primary group(s) is calculated, in which, the actual association degree is the sum of the number of the lines between the filtered node and the nodes in the primary group(s). Whether the actual association degree is larger than the average degree is further determined, and in response to determining the actual association degree is larger than the average degree, it is determined that the filtered node belongs to the primary group(s). Of course, the person skilled in the art may conceive other embodiments for determining whether the filtered node belongs to the primary group(s) based on the application.
In step 305, in response to determining that the filtered node belongs to the primary group(s), the filtered node is merged into the primary group(s).
In step 307, it is judged whether all the filtered nodes are passed through, and if there is any filter node having not been processed, the steps 303-305 are repeatedly executed.
In step 309, in response to merging all the filtered nodes into their corresponding primary group(s), regarding the primary group(s) as the final group(s).
1) calculating a predefined threshold for filtering, statistically calculating the degree of each node and ordering them, taking the first 1% of them as the predefined threshold for filtering, the predefined threshold of the graph (map) being 5;
2) discovering the degree of the node V16 in the graph (map) larger than 5 (the degree of V16 being 6), and thus saving the node V16 and its neighborhood relationship {V15, V18, V19, V17, V12 and V17};
3) performing community discovery on all the nodes except the node V16, by using the method as described in the reference document [2], which has a basic concept that each round of iterations, similarities between two points of all the points within two hops (jumps) are determined, two points which are similar but do not have a line are connected with a line, two points which are not similar but have a line are disconnected, when the variation of the network topology is less than a certain threshold, the iteration end, otherwise, the iteration goes not the next round of nodes. A simple description about the method of the reference document [2] is performed here, and the details can be found in the reference document itself. The network as shown in
4) using the results stored in 2), according to the neighborhood of V16, it is found the above 3 primary groups G1, G2 and G3 all include the nodes adjacent to them, so the node V16 could belong to the three primary groups G1, G2 and G3; and
5) calculating the average degrees of G1, G2, G3 respectively. The average degrees of G1, G2, G3 are 1.5, 1.6 and 0.7, while the actual association degrees of the node V16 with G1, G2 and G3 are 1, 3 and 2 respectively. Since it is determined that actual association degrees of the node V16 with G2 and G3 are larger than the average degrees of G2, G3, it can be determined that the V16 will be merged into G2 and G3, to form the final group result as shown in
Each particular embodiments of the invention is applicable to various implementing flats, such as the network data clustering processing realized by a single-machine, the network data clustering processing realized by parallel computing flat such as MapReduce and MPI.
To realize the community discovery, the basic data structure of the network in MapReduce is a “two hop adjacency list”, i.e., each row uses nodes as keys, the adjacency table of the nodes and the adjacency table of each node in the adjacency table are used as a value; meanwhile, the similarities of the node with respect to all the nodes in the two hop adjacency list should be stored in the value, and a certain value field is reserved for storing information such as marks and so on. For example, the two hop adjacency list of a node A is A-C (A, B, D), B (A, C), in which one-hop (one-jump) neighbors of A are B and C, one-hop (one-jump) neighbors of B include A and C, and one-hop (one-jump) neighbors of C include A, B and D. Such data structure is to facilitate realization of the main clustering method as described in the reference document [1].
During a preprocessing stage, by one MapReduce job, the nodes with degrees larger than a designated threshold are marked (the degree resolving is easily realized by one Map task, and each node stores an adjacency table, and the degree is the number of the members in the adjacency table), and the marked data is used as the input to a “filter” and a “large degree node collector.”
During the main algorithm stage, a two hop adjacency list (two jumpadjacency matrix) set of the nodes with the output of the filter less than the designated threshold according to the main clustering method in the reference document [1], several rounds of iterations are performed to update the topology; each round of iteration uses a similarity calculator to obtain the similarities between nodes, and uses a topology updater to update the topology; and when the topology variation is less than the designated threshold, the iteration ends, and the main algorithm in the reference document [1] is completed.
During a post-processing stage, after the main algorithm is completed, a Connected Component Calculator is called to obtain the community corresponding to each node. In this regard, reference is made to X-RIME: Hadoop based large scale social network analysis, project available from SourceForge, expressly incorporated herein by reference in its entirety for all purposes, and in particular to a Weakly Connected Component implemented in X-RIME. At this time, a “group degree calculator” is called to calculate the average degree of each group. The key input by the “group degree calculator” is the nodes, value is the group number, the output key is the group, and the value is the average degree of the group together with the set of included nodes. Both the output (output 1) of the group degree calculator and the output (output 2) of the “large degree node collector” are used as the input of a “group selector” and the output of the “group selector” is the potential group(s) of the filtered node. During a Map stage, the “group selector” sends a {group, filtered node} key-value pair message to each neighbor of the filtered node according to the adjacency table of the filtered node, for example, if a node V has neighbors V1, V2, V3, V4 and V5, and V1 and V2 are grouped into g1,V3, V4 and V5 are grouped into g2, in this case, the “group selector” sends two <g1, V> to a reducer with g1 as a key, and sends three <g2,V> to a reducer with g2 as a key, so the number of the messages corresponding to V received in each group indicates the number of the neighbors of the node in the group, and the number is recoded as a label L. Further, a group clustering device may use the label L and the previously calculated group average degree to determine whether V really belongs to this group, and to finally obtain the final group result.
It should be understood that the above embodiments have been discussed with respect to a network of large scale, but embodiments of the invention are applicable to the network of normal scale, to obtain the corresponding gain. If the person skilled in the art will extend the method of the invention to other physical network data (such as sensor network(s) and so on) according to his or her professional knowledge, and adaptively modify various embodiments of the invention based on his or her knowledge in the art, which will be available too.
Preferably, the final grouping means 809 includes: a mapping means, configured to, based on the stored neighborhood relationship, establish a mapping between the filtered node and the primary group(s); a judging means, configured to determine whether the filtered node belongs to the primary group(s); and a merging means, configured to, in response to determining the filtered node belongs to the primary group(s), merge the filtered node into the primary group(s).
Preferably, the final grouping means 809 further includes: a final group determining means, configured to, in response to merging all the filtered nodes into their corresponding primary group(s), regard the primary group(s) as the final group.
Preferably, the computer system 800 further comprises: a new grouping means, configured to cluster subnetwork data composed by the filtered nodes to form a new group; and an incorporating means, configured to incorporate the new group into the final group.
Preferably, the computer system 800 further comprises: a statistically-calculating means, configured to statistically calculate degree distribution of all the nodes in the network data; and a predefined threshold determining means, configured to select a degree of any node from a certain percentage range (preferably, the first 5.5%-1%) of nodes with high degrees in all the nodes, as the predefined threshold.
Preferably, the neighborhood relationship is represented by a set of nodes adjacent to the filtered node.
Preferably, the mapping means includes: a primary group determining means, configured to determine the primary group(s) including at least one node in the neighborhood relationship of the filtered node; and an associating means, configured to associate the filtered node with the determined primary group(s).
Preferably, the judging means includes: an average degree calculating means, configured to calculate an average degree of the nodes in the primary group(s); an actual association degree calculating means, configured to calculate an actual association degree of the filtered node with the nodes in the primary group(s); a comparing means, configured to determine whether the actual association degree is larger than the average degree; and a determining means, configured to, in response to determining that the actual association degree is larger than the average degree, determine that the filtered node belongs to the primary group(s).
Preferably, the computer system 800 is configured on MapReduce calculating flat.
The function of each component in
Although the computer system described in
The invention can also be realized as a computer program product used by the computer system in
In view of the discussion of
Although the invention is described with reference to the preferred embodiments of the invention, it will be obvious by the person skilled in the art that without departing the spirit and scope of the invention defined by the appended claims, various modifications in form and detail can be performed on the invention.
Number | Date | Country | Kind |
---|---|---|---|
2011 1 0076719 | Mar 2011 | CN | national |
Number | Name | Date | Kind |
---|---|---|---|
7466663 | Young et al. | Dec 2008 | B2 |
7818272 | Mishra | Oct 2010 | B1 |
20050021531 | Wen et al. | Jan 2005 | A1 |
20090315890 | Modani | Dec 2009 | A1 |
20100022752 | Young et al. | Jan 2010 | A1 |
20100063973 | Cao et al. | Mar 2010 | A1 |
20100309206 | Xie et al. | Dec 2010 | A1 |
20100313205 | Shao | Dec 2010 | A1 |
20120143882 | Zheng et al. | Jun 2012 | A1 |
Number | Date | Country |
---|---|---|
101661482 | Mar 2010 | CN |
101944045 | Jan 2011 | CN |
Entry |
---|
Purnamrita Sarkar, “Fast Nearest-neighbor Search in Disk-resident Graphs”.Feb. 5, 2010CMU-ML-10-100,School of Computer Science Carnegie Mellon University, Pittsburgh, PA. |
Shawndra Hill, et al. “Social Network Signatures: A Framework for Re-Identification in Networked Data and Experimental Results”. Computational Aspects of Social Networks,2009. |
“X-Rime:Hadoop based large scale social network analysis” downloaded from http://xrime.sourceforge.net on Mar. 26, 2012. |
Y. Zhang et al., “Parallel Community Detection on Large Networks with Propinquity Dynamics,” Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2009, p. 997-1005, Association for Computing Machinery (ACM). |
F. Protti et al., “On Computing All Maximal Cliques Distributedly,” Proceedings of the 4th International Symposium on Solving Irregularly Structured Problems in Parallel, 1997, p. 37-48, Lecture Notes in Computer Science (LNCS) vol. 1253, Springer. |
M. E. J. Newman & M. Girvan, “Finding and Evaluating Community Structure in Networks,” Physical Review E (PRE), Feb. 2004, vol. 69, iss. 2, p. 026113(15), American Physical Society (APS). |
M. E. J. Newman, “Finding Community Structure in Networks Using the Eigenvectors of Matrices,” Physical Review E (PRE), Sep. 2006, vol. 74, iss. 3, p. 036104(19), American Physical Society (APS). |
B. Yang et al., “Complex Network Clustering Algorithms,” Journal of Software, Jan. 2009, vol. 20, iss.1, p. 54-66, Institute of Software Chinese Academy of Sciences (ISCAS). |
Number | Date | Country | |
---|---|---|---|
20120284384 A1 | Nov 2012 | US |