A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever.
One or more implementations relate generally to machine learning, and more specifically to a method for performing cluster functions on large graphs to categorize the vertices of the graphs into subsets.
The subject matter discussed in the background section should not be assumed to be prior art merely as a result of its mention in the background section. Similarly, a problem mentioned in the background section or associated with the subject matter of the background section should not be assumed to have been previously recognized in the prior art. The subject matter in the background section merely represents different approaches, which in and of themselves may also be inventions.
As social networks have gained in popularity, maintaining and processing the social network graph information using graph algorithms has become an essential source for discovering potential features of the graph. In general, a graph is a mathematical structure comprising an ordered pair G=(V, E), where V is the set of vertices or nodes represent objects, and the elements in set E are edges or lines which represent relationships among different objects. Many real world problems can be abstracted into graph problems, such as, social networks and traffic networks. The great increase in the size and scope of social networks and other similar applications has made it virtually impossible to process huge graphs on a single machine in a “real-time” level of execution.
Distributed computing techniques have been applied to graph computations in order to more efficiently process graph data. One example is Map-Reduce, which is a distributed computing model introduced by Google® that processes large data sets on clusters of computers in parallel using the principles of map and reduce functions commonly used in functional programming. Although many real world problems can be modeled using Map-Reduce, there are still many that cannot be presented very well using this framework. Furthermore, the Map-Reduce model has certain weaknesses that limit its effectiveness with regard to certain important applications, such as cloud computing and social network environments. For example, Map Reduce cannot share information among different slave machines when running map or reduce functions, and not all graph-based algorithms can be mapped onto Map-Reduce; and for certain graph related problems that can be solved by Map-Reduce, the solutions may not be optimum for certain applications (e.g., cloud computing). Increased scalability is another key concern in the development and application of graph processing systems.
What is needed is an effective and efficient way to decompose and reformulate the density-based clustering problem, and make it possible to be solved on Map-Reduce platforms efficiently. Concurrent with this objective is the need to provide a scalable algorithm that will perform faster when there are more machines in a Map-Reduce machine cluster; perform faster merging operations, since with results being calculated on multiple machines, the speed of merging these results is critical; maintain low network traffic by ensuring that the number of messages generated is not high; maintain good load balance by ensuring that all machines in a cluster have similar workloads, and maintain result accuracy.
In an embodiment and by way of example, there are provided mechanisms and methods for decomposing and reformulating the density-based clustering problem, and making it possible to be solved on Map-Reduce platforms efficiently. Embodiments are directed to a density-based clustering algorithm that decomposes and reformulates the DBSCAN algorithm to facilitate its performance on the Map-Reduce model. Present methods of implementing DBSCAN are processing nodes in a graph on a one-by-one basis based on branches of the graph. The DBSCAN algorithm is reformulated into connectivity problem using a density filter method. The density-based clustering algorithm uses message passing and edge adding to increase the speed of result merging, it also uses message mining techniques to further decrease the number of iterations. The algorithm is scalable, and can be accelerated by using more machines in a distributed computer network implementing the Map-Reduce program. An active or halt state in the cluster and edge cutting operations reduces a large amount of network traffic. Good load balance between different machines of the network can be achieved by a splitting function, and results are generally accurate compared to results through straight application of the DBSCAN algorithm.
Any of the above embodiments may be used alone or together with one another in any combination. The one or more implementations encompassed within this specification may also include embodiments that are only partially mentioned or alluded to or are not mentioned or alluded to at all in this brief summary or in the abstract. Although various embodiments may have been motivated by various deficiencies with the prior art, which may be discussed or alluded to in one or more places in the specification, the embodiments do not necessarily address any of these deficiencies. In other words, different embodiments may address different deficiencies that may be discussed in the specification. Some embodiments may only partially address some deficiencies or just one deficiency that may be discussed in the specification, and some embodiments may not address any of these deficiencies.
In the following drawings like reference numbers are used to refer to like elements. Although the following figures depict various examples, the one or more implementations are not limited to the examples depicted in the figures.
Systems and methods are described for implementing the Map-Reduce framework through a density-based algorithm to solve large-scale graph problems. Aspects of the one or more embodiments described herein may be implemented on one or more computers executing software instructions. The computers may be networked in a client-server arrangement or similar distributed computer network, and one or more of the networked computers may execute application programs that require periodic testing to ensure continuing functionality.
Map-Reduce is a distributed computing model that processes large data sets on clusters of computers in parallel. It provides a programming model which users can specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a data clustering algorithm that can be performed on graphs. It implements a density-based clustering algorithm by finding a number of clusters starting from the estimated density distribution of corresponding nodes (vertices). DBSCAN is an efficient algorithm when clustering data on a single machine. However, present implementations of graph processing using DBSCAN are limited to processing nodes one at a time on a single machine, and calculating the density of each node along a branch of nodes that are of the same density. This severely limits the efficiency and processing speed of systems dealing with large graphs and prevents real-time processing of these graphs.
Embodiments are directed to a system that implements DBSCAN using Map-Reduce as a model that can accelerate the clustering calculation of large graphs using a similar density-based model. The original graph is partitioned into a number of partial clustered graphs. Each partition is processed on a respective machine in a networked system so that multiple partitions of the original graph are processed in parallel. The sub-clusters comprising each of the partial clustered graphs are the merged using message-based mechanism that reduces unnecessary processing of redundant edges or node connections.
In an embodiment, the network 100 of
Each node or “machine” of network 100 includes a respective execution component for the density-based algorithm using Map-Reduce, such that the original large graph is partitioned into a number of smaller partitions and each partition is processed using a single machine of the network. The number and size of partitions can be selected based on the number of machines in the network, and can be determined based on a random or rule-based selection process. It should be noted that the system 100 is intended to be an example of a multi-processor or multi-machine network that processes large graphs using the density-based algorithm using Map-Reduce, and that any appropriate network topography may be used depending on the constraints and requirements of the graph or application being processed.
As shown in step 204, the data clustering process is then implemented using a DBSCAN algorithm that is divided into two parts: a DBF function 206 and a PCD function 208. This task split makes it more efficient to execute the DBSCAN algorithm on Map-Reduce platforms. The DBSCAN algorithm will apply density constraints and do expansion at the same time. In an embodiment, the density constraints will be applied in the DBF function, step 206, and the expansion operation is performed in PCD function based on the result of DBF filter operation, step 208.
For the embodiment of
After the DBF process 300 of
As shown in
The result of the PCD process of
As mentioned above, the process utilizes a message-based mechanism to merge the sub-clusters. This mechanism allows each sub cluster to send a message to the other sub-clusters indicating that it needs to be merged. Each sub-cluster waiting to be merged is thus messaged and merged with other sub-clusters. The messaging structure allows sub-clusters to know each of their neighbors and indirectly couple or “introduce” neighboring sub-clusters to its other neighbors. This messaging system allows the comprehensive and effective merging of sub-clustered generated by many different parallel processes executed on dedicated machines in the network.
As shown in
Prior to processing received messages the first time, unnecessary edges may be removed to speed processing.
The procedure of building edges will be described with reference to
For every first round of a merge, when a cluster without core node receives a message with a cluster ID, if this cluster is belonging to itself or its belonging value is larger than this ID, it will take this ID as the belonging value and set its own state to active.
For every second round of a merge, when a cluster with core nodes receives a message with cluster ID with a tag that indicates it is an edge node, it adds this ID into its edge node set.
With reference to process 700 of
After process 700 of
In an embodiment, a message mining technique is used to accelerate the merging process. The message mining technique is based on the following facts: When one cluster in the graph partition changed its belonging value, other clusters that had the same belonging value before can also be changed. For example, if node A sent messages to nodes E and F, and node B sent messages to nodes F and G, it means that A,B,E,F,G are in the same cluster (A,B,E,F,G are clusters with core nodes).
Message mining is performed by grouping a bipartite graph wherein the left part of the graph consists of the IDs in the content of messages, and the belonging values in clusters; and the right part of the graph consists of the “To” values in the messages and clusters' IDs. Grouping of this bipartite graph can be done by traversal of the graph. After grouping, there will be no connections between each group; then for one group, the process makes the edges fully connected between left and right parts, and reverses the graph into messages. These new messages are taken as input of merge procedure. All of the original messages are included into new ones, and the extra information can be used to accelerate the merge procedure to log(P).
As described herein, the density-based clustering algorithm is an improvement to a straight implementation of DBSCAN to process large graphs. It decomposes and reformulates the original DBSCAN algorithm to make it possible to be performed on a Map-Reduce model using many machines in a networked architecture. It uses message passing and edge adding to increase the speed of result merging to log(N) iterations. It also uses message mining to further decrease the iteration number to log(P). The density-based algorithm using Map-Reduce is scalable, and can be accelerated by using more machines in the network. The active or halt state in the cluster and edge cutting reduce large amounts of network traffic. Optimum load balance among different machines in a network can be achieved by a splitting function.
The DBF and PCD processes together can be used to clustering data partially. The various combinations of message sending, edge cutting, merging, and message mining operations can be used to merge and cluster different kinds of graph-based data. Although embodiments have been described in relation to a design based on a Map-Reduce model, it should be noted that implementation is not limited only to the Map-Reduce model. Embodiments of the density-based model using the DBF, PCD processes with any or all of the message sending, edge cutting, merging, and message mining operations can be migrated to other distributed or multi-core platforms, as well.
Aspects of the system 100 may be implemented in appropriate computer network environment for processing large graph data, including a cloud computing environment in which certain network resources are virtualized and made accessible to the individual nodes through secure distributed computing techniques. The Map-Reduce system described herein can be implemented in an Internet based client-server network system. The network 100 may comprise any desired number of individual machines, including one or more routers (not shown) that serve to buffer and route the data transmitted among the computers. Network 100 may be built on various different network protocols, and may be the Internet, a Wide Area Network (WAN), a Local Area Network (LAN), or any combination thereof.
According to one embodiment, each machine of network 100 is operator configurable using applications, such as a web browser, including computer code run using a central processing unit. A computer program product embodiment includes a machine-readable storage medium (media) having instructions stored thereon/in, which can be used to program a computer to perform any of the processes of the embodiments described herein. Computer code for executing embodiments may be downloaded and stored on a hard disk, but the entire program code, or portions thereof, may also be stored in any other volatile or non-volatile memory medium or device as is well known, such as a ROM or RAM, or provided on any media capable of storing program code, such as any type of rotating media including floppy disks, optical discs, digital versatile disk (DVD), compact disk (CD), microdrive, and magneto-optical disks, and magnetic or optical cards, nanosystems (including molecular memory ICs), or any type of media or device suitable for storing instructions and/or data. Additionally, the entire program code, or portions thereof, may be transmitted and downloaded from a software source over a transmission medium, e.g., over the Internet, or from another server, as is well known, or transmitted over any other conventional network connection as is well known (e.g., extranet, VPN, LAN, etc.) using any communication medium and protocols (e.g., TCP/IP, HTTP, HTTPS, Ethernet, etc.) as are well known. It will also be appreciated that computer code for implementing embodiments can be implemented in any programming language that can be executed on a client system and/or server or server system such as, for example, C, C++, HTML, any other markup language, Java™, JavaScript, ActiveX, any other scripting language, such as VBScript, and many other programming languages as are well known may be used. (Java™ is a trademark of Sun Microsystems, Inc.).
It should also be noted that the various functions disclosed herein may be described using any number of combinations of hardware, firmware, and/or as data and/or instructions embodied in various machine-readable or computer-readable media, in terms of their behavioral, register transfer, logic component, and/or other characteristics. Computer-readable media in which such formatted data and/or instructions may be embodied include, but are not limited to, physical (non-transitory), non-volatile storage media in various forms, such as optical, magnetic or semiconductor storage media.
For the purpose of the present description, a data object is any type of distinguishable data or information, such as an image, video, sound, text, or other type of data. A data object may include multiple types of distinguishable data, such as an image combined with descriptive text, and it may also comprise a dynamic signal such as a time varying signal. A data object as used herein is to be interpreted broadly to include stored representations of data including for example, digitally stored representations of source information. A data set is a collection of data objects, and may comprise a collection of images, or a plurality of text pages or documents. A user is utilized generically herein to refer to a human operator, a software agent, process, or device that is capable of executing a process or control.
Unless the context clearly requires otherwise, throughout the description and the claims, the words “comprise,” “comprising,” and the like are to be construed in an inclusive sense as opposed to an exclusive or exhaustive sense; that is to say, in a sense of “including, but not limited to.” Words using the singular or plural number also include the plural or singular number respectively. Additionally, the words “herein,” “hereunder,” “above,” “below,” and words of similar import refer to this application as a whole and not to any particular portions of this application. When the word “or” is used in reference to a list of two or more items, that word covers all of the following interpretations of the word: any of the items in the list, all of the items in the list and any combination of the items in the list.
While one or more implementations have been described by way of example and in terms of the specific embodiments, it is to be understood that one or more implementations are not limited to the disclosed embodiments. To the contrary, it is intended to cover various modifications and similar arrangements as would be apparent to those skilled in the art. Therefore, the scope of the appended claims should be accorded the broadest interpretation so as to encompass all such modifications and similar arrangements.
This application claims the benefit of U.S. Provisional Patent Application 61/509,847 entitled A DENSITY-BASED ALGORITHM FOR DISCOVERING CLUSTERS IN LARGE GRAPHS WITH NOISE USING MAP-REDUCE, by Nan Gong and Jari Koister, filed Jul. 20, 2011, the entire contents of which are incorporated herein by reference.
Number | Name | Date | Kind |
---|---|---|---|
5577188 | Zhu | Nov 1996 | A |
5608872 | Schwartz | Mar 1997 | A |
5649104 | Carleton | Jul 1997 | A |
5715450 | Ambrose et al. | Feb 1998 | A |
5761419 | Schwartz | Jun 1998 | A |
5819038 | Carleton | Oct 1998 | A |
5821937 | Tonelli et al. | Oct 1998 | A |
5831610 | Tonelli et al. | Nov 1998 | A |
5873096 | Lim et al. | Feb 1999 | A |
5918159 | Fomukong et al. | Jun 1999 | A |
5963953 | Cram et al. | Oct 1999 | A |
6092083 | Brodersen et al. | Jul 2000 | A |
6169534 | Raffel et al. | Jan 2001 | B1 |
6178425 | Brodersen et al. | Jan 2001 | B1 |
6189011 | Lim et al. | Feb 2001 | B1 |
6216135 | Brodersen et al. | Apr 2001 | B1 |
6233617 | Rothwein et al. | May 2001 | B1 |
6266669 | Brodersen et al. | Jul 2001 | B1 |
6295530 | Ritchie et al. | Sep 2001 | B1 |
6324568 | Diec et al. | Nov 2001 | B1 |
6324693 | Brodersen et al. | Nov 2001 | B1 |
6336137 | Lee et al. | Jan 2002 | B1 |
D454139 | Feldcamp et al. | Mar 2002 | S |
6367077 | Brodersen et al. | Apr 2002 | B1 |
6393605 | Loomans | May 2002 | B1 |
6405220 | Brodersen et al. | Jun 2002 | B1 |
6434550 | Warner et al. | Aug 2002 | B1 |
6446089 | Brodersen et al. | Sep 2002 | B1 |
6535909 | Rust | Mar 2003 | B1 |
6549908 | Loomans | Apr 2003 | B1 |
6553563 | Ambrose et al. | Apr 2003 | B2 |
6560461 | Fomukong et al. | May 2003 | B1 |
6574635 | Stauber et al. | Jun 2003 | B2 |
6577726 | Huang et al. | Jun 2003 | B1 |
6601087 | Zhu | Jul 2003 | B1 |
6604117 | Lim et al. | Aug 2003 | B2 |
6604128 | Diec | Aug 2003 | B2 |
6609150 | Lee et al. | Aug 2003 | B2 |
6621834 | Scherpbier | Sep 2003 | B1 |
6654032 | Zhu | Nov 2003 | B1 |
6665648 | Brodersen et al. | Dec 2003 | B2 |
6665655 | Warner et al. | Dec 2003 | B1 |
6684438 | Brodersen et al. | Feb 2004 | B2 |
6711565 | Subramaniam et al. | Mar 2004 | B1 |
6724399 | Katchour et al. | Apr 2004 | B1 |
6728702 | Subramaniam et al. | Apr 2004 | B1 |
6728960 | Loomans | Apr 2004 | B1 |
6732095 | Warshavsky et al. | May 2004 | B1 |
6732100 | Brodersen et al. | May 2004 | B1 |
6732111 | Brodersen et al. | May 2004 | B2 |
6754681 | Brodersen et al. | Jun 2004 | B2 |
6763351 | Subramaniam et al. | Jul 2004 | B1 |
6763501 | Zhu | Jul 2004 | B1 |
6768904 | Kim | Jul 2004 | B2 |
6782383 | Subramaniam et al. | Aug 2004 | B2 |
6804330 | Jones et al. | Oct 2004 | B1 |
6826565 | Ritchie et al. | Nov 2004 | B2 |
6826582 | Chatterjee et al. | Nov 2004 | B1 |
6826745 | Coker | Nov 2004 | B2 |
6829655 | Huang et al. | Dec 2004 | B1 |
6842748 | Warner et al. | Jan 2005 | B1 |
6850895 | Brodersen et al. | Feb 2005 | B2 |
6850949 | Warner et al. | Feb 2005 | B2 |
7062502 | Kesler | Jun 2006 | B1 |
7340411 | Cook | Mar 2008 | B2 |
7401094 | Kesler | Jul 2008 | B1 |
7620655 | Larsson | Nov 2009 | B2 |
7698160 | Beaven et al. | Apr 2010 | B2 |
7933915 | Singh et al. | Apr 2011 | B2 |
8010663 | Firminger et al. | Aug 2011 | B2 |
8082301 | Ahlgren et al. | Dec 2011 | B2 |
8095413 | Beaven et al. | Jan 2012 | B1 |
8095594 | Beaven et al. | Jan 2012 | B2 |
8275836 | Beaven et al. | Sep 2012 | B2 |
8396884 | Singh et al. | Mar 2013 | B2 |
20010044791 | Richter et al. | Nov 2001 | A1 |
20020072951 | Lee et al. | Jun 2002 | A1 |
20020082892 | Raffel | Jun 2002 | A1 |
20020129352 | Brodersen et al. | Sep 2002 | A1 |
20020140731 | Subramaniam et al. | Oct 2002 | A1 |
20020143997 | Huang et al. | Oct 2002 | A1 |
20020162090 | Parnell et al. | Oct 2002 | A1 |
20020165742 | Robbins | Nov 2002 | A1 |
20030004971 | Gong | Jan 2003 | A1 |
20030018705 | Chen et al. | Jan 2003 | A1 |
20030018830 | Chen et al. | Jan 2003 | A1 |
20030066031 | Laane et al. | Apr 2003 | A1 |
20030066032 | Ramachandran et al. | Apr 2003 | A1 |
20030069936 | Warner et al. | Apr 2003 | A1 |
20030070000 | Coker et al. | Apr 2003 | A1 |
20030070004 | Mukundan et al. | Apr 2003 | A1 |
20030070005 | Mukundan et al. | Apr 2003 | A1 |
20030074418 | Coker et al. | Apr 2003 | A1 |
20030120675 | Stauber et al. | Jun 2003 | A1 |
20030151633 | George et al. | Aug 2003 | A1 |
20030159136 | Huang et al. | Aug 2003 | A1 |
20030187921 | Diec et al. | Oct 2003 | A1 |
20030189600 | Gune et al. | Oct 2003 | A1 |
20030204427 | Gune et al. | Oct 2003 | A1 |
20030206192 | Chen et al. | Nov 2003 | A1 |
20040001092 | Rothwein et al. | Jan 2004 | A1 |
20040015981 | Coker et al. | Jan 2004 | A1 |
20040027388 | Berg et al. | Feb 2004 | A1 |
20040128001 | Levin et al. | Jul 2004 | A1 |
20040186860 | Lee et al. | Sep 2004 | A1 |
20040193510 | Catahan et al. | Sep 2004 | A1 |
20040199489 | Barnes-Leon et al. | Oct 2004 | A1 |
20040199536 | Barnes Leon et al. | Oct 2004 | A1 |
20040249854 | Barnes-Leon et al. | Dec 2004 | A1 |
20040260534 | Pak et al. | Dec 2004 | A1 |
20040260659 | Chan et al. | Dec 2004 | A1 |
20040268299 | Lei et al. | Dec 2004 | A1 |
20050050555 | Exley et al. | Mar 2005 | A1 |
20050091098 | Brodersen et al. | Apr 2005 | A1 |
20060274062 | Zhang et al. | Dec 2006 | A1 |
20090063415 | Chatfield et al. | Mar 2009 | A1 |
20120317142 | Broecheler et al. | Dec 2012 | A1 |
Entry |
---|
Kwon et al. “Scalable clustering algorithm for N-body simulations in a shared-nothing cluster”, 2010. |
Number | Date | Country | |
---|---|---|---|
20130024479 A1 | Jan 2013 | US |
Number | Date | Country | |
---|---|---|---|
61509847 | Jul 2011 | US |