The present invention relates generally to peer discovery in a peer-to-peer network. More particularly, the present invention relates to discovery of peers for a grid or cluster-based computing network.
Many software applications require the combined resources of a number of computers, which are connected together through standard and well-known networking techniques, such as TCP/IP networking software running on the computers and on the hubs, routers, and gateways that interconnect the computers. In particular, grid or cluster-based computing makes use of a network of interconnected computers to provide additional computing resources necessary to solve complex problems.
Many grid or cluster-based computing networks rely upon a peer-to-peer (p2p) configuration. In such a p2p network, each node must know how to locate every other node participating in the network. This can be done by statically provisioning each node. However, a statically defined network is incapable of accepting new participants without requiring the re-configuration of existing nodes. The reconfiguration of nodes to accept new participants and delete unavailable participants is a time consuming process, but without such reconfiguration, a static network is incapable of using all the available resources. For these reasons, adaptive peer-to-peer networks with dynamic discovery of peers are considered preferable.
Unfortunately, adaptive configuration of a grid or cluster using existing TCP/IP software is often complex. Technologies such as AppleTalk™ and Rendezvous™ by Apple Computer, Inc. attempt to solve the complexity problem, but introduce new problems. AppleTalk devices can not communicate with the far more common TCP/IP connected Ethernet devices in use today. Rendezvous is based on TCP/IP, but scalability and efficiency problems remain.
Other peer-to-peer networks rely upon a node joining the network knowing, a priori, at least one other node in the network. From the connection to one node in the network, connections to other nodes can be formed, and contact with all nodes in the network can be achieved by relying upon connected nodes to pass messages to all other nodes that they are connected to. By assigning each message a time-to-live value based on the number of hops, a high statistical probability of distributing a message through the entire network can be achieved. Networks of this variety are ad hoc in nature and, though they benefit from a planned structure, typically generate higher levels of network traffic as each node in the network continually passes messages. At any given node, a message is usually received from more than one connected node, which generates large volumes of unnecessary traffic. Peer-to-peer networks having this configuration are considered inefficient, largely because no one node on the network maintains a list of all other network nodes.
It is, therefore, desirable to provide a peer-to-peer discovery protocol that is adaptive, efficient and easily scalable.
It is an object of the present invention to obviate or mitigate at least one disadvantage of previous peer discovery methods and mechanisms.
In a first aspect of the present invention, there is provided a peer discovery method for determining a prime in a peer-to-peer network having at least one system. The method comprises transmitting a voting message including a voting token; initializing a timer; listening on a predetermined port for voting tokens transmitted by another system in the network; and entering a prime mode upon expiry of the timer if no superior voting token is received.
In embodiments of the first aspect of the present invention, the method can include the further step of entering a vanilla peer mode when a superior voting token is received. In another embodiment, the step of transmitting a voting message includes transmitting a voting token containing a voting number entering a prime mode upon determining that no received voting token has a higher voting number. In another embodiment, the step of transmitting includes multicasting the voting token to all nodes on a subnet. In a further embodiment, the step of entering a prime mode includes transmitting an assertion of prime status to all nodes from whom a voting token is received, and creating a list of all nodes from whom a voting token is received. In another embodiment, the step of entering a vanilla peer mode upon receipt of an assertion of prime status from another node. In a further embodiment, the step of listening includes listening for tokens associated with a grid identifier associated with the node. The yet a further embodiment, the method includes the step of requesting an update from all peers from whom a voting token has been received upon entering the prime mode.
In a second aspect of the present invention, there is provided a system for carrying out the methods of the present invention.
Other aspects and features of the present invention will become apparent to those ordinarily skilled in the art upon review of the following description of specific embodiments of the invention in conjunction with the accompanying figures.
Embodiments of the present invention will now be described, by way of example only, with reference to the attached Figures, wherein:
The architecture of the present invention allows the computers in the p2p network to discover each other and automatically set up a processing network, distribute work, and recover from failure. After installation of the peer, no administration, configuration or ongoing management is necessary. Generally, the present invention provides a method and system for peer discovery in a peer-to-peer (p2p) network. The p2p discovery mechanisms are preferably based on the ubiquitous TCP/IP and UDP/IP standards, but can use other networking protocols without departing from the scope of the present invention. Furthermore, the mechanisms preferably require no global network configuration to add or remove peers from the network. The solution is preferably scalable to a large number of processors through careful combination of both multicast and connection-based messaging, together with an automated peer voting mechanism which elects a “prime” peer at runtime. The Prime peer is preferably responsible for providing a list of available peers whenever a new p2p network-enabled application is launched. The mechanisms are preferably: robust to failures of an individual machine, through detection of peer failure, routing around failed peers, and if necessary, repetition of the peer voting algorithm.
The peer-to-peer network described herein makes use of a series of peers. To allow discovery of peers, each peer has a mechanism for allowing it to announce to the other peers that it is present. One of the peers is selected as a prime. The designated prime stores a list of peers in the network. When a peer in the network needs a listing of other peers, a request can be generated and send to the prime. Thus, each non-prime peer, also referred to as a vanilla peer, need only store the address of the prime in the network, while the prime stores a listing of all peers. Nodes remain on the prime's node list until they have been found to be dead. If the prime attempts to contact a node that is on the list but is. inactive, the prime can determine the node to be dead and then remove the node from the network. If a vanilla peer requests a listing of nodes, and attempts to contact nodes on the network that are inactive, the inactivity of a node can be reported to the prime for removal from the peer list. In a presently preferred embodiment, any peer in the network can become the prime node.
To allow for redundancy, a backup prime can be designated when the network is created, so that if the prime fails, the backup prime can assume the responsibilities of the prime. In an alternate embodiment, a mechanism to select a new prime can be implemented so select a new prime upon the failure of the prime.
In the p2p network described herein, it is preferable that all nodes know how to locate every other node participating in the network. A peer service discovery (PSD) protocol provides a mechanism by which peers in a network can dynamically discover each other. This protocol preferably provides each peer in the network with access to a list of all other known peers. Active connections need not be maintained between all peers using this protocol, as each peer can access the peer listing. Table 1 provides a list of the terminology used throughout.
The PSD protocol of the present invention is preferably positioned on top of the transport layer (e.g. TCP or UDP) in the session layer. The PSD implementation involves the use of vanilla peers (VPs) on each node participating in the p2p network. VPs preferably run as daemons and are the participants in the PSD protocol. VPs also preferably serve as the execution environment for parallel applications that have been assigned to them.
Peers in the network can assume roles for controlling a given service. The roles include both the Prime and Backup Prime as described in the above table. Primes are responsible for management of data associated with a service. Backup Primes take over the role of Prime if P′ goes down, or may have some control functions delegated to them by the Prime. In each p2p network there can be an arbitrary number of backups.
Services are specific areas of distributed functionality. For instance, the task of determining all of the peers (end-points) in the network is an example of a service. Similarly, the task of determining the availability of any given node to do work is another service. The following tasks, some of which maybe distributed, can be included in a PSD protocol of the present invention: End-point Discovery; Peer Availability; Data Check-pointing; Load Balancing; Results Collection; and Operations, Administration, and Maintenance (OAM).
Prior to a discussion of a messaging protocol, a brief description of how the network is dynamically formed will be presented. When a peer is initialized, a prior configuration preferably determines a network port over which all communications are sent. The peer then transmits, using the configured port, a message to all nodes on the same subnet. This first message is considered as a voting message. After transmitting the voting message the peer listens for a reply. If there is already a network, then the network peers disregard the message, and the network prime records the address of the new peer, and asserts ownership of the prime status. If other peers are initializing at the same time, each peer with transmit a voting message. Each peer will receive the voting message transmitted by the other peers. If a received voting message is deemed to be stronger than the transmitted voting message, the peer enters its vanilla peer mode and records the address of the prime. Until a stronger voting message is received, the peer creates a list of the peers that it has received messages from. Based on the value of the voting message, a peer is selected to be the prime, and will have a list of all peers in the network. If, at a later time, another peer joins the network, it will transmit a voting message. In response, the prime will assert its ownership of the prime status and add the new peer to the peer list. If, at some time, the prime is unreachable, another node in the network can multicast a reset message to restart the voting process. In an alternate embodiment, each peer records the address of both the prime and a backup prime, based on the strength of the voting message, and if the prime fails, the backup prime continues in its place.
The PSD protocol uses both multicast and point-to-point messaging. In a presently preferred embodiment, the PSD protocol uses five messages: Vote; AssertPrime; PeerUpdate; complete and Reset. Note that the use of XML-style constructs in the following description is for illustrative purposes only; no part of the present invention is dependant on the use of XML as the PSD transport protocol.
The Vote message is preferably a multicast message, where each VP provides its VP token to the other VPs in the network. Each. VP token has at least one attribute that permits it to be compared and ranked against other VP tokens. For purposes of illustration, the tokens used in the following examples are numbers with a value attribute that can be compared. The Vote message is preferably multicast so that all nodes in the p2p network will receive the message. One skilled in the art will appreciate that the message is preferably sent to a port used exclusively for PSD, and may be used exclusively for the voting messaging. A node preferably sends the Vote message when it is first initialized (i.e. when it comes online). For example, a node engaged in discovering peers for a particular service can transmit: <vote service=“SERVICE” rand=“NUMBER”/>. One skilled in the art will appreciate that the random number can be replaced by a deterministic value, such as the system's media access control (MAC) address, or a number determined in accordance with a set of features particular to a computer. Deterministic values can also be combined with a random number. The use of weighted tokens permits nodes with particular properties to be given an advantage in the selection of Prime.
The AssertPrime message is preferably a multicast message where a single node on the network asserts to the other nodes in the p2p network that it is the Prime. The value of the token attribute, such as the numeric value of the random number, in the Vote message is used to resolve collisions between two or more VPs, each claiming to be the prime. For example, the message can be formatted as follows: <assertprime service=“SERVICE” rand=“NUMBER”/>.
The PeerUpdate message can be either multicast to the entire p2p network, or it can be unicast to a single node, such as the Prime P′. The PeerUpdate message preferably indicates the state of a VP for a given service. For end-pointing the available states include UP or DOWN, where for availability the states include both BUSY and IDLE. For example, the message can be formatted as follows: <peerupdate service=“SERVICE” state=“STATE”/>.
The Reset message is preferably a multicast message to all nodes in the p2p network, which is sent when an application attempts to establish a connection to a service Prime and fails. The node that detects the failure preferably transmits the message as a multicast to the other nodes in the network to notify the rest of the network that a node is down. This message preferably causes the other VPs to restart the election process for the role of service P′ to replace the unreachable node. For example, the message can be formatted as <reset service=“SERVICE”/>.
From the perspective of a node intializing and joining the network, the flowchart of
If in step 136, no message is received, the timer is examined in step 144. If the timer has not expired, the node continues to step 134 and listens for another response. If the timer has expired, as determined by step 144, the peer determines that it is prime and runs as prime in step 146. At this point, the peer is the prime peer and issues an assert prime message to all nodes in the network. The peers in the network respond to the assert prime message by providing peer update information from which a complete peer list, or a peer map is generated. At this point a “complete” message can be transmitted from the prime to all peers in the network to indicate that the discovery operation is complete and that a valid peer map is available.
After P′ has been selected, it can appoint various VP's in the network to serve as controllers for the various services (and their backups). P′ preferably selects appointees bottom up in the Peer Map. Selection in a bottom-up fashion from the peer map is presently preferred as it is likely that worker peers will be inlisted from the list in a top-down manner, thus the bottom-up selection method will minimize the intersection wherever possible.
Preferably, all peers must know every service controller. Thus, any peer wishing to participate in (or use) a given service can access the service controller. Services are therefore carefully selected as they may increase per peer state information and messaging. In one embodiment, the prime maintains a list of all service controllers, allowing any VP to contact the prime to determine the peer that is providing a particular service.
If a service controller go down, either one of its backup's or an application master will eventually discover the loss and inform the Prime Peer. The Prime can then select a new VP to fill the role and redistribute the service, assignment information.
The service assignment can be done by maintaining an active connection to each service controller candidate. The Candidate can accept the appointment or reject it based on implementation specific criteria. If the appointment is accepted, the Prime broadcasts the appointment to the rest of the network. Otherwise a new peer is selected for the appointment.
When the first node in the network is initialized, its process can be described with reference to the flowchart of
When another node enters the network, it's process can be described with reference to the flowchart of
From the perspective of the prime, when a new VP is initialized, the Prime receives a vote message from the newly arriving VP. Because P′ has already been selected, the new node receives a notification (the AssertPrime message) that a Prime exists and its vote is of no consequence.
It is preferable that each element in the p2p network be both robust and capable of error recovery. To that end, PSD preferably attempts to manage the fluid nature of nodes in a network while minimizing recoverability overhead and messaging.
PSD preferably provides accommodation for at least the following three situations where: two peers with equally high voting numbers assert primeship simultaneously; the prime in a network goes down and new elections are required; and where VP nodes are lost.
In the simultaneous prime assertion scenario two nodes simultaneously arrive on an otherwise empty network. This can also occur as a result of a reset message being broadcast or multicast to all nodes in the network. In one embodiment, the prime is selected between two equally ranked nodes on the basis of a property that must be unique, such as an IP address or a MAC address. In another embodiment, each node transmits as part of its voting token, a secondary vote. This secondary vote is ignored unless a collision of equal votes is received. The statistical probability of two collisions is considered to be sufficiently small that it will not happen if the prime voting is based on random seeds.
In the situation where the Prime node is lost, any new application master, which is by default at least a VP, attempting to run on the network will attempt to connect to the node it thinks is Prime. This connection will typically be made to obtain a peer map. If the prime has failed, the VP will determine that the prime in unavailable, and can either make use of a backup prime that has been previously designated and provided to all VP's, or the VP can send out a multicast RESET message. The RESET message causes all peers in the network to change back to the VOTING state, and elections ensue to obtain a new Prime, as described with reference to the earlier figures.
If a worker VP in the network goes down, it will not be detected until such time that a master application attempts to employ that node to do work. The failure to obtain a connection to the node can be reported to the prime. The Prime can then update the Global Peer Map. This information can be provided to any application masters in the system, either directly from the VP's or the prime can compile the list and provide it to each application master.
One skilled in the art will appreciate that there are a number of solutions for sending multicast messages across a wide area network, such as the internet. These solutions provide for simplified peer discovery outside of a single subnet. Multicast message forwarding may be performed by routers in the network and would be configured by the administrator of the p2p network environment. Multiple subnets typically require that each (subnet) has its own discovery prime (P′). All discoveries of other peers, for the P′ peer map, are isolated to the local subnet. All discovery multicast messages are within the local link multicast address range (224.0.0.1-224.0.0.255, 224.0.0.252-224.0.0.255).
If any of the masters on a particular subnet requires more peers than that available on its local subnet, a multicast message can be sent, using an inter-subnet multicast address (224.0.1.178-224.0.1.255), to all P′ on all other subnet in the local enterprise. Upon receiving the multicast message, the P′ would forward its local subnet peer map of available peers. This scenario allows for a plurality of p2p networks, each with its own prime to interact through messaging between primes.
If the master is successful in contacting the remote peer, and subsequently assigns it work, the peer can send an inter-subnet multicast message that it is busy doing work and that no attempt should be made until such time as it has completed its work and become available once again. Notification of availability would, again, be sent through inter-subnet multicast.
With the above described network, a peer-to-peer distributed computing architecture can be built. Because any peer in the network can obtain a listing of the other peers, any node in the network can submit a job. When a node wishes to submit a job to a number of peers, it contacts the prime and requests a peer map. The peer is typically provided with a listing of peers on its subnet to assist in reducing network traffic. If the job requires more peers than are presently available, either due to the peers being used for other jobs or due to the network being small, the submitting node can requests a map of nodes on other subnets. The reply to this request preferably includes a map of where the other peers can be found. The submitting node can then add the peers in the other subnets to the job, and contacts the remote peer. Communications across the subnets are preferably done using unicast transmissions. Once, a peer has been contacted and provided a job slice, it is considered part of the logical grid used for a single job.
If the situation arises that a peer on a different subnet cannot be reached, the master, or job submitting node, will preferably not send a multicast message to other masters on other subnets to inform them of a possible downed node. It would, however, preferably inform other master on its local subnet. The reason for not informing the masters on other subnets is that there are too many single points of failure between the master and the remote peer, and the peer could be unreachable to the master, but still active. As a result, informing masters outside of the original subnet may not be accurate. Therefore, to reduce the inter-subnet traffic, other masters are preferably not notified. Any discovery of it having gone down will be made by the other masters themselves in due course.
If a subnet, containing work peers, goes down, the subnet will be treated just as if a single peer had gone down. Therefore, all recovery procedure will be identical to the above discussion of a peer on another subnet going down.
It is preferable for a distributed p2p network used for grid or cluster computing to provide features such as load balancing. Load balancing may be provided at a higher layer in the protocol stack such as the Peer Map data structure, which maintains information such as Processor Type (Intel x86, Sparc, PPC, etc), Processor Speed (in MHz), Processor Count (for SMP support). Disc Speed (for I/O intensive jobs), Memory Size, and Memory Access Speed. Such information can be stored to allow querying by applications. These metrics are preferably collected when the VP starts and advertised via the PEERUPDATE message sent at startup time.
It is preferable that all aspects of PSD be hidden from an application programmer behind programming constructs allowing access to the collective computing resources of the network.
As discussed above, a grid can span a number of subnets. Similarly, a number of grids can co-exist on the same subnet. During the peer discovery process, peers make use of broadcast and multicast transmissions that are directed to specified ports. If there are a number of a grids on the same subnet, each node will receive traffic for the other grids during the discovery process. Additionally, when a reset message is sent to the nodes in a grid, it will be received by all the nodes in the subnet. To allow for the management of multiple grids on a single subnet, each grid is preferably assigned a grid identifier. The grid identifier can be combined with a shared secret, such as a password. When a node is configured to have a grid identifier and password, it can then include the grid identifier and information associated with the shared secret. When a node receives grid traffic, the node can compare the grid identifier and the shared secret information to the information it has been configured with. This use of grid identifier and shared secret allows a node to differentiate between traffic intended for its grid and traffic intended for other grids. One skilled in the art will appreciate that the use of a shared secret affords a degree of security to these transmissions, and can be used in conjunction with a common encryption or digital signature technology. Thus, a node can digitally sigh network messages so that they can be verified as originating from a node in a given grid. If encryption is used, then a transmitting node can be assured that only nodes in the same grid can determine the content of the message. Other applications of this functionality will be apparent to those skilled in the art.
One skilled in the art will appreciate that nodes of the network described above, can be implemented using standard computing hardware programmed according to the methods of the present invention. Such systems would typically have an input for receiving messages from other nodes, an output for transmitting messages to other nodes, and a state machine for generating messages for transmission and acting upon the received messages as described above.
The above-described embodiments of the present invention are intended to be examples only. Alterations, modifications and variations may be effected to the particular embodiments by those of skill in the art without departing from the scope of the invention, which is defined solely by the claims appended hereto.
This application claims the benefit of U.S. Application No. 60/539,350, filed Jan. 28, 2004, the contents of which are expressly incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
60539350 | Jan 2004 | US |