Non-blocking commit protocol systems and methods

Description

BACKGROUND

1. Field of the Invention

This invention relates to systems and methods for maintaining atomicity and reducing blocking in distributed systems.

2. Description of the Related Art

For a transaction to be atomic, a system either executes all of the operations in the transaction to completion or none of the operations. Atomicity allows multiple operations to be linked so that the final outcome of the overall transaction is known. System failures can prevent atomicity. For example, a device or communication failure in a distributed system executing a transaction can cause some of the parties participating in the transaction to execute the transaction to completion while other parties abort the transaction. This puts the parties in different states and can corrupt system information if the parties cannot roll-back to a stable condition consistent with a known state before the transaction was initiated.

In a distributed system, an atomic commit protocol (ACP) resolves transactions between a number of different parties involved in the transaction. The ACP ensures that all parties to the transaction agree on a final outcome by either committing to the transaction or aborting the transaction. Several such protocols are described below.

I. Deterministic Atomic Commit Protocol

A plurality of nodes may participate in a transaction and then send messages to each other to indicate that they are each prepared to commit the transaction. Once a particular participant receives “prepared” messages from all other participating nodes, the participant commits to the transaction and sends a “committed” message to the other participating nodes. If the participant receives an “abort” message from another participating node, the participant also aborts. Thus, the protocol in this example is deterministic in that the outcome of the transaction is causally determined when the participating nodes are prepared to commit. The transaction eventually commits when all participants successfully send “prepared” messages to the other participants. Each participating node uses this rule to decide for itself how to resolve the transaction.

However, failure of a participant can block the transaction until the participant recovers. If, for example, the participant prepares for the transaction but crashes before sending any “prepared” message, and all other participants send “prepared” messages, the transaction is blocked while the functioning participants wait to determine whether or not the failed participant prepared or aborted the transaction. Further, the functioning participants do not know whether or not the failed participant committed to the transaction after receiving their “prepared” messages. Thus, the functioning participants block the transaction until the failed participant recovers. The transaction may block for an indeterminate amount of time, which may be forever in the case of a permanent failure.

II. Two-Phase Commit Protocol

Some ACPs are non-deterministic and use a coordinator to manage the ACP and reduce blocking when a participating node fails. For example, in a conventional two-phase commit protocol the participants send “prepared” messages or “abort” messages to the coordinator rather than to each other. In a first phase, the coordinator decides whether to commit or abort the transaction. If the coordinator receives “prepared” messages from all participants, the coordinator decides to commit the transaction. If the coordinator receives an “abort” message from at least one participant, the coordinator decides to abort the transaction. In a second phase, the coordinator logs its decision and sends messages to the participating nodes to notify them of the decision. The participants can then take appropriate action.

Since the coordinator makes a unilateral decision, failure of a single participant will not block the transaction. If a participant fails or loses communication with the coordinator before sending a prepared or “abort” message, the coordinator unilaterally decides to abort after a predetermined amount of time. However, the two-phase commit protocol can still block the transaction under certain circumstances. For example, if the coordinator fails and all participants send “prepared” messages, the participants will block until the coordinator recovers and resolves the protocol.

III. Three-Phase Commit Protocol

Conventional three-phase commit protocols attempt to solve the blocking problem of the two-phase commit protocol by adding an extra phase in which a preliminary decision of whether to commit or abort the transaction is communicated to the participating nodes. If the coordinator fails, the participating nodes select one of the participants to be a new coordinator that resumes the protocol. When the failed coordinator recovers, it does so as a participant and no longer acts in the role of the coordinator. However, in many applications it is not practical to implement the conventional three-phase commit protocol. Further, the three-phase commit protocol may block if multiple participants fail or if there is a communication failure.

SUMMARY

The systems and methods described herein provide single-failure non-blocking commitment and double-failure non-blocking commitment protocols.

In one embodiment, a distributed system is provided, where the distributed system is configured to resolve a transaction among a set of parties within the distributed system. The distributed system may include a plurality of participants configured to permit communication among the plurality of participants and to resolve a transaction; a coordinator configured to communicate with the plurality of participants to resolve the transaction; wherein the plurality of participants are configured to determine whether to commit the transaction based on messages from the coordinator, and if not, to determine among the plurality of participants whether to commit the transaction.

In an additional embodiment, a method is provided for resolving a transaction among a set of nodes. The method may include determining whether communication with a coordinator node is available; if communication with the coordinator node is available, receiving messages from the coordinator node indicating whether to commit or abort a transaction; and if communication with the coordinator node is not available, receiving messages from other nodes involved in the transaction indicating whether to commit or abort the transaction.

In an additional embodiment, a distributed system is provided to resolve a transaction among a set of parties within a distributed system. The distributed system may include a set of participant nodes configured to permit communication among the plurality of nodes and to resolve a transaction among a set of parties from the plurality of nodes; an initiator located on a first node configured to communicate with the plurality of participant nodes; a coordinator located on the first node; and wherein the initiator is further configured to receive a start command to start the transaction, add participant nodes to the set of participant nodes after the start of the transaction to form an updated set of participant nodes, and send a message to the coordinator, the message configured to indicate that the initiator is prepared to commit the transaction and to indicate that the participant nodes in the updated set of participant nodes are to be included in the transaction.

In a further embodiment, a method is provided for resolving a transaction among a set of parties within a distributed system. The method may include receiving a command to start an transaction; receiving a first set of participant nodes to be included in the transaction; receiving additional participant nodes to be included in the transaction; adding the additional participant nodes to the first set of participant nodes; receiving a command to commit the transaction; and sending a message to a coordinator node to prepare for the transaction, the message including the updated first set of participant nodes.

For purposes of summarizing the invention, certain aspects, advantages and novel features of the invention have been described herein. It is to be understood that not necessarily all such advantages may be achieved in accordance with any particular embodiment of the invention. Thus, the invention may be embodied or carried out in a manner that achieves or optimizes one advantage or group of advantages as taught herein without necessarily achieving other advantages as may be taught or suggested herein.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustrates an exemplary timing chart of a transaction between an initiator, two participants, a shared participant, and a coordinator using a more-modified two-phase commit protocol.

FIG. 2 illustrates an exemplary state diagram of a coordinator for the more-modified two-phase commit protocol.

FIG. 3 illustrates an exemplary state diagram of a participant for the more-modified two-phase commit protocol.

FIG. 4 illustrates an exemplary timing chart of a transaction between an initiator, two participants, a shared participant, and a coordinator using a two-phase commit version 2 protocol.

FIG. 5 illustrates an exemplary state diagram of an initiator for the two-phase commit version 2 protocol.

FIG. 6 illustrates an exemplary state diagram of a coordinator for the two-phase commit version 2 protocol.

FIG. 7 illustrates an exemplary state diagram of a participant for the two-phase commit version 2 protocol.

FIG. 8 illustrates an exemplary state diagram of a shared participant for the two-phase commit version 2 protocol.

FIG. 9 illustrates an exemplary timing chart of a transaction between an initiator, two participants, a coordinator, and a distributor using a 2.5 phase commit protocol.

FIG. 10 illustrates an exemplary state diagram of a coordinator for the 2.5 phase commit protocol.

FIG. 11 illustrates an exemplary state diagram of a distributor for the 2.5 phase commit protocol.

FIG. 12A illustrates an exemplary state diagram of a participant for the 2.5 phase commit protocol.

FIG. 12B illustrates an exemplary state diagram of a participant for the 2.5 phase commit protocol.

FIG. 12C illustrates an exemplary state diagram of a participant for the 2.5 phase commit protocol.

FIG. 12D illustrates an exemplary state diagram of a participant for the 2.5 phase commit protocol.

DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS

I. Overview

Systems and methods for providing atomic non-blocking commit protocols will now be described. These protocols may be used for a variety of transactions that involve two or more parties, where the parties include at least one initiator and one or more participants. For example, these protocols may be used in distributed file systems as described in U.S. patent application Ser. No. 10/007,003 entitled “Systems and Methods for Providing a Distributed File System Utilizing Metadata to Track Information About Data Stored Throughout the System,” filed Nov. 9, 2001 which claims priority to Application No. 60/309,803 filed Aug. 3, 2001, U.S. patent application Ser. No. 10/281,467 entitled “Systems and Methods for Providing A Distributed File System Incorporating a Virtual Hot Spare,” filed Oct. 25, 2002, and U.S. patent application Ser. No. 10/714,326 entitled “Systems And Methods For Restriping Files In A Distributed File System,” filed Nov. 14, 2003, which claims priority to Application No. 60/426,464, filed Nov. 14, 2002, all of which are hereby incorporated by reference herein in their entirety.

A. The Initiator

The initiator has several responsibilities. In one embodiment, the initiator is responsible for starting transactions, assigning work items to participants for execution on the transactions, and deciding when to request a commit or abort for a transaction. In the examples discussed herein, the initiator sends “prepare” messages to all of the participants when the initiator wants to commit a transaction and “abort” messages when the initiator wants to abort a transaction. In addition, the initiator receives “aborted” messages and “committed” messages from the participants indicating whether the participants have completed the transaction. Typically, the initiator is allowed to abort a transaction, by sending an “abort” message to the participants, at any point up until the initiator has sent “prepare” messages to the participants. Once the initiator has sent all of the “prepare” messages, the transaction is out of the initiator's hands.

In some embodiments, the initiator controls message synchronization. For example, the initiator may mediate the distribution of the “abort” messages to guarantee that the “start” messages have been processed on all participants before they receive an “abort” or “aborted” message. As another example, the initiator may wait to collect responses to the “start” messages from one or more participants before sending the “prepare” messages.

In a distributed file system, for example, the initiator may start a transaction to write or restripe data blocks across a plurality of nodes corresponding to the participants. The initiator then sends requests to the participants to read data blocks, allocate space for data blocks, write data blocks, calculate parity data, store parity data, send messages to another participant, combinations of the forgoing, or the like.

B. The Participants

The participants' responsibilities include executing transactions, receiving messages from the initiator, and sending messages to the initiator indicating whether the transaction was completed by sending “aborted” or “committed” messages. For example, if a particular participant has an error while performing the transaction, becomes disconnected from the initiator, or receives an “abort” message from the initiator, the participant aborts the transaction and sends an “aborted” message to the initiator. If the participant commits the transaction, it sends a “committed” message to the initiator.

In one embodiment, the participants are located on separate nodes from one another. However, in some embodiments, a participant can share a node with another party. Moreover, in some embodiments, the participants have durable logs that they use to store requested transaction procedures and protocol states. As discussed in detail below, if a failure causes a particular participant to restart, the log is consulted to determine the last state of the participant. The information in the log can also be provided to other participants.

C. Communication

In one embodiment, the parties involved in the transaction are interconnected through a bidirectional communication link. The link between two or more parties may be up or down. If the link is down, the messages are dropped. If the link is up, the messages are received in the order they are sent. In one embodiment, the link comprises a “keep-alive” mechanism that quickly detects when nodes or other network components fail. The parties are notified when a link goes up or down. When a link goes down between two parties, for example, both parties are notified before it comes back up. In one embodiment, the link comprises a TCP connection. In one embodiment, the link could also include an SDP connection over Infiniband, a wireless network, a wired network, a serial connection, IP over FibreChannel, proprietary communication links, connection based datagrams or streams, and/or connection based protocols.

D. Failures

Any party, including participants and initiators, is said to fail when it stops executing. The failed party may, however, be able to reboot or otherwise restart. Once the failure is resolved by restarting, the party may resume participation in the transaction. A party can also fail wherein one or more communication links with other parties go down. This failure is over once the communication links are back up.

In the following description, reference is made to the accompanying drawings, which form a part hereof, and which show, by way of illustration, specific embodiments or processes in which the invention may be practiced. Where possible, the same reference numbers are used throughout the drawings to refer to the same or like components. In some instances, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure. The present disclosure, however, may be practiced without the specific details or with certain alternative equivalent components and methods to those described herein. In other instances, well-known components and methods have not been described in detail so as not to unnecessarily obscure aspects of the present disclosure.

II. Modified Two-Phase Commit Protocol

Improvements to the two-phase commit protocol include converting back to the deterministic approach described above when the coordinator fails or otherwise disconnects from all of the participants. Initially, the participants send “prepared” messages to the coordinator and expect to receive a commit or “abort” message from the coordinator. As participants are disconnected from the coordinator, they send “prepared” messages to the other participants. Once a particular participant is disconnected from the coordinator, it no longer accepts commit or “abort” messages from the coordinator. When the particular participant receives “prepared” messages from all the other participants, it “commits” to the transaction and sends a “committed” message to the other participants. If a participant fails, rather than the coordinator, the coordinator aborts the transaction and notifies all other participants. The participant that failed is notified of the outcome when it recovers from the failure.

When the coordinator is not located on the same node as a participant, this improvement to the two-phase commit protocol is non-blocking (at least for single-failures). However, blocking occurs when the coordinator shares a node with a participant. For example, the shared node may fail before the other participants receive an abort or commit from the coordinator. After losing communication with the coordinator, the other participants block until the shared node recovers from the failure and the participant thereon sends a message to indicate how it resolved the transaction. In many applications, a separate node for the coordinator is not available. Therefore, undesirable blocking may occur.

III. More-Modified Two-Phase Commit Protocol

As discussed above, the modified two-phase commit protocol (“M2PC”) provided a single-failure non-blocking behavior, but required the coordinator to reside on a separate node from all other parties, thereby limiting the ability to implement the modified two-phase commit protocol. Discussed below is a “more-modified” two-phase commit protocol (“MM2PC”) that allows the coordinator to reside on the same node as a participant (for example, a shared participant), such that if that node fails, the non-shared participants can determine the state of the shared participant and deterministically resolve the outcome of the transaction.

The MM2PC protocol is similar to the M2PC protocol in that it utilizes a coordinator c to collect “committed” and “aborted” messages from the participants and to alert the participants as to the transaction's status. The M2PC and MM2PC protocols include, a “first prepared” state in which the participants expect the coordinator to resolve the transaction, for example state Pc as discussed below, and a “second prepared” state for situations in which the coordinator becomes disconnected from one or more of the participants, for example state Pp as discussed below. The participants collect information from the other participants in case the coordinator becomes disconnected from one or more participants. The participants transition to the second prepared state when they have lost their connection to the coordinator. Once in the second prepared state, a participant then determines its status based on status messages from other participants instead of the coordinator. In the MM2PC protocol, however, the coordinator does not send a “commit” message to the shared participant. Instead, the shared participant receives “committed” messages from the other participants. Since the remote participants notify the shared participant of the transaction's outcome, the remote participants can resolve the transaction even if they become disconnected from the coordinator.

The MM2PC protocol may also include “collection” states that allow a participant to verify that the participant has received either “aborted” messages from all of the other participants or “committed” messages from all of the other participants. This verification allows the participant to be sure that the other participants are aware of the status of the transaction before the participant clears its log of status information regarding the transaction.

A. MM2PC Exemplary Timing Chart

FIG. 1 illustrates an exemplary timing chart according to one embodiment of a commit protocol 100 for a transaction involving an initiator 110 (shown as “i”), a first participant 112 (shown as “p₁”), a second participant 114 (shown as “p₂”), a shared participant 116 (shown as “p_s”), and a coordinator 118 (shown as “c”). The exemplary shared participant 116 and the coordinator 118 reside on the same node.

The initiator 110 is configured to start the transaction by sending “start” messages (not shown) to the coordinator 118 and the participants p₁, p₂, p_s. In one embodiment, the initiator 110 collects responses to the “start” messages before requesting the participants p₁, p₂, and p_sto commit the transaction. To request commitment to the transaction, the initiator 110 sends “prepare” messages 120 (three shown) to the first participant 112, the second participant 114, and the shared participant 116.

The first participant 112, the second participant 114, and the shared participant 116 each log their respective “prepare” message 120 and each determine whether they are prepared to commit the transaction. If the first participant 112 can commit the transaction, the first participant 112 sends a “prepared” message 122 to the coordinator 118. If the second participant 114 can commit the transaction, the second participant 114 sends a “prepared” message 122 to the coordinator 118. If the shared participant 116 can commit the transaction, the shared participant 116 sends a “prepared” message 122 to the coordinator 118. If the coordinator receives a “prepared” message 122 from the first participant 112, the second participant 114, and the shared participant 116, the coordinator 118 sends “commit” messages 124 (two shown) to the first participant 112 and the second participant 114. The coordinator 118 does not send a “commit” message to the shared participant 116.

After receiving the “commit” messages 124 from the coordinator 118, the first participant 112 and the second participant 114 each log the “commits” and each send “committed” messages 126 (six shown) to each other, to the shared participant 116, and to the initiator 110. For example, the first participant 112 would send a “committed” message 126 to the second participant 114, the shared participant 116, and the initiator 110. Upon receiving a “committed” message 126 from either the first participant 112 or the second participant 114, the shared participant 116 commits the transaction, logs the received “committed” message 126 and sends “committed” messages 128 (three shown) to the initiator 110, the first participant 112, and the second participant 114. The first participant 112, the second participant 114, and the shared participant 116 can then clean their respective logs and the commit protocol 100 ends.

The exemplary timing chart shown in FIG. 1 illustrates the commit protocol 100 when no failures occur. If the coordinator 118 has not disconnected from the first participant 112 or the second participant 114, the coordinator 118 determines whether to commit or abort the transaction. The coordinator 118 commits the transaction when all of the “prepared” messages 122 are received from each of the participants 112, 114, 116.

As discussed in detail below, if the node with the coordinator 118 and the shared participant 116 fails, the first participant 112 and the second participant 114 are still able to resolve the transaction. If the coordinator 118 sent the “commit” messages 124 before failing, the participants p₁, p₂112, 114 commit the transaction since they know that the shared participant 116 successfully prepared. However, if the coordinator 118 did not send the “commit” messages 124 before failing, the participants p₁, p₂112, 114 abort the transaction since they do not know whether the shared participant 116 successfully prepared. When the shared participant 116 reconnects, the participants p₁, P₂112, 114 inform the shared participant 116 of their decision.

So long as at least one of the first participant 112 and the second participant 114 are connected to the coordinator 118, the connected participant can still receive a “commit” or “abort” message from the coordinator 118. To avoid ending up in different states, the first participant 112 and the second participant 114 only decide whether to commit or abort if they have both been disconnected from the coordinator 118. Further, once disconnected from the coordinator 118, the first participant 112 or the second participant 114 no longer accept “commit” or “abort” messages from the coordinator 118. Since the participants 112, 114, 116 do not look to the coordinator 118 after a failure, the coordinator 118 does not log the “prepared” messages 122 received from the participants 112, 114, 116 and does not clean its log at the end of the commit protocol 100.

FIGS. 2 and 3 illustrate state diagrams according to one embodiment of an MM2PC protocol. Parties in a transaction using the exemplary MM2PC protocol include a coordinator c, a shared participant p_son the same node as the coordinator c, one or more remote participants p selected from the set defined by {p₁, p₂, p_s}, and an initiator i.

B. Coordinator States

FIG. 2 illustrates a state diagram having an initial state I and a final state F of the coordinator c during execution of the MM2PC protocol. The coordinator c can be in a state “s_c” defined by:

s_cε{(I,S)|S⊂P}∪{F},

wherein P is a set of participants defined by P={p₁, p₂, . . . , p_n, p_s}. In addition, the participant p_xrepresents any one of the participants in the set P={p₁, p₂, . . . , p_n, p_s}. For example, in the MM2PC protocol 100 shown in FIG. 1, P={p₁, p₂, p_s}. In one embodiment, the variable S is a proper subset of the participants P and represents the participants in P for which the coordinator c has received “prepared” messages. The coordinator c remains in the initial state I until S=P. The coordinator c then transitions to the final state F. As discussed below, the coordinator c can also transition from the initial state I to the final state F if the initiator i or a participant in P aborts the transaction or if any of the participants P disconnect before sending a “prepared” message. In this embodiment, because the shared participant p_sand the coordinator c are located on the same node, then it may be assumed that the shared participant p_sand the coordinator c will not become disconnected. Thus, in the exemplary embodiment, the coordinator c would not become disconnected from the shared participant p_s, and thus, would only transition from the initial state I to the final state F if the initiator i or a participant in P, not including the shared participant p_s, disconnected before sending a “prepared” message.

In the initial state I, the coordinator c receives messages from the initiator i and the participants in P. The coordinator c may receive a “prepared” message from any of the participants p_x(for example, prepared(p_x)). If the coordinator c receives a “prepared” message from any of the participants p_x, the coordinator c adds the participant p_xto the set of known prepared participants S (for example, S=S∪{p_x}).

Upon receiving “prepared” messages from all of the participants in the set P (for example, S=P), the coordinator c sends a “commit” message to the participants in P except for the shared participant p_s(for example, commit(P\{p_s}) and changes from the initial state I to the final state F. As noted above, in this embodiment, the coordinator c does not send “commit” messages to the shared participant p_s. If the coordinator c receives an “aborted” message from any of the participants p_xor the initiator i (for example, aborted (p_x, i)), or if the coordinator c detects that any of the participants p_xare disconnected (for example, disconnect (p_x)), the coordinator c sends an “abort” message to the participants in P except for the shared participant p_s(for example, abort(P\{p_s})). As noted above, in this embodiment, the coordinator c does not send “abort” messages to the shared participant p_s. The coordinator c then changes from the initial state I to the final state F and no longer participates in the transaction.

The following exemplary pseudocode further describes the coordinator's execution of the MM2PC protocol:

function abort(S):

send abort to S \ {p_s}

set state to F

function commit(S):

send commit to S \ {p_s}

set state to F

in state (I, S):

on disconnect from p ∉ S:
abort(P)

on aborted from (p_x, i):
abort(P)

on prepared from (p_x):

if S ∪ {p_x} ≠ P:
set state to (I, S ∪ {p_x})

else:
commit(P)

on start:
set state to (I, Ø)

C. Participant States

FIG. 3 illustrates a state diagram of any of the participants p_xduring execution of the MM2PC protocol. In the following description of FIG. 3, reference to a “participant p_x” refers to any of the participants p_x(for example, p₁, p₂, . . . , p_n, p_s) including the shared participant p_s. The participants p₁, p₂, . . . , p_nor the shared participant p_smay also be referred to separately. In the MM2PC protocol, the participants in P resolve the transaction if the coordinator c fails. The participant p_xis configured to communicate with the coordinator c, the initiator i and one or more other participants p′ selected from P. The other participant p′ may be, for example, the shared participant p_swhen the state diagram shown in FIG. 3 corresponds to the participant p_n. As another example, if there are three participants, p₁, p₂, p₃and a shared participant p_s, if FIG. 3 corresponds to p₁then p′ may be p₂, p₃, or p_s; if FIG. 3 corresponds to p_sthen p′ may be p₁, p₂, or p₃.

The state diagram illustrated in FIG. 3 includes an initial state I, a first prepared state Pc, a second prepared state Pp, an aborted state A, a committed state C, and a final state F. In the first prepared state Pc, the participant p expects to receive an “abort” or “commit” message from the coordinator c. In the second prepared state Pp, the participants in P decide amongst themselves how to resolve the transaction. The participant p_xcan be in a state “s_px” defined by:

s_pxε{(r,S)|rε{I,Pc,Pp,A,C}; S⊂P}∪{F}

wherein the variable S is a proper subset of the participants P and represents the participants in P for which the participant p_xhas received “prepared” messages. The participant p_xremains in one of the states of which the variable r is a member until S=P. The participant p_xthen transitions to the final state F. As discussed below, the participant p_xtransitions to the final state F after performing a garbage collection procedure.

A detailed discussion of each participant state is set forth below.

1. Garbage Collection and Restart

The participant p_xrecords messages sent or received during the MM2PC protocol in a log. The participant p_xcan provide the information in the log to another participant p′ that may not have received one or more messages sent, for example, when the other participant p′ was disconnected. The participant p_xcan also use the information when the participant p_xrestarts after a disconnect or failure to determine the outcome of the transaction.

The aborted state A and the committed state C are garbage collection states. In these states, the participant p_xhas already committed or aborted the transaction. However, the participant p_xwaits until the other participants in P complete the transaction before clearing its log. If the participant p_xaborts the transaction, it includes itself in a set of known aborted participants A′ (for example, A′={p_x}); A′ represents a subset of the participants in P for which the participant p_xhas received “aborted” messages. If the participant p_xcommits, it includes itself in a set of known committed participants C′ (for example, C′={p_x}); C′ represents a subset of the participants in P for which the participant p_xhas received “committed” messages.

As mentioned above, the participant p_xkeeps a log that it can use when another participant p′ reconnects or when the participant p_xrestarts after a disconnect or failure. On restart, the participant p_xno longer accepts messages from the coordinator c. If the last entry in the log was “start,” the participant p_xdid not receive a “prepare” message from the initiator and can abort the transaction. If the last entry in the log was “prepare,” the participant p_xchecks to see if it has received “prepared” messages from all of the participants in P except for the shared participants p_s(S=P\{p_s}). If S=P\{p_s}, the participant p_xaborts the transaction. If S≠P\{p_s}, the participant p_xenters or remains in the second prepared state Pp, which is discussed in detail below.

If the last entry in the log was “abort,” the participant p_xdetermines whether it has received “aborted” messages from all of the other participants in P (A′=P). If A′≠P, the participant p_xenters the abort state A. If A′=P, the participant p_xclears its log and enters the final state F. If the last entry in the log is “commit,” the participant p_xdetermines whether it has received “committed” messages from all of the other participants in P (C′=P). If C′≠P, the participant p_xenters the committed state C. If C′=P, the participant p_xclears its log and enters the final state F.

The following exemplary pseudocode illustrates one embodiment of garbage collection and restart for the participant p_x:

The functions abort_count( ) and commit_count( ) respectively check A′ and C′ against the participants in P. The function forget( ) clears the log at the end of the transaction so it can be used for subsequent transactions. The abort( ) function sends an “aborted” message to the other participants in P, the initiator i, and the coordinator c. The commit( ) function sends a “committed” message to the other participants P and the initiator i. The participant p_xdoes not send the “committed” message to the coordinator c because the coordinator c either told the participant p_xto commit or the participants in P decided to commit when the coordinator c was no longer involved in the transaction. Further details about the aborted state A and the committed state C are discussed below.

2. The States

a. The Initial State I

As illustrated in FIG. 3, in the initial state I, the participant p_xreceives a “prepare” message from the initiator i (for example, prepare(i)). If the participant p_xhas an error such that it cannot perform the transaction, the participant p_xaborts the transaction. The participant p_xmay also abort the transaction if it detects a disconnect from the initiator i or the coordinator c (for example, disconnect(i, c)) or if it receives an “aborted” message from another participant p′ (for example, aborted(p′)). If the participant p_xreceives the “aborted” message from another participant p′, it adds itself and the other participant p′ to the set of known aborted participants A′ (for example, A′={p_x, p′}). Further, the participant p aborts if it receives an “abort” message from the coordinator c (for example, abort(c)). It should also be noted that the shared participant p_scannot disconnect from the coordinator c since they are on the same node.

If the participant p_xaborts, it sends an “aborted” message to the participants in P, the coordinator c and the initiator i (for example, aborted(P, c, i)), and enters the aborted state A. If, on the other hand, the participant p_xcan commit the transaction after receiving the “prepare” message from the initiator i, it sends a “prepared” message to the coordinator c (for example, prepared(c)) and enters the first prepared state Pc.

While in the initial state I, the participant p_xmay also receive a “prepared” message from another participant (for example, prepared(p′)). As discussed below, if the participant p_xlater enters the second prepared state Pp, it will need to know that the other participant p′ is also in the second prepared state Pp. Thus, upon receiving a “prepared” message from the other participant p′, the participant p_xadds the other participant p′ to the subset S (for example, S=S∪{p′}).

The following exemplary pseudocode illustrates one embodiment of the participant p_xin the initial state I:

in state (I, S):

on disconnect from i or c:
abort({p_x})

on abort from c:
abort({p_x})

on aborted from p′:
abort({p_x, p′})

on prepared from p′:
set state to (I, S ∪ {p′})

on prepare from i:

if error:
abort({p_x})

else:
log(prepare)

send prepared to c

set state to (Pc, S)

b. The First Prepared State Pc

In the first prepared state Pc, the participant p_xexpects to receive a “commit” or “abort” message from the coordinator c. As discussed above, in some embodiments, the shared participant p_smay ignore commands from the coordinator c. If the participant p_xreceives a “commit” message from the coordinator c (for example, commit(c)), the participant p_xcommits the transaction and sends a “committed” message to the other participants in P, the coordinator c, and the initiator i (for example, committed(P, c, i)). The participant p_xthen enters the committed state C. If the participant p_xreceives an “abort” message from the coordinator c (for example, abort(c)), the participant p_xaborts the transaction and sends an “aborted” message to the other participants in P, the coordinator c, and the initiator i (for example, aborted(P, c, i)). The participant p_xthen enters the aborted state A.

While in the first prepared state Pc, the participant p_xmay receive a “committed” or “aborted” message from another participant p′ (for example, committed(p′) or aborted(p′)). In response to receiving a “committed” message from another participant p′, the participant p_xadds itself and the other participant p′ to the set of known committed participants C′ (for example, C′={p_x, p′}), sends a “committed” message to the other participants in P, the coordinator c, and the initiator (for example, committed(P, c, i)), and transitions to the committed state C. In response to receiving an “aborted” message from another participant p′, the participant p_xaborts the transaction, adds itself and the other participant p′ to the set of known aborted participants A′ (for example, A′={p_x, p′}), sends an “aborted” message to the other participants in P, the coordinator c, and the initiator i (for example, aborted(P, c, i)), and enters the aborted state A.

The participant p_xmay also receive a “prepared” message from another participant p′ while in the first prepared state Pc. Upon receiving the “prepared” message from another participant p′, the participant p_xadds the other participant p′ to the set of known prepared participants S (for example, S=S∪{p′}). The participant p_xmay also detect a disconnect from the coordinator c (for example, disconnect(c)). As discussed above, the shared participant p_sdoes not disconnect from the coordinator c since it resides on the same node. In determining that the coordinator c is disconnected, the participant p_xsends a “prepared” message to the other participants in P (for example, prepared(P)) and enters the second prepared state Pp.

The following exemplary pseudocode illustrates one embodiment of the participant p_xin the first prepared state Pc:

in state (Pc, S):

on disconnect from c:
send prepared to P \ {p}

prepare_p_count(S ∪ {p})

on abort from c:
abort({p})

on aborted from p′:
abort({p, p′})

on commit from c:
commit({p})

on committed from p′:
commit({p, p′})

on prepared from p′:
set state to (Pc, S ∪ {p′})

The definitions for the functions abort, commit, and prepare_p_count are discussed above in section I with respect to “The Initial State I.”

c. The Second Prepared State Pp

In the second prepared state Pp, the participants in P decide amongst themselves how to resolve the transaction. As discussed above, the shared participant p_sdoes not enter the second prepared state Pp because it cannot disconnect from the coordinator c.

The participant p_xcannot decide to commit once all of the participants in P (except for the shared participant p_s) enter the second shared state Pp because the participant p_xdoes not know whether the shared participant p_ssuccessfully prepared. However, if the participant p_xreceives a “committed” message from another participant p′ (for example, committed(p′)), the participant p_xcommits since receiving the “committed” message from the other participant p′ indicates that the other participant p′ received a “commit” message from the coordinator c and also committed. The participant p_xthen adds itself and the other participant p′ to the set of known committed participants C′, sends a “committed” message to the other participants in P, the coordinator c, and the initiator i (for example, committed(P, c, i)), and transitions to the committed state C.

While in the second prepared state Pp, the participant p_xmay receive an “aborted” message from another participant p′ (for example, aborted(p′)). In response, the participant p_xadds itself and the other participant p′ to the set of known aborted participants A′, sends an “aborted” message to the other participants in P, the coordinator c, and the initiator i (for example, aborted(P, c, i)), and transitions to the aborted state A.

The participant p_xmay also receive a “prepared” message from another participant p′ message while in the second prepared state Pp. Upon receiving the “prepared” message from another participant p′, the participant p_xadds the other participant p′ to the set of known prepared participants S (for example, S=S∪{p′}). If S=P\{p_s}, the participant p_xaborts the transaction since all of the participants in P except for the shared participant p_shave disconnected from the coordinator c but do not know whether the shared participant p_sis prepared to commit the transaction.

If another participant p′ connects to the participant p_x(for example, connect(p′)) while the participant p_xis in the second prepared state Pp, the participant p_xsends a “prepared” message to the other participant p′ (for example, prepared(p′)). This informs the other participant p′ of the state of the participant p_xif, for example, the other participant p′ did not receive one or more messages while it was disconnected.

The following exemplary pseudocode illustrates one embodiment of the participant p in the second prepared state Pp: in state (Pp, S):

in state (Pp, S):

on connect to p′:
send prepared to p′

on aborted from p′:
abort({p, p′})

on committed from p′:
commit({p, p′})

on prepared from p′:
prepare_p_count(S ∪ {p′})

d. The Committed State C

As discussed above, the committed state C is a garbage collection state wherein the participant p_xhandles information stored in a log during its execution of the MM2PC protocol. The participant p_xwaits until the other participants in P complete the transaction before clearing its log so that it can provide the information in the log to another participant p′ that may not have received one or more messages sent, for example, when the other participant p′ was disconnected.

In the committed state C, the participant p_xmay receive a “committed” message from another participant p′ (for example, committed(p′)). In response, the participant p_xadds the other participant p′ to the set of known committed participants C′ (for example, C′=C′∪{p′}). Once all the participants in P have committed (for example, C′=P), the participant p_xclears its log (for example, clean log) and transitions to the final state F.

When the participant p_xdetects the connection or reconnection with another participant p′ (for example, connect(p′)), the participant p_xnotifies the other participant p′ that it is committed to ensure that the earlier “committed” message was not missed. Again, the participant p_xwaits in the committed state C until C′=P. However, if the participant p_xdid not receive a “committed” message from the other participant p′ when it was disconnected, and if the other participant p′ did receive the earlier “committed” message from the participant p_xsuch that it is finished with the transaction, the participant p_xdoes not know whether the other participant p′ committed. To avoid a large number of messages being sent between the participants in P, the participants in P are not required to respond to “committed” messages. Thus, the other participant p′ will not send another “committed” message to the participant p_x. Therefore, the participant p_xwill block as it remains in the completed state C.

To avoid this blocking, the participant p_xsends a “committed′” message to the other participant p′ (for example, committed′(p′)) in response to connect(p′). The committed′ message indicates to the receiver of the message that the sender does not know if the receiver has resolved the transaction. If the other participant p′ is in the committed state C or the final state F, it will return the committed(p′) message to the participant p_x. Thus, in the committed state C, the participant p_xcan add the other participant p′ to the variable C′. Likewise, if the participant p_xreceives the committed′(p′) message from another participant p′ while in the committed state C or the final state F, the participant p_xwill respond by sending the committed(p′) message to the other participant p′. In the committed state C, the participant p_xalso adds the other participant p′ to the variable C′.

The following exemplary pseudocode illustrates one embodiment of the participant p_xin the committed state C:

in state (C, C′):

on connect to p′:
send committed’ to p′

on committed from p′:
commit_count(C′ ∪ {p′})

on committed’ from p′:
send committed to p′

commit_count(C′ ∪ {p′})

e. The Aborted State A

As discussed above, the aborted state A is also a garbage collection state wherein the participant p_xhandles information stored in a log during its execution of the MM2PC protocol. In the aborted state A, the participant p_xmay receive an “aborted” message from another participant p′ (for example, aborted(p′)). In response, the participant p_xadds the other participant p′ to the set of know aborted participants A′ (for example, A′=A′∪{p′}). Once all the participants in P have aborted (for example, A′=P), the participant p_xclears its log (for example, clean log) and transitions to the final state F.

When the participant p_xdetects the connection or reconnection with another participant p′ (for example, connect(p′)), the participant p_xnotifies the other participant p′ that it has aborted to ensure that the earlier “aborted” message was not missed. Again, the participant p_xwaits in the aborted state A until A′=P. To avoid the blocking problem discussed above in relation to the committed state C, the participant p_xsends an “aborted′” message to the other participant p′ (for example, aborted′(p′)) in response to connect(p′). The aborted′ message indicates to the receiver of the message that the sender does not know if the receiver has resolved the transaction. If the other participant p′ is in the aborted state A or the final state F, it will return the aborted(p′) message to the participant p_x. Thus, in the aborted state A, the participant p_xcan add the other participant p′ to the variable A′. Likewise, if the participant p_xreceives the aborted′(p′) message from another participant p′ while in the aborted state A or the final state F, the participant p_xwill respond by sending the aborted(p′) message to the other participant p′. In the aborted state A, the participant p_xalso adds the other participant p′ to the variable A′.

The following exemplary pseudocode illustrates one embodiment of the participant p_xin the aborted state A:

in state (A, A′):

on connect to p′:
send aborted’ to p′

on aborted from p′:
abort_count(A′ ∪ {p′})

on aborted’ from p′:
send aborted to p′

abort_count(A′ ∪ {p′})

III. Two-Phase Commit Version 2 Protocol

While the MM2PC protocol allows for the coordinator to reside on the same node as a participant, the MM2PC does not address set-up phase transactions and may involve a large number of clean up messages. The two-phase commit version 2 protocol (“2PCV2”) addresses set-up phase transactions, allows for the addition of late participation additions, and reduces clean up messages. The 2PCV2 protocol includes an initiator i, a coordinator c, as well as a set of participants {p₁, p₂, . . . p_n}. The initiator i and the coordinator c reside on the same node such that they never get disconnected from each other. In addition, one of the participants may also reside on the same node as the initiator and the coordinator. The participant, if any, that resides on the same node as the initiator and the coordinator is referred to herein as the shared participant p_s. The remote participants notify the shared participant of the transaction's outcome thereby allowing the remote participants to resolve the transaction if they become disconnected from the coordinator. In addition, the initiator receives “committed” messages from the participants rather than from the coordinator.

A. 2PCV2 Exemplary Timing Chart

FIG. 4 illustrates an exemplary timing chart according to one embodiment of a 2PCV2 commit protocol 400 for a transaction involving an initiator 410 (shown as “i”), a first participant 412 (shown as “p₁”), a second participant 414 (shown as “p₂”), a shared participant 416 (shown as “p_s”) and a coordinator 418 (shown as “c”). As discussed above, the initiator 410 and the coordinator 418 are on the same node. In the example shown in FIG. 4, the shared participant 416 is also on the same node as the initiator 410 and the coordinator 418. The first participant 412 and the second participant 414 are located on remote nodes.

During the transaction, the initiator 410 adds the first participant 412, the second participant 414, and the shared participant 416 to the transaction. As it does so, the initiator 410 sends start messages 419 (three shown) to the first participant 412, the second participant 414, and the shared participant 416. When the initiator 410 is ready to try to commit the transaction, the initiator sends “prepare” messages 420 (four shown) to the coordinator 418, the first participant 412, the second participant 414, and the shared participant 416. In one embodiment, the coordinator 418 is configured to return a response 420a to the “prepare” message 420. Since the initiator 410 and the coordinator 418 are on the same node, the coordinator 418 receives the “prepare” message 420 before the remote participants 412, 414.

The first participant 412, the second participant 414, and the shared participant 416 respectively log the “prepare” messages 420 and determine whether they are prepared to commit the transaction. If they can commit the transaction, the first participant 412, the second participant 414, and the shared participant 416 each send a “prepared” message 422 (three shown) to the coordinator 418. If the coordinator 418 receives all of the “prepared” messages 422, the coordinator 418 sends “commit” messages 424 (two shown) to the first participant 412 the second participant 414. The coordinator 418 does not send a “commit” message 424 to the shared participant 416.

After receiving the “commit” messages 424 from the coordinator 418, the first participant 412 and the second participant 414 each log their respective “commits” and send “committed” messages 426 to the shared participant 416. Thus, the shared participant 416 learns of the transaction's outcome from the other participants 412, 414. After committing to the transaction, the first participant 412, the second participant 414 and the shared participant 418 send “committed” messages 428 (three shown) to the initiator 410. For garbage collection purposes, the initiator 410 responds by sending “committed” messages 430 to the first participant 412, the second participant 414, and the shared participant 416. After receiving the “committed” message 430 from the initiator 410, the first participant 412, the second participant 414, and the shared participant 416 clear their respective logs and the commit protocol 400 ends.

The exemplary timing chart shown in FIG. 4 illustrates the commit protocol 400 when no failures occur. Since the remote participants 412, 414 notify the shared participant 416 of the transaction's outcome, the remote participants 412, 414 can resolve the transaction if they both become disconnected from the coordinator 418.

FIGS. 5-8 illustrate state diagrams according to one embodiment of a 2PCV2 protocol. As stated above, parties in a transaction using the 2PCV2 protocol include an initiator i, a coordinator c on the same node as the initiator i, and one or more remote participant p selected from the set defined by {p₁, p₂, . . . , p_n}. The parties may also include a shared participant p_son the same node as the initiator i and the coordinator c.

B. Initiator States

FIG. 5 illustrates a state diagram for the initiator i having an unknown state U, an initial state I, a prepare state P_i, an aborted state A, and a committed state C.

1. Unknown State U

The initiator i begins and ends the transaction in the unknown state U. Upon receiving a start command (for example, start( )) from a user, the initiator transitions to the initial state I.

2. Initial State I

While the initiator i is in the initial state I, the transaction is being performed. In one embodiment, the initiator i is configured to manage the transaction among nodes by sending transaction commands to and receiving responses from the nodes involved in the transaction. For example, in a transaction to stripe a file across a plurality of nodes in a distributed file system, the distributed system determines the nodes in which it will save data blocks. For each node selected to participate in the transaction, the distributed system sends a message to the initiator i to include the node as a participant p in the transaction (for example, add_participant(p)). In response to the add_participant(p) message, the initiator i adds the participant p to the set of participants P (for example, P=P∪{p}) and sends a start command to the participant p (for example, start(p)).

While the initiator i is in the initial state I, the user may send an “abort” command (for example, abort( )) or a “commit” command (for example, commit( )) to the initiator i. If the initiator i receives an “abort” command from the user, the initiator i sends an “aborted” message to the participants in P (for example, aborted(P)) and transitions to the aborted state A. If the initiator i receives a “commit” command (for example, commit( )) before the user adds any participants to the transaction (for example, P=Ø), the initiator i returns true to the user (for example, return(true)) and transitions back to the unknown state U.

If the user has added participants to the transaction (for example, P≠Ø), the initiator i sends a “prepare” message to the coordinator c (for example, prepare(c)) and a “prepare” message to the participants in P (for example, prepare(P)), and transitions to the prepare state P_i. The prepare(c) and prepare(P) messages include a final set of participants in the set of participants P. In some embodiments, the prepare(c) message is configured to be received by the coordinator c before the prepare(P) messages are sent. Thus, in one embodiment the prepare(c) message can be implemented as a function call rather than a message.

3. Prepare State P_i

In the prepare state P_i, the initiator i waits to receive an “aborted” or “committed” message from any one of the participants in P (for example, aborted(p) or committed(p), respectively). If the initiator i receives an “aborted” message from the participant p, the initiator i removes the participant p from the set of known participants P (for example, P=P\{p}) and transitions to the aborted state A. If the initiator i receives a “committed” message from a participant p, the initiator i removes the participant p from the set of known participants P (P=P\{p}), adds the participant p to the set of committed participants C′ (for example, C′={p}), and transitions to the committed state C. As discussed below, the initiator i tracks which participants in P have committed or aborted by removing the participant p from the set of known participants P when an “aborted” or “committed” message is received.

If the initiator i becomes disconnected from the participant p (for example, disconnect(p)), the initiator i removes the participant p from the set of known participants P (for example, P=P\{p}). As discussed below, the disconnected participant p will resolve the transaction without receiving further information from the initiator i. Thus, the initiator i can ignore the disconnected participant p. However, if the initiator i becomes disconnected from all of the participants (for example, P=Ø), the initiator i transitions to the unknown state U and reboots.

4. The Aborted State A and the Committed State C

In the aborted state A, the initiator i removes participants from the set of participants P when it receives “aborted” messages from the participants or detects that the participants have become disconnected. When P=Ø, the initiator i returns false to the user (for example, return(false)) and transitions to the unknown state U.

In the committed state C, the initiator i removes participants from the set of participants P when it receives “committed” messages from the participants or detects that the participants have become disconnected. When the initiator i receives a “committed” message from a participant p, it also adds the corresponding participant p to the set of known committed participants C′ (for example, C′=C′∪{p}). When P=Ø, the initiator i sends “committed” messages to the participants in the set of known committed participants C′ (for example, committed(C′)), returns true to the user (for example, return(true)), and transitions back to the unknown state U. As discussed below, the “committed” message from the initiator i is used for garbage collection.

The following exemplary pseudocode illustrates one embodiment of the initiator i:

in state U:

on start( ):
set state to (I, Ø)

in state (I, P):

on add_participant(p):
set state to (I, P ∪ {p})

send start to p

on abort( ):
send aborted to P

set state to (A, P)

on commit( ):

if P = Ø:
set state to U

return(true)

else:
send prepare(P) to c

send prepare(P) to P

set state to (P_i, P)

in state (P_i, P):

if P = Ø:
set state to U

reboot( )

on disconnect from p ∈ P:
set state to (P_i, P \ {p})

on aborted from p ∈ P:
set state to (A, P \ {p})

on committed from p ∈ P:
set state to (C, P \ {p}, {p})

in state (A, P):

if P = Ø:
set state to U

return(false)

on disconnect from p ∈ P:
set state to (A, P \ {p})

on aborted from p ∈ P:
set state to (A, P \ {p})

in state (C, P,C′):

if P = Ø:
set state to U

send committed(C′) to C′

return(true)

on disconnect from p ∈ P:
set state to (C, P \ {p},C′)

on committed from p ∈ P:
set state to (C, P \ {p},C′ ∪ {p})

C. Coordinator States

FIG. 6 is a state diagram illustrating an unknown state U and a prepare state P_jof the coordinator c. The coordinator c begins and ends the transaction in the unknown state U. In the unknown state U, the coordinator c waits to receive a “prepare” message from the initiator i (for example, prepare(i)). The “prepare” message informs the coordinator c that the initiator i has started the transaction. If the coordinator c is connected to all of the participants in P when it receives the prepare(i) message, the coordinator c resets the set of known prepared participants S (for example, S=Ø) and transitions to the prepare state P_j.

If, on the other hand, the coordinator c is disconnected from one or more of the participants when it receives the prepare(i) message, the coordinator c remains in the unknown state U. Thus, the coordinator c quickly aborts the transaction by not transitioning to the prepare state P_jwhen at least one of the participants p is disconnected. When the other participants in P send “prepared” messages to the coordinator c (for example, prepared(p)), the coordinator c responds with an “aborted” message (for example, aborted(p)).

In the prepare state P_j, the coordinator c tracks the participants that are prepared to commit the transaction. When the coordinator c receives a “prepared” message from a participant p (for example, prepared(p)), the coordinator c adds the participant p to the set of known prepared participants S (for example, S=S∪{p}). Once all of the participants in P have prepared (for example, S=P), the coordinator sends a “commit” message to the participants in P except for the shared participant p_s(for example, commit(P\{p_s})) and transitions back to the unknown state U. As discussed below, the shared participant p_sreceives the outcome of the transaction from the other participants. As also discussed below, the participants that receive the commit(P\{p_s}) message may end up ignoring it and aborting the transaction instead. Thus, the initiator i receives “committed” messages from the participants in P rather than from the coordinator c.

While in the prepare state P_j, the coordinator c may detect a disconnect from one of the participants in P (for example, disconnect(p)) or the coordinator c may receive an “aborted” message from one of the participants in P (for example, aborted(p)). In response, the coordinator c sends an “aborted” message to the prepared participants in S (for example, aborted(S)) and transitions back to the unknown state U.

The following exemplary pseudocode illustrates one embodiment of the coordinator c:

in state U:

on prepared from p:
send aborted to p

on prepare(P) from i:

if connected to all in P:
set state to (P, Ø, P)

else:
leave state as U

in state (P_j, S, P):

if S = P:
send commit to P \ {p_s}

set state to U

on disconnect or aborted from p ∈ P:
send aborted to S

set state to U

on prepared from p ∈ P:
set state to (P_j, S ∪ {p}, P)

D. Remote Participant States

FIG. 7 is a state diagram illustrating an unknown state U, an initial state I, a first prepared state Pc, a second prepared state Pp, a first committed state Ci, and a second committed state Cp of the remote participant p.

1. The Unknown State U

The participant p is in the unknown state U before and after the transaction. In the unknown state U, the participant p may receive committed′ message from another participant p′ (for example, committed′(p′)). The “committed′” message from another participant p′ indicates that the other participant p′ has committed to the transaction but is waiting to find out the status of the participant p before cleaning its log. Since the participant p is already in the unknown state with a clean log, it determines that it committed to the transaction and sends a “committed” message to the other participant p′ (for example, committed(p′)).

In the unknown state U, the participant p may receive a “prepared” message from another participant p′ (for example, prepared(p′)). As discussed in detail below, the participant p would not have cleaned its log and transitioned to the unknown state U unless it had received “committed” messages from all of the participants. However, the “prepared” message from the other participant p′ indicates that the other participant p′ has not committed the transaction. Thus, the participant p determines that the transaction was aborted and sends an “aborted” message to the other participant p′ (for example, aborted(p′)).

In one embodiment, the participant p receives a start message from the initiator i (for example, start(i)) to signal the beginning of the transaction. In response, the participant p transitions to the initial state I. In other embodiments, the initiator i does not send a start message to the participant p. Instead, the participant p transitions to the initial state I when it receives any message referencing the transaction. In such embodiments, messages in the transaction are no longer delivered to the participant p once the transaction is aborted to prevent the participant p from starting the aborted transaction.

2. The Initial State I

In the initial state I, the participant p performs the operations associated with the transaction. In one embodiment, the initiator i sends one or more request to the participant p to perform tasks for the transaction. In a distributed file system, for example, the initiator i may send requests to the participant p to read data blocks, allocate space for data blocks, write data blocks, calculate parity data, store parity data, send messages to another participant, combinations of the forgoing, or the like. If the participant p has an error while performing the transaction, becomes disconnected from the initiator i (for example, disconnect(i)), or receives an “aborted” message from the initiator i (for example, aborted(i)), the participant aborts the transaction and sends an “aborted” message to the coordinator c, the initiator i, and the participants in the set of known prepared participants S from which it has received “prepared” messages (for example, aborted(c, i, S)). The participant p then transitions back to the unknown state U.

While in the initial state I, the participant p may receive a “prepared” message from another participant p′ (for example, prepared(p′)). For example, another participant p′ may send the prepared(p′) message to the participants in P if it received a “prepare” message from the initiator i and then disconnected from the coordinator c. In response to receiving the prepared(p′) message, the participant p adds the other participant p′ to the set of known prepared participants S (for example, S=S∪{p′}) for use in the second prepared state Pp.

As discussed above, the initiator i can add participants to P as the transaction is being executed. After the participants in P have performed the operations associated with the transaction, the initiator i sends a “prepare” message (for example, prepare(i)) to the participants in P. The prepare(i) message includes the final set of participants in P. If the participant p has not transitioned back to the unknown state U, the participant p responds to the prepare(i) message by logging the prepare, sending a “prepared” message to the coordinator c (for example, prepared(c)) and transitioning to the first prepared state Pc.

Although not shown, in other embodiments the participant p only transitions to the first prepared state Pc from the initial state I if S=Ø. In such embodiments, if S≠Ø, the participant p may transition directly to the second prepared state Pp.

3. The First Prepared State Pc

In the first prepared state Pc, the participant p awaits the outcome of the transaction. The coordinator c may notify the participant p of the outcome by sending a “commit” or “aborted” message to the participant p (for example, commit(c) or aborted(c)). In response to commit(c), the participant p sends a “committed” message to the initiator i and to the shared participant p_s(for example, committed(i, p_s)). Thus, as discussed in detail below, the shared participant p_sis notified of the outcome of the transaction. The participant p then transitions to the first committed state Ci. In response to the “aborted” message from the coordinator, the participant p sends an “aborted” message to the coordinator c, the initiator i, and the participants in S. The participant p then transitions back to the unknown state U.

Rather than receiving notice of the transaction's outcome from the coordinator c, another participant p′ may notify the participant p of the outcome by sending a committed′(p′) message. In response to the committed′(p′) message, the participant p adds the other participant p′ to the set of known committed participants C′ (for example, C′={p′}), sends a “committed” message to the initiator i, the shared participant p_s, and the other participant p′ (for example, committed(i, p_s, p′)), and transitions to the first committed state Ci.

In the first prepared state Pc, the participant p may receive a prepared(p′) message from another participant p′. In response, the participant p adds the other participant p′ to the set of known prepared participants S (for example, S=S∪{p′}) allowing participant p to track the participants from which it has received a “prepared” message if it transitions to the second prepared state Pp. In the first prepared state Pc, the participant p may detect that it has become disconnected from the coordinator (for example, disconnect (c)). In response, the participant sends a “prepared” message to all participants in P (for example, prepared (P)) and transitions to Pp.

4. The Second Prepared State Pp

In one embodiment, the second prepared state Pp is used when the participant p loses its connection with the coordinator c. As noted above, only the participants can notify the shared participant p_sof the outcome of the transaction. Participants in the second prepared state Pp are not committed. Thus, once the participant p knows that all the participants in P except for the shared participant p_sare in the second prepared state Pp (for example, S=P\{p_s}), the participant p knows that the shared participant p_sis not committed. The participant p can then abort the transaction, send an “aborted” message to the coordinator c and the initiator i (for example, aborted(c, i)), and transition back to the unknown state U. Thus, once all of the non-shared participants in P are disconnected from the coordinator c, the non-shared participants resolve the transaction by aborting without further instructions from the initiator i or coordinator c.

In the second prepared state Pp, the participant p may receive a “committed” or “aborted” message from another participant p′ (for example, committed′(p′) or aborted(p′)). In response to receiving the committed′(p′) message, the participant p adds the other participant p′ to the set of known committed participants C′ (for example, C′={p′}) and sends a “committed” message to the initiator i and the other participant p′ (for example, committed(i, p′)). To find out which of the other participants in P have also committed, the participant p sends a committed′ message to the participants in P (for example, committed′(P\{p′})). The participant p then transitions to the second committed state Cp.

Sending a “prepared” message is how the participant p asks for the outcome of the transaction. If the other participant p′ has aborted the transaction, the other participant p′ sends the aborted(p′) message to the participant p in response to a “prepared” message (not shown) from the participant p. In response to receiving the aborted(p′) message, the participant p aborts the transaction, sends an “aborted” message to the coordinator c and the initiator i (for example, aborted(c, i)), and transitions to the unknown state U.

In the second prepared state Pp, the participant p may detect a connection to another participant p′ (for example, connect(p′)). In response, the participant p sends a “prepared” message to the other participant p′ (for example, prepared(p′)). When the participant p and the other participant p′ connect, the other participant p′ also sends a “prepared” message to the participant p. When the participant receives a “prepared” message from another participant p′ other than the shared participant p′ (for example, prepared(p′εP\{p_s})), the participant adds the other participant p′ to the set of known prepared participants S (for example, S=S\p′).

5. Garbage Collection and Restart

The participant p records messages sent or received during the commit protocol in a log. The participant p can provide the information in the log to another participant p′ that may not have received one or more messages sent, for example, when the other participant p′ was disconnected. The participant p can also use the information when the participant p restarts after a disconnect or failure to determine the outcome of the transaction.

In one embodiment, when the participant p restarts after a failure, the participant p checks its log for a prepare block, a done block, a commit block, or a combination of the forgoing. If the log does not have a prepare block, the participant p restarts in the unknown state U. The participant p also restarts in the unknown state U if the log has a done block. If the log has a prepare block, but no commit block or done block, the participant restarts in the second prepared state Pp. If the log has a prepare block and a commit block, but no done block, the participant p restarts in the second committed state Cp.

The first committed state Ci and the second committed state Cp are garbage collection states. In these states, the participant p has already committed the transaction. However the participant p waits to clear its log until it is sure that the information stored therein will not be needed. The set of known committed participants C′ includes the participants that the participant p knows have also committed the transaction. When the participant p receives a committed′ from another participant p′ (for example, committed′(p′)), the participant p adds the other participant p′ to the set of known committed participants C′ (for example, C′=C′∪{p′}) and sends a “committed” message to the other participant p′ (for example, committed(p′)).

In the first committed state Ci, the participant p waits to receive a “committed” message from the initiator i that includes a set T of participants that the initiator i knows have committed (for example, committed(i)(T)). In response to receiving the committed(i)(T) message, the participant p adds the participants in T to C′ (for example, C′=C′∪T). If C′∪T is not all of the participants in P, the participant sends a committed′ message to query the participants it does not know have committed (for example, committed′(P\C′)). The participant p also sends the committed′(P\C′) message if it detects a disconnect from the initiator i. The participant p then transitions to the second committed state Cp.

In the second committed state Cp, the participant p may receive a “committed” message from another participant p′ (for example, committed(p′)). In response, the participant p adds the other participant p′ to the set of known committed participants C′ (for example, C′=C′∪{p′}). The participant p may also detect a connection to another participant p′ that is not included in the set of known committed participants C′ (for example, connect(p′εP\C′). In response, the participant p queries whether the other participant p′ has committed by sending it a committed′ message (for example, committed′(p′)). When C′=P, the participant p can clean its log and transition to the unknown state U.

The following exemplary pseudocode illustrates one embodiment of the participant p:

function abort(S):

log abort

send aborted to S ∪ {i, c}

set state to U

function commit_i(C′, P):

log commit

send committed to {i, p_s}

set state to (Ci, C′, P)

function commit_p(C′, P):

send committed’ to P \ C′

set state to (Cp, C′, P)

in state U:

on committed’ from p′:
send committed to p′

on prepared from p′:
send aborted to p′

on start from i:
set state to (I, Ø)

in state (I, S):

on disconnect from i:
abort(S)

on local failure:
abort(S)

on aborted from i:
abort(S)

on prepared from p′:
set state to (I, S ∪ {p′})

on prepare(P) from i:
log prepare

send prepared to c

set state to (Pc, S, P)

in state (Pc, S, P):

on disconnect from c:
send prepared to P

set state to (Pp, S, P)

on aborted from c:
abort(S)

on commit from c:
commit_i(Ø, P)

on committed′ from p′ ∈ P:
commit_i({p′}, P)

send committed to p′

on prepared from p′ ∈ P:
set state to (Pc, S ∪ {p′}, P)

in state (Pp, S, P):

if S = P \ {p_s}:
abort(Ø)

on connect to p′ ∈ P:
send prepared to p′

on aborted from p′ ∈ P:
abort(Ø)

on committed’ from p′ ∈ P:
log commit

commit_p({p′}, P)

send committed to {i, p′}

on prepared from p′ ∈ P \ {p_s}:
set state to (Pp, S ∪ {p′}, P)

in state (Ci, C′, P):

on committed(T) from i:
commit_p(C′ ∪ T, P)

on disconnect from i:
commit_p(C′, P)

on committed’ from p′ ∈ P:
set state to (Ci, C′ ∪ {p′}, P)

send committed to p′

in state (Cp, C′, P):

if C′ = P:
log done

set state to U

on connect to p′ ∈ P \ C′:
send committed’ to p′

on committed from p′ ∈ P:
set state to (Cp, C′ ∪ {p′}, P)

on committed’ from p′ ∈ P:
set state to (Cp, C′ ∪ {p′}, P)

send committed to p′

E. The Shared Participant

As discussed above, the transaction may include a shared participant p_s. The shared participant p_sis on the same node as the coordinator c and the initiator i. The coordinator c does not send a “commit” message to the shared participant p_s. Instead, the other participants in P inform the shared participant p_sthat the transaction is committed.

FIG. 8 is a state diagram illustrating the states of the shared participant p_saccording to one embodiment. The shared participant p_soperates similar to a remote participant p and can be in the unknown state U, the initial state I, the second prepared state Pp, the first committed state Ci, and the second committed state Cp as discussed above in relation to FIG. 7. However, since the shared participant p_sdoes not receive a “commit” message from the coordinator c, the shared participant p_sdoes not enter the first prepared state Pc. Rather, when the shared participant p_sreceives the prepare(i) message from the initiator i, the shared participant p_stransitions directly to the second prepared state Pp.

Since the shared participant p_sdoes not enter the first prepared state Pc, the shared participant transitions to the first committed state Ci directly from the second prepared state Pp. Thus, upon receiving the committed′(p′) message while in the second prepared state Pp, the shared participant p_stransitions to the first committed state Ci. In the second prepared state Pp, the shared participant p_smay learn of the outcome of the transaction by receiving a “committed” message from another participant p′ (for example, committed(p′)). In response, the shared participant p_sadds the other participant p′ to the set of known committed participants C′ (for example, C′={p′}), sends a “committed” message to the initiator i, and transitions to the first committed state Ci.

Like the remote participant p, upon detecting connect(p′), the shared participant p_sasks the other participant p′ the outcome of the transaction by sending a prepared(p′) message. As discussed above, if the other participant p′ has not resolved the transaction, it will ignore the prepared(p′) message from the shared participant p_s. If the other participant p′ has aborted the transaction, it will send aborted(p′) to the shared participant p_s.

An artisan will recognize from the disclosure herein that there are other differences between the shared participant p_sand the remote participant p discussed above. For example, since the shared participant p_sis on the same node as the coordinator c and the initiator i, the shared participant p_swill not detect disconnect(i). Thus, the shared participant p_sdoes not respond to disconnect(i) in, for example, the unknown state U or the first committed state Ci.

In one embodiment, the shared participant p_srestarts as discussed above in relation to the remote participant p.

The following exemplary pseudocode illustrates one embodiment of the shared participant p_s:

function abort( ):

log abort

send aborted to {i, c}

set state to U

function commit_i(C′, P):

log commit

send committed to i

set state to (Ci, C′, P)

function commit_p(C′, P):

send committed’ to P \ C′

set state to (Cp, C′, P)

in state U:

on committed’ from p′:
send committed to p′

on prepared from p′:
send aborted to p′

on start from i:
set state to (I, Ø)

in state (I, S):

on local failure:
abort( )

on aborted from i:
abort( )

on prepared from p′:
set state to (I, S ∪ {p′})

on prepare(P) from i:
log prepare

send prepared to c

set state to (Pp, S, P)

in state (Pp, S, P):

if S = P \ {p_s}:
abort( )

on connect to p′ ∈ P:
send prepared to p′

on aborted from p′ ∈ P:
abort( )

on committed from p′ ∈ P:
commit_i({p′}, P)

on committed’ from p′ ∈ P:
commit_i({p′}, P)

send committed to p′

on prepared from p′ ∈ P:
set state to (Pp, S ∪ {p′}, P)

in state (Ci, C′, P):

on committed(T) from i:
commit_p(C′ ∪ T, P)

on committed’ from p′ ∈ P:
set state to (Ci ,C′ ∪ {p′}, P)

send committed to p′

in state (Cp, C′, P):

if C′ = P:
log done

set state to U

on connect to p′ ∈ P:
send committed’ to p′

on committed from p′ ∈ P:
set state to (Cp, C′ ∪ {p′}, P)

on committed’ from p′ ∈ P:
set state to (Cp, C′ ∪ {p′}, P)

send committed to p′

IV. 2.5-Phase Commit Protocol

While the MM2PC and the 2PCV2 protocols provide single-failure non-blocking commitment protocols, it may be useful to provide for double-failure tolerance. The 2.5 Phase Commit (“2.5PC”) protocol provides a double-failure non-blocking atomic commitment protocol. The 2.5PC protocol includes an initiator i, a coordinator c, a distributor d, as well as a set of participants P={P₁, P₂, . . . p_n}. In the 2.5PC protocol, each party is located on a different node from the other parties. It is recognized, however, that the 2.5PC protocol may be implemented such that two parties share a node (for example, the coordinator c shares a machine with one participant p₂), but such implementations would only provide single-failure tolerance.

A. 2.5PC Protocol Exemplary Timing Chart

FIG. 9 illustrates an exemplary timing chart according to one embodiment of a 2.5PC protocol 900 for a transaction involving an initiator 910 (shown as “i”), a first participant 912 (shown as “p₁”), a second participant 914 (shown as “p₂”), a coordinator 916 (shown as “c”), and a distributor 918 (shown as “d”). The coordinator 916 and the distributor 918 are on separate nodes. If the coordinator 916 and the distributor 918 do not share a node with the first participant 912 or the second participant 914, then the commit protocol 900 allows for double-failure non-blocking.

The initiator 910 sends “prepare” messages 920 (two shown) to the first participant 912 and the second participant 914. The first participant 912 and the second participant 914 log their respective “prepare” messages 920 and determine whether they are prepared to commit the transaction. If the first participant 912 can commit the transaction, the first participant 912 sends a “prepared” message 922 to the coordinator 916. If the second participant 914 can commit the transaction, the second participant 914 sends a “prepared” message 922 to the coordinator 916. If the coordinator receives both of the “prepared” messages 922, the coordinator 916 sends a “commit” message 924 to the distributor 918.

If the coordinator 916 and one of the participants 912, 914 were to fail (for example, a double-failure) after the coordinator 916 sends the “commit” message 924, the distributor 918 knows the coordinator's 916 decision and can resolve the transaction. Thus, the protocol 900 is double-failure non-blocking. In response to the “commit” message 924 from the coordinator 916, the distributor 918 sends “commit” messages 926 (three shown) to the first participant 912, the second participant 914, and the coordinator 916.

After receiving the “commit” messages 926 from the distributor 918, the first participant 912 and the second participant 914 respectively log the “commits” and send “committed” messages 928 (six shown) to each other, to the coordinator 916, and to the initiator 910. Upon receiving a “committed” message 928, the first participant 912 and the second participant 914 clear their respective logs and the 2.5PC protocol 900 ends.

The exemplary timing chart shown in FIG. 9 illustrates the 2.5PC protocol 900 when no failures occur. However, if one or more of the participants 912, 914 fails or disconnects, the coordinator 916 aborts the transaction and informs the distributor 918. The distributor 918 then informs the remaining participants 912, 914.

If the coordinator 916 fails or disconnects before informing the distributor 918 of its decision, the distributor 918 aborts because it does not know if all the participants 912, 914 prepared successfully. However, the coordinator 916 can also send “abort” or “commit” messages to the participants 912, 914. Therefore, as discussed in detail below, when the coordinator 916 is disconnected from the distributor 918, the participants 912, 914 decide whether to accept “commit” or “abort” messages from the coordinator 916 or the distributor 918. If the participants 912, 914 decide to accept the decision of the distributor 918, the distributor sends an “abort” message to the participants 912, 914.

If the coordinator 916 loses its connection with the distributor 918 before sending the “commit” message 924, the coordinator 916 aborts. Since the distributor also aborts, the coordinator 916 sends “abort” messages to the participants 912, 914 without waiting for the participants to decide whether to accept the decision of the coordinator 916.

If, on the other hand, the coordinator 916 loses its connection to the distributor 918 after sending the “commit” message 924, the coordinator 916 is still committed. However, the coordinator 916 does not know whether the distributor 918 received the “commit” message 924. If the distributor 918 did receive the “commit” message 924, it may have sent the “commit” messages 926 to one or more of the participants 912, 914. If the distributor 918 did not receive the “commit” message 924, the distributor 918 may abort the transaction when the participants 912, 914 decide to accept the distributor's 918 decision. Thus, the coordinator 916 waits for the participants 912, 914 to decide to accept its decision before committing the transaction.

The participants 912, 914 vote to determine whether to accept the decision (for example, commit or abort) of the coordinator 916 or the distributor 918. For example, if the coordinator 916 receives a majority of the votes, it will send its decision to the participants 912, 914. If, on the other hand, the distributor 918 receives the majority of votes, it will send its decision to the participants 912, 914. The participants 912, 914 will vote for the coordinator 916 if they loose their respective connections to the distributor 918. The participants 912, 914 will vote for the distributor 918 if they lose their respective connections with the coordinator 916. Otherwise, the participants 912, 914 will vote for the first party (for example, either the coordinator 916 or the distributor 918) to ask for its vote. In one embodiment, only the distributor 918 asks for votes to avoid a split vote.

If one or more of the participants 912, 914 are disconnected from the coordinator 916, the distributor 918, or both, neither the coordinator 916 nor the distributor 918 may receive the majority of the votes. Thus, the participants 912, 914 send their respective votes to both the coordinator 916 and the distributor 918. When either the coordinator 916 or the distributor 918 realizes that it cannot receive the majority of votes, it bows out of the election and notifies the participants 912, 914.

If both the participants 912, 914 lose their connections with both the coordinator 916 and the distributor 918, the participants 912, 914 deterministically resolve the transaction among themselves as discussed above.

FIGS. 10-12D illustrate state diagrams according to one embodiment of a 2.5PC protocol. Parties in a transaction using the 2.5PC protocol include a coordinator c, a distributor d, one or more participant p selected from the set defined by {p₁, p₂, . . . , p_n}, and an initiator i.

B. Coordinator States

FIG. 10 is a state diagram illustrating an initial state I, a commit state C and a final state F of the coordinator c during execution of the commit protocol. The coordinator c can be in a state “s_c” defined by:

S_cε{(I,S)|S⊂P}
∪{(C,S_for,S_against)|S_for,S_against⊂P; S_for∩S_against=Ø}
∪{F}

wherein P is a set of participants defined by P={p₁, p₂, . . . , p_n}. The variable S is a proper subset of the participants in P for which the coordinator c has received “prepared” messages. In the commit state C, the coordinator c keeps two mutually exclusive proper subsets S_forand S_againstof the participants in P. The variable S_forincludes participants that vote for the coordinator c and the variable S_againstincludes participants that vote for the distributor d.

1. The Initial State I

As illustrated in FIG. 10, the coordinator c starts in the initial state I. In the initial state I, the coordinator c may receive a “prepared” message from one of the participants p (for example, prepared(p)). In response, the coordinator c adds the participant p to the set of known prepared participants S (for example, S=S∪{p}). Once S=P, the coordinator c sends a “commit” message to the distributor d (for example, commit(d)) and transitions to the commit state C.

While in the initial state I, the coordinator c may detect a disconnect from one of the participants p (for example, disconnect(p)), or may receive an “abort” message from the initiator i (for example, abort(i)), an “aborted” message from the participant p (for example, aborted(p)), or a pledged message from one of the participants p (for example, pledged(p)). In response, the coordinator c aborts the transaction and sends an “abort” message to the participants in P and the distributor d (for example, abort(P, d)). The coordinator c then transitions to the final state F.

The “pledged” message from one of the participants p may be a vote from the participant p for the coordinator c or the distributor d. Either way, the coordinator c knows that the “pledged” message is in response to a “pledge” message (discussed below) from the distributor d in the event of a failure. Thus, the coordinator c aborts.

2. The Commit State C

In the commit state C, the coordinator c expects the transaction to be committed but waits in the commit state C in case the distributor d fails and the participants in P need the coordinator c to resolve the transaction. While in the commit state C, the coordinator c may receive the “pledged” message from one of the participants p. As discussed above, the coordinator c adds the participant p to the set of participants voting for the coordinator S_forif the participant p pledges its vote to the coordinator c (for example, pledged(p)(c)). Once |S_for|>└|P|/2┘, the coordinator c commits the transaction and sends a “commit” message to the participants in P and the distributor d (for example, commit(P, d)). The coordinator c then transitions to the final state F.

If the participant p pledges its vote to the distributor d (for example, pledged(p)(d)), the coordinator c adds the participant p to set of participants voting for the distributor S_against. The coordinator c may also detect that it has disconnected from one of the participants (for example, disconnect(p)). If the participant p is not in S_foror S_against, in one embodiment, the coordinator c adds the participant p to S_against. If |S_against|≧┌|P|/2┐, the coordinator c revokes its participation in the election and notifies the participants in P (for example, revoke(P)). The coordinator c then transitions to the final state F.

In the commit state C, the coordinator c may receive the “aborted” message or a “committed” message from one of the participants p (for example, aborted(p) or committed(p)). In response to the “aborted” message, the coordinator c aborts the transaction, sends the “abort” message to all of the participants in P and the distributor d (for example, abort(P, d)) and transitions to the final state F. In response to the “committed” message, the coordinator c commits the transaction, sends a “commit” message to the participants in P and the distributor d (for example, commit(P, d)), and transitions to the final state F.

The following exemplary pseudocode illustrates one embodiment of the coordinator c:

function abort( ):

send abort to P ∪ {d}

set state to F

function commit( ):

send commit to P ∪ {d}

set state to F

function revoke( ):

send revoke to P

set state to F

in state (I, S):

on disconnect from p ∈ P:
abort( )

on pledged(c) from p ∈ P:
abort( )

on pledged(d) from p ∈ P:
abort( )

on abort from i:
abort( )

on aborted from p ∈ P:
abort( )

on prepared from p ∈ P:

if S = P:
send commit to d

set state to (C , Ø, Ø)

in state (C, S_for, S_against)

on disconnect from p ∈ P \ S_for:
set state to (C, S_for, S_against∪ {p})

on pledged(c) from p ∈ P:
set state to (C, S_for∪ {p}, S_against)

on pledged(d) from p ∈ P:
set state to (C, S_for, S_against∪ {p})

if |S_for| > └|P|/2┘:
commit( )

if |S_against| > ┌|P|/2┐:
revoke( )

on aborted from p ∈ P:
abort( )

on committed from p ∈ P:
commit( )

on start:
set state to (I, Ø)

It is recognized that not all error cases are shown in the above pseudocode. In the embodiments discussed above, non-handled messages are ignored. For example, the above pseudocode does not address a failure of the connection between the coordinator c and the distributor d. If the connection goes down, the distributor d starts seeking pledges and the coordinator c starts receiving “pledged” messages or “aborted” messages from one of the participants p (for example, pledged(p) or aborted(p)). Further, the above pseudocode does not have a restart procedure for the coordinator c. If the coordinator c fails, the participants ignore it. When the coordinator c restarts, it has no knowledge of the transaction, but the participants do not care; and if the coordinator then aborts, the coordinator c does not inform the distributor d when it aborts. The distributor d is instead notified of the abort from the participants.

C. Distributor States

FIG. 11 is a state diagram illustrating an initial state I, an abort state A, and a final state F of the distributor d during execution of the commit protocol. The distributor d can be in a state “s_d” defined by:

s_dε{(r,S_for,S_against)|rε{I,A}; S_for,S_against⊂P; S_for∩S_against=Ø}∪{F}

wherein the distributor d adds participants that vote for the distributor d to the set of participants voting for the distributor S_forand adds participants that vote for the coordinator c to the set of participants voting for the coordinator S_against.

1. The Initial State I

The distributor d starts in the initial state I where it can detect a disconnect from a participant p (for example, disconnect(p)) or receive pledged messages from the participant p for the coordinator c or the distributor d (for example, pledged(p)(c) or pledged(p)(d)). In response, the distributor d adds the participant p to the corresponding set S_foror S_against, as described above.

If the distributor d detects a disconnect from the coordinator c (for example, disconnect(c)) while in the initial state I, the distributor d checks to see if the number of votes for the coordinator are less than the majority and then requests votes from the participants in P that have not yet voted by sending them a “pledge” message (for example, pledge(P\(S_for∪S_against))). The distributor d then transitions to the abort state A where it tries to abort the transaction.

If the distributor d receives an “abort” message from the initiator i or the coordinator c (for example, abort(i, c)) or an “aborted” message from one of the participants p (for example, aborted(p)), the distributor d aborts the transaction. The distributor d then sends an “abort” message to the participants in P (for example, abort(P)) and transitions to the final state F. If, on the other hand, the distributor d receives a “commit” message from the coordinator c (for example, commit(c)) or a “committed” message from one of the participants p (for example, committed(p)), the distributor d commits the transaction. The distributor d then sends a “commit” message to the participants in P (for example, commit(P)) and transitions to the final state F.

2. The Abort State A

In the abort state A, the distributor d tries to get enough votes to abort the transaction. Upon detecting a disconnection from one of the participants p that has not voted for the distributor (for example, disconnect(pεP\S_for)), the distributor d adds the participant p to the set of participants voting for the coordinator S_against. The distributor d may also receive pledged messages from the participant p for the coordinator c or the distributor d (for example, (pledged (p)(c) or pledged (p)(d))). In response, the distributor d adds the participant p to the corresponding sets S_foror S_against, as described above. Once |S_against|≧┌|P|/2┘, the distributor d revokes its participation in the election and notifies the participants in P (for example, revoke(P)). The distributor d then transitions to the final state F. Once |S_for|>└|P|/2┘, the distributor d aborts the transaction and sends an “abort” message to the participants (for example, abort(P)). The distributor d then transitions to the final state F.

If the distributor d receives the “aborted” message from one of the participants p (for example, aborted(p)) while in the abort state A, the distributor d aborts the transaction, sends the “abort” message to all of the participants (for example, abort(P)) and transitions to the final state F. If the distributor d receives the “committed” message from one of the participants p (for example, committed(p)) while in the abort state A, the distributor d commits the transaction, sends the “commit” message to all of the participants (for example, commit(P)) and transitions to the final state F. Like the coordinator c, the distributor d does not have a restart procedure. If the distributor d fails, the participants in P will ignore it and continue with the commit protocol.

The following exemplary pseudocode illustrates one embodiment of the distributor d:

function abort( ):

send abort to P

set state to F

function commit( ):

send commit to P

set state to F

function revoke( ):

send revoke to P

set state to F

in state (I, S_for, S_against):

on disconnect from c:

if |S_against| < ┌|P|/2┐:
send pledge to P \ (S_for∪ S_against)

set state to (A, S_for, S_against)

on disconnect from p ∈ P:
set state to (I, S_for, S_against∪ {p})

on pledged(c) from p ∈ P:
set state to (I, S_for, S_against∪ {p})

on pledged(d) from p ∈ P:
set state to (I, S_for∪ {p}, S_against)

on abort from i or c:
abort( )

on aborted from p ∈ P:
abort( )

on commit from c:
commit( )

on committed from p ∈ P:
commit( )

in state (A, S_for, S_against)

on disconnect from p ∈ P \ S_for:
set state to (A, S_for, S_against∪ {p})

on pledged(c) from p ∈ P:
set state to (A, S_for, S_against∪ {p})

on pledged(d) from p ∈ P:
set state to (A, S_for∪ {p}, S_against)

if |S_for| > └|P|/2┘:
abort( )

if |S_against| ≧ ┌|P|/2┐:
revoke( )

on aborted from p ∈ P:
abort( )

on committed from p ∈ P:
commit( )

on start:
set state to (I, Ø, Ø)

D. Participant States

FIGS. 12A-12D are state diagrams illustrating an initial state I, a first prepared state Pcd, a second prepared state Pc, a third prepared state Pd, a fourth prepared state Pp, an aborted state A, a committed state C and a final state F for a participant p during execution of the commit protocol. The participant p can be in a state “s_p” defined by:

s_pε{(r,S)|rε{I,Pcd,Pc,Pd,Pp}; S⊂P}∪{(A,A′)|A′⊂P}∪{(C,C′)|C′⊂P}∪{F}

wherein P is a set of participants defined by P={p₁, p₂, . . . , p_n}. The variable S is a proper subset of the participants in P for which the participant p has received “prepared” messages. As discussed below, participants in S are in the fourth prepared state Pp.

In the first prepared state Pcd, the participant p has not pledged its vote to the coordinator c or the distributor d, but is prepared and listening to the coordinator or the distributor. In the second prepared state Pc, the participant p has pledged its vote to the coordinator c and is prepared and listening to the coordinator. In the third prepared state Pd, the participant p has pledged its vote to the distributor d and is prepared and listening to the distributor. The participant p transitions to the fourth prepared state Pp from the second prepared state Pc or the third prepared state Pd when it decides to resolve the transaction deterministically without further input from the coordinator c or the distributor d, showing that it is prepared and listening to the other participants.

1. The Initial State I

As illustrated in FIG. 12B, the participant p begins the transaction in the initial state I where it waits for a “prepare” message from the initiator i (for example, prepare(i)). Upon receiving the “prepare” message from the initiator (for example, prepare(i)), the participant p sends a “prepared” message to the coordinator c (for example, prepared(c)) and transitions to the first prepared state Pcd to await an “abort” or “commit” message. If the participant p receives a “prepared” message from another participant p′ (for example, prepared(p′)), the participant p adds the other participant p′ to the set of known prepared participants S (for example, S=S∪{p′}).

In the initial state I, the participant p may receive an “abort” message from the initiator i, the coordinator c, or the distributor d (for example, abort(i, c, d)). The participant p may also receive an “aborted” message from another participant p′ (for example, aborted(p′)) or a “pledge” message from the distributor d (for example, pledge(d)). The “pledge” message from the distributor indicates that the distributor d has lost its connection with the coordinator c. In response to receiving the “abort” message from the initiator, the coordinator, or the distributor (for example, abort(i, c, d)), the “aborted” message from one of the other participants p′ (for example, aborted(p′)), or the “pledged” message from the distributor d (for example, pledge(d)), the participant aborts the transaction and transitions to the aborted state A. Upon aborting the transaction, the participant p sends an “aborted” message to the participants in P, the initiator i, the coordinator c, and the distributor d (for example, aborted(P, i, c, d)).

In the initial state I, the participant p may have an error wherein it cannot commit the transaction (for example, error), or it may detect a disconnect from the initiator i, the coordinator c, or the distributor d (for example, disconnect(i, c, d)). In response, the participant p aborts the transaction, sends the “aborted” message to all of the participants in P, the initiator i, the coordinator c, and the distributor (for example, aborted(P, i, c, d)) and transitions to the aborted state A.

2. The First Prepared State Pcd

As illustrated in FIG. 12D, in the first prepared state Pcd, the participant p has not pledged its vote to the coordinator c or the distributor d. If the participant p detects a disconnect from the coordinator c (for example, disconnect(c)) or receives a “revoke” message from the coordinator c (for example, revoke(c)), the participant p then pledges its vote to the distributor d and sends a “pledged” message for the distributor d to the distributor (for example, pledged (d)(d)) and transitions to the third prepared state Pd. If the participant p receives a “pledge” message from the distributor d (for example, pledge(d)), then the participant p pledges its vote to the distributor d and sends a “pledged” message for the distributor d to the coordinator c and the distributor d (for example, pledged(c, d)(d)) and transitions to the third prepared state Pd.

If, while in the first prepared state Pcd, the participant p detects a disconnect from the distributor d (for example, disconnect(d)) or receives a “revoke” message from the distributor d (for example, revoke(d)), the participant p pledges its vote to the coordinator C. The participant p then sends a “pledged” message for the coordinator c (for example, pledged(c)(c)) to the coordinator c, and transitions to the second prepared state Pc.

In the first prepared state Pcd, the participant may receive a “commit” message from the coordinator c or the distributor d (for example, commit(c, d)) or a “committed” message from another participant p′ (for example, committed(p′)). In response, the participant p commits the transaction and sends a “committed” message to the participants in P, the initiator i, the coordinator c, and the distributor d (for example, committed(P, i, c, d)). The participant p then transitions to the committed state C.

In the first prepared state Pcd, the participant p may also receive an “abort” message from the coordinator c or the distributor d (for example, abort(c, d)), or the “aborted” message from another participant p′ (for example, aborted(p′)). In response, the participant p aborts the transaction, sends the “aborted” message to the participants in P, the initiator i, the coordinator c, and the distributor d, (for example, aborted(P, i, c, d)), and transitions to the aborted state A.

3. The Second Prepared State Pc and the Third Prepared State Pd

As illustrated in FIG. 12D, in the second prepared state Pc, the participant p has pledged its vote to the coordinator c. In the third prepared state Pd, the participant p has pledged its vote to the distributor d. In the second prepared state Pc or the third prepared state Pd, the participant p may receive the “commit” message from the coordinator c or the distributor d (for example, commit(c, d)) or the “committed” message from another participant p′ (for example, committed(p′)). In response, the participant p commits the transaction, sends the “committed” message to the participants in P, the initiator i, the coordinator c, and the distributor d (for example, committed(P, i, c, d)) and transitions to the committed state C.

In the second prepared state Pc or the third prepared state Pd, the participant p may also receive the “abort” message from the coordinator c or the distributor d (for example, abort(c, d)) or the “aborted” message from another participant p′ (for example, aborted(p′)). In response, the participant p aborts the transaction, sends the “aborted message” to the participants in P, the initiator i, the coordinator c, and the distributor d (for example, aborted(P, i, c, d)) and transitions to the aborted state A.

In the second prepared state Pc, the participant p may detect a disconnect from the coordinator c (for example, disconnect(c)) or receive the “revoke” message from the coordinator c (for example, revoke(c)). In response, the participant p sends a “prepared” message to the participants in P (for example, prepared(P)) and transitions to the fourth prepared state Pp.

In the third prepared state Pd, the participant p may detect a disconnect from the distributor (for example, disconnect(d)) or receive the “revoke” message from the distributor (for example, revoke(d)). In response, the participant p sends the “prepared” message to the participants in P (for example, prepared(P)) and transitions to the fourth prepared state Pp.

4. The Fourth Prepared State Pp

As illustrated in FIG. 12D, the participant p transitions to the fourth prepared state Pp from the second prepared state Pc or the third prepared state Pd when it decides to resolve the transaction deterministically without further input from the coordinator c or the distributor d. As illustrated in FIG. 12C, in the fourth prepared state Pp, the participant p waits for all of the other participants in P to enter the fourth prepared state Pp (for example, S=P) before committing the transaction. After committing, the participant p sends the “committed” message to the participants in P, the initiator i, the coordinator c, and the distributor d (for example, committed(P, i, c, d)) and transitions to the committed state C.

When the participant p receives the “prepared” message from another participant p′ (for example, prepared(p′)), the participant p adds the other participant p′ to the set of known participants S. When the participant p detects a connect from another participant p′ (for example, connect(p′)), it sends the “prepared” message to the other participant p′ (for example, prepared(p′)) in case the other participant p′ did not receive the “prepared” message when it was disconnected.

In the fourth prepared state Pp, the participant p may receive the “aborted” message from another participant p′ (for example, aborted(p′)). In response, the participant p adds the other participant p′ to a set of known aborted participants A′ (for example, A′=A′∪{p′} or A′={p′}) and sends the “aborted” message to the participants in P, the initiator i, the coordinator c, and the distributor d (for example, aborted(P, i, c, d)). The participant p then transitions to the aborted state A.

The participant p may also receive the “committed” message from another participant p′ (for example, committed(p′)) while in the fourth prepared state Pp. In response, the participant p commits the transaction and adds the other participant p′ to a set of known committed participants C′ (for example, C′=C′∪{p′} or C′={p′}). The participant p then sends the “committed” message to the participants in P, the initiator i, the coordinator c, and the distributor d (for example, committed(P, i, c, d)) and transitions to the committed state C.

5. The Committed State C

The committed state C and the aborted state A are garbage collection states wherein the participant p handles information stored in a log during its execution of the commit protocol. As illustrated in FIG. 12C, the participant p waits until the other participants in P complete the transaction before clearing its log so that it can provide the information in the log to another participant p′ that may not have received one or more messages sent, for example, when the other participant p′ was disconnected.

In the committed state C, the participant p may receive the “committed” message from another participant p′ (for example, committed(p′)). In response, the participant p adds the other participant p′ to the set of known committed participants C′ (for example, C′=C′∪{p′}). Once all the participants in C′ have committed (for example, C′=P), the participant p cleans its log and transitions to the final state F.

When the participant p detects a connection to another participant p′ (for example, connect(p′)), the participant p sends a “committed′” message to the other participant p′ (for example, committed′(p′)). Again, the participant p waits in the committed state C until C′=P.

6. The Aborted State A

As discussed above, the aborted state A is also a garbage collection state wherein the participant p handles information stored in a log during its execution of the commit protocol. As illustrated in FIG. 12C, in the aborted state A, the participant p may receive the “aborted” message from another participant p′ (for example, aborted(p′)). In response, the participant p adds the other participant p′ to the set of known aborted participants A′ (for example, A′=A′∪{p′}. Once all the participants in A′ have aborted (for example, A′=P), the participant p cleans its log and transitions to the final state F.

When the participant p detects a connect to another participant p′ (for example, connect(p′)), the participant p sends an “aborted′” message to the other participant p′ (for example, aborted′(p′)). Again, the participant p waits in the aborted state A until A′=P.

7. The Final State F

The participant p ends the transaction in the final state F. As illustrated in FIG. 12C, in the final state F, the participant p may receive the “aborted′” message from another participant p′ (for example, aborted′(p′)). In response, the participant sends the “aborted” message to the other participant p′ (for example, aborted(p′)). The participant p may also receive the “committed′” message from another participant p′ (for example, committed′(p′)). In response, the participant p sends the “committed” message to the other participant p′ (for example, committed(p′)).

The following exemplary pseudocode illustrates one embodiment of the participant p:

function forget( ):

clean log

set state to F

function abort_count(A′):

if A′ ≠ P:
set state to (A, A′)

else:
forget( )

function commit_count(C′):

if C′ ≠ P:
set state to (C, C′)

else:
forget( )

function abort(A′):

log(abort)

send aborted to (P ∪ {i, c, d}) \ {p}

abort_count(A′)

function commit(C′):

log(commit)

send committed to (P ∪ {i, c, d}) \ {p}

commit_count(C′)

function pledge_c(S_tell, S_prepared)

send pledged(c) to S_tell

set state to (Pc, S_prepared)

function pledge_d(S_tell, S_prepared)

send pledged(d) to S_tell

set state to (Pd, S_prepared)

function prepare_p(S)

send prepared to P \ {p}

set state to (Pp, S)

in state (I, S):

on disconnect from i, c, or d:
abort({p})

on pledge from d:
abort({p})

on abort from i, c, or d:
abort({p})

on aborted from p′:
abort({p, p′})

on prepared from p′:
set state to (I, S ∪ {p′})

on prepare from i:

if error:
abort({p})

else:

log(prepare)

send prepared to c

set state to (Pcd, S)

in state (Pc, S), (Pd, S), or (Pcd, S):

on abort from c or d:
abort({p})

on aborted from p′:
abort({p, p′})

on commit from c or d:
commit({p})

on committed from p′:
committed({p, p′})

in state (Pcd, S):

on disconnect from d:
pledge_c({c}, S)

on revoke from d:
pledge_c({c}, S)

on disconnect from c:
pledge_d({d}, S)

on revoke from c:
pledge_d({d}, S)

on pledge from d:
pledge_d({c, d}, S)

on prepared from p′:
set state to (Pcd, S ∪ {p′})

in state (Pc, S):

on disconnect from c:
prepare_p(S ∪ {p})

on revoke from c:
prepare_p(S ∪ {p})

on prepared from p′:
set state to (Pc, S ∪ {p′})

in state (Pd, S):

on disconnect from d:
prepare_p(S ∪ {p})

on revoke from d:
prepare_p(S ∪ {p})

on prepared from p′:
set state to (Pd, S ∪ {p′})

in state (Pp, S):

on connect to p′:
send prepared to p′

on aborted from p′:
abort({p, p′})

on committed from p′:
commit({p, p′})

on prepared from p′:
set state to (Pp, S ∪ {p′})

if S = P:
commit({p})

in state (C, C′):

on connect to p′:
send committed' to p′

on committed from p′:
commit_count(C′ ∪ {p′})

on committed' from p′:
send committed to p′

commit_count(C′ ∪ {p′})

in state (A, A′):

on connect to p′:
send aborted' to p′

on aborted from p′:
abort_count(A′ ∪ {p′})

on aborted' from p′:
send aborted to p′

abort_count(A′ ∪ {p′})

in state F:

on aborted' from p′:
send aborted to p′

on committed' from p′:
send committed to p′

on start:
set state to (I, Ø)

on restart:

if last log was start:
abort({p})

if last log was prepare:
set state to (Pp, {p})

if last log was abort:
abort_count({p})

if last log was commit:
commit_count({p})

The 2.5PC protocol is double-failure non-blocking if there are at least three participants and the coordinator c and the distributor d are on different nodes than each other and all participants. For example, if both the coordinator c and the distributor d fail after all the participants in P prepare, the participants will all go to the fourth prepared state Pp and resolve the transaction themselves. If, rather than crashing, the coordinator c and distributor d lose some of their network connections, including the connection between themselves, they may both realize that they can not get enough pledges to resolve the transaction and will send revoke messages to the participants in P. This will result in all the participants in P moving to the fourth prepared state Pp and resolving the transaction.

As another example, if the coordinator c and the participant p both fail, the distributor d will start gathering pledges. If there are at least three participants in P, there will be at least two non-failed participants p′. Thus, the distributor d will be able to get a majority of the votes. The distributor d will then abort the transaction. If the coordinator c and the failed participant p have not crashed, but just on the other side of a network split for example, the coordinator c will fail to gather enough pledges to commit the transaction and will transition to its final state F. The participant p will receive the result when it reconnects to one of the other participants p′.

As another example, a failure of both the distributor d and the participant p will cause all the other participants in P to disconnect from the distributor d. This will result in a majority of pledges to the coordinator c. The coordinator c will then commit the transaction. If the distributor d and the participant p are on the other side of a network split for example, they may or may not commit the transaction. If the distributor d received the original “commit” message from the coordinator c before the link went down, it will commit. However, if the distributor d did not receive the commit, the distributor d will start getting pledges. Once it discovers that it can only get one pledge, it will revoke and transition to its final state F. The participant p will resolve the transaction when it reconnects to another participant p′.

While certain embodiments of the inventions have been described, these embodiments have been presented by way of example only, and are not intended to limit the scope of the inventions. Indeed, the novel methods and systems described herein may be embodied in a variety of other forms; furthermore, various omissions, substitutions and changes in the form of the methods and systems described herein may be made without departing from the spirit of the inventions.

Claims

1. A distributed system configured to resolve an atomic transaction comprising multiple transactions among a set of parties within the distributed system, the distributed system comprising: a plurality of participants, each of the plurality of participants residing on a node of a computer system and in communication with each of the other of the plurality of participants; anda coordinator residing on a node of the computer system and in communication with each of the plurality of participants, wherein at least one of the plurality of participants resides on the same node of the computer system as the coordinator;wherein the coordinator is configured to: receive one or more prepared messages from one or more of the plurality of participants, each prepared message indicating whether the sending participant can commit a transaction of the atomic transaction;decide to either commit or abort the atomic transaction based on the received prepared messages; andsend the decision to either commit or abort the atomic transaction to two or more of the plurality of participants;wherein each of the plurality of participants is configured to: receive a prepare message, the prepare message indicating that the participant should commit a transaction of the atomic transaction;determine whether the participant can commit the transaction;send a prepared message to the coordinator, the prepared message indicating whether the participant can commit the transaction;receive decision messages from two or more of the plurality of participants which are different than the coordinator, each decision message indicating that the sending participant has either committed or aborted a transaction of the atomic transaction;if communication with the coordinator is available, receive the decision to either commit or abort the atomic transaction from the coordinator;if the decision is received from the coordinator, decide to either commit or abort the transaction based on the received decision;if communication with the coordinator is not available and the decision was not received from the coordinator, decide to either commit or abort the transaction based on the received decision messages from the two or more of the plurality of participants which are different than the coordinator; andsend the participant's decision to either commit or abort the transaction to each of the other of the plurality of participants with which communication is available.
2. The distributed system of claim 1, wherein the coordinator is further configured to decide to: commit the atomic transaction if prepared messages have been received from each of the plurality of participants and each received prepared message indicates that the sending participant can commit a transaction of the atomic transaction; andabort the atomic transaction if at least one prepared message has been received from one of the plurality of participants and the received prepared message indicates that the sending participant cannot commit a transaction of the atomic transaction.
3. The distributed system of claim 1, wherein each of the plurality of participants is further configured to decide to: commit the transaction if a decision to commit the atomic transaction is received from the coordinator; andabort the transaction if a decision to abort the atomic transaction is received from the coordinator.
4. The distributed system of claim 2, wherein the coordinator is further configured to abort the transaction if communication between the coordinator and one of the plurality of participants is unavailable.
5. The distributed system of claim 1, wherein each of the plurality of participants is further configured to commit the transaction if all of the received one or more decision messages indicate that the sending participant has committed a transaction of the atomic transaction.
6. The distributed system of claim 1, wherein each of the plurality of participants is further configured to abort the transaction if one of the received one or more decision messages indicate that the sending participant has aborted a transaction of the atomic transaction.
7. The distributed system of claim 1, further comprising a shared participant residing on the same node of the computer system as the coordinator and in communication with each of the plurality of participants, the shared participant configured to: receive a prepare message from an initiator, the prepare message indicating that the shared participant should commit a transaction of the atomic transaction;determine whether the shared participant can commit the transaction;send a prepared message to the coordinator, the prepared message indicating whether the shared participant can commit the transaction;receive one or more decision messages from one or more of the plurality of participants, each decision message indicating that the sending participant has either committed or aborted a transaction of the atomic transaction; anddecide to either commit or abort the transaction based on the received one or more decision messages.
8. The method of claim 1, wherein each of the plurality of participants is further configured to: determine that communication with one of the plurality of participants was previously unavailable and has become available; andsend the participant's decision to either commit or abort the transaction to the previously unavailable one of the plurality of participants.
9. The method of claim 8, wherein said determination that communication with one of the plurality of participants was previously unavailable and has become available is based on a message received from said previously unavailable one of the plurality of participants.
10. The method of claim 1, wherein each of the plurality of participants is further configured to: receive a request message from one or more of the plurality of participants, the request message requesting the participant's decision to either commit or abort the transaction; andsend the participant's decision to either commit or abort the transaction to one or more of the other of the plurality of participant from which a request message was received.
11. A method of resolving an atomic transaction comprising multiple transactions among a plurality of participants, each participant residing on a node of a distributed computer system, the method comprising: sending, by the one or more computer processors of each of the plurality of participants, a prepared message to a coordinator, the coordinator residing on the same node as at least one of the plurality of participants, the prepared message indicating that the sending participant is prepared to commit a transaction of the atomic transaction;receiving, by each of a first subset of two or more of the plurality of participants which are different than the coordinator, a decision from the coordinator to either commit or abort the atomic transaction;determining, by each of a second subset of one or more of the plurality of participants, that communication with the coordinator is not available and that no decision from the coordinator was received;deciding, by the one or more computer processors of each of the first subset of the plurality of participants, to either commit or abort the transaction;sending, by the one or more computer processors of two or more of the first subset of the plurality of participants, the decision by the sending participant to either commit or abort the transaction to each of the second subset of the plurality of participants with which communication is available;receiving, by each of the second subset of the plurality of participants, a decision to either commit or abort the transaction from two or more of the first subset of the plurality of participants; anddeciding, by the one or more computer processors of each of the second subset of the plurality of participants, to either commit or abort the transaction based on the received decisions from the two or more of the first subset of the plurality of participants.
12. The method of claim 11, further comprising receiving, by each of the plurality of participants, a prepare message from an initiator, the prepare message indicating that the receiving participant should prepare for the transaction.
13. The method of claim 11, wherein receiving a decision from the coordinator comprises receiving a commit message from the coordinator node to commit the transaction.
14. The method of claim 11, wherein receiving a decision from the coordinator comprises receiving an abort message from the coordinator node to abort the transaction.

REFERENCE TO RELATED APPLICATIONS

The present application claims priority benefit under 35 U.S.C. §119(e) from U.S. Provisional Application No. 60/623,843, filed Oct. 29, 2004 entitled “Non-Blocking Commit Protocol Systems and Methods.” The present application also hereby incorporates by reference herein the foregoing application in its entirety. The present application relates to U.S. application Ser. No. 11/262,314, titled “Message Batching with Checkpoints Systems and Methods”, filed on Oct. 28, 2005, which claims priority to U.S. Provisional Application No. 60/623,848, filed Oct. 29, 2004 entitled “Message Batching with Checkpoints Systems and Methods,” and U.S. Provisional Application No. 60/628,528, filed Nov. 15, 2004 entitled “Message Batching with Checkpoints Systems and Methods;” and U.S. application Ser. No. 11/262,308, titled “Distributed System with Asynchronous Execution Systems and Methods,” filed on Oct. 28, 2005, which claims priority to U.S. Provisional Application No. 60/623,846, filed Oct. 29, 2004 entitled “Distributed System with Asynchronous Execution Systems and Methods,” and U.S. Provisional Application No. 60/628,527, filed Nov. 15, 2004 entitled “Distributed System with Asynchronous Execution Systems and Methods.” The present application hereby incorporates by reference herein all of the foregoing applications in their entirety.

US Referenced Citations (392)

Number	Name	Date	Kind
5163131	Row et al.	Nov 1992	A
5181162	Smith et al.	Jan 1993	A
5212784	Sparks	May 1993	A
5230047	Frey et al.	Jul 1993	A
5251206	Calvignac et al.	Oct 1993	A
5258984	Menon et al.	Nov 1993	A
5329626	Klein et al.	Jul 1994	A
5359594	Gould et al.	Oct 1994	A
5403639	Belsan et al.	Apr 1995	A
5459871	Van Den Berg	Oct 1995	A
5481699	Saether	Jan 1996	A
5548724	Akizawa et al.	Aug 1996	A
5548795	Au	Aug 1996	A
5568629	Gentry et al.	Oct 1996	A
5596709	Bond et al.	Jan 1997	A
5606669	Bertin et al.	Feb 1997	A
5612865	Dasgupta	Mar 1997	A
5649200	Leblang et al.	Jul 1997	A
5657439	Jones et al.	Aug 1997	A
5668943	Attanasio et al.	Sep 1997	A
5680621	Korenshtein	Oct 1997	A
5694593	Baclawski	Dec 1997	A
5696895	Hemphill et al.	Dec 1997	A
5734826	Olnowich et al.	Mar 1998	A
5754756	Watanabe et al.	May 1998	A
5761659	Bertoni	Jun 1998	A
5774643	Lubbers et al.	Jun 1998	A
5799305	Bortvedt et al.	Aug 1998	A
5805578	Stirpe et al.	Sep 1998	A
5805900	Fagen et al.	Sep 1998	A
5806065	Lomet	Sep 1998	A
5822790	Mehrotra	Oct 1998	A
5862312	Mann	Jan 1999	A
5870563	Roper et al.	Feb 1999	A
5878410	Zbikowski et al.	Mar 1999	A
5878414	Hsiao et al.	Mar 1999	A
5884046	Antonov	Mar 1999	A
5884098	Mason, Jr.	Mar 1999	A
5884303	Brown	Mar 1999	A
5890147	Peltonen et al.	Mar 1999	A
5917998	Cabrera et al.	Jun 1999	A
5933834	Aichelen	Aug 1999	A
5943690	Dorricott et al.	Aug 1999	A
5963963	Schmuck et al.	Oct 1999	A
5966707	Van Huben et al.	Oct 1999	A
5996089	Mann	Nov 1999	A
6000007	Leung et al.	Dec 1999	A
6014669	Slaughter et al.	Jan 2000	A
6021414	Fuller	Feb 2000	A
6029168	Frey	Feb 2000	A
6038570	Hitz et al.	Mar 2000	A
6044367	Wolff	Mar 2000	A
6052759	Stallmo et al.	Apr 2000	A
6055543	Christensen et al.	Apr 2000	A
6055564	Phaal	Apr 2000	A
6070172	Lowe	May 2000	A
6081833	Okamato et al.	Jun 2000	A
6081883	Popelka et al.	Jun 2000	A
6108759	Orcutt et al.	Aug 2000	A
6117181	Dearth et al.	Sep 2000	A
6122754	Litwin et al.	Sep 2000	A
6138126	Hitz et al.	Oct 2000	A
6154854	Stallmo	Nov 2000	A
6173374	Heil et al.	Jan 2001	B1
6202085	Benson et al.	Mar 2001	B1
6209059	Ofer et al.	Mar 2001	B1
6219693	Napolitano et al.	Apr 2001	B1
6226377	Donaghue, Jr.	May 2001	B1
6279007	Uppala	Aug 2001	B1
6321345	Mann	Nov 2001	B1
6334168	Islam et al.	Dec 2001	B1
6353823	Kumar	Mar 2002	B1
6384626	Tsai et al.	May 2002	B2
6385626	Tamer et al.	May 2002	B1
6393483	Latif et al.	May 2002	B1
6397311	Capps	May 2002	B1
6405219	Saether et al.	Jun 2002	B2
6408313	Campbell et al.	Jun 2002	B1
6415259	Wolfinger et al.	Jul 2002	B1
6421781	Fox et al.	Jul 2002	B1
6434574	Day et al.	Aug 2002	B1
6449730	Mann	Sep 2002	B2
6453389	Weinberger et al.	Sep 2002	B1
6457139	D'Errico et al.	Sep 2002	B1
6463442	Bent et al.	Oct 2002	B1
6496842	Lyness	Dec 2002	B1
6499091	Bergsten	Dec 2002	B1
6502172	Chang	Dec 2002	B2
6502174	Beardsley et al.	Dec 2002	B1
6523130	Hickman et al.	Feb 2003	B1
6526478	Kirby	Feb 2003	B1
6546443	Kakivaya et al.	Apr 2003	B1
6549513	Chao et al.	Apr 2003	B1
6557114	Mann	Apr 2003	B2
6567894	Hsu et al.	May 2003	B1
6567926	Mann	May 2003	B2
6571244	Larson	May 2003	B1
6571349	Mann	May 2003	B1
6574745	Mann	Jun 2003	B2
6594655	Tal et al.	Jul 2003	B2
6594660	Berkowitz et al.	Jul 2003	B1
6594744	Humlicek et al.	Jul 2003	B1
6598174	Parks et al.	Jul 2003	B1
6618798	Burton et al.	Sep 2003	B1
6631411	Welter et al.	Oct 2003	B1
6658554	Moshovos et al.	Dec 2003	B1
6662184	Friedberg	Dec 2003	B1
6671686	Pardon et al.	Dec 2003	B2
6671704	Gondi et al.	Dec 2003	B1
6671772	Cousins	Dec 2003	B1
6687805	Cochran	Feb 2004	B1
6725392	Frey et al.	Apr 2004	B1
6732125	Autrey et al.	May 2004	B1
6742020	Dimitroff et al.	May 2004	B1
6748429	Talluri et al.	Jun 2004	B1
6801949	Bruck et al.	Oct 2004	B1
6848029	Coldewey	Jan 2005	B2
6856591	Ma et al.	Feb 2005	B1
6871295	Ulrich et al.	Mar 2005	B2
6895482	Blackmon et al.	May 2005	B1
6895534	Wong et al.	May 2005	B2
6907011	Miller et al.	Jun 2005	B1
6907520	Parady	Jun 2005	B2
6917942	Burns et al.	Jul 2005	B1
6920494	Heitman et al.	Jul 2005	B2
6922696	Lincoln et al.	Jul 2005	B1
6922708	Sedlar	Jul 2005	B1
6934878	Massa et al.	Aug 2005	B2
6940966	Lee	Sep 2005	B2
6954435	Billhartz et al.	Oct 2005	B2
6990604	Binger	Jan 2006	B2
6990611	Busser	Jan 2006	B2
7007044	Rafert et al.	Feb 2006	B1
7007097	Huffman et al.	Feb 2006	B1
7017003	Murotani et al.	Mar 2006	B2
7043485	Manley et al.	May 2006	B2
7043567	Trantham	May 2006	B2
7069320	Chang et al.	Jun 2006	B1
7103597	McGoveran	Sep 2006	B2
7111305	Solter et al.	Sep 2006	B2
7113938	Highleyman et al.	Sep 2006	B2
7124264	Yamashita	Oct 2006	B2
7146524	Patel et al.	Dec 2006	B2
7152182	Ji et al.	Dec 2006	B2
7177295	Sholander et al.	Feb 2007	B1
7181746	Perycz et al.	Feb 2007	B2
7184421	Liu et al.	Feb 2007	B1
7194487	Kekre et al.	Mar 2007	B1
7206805	McLaughlin, Jr.	Apr 2007	B1
7225204	Manley et al.	May 2007	B2
7228299	Harmer et al.	Jun 2007	B1
7240235	Lewalski-Brechter	Jul 2007	B2
7249118	Sandler et al.	Jul 2007	B2
7257257	Anderson et al.	Aug 2007	B2
7290056	McLaughlin, Jr.	Oct 2007	B1
7313614	Considine et al.	Dec 2007	B2
7318134	Oliveira et al.	Jan 2008	B1
7346346	Fachan	Mar 2008	B2
7346720	Fachan	Mar 2008	B2
7370064	Yousefi'zadeh	May 2008	B2
7373426	Jinmei et al.	May 2008	B2
7386675	Fachan	Jun 2008	B2
7386697	Case et al.	Jun 2008	B1
7440966	Adkins et al.	Oct 2008	B2
7451341	Okaki et al.	Nov 2008	B2
7509448	Fachan et al.	Mar 2009	B2
7509524	Patel et al.	Mar 2009	B2
7533298	Smith et al.	May 2009	B2
7546354	Fan et al.	Jun 2009	B1
7546412	Ahmad et al.	Jun 2009	B2
7551572	Passey et al.	Jun 2009	B2
7558910	Alverson et al.	Jul 2009	B2
7571348	Deguchi et al.	Aug 2009	B2
7577258	Wiseman et al.	Aug 2009	B2
7577667	Hinshaw et al.	Aug 2009	B2
7590652	Passey et al.	Sep 2009	B2
7593938	Lemar et al.	Sep 2009	B2
7596713	Mani-Meitav et al.	Sep 2009	B2
7631066	Schatz et al.	Dec 2009	B1
7665123	Szor et al.	Feb 2010	B1
7676691	Fachan et al.	Mar 2010	B2
7680836	Anderson et al.	Mar 2010	B2
7680842	Anderson et al.	Mar 2010	B2
7685126	Patel et al.	Mar 2010	B2
7685162	Heider et al.	Mar 2010	B2
7689597	Bingham et al.	Mar 2010	B1
7707193	Zayas et al.	Apr 2010	B2
7716262	Pallapotu	May 2010	B2
7734603	McManis	Jun 2010	B1
7739288	Lemar et al.	Jun 2010	B2
7743033	Patel et al.	Jun 2010	B2
7752402	Fachan et al.	Jul 2010	B2
7756898	Passey et al.	Jul 2010	B2
7779048	Fachan et al.	Aug 2010	B2
7783666	Zhuge et al.	Aug 2010	B1
7788303	Mikesell et al.	Aug 2010	B2
7797283	Fachan et al.	Sep 2010	B2
7822932	Fachan et al.	Oct 2010	B2
7840536	Ahal et al.	Nov 2010	B1
7844617	Lemar et al.	Nov 2010	B2
7848261	Fachan	Dec 2010	B2
7870345	Issaquah et al.	Jan 2011	B2
7882068	Schack et al.	Feb 2011	B2
7882071	Fachan et al.	Feb 2011	B2
7899800	Fachan et al.	Mar 2011	B2
7900015	Fachan et al.	Mar 2011	B2
7917474	Passey et al.	Mar 2011	B2
20010042224	Stanfill et al.	Nov 2001	A1
20010047451	Noble et al.	Nov 2001	A1
20010056492	Bressoud et al.	Dec 2001	A1
20020010696	Izumi	Jan 2002	A1
20020029200	Dulin et al.	Mar 2002	A1
20020035668	Nakano et al.	Mar 2002	A1
20020038436	Suzuki	Mar 2002	A1
20020049778	Bell et al.	Apr 2002	A1
20020055940	Elkan	May 2002	A1
20020072974	Pugliese et al.	Jun 2002	A1
20020075870	de Azevedo et al.	Jun 2002	A1
20020078161	Cheng	Jun 2002	A1
20020078180	Miyazawa	Jun 2002	A1
20020083078	Pardon et al.	Jun 2002	A1
20020083118	Sim	Jun 2002	A1
20020087366	Collier et al.	Jul 2002	A1
20020095438	Rising et al.	Jul 2002	A1
20020107877	Whiting et al.	Aug 2002	A1
20020124137	Ulrich et al.	Sep 2002	A1
20020138559	Ulrich et al.	Sep 2002	A1
20020156840	Ulrich et al.	Oct 2002	A1
20020156891	Ulrich et al.	Oct 2002	A1
20020156973	Ulrich et al.	Oct 2002	A1
20020156974	Ulrich et al.	Oct 2002	A1
20020158900	Hsieh et al.	Oct 2002	A1
20020161846	Ulrich et al.	Oct 2002	A1
20020161850	Ulrich et al.	Oct 2002	A1
20020161973	Ulrich et al.	Oct 2002	A1
20020163889	Yemini et al.	Nov 2002	A1
20020165942	Ulrich et al.	Nov 2002	A1
20020166026	Ulrich et al.	Nov 2002	A1
20020166079	Ulrich et al.	Nov 2002	A1
20020169827	Ulrich et al.	Nov 2002	A1
20020170036	Cobb et al.	Nov 2002	A1
20020174295	Ulrich et al.	Nov 2002	A1
20020174296	Ulrich et al.	Nov 2002	A1
20020178162	Ulrich et al.	Nov 2002	A1
20020191311	Ulrich et al.	Dec 2002	A1
20020194523	Ulrich et al.	Dec 2002	A1
20020194526	Ulrich et al.	Dec 2002	A1
20020198864	Ostermann et al.	Dec 2002	A1
20030005159	Kumhyr	Jan 2003	A1
20030009511	Giotta et al.	Jan 2003	A1
20030014391	Evans et al.	Jan 2003	A1
20030033308	Patel et al.	Feb 2003	A1
20030061491	Jaskiewicz et al.	Mar 2003	A1
20030109253	Fenton et al.	Jun 2003	A1
20030120863	Lee et al.	Jun 2003	A1
20030125852	Schade et al.	Jul 2003	A1
20030126522	English et al.	Jul 2003	A1
20030131860	Ashcraft et al.	Jul 2003	A1
20030135514	Patel et al.	Jul 2003	A1
20030149750	Franzenburg	Aug 2003	A1
20030158873	Sawdon et al.	Aug 2003	A1
20030161302	Zimmermann et al.	Aug 2003	A1
20030163726	Kidd	Aug 2003	A1
20030172149	Edsall et al.	Sep 2003	A1
20030177308	Lewalski-Brechter	Sep 2003	A1
20030182312	Chen et al.	Sep 2003	A1
20030182325	Manely et al.	Sep 2003	A1
20030233385	Srinivasa et al.	Dec 2003	A1
20040003053	Williams	Jan 2004	A1
20040024731	Cabrera et al.	Feb 2004	A1
20040024963	Talagala et al.	Feb 2004	A1
20040078680	Hu et al.	Apr 2004	A1
20040078812	Calvert	Apr 2004	A1
20040117802	Green	Jun 2004	A1
20040133670	Kaminsky et al.	Jul 2004	A1
20040143647	Cherkasova	Jul 2004	A1
20040153479	Mikesell et al.	Aug 2004	A1
20040158549	Matena et al.	Aug 2004	A1
20040174798	Riguidel et al.	Sep 2004	A1
20040189682	Troyansky et al.	Sep 2004	A1
20040199734	Rajamani et al.	Oct 2004	A1
20040199812	Earl et al.	Oct 2004	A1
20040205141	Goland	Oct 2004	A1
20040230748	Ohba	Nov 2004	A1
20040240444	Matthews et al.	Dec 2004	A1
20040260673	Hitz et al.	Dec 2004	A1
20040267747	Choi et al.	Dec 2004	A1
20050010592	Guthrie	Jan 2005	A1
20050033778	Price	Feb 2005	A1
20050044197	Lai	Feb 2005	A1
20050066095	Mullick et al.	Mar 2005	A1
20050114402	Guthrie	May 2005	A1
20050114609	Shorb	May 2005	A1
20050125456	Hara et al.	Jun 2005	A1
20050131990	Jewell	Jun 2005	A1
20050138195	Bono	Jun 2005	A1
20050138252	Gwilt	Jun 2005	A1
20050171960	Lomet	Aug 2005	A1
20050171962	Martin et al.	Aug 2005	A1
20050187889	Yasoshima	Aug 2005	A1
20050188052	Ewanchuk et al.	Aug 2005	A1
20050192993	Messinger	Sep 2005	A1
20050289169	Adya et al.	Dec 2005	A1
20050289188	Nettleton et al.	Dec 2005	A1
20060004760	Clift et al.	Jan 2006	A1
20060041894	Cheng	Feb 2006	A1
20060047713	Gornshtein et al.	Mar 2006	A1
20060047925	Perry	Mar 2006	A1
20060053263	Prahlad et al.	Mar 2006	A1
20060059467	Wong	Mar 2006	A1
20060074922	Nishimura	Apr 2006	A1
20060083177	Iyer et al.	Apr 2006	A1
20060095438	Fachan et al.	May 2006	A1
20060101062	Godman et al.	May 2006	A1
20060129584	Hoang et al.	Jun 2006	A1
20060129631	Na et al.	Jun 2006	A1
20060129983	Feng	Jun 2006	A1
20060155831	Chandrasekaran	Jul 2006	A1
20060206536	Sawdon et al.	Sep 2006	A1
20060230411	Richter et al.	Oct 2006	A1
20060277432	Patel	Dec 2006	A1
20060288161	Cavallo	Dec 2006	A1
20060294589	Achanta et al.	Dec 2006	A1
20070038887	Witte et al.	Feb 2007	A1
20070091790	Passey et al.	Apr 2007	A1
20070094269	Mikesell et al.	Apr 2007	A1
20070094277	Fachan et al.	Apr 2007	A1
20070094310	Passey et al.	Apr 2007	A1
20070094431	Fachan	Apr 2007	A1
20070094449	Allison et al.	Apr 2007	A1
20070094452	Fachan	Apr 2007	A1
20070124337	Flam	May 2007	A1
20070168351	Fachan	Jul 2007	A1
20070171919	Godman et al.	Jul 2007	A1
20070192254	Hinkle	Aug 2007	A1
20070195810	Fachan	Aug 2007	A1
20070233684	Verma et al.	Oct 2007	A1
20070233710	Passey et al.	Oct 2007	A1
20070244877	Kempka	Oct 2007	A1
20070255765	Robinson	Nov 2007	A1
20080005145	Worrall	Jan 2008	A1
20080010507	Vingralek	Jan 2008	A1
20080021907	Patel et al.	Jan 2008	A1
20080031238	Harmelin et al.	Feb 2008	A1
20080034004	Cisler et al.	Feb 2008	A1
20080044016	Henzinger	Feb 2008	A1
20080046432	Anderson et al.	Feb 2008	A1
20080046443	Fachan et al.	Feb 2008	A1
20080046444	Fachan et al.	Feb 2008	A1
20080046445	Passey et al.	Feb 2008	A1
20080046475	Anderson et al.	Feb 2008	A1
20080046476	Anderson et al.	Feb 2008	A1
20080046667	Fachan et al.	Feb 2008	A1
20080059541	Fachan et al.	Mar 2008	A1
20080059734	Mizuno	Mar 2008	A1
20080126365	Fachan et al.	May 2008	A1
20080151724	Anderson et al.	Jun 2008	A1
20080154978	Lemar et al.	Jun 2008	A1
20080155191	Anderson et al.	Jun 2008	A1
20080168304	Flynn et al.	Jul 2008	A1
20080168458	Fachan et al.	Jul 2008	A1
20080243773	Patel et al.	Oct 2008	A1
20080256103	Fachan et al.	Oct 2008	A1
20080256537	Fachan et al.	Oct 2008	A1
20080256545	Fachan et al.	Oct 2008	A1
20080294611	Anglin et al.	Nov 2008	A1
20090055399	Lu et al.	Feb 2009	A1
20090055604	Lemar et al.	Feb 2009	A1
20090055607	Schack et al.	Feb 2009	A1
20090125563	Wong et al.	May 2009	A1
20090210880	Fachan et al.	Aug 2009	A1
20090248756	Akidau et al.	Oct 2009	A1
20090248765	Akidau et al.	Oct 2009	A1
20090248975	Daud et al.	Oct 2009	A1
20090249013	Daud et al.	Oct 2009	A1
20090252066	Passey et al.	Oct 2009	A1
20090327218	Passey et al.	Dec 2009	A1
20100011011	Lemar et al.	Jan 2010	A1
20100122057	Strumpen et al.	May 2010	A1
20100161556	Anderson et al.	Jun 2010	A1
20100161557	Anderson et al.	Jun 2010	A1
20100185592	Kryger	Jul 2010	A1
20100223235	Fachan	Sep 2010	A1
20100235413	Patel	Sep 2010	A1
20100241632	Lemar et al.	Sep 2010	A1
20100306786	Passey	Dec 2010	A1
20110022790	Fachan	Jan 2011	A1
20110035412	Fachan	Feb 2011	A1
20110044209	Fachan	Feb 2011	A1
20110060779	Lemar et al.	Mar 2011	A1
20110087635	Fachan	Apr 2011	A1
20110087928	Daud et al.	Apr 2011	A1

Foreign Referenced Citations (19)

Number	Date	Country
0774723	May 1997	EP
1421520	May 2004	EP
1563411	Aug 2005	EP
2284735	Feb 2011	EP
2299375	Mar 2011	EP
04096841	Mar 1992	JP
2006-506741	Jun 2004	JP
4464279	May 2010	JP
4504677	Jul 2010	JP
WO 9429796	Dec 1994	WO
WO 0057315	Sep 2000	WO
WO 0114991	Mar 2001	WO
WO 0133829	May 2001	WO
WO 02061737	Aug 2002	WO
WO 03012699	Feb 2003	WO
WO 2004046971	Jun 2004	WO
WO 2008021527	Feb 2008	WO
WO 2008021528	Feb 2008	WO
WO 2008127947	Oct 2008	WO

Related Publications (1)

	Number	Date	Country
	20060095438 A1	May 2006	US

Provisional Applications (1)

	Number	Date	Country
	60623843	Oct 2004	US

Non-blocking commit protocol systems and methods

Information

Patent Number

Date Filed

Date Issued

Inventors

Original Assignees

Examiners

Agents

CPC

US Classifications

Field of Search

US

International Classifications

Term Extension

Abstract

Description

Claims

REFERENCE TO RELATED APPLICATIONS

US Referenced Citations (392)

Foreign Referenced Citations (19)

Related Publications (1)

Provisional Applications (1)