EFFECTIVE USE OF A HARDWARE BARRIER SYNCHRONIZATION REGISTER FOR PROTOCOL SYNCHRONIZATION

Description

BRIEF DESCRIPTION OF THE DRAWINGS

The subject matter which is regarded as the invention is particularly pointed out and distinctly claimed in the concluding portion of the specification. The invention, however, both as to organization and method of practice, together with the further objects and advantages thereof, may best be understood by reference to the following description taken in connection with the accompanying drawings in which:

FIG. 1 is a block diagram providing an abstract view of threads from a star configuration entering a synchronization barrier;

FIG. 2 is a block diagram providing an abstract view of threads from a tree configuration entering a synchronization barrier; and.

FIG. 3 is a block diagram illustrating the situation in which a process includes multiple threads.

DETAILED DESCRIPTION

Although one thread may belong to multiple groups, it can only perform barrier operations on one group at a time, according to the definition of barrier. Therefore, at any given time, it is only necessary to keep one state per thread, which leads to a solution of dividing the BSR evenly for all potential participating threads. To handle multiple group memberships, one thread can use its BSR entry to share a unique identifier for the group on which it is currently performing barrier synchronization.

Suppose each thread has a logical identifier “t” and that it is allocated m bits in the BSR. For example, if m=8, each thread gets one BSR byte. The expression “BSR[t]” is used to represent the share of the BSR that thread t has. Thus, the size of BSR[t] is m bits.

In accordance with the second embodiment of the present invention, a method of BSR use for barrier synchronization is based on a tree topology of the group members. When a group is created, the members are arranged in a logical tree. The method is based on storing a group identifier in thread states and releasing threads with multiple stores. Barrier synchronization for such a group is performed as follows.

TABLE I

Method 2: Tree Structure

Barrier_2(group_id)

{

// wait for all my children to enter the barrier

foreach (thread t and t is my child) {

while (BSR[t] != group_id);

}

if (this_thread == the root of this group) {

off-node synchronization when necessary;

} else {

// indicate that all my children and I have

// entered the barrier

BSR[this_thread] = group_id;

// wait for my parent to release me from the

// barrier

while (BSR[this_thread] != 0);

}

// release all my children from the barrier

foreach (thread t and t is my child) {

BSR[t] = 0;

}

}

The method above is able to handle 2^m−1 groups because each BSR[t] has m bits and the value “0” is used to indicate that a thread is not in a barrier. To complete a barrier on a group of size n, a total of 2(n−1) stores to the BSR are employed: n−1 to indicate that all non-root threads are in the barrier, and n−1 more stores to release the non-root threads from the barrier.

FIG. 2 provides an abstract view of the processing that takes place in a situation in which the threads are structured in tree fashion with root thread 300 and which possesses child threads 310 and 311. In turn child thread 310 possesses its own set of child (grandchild, if you will) threads 320 and 321. Likewise, child thread 311 possesses it own set of child (grandchild, if you will) threads 322 and 323. Each of these threads is associated with a portion of BSR 100. Each of these threads is also capable of storing a group identifier (or an all zero field) into its associated BSR portion.

Since a store to BSR incurs an expensive broadcast from the storing CPU to all other CPUs, it is sometimes desirable to reduce the total number of stores to improve performance. Accordingly, a second method is employed for use with star topology groupings of threads. This method involves storing a group identifier and sequence number as the thread state and releasing threads with a single store.

TABLE II

Method 1: Star Structure

Barrier_1(group_id)

{

// flip barrier sequence number for this group

seq_no[group_id] = !seq_no[group_id];

if (this_thread == the center of this group) {

// wait for all other members to enter the

// barrier

foreach (other thread t in this group) {

while (BSR[t] != (group_id, seq_no));

}

off-node synchronization when necessary;

// release all other members from the barrier

BSR[this_thread] = (group_id, seq_no) ;

} else {

// indicate that I have entered the barrier

BSR[this_thread] = (group_id, seq_no);

// wait for the center to release me from the

// barrier

while (BSR[center] != (group_id, seq_no));

}

}

The method specified above is a special case of a 1-level tree algorithm, which is optimized by a store to the center task's BSR share (allotted portion) to release all members from the barrier. The one-bit sequence number is to distinguish between consecutive barriers on the same group and, because of this one-bit, only 2^(m-1)−1 groups are supported. The value 0 is reserved for BSR initialization. For a group of size of n, the number of BSR stores is reduced to n, but polling for the members to enter the barrier is serialized at the center process for the star grouping.

FIG. 1 provides an abstract view of the processing that takes place in a situation in which the threads are structured in star fashion. The threads enter the barrier independently. The root or parent thread 200 is responsible for controlling the exit of all of the child threads 210 from the barrier. On entry into the barrier, non-root threads are able to set a flag indicating that they are “in.” The root thread 200 is responsible for polling the other threads through BSR 100 and it is thus capable of setting an “out” flag.

Using BSR for MPI Barrier Synchronization

For single-threaded MPI applications, only one thread of an MPI process is allowed to call MPI functions. The above two methods are easily applied to MPI barrier synchronization by mapping MPI communicators to groups. The complication of using BSR for MPI barrier synchronization arises from multi-threaded cases where the participants of an MPI barrier are MPI processes and an MPI process can issue a barrier call from any of its threads.

Complications from using the BSR for the MPI Barrier arise for two reasons:

- (1) The MPI allows barriers on all communicators and there can be a lot of communicators; and,
- (2) a multithreaded MPI task can issue a barrier on a communicator from any thread.

To handle multi-threaded cases, one has to know the maximum number of threads that an MPI process can have and assign m bits to each thread. When an MPI process is waiting in a barrier for other processes to enter the same barrier, it's necessary for the waiting thread to poll all the BSR shares assigned to another process because any thread in the other process can issue a matching barrier call.

Suppose one MPI process “p” has a maximum of T threads that make MPI barrier calls and each thread is logically numbered as t where t ranges from 0 to T−1. To solve this problem, an array “BSR[p] [t]” is used to represent the m BSR bits that are allocated to thread t of process p. When an MPI communicator is created, a tree is built from the participating processes. The tree method above is modified as follows to handle the multi-threaded MPI barrier situation.

TABLE III

Method 3: Synchronizing multi-threaded processes

Barrier_3(group_id)

{

foreach (process p and p is my child) {

// wait until any of p's thread enters the

// barrier

wait for any thread t of process p such that

BSR[p][t] == group_id;

// remember which thread has entered the

// barrier

thread_in_barrier[p] = t;

if (this_process == the root of this group) {

off-node synchronization when necessary;

} else {

// indicate that my children and I have

// entered the barrier

BSR[this_process][this_thread] = group_id;

// wait for my parent to release me from the

// barrier

while (BSR[this_process][this_thread] != 0);

}

// release the participating threads of my

// children from the barrier

foreach (process p and p is my child) {

BSR[p][thread_in_barrier[p]] = 0;

}

}

This modified algorithm is still able to handle 2^m−1 groups and it still requires 2(n−1) BSR stores to complete a barrier on a group of size n. To reduce the overhead in polling the states of multiple threads, advantage is taken of the multi-byte load capability of the BSR. This method requires that the protocol know the maximum number of MPI threads in a process, which is specified by a user. This situation is illustrated in FIG. 3 in which process 400 is seen to have 4 threads; process 401 is seen to have 3 threads and process 402 is seen to have two threads. Method 2 is a special case of Method 3 with only one thread per process.

This solution suggests that the protocol know the maximum number of MPI threads in a task, which is specified by the user, for example through the specification of an environment variable.

While the invention has been described in detail herein in accordance with certain preferred embodiments thereof, many modifications and changes therein may be effected by those skilled in the art. Accordingly, it is intended by the appended claims to cover all such modifications and changes as fall within the true spirit and scope of the invention.

Claims

1. A method for synchronizing threads in a data processing system, said threads having with a single parent thread and at least one child thread, said method comprising the steps of: upon reaching a point of synchronization by said at least one child thread, storing, within a portion of memory allocated to synchronization, a group identifier for said thread and toggling a portion of said allocated memory representing a sequence number;upon reaching a point of synchronization by said parent thread, polling said portion of allocated memory to determine that each said child thread has reached said point of synchronization; andstoring within said allocated memory an indicia indicating release of any said child thread.
2. The method of claim 1 in which said sequence number is one bit in length.
3. The method of claim 1 in which said allocated memory is a barrier synchronization register accessible to a plurality of CPU's on a data processing node.
4. A method for synchronizing threads in a data processing system, said threads being organized in a tree structure with a single root thread and at least two child threads, said method comprising the steps of: upon reaching a point of synchronization by said root thread, storing, within a portion of memory allocated to synchronization, a group identifier for said root thread and polling said root thread's children to determine whether said root thread's child threads have also reached said point of synchronization and if so, storing instead an indicator signifying release of the root thread;upon reaching a point of synchronization for each said child thread, storing, within a portion of memory allocated to synchronization, a group identifier for said child thread and polling said child thread's children to determine whether said child thread's children threads have also reached said point of synchronization and if so, storing instead an indicator signifying release of the thread; andupon release of said root thread, storing an indicator signifying release of said child threads.
5. The method of claim 2 in which said indicator signifying release is an all zero bit field in said allocated memory.
6. The method of claim 4 in which said allocated memory is a barrier synchronization register accessible to a plurality of CPU's on a data processing node.
7. A method for synchronizing processes in a data processing system, said processes being organized in a tree structure with at least one process including multiple threads, said method comprising the steps of: upon reaching a point of synchronization by the root process of said tree, storing, within a portion of memory allocated to synchronization, a group identifier for said root process and polling said process's children to determine whether said process's child threads have also reached said point of synchronization and if so, storing instead an indicator signifying release of the root thread;upon reaching a point of synchronization by a child process of said tree, storing, within a portion of memory allocated to synchronization, a group identifier for said child process and polling said child process's children to determine whether said child process's child threads have also reached said point of synchronization and if so, storing an indicator signifying release of the child thread, said child process also polling said allocated memory assigned to its child processes; andupon release of said root process, storing an indicator signifying release of said child processes.
8. The method of claim 7 in which said indicator signifying release is an all zero bit field in said allocated memory.
9. The method of claim 7 in which said allocated memory is a barrier synchronization register accessible to a plurality of CPU's on a data processing node.

Government Interests

This invention was made with Government support under Agreement No. NBCH3039004 awarded by DARPA. The Government has certain rights in the invention.

EFFECTIVE USE OF A HARDWARE BARRIER SYNCHRONIZATION REGISTER FOR PROTOCOL SYNCHRONIZATION

Information

Publication Number

Date Filed

Date Published

Inventors

Original Assignees

CPC

US Classifications

International Classifications

Abstract

Description

Claims

Government Interests