So that the manner in which the above recited features, advantages and objects of the present invention are attained and can be understood in detail, a more particular description of the invention, briefly summarized above, may be had by reference to the embodiments thereof which are illustrated in the appended drawings.
It is to be noted, however, that the appended drawings illustrate only typical embodiments of this invention and are therefore not to be considered limiting of its scope, for the invention may admit to other equally effective embodiments.
Embodiments of the present invention provide techniques for protecting critical sections of code being executed in a lightweight kernel environment. These techniques operate very quickly and avoid the overhead associated with a full kernel mode implementation of a network layer, while also allowing network interrupts to be processed without corrupting shared memory state. Thus, embodiments of the invention are suited for use in large, parallel computing systems, such as the Blue Gene® system developed by IBM®.
In one embodiment, a system call may be used to disable interrupts upon entry to a routine configured to process an event associated with the interrupt. For example, a user application may poll network hardware using an advance( ) routine, without waiting for an interrupt to be delivered. When the advance( ) routine is executed, the system call may be used to disable the delivery of interrupts entirely. If the user application calls the advance( ) routine, then delivering an interrupt is not only unnecessary (as the advance( ) routine is configured to clear the state indicated by the interrupt), but depending on timing, processing an interrupt could easily corrupt network state. At the same time, because the network hardware preserves interrupt state and will continually deliver the interrupt until the condition that caused the interrupt is cleared, an interrupt not cleared while in the critical section will be redelivered after the critical section is exited and interrupts are re-enabled.
In some cases, however, the use of a system call may incur an unacceptable performance penalty; particularly for critical sections that do not invoke other system calls. For example, incurring the overhead of a system call each time a libc function is invoked (e.g., malloc( )) may be too high. Instead of invoking a system call at the start of such functions to disable interrupts and another on the way out to re-enable interrupts, an alternative embodiment invokes a fast user-space function to set a flag in memory indicating that interrupts should not progress and also provides a mechanism to defer processing of the interrupt. Both of these embodiments are described in greater detail below.
Additionally, embodiments of the invention are described herein with respect to the Blue Gene massively parallel architecture developed by IBM. Embodiments of the invention are advantageous for massively parallel computer systems that include thousands of processing nodes, such as a Blue Gene system. However, embodiments of the invention may be adapted for use by a variety of parallel systems that employ CPUs running lightweight kernels and that are configured for interrupt driven communications. For example, embodiments of the invention may be readily adapted for use in distributed architectures such as clusters or grids where processing is carried out by compute nodes running lightweight kernels.
In the following, reference is made to embodiments of the invention. However, it should be understood that the invention is not limited to specific described embodiments. Instead, any combination of the following features and elements, whether related to different embodiments or not, is contemplated to implement and practice the invention. Furthermore, in various embodiments the invention provides numerous advantages over the prior art. However, although embodiments of the invention may achieve advantages over other possible solutions and/or over the prior art, whether or not a particular advantage is achieved by a given embodiment is not limiting of the invention. Thus, the following aspects, features, embodiments and advantages are merely illustrative and are not considered elements or limitations of the appended claims except where explicitly recited in a claim(s). Likewise, reference to “the invention” shall not be construed as a generalization of any inventive subject matter disclosed herein and shall not be considered to be an element or limitation of the appended claims except where explicitly recited in a claim(s).
One embodiment of the invention is implemented as a program product for use with a computer system. The program(s) of the program product defines functions of the embodiments (including the methods described herein) and can be contained on a variety of computer-readable media. Illustrative computer-readable media include, but are not limited to: (i) non-writable storage media (e.g., read-only memory devices within a computer such as CD-ROM or DVD-ROM disks readable by a CD- or DVD-ROM drive) on which information is permanently stored; (ii) writable storage media (e.g., floppy disks within a diskette drive or hard-disk drive) on which alterable information is stored. Other media include communications media through which information is conveyed to a computer, such as through a computer or telephone network, including wireless communications networks. The latter embodiment specifically includes transmitting information to/from the Internet and other networks. Such computer-readable media, when carrying computer-readable instructions that direct the functions of the present invention, represent embodiments of the present invention.
In general, the routines executed to implement the embodiments of the invention, may be part of an operating system or a specific application, component, program, module, object, or sequence of instructions. The computer program of the present invention typically is comprised of a multitude of instructions that will be translated by the native computer into a machine-readable format and hence executable instructions. Also, programs are comprised of variables and data structures that either reside locally to the program or are found in memory or on storage devices. In addition, various programs described hereinafter may be identified based upon the application for which they are implemented in a specific embodiment of the invention. However, it should be appreciated that any particular program nomenclature that follows is used merely for convenience, and thus the invention should not be limited to use solely in any specific application identified and/or implied by such nomenclature.
As shown, the system 100 includes a collection of compute nodes 110 and a collection of input/output (I/O) nodes 112. The compute nodes 110 provide the computational power of the computer system 100. Each compute node 110 may include one or more central processing units (CPUs). Additionally, each compute node 110 may include a memory store used to store program instructions and data sets (i.e., work units) on which the program instructions are performed. In a fully configured Blue Gene/L system, for example, 65,536 compute nodes 110 run user applications, and the ASIC for each compute node includes two PowerPC® CPUs (the Blue Gene/P architecture includes four CPUs per node).
Many data communication network architectures are used for message passing among nodes in a parallel computer system 100. Compute nodes 110 may be organized in a network as a torus, for example. Also, compute nodes 110 may be organized as a tree. A torus network connects the nodes in a three-dimensional mesh with wrap around links. Every node is connected to its six neighbors through the torus network, and each node is addressed by an <x, y, z> coordinate. In a tree network, nodes are often connected as a binary tree: each node has a parent, and two children. Additionally, parallel system may employ network communication channels for multiple architectures. For example, in a system using a torus and a tree network, the two networks may be implemented independently of one another, with separate routing circuits, separate physical links, and separate message buffers.
I/O nodes 112 provide a physical interface between the compute nodes 110 and file servers 130, front end nodes 120 and service nodes 140. Communication may take place over a network 150. Additionally, compute nodes 110 may be configured to pass messages over a point-to-point network. In a Blue Gene/L system, for example, 1,024 nodes 112 each manage communications for a group of 64 compute nodes 110. The I/O nodes 112 provide access to the file servers 130, as well as socket connections to processes in other systems. When a compute process on a compute node 110 performs an I/O operation (e.g., a read/write to a file), the operation is forwarded to the I/O node 112 managing that compute node 110. The managing I/O node 112 then performs the operation on the file system and returns the result to the requesting compute node 110. In a Blue/Gene L system, the I/O nodes 112 include the same ASIC as the compute nodes 112, with added external memory and an Ethernet connection.
Additionally, I/O nodes 112 may be configured to perform process authentication and authorization, job accounting, and debugging. By assigning these functions to I/O nodes 112, a lightweight kernel running on each compute node 110 may be greatly simplified as each compute node 110 is only required to communicate with a few I/O nodes 112. The front end nodes 120 store compilers, linkers, loaders and other applications used to interact with the system 100. Typically, users access front end nodes 120, submit programs for compiling, and submit jobs to the service node 140.
The service node 140 may include a system database and a collection of administrative tools provided by the system 100. Typically, the service node 140 includes a computing system configured to handle scheduling and loading of software programs and data on compute nodes 110. In one embodiment, the service node 140 may be configured to assemble a group of compute nodes 110 (referred to as a block), and dispatch a job to a block for execution.
The compute node operating system is a simple, single-user, and lightweight compute node kernel 365, which may provide a single, static, virtual address space to one user application 350 and a user level communications library 355 that provides access to networks 330-345. Known examples of parallel communications library 355 include the ‘Message Passing Interface’ (‘MPI’) library and the ‘Parallel Virtual Machine’ (‘PVM’) library.
In one embodiment, parallel communications library 355 includes routines used for both efficient deferred interrupt handling and fast interrupt disabling and processing by compute node 110, when the node is executing critical section code included in application 350. Additionally, communications library may define a state structure 360 used to determine whether user application 350 is in a critical section of code, whether interrupts have been disabled, or whether interrupts have been deferred, for a given critical section.
Typically, user application program 350 and parallel communications library 355 are executed using a single thread of execution on compute node 110. Because the thread is entitled to access to all resources of node 110, the quantity and complexity of tasks to be performed by lightweight kernel 365 are smaller and less complex that those of a kernel running an operating system on a computer with many threads running simultaneously. Kernel 365 may, therefore, be quite lightweight when compared to operating system kernels used for general purpose computers. Operating system kernels that may usefully be improved, simplified, or otherwise modified for use in a compute node 110 include versions of the UNIX®, Linux®, IBM's AIX® and i5/OS® operating systems, and others, as will occur to those of skill in the art.
As shown in
Point-to-point adapter 340 couples compute node 110 to other compute nodes in parallel system 100. In a Blue Gene/L system, for example, the compute nodes 110 are connected using a point-to-point a network configured as a three-dimensional torus. Accordingly, point-to-point adapter 340 provides data communications in six directions on three communications axes, x, y, and z, through six bidirectional links: +x and −x, +y and −y, +z and −z. Point-to-point adapter 340 allows application 350 to communicate with applications running on other compute nodes by passing a message that hops from node to node until reaching a destination. While a number of message passing models exist, the Message Passing Interface (MPI) has emerged currently dominant one. Many applications have been ported to, or developed for, the MPI model making it useful for a Blue Gene system.
Collective operations adapter 345 couples compute node 110 to a network suited for collective message passing operations. Collective operations adapter 345 provides data communications through three bidirectional links: two to children nodes and one to a parent node.
In one embodiment, torus network 400 supports cut-through routing, which enables packets to transit a compute node 110 without any software intervention until a message reaches a destination. In addition, adaptive routing may be used to increase network performance, even under stressful loads. Adaptation allows packets to follow any minimal path to the final destination, allowing packets to dynamically “choose” less congested routes. Another property integrated in the torus network is the ability to do multicast along any dimension, enabling low-latency broadcast algorithms.
Additionally, the user space function setting the shared memory flag 505 may register a function, i.e., deferred function 520, to invoke once the user application exits the critical section. In the event that different types of interrupts are available, user application 350 may register a table of functions, one for each type of interrupt that might be deferred while user application 350 is inside a critical section. Reference counter 510 may be used to track how “deep” within multiple critical sections a user application might be at any given point of execution. That is, one critical section may include calls to another function with its own critical section. Thus, the critical section “lock” created by shared memory flag 505 may be “locked” multiple times.
In the event an interrupt is delivered while shared memory flag 505 is active, handling of the interrupt is deferred until all critical sections have completed executing. At the same time, if an interrupt occurs, processing of the interrupt is deferred and pending flag 515 may be set. When user application 350 exits a critical section, the pending flag 515 may be checked, and if set, then the deferred function 520 may be invoked to begin the deferred processing of the interrupt delivered while user application 350 was inside a critical section.
Advantageously, as described above, embodiments of the invention provide techniques for protecting critical sections of code being executed in a lightweight kernel environment. These techniques operate very quickly and avoid the overhead associated with a full kernel mode implementation of a network layer, while also allowing network interrupts to be processed without corrupting shared memory state.
While the foregoing is directed to embodiments of the present invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow.