The present invention generally relates to a parallel computing system. More particularly, the present invention relates to adding a plurality of floating point numbers in the parallel computing system.
IEEE 754 describes floating point number arithmetic. Kahan, “IEEE Standard 754 for Binary Floating-Point Arithmetic,” May 31, 1996, UC Berkeley Lecture Notes on the Status of IEEE 754, wholly incorporated by reference as if set forth herein, describes IEEE Standard 754 in detail.
According to IEEE Standard 754, to perform floating point number arithmetic, some or all floating point numbers are converted to binary numbers. However, the floating point number arithmetic does not need to follow IEEE or any particular standard. Table 1 illustrates IEEE single precision floating point format.
“Signed” bit indicates whether a floating point number is a positive (S=0) or negative (S=1) floating point number. For example, if the signed bit is 0, the floating point number is a positive floating point number. “Exponent” field (E) is represented by a power of two. For example, if a binary number is 10001.0010012=1.00010010012×24, then E becomes 127+4=13110=1000—00112. “Mantissa” field (M) represents fractional part of a floating point number.
For example, to add 2.510 and 4.7510, 2.510 is converted to 0x40200000 (in hexadecimal format) as follows:
Although this example is based on single precision floating point numbers, the mechanism used in this example can be extended to double precision floating point numbers. A double precision floating number is represented by 64 bits, i.e., 1 bit for the signed bit, 11 bits for the exponent field and 52 bits for the mantissa field.
Traditionally, in a parallel computing system, floating point number additions in multiple computing node operations, e.g., via messaging, are done in part, e.g., by software. The additions require at per network hop a processor to first receive multiple network packets associated with multiple messages involved in a reduction operation. Then, the processor adds up floating point numbers included in the packets, and finally puts the results back into the network for processing at the next network hop. An example of the reduction operations is to find a summation of a plurality of floating point numbers contributed (i.e., provided) from a plurality of computing nodes. This software had large overhead, and could not utilize a high network bandwidth (e.g., 2 GB/s) of the parallel computing system.
Therefore, it is desirable to perform the floating point number additions in a collective logic device to reduce the overhead and/or to fully utilize the network bandwidth.
The present invention describes a system, method and computer program product for implementing a hardware logic device in a parallel computing system for adding floating point numbers.
In one embodiment, there is provided a method for adding a plurality of first floating point numbers in a parallel computing system, the parallel computing system including a plurality of computing nodes, a computing node including at least one processor and at least one memory device, the method comprising:
receiving a plurality of the first floating point numbers from the computing nodes or network links;
converting the first floating point numbers to integer numbers;
adding the integer numbers and generating a summation of the integer numbers; and
converting the summation to a second floating point number,
wherein the receiving, the converting the first floating point numbers, the adding, the generating and the converting the summation are done in one pass, the one pass indicating that the computing nodes send inputs only once to a collective logic device and receive outputs only once from the collective logic device.
In one embodiment, there is provided a parallel computing system for adding a plurality of first floating point numbers, the system comprising:
a plurality of computing nodes, a computing node including at least one processor and at least one memory device; and
a collective logic device including:
In a further embodiment, the collective logic device further includes:
a floating number exponent max unit for determining a maximum exponent of the first floating point numbers
In a further embodiment, the collective logic device is implemented in a network.
In a further embodiment, the second floating point number is reproducible.
The accompanying drawings are included to provide a further understanding of the present invention, and are incorporated in and constitute a part of this specification. The drawings illustrate embodiments of the invention and, together with the description, serve to explain the principles of the invention. In the drawings,
In one embodiment, the present disclosure illustrates performing floating point number additions in hardware, for example, to reduce the overhead and/or to fully utilize the network bandwidth.
In one embodiment, the back-end floating point logic device 240 includes, without limitation, at least one shift register for performing normalization and/or shifting operation (e.g., a left shift, a right shift, etc.). In embodiment, the collective logic device 260 further includes an arbiter device 250. The arbiter device is described in detail below in conjunction with
In a further embodiment, the collective logic device 260 is embedded and/or implemented in a 5-Dimensional torus network.
At step 120, the ALU tree 230 adds the integer numbers and generates a summation of the integer values. Then, the ALU tree 230 provides the summation to the back-end floating point logic device 240. At step 130, the back-end logic device 240 converts the summation to a floating point number (“second floating point number”), e.g., by performing left shifting and/or right shifting according to the maximum exponent and/or the summation. The second floating point number is an output of adding the inputs 200. This second floating point numbers is reproducible. In other words, upon receiving same inputs, the collective logic device 260 produces same output(s). The outputs do not depend on an order of the inputs. Since an addition of integer numbers (converted from the floating point numbers) does not generate a different output based on an order of the addition, the collective logic device 260 generates the same output(s) upon receiving same inputs regardless of an order of the received inputs.
In one embodiment, the collective logic device 260 performs the method steps 100-130 in one pass. One pass refers that the computing nodes sends the inputs 200 only once to the collective logic device 260 and/or receives the output(s) only once from the collective logic device 260.
In a further embodiment, in each computing node, besides at least 10 bidirectional links for the 5D torus network 400, there is also at least one dedicated I/O link that is connected to at least one I/O node. Both the I/O link and the bidirectional links are inputs to the collective logic device 260. In one embodiment, the collective logic device 260 has at least 12 inputs. One or more of the inputs may come from a local computing node(s). In another embodiment, the collective logic device 260 has at most 12 inputs. One or more of the inputs may come from a local computing node(s).
In a further embodiment, at least one computing node defines a plurality of collective class maps to select a set of inputs for a class. A class map defines a set of input and output links for a class. A class represents an index into the class map on at least one computing node and is specified, e.g., by at least one packet.
In another embodiment, the collective logic device 260 performs the method steps 100-130 in at least two passes, i.e., the computing nodes sends (intermediate) inputs at least twice to the collective logic device 260 and/or receives (intermediate) outputs at least twice from the collective logic device 260. For example, in the first pass, the collective logic device 260 obtains the maximum exponent of the first floating point numbers. Then, the collective logic device normalizes the first floating point numbers and converts them to integer numbers. In the second pass, the collective logic device 260 adds the integer numbers and generates a summation of the integer numbers. Then, the collective logic device 260 converts the summation to a floating point number called the second floating point number. When the collective logic device 260 operates based on at least two passes, its latency may be at least twice larger than a latency based on one pass described above.
In one embodiment, the collective logic device 260 performing method steps in
Following describes an exemplary floating point number addition according to one exemplary embodiment. Suppose that the collective logic device 260 receives two floating point numbers A=21*1.510=310 and B=23*1.2510=1010 as inputs. The collective logic device 260 adds the number A and the number B as follows:
I. (corresponding to Step 105 in
II. (corresponding to Step 110 in
Thus, when the number A is converted to an integer number, it becomes 0x0180000000000000. When the number B is converted, it becomes 0x0500000000000000. Note that the integer numbers comprise only the mantissa field. Also note that the most significant bit of the number B is two binary digits to the left (larger) than the most significant bit of the number A. This is exactly the difference between the two exponents (1 and 3).
III. (corresponding to Step 120 in
IV. (corresponding to Step 130 in
In one embodiment, the collective logic device 260 performs logical operations including, without limitation, logical AND, logical OR, logical XOR, etc. The collective logic device 260 also performs integer operations including, without limitation, an unsigned and signed integer addition, min and max with an operand size from 32 bits to 4096 bits in units of (32*2n) bits, where n is a positive integer number. The collective logic device 260 further performs floating point operations including, without limitation, a 64-bit floating point addition, min (i.e., finding a minimum floating point number among inputs) and max (finding a maximum floating point number among inputs). In one embodiment, the collective logic device 260 performs floating point operations at a peak network link bandwidth of the network.
In one embodiment, the collective logic device 260 performs a floating point addition as follows: First, some or all inputs are compared and the maximum exponent is obtained. Then, the mantissa field of each input is shifted according to the difference of its exponent and the maximum exponent. This shifting of each input results in a 64-bit integer number which is then passed through the integer ALU tree 230 for doing an integer addition. A result of this integer addition is then converted back to a floating point number, e.g., by the back-end logic device 240.
Once input requests has been chosen by an arbiter, those input requests are sent to appropriate senders (and/or the reception FIFO) 330 and/or 350. Once some or all of the senders grant permission, the main arbiter 325 relays this grant to a particular sub-arbiter which has won and to each receiver (e.g., an injection FIFO 300 and/or 305). The main arbiter 325 also drives correct configuration bits to the collective logic device 260. The receivers will then provide their input data through the collective logic device 260 and an output of the collective logic device 260 is forwarded to appropriate sender(s).
Integer Operations
In one embodiment, the ALU tree 230 is built with multiple levels of combining blocks. A combining block performs, at least, an unsigned 32-bit addition and/or 32-bit comparison. In a further embodiment, the ALU tree 230 receives control signals for a sign (i.e., plus or minus), an overflow, and/or a floating point operation control. In one embodiment, the ADD tree 230 receives at least two 32-bit integer inputs and at least one carry-in bit, and generates a 32-bit output and a carry-out bit. A block performing a comparison and/or selection receives at least two 32-bit integer inputs, and then selects one input depending on the control signals. In another embodiment, the ALU tree 230 operates with 64-bit integer inputs/outputs, 128-bit integer inputs/outputs, 256-bit integer inputs/outputs, etc.
Floating Point Operations
In one embodiment, the collective logic device 260 performs 64-bit double precision floating point operations. In one embodiment, at most 12 (e.g., 10 network links+1 I/O link+1 local computing node) floating point numbers can be combined, i.e., added. In an alternative embodiment, at least 12 floating point number are added.
A 64-bit floating point number format is illustrated in Table 2.
In IEEE double precision floating point number format, there is a signed bit indicating whether a floating point number is an unsigned or signed number. The exponent field is 11 bits. The mantissa field is 52 bits.
In one embodiment, Table 3 illustrates a numerical value of a floating point number according to an exponent field value and a mantissa field value:
If the exponent field is 2047 and the mantissa field is 0, a corresponding floating point number is plus or minus Infinity. If the exponent field is 2047 and the mantissa field is not 0, a corresponding floating point number is NaN (Not a Number). If the exponent field is between 1 and 204610, a corresponding floating point number is (−1)S×0.M×2E. If the exponent field is 0 and the mantissa field is 0, a corresponding floating point number is 0. If the exponent field is 0 and the mantissa field is not 0, a corresponding floating point number is (−1)S×0.M×2−1022. In one embodiment, the collective logic device 260 normalizes a floating point number according to Table. 3. For example, if S is 0, E is 210=102 and M is 1000—0000—0000—0000—0000—0000—0000—0000—0000—0000—0000—0000—00002, a corresponding floating number is normalized to 1.1000 . . . 0000×22.
In one embodiment, an addition of (+)infinity and (+)infinity generates (+)infinity, i.e., (+)Infinity+(+)Infinity=(+)Infinity. An addition of (−)infinity and (−)infinity generates (−) infinity, i.e., (−)Infinity+(−)Infinity=(−)Infinity. An addition of (+)infinity and (−)infinity generates NaN, i.e., (+)Infinity+(−)Infinity=NaN. Min or Max operation for (+)infinity and (+)infinity generates (+)infinity, i.e., MIN/MAX (+Infinity, +Infinity)=(+)infinity. Min or Max operation for (−)infinity and (−)infinity generates (−)infinity, i.e., MIN/MAX (−Infinity, −Infinity)=(−)infinity.
In one embodiment, the collective logic device 260 does not distinguish between different NaNs. An NaN newly generated from the collective logic device 260 may have the most significant fraction bit (the most significant mantissa bit) set, to indicate NaN.
Floating Point (FP) Min and Max
In one embodiment, an operand size in FP Min and Max operations is 64 bits. In another embodiment, an operand size in FP Min and Max operations is larger than 64 bits. The operand passes through the collective logic device 260 without any shifting and/or normalization and thus reduces an overhead (e.g., the number of clock cycles to perform the FP Min and/or Max operations). Following describes the FP Min and Max operations according to one embodiment. Suppose that “I” be an integer representation (i.e., integer number) of bit patterns for 63 bits other than the sign bit. Given two floating point numbers A and B,
if (Sign(A)=0 and Sign(B)=0, or both positive) then
if (I(A)>1(B)), then A>B.
(If both A and B are positive numbers and if A's integer representation is larger than B's integer representation, A is larger than B.)
if (Sign(A)=0, and Sign(B)=1), then A>B.
(If A is a positive number and B is a negative number, A is larger than B.)
if (Sign(A)=1 and Sign(B)=1, both negative) then
if (I(A)>1(B)), then A<B.
(If both A and B are negative numbers and if A's integer representation is larger than B's integer representation (i.e., |A|>|B|), A is smaller than B.)
In one embodiment, operands are 64-bit double precision Floating point numbers. In one embodiment, the operands are 32 bits floating point numbers, 128 bits floating point numbers, 256 bits floating point numbers, 256 bits floating point numbers, etc. There is no reordering on injection FIFOs 300-305 and/or reception FIFOs 330-335.
In one embodiment, when a first half of the 64-bit floating point number is received, the exponent field of the floating point number is sent to the FP exponent max unit 220 to get the maximum exponent for some or all the floating point numbers contributing to an addition of these floating point numbers. The maximum exponent is then used to convert each 64-bit floating point numbers to 64-bit integer numbers. The mantissa field of each floating point numbers has a precision of 53 bits, in the form of 1.x for regular numbers, and 0.x for denormalized numbers. The converted integer numbers reserve 5 most significant bits, i.e., 1 bit for a sign bit and 4 bits for guarding against overflow with up to 12 numbers being added together. The 53-bits mantissa field is converted into a 64-bit number in the following way. The left most 5 bits are zeros. The next bit is one if the floating point number is normalized and it is zero if the floating point number is denormalized. Next, the 53-bit mantissa field is appended and then 6 zeroes are appended. Finally, the 64-bit number is right-shifted by Emax−E, where Emax is the maximum exponent and E is a current exponent value of the 59-bit number. E is never greater than Emax, and so Emax E is zero or positive. After this conversion, if the sign bit retained from the 64-bit floating point number, then the shifted number (“N”) is converted to 2's complementary format (“N_new”), e.g., by N_new=(not N)+1, where “not N” may be implemented by a bitwise inverter. A resulting number (e.g., N_new or N) is then sent to the ALU tree 230 with a least significant 32-bit word first. In a further embodiment, there are additional extra control bits to identify special conditions. In one embodiment, each control bit is binary. For example, if the NaN bit is 0, then it is not a NaN, and if it is 1, then it is a NaN. There are control bits for +Infinity and −Infinity as well.
The resulting numbers are added as signed integers with operand sizes of 64 bits, with a consideration to control bits for Infinity and NaN. A result of the addition is renormalized to a regular floating point format: (1) if a sign bit is set (i.e., negative sum), covert the result back from 2's complementary format using, .e.g., K_new=not (K−1), where K_new is the converted result and K is the result before the converting; (2) Then, right or left shift K or K_new until the left-most bit of the final integer sum (i.e., an integer output of the ALU 230) which is a ‘1’ is in the 12th bit position from the left of the integer sum. This ‘1’ will be a “hidden” bit in the second floating point number (i.e., a final output of adding of floating point numbers). If the second floating point number is a denormalized number, shift right the second floating point number until the left-most ‘1’ is in the 13th position, and then shift to the right again, e.g., by the value of the maximum exponent. The resultant exponent is calculated as Emax+the amount it was right-shifted−6, for normalized floating point results. For denormalized floating point results, the exponent is set to the value according to the IEEE specification. A result of this renormalization is then sent on with most significant 64-bit word to computing nodes as a final result of the floating point addition.
Although the embodiments of the present invention have been described in detail, it should be understood that various changes and substitutions can be made therein without departing from spirit and scope of the inventions as defined by the appended claims. Variations described for the present invention can be realized in any combination desirable for each particular application. Thus particular limitations, and/or embodiment enhancements described herein, which may have particular advantages to a particular application need not be used for all applications. Also, not all limitations need be implemented in methods, systems and/or apparatus including one or more concepts of the present invention.
The present invention can be realized in hardware, software, or a combination of hardware and software. A typical combination of hardware and software could be a general purpose computer system with a computer program that, when being loaded and run, controls the computer system such that it carries out the methods described herein. The present invention can also be embedded in a computer program product, which comprises all the features enabling the implementation of the methods described herein, and which—when loaded in a computer system—is able to carry out these methods.
Computer program means or computer program in the present context include any expression, in any language, code or notation, of a set of instructions intended to cause a system having an information processing capability to perform a particular function either directly or after conversion to another language, code or notation, and/or reproduction in a different material form.
Thus the invention includes an article of manufacture which comprises a computer usable medium having computer readable program code means embodied therein for causing a function described above. The computer readable program code means in the article of manufacture comprises computer readable program code means for causing a computer to effect the steps of a method of this invention. Similarly, the present invention may be implemented as a computer program product comprising a computer usable medium having computer readable program code means embodied therein for causing a function described above. The computer readable program code means in the computer program product comprising computer readable program code means for causing a computer to affect one or more functions of this invention. Furthermore, the present invention may be implemented as a program storage device readable by machine, tangibly embodying a program of instructions executable by the machine to perform method steps for causing one or more functions of this invention.
The present invention may be implemented as a computer readable medium (e.g., a compact disc, a magnetic disk, a hard disk, an optical disk, solid state drive, digital versatile disc) embodying program computer instructions (e.g., C, C++, Java, Assembly languages, .Net, Binary code) run by a processor (e.g., Intel® Core™, IBM® PowerPC®) for causing a computer to perform method steps of this invention. The present invention may include a method of deploying a computer program product including a program of instructions in a computer readable medium for one or more functions of this invention, wherein, when the program of instructions is run by a processor, the compute program product performs the one or more of functions of this invention. The present invention may also include a computer program product for one or more functions of this invention. The computer program product includes a storage medium (e.g., a disk, optical disc, memory device, solid-state drive, etc.) readable by a processing circuit and storing instructions run by the processing circuit for performing one or more functions of this invention.
It is noted that the foregoing has outlined some of the more pertinent objects and embodiments of the present invention. This invention may be used for many applications. Thus, although the description is made for particular arrangements and methods, the intent and concept of the invention is suitable and applicable to other arrangements and applications. It will be clear to those skilled in the art that modifications to the disclosed embodiments can be effected without departing from the spirit and scope of the invention. The described embodiments ought to be construed to be merely illustrative of some of the more prominent features and applications of the invention. Other beneficial results can be realized by applying the disclosed invention in a different manner or modifying the invention in ways known to those familiar with the art.
The present invention is related to the following commonly-owned, co-pending United States patent applications filed on even date herewith, the entire contents and disclosure of each of which is expressly incorporated by reference herein as if fully set forth herein. U.S. patent application Serial No. (YOR920090171US1 (24255)), for “USING DMA FOR COPYING PERFORMANCE COUNTER DATA TO MEMORY”; U.S. patent application Serial No. (YOR920090169US1 (24259)) for “HARDWARE SUPPORT FOR COLLECTING PERFORMANCE COUNTERS DIRECTLY TO MEMORY”; U.S. patent application Serial No. (YOR920090168US1 (24260)) for “HARDWARE ENABLED PERFORMANCE COUNTERS WITH SUPPORT FOR OPERATING SYSTEM CONTEXT SWITCHING”; U.S. patent application Serial No. (YOR920090473US1 (24595)), for “HARDWARE SUPPORT FOR SOFTWARE CONTROLLED FAST RECONFIGURATION OF PERFORMANCE COUNTERS”; U.S. patent application Serial No. (YOR920090474US1 (24596)), for “HARDWARE SUPPORT FOR SOFTWARE CONTROLLED FAST MULTIPLEXING OF PERFORMANCE COUNTERS”; U.S. patent application Serial No. (YOR920090533US1 (24682)), for “CONDITIONAL LOAD AND STORE IN A SHARED CACHE”; U.S. patent application Serial No. (YOR920090532US1 (24683)), for “DISTRIBUTED PERFORMANCE COUNTERS”; U.S. patent application Serial No. (YOR920090529US1 (24685)), for “LOCAL ROLLBACK FOR FAULT-TOLERANCE IN PARALLEL COMPUTING SYSTEMS”; U.S. patent application Serial No. (YOR920090530US1 (24686)), for “PROCESSOR WAKE ON PIN”; U.S. patent application Serial No. (YOR920090526US1 (24687)), for “PRECAST THERMAL INTERFACE ADHESIVE FOR EASY AND REPEATED, SEPARATION AND REMATING”; U.S. patent application Serial No. (YOR920090527US1 (24688), for “ZONE ROUTING IN A TORUS NETWORK”; U.S. patent application Serial No. (YOR920090531US1 (24689)), for “PROCESSOR WAKEUP UNIT”; U.S. patent application Serial No. (YOR920090535US1 (24690)), for “TLB EXCLUSION RANGE”; U.S. patent application Serial No. (YOR920090536US1 (24691)), for “DISTRIBUTED TRACE USING CENTRAL PERFORMANCE COUNTER MEMORY”; U.S. patent application Serial No. (YOR920090538US1 (24692)), for “PARTIAL CACHE LINE SPECULATION SUPPORT”; U.S. patent application Serial No. (YOR920090539US1 (24693)), for “ORDERING OF GUARDED AND UNGUARDED STORES FOR NO-SYNC I/O”; U.S. patent application Serial No. (YOR920090540US1 (24694)), for “DISTRIBUTED PARALLEL MESSAGING FOR MULTIPROCESSOR SYSTEMS”; U.S. patent application Serial No. (YOR920090541US1 (24695)), for “SUPPORT FOR NON-LOCKING PARALLEL RECEPTION OF PACKETS BELONGING TO THE SAME MESSAGE”; U.S. patent application Serial No. (YOR920090560US1 (24714)), for “OPCODE COUNTING FOR PERFORMANCE MEASUREMENT”; U.S. patent application Serial No. (YOR920090579US1 (24731)), for “A MULTI-PETASCALE HIGHLY EFFICIENT PARALLEL SUPERCOMPUTER”; U.S. patent application Serial No. (YOR920090581US1 (24732)), for “CACHE DIRECTORY LOOK-UP REUSE”; U.S. patent application Serial No. (YOR920090582US1 (24733)), for “MEMORY SPECULATION IN A MULTI LEVEL CACHE SYSTEM”; U.S. patent application Serial No. (YOR920090583US1 (24738)), for “METHOD AND APPARATUS FOR CONTROLLING MEMORY SPECULATION BY LOWER LEVEL CACHE”; U.S. patent application Serial No. (YOR920090584US1 (24739)), for “MINIMAL FIRST LEVEL CACHE SUPPORT FOR MEMORY SPECULATION MANAGED BY LOWER LEVEL CACHE”; U.S. patent application Serial No. (YOR920090585US1 (24740)), for “PHYSICAL ADDRESS ALIASING TO SUPPORT MULTI-VERSIONING IN A SPECULATION-UNAWARE CACHE”; U.S. patent application Serial No. (YOR920090587US1 (24746)), for “LIST BASED PREFETCH”; U.S. patent application Serial No. (YOR920090590US1 (24747)), for “PROGRAMMABLE STREAM PREFETCH WITH RESOURCE OPTIMIZATION”; U.S. patent application Serial No. (YOR920090595US1 (24757)), for “FLASH MEMORY FOR CHECKPOINT STORAGE”; U.S. patent application Serial No. (YOR920090596US1 (24759)), for “NETWORK SUPPORT FOR SYSTEM INITIATED CHECKPOINTS”; U.S. patent application Serial No. (YOR920090597US1 (24760)), for “TWO DIFFERENT PREFETCH COMPLEMENTARY ENGINES OPERATING SIMULTANEOUSLY”; U.S. patent application Serial No. (YOR920090598US1 (24761)), for “DEADLOCK-FREE CLASS ROUTES FOR COLLECTIVE COMMUNICATIONS EMBEDDED IN A MULTI-DIMENSIONAL TORUS NETWORK”; U.S. patent application Serial No. (YOR920090631US1 (24799)), for “IMPROVING RELIABILITY AND PERFORMANCE OF A SYSTEM-ON-A-CHIP BY PREDICTIVE WEAR-OUT BASED ACTIVATION OF FUNCTIONAL COMPONENTS”; U.S. patent application Serial No. (YOR920090632US1 (24800)), for “A SYSTEM AND METHOD FOR IMPROVING THE EFFICIENCY OF STATIC CORE TURN OFF IN SYSTEM ON CHIP (SoC) WITH VARIATION”; U.S. patent application Serial No. (YOR920090633US1 (24801)), for “IMPLEMENTING ASYNCHRONOUS COLLECTIVE OPERATIONS IN A MULTI-NODE PROCESSING SYSTEM”; U.S. patent application Serial No. (YOR920090586US1 (24861)), for “MULTIFUNCTIONING CACHE”; U.S. patent application Serial No. (YOR920090645US1 (24873)) for “I/O ROUTING IN A MULTIDIMENSIONAL TORUS NETWORK”; U.S. patent application Serial No. (YOR920090646US1 (24874)) for ARBITRATION IN CROSSBAR FOR LOW LATENCY; U.S. patent application Serial No. (YOR920090647US1 (24875)) for EAGER PROTOCOL ON A CACHE PIPELINE DATAFLOW; U.S. patent application Serial No. (YOR920090648US1 (24876)) for EMBEDDED GLOBAL BARRIER AND COLLECTIVE IN A TORUS NETWORK; U.S. patent application Serial No. (YOR920090649US1 (24877)) for GLOBAL SYNCHRONIZATION OF PARALLEL PROCESSORS USING CLOCK PULSE WIDTH MODULATION; U.S. patent application Serial No. (YOR920090650US1 (24878)) for IMPLEMENTATION OF MSYNC; U.S. patent application Serial No. (YOR920090651US1 (24879)) for NON-STANDARD FLAVORS OF MSYNC; U.S. patent application Serial No. (YOR920090652US1 (24881)) for HEAP/STACK GUARD PAGES USING A WAKEUP UNIT; U.S. patent application Serial No. (YOR920100002US1 (24882)) for MECHANISM OF SUPPORTING SUB-COMMUNICATOR COLLECTIVES WITH O(64) COUNTERS AS OPPOSED TO ONE COUNTER FOR EACH SUB-COMMUNICATOR; and U.S. patent application Serial No. (YOR920100001US1 (24883)) for REPRODUCIBILITY IN BGQ.
This invention was Government support under Contract No. B554331 awarded by Department of Energy. The Government has certain rights in this invention.