1. Technical Field
The present invention relates generally to circuits for parallel computation. Specifically, the present invention provides a mesh topology for computing time- and wire-length-optimal cyclic segmented parallel prefix operations.
2. Description of the Related Art
Parallel prefix circuits have evolved as a generalization of efficient algorithms for binary arithmetic. Ladner and Fischer introduced parallel prefix computations as a class of parallel algorithms. Ladner, R. E. et al. “Parallel Prefix Computation,” J. of the ACM, 27(4):831-838, October 1980. See also Pippenger, N. “The Complexity of Computations by Networks,” IBM J. of Research and Development, 31(2):235-243, March 1987; Blelloch, G. E. “Scans as Primitive Parallel Operations,” IEEE Trans. On Computers, C-38(11):1526-1538, November 1989; Blelloch, G. E. “Prefix Sums and their Applications,” Technical Report CMU-CS-90-190, School of Computer Science, Carnegie Mellon University, Pittsburgh, Pa. 15213, November 1990; Leighton, F. T. Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes, Morgan Kaufmann, 1992; and Cormen, Leiserson, and Rivest. Introduction to Algorithms, MIT Press 1990. Parallel prefix circuits were implemented in Thinking Machine's CM-5 supercomputer. See Leiserson, C. E. et al. “The Network Architecture of the Connection Machine CM-5,” J. of Parallel and Distributed Computing, 33(2):145-158, March 1996; U.S. Pat. No. 5,333,268 (DOUGLAS et al.) 1994-07. The Ultrascalar processor is based on the observation that cyclic segmented parallel prefix circuits can implement all the tasks of a typical superscalar processor, including register renaming, wake-up, scheduling, committing, etc., in an orderly, principled fashion. See Henry, D. S. et al. “Cyclic Segmented Prefix Circuits,” Ultrascalar Memo 1, Yale University, November 1998; Henry, D. S. et al. “The Ultrascalar Processor—An Asymptotically Scalable Superscalar Microarchitecture,” in 20th Anniversary Conference on Advanced Research in VLSI, pp. 256-278, Atlanta, Ga., March 1999; Henry, D. S. et al. “Circuits for Wide-Window Superscalar Processors,” in 27th Int'l Symposium on Computer Architecture, pp. 236-247, Vancouver, BC, June 2000; U.S. Pat. No. 6,609,189 (KUSZMAUL et al.) 2003-08. Parallel prefix circuits have also been applied to load/store disambiguation, although under the name scan circuit. See U.S. Pat. No. 6,038,657 (FAVOR et al.) 2000-03.
Much of the appeal of parallel prefix computations stems from the fact that they can be implemented as a tree structure in VLSI with logarithmic complexity. Traditionally, complexity theory accounts for the number of nodes in the tree-structured circuit rather than the length of the wires. With increasing clock speeds, the lengths of the wires begin to dominate the critical path length, however.
What is needed, therefore, is a circuit topology for computing a cyclic segmented parallel prefix operation that is time-optimal as well as being optimal in terms of wire lengths and propagation delays. The present invention provides a solution to these and other problems, and offers other advantages over previous solutions.
Accordingly, the present invention provides new parallel prefix circuits for computing a cyclic segmented prefix operation with a mesh topology. In one embodiment of the present invention, the elements (prefix nodes) of the mesh are arranged in row-major order. Values are accumulated toward the center of the mesh and partial results are propagated outward from the center of the mesh to complete the cyclic segmented prefix operation. This embodiment has been shown to be time-optimal. In another embodiment of the present invention, the prefix nodes are arranged such that the prefix node corresponding to the last element in the array is located at the center of the array. This alternative embodiment is also time-optimal, but it is optimal in terms of wire-lengths (and therefore propagation delays) as well.
The foregoing is a summary and thus contains, by necessity, simplifications, generalizations, and omissions of detail; consequently, those skilled in the art will appreciate that the summary is illustrative only and is not intended to be in any way limiting. Other aspects, inventive features, and advantages of the present invention, as defined solely by the claims, will become apparent in the non-limiting detailed description set forth below.
The present invention may be better understood, and its numerous objects, features, and advantages made apparent to those skilled in the art by referencing the accompanying drawings, wherein:
FIGS. 14 is a diagram of a cyclic segmented prefix sum over 16 elements;
The following is intended to provide a detailed description of an example of the invention and should not be taken to be limiting of the invention itself. Rather, any number of variations may fall within the scope of the invention, which is defined in the claims following the description.
1. Prefix Computations
In this section, we discuss a circuit family called prefix computations. A prefix computation is defined on an input sequence a=[a0,a1, . . . ,an−1], an output sequence b=[b0,b1, . . . ,bn−1], and a binary, associative operator {circle around (x)}, such that
b0=a0,
bk=a0{circle around (x)}a1{circle around (x)} . . . {circle around (x)}ak for k=1,2, . . . ,n−1. (1)
As an example of a prefix computation, consider addition as operator. Addition is a binary operator, and it is associative, that is (a+b)+c=a+(b+c). Recall that associativity implies that the parentheses can be dropped, because the order in which the additions are performed is immaterial, at least from a mathematical perspective. Given the input sequence a=[2,7,1,1,3,5,2,4], we compute the elements of output sequence b as follows:
b0=a0=2
b1=a0+a1=2+7=9
b2=a0+a1+a2=2+7+1=10
b3=a0+a1+a2+a3=2+7+1+1=11
b4=a0+a2+a3+a4=2+7+1+1+3=14
b5=a0+a1+a2+a3+a4+a5=2+7+1+1+3+5=19
b6=a0+a1+a2+a3+a4+a5+a6=2+7+1+1+3+5+2=21
b7=a0+a1+a2+a3+a4+a5+a6+a7=2+7+1+1+3+5+2+4=25
Thus, the prefix computation results in output sequence b=[2,9,10,11,14,19,21,25]. It is easy to see that we can formulate the prefix computation with addition as the operator by means of the sum
Furthermore, we can formulate the prefix computation by means of the recurrence
b0=a0,
bk=bk−1+ak fork=1,2, . . . , n−1, (2)
due to associativity.
A variation of the prefix computation is the segmented prefix computation. Analogously to the prefix computation, it is defined on an input sequence a=[a0,a2, . . . ,an−1], an output sequence b=[b0,b2, . . . ,bn−1], and a binary, associative operator {circle around (x)}. However, a segmented prefix computation has an additional input sequence, the segment sequence s=[s0,s1, . . . ,sn−1] whose elements, the segment bits, are in the domain {0,1}, and s0=1. The segment bits partition the input sequence into segments that begin where a segment bit is 1, and continue as long as the segment bits are 0.
We can describe the segmented prefix sum by means of the recurrence:
The segmented prefix sum of the example in
s0=1: b0=a0=2
s1=0: b1=b0+a1=2+7=9
s2=1: b2=a2=1
s3=0: b3=b2+a3=1+1=2
s4=0: b4=b3+a4=2+3=5
s5=1: b5=a5=5
s6=1: b6=a6=2
s7=0: b7=b6+a7=2+4=6.
The resulting output sequence is b=[2,9,1,2,5,5,2,6] . Note that the individual segments behave like independent prefix sums as described by Equation 2.
The segmented prefix computation is straightforward to implement as a circuit.
Finally, we extend the segmented prefix computation into a cyclic segmented prefix computation by wrapping around output value bn−1, and feeding it into position 0. Furthermore, we relax the constraint s0=1 so as to be variable s0∈{0,1}.
The following recurrence formalizes the cyclic segmented prefix computation for an associative binary operator {circle around (x)} using modular index arithmetic to express the wrap-around:
This recurrence has a solution if there exits at least one index k′ for which segment bit sk′=1.
As a concrete example, consider the segmented prefix sum in
s0=0: b0=b7+a0=6+2=8
S1=0: b1=b0+a1=8+7=15
s2=1: b2=a2=1
s3=0: b3=b2+a3=1+1=2
s4=0: b4=b3+a4=2+3=5
s5=1: b5=a5=5
s6=1: b6=a6=2
s7=0: b7=b6+a7=2+4=6.
The resulting output sequence is b=[8,15,1,2,5,5,2,6]. Note that we cannot compute b0 unless we know the value of b7. The recurrence forces us to unroll the recursion until we visit an index with a segment bit of value 1, in this case wrapping around to s6.
Additional information regarding cyclic segmented prefix circuits as they are known in the art can be found in U.S. Pat. No. 6,609,189 (KUSZMAUL et al.) Aug. 19, 2003, which is incorporated herein by reference.
2. Prefix Computations for a 2-Dimensional Mesh
We present prefix computations on a 2-dimensional mesh for each of the variations discussed in Section 1. We analyze the performance of these circuits, focusing our attention on signal propagation delays through wires. We lump the operator delays into the Landau notation. Note that our prefix circuits are purely combinational. Therefore, the bounds on execution time resulting from our analysis, are also bounds for the critical path length of these circuits.
2.1. Prefix Sum on 2-Dimensional Mesh
We begin our study of prefix computations on a 2-dimensional mesh with a concrete example, a prefix sum. We assume that the mesh has n prefix nodes and, therefore, a side length of √{square root over (n)}.
It should be noted that, in this context, we use the term “prefix node” to denote the (segmented) prefix operators associated with position i of the prefix circuit. For instance, the circuit shown in
An obvious lower bound for the execution time of the prefix sum is Ω(√{square root over (n)}), because input value a0 must travel from the top-left prefix node to the bottom-right prefix node. If we assume one time unit for the propagation delay of a signal between neighboring prefix nodes, the execution time T must be T≧2(√{square root over (n)}−1). The execution time of our prefix computation meets the asymptotic lower bound, as we discuss in the following.
The prefix computation proceeds in three phases:
Row Prefix:
Straightforward analysis of our prefix sum reveals that the execution time is T=3Θ(√{square root over (n)}), because each of the three phases requires time Θ(√{square root over (n)}) to propagate the values through an entire row or column. Thus, our prefix sum meets the asymptotic lower bound.
A few comments are in order. First, the column prefix and row update phases can by overlapped to reduce the critical path length by a constant amount of time. For example, the prefix nodes in the rightmost column of
Second, the prefix sum technique illustrated in
We show one possible implementation of the prefix sum as a combinational circuit with 2-dimensional mesh topology in
2.2. Segmented Prefix Sum on 2-Dimensional Mesh
The prefix computation in Section 2.1 may be deceivingly simple. At the first glance, we may suspect that we violate the assumption that the operator of a prefix computation is not necessarily commutative. Using addition as the operator has hidden this issue from the discussion, because addition is commutative. In a segmented prefix circuit, we must pay attention to the fact that the operator intentionally does not commute, since segmentation does not commute.
Let us reexamine the segmented operation by introducing the following notation, borrowing from Cormen, Leiserson, and Rivest. Introduction to Algorithms, MIT Press 1990, p. 726, Exercise 30-1. For an associative binary operator {circle around (x)}, we introduce the segmented operator {tilde over ({circle around (x)})} for pairs (sx,x), where x is an input value and sx is the associated segment bit, such that:
Furthermore, output bit by is the second element of the result pair, that is
Operator {tilde over ({circle around (x)})} is associative, as can be proved by perfect induction.
The key consequence from the associativity of operator {tilde over ({circle around (x)})} is that we can rearrange the order of evaluation of the segmented prefix computations arbitrarily, just as for operator {circle around (x)}. Let us gain confidence in this result by reorganizing the segmented prefix sum in
The row prefix, here in segmented form, leaves the elements in the leftmost column untouched. It yields the following partial sums:
The complete set of prefix sums is shown below.
Let us examine the definition in Equation 5 in more detail. First of all, we observe that we can express the segment bit of expression (sx,x){tilde over ({circle around (x)})}(sy,y) as sxsy, where denotes the logical-or operation. The proof of this fact is straightforward by perfect induction. Furthermore, we may use explicit segment bits to represent intermediate results of a segmented computation. For example, consider the segmented computation:
(sv,v)=(sx,x){tilde over ({circle around (x)})}(sy,y){tilde over ({circle around (x)})}(sz,z){tilde over ({circle around (x)})}(su,u) (6)
Note that Equation 6 is not a prefix computation, but merely a segmented computation. We may express the linear, left-to-right evaluation of the segmented computation, cf.
(sv,v)=((((sx,x){tilde over ({circle around (x)})}(sy,y)){tilde over ({circle around (x)})}(sz,z)){tilde over ({circle around (x)})}(su,u)).
We can separate the computation of segment bit sv from the computation of the output bits, and obtain:
sv=(((sxsy)sz)su).
Since the logical-or operation is not only associative but also commutative, we can drop the parentheses in this expression, and reorder the evaluation arbitrarily.
The preceding insight is useful for reorganizing the segmented computation by exploiting associativity. For example, we may choose to evaluate Equation 6 as follows:
(sv,v)=((sx,x){tilde over ({circle around (x)})}(sy,y)){tilde over ({circle around (x)})}((sz,z){tilde over ({circle around (x)})}(su,u)).
Here, the segment bit associated with the left expression is sxsy, the segment bit of the right expression is szsu, and the segment bit associated with the result is sxsyszsu.
As a second example, we may choose to evaluate the expression in Equation 6 from right to left:
(sv,v)=((sx,x){tilde over ({circle around (x)})}((sy,y){tilde over ({circle around (x)})}((sz,z){tilde over ({circle around (x)})}(su,u)))).
By now, we have at our disposal all the knowledge necessary to comprehend the segmented prefix circuit with a mesh topology shown in
Accounting for wire lengths as dominating effect, the segmented prefix circuit in
2.3. Cyclic Segmented Prefix Sum on 2-Dimensional Mesh
We begin our introduction of cyclic segmented prefix circuits with a concrete example. Consider the segmented prefix sum of
The simplest strategy to implement a cyclic segmented prefix computation is to perform two phases of non-cyclic segmented prefix computations. In the first phase, we compute the non-cyclic segmented prefix, assuming that s0=1. In the second phase, we include the cyclic wrap-around, assert the desired value of s0, and redo the segmented prefix computation. Formally, we express this strategy for a cyclic segmented prefix computation as:
Due to segmentation, correctness is guaranteed as long as at least one segment bit assumes value 1. From an operational perspective, we must arrange the order of evaluation such that the result (sn−1,bn−1)|s
We account for the cyclic wrap-around by extending our segmented prefix computation with a fourth phase:
Reverse Column Prefix:
We illustrate the cyclic segmented prefix using
We illustrate the cyclic segmented prefix using
Phase 2, the segmented column prefix, does not change the values in the rightmost column. Next, we include the reverse column prefix as phase 3, using pair {tilde over ({circle around (x)})}015=(1,6) as prefix. As a result, only the top-right pair changes, because (1,6){tilde over (+)}(0,11)=(1,17). The segmented prefix sums after the reverse column prefix are as follows.
Finally, we perform phase 4, the modified segmented row update, which includes all rows including the top row, which uses the bottom-right element {tilde over ({circle around (x)})}015=(1,6) as prefix. The complete set of prefix sums is shown below.
As educational examples, we discuss the two extreme cases of cyclic segmented prefix sums illustrated in
Now, let us determine the critical path lengths of the two prefix sums in
3. Optimizations
We introduce two optimizations to the above-described prefix circuits: (1) minimizing the constant factor of the critical path length and number of operators by rearranging the communication pattern in the mesh, and (2) pipelining of combinational circuits to increase throughput.
3.1. Cutting Constant Factors
Our goal is to optimize the prefix circuits presented in Section 2 so as to obtain the minimum possible execution time with the minimal number of operators. We begin with the simple prefix circuit, and evolve the circuit progressively into a near optimal implementation of the cyclic segmented prefix circuit on a mesh.
Recall the prefix circuit in
The three phases shown in
Row Prefix:
Analysis of
respectively. Thus, for each of these phases, we gain a factor of two over the original version discussed in Section 2.1. The column prefix takes time Θ(√{square root over (n)}), as before. The total time of the optimized prefix computation is therefore T=2Θ(√{square root over (n)}). Not only is this optimized version 50% faster than our original version with T=3Θ(√{square root over (n)}), it is also the absolute minimum time, because it coincides with the critical path length for communicating from the top-left prefix node 0 of the mesh to the bottom-right prefix node n−1.
We can apply similar rearrangements of the data movements to cyclic segmented prefix circuits. Recall that our circuit design presented in Section 2.3 has a critical path length of 4Θ(√{square root over (n)}), and requires 2n+√{square root over (n)}−2 operators. Before presenting two alternative designs, we emphasize the following lower bounds on time and the number of segmented operators:
Lower Time Bound:
In the following, we present two cyclic segmented prefix circuits with 2-dimensional mesh topology. The first circuit is time optimal, yet uses more operators than necessary.
Our time-optimal circuit for a cyclic segmented prefix computation has a critical path length of 2Θ(√{square root over (n)}), matching the lower time bound.
(assuming that columns are numbered starting with zero-in terms of conventional ordinals (i.e., first, second, third, etc.), we could say that column k is the “(k+1)th” column). Phases 1 and 2a require
time steps each, counting the number of communications between neighboring prefix nodes that also represent wire delays. Next, we distribute value {tilde over ({circle around (x)})}0n−1 to all prefix nodes within time Θ(√{square root over (n)}) to meet the lower time bound.
Concurrently with phase 2a, we complete the row prefix by extending the partial row prefix from the left half of the mesh, computed in phase 1, to the right half of the mesh. The completion of the row prefix (phase 2b) is shown in
time steps, just like phase 2a. During phases 3 and 4 (
time steps, respectively, yielding a total of 2Θ(√{square root over (n)}) time steps for the entire cyclic segmented prefix computation.
We can reduce the number of operators in the circuit of
where 0≦x<n, and either
We may also introduce a ki and kj, which are defined like k, but which allow us to choose to adopt the floor or ceiling (i.e.,
for the rows and columns independently.
This renumbering technique allows us to approach the lower bound on the number of operators while retaining the lower bound on the critical path length. In particular, due to the new prefix node arrangement we can complete the row prefix in phase 2b, because this arrangement produces the row prefixes of the right half without the (0.5n ) operators that are dedicated to this purpose in the circuit of
3.2. Pipelining
As the problem size n of our prefix circuits grows, the critical path length may become intolerably large. We may wish to apply pipelining to increase throughput in circuits with large propagation delays.
Consider a concrete example in which the delays of long wires dominate operator delays, where we want to insert pipeline registers between each of the neighboring prefix nodes of the mesh.
The pipelined prefix circuit enables us to interleave multiple prefix computations. When the number of prefix computations is larger than or equal to the pipeline depth, the prefix circuit is fully utilized, and produces one prefix computation per clock cycle.
Pipelining the segmented and the cyclic segmented prefix circuits proceeds analogously to that in
As an alternative to the hardware-based implementation described above, the present invention may be implemented in the form of software for execution on a parallel computer. That is, an alternative embodiment of the present invention may be implemented in the form of a set of instructions or microcode or other functional descriptive material that may, for example, be resident in the memory (volatile or non-volatile) of a computer. Until required by the computer, the set of instructions may be stored in another computer memory, for example, in a hard disk drive, or in a removable memory such as an optical disk (for eventual use in a CD ROM) or floppy disk (for eventual use in a floppy disk drive), or downloaded via the Internet or other computer network. Thus, the present invention may be implemented as a computer program product for use in a computer. In addition, although the various methods described are conveniently implemented in a general purpose computer selectively activated or reconfigured by software, one of ordinary skill in the art would also recognize that such methods may be carried out in hardware, in firmware, or in more specialized apparatus constructed to perform the required method steps. Functional descriptive material is information that imparts functionality to a machine. Functional descriptive material includes, but is not limited to, computer programs, instructions, rules, facts, definitions of computable functions, objects, and data structures.
While particular embodiments of the present invention have been shown and described, it will be obvious to those skilled in the art that, based upon the teachings herein, changes and modifications may be made without departing from this invention and its broader aspects. Therefore, the appended claims are to encompass within their scope all such changes and modifications as are within the true spirit and scope of this invention.
Furthermore, it is to be understood that the invention is solely defined by the appended claims. It will be understood by those with skill in the art that if a specific number of an introduced claim element is intended, such intent will be explicitly recited in the claim, and in the absence of such recitation no such limitation is present. For non-limiting example, as an aid to understanding, the following appended claims contain usage of the introductory phrases “at least one” and “one or more” to introduce claim elements. However, the use of such phrases should not be construed to imply that the introduction of a claim element by the indefinite articles “a” or “an” limits any particular claim containing such introduced claim element to inventions containing only one such element, even when the same claim includes the introductory phrases “one or more” or “at least one” and indefinite articles such as “a” or “an;” the same holds true for the use in the claims of definite articles.
This invention was made with Government support under DARPA, NBCH3039004. THE GOVERNMENT HAS CERTAIN RIGHTS IN THIS INVENTION.