This application is directed, in general, to circuit simulation and, more specifically, to a system and method for simulating integrated circuit (IC) performance on a many-core processor.
SPICE (Simulation Program with Integrated Circuit Emphasis) is a computer program written about 40 years ago, significantly enhanced over the intervening years (i.e., SPICE1, SPICE2 and SPICE3 so far) and widely commercially available in open source and several proprietary variants. SPICE is fundamentally designed to simulate the operation of an IC by evaluating a model of it. Consequently, the IC can be tested and verified without being fabricated.
To evaluate an IC model, SPICE constructs a matrix A and a right-hand-side vector b for use in various (e.g., Newton-Raphson) numerical analyses. After constructing A and b, SPICE then iteratively (1) evaluates the devices in the IC and (2) updates A and b accordingly.
SPICE is enormously popular among IC developers and is expected to continue being so for the foreseeable future. It is expected that SPICE will become more accurate and encompass a growing variety of devices and fabrication technologies as time progresses. SPICE also benefits from executing on processors that have become faster, and memories that have become larger, over time.
One aspect provides a SPICE model evaluation module executable on a many-core processor. In one embodiment, the module includes: (1) a setup module operable to generate topology matrices T1 and T2, (2) a device evaluation/update module associated with the setup module and operable to generate and update source elements SA for a matrix A and Sb for a right-hand-side vector b and (3) a generation module associated with the device evaluation/update module and operable to generate A using T1 and SA and further generate b using T2 and Sb.
Another aspect provides a method of simulating IC performance on a many-core processor. In one embodiment, the method includes: (1) generating respective topology matrices for a matrix A and a right-hand-side vector b, (2) generating source elements for the A and the b, (3) repeatedly updating the source elements and (4) generating A and b using the respective topology matrices and the source elements.
Yet another embodiment provides a system executable on a many-core processor. In one embodiment, the system includes: (1) an input including a netlist, (2) a SPICE model evaluation module including: (2a) a setup module operable to generate topology matrices T1 and T2, (2b) a device evaluation/update module associated with the setup module and operable to generate and update source elements SA for a matrix A and Sb for a right-hand-side vector b and (2c) a generation module associated with the device evaluation/update module and operable to generate A using T1 and SA and further generate b using T2 and Sb and (3) an output.
Reference is now made to the following descriptions taken in conjunction with the accompanying drawings, in which:
As described above, SPICE constructs a matrix A and a right-hand-side vector b for use in various (e.g., Newton-Raphson) numerical analyses. It is realized herein that carrying out SPICE on a many-core processor, such as one constructed according to Intel's MIC Architecture or Nvidia's Kepler-grade graphics processing unit (GPU) architecture, has the potential to increase SPICE's speed, because functions that can be carried out efficiently in parallel, such as matrix-vector multiplication, may be employed to advantage.
However, two significant issues arise in attempting to carry out SPICE on a many-core processor: one is reproducibility; the other is performance. Conventional SPICE interleaves the iterative evaluating of devices and updating of A. This is quite acceptable for a single-core processor, because sequential code always reproduces the same A. Further, the performance of SPICE in a single-core processor depends on the size of the cache memory, which is typically small relative to the size needed to carry out the SPICE analysis.
However, it is realized herein that the interleaving of device evaluation and matrix updating that conventional SPICE does is unacceptable in a many-core processor. More specifically, it is realized that, because a many-core processor can evaluate many devices and update many entries of A at the same time, the only way to guarantee that A is correct is to use an atomic operation. Unfortunately, not only are atomic operations computationally expensive due to their fundamentally serial nature, they cannot reproduce A. This fact will be established below.
It is realized herein that the model evaluation carried out in SPICE can be divided into three parts. A first part of the division involves constructing two topology matrices, T1 and T2. A second part of the division involves evaluating devices and obtaining vectors SA, Sb which will eventually be used to generate A and b. A third part of the division involves generating matrix A and b by performing matrix-vector multiplication: A=T1SA and b=T2Sb. T1 and T2 are almost always sparse, so the matrix-vector multiplication typically takes the form of a sparse matrix-vector multiplication.
Several advantages result from this division of SPICE's model evaluation. First, once the orientations of the currents in the circuit (using Kirchhoff's current law), and the order of devices are determined, the topology matrix need never be changed during the remainder of the evaluation. Thus, in various embodiments, T1 and T2 are generated only once. In some embodiments, T1 and T2 are generated in a central processing unit (CPU) and provided to a GPU, which carries out at least some of the subsequent evaluation. As those skilled in the pertinent art are aware, many modern SPICE implementations allow for a variable topology, meaning that the IC topology and resulting topology matrix can change before simulation begins. However, in such implementations, the simulation starts only after the topology has been “frozen,” in which case the IC topology and topology matrix do not change.
Second, the embodiments of the topology matrix illustrated herein are space-efficient, because they represent current orientation. As a result, the value of each nonzero entry is either 1 or −1, and one bit may be used to compress each value.
Third, the matrix-vector multiplication carried out in the third part of the division can be performed without atomic operations on a many-core processor, and A, the resulting matrix, is guaranteed to be reproducible. It is realized that matrix-vector multiplication is memory-bound and thus, performance decreases as the size of A increases. However, some embodiments compress A using a conventional or later-developed matrix compression technique. Compression of A allows performance to approach, and perhaps reach, optimal levels.
Fourth, the performance of matrix-vector multiplication on many-core processor is stable. Fifth, the evaluation of devices that takes place during the second part of the division can be relatively fast, because no atomic operations need be involved.
Topology matrices are known in linear algebra. They reform linear operations into matrix-vector operations which are formal, readable, understandable and amenable to numerical computation. However, topology matrices are not employed in the context of SPICE. Accordingly, topology matrices and their use in SPICE simulation will now be described more particularly.
A topology matrix is defined as follows: given a set of source elements S, a set of target elements T, and a set of operations O={f:S→T:f is linear}, a matrix-vector multiplication can represent O.
For example, if S={x1, x2, x3}, T={y1, y2, y3} and O={y1+=x1, y1+=2*x2, y3+=xx}, the matrix-vector multiplication is:
The above matrix-vector multiplication may be formed from O, because any scalar linear function can be represented by a dot-product. For example, f1 is “y1+=x1,” and x1 can be represented by x1*1+x2*0+x3*0, or its matrix notation:
Sometimes it is better to use a matrix notation; sometimes it is not. A second example will now be set forth in which a matrix is employed to reform a Laplacian operator. In this example, the reformation is trivial. However, in the context of SPICE, it is far from trivial.
In a second example, a standard three-point discretization of a one-dimensional Laplacian equation
is described by the following operator form:
Alternatively, a matrix-vector multiplication may represent the above operations:
This is a model problem in scientific computation. In this case, an explicit matrix is not required to perform matrix-vector multiplication with good performance, because parallel computing can readily avoid atomic operations.
Now the focus will shift to circuit simulation. In this case, atomic operations cannot be easily bypassed, so the teachings herein are important.
SPICE simulation involves the solution of a non-linear system of equations F(V,I)=0, where V is a vector representation of voltage nodes, and I is a vector representation of extra current branches in a given circuit. F is a vector function, and each component of F corresponds to a rule from Modified Nodal Analysis (MNA). V and I are a function of time t. Given an initial value of V(t=0) and I(t=0), V(t) and I(t) can be computed for a given time sequence t=0,t1,t2, . . . .
Since circuit simulation requires a non-linear system to be solved at multiple time steps, a Taylor expansion or Newton-Raphson linearization process is typically adopted to solve a succession of linear systems to extract V(tk) and I(tk) from previous values V(tk−1) and I(tk−1). Thus, the formula can be simplified by:
where h.o.t stands for a “high order term,” which is neglected during the simulation. “D” is a differential operator with respect to V and I. A=DF(V(tk−1),I(tk−1)) is a square, sparse matrix. A is also assumed to be non-singular, since a singularity typically arises from a malformed netlist and represents a nonfunctional circuit or a nonsensical netlist.
Rearranging the above terms, SPICE solves the following equation for every model:
Because the well-known Kirchhoff's current law requires currents orientation, each component of F can be regarded as the linear combination of several device models, e.g.:
where αk={1,−1} denotes the orientation of the current in a current branch, and fkj(V,I) is a function (linear or nonlinear), including the extra current branches.
In one example of a circuit having a MOSFET, the linear system resulting from Newton-Raphson linearization is:
where scalar functions f10, f31, f20 and f32 describe linear or non-linear relationships between current branches and voltage nodes. VS and VDD are known values for the IC.
To solve the above linear system Ax=b, A needs to be updated with Df10, Df31, Df20 and Df32, and b needs to be updated with f10, f31, f20 and f32.
Before describing the difference between the topology matrix method disclosed herein and the conventional technique in this example, some notation should be introduced. First, A is represented by:
In one embodiment, non-zero elements are stored in row-major order. In another embodiment, non-zero elements are stored in column-major order. The vectors x and b are represented as follows:
The conventional technique for updating A is to write corresponding values to nonzero elements a11, a31, a41, a22, a32, a13, a23, a33, a53, a14 and a35 in some sequential order. If the original sequential program is trivially transformed into its parallel counterpart (“parallelized”), several issues arise:
First, atomic operation is necessary. As an example, a22+=−Df20−Df32 employs two threads: one thread performs a22=a22−Df20, and the other thread performs a22=a22−Df32. The two threads are running in parallel, but only one thread can update a22 at any instant in time. Thus, an atomic operation is needed to avoid a race condition. An atomic operation carries out a read-modify-write operation in a non-interrupted unit. Atomic operations are relatively slow.
Second, the result cannot be reproduced. Again, taking a22+=−Df20−Df32 as an example, two results are possible, depending upon which thread updates a22 first: a22−Df32−Df20 or a22−Df20−Df32. The two results are theoretically equivalent, however if a22 is nonzero, rounding error may cause them to be different in practice. Those skilled in the pertinent art will recall that rounding error is a consequence of finite precision computation. In other words, if a, b and c are floating point numbers, (a+b+c) may not be equal to (a+c+b).
Third, performance may not be stable. The performance depends on the pattern of memory access, including Dfw, Df31, Df20, Df32, a11, a31, a41, a22, a32, a13, a23, a33, a53, a14 and a35. The pattern depends on the order of threads which are running in parallel, so the threads cannot be expected to run in the same order every time.
Fourth, performance is likely not to be good. The required atomic updates are known to hamper performance.
Turning to the topology matrix method disclosed herein, a topology matrix T1 is generated for A, and a topology matrix T2 is generated for b. To generate the topology matrix T1, a set of source elements S={Df10,Df20,Df31,Df32,1}, a set of target elements T={a11,a31,a41,a22,a32,a13,a23,a33,a53,a14,a35} and a set of operations
are defined. The relation T=T1*S can then be represented by:
Similarly, the topology matrix T2 for b is:
Step 1: The setup module 121 is operable to generate topology matrices T1 and T2.
Step 2: The device evaluation/update module 122 is operable to generate source elements SA={Df10,Df20,Df31,Df32,1} for A and Sb={f10,f20,f31,f32,VS,VDD} for b.
Step 3: The generation module 123 is operable to perform sparse matrix-vector multiplication (e.g., csrmv) as follows:
csc(A)=T1*SA and b=T2*Sb,
where csc(A) is a column-major order of non-zero elements in A. In the example, csc(A)={a11,a31,a41,a22,a32,a13,a23,a33,a53,a14,a35}.
The SPICE model evaluation module 120 produces output 130, which may take the form of logs.
In contrast to the conventional technique described above, atomic operation is unnecessary in the matrix-vector multiplication y=A*x. Results are reproducible. Performance is stable assuming a deterministic algorithm is employed to perform the matrix-vector multiplication. While matrix-vector multiplication is memory-bound, a topology matrix is relatively memory-efficient because it represents orientation of the currents. Consequently, only one bit is required to represent each value, allowing performance to come close to speed-of-light.
Finally, the topology matrix is particularly advantageous in the context of SPICE, because it is one-time cost. Topology matrices can be generated at the beginning, before doing simulation. During the simulation, source elements SA={Df10,Df20,Df31,Df32,1} for A and Sb={f10,f20,f31,f32,VS,VDD} for b are updated (recalculated) by the SPICE model evaluation process.
Those skilled in the art to which this application relates will appreciate that other and further additions, deletions, substitutions and modifications may be made to the described embodiments.