The present invention is related to semiconductor transistor level simulation techniques, particularly improvements to reduce the simulation computation time by parallel processing and utilizing numerical techniques with improved convergence.
With the ever shrinking feature sizes and growing demand for high performance and low power from electronic circuits, accurate simulation of large systems of circuits is necessary. SPICE has long been considered the gold standard for circuit simulation accuracy, but the biggest drawback of traditional SPICE tools is their limited capacity and prohibitively long simulation time for most practical circuits. The SPICE transient simulation algorithm involves repeatedly solving a linear form of a modified nodal equation matrix for the circuit in such a way that the circuit node voltages converge to a steady state value at each time step in the simulation. The performance limitation of SPICE is directly related to its method for solving these nodal equation matrices. This has led to improvements in circuit simulation beyond the traditional SPICE modeling.
Feldman et. al. describes the use of symmetric positive definite (SPD) matrix manipulations to generate transfer functions for systems of passive L, R and C elements in U.S. Pat. No. 6,041,170 granted Mar. 21, 2000, and further describes LU factorization applied to SPD matrices as a way to solve non-linear analysis of circuit systems in U.S. Pat. No. 6,182,270, granted Jan. 30, 2001. Still further improvements may be made by doing decomposition of the SPD matrices and performing the LU factorization processing in parallel across multiple processors as described by Nakanishi in U.S. Pat. No. 6,907,513, granted Jun. 14, 2005, but while Nakanishi does not describe the use of these techniques to circuit simulation, Hachiya does in combination with the Newton iteration method in U.S. Pat. No. 6,144,932, granted Nov. 7, 2000. In addition to parallel processing of LU factorization, Hachiya further describes clustering the devices into sub-circuits, balanced to minimize the parallel processing of all sub-circuits.
While the above techniques improve the processing time of circuit simulation, accuracy is also important. For example, the clocks within most ICs are the most timing critical portion of the design, and therefore require special processing, as pointed out by Burks et. al. in U.S. Pat. No. 6,014,510 granted Jan. 1, 2000 and Srinivansan et al. in U.S. Pat. No. 6,851,095 granted Feb. 1, 2005, but unlike Kanamoto et al. in U.S. Pat. No. 6,442,740, they limit their discussion to non-circuit simulation of clock structures. Kanamoto et al. also describes the need to map the passive elements of the power and ground structure, to reduce the computational complexity of the clock structures during circuit simulation.
This disclosure builds on the cited prior art to further improve the execution time of circuit simulation of large systems of transistors and passive components, while maintaining waveform accuracy through a series of techniques. For example, in addition to extracting the clock structure for more exact timing analysis, its typical tree like structure lends itself to partitioning for parallel processing. Similarly, most IC designs are made up of numerous instances of cells and macros, many of which are identically structured, which may be hierarchically preprocessed to reduce the simulation time. Also, because LU decomposition and iterative methods are guaranteed to converge SPD matrices, this disclosure presents a technique for partitioning the system into sub-systems with SPD matrices and well behaved non-SPD matrices, as opposed to min-cuts or structural clustering as described in the prior art.
Furthermore, recognizing that matrix solvers such as LU decomposition, Cholesky's method, Algebraic Multi-Grid (AMG), and Generalized Minimal Residual method (GmRes), each have their own strengths and weaknesses, this disclosure presents techniques for selecting between parallel and serial versions of multiple solvers within a two-stage Newton-Ralphson's iteration method to maximize simulation performance by minimizing non-convergence conditions, while bounding the numerical errors.
Embodiments of the invention will now be described in conjunction with the drawings, in which:
a, 3b and 3c are diagrams of a circuit being partitioned into sub-circuits,
a and 5b are diagrams of matrix partitioning for parallel processing, and
Reference is now made to
In one embodiment of the present invention, the partitioning for parallel execution may be tuned to fit the limitations of both the number of slave processors and the resources, which reside with each processor.
Reference is now made to
In another embodiment of the present invention, clock structures may be partitioned along branches of their tree structure duplicating the root and sufficient portions of the rest of the tree such that each branch may be separately simulated in parallel with all the other branches.
Reference is now made to
Reference is now made to
Following the generation of the sub-circuits, the connectors between each sub-circuit are appended with a voltage/current regulator circuit for iteratively applying the intermediate results to and from the adjacent sub-circuits.
In yet another embodiment of the present invention a method for partitioning the circuit system into sub-circuits, which are either composed entirely of passive elements or are compose of elements with clear paths to power and ground, for the purpose of creating well behaved matrix models to be used in parallel circuit simulation, where the entire system may be partitioned into groups of one or more sub-circuits, such that each group may be simulated in parallel to all other groups.
It should be noted here, that the grouping of sub-circuits may be chosen to both minimize inter-processor communication and overall processing time, when performing the parallel simulation, and should be chosen to best fit the configuration and resources of the slave processors. Furthermore, some resulting sub-circuits, such as the power and ground structures, may be coupled with other timing critical sub-circuits, such as the branches of a clock tree, as described above. Such combinations ensure proper treatment of the self induced power and ground noise when modeling the resulting sub-circuit.
Even after such sub-circuit partitioning as described above is performed one or more of the resulting matrices created for the sub-circuits may be sufficiently large enough to require further partitioning. It such cases it may be necessary to further partition the matrices themselves.
Reference is now made to
So, in yet another embodiment of the present invention, sparse matrix reordering techniques may be employed to organize the matrices for row partitioning to minimize the inter-processor communication needed while processing each of the row partitions.
A number of LU decomposition matrix solvers exist including KLU, Cholesky decomposition, and Block Triangular. They all advantageously perform direct inversions of the matrix being solved, but are generally limited in how large a sub-circuit they can handle and require positive definite matrices in order to find a solution. The sub-circuits composed of passive elements convert into SPD matrices and as such are good candidates for decomposition solvers, if they are small enough to be processed. On the other hand, iterative solvers such as GmRes and AMG that can handle larger matrices, are not limited to SPD matrices, but do not always converge rapidly on a solution, particularly if the solution is a large incremental step from the current state of the simulation. Furthermore, both types of matrix solvers may be implemented in either serial or parallel form, with varying degrees of resulting improvement in execution time.
When using the techniques previously described in this disclosure, the sub-circuits and blocks of rows may vary in size. As such when choosing a method, the type of matrix, the size of the row blocks and the degree of transient changes in input voltage must all be taken into consideration. For example while LU decomposition is more appropriate for linear networks, and DC analysis, the power and ground sub-circuits are typically too large for such methods and therefore must be solved with GmRes or AMG techniques. On the other hand, it may be appropriate to use a decomposition technique as a precursor to an iterative technique when the transients are large, since the iterative solvers converge more rapidly when they are close to the actual solution.
In yet another embodiment of the present invention, the selection of a particular solver from parallel, serial, direct decomposition and iterative solvers, may vary both with the type of sub-circuit and with the type of simulation stimulus.
Reference is now made to
So, in yet another embodiment of the present invention, multiple parallel slave processes are spawned from a master process to solve both portions of the network of circuits and portions of the matrices created to solve other portions of the network of circuits, which separately communicate their intermediate partial results to the other slave processes until voltage/current stability in the entire network is reached.
Still, stability between the partitioned sub-circuits may require a large number of first stage iterations, when dealing with large sub-circuits and/or large voltage/current changes on the sub-circuit interfaces. In general an iterative solver such as AMG or GmRes works well then the initial conditions are near its final state, but may not converge if the voltage/current steps are too large. Such is the case at the initial DC condition, or when high frequency transients are simulated. In these cases, as a variant of the two-stage Newton-Raphson's method, before the next time iteration is invoked, the large voltage current steps are broken into multiple smaller incremental steps, which are successively applied to the portions of the network that are using iterative solvers.
Therefore, in yet another embodiment of the present invention, a modified two stage Newton-Raphson's method is employed to perform circuit simulation, where the method includes a first stage of iterating through the multiple components of the network until voltage/current stability is reached and then in a second stage iterates through increments of time to complete the simulation, but may include an intermediate step between the first and second stage to increment through large voltage/current steps for portions of the network which may otherwise be unstable.
It is contemplated that the techniques in the embodiments described herein are not limited to any specific matrix inversion technique. Furthermore, the above techniques may be used in part or in whole depending on the configuration IC system they are applied to. It is further contemplated that one or all of the techniques described herein may be applied to a wide variety of systems of computers and IC structures when suitably modified by one well versed in the state of the art.