1. Field of the Art
The disclosure herein relates generally to the field of integrated circuit design and more specifically to the automated layout design of semiconductor chips.
2. Description of the Related Art
In an Electronic Design Automation (EDA) system for hierarchical integrated circuit (IC) design, the Physical Hierarchy Generation (PHG) step is responsible for partitioning the input netlist into a set of two or more hierarchical modules which can be referred to as soft macros. The PHG problem is the first step in any top-down hierarchical design planning system, and therefore, all proceeding steps depend of the quality of the PHG solution.
In general, large integrated circuits are often designed hierarchically, as opposed to the alternative flat design flow. There are several reasons for this including enabling (1) a “divide and conquer” approach for design teams to manage size and complexity; (2) a distributed design, in which self-contained pieces of a design are given to multiple engineering teams to be designed in parallel; (3) a convenient reuse of soft macros that may be used again in a different design, or repeated multiple times in the same design; and (4) EDA tools, which have a finite capacity based on available memory and runtime, to operate on manageable sized pieces of the design.
In an EDA physical design system, however, hierarchy introduces an extra level of complexity over flat design. For example, soft macros must be floorplanned, i.e., each must be assigned a shape and then placed such that it is not overlapping the other soft and hard macros. Leaf cells (standard cells and hard macros) are constrained to be placed within those artificial boundaries, possibly causing them to be moved from their optimal “flat” locations, increasing signal delay. Signal routes between soft macros are similarly constrained to cross the soft macro boundaries only at pre-defined pin locations, which may also cause the routes to deviate from their optimal shortest paths. Register-to-register paths that cross the boundaries must be budgeted such that the arrival times at the soft macro boundaries are fixed; incorrect budgets may lead to unsolvable interconnect optimization problems.
Hierarchical design planning choices can have a large impact on the quality of a design's interconnect performance. Increased signal delays, especially on large global signals between soft macros, can result from increased net lengths or increased routing congestion if floorplanning, pin assignment, or budgeting quality are poor. Increases in net length and/or congestion also can result in increased signal integrity issues, for example, crosstalk delay and noise violations, I-R drop violations, and ringing due to inductance effects. Increased wiring densities can lead to manufacturability problems due to higher defect rates and sub-wavelength lithography effects.
To address the increasing relevance of global interconnect in the design of integrated circuits at nanometer-scale technology nodes, an interconnect-centric design methodology was proposed based on a three phase flow: (1) interconnect planning, (2) interconnect synthesis, (3) interconnect layout. In other words, interconnect cost must be addressed directly in every step of the design process. PHG is an important component of the initial interconnect planning step in this methodology, a component on which all downstream steps depend.
The input specification for a design (typically a Register Transfer Level description or netlist described in a Hardware Description Language) usually is described hierarchically as well. Hierarchy in the HDL description, which is called the logical hierarchy, permits the logic designers to benefit from a divide and conquer approach as well. The logical hierarchy, however, may be quite different from the physical hierarchy, which is the hierarchy ultimately used by the back-end physical design tools. Note that the physical design “back end” tools typically handle floorplanning, power planning, physical synthesis, placement, and routing tasks.
There are several reasons for this difference between logical and physical hierarchy. First, the logical hierarchy is specified for the convenience of the logic design team, while the physical hierarchy is based on the capacity of the EDA software and the feasibility of the resulting physical design task. These goals may be very different. Second, the logical hierarchy is typically much deeper than the physical hierarchy. Each additional level of physical hierarchy increases the complexity of the physical design process, and hence there are typically only one or two levels of physical hierarchy.
Third, blocks in the physical hierarchy are typically much larger than in the logical hierarchy. The flat design capacity of modern EDA software tools is quite high, and the complexity of the physical design task increases with the number of blocks, so blocks in the physical hierarchy are typically made as large as possible. Fourth, the logic design team often has little visibility into the physical design process or requirements. Thus the logical hierarchy, if used directly, might result in an extremely sub-optimal physical design. For example, all memories might have been grouped together and given to one memory design specialist. However in the physical hierarchy the memories should each be distributed into the blocks that access them. Another common example involves test logic. BIST (Built-in Self-Test) and Scan logic is often synthesized into a single hierarchical block. However in the physical design this test logic must be distributed over the floorplan or, again, long wiring delays and congestion might occur.
One way to view the PHG problem is to specify it as the problem of finding a mapping from the logical hierarchy into a physical hierarchy which is optimal with respect to the back end physical design task. Physical hierarchy generation may be viewed as a special case of the classical k-way netlist partitioning problem. However, it is different in a number of significant ways, and therefore requires a new approach and new algorithms. First, logical hierarchy needs to be followed as closely as possible, optionally even disallowing non-sibling cell grouping. Second, classical k-way partitioning algorithms usually consider k to be fixed, and it typically must be an integer power of two. In the PHG problem k is usually not pre-specified and may be any integer. Furthermore, it is not obvious a-priori what values of k may be optimal or even feasible.
Third, classical netlist partitioning seeks to optimize a simple cost function, usually the hypernet cut or maximum subdomain degree. While those figures of merit do correlate with physical parameters such as routing length and congestion, they are only indirect measures and not robust enough for an interconnect-centric flow. A novel cost function is used which measures the “affinity” of sets of cells for each other in a virtually-flat placement. Since this placement has been optimized for wire length, global routing congestion, timing etc., grouping together cells with high mutual affinity will have the effect of minimizing the disturbance on the flat placement and maintaining its optimality.
The PHG problem has been discussed previously in the industry. These discussions include a system for unified multi-level k-way partitioning, floorplanning and retiming. It uses a placement-based delay model to improve partition quality, but the placement is performed top-down on the cluster hierarchy, not virtually-flat as in one proposed embodiment. Their system requires k to be a power of 2, and makes no effort to follow the original logical hierarchy. Another describes a multilevel k-way partitioning system that exploits the logical hierarchy as a “hint” during partitioning to achieve higher quality results. They use the Rent exponent to determine which logical hierarchy modules to preserve, and use those modules as constraints during clustering. However, k must be a power of 2, and only cut-size cost (not placement or routing cost) is considered. Yet another describes a system for physical hierarchy generation based on multilevel clustering and simulated-annealing placement-based refinement, with embedded global routing to estimate and minimize congestion. The coarse placement is performed top-down and does not follow the logical hierarchy.
Formally, the PHG problem is defined as a set assignment problem that maps the logical hierarchy into the physical hierarchy. Given as inputs are a circuit netlist, the original logical hierarchy, and a set of constraints. The output is the physical hierarchy.
The netlist is specified as an undirected hypergraph G=(V, E), where v ∈ V is a set of vertices representing the leaf cells (standard cells, macros, I/O pads, etc), and e ∈ E is a set of undirected hyperedges (sometimes abbreviated to edges) connecting the vertices, e ⊂ V, representing the interconnect nets. Ev ⊂ E is defined as the set of edges incident on vertex v. High fanout nets, such as the clock net, are typically ignored. Vertices and edges may each have a real number weight, wv ∈ and we ∈
respectively.
The input logical hierarchy L is a recursively defined set of subsets of V. Hierarchy L consists of one or more levels Li, 1≦i≦n, each consisting of a set of disjoint sub-sets of V that collectively cover V. Li=(Li,1, Li,2, . . . Li,j, . . . Li,n) in which Li,j⊂ V for all 1≦j≦ni, ∪j=1n
The physical hierarchy P is defined similarly. The PHG problem is to find a mapping M which maps L into P, L{right arrow over (M)}P, such that the solution is optimal with respect to some cost function, and such that the solution meets the constraints. One embodiment of the proposed process only supports a single level of physical hierarchy, but in general there is no such requirement.
The quality of the mapping M is defined by a cost function ƒ which can be any function of G, L, and P. The most common k-way partitioning cost function for a given level of the physical hierarchy Pi is to minimize the sum of the cut set costs of all Pi,j. An edge ek is defined as an external edge with respect to partition Pi,j if ek ∩ Pi,j=Ø. Similarly, edge ek is defined as an internal edge with respect to Pi,j if ek ∩ Pi,j=ek. Otherwise ek is called a cut edge. The cut set Ecut(Pi,j)⊂ E is the set of edges in G that are cut nets with respect to Pi,j. The cut set cost of a partition Pi,j is ƒcut(Pi,j)=Σwe
As already described, geometric cost functions such as cut size do not have high fidelity with respect to the real physical metrics that are of interest: routability, delay, signal integrity, manufacturability, etc. Also, it is obviously desirable to maintain as much of the structure of the original logical hierarchy as possible. This goal could be addressed in the cost function, but instead is achieved intrinsically in the setup of the partitioning problem. The atomic objects which are considered for partitioning are not individual standard cells and macros, but rather are modules in the logical hierarchy which already demonstrate good placement affinity.
In addition to a cost function, a set of constraints on the solution is also required. Without a constraint on the number of required partitions, or upper and lower bounds on the partition sizes, for example, the optimal solution consists of a single cluster of all cells in G. (That degenerate solution has a cut of zero, equivalent to a flat instance of the design.) Many other constraints are possible. One author solved an instance of the partitioning problem for FPGAs subject to component resource capacity constraints.
Another common requirement is support for repeated blocks (RBs), sometimes also called multiply instantiated blocks (MIBs). This requirement is most easily expressed as a constraint. If an instance of an RB in the logical hierarchy becomes a partition in the physical hierarchy, then all instances of that RB must also become partitions. Furthermore, all such partitions must be identical. Other cells (such as small clusters or glue logic cells) may only be merged into an RB partition if identical instances can be merged into all instances of the RB.
Another common requirement is support for multiple power domains. A power domain is a set of leaf cells sharing a common power supply. Different power domains may use different voltages to achieve different power/performance tradeoffs. Alternately they may use the same voltage, but with different power gating control circuitry that switches off power to the cells when they are not in use. Splitting a power domain into two partitions is not desirable because of the extra overhead required to distribute the power supply voltage to each partition, and to duplicate associated level shifting cells and/or power gating logic to each partition. In the context of the PHG problem, one could to treat the power domains as constraints, preventing cells in different power domains from being clustered together. Or one could consider the domains with a term in the cost function that would minimize the “power domain cut set” (the number of partition boundaries that split a given domain into different partitions.
Yet another common requirement is support for multiple clock domains. A clock domain is a set of leaf cell latches or flip flops that share a particular clock distribution network. Different clock domains may operate at different clock frequencies or duty cycles, for example, or they may be different versions of a common clock that are gated to switch off the clock to portions of the circuit that are not in use during a particular clock cycle. Splitting a clock domain into two partitions is not desirable because of the extra overhead required to route the clock network to each partition, or to duplicate the clock gating logic in each partition. As with power domains, clock domains may be considered either as hard constraints during the PHG problem, or as an additional term in the cost function that minimizes the “clock domain cut set”.
Therefore, the problem addressed by this disclosure includes partitioning that keeps logical and physical hierarchy as similar as possible. One embodiment also removes restrictions on the allowable number for k in the case of k-way partitioning and allows k to adapt to the needs of the design rather than simply be pre-defined. One embodiment further factors in a specialized cost function based on the result of virtually-flat placement. Other embodiments add restrictions based on repeated blocks, multiple power domains, or multiple clock domains in the selection of the blocks or components that compose the partitioning.
The described embodiments provide systems and methods for generation of a physical hierarchy. In one embodiment, a virtually-flat placement of a logically hierarchical design having a plurality of cells is received. A placement affinity metric is calculated in response to receiving the virtually-flat placement. In one embodiment a plurality of cells is coarsened by clustering cells in the logical hierarchical design using the calculated placement affinity metric. In another embodiment, initial partitions of clustered cells are refined by selecting at least one cluster to move between the partitions using the placement affinity metric.
In one embodiment, virtually-flat mixed-mode placement comprises simultaneous global placement of standard cells and macros, ignoring the logical hierarchy. The placement is minimized for wire length and congestion. Hard macro legalization is optional. The placement affinity metric, based on the mutual affinity of one cell, or cluster of cells, for another in the virtually-flat mixed-mode placement is utilized in the optimization cost function.
An embodiment of a method also includes pre-clustering. This includes processing the logical hierarchy in a top-down levelized order to locate and pre-cluster logical hierarchy cells with high placement affinity. An embodiment including graph coarsening comprises a method that performs a bottom-up clustering to reduce the size of the hypergraph, using the best choice clustering heuristic and a lazy update scheme for neighbor cost updates. A method also may include initial partition generation. For example, using a simplified netlist produced by graph coarsening, the method creates an initial k-way partitioning of reasonable quality that meets the constraints. Further, graph uncoarsening and refinement performs top-down declustering, using an iterative refinement process at each level to improve the initial partition from initial partition generation. Finally, there may be multi-phase refinement in which steps for graph coarsening, initial partition generation, and graph uncoarsening and refinement may occur zero or more times until partitioning converges.
The process described may also be embodied as instructions that can be stored within a computer readable storage medium (e.g., a memory or disk) and can be executed by a processor.
The features and advantages described herein are not all inclusive, and, in particular, many additional features and advantages will be apparent to one skilled in the art in view of the drawings, specifications, and claims. Moreover, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes and may not have been selected to circumscribe the claimed invention.
The teachings of the disclosure herein can be readily understood by considering the following detailed description in conjunction with the accompanying drawings. Like reference numerals are used for like elements in the accompanying drawings.
FIGS. 4A,B is a schematic diagram illustrating one embodiment of affinity cost for low and high affinity clusters.
FIGS. 5A-C is a schematic diagram illustrating examples of placement affinity.
The figures depict embodiments for purposes of illustration only. One skilled in the art will readily recognize from the following discussion that alternative embodiments of the structures and methods illustrated herein may be employed without departing from the principles of the disclosure herein.
Methods (and systems) for generation of a physical hierarchy based on placement are described.
A. Virtually-Flat Placement
Referring to
In one embodiment, the PHG process receives a virtually-flat placement of a logically hierarchical design and calculates a placement affinity metric for use in the partitioning phase. Global placers are extremely good at optimizing wire length over many different connectivity scales, and the Manhattan distance between two cells (or sets of cells) may be used as a fairly reliable indication of their degree of connectivity. During partitioning, one can view the placement affinity as a tie-breaker: selecting between two possible clusterings with equal cut-size reduction, one embodiment will choose to cluster the groups with higher placement affinity. Placement affinity is further described below.
B. Pre-Clustering
Referring next to step 120 in
One embodiment preserves the logical hierarchy intrinsically through pre-clustering of leaf cells based on their logical hierarchy relationships. In one embodiment, leaf cells are pre-clustered in a top-down order. In a top-down process, processing begins at the highest level of the logical hierarchy and proceeds downward, successively processing smaller and smaller cells. Starting at the top level, the process recursively de-clusters cells in the logical hierarchy until it reaches a set of cells that satisfy the user supplied maximum-cell-count threshold constraint. In addition, it measures their leaf cell's mutual placement affinity which will be defined in greater detail later. If the affinity of a cell is below an empirically derived threshold the process automatically de-clusters that cell and tests the cells in the next level of logical hierarchy. These pre-clustered logical hierarchy modules, along with any glue logic leaf cells instantiated by the de-clustered hierarchy modules, become the initial set of vertices in the partitioning hypergraph. While the described embodiment includes an empirically derived affinity it should be noted that other possible embodiments include fixed values or those derived adaptively by examining the affinity of a cell's children or grandchildren for better affinity values.
C. Graph Coarsening
Turning next to step 130 in
In one embodiment, the graph coarsening step 130 comprises coarsening a plurality of cells by clustering cells using a placement affinity metric. The placement affinity metric will be described below. In one embodiment, graph coarsening comprises creating a bottom-up clustering of cells. In a bottom-up process, processing begins at the lowest level (for example, the leaf cells and pre-clustered logical hierarchy cells obtained from pre-clustering) and proceeds upwards, successively merging pairs of smaller clusters to form new larger clusters.
This hierarchically defined sequence of successively coarse sub-graphs encodes connectivity relationships in the graph at successively larger length scales. The first iteration merges vertices with direct connections. The second iteration merges vertices connected through one common vertex, etc. The uncoarsening and refinement stage will later make use of this information to improve the partition as each level is unclustered in reverse order, optimizing the partition cut at each of those different length scales. This is the key idea behind the efficacy of using steps 130 through 160.
It is noted that examples of coarsening approaches include edge coarsening (EC), hyperedge coarsening (HEC) and first-choice coarsening (FCC) schemes. A particular embodiment uses a scheme referred to in the literature as best-choice clustering (BCC). The BCC process is discussed further below.
When two vertices va and vb are merged the graph G is modified as follows. Vertices va and vb are removed and a new vertex va∪b is added with weight wv
The coarsening schemes operate on pairs of vertices (EC, FCC, BC) or sets of hyperedge sinks (HEC). Thus, the process defines how many coarsening operations are to be performed before defining a new coarsening “level” and creating a new reduced graph instance. It is noted that each coarsening level is used to define an iteration in the uncoarsening and refinement step. For example, it has been observed that a balance between quality and runtime may be achieved when the size of the successive graphs is reduced by a factor of 1.5-1.8.
1. Graph Coarsening: Best Choice Clustering (BCC)
Best Choice Clustering uses a priority queue to track the globally best merge choice encountered from among all of the possibilities. This Best Choice Clustering uses a cost function to compute a clustering score Sa∪b for all pairs of connected vertices va and vb. A record is maintained for each vertex referencing its neighbor with the highest score. These records are placed into a priority queue (PQ), sorted by score, so that the clustering choice with the globally highest score can be obtained in O(1) time. The selected vertices are merged into a larger vertex va∪b, and the process is repeated until a certain stopping criterion is met.
After the vertex va∪b is formed, its best neighbor must be found, and a new PQ record must be created and inserted into the queue. In addition, the existing entries in the PQ must be searched for references to va and vb. Vertices that were previously neighbors of va and vb are now neighbors of va∪b. Their new best-choice must be found, and their records must be re-inserted into the PQ as well.
a. Graph Coarsening Score
This section describes a cost function score used during the coarsening phase of the multilevel partitioning process. The score is a multi-variable cost function with two or more terms. The first term reflects the number of pins eliminated by the merge, normalized by the maximum possible gain. The second term is a new metric based on a measurement of the placement affinity of the cells in a virtually-flat placement. The placement affinity describes how closely the cells of a virtually-flat placement are located to one another.
(i) Pin Reduction
As described above, Ev ⊂ E is the set of hyperedges incident on vertex v. Also defined is WE
The denominator normalizes the function so that it is independent of cluster size. Otherwise the partitioner would favor the merge of large cell clusters over small cell clusters, as more pins would likely disappear. It also serves to scale the function such that it can be effectively combined with the placement affinity term as described below.
After normalization this metric is a unitless number between zero and one. When WE
(ii) Placement Affinity
The placement affinity term, in one embodiment represented by Mpl, in the coarsening score is used to guide the partitioning decisions based on the virtually-flat mixed-mode placement results. In one embodiment, the placement affinity metric quantifies the relative proximity of cells to each other in a cluster as a result of forming the cluster during coarsening. The placement, which has been optimized for wire length and congestion, provides useful information about the complex connectivity relationships between cells and clusters of cells. If two cell clusters are placed close to one another then it is likely that they communicate with one another. If all cells in a cluster are placed close to one another then it is likely that they have high relative connectivity and should remain clustered. Conversely, if the cells in a cluster are scattered across the entire surface of the chip, it is likely that they should be de-clustered in the physical hierarchy.
Given a vertex v ∈ V in G which represents a cluster C of two or more cells, C={c1, c2, . . . , cn, the placement affinity of the cells is quantitatively measured. One simple way of doing this is to use the maximum enclosing bounding box over all cells in the cluster, bbmax(C). The computational complexity to calculate bbmax(C) is O(n), where n=|C|, since the cells must be iterated over one time. It is noted that this metric may be strongly impacted by “outliers”, cells that are pulled far from the center of mass.
Another possibility is to think of the cell placement as a probability distribution function over the x and y placement axis. The cells will have a center of mass described by the mean μ of the cell's coordinates in x and y. One can also measure the standard deviation σ of the placement in x and y. The standard deviation is a measure of how “spread out” the cells are in the placement, and is defined as the root mean squared (RMS) of the deviation of each cell from the mean. The standard deviation has the same units as the data being measured, in this case units of distance. It can be thought of as the average distance of the cells from the mean.
If a rectangle Rσ(C) is drawn with the following coordinates
it provides a good measure of the placement affinity of the cells in the set. The area of the rectangle is proportional to the average distance of the cells from the mean. Because the standard deviation is much less sensitive to outliers than the bbmax(C) function, the latter technique may be more tolerant of small placement abnormalities. As described below, the computational complexity of the standard deviation metric is also O(n).
A review of the definitions of the mean and standard deviation functions is now provided. A more computationally efficient formulation of the standard deviation expression is given and then derived equations for the mean and standard deviation of a rectangular region and of sets of such regions are described.
The arithmetic mean μp of a population p={p1, p2, . . . , pn}, where pi ∈ for all i=1 . . . . n, is defined as
For convenience, μp
The standard deviation σp of population p is defined as
It is easily shown that equation 5 can be re-written in a more convenient form, as shown below in theorem 1. Theorem 1 below is an alternative formulation of standard deviation is:
σp=√{square root over (μp
When computing the standard deviation, equation 6 has an advantage over equation 5, in that it allows single-pass computation of σp. To calculate σp using equation 5 requires one pass to compute μp and a second pass to sum the (pi−μp) values. Using equation 6 the values of μp
Another useful property of the standard deviation is shown below in theorem 2. The proof, based on the fact that summation is distributive, is straightforward. Theorem 2 below is a mean and standard deviation for the union of two populations p and q, that are:
Equations 15-17 demonstrate that, once the mean has been computed for populations p and q, the mean and standard deviation for the combined population p ⊂ q can be computed in constant time. If one caches μp, μp
Equation 6 shows how to calculate the standard deviation for a finite “population” of real numbers. Theorem 3 is used to relate this to a placement of standard cells and macros, which are boxes with finite width and height, rather than zero-dimensional points. Theorem 3 below is a mean and standard deviation in x and y of all points in a rectangle R defined by the closed interval [x1,xr] on the x axis, and the closed interval [yb,yt] on the y axis, which are:
It is also easy to derive an analogy to equations 15-17, which are defined over discrete populations of real numbers, for use with continuous bounded functions. This is shown below in theorem 4. The proof, using equation 24 is straightforward. Theorem 4 below includes mean and standard deviation in x and y of the union of two rectangles R1 and R2 with areas AR
Theorem 4 shows how to compute the mean and standard deviation values, with respect to either the x or y axis, over the volume of a rectangle R. To analyze the placement affinity for a set of two or more standard cells or macro cells C={c1, c2 . . . cn} equations 18-21 are used to compute μc
All that remains is to show how σC
Equation 41 of corollary 1 shows that computing the standard deviations in x and y of all points in a rectangle R, and using those values as the x and y dimensions of a new rectangle Rσ, then Rσ will always have an area of 1/12 the area of the original rectangle. This is independent of the size of the original rectangle.
This property demonstrates that the standard deviation metric does not have a bias for large groups of cells over small groups of cells, or vice versa. Conversely, Equation 42 shows that the area of rectangle R is always 12 times the area of Rσ. The area of a single cell will always be 12× the standard deviation product of its bounding box.
In one embodiment, an ideal bounding box is defined to be the bounding box of the best possible placement of the cells. The observed bounding box of the set of cells, on the other hand, is measured by computing (using equations 18-21 and 35-40) twelve times the product of the cumulative standard deviations in x and y, given their actual placement in the floorplan.
A placement affinity metric, Mpl, is defined as the ratio of the areas of the observed bounding box and the ideal bounding box, as shown below
Note that if the cells are placed in a minimum-area circle, the horizontal and vertical standard deviation values and ideal area will actually be smaller than the lower bound obtained from the ideal square bounding box. An analytical expression for the standard deviation over a circle could be developed, but since the lower bound is only being used as a scaling factor, it would make little difference.
Also note that a set of cells placed with zero whitespace, as in the ideal lower bound, would in most cases result in an un-routable design. Global cell placers typically spread the cells out with a non-zero amount of white space, either at a constant user-defined utilization value, or with dynamically controlled local routability estimates, in a process called whitespace management. Utilization can be defined as a real number between 0.0 and 1.0, indicating the average amount of “white space” that is to be left between cells in the placement. It may also be specified as a percentage between 0% and 100%.
The metric given in equation 45 is a unitless number≧1.0, which has the value 1.0 when the cells are placed in their minimum possible rectangular bounding box and increases as the cells are spread farther apart. It has a very loose upper bound, achieved when two cells are placed in opposite corners of the floorplan.
Equation 45 can be used directly to compare the absolute placement affinities of two different sets of cells, as required in the pre-clustering phase of the process described above. Or it can be used to compute the Best Choice Clustering score, as required in the coarsening phase described above as follows.
When two sets of one or more cells C1 and C2 are clustered into a larger set C1 ∪ C2, the placement affinity of the merged set may be better or worse than the placement affinities of the individual sets. The placement score Spl is defined as follows
This metric may be a unitless number that has the value zero when Mpl(C1)+Mpl(C2)=Mpl(C1 ∪ C2), i.e., there may be no benefit or penalty due to clustering. Spl(C1 ∪ C2) is negative when Mpl(C1 ∪ C2)>Mpl(C1)+Mpl(C2) (i.e. the placement affinity of the union is worse than the individual clusters), and vice versa. However, unlike the pin-reduction score Spin from equation 1, it has only very loose lower and upper bounds. This is because Mpl has only a very loose upper bound.
(iii) Final Normalized Coarsening Score
In order to choose which sets of cells C1 and C2 to cluster, a coarsening score Scoursening(C1 ∪ C2) is computed as follows:
Scoarsening(C1 ∪ C2)=ωpin×Spin(C1 ∪ C2)+ωpl×Spl(C1 ∪ C2) (47)
This is a linear combination of the pin reduction term from equation 1 and the placement affinity term from equation 46. The multipliers ωpin and ωpl are user supplied weights that can be used to tune the relative importance of pin reduction vs. placement affinity. Because the scores have been normalized, and are of approximately the same scale, the default values of these terms are set to be equal ωpin=ωpl, giving both terms approximately equal influence. Additional terms can easily be added to this cost function, for example, a penalty for cluster size (for size balancing), timing, timing slack, placement aspect ratio, and macro area vs. standard cell area ratio.
Note that in some embodiments, the latter two, aspect ratio and macro versus standard cell area, may not be well optimized during coarsening. However, it is the aspect ratio and cell area ratio of the final partition that may be of interest in such embodiments. In particular, their values may not be monotonic during successive clustering phases, and therefore, their values during early clustering phases may not be good predictors for their final values. In one embodiment, a good solution may be characterized by those term weights that increase with each coarsening iteration, or optimize them only during the uncoarsening and refinement phase.
2. Graph Coarsening: Lazy Update Heuristic (LU)
In one embodiment, all of the best-neighbor re-calculations required by BCC can be quite computationally expensive, especially when clusters are large and have many pins and thus many neighbors. This problem may be addressed with a technique referred to as lazy-update (LU). Rather than re-evaluating the PQ records that refer to na and nb, one embodiment simply marks them stale. When a stale record appears at the top of the PQ is it re-evaluated and re-inserted into the PQ. Clearly, if the re-evaluated cost is higher, optimality has not suffered—the record is inserted back into the PQ and the real optimal choice is selected. When the record's cost is lower, the results are different—the stale record is lower in the PQ than it should be, and therefore, does not appear at the top of the PQ when it should. It is noted that in one embodiment there may be an expectation that most of the time the new cost increases as the vertex is forced to choose its next-best neighbor.
D. Initial Partition Generation
Referring back to
Because the PHG problem begins with a relatively small number of pre-clustered modules, one embodiment adopts a different 2-phase coarsening approach. For example, in the first phase it limits coarsening to the glue logic leaf cells, seeking to cluster them together or assign them to one of the pre-clustered modules. In the second phase it further performs a relatively small number of additional coarsening iterations to directly achieve the initial k-way physical hierarchy partition.
In one embodiment, coarsening may stop at any time when the vertices are between the user-supplied minimum and maximum cell count constraints. After a vertex reaches its minimum cell count it uses the placement-affinity heuristic from pre-clustering to decide whether to continue coarsening. Successive merges are accepted under two conditions: (1) if the new placement affinity is better than the old, or (2) if the user has specified a hard constraint on the number of partitions, that constraint has not yet been met, and all other partitions have also reached their minimum cell count constraints.
E. Graph Uncoarsening and Refinement
Continuing with step 150 in
1. Partition Refinement Cost Function
In this section the cost function score used during the uncoarsening and refinement stage of the multilevel partitioning process is discussed. As described above, during each uncoarsening step, an FM style k-way partition refinement process is executed on the new uncoarsened hypergraph in an attempt to improve the quality of the current partitioning.
At each iteration of the refinement an unlocked vertex, e.g., referenced as the base vertex, is selected and moved from one partition to another. In addition, the cost function is updated and the base vertex is locked. This process continues until all vertices have been moved, and then a partitioning is selected from the iteration with the best cost.
As in the clustering score described above, the refinement cost is a multi-variable cost function with two or more terms. The first term is the traditional cost function of the FM algorithm, reflecting the reduction in the global cut set. The second term is a new metric based on a measurement of the mutual affinity of the cells in a virtually-flat placement.
a. Cut Set Reduction
A cut set is defined to be the number of edges that cross between two or more partitions. In step 150 the cut set cost function ƒcut(Pi) of a k-way physical hierarchy partitioning Pi is defined as the sum of the cardinalities of the cut sets of each partition Pi,j ∈ Pi, multiplied by the weighted cost we
ƒcut(Pi,j)=Σwe
ƒcut(Pi)=Σjƒ(Pi,j) (49)
The traditional score used during an FM iteration is simply the change in cut set cost resulting from the move of the base vertex vbase from partition Pi,a to partition Pi,b.
ƒcut(Pi,b ∪ vbase)−ƒcut(Pi,a ∪ vbase)−ƒcut(Pi,b)+ƒcut(Pi,a) (50)
In the PHG system this cut set reduction score is adopted as the first term in the overall partition refinement score, except that it is normalized by dividing by its upper bound, the sum of the weights of all edges in G.
This normalization makes the score into a unitless number between zero and one that is more easily combined with the second placement affinity term.
b. Placement Affinity
The placement affinity term, summarized in one embodiment by equation 54, in the partition refinement score is defined similarly to the cut set reduction term. It is a change in the sum of the placement affinities of the partitions Pi,j in partitioning Pi when the base cell is moved from partition Pi,a to partition Pi,b. From equation 45, the mutual placement affinity of a cluster of cells C represented by a vertex v is defined as follows:
The change in placement affinity resulting from the move of the base vertex vbase from partition Pi,a to partition Pi,b would then be as follows
Mpl(Pi,b ∪ vbase)−Mpl(Pi,a ∪ vbase)−Mpl(Pi,b)+Mpl(Pi,a) (53)
This change in affinity is adopted as the second term in the overall cost function, except that it is normalized by dividing by its upper bound, the total placement affinity of all cells in the original netlist, represented by all vertices in the current hypergraph
As above, this normalization makes the placement affinity score into a unitless number between zero and one.
c. Final Normalized Refinement Score
In order to choose the base cell vbase from the current set of unlocked vertices, a refinement score Srefinement(vbase) is computed as follows
Srefinement(vbase)=ωcut×Scut(vbase)+ωpl×Spl(vbase) (55)
This is a linear combination of the cut set reduction term from equation 51 and the placement affinity term from equation 54. The multipliers ωcut and ωpl are user supplied weights that can be used to tune the relative importance of cut set reduction vs. placement affinity. Because the scores have been normalized, and are of approximately the same scale, the default values of these terms are set to be equal ωcut=ωpl, giving both terms approximately equal influence. Further, additional terms can easily be added to this cost function. These additional terms may include, for example, penalty for cluster size (for size balancing), timing, timing slack, placement aspect ratio, and/or macro area versus standard cell area ratio.
It is noted that the latter two terms, aspect ratio and macro versus standard cell area, also may have implications from a physical perspective. When aspect ratios deviate far from unity, soft macros can become difficult to route, suffering high horizontal or vertical routing congestion. Soft macros with a relatively large area devoted to hard macros can be difficult to floorplan, as the macros must be packed with little whitespace, and again are prone to routing congestion problems.
Additional constraints may further be considered in the partitioning and refinement stages. For example, repeated blocks (RBs), sometimes also called multiply instantiated blocks (MIBs) may be used to constrain partitioning. According to one embodiment, if an instance of an RB in the logical hierarchy becomes a partition in the physical hierarchy, then all instances of that RB also become partitions. Other cells (such as small clusters or glue logic cells) may only be merged into an RB partition if identical instances can be merged into all instances of the RB.
In another embodiment, partitioning is constrained by the power domains. A power domain is a set of leaf cells sharing a common power supply. Different power domains may use different voltages to achieve different power/performance tradeoffs. Alternately they may use the same voltage, but with different power gating control circuitry that switches off power to the cells when they are not in use. Splitting a power domain into two partitions may not be desirable because of the extra overhead required to distribute the power supply voltage to each partition, and to duplicate associated level shifting cells and/or power gating logic to each partition. Thus, in one embodiment the power domains are treated as constraints in the partitioning problem. In this embodiment, cells in different power domains are constrained from being clustered together. Alternatively, the cost function can be modified to include a power domain term that would minimize the “power domain cut set” (the number of partition boundaries that split a given domain into different partitions.
In yet another embodiment, partitioning is constrained by the clock domains of the cells. A clock domain is a set of leaf cell latches or flip flops that share a particular clock distribution network. Different clock domains may operate at different clock frequencies or duty cycles, for example, or they may be different versions of a common clock that are gated to switch off the clock to portions of the circuit that are not in use during a particular clock cycle. In some instances, splitting a clock domain into two partitions may be undesirable because of the extra overhead of routing the clock network to each partition, or duplicating the clock gating logic in each partition. Thus, in one embodiment, clock domains may be considered as hard constraint during partitioning and refinement. Alternatively, an additional clock domain term can be added to the cost function that minimizes the “clock domain cut set”.
F. Multi-Phase Refinement
Turning next to the dotted line branch 160 in
G. Alternative Embodiments
As previously described, there may many alternative embodiments for the steps described in
Further Illustrations
Turning now to
A physical hierarchy 265 is again represented as having multiple levels of hierarchy as illustrated by submodules 275 and 285. The logical hierarchy has three levels while the physical hierarchy has only one. All cells shaded in grey (and therefore all cells below them in the hierarchy) are grouped together in the physical hierarchy, as are the un-shaded cells.
Note that the leaf cells, such as leaf cells 201, 202, and 203, are not constrained to exist only at the bottom of the logical hierarchy. Any level, including the top level, may contain leaf cells. Leaf cells at intermediate levels of hierarchy are often called glue logic. They may represent small amounts of control logic shared by the blocks below, test logic added for BIST or boundary scan, clock generation or gating logic, etc. In our example we have grouped all leaf cells within the physical hierarchy blocks, such as leaf cells 251 and 252, leaving no glue logic at the top. Although this may be required in a fully abutted floorplan, it generally is optional.
Re-organizing the logical hierarchy into the physical hierarchy can be quite disruptive to the design. Grouping together cells which are siblings of each other, for example, cells 1 and 2 in
Next, FIGS. 4A-B illustrate examples of cells with a high degree of affinity and cells with a low degree of affinity. These figures help illustrate the placement affinity metric discussions provided earlier.
Referring next to FIGS. 5A-C, these illustrate affinity examples before and after cluster merging. In
The order in which the steps of the methods are performed is purely illustrative in nature. The steps can be performed in any order or in parallel, unless otherwise indicated by the present disclosure. The methods described herein may be performed in hardware, firmware, software, or any combination thereof operating on a single computer or multiple computers of any type. Software (or computer program product) embodying the described systems and methods may comprise computer instructions in any form (e.g., source code, object code, interpreted code, etc.) stored in any computer-readable storage medium (e.g., a ROM, a RAM, a solid state media, a magnetic media, a compact disc, a DVD, etc.). The instructions are executable by a processor (or processing system). In addition, the software may be in the form of an electrical data signal embodied in a carrier wave propagating on a conductive medium or in the form of light pulses that propagate through an optical fiber.
While particular embodiments have been shown and described, it will be apparent to those skilled in the art that changes and modifications may be made without departing from this disclosure in its broader aspect and, therefore, the appended claims are to encompass within their scope all such changes and modifications, as fall within the true spirit of this disclosure.
In the above description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the disclosure. It will be apparent, however, to one skilled in the art that the described embodiments can be practiced without these specific details. In other instances, structures and devices are shown in diagram form in order to avoid obscuring the embodiments.
Reference in the specification to “one embodiment” or “an embodiment” means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment. The appearances of the phrase “in one embodiment” in various places in the specification are not necessarily all referring to the same embodiment.
Some portions of the detailed description are presented in terms of algorithms and symbolic representations of operations on data bits within a computer memory. These algorithmic descriptions and representations are the means used by those skilled in the data processing arts to most effectively convey the substance of their work to others skilled in the art. An algorithm is here, and generally, conceived to be a self-consistent sequence of steps leading to a desired result. The steps are those requiring physical manipulations of physical quantities. Usually, though not necessarily, these quantities take the form of electrical or magnetic signals capable of being stored, transferred, combined, compared, and otherwise manipulated. It has proven convenient at times, principally for reasons of common usage, to refer to these signals as bits, values, elements, symbols, characters, terms, numbers, or the like.
It should be borne in mind, however, that all of these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities. Unless specifically stated otherwise as apparent from the discussion, it is appreciated that throughout the description, discussions utilizing terms such as “processing” or “computing” or “calculating” or “determining” or “displaying” or the like, refer to the action and processes of a computer system, or similar electronic computing device, that manipulates and transforms data represented as physical (electronic) quantities within the computer system's registers and memories into other data similarly represented as physical quantities within the computer system memories or registers or other such information storage, transmission or display devices.
It will be understood by those skilled in the relevant art that the above-described implementations are merely exemplary, and many changes can be made without departing from the true spirit and scope of the disclosure. Therefore, it is intended by the appended claims to cover all such changes and modifications that come within the true spirit and scope of this disclosure.
This application claims a benefit, and priority, under 35 USC §119(e) to U.S. Provisional Patent Application No. 60/791,980, titled “Placement-Driven Physical-Hierarchy Generation”, filed Apr. 14, 2006, the contents of which are herein incorporated by reference.
Number | Date | Country | |
---|---|---|---|
60791980 | Apr 2006 | US |