This invention relates to, and the following discussion assumes skill in, the design of integrated circuit chip floorplan, layout or topography.
In the design of very high performance integrated circuits, designers have to deal with multiple clock frequencies in the GHz frequency domain. Very often there is a primary frequency that drives most of the design and secondary frequencies used to drive selected parts of the design, such as for example an I/O interface. There are several design strategies to distribute these clock signals to their destinations. In one strategy the primary clock signal is distributed through a global clock distribution network to reach all the surface area of the chip, using a two-stage distribution network.
Other clock signals can be distributed in a similar fashion. However, if these signals are only used in specific areas of the design, such global distribution would be a waste of design resources. Furthermore, if these signals are sub-frequencies of the main clock signal, it is important to keep them linked to facilitate synchronization between the signals. One design technique for creating and distributing signals to drive portions of a design at different clock frequencies is to send a control signal from the clock source synchronized with the main clock. At the destination this signal is combined with the main clock signal, for example a frequency divider, to create the desired frequency. Synchronization between destinations at different locations on the chip is achieved by ensuring that the control signal reaches each destination at the same time independent of the location of the destinations. A design technique to ensure control signals are synchronized with the main clock signal is by using latches in the distribution of the control signals, the latches being controlled by the main clock signal.
The latch structures are known as Latch Distribution Trees (LDTs). In the present disclosure, new design approaches for LDTs are presented that allow design by construction of trees while reducing the load impact of such trees on the main clock distribution network. Furthermore, new approaches are introduced to distribute the load of LDTs on the main clock distribution network to help balancing clock skew. LDTs are sometimes identified as plats, and plats may be consolidated into macros drawn from a library of chip floorplan designs.
The method of this invention is distinguished by the analysis of plat load impact and movement of plats within a Sector Grid to balance clock load within the sector in order to help balance clock skew. A design methodology and algorithms are presented such that the total load on the clock distribution network is reduced by clustering plats. The clustering is combined with a movement of plats within each clock sector area to reduce clock skew. The movement and clustering of plats is such that the timing constraints of each plat are preserved. The new techniques are described hereinafter in terms of reducing and balancing the load inside each clock sector, although the techniques could also be applied to balancing load between clock sectors.
Some of the purposes of the invention having been stated, others will appear as the description proceeds, when taken in connection with the accompanying drawings, in which:
While the present invention will be described more fully hereinafter with reference to the accompanying drawings, in which a preferred embodiment of the present invention is shown, it is to be understood at the outset of the description which follows that persons of skill in the appropriate arts may modify the invention here described while still achieving the favorable results of the invention. Accordingly, the description which follows is to be understood as being a broad, teaching disclosure directed to persons of skill in the appropriate arts, and not as limiting upon the present invention.
In a two-stage clock signal distribution network, the first stage distributes the signal from the source, usually at the center of the chip, to intermediate pre-defined locations called sectors. These sectors form a grid overlapping the chip surface as illustrated in
To design distribution networks for frequencies of multi GHz frequencies, several design parameters are very carefully controlled. The main design parameter is clock skew, defined as the delay difference between two clock pins, the delay being the signal latency from the phase locked loop 12 (PLL) to the clock pin. There are two types of clock skew, early mode and late mode clock skew. Early mode clock skew is particularly important because an early mode failure will cause a chip to malfunction. Late mode is also important because a higher skew between two critical clock pins will reduce chip performance. The design of a clock distribution network for very low skew (in the single pico second (ps) range) is very dependent on maximum load, or the number of devices driven by the network, and the actual location of such devices, or load distribution.
Up to the sector buffer level (first stage of the network ending at the input of each sector buffer) the design is independent of the clock load pin location. As such, the design of this network minimizes early and late clock skew by attempting to equalize delay and buffer input slew at each buffer stage in the clock distribution network starting from the PLL.
The second stage of the network (starting with a Sector Buffer) is load dependent. The load within each clock sector varies because of the location of clock pins, the number of clock pins, and the capacitance value of each clock pin. To control the clock skew within a sector and between clock sectors several design techniques are used. In one technique, the wiring structure used within a sector (H-Tree driving a grid) is designed for very low skew at the H-Tree leaves. This is accomplished with buffer and wire sizing. However, the actual skew is dependent on how balanced is the load distributed among the leaves of the H-Tree. Another design technique is to minimize the capacitance load per sector (clock pin load). Furthermore, distributing the capacitance load within the sector will help the optimization of wires and buffer(s) in the sector.
As described above control signals are distributed to their destination using LDTs. One property of such trees is that the number of latches between source and destinations is the same, independent of the location of each destination connected to the tree. Such latches are known as plats for this disclosure. An example of a plat LDT design structure is illustrated in
The location of plats 14 is determined by the placement of the macros they connect to, the source of the signal (usually common for all reference signals) and the timing requirements defined by the main clock signal. The distance between two plats is the maximum distance a signal can travel for the duration of a single clock cycle. This distance depends on the fanout of the net connecting them, the wire layers, and any buffering strategy used to stretch the distance between plats as illustrated in
For macros close to the signal source the required plats are placed close to each other while for macros further away from the source the plats are placed further apart including the use of buffer trees in between to space them out. The possibility and flexibility of placement or movement of plats creates an opportunity to reduce and/or balance the clock load within each clock sector where plats are available as described here.
Plats add more loads to the clock distribution network because the clock pin of each plat is driven by the main clock distribution network. On one design example with a 4.5 GHz clock signal and more than 60 control signals, the average load increased 10% per clock sector, although some sectors may see an increase as high as 60%. One technique to reduce this load is to cluster neighboring plats into a single macro such that the clock pins of the clustered plats are driven by a buffer and only the input capacitance of the buffer is exposed to the clock grid. For example, four plats 14 can be clustered within a macro 15 reducing the load on the grid by ¼, as illustrated in
The number of plats per sector area can vary anywhere from zero to hundreds of plats, significantly increasing the load of the sector. For the same design example a sector area had more than 200 plats which contributed with almost 1 pF load for a total load in the sector of about 4 pF. After clustering, the number of plat pins was reduced to a little over 50 pins contributing only 0.25 pF of load to the clock distribution network.
The concept of clustering plats into a macro to reduce the load on the main clock tree was used in a previous design. In such design, LDTs were analyzed for plat proximity and a designer manually selected which plats should be clustered into a signal macro. Afterwards, a timing analysis was performed on the design and any timing failures in the LDTs where manually fixed by moving plats, re-designing the buffer trees in between plats or re-clustering plats. The main goal was to meet timing. Since the insertion of LDTs was performed late in the design cycle there was a negative impact on the clock distribution network as the clock skew increased. To overcome the impact on clock skew, designers clustere latches in plat macros. The clustering was performed manually by inspection and the objective was to reduce the load while meeting timing.
In the current disclosure, the process of clustering is computational, automatic and designed to meet timing requirements. Unique to this disclosure is the analysis of plat load impact and placement of plats within a Sector Grid to balance clock load within the sector in order to help balance clock skew. A design methodology and algorithms are presented such that the total load on the clock distribution network is reduced by clustering plats. The clustering is combined with movement of plats within each clock sector area to reduce clock skew. The movement and clustering of plats is such that the timing constraints of each plat are preserved. The new techniques are described in terms of reducing and balancing the load inside each clock sector, although the techniques could also be applied to balancing load between clock sectors.
The optimization starts by analyzing each clock sector area of the chip. It is first determined if a sector satisfies the minimum requirements for plat movement and clustering that lead to load and clock skew balancing. If conditions are satisfied, the algorithms for plat movement and clustering are applied. The optimization flow applied to a chip where the clock distribution network is comprised of a grid of sector buffers (see
For a given chip the procedure is applied sequentially to each sector buffer area. Before performing any movement and clustering, the procedure determines if the sector area has plats, if they can move, and if there is a load balancing/reduction due to clustering.
The movement algorithm is straightforward. This process starts by calculating the load deviation of each quadrant within the sector to get the quadrant with the most plat load to distribute (110 in
As illustrated in
For the purposes of optimization each sector is divided in four quadrants Q1, Q2, Q3, Q4 as seen in
For the purposes of this work, the load within a sector is divided into at least two components, the load due to clock pins of plats Cplat and the load due to clock pins of macros and/or units Cclk. The load on each quadrant is
CLQi=ΣCclk+ΣCplat (1)
The average load(106 in
Cavg=ΣCLQi+#Quads (2)
The deviation(108 in
CDi=CLQi−Cavg (3)
The requirements to determine if plat movement and clustering will help load and clock skew balancing within a clock sector area are fourfold. First, a sector must have plats within it. Second, there must be a load imbalance between the quadrants. This is determined by the results of equation (3). If the deviation is negative, the quadrant is a potential receiver of load while if it is positive the quadrant is a potential donor of load. The minimum deviation must be greater then the input load of a plat, CDi>2Cplat. Observe that unless the load is equally balanced amongst the quadrants there is always at least one donor quadrant and one receiver quadrant. Finally, the clustering must help balance the load. For each quadrant find the minimum possible plat capacitance by applying the maximum possible cluster factor. This factor, know as Max_cluster_factor, is the largest plat macro available in the design library. The minimum plat load per quadrant is
Cplat_min=(#plats÷Max_cluster_factor)×Cpin (4)
Applying (4) into (1) to (3) determines which quadrants, if any, are still donors. If, after clustering, the result of equation (3) is positive for at least one quadrant and it satisfies the minimum requirements, load balancing by plat movement can start.
The algorithm for sector plat movement and load balancing is presented below. The algorithm starts by determining if the load deviation within a sector is positive (2). That triggers the selection of quadrant and the movement of plats (3, 3.1). The step to get the quadrant with maximum load may determine that more than one quadrant has similar load. If this scenario happens the algorithm resolves the conflict by selecting the quadrant with the smallest plat count. By selecting such quadrant, the algorithm increases the possibilities of plat clustering on the quadrants to which the plats eventually move. If the plat count is the same, the algorithm just chooses the first quadrant of the selected list.
For each plat within a quadrant, the maximum distance it can move without violating the timing constraints (3.2.1) is calculated. The details of such calculation are explained herein. Plats with positive move, max_move(pj)>0, become candidates for movement. The algorithm picks the plat with the largest value of max_move (3.2.2) and moves the plat to the quadrant with the lowest total load without creating a timing violation (3.2.3). In other words, the distance of movement is less then the max_move(pj) value. The moved plat is tagged to prevent it from moving to other quadrants or returning to its original quadrant (3.2.4). After the move, the CDi of each quadrant is updated (3.2.5). The algorithm stays with quadrant Qi while the quadrant still has the largest load and the CDi>0 indicating that plats can still move out of the quadrant.
The step to calculate the maximum movement 3.2.2, may also return more than one plat with the save value of max_move(pj). The plat chosen to move is the one that—once moved to a donor quadrant—results in the smallest displacement for the plat. This displacement is calculated as the difference between the location of the plat and the center of a donor quadrant. The plat with the smallest value is the one chosen to move. If two plats have the same displacement value, the first one in the list is chosen.
Once all possible plats have been moved from a quadrant the algorithm evaluates if there are any more donor quadrants (3.4), otherwise the flow moves to a new sector. Several conditions are tested. First, the deviation must be greater then zero ensuring that there is load imbalance between the quadrants. Second, to prevent oscillation of plat movement between two quadrants, a donor quadrant is compared against all the other receptor quadrants. The movement should continue if the donor less a plat still has more loads then a receptor plus a plat. Lastly, a donor must still have plats that have not yet moved.
The delay between two connecting plats cannot exceed the clock cycle driving them less any setup and hold values. This delay can be translated into a Manhattan distance (orthogonal wires connecting two points) given the parasitic properties (RLC values) of the wires connecting them. The distances can be extended if trees of buffers/repeaters are used to boost the signal between the plats. The maximum Manhattan distance is obtained with a plat with a single fanout connection and it decreases as the fanout increases. The maximum allowed location of the receiver plat is at the edges of a diamond centered at the driver plat, as illustrated in
To meet the timing constraints, any plat (directly or through buffer trees) must share the intersection of two or three diamonds, as shown in
The displacement, max_move(pj) is a value determined from the intersection box 18 in
Once the plats have been re-distributed among the quadrants, Clustering(109 in
The clustering algorithm is designed to satisfy three objectives. First, ensure that all the plats clustered in a macro still satisfy the timing requirements between plats. Second, guarantee that the chosen macro can be placed in a legal position. Lastly, minimize the number of used macros.
Clustering optimization begins by processing each clock sector grid individually, and repeating the cluster algorithm for each clock sector that contains plats to be clustered. For each latch a timing based diamond is created as shown previously. For a given latch its diamond is checked against the diamonds of all the other latches to get the common intersection area amongst all overlapping diamonds. This intersection area is the place where at least one cluster can be formed. To ensure legal placement of any cluster macro, the intersection area is overlapped against the chip placement blockage. If the intersection area is not empty then the latches can be clustered. The result of diamond overlap and overlap with placement blockage is that each latch in the clock sector gets assigned the number of overlapped diamonds.
In the next step the latch with highest overlap number is picked along with all the latches it legally overlaps. These are replaced by a latch macro which is legally placed in the intersection region. Furthermore, the netlist is updated by adding the latch macro, placing it, connecting all the correct signals to the latch macro, and deleting the intersected plats. The process is repeated for other latches in the clock sector while it is possible to cluster plats. The clustering criterion is controlled by defining which plat macros can be used for clustering. The algorithm will choose which plat macro to use to minimize the total load within the sector.
The final step is to re-run clock simulation to verify that clock skew is still within the margins defined. Likewise, because the LDT are timed at the clock frequency a timing analysis is run to ensure that all the plat paths are within the timing budgets.
As one example, one or more aspects of the present invention can be included in an article of manufacture (e.g., one or more computer program products) having, for instance, computer usable media, indicated at 90 in
In the drawings and specifications there has been set forth a preferred embodiment of the invention and, although specific terms are used, the description thus given uses terminology in a generic and descriptive sense only and not for purposes of limitation.
Number | Name | Date | Kind |
---|---|---|---|
4495559 | Gelatt et al. | Jan 1985 | A |
5880613 | Ishihara | Mar 1999 | A |
5912820 | Kerzman et al. | Jun 1999 | A |
5999716 | Toyonaga | Dec 1999 | A |
6205571 | Camporese et al. | Mar 2001 | B1 |
6311313 | Camporese et al. | Oct 2001 | B1 |
6351840 | Teng | Feb 2002 | B1 |
6452435 | Skergan et al. | Sep 2002 | B1 |
6609228 | Bergeron et al. | Aug 2003 | B1 |
6698006 | Srinivasan et al. | Feb 2004 | B1 |
6792554 | Gervais et al. | Sep 2004 | B2 |
7017132 | Hou et al. | Mar 2006 | B2 |
7020861 | Alpert et al. | Mar 2006 | B2 |
7225421 | Migatz et al. | May 2007 | B2 |
7461365 | Galbi et al. | Dec 2008 | B1 |
7486130 | Overs et al. | Feb 2009 | B2 |
20050015738 | Alpert et al. | Jan 2005 | A1 |
20050102643 | Hou et al. | May 2005 | A1 |
20060190899 | Migatz et al. | Aug 2006 | A1 |
20070136708 | Overs et al. | Jun 2007 | A1 |
20080016475 | Durham et al. | Jan 2008 | A1 |
20090033398 | Dennis et al. | Feb 2009 | A1 |
Number | Date | Country |
---|---|---|
5047932 | Feb 1993 | JP |
Number | Date | Country | |
---|---|---|---|
20090210840 A1 | Aug 2009 | US |