The present invention relates to the field of integrated circuit design tools and more particularly relates to an apparatus for and method of estimating the quality of clock gating solutions for integrated circuit designs.
Clock gating is a well known technique used to reduce the power consumption of digital hardware circuits. It is often employed as one of several power saving techniques typically applied to synchronous circuits used in large microprocessors and other complex circuits. To save power, clock gating solutions add additional logic to a circuit to modify the functionality of the clock input of a flip-flop or latch, thereby disabling portions of the circuitry where flip-flops or latches do not change state.
Although asynchronous circuits by definition do not employ a ‘clock’ signal, the term ‘perfect clock gating’ is used to show that some clock gating techniques are approximations of the data-dependent behavior exhibited by asynchronous circuitry and that as the granularity of the clock gating employed in a synchronous circuit approaches zero, the power consumption of that circuit approaches that of an asynchronous circuit.
Minimizing switching activity through clock gating is one of the mainstream methods of low-power design. Clock gating can be fine-grained, in which a given clock gating function gates the clock of a small number of flip-flops or latches, or it can be coarse-grained, in which large areas of the integrated circuit chip are turned on and off at the same time. When performed manually, fine-grained clock gating is typically a very labor-intensive process because almost every flip-flop or latch in the design must be considered separately. Furthermore, manual fine-grained clock gating has a low return on investment because the benefits of clock gating a small group of flip-flops or latches are limited. On the other hand, fine-grained clock gating is relatively easy to automate. In addition, there are numerous opportunities, because almost every flip-flop or latch in a design is a candidate for a clock gating solution that minimizes switching activity.
In contrast, coarse-grained clock gating is an architectural-level decision, which is relatively easy to perform manually and can yield a large return on investment for minimal effort. Coarse-grained clock gating, however, is difficult to automate, as it requires some kind of an architectural level understanding of the design. In addition, the number of opportunities for coarse-grained clock gating is relatively small, since there are fewer blocks or units than there are individual flip-flops or latches.
A problem associated with clock gating is that it may create even more severe setup times. This is because putting additional logic on the clock signal requires that logic to arrive sooner in order to ensure that the resultant clock signal arrives before the data.
Another problem is that the additional gates required to implement the clock gating may end up using more in leakage power than is saved in dynamic power through clock gating.
As an example of these problems, consider the example prior art original circuit design shown in
While the clock gated design of circuit 20 is functionally equivalent to the original ungated design of circuit 10 (
There is thus a need for a hardware development tool mechanism that is able to distinguish and select good clock gating solutions from bad ones, especially in regard to the issues of leakage power and timing. The tool should be able to analyze the fine-grained clock gating opportunities found for a design wherein flip-flops or latches are grouped into gating groups that share the same clock gating function and thus can share a clock buffer. In addition, the mechanism should be capable of estimating the quality of candidate clock gating solutions by filtering out any proposed clock gating solutions that require undue overhead.
The present invention is an apparatus for and method of estimating the quality of candidate clock gating solutions. The quality estimation mechanism of the present invention operates on candidate clock gating solutions that are generated using any suitable means. An example of a clock gating technique suitable for use with the present invention is taught in U.S. application Ser. No. 11/295,936, entitled “Clock Gating Through Data Independent Logic,” cited supra. Other known clock gating techniques may also be used without departing from the scope of the invention.
Regardless of the actual technique used, clock gating tools in general are operative to search for clock gating opportunities in a digital circuit design. The result of typical clock gating tools is a plurality of candidate clock gating solutions. A clock gating tool may be standalone, or may be embedded in another tool such as a synthesis or a layout tool. The quality estimation mechanism of the present invention is operative to filter these candidate clock gating solutions. Optionally, the filtered results are reported to a user or simply discarded by the tool. The mechanism is operative to filter the proposed solutions in order to take into account leakage power as well as timing constraints.
The quality estimation mechanism of the invention can optionally be embedded in the clock gating tool itself or accessed as a stand alone application. If embedded the resultant hardware development tool is operative to determine clock gating opportunities in a digital logic design. The tool is able to clock gate any single flip-flop or latch that can be functionally clock gated in addition to grouping flip-flops or latches into gating groups that share the same clock gating function and thus can share a clock buffer. Proposed candidate solutions are filtered using user supplied input parameters thereby eliminating solutions that require undue overhead. This helps to ensure that timing constraints are met and that increased leakage will not eat up the power saved by clock gating.
It is noted that the mechanism of the invention is capable of operating at a relatively early stage in the design cycle. The mechanism operates on clock gating solutions that are generated at a stage in the design wherein the exact logic design is not finalized. The functionality is known but the circuit has not yet been optimized, thus exact timing information or power usage is not available. Thus, the mechanism functions as a reliable predictor of whether a candidate clock gating solution is a good solution or not without requiring complex heavy analyses that would normally be applied to the final circuit design.
Alternatively, the mechanism of the invention could be used at a late stage of the design cycle. In this case, exact timing information and power usage can be calculated, but the invention can be used to filter out obviously bad solutions and thus save processing time.
In operation, a metric called the intersection coefficient is determined for a candidate clock gating solution. The intersection coefficient is defined as the number of signals shared by both the data logic portion and clock enable logic portions of a proposed clock gating solution. It has been determined experimentally that this intersection coefficient can predict the quality of the solution with very high reliability.
Note that some aspects of the invention described herein may be constructed as software objects that are executed in embedded devices as firmware, software objects that are executed as part of a software application on either an embedded or non-embedded computer system such as a digital signal processor (DSP), microcomputer, minicomputer, microprocessor, etc. running a real-time operating system such as WinCE, Symbian, OSE, Embedded LINUX, etc. or non-real time operating system such as Windows, UNIX, LINUX, etc., or as soft core realized HDL circuits embodied in an Application Specific Integrated Circuit (ASIC) or Field Programmable Gate Array (FPGA), or as functionally equivalent discrete hardware components.
There is thus provided in accordance with the invention, a method of filtering a plurality of candidate clock gating solutions, each candidate clock gating solution incorporating data logic and clock enable logic, the method comprising the steps of for each the clock gating candidate solution, determining a number of input signals shared by the data logic and the clock enable logic of the candidate clock gating solution and considering only clock gating solutions having a number of shared inputs less than or equal to a predetermined threshold.
There is also provided in accordance with the invention, a method of estimating the quality of a plurality of clock gating solutions, the method comprising the steps of determining an intersection coefficient for each candidate clock gating solution, comparing each the intersection coefficient against a predetermined threshold and if the intersection coefficient is less than or equal to the threshold, adding the corresponding candidate clock gating solution to a set of acceptable candidate clock gating solutions.
There is further provided in accordance with the invention, a computer program product comprising a computer usable medium having computer usable program code for estimating the quality of a plurality of candidate clock gating solutions, the computer program product including, computer usable program code for determining an intersection coefficient value of each candidate clock gating solution and computer usable program code for eliminating from consideration candidate clock gating solutions having an intersection coefficient value greater than a predetermined threshold.
The invention is herein described, by way of example only, with reference to the accompanying drawings, wherein:
The following notation is used throughout this document.
The present invention is an apparatus for and method of estimating the quality of candidate clock gating solutions. The quality estimation mechanism of the present invention operates on candidate clock gating solutions that are generated using any suitable means. An example of a clock gating technique suitable for use with the present invention is taught in U.S. application Ser. No. 11/295,936, entitled “Clock Gating Through Data Independent Logic,” cited supra. Other known clock gating techniques may also be used without departing from the scope of the invention.
Regardless of the actual technique used, clock gating tools in general are operative to search for clock gating opportunities in a digital circuit design. The result of typical clock gating tools is a plurality of candidate clock gating solutions. A clock gating tool may be standalone, or may be embedded in another tool such as a synthesis or a layout tool. The quality estimation mechanism of the present invention is operative to filter these candidate clock gating solutions. Optionally, the filtered results are reported to a user or a simply discarded by the tool. The mechanism is operative to filter the proposed solutions in order to take into account leakage power as well as timing constraints.
The quality estimation mechanism of the invention can optionally be embedded in the clock gating tool itself or accessed as a stand alone application. If embedded the resultant hardware development tool is operative to determine clock gating opportunities in a digital logic design. The tool is able to clock gate any single flip-flop or latch that can be functionally clock gated in addition to grouping flip-flops or latches into gating groups that share the same clock gating function and thus can share a clock buffer. Proposed candidate solutions are filtered using user supplied input parameters thereby eliminating solutions that require undue overhead. This helps to ensure that timing constraints are met and that increased leakage will not eat up the power saved by clock gating.
Some portions of the detailed descriptions which follow are presented in terms of procedures, logic blocks, processing, steps, and other symbolic representations of operations on data bits within a computer memory. These descriptions and representations are the means used by those skilled in the data processing arts to most effectively convey the substance of their work to others skilled in the art. A procedure, logic block, process, etc., is generally conceived to be a self-consistent sequence of steps or instructions leading to a desired result. The steps require physical manipulations of physical quantities. Usually, though not necessarily, these quantities take the form of electrical or magnetic signals capable of being stored, transferred, combined, compared and otherwise manipulated in a computer system. It has proven convenient at times, principally for reasons of common usage, to refer to these signals as bits, bytes, words, values, elements, symbols, characters, terms, numbers, or the like.
It should be borne in mind that all of the above and similar terms are to be associated with the appropriate physical quantities they represent and are merely convenient labels applied to these quantities. Unless specifically stated otherwise as apparent from the following discussions, it is appreciated that throughout the present invention, discussions utilizing terms such as ‘processing,’ ‘computing,’ ‘calculating,’ ‘determining,’ ‘displaying’ or the like, refer to the action and processes of a computer system, or similar electronic computing device, that manipulates and transforms data represented as physical (electronic) quantities within the computer system's registers and memories into other data similarly represented as physical quantities within the computer system memories or registers or other such information storage, transmission or display devices.
The invention can take the form of an entirely hardware embodiment, an entirely software/firmware embodiment or an embodiment containing both hardware and software/firmware elements. In a preferred embodiment, the invention is implemented in software, which includes but is not limited to firmware, resident software, microcode, etc.
Furthermore, the invention can take the form of a computer program product accessible from a computer-usable or computer-readable medium providing program code for use by or in connection with a computer or any instruction execution system. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
A block diagram illustrating an example computer processing system adapted to implement the quality estimation mechanism of the present invention is shown in
The computer system is connected to one or more external networks such as a LAN or WAN 246 via communication lines connected to the system via data I/O communications interface 244 (e.g., network interface card or NIC). The network adapters 244 coupled to the system enable the data processing system to become coupled to other data processing systems or remote printers or storage devices through intervening private or public networks. Modems, cable modem and Ethernet cards are just a few of the currently available types of network adapters. The system also comprises magnetic or semiconductor based storage device 242 for storing application programs and data. The system comprises computer readable storage medium that may include any suitable memory means, including but not limited to, magnetic storage, optical storage, semiconductor volatile or non-volatile memory, biological memory devices, or any other memory storage device.
Software adapted to implement the quality estimation mechanism is adapted to reside on a computer readable medium, such as a magnetic disk within a disk drive unit. Alternatively, the computer readable medium may comprise a floppy disk, removable hard disk, Flash memory 236, EEROM based memory, bubble memory storage, ROM storage, distribution media, intermediate storage media, execution memory of a computer, and any other medium or device capable of storing for later reading by a computer a computer program implementing the method of this invention. The software adapted to implement the quality estimation mechanism of the present invention may also reside, in whole or in part, in the static or dynamic main memories or in firmware within the processor of the computer system (i.e. within microcontroller, microprocessor or microcomputer internal memory).
Other digital computer system configurations can also be employed to implement the quality estimation mechanism of the present invention, and to the extent that a particular system configuration is capable of implementing the system and methods of this invention, it is equivalent to the representative digital computer system of
Once they are programmed to perform particular functions pursuant to instructions from program software that implements the system and methods of this invention, such digital computer systems in effect become special purpose computers particular to the method of this invention. The techniques necessary for this are well-known to those skilled in the art of computer systems.
It is noted that computer programs implementing the system and methods of this invention will commonly be distributed to users on a distribution medium such as floppy disk or CD-ROM or may be downloaded over a network such as the Internet using FTP, HTTP, or other suitable protocols. From there, they will often be copied to a hard disk or a similar intermediate storage medium. When the programs are to be run, they will be loaded either from their distribution medium or their intermediate storage medium into the execution memory of the computer, configuring the computer to act in accordance with the method of this invention. All these operations are well-known to those skilled in the art of computer systems.
As stated supra, the quality estimation mechanism of the invention is operative to filter the candidate clock gating solutions generated by a clock gating tool. An example of a good clock gating solution that might be proposed by a prior art tool is described herein. An example of an original design before processing by a clock gating tool is shown in
The circuit after processing by a clock gating tool is shown in
It is noted that for simplicity's sake, throughout this document, clock gating is represented graphically in the figures by showing an and-gate driving the clock input of a flip-flop. It is appreciated, however, that the mechanism is operative to recognize and process clock gating performed by other means, for instance using specialized clock buffers or other modifications to the clock tree. Furthermore, it is noted that while throughout this document clock gating is shown to be applied to flip-flops, it is appreciated that the mechanism is operative to recognize and process latches, latch pairs such as seen in two-phase design styles, including latch pairs with intervening logic, and any other memory element that may be clock gated.
The clock gating tool of U.S. application Ser. No. 11/295,936, cited supra, for example, is operative to search the digital circuit for opportunities to eliminate feedback loops. A feedback loop includes, inter alia, the case where the data output of a flip-flop or L1-L2 latch pair feeds into the data input of the same flip-flop or L1-L2 latch pair. Three examples of feedback loops are shown in
The clock gating method depends on Theorems 1 and 2 below. We use x0, x1, . . . to denote variables and a0, a1, . . . to denote constants. The theorems and their proof depend on the fact that if we have a function f(x0, x1, . . . , xn, q), and we set the values of the variables xi, the result is a function f′(q). Note that there are only four such functions, including: f0(q)≡0; f1(q)≡1; f2(q)≡q; and f3(q)≡q.
Theorem 1: Let f(x0, x1, . . . , xn, q) be a function. Then there exist functions g(x0, x1, . . . , xn) and h(x0, x1, . . . , xn) such that
Theorem 2: Let f(x0, x1, . . . , xn, q) be a function. Then there exist functions g1(x0, x1, . . . , xn), g2(x0, x1, . . . , xn) and h(x0, x1, . . . , xn) such that
The functions g and g1 can be constructed by building the function f|q=0=f|q=1. The function g2 can be constructed by building the function
The function h can be constructed by building (f if g else undefined). The condition a0, a1, . . . , an such that f(a0, a1, . . . , an, q)≡q can be tested by comparing the function
to the function f2(q)=q. This provides a practical method to perform clock gating automatically.
An example of Theorem 1 is illustrated in
A gated example of the application of Theorem 1 of the present invention is shown in
If ∃a0, a1, . . . an such that f (a0, a1, . . . , an, q)≡q, then clock gating can be performed. The feedback loop, however, cannot be eliminated. An example of this is shown in
A gated example of the application of Theorem 2 of the present invention incorporating a feedback loop is shown in
The clock gating method described supra is able to eliminate the feedback loop in 100% of the cases in which it is theoretically possible to do so. Furthermore, it simplifies the logic in all other cases in which a feedback loop is present and there exists at least one assignment of the variables x0 through xn such that f (a0, a1, . . . , an, q)≡q. Not every theoretical solution arrived at by a clock gating tool is useful in practice. A solution that adds too much logic might end up wasting more in leakage power than it saves by clock gating. In addition, many theoretical solutions will not obey timing constraints. And finally, a gating function not applicable to a large enough number of flip-flops or latches will waste expensive clock buffers.
The problem of wasting clock buffers can be solved by allowing the user to specify the size of the minimum Gating Group, referred to as the S4G (size for group) parameter. Only gating functions that can be used to gate at least the specified number of flip-flops or latches are allowed, the rest are discarded.
The issues of leakage power and timing, however, are more complex.
One approach is to perform power simulations and static timing analyses within the development tool. Doing so, however, would add a great deal of complication to the tool and would greatly increase run times. Furthermore, the exact timing and power usage depends on the technology mapping and optimizations to be performed by synthesis and/or, thus sometimes only an estimate is possible.
Instead, the mechanism of the present invention utilizes a heuristic approach. To control leakage power, the mechanism uses heuristics to limit the solutions to those that require less logic gates to implement than the original, ungated design. In this manner, it is guaranteed that the mechanism of the invention does not waste more in leakage than it gains through clock gating.
This is achieved by using what is called the Intersection Coefficient (IC) which is defined as the number of input signals shared by the data logic and clock enable logic portions of a proposed clock gating solution. It has been determined experimentally that the intersection coefficient can predict the quality of the solution with very high reliability. For example, in
Stated mathematically, assume that for some flip-flop F there exists both a new function d′ for the data input to the flip-flop and a gate function en for the flip-flop. Let Sd′ be a set of signals affecting d′ and let Sen be a set of signals affecting en. Thus, if S equals the intersection of Sd′ and Sen, then the intersection coefficient (IC) is the size of the set S. Note that IC is a positive natural number or zero in the case S is the empty set.
Note that for a particular circuit, there may be multiple clock gating solutions. The IC is a function of the particular clock gating solution, rather than a function of the circuit. For instance, consider the circuit of
Using a specified limit on the IC, referred to as IC_LIMIT, it is possible to divide a set of candidate clock gating logical solutions into two groups in accordance with the value of each solution's IC: (1) a satisfactory or acceptable group wherein IC<=IC_LIMIT and an unsatisfactory or unacceptable group wherein IC>IC_LIMIT. The unacceptable group comprises candidate logical solutions which are not like to satisfy timing and/or power usage requirements. A key benefit of the mechanism of the invention is that it enables very fast estimation of the quality of candidate clock gating solutions without using time-expensive synthesis tools, static timing analysis tools, layout tools or power estimation tools.
Another example of an original design before clock gating is shown
An example of the design after application of a clock gating tool is shown in
In operation, an IC parameter is supplied by the user and the mechanism uses this parameter as a threshold against which the measured IC value of each candidate solution is compared to. A solution is considered acceptable only if its IC value is less than or equal to the threshold. The inventors have found experimentally that the value of the IC parameter allows good control over the quality of the result, both with respect to timing as well as with respect to reducing the number of gates (and thus leakage power).
A block diagram illustrating an example implementation of the quality estimation mechanism of the present invention is shown in
A flow diagram illustrating the intersection coefficient method of the present invention is shown in
Note that the mechanism can be adapted to either generate each candidate solution and perform the IC comparison sequentially or to generate all the candidate solutions and then sequentially filter each against the threshold.
Note that a candidate solution with IC=0 is a good (and very likely the best) solution because the signals effecting the flip-flop data input and the gated clock signal are separated. Thus, the size of the design has likely been reduced by several logical gates.
For an IC of 1, experiments conducted by the inventors have shown that the size of the design usually does not increase. If the IC value is greater than 1, the estimation of quality of changes in logic depends on the particular features of the design. Nevertheless, the restriction of maximal admissible value of IC noticeably facilitates the filtering of unacceptable changes in logic.
Table 1 below shows the effect of various values of IC on a single design comprising 1126 flip-flops, 338 of which can be potentially gated. Critical slack of the original design is 4.5 for a clock period of 40 ns and comprised of 2689 logical gates. The table demonstrates the effect of the restriction of maximal admissible value of IC on the quality of the solution. The table was generated using a design that allows tracking the dynamics of deterioration of the solution with the increase of IC limit. Moreover, there is a “red line” beyond which the solution becomes unsatisfactory. As the value of IC grows, the percentage of the flip-flops or latches in the design that can be gated grows as well. At high values of IC, however, timing is negatively impacted and there is an increase rather than a decrease in the number of gates. Note that negative impact is indicated by negative improvement in columns 3 and 4 of Table 1. The pattern shown in Table 1 is consistent across many designs that have been experimented, and based on these results, the IC threshold parameter is by default set to IC=1. Note that negative numbers for critical slack and number of gates represents a worse result than the original while positive numbers represent an improvement.
The data presented in Table 1 illustrates that controlling the value of the IC threshold allows synthesis process characteristic such as critical slack and the number of gates to be regulated. If it is desired not to worsen the critical slack, a value of IC=11 is the best choice, while if it is desired not to increase the number of gates, the best result can be achieved by setting this limit at IC=2.
Depending on the implementation of the invention, the IC_LIMIT parameter may be configured as an input parameter by the user, fixed by the software/firmware/hardware mechanism or configured dynamically in accordance with one or more metrics measured during processing of the candidate solutions.
As shown above, the IC value provides some control over the timing as well as the size of the generated logic. In addition, additional heuristics are used that enable to limit the amount of logic on the clock enable. The DPT parameter is a rough measure of the depth of the logic when implemented with two-input and-gates and or-gates. For example, the VHDL expression a and b and c and d can be implemented with two levels of logic as in
As opposed to leakage power, which can be controlled completely through the IC parameter, neither the IC nor the DPT parameter guarantees that timing constraints can be met. They do, however, enable the filtering out of those which are clearly problematical. The designer would then use her/his judgment in implementing the remaining advice provided by the development tool of the invention while always having the option to cancel the clock gating later in the design cycle if timing constraints cannot be met.
As an illustration of an example embodiment of the development tool of the invention, a representative portion of an actual advice file, output of the development tool, is provided in
It is noted that for simplicity the advice is provided as if an and-gate is to be used to gate the clock. Depending on circumstances, the designer may use a clock buffer instead. Thus, signal GALERT NET 002327 is the ‘and’ of the gating function given by GA CLK EN GALERT NET 002326 and the original clock given by ALNC1.SH CNT DATAQ.Z.$4(0). In practice, the designer using the results provided in the advice file takes the clock gating function from Line 5.
Lines 9 through 25 show that L1 latches of the L1-L2 latch pairs for which this gating function is applicable. The 17 L1-L2 latch pairs shown are composed of two sets of related signals: ten bits of ALNC1.SH CNT DATAQ and seven bits of ALNC1.SH CNT DATAQ.
As a further example, the mechanism was applied to several different circuits, the results of which are presented below in Table 2. The values of the intersection coefficient (IC), gating group (S4G) and depth (DPT) parameters are shown in columns 2 through 4. Column 5 shows the number of L1-L2 pairs in the block and columns 6-7 show the number and percentage of those pairs that were candidates for clock gating. A clock gating candidate is an L1-L2 pair that can be clock gated according to the method described supra. Columns 8-9 show the number and percentage of the total L1-L2 pairs were solved (i.e. remained after filtering by the IC, S4G and DPT parameters).
The results range for almost negligible for L15 ARB WRAP to more than a quarter of the latch pairs gated for MCGSCFG KMAC. The wide range of results are due to the inherent difference in the various block and the varying amount of effort that was put into manual clock gating previously to the mechanism of the invention being run.
In alternative embodiments, the methods of the present invention may be applicable to implementations of the invention in integrated circuits, field programmable gate arrays (FPGAs), chip sets or application specific integrated circuits (ASICs), DSP circuits, wireless implementations and other communication system products.
It is intended that the appended claims cover all such features and advantages of the invention that fall within the spirit and scope of the present invention. As numerous modifications and changes will readily occur to those skilled in the art, it is intended that the invention not be limited to the limited number of embodiments described herein. Accordingly, it will be appreciated that all suitable variations, modifications and equivalents may be resorted to, falling within the spirit and scope of the present invention.
The present invention is related to U.S. application Ser. No. 11/295,936, filed Dec. 7, 2005, entitled “Clock Gating Through Data Independent Logic,” incorporated herein by reference in its entirety.