It has been shown that asynchronous circuits can improve the throughput of a circuit, and can be more robust to process variability and environmental changes. This can potentially allow designers to use asynchronous circuits in ASIC design flows. The omission of the clock network together with the fact that asynchronous circuits can be active only when they are performing useful functions, can inherently contribute to the reduction of switching activity, and hence power saving. These benefits, however, come at the expense of incorporating handshaking signals, completion detection trees, distributed controllers, and timing assumptions. The extra overhead might lead to a circuit with more area and higher power consumption compared to synchronous implementation.
Therefore, designers of low power asynchronous circuits typically endeavor to carefully avoid intensive overhead to be able to compete with the equivalent synchronous implementation.
Because of the more complicated structure of asynchronous circuits, they have not been adopted by commercial computer-aided design (“CAD”) tool developer companies as much as synchronous circuits have been. Thus, a circuit designer does not have a wide range of options when it comes to design automation of asynchronous circuits.
This has motivated many asynchronous designers to exploit synchronous CAD tools for synthesizing asynchronous circuits. There are multiple instances in the literature that designers tried to use a familiar synchronous design flow for an asynchronous flow and feel the gaps with rather simple ad-hoc algorithms in order to build up an asynchronous circuit design flow. Often, the original legacy circuit is described at a synchronous register transfer level (“RTL”) level as a netlist, or interconnection or interconnectivity of primitive circuit elements or electronic design. Netlists usually convey connectivity information and at a basic level provide nothing more than instances, nets, and perhaps some attributes.
Various approaches exist for starting with a synchronous netlist to produce an asynchronous netlist. The following are significant examples of such approaches:
A De-synchronization approach has been used, as described by J. Cortadell, et al. “Desynchronization: Synthesis of Asynchronous Circuits From Synchronous Specifications,” Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on. Volume 25, Issue 10, pp. 1904-1921 (October 2006). In this method, each flip-flop is converted into two latches: an odd and an even latch. The clock tree is then replaced by a set of handshaking signals. Asynchronous local controllers are added to the netlist to enable the latches and control the flow of data so that the flow of data in the asynchronous netlist is equivalent to the flow of data in the asynchronous netlist.
A phased logic approach is described in D. H. Linder, et al. “Phased logic: supporting the synchronous design paradigm with delay-insensitive circuitry,” Mississippi State Univ., IEEE Transactions, vol. 45, issue 9, pp. 1031-1044 (September 1996). In this method the modules in the synchronous netlist are replaced by equivalent phased logic modules. In phased logic, each signal is encoded with two Level Encoded Dual Rail (“LEDR”) signals. After the original conversion, the liveness and safeness problems are analyzed and extra buffers and token-buffers are added if necessary. Although some FPGA implementations of this technique have been reported, in general custom LEDR library development is needed.
A null convention logic approach is described in Karl M. Fant, et al. “NULL Convention Logic” (Theseus Logic, Inc.), and available at http://www.cs.ucsc.edu/˜sbrandt/papers/NCL2.pdf. This method starts from conventional HDL. It then gets synthesized into an intermediate library called 3NCL. This library is still a single-rail library but with the addition of an extra possible value (the NULL value) for all wires. This preserves single-rail simulation and design capabilities, while emulating the final dual-rail gates. The final library is a full dual-rail library. Next, second run of synthesis is performed to translate the 3NCL gates into 2NCL gates that are the true dual-rail gates that will be used for the physical design process. In order to assure DI behavior only a limited variety of gates are used (2-input NAND, NOR, XOR).
Another approach is described in A. Smirnov, et al. “Synthesizing Asynchronous Micropipelines with Design Compiler,” Proc. SNUG Boston 2006: Synopsys User Group, Sep. 18-19, 2006, Boston, USA. In this method, a synchronous circuit described at RTL level is implemented as an asynchronous micropipeline. Synthesis can be targeted at a wide range of micropipeline protocols and implementations through standard cell library approach. Primary target applications include high-throughput low-power using domino-like low-latency cells.
A dataflow graph approach is described in International Patent Application No. PCT/US2007/067618 (Publication No. WO/2007/127914) and entitled “Systems And Methods For Performing Automated Conversion Of Representations Of Synchronous Circuit Designs To And From Representations Of Asynchronous Circuit Designs” having Applicant Achronix Semiconductor Corp. and inventor R. Manohar. In this method a synchronous netlist containing combinational logic, latches, and flip flops with multiple clock domains and enable signals is converted to asynchronous circuit using a notion of dataflow graph. This method eliminates the gating through substitution of a MUX transformation and using the gating information to make the output of the state-holding element a conditional signal. In such a method, if the state holding element in synchronous circuit is gated, either the gating is eliminated using a MUX, or the previous token will be generated using an asynchronous register module. Hence, the computational modules will be activated and consume a token whose value is the same as the previous token.
Another approach is described in U.S. Provisional Patent Application Ser. No. 61/047,714, filed 24 Apr. 2008 and entitled “Clustering and Fanout Optimizations of Asynchronous Circuits” to G. Dimou (and assigned to the assignee of the present disclosure), the entire contents of which are incorporated herein by reference.
For such an approach, a synchronous netlist of combinational gates and flip-flops can be converted to asynchronous templates, such as a pre-charged half-buffer (“PCHB”), e.g., as described in “Pipelined Asynchronous Circuits” by Lines, Andrew Matthew (1998), Technical Report, California Institute of Technology, [CaltechCSTR:1998.cs-tr-95-21]. In such an approach, the netlist is first clustered into several gates that can use a shared controller, subject to a given cycle time constrain. The cluster size is limited by the number of inputs and output. After clustering, the tool tries to optimize the throughput of the circuit through slack matching and minimize the area.
Aspects and embodiments of the present disclosure can provide asynchronous techniques for RTL design to provide asynchronous RTL designs that are comparable or equivalent to given synchronous RTL designs while achieving lower power consumption, faster throughput, or both. Embodiments of the present disclosure accept a synchronous RTL netlist with clock gating elements as an input and output an asynchronous power optimized netlist, described at a high level of description that can be implemented using wide range of asynchronous templates.
Exemplary embodiments of the present disclosure provide methods for conversion of a synchronous netlist, e.g., of combinational modules, flip flops (or latches), and clock gating modules, to a netlist of asynchronous modules. The processes (including algorithms) described herein can operate to bundle multiple modules in an enable domain, so that they are activated only if the incoming enable token to the enable domain has an UPDATE value. Further, the modules can be clustered inside an enable domain, so that each cluster has a separate controller. The objective function of bundling and clustering can function to minimize power consumption with respect to a given cycle time.
It should be understood that while certain embodiments/aspects are described herein, other embodiments/aspects according to the present disclosure will become readily apparent to those skilled in the art from the following detailed description, wherein exemplary embodiments are shown and described by way of illustration.
The techniques and algorithms are capable of other and different embodiments, and details of such are capable of modification in various other respects. Accordingly, the drawings and detailed description are to be regarded as illustrative in nature and not as restrictive.
Aspects and embodiments of the present disclosure may be more fully understood from the following description when read together with the accompanying drawings, which are to be regarded as illustrative in nature, and not as limiting. The drawings are not necessarily to scale, emphasis instead being placed on the principles of the disclosure. In the drawings:
While certain embodiments depicted in the drawings, one skilled in the art will appreciate that the embodiments depicted are illustrative and that variations of those shown, as well as other embodiments described herein, may be envisioned and practiced within the scope of the present disclosure.
As described previously, the present disclosure provides for methods, including specific algorithms, for conversion of a synchronous netlists, e.g., of combinational modules, flip flops (or latches), and clock gating modules, to netlists of asynchronous modules. Techniques including algorithms described herein can utilize token filters and token latches, and can function to bundle multiple modules in an enable domain, so that they are activated only if an incoming enable token to the enable domain has an UPDATE value. The modules can be clustered inside an enable domain, so that each cluster has a separate controller. The objective function of bundling and clustering can minimize power consumption with respect to a given cycle time.
Some alternate approaches (e.g., as described previously) do not start from an synchronous netlist where clock-gating modules are present. For such alternate approaches, if the state holding element in synchronous circuit is gated, either the gating is eliminated using a MUX, or the previous token will be generated using an asynchronous register module. Hence, the computational modules will be activated and consume a token whose value is the same as the previous token. For embodiments of the present disclosure, in contrast, in the equivalent asynchronous circuit no token will be sent to the computational modules at the fan-out cone of the state holding elements. If the state-holding element is gated, the incoming token will be filtered out, using a token filter module. Hence, the computational modules will not be activated anymore. In this way, embodiments of the present disclosure can avoid activating computational modules with a token that has the previous value.
While some techniques for handling clock gating circuitry are presented in alternate approaches (e.g., as described previously) where conditional split and join modules are used to bypass the disabled part of the circuit and to avoid deadlock and starvation, techniques/algorithms of the present disclosure differ from those alternate approaches in the sense that for each combinational gate a notion of an enable set is introduced, e.g. as shown in
Moreover, algorithms are described herein, which can be used to combine enable domains and instantiate less boundary (TF and TL modules described in Token Filter and Token Latch sections as follows) modules. Further optimization is described for enable tokens, which can be qualified with the previous fan-in enable domains as explained in subsequent section “Further Optimization of Enable Tokens.” Accordingly, if fan-in enable domains of an enable domain are not producing a new token, the enable domain does not get activated. Clustering methods/algorithms of the present disclosure can include the ability of clustering synchronous netlist with clock gating modules. In addition, such methods/algorithms can explicitly define the objective function of the conversion to be power consumption.
The output of embodiments of the present disclosure can be in the form of a netlist or hardware description language of asynchronous modules, e.g., described in high level VerilogCSP language. Therefore, any asynchronous template that is able to implement such a netlist or hardware description language, e.g., VerilogCSP descriptions, can be used as the low-level implementation. Such an output can be used as an input to circuit design and/or simulation software, firmware, and/or hardware, including apparatus and/or systems suitable for application specific integrated circuit (“ASIC”) design and/or manufacturing, including chip circuit layout and fabrication/lithography.
Moreover, embodiments of the present disclosure can be utilized, implemented with, or stored in computer-readable storage media, including commercially available storage media including but not limited to CDs, DVDs, hard drives, flash memory, tape media (both optical and magnetic), and the like. It will be appreciated that embodiments of the present disclosure are not limited to specific types of signal/instruction storage media and will have increased utility as new types of storage media are developed. It should be appreciated that algorithms/methods according to the present disclosure can function or run on one or more suitable computer systems, e.g., those with suitable memory, processing, and/or I/O (e.g., display) functionality. It will be appreciated that embodiments of the present disclosure are not necessarily limited to specific types of computer systems and can have increased utility as new types of computer systems are developed.
This section provides an explanation and description of the mathematical models/algorithms used for exemplary embodiments. Two novel asynchronous modules are presented for reducing token flow in a circuit, and hence saving power. Additionally, a definition is given for a novel notion called enable domains.
Conditional Token Flow Regulator Modules
In order to regulate and minimize the flow of tokens, two modules, described in a suitable language or script, e.g., VerilogCSP, are introduced: a token filter and a token latch. These modules can use conditional communication actions, as is explained in further detail in the following sections.
Token Filter:
An example 100 of token filter module is shown in
Token Latch:
An example 300 of a token latch module according to the present disclosure is shown in
Module 300 was originally referenced as a Token Latch since if en value is 'UPDATE, similar to a transparent latch, it operates to let an input token pass through the latch to the output channel.
On the other hand, when the en value is 'NOUPDATE, similar to an opaque latch, the module 300 operates to send the previous stored value to the output channel.
Input Synchronous Graph
For exemplary embodiments, a given input synchronous circuit can be mapped to a directed graph as follows:
G1=(Vs,Es)
Vs=PI∪PO∪Cs∪Ss∪Gs∪CLK
Es=Ds∪ENI∪ENO∪CLKNET
A:ENI→[0,+∞]
PW:Cs∪Ss→[0,+∞]
Where PI=Primary Input, PO=Primary Output; Cs=Combinational Gates; Ss=Sequential Gates; Gs=Clock Gating Element; CLK=Clock network drivers; A=Activity Factor; PW=Switching power of the gate in watts; Ds={(u,v)|u, vεCs∪Ss∪PI∪PO}, edges between sequential gates, combinational gates, primary inputs, and primary outputs; ENI:={(u, v)|vεGs}, incoming edges to clock gating elements; ENO:={(u, v)|u εGs }, outgoing edges from clock gating elements, and CLKNET={(u,v)|(uεCLK∪Gs)(vεCLK∪Gs)}, edges in the clock network.
For the preceding graph, a further definition can be given:
1. Path from u to v, pu,v: define pu,v to be a path between vertices u and v where:
pu,v⊂2V
∀i<|pu,v|:(pu,v[i],pu,v[i+1])εEs
Thus, pu,v is a tuple, and pu,v[k] is the kth element of the tuple.
2. Set of all paths P0: Ps={pu,v|u, vεVs}
3. Sequential Fan In (SFI):
a. for a combinational gate cεCs, SFI(c) can be defined as follows:
SFI(c)={si⊂Ss|(∃ps
b. for a sequential gate sεSs, SFI(c) is defined as follows
SFI(s)={s}
4. Enable Set (ES):
a. For a sequential gate s, the enable set is defined as:
ES(s)={eεEN1|(e=(v,g))((g,s)εENo)(vεVs)(gεGs)}
b. For a combinational gate c, the enable set is defined as:
5. Always Enable Set (AES): If for a vertex cεCs, ES(c) is empty, ES(c) is called Always Enable Set, or AES for short.
6. Enable Domain (ED): For a node vεCs∪Ss, Enable Domain of v, ED(v) is defined as:
7. Always Enable Domain (AED): For a set of nodes ciεCs, Always Enable Domain is defined as:
Output Asynchronous Graph
To convert a given synchronous graph to a new graph G2(Va, Ea) consisting of asynchronous modules, G2 can be defined as follows:
Va=PI∪PO∪Ca∪Sa∪TF∪TL
Ea=Da∪ENa
Where PI=Primary Input; PO=Primary Output; Ca=Asynchronous Computational Modules; Sa=TokBuf Modules; TF=Token Filter modules, e.g., as described in
Similar to the synchronous graph, on the asynchronous graph G2, the following definitions can be made:
1. Path from u to v pu,v: pu,v can be defined to be a path between vertices u and v where:
pu,v⊂2V
∀i<|pu,v|:(pu,v[i],pu,v[i+1])εEa
Thus, pu,v is a tuple, and pu,v[k] is the kth element of the tuple.
2. Set of all paths Pa: Pa={pu,v|u,vεVa}
3. Fan-in: for a vertex vεCa∪Sa∪TF∪TL∪PO, Fan-in, FI(v), is defined as the number of incoming edges to v.
4. Fan-out: for a vertex vεCa∪Sa∪TF∪TL∪PI, Fan-in, FO(v), is defined as the number of outgoing edges from v
5. Token Filter Fan In:
a. For a vertex uεCa∪Sa∪TF∪PO, TFFI(u) can be defined as follows:
TFFI(u)={tfi⊂TF|(∃ptf
b. for a Token Filter gate tfεTF, TFFI(tf) can be defined as follows:
TFFI(tf)={tf}
6. Token Latch Fan Out:
a. For a vertex uεCa∪Sa∪TL∪PI, TLFO(c) can be defined as follows
TLFO(u)={tli⊂TL|(∃pu,tl
b. for a Token latch gate tlεTL, TLFO(tl) can be defined as follows TLFO(tl)=tl
7. Enable Set:
a. For a vertex tεTF∪TL, the enable set is defined as:
ES(t)={eεENa|e=(v,t)}
b. For a vertex cεuεCa∪Sa∪PI∪PO, the enable set can be defined as:
ES(c)={ei|(eiεES(ti))(tiεTFFI(c))}
8. Always Enable Set (AES): If for a vertex vεVa, ES(v) is empty, ES(v) is called Always Enable Set, or AES for short.
9. Enable Domain: For a node vεVa, Enable Domain of v, ED(v) is defined as:
10. Always Enable Domain (AED): For a set of nodes viεVa, Always Enable Domain is defined as:
11. Activity factor: For an enable domain ed, activity factor A(ed) is defined as follows:
12. Power Per Token:
a. For a module vεCs∪Ss, Power Per Token (PPT) is defined as:
b. For a module tεTF∪TL, Power Per Token (PPT) is defined as:
High level description of modules: a module vεCs∪Ss can be modeled, e.g., using a high level description in VerilogCSP an example of which 600 is shown in
Forward Latency: for a module vεCa∪Sa, Forward Latency (FL) is the time from when it starts receiving a new token from the input, calculate the output value until when it starts sending the resulting token to the output. This value is a function of the number of logic levels in a the low level implementation of the module
Backward Latency: for a module vεCa∪Sa, Backward Latency (BL) is the time from when the module starts sending until the time the module finishes communication actions on both channels L and R, so that it can start the next communication actions on them. Backward latency is a function of the number of logic levels, the fan-in and fan-out of the module.
Local Cycle Time: for a module vεCa∪Sa, Local Cycle Time (LCT) is the time it takes for complete communication actions on both L and R channels, plus the time for computation of the value of the output token. The following can consequently be written: LCT(v)=FL(v)+BL(v).
Algorithms
In this section, explanation is provided about how to generate the graph G2 define in previously, from graph G1.
Converting the Synchronous Graph to Asynchronous Graph
The conversion of G1 to G2 (e.g., in
In this algorithm, e.g., 800, first the clock network is removed from G1, and enable sets and enable domains are specified. Then, each node from G1 is copied to G2. The edges are copied when the enable domains of to adjacent nodes are the same. Whenever an enable domain boundary is crossed, the function InstantiateAndConnectTLandTF instantiates a TF and TL module between enable domains. Primary inputs and primary outputs are treated in a special way: from a PI vertex to a non-PO vertex in different enable domains, only a TF module is instantiated. From a non-PI vertex to a PO vertex, only a TL module is instantiated. From a PI to a PO node, the edge from G1 is copied to G2 without any modification.
Since not all enable sets were present in the original synchronous graph, the algorithm instantiates extra logic to create them. Extra enable sets are unions of original enable signals. Hence, the extra logic is the logical OR of enable tokens in the asynchronous graph. This is done in the function InsntantiateExtraEnableSetLogic.
After instantiating all nodes, adding extra TL and TF modules, and adding extra enable set logic, the algorithm connects all enable signals to TF and TL modules by calling the function ConnectEnableSignalsToTLandTFModules.
One can consider an equivalent asynchronous circuit for the converted asynchronous graph. The computational nodes can be replaced with function modules equivalent to the function modules in the synchronous graph (same truth table), and the sequential gates are replaced with TokBuf modules (TB). An example 1200 of such a circuit is shown in
Merging Enable Domains
In the previous section, a greedy algorithm was described for use to identify enable domains. It is possible to combine enable domains and merge them. Merging enable domains can lead to some power savings since the number of boundary cells is reduced. Besides, this can facilitate a reduction the number of controllers as well, since the clustering algorithm that assigns a controller to computational blocks has the opportunity to share controllers between merging enable domains.
To have a better understanding of such trade-offs, a power metric is introduced for each enable domain, as is described in the following section.
Power Consumption of Enable Domains
For each enable domain C, e.g., as shown by graph 1400 in
PC=αCPCActive+(1−αC)PCGated
PCActive=PCBoundary+PCComputation+PCCtrl
PCGated=PCBoundary
PC=PCBoundary+αC(PCComputation+PCCtrl)
Where, PCActive represents the power consumption while the modules in the enable domain are active and PCGated represents the power consumption while the modules are not active. PCBoundary represents the power consumption (PPT) of the boundary cells (TF and TL). PCComputation represents the power consumption (PPT) of computational modules. Finally, PCCtrl represents the power consumption of the controller modules that may be needed for implementing this enable domain.
In order to merge two enable domains, the power metrics before and after merging should be calculated and compared. The next example, shows such comparison
The total power metric, PBefore, can be calculated as follows:
Where, PB represents the power metric for boundary cells (assuming they are all equal). PF
Now, if two enable domains are merged into EM, as shown in
PAfter=8PB+αM·(PF
Where, PF
So, a calculation for Min(PAfter, PBefore) can be made to find out if the merge pays off or not. In order to calculate Min(PAfter, PBefore), estimates can be made of αM and PCtrl
For the activity factor, the following can be written:
For the controller modules in the combined enable domain, as an estimate can be:
Max(PCtrl
Now, for the purpose of this example, it can be assumed that E1⊂E2 and (without loss of generality) αE
Previously:
PBefore=10PB+αE
Now, one can compare PAfter and PAfter to accept or reject the merge.
In general, for two enable domains E1 and E1, with activity factors αE
PBefore=(b1+b2)PB+αE
After merging, the following can be obtained:
PAfter=(b1+b2−b12)PB+αM·(PF
As shown and described for previous examples there can be many cases that the intersections of enable domains are not empty; therefore, there might be a chance to save power by merging them together. Accordingly, algorithms/modules/processes of the present disclosure can use estimates for αM and PCtrl
Merging Enable Domains: Algorithm
Based on the power metric defined in the previous section, one can define the problem as follows:
Given a graph G2(Va, Ea), activity factor A, Power Per Token PPT, as defined previously, a set of Enable Domains S, find the best possible merging of enable domains to optimize power.
Function BreadthFirstCalculatePowBeforePowAfter traverses the graph in a breadth first search order, and whenever crosses an enable domain, it calculates the power before and after the merge of those two enable domains. The function CalculatePowBeforePowAfter calculates the power metric for two enable domains before and after the merge. Extensions that use this cost function in conjunction with simulated-annealing, genetic algorithm, or other look-ahead algorithm can be utilized within the scope of the present disclosure. Alternatives with classical statistical pattern recognition and/or a neural network and/or other heuristics are also possible.
With continued reference to
Further Optimization of Enable Tokens
Consider the following example having three enable domains, EN1, EN2, and EN3. EN1 and EN2. There is a channel between EN1 and EN2, where tokens from EN1 and EN2 are consumed by EN3, as shown by the example 1800 in
Let en1, en2, and en3 be enable tokens of EN1, EN2, and EN3 respectively, and let's assume they are independent of each other. The enable token of ED3 can be further optimized by qualifying en3 by the OR of en1 and en2, i.e., ED3 should be disabled when both ED1 and ED2 are disabled. Therefore, the new en3 value can be calculated as follows:
en3
The algorithm adds the extra logic necessary to calculate en3new, when the extra power justifies the extra overhead.
Such optimizations can be similar to Stability Condition Analysis in synchronous circuits, discussed in R. Fraer, et al. “A new paradigm for synthesis and propagation of clock gating conditions,” Design Automation Conference, 2008. DAC 2008. 45th ACM/IEEE, pp. 658-663 (June 2008), the entire contents of which are incorporated herein by reference. Embodiments of the present disclosure extend the same idea to the asynchronous realm and at a more coarse grain, i.e., enable domains as opposed to pipeline stages.
Once the enable domains are specified, one can use existing clustering algorithms (e.g., such as described in previously referenced U.S. Provisional Patent Application Ser. No. 61/047,714, filed 24 Apr. 2008 and entitled “Clustering and Fanout Optimizations of Asynchronous Circuits” to G. Dimou and/or as described in C. Wong, et al. “High-level synthesis of asynchronous systems by data-driven decomposition,” DAC 2003; the entire contents of both of which are incorporated herein by reference) to cluster modules in each enable domain to share controllers. The clustering algorithm used can optimize power consumption constrained to a given cycle time.
Once the enable domain optimization is complete, clusters within a region can be combined via existing clustering algorithms to tradeoff control logic overhead and achievable performance (see e.g., U.S. Provisional Patent Application Ser. No. 61/047,714). In particular, after clustering the final netlist can be slack matched (adding clusters where necessary) using several known techniques to balance the asynchronous pipelines and achieve the desired performance (see e.g., P. A. Beerel, et al. “Slack matching asynchronous designs” P. A. Beerel, IEEE International Symposium on Asynchronous Circuits and Systems (ASYNC'06), 2006, the entire contents of which are incorporated herein by reference).
Once the enable domain optimization and clustering is done, the modules described in VerilogCSP format can be implemented individually using any template that can implement the semantics of VerilogCSP. This means that the approach is general and is applicable to design styles that range from single-rail bundled-data implementations to QDI design-styles that use 1-of-N encoding to single-track implementations that use single-track handshaking as well as mixtures of these styles such as Sun's GasP implementation, e.g., as described in Ivan Sutherland and Scott Fairbanks, “GasP: A Minimal FIFO Control” Proceedings of the Seventh International Symposium on Advanced Research in Asynchronous Circuits and Systems, Salt Lake City, Utah, USA. 11-14 Mar. 2001. pp. 46-53. (IEEE 2001), the entire contents of which are incorporated herein by reference. In fact, different clusters can be implemented with different design styles assuming the handshaking interfaces between the clusters are compatible.
In addition, although the VerilogCSP description of the components implicitly models a full-buffer, half-buffer implementations will work equally well as long as the subsequent slack matching takes the specific performance characteristics of the half-buffer implementation into account.
Global Versus Local Evaluation of Enable Signal
Alternatively, it is possible for the NOUPDATE value to propagate locally through the entire domains and the OR of enable domains computed at the boundary of enable domains. As described in the following section, an implementation of this alternative is the gated multi-level domino (“GMLD”) template, where the NOUPDATE value is captured in a dual-rail control signal between clusters.
Gated Multi-Level Domino (GMLD) Template
Exemplary embodiments of the present disclosure can be implemented for/with gate multi-level domino gated (“GMLD”) templates. A GMLD template is a gated version of the multilevel domino template (“MLD”). For a GMLD template, the data path is largely unchanged, and the prime difference lies in the control path. GMLD seeks to exploit the availability of the enable pin on EDFFs on a synchronous circuit. This enable signal is used to disable affected GMLD stages, causing them not evaluate if the data inputs do not change. This effect reduces dynamic switching power, and potentially can reduce the forward latency to a constant value. GMLD introduces an important distinction to the token-flow model of asynchronous computation: two varieties of tokens. One kind of token is a control token, which represents data flow without a re-evaluation of the data elements. The other is a data token, which is equivalent to a tradition asynchronous token. Control tokens preserve liveness and safeness of an asynchronous system, allowing GMLD stages to fire in correct sequence. The fundamental difference is that a control token always skips the evaluation phase of the data logic. Data tokens always require the evaluation of the data logic. Examples of such are described in previously noted and co-owned U.S. Provisional Patent Application Ser. No. 61/043,988, filed Apr. 10, 2008 and entitled “Gated Multi-Level Domino Template”, the entire contents of which are incorporated herein by reference.
For GMLD templates of exemplary embodiments of the present disclosure, the value of the dual-rail control signal is updated with additional gating logic at the boundary of clusters, rather than computed centrally. This is feasible because in each cluster, the control is always active but the datapath is only activated when new input data arrives. The control logic of GMLD template adds extra logic necessary to do the optimizations discussed in previously, i.e., it qualifies the enable token of the next stage with its own enable token.
Also, since GMLD uses dynamic logic gates, which can hold state, using explicit TL modules at the end of enable domains can be avoided. Instead, in the next computation cycle, the last domino stage of an GMLD stage holds the previous token value. The last domino stage gets precharged only if a new token has come in and the previous value is not needed anymore.
The GMLD template represents a specific embodiment in which the combinational logic is implemented with domino logic however other pre-charged and un-precharged logic, including single-rail, can also be used. The GMLD template is described in the form of a signal-transition-graph (STG) for which many implementations, including ones with less concurrency are feasible and known to a typical engineer trained in the art.
One skilled in the art will appreciate that embodiments and/or portions of embodiments of the present disclosure can be implemented in/with computer-readable storage media (e.g., hardware, software, firmware, or any combinations of such), and can be distributed and/or practiced over one or more networks. Steps or operations (or portions of such) as described herein, including processing functions to derive, learn, or calculate formula and/or mathematical models utilized and/or produced by the embodiments of the present disclosure, can be processed by one or more suitable processors, e.g., central processing units (“CPUs) implementing suitable code/instructions in any suitable language (machine dependent on machine independent).
While certain embodiments have been described herein, it will be understood by one skilled in the art that the techniques (methods, systems, and/or algorithms) of the present disclosure may be embodied in other specific forms without departing from the spirit thereof. Accordingly, the embodiments described herein, and as claimed in the attached claims, are to be considered in all respects as illustrative of the present disclosure and not restrictive.
This application claims the benefit of U.S. Provisional Patent Application Ser. No. 61/043,988, filed Apr. 10, 2008 and entitled “Gated Multi-Level Domino Template”, the entire contents of which are incorporated herein by reference.
Number | Name | Date | Kind |
---|---|---|---|
6658635 | Tanimoto | Dec 2003 | B1 |
6735743 | McElvain | May 2004 | B1 |
7120883 | van Antwerpen et al. | Oct 2006 | B1 |
7689955 | van Antwerpen et al. | Mar 2010 | B1 |
7694266 | Sankaralingam | Apr 2010 | B1 |
20060120189 | Beerel et al. | Jun 2006 | A1 |
20060190851 | Karaki et al. | Aug 2006 | A1 |
20070198238 | Hidvegi et al. | Aug 2007 | A1 |
20070253275 | Ja et al. | Nov 2007 | A1 |
20070256038 | Manohar | Nov 2007 | A1 |
20090271747 | Tanaka | Oct 2009 | A1 |
Number | Date | Country |
---|---|---|
WO 2007127914 | Nov 2007 | WO |
WO 2008078740 | Jul 2008 | WO |
Number | Date | Country | |
---|---|---|---|
20090288058 A1 | Nov 2009 | US |
Number | Date | Country | |
---|---|---|---|
61043988 | Apr 2008 | US |