The present description relates generally to flip-flops, and more particularly, but not exclusively, to improved latency/area/power flip-flops for high-speed CPU applications.
The state-of-the-art flow of designing Integrated Circuits (e.g., micro-chips) may include specifying the functionality of the chip in a standard hardware programming language such as Verilog, synthesizing/mapping the circuit description into basic gates of a standard cell library using design compiler CAD tools (e.g., Synopsys' Design Compiler), placing and routing the gates netlist using IC compiler CAD tools (e.g., Synopsys' IC Compiler), and finally verifying proper connectivity (e.g., by using layout versus schematic (LVS) software) and functionality of the circuit. While these steps may be important for the final quality of the integrated circuit, for most of the steps, the achievable quality of implementation may be design dependent. For example, a good Verilog code specifying a circuit A may not make an independent circuit B any better. However, an adequate standard cell library may improve all designs that use that standard cell library. In other words, the quality of the standard cell library used in designing a chip may have a far reaching influence on the quality of the chip.
With the advent of technology scaling, higher and higher levels of integration may became possible due to the shrinking device sizes. At the same time, the technology scaling may have provided not only an area scaling but also a delay scaling. According to Moore's Law, chips were doubling their speed every 18 months. While Moore's Law has been applicable for more than 20 years, the technology has reached a point where process scaling may no longer deliver the expected speed increases. This is mainly due to the fact that certain device parameters may have reached atomic scales. This trend can be clearly shown as the technology moves from 28 nm to 20 nm feature size. Similar trends are also foreseen by silicon vendors projecting not only for their current offerings of 20 nm but also for the future 14 nm technologies. As one of the consequences of this speed saturation due to technology scaling, designers may need to work harder at each stage of the design flow to squeeze out the last remaining circuit performance. In other words, even small speed improvements may come at significantly higher design efforts than in the past. In particular, it may be more important than ever to have the best standard cell library possible, as this is one of those key ingredients that may influence many design efforts.
Certain features of the subject technology are set forth in the appended claims. However, for purpose of explanation, several embodiments of the subject technology are set forth in the following figures.
The detailed description set forth below is intended as a description of various configurations of the subject technology and is not intended to represent the only configurations in which the subject technology may be practiced. The appended drawings are incorporated herein and constitute a part of the detailed description. The detailed description includes specific details for the purpose of providing a thorough understanding of the subject technology. However, it will be clear and apparent to those skilled in the art that the subject technology is not limited to the specific details set forth herein and may be practiced using one or more implementations. In one or more instances, well-known structures and components are shown in block diagram form in order to avoid obscuring the concepts of the subject technology.
The master cell 120 may include an inverter 122 cross-coupled with an inverter 124 through a clock pass-gate 126. The master cell 120 may receive the input data D or the test data TI and may latch and provide at an input node 131 of the slave cell 130, an inverted replica of the input data D or the test data TI, upon a transition of the clock signal CLK to a logical high state (hereinafter “high”). The slave cell 130 may include a clock pass-gate 132 and an inverter 134 that is cross-coupled to an inverter 136 through a clock pass-gate 138. The slave cell 130 may receive the inverted replica of the input data D or the test data TI and may latch and provide at an output node Q of the slave cell 130, the input data D or the test data TI, upon the transition of the clock signal CLK to high.
The pass-gates 112, 114, 132 and the clock pass-gates 126 and 138 may be substantially similar and may be implemented in CMOS. The pass-gates 126, 132, and 138 may be controlled by the CLK signal and a CLKB signal, which is an inverted replica of the CLK signal. The inverters 122, 124, 134, and 136 may be substantially similar and may be implemented in CMOS. The clock generator circuit 140 may be implemented by a NAND-gate 142 and an inverter 144 and may provide the TIEN and TIENB signals based on the TE signal and the CLKB signal. The clock generator circuit 150 may be implemented by a NOR-gate 152 and an inverter 154 and may provide the DEN and DENB signals based on the TE signal and the CLK signal. In the pass-gate multiplexer 110, the data input D may be selected when TE signal is at a logical low state (hereinafter “low”). This input then may be sampled on the rising edge of the CLK signal producing and output (e.g., an output Q of the flip-flop) on the output node Q of the slave cell 130. The output node Q may be maintained stable till a new clock signal arrives and a possible new value is written into the flip-flop 100A. When the flip-flop 100A is in a scan-mode, TE signal is high and the selected input is TI. This signal then follows the same timing path producing an output on the output node Q. For normal operation, a low TE signal may be of interest. This mode may be the one that determines the minimum latency of the flip-flop, and ultimately the chip's maximum operating frequency.
The low-latency of the flip-flop 100A may result from deletion of a pass-gate (e.g., similar to 132) from master cell 120, which is existent in conventional scan flip-flops. The deletion of the pass-gate from master cell 120 is made possible by design of the clock generator circuits 140 and 150 that allows combining the functionality of the deleted pass-gate with the pass-gate multiplexer 110. The TE and CLK/CLKB signals are combined to provide encoded select signals (e.g., DEN, DENB, TIEN, TIENB) for the pass-gate multiplexer 110. The deletion of the pass-gate from the master cell 120 not only reduces the latency but may also save on the area and power consumption of the flip-flop. This in view of the fact that flip-flops, in particular scan-able flip-flops, may represent about 30-40% of the logic area of many chips. At the same time, for high-speed applications such as Arm/MIPS CPU designs, the latency of the flip-flops (e.g., a setup time+a clock-to-Q time) may represent up to 20% of the flip-flops cycle time. Therefore, the improved latency and area and power saving by the disclosed flip-flops may result in significant improvement in the latency, area and power consumption of the chips using the subject flip-flops.
Another benefit of elimination of the pass-gate from the master cell 120 is that in the flip-flop 100B there is a timing overlap between the master cell 120 and the slave cell 130 that allows a reduced set-up time as the data input D can feed-through directly to the output node Q of the flip-flop. The amount of this overlap may be determined by the arrival of signals DENB/DEN to the pass-gate 112. It is known that N-type gates drive 0 signals well, while P-type gates drive 1 signals well. For example, a proper fully-restoring CMOS gate has a P-transistor pull-up (not an N-type) to drive the output to full 1 level (e.g., supply voltage VDD) and an N-transistor pull-down (not a P-type) to drive the output to a full 0 level (e.g., ground potential GND). Thus, when pass-gate 112 is opening, a 0 is driven mostly through the N transistor controlled by DENB signal and a 1 is driven through the P transistor controlled by the DEN signal. However, because of the inversion delay of DEN (see clock generator circuit 150), signal DENB always arrives early to the pass-gate 112, resulting in lesser master/slave timing overlap for the case when D=0 is written into the flip-flop. At the same time, when D=1 is written to the flip-flop, the late arrival of DEN may allow more timing overlap (which benefits latency).
To make the point more clear, a comparison can be made when a D=1 and a D=0 is written to the flip-flop 100A (e.g., no longer being driven through the pass-gate 112) for the improved flip-flop 100A versus an existing version. For this, we may compare the rise of the signal CLK to the rise of the DEN signal (controlling D=1 being written) and the fall of the CLKB signal to the fall of DENB signal (controlling D=0 being written). For D=1, the clock signal CLK arrives two logic stages earlier than DEN signal (e.g., NOR-gate 152+inverter 154). This way, the writing of D=1 benefits from the master/slave timing overlap. On the other hand, for D=0, the only delay difference between the CLKB signal and the DENB signal may be due to the type of gate being used (e.g., NAND-gate versus a NOR-gate such as 152); and no delay due to logic depth. Therefore, the writing of D=0 may not benefit an much from the slave/master timing overlap. As a result, writing a 0 to the flip-flop 100A may be substantially slower than writing the corresponding 1. This then may manifest itself on the critical path of the flip-flop and adversely affect the timing efficiency of the flip-flop 100A. A further improvement in the flip-flop clock generator circuit 150, as described below, can totally resolve this issue.
Note that this change now delays the controlling signal for writing a D=0 by two logic stage delays (e.g., 154 and 162) compared to the case of flip-flop 100A, and makes it comparable to writing of a D=1. This rebalancing of the overlap window may speed up writing D=0 as well. An implementation of the flip-flop 100B and the associated clock generation circuits 140 and 160 in layout was characterized and used to synthesis and place and route a large block. The results showed that indeed, flip-flop 100B is superior in speed to the flip-flop 100A, which in turn is significantly faster than existing scan flip-flops.
The non-pass-gate circuit 210 includes a non-pass-gate multiplexer 215 and an inverter 220. The non-pass-gate multiplexer 215 includes P-transistors (e.g., PMOS) T1-T4 and N-transistors (e.g., NMOS) T5-T8. The transistors T1-T2 and T5-T6 can control test input TI and the transistors T3-T4 and T7-T8 can control data input D. For example, P-transistors T3-T4 can pull a signal at node 212 to a high state when both the DEN signal and the input data D are at a logical low state, and can pull the signal at node 212 to a logical low state when both the DENBB signal and the input data D are at a logical high state. The inverter 220 can be pushed through the circuit to the output of the scan flip-flop as described below. This may help in generating higher-drive strength flip-flop cell variants efficiently.
In the layout 5000 the clock generator element 540 is implemented in double height so that the width of the clock generator element 540 does not need to be matched to that of the data elements 522, 524, 526, and 528, resulting in a more compact layout. At the same time, the layout design may share a common power supply rail VDD that can eliminate launch-to-capture voltage variations, a phenomenon that can be the case for randomly placed flip-flops operating on independent VDD rails. Also, the close proximity of these circuits may eliminate global variability, something that may deteriorate the speed of randomly placed flip-flops. For larger clusters, the data element pairs may be added alternating between the left and right of the presented structure (e.g., layout 500C), keeping the design as symmetric as possible in reference to the clock generator element 540. This will ensure close to equal-length clock wires which can further reduce variability and mismatch.
Besides the area saving of the flip-flop cluster, the other essential thing for the usefulness of the disclosed design is the amount of “state coverage” these flip-flops provide in an actual implementation. The term “state coverage” may be defined as the percentage of the clustered flip-flops that are being picked up by the synthesis/P&R tools. The described family of flip-flop clusters are tried on various circuits and confirmed experimentally that the “state coverage” is about 80% and may reduce to approximately 65% at the highest speed (e.g., due to requirement of larger and more diverse drive strength at higher speeds). This may result in an about 10% area and leakage power savings at block level. This experimental result can be anticipated via, the following hand calculation. With a given original area of 1, after applying the flip-flop cluster cells the new area is reduced to 0.65 (logic cells that are not scan flip-flops)+0.35 (scan flip-flops)*(0.2 (not covered)+0.8 (state coverage))*0.7 (average area reduction)=0.916, which shows about 8% area reduction compared to the base case. The hand-calculated result is almost close to the experimentally observed 10%.
At operation block 820, the master cell may be formed by cross-coupling a first inverter (e.g., 122 of
At operation block 830, the slave cell may be formed by coupling a second clock pass-gate (e.g., 132 of
At operation block 840, control signals (e.g., DEN, DENB, TIEN, and TIENB of
At operation block 920, each inverting data cell (e.g., 460 of
At operation block 930, a plurality of non-inverting data cells (e.g., 460 of
At operation block 950, the pass-gate multiplexer may be configured to selectively allow passage of one of the input data or the test data to an output node of the pass-gate multiplexer. At operation block 960, the clock generator cell may be configured to generate control signals to control operation of the pass-gate multiplexer.
Those of skill in the art would appreciate that the various illustrative blocks, modules, elements, components, and methods described herein may be implemented as electronic hardware, computer software, or combinations of both. To illustrate this interchangeability of hardware and software, various illustrative blocks, modules, elements, components, and methods have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application. Various components and blocks may be arranged differently (e.g., arranged in a different order, or partitioned in a different way) all without departing from the scope of the subject technology.
As used herein, the phrase “at least one of” preceding a series of items, with the term “and” or “or” to separate any of the items, modifies the list as a whole, rather than each member of the list (i.e., each item). The phrase “at least one of” does not require selection of at least one of each item listed; rather, the phrase allows a meaning that includes at least one of any one of the items, and/or at least one of any combination of the items, and/or at least one of each of the items. By way of example, the phrases “at least one of A, B, and C” or “at least one of A, B, or C” each refer to only A, only B, or only C; any combination of A, B, and C; and/or at least one of each of A, B, and C.
A phrase such as “an aspect” does not imply that such aspect is essential to the subject technology or that such aspect applies to all configurations of the subject technology. A disclosure relating to an aspect may apply to all configurations, or one or more configurations. An aspect may provide one or more examples of the disclosure. A phrase such as an “aspect” may refer to one or more aspects and vice versa. A phrase such as an “embodiment” does not imply that such embodiment is essential to the subject technology or that such embodiment applies to all configurations of the subject technology. A disclosure relating to an embodiment may apply to all embodiments, or one or more embodiments. An embodiment may provide one or more examples of the disclosure. A phrase such an “embodiment” may refer to one or more embodiments and vice versa. A phrase such as a “configuration” does not imply that such configuration is essential to the subject technology or that such configuration applies to all configurations of the subject technology. A disclosure relating to a configuration may apply to all configurations, or one or more configurations. A configuration may provide one or more examples of the disclosure. A phrase such as a “configuration” may refer to one or more configurations and vice versa.
The word “exemplary” is used herein to mean “serving as an example, instance, or illustration,” Any embodiment described herein as “exemplary” or as an “example” is not necessarily to be construed as preferred or advantageous over other embodiments. Furthermore, to the extent that the term “include,” “have,” or the like is used in the description or the claims, such term is intended to be inclusive in a manner similar to the term “comprise” as “comprise” is interpreted when employed as a transitional word in a claim.
All structural and functional equivalents to the elements of the various aspects described throughout this disclosure that are known or later come to be known to those of ordinary skill in the art are expressly incorporated herein by reference and are intended to be encompassed by the claims. Moreover, nothing disclosed herein is intended to be dedicated to the public regardless of whether such disclosure is explicitly recited in the claims. No claim element is to be construed under the provisions of 35 U.S.C. §112, sixth paragraph, unless the element is expressly recited using the phrase “means for” or, in the case of a method claim, the element is recited using the phrase “step for.”
The previous description is provided to enable any person skilled in the art to practice the various aspects described herein. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects. Thus, the claims are not intended to be limited to the aspects shown herein, but are to be accorded the full scope consistent with the language claims, wherein reference to an element in the singular is not intended to mean “one and only one” unless specifically so stated, but rather “one or more.” Unless specifically stated otherwise, the term “some” refers to one or more. Pronouns in the masculine (e.g., his) include the feminine and neuter gender (e.g., her and its) and vice versa. Headings and subheadings, if any, are used for convenience only and do not limit the subject disclosure.