The present invention relates to harvest of electrostatic energy from transient on-chip data to improve the energy efficiency of digital CMOS circuit operation.
In recent years, deep neural networks (DNNs) have become the solution for many AI applications including computer vision, speech recognition and robotics implementing machine learning methods. While these neural networks deliver sufficient accuracy—it comes at the cost of high computational complexity with associated power drain limiting deep learning from being deployed on mobile devices with limited energy budgets. Smart phones for example, cannot run object classification with AlexNet in real-time for more than an hour [1]. Network issues of latency, bandwidth and availability could require battery/ambient powered IoT devices on the edge to not only sense and act without communicating to the cloud but also to take on more computationally intense tasks of learning or training a neural network. Neural networks for a myriad of IoT devices [2]can easily result in model sizes that are enormous—becoming computationally burdensome to their energy resources, demanding energy budgets that exceed provisions from batteries and conventional energy harvesting methods. Even where power is abundantly available as in a data center supporting AI workloads, where GPU accelerators consume as much as 400 W [3], the cost of electricity and the performance limits imposed by heat removal efficiency can be improved by lowering the switching (or Dynamic) energy consumption of digital CMOS circuits
In many applications requiring high speed CMOS circuit operation, precharged dynamic circuit techniques are preferred. These circuits are typically operated by pre-charging output nodes to the supply voltage during a pre-charge phase every clock cycle and conditionally discharging some of them, depending on the inputs during the evaluation phase. These techniques are energy inefficient since all of the charge discarded to the reference ground potential during evaluate must be resupplied during the precharge phase of the next clock cycle. High peak currents can also cause large di/dt noise causing voltage bumps in power rails with associated risks to signal integrity and reliability in high performance CMOS components.
Dynamic logic circuits that recycle some of the charge were proposed to improve the energy efficiency of circuit operation [4-6]. These circuit techniques precharge complementary outputs to half VDD by charge sharing from the previous evaluation state, enabling a maximum of a 50% reduction in energy. Such schemes are relevant only when complementary signal pairs are used in implementing complex logic functions. Also, much of the charge recycle benefits are lost with performance degraded as well due to (i) high overheads in device count (ii) requirement of using complementary inputs and as many as 2-3 clock and enable inputs to each logic gate with their associated additional routing, performance and power overheads (iii) use of cross-coupled inverters as output drivers which increase the uncertainty of gate metrics in the presence of parameter variations and the offsets they develop, and (iv) with only a half-VDD gate-source voltage precharged to output and input nodes of output drivers for charge recycle operation, gate overdrive is degraded during evaluation phase. In one comparison [4]with static CMOS implementing full adders, the power-delay product of a full adder increases total energy nearly 10% over static CMOS. Moreover, neural network energy consumption is dominated by movement of data across the memory hierarchy [7-8]and the chip and not by dissipation from computation.
On-chip small voltage swing signaling schemes [9]have attempted charge recycling by stacking components (such as logic and clocking circuits) with predictable data switching activities in two adjacent voltage domains using simple push-pull regulators to balance current between the two domains to maintain the voltage at their interface. This approach could deliver a maximum of a quadratic reduction in power. Inefficiency introduced by voltage regulation is eliminated if the current between domains is matched. An approach to stack voltage domains without requiring regulators between them has been reported [10]using a balanced charge recycling bus where differing data activity between two links is compensated by swapping data between them periodically so that switching activity along the bus is exactly matched. These schemes however, are difficult to implement and also require circuits in the domains to be powered by reduced operating voltages.
Charge recycling techniques have been reported where the flow of electric charge from the supply rail (VDD) to Ground is traced through more than one circuit/use through multiple voltage domains [11]. However, there is no energy advantage from recycling the charge through multiple voltage domains since it costs as much in energy to raise charge to the highest voltage domain as it does to do so cumulatively each of the stacked domains operating independently. The energy advantage of stacking voltage domains is only in removing the inefficiencies of on-chip voltage regulation from VDD to much lower voltages that these domains would be powered with to benefit from quadratic reductions in their switching power. If the current between domains is not matched, the energy overhead consumed by regulators attempting to maintain domain interface at a fixed voltage, could diminish the quadratic energy improvements from operating each domain at reduced voltages.
Non-resonant approaches to returning/recycling stored energy on load capacitance include use of an inductor to discharge load capacitor of a clock network to the power grid instead of it being discharged to ground [12]. However, overheads of inductors, decoupling capacitors, integration with clock gating (and its accompanying overheads), and limited application to large clock load capacitances (as seen in a clock mesh) are challenges seen with this approach making it impractical and difficult to implement.
Smaller voltage transitions for each logic operation using ‘recycled charge’also come with the disadvantages of smaller margins and lower performance. In multiple instances, these make implementations impractical. For e.g., in [13-14], a smaller (than VDD) voltage is applied across a BL pair during an SRAM Write operation to enable lower energy dissipation per Write operation. By sharing/recycling charge across a set of BL pairs, Writes are attempted with smaller voltage swings on the BL (instead of full rail-rail BL swings during a conventional SRAM Write). For small geometry devices it becomes harder to write [15]to the bitcell even with the full supply voltage across a bit line pair—due to increasing electrical variability seen in small-geometry bitcell transistors. Circuit overheads introduced by full CMOS transmission gates to move charge between columns comes at a significant cost in area, control and performance.
Adiabatic switching in reversible logic circuits moves charge from the power supply to a load capacitance using slow constant current charging without energy dissipation [16-17]. It enables the recycling of energy to reduce the total energy drawn from the power supply by reversing the current source using non-standard AC or pulsed power supplies with time varying voltage or current. In sharp contrast to conventional CMOS circuit operation, charge and energy are not discarded after being used only once—with pulsed/sinusoidal power supplies designed to be able to retrieve the energy fed back to it. The problem areas limiting realization of practical low-power operation of CMOS chips using adiabatic or reversible logic techniques: (1) the energy-efficient design of the combined power supply and clock generator (2) logical overhead needed to support reversible logic functions [16]and (3) the alternative of scaling operating voltages with feature size and improving performance—that comes with conceptual simplicity and high payback of lower power dissipation, has been preferred by industry.
Conventional CMOS operation as illustrated by an inverter driving a capacitive load C and which draws energy equal to CVDD
In the proposed invention, an inverter driving the same capacitive load C as the above conventional CMOS inverter, draws energy equal to CVDD
∫IVDD(t)VDDdt=∫VSSVDDCoutVDDdVout=CoutVDD2 (1)
∫IVDD(t)Vout(t)dt=∫VSSVDDCoutVoutdVout=½CoutVDD2 (2)
∫IVSS(t)Vout(t)dt=∫VDDVSSCoutVoutdVout=½CoutVDD2 (3)
200 is an illustration of the time dependent voltage waveforms of the output node OUT (102 in
The waveform of current flow 206 into the inverter from the power rail at voltage VDD (106 in
In
The NAND gate 302 in this schematic generates an active low pulse at its output node 306 whose leading edge is triggered by a 0→1 transition at the input 308 and whose trailing edge is triggered by a 1→0 transition at the output node 310 loaded with a total capacitance Cout 312.
The leading edge of this active low pulse turns on PFET P2 314 which drives charge from the output node at logic ‘1’ and voltage VDD to be harvested on the common grid/node V2 316 (typically at a voltage between VSS and VDD and preferably at a voltage comparable to or lower than the logic threshold of the NAND gate 302).
The leading edge of the active low pulse at the output of the NAND gate 306, when delayed and inverted to drive the gate input 318 of NFET N1 326, turns on NFET N1 326 to begin discharging the output 310 to VSS—as the output voltage at node OUT 310 approaches V2. Note that a design requirement on the logic threshold voltage of the NAND gate 302 is that it is higher than the typical voltage node V2 would be raised to with harvested charge or during a dynamic equilibrium when rate of charge transfer to and from the common grid/node are balanced. Thus, node OUT 310 when being discharged to V2 through PFET P2 314, can trip the NAND 302 to produce the trailing low→high transition of the active low pulse at output of the NAND gate 306 to turn-off P2 314.
The NAND 302 would also trip when the N channel FET N1 326 begins conducting after the delayed and inverted leading edge of the active low pulse output from the NAND is inverted by the inverter 304 whose output turns on N1 326.
The output continues being discharged toward VSS—the reference ground terminal 324 as N1 320 is turned on. The trailing edge of the active high pulse driving the gate input terminal of the N channel FET, N1 326 turns this FET, N1 320 off. A small geometry keeper HVT NFET 328 holds the output to VSS. Its gate input is driven by the inverter input 308 with its source terminal connected to the reference ground voltage rail 322 at voltage VSS=0V and its drain terminal connected to OUT 310.
The trailing edge of the active low pulse at the output of the NAND 306 is triggered by the transition at the output node from VDD toward V2 since the logic threshold of the NAND 302 is higher than the voltage at which node V2 316 is typically charged to with harvested charge. The trailing edge is triggered by this feedback from OUT 310 to the input of the NAND 306.
The proposed circuit (1) maintains rail-rail operation (2) drives practically the same waveforms at its output as a conventional inverter and (3) while harvesting about 25%-40% of the total charge it discharges from its output 310—to the harvest grid node V2 316, instead of discharging all of that charge to the reference ground supply rail 322. The primary overhead in area is consumed by the PFET P2 in
The NAND gate 302 and the delay element 304 can be optimized to maximize the energy harvested at the grid/node from the output node of the inverter—according to what voltage the harvested charge is typically held at when using the proposed inverter. The closer the voltage of the harvested charge at V2 316 is to VDD, the higher the optimal logic threshold voltage of the NAND gate 302 should be (to avoid reverse flow of current from harvest grid node to output node of inverter) and the shorter the delay value of the delay element 304 needs to be to minimize the delay overheads to accomplish the same 1→0 transition at the output of the inverter. This optimization is especially useful when operating at low, near threshold voltages
400 is an illustration of the time dependent voltage waveforms of the output node OUT (310 in
The waveform of current flow 406 into the inverter from the VDD power rail (324 in
Note that the voltage waveform at the output node 310 in
Switching energy consumption by logic gates with low fanouts (<4) are typically small. Gates driving a high fanout (>10) and/or long wires consume more energy and are best candidates for the proposed scheme that harvests charge from these large loads as they are discharged.
The transistor count increases in the proposed schematic shown in
Although illustrative embodiments of the present invention have been described herein, it is to be understood that the invention is not limited to those precise embodiments, and that various other changes and modifications may be made by one skilled in the art without departing from the scope of the invention.
[1] B S Kong, J S Choi, S J Lee, and K. Lee “Charge Recycling Differential Logic (CRDL) for Low Power Application,”IEEE JSSC, vol. 31, No. 9, 1996.
[2] S Y Cheo, G A Rigby, and G R Hellestrand “Half-Rail Differential Logic,”in ISSCC Dig. Tech. Papers. February 1997, pp, 420-42
[3] J Lee, J park, B Song, W Kim, “Split-level Precharge Differential Logic: A New type of High-Speed Charge Recycling Differential Logic”, IEEE JSSC Vol 36, No. 8 August 2001, pp 1276-1280
[4] Tien-Ju Yang et al, “Designing Energy Efficient Convolutional Neural Networks using Energy-aware pruning”, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
[5] Y-H Chen et al, “Understanding the Limitations of Existing Energy-Efficient Design Approaches for Deep Neural Networks”, SYSML'18, February 2018, Stanford, Calif. USA
[6] Y. Liu et al., “A 0.1pJ/b 5-10 Gb/s Charge-Recycling Stacked Low-Power I/O for On-Chip Signaling in 45 nm CMOS SOI,”ISSCC Dig. Tech. Papers, pp. 400-401, 2013
[7] J Wilson et al., “A 6.5-to-23.3fJ/b/mm Balanced Charge-Recycling Bus in 16 nm FinFET CMOS at 1.7-to-2.6 Gb/s/wire with Clock Forwarding and Low-Crosstalk Contraflow Wiring”, ISSCC Dig. Tech. Papers, pp. 156-157, 2016
[8] S Ralapandian et al, “Energy-Efficient Low-Voltage Operation of Digital CMOS Circuits through Charge-recycling-”, 2004 Swap on VLSI Ckts, pp 330-331.
[9] M. Alimadadi, S. Sheikhaei, G. Lemieux, S. Mirabbasi, W. Dunford, and P. Palmer, “A 4 GHz non-resonant clock driver with inductor-assisted energy return to power grid,”IEEE Trans Circuits Svst. I, Reg. Papers, vol. 57, pp. 2099-2108, August 2010
[10] K Sim, H Mahmoodi, and K Roy, “A Low-Power SRAM Using Bit-Line Charge-Recycling”, IEEE Journal of Solid-State Circuits, Vol. 43, NO. 2, February 2008
[11] B D Yang, “A Low-Power SRAM Using Bit-Line Charge-Recycling for Read and Write Operations”, IEEE Journal of Solid-State Circuits, Vol. 45, NO. 10, October 2010.
[12] A. Bhavnagarwala, et. al, ‘Fluctuation Limits and Scaling Opportunities for CMOS SRAM Cells’, Tech. Dig. IEDM 2005, Pp. 675-678, December 2005.
[13] WC fit bias et al., “A low-power microprocessor based on resonant energy”, IEEE Journal of Solid-State Circuits, Vol: 32, issue: 11, pp 169-,-1701, November 1997
[1] L. Svensson, “Adiabatic Switching”, Chapter 6, Low Power Digital CMOS Design, Kluwer Academic Publishers, 1995
63/090,169 & 63/139,744
Number | Date | Country | |
---|---|---|---|
63139744 | Jan 2021 | US | |
63090169 | Oct 2020 | US |