The present invention relates generally to three-dimensional integrated circuit technology (3D chip technology). More specifically, the present invention relates to 3D stacked chips system and voltage control and regulation in a 3D stacked multichip system.
Vertically stacked integrated circuits, with multiple dies interconnected vertically via through silicon vias (TSVs), are one kind of 3D integration circuit technology, which provide vertical stacking of two or more dies with a dense, high-speed interface. The global wire length is reduced by a factor of the square root of the number of layers used, leading to performance improvement and power reduction of the interconnection. Thus, three-dimensional integration technology is a promising technology in providing a dense and high-speed communication interface to achieve high performance with low transmission power.
Often, a 3D stacked chip is powered by a power delivery system constituted of two parts, commonly known as off-chip paths and on-chip networks. The off-chip paths refer to the power delivery path from voltage source and/or package substrate to a chip. The on-chip network refers to the R(L)C network inside a chip, which usually comprises parasitic resistance, inductance in the delivery path and decoupling capacitance for eliminating transient voltage noise. A simplified circuit model of such a power delivery network to a 3D stacked multichip package is shown in
Despite the promising features of rapid data transferring across layers, low transmission power and high device density, 3D integration techniques also confront many challenges, one of which is power supply noise. By stacking multiple dies vertically, 3D chips have higher load than the same-sized 2D chips, leading to larger voltage droop due to imperfect parasitic impedance of the power delivery network (PDN) and current fluctuation of circuits, damaging power integrity. Power integrity issues may cause timing failures, thereby degrading the system reliability.
In 3D integration using TSVs, multi-layer dies are connected by through-silicon-vias (TSVs) vertically so as to form multiple layers of dies, and the connection length of dies is usually around 0.1%˜1% of the connection length for 2D dies. Such short connection enables close voltage interaction between layers. However, the extremely short die to die distance in the vertical direction leads to strong voltage interference in the vertical direction. The short connection can aggravate the problem of thread resonance and make voltage droop even worse than that of 2D chips. Meanwhile, computing techniques such as Single-Program-Multiple-Data (SPMD) techniques in multithreaded applications may stimulate destructive interference (core resonances) among threads and exacerbate voltage droops.
To tackle the above-mentioned problems in 3D stacked multichip system, a conventional solution is to allocate sufficient voltage margin for the worst-case voltage droop. However, such solution will bring significant cost especially with the decreasing transistor size and increasing layer count in future 3D chips. Prior work has also focused on the impact of physical design and the floorplan on voltage droop in a 3D PDN and observed that increasing decoupling capacitance or TSV density alleviates voltage droops. Nevertheless, it would cost too much to place enough TSVs and decoupling capacitance on a chip to overcome the power integrity issue. Moreover, decoupling capacitors should be put next to the active circuits to reduce voltage noise effectively. Hence, a static solution may not be efficient and flexible because the status of the circuits changes dynamically.
Thus, in order to overcome the drawbacks in prior art, aspects of the present invention provide the following technical solutions.
In an example embodiment embodying a first aspect, a multichip system is provided. The multichip system comprises a plurality of dies stacked vertically and electrically coupled together. Each of the plurality of dies comprises one or more cores, each of the plurality of dies further comprises: at least one voltage violation sensing unit, the at least one voltage violation sensing unit being connected with the one or more cores of each die, the at least one voltage sensing unit being configured to independently sense voltage violation in each core of each die; and at least one frequency tuning unit, the at least one frequency tuning unit being configured to tune the frequency of each core of each die, the at least one frequency tuning unit being connected with the at least one voltage violation sensing unit.
In a second example embodiment embodying a second aspect, a layer-control method for a 3D stacked chip system is provided. The 3D stacked chip system comprises a plurality of dies stacked vertically, each of the plurality of dies comprising one or more cores. The method is performed independently for each die in the 3D stacked chip, and comprises: (a) sensing whether there is voltage violation in the one or more cores of a die by means of at least one voltage violation sensing unit connected with each core of the die; (b) if yes, tuning the frequency of the die by means of a frequency tuning unit connected with the voltage violation sensing unit; and (c) if no, continuing the step (a).
In a third example embodiment embodying a third aspect, a method for scheduling threads in a 3D stacked chip is provided. The method comprises the steps of: (a) estimating intrinsic droop intensity of a plurality of threads from one or more applications; (b) sorting the threads in descending order in terms of the intrinsic droop intensity and enqueuing them into a queue; (c) selecting the thread at the head of the queue and placing it in an available core of the bottommost available die; and (d) checking if the queue is empty, and repeating step (c) until the queue is empty.
In a four example embodiment embodying a four aspect, a system for scheduling threads in a 3D stacked chip is provided. The system comprises means for estimating intrinsic droop intensity of a plurality of threads from one or more applications; means for sorting the threads in descending order in terms of the intrinsic droop intensity and enqueuing them into a queue; means for selecting the thread at the head of the queue and placing it in an available core of the bottommost available die; and means for checking if the queue is empty.
Other aspects and embodiments are described in more detail below.
The present invention is illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings in which like reference numerals refer to similar elements and in which:
The present invention will now be described in detail with reference to a few embodiments thereof as illustrated in the accompanying drawings. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. It will be apparent, however, to one skilled in the art, that the present invention may be practiced without some or all of these specific details. In other instances, well known process steps and/or structures are not described in detail in order to not unnecessarily obscure the present invention.
The present invention finds that voltage droop is asymmetrically distributed in both space and in time in 3D chips. And, the amplitude of voltage droop varies with different execution phases and the worst-case voltage droop is much more serious than the average case but rarely occurs. Taking Conocean and Waternsq threads as an example,
Thread diversity can mitigate horizontal interference. As shown in
In 3D stacked chips, in addition to horizontal interaction, there is vertical interaction among threads in dies located in different layers, which also affects voltage droop. For example,
The corresponding voltage droops of scenarios (a) and (b) are shown in
Scenario (b), (c), (d) and (e) are compared to illustrate the influence of violent thread location on voltage droop. The strategy in (c) in
As we can observe in
Thus, from
The above implies that voltage margin is distributed unevenly in 3D chips during the execution of one or more applications. Allocating the worst-case margin for the entire chip would waste power. Thus, to resolve such issues, the present invention provides a new hardwire design for 3D chips to avoid wasting voltage margin. Generally, the new hardwire design for 3D chip equips the 3D chip with a power delivery system with multiple frequency domains such that every layer of the 3D chip can work at an individual frequency and be controlled separately.
As an exemplary embodiment,
In the illustrative embodiment shown in
In the exemplary example of
In the system shown in
To solve voltage violation issues, in the multichip system of
In addition, through connecting with all DPLLs, the performance monitor 640 can supervise the average working frequency of the multichip system periodically to maintain performance. In addition, through connection with the DPLLs, the performance monitor 640 can supervise the working frequency of the system. When the frequency exceeds an upper threshold, it indicates voltage margin is over-provisioned for that period of time. When frequency is lower than a lower bound, it indicates the supplied voltage is insufficient and needs to be raised. Then, the voltage regulator starts to tune the corresponding voltage supplied to the 3D chip. The tuning resolution can be set in advance such as 6.25 mv. It takes several micro-seconds to complete the tuning process.
The present invention also provides a method for scheduling threads in a 3D stacked chip. As shown in
The present invention finds there are three guidelines for thread scheduling: (1) Arranging the voltage-violent threads in the lower layers (i.e., the layer close to power delivery system), otherwise it will induce serious voltage droop in the vertical chip stack; (2) clustering the threads with close IDI into one layer to mitigate vertical interference and minimize the asymmetric voltage margin of cores in one layer. As mentioned above, resonance in vertical direction can arouse larger voltage droop than horizontal direction, so it is reasonable to put similar threads in the same die instead of the same vertical stack. In addition, the worst voltage droop of cores varies with the aggressiveness of threads. While the whole layer shares a frequency-monitoring and actuation system, timing margin needs to tolerate the worst thread, leading to margin waste for the mild cores. This strategy smoothes the intra-layer gaps among cores; (3) in the same die, placing threads of different applications in the same neighborhood is helpful to reduce local voltage resonance induced by similar pipeline activities.
An exemplary embodiment for the method for scheduling threads in a 3D stacked chip of the present invention may comprise the following steps as shown in
In addition, in the step 720, if multiple threads have the same IDI, a round-robin algorithm may be employed to choose threads from different applications for alleviating horizontal interference.
Regarding step 710, voltage droop is strongly associated with pipeline activities, so micro-architectural events information captured by performance counter can be used to estimate IDI of a thread from applications so as to predict whether a thread is voltage-violent or voltage-mild, which is known in the art. Statistical learning can be used to correlate performance counter inputs (such as branch-misprediction intensity, cache miss intensity, TLB miss intensity) with thread voltage droop intensity. Because these runtime statistics are nonlinearly correlated, the regression tree is an ideal approach to cope with such a relationship. Such model training process can be conducted off-line. During the training period, performance counter information and corresponding droop intensity of threads are gathered as the training set to generate a regression tree. The droop intensity can be calculated by using online measuring sensors such as Critical Path Monitor (CPM). To avoid interferences, during the training stage only one thread may be running in a power domain at a time. When the regression tree is trained to reach a stable state, it can be built into target chips to predict thread voltage features.
In addition, a 3D chip can comprise both core and cache or memory, in which one or multiple cache layers are stacked on top of the core layer. For example,
Before migrating the data to a remote cache bank or allocating a voltage-mild thread to the core to mitigate or avoid the core-cache resonance, it is necessary to monitor the cache behavior. A regression model can be used to conduct correlations between access intensity and voltage droop intensity of one cache bank. The training stage of the regression model is also conducted off-line. During prediction stage, a monitor takes access count of each cache bank as input to predict the voltage droop intensity. So it is necessary to add a cache bank access counter to core or cache dice to record the access for cache emergency identification. If the monitor component is implemented in the core, the address of the read/write request is converted to the cache bank ID. Then the corresponding counter is incremented upon each cache access. If the monitor is implemented in cache side, the monitor can be embedded in the reading/writing circuitry of banks. Relative bank's access counter is incremented upon each data access. Then, if the cache behavior indicates there is a core-cache dice resonance in the 3D chip, depart the voltage-violent thread and cache bank either through migrating the data to a remote cache bank or allocating a voltage-mild thread to the core.
In one or more examples herein, the thread scheduling method described above may be implemented through software. Computer-readable codes for realizing the functions of the steps of the thread scheduling method can be stored in a computer-readable medium. Examples of computer-readable media include, but are not limited to: magnetic media such as hard disks, floppy disks, and magnetic tape; optical media such as CD-ROMs and holographic devices; magneto-optical media such floptical disks; and hardware devices that are specially configured to store and execute program code, such as application-specific integrated circuits (ASICs), programmable logic devices and ROM and RAM devices. The computer-readable codes can be executed by one or more processing units.
By using the layer-independent control system of present invention, the average reduction of voltage violations can reach up to, for example, 40%. And, the thread scheduling method of present invention can mitigate voltage droop in every layer and reduce the voltage droop in a 3D chip, for example by 13% as shown in
It will be apparent to those skilled in the art the embodiments described above is only illustrative and cannot be deemed as a limitation to present invention, and that various modifications and variations can be made to the embodiments described herein without departing from the spirit and scope of the claimed subject matter. Thus, it is intended that the specification covers modifications and variations of the various embodiments described herein, provided such modification and variations come within the scope of the appended claims and their equivalents.
Number | Date | Country | Kind |
---|---|---|---|
2013 1 0659511 | Dec 2013 | CN | national |
This application is a divisional of non-provisional application Ser. No. 14/144,920, filed on Dec. 31, 2013, and titled VOLTAGE DROOP MITIGATION IN 3D CHIP SYSTEM, which claims priority to Chinese Application Serial No. 201310659511.X, filed on Dec. 9, 2013, and titled VOLTAGE DROOP MITIGATION IN 3D CHIP SYSTEM, both of which are incorporated herein by reference.
Number | Name | Date | Kind |
---|---|---|---|
7205805 | Bennett | Apr 2007 | B1 |
7739626 | Jin | Jun 2010 | B2 |
7793119 | Gammie | Sep 2010 | B2 |
7921313 | Ghiasi | Apr 2011 | B2 |
8078901 | Meyer | Dec 2011 | B1 |
8081011 | Azimi | Dec 2011 | B2 |
8164390 | Briggs | Apr 2012 | B1 |
8295950 | Wordsworth | Oct 2012 | B1 |
8364998 | Hsu | Jan 2013 | B2 |
8463973 | Naffziger | Jun 2013 | B2 |
8527801 | Brock | Sep 2013 | B2 |
8661274 | Hansquine | Feb 2014 | B2 |
8689023 | Gupta | Apr 2014 | B2 |
8760217 | Chua-Eoan | Jun 2014 | B2 |
8775843 | Frid | Jul 2014 | B2 |
8819615 | Le Coz | Aug 2014 | B2 |
8839006 | Li | Sep 2014 | B2 |
8885694 | Vatinel | Nov 2014 | B2 |
8912778 | Bennett | Dec 2014 | B1 |
8963599 | Dobbs | Feb 2015 | B2 |
8972755 | Hasko | Mar 2015 | B1 |
8996595 | Gargash | Mar 2015 | B2 |
9134782 | Reddy | Sep 2015 | B2 |
9218044 | Brock | Dec 2015 | B2 |
9252683 | Hunt | Feb 2016 | B2 |
9430353 | Shafi | Aug 2016 | B2 |
9595508 | Xu et al. | Mar 2017 | B2 |
9658877 | Barwick | May 2017 | B2 |
20020080516 | Bhakta | Jun 2002 | A1 |
20050218871 | Kang | Oct 2005 | A1 |
20120054515 | Naffziger | Mar 2012 | A1 |
20130181690 | Holloway | Jul 2013 | A1 |
Number | Date | Country | |
---|---|---|---|
20170153916 A1 | Jun 2017 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 14144920 | Dec 2013 | US |
Child | 15428536 | US |