The present embodiments relate generally to computing and more particularly to a solution for clock delivery and distribution to an entire waferscale system composed of many chiplets.
Waferscale processor systems can provide the large number of cores and that large amount of interconnect bandwidth that are required by today's highly parallel workloads. One approach to building waferscale systems is to use a chiplet-based architecture where pre-tested chiplets are integrated on a passive high bandwidth interconnect substrate technology such as silicon interconnect fabric or integrated fan-out system-on-wafer. These technologies allow heterogeneous integration where chiplets with different functionalities (e.g., processor, memory) as well as built in disparate technologies (e.g., CMOS and DRAM) can be tightly integrated for significant performance and cost benefits. However, designing large scale systems using these technologies is challenging. One of the most important challenges that needs to be addressed is how to reliably deliver and distribute clocks to the chiplets in the system.
It is against this technological backdrop that the present Applicant sought to a technological solution to these and other technological issues rooted in this technology.
The present embodiments provide a solution for clock delivery and distribution to an entire waferscale system composed of many chiplets. A clock distribution scheme according to embodiments is also fault tolerant, i.e., the clock distribution network can avoid faulty chiplets on the substrate and reliably distribute clock to all the functional chiplets which are accessible by the network.
These and other aspects and features of the present embodiments will become apparent to those ordinarily skilled in the art upon review of the following description of specific embodiments in conjunction with the accompanying figures, wherein:
The present embodiments will now be described in detail with reference to the drawings, which are provided as illustrative examples of the embodiments so as to enable those skilled in the art to practice the embodiments and alternatives apparent to those skilled in the art. Notably, the figures and examples below are not meant to limit the scope of the present embodiments to a single embodiment, but other embodiments are possible by way of interchange of some or all of the described or illustrated elements. Moreover, where certain elements of the present embodiments can be partially or fully implemented using known components, only those portions of such known components that are necessary for an understanding of the present embodiments will be described, and detailed descriptions of other portions of such known components will be omitted so as not to obscure the present embodiments. Embodiments described as being implemented in software should not be limited thereto, but can include embodiments implemented in hardware, or combinations of software and hardware, and vice-versa, as will be apparent to those skilled in the art, unless otherwise specified herein. In the present specification, an embodiment showing a singular component should not be considered limiting; rather, the present disclosure is intended to encompass other embodiments including a plurality of the same component, and vice-versa, unless explicitly stated otherwise herein. Moreover, applicants do not intend for any term in the specification or claims to be ascribed an uncommon or special meaning unless explicitly set forth as such. Further, the present embodiments encompass present and future known equivalents to the known components referred to herein by way of illustration.
The proliferation of highly parallel workloads such as graph processing, data analytics, and machine learning is driving the demand for massively parallel high-performance systems with a large number of processing cores, extensive memory capacity, and high memory bandwidth (Workload Analysis of Blue Waters, https://arxiv.org/ftp/arxiv/papers/1703/1703.00924 pdf, accessed Nov. 23, 2020); K. Shirahata et al., “A Scalable Implementation of a MapReduce-based Graph Processing Algorithm for Large-Scale Heterogeneous Supercomputers,” 13th International Symposium on Cluster, Cloud, and Grid Computing, 2013). Often these workloads are run on systems composed of many discrete packaged processors connected using conventional off-package communication links. These off-package links have inferior bandwidth and energy efficiency compared to their on-chip counterparts and have been scaling poorly compared to silicon scaling (S. Pal et al., “A Case for Packageless Processors,” IEEE International Symposium on High Performance Computer Architecture (HPCA), 2018). As a result, the overhead of inter-package communication has been growing at an alarming pace.
Waferscale integration can alleviate this communication bottleneck by tightly interconnecting a large number of processor, memory and/or networking chips on a large wafer. Multiple recent works have shown that waferscale processing can provide very large performance and energy efficiency benefits (Kamil Rocki et al., “Fast Stencil-Code Computation on a Wafer-Scale Processor,” 2020, arXiv: 2010.03660; S. Pal et al. “Architecting Waferscale Processors-A GPU Case Study,” IEEE International Symposium on High Performance Computer Architecture, 2019) compared to conventional systems. One approach to building waferscale systems is to integrate pre-tested known-good chiplets (referred to herein as un-packaged bare-dies/dielets as chiplets) on a waferscale interconnect substrate. Silicon interconnect fabric (Si-IF) and integrated fanout system-on-wafer (InFO-SoW) (S. R. Chun et al., “InFO_SoW (System-on-Wafer) for High Performance Computing,” 2020 IEEE 70th Electronic Components and Technology Conference (ECTC), 2020, pp. 1-6) are tow example technologies which allow for tightly integrating many chiplets on a high-density interconnect substrate (A. A. Bajwa et al., “Demonstration of a Heterogeneously Integrated System-on-Wafer (SoW) Assembly,” 68th ECTC, 2018). Also, in a chiplet-based waferscale system, the chiplets can be manufactured in heterogeneous technologies and can potentially provide better cost-performance trade-offs. For example, TBs of memory capacity at 100s of TBps alongside PFLOPs of compute throughput can be obtained which is suitable for big-data workloads in HPC and ML/AI.
Waferscale system design, however, has its unique set of challenges which encompass a wide range of topics from the underlying integration technology to circuit design and hardware architecture. One of the major challenges is to reliably supply clock to all the chiplets on the waferscale substrate.
According to certain aspects, the present embodiments provide a reliable clock generation and distribution mechanism for waferscale systems and/or other large chiplet based systems such as those including large interposers. In one solution, embodiments enable one or multiple master clocks from external oscillator source(s) to be provided to a subset of the chiplets on the wafer. The slower master clock or the faster clock generated by the PLLs in these chiplets can then be distributed across the wafer using a clock distribution network (ClkDN). The ClkDN is architected such that faulty blocks or nodes on the wafer can be avoided while ensuring all the useful non-faulty blocks get a working clock. Moreover, forwarding clocks over a large area means that the clock signals would traverse a large amount of combinational circuitry. Because of pull-up and pull-down strength mismatch in logic gates, duty cycle distortion can accrue and eventually lead to a poor clock signal, or it may even lead to complete disappearance of the clock signal. An embodiment solves this problem among others by inverting the clock between every node on the ClkDN and ensuring that one edge of the clock never accrues too much distortion.
The conventional way of clock delivery would be to distribute a slow master clock (running at a few 10 s of MHz) across the wafer using a passive ClkDN built on the waferscale substrate. Each chiplet would then use a PLL circuitry to generate a faster clock (e.g. 100 MHz to 1 GHz or more). However, there are two challenges in such a scheme.
First, the parasitic capacitance of a passive ClkDN which spans an area of up to 70,000 mm2 and can have hundreds to thousands of sinks can be very large (>2000 pF and >600 nH). So, the clock distribution can only be done at sub-MHz frequency, often at <100 KHz. Getting a good crystal oscillator which can drive large capacitive load while ensuring absolute jitter performance of sub-100 pico-seconds is challenging.
Second, the PLL circuitry used to generate the on-chip fast clock would require a stable reference voltage for reliable operation. This can be an issue for such large systems where the noise in the power distribution network can be >10%. Moreover, often supplying a clean and stable analog voltage for all the chiplets may not be possible. Also, there can be cases where the power is delivered only at the edge of the wafer and the since the voltage regulation in the chiplets away from the edge may not perfect, regulated voltage could fluctuate by a large amount. As a result, stable fast clock can only be generated near the edge of the wafer where the chiplets can access near-by off-chip decoupling capacitors.
A schematic of an example waferscale processor system for implementing aspects of the present embodiments is shown in
The present Applicant recognizes, among other things, that in a system such as that shown in
Therefore, embodiments of this disclosure provide the master clock 108 to the chiplets 106-E at the edge of the wafer 100. As will be described in more detail below, a fast clock will be generated in one of the edge chiplets 106-E and then this clock is forwarded throughout the chiplet array using forwarding circuitry 104 built inside every chiplet 106. This strategy can ensure that a clean and stable fast clock can be generated by the edge chiplets having better voltage stability. The forwarding circuitry then ensures that the generated fast clock can be distributed reliably across the entire waferscale substrate.
Although
Next described is an example clock selection and forwarding circuitry 104 according to embodiments, a schematic of which is shown in
In embodiments, the clock selection and forwarding circuitry 104 is included in every chiplet 106 in a system such as that shown in
During boot-up, the selector 206 is configured to default to using the software controlled JTAG clock (e.g. using bootup circuitry in each chiplet 106, for example). Using JTAG, embodiments then initiate the clock setup phase. In this phase, one or multiple edge chiplets is selected or identified (e.g. using bootup circuitry in each chiplet 106, for example) and configured using selector 208 to generate a faster clock (e.g. using PLL 204) from the slower system clock that is provided from an off-the-wafer crystal oscillator source (e.g. master_clk, running at a few 10 s of MHz). The generated faster clocks (e.g. running 100 MHz to 1 GHz or more) from the edge chiplets 106-E are forwarded to their neighboring chiplets. The non-edge chiplets 106 (as determined using bootup circuitry in each chiplet 106, for example) are then configured for the auto-clock selection phase. In this phase, selectors 208 and 210 select the forwarded clock which starts toggling and is the first to reach a pre-defined toggle count (the default in one implementation). Once a forwarded clock is selected, the clock setup phase for that chiplet terminates and the selected clock is forwarded to its neighboring tiles. This ensures that no live-lock scenarios occur in the clock forwarding process.
One potential issue with such a clock forwarding scheme is that the fast clock can accrue duty cycle distortion because of pull-up/pulldown imbalance in the buffers, inverters, forwarding unit components and inter-chiplet I/O drivers (Kaijian Shi, Synopsys, “Clock Distribution and Balancing Methodology For Large and Complex ASIC Designs,” Accessed Nov. 23, 2020). As the clock traverses across multiple chiplets in the array, this duty cycle distortion can potentially kill the clock, e.g., a 5% distortion per tile could kill the clock with in just 10 tiles. In order to avoid this issue, embodiments forward an inverted version of the clock. This ensures that the distortion is alternated between the clock cycle halves. Moreover, these and other embodiments also include a cycle distortion correction (DCC) unit 212 (Yi-Ming Wang and Jinn-Shyan Wang, “An all-digital 50% duty-cycle corrector,” IEEE International Symposium on Circuits and Systems, 2004), which can correct any residual distortion. On the other hand, the half-cycle phase delay and any jitter introduced is not a concern since the inter-chiplet data communication would use asynchronous FIFOs and clock domain crossing cells.
Next described is an example fault tolerance scheme in the Clk distribution network (ClkDN) according to embodiments.
Faulty chiplets can potentially disrupt the clock forwarding mechanism. A clock generation and forwarding scheme according to embodiments, however, has resilience built in. Because any chiplet at the edge can generate a faster clock, there isn't a single point of failure in clock generation. Moreover, because every non-edge chiplets receives clocks from all four directions, this ensures that if at least one of the neighboring chiplets out of the four is not faulty, then the clock can reach that chiplet and be further forwarded. By induction, it can be proved that the generated fast clock can reach all non-faulty chiplets on the wafer, unless all the neighboring chiplets of a specific chiplet are faulty.
It should be noted that although
In 402 (e.g. during boot-up), each chiplet is configured to default to using the software controlled JTAG clock. Using JTAG, embodiments then initiate the clock setup phase. In 404 one or multiple edge chiplets 106-E is selected and configured to generate a faster clock from the slower system clock that is provided from an off-the-wafer crystal oscillator source (e.g. master_clk). In 406 the generated faster clocks from the edge chiplets 106-E are forwarded to their neighboring chiplets. In 408 the non-edge chiplets 106 are then configured to select the forwarded clock which starts toggling and is the first to reach a pre-defined toggle count (the default in one implementation). In 410, once a forwarded clock is selected, the clock setup phase for that chiplet terminates and the selected clock is forwarded to its neighboring tiles.
The herein described subject matter sometimes illustrates different components contained within, or connected with, different other components. It is to be understood that such depicted architectures are illustrative, and that in fact many other architectures can be implemented which achieve the same functionality. In a conceptual sense, any arrangement of components to achieve the same functionality is effectively “associated” such that the desired functionality is achieved. Hence, any two components herein combined to achieve a particular functionality can be seen as “associated with” each other such that the desired functionality is achieved, irrespective of architectures or intermedial components. Likewise, any two components so associated can also be viewed as being “operably connected,” or “operably coupled,” to each other to achieve the desired functionality, and any two components capable of being so associated can also be viewed as being “operably coupleable,” to each other to achieve the desired functionality. Specific examples of operably coupleable include but are not limited to physically mateable and/or physically interacting components and/or wirelessly interactable and/or wirelessly interacting components and/or logically interacting and/or logically interactable components.
With respect to the use of plural and/or singular terms herein, those having skill in the art can translate from the plural to the singular and/or from the singular to the plural as is appropriate to the context and/or application. The various singular/plural permutations may be expressly set forth herein for sake of clarity.
It will be understood by those within the art that, in general, terms used herein, and especially in the appended claims (e.g., bodies of the appended claims) are generally intended as “open” terms (e.g., the term “including” should be interpreted as “including but not limited to,” the term “having” should be interpreted as “having at least,” the term “includes” should be interpreted as “includes but is not limited to,” etc.).
Although the figures and description may illustrate a specific order of method steps, the order of such steps may differ from what is depicted and described, unless specified differently above. Also, two or more steps may be performed concurrently or with partial concurrence, unless specified differently above. Such variation may depend, for example, on the software and hardware systems chosen and on designer choice. All such variations are within the scope of the disclosure. Likewise, software implementations of the described methods could be accomplished with standard programming techniques with rule-based logic and other logic to accomplish the various connection steps, processing steps, comparison steps, and decision steps.
It will be further understood by those within the art that if a specific number of an introduced claim recitation is intended, such an intent will be explicitly recited in the claim, and in the absence of such recitation, no such intent is present. For example, as an aid to understanding, the following appended claims may contain usage of the introductory phrases “at least one” and “one or more” to introduce claim recitations. However, the use of such phrases should not be construed to imply that the introduction of a claim recitation by the indefinite articles “a” or “an” limits any particular claim containing such introduced claim recitation to inventions containing only one such recitation, even when the same claim includes the introductory phrases “one or more” or “at least one” and indefinite articles such as “a” or “an” (e.g., “a” and/or “an” should typically be interpreted to mean “at least one” or “one or more”); the same holds true for the use of definite articles used to introduce claim recitations. In addition, even if a specific number of an introduced claim recitation is explicitly recited, those skilled in the art will recognize that such recitation should typically be interpreted to mean at least the recited number (e.g., the bare recitation of “two recitations,” without other modifiers, typically means at least two recitations, or two or more recitations).
Furthermore, in those instances where a convention analogous to “at least one of A, B, and C, etc.” is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., “a system having at least one of A, B, and C” would include but not be limited to systems that have A alone, B alone, C alone, A and B together, A and C together, B and C together, and/or A, B, and C together, etc.). In those instances where a convention analogous to “at least one of A, B, or C, etc.” is used, in general, such a construction is intended in the sense one having skill in the art would understand the convention (e.g., “a system having at least one of A, B, or C” would include but not be limited to systems that have A alone, B alone, C alone, A and B together, A and C together, B and C together, and/or A, B, and C together, etc.). It will be further understood by those within the art that virtually any disjunctive word and/or phrase presenting two or more alternative terms, whether in the description, claims, or drawings, should be understood to contemplate the possibilities of including one of the terms, either of the terms, or both terms. For example, the phrase “A or B” will be understood to include the possibilities of “A” or “B” or “A and B.”
Further, unless otherwise noted, the use of the words “approximate,” “about,” “around,” “substantially,” etc., mean plus or minus ten percent.
Although the present embodiments have been particularly described with reference to preferred examples thereof, it should be readily apparent to those of ordinary skill in the art that changes and modifications in the form and details may be made without departing from the spirit and scope of the present disclosure. It is intended that the appended claims encompass such changes and modifications.
The present application claims priority to U.S. Provisional Patent Application No. 63/246,731 filed Sep. 21, 2021, the contents of which are incorporated herein by reference in their entirety.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2022/044142 | 9/20/2022 | WO |
Number | Date | Country | |
---|---|---|---|
63246731 | Sep 2021 | US |