Both planar transistors (devices) and non-planar transistors are fabricated for use in integrated circuits within semiconductor chips. A variety of choices exist for placing processing circuitry in system packaging to integrate multiple types of integrated circuits. Some examples are a system-on-a-chip (SOC), multi-chip modules (MCMs) and a system-in-package (SiP). Mobile devices, desktop systems, and servers use these packages. Regardless of the choice for system packaging, in several uses, power consumption of modern integrated circuits has become an increasing design issue with each generation of semiconductor chips.
As power consumption increases, more costly cooling systems such as larger fans and heat sinks are utilized to remove excess heat and prevent failure of the integrated circuit. However, cooling systems increase system costs. The power dissipation constraint of the integrated circuit is not only an issue for portable computers and mobile communication devices, but also for high-performance desktop computers and server computers. Power management circuitry assigns operating parameters to different partitions of an integrated circuit. The operating parameters include at least an operating power supply voltage and an operating clock frequency.
Although a partition can have no computational tasks to perform during a particular time period while an application is running, the power management circuitry is unable to assign a sleep state to the partition due to occasional maintenance tasks targeting the partition. Recent integrated circuits include multiple replicated functional blocks in the partition to increase throughput. Each functional block includes one or more sub-blocks for data processing, one or more levels of cache, and an interface to communicate with local memory. In one example, when a video graphics application is executed by the integrated circuit, a partition that includes multiple functional blocks responsible for rendering video frame data has no further computational tasks to perform when the image presented on a display device has no updates. The image remains unchanged during a pause of the application, during a wait time for user input information, or other condition that doesn't require updates to the image despite the application is still running. However, the power management circuitry is unable to assign a sleep state to the multiple functional blocks due to periodic refresh operations that request data to be retrieved from the multiple functional blocks and sent to the display device.
In view of the above, methods and mechanisms for efficiently managing power consumption of multiple, replicated functional blocks of an integrated circuit are desired.
While the invention is susceptible to various modifications and alternative forms, specific implementations are shown by way of example in the drawings and are herein described in detail. It should be understood, however, that drawings and detailed description thereto are not intended to limit the invention to the particular form disclosed, but on the contrary, the invention is to cover all modifications, equivalents and alternatives falling within the scope of the present invention as defined by the appended claims.
In the following description, numerous specific details are set forth to provide a thorough understanding of the present invention. However, one having ordinary skill in the art should recognize that the invention might be practiced without these specific details. In some instances, well-known circuits, structures, and techniques have not been shown in detail to avoid obscuring the present invention. Further, it will be appreciated that for simplicity and clarity of illustration, elements shown in the figures have not necessarily been drawn to scale. For example, the dimensions of some of the elements are exaggerated relative to other elements.
Apparatuses and methods efficiently managing power consumption among multiple, replicated functional blocks of an integrated circuit are contemplated. In the case of using a multi-chip module (MCM), one or more of multiple, replicated chiplets are connected to separate power rails, and therefore, can use separate power domains. The multiple, replicated chiplets are provided from a silicon wafer separate from silicon wafers of other functional blocks used in the MCM. In the case of using a system-on-chip (SoC), one or more of the functional blocks are connected to separate power rails, and therefore, can use separate power domains. The multiple, replicated functional blocks are provided from a same silicon wafer that provides other functional blocks used in the SoC. Therefore, the techniques and steps described in the upcoming description directed to power management of multiple, replicated chiplets placed in a MCM are also applicable for power management of multiple, replicated functional blocks placed in an SoC.
In various implementations, an integrated circuit includes multiple, replicated functional blocks that use separate power domains. The multiple functional blocks store data of a given type in an interleaved manner among the multiple functional blocks. In an implementation, the data of the given type is video frame data of a frame buffer that has been rendered by the multiple functional blocks. A low-performance mode indicates a static screen of the display device connected to the display controller. When control circuitry detects the low-performance mode, the control circuitry sends commands to the multiple functional blocks specifying storing data of the given type in a contiguous manner in one or more of the caches of the multiple functional blocks and the memories connected to the multiple functional blocks. Following, the control circuitry transitions the memories to a sleep state and transitions one or more functional blocks to a sleep state.
In another implementation, the control circuitry powers down one or more of the functional blocks while maintaining the one or more corresponding caches in the sleep state. For example, the control circuitry turns off one or more power supply reference levels to portions of the corresponding one or more functional blocks. However, the caches of these corresponding one or more functional blocks maintain connection to a power supply reference level with a voltage magnitude associated with the sleep state. The control circuitry sends control signals to power switches that disconnect, from a physical voltage plane, the one or more power supply reference levels used by portions other than the caches of the corresponding one or more functional blocks. The functional blocks process requests targeting the data of the given type using the particular functional block that is currently in an active state. The control circuitry rotates among the functional blocks with a single functional block being in the active state and servicing requests based on which data of the given type is targeted by the requests.
Turning now to
As used herein, a “chiplet” is also referred to as a “functional block,” or an “intellectual property block” (or IP block). However, a “chiplet” is a semiconductor die (or die) fabricated separately from other dies, and then interconnected with other dies in a single integrated circuit in system packaging known as multi-chip modules (MCMs). A chiplet is a type of functional block. However, a functional block can also include blocks fabricated with other functional blocks on a larger semiconductor die such as a system-on-chip (SoC). Therefore, a chiplet is a subset of the types of functional blocks. A chiplet is fabricated as multiple copies by itself on a silicon wafer, rather than fabricated with other functional blocks on a larger semiconductor die such as an SoC. For example, a first silicon wafer (or first wafer) is fabricated with multiple copies of a first chiplet, and this first wafer is diced using laser cutting techniques to separate the multiple copies of the first chiplet.
A second silicon wafer (or second wafer) is fabricated with multiple copies of a second chiplet, and this second wafer is diced using laser cutting techniques to separate the multiple copies of the second chiplet. The first chiplet provides functionality different from the functionality of the second chiplet. One or more copies of the first chiplet is placed in an integrated circuit, and one or more copies of the second chiplet is placed in the integrated circuit. The first chiplet and the second chiplet are interconnected to one another within a corresponding MCM. Such a process replaces a process that fabricates a third silicon wafer (or third wafer) with multiple copies of a single, monolithic semiconductor die that includes the functionality of the first chiplet and the second chiplet as integrated functional blocks within the single, monolithic semiconductor die.
Process yield of single, monolithic dies on a silicon wafer is lower than process yield of smaller chiplets on a separate silicon wafer. In addition, a semiconductor process can be adapted for the particular type of chiplet being fabricated. With single, monolithic dies, each die on the wafer is formed with the same fabrication process. However, it is possible that an interface functional block does not require process parameters of a semiconductor manufacturer's expensive process that provides the fastest devices and smallest geometric dimensions that are beneficial for a high throughput processing unit on the die. With separate chiplets, designers can add or remove chiplets for particular integrated circuits to readily create products for a variety of performance categories. In contrast, an entire new silicon wafer must be fabricated for a different product when single, monolithic dies are used.
The following description describes power management of multiple, replicated chiplets placed in a MCM where the multiple, replicated chiplets are provided from a silicon wafer separate from silicon wafers of other functional blocks used in the MCM. However, the following description is also applicable to the power management of multiple, replicated functional blocks located on a SoC where the multiple, replicated functional blocks are provided from a same silicon wafer that provides other functional blocks used in the SoC. In the case of using an MCM, one or more of the chiplets are connected to separate power rails, and therefore, can use separate power domains. Similarly, in the case of using an SoC, one or more of the functional blocks are connected to separate power rails, and therefore, can use separate power domains.
Although not shown for ease of illustration, each of the chiplets 110, 120, 130 and 140 also includes one or more sub-blocks that provide a variety of functionalities. These sub-blocks utilize transistors. As used herein, a “transistor” is also referred to as a “semiconductor device” or a “device.” The chiplets 110, 120, 130 and 140 uses p-type metal oxide semiconductor (PMOS) field effect transistors FETS (or pfets) in addition to n-type metal oxide semiconductor (NMOS) FETS (or nfets). In some implementations, the devices (or transistors) in the memory array portion 100 are planar devices. In other implementations, the devices (or transistors) in the memory array portion 100 are non-planar devices. Examples of non-planar transistors are tri-gate transistors, fin field effect transistors (FETs), and gate all around (GAA) transistors. In some implementations, the chiplets 110, 120, 130 and 140 includes one or more three-dimensional integrated circuits (3D ICs). A 3D IC includes two or more layers of active electronic components integrated both vertically and/or horizontally into a single circuit. In one implementation, interposer-based integration is used whereby the 3D IC is placed next to a central processing unit (CPU) that includes one or more general-purpose processor cores. Alternatively, a 3D IC is stacked directly on top of another IC.
As shown, each of the chiplets 110, 120, 130 and 140 and each of the memories 114, 124, 134 and 144 stores a copy of one or more portions of data of a given type. Each of the memories 114, 124, 134 and 144 is one of a variety of types of dynamic random-access memory (DRAM). In an implementation, the data of the given type is video frame data stored in a frame buffer implemented by the memories 114, 124, 134 and 144. The portions of data of the given type are shown as numbered boxes where the number is used to identify it. In some implementations, each portion of data is a contiguous portion compared to a previous portion of a larger data set (such as a video frame buffer) where the previous portion has a number identifying it that is one less than the number identifying the current portion. For example, portion “2” is a next contiguous portion following portion “1.” In an implementation, each portion has a same size such as a size of a page of DRAM or other. In other implementations, one or more portions have a different size.
In the illustrated implementation, the memory 114 stores a copy of the portion “1,” the portion “5,” the portion “9” and the portion “13.” The memory 124 stores a copy of the portion “2,” the portion “6,” the portion “10” and the portion “14.” The memory 134 stores a copy of the portion “3,” the portion “7,” the portion “11” and the portion “15.” The memory 144 stores a copy of the portion “4,” the portion “8,” the portion “12” and the portion “16.” The caches 112, 122, 132 and 142 store a copy of data that is stored in a corresponding one of the memories 114, 124, 134 and 144. For example, the cache 112 stores a copy of the portion “1,” the portion “5,” the portion “9” and the portion “13.” In an implementation, each of the caches 112, 122, 132 and 142 is a last-level cache of a corresponding one of the chiplets 110, 120, 130 and 140, and supports a writeback cache policy. In another implementation, the caches 112, 122, 132 and 142 support a writethrough cache policy.
In an implementation, the chiplets 110, 120, 130 and 140 process tasks of a video graphics workload such as rendering video frame data for a display device (not shown). The data of the given type is video frame data of a frame buffer that has been rendered by the chiplets 110, 120, 130 and 140. This data of the given type is sent from the chiplets 110, 120, 130 and 140 to a display controller and then to the display device. In some implementations, the chiplets 110, 120, 130 and 140 store data of the given type in an interleaved manner amongst themselves as shown in the illustrated implementation. For example, a first portion of the data of the given type (portion “1”) is stored in a first chiplet (chiplet 110), and a second portion different (portion “2”) from the first portion (portion “1”) of the data of the given type is stored in a second chiplet (chiplet 120). A third portion (portion “3”) different from the first portion and the second portion of the data of the given type is stored in a third chiplet (chiplet 110), and so on. When the last chiplet (chiplet 140) of the multiple chiplets has a portion (portion “4”) of the data of the given type stored in it, a next portion (portion “5”) of the data of the given type is stored in the first chiplet (chiplet 110). Data storage of the data of the given type continues in this manner.
The chiplets 110, 120, 130 and 140 store the portions “1” to “16” in the interleaved manner in the memories 114, 124, 134 and 144 in order to hide overhead latency (penalty) of the memory devices used to implement the memories 114, 124, 134 and 144. For example, each of the steps of opening a page in DRAM, storing the targeted page in a row buffer, accessing the row buffer, and closing the page includes appreciable latency or penalty. In an implementation, power management circuitry (not shown) either determines or assigns a low-performance mode to a computing system using the chiplets 110, 120, 130 and 140, or receives an indication of the low-performance mode. For example, a video graphics application no longer updates frame data to be viewed on the display device. It is possible that the video graphics application is paused or is waiting for further user input, and during the wait time, the scene or image is not updated on the display device. Therefore, the video processing subsystem of the computing system that utilizes the chiplets 110, 120, 130 and 140 enters an idle state (or an idle condition) although the video graphics application has not stopped being executed.
When control circuitry, such as the power management circuitry (or other control circuitry), determines the integrated circuit 100 is in a low-performance mode or assigns the integrated circuit 100 to transition to a low-performance mode, the control circuitry sends commands or indications to either the chiplets 110, 120, 130 and 140, or a direct memory access (DMA) engine. These commands or indications specify storing data of the given type in a contiguous manner in the memories 114, 124, 134 and 144, rather than in an interleaved manner among the memories 114, 124, 134 and 144. Therefore, the control circuitry causes the chiplets 110, 120, 130 and 140 to store data in a contiguous manner, responsive to a mode of operation such as the low-performance mode. The low-performance mode is a mode of operation associated with a low-power power domain of multiple power domains. Each of the power domains includes at least operating parameters such as an operating power supply voltage and an operating clock frequency. Each of the power domains also includes control signals for enabling and disabling connections to clock generating circuitry and a power supply reference. In various implementations, each of the chiplets 110, 120, 130 and 140 utilizes a separate power rail and can be set to a separate power domain.
In some implementations, when the integrated circuit 100 enters the low-performance mode, the DMA engine or other unit does not provide updated data of the given type to the chiplets 110, 120, 130 and 140. Rather, the data of the given type currently stored among the memories 114, 124, 134 and 144 is the last video frame to be received until the low-performance mode ends. However, in an implementation, the DMA engine or other unit provides a last frame to the chiplets 110, 120, 130 and 140, but this last frame is not updated data, but rather, this last frame is a copy of the frame already currently stored among the memories 114, 124, 134 and 144. In such an implementation, the portion “17” is a copy of portion “1,” the portion “18” is a copy of portion “2,” the portion “19” is a copy of the portion “3,” and so on. In another implementation, when the integrated circuit 100 enters the low-performance mode, the DMA engine or other unit does not provide any additional data of the given type to the chiplets 110, 120, 130 and 140. Further details of this implementation are provided in the description of the integrated circuit 300 (of
In the low-performance mode of the integrated circuit 100, when data of the given type is retrieved from the DMA engine or other unit, and sent to the integrated circuit 100, the data of the given type is now stored in a contiguous manner in the memories 114, 124, 134 and 144, rather than in an interleaved manner among the memories 114, 124, 134 and 144. For example, the data includes the portions “17” to “32.” The memory 114 stores portions “17,” “18,” “19” and “20” in a contiguous manner. Memories 124, 134 and 144 store portions in a contiguous manner as well. Following, the power management circuitry transitions each of the memories 114, 124, 134 and 144 to a sleep state. In an implementation, the sleep state is relatively low power state (of available power states). In various implementations, the sleep state is a minimum power consumption state in which power is not turned off. When the memories 114, 124, 134 and 144 utilize DRAM, the memories 114, 124, 134 and 144 are volatile memories. In some implementations, the sleep state is a component idle state with the lowest available voltage magnitude of any of one or more component idle states. A memory of the memories 114, 124, 134 and 144 has power consumption reduced, but this memory also retains sufficient configuration information (or context information) to return to the active state without restarting the operating system.
In another implementation, the sleep state is a component idle state with a voltage magnitude lower than a voltage magnitude provided by the active state, but higher than the lowest available voltage magnitude of any of one or more component idle states. In an implementation, in the sleep state, the power management circuitry additionally turns off the power supply reference level to portions of the one or more corresponding chiplets of the chiplets 110, 120, 130 and 140, and these portions do not include the caches 112, 122, 132 and 142. For example, the power management circuitry sends control signals to power switches that disconnect, from a physical voltage plane, the power supply reference level used by the portions of the one or more corresponding chiplets of the chiplets 110, 120, 130 and 140. The sleep state and one or more active states can be associated with one or more power-performance states (P-states) that indicate a respective power domain managed by the power management circuitry. The sleep state and one or more active states can be associated with one or more states of the Advanced Configuration and Power Interface (ACPI) standard. States of another standard are also possible and contemplated.
Initially, the power management circuitry does not transition the chiplet 110 to a non-active state, but maintains the chiplet 110 in one of multiple active states. In an implementation, the power management circuitry transitions each of the chiplets 120, 130 and 140 to a non-active state. For example, the power management circuitry powers down portions of the chiplets 120, 130 and 140 other than the caches 122, 132 and 142. The power management circuitry removes the power supply reference level from the portions of the chiplets 120, 130 and 140 other than the caches 122, 132 and 142. The power management circuitry maintains connection to a power supply reference level for the caches 122, 132 and 142, but with a voltage magnitude associated with the sleep state. In some implementations, the voltage magnitude provided by the power supply reference level of the sleep state is based on reducing leakage current of devices (transistors) within the caches 122, 132 and 142. With the caches 122, 132 and 142 still connected to a voltage plane and maintaining a power supply reference level, the integrated circuit 100 implements the caches 122, 132 and 142 as persistent memory, or non-volatile memory.
During the idle state of the video subsystem, the chiplet 110, which stores data of the given type (portions “17” to “20”), processes any generated requests targeting the data of the given type. For example, despite not requesting new frame data to be rendered, the display device of the computing system still performs refresh operations. In this case, the data of the given type (portions “17” to “20”) are a subset of the entire rendered data of the last frame (portions “17” to “32”) to be processed before the transition to the idle state indicating a static screen of the display device.
To perform the refresh operations, the display device requests data of the given type (portions “17” to “32”) from the chiplets 110, 120, 130 and 140. After accessing portion “20” from chiplet 110, the power management circuitry transitions the chiplet 120 from the sleep state to the active state, and transitions the chiplet 110 from the active state to the sleep state. For example, the power management circuitry (or other circuitry) reconnects the power supply reference level to portions of the chiplet 120 by sending control signals to power switches that reconnect, to a physical voltage plane, the power supply reference level used by these portions. The power management circuitry (or other circuitry) transitions the cache 122 from the sleep state to the active state. Additionally, the power management circuitry (or other circuitry) disconnects the power supply reference level from the portions of the chiplet 110 other than the cache 112 by sending control signals to power switches that disconnect, to a physical voltage plane, the power supply reference level used by these portions. The power management circuitry (or other circuitry) transitions the cache 112 from the active state to the sleep state. With the cache 112 still connected to a voltage plane and maintaining a power supply reference level, the integrated circuit 100 implements the cache 112 as persistent memory, or non-volatile memory. Therefore, a single chiplet is in the active state while the remaining chiplets are in the sleep state, or otherwise powered down to reduce power consumption. Similarly, after accessing portions “21 to “24” from chiplet 120, the power management circuitry transitions the chiplet 130 from the sleep state to the active state, and transitions the chiplet 120 from the active state to the sleep state using similar steps.
Further, after accessing portions “25” to “28” from chiplet 130, the power management circuitry transitions the chiplet 140 from the sleep state to the active state, and transitions the chiplet 130 from the active state to the sleep state using the steps described in the above description directed toward chiplets 110 and 120. Continuing, after accessing portions “29” to “32” from chiplet 140, the power management circuitry transitions the chiplet 110 from the sleep state to the active state, and transitions the chiplet 140 from the active state to the sleep state. These steps are repeated during the video refresh operations. Therefore, a single chiplet is in the active state while the remaining chiplets are in the sleep state, or otherwise powered down to reduce power consumption. While still supporting the refresh operations, the integrated circuit 100 reduces power consumption by maintaining a single chiplet of chiplets 110, 120, 130 and 140 in the active state while the remaining chiplets of chiplets 110, 120, 130 and 140 are in the sleep state. Additionally, each of the memories 114, 124, 134 and 144 is in the sleep state.
Referring to
The control circuitry sends commands or indications to either the chiplets 110, 120, 130 and 140, or a direct memory access (DMA) engine. These commands or indications specify storing data of the given type in an interleaved manner among the memories 114, 124, 134 and 144, rather than in a contiguous manner in the memories 114, 124, 134 and 144. It is noted that these commands used when the integrated circuit 200 is in a high-performance mode perform the opposite storage arrangement (from the contiguous manner to the interleaved manner) than the storage arrangement (from the interleaved manner to the contiguous manner) used when the integrated circuit 100 (of
Turning now to
As a result of the above commands or indications, the chiplet 120 sends the copy of portion “2” to be stored in cache 112 of chiplet 110. The copy of portion “2” stored in memory 124 remains in memory 124. The chiplet 110 does not store a copy of portion “2” in memory 114. Regardless of supporting a writethrough cache policy or a writeback cache policy, the caches 112, 122, 132 and 142 do not send updates to the memories 114, 124, 134 and 144 when transferring data between the caches 112, 122, 132 and 142. In a similar manner, the chiplet 130 sends the copy of portion “3” to be stored in cache 112 of chiplet 110. The copy of portion “3” stored in memory 134 remains in memory 134. The chiplet 110 does not store a copy of portion “3” in memory 114.
Additionally, the chiplet 130 sends the copy of portion “7” to be stored in cache 122 of chiplet 120. The copy of portion “7” stored in memory 134 remains in memory 134. The chiplet 120 does not store a copy of portion “7” in memory 124. The chiplet 110 sends the copy of portion “13” to be stored in cache 142 of chiplet 140. The copy of portion “13” stored in memory 114 remains in memory 114. The chiplet 140 does not store a copy of portion “13” in memory 144. Other data transfers are performed in this manner among the chiplets 110, 120, 130 and 140 until the portions “1” to “16” are stored in a contiguous manner in the caches 112-142. It is noted that in a case where cache 112 does not have sufficient temporary data storage to simultaneously store portions “1,” “2,” “5,” “9,” and “13,” the cache 112 overwrites portion “5” with portion “2,” and later chiplet 110 sends a copy of portion “5” from memory 114 to cache 122 of chiplet 120, rather than from cache 112. Other chiplets perform similar steps when their respective caches do not include sufficient temporary data storage.
Following the data transfers, the power management circuitry transitions each of the memories 114, 124, 134 and 144 to a sleep state. In addition, the power management circuitry transitions one or more of the chiplets 120, 130 and 140 to a sleep state, or otherwise powered down to reduce power consumption. The power management circuitry does not transition the chiplet 110 to the sleep state, but maintains the chiplet 110 in one of multiple active states. In an implementation, the power management circuitry transitions each of the chiplets 120, 130 and 140 to the sleep state, or otherwise powered down to reduce power consumption. For example, the power management circuitry (or other circuitry) performs the steps described in the above description directed toward the integrated circuit 100. During the idle state of the video subsystem, the chiplet 110, which stores data of the given type (portions “1” to “4”), processes any generated requests targeting the data of the given type. For example, despite not requesting new frame data to be rendered, the display device of the computing system still performs refresh operations. In this case, the data of the given type (portions “1” to “4”) are a subset of the entire rendered data of the last frame (portions “1” to “16”) to be processed before the transition to the idle state indicating a static screen of the display device.
To perform the refresh operations, the display device requests data of the given type (portions “1” to “16”) from the chiplets 110, 120, 130 and 140. After accessing portion “4” from chiplet 110, the power management circuitry transitions the chiplet 120 from the sleep state to the active state, and transitions the chiplet 110 from the active state to the sleep state. For example, the power management circuitry (or other circuitry) reconnects the power supply reference level to the portions of the chiplet 120 other than the cache 122 by sending control signals to power switches that reconnect, to a physical voltage plane, the power supply reference level used by these portions. The power management circuitry (or other circuitry) transitions the cache 122 from the sleep state to the active state. Additionally, the power management circuitry (or other circuitry) disconnects the power supply reference level from the portions of the chiplet 110 other than the cache 112 by sending control signals to power switches that disconnect, to a physical voltage plane, the power supply reference level used by these portions. The power management circuitry (or other circuitry) transitions the cache 112 from the active state to the sleep state. Therefore, a single chiplet is in the active state while the remaining chiplets are in the sleep state. Similarly, after accessing portions “5” to “8” from chiplet 120, the power management circuitry transitions the chiplet 130 from the sleep state to the active state, and transitions the chiplet 120 from the active state to the sleep state using similar steps as the transitions for chiplets 110 and 120.
Further, after accessing portions “9” to “12” from chiplet 130, the power management circuitry transitions the chiplet 140 from the sleep state to the active state, and transitions the chiplet 130 from the active state to the sleep state. Continuing, after accessing portions “29” to “32” from chiplet 140, the power management circuitry transitions the chiplet 110 from the sleep state to the active state, and transitions the chiplet 140 from the active state to the sleep state. These steps are repeated during the video refresh operations. Therefore, a single chiplet is in the active state while the remaining chiplets are in the sleep state, or otherwise powered down to reduce power consumption. While still supporting the refresh operations, the integrated circuit 300 reduces power consumption by maintaining a single chiplet of chiplets 110, 120, 130 and 140 in the active state while the remaining chiplets of chiplets 110, 120, 130 and 140 are in the sleep state. Additionally, each of the memories 114, 124, 134 and 144 is in the sleep state.
Referring to
The control circuitry sends commands or indications to either the chiplets 110, 120, 130 and 140, or a direct memory access (DMA) engine. These commands or indications specify maintaining storage of data of the given type in an interleaved manner in the memories 114, 124, 134 and 144, and additionally, also storing data of the given type in an interleaved manner in the caches 112, 122, 132 and 142 of the chiplets 110, 120, 130 and 140. The power management circuitry transitions each of the memories 114, 124, 134 and 144 from the sleep state to an active state. The power management circuitry transitions each of the chiplets 110, 120, 130 and 140 that is in the sleep state to an active state. As a result of the received commands or indications, each of the caches 112-142 invalidates its contents, and later fetches the corresponding portions of portions “1” to “16” from memories 114-144. For example, after invalidating its contents, the cache 122 of the chiplet 120 fetches portions “2,” “6,” “10,” and “14” from the memory 124. The other caches 110, 130 and 140 perform similar steps.
Referring to
In some implementations, the functionality of the apparatus 500 is included as components on a single die such as a single integrated circuit. In an implementation, the functionality of the apparatus 500 is included as one die of multiple dies on a multi-chip module (MCM). In various implementations, the apparatus 500 is used in a desktop, a portable computer, a mobile device, a server, a peripheral device, or other. The apparatus 500 is also capable of communicating with a variety of other external circuitry such as one or more of a digital signal processor (DSP), a variety of application specific integrated circuits (ASICs), a multimedia engine, and so forth.
The hardware, such as circuitry, of each of the blocks 514A and 516A provides a variety of functionalities. In some implementations, one or more of the blocks 514A and 516A include a relatively wide single-instruction-multiple-data (SIMD) microarchitecture. For example, one or more of the blocks 514A and 516A is used as a dedicated GPU (or dGPU), a dedicated video graphics chip or chipset, or other. In some implementations, one or more of the blocks 514A and 516A render video frame data that is later sent to the display controller 550. In an implementation, the cache 520A is a last-level cache of a cache memory subsystem hierarchy. The cache 520A can support a writeback policy, or the cache 520A can support a writethrough policy. The chiplet 510A uses the local memory controllers 522A and 526A to transfer data with the local memory 530A via the communication channels 524A and 528A.
The local memory 530A includes the memory devices 532A and 534A. In some implementations, each of the memory devices 532A and 534A is one of a variety of types of synchronous dynamic random-access memory (SDRAM) specifically designed for applications requiring both high memory data bandwidth and high memory data rates. In other implementations, each of the memory devices 532A and 534A is another type of DRAM. In various implementations, each of the communication channels 524A and 528A is a point-to-point (P2P) communication channel. A point-to-point communication channel is a dedicated communication channel between a single source and a single destination. Therefore, the point-to-point communication channel transfers data only between the single source and the single destination. The address information, command information, response data, payload data, header information, and other types of information are transferred on metal traces or wires that are accessible by only the single source and the single destination. In an implementation, the local memory controllers 522A and 526A support one of a variety of types of a Graphics Double Data Rate (GDDR) communication protocol.
It is noted that although the communication channels 524A and 528A use the term “communication channel,” each of the communication channels 524A and 528A is capable of transferring data across multiple memory channels supported by a corresponding memory device. For example, a single memory channel of a particular memory device can include 60 or more individual signals with 32 of the signals dedicated for the response data or payload data. A memory controller or interface of the memory device can support multiple memory channels. Each of these memory channels is included within any of the communication channels 524A and 528A.
The interface 512A includes circuitry that allows the chiplet 510A to communicate with external integrated circuits such as at least the components 540-570 shown. One or more of a communication bus, a point-to-point channel, a communication fabric, or other are used to transfer data and commands between the chiplets 510A-510B and at least the components 540-570. The network interface 570 supports a communication protocol for communication with one of a variety of types of a network. The DMA circuit 560 supports memory mapping and a communication protocol used to communicate with one of a variety of types of system memory. The display controller 550 receives rendered video frame data from the chiplets 510A-510B and prepares this data to present an image on a corresponding display device. Each of the chiplets 510A-510B and each of the caches 520A-520B is assigned a respective power domain by the power manager 540. The power domain includes at least operating parameters such as at least an operating power supply voltage and an operating clock frequency. Each of the power domains also includes control signals for enabling and disabling connections to clock generating circuitry and a power supply reference. Therefore, the caches 520A-520B are implemented as persistent memory similar to the caches 112, 122, 132 and 142 (of
In some implementations, the hardware, such as circuitry, of the power manager 540 determines when tasks of a workload enter an idle state. In other implementations, the power manager 540 receives an indication of the idle state. The idle state can indicate a static screen of the display device connected to the display controller 550. For example, a video graphics application no longer updates frame data to be viewed on the display device. It is possible that the video graphics application is paused or is waiting for further user input, and during the wait time, the scene or image is not updated on the display device. Therefore, the video processing subsystem of the computing system, such as the chiplets 510A-510B, enters an idle state although the video graphics application has not stopped being executed. The power manager 540 sends operating parameters and data storage commands 542 to one or more of the DMA circuit 560 and the chiplets 510A-510B. For example, the power manager 540 performs the steps described regarding the description of power management by the integrated circuits 100-400 (of
Turning now to
Field 616 stores an indication of whether dynamic identification or static allocation is being used for storing data of a given type in the multiple chiplets. Field 618 stores an indication specifying whether a cache of the corresponding chiplet is storing data of the given type in a contiguous manner or an interleaved manner. Field 620 stores a value indicating whether a memory of the corresponding chiplet, such as DRAM used as local memory, is storing data of the given type in a contiguous manner or an interleaved manner. Field 622 stores a current value indicating the most-recent P-state or power domain for the corresponding chiplet.
The control circuitry 630 receives usage measurements and indications 624, which represent activity levels of the chiplets and power consumption measurements or parameters used to determine recent power consumption values of the chiplets. The power-performance state (P-state) selector 632 selects the next operating parameters to use for the chiplets. The data storage arrangement allocator 634 (or allocator 634) includes circuitry that determines whether the caches and the memories store data of the given type in a contiguous manner or an interleaved manner.
In some implementations, the data of the given type is video frame data. Based on one or more of the expected size of the video frame data, the sizes of the last-level caches of the chiplets, expected performance degradation when accessing data in a contiguous manner from the memories, any quality of service (QOS) values associated with the video graphics application, values stored in the table 610, and so on, the allocator 634 determines whether the caches and the memories store data of the given type in a contiguous manner or an interleaved manner. One or more components of the power manager 600 use values stored in the configuration and status registers (CSRs) 636. The CSRs 636 store the above examples of values used by the allocator 634. In some implementations, one or more of the components of power manager 600 and corresponding functionality is provided in another external circuit, rather than provided here in power manager 600.
Turning now to
Die-stacking technology is a fabrication process that enables the physical stacking of multiple separate pieces of silicon (integrated chips) together in a same package with high-bandwidth and low-latency interconnects. In some implementations, the die is stacked side by side on a silicon interposer, or vertically directly on top of each other. One configuration for the SiP is to stack one or more semiconductor dies (or dies) next to and/or on top of a processing unit such as processing unit 710. In an implementation, the SiP 700 includes the processing unit 710 and the modules 740A-740B. Module 740A includes the chiplet 720A and the chiplets 722A-722B. In various implementations, the chiplets 720A and 722A-722B are multiple three-dimensional (3D) semiconductor dies. Although a particular number of chiplets are shown, any number of chiplets is used as stacked 3D dies in other implementations.
The chiplet 720A is fabricated on a corresponding silicon wafer that is later dices to provide the chiplet 720A. Each of the chiplets 722A-722B is fabricated on a silicon wafer different from the silicon wafer used to provide the chiplet 720A and separate from the silicon wafer used to provide the processing unit 710. In some implementations, the chiplets 722A-722B include circuitry that renders video frame data that is later sent to a display controller (not shown). Each of the chiplets 722A-722B has a separate power rail, which allows one or more of the chiplets 722A-722B to be placed in a sleep state by the power manager 712 during an idle state that indicates a static screen of a display device. In some implementations, the module 740B is a replication of the module 740A. In various implementations, the power manager 712 has the functionality of the power manager 540 (of
Each of the modules 740A-740B communicates with the processing unit 710 through horizontal low-latency interconnect 730. In various implementations, the processing unit 710 is a general-purpose central processing unit; a graphics processing unit (GPU), an accelerated processing unit (APU), a field programmable gate array (FPGA), or other data processing device. The in-package horizontal low-latency interconnect 730 provides reduced lengths of interconnect signals versus long off-chip interconnects when a SiP is not used. The in-package horizontal low-latency interconnect 730 uses particular signals and protocols as if the chips, such as the processing unit 710 and the modules 740A-740B, were mounted in separate packages on a circuit board. In some implementations, the SiP 700 additionally includes backside vias or through-bulk silicon vias 732 that reach to package external connections 734. The package external connections 734 are used for input/output (I/O) signals and power signals.
In various implementations, multiple device layers are stacked on top of one another with direct vertical interconnects 736 tunneling through them. In various implementations, the vertical interconnects 736 are multiple through silicon vias grouped together to form through silicon buses (TSBs). The TSBs are used as a vertical electrical connection traversing through a silicon wafer. The TSBs are an alternative interconnect to wire-bond and flip chips. The size and density of the vertical interconnects 736 that can tunnel between the different device layers varies based on the underlying technology used to fabricate the 3D ICs. As shown, some of the vertical interconnects 736 do not traverse through each of the modules 740A-740B. Therefore, in some implementations, the processing unit 710 does not have a direct connection to one or more dies such as die 722D in the illustrated implementation. Therefore, the routing of information relies on the other dies of the SiP 700.
For methods 800 and 900 (of
In the case of using an MCM, one or more of the chiplets are connected to separate power rails, and therefore, can use separate power domains. Similarly, in the case of using an SoC, one or more of the functional blocks are connected to separate power rails, and therefore, can use separate power domains. Therefore, the techniques and steps described earlier and described in the upcoming description directed to power management of multiple, replicated chiplets placed in a MCM are also applicable for power management of multiple, replicated functional blocks placed in an SoC where separate power rails are used for separate replicated functional blocks.
Referring to
Hardware, such as circuitry, of multiple chiplets of an integrated circuit process tasks of a workload using assigned operating parameters (block 802). In various implementations, a power manager assigns a respective power domain to each of the chiplets. Each of the power domains includes at least operating parameters such as at least an operating power supply voltage and an operating clock frequency. Each of the power domains also includes control signals for enabling and disabling connections to clock generating circuitry and a power supply reference. In an implementation, the multiple chiplets process tasks of a video graphics workload such as rendering video frame data for a display device. The data of a given type is video frame data of a frame buffer that has been rendered by the multiple chiplets. This data of the given type is sent from the multiple chiplets to a display device.
The multiple chiplets store data of the given type in an interleaved manner among each of the multiple chiplets (block 804). For example, a first portion of the data of the given type is stored in a first chiplet, and a second portion different from the first portion of the data of the given type is stored in a second chiplet. A third portion different from the first portion and the second portion of the data of the given type is stored in a third chiplet, and so on. When the last chiplet of the multiple chiplets has a portion of the data of the given type stored in it, a next portion of the data of the given type is stored in the first chiplet. Data storage of the data of the given type continues in this manner. In some implementations, each portion is a contiguous portion compared to a previous portion, and each portion has a same size. In other implementations, one or more portions have a different size, and one or more portions is not a contiguous portion compared to a previous portion.
In various implementations, the multiple chiplets store data in a local memory device such as one of a variety of types of DRAM. In addition, the multiple chiplets store a copy of the data of the given type in a corresponding cache. In an implementation, the cache is a last-level cache of a cache memory subsystem hierarchy. The cache can support a writeback policy, or the cache can support a writethrough policy. In some implementations, the power manager or other control circuitry determines when tasks of a workload cause the integrated circuit to transition to a low-performance mode. In other implementations, the power manager or other control circuitry receives an indication of the low-performance mode. The low-performance mode can indicate a static screen of the display device. For example, a video graphics application no longer updates frame data to be viewed on the display device. It is possible that the video graphics application is paused or is waiting for further user input, and during the wait time, the scene or picture is not updated on the display device. Therefore, the video processing subsystem of the computing system that includes the multiple chiplets enters an idle state although the video graphics application has not stopped being executed.
If the control circuitry determines a transition to the low-performance mode has not yet occurred (“no” branch of the conditional branch 806), then control flow of method 800 returns to block 802 where the multiple chiplets of the integrated circuit process tasks of the workload using assigned operating parameters. However, if the control circuitry determines a transition to the low-performance mode has occurred (“yes” branch of the conditional branch 806), then the control circuitry sends commands to the multiple chiplets to transfer data of the given type between their caches (and possibly from their memories in cases where caches do not have sufficient temporary data storage for data transfers) until data of the given type is stored in a contiguous manner in the caches of the chiplets (block 808).
The commands indicate that the memories connected to the chiplets maintain storage of data of the given type in an interleaved manner (block 810). Following, the control circuitry transitions the memories to a sleep state (block 812). The control circuitry sends commands or indications to the chiplets specifying maintaining operating parameters of an active state for a given chiplet of the multiple chiplets (block 814). The control circuitry transitions each of the multiple chiplets except the given chiplet to a sleep state, or otherwise powered down to reduce power consumption (block 816). For example, the control circuitry powers down portions of the multiple chiplets other than the caches. The control circuitry removes the power supply reference level from these portions while maintaining connection to a power supply reference level for the caches of these multiple chiplets except the given chiplet, but with a voltage magnitude associated with the sleep state. In some implementations, the voltage magnitude provided by the power supply reference level of the sleep state is based on reducing leakage current of devices (transistors) within the caches. During the low-performance mode, the chiplets process requests targeting the data of the given type using the given chiplet (block 818). The control circuitry rotates among the multiple chiplets to have a single chiplet in the active state and service requests based on which data of the given type is targeted by the requests (block 820).
Turning now to
It is noted that one or more of the above-described implementations include software. In such implementations, the program instructions that implement the methods and/or mechanisms are conveyed or stored on a computer readable medium. Numerous types of media which are configured to store program instructions are available and include hard disks, floppy disks, CD-ROM, DVD, flash memory, Programmable ROMs (PROM), random access memory (RAM), and various other forms of volatile or non-volatile storage. Generally speaking, a computer accessible storage medium includes any storage media accessible by a computer during use to provide instructions and/or data to the computer. For example, a computer accessible storage medium includes storage media such as magnetic or optical media, e.g., disk (fixed or removable), tape, CD-ROM, or DVD-ROM, CD-R, CD-RW, DVD-R, DVD-RW, or Blu-Ray. Storage media further includes volatile or non-volatile memory media such as RAM (e.g. synchronous dynamic RAM (SDRAM), double data rate (DDR, DDR2, DDR3, etc.) SDRAM, low-power DDR (LPDDR2, etc.) SDRAM, Rambus DRAM (RDRAM), static RAM (SRAM), etc.), ROM, Flash memory, non-volatile memory (e.g., Flash memory) accessible via a peripheral interface such as the Universal Serial Bus (USB) interface, etc. Storage media includes microelectromechanical systems (MEMS), as well as storage media accessible via a communication medium such as a network and/or a wireless link.
Additionally, in various implementations, program instructions include behavioral-level descriptions or register-transfer level (RTL) descriptions of the hardware functionality in a high level programming language such as C, or a design language (HDL) such as Verilog, VHDL, or database format such as GDS II stream format (GDSII). In some cases the description is read by a synthesis tool, which synthesizes the description to produce a netlist including a list of gates from a synthesis library. The netlist includes a set of gates, which also represent the functionality of the hardware including the system. The netlist is then placed and routed to produce a data set describing geometric shapes to be applied to masks. The masks are then used in various semiconductor fabrication steps to produce a semiconductor circuit or circuits corresponding to the system. Alternatively, the instructions on the computer accessible storage medium are the netlist (with or without the synthesis library) or the data set, as desired. Additionally, the instructions are utilized for purposes of emulation by a hardware based type emulator from such vendors as Cadence®, EVER, and Mentor Graphics®.
Although the implementations above have been described in considerable detail, numerous variations and modifications will become apparent to those skilled in the art once the above disclosure is fully appreciated. It is intended that the following claims be interpreted to embrace all such variations and modifications.