This application relates generally to techniques for measuring on-chip temperature and applications thereof.
In modern complex processors, unexpected thermal events such as localized hotspots can occur when a small area of the processor is continuously active when executing a given set of instructions. The resulting power density increases the temperature of the chip and causes a hotspot to form in the processor. These hotspots may cause spatial thermal gradients that affect the performance and lifetime of the chip.
Most modern processor designs use temperature-based estimates at a course grain resolution to determine how much power is being dissipated through heat. Current mechanisms for estimating and mitigating excessive power loss through heat dissipation are reactive measures and are predominantly sensor-based. Analog sensors, such as diodes, are placed on the die and their output is used as a proxy for on-chip temperature.
These measurements, however, tend to vary based on the ambient temperature. For example, an identical chip with identical sensors and based on identical input may give different temperature estimates in a cold climate than it would in a warm climate. As a result, the response mechanisms utilized to mitigate power loss will be different depending on the geographic location of the chip. Therefore, the overall performance of an identical chip processing identical data will differ depending on the geographic location of the chip.
Other techniques employed in the past convert power dissipation and prior temperature to temperature estimates. Power estimates may be made based on measures of activity. The resulting power estimates may then be converted to temperature with knowledge of the prior temperature in a given region and modeling of heat dissipation laterally in the silicon and vertically through the thermal interface material, package lid, heat sink, and so on. These techniques estimate the temperature at a coarse resolution and are not suitable for predicting the development of hotspots.
Another existing method for determining on-die temperature is through thermal sensors built into the silicon. While these techniques may be used to implement thermal management systems into hardware, they cannot be applied to modeling thermal events for product development and planning. Also, when implemented in hardware, the accuracy of the thermal sensors is adversely affected by process variations and may be difficult to calibrate.
Current scheduling techniques in modern processors typically focus on processing workloads as fast as possible without any consideration of on-die power density or temperature. Issues involving power management and efficiency are expected to increase in quantity and complexity. Power density (i.e., the amount of power over a set area of the die) and the amount of power of a localized component are expected to be more diverse in the future.
Therefore, it is desirable to develop more accurate methods of predicting on-die temperature and to incorporate accurate predictions into the scheduling of work across the die.
A method and apparatus are disclosed for estimating temperature and scheduling workload on an integrated circuit (IC). When an instruction is executed on the IC, an activity level and temperature are measured. A relationship between the activity level and the temperature is determined, allowing the temperature to be estimated from the activity level. The activity level of the IC is monitored and is input to a scheduler, which estimates the temperature of the IC based on the activity level. The scheduler distributes work taking into account the temperature of various regions of the IC and may include distributing work to the region of the IC that has the lowest estimated temperature or relatively lower estimated temperature (e.g., lower than the average IC or IC region temperature). When the utilization level of one or more regions of the IC is high, the scheduler is configured to reduce the clock speed or reduce the voltage of the one or more regions of the IC, or flag the one or more regions as being unavailable for additional workload.
A method of measuring estimated temperature on an integrated circuit (IC) having a plurality of regions includes executing an instruction on the IC; measuring an activity level and a temperature of each of the plurality of regions; and determining the relationship between the measured temperature and the activity level.
An integrated circuit includes a plurality of regions selectable for processing activity, a plurality of activity monitors, and a scheduler. The plurality of activity monitors are configured to monitor an activity level of each of the plurality of regions, wherein the activity level is proportional to an estimated temperature. The scheduler is configured to distribute instructions to the plurality of regions, wherein the scheduler distributes instructions to regions based on the activity level.
An integrated circuit includes a scheduler configured to distribute instructions to a plurality of regions, wherein the scheduler distributes instructions to a region based on a temperature of each of the plurality of regions.
A method of scheduling instructions in an IC includes monitoring an activity level of a plurality of regions on the IC and computing an estimated temperature for each of the plurality of regions based on the activity level.
A non-transitory computer-readable storage medium storing a set of instructions for execution by one or more processors to facilitate manufacture of an integrated circuit device, the integrated circuit device including a plurality of processing units, a plurality of activity monitors, a scheduler, and a logic circuit. The plurality of activity monitors are configured to monitor an activity level of each of the plurality of processing units, wherein the activity level is proportional to an estimated temperature. The scheduler is configured to distribute instructions to the plurality of processing units, wherein the scheduler distributes instructions to a processing unit with a lowest estimated temperature based on the activity level. The logic circuit is configured to determine a utilization level of at least one of the plurality of processing units.
A more detailed understanding may be had from the following description, given by way of example in conjunction with the accompanying drawings, wherein:
Accurate on-die temperature estimation at high spatial resolution is important for achieving better performance, performance per watt of power, power management, energy efficiency, and reliability. Accurate temperature estimation allows for identification and mitigation of localized hotspots, thermal gradients, and transient thermal variations. Accurate temperature estimation may also be used to reduce system cost by reducing package and cooling solution costs. In addition, accurate temperature estimation may improve the long-term product reliability by minimizing electro-migration, a function of current and temperature, as well as package and thermal interface material (TIM) reliability.
In addition, it is often desirable to obtain a theoretical maximum temperature for all devices of a given product line. This theoretical maximum temperature represents a worst case scenario for a particular device, as determined by worst case process variations. The theoretical maximum temperature may also represent a worst case environmental scenario for the device. This theoretical maximum temperature provides the outer bounds of expected scenarios. Thermal management is performed with this input to obtain deterministic worst case behavior across a whole line of products, which is often an important market requirement.
The processor 102 may include a central processing unit (CPU), a graphics processing unit (GPU), a CPU and GPU located on the same die, or one or more processor cores, wherein each processor core may be a CPU or a GPU. The memory 104 may be located on the same die as the processor 102, or may be located separately from the processor 102. The memory 104 may include a volatile or non-volatile memory, for example, random access memory (RAM), dynamic RAM, or a cache.
The storage 106 may include a fixed or removable storage, for example, a hard disk drive, a solid state drive, an optical disk, or a flash drive. The input devices 108 may include a keyboard, a keypad, a touch screen, a touch pad, a detector, a microphone, an accelerometer, a gyroscope, a biometric scanner, or a network connection (e.g., a wireless local area network card for transmission and/or reception of wireless IEEE 802 signals). The output devices 110 may include a display, a speaker, a printer, a haptic feedback device, one or more lights, an antenna, or a network connection (e.g., a wireless local area network card for transmission and/or reception of wireless IEEE 802 signals).
The input driver 112 communicates with the processor 102 and the input devices 108, and permits the processor 102 to receive input from the input devices 108. The output driver 114 communicates with the processor 102 and the output devices 110, and permits the processor 102 to send output to the output devices 110. It is noted that the input driver 112 and the output driver 114 are optional components, and that the device 100 will operate in the same manner is the input driver 112 and the output driver 114 are not present.
A die may be subdivided into M×N regions as illustrated in
In an embodiment shown in
T=f(A, Tp) (Equation 1)
Where T=estimated temperature, A=block activity, and Tp=prior temperature.
The relationship between the activity levels and the measured temperature is determined using Equation 1 (step 306). The process is repeated for each of a known set of impulses corresponding to various typical workloads that may be expected (step 308). A set of linear equations may be solved to map the activity of a block to the temperature profile. The solver used may be a simple solver, such as one offered in a commercial spreadsheet package.
The solution to the set of linear equations is a set of thermal coefficients that maps the activity, as measured by performance counters (i.e., activity monitors), to an estimated temperature of the block. By using the thermal coefficients as multipliers to unit activity, the on-die temperature for a given input generating a given on-die activity level may be predicted.
In practice, the activity of adjacent blocks also adds to or subtracts from the temperature of the block due to lateral heat transfer. Heat is also eliminated through the fan-sink. Hence, Equation 1 may be modified as follows.
T=f
(A, A
adj
, A
sink
, T
p) (Equation 2)
Where T=estimated temperature, A=block activity, Aadj=adjacent block activity, Asink=heat sink activity, and Tp=prior temperature.
Equation 2 may be used to calculate the relationship in step 306 of
Based on the relationship between the activity level, the temperature, and the current, a temperature and/or current map may be generated for the entire die (step 410). As discussed above, the number of regions may be selected to provide a desired granular resolution for the temperature estimates. The method may be refined through correlation against thermal images or sensor readings (step 412).
Temperature estimation maps are used for thermal analysis and thermal management of hot spots and thermal gradients on the die. The temperature data may also be used for package and cooling design, and as an input to logic timing analysis. For reliability studies, the current plus the temperature data is used to determine FIT rates due to electro-migration. The mean time to failure (MTTF) of the on-chip interconnects is based on Black's equation and is directly dependent on the current density in the block and is exponentially dependent on the temperature. Determining the MTTF may be useful in large data center environments to track the availability of individual nodes.
Another embodiment includes pro-active instruction and data scheduling procedures to maintain uniform power density in both time and space in all units on the die. As the on-die activity level rises, so does the on-die temperature. As the temperature increases, the amount of power that is dissipated through heat increases as well as reducing system performance and product lifetime.
Unlike reactive mechanisms that impact system performance, such as frequency throttling, the disclosed scheduling procedures distribute work in a way to minimize hotspots. Reduced hot spots and thermal cycling allows for lower cost cooling solutions and better chip reliability. In addition, by minimizing the thermal gradients and maintaining a lower die temperature, the impacts of temperature may be minimized and power leakage may be reduced, contributing to better performance and power efficiency.
In the following set of exemplary embodiments, a die may include multiple processing units. When one processing unit begins to operate at a higher power density, work may be scheduled to another processing unit such that all processing units are being utilized at an optimum level, yet are not overworked so as to generate localized hotspots.
When all processing units are utilized at maximum utilization, performance may be negatively affected. By taking temperature estimates into account when scheduling work across the multiple processing units, the on-die temperatures may be kept at a lower level than with using current schedulers. Due to the direct relationship between temperature and power leakage, the die will dissipate less power. Thus, it is possible to extract more performance out of the same system by carefully managing the temperature.
As previously described, the scheduler is actively distributing work to the processors with lower temperatures (e.g., the lowest temperature). Because the activity level is directly related to the temperature level, it is likely that a processing unit operating at or near full utilization is running at a high temperature. Since the scheduler is already distributing work to the processing units that have lower temperatures, it is likely that all the processing units on the die are operating at or near full utilization and are at a high temperature. In this situation, an additional mechanism is required to actively reduce the on-die temperature. To accomplish this, the scheduler also tracks the overall temperature of the entire die.
A determination is made whether one of the processing units on the die is at or near full utilization (step 606). If the processing unit is operating at or near full utilization, one of three actions may result. A first option is that the scheduler reduces the clock speed of the processing unit for a predetermined amount of time (step 608). This has the effect of reducing the speed at which work is processed by that unit, thereby cooling the unit. A second option is that the scheduler reduces the voltage of the processing unit for a predetermined amount of time (step 610). This has the effect of reducing the amount of power consumed by the processor, thus allowing the unit to cool. A third option is that the scheduler flags the processing unit as unavailable for additional instructions for a predetermined amount of time (step 612). This has the effect of preventing work from being scheduled to the processing unit, thus allowing the unit to cool. Once the scheduler chooses one or more of these actions, the procedure restarts.
Such a scheme creates an automatic regulatory scheduling system that reduces the rate at which on-die temperature increases. For example, if unit 5 has the lowest temperature, then the next instruction is scheduled for unit 5. When the activity on unit 5 increases such that it is operating at a higher power relative to the other units, unit 5 will automatically not receive any more work. This scheme, thus, regulates temperature across the entire die by identifying the processing unit with the lowest temperature and scheduling data to it.
As an example, all units on a die are operating at a high utilization, and the temperature of the die has reached 110° C. The power leakage is about 30% higher than if the die was operating at 90° C. Current technology only allows for throttling the clock speed and the voltage of the die as a whole, which causes loss of performance. The scheme described above offers a more proactive mechanism for accomplishing the same result, but changes the clock speed or the voltage at locations closer to the processing units. Thus, the time in which the processing is slowed down is lower.
Although features and elements are described above in particular combinations, each feature or element may be used alone without the other features and elements or in various combinations with or without other features and elements. The apparatus described herein may be manufactured by using a computer program, software, or firmware incorporated in a computer-readable storage medium for execution by a general purpose computer or a processor. Examples of computer-readable storage mediums include a read only memory (ROM), a random access memory (RAM), a register, cache memory, semiconductor memory devices, magnetic media such as internal hard disks and removable disks, magneto-optical media, and optical media such as CD-ROM disks, and digital versatile disks (DVDs).
Embodiments of the present invention may be represented as instructions and data stored in a non-transitory computer-readable storage medium. For example, aspects of the present invention may be implemented using Verilog, which is a hardware description language (HDL). When processed, Verilog data instructions may generate other intermediary data (e.g., netlists, GDS data, or the like) that may be used to perform a manufacturing process implemented in a semiconductor fabrication facility. The manufacturing process may be adapted to manufacture semiconductor devices (e.g., processors) that embody various aspects of the present invention.
Suitable processors include, by way of example, a general purpose processor, a special purpose processor, a conventional processor, a digital signal processor (DSP), a plurality of microprocessors, a graphics processing unit (GPU), a DSP core, a controller, a microcontroller, application specific integrated circuits (ASICs), field programmable gate arrays (FPGAs), any other type of integrated circuit (IC), and/or a state machine, or combinations thereof.
Processors may be any one of a variety of processors such as a central processing unit (CPU) or a graphics processing unit (GPU). For instance, they may be x86 microprocessors that implement x86 64-bit instruction set architecture and are used in desktops, laptops, servers, and superscalar computers, or they may be Advanced RISC (Reduced Instruction Set Computer) Machines (ARM) processors that are used in mobile phones or digital media players. Other embodiments of the processors are contemplated, such as Digital Signal Processors (DSP) that are particularly useful in the processing and implementation of algorithms related to digital signals, such as voice data and communication signals, and microcontrollers that are useful in consumer applications, such as printers and copy machines. Although the embodiment may include one processor for illustrative purposes, any other number of processors will be in-line with the described embodiments.
Number | Date | Country | |
---|---|---|---|
61580490 | Dec 2011 | US |