The present invention relates to multi-core processors and, more particularly, to prioritization of clock rates in multi-core processors to achieve improved instruction throughput.
Over the past number of years continual improvement of microprocessor performance has been achieved through continued increases in clock rates associated with microprocessors. However, recently the improvement has slowed to a fraction of what has occurred in the past. Modern microprocessor designers are now achieving additional performance by increasing the number of microprocessor cores placed on a single semiconductor die. These multi-core processors enable a plurality of operations to be performed in parallel, thereby increasing instruction throughput, i.e., the total number of instructions executed per unit of time.
A noted disadvantage of multi-core processors is that with the addition of each core, the total power consumed by the processor increases. This results in generation of additional heat that must be dissipated, etc. Modern processors have a power envelope associated therewith based on the physical limitations of heat dissipation, etc. of the physical processors and packaging of the processors. Running a processor over the power envelope may cause physical damage to the processor and/or the cores contained therein.
Certain processors, such as those designed for laptop computers, include active power management to lower the total power consumed, in turn, by lowering the operational frequency of the processors. This may occur when, e.g., a laptop is placed in a standby mode. Processors designed for servers or other non-laptop applications typically have not been concerned about operating on battery power; however, the total power consumed (or heat generated) is now reaching a point where allowing power consumption to increase is no longer feasible due to physical constraints of the processor and/or processor packaging.
Generally, in a multi-core system, running all cores at full speed results in a power consumption of nPmax watts, where n is the number of cores in the processor and Pmax is the maximum power consumed by a single core. However, the processor's power budget is such that only αPmax watts is feasible, where α represents a fraction of total power that may be consumed due to physical limitations of the semiconductor die and/or packaging.
Typically, processor cores operate using a fixed allocation of power consumption among the cores. However, a noted disadvantage of such a fixed allocation technique is that the overall system throughput, as measured by instructions performed per unit time, is suboptimal as will be shown herein. Assume that the frequency of each of the cores may be varied on some multiple of the clock cycle to a spectrum of frequencies (f0, f1, . . . , fmax). The power dissipation of the core is proportional to the square of the chosen frequency. As will be appreciated by one skilled in the art, the selected clock rate for a core during a particular time interval determines the core's instruction rate during that time interval.
Without loss of generality, assume that each core is capable of operating at one billion instructions per second (1 BIPS). Let the vector s={si,0<i≦n,0≦si} be the set of instruction service rates for each core. Furthermore, let the power for these n cores be defined as follows:
where ci is a constant for core i. The constant ci may represent architectural differences for a particular core. For example, one core on a processor may comprise a floating point unit which consumes more power per instruction, than, e.g. a simple arithmetic unit. As such, the power cost ci of that core may vary from other cores of the processor. To simplify modeling, assume that power varies with the square of the frequency and that the frequency determines the maximum instruction rate.
To maintain overall operations within the power envelope of αnPmax a processor designer could evenly distribute the processing capability across all cores. For simplicity, assuming ci=1, then Equation (1) becomes:
which reduces to:
si2=αPmax
Thus, if all cores utilize a fixed allocation, then all cores can be allocated a service rate that is si=√{square root over (αPmax)}. To simplify this further for comparison purposes let Pmax=1, so:
si=√{square root over (α)} (2)
This indicates that under the fixed allocation scheme when power is reduced by 1−α, the core service rates are reduced by 1−√{square root over (α)}.
Let the vector a={ai,0<i≦n,0≦ai} represent a set of requested instruction annual rates for each core by an applied workload during a given interval. A noted disadvantage is that the requested instruction annual rates may vary considerably and may exceed the fixed service rates during certain time intervals. Thus, the system throughput is suboptimal.
The present invention overcomes the disadvantages of the prior art to providing a system and method for prioritization of clock rates in a multi-core processor. Instruction arrival rates are measured during a time interval Ti−1 to Ti by a monitoring module either internal to the processor or operatively interconnected with the processor. Using the measured instruction arrival rates, the monitoring module calculates an optimal instruction arrival rate for each core of the processor. For processors that support continuous frequency changes for cores, each core is then set to an optimal service rate. For processors that only support a discrete set of arrival rates, the optimal rates are mapped to a closest supported rate and the cores are set to the closest supported rate. This procedure is then repeated for each time interval. By setting time intervals at an appropriate level, e.g., 1 millisecond, the present invention may approximate optimal instruction rate allocations among the cores, thereby improving system throughput.
The above and further advantages of the invention may be better understood by referring to the following description in conjunction with the accompanying drawings in which like reference numerals indicate identical or functionally similar elements:
The present invention provides a system and method for prioritization of clock rates in a multi-core processor. Illustratively, instruction arrival rates are measured during a time interval Ti−1 to Ti by a monitoring module associated with the processor. Using the measured instruction arrival rates, the monitoring module calculates an optimal instruction arrival rate for each core of the processor. This optimal instruction arrival rate is then used to dynamically modify the allocation of arrival rates among the cores, thereby increasing overall instruction throughput.
A. Multi-Core System Architecture
The network interface 120 comprises mechanical, electrical and signaling circuitry needed to connect the system to other systems over a network. The storage interface 125 coordinates with the operating system executing on the system to store and retrieve information requested on any type of attached array of writable storage device media such as video tape, optical, DVD, magnetic tape, bubble memory, electronic random access memory, micro-electromechanical and any other similar media adapted to store information, including data and/or parity information.
The multi-core processor 150 illustratively includes a plurality of cores 155 A-D. It should be noted that any number of cores may be utilized in a single processor and any number of processors may be utilized in a single computer 100. As such, the description of four cores 155A-D in a single processor 150 should be taken as exemplary only. In accordance with an illustrative embodiment of the present invention, a monitoring module 157 is included within processor 150. The monitoring module 157, which may be included within processor 150 or may be external to the processor, such as a monitoring module 157 interconnected to the system bus 105, monitors instruction arrival rates for each core 155 in accordance with an illustrative embodiment of the present invention. That is, the monitoring module 157 may identify the total number of instructions executed by a processor core 155 during a predefined quantum of time.
Furthermore, the monitoring module 157 may modify service rates for each of the cores to optimize instruction rate throughput in accordance with an illustrative embodiment of the present invention, as described further below. Illustratively, the monitoring module may comprise necessary external circuitry to monitor instruction arrival rates to each of the cores of the processor and to modify the service rates of each core in accordance with an illustrative embodiment of the present invention. The monitoring module may utilize various features of the processor in obtaining and/or setting instruction arrival rates, e.g., the processor may include functionality to enable external monitoring of instruction arrival rates. Alternatively, many cores include functionality to count retired instructions during a predefined time period. In processors using such cores, the functionality of the monitoring module may be implemented directly into each core. As such, the description of the monitoring module comprising a separate module internal or external to the processor should be taken as exemplary only. In illustrative embodiments, the functionality of the monitoring module may be directly integrated into the cores of a processor. As such, the description of a separate monitoring module should be taken as exemplary only.
B. Optimizing Core Frequencies
The present invention provides a system and method for prioritization of clock rates in a multi-core processor. Instruction arrival rates are measured during a time interval Ti−1 to Ti by a monitoring module, either internal to the processor or operatively interconnected with the processor. Using the measured instruction arrival rates, the monitoring module calculates an optimal instruction arrival rate for each core of the processor. For processors that support continuous frequency changes for cores, each core is then set to an optimal service rate. For processors that only support a discrete set of arrival rates, the optimal rates are mapped to a closest supported rate and the cores are set to the closest supported rate. This procedure is then repeated for each time interval. By setting time intervals at an appropriate level, e.g., 1 millisecond, the present invention may approximate optimal instruction rate allocations among the cores, thereby improving system throughput.
More generally, the present invention provides a technique for optimizing frequency allocations among cores of a multi-core processor. As used herein, n is the number of cores on a single processor. Similarly, a represents a vector of requested instruction annual rates, while α represents the fraction of maximum power that is within the appropriate power budget for the processor. Note that the value of α may vary depending upon, e.g., environmental concerns, architectural designs of a processor, etc. According to an illustrative embodiment of the present invention, the monitoring module determines a vector s of instruction service rates that are utilized to achieve increased instruction throughput among all of the cores of a processor.
The utilization of the cores is given by the vector u=a/s wherein each element is less than or equal to one. That is, the utilization of the cores (u) equals the instruction arrival rate divided by the optimal instruction service rate among each of the cores. As noted above, the clock rate (frequency) for a given core during a time interval determines the core's instruction rate during that time interval. The present invention is directed to a technique to maximize the utilization of the cores subject to the power constraint in Equation (1). More generally, the present invention utilizes an estimated instruction rate to set the clock rate for a next time interval. To that end, an illustrative objective function H is given by:
wherein y is a LaGrange multiplier.
Differentiating this with respect to sk for k=1, 2, . . . , n results in:
Setting this to zero and re-arranging results in:
Summing this over all k=1, 2, . . . , n and using Equation (1) generates:
Substituting this back into Equation (3) results in:
Assuming that ck=1, i.e., each core is equivalent to each other core on a single processor, the interpretation is that for optimal allocation the square of the instruction service rates should be assigned by apportioning a fraction
of the power budget to core k.
This gives the optimal service rates for the cores of a set to maximize the throughput subject to the power constraint. However, it assumes the instruction arrival rate for each core is known. In practice, the arrival rates of the cores are not known a priori, but arrival rates can be estimated from past history. For example, by measuring the arrival rates in the interval Ti−1 to Ti using, e.g., monitoring module 157, it is possible to predict the arrival rates in time period Ti to Ti+1. Typically, arrival rates are correlated among subsequent time periods if the interval is made small enough, e.g., approximately one millisecond. Assuming that the overhead of changing core speeds in a processor is small enough, the monitoring module may effectuate changes thousands of times a second (e.g., as part of a clock interrupt) to enable updates to power allocations among the cores of a processor. In practice, chip vendors will likely implement a discrete set of frequencies rather than a continuous spectrum that Equation (5) suggests. Optimal service rates (frequency) may be computed and then mapped to the nearest discrete frequency that is supported.
Utilizing the measured instruction arrival rates, the monitoring module 157 then, in step 215, calculates the optimal service rates for the cores. Illustratively, this is accomplished by assigning
of the overall power to each core based on the measured arrival rate ak of each core. As typically chip manufacturers may not enable continuous frequency ranges among the cores, the monitoring module maps the calculated optimal service rates to the nearest supported rates for the cores in step 220. Thus, for example, if the optimally calculated service rate is 1.2 billion instructions per second (BIPs) for a particular core, and the core only supports 1 BIPs or 1.5 BIPs, the monitoring module will map the particular core to 1 BIPs.
Then, in step 225, the monitoring module sets the cores to the optimal service rates (or the nearest supported rate). Thus, during the next time interval (Ti to Ti+1) the various processor cores execute at the optimal instruction arrival rates (or the nearest supported rates) which enable improved overall processor performance while maintaining power consumption within the power envelope. By utilizing the principles of the present invention overall processor throughput is increased while maintaining power consumption below the power envelope for a particular processor. The procedure then completes in step 230. As will be appreciated by one skilled in the art, procedure 200 is continuously repeated by the monitoring module and/or processor so that during each time period, e.g., every millisecond, each core is operating at the optimal service rate.
In accordance with alternative embodiments of the present invention, the monitoring module 157 may collect historical information regarding instruction arrival rates when certain types of processes are executing on processor 150. For example, the monitoring module 157 may collect such historical data for analysis of various instruction arrival rates based on types of processes executing. In such alternative embodiments, when a process is initialized via, e.g., a task switch from another type of process, the monitoring module 157 may preconfigure the processor 150 using historical arrival rates associated with the type of process to be executed. This preconfiguration may improve initial throughput during task switching until appropriate samples may be taken once the task switch has been effectuated.
The foregoing description has been directed to specific embodiments of this invention. It will be apparent, however, that other variations and modifications may be made to the described embodiments, but the attainment of some or all of their advantages. For instance, it is expressly contemplated that the teachings of this invention can be implemented as software, including a computer-readable medium having program instructions executing on a computer, hardware, firmware, or a combination thereof. Accordingly, this description is to be taken by way of example of and not to otherwise limit the scope of the invention. Therefore, it is the object of the appended claims to cover all such variations and modifications as come within the true spirit and scope of the invention.
Number | Name | Date | Kind |
---|---|---|---|
5504861 | Crockett et al. | Apr 1996 | A |
5592618 | Micka et al. | Jan 1997 | A |
5657440 | Micka et al. | Aug 1997 | A |
5682513 | Candelaria et al. | Oct 1997 | A |
5815693 | McDermott et al. | Sep 1998 | A |
6144999 | Khalidi et al. | Nov 2000 | A |
6502205 | Yanai et al. | Dec 2002 | B1 |
6711693 | Golden et al. | Mar 2004 | B1 |
6983353 | Tamer et al. | Jan 2006 | B2 |
6985499 | Elliot | Jan 2006 | B2 |
7024584 | Boyd et al. | Apr 2006 | B2 |
7152077 | Veitch et al. | Dec 2006 | B2 |
7203732 | McCabe et al. | Apr 2007 | B2 |
7269713 | Anderson et al. | Sep 2007 | B2 |
7278049 | Bartfai et al. | Oct 2007 | B2 |
7343460 | Poston | Mar 2008 | B2 |
7380081 | Ji et al. | May 2008 | B2 |
7418368 | Kim et al. | Aug 2008 | B2 |
7467168 | Kern et al. | Dec 2008 | B2 |
7467265 | Tawri et al. | Dec 2008 | B1 |
7475207 | Bromling et al. | Jan 2009 | B2 |
7539976 | Ousterhout et al. | May 2009 | B1 |
7571268 | Kern et al. | Aug 2009 | B2 |
7624109 | Testardi | Nov 2009 | B2 |
7720801 | Chen | May 2010 | B2 |
20030204759 | Singh | Oct 2003 | A1 |
20050050115 | Kekre | Mar 2005 | A1 |
20050154786 | Shackelford | Jul 2005 | A1 |
20060006918 | Saint-Laurent | Jan 2006 | A1 |
20060015507 | Butterworth et al. | Jan 2006 | A1 |
20070165549 | Surek et al. | Jul 2007 | A1 |
20080162594 | Poston | Jul 2008 | A1 |
20080243951 | Webman et al. | Oct 2008 | A1 |
20080243952 | Webman et al. | Oct 2008 | A1 |
20080288646 | Hasha et al. | Nov 2008 | A1 |
Number | Date | Country |
---|---|---|
1617330 | Jan 2006 | EP |
Number | Date | Country | |
---|---|---|---|
20080263384 A1 | Oct 2008 | US |