Power and thermal management are becoming more challenging than ever before in all segments of computer-based systems. While in the server domain, the cost of electricity drives the need for low power systems, in mobile systems battery life and thermal limitations make these issues relevant. Optimizing a system for maximum performance at minimum power consumption is usually done using the operating system (OS) to control hardware elements. Most modern OS's use the Advanced Configuration and Power Interface (ACPI) standard, e.g., Rev. 3.0b, published Oct. 10, 2006, for optimizing the system in these areas. An ACPI implementation allows a core to be in different power-saving states (also termed low power or idle states) generally referred to as so-called C1 to Cn states. Similar package C-states exist for package-level power savings.
When a core is active, it runs at a so-called C0 state, and when the core is idle, it may be placed in a core low power state, a so-called core non-zero C-state. The core C1 state represents the low power state that has the least power savings but can be switched on and off almost immediately, while an extended deep-low power state (e.g., C3) represents a power state where the static power consumption is negligible, but the time to enter into this state and respond to activity (i.e., back to C0) is quite long.
Package non-zero C-states enable power consumption at lower levels than the package active state (i.e., C0 state). Server workloads rarely drive all cores in same package busy, but even if only one core is active the whole package (including all idle cores) must stay in a high-power C0 state. Since package non-zero C-state entry/exit latency is relatively long (e.g., on the order of 100 to 200 microseconds (μs)), the transient time for all cores being idle usually is not worth using that state, or a performance loss will occur. Thus the OS is unable to take advantage of a package's lower power state benefits, resulting in a package always running at a higher power state than needed.
Embodiments may reschedule/delay tasks so that the idle time of all cores of a package can be aligned and extended. In this way, more opportunities exist for using greater low power states, i.e., deeper non-zero package C-states. Embodiments may operate at relatively fine-granularity, e.g., every 500 microseconds (μs), so that latency sensitive workload performance is not degraded. In contrast, a conventional OS scheduler simply leaves all tasks' timing as they are set.
In various embodiments, a predetermined interval, e.g., 500 μs, may be set and during each interval break-event processing may be delayed to make cores in the same package idle together and busy together. Further, the busy times can be stitched to be continuous (i.e., not separated by short idles) so that the idle duration can be extended to accommodate a package deep non-zero C-state's long entry/exit latencies. As described below, a prediction for future core utilization, i.e., for the next operation interval, may be generated. Then, real-time task rescheduling may be performed to enable greater power savings. Note that the C-states described herein are for an example processor such as an advanced Intel® Architecture 32 (IA-32) processor available from Intel Corporation, Santa Clara, Calif., although embodiments can equally be used with other processors. Shown in Table 3 below is an example designation of package C-states available in one embodiment. However, understand that the scope of the present invention is not limited in this regard.
Using a conventional scheduling algorithm, for example a workload at a 15% system load level, the package is in the C0 state for 70% of the time (where the theoretical perfect case for power savings should be 15%). For the remaining 30% of time that the package could enter a non-zero package C-state, a large portion of the package idle time is for less than 500 μs, which typically is not worth package C-state entry/exit transition energy cost to enter a deep package low power state. Assume the follow power consumption levels for an example processor in the package C0, C1, and C3 states:
Power(C0)=130 Watts (W)
Power(C1)=28 W
Power(C3)=18 W
The power consumed in this example using a conventional scheduling policy, is:
Power(C0)*70%+Power(C1)*30%*52%+Power(C3)*30%*48%=97.96 W [EQ. 1]
In contrast, embodiments may provide deeper low power states for longer time periods. For example, in comparison to the above calculation, a processor can be scheduled in accordance with an embodiment of the present invention such that it is the active state for only 20% of the time and a deeper low power state (e.g., package C3) for 80% of the time. In this case, the processor consumes:
Power(C0)*20%+Power(C3)*80%=40.4 W [EQ. 2]
leading to a theoretical upper limit of power saving of 57.56 W (or 58.8%) using an embodiment of the present invention.
Referring now to
U=delta of unhalted core reference clockticks/delta of TimeStamp Counter [EQ. 3]
where the unhalted clock ticks are clock ticks occurring when the core is active and timestamp counter is a timestamp of total processor cycles during the monitored interval. The monitored data from monitor 20 may be provided to a predictor 30.
Predictor 30 may be used to predict future core utilization. In one embodiment, a Kalman filter algorithm prediction of computational complexity O(n) may be performed, where n is number of cores in the same package, such that the prediction can be done in real time. For each prediction interval (e.g., 500 μs), predictor 30 may provide information to an OS scheduler 40 regarding its predictions of the core utilization, which may be on a utilization percentage basis.
In various embodiments, OS scheduler 40 may perform embodiments of the present invention to enable transitioning of the processor package to a lower power state. For example, using embodiments of the present invention, tasks, interrupts and break events to be scheduled on the various cores 55 may be delayed to enable a longer and more continuous idle period in which package 50 can be placed in a deeper low power state. Furthermore, package 50 may remain in this selected low power state for a longer time duration. At the conclusion of this extended idle period, which may be referred to herein as a delay period, the various cores of package 50 may be activated to perform any pending tasks, interrupts or other break events that may have been buffered during the delay period.
Thus based on the predicted core utilizations for the next interval, all incoming break events and tasks can be delayed for the duration of the delay period, referred to herein as a time T* (and which may vary in each operation interval). After that time, all break events and tasks will be serviced.
Since the OS' periodic timer interrupt is also delayed, on each core a watchdog timer 58 may be set with an initial value T*. Note that while in some embodiments, timer 58 may be present in each core 55, in other implementations only a single package timer may be present. When this timer expires, it will create a non-maskable interrupt and wake the corresponding core 55. Cores 55 may then service tasks and break events. In various embodiments buffers 45a and 45b may keep all interrupts received before the watchdog timer expires. Note while shown being coupled between OS scheduler 40 and package 50, such buffers may be associated with various system agents coupled to package 50, such as chipsets, input/output (I/O) devices, peripherals and so forth.
When all cores are idle after rescheduling of tasks, the low-power non-zero package C-state may be entered by processor hardware logic. While shown with this particular implementation in the embodiment of
Referring now to
Specifically, as shown in
Different predictions may be made in different embodiments. In one embodiment, a Kalman filter model (KFM) may be used to generate the predictions. A KFM models a partially observed stochastic process with linear dynamics and linear observations, both subject to Gaussian noise. It is an efficient recursive filter that estimates the state of a dynamic system from a series of incomplete and noisy measurements. Based on a KFM, the CPU package activity as set forth in a number of predetermined patterns associated with idle-busy states of the package's core (e.g., a percentage of a number of predetermined idle-busy patterns) are considered the observations of a real number stochastic process discretised in the time domain, denoted by y1:t=(y1 . . . yt). The hidden state of the process, x1:t=(x1 . . . xt), is also represented as a vector of real numbers. The linear stochastic different equation in KFM is:
x(t)=Ax(t−1)+w(t−1) p(w)˜N(0,Q) x(0)˜N(x1|0, V1|0) [EQ. 4]
And the measurement equation is:
y(t)=Cx(t)+V(t) p(v)˜N(0, R) [EQ. 5]
The n×n transition matrix A in the difference Equation 4 relates the state at the previous t−1 time step to the state at the current step t, in the absence of either a driving function or process noise. Here n is the number of hidden states. In our task, m=n is the number of possible CPU activity states. x1|0, V1|0 are the initial mean and variance of the sate, Q is the system covariance for the transition dynamic noises, and R is the observation covariance for the observation noises. The transition of observation functions is the same for all time and the model is said to be time-invariant or homogeneous.
Using KFM, values can be predicted on the future time, given all the observations up to the present time. However, we are generally unsure about the future, and thus a best guess is computed, as well as a confidence level. Hence a probability distribution over the possible future observations is computed, denoted by P(Yt+h=y|y1:t), where k>0 is the horizon, i.e., how far into the future to predict.
Given the sequence of observed values (y1-yt), to predict the new observation value is to compute p(Yt+h=y|y1:t) for some horizon k>0 into the future. Equation 6 is the computation of a prediction about the future observations by marginalizing out the prediction of the future hidden state.
In the right part of the Equation, we compute P(Xt+h=x|y1:t) by the algorithm of the fixed-lag smoothing, i.e., P(Xt−L=x|y1:t), L>0, L is the lag. So before diving into the details of the algorithm, a fixed-lag smoothing in KFM is first introduced.
A fixed-lag Kalman smoother (FLKS) is an approach to perform retrospective data assimilation. It estimates the state of the past, given all the evidence up to the current time, i.e., P(Xt−L=x|y1:t), L>0, where L is the lag, e.g., we might want to figure out whether a pipe broke L minutes ago given the current sensor readings. This is traditionally called “fixed-lag smoothing”, although the term “hindsight” might be more appropriate. In the offline case, this is called (fixed-interval) smoothing; this corresponds to computing P(XT−L=x|y1:T), T≧L≧1.
In the prediction algorithm, there are h more forward and backward passes. The computation of the passes is similar to that in the smoothing process. The only difference is that in the prediction step the initial value of the new observation is null, which means y1:T+h=[y1:T ynull1 . . . ynullh]. The prediction algorithm estimates the value of the y1:T+h=[y1:T yT+1 . . . yT+h] by performing retrospective data assimilation on all the evidence up to the current time plus they y1:T+h=[ŷ1:T ynullt . . . ynullh]. In practice, we consider using the previous steps as the prior data, for example, if h=1, then yT+1=(yT−1+yT)/2 rather than yT+1=null.
Table 1 shows the pseudo code of the prediction algorithm.
In Table 1, Fwd and Back are the abstract operators. For each Fwd (forwards pass) operation of the first loop (for t=1:T), we firstly compute the inference mean and variance by xt|t−1=Axt−1|t−1 and Vt|t−1=AVt−1|t−1A′+Q; then compute the error in the inference (the innovation), the variance of the error, the Kalman gain matrix, and the conditional log-likelihood of this observation by errt=yt−Cxt|t−1, St=CVt|t−1C′+R, Kt=Vt|t−1C′St−1, and Lt=log(N(errt;0,St) respectively; finally we update the estimates of the mean and variance by xt|t=xt|t−1+Kterrt and Vt|t=Vt|t−1−KtStKt′.
For each Back (backwards pass) operation of the second loop (for t=T−1:−1:1), at first we compute the inference quantities by xt+1|t=Axt|t and Vt+1|t=AVt|tA′+Q; then compute the smoother gain matrix by Jt=Vt|tA′Vt+1|t−1; finally we compute the estimates of the mean, variance, and cross variance by xt|T=xt|t+Jt(xt+1|T−xt+1|t), Vt|T=Vt|t+Jt(Vt+1|T−Vt+1|t)Jt′, and Vt−1,t|T=Jt−1Vt|T respectively, which are known as the Rauch-Tung-Striebel (RTS) equations.
The computation as set forth in Table 1 can be complicated, e.g., there are matrix inversions in the T+1 step loop, when computing Kalman gain matrix in Fwd operator and the smoother gain matrix in Back operator. And the computational complexity will be O(TN3), where T is the number of history observations; N is the number of activity states, because for a general N*N matrix, Gaussian elimination for solving the matrix inverse leads to O(N3) complexity. However, in various embodiments the algorithm implementation can be simplified.
As shown in
Referring back to
Referring still to
At the conclusion of the idle period, the active period is initiated and thus the processor package may be controlled to be in a package active power state (block 140), such as the package C0 state, although the scope of the present invention is not limited in this regard. While shown with this particular implementation in the embodiment of
Referring now to
As shown in
Next, the core with the highest predicted utilization rate may be identified and a maximum utilization rate, U_max, may be set equal to this predicted value (block 320). Furthermore, a delay period, T*, may be set based on this predicted maximum utilization rate. For example, for a next operation interval (NOI), T* may be set equal to NOI×(100%-U_max), where U_max is expressed as a percentage.
Referring still to
This low power state may thus remain in effect until the package watchdog timer expires (block 360). At this time, the package may transition from its low power state to the active C0 state. Then on each core, any break events that have been buffered may be fetched (block 370). For example, during the delay period in which the package is in an idle state, one or more devices coupled to the package may have buffered break events destined to the package. Thus, upon waking of the individual cores, these break events may be fetched. Accordingly, at block 380, the break events may be serviced in their original timing. That is, the buffered break events may be serviced in the order in which they were buffered (e.g., on a first in first out basis). After servicing of any break events, any tasks scheduled for the core may then be performed according to their original timing. That is, after handling the break events any tasks scheduled to each core may be processed in the order of their original scheduling. While shown with this particular implementation in the embodiment of
In contrast to embodiments such as described above, using a conventional OS scheduler, tasks and break events are serviced immediately, causing cores and the package to frequently to enter and exit idle states. Since periods of minimal idles do not permit use of long-latency and low-power deep package C-states, only the package C1 state can be used. Also conventional OS scheduler cores' busy times are not overlapped, causing a package to remain in C0 while only a single core is busy. Assume using conventional scheduling that of a 500 μs period, the total time spent in the package C0 state is 125 μs, and total time in the package C1 state is 375 μs. Instead, using an embodiment of the present invention predicting a first core's utilization of 10% and a second core's utilization of 15%, then the maximum core's utilization (U_max) is 15%. Accordingly, the determined delay period T* may be set as follows:
T*=interval time×(100%−U_max) [EQ 7]
where U_max is expressed as a percentage (i.e., 15% is expressed as 15). In this case, T*=500*(100%-15%)=425 μs, and the package in the coming 500 μs will get 425 μs of continuous idle time, which can enable a deeper low power state such as the package C3 state. Thereafter the 425 μs watchdog timer expires, the package returns to the C0 state and each core will process all tasks and break events in their original timing. Example processor power specifications and power consumption using a conventional scheduling and in accordance with one embodiment of the present invention, for an example 15% system load level are shown in Table 2.
Referring now to
In contrast, with reference to schedule B, scheduled in accordance with an embodiment of the present invention, a delay period corresponding to time duration 430 is provided at a beginning of the operation interval. Because of the extended duration of this delay period 430, which may be 425 μs, the package may be placed in a deeper low power state, e.g., a package C3 state (or even deeper C-state), thus enabling greatly enhanced power savings, on the order of approximately 60% more than that of schedule A. At the conclusion of this delay period 430, the package is placed into the active state and thus for the remaining duration of the operation interval, the package is in the package C0 state for time duration 440. While not shown in
For purposes of example, Table 3 below shows package C-states and their descriptions, along with the estimated power consumption in these states, with reference to an example processor having a thermal design power (TDP) of 130 watts (W). Of course it is to be understood that this is an example only, and embodiments are not limited in this regard.
While described herein as OS-based scheduling, embodiments are not limited m this regard. That is, in other implementations, software, firmware or hardware may be adapted on a package basis or at another location within a system, such as a power management unit (PMU) to enable dynamic rescheduling of tasks within an operating interval such that extended, continuous idle periods within each operating interval may be realized, enhancing the ability to enter into extended and deeper low power states on a package basis. Furthermore, while shown in
Embodiments may be implemented in many different system types. Referring now to
Still referring to
Furthermore, chipset 590 includes an interface 592 to couple chipset 590 with a high performance graphics engine 538, by a P-P interconnect 539. In turn, chipset 590 may be coupled to a first bus 516 via an interface 596. As shown in
Embodiments may be implemented in code and may be stored on a storage medium having stored thereon instructions which can be used to program a system to perform the instructions. The storage medium may include, but is not limited to, any type of disk including floppy disks, optical disks, compact disk read-only memories (CD-ROMs), compact disk rewritables (CD-RWs), and magneto-optical disks, semiconductor devices such as read-only memories (ROMs), random access memories (RAMs) such as dynamic random access memories (DRAMs), static random access memories (SRAMs), erasable programmable read-only memories (EPROMs), flash memories, electrically erasable programmable read-only memories (EEPROMs), magnetic or optical cards, or any other type of media suitable for storing electronic instructions.
While the present invention has been described with respect to a limited number of embodiments, those skilled in the art will appreciate numerous modifications and variations therefrom. It is intended that the appended claims cover all such modifications and variations as fall within the true spirit and scope of this present invention.