The present invention is related to the cooling of electronics, such as computer equipment in a data center. More specifically, this invention is related to methods and apparatus for optimizing the energy efficiency of cooling such electronic equipment when the temperature dependence of the electronics' power dissipation is considered in conjunction with the power dissipated in cooling equipment such as a chiller. In particular, when a chiller is used in conjunction with a free-cooling heat exchanger, the invention provides apparatus and methods for choosing the temperature of coolant provided to the electronics that minimizes the overall power consumption of the system.
In a manner similar to an electric light bulb that gives off heat when it is powered on, electronics such as computer equipment dissipate electrical power as heat. In many cases, to prevent the electronics from overheating, that heat must be removed by a cooling system, typically using a cooling fluid such as air or water that is passed through the electronics. The current invention applies to all types of cooling fluids and many types of electronic equipment, but for specificity, consider the example of a liquid-cooled computer. Referring to the traditional system 100 shown in
First, the power consumption of the electronics (or aggregate machine-room heat load)
P
0
≡P
A
+P
B
+ . . . +P
X (1)
is transferred from the computers to a liquid coolant 106, which enters each computer at a cold temperature T0, and flows under pressure created by pump 108 in a closed loop of liquid-coolant pipes. The pump 108 creates an additional power consumption (or heat load) P1.
Second, in chiller 110, the combined heat load P0+P1 is transferred from the liquid coolant 106 to a refrigerant 112, which flows in a closed loop of refrigerant pipes, wherein evaporation of the refrigerant 112 occurs during its absorption of heat load P0+P1, and compression of the refrigerant occurs in a compressor 114. The compressor 114 creates an additional heat load α(P0+P1) that is proportional to the incident heat load P0+P1. The proportionality factor α is a characteristic of the compressor 114. The temperature T0 entering computers 102 is maintained by feedback circuitry comprising a temperature-measurement device 116 and a processing device 118 that compares the measured value T0 to a user-defined set-point temperature (T0)Set, and commands the chiller 110 to drive the difference T0−(T0)Set to zero by modulation of the compressor 114.
Third, heat load (1+α)(P0+P1) is transferred from the refrigerant 112 to condenser water 120, which flows under pressure created by pump 122 in a closed loop of condenser-water pipes, with the transfer of heat (1+α)(P0+P1) causing condensation of the refrigerant 112. The pump 122 creates an additional heat load P2.
Fourth, by means of a cooling tower 124, which typically contains an air mover 126 that produces an additional heat load P3, heat load (1+α)(P0+P1)+P2+P3 is transferred from the condenser water 120 to the outside air.
The “useful” electrical power consumed in the machine room 104 by computers 102 is P0, whereas the “overhead” power consumed by the cooling equipment in system 100 is, by inspection of
P
Cool
(100)
≡P
1+α(P0+P1)+P2+P3. (2)
Thus, the total electrical power consumed by system 100 is
P
Total
(100)
=P
0
+P
Cool
(100). (3)
In equations (2) and (3), superscript “(100)” indicates that the symbols apply to system 100.
The dominant term on the right-hand side of equation (2) is the compressor power α(P0+Pi), where the chiller-overhead fraction α is typically 0.10 to 0.20. Consequently, to save some or all of compressor power a (P0+p), the concept of “free cooling” has been developed in prior art, as described, for example, in “Free Cooling Using Water Economizers”, by Susanna Hanson and Jeanne Harshaw, TRANE Engineers Newsletter, Volume 37-3, September 2008, which is incorporated herein in its entirety by reference. It is worth noting that in the Unites States, the overhead fraction α is often expressed in kW/ton, where a ton of cooling is 3.517 kW. Thus, for example, 0.53 kW/ton corresponds to the dimensionless value α=0.53/3.517=0.15.
Referring to
β≡Fraction of heat load P0+P1 that coolant 206 rejects to free-cooling water 232 via the free-cooling heat exchanger 230. (4)
Free-cooling water 232 flows under pressure created by a pump 234 in a closed loop of pipes. The pump 234 creates an additional heat load P6. By means of a cooling tower 238, which typically contains an air mover 240 that produces additional heat load P7, a heat load β(P0+P1)+P6+P7 is transferred from the free-cooling water 232 to outside air.
System 200 is further distinguished from system 100 by the addition of feedback circuitry comprising a device 242 to measure temperature T3 and to communicate this measurement to processing device 218. System 200 is further distinguished by the addition of electrical feedback from processing device 218 to pump 234 to enable speed modulation of the pump is cases where T3<T0, so that the liquid coolant does not become too cold, and also to enable powering off the pump in cases where T3>T1, to prevent undesired heating of the liquid coolant as it passes through the free-cooling heat exchanger 230.
Because of the heat-load rejection β(P0+P1) from liquid coolant 206 by means of heat exchanger 230, the amount of remaining heat to be rejected to the refrigerant 212 in system 200 is (1−β)(P0+P1). Thus, the incident heat load on the compressor 214 in system 200 is a factor of 1−β smaller than that on the compressor 114 in system 100. Because we assume the compressors 114 and 214 to be otherwise identical, the power consumption of compressor 214 in system 200 is likewise reduced by the factor 1−β compared to compressor 114 in system 100. That is, system 200's compressor 214 consumes only (1−β)α(P0+P1).
In comparing systems 100 and 200 to assess how much power is saved by “free cooling”, the useful electrical power P0 consumed in the machine room 204 of system 200 is naturally assumed to be the same as that consumed in the machine room 104 of system 100. This assumption merely states that, to make a fair comparison, the computers 202 in system 200 are identical to the computers 102 in system 100, and are performing identical computations.
The “overhead” power consumed by cooling equipment in system 200 is
P
Cool
(200)
≡P
1+(1−β)α(P0+P1)+P4+P5+P6+P7. (5)
Comparing equations (2) and (5) yields the power-saving advantage of system 200 over system 100:
ΔP≡PCool(100)−PCool(200)=βα(P0+P1)+(P2+P3)−(P4+P5+P6+P7) (6)
The power consumed by pumps (P1, P4, P6) and air movers (P5, P7) is typically small compared to that consumed by the refrigerant-loop compressor, so the first term on the right-hand side of equation (6), βα(P0+P1), is the dominant teem. Moreover, if the pumps 122, 222, and 234 as well as the cooling-tower air movers 126, 226, and 240 are controlled so that power consumed is proportional to incident heat load, then
P
4=(1−β)P2; P6=βP2 (7)
and
P
5=(1−↑)P3; P7=βP3. (8)
Assuming this type of control, by combination of equations (7) and (8),
P
4
+P
5
+P
6
+P
7
=P
2
±P
3, (9)
whence, substituting equation (9) into equation (6), the second the third terms on the right-hand side of equation (6) disappear, and the power saving from free cooling becomes simply
ΔP≡PCool(100)−PCool(200)=βα(P0+P1). (10)
Equation (10) implies that to maximize the amount of saved power ΔP, β=1 is desired, whereby the entire incident heat load P0+P1 on free-cooling heat exchanger 230 is rejected thereby. In such a scenario, the chiller 210, pump 222, and cooler tower 126 may be turned off. In fact, if β=1 can be achieved at all times, then the chiller 212, pump 222, and cooling tower 224 are superfluous and need not be purchased. This is the most aggressive objective of the free-cooling paradigm.
Still referring to
Consequently, if the chiller 210 is turned off or absent, then β=1, so, according to equation (11), T2=T0. But T2 is weather dependent, because the temperature T3 of water returning from the cooling tower 238 depends on the wet-bulb temperature TWB of ambient outside air, which depends on geographical location and season. The wet bulb temperature TWB currently never exceeds 31° C. anywhere on earth, but may be higher in the future due to climate change, as reported by Steven C. Sherwood and Matthew Huber in “An adaptability limit to climate change due to heat stress”, Proceedings of the National Academy of Sciences, May 2010, (0913352107), which is included herein in its entirety by reference. Temperature T3 exceeds TWB by an amount ΔTCT, sometimes called the “cooling-tower approach temperature”, which is a function of several variables (see the aforementioned TRANE Engineers Newsletter) but typically in the range of 1 to 5° C. Thus,
T
3
=T
WB
+ΔT
CT. (12)
Moreover, temperature T2 exceeds T3 by an amount ΔTHX, sometimes called the “heat-exchanger approach temperature”, which is typically in the range of 1 to 2° C. Thus,
T
2
=T
WB
+ΔT
CT
+ΔT
HX. (13)
Consequently, to achieve year-round free cooling anywhere on earth under current climate conditions, the water-inlet temperature T0 to computers 202 may need to be as high as
(T0)max≡(TWB)max+(ΔTCT)max+(ΔTHX)max=31+5+2=38° C. (14)
In contrast,
(T0)min≡16° C. (15)
is just warm enough to avoid condensation of air-borne moisture for “Class 1” machine-room conditions, as specified by the American Society of Heating, Air-Conditioning, and Refrigeration Engineers (ASHRAE) in “Thermal Guidelines for Data Processing Environments, 2nd edition”, ISBN 978-1-933742-46-5, which in included herein in its entirety by reference. Specifically, ASHRAE's “Recommended” Class 1 envelope in psychrometric space is bounded above by a 15° C. dew-point, which implies, with a 1° C. margin of safety, that T0=16° C. is the minimum safe temperature at which air-borne water is guaranteed not to condense inside the computers 202.
Less aggressive than equation (14), a more typical free-cooling option is to choose a moderate set-point value of T0, and to retain the chiller 210, pump 222, and cooling tower 224 so that, when weather precludes the free-cooled temperature T2 from reaching the desired set-point T2=(T0)Set, the chiller can make up the difference. Even in such systems, with a chiller present as in system 200, users of computers and other electronic equipment urge manufacturers to design their equipment to allow high inlet temperature T0 so that, despite adverse weather conditions, the fraction β of the heat load removed by the heat exchanger 230 remains high, and thus the power-saving ΔP, given by equation (10), remains large. This strategy says that maximizing the amount of free cooling is always better—even if it means higher water inlet temperature T0 to the computers 202.
The current invention calls this strategy into question, purely on the basis of power savings. It explains why maximizing the amount of free cooling, regardless of T0, does not necessarily save energy, but may actually waste energy. Based on this insight, the invention provides, for a system like system 200, an innovative method and apparatus to determine the value of T0 that actually provides the greatest conservation of energy. Depending on the weather-related temperature 7; and the computational state C, the power-optimal solution may be for the chiller 310 to provide some or even all of the cooling, despite the presence of the free-cooling heat exchanger 330.
For computers and other electronic equipment, power dissipation depends on temperature. In particular, for CMOS (Complementary Metal-Oxide-Semiconductor) circuits, power is an increasing function of junction temperature Tj of CMOS transistors, due to a component of leakage power called sub-threshold leakage, which depends exponentially on temperature, as described, for example, in Leakage in Nanometer CMOS Technologies, edited by Siva G. Narenda and Anantha Chandrakasan, published by Springer, 2010, ISBN-13 978-1-4419-3826-8, which is included herein in its entirety by reference. As a result, power dissipation in CMOS circuits is an increasing function of the inlet temperature T0 of the coolant, because increasing T0 by ΔT also increases Tj directly by ΔT, and even slightly more, because the additional sub-threshold leakage power attending the higher Tj increases the cooling load, leading to a small additional temperature burden through the cooling path.
Combining the understanding that power dissipation depends on temperature with the foregoing discussion relating to
According to this invention, the above dichotomy suggests optimization: by striking a balance between the needs of the electronics and the capabilities of the cooling system, it is possible to find an optimum temperature T*0 that minimizes the aggregate power consumption of the electronic equipment and the cooling equipment combined. The invention specifies methods and apparatus to choose, for a given system, the optimum value T*0 of the set-point temperature (T0)Set, despite the fact that this optimum is a function of the weather as well as the computational work-load of the computers 202. Further, the invention shows by means of a mathematical model that, for certain conditions, the optimum temperature T*0 is likely to be lower than what is suggested by prior-art that focuses solely on PCool, but ignores the temperature dependence of P0.
takes on several values, thereby showing how the downshift boundary depends on
This invention exploits the fact that the power consumption of computers increases when the temperature T0 of the cooling fluid increases, whereas the power consumption of the associated cooling equipment decreases as T0 increases, because larger T0 allows for more “free cooling”. This dichotomy suggests that, for given conditions, an optimum, power-minimizing value of T0 exists, denoted T0*, a value that will minimize overall energy consumption. The invention specifies how this optimum may be found as a function of weather conditions and the computational state of the computers being cooled.
Referring to
The total power consumption PTotal for system 300 is the sum of two terms, as follows,
P
Total(T0,T3,C)=P0(T0,C)+PCool(T0,T3,C), (16)
where the parenthetical lists in equation (16) denote parameters upon which the various components of power depend, and the following definitions apply:
P
Cool
≡P
1+(1−β)α(P0+P1)+P4+P5+P6+P7
In the latter equality, TWB is the wet-bulb temperature at cooling tower 338, and ΔTCT is the cooling-tower “approach temperature”, as discussed in connection with equation (12). Thus T3 is dependent on weather.
The optimal value T*0 of the coolant inlet temperature T0 is the value that minimizes the total power PTotal in equation (16). Thus, T*0 is a function of the weather-related temperature T3 and the computational state C. With the goal of determining this functional dependence T*0(T3, C), system 300 comprises several elements missing from the prior-art system 200.
First, system 300 comprises a storage device 344 in which measured and computed data may be stored and retrieved by processing device 318.
Second, system 300 comprises a power-measurement device 346 that measures the aggregate electrical power P0 being consumed by all the computers 302 in machine room 304.
Third, system 300 comprises one or more power-measurement devices 348 that collectively measure the aggregate electrical power PCool consumed by all the cooling equipment in system 300:
P
Cool
≡P
1(1−β)α(P0+P1)+P4+P5+P6+P7. (17)
If measuring all terms of PCool is too expensive, a possible (though inexact) alternative is to measure chiller power only, PCool≈(1−β)α(P0+P1), because chiller power is typically the dominant term on the right-hand side of equation (5). In either case, the power measurements are saved in the storage device 344.
Fourth, referring to
Lines (402) and (404) of calibration algorithm 400 theoretically execute a loop on T3, starting at a minimum valve (T3)min and proceeding stepwise to a maximum value (T3)max. The loop variable corresponding to T3 is denoted i. This representation of the loop on T3 as a neat, orderly progression from (T3)min to (T3)max in equal steps is theoretical rather than actual. In reality, because T3 depends on weather, which cannot be algorithmically dictated, iterations of this loop may actually occur in random order, and at weather-dependent values of T3, over an extended time after system 300 is first installed: when the weather changes to produce a value of T3 substantially different from values for which algorithm 400 has already been run, an additional iteration of this loop is executed. It may take some time (a year or longer) for T3 to run its slowly-varying, seasonal course from extreme to extreme, so iterations of the “for loop” represented by line 402 of algorithm 400 may actually be executed sporadically over an extended period. Thus, in a practical implementation of algorithm 400, line 404 will cause sensor 342 to read temperature T3, and the remainder of the iteration of the “for i” loop will proceed if and only if a similar value of T3 has never been encountered in previously iterations of the loop. Such a strategy prevents needless, redundant experimentation.
Lines (406) and (408) of calibration algorithm 400 execute a loop on “C”, the variable representing the computational state of the computers 302 in machine room 304. The precise definition of the “computational state” Cj depends on the types of computers 302 contained in machine-room 304, as well as the types of computational tasks assigned to them. For the embodiment of the invention that utilizes calibration algorithm 400, it is assumed that:
Line (408) of calibration algorithm 400 directs the computers 302 to execute the computational task designated by Cj. For most computers, one of these computational tasks, denoted C0, will be “do nothing”, in which the computers are powered on but idle. Computers often spend considerable time in the idle state. For CMOS-based computers in the idle state, sub-threshold leakage, which is exponentially dependent on absolute temperature, is the dominant form of leakage, and may comprise a major portion of power dissipation. Consequently, for the idle state, providing a low coolant-inlet temperature T0 is likely to save considerably more power in the computers 302 than it costs in the cooling equipment, even if weather forces the chiller 310 to be used to achieve the lower T0. For other computational tasks in CMOS-based computers, where temperature-independent transistor-switching power may dominate P0, the trade-off between P0 and PCool is complex, and minimization of PTotal is best done experimentally; for example, in the manner prescribed by calibration algorithm 400.
Line (410) of calibration algorithm 400 initializes, to the value (T0)min specified by equation (15), each element of a two-dimensional array T*0(i, j) that will ultimately hold the optimum values of T0.
Line (412) of calibration algorithm 400 initializes the variable P*, which is defined mathematically as
where
Line (414) of calibration algorithm 400 executes a loop on the set-point coolant-inlet temperature (T0)set, from a minimum of (T0)min to a maximum of (T0)max-allowed, in increments of ΔT0, where (T0)min is given by equation (15), (T0)max-allowed is the maximum coolant-inlet temperature allowed by computers 302, and ΔT0 is chosen at will, but is perhaps 0.5 or 1.0° C.
Line (416) of calibration algorithm 400 pauses program execution while the system comes to thermal equilibrium, such that the measured temperature (T0)meas is within a tolerance ε of the set-point value (T0)set.
Line (418) of calibration algorithm 400 performs a first reading process to obtain the time-averaged power
Line (420) of calibration algorithm 400 performs a second reading process to obtain the time-averaged power
is computed. Time averaging is expected to be less important for PCool(t) than for P0(t) because PCool(t) is typically slowly varying. Nevertheless, the averaging process (20) makes the measurement more robust.
Line (422) of calibration algorithm 400 sums the time-average computer power consumption
Line (424) of calibration algorithm 400 tests, for the current values of T3 and C, namely T3[i] and C[j], whether the power consumption
The net result of calibration algorithm (400) is the two-dimensional array T0*[i][j], which is stored as a look-up table in the storage device 344, and which specifies the optimal coolant temperature that should be used for various combinations of weather conditions and computational tasks. The index i refers to the value of T3 stored in the ith element of the one-dimensional array T3[i], which is also stored in storage device 344. Likewise the index j refers to the ith element of the one-dimensional array of computational states C[j], which is also stored in storage device 344 in a way that can be usefully identified later. For example, C[0] may hold the string “idle”, indicating for later usage that C[0] refers to the idle state of the computers 302.
After calibration algorithm 400 has been run to characterize system 300, during subsequent operation of the machine room 304, an operational algorithm 450 is executed by processing device 318. The operational algorithm 450 comprises the following steps:
It should be noted that equation (21) has the correct limiting behavior:
if T3=T3[i]:T*0(T3,Cj)=T*0[i][j]
if T3=T3[i+1]:T*0(T3,Cj)=T*0[i+1][j] (22)
Referring now to
Rather, system 500 merely assumes that during operation of the computers 302 in machine room 304, the time-averaged powers
Line (610) of operational algorithm 600, analogous to line (410) of calibration algorithm (400), initializes the variable T*0 to the value (T0)min. T*0 will ultimately hold the optimum value of T0 for current weather conditions and computational status of computers 502.
Line (612) of operational calibration algorithm 600, analogous to line (412) of calibration algorithm (400), initializes the variable P*, which is defined mathematically as
That is, over the range of values of coolant-inlet temperature T0 to be tested, P* is the minimum total electrical power consumed under current conditions. Initialization of P* to the huge number 9E99, a common programming artifice, insures that, in the subsequent search for minimum power, the initialized value exceeds the first actual value, for reasons described presently.
Line (614) of operational algorithm 600, analogous to line (414) of calibration algorithm 400, executes a loop on the set-point coolant-inlet temperature (T0)set, from a minimum of (T0)min to a maximum of (T0)max, in increments of ΔT0, where (T0) is given by equation (15), (T0)max-allowed is the maximum coolant-inlet temperature allowed by computers 302, and ΔT0 is chosen at will, but is perhaps 0.5 or 1.0° C.
Line (616) of operational algorithm 600, analogous to line (416) of calibration algorithm 400, pauses program execution while the system thermally stabilizes, such that the measured temperature (T0)meas is within a tolerance ε of the set-point value (T0)set.
Line (618) of operational calibration algorithm 600, analogous to line (418) of calibration 400, performs a first reading process to obtain the time-averaged power
Line (620) of operational algorithm 600, analogous to line (420) of calibration algorithm 400, performs a second reading process to obtain the time-averaged power
Line (622) of operational algorithm 600, analogous to line (422) of calibration algorithm 400, sums the time-average computer power consumption
Line (624) of operational algorithm 600, analogous to line (424) of calibration algorithm 400, tests whether the power consumption
Line (632) of calibration algorithm 600 closes the loop on (T0)set. At this point in the algorithm, T*0 holds the optimal value of (T0)set; that is, the value found to produce minimum total power
Line (634) of calibration algorithm 600 assigns the optimal value T*0 as the set-point value (T0)Set to be used during subsequent operation. Because coolant temperature T0 entering computers 502 is controlled to (T0)Set, it will thus be controlled to the optimal value T*0.
The optimal coolant temperature T*0 will be valid as long as weather conditions and the computational conditions of the computers 502 remains similar to those that existed during execution of algorithm 600. Recognizing that these conditions, particularly computational conditions, may frequently change, lines (610) through (634) of the operational algorithm 600 must be re-executed from time to time, to check whether the current set-point temperature (T0)Set is still optimal. Two strategies may be pursued:
(a) Repetition of lines (610) through (634) of algorithm 600 may be done periodically, at regular intervals. This option is illustrated conceptually by the pseudo-code shown in
(b) Repetition of lines (610) through (634) of algorithm 600 may be triggered by a change in ongoing measurements of
Many features of the two embodiments discussed in the foregoing are illuminated by a mathematical model that captures the essence of the optimization involved in choosing a value of coolant temperature T0 for computers such as 302 and 502 in
Referring to
P
0
=P
ref{(1−λ)+λexp[κ(T−Tref)]}, (24)
or, equivalently,
In equations (24) and (25), for a given computational state, Pref, λ, and κ are positive constants, T is a representative junction temperature, and Tref is an arbitrary reference temperature at which P0=Pref when T=Tref; that is, P0 would equal Pref if the junction temperature T of all transistors in system 500 were held at the reference temperature Tref. For example,
T
ref=(T0)min≡16° C. (26)
is a useful choice, because then P0=Pref when T=(T0)min≡16° C., where the significance of (T0)min≡16° C. is discussed above in connection with equation (15).
In equation (24), λ is the parameter that controls the fraction of computer power that is exponentially dependent on temperature. This fraction is λ at T=Tref, but the fraction is greater than λ when T>Tref due to the exponential growth. Specifically, for CMOS-based electronics, the second term in equation (24), Prefλexp[κ(T−Tref)], represents the power consumed by CMOS-transistor sub-threshold leakage, whereas the first term in equation (24), Pref (1−λ), represents the sum of all types of power not dependent on temperature, including CMOS-transistor switching power as well as CMOS-transistor leakage components caused by tunneling currents, such as gate leakage.
Consequently, the fraction λ is a function of computational state. For example, λ is relatively large when the transistor is off, because most transistor power in the off state is sub-threshold leakage power. However, λ becomes considerably smaller when the transistor is active, because active power and gate leakage then dominate.
Equation (25) may be written in the mathematically equivalent form
in which the temperature difference T−Tref is decomposed into (T−T0)+(T0−Tref).
The temperature decomposition in equation (27) is useful because the temperature difference T−T0 may be written as the product of two factors: first, the power
of one of the CMOS packages 802, and second, the thermal resistance from the coolant flowing in cold head 808 to a typical transistor on the CMOS package 802. That is,
Thermal resistance is a function only of the geometry and the thermal conductivities of the components comprising the thermal path from coolant to transistor junction.
Substituting equation (28) into equation (27) yields
Equation (30) is a nonlinear algebraic equation in the unknown ratio
an equation that may readily be solved iteratively by starting with the initial guess
on the right-hand side of equation (30), thereby to compute an improved estimate of
on the left-hand side of equation (30), this improved estimate to be inserted into the right-hand side of equation (30) to compute an even better estimate, and so on, until two successive estimates differ by a negligible amount.
In many cases, the set of N heat-producing devices is arranged in series along the coolant flow path. In such cases, the inlet coolant temperature T0 is truly relevant only for the first heat-producing device in the series; devices further downstream see warmer coolant because the coolant has absorbed heat from upstream devices. Specifically, coolant at the downstream-most device has been heated by N−1 previous devices, so the average device sees warming from ½(N−1) previous devices. Consequently, coolant temperature for the average heat-producing device, denoted
where ρ=coolant density (1000 kg/m3 for water), c=coolant specific heat (4180 J/kg-C for water), and V=coolant flow rate. In analogy to equation (28), let be the thermal resistance of the path from coolant to transistor junction as computed using the averaged coolant temperature
Averaged versions of coolant temperature and thermal resistance may then be used in the following modified form of equation (30):
Substituting (31) into (33) finally yields
Like equation (30), equation (34) is a nonlinear algebraic equation in the unknown ratio
an equation that may readily be solved by iteration as explained above in connection with equation (30).
Equation (34) is used instead of equation (30) whenever heat-producing devices are arranged in series, as in the experimental example considered below.
For simplicity, consider only the largest component of PCool; namely, the power consumed by compressor 514. Then, according to
P
Cool≈(1−β)α(P0+P1), (35)
where β is the fraction of power P0+P1 rejected by coolant 506 to free-cooling water 532 via the “free-cooling” heat exchanger 530. As will be shown, β is a function of T0.
To simplify further, assume P1□P0; that is, assume that the computers 502 consume much more power than the pump 508 that circulates the coolant 506. This is a very reasonable assumption for most systems of the type represented by 500. Thus
P
cool≈(1−β)αP0. (36)
The range of coolant-inlet temperature T0 over which PTotal must be considered is typically limited to
(T0)min≦T0≦(T0)max-allowed. (37)
The lower limit (T0)min≡16° C. is imposed by condensation issues, as discussed previously in connection with equation (15). The upper limit (T0)max-allowed is imposed by manufacturers of computers 502 due to the cooling requirements thereof, because junction temperatures in the electronics 502 rise at least as much as T0 rises—in fact slightly more due to the additional leakage power caused by higher temperature—and thus the electronics may be damaged if T0 is too high.
In general, in the range given by equation (37), for fixed values of all other parameters, there will be an optimum value of T0, denoted T*0, where PTotal(T0) is minimized. This optimum is often at one of the two limits (T0)min or (T0)max-allowed. For example, if PTotal(T0) is a monotonically increasing function in the range given by equation (37), then T*0=(T0)min. Conversely, if PTotal(T0) is a monotonically decreasing function in the range given by equation (37), then T*0=(T0)max-allowed. If PTotal(T0) first increases then decreases in the range given by equation (37), then T*0 is either (T0)min or (T0)max-allowed. Only if PTotal(T0) first decreases then increases in the range given by equation (37) will there be a local minimum within the range, but even then, this local minimum is not necessarily the global minimum of total power PTotal(T0), as illustrated by example below.
To find the optimum coolant temperature T*0, it is necessary to derive the functional dependence of β on T0 for system 500 under the approximations expressed by equation (36). Let
ρ≡Density of coolant 506 [kg/m3]
c≡Heat capacity of coolant 506 [J/kg-° C.]
V≡Volumetric flow rate of coolant 506 [m3/s]
UA≡Figure of merit for the free-cooling heat exchanger 530; (38)
which says that the fraction β of the heat load P0+P1 removed by free-cooling heat exchanger 530 is equal to the ratio of the temperature drop across the free-cooling heat exchanger 530, T1−T2, to the total temperature drop T1−T0 across the free-cooling heat exchanger 530 and the chiller 510 combined. In equation (39), the unknown temperature T1 may be expressed in terms of other variables by writing the energy balance across the computers 502,
P
0
≈ρcV(T1−T0), (40)
and the unknown temperature T2 may be expressed likewise by writing the free-cooling heat-exchanger's heat-transfer equation,
βP0≈UA(T2−T3) (41)
where T3 is the sum of wet-bulb temperature TWB and cooling-tower approach temperature ΔTCT as specified by equation (12). In both equations (40) and (41), as previously, the small pump power P1 is neglected in comparison with the large computer power P0, as indicated by the approximately-equal signs. Solving for T1 in equation (40) and T2 in equation (41) yields
Substituting equations (42) and (43) into equation (39) yields
Solving for β in equation (44) yields
or, equivalently,
Two exceptional cases occur. First, some combinations of parameters in equation (46) yield β<0. Physically this means that the free-cooling heat exchanger 530 can accomplish no cooling of the incoming hot-side coolant 506 at temperature T1, because the incoming cold-side temperature T3=TWB+ΔTCT is larger than T1. In fact, β<0 implies that the free-cooling heat exchanger would warm the coolant, which would never be allowed; in such a situation, the free-cooling heat exchanger would instead be bypassed, producing β=0.
Second, other combinations of parameters in equation (46) yield β>1.
Physically this means that the free-cooling heat exchanger 530 is more than adequate to reduce the incoming hot-side coolant 506 at temperature T1 to the required temperature T0. In fact, to achieve the required coolant temperature T0 with β>1 would involve reheating the coolant after it exits the free cooling heat exchanger. This would be a waste of energy without reason, and would never be implemented. Instead, the flow of free-cooling water 532 would be modulated to limit the cooling achieved by the free-cooling heat exchanger, as may be necessary to prevent the coolant temperature T0 from becoming less than (T0)min≡16° C. Thus, the actual value of β in such cases is β=1. Because of these exceptions, equation (46) must be re-written as follows:
The total power PTotal of system 500 is found, with the help of equation (36), to be
P
Total
≡P
0
+P
Cool=[1+(1−β)α]P0, (49)
Normalizing by the power Pref, which is the value of power P0 that would occur if all chips were cooled to temperature Tref, gives
The computation occurs in three steps. First,
is found by solving the nonlinear algebraic equation (34) for the given set of the parameters λ, κ, , Pref, T0, N,
and Tref. Second, β is computed from equations (47) and (48) for the given set of the parameters
Third, the two results are combined, in equation (50), to produce
The impetus for this invention is that, as coolant-inlet temperature T0 is increased, the two factors on the right-hand side of equation (50) oppose each other, suggesting that an optimum value of T0 exists. Specifically, equations (47) and (48) show that, as the coolant temperature T0 supplied to computers 502 is increased, the fraction β of the heat load P0 that can be rejected to the “free cooling” heat exchanger 530 either remains the same (when β=0 or β=1) or increases (when 0<β<1), and consequently the fraction 1−β that must be rejected to the chiller 510 either remains the same or decreases, thereby reducing the power consumption of compressor 514. This power saving is precisely why prior-art schemes focus on increasing T0. However, these prior-art schemes ignore the fact that the power consumption P0 of the computers 502 increases as T0 increases, such that the total power of system 500, given by equation (50), may actually rise as T0 increases. That is, depending on the various parameters, as T0 increases, the exponential growth of P0 in equation (34) may overwhelm the diminution of [1+(1−β)α] equation (50). Consequently, increasing T0 may actually increase overall power consumption, despite “free-cooling”.
Several of the parameters in equation (34) may be determined experimentally for a particular set of computers 502. To illustrate this process, experiments have been conducted on an ensemble of 32 compute chips used in a water-cooled supercomputer called Blue Gene/Q, which is currently being developed by International Business Machines. Thus, in this experiment, the computers 502 are represented by the 32 Blue Gene/Q compute chips, which are arranged in series along the coolant flow path.
Consequently, equation (34) with N=32 is used to obtain
as a function of the inlet water temperature T0. Power P0 is the aggregated power of the 32 chips. In this experiment, it is important to understand which parameters in equation (34) depend on the computer hardware only, and which depend both on hardware and on the computational state implied by the software being executed on the hardware.
First, the exponential parameter κ is a function of CMOS technology, so it is fixed for given hardware. In particular, parameter κ is smaller for more advanced generations of CMOS than it is for older generations, as shown in
Second, parameter , the average thermal resistance of the path from a transistor junction to the coolant, is also determined by hardware; namely, by the geometries and thermal conductivities of materials in the thermal path. The value of appropriate for the Blue Gene/Q system is derived below from experimental measurements.
Third, reference temperature Tref is chosen arbitrarily, so it is the same for all hardware and all computational states. Tref=16° C. is chosen herein, for reasons explained in connection with equation (26).
Fourth, parameter is a function of computational state as well as hardware, because, referring to equation (24), different computational states exhibit different fractions of power consumed by temperature-dependent leakage at T=Tref. Values of λ appropriate for various computational states of the Blue Gene/Q system are derived below from experimental measurements.
Fifth, reference power Pref is a function of both computational state and hardware, since each combination thereof leads to a different amount of dissipated power Pref at the reference temperature Tref. Values of Pref appropriate for various computational states of the Blue Gene/Q compute chips are derived below from experimental measurements.
Referring to
To fit data 1002 to the mathematical model herein, equation (24) may be written as
Rather than performing a three-parameter nonlinear regression to choose values of C1, C2 and κ in equation (51) that best fit dataset 1002, it is simpler to note that equation (51) may be written as
y=C
2exp(κx), (54)
where
x≡T−T
ref
; y≡P
0
−C
1. (55)
Taking the natural log of both sides of equation (54) yields a two-parameter linear regression for the unknown regression constants C2 and κ. A series of such linear regressions may be easily performed for various assumed values of C1. For each regression, the best-fit values of C2 and κ, as well as the correlation coefficient R2, are recorded. The value of C1 producing the highest correlation R2 is then assumed to provide the best fit to the three-parameter model (51). Specifically, referring again to
C1=75.0 W; C2=151.3 W; κ=0.01789. (56)
Substituting equations (56) into the first of equations (53) yields the reference power
Pref=226.3 W, (57)
which is the value of P0 that would occur if all chips were cooled to the reference temperature Tref.
Substituting equations (56) into the second of equations (53) yields
λ=0.6686. (58)
That is, referring to equation (24), in the Un-Initialized state, 66.9% of the power at T=Tref is temperature-dependent, sub-threshold leakage power. The remaining 33.1% is gate-leakage power, because there is no active power in this state.
In equations (56), the experimentally deduced value of the important parameter κ, namely,
κ=0.01789, (59)
is similar to the textbook value shown on
f(T)≡Fraction of power that is sub-threshold leakage power, as a function of junction temperature T. (60)
According to the two terms in equation (24),
For example, for the Un-Initialized computational state for which equation (58) applies,
f(30° C.)=0.722; f(40° C.)=0.756. (62)
That is, at a junction temperature of 30° C., sub-threshold leakage in the Un-Initialized state consumes 72.2% of total power; at 40° C., it consumes 75.6%.
Referring to
Two additional experiments with the system of 32 Blue Gene/Q chips, summarized in
V=3.3 liter/min=5.48E−5 m3/s, (63)
so that a steady thermal state is reached. The coolant is water, so
Cases C and E were run at a low value of inlet coolant temperature T0, whereas Cases D and F were run at a somewhat higher value of coolant temperature T0. As for Cases A and B, each value of junction temperature T tabulated for Cases C through F represents a measured, ensemble average of 32 junction temperatures reported by the 32 chips used in the experiment. Each value of 1 represents the measured, aggregate power consumed by the 32 chips. The ensemble-average coolant temperature T0 is calculated for each case according to equation (31), using N=32 and the values in equations (63) and (64).
Thermal impedance of the path from coolant to transistor junction, as depicted in
In principle, this result applies to all Cases, because it is a thermal characterization of the physical path from coolant to junction, which, does not change with computational state. In reality, somewhat different results are obtained for other cases: =0.700 [° C./W] for Case C; =0.722 [° C./W] for Case D; =0.791 [° C./W] for Case E. The result for Case F, given in equation (65), is chosen for subsequent computations because the temperature difference in the numerator of (65) is the largest of all the Cases, and thus is believed to be the most accurate.
Consider now the “Idle” computational state represented by Cases C and D in
Equations (66) and (67) represent two equations in the two unknown parameters Pref and λ. Dividing equation (67) by equation (66) to eliminate Pref yields
Clearing fractions and solving for λ yields
From
P0C=935.7 W; P0D=978.2 W; TC=38.1° C.; TD=48.5° C. (70)
In light of equations (26) and (59), substituting the values from equations (70) into equation (69) yields
λ=0.1613. (71)
This means that, in the “Idle” computational state, if the transistor junctions of the Blue Gene/Q chips are all held at the reference temperature Tref=16° C., 16.1% of the power P0 is temperature-dependent, sub-threshold leakage power. At a junction temperature T higher than Tref, the fraction f of power consumed by sub-threshold leakage is higher. For example, applying equation (61) for the junction temperatures TC and TD measured experimentally in the “Idle” state, the power fraction f is
That is, at the operating temperatures found for Cases C and D, sub-threshold leakage is 22.2% and 25.6% of total power, respectively.
The reference power Pref for the “Idle” computational state may be found by substituting the values from equations (70) and (71) into equation (66):
Similar substitution into equation (67) yields
As expected, the two results agree: the computed value of λ insures it.
Consider now the “Full Power” computational state represented by Cases E and F on
From
P0E=1558.1W; P0F=1610.9W; TE=56.8° C.; TF=67.1° C. (77)
In light of equations (26) and (59), substituting the values from equations (70) into equation (69) yields, for the “Full Power” computational state
λ=0.08839. (78)
This means that, in the “Full Power” computational state, if the transistor junctions of the Blue Gene/Q chips are all held at the reference temperature Tref=16° C., 8.8% of the power P0 is temperature-dependent, sub-threshold leakage power. At a junction temperature T higher than Tref, the fraction f of power consumed by sub-threshold leakage will be higher. Using definition (61), for Cases E, the fraction f is
Similarly, for Case F, the fraction f is
That is, in the “Full Power” state, at the operating temperatures found for Cases E and F, sub-threshold leakage power is 16.7% and 19.5% of total power, respectively.
The reference power Pref for the “Full Power” computational state is computed as described above for the “Idle” state. For Case E:
As expected, the two results agree—the computed value of λ, insures it.
Referring to
yields the curves labeled “λ=0.6686”, “λ=0.1613”, and “λ=0.08839” on
Recall that, by definition, Pref is the value of P0 that would occur if the temperature T of all transistors were held at (T0)min=16° C. Consequently, it should be no surprise that
occurs at some value of inlet coolant temperature T0 lower than 16° C.; specifically, at the value of T0 corresponding to a value of chip-averaged coolant temperature
To simplify the interpretation of plots such as
(P0)16° C.≡P0@T0=16° C. (83)
That is, (P0)16° C. is the consumed power for the ensemble of chips when the coolant inlet temperature at the first chip in the series is 16° C. For each curve on
The experiments described above show that the parameters λ, Pref depend on computational state. However, Pref appears to be related to λ, as shown on
Pref=169.42λ−0.91957. (84)
The above results characterizing computer power P0 may now be used to study PTotal=P0+PCool, where equation (50) for PTotal may be renormalized as
The computation occurs in four steps. First,
as a function of T0, denoted
is found by repeatedly solving the nonlinear algebraic equation (50) for the given set of the parameters
Second, each result for
is normalized by the result at T0=16° C., denoted
to produce
Third, the factor [1+(1−β)α] in equation (85) is computed from equations (47) and (48) for the given set of the parameters
Fourth, the results for
and [1+(1−β)α] are combined, in equation (85), to produce
Numerical values of the several parameters shall now be adopted for use in
κ=0.01789; =0.769[° C./W]; Tref=16[° C.]; N=32; ρcV=229[W/° C.]. (89)
Combining equations (84) and (89) yields the functional relationship between λ and the parameter grouping
that appears in equations (34) and (46):
Assume that the free-cooling heat exchanger, sized to handle the small amount of power represented by the 32 Blue Gene/Q compute chips, has the figure of merit
UA=458[W/° C.], (91)
which is tantamount to assuming the dimensionless parameter
Assume that the chiller consumes 15% of the heat that it transfers, so
α=0.15. (93)
As a result of the numerical assumptions above,
vs. T0 may be plotted, as a function of just two free parameters, λ and T3, because all other parameters are known or assumed from equations (89) through (93). Specifically, the numerical values of κ, , N, and ρcV are known for the given system of 32 Blue Gene/Q compute chips, the numerical values of Tref,
and α are assumed, and the numerical values of Pref and
depend only on λ, according to equations (84) and (90).
The two free parameters λ and T3 correspond to the two primary variables of the system 500; namely, the computational state, as represented by λ, and the weather, as represented by T3, inasmuch as T3 is related to wet-bulb temperature, according to equation (12).
To illustrate the computation process described above, consider the case
λ=0.5; T3=21° C.; T0=20° C. (94)
Substituting (94) into equation (84) produces
P
ref=169.42(0.5)−0.91957=320.47 W. (95)
Substituting values from equations (89), (94) and (95) into equation (34) yields, after several iterations,
Repeating the latter procedure with T0=16° C. produces
Substituting equations (96) and (97) into equation (87) yields
Substituting equations (89), (92), (94), (95), and (96) into equation (47) and (48) yields
β={circumflex over (β)}=0.244. (99)
Substituting (93), (98), and (99) into (85) yields
If the three free parameters in (94) only λ affects (P0)16° C.. As shown on
(P0)16° C.=182.83λ−0.92338 [W]. (101)
Equation (101) is helpful to interpret physically subsequent plots where PTotal is normalized by (P0)16° C..
Repeating the calculations represented by (94) through (100) many times, for various values of the three parameters in (94), yields
is plotted vs. T0, λ, and T3. That is,
(T0)min≦T0≦(T0)max, (102)
where (T0)min≡16° C. and (T0)max≡39° C. as discussed in connection with equations (14) and (15).
In general, each curve on
Regime I: β=0 (No free cooling)
Regime II: 0<β<1 (Mixed free cooling and chiller cooling)
Regime III: β=1 (100% free cooling). (103)
For example, for the curve λ=0.9 on
On
For example, referring to the aforesaid curve λ=0.9 on
(T0)Local Max=25.9° C. and
(T0)Local Min=27.6° C.
In general, PTotal always increases with T0 in Regimes I and III, due to the exponential dependence of P0 on
Two features of
This is a natural consequence of assumption (93), α=0.15, because, according to equation (85), in Regime I where β=0,
That is, at T0=16° C., the fully utilized chiller consumes 15% of P0, so total power is 1.15P0.
It is clear that, as T3 increases beyond a certain point, free cooling ceases to produce lower total power PTotal compared to conventional cooling at T0=16° C. Specifically,
The value in equation (110) appears on
Referring now to
whether this minimum be the local minimum defined by equation (105), or whether it be the “edge minimum” at 16° C., as discussed in the previous paragraph. For each value of λ, the optimal value T*0 is 16° C. for low values of T3, then rises linearly with T3, and then suddenly drops again to T*0=16° C. The reason for the sudden drop is explained by the interplay between the aforementioned local minimum and the edge minimum at T*0=16° C. For example, consider the curves for λ=0.4 on
denoted Point B, is slightly below the edge minimum
whereas the local minimum on
at T0=31.1° C., denoted Point C, is slightly above the edge minimum. It is clear that at some “critical” value of T3 between T3=27° C. and T3=30° C., the value of the local minimum will be exactly
at an abscissa value of T0=(T0)Local Min. At this “critical” value of T3, the value of T0 that minimizes
will, as a function of T3, suddenly jump from the local minimum at T*0=(T0)Local Min to the edge minimum at T*0=16° C. In the example with λ=0.4, this sudden shift is reflected in
The quantitative results shown in FIGS. 16 through 27—in particular the location of the downshift-boundary curve in FIG. 27—depend on the set of parameters given by equations (89) through (93), to which all of the
and α, as tabulated in
and α respectively. Numerical values written in bold-face type in
and α is varied in
increases, the downshift boundary moves to the left. Physically, this corresponds to the fact that
increases as UA decreases, which makes sense: UA is the figure of merit for the free-cooling heat exchanger, so lower UA implies a less effective free-cooling system, which impedes its usefulness at high T3, thereby pushing the downshift boundary leftward.
In summary, the mathematical model presented above as equations (24) through (50), as well as the numerical examples of this model presented as
The mathematical model above shows that the optimum, energy-minimizing coolant temperature T*0 is either (T0)min or (T0)Local Min. The former temperature, (T0)min, is defined by equation (26), or more generally as slightly higher than the machine room's ambient dew-point temperature. The latter temperature, (T0)Local Min, is defined by equation (105). This mathematical result greatly simplifies the calibration algorithm 400 in
(T0)min=TDewPoint+1° C. (112)
For given weather conditions, the value of (T0)Local Min is provided by the system 300 itself: it is the temperature at which coolant 306 is returned to machine room 304 if the free-cooling heat exchanger 330 is used while the chiller 310 is turned off or bypassed.
Consequently, referring to
Likewise, referring to
As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or computer program product. Accordingly, aspects of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit,” “module” or “system.” Furthermore, aspects of the present invention may take the form of a computer program product embodied in one or more computer readable medium(s) having computer readable program code embodied thereon.
Any combination of one or more computer readable medium(s) may be utilized. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with a system, apparatus, or device running an instruction.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with a system, apparatus, or device running an instruction.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C++ or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The program code may run entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
Aspects of the present invention are described below with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which run via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer program instructions may also be stored in a computer readable medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which run on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more operable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be run substantially concurrently, or the blocks may sometimes be run in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
While the present invention has been particularly shown and described with respect to preferred embodiments thereof, it will be understood by those skilled in the art that changes in forms and details may be made without departing from the spirit and scope of the present application. It is therefore intended that the present invention not be limited to the exact forms and details described and illustrated herein, but falls within the scope of the appended claims.
This invention was made with Government support under Contract No. B554331 awarded by the Department of Energy. The Government has certain rights in this invention.