With data center computing environments dissipating increasing amounts of heat, system managers are increasingly interested in controlling the costs of cooling a data center while sufficiently cooling the data center's electronic hardware.
As observed in
In the case of the first cooling mechanism (air-cooled), a fan 104 blows ambient air 105 through fins 106 that extend from the base 107 of a chip package cooling unit 103. Here, the base 107 of the cooling unit 103 acts as a thermal mass that draws heat from the semiconductor chip(s) that operate within the chip package 102 that the base 107 is thermally coupled to. The heat is transferred from the base 107 to the fins 106 which radiate the heat into the ambient air. The air flow 105 through the fins 106 removes the heat from cooling unit 103, which in turn, removes heat from the chips within the package 102. Thus, in the case of the first mechanism, the cooling unit 103 acts as a traditional heat sink.
In the case of the second mechanism (liquid cooling), liquid flows through the base 107 of the cooling unit 103 which acts as a cold plate. Here, heat generated by the semiconductor chip(s) is drawn into the base 107 of the cooling unit 103 while liquid flows through one or more fluidic conduits that are formed within the base 107 of the cooling unit (the liquid is pumped by pump 108). The liquid is warmed from the heat and then flows out of the base 107 thereby removing heat from the cooling unit 103. The warmed liquid is then channeled to a heat exchanger 109 (“hex”). The heat exchanger 109, e.g., ripples the warmed fluid over fins that are exposed to the ambient which reduces the temperature of the fluid. The cooled fluid is then channeled back to the base 107 and the process repeats.
In the case of the third mechanism (chilled liquid cooling), the same approach as liquid cooling above is applied except that the cooled fluid runs through both the heat exchanger 109 and a chilling unit 110. Here, the heat exchanger 109 can theoretically only lower the temperature of the liquid to that of the ambient temperature. That is, the heat exchanger 109 is essentially a passive device that relies on the ambient as a heat sink.
By contrast, in the case of chilled liquid cooling, energy is applied to an active cooling unit 110 (a chiller, chilling unit, etc.) that can reduce the temperature of the warmed fluid beneath ambient temperature.
For ease of explanation
The three different cooling mechanisms effect three different “course grain” adjustments that can be used to effect various tradeoffs between cooling performance and cooling cost. Specifically, air cooling represents the least performance but the lowest cost, chilled cooling represents the highest performance but the highest cost, and liquid cooling falls in between these two extremes.
More specifically, among the three mechanisms, heat is removed from the system least effectively with air cooling, but, the cost to run the fans 104 with air cooling is less than the cost of running the pump 108 with liquid cooling or the cost of running the pump 108 and the chiller 110 with chilled liquid cooling. By contrast, heat is removed from the system most effectively with chilled liquid cooling, but the cost of running the pump 108 and the chiller 110 exceeds the cost of the running just the pump 108 with liquid cooling or the fans 104 with air cooling. With respect to liquid cooling, heat is removed more effectively than air cooling but less affectively than chilled liquid cooling, while, the cost of running the pump 108 is more expensive than the cost of running the fans 104 but less expensive than the cost of running both the pump 108 and chiller 110.
Notably, however, each different cooling mechanism can have its own range of performance/cost tradeoff. Specifically, if the speed of the fans are adjusted up/down, then, the performance/cost of the air cooling 201 is likewise adjusted up/down. Similar adjustments can be made by adjusting the pumping action (pump speed) during liquid cooling 202 and adjusting the chiller's temperature setting during chilled liquid cooling 203.
For ease of discussion, the simple example of
Here, each box 301 can represent an electronic hardware component at any of multiple possible granularities. For example, a single box 301 can represent any of: 1) a semiconductor chip package; 2) multiple semiconductor chip packages coupled to a same cooling apparatus (e.g., as observed in
For ease of discussion, much of the remaining discussion will be directed to an example where each box represents a rack of systems as described in 4) above where the rack contains, e.g., CPU sleds installed therein. Here, each rack has an associated number of fans, pumps and chillers to effect multi-mechanism cooling of the various systems that are plugged into the rack.
With each rack being capable of multi-mechanism cooling at rack granularity, the data center 300 of
Here,
In this example,
Then, as observed in
Then, after the workday's midpoint (T5), the above described process occurs in reverse where: 1) the cooling configuration
In the case of current workload information, for each individual rack, the cooling system controller can receive various metrics from the rack's real time workload monitor 616 that describes the rack's current operational state (e.g., temperature, processor activity (e.g., instructions per second, processor frequency), memory activity (e.g., memory access reads and/or writes per second), power supply current draw, jobs being processed by the rack's constituent systems, etc.). Based on these metrics the cooling system controller 611 can access the cooling needs of the individual racks and adjust each rack's cooling system, accordingly.
As observed in
Here, each job causes the rack that processes the job to perform some amount of work (“processing activity”) which, in turn, causes the rack's electronic hardware to emit some additional heat that the rack's cooling system will have to remove. As the number of jobs that a rack receives grows, the cooling demands will necessitate a higher performance/cost setting for the rack's cooling.
According to a first dispatching process, observed in
Here, the first rack 701_1 that is assigned all new jobs steadily increases its need for cooling as jobs are assigned to it. However, at least in the early stages when the data center is under light total workload, the one rack remains in the lower performance/cost region (air cooled) and merely increases its fan speed to maintain sufficient cooling while the remaining racks 701_2 through 701_N hardly consume any power at all and require essentially no cooling.
Referring to
The dispatching process then continues in this manner until, as observed in
As the rate of new jobs being received by the data center continues to increase, as observed in
The process then continues as before until all racks are being cooled in the medium performance/cost state. If the rate at which jobs are being sent to the data center continues to increase the dispatcher 613 can then begin sending all new jobs to the first rack 701_1 which triggers the first rack's cooling system to operate in the higher performance/cost state (chilled liquid cooling). If the rate at which jobs are being sent to the data center continues to increase the dispatcher than begins sending all new jobs to a next until the respective cooling systems of all racks are operating in the highest performance/cost state.
Note that the jobs described above correspond to the number of active jobs at a moment in time. That is, when a job is completed it reduces the number of active jobs assigned to the rack that processed the job. Thus, the above described dispatching process can weigh the rate at which jobs are being sent to the data center to determine per job rack assignment. As such, the different predetermined levels described just above can correspond to a certain amount of processing capacity within the rack (e.g., X number of CPU cores are active and/or operate beneath some higher performance state).
When the rate of incoming jobs exceeds the processing capacity of a rack that can remain properly cooled within the lowest cooling performance/cost state (e.g., level 702), the dispatcher 613 begins sending new jobs to the next rack, and so on.
The above example assumes for simplicity that all jobs induce approximately a same amount of rack processing activity. In environments where different jobs can induce different amounts of rack processing activity, each job can be assigned (e.g., by the dispatcher 613 or some other intelligence) a metric that corresponds to the amount of processing activity the job will entail on its assigned rack, the amount of power such processing activity corresponds to, and/or, the amount of heat that such power and/or processing activity will add to the rack. Thus, for instance, higher performance jobs will be assigned a higher metric (e.g., “5”) than a metric that is assigned to lower performance jobs (e.g., “3”).
The dispatcher 613 then adds the metrics that are assigned to a same rack to determine the processing activity that has been assigned to the rack. If the processing activity reaches the predetermined processing capacity of a rack within a lowest cooling performance/cost state (e.g., level 702), the dispatcher 613 will begin dispatching new jobs to a next rack, and so on.
This dispatching process not only applies to environments where the applications that are executing the jobs are the same but also where the applications that execute the various jobs can be different. Here, a job that is executed by a higher performance application can be assigned a higher metric while another job that is executed by a lower performance application can be assigned a lower metric.
A higher performance application can be characterized by an application that consumes higher CPU cores/processes/threads, more accelerator invocations, requires a larger memory footprint, and/or uses memory more extensively (greater number of memory accesses), while a lower performance application can be characterized by an application that consumes lesser amounts for these resources as described. Thus, the assignment of a metric to a job can include, at least in part, a component that weighs the performance level of the application that is to execute the job.
The above discussions have so far assumed that the metrics that are assigned to jobs are a static value that does not change over the runtime of the job. In an embodiment, the static metric corresponds to a worst case cooling need for the application (the job is assumed to consume a maximum power for the job and dissipate a maximum amount of heat for the job).
In a more complex approach, the metric that is assigned to a job can dynamically change as the application's processing needs, power consumption and/or heat dissipation change.
Here, according to a first approach, when a job is first dispatched by the dispatcher 613, an average or typical metric for the job is assigned to the job and the job is dispatched to a rack based on the average/typical metric. Additionally, referring to
Referring to
Such learning can be used to adjust the metric that is assigned to a job and/or add a “delta” to the metric that informs the dispatcher 913 of how much (and in what direction) the job's processing/power needs are expected to change over the course of the job's runtime.
For example, if a job has an initial default metric of 3 and a delta of +2, the dispatcher 913 can assume the job will eventually exhibit a processing/power need that corresponds to a 5 and assign to the job to a rack accordingly. For example, based on the total of all metrics of all jobs assigned to a particular rack, if the increase from 3 to 5 could cause the rack to trigger into a next higher cooling system performance/cost state, the dispatcher 913 could decide to assign the job to a next rack whose total of all metrics is well beneath the rack's cooling state trigger point. Although this example is directed to a job (a particular session executing on a particular application), it can be extended to applications and/or containers (an application or container is assigned to a next rack based on the application's/container's metric and delta).
Alternatively or in combination, the dispatcher 913 can dynamically reassign a job to another rack as the job's processing needs dynamically change. For example, consider a situation where a first rack is already operating in a highest cooling performance/cost state and has budget to take on another high performance job, while, a particular job is executing on another, second rack that is operating with a lower cooling performance/cost state. If the processing activity of the second rack is near the trigger point at which the second rack switches over to a next higher cooling performance/state, and if the processing needs of the job increases and/or is expected to increase in the near future, the dispatcher 913 can move the job from the second rack to the first rack to avoid the switchover of the second rack's cooling state.
In another scenario, the processing activity of a first rack is operating in a higher performance/cost cooling state but near the point at which the first rack could drop down to a lower cooling performance/cost cooling state. Meanwhile, a second rack is operating in a lower performance/cost cooling state and has plenty of budget to take on more jobs without causing a switchover to a higher performance/cost cooling state. If a job executing on the first rack increases and/or is expected to increase in the near future, the dispatcher 913 can move the job from the first rack to the second rack. The removal of the job from the first rack causes the first rack to drop to a lower cooling performance/cost state rather than remain in the higher cooling performance/cost state because of the job's expected increased activity.
In another scenario, a first rack is operating in a highest performance/cost cooling state because most of its jobs are high performance jobs and second and third racks are operating in the lowest performance/cost cooling state because most of their jobs are low performance jobs. In this case, the dispatcher 913 can move high performance jobs from the first rack to the second and third racks, and, move jobs from the second and third racks to the first rack. So doing could cause the first rack to drop to a lower cooling performance/cost state while keeping the second and third racks to remain in the lowest cooling performance/cost state (e.g., if their current activity is well below their trigger point to a next higher cooling performance/cost state).
Thus, the dispatcher 913 can dynamically move jobs based on observed and/or learned/predicted changes in job processing activity to, e.g., keep the overall cooling costs of the data center at a minimum. Alternatively or in combination, a deployment controller (described in more detail below) can move a job based on dynamic workload changes as described just above. The deployment controller can move the job directly or indirectly (e.g., by moving a container that the job executes within).
As observed in
Such predicted changes can be static (e.g., a predicted job activity profile is maintained for the job before job runtime and is relied upon by dispatcher over the course of the job's runtime). Alternatively or in combination, any predicted changes in job activity can be based on a job's current state and/or, e.g., recent activity. Thus, a real time monitoring system 916 that observes the processing activity of the jobs is real time can be used to feed forward a job's current/recent activity to the dispatcher 913, which, in turn, can influence the job's predicted future activity (e.g., as an input parameter to a predicted job activity function and/or model that was provided to the dispatcher 913 by the workload prediction function 915).
The information from the real time monitoring system 916 can also be used directly by the dispatcher 913. In this case, for example, a sudden detected increase in a job's activity (e.g., in the absence of any predicted increase) can be used by the dispatcher 913 to, e.g., move the job to another rack.
Thus, as described above, both predictions of changes in job activity and real-time observations of changes in job activity can be used by the dispatcher 913 to move jobs that are in process to different racks, e.g., as per the scenarios described just above.
Although the dynamic movement scenarios described just above were directed to jobs specifically, the same concept can be extended to applications and/or containers (an application or container is moved based on dynamic changes in the application's or container's processing needs). In this case, referring to
Note that the data center of
Here, usage/activity intensity can affect component reliability. For example, electrical components (e.g., processors, memory modules, solid state drives (SSDs), etc.) having higher average input command/instruction rates, applied clock frequencies, temperatures, supply voltages, etc. will wear-out (fail) before same kinds of electrical components having lower average input instruction/command rates/clock frequencies/temperatures/voltages. Likewise, cooling supply components that are being used more frequently (e.g., pumps, valves, lubricating oils, etc.), subjected to higher pressures (e.g., hoses, gaskets, etc.), and/or removing larger amounts of heat over time will wear-out before same kinds of cooling components that are subjected to less usage/pressures/heat. Additionally, certain kinds of cooling components can wear out over time irrespective of usage (e.g., cooling fluid, thermal interface paste, etc.).
The failure prediction function 1101 receives the actual usage information from the real time monitoring function 616/916/1016 and applies the information to wear-out models that the prediction function maintains for the data center's individual electrical components 1102 and cooling components 1103. Before any one of these components is predicated to fail, the failure prediction function 1101 sends a communication to a maintenance controller 1104 that alerts the cooling controller 611/911/1011, dispatcher 613/913 and deployment manager 1013 of the upcoming expected time of failure and/or a time window over which replacement is recommended.
The maintenance controller 1104 maintains a maintenance schedule for the data center's electrical and cooling system components that schedules replacements of the components based on the information from the failure prediction function 1101. The maintenance controller 1104 communicates the component replacement schedules to: 1) the cooling controller 611/911/1011 so the component's associated cooling system can be shut down for the replacement activity; 2) the dispatcher 613/913 so that the dispatcher will stop sending new jobs to electronic systems/components that are out of service during the replacement activity; 3) the deployment manager 1013 so that the software processes that are impacted by the replacement activity (e.g., jobs, virtual machines, containers) can be parked or moved to other electronic processing resources.
So doing allows for smooth shut down of isolated platform resources during component replacement without crashing jobs/applications/containers that were relying on the resources before the replacement. In various embodiments, the maintenance schedule for any/all components defines a window of time in which the component, and any associated resources that are electrically and/or mechanically coupled to the component, will be down (unavailable) because of the component's replacement.
For example, if a processor is to be replaced, the processor's cooling system will also be brought down during the processor replacement (the processor's cooling apparatus must be removed in order to remove the processor). If the cooling system is designed such that the cooling fluid that flows through the thermal mass of the processor being replaced also flows through the respective thermal masses of other electrical components, e.g., other processors within the processor's same sled, then, the maintenance schedule schedules the other electrical components to also be down during replacement of the processor. The deployment controller can then park and/or move jobs from the group of processors before the processor is replaced so that crashes are avoided.
Although embodiments above have focused on three different cooling mechanisms (air, liquid and chilled liquid), the above teaching mechanisms can be applied to more than three different cooling mechanisms and/or other cooling mechanisms than those described above (e.g., immersion cooling).
Networked based computer services, such as those provided by cloud services and/or large enterprise data centers, commonly execute application software programs for remote clients. Here, the application software programs typically execute a specific (e.g., “business”) end-function (e.g., customer servicing, purchasing, supply-chain management, email, etc.).
Remote clients invoke/use these applications through temporary network sessions/connections that are established by the data center between the clients and the applications. A recent trend is to strip down the functionality of at least some of the applications into more finer grained, atomic functions (“micro-services”) that are called by client programs as needed. Micro-services typically strive to charge the client/customers based on their actual usage (function call invocations) of the micro-service application.
In order to support the network sessions and/or the applications' functionality, however, certain underlying computationally intensive and/or trafficking intensive functions (“infrastructure” functions) are performed.
Examples of infrastructure functions include encryption/decryption for secure network connections, compression/decompression for smaller footprint data storage and/or network communications, virtual networking between clients and applications and/or between applications, packet processing, ingress/egress queuing of the networking traffic between clients and applications and/or between applications, ingress/egress queueing of the command/response traffic between the applications and mass storage devices, error checking (including checksum calculations to ensure data integrity), distributed computing remote memory access functions, etc.
Traditionally, these infrastructure functions have been performed by the CPU units “beneath” their end-function applications. However, the intensity of the infrastructure functions has begun to affect the ability of the CPUs to perform their end-function applications in a timely manner relative to the expectations of the clients, and/or, perform their end-functions in a power efficient manner relative to the expectations of data center operators. Moreover, the CPUs, which are typically complex instruction set (CISC) processors, are better utilized executing the processes of a wide variety of different application software programs than the more mundane and/or more focused infrastructure processes.
As such, as observed in
As observed in
Notably, each pool 1201, 1202, 1203 has an IPU 1207_1, 1207_2, 1207_3 on its front end or network side. Here, each IPU 1207 performs pre-configured infrastructure functions on the inbound (request) packets it receives from the network 1204 before delivering the requests to its respective pool's end function (e.g., executing software in the case of the CPU pool 1201, memory in the case of memory pool 1202 and storage in the case of mass storage pool 1203). As the end functions send certain communications into the network 1204, the IPU 1207 performs pre-configured infrastructure functions on the outbound communications before transmitting them into the network 1204.
Depending on implementation, one or more CPU pools 1201, memory pools 1202, mass storage pools 1203 and network 1204 can exist within a single chassis, e.g., as a traditional rack mounted computing system (e.g., server computer). In a disaggregated computing system implementation, one or more CPU pools 301, memory pools 1202, and mass storage pools 1203 are, e.g., separate rack mountable units (e.g., rack mountable CPU units, rack mountable memory units (M), rack mountable mass storage units (S)). Although not depicted in
In various embodiments, the software platform on which the applications 1205 are executed include a virtual machine monitor (VMM), or hypervisor, that instantiates multiple virtual machines (VMs). Operating system (OS) instances respectively execute on the VMs and the applications execute on the OS instances. Alternatively or combined, container engines (e.g., Kubernetes container engines) respectively execute on the OS instances. The container engines provide virtualized OS instances and containers respectively execute on the virtualized OS instances. The containers provide isolated execution environment for a suite of applications which can include, applications for micro-services.
Notably, each pool can be viewed as its own respective data center that the improvements of
Moreover, e.g., in the case of a large pool, there can be many IPUs that service a single pool. Here, the electronic systems of claims 601, 901, 1001 correspond to the pool's IPUs that process packets that are directed to/from the pool as described above.
With respect to
With respect to
The teachings above can also be applied to traditional data centers, e.g., where the racks contain individual servers installed therein and the dispatcher dispatches jobs, e.g., to the applications that run within the servers.
Embodiments of the invention may include various processes as set forth above. The processes may be embodied in program code (e.g., machine-executable instructions). The program code, when processed, causes a general-purpose or special-purpose processor to perform the program code's processes. Alternatively, these processes may be performed by specific/custom hardware components that contain hard wired interconnected logic circuitry (e.g., application specific integrated circuit (ASIC) logic circuitry) or programmable logic circuitry (e.g., field programmable gate array (FPGA) logic circuitry, programmable logic device (PLD) logic circuitry) for performing the processes, or by any combination of program code and logic circuitry.
Elements of the present invention may also be provided as a machine-readable medium for storing the program code. The machine-readable medium can include, but is not limited to, floppy diskettes, optical disks, CD-ROMs, and magneto-optical disks, FLASH memory, ROMs, RAMs, EPROMs, EEPROMs, magnetic or optical cards or other type of media/machine-readable medium suitable for storing electronic instructions.
In the foregoing specification, the invention has been described with reference to specific exemplary embodiments thereof. It will, however, be evident that various modifications and changes may be made thereto without departing from the broader spirit and scope of the invention as set forth in the appended claims. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense.