The disclosure relates generally to closed loop physical resource control for multiple processor installations.
Large server systems (often call server farms) and some other applications employ a large number of processors. There are multiple physical resources and environmental constraints that affect the operation of these server farms, including power supply and power management, thermal management and limitations, fan and cooling management, and potential acoustic limits. Usually these physical resources such as power supplies and fans are significantly over-designed. Typically, power supplies and fans are allocated to supply each processor running at some high fraction of a peak load. In addition, some redundancy is added so that in the event that one power supply module or fan fails enough power or cooling capacity exists to keep the system running. Thus on one hand, there is a desire to have maximum computing performance available; on the other hand, there are limits, due to heat generation and supply of power, to what can actually be made available. Always, there is a connectedness among temperature, power, and performance. Typically, a larger-than-usually-needed supply sits ready to provide power needed by the CPUs, thus running most of the time at a low utilization, inefficient operating point. Also, a certain amount of headroom of power needs to be available, to maintain regulation during instantaneous increased demand. Additionally, power supplies need to be over-sized to respond to surge demands that are often associated with system power-on, where many devices are powering up simultaneously.
Thus, it is desirable to provide a system and method for closed loop physical resource control in large, multiple-processor installations and it is to this end that the disclosure is directed. The benefit of this control is relaxation of design requirements on subsystems surrounding the processor. For example, if the processor communicates that it needs maximum instantaneous inrush current, the power supply can activate another output phase so that it can deliver the needed inrush current. After this new current level averages out from the peak of inrush current, the power supply can deactivate the output phases in order to run at peak efficiency. In another example, when the processor predicts an approaching peak workload, it can communicate to the cooling subsystem its need for extra cooling to bring itself lower in its temperature range before the peak workload approaches. Likewise, if the system fans are running less than optimal to meet acoustic requirements, detection of departure of datacenter personnel (e.g. through badge readers) can cause the system to optimize the fans beyond the acoustic limit to some degree. Additionally upon detection of certain external power limit conditions such as, but not limited to brownout, or battery backup engaged, CPU throttling can immediately be implemented in order to maximize available operational time to either perform at reduced capacity or effect a hibernation state.
a illustrates an exemplary hierarchy of server processors, boards, shelves, and racks within a data center.
b illustrates an exemplary hierarchy of power supplies and regulators across a server rack.
c illustrates an exemplary hierarchy of cooling and fans across a server rack.
d illustrates an exemplary communication fabric topology across server nodes.
What is needed is a system and method to manage the supply of power and cooling to large sets of processors or processor cores in an efficient, closed-loop manner such that rather than the system supplying power and cooling that may or may not be used, a processor would request power and cooling based on the computing task at hand, which request would then be sent to the central resource manager, and then to the power supply system and thus power made available. Further needed is bidirectional communication between the CPU(s), the central resource managers, and the power supplies stating it has a certain limit, and rather than giving each processor its desired amount of power, said system may give a processor an allocation based on prorated tasks. Additionally needed is a method of prioritization that may be used to reallocate power and cooling among processors, so the allocation does not have to be a linear cut across the board, and allows the resources (power supplies, fans) to not only limit, but to potentially switch units on and off to allow individual units to stay within their most efficient operating ranges.
The examples of the resources below in this disclosure are power, cooling, processors, and acoustics. However, there are many other resource types, such as individual voltage levels to minimize power usage within a circuit design, processor frequency, hard drive power states, system memory bus speeds, networking speeds, air inlet temperature, power factor correction circuits within the power supply, and active heat sinks and these resource types also can benefit from CRM functions by relaxing their expected performance as demanded by today's CPU's. In addition, the resource control technology described below for use in servers and data centers also may be used in other technologies and fields since the resource control technology may be used for solar farms for storage and recovery of surplus power where the utility grid or a residential “load” is the targeted application and those other uses and industries are within the scope of this disclosure.
Some of the leading processor architectures have a thermal management mode that can force the processor to a lower power state; however none at present time imposes similar power reduction dynamically based on the available power resources of the system as they assume that sufficient power is always available. Likewise, none at present time allow the fan speed to increase beyond an acoustic limit for a short duration to handle peak loads or for longer durations if humans are not present.
Fan speed and its effect on acoustic limits is a good example where a resource can be over-allocated. Typically, server subsystems are designed in parallel; each one having extra capacity that is later limited. For example, acoustic testing may place a fan speed limitation at 80% of the fan speed maximum. Since acoustics are specified based on human factor studies, not a regulatory body, violation of the acoustic limit by using the fan speed range between 80% to 100% may be acceptable in some cases. For example in a datacenter environment, acoustic noise is additive across many systems, so it may be permissible for a specific system to go beyond its acoustic limit without grossly affecting the overall noise levels. Often, there are particular critical systems, such as a task load balancer, that may experience a heavier workload in order to break up and transfer tasks to downstream servers in its network. This load balancer could be allowed to exceed its acoustic limit, knowing that the downstream servers can compensate by limiting their resources.
Like acoustics, the load balancer may also get over-allocated resources for network bandwidth, cooling air intake, or many other resources. Continuing with the above example to depict a tradeoff between processors, let the load balance processor run above its acoustic limit and run at its true maximum processing performance. Two rack-level resources need to be managed: rack-level power and room temperature. Typically, a server rack is designed with a fixed maximum power capacity, such as 8 KW (kilowatts). Often this limitation restricts the number of servers that can be installed in the rack. It is common to only fill a 42 U rack at 50% of its capacity, because each server is allowed to run at its max power level. When the load balance processor is allowed to run at maximum, the total rack power limit may be violated unless there is a mechanism to restrict the power usage of other servers in the rack. A Central Resource Manager can provide this function by requiring each processor to request allocation before using it. Likewise, while the load balancer exhausts extra heat, other processors in the rack can be commanded to generate less heat in order to control room temperature.
Each processor typically can run in a number of power states, including low power states where no processing occurs and states where a variable amount of execution can occur (for example, by varying the maximum frequency of the core and often the voltage supplied to the device), often known as DVFS (Dynamic Voltage and Frequency Scaling). This latter mechanism is commonly controlled by monitoring the local loading of the node, and if the load is low, decreasing the frequency/voltage of the CPU. The reverse is also often the case: if loading is high, the frequency/voltage can be increased. Additionally, some systems implement power capping, where CPU DVFS or power-off could be utilized to maintain a power cap for a node. Predictive mechanisms also exist where queued transactions are monitored, and if the queue is short or long the voltage and frequency can be altered appropriately. Finally, in some cases a computational load (specifically in the cloud nature of shared threads across multiple cores of multiple processors) is shared between several functionally identical processors. In this case it's possible to power down (or move into a lower power state) one or more of those servers if the loading is not heavy.
Currently there is no connection between power supply generation to the processors and the power states of each processor. Power supplies are provisioned so that each processor can run at maximum performance (or close to it) and the redundancy supplied is sufficient to maintain this level, even if one power supply has failed (in effect double the maximum expected supply is provided). In part, this is done because there is no way of limiting or influencing the power state of each processor based on the available supply.
Often, this is also the case for fan and cooling designs, where fans may be over-provisioned, often with both extra fans, as well as extra cooling capacity per fan. Due to the relatively slow changes in temperature, temperatures can be monitored and cooling capacity can be turned changed (e.g., increase or slow fans). Based on the currently used capacity, enough capacity must still be installed to cool the entire system with each system at peak performance (including any capacity that might be powered down through failure or maintenance).
In effect, the capacity allocated in both cases must be higher than absolutely necessary, based on the inability to modulate design when capacity limits are approached. This limitation also makes it difficult to install short-term peak clipping capacity that can be used to relieve sudden high load requirements (as there is no way of reducing the load of the system when it is approaching the limits of that peak store). As an example, batteries or any other means of storing an energy reserve could be included in the power supply system to provide extra power during peaks; however, when they approach exhaustion the load would need to be scaled down. In some cases, cooling temperatures could simply be allowed to rise for a short period.
Given closed loop physical resource management, it is possible to not over-design the power and cooling server subsystems. Not over-designing the power and cooling subsystems have a number of key benefits including:
a illustrates that physical resources within a data center have inherent hierarchy. These hierarchies may take different forms, but a common server structural data center hierarchy shown in
b shows that power supply and regulation can also be thought of hierarchically through the data center. Individual processors require 1 or more voltage/power feeds 101a-p. These feeds may be either static, or dynamically adjustable. Also these feeds may have power gating associated with them to allow software to enable/disable the feeds to the processors. A server board may have power supplies/regulators that feed all or some of the processors on the board, as well as processors may have regulators associated individually with processors. Shelves may have power supplies or regulators at the shelf level. Racks may have one or more power supplies or regulators feeding the servers across the rack.
c illustrates that fans and cooling can also be thought of hierarchically through the data center. Processors may have fans associated with the individual processors, often integrated into the heat sink of the processor. Boards may have one or more fans for air movement across the board. Shelves may have one or more fans for air movement across the shelf. Racks may have one or more fans throughout the rack.
d illustrates that individual server nodes may also have a structural topology. Common topologies include meshes or tree-like organizations.
The exemplary data structure shows a single record 201t summing usages and utilizations across processors into a single sum is a simple approach to aid understanding of the overall strategy. More refined implementations will contain data structures that encode the server hardware topologies illustrated in
Usage, request, and utilization sums in more sophisticated systems would be done at each node of the aggregation hierarchies. As an example, power usage, request, and utilization sums would be done in a tree fashion at each node of the tree illustrated in
In the current system as described in the discussions of
In some cases several of the nodes in a system may require greater performance (based on loading). The individual power managers request capacity and it is granted by the central resource manager (CRM) (for example, 50 nodes request 5 units of extra capacity allowing full execution). If other nodes request the same capacity, the CRM can similarly grant the request (assuming that the peak loads do not align, or it may over allocate its capacity). The CRM is implementing the process shown in
In the event of a power supply failure, the CRM detects the power supply failure. The system may have an energy reserve. The energy reserve may be a backup battery, or any other suitable energy reserve, including but not limited to mechanical storage (flywheel, pressure tanks etc.) or electronic storage (all types of capacitors, inductors etc.) that is capable of supplying power for a deterministic duration at peak load, so the CRM has adequate time to reduce the capacity to the new limit of 450 units (actually it has double that this time if the battery can be fully drained, because part of the load may be supplied by the single functioning power supply). The CRM signals each power controller in each processor that it must reduce its usage quickly. This operation takes a certain amount of time, as typically the scheduler needs to react to the lower frequency of the system; however, it should be achievable within the 100 ms. After this point each processor is going to be running at a lower capacity, which implies slower throughput of the system (each processor has 4.5 units of capacity, which is enough for minimum throughput).
Further adjustment of the system can be done by the CRM requesting capacity more slowly from some processors (for example moving them to power down states) and using this spare capacity to increase performance in nodes that are suffering large backlogs. In addition, in an aggressive case, the energy reserve can have some of its energy allocated for short periods to allow peak clipping (the processor requests increase capacity and is granted it, but only for a few seconds).
A similar mechanism can be used to allocate cooling capacity (although the longer time constants make the mechanism easier).
A less aggressive system can allocate more total power and have more capacity after failure; while more aggressive systems can allocate less total power and not allow all processors to run at full power even in the situation where redundancy is still active. More complex redundancy arrangements can be considered (e.g., N+1), etc. The key is that capacity is allocated to different processors from a central pool and the individual processors must coordinate their use.
For a system where the individual processors are smaller and have better low power modes (i.e., bigger differences between high and low power) this approach is even more applicable.
Communication to the CRM can be done by any mechanism. The requirement is that it must be quick so that the failure case time constant can be met, at least for most of the nodes. It's likely that Ethernet packets or messages to board controllers are sufficient.
Additionally when the CRM is making allocations of resources to processors, the encoded processor communication topologies illustrated in
It is clear that many modifications and variations of this embodiment may be made by one skilled in the art without departing from the spirit of the novel art of this disclosure. These modifications and variations do not depart from the broader spirit and scope of the disclosure, and the examples cited here are to be regarded in an illustrative rather than a restrictive sense.
This patent application claims the benefit under 35 USC 119(e) and priority under 35 USC 120 to U.S. Provisional Patent Application Ser. No. 61/245,592 filed on Sep. 24, 2009 and entitled “System and Method for Closed Loop Power Supply Control in Large, Multiple-Processor Installations”, the entirety of which is incorporated herein by reference.
| Number | Date | Country | |
|---|---|---|---|
| 61245592 | Sep 2009 | US |