A portion of the disclosure of this patent document contains material that is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever.
This invention relates to transmission of digital information over data networks. More particularly, this invention relates to power management in switched data networks.
Various methods are known in the art for reducing the power consumption of a communication link or network by reducing unneeded data capacity. For example, U.S. Pat. No. 6,791,942, whose disclosure is incorporated herein by reference, describes a method for reducing power consumption of a communications interface between a network and a processor. The method monitors data traffic from the sides of the interface. Upon detecting a predetermined period of no data traffic on both sides, the method disables an auto-negotiation mode of the interface and forces the interface to operate at its lowest speed.
As another example, U.S. Pat. No. 7,584,375, whose disclosure is incorporated herein by reference, describes a distributed power management system for a bus architecture or similar communications network. The system supports multiple low power states and defines entry and exit procedures for maximizing energy savings and communication speed.
Chiaraviglio et al. analyze another sort of approach in “Reducing Power Consumption in Backbone Networks,” Proceedings of the 2009 IEEE International Conference on Communications (ICC 2009, Dresden, Germany, June, 2009), which is incorporated herein by reference. The authors propose an approach in which certain network nodes and links are switched off while still guaranteeing full connectivity and maximum link utilization, based on heuristic algorithms. They report simulation results showing that it is possible to reduce the number of links and nodes currently used by up to 30% and 50%, respectively, during off-peak hours while offering the same service quality.
Commonly assigned U.S. Pat. No. 8,570,865, which is herein incorporated by reference, describes power management in a fat-tree network. Responsively to an estimated characteristic, a subset of spine switches in the highest level of the network is selected, according to a predetermined selection order, to be active in carrying the communication traffic. In each of the levels of the spine switches below the highest level, the spine switches to be active are selected based on the selected spine switches in a next-higher level. The network is operated so as to convey the traffic between leaf switches via active spine switches, while the spine switches that are not selected remain inactive.
Current fabric switches have a predetermined number of internal links. Conventionally, once the fabric power-budget is set, the number of active links is never changed. Thus, the throughput of the system is bound by the max-cut-min-flow law, which can be derived from the well-known Ford-Fulkerson method for computing the maximum flow in a network. In practice, the traffic flow corresponding to the max-cut in the fabric is almost never achieved, since network traffic is not evenly distributed. For example, an inactive link is sometimes needed in order to gain a better temporal max-cut.
According to disclosed embodiments of the invention, a fine-grained method of power control within a maximal power usage is achieved by dynamically managing the bandwidth carried by internal links in the fabric. A bandwidth manager executes a dynamic feature called “width-reduction”. This feature enables a link to operate at different bandwidths. By limiting the bandwidth of a link, the bandwidth manager effectively throttles the power consumed by that link. From time to time the bandwidth manager decides which links should be active, and at which bandwidths. By employing width reduction it is possible to obtain a higher throughput for a given power level than by maintaining a static bandwidth assignment.
There is provided according to embodiments of the invention a method for communication, which is carried out in a fabric of interconnected network switches having ingress ports and egress ports, a plurality of lanes for carrying data between the egress port of one of the switches and the ingress port of another of the switches, and queues for data awaiting transmission via the egress ports, iteratively at allocation intervals. The method includes determining current queue byte sizes of the queues of the switches, assigning respective bandwidths to the switches according to the current queue byte sizes thereof, and responsively to the assigned respective bandwidths disabling a portion of the lanes of the switches to maintain a power consumption of the fabric below a predefined limit.
According to one aspect of the method, an aggregate of the assigned respective bandwidths complies with bandwidth requirements of leaf nodes of the fabric.
According to a further aspect of the method, the aggregate of the assigned respective bandwidths does not exceed throughput requirements of leaf nodes of the fabric.
According to yet another aspect of the method, in assigning respective bandwidths larger bandwidths are assigned to switches that have long queue byte sizes relative to switches that have short queue byte sizes.
According to still another aspect of the method, in disabling the lanes fewer lanes of the switches that have long queue byte sizes are disabled relative to the switches that have short queue byte sizes.
According to an additional aspect of the method, uplinks through the fabric have a different bandwidth than downlinks through the fabric.
There is further provided according to embodiments of the invention an apparatus, including a fabric of interconnected network switches, a bandwidth manager connected to the switches, ingress ports and egress ports in the switches. The ports provide a plurality of lanes for carrying data between the egress port of one of the switches and the ingress port of another of the switches. A memory in the switches stores queues for data awaiting transmission via the egress ports. The bandwidth manager is operative, iteratively at allocation intervals, for determining current queue byte sizes of the queues of the switches, assigning respective bandwidths to the switches according to the current queue byte sizes thereof, and responsively to the assigned respective bandwidths disabling a portion of the lanes of the switches to maintain a power consumption of the fabric below a predefined limit.
According to another aspect of the apparatus, the egress ports comprise a plurality of serializers that are commonly served by one of the queues.
For a better understanding of the present invention, reference is made to the detailed description of the invention, by way of example, which is to be read in conjunction with the following drawings, wherein like elements are given like reference numerals, and wherein:
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the various principles of the present invention. It will be apparent to one skilled in the art, however, that not all these details are necessarily always needed for practicing the present invention. In this instance, well-known circuits, control logic, and the details of computer program instructions for conventional algorithms and processes have not been shown in detail in order not to obscure the general concepts unnecessarily.
Documents incorporated by reference herein are to be considered an integral part of the application except that, to the extent that any terms are defined in these incorporated documents in a manner that conflicts with definitions made explicitly or implicitly in the present specification, only the definitions in the present specification should be considered.
A “switch fabric” or “fabric” refers to a network topology in which network nodes interconnect via one or more network switches (such as crossbar switches), typically through many ports. The interconnections are configurable such that data is transmitted from one node to another node via designated ports. A common application for a switch fabric is a high performance backplane.
A “fabric facing link” is a network link in a fabric that is configured for transmission to or from one network element to another network element in the fabric.
Reference is now made to
A bandwidth manager 29 controls aspects of the operation of switches 26, 28, such as routing of messages through network 20, performing any necessary arbitration, and remapping of inputs to outputs. Routing issues typically relate to the volume of the traffic and the bandwidth required to carry the traffic, which may include either the aggregate bandwidth or the specific bandwidth required between various pairs of computing nodes (or both aggregate and specific bandwidth requirements). Additionally or alternatively, other characteristics may be based, for example, on the current traffic level, traffic categories, quality of service requirements, and/or on scheduling of computing jobs to be carried out by computing nodes that are connected to the network. Specifically, for the purposes of embodiments of the present invention, the bandwidth manager 29 is concerned with selection of the switches and the control of links between the switches for purposes of power management, as explained in further detail hereinbelow.
The bandwidth manager 29 may be implemented as a dedicated processor, with memory and suitable interfaces, for carrying out the functions that are described herein in a centralized fashion. This processor may reside in one (or more) of computing nodes 22, or it may reside in a dedicated management unit. In some embodiments, communication between the bandwidth manager 29 and the switches 26, 28 may be carried out through an out-of-band channel and does not significantly impact the bandwidth of the fabric nor that of individual links.
Alternatively or additionally, although bandwidth manager 29 is shown in.
Reference is now made to
In one configuration, spine node 34 is set to connect with leaf node 44, while in other configurations the connection between spine node 34 and leaf node 44 is broken or blocked, and a new connection formed (or unblocked) between spine node 34 and any of the other leaf nodes 40, 42, 46.
Reference is now made to
Each of the lanes 54 is connected to a respective SERDES 60, which can be operational or non-operational, independently of the other SERDES 60 in the port. Each SERDES 60 can be individually controlled directly or indirectly by command signals from the bandwidth manager 48 (
Cumulative activity of the switch 50 during a time interval may be recorded by a performance counter 62, whose contents are accessible to the bandwidth manager.
Continuing to refer to
The bandwidth manager 48 knows the state of all fabric-facing links, and knows the state of the queues 56 as well.
The bandwidth manager 48 assigns bandwidth for each fabric-facing link using a grading algorithm such that the fabric power-budget is not violated. Each switch responds to the bandwidth assignment by implementing its width-reduction features.
The links are configured such that a temporary max-cut of the fabric, which is computed according to current traffic, is maximized. For example, in
The actual flow through the links 49 is the lesser of the flow requirement and the max-cut:
Min{max-cut[x−y],requirement[x−y]}.
The term “requirement” refers to a temporal requirement, i.e., the latency of the transit of the packet from x to y. The goodput through the fabric is the sum of all the flows through the links of the leaf nodes 40-46. The bandwidth manager 48 attempts to maximize goodput by reducing max-cut [x−y], as much as possible, provided that the requirement [x−y] does not exceed max-cut [x−y].
The risk of local switch buffer overflow is minimized (measured, for example by packet drop). In general, the bandwidth manager 48 attempts to estimate the bandwidth requirement (requirement [x−y]) for the fabric by sorting the queues of the switches according to space used. A switch with a high buffer usage (hence, low free space), is relatively likely to drop packets. Such a switch should be allocated a relatively high amount of output bandwidth.
A link in the fabric connecting one of the spine nodes 32, 34, 36, 38 with one of the leaf nodes 40, 42, 44, 46 that has a non-zero transmit queue (TQ) size can initially transmit the entire bandwidth. This is the case regardless of the size of the queue (in bytes). However, a link with a relatively long queue (large byte size) can sustain full bandwidth transmission for a longer period than a link with a shorter queue (small byte size). Therefore, a link with a long queue deserves a relatively larger bandwidth allocation, and would have relatively few of its lanes disabled. This strategy minimizes unused operational bandwidth, and reduces packet drop, thereby simulating a fabric operating at full bandwidth.
Each switch periodically reports its status and alerts to the bandwidth manager 48.
Reference is now made to
The process iterates in a loop. In step 64 the status of each link in the fabric is obtained by the bandwidth manager. In some embodiments the bandwidth manager may query the links using a dedicated channel. Alternatively the links may be programmed to automatically report their status to the bandwidth manager. The status of ingress and egress queues are obtained. The information may comprise the length of the queues, and the categories of traffic. Cumulative activity during a time interval may be obtained from performance counters in the switches. The pseudocode of Listing 1 illustrates one way of determining queue length in a fabric, incorporating a low pass filter to eliminate random “noise” in queue length measurement.
The fabric power consumption is measured in step 70 by suitable power metering devices. Alternatively, once the bandwidth is known, the fabric power consumption can be calculated from the number of active links and the queue. Normally the process of
Next, at step 72 user-determined bandwidth requirements for the fabric during a current epoch are evaluated in relation to the computing jobs. In one approach to bandwidth assignment, the bandwidth manager may use network power conservation as a factor in deciding when to run each computing job. In general, the manager will have a list of jobs and their expected running times. Some of the jobs may have specific time periods (epochs) when they should run, while others are more flexible. As a rule of thumb, to reduce overall power consumption, the manager may prefer to run as many jobs as possible at the same time. On the other hand, the manager may consider the relation between the estimated traffic load and the maximal capabilities of a given set of spine switches, and if running a given job at a certain time will lead to an increase in the required number of active spine switches, the manager may choose to schedule the job at a different time. Further details of this approach are disclosed in commonly assigned U.S. Pat. No. 8,570,865, whose disclosure is herein incorporated by reference.
Next, at step 74 based on the assessment of step 72 respective bandwidths are assigned to switches in the fabric based on a sort order of the lengths of egress queues of the switches as described above.
Next, at step 76, based on the respective bandwidth assignments in step 74, logic circuitry in each link determines the number of lanes for its ports that are to be active, and enables or disables its links accordingly. For example, if a 40 Gb/s link in the example of
The objective of steps 74, 76 is to disable as many lanes as possible without exceeding a threshold of data loss or packet drop, but remaining within the power budget. This enables the fabric to operate at minimal power while maintaining a required quality of service. Steps 74, 76 can be performed using the procedure in Listing 2, which represents a simulation. The power budget of the fabric is considered as fixed. The fabric must not violate the budget, even when there is a high packet drop count or poor quality of service.
In Listing 2, the variable NumTQs corresponds to the number of links in a simulated system. TargetBWPercentOf100 is a simulation parameter that describes the amount of traffic entering the fabric. A value of 75% bandwidth was used in the simulation. It should be noted that when 100% bandwidth is used for the parameter TargetBWPercentOf100, no bandwidth reduction can be accomplished, because all internal-facing links in the fabric are utilized.
The following examples are simulations of a fabric operation in which bandwidth is allocated in accordance with embodiments of the invention.
Reference is now made to
The effect of bandwidth allocation frequency is most pronounced under higher traffic conditions. The packet drop is significantly higher when a 30 μs interval is used (line 78) than when the allocation interval is shortened to 10 μs (line 80).
It will be appreciated by persons skilled in the art that the present invention is not limited to what has been particularly shown and described hereinabove. Rather, the scope of the present invention includes both combinations and sub-combinations of the various features described hereinabove, as well as variations and modifications thereof that are not in the prior art, which would occur to persons skilled in the art upon reading the foregoing description.