The present disclosure relates to communication links, and in particular, multilane communication links.
In data center environments, rack units may house many server devices, such as blade servers. Each server device may be configured to host one or more physical or virtual host devices. The servers in the rack units are connected to switch devices such as Top of Rack (ToR) switch devices. The switches, in turn, are connected to other switches via a spine switch or spine fabric. Data in a communication session may be exchanged between host devices (physical and/or virtual) in the same or different rack units. For example, packets of data in the session may be sent from a host device in one rack unit to a host device in another rack unit using network or fabric links. Fabric networks provide cross-connections between multiple fabric links on the same linecard or across multiple linecards. A fabric link may include multiple fabric lanes in each fabric link.
In a fabric, significant power is consumed by the serial input/output devices used to communicate over the fabric links. For example, consider a fabric with 28 fabric links, with each link consisting of 8 fabric lanes. If each lane includes a serializer/deserializer which consumes 240 mW of power in a specific period of time, 53.7 W of power are consumed for a single fabric during that time. During the period in which the fabric link is not fully utilized, each serializer/deserializer still consumes peak power.
Overview
Generally, techniques are presented herein to manage traffic flow across a plurality of communication lanes between two devices, which allows a multilane communication link operating at lower capacities to conserve power by powering-down one or more of its communication lanes. Additionally, the multilane communication link may power-up one or more of the communication lanes when high performance is needed.
Traffic is sent between a first device and a second device over a plurality of active communication lanes of a communication link. A number of the active communication lanes of the communication link is altered. Thereafter, traffic is sent over the altered number of active communication lanes.
In order to alter the number of active communication lanes, traffic between the first device and the second device across a plurality of communication lanes of a communication link is stalled. A first device side of the communication lane is powered-down on at least one of the plurality of communication lanes to deactivate the at least one communication lane. A power-down request is sent from the first device to the second device to power-down a second device side of the at least one communication lane. Traffic is resumed between the first device and the second device over the altered number of active communication lanes.
Presented herein is a method to save power on a communication link, e.g., a fabric link, by turning some of the lanes off when traffic is still actively flowing on other lanes. and when needed, powering up more lanes. This is achieved by a link level protocol over the fabric link. The device at one of the two ends of a fabric link is designated as “Master” and the other as “Target.” The Master device is capable of initiating a power saving protocol to avoid any potential deadlocks.
The protocol involves an exchange of a series of control messages before lanes can be added or removed. Adding (powering up) and removing (powering down) lanes requires the scrambler/descrambler on the transmit/receive devices to work in synchronization to avoid errors. This synchronization is achieved through “Markers.” Markers play an important role in ensuring the reliability of power down/power up process, as described hereinafter.
Referring to
Included in first linecard 110 and second linecard 120 are application specific integrated circuits (“ASICs”) 185a-d and 190a-d, respectively. Also included in linecard 110 is switched fabric 187 which includes multilane fabric links 188a-h and allows intercommunication between ASICs 185a-d through xbars 160a and b. Similarly, switched fabric 197 includes multilane fabric links 198a-h and allows intercommunication between ASICs 195a-d through xbars 180a and b. As with fabric 100, xbars 160a,b and 180a-b may be configured to dynamically scale the number of active lanes in multilane fabric links 188a-h and 198a-h, respectively.
Turning now
When data is sent from xbar 160a to xbar 170a, data received from input/output port 240 is sent to encoder 245. Specifically, control logic 220 may first split that data stream into separate streams for each of fabric lanes 210a-h that will be active during the data transfer. For example, if all of fabric lanes 210a-h will be active during the transfer, control logic 220 will split the input data into eight separate streams. If, on the other hand, only fabric lanes 210a and b will be active during the data transfer, control logic 220 will split the input data into two separate data streams.
Encoder 245, upon receiving the data streams from control logic 220, will encode the data according to the transfer protocol used to send data across fabric lanes 210a-h. According to one example, encoder 245 encodes the data into 64-bits, plus an additional 2 bits of data which may be used to identify the type of data sent in the remaining 64 bits of the code word. Encoder 245 may comprise eight separate data stream encoders 247a-h, one for each of fabric link lanes 210a-h.
Each 64/66 bit code word is then sent from encoder 245 to scrambler 249. Scrambler 249 may modify the code words generated by encoder 245 to ensure a balanced data stream. Specifically, scrambler 249 may modify the code words to ensure that the number of “0”s sent over each fabric lane 210a-h is approximately equal to the number of “1”s sent over each fabric lane 210a-h, thereby ensuring a direct current (“DC”) balanced data stream.
The scrambled and balanced code words are sent from scrambler 249 to mark insertion logic 251. In mark insertion logic 251, markers are included in the data stream that allow the sending xbar, such as xbar 160a, to remain in alignment with the receiving xbar, such as receiving xbar 170a. Similar to encoder 245, mark insertion logic 251 may include a separate mark inserter 253a-h for each of fabric lanes 210a-h. After mark insertion, the encoded and scrambled data is sent to serializer/deserializer (“serdes”) macro 255. Serdes macro 255 contains a separate serdes 257a-h for each of fabric lanes 210a-h. The serdes 257a-h serialize the encoded and scrambled data for transfer to xbar 170a over the fabric lanes 210a-h that are currently active.
Upon receiving data from xbar 170a over fabric lanes 210a-h, xbar 160a effectively reverses the process described above for sending data. The data is received over fabric lanes 210a-h on the presently active fabric lanes. The serdes macro 255 receives the serialized data, and the serdes 257a-h deserialize the data received from their respective fabric lanes. Once deserialized, the data is sent to deskew logic 259 where the markers inserted by the sending xbar 170a are used to ensure that xbar 160a is in correct alignment with xbar 170a.
With alignment ensured, the markers are removed from the encoded data, and the 64/66 bit code words are sent to descrambler 261. Descrambler 261 reverses the DC-balancing performed in the scrambler of the sending xbar, such as xbar 170a. The unbalanced code words are subsequently sent to decoder 263 where the data is decoded. As with encoder 245, decoder 263 may include eight separate data stream decoders 265a-h, one for each of fabric lanes 210a-h. The decoded data is then sent to control logic 220 for use by a device, such as first linecard 110 of
Because serdes 257a-h use a significant amount of power even when not actually sending or receiving traffic, by powering-down a portion of serdes 257a-h that are not necessary to maintain sufficient communication performance, significant power savings may be achieved. Accordingly, control logic 220 is also configured to power-up and power-down one or more fabric lanes 210a-h, as well as their respective serdes 257a-h. Xbar 160a may serve as the master device, with control logic 220 initiating the procedure used to dynamically scale the number of active lanes in the multilane fabric link 130a, with receiving xbar 170a serving as the target device, responding to the process initiated by sending xbar 160a. Example processes for powering-up and powering-down fabric lanes 210a-h are described below with reference to
Turning now to
Specifically,
With traffic stalled at the first device, a first device side of at least one of the plurality of communication lanes is powered-down to deactivate the at least one of the plurality of communication lanes in step 320. A lane is said to be deactivated or inactive when it is deactivated. In step 330, a power-down request is sent from the first device to the second device in order to power-down the second device side of the communication lane. Finally, in step 340, traffic is resumed between the first device and the second device over the altered number of active communication lanes.
Turning to
In step 420, traffic across a plurality of communication lanes between the first device and the second device are stalled. In step 430, a second device side of the at least one of the communication lanes is powered-down to deactivate the at least one of the plurality of communication lanes. Finally, in step 440, traffic between the first device and the second device is restarted on the active communication lanes of the communication link.
While
Prior to the sending of any messages in
Having stalled traffic between master 510 and target 520 on the master device side of the communication link, at 530 a power-down request 532 is sent from the master 510 to the target 520. The power-down request 532 may comprise a specific code word or series of bits that the target device 520 will recognize, not as link traffic, but as a power-down request 532. In addition to sending power-down request 532, master device 510 may start a timer T1 which will measure the duration until a response is received from the target 520. If timer T1 reaches a predetermined value, the master 510 may send another power-down request message or abort the power-down process.
At 534 the power-down request message 532 is received at target device 520. In response to receiving power-down request message 532, target device 520 stalls traffic from target device 520 to master device 510. The traffic from the target device 520 to the master device 510 may also be stalled at a packet boundary to maintain the continuity of the data. Having stalled the traffic, target device 520 sends power-down acknowledgement message 536 at 538.
Power-down acknowledgement 536 is received at the master device 510 at 540, and the power-down acknowledgment serves as an indication to master 510 that target 520 received the power-down request. Power-down acknowledgement 536 may also serve as an indication that communications from target device 520 (except those necessary for the power-down process) have been stalled.
Master 510 may further check to ensure that no messages are received from or sent to target 520 subsequent to receiving power-down acknowledgement message 536. Upon receipt of power-down acknowledgment 536, master 510 sends expected powered-down mark (message) 542. The expected power-down mark 542 is an indication to target device 520 that the power-down process is proceeding, and to expect power-down marks 544a-c. Both expected powered-down mark 542 and power down marks 544a-c may be included in the communications by mark insertion logic 251 of
After sending expected power-down mark 542, master device 510 may wait a period of time T2 before continuing with the power-down procedure. Time T2 may serve to ensure any traffic that may have been delayed is received before any lanes of the communication link are rendered inactive. The master device 510 may also simply wait time period T2 to ensure that target device 520 has sufficient time to receive expected power-down mark 542.
At the conclusion of time T2, master device 510 sends power-down mark 544a at 546. Master device 510 may also send additional power-down marks, such as marks 544b and 544c. By sending multiple power-down marks, the master 510 increases the reliability of the process, as the target device only needs to receive a single power-down mark to continue the power-down process.
The power-down marks 544a-c may be embodied as unscrambled, direct current balanced, predefined code words. The master device 510 sends power-down marks 544a-c at a periodic programmable interval. Power-down marks 544a-c a may include a countdown value so that the target 520 knows how many more power-down marks will be received before power-down happens. For example, as shown in
Having sent all power-down marks 544a-c, master 510 reconfigures the master side of the multilane fabric link for operation with fewer communication lanes. For example, if a scrambler is used by master 510 to ensure sufficient transitions in the data transmitted over the plurality of communication lanes, the scrambler will be reconfigured to no longer include the lanes that will be powered-down when dividing the data. Also, after sending the power-down marks 544a-c, master device 510 may start a timer T3. If a predetermined time passes without receiving a response from target 520, master 510 may abort the power-down process and return to its previous state of operation or resend the power down marks.
Due to the countdown of the power-down marks 544a-c, target device 520 will reconfigure the target side of the multilane fabric link to operate correctly once one or more of the plurality of communication lanes are powered-down at the same time that master 510 reconfigures to operate without the powered-down lanes. While
When reconfiguring target 520, if a descrambler is used to descramble the data received over the plurality of communication lanes, the descrambler will be reconfigured to no longer descramble data from the lanes to be powered-down. Target device 520 will send expected power-down message 548 which is an indication that power-down marks 550a-c will be subsequently sent to master 510. Power-down marks 550a-c are then sent. Target device 520 may start a timer T4. If a predetermined time passes without receiving a response from master 510, target 520 may abort the power-down process and return to its previous state of operation or resend the power-down marks.
Upon receipt of at least one of power-down marks 550a-c, master 510 sends power-down complete message 554. Once this message is sent, traffic is resumed from master device 510, and the serdes on the selected lanes are powered-down one lane at a time. Master 510 can send power-down complete message 554 at the appropriate time, even if only one of power-down marks 550a-c is received, similar to the process described above with regard to power-down marks 544a-c. Power-down marks 550a-c may have count values similar to power-down marks 544a-c. Upon receipt of power-down complete message 554, target device 520 also resumes traffic, and also powers down the serdes of the selected lanes one at a time.
Powering-down the lanes may involve depowering a serdes for each of the communication lanes that is being powered-down. Because serdes use power even when not actually sending traffic, by powering-down the serdes that are not necessary to maintain sufficient communication performance, significant power savings may be achieved. As depicted in
According to the example of
With reference now made to
For example, master device 610 may have previously depowered lanes according to the process of
At some point during the communications between master 610 and target 620, a determination is made that one or more of the inactive communication lanes between master 610 and target 620 should be powered-up. This determination may come from either the master 610 or the target 620, or from a third device not illustrated in
The powering-up of the master side of the communication lanes may comprise powering-up a previously unpowered serdes on the master side of the communication lane. According to other examples, which will be described in more detail with reference to
Upon receiving powering-up request 632, target 620 will begin powering-up the target side of the previously inactive lanes. In one example, the communication lanes are powered-up before traffic is stalled in order to shorten the period of time during which traffic is not being sent between master 610 and target 620. Once all of the lanes to be powered-up have been powered-up, and synchronized with the master side of the communication lanes, target device 620 sends block lock status message 634. Block lock status message is an indication to master 610 that all of the previously inactive lanes have been powered-up and the master side of the lanes are synchronized with the target side of the lanes, and traffic between the master 610 and target 620 may now be stalled.
Upon receiving block lock status message 634, master 610 stalls incoming traffic, and sends a stop traffic request (Req) 636 to target 620. Master 610 may also start a timer T6. If time T6 exceeds a predetermined length of time without having received a response from target 620, the powering-up process may be aborted or the stop traffic request may be resent.
Upon receiving stop traffic request 636, target 620 may stall incoming traffic, and also send traffic stop acknowledgment (Ack) 638 to master 610. When both master 610 and target 620 stall their incoming traffic, they may do so at a packet boundary to ensure the continuity of the traffic data. Once target 620 has stalled its incoming traffic, stop traffic acknowledgment 638 is sent from target 620 to master 610.
Upon receipt of stop traffic acknowledgement 638, master 610 sends deskew request message 640 which is a message to target 620 indicating that target 620 should begin realigning for traffic transmission over all powered-up lanes, including the recently powered-up lanes. Master 610 may start timer T7 having a predetermined time duration, in order to wait to see if target 620 completes its realignment process. If the predetermined period of time T7 is reached without receiving an indication from target 620 that it has completed its realignment, master 610 may retry initiating the realignment process with target 620, or abort the powering-up process.
Master 610 begins realigning itself to enable sending traffic over the recently powered-up lanes. For example, a scrambler may be reconfigured to direct current balance the data sent across all of the powered-up link lanes, including the recently powered-up lanes. Accordingly, master 610 may enable the scrambler for use with all communication lanes, and depower the scrambler used when fewer than all the lanes are in use. Upon receiving deskew request 640, target 620 begins realigning for transmission over all of the powered-up lanes, including the recently powered-up lanes. The realignment may comprise realigning a single scrambler to operate over all of the powered-up lanes, or enabling a scrambler which operates when all lanes are powered-up. When the realignment process has begun at target 620, target 620 sends deskew acknowledgement 642.
During the realignment process, master 610 may send power marks 644a-c and target 620 may send power marks 646a-c. As with power marks 544a-c and 550a-c described above in connection with
Upon receipt of power-up marks 646a-c, master 610 sends power-up complete message 648 and once again sends traffic over the communication link, now over all of the powered-up lanes. Similarly, upon receipt of power-up complete message 648, target 620 begins sending and receiving traffic over all of the powered-up lanes of the communication link.
As indicated above, the serdes on both the master side and the target side of a communication lane may be completely powered-up and powered-down, or may be alternated between a full-power mode and a low-power sleep mode. Turning to
At the start of the message exchange depicted by ladder diagram 700, both master 710 and target 720 operate in a full-power active state 730a and 730b, respectively. When master 710 powers-down the master side of the communication lane, for example, as described above in reference to
At the expiration of timer T8, the master-side of the communication lane enters a low power active state 734a. In the lower-power active state 734a, master 710 sends alert messages 736 to target 720. Alert messages 736 place the target-side of the communication lane into an alert state 734b so that the target-side is prepared to receive messages that will place it in a full-power state, if necessary. If no such messages are received, the target-side of the communication lane returns to quiet state 732b. Similarly, if master 710 does not initiate powering-up of the communication lane, the master-side of the communication lane returns to quiet state 732a. The master 710 and target 720 repeat this process until the communication lane is to be powered-up. An analogous process may also take place over the target 720 to master 710 link as well.
At 738, a powering-up of the communication lane is initiated. The powering-up of the communication lane may cut short timer T8, placing the master-side of the communication lane in an alert state 734b earlier than otherwise would have been the case. According to other examples, the powering-up process will simply wait until timer T8 expires. Master 710 sends alert messages 736 as it normally would to place the target-side of the communication lane into alert state 734b. After sending messages 736, the master-side of the communication lane enters wake-up mode 740 during which it sends messages 742. Messages 742 may comprise specific code words or series of characters which indicate to target 720 that the target-side of the communication lane should enter an active state. Similar to the message sent to initiate the lower power sleep mode, the wake-up process may be initiated by the master 710 transmitting unscrambled LPI characters (0707070707070707) to the target 720. After sending messages 742, the master-side of the communication lane returns to full-power active mode 730a. Similarly, the target-side of the communication lane returns to full-power active mode 730b. Because the serdes for the master-side of the communication lane and the target-side of the communication lane were not fully powered-down, the transition to the full-power modes 730a and 730b takes place more quickly than it would if the serdes were fully depowered and de-synchronized.
In summary, the foregoing presents techniques to save power on a fabric link by turning some of the lanes off when traffic is still actively flowing on other lanes and when needed, powering up more lanes. This is achieved by a link level protocol over the fabric link. There are numerous advantages of these techniques. The fabric link runs with expected bandwidth and not over speed, thus saving power. Different power modes can be chosen depending on the needed power saving/response time. There is no traffic loss during powering up/down. The power down feature can be used to keep the fabric link active even if some of the serdes links are bad or not working. The power down feature can also be used to allow for programming in serdes on lanes that are down, while the fabric link is still active.
The above description is intended by way of example only.
Number | Name | Date | Kind |
---|---|---|---|
8201006 | Bobrek et al. | Jun 2012 | B2 |
8817817 | Koenen | Aug 2014 | B2 |
20050105545 | Thousand | May 2005 | A1 |
20120066531 | Shafai | Mar 2012 | A1 |
20120213223 | Ortacdag | Aug 2012 | A1 |
20130077623 | Han | Mar 2013 | A1 |
Entry |
---|
Cisco | Intel., “IEEE 802.3az Energy Efficient Ethernet: Build Greener Networks,” White Paper, Oct. 2011, pp. 1-9. |