The present invention is generally related to routing packets in a switch fabric, such as PLX Technology's “Express Fabric”.
Peripheral Component Interconnect Express (commonly described as PCI Express or PCIe) provides a compelling foundation for a high performance, low latency converged fabric. It has near-universal connectivity with silicon building blocks, and offers a system cost and power envelope that other fabric choices cannot achieve. PCIe has been extended by PLX Technology, Inc. to serve as a scalable converged rack level “ExpressFabric.”
However, the PCIe standard provides no means to handle routing over multiple paths, or for handling congestion while doing so. That is, conventional PCIe supports only tree structured fabric. There are no known solutions in the prior art that extend PCIe to multiple paths. Additionally, in a PCIe environment, there is also a need for shared input/output (I/O) and host-to-host messaging which should be supported.
In a manifestation of the invention, a method of providing unordered path routing in a multi-path PCIe switch fabric is provided. A set of route choices for unordered traffic from the local (current) switch towards the final destination is provided via a current hop destination indexed look up table (CH D-LUT). A set of route choices from each of those possible current hop unordered route choices applicable at the next hop are stored in a next hop destination indexed look up table (NH-DLUT). The Port congestion on a local level is measured and communicated internally in the local switch via a congestion feedback interconnect. Congestion indication for the local switch comprises low priority congestion information and medium priority congestion information. A congestion feedback interconnect, in this manifestation a ring structure (other interconnect structures such as a bus could also be used), is used to communicate congestion feedback information within a chip, wherein only fabric ports send congestion information of the local level and an applicable next hop level to the congestion feedback ring. The congestion state is saved in local congestion vectors in every module in which routing are performed.
The Unordered Route Choice Mask Vectors, which represent the fault free route choices for unordered traffic that lead to the destination corresponding to the table index, are stored in a destination look up table CH DLUT. From the combination of the fault free route choices of paths to a destination, the local congestion information for the destination, the priority level of the packet and round-robin state information , an uncongested path will be selected to route the unordered packet. The congestion information is used to mask out route choices for which congestion is indicated. If a single choice survives this masking process, that choice is selected. If multiple route choices remain after this masking process or if congestion is indicated for all route choices, then the final route choice selection is made by a round robin process. In the former case, round robin is among the surviving choices. In the latter case, the round robin is among the original set of choices.
In another manifestation of the invention a method of providing unordered path routing in a multi-path PCIe switch fabric is provided. A set of route choices for unordered traffic from the local (current) switch towards the final destination is provided via a current hop destination indexed look up table (CH DLUT). Port congestion on a local level is measured and communicated to the local switch via a congestion feedback interconnect and the congestion state is saved in local congestion mask vectors in every port. At fabric ports, the local congestion state is communicated to the neighboring switches via data link layer packets (DLLPs) and then communicated within that neighboring switch via a congestion feedback interconnect. At each module in which routing is performed, the next hop congestion state is saved in a set of next hop congestion vectors with one such vector for each current hop unordered route choice. Congestion indication for both the local and the next hop switch comprises low priority congestion information and medium priority congestion information. A congestion feedback interconnect, in this manifestation a ring structure (other interconnect structures such as a bus could also be used), is used to communicate congestion feedback information within a chip, wherein only fabric ports send congestion information of the local level and an applicable next hop level to the congestion feedback ring. For each current hop unordered route choice, there is a set of next hop choices that lead to the destination if the associated current hop route choice is taken. These next hop masked choice vectors are saved in a next hop destination look up table (NH-DLUT). The next hop masked choice vectors are used in conjunction with Port_for_Choice tables to construct next hop masked port vectors. These vectors are in turn used to select the next hop congestion information that is associated with the destination of the packet being routed. From the combination of the choices of paths to a destination, the local congestion information for the destination, the next hop congestion information for the destination, and the priority level of the packet, and round-robin state information, an uncongested path is selected to route the unordered packet. The congestion information is used to mask out route choices for which congestion is indicated. If a single choice survives this masking process, that choice is selected. If multiple route choices remain after this masking process or if congestion is indicated for all route choices, then the final route choice selection is made by a round robin process. In the former case, round robin is among the surviving choices. In the latter case, the round robin is among the original set of choices. In one manifestation of the invention, this tie breaking is done by a simple round robin selection mechanism that is independent of the packet's destination. In another manifestation of the invention, separate round robin information is maintained and used in this process for each destination edge switch.
In another manifestation of the invention, a system is provided. A switch fabric is provided including at least three PLX ExpressFabric switches and a management system, wherein each switch comprises a plurality of ports, wherein some of the ports are fabric ports that are connected to other switches, and each switch includes a congestion feedback interconnect which collects congestion information only from fabric ports, wherein the congestion information provides port congestion on a local level and port congestion on an applicable next hop level. Congestion is indicated for a port when the depth of all of its egress queues in total exceeds a configurable threshold. An egress scheduler and router is provided that applies the destination independent local congestion mask vector and the destination specific next hop congestion port vector created using the NH-DLUT and the Port_for_Choice tables to the congestion vectors to produce a vector that indicates route choices for which congestion is indicated. It applies this vector to the destination specific masked choice vector from the CH-DLUT to exclude route choices for which congestion is indicated. If multiple route choices remain after this masking process or if congestion is indicated for all route choices, then the final route choice selection is made by a round robin process, where the round-robin may be either destination agnostic or round robin per destination edge switch. In the final route step, the surviving route choice is mapped to a fabric egress port via a choice to port look up table.
These and other features of the present invention will be described in more detail below in the detailed description of the invention and in conjunction with the following figures.
A switch fabric may be used to connect multiple hosts. A PCIe switch implements a fabric-wide Global ID, GID, that is used for routing between and among hosts and endpoints connected to edge ports of the fabric or embedded within it and means to convert between conventional PCIe address based routing used at the edge ports of the fabric and Global ID based routing used within it. GID based routing is the basis for additional functions not found in standard PCIe switches such as support for host to host communications using ID-routed messages, support for multi-host shared I/O, support for routing over multiple/redundant paths, and improved security and scalability of host to host communications compared to non-transparent bridging.
A commercial embodiment of the switch fabric described in U.S. patent application Ser. No. 13/660,791 (and the other patent applications and patents incorporated by reference) was developed by PLX Technology, Inc. and is known as ExpressFabric™. An exemplary switch architecture developed by PLX, Technology, Inc. to support ExpressFabric™ is the Capella 2 switch architecture, aspects of which are also described in the patent applications and patents incorporated by reference. The edges of the ExpressFabric are labeled nodes where a node may be a path to a server (a host port) or a path to an endpoint (a downstream port). ExpressFabric™ host-to-host messaging uses ID-routed PCIe Vendor Defined Messages together with routing mechanisms that allow non-blocking fat tree (and diverse other topology) fabrics to be created that contain multiple paths between host nodes.
One aspect of embodiments of the present invention is that unlike standard point-to-point PCIe, multi-path routing is supported in the switch fabric to handle ordered and unordered routing, as well as load balancing. Embodiments of the present invention include a route table that identifies multiple paths to each destination ID together with the means for choosing among the different paths that tend to balance the loads across them, preserve producer/consumer ordering, and/or steer the subset of traffic that is free of ordering constraints onto relatively uncongested paths.
Traffic sent between nodes using ExpressFabric can be generally categorized as either ordered traffic, where two subsequent packets must stay in relative order with respect to each other, and unordered traffic, where two subsequent packets can arrive in any order. In a complex system with multiple hosts and multiple endpoints, some paths may be congested. If the congested paths can be determined, unordered traffic can be routed to avoid the congestion and thereby increase overall fabric performance. Before congestion develops, unordered traffic can be load balanced across multiple paths to avoid congestion.
Embodiments of the present invention are now discussed in the context of switch fabric implementation.
Each switch 105 may include host ports 110, fabric ports 115, an upstream port 118, and downstream port(s) 120. The individual host ports 110 each lead eventually to a host root complex such as a server 130. In the ExpressFabric switch, a host port gives a host access to host to host functions such as a Network function for DMA and a Tunneled Window Connection for programmed IO. In this example, a shared endpoint 125 is coupled to the downstream port and includes physical functions (PFs) and Virtual Functions (VFs). Individual servers 130 may be coupled to individual host ports. The fabric is scalable in that additional switches can be coupled together via the fabric ports. While two switches are illustrated, it will be understood that an arbitrary number may be coupled together as part of the switch fabric, symbolized by the cloud in
A Management Central Processor Unit (MCPU) 140 is responsible for fabric and I/O management and must include an associated memory having management software (not shown). In one optional embodiment, a semiconductor chip implementation uses a separate control plane 150 and provides an x1 port for this use. Multiple options exist for fabric, control plane, and MCPU redundancy and fail over, including incorporating the MCPU into the switch silicon. The Capella 2 switch supports arbitrary fabric topologies with redundant paths and can implement fabrics that scale from two switch chips and two nodes to hundreds of switches and thousands of nodes.
In one embodiment, inter-processor communications are supported by RDMA-NIC emulating DMA controllers at every host port and by a Tunneled Window Connection (TWC) mechanism that implements a connection oriented model for ID-routed PIO access among hosts The RDMA-NIC can send ordered and unordered traffic across the fabric. The TWC can send only ordered traffic across the fabric.
A Global Space in the switch fabric is defined. The hosts communicate by exchanging ID routed Vendor Defined Messages in a Global Space after configuration by MCPU software.
In one embodiment, the fabric ports 115 are PCIe downstream switch ports enhanced with fabric routing, load balancing, and congestion avoidance mechanisms that allow full advantage to be taken of redundant paths through the fabric and thus allow high performance multi-stage fabrics to be created.
In one embodiment, a unique feature of fabric ports is that their control registers don't appear in PCIe Configuration Space. This renders them invisible to BIOS and OS boot mechanisms that understand neither redundant paths nor congestion issues and allows the management software to configure and manage the fabric.
In one embodiment Capella 2's host-to-host messaging protocol includes transmission of a work request message to a destination DMA VF by a source DMA VF, the execution of the requested work by that DMA VF and then the return of a completion message to the source DMA VF with optional, moderated notification to the recipient as well. These messages appear on the wire as ID routed Vendor Defined Messages (VDMs). Message pull-protocol read requests that target the memory of a remote host are also sent as ID-routed VDMs. Since these are routed by ID rather than by address, the message and the read request created from it at the destination host can contain addresses in the destination's address domain. When a read request VDM reaches the target host port, it is changed to a standard read request and forwarded into the target host's space without address translation.
A primary benefit of ID routing is its easy extension to multiple PCIe bus number spaces by the addition of a Vendor Defined End-to-End Prefix containing source and destination bus number “Domain” ID fields as well as the destination BUS number in the destination Domain. Domain boundaries naturally align with packaging boundaries. Systems can be built wherein each rack, or each chassis within a rack, is a separate Domain with fully non-blocking connectivity between Domains.
Using ID routing for message engine transfers simplifies the address space, address mapping and address decoding logic, and enforcement of the producer/consumer ordering rules. The ExpressFabric™ Global ID is analogous to an Ethernet MAC address and, at least for purposes of tunneling Ethernet through the fabric, the fabric performs similarly to a Layer 2 Ethernet switch.
The ability to differentiate message engine traffic from other traffic allows use of relaxed ordering rules for message engine data transfers. This results in higher performance in scaled out fabrics. In particular, work request messages are considered strongly ordered while prefixed reads and their completions are unordered with respect to these or other writes. Host-to-host read requests and completion traffic can be spread over the redundant paths of a scaled out fabric to make best use of available redundant paths.
1.3 Push vs. Pull Messaging
In one embodiment, a Capella 2 switch pushes short messages that fit within the supported descriptor size of 128 B, or can be sent by a small number of such short messages sent in sequence, and pulls longer messages.
In push mode, these unsolicited messages are written asynchronously to their destinations, potentially creating congestion there when multiple sources target the same destination. Pull mode message engines avoid congestion by pushing only relatively short pull request messages that are completed by the destination DMA returning a read request for the message data to be transferred. Using pull mode, the sender of a message can avoid congestion due to multiple targets pulling messages from its memory simultaneously by limiting the number of outstanding message pull requests it allows. A target can avoid congestion at its local host's ingress port by limiting the number of outstanding pull protocol remote read requests. In a Capella 2 switch, both outstanding DMA work requests and DMA pull protocol remote read requests are managed algorithmically so as to avoid congestion.
Pull mode has the further advantage that the bulk of host-to-host traffic is in the form of read completions. Host-to-host completions are unordered with respect to other traffic and thus can be freely spread across the redundant paths of a multiple stage fabric.
Referring again to
In the preferred embodiment of the invention, the port types are:
Every PCIe function of every node (edge host or downstream port of the fabric) has a unique Global ID that is composed of {domain, bus, function}. The Global ID domain and bus numbers are used to index the routing tables. A packet whose destination is in the same domain as its source uses the bus to route. A packet whose destination is in a different domain uses the domain to route at some point or points along its path.
Each host port 110 consumes a Global BUS number. At each host port, DMA VFs use FUN 0 . . . NumVFs-1. X16 host ports get 64 DMA VFs ranging from 0 . . . 63. X8 host ports get 32 DMA VFs ranging from 0 . . . 31. X4 host ports get 16 DMA VFs ranging from 0 . . . 15.
The Global RID of traffic initiated by a requester in the RC connected to a host port is obtained via a TWC Local-Global RID-LUT. Each RID-LUT entry maps an arbitrary local domain RID to a Global FUN at the Global BUS of the host port. The mapping and number of RID LUT entries depends on the host port width as follows:
The leading most significant 1's in the FUN indicate a non-DMA requester. One or more leading 0's in the fun at a host's Global BUS indicate that the FUN is a DMA VF.
Endpoints, shared or unshared, may be connected at fabric edge ports with the Downstream Port attribute. Their FUNs (e.g. PFs and VFs) use a Global BUS between SEC and SUB of the downstream port's virtual bridge. At 2013's SRIOV VF densities, endpoints typically require a single BUS. ExpressFabricTM architecture and routing mechanisms fully support future devices that require multiple Busses to be allocated at downstream ports.
For simplicity in translating IDs, fabric management software configures the system so that except when the host doesn't support ARI, the Local FUN of each endpoint VF is identical to its Global FUN. In translating between any Local Space and Global Space, it's only necessary to translate the BUS number. Both Local to Global and Global to Local Bus Number Translation tables are provisioned at each host port and managed by the MCPU.
If ARI isn't supported, then Local FUN[2:0]==Global FUN[2:0] and Local Fun[7:3]==5′b000 00.
2.3 Navigating through Global Space
In one embodiment, ExpressFabric™ uses standard PCIe routing mechanisms augmented to support redundant paths through a multiple stage fabric.
In one embodiment, ID routing is used almost exclusively within Global Space by hosts and endpoints, while address routing is sometimes used in packets initiated by or targeting the MCPU. At fabric edges, CAM data structures provide a Destination BUS appropriate to either the destination address or Requester ID in the packet. The Destination BUS, along with Source and Destination Domains, is put in a Routing Prefix prepended to the packet, which, using the now attached prefix, is then ID routed through the fabric. At the destination fabric edge switch port, the prefix is removed exposing a standard PCIe TLP containing, in the case of a memory request, an address in the address space of the destination. This can be viewed as ID routed tunneling.
Routing a packet that contains a destination ID either natively or in a prefix starts with an attempt to decode an egress port using the standard PCIe ID routing mechanism. If there is only a single path through the fabric to the Destination BUS, this attempt will succeed and the TLP will be forwarded out the port within whose SEC-SUB range the Destination BUS of the ID hits. If there are multiple paths to the Destination BUS, then fabric configuration will be such that the attempted standard route fails. For ordered packets, the current hop destination lookup table (CH-DLUT) Route Lookup mechanism described below will then select a single route choice. For unordered packets, the CH-DLUT route lookup will return a number of alternate route choices. Fault and congestion avoidance logic will then select one of the alternatives. Choices are masked out if they lead to a fault, or to a congestion hot spot, or to prevent a loop from being formed in certain fabric topologies. In one implementation, a set of mask filters is used to perform the masking. Selection among the remaining, unmasked choices is via a “round robin” algorithm.
The CH-DLUT route lookup is used when the PCIe standard active port decode (as opposed to subtractive route) doesn't hit. The active route (SEC-SUB decode) for fabric crosslinks, is topology specific. For example, for all ports leading towards the root of a fat tree fabric, the SEC/SUB ranges of the fabric ports are null, forcing all traffic to the root of the fabric to use the DLUT Route Lookup. Each fabric crosslink of a mesh topology would decode a specific BUS number or Domain number range. With some exceptions, TLPs are ID-routed through Global Space using a PCIe Vendor Defined End-to-End Prefix. Completions and some messages (e.g. ID routed Vendor Defined Messages) are natively ID routed and require the addition of this prefix only when source and destination are in different Domains. Since the MCPU is at the upstream port of Global Space, TLPs may route to it using the default (subtractive) upstream route of PCIe, without use of a prefix. In the current embodiment, there are no means to add a routing prefix to TLPs at the ingress from the MCPU, requiring the use of address routing for its memory space requests. PCIe standard address and ID route mechanisms are maintained throughout the fabric to support the MCPU.
With some exceptions, PCIe message TLPs ingress at host and downstream ports are encapsulated and redirected to the MCPU in the same way as are Configuration Space requests. Some ID routed messages are routed directly by translation of their local space destination ID to the equivalent Global Space destination ID.
Support is provided to extend the ID space to multiple Domains. In one embodiment, an ID routing prefix is used to convert an address routed packet to an ID routed packet. An exemplary ExpressFabric™ Routing prefix is illustrated in
A Vendor (PLX) Defined End-to-End Routing Prefix is added to memory space requests at the edges of the fabric. The method used depends on the type of port at which the packet enters the fabric and its destination:
At host ports:
At downstream ports:
The Address trap and TWC-H TLUT are data structures used to look up a destination ID based on the address in the packet being routed. ID traps associate the Requester ID in the packet with a destination ID:
In one embodiment, the Routing Prefix is a single DW placed in front of a TLP header. Its first byte identifies the DW as an end-to-end vendor defined prefix rather than the first DW of a standard PCIe TLP header. The second byte is the Source Domain. The third byte is the Destination Domain. The fourth byte is the Destination BUS. Packets that contain a Routing Prefix are routed exclusively by the contents of the prefix.
Legal values for the first byte of the prefix are 9Eh or 9Fh, and are configured via a memory mapped configuration register.
Routing traps are exceptions to standard PCIe routing. In forwarding a packet, the routing logic processes these traps in the order listed below, with the highest priority trap checked first. If a trap hits, then the packet is forwarded as defined by the trap. If a trap doesn't hit, then the next lower priority trap is checked. If none of the traps hit, then standard PCIe routing is used.
The multicast trap is the highest priority trap and is used to support address based multicast as defined in the PCIe specification. This specification defines a Multicast BAR which serves as the multicast trap. If the address in an address routed packet hits in an enabled Multicast BAR, then the packet is forwarded as defined in the PCIe specification for a multicast hit.
Each address trap is an entry in a ternary CAM, as illustrated in
The following outputs are available from each address trap:
A CAM Code determines how/where the packet is forwarded, as follows:
If sending to the DMAC, then the 8 bit Destination BUS and Domain fields are repurposed as:
Hardware uses this information along with the CAM code (forward or reverse mapping of functions) to arrive at the targeted DMA function register for routing, while minimizing the number of address traps needed to support multiple DMA functions.
The T-CAM used to implement the address traps appears as several arrays in the per-station global endpoint BARO memory mapped register space. The arrays are:
An exemplary array implementation is illustrated in the table below.
ID traps are used to provide upstream routes from endpoints to the hosts with which they are associated. ID traps are processed in parallel with address traps at downstream ports. If both hit, the address trap takes priority.
Each ID trap functions as a CAM entry. The Requester ID of a host-bound packet is associated into the ID trap data structure and the Global Space BUS of the host to which the endpoint (VF) is assigned is returned. This BUS is used as the Destination BUS in a Routing Prefix added to the packet. For support of cross Domain I/O sharing, the ID Trap is augmented to return both a Destination BUS and a Destination Domain for use in the ID routing prefix.
In an embodiment, ID traps are implemented as a two-stage table lookup. Table size is such that all FUNs on at least 31 global busses can be mapped to host ports.
The table below illustrates address generation for 2nd stage ID trap lookup.
The ID traps are implemented in the Upstream Route Table that appears in the register space of the switch as the three arrays in the per station GEP BARO memory mapped register space. The three arrays shown in the table below correspond to the two stage lookup process with FUNO override described above.
The table below illustrates an Upstream Route Table Containing ID Traps.
A 512 entry CH-DLUT stores 4 4-bit egress port choices for each of 256 Destination BUSes and 256 Destination Domains. The number of choices stored at each entry of the DLUT is limited to four in our first generation product to reduce cost. Four choices is the practical minimum, 6 choices corresponds to the 6 possible directions of travel in a 3D Torus, and eight choices would be useful in a fabric with 8 redundant paths. Where there are more redundant paths than choices in the CH-DLUT output, all paths can still be used by using different sets of choices in different instances of the CH-DLUT in each switch and each module of each switch.
Since the Choice Mask or masked choice vector has 12 bits, the number of redundant paths is limited to 12 in this initial silicon, which has 24 ports. A 24 port switch is suitable for use in CLOS networks with 12 redundant paths. In future products with higher port counts, a corresponding increase in the width of the Choice Mask entries will be made.
The Route by BUS is true when (Switch Domain==Destination Domain) or if routing by Domain is disabled by the ingress port attribute. Therefore, if the packet is not yet in its Destination Domain, then the route lookup is done using the Destination Domain rather than the Destination Bus as the D-LUT index, unless prohibited by the ingress port attribute.
In one embodiment, the CH-DLUT lookup provides four egress port choices that are configured to correspond to alternate paths through the fabric for the destination. DMA WR VDMs include a PATH field for selecting among these choices. For shared I/O packets, which don't include a PATH field or when use of PATH is disabled, selection among those four choices is made based upon which port the packet being routed entered the switch. The ingress port is associated with a source port and allows a different path to be taken to any destination for different sources or groups of sources.
The primary components of the CH-DLUT are two arrays in the per station BARO memory mapped register space of the GEP shown in the table below.
Table 3 illustrates CH-DLUT Arrays in Register Space
For host-to-host messaging, Vendor Defined Messages (VDMs), if use of PATH is enabled, then it can be used in either of two ways:
Note that if use of PATH isn't enabled, if PATH==0, or the packet doesn't include a PATH, then the low 2 bits of the ingress port number are used to select among the four Choices provided by the DLUT
In one embodiment, DMA driver software is configurable to use appropriate values of PATH in host to host messaging VDMs based on the fabric topology. PATH is intended for routing optimization in HPC where a single, fabric-aware application is running in distributed fashion on every compute node of the fabric.
In one embodiment, a separate array (not shown in
The CH-DLUT Route Lookup described in the previous subsection is used only for ordered traffic. Ordered traffic consists of all host < >I/O device traffic plus the Work Request VDM and some TxCQ VDMs of the host to host messaging protocol. For unordered traffic, we take advantage of the ability to choose among redundant paths without regard to ordering. Traffic that is considered unordered is limited to types for which the recipients can tolerate out of order delivery or for which re-ordering is implemented at the destination node. In one embodiment, unordered traffic types include only:
Choices among alternate paths for unordered TLPs are made to balance the loading on fabric links and to avoid congestion signaled by both local and next hop congestion feedback mechanisms. In the absence of congestion feedback, each source follows a round robin distribution of its unordered packets over the set of alternate egress paths that are valid for the destination.
The CH-DLUT includes an Unordered Route Choice Mask for each destination BUS and Domain. In one embodiment, choices are masked from consideration by the Unordered Route Choice Mask vector output from the DLUT for the following reasons:
It also is helpful in grid like fabrics where switch hops between the home Domain and the Destination Domain may be made at multiple switch stages along the path to the destination to process the route by Domain route Choices concurrently with the Route by BUS Choices and to defer routing by Domain at some fabric stages for unordered traffic if congestion is indicated for its route Choices and not for route by BUS route Choices. This deferment of route by Domain due to congestion feedback would be allowed for the first switch to switch hop of a path and would not be allowed if the route by Domain step is the last switch to switch hop required.
The Unordered Route Choice Mask Table shown below is part of the DLUT and appears in the per-chip BARO memory mapped register space of the GEP.
In a fat tree fabric, the unordered route mechanism is used on the hops leading toward the root (central switch rank) of the fabric. Route decisions on these hops are destination agnostic. Fabrics with up to 12 choices at each stage are supported. During the initial fabric configuration, the Unordered Route Choice Mask entries of the CH-DLUTs are configured to mask out invalid choices. For example, if building a fabric with equal bisection bandwidth at each stage and with x8 links from a 97 lane Capella 2 switch, there will be 6 choices at each switch stage leading towards the central rank. All the Unordered Route Choice Mask entries in all the fabric D-LUTs will be configured with an initial, fault-free value of 12′hFCO to mask out choices 6 and up.
Separate masks are used to exclude congested local ports or congested next hop ports from the round robin distribution of unordered packets over redundant paths. A congested local port is masked out independent of destination. Masking of congested next hop ports is a function of destination. Next hop congestion is signaled using a DLLP with encoding as RESERVED as a Backwards Explicit Congestion Notification (BECN). BECNs are broadcast to all ports one hop backwards towards the edge of the fabric. Each BECN includes a bit vector indicating congested downstream ports of the switch generating the BECN. The BECN receivers use lookup tables to map each congested next hop port indication to the current stage route choice that would lead to it.
The routing of an unordered packet is a four step process:
For the unordered route, the CH DLUT stores a 12-bit Unordered Route Choice Mask Vector for each potential destination Bus and destination Domain. The implicit assumption in the definition is that each of the choices in the vector is valid unless masked. The starting point for configuration is to assert all the bits corresponding to choices that don't exist in the topology. If a fault arises during operation, additional bits may be asserted to mask off choices affected by the fault. For example, a 3×3 array Clos network made with PEX9797 has only 3 valid choices corresponding to the fabric ports that lead to the three central rank switch in the array. To be clear: zero bits in the vector indicate that the associated ports are valid choices.
The NH DLUT is 512×96 array. For each possible destination Bus and Destination Domain, it returns 12 bytes of information. Each byte is associated with the same numbered bit of the Unordered Route Choice Mask Vector. Each byte is structured as a 2-bit pointer to one of four “Port of Choice” tables followed by a 6-bit “Choice” vector. The “Port of Choice” tables map bits in the vector to ports on the next hop switch. Next Hop route choices are stored at index values 256-511 in the next hop DLUT for destination Busses in the current Domain and at index values 0-255 for remote Domain destinations.
The “Port of Choice” tables return the ports on the next hop switch that lead to the destination if the associated current hop route choice is selected. It's those ports for which the congestion state is needed. It can be seen that this supports fabrics in which up to 6 next hop ports lead to the destination. The topology analysis in the next subsection shows that this is more than sufficient.
The “Port of Choice” tables are used to transform NH DLUT output from a next hop masked choice vector to a next hop masked port vector.
The next hop masked port vector aligns bit by bit with the next hop congestion vectors. They are in effected ANDed bit by bit with the congestion vectors so that only bits corresponding to next hop ports that lead to the destination for which congestion is indicated are asserted in the bit vector that results from the AND operation.
In order to do this, the “Port of Choice” tables and the Choice vectors themselves must be configured consistently with the fabric topology and the congestion vectors. The congestion vector bits are in port order; i.e. bit zero of the vector corresponds to port zero, etc. Since the is only one set of four Port of Choice tables but as many as 12 next hop switches from which congestion feedback is received, all the next hop switches must use the same numbered port to get to the same destination switch of a Clos network or to the equivalent next hop destination of a deeper fat tree or mesh network. For example, if port 0 of one central rank Clos network leads to destination switch 0, then the fabric must be wired so that port zero leads to destination switch 0 on all switches in the central rank. This is a fabric wiring constraint which if not followed makes the next hop congestion feedback unusable to the extent that it isn't followed.
This NH DLUT route structure supports all fabric topologies with up to 24 next hop route choices in which only a single next hop route choice leads to the destination and some fabric topologies in which multiple next hop route choices lead to the destination.
The CH DLUT supports fabrics with up to 12 current hop route choices and up to 24 next hop route choices. Support for 12 first hop route choices and 24 2nd hop route choices is consistent with C2's maximum of 24 fabric ports and the desire to support fat tree topologies.
The fabric topology determines how many first and second hop route choices lead to the destination:
Improved support for topologies with multiple next hop route choices can be realized by implementing options to interpret the NH DLUT output differently:
A copy of the congestion information is maintained in every “station” module of the switch as the information is needed at single clock latency for routing decisions. The information is stored in discrete flip-flops organized as a set of Next Hop Congestion Vectors for each fabric port of the current switch, as shown in
The final congestion vector is generated using these rules:
In the above, a round robin policy was specified for use breaking ties in the complete absence of congestion indications and when congestion is indicated for all route choices. The simplest round robin policy is to send packets to each route choice in order, independent of what flow, if any, it might be a part of. This is what has been implemented in Capella 2.
It was shown earlier that for several topologies of interest, our BECN doesn't make all congestion along all complete paths through the fabric visible at the source edge node where the initial routing decision is made. Furthermore, reactive congestion management mechanisms are limited in their effectiveness by delays in the congestion sensing and feedback paths. For fabrics with more than 3 stages and for improved performance on 3 stage fabrics, a proactive congestion management mechanism is desirable.
Deeper fabrics are likely better served with a feed forward mechanism rather than a feedback mechanism because the delay in the feedback loop may approach or exceed the amount of congestion buffering available if the BECNs were sent back all the way to the source edge switches. It is well known that a round robin per flow current hop routing policy that rounds over multiple first hop route choices will balance the fabric link loading at the next hop stages. Depending on the burstiness of the traffic, switch queues may fill before balance occurs. Thus even with round robin per flow, congestion feedback remains necessary.
Given the limited goal of load balancing paths at the next switch stage, the round robin per flow policy can be simplified to what is essentially round robin per destination edge switch. Each stream from any input visible to the management logic (in each switch “station”) to each destination is treated as a separate flow. This is the coarsest grained possible flow definition and will thus require the least time for loads to balance. It also requires the least state storage.
Implementing this policy with the flexibility to adapt to different switch port configuration and fabric topologies can be done with a two stage lookup of the flow state, as illustrated in
With round robin per destination edge switch differs from the simple round robin policy described earlier only in that a separate round robin state is maintained for each destination edge switch. Note that the Destination Switch LUT and Prior Choice Array are together quite small compared to the CH and NH DLUTs.
One routes the next unordered packet in a flow (to a specific destination edge switch) to the next Choice in the Current Hop Unordered Route Choice vector after that listed in the flow state table. As noted earlier, if all such Choices are congested or if more than one is uncongested, the next choice, starting from the most recent choice taken to that direction, in increasing bit order on the choice vector is taken.
After each such route, the choice just taken is written to the destination's entry in the Prior Choice Array.
Tie breaking via round robin per destination edge switch is proposed as an improvement for the next generation fabric switch. This was rejected initially as being too complicated but, as should be evident, the next hop congestion feedback that we ended up implementing is considerably more complicated. In retrospect, the two methods complement each other with each compensating for the shortcomings of the other. Adding round robin per destination edge switch at this point is only a marginal increase in cost and complexity.
Fabric ports indicate congestion when their fabric egress queue depth is above a configurable threshold. Fabric ports have separate egress queues for high, medium, and low priority traffic. Congestion is never indicated for high priority traffic; only for low and medium priority traffic.
Fabric port congestion is broadcasted internally from the fabric ports to all the ports in the switch. using congestion ring bus for each {port, priority}, where priority can be medium or low. When a {port, priority} signals XOFF in the congestion ring bus, then edge ingress ports are advised not to forward unordered traffic to that port, if possible. If, for example, all fabric ports are congested, it may not be possible to avoid forwarding
Hardware converts the portX local congestion feedback to a local congestion bit vector per priority level, one vector for medium priority and one vector for low priority. High priority traffic ignores congestion feedback because by virtue of its being high priority, it bypasses traffic in lower priority traffic classes, thus avoiding the congestion. These vectors are used as choice masks in the unordered route selection logic, as described earlier.
For example, if a local congestion feedback from portX uses choice 1 and 5 and has XOFF set for low priority, then bits[1] and [5] of low local_congestion would be set. If a later local congestion from portY has XOFF clear for low priority, and portY uses choice 2, then bit[2] of low_local_congest would be cleared.
If all valid (legal) choices are locally congested, i.e. all 1s, the local congestion filter applied to the legal_choices is set to all Os since we have to route the packet somewhere.
In one embodiment, any one station can target any of the six stations on a chip. Put another way, there is a fan-in factor of six stations to any one port in a station. A simple count of traffic sent to one port from another port cannot know what other ports in other stations sent to that port and so may be off by a factor of six. Because of this, one embodiment relies on the underlying round robin distribution method augmented by local congestion feedback to balance the traffic and avoid hotspots.
The hazard of having multiple stations send to the same port at the same time is avoided using the local congestion feedback. Queue depth reflects congestion instantaneously and can be fed back to all ports within the Inter-station Bus delay. In the case of a large transient burst targeting one queue, that Queue depth threshold will trigger congestion feedback which allows that queue time to drain. If the queue does not drain quickly, it will remain XOFF until it finally does drain.
Each source station should have a different choice_to_port map so that as hardware sequentially goes through the choices in its round robin distribution process, the next port is different for each station. For example, consider x16 ports with three stations 0,1,2 feeding into three choices that point to ports 12, 16, 20. If port 12 is congested, each station will cross the choice that points to port 12 off of their legal choices (by setting a choice_congested [priority]). It is desirable to avoid having all stations then send to the same next choice, i.e. port 16. If some stations send to port 16 and some to port 20, then the transient congestion has a chance to be spread out more evenly. The method to do this is purely software programming of the choice to port vectors. Station0 may have choice 1,2,3 be 12, 16, 20 while stationl has choice 1,2,3 be 12, 20, 16, and station 2 has choice 1,2,3 be 20, 12, 16.
A 512 B completion packet, which is the common remote read completion size and should be a large percent of the unordered traffic, will take 134 ns to sink on an x4, 67 ns on x8, and 34.5 ns on x16. If we can spray the traffic to a minimum of 3× different x4 ports, then as long as we get feedback within 100 ns or so, the feedback will be as accurate as a count from this one station and much more accurate if many other stations targeted that same port in the same time period.
For a switch from which a single port leads to the destination, congestion feedback sent one hop backwards from that port to where multiple paths to the same destination may exist, can allow the congestion to be avoided. From the point of view of where the choice is made, this is next hop congestion feedback.
For example, in a three stage Fat Tree, CLOS network, the middle switch may have one port congested heading to an edge switch. Next hop congestion feedback will tell the other edge switches to avoid this one center switch for any traffic heading to the one congested port.
For a non-fat tree, the next hop congestion can help find a better path. The congestion thresholds would have to be set higher, as there is blocking and so congestion will often develop. But for the traffic pattern where there is a route solution that is not congested, the next hop congestion avoidance ought to help find it.
Hardware will use the same congestion reporting ring as local feedback, such that the congested ports can send their state to all other ports on the same switch. A center switch could have 24 ports, so feedback for all 24 ports is needed
If the egress queue depth exceeds TOFF ns, then an XOFF status will be sent. If the queue drops back to TON ns or less, then an XON status will be sent. These times reflect the time required to drain the associated queue at the link bandwidth.
When TON<TOFF, hysteresis in the sending of BECNs results. However, at the receiver of the BECN, the XOFF state remains asserted for a fixed amount of time and then is de-asserted. This “auto XON” eliminates the need to send a BECN when a queue depth drops below TON and allows the TOFF threshold to be set somewhat below the round trip delay between adjacent switches.
For fabrics with more than three stages, next hop congestion feedback may be useful at multiple stages. For example, in a five stage Fat Tree, it can also be used at the first stage to get feedback from the small set of away-from-center choices at the second stage. Thus, the decision as to whether or not to used next hop congestion feedback is both topology and fabric stage dependent.
a PCIe DLLP with encoding as Reserved is used as a BECN to send next hop congestion feedback between switches. Every port that forwards traffic away from the central rank of a fat tree fabric will send a BECN if the next hop port stays in XOFF state. It is undesirable to trigger it too often.
BECN protocol uses the auto_XON method described earlier. A BECN is sent only if at least one port in the bit vector is indicating XOFF. XOFF status for a port is cleared automatically after a configured time delay by the receiver of a BECN. If a received BECN indicates XON, for a port that had sent an XOFF in the past which has not yet timed out, the XOFF for that port is cleared.
The BECN information needs to be stored by the receiver. The receiver will send updates to the other ports in its switch via the internal congestion feedback ring whenever a hop port's XON/XOFF state changes.
Like all DLLPs, the Vendor Defined DLLPs are lossy. If a BECN DLLP is lost, then the congestion avoidance indicator will be missed for the time period. As long as congestion persists, BECNs will be periodically sent.
Any port that receives a DLLP with new BECN information will need to save that information in its own XOFF vector. The BECN receiver is responsible to track changes in XOFF and broadcast the latest XOFF information to other ports on the switch. The congestion feedback ring is used with BECN next hop information riding along with the local congestion.
Since the BECN rides on a DLLP which is lossy, a BECN may not arrive. Or, if the next hop congestion has disappeared, a BECN may not even be sent. The BECN receiver must take care of ‘auto XON’ to allow for either of these cases.
One important thing for a receiver to not turn XON a next hop if it should stay off. Lost DLLPs are so rare as to not be a concern. However, DLLPs can be stalled behind a TLP and they often are. The BECN receiver must tolerate a Tspread +/−Jitter range, where Tspread is inverse of the transmitter BECN rate and Jitter is the delay due to TLPs between BECNs.
Upon receipt of a BECN for a particular priority level, a counter will be set to Tspread+Jitter. If the counter gets to 0 before another BECN of any type is received, then all XOFF of that priority are cleared. The absence of a BECN implies that all congestion has cleared at the transmitter. The counter measures the worst case time for a BECN to have been received if it was in fact sent.
The BECN receiver also sits on the on chip congestion ring. Each time slot it gets on the ring, it will send out any state change information before sending out no-change. The BECN receiver must track state change since the last time the on chip congestion ring was updated. It sends the next hop medium and low priority congestion information for half the next hop ports per slot. The state change could be XOFF to XON or XON to XOFF. If there were two state changes or more, that is fine—record it as a state change and report the current value.
The ports on the current switch that receive BECN feedback on the inner switch broadcast will mark a bit in an array as ‘off.’ The array needs to be 12 choices×24 ports.
A RAM with size 512×12 is needed to store fault vector of current hop where first 256 entries is for route by bus and remaining 256 is for route by domain. A ram with size 512×96(12×8) is needed for storing Next hop fault vector, where 8 bits is for each fabric port.
Sw-00 ingress station last sent an unordered medium priority TLP to Sw-10, so Sw-11 is the next unordered choice. The choices are set up as 1 to Sw-10, 2 to Sw-11, and 3 to Sw-12.
Case1: The TLP is an ordered TLP. D-LUT[DB] tells us to use choice1. Regardless of congestion feedback, a decision to route to choice1 leads to Sw-11 and even worse congestion.
Case2: The TLP is an unordered TLP. D-LUT[DB] shows that all 3 choices 1,2, and 3 are unmasked but 4-12 are masked off. Normally we would want to route to Sw-11 as that is the next switch to spray unordered medium traffic to. However, a check on NextHop[DB] shows that choice2's next hop port would lead to congestion. Furthermore choice3 has local congestion. This leaves one ‘good choice’, choice1. The decision is then made to route to Sw-10 and update the last picked to be Sw-10.
Case3: A new medium priority unordered TLP arrives and targets Sw-04 destination bus DC. D-LUT[DC] shows all 3 choices are unmasked. Normally we want to route to Sw-11 as that is the next switch to spray unordered traffic to. NextHop[DC] shows that choice2's next hop port is not congested, choice2 locally is not congested, and so we route to Sw-11 and update the last routed state to be Sw-11.
The final step in routing is to translate the route choice to an egress port number. The choice is essentially a logical port. The choice is used to index table below to translate the choice to a physical port number. Separate such tables exist for each station of the switch and may be encoded differently to provide a more even spreading of the traffic.
In ExpressFabric™, it is necessary to implement flow control of DMA WR VDMs in order to avoid deadlock that would occur if a DMA WR VDM that could not be executed or forwarded, blocked a switch queue. When no WR flow control credits are available at an egress port, then no DMA WR VDMs may be forwarded. In this case, other packets bypass the stalled DMA WR VDMs using a bypass queue. It is the credit flow control plus the bypass queue mechanism that together allow this deadlock to be avoided.
In one embodiment, a Vendor Defined DLLP is used to implement a credit based flow control system that mimics standard PCIe credit based flow control.
To facilitate fabric management, a mechanism is implemented that allows the management software to discover and/or verify fabric connections. A switch port is uniquely identified by the {Domain ID, Switch ID, Port Number} tuple, a 24-bit value. Every switch sends this value over every fabric link to its link partner in two parts during initialization of the work request credit flow control system, using the DLLP formats defined in
For a fat tree with multiple choices to the root of the fat tree, the design goal is to use all routes. Unordered traffic should be able to route around persistent ordered traffic streams, such as caused by shared JO or ordered host to host traffic using a single path.
For a fat tree with multiple choices, one link may be degraded. The design goal is to recognize that weaker link and route around it. If a healthy fabric has 6× bandwidth using 3× healthy paths, then one path drops from 2× to 1×, then the resulting fabric should run at 5× bandwidth worst case. If software can lower the injection rate that uses the weak link to ⅚ of nominal, no congestion should develop in the fabric allowing other flows to run at 11/2=5.5× assuming uniform traffic load using different TxQ for each destination.
Blocking topologies will likely often have congestion. A 2D or 3D torus can benefit from local congestion avoidance to try a different path, if there is more than one choice. BECN next hop on a non-fat tree is possible only if we can define ‘BECN enable’ or not.
The design goal is for hardware to be able to make a good choice to avoid congestion using a set of legal paths. The choice need not be the best.
To even be considered a choice, there must be no faults anywhere on the path to the destination, i.e. the path must be valid. One must rule out use of a choice where the port selected on the first hop through a 3 stage fabric would cause the packet to encounter a fault on its second hop. A choice_mask or fault vector programmed in the D-LUT and Next hop choice mask (next hop masked choice vector) or fault vector in NH-DLUT or NHDLUTfor every possible destination bus or domain will give the legal paths (paths that are not masked).
After the choice_mask, the best choice would be the one that has little other traffic. Congestion feedback from the same switch egress and the next switch egress will help indicate which choices have heavy traffic and should be avoided, assuming another choice has less heavy traffic. Clearly if unordered traffic hits congestion, latency will go up. Not as clearly, unordered traffic hitting ordered congestion may cause throughput to drop unless unordered traffic can be routed around the congestion.
Putting it together, all valid choices (those not masked) will be filtered against a same switch congestion vector and a next hop congestion vector. The remaining choices are all good choices. A choice equation follows:
good_choices=!masked_choice & !adj_local_congestion & !adj_next_hop_congestion
selected_choice=state_machine (last choice[priority], good_choices)
Looking at the equations, the masked choice term is easy enough to understand: if the choice does not lead to the destination or should not be used, it will be masked. Masking may be due to a fault or due to a topology consideration where the path should not be used. The existence of a masked choice is a function of destination and thus requires a look up (D-LUT output).
The congestion filters each have two adjustments. First there is a priority adjustment. The TLP's TC will be used to determine which priority class the TLP belongs to. High priority traffic is never considered congested, but medium and low priority traffic could be congested. If low priority traffic is congested on a path but medium priority is not, medium priority traffic can still make low latency progress on the congested low latency path.
If medium priority traffic is congested, then theoretically low priority could make progress since it uses a different queue. However, practically we do not want low priority traffic to pile up on a congested medium priority path, so we will avoid it. For example, consider shared IO ordered traffic on medium priority taking up all the bandwidth − low priority host to host should use an alternate path if such a path exists. This avoidance is handled by hardware counting only medium + high traffic for medium congestion threshold checks, but counting high, medium and low traffic for low priority congestion threshold checks. The same threshold is used for both medium and low priority, so if medium priority is congested, then low priority is also congested. However, low priority can be congested without medium priority being congested.
The second adjustment is needed because one choice always must be made even if everything is congested. If the congestion vector mapped for all un-masked choices is all ls, then it is treated as if it were all 0s (i.e. no congestion).
The combination of priority and ignoring all-1s result in the adjusted congestion filter, either adj_local or adj_next_hop. For example, logic to determine the adj_local_congestion is as follows (similar logic for adj_next_hop_congestion):
A choice is selected based on the most recent choice for the given priority level and the choices available. In the absence of the congestion feedback the unordered packet is routed based on purely round robin arbitration between all possible choices. A state machine will track the most recent choice for high, medium, and low priority TLPs separately.
The next sub-sections go into the mechanisms behind the equation: where do masked_choice, local_congest, and next_hop_congestion come from.
Chicken bit options should be available to turn off either local_congestion or next_hop_congestion independently.
The unordered route choice mask vector or fault vector is held in the CH-DLUT, which is indexed by either destination bus or destination domain. There are at most 12 unordered choices. Software will program a 1 in the choice mask vector for any choice to avoid for the destination bus (if same domain) or the domain (if different domain)
For a fat tree, all choices are equal. If there are only 3 choices or 6 choices, and not 12, then only 3 or 6 are programmed The remaining choices are turned off by labeling them as masked choices.
For other topologies, pruning can be applied with the choice mask vector. For example, a 3D torus can have up to 6 choices. Only 1, 2, or 3 will likely head closer to the target—the other choices can be pruned by setting a choice mask bit on them.
For the Argo box, it may be desirable to route traffic between the lower two switches only using the 2×16 links between the switches, and not take a detour through the top switch. This can be accomplished by programming a choice mask on the path to the top switch for those destinations on the other bottom switch.
The egress scheduler is responsible for initiating all congestion feedback. It does so by determining its egress queue filled depth or fill level in nanoseconds.
The egress logic will add to the queue depth any time a new TLP arrives on the source queue. If the resolution is 16B and a header is defined to take 2 units, a 512 B CpID will therefore count as 2+512/16=34 units. A 124 B payload VDM-WR with a prefix will count as 2+128/16=10 units.
The egress logic will subtract from the queue depth any time a TLP is scheduled. The same units are used.
The units will then be scaled according to the egress port bandwidth. An x16 gen3 can consume 2 units per clock, whereas an x1 gen1 can only consume 1 unit in 64 clocks. The ultimate job of the egress scheduler is to determine if the Q-depth in ns is more than a programmable threshold Toff or Ton.
The same thresholds can be used for both low and medium priority. Low priority q-depth count should include low+medium+high priority TLPs (all of them). Medium priority q-depth should not include low priority TLPs, only medium and high priority. It is possible that a low priority threshold is reached but not a medium priority threshold. It should not be possible for a medium threshold to be reached but not a low priority threshold.
A port is considered locally congested if its egress queue has Toff or greater queue fill depth. Hysteresis will be applied so that a port stays off for a while before it turns back on; port will stay off until queue drops to Ton. Queue depth is measured in ns and the count for new TLPs should automatically scale as the link changes width or speed.
The output of the queue depth logic should be a low priority Xoff and a medium priority Xoff per port.
Management software should know to program local congestion values for Ton and Toff to be smaller than for next hop congestion. Hardware doesn't care; it will just use the value programmed for that port.
It will be very valuable to see the count for the number of clocks the queue depth ranged between a min and max value. Software can sample the count every 1 sec quite easily, so the count should not saturate even if it counts every clock for 1s, which is 500M 2 ns clocks. A 32b counter is needed.
The debug would look at just one q-depth for a station.
Each station will track congestion to all ports on that same switch as well as to ports in the next hop. An internal station to station ring is used to send feedback between ports on the same switch. The congestion feedback ring protocol will have the following structure:
All fabric ports will report on the congestion ring in a fixed sequential order. First station0 will send out a start pulse, which has local port=5′b11111 and valid=1. This starts the reporting sequence. Every station can use the receipt of the start pulse as a start/reset to sync up when it will send information on the congestion feedback ring (muxing its value onto the ring). Each station is provided 4 slots in the update cycle. The slot for the station is programmable with default value as:
The order at which at port puts the congestion information is decided by the slot number programmed by software. By default the sequence is Station 0→Station1→Station2 - - - →Station 5.
A fabric port pointing to the center will send both local and BECN next hop congestion information on the congestion ring. Only fabric ports participate in the congestion ring feedback. EEPROM or management software will program, per station, the slots used on the ring. Up to 24 ports could use the ring, but if only 3 fabric ports are active then only 3 slots will be programmed, reducing the latency to get access to the ring.
A port will determine its slot offset from the start strobe based on the 4 registers in the station.
P0_ring_slot[5b]
P1_ring_slot[5b]
P2_ring_slot[5b]
P3_ring_slot[5b]
An x8 fabric port would use 2 slots, either 0-1 or 2-3, depending on the port location. An x16 would use all 4 slots. An x4 would use the correct 1 slot. The start strobe uses slot 0. So if a port is programmed to ring_slot=1, it would follow the start strobe. If programmed to ring_slot=10, it would follow 10 clocks after the start strobe.
The BECN next hop information can cover low and medium priority for up to 24 ports. If we serially report each of those changes, the effect will be dreadfully slow. Instead of one bit reported at a time, we will report multiple ports at once using a similar bit vector as the BECN 2x12b will tell the on/xoff for 12 ports for medium and low priority. Another 1b will tell which the ports are in: bottom half or top half.
A fabric port pointing away from center will not have received next hop information, so it will send all 0s on the Next Hop fields. Only local congestion fields will be non-0. This local congestion information is actually the basis used to send next hop congestion on other fabric ports pointing away from the center on the same switch! Basically a port uses its own threshold logic to report local congestion on the ring and it uses BECN received data to report next hop congestion on the ring. No BECN received means no next hop data to report.
A non-fabric port will not send any congestion information on the ring. Instead, it can send the same data as the previous clock, except setting valid to 0, to reduce power.
All stations will monitor the on chip congestion ring.
The local congestion feedback is saved in two places.
First, it is saved in a local congestion bit vector. The local port is associated with the choice to port array. Any match (could be more than 1 choice pointing to the same port) will result in a 1 set on the congestion_apply 12b vector. The congestion_data vector is then used as a mask to apply to the local_congestion vector for either medium or low priority as follows:
This is how congestion vector is stored:
We have total 600 bits where 300 bits is for medium and low priority each. The 12 bits is local congestion information and 24 bits are for each fabric port next hop information which in total makes 12×24=288 bits. By using the following formula we derive final 12 bit vector and choose one of the fabric port based on last selection:
For a case where a next hop is using both domain and source to route packets, such as a Top of Rack switch with local domain connections and inter-domain connections, there should only be 1 choice for the local domain connections and so Next Hop congestion feedback will not do anything. While the next hop feedback is accurate for choice x next_hop_port, the NH_LUT index may not be. To avoid any confusion, a NH_LUT_domain bit will tell hardware to only read the NH_LUT for cases where the TLP targets a different domain if 1, else the NH_LUT will be read only for cases where a TLP targets the same domain.
It may be useful to see the congestion state via inline debug. Each of the recorded states should be available to debug. These include:
Low priority congestion
Medium priority congestion
Total above: 26 sets of ˜24b (or less)
The typical debug min/max comparison isn't much good when looking for a particular bit value. Useful feedback would be to count any non-0 state for any one selection of the above. More useful would be able to select a particular bit or set of bits in the bit vector and count if any matching Xoff bit is set (say track if any of 4 ports are congested).
If software has a 5b select (to pick the counter) and a 24b vector to match against, then any time any of the match bits is one for that vector, the count would increase. A 32b count is used with auto-wrap so software does not need to clear the count.
Only fabric ports will give a local congestion response. The management port (for C2 it cannot be a fabric port, but perhaps later it can), host port, or downstream port need never give this feedback. The direction of the fabric port affects how a port reports the congestion, but does not affect the threshold comparison.
Local congestion feedback from portX that says “Xoff” will tell the entire switch to avoid portX for any unordered choice. Each station will look up (associative) portX in the choice to port table to determine which choice(s) target portX.
Software may program 1, 2, or more choices to go to the same portX, which effectively gives portX a weighted choice compared to other choices. Or software may be avoiding a fault and so program two choices to the same port while the fault is active, but have those two choices go to different ports once the fault is fixed.
Hardware will convert the portX local congestion feedback to a local congestion bit vector per priority level, one vector for medium and one vector for low. High priority traffic does not use congestion feedback.
For example, if a local congestion feedback from portX uses choice 1 and 5 and has Xoff set for low priority, then bits[1] and [5] of low_local_congestion would be set. If a later local congestion from portY has Xoff clear for low priority, and portY uses choice 2, then bit[2] of low_local_congest would be cleared.
If *all* legal choices are locally congested, i.e. all 1s, the local congestion filter applied to the legal_choices is set to all 0s since we have to route the packet somewhere.
You may wonder, why not use a count for each choice? Any one station can target any of the 6 stations on a chip. Put another way, there is a fan-in factor of 6 stations to any 1 port in a station. A simple count of traffic sent to one port cannot ever know what other stations sent and so may be off by a factor of 6. Since a count costs a read-modify-write to the RAM and it has dubious accuracy, rather than using a count, hardware will spray the traffic to all possible local ports equally and rely on the local congestion feedback to balance the traffic and avoid hotspots.
There is still a hazard to avoid: namely, avoid having N stations sending to the same port at the same time. Qdepth reflects congestion instantaneously and can be fed back to all ports within the Interstation Bus delay. Qdepth has no memory of what was sent in the past. In the case of a large transient burst targeting one queue, that Qdepth threshold would trigger congestion feedback which should allow that queue time to drain. If the queue does not drain quickly, it will remain Xoff until it finally does drain.
Each source station should have a different choice to port map so that as hardware sequentially goes through the choices, the next port is different for each station. For example, consider x16 ports with 3 stations 0,1,2 feeding into 3 choices that point to ports 12, 16, 20. If port12 is congested, each station will cross the choice that points to port12 off of their legal choices (by setting a choice_congested[priority]). What we want to avoid is having all stations then send to the same next choice, i.e. port 16. If some stations send to port16 and some to port20, then the transient congestion has a chance to be spread out more evenly. The method to do this is purely software programming of the choice to port vectors. Station0 may have choice 1,2,3 be 12, 16, 20 while stationl has choice 1,2,3 be 12, 20, 16, and station 2 has choice 1,2,3 be 20, 12, 16.
A 512 B Cp1D, which is the common remote read completion size and should be a large percent of the unordered traffic, will take 134 ns to sink on an x4, 67 ns on x8, and 34.5 ns on x16. If we can spray the traffic to a minimum of 3× different x4 ports, then as long as we get feedback within 100 ns or so, the feedback will be as accurate as a count from this one station and much more accurate if many other stations targeted that same port in the same time period.
For a switch that has no choice of which port to route to, congestion feedback from that one port is helpful if sent to a prior hop back where there was a choice. From the point of view of where the choice is made, this is next hop congestion feedback.
For example, in a Fat Tree the middle switch may have one port congested heading to an edge switch. Next hop congestion feedback will tell the other edge switches to avoid this one center switch for any traffic heading to the one congested port.
A 5-stage Fat Tree, using rank0 on edge, rank1 next, and rank2 in the middle, there is an opportunity for next hop feedback form rank2 to rank1 switch as well as from rank1 to rank0 switch. The rank1 to rank0 feedback gets complicated. Next hop feedback can certainly be applied for any away from center port on the rank1 switch, because there is one port only that is the target for a particular destination. But if there are multiple rankl to rank2 ports that ‘subtractive decode’, the final destination could be reached by using any of them and we have no way to apply the next hop congestion for all cases. What we can do is record the congestion correctly, but we would only be able to use congestion for one of the choices, as we use a NH_LUT[destination] to pick the next hop port for any one choice. Since the rank1 switch is seeing local congestion in this case, it should be trying to balance the traffic to other choices. If there are 3 choices in the rankl switch, then ⅓ of the time the rank0 switch will help the rank1 switch avoid the congestion.
For a non-fat tree, the next hop congestion can help find a better path. The congestion thresholds would have to be set higher, as there is blocking and so congestion will develop. But for the traffic pattern where there is a solution that does not congest, the next hop congestion avoidance ought to help find it. Similar to the 5-stage fat tree, where the rank1 feedback cannot all be used by the rank0 switch, for a 3D torus the next hop feedback only applies for the one port given by the NH-LUT[destination] choice.
Hardware will use the same congestion reporting ring as local feedback, such that the congested ports can send their state to all other ports on the same switch. A center switch could have 24 ports, so feedback for all 24 ports is needed. [The x1 port would not be considered as it should not have significant unordered traffic]
If the egress queue exceeds Toff ns, then an Xoff status will be sent. If the queue drops back to Ton ns or less, then an Xon status will be sent.
Because the feedback must travel across a link, perhaps waiting behind a max length (512 B) packet, the next hop congestion feedback must turn back on before all traffic can drain. An x4 port can send 512+24 in 134 ns. A switch in-to-out latency is around 160 ns. So an Xoff to Xon could take 300 ns to get to the port making a choice to send a packet, which then would take another ˜200 ns to get the TLP to the next hop. Therefore, Xon threshold must be at least 500 ns of queue. Xoff would represent significant congestion, perhaps a queue of 750 ns to 1000 ns.
Next hop congestion feedback applies to more than just 1 hop from the center. For a 5-stage fat tree, it can also be used at the lrst stage to get feedback from the small set of away-from-center choices at the 2nd stage.
Next hop congestion feedback will use a BECN to send information between switches. Every away from center port will send a BECN if the next hop port stays in Xoff state. We don't want to trigger it too often.
BECN stands for Backwards Early Congestion Notification. It is a concept adapted from Advanced Switching.
Next hop congestion feedback is communicated using DLLP with Reserved encoding type. Next hop congestion feedback will use a BECN (Backward Early Congestion Feedback Notification) to send information between switches. Every away from center fabric port will send a BECN if the next hop port stays in Xoff state.
The above VDLLP is send if any of the port has Xoff set. This DLLP is treated as high priority DLLP. The two BECN are sent in burst if both low and medium priorities are congested at one time.
The first time any one port threshold triggers Xoff for a chip, BECN will be scheduled immediately for that priority. From that point, subsequent BECN will be scheduled periodically as long as, at least one of the ports remains Xoff. The periodicity of Xoff DLLP is controlled by following programmable register:
The XOff update period should be programmed in such a way that it does not hog the bus and create a deadlock. For example: X1 Gen1 if the update period is 20 ns then DLLP is scheduled every 20 ns and it takes 24 ns to send two dllp for low and medium which will not allow TLP to be scheduled and congestion will not clear up and it will cause deadlock and it will cause deadlock as DLLP will be schedule periodically as long as there is congestion. Whenever the timer counts down to 0, each qualified port in a station will save the active quartile 4b state (up to 4× copies), and then attempt to schedule a burst of BECNs. The Xoff vector for the BECN is simply the corresponding low_and med_BECN state saved in the station. Each active quartile will have one BECN sent until there are no more active quartiles to send. The transmission of BECN is enabled by Congestion management control register.
New BECN will be sent as frequently as some programmable spread period “Tspread” per priority (2 value). There is jitter on the receive side of T_spread +J. J can be bound by the time to send a MPS TLP+a few DLLPs. The time between received BECNs would be (T_spread−J)<=time<=(Tspread+J).
For the common case of avoiding a constant ordered flow, there is no hurry to get back to using that congested path. There is little harm in over stalling a congested flow—the link worst case would be out of data for a short time. Long term, throughput will be maintained as, even if all paths are congested, the packet will be sent to one of the non masked choices.
BECN_low_threshold is compared against (low+medium+high) count
BECN_medium_threshold is compared against (medium+high) count
Could medium be XOFF and low not? From thresholds it could be—discuss what to do. Low has some guaranteed bandwidth, so low could make progress if medium is congested.
The BECN information needs to be stored by the receiver. The receiver will update the other ports in its switch via the internal congestion feedback ring.
These are the same bits carried by the feedback ring and the 24×2 flops should hold the information on the Tx side of the link
Like all DLLPs, the Vendor Defined DLLPs are lossy. If a BECN DLLP is lost, then the congestion avoidance indicator will be missed for the time period. As long as congestion persists, BECNs will be periodically sent.
A port that may transmit a BECN is by definition an ‘away from center’ fabric port. BECN only need to be sent if there is at least one port has congestion for either medium or low priority
The first time any one port threshold triggers Xoff for a chip, BECN will be scheduled immediately. From that point, subsequent BECN will be scheduled periodically as long as at least one port remains Xoff. The period should match the time to send a 512 B CpID on the wire, such that a BECN ‘burst’ is sent after each 512 B CpID. A BECN burst can be 1, 2, 3, or 4 BECN DLLPs (costing 8 B to 32 B on the wire). A BECN DLLP is only sent if at least one of the bits in its Xoff vector is set to one.
X16 port can send 532 B in 33.25 ns, x8 in 66.5 ns, and x4 in 133 ns. If each of the 4 BECN can be coalesced (separately), then BECN can be scheduled at the max rate of a burst every 30 ns, and if there is a TLP already in flight, the BECN will wait. X16 will get a BECN burst every 30 ns, while X8 will get a BECN burst every 60 ns, and x4 every 120 ns. The worst case spread of two BECN is therefore (time to send 1 MPS TLP+BECN period).
Any port that receives a DLLP with new BECN information will need to save that information in its own Xoff vector. The BECN receiver is responsible to track changes in Xoff and broadcast the latest Xoff information to other ports on the switch. The congestion feedback ring is used with BECN next hop information riding along with the local congestion.
Since the BECN rides on a DLLP which is lossy, a BECN may not arrive. Or, if the next hop congestion has disappeared, a BECN may not even be sent. The BECN receiver must take care of ‘auto Xon’ to allow for either of these cases.
The most important thing is for a receiver to not turn Xon a next hop if it should stay off. Lost DLLPs are so rare as to not be a concern. However, DLLPs can be stalled behind a TLP and they often are. The BECN receiver must tolerate a Tspread+/−Jitter range, where Tspread is the transmitter BECN rate and Jitter is the delay due to TLPs between BECNs.
Upon receipt of a BECN a counter will be set to Tspread+Jitter. Since the BECN VD-DLLPs should arrive in a burst, a single timer can cover all 4 BECN sets. If the counter gets to 0 before another BECN of any type is received, then all Xoff are cleared. The BECN receiver also sits on the on chip congestion ring. Each time slot it gets on the ring, it will send out information for 12 ports for both medium and low priority queue. The BECN receiver must track which port has had a state change since the last time the on chip congestion ring was updated. The state change could be Xoff to Xon or Xon to Xoff. If there were two state changes or more, that is fine—record it as a state change and report the current value.
More than one path may exist from a source to destination in the fabric. For example in the 3×3 fabric shown in
Note: the logic describe below exist independently for both medium and low priority.
The 2 stage path information is saved in the local and next hop Destination LUT respectively. The Local DLUT is indexed by destination bus (if the domain of the TLP is current domain) or domain number (if the domain of the TLP is not current domain).
The fault vector or masked choice gives the lists of fabric port where the unordered TLP be routed to. The masked choice is 12 bit vector where each bit when cleared represents valid path for the TLP. The port mapping of each bit in the masked choice vector is located at GEP_MM_STN map starting at offset 1000h.
For example if masked choice vector is 12′hFFC and port of choice 0,1 at offset 1000h is 4,5 respectively then port 4 and 5 are the two possible choices for the current unordered TLP.
Similarly the next hop path for the current TLP is stored in Next Hop Destination LUT which is addressed by the destination bus in the current unordered TLP. If two headers come on single clock, then only TLP on beat 1 will be considered for unordered routing to keep the instance of Next Hop DLUT RAM to 1. If for a particular destination bus all the next hop path is faulty then the software should also remove that fabric port from current hop DLUT for the destination bus.
Each Next Hop DLUT has 8 bit entry for each fabric port (total 96 bits) where MSB 2 bits represent which port to choice vectors table out of 4 the remaining 6 bits maps to. In this way we can selectively cover 24 ports.
Following would be the format of choice vector
So we need 120 flops for each fabric port for port of choice mapping in NH LUT. The following register implements the Next hop Port to Choice mapping.
The Port of Choice for Fabric Ports 1-11 Exist in the Address Range 1080h-10DCh in Sequence.
The following example, illustrated in
The Source is S01104 and the three destinations D0,D1,D21108, 1112, 1116.
Now when software programs the NHLUT (Next hop look up table) for D01108 (destination bus D0) index the 8 bit entry would be
So when the choice to port is converted it will indicate all the four ports exist for destination port 0.
Similarly for D11112 the choice vector would be
For D3 1116 the vector can be [7:0]—00_101111
This will refer to And Port of Choice for fabric port 0 Choice 0(register 1060-1064h), which has port 16 on the 4th entry.
The arbiter chooses each path in round robin fashion to balance the traffic. Sometimes, some of the path might be congested (has higher latency) because it might be the path for some ordered traffic. Hence, a good choice would be to send the TLP on the path which is not congested. The arbiter along with last path selected makes decision on the congestion information. Each station keeps track of the congestion of all the fabric ports in the switch along with the next hop port congestion information.
The congestion information within the chip is communicated using a congestion feedback ring which is described in the next section. For the center switch we can have all the 24 ports as a fabric port. To save the congestion information we will need 312 bits+24 (local congestion information)+12*24 (next hop congestion information)=312 bits.
The congestion information is saved in each station in the following format 0, as shown in
The next hop congestion information is communicated using Vendor Defined DLLP which is described in coming section.
The final congestion vector derived by following logic:
If (all local is congested and all next hop is congested)
Else
The choice which is not congested at both the level is considered.
The local domain bus is mapped from 256-511 in the next hop DLUT and remote domain is mapped from 0-255.
The congestion information between the stations is exchanged on the congestion ring, as shown in
Each port reports its congestion information along with the next hop congestion information of the switch it is connected to. If there is no change in Xoff information from the last time a station updated its information on the bus, then it puts the same data as last one on its slot. The congestion information is named as Xoff which represents congestion when it is set. The congestion information is different for low and medium priority packets. The next hop congestion information is reported by DLLP with encoding type as ReservedVendor Defined DLLP. Following table specifies the fields used on the congestion bus:
Each station gets a slot numbers per update cycle which is programmed by software using following register:
To meet the timing, number of pipeline stages might be added, which adds additional latency into the bus. The update on the congestion ring starts with a start pulse where Station 0 puts the Local port number as 5′b11111 and valid field (bit 33) as 1. The slot 0 is reserved for start pulse out of total number of slots and it should not be assigned to any of the station OR the slot assignment starts from slot 1. After the start pulse, the station which is assigned slotl puts the congestion information followed by slot 2 and so on. The station 0 sends the start pulse again once the maximum number of slots is already on the bus. Each station maintains a local counter which gets synchronized by the arrival of start pulse.
Each port maintains a counter to keep track of the number of DW is on the egress queue. This count eventually decides the latency for a “new packet scheduled” to be put on wire which depends upon the physical bandwidth of the port. The counter has 4 DW (4 double word or 16 bytes) granularity, which gets incremented when the scheduler puts the TLP on the queues. The counter is decremented by the number of DW scheduled by scheduler. The counter is implemented separately for each low and high priority queue. This counter is used to decide the congestion status of a port. The management software is responsible for programming the Xmax/Xmin threshold. The port is congested or Xoffed if the count crosses the Xmax threshold and not congested or Xoned if the count is below the Xmin threshold. This counter is maintained individually for each medium and low priority queue. The low priority counter is incremented if any low+medium+high priority TLP is scheduled and same for decrement. For every Header the counter is inc/dec by 2 instead of 1 as it accounts for overhead associated with every TLP. It the payload is less than 1 then counter will not be incremented or decremented. The medium priority counter is incremented if any medium+high priority TLP is scheduled. The station based threshold register is shown below.
Since the feedback must travel across a link, perhaps waiting behind a max length (512 B) packet, the next hop congestion feedback must turn back on before all traffic can drain. An x4 port can send 512+24 in 134 ns. A switch in-to-out latency is around 160 ns. So an Xoff to Xon could take 300 ns to get to the port making a choice to send a packet, the next hop congestion feedback must turn back on before all traffic can drain.
Next hop congestion feedback is communicated using BECN which is PCIe DLLP with encoding type as Reserved Next hop congestion feedback will use a BECN (Backward Early Congestion Feedback Notification) to send information between switches. Every away from center fabric port will send a BECN if the next hop port stays in Xoff state.
The above VDLLP is send if any of the port has Xoff set. This DLLP is treated as high priority DLLP. The two BECN are sent in burst if both low and medium priorities are congested at one time.
The first time any one port threshold triggers Xoff for a chip, BECN will be scheduled immediately for that priority. From that point, subsequent BECN will be scheduled periodically as long as, at least one of the ports remains Xoff. The periodicity of Xoff DLLP is controlled by following programmable register:
The XOff update period should be programmed in such a way that it does not hog the bus and create a deadlock. For example: X1 Gen1 if the update period is 20 ns then DLLP is scheduled every 20 ns and it takes 24 ns to send two dllp for low and medium which will not allow TLP to be scheduled and congestion will not clear up and it will cause deadlock and it will cause deadlock as DLLP will be schedule periodically as long as there is congestion. Whenever the timer counts down to 0, each qualified port in a station will save the active quartile 4b state (up to 4× copies), and then attempt to schedule a burst of BECNs. The Xoff vector for the BECN is simply the corresponding low_and med_BECN state saved in the station. Each active quartile will have one BECN sent until there are no more active quartiles to send. The transmission of BECN is enabled by Congestion management control register.
Any port that receives a DLLP with new BECN information will need to save that information in its own Xoff vector. The BECN receiver is responsible to track changes in Xoff and broadcast the latest Xoff information to other ports on the switch. Each fabric port maintains a 24 bit next hop congestion vector. The congestion feedback ring is used with BECN next hop information riding along with the local congestion. The port only publishes the Xoff information on the congestion feedback ring which has changed from its last time slot.
The Xoff is not send by the transmitter if the congestion is disappeared. Sometime the DLLP might be even lost because of the lossy medium. Hence auto Xon feature is implemented in the receiver. The receiver maintains a timer and a counter to implement this auto Xon feature. The timer, which is programmable and is one per fabric port, keeps track when the next Xoff DLLP should arrive. The 2 bit counter is maintained one per next hop port. It is incremented when the corresponding Xoff bit is set on the incoming BECN. The counter is decremented when the previously described timer expires. When the count reaches 0 the port state is changed to Xon.
Finally the Xoff congestion information is communicated to TIC which has the format, as shown in
While a specific example of a PCIe fabric has been discussed in detail, more generally, the present invention may be extended to apply to any switch that includes multiple paths some of which may suffer congestion. Thus, the present invention has potential application for other switch fabrics beyond those using PCIe.
While the invention has been described in conjunction with specific embodiments, it will be understood that it is not intended to limit the invention to the described embodiments. On the contrary, it is intended to cover alternatives, modifications, and equivalents as may be included within the spirit and scope of the invention as defined by the appended claims. The present invention may be practiced without some or all of these specific details. In addition, well known features may not have been described in detail to avoid unnecessarily obscuring the invention. In accordance with the present invention, the components, process steps, and/or data structures may be implemented using various types of operating systems, programming languages, computing platforms, computer programs, and/or general purpose machines. In addition, those of ordinary skill in the art will recognize that devices of a less general purpose nature, such as hardwired devices, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), or the like, may also be used without departing from the scope and spirit of the inventive concepts disclosed herein. The present invention may also be tangibly embodied as a set of computer instructions stored on a computer readable medium, such as a memory device.
This application incorporates by reference, in their entirety and for all purposes herein, the following U.S. patents and pending applications: Ser. No. 14/231,079, filed Mar. 31, 2014, entitled, “MULTI-PATH ID ROUTING IN A PCIE EXPRESS FABRIC ENVIRONMENT.”