High performance computing (HPC) clusters, cloud computing datacenters, and other large-scale computing networks may communicate over a high-speed input/output fabric such as an InfiniBand™ fabric. The InfiniBand™ architecture may transfer data using switched, point-to-point channels between endnodes. In the InfiniBand™ architecture, an endnode may be identified within a subnet using a 16-bit local identifier (LID). Routing in InfiniBand™ networks is distributed, based on forwarding tables stored in each switch. The forwarding table of an InfiniBand™ switch may store a single destination port per destination LID. Therefore, routing in InfiniBand™ may be static and deterministic.
Congestion in network communications may occur when demand for a network link exceeds available bandwidth or other network resources. In InfiniBand™ networks, a congestion control agent may monitor for network congestion and communicate with network hosts to reduce data injection rates for network traffic causing congestion.
The concepts described herein are illustrated by way of example and not by way of limitation in the accompanying figures. For simplicity and clarity of illustration, elements illustrated in the figures are not necessarily drawn to scale. Where considered appropriate, reference labels have been repeated among the figures to indicate corresponding or analogous elements.
While the concepts of the present disclosure are susceptible to various modifications and alternative forms, specific embodiments thereof have been shown by way of example in the drawings and will be described herein in detail. It should be understood, however, that there is no intent to limit the concepts of the present disclosure to the particular forms disclosed, but on the contrary, the intention is to cover all modifications, equivalents, and alternatives consistent with the present disclosure and the appended claims.
References in the specification to “one embodiment,” “an embodiment,” “an illustrative embodiment,” etc., indicate that the embodiment described may include a particular feature, structure, or characteristic, but every embodiment may or may not necessarily include that particular feature, structure, or characteristic. Moreover, such phrases are not necessarily referring to the same embodiment. Further, when a particular feature, structure, or characteristic is described in connection with an embodiment, it is submitted that it is within the knowledge of one skilled in the art to effect such feature, structure, or characteristic in connection with other embodiments whether or not explicitly described. Additionally, it should be appreciated that items included in a list in the form of “at least one of A, B, and C” can mean (A); (B); (C): (A and B); (A and C); (B and C); or (A, B, and C). Similarly, items listed in the form of “at least one of A, B, or C” can mean (A); (B); (C): (A and B); (A and C); (B and C); or (A, B, and C).
The disclosed embodiments may be implemented, in some cases, in hardware, firmware, software, or any combination thereof. The disclosed embodiments may also be implemented as instructions carried by or stored on one or more transitory or non-transitory machine-readable (e.g., computer-readable) storage media, which may be read and executed by one or more processors. A machine-readable storage medium may be embodied as any storage device, mechanism, or other physical structure for storing or transmitting information in a form readable by a machine (e.g., a volatile or non-volatile memory, a media disc, or other media device).
In the drawings, some structural or method features may be shown in specific arrangements and/or orderings. However, it should be appreciated that such specific arrangements and/or orderings may not be required. Rather, in some embodiments, such features may be arranged in a different manner and/or order than shown in the illustrative figures. Additionally, the inclusion of a structural or method feature in a particular figure is not meant to imply that such feature is required in all embodiments and, in some embodiments, may not be included or may be combined with other features.
Referring now to
Each managed network device 102 may be embodied as any network device capable of forwarding or controlling fabric traffic, such as a managed switch. The illustrative managed network device 102 includes a number of fabric ports 120, a switch logic 122, and a management logic 124. Each fabric port 120 may be connected to a fabric link 106, which in turn may be connected to a remote device such as a computing node 104 or another managed network device 102. The illustrative managed network device 102 includes three fabric ports 120a through 120c; however, in other embodiments the managed network device 102 may include additional or fewer ports 120 to support a different number of fabric links 106.
The switch logic 122 may be embodied as any hardware, firmware, software, or combination thereof configured to forward data packets received on the ports 120 to appropriate destination ports 120. For example, the switch logic 122 may be embodied as a shared memory switch or a crossbar switch, and may include a scheduler, packet processing pipeline, linear forwarding tables, port group forwarding tables, port group tables, and/or any other switching logic. In some embodiments, the switch logic 122 may be embodied as one or more application-specific integrated circuits (ASICs).
The management logic 124 may be embodied as any control circuit, microprocessor, or other logic block that may be used to configure and control the managed network device 102. For example, the management logic 124 may initialize the managed network device 102 and its components, control the configuration of the managed network device 102 and its components, provide a testing interface to the managed network device 102, or provide other management functions. The management logic 124 may be configured by changing the values of a number of data tables including a port group forwarding table and/or a port group table. The fabric manager 108 may communicate with the management logic 124 using an in-band management interface by transmitting specially formatted management datagrams (MADs) over the fabric links 106. Additionally or alternatively, the management logic 124 may communicate with the fabric manager 108 over a management interface such as one or more PCI Express host interfaces, a test interface, or one or more low-speed interfaces such as an I2C interface, a JTAG interface, an SPI interface, an MDIO interface, an LED interface, or a GPIO interface.
Each computing node 104 may be embodied as any type of computation or computer device capable of performing the functions described herein, including, without limitation, a computer, a server, a rack-mounted server, a blade server, a network appliance, a web appliance, a multiprocessor system, a distributed computing system, a processor-based system, a mobile computing device, and/or a consumer electronic device. As shown in
The processor 140 may be embodied as any type of processor capable of performing the functions described herein. For example, the processor 140 may be embodied as a single or multi-core processor(s), digital signal processor, microcontroller, or other processor or processing/controlling circuit. The processor 140 further includes a host fabric interface 142. The host fabric interface 142 may be embodied as any communication interface, such as a network interface controller, communication circuit, device, or collection thereof, capable of enabling communications between the processor 140 and other remote computing nodes 104 and/or other remote devices over the fabric links 106. The host fabric interface 142 may be configured to use any one or more communication technology and associated protocols (e.g., the Intel® Omni-Path Architecture) to effect such communication. Although illustrated as including a single processor 140, it should be understood that each computing node 104 may include multiple processors 140, and each processor 140 may include an integrated host fabric interface 142.
Similarly, the memory 146 may be embodied as any type of volatile or non-volatile memory or data storage capable of performing the functions described herein. In operation, the memory 146 may store various data and software used during operation of the computing node 104 such as operating systems, applications, programs, libraries, and drivers. The memory 146 is communicatively coupled to the processor 140 via the I/O subsystem 144, which may be embodied as circuitry and/or components to facilitate input/output operations with the processor 140, the memory 146, and other components of the computing node 104. For example, the I/O subsystem 144 may be embodied as, or otherwise include, memory controller hubs, input/output control hubs, firmware devices, communication links (i.e., point-to-point links, bus links, wires, cables, light guides, printed circuit board traces, etc.) and/or other components and subsystems to facilitate the input/output operations. In some embodiments, the I/O subsystem 144 may form a portion of a system-on-a-chip (SoC) and be incorporated, along with the processor 140, the memory 146, and other components of the computing node 104, on a single integrated circuit chip. The data storage device 148 may be embodied as any type of device or devices configured for short-term or long-term storage of data such as, for example, memory devices and circuits, memory cards, hard disk drives, solid-state drives, or other data storage devices.
The communication circuitry 150 of the computing node 104 may be embodied as any communication interface, such as a communication circuit, device, or collection thereof, capable of enabling communications between the computing node 104 and one or more remote computing nodes 104, managed network devices 102, switches, remote hosts, or other devices. The communication circuitry 150 may be configured to use any one or more communication technology (e.g., wired or wireless communications) and associated protocols (e.g., Intel® Omni-Path Architecture, InfiniBand®, Ethernet, Bluetooth®, Wi-Fi®, WiMAX, etc.) to effect such communication. In particular, the communication circuitry 150 includes a port 152 that connects to a fabric link 106. Although illustrated as including a single port 152, in some embodiments each computing node 104 may include multiple ports 152.
Each of the fabric links 106 may be embodied as any point-to-point communication link capable of connecting two ports 120, 152 of the system 100. For example, a fabric link 106 may connect a port 152 of a computing node 104 with a port 120 of a managed network device 102, may connect two ports 120 of two managed network devices 102, and so on. Each fabric link 106 allows communications in both directions. Each fabric link 106 may be embodied as a serial data communication link such as a copper cable, copper backplane, fiber optic cable, or silicon photonics link, and may include multiple communication lanes (e.g., four lanes) to increase total bandwidth. Each fabric link 106 may signal data at a wire speed such as 12.5 Gb/s or 25.78125 Gb/s.
The fabric manager 108 is configured to initialize and otherwise manage the managed network devices 102, computing nodes 104, and other hosts, gateways, and/or other devices of the system 100. The fabric manager 108 may be embodied as any type of server computing device, network device, or collection of devices, capable of performing the functions described herein. In some embodiments, the system 100 may include multiple fabric managers 108 of which a primary fabric manager 108 may be selected. As such, the fabric manager 108 may be embodied as a single server computing device or a collection of servers and associated devices. Accordingly, although the fabric manager 108 is illustrated in
Referring now to
The packet ingress module 202 is configured to receive and process data packets from the ports 120. In particular, the packet ingress module 202 is configured to extract a destination local identifier (DLID) from a received data packet. The DLID may be embodied as a binary value having a configurable length (e.g., 32, 24, 20, or 16 bits wide, or any other appropriate width). The DLID identifies the destination end point (e.g., a destination computing node 104) of the data packet.
The static route module 204 is configured to determine a statically routed destination port 120 of the managed network device 102 as a function of the DLID. The static route module 204 may, for example, look up the destination port 120 in a forwarding table using the DLID. The static route module 204 may be configured to forward the data packet to the statically routed destination port 120 if that destination port 120 is not congested.
The congestion monitoring module 206 is configured to determine whether the statically routed destination port 120 is congested. The congestion monitoring module 206 may use any appropriate congestion metric or other monitoring technique to determine whether the destination port 120 is congested. In some embodiments, the congestion monitoring module 206 may determine whether a particular subdivision of fabric link 106 for the destination port 120 is congested (e.g., a particular virtual lane, service channel, or associated service level).
The adaptive route module 208 is configured to determine a port group based on the DLID. Each port group identifies two or more ports 120 of the managed network device 102. Port groups may overlap, and each DLID is associated with exactly one port group. The adaptive route module 208 is further configured to dynamically select a destination port 120 of the port group when the statically routed destination port 120 is congested and then forward the data packet to dynamically selected destination port 120. The adaptive route module 208 may use any one or more strategies for selecting the destination port 120 (e.g., random selection, greedy/least-loaded selection, and/or greedy random selection).
The management module 210 is configured to manage the configuration of the managed network device 102. The management module 210 may store or otherwise manage one or more configuration registers, data tables, or other management information that may be used to configure the managed network device 102. For example, in some embodiments, the management module 210 may manage a linear forwarding table, a multicast forwarding table, a port group forwarding table, and/or a port group table. The management module 210 may be configured to receive commands, data, and other management information from the fabric manager 108.
Referring now to
In block 304, the managed network device 102 determines the statically routed destination port 120 based on the DLID. The statically-routed destination port 120 may be a predetermined destination port 120 of the managed network device 102 that has been associated with the DLID. The managed network device 102 may look up the statically routed destination port 120 in one or more data tables. Those data tables may be configured or otherwise maintained by the fabric manager 108. In some embodiments, in block 306 the managed network device 102 may look up the destination port 120 in a linear forwarding table. The managed network device 102 may use the DLID as an index into the linear forwarding table and retrieve a port number or other data identifying the destination port 120.
In block 308, the managed network device 102 determines whether the statically routed destination port 120 is congested. The destination port 120 may be congested if the offered load on that port 120 exceeds the ejection rate of the receiver on the other side of the fabric link 106 (e.g., the receiving managed network device 102 or computing node 104). The managed network device 102 may use any monitoring technique to determine whether the destination port 120 is congested. For example, the managed network device 102 may use a congestion control agent, monitor for congestion notices received from remote devices, analyze flow control data, or perform any other appropriate monitoring. The managed network device 102 may determine whether the destination port 120 is congested on a per virtual lane basis, per service channel basis, per service level basis, or based on any other logical or physical subdivision of the fabric link 106. In some embodiments, in block 310 the managed network device 102 analyzes available flow control credits at the receiver and pending flow control credits to be transmitted by the managed network device 102. If flow control credits are not available at the receiver or pending flow control credits of the managed network device 102 are increasing, then the destination port 120 may be congested. In some embodiments, in block 312 the managed network device 102 may analyze a congestion log for congestion marking events. In some embodiments, in response to detecting congestion, the managed network device 102 may send a Forward Explicit Congestion Notification (FECN) to the receiver when congestion is detected, for example, by setting an FECN bit on data packets exiting the managed network device 102. When marking a data packet with the FECN bit, the managed network device 102 may also record that marking event in the congestion log.
In block 314, the managed network device 102 determines whether the statically routed destination port 120 is congested. If not, the method 300 branches ahead to block 332, described below. If the statically routed destination port 120 is congested, the method 300 advances to block 316.
In block 316, the managed network device 102 determines a destination port group based on the DLID. The destination port group may be identified as a collection of any two or more destination ports 120 of the managed network device 102. Destination port groups may overlap, meaning that a port 120 may be included in more than one port group. As further described below, each port group may map to one or more DLIDs, and each DLID is associated with exactly one port group. The fabric manager 108 may discover routes through the fabric and then configure the port groups and port group mappings accordingly. When there is only one possible path through the fabric for a particular DLID (e.g., a single destination port 120), that DLID may be assigned to an undefined port group (e.g., an empty set, null value, zero value, etc.).
In some embodiments, in block 318, the managed network device 102 may look up the port group in a port group forwarding table. For example, the managed network device 102 may index the port group forwarding table using the DLID to identify the unique port group identifier of the destination port group. The port group forwarding table may have a similar structure to the linear forwarding table, and may be accessed or otherwise maintained by the fabric manger 108 similarly to the linear forwarding table. Referring now to
Referring back to
Referring back to
In block 330, the managed network device 102 updates the static routing information with the dynamic destination port 120. The managed network device 102 may, for example, replace the entry for the statically routed destination port 120 in the linear forwarding table with the dynamically determined destination port 120.
In block 332, the managed network device 102 forwards the data packet to the destination port 120. The managed network device 102 may, for example, forward the data packet to the destination port 120 described in the linear forwarding table. As described above, the destination port 120 described in the linear forwarding table may be the statically routed destination port 120 determined as described above in block 304 if that port 120 is not congested, or the dynamically determined destination port 120 as described above in connection with block 320. After forwarding the data packet to the destination port 120, the method 300 loops back to block 302 to continue processing data packets. Although the method of 300 of
Illustrative examples of the technologies disclosed herein are provided below. An embodiment of the technologies may include any one or more, and any combination of, the examples described below.
Example 1 includes a network device for data packet forwarding, the network device comprising a packet ingress module to extract a destination local identifier (DLID) from a data packet; a static route module to determine a statically routed destination port of the network device as a function of the DLID; a congestion monitoring module to determine whether the statically routed destination port is congested; and an adaptive route module to determine a port group as a function of the DLID in response to a determination that the statically routed destination port is congested, wherein the port group identifies two or more ports of the network device; select a dynamic destination port of the port group in response to the determination that the statically routed destination port is congested; and forward the data packet to the dynamic destination port in response to the determination that the statically routed destination port is congested.
Example 2 includes the subject matter of Example 1, and wherein the DLID comprises a binary value that is 32, 24, 20, or 16 bits long.
Example 3 includes the subject matter of any of Examples 1 and 2, and wherein to determine the statically routed destination port comprises to index a linear forwarding table with the DLID to determine the statically routed destination port.
Example 4 includes the subject matter of any of Examples 1-3, and wherein to determine whether the statically routed destination port is congested comprises to analyze available flow control credits associated with the destination port.
Example 5 includes the subject matter of any of Examples 1-4, and wherein to determine whether the statically routed destination port is congested comprises to analyze a congestion log associated with the destination port.
Example 6 includes the subject matter of any of Examples 1-5, and wherein to determine the port group as a function of the DLID comprises to determine a port group identifier, wherein the port group identifier includes an integer value between 1 and 255, inclusive.
Example 7 includes the subject matter of any of Examples 1-6, and wherein to determine the port group as a function of the DLID comprises to index a port group forwarding table with the DLID to determine a port group identifier.
Example 8 includes the subject matter of any of Examples 1-7, and wherein to select the dynamic destination port of the port group comprises to index a port group table with the port group identifier to determine a port group mask, wherein the port group mask is indicative of a plurality of valid destination ports for the DLID; and select the dynamic destination port from the plurality of valid destination ports of the port group mask.
Example 9 includes the subject matter of any of Examples 1-8, and wherein the port group mask comprises a binary value that includes 256 bits, and wherein each bit of the port group mask is associated with a corresponding port of the network device.
Example 10 includes the subject matter of any of Examples 1-9, and wherein to select the dynamic destination port from the plurality of valid destination ports comprises to randomly select the dynamic destination port from the plurality of valid destination ports.
Example 11 includes the subject matter of any of Examples 1-10, and wherein to select the dynamic destination port from the plurality of valid destination ports comprises to select a least-loaded destination port of the plurality of valid destination ports as the dynamic destination port.
Example 12 includes the subject matter of any of Examples 1-11, and wherein to select the dynamic destination port from the plurality of valid destination ports comprises to randomly select the dynamic destination port from a plurality of least-loaded destination ports of the plurality of valid destination ports.
Example 13 includes the subject matter of any of Examples 1-12, and wherein the static route module is further to forward the data packet to the statically routed destination port in response to a determination that the statically routed destination port is not congested.
Example 14 includes a method for adaptive data packet routing, the method comprising extracting, by a network device, a destination local identifier (DLID) from a data packet; determining, by the network device, a statically routed destination port of the network device as a function of the DLID; determining, by the network device, whether the statically routed destination port is congested; determining, by the network device, a port group as a function of the DLID in response to determining the statically routed destination port is congested, wherein the port group identifies two or more ports of the network device; selecting, by the network device, a dynamic destination port of the port group in response to determining the statically routed destination port is congested; and forwarding, by the network device, the data packet to the dynamic destination port in response to determining the statically routed destination port is congested.
Example 15 includes the subject matter of Example 14, and wherein the DLID comprises a binary value that is 32, 24, 20, or 16 bits long.
Example 16 includes the subject matter of any of Examples 14 and 15, and wherein determining the statically routed destination port comprises indexing a linear forwarding table with the DLID to determine the statically routed destination port.
Example 17 includes the subject matter of any of Examples 14-16, and wherein determining whether the statically routed destination port is congested comprises analyzing available flow control credits associated with the destination port.
Example 18 includes the subject matter of any of Examples 14-17, and wherein determining whether the statically routed destination port is congested comprises analyzing a congestion log associated with the destination port.
Example 19 includes the subject matter of any of Examples 14-18, and wherein determining the port group as a function of the DLID comprises determining a port group identifier, wherein the port group identifier includes an integer value between 1 and 255, inclusive.
Example 20 includes the subject matter of any of Examples 14-19, and wherein determining the port group as a function of the DLID comprises indexing a port group forwarding table with the DLID to determine a port group identifier.
Example 21 includes the subject matter of any of Examples 14-20, and wherein selecting the dynamic destination port of the port group comprises indexing a port group table with the port group identifier to determine a port group mask, wherein the port group mask is indicative of a plurality of valid destination ports for the DLID; and selecting the dynamic destination port from the plurality of valid destination ports of the port group mask.
Example 22 includes the subject matter of any of Examples 14-21, and wherein the port group mask comprises a binary value including 256 bits, and wherein each bit of the port group mask is associated with a corresponding port of the network device.
Example 23 includes the subject matter of any of Examples 14-22, and wherein selecting the dynamic destination port from the plurality of valid destination ports comprises randomly selecting the dynamic destination port from the plurality of valid destination ports.
Example 24 includes the subject matter of any of Examples 14-23, and wherein selecting the dynamic destination port from the plurality of valid destination ports comprises selecting a least-loaded destination port of the plurality of valid destination ports as the dynamic destination port.
Example 25 includes the subject matter of any of Examples 14-24, and wherein selecting the dynamic destination port from the plurality of valid destination ports comprises randomly selecting the dynamic destination port from a plurality of least-loaded destination ports of the plurality of valid destination ports.
Example 26 includes the subject matter of any of Examples 14-25, and further comprising forwarding, by the network device, the data packet to the statically routed destination port in response to determining the statically routed destination port is not congested.
Example 27 includes a computing device comprising a processor; and a memory having stored therein a plurality of instructions that when executed by the processor cause the computing device to perform the method of any of Examples 14-26.
Example 28 includes one or more machine readable storage media comprising a plurality of instructions stored thereon that in response to being executed result in a computing device performing the method of any of Examples 14-26.
Example 29 includes a computing device comprising means for performing the method of any of Examples 14-26.
Example 30 includes a network device for data packet forwarding, the network device comprising means for extracting a destination local identifier (DLID) from a data packet; means for determining a statically routed destination port of the network device as a function of the DLID; means for determining whether the statically routed destination port is congested; means for determining a port group as a function of the DLID in response to determining the statically routed destination port is congested, wherein the port group identifies two or more ports of the network device; means for selecting a dynamic destination port of the port group in response to determining the statically routed destination port is congested; and means for forwarding the data packet to the dynamic destination port in response to determining the statically routed destination port is congested.
Example 31 includes the subject matter of Example 30, and wherein the DLID comprises a binary value that is 32, 24, 20, or 16 bits long.
Example 32 includes the subject matter of any of Examples 30 and 31, and wherein the means for determining the statically routed destination port comprises means for indexing a linear forwarding table with the DLID to determine the statically routed destination port.
Example 33 includes the subject matter of any of Examples 30-32, and wherein the means for determining whether the statically routed destination port is congested comprises means for analyzing available flow control credits associated with the destination port.
Example 34 includes the subject matter of any of Examples 30-33, and wherein the means for determining whether the statically routed destination port is congested comprises means for analyzing a congestion log associated with the destination port.
Example 35 includes the subject matter of any of Examples 30-34, and wherein the means for determining the port group as a function of the DLID comprises means for determining a port group identifier, wherein the port group identifier includes an integer value between 1 and 255, inclusive.
Example 36 includes the subject matter of any of Examples 30-35, and wherein the means for determining the port group as a function of the DLID comprises means for indexing a port group forwarding table with the DLID to determine a port group identifier.
Example 37 includes the subject matter of any of Examples 30-36, and wherein the means for selecting the dynamic destination port of the port group comprises means for indexing a port group table with the port group identifier to determine a port group mask, wherein the port group mask is indicative of a plurality of valid destination ports for the DLID; and means for selecting the dynamic destination port from the plurality of valid destination ports of the port group mask.
Example 38 includes the subject matter of any of Examples 30-37, and wherein the port group mask comprises a binary value including 256 bits, and wherein each bit of the port group mask is associated with a corresponding port of the network device.
Example 39 includes the subject matter of any of Examples 30-38, and wherein the means for selecting the dynamic destination port from the plurality of valid destination ports comprises means for randomly selecting the dynamic destination port from the plurality of valid destination ports.
Example 40 includes the subject matter of any of Examples 30-39, and wherein the means for selecting the dynamic destination port from the plurality of valid destination ports comprises means for selecting a least-loaded destination port of the plurality of valid destination ports as the dynamic destination port.
Example 41 includes the subject matter of any of Examples 30-40, and wherein the means for selecting the dynamic destination port from the plurality of valid destination ports comprises means for randomly selecting the dynamic destination port from a plurality of least-loaded destination ports of the plurality of valid destination ports.
Example 42 includes the subject matter of any of Examples 30-41, and further comprising means for forwarding the data packet to the statically routed destination port in response to determining the statically routed destination port is not congested.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US14/72461 | 12/27/2014 | WO | 00 |