Data centers support business, government and consumer users, among others, by storing and providing access to data. Typically, hundreds or thousands of servers are connected to one another to efficiently store and retrieve data. The servers are connected to a network such as the internet to provide data to user devices such as cell phones, laptops and personal computers or to other client devices. The servers are also connected to one another to exchange data. Routers allow packets of data to be communicated between the servers or other computing devices. A router typically can select from among multiple available links when communicating a packet.
In one embodiment, a router comprises a non-transitory memory storage comprising instructions and a routing table. The router also comprises one or more processors in communication with the memory, wherein the one or more processors execute the instructions to: receive a packet comprising an IP address prefix; access the routing table to identify a subset of a plurality of next hops, the subset is cross referenced by the IP address prefix, each of the plurality of next hops is connected to the router; selecting a next hop of the subset as a selected next hop based on an equal cost multiple path process, wherein at least two next hops in the subset are selected with different aggregate weights; and transmit the packet via the selected next hop.
In another embodiment, a computer-implemented method for routing data comprises, with one or more processors: assigning a weight to each next hop of a plurality of next hops connected to a router; based on the assigned weights, for each IP address prefix set of a plurality of IP address prefix sets, determining a subset of the next hops of the plurality of next hops which are to be cross referenced to the IP address prefix set; and providing instructions to the router for populating a routing table, wherein the instructions identify (a) for each next hop of the plurality of next hops, one or more IP address prefix sets which will cross reference the next hop, and (b) for at least one next hop of the plurality of next hops, one or more IP address prefix sets which will not cross reference the next hop.
In another embodiment, a non-transitory computer-readable medium storing computer instructions for routing data, that when executed by one or more processors, cause the one or more processors to perform the steps of: preparing announce instructions and withdraw instructions for a router which cause the router to provide weighted multipath load balancing using an equal cost multiple path process; and transmitting the announce instructions and the withdraw instructions to the router for populating a routing table.
In another embodiment, a computer-implemented method for routing data, comprises, with one or more processors: receiving instructions from a controller at a router, the instructions identify (a) for each next hop of a plurality of next hops of the router, one or more IP address prefix sets which will cross reference the next hop, and (b) for at least one next hop of the plurality of next hops, one or more IP address prefix sets which will not cross reference the next hop; translating the instructions to provide a weight for each next hop; and populating a routing table with the weights cross referenced to the next hops.
Aspects of the present disclosure are illustrated by way of example and are not limited by the accompanying figures for which like references indicate elements.
The disclosure relates to devices and techniques for enabling a router to provide weighted multipath load balancing using an equal cost multiple path process.
Networks such as those in a data center are highly connected such that there are multiple paths which connect servers or other computing devices. The task of a router is to select one of multiple available next hops to route a packet which is received at the router. Typically, the router examines a destination address of the packet and identifies a number of next hop computing devices which can be used to transmit the packet to its destination. This is referred to as multipath routing since multiple paths are available to select from. In selecting a next hop, the router can consider a cost of each path. These paths can have equal or non-equal costs and the router can select the lowest cost path.
The cost of a path can be based on various metrics such as number of hops to the final destination (fewer hops are preferable), bandwidth (paths which have higher data transfer rates are preferable), delay (a smaller delay is preferable, where the delay is the amount of time it takes to communicate a packet to the final destination), reliability (a higher reliability is preferable, where the reliability is based on reported problems such as link failures, interface errors, and lost packets), and load (where a less congested path is preferable).
To provide low latency communications, multipath is critical to load balance over links in data center networks. In one approach, equal cost multipath (ECMP) routing is used in data center networks to distribute the network load between different servers. It is efficient in the aggregate. ECMP is a routing strategy where next hop packet forwarding to a single destination can occur over multiple best paths which tie for top place in routing metric calculations. Multi-path routing can be used with most routing protocols, because it is a per-hop decision limited to a single router. It can substantially increase bandwidth by load-balancing traffic over multiple paths.
ECMP can be used with routing protocols such as the Border Gateway Protocol (BGP). BGP is used to exchange routing information for the Internet and has become the de-facto control plane protocol for data center networks as well. However, ECMP as used with BGP does not allow certain paths to be favored or disfavored, e.g., weighted, without the use of proprietary extensions to BGP. Moreover, other approaches which may provide weight information can consume a substantial portion of the limited memory space in the router. In some cases, the memory space is provided using a Static Random Access Memory (SRAM) or Ternary Content Addressable Memory (TCAM) table.
Techniques provided herein address the above and other issues. In one aspect, weighted ECMP (WECMP) is proposed to adaptively adjust the preference of next hops based on the network dynamics. Weighted path selection is useful, e.g., for traffic engineering, link failure resiliency and preplanned maintenance or updates. For example, the preference of a path may change due to reliability issues, data drop off or congestion on the path.
A routing technique provides a routing table which results in next hops being selected according to desired aggregate weights at a router, while still using an equal cost multipath selection process at the router. The aggregate weight represents the weight with which a next hop is selected over a time period which involves multiple next hops selections. The routing table is configured to cross reference an IP address prefix set (representing a set of destination nodes) to a number of next hops which can be all, or fewer than all, available next hops. This occurs in each row of the table for a different IP address prefix set. See
For example, the weight assigned to each next hop may be based on a number of rows in the routing table which cross reference an IP address prefix set to the next hop, a number of next hops in a row, and associated traffic estimates of the IP address prefix set.
The appropriate configuration of a routing table may be determined by an external controller based on the current network metrics. The controller transmits instructions to the router, such as announce and withdraw messages made according to the BGP protocol, to configure the routing table. If the network metrics change, the controller can re-configure the routing table. Moreover, a single controller may communicate with multiple routers to configure their routing tables.
Moreover, the weighted multipath techniques can be deployed without any extension of the BGP protocol and without significant consumption of memory resources in the router.
The techniques described herein may be used on any type of switching device, whether embodied in hardware and/or software. A router is discussed as one example of a switching device.
The network interface allows the controller to communicate with the data center routers. The network interface may communicate using the Ethernet standard, for example. Ethernet is a local area network (LAN) technology which describes how networked devices can format data for transmission to other network devices. Ethernet transmissions use a packet which includes a header and a payload. The header includes information such as a Media Access Control (MAC) destination address and source address, in addition to error-checking data. The network interface may be a network interface controller card comprising ports. Communication cables, such as CATS cables, may plug into the physical ports.
The memory 172 may be non-volatile storage for software (code) which is loaded into the working memory 176 and executed by the processor 171 to perform the functions described herein. The working memory may be a volatile memory, e.g., RAM.
The memory 184 may be non-volatile storage for software (code) which is loaded into the working memory 182 and executed by the processor 180 to perform the functions described herein. The working memory may be a volatile memory, e.g., RAM.
In
Various computing devices may utilize all of the components shown, or only a subset of the components, and levels of integration may vary from device to device. Furthermore, a device may contain multiple instances of a component, such as multiple processing units, processors, memories, transmitters, receivers, etc. The data source may comprise a hardware storage device which stores samples of data which are to be processed. The processors may be any type of electronic data processor such as a CPU. The input/output devices may include network interfaces, storage interfaces, monitors, keyboards, pointing devices and the like. A working memory may store code, e.g., instructions which are executed by a processor to carry out the functions described herein. The code may be stored in a non-volatile memory and loaded into the working memory.
The memory/storage devices may comprise any type of system memory such as static random access memory (SRAM), dynamic random access memory (DRAM), synchronous DRAM (SDRAM), read-only memory (ROM), a solid state drive, hard disk drive, a magnetic disk drive, or an optical disk drive. The memory devices may include ROM for use at boot-up, and DRAM for program and data storage for use while executing programs. The memory devices may include non-transitory, hardware memory devices.
In this configuration, the controller cannot configure the routers with assigned weights for different links/next hops without the disadvantages mentioned previously, including use of proprietary BGP extensions or excessive memory consumption.
For example, the router R1 has outgoing links L1-L5. The controller may determine that L1 is becoming congested based on monitoring of an amount of traffic on that link. For example, the monitoring may indicate that an amount of traffic exceeds a threshold level. In response to the monitoring, the controller can decide to lower a weight assigned to the link, where the router selects a link with a probability which is equal to its weight. For example, the weight may be reduced from 0.2 to 0.1.
In one aspect, the controller inputs weights and traffic estimates, translates the weights, uses BGP announce and withdraw commands to override and modify routing behavior, and requires the routers to perform ECMP. In one embodiment, the controller does not calculate weights but instead uses weights that are already available from a router monitoring process. Or, it is possible for the controller to also calculate the desired weights. The controller may not calculate traffic estimates but instead use traffic estimates that are already available from a router monitoring process. Or, it is possible for the controller to also calculate the traffic estimates. The controller need not modify the BGP which is used on the routers, or require the routers to perform local computations or perform a weighted multipath selection process. The techniques described herein can therefore be easily be deployed with existing routers without imposing additional burdens on the routers. Moreover, the techniques can be deployed from a centralized controller which communicates with many routers.
The last row of the table notes that the traffic estimates sum to 1.0. Also, aggregate weights v1-v5 for all prefixes/destinations are achieved for the next hops 1-5, respectively, based on the binary weights. The aggregate weight of a next hop represents the weight with which the next hop is selected to route data over a period of time involving multiple next hop selections.
IP address prefixes are patterns which match the first n binary bits of an IP address. The standard syntax is to write the prefix bits that must match in dotted-quad format, followed by a slash and then the number of bits in the prefix. Any trailing bits, not part of the prefix, are written as zero. If an entire trailing byte is zero, it can be written explicitly, as in the example prefix “128.8.0.0/16,” or omitted, as in the example prefix “128.8/16.” Since only the first sixteen bits are significant (in this example), the remaining sixteen bits can be omitted. One or more destination nodes or computing devices can be associated with an IP address prefix.
Example routing tables provided herein discuss four IP address prefix sets referred to as prefix set 1, prefix set 2, prefix set 3 and prefix set 4. Each prefix set represents one or more IP address prefixes. The term “set” thus includes a unit set in the case of one IP address prefix in the set. A single IP address prefix represents a continuous set of IP addresses. The set also encompasses a group of multiple IP address prefixes. For example, one example of an IP address prefix is 128.10.0.0/16 and another example of an IP address prefix is 234.56.0.0/16. An example of a prefix set comprises both of these IP address prefixes. For instance, in
The binary weights are 1, 0, 0, 0 and 1 for next hops 1-5, respectively, with destinations of prefix set 1. The binary weights are 1, 1, 0, 1 and 1 for next hops 1-5, respectively, with destinations of prefix set 2. The binary weights are 0, 1, 1, 1, 0 for next hops 1-5, respectively, with destinations of prefix set 3. The binary weights are 1, 1, 1, 1 and 1 for next hops 1-5, respectively, with destinations of prefix set 4. In one approach, each weight of “1” in a row indicates the corresponding next hop can be used to communicate data to a destination of the prefix set identified in the row. Each weight of “0” in a row indicates the corresponding next hop cannot be used to communicate data to a destination of the prefix set identified in the row. For example, next hops 1 and 5 can be used for prefix set 1, next hops 1, 2, 4 and 5 can be used for prefix set 2, next hops 2, 3 and 4 can be used for prefix set 3, and next hops 1-5 can be used for prefix set 4.
Each row can include two or more 1's so that there are at least two next hops for failure resiliency. That is, if the link for one next hop fails, the router can communicate via the other next hop. A row can include all 1's as well as indicated for prefix set 4. It is also possible for different rows to include the same number of 1's but generally there will be different sets of binary weights in different rows.
Traffic estimates of 0.2, 0.3, 0.15 and 0.35 are also provided for prefix sets 1-4, respectively. As an example, 0.2 means 20% of the total traffic through the router is destined for the destinations in prefix set 1. 100% of the traffic destined for prefix set 1 is sent via next hops 1 and 5. Further, when the router uses ECMP, this traffic is divided equally among these next hops, e.g., 10% for next hop 1 and 10% for next hop 5. Therefore, 10% of the total traffic through the router is sent via next hop 1 and 10% is sent via next hop 5. Generally, when a row has more 1's, there is relatively less traffic for each next hop for destinations of the associated prefix set. That is, the traffic is spread out among relatively more next hops. A larger number of 1's in a row also corresponds to a smaller fractional weight for each selected next hop in the row. As mentioned, the traffic estimates can be based on observations of traffic at a router over time.
Thus, when the destination is associated with prefix set 1, 2, 3 or 4, the probability that the router selects a given one of the next hops associated with the prefix set is 0.5, 0.25, 0.33 or 0.2, respectively.
The last row of the table indicates that the traffic estimates sum to 1. Also, a fractional aggregate weight which is achieved based on the binary weights and the traffic estimates is indicated. The achieved weights of 0.245, 0.1945, 0.1195, 0.1945 and 0.245 are very close to the desired weights of 0.25, 0.2, 0.1, 0.2 and 0.25, respectively, of
Mathematical calculations can be used to set the binary weights which will achieve the desired aggregate weight for each next hop.
The table of
The algorithm can be optimally formulated using linear integer programming, as discussed further below. For failure resiliency, there will be at least two next hops for each prefix set. It is assumed that outgoing traffic is distributed to available (advertised) next hops homogenously (e.g., using ECMP). Once the set of prefixes for each next hop is calculated, these paths are injected into the routers using announce/withdraw BGP messages, for instance.
First, a set of partial advertisement configurations is listed. For example, if we want to advertise next hops 2 and 3 only (assuming next hops 1-5 are available), the configuration would be c=[0 0.5 0.5 0 0], whose total sum would be 1 and non-zero elements would be more than 1 and equal. If we want to advertise next hops 1, 4 and 5, then the configuration would be c=[0.333 0 0 0.333 0.333]. The full set of configuration are listed as parameters in set C which has a size of T. Then we create one variable for each configuration.
Announce and withdraw messages can similarly be provided for the remaining next hops. Note that it is possible that a withdraw message is not provided for a particular next hop if that next hops is cross referenced in each row of the routing table to a prefix set.
For example, if a packet is received with a destination address which indicates it is associated with prefix set 1, the router will select from among next hops 1 and 5 in routing the packet. The selection can be made among the advertised next hops using an ECMP process.
Each row of the table comprises identifiers of a subset of next hops among a plurality of next hops connected to a router. The subset can include all, or fewer than all, of the plurality of next hops. For example, the subset 804 includes all of the plurality of next hops, e.g., next hops 1-5. Further, the rows of the routing table can comprise overlapping and non-overlapping subsets of next hops. For example, the subset 801 of next hop 1 and 5 overlaps with the subset 802 of next hop 1, 2, 4 and 5 but not with the subset 803 of next hop 2, 3 and 4. Generally, at least two next hops are selected with different aggregate weights. It is also possible for all next hops to be selected with different aggregate weights.
At step 903, for a next hop of the router, the controller transmits instructions to the router identifying the next hop and IP address prefix sets which are to cross reference the next hop. See, e.g.,
Note that the assigned weights of the next hops can change over time so that the routing table can be periodically reconfigured. In the initial configuring of the table, the withdraw messages may not be required since identifiers of next hops are not being removed from rows of the routing table. In subsequent reconfiguring, the withdraw message can be used to remove an identifier of a next hop from a row of the routing table.
The techniques disclosed herein provide various benefits. For example, the techniques enable a weighted load balancing implementation using BGP without any extensions. The techniques overcome limitations of proprietary weighted load balancing processes and empower network designers to customize the process to meet their needs. Further, the techniques do not inflate the size of routing tables such as embodied by SRAM/TCAM tables.
A mapping algorithm converts fractional weights for each next-op for all prefixes into partial prefixes for each next hop. These partial advertisements can easily be implemented in BGP without any need of extensions. The techniques enable weighted multipath load balancing in non-extended BGP using a mapping algorithm and an enforcer module. The mapping algorithm maps desired pre-defined weights for each next hop into partial prefix advertisements for each next hop that will result in the aggregate traffic ratios for each next hop converging to the predefined weights. The enforcer module modifies the routing behavior of the routers to make them align with the predefined link weights. The techniques can expand the capabilities of networks such as in data centers in terms of providing weighted multipath load balancing functionality using legacy switches which do not support BGP extensions. The techniques enable use of a wider range of switches to implement weighted multipath load balancing, thereby reducing costs.
The translation can be performed as follows. A table such as in
Step 930 is the same as step 910 in
It is understood that the present invention may be embodied in many different forms and should not be construed as being limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete and will fully convey the invention to those skilled in the art. Indeed, the invention is intended to cover alternatives, modifications and equivalents of these embodiments, which are included within the scope and spirit of the invention as defined by the appended claims. Furthermore, numerous specific details are set forth in order to provide a thorough understanding. However, it will be clear to those of ordinary skill in the art that the embodiments may be practiced without such specific details.
In accordance with various embodiments of the present disclosure, the methods described herein may be implemented using a hardware computer system that executes software programs. Further, in a non-limited embodiment, implementations can include distributed processing, component/object distributed processing, and parallel processing. Virtual computer system processing can be constructed to implement one or more of the methods or functionalities as described herein, and a processor described herein may be used to support a virtual processing environment.
Aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatuses (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a device, such that the instructions, which execute via the processor of the computer or other programmable instruction execution apparatus, create a mechanism for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
The terminology used herein is for the purpose of describing particular aspects only and is not intended to be limiting of the disclosure. As used herein, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
The description of the present disclosure has been presented for purposes of illustration and description, but is not intended to be exhaustive or limited to the disclosure in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the disclosure. The aspects of the disclosure herein were chosen and described in order to best explain the principles of the disclosure and the practical application, and to enable others of ordinary skill in the art to understand the disclosure with various modifications as are suited to the particular use contemplated.
For purposes of this document, each process associated with the disclosed technology may be performed continuously and by one or more computing devices. Each step in a process may be performed by the same or different computing devices as those used in other steps, and each step need not necessarily be performed by a single computing device.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.
Number | Name | Date | Kind |
---|---|---|---|
8787400 | Barth et al. | Jul 2014 | B1 |
20120158976 | Van der Merwe | Jun 2012 | A1 |
20140369186 | Ernstrom | Dec 2014 | A1 |
20150124652 | Dharmapurikar | May 2015 | A1 |
20150312137 | Li | Oct 2015 | A1 |
20150326476 | Ye | Nov 2015 | A1 |
20160014025 | Wang | Jan 2016 | A1 |
20170063600 | Singh | Mar 2017 | A1 |
20170346727 | Perrett | Nov 2017 | A1 |
Entry |
---|
Avci, Serhat Nazim, et al., “Congestion Aware Priority Flow Control in Data Center Networks,” Futurewei Technologies, IFIP Networking, May 2016, 9 pages. |
Alizadeh, Mohammad, et al., “CONGA: Distributed Congestion-Aware Load Balancing for Datacenters,” SIGCOMM, Aug. 2014, 12 pages. |
Mohapatra, P., et al., “BGP Link Bandwidth Extended Community draft-ieft-idr-link-bandwidth-06.txt,” Cisco Systems, Jan. 22, 2013, 7 pages. |
Lapukhov, P.L., et al., “Using BGP for routing in large-scale data centers draft-lapukhov-bgp-routing-large-dc-04,” Arista Networks, Apr. 8, 2013, 25 pages. |
Ghorbani, Soudeh, et al., “Micro Load Balancing in Data Centers with DRILL,” HotNets-XIV, Nov. 16-17, 2015, 7 pages. |
Al-Fares, Mohammad, et al., “Hedera: Dynamic Flow Scheduling for Data Center Networks,” In Proc. of Networked Systems Design and Implementation (NSDI) Symposium, Mar. 2010, 15 pages. |
“BGP 4 Prefix Filter and Inbound Route Maps,” IP Routing BGP Configuration Guide, Cisco IOS XE Release 3SE (Catalyst 3850 Switches), Sep. 2016, 16 pages. |
Laurence, Shaly, et al., “SRAM Based Architecture for TCAM For Low Area and Less Power Consumption,” APRN Journal of Engineering and Applied Sciences, vol. 10, No. 17, Sep. 2015, 6 pages. |
“ECMP Load Balancing,” MPLS: Layer 3 VPNs Configuration Guide, Cisco IOS XE Release 3S (Cisco ASR 900 Series), Jul. 2016, 12 pages. |
“Multipath TCP,” Internet Engineering Task Force (IETF), Dec. 2016, 5 pages. |
He, Keqiang, et al., “Presto: Edge-based Load Balancing for Fast Datacenter Networks,” SIGCOMM, Aug. 2015, 14 pages. |
Zhou, Junlan, et al., “WCMP: Weighted Cost Multipathing for Improved Fairness in Data Centers,” EuroSys '14 Proceedings of the Ninth European Conference on Computer Systems, No. 5, Apr. 2014, 13 pages. |
Abouchaev, Adel, et al., “Brain-Slug: a BGP-Only SDN for Large-Scale Data-Centers,” Microsoft, Jun. 2013, 29 pages. |
Zhang, Junjie, et al., “Optimizing Network Performance using Weighted Multipath Routing,” 21st International Conference onComputer Communications and Networks (ICCCN), Jul. 2012, 7 pages. |
Number | Date | Country | |
---|---|---|---|
20180205634 A1 | Jul 2018 | US |