The present disclosure is directed to data networks and, more particularly, to data network load balancing and related methods, load balancers, and servers.
A Client is defined as any combination of hardware and software (e.g., including operating systems and client applications) that is capable of accessing services over a network connection.
A Server is defined as any combination of hardware and software (e.g., operating systems and server applications) that is capable of providing services to clients.
A Blade is defined herein as any combination of hardware and software (e.g., operating systems and client and/or server applications software) which is capable of acting not only as a server but also as a client. A Blade Server is an instance of a server on a blade, whereas a Blade Client is an instance of a client on a blade. A blade can both be a client and a server at the same time. The terms blade server and server may be used interchangeably herein.
A Blade ID is a unique value that identifies a blade among other blades.
A Load Balancer is a network device that receives requests/packets coming from clients and distributes the requests/packets among the blade servers.
Server-Side Load Balancing is a technology whereby service requests are distributed among a pool of blade servers in a relatively transparent manner. Server-side load balancing may introduce advantages such as scalability, increased performance, and/or increased availability (e.g., in the event of a failure or failures).
As shown in
An Outside Node (e.g., an outside server and/or client) is defined as a network node which is located outside the load balancing site. An outside node can be a client requesting a service from one of the blade servers, or an outside node can be an outside server which is serving a blade client inside the load balancing site.
As used herein, a data flow is defined as network traffic made of up data packets and transmitted between a client and a server that can be identified by a set of attributes. Sample attributes may include 5 tuple parameters (e.g., Src/Dest IP addresses, Protocol, Src/Dest TCP/UDP Port), Src/Dest (Source/Destination) Mac address, or any other set of bits in the data packets (e.g., PCP bits, VLAN IDs, etc) of the data flow, or simply source and destination nodes of the network traffic. For example, over a certain link (e.g., from node a to node b) in a network, a packet passing through with a specific source IP address (e.g., IP1) is part of a flow identified by the source IP address over that link with the attributes (IP1, a, b). As another example, in an access network, traffic originated from a subscriber can also be considered as a flow where that flow can be identified as the traffic passing through a UNI/NNI/ANI port of a RG (Residential Gateway). Such subscriber flows in access and edge networks can be also identified by subscriber IP addresses. Upstream/downstream subscriber flow (e.g., flow from the subscriber/network side to the network side/subscriber) may have the IP address of the subscriber as the source/destination IP address respectively.
A flow ID is an ID or tag used to identify a flow. For example, the set of attributes used to identify a flow may be mapped to natural numbers to construct Flow IDs. Also, a Flow ID uniquely identifies a flow.
An incoming Flow is network traffic that enters the Load Balancing site and that originated from outside the Load Balancing site. An incoming flow includes not only data traffic that is destined to the load balancing site to be terminated at the load balancing site but also data traffic that is to be forwarded by the load balancing site after corresponding processing. An Incoming Packet is a packet belonging to an incoming flow.
Outgoing Flow is the network traffic that is about to leave the load balancing site. The outgoing flow includes not only the network traffic that is originated by the load balancing site (e.g., by the blade servers/clients) but also network traffic (originated by an outside node) that is forwarded by the load balancing site (after further processing at load balancer and/or the blade servers) to another location. An outgoing packet is a packet belonging to an outgoing flow.
Granularity of a flow refers to the extent to which a larger (coarse-grained) flow is divided into smaller (finer-grained) sub-flows. For example, an aggregate flow passing thorough a link (from node a to node b) with multiple destination IP addresses may have a coarser granularity than a sub-flow passing through the same link with a certain destination IP address. The former flow can be referred to as link flow and the latter flow can be referred to as link, destination IP flow.
A flow can be made up of many flows. Accordingly, the Flow ID that can be derived from a packet of an arbitrary incoming flow at the load balancing site may be a random variable. A probability distribution of the Flow ID may depend on what and how the packet header fields are associated with the Flow ID. (The header fields of an incoming packet that include the Flow ID can also be random variables, and mapping between the header fields and the Flow ID may govern the probability distribution of the Flow ID of incoming packets). For example, assuming that the Flow ID simply depends on a respective source IP address, then the probability distribution of the Flow ID of incoming packets will depend on factors such as how a DHCP server allocates the IP addresses, demographic distribution in case of correlation between geography and IP addresses, etc.
A connection is an example of a flow that can be identified using 5-tuple parameters (Src/Dest IP address, Protocol, Src/Dest TCP/UDP Port). TCP (Transmission Control Protocol) or UDP (User Datagram Protocol) connections can be considered as an example. As used herein, Src means source, Dest means destination, PCP means Priority Code Point, UNI means User Network Interface, NNI means Network Network Interface, ANI means Application Network Interface, and HA means High Availability.
A Type-1 Flow is a type of flow for which it is possible to detect the start of the flow or the first data packet of the flow by considering only the bits in the first packet of the flow (without consulting other information). For example, an initial data packet of a connection can be identified using a SYN (sequence) flag in TCP packets and an INIT (initial) flag in SCTP (Stream Control Transmission Protocol) packets. Many connection oriented protocols have a way of telling the server about the start of a new connection. For example, a subscriber level flow can be identified by subscriber IP address (e.g., Source/Destination IP address of the upstream/downstream traffic). In such a case, a RADIUS start request or DHCP (Dynamic Host Configuration Protocol) request may indicate the start of the subscriber level flow. Because the Flow ID (identification) is based on the source IP address, a new flow for a subscriber can be detected by sensing the RADIUS packet or DHCP packet which is generated to establish the subscriber session.
A Type-2 Flow is a flow that may be defined arbitrarily such that it may be difficult to determine initial packets of a flow by considering only packet headers.
Load Balancer Traffic Management
When client sends a request to a load balancing site, a load balancer of the load balancing site forwards the request to one of the available blade servers. Once the data flow is established, the load balancer is in charge of distributing subsequent data packets of the data flow to the appropriate blade server(s). In this case, the blade server may be the flow/connection end point where, for example, the corresponding TCP connection has an end point at the blade server.
In an alternative, one of the blade clients in the load balancing site may initiate a data flow to an outside node. In this latter case, load balancer may still be responsible for forwarding all the response packets of the connection to the original blade client.
In addition, an outside client node can originate/initiate a connection which is destined to an outside node but which needs to traverse the load balancing site for further processing. As an example, subscriber management nodes and/or nodes/sites for deep packet inspection can be considered. In such scenarios, it is possible that certain flows may need to be associated with specific blade servers so that the processing can be performed consistently. In other words, it is possible that all the data packets of some flows may need to travel to the same blade server during the life time of the flow.
In summary, regardless of the origin of a data flow, the traffic of the data flow may need to be forwarded by the load balancer in a convenient fashion.
Flow Aware Server Load Balancing: Maintaining the Flow Stickiness
In flow level load balancing, the load balancer first allocates a new flow to a blade. That is, the initial data packet of an incoming data flow (e.g., SYN packet of a TCP connection, INIT packet of an SCTP connection) is forwarded to an available blade server with respect to a scheduling mechanism (e.g., weighted round robin, etc). All of the subsequent data packets associated with the flow are then processed by the same blade. In other words, the flow ‘stickiness’ to a particular blade should be maintained by the load balancer.
Most transport protocols such as TCP and SCTP may require connection level load balancing such that data packets belonging to a same connection are handled by a same blade server. On the other hand, UDP can sometimes cope with packet level load balancing where each individual packet can be handled by a different blade server.
A requirement/goal of sending subsequent packets associated with a data flow to the previously assigned blade server may make load balancing more challenging. Such Load Balancing may be referred to as Flow Aware, Session-Aware, and/or Connection-Aware Load Balancing.
Load Balancers
As discussed in greater detail below, requirements/goals of load balancers may include: flexible, deterministic, and dynamic load distribution; hitless support of removal and addition of servers; simplicity; support for all traffic types; and/or Load Balancer HA.
An ideal load balancer may distribute the incoming traffic to the servers in a flexible, deterministic and dynamic manner. Determinism refers to the fact that the load on each server can be kept at a targeted/required level (throughput this specification uniformity and deterministic load balancing are used interchangeably) whereas dynamicity refers to the fact that as load indicators over the servers change over time, the load balancer should be able to dynamically change the load distribution accordingly to keep the targeted/required load levels (e.g., a lifetime of each data flow/connection may be arbitrary so that load distributions may change).
Flexibility in load distribution refers to the granularity of load balancing such that the load balancer should support per flow load balancing where the flow can be defined in a flexible manner such as 5-tuple connection level (e.g., relatively fine granular load balancing) or only source IP flows (e.g., coarser granular load balancing). Flow-level load balancing refers to the fact that all data packets of the same data flow may need to be sent over the same server/blade during the lifetime of a connection (this may also be called as flow-aware load balancing preserving the flow stickiness to the servers/blades).
Hitless Support of Removal and Addition of Servers: The load balancer should be able to support dynamicity of resources (e.g., servers) without service disruption to the existing flows. Dynamicity refers to planned/unplanned additions and/or removals of blades. That is, whenever the number of active blades changes, the load balancer should incorporate this change in a manner to reduce disruptions in existing data flows. (Support for dynamicity of resources may be a relatively trivial requirement as most of server pools operate in a high availability configuration. Moreover, graceful shutdowns as well as rolling upgrades may require planned removal(s)/restart(s) of blades/servers).
The Load Balancer should be as simple as possible. Simplicity is a goal/requirement which, for example, may provide TTM (Time To Market) and/or cost effectiveness advantages.
As a goal, the load balancer should support all kinds of traffic/protocols (i.e., the load balancer may be Traffic-type agnostic).
Load Balancer HA: In cases where load balancer level redundancy is provided, it may be desirable that no state/data replication is required on the back up load balancer for each flow, and the switch over to back up should take place rapidly and/or almost immediately in the event that the primary load balancer fails.
Table-based flow level server load balancing is considered as a stateful mechanism such that the scheduling decision of each data flow is maintained as a state in the load balancer.
Table-based flow level load balancing is a stateful approach which uses a look-up table at the load balancer to record previous load balancing decisions so that subsequent packets of an existing data flow follow a same blade server assignment decision. Accordingly, the load balancer may need to keep the state of each active flow in the form of a Flow ID (e.g., 5 tuple parameters for connections) to Blade ID (e.g., IP address of the blade) mapping table. The first packet of the flow is scheduled/assigned to a blade server by the load balancer with respect to a scheduling algorithm.
As shown in
Using load balancing operations of
Stateful Load Balancing with Scheduling Offload
As shown in
As a data packet arrives at the load balancer, the load balancer extracts the Flow ID and performs a table look up operation. If the load balancer finds a match with the Flow ID in the table, the data packet is forwarded to the indicated Blade (having the Blade ID matching the Flow ID as stored in the mapping table). If the data packet is from type 1 data flow, then the load balancer can use the packet header to identify the data packet as an initial data packet of a new flow without using the mapping table. Otherwise, if no match is found for the Flow ID in the mapping table, the load balancer can determine that the data packet is an initial packet belonging to a new data flow.
The load balancer then either sends the new packet to the controller or otherwise communicates with the controller regarding this new data flow. Responsive to this communication from the load balancer, the controller instructs the load balancer to add a Flow ID to Blade ID mapping entry to the mapping table, and the controller forwards the data packet to the corresponding Blade. Hereafter, all data packets belonging to the data flow will be forwarded by the load balancer to the corresponding Blade because the table now has a mapping entry for the data flow.
Accordingly, the controller is responsible for communicating with the blade servers for their availability and load, and the controller performs the scheduling and updates the load balancer mapping table. The load balancer in this case may be a dumb device which only receives the commands from the controller to add/delete/modify the mapping entries in the mapping table and perform data packet forwarding according to the mapping table.
Stateless Load Balancing
For stateless load balancing algorithms/operations, there may be no need to maintain any sort of state. By considering only a packet header, a scheduling decision can be made whether it is the first packet of the flow or not. In other words, no state is kept with respect to the Flow and Blade IDs to make the scheduling decision of the later packets of a flow.
Static Mapping: Hash-Based Flow-Aware Server Load Balancing
A Hash-based approach is a stateless scheme that maps/hashes any Flow ID (e.g., 5 tuple parameters) to a Blade ID. In that respect, hash based scheduling maintains flow stickiness as the Flow ID to Blade ID mapping is static.
As an example, in a load balancing site with N (e.g., N=10) blade servers and a single load balancer, a hash function may take the last byte of the source IP address (e.g., 192.168.1.122) of a packet as an integer (e.g., 122) and take modulo N. The resulting number (e.g., 2) is in the range of 0 to N−1 which points to the blade server to which the packet is to be forwarded. In the event of a server failure, the hash function (i.e., in this case the module N function) needs to be updated.
In a more sophisticated example, a set of fields in the packet header is first hashed/mapped to a large number of M buckets such that M>>N (N being the number of servers with the number of buckets M being much greater than the number of servers N). Then, a second level mapping from the M buckets to the N servers is performed. This two-level hashing/mapping based mechanism may provide an ability to cope with server failures, such that in the event of a failure, only second level (i.e., M bucket to N Server) mapping needs to be updated without any change in the original hash function.
A good hashing function may generate substantially uniformly distributed outputs. Weighted uniformity can also be achieved using hash functions. In other words, weights can be assigned to each blade server with respect to its capacity, and the distribution of the traffic may be expected to assume a (weighted) substantially uniform distribution with respect to capacity over the blade servers.
Non Static Mapping Schemes (Per Packet Server Load Balancing)
In some cases, a flow may be comprised of a single packet, and/or per packet server load balancing may be required. Transport protocols such as UDP (User Datagram Protocol) may tolerate such per packet server load balancing. In such cases, the load balancer may uniformly selects a blade server to schedule an incoming packet. By doing so, the Flow ID to Blade ID mapping is not necessarily maintained meaning that the same Flow ID may not necessarily map to the same Blade ID each time the scheduling algorithm (e.g., Random Blade Selection, Round Robin, Weighted Round Robin, etc.) is executed.
As mentioned above, alternatively if a data flow is only consists of a single data packet (which is both the first and the last packet of the flow), then even flow based stateful load balancing may becomes a non-static mapping scheme, because there is no need to keep a Flow-ID to blade ID table.
Protocol Specific Load Balancing (Stateless)
Some load balancing schemes have exploited the nature of protocol specific handshakes, acknowledgements, and/or other mechanisms to leverage flow aware load balancing in a stateless manner.
For protocols like GTP and SCTP, the information about the assigned blade can be embedded in the packet headers which can be used for flow stickiness as briefly explained below.
New Connection Assignment:
A new flow is identified by considering the packet header fields (e.g., an INIT flag of a SCTP packet). The new flow is then assigned to a blade server using a scheduling algorithm such as a Round Robin, Hash based method that may exploit the random nature of bits (if any) in the header, etc.
Maintaining Flow Stickiness:
For protocols like SCTP and GTP, information about the blade assigned to the first packet of a data flow can be embedded in the headers of the subsequent flow packets. For example, a V_tag field in SCTP packets can store information about the Blade ID of the assigned blade server. Once a flow has been identified as an existing one (e.g., reading SYN flag which is 0 for the subsequent packets of a SCTP connection), the information about the assigned blade server can be extracted from the packet header and the packets can then be forwarded to the correct blades.
Other Flow Aware Load Balancing Techniques
In DNS (Domain Name System) based server load balancing mechanisms, each server/blade has a unique IP address which is know by the DNS servers. When a client/outside node initiates a connection, the corresponding DNS request is sent to the DNS servers which choose one of the IP addresses of the blade servers and sends the response back. The client/outside node than directs the connection/flow towards the specific blade server. In other words, the destination IP address of the packets of the flow belongs to the blade server in question. The load balancer in this case performs a route lookup (e.g., a FIB lookup or Forwarding Information Base lookup based on the destination IP address) for all the flows to be forwarded to the correct blade server. No scheduling may need to be performed at the load balancer. Accordingly, various client requests are load balanced over the blade servers.
Table-Based Server Load Balancing (Stateful)
Centralized Scheduling and Table Look Up at the Load Balancer
As discussed above, flow aware server side load balancing may require all the packets belonging to the same flow to be forwarded to the same blade server. For a table-based approach, a state for each flow (e.g., Flow ID to Blade ID mapping) existing on the blades may need to be maintained.
Table-based load balancing may be compatible with server load aware (dynamic) load balancing techniques. As an example, using weighted round robin scheduling in conjunction with table based server load balancing, weights can be changed for each blade server dynamically based on the load on each blade server. As the number of flows increases, however, the size of the table as well as the time it takes to search the table also increases. Also, the table search/lookup has to be performed for every incoming packet which may significantly increase processing overhead on the load balancer as the size of the table increases. With this approach, the load balancer may become more vulnerable to resource exhaustion (both cpu and memory resource exhaustion) in conditions of high traffic load.
Moreover, for every new flow/connection, the load balancer may need to perform scheduling operations and update the table accordingly. As a rate of new connections/flows increases, scale problems may arise because there is a single processing entity.
In addition, in standard deployments of a load balancing system, multiple load balancers may be deployed in parallel to provide increased availability in the event of a load balancer failure(s). In such deployments, flow replication mechanisms may be deployed to provide failover for active flows on a failed load balancer. Flow replication mechanisms may require all active flow state information to be replicated and distributed among all participating load balancers (i.e., synchronizing the table providing the flow ID to Blade ID mappings for the participating load balancers). Synchronization among the load balancers for such session state information may introduce significant communication and processing overhead.
In addition, the time it takes for a new flow (e.g., the first packet of a session) to be redirected to one of the load balancers until the other load balancer is ready for the failover for that session (called Peering Delay) can be very high in event of a high incoming flow rate. Peering delay is a known issue for resilient load balancing such that the state of the flows with lifetimes less than or equal to the peering delay would not be replicated at the other load balancer.
In summary, stateful server side load balancing may suffer from resource exhaustion, and also from the memory and processing overhead, inefficiency, and other issues (e.g., peering delay) of standard state replication mechanisms.
Stateful Load Balancing with Scheduling Offload
This scheme may share disadvantages of the Stateful scheme discussed above because the load balancer keeps the state table (e.g., a Flow ID to Blade ID mapping table). This state table may become very large when handling large numbers of data flows, thereby increasing memory requirements and table lookup times on the load balancer.
In addition, the controller node may be responsible for scheduling and updating the mapping table on the load balancer and may thus have the same/similar scale issues as discussed above with respect to other load balancing implementations.
Stateless Load Balancing
Static Mapping: Hash-Based Server Load Balancing
Hash-based server load balancing may depend on the arbitrariness of the traffic parameters (e.g., Flow ID, 5-tuple parameters, etc.) to provide a desired/required (e.g., substantially uniform) load distribution. If the probability distribution of the Flow ID of incoming data packets is known a priori, then it may be possible to design a Flow ID to Blade ID mapping (stateless, e.g., Hash) with Flow IDs as keys, that can substantially guarantee a desired/required (e.g., substantially uniform) flow distribution across the blades over a sufficiently large period of time. In many of the cases, however, a probability distribution of the Flow ID may not be known in advance and may change over time. Another challenge is that even if the statistical characteristics of the Flow IDs can be estimated accurately, lifetimes of the connections/flows are generally arbitrary. Accordingly, loads on the servers may change overtime even with a hash function aligned with the Flow ID pattern of the traffic. Any hash based scheme or static mapping may thus not guarantee uniformity at all times.
Also, hash-based server load balancing approaches may not sufficiently support load aware (e.g., dynamic, adaptive) load balancing. Considering dynamic traffic load characteristics (e.g., lifetimes of each connection and arrival rates), techniques in question may result in asymmetric load balancing among the blade servers. Changing the weights on the fly with respect to the load on the blade servers may be a challenge in this approach, because with the new weights, the existing flow can be reassigned to a new blade server which may terminate the flow.
Similarly, adding/removing blades to/from the load balancing site dynamically may be complicated in hash-based server load balancing, because any change in one of the hashing function parameters (i.e., number of blade servers to be load balanced) has the risk of spoiling the existing flows to blade server associations which may terminate the existing flows.
To be more precise, removal of a blade may be easier than addition of a blade. As discussed above with respect to static mapping, when a blade is removed, the flows mapped to the blade can be re-mapped to the other active blades while keeping the mapping on the active blades intact. Hence, there may be no disruption in flows existing on previously active blades (assuming the hash bucket size is much larger than the number of servers, otherwise the uniformity of the load balancing may become an issue).
When a blade is added, however, some of the flows mapped to the previously active blades should be re-mapped to the added blade (for uniformity/resource utilization) which may cause disruption of existing flows. Otherwise, the load balancer may need to identify which connection is added before and after the blade/server addition which may require state keeping in the load balancer.
If a backup load balancer is used for purposes of redundancy and/or resiliency, there is no need for flow state (e.g., table entries for flowIDs and server IDs) replication between the active and standby load balancers. The only thing to be synchronized between active and backup load balancers is the hash function itself
Non Static Mapping Schemes (Per Packet Server Load Balancing)
As discussed above, with non-static mapping schemes a load balancer does not keep a flow ID to blade/server ID mapping table because load balancing decisions (i.e., scheduling decisions) are made per packet.
A disadvantage may be that these schemes alone cannot be used for load balancing with flow awareness. For example, if 5-tuple parameter connection level load balancing is required and per packet server load balancing is performed, all the connections may eventually be terminated because the packets of a single connection may end up with several blades/servers, only one of which has the connection state information.
For flow aware scheduling, these schemes may be used in conjunction with stateful schemes (e.g., Table based), which may have other disadvantages as discussed above.
Protocol Specific Load Balancing
Protocol specific load balancing techniques may have a disadvantage of being non-generic as these techniques may only be applied to specific protocols such as SCTP and/or GTP. For example, such techniques may not apply to IP traffic.
Other Techniques
As discussed above, different load balancing techniques are provided for different data applications. In general, however, each of these load balancing techniques may only be suited to specific respective applications and may have limited scope for usage in other applications. A summary of characteristics of stateful, stateless static, stateless per packet, and stateless protocol specific load balancing schemes is provided in the table of
Hash Based Implementation of the Load Balancer
Architecture of the Load Balancer
For every incoming packet, the load balancer first computes the hash of the packet header (e.g., a hash of a Flow ID included in the packet header) to obtain a corresponding Bucket ID. The load balancer then maps the computed Bucket ID to a Blade ID using a look-up over the B2B (Bucket to Blade) Mapping Table. The load balancer then forwards the packet to the corresponding blade.
The first stage hash is a static mapping from Flow IDs to Buckets or Bucket IDs. Also, Bucket-to-Blade mapping can be considered static over a period of time. Therefore, this scheme may have an ability to maintain flow-level granularity. Determinism and dynamicity may be provided by modifying the B2B mapping table. In fact, it can be shown that a reasonably good algorithm to map Buckets to Blades may allow this scheme to have improved uniformity relative to a one-stage static hash from Flow IDs to Blades.
The load balancer of
Load balancing schemes of
In a scenario when a new blade is added, for example, a “Blade n+1” may be added to the original system of n blades illustrated in
Similarly, a random blade K may be removed abruptly or as a pre-planned downtime. In this situation, all group IDs that were mapped to Blade K will now be remapped to some other Blade. The packets from existing flows/sessions that were being forwarded to Blade K will now be forwarded to the other blade and may thus be subsequently dropped (thereby disrupting the flow stickiness).
Buckets-to-Blades (B2B) mapping may thus be changed to provide better load balancing (uniformity). This situation is not unlikely because Bucket ID is nothing but a hash of flow ID of the packet, the distribution of which is unknown and may be sufficiently arbitrary that it causes uneven loads and/or numbers of connections to each bucket. In such a scenario, when a bucket is remapped from an initial blade ID to a new blade ID, all the existing flows which were destined towards the original blade will now be directed to the new blade and may therefore be disrupted.
In summary, the current implementations of load balancing may cause flow disruptions when B2B (Bucket to Blade) mapping is changed. Moreover, hash based load balancing may not support sufficiently hitless addition/removal of blades and/or remapping of blades. Stated in other words, existing connections through a bucket may be affected/lost when a mapping of a bucket is changed from one server/blade to another.
It may therefore be an object to address at least some of the above mentioned disadvantages and/or to improve network performance. According to some embodiments, for example, loss of existing data flows may be reduced during load balancing when a mapping of a bucket changes from one server/blade to another.
According to some embodiments, methods may be provided to forward data packets to a plurality of servers with each server being identified by a respective server identification (ID). A mapping table may be defined including a plurality of bucket identifications (IDs) identifying a respective plurality of buckets. More particularly, the mapping table may map a first of the plurality of bucket IDs to a first of the server IDs for a first of the plurality of servers as a current server ID for the first bucket ID, the mapping table may map a second of the plurality of bucket IDs to a second of the server IDs for a second of the plurality of servers as a current server ID for the second bucket ID, and the mapping table may map the first bucket ID to a third of the plurality of server IDs for a third of the plurality of servers as an old server ID for the first bucket ID. A data packet of a data flow may be received with the data packet including information for the data flow, and a bucket ID may be computed for the data packet as a function of the information for the data flow with the bucket ID for the data flow being computed as the first bucket ID. Responsive to computing the first bucket ID as the bucket ID for the data flow and responsive to the mapping table mapping the first bucket ID to the to the first server ID as the current server ID and to the third server ID as the old server ID, the data packet may be transmitted to the first server and/or to the third server using the first server ID and/or using the third server ID from the mapping table.
When a bucket is in a transient state (e.g., transitioning from an old server to a current server to support load balancing, server replacement, etc.), the mapping table may thus include IDs for the current and old servers associated with the transient state bucket to support new data flows and previously existing data flows.
Transmitting the data packet to the first server and/or to the third server may include transmitting the data packet as a unicast to the first server using the first server ID responsive to the mapping table mapping the first bucket ID to the first server ID as the current server ID and to the third server ID as the old server IDs and responsive to the data packet being an initial data packet for the data flow.
Transmitting the data packet to the first server and/or to the third server may include transmitting the data packet as a multicast to the first and third servers using the first and third server IDs responsive to the mapping table mapping the first bucket ID to the first server ID as the current server ID and to the third server ID as the old server IDs and responsive to the data packet being a non-initial data packet for the data flow.
Transmitting the first data packet to the first server and/or to the third server may include transmitting the data packet as a multicast to the first and third servers using the first and third server IDs responsive to mapping table mapping the first bucket ID to the first server ID as the current server ID and to the third server ID as the old server ID.
After mapping the first data packet using the mapping table, the mapping table may be revised to remove the mapping of the first bucket ID to the third server ID as an old server ID for the first bucket ID. Moreover, the data packet may be a first data packet and the data flow may be a first data flow. After revising the mapping table, a second data packet of a second data flow may be received with the second data packet including information for the second data flow. A bucket ID may be computed for the second data packet as a function of the information for the second data flow with the bucket ID for the second data flow being computed as the first bucket ID. Responsive to computing the first bucket ID as the bucket ID for the second data flow and responsive to the mapping table mapping the first bucket ID to the to the first server ID as the current server ID, the second data packet may be transmitted as a unicast to the first server using the first server ID from the mapping table.
Computing the bucket ID for the data packet may include performing a hash function on the information for the data flow. Moreover, the information for the data flow may include a data flow identification (ID) for the data flow, and performing the hash function may include performing the hash function on the data flow ID.
A number of the plurality of bucket IDs in the mapping table may be greater than a number of the servers and server IDs, and the mapping table may provide a same server ID for more than one of the bucket IDs.
According to some other embodiments, a load balancer may be configured to forward data packets to a plurality of servers with each server being identified by a respective server identification (ID). The load balancer may include a network interface configured to receive data packets from an outside network, a server interface configured to forward data packets to the servers, memory configured to store a mapping table including a plurality of bucket identifications (IDs) identifying a respective plurality of buckets, and a processor coupled to the network interface, the server interface, and the memory. More particularly, the mapping table may map a first of the plurality of bucket IDs to a first of the server IDs for a first of the plurality of servers as a current server ID for the first bucket ID, the mapping table may map a second of the plurality of bucket IDs to a second of the server IDs for a second of the plurality of servers as a current server ID for the second bucket ID, and the mapping table may map the first bucket ID to a third of the plurality of server IDs for a third of the plurality of servers as an old server ID for the first bucket ID. The processor may be configured to receive a data packet of a data flow through the network interface wherein the data packet includes information for the data flow, and to compute a bucket ID for the data packet as a function of the information for the data flow, with the bucket ID for the data flow being computed as the first bucket ID. In addition, the processor may be configured to transmit the data packet through the server interface to the first server and/or to the third server using the first server ID and/or using the third server ID from the mapping table responsive to computing the first bucket ID as the bucket ID for the data flow and responsive to the mapping table mapping the first bucket ID to the to the first server ID as the current server ID and to the third server ID as the old server ID.
According to still other embodiments, a method of processing data packets at a server may include defining a server flow table for the server with the server flow table including data flow identifications for data flows being processed by the server, and receiving a data packet of a data flow at the server with the data packet including information for the data flow. Responsive to the data packet being an initial data packet of the data flow for the server, a data flow identification of the data flow may be added to the server flow table, and the data packet may be processed at the server.
The data packet may include information identifying the data packet as an initial data packet of the data flow, adding the data flow identification may include adding the data flow identification of the data flow to the server flow table responsive to the information identifying the data packet as the initial data packet of the data flow, and processing the data packet at the server may include processing the data packet at the server responsive to the information identifying the data packet as the initial data packet of the data flow.
The data packet may be a first data packet and the data flow may be a first data flow. In addition, a second data packet of a second data flow may be received at the server with the second data packet including information identifying the second data packet as a non-initial data packet of the second data flow. Responsive to the information identifying the second data packet as a non-initial data packet of the second data flow and responsive to a data flow identification for the second data packet being included in the server flow table for the server, the second data packet may be processed at the server.
The data packet may be a first data packet and the data flow may be a first data flow. In addition, a second data packet of a second data flow may be received at the server with the second data packet including information identifying the second data packet as a non-initial data packet of the second data flow. Responsive to the information identifying the second data packet as a non-initial data packet of the second data flow and responsive to a data flow identification for the second data packet being excluded from the server flow table for the server, the second data packet may be dropped at the server.
In addition, a list of bucket identifications (IDs) may be defined for buckets that map to the server. Adding the data flow identification to the server flow table may include adding the data flow identification to the server flow table responsive to the data packet being a multicast data packet addressed to the server and to another server and responsive to the data flow identification for the data packet mapping to a bucket ID included in the list of bucket IDs for buckets that map to the server. Processing the data packet may include processing the data packet at the server responsive to the data packet being a multicast data packet addressed to the server and to another server and responsive to the data flow identification for the data packet mapping to a bucket ID included in the list of bucket IDs for buckets that map to the server.
Adding the data flow identification to the server flow table may include adding the data flow identification to the server flow table responsive to the data packet being a unicast data packet addressed to the server and responsive to the data flow identification being excluded from the server flow table. Processing the data packet may include processing the data packet at the server table responsive to the data packet being a unicast data packet addressed to the server and responsive to the data flow identification being excluded from the server flow table.
A list of bucket identifications may be defined for the server, and responsive to the data packet being a unicast data packet addressed to the server and responsive to the data flow identification being excluded from the server flow table, a bucket ID for the data packet may be computed as a function of the information for the data flow.
The data packet may include information identifying the data packet as an initial data packet of the data flow. Adding the data flow identification to the server flow table may include adding the data flow identification to the server flow table responsive to the data packet being a unicast data packet addressed to the server and responsive to the information identifying the data packet as the initial data packet of the data flow. Processing the data packet may include processing the data packet at the server table responsive to the data packet being a unicast data packet addressed to the server and responsive to the information identifying the data packet as the initial data packet of the data flow.
A list of bucket identifications (IDs) may be defined for the server, and responsive to the data packet being a unicast data packet addressed to the server and responsive to the information identifying the data packet as the initial data packet of the data flow, a bucket ID for the data packet may be computed as a function of the information for the data flow.
The information for the data flow may include an identification of the data flow, and computing the bucket ID may include computing the bucket ID as a function of the identification of the data flow.
The data packet may be a first data packet and the data flow may be a first data flow. A second data packet of a second data flow may be received at the server with the second data packet including information for the second data flow. Responsive to the information for the data flow identifying the data packet as an initial data packet of the second data flow and responsive to the data packet being a multicast data packet addressed to the server and to another server, a bucket ID for the second data packet may be computed as a function of the information for the second data flow. Responsive to the bucket ID for the second data packet being excluded from the list of bucket IDs for the server, the data packet may be dropped.
The data packet may be a first data packet and the data flow may be a first data flow. In addition, a second data packet of a second data flow may be received at the server with the second data packet including information for the second data flow, and responsive to the information for the second data flow including a data flow identification matching a data flow identifications of the server flow table, the data packet may be processed at the server.
A third data packet of a third data flow may be received at the server wherein the third data packet includes information for the third data flow, and responsive to the information identifying the third data packet as a non-initial data packet of the third data flow and responsive to the information for the third data flow including a data flow identification that does not match one of the data flow identifications of the server flow table, the third data packet may be dropped at the server.
According to still other embodiments, a server may be configured to process data packets. The server may include a load balancer interface configured to receive data packets from a load balancer, a memory configured to store a server flow table for the server with the server flow table including data flow identifications for data flows being processed by the server, and a processor coupled to the load balancer interface and to the memory. The processor may be configured to receive a data packet of a data flow through the load balancer interface with the data packet including information for the data flow, to add a data flow identification of the data flow to the server flow table in memory responsive to the data packet being an initial data packet of the data flow for the server, and to process the data packet responsive to the data packet being an initial data packet of the data flow for the server.
The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this application, illustrate certain non-limiting embodiment(s) of inventive concepts. In the drawings:
a is a flow chart illustrating load balancer operations for data flows using redirect based approaches according to some embodiments, and
Embodiments of present inventive concepts will now be described more fully hereinafter with reference to the accompanying drawings, in which examples of embodiments of inventive concepts are shown. Inventive concepts may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. It should also be noted that these embodiments are not mutually exclusive. Components from one embodiment may be tacitly assumed to be present/used in one or more other embodiments. According to embodiments disclosed herein, a blade may be interpreted/implemented as a server and/or a server may be interpreted/implemented as a blade.
As shown in
According to some embodiments, hitless dynamic behavior (e.g., hitless addition and/or removal of blades/servers, hitless changes in the load distribution, etc.) and/or reduced hit dynamic behavior in hash based load balancing architectures may be provided while maintaining flow stickiness. In some embodiments, load balancing approaches may include: (1) Multicast/Broadcast Based Distributed Approaches; (2) Transient Table Based Approaches; and/or (3) HTTP Redirect based Approaches.
In multicast/broadcast based distributed approaches, packets in each bucket may be multicast to both original and new target blades for that bucket while Bucket to Blade (B2B) mapping changes dynamically. These approaches may maintain flow stickiness and hitless (or reduced hit) support for addition/removal/remapping of blades for both type 1 and type 2 flows. Load balancing operations may run in a distributed fashion (i.e., partially on the load balancer and partially on the blades themselves). Related load balancing operations are discussed, for example, in U.S. patent application Ser. No. 13/464,608 entitled “Two Level Packet Distribution With Stateless First Level Packet Distribution To A Group Of Servers And Stateful Second Level Packet Distribution To A Server Within The Group” filed May 4, 2012, the disclosure of which is hereby incorporated herein in its entirety by reference.
In transient tables based approaches, load balancing operations may be handled on the load balancer without burdening blades with additional operations to support load balancing. Moreover, load balancer LB provides unicast transmissions of packets thereby saving bandwidth. More particularly, transient tables are temporarily maintained in memory 807 at load balancer LB while changing bucket to blade mappings. Transient tables based approaches, however, may only support type 1 flows.
HTTP redirect based approaches are based on a concept of HTTP (Hypertext Transfer Protocol) redirect within an application layer. Each blade uses HTTP redirect to point the incoming flows to their new destination blade when the Bucket-to-Blade (B2B) mapping is modified. In this mechanism, there is no multicast of packets and/or there are no additional tables to be maintained. HTTP redirect based approaches may support both type-1 and type-2 flows, but may work only for HTTP traffic.
Multicast/broadcast based distributed approaches, transient table based approaches, and HTTP redirect based approaches are discussed in greater detail below.
Modifying a Bucket-to-Blade (B2B) Mapping Table
According to some embodiments, a bucket to blade (B2B) mapping table is maintained in memory 807 of load balancer LB, and the B2B mapping table includes a first column for bucket IDs, a second column for blade IDs (also referred to as current blade IDS), and a third column for old blade IDs. When a server/blade is added to, removed from, or reassigned within the plurality of servers/blades of
Addition of a Blade
Modification of a B2B mapping table responsive to adding a blade is illustrated in
By adding a new Blade (e.g., Blade 4) to the plurality of blades, additional capacity may be added to the system, and buckets originally mapped to previously existing blades may be remapped to the new blade to provide load balancing. Accordingly, data traffic to previously existing blades may be reduced while providing traffic for the newly added blade. In addition, the old blade ID column may be used according to some embodiments to support data flows from buckets 2 and 3 that began before and continue after the remapping.
A bucket is defined to be in steady-state when the Old Blade ID field corresponding to that bucket is empty. In
A bucket is defined to be in a transient state if the Old Blade ID field corresponding to that bucket identifies an old blade ID (e.g., the old blade ID field is non-empty). Buckets 2 and 3 of
Removal of a Blade
Modification of a B2B mapping table responsive to removing a blade is illustrated in
By removing a Blade (e.g., Blade 3) from the plurality of blades, capacity may be reduced, and buckets originally mapped to the blade that is removed may be remapped to a remaining blade to provide load balancing. Accordingly, data traffic to previously existing blades may be increased. In addition, the old blade ID column may be used according to some embodiments to support data flows from Bucket 1 that began before and continue after the remapping.
Reallocation of Buckets to Blades
Modification of a B2B mapping table responsive to rescheduling data flows is illustrated in
Transient Multicast/Broadcast Based Distributed Approach (Approach I)
A Multicast/Broadcast Based Distributed Approach may enable hitless (or reduced hit) addition, removal, and/or reallocation of blades while maintaining flow stickiness. In this approach, packets that belong to buckets in steady-state may be unicast from load balancer LB to respective blades (as identified by the current blade IDs for the respective steady state bucket IDs), packets that belong to buckets in transient state may be multicast/broadcast (to both current and old blades) for the buckets in transient state. In this case, additional operations may be performed on each blade to determine whether each received packet is to be processed or dropped. This sharing of operations between load balancer and servers is referred to as a distributed approach, and this approach may work for both type-1 and type-2 flows.
While discussion of transient multicast/broadcast operations is provided for a multicast group of 2 for the sake of conciseness, embodiments of inventive concepts may be implemented with larger multicast groups. Transient Multicast/Broadcast based distributed approaches, for example, may be generalized to larger multicast groups discussed below in the section entitled “Extended Operations For Multiple Cascaded Transients”. Similarly, operations disclosed herein are not limited to multicast. These operations may be generalized using, for example, VLAN based broadcast, as briefly discussed below in the section entitled “VLAN Based Broadcast Implementation Alternative”.
For Type-1 Flows
As discussed above, type-1 data flows are those data flows for which it is possible to detect the start of the data flow or the first data packet of the data flow by considering only bits in the first data packet of the data flow (i.e., without consulting any other data/information). In this section, a multicast based distributed approach is presented for type-1 flows, with type-1 flows being the ones that are most commonly encountered. This approach may be broken into two parts: data plane operations; and control plane operations.
Data plane operations may primarily be used to handle how incoming data packets are forwarded to and received by the blades assuming an instance of a B2B Mapping Table at any given point in time. Data plane operations may include operations running on both the load balancer and the blades. Control plane operations may primarily be used to handle maintenance and modification of the load balancer table.
Data Plane Operations for Type-1 Flows
In this approach, a two-stage ‘distributed’ mechanism may be followed. The first stage includes the screening of packets at the load balancer LB to make an appropriate forwarding decision. The second stage includes the screening of received packets at the blade.
Operations at the load balancer are discussed as follows. A B2B mapping table is maintained in memory 807 at load balancer LB. For every incoming data packet, load balancer processor 801 obtains the Bucket ID using the hash function. The bucket ID may be computed as a hash of element(s) of the packet header (e.g., a hash of the Flow ID including in the packet header). If the bucket is in steady state, the packet is unicast to the blade (identified in the current blade ID column of the B2B mapping table) corresponding to the bucket. If the bucket is in transient state (with both current and old blades identified for the bucket) and the data packet is an initial data packet of the data flow (as indicated by an INIT identifier), the packet is unicast to the current blade ID for the transient state bucket. If the bucket is in transient state and the data packet is not an initial data packet of the data flow, the packet may be multicast to both the current blade and the old blade as indicated in the blade ID and old blade ID columns of the B2B mapping table.
By way of example, a first data packet may be received at block 1201 from the network through network interface 805, and processor 807 may perform the hash function on a flow ID from a header of the first data packet at block 1203 to obtain a bucket ID corresponding to the flow ID. If the hash function outputs bucket ID 2, processor 807 consults the B2B mapping table (e.g., as shown in
A second data packet may be received at block 1201 from the network through network interface 805, and processor 807 may perform the hash function on a flow ID from a header of the second data packet at block 1203 to obtain a bucket ID corresponding to the flow ID. If the hash function outputs bucket ID 1, processor 807 consults the B2B mapping table (e.g., as shown in
In embodiments of
If a received non-INIT data packet is unicast from load balancer LB at block 1211 for a steady state bucket, the packet is being sent only to the blade that is to process the non-INIT data packet, and the receiving blade should thus process this packet because the receiving blade should have the data flow for this non-INIT data packet in its ‘my-flows-table’. As shown in
If the data packet is not an initial data packet (referred to as a non-initial data packet) at block 1303 (e.g., the non-initial data packet does not include an INIT flag), processor 701 determines if the non-initial data packet belongs to a data flow being handled by the blade. More particularly, processor 701 compares a data flow ID of the non-initial data packet (e.g., included in a header of the packet) with data flow IDs from the ‘my-flows-table’ of
Alternate Data Plane Operations for Type-1 Flows
In this section, another two-stage ‘distributed’ data plane algorithm is described. These alternate operations may provide a relatively simplified load balancer, such that the load balancer merely forwards the packets based on the buckets they are hashed to. In this approach, there is no assumption as to whether load balancer is capable of identifying start-of-the-flow (INIT) packets. Even though some operations of this approach may be common with respect to operations discussed above with respect to
At the load balancer LB, a bucket to blade mapping table is maintained, and for every incoming packet, the Bucket ID is obtained using the hash function (e.g., by computing a hash of an element of the packet header such as the Flow ID). If the bucket is in steady state, the packet is unicast to the blade corresponding to the bucket (i.e., the current blade from the B2B table for the bucket). If the bucket is in transient state, the load balancer multicasts the packet to both the current Blade ID and the Old Blade ID identified in the B2B mapping table.
As shown in
The potentially reduced complexity of load balancer operations of
Each blade may maintain both a list of flow IDs for data flows being processed by that blade (also referred to as a server flow table) as well as list of buckets that the blade is serving (also referred to as a buckets table), and both lists may be provided in a table referred to as a ‘my-flows-and-buckets-table’. A more detailed architecture of this table may be significant from a control plane perspective, but for purposes of data plane operations, each blade may include a list of flow IDs for data flows being processed by the blade and a list of buckets that the blade is serving.
If the load balancer unicasts a data packet (so that the data packet includes a unicast indicator) to only one blade in accordance with operations of
If the data packet is a unicast data packet at block 1503 and an initial data packet for a new data flow (e.g., as indicated by an INIT flag) at block 1505, processor 701 may perform the hash function on the flow ID of the data packet to compute a bucket ID used to process the data packet at the load balancer at block 1507. Processor 701 may then add the flow ID for the new flow to the my-flows table at block 1511, and process the data packet at block 1517. By adding the flow ID to the table of
If the data packet is a multicast data packet at block 1503 and an initial data packet for a new data flow (e.g., as indicated by an INIT flag) at block 1519, processor 701 may perform the hash function on the flow ID of the data packet to compute a bucket ID used to process the data packet at the load balancer at block 1521. If the resulting bucket ID is included in the table of
In embodiments of
While not shown in
If the data packet is an initial data packet for a new data flow (e.g., as indicated by an INIT flag) at block 1519, processor 701 may perform the hash function on the flow ID of the data packet to compute a bucket ID used to process the data packet at the load balancer at block 1521. If the resulting bucket ID is included in the table of
In embodiments of
While not shown in
Control Plane Operations for Type-1 Flows
In this section, control plane embodiments are discussed. There are multiple viable ways in which a control plane can be implemented depending on the use case and various embodiments are disclosed herein.
As discussed above, the a bucket enters the transient state from the steady state when the Blade ID corresponding to the bucket is modified and the original Blade ID is recorded in the Old Blade ID column. Control plane operations may be used to decide when the bucket can return to the steady state after having entered the transient state. When a signal is received from the control plane indicating that the bucket should return to the steady state (from the current transient state), the Old Blade ID field corresponding to the bucket is erased. The control plane may thus decide when the Old Blade ID is no longer needed.
For example, an Old Blade ID may no longer needed when all flows belonging to the bucket that are mapped to the Old Blade ID have ended. In other words, the Old Blade ID for bucket ‘x’ may no longer be needed when a number of connections on and/or data flows to the old blade that correspond to bucket ‘x’ go to zero. This criterion, however, may be unnecessarily strict. For example, a few connections may be active for a very/relatively long time, connections may be inactive but kept open for a relatively long time, or FIN/FINACKs (end of data flow indicators) may go missing. Under such conditions, a bucket may be in the transient state for an unnecessarily long time, resulting in a loss of bandwidth due to unnecessary multicast transmissions and/or additional processing on blades. Therefore, a mechanism that provides a reasonable criterion to conclude that Old Blade ID for the bucket is no longer needed may be desired.
As discussed above, a more detailed architecture of the My-flows-table may be useful from the control plane perspective.
Information from the consolidated flows table of
Another criterion may be based on a net bit-rate corresponding to the bucket being less than a threshold. A low net bit-rate may be a good reason to drop the existing flows corresponding to the bucket and release the bucket to a steady state.
Still another criterion may be based on the timer which can be used to determine periods of inactivity. If the last data packet from any data flow for a bucket was received a relatively long time ago, data flows from that bucket may be inactive. If the flows corresponding to the bucket have been inactive for a sufficiently long period of time, that bucket may be dropped to the steady state without significant loss of data.
One or more of the above referenced criterion may be used alone or in combination to determine when a bucket should be returned from the transient state to the steady state. Additionally, For type-1 flows, a Consolidated Flows Table may also be created and maintained at the load balancer LB for each blade. The load balancer can keep track of numbers of connections for each blade by incrementing or decrementing a respective connection counter whenever an outgoing INIT/INITACK or FIN/FINACK packet is generated for the blade. It may also be possible to keep track of bucket net bit-rates and/or last packets received in a similar manner for each blade.
For Type 2 Flows
As discussed above, type-2 flows are data flows which may be substantially arbitrary such that initial packets of a type-2 data flow may be difficult to identify when considering packet header information. In this section, a multicast based distributed approach is discussed for type-2 data flows. The approach may include two parts: data plane operations; and control plane operations. Data plane operations may primarily handle how incoming data packets are forwarded to and received by the blades assuming an instance of a B2B Mapping Table at any given point in time. Data plane operations may include operations running on both the load balancer and the blades. Control plane operations may be used to maintain and modify the B2B Mapping Table residing on load balancer memory 807.
Data Plane Operations for Type-2 Flows
In this approach, a two-stage ‘distributed’ mechanism may be used (similar to that discussed above with respect to type-1 flows). The first stage includes screening data packets at the load balancer to make an appropriate forwarding decision. The second stage includes screening of received packets at the blade. Because the load balancer may be unable to identify start-of-the-flow/INIT data packets for type-2 data flows, the load balancer forwarding decisions for type-2 data flows may be based on whether the bucket to which a data packet maps is in steady state or transient state (without considering whether the data packet is an initial or subsequent data packet of its data flow).
Load balancer LB operations may include maintaining a B2B mapping table at load balancer memory 807. For each incoming data packet, load balancer processor 801 may obtain the Bucket ID using the hash function. As discussed above, the Bucket ID is computed as a hash of an element of the packet header such as the Flow ID. If the bucket is in steady state, the data packet is unicast to the blade corresponding to the bucket (i.e., the current blade identified for the bucket ID in the B2B mapping table). If the bucket is in transient state, the data packet is multicast to both current Blade ID and Old Blade ID (i.e., the current blade and the old blade identified for the bucket ID in the B2B mapping table).
According to some embodiments, the multicast/unicast indicator may be provided as the destination address of the data packet. If the data packet is transmitted as a multicast to both the current and old blades of Bucket 1 (in transient state) at block 1409, for example, load balancer processor 801 may include a multicast destination address (with addresses/IDs of both blade 4 and blade 7) in the header of the data packet. If the data packet is transmitted as a unicast to only the current blade of Bucket 2 (in steady state) at block 1411, for example, load balancer processor 801 may include a unicast destination address (with address/ID of only blade 3) in the header of the data packet. The receiving blade or blades can thus determine whether a received data packet was transmitted from the load balancer as a unicast or multicast transmission based on the destination address or addresses included in the data packet.
At the Blades
A Multicast Partner of a blade for a given data packet/flow is defined as the other blade (in a multicast group) to which a data packet/flow is being multicast (forwarded). Stated in other words, a multicast group for a transient state bucket is defined to include the current and old blades for the transient state bucket, and each blade of the multicast group is defined as a multicast partner of the other blade(s) of the multicast group. In operations discussed above, a data packet may be multicast to at most two blades (the current and old blade for the corresponding transient state bucket). According to such embodiments, a blade will have at most one multicast partner for any given data packet. By way of example, for operations of FIGS. 17A and 17B, Blade 7 is the multicast partner of Blade 4, and Blade 4 is the multicast partner of Blade 7 for any data packets/flows that hash to Group 1 in the illustrated transient state.
A blade can determine its multicast partner for a given packet by considering the destination multicast group address that the data packet is sent to. As discussed above, the header of each data packet may include a unicast/multicast destination address allowing a receiving blade to determine whether the data packet was transmitted as a unicast or multicast transmission, and also allowing the receiving blade to identify a multicast partner(s) for any multicast data packets. Each receiving blade can thus maintain a mapping between active multicast group addresses and constituent blades. Each receiving blade can also determine its multicast partner(s) by hashing the flow ID of the packet to obtain Bucket ID and then looking it up in the B2B Mapping Table to determine which other (old or current) Blade ID it is being grouped with. The load balancer may also encode the multicast partner in one of the header fields. In summary, regardless of the implementation, a blade may be able to identify its multicast partner.
Operations running on the blades may be summarized as follows with reference to
Each blade maintains a list of data flow IDs being processed by that blade and a list of buckets being served by that blade, referred to as a ‘my-flows-table’ as shown in
At block 1800, processor 701 defines/revises data flow and bucket lists of
If the blade receives a packet as a unicast (e.g., only one destination blade address for the blade is included in the packet header) at block 1803, the data packet is intended for that blade for processing, and the blade processes the packet. If the data packet is an initial data packet for a new data flow (i.e., the flow address is not already included in the list of data flows of the my-flows table) at block 1805, processor 701: performs the hash function on the flow ID at block 1807 to determine the bucket ID for the data flow; adds the flow ID to the list of data flows at block 1811, and processes the data packet at block 1817. If the data packet is a subsequent data packet for an existing flow for the blade (i.e., the flow address is already included in the list of data flows of the my-flows-and-buckets-table) at block 1805, processor 701 may process the data packet at block 1817 without updating list of
If a data packet is received as a multicast at block 1803, processor 701 determines if the flow ID is present in the my-flows-and-buckets table of
If a data packet is received as a multicast at block 1803 and its data flow ID is not found in the list of data flows for the blade of
If the data packet is received as a multicast data packet at block 1803 and its data flow ID is not included in either of the lists of data flows of
While
In embodiments of
While not shown in
At block 1800, processor 701 defines/revises data flow and bucket lists of
If the data flow ID is not found in the list of data flows for the blade of
If the data flow ID is not included in either of the lists of data flows of
In embodiments of
While not shown in
While
Control Plane Operations for Type-2 Flows
Control plane operations for type-2 flows may be similar to and/or the same as those discussed above with respect to
Extended Algorithm for Multiple Cascaded Transients
As discussed above, load balancer and blade operations have been discussed with respect to examples with one transient bucket (i.e., the blade corresponds to the bucket that changes from A to B and is not reassigned until it reaches steady state), which may be handled with a multicast group of size of at most 2.
It is possible, however, that a bucket may be reassigned multiple times in a short period of time (before it reaches steady state after the first of the multiple reassignments). For example, a certain bucket ‘x’ may be assigned to Blade A and then reassigned to Blade B. While the bucket ‘x’ is still in the transient state, it may be reassigned to Blade C. In such a scenario, an extended mechanism may be used as discussed in greater detail below.
To address this issue, a multicast group having a size greater than two may be used. In the example noted above (where bucket ‘x’ was initially assigned to Blade A, then reassigned to Blade B, and then reassigned to Blade C while the bucket ‘x’ is still in the transient state with Blade A as the old blade), processor 801 can store both Blade A and Blade B in the list of Old Blade IDs corresponding to the Bucket x. Blades A, B, and C may together form a multicast group having a size of three. Any data packet that belongs/maps to bucket ‘x’ can thus be multicast to a multicast group including Blade A, Blade B, and Blade C. A number of blades in the multicast group is thus 3. Respective processes running on each of the three blades (i.e., A, B, and C) in the multicast group will thus govern whether the data packet is to be dropped or processed by respective blades of the multicast group.
Briefly, multicast based load balancing mechanisms set forth above may be generic and may be extended to cases of multiple cascaded transients. If a number of cascaded transients for a particular bucket is T, then the multicast group for that bucket may include T+1 blades. Issues regarding these extended operations and workarounds relating thereto are discussed below.
Virtual Machine Based Cloud Infrastructure
In a virtual machine (VM) application, a server defined as discussed above with respect to
The load balancing logic (e.g., as discussed with respect to
VLAN Based Broadcast Implementation Alternative
Embodiments forwarding packets to multiple servers are not limited to multicast group mechanisms discussed above. For example, VLAN (Virtual Local Area Network) based broadcasts may be used. If a certain bucket ‘x’ was initially assigned to Blade A and the reassigned to Blade B, then load balancer processor 801 forwards data packets (or only non-INIT data packets, depending on the method) that correspond to bucket ‘x’ to both blades A and B. This can be implemented by considering both the blades to be in one VLAN. Then, the data packet is just forwarded to that VLAN, and the VLAN takes care that the packet is broadcasted to the individual blades.
Advantages of Multicast Based Distributed Approaches
Multicast based distributed approaches discussed above may support load balancing with reduced hit or hitless addition and/or removal of blades. Accordingly, data flow disruption may be reduced when a new blade is added or when a blade is removed.
Even though mapping between data Flow IDs and Bucket IDs is static, the mapping between Bucket IDs and Blades can be changed dynamically to provide better uniformity (e.g., load balancing). Dynamic mapping between bucket IDs and blade IDs may be applicable when a blade is added or removed. In addition, dynamic mapping between bucket IDs and blade IDs may provide a mechanism allowing the load balancer to maintain a uniformity index (e.g., a Jain's fairness index) over time as loads of different blades change.
If the uniformity index drops below a certain threshold, processor 801 may call for a reassignment of one or more buckets to different blades. The buckets selected for reassignment may depend on parameters such as number of flows corresponding to the bucket(s), bit-rate corresponding to the bucket(s), current load on the blades, blades on downtime, and control parameters (e.g., when the last packet from the bucket was received). An exact mechanism governing this choice, however, may be beyond the scope of this disclosure. Approaches disclosed herein may enable a load balancer to provide a relatively smooth reassignment of buckets to blades with reduced loss of data flows when such a reassignment takes place.
Multicast Based Distributed Approaches disclosed herein may provide flow awareness. Each blade, for example, may decide whether to process or drop a received data packet depending on whether it belongs to a list of data flows that are being processed by the respective blade.
Multicast Based Distributed Approaches disclosed herein may support dynamicity of resources (e.g., blades), by enabling resources (e.g., blades) to be added and/or removed with reduced disruption to existing data flows.
In multicast based distributed approaches disclosed herein, a relatively low complexity load balancer may unicast or multicast the data packets based on whether the bucket is in the steady-state or in the transient state. Additional Load balancer architecture may include a hash module/function, fixed length B2B (Bucket-to-Blades) mapping and O(1) table lookup.
Multicast based distributed approaches disclosed herein may support different types of data traffic and/or flows. While approaches have been discussed with respect to type-1 and type-2 data flows, other data types may be supported.
A B2B mapping table may have a fixed size and is not expected to change frequently. Therefore, a backup load balancer may not need to perform computationally intensive tasks (e.g., copying and syncing large tables or flow-level information) in real time, thereby providing a relatively low complexity failover mechanism for the load balancer. It may suffice for the backup load balancer to know the hash module/function and to sync up with the active load balancer B2B mapping table when the B2B mapping table changes.
Issues for Multicast Based Distributed Approaches
The relatively low complexity load balancer may generate a greater load on the blades. Rather than simply processing the received packets, a blades may be required to first perform operations to determine whether or not to process the received packet. Increased processing and/or memory requirements may thus be imposed on blades potentially resulting in latency issues in some scenarios. Because multicast/broadcast transmissions only occur for buckets in the transient state this additional burden may be reduced/limited.
Broadcast/multicast transmission of data packets to multiple servers/blades may reduce backplane bandwidth available for other traffic because repetitive information (data packets) is forwarded over multiple links (i.e., between the load balancer and multiple blades) when different incoming packets could be forwarded over the links Because the multicast/broadcast transmissions only occur for buckets in the transient state, this additional burden may be reduced/limited.
Handling of multiple cascaded transients (e.g., T number of cascaded transients) for a same bucket was discussed above using multicast group of size T+1. Bandwidth loss, however, may be proportional to a number of servers in the multicast group to which the packet is multicast/broadcast. For example, the packets may be broadcast to all the servers potentially consuming bandwidth that could otherwise be used to transmit other information. Moreover, in the context of type-2 flows, operations running on blades in case of type-2 flows may need access to the list of partner flows. In a multiple transient situation, if there are K blades in the multicast group, each blade may have K−1 partners. Therefore, at any point of time, each blade in transient may need to synchronize flow tables with all K blades in the group.
The blades, however, may not need to synchronize all their flows. The blades may only need to synchronize those flows corresponding to the bucket in the transient state which resulted in formation of the multicast group. As an example, even if there are one million existing flows, if 64K buckets are maintained, assuming reasonable uniformity, only tables of size about 100 rows (flows) per blade may need to be synchronized (as opposed to the one million flows).
To reduce/avoid issues of synchronizing large numbers of servers and/or to reduce/avoid waste of bandwidth, a number of simultaneous transitions allowed may be reduced/limited. In other words, a maximum number of blades in one multicast group may be reduced/limited to a certain upper limit. In some embodiments discussed above, that limit was set to two blades.
Transient Table Based Approach (Approach II)
Transient Table Based Approaches may enable reduced hit and/or hitless addition, removal, and/or reallocation of blades while increasing/maintaining flow stickiness. In this approach, a “Transient Table” is maintained that includes ‘new’ connections for every bucket in the transient state.
Transient table based approaches use unicast transmissions of each packet to the blade that will process that packet. Accordingly, multicast transmissions of packets to multiple blades may be reduced/avoided. Therefore, load balancing operations presented according to this approach may run solely at/on the load balancer, and additional load balancing operations running on blades may not be required. Accordingly, each blade may process the packets received at that blade without determining whether a given packet is intended for that blade.
Transient table based approaches, however, may be better suited for type-1 flows, and transient table based approaches may not be applicable to type-2 flows. As discussed above, type-1 flows are those flows for which it is possible to detect the start of a data flow or a first data packet of a data flow by only considering bits in the first data packet of the data flow, without considering any other information.
Transient table based approaches according to some embodiments disclosed herein may be broken into data plane operations and control plane operations. Data plane operations may be used primarily to handle how incoming data packets are forwarded to and received by the blades assuming an instance of a B2B Mapping Table at any given point in time.
Control plane operations may be used primarily to handle maintenance and/or modification of the B2B Mapping Table residing on/at the load balancer.
Data Plane Operations
As discussed above, operations of transient table based approaches may run on/at the load balancer without requiring additional operational overhead at the blades. Moreover, load balancer LB maintains an additional table(s) in memory 807 called a ‘Transient Table’ (also referred to as a ‘Current Flows Table For Transient Bucket’) for each bucket in the transient state.
Transient Table For Transient Bucket
A Transient Table for bucket ‘x’ includes a list of all flows corresponding to the bucket ‘x’ that are initiated while the bucket ‘x’ is in the transient state. Once a bucket that was in the transient state returns to the steady state, the Transient Table for the bucket may be cleared (e.g., erased, deleted, disregarded, etc.).
Bucket ‘x’ enters the transient state whenever the blade ID corresponding to bucket ‘x’ changes, for example, from blade A to blade B. During such a change, Blade B is the new current blade, and Blade A is recorded as the Old Blade ID in the B2B mapping table as discussed above. At the initiation of the transition from blade A to blade B, all existing data flows through bucket ‘x’ are being served by Blade A, and any data packets received for these existing data flows should be forwarded to the old blade, (i.e., Blade A). Any data flows that are initiated after this transition to the transient state and during the transient state are recorded in the “Transient Table” for the bucket ‘x’. These data flows which are recorded in the “Transient Table” for bucket ‘x’ are to be forwarded to the new/current blade, i.e., Blade B.
Operations of load balancer processor 801 are illustrated in the flow chart of
If the bucket identified by the bucket ID of block 2003 is in the transient state at block 2005 (e.g., Bucket 1 from the B2B mapping table of
For each non-initial data packet for an existing data flow that is received for a bucket that is in transient state (e.g., Bucket 1) at blocks 2005 and 2007, processor 801 determines if the flow ID of the data packet matches one of the flow IDs in the transient table of
Examples of operations of
As discussed above with respect to embodiments of
Operations of load balancer processor 801 are illustrated in the flow chart of
If bucket is in steady state at block 2005 and the bucket has not been designated for movement to the transient state at block 2029, processor 801 unicasts the data packet to the current blade (e.g., blade 3) for the current bucket (e.g., Bucket 2) at block 2035. If the bucket is in steady state at block 2005 and the bucket has been designated for movement to the transient state at block 2029, processor 801 records the flow ID for the data flow in the existing transient table of
If the bucket identified by the bucket ID of block 2003 is in the transient state at block 2005 (e.g., Bucket 1 from the B2B mapping table of
Control Plane
In this section, implementations of control planes using Transient Table Based Approaches are discussed according to some embodiments. As discussed above with respect to Multicast Based Distributed Approaches, a control plane can be implemented in different ways depending on the use case. Unlike Multicast Based Distributed Approaches, however, control planes for transient table based approaches may be implemented on/at only the load balancer, and additional processing on/at the blades may be reduced/avoided.
As discussed above, a bucket enters the transient state from the steady state when the Blade ID corresponding to the bucket is modified and the original Blade ID is recorded in the Old Blade ID column for the transient state bucket. Once a bucket has entered the transient state, control plane operations may decide when the bucket can return to the steady state from the transient state. When the signal is received from the control plane indicating that the bucket is ready to return to the steady state from the transient state, the Old Blade ID field corresponding to the old bucket is cleared/erased. The control plane may thus decide when is it reasonable to conclude that the Old Blade ID is no longer needed.
In some embodiments discussed above with respect to
Because transient table based approaches are used with type-1 data flows, load balancer LB can identify starts and ends of data flows by considering the initial (INIT/INITACK) or final (FIN/FINACK) data packets arriving for the data flows. Accordingly, it may be possible to maintain a Load Balancer Control Table in memory 807 as shown, for example, in
Load balancer processor 801 can thus keep track of numbers of data flows (connections) to each of the current and old blades mapped to a transient bucket by incrementing or decrementing the connection counter whenever it detects an outgoing initial (INIT/INITACK) data packet or final (FIN/FINACK) data packet for a data flow to one of the blades. Processor 801 may also keep track of a net bit-rate for each bucket and a time that a last packet was received for each bucket in a similar manner. Information included in the load balancer control table may thus be used by processor 801 to determine when a data flow to an old blade of a transient state bucket is no longer significant so that the transient state bucket can be returned to steady state (thereby terminating any remaining data flows to the old blade). Criteria that may be used to determine when a bucket can be returned to the steady state using information of a load balancer control table are discussed in greater detail below.
When a number of data flows being serviced by the Old Blade drops below a threshold, processor 801 may return the transient state bucket to the steady state. As discussed above, the number of flows serviced by an old blade dropping to zero may be an unnecessarily stringent criterion, and a number of data flows for an old blade reaching a near-zero number may be sufficient to return the transient state bucket to the steady state.
When a bucket is returned to steady state with a data flow to the old blade still active, the data flow to the old blade may be terminated because the current blade may be unable to service the data flow that was initiated with another blade. Accordingly, care may be taken to provide that a significant data flow is not lost even if a total number of data flows for an old blade falls below a threshold. Bucket 1 in the Load Balancer Control Table of
Considering only net-bit rates may result in unwanted loss of data flows. Bucket 3 of
For example, processor 801 may consider a time elapsed since a last data packet to the old blade before returning a transient state bucket to the steady state. Processor 801, for example, may require some minimum period of time to pass after a last data packet to an old blade before returning the bucket to the steady state. If the last data packet received from any flow from the transient state bucket to the old blade was a sufficiently long period of time ago (e.g., exceeding a threshold), processor 801 may return the bucket to the steady state. Stated in other words, if the flows from the transient state bucket to the old blade have been inactive for a sufficiently long period of time, any remaining data flows to the old blade may be dropped.
Any of the criteria discussed above (based on information from the table of
Multiple Transients
Transient table based approaches may also be able to handle multiple transients. For example, a bucket may be reassigned multiple times in a short period of time. For example, a certain bucket ‘x’ may be assigned to Blade A and then reassigned to Blade B. While the bucket ‘x’ is still in the transient state, it may again get reassigned to Blade C, a situation referred to as 3 layers of transitions (A, B and C). Blade C thus corresponds to the current blade while Blade A would still be designated as the Old Blade ID. In such a situation, the transient stateful table may be expected to have flows for both blades B and C. Once the control plane decides that all the flows associated to the OLD blade (i.e. Blade A) have been terminated gracefully, the stateless entry in the bucket to blade table can be switched to blade C, and all the related stateful table entries of blade C can be erased. However, there may still be stateful entries in the stateful table for blade B and such entries may need to be cleaned up as the lifetimes of the connections have ended. The above example can be extended to even greater numbers of transients by considering blade B as a set of blades instead of a single blade where the transients become A, B1, B2, . . . Bn, C where B={B1, B2 . . . , Bn}.
As discussed above, Transient table based approaches may thus provide reduced hit and/or hitless addition and/or removal of blades. Accordingly, disruptions of flows may be reduced/eliminated when a new blade is added and/or when an old blade is removed.
Despite mappings between Flow IDs and Bucket IDs that may be relatively static, the mapping between Bucket IDs and Blades can be changed dynamically to provide better uniformity and/or load balancing. This dynamic mapping of bucket IDs to blades may be applicable when a blade is added or removed, but dynamic bucket to blade mapping is not restricted to these two use cases. For example, a mechanism may be provide where load balancer processor 801 maintains some sort of a uniformity index (e.g., a Jain's fairness index) at all times. If the uniformity index drops below a certain threshold, processor 801 may call for a reassignment of one or more buckets to other blades. The bucket(s) selected for reassignment may depend on various parameters, such as, numbers of data flows corresponding to the buckets, bit-rates corresponding to the buckets, current loads on the blades, blades on downtime, and control parameters (e.g., when last packets from buckets were received). Approaches described herein may enable relatively smooth reassignments of buckets to blades with reduced loss of data flows when reassignments occur.
Transient table based approaches may provide/enhance flow awareness, because the load balancer temporarily maintains a list of flows for the old blade to provide flow stickiness while a bucket is in the transient state. Transient table based approaches may support dynamicity of resources (e.g., blades) by enabling addition/removal of resources (e.g., blades) with reduced disruption of the previously existing flows. In transient table based approaches, all load balancing functionality may reside on/at the load balancer so that additional processing at the blades (other than processing received packets) may be reduced. Transient table based approaches may not be limited to any particular type of traffic or flow provided that initial data packets of the data flows can be identified.
By reducing additional processing at the blades, additional processing may be required at the load balancer. The load balancer, for example, may need to store an additional table (Transient Table) for every bucket in the transient state. A number of rows in the Transient Table may be equal to the number of new data flows initiated for the bucket during the transient state. In a high traffic situation, the number of rows of a transient table may be relatively high. Assuming a total number of buckets is on the order of 64K (e.g., in current Smart Services Router Line Card implementations), only the states of the data flows for the buckets in transient state may need to be maintained. In practice, a total number of flow IDs that may need to be maintained in a transient table is expected to be relatively low, and a bucket is not expected to be in the transient state for long.
As discussed above, transient table based approaches may require that the load balancer identify the start of each data flow (e.g., using initial or INIT data packets). Therefore, transient table based approaches may be difficult to implement for type-2 flows.
In case of multiple transients, the temporary usage of memory 807 for transient tables may increase. As discussed above, for example, once all the data flows for Blade A are finished, the destination on the bucket ‘x’ can be switched to blade C and the transient stateful entries for blade B may still exist on the stateful table until all such flows are finished/terminated.
Alternative Approaches to Modifying B2B Mapping Tables
Before discussing third approaches of the present disclosure, a modified B2B mapping table is discussed. This modified B2B mapping table will be used in HTTP Redirect approaches discussed below. Note that underlying mechanisms may partially resemble embodiments of B2B tables discussed above, and such underlying mechanisms may be repeated below for the sake of clarity.
When a Blade ID corresponding to a certain bucket changes from Blade A to Blade B, the new Blade ID (i.e., Blade B) is recorded in an additional column called a New Blade ID column. The entry in the new Blade ID (i.e. Blade B) column is moved to the (current) Blade ID column when signaled by the control plane. This may typically happen when the original Blade ID (i.e. Blade A) is no longer needed.
Use Case 1—Addition of a Blade
Considering the situation of
In embodiments of
A bucket is defined to be in transient state if the New Blade ID field corresponding to that group is non-empty. Buckets 2 and 3 in
Use Case 2—Removal of a Blade
Considering the situation of
Use Case 3—Reallocation of Buckets to Blades
Considering the situation of
HTTP Redirect based Approach (Approach III)
In HTTP Redirect based approaches, concepts of HTTP redirect are used for every bucket in the transient state. HTTP Redirect based approaches can also be used for gradual load correction rather than taking a whole blade into congestion collapse.
In HTTP redirect based approaches, the Load balancing site (including load balancer LB and blades/servers S1-Sn of
HTTP redirect based approaches may be organized into two parts, data plane operations, and control plane operations. Data plane operations may primarily handle how incoming data packets are forwarded to and received by the blades/servers assuming an instance of B2B Mapping Table at any given point in time. Control plane operations may handle maintenance and/or modification of the B2B Mapping Table residing on/at the load balancer. As discussed herein, a same primary IP address covers all of the blades/servers and a same stand-by IP address covers all of the blades/servers. A destination IP based router may be sufficient with HTTP redirect from the blades/servers with a control plane orchestrating the overall load distribution.
Data Plane Operations
As mentioned earlier, load balancing operations may run both at the load balancer and at the blades/servers. According to HTTP redirect based approaches, each blade/server also has access to the B2B mapping table that may reside on/at the load balancer.
When all blades/servers are operating in steady state, outside devices (e.g., clients C1-Cm) may transmit data packets to the system using the primary IP address. Upon receipt of data packets addressed to the primary IP address, the load balancer performs the hash function using the data flow ID of the data packet to generate a bucket ID for the data flow to which the data packet belongs, the load balancer uses the B2B mapping table to map the bucket ID to a respective current blade ID, and the load balancer forwards the data packet to the blade/server indicated by the current blade ID corresponding to the bucket ID. The blade can then process data packets received in this manner as discussed in greater detail below.
At the Blade
Operations performed at a blade/server may be used to decide whether to process or drop a data packet received from the load balancer as discussed in greater detail below with respect to
When a data packet is received through LB interface 703 of blade/server S at block 2601, processor 701 determines at block 2603 if the data packet belongs to an existing data flow being processed by the server. This determination may be made with reference to the my-flows-table by determining if a flow ID of the data packet matches any flow IDs included in the my-flows-table. If the data packet belongs to an existing data flow being processed by the server at block 2603 (as indicated by the my-flows-table of
If the data packet does not belong to an existing data flow being processed by the server at block 2603, the data packet is for a new data flow, and processor 701 should decide whether to accept or reject the data flow. For a data packet for a new data flow (not an existing data flow), processor 701 performs the hash function using the data flow ID of the data packet to determine a bucket ID to which the data flow is mapped at block 2607. At block 2609, processor 701 determines if the bucket ID is in steady state or in transient state with reference to the B2B mapping table of
If the bucket ID is in steady state (e.g., bucket ID 1, 2, or 3 of
If the bucket ID is in transient state (e.g., bucket ID 4 of
If the server is identified as the current blade for the bucket ID (e.g., current blade 1 for bucket 4 from the B2B table of
The HTTP redirect is transmitted to the client device (e.g., client C1, C2, C3, etc.) outside of the load balancing system that originally generated the data packet that triggered the HTTP redirect. On receipt of the HTTP redirect (including the stand-by IP address), the client retransmits the data packet addressed to the stand-by IP address. Operations of load balancer will now be discussed in greater detail below with respect to
At the Load Balancer
In general, the load balancer considers the packet header of a data packet and determines where to send the packet.
Data packets addressed to the stand-by IP address are transmitted to the new server/blade identified by the new Blade ID corresponding to the bucket that the packet flow belongs to. Operations at a load balancer may be summarized as follows:
A B2B (Buckets-to-Blade) mapping table is stored at the load balancer. This BTB table may be provided as discussed above with respect to
For every incoming data packet, the load balancer computes the hash of the Flow ID (or equivalent) to obtain the Bucket ID;
If the bucket is in the steady-state, the packet is forwarded to the Current Blade ID;
If the packet belongs to a bucket in transient state and is sent to the Stand-by IP address, the load balancer forwards the packet to the New Blade ID corresponding to the Bucket ID; and
If the packet belongs to a bucket in transient state and is sent to the Primary IP address, the load balancer forwards the packets to the (current) Blade ID corresponding to the Bucket ID.
Load balancer operations are discussed in greater detail below with respect to the flow chart of
If the bucket identified by the bucket ID of block 2703 is in the transient state at block 2705 (e.g., bucket 4 from the B2B mapping table of
In embodiments of
Moreover, this approach may not be limited to a single transition. For example, consider mapping as Bucket ‘x’ changes from Blade A to Blade B and then changes again from Blade B to Blade C. In this case with two transitions, two a priori advertized stand-by IP addresses may be used. Server/blade A will respond with the HTTP redirect option with the first stand-by IP address for of any data packets of new flows that corresponds to Bucket ‘x’. Load balancer will forward any packets from Bucket ‘x’ and are destined to first stand-by IP address to server/blade B whose ID will be stored in New Blade ID 1. At the same time, Bucket B will respond with HTTP redirect option with the second stand-by UP address for any data packets of new flows from Bucket ‘x’. The load balancer will forward any packets from Bucket ‘x’ that are destined to the second stand-by IP address to server/blade C whose ID will be stored in New Blade ID 2. Server/blade C will process the new flows belonging to Bucket ‘x’. Note that it is also possible that server/blade A sends the HTTP redirect option with the second stand-by IP address directly. These details may depend on corresponding control plane implementations. In summary, this approach can be generalized to any number of transitions lower than the number of available stand-by IP addresses. Moreover, this scenario may be different from each server/blade having its own IP address because of the resulting flexibility and dynamicity. In fact, it may be beneficial when a number of advertized IP addresses is smaller as compared to a number of servers/blades.
Control Plane Algorithm
In this section, control plane implementations are described. There are multiple viable ways in which the control plane may be implemented depending on the use case.
As discussed above with respect to
In an ideal scenario, the original Blade ID is no longer needed when all existing flows belonging to the bucket that were mapped to the original Blade ID have ended. In other words, the original Blade ID for bucket ‘x’ is no longer needed when the number of connections on the original blade that correspond to bucket ‘x’ goes to zero. A requirement that all data flows to the original server/blade, however, may be an unnecessarily strict criterion. A few connections may be active for a very long time, connections may be inactive but kept open for a long time, and/or FIN/FINACKs can go missing. Under such scenarios, a bucket may be in the transient state for a relatively long time resulting in continued HTTP redirect processing and/or suboptimal routing for periods of time that may be longer than desired. Accordingly, mechanisms that provide reasonable criterion to conclude that an original Blade ID for the bucket is no longer needed may be desired.
Suitable implementations of the my-flows table may assist the control plane in making these decisions. A sample implementation of the my-flows table on every blade is discussed above with respect to
A criterion any/or criteria may thus be defined for a server/blade to reasonably conclude that the flow of data packets from the bucket is no longer significant. One or more of the following heuristics, for example, may be used:
A number of data flows corresponding to the bucket drops below a threshold. As discussed above, a number of flows dropping to zero may be an unnecessarily stringent criterion. A number of data flows reaching a near-zero number, however, might be good enough. Care should be taken, however, when employing this heuristic. For Bucket 2 in the consolidated flows table of
A net bit-rate corresponding to the bucket dropping below a threshold may be a good reason to drop the existing flows corresponding to the bucket and release the bucket to a steady state.
A last packet received from any data flow to the bucket was a sufficiently long period of time ago. If the data flows corresponding to the bucket have been inactive for a sufficiently long period of time, the bucket may be released to the steady state without significant impact on performance.
Any combinations of the above and/or any other heuristics may also be used.
Appropriate control plane operations may be selected based on the use case. Whenever one or more (or combinations) of these criteria are met, the control plane may instruct the load balancer to replace original Blade ID for the bucket with the corresponding New Blade IDs and thereby release the bucket to the steady-state.
Additionally, a consolidated flows table can be created and maintained on/at the load balancer. The load balancer can keep track of a number of connections by incrementing or decrementing a connection counter(s) whenever it detects an outgoing SYN or FIN packet originating from one of the blades. Net bit-rates and/or last packets received of a bucket may be tracked in a similar manner. Operations/logic used to determine when to return a bucket to steady state are discussed above with respect to
HTTP Redirect based approaches may provide reduced hit addition and/or removal of servers/blades.
HTTP redirect based approaches may provide increased uniformity.
Despite the fact that mapping between Flow IDs and Bucket IDs is static, the mapping between Bucket IDs and servers/blades can be changed dynamically to provide increased uniformity. This dynamic mapping may be applicable when a server/blade is added or removed, but dynamic mapping is not restricted to these use cases. For example, mechanism may be provided in which the load balancer maintains a uniformity index (e.g., a Jain's fairness index) at all times. If the uniformity index drops below a threshold, a reassignment of buckets to servers/blades may be initiated. Which buckets to reassign may depend on various parameters like number of flows corresponding to the bucket, bit-rate corresponding to the bucket, current load on the blades, blades on downtime and control parameters (such as when the last packet from the bucket was received). Approaches disclosed herein may enable a relatively smooth reassignment of buckets to servers/blades with reduced loss of data flows when such a reassignment takes place.
HTTP redirect based approaches may provide increased flow awareness, and/or HTTP redirect based approaches may support dynamicity of resources (servers/blades) by enabling addition/removal of resources (servers/blades) with reduced disruption of existing data flows.
Moreover, HTTP redirect based approaches may be implemented with relatively low complexity. With HTTP redirect base approaches, each data packet is unicast to only one server/blade. Accordingly, waste of backplane bandwidth may be reduced by reducing/eliminating multicasts of data packets to multiple servers/blades. Moreover, HTTP redirect based approaches may be implemented at the servers/blades without additional processing/operations other than responding to new data flows with an HTTP redirect option. Moreover, additional memory requirements at the load balancer and/or servers/blades may be reduced because additional flow tables may be reduced.
HTTP Redirect based approaches may work only for HTTP data traffic. Therefore, HTTP redirect based approaches may be of significant use in applications involving only HTTP traffic, but may not work for other types of application layer traffic. Because HTTP runs over TCP, HTTP redirect based approaches may only work for Type 1 data flows.
HTTP redirect is an application layer redirection method. Because the load balancer does not maintain a list of ongoing flows, every new flow from the bucket in transient state is first forwarded to the original blade which in turn forwards it to the appropriate server/blade via HTTP redirect. If the number of transitions (i.e., changes in the B2B table) and/or the number of flows are too large, significant overhead may occur at the load balancer site.
HTTP redirect based approaches may require use of Multiple IP addresses for the load balancer site. HTTP redirect based approaches discussed above may provide reduced hit support for one bucket-to-blade transition at a time using one additional IP address. If the blade corresponding to bucket changes multiple of times, additional IP addresses may be needed to support the additional transitions. Each of these IP addresses may need to be maintained at all times and advertized to the external network, thereby increasing cost. A maximum allowed number transitions per bucket may be limited to the number of stand-by IP addresses chosen.
HTTP redirect based approaches may result in multiple cascaded transients, and increased complexity of the control plane and/or an increased number of IP addresses may be needed to provide reduced hit support for data flows in the system.
Summary of Approaches
As discussed above, three different approaches may enable reduced hit addition and/or removal of servers/blades as well as reassignment of buckets to servers/blades.
Transient Multicast based Distributed Approaches may be based on the multicast of data packets that belong to a bucket in transient state. These approaches may provide a relatively low complexity load balancer without requiring significant/huge tables, but efficiency of backplane bandwidth usage may be reduced and/or additional processing on the servers/blades may be required.
Transient Table based Approaches may be based on temporarily storing data flow IDs corresponding to a bucket in transient state. These approaches may provide increased efficiency of backplane bandwidth use without requiring additional processing on the servers/blades. Additional storage and/or computation on/at the load balancer, however, may result.
HTTP Redirect based Approaches may be based on HTTP redirect of unwanted new flows. HTTP redirect based approaches may provide increased efficiency of backplane bandwidth use without requiring additional table storage. HTTP redirect based approaches, however, may work only for HTTP traffic.
Sample Implementations and Embodiments
As discussed below, embodiments of load balancing frameworks disclosed herein may be adapted to different scenarios/applications/platforms.
Implementation on a Multi-Application/Service Router (MASR) Platform (e.g., Ericsson's Smart Services Router SSR)
A Multi-Application/Service Router (MASR) Platform is a Next Generation Router aimed at providing a relatively flexible, customizable, and high throughput platform for operators. An MASR platform may provide support for a number of applications, such as, B-RAS, L2/3 PE, Multi Service Proxy, Service Aware Support Node, GGSN+MPG (SGW, PGW of LTE/4G), etc.
An MASR architecture illustrated in
As shown in
SC Traffic Steering may follow a hash based implementation of a load balancer as discussed above, for example, with respect to
Embodiments for CDN (Content Delivery Network) on MASR
CDN is a significant feature of MASR that may provide subscribers access to HTTP content without having to query an original host server for every subscriber. The MASR may have multiple SC cards dedicated to CDN and there may be a need for a load balancing mechanism within those CDN SCs. Approaches of embodiments disclosed herein may be applied to carry out load balancing between different CDN SCs. More specifically, the fact that CDN traffic is HTTP only may be used. Therefore, highly advantageous HTTP Redirect based approaches may be applied to perform load balancing for CDN traffic.
Embodiments for Multi-Application on MASR and Service Chaining
Multiple Applications can be collocated at the same MASR chassis, causing certain qualifying traffic to travel to multiple applications/services within the same MASR. Thus an inter-SC load balancing may become necessary. Methods according to embodiments discussed above may cover all service chaining use cases, such that load balancing functionality may reside not only on/at the load balancer (e.g., an LC of an MASR) but also on the servers/blades as discussed above with respect to
Load balancing algorithms presented according to some embodiments of inventive concepts may have significant use cases for multi-application on MASR. Customers for CDN, TIC (Transparent Internet Caching), SASN (Service Aware Support Node), etc., are asking for a server load balancing which supports hitless ISSU (In Service Software Upgrade), hitless In service addition, and removal of Servers/SCs. Customers are interested in flexible, uniform load balancing that provides overall protection. At the same time, methods employed should desirably have low complexity, low cost, and/or low TTM (Time to Market). As discussed above, these may be advantages of some embodiments disclosed herein with respect to mechanisms used on MASR. While stateful methods for server load balancing, when implemented on a line card, may be expected to have performance issues with large state tables, embodiments disclosed herein may have reduced complexity, reduced costs, and/or increased performance. In addition, features such as energy efficient server load balancing can be performed by setting some of the servers in a sleep mode more gracefully (e.g., hitless or with reduced hits) when the load is not high.
A traffic flow may need to visit multiple servers at the same load balancing site, for example, if there are multiple services in the same load balancing site, each with multiple servers. In this case, when the traffic goes out from one of the service/server cards, the traffic may need to be load balanced again over the next service cards. In this respect, individual servers may also have a load balancer inside to perform similar load balancing as that done at the load balancer of the load balancing site.
Embodiments on Policy Based Forwarding (PBF/ACL) and Software Defined Networking Rule (SDN) Based Mechanisms.
SDN refers to separation of the control plane and the data plane where the data plane includes only a set of forwarding rules instructed by the central control plane. ACL/PBF also similarly has a control plane which sets up the set of simple forwarding rules.
In ACL/PBF, stateless hash based server load balancing may be provided using policy based forwarding (PBF) and/or Access Control List (ACL) and/or Software Defined Networking (SDN). Basically, legacy ACL/PBF/SDN data plane rules match a certain set of bits of the flow IDs (e.g., Source IP, Destination IP, Source/Destination Port, etc.) and map them statically to the servers/blades.
Transient Multicast/Broadcast Based Distributed Methods Via ACL/PBF/SDN
Assuming a set of SDN/ACL/PBF rules are set via an intelligent control plane realizing a stateless hash based server load balancing, each slice/bucket in the B2B table may be realized using an ACL/PBF/SDN rule. The action associated with each rule can be switched (e.g., by the intelligent control plane) between unicast and multicast forwarding action (e.g., sending the matching packet to a single or to multiple destinations/servers/blades) depending on whether the rule/bucket is in steady or transient state respectively. As used herein, SDN means Software Defined Networking, ACL means Access Control List, and PBF means Policy Based Forwarding.
As an example of a load balancer and/or line card implementation, in
Transient Table Based Approach Via ACL/PBF/SDN
HTTP Redirect Method Via ACL/PBF/SDN
As can be seen from the previous two embodiments, HTTP redirect operations discussed above may be realized in an SDN/ACL/PBF environment.
Adapting to Elephant and Mice Flow Model
A significant volume of traffic in many networks (including the Internet) can be attributed to a relatively small number of data flows, known as Elephant flows. Other flows which are relatively large in number, may each consume relatively little bandwidth, and are known as Mice flows. For example, at least one study has shown that in a traffic trace, about 0.02% of all flows contributed more than 59.3% of the total traffic volume. See, Tatsuya et al., “Identifying elephant flows through periodically sampled packets,” Proceedings of the 4th ACM SIGCOMM conference on Internet Measurement (IMC 2004), NY, NY, USA, 115-120. Some embodiments disclosed herein may be adapt a load balancer framework to an Elephant and Mice flow model case. A hybrid model, for example, may combine multicast based distributed approaches and transient table based approaches.
In such a hybrid model, transient table based operations may be performed for elephant flows while multicast based distributed approach operations may be performed for the mice flows. Elephant flows are relatively low in number but high in bandwidth. Accordingly, it may be easier to maintain elephant data flows in the transient table while relatively expensive to multicast them to multiple servers. Mice flows are relatively high in number. Accordingly, it may be relatively expensive to maintain a list of mice flows in the transient table but reasonable to multicast mice flows to multiple servers since they do not consume significant bandwidth. Note that this hybrid model is discussed here with respect to type 1 flows, and that concepts of a transient table may not work for type-2 flows. Details of the hybrid method are discussed below.
Operations of the hybrid method at the load balancer will now be discussed.
In the Transient Table based Approach, the load balancer maintains a table of new data flows for each bucket in the transient state (i.e., data flows that are created after the bucket enters the transient state). In this hybrid model, however, the load balancer will only maintain a list of elephant flows that are created after the bucket enters the transient state.
When a data packet arrives, the load balancer performs the hash function to determine the bucket for the data packet. If the data packet corresponds to a bucket in steady-state or is an INIT data packet of a new data flow, the packet is forwarded to the corresponding current blade. If the data packet is a non-initial data packet that corresponds to the bucket in the transient state, the load balancer checks if the packet is in the list of elephant flows corresponding to the bucket in the transient state. If the non-initial data packet is part of a data flow included in the list of elephant data flows, the load balancer forwards the data packet to the current blade. If the non-initial data packet is not part of a data flow included in the list of elephant flows, the load balancer assumes the data packet is a mice flow and multicasts the data packet to both current and old blades.
Operations of the hybrid method at the server(s)/blade(s) will now be discussed. As a packet arrives at a server/blade, the server/blade processes the data packet if the packet is unicast to the blade, and if the data packet is an INIT data packet of a new data flow, the server/blade records it in the “my-flows table”. If the data packet is received as a multicast, the server/blade checks if the packet belongs to its “my-flows table”. If yes, then the server/blade processes the packet, and if not, the server/blade drops the packet.
In addition, the servers/blades may also try to estimate whether a flow is an Elephant flow or a Mice flow. Identification of Elephant flows may already be an active research area and there may exist mechanisms by which such estimations may be performed. See, Tatsuya et al., “Identifying elephant flows through periodically sampled packets,” Proceedings of the 4th ACM SIGCOMM conference on Internet Measurement (IMC 2004), NY, NY, USA, 115-120; and Yi Lu, et al., “ElephantTrap: A low cost device for identifying large flows,” High-Performance Interconnects, Symposium on, pp. 99-108, 15th Annual IEEE Symposium on High-Performance Interconnects (HOTI 2007), 2007. Once a server/blade classifies a data flow as an Elephant flow, the server/blade instructs the Load Balancer to add the Flow ID and its Blade ID to the Elephant Flow table corresponding to the transient state group. Detection of elephant flows may thus occur at the blades/servers.
The assumption here is that when a packet arrives at the load balancer, if the data packet does not belong to any flow in the elephant flow table of the load balancer, then the data packet is automatically considered as a mice flow, and the load balancer multicasts data packet of the mice flow. Then the servers/blades decide whether to process or drop the data packet based on operations described above and elephant/mice detection is also performed at the server/blade which accepted the flow. If the flow is detected as a mice flow, there is no need for further action. However, if the flow is detected as an elephant flow, then the blade/server in question manipulates the elephant flow table and from that time on, the load balancer switches to unicasting the packets belonging to that flow.
There may be advantages of this hybrid method. Since elephant flows are not multicast to multiple servers, more efficient bandwidth utilization may be provided. Only data packets of mice flows (which correspond to a relatively small fraction of the load) are sent using multicast transmissions. Therefore, this hybrid strategy may save bandwidth. Similarly, all flows are not maintained in the transient table. Only elephant flows (which are relatively small in number) are maintained in the transient table. In other words, only a relatively small amount of information is saved in the transient table on/at the load balancer. In essence, this hybrid method may combine positive elements from multicast based distributed approaches and from transient table based approaches.
In the above-description of various embodiments of the present inventive concepts, it is to be understood that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of inventive concepts. Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which the present inventive concepts belong. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of this specification and the relevant art and will not be interpreted in an idealized or overly formal sense expressly so defined herein.
When an element is referred to as being “connected”, “coupled”, “responsive”, or variants thereof to another element, it can be directly connected, coupled, or responsive to the other element or intervening elements may be present. In contrast, when an element is referred to as being “directly connected”, “directly coupled”, “directly responsive”, or variants thereof to another element, there are no intervening elements present. Like numbers refer to like elements throughout. Furthermore, “coupled”, “connected”, “responsive”, or variants thereof as used herein may include wirelessly coupled, connected, or responsive. As used herein, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. Well-known functions or constructions may not be described in detail for brevity and/or clarity. The term “and/or” includes any and all combinations of one or more of the associated listed items.
As used herein, the terms “comprise”, “comprising”, “comprises”, “include”, “including”, “includes”, “have”, “has”, “having”, or variants thereof are open-ended, and include one or more stated features, integers, elements, steps, components or functions but does not preclude the presence or addition of one or more other features, integers, elements, steps, components, functions or groups thereof. Furthermore, as used herein, the common abbreviation “e.g.”, which derives from the Latin phrase “exempli gratia,” may be used to introduce or specify a general example or examples of a previously mentioned item, and is not intended to be limiting of such item. The common abbreviation “i.e.”, which derives from the Latin phrase “id est,” may be used to specify a particular item from a more general recitation.
It will be understood that although the terms first, second, third, etc. may be used herein to describe various elements/operations, these elements/operations should not be limited by these terms. These terms are only used to distinguish one element/operation from another element/operation. Thus a first element/operation in some embodiments could be termed a second element/operation in other embodiments without departing from the teachings of present inventive concepts. The same reference numerals or the same reference designators denote the same or similar elements throughout the specification.
Example embodiments are described herein with reference to block diagrams and/or flowchart illustrations of computer-implemented methods, apparatus (systems and/or devices) and/or computer program products. It is understood that a block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented by computer program instructions that are performed by one or more computer circuits. These computer program instructions may be provided to a processor circuit of a general purpose computer circuit, special purpose computer circuit, and/or other programmable data processing circuit to produce a machine, such that the instructions, which execute via the processor of the computer and/or other programmable data processing apparatus, transform and control transistors, values stored in memory locations, and other hardware components within such circuitry to implement the functions/acts specified in the block diagrams and/or flowchart block or blocks, and thereby create means (functionality) and/or structure for implementing the functions/acts specified in the block diagrams and/or flowchart block(s).
These computer program instructions may also be stored in a tangible computer-readable medium that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable medium produce an article of manufacture including instructions which implement the functions/acts specified in the block diagrams and/or flowchart block or blocks.
A tangible, non-transitory computer-readable medium may include an electronic, magnetic, optical, electromagnetic, or semiconductor data storage system, apparatus, or device. More specific examples of the computer-readable medium would include the following: a portable computer diskette, a random access memory (RAM) circuit, a read-only memory (ROM) circuit, an erasable programmable read-only memory (EPROM or Flash memory) circuit, a portable compact disc read-only memory (CD-ROM), and a portable digital video disc read-only memory (DVD/BlueRay).
The computer program instructions may also be loaded onto a computer and/or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer and/or other programmable apparatus to produce a computer-implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions/acts specified in the block diagrams and/or flowchart block or blocks. Accordingly, embodiments of present inventive concepts may be embodied in hardware and/or in software (including firmware, resident software, micro-code, etc.) that runs on a processor such as a digital signal processor, which may collectively be referred to as “circuitry,” “a module” or variants thereof
It should also be noted that in some alternate implementations, the functions/acts noted in the blocks may occur out of the order noted in the flowcharts. For example, two blocks shown in succession may in fact be executed substantially concurrently or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved. Moreover, the functionality of a given block of the flowcharts and/or block diagrams may be separated into multiple blocks and/or the functionality of two or more blocks of the flowcharts and/or block diagrams may be at least partially integrated. Finally, other blocks may be added/inserted between the blocks that are illustrated. Moreover, although some of the diagrams include arrows on communication paths to show a primary direction of communication, it is to be understood that communication may occur in the opposite direction to the depicted arrows.
Many different embodiments have been disclosed herein, in connection with the above description and the drawings. It will be understood that it would be unduly repetitious and obfuscating to literally describe and illustrate every combination and subcombination of these embodiments. Accordingly, the present specification, including the drawings, shall be construed to constitute a complete written description of various example combinations and subcombinations of embodiments and of the manner and process of making and using them, and shall support claims to any such combination or subcombination.
Many variations and modifications can be made to the embodiments without substantially departing from the principles of present inventive concepts. All such variations and modifications are intended to be included herein within the scope of present inventive concepts. Accordingly, the above disclosed subject matter is to be considered illustrative, and not restrictive, and the appended claims are intended to cover all such modifications, enhancements, and other embodiments, which fall within the spirit and scope of present inventive concepts. Thus, to the maximum extent allowed by law, the scope of present inventive concepts is to be determined by the broadest permissible interpretation of the following claims and their equivalents, and shall not be restricted or limited by the foregoing detailed description. Any reference numbers in the claims are provided only to identify examples of elements and/or operations from embodiments of the figures/specification without limiting the claims to any particular elements, operations, and/or embodiments of any such reference numbers.