As packets arrive at network nodes, a technology referred to as network load balancing (NLB) distributes the load between the nodes. In one solution, each packet is received at each node, but only one of the nodes handles the packet based on attributes of the TCP connection (protocol, source IP address, source port, destination IP address, destination port) corresponding to that packet. In other words, each packet deterministically maps to a bucket based on its attributes, and each node is configured (via a process called convergence) to handle only a subset of the available set of buckets.
Thus, each incoming packet from the network clients is received by each node, and each node independently figures out (hashes) whether that packet maps to a bucket that the node owns as a result of the most recent convergence. If so, that node accepts and processes the packet, otherwise it drops the packet, knowing that the packet will instead be accepted by a different node, specifically the node that was last assigned the bucket. In this manner, once the convergence process is complete and buckets are distributed, a node can independently decide whether to accept each incoming packet without needing to query other nodes.
If the cluster configuration changes (for instance, nodes are added or removed) then the convergence process runs again and buckets are redistributed among nodes. In the event that a bucket associated with a connection has moved following the convergence, current technology ensures that any previously established TCP connection continues to be processed by the same node, even if it no longer owns the bucket.
However, a new TCP connection from an existing client is accepted by whichever node currently owns the associated bucket. This is a problem for some applications and services, which require that the same node handle all connections from the same client, regardless of cluster configuration changes.
This Summary is provided to introduce a selection of representative concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used in any way that would limit the scope of the claimed subject matter.
Briefly, various aspects of the subject matter described herein are directed towards a technology by which a network load balancing system (node cluster) maintains affinity with a client so as to handle packets from the same client connection, including new connections, regardless of any cluster configuration changes. For example, each node maintains a list of the clients that are to remain (have affinity, or “stickiness”) with that node, including following bucket redistribution resulting from convergence. That affinity list is propagated to one or more other nodes, such as for building an exception list by which the node owning the bucket that otherwise is responsible for that client's packets knows that the client is an exception.
In one aspect, upon receiving a packet, a node determines whether to drop or accept the packet based on data (e.g., an IP address) provided in the packet. This may include determining whether the packet maps to a bucket owned by the node, and if so, accepting the packet unless the data indicates an exception. If the packet does not map to the bucket set, it is still accepted if the data indicates that the node has affinity with the client. Affinity may expire (e.g., if client does not have any connections to the node for a configured period of time) whereby the node having affinity with that client releases it and notifies the node owning the bucket for that client that the client is no longer to be treated as an exception.
In one aspect, a convergence process is performed among a plurality of nodes in a network load balanced cluster of nodes when a node configuration changes. Convergence includes distributing buckets (that is, bucket ownership) among nodes, and communicating information that allows nodes to maintain affinity with clients. In one implementation, the information comprises affinity lists, by which bucket owners build exception lists for their bucket or buckets.
Upon exiting convergence, the nodes enter a packet handling state, including receiving packets at each node of the cluster, and at each node, accepting the packet when that node has affinity, or when the packet maps to a bucket owned by that node and no other node has affinity.
Other advantages may become apparent from the following detailed description when taken in conjunction with the drawings.
The present invention is illustrated by way of example and not limited in the accompanying figures in which like reference numerals indicate similar elements and in which:
Various aspects of the technology described herein are generally directed towards a network load balancing system (node cluster) in which the same node handles connections from the same client, including new connections, regardless of any cluster configuration changes as a result of convergence. Each node maintains a list of the clients that are to remain (have affinity, or “stickiness”) with that node as connections from clients are established. During convergence, as buckets are redistributed, nodes send their affinity (stickiness) lists to new bucket owners, where they become exception lists comprising lists of clients from which the bucket owner is not to accept packets. For existing clients, nodes first use the affinity and exception lists instead of bucket assignments to ensure that client-node affinity is preserved. As a result, once a node has started handling TCP connections from a client, that same node continues to handle connections from that client, at least for a configured period of time, even if it loses ownership over the bucket associated with connections from this client.
While many of the examples described herein are directed towards affinity lists and exception lists, it is understood that these are only examples of one mechanism for maintaining client-node affinity. For example, one alternative for maintaining client-node affinity as described herein is based on using relatively small buckets, and only moving empty buckets during convergence.
As such, the present invention is not limited to any particular embodiments, aspects, concepts, structures, functionalities or examples described herein. Rather, any of the embodiments, aspects, concepts, structures, functionalities or examples described herein are non-limiting, and the present invention may be used various ways that provide benefits and advantages in computing and networking in general.
Turning to
Because of convergence (described below) in which one or more nodes are added and/or removed, buckets may be redistributed across the nodes. For example, in
However, new client connections from a former client that are established after convergence previously did not remain with the node, causing problems with certain applications and services, which require that the same node handle all connections from the same client. Described herein is one mechanism that uses existing bucket distribution as a basis to implement client affinity in a manner that solves this problem.
In one implementation, client affinity is maintained in that each node maintains and refers to an affinity list as represented in
In one implementation, each element of the list comprises an IP address. As part of the convergence process, after nodes figure out the redistributed bucket ownership, the nodes send to each other the parts of the affinity list that they no longer own. Each new owner of a bucket collects the affinity lists from the other nodes for the buckets it owns, and builds up an exception list for these buckets, as represented in
Once the nodes have completed exchanging the affinity list information and convergence completes, the cluster starts accepting packets (load) using the new distribution. At each node, any packet corresponding to a connection that is not in a bucket set (does not map to a bucket) of that node, is dropped unless it is in the affinity list; any packet that maps to a bucket is accepted unless it is in the exception list.
If a configurable period of time elapses (e.g., fifteen minutes) after which the last connection from this client ends and no new connections from this client are established, then the client will be removed from the affinity list. In other words, affinity ends if there are no connections from this client for the given interval of time. A message reporting the affinity change will be sent to the current bucket owner so that the current bucket owner can remove the client from its exception list. From this point on, if the client initiates a new connection, it will be accepted by the bucket owner, and the bucket owner will establish affinity with the client.
If this node does not own the bucket, then step 308 the checks the node's affinity list. If the node does not have affinity, the node drops the packet at step 314. For UDP (or any session-less protocol), each packet may be treated as SYN and FIN at the same time, whereby if the node has affinity, the node updates its timer (time-to-leave) for this client at step 310; (note that this is not needed for TCP). Step 322 represents the node accepting the packet.
Returning to step 306, if this node owns the bucket, at step 312 the node looks up in its exception list to determine whether another node has affinity to this client. If so, the packet is dropped as represented by step 314. If the client is not in the exception list, then at step 322 the node accepts the packet. Note that if the node has not seen this client before, then via steps 316 and 320 the node adds it to its affinity list (although for TCP this may be deferred until a later notification, in case the connection establishment fails; this is not deferred for UDP or other session-less protocol). Deferral may be extended for any other protocols that have a notion of session, such as IPSEC which provides IP datagrams, with certain attributes in the IPSEC header used to detect session stat, and is treated similarly to the TCP SYN. Session up/end notifications arrive from an IPSEC service.
If this node does not own the bucket, then step 408 the checks the node's affinity list. If the node does not have affinity with this client connection, the node drops the packet at step 416. If the node has affinity, the node accepts the packet at step 418.
Note that as generally represented at step 410, for TCP, the node keeps the number of active connections for each client; when that number reaches zero, the node starts timing how long the client has no connections; if the client does not establish a new connection within a (configurable) threshold time, the node will notify the other node that owns the bucket to take over the connection (that is, remove it from the other node's exception list), and will remove the connection from its own affinity list, as generally represented by steps 501-503 of
By way of example, if the node does not observe new UDP packets from this client for the configured period of time and it does not have any TCP connections, the node removes the client from the affinity list. Alternatively, each UDP (or any other stateless protocol packet) may be treated as session start and end at the same time, resetting client's affinity timeout.
Returning to step 406, if this node owns the bucket, at step 412 the node looks up in its exception list to determine whether another node has affinity to this client. If another node has affinity, the packet is dropped as represented by step 416. Otherwise at step 418 the node accepts the packet.
In one alternative for tracking client affinity, rather than maintaining the list of IP addresses, compressed information may be maintained. One way to compress the data is to use only a network part of the IP address; this may be optionally turned on when an administrators configures a cluster with network affinity turned on.
In another alternative, affinity may be maintained by only allowing empty buckets to change ownership during convergence. This is not particularly practical for buckets on the order of sixty, but is feasible for larger numbers of buckets, such as sub-buckets. For example, each bucket may be subdivided into sub-buckets (e.g., each bucket corresponds to 1,000 sub-buckets), with clients mapped to sub-buckets, using sub-buckets instead of the affinity lists. Other nodes build exception lists based on the sub-bucket states. Only one node may own a sub-bucket at any time. Sub-buckets are not moved from a node until there are no active connections that map to this sub-bucket (the sub-bucket is empty) for the configured period of time.
Turning to various aspects related to convergence, to maintain client affinity, each node keeps an affinity list of the IP addresses to which it has affinity. When bucket move, the new owner of the bucket needs to know that another node has affinity to a client on that bucket. The convergence process is extended to circulate client affinity lists among nodes. In one implementation, once the nodes know the new owner of a bucket, those nodes communicate their corresponding affinity lists to the new owner of the bucket, which collects this information to build its exception list.
In general,
During Phase 1 (block 660 of
During phase 2, nodes use a reliable network protocol to send out their affinity lists to other nodes, such as represented by step 706 of
Once all the information is successfully sent and acknowledged, the convergence has reached a stable state 666. At this time, the nodes build up (or have built up at step 708) their respective exception lists, each of which tells that node the connections (from what IP addresses) they should not be accepting even though they own the bucket or buckets to which these IP addresses map. At this time, the leader moves from the Phase 2 stable state to the normal operating state, and other nodes follow and start accepting load.
In general, the communication between nodes is done using heartbeats that are sent non-reliably. Each node sends a heartbeat with its current state, including the state of the buckets it owns. Losing a heartbeat does not affect the cluster because next heartbeat delivers the same information.
However, the affinity lists (and release affinity messages on a timing-out) need to be communicated to the new bucket owner in a reliable way. To this end, a reliable message delivery protocol is used during convergence, as well as on connection time-outs, to ensure that each message is delivered to the destination. There is no response needed, other than an acknowledgement to confirm that the message was delivered. One suitable protocol does not need to be routable because nodes are on the same LAN, which also provides a low-error and low-latency environment.
As mentioned above, only the current owner of a bucket needs to build an exception list for that bucket. Similarly, other nodes only need to keep their own affinity list. For efficiency, data may be grouped in the messages such that they contain information for only one bucket, with only the current bucket owner being a destination for these messages. This applies to when nodes send affinity lists for the buckets they do not own to the current bucket owner during convergence, and when in normal load handling operation a node releases a connection for a client to the current bucket owner.
One format uses a <port rule, bucket number> pair as a way to identify a recipient for a message. At any given moment only one node owns a bucket and thus this address is not ambiguous. An extended heartbeat may be used as a frame level protocol, whereby all packets are broadcast, allowing the multiplexing of data for different destinations into one packet. Each node scans the packet and reads only the sections which are destined for that node's buckets.
The <port rule, bucket number> pair is not sufficient for acknowledging message delivery, because multiple nodes might send their affinity lists for the pair. Additional source information such as <Node Id, Message Id> suffices. A generation identifier may be added to the message identifier, e.g., maintained separately at each node and increased when convergence restarts to ensure that messages from any previous generations (convergence) are dropped, increasing tolerance to any network delays.
By way of an example communication sequence, N1 sends an extended heartbeat packet that contains affinity list for the buckets 1 (owned by Node 2) and 2 (owned by Node 3), e.g.:
<Port rule 1; Bucket 0><N1, 1>[data]<Port rule 1; Bucket 2><N1, 2>[data]
Node 2 and Node 3 receive the packet and extract the part that is relevant for the buckets they own:
Node 2 sends confirmation for the part it has received <N1, 1>
Node 3 sends confirmation for the part it has received <N1, 2>
Again note that confirmations do not need to include <Port rule, bucket Id> because a pair <Node Id, Message Id> sufficiently identifies the recipient. Further, if the underlying protocol is broadcast, then confirmations can be multiplexed in a single packet and each node will extract only the portion that is relevant, and moreover, messages and confirmations may be multiplexed in a single packet, by a section type (message/confirmation) and having each destination node extract the data destined for that node from the packet.
Exemplary Operating Environment
The invention is operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well known computing systems, environments, and/or configurations that may be suitable for use with the invention include, but are not limited to: personal computers, server computers, hand-held or laptop devices, tablet devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.
The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, and so forth, which perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in local and/or remote computer storage media including memory storage devices.
With reference to
The computer 910 typically includes a variety of computer-readable media. Computer-readable media can be any available media that can be accessed by the computer 910 and includes both volatile and nonvolatile media, and removable and non-removable media. By way of example, and not limitation, computer-readable media may comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by the computer 910. Communication media typically embodies computer-readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. Combinations of the any of the above should also be included within the scope of computer-readable media.
The system memory 930 includes computer storage media in the form of volatile and/or nonvolatile memory such as read only memory (ROM) 931 and random access memory (RAM) 932. A basic input/output system 933 (BIOS), containing the basic routines that help to transfer information between elements within computer 910, such as during start-up, is typically stored in ROM 931. RAM 932 typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by processing unit 920. By way of example, and not limitation,
The computer 910 may also include other removable/non-removable, volatile/nonvolatile computer storage media. By way of example only,
The drives and their associated computer storage media, described above and illustrated in
The computer 910 may operate in a networked environment using logical connections to one or more remote computers, such as a remote computer 980. The remote computer 980 may be a personal computer, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer 910, although only a memory storage device 981 has been illustrated in
When used in a LAN networking environment, the computer 910 is connected to the LAN 971 through a network interface or adapter 970. When used in a WAN networking environment, the computer 910 typically includes a modem 972 or other means for establishing communications over the WAN 973, such as the Internet. The modem 972, which may be internal or external, may be connected to the system bus 921 via the user input interface 960 or other appropriate mechanism. A wireless networking component 974 such as comprising an interface and antenna may be coupled through a suitable device such as an access point or peer computer to a WAN or LAN. In a networked environment, program modules depicted relative to the computer 910, or portions thereof, may be stored in the remote memory storage device. By way of example, and not limitation,
An auxiliary subsystem 999 (e.g., for auxiliary display of content) may be connected via the user interface 960 to allow data such as program content, system status and event notifications to be provided to the user, even if the main portions of the computer system are in a low power state. The auxiliary subsystem 999 may be connected to the modem 972 and/or network interface 970 to allow communication between these systems while the main processing unit 920 is in a low power state.
While the invention is susceptible to various modifications and alternative constructions, certain illustrated embodiments thereof are shown in the drawings and have been described above in detail. It should be understood, however, that there is no intention to limit the invention to the specific forms disclosed, but on the contrary, the intention is to cover all modifications, alternative constructions, and equivalents falling within the spirit and scope of the invention.
Number | Name | Date | Kind |
---|---|---|---|
6006259 | Adelman et al. | Dec 1999 | A |
6424992 | Devarakonda et al. | Jul 2002 | B2 |
6587866 | Modi et al. | Jul 2003 | B1 |
6671259 | He et al. | Dec 2003 | B1 |
6691165 | Bruck et al. | Feb 2004 | B1 |
6735205 | Mankude et al. | May 2004 | B1 |
6748437 | Mankude et al. | Jun 2004 | B1 |
7003574 | Bahl | Feb 2006 | B1 |
7209977 | Acharya et al. | Apr 2007 | B2 |
7277897 | Bamford et al. | Oct 2007 | B2 |
7353276 | Bain et al. | Apr 2008 | B2 |
7600036 | Hesselink et al. | Oct 2009 | B2 |
7890463 | Romem et al. | Feb 2011 | B2 |
20010052024 | Devarakonda et al. | Dec 2001 | A1 |
20020010783 | Primak et al. | Jan 2002 | A1 |
20020107962 | Richter et al. | Aug 2002 | A1 |
20020143953 | Aiken, Jr. | Oct 2002 | A1 |
20020143965 | Aiken, Jr. | Oct 2002 | A1 |
20030065711 | Acharya et al. | Apr 2003 | A1 |
20040205250 | Bain et al. | Oct 2004 | A1 |
20040267920 | Hydrie et al. | Dec 2004 | A1 |
20050005006 | Chauffour et al. | Jan 2005 | A1 |
20050171927 | Chan et al. | Aug 2005 | A1 |
20050181803 | Weaver et al. | Aug 2005 | A1 |
20050246441 | Chandrasekaran et al. | Nov 2005 | A1 |
20060129684 | Datta | Jun 2006 | A1 |
20060130064 | Srivastava | Jun 2006 | A1 |
20060277180 | Okamoto | Dec 2006 | A1 |
20070043842 | Chouanard et al. | Feb 2007 | A1 |
20070079308 | Chiaramonte et al. | Apr 2007 | A1 |
20080008095 | Gilfix | Jan 2008 | A1 |
20090034537 | Colrain et al. | Feb 2009 | A1 |
20090287846 | Iyengar et al. | Nov 2009 | A1 |
Number | Date | Country | |
---|---|---|---|
20100057923 A1 | Mar 2010 | US |