Computer networks are typically comprised of a number of network switches which connect a group of computers together. Ideally, computer networks pass messages between computers quickly and reliably. Additionally, it can be desirable that a computer network be self-configuring and self-healing. In Ethernet switching networks, a spanning tree algorithm is often used to automatically generate a viable network topography. However, there are several challenges when implementing Ethernet switching networks within large datacenters and computer clusters. One challenge relates to instances where the network switches do not have the necessary information to deliver a message to its destination. In this case, the network switches broadcast the message through out the entire network, resulting in a message flood. The message flood eventually results in the delivery of the message to the desired end station, but produces a large volume of network traffic. In large scale networks, where the number of end stations is large, the likelihood and magnitude of message floods increases dramatically.
The accompanying drawings illustrate various embodiments of the principles described herein and are a part of the specification. The illustrated embodiments are merely examples and do not limit the scope of the claims.
Throughout the drawings, identical reference numbers designate similar, but not necessarily identical, elements.
The network switches which make up a computer network incorporate forwarding tables which contain forwarding information required to route messages to their intended destination. Forwarding tables use caches that are based on techniques which hash the end station address to implement destination lookup tables. After a forwarding route is cached, switches forward data only on the port that sends data toward its intended destination.
However, the forwarding tables are typically stored on random access memory (RAM) with limited memory capacity, which may prevent the network switch from retaining a complete set of forwarding data. With limited size RAM, hash collisions result in hash table misses. Hash table misses cause flood type broadcasting within the network that decreases network performance. Existing systems do not fully utilize the capabilities of neighboring switches to limit the propagation of flooding to relevant portions of the network when forwarding cache misses result in a broadcast.
Additionally, the network switches depend on the proper forwarding information being propagated through the network. If the destination of an incoming message can be matched with proper routing information contained within the forwarding table, the switch forwards the message along the proper route. However, if there is no routing information that corresponds to the desired destination, the switch broadcasts the message to the entire network. This creates message flood which propagates through the entire computer network, eventually reaching the desired end station. Particularly in large computing networks, this message flood can consume a large portion of the network capacity, resulting in decreased performance and/or the requirement to build a more expensive network that has far greater capacity than would otherwise be required.
This specification describes networking techniques which reduce propagation of message floods while still allowing the message to reach the desired end station. In particular, the specification describes techniques that improve the ability of neighboring switches to mitigate broadcast penalties without the requirement for hardware changes or upgrades. This allows networks to incorporate smaller forwarding caches while providing an equivalent level of performance.
Existing forwarding techniques suffer from two inadequacies. First, since a forwarding cache in one switch uses the same hashing function as forwarding caches in neighboring switches, cache collisions produced in one switch may be replicated in neighboring switches. In particular, the broadcasting action within one switch may cause cache misses and broadcasting to neighboring switches. This may cause cache missing to propagate from switch to switch throughout a computer network.
According to one illustrative embodiment, distinct hash functions can be implemented within each switch. With this technique, even when a hash collision occurs within a forwarding cache in one switch, it is unlikely that a hash collision occurs in a neighboring switch. This can improve the neighboring switches ability to block unnecessary broadcast traffic. By introducing the concept of a distinct hash within each switch, broadcast traffic and wasted network bandwidth is reduced.
Limiting the scope for broadcast traffic also reduces the number of unnecessary forwarding entries that must be maintained within switch caches that are not directly on the communication path. A second inadequacy is that is many situations, it is difficult for a switch that is a neighbor to a switch that is missing its cache to learn the forwarding direction for missing end station addresses. The specification describes a method to detect that cache missing is occurring and to instruct neighboring switches as the location of the missing end station in order to eliminate unnecessary propagation of broadcast traffic.
Additionally or alternatively, a method for neighboring switches to learn forwarding direction for missing end station address can be implemented. First, a network switch detects conditions that are symptomatic of cache missing. In this situation, selective broadcasting is intentionally performed to deposit forwarding entries in the caches of switches that are in the neighborhood of the missing switch. This again improves the ability of neighboring switches to limit the effects of broadcast traffic. With this invention, networks can be constructed using simpler hashing functions and smaller forwarding table RAM while reducing the volume of unnecessary broadcast traffic.
In the following description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the present systems and methods. It will be apparent, however, to one skilled in the art that the present apparatus, systems and methods may be practiced without these specific details. Reference in the specification to “an embodiment,” “an example” or similar language means that a particular feature, structure, or characteristic described in connection with the embodiment or example is included in at least that one embodiment, but not necessarily in other embodiments. The various instances of the phrase “in one embodiment” or similar phrases in various places in the specification are not necessarily all referring to the same embodiment.
The computer network topology and management is important to maximize the performance of the computer network, reduce costs, increase flexibility and provide the desired stability. Early in the development of computer networks, a number of problems had to be overcome. One of those problems was messages being trapped in endless loop as a result of a minor change to the network topology, such as adding a link or an end station. The trapped message would be repeatedly passed between various network components in a closed cycle that never allowed the message to reach the intended destination. This could generate enormous volumes of useless traffic, often making a network unusable.
A spanning tree algorithm was developed to eliminate potential cycles within a network. The spanning tree algorithm identifies a set of links that spans the network and allows each end station to communicate with every other end station. Redundant links were blocked to prevent loops which could give rise to a cycle. After a spanning tree is identified throughout an entire network, each switch within the network can use a very simple forwarding procedure. When a message is sent from any end station A to any end station B, each switch forwards an incoming message on all active (spanning tree) ports except the port on which the message arrived. This process is called flooding and can be performed with no routing information except information needed to define the active ports. This simple procedure guarantees correct network transmission. Every message sent from an end station A traverses the entire spanning tree and is guaranteed to arrive at end station B, where it is received when B recognizes its target address. Other end stations drop the message addressed to end station B because it is not addressed to them. The use of the spanning tree prevents endless message transmission. When a message reaches the end of the spanning tree, no further message transmission occurs.
For large networks, this broadcast-based, procedure can be very inefficient.
Adaptive forwarding has been developed to enhance communications efficiency using forwarding tables that learn the proper route to each destination. Messages contain MAC (Media Access Controller) addresses that uniquely identify all end stations within an Ethernet network. Each message has a source MAC address and destination MAC address. The source indicates the origin end station, and the destination indicates the target end station. Whenever a message is received on a link with source address X, then a forwarding table entry is created so that all subsequent messages destined for X are forwarded only this link. For example, after a first message is sent from end station B (205) with source address B, a forwarding entry to B is created within switch 1 (215). Subsequent messages sent into switch 1 (e.g. from end station A) with destination address B traverse only link 1 (230) and link 2 (235). This procedure is used to adaptively create forwarding tables throughout large networks to reduce network traffic. This adaptive forwarding procedure requires that the switches efficiently implement hardware based lookup tables. Lookup hardware reads the input destination MAC address, which consists of either 48 or 64 bits of address information, depending on the addressing standard. The lookup result, if successful, identifies the unique forwarding port to the addressed end station. If a forwarding entry for the input MAC address is not found, then the message is forwarded on all active links except the link on which the message arrived.
Efficient hash mapping approaches have been developed to implement hardware lookup for adaptive forwarding.
During destination address look up, two potential forwarding instructions result. The tag fields are then compared in by the tag compare modules (325, 330). If the tag field for one of those forwarding instruction exactly matches the input destination MAC address, then the result field from the matching instruction can be used to forward data.
Whenever a message enters a switch, both its source address and the destination address are processed. The destination address is processed to determine the correct forwarding port. The source address is processed to determine whether a new entry should be placed in the forwarding table. When a source address is processed, the lookup table is queried to see whether that source address is already in the table. If no entry for the source address lies in the table, then there are no current instructions on how to reach the end station having that source address. In this case, a new entry can be added into the table. If either entry is empty, then the value for that forwarding entry is set with tag field equal to the source address and result field equal to the port on which the message arrived into the switch. For this switch, subsequent messages sent to that source address will be sent only on the correct forwarding port. If the address is already in the table, and the correct forwarding port is indicated no further action is needed. If the address is already in the table and an incorrect forwarding port is indicated, then the entry is overwritten with correct forwarding instructions.
As new entries are entered into the table, a replacement strategy is needed. When a message arrives from an end station having a given source address, the lookup process may determine that both entries are nonempty and do not match the newly arriving message. In this case, the new entry may displace a randomly selected entry from the two-way set. Thus, replacement “flips a coin” and decides which entry is to be replaced with the new entry. There are occasions when multiple, frequently used destinations happen to hash to the same hash address. For this two-way set associative scheme, only two distinct forwarding instructions can be held at the same hash address location within each of the two RAMs. If there is a third common communication to the same hash address, at least one of these communications will repeatedly fail to identify forwarding instructions. This is called a forwarding table lookup miss. In this case, data is flooded or forwarded on all spanning tree ports except for the port on which the message arrived.
According to one illustrative embodiment, several changes can be made to the architecture described above which may reduce cost and improve the performance of the computer network. A reduction in forwarding efficiency occurs when multiple destination addresses produce the same hash address. For example, in a one-way set associative table, only a single forwarding entry can reside at each hash location. When multiple forwarding addresses hash to the same location, forwarding misses will cause some incoming messages to be flooded.
At least two features can be introduced into the network architecture which reduce propagation of message floods within the network. The purpose of these features is to assist neighboring switches in halting the lc propagation of broadcast floods throughout a larger network and to reduce the total forwarding table space needed within the network. Reducing table space requirements again reduces the number of flooding actions within the network as each required entry can replace another needed entry. To simplify examples, we assume switches use a one-way associative hash table to implement adaptive forwarding.
Assume that, after the network is initialized, a very first communication is from end station A (405) to end station B (410). Since, no switch within the entire network has a forwarding entry for end station B (410), the message is broadcast throughout the entire spanning tree and the adaptive forwarding procedure places an entry for end station A (405) in every switch. Now, all messages sent to end station A (405) traverse the proper communication path. For example, messages sent from either end station B (410) to end station A (405) or from end station C (415) to end station A (405) never traverse switch 5 (440). Consequently, switch 5 (440), and more remote switches, may never discover forwarding entries for the end station B (410) or end station C (415). As a result, misses at switch 2 (435) for flows from end station A (405) to end station B (410) or from end station A (405) to end station C (415) propagate throughout large regions of the network that have no knowledge of the location of end station B (410) or end station C (415).
This problem can be alleviated by ensuring that switch 5 (440) becomes aware of the location of end stations for which a miss is repeatedly occurring. If switch 5 (440) has a forwarding entry for end station B (410) and switch 5 (440) receives a message for end station B (410) from switch 4 (435), then that message can be dropped because it has arrived on an input link that also the correct route to the destination. If switch 5 (440) can learn the needed end station location information, switch 5 (440) can provide a barrier that limits unnecessary message propagation due to forwarding table misses in switch 4 (435).
One approach to propagating needed information uses logic in each switch to detect whenever a new forwarding entry is entered. For example, a message from end station B (410) to end station A (405) may cause a new forwarding entry for B to be added in switch 4 (435). This indicates that a message sent to end station B (410) would have missed just prior to this addition, and thus, misses to end station B (410) may likely happen in the future whenever this new entry is replaced. When the message from end station B (410) to end station A (405) is processed, and the B entry is added, this message is artificially flooded even though the lookup entry for end station A (405) lies in the forwarding table. This allows that on a subsequent miss from end station A (405) to end station B (410) at switch 4 (435), switch 5 (440) will block flooding that might otherwise propagate throughout the network. While the link connecting switch 4 (435) to switch 5 (440) is flooded, flooding does not propagate past switch 5 (440).
This artificial flooding action to teach the network the location of end stations need not be performed on every new entry insertion as that might waste undue link bandwidth. The artificial flooding action may be caused with some low probability each time a new entry is added. For example, when a message from end station B (410) to end station A (405) is processed and a replacement of the B forwarding entry occurs, the switch can flood with some low probability p (e.g. p=0.01). This allows that switch 5 (440) will eventually (after about 100 replacements of the destination at switch 4 (435)) learn the location for an end station that is repeatedly missing at switch 4 (435). The forwarding probability can be adjusted to produce the desired forwarding frequency. For example, a low forwarding probability could be used where there is a large number of communication flows and an inadequate hash table size such that the forwarding process can miss frequently. This can reduce the overall network traffic and minimize the expense of broadcasting messages over a long distance when this missing occurs. By informing a neighboring switch of the location of the conflicting end stations, the neighboring switch can be enabled to automatically act as a barrier to limit the flooding of messages to the remainder of the computer network.
A significant problem remains to be solved. If identical hash functions are used within all the switches, the switches will all exhibit the same conflicts. For example, when switch 4 (435) repeatedly misses as traffic is alternatively sent to end stations B and C (410, 415), then switch 5 (440) has the same conflict and may again propagate misses to its neighbors. In our example, forwarding entries for B and C cannot be simultaneously held within switch 4 (435) or switch 5 (440). Since switch 5 (440) is of identical construction and uses the same hash function, switch 5 (440) also cannot simultaneously hold entries for destinations of end stations B and C (410, 415).
This problem is rectified by the illustrative associative lookup method (500) shown in
Returning to the example illustrated in
Previously, when a repeated miss occurs at a switch, that miss might also be repeated at a neighboring switch. Under some conditions, misses can propagate throughout the entire fabric. These misses also systematically flood the network with potentially useless forwarding entries. In our example, conflicting flows from end station A to end station B and end station A to end station C cause repeated flooding that inserts forwarding entries for end station A throughout the network, potentially displacing useful entries even where not needed. With this improved architecture, misses still occur within a switch, but neighboring switches limit costly effects of flooding by acting as a barrier that reduces wasted bandwidth and wasted forwarding table space.
In sum, by propagating missed forwarding information throughout the network, neighboring switches learn pertinent information about switches which may have recurring forwarding table misses. Introducing variations in the calculation of the hash function at each switch ensures that there is a very low likelihood that neighboring switches will exhibit the same forwarding table miss. The neighboring switches can then act as a barrier to prevent unnecessary flooding into other areas of the computer network after a forwarding table miss. By applying these principles, the overall efficiency of a computer network can be improved without replacing switches or increasing the forwarding look up table RAM in each switch.
The preceding description has been presented only to illustrate and describe embodiments and examples of the principles described. This description is not intended to be exhaustive or to limit these principles to any precise form disclosed. Many modifications and variations are possible in light of the above teaching.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/US09/30768 | 1/12/2009 | WO | 00 | 7/1/2011 |