Field
The present disclosure relates to network design. More specifically, the present disclosure relates to a method for a constructing a scalable switching system that facilitates automatic configuration.
Related Art
The relentless growth of the Internet has brought with it an insatiable demand for bandwidth. As a result, equipment vendors race to build larger, faster, and more versatile switches to move traffic. However, the size of a switch cannot grow infinitely. It is limited by physical space, power consumption, and design complexity, to name a few factors. More importantly, because an overly large system often does not provide economy of scale due to its complexity, simply increasing the size and throughput of a switch may prove economically unavailable due to the increased per-port cost.
One way to increase the throughput of a switch system is to use switch stacking. In switch stacking, multiple smaller-scale, identical switches are interconnected in a special pattern to form a larger logical switch. However, switch stacking requires careful configuration of the ports and inter-switch links. The amount of required manual configuration becomes prohibitively complex and tedious when the stack reaches a certain size, which precludes switch stacking from being a practical option in building a large-scale switching system. Furthermore, a system based on stacked switches often has topology limitations which restrict the scalability of the system due to fabric bandwidth considerations.
One embodiment of the present invention provides a switch system. The switch includes one or more ports on the switch configured to transmit packets encapsulated based on a first protocol. The switch further includes a control mechanism. During operation, the control mechanism forms a logical switch based on a second protocol, receives an automatically assigned identifier for the logical switch without requiring manual configuration of the identifier, and joins a virtual cluster switch.
In a variation on this embodiment, the virtual cluster switch comprises one or more physical switches which are allowed to be coupled in an arbitrary topology. Furthermore, the virtual cluster switch appears to be one single switch.
In a further variation, the first protocol is a Transparent Interconnection of Lots of Links (TRILL) protocol, and the packets are encapsulated in TRILL headers.
In a variation on this embodiment, the logical switch formed by the control mechanism is a logical Fibre Channel (FC) switch.
In a further variation, the identifier assigned to the logical switch is an FC switch domain ID.
In a variation on this embodiment, the control mechanism is further configured to maintain a copy of configuration information for the virtual cluster switch.
In a further variation on this embodiment, the configuration information for the virtual cluster switch comprises a number of logical switch identifiers assigned to the physical switches in the virtual cluster switch.
In a variation on this embodiment, the switch includes a media access control (MAC) learning mechanism which is configured to learn a source MAC address and a corresponding VLAN identifier of an ingress packet associated with a port and communicate a learned MAC address, a corresponding VLAN identifier, and the corresponding port information to a name service.
One embodiment of the present invention provides a switching system that includes a plurality of switches configured to transport packets using a first protocol. Each switch includes a control mechanism. The plurality switches are allowed to be coupled in an arbitrary topology. Furthermore, the control mechanism automatically configures the respective switch within the switching system based on a second protocol without requiring manual configuration, and the switching system appears externally as a single switch.
In a variation on this embodiment, a respective switch in the switching system receives an automatically configured identifier associated with a logical switch formed on the respective switch.
In a further variation, the logical switch is a logical FC switch. In addition, the identifier is an FC switch domain ID.
In a further variation, the packets are transported between switches based on a TRILL protocol. The respective switch is assigned a TRILL RBridge identifier that corresponds to the FC switch domain ID.
In a variation on this embodiment, a respective switch maintains a copy of configuration information of all the switches in the switching system.
In a variation on this embodiment, the switching system includes a name service which maintains records of MAC addresses and VLAN information learned by a respective switch.
The following description is presented to enable any person skilled in the art to make and use the invention, and is provided in the context of a particular application and its requirements. Various modifications to the disclosed embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be applied to other embodiments and applications without departing from the spirit and scope of the present invention. Thus, the present invention is not limited to the embodiments shown, but is to be accorded the widest scope consistent with the claims.
In embodiments of the present invention, the problem of building a versatile, cost-effective, and scalable switching system is solved by running a control plane with automatic configuration capabilities (such as the Fibre Channel control plane) over a conventional transport protocol, thereby allowing a number of switches to be inter-connected to form a single, scalable logical switch without requiring burdensome manual configuration. As a result, one can form a large-scale logical switch (referred to as a “virtual cluster switch” or VCS herein) using a number of smaller physical switches. The automatic configuration capability provided by the control plane running on each physical switch allows any number of switches to be connected in an arbitrary topology without requiring tedious manual configuration of the ports and links. This feature makes it possible to use many smaller, inexpensive switches to construct a large cluster switch, which can be viewed as a single logical switch externally.
It should be noted that a virtual cluster switch is not the same as conventional switch stacking. In switch stacking, multiple switches are interconnected at a common location (often within the same rack), based on a particular topology, and manually configured in a particular way. These stacked switches typically share a common address, e.g., IP address, so they can be addressed as a single switch externally. Furthermore, switch stacking requires a significant amount of manual configuration of the ports and inter-switch links. The need for manual configuration prohibits switch stacking from being a viable option in building a large-scale switching system. The topology restriction imposed by switch stacking also limits the number of switches that can be stacked. This is because it is very difficult, if not impossible, to design a stack topology that allows the overall switch bandwidth to scale adequately with the number of switch units.
In contrast, a VCS can include an arbitrary number of switches with individual addresses, can be based on an arbitrary topology, and does not require extensive manual configuration. The switches can reside in the same location, or be distributed over different locations. These features overcome the inherent limitations of switch stacking and make it possible to build a large “switch farm” which can be treated as a single, logical switch. Due to the automatic configuration capabilities of the VCS, an individual physical switch can dynamically join or leave the VCS without disrupting services to the rest of the network.
Furthermore, the automatic and dynamic configurability of VCS allows a network operator to build its switching system in a distributed and “pay-as-you-grow” fashion without sacrificing scalability. The VCS's ability to respond to changing network conditions makes it an ideal solution in a virtual computing environment, where network loads often change with time.
Although this disclosure is presented using examples based on the Transparent Interconnection of Lots of Links (TRILL) as the transport protocol and the Fibre Channel (FC) fabric protocol as the control-plane protocol, embodiments of the present invention are not limited to TRILL networks, or networks defined in a particular Open System Interconnection Reference Model (OSI reference model) layer. For example, a VCS can also be implemented with switches running multi-protocol label switching (MPLS) protocols for the transport. In addition, the terms “RBridge” and “switch” are used interchangeably in this disclosure. The use of the term “RBridge” does not limit embodiments of the present invention to TRILL networks only. The TRILL protocol is described in IETF draft “RBridges: Base Protocol Specification,” available at http://tools.ietf.org/html/draft-ietf-trill-rbridge-protocol, which is incorporated by reference herein
The terms “virtual cluster switch,” “virtual cluster switching,” and “VCS” refer to a group of interconnected physical switches operating as a single logical switch. The control plane for these physical switches provides the ability to automatically configure a given physical switch, so that when it joins the VCS, little or no manual configuration is required. VCS is not limited to a specific product family from a particular vendor. In addition, “VCS” is not the only term that can be used to name the switching system described herein. Other terms, such as “Ethernet fabric,” “Ethernet fabric switch,” “fabric switch,” “cluster switch,” “Ethernet mesh switch,” and “mesh switch” can also be used to describe the same switching system. Hence, in some embodiments, these terms and “VCS” can be used interchangeably.
The term “RBridge” refers to routing bridges, which are bridges implementing the TRILL protocol as described in IETF draft “RBridges: Base Protocol Specification.” Embodiments of the present invention are not limited to the application among RBridges. Other types of switches, routers, and forwarders can also be used.
The terms “frame” or “packet” refer to a group of bits that can be transported together across a network. “Frame” should not be interpreted as limiting embodiments of the present invention to layer-2 networks. “Packet” should not be interpreted as limiting embodiments of the present invention to layer-3 networks. “Frame” or “packet” can be replaced by other terminologies referring to a group of bits, such as “cell” or “datagram.”
A physical switch may dedicate a number of ports for external use (i.e., to be coupled to end hosts or other switches external to the VCS) and other ports for inter-switch connection. Viewed externally, VCS 100 appears to be one switch to a device from the outside, and any port from any of the physical switches is considered one port on the VCS. For example, port groups 110 and 112 are both VCS external ports and can be treated equally as if they were ports on a common physical switch, although switches 105 and 107 may reside in two different locations.
The physical switches can reside at a common location, such as a data center or central office, or be distributed in different locations. Hence, it is possible to construct a large-scale centralized switching system using many smaller, inexpensive switches housed in one or more chassis at the same location. It is also possible to have the physical switches placed at different locations, thus creating a logical switch that can be accessed from multiple locations. The topology used to interconnect the physical switches can also be versatile. VCS 100 is based on a mesh topology. In further embodiments, a VCS can be based on a ring, tree, or other types of topologies.
In one embodiment, the protocol architecture of a VCS is based on elements from the standard IEEE 802.1Q Ethernet bridge, which is emulated over a transport based on the Fibre Channel Framing and Signaling-2 (FC-FS-2) standard. The resulting switch is capable of transparently switching frames from an ingress Ethernet port from one of the edge switches to an egress Ethernet port on a different edge switch through the VCS.
Because of its automatic configuration capability, a VCS can be dynamically expanded as the network demand increases. In addition, one can build a large-scale switch using many smaller physical switches without the burden of manual configuration. For example, it is possible to build a high-throughput fully non-blocking switch using a number of smaller switches. This ability to use small switches to build a large non-blocking switch significantly reduces the cost associated switch complexity.
The forwarding of the Ethernet frame between ingress switch 202 and egress switch 204 is performed via inter-switch ports 208 and 210. The frame transported between the two inter-switch ports is encapsulated in an outer MAC header and a TRILL header, in accordance with the TRILL standard. The protocol stack associated with a given inter-switch port includes the following (from bottom up): MAC layer, TRILL layer, FC-FS-2 layer, FC E-Port layer, and FC link services (FC-LS) layer. The FC-LS layer is responsible for maintaining the connectivity information of a physical switch's neighbor, and populating an FC routing information base (RIB) 222. This operation is similar to what is done in an FC switch fabric. The FC-LS protocol is also responsible for handling joining and departure of a physical switch in VCS 200. The operation of the FC-LS layer is specified in the FC-LS standard, which is available at http://www.t11.org/ftp/t11/member/fc/ls/06-393v5.pdf, the disclosure of which is incorporated herein in its entirety.
During operation, when FDB 214 returns the egress switch 204 corresponding to the destination MAC address of the ingress Ethernet frame, the destination egress switch's identifier is passed to a path selector 218. Path selector 218 performs a fabric shortest-path first (FSPF)-based route lookup in conjunction with RIB 222, and identifies the next-hop switch within VCS 200. In other words, the routing is performed by the FC portion of the protocol stack, similar to what is done in an FC switch fabric.
Also included in each physical switch are an address manager 216 and a fabric controller 220. Address manager 216 is responsible for configuring the address of a physical switch when the switch first joins the VCS. For example, when switch 202 first joins VCS 200, address manager 216 can negotiate a new FC switch domain ID, which is subsequently used to identify the switch within VCS 200. Fabric controller 220 is responsible for managing and configuring the logical FC switch fabric formed on the control plane of VCS 200.
One way to understand the protocol architecture of VCS is to view the VCS as an FC switch fabric with an Ethernet/TRILL transport. Each physical switch, from an external point of view, appears to be a TRILL RBridge. However, the switch's control plane implements the FC switch fabric software. In other words, embodiments of the present invention facilitate the construction of an “Ethernet switch fabric” running on FC control software. This unique combination provides the VCS with automatic configuration capability and allows it to provide the ubiquitous Ethernet services in a very scalable fashion.
For example, RBridge 412 is coupled with hosts 420 and 422 via 10GE ports. RBridge 414 is coupled to a host 426 via a 10GE port. These RBridges have TRILL-based inter-switch ports for connection with other TRILL RBridges in VCS 400. Similarly, RBridge 416 is coupled to host 428 and an external Ethernet switch 430, which is coupled to an external network that includes a host 424. In addition, network equipment can also be coupled directly to any of the physical switches in VCS 400. As illustrated here, TRILL RBridge 408 is coupled to a data storage 417, and TRILL RBridge 410 is coupled to a data storage 418.
Although the physical switches within VCS 400 are labeled as “TRILL RBridges,” they are different from the conventional TRILL RBridge in the sense that they are controlled by the FC switch fabric control plane. In other words, the assignment of switch addresses, link discovery and maintenance, topology convergence, routing, and forwarding can be handled by the corresponding FC protocols. Particularly, each TRILL RBridge's switch ID or nickname is mapped from the corresponding FC switch domain ID, which can be automatically assigned when a switch joins VCS 400 (which is logically similar to an FC switch fabric).
Note that TRILL is only used as a transport between the switches within VCS 400. This is because TRILL can readily accommodate native Ethernet frames. Also, the TRILL standards provide a ready-to-use forwarding mechanism that can be used in any routed network with arbitrary topology (although the actual routing in VCS is done by the FC switch fabric protocols). Embodiments of the present invention should be not limited to using only TRILL as the transport. Other protocols (such as multi-protocol label switching (MPLS) or Internet Protocol (IP)), either public or proprietary, can also be used for the transport.
In one embodiment, a VCS is created by instantiating a logical FC switch in the control plane of each switch. After the logical FC switch is created, a virtual generic port (denoted as G_Port) is created for each Ethernet port on the RBridge. A G_Port assumes the normal G_Port behavior from the FC switch perspective. However, in this case, since the physical links are based on Ethernet, the specific transition from a G_Port to either an FC F_Port or E_Port is determined by the underlying link and physical layer protocols. For example, if the physical Ethernet port is connected to an external device which lacks VCS capabilities, the corresponding G_Port will be turned into an F_Port. On the other hand, if the physical Ethernet port is connected to a switch with VCS capabilities and it is confirmed that the switch on the other side is part of a VCS, then the G_Port will be turned into an E_port.
Similarly, RBridge 416 contains a virtual, logical FC switch 512. Corresponding to the physical Ethernet ports coupled to host 428 and external switch 430, logical FC switch 512 has a logical F_Port coupled to host 428, and a logical FL_Port coupled to switch 430. In addition, a logical N_Port 510 is created for host 428, and a logical NL_Port 508 is created for switch 430. Note that the logical FL_Port is created because that port is coupled to a switch (switch 430), instead of a regular host, and therefore logical FC switch 512 assumes an arbitrated loop topology leading to switch 430. Logical NL_Port 508 is created based on the same reasoning to represent a corresponding NL_Port on switch 430. On the VCS side, logical FC switch 512 has two logical E_Ports, which to be coupled with other logical FC switches in the logical FC switch fabric in the VCS.
In the example illustrated in
The physical edge ports 522 and 524 are mapped to logical F_Ports 532 and 534, respectively. In addition, physical fabric ports 526 and 528 are mapped to logical E_Ports 536 and 538, respectively. Initially, when logical FC switch 521 is created (for example, during the boot-up sequence), logical FC switch 521 only has four G_Ports which correspond to the four physical ports. These G_Ports are subsequently mapped to F_Ports or E_Ports, depending on the devices coupled to the physical ports.
Neighbor discovery is the first step in VCS formation between two VCS-capable switches. It is assumed that the verification of VCS capability can be carried out by a handshake process between two neighbor switches when the link is first brought up.
In general, a VCS presents itself as one unified switch composed of multiple member switches. Hence, the creation and configuration of VCS is of critical importance. The VCS configuration is based on a distributed database, which is replicated and distributed over all switches. In other words, each VCS member switch maintains a copy of the VCS configuration database, and any change to the database is propagated to all the member switches. As a result, a network administrator can configure any part of the VCS from anywhere by, for example, executing a command line interface (CLI) command from any member switch.
In one embodiment, a VCS configuration database includes a global configuration table (GT) of the VCS and a list of switch description tables (STs), each of which describes a VCS member switch. In its simplest form, a member switch can have a VCS configuration database that includes a global table and one switch description table, e.g., [<GT><ST>]. A VCS with multiple switches will have a configuration database that has a single global table and multiple switch description tables, e.g., [<GT><ST0><ST1> . . . <STn−1>]. The number n corresponds to the number of member switches in the VCS. In one embodiment, the GT can include at least the following information: the VCS ID, number of nodes in the VCS, a list of VLANs supported by the VCS, a list of all the switches (e.g., list of FC switch domain IDs for all active switches) in the VCS, and the FC switch domain ID of the principal switch (as in a logical FC switch fabric). A switch description table can include at least the following information: the IN_VCS flag, indication whether the switch is a principal switch in the logical FC switch fabric, the FC switch domain ID for the switch, the FC world-wide name (WWN) for the corresponding logical FC switch; the mapped ID of the switch, and optionally the IP address of the switch. Note that, as explained below, a switch can have an IP address dynamically assigned to it when it joins the VCS. The IP address can be derived from and have a one-to-one mapping to the FC switch domain ID assigned to the switch.
In addition, each switch's global configuration database is associated with a transaction ID. The transaction ID specifies the latest transaction (e.g., update or change) incurred to the global configuration database. The transaction IDs of the global configuration databases in two switches can be compared to determine which database has the most current information (i.e., the database with the more current transaction ID is more up-to-date). In one embodiment, the transaction ID is the switch's serial number plus a sequential transaction number. This configuration can unambiguously resolve which switch has the latest configuration.
As illustrated in
In one embodiment, each switch also has a VCS-mapped ID (denoted as “mappedID”), which is a switch index within the VCS. This mapped ID is unique and persistent within the VCS. That is, when a switch joins the VCS for the first time, the VCS assigns a mapped ID to the switch. This mapped ID persists with the switch, even if the switch leaves the VCS. When the switch joins the VCS again at a later time, the same mapped ID is used by the VCS to retrieve previous configuration information for the switch. This feature can reduce the amount of configuration overhead in VCS. Also, the persistent mapped ID allows the VCS to “recognize” a previously configured member switch when it re-joins the VCS, since a dynamically assigned FC fabric domain ID would change each time the member switch joins and is configured by the VCS.
Default switch configuration table 604 has an entry for the mappedID that points to the corresponding ST in VCS configuration database 600. Note that only VCS configuration database 600 is replicated and distributed to all switches in the VCS. Default switch configuration table 604 is local to a particular member switch.
The “IN_VCS” value in default switch configuration table 604 indicates whether the member switch is part of a VCS. A switch is considered to be “in a VCS” when it is assigned one of the FC switch domains by the FC switch fabric with two or more switch domains. If a switch is part of an FC switch fabric that has only one switch domain, i.e., its own switch domain, then the switch is considered to be “not in a VCS.”
When a switch is first connected to a VCS, the logical FC switch fabric formation process allocates a new switch domain ID to the joining switch. In one embodiment, only the switches directly connected to the new switch participate in the VCS join operation.
Note that in the case where the global configuration database of a joining switch is current and in sync with the global configuration database of the VCS based on a comparison of the transaction IDs of the two databases (e.g., when a member switch is temporarily disconnected from the VCS and re-connected shortly afterward), a trivial merge is performed. That is, the joining switch can be connected to the VCS, and no change or update to the global VCS configuration database is required.
When a switch joins a VCS (see the description below), the VCS assigns an FC switch domain ID to the logical FC switch formed within the joining switch. The joining switch can be automatically assigned an IP address (which in one embodiment can be internal to the VCS) corresponding to the FC switch domain ID. For example, the joining switch's IP address can be 127.10.10.FC_domain_ID. In addition, each port on the joining switch can be automatically assigned a MAC address which is unique to that port, on that switch. For instance, a port on the joining switch can be assigned a MAC address of OUI:FC_domain_ID:0:0, where OUI is the Organizationally Unique Identifier for the port, and FC_domain_ID is the switch domain ID assigned to the logical FC switch. The IP and MAC address assignment process can be part of the software within the switch which controls the FC switch formation and the join/merge process.
The automatically assigned switch IP address and port MAC address for a given VCS member switch allow any port on the switch to be remotely configured from anywhere within the VCS. For example, a configuration command of a given member switch can be issued from a host connected to any member switch in the VCS and identifies the switch to be configured by its IP address and optionally one or more port MAC addresses. Such configuration command might include information on VLAN configuration, quality-of-service (QoS) configuration, and/or access-control configuration, each of which can be port-specific. In one embodiment, the change to a switch's configuration is tentatively transmitted to the switch. After the switch confirms and validates the change, a commit-change command is transmitted to all the member switches in the VCS, so the global configuration database can be updated throughout the VCS. In a further embodiment, the change is tentatively transmitted to all the member switches in the VCS, and the commit-change command is only sent out after all the switches confirm and validate the tentatively change.
When a switch joins the VCS via a link, both neighbors on each end of the link present to the other switch a VCS four-tuple of <Prior VCS_ID, SWITCH_MAC, mappedID, IN_VCS> from a prior incarnation, if any. Otherwise, the switch presents to the counterpart a default tuple. If the VCS_ID value was not set from a prior join operation, a VCS_ID value of −1 is used. In addition, if a switch's IN_VCS flag is set to 0, it sends out its interface configuration to the neighboring switch. In the example in
After the above PRE-INVITE operation, a driver switch for the join process is selected. By default, if a switch's IN_VCS value is 1 and the other switch's IN_VCS value is 0, the switch with IN_VCS=1 is selected as the driver switch. If both switches have their IN_VCS values as 1, then nothing happens, i.e., the PRE-INVITE operation would not lead to an INVITE operation. If both switches have their IN_VCS values as 0, then one of the switches is elected to be the driving switch (for example, the switch with a lower FC switch domain ID value). The driving switch's IN_VCS value is then set to 1 and drives the join process.
After switch 702 is selected as the driver switch, switch 702 then attempts to reserve a slot in the VCS configuration database corresponding to the mappedID value in switch 704's PRE-INVITE information. Next, switch 702 searches the VCS configuration database for switch 704's MAC address in any mappedID slot. If such a slot is found, switch 702 copies all information from the identified slot into the reserved slot. Otherwise, switch 702 copies the information received during the PRE-INVITE from switch 704 into the VCS configuration database. The updated VCS configuration database is then propagated to all the switches in the VCS as a prepare operation in the database (note that the update is not committed to the database yet).
Subsequently, the prepare operation may or may not result in configuration conflicts, which may be flagged as warnings or fatal errors. Such conflicts can include inconsistencies between the joining switch's local configuration or policy setting and the VCS configuration. For example, a conflict arises when the joining switch is manually configured to allow packets with a particular VLAN value to pass through, whereas the VCS does not allow this VLAN value to enter the switch fabric from this particular RBridge (for example, when this VLAN value is reserved for other purposes). In one embodiment, the prepare operation is handled locally and/or remotely in concert with other VCS member switches. If there is an un-resolvable conflict, switch 702 sends out a PRE-INVITE-FAILED message to switch 704. Otherwise, switch 702 generates an INVITE message with the VCS's merged view of the switch (i.e., the updated VCS configuration database).
Upon receiving the INVITE message, switch 704 either accepts or rejects the INVITE. The INVITE can be rejected if the configuration in the INVITE is in conflict with what switch 704 can accept. If the INVITE is acceptable, switch 704 sends back an INVITE-ACCEPT message in response. The INVITE-ACCEPT message then triggers a final database commit throughout all member switches in the VCS. In other words, the updated VCS configuration database is updated, replicated, and distributed to all the switches in the VCS.
If more than one switch in a VCS has connectivity to the new joining switch, all these neighboring member switches may send PRE-INVITE to the new joining switch. The joining switch can send out only one PRE-INVITE to a randomly selected neighboring member switch to complete the join process. Various use cases of the join process are described below. In the following description, a “joining switch” refers to a switch attempting to join a VCS. A “neighboring VCS member switch” or “neighboring member switch” refers to a VCS member switch to which the joining switch is connected.
VCS Pre-Provisioned to Accept a Switch.
A VCS is pre-configured (e.g., the global configuration database) with the MAC address of a joining switch with an optionally pre-allocated mapped ID for the joining switch. The joining switch may be allowed to carry any value in the VCS_ID field of its existing configuration. The neighboring VCS member switch can assign an FC switch domain ID and the proper VCS_ID to the joining switch in the INVITE message. In one embodiment, the joining switch may be pre-provisioned to join an existing VCS (e.g., with the parameters in the default switch configuration table, such as mappedID, VCS_ID, and IN_VCS, populated with values corresponding to the VCS). If the pre-provisioned parameters do not guarantee a slot with the same mappedID in the global configuration database when the switch joins the VCS, the switch can revert to the default joining procedure described below.
Default Switch Joins a VCS.
A default switch is one that has no records of any previous joining with a VCS. A switch can become a default switch if it is forced into a factory default state. A joining default switch can present its initial configuration information (for example, its interface configuration details) to a neighboring VCS member switch. In one embodiment, a slot in the VCS configuration database is selected based on a monotonically incrementing number, which is used as the mapped ID for the joining switch. The corresponding FC switch domain ID which is allocated to the joining switch and the joining switch's MAC is updated accordingly in this slot. The neighboring VCS member switch then initiates a prepare transaction, which propagates to all VCS member switches and requires an explicit validation of the joining switch's configuration information from each VCS member switch. If the prepare transaction fails, a PRE-INVITE-FAILED message is sent to the joining switch and the joining process is aborted.
The neighboring VCS member switch then tentatively updates the reserved slot in the global configuration database with the allocated FC switch domain ID and the joining switch's MAC address (operation 726). Next, the neighboring VCS member switch transmits the joining switch's tentative configuration to all member switches in the VCS (operation 728), and determines whether the joining switch's configuration information is confirmed and validated by all VCS member switches (operation 730). If the joining switch's configuration is confirmed, the neighboring member switch then commits the changes to the global configuration database and completes the join process (operation 732). Otherwise, the join process is aborted and the tentative changes to the global configuration database are discarded (operation 734).
Switch Re-Joins a Previously Joined VCS.
If for some reason a switch is joining a VCS to which the switch previously belongs (for example, due to a link failure), the FC switch domain ID that is re-allocated to the joining switch will most likely be the same. When such a switch joins the VCS, the neighboring VCS member switch first checks whether the joining switch's VCS_ID is the same as the existing VCS_ID on the member switch. If the two VCS_ID values are the same, the neighboring member switch attempts to find a slot in the global configuration database with the same mappedID value which was received from the joining switching during the tuple-exchange process. If such a slot in the global database is available, the slot is reserved for the joining switch. In addition, the global configuration database is searched for a match to the joining switch's MAC address. If a match is found in another slot, the configuration information from that slot is copied to the reserved slot. Subsequently, the join process continues as described in
Subsequently, the neighboring member switch determines whether the global configuration database contains a slot with the same MAC address as the joining switch (operation 748). If there is such a slot, which means that the global configuration database contains a slot which has been used previously for the same joining switch's configuration information, such information is copied from the identified slot to the reserved slot (operation 750). Otherwise, the neighboring member switch proceeds to complete the join process as illustrated in
Switch Joins Another VCS.
This use case occurs when a switch is disconnected from one VCS and then connected to a different VCS without being reset to the default state. This scenario can also occur when a switch is connected to a VCS while it is participating in another VCS. In such cases, there will be a VCS_ID mismatch in the join process. In addition, the IN_VCS field in the joining switch's configuration table might or might not be set. If the IN_VCS field is not set, which means that the joining switch is not currently participating in a VCS, the join process can assign the switch a new VCS_ID corresponding to the VCS the switch is joining. In one embodiment, if the IN_VCS filed is set in the joining switch's configuration, which means that the joining switch is currently participating in a different VCS, the join process is disallowed. Optionally, the joining switch can complete the joining process after being set to the default state.
Initial Joining of Two Switches which are Both not in a VCS.
When two switches are connected together and both of them are not in a VCS, an election process can be used to let one of them be the driving switch in the VCS formation process. In one embodiment, the switch with a lower FC switch domain ID would have its IN_VCS field set to “1” and drives the join process.
Joining of Two VCSs.
In one embodiment, two VCSs are allowed to merge together. Similar to the FC switch fabric formation process, the logical FC switches in both VCSs would select a new principal FC switch. This newly selected principal FC switch then re-assigns FC switch domain IDs to all the member switches. After the FC switch domain IDs are assigned, a “fabric up” message which is broadcast to all the member switches starts the VCS join process.
During the join process, the principal FC switch's IN_VCS field is set to “1,” whereas all other member switches' IN_VCS fields are set to “0.” Subsequently, each member switch can join the VCS (which initially only contains the switch with the principal FC switch) using the “switch joins another VCS” procedure described above.
Removal of a Switch from VCS.
When a switch is removed from a VCS, its neighboring member switch typically receives a “domain-unreachable” notification at its logical FC switch. Upon receiving this notification, the neighboring member switch disables this switch from the global VCS configuration database and propagates this change to all other member switches. Optionally, the neighboring member switch does not clear the slot previously used by the removed switch in the global configuration database. This way, if the departure of the switch is only temporary, the same slot in the configuration database can still be used when the switch re-joins the VCS.
If the VCS is temporarily disjoint due to a link failure, the logical FC infrastructure in the member switches can detect the disconnection of the switch(es) and issues a number of “domain-unreachable” notifications. When the disjoint switch is reconnected to the VCS, a comparison between the switch's configuration information and the corresponding slot information in the global VCS configuration database allows the switch to be added to the VCS using the same slot (i.e., the slot with the same mappedID) in the global configuration database.
General Operation.
If the system determines that itself is already part of a VCS (i.e., its IN_VCS=1) (operation 764), the system then further determines whether there is an existing slot in the global configuration database with the same mappedID as the joining switch (operation 774). If such a slot exists, the system then sends the INVITE to the joining switch (operation 775) and determines whether there is any un-resolved conflict between the configuration information stored in this slot and the information provided by the joining switch (operation 780). If so, the system revokes the INVITE (operation 782). Otherwise, the system updates the global configuration database with the joining switch's configuration information and propagates the update to all other member switches (operation 784).
If there is no slot in the global configuration database with the same mappedID as the joining switch (operation 774), the system allocates an interim slot in the global configuration database (operation 776), and sends an INVITE to the joining switch (operation 778). After receiving an INVITE acceptance from the joining switch (operation 779), the system then updates the global configuration database (operation 784) and completes the join process.
In one embodiment, each VCS switch unit performs source MAC address learning, similar to what an Ethernet bridge does. Each {MAC address, VLAN} tuple learned on a physical port on a VCS switch unit is registered into the local Fibre Channel Name Server (FC-NS) via a logical Nx_Port interface corresponding to that physical port. This registration binds the address learned to the specific interface identified by the Nx_Port. Each FC-NS instance on each VCS switch unit coordinates and distributes all locally learned {MAC address, VLAN} tuples with every other FC-NS instance in the fabric. This feature allows the dissemination of locally learned {MAC addresses, VLAN} information to every switch in the VCS. In one embodiment, the learned MAC addresses are aged locally by individual switches.
If the FC-NS returns a valid result, the switch forwards the frame to the identified N_Port or NL_Port (operation 808). Otherwise, the switch floods the frame on the TRILL multicast tree as well as on all the N_Ports and NL_Ports that participate in that VLAN (operation 810). This flood/broadcast operation is similar to the broadcast process in a conventional TRILL RBridge, wherein all the physical switches in the VCS will receive and process this frame, and learn the source address corresponding to the ingress RBridge. In addition, each receiving switch floods the frame to its local ports that participate in the frame's VLAN (operation 812). Note that the above operations are based on the presumption that there is a one-to-one mapping between a switch's TRILL identifier (or nickname) and its FC switch domain ID. There is also a one-to-one mapping between a physical Ethernet port on a switch and the corresponding logical FC port.
Upon receiving frame 935, switch 938 determines that it is the destination RBridge based on frame 935's TRILL header. Correspondingly, switch 938 strips frame 935 of its outer Ethernet header and TRILL header, and inspects the destination MAC address of its inner Ethernet header. Switch 938 then performs a MAC address lookup and determines the correct output port leading to host 940. Subsequently, the original Ethernet frame 933 is transmitted to host 940.
As described above, the logical FC switches within the physical VCS member switches may send control frames to one another (for example, to update the VCS global configuration database or to notify other switches of the learned MAC addresses). In one embodiment, such control frames can be FC control frames encapsulated in a TRILL header and an outer Ethernet header. For example, if the logical FC switch in switch 944 is in communication with the logical FC switch in switch 938, switch 944 can sends a TRILL-encapsulated FC control frame 942 to switch 946. Switch 946 can forward frame 942 just like a regular data frame, since switch 946 is not concerned with the payload in frame 942.
Today's server virtualization infrastructure (e.g. a Hypervisor, also called virtual machine monitor) associates a server side (e.g. Hypervisor or adapter) Virtual Ethernet Bridge (VEB) port profile to each Ethernet MAC address used by a virtual machine (VM) to access the network through a VEB port. Examples of the VEB's port profile attributes includes: the types of frames allowed on the port (e.g. all frames, only frames tagged with certain VLAN values, or untagged frames), the VLAN identifiers that are allowed to be used, and rate limiting attributes (e.g. port or access-control based rate limits). In today's server virtualization infrastructure, if the VM migrates from one physical server to another, the VEB's port profile migrates with it. In other words, today's server virtualization infrastructure provides automated port profile migration of the server's VEB port(s) that are associated with a VM.
However, in existing technologies, there remains a gap between the access and Quality of Service (QoS) controls supported in external layer 2 switches and server virtualization infrastructure. That is, external layer 2 switches have more advanced controls compared to server VEB implementations. Although server virtualization infrastructure is continually adding these controls, this gap is expected to remain. Some environments prefer the more advanced controls provided by external network switches. An example of such an environment is a multi-tier data center that has several types of applications, each with differing advanced network controls, running over the same layer-2 network. In this type of environment the network administrator often prefers the use of advanced access controls available in external switches.
Today's layer-2 networks do not provide a mechanism for automatically migrating switch access and traffic controls associated with an end-point device (e.g., a VM), when that device migrates from one switch to another. The migration may be physical, such as an Operating System image (application, middleware, Operating System and associated state) that is running on one physical system and is migrated to another system. The migration may be also be virtual, such as an Operating System image (OS image) that is running over a Hypervisor on one system and is migrated to run over a Hypervisor on another system.
Embodiments of the present invention provides a mechanism for automatically migrating port profiles resident in a switch and associated with an OS image to a port on a second switch, when that OS image migrates from one physical end-host system to another end-host system, which is attached to the second switch.
The switches within VCS 100 which are coupled externally to the physical end-host systems form a logical VCS access layer (VAL) 1010. The automatic migration of port profiles (AMPP) is implemented in VAL 1010. During operation, various port profiles, which are often tailored to different requirements of the VMs, are created and distributed to all the member switches in VCS 1000. As described in detail below, when the packets generated by a VM are detected by an ingress member switch of VCS 1000, the VM's source MAC address is recognized and used to identify the corresponding port profile, which is then applied to the appropriate ingress switch port. When a VM moves from one physical server to another, the MAC-address detection mechanism can quickly identify the new physical switch port to which the VM is coupled to, and apply the same port profile to the new port.
During operation, one or more port profiles can be created to specify a number of requirements/restrictions/limitations that should be enforced at a VCS switch port corresponding to one or more VMs. For example, a port profile for VM 1102 (which can be identified by the virtual MAC address of VP 1110) can be created and distributed to every member switch of VCS 1100. When VM 1102 sends its first Ethernet frame to the network, switch 1120 would learn this source MAC address. Upon learning VP 1110's MAC address, switch 1120 then searches its port profile database and identifies the matching port profile. Subsequently, the identified port profile is applied to the port on switch 1120 which is coupled to system 1116. In addition, the same port profile is applied to the port where the matching MAC address is the destination MAC address of a frame. This way, the same network parameters are enforced at both ingress and egress ports of the VCS. Note that the port profile might include “soft” parameters. In other words, the requirements and limitations in the port profile may be specific to certain MAC addresses, and may not be “hard” limitations on the physical parameters of the switch port, since traffic from/to multiple VMs is handled by the same physical switch port.
In one embodiment, VCS 1100 provides a mechanism that distributes all the port profiles and the port-profile-to-MAC mapping information to all the member switches. The port profiles can be created using a command line interface (CLI) or other network management software. In addition, upon migration of a VM (such as a VMware VMotion), the target switch port in the VCS can automatically activate the correct port profile configuration.
In one embodiment, a port profile can be capable of operating as a self contained configuration container. In other words, if a port profile is applied to a new switch without any additional configuration, the port profile should be sufficient to set the switch's global and local (interface level) configuration and allow the switch to start carrying traffic.
A VLAN configuration profile within a port profile can define:
A QoS configuration profile within a port profile can define:
An FCoE configuration profile within a port profile defines the attributes needed for the port to support FCoE, which can include:
A security configuration profile within a port profile defines the security rules needed for the server port. However, the security rules can be different at different ports, so some of the locally configured ACLs can be allowed to override conflicting rules from a port profile. A typical security profile can contain the following attributes:
In one embodiment, each port profile can have one or more MAC addresses associated with it.
It is possible to group a set of VMs in the network by associating them with one port profile. This group can be used to dictate forwarding between the VMs.
Subsequently, the system distributes the port profile and the corresponding VM MAC address to every member switch throughout the VCS fabric (operation 1408). When a VM is initiated or migrated, the system then detects a matching virtual MAC address from the received ingress packets (operation 1410). Based on the learned MAC address, the system then activates the corresponding port profile on the switch port (operation 1412).
Exemplary VCS Member Switch with AMPP
During operation, packet processor 1502 extracts the source and destination MAC addresses of incoming frames, and attaches proper Ethernet or TRILL headers to outgoing frames. Virtual FC switch management module 1504 maintains the state of logical FC switch 1505, which is used to join other VCS switches using the FC switch fabric protocols. VCS configuration database 1506 maintains the configuration state of every switch within the VCS. TRILL header generation module 1508 is responsible for generating property TRILL headers for frames that are to be transmitted to other VCS member switches.
Upon learning a new MAC address from a VM, AMPP management module 1507 identifies the port profile corresponding to the learned MAC, and applies the identified port profile. This dynamic port profile application process can respond to VM migration within a very short period of time, thereby facilitating automatic port configuration in a virtual computing environment.
Today's server virtualization infrastructure (e.g. a Hypervisor, also called virtual machine monitor) typically provides one or more virtual switches (also called virtual Ethernet bridges, VEBs) within a physical server. Each virtual switch serves a number of virtual machines. When a number of such servers connect to a VCS, the number of communication sessions among the virtual machines can be quite large. In such a network environment, when a network link or port fails, the failure would typically disrupt the reachability to one or more virtual machines. This disruption can affect the communication sessions of some of the virtual machines. In conventional networks, such reachability disruption only triggers a topology change and/or MAC address learning update in the network, and the source virtual machines are not notified about these updates. Correspondingly, with conventional technologies, there is no way for a Hypervisor to re-configure the connectivity of the virtual machines absent of some signaling from the network via proprietary protocols.
Embodiments of the present invention facilitate advanced link tracking by monitoring any reachability disruption in the network and notifying the affected hypervisor. In response, the hypervisor can re-configure the connectivity of the virtual machines under its control to bypass the failed link or port. In one embodiment, this advanced link tracking function can be carried out in a logical VCS access layer.
Also coupled to VCS 1600 is physical servers 1617, which has a similar configuration as server 1618. Server 1617 includes four VMs, 1632, 1634, 1636, and 1638. These four VMs are each dual-homed with virtual switches 1642 and 1644, which are provided by hypervisor 1641. Virtual switch 1642 is coupled to VCS member switch 1620 via a NIC 1643, and virtual switch 1644 is coupled to VCS member switch 1626 via a NIC 1645. During normal operation, VMs 1632 and 1634 communicate with VCS 1600 via virtual switch 1642 and NIC 1643, and VMs 1636 and 1638 communicate with VCS 1600 via virtual switch 1644 and NIC 1645.
Assume that VMs 1622 and 1624 are in communication with VMs 1636 and 1638. Since VMs 1636 and 1638 are configured by hypervisor 1641 to use virtual switch 1644 and NIC 1645, the traffic between VMs 1622 and 1624 and VMs 1636 and 1638 is normally carried by VCS member switch 1626. Now, assume the link between switches 1620 and 1626 fails. As a result, VMs 1636 and 1638 can no longer be reached via NIC 1645. In embodiments of the present invention, this reachability update information is not only reflected in the VCS topology update (which is handled by the routing protocol within VCS 1600), but also communicated to hypervisor 1601 via NIC 1603. This update can allow hypervisor 1601 to quickly re-configure VMs 1622 and 1624, so that these two VMs use virtual switch 1604 and NIC 1605 to access VCS 1600. This way, the traffic from VMs 1622 and 1624 can still reach VMs 1636 and 1638 via switch 1624, switch 1620, NIC 1643, and virtual switch 1642. The new data path bypasses the failed link between switches 1620 and 1626. This re-configuration can take place shortly after the link failure is detected, thereby facilitating fast recovery at the source VMs.
Exemplary VCS Member Switch with Advanced Link Tracking
During operation, packet processor 1902 extracts the source and destination MAC addresses of incoming frames, and attaches proper Ethernet or TRILL headers to outgoing frames. Virtual FC switch management module 1904 maintains the state of logical FC switch 1905, which is used to join other VCS switches using the FC switch fabric protocols. VCS configuration database 1906 maintains the configuration state of every switch within the VCS. TRILL header generation module 1908 is responsible for generating property TRILL headers for frames that are to be transmitted to other VCS member switches.
Upon learning about disrupted reachability in the VCS, advanced link tracking module 1907 identifies the port(s) which are affected by the disruption, and notifies the hypervisor of the disruption. This notification can allow the hypervisor to expedite the re-configuration of the affected VMs and minimize service disruption. Furthermore, advanced link tracking module 1907 also monitors the health of all the links corresponding to ports 1901. Upon detection of any link or port failure, advanced link tracking module 1907 can notify other switches in the VCS of the link state change and any reachability disruption.
Today's server virtualization infrastructure (e.g. a Hypervisor, also called virtual machine monitor) typically provides one or more virtual switches (also called virtual Ethernet bridges, VEBs) within a physical server. Each virtual switch serves a number of virtual machines. When a number of such servers connect to a VCS, the number of communication sessions among the virtual machines can be quite large. Often, different virtual machines belong to different customers, and different customers usually require dedicated and isolated network resources for security and quality of service purposes.
In one embodiment, each customer's virtual machines are placed in the same group based on the MAC address and VLAN tag of their associated virtual ports. The grouping information is replicated and stored at every VCS member switch. During operation, when a frame arrives at a VCS member switch, the member switch can enforce the grouping policy by looking up the group ID corresponding to the frame's source MAC address and VLAN tag. If the frame's destination MAC address combined with the VLAN tag belong to the same group, then the frame is forwarded to the proper next-hop switch. Otherwise, the frame is dropped. This configuration ensures that each customer's virtual machines only communicate within themselves, thereby creating virtual partitions within a VCS.
Also coupled to VCS 2000 is physical servers 2017, which has a similar configuration as server 2018. Server 2017 includes four VMs, 2032, 2034, 2036, and 2038. These four VMs are each dual-homed with virtual switches 2042 and 2044, which are provided by hypervisor 2041. Virtual switch 2042 is coupled to VCS member switch 2020 via a NIC 2043, and virtual switch 2044 is coupled to VCS member switch 2026 via a NIC 2045. During normal operation, VMs 2032 and 2034 communicate with VCS 2000 via virtual switch 2042 and NIC 2043, and VMs 2036 and 2038 communicate with VCS 2000 via virtual switch 2044 and NIC 2045.
Assume that VMs 2024, 2028, 2032, and 2036 belong to the same customer (as indicated by the shading in each drawing element in
During operation, a VM administrator 2124 configures VMs 2108, 2112, 2124, and 2128 with the same VLAN tag corresponding to their common customer. Correspondingly, the MAC address of the virtual ports for these machines, together with the designated VLAN tag, are assigned with the same group ID. VMs 2106, 2110, 2122, and 2126 and their respective VLAN tags are assigned a different group ID. This grouping information is then communicated to VCS 2100 and stored at every member switch.
When the frames are processed and forwarded within VCS 2100, frames with one group ID are prohibited from being forwarded to a MAC address-VLAN combination that corresponds to a different group ID. Thus, effectively, the virtual port grouping creates two virtual networks 2120 and 2121 within VCS 2100.
Exemplary VCS Member Switch with Virtual Port Grouping
During operation, packet processor 2302 extracts the source and destination MAC addresses of incoming frames, and attaches proper Ethernet or TRILL headers to outgoing frames. Virtual FC switch management module 2304 maintains the state of logical FC switch 2305, which is used to join other VCS switches using the FC switch fabric protocols. VCS configuration database 2306 maintains the configuration state of every switch within the VCS. TRILL header generation module 2308 is responsible for generating property TRILL headers for frames that are to be transmitted to other VCS member switches. Virtual port grouping module 2307 is responsible for maintaining a copy of the grouping information and enforcing the grouping policies. As described in conjunction with
VCS allows an interconnected fabric of RBridges to function as a single logical switch. The VCS name services facilitate fast distribution of run-time network state changes, including newly learned MAC addresses (which is referred to as “Ethernet name service” or “Ethernet NS” in this disclosure) and multi-chassis trunk (MCT) port state updates (which is referred to as “MCT name service” or “MCT NS” in this disclosure). More details on MCT are provided in U.S. patent application Ser. No. 12/725,249, (attorney docket number BRCD-112-0439US), entitled “REDUNDANT HOST CONNECTION IN A ROUTED NETWORK,” by inventors Somesh Gupta, Anoop Ghanwani, Phanidhar Koganti, and Shunjia Yu, filed 16 Mar. 2010, the disclosure of which is incorporated by reference herein.
The Ethernet NS provides the ability to distribute various information across the VCS. The MAC information learned at one member switch is distributed to all other member switches, which facilitates fast MAC moves (for example, during migration of virtual machines) and global MAC learning. In some embodiments, layer-2 multicast information, which can be a multicast MAC address with corresponding switch/port identifiers and VLAN tag, can be distributed to facilitate efficient VCS-wide multicast. Optionally, Ethernet NS provides a distribution mechanism and does not maintain a central storage of the MAC-related knowledge base. In other words, the Ethernet NS knowledge database is replicated and stored distributively among all the VCS member switches.
Each member switch maintains a database of all the MAC addresses learned throughout the VCS. This database can be used to minimize the amount of flooding (a default behavior of Ethernet switch when a frame's destination MAC address is not recognized). Ethernet NS also provides VCS-wide distribution of multicast MAC-to-RBridge/Port mapping information which can be obtained by Internet Group Management Protocol (IGMP) snooping. (Details about IGMP and IGMP snooping can be found at IETF RFC 3376 available at http://tools.ietf.org/html/rfc3376 and IETF RFC 4541 available at http://tools.ietf.org/html/rfc4541.) Ethernet NS distributes this information to all RBridges, thereby allowing the VCS to behave as a single switch. By tracking and forwarding IGMP join and leave information, the Ethernet NS can efficiently track the multicast MAC information and maintain an accurate layer-2 multicast group.
One of the requirements of presenting a VCS as a single switch is to support connection of trunked links from external hosts to different RBridges within the VCS fabric. Such trunking which involves connection to different RBridges is referred to as multi-chassis trunking (MCT). Conceptually, support within the VCS fabric for routing to a MCT destination is achieved by presenting each MCT group (i.e., each trunk) as a virtual RBridge. In some embodiments, the virtual RBridge is not assigned a domain ID and thus does not utilize FSPF for routing setup. Instead, the a primary RBridge hosting the MCT distributes the virtual RBridge ID and the corresponding link state updates to the VCS fabric. The primary RBridge is responsible for learning a new MAC via an MCT and distributing the new MAC information to the VCS.
When an RBridge joins the VCS it will request a dump of the local NS database from the remote RBridge. It will not respond to individual updates from the remote RBridge until the DB dump has been received. After the database is in sync between two RBridges, individual changes are detected locally and pushed remotely. If a local database receives domain unreachable it is responsible for removing all records for that remote domain and doing any local notification that this removal implies.
Also shown in the example in
In case when one of the links (i.e., either the link between switch 2406 and end host 2412, or the link between switch 2408 and end host 2412) fails, as part of the MCT NS, in one embodiment, primary RBridge 2406 is responsible for updating the rest of the VCS 2400 that host 2412's MAC address is no longer associated with virtual RBidge 2410. Instead, the MAC address of host 2412 is now associated with the switch to which host 2412 remains connected. In a further embodiment, it can be the responsibility of the switch that remains connected to host 2412 to distribute the updated MAC address association to the rest of VCS 2400.
If the frame is determined to be from a regular end host (i.e., not a dual-homed host), the switch assembles an Ethernet NS updated frame with the MAC address, the local physical switch identifier (as opposed to the virtual RBridge ID), the identifier of the port on which the frame is received, and the frame's VLAN tag (operation 2607). The switch then distributes the Ethernet NS update frames to all the member switches in the VCS (operation 2608).
If the MCT end host is still connected to the local switch, the switch then assembles an NS update frame with the end host's MAC address, the local switch's identifier (e.g., the physical RBridge ID of the local switch), the identifier of the port thought which the end host is connected, and the proper VLAN tag (operation 2706). The switch then distributes the NS update frames to all member switches in the VCS (operation 2708).
During operation, packet processor 2802 extracts the source and destination MAC addresses of incoming frames, and attaches proper Ethernet or TRILL headers to outgoing frames. Virtual FC switch management module 2804 maintains the state of logical FC switch 2805, which is used to join other VCS switches using the FC switch fabric protocols. VCS configuration database 2806 maintains the configuration state of every switch within the VCS. TRILL header generation module 2808 is responsible for generating property TRILL headers for frames that are to be transmitted to other VCS member switches. Based on the extracted MAC addresses of incoming frames, NS management module 2807 distributes the NS update frames to the rest of the VCS. NS management module 2807 also maintains a copy of NS database 2809. NS database 2809 stores all the learned MAC address information from every member switch in the VCS.
Because VCS uses the Fabric Shortest Path First (FSPF) routing protocol and the TRILL transport, embodiments of the present invention can facilitate equal-cost or nearly-equal-cost multi-pathing for load balancing purposes.
In one embodiment, the traffic splitting can be based on any field in a frame header. For example, the traffic can be split based on VLAN tags. The traffic can also be split based on layer-4 port numbers or application types.
In some embodiments of the present invention, the VCS can achieve loss-less packet transport by implementing a number of flow control mechanisms. In one embodiment, the VCS member switches implement a priority-based flow control (PFC)-like mechanism to ensure a loss-less packet transport between neighboring switches. With priority-based flow control, traffic class is identified by the VLAN tag priority values. Priority-based flow control is intended to eliminate frame loss due to congestion. This is achieved by a mechanism similar to the IEEE 802.3x PAUSE, but operating on individual priorities. This mechanism enables support for higher layer protocols that are highly loss sensitive while not affecting the operation of traditional LAN protocols utilizing other priorities. Details of PFC can be found in the IEEE 802.1Qbb standard, available at http://www.ieee802.org/1/pages/802.1bb.html, which is incorporated by reference herein.
In further embodiments, the VCS member switch also employ enhanced transmission selection mechanisms, which support allocation of bandwidth amongst different traffic classes. When the offered load in a traffic class does not use its allocated bandwidth, enhanced transmission selection will allow other traffic classes to use the available bandwidth. The bandwidth-allocation priorities can coexist with strict priorities. The VCS can prioritize traffic to provide different service characteristics to traffic classes. It is possible to share bandwidth between priorities carrying bursty loads rather than servicing them with strict priority while allowing strict priority for time-sensitive and management traffic requiring minimum latency. Also, when traffic at a given priority level does not use its allocation, it is possible to allow other priorities to use that bandwidth. Details of ETS can be found in the IEEE 802.1Qas standard, available at http://www.ieee802.org/1/pages/802.1az.html, which is incorporated by reference herein.
In some embodiments, the VCS member switches can also implement congestion notification mechanisms to facilitate source-oriented flow control.
In one embodiment, the trunked links between two neighboring VCS member switches can be further trunked to form a hybrid trunk. This way, the amount of link state maintenance traffic between the two switches can be minimized.
During operation, packet processor 3202 extracts the source and destination MAC addresses of incoming frames, and attaches proper Ethernet or TRILL headers to outgoing frames. Virtual FC switch management module 3204 maintains the state of logical FC switch 3205, which is used to join other VCS switches using the FC switch fabric protocols. VCS configuration database 3206 maintains the configuration state of every switch within the VCS. TRILL header generation module 3208 is responsible for generating property TRILL headers for frames that are to be transmitted to other VCS member switches. Traffic management module 3207 facilitates the aforementioned traffic management functions, such as multi-pathing, flow control, congestion notification, and hybrid trunking.
VCS with Converged Network Extension
One goal of the embodiments of the present invention is to extend a VCS and TRILL network across data centers and meet the scalability requirements needed by the deployments. A converged network extension (CNE) device can be implemented in a two-box solution, wherein one box is capable of L2/L3/FCoE switching and is part of the VCS, and the other facilitates the WAN tunneling to transport Ethernet and/or FC traffic over WAN. The CNE device can also be implemented in a one-box solution, wherein a single piece of network equipment combines the functions of L2/L3/FCoE switching and WAN tunneling.
As described above, VCS as a layer-2 switch uses TRILL as its inter-switch connectivity and delivers a notion of single logical layer-2 switch. This single logical layer-2 switch delivers a transparent LAN service. All the edge ports of VCS support standard protocols and features like Link Aggregation Control Protocol (LACP), Link Layer Discovery Protocol (LLDP), VLANs, MAC learning, etc. VCS achieves a distributed MAC address database using Ethernet Name Service (eNS) and attempts to avoid flooding as much as possible. VCS also provides various intelligent services, such as virtual link aggregation group (vLAG), advance port profile management (APPM), End-to-End FCoE, Edge-Loop-Detection, etc.
In embodiments of the present invention, for the purpose of cross-data-center communication, each data center is represented as a single logical RBridge. This logical RBridge can be assigned a virtual RBridge ID or use the RBridge ID of the CNE device that performs the WAN tunneling.
Similarly, data center 3346 includes a VCS 3342, which in turn includes a member switch 3332. Member switch 3332 is coupled to a host 3341, which includes VMs 3334 and 3336, both of which are coupled to virtual switches 3338 and 3340. Also included in VCS 3342 is a CNE device 3330. CNE device is coupled to member switch 3332 via an Ethernet (TRILL) link and an FC link. CNE device 3330 is also coupled to target storage device 3322 and a clone of target storage device 3320.
During operation, assume that VM 3302 needs to move from host 3301 to host 3341. Note that this movement is previously not possible, because virtual machines are visible only within the same layer-2 network domain. Once the layer-2 network domain is terminated by a layer-3 device, such as gateway router 3324, all the identifying information for a particular virtual machine (which is carried in layer-2 headers) is lost. However, in embodiments of the present invention, because CNE device extends the layer-2 domain from VCS 3316 to VCS 3342, the movement of VM 3302 from data center 3344 to data center 3346 is now possible.
When forwarding TRILL frames from data center 3344 to data center 3346, CNE device 3318 modifies the egress TRILL frames' header so that the destination RBridge identifier is the RBridge identifier assigned to data center 3346. CNE device 3318 then uses the FCIP tunnel to deliver these TRILL frames to CNE device 3330, which in turn forwards these TRILL frames to their respective layer-2 destinations.
VCS uses FC control plane to automatically form a fabric and assign RBridge identifiers to each member switch. In one embodiment, the CNE architecture keeps the TRILL and SAN fabrics separate between data centers. From a TRILL point of view, each VCS (which corresponds to a respective data center) is represented as a single virtual RBrdige. In addition, the CNE device can be coupled to a VCS member switch with both a TRILL link and an FC link. The CNE device can join the VCS via a TRILL link. However, since the CNE devices keeps the TRILL VCS fabric and SAN (FC) fabric separate, the FC link between the CNE device and the member switch is configured for FC multi-fabric.
As illustrated in
In one embodiment, each data center's VCS includes a node designated as the ROOT RBridge for multicast purposes. During the initial setup, the CNE devices in the VCSs exchange each VCS's ROOT RBridge identifier. In addition, the CNE devices also exchange each data center's RBridge identifier. Note that this RBridge identifier represents the entire data center. Information related to data-center RBridge identifiers is distributed as a static route to all the nodes in the local VCS.
Assume that host A needs to send multicast traffic to host Z, and that host A already has the knowledge of host Z's MAC address. During operation, host A assembles an Ethernet frame 3502, which has host Z's MAC address (denoted as MAC-Z) as its destination address (DA), and host A's MAC address (denoted as MAC-A) as its source address (SA). Based on frame 3502, member switch RB1 assembles a TRILL frame 3503, whose TRILL header 3506 includes the RBridge identifier of data center DC-1's root RBridge (denoted as “DC1-ROOT”) as the destination RBridge, and RB1 as the source RBridge. (That is, within DC-1, the multicast traffic is distributed on the local multicast tree.) The outer Ethernet header 3504 of frame 3503 has CNE device RB4's MAC address (denoted as MAC-RB4) as the DA, and member switch RB1's MAC address (denoted as MAC-RB1) as the SA.
When frame 3503 reaches CNE device RB4, it further modifies the frame's TRILL header to produce frame 3505. CNE device RB4 replaces the destination RBridge identifier in the TRILL header 3510 with data center DC-2's root RBridge identifier DC2-ROOT. The source RBridge identifier is changed to data center DC-1's virtual RBridge identifier, DC1-RB (which allows data center DC-2 to learn data center DC-1's RBridge identifier). Outer Ethernet header 3508 has the core router's MAC address (MAC-RTR) as its DA, and CNE device RB4's MAC address (MAC-DC-1) as its SA.
Frame 3505 is subsequently transported across the IP WAN in an FCIP tunnel and reaches CNE device RB6. Correspondingly, CNE device RB6 updates the header to produce frame 3507. Frame 3507's TRILL header 3514 remains the same as frame 3505. The outer Ethernet header 3512 now has member switch RB5's MAC address, MAC-RB5, as its DA, and CNE device RB6's MAC address, MAC-RB6, as its SA. Once frame 3507 reaches member switch RB5, the TRILL header is removed, and the inner Ethernet frame is delivered to host Z.
In various embodiments, a CNE device can be configured to allow or disallow unknown unicast, broadcast (e.g., ARP), or multicast (e.g., IGMP snooped) traffic to cross data center boundaries. By having these options, one can limit the amount of BUM traffic across data centers. Note that all TRILL encapsulated BUM traffic between data centers can be sent with the remote data center's root RBrdige identifier. This translation is done at the terminating point of the FOP tunnel.
Additional mechanisms can be implemented to minimize BUM traffic across data centers. For instance, the TRILL ports between the CNE device and any VCS member switch can be configured to not participate in any of the VLAN MGIDs. In addition, the eNS on both VCSs can be configured to synchronize their learned MAC address database to minimize traffic with unknown MAC DA. (Note that in one embodiment, before the learned MAC address databases are synchronized in different VCSs, frames with unknown MAC DAs are flooded within the local data center only.)
To further minimize BUM traffic, broadcast traffic such as ARP traffic can be reduced by snooping ARP responses to build ARP databases on VCS member switches. The learned ARP databases are then exchanged and synchronized across different data centers using eNS. Proxy-based ARP is used to respond to all know ARP requests in a VCS. Furthermore, multicast traffic across data centers can be reduced by distributing the multicast group membership across data canters by sharing the IGMP snooping information via eNS.
The process of forwarding unicast traffic between data centers is described as follows. During the FCIP tunnel formation, the logical RBridge identifiers representing data centers are exchanged. When a TRILL frame arrives at the entry node of the FOP tunnel, wherein the TRILL destination RBridge is set as the RBridge identifier of the remote data center, the source RBridge in the TRILL header is translated to the logical RBridge identifier assigned to the local data center. When the frame exits the FOP tunnel, the destination RBridge field in the TRILL header is set as the local (i.e., the destination) data center's virtual RBridge identifier. The MAC DA and VLAN ID in the inner Ethernet header is then used to look up the corresponding destination RBridge (i.e., the RBridge identifier of the member switch to which the destination host is attached, and the destination RBridge field in the TRILL header is updated accordingly.
In the destination data center, based on an ingress frame, all the VCS member switches learn the mapping between the MAC SA (in the inner Ethernet header of the frame) and the TRILL source RBridge (which is the virtual RBridge identifier assigned to the source data center). This allows future egress frames destined to that MAC address to be sent to the right remote data center. Note that since the RBridge identifier assigned to a given data center does not correspond to a physical RBridge, in one embodiment, a static route is used to map a remote data-center RBridge identifier to the local CNE device.
When frame 3503 reaches CNE device RB4, it further modifies the frame's TRILL header to produce frame 3505. CNE device RB4 replaces the source RBridge identifier in the TRILL header 3511 with data center DC-1's virtual RBridge identifier DC1-RB (which allows data center DC-2 to learn data center DC-1's RBridge identifier). Outer Ethernet header 3508 has the core router's MAC address (MAC-RTR) as its DA, and CNE device RB4's MAC address (MAC-DC-1) as its SA.
Frame 3505 is subsequently transported across the IP WAN in an FCIP tunnel and reaches CNE device RB6. Correspondingly, CNE device RB6 updates the header to produce frame 3507. Frame 3507's TRILL header 3515 has an updated destination RBridge identifier, which is RB5, the VCS member switch in DC-2 that couples to host Z. The outer Ethernet header 3512 now has member switch RB5's MAC address, MAC-RB5, as its DA, and CNE device RB6's MAC address, MAC-RB6, as its SA. Once frame 3507 reaches member switch RB5, the TRILL header is removed, and the inner Ethernet frame is delivered to host Z.
Flooding across data centers of frames with unknown MAC Das is one way for the data centers to learn the MAC address in another data center. All unknown SAs are learned as MACs behind an RBridge and it is no exception for the CNE device. In one embodiment, eNS can be used to distribute learned MAC address database, which reduces the amount of flooding across data centers.
In order to optimize flushes, even though MAC addresses are learned behind RBridges, the actual VCS edge port associated with a MAC address is present in the eNS MAC updates. However, the edge port IDs might no longer be unique across data-centers. To resolve this problem, all eNS updates across data centers will qualify the MAC entry with the data-center's RBridge identifier. This configuration allows propagation of port flushes across data centers.
In the architecture described herein, VCSs in different data-centers do not join each other; hence the distributed configurations are kept separate. However, in order to allow virtual machines to move across data-centers, there will be some configuration data that needs to be synchronized across data-centers. In one embodiment, a special module (in either software or hardware) is created for CNE purposes. This module is configured to retrieve the configuration information needed to facilitate moving of virtual machines across data centers and it is synchronized between two or more VCSs.
In one embodiment, the learned MAC address databases are distributed across data centers. Also, edge port state change notifications (SCNs) are also distributed across data centers. When a physical RBridge is going down, the SCN is converted to multiple port SCNs on the inter-data-center FCIP link.
In order to protect the inter-data-center connectivity, a VCS can form a vLAG between two or more CNE devices. In this model, the vLAG RBridge identifier is used as the data-center RBridge identifier. The FCIP control plane is configured to be aware of this arrangement and exchange the vLAG RBridge identifers in such cases.
Existing routing models assume every inter-switch link (ISL) can be a path to a reachable domain. However, if an FCIP-enabled VCS node joins the VCS via both FC and Ethernet ISLs, extra care needs to be taken to ensure that only FCoE storage traffic uses the FC links while the Ethernet ISLs are used for the LAN traffic. This can be achieved by allowing only FCoE VLAN traffic to use the FC links, while all non-FCoE traffic are allowed to use only the Ethernet ISLs.
Coupled to packet processor 3706 is an address management module 3708, which is responsible for maintaining the virtual data-center RBridge identifier, learning address information from remote data centers, and updating the source RBrdige field for outgoing frames to a remote data center. Also included in CNE device 3700 is an FCIP management module 3710, which maintains the FCIP tunnel and is responsible for exchanging configuration information via the FCIP tunnel, and a TRILL management module 3712, which manages the TRILL connection with the local VCS.
The methods and processes described herein can be embodied as code and/or data, which can be stored in a computer-readable non-transitory storage medium. When a computer system reads and executes the code and/or data stored on the computer-readable non-transitory storage medium, the computer system performs the methods and processes embodied as data structures and code and stored within the medium.
The methods and processes described herein can be executed by and/or included in hardware modules or apparatus. These modules or apparatus may include, but are not limited to, an application-specific integrated circuit (ASIC) chip, a field-programmable gate array (FPGA), a dedicated or shared processor that executes a particular software module or a piece of code at a particular time, and/or other programmable-logic devices now known or later developed. When the hardware modules or apparatus are activated, they perform the methods and processes included within them.
The foregoing descriptions of embodiments of the present invention have been presented only for purposes of illustration and description. They are not intended to be exhaustive or to limit this disclosure. Accordingly, many modifications and variations will be apparent to practitioners skilled in the art. The scope of the present invention is defined by the appended claims.
This application is national-phase application based on PCT international application PCT/US2011/034917, which claims the benefit of U.S. Provisional Patent Applications: 61/330,678, filed 3 May 2010; 61/333,666, filed 11 May 2010; 61/334,945, filed 14 May 2010; 61/345,953, filed 18 May 2010; 61/349,603, filed 28 May 2010; 61/350,838, filed 2 Jun. 2010; 61/352,264, filed 7 Jun. 2010; 61/352,255, filed 7 Jun. 2010; 61/352,819, filed 8 Jun. 2010; 61/352,776, filed 8 Jun. 2010; 61/380,807; filed 8 Sep. 2010; 61/380,803, filed 8 Sep. 2010; 61/380,819, filed 8 Sep. 2010; 61/380,818, filed 8 Sep. 2010; and U.S. Non-provisional Patent Applications; Ser. No. 13/042,259, filed 7 Mar. 2011; Ser. No. Ser. No. 13/087,239, filed 14 Apr. 2011; Ser. No. Ser. No. 13/092,752, filed 22 Apr. 2011; Ser. No. Ser. No. 13/092,460, filed 22 Apr. 2011; Ser. No. Ser. No. 13/092,724, filed 22 Apr. 2011; Ser. No. 13/092,877, filed 22 Apr. 2011; Ser. No. 13/092,701, filed 22 Apr. 2011; Ser. No. Ser. No. 13/092,580, filed 22 Apr. 2011; Ser. No. Ser. No. 13/098,360, filed 29 Apr. 2011; and Ser. No. Ser. No. 13/098,490, filed 2 May 2011; the disclosures of which are incorporated by reference in entirety herein.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/US11/34917 | 5/3/2011 | WO | 00 | 10/26/2016 |
Number | Date | Country | |
---|---|---|---|
61330678 | May 2010 | US | |
61333666 | May 2010 | US | |
61334945 | May 2010 | US | |
61345953 | May 2010 | US | |
61349603 | May 2010 | US | |
61350838 | Jun 2010 | US | |
61352255 | Jun 2010 | US | |
61352264 | Jun 2010 | US | |
61352776 | Jun 2010 | US | |
61352819 | Jun 2010 | US | |
61380818 | Sep 2010 | US | |
61380819 | Sep 2010 | US | |
61380803 | Sep 2010 | US | |
61380807 | Sep 2010 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 13042259 | Mar 2011 | US |
Child | 14912585 | US | |
Parent | 13087239 | Apr 2011 | US |
Child | 13042259 | US | |
Parent | 13092580 | Apr 2011 | US |
Child | 13087239 | US | |
Parent | 13092701 | Apr 2011 | US |
Child | 13092580 | US | |
Parent | 13092877 | Apr 2011 | US |
Child | 13092701 | US | |
Parent | 13092724 | Apr 2011 | US |
Child | 13092877 | US | |
Parent | 13092460 | Apr 2011 | US |
Child | 13092724 | US | |
Parent | 13092752 | Apr 2011 | US |
Child | 13092460 | US | |
Parent | 13098360 | Apr 2011 | US |
Child | 13092752 | US | |
Parent | 13098490 | May 2011 | US |
Child | 13098360 | US |