This application relates to computer networking and more particularly to providing a data center Ethernet switch fabric.
A data center is a facility used to house computer systems and associated storage and networking components. For interconnecting the computer systems and storage components, an Ethernet switch fabric is often used. Connecting Ethernet switches in a fat-tree topology and managing them as Local Area Networks (LANs) with spanning tree protocol (STP) or as Internet Protocol (IP) subnets with routing protocols have been a typical practice. However, there are some short-comings associated with the practice. For example, the switching paths among end-stations are static; therefore, the network is susceptible to network congestion without alleviation and is unable to address the mobility of virtual machines (VMs) where VMs may be dynamically spawned or moved. Also, a hosting data center may need to support tens of thousands tenants and need to set up traffic forwarding boundaries among the tenants. If a Virtual LAN (VLAN) is used to confine the traffic of a tenant, a layer 2 switching network is limited to supporting 4094 tenants.
There is recent development in network virtualization technology that attempts to scale the capacity of tenancy beyond the 4094 limit. One example is Virtual Extensible LAN (VxLAN). It encapsulates Ethernet frames within UDP (User Datagram Protocol) packets. However, one weakness in that approach is not being able to manage the congestion in the underlying switch fabric.
Software defined networking (SDN) is an approach to building a computer network that separates and abstracts elements of the networking systems. SDN decouples the system that makes decisions about where traffic is sent (i.e., the control plane or the controller) from the system that forwards traffic to the selected destination (i.e., the data plane). OpenFlow is a communications protocol that enables the control plane to access and configure the data plane. Recently, there have been commodity OpenFlow Ethernet switches in the market. Those switches are relatively low-cost, but they also have severe limitations in terms of the number of forwarding rules. Supposedly, an OpenFlow device offers the ability of controlling the traffic by flows in a data center switch fabric. The ability can be utilized in alleviating congestion or addressing VM mobility issues. The severe limitations of those switches greatly discount the ability because the number of forwarding rules that can be programmed on those switches is relatively small, e.g. in thousands.
In this invention, we disclose a system, method and computer program product of using commodity switches to provide an Ethernet switch fabric for multi-tenant data center, taking into account the limitations of the commodity switches.
We disclose herein a system, method, and computer program product for providing an Ethernet switch fabric for multi-tenant data center. An objective of the invention is to enable a hosting data center to support no less than tens of thousands of tenants. Another objective is to support dynamic traffic engineering within the switch fabric so as to address network congestion conditions and dynamic VM deployment. Yet another objective is to provide a switch fabric constructed with commodity switches which may support only a small number of forwarding rules for traffic engineering.
The system comprises a number of interconnected Ethernet switches, a number of virtual switches running on host machines, and a controller. Some of the Ethernet switches are considered to be edge switches when they own an edge port. An edge port is a switch port that is connected to a host machine. On a host machine, it runs a virtual switch and one or more VMs. The virtual switch provides connectivity among the VMs on the host machine and one or more edge ports. The controller is a computer program that implements the method of this invention. The controller assigns MAC addresses to the VMs and programs the Ethernet switches and the virtual switches.
The method comprises two key steps. One step assigns a unique, location-based MAC address to each VM that is spawned on a host machine. The MAC address assigned to a VM comprises a set of bits that identifies the location of the VM with respect to the switch fabric. Another step programs the Ethernet switches and the virtual switches in the switch fabric to forward a unicast packet destined to the MAC address by at least one bit of the set of bits of the MAC address. Furthermore, the virtual switch is programmed to discard all packets from the VM whose source MAC addresses are not the MAC address of the VM. The virtual switch is programmed to discard unicast packets from the VM that are not destined to other members of the tenant group of the VM. A broadcast packet from the VM is handled by converting the broadcast packet into one or more unicast packets by replacing the destination MAC address of the broadcast packet by the MAC addresses of the other members of the tenant group of the VM.
By assigning structured, location-based MAC addresses for VMs, the switch fabric enables traffic engineering with a relatively small number of forwarding rules programmed on the Ethernet switches. Also, the VMs of various tenant groups are not separated by VLANs in the present invention. The traffic of the VMs of various tenant groups is constrained by forwarding rules. Because the forwarding rules do not rely on VLAN identifiers, there can be more than 4094 tenant groups.
The present disclosure will be understood more fully from the detailed description that follows and from the accompanying drawings, which however, should not be taken to limit the disclosed subject matter to the specific embodiments shown, but are for explanation and understanding only.
a-2d illustrate some exemplary implementations of MAC address assigned to a VM.
a-3h illustrate some exemplary implementations of forwarding rules on the switch fabric.
a-4d illustrate some exemplary implementations of using location-based MAC address.
We disclose herein a system, method, and computer program product for providing an Ethernet switch fabric for multi-tenant data center. The system comprises a number of interconnected Ethernet switches, a number of virtual switches running on host machines, and a controller. The controller is a computer program that implements the method of this invention. The controller assigns location-based MAC addresses to the VMs and programs the Ethernet switches and the virtual switches to forward traffic by the location-based MAC addresses.
In a typical data center network, a VM is uniquely identified by an identifier, such as UUID (Universally Unique Identifier), generated by a data center orchestration tool. The data center orchestration tool also manages the tenant group membership of the VM and the attributes of the VM. A tenant who uses the VMs of his tenant group may have control over the IP address assignments or even the VLAN assignments of the VMs in his virtual private network. The tenant may or may not have control over the MAC address assignments of his VMs. To the tenant, an IP address identifies a VM. The tenant is not given the knowledge about the location of the VM. The data center orchestration tool has knowledge about the location of the VM.
In the present invention, the controller assigns a MAC address to a VM in a way that the MAC address embeds the location of the VM so that the controller can program the switch fabric to forward traffic to the VM by MAC addresses. In other words, the forwarding decisions are independent of IP addresses or any other networking parameter that the tenant has control. In other words, the IP datagram encapsulated inside an Ethernet frame is opaque to the switch fabric. The switch fabric no longer functions as a standard Ethernet network such as running spanning tree protocol, having MAC address learning, and forwarding by destination MAC address and VLAN identifier even though the switch fabric comprises Ethernet switches. The switch fabric in the present invention forwards a packet using the destination MAC address, not using the full destination MAC address, but using only the bits of the destination MAC address that provide location information, thereby reducing the number of forwarding rules to be programmed on the switch fabric.
There can be various embodiments how the location of the VM is embedded into a MAC address. In one embodiment, as in
In another embodiment, as in
Yet in another embodiment, as in
Yet in another embodiment, as in
There can be various embodiments of forwarding rules in the switch fabric to forward multi-tenant data center traffic by location-based MAC addresses.
The forwarding rules are ordered by their priorities. The smaller the rule number, the highest the priority of execution. In
The forwarding rules on virtual switches are constructed to match the ingress VNI, the source MAC address (SMAC), and the destination MAC address (DMAC) of the packets from VMs for the following reasons. Firstly, a virtual switch is to discard a packet from a VM to another VM not of the same tenant group. Secondly, a virtual switch is to discard a packet from a VM whose source MAC address do not match the MAC address assigned to the VM. That prevents a tenant from forging MAC address to spoof other tenants.
Broadcast packets from VMs are handled specially. A broadcast packet should be forwarded to all tenant group members other than the sender. An entity aware of the tenant group membership may convert the broadcast packet into unicast packets by replacing the destination MAC address (DMAC) with the MAC addresses assigned to the other tenant group members. In one embodiment, the controller does the broadcast packet conversion, having the virtual switch to capture the broadcast packet via an OpenFlow session and injecting the corresponding unicast packets into the switch fabric via OpenFlow sessions. In another embodiment, a special gateway does the broadcast packet conversion, having the virtual switch to forward the broadcast packet to the special gateway by replacing the destination MAC address of the broadcast packet with a special MAC address of the special gateway. The special gateway is attached to the switch fabric, and there are forwarding rules on the Ethernet switches for the special MAC address. For example, rule 98 forwards a broadcast packet to a special gateway whose MAC address is D. In yet another embodiment, the virtual switch does the broadcast packet conversion. The controller informs the virtual switch of the tenant group membership information.
An ARP (Address Resolution Protocol) request from a VM needs a response so that the IP stack of the VM can send unicast packets to a target tenant group member. An entity aware of the tenant group membership needs to generate the response. In one embodiment, the controller generates the ARP response, having the virtual switch capture an ARP request via an OpenFlow session and injecting the ARP response via the same OpenFlow session. In another embodiment, a special gateway generates the ARP response, having the virtual switch to forward the broadcast ARP request packet to the special gateway by replacing the destination MAC address of the broadcast ARP request packet with a special MAC address of the special gateway. For example, rule 98 forwards a broadcast packet to a special gateway whose MAC address is D. In yet another embodiment, an ARP request is treated as a typical broadcast packet and it is converted into multiple unicast ARP request packets to all other tenant group members. A tenant group member of the MAC address in the ARP request is to respond to the sender of the ARP request directly. In yet another embodiment, the virtual switch generates the ARP response to the VM. The controller informs the virtual switch of the tenant group membership information.
On an edge switch, the forwarding rules do not need to match the VNI identifier of the destination MAC address of a packet. A packet whose location identifier of the destination MAC address matches the location identifier assigned to the edge switch should be forwarded to an edge port further according to the port identifier of the destination MAC address. A packet whose location identifier of the destination MAC address does not match the location identifier assigned to the edge switch should be forwarded to a non-edge port that can lead to the edge switch associated with the location identifier of the destination MAC address. For example, edge switch 102 is assigned location identifier A, and edge switch 103 is assigned location identifier B. On edge switch 102, a packet whose location identifier of the destination MAC address matches B is forwarded to port 2 which can lead to edge switch 103 through spine switch 101.
There is no location identifier assigned to a spine switch in the case of MAC address embodiments of
b illustrates another embodiment of forwarding rules compatible with MAC address embodiment of
c illustrates yet another embodiment of forwarding rules compatible with MAC address embodiment of
For example, in
d illustrates a forwarding rules embodiment compatible with the MAC address embodiment of
e illustrates a forwarding rules embodiment for a network topology with redundant links. It is advantageous to use link aggregation (LAG) to bundle the physical ports. In other words, an edge port can be a logical edge port, mapped to a LAG. Then we can take advantage some quick link failure detection and link failover capability on typical Ethernet switches. The upstream ports of an edge switch can be bundled as one LAG. The downstream ports of a spine switch can be bundled as one or more LAGs, each LAG leading to one edge switch. For example, ports 2-5 on edge switch 142 are bundled as LAG 1. Ports 1-2 on spine switch 140 are bundled as LAG 3, and ports 3-4 on spine switch 140 are bundled as LAG 4. LAG 3 leads to edge switch 142 and LAG 4 leads to edge switch 143. The forwarding rules can specify a LAG as the output port. It is up to the LAG traffic distribution algorithm on an Ethernet switch to select one physical port out of the LAG to send out a packet. It is typical that the LAG traffic distribution algorithm uses a hash value on the packet header fields of a packet to select the physical port.
f illustrates a forwarding rules embodiment for dynamic load-balancing or congestion alleviation. For example, when the controller detects congestion on LAG 2, the controller can program new forwarding rules to divert some packet flows to specific ports. On edge switch 153, rule 50 is one such forwarding rule. It directs File Transfer Protocol (FTP) packets to port 3. Rule 51 is another one. It directs traffic to a specific VM out on port 2, overriding the LAG load distribution algorithm for LAG 2.
The controller can update the forwarding rules on the switch fabric dynamically. Some forwarding rules previously programmed may need to be relocated to make room for new forwarding rules so as to maintain proper rule execution priorities.
g illustrates a forwarding rules embodiment for a switch fabric spread over two geographical partitions separated by an IP network 165. The network topology and configurations are similar to
h illustrates a forwarding rules embodiment for a network topology with redundant links from a host machine to two edge switches. The host machine 174 has two NICs and VNI 3 of virtual switch 176 is mapped to the two NICs. The two NICs are bonded as LAG 9. NIC bonding offers benefits of quick link failure detection and link failover capability. As LAG 9 spans to edge switch 172 and edge switch 173, edge switches 172 and 173 are to share tier 1 location identifier C. In other words, edge switches 172 and 173 form a switch aggregation from the viewpoint of the location of VM 1 and VM 2. In this case, the tier 1 location identifier indicates the location of a logical Ethernet switch with respect to the data center fabric. The logical Ethernet switch is mapped to the switch aggregation. However, edge switch 173 is the only edge switch to host machine 175, so edge switch 173 is also associated with tier 1 location identifier B. Spine switches 170 and 171 are programmed accordingly to accommodate the tier 1 location identifier C of the logical Ethernet switch and the tier 1 location identifier B of edge switch 173. Ports 1-4 of spine switch 170 can form one LAG, and ports 1-4 of spine switch 171 can form one LAG.
The controller may update the forwarding rules on the switch fabric in response to network topology changes such as link status change and insertion or failure of Ethernet switches. In response to failure of an Ethernet switch, there can be various embodiments. In one embodiment, using an aggregation of Ethernet switches as in
There can be various embodiments how location-based MAC addresses can be associated with VMs.
b illustrates another embodiment of using location-based MAC address. VM 211 uses a globally unique MAC address or a MAC address of the tenant's choice on its network interface. The source MAC address of all packets from VM 211 is 00:00:0C:11:11:11. Virtual switch 212 maintains a mapping between the MAC address of VM 211 and the location-based MAC address assigned to VM 211. VM 211 is not aware of the location-based MAC address 02:00:F0:01:01:01. Virtual switch 212 replaces the source MAC address of a packet from VM 211 with VM 211's assigned location-based MAC address. For example, packet 213 is converted to packet 215. Virtual switch 212 also replaces the destination MAC address of a packet destined to VM 211 with VM 211's actual MAC address. For example, packet 216 is converted to packet 214.
c illustrates yet another embodiment of using location-based MAC address. VM 221 uses a globally-unique MAC address and is not aware of any location-based MAC address. Even the ARP response to VM 221's ARP request provides a globally unique MAC address for the target. That can be achieved by forwarding ARP requests to tenant group members and allowing tenant group members to reply to ARP requests. As a result, packet 223 and packet 224 do not contain any location-based source and destination MAC addresses. Virtual switch 222 maintains mappings between the globally-unique MAC addresses and the location-based MAC addresses assigned to the VMs. Virtual switch 222 encapsulates a packet from VM 221 in a Provider Backbone Bridges (PBB) encapsulation. For example, packet 223 is encapsulated in packet 225. The outer destination and source MAC addresses of packet 225 are location-based MAC addresses of the target VM and VM 221, respectively. Similarly, virtual switch 222 decapsulates packet 226 and forward packet 224 to VM 221. Other encapsulation method can be used such as MPLS and GRE.
d illustrates an embodiment of using location-based MAC address for a data center switch fabric partitioned into multiple partitions while the partitions are separated by an IP network. The traffic from one partition to another partition needs to be tunneled through the IP network. There can be various embodiments of tunneling the traffic. One embodiment is to forward packets to another partition to a special gateway and let the special gateway encapsulate the packets using MPLS. Another embodiment is to have a virtual switch encapsulate the packets from its connected VMs to another partition using MPLS. The embodiment in
In step 305, the controller ensures that there is an OpenFlow session to each of the Ethernet switches and virtual switches in the switch fabric. Here we assume that the data center orchestration tool has configured the Ethernet switches and virtual switches to be able to accept OpenFlow sessions. If there is not an existing session to an Ethernet switch or a virtual switch, the controller establishes one. The controller may have network connectivity to the Ethernet switches via their management Ethernet interfaces different from the switch ports.
In step 307, the controller discovers the network topology via Link Layer Discovery Protocol (LLDP). The controller injects LLDP packets into each of the Ethernet switches and virtual switches via the corresponding OpenFlow session. The LLDP packets are to be sent out on each port of the switch entity. Some or all of those LLDP packets are to be received by the peering switch entities. The controller captures those received LLDP packets via OpenFlow sessions and thereby deducing the network topology. The connectivity between VMs and their virtual switches are obtained from the data center orchestration tool.
In step 309, the controller detects whether there is any addition, removal, or migration of VMs. The controller may obtain related information from the data center orchestration tool.
In step 311, the controller assigns a location-based MAC address to an added or a migrated VM. The location is determined with respect to the network topology. When the embodiment of using MAC address requires that the VM be using the location-based MAC address directly, the controller informs the data center orchestration tool about the assignment, and the data center orchestration tool is to configure the location-based MAC address on the VM.
In step 313, the controller programs forwarding rules onto the Ethernet switches using the location-based MAC addresses. Various implementations of the forwarding rules are illustrated in
In step 315, the controller programs forwarding rules onto the virtual switches using the location-based MAC addresses. Various implementations of the forwarding rules are illustrated in
In step 317, the controller checks whether there is any broadcast packet from a VM captured and received via an OpenFlow session. The check can only be valid if the controller has programmed the virtual switches to forward broadcast packets from VMs to the controller via the OpenFlow sessions in step 315. Otherwise, it is expected that a special gateway or the virtual switches are to handle broadcast packets and ARP requests captured on the virtual switches.
In step 319, the controller differentiates an ARP request from other broadcast packets. In step 321, the controller generates unicast packets by converting a broadcast packet by replacing the destination MAC address of the broadcast packet with MAC addresses of other tenant group members. In step 323, the controller provides the MAC address of the tenant group member requested by the VM that has sent the ARP request.
The information about Ethernet switches is passed from the North-bound API module 401 to the Ethernet switch management module 402. The information about the virtual switches is passed to the virtual switch management module 403. The information about VMs is passed to VM management module 404. The Ethernet switch management module 402 and the virtual switch management module 403 maintain the OpenFlow sessions, as in step 305 of
The present invention is also applicable to a data center network that comprises non-virtualized physical machines. In that case, the forwarding rules that would be applied to virtual switches are applied to the edge switches.
The embodiments described above are illustrative examples and it should not be construed that the present invention is limited to these particular embodiments. Thus, various changes and modifications may be effected by one skilled in the art without departing from the spirit or scope of the invention as defined in the appended claims.