The present invention is in the field of multi-function network switches and switched networking fabrics.
A method and apparatus allowing the interconnection of host systems using the switching capability of a multi-port intelligent network interface card. Distributed or centralized control plane implementations are possible, including software defined network control.
Network topologies typically consist of “sub-networks” of end-stations/host systems connected through layer 2 switches, that are in turn interconnected with layer 3 switches or (edge) routers into larger networks. Edge routers or gateways may bridge between different network technologies such as Ethernet and SDH.
The inventors have realized that a L2 to L7 network may be implemented using a multi-port switch building block consisting of a multi-port intelligent network interface card (NIC), without necessitating a dedicated switch. It is for example, possible to build a 16-node switched interconnect from 16 4-port building blocks. Such a multi-port L2 to L7 switch may be used as a storage/compute cluster interconnect, where it replaces one or more external switches. The multi-port switch may utilize routing protocols such as IS-IS or OSPF for packet delivery. It may further support ACL, TCP proxy (L4 switching), iSCSI switching (L7 switching), and multi-function gateway capability, translating between different protocols such as iSCSI/FCoE/FC/SAS, etc.
For example, a multi-port switch building block may have four 1GE/10GE ports, a PCIe interface bus, an integrated control processor (uP) and an additional 1GE NCSI management interface.
The example
Alternatively, a centralized control model may be used to collect network state from the participating nodes, and determine appropriate routes for traffic between all pairs of nodes, e.g. by running a shortest path first (SPF) algorithm (which may support multi-path routing), as well as a suitable algorithm for multicast/broadcast packet delivery. A central control point may utilize a protocol such as OpenFlow to exchange state with and configure the building block nodes.
As a result of operating a routing protocol, or based on central control determination, the packet processing functionality of the building block is configured to switch ingress and egress packets to the appropriate port.
The example
The example
In some examples, the switch is configured for rewriting of DA, SA, and VLAN; and in some examples, the switch is also configured for rewriting of FIP, LIP, FP, LP or other fields of the network stack headers. Relevant TCAM functionality is described, for example, in U.S. Pat. No. 7,760,733 and U.S. Pat. No. 7,616,563, each of which is incorporated by reference herein in its entirety.
An MPS TCAM entry can optionally contain a list of VLANs that are allowed for the particular DA, and an entry can also contain information on which port(s) packets with the particular DA are allowed to arrive on. The multicast list is described in the next subsection and is used to accomplish multicast and broadcast switching action.
The LE tuple can optionally contain IP header fields (L3 information) e.g. LIP (Local IP address) and/or FIP (Foreign IP address), Protocol Number, TOS/DSCP value, TCP/UDP header fields (L4 information) e.g. LP (Local Port number) and/or FP (Foreign Port number), and it can contain parts of the e.g. TCP payload and/or UDP payload and/or other payload (L7 information). When the tuple contains only DA index information the switching is pure L2 switching, when in addition it contains L3 information the switching is L3 switching aka routing, and when it contains parts of the payload the switching is L7 switching. The TCAM can contain don't care information that enables simultaneous L2, L3, L4, and L7 switching for different Ethernet frames. The FCB actions can also include dropping a packet matching the filter entry and the switch may therefore implement Access Control Lists (ACL) and firewall functionality for incoming packets.
The FCB Action can also include re-writing of the header fields such as DA, SA (or swapping of the DA and SA), VLAN as well as removing the VLAN tag, or inserting a VLAN (or other) tag. The FCB/TCB can also include offload of TCP packets and the full TCP proxy connection (payload transfer) of two TCP connections.
In addition, the μP may be useful as the destination for control plane packets, for example for a configuration protocol to configure each of the MPS TCAM, LE TCAM/hash and FCB in a fabric of “S” switch building blocks. For this purpose, the MPS TCAM, the LE table, and the FCB can be initialized with an entry that switches all incoming Ethernet frames with a particular DA-μP to the local μP. That μP in turn can send a configuration packet to each of the 4 nearest neighbors, etc.
The multi-port switch building block may implement multicast and broadcast by using the mcast-list bit-mask in the MPS TCAM. In the example above, a frame with destination address DA1 is replicated for each bit set in the mcast-list bit-mask and a frame copy is sent through the LE table and FCB and the associated replication index, and this in turn enables different switching and re-writing actions for each of the different copies of the frame created with multicast. Examples of this will be described later, when it is shown how the multi-port switch building block implements flooding, the sending of a copy of frame to each of the output ports.
The following table summarizes some example features of the 12 node cluster:
The 8-port butterfly switch may be implemented as follows. The multi-port switch building blocks have 4×10 Gbps Ethernet ports, the capability to switch between each of the 4-ports at line rate, and each of the S have a 512 entry Ethernet Address (DA) MPS TCAM, and 2K LE ternary filter/forwarding rules with up to 1 M standard lookup rules. Each of the DA TCAM entries has a list of allowable arrival ports, and a list of up to 16 allowed VLAN tags associated with each DA.
At a high level, the switch operates by associating each of the DA MPS TCAM entries with a filtering/forwarding rule, and DA TCAM and the filtering/forwarding rule are in combination used to allow an Ethernet frame with a particular DA to arrive on a particular S port and is used to forward/switch an incoming Ethernet frame to an output port. It is possible to add which ports packets from DA=DA1 are allowed to arrive, and for S[0,0] this includes all ports except port 3, e.g. port 0,1,2 are legal arrival ports, and the DA TCAM can be programmed to drop any frames that arrive on other ports, i.e. port 3.
The example design uses 12 switch elements which could be pruned to 8 because the right most column of S[*,2] can be replaced by direct connections. The purpose of showing the complete diagram is to give an idea of how the design may be scaled to a larger number of ports. It is also noted that there are multiple paths available between some ports e.g. it is possible to switch from SW[0,0] to SW[1,0] either through SW[0,1] or SW[1,1].
The following TCAM table example describes how the MPS TCAM, the LE table and the FCB is configured for the switching of packets with destination and source addresses A1 and A2 between ports 1 and 4.
The filter table shows how the filter/forwarding rules may be set up in the different SW instances to forward the frames correctly to the different ports of the SW instance.
We now describe examples of flooding in the SW 4-port switch building block. Flooding a packet that arrives on one of the 4-ports of the switch building block involves sending a copy of the packet to all the other 3 ports of the building block. The hardware multicasting capability of the SW switch building block can be used to flood the other ports when the DA lookup doesn't produce a hit, i.e. when the forwarding rule for an Ethernet address is not known.
Flooding uses a default flooding entry in the last entry in the DA MPS TCAM with don't care values for the address and replicating it to 4 VI (Virtual Interfaces) and using 16 LE filter/forwarding-rules to flood the packet to the other ports besides the port the packet arrived on. The packet can also optionally be flooded to the uP and the PCIe bus.
The table shows an example configuration where the first 4 (optionally 6) VI are used to flood a frame.
The example uses the DROP rule to drop the inbound frame because the MPS TCAM does not use the port number as part of the lookup key, but instead looks up the allowed ports for a DA after producing a hit in the TCAM.
There are cases where all the Ethernet ports of a building block instance are not in use: either the port is not connected/active or the port doesn't forward any packets between a pair of ports. In these cases there are no flooding entries between two such ports.
A switch can learn forwarding rules through several mechanisms, for example:
For example, an OSPF or IS-IS like protocol can be implemented to exchange the addressed (replacing IP addresses with MAC addresses) of the participating nodes, and to construct the topology of the network using flooding of link state advertisement (LSA) messages. Once the topology is discovered a multi-path Shortest Path First algorithm such as Dijkstra's can be used to determine routes to all the nodes in the network. The computed routes are then used to program a filter rule per destination node.
Multicast and broadcast delivery trees can be determined by running a similar algorithm per multicast group. The computed trees are then used to program a filter rule per multicast address at each node.
Furthermore, in order to provide connectivity outside of the grid, gateway nodes with connectivity to the “outside world” can be identified during the network discovery phase. Default filter rules can then be programmed to send traffic to unknown addressed towards these nodes, which in turn program their default rules to send out the unmatched packets on the ports connected to the outside.
We now discuss a process for L2 learning. An L2 learning process learns an SA=SA0 from a received packet from port=i, e.g. it would look up SA0 in the MPS TCAM and if it doesn't product a hit, then it would learn to forward DA=SA0 to port=i. The SW switch building block can be enhanced to look up SA and DA separately and perform two actions, i.e. learn SA if necessary, and in addition forward DA.
The SW learns an SA0 when a lookup does not produce a hit, and the SW floods a DA0 when it does not produce a hit in the MPS TCAM.
The learning process is the following: Frame (DA=DA0,SA=SA0) arrives on some port
If(SA0 lookup in DA TCAM produces a hit on a don't care entry)
Number | Name | Date | Kind |
---|---|---|---|
5790553 | Deaton, Jr. | Aug 1998 | A |
6578086 | Regan | Jun 2003 | B1 |
6757725 | Frantz | Jun 2004 | B1 |
7760733 | Eiriksson et al. | Jul 2010 | B1 |
8054832 | Shukla et al. | Nov 2011 | B1 |
20030200315 | Goldenberg et al. | Oct 2003 | A1 |
20040172485 | Naghshineh et al. | Sep 2004 | A1 |
20070036178 | Hares | Feb 2007 | A1 |
20080133709 | Aloni | Jun 2008 | A1 |
20100020814 | Thyni | Jan 2010 | A1 |
20100214950 | Vobbilisetty | Aug 2010 | A1 |
20100312941 | Aloni et al. | Dec 2010 | A1 |
20110087774 | Pope et al. | Apr 2011 | A1 |
20110090915 | Droux et al. | Apr 2011 | A1 |
20110149966 | Pope et al. | Jun 2011 | A1 |
20110299543 | Diab et al. | Dec 2011 | A1 |
20120016970 | Shah | Jan 2012 | A1 |
20120117366 | Luo | May 2012 | A1 |
20120151004 | Pope | Jun 2012 | A1 |
20120155256 | Pope et al. | Jun 2012 | A1 |
20120170585 | Mehra | Jul 2012 | A1 |
20130022045 | An | Jan 2013 | A1 |
20130080567 | Pope | Mar 2013 | A1 |
20130195111 | Allan | Aug 2013 | A1 |
20140310405 | Pope et al. | Oct 2014 | A1 |
Entry |
---|
http://www.ieee802.org/1/files/public/docs2009/new-hudson-vepa—seminar-20090514d.pdf; C. Hudson and P. Congdon, Edge Virtual Bridging with VEB and VEPA, May 14, 2009. |
U.S. Appl. No. 13/330,513, filed Dec. 19, 2011. |
Office Action in U.S. Appl. No. 13/330,513, dated Sep. 12, 2013. |
Final Office Action in U.S. Appl. No. 13/330,513, dated Apr. 11, 2014. |
Advisory Action in U.S. Appl. No. 13/330,513, dated Jul. 8, 2014. |
Office Action in U.S. Appl. No. 13/330,513, dated Apr. 8, 2015. |
Office Action in U.S. Appl. No. 13/330,513, dated Oct. 14, 2015. |