The present technology pertains to data networks, and more specifically pertains to automatically detecting loops within a multicast tree in order to repair the loop, thus ensuring loop free multicast data delivery during instances of network convergence.
In a proposed multicast implementation for Virtual Extensible Local Area Network (“VXLAN”) networks, multiple Forwarding Tag (FTAG) multicast trees are constructed from the dense bipartite graph of fabric nodes/edges and each such multicast tree (henceforth called “FTAG” tree) is used to forward tenant multicast packets which are encapsulated in VXLAN and are distributed to various fabric edge switches (ToRs). Multiple trees are created for load balancing purposes. An external controller decides a suitable root node for each FTAG instance and distributes this information to all the member switches of a fabric. The FTAG trees are created in a distributed manner where each node independently decides (through an algorithm) which local links should be included in a given instance of FTAG tree. During periods of network convergence (which might be due to link failure or FTAG root failure) there is a possibility that nodes have a disparate view of the network thus indicating that there is a possibility of a loop created in the FTAG tree construction. Loops can be problematic from a multicast tree point of view because there can be duplicate packets delivered to the tenant end nodes if loops are created in the tree. Since forwarding latency is very low even when loops persist only for a transient period of times, this can result in a large number of duplicate packets sent to the tenant end hosts.
In order to describe the manner in which the above-recited and other advantages and features of the disclosure can be obtained, a more particular description of the principles briefly described above will be rendered by reference to specific embodiments thereof which are illustrated in the appended drawings. Understanding that these drawings depict only exemplary embodiments of the disclosure and are not therefore to be considered to be limiting of its scope, the principles herein are described and explained with additional specificity and detail through the use of the accompanying drawings in which:
The detailed description set forth below is intended as a description of various configurations of the subject technology and is not intended to represent the only configurations in which the subject technology can be practiced. The appended drawings are incorporated herein and constitute a part of the detailed description. The detailed description includes specific details for the purpose of providing a more thorough understanding of the subject technology. However, it will be clear and apparent that the subject technology is not limited to the specific details set forth herein and may be practiced without these details. In some instances, structures and components are shown in block diagram form in order to avoid obscuring the concepts of the subject technology.
The disclosed technology addresses the need in the art for automatically detecting one or more loops within a multicast tree and repairing the loops thus ensuring loop free multicast data delivery.
Overview
In one aspect of the disclosure, a method for detecting the presence of a loop in a multicast tree is provided. The method includes calculating a multicast tree radius for a first multicast tree. The multicast tree radius represents a maximum number of hops from a root node to a furthest edge node in the first multicast tree. The method further includes forwarding, by the root node, a first packet to each edge node within the first multicast tree, the first packet having a time-to-live (TTL) value equal to twice the first multicast tree radius, receiving, at the root node, a copy of the forwarded first packet, and determining an existence of a loop in the first multicast tree based at least upon receiving the copy of the forwarded first packet.
In another aspect of the disclosure, a system for detecting the presence of a loop in a multicast tree is provided. The system includes at least one edge node and a root node in communication with the at least one edge node, the root node and the at least one edge node forming a first multicast tree. The root node is configured to calculate a multicast tree radius for the first multicast tree, the multicast tree radius representing a maximum number of hops from the root node to a furthest edge node in the first multicast tree, forward a first packet to each edge node within the first multicast tree, the first packet having a time-to-live (TTL) value equal to twice the first multicast tree radius, receive a copy of the forwarded first packet, and determine an existence of a loop in the first multicast tree based at least upon receiving the copy of the forwarded first packet.
Yet another aspect of the disclosure provides a non-transitory computer-readable storage medium having stored therein instructions which, when executed by a processor, cause the processor to perform a series of operations. These operations include calculating a multicast tree radius for a first multicast tree, where the multicast tree radius represents a maximum number of hops from a root node to a furthest edge node in the first multicast tree, forwarding a first packet to each edge node within the first multicast tree, the first packet having a time-to-live (TTL) value equal to twice the first multicast tree radius, receiving a copy of the forwarded first packet, and determining an existence of a loop in the first multicast tree based at least upon receiving the copy of the forwarded first packet.
The present disclosure describes systems, methods, and non-transitory computer-readable storage media for detecting and repairing loops in multicast trees. In one implementation, the present disclosure provides a VXLAN or alternate overlay solution in a multicast tree built in an infra network, where the multicast tree transmits tenant traffic among hosts via switches or nodes. In a multicast implementation, multiple forwarding tag (FTAG) multicast trees are constructed from the dense bipartite graph of fabric nodes and switches. Each multicast tree (also referred to as “FTAG tree”) is used to forward a tenant multicast data packet, which is encapsulated inside a VXLAN packet, and this VXLAN packet forwarded for distribution to various fabric edge nodes or switches (also referred to as ToRs). The end nodes or switches decapsulate the VXLAN packet and forward the tenant multicast traffic to tenant recipient hosts. Multiple trees can exist in the network. The multiple trees are created for load balancing purposes. An external controller decides a suitable root node for each FTAG instance and distributes the information to all member switches of a given fabric. The FTAG trees are created in a distributed manner where each node independently decides, via an algorithm, which local links should be included in a given instance of an FTAG tree.
During periods of network convergence, which may be due to a link failure or FTAG root node failure, there is a possibility that the nodes have disparate views of the network. This typically indicates the existence of a loop created in the FTAG tree construction. Any tenant multicast traffic forwarding using an FTAG tree with a loop will result in multiple duplicate packets being delivered to end nodes, which is highly undesirable. The present disclosure applies to all types of multicast environments. One type of multicast environment is where tenant multicast traffic is encapsulated within a “VXLAN outer packet. The outer packet's destination Internet Protocol (“IP”) address is a multicast address derived from the inner packet's multicast address and maps to an infra-Virtual Routing and Forwarding (“VRF”) multicast address. The outer packet is distributed over the infra-VRF to the various ToR switches where the inner multicast group's interested receivers are present and the egress ToRs decapsulate and forward the packet to the interested receivers.
The distribution tree is based on an FTAG tree constructed in the overlay/infra space. Each tree has an FTAG root and the tree can be bi-directional. As a multicast source sends a packet, the ingress ToR connected to this source encapsulates the inner tenant multicast packet in a VXLAN packet and forwards the packet over all the branches on the chosen FTAG tree. The packet will travel towards the root of the FTAG tree and then be redistributed to other branches of the tree until it reaches the intended edge switch/ToR. The leaf ToRs decapsulate and forward the packet to receivers that they are in communication with.
During times of network failure, the tree may experience one or more loops. In these instances, the packet forwarded over the tree that has intermediate loops will create a duplicate packet for delivery to the end hosts.
The interfaces 168 are typically provided as interface cards (sometimes referred to as “line cards”). Generally, they control the sending and receiving of data packets over the network and sometimes support other peripherals used with the router 110. Among the interfaces that may be provided are Ethernet interfaces, frame relay interfaces, cable interfaces, Digital Subscriber Line (“DSL”) interfaces, token ring interfaces, and the like. In addition, various very high-speed interfaces may be provided such as fast token ring interfaces, wireless interfaces, Ethernet interfaces, Gigabit Ethernet interfaces, Asynchronous Transfer Mode (ATM” interfaces, High Speed Serial Interfaces (HSSI), Packet-Over-SONET (POS) interfaces, Fiber Distributed Data Interfaces (FDDI) and the like. Generally, these interfaces may include ports appropriate for communication with the appropriate media. In some cases, they may also include an independent processor and, in some instances, volatile random access memory (RAM). The independent processors may control such communications intensive tasks as packet switching, media control and management. By providing separate processors for the communications intensive tasks, these interfaces allow the master microprocessor 162 to efficiently perform routing computations, network diagnostics, security functions, etc.
Although the system shown in
Regardless of the network device's configuration, it may employ one or more memories or memory modules (including memory 161) configured to store program instructions for the general-purpose network operations and mechanisms for roaming, route optimization and routing functions described herein. The program instructions may control the operation of an operating system and/or one or more applications, for example. The memory or memories may also be configured to store tables such as mobility binding, registration, and association tables, etc.
The communications interface 240 can generally govern and manage the user input and system output. There is no restriction on operating on any particular hardware arrangement and therefore the basic features here may easily be substituted for improved hardware or firmware arrangements as they are developed.
Storage device 230 is a non-volatile memory and can be a hard disk or other types of computer readable media which can store data that are accessible by a computer, such as magnetic cassettes, flash memory cards, solid state memory devices, digital versatile disks, cartridges, random access memories (RAMs) 225, read only memory (ROM) 220, and hybrids thereof.
The storage device 230 can include software modules 232, 234, 236 for controlling the processor 210. Other hardware or software modules are contemplated. The storage device 230 can be connected to the system bus 205. In one aspect, a hardware module that performs a particular function can include the software component stored in a computer-readable medium in connection with the necessary hardware components, such as the processor 210, bus 205, display 235, and so forth, to carry out the function.
In one example of the methodology described herein, virtual extensible local area network (“VXLAN”) is utilized as the infrastructure layer's encapsulation protocol. However, the use of VXLAN is exemplary only, and the methodology can be implemented using any encapsulation technology such as, for example, Transparent Interconnection of Lots of Links (TRILL). In VXLAN, the user's data traffic is injected into the VXLAN network from an ingress switch which encapsulates the user's data traffic within a VXLAN packet with the UDP source port set to a value based on the inner packet's header information. This dynamic setting of the UDP source port in a VXLAN header allows the packet to follow alternate Equal Cost Multi-Paths (ECMPs) within the VXLAN infra-network. At the egress switch (the boundary of the VXLAN network), the packet is de-capsulated and the inner packet (the user data packet) is forwarded out.
In one example of the present disclosure, an automatic loop detection methodology is employed. This methodology can be implemented by the CPU 162 of network device 110 shown in
An infra-space multicast group address (GIPo) is reserved for each FTAG instance. After detecting a transient condition in the network, root node 302 can initiate a loop detection algorithm. In the loop detection algorithm, root node 302 sends a crafted data packet having an outer multicast address reserved as a GIPo address and a time-to-live (“TTL”) value set to a value twice the value of the FTAG tree radius, to each node 304 of the FTAG tree. The VXLAN encapsulated packet with a destination address set to GIPo is forwarded across the chosen FTAG tree. The first root forwards the packet to the immediate ToRs, which are part of the FTAG. In turn, each subsequent switch which receives this packet forwards it to all the branches of the tree except for the link on which the packet was received. If there are no loops in the FTAG tree, this packet will follow the branches of the tree and will end up in the edge ToRs, where the inner packet is dropped and not forwarded out to the tenant servers. As this is a packet with a reserved GIPo, there are no corresponding GIPi (tenant multicast group addresses) that this packet maps to. However, if there is a loop, the packet is replicated at the point where the loop starts. Thus, a copy of the packet will eventually be sent back towards root node 302 of the FTAG tree. Upon receiving this duplicate packet, root node 302 can determine that the FTAG tree must have at least one cycle, i.e., that a loop exists. The node 304 which observes the packet's outer TTL drop to 0 reports this back to root node 302.
Thus, when root node 302 receives a message that an edge node 304 in the multicast tree detected the packet's TTL=0, root node 302 recognizes that a loop exists in the tree because, otherwise, edge node 304 should not observe the outer TTL value drop to 0. Thus, root node 302 can determine the presence of a loop by receiving a copy of the packet it forwarded out to the nodes 304 of the multicast tree, and it can also determine the presence and location of a loop if one of the nodes 304 reports back to root node 302 that the node 304 observed the outer packet's TTL to be 0.
The TTL of the packet's outer address is set to twice that of the FTAG tree radius to ensure that any looped packet reaches root node 302 before being discarded by one of the edge nodes 304 for TTL reasons. For example, referring to
In one example of the present disclosure, multiple loops can be detected on different multicast trees. For example, root node 302 can initiate multiple loop detection algorithms by injecting a packet into each tree path. In one example, this can occur simultaneously. To maintain distinction between loop detection on multiple FTAG instances, root node 302 can keep each packet signature distinct from each other.
Loop detection logic can be used for advertising whether an FTAG tree is usable or not. For example, upon observing a transient condition by running the loop detection procedure described above, root node 302 can send a special Link State Packet (LSP) with a Link State Advertisement (LSA) indicating a particular FTAG is not useful until a certain threshold amount of time has been reached to ensure that there is no loop and then advertise the FTAG is usable in the new LSA. It should be noted that instead of using a special link packet, any other form of notification to indicate usability of an FTAG can also be exploited in a given deployment. In this fashion, root node 302 of an FTAG tree can automate FTAG loop detection and loop avoidance logic.
In one example, after a loop or multiple loops in a multicast tree has been detected, as described above, an automatic loop repair implementation can be employed. This methodology can be implemented, for example, by the CPU 162 of network device 110 shown in
When a special/reserved GIPo packet is discarded by an FTAG tree node because the packet's TTL=0, this is an indication that there exists a loop in the FTAG tree with the node 304 that detects the packet's TTL=0 being part of the loop. Because root node 302 has set the packet's initial TTL as twice the FTAG tree radius, an edge node 304 should not observe an outer TTL=0 outer packet unless there is loop. The node 304 that detects the value of the packet's TTL=0 can look at its local ports and/or links which are part of the FTAG tree. Node 304 then proceeds to cut off the designated ports. In one example, this is the port through which its downstream neighbor thinks has the shortest path to root node 302. For example, referring to
At step 406, it is determined if a copy of the packet has been returned by an edge node 304 to root node 302. If root node 302 does receive a copy of the packet, it is confirmed that a loop exists somewhere in the tree, at step 408. At step 410, it is determined if root node 302 receives a message from one of the edge nodes 304 in the multicast tree, the message indicating that a particular edge node 304 detected the packet's TTL=0. If this message is received, root node 302 recognizes that a loop exists and also knows the location of the loop. The location of the loop is determined by the location of the node 304 that detected the TTL of the forwarded packet having a value of 0. If a copy of the packet is not returned to root node 302 and no edge node 304 sends a message that it recognized the packet's TTL value=0, then root node 302 recognizes that there is no loop in the multicast tree.
At step 510, root node 302 determines if it receives a copy of the packet that it sent out. This could be multiple packets if root node 302 sent out multiple packets for multiple loop detection algorithms. If a copy of the packet that root node 302 forwarded out is received back, then root node 302 recognizes that a loop exists in the multicast tree, at step 512. At step 514, if root node 302 receives a message from a node 304 that the node detected the packet's TTL=0, then root node 302 recognizes that a loop exists in the multicast tree, and, specifically, that the node 304 that sent the message, is part of the loop. Referring to
In one example, if root node 302 has determined that a loop exists, it sends out an LSP packet with a link state advertisement (“LSA”) that informs the other nodes in the network that a loop exists on the detected FTAG, at step 518. For example, upon seeing a transient condition, root node 302 can send a special LSP packet with an LSA indicating a particular FTAG is not useful until a certain threshold of time has elapsed. Once the loop is repaired, root node 302 can advertise that the FTAG is again usable in a new LSA.
If root node 302 receives a message from one of the nodes 304 in the network, indicating that the node has received a packet having a TTL value=0, that node 304 can, at step 520, then isolate any links or ports which are part of the FTAG tree and cut off the designated ports from the FTAG tree. This will essentially break the loop without impairing reachability. Once the downstream node 304 has an updated view of the network, it reconverges with what should be its root port 302, at step 522, restoring the FTAG tree to its intended final form.
For clarity of explanation, in some instances the present technology may be presented as including individual functional blocks including functional blocks comprising devices, device components, steps or routines in a method embodied in software, or combinations of hardware and software.
In some embodiments the computer-readable storage devices, mediums, and memories can include a cable or wireless signal containing a bit stream and the like. However, when mentioned, non-transitory computer-readable storage media expressly exclude media such as energy, carrier signals, electromagnetic waves, and signals per se.
Methods according to the above-described examples can be implemented using computer-executable instructions that are stored or otherwise available from computer readable media. Such instructions can comprise, for example, instructions and data which cause or otherwise configure a general purpose computer, special purpose computer, or special purpose processing device to perform a certain function or group of functions. Portions of computer resources used can be accessible over a network. The computer executable instructions may be, for example, binaries, intermediate format instructions such as assembly language, firmware, or source code. Examples of computer-readable media that may be used to store instructions, information used, and/or information created during methods according to described examples include magnetic or optical disks, flash memory, USB devices provided with non-volatile memory, networked storage devices, and so on.
Devices implementing methods according to these disclosures can comprise hardware, firmware and/or software, and can take any of a variety of form factors. Typical examples of such form factors include laptops, smart phones, small form factor personal computers, personal digital assistants, and so on. Functionality described herein also can be embodied in peripherals or add-in cards. Such functionality can also be implemented on a circuit board among different chips or different processes executing in a single device, by way of further example.
The instructions, media for conveying such instructions, computing resources for executing them, and other structures for supporting such computing resources are means for providing the functions described in these disclosures.
Although a variety of examples and other information was used to explain aspects within the scope of the appended claims, no limitation of the claims should be implied based on particular features or arrangements in such examples, as one of ordinary skill would be able to use these examples to derive a wide variety of implementations. Further and although some subject matter may have been described in language specific to examples of structural features and/or method steps, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to these described features or acts. For example, such functionality can be distributed differently or performed in components other than those identified herein. Rather, the described features and steps are disclosed as examples of components of systems and methods within the scope of the appended claims.
This application claims priority to U.S. Provisional Patent Application No. 61/900,359, filed on Nov. 5, 2013, which is expressly incorporated by reference herein in its entirety.
Number | Name | Date | Kind |
---|---|---|---|
6456624 | Eccles et al. | Sep 2002 | B1 |
7152117 | Stapp et al. | Dec 2006 | B1 |
7177946 | Kaluve et al. | Feb 2007 | B1 |
7826400 | Sakauchi | Nov 2010 | B2 |
7848340 | Sakauchi et al. | Dec 2010 | B2 |
8339973 | Pichumani et al. | Dec 2012 | B1 |
8868766 | Theimer et al. | Oct 2014 | B1 |
9258195 | Pendleton et al. | Feb 2016 | B1 |
9374294 | Pani | Jun 2016 | B1 |
9444634 | Pani et al. | Sep 2016 | B2 |
20030067912 | Mead et al. | Apr 2003 | A1 |
20030115319 | Dawson et al. | Jun 2003 | A1 |
20040103310 | Sobel et al. | May 2004 | A1 |
20040160956 | Hardy et al. | Aug 2004 | A1 |
20040249960 | Hardy et al. | Dec 2004 | A1 |
20050010685 | Ramnath | Jan 2005 | A1 |
20050013280 | Buddhikot et al. | Jan 2005 | A1 |
20050083835 | Prairie et al. | Apr 2005 | A1 |
20050117593 | Shand | Jun 2005 | A1 |
20050175020 | Park et al. | Aug 2005 | A1 |
20050207410 | Adhikari | Sep 2005 | A1 |
20060013143 | Yasuie | Jan 2006 | A1 |
20060039364 | Wright | Feb 2006 | A1 |
20060072461 | Luong et al. | Apr 2006 | A1 |
20060193332 | Qian et al. | Aug 2006 | A1 |
20060209688 | Tsuge et al. | Sep 2006 | A1 |
20060221950 | Heer | Oct 2006 | A1 |
20060227790 | Yeung et al. | Oct 2006 | A1 |
20060268742 | Chu | Nov 2006 | A1 |
20060274647 | Wang et al. | Dec 2006 | A1 |
20060280179 | Meier | Dec 2006 | A1 |
20070025241 | Nadeau et al. | Feb 2007 | A1 |
20070047463 | Jarvis et al. | Mar 2007 | A1 |
20070165515 | Vasseur | Jul 2007 | A1 |
20070171814 | Florit et al. | Jul 2007 | A1 |
20070177525 | Wijnands et al. | Aug 2007 | A1 |
20070183337 | Cashman et al. | Aug 2007 | A1 |
20070217415 | Wijnands et al. | Sep 2007 | A1 |
20080031130 | Raj et al. | Feb 2008 | A1 |
20080092213 | Wei et al. | Apr 2008 | A1 |
20080212496 | Zou | Sep 2008 | A1 |
20090067322 | Shand et al. | Mar 2009 | A1 |
20090094357 | Keohane et al. | Apr 2009 | A1 |
20090161567 | Jayawardena | Jun 2009 | A1 |
20090193103 | Small et al. | Jul 2009 | A1 |
20090232011 | Li et al. | Sep 2009 | A1 |
20090238196 | Ukita et al. | Sep 2009 | A1 |
20100020719 | Chu et al. | Jan 2010 | A1 |
20100020726 | Chu et al. | Jan 2010 | A1 |
20100191813 | Gandhewar et al. | Jul 2010 | A1 |
20100191839 | Gandhewar et al. | Jul 2010 | A1 |
20100223655 | Zheng | Sep 2010 | A1 |
20100312875 | Wilerson et al. | Dec 2010 | A1 |
20110022725 | Farkas | Jan 2011 | A1 |
20110110241 | Atkinson et al. | May 2011 | A1 |
20110138310 | Gomez et al. | Jun 2011 | A1 |
20110170426 | Kompella et al. | Jul 2011 | A1 |
20110199891 | Chen | Aug 2011 | A1 |
20110199941 | Ouellette et al. | Aug 2011 | A1 |
20110243136 | Raman et al. | Oct 2011 | A1 |
20110280572 | Vobbilisetty et al. | Nov 2011 | A1 |
20110286447 | Liu et al. | Nov 2011 | A1 |
20110299406 | Vobbilisetty et al. | Dec 2011 | A1 |
20110321031 | Dournov et al. | Dec 2011 | A1 |
20120030150 | McAuley et al. | Feb 2012 | A1 |
20120057505 | Xue | Mar 2012 | A1 |
20120102114 | Dunn et al. | Apr 2012 | A1 |
20120300669 | Zahavi | Nov 2012 | A1 |
20130055155 | Wong et al. | Feb 2013 | A1 |
20130097335 | Jiang et al. | Apr 2013 | A1 |
20130182712 | Aguayo et al. | Jul 2013 | A1 |
20130208624 | Ashwood-Smith | Aug 2013 | A1 |
20130223276 | Padgett | Aug 2013 | A1 |
20130227689 | Pietrowicz et al. | Aug 2013 | A1 |
20130250779 | Meloche et al. | Sep 2013 | A1 |
20130250951 | Koganti | Sep 2013 | A1 |
20130276129 | Nelson et al. | Oct 2013 | A1 |
20130311663 | Kamath et al. | Nov 2013 | A1 |
20130311991 | Li et al. | Nov 2013 | A1 |
20130322258 | Nedeltchev et al. | Dec 2013 | A1 |
20130322446 | Biswas et al. | Dec 2013 | A1 |
20130329605 | Nakil et al. | Dec 2013 | A1 |
20130332399 | Reddy et al. | Dec 2013 | A1 |
20130332577 | Nakil et al. | Dec 2013 | A1 |
20130332602 | Nakil et al. | Dec 2013 | A1 |
20140016501 | Kamath et al. | Jan 2014 | A1 |
20140086097 | Qu et al. | Mar 2014 | A1 |
20140146817 | Zhang | May 2014 | A1 |
20140149819 | Lu et al. | May 2014 | A1 |
20140201375 | Beereddy et al. | Jul 2014 | A1 |
20140219275 | Allan et al. | Aug 2014 | A1 |
20140269712 | Kidambi | Sep 2014 | A1 |
20150016277 | Smith et al. | Jan 2015 | A1 |
20150092593 | Kompella | Apr 2015 | A1 |
20150113143 | Stuart et al. | Apr 2015 | A1 |
20150124586 | Pani | May 2015 | A1 |
20150124629 | Pani | May 2015 | A1 |
20150124642 | Pani | May 2015 | A1 |
20150124644 | Pani | May 2015 | A1 |
20150124654 | Pani | May 2015 | A1 |
20150124823 | Pani et al. | May 2015 | A1 |
20150127701 | Chu et al. | May 2015 | A1 |
20150188771 | Allan | Jul 2015 | A1 |
20150378712 | Cameron et al. | Dec 2015 | A1 |
20150378969 | Powell et al. | Dec 2015 | A1 |
20160119204 | Murasato et al. | Apr 2016 | A1 |
Number | Date | Country |
---|---|---|
WO 2014071996 | May 2014 | WO |
Entry |
---|
Andrew Whitaker and David Wetherall, Forwarding Without Loops in Icarus, Aug. 2002, IEEE, pp. 63-75. |
Number | Date | Country | |
---|---|---|---|
20150124587 A1 | May 2015 | US |
Number | Date | Country | |
---|---|---|---|
61900359 | Nov 2013 | US |