1. Field
The present disclosure relates to network management. More specifically, the present disclosure relates to fault detection and management in a communication network.
2. Related Art
Telecom service providers (SPs) often provide services to enterprise customers having multiple physical locations. For example, an SP can provide virtual leased line (VLL) services to a business customer to enable high-speed, reliable connectivity service between two separate sites of the customer. Conventionally, on the physical layer, the SP network is based on the synchronous optical networking (SONET) standard, and the edge devices are equipped with SONET equipment to provide SONET circuit(s) between customer edge (CE) points, which belong to the customer's own network. The provision of SONET circuits allows a local CE port to detect a failure in the provider network or in the corresponding remote CE port in a timely manner.
However, the price for optical equipment is high, and service providers are increasingly moving away from SONET solutions to Metro Ethernet solutions. Contrary to the SONET, in a packet-switched network, such as a multiprotocol label switching (MPLS) network or an Ethernet network, if the two endpoints are not directly coupled (for example, if they are located at two sides of the provider's network), link-level connectivity on their respective ports is not exchanged. Hence, if a remote CE port goes down, the local CE port stays alive and continues to forward traffic to the remote port. This can lead to significant traffic loss and extended network down time.
One embodiment of the present invention provides a fault-management system. During operation, the system identifies a failure at a remote location associated with a communication service. The system then suspends the local port used for that communication service, thereby allowing the failure to be detected by a device coupled to the local port. This significantly reduces network down time for the customer. In addition, since the customer's network is aware of the remote fault, it can take steps to re-route traffic through another network if such a backup network has been provisioned.
In a variation on this embodiment, suspending the local port includes placing the local port in a special down state and maintaining state information for the local port.
In a variation on this embodiment, identifying the failure comprises processing a message generated by a remote switch indicating the failure.
In a further variation, the message is a connectivity fault management message.
In a variation on this embodiment, the system detects a recovery from the failure and resumes operation on the suspended local port, thereby allowing the device coupled to the local port to resume transmission.
In a variation on this embodiment, the system detects a local failure. The system then generates a message indicating the local failure, and transmits the message to a remote switch, thereby allowing the remote switch to suspend a port on the remote switch.
In a variation on this embodiment, the communication service includes at least one of: a virtual local area network (VLAN) service; a virtual private LAN service (VPLS); a virtual private network (VPN) service; and a virtual leased line (VLL) service.
In the figures, like reference numerals refer to the same figure elements.
The following description is presented to enable any person skilled in the art to make and use the invention, and is provided in the context of a particular application and its requirements. Various modifications to the disclosed embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be applied to other embodiments and applications without departing from the spirit and scope of the present invention. Thus, the present invention is not limited to the embodiments shown, but is to be accorded the widest scope consistent with the claims.
In embodiments of the present invention, the problem of fast failure notification between two customer edge (CE) devices coupled via a packet-switched provider network is solved by allowing two provider edge (PE) devices in the packet-switched network to exchange connectivity check messages (CCMs) in the event of failure. Once a local PE device receives a CCM indicating a failure of a remote port or link, the local PE device suspends a corresponding local PE port. Consequently, the CE device coupled to the suspended local PE port can take proper actions, such as protection switching, to recover from the remote failure. When the remote port or link recovers, the local PE port can be brought back up accordingly. This significantly reduces network down time for the customer. In addition, because the customer's network is aware of the remote failure, the customer network can re-route its traffic through another network if such a backup network has been provisioned.
In this disclosure, the term “switch” refers to a network switching device capable of forwarding packets. A switch can be an MPLS router, an Ethernet switch, or any other type of switch that performs packet switching. Furthermore, in this disclosure, “router” and “switch” are used interchangeably.
The term “provider edge” or “PE” refers to a device located at the edge of a provider's network and which offers ingress to and egress from the provider network. The term “customer edge” or “CE” refers to a device located at the edge of a customer's network. A PE device is typically coupled to a CE device. When two CE devices communicate via a provider's network, each CE device is coupled to a respective PE device.
Various services, including but not limited to: virtual local area network (VLAN), virtual private LAN service (VPLS), virtual private network (VPN), and virtual leased line (VLL), can be provided to allow customer network 104 to communicate with customer network 106. To avoid traffic loss, it is desirable for the local CE port running a service to be aware of the health of the corresponding port at the remote side. For example, when a remote CE port for the VLL service fails while the corresponding local CE port stays alive, if the local CE port is unaware of the failure of the remote CE port, the local CE port will continue to forward VLL traffic to the remote CE port. This leads to traffic loss and increases the time required for network convergence. To avoid this situation, service providers need to provide their customers with a solution for physical layer emulation between two endpoints. Note that “physical layer emulation” as described herein refers to the scenario where the remote port status is reflected at the local network device.
In a conventional SONET based provider's network, such failure notification between two CE endpoints can be easily achieved because optical equipment at the PE routers of the provider's network can map the CE endpoints to specific wavelength channel(s) between these PE routers. However, this solution is expensive due to the high price of SONET equipment. Embodiments of the present invention provide a solution that allows MPLS, Ethernet, or other types of packet switching-based providers to offer the same physical layer emulation with fast failure notification as the SONET providers.
In one embodiment, physical level emulation across the packet-switched provider's network is achieved by extending an OAM (Operations, Administrations, and Maintenance) solution, such as Connectivity Fault Management (CFM) defined in IEEE Standard 802.1ag, available at http://www.ieee802.org/1/pages/802.1ag.html, which is incorporated by reference herein.
CFM allows service providers to manage each customer service instance, or Ethernet Virtual Connection (EVC), individually. In other words, CFM provides the ability to monitor the health of an end-to-end service delivered to customers as opposed to just links or individual bridges. In embodiments of the present invention, PE devices operate as maintenance endpoints (MEP) and issue connectivity check messages (CCMs) periodically. This allows MEPs to detect loss of service connectivity amongst themselves. However, it is still a challenge to solve the problem where a failure of an EVC or a remote port does not translate into a link status event at the local CE device. Embodiments of the present invention solve this problem by enabling the MEPs to continue issuing CCMs even after the EVC has failed due to the failure of a port or link on the CE device. A modified CCM is sent from a remote MEP (i.e., a remote PE device) to the local MEP (i.e., a local PE device), notifying the local MEP of a port failure on the CE device coupled to the remote MEP. Once the local MEP receives the modified CCM, the local MEP temporarily suspends a local port associated with the CCM session. Note that this solution can provide a very short reaction time (in the order of milliseconds) for discovering the remote port failures. In one embodiment, the reaction time for discovering a remote port failure is determined by the interval between the periodically sent CCMs, which can be less than 1 second. In a further embodiment, the reaction time is approximately 3.3 ms.
The subsequent flags field is defined separately for each OpCode. For CCM, the flags field is split into three parts: a Remote Defect Indication (RDI) field, a reserved field, and a CCM interval field. The most-significant bit of the flags field is the RDI field. The following four bits are the reserved field. The least-significant three bits of the flags field constitute the CCM interval field, which specifies the transmission intervals of the CCMs. For example, if the transmission interval is 3.3 ms, the CCM interval field is set as 1.
The first TLV offset field of the common CFM header specifies the offset, starting from the first octet following the first TLV offset field, up to the first TLV in the CFM PDU. The value of the offset varies for different OpCodes. The first TLV offset field in a CCM is transmitted as 70.
In one embodiment, one of the TLVs can be used to indicate the status of the interface on which the MEP transmitting the CCM is configured (which is not necessarily the interface on which it resides) or the next lower interface as defined in IETF RFC 2863 (available at http://tools.ietf.org/html/rfc2863, which is incorporated by reference herein). This TLV can be referred to as an Interface Status TLV.
A number of CCM-specific fields (not shown in
The status of a remote port can propagate across a provider network (even if the provider network includes multiple networks maintained by different administrative organizations) using CCM transmitted between two MEPs. Once a local MEP is notified of a remote port failure, a corresponding local PE port coupled to the CE equipment is suspended, which allows the CE equipment to be notified of the failure and to avoid significant traffic loss. In one embodiment, an external network-management system can be used to facilitate the process of bringing down the local port, and maintaining the status of all ports.
During operation, local PE device 312 and remote PE device 316 function as MEPs for service provider network 302, and periodically exchange CCMs (shown by the dashed lines), which provide a means to detect connectivity failures. In addition, PE devices 312 and 316 can detect any port failure on the coupled CE devices (or link failures), and report the port-down information to network-management server 308. For example, if there is a port failure on CE device 314 (for example, a port running the VLL service for customer network 306 failed), PE device 316 will notify network-management server 308 of this port failure. To prevent significant traffic loss (e.g., to prevent a port on CE device 310 from forwarding traffic to the failed port on CE device 314), network-management server 308 maps this failure to a corresponding port (via user configurations) on PE device 312, and triggers an event (such as a “VLL port down” event) on PE device 312 to temporarily suspend that corresponding port on PE device 312. This operation allows CE device 310 to detect the failure and divert the traffic to an alternative path by using protective switching. In addition, network-management server 308 stores all port states and appropriate event transitions. Note that this held-down port is maintained at a special state that is different from other “port down” state because the port itself is actually functioning. As soon as the failed port on the remote end recovers, network-management server 308 can bring up the held-down port to resume traffic, thus significantly reducing the network recovery time.
The solution shown in
The operation of network 400 is similar to that of network 300, except that, without an external network-management server, the PE devices are responsible for bringing down a local port in response to a remote port failure. During operation, local PE device 410 and remote PE device 414 periodically exchange CCMs. When a PE device detects a failure (which can be a CE port failure, a link failure, or a PE port failure) at one end of a service instance, the PE device sends a CCM to a corresponding PE device at the other end of the service instance, notifying the corresponding PE device of the failure. The corresponding PE device maps the failure to a local PE port coupled to a customer port associated with the service instance at this end, and brings down the mapped local PE port to prevent the coupled customer port from forwarding traffic to the failed port. For example, in
Local PE device 500 receives the port-failure-report CCM, either with its RDI bit set or with its Interface Status TLV value set as “2” (operation 512), and maps the failure to a local PE port facing the CE equipment and associated with the service (operation 514). Subsequently, local PE device 500 brings down the local PE port and maintains its port status (operation 516). In one embodiment, the local PE port is kept in a special “down” state, which is different from other “down” states, such as the one caused by local equipment failures. The link is still brought down in the special “down” state just as in other down states. Local PE device 500 continues to send regular CCMs to remote PE device 502 (operation 518).
In some cases, CCMs may fail to reach an MEP. For example, a one-direction path failure may occur between a local MEP and a remote MEP, resulting in CCMs from the local MEP not reaching the remote MEP. The remote MEP, which fails to receive regular CCMs from the local MEP, can detect the CCM failure, and in response, send failure-report CCMs with RDI bit set or with the Interface TLV value set as “2” to the local MEP. In addition, the remote MEP brings down a coupled port associated with the CCM session by placing the coupled port in a special “down” state. The local MEP, in response to receiving the failure-report CCMs, also brings down a local port associated with the CCM session by placing the local port in a special “down” state. Although the CCM failure occurs in one direction (from the local MEP to the remote MEP), ports at both ends are put into the special “down” state.
While the ports are down, the remote MEP continues to send failure-report CCMs to the local MEP. The local MEP also attempts to send regular CCMs. Once the CCM path between the two MEPs is recovered, the remote MEP starts to receive regular CCMs sent by the local MEP. In response to receiving CCMs with cleared RDI bit or with the Interface Status TLV value set as “1,” the remote MEP brings up the coupled port that was in the special “down” state. In addition, the remote MEP generates interface-up CCMs by reseting the RDI bit or by setting the Interface Status TLV value as “1,”, and sends these interface-up CCMs to the local MEP. In response to receiving these interface-up CCMs, the local MEP brings up the corresponding port on its end, and normal communication between the local port and the remote port resumes.
FSM 600 also includes a number of events, where certain events trigger a transition between states. The following is a list of events in FSM 600:
E1: ENDPOINT_ADD
E2: ENDPOINT_DELETE
E3: PEER_ADD
E4: PEER_DELETE
E5: CONFIG_COMPLETE
E6: VC_PARAM_UPDATE
E7: INSTANCE_DELETE
E8: NO_ROUTER_MPLS
E9: ENDPOINT_UP
E10: TUNNEL_UP
E11: LDP_SESSION_UP
E12: PW_UP
E13: ENDPOINT_DOWN
E14: TUNNEL_DOWN
E15: LDP_SESSION_DOWN
E16: PW_DOWN
E17: VC_WITHDRAW_DONE
E18: VC_BIND_FAILED
E19: VC_WITHDRAW_FAILED
E20: LINK_RELAY_LOCAL_DOWN
E21: LINK_RELAY_REMOTE_DOWN
E22: LINK_RELAY_LOCAL_UP
E23: LINK_RELAY_REMOTE_UP
Various events lead to the various state transitions are illustrated in
While in PW-down state 608, if E2/E4/E13/E14 (TUNNEL_DOWN)/E6 (VC_PARAM_UPDATE) occurs, a VC withdrawal command is issued, and FSM 600 changes the state to wait-VC-withdraw-done state 616. The withdraw-next-state will be: configuration-incomplete state 602 if E2/E4 occurs, local-port-down state 604 if E13 occurs, or tunnel-down state 606 if E14 or E6 occurs. If E12 (PW_UP) occurs, FSM 600 moves from PW-down state 608 to operational state 614 where the PW is completely operational.
While in operational state 614, if E2/E4/E13/E14/E6 occurs, a VC withdrawal command is issued, and FSM 600 changes the state to wait-VC-withdraw-done state 616. The withdraw-next-state will be: configuration-incomplete state 602 if E2/E4 occurs, local-port-down state 604 if E13 occurs, or tunnel-down state 606 if E14 or E6 occurs. If E15 (LDP_SESSION_DOWN) or E16 (PW_DOWN) occurs, FSM 600 moves from operational state 614 to PW-down state 608, and no withdrawal is issued.
The aforementioned state transitions do not include state transitions associated with link state relay. Compared with a regular FSM not implementing link relay, FSM 600 includes two link-relay states (relay-local-link-down state 610 and relay-remote-link-down state 612). When a local link (a local port coupled to the MEP) goes down (E20), FSM 600 moves to relay-local-link-down state 610 from operational state 614, and all tunnel label and VC label information remains intact. The MEP sends a failure-report message to a remote MEP indicating this endpoint-link down event by setting the RDI bit of the CCMs or by setting the Interface Status TLV value as “1.” Note that the service between the two MEPs remains operationally active to allow transmission of CCMs. When the remote MEP receives a failure-report CCM, the FSM running on the remote MEP will move from operational state 614 to a relay-remote-link-down state 612. The remote MEP also brings down the endpoint interface. The VC/tunnel label information remains intact so that the CCM packets can still flow to the local MEP.
When the local link comes up (E22), the local MEP sends interface-up CCMs to the remote MEP, indicating the link up state by clearing the RDI bit of the CCMs or by setting the Interface Status TLV value as “1.” At the local MEP, FSM 600 moves from relay-local-link-down state 610 back to operational state 614. This also reprograms its content-addressable memory (CAM) or the phase-change memory (PRAM) so that the endpoint traffic can flow through using the link relay PW.
The remote MEP receives the interface-up CCMs that indicate the endpoint link up event (E23), and its own FSM will move from relay-remote-link-down state 612 to operational state 614. This will eventually reprogram its content-addressable memory (CAM) or the phase-change memory (PRAM) so that the endpoint traffic can be sent quickly.
Since the CCMs reach the other end even before the endpoint is enabled on the local MEP, the endpoint on the remote MEP can be brought up quickly enough so that traffic from the endpoint coupled to the local MEP using the link relay can be forwarded to the endpoint coupled to the remote MEP. This failover can be achieved in the time scale of milliseconds.
Fault-detection mechanism 702 is configured to detect faults in a port coupled to PE device 700. CCM-generation mechanism 704 is configured to generate CCMs. During normal operation, CCM-generation mechanism 704 generates regular CCMs, indicating that no fault has been detected. When fault-detection mechanism 702 detects a local failure, CCM-generation mechanism 704 generates failure-report CCMs with their RDI bit set or with their Interface Status TLV value set as “2,” indicating a failure at this end. CCM TX/RX mechanism 706 is configured to periodically transmit and receive CCMs to and from a remote PE device.
When CCM-TX/RX mechanism 706 receives a CCM, it sends the received CCM to CCM-processing mechanism 708, which is configured to process the received CCM by examining the RDI bit or by examining the value field of the Interface Status TLV. If CCM-processing mechanism 708 determines that the RDI bit of an incoming CCM is set or the Interface Status TLV is set as “2” (down), it notifies port-management mechanism 710, which in response brings down a corresponding coupled local port to prevent it from forwarding traffic to the failed remote port. The coupled port is now placed in a special “down” state to enable subsequent fast recovery. In addition, port states and event transitions associated with the local port is maintained in memory 712. Note that while the coupled port is in the special “down” state, CCM-TX/RX mechanism 706 continues to transmit regular CCMs to the remote PE device. Subsequently, if CCM-processing mechanism 708 determines that the RDI bit of a newly received CCM is cleared or the Interface Status TLV is reset as “1” (up), it notifies port-management mechanism 710, which in response brings up the port that was held in a special “down” state.
Note that embodiments of the present invention provide a solution that a packet-switched network to provide physical layer emulation capability to their customers. Compared with the SONET solution, the present solutions are more cost effective. This solution expands upon the existing Ethernet CFM standard, which uses CFM messages to detect and report connectivity failures. Unlike conventional fault-management mechanisms, in embodiments of the present invention, a number of “physical actions” are linked to CFM events. Note that these physical actions (including bringing down a port in response to a remote port failure and bringing up the port when the remote port recovers) are not defined in the CFM standard.
The methods and processes described herein can be embodied as code and/or data, which can be stored in a computer-readable non-transitory storage medium. When a computer system reads and executes the code and/or data stored on the computer-readable non-transitory storage medium, the computer system performs the methods and processes embodied as data structures and code and stored within the medium.
The methods and processes described herein can be executed by and/or included in hardware modules or apparatus. These modules or apparatus may include, but are not limited to, an application-specific integrated circuit (ASIC) chip, a field-programmable gate array (FPGA), a dedicated or shared processor that executes a particular software module or a piece of code at a particular time, and/or other programmable-logic devices now known or later developed. When the hardware modules or apparatus are activated, they perform the methods and processes included within them.
The foregoing descriptions of embodiments of the present invention have been presented only for purposes of illustration and description. They are not intended to be exhaustive or to limit this disclosure. Accordingly, many modifications and variations will be apparent to practitioners skilled in the art. The scope of the present invention is defined by the appended claims.
This application claims the benefit of U.S. Provisional Application No. 61/392,400, Attorney Docket Number BRCD-3067.0.1.US.PSP, entitled “LINK STATE RELAY FOR PHYSICAL LAYER EMULATION,” by inventors Srinivas S. Hanabe, Jitendra Verma, and Eswara S. P. Chinthalapati, filed 12 Oct. 2010, the disclosures of which are incorporated by reference herein.
Number | Date | Country | |
---|---|---|---|
61392400 | Oct 2010 | US |