1. Field
The present embodiments relate to remote device probing for failure detection.
2. Description of the Related Art
A server may include multiple network adaptors to provide redundant communication paths to a network, where each adaptor is connected to a different switch providing a separate communication path to the network. A device driver in the server may manage the adaptors as a team and perform load balancing operations when transmitting data to the network. If the device driver detects that one adaptor has failed, then the device driver may perform a failover to the surviving adaptor to use only the surviving adaptor, and subsequently failback to using an adaptor that has recovered from a failed state.
Referring now to the drawings in which like reference numbers represent corresponding parts throughout:
In the following description, reference is made to the accompanying drawings which form a part hereof and which illustrate several embodiments. It is understood that other embodiments may be utilized and structural and operational changes may be made without departing from the scope of the embodiments.
Each adaptor 28a, 28b connects to a separate switch 38a, 38b, where the switches may comprise blades 4a, 4b . . . 4n or printed circuit boards in the same chassis 2 including the server blade 20, or comprise switches in separate chassis.
In certain embodiments, the fault tolerant module 32 queries the switches 38a, 38b using the Simple Network Management Protocol (SNMP). For instance, in certain embodiments, the switch processor 44 may operate as an SNMP agent and include a Management Information Base (MIB) providing information on the switch 38. The fault tolerant module 32, operating as an SNMP manager, may look-up the port link status of the switch external ports 42a, 42b, 42c, 42d using the SNMP command “ifOperStatus” to determine the value of the Object Identifier Description (OID) 1.3.6.1.2.1.2.2.1.8, providing the current operational state of an interface. The returned states may indicate whether operational packets can be passed. In additional embodiments, the fault tolerant module 32 may use additional or alternative communication protocols and commands to determine the state of the external ports in the switch. The SNMP protocol is described in the publications “Management Information Base for Network Management of TCP/IP-based Internets: MIB-II”, Network Working Group, RFC 1213 (March 1991) and “A Simple Network Management Protocol (SNMP)”, Network Working Group RFC1157 (May 1990).
Further, if two adaptors are connected to a same switch, then the fault tolerant module 32 may only query the status of the external ports on the connected switch 38a, 38b for one adaptor 28a, 28b. In certain embodiments though, each adaptor may be connected to a different switch to provide redundant paths to the network.
If (at block 106) there are no operational external ports in one switch 38a, 38b, then the fault tolerant module 32 indicates (at block 108) not to transmit data to the adaptor 28a, 28b connected to the non-operational switch 38a, 38b. If the adaptor 28a, 28b is in the non-operational state, then a failover may occur if the adaptor is indicated as the primary adaptor for all traffic. However, if (at block 110) there is at least one operational external port 42a, 42b, 42c, 42d, then the fault tolerant module 32 indicates to transmit data to one adaptor 28a, 28b connected to a switch 38a, 38bhaving at least one operational external port in response to determining from the at least one query that at least one external port in the switch is operational when the switch was previously indicated as non-operational. The status of the external ports is updated (at block 112) to the status determined from the at least one query.
In further embodiments, a failover occurs to the switch that is operational from the switch that is non-operational in response to determining from the at least one query that the switch is non-operational at block 108 and a failback is performed to the switch that is determined to have at least one operational external port when the switch was previously indicated as non-operational at block 110.
With the described embodiments, fault tolerant module 32 avoids sending packets to a functioning adaptor that is connected to a switch not having operational links to the external network. In described embodiments, the fault tolerant module 32 maintains a switch map 34 providing information on the status of the switch, which is used when determining an adaptor on which to transmit packets so that packets are only transmitted through adaptors connected to functioning switches. In alternative embodiments, the adaptor device drivers may update the switch map 34.
The described embodiments may be implemented as a method, apparatus or article of manufacture using standard programming and/or engineering techniques to produce software, firmware, hardware, or any combination thereof. The term “article of manufacture” as used herein refers to code or logic implemented in hardware logic (e.g., an integrated circuit chip, Programmable Gate Array (PGA), Application Specific Integrated Circuit (ASIC), etc.) or a computer readable medium, such as magnetic storage medium (e.g., hard disk drives, floppy disks,, tape, etc.), optical storage (CD-ROMs, optical disks, etc.), volatile and non-volatile memory devices (e.g., EEPROMs, ROMs, PROMs, RAMs, DRAMs, SRAMs, firmware, programmable logic, etc.). Code in the computer readable medium is accessed and executed by a processor. The code in which preferred embodiments are implemented may further be accessible through a transmission media or from a file server over a network. In such cases, the article of manufacture in which the code is implemented may comprise a transmission media, such as a network transmission line, wireless transmission media, signals propagating through space, radio waves, infrared signals, etc. Thus, the “article of manufacture” may comprise the medium in which the code is embodied. Additionally, the “article of manufacture” may comprise a combination of hardware and software components in which the code is embodied, processed, and executed. Of course, those skilled in the art will recognize that many modifications may be made to this configuration without departing from the scope of the embodiments, and that the article of manufacture may comprise any information bearing medium known in the art.
The described operations may be performed by circuitry, where “circuitry” refers to either hardware or software or a combination thereof. The circuitry for performing the operations of the described embodiments may comprise a hardware device, such as an integrated circuit chip, Programmable Gate Array (PGA), Application Specific Integrated Circuit (ASIC), etc. The circuitry may also comprise a processor component, such as an integrated circuit, and code in a computer readable medium, such as memory, wherein the code is executed by the processor to perform the operations of the described embodiments.
In the described embodiments, the server and switches comprise blades in a single chassis, where the switches provide connections to an external network. In alternative embodiments, the server and switches may be in separate chassis or boxes and connect through a direct line or over a network.
In described embodiments, the probing operations to determine the switch status are performed by the fault tolerant module. In alternative embodiments, the probing operations may be performed by the adaptor device drivers or a program module external to the fault tolerance module.
In described embodiments, the adaptors were connected to switches. In additional embodiments, the switches may comprise additional router or packet forwarding devices known in the art, such as an expander, etc.
The illustrated operations of
In blade server embodiments, the adaptors 38a, 38b may be implemented on a same printed circuit board, i.e., motherboard, including the server components. In additional embodiments, the adaptors 38a, 38b may be implemented on an expansion card that is mounted on the server 20 motherboard or backplane.
The foregoing description of various embodiments has been presented for the purposes of illustration and description. It is not intended to be exhaustive or to limit the embodiments to the precise form disclosed. Many modifications and variations are possible in light of the above teaching.