The present application describes systems and techniques relating to interconnecting components within computing and network devices.
Traditional data processing devices have used buses to interconnect internal components. Typically a shared-bus architecture provides a common bus over which internal system components can communicate (e.g., the Peripheral Component Interconnect (PCI) bus). Such shared-buses typically allow multiple components (e.g., input/output (I/O) ports of components within the device) to be plugged into the bus as needed, and some type of arbitration process is used to handle contention for the shared bus.
Other system interconnect architectures include shared memory designs and point-to-point switching fabrics. In a shared memory architecture, data is placed in the shared memory when it arrives at a port within the device, and this data is then moved to an appropriate output port within the device. In a switch fabric design, multiple input ports and multiple output ports within a device are connected by a matrix of switching points that provide any-to-any, point-to-point links among the ports. Links may be set up on-the-fly for the duration of a data exchange between components within the computing device, and multiple links may be active at once.
PCI Express is a serialized I/O interconnect standard developed to meet the increasing bandwidth needs of data processing machines. PCI Express was designed to be fully compatible with the widely used PCI local bus standard. Various extensions to the PCI standard have been developed to support higher bandwidths and faster clock speeds. With its high-speed and scalable serial architecture, PCI Express may be an attractive option for use with or as a possible replacement for PCI in computer systems. The PCI Express architecture is described in the PCI Express Base Specification, Revision 1.1, which is available through the PCI-SIG (PCI-Special Interest Group) (www-pcisig-com), published Mar. 28, 2005.
Advanced Switching Interconnect (ASI) is a switch fabric technology, which is an extension to the PCI Express architecture. ASI utilizes a packet-based transaction layer protocol that operates over the PCI Express physical and data link layers. The ASI architecture provides a number of features common to multi-host, peer-to-peer communication devices such as blade servers, clusters, storage arrays, telecom routers, and switches. These features include support for flexible topologies, packet routing, congestion management (e.g., credit-based flow control), fabric redundancy, and fail-over mechanisms. The ASI architecture is described in the Advanced Switching Core Architecture Specification, Revision 1.0 (the “ASI Specification”) (December 2003), which is available through the ASI-SIG (Advanced Switching Interconnect SIG) (www-asi-sig-org).
Details of one or more embodiments are set forth in the accompanying drawings and the description below. Other features and advantages may be apparent from the description and drawings, and from the claims.
The endpoints 130 reside on the edge of the switch fabric and represent data ingress and egress points for the switch fabric. The endpoints 130 may encapsulate and/or translate packets entering and exiting the switch fabric and may be viewed as bridges between the switch fabric and other interfaces. Moreover, the endpoints 130 may be various components within the data processing machine or interfaces to such components, such as integrated circuit chips, adapters, memories, or other electronic devices.
For example, a component attached to the switch fabric may include a network interface device 132. The network interface device 132 may be a network interface card (NIC) or an integrated network device (e.g., a network adapter built into a main circuit board, such as a motherboard, of the machine 100). In general, the network interface device 132 connects the machine 100 with a network 150, which may be a land-based computer network or a mobile device network, thus providing communications access to one or more networked machines 160.
The data processing machine 100 may be a computing device, a network device, or both. For example the machine 100 may be a mobile phone, a network router or switch, or a server system. The machine 100 may be a modular machine, including high-performance backplane interconnects and modular elements, such as a network server having interchangeable cards or blade servers. Moreover, the endpoint components 130 may itself be any computing device, such as a server.
The machine 100 includes a fabric management component 140, which is configured to cause a primary fabric manager 142 to trigger an initiating secondary fabric manager 144 to obtain an initial fabric topology, and further configured to synchronize fabric management information between the primary fabric manager 142 and the secondary fabric manager 144. The primary and secondary fabric managers 142, 144 may be subcomponents of the fabric management component 140, or the primary and secondary fabric managers 142, 144 may be separate components. For example, the primary and secondary fabric managers 142, 144 can be separate software products that run on two respective privileged devices in the switch fabric, or the primary and secondary fabric managers 142, 144 can be different functionality implemented within a single software product that runs on two privileged devices in the switch fabric.
The primary fabric manager 142 oversees the switch fabric. The secondary fabric manager 144 acts as a hot backup to the primary fabric manager 142 (although the secondary fabric manager 144 need not be dedicated to fabric management operations and may concurrently provide other functionality). Fabric management information, such as multicast information and peer-to-peer connections information, is synchronized between the primary fabric manager 142 and the secondary fabric manager 144. If the secondary fabric manager 144 determines that the primary fabric manager 142 has failed, fabric management operations failover to the secondary fabric manager 144, which continues running the switch fabric with minimal interruption, thus providing a high availability feature for the switch fabric.
The switch fabric may conform to an Advanced Switching Interconnect (ASI) specification defined by an ASI Special Interest Group (SIG).
ASI uses a path-defined routing methodology in which the source of a packet provides all information used by a switch (or switches) to route the packet to the desired destination.
A path may be defined by the turn pool 402, turn pointer 404, and direction flag 406 in the path header, as shown in
The PI field 306 in the ASI path header 302 (
PIs represent fabric management and application-level interfaces to the switch fabric. Table 1 provides a list of PIs currently supported by the ASI Specification.
PIs 0-7 are reserved for various fabric management tasks, and PIs 8-126 are application-level interfaces. The protocol interfaces may be used to tunnel or encapsulate native PCI Express, as well various other protocols, e.g., Ethernet, Fibre Channel, ATM (Asynchronous Transfer Mode), InfiniBand®, and SLS (Simple Load Store). A feature of an ASI switch fabric is that a mixture of protocols may be simultaneously tunneled through a single, universal switch fabric, which can be used by next generation modular applications such as media gateways, broadband access routers, and blade servers.
The ASI architecture supports the implementation of an ASI Configuration Space in each ASI device in the network. The ASI Configuration Space is a storage area that includes fields to specify device characteristics as well as fields used to control the ASI device. The information is presented in the form of capability structures and other storage structures, such as tables and a set of registers. Table 2 provides a set of capability structures (ASI native capability structures) that are defined by the ASI Specification.
Legend:
O = Optional normative
R = Required
R w/OE = Required with optional normative elements
N/A = Not applicable
The information stored in the ASI native capability structures may be accessed through PI-4 packets, which are used for device management.
In some implementations of a switched fabric, the ASI devices on the fabric may be restricted to read-only access of another ASI device's ASI native capability structures, with the exception of one or more ASI end nodes which have been elected as fabric managers. A fabric manager election process may be initiated by a variety of either hardware and/or software mechanisms to elect one or more fabric managers for the switched fabric network. A fabric manager is an ASI endpoint that in one sense owns all of the ASI devices, including itself, in the switch fabric. When both a primary fabric manager and a secondary fabric manager are elected, the secondary fabric manager may declare ownership of the ASI devices in the network upon a failure of the primary fabric manager.
Once a fabric manager declares ownership, that fabric manager has privileged access to it's ASI devices' ASI native capability structures. In other words, the fabric manager has read and write access to the ASI native capability structures of all of the ASI devices in the network, while the other ASI devices are restricted to read-only access, unless granted write permission by the fabric manager.
Once PFM discovery/configuration is completed, there is at least one endpoint in the fabric capable of being SFM, and the arbitration process for ST[Y] ownership is run at 505. As used here, X is the index of ST owned by PFM, and Y is the index owned by SFM. The PFM checks whether the ST[Y] Own Field has been set in the PFM at 510. Once the ST[Y] Own Field has been set in the PFM, a synchronization message may be sent from the PFM to the SFM at 515. This message (e.g., PFM_SynchRequest(uint STIndex)), as well as the other messages described below, may form a set of synchronization protocol primitives, such as detailed below in Table 3 for ASI.
The SFM receives the synchronization message from the PFM and checks whether the ST[X] Own Field has been set in the SFM at 520. In rare cases, due to propagation delays, PFM has not claimed ownership of SFM (i.e., ST[x] Own Field is not set in SFM yet), but SFM receives the synchronization message. In this case, error handling operations are performed at 525. This may involve just waiting for some time and again checking ST[X] Own Field register. By the time SFM receives the synchronization message, the PFM has been elected, but the ST[X] Own Field register of SFM may not have been set yet (again due to propagation delay).
Once it is established that both PFM and SFM components have been elected in the fabric and the SFM is running and ready to synchronize, the SFM sends a synchronization request reply (e.g., SFM_SynchRequestReply (SFM_READY)) at 530. After receiving this message, the PFM sends a message to the SFM at 535 instructing the SFM to obtain fabric topology (e.g., PFM_SendMsg (GET_TOPOLOGY)). Thus, the SFM waits to obtain the fabric topology from the SFM's perspective until after it is instructed to do so by the PFM.
The SFM obtains its fabric topology at 540. In doing so, the SFM can take advantage of various parameters previously set in the fabric by the PFM. The SFM may use various techniques to obtain the fabric topology. For example, the SFM may use the techniques described in U.S. patent application Ser. No. 10/816,253, filed Mar. 31, 2004, and entitled “ADVANCED SWITCHING FABRIC DISCOVERY PROTOCOL”, which is hereby incorporated by reference.
Once the SFM obtains the fabric topology, the SFM sends a reply message (e.g., SFM_SendMsgReply (SUCCESS)) at 545. Then, the PFM instructs the SFM to synchronize any table and graph elements as needed (note that there may not be any tables yet at this point in the processing, but any tables already set up should be synchronized during the initializing process). For each table and graph element to be synchronized at 550, a synchronization message is sent to the SFM at 555. For example, the PFM_SynchTable ( ) and PFM_SynchGraph ( ) primitives may be used as needed. Each message may identify a specific data structure to be updated and may include a minimum amount of data needed to effect the update. For example, if a table entry needs to be synchronized, the message may include only the data necessary to update that specific table entry within a larger table.
Various data structures may be used by the PFM and SFM. In general, one or more connection-related data structures and one or more fabric-graph-related data structures may be used. In the example from Table 3 above, multiple tables are used to store the connection-related data. The HASH table is used to facilitate fabric topology searching. For example, a hash function may be used along with a hash key to locate a device in the fabric topology. Moreover, using a hash key composed of device address plus port number (turn pool plus turn pointer in ASI) with this hash table can reduce hash table collisions and improve searching operations.
The CONNECTION table from Table 3 above indicates how devices are connected to each other in the switch fabric by detailing the total set of connections in the fabric (multiple links between two devices being treated as one connection for those two devices). The SPANN_TREE table includes information concerning the spanning tree. The P2P_CONN table includes information about the peer-to-peer connections in the fabric (e.g., a list of which devices are involved in peer-to-peer communications). The TPV table is use by third party vendors. The MC_INGRESS_PORT_PATH and MC_EGRESS_PORT_PATH tables includes information about which ingress and egress ports of switches are enabled or disabled for multicast (MC) operations.
The MCG_GROUP table includes information about multicast groups in the switch fabric. The MCG_NODE table includes information about the group (e.g., a newly created group), such as group number, number of members, etc. The MCG_MEMBER table includes information about a member (e.g., a new member that joins a particular multicast group), such as the device serial number, status (listener or writer), etc. These data structures are referred to as tables in its broadest sense, and thus need not be stored as a continuous block of memory.
For example, the data structures used for the MCG_MEMBER and MCG_NODE tables can be as follows:
The MCGNode data structure stores information on a multicast group. A new instance of MCGNode structure may be created when a new multicast group is created, and this structure then contains a pointer to a linked list of member structures. The MCGMember data structure stores information on a member in a multicast group. A new instance of MCGMember structure may be created when a new member joins a particular multicast group, and this structure is used to form the linked list of member structures.
For the fabric-graph-related data structure(s), in the example from Table 3 above, the EP component specifies the endpoints in the switch fabric. The SW component specifies the switches in the switch fabric. The LINK component specifies the links in the switch fabric. All three of these, endpoints, switches and links, may be hot-added or hot-removed (hot-swapped) in the switch fabric.
The P2P_PATH component describes the actual paths being used in the fabric by the devices doing peer-to-peer communications. The MC_PATH component keeps track of the multicast paths for a given group. The fabric-graph-related data structure(s) can be one big data structure that represents the whole fabric graph, where nodes of that big data structure are made up of additional data structures. The big structure may have a common set, and then within the structure, a union of certain data may be unique to endpoints or switches.
As the SFM receives each synchronization message from the PFM, the SFM updates its tables and fabric graph as appropriate at 560 and sends a synchronization reply (e.g., using the SFM_SynchTableReply ( ) and SFM_SynchGrpahReply ( ) primitives) at 565. After the PFM's data structures have been synchronized with the SFM's data structures, the PFM sends one or more messages to complete the synchronization at 570. For example, the PFM may send a PFM_SendMsg (SYNCH_COMPLETE) message and a PFM_SendMsg (SET_HB_TIMER, HB-Period) message.
Various approaches to determining the heartbeat period (HB-Period) are described further below, but in general, the SFM sets a heartbeat timer at 575 in response to one or more messages from the PFM. Then, the SFM may send a reply message (e.g., SFM_SendMsgReply (SUCCESS)) at 580 to indicate that the initializing process has finished.
The SFM checks at 620 whether the SFM has received a heartbeat message from the PFM before expiration of the SFM's timer. If so, the SFM's timer is reset at 625. This may involve setting the timer based on time information included in the heartbeat message from the PFM. If the SFM timer expires without the SFM receiving a heartbeat message, the SFM initiates a failure detection and recovery protocol at 630.
Once the SFM detects the PFM has failed, the SFM may take the PFM's role and continue running the fabric without interruption. The new PFM then selects a Fabric Manager capable device, if one exists in the fabric, to become the fabric's new SFM. On the other hand, if the PFM detects the the SFM has failed, the PFM selects a Fabric Manager capable device, if one exists in the fabric, to take the role of the failed SFM while the same PFM continues running the fabric.
This protocol allows for the Fabric Manager devices to detect not only failure of the other, but also isolation from the fabric due to some intermediate hardware failure. The PFM may detect whether the SFM is unable to monitor the PFM, or the PFM is unable to monitor the SFM, or both, due to one or more failed paths that the PFM and SFM were using to monitor each other as opposed to PFM or SFM failing. In this case the PFM may notify the SFM, through some other paths that the PFM is still alive and the SFM should not take the recovery or fail-over actions. Meanwhile, the PFM recovers from the path failure by moving the fabric to a steady and consistent state and reestablishing valid path(s) through which the fabric managers can continue monitoring each other.
As the SFM receives each synchronization message from the PFM, the SFM updates its tables and fabric graph as appropriate at 715 and sends a synchronization reply (e.g., using the SFM_SynchTableReply ( ) and SFM_SynchGrpahReply ( ) primitives) at 720.
Synchronizing the fabric management information may involve synchronizing multicast information and peer-to-peer connections information from the primary fabric manager to the secondary fabric manager, such as described above. Synchronizing the fabric management information may involve sending incremental update messages, specific to the changes in the fabric management information in the primary fabric manager, from the primary fabric manager to the secondary fabric manager. Moreover, sending incremental update messages can involve sending a first type of message to synchronize one or more connection-related data structures and sending a second type of message to synchronize one or more fabric-graph-related data structures.
As a result of this synchronization, if the primary fabric manager fails, the secondary fabric manager is ready to take over running the switch fabric. The primary fabric manager actively sends the critical fabric data to the secondary fabric manager, which is thus ready to handle failover. This can result in improved reliability for the switch fabric, providing a high availability feature for system interconnect.
The timeout period may be determined according to a diameter of the switch fabric, a per-link message delay and a path-repair delay. The diameter of the switch fabric corresponds to the length of the longest path between two devices in the fabric. The per-link message delay may be an average delay per link, and may thus be combined with the fabric diameter to determine a maximum expected time for a message to go from the primary to the second fabric manager. The path-repair delay may be the average or expected time needed for the primary fabric manager to repair a broken path (e.g., identify that a link is broken and route around that link to repair the path).
The parameters used to set the timeout period may be measured once the switch fabric is up and running. The primary and secondary fabric managers may communicate with each other about these parameters in order to negotiate the timeout period. Moreover, the timeout period used by the primary fabric manager in sending out heartbeat messages need not be the same length of time as the timeout period used by the secondary fabric manager in deciding when the primary fabric manager has failed. In general, the secondary fabric manager should give the primary fabric manager a little more time to send out a heartbeat message. If a link along the path between the primary and secondary fabric managers goes down, resulting in a heartbeat message being lost, the secondary fabric manager may use a timeout period that allows the primary fabric manager the opportunity to identify the broken link, repair the broken path, and resend the heartbeat message. The timeout period may be set using an empirical approach, an analytical approach, or both, based on fabric size, expected traffic, delays, application types, etc.
The systems and techniques presented herein, including all of the functional operations described in this specification, may be implemented in digital electronic circuitry, or in computer hardware, firmware, software, or in combinations of them, such as the structural means disclosed in this specification and structural equivalents thereof. Apparatus according to the described systems and techniques may be implemented as one or more data processing devices communicatively coupled by a switch fabric within a chassis (e.g., multiple motherboards, each having a chipset supporting ASI, plugged into a backplane). Apparatus according to the described systems and techniques may be implemented in a software product (e.g., a computer program product) tangibly embodied in a machine-readable medium (e.g., a magnetic-based storage disk) for execution by a machine (e.g., a programmable processor, network processor, system component); and the processing operations may be performed by a programmable processor executing a program of instructions to perform the described functions by operating on input data and generating output.
The described systems and techniques may be implemented advantageously in one or more software programs that are executable on a programmable system including at least one programmable processor coupled to receive data and instructions from, and to transmit data and instructions to, a data storage system, at least one input device, and at least one output device. Each software program can be implemented in a high-level procedural or object-oriented programming language, or in assembly or machine language if desired; and in any case, the language can be a compiled or interpreted language. Suitable processors include, by way of example, both general and special purpose microprocessors. Generally, a processor will receive instructions and data from a read-only memory, a random access memory and/or a machine-readable signal (e.g., a digital signal received through a network connection).
Generally, a computer will include one or more mass storage devices for storing data files; such devices include magnetic disks, such as internal hard disks and removable disks, magneto-optical disks, and optical disks. Storage devices suitable for tangibly embodying software program instructions and data include all forms of non-volatile memory, including by way of example semiconductor memory devices, such as EPROM (electrically programmable read-only memory), EEPROM (electrically erasable programmable read-only memory), and flash memory devices; magnetic disks such as internal hard disks and removable disks; magneto-optical disks; and optical disks, such as CD-ROM disks. Any of the foregoing may be supplemented by, or incorporated in, ASICs (application-specific integrated circuits).
The present systems and techniques have been described in terms of particular embodiments. Other embodiments are within the scope of the following claims. For example, the operations described may be performed in a different order and still achieve desirable results.