This invention relates to packet processing in switched fabric networks.
PCI (Peripheral Component Interconnect) Express is a serialized I/O interconnect standard developed to meet the increasing bandwidth needs of the next generation of computer systems. PCI Express was designed to be fully compatible with the widely used PCI local bus standard. PCI is beginning to hit the limits of its capabilities, and while extensions to the PCI standard have been developed to support higher bandwidths and faster clock speeds, these extensions may be insufficient to meet the rapidly increasing bandwidth demands of PCs in the near future. With its high-speed and scalable serial architecture, PCI Express may be an attractive option for use with or as a possible replacement for PCI in computer systems. The PCI Special Interest Group (PCI-SIG) manages PCI specifications (e.g., PCI Express Base Specification 1.0a) as open industry standards, and provides the specifications to its members.
Advanced Switching (AS) is a technology which is based on the PCI Express architecture, and which enables standardization of various backplane architectures. AS utilizes a packet-based transaction layer protocol that operates over the PCI Express physical and data link layers. The AS architecture provides a number of features common to multi-host, peer-to-peer communication devices such as blade servers, clusters, storage arrays, telecom routers, and switches. These features include support for flexible topologies, packet routing, congestion management (e.g., credit-based flow control), fabric redundancy, and fail-over mechanisms. The Advanced Switching Interconnect Special Interest Group (ASI-SIG) is a collaborative trade organization chartered with providing a switching fabric interconnect standard, specifications of which it provides to its members.
Each switch element 102 and end point 104 has an Advanced Switching (AS) interface that is part of the AS architecture defined by the “Advance Switching Core Architecture Specification” (e.g., Revision 1.0, December 2003, available from the Advanced Switching Interconnect-SIG at www.asi-sig.org). The AS architecture utilizes a packet-based transaction layer protocol that operates over the PCI Express physical and data link layers 202, 204, as shown in
In some implementations, the end points 104 include a packet building module that has the ability to assemble different portions (e.g., header or payload fields) of a packet (e.g., an AS transaction layer packet) in parallel. For example, different header fields may include information that can be determined independently for each of a series of packets being sent into the switched fabric network 100. For example, a header field indicating a route for a packet can be determined independently from a header field indicating a priority of the packet. Therefore, separate computing resources (e.g., dedicated circuitry and/or dedicated processor execution threads) can be dedicated to determining these header fields.
The fields can be concurrently constructed and stored in separate memory locations. This enables the dedicated computing resources to determine and store data for a second packet without waiting for all of the fields of a first packet to be constructed and stored. A packet assembly module can then assemble a packet after all of the fields are available. An exemplary packet building module is described in more detail below.
AS uses a path-defined routing methodology in which the source of a packet provides all information required by a switch (or switches) to route the packet to the desired destination.
A path may be defined by the turn pool 402, turn pointer 404, and direction flag 406 in the AS header 302, as shown in
The PI field 302B in the AS header 302 determines the format of the encapsulated packet in the payload field 304. The PI field 302B is inserted by the end point 104 that originates the AS packet and is used by the end point that terminates the packet to correctly interpret the packet contents. The separation of routing information from the remainder of the packet enables an AS fabric to tunnel packets of any protocol.
The PI field 302B includes a PI number that represents one of a variety of possible fabric management and/or application-level interfaces to the switched fabric network 100. Table 1 provides a list of PI numbers currently supported by the AS Specification.
PI numbers 0-7 are used for various fabric management tasks, and PI numbers 8-126 are application-level interfaces. As shown in Table 1, PI number 8 (or equivalently “PI-8”) is used to tunnel or encapsulate a native PCI Express packet. Other PI numbers may be used to tunnel various other protocols, e.g., Ethernet, Fibre Channel, ATM (Asynchronous Transfer Mode), InfiniBand®, and SLS (Simple Load Store). An advantage of an AS switch fabric is that a mixture of protocols may be simultaneously tunneled through a single, universal switch fabric making it a powerful and desirable feature for next generation modular applications such as media gateways, broadband access routers, and blade servers.
The AS architecture supports the establishment of direct endpoint-to-endpoint logical paths through the switch fabric known as Virtual Channels (VCs). This enables a single switched fabric network to service multiple, independent logical interconnects simultaneously, each VC interconnecting AS end points for control, management and data. Each VC provides its own queue so that blocking in one VC does not cause blocking in another. Each VC may have independent packet ordering requirements, and therefore each VC can be scheduled without dependencies on the other VCs.
The AS architecture defines three VC types: Bypass Capable Unicast (BVC); Ordered-Only Unicast (OVC); and Multicast (MVC). BVCs have bypass capability, which may be necessary for deadlock free tunneling of some, typically load/store, protocols. OVCs are single queue unicast VCs, which are suitable for message oriented “push” traffic. MVCs are single queue VCs for multicast “push” traffic.
The AS architecture provides a number of congestion management techniques, one of which is a credit-based flow control technique that ensures that packets are not lost due to congestion. Link partners (e.g., an end point 104 and a switch element 102, or two switch elements 102) in the network exchange flow control credit information to guarantee that the receiving end of a link has the capacity to accept packets. Flow control credits are computed on a VC-basis by the receiving end of the link and communicated to the transmitting end of the link. Typically, packets are transmitted only when there are enough credits available for a particular VC to carry the packet. Upon sending a packet, the transmitting end of the link debits its available credit account by an amount of flow control credits that reflects the packet size. As the receiving end of the link processes the received packet (e.g., forwards the packet to an end point 104), space is made available on the corresponding VC. Flow control credits are then returned to the transmission end of the link. The transmission end of the link then adds the flow control credits to its credit account.
The egress module 500 includes a packet building module 512 that assembles AS transaction layer packets to be sent into the switched fabric network 100.
Any of a variety of types of data sources can provide the portions of a packet to be stored in one of the memory areas. In at least some implementations, each data source is able to operate independently from the other data sources. A data source can be implemented using software (e.g., as a dedicated execution thread in a processor), using hardware (e.g., as dedicated circuitry), or using both software and hardware. The example in
In one implementation, the multi-source header engine 610 includes multiple customized PI modules that each handles a particular PI or group of PIs.
Each data source is able to operate in parallel and independently of the other data sources. For example when the header engine 608 is building the AS route header, the multi-source header engine 610 can be building the PI header and the payload engine 612 can be fetching payload data concurrently. Once a particular data source is done with its current work that data source can start building the corresponding portion for the next packet even though the current packet might not be completely build yet.
The packet building module 512 includes a combiner 620 that assembles corresponding portions of a packet stored in the segmented memory structure 600. The combiner 620 monitors the status of packets in the segmented memory structure 600, and after the portions of a packet are stored in their respective memory areas, the combiner 620 assembles the portions of the packet to form a packet that is ready to be sent out (e.g., to the AS link layer module 502).
In one implementation, the segmented memory structure 600 includes two indicators associated with each memory area that are used by the combiner 620 to monitor the status of packets. For example, the first memory area 602 is associated with a first indicator 614A that is used to reserve space for a portion of a particular packet in the memory area 602 when the corresponding (one or more) portions of the packet are being built and stored in the other memory areas. A second indicator 614B is used to notify the combiner 620 that a particular portion of a packet is built and stored.
These two indicators can be used to build various kinds of AS packets. For example, if a packet only includes an AS route header and a PI header and no payload, then only space in the first memory area 602 and the second memory area 604 of the segmented memory structure 600 are reserved for that packet. After the AS route header portion and PI header portion are built and stored, the corresponding indicators notify the combiner 620 to assemble that packet. Since only the AS route header portion and PI header portion had space reserved for packet data, the combiner is notified that for that particular packet there is no need to wait for payload data to be fetched and stored.
The techniques described in this specification can be implemented in digital electronic circuitry, or in computer hardware, firmware, software, or in combinations of them. The techniques can be implemented as a computer program product, i.e., a computer program tangibly embodied in an information carrier, e.g., in a machine-readable storage device or in a propagated signal, for execution by, or to control the operation of, data processing apparatus, e.g., a programmable processor, a computer, or multiple computers. A computer program can be written in any form of programming language, including compiled or interpreted languages, and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A computer program can be deployed to be executed on one computer or on multiple computers at one site or distributed across multiple sites and interconnected by a communication network.
Processes described herein can be performed by one or more programmable processors executing a computer program to perform functions described herein by operating on input data and generating output. Processes can also be performed by, and techniques can be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).
Processors suitable for the execution of a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processors of any kind of digital computer. Generally, a processor will receive instructions and data from a read-only memory or a random access memory or both. The essential elements of a computer are a processor for executing instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks. Information carriers suitable for embodying computer program instructions and data include all forms of non-volatile memory, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto-optical disks; and CD-ROM and DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in special purpose logic circuitry.
The techniques can be implemented in a computing system that includes a back-end component, e.g., as a data server, or that includes a middleware component, e.g., an application server, or that includes a front-end component, e.g., a client computer having a graphical user interface or a Web browser through which a user can interact with an implementation of these techniques, or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication, e.g., a communication network. Examples of communication networks include a local area network (“LAN”) and a wide area network (“WAN”), e.g., the Internet.
The computing system can include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
The invention has been described in terms of particular embodiments. Other embodiments are within the scope of the following claims. For example, the steps of the invention can be performed in a different order and still achieve desirable results.