PCI (Peripheral Component Interconnect) Express is a serialized I/O interconnect standard developed to meet the increasing bandwidth needs of the next generation of computer systems. PCI Express was designed to be fully compatible with the widely used PCI local bus standard. PCI is beginning to hit the limits of its capabilities, and while extensions to the PCI standard have been developed to support higher bandwidths and faster clock speeds, these extensions may be insufficient to meet the rapidly increasing bandwidth demands of PCs in the near future. With its high-speed and scalable serial architecture, PCI Express may be an attractive option for use with or as a possible replacement for PCI in computer systems. The PCI Express architecture is described in the PCI Express Base Architecture Specification, Revision 1.0a (Initial release Apr. 15, 2003), which is available through the PCI-SIG (PCI-Special Interest Group) (http://www.pcisig.com)].
Advanced Switching (AS) is an extension to the PCI Express architecture. AS utilizes a packet-based transaction layer protocol that operates over the PCI Express physical and data link layers. The AS architecture provides a number of features common to multi-host, peer-to-peer communication devices such as blade servers, clusters, storage arrays, telecom routers, and switches. These features include support for flexible topologies, packet routing, congestion management (e.g., credit-based flow control), fabric redundancy, and fail-over mechanisms. The AS architecture is described in the Advanced Switching Core Architecture Specification, Revision 1.0 (the “AS Specification”) (December 2003), which is available through the ASI-SIG (Advanced Switching Interconnect-SIG) (http//:www.asi-sig.org).
The network 100 may have an Advanced Switching (AS) architecture. AS utilizes a packet-based transaction layer protocol that operates over the PCI Express physical and data link layers 202, 204, as shown in
AS uses a path-defined routing methodology in which the source of a packet provides the information required by a switch (or switches) to route the packet to the desired destination.
A path may be defined by the turn pool 402, turn pointer 404, and direction flag 406 in the route header, as shown in
The PI field 306 in the AS route header 302 (
PIs represent fabric management and application-level interfaces to the switched fabric network 100. Table 1 provides a list of PIs currently supported by the AS Specification
PIs 0-7 are reserved for various fabric management tasks, and PIs 8-126 are application-level interfaces. As shown in Table 1, PI8 is used to tunnel or encapsulate native PCI Express. Other PIs may be used to tunnel various other protocols, e.g., Ethernet, Fibre Channel, ATM (Asynchronous Transfer Mode), InfiniBand®, and SLS (Simple Load Store). An advantage of an AS switch fabric is that a mixture of protocols may be simultaneously tunneled through a single, universal switch fabric making it a powerful and desirable feature for next generation modular applications such as media gateways, broadband access routers, and blade servers.
The AS architecture supports the implementation of an AS Configuration Space in each AS device in the network. The AS Configuration Space is a storage area that includes fields to specify device characteristics as well as fields used to control the AS device. The information is presented in the form of capability structures and other storage structures, such as tables and a set of registers. The information stored in the capability structures may be accessed through PI-4 packets, which are used for device management.
A fabric manager election process may be initiated by a variety of either hardware or software mechanisms to elect one or more fabric managers for the switched fabric network. A fabric manager is an AS endpoint that “owns” all of the AS devices, including itself, in the network. If multiple fabric managers, e.g., a primary fabric manager and a secondary fabric manager, are elected, then each fabric manager may own a subset of the AS devices in the network. Alternatively, the secondary fabric manager may declare ownership of the AS devices in the network upon a failure of the primary fabric manager, e.g., resulting from a fabric redundancy and fail-over mechanism.
Once a fabric manager declares ownership, it has privileged access to it's AS devices' capability structures. In other words, the fabric manager has read and write access to the capability structures of all of the AS devices in the network, while the other AS devices may be restricted to read-only access, unless granted write permission by the fabric manager.
According to the PCI Express Link Layer definition a link between two AS devices is either down (DL_Inactive=no transmission or reception of packets of any type), fully active (DL_Active), i.e., fully operational and capable of transmitting and receiving packets of any type or in the process of being initialized (DL_Init).
AS architecture adds to PCI Express' definition of this state machine by introducing a new data-link layer state, DL_Protected, which becomes an intermediate state between the DL_Init and DL_Active states. The DL_Protected link state may be used to provide an intermediate degree of communication capability and serves to enhance an AS fabric's robustness and HA (High Availability) readiness.
The AS architecture supports the establishment of direct endpoint-to-endpoint logical paths known as Virtual Channels (VCs). This enables a single switched fabric network to service multiple, independent logical interconnects simultaneously, each VC interconnecting AS end nodes for control, management, and data. Each VC provides its own queue so that blocking in one VC does not cause blocking in another. Since each VC has independent packet ordering requirements, each VC may be scheduled without dependencies on the other VCs.
The AS architecture defines three VC types: bypass capable unicast (BVC); ordered-only unicast (OVC); and multicast (MVC). BVCs have two queues—an ordered-only queue and a bypass capable queue. The bypass capable queue provides BVCs bypass capability, which may be necessary for deadlock free tunneling of protocols. OVCs are single queue unicast VCs, which may be suitable for message oriented “push” traffic. MVCs are single queue VCs for multicast “push” traffic.
To preserve packet ordering, ordered-only packets and bypass capable packets may not pass previously enqueued ordered-only packets, and bypass capable packets may not pass previously enqueued bypass capable packets. To prevent the potential for deadlock, ordered-only packets may pass previously enqueued bypass capable packets that, due to lack of flow control credit, block their forward progress.
Bypass capable packets that have been bypassed by ordered-only packets, e.g., have been moved from the head of the ordered-only queue into the bypass capable queue, have by definition, already satisfied the BVC's ordering requirements. The following rule ensures that packets which have been previously bypassed are treated fairly, so that their flows are not exposed to potential starvation. All bypassed packets within the bypass capable queue must be the next packets moved out of the VC whenever there are sufficient bypass queue flow control credits to move them. This may continue until either there are insufficient bypass queue flow control credits to propagate other pending, previously bypassed packets or all bypassed packets have been propagated. Only after one of these condition becomes true can packets from the head of the ordered queue be propagated. This rule ensures that bypass capable packets that had already incurred their ordering delay area able to make forward progress as soon as possible.
When the fabric is powered up, link partners in the fabric may negotiate the largest common number of VCs of each VC type. During link training, the largest common sets of VCs of each VC type supported by both link partners may be initialized and activated.
During link training, surplus BVCs may be transformed into OVCs. A BVC can operate as an OVC by not utilizing its bypass capability, e.g., its bypass queue and associated logic. For example, if link partner A supports three BVCs and one OVC and the link partner B supports one BVC and two OVCs, the agreed upon number of VCs would one BVC and two OVCs, with one of link partner A's BVCs being transformed into an OVC.
AS packets may be assigned to one of eight possible traffic classes (TCs), e.g., TC0, TC1, . . . , TC7. The AS device ports may map a received packet to one of the port's active VCs of a given type (e.g., OVC, BVC, or MVC). One or multiple TC assignments may be mapped to the same VC depending on the number of VCs of the type that are active between link partners, and any given TC must be mapped to a single VC of the appropriate VC type within an AS port. TC-to-VC mapping is a function of the number of VCs that are active between link partners.
PI-5 Packet Generator
The AS architecture uses events as a notification mechanism. When a particular condition is detected in the fabric, an event may be sent to an agent responsible for handling that particular condition. Events may be arranged into event classes, and each event class may be identified using a class code. Depending on the event class, a class may further be divided into sub-classes.
AS devices may use PI-5 packets to report events. Endpoints must support the termination of PI-5 packets. If an endpoint receives a PI-5 packet, the endpoint need not be able to process the packet and may legally and silently discard any PI-5 packets the endpoint receives, e.g., if it is unable to process the packets or has been configured to discard them. According to the AS Specification, endpoints must support the generation of PI-5 packets, and switches must generate PI-5 packets.
An AS device may include an event dispatch unit to receive events and generate the PI-5 packets. PI-S packets may be directed to an event handler designated by a path stored at the AS device generating the event.
The capability structure access block 504 may use the event class/subclass code to access the Event capability structure in the AS device. The Event capability structure may include an Event Table, which may include at least one entry for each class of events the AS is capable of generating.
The capability structure access block 504 may read information relating to the particular event from the entry in the Event Table corresponding to that event (block 604). The information in the Event Table entry for an event may indicate how the event should be handled. The capability structure access block 504 may decide between the following three options: block the event (block 606); handle the event locally (block 608); or generate a PI-5 packet to be transmitted to an agent over the AS fabric (block 610). Event entries indicating a PI-5 packet should be generated may include a destination for the packet and may also include information defining the event to the agent receiving the event. This information may be software generated and application specific.
If the capability structure access block 504 determines a PI-5 packet should be generated, the packet generator uses event data from the originating agent and event processing data from the capability structure access block to generate a PI-5 packet. PI-5 packets may contain a number of dwords (32-bit data words), e.g., two or six for short and long formats, respectively, in addition to the AS Route Header.
The TC-to-VC mapping module may map the generated PI-5 packet to a particular VC (block 612). The event dispatch unit may then send a request to the transmit queue resource in the AS transaction layer for transmission to the destination agent via the AS fabric (block 614).
Packet Arbitration for a VC
The event dispatch unit and other PI requesting agents may arbitrate for the transmit resources (e.g., transmit queue(s)) of a particular VC before the packets are sent out to the AS fabric. In an embodiment, a packet arbiter may provide low latency and fast data access for multiple PI requesting agents arbitrating for the transmit resources of a particular VC.
As shown in
The PI requesting agents may have a uniform interface with the arbiter 700, in this case, requesting agents for PI4, PI5, PI00, PIE (a generic engine for building PIs), and PI8. The arbiter interface may be expanded to incorporate additional vendor specific PIs or future ASI-SIG defined PIs.
The control unit 706 may select a requesting agent based on the arbitration scheme (block 804) and the type of packet for the request (block 806).
When the arbiter 700 asserts the trdy signal back to the PI requesting agent (block 808), the PI requesting agent must start transferring the packet (block 810). The information collected by the packet arbiter may include the dword enables, start of packet indication, end of packet indication, and the packet data. The control unit place the packet information on the correct bus interface based on the identified packet type (block 812). The packet may then be placed in the appropriate queue by a queue controller (block 814), e.g., state machine 708.
As shown in
The packet arbitration scheme described above may also be implemented for OVCs and MVCs. In this case, only the ordered-only queue and ordered-only states are utilized.
Packet Arbitration for Multiple VCs
As shown in
The packet arbiter 1200 may perform certain duties of a fabric manager by regulating packet traffic in order to allow high priority (TC7) packets to be transmitted first. Since TC7 packets can pass through any type of VC, the packet arbiter may also handle a second level of arbitration between multiple TC7 packets. These decisions may be made within one clock cycle, thereby reducing latency in the transmit path.
In the AS architecture, the TC7 traffic class is reserved for high priority traffic. TC7 packets may be mapped exclusively to a dedicated VC corresponding to the highest numbered active VC of the specified VC type. The packet arbiter may dynamically identify the VC number which corresponds to TC7 for each VC type. Since each BVC is capable of being converted into an OVC, this feature may allow the packet arbiter to handle variable BVC and OVC combinations without using additional hardware, further reducing latency.
A number of embodiments have been described. Nevertheless, it will be understood that various modifications may be made without departing from the spirit and scope of the invention. For example, blocks in the flowchart may be skipped or performed out of order and still produce desirable results. Accordingly, other embodiments are within the scope of the following claims.