The present application is related to computer networks. More specifically, the present application is related to a system and method for a scalable switch fabric for use in computer networks.
Current storage area networks (“SANs”) are designed to carry block storage traffic over predominantly Fibre Channel (“FC”) standard medium and protocols. FC SANs are local networks that are equivalent to many common types of local area networks (“LANs”) used in standard data communications networks. Expansion of SANs is limited in that conventional FC SANs cannot be implemented over geographically distant locations. Conventional FC architecture is not suitable for most wide area networks (“WANs”) or metropolitan area network configurations. While TCP/IP and Ethernet may be used to implement block storage protocols over a WAN/LAN, these two protocols are not efficient for block storage applications. Accordingly, current SANs are limited to a single geographic location.
There exist several proposals for moving block storage traffic over SANs built on other networking medium and protocol technologies such as Gigabit Ethernet, ATM/SONET, Infiniband, and the like. Presently, to bridge or interconnect storage data traffic from SANs using one medium/protocol type to another SAN using an incompatible protocol/medium type requires devices and software that perform the necessary protocol/medium translations. These translation devices, hereinafter referred to as “translation bridges,” make the necessary translations between incompatible protocol/mediums in order to serve the host computers/servers and storage target devices (the “clients”). Interconnecting heterogeneous SANs that may be easily scaled upward using these translation bridges is very difficult because the translation bridges usually become the bottleneck in speed of data transfer when the clients (servers and/or storage devices) become larger in number. In addition, in a mixed protocol environment and when the number of different protocols increase, the complexity of the software installed on the translation bridges increases, which further impacts performance.
Other limitations of the size of SANs, in terms of storage capacity, are cost and manpower. In order to expand the storage capacity of a SAN, storage devices such as disk drives, controllers, fiber channel switches and hubs, and other hardware must be purchased, interconnected and made functionally operable together. Another major, if not primary, expense is the cost of managing a SAN. SAN management requires a great deal of manpower for maintenance and planning. For example, as storage capacity grows, issues such as determining server access to storage devices, backup strategy, data replication, data recovery, and other considerations become more complex.
It is desirable that next generation storage network switch systems may have ingress and egress ports that support different protocols and network media so that different types of host computer/servers and storage target devices may be attached directly to the switch system and start communicating with each other without translation overhead. In order to communicate between any two ports, the source and destination ports must be identifiable in both the source and destination protocol. For example, to send a message or frame from a FC port to a Gigabit Ethernet port, the destination port needs to appear as a FC port to the connected FC source, and the source port needs to appear as a Gigabit Ethernet port to the destination port.
What is needed is a storage network switching device that performs a multiplicity of functions and has a multiplicity of port types to allow it to connect to a variety of network types (e.g., FC, Gigabit Ethernet, etc.) in a SAN and/or LAN and/or WAN environment.
The invention overcomes the above-identified problems as well as other shortcomings and deficiencies of existing technologies by providing a storage network device that performs a multiplicity of functions and has a multiplicity of port types to allow it to connect to a variety of network types (e.g., FC, Gigabit Ethernet, etc.). A primary function of the invention is to act as a storage network switch where frames are switched from port to port. However, because of its architecture, the present invention has the ability to perform many additional functions that take advantage of its high performance, highly scalable, and highly programmable infrastructure. The switch architecture of the present invention is comprised of: 1) a Switch Fabric Subsystem (“SFS”); 2) input/output subsystems (“IOS”); 3) Application Subsystems (“AS”); and 4) System Control Subsystems (“SCS”).
The SFS is a protocol agnostic cell or packet switching infrastructure that provides the high performance and highly scalable interconnections between the IOSs and ASs. It provides primary data paths for network traffic being moved by the switch. The IOSs provide the actual port connectivity to the external network devices that use the switch to communicate with other external network devices. The IOSs are part of the data path and are responsible for making the high performance, low level decoding of ingress frames from the external ports; and switching/routing, identifying the destination IOS for the frame, and queuing the frame for transmission through the switch fabric. The IOSs process packets at the very lowest protocols levels (Data Link and Network Layer of the Open System Interconnect (“OSI”) Reference Model) where fast switching and routing decisions can be made. The ASs provide the platform with higher levels of processing of frames and data streams in the switch system. The ASs have more advanced programmability and functionality than the IOSs, but rely on the control and data information provided by the IOSs to maintain high performance packet throughput. Typical applications that can run on the ASs are caching, storage virtualization, file serving, and high level protocol conversion. The SCSs provide the overall management of the storage network switch. Most of the low level switching and routing protocol functions are executed on the SCSs. In addition, management access functions such as the standard simple network management protocol (“SNMP”) agent, web server, Telnet server, and the direct command line interface reside on the SCSs. The hardware and software executing on the SCSs are responsible for managing the other subsystems in the Storage Network Switch (“SNS”) 100.
The present invention is directed to a storage network switch, comprising: a SFS; an IOS coupled to the SFS; an AS coupled to the SFS; and a SCS coupled to the SFS, said IOS and the AS.
The present invention is also directed to a scalable SFS for computer networks, said system comprising: at least one IOS that is coupled to at least one computer network; a SFS that is coupled to at least one IOS; an AS coupled to the SFS; and a SCS that is coupled to the SFS, the said IOS and the AS.
The present invention is also directed to a method for processing information on a storage network switch having a switch fabric subsystem, an input-output subsystem coupled to said switch fabric subsystem, an application subsystem coupled to said switch fabric subsystem, and a system control subsystem coupled to said switch fabric subsystem, said input-output subsystem and said application subsystem. The first step is to receive a signal from a host by an ingress line card of the input-output subsystem. Next, a destination address of a destination device to which the signal is to be sent is discerned by the network processor. Then, if the ingress line card can forward the signal to an egress line card of the input-output system, then the signal is sent to an outbound line card that is coupled to the destination device (e.g., a mass storage device). Outbound signals (e.g., from a storage device to a host) works the same way, only in reverse. It should be noted that the destination address need not be for a particular device. For example, the destination address can be for a device attached to the particular line card, or it may be for a device attached to a different portion of the network.
If the network processor of the ingress line card cannot process the signal, then the signal is sent to a line card processor of the ingress line card. If the line card processor of the ingress line card can process the signal, it does so and sends the processed signal back to the network processor. If the line card processor of the ingress line card cannot process the signal, then the line card processor forwards the signal to a system card for processing.
A technical advantage of the present invention is the distributed mechanism for processing inbound signals. Each network processor has its own table look-up mechanism for handling transactions. If the network processor's own look-up table is stale or incomplete, then the processing is handed off to the line card processor of the ingress line card. The line card processor itself can have a more extensive look-up table than the individual network processors on the same line card. System cards have yet more resources (and more extensive look-up tables) than the individual line cards and can thus handle almost all transactions. However, the majority of signals can be processed by the network processors individually, without the aid of other devices or processes. This design reduces the resource/latency problems associated with centralized control of the handling process because the network processors need not incur resource contention with other processes/devices when performing routine actions.
Another technical advantage of the present invention is that updates to the look-up tables of the network processors and line and line card processors can be made from a centralized data source and can be made on a periodic or on an as-needed basis. For example, if a system card is receiving repeated requests for specific handling-instances, it can issue updates to the look-up tables of the network processors and line card processors for the various line cards (both ingress and/or egress line cards). This decentralized design lends significantly to enhanced scalability and to performance of the present invention.
A more complete understanding of the present disclosure and advantages thereof may be acquired by referring to the following description taken in conjunction with the accompanying drawings wherein:
a is a schematic block diagram of an exemplary embodiment of a portion of an AS of the present invention;
a-16f are block diagrams of various embodiments of the control path of the present invention;
a-17d are block diagrams further illustrating various details of the embodiments of the control paths according to
a and 18b are flow charts illustrating the various embodiments of the control path and the data path according to the method of the present invention;
The present invention may be susceptible to various modifications and alternative forms. Specific exemplary embodiments of the present invention are shown by way of example in the drawings and are described herein in detail. It should be understood, however, that the description set forth herein of specific exemplary embodiments is not intended to limit the present invention to the particular forms disclosed. Rather, all modifications, alternatives, and equivalents falling within the spirit and scope of the invention as defined by the appended claims are intended to be covered.
The present invention is directed to a storage network device that performs a multiplicity of functions and has a multiplicity of port types to allow it to connect to a variety of network types (e.g., FC, Gigabit Ethernet, etc.). A primary function of the invention is to act as a storage network switch wherein frames are switched from port to port. However, because of its architecture, the present invention has the ability to perform many additional functions that take advantage of its high performance, highly scalable, and highly programmable infrastructure. The present invention is also useful for peer-to-peer networks as a simple switch where all attached devices are direct end devices, because the functions provided by the application blade of the architecture provide significant functionality even in the simple peer-to-peer network. Specifically, the cache feature alone provides a tremendous competitive advantage over other switch implementations.
For purposes of this disclosure, an application blade can be any electronic device that is able to perform one or more functions. For example, the application blade may be a peripheral card that is connected to a server or other device that is coupled to the switch of the present invention. Other examples of application blades include: remote computing devices that are coupled to the present invention by a network connection; or software processes running virtually on a single or multiprocessing system and/or single or multithreading processor; or electronic appliances with specific functionality; or the like.
The following description of the exemplary embodiments of the present invention contains a number of technical terms using abbreviations and/or acronyms which are defined herein and used hereinafter:
Referring now to the drawings, the details of an exemplary specific embodiment of the invention is schematically illustrated. Like elements in the drawings will be represented by like numbers, and similar elements will be represented by like numbers with a different lower case letter suffix.
The SFS 102 is designed to provide very high levels of performance and scalability. Switch fabrics are built using a series of switch fabric devices which may be highly integrated semiconductor devices in an integrated circuit die or chip. For highly scalable systems, typically cross bar switch devices are used, but other types (such as shared memory devices) can be used as well.
Each ingress port 204 and egress port 206 of a switch fabric device 202 is typically implemented as a high-speed differential signal port. Some switch fabric devices 202 may have a differential transmitter and receiver integrated into the device and drives/receives these differential signals directly. Some switch fabric devices 202 may use a parallel 5-bit or 10-bit interface and an external serializer/deserializer (“SERDES”) to convert the parallel signals to differential signals. The raw transfer rate of these ports varies, but state of the art today is 1.25 Gb/s to 3.125 Gb/s. Since one ingress port 204 and one egress port 206 are usually paired together, transceivers are used that have a receive (ingress) differential pair of signals and an output (egress) differential pair of signals. This differential signal technology may be the same technology typically used for Gigabit Ethernet, FC, or other high speed physical communications interfaces presently used in industry.
In addition to the data path ingress port(s) 204 and the data path egress port(s) 206, each switch fabric device 202 typically has one or more Switch Fabric Path Control (“SFPC”) ports 208 that may be used to configure the switch fabric device 202. The SCSs 108 typically are used to drive the control ports if they exist. The configuration logic for the switch fabric device 202 may comprise enabling and disabling ports, reading error or statistics registers from the switch fabric device 202, and setting parameters for the behavior of the switch fabric device 202 upon receiving specific frame or cell types from the ingress ports 204 (e.g., multicast or broadcast configuration). Some of the switch fabric devices 202 may allow the control function to occur in-band via the ingress/egress port pairs (204, 206) as well by specifying certain specific control messages that may be sent and that are not typically forwarded by the switch fabric device 202 in normal operation. These control messages may be processed internally by the switch fabric device 202, after which a response message by the switch fabric device 202 may be sent via the egress port 206 paired with the ingress port 204 from which the control message was received.
Each switch fabric device 202 has a multiplicity of ingress ports 204 where frames or cells are received from an ingress IOSs 104 and/or ASs 106. In addition, these ingress ports 204 can also receive routing information from the IOSs 104 and/or ASs 106 which is used by the switch fabric devices 202 for dynamically allocating ingress to egress switch paths/resources internal to the switch fabric devices 202. For example, when an IOS 104 needs to forward a frame or cell to another subsystem (IOS 104 or AS 106), it must send a request to the attached switch fabric device(s) specifying the desired egress port 206 to which it needs to send the frame or cell. The switch fabric device 202 may check to see if the requested egress port 206 is currently busy. Once the requested egress port 206 is available for use, the switch fabric device 202 may grant the request by sending a grant message back to the requesting IOS 104 via the egress port 206 that is paired with the ingress port 204 from which it received the original request message. Upon receiving the grant message, the IOS 104 may send the frame or cell to the switch fabric device 202 via the ingress port 204 in the switch fabric device 202 and the switch fabric device 202 may then forward the frame or cell to the reserved egress port 206. After the frame or cell has been forwarded, the egress port 206 is released for use by the next requesting ingress port 204.
Typically, switch fabric devices 202 are designed to be scalable so that small switch configurations can be built using one or more integrated circuits comprising a plurality of switch fabric devices 202. As additional port count, bandwidth or capacity is needed, several switch fabric devices 202 may be used in parallel to add capacity and redundancy to the overall SFS 102.
The IOS 104 is designed to provide the physical port connectivity to the Storage Network Switch 100. The IOS 104 provides the high performance data path 110 between the external switch ports and the SFS 102. The IOS 104 preferably performs a multiplexing/demultiplexing or port aggregation function for the Storage Network Switch 100. In other words, the external switch ports may be lower bandwidth ports than the internal interface to the SFS 102. For example, typical external port speeds are 1 Gb/s for Gigabit Ethernet or FC. There is also a 2 Gb/s version of FC available as well. Typical bandwidths available for the switch interface to the SFS 102 are 12.5 to 25 Gb/s or greater. Therefore, an IOS 104 typically implements a plurality of external I/O ports 540 and aggregates the ingress traffic and de-multiplexes the egress traffic between them and the high bandwidth SFI 304 to the SFS 102.
Referring to
The receive picoprocessors 632 and transmit picoprocessors 634 within the NP 512 may have special instructions and coprocessors included to help provide high performance parsing and classification of the frames or cells being transmitted or received on the external I/O ports 540 that are coupled to the NPs 512. When a frame is received by an external port 540 of the NP 512, it is preferably processed by one of a number of picoprocessors dedicated to processing incoming frames on that port. There may be, for example, four receive picoprocessors 632 allocated to handling receive frames from a particular external port 540. As each frame is received from the network attached to that I/O port 540, the NP 512 may assign the processing of that frame to one of the allocated receive picoprocessors 632 in a round robin fashion. Likewise, when a frame is forwarded to the NP 512 from the SFS 102 to be transmitted out one of its external I/O ports 540, it may receive the frame from the fabric through a SFC 642, wherein the frame may be processed in a round robin fashion by one of the transmit picoprocessors 634 allocated for handling transmit traffic for that external I/O port 540. By allowing multiple picoprocessors to process frames for a single I/O port 540, the picoprocessors 632, 634 have more time to classify a frame and to make the proper routing decisions.
The receive picoprocessors 632 may extract various fields in the received frames and make use of the TLC(s) 636 in the NP 512 to use the extracted fields to look-up the route information for that received frame. For example, the Ethernet address or FC address of an incoming frame can be extracted and sent to a TLC 636 to search a route table for the destination (egress) port identifier to which to send the frame. The receive picoprocessors 632 may then use the results of the table look-up to send the frame to the QMC 638 for queuing either to the SFC 642 for forwarding through the switch fabric in the SFS 102, or to the XCP 644 if it is a control frame that needs to be processed by a particular process in the IOS 104. This low level frame parsing and table look-up processing enables the receive picoprocessors 632 to perform fairly sophisticated frame route processing while maintaining the ability to process frames while receiving them at the fastest possible line rates. Not only can the receive/transmit picoprocessors 632, 634 route frames based on address fields found in the frame headers, but they can also make routing decisions on any of the data in the frame. This allows the software programmed in the picoprocessors 632, 634 to direct and control sophisticated frame routing based on frame types, contents of the frame, or control fields in the frame. In addition, the picoprocessors 634, 632 are also able to modify frames as they are sent or received from the external I/O ports 540 in order to attach the appropriate route information into the frame or to add control information as the frame is going to the SCS 108 or an AS 106.
The transmit picoprocessors 634 accept frames forwarded from the SFC 642 that were sent to the NP 512 from the SFS 102. The transmit picoprocessors 634 are allocated in groups of four, preferably, to handle sending frames out a particular egress I/O port 540. The transmit picoprocessors 634 may process the frames received from the SFC 642 in a round robin fashion similar to the manner the receive picoprocessors 632 handle ingress frames from the external I/O ports 540. The transmit picoprocessors 634 may perform a Cyclic Redundancy Check (“CRC”) checksum calculations on the frames and modify other fields necessary for the proper transmission of the frame to the associated external I/O port 540. Finally, when the transmit picoprocessors 634 have completed their processing of the egress frame, they may queue the frame to the I/O port interface logic 656 for actual transmission on the external I/O port 540.
The TLC 636, as mentioned herein, may be a common coprocessor that is shared by the picoprocessors 632, 634 within the NP 512. The picoprocessors 632, 634 send look-up requests to the TLC 636 whenever they have information that needs to be looked up in a table maintained in table look-up memory (not illustrated). The TLC 636 does not manage the tables in memory, it simply performs the look-up operation itself. The XCP 644 or the IOS processor 518 may initialize, update, and/or maintain the tables stored in table look-up memory. The table look-up memory is typically very fast memory attached to the NP 512 via a high performance bus dedicated to connecting the look-up memory to the NP 512.
The BMC 640 is a common coprocessor that manages the buffer memory (not illustrated) used to store temporarily frame or cell data as the NP 512 processes the ingress and egress frames. As the picoprocessors 632, 634 process ingress frames received from the external I/O ports 540, the BMC 640 is taking the frames and storing them into memory. After the frames have been received, and the receive picoprocessors 632 have completed the frame routing look-ups, the receive picoprocessors 632 create a descriptor that points to the buffer memory used to receive the frame and pass it to the QMC 638 to be queued to the appropriate coprocessor for forwarding to the XCP 644, or to the SFC 642. The software running on the picoprocessors 632, 634 allocates the buffer space used by the BMC 640 for receiving the frame/cell data. After frame reception is complete, the software then allocates the queue descriptor from the QMC, and initializes the descriptor with the buffer memory location, the destination coprocessor information, and the final destination information (like final I/O port location). Then the picoprocessor 632 or 634 queues it back to the QMC 638 to indicate the data is ready to be processed by the next XCP 644. Likewise, when a frame is forwarded to the NP 512 for transmission to one of the external I/O ports 540, the BMC 640 may move the frame data into buffer memory while the SFC 642 determines to which port to send the frame. After the frame has been completely moved into buffer memory, the SFC 642 may build a descriptor and queue it to the QMC 638 to pass it to one of the transmit picoprocessors 634 for transmission of the frame through an egress I/O port 540.
The QMC 638 is a common coprocessor that is used to manage a set of descriptors in a dedicated descriptor memory attached to the NP 512. The descriptors provide information about the contents and length of the cells or frames in the buffer memory of the NP 512. The QMC 638 may manage all queues of the NP 512 in its attached queue memory. The picoprocessors and coprocessors of the NP may allocate and initialize the descriptors. After determining the proper destination device within the NP 512 (i.e., XCP 644, SFC 642, picoprocessors 634, etc.) the picoprocessor or coprocessor may send the descriptor to the QMC 638 indicating the target queue to place the descriptor. Likewise, when a picoprocessor or coprocessor is ready to process a frame from a queue, it may send a request to the QMC 638 to fetch the next descriptor that belongs to its queue. When the use of the descriptor is complete, the owning picoprocessor or coprocessor may send a request to the QMC 638 to free the descriptor for later reuse.
The SFC 642 may be a common coprocessor that is responsible for sending and receiving frames or cells to/from the fabric interface of the NP 512. Typically, the switch fabric bus interface 654 is a high bandwidth bus, e.g., UTOPIA-3, CSIX, SPI-4, etc. The switch fabric bus interface 654 is used to send frames or cells to other NPs 512 in the system via the SFS 102. For frame based protocols like FC or Ethernet, the SFC 642 can segment the frames into cells for transmission to the switch fabric if the switch fabric devices 202 require cell based operations. Some switch fabric devices allow full frame based operation and in these cases, the SFC 642 merely manages the data flow to and from the switch fabric and the internal buffer memory (not illustrated). Many switch fabric devices are designed to switch cells much more efficiently than entire frames; therefore, most NPs 512 support frame to cell segmentation and reassembly. Likewise, the SFC 642 may receive cells from the switch fabric bus interface (“SFBI”) 654 and reassemble them into a frame as it moves the frame data into the buffer memory. To perform this cell segmentation and reassembly process, the SFC 642 builds cells that have a cell header and a cell payload (data). The SFC 642 creates the header based on information provided to it by the picoprocessor or coprocessor that built the descriptor that was queued to the SFC 642. The cell header may contain an indication if the cell is the start of a frame, middle of a frame, or the last cell in a frame. In addition, the cell header may contain information regarding the target IOS 104, NP 512, and I/O port 540 to which the frame is to be forwarded.
Preferably, the SFC 642 has the ability to support flow control mechanisms in cooperation with the attached switch fabric devices, as well as to support flow control mechanisms between itself and other NPs 512 attached to the SFS 102 (either on an associated IOS 104 or another IOS 104 installed in the storage network switch). Local flow control signals are employed between the SFC 642 and the PAB 514 in the exemplary embodiment of the present invention. These local flow control signals can be specific hardware signals on the bus (e.g., UTOPIA-3), or they may be a bit in every cell header transmitted and received from an attached device, e.g. over the CSIX bus 530. In the latter case, where flow control information is passed in the cell headers, flow control information can still be passed even if no data is transmitted across the bus. If the devices have no frame or cell data to send, they will send an idle cell that indicates no data, however, control information in the cell header is still valid. In this manner, flow control mechanisms work even if no data frames or cells are being sent on the bus. Whenever the SFC 642 detects that it does not have enough buffer memory to continue to receive frames/cells from the PAB 514, it may assert its local flow control signal to the PAB 514. The PAB 514 detects the flow control signal and then stops sending cells to the SFC 642 until enough buffers have been made available in the buffer memory to continue to receive frames/cells from the PAB 514. Likewise, if the PAB 514 does not have enough buffer space to accept frames/cells from the SFC 642, it may assert its flow control signal indicating a congestion condition and the SFC 642 may discontinue sending frames/cells to the PAB 514 until the PAB 514 releases the flow control signal.
The XCP 644 is a common coprocessor that is responsible for handling generic device functions like initialization, management, and control. Typically, the XCP 644 is the first processor initialized in the NP 512 and is responsible for initializing the remaining coprocessors (e.g., 636, 638, 640 and 642) and picoprocessors (e.g., 632 and 634) in the NP 512. In addition, the XCP 644 may provide the interface function to externally control processors in the Storage Network Switch 100. The physical interface between the XCP 644 and external processors is preferably a PCI bus 650, however, other bus interfaces are contemplated and within the scope of the present invention. The external IOS Processor 518 may communicate through the PCI bus 650 to send commands and control frames to the XCP 644 for transmission through the external I/O ports 540 in the NPs 512. Likewise, control frames received by the external I/O ports 540 are forwarded to the XCP 644 for relay to an IOS Processor 518 via the PCI bus 650. The message/frame passing mechanism may be preferably implemented as a bus master type interface, or it may be implemented as a slave interface function in the XCP 644. The IOS Processor 518 and the XCP 644 may also work together to manage the routing tables that are maintained in the TLC's 636 memory associated with the XCP 644. As the routing protocols in the IOS Processor 518 determine the latest and best routing paths through the network, the updated routing table entries are sent to the NP's 512 XCP 644 (via the PCI bus 650) for placement into the table look-up memory (not illustrated).
The XCP 644 may also control the port state of each port controlled by the picoprocessors 632, 634. When the exemplary embodiment of the present invention wants to shut down or disable ports, an IOS Processor 518 may send commands to the XCP 644 (via the PCI bus 650) to disable, control, change modes, or enable the I/O ports 540 as necessary. The XCP 644 communicates the control information to the picoprocessors 632, 634 that control the ports 540 or may directly control the port states thereof. In addition, the XCP 644 is preferably responsible for maintaining statistics about the overall operation of the NPs 512. The NP's 512 statistics may include, but are not limited to, transmission and reception statistics, error counts, mode change logs, control frame statistics, and current state information. The XCP 644 may report these statistics to the IOS Processor 518 for reporting to external network management systems.
Referring to
If the PAB 514 detects that its own CSIX bus 530 receive buffers are getting full, it may assert a congestion indication to the SFI 304. Likewise, if the PAB 514 detects its own receive buffers for one of the UTOPIA-3 busses 532 is almost full, it may assert a congestion condition to the appropriate NP 512. When the buffers drain sufficiently, then the PAB 514 may deassert its congestion indication to the appropriate bus (530 or 532) in order to continue to receive cells or frames from the devices attached to the other end of the respective bus.
The SFI 304 provides a communication path between the high bandwidth CSIX bus 530 and the SFS 102. The SFI 304 may interface to the switch fabric via a multiplicity of high speed serial link pairs (one transmit and one receive link per pair). For smaller switch fabric configurations, several of these link pairs may be connected to each switch fabric device to increase throughput by load balancing the traffic to the switch fabric devices. In larger switch fabric configurations, one link pair preferably may be connected to each switch fabric device. In this manner, by increasing the number of switch fabric devices in the SFS 102 that are operating in parallel, one can increase the number of IOS 104/SFIs 304 that may be attached to the SFS 102. This expansion is illustrated in
The SFI 304 receives cells or frames from the CSIX bus 530 and queues them for transmission to the switch fabric. Once a cell or frame has been queued, the SFI 304 may request a path through the SFS 102 by either using a dedicated control bus to the SFS 102 or, preferably, by sending a special request cell via one of the serial links to the SFS 102. The SFS 102 may acknowledge the request when it has set up a path for the requesting SFI 304 through the switch fabric to the target IOS 104 or AS 106 (and the corresponding destination SFI 304 on that IOS 104 or AS 106).
Likewise, the SFI 304 receives frames or cells through any of its serial links 110 from the SFS 102. The SFI 304 may receive cells or frames simultaneously from all attached serial links 542. One of the primary functions of the SFI 304 is to perform buffering and synchronization of the received cells if cells are used to pass data. This buffering and synchronization operation is necessary in order to ensure that cells do not arrive and get forwarded on the CSIX bus 530 out of order from their original transmission order (if the cells are part of the same frame received from the same IOS). These operations can rely on information in the cell headers to provide unique identification (“ID”) to ensure proper ordering, or the system can rely on strict timing relationships between the serial links (e.g., maximum link lengths, maximum signal skew between links, etc.) and buffer cells in constant time intervals. Once cell order is assured, the SFI 304 queues the cells for transmission to the CSIX bus 530.
Another important function of the SFI 304 is the detection of link errors and balancing the load of cells or frames transmitted to the SFS 102. If a link fault is detected by the SFI 304, it disables transmissions to that link and does not send arbitration requests, cells, or frames on that link until it detects that link state is restored. In addition, the SFI 304 can provide basic quality of service guarantees by queuing and transmitting cells or frames based on priorities. The priorities can be requested by the NP's 512 by placing the requested priority in the cell headers, or by basing the priority on what external I/O ports 540 or NPs 512 the cells or frames are sent from or destined to. The SFI 304 implements a plethora of buffers internally or externally via an external memory. This memory is organized into a set of queues of various priorities that can be used to queue ingress as well as egress data traffic through the SFI 304.
The AS, generally represented by the element numeral 106, provides for the processing of upper layer protocols and applications that the IOS 104 may not be able to handle. The AS 106 has a high performance interface to the switch fabric that implements high performance general purpose processors to provide high levels of functionality for the Storage Network Switch 100. The AS 106 works with the NPs 512 associated with the IOS 104 to help accelerate the processing of frames. The NPs 512 may perform sophisticated frame parsing and routing look-ups. When the NPs 512 determine that a frame needs higher levels of processing, they can either send the frames directly to the AS 106, or they can append additional control information into the frames to help the processor(s) in the AS 106. This technique improves the processing speed of the AS 106 by allowing it to avoid some of the time-consuming low level processing that the NPs 512 have already performed for the AS 106. Therefore, this exemplary embodiment allows the AS 106 to be optimized for upper layer protocol and application processing.
In addition, there may be several different methods of physically and mechanically packaging the application logic section 721 and the IOS logic section 722 in the embodiments of the invention. One method may place both the application logic section 721 and the IOS logic section 722 on the same printed circuit board/assembly as one physically integrated unit. In this case, the I/O ports of each section may be connected together via printed circuit board traces. Another exemplary embodiment of this invention might package the application logic section 721 on a single printed circuit board/assembly and use an existing IOS 104, on a separate printed circuit board/assembly, for the IOS logic section. In this case, the I/O ports on the two sections may be connected together via physical copper or optical cabling.
The remainder of the application logic section 721 is similar to logic illustrated in
The exemplary embodiment of the invention depicted in
A preferred exemplary embodiment of the AS 106a does not require NPs because the FC controller and NP combination may be replaced by a custom FIOC 818 that directly attaches the data bus 532 to the PAB 514 which is coupled to the switch fabric 112 through the SFI 304 and data bus 530, as illustrated in
The PAB 514 of
The SFI 304 of
The FC Controller 718 may be an off-the-shelf device that is used to provide FC I/O services on an I/O bus (e.g., PCI). Typically, these devices may implement the FC-0 (physical), FC-1 (transmission protocol), FC-2 (signaling protocols), FC-3 (services), and FC-4 (upper layer protocols) protocols required to implement an end node in a FC network (as defined by the ANSI T 11 series of standards). The basic functionality of most FC controllers 718 is to provide a SCSI over FC function that allows SCSI formatted block storage access to be performed over a FC network. The FC controller 718 may typically operate in an initiator, target, or in both modes of operation. An initiator is a node that initiates a storage access command like READ, WRITE, or REWIND and expects the operation to be carried out by a target on the storage network. A target usually contains some form of storage (disk, tape, RAM disk, etc.) and accepts storage access commands and performs the operation on its local storage media. A target may respond to the initiator with the status of the storage access command.
In the exemplary embodiment of the AS 106 depicted in
The Gigabit Ethernet Controller 718 may be an off-the-shelf device that is used to provide Ethernet I/O services on an I/O bus (e.g., PCI). Typically, these devices provide the physical and Media Access Control (“MAC”) protocols as defined by the IEEE 802 series of standards. The basic functionality of a Gigabit Ethernet controller 718 is quite simple: it merely provides a set of transmit and receive frame services for sending and receiving Ethernet frames to and from an Ethernet network. These frames are generated and received via the I/O bus, e.g., PCI bus and the like.
The FIOC 818, illustrated in
The FIOC 818 may also have access to the routing tables used in routing frames and cells properly though the switch fabric. Normally, on an IOS, the IOS processor 518 accepts route table updates from the SCS 108 and loads these updates into the NP's table look-up memory. In the FIOC 818, the application processor accepts the route table updates from the SCS 108 and loads them into the proper locations in the FIOC 818 so that it knows how to forward frames/cells to destination nodes on the FC or Ethernet networks.
The ANB 712 is commonly called a North Bridge and is a combination application memory controller and I/O bus bridge. The ANB 712 device preferably provides: 1) memory interface, memory control, and memory access functions for memory requests from the application processor as well as from the devices on the I/O buses (e.g., Gigabit Ethernet, FC, and FIOC; 2) a processor interface to connect to one or more application or general purpose processors; and 3) one or more I/O bus interfaces to allow the processors to access the I/O devices as well as to allow I/O devices access to the memory attached to the memory interface. The memory attached to the memory interface of the ANB 712 may be the main memory 720 of the application processors 710. When the application processors 710 are initialized, they can initialize the ANB's 712 configuration and initialize the main memory 720. In addition, the APs 710 can access the I/O devices on the I/O busses through the ANB 712. When the system is initializing, software executing on the APs 710 may initialize the I/O devices to be in such a state as to allow the I/O devices to perform I/O operations at the request of the AP(s) 710. The ANB 712 provides high bandwidth access to memory for the I/O devices attached to the I/O bus(ses). Typically, in high performance AP architectures, the ANB 712 provides access to the I/O bus(ses) via one or more HSBs 715. The conversion of the HSB 715 signals to the I/O bus 714 signals is performed by the HSB/PCI Bridge 716. The software executing on the APs 710 manages the use of the memory to allow ordered data accesses and communication of data through memory based data structures utilized by both the AP(s) 710 and the I/O devices.
The ANB 712 device characteristics preferably are high memory capacity (4 GB or greater) and high memory bandwidth (greater than 4 GB/s), and high I/O bandwidth across the I/O busses (greater than 4 GB/s aggregate). An advantage of this ANB 712 architecture is the memory capacity and the bandwidth available for moving data into and out of memory from both the application processor(s) and the I/O devices on the I/O busses.
The HSB/PCI Bridge 716 is a device that provides connectivity between one of the HSBs 715 from the ANB 712 and one or more I/O busses, most typically PCI bus 714, although other types of busses may be used. In a preferred exemplary embodiment shown in
The Application or General Purpose Processor(s) 710 execute software that provides the high level services needed in the storage network switch. These processors are preferably off-the-shelf high-performance CISC or RISC processors such as those used in servers and workstations. Preferably the AS 106 partitions the frame level processing from the high level service software in the system. Frames received on the external ports of the IOS 104 are parsed, and a determination is decided at the low level protocol layers whether or not to forward the frame to an appropriate AS 106. A good deal of the information that is parsed in the frame can be inserted into the frames so that once they are forwarded to the AS 106, the application processors can skip the low level processing and immediately process the high level protocols, thus drastically improving performance relative to application appliances or servers that are external to a switch. An advantage of the present invention is that the application appliance may be directly integrated into the storage network switch which results in tight functional coupling with the NPs in the IOS 104.
The AP(s) 710, ANB 712, HSBs 715, HSB/PCI Bridge 716, I/O busses 714, and I/O controllers (FC and Gigabit Ethernet) 718 may be used to develop the AS 106 in a manner that allows high performance through tight integration of the low level protocol processing done in the IOS 104 with the high level processing done by the AS 106 specific components. The AS 106 specific components may provide the same functionality as an external appliance, workstation, or server. The AP(s) 710 may execute on a variety of operating systems including Linux, embedded Windows, manufactured by the Microsoft Corporation of Redmond, Wash., and off-the-shelf real-time operating systems like Wind River's VxWorks and the like.
The SCS 108 (see
The SCS 108 may be implemented using a pair of System Control Cards (“SCC”) 900 that operate in an active/standby pair.
A feature of the present invention is that in the event of a software or hardware failure on the active SCC 900, the software/hardware may initiate a fail-over event in the fault tolerant logic of the standby SCC 900, such that the active SCC 900 immediately releases active control and the standby SCC 900 immediately assumes the active status. In this event, the software in the newly controlling SCC 900 may activate all its tasks, processes and interfaces in order to begin processing management requests and to continue the operational control of the switch system.
The System Control Processor (“SCP”) 910 may be an off-the-shelf general purpose microprocessor that provides enough CPU computing power and data throughput to perform the system management and operation control of the storage network switch 100 of the present invention.
Referring to
The SCC 900 may use 100 Mb/s Ethernet controllers 916 for two different functions in the storage network switch: 1) a number of Ethernet controllers 916 may be used for attaching directly to external ports on the SCC 900 in order to provide network-based management access 929 to the storage network switch 100; and 2) a number of Ethernet controllers 916 may be used to communicate with the other SCCs 900, the various IOS 104 in the switch, and the AS 106 in the switch.
The Management Ethernet Controllers 928 may be used by all users to manage the SNS via standard management protocols such as TCP/IP, SNMP, TELNET, and TFTP. Users may use network management applications or tools to access, monitor, and control the SNS through these management Ethernet ports.
IPC Ethernet Controllers 916 may be used for internal interprocessor communications. The IPC network is a switched 100 Mb/s Ethernet network that is used to pass control information among the various subsystems in the SNS. All IOS, AS, and SCC have an IPC Ethernet port to communicate control and status information throughout the system. In a preferred exemplary embodiment of this invention, there are two IPC Ethernet controllers 916, one that connects directly to the IPC network implemented via the local Ethernet Switch 920, and one that is connected to the Ethernet Switch 920 in the other SCC 900 via an inter-SCC IPC link 926. Likewise, the other SCC 900 has an IPC Ethernet connection to its own Ethernet Switch 920 and another IPC Ethernet controller 916 that is connected to the local SCC's 900 Ethernet Switch 920 via an inter-SCC link 926. In this way, each SCC 900 has two redundant communications paths for interprocessor communication to the other SCC 900.
The IPC Ethernet Switch 920 implements an internal 100 Mb/s Ethernet network that is used by the various subsystems in the SNS to communicate status and control information. There are redundant IPC networks implemented in the storage network switch because each SCC 900 implements an IPC Ethernet Switch 920 device. Every SCC 900, IOS 104, and AS 106 have IPC Ethernet connections to both IPC Ethernet networks via Ethernet links 922, 926 to each SCC's 900 IPC Ethernet Switch 920. Typically, each subsystem may normally communicate through the IPC Ethernet Switch 920 on the active SCC 900. If a subsystem can no longer communicate with other subsystems, it may attempt communications through the IPC Ethernet Switch 920 on the standby SCC 900.
The RCL 918 on the SCC 900 performs the arbitration between the SCCs 900 to determine which may achieve active status. This arbitration is carried out via redundancy control signals 924 that are connected between the two SCCs 900. The logic is designed such that only one of the SCCs 900 can be active at any given time. The RCL 918 has both software inputs and hardware inputs to determine if the local SCC 900 may try to arbitrate for active status. The software inputs are registers that allow the software to indicate that it is not ready to achieve active controller status. If these registers are set to “not ready”, then the arbitration logic will always select the other SCC 900 to win the arbitration. The hardware inputs allow the hardware to indicate hardware faults which may also prevent the RCL 918 from arbitrating for active status. In addition to these functions, the RCL 918 also sends its current active/standby status via the redundancy control signals 924 so that the other SCC 900, the IOS 104 and AS 106 can see the current active standby status of both SCCs 900.
This section may describe the flow of data through the storage network switch 100 for a variety of scenarios.
Operation of an exemplary embodiment of the Storage Network Switch 100 is explained more fully herein.
Referring now to
In a preferred exemplary embodiment of the present invention, the ingress line card 1042 has one or more NPs 1052. Each of the NPs 1052 has an XCP 1055, and an FCP 1054. Preferably, each of the NPs 1052 also has two ports 1050 and 1053 through which data flows in or out. The ingress line card 1042 has a bridge 1056 that is coupled to the network processors 1052. The NPs 1052 are also coupled to a line card processor 1090 via data bus 1060. The data bus 1060 need not be any particular kind of data bus. For example, a PCI bus can be used in this capacity. However, any kind of data bus having sufficient bandwidth and latency is acceptable for the present invention. The line cards 1042 also preferably have one or more Ethernet controllers 1080, which are used to pass control information between itself and the SCS. Although Ethernet is preferably used with this aspect of the present invention, other network protocols could be used with equal effect with the present invention. Finally, the ingress line card 1042 is preferably equipped with a SFI 1058 which interfaces with components of the SFS 1044.
The SFS 1044 contains one or more switch fabric devices 202 which interface with one or more line cards (either ingress and/or egress line cards).
The egress line card 1046 is preferably identical to the ingress line card 1042. As with the ingress line card 1042, the egress line card 1046 has one or more NPs 1072. Each of the NPs 1072 has an XCP 1075 and a FCP 1074. The NPs 1072 are coupled to the egress line card processor 1091 via data bus 1061, as illustrated in
According to an exemplary embodiment of the method of the present invention, in step 1001, a frame from the network arrives on an ingress port 1050 of an NP 1052. In step 1002, the NP 1052 simultaneously moves the frame into a buffer in the NP's 1052 buffer memory while the NP 1052 parses the header of the frame, wherein: the NP 1052 first decodes addresses in the frame header; performs a check, such as a CRC to detect data transmission errors; performs a table look-up (“TLU”) to determine the destination port (dest_port); creates a descriptor data structure that describes where the frame is in buffer memory and to what destination NP 1052 and egress port number the frame is to be sent; imbeds the dest_port into the frame descriptor that it created, and queues the frame descriptor for transmission to the FP 1054 within the NP 1052.
In step 1003, a FCP 1054 on the NP 1052 segments the frame into, for example, but not limited to, 80-byte cells (64-byte payload and 16-byte cell header) and adds the destination port information (“dest_port”) to every cell header it generates. The dest_port field is used by the switch fabric hardware and the destination egress NP 1072 to route the frames to the appropriate egress switch port 1070. In step 1004, the NP 1052 forwards the cells to the bridge 1056. In step 1005, the bridge 1056 performs multiplexing and buffering of cells from the multiplicity of NPs 1052 attached to it so that a steady stream of cells flows to the switch fabric devices 202c. In step 1006, a single-rate, nominally 10-Gb or greater stream of cells may be sent from the bridge 1056 to the SFI 1058. In step 1007, the SFI 1058 receives the frame and performs port selection and de-multiplexing. For each cell, the SFI 1058 reads the dest_port information from the cell header and determines which one of the serial links to send the cell through. In step 1008, an arbitration request is sent from the SFI 1058 on a line card 1042 to the switch fabric device 202 on the serial link that was selected for that cell's transmission to the SFS 1044, as illustrated in
In step 1009, the switch fabric device 202 determines the correct egress path for the cell through itself. When the switch fabric device 202 has determined that the egress path is free for use, step 1010 is taken where the arbitration response is sent, via the serial link from which the arbitration request was made, back to the SFI 1058 on the ingress line card 1042. In step 1011, the SFI 1058 on the line card 1042 queues the actual cell for transmission to the switch fabric device 202. In step 1012, a cell or frame is transmitted from the SFI 1058 on the line card 1042 to the switch fabric device 202 via the same serial link in which the arbitration request was made for that cell. In step 1013, a preferably first-in first-out (“FIFO”) buffer receives the cell internal to the switch fabric device 202, and the switch fabric device 202 immediately forwards the cell to the destination egress port 1070 that was determined during the arbitration process. It will be understood by those skilled in the art that queues other than FIFO queues could also be used with equal effect by the present invention.
The switch fabric device 202 forwards the cell to the destination egress port 1070 through a series of steps beginning with step 1014, in which the cell is transmitted from the switch fabric device 202 to the SFI 1078 on the line card 1046 that corresponds to the destination egress port (dest_port) 1070. In step 1015, the SFI 1078 on the destination line card 1046 accepts the cell and checks the validity of the cell. In step 1016, the SFI 1078 performs multiplexing of the serial channels into one CSIX stream. In step 1017, the cells are transmitted from the SFI 1078 to the bridge 1076 via, for example, the CSIX bus. In step 1018 the bridge 1076 reads the destination in the cell headers and performs de-multiplexing for transmission to the proper egress NP 1072.
In step 1019, the cell is transmitted from the bridge 1076 to the egress NP 1072. In step 1020, the FP 1074 within the egress NP 1072 reassembles the cells into the original frame in the NP's 1072 internal (or external) buffer memory. If this is the first cell of a frame, the FP 1074 within the NP 1072 allocates a frame descriptor to keep track of the location of the frame in the NP 1072 buffer memory. When the start of the frame cell is received in step 1021, the FP 1074 determines to which of the NP's 1072 two I/O ports 1070 to send the frame. In step 1022, when the FP 1074 within the NP 1072 receives the end of the frame cell, the FP 1074 queues the frame for transmission to the appropriate destination egress I/O port 1070. Then in step 1023, the NP 1072 transmits the frame out of the egress port 1070.
Referring now to
In step 1103, the frame is processed by the XCP 1055 of the NP 1052, wherein the XCP 1055 controls the PCI interface 1060 and the XCP 1055 performs additional frame validation, such as: if the frame is not valid or is merely dropped, a negative acknowledgement response frame is formatted and queued for transmission to the port 1050 from which the original frame was received; and, if the frame is valid, it is queued for transmission to the LCP 1090. In step 1104, the XCP 1055 moves the frame data to the LCP 1090 over the data bus 1060 and into the LCP's local memory. In step 1105, the NP 1052 interrupts the LCP 1090 to inform it that the frame is in the LCP's 1090 local memory. In step 1106, the FC driver software of the LCP 1090 performs more detailed parsing on the frame. The remaining steps of
Referring now to
In step 1106d, the FC driver 1202 on the LCP 1090 formats the acknowledgement frame in local memory. In step 1106e, the FC driver 1202 on the LCP 1090 interrupts the NP 1052 (see
Referring back to
Referring now to
Referring back to
In step 1116, the LCP 1090 interrupts the NP 1052 to send the frame. In step 1117, the NP 1052 sets up the direct memory access (“DMA”) queue (or alternative queue) to initiate a PCI transfer of the frame into memory on the NP 1052. In step 1118, the XCP 1185 receives the frame, wherein the XCP 1185: performs the final formatting of the frame; determines which physical port to use for transmitting the frame; and queues the frame for transmission out of the physical port 1190. In step 1119, the frame is transmitted out of the NP physical port 1190 to its destination.
Referring now to
In step 1205 the NP 1072 on the line card 1046 forwards the request(s) to the target(s) 1412. In step 1206, the target 1412 processes the request(s), formats one or more response frames, Response B 1438, and sends them back to the switch port and line card from which the Request B 1434 was originally sent. In step 1207, the NP 1072 on the line card 1046 receives the Response B 1438 and forwards it/them to the application card 1410. If the application card 1410 needs to send more requests to the target(s) 1412 in order to complete the original Request A 1432, then steps 1204 through 1207 are repeated until enough information has been received to complete the original Request A 1432.
In step 1208, the application card 1410 formats one or more response frames, Response A 1436, to respond to the original request sent by the host 1402 and forwards it/them to the line card 1042, NP 1052, and port 1050 that the original request was received on. In step 1209, the NP 1052 on the line card 1042 forwards the response frame(s), Response A 1436, to the host 1402 that originally sent the Request A 1432.
Referring now to
In step 1303, the FCP 1054 of the NP 1052 on the ingress line card 1042 segments the frame into, for example but not limited to, 80-byte (64-byte payload and 16-byte cell header) cells and adds the destination port information (dest_port) to every cell header it generates. The dest_port field may be used by the SFS 102 hardware and the egress NP 1572 to route the frames to the appropriate destination egress port 1570. In step 1304, the NP 1052 on the ingress line card 1042 forwards the cells to the bridge 1056. In step 1305, the bridge 1056 performs multiplexing and buffering of the cells so that a steady stream of cells flows to the SFI 1058.
In step 1306, a single-rate, 10-Gb stream of cells may be sent from the bridge 1056 to the SFI 1058 on the ingress line card 1042. In step 1307, the SFI 1058 receives the frame and performs port selection and de-multiplexing. For each cell, the SFI 1058 reads the dest_port information from the cell header and determines through which set of serial links to send the cell. In step 1308, an arbitration request is sent from the SFI 1058 on the ingress line card 1042 to the switch fabric device 202 on the Switch Fabric Card 1406. The SFI 1058 sends the destination port information to the switch fabric device 202 to find a path to the egress line card 1046. In step 1309, the switch fabric device 202 determines the correct egress path for the cell through the switch fabric. In step 1310, after the switch fabric device 202 determines that the egress link serial link to the egress line card 1046 is free, the switch fabric device 202 sends an arbitration response back to the SFI 1058 on the ingress line card 1042. In step 1311, the SFI 1058 on the ingress line card 1042 receives the arbitration response (sent from the switch fabric device 202) and queues the actual cell for transmission to the switch fabric device 202 on the Switch Fabric Card 1406. In step 1312, the cell is transmitted from the SFI 1058 on the ingress line card 1042 to the switch fabric device 202 on the Switch Fabric Card 1406, as illustrated in
In step 1313 a FIFO buffer receives the cell internal to the switch fabric device 202 on the Switch Fabric Card 1406, and the switch fabric device 202 immediately forwards the cell to the egress serial link that it had reserved for the cell during the arbitration process. In step 1314, the cell is transmitted from the switch fabric device 202 to the SFI 1078 on the egress line card 1046 with the destination egress port (dest_port). In step 1315, the SFI 1078 on the destination egress line card 1046 accepts the cell and checks the validity of the cell. In step 1316, the SFI 1078 performs multiplexing of, for example but not limited to, 24 serial channels into one CSIX stream. In step 1317, a frame may be transmitted from the SFI 1078 to the bridge 1076 on the egress line card 1046.
In step 1318, the bridge 1076 of the egress line card 1046 reads the destination information in the cell header and performs de-multiplexing for transmission to the proper egress NP 1572. In step 1319, the cell is transmitted from the bridge 1076 to the egress NP 1572. In step 1320, the FP 1574 within with the egress NP 1572 reassembles the cells into the original frame in buffer memory that is attached to the NP 1572. If this is the first cell of a frame, the FP 1574 within the NP 1572 allocates a frame descriptor to keep track of the location of the frame in buffer memory that is attached to the NP 1572.
In step 1321, when the start of frame cell is received, the NP 1572 determines to which of its two ports to send the frame by reading the destination information in the cell header. In step 1322, when the FP 1574 within the NP 1572 receives the end of frame cell, it queues and transmits the frame out of the egress port 1570 of the egress line card 1046 to the Input/Output Processor (“IOP”) 1502 on the application card 1410. In step 1323, the IOP 1502 on the application card 1410 receives a frame, wherein the IOP 1502: validates the frame; and queues the frame for transmission to the AP's memory 1504 via the data bus 1506. The data bus 1506 of the application card 1410 can be any type of data bus having suitable speed and bandwidth, but is preferably a PCI bus or the like. In step 1324, the frame is transmitted to the AP's memory 1504 via the PCI bus 1506 and the North Bridge 1508. In step 1325, the IOP 1502 on the application card 1410 sends a frame reception interrupt to the AP 1510.
In step 1326, the AP 1510 processes the frame using the following logic flow:
a) if the frame is a storage request frame and it can be processed locally then AP 1510 processes the frame by:
b) if the frame is a storage request frame and cannot be processed locally then the AP 1510 processes the frame by:
In step 1327, the IOP 1502 moves the response frame(s) from the memory 1504 of the application card 1410 across the data bus 1506 via the North Bridge/Memory Controller 1508. In step 1328, the NP 1572 on the egress line card 1046 receives one or more response frames from the IOP 1502 of the application card 1410. The NP 1572 then processes the frame just as it would in the normal data flow discussed above, with the ultimate purpose of sending the response frame(s) back to the port on the storage network switch from which the original storage request frame was received.
The operation of the present invention can best be illustrated first by several simplified block diagrams, and then with the aid of several flow charts. The method of the present invention encompasses three critical paths: the data path; the control path; and the application path. An optional network management system path may be added in order to facilitate configuration, monitoring, and control of the present invention. Finally, the distributed nature of the processing that occurs during the performance of the method of the present invention will also be discussed in order to illustrate the unique characteristics and scalability of the present invention.
The basic system of the present invention is illustrated in
According to this illustrative example, the host 1602 is coupled to the first line card 1604. The first line card 1604 itself is coupled to the system card 1606 and the second line card 1608, as illustrated in
The first and second line cards 1604 and 1608 both have two components. Specifically, the first line card 1604 includes one or more NPs 1702 that are coupled with a LCP 1704 as illustrated in
a illustrates an exemplary embodiment of the method of the present invention, namely the data path and the control path. The method 1800 begins, generally, at step 1802 when a statement or signal, typically in the form of a packet, is received by an NP (1702 in
If the result of step 1808 is positive, then the packet is forwarded from the NP (1702 of
If the packet cannot be handled by the LCP 1704 (i.e., the result of step 1816 is no), then in step 1820, the packet is passed to the SCC 1606 for additional processing. The LCP 1704 completes its processing at this time and allows the SCC to process the packet. In step 1824, the processed packet is returned to the original sender, which is NP 1702 in this example illustrated in
It should be noted that the destination address need not be for a particular device. For example, the destination address can be for a device attached to the particular line card, or it may be for a device attached to a different portion of the network. The system of the present invention facilitates multiple paths for signals to travel from one device to another. Any one of the multiple paths may be used for transmission of the signal. Some paths, however, provide greater system performance, depending upon the type of signal involved, and the ability of the various devices along those multiple paths to process and/or forward the signal.
Another example embodiment of the basic system of the present invention is illustrated in
According to this illustrative example of
The first, second, and third line cards 1604, 1608, and 1614 all have two components. Specifically, the first line card 1604 includes one or more NPs 1702 that are coupled with a LCP 1704 as illustrated in
The application blade 1616 has two components: one or more IOPs 1714 that are coupled with one or more APs 1714 as illustrated in
Another example embodiment of the basic system of the present invention is illustrated in
b illustrates the primary method of the present invention, namely the data path and the control path method. The method 1800 begins generally at step 1802 when a statement or signal, typically in the form of a packet, is received by an NP (1702 in
If the result of step 1808 is positive, then the packet is forwarded from the NP (1702 of
If the packet cannot be handled by the LCP 1704 (i.e., the result of step 1816 is no), then in step 1820, the packet is passed to the SCC 1606 for additional processing. The LCP 1704 completes its processing at this time and allows the SCC to process the packet. The method ends generally at step 1826.
It should be noted that the egress path of the packet (the portion of the total path where the packet is about to be sent to its final destination) is illustrated in
d illustrates yet another alternate embodiment of the present invention that is similar to the embodiment of
e illustrates yet another alternate embodiment of the basic system of the present invention. In this alternate embodiment, the application blade 1616 and the host 1602 are coupled to the first (ingress) line card 1604. As in some of the previous examples, the first line card 1604 is coupled to the switch fabric 1612 and to the system card 1606. The second (egress) line card 1608 is coupled to the switch fabric 1612 and to the system card 1606. Thus, in this alternate embodiment, the signal can be sent, for example, by either the application blade 1616 or the host 1602 directly through the first and second line cards 1604 and 1608 via the switch fabric 1612 to the receiving device, such as storage device 1610. The application blade 1616 is also coupled to the system card 1606, to allow the application blade 1616 to be controlled and monitored by the system card 1610.
Yet another alternate embodiment of the basic system of the present invention is illustrated in
d illustrates yet another illustration of the alternate embodiment of the present invention shown in
If the incoming packet is a response packet (i.e., the result of 2110 is no), then in step 2120, the SCP 1184 (see
Steps 1816 through 1824 in
The application path portion of the method of the present invention is composed of four major methods: the application path control method, the application ingress processing method, the application command processing method, and the application response processing method.
The application control process is illustrated in
The LCP processing method upon receipt of the “route table update” message is encapsulated in steps 2314 and 2316. In step 2314, the LCP 1704, 1708, or 1712 receives the “route table update” message or signal from the SC 1606 and extracts the route table entries that need to be added, deleted or replaced in the routing table. In step 2316, the LCP 1704, 1708, or 1712 writes or transfers the route table into the route table memory of the NP 1702, 1706, or 1710 respectively.
The AP processing method upon receipt of the “configuration parameters update” message is encapsulated in steps 2318 and 2320. In step 2318, the AP 1716 receives the “configuration parameters update” message or signal from the SC 1606 and extracts the application configuration parameters that need to be added, deleted or replaced in the application configuration files and memory based data structures. In step 2320, the AP 1716 writes the configuration files and memory based data structures to enable the updated configuration of the application software executing on the AP 1716. The method of
The application ingress processing method is encapsulated in steps 1810 and 1812 of
The application command processing method is illustrated in
The application response processing method is illustrated in
The highly distributed nature of the present invention, such as the widespread dissemination of look-up addresses, among other information, among the large numbers of LCPs and NPs, is a key factor in the scalability of the present invention. Moreover, the widespread use of multiple processors running in parallel is a key factor for high performance. The combination of distributed processing and near-massive parallel processing provides the unexpectedly high performance of the present invention.
The two types of operations that are best moved to NPs in order to fully take advantage of their distributed processing nature are: header field look-ups and header field modifications. There are many types of look-up operations that can take advantage of the look-up accelerators provided in most NPs. For example, most NPs support direct table look-ups, hash tree look-ups, and binary tree look-ups. More sophisticated NPs allow full string pattern matching capabilities for very complex, high performance look-up operations. Example packet header fields that are used in look-up operations include the packet destination address field, packet source address field, packet type fields, packet priority fields, packet tag fields, SCSI command type field, SCSI logical unit number field, and logical block address field. The destination address field is used to determine to what physical port in the SNS to forward the packet and the look-up operation does a direct map or hash table look-up to find the egress port identifier for the egress port. Both the destination and source address fields are typically used to hash zone information or security information for the ingress and egress devices represented by the addresses. If the zone or security information indicates that the two ports may communicate, the packet is forwarded; otherwise, the packet is dropped or an error reply can be generated. In addition, the destination address look-up operation may have the result that indicates that the packet header must be modified in order to support the next hop when the packet is forwarded. The packet type and packet priority fields are used to determine how to route the packet (i.e., to which queue to place the packet). The SCSI command type, logical unit number, and logical block addresses are used to look-up whether the packet needs to be redirected to the application blade for higher levels of processing, or to determine whether the fields need to be modified. If the fields are to be modified, then the results of the look-up operation yields the new fields with which to replace the old fields in the packet.
Header field modification is an attribute of NPs that allow wire rate modifications to the packet header fields so that high-level processors, such as those on the application blade, do not have to be involved in the processing of the packet. Simple types of modifications include replacing address fields (converting virtual addresses to physical addresses), address swapping (used for IP router type functionality), and upper layer protocol field modifications (changing SCSI commands, SCSI logical unit numbers, and SCSI logical block numbers). Allowing these type of modifications at the data path in the NP software provides for the two major benefits: 1) allows more sophisticated processing of the data stream without having to send the packets to an application blade, and 2) allows a sophisticated level of preprocessing to the packet stream to help off-load the processing of the packets by the application blade if the packets must still be sent there for final processing.
The present invention may also be fitted with a Network Management System (“NMS”) path. The NMS path is used to configure and control the switch of the present invention, or to modify the behavior of the switch of the present invention. The NMS path is also useful for monitoring traffic along the switch, for example, for law enforcement requirements.
The invention, therefore, is well adapted to carry out the objects and to attain the ends and advantages mentioned, as well as others inherent therein. While the invention has been depicted, described, and is defined by reference to exemplary embodiments of the invention, such references do not imply a limitation on the invention, and no such limitation is to be inferred. The invention is capable of considerable modification, alternation, and equivalents in form and function, as will occur to those ordinarily skilled in the pertinent arts and having the benefit of this disclosure. The depicted and described exemplary embodiments of the invention are exemplary only, and are not exhaustive of the scope of the invention. Consequently, the invention is intended to be limited only by the spirit and scope of the appended claims, giving full cognizance to equivalents in all respects.
This application is related to U.S. patent application Ser. No. ______ [attorney docket number 069099.0103/client reference 105-02] entitled “Scalable Switch Fabric System and Apparatus for Computer Networks” by [name inventors], which is being filed contemporaneously with the present application and which is incorporated herein by reference in its entirety for all purposes. This application is also related to previously filed and pending U.S. patent application Ser. No. 09/738,960, entitled “Caching System and Method for a Network Storage System” by Lin-Sheng Chiou, Mike Witkowski, Hawkins Yao, Cheh-Suei Yang, and Sompong Paul Olarig, which was filed on Dec. 14, 2000 and which is incorporated herein by reference in its entirety for all purposes; U.S. patent application Ser. No. 10/015,047 [attorney docket number 069099.0102/B2] entitled “System, Apparatus and Method for Address Forwarding for a Computer Network” by Hawkins Yao, Cheh-Suei Yang, Richard Gunlock, Michael L. Witkowski, and Sompong Paul Olarig, which was filed on Oct. 26, 2001 and which is incorporated herein by reference in its entirety for all purposes; U.S. patent application Ser. No. 10/039,190 [attorney docket number 069099.0105/B5] entitled “Network Processor Interface System” by Sompong Paul Olarig, Mark Lyndon Oelke, and John E. Jenne, which was filed on Dec. 31, 2001, and which is incorporated herein by reference in its entirety for all purposes; U.S. patent application Ser. No. 10/039,189 [attorney docket number 069099.0106/B6-A] entitled “XON/XOFF Flow Control for Computer Network” by Hawkins Yao, John E. Jenne, and Mark Lyndon Oelke, which was filed on Dec. 31, 2001, and which is incorporated herein by reference in its entirety for all purposes; U.S. patent application Ser. No. 10/039,184 [attorney docket number 069099.0107/B6-B] entitled “Buffer to Buffer Flow Control for Computer Network” by John E. Jenne, Mark Lyndon Oelke and Sompong Paul Olarig, which was filed on Dec. 31, 2001, and which is incorporated herein by reference in its entirety for all purposes; U.S. patent application Ser. No. 10/117,418 [attorney docket number 069099.0109/(client reference 115-02)], entitled “System and Method for Linking a Plurality of Network Switches,” by Ram Ganesan Iyer, Hawkins Yao and Michael Witkowski, which was filed Apr. 5, 2002 and which is incorporated herein by reference in its entirety for all purposes; U.S. patent application Ser. No. 10/117,040 [attorney docket number 069099.0111/(client reference 135-02)], entitled “System and Method for Expansion of Computer Network Switching System Without Disruption Thereof,” by Mark Lyndon Oelke, John E. Jenne, Sompong Paul Olarig, Gary Benedict Kotzur and Matthew John Schumacher, which was filed Apr. 5, 2002 and which is incorporated herein by reference in its entirety for all purposes; U.S. patent application Ser. No. 10/117,266 [attorney docket number 069099.0112/(client reference 220-02)], entitled “System and Method for Guaranteed Link Layer Flow Control,” by Hani Ajus and Chung Dai, which was filed Apr. 5, 2002 and which is incorporated herein by reference in its entirety for all purposes; U.S. patent application Ser. No. 10/117,638 [attorney docket number 069099.0113/(client reference 145-02)], entitled Fibre Channel Implementation Using Network Processors,” by Hawkins Yao, Richard Gunlock and Po-Wei Tan, which was filed Apr. 5, 2002 and which is incorporated herein by reference in its entirety for all purposes; U.S. patent application Ser. No. 10/117,290 [attorney docket number 069099.0114/(client reference 230-02)], entitled “Method and System for Reduced Distributed Event Handling in a Network Environment,” by Ruotao Huang and Ram Ganesan Iyer, which was filed Apr. 5, 2002 and which is incorporated herein by reference in its entirety for all purposes; and U.S. patent application Ser. No. ______ [attorney docket number 069099.0115/(client reference 225-02)], entitled “System and Method for Allocating Unique Zone Membership,” by Walter Bramhall and Ruotag Huang, which was filed Apr. 15, 2002 and which is incorporated herein by reference in its entirety for all purposes; and U.S. patent application Ser. No. ______ [attorney docket number 069099.0108/(client reference 140-02)], entitled “System and Method for Load Sharing Computer Network Switch,” by Mark Lyndon Oelke, John E. Jenne and Sompong Paul Olarig, which was filed Apr. 22, 2002 and which is incorporated herein by reference in its entirety for all purposes.