This application relates to the following co-pending application: NETWORK PROTOCOL ENGINE, attorney docket number 42.P14732, filed on ______ by Sriram R. Vangal et al.
This application includes an appendix, Appendix A, of micro-code instructions. The authors retain applicable copyright rights in this material.
1. Field
The disclosure relates to a network protocol processor.
2. Description of the Related Art
Networks enable computers and other electronic devices to exchange data such as e-mail messages, web pages, audio data, video data, and so forth. Before transmission across a network, data is typically distributed across a collection of packets. A receiver can reassemble the data back into its original form after receiving the packets.
In addition to the data (“payload”) being sent, a packet also includes “header” information. A network protocol can define the information stored in the header, the packet's structure, and how processes should handle the packet.
Different network protocols handle different aspects of network communication. Many network communication models organize these protocols into different layers. For example, models such as the Transmission Control Protocol/Internet Protocol (TCP/IP) model (TCP—Internet Engineering Task Force (IETF) Request for Comments (RFC) 793, published September 1981; IP IETF RFC 791, published September 1981) and the Open Systems Interconnection (OSI) model (define a “physical layer” that handles bit-level transmission over physical media; a “link layer” that handles the low-level details of providing reliable data communication over physical connections; a “network layer”, such as the Internet Protocol, that can handle tasks involved in finding a path through a network that connects a source and destination; and a “transport layer” that can coordinate communication between source and destination devices while insulating “application layer” programs from the complexity of network communication.
A different network communication model, the Asynchronous Transfer Mode (ATM) model, is used in ATM networks. The ATM model also defines a physical layer, but defines ATM and ATM Adaption Layer (AAL) layers in place of the network, transport, and application layers of the TCP/IP and OSI models.
Generally, to send data over the network, different headers are generated for the different communication layers. For example, in TCP/IP, a transport layer process generates a transport layer packet (sometimes referred to as a “segment”) by adding a transport layer header to a set of data provided by an application; a network layer process then generates a network layer packet (e.g., an IP packet) by adding a network layer header to the transport layer packet; a link layer process then generates a link layer packet (also known as a “frame”) by adding a link layer header to the network packet; and so on. This process is known as encapsulation. By analogy, the process of encapsulation is much like stuffing a series of envelopes inside one another.
After the packet(s) travel across the network, the receiver can de-encapsulate the packet(s) (e.g., “unstuff” the envelopes). For example, the receiver's link layer process can verify the received frame and pass the enclosed network layer packet to the network layer process. The network layer process can use the network header to verify proper delivery of the packet and pass the enclosed transport segment to the transport layer process. Finally, the transport layer process can process the transport packet based on the transport header and pass the resulting data to an application.
As described above, both senders and receivers have quite a bit of processing to do to handle packets. Additionally, network connection speeds continue to increase rapidly. For example, network connections capable of carrying 10-gigabits per second and faster may soon become commonplace. This increase in network connection speeds imposes an important design issue for devices offering such connections. That is, at such speeds, a device may easily become overwhelmed with a deluge of network traffic. An overwhelmed device may become the site of a network “traffic jam” as packets await processing or the may even drop packets, causing further communication problems between devices.
Transmission Control Protocol (TCP) is a connection-oriented reliable protocol accounting for over 80% of network traffic. Today TCP processing is performed almost exclusively through software. Several studies have shown that even state-of-the-art servers are forced to completely dedicate their Central Processing Units (CPUs) to TCP processing when bandwidths exceed a few Gbps. At 10 Gbps, there are 14.8M minimum-size Ethernet packets arriving every second, with a new packet arriving every 67.2 ns. The term “Ethernet” is a reference to a standard for transmission of data packets maintained by the Institute of Electrical and Electronics Engineers (IEEE) and one version of the Ethernet standard is IEEE std. 802.3, published Mar. 8, 2002. Allowing a few nanoseconds for overhead, wire-speed TCP processing requires several hundred instructions to be executed approximately every 50 ns. Given that a majority of TCP traffic is composed of small packets, this is an overwhelming burden on the CPU. A generally accepted rule of thumb for network processing is that 1 GHz CPU processing frequency is required for a 1 Gbps Ethernet link. For smaller packet sizes on saturated links, this requirement is often much higher. Ethernet bandwidth is slated to increase at a much faster rate than the processing power of leading edge microprocessors. Therefore, general purpose MIPS may not be able to provide the required computing power in coming generations. Even with the advent of GHz processor speeds, there is a need for a dedicated TCP offload engine (TOE) in order to support high bandwidths of 10 Gbps and beyond.
Therefore, there is a need in the art for an improved network protocol processor.
Referring now to the drawings in which like reference numbers represent corresponding parts throughout:
In the following description, reference is made to the accompanying drawings which form a part hereof and which illustrate several embodiments. It is understood that other embodiments may be utilized and structural and operational changes may be made without departing from the scope of embodiments.
Many computing devices and other host devices feature processors (e.g., general purpose Central Processing Units (CPUs)) that handle a wide variety of tasks. Often these processors have the added responsibility of handling network traffic. The increases in network traffic and connection speeds have placed growing demands on host processor resources. To at least partially reduce the burden of network communication on a host processor,
In addition to conserving host processor resources by handling protocol operations, the system 106 may provide “wire-speed” processing, even for very fast connections such as 10-gigabit per second and 40-gigabit per second connections. In other words, the system 106 may, generally, complete processing of one packet before another arrives. By keeping pace with a high-speed connection, the system 106 can potentially avoid or reduce the cost and complexity associated with queuing large volumes of backlogged packets.
The sample system 106 shown includes an interface 108 for receiving data traveling between one or more hosts and a network 102. For out-going data, the system 106 interface 108 receives data from the host(s) and generates packets for network transmission, for example, via a PHY and medium access control (MAC) device (not shown) offering a network connection (e.g., an Ethernet or wireless connection). For received packets (e.g., packets received via the PHY and MAC), the system 106 interface 108 can deliver the results of packet processing to the host(s). For example, the system 106 may communicate with a host via a Small Computer System Interface (SCSI) (American National Standards Institute (ANSI) SCSI Controller Commands-2 (SCC-2) NCITS.318:1998) or Peripheral Component Interconnect (PCI) type bus (e.g., a PCI-X bus system) (PCI Special Interest Group, PCI Local Bus Specification, Rev 2.3, published March 2002).
In addition to the interface 108, the system 106 also includes processing logic 110 that implements protocol operations. Like the interface 108, the logic 110 may be designed using a wide variety of techniques. For example, the system 106 may be designed as a hard-wired ASIC (Application Specific Integrated Circuit), a FPGA (Field Programmable Gate Array), and/or as another combination of digital logic gates.
As shown, the logic 110 may also be implemented by a system 106 that includes a processor 122 (e.g., a micro-controller or micro-processor) and storage 126 (e.g., ROM (Read-Only Memory) or RAM (Random Access Memory)) for instructions that the processor 122 can execute to perform network protocol operations. The instruction-based system 106 offers a high degree of flexibility. For example, as a network protocol undergoes changes or is replaced, the system 106 can be updated by replacing the instructions instead of replacing the system 106 itself. For example, a host may update the system 106 by loading instructions into storage 126 from external FLASH memory or ROM on the motherboard, for instance, when the host boots.
Though
In greater detail, the system 106 shown includes an input sequencer 116 that parses a received packet's header(s) (e.g., the TCP and IP headers of a TCP/IP packet) and temporarily buffers the parsed data. The input sequencer 116 may also initiate storage of the packet's payload in host accessible memory (e.g., via DMA (Direct Memory Access)). As described below, the input sequencer 116 may be clocked at a rate corresponding to the speed of the network connection.
As described above, the system 106 stores context data for different network connections. To quickly retrieve context data from memory 112 for a given packet, the system 106 depicted includes a content-addressable memory 114 (CAM) that stores different connection identifiers (e.g., index numbers) for different connections as identified, for example, by a combination of a packet's IP source and destination addresses and source and destination ports. A CAM can quickly retrieve stored data based on content values much in the way a database can retrieve records based on a key. Thus, based on the packet data parsed by the input sequencer 116, the CAM 114 can quickly retrieve a connection identifier and feed this identifier to the context data memory 112. In turn, the connection context data corresponding to the identifier is transferred from the memory 112 to the working register 118 for use by the processor 122.
In the case that a packet represents the start of a new connection (e.g., a CAM 114 search for a connection fails), the working register 118 is initialized (e.g., set to the “LISTEN” state in TCP) and CAM 114 and a context data entries are allocated for the connection, for example, using a Least Recently Used (LRU) algorithm or other allocation scheme.
The number of data lines connecting different components of the system 106 may be chosen to permit data transfer between connected components 112-128 in a single clock cycle. For example, if the context data for a connection includes n-bits of data, the system 106 may be designed such that the connection data memory 112 may offer n-lines of data to the working register 118.
Thus, the sample embodiment shown uses at most three processing cycles to load the working register 118 with connection data: one cycle to query the CAM 114; one cycle to access the connection data 112; and one cycle to load the working register 118. This design can both conserve processing time and economize on power-consuming access to the memory structures 112, 114.
After retrieval of connection data for a packet, the system 106 can perform protocol operations for the packet, for example, by processor 122 execution of protocol embodiment instructions stored in storage 126. The processor 122 may be programmed to “idle” when not in use to conserve power. After receiving a “wake” signal (e.g., from the input sequencer 116 when the connection context is retrieved or being retrieved), the processor 122 may determine the state of the current connection and identify the starting address of instructions for handling this state. The processor 122 then executes the instructions beginning at the starting address. Depending on the instructions, the processor 122 can alter context data (e.g., by altering working register 118), assemble a message in a send buffer 128 for subsequent network transmission, and/or may make processed packet data available to the host (not shown). Again, context data, potentially modified by the processor 122, is returned to the context data memory 112.
The instruction set also includes operations specifically tailored for use in implementing protocol operations with system 106 resources. These instructions include operations for clearing the CAM 114 of an entry for a connection (e.g., CAM1CLR) and for saving context data to the context data storage 112 (e.g., TCBWR). Other embodiments may also include instructions that read and write identifier information to the CAM 114 storing data associated with a connection (e.g., CAM1READ key→index and CAM1WRITE key→index) and an instruction that reads the context data (e.g., TCBRD index→destination). Alternately, these instructions may be implemented as hard-wired logic.
Though potentially lacking many instructions offered by traditional general purpose CPUs (e.g., processor 122 may not feature instructions for floating-point operations), the instruction set provides developers with easy access to system 106 resources tailored for network protocol embodiment. A programmer may directly program protocol operations using the micro-code instructions. Alternately, the programmer may use a wide variety of code development tools (e.g., a compiler or assembler).
As described above, the system 106 instructions can implement operations for a wide variety of network protocols. For example, the system 106 may implement operations for a transport layer protocol such as TCP. A complete specification of TCP and optional extensions can be found in IETF RFCs 793, 1122, and 1323.
Briefly, TCP provides connection-oriented services to applications. That is, much like picking up a telephone and assuming the phone company makes everything work, TCP provides applications with simple primitives for establishing a connection (e.g., CONNECT and CLOSE) and transferring data (e.g., SEND and RECEIVE). TCP transparently handles communication issues such as data retransmission, congestion, and flow control.
To provide these services to applications, TCP operates on packets known as segments. A TCP segment includes a TCP header followed by one or more data bytes. A receiver can reassemble the data from received segments. Segments may not arrive at their destination in their proper order, if at all. For example, different segments may travel very different paths across a network. Thus, TCP assigns a sequence number to each data byte transmitted. Since every byte is sequenced, each byte can be acknowledged to confirm successful transmission. The acknowledgment mechanism is cumulative so that an acknowledgment of a particular sequence number indicates that bytes up to that sequence number have been successfully delivered.
The sequencing scheme provides TCP with a powerful tool for managing connections. For example, TCP can determine when a sender should retransmit a segment using a technique known as a “sliding window”. In the “sliding window” scheme, a sender starts a timer after transmitting a segment. Upon receipt, the receiver sends back an acknowledgment segment having an acknowledgement number equal to the next sequence number the receiver expects to receive. If the sender's timer expires before the acknowledgment of the transmitted bytes arrives, the sender transmits the segment again. The sequencing scheme also enables senders and receivers to dynamically negotiate a window size that regulates the amount of data sent to the receiver based on network performance and the capabilities of the sender and receiver.
In addition to sequencing information, a TCP header includes a collection of flags that enable a sender and receiver to control a connection. These flags include a SYN (synchronize) bit, an ACK (acknowledgement) bit, a FIN (finish) bit, a RST (reset) bit. A message including a SYN bit of “1” and an ACK bit of “0” (a SYN message) represents a request for a connection. A reply message including a SYN bit “1” and an ACK bit of “1” (a SYN+ACK message) represents acceptance of the request. A message including a FIN bit of “1” indicates that the sender seeks to release the connection. Finally, a message with a RST bit of “1” identifies a connection that should be terminated due to problems (e.g., an invalid segment or connection request rejection).
In the state diagram of
Again, the state diagram also manages the state of a TCP sender. The sender and receiver paths share many of the same states described above. However, the sender may also enter a SYN SENT state 146 after requesting a connection, a FIN WAIT 1 state 152 after requesting release of a connection, a FIN WAIT 2 state 156 after receiving an agreement from the receiver to release a connection, a CLOSING state 154 where both sender and receiver request release simultaneously, and a TIMED WAIT state 158 where previously transmitted connection segments expire.
The system's 106 protocol instructions may implement many, if not all, of the TCP operations described above and in the RFCs. For example, the instructions may include procedures for option processing, window management, flow control, congestion control, ACK message generation and validation, data segmentation, special flag processing (e.g., setting and reading URGENT and PUSH flags), checksum computation, and so forth. The protocol instructions may also include other operations related to TCP such as security support, random number generation, RDMA (Remote Direct Memory Access) over TCP, and so forth.
In a system 106 configured to provide TCP operations, the context data may include 264-bits of information per connection including: 32-bits each for PUSH (identified by the micro-code label “TCB[pushseq]”), FIN (“TCB[finseq]”), and URGENT (“TCB[rupseq]”) sequence numbers, a next expected segment number (“TCB[rnext]”), a sequence number for the currently advertised window (“TCB[cwin]”), a sequence number of the last unacknowledged sequence number (“TCB[suna]”), and a sequence number for the next segment to be next (“TCB[snext]”). The remaining bits store various TCB state flags (“TCB[flags]”), TCP segment code (“TCB[code]”), state (“TCB[tcbstate]”), and error flags (“TCB[error]”),
To illustrate programming for a system 106 configured to perform TCP operations, Appendix A features an example of source micro-code for a TCP receiver. Briefly, the routine TCPRST checks the TCP ACK bit, initializes the send buffer, and initializes the send message ACK number. The routine TCPACKIN processes incoming ACK messages and checks if the ACK is invalid or a duplicate. TCPACKOUT generates ACK messages in response to an incoming message based on received and expected sequence numbers. TCPSEQ determines the first and last sequence number of incoming data, computes the size of incoming data, and checks if the incoming sequence number is valid and lies within a receiving window. TCPINITCB initializes TCB fields in the working register. TCPINITWIN initializes the working register with window information. TCPSENDWIN computes the window length for inclusion in a send message. Finally, TCBDATAPROC checks incoming flags, processes “urgent”, “push” and “finish” flags, sets flags in response messages, and forwards data to an application or user.
Another operation performed by the system 106 may be packet reordering. For example, like many network protocols, TCP does not assume TCP packets (“segments”) arrive in order. To correctly reassemble packets, a receiver can keep track of the last sequence number received and await reception of the byte assigned the next sequence number. Packets arriving out-of-order can be buffered until the intervening bytes arrive. Once the awaited bytes arrive, the next bytes in the sequence can potentially be retrieved quickly from the buffered data.
For the purposes of illustration,
Briefly, when a packet arrives, a packet tracking sub-system determines whether the received packet is in-order. If not, the sub-system consults memory to identify a contiguous set of previously received out-of-order packets bordering the newly arrived packet and can modify the data stored in the memory to add the packet to the set. When a packet arrives in-order, the sub-system can access the memory to quickly identify a contiguous chain of previously received packets that follow the newly received packet.
In greater detail, as shown in
As shown, the tracking sub-system 500 includes content-addressable memory 510, 512 that stores information about received, out-of-order packets. Memory 510 stores the first sequence number of a contiguous chain of one or more out-of-order packets and the length of the chain. Thus, when a new packet arrives that ends where the pre-existing chain begins, the new packet can be added to the top of the pre-existing chain. Similarly, the memory 512 also stores the end (the last sequence number+1) of a contiguous packet chain of one or more packets and the length of the chain. Thus, when a new packet arrives that begins at the end of a previously existing chain, the new packet can be appended to the end of the previously existing chain to form an even larger chain of contiguous packets. To illustrate these operations,
As shown in
As shown in
As shown in
As shown in
The sample series shown in
Potentially, the received packet may border pre-existing packet chains on both sides. In other words, the newly received packet fills a hole between two chains. Since the process 520 checks both starting 532 and ending 536 borders of the received packet, a newly received packet may cause the process 520 to join two different chains together into a single monolithic chain.
As shown, if the received packet does not border a packet chain, the process 520 stores 540 data in content-addressable memory for a new packet chain that, at least initially, includes only the received packet.
If the received packet is in-order, the process 520 can query 526 the content-addressable memory to identify a bordering packet chain following the received packet. If such a chain exists, the process 520 can output the newly received packet to an application along with the data of other packets in the adjoining packet chain.
This process may be implemented using a wide variety of hardware, firmware, and/or software. For example,
As shown, the embodiment operates on control signals for reading from the CAM(s) 560, 562 (CAMREAD), writing to the CAMs 560, 562 (CAMWRITE), and clearing a CAM 560, 562 entry (CAMCLR). As shown in
To implement the packet tracking approach described above, the sub-system 500 may feature its own independent controller that executes instructions implementing the scheme or may feature hard-wired logic. Alternately, a processor 122 (
Referring to
Instead, as shown in
As an example of a “dual-clock” system, for a system 106 having an interface 108 data width of 16-bits, to achieve 10 gigabits per second, the interface 108 should be clocked at a frequency of 625 MHz (e.g., [16-bits per cycle]×[625,000,000 cycles per second]=10,000,000,000 bits per second). Assuming a smallest packet of 64 bytes (e.g., a packet only having IP and TCP headers, frame check sequence, and hardware source and destination addresses), it may take the 16-bit/625 MHz interface 108 32-cycles to receive the packet bits. Potentially, an inter-packet gap may provide additional time before the next packet arrives. If a set of up to n instructions is used to process the packet and a different instruction can be executed each cycle, the processing block 110 may be clocked at a frequency of k·(625 MHz) where k=n-instructions/32-cycles. For embodiment convenience, the value of k may be rounded up to an integer value or a value of 2n though neither of these is a strict requirement.
Since components run by a faster clock generally consume greater power and generate more heat than the same components run by a slower clock, clocking the different components 108, 110 at different speeds according to their need can enable the system 106 to save power and stay cooler. This can both reduce the power requirements of the system 106 and can reduce the need for expensive cooling systems.
Power consumption and heat generation can be reduced even further. That is, the system 106 depicted in
Thus, instead of permanently tailoring the system 106 to handle difficult scenarios,
As shown in
The scaling logic 124 may be implemented in wide variety of hardware and/or software schemes. For example,
[(packet size/data-width)/interface-clock-frequency]>=(interface-clock-cycles/interface-clock-frequency)+(maximum number of instructions/processing-clock-frequency).
While
The resulting clock signal can be routed to different components within the processing logic 110. However, not all components within the processing logic 110 and interface 108 blocks need to run at the same clock frequency. For example, in
Placing the scaling logic 124 physically near a frequency source can reduce power consumption. Further, adjusting the clock at a global clock distribution point both saves power and reduces logic need to provide clock distribution.
Again, a wide variety of embodiments may use one or more of the techniques described above. Additionally, the system 106 may appear in a variety of forms. For example, the system 106 may be designed as a single chip. Potentially, such a chip may be included in a chipset or on a motherboard. Further, the system 106 may be integrated into components such as a network adaptor, NIC (Network Interface Card), or MAC (medium access device). Potentially, techniques described herein may integrated into a micro-processor.
A system 106 may also provide operations for more than one protocol. For example, a system 106 may offer operations for both network and transport layer protocols. The system 106 may be used to perform network operations for a wide variety of hosts such as storage switches and application servers.
Certain embodiments provide a network protocol processing system to implement protocol (e.g., TCP) input and output processing. The network protocol processing system is capable of processing packet receives and transmits at, for example, 10+Gbps Ethernet traffic at a client computer or a server computer. The network protocol processing system minimizes buffering and queuing by providing line-speed (e.g., TCP/IP) processing for packets (e.g., packets larger than 512 bytes). That is, the network protocol processing system is able to expedite the processing of inbound and outbound packets.
The network protocol processing system provides a programmable solution to allow for extensions or changes in the protocol (e.g., extensions to handle emerging protocols, such as Internet Small Computer Systems Interface (iSCSI) (IETF RFC 3347, published February 2003) or Remote Direct Memory Access (RDMA)). The network protocol processing system also provides a new instruction set. The network protocol processing system also uses multi-threading to effectively hide memory latency. Although examples herein may refer to TCP, embodiments are applicable to other protocols.
Certain embodiments provide a TCP Offload Engine (TOE) to offload some of processing from the CPU.
In addition to high-speed protocol processing requirements, the efficient handling of Ethernet traffic involves addressing several issues at the system level, such as transfer of payload and management of CPU interrupts. Thus in certain embodiments, a high-speed processing engine is incorporated with a DMA controller and other hardware assist blocks, as well as system level optimizations.
An on-die cache 2112 (i.e., a type of storage area) (e.g., 1MB) stores TCP connection context, which provides temporal locality for connections (e.g., 2K connections), with additional contexts residing in host memory. The context is the portion of the transmission control block (TCB) that TCP maintains for each connection. Caching this context on-chip is useful for 10 Gbps performance. The cache 2112 size may be limited by physical area. Although the term cache may be used herein, embodiments are applicable to any type of storage area.
In addition, to avoid intermediate packet copies on receives and transmits, an integrated direct memory access (DMA) engine (shown logically as a transfer (TX) DMA 2164 and receive (RX) DMA 2162) is provided. This enables a low latency transfer path and supports direct placement of data in application buffers without substantial intermediate buffering. The TX DMA 2164 transfers data from host memory to the transfer queue 2118 upon receiving a notification from the processing engine 2110 to perform the transfer. The RX DMA 2162 is capable of storing data from the header and data queue 2144 into host memory.
A central scheduler 2116 provides global control to the processing engine 2110 at a packet level granularity. In certain embodiments, a control store of the processing engine 2110 may be made cacheable. Caching code instructions allows code relevant to specific processing to be cached, with the remaining instructions in host memory and allows for protocol code changes.
An network interface interacts with a transmit queue 2118 (TX queue) that buffers outbound packets and a header and data queue 2144 that buffers incoming packets. Three queues form a hardware mechanism to interface with the host CPU. The host interface interacts with the following three queues: an inbound doorbell queue (DBQ) 2130, outbound completion queue (CQ) 2132, and an exception/event queue (EQ) 2134. Each queue 2130, 2132, 2134 may include a priority mechanism. The inbound doorbell queue (DBQ) 2130 initiates send (or receive) requests. An operating system may use the TOE driver layer 2300 to place doorbell descriptors in the DBQ 2130. The outbound completion queue (CQ) 2132 and the exception/event queue (EQ) 2134 communicate processed results and events back to the host. For example, a pass/fail indication may be stored in CQ 2132.
A timer unit 2140 provides hardware offload for four of seven frequently used timers associated with TCP processing. The system includes hardware assist for virtual to physical (V2P) 2142 address translation. A memory queue 2166 may also be included to queue data for the host interface.
The DMA engine supports 4 independent, concurrent channels and provides a low-latency/high throughput path to/from memory. The TOE constructs a list of descriptors (e.g., commands for read and write), programs the DMA engine, and initiates the DMA start operation. The DMA engine transfers data from source to destination as per the list. Upon completion of the commands, the DMA engine notifies the TOE, which updates the CQ 2132 to notify the host.
The functional units include arithmetic and logic units, shifters and comparators, which are optimized for high frequency operation. A register set 2212 includes a large register set. In certain embodiments, the register set 2212 includes two 256B register arrays to store intermediate processing results. The scheduler 2116 (
In an effort to hide host and TCB memory latency and improve throughput, the processing engine 2110 is multithreaded. The processing engine 2110 includes a thread cache 2206, running at execution core speed, which allows intermediate system state to be saved and restored. The design also provides a high-bandwidth connection between the thread cache 2106 and the working register 2204, making possible very fast and parallel transfer of thread state between the working register 2204 and the thread cache 2206. Thread context switches may occur during both receives and transmits and when waiting on outstanding memory requests or on pending DMA transactions. Specific multi-threading details are described below.
The processing engine 2110 features a cacheable control store 2208 (
After a packet is processed, process results are updated to the working register 2204. Additionally, the cache 2112 and thread cache 2206 are updated with the results in the working register 2204.
In certain embodiments, the processing engine 2110 performs TCP input processing under programmed control at high speed. The execution core 2200 also programs the DMA control unit and queues the receive DMA requests. Payload data is transferred from internal receive buffers to pre-posted locations in host memory using DMA. This low latency DMA transfer is useful for high performance. Careful design allows the TCP processing to continue in parallel with the DMA operation. On completion of TCP processing, the context is updated with the processing results and written back to the cache 2112. The scheduler 2116 also updates CQ 2132 with the completion descriptors and EQ 2134 with the status of completion, which can generate a host CPU interrupt and/or an exception. In certain embodiments, TOE driver layer 2300 may coalesce the events and interrupts for efficient processing. This queuing mechanism enables events and interrupts to be coalesced for more efficient servicing by the CPU. The execution core 2200 also generates acknowledgement (ACK) headers as part of processing.
24D illustrates operations 2460 for processing outbound packets in accordance with certain embodiments. Control begins at block 2462 with receipt of a packet from a host via DBQ 2130. In block 2464, descriptors are fetched into cache 2112 from host memory using pointers in DBQ 2130 to access the host memory. In block 2466, a lookup for the context is scheduled. In block 2468, when the lookup is complete, the context is loaded into the working register 2204 from host memory. In block 2470, packet processing continues with the processing engine 2110 and DMA controller (transmit and receive queues 2164, 2162) performing processing in parallel. In block 2472, wrap up processing is performed (e.g., the CQ 2132 and EQ 2134 are updated).
Scheduling a lookup against the local cache 2112 identifies the connection with the corresponding connection context being loaded into the execution core 2200 working register 2204 (
In
Certain embodiments provide a multi-threaded system to enable hiding of latency from memory accesses and other hardware functions, and, thus, expedites inbound and outbound packet processing, minimizing the need for buffering and queuing. Unlike conventional approaches to multi-threading, certain embodiments implement the multiple thread mechanism in hardware, including thread suspension, scheduling, and save/restore of thread state. This frees a programmer from the responsibility of maintaining and scheduling threads and removes the element of human error. The programming model is thus far simpler than the more common model of a programmer or compiler generating multithreaded code. Also, in certain embodiments, the save/restore of thread state and switching may be programmer controlled.
Additionally, code that runs on a single-threaded engine may run on the multi-threaded processing engine 2110, but with greater efficiency. The overhead penalty from switching between threads is kept minimal to achieve better throughput.
As can be seen in
Unlike conventional approaches, the scheduler 2116 controls the switching between different threads. A thread is associated with each network packet that is being processed, both incoming and outgoing. This differs from other approaches that associate threads with each task to be performed, irrespective of the packet. The scheduler 2116 spawns a thread when a packet belonging to a new connection needs to be processed. In certain embodiments, a second packet for that same connection may not be assigned a thread until the first packet is completely processed and the updated context has been written back to cache 2112. This is under the control of the scheduler 2116. When the processing of a packet in the execution core 2200 is stalled, the thread state is saved in the thread cache 2206, and the scheduler 2116 spawns a thread for a packet on a different connection. The scheduler 2116 may also wake up a thread for a previously suspended packet by restoring thread state and allowing the thread to run to completion. In this approach, the scheduler 2116 may also spawn special maintenance threads for global tasks (e.g., such as gathering statistics on Ethernet traffic).
Thus, the network protocol processing system also provides a low power/high performance solution with better Million Instructions Per Second (MIPS)/Watt than a general purpose CPU. Thus, certain embodiments provide packet processing that demonstrates TCP termination for multi-gigabit Ethernet traffic. With certain embodiments, performance analysis shows promise for achieving line speed TCP termination at 10 Gbps duplex rates for packets larger than 289 bytes, which is more than twice the performance of a single threaded design. In certain embodiments, the network protocol processing system complies with the Request for Comments (RFC) 793 TCP processing protocol, maintained by the Internet Engineering Task Force (IETF).
Certain embodiments minimize intermediate copies of payload. Conventional systems use intermediate copies of data during both transmits and receives, which results in a performance bottleneck. In conventional systems, data to be transmitted is copied from the application buffer to a buffer in OS kernel space. It is then moved to buffers in the NIC before being sent out on the network. Similarly, data that is received has to be first stored in the NIC, then moved to kernel space and finally copied into the destination application buffers. On the other hand, embodiments pre-assign buffers for data that are expected to be received to facilitate efficient data transfer.
Certain embodiments mitigate the effect of memory accesses. Processing transmits and receives requires accessing context data for each connection that may be stored in host memory. Each memory access is an expensive operation, which can take up to 100 ns. Certain embodiments optimize the TCP stack to reduce the number of memory accesses to increase performance. At the same time, certain embodiments use techniques to hide memory latency.
Certain embodiments provide quick access to state information. The context data for each Ethernet connection may be of the order of several hundred bytes. Caching the context for active connections is provided. In certain embodiments, caching context for a small number of connections (burst mode operation) is provided and results in performance improvement. In certain embodiments, the cache size is made large enough to hold the allowable number of connections. Additionally, protocol processing may require frequent and repeated access to various fields of each context. Certain embodiments provide fast local registers to access these fields quickly and efficiently to reduce the time spent in protocol processing. In addition to context data, these registers can also be used to store intermediate results during processing.
Certain embodiments optimize instruction execution. In particular, certain embodiments reduce the number of instructions to be executed by optimizing the TCP stack to reduce the processing time per packet.
Certain embodiments streamline interfaces between the host, chipset and NIC. This addresses a source of overhead that reduces host efficiency because of the communication interface between the host and NIC. For instance, an interrupt driven mechanism tends to overload the host and adversely impact other applications running on the host.
Certain embodiments provide hardware assist blocks for specific functions, such as hardware blocks for encryption/decryption, classification, and timers.
Certain embodiments provide a multi-threading architecture to effectively hide host memory latency with a controller being implemented in hardware. Certain embodiments provide a mechanism for high bandwidth transfer of context between the working register and the thread cache, allowing fast storage and retrieval of context data. Also, this avoids the processor from stalling and hides processing latency.
Intel is a registered trademark and/or common law mark of Intel Corporation in the United States and/or foreign countries.
The described techniques for adaptive caching may be implemented as a method, apparatus or article of manufacture using standard programming and/or engineering techniques to produce software, firmware, hardware, or any combination thereof. The term “article of manufacture” as used herein refers to code or logic implemented in hardware logic (e.g., an integrated circuit chip, Programmable Gate Array (PGA), Application Specific Integrated Circuit (ASIC), etc.) or a computer readable medium, such as magnetic storage medium (e.g., hard disk drives, floppy disks, tape, etc.), optical storage (CD-ROMs, optical disks, etc.), volatile and non-volatile memory devices (e.g., EEPROMs, ROMs, PROMs, RAMs, DRAMs, SRAMs, firmware, programmable logic, etc.). Code in the computer readable medium is accessed and executed by a processor. The code in which preferred embodiments are implemented may further be accessible through a transmission media or from a file server over a network. In such cases, the article of manufacture in which the code is implemented may comprise a transmission media, such as a network transmission line, wireless transmission media, signals propagating through space, radio waves, infrared signals, etc. Thus, the “article of manufacture” may comprise the medium in which the code is embodied. Additionally, the “article of manufacture” may comprise a combination of hardware and software components in which the code is embodied, processed, and executed. Of course, those skilled in the art recognize that many modifications may be made to this configuration without departing from the scope of embodiments, and that the article of manufacture may comprise any information bearing medium known in the art.
Although the term “queue” may be used to refer to data structures for certain embodiments, other embodiments may utilize other data structures. Although the term “cache” may be used to refer to storage areas for certain embodiments, other embodiments may utilize other storage areas.
The illustrated logic of
The foregoing description of various embodiments has been presented for the purposes of illustration and description. It is not intended to be exhaustive or to limit the embodiments to the precise forms disclosed. Many modifications and variations are possible in light of the above teaching. It is intended that the scope of the embodiments be limited not by this detailed description, but rather by the claims appended hereto. The above specification, examples and data provide a complete description of the manufacture and use of the composition of the embodiments. Since many embodiments can be made without departing from the spirit and scope of the embodiments, the embodiments reside in the claims hereinafter appended.