Networks enable computers and other devices to communicate. For example, networks can carry data representing video, audio, e-mail, and so forth. Typically, data sent across a network is divided into smaller messages known as packets. By analogy, a packet is much like an envelope you drop in a mailbox. A packet typically includes “payload” and a “header”. The packet's “payload” is analogous to the letter inside the envelope. The packet's “header” is much like the information written on the envelope itself. The header can include information to help network devices handle the packet appropriately.
A number of network protocols cooperate to handle the complexity of network communication. For example, a protocol known as Transmission Control Protocol (TCP) provides “connection” services that enable remote applications to communicate. Behind the scenes, TCP handles a variety of communication issues such as data retransmission, adapting to network traffic congestion, and so forth.
To provide these services, TCP operates on packets known as segments. Generally, a TCP segment travels across a network within (“encapsulated” by) a larger packet such as an Internet Protocol (IP) datagram. Frequently, an IP datagram is further encapsulated by an even larger packet such as a link layer frame (e.g., an Ethernet frame). The payload of a TCP segment carries a portion of a stream of data sent across a network by an application. A receiver can restore the original stream of data by reassembling the received segments. To permit reassembly and acknowledgment (ACK) of received data back to the sender, TCP associates a sequence number with each payload byte.
Many computer systems and other devices feature host processors (e.g., general purpose Central Processing Units (CPUs)) that handle a wide variety of computing tasks. Often these tasks include handling network traffic such as TCP/IP connections. The increases in network traffic and connection speeds have placed growing demands on host processor resources. To at least partially alleviate this burden, some have developed TCP Off-load Engines (TOE) dedicated to off-loading TCP protocol operations from the host processor.
Faster network communication speeds have increased the burden of packet processing on host systems. In short, more packets need to be processed in less time. Fortunately, processor speeds have continued to increase, partially absorbing these increased demands. Improvements in the speed of memory, however, have generally failed to keep pace. Each memory access that occurs during packet processing represents a potential delay as the processor awaits completion of the memory operation. Many network protocol implementations access memory a number of times for each packet. For example, a typical TCP/IP implementation performs a number of memory operations for each received packet including copying payload data to an application buffer, looking up connection related data, and so forth.
This description illustrates a variety of techniques that can increase the packet processing speed of a system despite delays associated with memory accesses by enabling the processor to perform other operations while memory operations occur. These techniques may be implemented in a variety of environments such as the sample computer system shown in
As shown, the CPU 112 features an internal cache 108 that provides faster access to data than provided by memory 114. Typically, the cache 108 and memory 114 form an access hierarchy. That is, the cache 108 will attempt to respond to CPU 112 memory access requests using its small set of quickly accessible copies of memory 114 data. If the cache 108 does not store the requested data (a cache miss), the data will be retrieved from memory 114 and placed in the cache 108. Potentially, the cache 108 may victimize entries from the cache's 108 limited storage space to make room for new data.
In a variety of packet processing operations, cache misses occur at predictable junctures. For example, conventionally, a NIC transfers received packet data to memory and generates an interrupt notifying the CPU. When the CPU initially attempts to access the received data, a cache-miss occurs, temporarily stalling processing as the packet data is retrieved from memory.
In the example shown, the NIC 102 can cause direct placement of data in the CPU 112 cache 108 instead of merely storing the data in memory 114. When the CPU 112 attempts to access the data, a cache miss is less likely to occur and the ensuing memory 114 access delay can be avoided.
Direct cache access may vary in other implementations. For example, the NIC 102 may be configured to directly access the cache 108 instead of using controller 104 as an intermediate agent. Additionally, in a system featuring multiple CPUs 112 and/or multiple caches 108 (e.g., L1 and L2 caches), the direct cache access request may specify the target CPU and/or cache 108. For example, the target CPU and/or cache 108 may be determined based on protocol information within the packet (e.g., a TCP/IP tuple identifying a connection). Pushing data into the relatively large last-level caches can minimize pre-mature victimization of cached data.
Though
The technique shown in
As shown,
As shown in
Direct cache access and fetching can be combined in a variety of ways. For example, instead of pushing data into the cache as described above, the NIC 102 can write packet data to memory 114 and issue a fetch command to the CPU. This variation can achieve a similar cache hit frequency.
In
Though CPU 112 generally executes instructions of one thread at a time, the CPU 112 can switch between the different threads, executing instructions of one thread and then another. This multi-threading can be used to mask the cost of memory operations. For example, if a thread yields after issuing a memory request, other threads can be executed while the memory operation proceeds. By the time execution of the original thread resumes, the memory operation may have completed.
A system may handle the thread switching in a variety of ways. For example, switching may occur in response to a software instruction surrendering CPU 112 execution of the thread 126n. For example, in
A variety of context-switching mechanisms may be used in a multi-threading scheme. For example, a CPU 112 may include hardware that automatically copies/restores context data for different threads. Alternately, software may implement a “light-weight” threading scheme that does not require hardware support. That is, instead of relying on hardware to handle context save/restoring, software instructions can store/restore context data.
As shown in
A variety of software architectures may be used to implement multi-threading. For example, yielding execution control by a thread may write the thread's context to a cache and branch to an event handler that selects and transfers control to a different thread. Thread 126a scheduling may be performed in a variety of ways, for example, using a round-robin or priority based scheme. For instance, a scheduling thread may maintain a thread queue that appends recently “yielded” threads to the bottom of the queue. Potentially, a thread may be ineligible for execution until a pending memory operation completes.
While each thread 126a-126n has its own context, different threads may execute the same set of instructions. This allows a given set of operations to be “replicated” to the proper scale of execution. For instance, a thread may be replicated to handle received TCP/IP packets for one or more TCP/IP connections.
Thread activity can be controlled using “wake” and “sleep” scheduling operations. The wake operation adds a thread to a queue (e.g., a “RunQ”) of active threads while a sleep operation removes the thread from the queue. Potentially, the scheduling thread may fetch data to be accessed by a wakened thread.
The threads 126a-126n may use a variety of mechanisms to intercommunicate. For example, a thread handling TCP receive operations for a connection and a thread handling TCP transmit operations for the same connection may both vie for access to the connection's TCP Transmission Control Block (TCB). To address contention issues, a locking mechanism may be provided. For example, the event handler may maintain a queue for threads requesting access to resources locked by another thread. When a thread requests a lock on a given resource, the scheduler may save the thread's context data in the lock queue until the lock is released.
In addition to locking/unlocking, threads 126 may share a commonly accessible queue that the threads can push/pop data to/from. For example, a thread may perform operations on a set of packets and push the packets onto the queue for continued processing by a different thread.
Fetching and multi-threading can complement one another in a variety of packet processing operations. For example, a linked list may be navigated by fetching the next node in the list and yielding. Again, this can conserve processing cycles otherwise spent waiting for the next list element to be retrieved.
As shown, direct cache access, fetching, and multi-threading can reduce the processing cost of memory operations by continuing processing while a memory operation proceeds. Potentially, these techniques may be used to speed copy operations that occur during packet processing (e.g., copying reassembled data to an application buffer). Conventionally, a copy operation proceeds under the explicit control of the CPU 112. That is, data is read from memory 114 into the CPU 112, then written back to memory 114 at a different location. Depending on the amount of data being copied, such as a packet with a large payload, this can tie up a significant amount of processing cycles. To reduce the cost of a copy, packet data may be pushed into the cache or fetched before being written to its destination. Alternately,
The copy circuitry 122 may perform asynchronous, independent copying from a variety of source and target devices (e.g., to/from memory 114, NIC 102, and cache 108). For example,
To identify completion of the copy, the circuitry 122 can write completion status into a predefined memory location that can be polled by the CPU 112 or the circuitry 122 can generate a completion signal. Potentially, the circuitry 122 can handle multiple on-going copy operations simultaneously, for example, by pipelining copy operations.
As shown in
The NIC 102 data transfers may occur via Direct Memory Access (DMA) to memory 114. To reduce “compulsory” cache misses, the NIC 102 also may also (or alternately) initiate a direct cache access to store the packet's 130 descriptor and header in cache 108 in anticipation of imminent CPU 112 processing of the packet 130. As shown, the NIC 102 notifies the CPU 112 of the packet's 130 arrival by signaling an interrupt. Potentially, the NIC 102 may use an interrupt moderation scheme to notify the CPU 112 after arrival of multiple packets. Processing batches of multiple packets enables the CPU 112 to better control cache contents by fetching data for each packet in the batch before processing.
As shown in
The fast threads 158 consume enqueued packets in turn. After dequeueing a packet entry, a fast thread 158 performs a lookup of the TCB for a packet's connection. A wide variety of algorithms and data structures may be used to perform TCB lookups. For example,
To perform a lookup, the nodes in a row identified by a hash of the packet's tuple are searched until a node matching the packet's tuple is found. The referenced TCB block 140a-140n can then be retrieved. A TCB block 140a-140n can include a variety of TCP state data (e.g., connection state, window size, next expected byte, and so forth). ATCB block 140a-140n may include or reference other connection related data such as identification of out-of-order packets awaiting delivery, connection-specific queues (e.g., a queue of pending application read or write requests), and/or a list of connection-specific timer events.
Like many TCB lookup schemes, the scheme shown may require multiple memory operations to finally retrieve a TCB block 140a-140n. To alleviate the burden of TCB lookup, a system may incorporate techniques described above. For example, NIC 102 may perform computation of the TCP tuple hash after receipt of a packet. Similarly, the event handler thread 162 may fetch data to speed the lookup. For example, the event handler 162 may fetch the table 142 row corresponding to a packet's hash value. Additionally, in the event that collisions are rare, a programmer may code the event handler 162 to fetch the TCB block 140a-140p associated with the first node of a row 142a-142n.
A TCB lookup forms part of a variety of TCP operations. For example,
The thread 158 may then determine 174 whether an application has issued a pending request for received data. Such a request typically identifies a buffer to place the next sequence of data in the connection data stream. The sample scheme depicted can include the pending requests in a list anchored in the connection's TCB block. As shown, if a request is pending, the thread can copy the payload data from the buffer(s) 136 and notify 178 the application of the posted data. To perform this copy, the thread may initiate transfer using the asynchronous memory copy (see
As described above, the receive threads 158 interface with an application, for example, to notify the application of serviced receive requests.
As shown, the event handler thread 160 monitors the doorbell queue 188 and schedules processing of the received request by an application interface thread (AIFW) 164. The event handler thread 160 may also fetch data used by the application interface threads 164 such as TCB nodes/blocks. The application interface threads 164 dequeues the doorbell entries and performs interface operations in response to the request. In the case of receive requests, an interface thread 164 can check the connection's TCB for in-order data that has been received but not yet consumed. Alternately, the thread can add the request to a connection's list 144 of pending requests in the connection's TCB.
In the case of application transmit requests, the event handler thread 126 also enqueues 186 these requests for processing by application interface threads 164. Again, the event handler 126 may fetch data (e.g., the TCB or TCB related data) used by the interface threads 164.
As shown in
The transmit threads 162 perform operations to construct a TCP/IP packet and deliver the packet to the NIC 102. Delivery to the NIC 102 is made by allocating and sending a NIC descriptor to the NIC 102. The NIC descriptor can include the payload buffer address and an address of a constructed TCP/IP header. The NIC descriptors may be maintained in a pool of free descriptors. The pools shrinks as the transmit threads 162 allocate descriptors. After the NIC issues a completion notice, for example, by a direct cache access push by the NIC, the event handler 126 may replenish freed descriptors back into the pool.
To construct a packet, a transmit thread 162 may fetch data indirectly referenced by the connection's TCB such as a header template, route cache data, and NIC data structures referenced by the route cache data. The thread 164 may yield after issuing the data fetches. After resuming, the thread 164 may proceed with TCP transmit operations such as flow control checks, segment size calculation, window management, and determination of header options. The thread may also fetch a NIC descriptor from the descriptor pool.
Potentially, the determined TCP segment size may be able to hold more data than requested by a given TxWQ entry. Thus, a transmit thread 162 may navigate through the list of pending TxWQ entries using fetch/yield to gather more data to include in the segment. This may continue until the segment is filled. After constructing the packet, the thread can initiate transfer of the packet's NIC descriptor, header, and payload to the NIC. The transmit thread 162 may also add an entry to the connection's list of outstanding transmit I/O requests and and TCP unacknowledged bytes.
In addition to the fast transmit threads 162 shown, the sample implementation may also feature slow transmit threads (not shown) that handle less time critical messaging (e.g., connection setup).
The timer threads can be scheduled at regular intervals by the event handler to process the timer events. The timer threads may navigate the linked list of timers associated with a time bucket using fetch and/or fetch/yield techniques described above.
Again, while
The term circuitry as used herein includes hardwired circuitry, digital circuitry, analog circuitry, programmable circuitry, and so forth. The programmable circuitry may operate on executable instructions disposed on an article of manufacture (e.g., a volatile or non-volatile storage device).
Other embodiments are within the scope of the following claims.