The present invention relates to a hardware system for managing buffers for queues of pointers to stored network packets.
In traditional Network Interfaces Cards/Components, ingress and egress traffic is handled using dedicated queues of pointers. These pointers are memory addresses of where packets are stored when received from network and before transmission to network.
Software must permanently monitor that enough pointers (and related memory positions) are available for received packets, and also that pointers that have no more usage after packet has been transmitted are reused on the receive side. This task consumes resource and must be error free otherwise memory leakage will appear leading to a system degradation. Such a mechanism is used in current devices.
Patent U.S. Pat. No. 6,904,040 titled “Packet Preprocessing Interface for Multiprocessor Network Handler” assigned to International Business Machines Corporation granted on 2005, Jun. 7 discloses a network handler using a DMA device to assign packets to network processors in accordance with a mapping function which classifies packets based on its content.
According to an aspect of the present invention, there is provided a network processor according to claim 1.
An advantage of this aspect is that the RQR and SQR hides most of the queue and buffer or cache management to the software. After initialization, software does not care anymore on buffer pointers.
Another advantage is that when software runs over multiple cores and/or in multiple threads, multiple applications may run in parallel without taking care about packet memory seen as a common resource.
Further advantages of the present invention will become clear to the skilled person upon examination of the drawings and detailed description. It is intended that any additional advantages be incorporated therein.
Embodiments of the present invention will now be described by way of example with reference to the accompanying drawings in which like references denote similar elements, and in which:
The SQR receives a send queue element (215) (or SQWE) from the completion unit (210). The role of the completion unit comprises:
The dequeue module (255) will send to the queue manager (220) the dequeued send work element (225) (represented as a WQE in
When an enqueue pool (245) is full, the SQR will write (233) its content to memory (230) using the DMA Writer (235) and empty the enqueue pool (245). Furthermore, when a dequeue pool is empty, the SQR will refill it by reading (237) one or more SQWE from memory (230) using the DMA Reader (239) and copying them to the dequeue pool (250).
One dequeue pool (250) and one enqueue pool (245) are in general associated with one send queue in memory. Furthermore there are in general one dequeue pool (250) and one enqueue pool (245) for each queue pair. Finally the enqueue pool (245), the dequeue pool (250) and the associated send queue are in general first in first out (FIFO) queues. A main reason for this configuration is to ensure that the SQWEs are transmitted in the order they are enqueued by the completion unit (210). It is possible to choose different configuration for the enqueue (245) and dequeue pool (250) and for the receive queue (either not FIFO, or in different numbers), however such configurations would require further mechanisms to ensure packets are transmitted in order. However such implementations would not deviate from the teachings of the present invention.
In a preferred embodiment the SQWE is 16 bytes, and the virtual address (300) is 8 bytes.
The RQR receives a RQWE for enqueueing along with an identifier of the queue pair and of the receive queue in which the RQWE should be enqueued. This element (412) is received at initialization time from a software thread (410). After initialization a RQWE, along with queue pair number and receive queue number (417), should in most of the cases be received from the queue manager (220), thus achieving automatic memory management by hardware. A case where a RQWE would be received from a software thread (410) after initialization is when the software decides to recycle the pointer itself.
Each enqueue (423) and dequeue pool (425) are associated with one receive queue stored in memory (430).
In case of a dequeue (443), a RQWE is removed from a dequeue pool (425) in the relevant queue pair (420) and is sent (455) to the completion unit (210) along with an identifier of the queue pair (420) and of the receive queue associated with the dequeue pool (425) from which the RQWE was pulled. The completion unit then forwards the element and the identifier to a software thread.
The SQR maintains a hardware managed send queue (620) by enqueueing SQWE to the tail (650) of the send queue and dequeueing SQWE from the head (660) of the send queue. It receives SQWE from the Completion Unit (210) and provides SQWE to the queue manager (220). It maintains a small cache of RQWE per queue pair waiting to be DMAed to memory and another small cache of SQWE that were recently DMAed from memory. If the send queue is empty, there is a path (640) whereby writing and reading from memory can be bypassed, and SQWE are moved directly from the enqueue pool (600) to the dequeue pool (610).
In a preferred embodiment, the enqueue pool comprises a set of 3 latches for temporarily storing SQWE. When a 4th RQWE is received, the 3 SQWEs in the enqueue pool (600) and the received 4th SQWE are written to the tail of the send queue (620) stored in memory. The enqueue pool (600) could also comprise 4 latches.
In a preferred embodiment 4 SQWE of 16 bytes are written at the same time to memory using DMA write. This is optimal when a DMA allowing transfer of 64 bytes is used. Various numbers of SQWEs can be transferred simultaneously from and to memory based on the needs of a specific configuration.
In a preferred embodiment, the enqueue pool (600), the dequeue pool (610) and the send queue (620) are FIFO queues so that the order of SQWE as received from the completion unit (210) is maintained.
The number of elements (630) in the send queue (620) is determine at initialization time, however mechanisms can be put in place to dynamically extend the size of the send queue (620).
RQR maintains a hardware managed receive queue (720) by enqueueing RQWE to the tail (750) of the queue and dequeueing RQWE from the head (760) of the queue. It receives RQWE from the queue manager (220) and from software (410) for example via ICSWX coprocessor commands. It then provides the RQWE to the identified receive queue and queue pair. It maintains a small cache (710) of RQWE per queue pair that were recently DMAed from memory or given by SQM/ICS. When the cache becomes near empty, RQR replenishes it by fetching (760) some RQWEs from the memory to serve the next request. In symmetric way, when the cache becomes near full, RQR writes (750) some RQWEs in the cache into the system memory to serve the next request from the queue manager or ICW. If the cache is neither near full nor near empty, RQWEs flow from providers to consumers (740) without going through system memory.
In a preferred embodiment, the enqueue pool (700) comprises a set of 8 latches for temporarily storing RQWEs. When a 8th RQWE is enqueued, the 8 RQWEs in the enqueue pool (700) are written to the tail of the receive queue (720) stored in memory. The enqueue pool (700) could also comprise different number of latches.
In a preferred embodiment 8 RQWE of 8 bytes are written at the same time to memory using DMA write. This is optimal when a DMA allowing transfer of 64 bytes is used. Various numbers of RQWEs can be transferred simultaneously from and to memory based on the needs of a specific configuration.
In a preferred embodiment, the enqueue pool (700), the dequeue pool (710) and the receive queue (720) can be FIFO queues, stacks or last in first out queues, as the order of RQWE does not need to be maintained.
The number of elements (730) in the receive queue (720) is determined at initialization time, however mechanisms can be put in place to dynamically extend the size of the receive queue (720).
Another embodiment comprises a method for adding specific hardware on both receive and transmit sides that will hide to the software most of the effort related to buffer and pointers management. At unitization, a set of pointers and buffers is provided by software, in quantity large enough to support expected traffic. A Send Queue Replenisher (SQR) and Receive Queue Replenisher (RQR) hide RQ and SQ management to software. RQR and SQR fully monitor pointers queues and perform recirculation of pointers from transmit side receive side.
RQ/RQR is preloaded with a number of RQWE large enough to guarantee no depletion of RQ until WQE may be received from SQ.
When a packet is received; using the hash performed on defined packet header fields, a QP is selected by the hardware; the RQWE at the head of the RQR cache for the corresponding RQ is used.
The RQWE contains the address where to store the packet content in memory; data transfer is fully handled by the hardware.
When the packet has been loaded in memory, a CQE is created by the hardware that contains: memory address used for storing packet (RQWE) miscellaneous data on packet (size, Ethernet flags, errors, sequencing . . . ).
The CQ is scheduled by the hardware to an available thread. The elected thread process the CQE.
The thread performs what is needed on the received packet to change it to a packet ready for transmission.
The thread enqueues the SQWE in SQ/SQR.
When at head of SQR cache, the packet is read by the hardware at address indicated in SQWE.
The packet is transmitted by the hardware using additional information contained in the SQWE.
If enabled in SQWE, the address of now free memory location is recirculated by the hardware in RQ as a RQWE.
Otherwise a CQE is generated by the hardware to indicate transmit completion to software; the WQE will have to be returned to RQ by software.
Another embodiment of the present invention handles all data movement tasks and all buffer management operations; threads have no more to care about these necessary but time costing tasks. Thus it highly increases performance by delegating to hardware all data movement tasks. Buffer management operations are further improved by using hardware cache that hide most latency due to DMA access while maximizing DMA efficiency (for example using a full cache line of 64B for transfer). Optionally the software can choose to fully use hardware capabilities or only use part of them.
Number | Date | Country | Kind |
---|---|---|---|
10306465.5 | Dec 2010 | EP | regional |
The present application is a U.S. National Phase application which claims priority from International Application No. PCT/EP2011/073256, filed Dec. 19, 2011, which in turn claims priority from European Patent Application No. 10306465.5, filed Dec. 21, 2010, with the European Patent Office, the contents of both are herein incorporated by reference in their entirety.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/EP2011/073256 | 12/19/2011 | WO | 00 | 5/30/2013 |