The present invention relates to computer systems, and in particular, but not exclusively, to requestor-responder systems.
For each message that a requestor device sends to a responder device over a network, the responder device needs to buffer the received message and process the received message. End-to-end credits may be utilized to prevent buffer overflow in the responder device by the responder device informing the requestor device about the buffer status using credits. In some systems, the responder device manages some pointers, such as producer index Pi (being the amount of work queue elements (WQEs) in the receive work queue), consumer index Ci (being the amount of WQEs consumed by received messages), and the message sequence number (MSN) of the last message for which receipt has been completed. The responder provides the pointer to the requestor device and the requestor device knows how many messages can be sent to the responder device without causing buffer overflow.
Each time the responder device consumes a WQE, the responder device increases the consumer index by 1. Software (e.g., on a host device) can increase the memory allocated to the responder device (e.g., by posting one or more additional WQEs to the responder device receive work queue) and the responder increases the producer index accordingly. The available credits are then given by the producer index less the consumer index. The requestor and responder devices keep track of the MSNs of sent and received messages, respectively. When a new message is received and consumes a WQE, the Ci is increased by 1 and the MSN is increased by 1. Upon safely receiving a message, the responder device responds to the requestor device with an acknowledgement of the received message and the available credits (based on the producer and consumer index) and the MSN of the last message for which receipt has been completed.
There is provided in accordance with an embodiment of the present disclosure, a system including a first network device, which includes a host interface to receive messages from a host device, a network interface to provide a connection to a second network device over a packet data network, and packet processing circuitry to prepare the messages for sending to the second network device over the packet data network, send a batch of the messages to the second network device without waiting for an acknowledgement receipt from the second network device after sending each of the messages in the batch before sending a next one of the messages in the batch, one of the messages in the batch having a maximum message sequence number (MSN), receive a given acknowledgement receipt from the second network device indicating that all the messages in the batch have been received by the second network device and including credit data indicating that there is no space in a receive work queue of the second network device for receiving an additional message, and send the additional message having an MSN greater than the maximum MSN to the second network device responsively to receiving the given acknowledgement receipt and based on the credit data indicating that there is no space in the receive work queue of the second network device for receiving the additional message.
Further in accordance with an embodiment of the present disclosure each of the messages in the batch of the messages consume work queue elements in a receive work queue of the second network device.
Still further in accordance with an embodiment of the present disclosure the packet processing circuitry is configured to receive a single acknowledgement receipt for all the messages in the batch.
Additionally in accordance with an embodiment of the present disclosure the packet processing circuitry is configured to resend the additional message to the second network device intermittently.
Moreover in accordance with an embodiment of the present disclosure, the system includes the second network device including a network interface to provide a connection to the first network device over the packet data network, and packet processing circuitry to receive the messages from the first network device, process the messages received from the first network device, maintain a cyclic receive work queue including work queue elements (WQEs) that are consumed in order, track available ones of the WQEs, assign the messages to the work queue elements so that the WQEs are consumed by the messages in an order that the messages are received by the second network device, scatter the messages to a memory, and generate completion queue elements (CQEs) for the WQEs and add the CQEs to a completion queue, wherein the CQEs provide the respective locations in the memory where respective ones of the messages were scattered.
Further in accordance with an embodiment of the present disclosure the packet processing circuitry of the second network device is configured to process the messages of the batch out-of-order, the messages of the batch including a first message and a second message, the first message having a lower MSN than the second message, receive the second message before the first message, assign the second message to a first one of the WQEs, scatter the second message to the memory, receive the first message, assign the first message to a second one of the WQEs, and scatter the first message to the memory.
Still further in accordance with an embodiment of the present disclosure the packet processing circuitry of the second network device is configured to process the messages of the batch in order so that if one of the messages is received out-of-order the MSN, the out-of-order message is dropped.
There is also provided in accordance with another embodiment of the present disclosure a first network device, including a network interface to provide a connection to a second network device over a packet data network, and packet processing circuitry to receive a batch of messages from the second network device, assign the messages to work queue elements (WQEs), generate completion queue elements (CQEs) for the WQEs, and track consumed WQEs still waiting for CQEs, being ones of the WQEs consumed with ones of the messages and still waiting for CQEs to be generated for the consumed WQEs.
Additionally in accordance with an embodiment of the present disclosure the packet processing circuitry is configured to maintain a cyclic receive work queue including the WQEs that are consumed in order, assign the messages to the work queue elements so that the WQEs are consumed by the messages in an order that the messages are received by the packet processing circuitry, scatter the messages to a memory, add the CQEs to a completion queue, wherein the CQEs provide respective locations in the memory where respective ones of the messages were scattered, track free WQEs which are not consumed with messages, compute credit data including available receive work queue credits based on the free WQEs plus the consumed WQEs still waiting for CQEs, and send the credit data to the second network device via the network interface.
Moreover, in accordance with an embodiment of the present disclosure the packet processing circuitry is configured to generate an acknowledgement receipt including the credit data and indicating that at least one of the messages has been received by the packet processing circuitry, and send the acknowledgement receipt to the second network device via the network interface.
Further in accordance with an embodiment of the present disclosure the packet processing circuitry is configured to generate the acknowledgement receipt to include a highest message sequence number (MSN) of the messages which have an associated one of the CQEs.
Still further in accordance with an embodiment of the present disclosure the packet processing circuitry is configured to process the messages of the batch out-of-order, the messages of the batch including a first message and a second message, the first message having a lower message sequence number (MSN) than the second message, receive the second message before the first message, assign the second message to a first one of the WQEs, scatter the second message to the memory, receive the first message, assign the first message to a second one of the WQEs, and scatter the first message to the memory.
Additionally in accordance with an embodiment of the present disclosure the packet processing circuitry is configured to generate the CQEs for the WQEs an order of the MSNs of the messages that have consumed the WQEs.
Moreover, in accordance with an embodiment of the present disclosure the packet processing circuitry includes a first hardware counter configured to track the free WQEs, and a second hardware counter configured to track the consumed WQEs still waiting for CQEs.
There is also provided in accordance with still another embodiment of the present disclosure, a method, including receiving messages from a host device, providing a connection to a network device over a packet data network, preparing the messages for sending to the network device over the packet data network, sending a batch of the messages to the network device without waiting for an acknowledgement receipt from the network device after sending each of the messages in the batch before sending a next one of the messages in the batch, one of the messages in the batch having a maximum message sequence number (MSN), receiving a given acknowledgement receipt from the network device indicating that all the messages in the batch have been received by the network device and including credit data indicating that there is no space in a receive work queue of the network device for receiving an additional message, and sending the additional message having an MSN greater than the maximum MSN to the network device responsively to receiving the given acknowledgement receipt and based on the credit data indicating that there is no space in the receive work queue of the network device for receiving the additional message.
Further in accordance with an embodiment of the present disclosure each of the messages in the batch of the messages consume work queue elements in a receive work queue of the network device.
Still further, in accordance with an embodiment of the present disclosure the receiving the given acknowledgement receipt includes receiving a single acknowledgement receipt for all the messages in the batch.
Additionally in accordance with an embodiment of the present disclosure, the method includes resending the additional message to the network device intermittently.
Moreover in accordance with an embodiment of the present disclosure, the method includes, by the network device receiving the messages, maintaining a cyclic receive work queue including work queue elements (WQEs) that are consumed in order, tracking available ones of the WQEs, assigning the messages to the work queue elements so that the WQEs are consumed by the messages in an order that the messages are received by the network device, scattering the messages to a memory, generate completion queue elements (CQEs) for the WQEs, and adding the CQEs to a completion queue, wherein the CQEs provide the respective locations in the memory where respective ones of the messages were scattered.
There is also provided in accordance with still another embodiment of the present disclosure a method, including providing a connection to a network device over a packet data network, receiving a batch of messages from the network device, assigning the messages to work queue elements (WQEs), generating completion queue elements (CQEs) for the WQEs, and tracking consumed WQEs still waiting for CQEs, being ones of the WQEs consumed with ones of the messages and still waiting for CQEs to be generated for the consumed WQEs.
Further in accordance with an embodiment of the present disclosure, the method includes maintaining a cyclic receive work queue including the WQEs that are consumed in order, assigning the messages to the work queue elements so that the WQEs are consumed by the messages in an order that the messages are received, scattering the messages to a memory, adding the CQEs to a completion queue, wherein the CQEs provide respective locations in the memory where respective ones of the messages were scattered, tracking free WQEs which are not consumed with messages, computing credit data including available receive work queue credits based on the free WQEs plus the consumed WQEs still waiting for CQEs, and sending the credit data to the network device.
Still further in accordance with an embodiment of the present disclosure, the method includes generating an acknowledgement receipt including the credit data and indicating that at least one of the messages has been received, and sending the acknowledgement receipt to the network device.
Additionally in accordance with an embodiment of the present disclosure the generating the acknowledgement receipt includes generating the acknowledgement receipt to include a highest message sequence number (MSN) of the messages which have an associated one of the CQEs.
Moreover in accordance with an embodiment of the present disclosure, the method includes processing the messages of the batch out-of-order, the messages of the batch including a first message and a second message, the first message having a lower message sequence number (MSN) than the second message, receiving the second message before the first message, assigning the second message to a first one of the WQEs, scattering the second message to the memory, receiving the first message, assigning the first message to a second one of the WQEs, and scattering the first message to the memory.
Further in accordance with an embodiment of the present disclosure the generating the CQEs includes generating the CQEs for the WQEs an order of the MSNs of the messages that have consumed the WQEs.
The present invention will be understood from the following detailed description, taken in conjunction with the drawings in which:
A problem may occur when using credit-based systems. For example, if a requestor device sends a number of messages, e.g., messages 11 to 15, to a responder device, the responder device may send an acknowledgement after processing packet number 15 and indicate that there are zero receive work queue credits available. In a system, where credits are reported in the acknowledgement receipts, the requestor device does not receive another message from the responder device when receive work queue credits are again available. One solution to the above problem is for the requestor device to periodically send an additional message, e.g., message 16, after sending packets 11 to 15, and the additional message may be dropped. If the additional message is received, and not dropped, the responder device responds with an acknowledgement receipt including an indication of available credits. In this manner, available credits become known to the requestor device.
The above solution may be helpful in systems where messages are received and processed in order. For example, message 11 is received and processed before message 12, and if message 12 is received before message 11, message 12 is dropped. In such a case, sending the additional message 16 does not lead to blocking other messages in the receive work queue (as explained in more detail below), as in a system where messages are received and processed in order, if message 16 is received prior to message 15 (or any message with a lower MSN), message 16 will be dropped.
However, in systems where messages may be received out-of-order in a cyclic queue (where when a message arrives it consumes the first available WQE in a cyclical manner), but completions are processed in MSN order (for example, message 12 may be received and added to the receive work queue before message 11, even though message 11 is still completed prior to message 12) a problem may occur as follows. The term “consume”, as used in the specification and claims, in all grammatical forms, may include associating a given message with a WQE to process receipt of the given message including scattering the given message to memory based on buffer pointers in the WQE. Once the WQE is consumed by the given message, that WQE cannot be used by another message as the buffer locations are being used by the given message. For example, if messages 11 to 16 are sent from the requestor device to the responder device, and there are 5 available receive work queue credits, if messages with MSN 12, 13, 14, 15 and 16 consume WQEs in the receive work queue, then when the message with MSN 11 arrives, it is dropped, as there is no room in the receive work queue for message with MSN 11. However, none of the other messages with MSN 12-16 can be completed until the message with MSN 11 is completed, resulting in a deadlock.
Therefore, embodiments of the present invention solve at least some of the above drawbacks by sending the additional message (e.g., message with MSN 16) after receiving an acknowledgement receipt(s) indicating that all the prior messages (e.g., messages up to and including MSN 15) have been received. In this manner, the additional message cannot block the other messages from being processed by the responder device.
In some embodiments, the requestor device sends a batch of messages (e.g., messages with MSN 11-15) to the responder device without waiting for an acknowledgement receipt from the responder device after sending each message in the batch before sending a next message in the batch. The requestor device receives an acknowledgement receipt from the responder device indicating that all the messages in the batch (e.g., messages with MSN 11-15) have been received. The acknowledgement receipt includes credit data indicating that there is no space in the receive work queue of the responder device for receiving an additional message. The requestor device sends the additional message (e.g., message with MSN 16) to the responder device responsively to receiving the acknowledgement receipt even though the credit data indicates that there is no space in the receive work queue for receiving the additional message.
The additional message (e.g., message with MSN 16) is repeatedly sent to the responder device until the requestor device receives an acknowledgement receipt from the responder device that the additional message has been received. The acknowledgement receipt for the additional message may include credit data indicating that there is free space in the receive work queue. According to the latest credit data, the requestor device may send another batch of messages to the responder device.
In some systems, acknowledgement receipts are not received unless requested by the requestor device. Therefore, in some embodiments, the requestor device requests an acknowledgement receipt to be provided after the final message in the batch of messages is received by the responder device. It may be assumed that if an acknowledgement receipt for message with MSN X is received, then all the messages with MSN less than X have also been received, even if message with MSN X is received by the responder device prior to a message with MSN less than X.
In some embodiments, the above may also be implemented when messages are received in order so that behavior of the requestor device is consistent whether the responder device processes packets in order or out-of-order.
As previously mentioned, the responder device may receive messages out-of-order in a cyclic queue (where when a message arrives it will consume the first available WQE in a cyclical manner), but completions are processed in MSN order (for example, message 12 may be received and added to the receive work queue before message 11, even though message 11 is still completed prior to message 12).
Therefore, when an out-of-order message arrives, it consumes the first available WQE in the receive work queue and the responder device scatters the data of the received out-of-order message into memory. However, the responder device cannot process a completion for the out-of-order message until the other message(s) (with lower MSNs) arrive and are completed in order. Therefore, in addition to the receive work queue including free WQEs (i.e., WQEs which have not been consumed by received messages), and WQEs associated with completion queue elements (CQEs) (i.e., WQEs consumed with received messages and having associated CQEs), the receive work queue also includes WQEs which are consumed with received messages and still waiting for associated CQEs. This leads to a performance problem based on how available credits are computed and this could lead to the requestor device sending less packets than the responder device can handle. The term “WQE” or work queue element, as used in the specification and claims, may refer to a queue element (e.g., receive queue element or a send queue element). A receive queue WQE may include a buffer pointer indicating where to scatter in memory a message associate with, or to be associated with, the WQE. Once a given WQE is consumed by a given message, the given WQE is not used by another message as the given buffer location(s) assigned to the given WQE is now reserved for the given message.
The term “CQE” or completion queue element, as used in the specification and claims, may refer to a completion entity which is provided to host software to inform the host software that the given message was assigned to the given WQE and was scattered in memory based on buffer pointers in the given WQE. The host software completes handling of the given message and frees up the given buffer location(s) in memory. The given WQE may then be replaced by a new WQE by software in the receive work queue. The new WQE may reuse the given buffer location(s) of the given WQE.
For example, consider a receive work queue having eight WQEs. Messages 1 and 2 are in the queue, and have associated CQEs. Therefore, the MSN tracked by the responder device is equal to 2. The responder device sends an acknowledgement receipt to the requestor device indicating that the MSN is equal to 2 and free credits are equal to 6. In such a case, the requestor device may compute that messages may be sent up to MSN 8.
Then message 4 arrives, but it is not processed (e.g., as message 3 has not yet arrived). Therefore, message 4 consumes a WQE but does not have an associated CQE. If the responder device sends an acknowledgement receipt at this stage, it will include MSN equal to 2 and free credits equal to 5. In such a case, the requestor device may compute that messages may be sent up to MSN 7. So even though the MSN is the same, the free credits have been reduced by one, which erroneously informs the requestor device that there is one less packet that can be transmitted than previously.
Embodiments of the present invention solve at least some of the above drawbacks by tracking the WQEs which are consumed with received messages and still waiting for associated CQEs. The responder device computes available credits based on free WQEs plus WQEs which are consumed with received messages and still waiting for associated CQEs. In the above example, the available credits after receiving message 4 will be equal to the amount of free WQEs (i.e., 5), plus the amount of WQEs which are consumed with received messages and still waiting for associated CQEs (i.e., 1) giving a total of 6 available credits. Therefore, the requestor device correctly computes that messages may be sent up to MSN 8 (i.e., MSN equal to 2 plus available credits of 6).
Reference is now made to
The host device 16 includes a processor 26 and a memory 28. The processor 26 prepares messages and provides the messages to the requestor network device 12, for example, via the memory 28. The requestor network device 12 includes a host interface 20, packet processing circuitry 22, and a network interface 24. The host interface 20 is configured to receive the messages from the host device 16 and provide the messages to the packet processing circuitry 22. The packet processing circuitry 22 processes the messages and is described in more detail with reference to
The responder network device 14 includes a host interface 30, packet processing circuitry 32, and a network interface 34. The network interface 34 is configured to provide a connection to the requestor network device 12 over the packet data network 40 and receive the messages from the requestor network device 12. The packet processing circuitry 32 is described in more detail with reference to
In practice, some or all of the functions of the packet processing circuitry 22 or packet processing circuitry 32 may be combined in a single physical component or, alternatively, implemented using multiple physical components. These physical components may comprise hard-wired or programmable devices, or a combination of the two. In some embodiments, at least some of the functions of the packet processing circuitry 22, 32 may be carried out by a programmable processor under the control of suitable software. This software may be downloaded to a device in electronic form, over a network, for example. Alternatively, or additionally, the software may be stored in tangible, non-transitory computer-readable storage media, such as optical, magnetic, or electronic memory.
Reference is now made to
The packet processing circuitry 22 of the requestor network device 12 is configured to receive a given acknowledgement receipt (e.g., a single acknowledgement receipt for all the messages in the batch of messages or more than one acknowledgement receipt for the messages) from the responder network device 14 indicating that all the messages in the batch have been received by the responder network device 14 and including credit data indicating that there is no space in a receive work queue of the responder network device 14 for receiving an additional message (block 208). The acknowledgement receipt received for the final message in the batch (e.g., message with MSN 15) indicates that all the messages in the batch have been received, even if one or more of the messages in the batch were received after the final message. The packet processing circuitry 22 of the requestor network device 12 is configured to send an additional message having an MSN (e.g., MSN 16) which is greater than the maximum MSN (e.g., MSN 15) to the responder network device 14 responsively to receiving the acknowledgement receipt and based on the credit data indicating that there is no space in the receive work queue of the responder network device 14 for receiving the additional message (block 210).
The packet processing circuitry 22 is configured to resend the additional message to the responder network device 14 intermittently (block 214) if an acknowledgement receipt was not yet received for the additional message. The packet processing circuitry 22 is configured to receive the acknowledgement receipt for the additional message with credit data and the latest MSN completed by the responder network device 14 (block 216).
Reference is now made to
The packet processing circuitry 32 is configured to maintain a cyclic receive work queue including work queue elements (WQEs) that are consumed in order (block 310) as described in more detail with reference to
The packet processing circuitry 32 is configured to scatter the messages to the memory 38 of the host device 18 (block 314). The scattering is generally performed according to the order of the WQEs in the receive work queue. The packet processing circuitry 32 is configured to generate completion queue elements (CQEs) for the WQEs (block 316). The CQEs provide the respective locations in the memory 38 where respective messages were scattered. The packet processing circuitry 32 is configured to generate the CQEs for the WQEs according to an order of the MSNs of the messages that have consumed the WQEs, as described in more detail with reference to
The packet processing circuitry 32 is configured to track available WQEs (block 320) as described in more detail with reference to
Reference is now made to
Reference is now made to
Reference is now made to
As previously mentioned, when an out-of-order message arrives, it consumes the first available WQE in the receive work queue 500 and the responder network device 14 scatters the data of the received out-of-order message into memory 38. However, the responder network device 14 cannot process a completion for the out-of-order message until the other message(s) (with lower MSNs) arrive and are completed in order. Therefore, in addition to the receive work queue 500 including free WQEs (arrow 702) (i.e., WQEs which have not been consumed by received messages), and WQEs associated with completion queue elements (CQEs) (arrow 704) (i.e., WQEs consumed with received messages and having associated CQEs), the receive work queue 500 also includes WQEs which are consumed with received messages and still waiting for associated CQEs (arrow 706). This leads to a performance problem based on how available credits are computed and could lead to the requestor device sending less packets than the responder device can handle.
For example, consider the receive work queue 500 which has eight WQEs (WQE1-8). Messages 1 and 2 are in the receive work queue 500, and have associated CQEs 502. Therefore, the MSN tracked by the responder network device 14 is equal to 2 (corresponding to message 2 with MSN 2). The responder network device 14 sends an acknowledgement receipt to the requestor network device 12 indicating that the MSN is equal to 2 and free credits are equal to 6. In such a case, the requestor network device 12 may compute that messages may be sent up to MSN 8.
Then message 4 arrives, but it is not processed as message 3 has not yet arrived. Therefore, message 4 consumes a WQE in the receive work queue 500 but does not have an associated CQE 502. If the responder network device 14 were to send an acknowledgement receipt at this stage based on the free WQEs only, it will include MSN equal to 2 and free credits equal to 5. In such a case, the requestor device may compute that messages may be sent up to MSN 7. So even though the MSN is the same, the free credits have been reduced by one, which would erroneously inform the requestor network device 12 that there is one less packet that can be transmitted than previously.
Therefore, packet processing circuitry 32 is configured to track free WQEs which are not consumed with messages (block 802), and track consumed WQEs still waiting for CQEs being WQEs consumed with messages and still waiting for CQEs to be generated for the consumed WQEs (block 804). The packet processing circuitry 32 is configured to compute available receive work queue credits based on free WQEs plus WQEs still waiting for CQEs (block 806). In the above example, the available credits after receiving message 4 will be equal to the amount of free WQEs (i.e., 5), plus the amount of WQEs which are consumed with received messages and still waiting for associated CQEs (i.e., 1) giving a total of 6 available credits. Therefore, the requestor device correctly computes that messages may be sent up to MSN 8 (i.e., MSN equal to 2 plus available credits of 6).
Various features of the invention which are, for clarity, described in the contexts of separate embodiments may also be provided in combination in a single embodiment. Conversely, various features of the invention which are, for brevity, described in the context of a single embodiment may also be provided separately or in any suitable sub-combination.
The embodiments described above are cited by way of example, and the present invention is not limited by what has been particularly shown and described hereinabove. Rather the scope of the invention includes both combinations and sub-combinations of the various features described hereinabove, as well as variations and modifications thereof which would occur to persons skilled in the art upon reading the foregoing description and which are not disclosed in the prior art.