The present invention relates to an intermediate apparatus, a communication method, and a program.
Remote direct memory access (RDMA) is a communication protocol for performing high-speed and high-reliability data transfer between communication terminals located at a distance. Since the RDMA directly performs memory access from the memory area of the transmission terminal to the memory area of the reception terminal, high-speed communication is possible. The RDMA has a credit-based flow control function, and performs retransmission control based on timer-out and packet loss detection, so that highly reliable communication can be performed. The RDMA is also used as a transport system for data communication of a host-to-device, and device-to-device between a solid state drive (SSD) and a graphics processing unit (GPU).
A communication model of RDMA will be described with reference to
In the prior art, when the network distance becomes long, the transfer performance of the RDMA deteriorates. In the RDMA, if the WQE of the requester is completed and no vacancy is created in the SQ, the process cannot proceed to the next WR. In the connection type service type, it is necessary to receive an ACK from the responder in order to complete the WQE of the requester. If the WQE which has not yet been completed remains until the ACK is received, the WQE cannot be newly loaded in the queue and the transfer performance deteriorates.
The present invention has been made in view of the above, and an object of the present invention is to realize high-bandwidth data transfer even on a network service having a large round trip time (RTT).
An intermediate device according to an aspect of the present invention is an intermediate device disposed between a first device and a second device for transferring data using remote direct memory access, the device including a transfer unit that transfers a packet between the first device and the second device, and a registration unit that extracts a combination of destination information of the first device and destination information of the second device from a packet transmitted and received when establishing a connection between the first device and the second device, and registers the combination in a destination table.
An intermediate device according to an aspect of the present invention includes a management unit that manages a message sequence number indicating a completion state of a request in the second device in a management table, in which the management unit registers destination information of the first device and an initialized message sequence number in the management table with the destination information of the second device as a key at the time of establishing a connection and transitions a message sequence number when a predetermined request is received from the first device, and a generation unit that acquires destination information of the first device and a message sequence number after transition from the management table, generates a pseudo-Response to the request, and returns the pseudo-Response to the first device.
A communication method according to an aspect of the present invention is a communication method of an intermediate device disposed between a first device and a second device that transfer data using remote direct memory access, the method including causing the intermediate device to transfer a packet between the first device and the second device, and extract a combination of destination information of the first device and destination information of the second device from a packet transmitted and received when establishing a connection between the first device and the second device, and register the combination in a destination table.
The communication method according to an aspect of the present invention includes causing the intermediate device to manage a message sequence number indicating a completion state of the request in the second device in a management table, register destination information of the first device and an initialized message sequence number in the management table with the destination information of the second device as a key at the time of establishing a connection and transition a message sequence number when a predetermined request is received from the first device, and acquire destination information of the first device and a message sequence number after transition from the management table, generate a pseudo-Response to the request, and return the pseudo-Response to the first device.
According to the present invention, high-bandwidth data transfer can be realized even on a network service having a large round trip time (RTT).
An embodiment of the present invention will be described below with reference to the drawings.
Referring to
The intermediate device 10A includes a transfer unit 11, a snooping unit 14, and a Queue Pair Number (QPN) table 15.
The transfer unit 11 transfers a request from the requester 30 to the responder 50, and transfers a response from the responder 50 to the requester 30. Requests and responses transmitted and received between the requester 30 and the responder 50 include REQ, REP and RTU at the time of connection establishment, requests and responses at the time of data transfer, and DREQ and DREP at the time of connection releasing.
The snooping unit 14 intercepts a packet of RDMA Communication Management (RDMA-CM) transmitted and received between the requester 30 and the responder 50 in a connection establishment phase of RDMA-CM, and registers a bidirectional QPN entry having a pair of QPN of the requester 30 and QPN of the responder 50 in the QPN table 15.
The requester 30 and the responder 50 set a communication ID (CID) as an identifier for uniquely identifying communication in the connection. The CID does not change until the connection is discarded. The QPN of each of the requester 30 and the responder 50 is also uniquely identified in association with the CID of each of the requester 30 and the responder 50. The CID and the QPN are exchanged between the requester 30 and the responder 50 in a connection establishment phase, and a CID pair and a QPN pair are set.
The connection establishment phase is configured by 3-way handshake of ConnectRequest (REQ), ConnectReply (REP), and ReadyToUSE (RTU).
The snooping unit 14 intercepts a packet of the RDMA-CM transmitted and received between the requester 30 and the responder 50 in a connection release phase of the RDMA-CM, and deletes a corresponding QPN entry from the QPN table 15.
A pair of CID and QPN set in a connection establishment phase is released in a connection release phase of the RDMA-CM. The connection release phase is composed of handshake of DisconnectRequest (DREQ) and DisconnectReply (DREP).
The QPN table 15 manages a pair of a local side (requester side) QPN and a remote side (responder side) QPN in association with each other using the CID as a key.
The intermediate device 10B includes the transfer unit 11.
Similar to the transfer unit 11 of the intermediate device 10A, the transfer unit 11 transfers the request from the requester 30 to the responder 50 and transfers the response from the responder 50 to the requester 30.
Next, an example of a flow of processing at the time of establishing the connection will be described with reference to the sequence diagram of
In Step S11, the requester 30 transmits the REQ to the responder 50. The REQ includes the Local CID and the Local QPN. The Local CID included in the REQ is an identifier for the requester 30 to identify the connection. The Local QPN included in the REQ is the QPN assigned to the QP of the requester 30.
When transferring the REQ to the responder 50, the intermediate device 10 creates a QPN entry using the Local CID included in the REQ as a key in the QPN table 15, and registers the Local QPN included in the REQ (QPN of the requester 30) in the Local QPN of the QPN entry.
When receiving the REQ, the responder 50 transmits the REP to the requester 30 in Step S12. The REP includes the Local CID, the Remote CID, and the Local QPN. The Local CID included in the REP is an identifier for the responder 50 to identify the connection. The Remote CID included in the REP is an identifier for the requester 30 to identify the connection, and is the same as the Local CID included in the REQ. The Local QPN included in the REP is the QPN assigned to the responder 50.
When transferring the REP to the requester 30, the intermediate device 10 retrieves the QPN entry using the Remote CID included in the REP as a key from the QPN table 15, and registers a Local QPN (QPN of the responder 50) included in the REP in the Remote QPN of the QPN entry. Further, the intermediate device 10 creates the QPN entry in the reverse direction using the Local CID included in the REP as a key in the QPN table 15. Specifically, the intermediate device 10 uses the Local CID included in the REP as a key, registers the QPN of the responder 50 in the Local QPN, and creates the QPN entry in which the QPN of the requester 30 is registered in the Remote QPN in the QPN table 15.
When the requester 30 receives the REP, the requester 30 transmits the RTU to the responder 50 in Step S13. The RTU includes the Local CID and the Remote CID.
By the above processing, a connection between the requester 30 and the responder 50 is established, and a bidirectional QPN entry in which the QPN of the requester 30 and the QPN of the responder 50 are paired is created in the QPN table 15. After the connection is established, data transfer is performed between the requester 30 and the responder 50. The intermediate device 10 transfers a packet between the requester 30 and the responder 50, and generates and transmits a pseudo-Response.
Next, an example of a flow of processing at the time of connection releasing will be described with reference to the sequence diagram of
In Step S31, the requester 30 transmits the DREQ to the responder 50. The DREQ includes the Local CID, the Remote CID, and the Remote QPN. The Local CID included in the DREQ is an identifier for the requester 30 to identify the connection. The Remote CID included in the DREQ is an identifier for the responder 50 to identify the connection. The Remote QPN included in the DREQ is the QPN assigned to the responder 50.
The intermediate device 10 transfers the DREQ to the responder 50.
When receiving the DREQ, the responder 50 transmits the DREP to the requester 30 in Step S32. The DREP includes the Local CID and the Remote CID. The Local CID included in the DREP is an identifier for the responder 50 to identify the connection. The Remote CID included in the DREP is an identifier for the requester 30 to identify the connection.
When transferring the DREP to the responder 50, the intermediate device 10 retrieves a QPN entry with the Local CID included in the DREP as a key, deletes the QPN entry from the QPN table 15, retrieves a QPN entry with the Remote CID included in the DREP as a key, and deletes the QPN entry from the QPN table 15.
By the above processing, the connection between the requester 30 and the responder 50 is released, and the QPN entry corresponding to the connection is deleted from the QPN table 15.
Referring to
The intermediate device 10A includes the transfer unit 11, a generation unit 12, a tracing unit 16, and a Work Queue (WQ) table 17. The intermediate device 10A may include the snooping unit 14 and the QPN table 15 illustrated in
The transfer unit 11 transfers a request from the requester 30 to the responder 50, and transfers a response from the responder 50 to the requester 30.
The generation unit 12 picks up a request transmitted from the requester 30 and having a flag of Only or Last at the time of data transfer, generates a pseudo-Response to the request, and returns the generated pseudo-Response to the requester 30. When generating the pseudo-Response, the generation unit 12 uses the same PSN value as the request of Only or Last. Further, the generation unit 12 refers to the WQ table 17 to be described later to determine destination QPN and MSN to be included in the pseudo-Response.
When receiving the pseudo-Response, the requester 30 recognizes the response from the responder 50 and releases the WQE of the SSN up to the value described in the MSN of the pseudo-Response.
The tracing unit 16 registers a WQ entry having the QPN and MSN of the requester 30 in the WQ table 17, which will be described later, using the QPN of the responder 50 as a key at the time of establishing the connection. At this time, the tracing unit 16 resets the MSN of the WQ entry to 0.
The tracing unit 16 may generate the WQ entry and register the WQ entry in the WQ table 17 with the generation of the QPN entry in the QPN table 15 as a trigger. At this time, the WQ entry having the Local QPN and the MSN is created using the Remote QPN of the QPN entry as a key.
The tracing unit 16 identifies a message unit based on header information of a request received from the requester 30, and transitions the value of the MSN. Specifically, when receiving a request with a flag of Only or Last, the tracing unit 16 transitions the value of the MSN of the corresponding the WQ entry to simulate the state of the MSN of the responder 50.
The tracing unit 16 may delete the WQ entry of the WQ table 17 after the connection is released. For example, when the snooping unit 14 of
The WQ table 17 manages the QPN (src QPN) and the MSN of the requester 30 using the QPN (dst QPN) of the responder 50 as a key.
When generating the pseudo-Response to the request, the generation unit 12 retrieves the WQ entry having a key matching the QPN of the destination of the request and generates the pseudo-Response having the MSN value of the WQ entry. The src QPN of the WQ entry is set in the destination QPN of the pseudo-Response.
The intermediate device 10B includes the transfer unit 11 and a discard unit 13.
Similar to the transfer unit 11 of the intermediate device 10A, the transfer unit 11 transfers the request from the requester 30 to the responder 50 and transfers the response from the responder 50 to the requester 30.
The discard unit 13 discards a true response from the responder 50 to a request from the requester 30 at the time of data transfer. Thus, the duplicate reception of the response in the requester 30 can be prevented. The discard unit 13 may discard a message which may cause malfunction of the requester 30 among messages returned from the responder 50 to the requester 30.
The intermediate device 10A includes the discard unit 13 and the intermediate device 10B is not required to be disposed.
Next, an example of a flow of processing at the time of data transfer will be described with reference to the sequence diagram of
In Step S21, the requester 30 loads the WQE into the SQ and transmits a request to the responder 50. The request includes data to be transferred. The request is transferred to the responder 50 via the intermediate devices 10A and 10B.
In Step S22, the intermediate device 10A refers to the WQ table 17 to determine the destination QPN and generates the pseudo-Response to the request.
In Step S23, the intermediate device 10A returns the pseudo-Response to the requester 30. When receiving the pseudo-Response, the requester 30 releases the WQE in the SQ.
After that, when the requester 30 loads the WQE into the SQ and transmits a request to the responder 50 (Step S26), when the request is transferred, the intermediate device 10A generates the pseudo-Response and returns the pseudo-Response to the requester 30 (Steps S27 and S28).
On the other hand, when the reception of the request is successful, the responder 50 transmits a response to the requester 30 in Step S24.
In Step S25, the intermediate device 10B discards the received response.
After that, the responder 50 returns a response when receiving the request and the intermediate device 10B discards the response.
Next, an example of a flow of processing at the time of data transfer will be described with reference to the sequence diagram of
In Step S41 to Step S43, the requester 30 transmits a request for data transfer to the responder 50. It is assumed that any request is a request related to WQE associated with SSN=1. In the figure, r represents a request, and the number following r represents PSN. The request of r17 is a request in which a Last flag is set. The intermediate device 10 transfers the request to the responder 50, and the responder 50 receives the request.
When receiving the request of r17 in which the Last flag is set, the intermediate device 10 causes the value of the MSN of the corresponding WQ entry to transit in Step S44, generates the pseudo-Response having the value of the MSN after the transition, and transmits the pseudo-Response to the requester 30. In the figure, p-a represents a pseudo-Response, and the number following p-a represents PSN. Numbers in parentheses represent MSN. In the example illustrated in
When receiving the pseudo-Response, the requester 30 releases the WQE having the SSN up to the value of the MSN of the pseudo-Response. In the example illustrated in
On the other hand, when the reception of the request is successful, the responder 50 transmits a response to the requester 30 in Steps S45 to S47. In the figure, a represents a response, and a number following a represents PSN. Numbers in parentheses represent MSN. Responses to requests of r15 and r16 are a15 and a16, and both with MSN=0. The response to the request from r17 is a17 and MSN=1.
The intermediate device 10 discards the received response from the responder 50 without transferring the received response to the requester 30.
As described above, the intermediate device 10A of the present embodiment is the intermediate device 10A disposed between the requester 30 and the responder 50 for transferring data using the RDMA. The intermediate device 10A extracts a combination of the QPN of the requester 30 and the QPN of the responder 50 from a packet transmitted and received when establishing a connection between the requester 30 and the responder 50, and registers the combination in the QPN table 15. Thus, the intermediate device 10A can specify the return destination of the pseudo-Response at the time of establishing the connection.
The intermediate device 10A of the present embodiment manages the MSN representing a completion state of the request in the responder 50 in the WQ table 17, registers the QPN of the requester 30 and the MSN initialized to 0 in the WQ table 17 using the QPN of the responder 50 as a key at the time of establishing the connection, when a request with a flag of Last or Only is received from the requester 30, transitions the MSN number, generates a pseudo-Response to the request from the MSN after transition from the QP of the requester 30 and returns the pseudo-Response to the requester 30. Since the requester 30 releases the WQE of the SQ according to the pseudo-Response from the intermediate device 10A, even in a case where the RTT between the requester 30 and the responder 50 is large, the requester 30 can realize high-bandwidth data transfer without waiting for the response from the responder 50. Thus, it is possible to realize a long-distance high-speed operation of PUSH type data transfer from the requester 30 to the responder 50, particularly, data transfer of RDMA Write.
Although the above description has been made with the configuration in which the intermediate devices 10A and 10B are installed between the requester 30 and the responder 50, the intermediate device 10A may be configured on the network interface card (NIC) of the device of the requester 30, and the intermediate device 10B may be configured on the NIC of the device of the responder 50, as illustrated in
As the intermediate devices 10A and 10B described above, it is possible to use a general-purpose computer system that includes a central processing unit (CPU) 901, a memory 902, a storage 903, a communication device 904, an input device 905, and an output device 906 as shown in, for example,
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2021/035305 | 9/27/2021 | WO |