In at least one aspect, the present invention relates to communication within a cluster of computer nodes.
A computer cluster is a group of closely interacting computer nodes operating in a manner so that they may be viewed as though they are a single computer. Typically, the component computer nodes are interconnected through fast local area networks. Internode cluster communication is typically accomplished through a protocol such as TCP/IP or UDP/IP running over an ethernet link, or a protocol such as uDAPL or IPoIB running over an Infiniband (“IB”) link. Computer clusters offer cost effective improvements for many tasks as compared to using a single computer. However, for optimal performance, low latency cluster communication is an important feature of many multi-server computer systems. In particular, low latency is extremely desirable for horizontally scaled databases and for high performance computer (“HPC”) systems.
Although present day cluster technology works reasonably well, there are a number of opportunities for performance improvements regarding the utilized hardware and software. For example, ethernet does not support multiple hardware channels with user processes having to go through software layers in the kernel to access the ethernet link. Kernel software performs the mux/demux between user processes and hardware. Furthermore, ethernet is typically an unreliable communication link. The ethernet communication fabric is allowed to drop packets without informing the source node or the destination node. The overhead of doing the mux/demux in software (trap to the operating system and multiple software layers) and the overhead of supporting reliability in hardware result in significant negative impact on application performance.
Similarly, Infiniband (“IB”) offers several additional opportunities for improvement. IB defines several modes of operation such as Reliable Connection, Reliable Datagram, Unreliable Connection and Unreliable Datagram. Each communication channel utilized in IB Reliable Datagrams requires the management of at least three different queues. Commands are entered into send or receive work queues. Completion notification is realized through a separate completion queue. Asynchronous completion results in significant overhead. When a transfer has been completed, the completion ID is hashed to retrieve context to service the completion. In IB, receive queue entries contain a pointer to the buffer instead of the buffer itself resulting in buffer management overhead. Moreover, send and receive queues are tightly associated with each other. Implementations cannot support scenarios such as multiple send channels for one process, and multiple receive channels for others, which is useful in some cases. Finally, reliable datagram is implemented as a reliable connection in hardware, and the hardware does the muxing and demuxing based on the end-to-end-context provided by the user. Therefore, IB is not truly connectionless and results in a more complex implementation.
Remote Direct Memory Access (“RDMA”) is a data transfer technology that allows data to move directly from the memory of one computer into that of another without involving either computer's operating system. This permits high-throughput, low-latency networking, which is especially useful in massively parallel computer clusters. The primary reason for using RDMA to transfer data is to avoid copies. The application buffer is provided to the remote node wishing to transfer data, and the remote node can do a RDMA write or read from the buffer directly. Without RDMA, messages are transferred from the network interface device to kernel memory. Software then copies the messages into the application buffer. Several studies have shown that when transferring large blocks over an interconnect the dominant cost lies in performing copies at the sender and the receiver.
However, to perform RDMA the buffers at the source and the destination need to be made accessible to the network device participating in RDMA. This process involves two steps referred to herein as buffer registration. In the first step, the buffer in memory is pinned so that the operating system does not swap it out. In the second step, the physical address or an I/O virtual address (“I/O VA”) of the buffer is obtained and sent to the device so the device knows the location of the buffer. As used herein, these two steps are referred to as buffer registration.
Buffer registration involves operating system operations and is expensive to perform. Accordingly, RDMA is not efficient for small buffers—the cost of setting up the buffers is higher than the cost of performing copies. Studies indicate that the crossover point where RDMA becomes more efficient than normal messaging is 2 KB to 8 KB. It should also be appreciated that buffer registration needs to be performed just once on buffers used in normal messaging, since the same set of buffers are used repeatedly by the network device with data being copied from device buffers to application buffers.
Two approaches are used to reduce impact of buffer registration. The first approach is to register the entire memory of the application when the application is started. For large applications this causes a significant fraction of physical memory to be locked down and unswappable. Furthermore, other applications are prevented from being run efficiently on the server. The second approach is to cache registrations. This technique has been used in a few MPI implementations. MPI is a cluster communication, API is used primarily in HPC applications. In this approach recently used registrations are saved in a cache. When the application tries to reuse the registrations, the cache is checked, and if the registration is still available they are serviced from the cache.
Accordingly, there exists a need for improved methods and systems for connectionless internode cluster communication.
The present invention solves one or more problems of the prior art by providing in at least one embodiment, a server interconnect system providing communication within a cluster of computer nodes. The server interconnect system for sending data includes a first server node and a second server node. Each server node is operable to send and receive data. The interconnect system also includes a first and second interface unit. The first interface unit is in communication with the first server node and has one or more Remote Direct Memory Access (“RDMA”) doorbell registers. Similarly, the second interface unit is in communication with the second server node and has one or more RDMA doorbell registers. The system also includes a communication switch that is operable to receive and route data from the first or second server nodes using an RDMA read and/or an RDMA write when either of the first or second RDMA doorbell registers indicates that data is ready to be sent or received. Advantageously, the server interconnect system of the present embodiment is reliable and connectionless while supporting messaging between the nodes. The server interconnect system is reliable in the sense that packets are never dropped other than in catastrophic situations such as hardware failure. The server interconnect system is connectionless in the sense that hardware treats each transfer independently, with specified data moved between the nodes and queue/memory addresses specified for the transfer. Moreover, there is no requirement to perform a handshake before communication starts or to maintain status information between pairs of communicating entities. Latency characteristics of the present embodiment are also found to be superior to the prior art methods.
In another embodiment of the present invention, a method of sending a message from a source node to a target node via associated interface units and a communication's switch is provided. The method of this embodiment implements an RDMA write by registering a source buffer that is the source of the data. Similarly, a target buffer that is the target of the data is also registered. An RDMA descriptor is created in system memory of the source node. The RDMA descriptor has a field that specifies identification of the target node with which an RDMA transfer will be established a field for the address of the source buffer, a field for the address of the target buffer, and an RDMA status field. The address of the RDMA descriptor is written to a set of first RDMA doorbell registers located within a source interface unit. An RDMA status register is set to indicate an RDMA transfer is pending. Next, the data to be transferred, the address of the target buffer and target node identification is provided to the server communication switch, thereby initiating an RDMA transfer of the data to the target server node.
In another embodiment of the present invention, a method of sending a message from a source node to a target node via associated interface units and a communication's switch is provided. The method of this embodiment implements an RDMA read by registering a source buffer that is the source of the data. A source buffer identifier is sent to the target server node. A target buffer that is the target of the data is registered. An RDMA descriptor is created in system memory of the target node. The RDMA descriptor has a field for the identification of the target node with which an RDMA transfer will be established, a field for the address of the source buffer, a field for the address of the target buffer, and an RDMA status field. The address of the RDMA descriptor is written to one of a set of RDMA doorbell registers. An RDMA status register is set to indicate an RDMA transfer is pending. A request is sent to the source interface unit to transfer data from the source buffer. Finally, the data from the source buffer is sent to the target buffer.
Reference will now be made in detail to presently preferred compositions, embodiments and methods of the present invention, which constitute the best modes of practicing the invention presently known to the inventors. The Figures are not necessarily to scale. However, it is to be understood that the disclosed embodiments are merely exemplary of the invention that may be embodied in various and alternative forms. Therefore, specific details disclosed herein are not to be interpreted as limiting, but merely as a representative basis for any aspect of the invention and/or as a representative basis for teaching one skilled in the art to variously employ the present invention.
It is also to be understood that this invention is not limited to the specific embodiments and methods described below, as specific components and/or conditions may, of course, vary. Furthermore, the terminology used herein is used only for the purpose of describing particular embodiments of the present invention and is not intended to be limiting in any way.
It must also be noted that, as used in the specification and the appended claims, the singular form “a,” “an,” and “the” comprise plural referents unless the context clearly indicates otherwise. For example, reference to a component in the singular is intended to comprise a plurality of components.
Throughout this application, where publications are referenced, the disclosures of these publications in their entireties are hereby incorporated by reference into this application to more fully describe the state of the art to which this invention pertains.
In an embodiment of the present invention, a server interconnect system for communication within a cluster of computer nodes is provided. In a variation of the present embodiment, the server interconnect system is used to connect multiple servers through a PCI-Express fabric.
With reference to
Still referring to
With reference to
Software writes the address of the descriptor into the RDMA doorbell register to initiate the RDMA. In one variation, RDMA send doorbell register 28n includes the fields provided in Table 2. The sizes of these fields are only illustrative of an example of RDMA send doorbell register 28n.
The set of message send registers also includes RDMA send status register 32n. RDMA send status register 32n is associated with doorbell register 28n. Send status register 32n contains the status of the message send initiated through a write into send doorbell register 28n. In a variation, send status register 32n includes at least one field as set forth in Table 2. The size of this field is only illustrative of an example of RDMA send status register 32n.
In a variation of the present embodiment, each interface unit 22n typically contains a large number of RDMA registers (on the order of 1000 or more). Each software process/thread on a server that wishes to RDMA data to another server is allocated an RDMA doorbell register and an associated RDMA status register.
With reference to
When hardware in the interface unit 221 sees a valid doorbell as indicated by the DSCR_VALID field, the corresponding RDMA status register 321 is set to the pending state as set forth in step f). In step g), hardware within interface unit 221 then performs a DMA read to get the contents of the descriptor from system memory of source server node 121. In step h), the hardware within interface unit 221 then reads the contents of the local buffer 181 from system memory on source server 181 using the RDMA descriptor and then sends the data along with the target address and the target node identification to server communication switch 26.
Server communication switch 26 routes the data to the to buffer 182 of target server node 122 as set forth in step i). In step i), interface unit 222 at the target server 122 performs a DMA write of received data to the specified target address. An acknowledgment (“ack”) is then sent back to source server node 121. Once the source node 121 receives the ack it updates the send status register to ‘done’ as shown in step j).
Software executing on the source node polls the RDMA status register. When it sees status change from “pending” to “done” or “error,” it takes the required action. Optionally, software on the source node could also wait for an interrupt when the RDMA completes. Typically, the executing software on the destination node has no knowledge of the RDMA operation. The application has to define a protocol to inform the destination about the completion of an RDMA. Typically this is done through a message from the source node to the destination node with information on the RDMA operation that was just completed.
With reference to
When hardware on the interface unit 221 sees a valid doorbell, it sets the corresponding RDMA status register 321 to the pending state in step f). In step g), hardware within interface unit 221 then performs a DMA read to get the contents of the descriptor 341 from system memory. The hardware within interface unit 221 obtains the identifier for buffer 182 from the descriptor 341, and sends a request for the contents of the remote buffer 182 to server communication switch 26 in step h). In step i), server communication switch 26 routes the request to interface unit 222. Interface unit 222 performs a DMA read of the contents of buffer 182 and sends the data back to switch 26 which routes the data back to interface unit 221. In step j), interface unit 221 then performs a DMA write of the data into buffer 181. Once the DMA write is complete, interface unit 221 updates the send status register to ‘done’.
Server communication switch 26 routes the data to local buffer 181 as set forth in step f). Interface unit 221 at the server 121 performs a DMA read of the data at the specified target address. An acknowledgment (“ack”) is then sent back to source server node 121. Once the source node 121 receives the ack it updates the send status register to ‘done’ as shown in step g).
When the size of the buffer to be transferred in the read and write RDMA communications set forth above is large, the transfer is segregated into multiple segments. Each segment is then transferred separately. The source server sets the status register when all segments have been successfully transferred. When errors occur, the target interface unit 22n sends an error message back. Depending on the type of error, the source interface unit 22n either does a retry (sends data again), or discards the data and sets the RDMA_STATUS field to indicate the error. Communication is reliable in the absence of unrecoverable hardware failure.
In another variation of the present invention, function calls in a software API are used for performing an RDMA. These calls can be folded into an existing API such as sockets or can be defined as a separate API. On each server 12n there is a driver that attaches to the associated interface unit 22n. The driver controls all RDMA registers on the interface unit 22n and allocates them to user processes as needed. A user level library runs on top of the driver. This library is linked by an application that performs RDMA. The library converts RDMA API calls to interface unit 22n register operations to perform RDMA operations as set forth in Table 4.
The application calls “register” with a start and end address for a contiguous region of memory. This indicates to the user library that the region of memory might participate in RDMA operations. The library records this information in an internal data structure. The application guarantees that the region of memory passed through the register call will not be freed until the application calls “deregister” for the same region of memory or exits.
The applications calls “get_rdma_handle” with a buffer start address and a size. The buffer should be contained in a region of memory that was registered earlier. The user level library pins the buffer by performing the appropriate system call. An I/O virtual address is obtained for the buffer by performing another system call which returns a handle (I/O virtual address) for the buffer. The application is free to perform RDMA operations to the I/O virtual address at this point.
The library does not have to perform the pin and I/O virtual address get operations when a handle for the buffer is found in the registration cache. The application calls “rdma_write” with a handle for a remote buffer, and a handle for a local buffer. The library contains an RDMA doorbell register and status register from the driver and maps them, creates a RDMA descriptor, and writes descriptor address and size into the RDMA doorbell. It then polls the status register until the status indicates completion or error. In either case, it returns the appropriate code to the application.
Optionally, the application may just provide a local buffer address and size, and allow the library to create the local handle. Also optionally, the API may include an RDMA initialization call for the library to acquire and map RDMA doorbell and status registers, that are then used on subsequent RDMA operations.
The application indicates to the library that the buffer will no longer be used for RDMA operations. The library can at this point unpin the buffer and release the I/O virtual address if it so desires. It may also continue to have the buffer pinned and hold the I/O virtual address in a cache, to service a subsequent get—rdma_handle call on the same buffer.
The application calls “deregister” with a start and end address for a region of memory. This indicates to the library that the region of memory will no longer participate in RDMA operations, and the application is even allowed to deallocate the region of memory from its address space. At this point, the library has to delete any buffers that it holds in its cache that are contained in the region, i.e. unpin the buffers and release their I/O virtual address.
In a variation of the invention, the registration cache is implemented as a hash table. The key into the hash table is the page address of a buffer in the application's virtual address space, where page refers to the unit of granularity at which I/O virtual addresses are allocated (I/O page size is typically 8 KB).
In another variation of the present embodiment, each entry of the registration cache typically contains the fields listed in Table 5.
An entry is added to the cache during a “get_rdma_handle call”. The following steps are performed as part of the “get_rdma_handle call”. The page virtual address of the buffer and index into hash table are obtained. If a valid hash entry is found, the “Status” is set to “Active” and a handle is returned. If a valid handle is not found, system calls are executed to pin the page and obtain an I/O virtual address, create a new hash entry and insert into table, and set “Status” to “Valid” and “Active” with a handle being returned. When “free_rdma_handle” is called, the corresponding hash table entry is set to “Inactive.”
The library keeps track of the total size of memory that is pinned at any point in time. Once size of pinned memory crosses a user settable threshold (defined as a fraction of total physical memory, e.g., ½ or ¾), the library walks through the entire hash table and frees all hash table entries whose “Status” is “Inactive”, and whose last time of use was further back than another user settable threshold (e.g., more than 1 hour back). When “deregister” is called on a region, the library walks down the hash table and releases all entries that are contained in the region being deregistered.
While embodiments of the invention have been illustrated and described, it is not intended that these embodiments illustrate and describe all possible forms of the invention. Rather, the words used in the specification are words of description rather than limitation, and it is understood that various changes may be made without departing from the spirit and scope of the invention.