Cluster database systems run on multiple host computers. A client can connect to any of the host computers and see a single database. Shared data cluster database systems provide coherent access from multiple host computers to a shared copy of data. Providing this coherent access to the same data across multiple host computers inherently involves performance compromises. For example, consider a scenario where a given database data is cached in the memory of two or more of the host computers in the cluster. A transaction running on a first host computer changes its copy of the given database data in memory and commits the transaction. At the next instant in time, another transaction starts on a second host computer, which reads the same given database data. For the cluster database system to function correctly, the second host computer must be ensured to read the database data as updated by the first host computer.
Many existing approaches to ensuring such coherent access to shared data involves a messaging protocol. However, messaging protocols require overhead associated with processor cycles to process the messages and in communication bandwidth for the sending of the messages. Some systems avoid using messaging protocols through use of specialized hardware that reduces or eliminates the need for messages. However, for systems without such specialized hardware, this approach is not possible.
According to one embodiment of the present invention, a coherency manager provides coherent access to shared data in a shared database system by: determining that remote direct memory access (RDMA) operations are supported in the shared database system; receiving a copy of updated database data from a first host computer in the shared database system through RDMA, the copy of the updated database data comprising updates to a given database data; storing the copy of the updated database data as a valid copy of the given database data in local memory; invalidating local copies of the given database data on other host computers in the shared database system through RDMA; receiving acknowledgements from the other host computers through RDMA that the local copies of the given database data have been invalidated; and sending an acknowledgement of receipt of the copy of the updated database data to the first host computer through RDMA.
In one embodiment, the coherency manager receives a request for the valid copy of the given database data from a second host computer in the shared database system through RDMA; retrieves the valid copy of the given database data from the local memory; and returns the valid copy of the given database data to the second host computer through RDMA.
In one embodiment, the coherency manager determines that RDMA operations are not supported in the shared database system; receives one or more messages comprising copies of a plurality of updated database data from a first host computer, where the copies of the plurality of updated database data comprises updates to a plurality of given database data; stores the copies of the plurality of updated database data as valid copies of the plurality of given database data in local memory; sending a single message to the other host computers invalidating local copies of the plurality of given database data on the other host computers; receives acknowledgement messages from the other host computers that the local copies of the plurality of given database data have been invalidated; and sends an acknowledgement message of receipt of the copies of the plurality of updated database data to the first host computer.
In one embodiment, a host computer updates a local copy of a given database data; determines a popularity of the given database data; in response to determining that the given database data is unpopular, sending updated database data identifiers only to a coherency manager through RDMA; and in response to determining that the given database data is popular, sending the updated database data identifiers and a copy of the updated database data to the coherency manager through RDMA.
System and computer program products corresponding to the above-summarized methods are also described herein.
As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or computer program product. Accordingly, aspects of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit,” “module” or “system.” Furthermore, aspects of the present invention may take the form of a computer program product embodied in one or more computer readable medium(s) having computer readable program code embodied thereon.
Any combination of one or more computer readable medium(s) may be utilized. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java® (Java, and all Java-based trademarks and logos are trademarks of Sun Microsystems, Inc. in the United States, other countries, or both), Smalltalk, C++ or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
Aspects of the present invention are described below with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer special purpose computer or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer readable medium that can direct a computer other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified local function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
The corresponding structures, materials, acts, and equivalents of all means or step plus function elements in the claims below are intended to include any structure, material, or act for performing the function in combination with other claimed elements as specifically claimed. The description of the present invention has been presented for purposes of illustration and description, but is not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the invention. The embodiment was chosen and described in order to best explain the principles of the invention and the practical application, and to enable others of ordinary skill in the art to understand the invention for various embodiments with various modifications as are suited to the particular use contemplated.
As illustrated, the process to ensure that Node 2 reads the latest copy of the page in transaction 2 requires numerous messages to be exchanged between Nodes 1, 2, and 3. The messages require communication bandwidth, as well as requiring central processing unit (CPU) cycles at each node to process the messages it receives. The volume of such messages could significantly impact overhead requirements on the database system and affect performance.
Embodiments of the present invention reduce the messages required to ensure coherent access to shared copies of database data through the use of a Coherency Manager.
Each host computer 202-205 is operatively coupled to a processor 206 and a computer readable medium 207. The computer readable medium 207 stores computer readable program code 208 for implementing the method of the present invention. The processor 206 executes the program code 208 to ensure coherency access to shared copies of database data across the host computers 202-205, according to the various embodiments of the present invention.
The Coherency Manager provides centralized page coherency management, and may reside on a distinct computer in the cluster or on a host computer which is also performing database processing, such as host computer 205. The Coherency Manager 205 provides database data coherency by leveraging standard remote direct memory access (RDMA) protocols, using intelligent selection between a force-at-commit protocol and an invalidate-at-commit protocol, and for using a batch protocol for data invalidation when RDMA is not available, as described further below. RDMA is a direct memory access from the memory of one computer into that of another computer without involving either computer's operating systems. RDMA allows for the transfer of data directly to or from the memories of two computers, eliminating the need to copy data between application memory and the data buffers in the operating system. Such transfers do not require work to be done by the CPU's or caches.
If the local copy of the given database data is not valid, the host computer 202 sends a request to the Coherency Manager 205 for a valid copy of the given database data through RDMA (303). The Coherency Manager 205 receives the request for the valid copy of the given database data from the host computer 202 through RDMA (309), retrieves the valid copy of the given database data from its local memory (310), and returns the valid copy of the given database data to the host computer 202 through RDMA (311).
The host computer 202 receives the valid copy of the given database data from the Coherency Manager 205 and stores it as the local copy (304). If the transaction is to read the given database data (305), then the host computer 202 reads the valid local copy of the given database data (306) and commits the transaction (308). Otherwise, the host computer 202 updates the local copy of the given database data (307). The host computer 202 then sends a copy of the updated database data to the Coherency Manager 205 through RDMA (308). The Coherency Manager 205 receives the copy of the updated database data from the host computer 202 through RDMA (312), and stores the copy of the updated database data as the valid copy of the given database data in local memory (313). The Coherency Manager 205 then invalidates the local copies of the given database data on the other host computer 203-204 in the cluster database system containing a copy through RDMA (314). When the Coherency Manager 205 receives acknowledgements from the other host computers 202-204 through RDMA that the local copies of the given database data have been invalidated (315), the Coherency Manager 205 sends an acknowledgement of receipt of the copy of the updated database data to the host computer 202 through RDMA (316). The host computer 202 receives the acknowledgement of receipt of the copy of the updated database data from the Coherency Manager 205 through RDMA (317), and in response, commits the transaction (318). This mechanism is referred to herein as a “force-at-commit” protocol. Once the transaction commits, any lock on the given database data owned by the host computer 202 is released.
When another host computer wishes to access the given database data during another transaction, steps 301-318 are repeated.
The force-at-commit protocol described above allows the Coherency Manager 205 to invalidate any copies of the database data that exist in the buffers of other host computers 203-204 before the transaction at the host computer 202 commits. The force-at-commit protocol furthers allows the Coherency Manager to maintain a copy of the updated database data, such that future requests for the database data from any host computer in the system can be efficiently provided directly from the Coherency Manager 205 without using a messaging protocol.
Assume that Node 2 starts transaction 2 and wants to read page A (301). Node 2 determines that the local copy of page A is invalid (302). Node 2 then sends a request to the Coherency Manager 205 for a valid copy of page A through RDMA, and receives the valid copy of page A from the Coherency Manager 205 through RDMA (303-304). Node 2 reads the valid copy of page A and commits the transaction (305-306, 318). Node 2 is thus assured to read the latest copy of page A. As can be seen by comparing
During the invalidation of step 314, the RDMA operations must fully complete with respect to the memory hierarchy of the host computers 203-204 before the Coherency Manager 205 acknowledges receipt in step 316. The RDMA protocol updates the memories at the host computer 203-204 but not the cache, such as the Level 2 caches of the CPU's. Thus, it is possible for an RDMA operation to invalidate a local copy of database data in memory but fail to invalidate a copy of the database data in cache. This would lead to incoherency of the data. To ensure that the RDMA operations fully complete with respect to the memory hierarchy of the host computers, the method of the present invention leverages existing characteristics of the RDMA protocol during the invalidation (314), as illustrated in
In one embodiment, the RDMA-write operations are immediately followed by RDMA-read operations of the same memory locations. In another embodiment, the RDMA-write operations are immediately followed by another set of RDMA-write operations of the same memory locations. Open RDMA protocols generally require that for the RDMA-read or RDMA-write operation to complete, any prior RDMA-write operations to the same location must have fully completed with respect to the memory coherency domain on the target computer. Thus, sending RDMA-read or RDMA-write operations to the same memory locations immediately after the RDMA-write operations ensures that no copies in the cache at the host computers 203-204 would erroneously remain valid.
Thus, once the acknowledgements that the second RDMA operations have completed are received from the other host computers 203-204, the Coherency Manager 205 is assured that the invalidation of the local copies of the given database data at the host computers 203-204 are complete in the entire memory hierarchy in the host computers 203-204.
Alternatively, some RDMA-capable adapters include a ‘delayed ack’ feature. The ‘delayed ack’ feature does not send an acknowledgement of an RDMA-write operation until the operation is fully complete. This ‘delayed ack’ feature can thus be leveraged to ensure that the invalidation of the local copies of the given database data are complete in the entire memory hierarchy in the host computers 203-204.
To optimize the method according to the present invention, several techniques can be used in conjunction with the RDMA operations described above. One technique includes the parallel processing of the RDMA invalidations. In the parallel processing, for any given database data that requires invalidation, the Coherency Manager 205 first initiates all RDMA operations to the other host computers containing a local copy of the database data. Then, the Coherency Manager 205 waits for the acknowledgements from each host computer 203-204 that the RDMA has completed before proceeding. For example, when used in conjunction with the RDMA-write operation followed by the RDMA-read approach described above, both RDMA operations are initiated for all of the other host computers 203-204, then all of the acknowledgements of the RMDA operations are collected from the other host computers 203-204 before the Coherency Manager 205 proceeds.
In another technique, multi-casting is used in conjunctions with the RDMA operations described above. Instead of sending separate, explicit RDMA operations to each host computer 203-204, the Coherency Manager 205 uses a single multi-cast RDMA operation to the host computers 203-204 with a copy of the database data to be invalidated. Thus, one multi-cast RDMA operation is used to accomplish invalidations on the host computers 203-204.
In another embodiment of the method of the present invention, a further optimization is through the intelligent selection by the host computer 202 between the force-at-commit protocol described above and an “invalidate-at-commit” protocol. In the invalidate-at-commit protocol, the identifiers of the updated database data are sent to the Coherency Manager 205, but a copy of the updated database data itself is not. In this embodiment, the selection is based on the “popularity”, or frequency of accesses, of the given database data being updated. Database data that are frequently referenced by different host computers in the cluster are “popular” while database data that are infrequently referenced are “unpopular”. The sending of a copy of updated database data that are unpopular may waste communication bandwidth and memory. Such unpopular database data may not be requested by other host computers in the cluster before the data is removed from memory by the Coherency Manager 205 in order to make room for more recently updated data. Accordingly, for data that are determined to be “unpopular”, an embodiment of the present invention uses an invalidate-at-commit protocol.
With the invalidate-at-commit protocol, the Coherency Manager 205 is still able to invalidate the local copies of the given database data at other host computer 203-204 using the updated database data identifiers but is not required to store a copy of the updated database data itself. When a host computer later requests a copy of the updated database data, the Coherency Manager 205 can request the valid copy from the host computer 202 that updated the database data and return the valid copy to the requesting host computer. For workloads involving random access to data, this can provide a significant savings in communication bandwidth costs.
Various mechanisms can be used to determine the popularity of database data. One embodiment leverages the fact that database data in a host computer's local bufferpool are periodically written to disk. When a host computer updates a given database data, at commit time, the host computer determines if the database data was originally stored into the local bufferpool via a reading of the database data directly from disk. If so, this means that no other host computer in the cluster requested the database data between writings from the bufferpool to disk. Thus, the database data is determined to be “unpopular,” and the host computer uses the invalidate-at-commit protocol. If the host computer determines that the database data was originally stored into the local bufferpool via a reading of the database data from the Coherency Manager 205, then this means that there was at least one other host computer in the cluster that requested the database data between writings from the bufferpool to disk. Thus, the database data is determined to be “popular”, and the host computer uses the force-at-commit protocol. Other mechanisms for determining the popularity of database data may be used without departing from the spirit and scope of the present invention.
Some communications fabrics of cluster database systems do not support RDMA operations. On such fabrics, an embodiment of the present invention increases the efficiency of coherent data access by amortizing multiple separate invalidations for different database data in the same message. For example, Node 1 may execute and commit ten transactions updating twenty pages. Node 2 has all twenty pages buffered. Instead of sending twenty individual page invalidation messages, the Coherency Manager 205 sends a single message to node 2 containing the identifiers for all twenty pages. When node 2 receives and processes the message, node 2 invalidates all twenty pages in its local buffer before replying to the Coherency Manager 205 with an acknowledgement. Thus, instead of expending CPU cycles to process twenty invalidation messages, node 2 only expends CPU cycles to process one message.
Further efficiency can be realized when multi-cast is available. When a set of pages needs to be invalidated, and these pages are buffered in more than one host computer, multi-cast can be used by the Coherency Manager 205 to send a single invalidate message for all of the pages.