The present invention relates generally to computer systems, and particularly to methods and systems for building and operating distributed databases in computer systems.
Distributed shared-disk databases, i.e., database systems that use multiple storage devices, are used in many computer systems. One example of such a product is DB2®, produced by IBM Corporation (Armonk, New-York). Additional details regarding DB2 products can be found at www-306.ibm.com/software/data/db2/. Another family of distributed shared-disk databases is produced by Oracle Corporation (Redwood Shores, Calif.). Additional details regarding Oracle database products can be found at www.oracle.com.
Several methods have been proposed for controlling the access of multiple transactions to shared storage devices. This sort of access is needed in distributed databases for maintaining data integrity and for recovering from node failures. For example, one such method is described by Mohan and Narang, in a paper entitled “Recovery and Coherency-Control Protocols for Fast Intersystem Page Transfer and Fine-Granularity Locking in a Shared Disks Transaction Environment,” Proceedings of the 17th International conference on Very Large Data Bases, Barcelona, Spain, September 1991, pages 193-207, which is incorporated herein by reference. The authors describe schemes for fast page transfers between transaction system instances wherein all sharing instances read and modify the same data. Recovery and coherency control schemes are also described.
Distributed databases sometimes use centralized clustering services, also called “group services,” for synchronizing the data that is distributed across the system. Examples of such group services are described in the publication “RS/6000 Cluster Technology Group Services Programming Guide and Reference,” IBM reference SA22-7355-02, IBM International Technical Support Organization, December 2001, which is available at www-1.ibm.com/support/docview.wss?uid=pub1sa22735502. Another distributed computer system comprising group services is described by Hayden in a PhD thesis entitled “The Ensemble System,” Computer Science Department Technical Report TR98-1662, Cornell University, Ithaca, N.Y., January 1998, which is incorporated herein by reference. The author describes a general-purpose group communication system called “Ensemble,” which can be used in constructing reliable distributed applications.
Accessing data by multiple users and performing fault recovery in databases is typically handled using locking and logging mechanisms. For example, the IBM DB2 database family uses a method called ARIES (Algorithm for Recovery and Isolation Exploiting Semantics). This method is described by Mohan et al. in a paper entitled “ARIES: a Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks using Write-Ahead Logging,” ACM Transactions on Database Systems, (17:1), March 1992, pages 94-162, which is incorporated herein by reference.
Some distributed database systems use object-disks (sometimes referred to as Object-based Storage Devices or OSDs) as building blocks. The Storage Network Industry Association (SNIA) handles the standardization of OSDs and their interfaces. Additional information regarding object-disks can be found at www.snia.org/tech_activities/workgroups/osd. Rodeh and Teperman describe a decentralized file system that uses locking and logging methods for accessing OSDs in a paper entitled “zFS—A Scalable Distributed File System Using Object-Disks, 20th IEEE/11th NASA Goddard Conference on Mass Storage Systems and Technologies (MSS'03), San Diego, Calif., April 2003, which is incorporated herein by reference.
As mentioned above, currently-available methods for managing distributed databases typically use clustering or group services. While such global services support the synchronization of data and failure recovery, they also suffer from several inherent disadvantages. For example, deploying group services typically requires an additional software layer with software components running on the computers in the network and dedicated messaging protocols between these software components. The amount of messaging traffic associated with group services grows rapidly with the size of the computer system, making it difficult to provide scalable solutions that are suitable for large clusters.
In response to the shortcomings of the prior art, disclosed embodiments of the present invention provide methods and systems for building and operating a database that is truly distributed, in the sense that synchronization and integrity of distributed data are maintained without reliance on centralized clustering or group services. As will be explained hereinbelow, distribution of the data integrity and access control functions is accomplished by a novel method of issuing device-served leases, or time-limited access permissions, that are granted by the nodes and storage devices of the computer system. The distribution of these leasing functions permits new storage devices and compute-nodes that are added to the system to take on their share of these functions, so that the computing and I/O load is spread throughout the system. The disclosed system configuration is thus highly scalable and robust in handling compute-node failures.
Some embodiments of the present invention provide novel methods for rolling-back of failed or aborted database transactions, as well as methods for recovering from various failure events in a distributed database.
Although features of the present invention are particularly suited for supporting database applications, the principles of the present invention are applicable in distributed storage systems generally, in support of distributed applications of other kinds.
There is therefore provided, in accordance with an embodiment of the present invention, a method for managing data in a computer system, including:
storing the data in a plurality of data structures;
receiving a transaction request for accessing the data in a specified data structure;
granting a time-limited lease on the specified data structure responsively to the transaction request; and
controlling an access to the specified data structure based on the lease until completion of the transaction request.
In an embodiment, the data structures are stored on object-disks, and granting the time-limited lease includes granting a major lease from one of the object-disks on which the specified data structure is stored to a compute-node handling the transaction request.
In another embodiment, granting the lease includes granting the lease to a first compute-node in the computer system and delegating the lease from the first compute-node to a second compute-node in the computer system.
In yet another embodiment, granting the lease includes granting a lease for accessing a storage device on which the specified data structure is stored, and wherein controlling the access includes issuing at least one lock for accessing data objects stored in the storage device. Additionally, the at least one lock is released upon expiration of the lease.
In still another embodiment, issuing the at least one lock includes appointing a compute-node in the computer system to serve as a lock manager for the storage device, wherein the lock manager issues the at least one lock.
In another embodiment, the at least one lock is maintained by a first compute-node in the computer system, and controlling the access includes restoring the at least one lock responsively to a failure in the first compute-node using a second compute-node in the computer system.
In yet another embodiment, controlling the access includes:
recording transaction entries in one or more log objects stored in one or more of data structures;
accessing the data objects responsively to the transaction entries; and
marking the transaction entries and the respective data objects with log serial numbers (LSNs), so as to cause each data object to be marked with monotonically-increasing LSNs. Additionally, when the transaction request is not completed, the transaction request is rolled-back using the transaction entries recorded in the one or more log objects, so as to remove effects of the transaction request from the plurality of data structures.
In another embodiment, controlling the access includes, responsively to a failure of a first compute-node handling the transaction request, completing the transaction request by a second compute-node using the transaction entries recorded in the one or more log objects.
There is also provided, in accordance with an embodiment of the present invention, a computer system including:
a first plurality of storage devices, which are arranged to store data in data structures; and
a second plurality of compute-nodes, which are arranged to receive a transaction request for accessing the data in a specified data structure, and responsively to the transaction request, to request and receive a time-limited lease on the specified data structure, and to control an access to the specified data structure based on the lease until completion of the transaction request.
There is additionally provided, in accordance with an embodiment of the present invention, a computer software product including a computer-readable medium in which program instructions are stored, which instructions, when read by one or more computers, cause the one or more computers to store data in data structures, to receive a transaction request for accessing the data in a specified data structure, and responsively to the transaction request, to request and receive a time-limited lease on the specified data structure, and to control an access to the specified data structure based on the lease until completion of the transaction request.
The present invention will be more fully understood from the following detailed description of the embodiments thereof, taken together with the drawings in which:
Each object-disk 36 is a logical storage device, typically comprising a physical storage device, such as a disk, for storing objects (files) and an application interface (API) that communicates with other components of the computer system and enables creation, modification and deletion of objects. In other words, the object-disks and objects can be regarded as data structures, and the methods described herein may also be applied to other types of data structures. The clients, compute-nodes and object-disks are typically interconnected using a suitable high-speed data network 38. An object-disk is also referred to as an object-based storage device (OSD). Although the OSD model is advantageous in building distributed databases, the principles of the present invention may also be applied, mutatis mutandis, using storage devices of other kinds, such as conventional disks or NAS (Network Attached Storage) devices.
The configuration of system 20 and the methods described below were particularly developed to support large-scale computer systems, on the order of hundreds of nodes or more. As will be apparent to those skilled in the art, eliminating clustering and group services is particularly beneficial in large-scale computer systems. Nevertheless, the system configuration described below is highly scalable by nature and may be used for any number of clients, compute-nodes and object-disks.
The clients and compute-nodes, as well as components of the object-disks, may be implemented using general-purpose computers, which are programmed in software to carry out the functions described herein. The software may be downloaded to the computers in electronic form, over a network, for example, or it may alternatively be supplied to the computers on tangible media, such as CD-ROM. The clients, compute-nodes and object-disks may comprise standalone units, or they may alternatively be integrated with other computing functions of computer system 20. Alternatively, functions carried out by the clients, compute-nodes and object-disks may be distributed differently among components of the computer system.
Device-served leases are fundamental building blocks of the methods described hereinbelow. A lease is a lock on a resource, such as an object-disk or on an individual page, having a predetermined expiration period. A lease can be viewed as a “virtual token,” allowing exclusive permission to access a resource, obtained by a compute-node for a limited time period. A typical expiry period used by the inventor is on the order of 30 seconds. A compute-node that wishes to maintain its access permission must periodically renew its lease. If the lease-holder does not renew the lease (due to compute-node failure, for example), the OSD automatically becomes available for major lease (i.e., a lease on the entire OSD, rather than on a specific object) to other compute-nodes without requirement for further communication among the nodes.
Leases are useful in environments in which compute-nodes may fail. When a compute-node holding a lease for a particular resource fails, another compute-node may gain access to the resource after the lease expires, without the need for any additional exchange of information or synchronization. If the failed compute-node recovers, it will have to re-obtain the lease in order to access the resource again. Using leases thus enables multiple users to access a resource without any centralized clustering or group services. The term “device-served” emphasizes the fact that the leases are issued and managed by the resources themselves and not by any centralized service.
Each OSD 36 supports a single exclusive lease, denoted “major-lease.” Only the holder of a valid (i.e., non-expired) major-lease has permission to access that particular OSD. Each OSD also maintains a record as to the identity of the compute-node that currently holds its major-lease. If a compute-node requests access to an OSD, the OSD will provide the requesting node with the network address of the major-lease holder. Typically, three operations are defined for a compute-node with regards to the major-lease of an OSD: take, release and renew. Leases may also be delegated from one compute-node to another, as shown below.
A typical database transaction carried out by a compute-node comprises the modification of data on one or more pages belonging to one or more files (objects). The files or objects may be stored on a single OSD or distributed among several OSDs. Before accessing and modifying data in a particular page, a compute-node 28 should first obtain a lock on the required page, to avoid conflicts with other compute-nodes that may try to modify the same page at the same time. For this purpose, each OSD 36, denoted X, in system 20 supports a lock manager, denoted XLKM, which provides lock services for all pages stored on OSD X to all components of system 20. Lock manager XLKM may run on any compute-node in system 20. The lock manager typically operates by taking the major-lease for OSD X and continuously renewing it.
Subject to the major-lease, the compute-node subsequently takes, renews and releases locks on pages and other objects on the OSD, as required by the transaction it needs to perform, at a locking step 56. The locks enable the compute-node to modify the data and perform the transaction. In other words, both the major-lease and a specific lock on the target page or object are needed in order to perform a transaction on the target.
The lease given by the lock manager to the compute-node is thus different from the major-lease, as it protects the client-server protocol between the lock taker (the compute-node) and the lock manager. As with all leases, the compute-node should periodically renew its lease with the lock manager. As long as this lease is valid, all of the client's locks on pages and objects will be respected. If the compute-node does not renew its lease, the lock manager will assume the compute-node failed. When the lease has expired, the lock manager will notify any node that asks to access pages previously locked by the failed node that recovery needs to be performed (see the detailed description of recovery methods hereinbelow). A compute-node that was disconnected, for any reason, from the lock manager will not be able to re-connect until its lease has expired.
Compute-nodes that have obtained leases from a lock manager are then allowed direct access to the respective OSD. This provision enhances efficiency of storage access but assumes that the compute-nodes are non-malicious, i.e., that they will modify only pages that they have previously locked.
In practical implementations, the lock manager itself may also fail. Several methods may be used for maintaining and respecting the locks granted by a lock manager that failed. In one embodiment, all granted locks may be recorded to disk (“hardened”) by setting up an object denoted Xlocks on OSD X, comprising a list of all locks currently granted by XLKM. Xlocks is updated whenever a lock is granted or released. Access to Xlocks is available only to XLKM, as it holds the major-lease for OSD x. Should the compute-node running XLKM fail, another compute-node will typically take the major-lease for OSD X and recover the locks from Xlocks.
In a distributed database, deadlock situations may occur in spite of the locking methods used. For example, consider a scenario in which two compute-nodes labeled A and B simultaneously request locks on two pages labeled P1 and P2, but in reverse order. The result is that compute-node A will take a lock on P1 and will be denied access to P2, while compute-node B will take a lock on P2 but will be denied access to P1. Both transactions will be stuck, waiting endlessly to receive a lock on their respective second pages. Practical deadlock scenarios are typically more complex and may involve several compute-nodes.
The deadlock problem is particularly severe in distributed databases that have no global lock manager having complete knowledge of the locks that have been taken and requested across the system. Several methods are known in the art for resolving deadlock situations such as the scenario described above. Such methods typically involve identifying transactions that block each other and breaking the deadlock by aborting some of these transactions. Any suitable deadlock prevention method may be used in conjunction with the methods described herein.
In the description above, pages are treated as the atomic unit for locking. A page (typically on the order of 8K bytes in size) may comprise multiple records, and database applications typically require record-level read/write locking. Therefore, in one embodiment, a compute-node that takes a lock on a page using the methods described hereinabove may provide finer locking granularity by locking individual records within this page for particular transactions running on this compute-node. Additional information regarding page locking methods may also be found in the paper by Rodeh and Teperman cited above.
A database is often required to perform rollback of a transaction, either because the transaction is aborted by the user or as part of recovery from a failure. To support rollback and recovery from failures, each compute-node 28 typically maintains a log object that records all database transactions. One logging technique that may be used for this purpose is Write-Ahead Logging (WAL). WAL means that each entry of a transaction is recorded in the log before being performed in the database itself. Once all entries of a particular transaction have been logged and performed, the transaction is committed to disk. This technique enables transactions to be recovered and “re-played” in the event of a failure.
Every log entry is typically stamped with a Log Sequence Number (LSN) provided by the log. The LSNs are assigned to pages in a monotonically increasing order. Each modified page in the database is stamped with the largest LSN of a log entry that modified it. The compute-node keeps track of the largest LSN entry that was committed to disk, and prevents pages having larger LSNs from being written to the disk. A method for synchronizing LSNs when multiple log objects are present is described below.
The WAL logging scheme described above is similar to the one used by ARIES (as described in the paper by Mohan et al. cited above). Other logging schemes are known in the art. The methods described below may also be used in conjunction with any other suitable logging scheme.
Before describing the methods in which database transactions are performed in system 20, certain aspects of log management will be demonstrated and explained in greater detail.
As can be seen in the example of
Having explained the principles of log management, the nominal transaction process in system 20 will now be explained and demonstrated.
Computer system 20 typically comprises multiple compute-nodes and therefore also comprises multiple respective log objects, one log object for each compute-node. As described above, each log object stamps modified pages with monotonically increasing LSNs (Log Sequence Numbers). To maintain data integrity and avoid erroneous recovery attempts, the LSNs assigned by different log objects should be mutually synchronized so that each page is stamped with a single LSN. Consecutive modifications to a certain page should be assigned monotonically increasing LSNs, even though they may be carried out by different compute-nodes and logged in different logs.
The following exemplary sequence of events demonstrates the potential errors that may occur in the absence of LSN synchronization between log objects:
Following this sequence of events, if compute-node A ever takes the page lock on page P again, and then fails and recovers, it will find an LSN value 6 marking page P. Compute-node A will then redo the modification corresponding to LSN value 10, erroneously assuming that this modification was not yet written to disk. Since in ARIES replaying an entry twice is erroneous, redoing this modification results in an error.
In one embodiment, in order to maintain monotonically increasing LSNs, a compute-node modifying a page P first reads from page P the LSN that was previously assigned to it (denoted PLSN). The compute-node then sets the LSN of its log to the maximum value between the current LSN of the compute-node and the PLSN extracted from page P. This method ensures that LSNs will always be assigned in a monotonically increasing order. Alternatively, other LSN synchronization methods may be used for this purpose, as will be apparent to those skilled in the art, and any other suitable method may be used. Caution should be exercised when defining LSN synchronization methods, as some commercial database products encode into the LSN additional information, such as the location of the corresponding transaction entry in the log. In this case the LSN format may be extended to contain the additional information.
Rolling back a database transaction (i.e., canceling the transaction and restoring the database to the exact state it was in before the transaction) is needed when a user decides to abort a transaction. Rollback may also be required in the event of a deadlock between two or more transactions, as described above.
For each modified page that participates in transaction T, compute-node A checks whether the page is cached in memory, at a cache checking step 92. If not, the corresponding page is read from disk at a disk retrieval step 94. The compute-node modifies the page at a modifying step 96 and writes the page to disk at a writing step 98. As mentioned above, the compute-node performs steps 92-98 for all pages that require modification in transaction T.
Finally, compute-node A releases all page locks and terminates the rollback procedure, at a lock releasing step 100. At this stage transaction T is fully rolled-back. Note that deadlocks cannot occur during rollback since compute-node A already holds all relevant page locks from the beginning of transaction T.
Several fault recovery scenarios are considered below for system 20:
In the event that compute-node denoted A fails and later recovers, compute-node A should replay logA in order to restore the database to its state before the failure.
For each entry 62, denoted E, in logA, compute-node A takes a page lock for the corresponding page P modified by the transaction entry, at a page locking step 122. The compute-node then checks whether PLSN (the LSN of page P) is lower than ELSN (the LSN of transaction entry E), at a LSN checking step 124. If indeed PLSN<ELSN, the compute-node updates page P and also updates PLSN, at a page updating step 126. Otherwise, no page update is performed for this page. CLRs 66 are added to logA for each modification at a CLR adding step 128. Steps 122-128 are performed by compute-node A for each transaction entry 62 in logA. Finally, compute-node A releases all page locks at a lock releasing step 130.
As mentioned hereinabove, the lock-manager of a particular OSD grants locks to compute-nodes for pages that have previously been locked by a failed compute-node only after the failed-node lease expires. In one embodiment, after expiration of the lease on a given page, the lock manager gives the next node requesting a lock on the page the task of recovering the page, or even the entire transaction, prior to being granted the requested lock. The lock manager provides the requesting node with the name and location of the failed-node log object. This assignment of responsibility is required since there is no global service that is responsible for performing recovery.
In the event that a compute-node fails and does not recover, other compute-nodes may wait endlessly for its transactions to complete. This is a problem in a distributed environment because nodes occasionally become disconnected, slow, or suffer from slow network connections. To solve this problem, compute-node A may replay the log of another compute-node B if the latter loses the lock for logB. Recovery of logB by node A is similar to the recovery by the owner node, as described above with reference to
Following these methods ensures that once a transaction commit record is written to disk, the transaction is assured of succeeding. Even if the initiating compute-node fails, all modified records are still locked. The next compute-node that attempts to access any of these records will be requested to perform recovery on behalf of the failed compute-node. Following recovery, the transaction will be replayed from the initiator compute-node log.
When a compute-node fails, any lock manager running on the failed compute-node will also fail. If the lock-manager for OSD X fails, it cannot be replaced until the OSD lease it took expires. Connections between clients and failed lock-managers are torn down, and thus lock holders (clients) become aware that lock-manager recovery is about to take place.
In one embodiment, after the major-lease for OSD X (held by the failed compute-node) expires, another compute-node (for example, the next compute-node that requires access to a page stored in OSD X) takes the major-lease for OSD X and creates a new local lock manager XLKM. The new XLKM recovers the set of granted locks from object Xlocks stored in OSD X. The new XLKM pessimistically assumes that all lock-holders have also failed and notifies all lock-requesters for previously locked pages that recovery is required.
Although the leasing, locking and logging methods described herein mainly address OSDs and pages, these methods may be implemented using other data structures, such as disks and individual records. It will thus be appreciated that the embodiments described above are cited by way of example, and that the present invention is not limited to what has been particularly shown and described hereinabove. Rather, the scope of the present invention includes both combinations and sub-combinations of the various features described hereinabove, as well as variations and modifications thereof which would occur to persons skilled in the art upon reading the foregoing description and which are not disclosed in the prior art.