1. Technical Field
This invention relates to main-memory transaction processing systems. More specifically, the present invention relates to a logging method and system for recovery of a main-memory database in a transaction processing system.
2. Description of Related Art
A transaction processing system must process transactions in such a manner that “consistency” and “durability” of data are maintained even in the event of a failure such as a system crash. Consistency of data is preserved when transactions are performed in an atomic and consistent manner so that the data initially in a consistent state is transformed into another consistent state. Durability of data is preserved when changes made to the data by “committed” transactions (transactions completed successfully) survive system failures. A database often refers to the data on which transactions are performed.
To achieve consistency and durability of a database, most transaction processing systems perform a process called logging, the process of recording updates in terms of log records in a log file. In case of a failure, these log records are used to undo the changes by incomplete transactions, thereby recovering the database into a consistent state before the failure. These log records are also used to redo the changes made by the committed transactions, thereby maintaining the durable database.
Recovering a consistent and durable database after a system crash using only the log records would require a huge volume of log data because all the log records that have been generated since the creation of the database must be saved. Therefore, the process called “checkpointing” is often used where the database is copied to a disk in a regular interval so that only the log records created since the last checkpointing need to be stored. In practice, each page in the database has a “dirty flag” indicating any modification by a transaction so that only the pages modified since the last checkpointing are copied to a disk.
a and 3b are flow charts of the conventional two-pass log-play process. To restore the database to the one of the most recent consistent state, the log records generated by all the committed transactions need to be played, but the log records generated by so-called “loser transactions” that were active at the time of system crash have to be skipped (A transaction is said to be a loser when there is a matching transaction start log record but no transaction end record). For this purpose, all the log records encountered scanning the log from the checkpointing start log record to the end of the log are played (307). Then, the changes by the log records of the loser transactions are rolled back (308).
To identify loser transactions, a loser transaction table (LTT) is maintained, which has two fields, TID and Last LSN. This table is initialized with the active transactions recorded in the checkpointing start log record (301). When encountering a transaction start record (302), a matching entry is created in the LTT (305). When encountering a transaction end (either commit or abort) record (303), the matching entry is removed from the LTT (306). Otherwise (304), the LSN of the current log record is recorded in the Last LSN field of the matching LTT entry (307). When reaching the end of the log, the transactions that have matching entries still in the LTT are losers. The most recent record of a loser transaction can be located using the Last LSN field of the matching LTT entry, and other records of the transaction can be located by chasing the Last LSN field of accessed log records backward.
When using the physical logging method of the conventional art, log records must be applied in the order of log-record creation during the LP process. That is, logs created earlier must be used first to redo some of the updates. The conventional logging method imposes the sequential ordering because the undo and redo operations are not commutative and associative. This sequential ordering requirement imposes a lot of constraints in the system design.
In a main-memory DBMS, for example, where the entire database is kept in main memory, disk access for logging acts as a bottleneck in the system performance. In order to reduce such a bottleneck, employment of multiple persistent log storage devices may be conceived to distribute the processing. The use of multiple persistent log storage devices, however, is not easily amenable to the conventional logging method because there is necessarily an overhead of merging log records in the order of creation during the step of LP.
Therefore, there is a need for an efficient logging system that may comport with massive parallel operations in distributed processing.
It is an object of the present invention to provide an efficient logging scheme that can be used to recover a transaction processing system after a failure occurs.
It is another object of the present invention to provide a logging scheme where parallel operations are possible.
The foregoing objects and other objects are achieved in the present invention using a differential logging method that allows commutative and associative recovery operations. The method includes the steps of taking a before-image of database in main memory before an update to the database; taking an after-image of the database after the update; generating a log by applying bit-wise exclusive-OR (XOR) between the before-image and the after-image; and performing either a redo or undo operation by applying XOR between said one or more logs and the database.
a and 3b are flow charts of a two-pass log-play process used in the conventional recovery architecture during a restart process.
a to 6e are an illustration comparing the differential logging scheme of the present invention with the physical logging scheme of the conventional art.
a and 7b flow charts of a one-pass log-play process used in the present invention to recover the database from a backup made with a consistent checkpointing scheme.
a and 8b are flow charts of a fuzzy checkpointing process used in the present invention to make a backup without blocking other transactions.
a and 10b are flow charts of a modified two-pass log-play process used in the present invention to recover the database from a backup made by the fuzzy checkpointing process.
a and 11b are flow charts of a modified one-pass log-play process used in the present invention to recover the database from a backup made by the fuzzy checkpointing process.
An aspect of the present invention is based on generating differential log records using the bit-wise exclusive-OR (XOR) operations, and applying the differential log records in an arbitrary order to roll forward for the redo operation and roll back for the undo operation. That the execution order is independent from the order of log record creation is based on the following mathematical observations.
Definition of the Differential Log Record
If an update operation using transaction changes the value bt−1 of a slot (the slot denotes a part of the database; it can be a field, a record, or any other database object) to bt, then the corresponding differential log record Δt is defined as bt ⊕ bt−1, where bt is the slot after the t-th update occurs, ⊕ denotes the bit-wise XOR operation, and bt−1 is the slot before the t-th update occurs.
Theorem 1: Recoverability (for Redo and Undo) Using Differential Log Records
Assume that the value of a slot has changed from b0 to bm by m number of updates, and each update ut (t=1, . . . , m) has generated the differential log record Δt. Then, the final value bm can be recovered (for the redo operation) from using the initial value b0 and the differential log records. In the same way, the initial value b0 can be recovered (for the undo operation) from using the final value bm and the differential log records.
(Proof)
XOR operations are commutative and associative. For any binary numbers, p, q, and r,
Theorem 2: Order-independence of Redo and Undo Operations
For redo operations, given the initial value b0 of a slot and differential log records Δt, where t=1, . . . , m, the final value bmcan be recovered (for the redo operation) from applying the differential log records in an arbitrary order. In the same way, the initial value b0 can be recovered (for the undo operation) from applying the differential log records in an arbitrary order.
(Proof)
Assume that differential log records are applied in the order of Δ(1), Δ(2), . . . , Δ(m) where Δ(t) is selected in an arbitrary order from the set of sequentially-generated differential log records {Δ1, . . . , Δm−1, Δm}.
Then, the final value of the slot is
Theorem 3: Order-independence Between Redo and Undo Operations
Assume that there are n differential log records to be redone, whose redo operations are denoted by Ri (i=1, . . . , n), and m differential log records to be undone, whose undo operations are denoted by Uj (j=1, . . . , m). Then, all the following execution sequences result in the same state of the database.
Sequence 1. (undo phase after redo phase) R1, R2, . . . Rn, U1, U2, . . . , Um
Sequence 2. (redo phase after undo phase) U1, U2, . . . , Um, R1, R2, . . . , Rn
Sequence 3. (redo and undo in a single phase) any permutation of {R1, R2, . . . Rn, U1, U2, . . . , Um} (for example, R1, U1, R2, R3, U2, . . . , Rn, . . . , Um)
(Proof)
The theorem results from the commutative and associative properties of XOR operations involved in the redo and undo operations.
An update log record is used to store changes in the database comprises of a log header and a log body.
The body of a log record stores the differential log information. Specifically, it stores the bit-wise exclusive-OR (XOR) result of the data image of the database before update (“before image”) and the data image of the database after update (“after image”). This differential logging scheme storing only the differentials is distinguished from the conventional physical logging scheme where both the before image and the after image are stored. For example, if the before image is “0011” and the after image is “0101”, the present invention stores only differential, namely, the XOR result of “0110”.
a to 6e shows a comparison between the differential logging scheme of the present invention and the physical logging scheme of the conventional art. Since the operations in the conventional physical logging scheme are not commutative and associative, the redo operations, for example, must be performed in the sequential order of log record creation.
Consider a situation where a system crash occurs after processing three transactions, T1, T2, and T3, as in
Upon the system crash, if redo operations were applied in the same sequence as the original sequence, a correct recovery would result for both logging schemes.
However, if redo operations were done in a different sequence from the original sequence, a correct recovery would not be possible in the conventional logging scheme. In contrast, the differential logging scheme of the present invention enables an accurate reconstruction regardless of the order of applying the log records.
a and 7b is a flow charts of a one-pass log-play process used in the present invention to recover the database from a backup made using a consistent checkpointing scheme. There are two categories of consistent checkpointing schemes: transaction consistent and action consistent. “Transaction consistent” means that no update transaction is in progress during checkpointing. In other words, a transaction-consistent checkpointing process can start only after all the on-going update transactions are completed, and no update transaction can start until the checkpointing process completes. “Action consistent” means that no update action is in progress during the checkpointing process. When using a consistent checkpointing scheme, the two-pass log-play process of
One benefit of the present invention is to make it possible to complete the log-play process by scanning the log only once. In the one-pass log-play process of the present invention, the log records are scanned in the opposite direction to the log record creation sequence, i.e., from the end of the log. When scanning the log backward, a transaction end log record is encountered before any other log records of the transaction. When an aborted transaction is encountered, there is no need to play the records of the transaction. In other words, once the committed transactions are identified, only the records of the committed transaction can be played, skipping the records of the aborted transactions.
Since nothing needs to be done for a loser transaction, it is treated as the same as an aborted transaction. As mentioned above, in the conventional methods, redo operations must be performed for all the transactions in the sequence of log record creation. Redo operations may be skipped for a loser transaction, but since one cannot determine whether a transaction is a loser transaction or not in advance, redo operations are done even for those would-be loser transactions. Therefore, undo operations are needed for those loser transactions after they are identified.
a and 8b is a flow charts of a fuzzy checkpointing process used in the present invention to make a backup without blocking other transactions. “Fuzzy checkpointing” means that an update transaction and the checkpointing process may proceed in parallel. Fuzzy checkpointing is often preferred to consistent checkpointing because it allows other transactions to proceed during the checkpointing period.
When using fuzzy checkpointing together with the differential logging scheme of the present invention, two synchronization problems must be dealt with for correct recovery of the database. First, although an update transaction and a checkpointing process may occur in parallel as long as the two apply to different pages, a database page should not be backed up during checkpointing while the same page is being updated by a transaction. Otherwise, a mixture of both the before and the after images are copied, making it difficult to correctly recover from a crash. To handle the first problem, the present invention provides a synchronization mechanism so that the process of backing up a page and the process of updating a page occur in a locked state (as an atomic unit) (808 and 811).
Second, a mechanism is needed to determine whether a backed-up page reflects the database after a log record creation or before a log record creation. If the backup was made after a log record creation, there is no need to play the log record because the log record already reflects the changes. Since the present invention uses XOR for redo and undo operations, a mechanism is necessary to determine whether to play a log record or not. To deal with the second problem, the present invention maintains a field storing the most recent backup identifier in each page (809) and copies it into log records.
a and 10b is a flow charts of a modified two-pass log-play process used in the present invention to recover the database from a backup made by the fuzzy checkpointing process. The difference from the two-pass log-play process of
a and 11b is a flow chart of a modified one-pass log-play process used in the present invention to recover the database from a backup made by the fuzzy checkpointing process. The difference from the one-pass log-play process of
In this embodiment, each log buffer page has a counter. This counter is reset when the LR process (1704) fills it with the data read from the persistent log storage device (1705). When an LP process finishes scanning a buffer page, it increment the counter of the page in a locked state. Then, when the counter has the same value as the number of LP processes, the buffer can be flushed.
While the invention has been described with reference to preferred embodiments, it is not intended to be limited to those embodiments. It will be appreciated by those of ordinary skilled in the art that many modifications can be made to the structure and form of the described embodiments without departing from the spirit and scope of this invention.
Number | Date | Country | Kind |
---|---|---|---|
2000-31166 | Jun 2000 | KR | national |
Number | Name | Date | Kind |
---|---|---|---|
5193162 | Bordsen et al. | Mar 1993 | A |
5375128 | Menon et al. | Dec 1994 | A |
5696967 | Hayashi et al. | Dec 1997 | A |
5751939 | Stiffler | May 1998 | A |
6185577 | Nainani et al. | Feb 2001 | B1 |
6269381 | St. Pierre et al. | Jul 2001 | B1 |
6327671 | Menon | Dec 2001 | B1 |
6449623 | Bohannon et al. | Sep 2002 | B1 |
6513093 | Chen et al. | Jan 2003 | B1 |
6618822 | Loaiza et al. | Sep 2003 | B1 |
Number | Date | Country | |
---|---|---|---|
20020116404 A1 | Aug 2002 | US |