ACID properties (Atomicity, Consistency, Isolation, and Durability) are intrinsic to many database management systems (DBMS) such as Oracle and SQLServer. The atomicity and durability properties depend on logging transactions to durable storage. Prior solutions typically involve logging these transactions to disk drives. Prior database systems have used elaborate logging techniques to improve the reliability of RAM buffers and to implement the transaction semantics. Non-volatile storage has been used by many database systems to reduce the overhead of logging, but the non-volatile storage in these systems is typically disk storage directly associated with the target disk storage system. Therefore, to ensure atomicity and durability of data, a DBMS thread or process must wait until it receives an acknowledgement from the disk drive that the log write was completed. Since disk writes take milliseconds, this method adds to the response time for transactions and adds latency to overall system performance.
Disk Caching Disk (DCD) systems use a small NVRAM cache and a small cache-disk to form a two-level cache. Write data is first assembled in the small NVRAM cache and later logged into the cache-disk. Data in the cache-disk is destaged to the data disk during idle periods. The two-level hierarchical structure acts as a large non-volatile cache. While DCD provides good performance for low to medium traffic workloads, directly applying DCD to high I/O workloads may result in certain problems: DCD requires destaging, which involves reading ‘dirty’ data (e.g. data in write cache that has not been destaged or written to disk), from the cache-disk and writing it into the data disk. The destaging process may become a performance bottleneck at high loads because the destaging read operations and the log write operations will compete for the limited cache-disk bandwidth. Moreover, the read speed of DCD is also slow because some data has to be read from the cache-disk.
One type of prior art system for caching data to be written to disk is shown in
The type of system illustrated in
Another prior art system for caching data to be written to disk is shown in
The
It should be noted that the ‘log’ in the type of system shown in
To avoid losing packets (and thereby reducing network reliability), intermediate data structures are saved into NVRAM 203, in the
What is needed is a method that reduces disk drive response time associated with writes to disk from a DBMS, while maintaining the properties of atomicity and durability.
A transaction logging system is provided for performing log writes in a database management system. The transaction logging system has an associated operating system and a target storage system to which are written log records representing complete database transactions. In one embodiment, the system includes non-volatile memory accessible by the database management system and directly addressable by the operating system. Each time a log record is written from the database management system to non-volatile memory, an acknowledgement is sent to the database management system, to allow a lock corresponding to the log record to be released. Log records are subsequently written from non-volatile memory to the target storage system.
In the present system, a DBMS (database management system) uses non-volatile RAM as a memory-mapped file (where I/O operations are performed via the operating system's file system) or as shared memory (where the DBMS performs raw I/O) for storing DBMS log records. Data residing in non-volatile memory locations is written to disk periodically to make room for new log entries. This can be done either by the operating system (through the memory-mapped file functionality) or, in the case of raw I/O through shared memory, by a separate DBMS thread or process.
Non-volatile memory 310 may be NVRAM (which may be RAM that is battery-backed-up, or FRAM [ferroelectric RAM], which does not require battery-back-up), or ‘solid-state disk’ memory built using, for example, MRAM (magnetic RAM) or ARS (atomic resolution storage), or other non-volatile storage device with a short access latency.
The term ‘non-volatile memory’ is used herein to refer to any type of non-rotating, low-latency non-volatile memory, including those types of non-rotating memory noted above, as distinguished from conventional disc memory involving rotating media. A log write to typical non-volatile memory takes a few hundred nanoseconds at most; in comparison, a log write to disk typically takes several milliseconds. On a busy system, a DBMS may issue thousands of log writes per second. The cumulative effect of writing these records to a closely-coupled media such as NVRAM results in a substantial overall performance improvement, in two ways: response time is reduced for the log write, and lock residency time (i.e., the time during which the DBMS holds locks) is also reduced, which in turn reduces queuing delays.
There are two parts to any DBMS transaction: (1) the changes to the database itself, and (2) the creation of a corresponding log record. In the present system, the DBMS workflow is structured as a series of complete transactions. A ‘complete transaction’ implies both (1) and (2), above. Each DBMS transaction requires a corresponding log record 303 to be written to non-volatile memory 310. These transactions are atomic; either they fail and are cancelled, or they are committed in their entirety. Partial results are not allowed. This atomicity is maintained through a logging and commit protocol, which is well-known in the art.
The present system uses non-volatile memory 310 closely coupled to the DBMS primarily to reduce latency, although DBMS reliability is also improved. Non-volatile memory 310 is more reliable than disk drive storage, and more accessible in the sense that the non-volatile memory in the present system is part of the address space 312 of the operating system, rather than being accessed, for example, via an internal I/O bus, then via a PCI bus interface, and finally through a SCSI card and SCSI bus, where any one of these components can fail or become temporarily unavailable.
At step 410, an acknowledgement 307 is sent to the DBMS 305 from non-volatile memory 310 (as indicated by arrow 307) to communicate that the log record 303 was successfully written to non-volatile memory 310, i.e., to indicate completion of the log record write operation. This allows the current DBMS thread to release any latches or locks associated with the write operation, thus allowing the related application to continue execution. In the case of memory mapped files, the acknowledgement indicated by arrow 307 is generated by the O/S file system. In the case of shared memory, the acknowledgement comes from the O/S virtual memory system. The operating system call interface (not shown) typically provides this acknowledgement functionality.
At step 415, one or more log records 303 are written to I/O device driver 315, as indicated by arrow 311. Log records 303 may be written to disk 331 (via firmware 325, and any intervening hardware, such as device driver 315 and NIC 320) immediately after each acknowledgement 307. Immediately writing each log record 303 may slightly increase system reliability in the event, for example, of near-simultaneous failure of both DBMS and NVRAM battery back-up. Alternatively, multiple log records may be stored or queued in non-volatile memory 310 and periodically ‘batch-written’ to disk after a predetermined number of records are accumulated in non-volatile memory 310, or after a predetermined maximum period of time. ‘Batch-writing’ multiple log records to disk minimizes the amount of disk traffic and the pathlengths associated with each I/O operation.
I/O device driver 315 comprises any driver software or firmware that is used to control interface card 320, which may be a NIC or other device suitable for communicating with storage system 330. Device driver 315 then writes the log record 303 to interface card 320, at step 420, as indicated by arrow 316. At step 425, interface card 320 sends the log record to the disk drive, where it is read by disk firmware 325. As indicated by arrow 321, the log record is sent from interface card 320 to disk firmware 325 via communications fabric 323, which may be a data bus, a local area network, or any other type of network. The log record 303 is then written to a physical disk (target disk) 331, at step 430. as indicated by arrow 326.
Note that the data flow (indicated by arrows 306, 311, 316, 321, 326) in
A comparison of system 500 with system 300 (shown in
At step 610, an acknowledgement is sent to the DBMS 305 from non-volatile memory 310 (as indicated by arrow 507) to communicate that the log record 303 was successfully written to non-volatile memory 310. This allows the current DBMS thread to release any latches or locks associated with the write, allowing forward progress of the related application.
At step 615, the log record 303 is written to device driver 315, as indicated by arrow 511. Device driver 315 comprises any driver software or firmware that is used to communicate with storage system 330. In one embodiment, log records 303 are written to disk 331 immediately after each acknowledgement 507. Alternatively, multiple log records are stored or queued in non-volatile memory 310 and periodically ‘batch-written’ to disk after a predetermined number of records are accumulated in non-volatile memory 310, or after a predetermined maximum period of time.
At step 625, device driver 315 then writes the log record 303 to storage system 330, where it is read by disk firmware 325 (as indicated by arrow 521). The log record 303 is then written to a physical disk (target disk) 331, at step 630. as indicated by arrow 526.
Certain changes may be made in the above methods and systems without departing from the scope of that which is described herein. It is to be noted that all matter contained in the above description or shown in the accompanying drawings is to be interpreted as illustrative and not in a limiting sense. For example, the system shown in