The present invention relates to a database system and stand-alone storage array including a device driver and daemon that provide interrupted write protection for database data.
Within a database system, device and software failures, system resets, and power losses can result in unsuccessful write operations. During multi-sector writes, it is possible for a data transfer to be interrupted by software failures such as a node software “panic” or hardware failures such as node processor or memory errors, adapter failures, cable failures or loss of power to the system. This can result in partially written data, with some new sectors and some old sectors in the area that was to be written. This failure is known as an interrupted, a subset of unsuccessful writes.
Teradata Corporation database systems have employed HDD and SSD disk array systems from vendors that incorporate interrupted write protection into their products. In these implementations, when data in the interrupted write area is read, a special pattern is returned instead of the data. The database system can detect this pattern and optimize its recovery. That is, it can distinguish the special case of an incomplete write from that of general corruption, which is what the interrupted write would have looked like had the actual data been returned instead of the special pattern. This interrupted write protection is not available when using commodity storage such as direct attached disks or software RAID. These direct attached disks or software RAID devices are generic products which usually do not include interrupted write detection.
A system and method for providing interrupted write protection to a stand-alone commodity storage array is described below.
In the following description, reference is made to the accompanying drawings that form a part hereof, and in which is shown by way of illustration specific embodiments in which the invention may be practiced. These embodiments are described in sufficient detail to enable one of ordinary skill in the art to practice the invention, and it is to be understood that other embodiments may be utilized and that structural, logical, optical, and electrical changes may be made without departing from the scope of the present invention. The following description is, therefore, not to be taken in a limited sense, and the scope of the present invention is defined by the appended claims.
The technique for providing interrupted write protection to a commodity storage array disclosed herein has particular application, but is not limited, to large databases that might contain many millions or billions of records managed by a database system (“DBS”) 200, such as a Teradata Active Data Warehousing System available from Teradata Corporation.
For the case in which one or more virtual processors are running on a single physical processor, the single physical processor swaps between the set of N virtual processors.
For the case in which N virtual processors are running on an M-processor node, the node's operating system schedules the N virtual processors to run on its set of M physical processors. If there are 4 virtual processors and 4 physical processors, then typically each virtual processor would run on its own physical processor. If there are 8 virtual processors and 4 physical processors, the operating system would schedule the 8 virtual processors against the 4 physical processors, in which case swapping of the virtual processors would occur.
Each of the processing modules 2101 . . . N manages a portion of a database that is stored in a corresponding one of the data-storage facilities 2201 . . . N. Each of the data-storage facilities 2201 . . . N includes one or more disk drives. The DBS may include multiple nodes 2052 . . . P in addition to the illustrated node 2051, connected by extending the network 215.
The system stores data in one or more tables in the data-storage facilities 221 . . . N. The rows 2251 . . . Z of the tables are stored across multiple data-storage facilities 2201 . . . N to ensure that the system workload is distributed evenly across the processing modules 2101 . . . N. A parsing engine 230 organizes the storage of data and the distribution of table rows 2251 . . . Z among the processing modules 2101 . . . N. The parsing engine 230 also coordinates the retrieval of data from the data-storage facilities 2201 . . . N in response to queries received from a user at a mainframe 235 or a client computer 240. The DBS 200 usually receives queries and commands to build tables in a standard format, such as SQL.
In one implementation, the rows 2251 . . . Z are distributed across the data-storage facilities 2201 . . . N by the parsing engine 230 in accordance with their primary index. The primary index defines the columns of the rows that are used for calculating a hash value. The function that produces the hash value from the values in the columns specified by the primary index is called the hash function. Some portion, possibly the entirety, of the hash value is designated a “hash bucket”. The hash buckets are assigned to data-storage facilities 2201 . . . N and associated processing modules 2101 . . . N by a hash bucket map. The characteristics of the columns chosen for the primary index determine how evenly the rows are distributed.
In one example system, the parsing engine 230 is made up of three components: a session control 300, a parser 305, and a dispatcher 310, as shown in
Once the session control 300 allows a session to begin, a user may submit a SQL request, which is routed to the parser 305. As illustrated in
The present invention is a design for an IWIL device driver that is inserted into the storage device stack—the chain of attached device objects that represent a device's storage device drivers—and a user-space process, an IWIL daemon, that handles recovery and remote requests. For example, on Linux, the IWIL driver could be a separate block device driver or a module that provides an interface to an existing driver, such as the Teradata Virtual Storage Extent Driver, depending on the requirements of the implementation.
When a write needing interrupted write protection, as determined by the application layers above, occurs, it is routed through the IWIL driver. The IWIL driver and daemon perform the following actions, as illustrated in
Step 501: The driver generates an intent log entry representing the write.
Minimally, this entry contains:
Ideally, this is non-volatile memory in the node, such as a flash card, but may also be a locally attached solid state disk (SSD) or hard disk (HDD).
In parallel with the write to local non-volatile storage, the driver sends the intent log entry to at least one other node in the system via a network connection such as Infiniband, Ethernet or bynet. This is done in case the original node does not come back after a crash and the application, e.g., Teradata AMP, moves to another node.
Referring to
TPA node 601 includes existing software components TVSAEXT 607, a the Teradata Virtual Storage Extent Driver; a Linux Block & SCSI driver 609; and Storage Interconnect 611. Linux Block & SCSI driver 609 and Storage Interconnect 611 are parts of the Linux operating system and are the drivers used to access disk storage. Generic interrupted write protection if provided by the IWIL driver 613. The arrow between IWIL driver 613 and TVSAEXT 607 is intended to show a call interface, where the TVSAEXT driver 607 tells IWIL driver 613 that a particular write needs interrupted write protection. For non-Teradata implementations, a block driver could be constructed to replace TVSAEXT driver 607 in
Non-volatile Random Access Memory (NVRAM) or Local SSD 615 provides physical storage for the intent log, either non-volatile memory in the node or on a solid state drive (SSD) attached to the node. The intent log entries written to NVRAM or Local SSD 615 are copied to another node over a network, such as infiniband, so that the log entries are not lost if the node 601 crashes.
Storage Subsystem 603 and disks 605 represent a generic disk storage subsystem such as a disk array, a plurality of raw disks (JBOD for Just a Bunch of Disks, or a software RAID. A software RAID would be a mirroring or other RAID (redundant array of inexpensive disks) implementation in software that does not use a disk array controller.
The figures and specification illustrate and describe a method for providing interrupted write protection to a stand-alone commodity storage array utilized within a database system.
The foregoing description of the invention has been presented for purposes of illustration and description, it is not intended to be exhaustive or to limit the invention to the precise form disclosed. Accordingly, this invention is intended to embrace all alternatives, modifications, equivalents, and variations that fall within the spirit and broad scope of the attached claims
This application claims priority under 35 U.S.C. §119(e) to the following co-pending and commonly-assigned patent application, which is incorporated herein by reference: Provisional Patent Application Ser. No. 61/922,544, entitled “TORN WRITE PROTECTION WITH GENERIC STORAGE,” filed on Dec. 31, 2013, by Gary Lee Boggs.
Number | Date | Country | |
---|---|---|---|
61922544 | Dec 2013 | US |