This invention is generally related to the field of data storage, and more particularly to checkpoint recovery in a secondary volume of a data storage system.
The use of electronic data storage is widespread in the industrialized world. Business and personal records that might previously have been stored in both paper and electronic form are often now stored only in electronic form. Various data protection techniques (e.g., file system checkpoints) have been conceived to enable recovery of data that would otherwise be lost when physical storage devices fail, become corrupted, or have been inadvertently lost. However, these techniques are continually being forced to scale to support larger collections of data because of the adoption of networked storage and the relatively rapid increase in the amount of data being created. In view of this situation, it is desirable to have techniques for facilitating recovery from device failure and data corruption while preserving good performance for client access from a network.
In accordance with one embodiment of the invention, apparatus for facilitating recovery of a secondary volume, containing checkpoints, having data indicative of changes made to data on a primary volume includes: monitoring circuitry operable to identify a synchronization point where a secondary volume B-tree is known to be consistent; and an intent log indicative of inconsistency in the secondary volume B-tree since a most recently identified synchronization point.
In accordance with another embodiment of the invention, computer program code stored on computer readable media, and executable by a computer, for facilitating recovery of a secondary volume, containing checkpoints, having data indicative of changes made to data on a primary volume, includes: logic operable to identify a synchronization point where a secondary volume B-tree is known to be consistent; and logic operable to maintain an intent log indicative of inconsistency in the secondary volume B-tree since a most recently identified synchronization point.
In accordance with another embodiment of the invention, a method for facilitating recovery of a secondary volume B-tree having data indicative of changes made to data on a primary volume includes the steps of: in response to an indication that a first leaf node of the B-tree is ready to split, allocating a second leaf node to the B-tree; initiating splitting the first leaf node in memory into the first and second leaf nodes; writing an intent log including an image of the leaf nodes and parent node that will result from the split; and asynchronously writing the first and second leaf nodes and the parent node.
One of the primary advantages of the invention is improved recovery time. For a relatively large checkpoint, recovery time to rebuild a consistent secondary volume B-tree can be significant in terms of both processor time and disk utilization. Since checkpoint recovery must be completed before user filesystems can be accessed on the primary volume, recovery time can adversely affect productivity. Synchronization points enable the recovery operation to be initiated from a nearer point in time to a reboot operation. Further, the intent log facilitates rapid recovery of the B-tree from the synchronization point when a split was in progress at the time when the event which lead to the reboot occurred. Through modeling, the invention, in one embodiment, has been found to have the potential for reducing recovery of a 1 TB checkpoint from about one hour to about sixty seconds, and potentially to under one second.
Referring to
It should be noted that both the primary volume (100) and the secondary volume (102) may be associated with multiple physical storage devices, and may be scaled in m:n relation. In one embodiment, the secondary volume (102) is associated with different physical storage devices than the primary volume (100). One secondary volume may be operable to support multiple primary volumes. However, a single primary volume will be discussed in this description for the sake of clarity.
Referring to
Referring now to
In order to facilitate recovery operations, intermediate syncpoints (302a-302j) are declared between checkpoints. For example, syncpoint (302e) occurs between checkpoint (300d) and checkpoint (300e). The syncpoints are logical locations on the secondary volume (102) where the B-tree is known to be in a consistent state. In particular, the syncpoint represents a state when the B-tree is stable on disk. Between syncpoints, all interior nodes are in a consistent state, but leaf nodes may be inconsistent with respect to the changed blocks on the primary file system. During checkpoint recovery, the secondary volume is used to re-populate any missing blockmap entries in the leaves. The frequency of syncpoints may be set by an administrator in units of blocks, i.e., a syncpoint to be taken every n blocks.
An intent log (112) on the secondary volume (102) further facilitates recovery operations. Before performing a B-tree split, the leaves involved in the split and their parent nodes are written to the intent log (102), i.e., the entire images of the changed leaves and parent. The intent log Write operation is relatively fast, and may comprise a single I/O operation to contiguous memory associated with the secondary volume (102). The movement of data between leaf nodes and changes to the parent nodes as a result of the split operation are done asynchronously, and typically to non-contiguous memory in a relatively lengthy background process, e.g., by a paging daemon. In the event of a failure during the time in which the movement of data between the leaf nodes or update of the parent node is incomplete, the intent log (112) is used to complete the split transaction. When a new syncpoint is declared, the intent log and dirty leaves are flushed.
Referring now to
Referring now to
While the invention is described through the above exemplary embodiments, it will be understood by those of ordinary skill in the art that modification to and variation of the illustrated embodiments may be made without departing from the inventive concepts herein disclosed. Moreover, while the preferred embodiments are described in connection with various illustrative structures, one skilled in the art will recognize that the system may be embodied using a variety of specific structures. Accordingly, the invention should not be viewed as limited except by the scope and spirit of the appended claims.
Number | Name | Date | Kind |
---|---|---|---|
4819156 | DeLorme et al. | Apr 1989 | A |
5644763 | Roy | Jul 1997 | A |
5758356 | Hara et al. | May 1998 | A |
6535869 | Housel, III | Mar 2003 | B1 |
6578041 | Lomet | Jun 2003 | B1 |
7133884 | Murley et al. | Nov 2006 | B1 |
20020029214 | Yianilos et al. | Mar 2002 | A1 |
20040139125 | Strassburg et al. | Jul 2004 | A1 |
20040220979 | Young et al. | Nov 2004 | A1 |
20050071336 | Najork et al. | Mar 2005 | A1 |
20050125593 | Karpoff et al. | Jun 2005 | A1 |
20050289152 | Earl et al. | Dec 2005 | A1 |
20060053139 | Marzinski et al. | Mar 2006 | A1 |
20060106860 | Dee et al. | May 2006 | A1 |
20060129611 | Adkins et al. | Jun 2006 | A1 |
Entry |
---|
Kornacker et al., Concurrency and Recovery in Generalized Search Trees, SIGMOD '97, published 1997. |