1. Technical Field
The present invention is directed to an improved method of performing backup operations in a computing system having a journaled or log file system. More specifically, the present invention provides a method and apparatus for ensuring the validity of the journaled file system metadata during a backup operation.
2. Description of Related Art
Virtually all computer applications (or programs) rely on storage. This storage can be used for both storing the computer code and for storing data manipulated by the code. The term “data” refers to any information, including formatting information, executable code and data for manipulation by an application program. Storage technology has developed in a variety of different directions. Accordingly, a wide variety of storage systems are available.
Because of the wide variety of storage systems that are available, it has become impractical for the person writing the computer application to also be responsible for detailed control over how data is stored on the storage system. For this and other reasons, application programs typically run on an operating system, e.g., Unix, Windows, MS DOS, Linux, and the many variations of each. Once again, however, the operating system may be used with a variety of storage systems. It would be highly inefficient to have to change the operating system, or the application programs, every time a change is made to physical storage. As a result, various layers of abstraction have evolved for viewing how data is actually stored in the storage system.
The application level 10 interfaces with the file system level 12. The file system level 12 is concerned with how files are stored on disks and how to make everything work efficiently and reliably. Thus, the file system level 12 may be responsible for storing directory structure, and for breaking up files into constituent data blocks for storage onto a physical storage system. For example, in most implementations of Unix, each file has an associated i-node. This i-node may contain accounting and protection information and, additionally, a set of pointers to data blocks.
Relatively early in the development of computer systems, disk drives became a fundamental device for storage. Accordingly, computer operating systems have been developed assuming that memory will rely on input/output (“I/O”) to a disk drive. The file system 12, therefore, may assume one or more “volumes” which correspond to a physical storage unit such as a disk drive, or any other unit of storage, with data stored in blocks on the disk drive. The demand for storage to be available for use by applications has greatly increased. As a result, a number of separate physical devices may be required to accommodate the total amount of storage required for a system. In addition, storage systems are often changed or reconfigured.
To insulate the operating system from any changes within the physical device storage system, some mechanism is often employed to flexibly map a standard (volume) view of physical storage onto an actual physical storage system. The logical volume manager (“LVM”) 14 of FIG. 1 can help achieve this function by mapping the file system view of data storage into an intermediate layer.
Finally, the actual storage reading and writing (and, potentially, additional mapping onto physical storage devices) occurs within the physical storage system level 16, as illustrated in
Problems occur in storage systems when there are failures, such as a system crash, that may cause the file system in the storage system to become corrupted. That is, when data is written to a storage system, the data is usually sent to a disk cache. The disk cache is a buffer allocated in the main memory and is intended to increase the speed of I/O operations. The problems occur when there is a system crash or failure before the data in the disk buffer has been written to the physical disk. In such a case, upon reboot of the system, the system will operate in an inconsistent manner since the system believes that the data has been written to the disk drive but, in actuality, the data has not been written to the disk drive due to the system failure. For example, a file may be deleted in the disk cache, but this deletion may not have actually been performed on the physical disk such that the file still remains on the physical disk.
It is because of such problems that databases and file systems have been provided with the ability to recover the system back to a consistent state. Although databases typically recover quickly, as the file system size grows, the recover time increases. For example, the fsck recover tool for ext2fs must scan through the entire disk partition in order to take the file system back to a consistent state. This time-consuming task creates a lack of availability for large servers with hundreds of gigabytes of disk space.
In order to provide the file system with a database-like recovery mechanism, journal or log file systems have been devised. The journal or log file systems make use of transactions. A transaction is a set of single operations that satisfy several properties. These properties are often referred to as the ACID properties of transactions. The acronym ACID stands for Atomicity, Consistency, Isolation and Durability. The most important one of these properties for purposes of the present invention is atomicity. Atomicity implies that all operations belonging to a single transaction are completed without errors or cancelled, producing no changes. This property, along with Isolation, makes the transaction appear to be non-partitionable.
The journal or log file system takes advantage of this atomicity by logging every single operation within the transaction into a log file or journal. The log entries for a transaction are comprised of metadata for the transaction. This metadata includes the control structures inside a file system, e.g., i-nodes, free block allocation maps, i-node maps, and the like.
After every single transaction, there must be a commit operation that makes the data in the disk buffer be written to physical disk storage. Therefore, if there is a system crash or other failure, the log may be traced back to the first commit statement, and the changes made by those transactions that did not commit entirely may be rolled back.
Even with the ability provided by journal or log file systems to recover a system to a consistent state, problems may still occur. One significant problem is the ability to obtain a valid device-level backup of multiple inter-related devices while the journal or log file system of those devices is active. For example, if a backup of a storage system is being performed, and the journal or log file system of the storage system is still active, then there may be transactions sent to the file system while the backup operation is being performed. As a result, the data in the backup copy of the storage system may be different from the data stored in the storage system. Thus, an inconsistency in the file system is created. Therefore, it would be beneficial to have a method and apparatus for ensuring valid journal file system metadata during a backup operation.
The present invention provides a method and apparatus for ensuring valid journaled file system metadata during a backup operation. With the method and apparatus of the present invention, mechanisms are provided for suspending write I/O operations to storage devices during a point-in-time backup operation. Once the point-in-time backup operation is completed, the suspended write I/O operations are released and may then be processed by the storage system.
In one exemplary embodiment, when a point-in-time operation is initiated, a point-in-time backup flag is set in a logical volume manager. In response to receiving a write I/O operation, the logical volume manager checks the status of the point-in-time backup flag to determine if a point-in-time backup operation is currently being performed. If the flag is not set, then the write I/O operation is logged in the file system log and the write I/O operation is performed in a normal manner. Once the write I/O operation is completed, the file system is informed of the completion and the log entry for the write I/O operation is used to update the file system metadata.
If the backup operation flag is set, then the metadata for the write I/O operation is stored in the file system log and the write I/O operation is stored in a hold queue of the logical volume manager. The write I/O is not permitted to be committed to the logical volume while it is stored in the hold queue. As a result, the write I/O operation cannot be completed until after the backup operation is complete and thus, the file system metadata is not modified to reflect the completion of the write I/O operation until after the backup operation is complete.
When the point-in-time backup operation is completed, the completion of the point-in-time backup operation is communicated to the logical volume manager which then resets the point-in-time backup flag. In addition, the suspended write I/O operations in the hold queue are released so that they may be performed on the logical volume. When the write I/O operation is committed by the storage device(s) of the logical volume, the storage devices respond to the logical volume manager that the write I/O has been completed and, as a result, the metadata stored in the file system log for the write I/O operation is used to update the file system metadata for the logical volume. In this way, write I/Os are permitted to be sent to the file system log but are held from being committed to the storage devices, in a hold queue of the logical volume manager, until after the backup operation is completed.
Thus, as a result of the operation of the present invention, the metadata associated with the point-in-time backup copy of the data in the storage system, the data in the storage system, and the file system metadata remain consistent during the backup operation. It is only after the point-in-time backup operation is complete that the write I/O operations received during the point-in-time backup operation are permitted to be committed to the logical volume, and as a result, the metadata of the file system is updated to reflect the writing of the data to the logical volume.
These and other features and advantages of the present invention will be described in, or will become apparent to those of ordinary skill in the art in view of, the following detailed description of the preferred embodiments.
The novel features believed characteristic of the invention are set forth in the appended claims. The invention itself, however, as well as a preferred mode of use, further objectives and advantages thereof, will best be understood by reference to the following detailed description of an illustrative embodiment when read in conjunction with the accompanying drawings, wherein:
The present invention provides a mechanism for ensuring the integrity of a file system during a backup operation. More specifically, the present invention provides a mechanism for ensuring that the metadata of a file system is consistent with the state of the storage system and the metadata of the backup such that the journal or log file system may remain active during the backup operation. The mechanisms of the present invention may be employed in a distributed data processing system or in a stand-alone computing device. As such, in order to provide a context for the description of the operational components of the present invention, a brief description of a distributed data processing system and a stand-alone data processing system are provided hereafter with reference to
With reference now to the figures,
In the depicted example, server 204 is connected to network 202 along with storage unit 206. In addition, clients 208, 210, and 212 are connected to network 202. These clients 208, 210, and 212 may be, for example, personal computers or network computers. In the depicted example, server 204 provides data, such as boot files, operating system images, and applications to clients 208-212. Clients 208, 210, and 212 are clients to server 204. Network data processing system 200 may include additional servers, clients, and other devices not shown. In the depicted example, network data processing system 200 is the Internet with network 202 representing a worldwide collection of networks and gateways that use the Transmission Control Protocol/Internet Protocol (TCP/IP) suite of protocols to communicate with one another. At the heart of the Internet is a backbone of high-speed data communication lines between major nodes or host computers, consisting of thousands of commercial, government, educational and other computer systems that route data and messages. Of course, network data processing system 200 also may be implemented as a number of different types of networks, such as for example, an intranet, a local area network (LAN), or a wide area network (WAN).
Referring to
Peripheral component interconnect (PCI) bus bridge 314 connected to I/O bus 312 provides an interface to PCI local bus 316. A number of modems may be connected to PCI local bus 316. Typical PCI bus implementations will support four PCI expansion slots or add-in connectors. Communications links to clients 208-212 in
Additional PCI bus bridges 322 and 324 provide interfaces for additional PCI local buses 326 and 328, from which additional modems or network adapters may be supported. In this manner, data processing system 300 allows connections to multiple network computers. A memory-mapped graphics adapter 330 and hard disk 332 may also be connected to I/O bus 312 as depicted, either directly or indirectly.
Those of ordinary skill in the art will appreciate that the hardware depicted in
The data processing system depicted in
With reference now to
An operating system runs on processor 402 and is used to coordinate and provide control of various components within data processing system 400 in
Those of ordinary skill in the art will appreciate that the hardware in
As another example, data processing system 400 may be a stand-alone system configured to be bootable without relying on some type of network communication interfaces. As a further example, data processing system 400 may be a personal digital assistant (PDA) device, which is configured with ROM and/or flash ROM in order to provide non-volatile memory for storing operating system files and/or user-generated data.
The depicted example in
Within the storage system 500 is a host adapter 520 that is responsible for managing and translating read and write requests from the host computer which are based on the virtual disk structure, e.g., from the file system or logical volume manager level, into one or more requests corresponding to how data is stored on the actual physical storage devices 540A-540D of the storage system 500. The host adapter 520 can be implemented in any of a number of ways, including using a general purpose processor or a custom hardware implementation. In addition, multiple host adapters may be included to facilitate having additional I/O channels for the storage system 500.
The host adapter 520 communicates with the other components of the storage system 500 using bus 530. The bus 530 may be any suitable communication element, including use of SCSI, ESCON, and other bus protocols.
Access to the physical storage devices 540A-540D is controlled through the use of disk adapters 550A-550D. The disk adapter 550A-550D can also be implemented using a general purpose processor or custom hardware design. In the embodiment illustrated in
In the embodiment of
Any of a variety of mechanisms can be used to implement and manage the cache 560. An example of such a mechanism is included in U.S. Pat. No. 5,537,568, entitled “System for Dynamically Controlling Cache Manager Maintaining Cache Index and Controlling Sequential Data Access,” issued on Jul. 16, 1996.
Similarly, writes may be accomplished through the cache 560 using any of a variety of mechanisms and strategies. One mechanism for writing from the cache is to store the data to be written in the cache, and mark a “write pending” bit. When the write pending bit is encountered, the applicable data can be written to the disk. This technique is described generally in U.S. Pat. No. 5,341,493, entitled “Disk Storage System with Write Preservation During Power Failure,” issued on Aug. 23, 1994. Of course other known mechanisms may be used to manage the cache 560 without departing from the spirit and scope of the present invention.
The cache 560 may be divided into more than one area. For example, the cache 560 may include an area 560A for storing data being read or written from physical storage devices 540A-540D. The cache may further include a “mailbox” area 540B. The mailbox area 540B may be used to facilitate communications among the disk adapters 550A-550D and with the host adapter 520. For example, each disk adapter 550A-550D may have its own area within the mailbox 560B. Each of the disk adapters 550A-550D can post or read information from the applicable mailbox area 560B, to communicate status and other information.
A remote adapter 570 may also be attached to the bus 530 of the storage system 500. The remote adapter 570 may be employed for communication with remote data facilities (“RDF”), for example, connection to another storage device to maintain a mirror redundancy group. One form of RDF link and method of implementation is described in various publications available from EMC Corporation, including SYMMETRIX Remote Data Facility Product Manual, P/N 200-999-554, rev. B, June 1995. RDF embodiments are also described in U.S. Pat. No. 5,544,347 (Yanai) which is hereby incorporated herein by reference in its entirety. It should be appreciated, however, that the present invention is not limited to the use of RDF or to a system that employs SYMMETRIX disk arrays, and can be employed with any of numerous other types of storage systems.
A service processor 580 may be coupled to the bus 530 of the storage system 500. The service processor 580 may include a display, keyboard and other I/O devices to permit an operator to use the service processor 580 for configuring the components of the storage system 500 and for running or initiating diagnosis and maintenance facilities.
The storage system 500 of
When a file system and journal log reside on more than a single device, starting a point in time, or instant, copy of those devices can cause the log and the file system to be out of sync without having the benefits of the present invention. That is, without the present invention, the log can continue to be written to and the state of the file system can drift away from what the log indicates it is. This is also true of a log that is contained on multiple devices, as in the case of striping or simply a large log. The present invention provides a mechanism for keeping the log and the file system consistent even when the log is written across a plurality of devices.
The present invention solves this problem by including a mechanism either within or associated with the logical volume manager that suspends write operations to the log devices of a logical volume during a backup operation. With the present invention, write operations to the log devices of the logical volume are held at a logical volume manager device driver level for the duration of the backup of the log devices. Since no new write operations are allowed to be sent to the storage system for the log device, the “log device” being a logical storage device that may be comprised of a plurality of physical storage devices, new metadata transactions to the file system are also prevented during the same time period. This guarantees internal consistency between the file system and log devices while still allowing write access to blocks in files that already have committed allocations.
Thus, through the operation of the present invention, write operations are permitted to log devices that are not part of the current backup operation but are suspended to other log devices of the storage system that are subject to the backup operation. The write operations are permitted to be sent to the file system log, however, those that need to be suspended because they are directed to log devices that are subject to a backup operation, are held by the logical volume manager until after the backup operation is completed. As a result, write I/O operations are permitted to continue during a backup operation in such a manner that the consistency of the file system metadata is maintained.
In order to suspend write I/O operations during the backup operation, e.g., a point-in-time backup operation, the logical volume manager of the present invention stores any write I/O operations received during the backup operation in a hold queue, e.g., a linked list, within the logical volume manager. As a result, the write I/O operation is not performed on the physical storage devices of the logical volume until after the backup operation is complete. Once the backup operation is complete, the write I/O operations are released from their suspended state in the hold queue and are performed in the order they were received. Thus, during the backup operation, write I/O operations are permitted to be received in the file system log but are not permitted to be sent to the log devices.
As a consequence of the above operation, the metadata of the file system is not changed during the backup operation. Since the write I/O operations are not completed during the backup operation, the file system metadata is not updated to reflect these write I/O operations until after the backup operation is complete and the write I/O operations are performed on the storage system.
The present invention will be described in terms of hardware point-in-time backup operations which are also sometimes referred to as instant copy operations, snapshot copy operations, and the like. It should be appreciated that the present invention may be utilized with other backup operations other than point-in-time backup operations, instant copy operations, or snapshot copy operations, without departing from the spirit and scope of the present invention.
A point-in-time backup operation is a backup of the entire contents of a storage device or storage devices, e.g., disk or disks. Thus, it is important that the metadata stored in the file system log correspond to the data on the physical storage devices or else data corruption may occur. As a result, it is important to ensure that data is not written to the physical storage blocks containing the file system log during the point-in-time backup operation since such write I/O operations would cause the data on the physical storage devices to not match the metadata of the point-in-time backup copy.
The mechanisms of the present invention ensure that write I/O operations are not permitted to be sent to the physical storage system 650 blocks containing the file system metadata during a point-in-time backup operation by providing a point-in-time backup flag 625, on a per logical volume basis, additional functionality in the logical volume device driver 630, and hold queue 640 of the logical volume manager 620, as described in greater detail hereafter. Through the use of these new mechanisms and functionality, write I/O operations that are received during a point-in-time backup operation are suspended for the file system log logical volume until after the point-in-time backup operation is completed. In this way, no write operations may be performed in the physical storage system 650 associated with the log logical volume during the point-in-time backup operation and thus, the metadata associated with the data in the physical storage devices of the physical storage system 650 is ensured to be consistent with the data.
As shown in
In response to receiving a read copy operation indicating that a point-in-time backup operation has been initiated, the logical volume device driver 630 sets a point-in-time backup flag 625 indicating that a point-in-time backup operation is being performed. The read copy operation is then logged in the log 614 and the read is permitted to be sent to the physical storage system 650.
Other applications are permitted to submit I/O requests to the file system 610 during the point-in-time backup operation. With the present invention, if the submitted I/O request is a Read I/O request, then the I/O request is permitted to be processed in a normal fashion. If the submitted I/O request is a Write I/O request that will result in a change to the file system metadata and a write to the file system log, however, the Write I/O request is suspended by the mechanisms of the present invention until after the point-in-time backup operation is completed.
Thus, for example, the application 600 may send a Write I/O request to the file system 610 which then generates a log entry associated with the Write I/O request in the file system log 614 and provides the Write I/O request to the logical volume manager 620. The logical volume device driver 630 receives the Write I/O request and determines it to be a Write I/O request. As a result, the logical volume device driver 630 reads the value for the point-in-time backup flag 625 to determine if the flag has been set or not. If the flag has not been set, then the Write I/O request is processed in a normal fashion by writing an entry to the file system log 614 for the Write I/O request and sending the Write I/O request to the physical storage system 650 so that the physical storage devices may commit the Write I/O operation. Once the Write I/O operation has been committed, the log entry in the file system log 614 is used to update the file system metadata 616.
If however, the point-in-time backup flag 625 has been set, i.e. a point-in-time backup is in progress, then the Write I/O request is sent to the hold queue 640. The hold queue 640 may be, for example, a linked list of I/O requests that have been suspended and which are to be performed once their suspend state has been removed. Any I/O requests that are suspended by placing them in the hold queue 640 are not sent to the physical storage system 650 until the I/O request entries in the hold queue 640 are released.
Once the point-in-time backup operation is completed, the point-in-time backup application 605 sends a message, or alternatively, the file system 610 sends a message, to the logical volume manager 620 indicating the completion of the point-in-time backup operation. Upon the logical volume device driver 630 receiving a message from the point-in-time backup application 605, or the file system 610, that the backup operation is complete, the logical volume device driver 630 resets the point-in-time backup flag 625 and releases the I/O requests stored in the hold queue 640. These I/O requests are then submitted to the physical storage system 650 in the order that they were received so that they may be committed. Once the I/O requests are committed, the logical volume manager 620 is informed of the committing of the I/O requests, and the metadata in the log entry of the file system log 614 corresponding to the I/O request is used to update the file system metadata 616.
Thus, with the present invention, Write I/O operations to the physical storage devices of a storage system are suspended for the blocks containing the file system log logical volume for the duration of the backup operation. In this way, the metadata associated with the data stored on the physical storage devices is ensured to be consistent with the data. As a result, the log will match the file system so that the log can be replayed. This means that file system metadata 616 will match the file system log 614 and thus, the metadata 616 can be trusted after the log 614 is replayed. A file system can be considered corrupted and possibly lost if the state of its metadata 616 is not maintained. If the state is maintained then the backup has valid metadata 616 if it is not maintained then the backup is not valid. One goal, and one principal benefit, of the present invention, is to obtain a valid backup by ensuring the validity of the file system metadata.
Accordingly, blocks of the flowchart illustrations support combinations of means for performing the specified functions, combinations of steps for performing the specified functions and program instruction means for performing the specified functions. It will also be understood that each block of the flowchart illustrations, and combinations of blocks in the flowchart illustrations, can be implemented by special purpose hardware-based computer systems which perform the specified functions or steps, or by combinations of special purpose hardware and computer instructions.
The point-in-time backup operation is then performed on one or more of the physical storage devices (step 730) and a determination is made as to whether the point-in-time backup operation has completed (step 740). If not, the operation returns to step 730 and the point-in-time backup operation is continued. If the point-in-time backup operation has completed, then the point-in-time flag in the logical volume manager is reset (step 750) and the operation terminates.
A determination is made as to whether the point-in-time backup flag has been set (step 830). If not, the Write I/O operation is processed in a normal fashion by submitting it to the storage system to be committed (step 840). If the point-in-time backup flag is set, the Write I/O operation is sent to a hold queue in the logical volume manager (step 850). The operation then terminates.
The write I/O requests in the hold queue are then output to the storage system in the order that they were received (step 930) and the hold queue is reinitialized (step 940). The ordering of the write I/O operations may be maintained by the structure of the hold queue, for example. That is, a linked list may be utilized that maintains the ordering of received write I/O requests.
As each write I/O operation is committed by the storage system, a confirmation that the write I/O operation is committed is received (step 950). As a result, the entry in the file system log that corresponds to the write I/O is used to update the file system metadata (step 960). The operation then terminates.
Thus, the present invention provides a mechanism for ensuring the consistency of a journaled file system metadata with the data stored in an associated storage system during a backup operation being performed on the storage devices of the storage system. As a result, data corruption is less likely due to Write operations being performed during the backup operation.
While the present invention is described in terms of all write I/O operations to log devices being held by the logical volume manager during the backup operation, the present invention is not limited to such. Rather, in an alternative embodiment, the write I/O operations may be held only when they correspond to a designated range of blocks of data. Thus, even though the backup operation is current being performed, if the write I/O operation is to a block of data that is not within the designated range, such as to a different logical volume, the write I/O operation will not be held by the mechanisms of the present invention.
For example, when a backup operation is initiated, the blocks of data that are subject to the backup operation may be determined, i.e. the range of blocks of data that are being backed up are identified. Thereafter, if a write I/O operation is received, during the backup operation, that is directed to this range of blocks of data, the write I/O operation will be suspended in the manner previously described. However, any write I/O operations that are not directed to this range of blocks of data will be allowed to proceed in a normal fashion since they will not affect the metadata of the blocks of data that are being backed up. Thus, in this alternative embodiment, rather than suspending all write I/O operations during a backup operation, only those write I/O operations that affect the metadata of one or more blocks of data subject to the backup operation are suspended.
It is important to note that while the present invention has been described in the context of a fully functioning data processing system, those of ordinary skill in the art will appreciate that the processes of the present invention are capable of being distributed in the form of a computer readable medium of instructions and a variety of forms and that the present invention applies equally regardless of the particular type of signal bearing media actually used to carry out the distribution. Examples of computer readable media include recordable-type media, such as a floppy disk, a hard disk drive, a RAM, CD-ROMs, DVD-ROMs, and transmission-type media, such as digital and analog communications links, wired or wireless communications links using transmission forms, such as, for example, radio frequency and light wave transmissions. The computer readable media may take the form of coded formats that are decoded for actual use in a particular data processing system.
The description of the present invention has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art. The embodiment was chosen and described in order to best explain the principles of the invention, the practical application, and to enable others of ordinary skill in the art to understand the invention for various embodiments with various modifications as are suited to the particular use contemplated.