1. Field of the Invention
The present invention is in the field of computer-generated (Host or Appliance) data backup, protection, and recovery and pertains particularly to methods and apparatus for data mapping, and optimizations for hierarchical persistent storage management, including fault tolerant data protection and fine grain system snapshot and instant recovery.
2. Discussion of the State of the Art
In the field of protection and restoration of computer generated data, it is important to protect computer systems from individual personal computers (PCs) to robust enterprise server systems from data loss and system down time that may result from system or application failure. The enterprise and medium businesses are especially vulnerable to loss of efficiency resulting from a lack of a secure data protection system or a faulty or slow data protection and recovery system. Small businesses require reliable and automated (shadow) backup to compensate lack of experienced IT personal and unreliable backup.
Existing methods for protecting data written to various forms of storage devices include copying files to alternate or secondary storage devices. Another known method involves archiving data to storage tape. In some systems, “snapshots” of data are created periodically and then saved to a storage disk for later recovery if required. Some data storage, backup, and recovery systems are delivered as external data protection devices (appliances), meaning that they reside outside of the processing boundary of the host.
There are some problems and limitations with current methods for protecting system and host generated data. For example, magnetic tape used in tape-drive archival systems suffers from poor performance in both data write and data access. Archiving data to tape may slow system activity for extended periods of time. Writing data to tape is an inherently slow process. Data restoration from a tape drive is not reliable or practical in some cases. One reason for this is that data on tape resides in a format that must be converted before the mounting system recognizes the data.
One development that provides better performance than a tape archival system uses high capacity serial-advanced-technology attachment (SATA) disk arrays or disk arrays using other types of hard disks (like SCSI, FC or SAS). The vast majority of on-disk backups and advanced data protection solutions (like continuous data protection) use specialized, dedicated hardware appliances that manage all of the functionality. Although these appliances may provide some benefits over older tape-drive systems, the appliances and software included with them can be cost prohibitive for some smaller organizations.
One mitigating factor in the data protection market is the speed at which systems can write the data into protective storage. Systems often write their data to some long term storage devices such as a local disk drive or a networked storage device. Often this data may be associated with one or more application programs and is located in random locations within the long term storage device(s). Writing frequently to random storage locations on a disk storage device may be slow because of seek-time and latency inherent in disk drive technology, more particularly, for each write the disk drive physically moves its read/write head and waits for the appropriate sector to come into position for write.
Data protection and backup appliances currently available handle data from several production servers and typically use SATA hard disks, which are much slower than SCSI hard disks. Improved performance can be achieved by adding additional disks, however cost becomes a factor.
Data writing performance, especially in robust transaction systems, is critical to enterprise efficiency. Therefore it is desired to be able to secure increasing amounts of data and to continually improve writing speed. Therefore, what is clearly needed are methods and apparatus that enable continuous data protection (CDP) for computing systems while improving write performance and solving the problems inherent to current systems described above (including slow and unreliable data recovery).
In an embodiment of the invention a host-based system for continuous data protection and backup for a computing appliance is provided, comprising a central processing unit, an operating system, a long-term disk storage medium, and a persistent low latency memory (PLLM). Writes to disk storage are first made to the PLLM, and later (being coalesced) made to the disk storage medium on the snapshot basis.
In one embodiment of the system periodic system state snapshots are stored in the PLLM, enabling restoration of the host to any state of a prior snapshot stored in the PLLM, and then adjustment, via the record of writes to memory between snapshots, to any state desired between the snapshot states.
In some embodiments the PLLM is non-volatile, random access memory (NVRAM), Flash memory, Magnetic RAM, Solid-state disk or any other persistent low latency memory devices.
In another aspect of the invention a method for improving performance in a computerized appliance having a CPU and non-volatile disk storage is provided, comprising steps of: (a) providing a persistent low-latency memory (PLLM) coupled to a CPU and to the non-volatile disk storage; (b) storing a memory map of the non-volatile disk storage in the PLLM; (c) performing writes meant for the disk storage first to the PLLM; and (d) performing the same writes later from the PLLM to the non-volatile disk, but in a more sequential order determined by reference to the memory map of the disk storage.
In one embodiment of the method steps are included for (e) storing periodic system state snapshots in the PLLM; and (f) noting sequence of writes to the PLLM for time frames between snapshots, enabling restoration to any snapshot, and any state between snapshots.
In some embodiments of the method the PLLM is non-volatile, random access memory (NVRAM), Flash memory, Magnetic RAM, Solid-state disk or any other persistent low latency memory devices.
Advanced Data Protection With Persistent Memory:
The inventor provides a computing system that can perform cost effective continuous data protection (CDP) and instant data recovery using a novel approach whereby a low latency persistent memory (PLLM or just PM) is provided to cache system snapshots during processing and to enable faster read and write access. The methods and apparatus of the invention are explained in enabling detail by the following examples according to various embodiments of the invention.
System 100 further includes an expansion bus adapter (EBA) 104 connected to SMC 102. EBA 104 provides CPU adaptation for an expansion bus. Common expansion buss configurations include Peripheral Component Interconnect (PCI) or variations thereof such as PCI-X and PCI-Express.
System 100 further includes a small computer system interface (SCSI)/redundant array of independent disks (RAID) controller 105, or optionally some other disk controller such as advanced technology attachment or a variant thereof. The exact type of controller will depend on the type of disk that computing system 100 uses for production storage (PS) 107. PS 107 may be a SCSI disk or variants thereof.
Controller 105 controls CPU access to PS 107 through expansion bus adapter 104. System 100 is provided with a persistent memory (PM) 106, which in one embodiment is a non-volatile random access memory (NVRAM). Persistent memory is defined within the specification as a memory type that retains data stored therein regardless of the state of the host system. Other types of persistent memory are Flash memory, of which there are many types known and available to the inventor, Magnetic RAM or Solid-state disk.
PM 106 may be described as having a low latency meaning that writing to the memory can be performed much faster than writing to a traditional hard disk. Likewise reading from NVRAM or Flash memory may also be faster in most cases. In this particular example, PM 106 is connected to CPU 101 through a 64-bit expansion bus.
Unique to computing systems is the addition of PM 106 for use in data caching for the purpose of faster writing and for recording system activity via periodic snapshots of the system data. A snapshot is a computer generated consistent image of data and system volumes as they were at the time the snapshot was created. For the purpose of this specification a snapshot shall contain enough information such that if computing system 100 experiences an application failure, or even a complete system failure, the system may be restored to working order by rolling back to a last snapshot that occurred before the problem. Furthermore, several snapshots of specific volume can be exposed to the system concurrently with the volume in-production for recovery purposes. Snapshots are writeable. It means that application and File System can write into snapshot without destroying it. It allows specifically providing application consistent snapshots. Snapshots can be used for different purposes such as specific file(s) or an entire volume(s) recovery. Writeable snapshots can be also used for test environment.
System 200 is provided with a SATA RAID controller 201 that extends the capabilities of the system for data protection and automatic shadow backup of data. In this example, snapshots cached in persistent memory (PM) 106 are flushed at a certain age to a snapshot storage pool (SSP) 202. SSP 202 is typically a SATA disks or disks staked in a RAID system. Other types of hard disks may be used in place of a SATA disk without departing from the spirit and scope of the present invention. Example include, but are not limited to advanced-technology attachment (ATA), of which variants exist in the form of serial ATA (SATA) and parallel ATA (PATA). The latter are very commonly used as hard disks for backing up data. In actual practice, SSP may be maintained remotely from system 200 such as on a storage area network (SAN) accessible through LAN or WAN within network attached storage (NAS). SSP is an extension of PM that can keep snapshots for day and weeks. SSP can also be used for production storage failover.
Being created once SSP is constantly and automatically updated in the background. SSP is an asynchronous consistent in time image of data and/or system volume(s).
The data is written into the persistent memory, using the memory as a sort of write cache instead of writing the data directly to the production storage and, perhaps replicating the writes for data protection purposes. The novel method of caching data uses allocate-on-write technique instead of traditional copy-on-write. Three major factors improve application performance: write-back mode (report write completion to the File System when data has been written into PM 106), write cancellation (keeping in PM latest version of the data), and writes coalescing (coalescing continuous and near-continuous blocks of data for efficient writing into production storage). The initial data writes are addressed to random locations on disk 304. However, a utility within PM 106 organizes the cached data in form of periodically taken snapshots 301. PM 106 contains short term snapshots 301 each taken at a different time. Snapshots 301 are taken over time so the oldest of snapshots 301 is eventually flushed into production storage 304 and snapshot storage pool 202.
Snapshots existing within PM 106 at any given time are considered short-term snapshots in that they are accessible from PM in the relatively short term (covering hours of the system activity). Snapshots 302 illustrated within snapshot storage pool 202 are considered long term snapshots because they are older or aged out of on-PM snapshots covering days and week of the system activity. All of snapshots 302 are eventually written to a full backup storage disk 303. A snapshot is generated arbitrarily as data is written therefore one snapshot may reflect much more recent activity than a previous snapshot.
One with skill in the art will recognize that NVRAM or other persistent memory may be provided economically in a size so as to accommodate many system snapshots before those snapshots are flushed from NVRAM (PM) 106 into PS 304 and SSP 202. Therefore, the host system has access to multiple system snapshots locally to which expedites instant data recovery greatly.
Most recent snapshots are stored on low latency persistent memory such as NVRAM or other mentioned types of low latency persistent memory devices. Older snapshots are stored on hard disks such as SATA disks. It is noted herein that common snapshot management is provided regardless of where the snapshots reside whether it be on PM 106 or on SSP 202. The system of the invention offers storage redundancy, for example, if production storage has failed, a production server can switch immediately to the alternative storage pool and resume its normal operation.
Write Optimization:
The inventor provides a write optimization that includes intermediate caching of data for write to disk using the low latency persistent memory as described above and utilities for organizing randomly addressed data into a sequential order and then mapping that data to sequential blocks on the disk. This method of writes does not provide advanced data protection like CDP. However, it is used within data protection appliances for random writes optimization. Such appliances handle data writes mostly and handle data reads periodically and infrequent for backing up and data recovery only. This unique write optimization technology is detailed below.
Note that writing random data sequentially causes data fragmentation that may impact read performance. However, read performance is not critical for data protection appliances.
One with skill in the art will recognize that the methodology of mapping random writes into a substantially sequential order can also be performed in parallel within NVRAM 402. In this way one sequential write may be initiated before another is actually finished. This capability may be scalable to an extent according to the provided structures and data capacity within NVRAM 402. Likewise, parallel processing may be performed within NVRAM 402 whereby collected data over time is mapped, structured using separate DSAs and written in distributed fashion over a plurality of storage disks. There are many possible architectures.
PM 402 includes a coalescing engine 602 for gathering multiple random writes for sequential ordering. In a preferred embodiment, coalescing engine 602 creates one mapping table for every data set comprising or filling the data storage area (DSA). It is noted herein that the DSA may be pre-set in size and may have a minimum and maximum constraint on how much data it can hold before writing.
Furthermore as time progresses and more data is written into long term storage, more sets of data have been reorganized from non-sequential to sequential or near sequential data. Therefore, different sets of reorganized data could contain data originally intended to be written to the same original address/location in destination storage. In such a case only the last instance of data intended to be written to a same original address would contain current data. In this embodiment, the address translation tables resident in persistent memory may be adapted to recover locations that contain current data. In this way old data intended for the same original location may be discarded and the storage space reused. In this example, PM 402 has an added enhancement exemplified as a history tracking engine 603. Tracking engine 603 records the average frequency of data overwrites to a same address in memory as just described.
In order to avoid fragmenting data, a special algorithm is provided for garbage collection. The algorithm (not illustrated here) is based on the history of data update frequency logged by engine 603 and it coalesces data with identical update frequencies. Additional address translation and advanced “garbage collection” algorithm require storing additional information in the form of metadata within NVRAM 402. In this embodiment, each original write request results in several actual writes into long term persistent storage (disks) as a “transaction series” of writes.
The advanced form of “garbage collection” begins by identifying blocks of data that are frequently overwritten over time. Those identified blocks of data are subsequently written together within the same sequential data set. Arranging the locality of data blocks that are most frequently overwritten as sequential blocks within the sequential data set increases the likelihood that overwritten data will appear in groups (sequential blocks) rather than in individual blocks. History of past access patterns will be used to predict future access patterns as the system runs.
Referring now back to
It is noted herein that the address correlation method described herein may, in a preferred embodiment, be transparent to the host CPU. The CPU only recognizes the random address for writes and reads and the utilities of the invention residing within PM 402.
Data Flow And Snapshot Management:
The inventor provides a detailed data flow explanation those results in advanced data protection, high-availability and data retention. These novel approaches whereby a low latency persistent memory, local snapshot storage pool and remote snapshot storage pool that keep (uniformly accessed) snapshots. Each snapshot can be exposed as a volume for the Operating System and used in read-write mode for production or for data recovery. However, original snapshot will be preserved. The methods and apparatus of the invention are explained in enabling detail by the following examples according to various embodiments of the invention.
In practice of the invention onboard computing system 800, CPU 802 aided by SW 801 sends write data to NVRAM 803 over logical bus 808. NVRAM 803 aided by SW 801 gathers write data in form of consistent snapshots. The data is then written into local production storage. Data snapshots are created and maintained within NVRAM 803 to an extent allowed by an aging scheme. Multiple snapshots are created and are considered fine grain snapshots while existing in NVRAM 803 and cover hours of the system activity. When a snapshot ages beyond NVRAM maintenance, it is flushed into production storage 805 through controller 804 over a logical path 806 and labeled flush.
Snapshots are available on demand from NVRAM 803 over a logical path 807. Application or File System may read directly from production storage disk 805 (through controller) over a logical path 809a, or optionally directly from NVRAM 803 over a logical path 809b. In this case, NVRAM 803 functions as a read cache.
In user mode an application 903 is running that could be some accounting or transaction application. The application communicates with a file system driver 909 included in a stack of storage drivers 906. The CDP driver 910 communicates with a CDP API driver 907 that provide abstraction layer of communication with variety of persistent memory devices like NVRAM, Flash memory, Magnetic RAM, Solid-state disks and other. The CDP API driver 907 communicates with a specific NVRAM driver 908, also included within drivers 905.
When application 903 writes data via file system driver 909, file system driver issues block level write requests. The CDP driver 910 intercepts them and redirects into persistent memory that services as a write cache. CDP driver diverts the write to the persistent memory via CDP API driver 907 and specific NVRAM driver 908.
System 1001 is provided to backup system 1000 in the case of a production storage failure. System 1001 can be local, remote of both. As a failover system, system 1001 may be located in a different physical site than the actual production unit, as is a common practice in the field of data security. Uniform access to all snapshots whenever created and wherever located is provided on demand. In some embodiments access to snapshots may be explicitly blocked for security purposes or to prevent modifications, etc. One further advantage of snapshot “expose and play” technique is that snapshots do not have to be data-copied from a backup snapshot to production volume. This functionality enables, in some embodiments, co-existence of many full snapshots for a single data volume and that volumes current state. All snapshots are writeable. The approach enables unlimited attempts to locate a correct point-in-time to roll the current volume state to. Each rollback attempt is reversible.
Much as previously described, system 1000 aided by SW 801 may send writes to NVRAM 803. In this example, system 1000 flushes snapshots from NVRAM 803 to production storage 805 through Disk controller 804 over logical path 1002 and at the same time, flushes snapshot copies to on-disk log 1003 within backup system 1001. Backup system can be local or remote. Log 1003 further extends those snapshots to backup storage 1004. Snapshots are available via logical path 807 from NVRAM 803, or from on-disk log 1003. Log 1003 enables recovery of a snapshot and subsequent playing of a log event to determine if any additional changed data logged in between a snapshot should be included in a data recovery task such as rolling back to an existing snapshot.
Backup storage 1004 may be any kind of disk drive including SATA, or PATA. If a failure event happens to system 1000, then system 1000 may, in one embodiment, automatically failover to system 1001 and backup storage 1004 may then be used in place of the production storage disk containing all of the current data. When system 1000 is brought back online, then a fail-back may be initiated. The fail-back process enables re-creation of the most current production storage image without interruption of continuing operation of system 1000. In actual practice backup storage has a lower performance than production storage for most systems therefore performance during the failover period may be slightly reduced resulting in slower transactions.
The uniqueness of controller 1101 over current RAID controllers is the addition of NVRAM 1103 including all of the capabilities that have already been described. In this case, system 1100 uses production storage 1102, which may be a RAID array accessible to the host through SATA controller 1106. In this case, NVRAM 1103 is directly visible to the CDP API driver (not illustrated) as another type of NVRAM device. It is a hybrid solution where invented software is integrated with PCI pluggable RAID controllers.
Both described server systems share a single production storage 1204 that is SAN connected and accessible to both servers, also LAN connected. This is an example of a high-availability scenario that can be combined with any or all of the other examples described previously. In this case the primary production server 1201 is backed up by secondary production server 1202. Production storage 1204 could be a network attached or local storage. In case of the failure of primary server 1201, failover mechanism 1207 transfers the NVRAM 1205 content via a failover communication path over LAN 1203 to secondary FM 1208 using standard TCP/IP protocol or any other appropriate protocols such as Infiniband, or others.
A third party data replication software (RSW) 1305 is provided to server 1301 for the purpose of replicating all write data. RSW 1305 may be configured to replicate system activity according to file level protocol or block level protocol. Replicated data is uploaded onto network 1303 via a replication path directly to data protection (DP) appliance 1302. Appliance 1302 has connection to the network and has the port circuitry 1306 to receive the replicated data from server 1301. The replicated data is written to NVRAM 1307 or other persistent memory devices like Flash memory or Magnetic RAM. Multiple system snapshots 1310 are created and temporarily maintained as short term snapshots before flush as previously described further above. Simultaneously, data can be replicated to the remote location (not shown on this figure).
In this example, appliance 1302 functions as a host system described earlier in this specification. DP appliance 1302 has a backup storage disk 1308 in which long term snapshots 1309 are stored on behalf of server 1301. In this case, snapshots are available to system 1301 on demand by requesting them over network 1303. In case of a failover condition where system 1301 fails, DP appliance 1302 may recreate the system data set of PS 1304 near-instantaneously from the long term and short term snapshots. Server 1301 may experience some down time while it rolls back to a successful operating state. Unlike the previous example of failover mechanisms, DP appliance 1302 may not assume server functionality as it may be simultaneously protecting multiple servers. However, in another embodiment, DP appliance 1302 may be configured with some added SW to function as a full backup to server 1301 if desired.
If at act 1402 it is determined that the failure is not due to software, but the failure is due to a storage failure at act 1403, then at act 1405 the server switches to backup storage and resumes application activity.
If at act 1402 it is determined that the failure is due to a software failure, then at act 1406 the system first attempts recovery without rolling back to a previous system snapshot by calling application specific utilities. At act 1407 the system determines if the recovery attempt is successful. If the attempt proved successful at act 1407, then the process ends at act 1408 without requiring rollback. If at act 1407 it is determined that the recovery attempt is not successful, then at act 1409, the server performs a rollback to a last good system snapshot. Once the system is mounted with the new data and settings then the application is resumed at act 1410.
The process resolves back to a determination of success relative to act 1410 whether or not recovery was successful. If so, the process ends and no further action is required. If not, then the process resolves back to another rollback operation and application restart until success is achieved.
Writes 1501 are occurring in data block with the volume offset 1500 in the most recent snapshot. Data block with volume offset 1500 may exist in one of previous snapshots (in snapshot S(4) in this example). However, new page will be allocated in NVRAM in order to preserve snapshot S(4). The data in the block with volume offset number 2000 in S(1) may be different from or the same as the data written in the same block represented in S(2), or in S(4). The only hard commonalities between the data blocks having the same offset numbers are the page size and the location in the volume. When appropriate, the oldest snapshot will be flushed out of NVRAM 1500 and onto a storage disk. This may happen in one embodiment incrementally as each snapshot is created when NVRAM is at a preset capacity of snapshots. A history grouping of several snapshots 1503 may be aggregated and presented as a snapshot. The amount of available snapshots that may be ordered for viewing may be a configurable parameter. There are many different options available for size configuration, partial snapshot view ordering, and so on. For example, a system may only require the portions of a snapshot that are specific to volumes used by a certain application. Application views, file system views and raw data block views may be ordered depending on need.
“NVRAM” >to> “Fast” disk/FS >to> “Slow” disk/FS >to> “Remote disk/FS”
This hierarchical approach is additional, and can be separate or together with the enhanced retention method. It is clear that many modifications and variations of this embodiment may be made by one skilled in the art without departing from the spirit of the novel art of this disclosure.
The methods and apparatus of the present invention may be implemented using some or all of the described components and in some or all or a combination of the described embodiments without departing from the spirit and scope of the present invention. In various aspects of the invention any one or combination of the following features may be implemented:
The present invention claims priority to a U.S. provisional patent application Ser. No. US60/705227 filed on Aug. 3, 2005 entitled, “On-Host Continuous Data Protection, Recovery, Heterogeneous Snapshots, Backup and Analysis”, and to a U.S. provisional patent application Ser. No. US60/708,911 filed on Aug. 17, 2005 entitled “Write performance optimization implemented by using a fast persistent memory to reorganize non-sequential writes to sets of sequential writes”. The listed disclosures are included herein at least by reference.
Number | Date | Country | |
---|---|---|---|
60705227 | Aug 2005 | US | |
60708911 | Aug 2005 | US |