Differencing in a data replication appliance

Information

  • Patent Application
  • 20060206542
  • Publication Number
    20060206542
  • Date Filed
    March 14, 2005
    19 years ago
  • Date Published
    September 14, 2006
    18 years ago
Abstract
Methods and apparatus are provided for copying data from a primary storage facility to a secondary storage facility which reduce the workload on the storage controller in the primary facility and minimize bandwidth usage. The primary storage facility includes a primary data replication appliance which transfers data to a secondary replication appliance. Updated data from a host is both stored through a storage controller in the primary facility and also received by the primary replication appliance. Logic in the primary replication appliance determines whether the immediately previous version of the data is in a buffer from a previous storage operation. If so, the current (updated) version of the data is compared with the previous version and the difference, such as calculated through a bit-wise exclusive-OR operation, is transferred to the secondary replication appliance. The process is reversed in the secondary replication appliance and the recreated updated version of the data stored in the secondary facility.
Description
TECHNICAL FIELD

The present invention relates generally to transferring data over a data link and, in particular, to applying a differencing algorithm to reduce data link bandwidth usage during data transfers.


BACKGROUND ART

Remote copying of data is an integral part of disaster recovery for protecting critical data from loss and providing continuous data availability. In a disaster recovery support system, data write updates to a primary or central data store are reproduced at a secondary, remote site. The remote site is typically located at a distance from the primary data store if protection from natural disasters is a concern, but may be adjacent to the primary site if equipment failure is the main concern. In the event of a failure at the primary data store, the remote site can take over all operations, including data write updates, with confidence that no data has been lost. Later, after repair, the primary data store can be restored to the condition of the remote site and can resume all operations, including data write operations.


During remote copying, typically same-sized blocks of data are sent from the primary data store to the remote data site. In this way, data write updates at the primary data store are reproduced at the remote site so as to permit reconstruction of the data, including reconstruction of the exact sequence of data write updates that took place at the primary data store. This reproducibility can be especially important, for example, in a banking system or other transaction log system. Thus, data write updates at the primary data store are collected and are periodically sent to the remote site in a remote copy operation.


The various types of remote copy can require enormous amounts of bandwidth over the data lines between the primary data store and the remote site controller. For example, if a primary data store controller can support 20,000 input/output (I/O) operations per second, and if 50% of these operations are write operations, then the controller can handle 10,000 write operations per second. If each write update involves 4 K bytes, then bandwidth of 40 MB per second is required between the primary controller and the remote site controller. This is a significant amount of bandwidth to provide, given currently available pricing for data lines. Even though asynchronous remote copy can speed up write updates, it does not decrease the amount of bandwidth required.


One proposed system which addresses the issue of bandwidth usage is presented in U.S. Pat. No. 6,327,671 entitled DELTA COMPRESSED ASYNCHRONOUS REMOTE COPY and assigned to the assignee of the present application (“the '671 patent”), which patent is incorporated herein by reference in its entirety. As illustrated in FIG. 1, the system disclosed therein provides a remote copy operation that copies data write updates from a primary data store to a secondary data store by identifying which bytes have changed and sending only the changed bytes from the primary data store to the secondary site. A data operation such as an exclusive-OR (XOR) logic operation can be used to identify the changed bytes. Many data storage systems include XOR facilities as part of their normal configuration, including systems that implement the well-known RAID-type data storage. The XOR operation is used in the '671 patent on the write updated block of data to be copied. Data compression can then be used on the XOR data block to delete the unchanged bytes, and then only the changed bytes are sent to the remote site. This reduces the amount of data being sent between the primary data store and the remote site, and reduces the bandwidth required between the sites. In this way, the remote copy system is said to provide remote copying without requiring a great deal of expensive bandwidth.


SUMMARY OF THE INVENTION

The present invention provides methods and apparatus for copying data from a primary storage facility to a secondary storage facility which reduce the workload on the storage controller in the primary facility and minimize bandwidth usage. The primary storage facility includes a primary data replication appliance which transfers data to a secondary replication appliance. Updated data from a host is both stored through a storage controller in the primary facility and also received by the primary replication appliance. Logic in the primary replication appliance determines whether the immediately previous version of the data is in a buffer from a prior storage operation. If so, the current (updated) version of the data is compared with the previous version and the difference, such as calculated through a bit-wise exclusive-OR operation, is compressed and transferred to the secondary replication appliance. The process is reversed in the secondary replication appliance and the recreated updated version of the data stored in the secondary facility.


While the primary appliance is performing these operations, the storage controller may be performing other tasks, including allowing access by the host. Therefore, the operations of the present invention, being performed in the primary replication appliance, do not adversely affect the workload of the storage controller or of the host. Additionally, because the difference between the previous version of the data and the current version, representing only bits which have changed, is typically highly compressible, when it is compressed and transmitted to the secondary replication appliance, bandwidth usage is reduced.


Other features and advantages of the present invention should be apparent from the following description of the preferred embodiments, which illustrates, by way of example, the principles of the invention.




BRIEF DESCRIPTION OF THE DRAWINGS


FIG. 1 is a block diagram of a prior art remote copy system;



FIG. 2 is a block diagram of a data replication system of the present invention;



FIG. 3 is a flowchart of the operational steps performed by the primary replication appliance of the present invention; and



FIG. 4 is a flowchart of the operational steps performed by the secondary replication appliance of the present invention.




DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT


FIG. 2 is a block diagram of a data replication system 200 of the present invention. The system 200 includes a primary data facility 210 and a secondary data facility 250. Both facilities 210 and 250 include a storage controller 212, 252 coupled to a data storage device or array 214, 254. Both facilities 210, 250 of the present invention further include a replication appliance 220, 260, interconnected to each other by way of appropriate interfaces 221, 261 directly or through a network, represented by the connection 202. The storage controller 212 and the replication appliance 220 of the primary data facility 210 are operatively coupled through respective interfaces 213, 223 to a host device 204.


The primary and secondary replication appliances 210, 260 both include: a processor 222, 262; a memory 224, 264 for storage of software instructions to be executed on the processor 222, 262; disk storage 226, 266; and a first-in-first-out (FIFO) buffer 228, 268 for temporary storage of data. The appliances 220, 260 also include disk storage 228, 268. Logic 230, 270 for performing the operations of the present invention may reside in the memory 224, 264, in firmware or in a separate processor.


Referring to the flowchart of FIG. 3, in operation the host 204 sends data to the primary storage controller 212 to be stored in the primary storage device 214 (step 300). The primary replication appliance 220 also receives the data (step 302), either simultaneously from the host 204 or forwarded by the primary controller 210, and ultimately sends a copy of the data over the connection 202 (step 306) for duplicate storage by the secondary replication appliance 260 (step 308). During the process, the primary appliance 220 temporarily stores the data in the buffer 226 (step 304). As other data is received from the host 204, it is also temporarily stored in the buffer 226. When an update to previously stored data is transmitted by the host 204 (step 310), it is both stored by the primary storage controller 212 (step 312) and received by the primary replication appliance 220 (step 314). Under the direction of the logic 230, the primary appliance 220 determines whether the immediately previous version of the data being updated is still in the buffer 226 (step 316). Depending on the size of the buffer 226 and the amount of time that has passed since the previous version of the data was received, the previous version of the data may not yet have been overwritten in the buffer 226 by other, more recent data. If the previous version of the data is no longer present, the logic 230 directs that the current version be transferred to the secondary replication appliance 260 (step 318).


If the previous version of the data being updated is still present, the logic 230 directs that it be compared with the current (updated) version (step 320). Because of the easy availability and reversibility of an exclusive-OR (XOR) function, preferably a bit-wise XOR operation is performed on the two sets of data to calculate a “difference” (step 322). Other mathematical transformations, or sets of complementary transformations, may be substituted for the XOR operation. An appropriate other set of two transformations includes a first transformation T1 which is performed on the current version of the data and the immediately previous version of the data to generate a new version D. A second, complementary, transformation T2 is subsequently applied to the new version D and the immediately previous version to generate (recreate) the current version. A further property of the second transformation T2 is that, when it is applied to the new version D and the current version, the immediately previous version should be generated (recreated). When the bit-wise XOR operation is used, both transformations T1 and T2 are the same. As used herein, the term “difference” will refer to the new version D which is the result of any such transformation T1.


Once generated or calculated, the difference is compressed (step 324) to reduce its size. Subsequently, the compressed difference is transferred to the secondary replication appliance 260 (step 328). While the primary appliance 220 is performing these operations, the primary storage controller 212 may be performing other tasks, including allowing access by the host 204. Therefore, the operations of the present invention, being performed in the replication appliance 220, do not adversely affect the workload of the primary storage controller 212 or of the host 204. Additionally, because the typically highly compressible difference between the previous version of the data and the current version, representing only bits which have changed, is compressed and transmitted to the secondary replication appliance 260, bandwidth usage is reduced.


At the direction of the logic 230, the difference may be encrypted to improve security before being transferred (step 326). If the previous version of the data was not still in the buffer 226 and the unprocessed current version is to be transferred to the secondary appliance 260, it too may be compressed, encrypted or both before being transferred.


Referring to the flowchart of FIG. 4, when the difference is received by the secondary replication appliance 260 (step 400), it is decrypted (step 402) if necessary and decompressed (step 404). The secondary appliance, under the direction of the logic 270, determines whether the previous version of the data is still in the buffer 266 (step 406). If so, the logic directs that the difference and the previous version of the data be compared (step 408), again preferably using a bit-wise XOR to reverse the previous process and recreate the current (updated) version of the data (step 410). The recreated current version of the data is then stored, through the secondary storage controller 252, in the secondary storage 254 (step 412).


If the previous version of the data is not still in the buffer 266 of the secondary replication appliance 260, it must be retrieved from the secondary storage 254 (step 414) before it can be compared with the difference (step 408) to recreate the current version (step 410) and be stored (step 412). If the secondary appliance employs a read_before_write function when storing data, the previous version of the data will first be read from the storage 254 and temporarily stored in a nonvolatile memory, which may include the buffer 266. Only after the previous version is hardened in this fashion will the recreated current version overwrite the previous version in the secondary storage 254. Thus, if the overwrite fails, the previous version is still available for recovery. Consequently, no extra step is required by the secondary appliance 260 to obtain the previous version of the data before recreating the current version.


In addition to the savings in bandwidth, the present invention beneficially permits the development and deployment of one logic to direct the process described herein, regardless of what type of storage controller the replication appliance is used with.


It is important to note that while the present invention has been described in the context of a fully functioning system, those of ordinary skill in the art will appreciate that the processes of the present invention are capable of being distributed in the form of a computer readable medium of instructions and a variety of forms and that the present invention applies regardless of the particular type of signal bearing media actually used to carry out the distribution. Examples of computer readable media include recordable-type media such as a floppy disk, a hard disk drive, a RAM, and CD-ROMs and transmission-type media such as digital and analog communication links.


The description of the present invention has been presented for purposes of illustration and description, but is not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art. The embodiment was chosen and described in order to best explain the principles of the invention, the practical application, and to enable others of ordinary skill in the art to understand the invention for various embodiments with various modifications as are suited to the particular use contemplated. Moreover, although described above with respect to methods and systems, the need in the art may also be met with a computer program product containing instructions for copying updated data from a primary storage facility to a secondary storage facility or a method for deploying computing infrastructure comprising integrating computer readable code into a computing system for performing the same.

Claims
  • 1. A method for copying updated data from a primary storage facility to a secondary storage facility, comprising: receiving a current version of data in a primary data replication appliance from a host substantially simultaneously with receipt of the current version of the data in a primary storage controller; determining if an immediately previous version of the data is stored in a first buffer in the primary data replication appliance; if the immediately previous version of the data is stored in the first buffer: calculating a difference between the current version of the data and the immediately previous version of the data; and transferring the calculated difference from the primary data replication appliance to a secondary data replication appliance; and if the immediately previous version of the data is not stored in the first buffer, transferring the current version of the data to the secondary data replication appliance.
  • 2. The method of claim 1, further comprising: receiving the difference from the primary data replication appliance at the secondary data replication appliance; determining if an immediately previous version of the data is stored in a second buffer in the secondary data replication appliance; if the immediately previous version of the data is stored in the second buffer, reading the immediately previous version of the data from the second buffer; if the immediately previous version of the data is not stored in the second buffer, reading the immediately previous version of the data from storage at the secondary data replication appliance; recreating the current version of the data in the secondary data replication appliance from the immediately previous version of the data and the received difference; and storing the current version of the data at a secondary storage controller attached to the secondary data replication appliance.
  • 3. The method of claim 1, wherein calculating the difference between the current version of the data and the immediately previous version of the data comprises applying a bitwise exclusive-OR to the current version and the immediately previous version.
  • 4. The method of claim 1, further comprising: receiving the calculated difference from the primary data replication appliance at the secondary data replication appliance; reading the immediately previous version of the data from storage at the secondary data replication appliance; storing the immediately previous version of the data in a non-volatile memory in the secondary data replication appliance; calculating the current version of the data from the immediately previous version of the data and the received difference; and storing the current version of the data at a secondary storage controller attached to the secondary data replication appliance.
  • 5. The method of claim 4, wherein storing the current version of the data comprises overwriting the immediately previous version of the data.
  • 6. The method of claim 4, wherein: calculating the difference between the current version of the data and the immediately previous version of the data comprises applying a bitwise exclusive-OR to the current version of the data and the immediately previous version of the data; and calculating the current version of the data from the immediately previous version of the data and the received difference comprises applying a bitwise exclusive-OR to the difference and the immediately previous version of the data.
  • 7. The method of claim 1, further comprising compressing the difference before transferring the difference to the secondary data replication appliance.
  • 8. The method of claim 1, further comprising encrypting the difference before transferring the difference to the secondary data replication appliance.
  • 9. A primary data replication appliance, comprising: a first interface for receiving a current version of the data from a host device, the current version of the data being received substantially simultaneously in a primary storage controller; a first buffer for temporary storage of the current version of the data; means for determining if an immediately previous version of the current version of the data is stored in the first buffer; logic for calculating a difference between the current version of the data and the immediately previous version of the data; and a second interface for transferring the calculated difference to a secondary data replication appliance.
  • 10. The appliance of claim 9, wherein the calculating logic comprises logic for performing a bitwise exclusive-OR on the current version of the data and the immediately previous version of the data.
  • 11. The appliance of claim 9, further comprising means for compressing the difference before transferring the calculated difference to the secondary data replication appliance.
  • 12. The appliance of claim 9, further comprising means for encrypting the difference before transferring the calculated difference to the secondary data replication appliance.
  • 13. The appliance of claim 9, wherein the secondary data replication appliance comprises: means for receiving the calculated difference from the primary data replication appliance; a second buffer; means for reading the immediately previous version of the data from the second buffer if the immediately previous version of the data is stored in the second buffer; means for reading the immediately previous version of the data from a storage device if the immediately previous version of the data is not stored in the second buffer; logic for recreating the current version of the data from the transferred difference between and the immediately previous version of the data; and means for storing the recreated current version of the data on the storage device.
  • 14. The appliance of claim 13, wherein: the calculating logic comprises bitwise exclusive-OR logic; and the recreating logic comprises bitwise exclusive-OR logic.
  • 15. The appliance of claim 9, wherein the secondary data replication appliance comprises: a third interface for receiving the calculated difference from the primary data replication appliance; a non-volatile memory; means for reading the immediately previous version of the data from a storage device and storing the previous version of the data in the non-volatile memory; logic for recreating the current version of the data from the transferred difference and the immediately previous version of the data; and storing the recreated current version of the data in the storage device.
  • 16. The appliance of claim 15, wherein the secondary data replication appliance further comprises logic for performing a read-before-write operation on the immediately previous version of the data stored on the storage device, whereby immediately previous version of the data is overwritten by the recreated current version of the data.
  • 17. A system for copying data from a primary storage facility to a secondary storage facility, comprising: a primary data replication appliance, comprising: a first interface for receiving a current version of the data from a host device, the current version of the data being received substantially simultaneously in a primary storage controller; a first buffer for temporary storage of the current version of the data; means for determining if an immediately previous version of the current version of the data is stored in the first buffer; logic for calculating a difference between the current version of the data and the immediately previous version of the data if the immediately previous version of the current version of the data is stored in the first buffer; and a second interface; and a secondary data replication appliance, comprising: a third interface operatively coupled to the second interface of the primary data replication appliance to receive the calculated difference; a second buffer; means for reading the immediately previous version of the data from the second buffer if the immediately previous version of the data is stored in the second buffer; means for reading the immediately previous version of the data from a storage device if the immediately previous version of the data is not stored in the second buffer; logic for recreating the current version of the data from the transferred difference and the immediately previous version of the data; and means for storing the recreated current version of the data on the storage device.
  • 18. The system of claim 17, the primary data replication appliance further comprising means for compressing the difference before transferring the calculated difference to the secondary data replication appliance.
  • 19. The system of claim 17, the primary data replication appliance further comprising means for encrypting the difference before transferring the calculated difference to the secondary data replication appliance.
  • 20. The system of claim 18, wherein: the calculating logic comprises bitwise exclusive-OR logic; and the recreating logic comprises bitwise exclusive-OR logic.
  • 21. The system of claim 17, wherein the secondary data replication appliance further comprises: a non-volatile memory; means for storing the read previous version of the data in the non-volatile memory; and logic for performing a read-before-write operation on the immediately previous version of the data stored on the storage device, whereby the immediately previous version of the data is overwritten by the recreated version of the data.
  • 22. A computer program product of a computer readable medium usable with a programmable computer, the computer program product having computer-readable code embodied therein for copying updated data from a primary storage facility to a secondary storage facility, the computer-readable code comprising instructions for: receiving a current version of the data in a primary data replication appliance from a host substantially simultaneously with receipt of the current version of the data in a primary storage controller; determining if an immediately previous version of the data is stored in a first buffer in the primary data replication appliance; if the immediately previous version of the data is stored in the first buffer: calculating a difference between the current version of the data and the immediately previous version of the data; and transferring the calculated difference from the primary data replication appliance to a secondary data replication appliance; and if the immediately previous version of the data is not stored in the first buffer, transferring the current version of the data to the secondary data replication appliance.
  • 23. The computer program product of claim 22, further comprising instructions for: receiving the difference from the primary data replication appliance at the secondary data replication appliance; determining if an immediately previous version of the data is stored in a second buffer in the secondary data replication appliance; if the immediately previous version of the data is stored in the second buffer, reading the immediately previous version of the data from the second buffer; if the immediately previous version of the data is not stored in the second buffer, reading the immediately previous version of the data from storage at the secondary data replication appliance; recreating the current version of the data in the secondary data replication appliance from the immediately previous version of the data and the received difference; and storing the current version at a secondary storage controller attached to the secondary data replication appliance.
  • 24. The computer program product of claim 22, wherein the instructions for calculating the difference between the current version of the data and the immediately previous version of the data comprise instructions for applying a bitwise exclusive-OR to the current version of the data and the immediately previous version of the data.
  • 25. The computer program product of claim 22, the further comprising instructions for: receiving the calculated difference from the primary data replication appliance at the secondary data replication appliance; reading the immediately previous version of the data from storage at the secondary data replication appliance; storing the immediately previous version of the data in a non-volatile memory in the secondary data replication appliance; calculating the current version of the data from the immediately previous version of the data and the received difference; and storing the current version of the data at the second facility.
  • 26. The computer program product of claim 25, wherein the instructions for storing the current version of the data comprise instructions for overwriting the immediately previous version of the data.
  • 27. The computer program product of claim 25, wherein: the instructions for calculating the difference between the current version of the data and the immediately previous version of the data comprise instructions for applying a bitwise exclusive-OR to the current version of the data and the immediately previous version of the data; and the instructions for calculating the current version of the data from the immediately previous version of the data and the received difference comprise instructions for applying a bitwise exclusive-OR to the difference and the immediately previous version of the data.
  • 28. The computer program product of claim 22, further comprising instructions for compressing the difference before transferring the difference to the secondary data replication appliance.
  • 29. The computer program product of claim 22, further comprising instructions for encrypting the difference before transferring the difference to the secondary data replication appliance.
  • 30. A method for deploying computing infrastructure, comprising integrating computer readable code into a computing system, wherein the code, in combination with the computing system, is capable of performing the following: receiving a current version of the data in a primary data replication appliance from a host substantially simultaneously with receipt of the current version of the data in a primary storage controller; determining if an immediately previous version of the data is stored in a first buffer in the primary data replication appliance; if the immediately previous version of the data is stored in the first buffer; calculating a difference between the current version of the data and the immediately previous version of the data; and transferring the calculated difference from the primary data replication appliance to a secondary data replication appliance; and if the immediately previous version of the data is not stored in the first buffer, transferring the current version of the data to the secondary data replication appliance.
  • 31. The method of claim 30, further comprising: receiving the difference from the primary data replication appliance at the secondary data replication appliance; determining if an immediately previous version of the data is stored in a second buffer in the secondary data replication appliance; if the immediately previous version of the data is stored in the second buffer, reading the immediately previous version of the data from the second buffer; if the immediately previous version of the data is not stored in the second buffer, reading the immediately previous version of the data from storage at the secondary data replication appliance; recreating the current version of the data in the secondary data replication appliance from the immediately previous version of the data and the received difference; and storing the current version of the data at a secondary storage controller attached to the secondary data replication appliance.
  • 32. The method of claim 30, wherein calculating the difference between the current version and the immediately previous version of the data comprises applying a bitwise exclusive-OR to the current version of the data and the immediately previous version of the data.
  • 33. The method of claim 30, further comprising: receiving the calculated difference from the primary data replication appliance at the secondary data replication appliance; reading the immediately previous version of the data from storage at the secondary data replication appliance; storing the immediately previous version of the data in a non-volatile memory in the secondary data replication appliance; calculating the current version of the data from the immediately previous version of the data and the received difference; and storing the current version of the data at the second facility.
  • 34. The method of claim 33, wherein storing the current version of the data comprises overwriting the immediately previous version of the data.
  • 35. The method of claim 33, wherein: calculating the difference between the current version of the data and the immediately previous version of the data comprises applying a bitwise exclusive-OR to the current version of the data and the immediately previous version of the data; and calculating the current version of the data from the immediately previous version of the data and the received difference comprises applying a bitwise exclusive-OR to the difference and the immediately previous version of the data.
  • 35. The method of claim 30, further comprising compressing the difference before transferring the difference to the secondary data replication appliance.
  • 36. The method of claim 30, further comprising encrypting the difference before transferring the difference to the secondary data replication appliance.