1. Field
The present disclosure relates to copy and/or data management operations in a computer network and, in particular, to systems and methods for performing data replication in a storage management system.
2. Description of the Related Art
Computers have become an integral part of business operations such that many banks, insurance companies, brokerage firms, financial service providers, and a variety of other businesses rely on computer networks to store, manipulate, and display information that is constantly subject to change. Oftentimes, the success or failure of an important transaction may turn on the availability of information that is both accurate and current. Accordingly, businesses worldwide recognize the commercial value of their data and seek reliable, cost-effective ways to protect the information stored on their computer networks.
Many approaches to protecting, data involve creating a copy of the data, such as backing up and/or replicating data on one or more storage devices. Data shadowing and mirroring, or duplexing, provide for copying but can require substantial amounts of time, processing power and/or storage space, especially for large databases. Moreover, such storage management systems can have a significant adverse impact on the performance of the source or primary system.
To address these drawbacks, certain systems perform replication operations that copy less than an entire volume of data to a desired location. For example, differential replication operations are used to copy all files that have changed since a last full replication of the data. Moreover, incremental replication operations can be used to copy all files that have changed since the most recent full, differential or incremental replication. These techniques, however, can require a significant amount of processing power or network bandwidth, especially when dealing with changes to relatively large files or databases.
In certain embodiments, the present disclosure relates to a method for performing data replication. The method includes performing an assessment on first data stored on a first storage device and second data stored on a second storage device, where at least a portion of the second data was previously replicated from the first data. The assessment includes comparing one or more attributes of files in the first data with those of corresponding files in the second data, and identifying a file having at least one of the one or more attributes different in the first and second data. The method further includes comparing the size of the identified file with a selected threshold value. If the size of the identified file is less than or equal to the selected threshold value, the identified file is replicated from the first storage device to the second storage device. If the size of the identified file is greater than the selected threshold value: checksums is obtained for the identified file in the first data and its corresponding file in the second data; the checksums are compared; if the checksums are different, the identified file is replicated from the first storage device to the second storage device; and if the checksums are the same, the one or more different attributes of the identified file in the first data and the corresponding file in the second data are synchronized, and the identified file is not replicated.
In certain embodiments, the one or more attributes comprise one or more attributes obtainable from metadata. In certain embodiments, the one or more attributes obtainable from metadata comprise at least one attribute selected among file size, file creation time, file modification time, or file access time.
In certain embodiments, the selected threshold value is obtained based on one or more storage policies. In certain embodiments, the one or more storage policies comprise assignment of the selected threshold value based on one or more of type of communication network between the first and second systems, available network resource, or assigned priority.
In certain embodiments, the size of the identified file is selected based on a size of a data block, one or more of the data blocks constituting the identified file.
In certain embodiments, the obtaining of checksums comprises calculating checksums for each of one or more data blocks associated with the identified file and the corresponding file. In certain embodiments, the replicating the identified file comprises replicating only one or more data blocks of the identified file whose checksums are different from those of the corresponding file.
In certain embodiments, the present disclosure relates to a data replication system having a data storage system configured to store replication of at least a portion of data from a client system. The client system is capable of communicating with the data storage system to facilitate transfer of data therebetween. The system further includes a replication agent in communication with the client system and the data storage system and configured to obtain information about an identified file on the client system. The identified file has at least one metadata attribute that is different from that of an existing replicated copy of the identified file on the data storage system. The replication agent is further configured to: obtain a size of the identified file; compare the size of the identified file with a threshold value; if the size is less than or equal to the threshold value, replicate the identified file so as to replace or update the existing replicated copy of the identified file; and if the size is greater than the threshold value, (1) obtain and compare checksums of the identified file and the replicated file, and (2) replicate the identified file so as to replace or update the existing replicated copy of the identified file if the checksums are different.
In certain embodiments, the replication agent is further configured to reconcile the metadata difference between the identified and replicated files but not replicate the identified file if the checksums of the identified file and the replicated file are the same.
In certain embodiments, the threshold value is obtained based on one or more storage policies, a type of communication network between the client system and the data storage system, one or more network resources associated with the communication network, or a priority assigned to the replication agent.
In certain embodiments, the threshold value comprises 256 kilobytes, 2 megabytes, or another operating-system dependent value.
In certain embodiments, the system further includes a user interface configured to receive user input indicative of the threshold value.
In certain embodiments, the replication agent comprises a software application executable on the client system.
In certain embodiments, the present disclosure relates to a replication system having means for identifying a first file in a first system based on a comparison of one or more attributes of the first file and a second file on a second system, with the second file representing an existing replicated copy of the first file. The system further includes means for comparing the size of the identified file with a threshold value. The system further includes means for determining whether to replicate one or more blocks of the first file to the second file again based at least in part on the size comparison.
In certain embodiments, the means for determining includes means for obtaining and comparing an assessment of contents of the first and second files, and means for selectively replicating data blocks of the first file to the second file based on the assessment of the contents. In certain embodiments, the assessment of contents of the first and second files comprises a calculation of checksums of the first and second file.
For purposes of summarizing the disclosure, certain aspects, advantages and novel features of the inventions have been described herein. It is to be understood that not necessarily all such advantages may be achieved in accordance with any particular embodiment of the invention. Thus, the invention may be embodied or carried out in a manner that achieves or optimizes one advantage or group of advantages as taught herein without necessarily achieving other advantages as may be taught or suggested herein.
As disclosed herein, certain systems and methods are provided for data replication. In particular, embodiments of the invention are capable of performing replication of data from a source system to a destination system.
In the description herein, various features and examples are described in the context of data replication. It will be understood that such features and concepts can be applied to various forms of data storage and recovery systems. Accordingly, it will be understood that “replication” can include any processes or configurations where some data representative of a file in a source system is stored or copied to in a destination system, such that the source file can be restored from on the representative data in the destination system. Such representative data can include, for example, a mirror-image data file, a backup format file, etc.
The features of the systems and methods will now be described with reference to the drawings summarized above. Throughout the drawings, reference numbers may be re-used to indicate correspondence between referenced elements. The drawings, associated descriptions, and specific implementation are provided to illustrate embodiments of the invention and not to limit the scope of the disclosure.
In addition, methods and functions described herein are not limited to any particular sequence, and the blocks or states relating thereto can be performed in other sequences that are appropriate. For example, described blocks or states may be performed in an order other than that specifically disclosed, or multiple blocks or states may be combined in a single block or state.
In certain embodiments, the replication agent 102 can be any computing device and/or software module that coordinates the transfer of data between the source 104 and destination 106 systems. In certain embodiments, the replication agent can 102 be a software application residing and/or executing on the source system 104, and configured to communicate with an application residing and/or executing on the destination system 106. The application on the destination system 106 can be configured to process data replicated from the source system 104 and provide information about such data to the replication agent 102.
In certain embodiments, the replication agent 102 does not necessarily need to reside and/or execute on the source system 104. Given appropriate information about data on the source system 104 and the destination system 106, the replication agent 102 can provide similar functionalities even when residing and/or executing elsewhere, such as on the destination system 106.
In certain embodiments, the source and destination systems 104 and 106 can be parts of different devices. In certain embodiments, the source system 104 and the destination system 106 can be part of the same computing device, where it may be desirable to replicate data from one system to another.
In certain embodiments, the source system 104 of
In certain embodiments, the source systems can include a stand-alone computing system such as a laptop computer 110. The example stand-alone computing system 110 can include a processor 114 configured to execute a number of software applications, including a replication agent 112. In the example system 110, data to be replicated can reside in one or more storage devices (not shown) inside of the computer's housing and/or connected to the laptop in known manner.
In certain embodiments, the source systems can include a workstation system 120 having a processor 124 configured to execute a number of software applications, including a replication agent 122. In the example system 120, data to be replicated can reside in one or more storage devices 126 associated with the system 120.
In certain embodiments, replicated data can be structured and organized so as to facilitate easy retrieval if needed. For example, each client's stored data can be organized in a file structure 160 representative of the client system's file structure. Such data organization and coordination by the processor 142 can be achieved in known manners.
As is generally known, replication of a given client's data can begin by a full replication process. Subsequently, the stored data can be updated by replicating selected portions of the data, such as one or more data blocks. In certain embodiments, such selected replication can be based on some change in the data. For example, identification and replication of files can be based on readily obtainable attributes such as creation time, modification time, and access time. In many file systems, such attributes can be part of metadata associated with files. Such selected replication can reduce expenditure of computing and/or network resources by not replicating files that have not changed.
In certain circumstances, a replication process can be made more efficient and reliable overall by performing an additional assessment of a given file that has undergone a first assessment (e.g., the foregoing assessment based on readily obtainable file attribute(s)). Examples of such circumstances are described herein in greater detail.
In block 174, a second assessment can be performed on the file identified in block 172. In certain embodiments, the second assessment can include determination of whether to obtain further information about the file. In situations where obtaining such information utilizes significant computing resources, the second assessment can reduce unnecessary expenditure of resources. For example, if the second assessment determines that the resource-consuming information is not needed or desired, expenditure of significant resources can be avoided.
Based on the second assessment of block 174, the process 170 can, in a decision block 176, determine whether the file should be replicated. If the answer is “Yes,” the process 170 can replicate the file in block 178. If the answer is “No,” the process 170 can determine that the file should not be replicated in block 179.
In a decision block 184, the process 180 can determine whether the attributes compared in block 182 are same. If “Yes,” the process 180 can determine that the source file should not be replicated in block 196. If “No,” the process 180 in a decision block 186 can determine whether the file size (e.g., for the source file) is less than that of a selected threshold value. In certain embodiments, the threshold value can be selected based on balancing of computing and/or bandwidth resource expenditures associated with determination of a second set of one or more attributes (for the source file and the replicated file) and possible replication thereafter, versus direct replication regardless of the second set of attribute(s). For example, if a given file is relatively small, it may be more efficient overall to simply send the file than to subject the file to further assessment. In another example, if a given file is relatively large, it may be worthwhile to further determine whether to send the file before committing significant bandwidth resources. In certain embodiments, the threshold value can be based on a data block size such as 256 KB or 2 MB. In certain embodiments, a data block size that can be used as a threshold value can depend on the operating system in which the process 180 is being performed.
Thus, if the answer to the decision block 186 is “Yes,” the process 180 can bypass further assessment and replicate the source file in block 194. If the answer is “No,” the process 180 can obtain a second set of one or more attributes for the source and replicated files in block 188. In block 190, the second sets of attribute(s) for the source and replicated files can be compared.
For the purpose of description, it will be understood that “checksum” (sometimes referred to as “hash sum”) can include any datum or data computed from a block of digital data to facilitate detection of errors that may be introduced during replication and/or storage. In the context of replication systems, integrity of data associated with a given block (e.g., a 256 KB block) of a replicated file can be checked by computing the checksum and comparing it with the checksum of the same data block of the source file. In the context of files, a given file can include one or more data blocks. Thus, checksums of the source and replicated files can be compared on a block-by-block basis. Non-limiting examples of checksums can include known algorithms such as rolling checksum (also sometimes referred to as rolling hash function) for block sizes between about 256 KB and 200 MB, and MD5 cryptographic hash function algorithm for larger block sizes.
If the checksums do not match, there is a high likelihood that the data was altered. On the other hand, if the checksums match, it is highly likely that integrity of the data is maintained (e.g., by being substantially error-free). In certain replication situations, comparison of checksums in the foregoing manner can be sufficiently reliable so as to override differences in one or more of the first set of attributes. An example of such overriding feature is described herein in greater detail.
Based on the comparison in block 190, the process 180 can, in a decision block 192, determine whether the second sets of attribute(s) for the source and replicated files are same. If “No,” the difference provides further confirmation of change and/or error in the file, and the source file can be replicated in block 194. If “Yes,” the process 180 can either decide to not replicate the source file (e.g., based on high reliability of checksum comparison), or replicate the source file (e.g., based on the change in one or more of the first set of attribute(s)). Whether to replicate or not replicate under such circumstances can be based on one or more factors, including, for example, balancing of likelihood of data integrity (confirmed by checksum comparison), versus replication based on the difference(s) of the first sets of attributes. In the example shown in
In certain embodiments, replication of the source file in block 194 can include sending of the entire source file from the source system to the destination system if the checksum comparison yields differences in one or more blocks of the source and replicated files. In other embodiments, replication of the source file in block 194 can include sending of only the block(s) having different checksum(s) between the source and replicated files.
In certain embodiments, such attributes can be obtained for files on the destination system in a survey performed periodically or as needed. Information representative of such attributes can be sent to the source system, and the replication agent can perform comparisons with similar information obtained on the source system so as to allow identification of files to be further assessed for replication purpose.
In a decision block 226, the process 220 can determine whether the attributes obtained in blocks 222 and 224 are the same for a given file. If “Yes,” the process 220 can determine that the file should not be replicated, and the source file is not sent (block 260). If “No,” the process 220 can obtain the size of the file (e.g., size of the source file) in block 230. In block 232, the file size can be compared with a threshold value obtained based on a selected policy. In certain embodiments, such policy can include a setting of threshold file size based on, for example, network type, available network bandwidth, loads placed on the source and/or destination systems, priorities assigned to replication processes, combinations of the same or the like. As described herein, a threshold value can be selected based on balancing of expenditure of various resources associated with checksum calculations for the source and replicated files versus direct replication without the checksum calculations.
In a decision block 234, the process 220 can determine whether the file size is greater than the threshold value. If the answer is “No,” the file can be sent (block 250) without further processing. If the answer is “Yes,” checksums for the source and replicated files can be obtained in block 240. In certain embodiments, information about the replicated file's checksum can be sent to the source system so as to allow comparison with the source file's checksum.
In block 242, checksums for the source and replicated files can be compared. In a decision block 244, the process 220 can determine whether the two checksums are same. If the answer is “No,” the source file can be sent in block 250. In certain embodiments, if the answer is “Yes,” the process 220 can determine that the source file should not be sent despite difference(s) in the attributes obtained and compared in blocks 222, 224, and 226. In certain embodiments, such determination can be based on consideration of the likelihood of data integrity as provided by the checksum comparison. In block 270, the process 220 can synchronize the creation time, modification time, and access time attributes of the source and replicated files. For example, such attributes for the replicated file can be updated to match those of the source file. The source file is not sent (block 260).
As described in reference to
By way of example, the interface 320 can include one or more parameters 330 representative of a current status of the source system, network, and/or destination system. For example, network type, available bandwidth, assigned priority for the replication process, average CPU usage for the replication process, and average I/O usage for the replication process can be presented to the user. Additionally, the current value of the threshold can also be presented to the user.
In certain embodiments, the user interface 320 can include one or more recommendations 340 that can be effectuated by the user. For example, priority setting can be allowed to be changed by the user selecting the “Change” button 342.
In certain embodiments, the user interface 320 can include an option 350 that allows the user to keep 352 or change 354 the threshold value. For example, the change 354 can be to a new threshold value based on one or more parameters 330 as described herein.
In certain embodiments, determination and implementation of the threshold value can be configured to be substantially automatic, based on one or more storage policies, the replication agent's monitoring of the source system, network, and/or destination system. Thus, the example user interface 320 can include an option 360 that allows determination of threshold values based on various operating and resource parameters.
Systems and modules described herein may comprise software, firmware, hardware, or any combination(s) of software, firmware, or hardware suitable for the purposes described herein. Software and other modules may reside on servers, workstations, personal computers, computerized tablets, PDAs, and other devices suitable for the purposes described herein. Software and other modules may be accessible via local memory, via a network, via a browser, or via other means suitable for the purposes described herein. Data structures described herein may comprise computer files, variables, programming arrays, programming structures, or any electronic information storage schemes or methods, or any combinations thereof, suitable for the purposes described herein. User interface elements described herein may comprise elements from graphical user interfaces, command line interfaces, and other interfaces suitable for the purposes described herein.
Embodiments of the invention are also described above with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, may be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the acts specified in the flowchart and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to operate in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the acts specified in the flowchart and/or block diagram block or blocks. The computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operations to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions that execute on the computer or other programmable apparatus provide steps for implementing the acts specified in the flowchart and/or block diagram block or blocks.
While certain embodiments of the inventions have been described, these embodiments have been presented by way of example only, and are not intended to limit the scope of the disclosure. Indeed, the novel methods and systems described herein may be embodied in a variety of other forms; furthermore, various omissions, substitutions and changes in the form of the methods and systems described herein may be made without departing from the spirit of the disclosure. The accompanying claims and their equivalents are intended to cover such forms or modifications as would fall within the scope and spirit of the disclosure.