A computer system may generate a large amount of data. Loss of such data may be detrimental to an entity using the computer system. To protect from such loss, a data backup system may store at least a portion of the computer system's data. If a failure of the computer system prevents retrieval of some portion of the data, it may be possible to retrieve the data from the data backup system.
The following detailed description references the drawings, wherein:
A data backup system may use several techniques to ensure data protection. One such technique is data deduplication, a technique that divides a sequence of input backup data into an ordered collection of non-overlapping chunks of data. In the deduplication process, when a duplicate of original data is found, a pointer can be established to the original data, rather than storing another copy (i.e. a duplicate) of the original data. By storing unique chunks of data, deduplication enables backup data to be stored compactly and cheaply by decreasing needed storage space.
A data backup system may also store backup data in a location that is remote to the data generation site (e.g. data from a storage client), requiring a transfer of the data using established communication protocols. For example, deduplicated data may be stored in disk arrays that emulate a tape library (known as a virtual tape library (VTL)). Because deduplicated data is more compact than the input backup data, the transfer time of deduplicated data may be less than the transfer time of the input backup data. Thus, deduplication may also decrease bandwidth demands.
When deduplicated data is stored, it is stored in a manner that allows the data to easily interface with the method that created the deduplicated data (for example, a deduplication appliance or software). This is because deduplicated data is dependent on the mechanism that created it (i.e. the data is format opaque). For example, in some situations, a restore of deduplicated data needs to be done with the same appliance and same version of the appliance that created the deduplicated data. Thus, storing the deduplicated data in a manner that links it to the original deduplication method allows the deduplicated data to be rehydrated (or restored) back to its original duplicated form when needed using the original deduplication method.
This generates an issue for data backup systems because data backup systems, in addition to storing deduplicated data, may also safeguard data that has already been deduplicated. For example, storage resources such as physical tape (e.g. Linear Tape-Open (LTO), etc.) or object-based repositories operated over networks (e.g. S3, SWIFT, etc.) may be used for archiving purposes and long-term storage of the information held in deduplicated data. This provides a tertiary level of protection in addition to the secondary level of protection provided by the deduplicated data. In some situations, archiving data to physical tapes may be used to help comply with government or industry regulations.
Storage resources used for archiving purposes, however, are not linked to the deduplication method that created the deduplicated data. Without the deduplication method, it is difficult to rehydrate the deduplicated data and extract information from the deduplicated data. This is an issue in archiving the deduplicated data in its deduplicated form because data held in archives may not be needed for long periods of time after the archive date and it may be difficult to ensure that the deduplication method that created the deduplicated data will be accessible or available at the time of restoration from an archive.
Thus, often times, to archive deduplicated data in a useful form, the deduplicated data is rehydrated before it is sent to the storage resource. But this rehydration process is time consuming. Additionally, storing the data in its original, duplicated form burdens the data backup system's bandwidth and storage space.
Examples described herein address these issues by providing a way to store both deduplicated data and a mechanism to rehydrate the deduplicated data. In some examples, a consistent-in-time data set of deduplicated data is generated and a rehydrating agent is generated. The consistent-in-time data set and an executable copy of the rehydrating agent are then sent to the storage resource for storage.
In some examples, a computing device is provided with a processor, instructions to generate a rehydrating agent, instructions to generate a consistent-in-time data set of deduplicated data, and instructions to send the consistent-in-time data set and an executable copy of the rehydrating agent to a storage resource. The instructions are executable by the processor.
In some examples, a system is provided with a deduplication engine, a policy engine, a data generation engine, an agent generation engine, and a transmit engine. The deduplication engine generates deduplicated data. The policy engine determines an occurrence of a trigger event. In response to the occurrence of the trigger event, the data generation engine generates a consistent-in-time data set of the deduplicated data. The agent generation engine, in response to the occurrence of the trigger event, generates an executable copy of a rehydrating agent. The transmit engine then sends the consistent-in-time data set and an executable copy of the rehydrating agent to a storage resource.
In some examples, a method is provided to generate a consistent-in-time data set of deduplicated data and generate a rehydrating agent. The method includes sending the consistent-in-time data set and an executable copy of the rehydrating agent to a storage resource that is remote from the deduplicated data.
Thus, examples described herein allow for deduplicated data to be stored for archival purposes without dependence on the original method that generated the deduplicated data. This frees up the data backup system's storage space and bandwidth resources, allowing the data backup system to use deduplicated data in tertiary levels of data protection.
Referring now to the figures,
Computing device 100 of
As used herein, “machine-readable storage medium” may include a storage drive (e.g., a hard drive), flash memory, Random Access Memory (RAM), any type of storage disc (e.g., a Compact Disc Read Only Memory (CD-ROM), any other type of compact disc, a DVD, etc.) and the like, or combination thereof. In some examples, a storage medium can correspond to a memory including a main memory, such as a Random Access Memory (RAM), where software may reside during runtime, and a secondary memory. The secondary memory can, for example, include a nonvolatile memory where a copy of software or other data, such as deduplicated data, is stored.
In the example of
Processing resource 120 may, for example, be in the form of a central processing unit (CPU), a semiconductor-based microprocessor, a digital signal processor (DSP) such as a digital image processing unit, other hardware devices or processing elements suitable to retrieve and execute instructions stored in a storage medium, or suitable combinations thereof. The processor can, for example, include single or multiple cores on a chip, multiple cores across multiple chips, multiple cores across multiple devices, or suitable combinations thereof. The processor can be functional to fetch, decode, and execute instructions 111, 112, and 113 as described herein.
In the example of
Instructions 111 may be executable by processing resource 120 such that computing device 100 is operative to generate a consistent-in-time data set of deduplicated data 131. As used herein, “consistent-in-time data set of deduplicated data” is a copy of deduplicated data 131 that includes original input data that has been processed to remove duplicate data. The consistent-in-time data set does not include original input data that exists in a buffer to be processed. The portion of deduplicated data 131 that includes original input data in a buffer to be processed is flushed before it may be considered part of the consistent-in-time data set. In other words, the consistent-in-time data set is a copy of data that has been deduplicated as of one single point in time. In some examples, the consistent-in-time data set is used to ensure that there are no inconsistencies within the deduplicated data that is captured by the consistent-in-time data set. In some examples, instructions 111 may include suspending a deduplication process of a deduplication engine (see
In some examples, the consistent-in-time data set of deduplicated data does not capture all of deduplicated data 131, but rather just a subset of deduplicated data 131. In these examples, instructions 111 may include instructions to determine an appropriate subset of deduplicated data 131 from which to generate the consistent-in-time data set. In some examples, these instructions may determine the appropriate subset of deduplicated data based (at least partially) on reference to a catalog that tracks the volumes of deduplicated data that have previously been sent to a storage medium. In some examples, these instructions may determine the appropriate subset of deduplicated data based (at least partially) on an age-rule of the age of the deduplicated data or a time-rule of how long it has been since the last archive. In some examples, these instructions may determine an appropriate subset of deduplicated data based (at least partially on) an input from a storage client selecting the specific data to archive. For example, the storage client may select a specific virtual cartridge or a specific file share or files within a file share to be archived. In some examples, a combination of at least one of these methods may be used to determine the appropriate subset.
The appropriate subset may include any other data that the subset is dependent on even though the other data would not be included in the subset on its own. The inclusion of this other data makes the subset independently coherent and consistent and free from reliance on external sources. For example, if the rule is to generate the subset from data generated during time period 3-4, the subset would include data generated during time period 3-4 but may also include data generated outside of time period 3-4 (e.g. during time period 1-2) if data generated in time 3-4 depends on the data generated outside of time period 3-4.
In some examples, the instructions 111 to generate the consistent-in-time data set may executed after a request for an archive is received. The request may be inputted by a user interfacing with a GUI on the storage client side (not shown).
Instructions 112 may be executable by processing resource 120 such that computing device 100 generates a rehydrating agent. As used herein, a “rehydrating agent” includes any mechanism, including suitable software application, appliance, virtual appliance, software agent, intelligent software agent, computer program, or the like that restores the consistent-in-time data set of deduplicated data to its original, duplicated (i.e. rehydrated or restored) form. Non-limiting examples of a rehydrating agent include a stand-alone executable file and a virtual appliance (including a virtual storage appliance (VSA) that runs on a virtual machine to consolidate the directly-attached storage capacity of different physical hosts to create a virtual storage pool). In this regard, a copy 132 of the rehydrating agent may be stored in memory 130 or in a remote memory that is in communication with processing resource 120. The portion of memory 130 where a copy 132 of rehydrating agent is stored may be non-volatile memory (e.g., secondary memory). In some examples, the generation of the rehydrating agent involves reading of copy 132 of the rehydrating agent from the non-volatile portion of memory 130. The rehydrating agent is then generated from the copy 132 of the rehydrating agent in the main memory (e.g., RAM).
The example of
As used herein, an “executable copy of the rehydrating agent” includes a copy of the generated rehydrating agent at a specific point in time. The executable copy allows the rehydrating agent to be generated as it existed when it was generated by instructions 112.
Storage resource 150 includes storage mediums and any storage service that relies on underlying storage mediums. Storage resource 150 may be different from machine-readable storage medium 110 and memory 130 in that storage resource 150 is not linked to the rehydrating agent. In some examples, storage resource 150 may include physical tapes, network attached storage (NAS), object-based data repositories functioning over a communications network (e.g. SWIFT, S3, etc.)). In some examples, storage resource 150 may be used in tertiary storage.
In some examples, computing device 100 may implement at least a portion of a data backup system. For example, instructions 111, 112, and 113 may be part of a larger set of instructions implementing functionalities of a backup system, and memory 130 may implement at least a portion of the storage of the backup system.
In the example of
In some examples, and as shown in
For example, in
In the example of
As discussed above, the rehydrating agent may be a VSA. In some examples, the copy 132 of the rehydrating agent may be different between the examples of
At 410 of method 400, processing resource 120 may execute instructions 112 to generate a rehydrating agent from a copy 132 of the rehydrating agent in memory 130 of computing device 100. At 420 of method 400, processing resource 120 may execute instructions 111 to generate a consistent-in-time data set of deduplicated data 131 stored in memory 130. At 430 of method 400, processing resource 120 may execute instructions 113 to send the consistent-in-time data set and an executable copy of the rehydrating agent to a storage resource. In some examples, and in example method 400, the storage resource is remote from the deduplicated data 131. In some examples, a remote storage resource includes a storage resource that is different from the storage medium storing the deduplicated data. Different may include, among other things, a difference in physical location or a difference in type.
In some examples, at 430, instructions 113 may include instructions to send the consistent-in-time data set and the executable copy of the rehydrating agent to more than one storage resources of the same type, for example, spanning across more than one physical tape or object, as described above in relation to
Although the flowchart of
Each of engines 301, 310, 320, 330, 340, and any other engines, may be any combination of hardware (e.g., a processor such as an integrated circuit or other circuitry) and software (e.g., machine or processor-executable instructions, commands, or code such as firmware, programming, or object code) to implement the functionalities of the respective engine. Such combinations of hardware and programming may be implemented in a number of different ways. A combination of hardware and software can include hardware only (i.e., a hardware element with no software elements), software hosted at hardware (e.g., software that is stored at a memory and executed or interpreted at a processor), or at hardware and software hosted at hardware. Additionally, as used herein, the singular forms “a,” “an,” and “the” include plural referents unless the context clearly dictates otherwise. Thus, for example, the term “engine” is intended to mean at least one engine or a combination of engines. In some examples, system 300 may include additional engines.
Each engine of system 300 can include at least one machine-readable storage mediums (for example, more than one) and at least one computer processors (for example, more than one). For example, software that provides the functionality of engines on system 300 can be stored on a memory of a computer to be executed by a processor of the computer. System 300 of
In some examples, and as shown in
Deduplication engine 301 is an engine of system 300 that includes a combination of hardware and software that allows system 300 to generate deduplicated data. Deduplication engine 301 may organize the original input data into non-overlapping chunks by using a pointer at sites where duplicate data is found, rather than storing another copy (i.e. a duplicate) of the original data. In some examples, deduplication engine 301 may include hardware in the form of a microprocessor on a single integrated circuit, related firmware, or other software for allowing microprocessor to operatively communicate with other hardware of system 300. The discussion of deduplicated data 131 in relation to
Policy engine 310 is an engine of system 300 that includes a combination of hardware and software that allows system 300 to determine an occurrence of a trigger event. In some examples, a trigger event may include a storage client input signaling a request for an archive of deduplicated data. In some examples, a trigger event may include a determination that the deduplicated data has been stored in its deduplicated form for a specific time period. For example, a storage client may specify through a user interface that deduplicated data should be archived every 30 days. Thus, in those examples, a trigger event would include the passage of 30 days. In some examples, a trigger event may include a determination that the deduplicated data is of a specific type of data or from a specific origin.
In some examples, and as in the example of
Data generation engine 320 is a functional engine of system 300 that includes a combination of hardware and software that allows system 300 to generate consistent-in-time data set of a store of deduplicated data present on system 300 or accessible to system 300 in response to an input from policy engine 310 that a trigger event has occurred. The discussion of consistent-in-time data set in relation to instructions 111 of computing device 100 is applicable to the consistent-in-time data set of data generation engine 320.
In some examples, data generation engine 320 allows system 300 to generate consistent-in-time data set of a subset of the deduplicated data that has been generated by deduplication engine 301. In some examples, the appropriate subset of deduplicated data may be determined based (at least partially on) the boundaries of the trigger event discussed above in relation to policy engine 310. For example, if the rule in operation in policy engine 310 is that an archive is generated every 30 days, data generation engine 320 may generate different consistent-in-time data sets at day 30 and at day 60. For day 30, data generation engine 320 may generate a consistent-in-time data set of the entirety of deduplicated data, assuming that day 0 was the first day of deduplication. For day 60, however, data generation engine 320 may generate a consistent-in-time data set of a portion of deduplicated data, specifically from day 31 to day 60. The discussion of a subset of the deduplicated data in relation to instructions 111 of computing device 100 is applicable here.
In some examples, and as in the example of
Agent generation engine 330 is a functional engine of system 300 that includes a combination of hardware and software that allows system 300 to generate a rehydrating agent. In some examples, this includes reading a copy of the rehydrating agent stored in secondary memory (e.g. hard disk), as described in relation to 132 in
Transmit engine 340 is a functional engine of system 300 that includes a combination of hardware and software that allows system 300 to transmit the consistent-in-time data set generated by data generation engine 320 and an executable copy of the rehydrating agent to a storage resource. The interface of transmit engine 340 with agent generation engine 330 and data generation engine 320 is represented by line 325 in
In some examples, transmit engine 340 may send the consistent-in-time data set and the executable copy of the rehydrating agent to a physical tape (see
In some examples, transmit engine 340 may send the consistent-in-time data set and the executable copy of the rehydrating agent to an object-based repository. In an object structure, each object may include the underlying data, some metadata, and a globally unique identifier. In those examples, transmit engine 340 may provide system 300 with the functionality of writing the consistent-in-time data set and the executable copy of the rehydrating agent as an object. The object exists in transient (primary) memory until it is sent to the repository. Additionally, transmit engine 340 may include a data connector engine. Data connector engine is a functional engine of system 300 that includes a combination of hardware and software that allows system 300 to convert an object from one object format to another object format. Data connector engine may be present when the storage medium being used is an object-based repository, like in the example methods of
At 501 of method 500, policy engine 310 of system 300 may determine the occurrence of a trigger event. The discussion above of a trigger event in relation to policy engine 310 is also applicable at 501. At 502, method 500 proceeds to 510 if there is a determination that a trigger event has occurred. In some examples, policy engine 310 may determine that a trigger event has occurred if it receives an input for a request for an archive, as discussed in relation to policy engine 310 above. If there is no determination that a trigger event has occurred, method 500 iterates back to 501 to determine an occurrence of a trigger event. At 510 of method 500, agent generation engine 330 generates a rehydrating agent. 510 of method 500 is similar to 410 of method 400 and the discussion above in relation to 410 is applicable here.
At 521, data generation engine 320 suspends a deduplication process of deduplication engine 301 of system 300. This suspension may be characterized as a quiescence operation of the deduplication engine 300. At 522, data generation engine 320 instructs deduplication engine 301 of system 300 to flush any input backup data stream pending in a buffer storage of deduplication engine 301. At 523, data generation engine 320 takes a snapshot of deduplication engine 301 and any associated data stores of deduplication engine 301, including deduplicated data stores, indices, meta-data, log files, OS files, etc. This snapshot generation at 523 allows deduplication engine 301 to resume its deduplication process, as data generation engine 320 may create the consistent-in-time data set from the snapshot.
At 524 of method 500, data generation engine 320 generates a consistent-in-time data set from the snapshot generated at 523. In some examples, the consistent-in-time data set is generated by reading the snapshot, allowing data generation engine 320 to create a data drive containing the deduplicated data as well as the indices, meta-data, log files, etc. In some examples, method 500 may not include 523 of taking a snapshot. Instead, method 500 may skip 523 and go to 524. In these examples, the consistent-in-time data set is not generated from reading the snapshot, but from copying the deduplicated data from the memory associated with deduplication engine 301.
At 531 of method 500, transmit engine 340 may write an executable copy of the rehydrating agent generated at 510 by agent generation engine 330 to a physical tape. At 532, method 500 updates (i.e. commits) the executable copy of the rehydrating agent to the physical tape. In some examples, transmit engine 340 may control physical tapes and be an interface between the physical tapes and the other engines of system 300. Transmit engine 300 may write and commit (i.e. update) the executable copy of the rehydrating agent to the physical tape, as described above in relation to
At 533, transmit engine 340 may write the consistent-in-time data set generated by data generation engine 320 at 521-524 to a physical tape. At 534, transmit engine may update the physical tape with the consistent-in-time data set. In some examples, transmit engine may commit the write of the consistent-in-time data set and the executable copy of rehydrating agent over more than one physical tape, as discussed in relation to
Although the flowchart of
At 601, policy engine 310 determines an occurrence of a trigger event. This determination may be performed as described above in relation to 501 of method 500. At 602, policy engine 310 may trigger agent generation engine 330 to generate a rehydrating agent if policy engine 310 has determined that a trigger event has occurred. This may be performed as described above in relation to 502 of method 500. At 610, agent generation engine 330 of system 300 may generate a rehydrating agent. This may be performed as described above in relation to 510 of method 500.
At 621, 622, and 623, data generation engine 320 suspends a deduplication process of deduplication engine 301, flushes any pending input data stream, and takes a snapshot. 621, 622, and 623 may be performed as described above in relation to 521, 522, and 523 of method 500.
At 624, data generation engine 320 of system 300 generates a consistent-in-time data set from the snapshot generated at 623 using the rehydrating agent generated by agent generation engine 330. In some examples, this includes a reading of the snapshot and a writing of the consistent-in-time data with the rehydrating agent. What is generated from this is a consistent-in-time data set of the deduplicated data coupled to an executable copy of the rehydrating agent. In other words, the executable copy of the rehydrating agent contains with it the consistent-in-time data set of the deduplicated data. In some examples, the executable copy of the rehydrating agent may be an image of a virtual storage appliance (i.e. a copy of an appliance at a specific point in time). In this regard, the executable copy of the rehydrating agent may be thought of as carrier allowing rehydration of the data coupled to the carrier and the consistent-in-time data set may be thought of the data.
At 631 of method 600, transmit engine 340 writes the consistent-in-time data set with the executable copy of the rehydrating agent generated in 624 to a physical tape. This is performed as described above in relation to 531 in method 500, the difference here being that 631 includes the consistent-in-time data set with the executable copy of the rehydrating agent. At 632 of method 600, transmit engine 340 updates the physical tape with the consistent-in-time data set and the executable copy of the rehydrating agent. This is performed as described above in relation to 532 of method 500, the difference here being that the update includes both the consistent-in-time data set and the executable copy of the rehydrating agent.
Although the flowchart of
701 and 702 of method 700 are similar to 601, 602 of method 600 and 501, 502 of method 500 and are performed in accordance with the descriptions above. Additionally, the discussion above in relation to 610, 510; 621, 521; 622, 522; 623, 523; and 524 is applicable to 710, 721, 722, 723, and 724, respectively. At 731 of method 700, an executable copy of the rehydrating agent that is generated at 710 is written by transmit engine 340. As discussed above in relation to
Because there are various types of object-based repositories (for example, SWIFT OpenSource, S3, etc.) with different formats, transmit engine 340 of system 300 may convert, at 732, the object written in 731 to a format that is compatible with the intended object-based repository. At 733 of method 700, transmit engine 340 transmits the object containing the executable copy of the rehydrating agent to the object-based repository. Transport protocol may include Hypertext Transfer Protocol (HTTP), SOAP, REST, etc. In some examples, if the object write in 731 is compatible with the intended object-based repository, transmit engine 340 skips 732 and goes directly to 733. The dashed lines in
At 734, transmit engine 340 writes the consistent-in-time data generated in 724 as an object. At 735, transmit engine 340 of system 300 may convert the object written in 734 to a format that is compatible with the intended object-based repository. At 736 of method 700, transmit engine 340 transmits the object containing the executable copy of the rehydrating agent to the object-based repository. In some examples, if the object write in 734 is compatible with the intended object-based repository, transmit engine 340 skips 735 and goes directly to 736. The dashed lines in
Although the flowchart of
801 and 802 of method 800 are similar to 501, 502 of method 500; 601, 602 of method 600; and 701, 702 of method 700. Additionally, the discussion above in relation to 710, 610, 510; 721, 621, 521; 722, 622, 522; and 723, 623, 523 is applicable to 810, 821, 822, and 823, respectively.
At 824 of method 800, data generation engine 320 generates consistent-in-time data set using the rehydrating agent and the snapshot. This is generated as described above in relation to 624 of
In examples where the executable copy of the rehydrating agent is sent together with the consistent-in-time data set to the storage resource, (see
In examples where the storage resource is a physical tape, the storage server or another computer can instruct a robotic arm to fetch the tape and place it in a drive, or other reader mechanism. In examples where the storage resource is an object-based repository, the object can be read or recalled through an appropriate command (for example, HTTP GET over SWIFT RESTFul API).
The reading of the storage resource generates a rehydrating agent from the executable copy of the rehydrating agent. Because the consistent-in-time data set of deduplicated data is coupled with the executable copy of the rehydrating agent, the read rehydrates (or restores) the entirety of the consistent-in-time data set of deduplicated data. In this regard, it is envisioned that the restoration process of the consistent-in-time data set includes sending a command to an associated storage server to identify a requirement for the raw storage medium capacity of the deduplicated data, provisioning and portioning the needed logical unit numbers (LUN), and restoring the deduplicated data to the identified storage medium. In these examples, the deduplicated data that is rehydrated is the deduplicated data that was captured by the consistent-in-time data set. For example, the rehydrating agent may be a virtual storage appliance, that when opened, automatically rehydrates and writes the data to disk arrays, allowing the storage client access to the rehydrated data that came from the consistent-in-time data set.
In examples where the executable copy of the rehydrating agent is sent separate from the consistent-in-time data set (see
All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and/or all of the elements of any method or process so disclosed, may be combined in any combination, except combinations where at least some of such features and/or elements are mutually exclusive.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2016/026914 | 4/11/2016 | WO | 00 |