As the amount of data to be backed up continues to grow, more and more sophisticated approaches to backup are desired. These ever more sophisticated approaches seek to address the recovery time objective (RTO). One conventional backup application creates a new backup data set from fragments of previous backup data sets. The conventional backup application reads previously backed up data from a backup storage appliance onto a backup application media server. The previously backed up data may be read from different places including, for example, tapes, solid state devices, disks, or elsewhere. Conventionally, a synthetic backup data set is created from the previously backed up data that was read in and then the data associated with the synthetic backup data set is processed to create new image metadata and then written out to one or more backup storage appliances. This conventional approach is inefficient and resource intensive. Another conventional approach consolidated a set of incremental and/or differential backups to create a consolidated image that represented the entire source backup in a single image. Like other conventional approaches this may be inefficient due to reading and writing previously backed up data. Additional inefficiencies associated with conventional approaches include additional network overhead (e.g., when previously backed up data is read/written across a network), and extra workloads for both a backup application and a backup storage appliance.
A synthetic backup is a backup that is created by collecting data from a previous backup(s) rather than from an original source. The backup is referred to as a “synthetic” backup because it is not a backup created from original data. A synthetic full backup does not actually transfer data from an original non-backed up source (e.g., client computer) to backup media. Conventional synthetic backup methods are inefficient because they read and process previously backed up data from a backup storage appliance(s) and then write the previously backed up data to a backup storage appliance(s).
The accompanying drawings, which are incorporated in and constitute a part of the specification, illustrate various example apparatus, methods, and other example embodiments of various aspects of the invention. It will be appreciated that the illustrated element boundaries (e.g., boxes, groups of boxes, or other shapes) in the figures represent one example of the boundaries. One of ordinary skill in the art will appreciate that in some examples one element may be designed as multiple elements or that multiple elements may be designed as one element. In some examples, an element shown as an internal component of another element may be implemented as an external component and vice versa. Furthermore, elements may not be drawn to scale.
Example apparatus and methods concern synthetic backups. Example apparatus and methods construct a synthetic backup data set from information (e.g., metadata) associated with data (e.g., BLOB(s), portion(s) of BLOB(s), blocklet(s)) that have already been backed up. In one example, apparatus and methods use the information associated with a previous backup data set(s) already present on a backup storage appliance(s) to construct a synthetic backup data set “in place” without any movement (e.g., reading, writing) of previously backed up data. A backed up data set may be, for example, a copy of a live data set. The live data set may reside in a file system, on a server, or in association with some other entity. The backed up data may reside in a different location including, for example, on a backup medium or appliance (e.g., tape, disk).
Consider a trivial case where a new backup data set includes just a single member of a previously backed up data set. The previously backed up data set may include, for example, hundreds of BLOBs. Since the single member needed for the new backup is already present on a backup storage appliance, the new backup data set could just be described rather than reading in the single member and then writing the single member back out to a new, physical backup data set. The new backup data set could be synthesized from the existing backup data set by using just information for locating the previously stored data set. In this simple case, the synthetic backup could be stored as just location information for locating the single member from the previously stored data set. The information for locating the previously stored data set may be retrieved, for example, from metadata associated with the previously stored data.
Now consider a less trivial, but still straightforward case where a new backup data set is identical to a previously backed up data set. Conventional systems might read in the entire previously backed up data set and then write it back out and then create metadata for locating and using the new copy of the previously backed up data set. Example apparatus and methods would not be so inefficient. Instead, the new backup data set could be synthesized by creating metadata for the new backup data set. The metadata could include information for locating and using the previously backed up data set. The metadata could be retrieved, copied, or otherwise acquired from the metadata associated with the previously backed up data set. In this case, the synthetic backup could also be stored as just location information for locating the members in the previously stored data set. Other more complicated cases could be handled similarly.
Example apparatus and methods construct the synthetic backup data set based, at least in part, on information (e.g., metadata) associated with previously backed up data. The synthetic backup data set can be built “in place”, without reading all of the previously backed up data of which the backup image is composed. In one example, none of the previously backed up data will be read. In another example, at least one piece of the previously backed up data will be read. In one example, none of the previously backed up data will be written to a new location on a backup appliance. In another example, at least one piece of the previously backed up data will be written to a new location on a backup appliance.
Example apparatus and methods may be described using terminology familiar to one skilled in the art of data de-duplication. For example, figure one illustrates a “data stream.” A “data stream,” as used herein, refers to a contiguous sequence of bytes or characters or elements. A data stream may be of indeterminate but finite length. The first byte in a data stream is referred to as byte 0 (e.g., b0). The illustrated data stream includes bytes b0, b1, b2 . . . bn, where n is an integer and refers to the “n-th” byte.
In one example, “blocklets” are atoms of unique data that may be stored by a data de-duplication system.
In
Since BLOB P′ has portions of BLOBS I and J, in one example, portions of BLOBS I and J may be read by backup method 710 to facilitate creating metadata for BLOB P′. In one example, portions of BLOBS I and J may also be written to a backup appliance. In another example, it may not be necessary or desirable to read portions of the BLOBs I and J. Additionally, even if portions of BLOBS I and J may be read, it may not be necessary to write out portions of BLOBs I and J. Thus, instead of actually creating a new BLOB P′, backup method 710 may just store metadata about a portion of BLOB I and a portion of BLOB J. This metadata is represented by BLOB P′.
Since BLOB Q′ has portions of BLOBs A, K, and N, backup method 710 may read and/or write a portion(s) of one or more of the BLOBs A, K, and N to facilitate acquiring the metadata for BLOB Q′. However, as described above, it may not be necessary to read or write the portions of BLOBs A, K, or N. Thus, instead of actually creating a new BLOB Q′, backup method 710 may store metadata about a portion of BLOB A, a portion of BLOB K, and a portion of BLOB N. This metadata is represented by BLOB Q′.
Some portions of the detailed descriptions that follow are presented in terms of algorithms and symbolic representations of operations on data bits within a memory. These algorithmic descriptions and representations are used by those skilled in the art to convey the substance of their work to others. An algorithm, here and generally, is conceived to be a sequence of operations that produce a result. The operations may include physical manipulations of physical quantities. Usually, though not necessarily, the physical quantities take the form of electrical or magnetic signals capable of being stored, transferred, combined, compared, and otherwise manipulated in a logic, and so on. The physical manipulations create a concrete, tangible, useful, real-world result.
It has proven convenient at times, principally for reasons of common usage, to refer to these signals as bits, values, elements, symbols, characters, terms, numbers, and so on. It should be borne in mind, however, that these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities. Unless specifically stated otherwise, it is appreciated that throughout the description, terms including processing, computing, determining, and so on, refer to actions and processes of a computer system, logic, processor, or similar electronic device that manipulates and transforms data represented as physical (electronic) quantities.
Example methods may be better appreciated with reference to flow diagrams. While for purposes of simplicity of explanation, the illustrated methodologies are shown and described as a series of blocks, it is to be appreciated that the methodologies are not limited by the order of the blocks, as some blocks can occur in different orders and/or concurrently with other blocks from that shown and described. Moreover, less than all the illustrated blocks may be required to implement an example methodology. Blocks may be combined or separated into multiple components. Furthermore, additional and/or alternative methodologies can employ additional, not illustrated blocks.
Method 800 includes, at 820, instantiating second information associated with a synthetic backup data set to be created. In one example, the second information may be instantiated on a non-transitory computer-readable medium (e.g., memory, disk, solid state device). Instantiating the second information may include, for example, allocating memory to store computer data, initializing memory to store computer data, allocating a variable to store computer data, initializing a variable to store computer data, creating a database record to store computer data, initializing a database record, creating an object to store data, initializing an object, writing a record, writing to an object, and other actions.
Method 800 also includes, at 830, selectively manipulating the second information to create the synthetic backup data set. The manipulating is based, at least in part, on the first information. The manipulating may include, for example, copying values from the first information to the second information, deriving second information values from first information values, computing second information values from first information values, and other actions. In one example, a full backup data set may be created from previous full and incremental backup data sets.
In one example, the first information may be data about data, which may be referred to as metadata. Since the metadata is data about backed up data in a backup data set, in different examples the metadata may include a binary large object location, a binary large object size, a binary large object identifier (e.g., TAG), a binary large object order, a blocklet location, a blocklet size, a blocklet identifier, a blocklet order, or other information. A TAG for a BLOB may be, for example, a hash of the hashes of blocklets stored in the BLOB. Similarly, in one example, the second information may also be metadata about backed up data in a synthetic backup data set and may include a binary large object location, a binary large object size, a binary large object identifier (e.g., TAG), a binary large object order, a blocklet location, a blocklet size, a blocklet identifier, a blocklet order, or other information.
Instantiating the second information at 820 and manipulating the second information at 830 facilitate logically creating the synthetic backup from one or more elements of the existing backup data set without physically reading data from the existing backup data set from the backup appliance. One skilled in the art of computer science understands the difference between logically creating a data set and physically creating a data set. In one example, method 800 logically creates the members of the synthetic backup data set without physically writing a backup data set to the backup appliance. Even though the synthetic backup data set is only logically created, metadata about the synthetic backup data set may be physically created to store the references (e.g., pointers, addresses, location information) that will be used to access physical data associated with the logical synthetic backup data set. In one embodiment, method 800 may include reading some data from a previously backed up data set. For example, when an extent starts or ends somewhere other than at a blocklet boundary, then a portion of the extent may be read in and written out. An extent may start or end, for example, partway through a blocklet, partway through a shared memory page, or partway through some other storage location. In these examples, a small amount of data corresponding to the portion of the extent may be read and written.
In one embodiment, method 800 may also include, at 840, providing the synthetic backup data set to entities including, but not limited to, a backup apparatus, a backup server, a backup appliance, a backup stream, and a backup process. Providing the synthetic backup data set may include, for example, publishing the second information to entities including, but not limited to, a backup apparatus, a backup server, a backup appliance, a server, a process, a data stream, and an object. Providing the synthetic backup data set may also include, for example, storing the second information, storing the second information in a pre-determined location, writing a database record, writing data to an object, writing data to a server, and other actions.
In one embodiment, method 800 may also include, at 850, providing the second information to one or more of, the backup apparatus, the backup server, the backup appliance, the backup stream, and the backup process.
The existing data may describe backed up data that is arranged in backed up data sets. In one example, the backed up data includes one or more BLOBs that store one or more blocklets. The BLOBs and the blocklets may have been produced, for example, by a data de-duplication apparatus or process. The existing data describes the backed up data and thus may include information about, for example, where the data is located, how big the data is, how the data is arranged, and other factors. In different examples, the existing data may include a binary large object location, a binary large object size, a binary large object identifier, a binary large object order, a blocklet location, a blocklet size, a blocklet identifier, and a blocklet order. The new data also describes backed up data and thus may include information including, but not limited to, the location of a binary large object, the size of a binary large object, an identifier (e.g., TAG) of a binary large object, an order in which binary large objects are arranged, the location of a blocklet, the size of a blocklet, an identifier (e.g., hash) of a blocklet identifier, and an order in which blocklets are arranged.
Method 900 also includes, at 920, providing access to the new backup data set through the new data. In one example, providing access to the new backup data set through the new data is done without writing backed up data that is described by the new data. Providing access to the new backup data set may include, for example, storing the new data in a location accessible to a backup application, storing the new data in a location accessible to a backup appliance, writing the new data to a pre-determined location, writing a set of database records, writing data to an object, writing data to a server, and other actions.
While the figures illustrate various actions occurring in serial, it is to be appreciated that various actions illustrated in the figures could occur substantially in parallel. By way of illustration, a first process could process existing metadata, and a second process could process the new metadata created for a synthetic backup. While two processes are described, it is to be appreciated that a greater and/or lesser number of processes could be employed and that lightweight processes, regular processes, threads, and other approaches could be employed.
In one example, a method may be implemented as computer executable instructions. Thus, in one example, a computer-readable medium may store computer executable instructions that if executed by a machine (e.g., processor) cause the machine to perform methods described herein. While executable instructions associated with the described methods are described as being stored on a computer-readable medium, it is to be appreciated that executable instructions associated with other example methods described herein may also be stored on a computer-readable medium.
Apparatus 1000 includes a processor 1010, a memory 1020, a set 1040 of logics, and an interface 1030 to connect the processor 1010, the memory 1020, and the set 1040 of logics. In one embodiment, apparatus 1000 may be a special purpose computer that is created as a result of programming a general purpose computer. In another embodiment, apparatus 1000 may include special purpose circuits that are added to a general purpose computer to produce a special purpose computer.
In one embodiment, the set 1040 of logics includes a first logic 1042, a second logic 1044, and a third logic 1046. In one embodiment, the first logic 1042 is configured to process first metadata associated with an existing backup. In one example, the first logic 1042 may be configured to process the first metadata without reading the data in the existing backup to which the first metadata refers. Instead of reading the data in the existing backup to which the first metadata refers, just the first metadata may be accessed. In one embodiment, the second logic 1044 is configured to process second metadata associated with a synthetic backup. In one example, the second logic 1044 is configured to process the second metadata without writing the data to which the second metadata refers.
In different examples the first metadata may include a binary large object location, a binary large object size, a binary large object identifier, a binary large object order, a blocklet location, a blocklet size, a blocklet identifier, and a blocklet order. Similarly, in different examples, the second metadata may include a binary large object location, a binary large object size, a binary large object identifier, a binary large object order, a blocklet location, a blocklet size, a blocklet identifier, and a blocklet order.
In one embodiment, the third logic 1046 is configured to produce the synthetic backup by controlling the first logic 1042 to provide members of the first metadata sufficient to describe the synthetic backup. The third logic 1046 may also be configured to produce the synthetic backup by controlling the second logic 1044 to store in the second metadata information sufficient to describe the synthetic backup. In one example, the third logic 1046 is configured to receive a description of the contents of the synthetic backup. Once the third logic 1046 has the description of the contents of the synthetic backup, the third logic 1046 may control the first logic 1042 to locate members of the first metadata sufficient to provide information for describing members of the synthetic backup as controlled by the description of the contents of the synthetic backup. Similarly, once the third logic 1046 has the description of the contents of the synthetic backup, the third logic 1046 may then control the second logic 1044 to write sufficient data as controlled by the description of the contents of the synthetic backup.
In one example, the synthetic backup data set refers to one or more blocklets stored in one or more BLOBs. The one or more blocklets and the one or more BLOBs may have been stored in one or more previously created physical backup data sets. In one example, the data to which the first metadata refers may have been produced by a data de-duplication apparatus or process.
The following includes definitions of selected terms employed herein. The definitions include various examples and/or forms of components that fall within the scope of a term and that may be used for implementation. The examples are not intended to be limiting. Both singular and plural forms of terms may be within the definitions.
References to “one embodiment”, “an embodiment”, “one example”, “an example”, and so on, indicate that the embodiment(s) or example(s) so described may include a particular feature, structure, characteristic, property, element, or limitation, but that not every embodiment or example necessarily includes that particular feature, structure, characteristic, property, element or limitation. Furthermore, repeated use of the phrase “in one embodiment” does not necessarily refer to the same embodiment, though it may.
“Computer-readable medium”, as used herein, refers to a medium that stores instructions and/or data. A computer-readable medium may take forms, including, but not limited to, non-volatile media, and volatile media. Non-volatile media may include, for example, optical disks, magnetic disks, and so on. Volatile media may include, for example, semiconductor memories, dynamic memory, and so on. Common forms of a computer-readable medium may include, but are not limited to, a floppy disk, a flexible disk, a hard disk, a magnetic tape, other magnetic medium, an ASIC, a CD, other optical medium, a RAM, a ROM, a memory chip or card, a memory stick, and other media from which a computer, a processor or other electronic device can read.
“Data store”, as used herein, refers to a physical and/or logical entity that can store data. A data store may be, for example, a database, a table, a file, a list, a queue, a heap, a memory, a register, and so on. In different examples, a data store may reside in one logical and/or physical entity and/or may be distributed between two or more logical and/or physical entities.
“Logic”, as used herein, includes but is not limited to hardware, firmware, software in execution on a machine, and/or combinations of each to perform a function(s) or an action(s), and/or to cause a function or action from another logic, method, and/or system. Logic may include a software controlled microprocessor, a discrete logic (e.g., ASIC), an analog circuit, a digital circuit, a programmed logic device, a memory device containing instructions, and so on. Logic may include one or more gates, combinations of gates, or other circuit components. Where multiple logical logics are described, it may be possible to incorporate the multiple logical logics into one physical logic. Similarly, where a single logical logic is described, it may be possible to distribute that single logical logic between multiple physical logics.
While example apparatus, methods, and computer-readable media have been illustrated by describing examples, and while the examples have been described in considerable detail, it is not the intention of the applicants to restrict or in any way limit the scope of the appended claims to such detail. It is, of course, not possible to describe every conceivable combination of components or methodologies for purposes of describing the systems, methods, and so on described herein. Therefore, the invention is not limited to the specific details, the representative apparatus, and illustrative examples shown and described. Thus, this application is intended to embrace alterations, modifications, and variations that fall within the scope of the appended claims.
To the extent that the term “includes” or “including” is employed in the detailed description or the claims, it is intended to be inclusive in a manner similar to the term “comprising” as that term is interpreted when employed as a transitional word in a claim.
To the extent that the term “or” is employed in the detailed description or claims (e.g., A or B) it is intended to mean “A or B or both”. When the applicants intend to indicate “only A or B but not both” then the term “only A or B but not both” will be employed. Thus, use of the term “or” herein is the inclusive, and not the exclusive use. See, Bryan A. Garner, A Dictionary of Modern Legal Usage 624 (2d. Ed. 1995).
To the extent that the phrase “one or more of, A, B, and C” is employed herein, (e.g., a data store configured to store one or more of, A, B, and C) it is intended to convey the set of possibilities A, B, C, AB, AC, BC, and/or ABC (e.g., the data store may store only A, only B, only C, A&B, A&C, B&C, and/or A&B&C). It is not intended to require one of A, one of B, and one of C. When the applicants intend to indicate “at least one of A, at least one of B, and at least one of C”, then the phrasing “at least one of A, at least one of B, and at least one of C” will be employed.