1. Technical Field
Aspects and embodiments relate to data generation, and more particularly to apparatus and methods for generating data with predetermined characteristics.
2. Discussion
Commercially available backup applications rely on a multi-level architecture to perform backup jobs. These backup applications have components to schedule jobs, merge multiple clients into one or more streams, manage media, and abstract the backup media (i.e., OST, tape or disk). These components are layered, much like an Operating System (OS) would layer device drivers for file systems. The characteristics of the data copied for backup is a product of this layering. For example, backup jobs (which are also referred to as policies) govern all aspects of the backup process and control of one or more clients. Clients copy data based on the backup job, which eventually provide data to one or more data backup systems for storage.
One such data backup system may be a virtual tape library, such as the SEPATON S2100-ES3, that integrates with third party backup solutions. Third party backup solutions interface with the virtual tape library as an ordinary tape drive system. Virtual tapes, much like real tapes, are written to sequentially. In order to reclaim space, storage system vendors often incorporate de-duplication processes into their product offerings to decrease the amount of required back-up media. One such method for identifying redundant data within back-up data streams is disclosed in U.S. application Ser. No. 12/877,719, entitled “SYSTEM AND METHOD FOR DATA DRIVEN DE-DUPLICATION” assigned to Sepaton, Inc. of Marlborough, Mass.
The ability to replicate data with the same variable characteristics of data generated from third party backup solutions is highly desirable. Conventional approaches utilize existing libraries to generate a single data stream (also known as a client). In some embodiments, by changing different parameters, different data qualities may be generated. These qualities include compressibility, starting seed, chunk size, amount of unique data from generation to generation, and the total size of the stream.
Aspects and examples disclosed herein relate to apparatus and processes for generating data having one or more predetermined characteristics. Some examples manifest an appreciation that conventional data generation techniques are constrained by the number of streams data may be generated to, and the granularity of the control over the data generated. For example, existing data generation techniques may generate a stream that is highly (100%) compressible or 100% random (non-compressible), with no variations in between. The ability to generate data closely resembling copied data that originated from one or more streams, utilizing third party backup solutions is highly desirable. Further, these examples manifest an appreciation that conventional data generation techniques do not have the ability to reproduce a previous generation of generated data, identically, based on one or more parameters. Thus, these examples manifest an appreciation of the limitations imposed by conventional data generation techniques.
For instance, some examples provide for a system configured to generate data having one or more predetermined characteristics. The system includes memory, at least one processor coupled to the memory, and at least one data stream component. The at least one data stream component is executed by the at least one processor and configured to read at least one first parameter descriptive of the one or more predetermined characteristics, identify a target sequence of data based on the at least one first parameter, execute a plurality data generator components to generate one or more data chunks, and assemble the target sequence from the one or more data chunks into at least one data stream. The at least one first parameter descriptive of the one or more predetermined characteristics may include at least one of a compression ratio parameter, a multiplex degree parameter a data change ratio parameter, and a total stream size parameter. In addition, each data generator component of the plurality of data generator components may be configured to write at least one variable sequence of random numbers to at least one data chunk of the one or more data chunks. Moreover, the plurality of data generators may write at least one variable sequence of random numbers, which includes a repeated random number of the same value, or a plurality of randomly generated numbers. The system may be further configured to assemble the target sequence by assembling a majority of the target sequence from data chunks generated by a first subset of the plurality of data generators and by assembling a minority of the target sequence from data chunks generated by a second subset of the plurality of data generators different from the first subset. In addition, the system may include the at least one data stream component that is configured to randomly select the first subset from the plurality of data generator components.
The system may also include a client job component executed by the at least one processor and configured to read at least one second parameter descriptive of the one or more predetermined characteristics, identify a first target sequence of streams based on the at least one second parameter, initiate a plurality of data stream components that generates a plurality of data streams; and assemble the first target sequence of streams from the plurality of data streams. In addition, the at least one second parameter descriptive of the one or more predetermined characteristics may be different during a subsequent execution of the client job component. Further, the system may be configured with each data stream of the plurality of data streams including data having characteristics different from others of the plurality of data streams. The system may further include another client job component executed by the at least one processor and configured to read the least one second parameter descriptive of the one or more predetermined characteristics, identify a second target sequence of streams based on the at least one third parameter, initiate one or more data stream components that generate one or more data streams, and assemble the second target sequence of streams from the one or more data streams. Thus, the second target sequence of streams may be identical to the first target sequence of streams.
The system may be further configured to verify at least a portion of the target sequence, wherein the target sequence is stored in one or more generations of data stored on hard drive of a data storage system.
According to another example, a method for generating data having one or more predetermined characteristics with at least one data stream component is provided. The method includes acts of reading at least one first parameter descriptive of the one or more predetermined characteristics, identifying, by the at least one data stream component, a target sequence of data based on the at least one first parameter, generating, by the plurality of generator components, one or more data chunks, and assembling the target sequence from the one or more data chunks into the least one data stream. In addition, the method may include the act of writing at least one variable sequence of random numbers to at least one data chunk of the one or more data chunks. The at least one variable sequence of random numbers may be one of a repeated random number of the sale value, a plurality of randomly generated numbers.
The method may further include an act of assembling the target sequence which may include the act of assembling a composition of a majority of data chunks generated by a first subset of a plurality of data generators, and a minority of the target sequence from data chunks generated by a second subset of the plurality of data generators different from the first subset. The composition may include a randomly determined order from the first subset of a plurality of data generators and the second subset of the plurality of data generators.
The method may further include acts of reading at least one second parameter descriptive of the one or more predetermined characteristics, identifying a first target sequence of streams based on the at least one second parameter, initiating, by a client job, a plurality of data streams, and assembling, by the client job, the first target sequence of streams from the plurality of data streams. Each data stream of the plurality of data streams may include data having characteristics different from others of the plurality of data streams. The method may further include the acts of reading the at least one second parameter descriptive of the one or more predetermined characteristics, identifying a second target sequence of streams based on the at least one second parameter assembling the second target sequence of streams from the one or more data streams. Thus, the second target sequence of streams may be identical to the first sequence of streams.
According to another example, a non-transitory computer readable medium storing computer readable instructions is provided. The computer readable medium stores computer readable instructions that, when executed by at least one processor, instruct the at least one processor to perform a method of generating data having one or more predetermined characteristics. This method includes the acts of reading at least one first parameter descriptive of the one or more predetermined characteristics, identifying a target sequence of data based on the at least one first parameter, generating, by a plurality of data generators, one or more data chunks; and assembling the target sequence from the one or more data chunks into at least one data stream. Further, the instructions for generating data having one or more predetermined characteristics may instruct the at least one processor to order the one or more data chunks in a pattern established in proportion to a ratio of a first subset of the plurality of data generators and a second subset of the plurality of data generators different from the first subset.
Still other aspects, examples, and advantages of these exemplary aspects and examples, are discussed in detail below. Moreover, it is to be understood that both the foregoing information and the following detailed description are merely illustrative examples of various aspects and embodiments, and are intended to provide an overview or framework for understanding the nature and character of the claimed aspects and embodiments. Any example or embodiment disclosed herein may be combined with any other example or embodiment. References to “an example,” “an embodiment,” “some examples,” “some embodiments,” “an alternate example,” “an alternate embodiment,” “various examples,” “various embodiments,” “one example,” “one embodiment,” “at least one example,” “at least one embodiment,” “this and other examples,” “this and other embodiments,” or the like are not necessarily mutually exclusive and are intended to indicate that a particular feature, structure, or characteristic described in connection with the example may be included in at least one example or embodiment. The appearances of such terms herein are not necessarily all referring to the same example or embodiment.
Various aspects of at least one example are discussed below with reference to the accompanying figures, which are not intended to be drawn to scale. The figures are included to provide an illustration and a further understanding of the various aspects and examples, and are incorporated in and constitute a part of this specification, but are not intended as a definition of the limits of the embodiments disclosed herein. The drawings, together with the remainder of the specification, serve to explain principles and operations of the described and claimed aspects and examples. In the figures, each identical or nearly identical component that is illustrated in various figures is represented by a like numeral. For purposes of clarity, not every component may be labeled in every figure. In the figures:
Some aspects and embodiments relate to apparatus and processes for generating data having one or more predetermined characteristics. For example, according to one embodiment, a data generation system is configured to read a plurality of data generation parameters. Based on the data generation parameters, one or more data stream components are initialized and executed by the data generation system. The one or more data stream components may generate data, using a plurality of data generators, in accordance with the predetermined characteristics targeted by the data generation parameters. The generated data may be a generation of data that simulates a daily full or incremental backup. Thus, subsequent generations of data may be generated, identical to the previous, if the same data generation parameters are used. In addition, subsequent generations of data may be generated, similar to the first, but with one or more changes based on changing certain parameters within the data generation parameters.
The predetermined characteristics may represent data characteristics of a particular target data footprint. Such predetermined characteristics may include data with target compression ratios, target data change ratios, and granular size of data. To this end, embodiments of this disclosure demonstrate how data generation parameters enable fine-grain control over generated data to achieve a particular data footprint. For example, data generation parameters may target characteristics of a particular database type. In certain embodiments, this may be a relational database. A data footprint simulating a relational database, depending on a database vendor's specific implementation (and the data stored therein), may include a specific predetermined number of streams, a compression ratio and de-duplication ratio. In certain other embodiments, the data footprint may simulate a file system with widely varying characteristics. Data generation parameters are discussed below in further detail in regards to
Embodiments disclosed herein further include one or more data stream components having stream objects connected to one or more destination storage systems. These destination storage systems may be connected in a number of ways, such as logically, by sockets, and physically, through the use of Ethernet, IEEE 1394 (Firewire), Fiber Optics, IEEE 802.11 (Wifi), USB, Bluetooth, or any method for transmitting data between computer systems.
Also, in at least one embodiment disclosed herein, the data generation system is further configured to provide data verification parameters inline to a generated data stream as a constant value or string. Responsive to the availability of such values within a generated stream, the data generation system may verify data integrity before, during, or after certain processes (e.g., de-duplication or compression) of a storage system alter the generated data. In other embodiments, no verification values may be provided within the generated stream, and therefore, no verification may occur.
Certain embodiments disclosed herein also include providing feedback regarding progress of data generation to the user of the data generation system. Feedback may be in the form of a progress bar, or on-screen report. Such feedback may include the percent of completion of the current generation, overall generations, etc. Other such feedback may include reports indicating whether verification was successful. In addition, feedback may include any error that occurs, for example any exception/fault, or connectivity issue with the data streams.
It is to be appreciated that examples of the methods and apparatuses discussed herein are not limited in application to the details of construction and the arrangement of components set forth in the following description or illustrated in the accompanying drawings. The methods and apparatuses are capable of implementation in other examples and of being practiced or of being carried out in various ways. Examples of specific implementations are provided herein for illustrative purposes only and are not intended to be limiting. In particular, acts, elements and features discussed in connection with any one or more examples are not intended to be excluded from a similar role in any other examples.
Also, the phraseology and terminology used herein is for the purpose of description and should not be regarded as limiting. Any references to examples or elements or acts of the systems and methods herein referred to in the singular may also embrace examples including a plurality of these elements, and any references in plural to any example or element or act herein may also embrace examples including only a single element. References in the singular or plural form are not intended to limit the presently disclosed systems or methods, their components, acts, or elements. The use herein of “including,” “comprising,” “having,” “containing,” “involving,” and variations thereof is meant to encompass the items listed thereafter and equivalents thereof as well as additional items. References to “or” may be construed as inclusive so that any terms described using “or” may indicate any of a single, more than one, and all of the described terms.
Furthermore, the data manipulated by examples disclosed herein may be organized into various data objects on one or more computer systems. These data objects may include any structure in which data may be stored. A non-limiting list of exemplary data objects includes bits, bytes, data files, data blocks, data directories and back-up data sets.
Various embodiments utilize one or more devices or computer systems to generate data having one or more predetermined characteristics.
As depicted in
Parallelism parameters 202 are a category of parameters generally directed to generation of data in a manner which is similar (temporally and spatially) to data copied by commercial backup processes. In one embodiment, these parallelism parameters specify the number of concurrent data originators (backup clients). Further parameters may be included that maintain a consistent timing, or randomly adjust certain delays in regards to maintaining the temporal relationship between parallel data originators. For example, a commercial backup application may copy data through parallel clients (using one or more streams). The copied data of these parallel clients would be sequenced, or “striped,” on a storage system. Striping is discussed in further detail below in regards to
One or more data characteristic parameters 204 may be contained within the data generation parameters object 102. In various embodiments, data characteristic parameters 204 may affect certain underlying characteristics of one or more generated data streams, over the course of several generations. In accordance with these embodiments, each data stream may have underlying characteristics which allows each stream to have unique qualities and characteristics different from others. In one embodiment, data characteristic parameters 204 may include a parameter which controls the variability of the underlying generated data. Variability may be controlled by several parameters which control the target compressibility (compression ratio) of the stream based on randomized data generation. Compressibility is discussed in further detail below in regards to
In one example embodiment, streams of data are a delineated and constructed by a plurality of chunks. Chunks, as used herein, are defined as a block of data stored in physically or logically contiguous memory having a defined size. In some embodiments, chunks are a basic unit of generated data. In certain embodiments, chunks may be grouped together into a chunk group (or buffer) that may include a header and/or footer. It should be noted that a chunk group may contain as few as one chunk. In one embodiment, chunk size may be a parameter of the data characteristic parameters 204. It should also be noted that certain other parameters may be defined, such as the generation size parameter, which is discussed further below, which may also affect chunk size. In one embodiment, a parameter may be defined that determines the overall number of chunks to be generated, and thus, also defines the overall size of the generated stream. In another embodiment, a parameter of the data characteristic parameters 204 may define the target chunk group size and number of chunks to include in a chunk group. In yet other embodiments, chunk size, chunk group size, chunk group composition, and generation size are all controlled by separate parameters. Chunks are described in further detail below, in reference to
Certain exemplary embodiments include one or more generational parameters 206 within the data generation parameters object 102. In various embodiments, generational parameters 206 control certain aspects of data generation, such as controlling unique qualities and underlying (predetermined) characteristics of each generated stream. The predetermined characteristics may change from one generation to the next during data generation. In one embodiment, a generational parameter controls the number of generations to be created during data generation. Further embodiments may include additional parameters such as a parameter for controlling the size of each generation. Still further examples of additional parameters may include randomization of generation size and a simulated delay period between subsequent generations. It should be noted, as was described above, that certain parameters directed towards chunk and generation size may affect the resulting generation size, and vice versa. In these embodiments, parameters are utilized, when enabled, in a harmonious and logical combination to reach desired results.
In one embodiment, verification parameters 208 may be included in the data generation parameters object 102. In some embodiments, a header and/or footer may appear in chunk groups. In accordance with these embodiments, verification parameters may control the insertion of one or more values within the headers/footers. In one embodiment, one value of a parameter may indicate a particular method to use for verifying the contents of one or more chunks. One such method may be a cyclic redundancy check (CRC). Also, it should be noted that a checksum may be used to verify the contents of one or more chunks. For example, checksums such as sum (Unix) 8/16/24/32, fletcher-4/8/16/32, Adler-32 may be used. In certain other embodiments, any suitable non-cryptographic or cryptographic method for verifying the contents of one or more chunks may be also used. For example, some non-cryptographic functions include Pearson hashing, Fowler-Noll-Vo (FNV) hashing, Jenkins hash function, Java's hash_CodeQ, and MurmurHash. Cryptographic methods for verifying the contents may be, for example, SHA-1/256/512, MD5 and FSB. The verification methods may be chosen based on target hardware and performance requirements. In one embodiment, a parameter may indicate that no verification should be performed. In still other embodiments, a parameter may control when and how verification is to occur with granularity. For example, a parameter may direct that verification should be performed during or after each generation, or at a chosen multiple of generations, or even at random. In another example, a parameter may limit verification to only the last generation. Still another example is a parameter that indicates some number of chunks of each generation to be verified. The number may be fewer than all of the chunks.
Further parameters may indicate the method in which verification results should be provided to a user 110 (
As described above in reference to
With continued reference to
Referring now to
The random number generator 304 may be any pseudo random number generator (PRNG) that is capable of generating a long sequence of random numbers. The sequence of numbers is generally determined from a fixed number called a seed. A common PRNG is the traditional linear congruential generator. However, the period length of PRNGs, such as the linear congruential generators, are limited most often to 232 or 264. The traditional PRNG may be sufficient to generate the quality of randomness needed. In certain other embodiments, the Mersenne twister algorithm may be implemented in the random number generator 302. In further embodiments, a linear feedback shift register PRNG may be implemented in the random number generator 302. In still further embodiments, the scalable parallel random number generator library (SPRNG) may be implemented in the random number generator 302.
Still referring to
Now referring to
According to one embodiment, having a compression group 4 KB in size allows for a data compression algorithm, such as Lampel-Ziv-Stac (LZS, or Stac compression), to use a sliding window compression algorithm to control resulting compression ratios in generated data LZS a common algorithm used by virtual tape systems and other storage systems to compress data. In certain embodiments, a consistent rate of compressibility may be used to control the overall ratio of compressible data in the stream by maintaining control over how many highly compressible compression groups are introduced into the data stream. To this end, introducing a continuous sequence of highly compressible compression groups to a data stream would yield nearly 100% compression rate. Likewise, introducing a series of compression groups wherein only 10% were compressible may yield nearly a 10% compression ratio for the data stream overall. Thus, the data characteristic parameters 306 (
Now referring back to
Now referring to
Components and parameters of the data generation system 104 have been discussed in various embodiments. These components, and related methods as described further below, may be implemented as specialized hardware or software components executing in one or more computer systems. There are many examples of computer systems that are currently in use. These examples include, among others, network appliances, personal computers, workstations, mainframes, networked clients, servers, media servers, application servers, database servers and web servers. Other examples of computer systems may include mobile computing devices, such as cellular phones and personal digital assistants, and network equipment, such as load balancers, routers and switches. Further, aspects may be located on a single computer system or may be distributed among a plurality of computer systems connected to one or more communications networks.
In addition, the various components and methods described herein may be executed from one or more storage systems. Referring to
The storage system includes back-up storage media 126 that may be, for example, one or more disk arrays, as discussed in more detail below. The back-up storage media 126 provide the actual storage space for back-up data from the host computer(s) 120. However, the storage system 170 may also include software and additional hardware that emulates a removable media storage system, such as a tape library, such that, to the back-up/restore application running on the host computer 120, it appears as though data is being backed-up onto conventional removable storage media. Thus, as illustrated in
According to one embodiment, the storage system may include a “logical metadata cache” 242 that stores metadata relating to user data that is backed-up from the host computer 120 onto the storage system 170. As used herein, the term “metadata” refers to data that represents information about user data and describes attributes of actual user data. A non-limiting exemplary list of metadata regarding data objects may include data object size, logical and/or physical location of the data object in primary storage, the creation date of the data object, the date of the last modification of the data object, the back-up policy name under which the data objected was stored, an identifier, e.g. a name or watermark, of the data object and the data type of the data object, e.g. a software application associated with the data object. The logical metadata cache 242 represents a searchable collection of data that enables users and/or software applications to randomly locate back-up user files, compare user files with one another, and otherwise access and manipulate back-up user files. Two examples of software applications that may use the data stored in the logical metadata cache 242 include a synthetic full back-up application 240 and an end-user restore application 300 that are discussed more fully below. In addition, a de-duplication director, which is discussed in more detail below, may use metadata to provide scalable de-duplication services within a storage system.
As discussed above, the storage system 170 includes hardware and software that interface with the host computer 120 and the back-up storage media 126. Together, the hardware and software of embodiments of the invention may emulate a conventional tape library back-up system such that, from the point of view of the host computer 120, data appears to be backed-up onto tape, but is in fact backed-up onto another storage medium, such as, for example, a plurality of disk arrays.
Referring to
As shown in
In the illustrated example, the switching network 132 may include one or more Fibre Channel switches 128a, 128b. The storage system controller 122 includes a plurality of Fibre Channel port adapters 124b and 124c to couple the storage system controller to the Fibre Channel switches 128a, 128b. Via the Fibre Channel switches 128a, 128b, the storage system controller 122 allows data to be backed-up onto the back-up storage media 126. As illustrated in
In the example illustrated in
As discussed above, in one embodiment, the back-up storage media 126 may include one or more disk arrays. In one preferred embodiment, the back-up storage media 126 include a plurality of ATA or SATA disks. Such disks are “off the shelf” products and may be relatively inexpensive compared to conventional storage array products from manufacturers such as EMC, IBM, etc. Moreover, when one factors in the cost of removable media (e.g., tapes) and the fact that such media have a limited lifetime, such disks are comparable in cost to conventional tape-based back-up storage systems. In addition, such disks can read/write data substantially faster than can tapes. For example, over a single Fibre Channel connection, data can be backed-up onto a disk at a speed of at least about 150 MB/s, which translates to about 540 GB/hr, significantly faster (e.g., by an order of magnitude) than tape back-up speeds. In addition, several Fibre Channel connections may be implemented in parallel, thereby increasing the speed even further. In accordance with an embodiment of the present invention, back-up storage media may be organized to implement any one of a number of RAID (Redundant Array of Independent Disks) schemes. For example, in one embodiment, the back-up storage media may implement a RAID-5 implementation.
As discussed above, embodiments of the invention emulate a conventional tape library back-up system using disk arrays to replace tape cartridges as the physical back-up storage media, thereby providing a “virtual tape library.” Physical tape cartridges that would be present in a conventional tape library are replaced by what is referred to herein as “virtual cartridges.” It is to be appreciated that for the purposes of this disclosure, the term “virtual tape library” refers to an emulated tape library which may be implemented in software and/or physical hardware as, for example, one or more disk array(s). It is further to be appreciated that although this discussion refers primarily to emulated tapes, the storage system may also emulate other storage media, for example, a CD-ROM or DVD-ROM, and that the term “virtual cartridge” refers generally to emulated storage media, for example, an emulated tape or emulated CD. In one embodiment, the virtual cartridge in fact corresponds to one or more hard disks.
Therefore, in one embodiment, a software interface is provided to emulate the tape library such that, to the back-up/restore application, it appears that the data is being backed-up onto tape. However, the actual tape library is replaced by one or more disk arrays such that the data is in fact being backed-up onto these disk array(s). It is to be appreciated that other types of removable media storage systems may be emulated and the invention is not limited to the emulation of tape library storage systems. The following discussion will now explain various aspects, features and operation of the software included in the storage system 170.
It is to be appreciated that although the software may be described as being “included” in the storage system 170, and may be executed by the processor 127 of the storage system controller 122 (see
As discussed above, according to one embodiment, the host computer 120 may back-up data onto the back-up storage media 126 via the network link (e.g., a Fibre Channel link) 121 that couples the host computer 120 to the storage system 170. It is to be appreciated that although the following discussion will refer primarily to the back-up of data onto the emulated media, the principles apply also to restoring back-up data from the emulated media for verification and examination. The flow of data between the host computer 120 and the emulated media 134 may be controlled by the back-up/restore application, as discussed above. From the view point of the back-up/restore application, it may appear that the data is actually being backed-up onto a physical version of the emulated media.
As discussed above with reference to
In act 802, the data stream component 104 (
Referring to
Returning to
Referring to
Referring to
Moreover, it should be understood that each data stream component 104 may generate a data stream with unique predetermined characteristics in accordance with embodiments previously described herein. Other embodiments may simulate data generated by other aspects of computer systems to be backed up. For example, certain embodiments may generate data that would be representative of certain file systems. Still other embodiments may generate data representative of certain file types, such as multi-media files including video. It should be recognized that almost any type of data (from a variable number of clients) that has definable characteristics such as distinct pattern, randomness, variability, compressibility, de-dupability, etc, may be generated by the data generation system 100 in accordance with the various embodiments described above.
Having thus described several aspects of at least one example, it is to be appreciated various alterations, modifications, and improvements will readily occur to those skilled in the art. Such alterations, modifications, and improvements are intended to be part of this disclosure, and are intended to be within the scope of the embodiments disclosed herein. Accordingly, the foregoing description and drawings are by way of example only.