Administrators strive to efficiently manage file servers and file server resources while keeping networks protected from unauthorized users yet accessible to authorized users. The practice of storing files on servers rather than locally on users' computers has led to identical data being stored at multiple locations in the same system and even at multiple locations in the same server.
Deduplication is a technique for eliminating redundant data, improving storage utilization, and reducing network traffic. Storage-based data deduplication inspects large volumes of data and identifies entire files, or sections of files, that are identical, then reduces the number of instances of identical data. For example, an email system may contain 100 instances of the same one-megabyte file attachment. Each time the email system is backed up, each of the 100 instances of the attachment is stored, requiring 100 megabytes of storage space. With data deduplication, only one instance of the attachment is stored, thus saving 99 megabytes of storage space.
For a detailed description of exemplary embodiments of the invention, reference will now be made to the accompanying drawings in which:
As used herein, the term “chunk” refers to a continuous subset of a data stream.
As used herein, the term “segment” refers to a group of continuous chunks. Each segment has two boundaries, one at its beginning and one at its end.
As used herein, the term“hash” refers to an identification of a chunk that is created using a hash function.
As used herein, the term “block” refers to a division of a file or data stream that is interleaved with other files or data streams. For example, interleaved data may comprise 1a, 2a, 3a, 1b, 2b, 1c, 3b, 2c, where 1a is the first block of underlying stream one, 1b is the second block of underlying stream one, 2a is the first block of underlying stream two, etc. In some cases, the blocks may differ in length.
As used herein, the term “deduplicate” refers to the act of logically storing a chunk, segment, or other division of data in a storage system or at a storage node such that there is only one physical copy (or, in some cases, a few copies) of each unique chunk at the system or node. For example, deduplicating ABC, DBC and EBF (where each letter represents a unique chunk) against an initially-empty storage node results in only one physical copy of B but three logical copies. Specifically, if a chunk is deduplicated against a storage location and the chunk is not previously stored at the storage location, then the chunk is physically stored at the storage location. However, if the chunk is deduplicated against the storage location and the chunk is already stored at the storage location, then the chunk is not physically stored at the storage location again. In yet another example, if multiple chunks are deduplicated against the storage location and only some of the chunks are already stored at the storage location, then only the chunks not previously stored at the storage location are stored at the storage location during the deduplication.
The following discussion is directed to various embodiments of the invention. Although one or more of these embodiments may be preferred, the embodiments disclosed should not be interpreted, or otherwise used, as limiting the scope of the disclosure, including the claims. In addition, one skilled in the art will understand that the following description has broad application, and the discussion of any embodiment is meant only to be exemplary of that embodiment, and not intended to intimate that the scope of the disclosure, including the claims, is limited to that embodiment.
During chunk-based deduplication, unique chunks of data are each physically stored once no matter how many logical copies of them there may be. Subsequent chunks received may be compared to stored chunks, and if the comparison results in a match, the matching chunk is not physically stored again. Instead, the matching chunk may be replaced with a reference that points to the single physical copy of the chunk. Processes accessing the reference may be redirected to the single physical instance of the stored chunk. Using references in this way results in storage savings. Because identical chunks may occur many times throughout a system, the amount of data that must be stored in the system or transferred over the network is reduced. However, interleaved data is difficult to deduplicate efficiently.
Recovering the underlying source streams is difficult without understanding the format used to interleave the streams. Because different backup agents are made by different companies that interleave data in different ways, and because methods of interleaving change over time, it may not be cost-effective to produce a system that can un-interleave all interleaved data. It may therefore be useful for a system to be able to directly handle interleaved data.
During deduplication, hashes of the chunks may be created in real time on a front end, which communicates with one or more deduplication back ends, or on a client 199. For example, the front end 118, which communicates with one or more back ends, which may be deduplication backend nodes 116, 120, 122. In various embodiments, front ends and back ends also include other computing devices or systems. A chunk of data is a continuous subset of a data stream that is produced using a chunking algorithm that may be based on size or logical file boundaries. Each chunk of data may be input to a hash function that may be cryptographic; e.g., MD5 or SHA1. In the example of
Instead of chunks being compared for deduplication purposes, hashes of the chunks may be compared. Specifically, identical chunks will produce the same hash if the same hashing algorithm is used. Thus, if the hashes of two chunks are equal, and one chunk is already stored, the other chunk need not be physically stored again; this conserves storage space. Also, if the hashes are equal, underlying chunks themselves may be compared to verify duplication, or duplication may be assumed. Additionally, the system 100 may comprise one or more backend nodes 116, 120, 122. In at least one implementation, the different backend nodes 116, 120, 122 do not usually store the same chunks. As such, storage space is conserved because identical chunks are not stored between backend nodes 116, 120, 122, but segments (groups of chunks) must be routed to the correct backend node 116, 120, 122 to be effectively deduplicated.
Comparing hashes of chunks can be performed more efficiently than comparing the chunks themselves, especially when indexes and filters are used. To aid in the comparison process, indexes 105 and/or filters 107 may be used to determine which chunks are stored in which storage locations 106 on the backend nodes 116, 120, 122. The indexes 105 and/or filters 107 may reside on the backend nodes 116, 120, 122 in at least one implementation. In other implementations, the indexes 105, and/or filters 107 may be distributed among the front end nodes 118 and/or backend nodes 116, 120, 122 in any combination. Additionally, each backend node 116, 120, 122 may have separate indexes 105 and/or filters 107 because different data is stored on each backend node 116, 120, 122.
In some implementations, an index 105 comprises a data structure that maps hashes of chunks stored on that backend node to (possibly indirectly) the storage locations containing those chunks. This data structure may be a hash table. For a non-sparse index, an entry is created for every stored chunk. For a sparse index, an entry is created for only a limited fraction of the hashes of the chunks stored on that backend node. In at least one embodiment, the sparse index indexes only one out of every 64 chunks on average.
Filter 107 may be present and implemented as a Bloom filter in at least one embodiment. A Bloom filter is a space-efficient data structure for approximate set membership. That is, it represents a set but the represented set may contain elements not explicitly inserted. The filter 107 may represent the set of hashes of the set of chunks stored at that backend node. A backend node in this implementation can thus determine quickly if a given chunk could already be stored at that backend node by determining if its hash is a member of its filter 107.
Which backend node to deduplicate a chunk against (i.e., which backend node to route a chunk to) is not determined on a per chunk basis in at least one embodiment. Rather, routing is determined a segment (a continuous group of chunks) at a time. The input stream of data chunks may be partitioned into segments such that each data chunk belongs to exactly one segment,
Although
In at least one embodiment, one or more clients 199 are backed up periodically by scheduled command. The virtual tape library (“VLT”) or network file system (“NFS”) protocols may be used as the protocol to backup a client 199.
Alternatively, the chunking and hashing may be performed by the client 199, and only the hashes may be sent to the front-end node 118. Other variations are possible.
As described above, interleaved data may originate from different sources or streams. For example, different threads may multiplex data into a single file resulting in interleaved data. Each hash corresponds to a chunk. In at least one embodiment, the amount of hashes received corresponds to chunks with lengths totaling three times the length of an average segment. Although the system is discussed using interleaved data as examples, in at least one example non-interleaved data is handled similarly as well.
At 206, locations of previously stored copies of the data chunks are determined. In at least one example, a query to the backends 116, 120, 122 is made for location information and the locations may be received as results of the query. In one implementation, the front-end node 118 may broadcast the sequence of hashes to the backend nodes 116, 120, 122, each of which may then determine which of its locations 106 contain copies of the data chunks corresponding to the sent hashes and send the resulting location information back to front-end node 118. In a one node implementation, the determining may be done directly without any need for communication between nodes.
For each data chunk, it may be determined which locations already contain copies of that data chunk. This determining may make use of heuristics. In some implementations, this determining may only be done for a subset of the data chunks.
The locations may be as general as a group or cluster of backend nodes or a particular backend node, or the locations may be as specific as a chunk container (e.g., a file or disk portion that stores chunks) or other particular location on a specific backend node. Determining locations may comprise searching for one or more of the hashes in an index 105 such as a full chunk index or a sparse index, or a set or filter 107 such as a Bloom filter. The determined locations may be a group of backend nodes 116, 120, 122, a particular backend node 116, 120, 122, chunk containers, stores, or storage nodes. For example, each backend node may return a list of sets of chunk container identification numbers to the front-end node 118, each set pertaining to the corresponding hash/data chunk and the chunk container identification numbers identifying the chunk containers stored at that backend node in which copies of that data chunk are stored. These lists can be combined on the front-end node 118 into a single list that gives for each data chunk, the chunk container ID/backend number pairs identifying chunk containers containing copies of that data chunk.
In another embodiment, the returned information identifies only which data chunks that backend node has copies for. Again, the information can be combined to produce a list giving for each data chunk, the set of backend nodes containing copies of that data chunk.
In yet another embodiment that has only a single node, the determined information may just consist of a list of sets of chunk container IDs because there is no need to distinguish between different backend nodes. As the skilled practitioner is aware, there many different ways location information can be conveyed.
At 208, a breakpoint in the sequence of chunks is determined based at least in part on the determined locations. This breakpoint may be used to form a boundary of a segment of data chunks. For example, if no segments have yet been produced, then the first segment may be generated as the data chunks from the beginning of the sequence to the data chunk just before the determined breakpoint. Alternatively, if some segments have already been generated then the next segment generated may consist of the data chunks between the end of the last segment generated and the newly determined breakpoint.
Each iteration of
Determining a break point may comprise determining regions in the sequence of data chunks based in part on which data chunks have copies in the same determined locations and then determining the breakpoint in the sequence of data chunks based on the regions. For example, the regions in the sequence of data chunks may be determined such that at least 90% of the data chunks with determined locations of each region have previously stored copies in a single location. That is, for each region there is a location in which at least 90% of the data chunks with determined locations have previously stored copies. Next, a break point in the sequence of data chunks may be determined based on the regions.
Hashes and chunks corresponding to the same or similar locations may be grouped. For example, the front-end node 118 may group hashes and corresponding data chunks corresponding to one location into a segment, and may group adjacent hashes and corresponding data chunks corresponding to a different location into another segment. As such, the breakpoint is determined to lie between the two segments.
The front-end node 118 may deduplicate the newly formed segment against one of the backend nodes as a whole. That is, the segment may be deduplicated only against data contained in one of the backend nodes and not against data contained in the other backend nodes. This is in contrast to, for example, the first half of a segment being deduplicated against one backend node and the second half the segment being deduplicated against another backend node. In at least one embodiment, the data contained in a backend node may be in storage attached to the backend node, under control of the backend node, or the primary responsibility of the backend node rather than physically part of it.
The segment may be deduplicated only against data contained in one of a plurality of nodes. In one embodiment, the chosen backend node 116, 120, or 122 identifies the storage locations 106 against which the segment will be deduplicated.
The system described above may be implemented on any particular machine or computer with sufficient processing power, memory resources, and throughput capability to handle the necessary workload placed upon the computer.
In various embodiments, the computer-readable storage device 388 comprises a non-transitory storage device such as volatile memory (e.g., RAM), non-volatile storage (e.g., Flash memory, hard disk drive, CD ROM, etc.), or combinations thereof. The computer-readable storage device 388 may comprise a computer or machine-readable medium storing software or instructions 384 executed by the processor(s) 382. One or more of the actions described herein are performed by the processor(s) 382 during execution of the instructions 384.
Shown below the chunks are a number of regions, R1 through R6. For example, region R1 comprises chunks 1 through 3 and region R2 comprises chunks 3 through 18. These regions (R1-R6) have been determined by finding the maximal continuous subsequences such that each subsequence has an associated location and every data chunk in that subsequence either has that location as one of its determined locations or has no determined location. For example region R1's associated location is 5; one of its chunks (# 2) has 5 as one of its determined locations and the other two chunks (#s 1 and 3) have no determined location. Similarly, R2's associated location is 1, R3 and R6's associated location is 2, R4's associated location is 4, and R5's associated location is 3.
Each of these regions is maximal because it cannot be extended in either direction by even one chunk without violating the example region generation rule. For example, chunk 4 cannot be added to region R1 because it has a determined location and none of its determined locations is 5. Each region represents a swath of data that resides in one location; thus a breakpoint in the middle of a region will likely cause loss of deduplication. Because new data (e.g., data chunks without locations) can be stored anywhere without risk of creating intermediate duplication, the new data effectively acts like a wildcard, allowing it to be part of any region, thus extending the region.
There are many ways of determining regions. For example, regions need not be maximal but may be required to end with data chunks having determined locations. In another example, in order to deal with noise, regions may be allowed to incorporate a small amount of data chunks with determined locations that do not include the region's primary location. For example, in
In another example, new data chunks may be handled differently. Instead of treating their locations as wildcards, able to belong to any region, they may be regarded as being located in both the determined location of the nearest chunk to the left with a determined location and the determined location of the nearest chunk to the right with a determined location. If the nearest chunk with a determined location is too far away (e.g., exceeds a threshold of distance away), then its determined locations may be ignored. Thus new data chunks too far away from old chunks may be regarded as having no location, and thus either incorporable in no region, or only in incorporable in special regions. Such a special reason may be one that contains only similar new data chunks far away from old data chunks in at least one example. In another example, new data chunks may be regarded as being in the determined locations of the nearest data chunk with a determined location. In the case of
Because “breaking” (i.e., determining boundaries) the middle of a region is likely to cause duplication, it should be avoided if possible. Moreover, breaking in the middle of a larger region rather than a smaller region and breaking closer to the middle of a region will likely cause more duplication. As such, these scenarios should be minimized as well. By taking the regions into account, an efficient breakpoint may be determined based on the regions. Efficient breakpoints cause less duplication of stored data.
There are many ways of determining boundaries. One example involves focusing on preserving the largest regions, e.g., selecting the largest regions and shortening the parts of the other regions that overlap it. Shorten here means make the smaller region just small enough so that it does not overlap the largest region; this may require removing the smaller region entirely if is completely contained in the largest region. In the case of
Potential breakpoints may lie just before the first chunk and after the last chunk of each of the three resulting regions in
Many variations of this implementation are possible. For example, instead of shortening regions, rules may comprise discarding maximal regions below a threshold size and prioritizing the resulting potential breakpoints by how large their associated regions are. Lower priority breakpoints might be used only if higher priority breakpoints fall outside the minimum and maximum segment size requirements.
In at least one example, two potential breakpoints are separated by new data not belonging to any region. In such a case, the breakpoint could be determined to be anywhere between the two potential breakpoints without affecting which regions get broken. In various examples, different rules would allow for selection of breakpoints in the middle between the regions or at one of the region ends.
The above discussion is meant to be illustrative of the principles and various embodiments of the present invention. Numerous variations and modifications will become apparent to those skilled in the art once the above disclosure is fully appreciated. It is intended that the following claims be interpreted to embrace all such variations and modifications.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/US2012/035917 | 5/1/2012 | WO | 00 | 10/19/2014 |