Computer networks can include storage systems that are used to store and retrieve data on behalf of computers on the network. In some storage systems, particularly large-scale storage systems (e.g., those employing distributed segmented file systems), it is common for certain items of data to be stored in multiple places in the storage system. For example, data duplication can occur when two or more files have some data in common, or where a particular set of data appears in multiple places within a given file. In another example, data duplication can occur if the storage system is used to back up data from several computers that have common files. Thus, storage systems can include the ability to “deduplicate” data, which is the ability to identify and remove duplicate data.
Some embodiments of the invention are described with respect to the following figures:
De-duplication in distributed file systems is described. In an embodiment, key classes are determined from a set of potential keys. The potential keys are those that could be used to represent file content in the file system. Control of the key classes is apportioned among index nodes of the file system. Nodes in the file system deduplicate data chunks of file content (e.g., portions of data content, as described below). During deduplication, the nodes generate keys calculated from the data chunks. The keys are distributed among the index nodes based on relations between the keys and the key classes controlled by the index nodes. Various embodiments are described below by referring to several examples.
A distributed file system can be scalable, in some cases massively scalable (e.g., hundreds of nodes and storage segments). Keeping track of individual elements of file content for purposes of deduplication in an environment having a large number of storage segments controlled by a large number of nodes can be challenging. Further, a distributed file system is designed to be capable of scaling up linearly by growing storage and processing capacities on demand. Example file systems described herein provide for deduplication capability that can scale along with the distributed file system. The knowledge of existing items of file content (e.g., keys calculated from data chunks) is decentralized and distributed over multiple index nodes, allowing the distributed knowledge to grow along with other parts of the file system with additional resources.
In a distributed file system, the number of distinct data chunks and associated keys can be very large. Multiple nodes in the system continuously generate new file data that has to be deduplicated. In example implementations described herein, the full set of potential keys that can represent data chunks of file content is divided deterministically into subsets of keys or “key classes.” Control of the key classes is distributed over multiple index nodes that communicate with nodes performing deduplication. As the number of unique keys calculated from data chunks increases, and/or as the number of nodes performing deduplication increases, the number of index nodes can be increased and control of the key classes redistributed to balance the indexing load. Example implementations may be understood with reference to the drawings below.
The file system 100 can serve clients 102. The clients 102 are sources and consumers of file data. The file data can include files, data streams, and like type data items capable of being stored in the file system 100. The clients 102 can any type of device capable of sourcing and consuming file data (e.g., computers). The clients 102 communicate with the file system 100 over a network 105. The clients 102 and the file system 100 can exchange data over the network 105 using various protocols, such as network file system (NFS), server message block (SMB), hypertext transfer protocol (HTTP), file transfer protocol (FTP), or like type protocols. To store file data, the clients 102 send the file data to the file system 100.
The entry point nodes 104 manage storage and deduplication of the file data in the file system 100. The entry point nodes 104 provide an “entry” for file data into the file system 100. The entry point nodes 104 are generally referred to herein as deduplicating or deduplication nodes. The entry point nodes 104 can be implemented using at least one computer (e.g., server(s)). The entry point nodes 104 determine data chunks from the file data. A “data chunk” is a portion of the file data (e.g., a portion of a file or file stream). The entry point nodes 104 can divide the file data into data chunks using various techniques. In an example, the entry point nodes 104 can determine every N bytes in the file data to be a data chunk, In another example, the data chunks can be of different sizes. The entry point nodes 104 can use an algorithm to divide the file data on “natural” boundaries to form the data chunks (e.g., using a Rabin fingerprinting scheme to determine variable sized data chunks). The entry point nodes 104 also generate keys calculated from the data chunks. A “key” is a data item that represents a data chunk (e.g., a fingerprint for a data chunk). The entry point nodes 104 can generate keys for the data chunks using a mathematical function. In an example, the keys are generated using a hash function, such as MD5, SHA-1, SHA-256, SHA-512, or like type functions.
To perform deduplication, the entry point nodes 104 obtain knowledge of which of the data chunks are duplicates (e.g., already stored by the storage subsystem 108). To obtain this knowledge, the entry point nodes 104 communicate with the index nodes 106. The entry point nodes 104 send indexing requests to the index nodes 106. The indexing requests include the keys representing the data chunks. The index nodes 106 respond to the entry point nodes 104 with indexing replies. The indexing replies can indicate which of the data chunks are duplicates, which of the data chunks are not yet stored in the storage subsystem 108, and/or which of the data chunks should not be deduplicated (reasons for not deduplicating are discussed below). Based on the indexing replies, the entry point nodes 104 send some of the data chunks and associated file metadata to the storage subsystem 108 for storage. For duplicate data chunks, the entry point nodes 104 can send only file metadata to the storage subsystem 108 (e.g., references to existing data chunks). In some examples, the entry point nodes 104 can send data chunks and associated file metadata to the storage subsystem 108 without performing deduplication. The entry point nodes 104 can decide not to deduplicate some data chunks based on indexing replies from the index nodes 106, or on information determined by the entry point nodes themselves. In an example, if the keys of two data chunks are candidate data chunks for deduplication, the entry point nodes 104 can perform a full data compare of each data chunk to confirm that the data chunks are actually duplicates.
The index nodes 106 control indexing of data chunks stored in the storage subsystem 108 based on keys. The index nodes 106 can be implemented using at least one computer (e.g., server(s)). The index nodes 106 maintain a key database storing relations based on keys. At least a portion of the key database can be stored by the storage subsystem 108. Thus, the index nodes 106 can communicate with the storage subsystem 108. In an example, a portion of the key database is also stored locally on the index nodes 106 (example shown below). The index nodes 106 receive indexing requests from the entry point nodes 104. The index nodes 106 obtain keys calculated for data chunks being deduplicated from the indexing requests. The index nodes 106 query the key database with the calculated keys, and generate indexing replies from the results.
The destination nodes 110 manage the storage nodes 112. The destination nodes 110 can be implemented using at least one computer (e.g., server(s)). The storage nodes 112 can be implemented using at least one non-volatile mass storage device, such as magnetic disks, solid-state devices, and the like. Groups of mass storage devices can be organized as redundant array of inexpensive disks (RAID) sets. The storage segments 113 are logical sections of storage within the storage nodes 112. At least one of the storage segments 113 can be implemented using multiple mass storage devices (e.g., in a RAID configuration for redundancy).
The storage segments 113 store data chunk files 114, metadata files 116, and index files 118. A particular storage segment can store data chunk files, metadata files, or index files, or any combination thereof. A data chunk file stores data chunks of file data. A metadata file stores file metadata. The file metadata can include pointers to data chunks, as well as other attributes (e.g., ownership, permissions, etc.). The index files 118 can store at least a portion of the key database managed by the index nodes 106 (e.g., an on-disk portion of the key database).
The destination nodes 110 communicate with the entry point nodes 104 and the index nodes 106. The destination nodes 110 provision and de-provision storage in the storage segments 113 for the data chunk files 114, the metadata files 116, and the index files 118. The destination nodes 110 communicate with the storage nodes 112 over links 120. The links 120 can include direct connections (e.g., direct-attached storage (DAS)), or connections through interconnect, such as fibre channel (FC), Internet small computer simple interface (iSCSI), serial attached SCSI (SAS), or the like. The links 120 can include a combination of direct connections and connections through interconnect.
In an example, at least a portion of the entry point nodes 104, the index nodes 106, and the destination nodes 110 can be implemented using distinct computers communicating over a network 109. The nodes can communicate over the links 109 using various protocols. In an example, processes on the nodes can exchange information using remote procedure calls (RPCs). In an example, some nodes can be implemented on the same computer (e.g., an entry point node and a destination node). In such case, nodes can communicate over the links 109 using a direct procedural interface within the computer.
As noted above, the entry point nodes 104 generate keys calculated from data chunks of file content. The function used to generate the keys should have preimage resistance, second preimage resistance, and collision resistance. The keys can be generated using a hash function that produces message digests having a particular number of bits (e.g., the SHA-1 algorithm produces 160-bit messages). Hence, there is a universe of potential keys that can be calculated for data chunks (e.g., SHA-1 includes 2̂160 possible keys). In an example, the universe of potential keys is divided into subsets or classes of keys (“key classes”). Dividing a set of possible keys into deterministic subsets can be achieved by various methods. For example, assuming generation of keys from file content creates an even distribution of values, key classes can be identified by a particular number of bits (N bits) from a specified position in the message (e.g., N most significant bits, N least significant bits, N bits somewhere in the middle of the message whether contiguous or not, etc.). In such a scheme, the set of possible keys is divided into 2̂N key classes.
In another example, key classes can be generated by identifying keys that are more likely to be generated from the file data (e.g., likely key classes). The key classes can be generated using a static analysis, heuristic analysis, or combination thereof. A static analysis can include analysis of file data related to known operating systems, applications, and the like to identify data chunks and consequent keys that are more likely to appear (e.g., expected keys calculated from expected file content). A heuristic analysis can be performed based on calculated keys for data chunks of file content over time to identify key classes that are most likely to appear during deduplication. An example heuristic can include identifying keys for well-known data patterns in the file data. In another example, key classes can be generated based on some Pareto of the data chunks under management (e.g., key classes can be formed such that k % if the keys belong to (100-k) % of key classes, where k is between 50 and 100). In general, the universe of keys can be divided into some number of more likely key classes and at least one less likely class. In such a scheme, each key class may not represent the same number of keys (e.g., there may be some number of more likely key classes and then a single larger key class for the rest of the keys).
In yet another example, the key classes may not collectively represent the entire universe of potential keys. In such cases, key classes may be “representative key classes,” since not every key in the universe will fall into a class. For example, if the universe of potential keys can be divided into 2̂N key classes using an N-bit identifier, then only a portion of such key classes may be selected as representative key classes. Heuristic analysis such as those described above may be performed to determine more likely key classes, with keys that are less likely not represented by a class. For example, if a Pareto analysis indicates that 80% of the keys belong to 20% of the key classes, only those 20% of key classes can be used as representative.
In general, key classes are determined from the set of potential keys forming a “key class configuration.” Regardless of the key class configuration, control of the key classes is apportioned among the index nodes 106 (a “key class distribution”). Each of the index nodes 106 can control at least one of the key classes. The entry point nodes 104 maintain data indicative of the distribution of key class control among the index nodes 106 (“key class distribution data”). The entry point nodes 104 distribute indexing requests among the index nodes 106 based on relations between the keys and the key classes as determined from the key class distribution data. The entry point nodes 104 identify which of the index nodes 106 are to receive certain keys based on the key class distribution data that relates the index nodes 106 to key classes.
In an example, the management node(s) 130 control the key class configuration and key class distribution in the file system 100. The management node(s) 130 can be implemented using at least one computer (e.g., server(s)). A user can employ the management node(s) 130 to establish a key class configuration and key class distribution. The management node(s) 130 can inform the index nodes 106 and/or the entry point nodes 104 of the key class distribution. In an example, the management node(s) 130 can collect heuristic data from nodes in the file system (e.g., the entry point nodes 104, the index nodes 106, and/or the destination nodes 110). The management node(s) 130 can use the heuristic data to generate at least one key class configuration over time (e.g., the key class configuration can change over time based on the heuristic data). The heuristic data can be generated using an heuristic analysis or heuristic analyses described above.
Returning to
Returning to
The index node 106-1 queries the key database 402 with the key(s) from the indexing request 404, and obtains query results. For those key(s) 406 not in the key database 402, the index node 106-1 can add such key(s) to the key database 402 along with respective proposed location(s) 408. The key(s) and respective proposed location(s) can be marked as provisional in the key database 402 until the associated data chunks are actually stored in the proposed locations. For each of the key(s) 406 in the key database 402, the query results can include a key record 410. The key record 410 can include a key value 412, a location 414, and a reference count 416. The reference count 416 indicates the number of times a particular data chunk associated with the key value 412 is referenced. The location 414 indicates where the data chunk associated with the key value 412 is stored in the storage subsystem 108. For each key in the key database 402, the index node 106-1 can update the reference count 416 and return the location 414 to the entry point node 104-1 in an indexing reply 418.
Returning to
In an example, the index node 106-2 can maintain a local database 516 of known representative keys within key class(es) managed by the index node 106-2 (known representative keys being representative keys stored in the key database 502). The index node 106-1 queries the local database 516 with the representative key 508 and obtains query results. If the representative key 508 is in the local database 516, the index node 106-2 queries the key database 502 with the representative key 508 to obtain query results. The query results can include at least one representative key record 518. Each of the representative key record(s) 518 can include a reference count 520 and a key group 522. The reference count 520 indicates how many times the key group 522 has been detected. The key group 522 includes a representative key value (RKV 524) and at least one non-representative key value (NRKV(s) 526). The key group 522 also includes a location 528 indicating where the data chunk associated with the representative key value 524 is stored, and location(s) 530 indicating where the data chunk(s) associated with the non-representative key value(s) 526 is/are stored.
The index node 106-2 attempts the match the key group 505 in the indexing request 504 with the key group 522 in one of the representative key record(s) 518. If a match is found, the index node 106-2 updates the corresponding reference count 520 and returns the location 528 and the location(s) 530 to the entry point node 104-2 in an indexing reply 532. If no match is found, the index node 106-2 attempts to add a representative key record 518 with the key group 505. In some examples, the key database 502 may have a limit on the number of representative key records that can be stored for each known representative key. If a new representative key record 518 cannot be added to the key database 502, then the index node 106-2 can indicate in the indexing reply 532 that the data chunks should be stored without deduplication. If the new representative key record 518 can be added to the key database 502, then reference count 520 is incremented and the key group 505 and respective proposed locations 528 and 530 can be marked as provisional in the key database 502 until the associated data chunks are actually stored in the proposed locations.
If the representative key 508 is not in the local database 516, the index node 106-2 can add a representative key record 518 with the key group 505 to the key database 502. The index node 106-2 also updates the local database 516 with the representative key 508. The key group 505 and respective proposed locations 528 and 530 can be marked as provisional in the key database 502 until the associated data chunks are actually stored in the proposed locations.
Returning to
In some examples, the entry point nodes 104 can select some data chunks to be stored in the storage subsystem 108 without performing indexing operations and hence without deduplication (“opportunistic deduplication”). This can remove the deduplication process from the write performance path and prevent indexing operations from negatively affecting efficiency of writes. The entry point nodes 104 can implement opportunistic deduplication using a policy based on various factors. In one example, the entry point nodes 104 can perform as heuristic analysis of the responsiveness of indexing replies from the index nodes 106 versus the responsiveness of the storage subsystem 108 storing data chunks. In another example, the entry point nodes 104 can track a ratio of newly seen to already known data chunks.
For example, some of the most attractive cases for deduplication are cloning of virtual machines. Such cloning originally creates complete duplicates of data. Later, as the virtual machines are actively used, the probability of seeing file data that could be deduplicated is lower. The entry point nodes 104 can learn, self-adjust, and eliminate deduplication attempts and associated penalties using opportunistic deduplication.
As noted above, data chunks can be distributed through multiple storage segments 113. This allows sufficient throughput for placing new data in the storage subsystem 108. The entry point nodes 104 can decide which of the storage segments 113 should be used to store data chunks. In some examples, file data that includes data written to different files within a narrow time window can be placed into different storage segments 113. In some examples, entry point nodes 104 can distribute data chunks belonging to the same file or stream across several of the storage segments 113. Thus, the entry point nodes 104 can implement various RAID schemes by directing storage of data chunks across different storage segments 113. The destination nodes 110 can provide a service to the entry point nodes 104 that atomically pre-allocates space and increases the size of data chunk files.
In some examples, the destination nodes 110 can implement various tools 150 that maintain elements of the deduplicated environment. The tools can scale with the number of storage segments 113 and the number of key classes in the key class configuration. For example, the deduplication process performed by the entry point nodes 104 can be referred to as “in-line deduplication”, since the deduplication is performed as the file data is received. The destination nodes 110 can include an offline deduplication tool that scans the storage nodes 112 and performs further deduplication of selected files. The offline deduplication tool can also reevaluate and deduplicate data chunks that were left without deduplication through decisions by the entry point nodes 104 and/or the index nodes 106. The tools 150 can also include dcopy and dcmp utilities to efficiently copy and compare deduplicated files without moving or reading data. The tools 150 can include a replication tool for creating extra replicas of data chunk files, index files, and/or metadata files to increase availability and accessibility thereof. The tools 150 can include a tiering migration tool that can move data chunk files, index files, and metadata files to a specified set of storage segments. For example, index files can be moved to storage segments implemented using solid state mass storage devices for quicker access. Data chunk files that have not been accessed within a certain time period can be moved to storage segments implemented using spin-down disk devices. The tools 150 can include a garbage collector that removes empty data chunk files.
The IO interface 606 receives file data, communicates with a storage subsystem, and communicates with index nodes. The memory 608 stores key class distribution data 612. The key class distribution data 612 includes relations between index nodes and key classes. The key classes are determined from a set of potential keys used to represent file content.
In an example, the processor 602 implements a deduplicator 614 to provide the functions described below. The processor 602 can also implement an analyzer 615. The memory 608 can store code 616 that is executed by the processor 602 to implement the deduplicator 614 and/or analyzer 615. In some examples, the deduplicator 614 and/or analyzer 615 can be implemented as a dedicated circuit on the hardware peripheral(s) 610. For example, the hardware peripheral(s) 610 can include a programmable logic device (PLD), such as a field programmable gate array (FPGA), which can be programmed to implement the functions of the deduplicator 614 and/or analyzer 615.
The deduplicator 614 receives the file data from the IO interface 606. The deduplicator 614 determines data chunks from the file data, and generates keys calculated from the data chunks. The deduplicator 614 distributes (through the IO interface 606) the keys among the indexing nodes based on the key class distribution data 612. For example, the deduplicator 614 can match keys to key classes, and then identify index nodes that control the key classes from the key class distribution data 612. The deduplicator 614 deduplicates the data chunks for storage in the storage subsystem based on responses from the indexing nodes. For example, the indexing nodes can respond with which of the data chunks are already known and which are not known and should be stored. The deduplicator 614 can selectively send the data chunks to the storage subsystem based on the responses from the index nodes.
In some examples, the deduplicator 614 groups the keys into key groups. Each of the key groups includes a representative key that is a member of a key class. Key group(s) can also include at least one non-representative key that is not a member of a key class. The deduplicator 614 can send the key groups to the index nodes based on representative keys of the key groups and the key class distribution data 612. For example, the deduplicator 614 can match representative keys to key classes, and then identify index nodes that control the key classes from the key class distribution data 612.
In some examples, the deduplicator 614 implements opportunistic deduplication. The deduplicator 614 can select certain data chunks from the file data and send such data chunks to the storage subsystem to be stored without deduplication. Aspects of opportunistic deduplication are described above.
The analyzer 615 can collect statistics on the keys calculated from data chunks being deduplicated. The analyzer 615 can perform a heuristic analysis of the statistics to generate heuristic data. The heuristic data can be used to identify likely key classes that can form a key class configuration. Various heuristic analyses have been described above. The analyzer 615 can process the heuristic data itself. In another example, the analyzer 615 can send the heuristic data to other node(s) (e.g., the management node(s) 130 shown in
The IO interface 706 communicates with a storage subsystem that stores at least a portion of a key database. The IO interface 706 receives indexing requests from deduplicating nodes. The indexing requests can include calculated keys for data chunks being deduplicated. The calculated keys are members of a key class assigned to the node. The key class in one of a plurality of key classes determined from a set of potential keys.
In an example, the processor 702 implements an indexer 712 to provide the functions described below. The memory 708 can store code 714 that is executed by the processor 702 to implement the indexer 712. In some examples, the indexer 712 can be implemented as a dedicated circuit on the hardware peripheral(s) 710. For example, the hardware peripheral(s) 710 can include a programmable logic device (PLD), such as a field programmable gate array (FPGA), which can be programmed to implement the functions of the indexer 712.
The indexer 712 receives the indexing requests from the IO interface 706 and obtains the calculated keys. The indexer 712 queries the key database to obtain query results. The query results can include, for example, information indicative of whether calculated keys are known. The indexer 712 sends responses (through the IO interface 706) to the deduplicating nodes based on the query results to provide deduplication of the data chunks for storage in the storage system.
In an example, the calculated keys in the indexing request can be grouped into key groups. Each of the key groups includes a representative key that is a member of the key class assigned to the node. Key group(s) can also include at least one non-representative key that is not part of any of the key classes. The indexer 712 can obtain key records from the key database based on representative keys of the key groups. In an example, each of the key records can include values for each representative and non-representative key therein, and locations in the storage subsystem for data chunks associated with each representative and non-representative key therein. In an example, the storage subsystem stores a first portion of the key database, and the memory 708 stores a second portion of the key database (a “local database 716”). The local database 716 includes representative keys for data chunks stored by the storage subsystem.
De-duplication in distributed file systems has been described. The knowledge of existing items of file content (e.g., keys calculated from data chunks) is decentralized and distributed over multiple index nodes, allowing the distributed knowledge to grow along with other parts of the file system with additional resources. In example implementations, the full set of potential keys that can represent data chunks of file content is divided into key classes. The key classes can cover all of the universe of potential keys, or only a portion of such key universe. Control of the key classes is distributed over multiple index nodes that communicate with deduplicating nodes. As the number of unique keys calculated from data chunks increases, and/or as the number of nodes performing deduplication increases, the number of index nodes can be increased and control of the key classes redistributed to balance the indexing load. The deduplicating nodes can employ opportunistic deduplication by selectively storing some file content without deduplication to improve write performance.
The methods described above may be embodied in a computer-readable medium for configuring a computing system to execute the method. The computer readable medium can be distributed across multiple physical devices (e.g., computers). The computer readable media may include, for example and without limitation, any number of the following: magnetic storage media including disk and tape storage media; optical storage media such as compact disk media (e.g., CD-ROM, CD-R, etc.) and digital video disk storage media; holographic memory; nonvolatile memory storage media including semiconductor-based memory units such as FLASH memory, EEPROM, EPROM, ROM; ferromagnetic digital memories; volatile storage media including registers, buffers or caches, main memory, RAM, etc., just to name a few. Other new and various types of computer-readable media may be used to store machine readable code discussed herein.
In the foregoing description, numerous details are set forth to provide an understanding of the present invention. However, it will be understood by those skilled in the art that the present invention may be practiced without these details. While the invention has been disclosed with respect to a limited number of embodiments, those skilled in the art will appreciate numerous modifications and variations therefrom. It is intended that the appended claims cover such modifications and variations as fall within the true spirit and scope of the invention.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/US2011/040316 | 6/14/2011 | WO | 00 | 11/14/2013 |