This application is related to U.S. patent application Ser. Nos. 14/322,813, 14/322,832, 14/684,086, 14/322,850, 14/322,855, 14/322,867, 14/322,868, 14/322,871, and 14/723,380, which are all hereby incorporated by reference.
The present invention relates generally to synchronization of metadata used to store data. More specifically, the present invention relates to compaction and synchronization of metadata for faster access within a data center.
In the field of data storage, enterprises have used a variety of techniques in order to store the data that their software applications use. At one point in time, each individual computer server within an enterprise running a particular software application (such as a database or e-mail application) would store data from that application in any number of attached local disks. Although this technique was relatively straightforward, it led to storage manageability problems in that the data was stored in many different places throughout the enterprise.
Currently, enterprises typically use a remote data center or data centers, each hosting a storage platform on which to store the data, files, etc., of the enterprise. A computer server of an enterprise may now host many different server applications, each application needing to store its data to the data center or centers. Accompanying this huge amount of data is the metadata that helps to describe the data including the location of the data, replication information, version numbers, etc. Because of the quantity of metadata, the techniques used to store it, and the need to access the metadata, numerous problems have arisen.
For example, because the metadata for a particular write operation or for a particular block of data may be stored on two or more metadata nodes, it is important that this metadata be synchronized between nodes, i.e., the metadata for a particular write should be the same on each metadata node. Various events can affect synchronization: a time delay between the storage of metadata on nodes or between data centers in a distributed storage system can allow an application to inadvertently read old metadata; a metadata node failure either means that metadata may not be read from that node or that the node will contain stale metadata when it comes back online; a disk failure of a metadata node also prevents reading of the correct metadata; a data center failure prevents reading of metadata from that center or means that a strong read of metadata cannot be performed; and, metadata may become corrupted in other manners.
Synchronization should be performed quickly but many prior art techniques do not perform synchronization fast enough or are disadvantageous for other reasons. By way of example, the Dynamo database used by Amazon, Inc. uses a Merkle Tree to identify differences between two metadata sources. A Merkle Tree is built by hashing pieces of input metadata, then hashing the resulting hashes into higher-level hashes and so on recursively, until a single hash is generated—the root of the Merkle Tree. Maintaining such a structure requires more meta-metadata and keeping it updated as metadata modified is computationally intensive.
The Rsync tool is used to synchronize files between two servers. It uses less metadata than a Merkle Tree-based approach, but, because it is oblivious to the data that it is synchronizing, it cannot be used in a bidirectional manner. In other words, it cannot be used to merge changes introduced at the two ends of the synchronization. The Rsync tool requires one of the servers to have the final version of the file which is to be copied to the other server. The technique known as Remote Differential Compression (RDC) available from Microsoft Corporation minimizes the copying of data by using data not in the file being synchronized but already present at the destination. As in the Rsync tool, however, RDC synchronization is unidirectional.
Accordingly, a synchronization technique for metadata is desirable that is faster, uses less meta-metadata, less storage, less bandwidth and allows for bidirectional synchronization.
To achieve the foregoing, and in accordance with the purpose of the present invention, a metadata synchronization technique is provided that includes the advantages discussed below.
The disclosed metadata synchronization technique provides for faster synchronization between metadata nodes thus improving the reliability of weak reads (reading metadata from only a single metadata node). The technique also synchronizes between metadata stored in files, providing for finer granularity, uses less meta-metadata to perform the synchronization, less storage and less bandwidth. The invention compares replicas of metadata and ensures that the replicas agree on a superset of the metadata contained therein without losing any metadata.
Unlike the prior art Dynamo database that uses meta-metadata just in order to synchronize the regular metadata for all levels of the Merkle Tree, the present invention only uses meta-metadata in what would be the first level of a Merkle Tree, therefore using a much smaller quantity of meta-metadata.
During storage and compaction of string-sorted tables (SSTs), a consistent file identification scheme is used across all metadata nodes. A fingerprint file is created for each SST that includes a start-length-hash value triple for each region of the SST. Replicas of SSTs are compared and synchronized by comparing their fingerprint files resulting in a faster and more efficient synchronization process.
In a first embodiment, metadata is synchronized between two metadata files using a fingerprint file for each metadata file. Any missing metadata is sent to the file missing the metadata.
In a second embodiment, metadata is stored on a computer node into a table on disk. The table is compacted with other tables on disk. A fingerprint value is calculated for the table and stored in the data storage platform.
In a third embodiment, metadata from a mutation is stored onto two metadata nodes and is flushed into tables. The tables are compared using their respective fingerprint files and missing metadata from one table is sent to the other table for compaction.
The invention, together with further advantages thereof, may best be understood by reference to the following description taken in conjunction with the accompanying drawings in which:
Computers nodes 30-40 are shown logically being grouped together, although they may be spread across data centers and may be in different geographic locations. A management console 40 used for provisioning virtual disks within the storage platform communicates with the platform over a link 44. Any number of remotely-located computer servers 50-52 each typically executes a hypervisor in order to host any number of virtual machines. Server computers 50-52 form what is typically referred to as a compute farm. As shown, these virtual machines may be implementing any of a variety of applications such as a database server, an e-mail server, etc., including applications from companies such as Oracle, Microsoft, etc. These applications write to and read data from the storage platform using a suitable storage protocol such as iSCSI or NFS, although each application may not be aware that data is being transferred over link 54 using a proprietary protocol.
Management console 40 is any suitable computer able to communicate over an Internet connection 44 with storage platform 20. When an administrator wishes to manage the storage platform (e.g., provisioning a virtual disk, snapshots, revert, clone, analyze metrics, determine health of cluster, etc.) he or she uses the management console to access the storage platform and is put in communication with a management console routine executing as part of metadata module 130 on any one of the computer nodes within the platform. The management console routine is typically a Web server application.
In order to provision a new virtual disk within storage platform 20 for a particular application running on a virtual machine, the virtual disk is first created and then attached to a particular virtual machine. In order to create a virtual disk, a user uses the management console to first select the size of the virtual disk (e.g., 100 GB), and then selects the individual policies that will apply to that virtual disk. For example, the user selects a replication factor, a data center aware policy and other policies concerning whether or not to compress the data, the type of disk storage, etc. Once the virtual disk has been created, it is then attached to a particular virtual machine within one of the computer servers 50-52 and the provisioning process is complete.
As mentioned above, each computer server may host any number of virtual machines, each executing a particular software application. Each virtual machine (or more specifically, the application executing on the virtual machine) is able to communicate with the storage platform using any of a variety of protocols. Each server 51 may also include a specialized controller virtual machine (CVM) that is specially adapted to handle communications with the virtual machines and communicates with the storage platform. The CVM also uses a memory cache on the computer server 51. In communication with computer 51 and with the CVM are any number of solid-state disks (or other similar memory).
Preferably, all information concerning a particular virtual disk attached to a CVM is organized into a virtual disk object and then stored into the memory cache of the CVM. A hash table is used to store these virtual disk objects and the key to find each object is the name of the virtual disk. Stored within this cache is the generation number, virtual disk information and the metadata nodes indicating on which nodes the metadata for this virtual disk is stored.
Although shown as three modules, each of the modules runs independently on each of the computer nodes within the platform 20. Also, associated with each module on each node is a memory cache 122, 132 and 142 that stores information used by that module; each module on each computer node may also use persistent storage on that node. A file (for example) may be stored on nodes 32, 34 and 36 (so-called data nodes), and the metadata for that file may be stored on three different nodes; nodes for a file that store metadata are referred to as “metadata nodes.” The data nodes and metadata nodes for a particular stored file may be the same or may be different. The modules communicate with each other via a modified version of Gossip over TCP, and work in concert to manage the storage platform.
A wide variety of types of metadata may exist within storage system 10 or within any other type of storage system. The metadata module on each computer node handles the storage of this metadata by placing it into persistent storage. In one embodiment, a hash function is used upon the virtual disk name in order to produce a hash value which is then used to select three metadata nodes within the platform for that virtual disk. For example, metadata for a block of data may be stored upon nodes 36, 30 and 40 within the platform (which might be different from the nodes where the actual block data of the virtual disk has been be stored).
Typically, metadata is generated during and after a write operation and pertains to a particular block of data that has been stored. Each item of metadata includes a key-value-timestamp triplet, the key being a string, the value being a number of bytes, and the timestamp being the physical or logical time when the write occurred. The metadata that may be synchronized by the present invention encompasses many types and includes: mutation information after write requests (where data blocks were written, success and failure, virtual disk name, etc.); statistical information (metrics concerning the storage space being used) and any kind of information needed to keep the health and operational efficiency of the system. Mutation metadata is discussed in more detail below.
As mentioned earlier, while the data associated with a particular write request may end up on three different data nodes, the metadata associated with that write request will be stored using the metadata modules 130 on the computer nodes, and these metadata nodes may be different from the nodes used for data storage.
For a particular virtual disk “Vi” 230 (metadata for other virtual disks may also be stored on this metadata node), write information is stored symbolically in columns 232, 234, etc., each column corresponding to a particular chunk of the virtual disk. In one embodiment, there will be a new column for a given chunk if the version is incremented and one writes again into the first chunk. In this fashion, older versions of data are never overwritten or lost, they are all saved within the storage platform for later reference if necessary.
Within a chunk column 232 are individual block columns 240, 242, etc., including the metadata for the individual blocks of that chunk that have been written to the virtual disk. For example, column 240 includes the block number “1,” the computer nodes (A, B, C) to which that block was written, whether or not the write was a success, and a timestamp. Column 242 includes similar metadata for the second block. Within column 232 there will be 64 individual block columns due to the size of the blocks and the size of the chunks. Column 234 will also include the same number of block columns, for example, block column 246 identifies the block number “66,” and the metadata earlier described. In this fashion, the metadata for particular virtual disk 230 is stored upon one of the computer nodes using its metadata module, and includes an identification of where each of its blocks are stored, a version, a timestamp, etc.
In step 304 the virtual machine that desires to write data into the storage platform sends a write request including the data to be written to a particular virtual disk (supplied to the application by the administrator earlier) via the CVM. As mentioned, a write request may originate with any of the virtual machines on one of computers 50-52 and may use any of a variety of storage protocols. The write request typically takes the form: write (offset, size, virtual disk name) The parameter “virtual disk name” is the name of the virtual disk originally selected during provisioning. The parameter “offset” is an offset within the virtual disk (i.e., a value from 0 up to the size of the virtual disk), and the parameter “size” is the size of the data to be written in bytes.
Next, in step 308 the controller virtual machine determines which containers to use for this request based upon the offset and size parameters. In step 312 the CVM queries a metadata node to determine on which computer nodes the container should be stored. Because the particular metadata nodes on which the metadata for the virtual disk is stored had been previously cached by the CVM, the CVM can easily select one of these metadata nodes to query.
In step 316 the CVM then sends the write request (in this case, simply the data itself to be written) to one of the data nodes returned in the previous step (e.g., data node E). The write request also includes an indication of the other two data nodes (B, D) to which the data should be written. The data node that receives the request then writes the data to its disk drives and then forwards the data to the other two nodes. Once each of these nodes writes the data to its disk drives, each of these nodes returns an acknowledgment back to the first data node that had originally received the request from the CVM. A timestamp is also sent with the write request.
In step 320 this first data node (e.g., E) acknowledges that the write has occurred to the CVM and returns the names of the data nodes (e.g., B, D and E) where the data was written. Preferably, write operations do not overwrite older versions of data. In this fashion, earlier versions of data in a virtual disk are available to be read.
A computer node such as node “A” to which a block of data is written will have the metadata for that write operation stored on at least two metadata nodes within the platform, and preferably three. This metadata will be stored using the metadata module executing upon each node and stored within persistent storage associated with each metadata module. The metadata nodes to be used for a write operation on a particular data node may be determined (for example) by using a hash function on the unique identifier of a data node in order to produce a hash value. This hash value is then used to identify three computer nodes within the platform that will be the metadata nodes for that particular data. Other techniques may also be used to identify metadata nodes to be used after a write operation. Determining which are the metadata nodes to be used to write the metadata after a write operation may be performed in different manners. These metadata nodes may depend upon the name of the virtual disk, the data nodes, data offset, type of metadata, or upon some other factor, as long as the process is deterministic.
Typically, the flow diagram of
Various types of metadata may be generated during a write operation. One type, for example, keeps track of which blocks were written to a particular virtual disk and a container, and to which data nodes, as well as a timestamp for the write operation. This metadata may be captured by a table for each container of the virtual disk where columns represent block numbers, a first row includes cells having a timestamp (or no value if a write to a particular data node failed), and a second row includes cells having the identifier of the data node to which the block was written. Another type of metadata keeps track of failed writes, i.e., which blocks were not written to particular data nodes when they should have been written. This type may also be represented in the table mentioned above. A third type of metadata (actually, meta-metadata) is generated in step 416 as described below. Preferably, a different merge tree will be used for each type of metadata as will be explained in more detail below.
In one particular embodiment to generate the first type of metadata, the CVM calculates the block identifiers (i.e., blocks 1, 2, 3) within the virtual disk where the data has been stored and then sends this information to the metadata nodes. As is known in the art, disks are typically divided up into blocks (usually blocks of 4K) and data is written to, and read from, disks using blocks. Because the CVM is aware of the offset for the write request, the CVM then knows the block identifier for the first block to be written for the current write request. And, because the size of the write request is also known, the CVM is then able to easily calculate onto which data nodes blocks of data were written, and the corresponding block identifiers for those blocks of data. In the current example, the CVM calculates the block identifiers for those blocks of data in the current write request which were written to nodes B, D and E. Even if a write request spans two different containers, by simple calculation using the container size, offset, and size of the write request, the CVM will be able to determine which block identifiers were written to the first container and which block identifiers were written to the second container.
In step 404 the CVM sends metadata (e.g., key-value-timestamp triplets) from the previous write operation to the designated metadata nodes for the current virtual disk.
Next, in step 408 each metadata node stores in its memory the key-value-timestamp triples received from the CVM. Metadata is stored in memory blocks and each node allocates any number of memory blocks in memory to store this metadata and the blocks in memory may be filled consecutively as the triplets arrive or they may be filled in parallel. Each type of metadata will be stored in its own set of memory blocks.
Memory blocks are tagged with increasing unique numerical identifiers, starting from 0, for example. Preferably, memory block identifiers, which will later serve as SST and file identifiers, are generated locally, and not by the CVM. Under ideal conditions, metadata would reach metadata nodes in the same order and be stored to blocks having the same identifier, but asynchronism and failures can prevent such behavior. Differences are minimized by piggybacking current identifiers in intra metadata node communication and by increasing local identifiers to match those from other nodes whenever possible. As a result, blocks with a given identifier may be missing from any given node and blocks with the same identifiers in different nodes may not contain the same metadata. These differences do not constitute an inconsistency in the metadata. If a new triplet corresponding to newer data is stored within a memory block that already contains a triplet having the same key, then the old triplet is overwritten.
As applications write data to virtual disks (or as other processes of an enterprise or processes internal to the storage platform write data) metadata will continue to be stored on the metadata nodes in accordance with
At some point in time, in step 412 a metadata node will find it necessary (or desirable) to flush one of its blocks of memory (where the metadata triples have been stored) to a string-sorted table (SST) on disk. Again, each type of metadata will have its own set of blocks in memory and these sets of blocks will be flushed to a corresponding set of SSTs on disk corresponding to each type of metadata.
As mentioned, an SST may be stored as a file on a hard disk that contains a sequence of metadata triplets sorted by key. An SST stored on disk will have the SST identifier equal to the memory block that held its data before the flush. Memory blocks may be flushed to disk consecutively as they fill, or, blocks may be filled in memory in parallel and may be flushed when they become full. The loop symbol on step 412 indicates that this step occurs continuously as the memory blocks in the memory of the metadata node become full. Typically, the SST identifiers of SSTs continually increase, even if the metadata node restarts, and are not reset back to “0.” The metadata is kept sorted in memory of a node and the sorted metadata is sequentially flushed to disk.
When memory blocks are flushed to disk when full, it is possible that metadata from a particular write operation stored in an SST as a replica on one metadata node may not correspond exactly to another replica on a different metadata node. That is one of the reasons for synchronizing. One technique used to address this non-correspondence is the following. During the process of writing metadata, metadata nodes communicate with each other and embedded in this communication is the information about which memory block identifiers they are using. Thus, if metadata for a particular write operation is being written to a memory block X at node A, that same metadata may be written to a memory block Y at node B, but during such a write, A and B learn about each others' memory block usage and one node may skip some identifiers in order to catch up with the other node. For example, if metadata node A is writing to memory block 1, yet metadata node B is writing the same metadata to memory block 3, then node A may skip and write to block 3 instead of to block 1. This is especially important when a node comes online after being down for a long time. This skipping of identifiers will cause gaps in memory block numbering, which will be reflected in the SSTs, since they carry the same identifier as the memory block they originated from. Having gaps in the SST identifiers implies that the merge tree (discussed later) will be missing some leaves, but because the compaction is deterministic regarding the files that must be compacted together, this compaction is not affected by the gaps.
As each memory block is flushed to an SST on disk, in step 416 a unique fingerprint is calculated for each SST and this meta-metadata is stored in persistent storage of any node of the storage platform where it can be accessed by any of the metadata modules 130 of the storage platform. In one embodiment, a rolling hash (such as Rabin's Hashing by Polynomials) is used to establish a series of boundaries throughout each SST, thus creating any number of regions of the SST (each region being bounded by two consecutive boundaries).
For each of these regions 542-548 of the SST, a hash value is calculated (using any suitable hashing technique) based upon the contents of each region, and the resulting hash value is combined with the start location of that region and length of that region to create a start-length-hash value triple (SLH triple) for each region. The resulting fingerprint of each SST is then a set of all of these SLH triples. A consecutive list of these SLH triples for the SST file 530 is then placed into a fingerprint file 550. This resulting meta-metadata (file 550 of SLH triples) may be processed and stored as metadata according to steps 408, 412 and 420, except that step 416 will not be performed upon this meta-metadata. Of course, other techniques may also be used to process and store this meta-metadata. Also, a fingerprint may be calculated for an SST by hashing slices of a predetermined size of the SST. This approach, however, is prone to small changes in input leading to great changes in the fingerprint, which may lead to a synchronization mechanism that assumes the SSTs to be more different than they actually are.
Next, in step 420, all of the SST's for each particular type of metadata are compacted on each metadata node. The files resulting from compaction are also stored in persistent storage on the applicable metadata node, e.g., in one of the storage pools 204. In this example, each SST is referred to as a file although other forms of storage are possible. Due to natural use of the storage system, metadata written earlier in the lifetime of the system may be superseded by metadata written later, that is, with same key but later timestamp. Compaction occurs by iterating through the keys of a group of files (in this case, four files at a time), and, if any keys are the same for any triples, deleting those triples having older timestamps so that only the newest triple for a particular key remains. Hence, the stale metadata does not move on to the resulting file, and as the input files are deleted, the system is freed of the useless information. Thus, four files are compacted into one file; of course, groups of three files, five files, or more, may be also be used.
Preferably, a file identification scheme has certain characteristics, for example, the scheme is consistent and the file identifiers are strictly increasing so that the compaction tree only grows to the right and upwards. Although it may not be known for a certainty that all files having the same identifier (file 1.1, for example, across metadata nodes) will contain the same metadata, there is a high chance of that being the case. Synchronization of two files will thus make them converge to the union of their contents.
Finally, in step 424 a unique fingerprint is calculated for each newly compacted file (files in Level 1 and above) as has been described in step 416. The list of SLH triples for each of these files is then stored as meta-metadata as has been described above. As this is a continuous process, a branch is shown to step 304 as the storage platform may continuously be writing to any of the data nodes.
Steps 404-424 are continuous in that as writes continuously occur within the storage platform metadata is continuously generated, this metadata is stored in memory blocks of a metadata node, the memory blocks are flushed to SSTs on disk, these SST files are compacted and the resulting files are continuously stored on disk. Accordingly, the example compaction tree shown in
The above technique for storage of metadata is a modified version of a log-structured merge tree technique. The technique has been modified by the way in which the files are chosen to be compacted together. Our approach chooses files to compact (i.e., joins) files based on their unique identifier, in a deterministic way. Other approaches choose files based on their sizes, the range of keys they cover, or how “hot” those keys are. None of these approaches, however, is deterministic.
In this situation, deterministic means that when given multiple replicas and trying to compact sets of files with exactly the same names, even if having different contents, the resulting files will have exactly the same names. If one were to use attributes of the files other than their names, different replicas could choose differently the groups of files to compact.
Synchronize Metadata
Beginning with step 604, each node within the storage platform that is a metadata node begins a process to iterate over each SST that it holds on disk that has not yet been synchronized. Various techniques may be used to indicate that an SST has already been or has recently been synchronized, such as by using flags, status tables, etc. Iteration over each SST may occur consecutively or in parallel. In this example, we assume that node C 34 is the metadata node that desires to synchronize its SSTs and that in this embodiment metadata is replicated on three different nodes within the storage platform. Of course, depending upon the implementation, metadata may be replicated on fewer nodes or greater nodes. Synchronization will be performed for all levels in the merge tree, including Level 0.
In step 608 the local fingerprint file created in step 416 or 424 for an SST of node C having a particular identifier (for example, file 1.1566 in
It should be noted that, given a set of metadata nodes used to replicate some metadata, the type of metadata, and an SST identifier (such as 1.1), there can be only one corresponding SST in each of such nodes. For example, in one embodiment of the system, an SST named as “4ee5f0-a5caf0-dd59a2-VirtualDiskBlockInfo-7.sst” corresponds to the metadata maintained by nodes identified as 4ee5f0, a5caf0, and dd59a2 and of type “VirtualDiskBlockInfo.” It is the eighth file of this kind, as denoted by the identifier “7.” The combination of all of this information is unique per node and is repeated on only three nodes within the entire system. Thus, steps 608 and 612 are able to find the three SSTs that are replicas (theoretically) of one another. And, one finds which two fingerprints to compare out of the dozens of possible files each having the same identifier.
Next, in step 616 the triples of the local fingerprint file are compared against the triples of the remote fingerprint files, one file at a time. The SLH triples of the local fingerprint file are compared against the SLH triples of a remote fingerprint file in order to determine if the remote fingerprint file includes any SLH triples that the local file does not have. This comparison may be performed in a variety of manners; in one specific embodiment, the comparison determines whether the remote fingerprint file includes any triples having a hash value that is not present in any triple in the local fingerprint file. If so, these triples are identified as missing fingerprint triples. For example, referring to
Next, in step 620 the local node, node C, requests the metadata from the missing regions from the metadata node holding SST 530 corresponding to the fingerprint file 550 having the missing fingerprint triples. The request may be made in many ways, for example, the request may simply include a list of the entire missing SLH triples, a list of the hash values that are missing, a list of the regions corresponding to the missing triples, a list of boundaries defining the missing triples, etc. If fingerprint triples are missing from more than one fingerprint file, then steps 620-632 are performed for each fingerprint file.
In step 624 the remote node holding SST 530 receives the request from the local node and retrieves the SST that is the subject of the request. The remote node then prepares a response which will be a new SST that contains a sequence of the key-value-timestamp triples that correspond to the missing SLH triples identified in step 620. Continuing with the above example, if the missing SLH triples are those two that have the hash values “X” and “Y” in fingerprint file 550, it is determined that these missing triples correspond to regions 542 and 544. Next, for each region, any entire key-value-timestamp triple found within that region is placed into the new SST file, for region 542 this includes the triple 532. In addition, any triple partially found within a region will also be included. Thus, the triple 552 will be added to the new SST file because it is partially in region 542. Similarly, regarding region 544, both triples 552 and 554 will also be added to the new SST file because parts of these triples appear in region 544 which corresponds to the missing SLH triple “11, 7, Y.” Note that it is possible (and is shown in this example) that a particular key-value-timestamp triple may be added to the new SST file twice (e.g., triple 552). Preferably, the remote node removes any such duplicates before it returns this new SST file to the requesting local node. Or, the remote node filters out any duplicate triples before the new SST is written locally on the remote node, before sending the new SST to the local node.
Next, in step 628, the local node compacts the local SST (the local SST which was chosen for iteration above in step 604) with the received new SST from the remote node to create a replacement SST. For example, if the file 1.1566 is the subject of the synchronization of the steps of
As mentioned above, the process of steps 616-628 occurs for all replicas of the local SST. For example, if the storage platform stores three replicas of metadata, then steps 616-628 will occur once for each of the other two remaining replicas (assuming that there are missing fingerprint triples in each of the other two replicas). Once all replicas have been compared to the local SST then this local SST is marked as “Synchronized.” After a period of time or if certain conditions occur, a metadata node may decide to perform another iteration of synchronization and on its SSTs. This synchronization may be periodic or may be instigated manually by the storage platform or by a particular metadata node.
Step 632 checks for missing SSTs or fingerprints. Because of the gaps in SST identifiers mentioned earlier, a local node A may not be able to retrieve the fingerprint from a remote node B, because node B does not have either the SST nor the fingerprint file. As a last step in the synchronization, A sends a list of files that it believes B is missing, to B. Upon receiving such a message, in a parallel flow, B will check if the missing files might still be created locally (for example by a flush that is yet to happen), or if the file existed before but has already been compacted. If none of these is the case and the file is indeed missing, node B requests all the regions of these missing files from node A. Node A then sends the SST which, upon being received by node B, will simply be incorporated as a regular SST.
In addition, the metadata nodes that hold the two replica SSTs will also perform the steps of
As mentioned above, the processes of
In order for an application to identify from which SST it needs to read the metadata is uses an SST attribute. In other words, if an application knows that metadata for a particular virtual disk is on nodes A, B, C, it will be able to find the correct SST. Accordingly, each SST has an attribute specifying the smallest and largest key it contains. If a read is for a key that falls in such a range, then the SST will be read. The read is done efficiently by means of a companion index file for each SST. Other techniques may be used to choose more efficiently from which SST to read (for example, the use of a Bloom Filter).
CPU 922 is also coupled to a variety of input/output devices such as display 904, keyboard 910, mouse 912 and speakers 930. In general, an input/output device may be any of: video displays, track balls, mice, keyboards, microphones, touch-sensitive displays, transducer card readers, magnetic or paper tape readers, tablets, styluses, voice or handwriting recognizers, biometrics readers, or other computers. CPU 922 optionally may be coupled to another computer or telecommunications network using network interface 940. With such a network interface, it is contemplated that the CPU might receive information from the network, or might output information to the network in the course of performing the above-described method steps. Furthermore, method embodiments of the present invention may execute solely upon CPU 922 or may execute over a network such as the Internet in conjunction with a remote CPU that shares a portion of the processing.
In addition, embodiments of the present invention further relate to computer storage products with a computer-readable medium that have computer code thereon for performing various computer-implemented operations. The media and computer code may be those specially designed and constructed for the purposes of the present invention, or they may be of the kind well known and available to those having skill in the computer software arts. Examples of computer-readable media include, but are not limited to: magnetic media such as hard disks, floppy disks, and magnetic tape; optical media such as CD-ROMs and holographic devices; magneto-optical media such as floptical disks; and hardware devices that are specially configured to store and execute program code, such as application-specific integrated circuits (ASICs), programmable logic devices (PLDs) and ROM and RAM devices. Examples of computer code include machine code, such as produced by a compiler, and files containing higher-level code that are executed by a computer using an interpreter.
Although the foregoing invention has been described in some detail for purposes of clarity of understanding, it will be apparent that certain changes and modifications may be practiced within the scope of the appended claims. Therefore, the described embodiments should be taken as illustrative and not restrictive, and the invention should not be limited to the details given herein but should be defined by the following claims and their full scope of equivalents.
Number | Name | Date | Kind |
---|---|---|---|
4084231 | Capozzi et al. | Apr 1978 | A |
4267568 | Dechant et al. | May 1981 | A |
4283787 | Chambers | Aug 1981 | A |
4417321 | Chang et al. | Nov 1983 | A |
4641274 | Swank | Feb 1987 | A |
4654819 | Stiffler et al. | Mar 1987 | A |
4686620 | Ng | Aug 1987 | A |
4912637 | Sheedy et al. | Mar 1990 | A |
4995035 | Cole et al. | Feb 1991 | A |
5005122 | Griffin et al. | Apr 1991 | A |
5093912 | Dong et al. | Mar 1992 | A |
5133065 | Cheffetz et al. | Jul 1992 | A |
5193154 | Kitajima et al. | Mar 1993 | A |
5212772 | Masters | May 1993 | A |
5226157 | Nakano et al. | Jul 1993 | A |
5239647 | Anglin et al. | Aug 1993 | A |
5241668 | Eastridge et al. | Aug 1993 | A |
5241670 | Eastridge et al. | Aug 1993 | A |
5276860 | Fortier et al. | Jan 1994 | A |
5276867 | Kenley et al. | Jan 1994 | A |
5287500 | Stoppani, Jr. | Feb 1994 | A |
5301286 | Rajani | Apr 1994 | A |
5321816 | Rogan et al. | Jun 1994 | A |
5347653 | Flynn et al. | Sep 1994 | A |
5410700 | Fecteau et al. | Apr 1995 | A |
5420996 | Aoyagi | May 1995 | A |
5454099 | Myers et al. | Sep 1995 | A |
5559991 | Kanfi | Sep 1996 | A |
5642496 | Kanfi | Jun 1997 | A |
6418478 | Ignatius et al. | Jul 2002 | B1 |
6542972 | Ignatius et al. | Apr 2003 | B2 |
6658436 | Oshinsky et al. | Dec 2003 | B2 |
6721767 | De Meno et al. | Apr 2004 | B2 |
6760723 | Oshinsky et al. | Jul 2004 | B2 |
7003641 | Prahlad et al. | Feb 2006 | B2 |
7035880 | Crescenti et al. | Apr 2006 | B1 |
7107298 | Prahlad et al. | Sep 2006 | B2 |
7130970 | Devassy et al. | Oct 2006 | B2 |
7162496 | Amarendran et al. | Jan 2007 | B2 |
7174433 | Kottomtharayil et al. | Feb 2007 | B2 |
7246207 | Kottomtharayil et al. | Jul 2007 | B2 |
7315923 | Retnamma et al. | Jan 2008 | B2 |
7343453 | Prahlad et al. | Mar 2008 | B2 |
7389311 | Crescenti et al. | Jun 2008 | B1 |
7395282 | Crescenti et al. | Jul 2008 | B1 |
7440982 | Lu et al. | Oct 2008 | B2 |
7454569 | Kavuri et al. | Nov 2008 | B2 |
7490207 | Amarendran et al. | Feb 2009 | B2 |
7500053 | Kavuri et al. | Mar 2009 | B1 |
7529782 | Prahlad et al. | May 2009 | B2 |
7536291 | Vijayan Retnamma et al. | May 2009 | B1 |
7543125 | Gokhale | Jun 2009 | B2 |
7546324 | Prahlad et al. | Jun 2009 | B2 |
7603386 | Amarendran et al. | Oct 2009 | B2 |
7606844 | Kottomtharayil | Oct 2009 | B2 |
7613752 | Prahlad et al. | Nov 2009 | B2 |
7617253 | Prahlad et al. | Nov 2009 | B2 |
7617262 | Prahlad et al. | Nov 2009 | B2 |
7620710 | Kottomtharayil et al. | Nov 2009 | B2 |
7636743 | Erofeev | Dec 2009 | B2 |
7651593 | Prahlad et al. | Jan 2010 | B2 |
7657550 | Prahlad et al. | Feb 2010 | B2 |
7660807 | Prahlad et al. | Feb 2010 | B2 |
7661028 | Erofeev | Feb 2010 | B2 |
7734669 | Kottomtharayil et al. | Jun 2010 | B2 |
7747579 | Prahlad et al. | Jun 2010 | B2 |
7801864 | Prahlad et al. | Sep 2010 | B2 |
7809914 | Kottomtharayil et al. | Oct 2010 | B2 |
8156086 | Lu et al. | Apr 2012 | B2 |
8170995 | Prahlad et al. | May 2012 | B2 |
8229954 | Kottomtharayil et al. | Jul 2012 | B2 |
8230195 | Amarendran et al. | Jul 2012 | B2 |
8285681 | Prahlad et al. | Oct 2012 | B2 |
8307177 | Prahlad et al. | Nov 2012 | B2 |
8364652 | Vijayan et al. | Jan 2013 | B2 |
8370542 | Lu et al. | Feb 2013 | B2 |
8578120 | Attarde et al. | Nov 2013 | B2 |
8649276 | O'Shea | Feb 2014 | B2 |
8954446 | Vijayan Retnamma et al. | Feb 2015 | B2 |
9020900 | Vijayan Retnamma et al. | Apr 2015 | B2 |
9098495 | Gokhale | Aug 2015 | B2 |
9239687 | Vijayan et al. | Jan 2016 | B2 |
9242151 | Murphy | Jan 2016 | B2 |
9411534 | Lakshman et al. | Aug 2016 | B2 |
9483205 | Lakshman et al. | Nov 2016 | B2 |
9558085 | Lakshman | Jan 2017 | B2 |
9633033 | Vijayan et al. | Apr 2017 | B2 |
9639274 | Maranna et al. | May 2017 | B2 |
9798489 | Lakshman et al. | Oct 2017 | B2 |
9864530 | Lakshman | Jan 2018 | B2 |
9875063 | Lakshman | Jan 2018 | B2 |
10067722 | Lakshman | Sep 2018 | B2 |
10248174 | Lakshman et al. | Apr 2019 | B2 |
20050044356 | Srivastava | Feb 2005 | A1 |
20060224846 | Amarendran et al. | Oct 2006 | A1 |
20090319534 | Gokhale | Dec 2009 | A1 |
20120150818 | Vijayan Retnamma et al. | Jun 2012 | A1 |
20120150826 | Vijayan Retnamma et al. | Jun 2012 | A1 |
20140201137 | Vibhor et al. | Jul 2014 | A1 |
20140201140 | Vibhor et al. | Jul 2014 | A1 |
20140201141 | Vibhor et al. | Jul 2014 | A1 |
20140201144 | Vibhor et al. | Jul 2014 | A1 |
20140201170 | Vijayan et al. | Jul 2014 | A1 |
20140337285 | Gokhale et al. | Nov 2014 | A1 |
20150154220 | Ngo et al. | Jun 2015 | A1 |
20160142249 | Wu | May 2016 | A1 |
20160162370 | Mehta et al. | Jun 2016 | A1 |
20160210195 | Sinha | Jul 2016 | A1 |
20160350302 | Lakshman | Dec 2016 | A1 |
20160350391 | Vijayan et al. | Dec 2016 | A1 |
20170032012 | Zhang | Feb 2017 | A1 |
20170168903 | Dornemann et al. | Jun 2017 | A1 |
20170185488 | Kumarasamy et al. | Jun 2017 | A1 |
20170192866 | Vijayan et al. | Jul 2017 | A1 |
20170193003 | Vijayan et al. | Jul 2017 | A1 |
20170235647 | Kilaru et al. | Aug 2017 | A1 |
20170235756 | Mehta et al. | Aug 2017 | A1 |
20170242871 | Kilaru et al. | Aug 2017 | A1 |
20170329527 | Lakshman et al. | Nov 2017 | A1 |
20170329530 | Lakshman et al. | Nov 2017 | A1 |
20180284986 | Bhagi et al. | Oct 2018 | A1 |
20180285201 | Bangalore et al. | Oct 2018 | A1 |
20180285382 | Mehta et al. | Oct 2018 | A1 |
20180314726 | Bath | Nov 2018 | A1 |
20190171264 | Lakshman et al. | Jun 2019 | A1 |
20190266054 | Kumarasamy et al. | Aug 2019 | A1 |
Number | Date | Country |
---|---|---|
0259912 | Mar 1988 | EP |
0405926 | Jan 1991 | EP |
0467546 | Jan 1992 | EP |
0541281 | May 1993 | EP |
0774715 | May 1997 | EP |
0809184 | Nov 1997 | EP |
0899662 | Mar 1999 | EP |
0981090 | Feb 2000 | EP |
WO 9513580 | May 1995 | WO |
WO 9912098 | Mar 1999 | WO |
WO 2006052872 | May 2006 | WO |
Entry |
---|
Arneson, “Mass Storage Archiving in Network Environments” IEEE, Oct. 31-Nov. 1998, pp. 45-50. |
Cabrera, et al. “ADSM: A Multi-Platform, Scalable, Back-up and Archive Mass Storage System,” Digest of Papers, Compcon '95, Proceedings of the 40th IEEE Computer Society International Conference, Mar. 5, 1995-Mar. 9, 1995, pp. 420-427, San Francisco, CA. |
Eitel, “Backup and Storage Management in Distributed Heterogeneous Environments,” IEEE, 1994, pp. 124-126. |
Huff, KL, “Data Set Usage Sequence Number,” IBM Technical Disclosure Bulletin, vol. 24, No. 5, Oct. 1981 New York, US, pp. 2404-2406. |
Rosenblum et al., “The Design and Implementation of a Log-Structure File System,” Operating Systems Review SIGOPS, vol. 25, No. 5, May 1991, New York, US, pp. 1-15. |