The present invention relates to distributed object storage systems that support hierarchical user directories within its namespace. The namespace itself is stored as a distributed object. When a new object is added or updated as a result of a put transaction, metadata relating to the object's name eventually is stored in a namespace manifest shard based on the partial key derived from the full name of the object. A snapshot can be taken of the namespace manifest at a specific moment in time to create a snapshot manifest. A clone manifest can be created from a snapshot manifest and thereafter can be updated in response to put operations. A clone manifest can be merged into a snapshot manifest or to the namespace manifest and set of current version links, thereby enabling users to modify objects in a distributed manner. The prior art includes snapshots, clones and the clone/modify/merge update pattern are as to hierarchically controlled storage systems. However, the present invention provides a system and method of implementing these useful features in a fully distributed storage cluster that has no central points of processing and does so without requiring any form of distributed locking.
In traditional copy-on-write file systems, low cost snapshots of a directory or an entire file system can be created by simply not deleting the root of the namespace when later versions are created. Examples of copy-on-write file systems includes the ZFS file system developed by Sun Microsystems and the WAFL (Write Anywhere File Layout) file system developed by Network Appliance.
Non copy-on-write file systems have to pause processing long enough to copy metadata from the directory metadata to form the snapshot metadata. Many of these systems will retain the payload data as long as it is referenced by metadata. For those systems, no bulk payload copying is required. Others will have to copy the object data as well its metadata to create a snapshot.
However, these techniques all rely upon a central processing point to take the snapshot before proceeding to the next transaction. A fully distributed object cluster, such as the types of clusters disclosed in the Incorporated References, does not have any central points of processing. Lack of any central processing points allows an object cluster to scale to far larger sizes than any cluster with central processing points.
What is needed for such a system, however, is a new solution to enable taking snapshots and forking a cloned version of a tree that does not interfere with the highly distributed processing enabled by such a system.
One of the Incorporated References, U.S. patent application Ser. No. 14/820,471, filed on Aug. 6, 2015 and titled “Object Storage System with Local Transaction Logs, A Distributed Namespace, and Optimized Support for User Directories,” which is incorporated by reference herein, describes a technique used by the Nexenta Cloud Copy-on-Write (CCOW) Object Cluster that applies MapReduce techniques to build an eventually consistent namespace manifest distributed object that tracks all version manifests created within a hierarchical namespace. This is highly advantageous in that it avoids the bottlenecks associated with the relatively flat tenant/account and bucket/container methods common that other object clusters.
The present invention extends any method of collecting directory entries for an object cluster where the entries are write-once records that do not require updating when the referenced content is replicated or migrated to new locations. The Nexenta CCOW Object Cluster does this by referencing payload with the cryptographic hash of a chunk, and then locating that chunk within a multicast negotiating group determined by the cryptographic hash of either the chunk content or the object name. A CCOW namespace manifest distributed object automatically collects the version manifests created within a namespace. Snapshot manifests and clone manifests subset and/or extend this data for specific purposes.
Snapshot manifests allow creation of point-in-time subsets of a namespace manifest, thereby creating a “snapshot” of a distributed moving system. While subject to the same eventual consistency delay as the namespace manifest itself, the “snapshot” can be “instantaneous” in that there is no risk of cataloging a sense of inconsistent versions that reflect only an unpredictable subset of a compound transaction.
The challenge of taking a snapshot of a distributed system is that without a central point of processing, it is hard to catch the system at rest. In prior art systems, it becomes necessary to tell the entire cluster to cease initiating new action until after the “snapshot” is taken. This is not analogous to a “snapshot,” but is more akin to a Civil War era photograph where the subject of the photograph had to remain motionless long enough for the camera to gather enough light.
Following the photography analogy, a snapshot manifest is indeed a snapshot of the cluster taken in a single instant. However, like a snapshot taken with analog film, the photograph is not available until after it has been fully processed.
Another aspect of the present invention relates to support for the clone-modify-merge pattern traditionally used for updating software source repositories.
Source control systems (such as subversion (svn), mercurial and git) have a well-established procedure for modifying source files required to build a system. The user creates a branch of the repository, checks out a working directory from the branch, makes modifications on the branch, commits changes to the branch and finally submits the changes back to the mainline repository. For most development projects, there is an associated review process to approve merges pushed from branches.
This clone-modify-merge pattern is useful for most software development projects, but can also be used for operational and configuration data as well as to facilitate exclusive access to blocks or files without requiring a global lock manager.
The clone-modify-merge pattern is conventionally implemented by user-mode software using standard file-oriented APIs to access and modify the repository. Typically, there are multiple repositories, each associated with directly attached storage. Each repository is comprised of multiple files holding the metadata about the visible files visible to the user of the repository. This layered implementation provides for a stable and highly portable interface. But it is wasteful of raw IO capacity and disk space. It also relies on end-users refraining from directly manipulating the metadata encoding files themselves. For source code repositories these are generally not overriding concerns compared with stability and portability, but this may have more of an impact on using these tools for production data.
Source control systems have conventionally implemented this strategy above the file system, encoding repository metadata in additional files over local file systems. Older systems, such as CVS and subversion, use a central repository that checks out to and checks in from end user local file systems. Later systems have distributed repositories that push and pull to each other, while the user's working directory checks in and out of a local repository.
Both of these strategies implicitly assume the Direct-Attached-Storage (DAS) model where storage for a cluster is attached as small islands to specific servers. All synchronization between repositories involves actual network transfers between the repositories.
An object storage system that supported a clone-modify-merge pattern for updating content could apply deduplication across all storage, avoid unnecessary replication when push content from one repository to another, and use a common storage pool for the data under management no matter what state each piece was in. The conventional solution presumes separate DAS storage, which precludes sharing resources for identical content. Integrating and then hiding is inefficient. Having physically separate repositories undermines the benefits of cloud storage, makes the aggregate storage less robust, and wastes network bandwidth with repository-to-repository copies.
The present invention addresses both of these needs through the creation of “snapshot manifests” and “clone manifests.” A snapshot manifest is an object that collects directory entries for a selected set of version manifests and enables access through the snapshot manifest. The snapshot manifest can be built from information in an eventually consistent namespace manifest, allowing the ability to create point-in-time snapshots of subsets of the whole repository without requiring a central point of processing. It may also be built from any cached set of version manifests.
A clone manifest is a writable version of a snapshot manifest, which allows metadata about new uncommitted versions of objects to be efficiently segregated from the metadata describing committed objects. Conventional solutions rely on access controls and naming conventions to hide uncommitted data, but this is inefficient. It first merges the data, and then takes extra steps to hide the data from typical users, or it can conversely rely upon repositories being kept on physically separate servers.
The present invention uses snapshot manifests and clone manifests to implement many conventional storage features within a fully distributed object storage cluster.
Overview of Embodiments
Storage servers 150a, 150c, and 150g here are illustrated as exemplary storage servers, and it is to be understood that the description herein applies equally to the other storage servers such as storage servers 150b, 150c, . . . 150j (not shown in
Gateway 130 can access object manifest 205 for the namespace manifest 210. Object manifest 205 for namespace manifest 210 contains information for locating namespace manifest 210, which itself is an object stored in storage system 200. In this example, namespace manifest 210 is stored as an object comprising three shards, namespace manifest shards 210a, 210b, and 210c. This is representative only, and namespace manifest 210 can be stored as one or more shards. In this example, the object has been divided into three shards and have been assigned to storage servers 150a, 150c, and 150g. Typically each shard is replicated to multiple servers as described for generic objects in the Incorporated References. These extra replicas have been omitted to simplify the diagram.
The role of the object manifest is to identify the shards of the namespace manifest. An implementation may do this either as an explicit manifest which enumerates the shards, or as a management plane configuration rule which describes the set of shards that are to exist for each managed namespace. An example of a management plane rule would dictate that the TenantX namespace was to spread evenly over 20 shards anchored on the name hash of “TenantX”.
In addition, each storage server maintains a local transaction log. For example, storage server 150a stores transaction log 220a, storage server 150c stores transaction log 220c, and storage serve 150g stores transaction log 150g.
Namespace Manifest and Namespace Manifest Shards
With reference to
Each namespace manifest shard 210, 210b, and 210c can comprise one or more entries, here shown as exemplary entries 301, 302, 311, 312, 321, and 322.
The use of multiple namespace manifest shards has numerous benefits. For example, if the system instead stored the entire contents of the namespace manifest on a single storage server, the resulting system would incur a major non-scalable performance bottleneck whenever numerous updates need to be made to the namespace manifest.
Hierarchical directories make it very difficult to support finding objects under the outermost directory. The number of possible entries for the topmost directory is so large that placing all of those entries on a single set of servers would inevitably create a processing bottleneck.
The present invention avoids this potential processing bottleneck by allowing the namespace manifest to be divided first in any end-user meaningful way, for example by running separate namespace manifests for each tenant, and then by sharding the content using a partial key. Embodiments of the present invention divide the total combined namespace of all stored object versions into separate namespaces. One typical strategy for such division is having one namespace, and therefore one namespace manifest, per each one of the tenants that use storage cluster.
Generally, division of the total namespace into separate namespaces is performed using configuration rules that are specific to embodiments. Each separate namespace manifest is then identified by the name prefix for the portion of the total namespace. The sum (that is, logical union) of separate non-overlapping namespaces will form the total namespace of all stored object versions. Similarly, controlling the namespace redundancy, including the number of namespace shards for each of the resulting separate namespace manifests, is also part of the storage cluster management configuration that is controlled by the corresponding management planes in the embodiments of the present invention.
Therefore, the namespace record derived from each name of each object 310 is sharded using the partial key hash of each record. In the preferred embodiment, the partial key is formed by a regular expression applied to the full key. However multiple alternate methods of extracting a partial key from the whole key should be obvious to those skilled in the art. In the preferred embodiment, the partial key may be constructed so that all records referencing the same object will have the same partial key and hence be assigned to the same shard. For example, under this design, if record 320a and record 320b pertain to a single object (e.g., “cat.jpg”), they will be assigned to the same shard, such as namespace manifest shard 210a.
The use of partial keys is further illustrated in
In
In
In
It is to be understood that partial keys 721, 722, and 723 are merely exemplary and that partial keys can be designed to correspond to any level within a directory hierarchy.
With reference now to
For example, if object 310 is named “/Tenant/A/B/C/d.docx,” the partial key could be “/Tenant/A/”, and the next directory entry would be “B/”. No value is stored for key 331.
Delayed Revisions to Namespace Manifest In Response to Put Transaction
With reference to
The updating illustrated in
Version Manifests and Chunk Manifests
With reference to
Each manifest, such as namespace manifest 210, version manifest 410a, and chunk manifest 420a, optionally comprises a salt (which guarantees the content of the manifest is unique) and an array of chunk references.
For version manifest 410a, the salt 610a comprises:
Version manifest 410a also comprises chunk references 620a for payload 630a. Each of the chunk references 620a is associated with one the payload chunks 630a-1, . . . 630a-k. In the alternative, chunk reference 620a may specify chunk manifest 420a, which ultimately references payload chunk 630a-1, . . . 630a-k.
For chunk manifest 420a, the salt 620a comprises:
Chunk manifest 420a also comprises chunk references 620a for payload 630a. In the alternative, chunk manifest 420a may reference other chunk/content manifests, which in turn directly reference payload 630a or indirectly reference payload 630a through one or more other levels of chunk/content manifests. Each of the chunk references 620a is associated with one the payload chunks 630a-1, . . . 630a-k.
Chunk references 620a may be indexed either by the logical offset and length, or by a hash shard of the key name (the key hash identifying token or KHIT). When indexed by logical offset and length, the chunk reference identifies an ascending non-overlapping offset within the object version. When indexed by hash shard, the reference supplies a base value and the number of bits that an actual hash of a desired key value must match for this chunk reference to be relevant. The chunk reference then includes either inline content or a content hash identifying token (CHIT) referencing either a sub-manifest or a payload chunk.
Namespace manifest 210 is a distributed versioned object that references version manifests, such as version manifest 410a, created within the namespace. Namespace manifest 210 can cover all objects in the cluster or can be maintained for any subset of the cluster. For example, in the preferred embodiments, the default configuration tracks a namespace manifest for each distinct tenant that uses the storage cluster.
Flexibility of Data Payloads within the Embodiments
The present embodiments generalize the concepts from the Incorporated References regarding version manifest 410a and chunk manifest 420a. Specifically, the present embodiments support layering of any form of data via manifests. The Incorporated References disclose layering only for chunk manifest 420a and the user of byte-array payload. By contrast, the present embodiments support two additional forms of data beyond byte-array payloads:
The line-array and byte-array forms can be viewed as being key/value data as well. They have implicit keys that are not part of the payload. Being implicit, these keys are neither transferred nor fingerprinted. For line oriented payload, the implicit key is the line number. For byte-array payload, a record can be formed from any offset within the object and specified for any length up to the remaining length of the object version.
Further, version manifest 410a encodes both system and user metadata as key/value records.
This generalization of the manifest format allows the manifests for an object version to encode more key/value metadata than would have possibly fit in a single chunk.
Hierarchical Directories
In these embodiments, each namespace manifest shard can store one or more directory entries, with each directory entry corresponding to the name of an object. The set of directory entries for each namespace manifest shard corresponds to what would have been a classic POSIX hierarchical directory. There are two typical strategies, iterative and inclusive, that may be employed; each one of this strategies may be configured as a system default in the embodiments.
In the iterative directory approach, a namespace manifest shard includes only the entries that would have been directly included in POSIX hierarchical directory. A sub-directory is mentioned by name, but the content under that sub-directory is not included here. Instead, the accessing process must iteratively find the entries for each named sub-directory.
The referencing directory is the partial key, ensuring that unless there are too many records with that partial key that they will all be in the same shard. There are entries for each referencing directory combined with:
Gateway 130 (e.g., the Putget Broker) will need to search for non-current versions in the namespace manifest 210. In the Incorporated References, the Putget Broker would find the desired version by getting a version list for the object. The present embodiments improves upon that embodiment by optimizing for finding the current version and performing asynchronous updates of a common sharded namespace manifest 210 instead of performing synchronous updates of version lists for each object.
With this enhancement, the number of writes required before a put transaction can be acknowledged is reduced by one, as discussed above with reference to
Queries to find all objects “inside” of a hierarchical directory will also be optimized. This is generally a more common operation than listing non-current versions. Browsing current versions in the order implied by classic hierarchical directories is a relatively common operation. Some user access applications, such as Cyberduck, routinely collect information about the “current directory.”
Distributing Directory Information to the Namespace Manifest
A namespace manifest 210 is a system object containing directory entries that are automatically propagated by the object cluster as a result of creating or expunging version manifests. Unlike user objects there is only the current version of a namespace manifest. Snapshot Manifests can be created to retain any subset of a namespace manifest as a frozen version.
The ultimate objective of the namespace manifest 210 is to support a variety of lookup operations including finding non-current (not the most recent) versions of each object. Another lookup example includes listing of all or some objects that are conceptually within a given hierarchical naming scope, that is, in a given user directory and, optionally, its sub-directories. In the Incorporated References, this was accomplished by creating list objects to track the versions for each object and the list of all objects created within an outermost container. These methods are valid, but require new versions of the lists to be created before a put transaction is acknowledged. These additional writes increase the time required to complete each transaction.
The embodiment of
As each entry in a transaction log is processed, the changes to version manifests are generated as new edits for the namespace manifest 210.
The version manifest referenced in the transaction log is parsed as follows: The fully qualified object name found within the version manifest's metadata is parsed into a tenant name, one or more enclosing directories (typically based upon configurable directory separator character such as the ubiquitous forward slash (“/”) character), and a final relative name for the object.
Records are generated for each enclosing directory referencing the immediate name enclosed within in of the next directory, or of the final relative name. For the iterative option, this entry only specifies the relative name of the immediate sub-directory. For the inclusive option the full version manifest relative to this directory is specified.
With the iterative option the namespace manifest records are comprised of:
With the inclusive option the namespace manifest records are comprised of:
A record is generated for the version manifest that fully identifies the tenant, the name within the context of the tenant and Unique Version ID (UVID) of the version manifest as found within the version manifest's metadata.
These records are accumulated for each namespace manifest shard 210a, 210b, 210c. The namespace manifest is sharded based on the key hash of the fully qualified name of the record's enclosing directory name. Note that the records generated for the hierarchy of enclosing directories for a typical object name will typically be dispatched to multiple shards.
Once a batch has accumulated sufficient transactions and/or time it is multicast to the Negotiating Group that manages the specific namespace manifest shard.
At each receiving storage server the namespace manifest shard is updated to a new chunk by applying a merge/sort of the new directory entry records to be inserted/deleted and the existing chunk to create a new chunk. Note that an implementation is free to defer application of delta transactions until convenient or there has been a request to get to shard.
In many cases the new record is redundant, especially for the enclosing hierarchy. If the chunk is unchanged then no further action is required. When there are new chunk contents then the index entry for the namespace manifest shard is updated with the new chunk's CHIT.
Note that the root version manifest for a namespace manifest does not need to be centrally stored on any specific, set of servers. Once a configuration object creates the sharding plan for a specific namespace manifest the current version of each shard can be referenced without prior knowledge of its CHIT.
Further note that each namespace manifest shard may be stored by any subset of the selected Negotiating Group as long as there are at least a configured number of replicas. When a storage server accepts an update from a source it will be able to detect missing batches, and request that they be retransmitted.
Continuous Update Option
The preferred implementation does not automatically create a version manifest for each revision of a namespace manifest. All updates are distributed to the current version of the target namespace manifest shard. The current set of records, or any identifiable subset, may be copied to a different object to create a frozen enumeration of the namespace or a subset thereof. Conventional objects are updated in discrete transactions originated from a single gateway server, resulting in a single version manifest. The updates to a namespace manifest arise on an ongoing basis and are not naturally tied to any aggregate transaction. Therefore, use of an implicit version manifest is preferable, with the creation of a specifically identified (frozen-in-time) version manifest of the namespace deferred until it is specifically needed.
Processing of a Batch for a Split Negotiating Group
Because distribution of batches is asynchronous, it is possible to receive a batch for a Negotiating Group that has been split. The receiver must split the batch, and distribute the half no longer for itself to the new negotiating group.
Transaction Log KVTs
The locally stored Transaction Log KVTs should be understood to be part of a single distributed object with key-value tuples. Each Key-Value tuple has a key comprised of a timestamp and a Device ID. The Value is the Transaction Log Entry. Any two subsets of the Transaction Log KVTs may be merged to form a new equally valid subset of the full set of Transaction Log KVTs.
In many implementations the original KVT capturing Transaction Log Entries on a specific device may optimize storage of Transaction Log Entries by omitting the Device ID and/or compressing the timestamp. Such optimizations do not prevent the full logical Transaction Entry from being recovered before merging entries across devices.
Namespace Manifest Resharding
An implementation will find it desirable to allow the sharding of an existing Namespace to be refined by either splitting a namespace manifest shard into two or more namespace manifest shards, or by merging two or more namespace shards into one namespace manifest shard. It is desirable to split a shard when there are an excessive records assigned to it, while it is desirable to merge shards when one or more of them have too few records to justify continued separate existence.
When an explicit Version Manifest has been created for a Namespace Manifest, splitting a shard is accomplished as follows:
When operating without an explicit version manifest it is necessary to split all shards at once. This is done as follows and as shown in
While relatively rare, the total number of records in a sharded object may decrease, eventually reaching a new version which would merge two prior shards into a single shard for the new version. For example, shards 72 and 73 of 128 could be merged to a single shard, which would be 36 of 64.
The put request specifying the new shard would list both 72/128 and 73/128 as providing the pre-edit records for the new chunk. The targets holding 72/128 would create a new chunk encoding shard 36 of 64 by merging the retained records of 72/128, 73/128 and the new delta supplied in the transaction.
Because this put operation will require fetching the current content of 73/128, it will take longer than a typical put transaction. However such merge transactions would be sufficiently rare and not have a significant impact on overall transaction performance.
Namespace manifest gets updated as a result of creating and expunging (deleting) version manifests. Those skilled in the art will recognize that the techniques and methods described herein apply to the put transaction that creates new version manifests as well as to the delete transaction that expunges version manifests. While specific embodiments of, and examples for, the invention are described herein for illustrative purposes, various equivalent modifications are possible within the scope of the invention. These modifications may be made to the invention in light of the above detailed description
Snapshots of the Namespace
With reference to
First, exemplary a snapshot initiator (shown as client 110a) issues command 1311 at time T to perform a snapshot of portion 1312 of namespace manifest 210 and to store snapshot object 1313 with object name 1315. Portion 1312 can comprise the entire namespace manifest 210, or portion 1312 can be a sub-set of namespace manifest 210. For example, portion 1312 can be expressed as one or more directory entries or as a specific enumeration of one or more objects. An example of command 1311 would be: SNAPSHOT/finance/brent/reports Financial_Reports. In this example, “SNAPSHOT” is command 1311, “/finance/brent/reports” is the identification of portion 1312, and “Financial_Reports” is object name 1315. The command may be implemented in one of many different formats, including binary, textual, command line, or HTTP/REST. (Step 1310).
Second, in response to command 1311, gateway 130 waits a time period K to allow pending transactions to be stored in namespace manifest 210. (Step 1320).Third, gateway 130 retrieves portion 1312 of namespace manifest 210. This step involves retrieving the namespace manifest shards that correspond to portion 1312. (Step 1330).
Fourth, in response to command 1311, gateway 130 retrieves all transaction logs 220 and identifies all pending transactions 1331 at time T. (Step 1330). These records cannot be used for the snapshot until all transactions that were initiated at or before Time T are represented in one or more Namespace Manifest shards. Thus, a snapshot at Time T cannot be created until time T+K, where K represents an implementation-dependent maximum propagation delay. The delay of time K allows all transactions that are pending in transaction logs (such as transaction logs 220a . . . 220g) to be stored in the appropriate namespace shards. While the records for the snapshot cannot be collected before this minimal delay, they will still represent a snapshot at time T. It should be understood that allowing for a maximum delay requires allowing for congested networks and busy servers, which may compromise prompt availability of snapshots. An alternative implementation could use a multicast synchronization, such as found in the MPI standards, to confirm that all transactions as of time T have been merged into the namespace manifest.
Fifth, gateway 130 generates snapshot object 1313. This step involves parsing the entries of each namespace manifest shard to identify the entries that relate to portion 1312 (which will be necessary if portion 1312 does not align completely with the contents of a namespace manifest shard), storing the namespace manifest shards or entries in memory, storing all pending transactions 1331 pending at time T from all transaction logs 220, and creating snapshot object 1313 with object name 1315 (Step 1340).
Finally, gateway 130 performs a put transaction of snapshot object 1313 to store it. This step uses the same procedure described previously as to the storage of an object. (Step 1350).
With reference to
As can be seen in
A snapshot manifest (such as snapshot manifest 1313 or 1314) is a sharded object that is created by a MapReduce job which selects a subset of records from a namespace manifest (such as namespace manifest 210) or a portion thereof, or another version of a snapshot manifest. The MapReduce job which creates a version of a snapshot manifest is not required to execute instantaneously, but the extract created will represent a snapshot of a subset of a namespace manifest at a specific point in time or of a specific snapshot manifest version.
In
Record 1510 comprises name mapping 1520. Name mapping 1520 encodes information for any name that corresponds to a conventional hierarchical directory found in the subject of the snapshot, such as namespace manifest 210 or 210′ or a portion thereof. Name mapping 1520 specifies the mapping of a relative name to a fully qualified name. This may merely document the existence of a sub-directory, or may be used to link to another name, effectively creating a symbolic link in the distributed object cluster namespace.
Record 1510 further comprises version manifest identifier 1530. Version manifest identifier 1530 identifies the existence of a specific version manifest by specifying at least the following information: (1) Unique identifier 1531 for the record, unique identifier 1531 comprising the fully qualified name of the enclosing directory, the relative name of the object, and a unique identifier of the version of the object. In the preferred embodiment, unique identifier 1531 comprises a transactional timestamp concatenated with a unique identifier of the source of the transaction. (2) Content hash-identifying token (CHIT) 1532 of the version manifest. (3) A cache 1540 of records from the version manifest to optimize their retrieval. These records have a value cached from the version manifest and the key for that record, which identifies the version manifest and the key value within the version manifest.
In the preferred embodiment, exemplary record 1510 follows a rule that a simple unique key yields a value. However, as should be obvious to those skilled in the art, the same information can also be encoded in a hierarchical fashion. For example an XML encoding could have one layer to specify the relative object name with zero or more nested XML sub-structures to encode each version manifest, with fields within the version manifest XML encoding.
For example, directory entries could be encoded in a flat organization as:
Or the same directory entries could be encoded in in an XML structure as:
Record 1510 optionally comprises chunk references 1550. In a flat encoding, the key is formed by concatenating the version manifest key with the chunk reference identifier. In a hierarchical encoding, the chunk reference records are included within the content of the version manifest record.
In the preferred embodiment, the following chunk reference types are supported:
The Partial Key Shard Chunk Reference is previously disclosed in the Incorporated References. Specific details are restated in this application because of their relevance. A Partial Key Shard Chunk Reference claims a subset of the potential namespace for referenced payload and specifies the CHIT of the current chunk for this shard. The current chunk may be either a Payload Chunk or a Manifest Chunk.
Partial Key Shard Chunk References are used with key/value data. A regular expression which must be included in the system metadata for the object governs mapping the full key to a partial key. The relevant cryptographic hash algorithm is then applied to the Partial Key to obtain the Partial Key hash.
Each Partial Key Shard Chunk Reference defines a shard of the aggregate key hash space, and assigns all keys to this shard by specifying:
A normal put operation will inherit the shards as defined for the referenced version, but will replace the referenced CHIT of the Manifest or Payload Chunk for this shard.
The Partial Key Shard Chunk Reference allows sets of related key/value records, for example all Snapshot Manifest records about a given Object, to be assigned to the same Shard. While this allows minor variations in the distribution of records across shards it reduces the number of shards that a transaction seeking all records matching a partial key must access.
In the unusual case that the Partial Key Chunk Reference selects more records than can be kept in a single Chunk, the referenced Manifest can use Full Key Shard Chunk References to sub-shard the records assigned to the partial-key specified shard.
The Full Key Shard Chunk Reference is previously disclosed in the Incorporated References. Specific details are restated in this application because of their relevance.
A Full Key Shard Chunk Reference is fully equivalent to the Partial Key Shard Chunk Reference except that the Key Hash is calculated on the record's full key. Full Key Shard Chunk References can be used to sub-shard a shard that has too many records for a single Payload Chunk to hold.
An object get may be specified to take place within the context of a specific version of a snapshot manifest. The object request will be satisfied against the version manifest enumerated within the snapshot manifest if possible, and then the object cluster as a whole if not (which would be required if the relevant portion of the namespace manifest was not part of the snapshot operation).
Rolling back to a snapshot manifest involves creating a current object version for each object within the snapshot manifest in the object cluster, where each new object version created:
In the distributed storage cluster of the embodiments described herein, it would be desirable to be able to create a snapshot of the namespace manifest or a portion thereof without halting all processing in the storage cluster, even in the situation where transactions are pending.
In
Under the embodiments previously discussed, the transactions from transaction logs 220e and 220i will be added to various namespace manifest shards, such as namespace manifest shard 210a, at some point in time. Because the snapshot is taken at time T, entries 301 and 302 are captured in snapshot manifest 1313, but metadata 801 also must be captured in snapshot manifest 1313. If we assume for this example, that metadata 801 contains a change to Entry 301 (for example, indicating a new version of an object), then that change will be reflected in Record 1401 in snapshot manifest 1313, either by modifying the data before it is stored as Record 1401, or by updating Entry 301 in namespace manifest shard 210a before it is copied as Record 1401 in snapshot manifest 1313.
Clones of a Snapshot or of the Namespace or a Portion Thereof
The embodiments all support the creation and usage of a clone manifest. In
With reference to
In response to command 1211, gateway 130 retrieves portion 1212 of namespace manifest 210 or snapshot 1213 (step 1220). Gateway 130 then generates clone manifest 1214 (step 1230). Gateway 130 performs a put transaction of clone manifest 1214 (step 1240).
Clone Manifest Extension
The present invention requires an additional encoding within a clone manifest not found in a snapshot manifest. This encoding specifies zero or more delta chunk references that must be applied before this new version can be put to a snapshot manifest. In the preferred implementation an object specified with delta chunk references is only accessible through a clone manifest; it cannot be independently accessed using the object cluster directly. Putting to a snapshot manifest is functionally equivalent to pushing a local git repository to a master repository.
Each delta chunk references encodes:
A delta chunk reference supplies content that is changed from the reference chunk. For sharded objects this is the existing payload chunk for the current shard. For objects within a clone manifest (that are not described in a shard chunk reference) the reference content is defined for the object version as a whole through a version manifest CHIT.
For each chunk reference type identified above there is an additional type to specify a Delta Chunk Reference to the same data. Additionally, the following chunk reference type must also be supported:
Clone Manifest Transactions
The following transactions must be supported to utilize a clone manifest:
Creating a new clone manifest is identical to creation of a snapshot manifest, but with the addition of a system metadata attribute indicating that it is a clone manifest and can therefore be a reference for further updates.
The source for initial records is a filtered subset of a namespace manifest or an existing version of a snapshot manifest. Because a clone manifest is a snapshot manifest they can also be the source of initial records. The subset of records selected may be specified by any combination of the following:
An implementation may choose to accept the enumeration of specific version manifests in a format that is compatible with an existing command line program such as tar or git. Creating a clone manifest is the functional equivalent of creating a local repository with a git “clone” operation.
The created clone manifest will include metadata fields identifying;
Putting Modifications to a Clone Manifest
A Clone Manifest Put Transaction applies changes to a set of objects within the scope of an existing snapshot manifest or clone manifest to create a new version of a clone manifest. No “working directory” is created because the clone manifest encodes the contents of the working directory by marking the delta chunk references as being “untracked” or “modified”.
The transaction specifies:
For each modified object, an additional metadata field is kept with the clone manifest system metadata noting the original version manifest that was initially snapshot. Unlike a generic object put, a new version of a clone manifest does not become the current version by default. Only a commit operation can make a newly committed version the current version.
Putting modifications to a clone manifest is functionally equivalent to performing a git “add” operation on a local repository.
Editing Untracked Objects
Existing source control solutions such as Git and mercurial allow users to edit files in the working directory that will not be tracked by the revision control system. This is most frequently used to exclude files that are generated by a make operation, limiting revision tracking to source files. These are most often specified by wildcard masks, such as “*.o”. However the revision control system can ignore any name when it is configured to be “untracked”.
The present invention allows new objects to be created within the clone manifest that are in an “untracked” state. Untracked delta chunk references are never committed or pushed. When the clone manifest is finally abandoned, as explained in the next section, these changes will be lost. This is same result as when untracked files are forgotten when the working directory is finally erased.
The object is created when the object is opened or created, and each write creates a new “untracked” delta chunk reference, potentially overwriting all of part of previous delta chunk references. Read operations referencing this payload will receive these bytes, read operations referencing undefined content will receive all zeroes or for key/value records an explicit “no such key” error indication.
Committing a Clone Manifest
Committing a clone manifest creates a new version of the clone manifest, or optionally of a snapshot manifest, with the following extensions to the already described method of creating a snapshot or clone manifest:
Committing a clone manifest without specifying a remote target is functionally equivalent to a git “commit” operation. Committing a clone manifest to another clone manifest or snapshot manifest is the equivalent of a git “push” operation to a bare repository.
Merging One or More Clone Manifests into the Main Tree
When the target is another clone manifest, or the mainline object store, it is necessary to reconcile edits already performed since the clone on the target with the accumulated edits in the clone.
When possible to do so without the same records or byte ranges being referenced the merge will be applied automatically by applying the delta in the clone manifest from its original version (when it split from the base that it is being re-merged with) to the current versions of objects in the merge target. This can be done on a per-shard basis.
With reference to
With reference to
Usage of Clones within Distributed Storage System
The use of clones allows for an extremely versatile storage system with the capability for scalable distributed computing and storage.
Root system 2301 follows the architecture of
Similarly, clone system 2302 follows the architecture of
Here, clone system 2302 is exemplary, and it should be understood that any number of clone systems can co-exist with root system 2301.
The devices and connections of root system 2301 and clone system 2302 can overlap to any degree. For example, a particular client might be part of both systems, and the same storage servers and storage devices might be used in both systems.
Abandoning a Clone Manifest
A clone manifest can be abandoned by expunging the specific version, using the same approach used for expunging any object.
Implementing File Archives Using Clone Manifests
A clone manifest can be used to manage a set of named objects that have never been put as distinct objects to the main tree of the object storage system. These are pending edits for new objects created in a clone manifest. The user can get or put these objects using the clone manifest much as they could get or put a file to a .tar, .tgz or .zip archive.
Implementing Files or Volumes Over Objects Using Clone Manifests
One use of the present invention is to efficiently implement a file or block interface to logical volumes over an object store.
Typically, volumes are already under management plane control for a given storage domain where the management plane assigns the exclusive right to mount a volume for write to a single entity such as a virtual machine. In the present invention, this assignment may be to a library in the end instance itself or to a proxy acting on behalf of the end instance.
Files, by contrast, typically have an existing network access protocol such as NFS (Network File System) which has pre-existing rules for determining which instance of a file system has the right to update specific portions of the namespace. The file access daemon would apply standard procedures to obtain the necessary rights to modify portions of the namespace under existing protocols. The present invention innovates in how those edits are applied to object storage, not in any of the file sharing protocol exchanges over the network.
In either case, the agent creates a clone manifest of the reference version manifest or snapshot manifest, and then applies updates to the clone manifest. Use of the Block Shard Chunk Reference, discussed in the Incorporated References, can be useful when updating byte array objects with random partial writes.
Changes are only committed back to the default namespace 2480 when the user wants to make the accumulated changes visible to subsequent users of the file/volume access layer 2410. This would typically be done when committing before unmounting a volume or file system, but could be done at extra commit points chosen by the user as well.
Block Shard Chunk Reference
A Block Shard Chunk Reference defines a shard as being a specific range of bytes for the object, and then specifying the CHIT for the current version's Payload or Manifest Chunk for this shard.
Block Shards are useful for performing edits for byte ranges for open volumes or files using clone manifests. The put transaction can supply the specifically modified range, and have the targeted storage servers create a new Chunk which replaces the specified range and supply the new CHIT for the shard. This can be implemented using the foregoing embodiments and is a specific use case for those embodiments.
This application builds upon the inventions of: U.S. patent application Ser. No. 14/258,791, filed on Apr. 22, 2014 and titled “SYSTEMS AND METHODS FOR SCALABLE OBJECT STORAGE”; U.S. patent application Ser. No. 14/258,791 is: a continuation of U.S. patent application Ser. No. 13/624,593, filed on Sep. 21, 2012, titled “SYSTEMS AND METHODS FOR SCALABLE OBJECT STORAGE,” and issued as U.S. Pat. No. 8,745,095; a U.S. patent application Ser. No. 13/209,342, filed on Aug. 12, 2011, titled “CLOUD STORAGE SYSTEM WITH DISTRIBUTED METADATA,” and issued as U.S. Pat. No. 8,533,231; U.S. patent application Ser. No. 13/415,742, filed on Mar. 8, 2012, titled “UNIFIED LOCAL STORAGE SUPPORTING FILE AND CLOUD OBJECT ACCESS” and issued as U.S. Pat. No. 8,849,759; U.S. patent application Ser. No. 14/095,839, which was filed on Dec. 3, 2013 and titled “SCALABLE TRANSPORT SYSTEM FOR MULTICAST REPLICATION”; U.S. patent application Ser. No. 14/095,843, which was filed on Dec. 3, 2013 and titled “SCALABLE TRANSPORT SYSTEM FOR MULTICAST REPLICATION”; U.S. patent application Ser. No. 14/095,848, which was filed on Dec. 3, 2013 and titled “SCALABLE TRANSPORT WITH CLIENT-CONSENSUS RENDEZVOUS”; U.S. patent application Ser. No. 14/095,855, which was filed on Dec. 3, 2013 and titled “SCALABLE TRANSPORT WITH CLUSTER-CONSENSUS RENDEZVOUS”; U.S. Patent Application No. 62/040,962, which was filed on Aug. 22, 2014 and titled “SYSTEMS AND METHODS FOR MULTICAST REPLICATION BASED ERASURE ENCODING;” U.S. Patent Application No. 62/098,727, which was filed on Dec. 31, 2014 and titled “CLOUD COPY ON WRITE (CCOW) STORAGE SYSTEM ENHANCED AND EXTENDED TO SUPPORT POSIX FILES, ERASURE ENCODING AND BIG DATA ANALYTICS”; and U.S. patent application Ser. No. 14/820,471, which was filed on Aug. 6, 2015 and titled “Object Storage System with Local Transaction Logs, A Distributed Namespace, and Optimized Support for User Directories.” All of the above-listed application and patents are incorporated by reference herein and referred to collectively as the “Incorporated References.”