The disclosure generally relates to the field of data storage systems, and more particularly to providing file system and object-based access to store, manage, and access data stored in an object-based storage system.
Network-based storage is commonly utilized for data backup, geographically distributed data accessibility, and other purposes. In a network storage environment, a storage server makes data available to clients by presenting or exporting to the clients one or more logical containers of data. There are various forms of network storage, including network attached storage (NAS) and storage area network (SAN). For NAS, a storage server services file-level requests from clients, whereas SAN storage servers service block-level requests. Some storage server systems support both file-level and block-level requests.
There are multiple mechanisms and protocols utilized to access data stored in a network storage system. For example, a Network File System (NFS) protocol or Common Internet File System (CIFS) protocol may be utilized to access a file over a network in a manner similar to how local storage is accessed. The client may also use an object protocol, such as the Hypertext Transfer Protocol (HTTP) protocol or the Cloud Data Management Interface (CDMI) protocol, to access stored data over a LAN or over a wide area network such as the Internet.
Object-based storage (OBS) is a scalable system for storing and managing data objects without using hierarchical naming schemas. OBS systems integrate, or “ingest,” variable size data items as objects having unique ID keys into a flat name space structure. Object metadata is typically stored with the objects themselves rather than in a separate file system metadata structure. Objects are accessed and retrieved using key-based searching implemented via a web services interface such as one based on the Representational State Transfer (REST) architecture or simple object access protocol (SOAP). This allows applications to directly access objects across a network using “get” and “put” commands without having to process more complex file system and/or block access commands.
Relatively direct application access to stored data is often beneficial since the application has a more detailed operation-specific perspective of the state of the data than an intermediary storage utility package would have. Direct access also provides increased application control of I/O responsiveness. However, direct OBS access is not possible for file system applications due to the substantial differences in access APIs, transaction protocols, and naming schemas. A NAS gateway may be utilized to provide OBS access to applications that use non-OBS compatible APIs and naming schemas. Such gateways may provide a translation layer that enables applications to access OBS without modification using, for example, NFS or CIFS. However, such gateways may interfere with native OBS access (e.g., S3 access) and, furthermore, may not provide the adjustable data access granularity and transaction responsiveness that are typical of file system protocols.
A system and method are disclosed for replicating object-based operations generated based on file system commands. In one aspect, a object storage backed file system cache includes a replication engine that selects, from an intent log, records for multiple transaction groups. Each of the records may associate an object-based operation with a transaction group identifier that is associated with a file system command from which the object-based operation was generated. The replication engine identifies transaction groups that each include at least one object-based operation associated with a same transaction group identifier and reads object data associated with at least one of the object-based operations. The replication engine determines operation dependencies among the transaction groups based on the object data and sequences the transaction groups for replication based on the determined operation dependencies.
This summary is a brief summary for the disclosure, and not a comprehensive summary. The purpose of this brief summary is to provide a compact explanation as a preview to the disclosure. This brief summary does not capture the entire disclosure or all aspects, and should not be used to limit claim scope.
Aspects of the disclosure may be better understood by referencing the accompanying drawings.
A file system includes the data structures and methods/functions used to organize file system objects, access file system objects, and maintain a hierarchical namespace of the file system. File system objects include directories and files. Since this disclosure relates to object-based storage (OBS) and objects in OBS, a file system object is referred to herein as a “file system entity” instead of a “file system object” to reduce overloading of the term “object.” An “object” refers to a data structure that conforms to one or more OBS protocols. Thus, an “inode object” in this disclosure is not the data structure that represents a file in a Unix® type of operating system.
This description also uses “command,” “operation,” and “request” and in a manner to reduce overloading of these terms. Although these terms can be used as variants of a requested action, this description aligns the terms with the protocol and source domain of the requested action. The description uses “file system command” or “command” to refer to a requested action defined by a file system protocol and received from or sent to a file system client. The description uses “object-based operation” or “operation” to refer to a requested action defined by an object-based storage protocol and generated by an object storage backed file system. The description uses “object storage request” to refer to an action defined by a specific object-based storage protocol (e.g., S3) and received from or sent to an object-based storage system.
The disclosure describes a system and program flow that enable file system protocol access to OBS storage that is compatible with native OBS protocol access and that preserve self-consistent views of the storage configuration state. An OBS bridge includes an object storage backed file system (OSFS) that receives and processes file system commands. The OSFS includes command handlers or other logic to map the file system commands into object-based operations that employ a generic OBS protocol. The mapping may require generating one or more object-based operations corresponding to a single file system command, with the one or more object-based operations forming a file system transaction. To enable access to OBS objects by file system clients, the OSFS augments OBS object representations such that each object is represented by an inode object and an associated namespace object. The inode object contains a key by which it is referenced, object content (e.g., user data), and metadata. The namespace object contains namespace information including a file system name of the inode object and an association between the file system name and the associated inode objects key value. Organized in this manner within a distinct object, the namespace information enables file system access to the inode object while also enabling useful decoupling of the namespace object for namespace transactions such as may be requested by a file system client. The decoupling also enables native object-based storage applications to directly access inode objects.
The disclosure also describes methods and systems that bridge the I/O performance gap between file systems and OBS systems. For example, file systems are structured to enable relatively fast and efficient partial updates of files resulting in reduced latency. Traditional object stores process each object as a whole using object transfer protocols such as RESTful protocols. The disclosure describes an intermediate storage and processing feature referred to as an OSFS cache that provides data and storage state protection and leverages the aforementioned filename/object duality to improve I/O performance for file system clients.
OBS client 122 is connected relatively directly to object storage 120 over WAN 110. OBS client 122 may be, for example, a Cloud services client application that uses web services calls to access object-based storage items (i.e., objects). OBS client 122 may, for example, access objects within object storage 120 using direct calls based on a RESTful protocol. It should be noted that reference as a “client” is relative to the focus of the description, as either OBS client 122 and/or file system client 102 may be a “server” if configured in a file sharing arrangement with other servers. Unlike OBS client 122, file system client 102 comprises a file system application, such as a database application that is supported by an underlying Unix® style file system. File system client 102 utilizes file system based networking protocols common in NAS architectures to access file system entities such as files and directories configured in a hierarchical manner. For example, file system client 102 may utilize the network file system (NFS) or Common Internet File System (CIFS) protocol.
A NAS gateway 115 provides bridge and NAS server services by which file system client 102 can access and utilize object storage 120. NAS gateway 115 includes hardware and software processing features such as a virtual file system (VFS) switch 112 and an OBS bridge 118. VFS switch 112 establishes the protocols and persistent namespace coherency by which to receive file system commands from and send responses to file system client 102. OBS bridge 118 includes an object storage backed file system (OSFS) 114 and an associated OSFS cache 116. Together, OSFS 114 and OSFS cache 116 create and manage objects in object storage 120 to provide a hierarchical file system namespace 111 (“file system namespace”) to file system client 102. The example file system namespace 111 includes several file and directory entities distributed across three directory levels. The top-level root directory, root, contains child directories dir1 and dir2. Directory dir1 contains child directory dir3 and a file, file1. Directory dir3 contains files file2 and file3.
OSFS 114 processes file system commands in a manner that provides an intermediate OBS protocol interface for file system commands, and that simultaneously generates a file system namespace, such as file system namespace 111, to be utilized in OBS bridge transactions and persistently stored in backend object storage 120. To create the file system namespace, OSFS 114 generates a namespace object and a corresponding inode object for each file system entity (e.g., file or directory). To enable transaction protocol bridging, OSFS 114 generates related groups of object-based operations corresponding to each file system command and applies the dual object per file system entity structure.
File system commands, such as from file system client 102, are received by VFS switch 112 and forwarded to OSFS 114. VFS switch 112 may partially process the file system command and pass the result to the OSFS 114. For instance, VFS switch 112 may access its own directory cache and inode cache to resolve a name of a file system entity to an inode number corresponding to the file system entity indicated in the file system command. This information can be passed along with the file system command to OSFS 114.
OSFS 114 processes the file system command to generate one or more corresponding object-based operations. For example, OSFS 114 may include multiple file system command-specific handlers configured to generate a group of one or more object-based operations that together perform the file system command. In this manner, OSFS 114 transforms the received file system command into an object-centric file system transaction comprising multiple object-based operations. OSFS 114 determines a set of n object-based operations that implement the file system command using objects rather than file system entities. The object-based operations are defined methods or functions that conform to OBS semantics, for example specifying a key value parameter. OSFS 114 instantiates the object-based operations in accordance with the parameters of the file system command and any other information provided by the VFS switch 112. OSFS 114 forms the file system transaction with the object-based operation instances. OSFS 114 submits the transaction to OSFS cache 116 and may record the transaction into a transaction log (not depicted) which can be replayed if another node takes over for the node (e.g., virtual machine or physical machine) hosting OSFS 114.
To create a file system entity, such as in response to receiving a file system command specifying creation of a file or directory, OSFS 114 determines a new inode number for the file system entity. OSFS 114 may convert the inode number from an integer value to an ASCII value, which could be used as a parameter value in an object-based operation used to form the file system transaction. OSFS 114 instantiates a first object storage operation to create a first object with a first object key derived from the determined inode number of the file system entity and with metadata that indicates attributes of the file system entity. OSFS 114 instantiates a second object storage operation to create a second object with a second object key and with metadata that associates the second object key with the first object key. The second object key includes an inode number of a parent directory of the file system entity and also a name of the file system entity.
As shown in
OSFS cache 116 attempts to fulfill file system transactions received from OSFS 114 with locally stored data. If a transaction cannot be fulfilled with locally stored data, OSFS cache 116 forwards the object-based operation instances forming the transaction to an object storage adapter (OSA) 117. OSA 117 responds by generating object storage requests corresponding to the operations and which conform to a particular object storage protocol, such as S3.
In response to the requests, object storage 120 provides responses processed by OSA 117 and which propagate back through OBS bridge 118. More specifically, OSFS cache 116 generates a transaction response which is communicated to OSFS 114. OSFS 114 may update the transaction log to remove the transaction corresponding to the transaction response. OSFS 114 also generates a file system command response based on the transaction response, and passes the response back to file system client 102 via VFS switch 112.
In addition to providing file system namespace accessibility in a manner enabling native as well as bridge-enabled access, the described aspects provide namespace portability and concurrency for geo-distributed clients. Along with file data and its associated metadata, object store 120 stores a persistent representation of the namespace via storage of the inode and namespace objects depicted in
Each of VMs 202 and 222 is configured to include hardware and software resources for implementing a NAS gateway/OBS bridge such as that described with reference to
An OBS bridge cluster, such as bridge cluster 205, may be created administratively such as by issuing a Create cluster command from a properly configured VM such as 202 or 222. Two nodes are depicted for the purpose of clarity, but other nodes may be added or removed from bridge cluster 205 such as by issuing or receiving Join or Leave commands administratively. Bridge cluster 205 may operate in a “data cluster” configuration in which each of nodes 204 and 224 may concurrently and independently query (e.g., read) object storage backed file system data within object storages 240 and 250. In the data cluster configuration, one of the nodes is configured as a Read/Write node with update access to create, write to, or otherwise modify namespace and inode objects 246 and 245. The Read/Write node may consequently have exclusive access to a transaction log 244 which provides a persistent view of in-flight transactions. Transaction log 244 may persist namespace only or namespace and data included within in-flight file system transactions. While being members of the same bridge cluster 205, there may be minimal direct interaction between the nodes 202 and 222 if the cluster is configured to provide managed, but substantially independent multi-client access to a given object storage container/bucket.
In an aspect, bridge node 204 may be configured as the Read/Write node and bridge node 224 as a Read-Only node. Each node has its own partially independent view of the state of the file system namespace via the transactions and objects recorded in its respective OSFS cache. Configured in this manner, bridge node 204 implements and is immediately aware of all pending namespace state changes while bridge node 224 is exposed to such changes via the backend storages 240 and 250 only after the changes are replicated by bridge node 204 from its OSFS cache 208. For example, in response to bridge node 204 receiving a file rename file system command, OSFS 206 will instantiate one or more object-based operations to form a file system transaction that implements the command in the object namespace. OSFS cache 208 will record the operations in an intent log (not depicted) within non-volatile storage 211 where the operations remain until an asynchronous writer service replicates the operations to object stores 240 and/or 250. Prior to replication of the file system transaction, bridge node 224 remains unaware of and unable to determine that the namespace change has occurred. This eventual consistency model is typical of shared object storage systems but not of shared file system storage in which locking or other concurrency mechanisms are used to ensure a consistent view of the hierarchical file system structure.
An OSFS cache is a subsystem of an OBS bridge that is operably configured between an OSFS and an OSA. Among the functions of the OSFS cache is to provide object-centric services to its OSFS client, enabling object-backed file system transactions to be processed with improved I/O performance compared with traditional object storage. The OSFS cache employs an intent log and an asynchronous writer (lazy writer) for propagating object-centric file system update transactions to backend object store.
Operations forming transaction groups are submitted to a persistence layer 315, which comprises a database catalog 316 and an intent log writer 318. Persistence layer 315 maintains state information for the OSFS cache by mapping objects and object relationships onto their corresponding database entries. Intent log writer 318 identifies those transaction groups consisting of one or more object-based operations that update object storage (e.g., mkdir). Intent log writer 318 records and provides ordering/sequencing by which update-type transaction groups are to be replicated. Catalog 316 tracks all data and metadata within the OSFS cache, effectively serving as a key-based index. For example, service API 302 uses catalog 316 to determine if a query operation can be fulfilled locally, or must be fulfilled from backend object storage. Intent log writer 318 uses catalog 316 to locally store update transactions and corresponding operations and associated data, thus providing query access to the intent log data.
Intent log writer 318 is the mechanism through which update transaction groups are preserved to an intent log for eventual replication to backend object storage. When an update transaction group is submitted to the OSFS cache, intent log writer 318 persists the transaction group and its constituent operations within database 310 before the originating file system command is confirmed. In the case of a data Write operation, intent log writer 318 also persists the user data to extent storage 309 via extents reference table 308 before the file system command is confirmed. Central to the function of intent log writer 318 and the intent log that it generates is the notion of a file system transaction group (transaction group). A transaction group consists of one or more object-based operations that are processed atomically in an OSFS-specified order. In response to identifying the transaction group or one of the transaction group's operations as an update, intent log writer 318 executes a database transaction to record the transaction group and the components of each of the constituent operations in corresponding tables of database 310. The recorded transaction is replicated to object store at a future point. The intent log generated by intent log writer 318 persists the updates in the chronological order in which they were received from the OSFS. This enables the OSFS cache's write-back mechanism (depicted as asynchronous writer 322) to preserve the original insertion order as it replicates to backend object storage. In this manner, intent log writer 318 generates chronologically sequenced records of each update transaction group that has not yet been replicated to backend object storage.
Each record within the intent log is constructed to include two types of information: an object-based operation such as CreateObject, and the transaction group to which to which the operation belongs. An object-based operation includes a named object, or key, as the target of the operation. A transaction group describes a set of one or more operations that are to be processed as a single transaction work unit (i.e., processed atomically) when replicated to backend object storage. The records generated by intent log writer 318 are self-describing, including the operations and data to be written, thus enabling recovery of the data as well as the object storage state via replay of the operations. Intent log writer 318 uses catalog 316 to reference file data that may be stored in extent storage 309 that is managed by the local file system. For example, if an update operation includes user data (i.e., object data content), then the data may be committed (if not already committed) to extent storage 309.
Database 310 includes several tables that, in conjunction with catalog 316, associatively store related data utilized for transaction persistence and replication. Among these are an update operations table 314 in which object-based operations are recorded and a transaction groups table 312 in which transaction group identifiers are recorded in association with corresponding operations stored in table 314. The depicted database tables further include an objects table 304, a metadata table 306, and an extents reference table 308. Objects table 304 stores objects including namespace objects that are identified in object-based operations. Metadata table 306 stores the file system metadata associated with inode objects. Extents reference table 308 includes pointers by which catalog 316 and intent log 318 can locate storage extents containing user data within the local file system. The records within the intent log may be formed from information contained in operations table 314 and transaction groups table 312 as well as information from one or more of objects table 304, metadata table 306, and extents reference table 308. The database tables may be used in various combinations in response to update or query (e.g., read) operation requests. For example, in response to an object metadata read request, catalog 316 would jointly reference object table 304 and metadata table 306.
The depicted OSFS cache further includes a cache manager 317 that monitors the storage availability of the underlying storage device and provides corresponding cache management service such as garbage collection. Cache manager 317 interacts with a replication engine 320 during transaction replication by signaling a high pressure condition to replication engine 320 which responds by replicating at a higher rate to make more data within the OSFS cache available for eviction.
The OSFS cache further includes a replication engine 320 that comprises an asynchronous (async) writer 322 and a dependency agent 324. Replication engine 320 interacts with an OSA 328 to replicate (replay, commit) the intent log's contents to backend object storage. Replication is executed, in part, based on the insertion order in which intent log writer 318 received and recorded transactions. The order of replication may also be optimized depending on the nature of the operations constituting the transaction groups and dependencies, including namespace dependencies, between the transaction groups. Execution of replication engine 320 may generally comply with a periodic consistency point that may be administratively determined or may be dynamically adjusted based on operating conditions. In an aspect, the high-level sequence of replication engine 320 execution begins with async writer 322 reading a transaction group comprising one or more object-based operations from the intent log. Async writer 322 submits the object-based operations in a pre-specified transaction group order to OSA 328 and waits for a response from backend object storage. On confirmation of success, async writer 322 removes the transaction group and corresponding operations from the intent log. On indication that any of the operations failed, the async writer 322 may log the failure to the intent log 318 and does not remove the transaction group from the log. In this manner, once a transaction group has been recorded by intent log writer 318, it is removed only after is has been replicated to backend object storage.
Maintaining general chronological order is required to prevent file system namespace corruption. However, some modifications to the serialized sequencing of transaction group replication may improve I/O responsiveness and reduce network traffic levels while maintaining namespace integrity. In an aspect, async writer 322 interacts with dependency agent 324 to increase replication throughput by altering the otherwise serialized sequencing. Dependency agent 324 determines relationships, such as namespace dependencies, between transaction groups to determine whether and in what manner to modify the otherwise serially chronological sequencing of transaction group replication. For example, if dependency agent 324 detects that chronologically consecutive transactions groups, TGn and TGn+1, do not share a namespace dependency (i.e., are orthogonal), dependency agent 324 may provide both action groups for concurrent replication by async writer 322. As another example, if dependency agent 324 detects that multiple transaction groups are writes to the same inode object, the transaction groups may be coalesced into a single write operation to backend storage.
An asynchronous (async) writer 415 periodically, or in response to messages from a cache manager, commences a replication sequence that begins with async writer 415 reading a series of transaction groups 414 from intent log 402. The sequence of the depicted series of transaction groups 414 is determined by the order in which they were received and recorded by intent log 402. In combination, the recording by intent log 402 and subsequent replication by asynchronous writer 415 generally follow a FIFO queuing schema which enables a lagging but consistent file system namespace view for other bridge nodes that share the same object store bucket. While FIFO replication sequencing generally applies as the initial sequence schema, async writer 415 may inter-operate with a dependency agent 420 to modify the otherwise entirely serialized replication to improve performance. The depicted example may employ at least two replication sequence optimizations.
One sequence optimization may be utilized for transaction groups determined to apply to objects that are different (not the same inode object) and are contained in different parent directories. Such transaction groups and/or their underlying object-based operations may be considered mutually orthogonal. The other replication sequence optimization applies to transaction groups that comprise writes to the same inode object. To implement these optimizations, async writer 415 reads out the series of transaction groups 414 and may optimize the transaction groups for replication optimization. Async writer 415 then pushes the series of transaction groups to dependency agent 420. Dependency agent 420 identifies the transaction groups and their corresponding member operations to determine which, if any, of the replication sequence optimizations can be applied. After sending (pushing) the transaction groups, async writer 415 queries dependency agent 420 for transaction groups that are ready to be replicated to backend object storage via an OSA 416.
For orthogonality-based optimization, dependency agent 420 reads namespace object data for namespace objects identified in the object-based operations. Dependency agent 420 compares the namespace object data for operations contained within different transaction groups to determine, for instance, whether a dependency exists between one or more operations in one transaction group and one or more operations in another transaction group. In response to determining that a dependency exists between a pair of consecutively sequenced transaction groups (e.g., TG1 and TG2), dependency agent 420 stages the originally preceding group to remain sequenced for replication prior to replication of the originally subsequent group. If no dependencies are found to exist between TG1 and TG2, dependency agent 420 stages TG1 and TG2 to be replicated concurrently by async writer 415.
For multi-write coalescence optimization, dependency agent 420 reads inode object keys to identify transaction groups comprising write operations that identify the same target inode object. Dependency agent 420 coalesces all such transaction groups for which there are no sequentially intermediate transaction groups. For sets of one or more writes to the same inode object that have intervening transaction groups, dependency agent 420 determines whether namespace dependencies exist between the write(s) to the same inode object and the intervening transaction groups. For instance, if TG1, TG2, and TG4 each comprise a write operation to the same inode object, dependency agent 420 will coalesce the underlying write operations in TG1 and TG2 into a single write operation because they are sequenced consecutively (no intermediate transaction group). To determine whether TG4 can be coalesced with TG1 and TG2, dependency graph 420 determines whether the intermediate TG3 comprises operations that introduce a dependency with respect to TG4. Extending the example, assume TG1, TG2, and TG4 are each writes to the same file1 inode object which is contained within a dir1 object. In response to determining that TG3 contains a rename operation renaming dir1 to dir2, dependency agent 420 will not coalesce TG4 with TG1 and TG2 since doing so will cause a namespace inconsistency.
An OSFS cache, such as those previously described, receives the object-based operations within the transaction request (block 510). The OSFS cache processes the content of the request to determine the nature of the member operations. In the depicted example, the OSFS cache determines whether the transaction group as a whole or one or more of the member operations will result in a modification to the object store (i.e., whether the transaction group operations include update operations). The OSFS cache may determine whether the transaction request is an update request by reading one or more of the member operations. In another aspect, the OSFS cache may read a flag or another transaction request indicator such as may be encoded in the transaction group ID to determine whether the transaction request and/or any of its member operations will modify the object store.
In response to determining at block 512 that the request is a non-update request (e.g., a read), control passes to block 514 and the OSFS cache queries a database catalog or table to determine whether the request can be satisfied from locally stored data. In response to determining that the requested data is not locally cached at block 516 (i.e., a miss), control passes to block 520 with the OSFS forwarding the read request for retrieval from the backend object store. In response to detecting that the requested data is locally cached at block 516 (i.e., hit), the OSFS cache determines at block 518 whether a time-to-live (TTL) period has expired for the requested data. In response to the detecting that the TTL has expired, control passes to block 520 with the OSFS forwarding the read request to the backend object store. In response to detecting that the TTL has not expired, the OSFS returns the requested data from the local cache database to the OSFS at block 522.
Returning to block 512, in response to determining that the request is an update request (e.g., a write), the OSFS cache records the member operations in intent log records within the database at block 524. In an aspect, the OSFS cache records the member transactions in the order specified by the transaction request and/or in the order in which the operations were received in the single or multi-part transaction request. The order in which the operations are recorded may or may not be the same as the order in which the operations are eventually replicated. In an aspect in which the recording order does not determine replication order, the replication order may be determined based on a replication order encoded as part of the transaction request. The replication order is a serially sequential replication order that may be determinable by the OSFS cache following recordation of the operations. In addition to preserving the recording/replication order of the member operations, the OSFS cache records each of the operations in intent log records that associate the operations with the corresponding transaction group ID.
In an aspect in which the member operations are serially recorded within the intent log, at block 526 the OSFS cache follows storage of each operation with a determination of whether all of the member operations have been recorded. In response to determining that unrecorded operations remain, control passes back to block 510 with the OSFS cache receiving the next operation for recordation processing. In response to determining that all member operations have been recorded, the OSFS cache signals, by response message or otherwise, to the OSFS at block 528 that the requested transaction that was generated from a file system command has been completed. The OSFS may forward the completion message back to the file system client.
At block 612, the dependency agent compares the namespace object data of operations belonging to different transaction groups. The comparison may include determining whether one or more namespace keys contained in a first namespace object identified by a first operation match or bear another logical association with one or more namespace keys of a second namespace object identified by a second operation. For instance, consider a pair of transaction groups, TG1 and TG2, that were received and recorded by the intent log such that TG1 precedes TG2 in consecutive sequential order. Having read the namespace objects identified in member operations of both TG1 and TG2, the dependency agent cross compares the namespace object data between the groups to detect dependencies that would result in a file system namespace collision if TG1 and TG2 are not executed sequentially. Such file system namespace collisions may not impact the Read/Write bridge node hosting the async writer but it may result in a corrupted file system namespace view for other nodes within the same cluster.
The dependency agent and async writer sequence or re-sequence the transaction groups based on whether dependencies were detected. In response to detecting a namespace dependency between consecutively sequenced transaction groups at block 614, the dependency agent and async writer maintain the same sequence order as was recorded in the intent log for replication of the respective transaction groups (block 618). In response to determining that no dependencies exist between the transaction groups, the otherwise consecutively sequenced groups are sent for replication concurrently (block 616).
While the dependency continuously processes received transaction group records, the async writer may be configured to trigger replication sequences at consistency point (CP) intervals (block 620) and/or be configured to trigger replication based on operational conditions such as cache occupancy pressure (block 622). The CP interval may be coordinated with TTL periods set by other OBS bridge nodes, such as reader nodes. The async writer may be configured such that the CP is the maximum period that the async writer will wait before commencing the next replication sequence. In this manner, and as depicted with reference to blocks 620 and 622, the async writer monitors for CP expiration and in the meantime may commence a replication sequence if triggered by a cache occupancy message such as may be sent by a cache manager. In response to either trigger, control passes to block 624 with the async writer retrieving from the dependency agent transaction groups that have been sequenced (serially and/or optimally grouped or coalesced). The async writer sends the retrieved transaction groups in the determined sequence to the OSA (block 626).
In addition to or in place of the functions depicted and described with reference to blocks 610, 612, 614, 616, and 618, the dependency agent may optimize replication efficiency by coalescing write requests.
At block 706, for the remaining identified transaction groups, the dependency agent reads namespace objects identified in transaction groups that immediately precede one of the transaction groups that were identified at block 702 (TGn+1 in the example). The dependency agent also reads data for the namespace object(s) identified in the consecutively subsequent write transaction (TGn+2 in the example). At block 707, the dependency graph compares the namespace object data between the identified transaction group and one or more preceding and consecutively adjacent transaction groups that were not identified as comprising writes to the inode object. In response to detecting dependencies, the intent log serial order is maintained and the write operation is not coalesced with preceding writes to the same inode object (block 710). In response to determining that no dependencies exist between the write operation and all preceding and consecutively adjacent transaction groups, the write operation/transaction is coalesced into the preceding writes to the same inode object. For instance, and continuing with the preceding example, if no namespace dependencies are found between TGn+1 and TGn+2, the dependency graph will coalesce TGn with TGn+2, TGn+3, and TGn+4 to form a single write operation.
Variations
The flowcharts are provided to aid in understanding the illustrations and are not to be used to limit scope of the claims. The flowcharts depict example operations that can vary within the scope of the claims. Additional operations may be performed; fewer operations may be performed; the operations may be performed in parallel; and the operations may be performed in a different order. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by program code. The program code may be provided to a processor of a general purpose computer, special purpose computer, or other programmable machine or apparatus.
As will be appreciated, aspects of the disclosure may be embodied as a system, method or program code/instructions stored in one or more machine-readable media. Accordingly, aspects may take the form of hardware, software (including firmware, resident software, micro-code, etc.), or a combination of software and hardware aspects that may all generally be referred to herein as a “circuit,” “module” or “system.” The functionality provided as individual modules/units in the example illustrations can be organized differently in accordance with any one of platform (operating system and/or hardware), application ecosystem, interfaces, programmer preferences, programming language, administrator preferences, etc.
Any combination of one or more machine readable medium(s) may be utilized. The machine readable medium may be a machine readable signal medium or a machine readable storage medium. A machine readable storage medium may be, for example, but not limited to, a system, apparatus, or device, that employs any one of or combination of electronic, magnetic, optical, electromagnetic, infrared, or semiconductor technology to store program code. More specific examples (a non-exhaustive list) of the machine readable storage medium would include the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a machine readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. A machine readable storage medium is not a machine readable signal medium.
A machine readable signal medium may include a propagated data signal with machine readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A machine readable signal medium may be any machine readable medium that is not a machine readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a machine readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as the Java® programming language, C++ or the like; a dynamic programming language such as Python; a scripting language such as Perl programming language or PowerShell script language; and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The program code may execute entirely on a stand-alone machine, may execute in a distributed manner across multiple machines, and may execute on one machine while providing results and or accepting input on another machine.
The program code/instructions may also be stored in a machine readable medium that can direct a machine to function in a particular manner, such that the instructions stored in the machine readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
While the aspects of the disclosure are described with reference to various implementations and exploitations, it will be understood that these aspects are illustrative and that the scope of the claims is not limited to them. In general, techniques for an object storage backed file system that efficiently manipulates namespace as described herein may be implemented with facilities consistent with any hardware system or hardware systems. Many variations, modifications, additions, and improvements are possible.
Plural instances may be provided for components, operations or structures described herein as a single instance. Finally, boundaries between various components, operations and data stores are somewhat arbitrary, and particular operations are illustrated in the context of specific illustrative configurations. Other allocations of functionality are envisioned and may fall within the scope of the disclosure. In general, structures and functionality shown as separate components in the example configurations may be implemented as a combined structure or component. Similarly, structures and functionality shown as a single component may be implemented as separate components. These and other variations, modifications, additions, and improvements may fall within the scope of the disclosure.