File systems provide an organized storage medium for files. Distributed file systems allow access to files from multiple nodes that communicate across a network (e.g., enterprise network).
Embodiments described herein provide for a scalable and reliable system for recording events relating to file system operations. Some embodiments include a system or method in which file system operations initiated on a node of a distributed file system environment are journaled asynchronously, and then subsequently stored for analysis. The types of analysis that can be performed based on the recorded events include, for example, compliance or auditing analysis pertaining to use of the distributed file system.
According to some embodiments, multiple file system events are detected on one or more nodes of a distributed file system. Each file system event corresponds to an operation that is to be performed on the file system. The detected events are durably recorded as an entry within a journal for the node prior to either performing or completing the corresponding operation at the node. In some embodiments, a programmatic component that is external to the file system can process entries from the journal, and in response, the entries can be expired from the journal.
The term “durable” or variants thereof (e.g., “durably”) in the context of storing data or information means such data is stored in a manner that is resilient to computing failure and data loss over time. For example, durably recorded data can be stored on a non-volatile storage medium such as a disk drive for subsequent analysis or use.
One or more embodiments described herein provide that methods, techniques and actions performed by a computing device (e.g., node of a distributed file system) are performed programmatically, or as a computer-implemented method. Programmatically means through the use of code, or computer-executable instructions. A programmatically performed step may or may not be automatic.
With reference to
Node Description
In an embodiment, a monitoring component 120 is provided on node 110 to monitor for file system events. Each file system event can correspond to an intent event, where the node 110 is to perform a corresponding file system operation (e.g., file system modification). The file system events can represent file system operations such as read, write, or changes in permission. Additionally, the file system events identify relevant parameters for such modifications, such as file names, number of bytes read, user name and timestamps. The node 110 may include or otherwise utilize a journal 130, and the monitoring component 120 durably records different file system events in the journal 130 as journal entries 105. In one implementation, the entries 105 can correspond to metadata (rather than file content) that represent a corresponding operation. The journal 130 marks individual entries 105 as uncommitted until confirmation is received that the file contents of the operations represented by the entries 105 have been written to non-volatile storage (e.g. hard disk) within the file system 100. After confirmation is received, the journal 130 marks the entries 105 as being committed.
In a variation, the entries 105 can include file content data, and in the event of a failure (e.g., a power outage, system or software crash, network failure etc.), the node 110 can utilize the entries marked as uncommitted to replay a sequence of file system operations that were in flight (or not written to disc) at the time the failure occurred.
According to embodiments, the entries 105 for the file system events are recorded asynchronously with, or independently of, performance of the corresponding operation. Thus, for example, the individual entries 105 can be recorded in the journal 130 before the operation that corresponds to the represented event is complete. At the same time, the entries 105 are durably stored, and their recording in the journal 130 signifies a commitment that the underlying operations represented by the individual entries 105 will be performed, even in the presence of file system, node or network failure.
According to embodiments, different types of events are recorded in the journal 130. In particular, the monitoring component 120 can include kernel level logic 122 which detects kernel level events 111. The kernel level event 111 can correspond to the intent to perform or the initiation of one or more kernel level operations 125 by node 110. Examples of kernel level operations 125 include delete, read, write, and rename, as well as some system wide operations. The kernel level events 111 can also identify the parameters that are relevant to the corresponding operation such as file name, number of bytes read, user etc., and time stamps (as described further below).
The monitor component 120 can also include user level logic 124 that detects user level events 113, which can correspond to node 110 initiating one or more user level operations 127. In variations, the monitoring functionality can be implemented in part or in whole by (i) a kernel for file system 100, which can write out journal entries for kernel-level events, and (ii) user-level applications which write events using a user-level journaling mechanism. The user-level operations 127 can be programmatically generated, or initiated by user tagging or input. The user level events 113 can also identify the parameters that are relevant to the corresponding operation such as file name, user-defined tag, user name and time stamps (as described further below).
Each of the kernel and user level events 111, 113 are recorded as entries 105 in the journal 130. The entries 105 for the different events may be sequenced in the journal 130. For example, the node 110 can maintain a clock 132 that is synchronized with, for example, clocks of other nodes that comprise the file system 100. In particular, embodiments provide that entries 105 generated from both user and kernel level events 111, 113 are interleaved and sequenced in the journal 130 based on timestamps provided from the clock 132.
In an embodiment, journal 130 is provided as an EXT3 file. The monitoring component 120 is programmed to generate entries that reflect the operation that is to occur (corresponding to the event), as well as to record from the clock 132 a timestamp for the journal entry 105. Other parameters (e.g., file name, file content, user, data size) that are relevant to the corresponding file system operation of the detected event are also identified and recorded as an entry 105 of journal 130.
In some embodiments, the particular operations that are deemed events and recorded in the journal 130 are specified by the administrator. Thus, for example, an administrator can modify the set of operations that are logged with the journal 130. Specific kernel level operations 125 can be pre-identified for logging using, for example, a kernel interface such as a UNIX FCNTL or similar system call. Similarly, user level operations 127 can be pre-identified for logging using kernel interface calls such as UNIX FCNTL or other similar system calls.
While an example of
According to one or more embodiments, an external system 175 (e.g., a database) can be provided individual entries 105 from the journal 130. The journal 130 can be synced, or otherwise coordinated, with the external system 175, so that journal entries 105 are expired from the journal 130 when those entries are accessed or processed by the external system 175. As examples, the external system 175 can correspond to a database (e.g., see database system 240 of
System Description
Among other benefits, an embodiment such as described with
As described with an embodiment of
In an embodiment, each node 210A, 210B, 210C includes a corresponding journal 220A, 220B, 220C in which respective entries 205A, 205B, 205C representing the file system operations are recorded. The entries 205A, 205B, 205C that are recorded in the respective journals 220A, 220B, 220C can correspond to metadata that represent a corresponding operation performed on the corresponding node 210A, 210B, 210C. In embodiments, each node 210A, 210B, 210C may mark the individual entries of the respective journals 220A, 220B, 220C as uncommitted until confirmation is received that the file contents of the file system operations represented by those entries have been written to, for example, the disk. Then each of the nodes 210A, 210B, 210C can mark their respective entries as being committed.
In some variations, data content journaling can also be used, so that the entries 205A, 205B, 205C specify data content and metadata. In the event of a failure, such as a power outage, the individual node 210A, 210B, 210C where the failure occurred can utilize the entries of the corresponding journal 220A, 220B, 220C which are marked as uncommitted to replay a sequence of file system operations that were in flight (or not written to disk) at the time the failure occurred.
According to embodiments, the entries of each journal 220A, 220B, 220C provided for each node are recorded asynchronously with that node's performance of the corresponding file system operation. Thus, the entries can be, for example, recorded in the corresponding journals 220A, 220B, 220C before the operation represented by that journal entry is complete. At the same time, each node 210A, 2108, 210C durably stores its entries in the corresponding journal 220A, 220B, 220C, and the entries can be aggregated or otherwise accessed by other components (e.g., aggregation component 230 and/or database system 240). In some variations, the aggregation of the entries representing the file system events 202 of the various nodes 210 provides an ability for the underlying operations represented by those entries to be available for analysis, even in the presence of some failures, such as file system, node or network failure. For example, the entries representing the file system events 202 can be stored in a database that can be queried, searched, and/or analyzed, to enable compliance or auditing operations to be performed in connection with use of the file system.
According to one or more embodiments, system 200 includes one or more aggregation components 230 and the database system 240 (or node of a distributed database system). In variations, other systems or components, such as an event viewer 242, can be implemented as an addition or alternative to the database system 240. In the example shown by
The aggregation component 230 can optionally operate to sequence the entries 205 from the various nodes 210. The sequencing of the entries 205 can be based on, for example, time stamps associated with the individual entries. As noted with, for example,
The aggregation component 230 provides the sequenced list of entries 232 for ingestion by the database system 240 (or with other component such as event viewer 242). For example, the database system 240 may import the sequenced entries 232, once the entries of the different nodes are aggregated and sequenced by the aggregation component 230. By using time stamps on each of the entries 205A, 205B, 205C, journal entry updates may be batched and then communicated to the database system 240 in any order. The timestamps on each of the journal entries can be used to determine which updates are kept if there are multiple entries for a single database record. As shown, system 200 can be implemented to reliably record journal entries 205A, 205B, 205C, reflecting kernel and user level events on the nodes 210A, 210B, 210C of the file system 250. For example, journal files 220A, 220B, 220C can be reliably maintained amongst the nodes 210A, 210B, 210C because each node is able to durably journal events with synchronized use of timestamps. Thus, the file system journals are reliably maintained even in the event of node failure resulting from, for example, a system crash, a software crash, or network failure.
Additionally, embodiments recognize that reliably maintaining records of journaling operations on each node further enhances the ability of the system 200 to scale. For example, each node 210A, 2108, 210C (or machine thereof) can store its own respective journal 220A, 220B, 220C. If a particular machine, for example, runs out of disk space or otherwise fails, then the auditable operations that occurred on that machine will result in errors, but other machines or nodes of the file system 250 will be unaffected. As another example, the failure of one node in implementing a mufti-node auditable operation (e.g., rename, in which the operation is initiated on one node and completed on another node) can result in the operation not being completed on any of the nodes that are involved in the operation. The journal entries can potentially be aggregated or retrieved from one or both nodes involved in the operation in order to enable, for example, fault analysis to be performed to determine information about the cause or source of the error.
Embodiments further recognize that the reliability inherent in system 200 promotes various auditing or compliance operations. In particular, embodiments recognize that the reliably and durable manner in which journal entries 205 are recorded can be used to enable additional auditing or compliance functionality for a variety of purposes. In some embodiments, an operation interface 270 for database system 240 can operate to enable auditing or compliance operations 272, such as to determine (i) who has accessed a file, (ii) verify that correct retention or deletion events have taken place, (iii) verify correct setting of file security properties, (iv) enable compliance tracking for an archive, (v) change notification for virus scanner, (vi) enable backups, including backup of applications, (vii) enable remote replication, and/or (viii) enable validation scanning, or other applications that would otherwise be required to scan the complete file system for file changes.
The system 200 can be implemented to enable journals that record the various file system events to be synchronized with external systems, such as database system 240 or event viewer 242. In an embodiment, the entries 205A, 205B, 205C of the journals 220A, 220B, 220C can be expired when the journals are processed by the external system (e.g., imported or stored with the database system 240). Moreover, by storing the entries in, for example, the database system 240, embodiments enable operations such as indexing, parsing and searching to be performed, resulting in better analysis and understanding of the various file system operations.
Methodology
In an embodiment, file system events are monitored on individual nodes 210A, 2108, 210C of a distributed file system 200 (310). Each node 210A, 210B, 210C can detect kernel level events (312), which represent a kernel level operation performed on that node. Each node 210A, 210B, 210C may also be able to detect user level events (314). Furthermore, each of the kernel and user level events may include parameters and metadata associated with performance of the corresponding operation, such as file name, number of bytes affected, the time stamp and the user name.
Each node 210A, 210B, 210C records its detected events as entries with the corresponding journal 220A, 220B, 220C (320). Under an embodiment, each of the nodes 210A, 210B, 210C, stores its own journal, so that failure of that node does not affect the journaling performed at other nodes. In this way, the entries of the journals 220A, 220B, 220C include metadata that identifies the various operations that are to, or which are, taking place. When a file system operation represented by an individual entry is complete, the node 210A, 210B, 210C marks the entry representing that entry as complete. In this way, each of the journals 220A, 220B, 220C record events that include file system operations that are in flight, or which are not yet initiated.
The entries of the journal files can be made available to an external component (330). For example, a component such as provided by aggregation component 230 can collect entries from the individual journals. The external component can sequence the events from different nodes, then import the sequenced journal entries for processing. For example, the aggregation component 230 can import the sequenced entries into the database system 240 of the file system 250. Some embodiments recognize that batch processing journal entries from different nodes 210A, 210B, 210C enhances the scalability of the system 200. To this end, each node 210A, 2108, 210C can implement functional callbacks with, for example, a centralized aggregation component 230 or other programmatic component. For example, the aggregation component 230 can sequence the entries and cause the entries to be stored in the database system 240. The use of functional callbacks can be in place of, for example, polling operations (which could alternatively be performed), to further enhance the scalability of the system 200.
According to embodiments, the entries of various journals 220A, 220B, 220C can be expired, in response to the programmatic component (e.g., database system 240) completing processing of those entries (340). For example, the entries of the journal files can be garbage collected when the entries are marked complete, coinciding with the entry being reliably stored off the node (e.g., within the database system 240). In variations, the journal entries may be retained until the database has been backed up or otherwise replicated to a different node. In this way, the journals 220A, 220B, 220C can provide a mechanism by which file system events are synchronized by external systems.
Distributed Aggregation
While an embodiment of
More specifically, with reference to an embodiment of
In an embodiment, aggregation component 420 resides with the node 410 and directly communicate entries 405 of the journal 430 to the database component 434. In particular, journal entries 405 may be communicated as transaction updates 415 from the node to the database component 434. The transaction updates 415 may be processed by the database 434 in order of arrival and synchronously, before the transaction updates are returned as success or failure. In this way, the database system 430 can maintain data reflecting the various entries 405, and database resources can enable searching and analysis to be performed in connection with auditing or compliance type operations. At the same time, the corresponding journal entries 405 can be removed from the journal 430. Thus, for example, the database system 430 provides a record of the events that resulted in the generation of journal entries 405 at a given instance of time.
As an addition or alternative, a node such as described with an embodiment of
Hardware Diagram
Computer system 500 can include display 512, such as a cathode ray tube (CRT), a LCD monitor, or a television set, for displaying information to a user. An input device 515, including alphanumeric and other keys, is coupled to computer system 500 for communicating information and command selections to processor 505. Other non-limiting, illustrative examples of input device 515 include a mouse, a trackball, or cursor direction keys for communicating direction information and command selections to processor 505 and for controlling cursor movement on display 512. While only one input device 515 is depicted in
The computer system 500 may be operable to implement functionality described with a node of a distributed file system. Accordingly, computer system 500 may be operated to implement file system operations, including user and kernel level operations. In performing the operations, the computer system 500 records events 511 corresponding to the file system operations, which are recorded as entries 513 in a journal of the computing system 500. The entries 511 of the journal identify the file system operations in advance of those operations being performed, as well as parameters (e.g., metadata) associated with the individual operations. The computer system 500 can also execute instructions to communicate, via for example, call back operations, to communicate the journal entries 513 to a database system. For example, in one implementation, the computer system 500 can communicate the entries 513 to an aggregation component of a database or database system. In some variations, the computer system 500 may also implement an aggregation component such as described with an embodiment of
The communication interface 518 can be used to communicate file system operations, such as described with embodiments of
Embodiments described herein are related to the use of computer system 500 for implementing the techniques described herein. According to one embodiment, those techniques are performed by computer system 500 in response to processor 505 executing one or more sequences of one or more instructions contained in memory 506. Such instructions may be read into memory 506 from another machine-readable medium, such as a storage device 510. Execution of the sequences of instructions contained in main memory 506 causes processor 505 to perform the process steps described herein. In alternative embodiments, hard-wired circuitry may be used in place of or in combination with software instructions to implement embodiments described herein. Thus, embodiments described are not limited to any specific combination of hardware circuitry and software.
Although illustrative embodiments have been described in detail herein with reference to the accompanying drawings, variations to specific embodiments and details are encompassed by this disclosure. It is intended that the scope of embodiments described herein be defined by claims and their equivalents. Furthermore, it is contemplated that a particular feature described, either individually or as part of an embodiment, can be combined with other individually described features, or parts of other embodiments. Thus, absence of describing combinations should not preclude the inventor(s) from claiming rights to such combinations.