The present invention relates generally to data storage for computing devices, and more particularly, but not exclusively, to a transactional file system that supports partitioning of clients.
In a computing system, a file system is the mechanism by which the logical view of data storage is mapped to physical locations on a disk or other storage device. Computing systems are vulnerable to unpredictable failures, such as operating system crashes, hardware failures, and power interruptions. Such events may place a file system within the computing system in an inconsistent state, since tasks involving reading from and writing to files may be in progress when the event occurs and in-memory buffers might not have been written to disk. To preserve the integrity of stored data, file systems have traditionally been designed to write file metadata for use in restoring the file system to a consistent state following a reboot. In these traditional systems, however, a reboot is typically followed by a scan of an entire disk, which typically requires an undesirable length of time to complete. Significant delays in recovering the file system may be unacceptable in certain types of embedded systems, such as safety-critical or mission-critical systems, that require a fast startup or boot time.
Some file systems have been designed to speed up system recovery by maintaining a journal on the storage device that logs metadata and possibly also data relating to file system operations (or “transactions”), including file updates. When metadata is updated, all potentially inconsistent data is recorded in the journal. A set of updates to files does not take effect until a final “commit” of transactions is made from the journal to the storage device.
Transactional file systems as well as traditional file systems suffer from contention among client processes for computing resources, such as processor time and file cache buffers, associated with access to the file system. If a client for a file server makes a system call, such as opening a file for reading and writing, other clients or processes are generally delayed from using those resources at the same time. Mutual exclusion locks, semaphores and similar mechanisms are available to coordinate access to these resources, but in general they do not prevent an operation on behalf of one client, such as a read-only client, to interfere with an operation on behalf of another client, such as a read-write client.
For a better understanding of the present invention, reference will be made to the following detailed description, which is to be read in association with the accompanying drawings, wherein:
In the following detailed description, reference is made to the accompanying drawings, in which are shown exemplary but non-limiting and non-exhaustive embodiments of the invention. These embodiments are described in sufficient detail to enable those having skill in the art to practice the invention, and it is understood that other embodiments may be used, and other changes may be made, without departing from the spirit or scope of the invention. The following detailed description is, therefore, not to be taken in a limiting sense, and the scope of the invention is defined only by the appended claims. In the accompanying drawings, like reference numerals refer to like parts throughout the various figures unless otherwise specified.
Overview of the Invention
The present invention is directed to a method and system for providing access to data on a storage device so that, for a given volume on the device, read-only clients and read-write clients are presented with separate but related views of a file system state. Clients with read-only access rights to a volume are provided a view of file system state that may be slightly older than that available to read-write clients. In an embodiment, file system clients are grouped into sets or partitions, which have different access rights to each storage device volume: no access, read-only access, and read-write access. For each volume, there is at most one client partition that has read-write access, and there are zero or more read-only client partitions. This concept of partitioning of file system clients is not supported by traditional file systems.
In accordance with an embodiment of the invention, a file system provides the following non-interference property. For a given volume, the read-only client partitions do not interfere with each other and do not interfere with the read-write client partition. A client partition that has read-write access to a volume may delay other partitions that have read-only access to that volume. This non-interference property enables guaranteed levels of service to be provided to client partitions. Moreover, it prevents the illicit flow of information between client partitions (for example, by way of covert channels). These features are important in enhancing the security, safety and reliability of the system.
In one embodiment, the invention includes a transactional file system. For a given volume, access to file system blocks is provided by way of separate virtualization tree data structures for the read-only client partitions and for the read-write client partition. A reader tree, which is stored on a flash memory device, magnetic hard disk, or other nonvolatile or secondary storage device, represents a consistent (but older) file system state. A writer tree, which has a different root pointer from the reader tree and is partially stored in main memory, represents the state of in-progress file system transactions. Read-only client partitions are permitted to see the set of content blocks that are reachable by way of the reader tree. The read-write client partition performs reads and writes by accessing the set of content blocks that are reachable by way of the writer tree. When a content block is modified, that block and the blocks in the writer tree that recursively point to the content block are exchanged with unused journal blocks. When a set of file system transactions is committed, the root pointer of the writer tree is copied to the reader tree root pointer, and journal blocks are reclaimed.
Embodiments of the invention provide a single integrated mechanism which allows for transactional journaling, client partition non-interference as described above, and deterministic allocation and freeing of blocks with no additional overhead. A single integrated mechanism provides efficiency benefits and permits a relatively small file server image, which is particularly advantageous for memory-constrained embedded systems. The invention may be practiced in conjunction with a real-time operating system and with deeply embedded systems that are required to operate under significant constraints relating to memory and processor usage and power consumption, including those used in safety-critical and mission-critical applications. However, the invention is not thus limited. The invention is applicable to the implementation of database systems in addition to file systems.
Exemplary Operating Environment
Environment 100 includes a computing device 102. Device 102 may be a special-purpose or a general-purpose computing device, and may be situated or embedded within another device or apparatus. The features typically present in computing devices of various kinds are well-known and rudimentary to those skilled in the art and need not be depicted in detail or described at length here. Computing device 102 includes, among other components not specifically shown, a processor 104, a main memory 122, and one or more nonvolatile storage devices 106. Storage devices 106 may include, for example, a flash memory device, a magnetic hard disk, or the like. Programs and data may be stored in main memory 122, from which they can be accessed by processor 104. Such programs may include operating system 110, file server 112, read-write client 114, and read-only client 116. Operating system 110 may be a real-time operating system or another kind of operating system. File server 112 mediates access to files for read-write client 114 and read-only client 116. Files are part of file system 118, which comprises a logical view of data physically stored on storage devices 118 and which may be separate from or integrated with operating system 110. A part of a file system in accordance with the present invention may be stored in main memory 122, as is discussed further above and below.
Initial File System Image
As indicated in
In embodiments of the invention, the virtualization data structure is a virtualization tree, which may be implemented as the interior nodes of a balanced tree, such as a B+ tree, or as another kind of tree data structure or component of a tree data structure. On storage device 201, separate but related views of file system state for volume 206 are provided to client partitions having read-only access to the volume and the client partition that has read-write access to the volume by providing the root pointer for a reader virtualization tree and the root pointer for a writer virtualization tree, respectively. The root pointer for the writer virtualization tree is stored within volume headers 204, in the volume header for volume 206. The root pointer for the reader virtualization tree is stored in main memory. The file system state accessed by the read-write client partition represents the state of in-progress transactions. The file system state accessed by read-only client partitions represents, in general, a consistent but older view of the file system and accordingly may coincide with or diverge from the view of the file system seen by the read-write partition.
Client Partitioning
For each volume, such as each of volumes 206-208, there is at most one client partition with read-write access. Thus, as illustrated in
As noted, client partitions have different access rights for each volume. A client partition may, for example, have read-only access to one volume and read-write access to another volume. For example, read-only client partition 312, which has read-only access to volume 206, may be the same client partition as read-write client partition 320, which has read-write access to volume 208. Each partition is associated with separate memory and CPU resources.
Providing Access to File System
If the decision at block 602 is affirmative, the client belongs to the read-write partition for this volume. Process 600 then advances to block 606, where the client is provided access to the writer tree for the volume. Processing then advances to decision block 610, at which it is determined whether the operation is one that may modify one or more blocks. If not (for example, if the operation includes a read call or a lookup of files by name), process 600 flows to a return block and performs other actions. If, however, the operation is a modifying operation, processing advances to decision block 612, at which it is determined whether a commit threshold will be reached as a result of the current file system transaction. The commit threshold is reached, for example, if the journal will be full as a result of the operation.
If the commit threshold will not be reached, process 600 advances to block 614, where modified blocks and all virtualization tree blocks that recursively point to the modified blocks are exchanged with unused journal blocks, if they are not “dirty.” Blocks are exchanged if they have not already been exchanged since the previous commit. It is at this exchange step in block 614 that the reader and writer views of the file system state begin to diverge. Processing then returns to perform other actions.
If the decision at block 612 is affirmative, the commit threshold will be reached, and processing flows to block 616, at which the process waits for in-progress transactions to finish. Process 600 next steps to block 618, at which a commit of transactions occurs. Next, at block 620, journal blocks are reclaimed. Processing then steps to block 622, at which the root pointer for the writer tree is copied to the root pointer for the reader tree. In effect, the read-write client has caused the updates to the file system to be published to all read-only clients for the volume. Process 600 then branches to block 614 where, as noted above, modified blocks and all virtualization tree blocks that recursively point to the modified blocks are exchanged with unused journal blocks, if they are non-dirty. Processing then returns to perform other actions.
As an effect of the journaling process, the body of the virtualization tree (the physical location of the blocks that comprise the tree) moves around the storage device as file system operations occur.
Virtualization Tree
A virtualization tree is accessed by way of an associated root pointer. In one embodiment, each node in the virtualization tree is a block comprising an array of branch pointers. As shown in
The structure of the branch pointer makes possible a deterministic algorithm for allocating and freeing file system content blocks that is constant with respect to the number of blocks to access. This allows for desirable performance, since a principal factor in the performance of a file system is the number of times a physical read from or write to the storage device is necessary. Because a content block is modified essentially immediately after it is allocated or before it is freed, and all the tree blocks that recursively point to the modified block are treated as dirty along with the modified block, there is no additional cost to allocate or free a content block (in terms of the number of blocks dirtied and, in general, in terms of the number of blocks that are read from the storage device).
The above specification, examples, and data provide a complete description of the manufacture and use of the composition of the invention. Since many embodiments of the invention can be made without departing from the spirit and scope of the invention, the invention resides in the claims hereinafter appended.