For security and simplicity, some operating systems are structured hierarchically and isolate higher-level processes from the underlying hardware. Those processes operating on higher levels of the hierarchy may access hardware (such as a processor, memory, storage, I/O, etc.) via an intermediary operating at a lower level of the hierarchy. One example of such an intermediary is an operating system kernel. The kernel may directly interface with the hardware and allow other processes at higher levels, such as the user level, to interface with the hardware by calling (i.e., system calls) an Application Programming Interface (API) of the kernel. The kernel executes the caller's request and returns any results to the higher-level process. In this way, the kernel may hide the complexities of the hardware from the higher-level processes and may prevent a crash of a process from compromising the entire operating system.
Certain examples are described in the following detailed description with reference to the drawings, of which:
An operating system kernel may act as a conduit between higher-level processes and underlying hardware. For example, a user-level process may access (e.g., read, write, etc.) data stored on a set of storage devices by making a system call to the kernel. Upon receiving the system call, the kernel may access the requested data on the storage devices and return any result therefrom to the user-level process.
When a process hands off control to the kernel or another process to access data, the resulting context switch may have an associated delay. As technological improvements reduce the latency of the hardware devices being accessed, the impact of the involving the kernel in the access becomes relatively greater. For example, a hard disk drive seek may have a latency on the order of 10 ms, and flash memory such as a solid-state disk drive may have a latency on the order of 100 μs. However, current and emerging non-volatile memories may have latencies of less than 100 ns. Such low device latencies make any latency associated with the context switch relatively larger. Accordingly, it has been determined that a significant, real world performance improvement may be obtained by reducing the number of system calls and context switches.
The benefit may be realized with any type of access of a storage device. For example, in order to access a segment of data, a computing system may also access various associated metadata. This metadata may include file system metadata used to convert file-level data indicators used by a process to block-level data indicators used by the storage devices. The storage devices themselves may not be aware of the file-to-block relationship but may store the file system metadata that maps the file-level identifiers to the respective block-level identifiers. Accordingly, accessing the data may include multiple accesses of data and metadata. Any of these accesses, whether data or metadata, may be improved by allowing user-level processes to retrieve the data or metadata directly rather than involving the kernel.
The present disclosure provides a system and technique for accessing a memory structure suitable for these applications and others. The memory structure may include metadata, such as file system metadata, data, and/or combinations thereof. In some examples, a kernel maps the memory structure and a set of locks associated with the memory structure into the address space of a user-level process. Mapping the memory structure allows the user-level process to directly access the memory structure without further involving the kernel. To enforce security, the kernel may map the memory structure in a read mode, which allows direct reading but excludes direct writing to the memory structure based on a trust level of the user-level process. This may be appropriate for processes that are trusted to read but not write filesystem data 108, and as used herein, the read mode does not permit direct writing.
In the example, the user-level process reads portions of the memory structure outside of the kernel, thus without involving the kernel or any other process. Because the user-level process may not be the only process accessing the memory structure, after the read, the user-level process may use the locks to determine whether the portions were modified during the read. Thus rather than obtaining a lock, which may entail a system call to the kernel, the user-level process may avoid a context switch latency by detecting a write on its own. If an intervening write occurred, the user-level process may repeat the read. In applications where writes are relatively infrequent, the penalty of repeating a read is outweighed by the reduced latency when direct reads successfully complete without intervening writes.
Many examples in the present disclosure are structured to reduce the number of system calls and context switches while still maintaining data integrity and security. By these mechanisms and others, the present disclosure provides substantial, real world improvements to the operation of a computing system, particularly in the manner in which processes access the storage devices of the system. The technique herein may greatly reduce the latency and overhead involved in accessing these devices.
These examples and others are described with reference to the following figures. Unless noted otherwise, the figures and their accompanying description are non-limiting, and no element is characteristic of any particular example. In that regard, features from one example may be freely incorporated into other examples without departing from the spirit and scope of the disclosure.
A computing environment for practicing the technique of the present disclosure is described with reference to
The computing environment 100 includes a storage aggregate 102 that in turn includes any number, type, and combination of non-transitory storage devices 104. Suitable storage devices 104 include Non-Volatile Memory (NVM), Solid State Drive (SSDs), Hard Disk Drives (HDDs), optical storage devices, tape drives, and/or any other suitable storage devices. The storage devices 104 may store data (e.g., data 106) and metadata used to access the data (e.g., file system 108, locks 110, etc.). The data and metadata may be recorded on the storage devices 104 at discrete block-level addresses and may be accessed via block-level instructions issued to the respective storage device 104. The data and metadata may additionally or alternatively be accessible via byte-level instructions such as loads and stores issued by a processing resource, such as the processing resource 802 of
The storage devices 104 may be grouped for redundancy and/or performance using Redundant Array of Independent/Inexpensive Disks (RAID) or other suitable groupings, and faster storage devices 104 may operate as caches for larger devices 104. Accordingly, the configuration of the storage devices 104 in the storage aggregate 102 may be complex. In order to keep track of the data and to correlate the various data identifiers, the computing environment 100 may maintain one or more file systems 108 that map virtual or physical block-level data identifiers used by the storage aggregate 102 to file-level data identifiers for use by the processes.
The file system 108 may be maintained, in part, by a kernel 112, a software component that interfaces with one or more hardware components such as the storage devices 104 of the storage aggregate 102. The kernel 112 directly interfaces with the hardware components by providing instructions to the hardware components and receiving responses and interrupts therefrom without any intervening software element. The kernel 112 may include an Application Programming Interface (API) 114 to allow other software components at other hierarchical levels (such as user-level processes 116) to interface with the hardware components.
In an example thereof, the kernel 112 receives a system call involving the storage aggregate 102 from a user-level process 116 at the API 114 as indicated by arrow 118. In the example, the system call contains a request to access (e.g., read, write, etc.) the contents of the storage aggregate 102 such as a file system 108. Because the user-level process 116 may reference data using file-level identifiers, the user-level process 116 may issue system calls to the kernel 112 to access file system metadata used to determine corresponding block-level identifiers. In response, the kernel 112 may query the file system 108, as indicated by arrow 120, to access the requested metadata. The kernel 112 may provide the metadata to the user-level process 116 or may store it for use in future data accesses on behalf of the user-level process 116.
Additionally or in the alternative, the kernel 112 may allow the user-level process 116 to access the contents of the storage devices 104 without any further intervention by the kernel or another process as indicated by arrow 122. Because there is processing overhead associated with the system call to the API 114, allowing the user-level process 116 to access the storage devices 104 directly and thus outside of the kernel 112 may reduce the latency of the operation. To determine which user-level processes 116 will be permitted to access the storage devices 104 directly and, if so, how (e.g., read only or read/write), the kernel 112 may refer to a trusted process list 124 stored upon the storage devices 104. Of course, other ways of determining whether a process may be trusted may be used.
Because multiple user-level processes 116 may attempt to access the data or metadata concurrently, the computing environment 100 may include one or more access control mechanisms. In an example, the storage aggregate 102 stores a set of locks 110 associated with other data and/or metadata. The user-level process 116 and the kernel 112 may use the locks 110 to arbitrate concurrent reads and writes as explained in detail below. In these examples and others, the computing environment 100 provides reduced transactional latency by allowing the user-level process 116 to directly access the storage devices 104, while protecting against access conflicts.
Various examples of a technique for directly accessing devices by user-level processes are described with reference to
Referring first to block 202 of
In mapping the memory structure 302, the kernel 112 may assign access permissions for the user-level process 116A. The kernel may assign any suitable combination of permissions and, in some examples, the kernel 112 determines that the user-level process 116A is trusted for direct reading yet untrusted for direct writing according to the trusted process list 124. In such an example, the kernel 112 may map the memory structure 302 in a read mode such that the user-level process 116A may read the memory structure 302 directly without being permitted to write to it directly. In some examples, despite the memory structure 302 being mapped in a read mode, the user-level process 116A may still write to the memory structure 302 by another mechanism such as using the kernel API 114. The kernel 112 may do fine-grained security checking via that path (e.g., is user process 116A trusted to write to this specific part of the file system?).
Referring to block 204 of
This particular user-level process 116A may not be the only process that accesses the memory structure 302. Accordingly, a set of locks 110 (that includes lock 110A) may be used for access control among the processes. Each lock 110 of the set may correspond to a portion of the memory structure 302, and lock 110A corresponds to the portion of the file system 108 read in block 204. Some examples of suitable locks 110 are described with reference to
In an example, the version record 404 and the lock bit 402 are grouped together as a number with the lock bit 402 representing the least significant bit(s). When the corresponding portion of the memory structure is to be written, the kernel 112 or other entity atomically attempts to change the lock bit 402 from a zero to a one. If it cannot do this (e.g., the lock bit 402 is already one because some other process already has the lock acquired in exclusive mode), it may try again and again until it succeeds. When the lock bit 402 is set (e.g., when the value of the seqlock 400 is odd), the seqlock 400 prevents other processes from writing to the portion until the seqlock 400 is released. When the write completes, the kernel or other entity increments the seqlock 400, which has the effect of incrementing the version record 404 and resetting the lock bit 402 (e.g., the value of the seqlock 400 becomes even).
The user-level process 116A may read a portion of a memory structure 302 associated with the seqlock 400 using optimistic concurrency control. To do this, the process 116A may first read the seqlock 400 state including the version record 404 and the lock bit 402. If the lock bit 404 is set, then the process 116A concludes a write may be in progress to the portion and it should wait until lock bit 404 is reset. It may do this by spinning until lock bit 404 is reset, rereading the seqlock 400 state each time it loops. Once the lock bit 404 is reset, the user-level process 116A reads the portion of the memory structure 302 and is prepared for the portion to be in an inconsistent state if a write occurs concurrently. After the read, the user-level process 116A reads the seqlock 400 state again and compares it to the state immediately preceding the read of portion of the memory structure 302. If the two seqlock 400 states are the same (i.e., the version record 404 and lock bit 402 values are identical), the user-level process 116A concludes that no write occurred while it was reading the memory structure 302 and thus its read saw good data. Conversely, if the two seqlock 400 states are not the same, a write may have occurred concurrently and the user-level process 116A may have read bad data. In this case, the user-level process 116A it may try reading optimistically again. It will be recognized that reading optimistically using a seqlock 400 only involves reading the seqlock state; it does not necessarily involve writing to the seqlock. Of course, other ways of implementing seqlocks 400 are both contemplated and provided for.
The QSX mutex 500 may also provide a shared mode lock where any number of processes may concurrently safely read the portion. To avoid disrupting the reads, the shared mode may prevent writing of the corresponding portion. For this purpose, the QSX mutex 500 may include a shared record 504 that records the number of processes currently holding the lock in the shared mode. A safely-reading entity (as opposed to an entity reading optimistically) may atomically verify that the lock bit 502 is not set and increment the shared record 504 prior to reading the corresponding portion in order to acquire the lock in the shared mode and may decrement the shared record 504 when the reading is complete. When the last safely-reading process has released the lock (e.g., the shared record 504 is zero) the lock is released and may be acquired in exclusive mode for writing. In this way, the QSX mutex 500 provides separate safe-reading and writing lock states.
Like the seqlock, it is also possible to read optimistically (but unsafely) using the QSX mutex 500. The process is similar, but the shared record 504 is not taken into account when determining if the lock state has changed when determining if a write has occurred concurrently. There are thus two ways to read the associated portion using a QSX mutex 500: safely via acquiring the lock in shared mode and unsafely via optimistic concurrency control. The former entails writing to the lock while the latter does not. The former, however, does not involve retrying the read and is less likely to be starved out by writers. Unsafely here refers to the fact that the data read on any given attempt using optimistic concurrency may be bad due to a concurrent write; the reader can detect this, however, so there is no actual danger of returning bad data to higher levels of the process.
Returning to
Referring back to
In this way, the user-level process 116A may determine whether the read may have been interrupted by an intervening write. While the user-level process 116A may also acquire a lock 110A for the memory structure 302 prior to the read to prevent the intervening write, acquiring a lock may entail a write to the lock 110A itself. If the lock 110A is mapped in a read mode, such a write may entail a system call to the kernel 112 and a context switch. However, latency associated with a system call may be avoided by reading the portion first and detecting the intervening write based on the lock 110A as described in blocks 206 and 208. Particularly in applications where reads are frequent and writes are infrequent, the penalty of an interrupted read is outweighed by the advantage of omitting the acquiring of the lock.
In these examples and others, the technique provides a mechanism for the user-level process 116A to directly access the memory structure 302. In some such examples, the user-level process 116A may do so without involving the kernel 112 or any other process once the kernel 112 maps the memory structure 302 to the user-level process 116A. In some such examples, the user-level process 116A is able to read the memory structure 302 without modifying locks 110 and without interrupting writes from other processes.
Further examples of the technique for directly accessing devices by user-level processes are described with reference to
Referring to block 602 of
Referring to block 604, prior to reading a portion of the memory structure 302 (such as a portion of a file system 108 contained in the memory structure 302), the user-level process 116A may determine a first state of a lock 110A associated with the portion of the memory structure 302. The user-level process 116A may directly read the lock 110A to determine the first state without involving the kernel or another process. The lock 110A may take any suitable form, and in some examples, the lock 110A includes a seqlock 400 substantially similar to that of
The first state of the lock 110A may indicate that the portion of the memory structure 302 is currently being written (based on a lock bit 402, an exclusive lock bit 502, and/or other element of the lock 110). If the user-level process 116A determines in block 606 that the portion is currently being written, the method returns to block 604 after a delay. If the user-level process 116A determines in block 606 that the portion is not currently being written, the method proceeds to block 608.
Referring to block 608, after determining the first state of the lock 110A, the user level-process reads the portion of the memory structure 302 substantially as described in block 204 of
Referring to block 610, the user-level process 116A determines a second state of the lock 110A after the reading performed in block 608. The user-level process 116A may directly read the lock 110A to determine the second state without involving the kernel or another process. This may be performed substantially as described in block 206 of
Referring to block 612, the user-level process 116A determines from the first state and/or the second state of the lock 110A whether a write occurred during the reading of block 608. This may be performed substantially as described in block 208 of
If it is determined that an intervening write occurred, the method 600 continues to block 614 where the user-level process 116A determines how many times the read has been retried due to intervening writes and whether the number of retries exceeds a threshold. If the number of retries does not exceed the threshold, the method may return to block 604 after a delay.
In many applications, reads will greatly outnumber writes, and retries due to an intervening write will be infrequent. However, in the event that the number of retries exceeds a threshold, the user-level process 116A may request a lock for the portion of the memory structure 302 via the kernel 112. In some such examples, the method 600 proceeds from block 614 to block 616 where the user-level process 116A issues a system call to the kernel 112 via an API 114 requesting the lock in order to read the portion of the memory structure 302. When the lock is issued, the user-level process directly reads the portion in block 618 substantially as described in block 608 of
Likewise, if the comparison of block 612 determines that an intervening write did not occur during the read of block 608, the method 600 proceeds from block 612 to block 622 where the method 600 concludes.
Method 200 and method 600 allow a first user-level process 116A to directly read a portion of a memory structure 302, while detecting writes performed by other processes such as a second user-level process 116B. A method for writing to the memory structure 302 by the second user-level process 116B is described with reference to
Referring first to block 702 of
Referring to block 704, the second user-level process 116B acquires the lock 110A in an exclusive mode for writing to the portion of the memory structure 302. In some examples, the lock 110A includes a seqlock 400, and acquiring the lock 110A in an exclusive mode includes setting a lock bit 402 thereof. In some examples, the lock 110A includes a QSX mutex 500 and acquiring the lock 110A includes verifying from the shared record 504 and the exclusive lock bit 502 that no other process has acquired the lock in a shared mode or an exclusive mode and setting the exclusive lock bit 502. Because the lock 110A has been mapped into the address space of the second user-level process 116B using a read/write mode, the second user-level process 116B may acquire the lock by writing to it directly (possibly using a test and set instruction or atomic compare and swap instruction) without intervention by the kernel 112 or another entity.
Referring to block 706, the second user-level process 116B directly writes to the portion of the memory structure 302 associated with the lock 110A. Because the memory structure 302 has been mapped into the address space of the second user-level process 116B read/write, the second user-level process 116B may write the memory structure 302 directly without intervention by the kernel 112 or another entity. Referring to block 708, the second user-level process releases the lock 110A.
The processes of methods 200, 600, and/or 700 may be performed by any combination of hard-coded and programmable logic. In some examples, a processing resource utilizes instructions stored on a non-transitory computer-readable memory resource to perform at least some of these processes. Accordingly, examples of the present disclosure may take the form of a non-transitory computer-readable memory resource storing instructions that perform at least part of methods 200, 600, and/or 700.
The computing system 800 may include one or more processing resources 802 operable to perform any combination of the functions described above. The illustrated processing resource 802 may include any number and combination of Central Processing Units (CPUs), Graphics Processing Units (GPUs), microcontrollers, Application-Specific Integrated Circuits (ASICs), Field-Programmable Gate Arrays (FPGAs), and/or other processing resources.
To control the processing resource 802, the computing system 800 may include a non-transitory computer-readable memory resource 804 that is operable to store instructions for execution by the processing resource 802. The non-transitory computer-readable memory resource 804 may include any number of non-transitory memory devices including battery-backed RAM, SSDs, HDDs, optical media, and/or other memory devices suitable for storing instructions. The non-transitory computer-readable memory resource 804 may store instructions that cause the processing resource 802 to perform any process of any block of methods 200, 600, and/or 700, examples of which follow.
Referring to block 806, the non-transitory computer-readable memory resource 804 may store instructions that cause the processing resource 802 to map a memory structure 302 and a lock 110A associated with a portion of the memory structure 302 into an address space 304 of a user-level process 116A. The memory structure 302 and the lock 110A may be mapped in a read mode based on a trust level of the user-level process 116A. This may be performed substantially as described in block 202 of
Referring to block 808, the non-transitory computer-readable memory resource 804 may store instructions that cause the processing resource 802 to directly read the portion of the memory structure 302 by the user-level process 116A. This may be performed substantially as described in block 204 of
Referring to block 810, the non-transitory computer-readable memory resource 804 may store instructions that cause the processing resource 802 to directly read a first state of the lock 110A by the user-level process 116A after the portion is read. This may be performed substantially as described in block 206 of
Referring to block 812, the non-transitory computer-readable memory resource 804 may store instructions that cause the processing resource 802 to detect a write to the portion during the read of the portion based on the first state of the lock 110A. This may be performed substantially as described in block 208 of
Further examples are described with reference to
Referring to block 902, the non-transitory computer-readable memory resource 804 may store instructions that cause the processing resource 802 to directly read, by a user-level process 116A, a portion of a file system 108 that is mapped into an address space 304 of the user-level process 116A. The file system 108 may be mapped in a read mode based on a trust level of the user-level process 116A. This may be performed substantially as described in block 204 of
Referring to block 904, the non-transitory computer-readable memory resource 804 may store instructions that cause the processing resource 802 to determine a first state of a lock 110A associated with the portion of the file system 108 after the read of the portion file system 108 by the user-level process 116A. This may be performed substantially as described in block 206 of
Referring to block 906, the non-transitory computer-readable memory resource 804 may store instructions that cause the processing resource 802 to detect a write of the portion of the file system 108 during the read of the portion based on the determined first state of the lock 110A. This may be performed substantially as described in block 208 of
In the foregoing description, numerous details are set forth to provide an understanding of the subject disclosed herein. However, implementations may be practiced without some or all of these details. Other implementations may include modifications and variations from the details discussed above. It is intended that the appended claims cover such modifications and variations.