This disclosure is directed to file systems, particularly for use with key-value stores.
Key-value stores are commonly used as local storage engines for various infrastructural storage systems. A key-value store typically keeps its persistent data on disks through a file system. These file systems are general purpose, and include many features and structures that are not suited for the particular requirements of key-value stores.
This disclosure is directed to file systems, particularly for use with key-value stores.
By providing a specialized file system configured for use with key-value stores, tail latency and write amplification can be reduced. Further, storage costs can be reduced and the potential number of files and/or amount of data that can be lost in failures can be limited, improving system robustness. Additionally, the file system, by being co-designed with and specialized for the workload of key-value stores, can control metadata overhead and make this overhead more predictable. Further, the file system can have operations, particularly input/output (I/O) patterns selected to be suitable for the storage devices used. Additionally, by having the file system operate in a user space instead of a kernel space, I/O scheduling can be set up so as to be friendly to flash memories, allowing use of NVMe solid state drives (SSDs) and/or ZNS SSDs, which offer zoned storage.
In an embodiment, a file system is configured to perform log-structured append-only write operations. The file system is configured for use with a key-value store, the file system comprising a file manager configured to maintain separation of data and logs.
In an embodiment, the file manager is further configured to determine whether a file is a general file or an instant file based on write characteristics of said file, and when the file is a general file, to store the general file in one or more pre-allocated data spaces. In an embodiment, the one or more pre-allocated data spaces are each approximately one megabyte (MB) in size.
In an embodiment, the file system further includes a metadata manager configured to perform compaction of a journal file, the journal file containing persistent metadata of the file system. In an embodiment, the compaction of the journal file is performed in a synchronous mode, wherein in the synchronous mode, a global flushing lock is maintained until a compacted journal and updates to a superblock are completed. In an embodiment, the compaction of the journal file is performed in an asynchronous mode, wherein in the asynchronous mode, a global flushing lock is released once in-memory metadata dump requests are submitted.
In an embodiment, the file system further includes a task scheduler configured to assign a priority to each write task, the priority based on whether the data of the write task is general data or instant data.
In an embodiment, the file system includes a dedicated journaling thread configured to generate a journal, wherein the journal includes a most recent checkpoint from an operation log of the file system. In an embodiment, the journal includes one or more transactions, each of the one or more transactions including a universal unique identifier, a global sequence identifier, an encoded operation log entry, and a cyclic redundancy check.
In an embodiment, a disk layout of the file system includes a superblock, the superblock including journal index node content, universal unique identifier data, version information, and block size. In an embodiment, the disk layout provides the superblock at a second 4 kilobyte block provided in a storage device.
In an embodiment, data storage of the file system includes one or more zoned state drives (ZNS SSDs).
In an embodiment, a system includes a key-value store and the file system as described herein.
In an embodiment, the file system is configured to update data based on the key-value store without updating metadata associated with said data.
In an embodiment, a system includes a key-value store and file system collaborator configured to provide an interface between the key-value store and the file system. In an embodiment, the key-value store and file system collaborator is configured to expose an application programming interface allowing the key-value store to mark priority of input/output requests to the file system. In an embodiment, the key-value store and file system collaborator is configured to coordinate compaction and/or garbage collection operations between the key-value store and the file system. In an embodiment, the key-value store and file system collaborator is configured to provide data from the key-value store to the file system to achieve sector-level redundancy for the file data of the key-value store in the file system. In an embodiment, the key-value store and file system collaborator is configured to perform sync operations for the key-value store and the file system without requiring syncing of metadata.
In an embodiment, the file system is an append-only file system. In an embodiment, the file system is configured to write said separate data and logs to persistent memory independently.
In an embodiment, a database system includes a key-value store, a file system configured to perform append operations, the file system configured for use with a key-value store, and a key-value store and file system collaborator configured to provide an interface between the key-value store and the file system. The file system comprises a file manager configured to maintain separation of data and logs. In an embodiment, the key-value store and file system collaborator is configured to expose an application programming interface allowing the key-value store to mark priority of input/output requests to the file system.
In an embodiment, a computer-readable medium contains instructions that, when executed by a processor, direct the establishment and operation of a file system, the file system configured to perform append operations, the file system configured for use with a key-value store, wherein the file system comprises a file manager configured to maintain separation of data and logs.
This disclosure is directed to file systems, particularly for use with key-value stores.
In the following detailed description, reference is made to the accompanying drawings, which form a part of the description. In the drawings, similar symbols typically identify similar components, unless context dictates otherwise. Furthermore, unless otherwise noted, the description of a successive drawing may reference features from any previous drawing to provide clearer context and a substantive explanation of the current example embodiment. Still, the example embodiments described in the detailed description, drawings, and claims are not intended to be limiting. Other embodiments may be utilized, and other changes may be made, without departing from the spirit or scope of the subject matter presented herein. It will be readily understood that the aspects of the present disclosure, as generally described and recited herein, as well as illustrated in the drawings, may be arranged, substituted, combined, separated, and designed in a wide variety of different configurations, all of which are explicitly contemplated herein.
Additionally, portions of the present disclosure may be described herein in terms of functional block components and various processing steps. It should be appreciated that such functional blocks may be realized by any number of firmware, software, and/or hardware components configured to perform the specified functions.
File system 100 can be designed specifically to support key-value store workloads, without the burden to support general-purpose file system workload. File system 100 can be a user-space append-only file system designed to operate as the file system for a corresponding key-value store. In an embodiment, storage device 112 used in the file system 100 can be solid state drives (SSDs), for example, NVMe SSDs. Device access layer 110 can be provided to implement file system 100 on the storage device 112. File system 100 can be configured to reduce sync write latency typically encountered in key-value stores. File system 100 can include customized handling of metadata 104 and making metadata persistent 106. For example, file system 100 can carry out sequential writes for write-ahead log (WAL) and SST files from the corresponding key-value system. File system 100 can further provide an ordered journaling mode, instead of dedicating a separate thread and thus requiring multiple metadata writes for journaling commit blocks. A dedicated journaling thread can lead to significant write amplification, especially for small data sizes (4 KB-32 KB), which can be avoided by the ordered journaling mode of file system 100. Reducing or avoiding write amplification can improve tail latency, particularly for sync writes of the write-ahead log.
In file system 100, the data and logs can be split. Further, the write model used in file system 100 can be log-structured append-only writing. File system 100 can use pre-allocated chunks of space for the storage of general data, such that sync writes can be performed by writing data without requiring a corresponding update to metadata. Further, data persistence for files and logs can be performed separately. Performing these tasks separately can reduce the number of metadata persistence operations associated with data write operations.
File system 100 can be configured to be an append-only file system, where random-writes are not allowed. Random writes much more expensive to support than appending writes (e.g., implementing atomic random writes typically requires data journaling besides metadata journaling, which effectively doubles the amount of data written), and further can be unfriendly to NAND flash media. On conventional SSDs, random writes could require extensive garbage collection overhead, and such operations cannot be performed on zoned (ZNS) SSDs. File system 100 can be configured to perform append operations, appending data to files. Since key-value stores do not require random writes, the file system 100 can be designed as append-only in an embodiment, for example by having the file system 100 not allow rename, delete, or modify operations. File system 100 can utilize ZNS SSDs, thereby reducing storage costs significantly. In some embodiments, file system 100 can also optionally be used with hard disk drives (HDDs) such as shingled magnetic recording (SMR) HDDs serving as the storage.
File system 100 can keep file system metadata 104 in memory during normal operation. Storing metadata 104 in memory when possible reduces latency during reads and allows writes to require only a single I/O operation.
File system 100 can provides consistent high throughput and low (tail) latency for sequential writes and random/sequential reads. File system 100 can be suitable to support online services such as key-value stores and databases.
File system 100 can be configured to cater to both general files and instant files. General files can be characterized by low write-intensity and exhibit low latency in sequential and random reads. As one non-limiting example, key-value store SST files, which can be written in batches without the need for frequent disk flushing, would be treated as general files by file system 100. Instant files can be characterized by requiring fast, incremental synchronous writes while also requiring high sequential and random read performance. For example, write-ahead log (WAL) files in a key-value store require frequent disk flushing for instant durability, leading to their designation as instant files in the file system 100.
File system 100 can be accessed through dedicated client library 102. The client library 102 can provide API interfaces for creating, deleting, renaming, reading and/or writing files
File manager 202 is configured to carry out data placement so as to improve file system performance by reducing sync write overhead. The sync write overhead can be reduced through separation of data and logs, pre-allocating data space, and/or maintaining independence between data and log persistence. By segregating data and logs and adopting a log-structured append-only writing approach, the write model of file system 200 can be improved, thereby leading to enhanced performance. File manager 202 can improve efficiency for sync writes by incorporating pre-allocated data space, as only data persistence is required, eliminating the need for persisting metadata, and thereby reducing overhead. By providing separate data persistence for different files and global log persistence, these operations can be executed independently, thereby mitigating the extensive metadata persistence operations typically caused by single data write persistence in traditional file systems.
The file manager 202 can be configured to handle general files and instant files, as discussed above. The general files can be characterized by relatively low write intensity, and the instant files can be files that required fast, incremental synchronous writes. The pre-allocating of data space can include allocating chunks of discrete size for the storage of general files, allowing more efficient metadata to be used for such general files. The nature of files as general files or instant files can be determined by file manager 202. The nature of files as general files or instant files can be determined based on an API through which the files have been provided or accessed. The nature of files as general files or instant files can be determined based on a tag applied to the file.
Metadata manager 204 is configured to manages file system metadata in memory and on disk. Metadata manger 204 can be configured to carry out journal compaction to release disk space occupied by stale records and prepare the journal for later metadata updates. A non-limiting example of the journal compaction process carried out by metadata manager 204 is provided in
Task scheduler 206 is configured to manage data reading and writing processes. Task scheduler 206 can categorize the writes by task type, such as the writing of instant data or general data, and to schedule the data writes according to priorities based on the type of task. In an embodiment, metadata writes can be sent directly to be performed, without having to be scheduled by task scheduler 206.
Space allocator 208 is configured to manage space within file system 200. In an embodiment, space allocator 208 is configured to use three layers (L0, L1, L2) to describe the spaces. Space allocator 208 can thereby accelerate the determination of the status of spaces as clear or set, without adding excessive overhead. A non-limiting example of the structure implemented by space allocator 208 is provided in
Key-value-file system collaborator 210 is configured to interface with an associated file system to improve the operations of the key-value system and the file system 200 by coordinating operations in both systems. Collaborations between the key-value system and the file system that are facilitated by key-value-file system collaborator 210 can include, as non-limiting examples, exposing an API allowing the key-value store to mark priority of I/O requests, providing specialized file types to improve synchronization I/O, reducing or avoiding blocking resulting from compaction operations, coordinating garbage collection operations to reduce write amplification, providing sector-level redundancy for stored data to improve fault tolerance, and performing sync operations without requiring syncing of metadata.
The key-value and file system collaboration can enable a more efficient approach compared to conventional file sync and file data-sync implementations. When such conventional file sync and file data-sync approaches are used with an append-only file, two separate I/O operations are triggered: one for updating the data and another for updating the file size. This results in increased input/output overhead and Write Amplification Factor (WAF). The key-value and file system collaboration can allow a specialized file data sync system call that updates only the user data, without synchronizing the file size unless a new unit of general file storage is allocated for the file to accommodate the new user data. This can avoid unnecessary synchronization of file metadata in most cases. Accordingly, the file data sync implementation through the key-value and file system collaboration can avoid performing the extra I/O operation caused by traditional file sync and file data sync, effectively reducing the overall WAF by 1. This can provide improved performance and enhanced data consistency without compromising essential synchronization tasks.
In the file system, the metadata manager such as metadata manager 202 described above and shown in
The metadata organized according to
In an embodiment, the file system can include an API for allowing the key-value store to set I/O priorities, for example as part of key-value-file system collaborator 210 as shown in
The specialized files allow the key-value stores to execute compaction tasks without causing any blockages or disruptions to the overall system performance. Traditional compaction processes can be time-consuming and often lead to system slowdowns, hindering overall performance. However, by employing the tailored file type, the key-value stores can more efficiently manage data organization and compaction in a non-blocking manner, preserving the system's responsiveness and optimal functioning.
In order to mitigate write amplification and optimize the garbage collection process, the file system can collaborate with the key-value stores to perform garbage collection efficiently. This collaboration can include sharing zone usage information of the file system with the key-value stores, enabling informed decisions regarding file compaction or garbage collection execution. Garbage collection scheduling can be improved by considering garbage rates at the entry level, rather than just the space level. This ensures the selection of blob files with the highest garbage space, leading to more efficient garbage collection operations. Further, this awareness can allow the garbage collection process to be made more flexible, diverse, concurrent, and aware of physical capacity. This will optimize space reclamation, improve concurrency during high-water levels, and enhance overall GC performance.
Periodic compaction can be performed in the key-value stores to merge multiple layers of the same page from the log-structured merge tree into the last layer. This can reduce the amount of data handled during garbage collection, and enable upper layers to perform corresponding functions. Reduced garbage data can reduce read operations, thereby reducing read amplification in read operations. Additionally, “hot” and “cold” data can be separated through the collaboration of the file system and the key-value stores, thereby reducing write amplification issues arising from rewriting cold data due when there is mixed hot and cold data.
The sector-level fault-tolerance techniques improve the robustness and reliability of the data storage system. By incorporating sector-level redundancy in both the file system metadata and file data, the risk of data loss due to sector failures is substantially mitigated. Consequently, the proposed approach enhances the overall data recovery capabilities of the system, thereby contributing to its stability and long-term viability. The application of sector-level redundancy in the management of both file system metadata and file data allows safeguarding critical information, ultimately contributing to an enhanced user experience and increased system resilience.
Key-value and file system 1300 can be used for storage in cloud applications, for example to provide data persistence required by cloud services. Key-value system 1302 is configured to provide a key-value store, for example as part of a storage backend for cloud services. Non-limiting examples of cloud services using key-value systems 1302 include shopping, social media, metadata management, and the like. File system 1314 can be a dedicated user-level append-only file system configured to provide storage specialized to facilitate operation of key-value system 1302.
Log engine 1304 is configured to allow concurrent writing of multiple log files, thereby reducing the number of compaction and garbage collection operations. The logs written by log engine 1304 can be configured such that strong sorting is not required for handling of said logs. Log engine 1304 is configured to improve throughput performance issue in log writes and increase recovery speed by reducing the sync write overhead of logs from multiple input/output (I/O) operations to a single I/O operation, aggregating writes using a lock-free queue to control latency and improve throughput, and/or providing asynchronous interfaces to enhance the thread model. Where key-value system 1302 and file system 1314 are integrated and collaborate, the log engine 1304 can be used to store a write-ahead log (WAL) having a predefined structure having a defined actual file size. The defined file size for the WAL can in turn result in requiring fewer I/O operations, thereby enhancing performance while mitigating potential tradeoffs regarding data consistency. The operation of a log engine 1304 is further detailed in in
Hash engine 1306 is configured to handle point queries within the key-value system 1302. In particular, hash engine 1306 is configured to reduce tail latency in point queries. The hash engine 1306 includes separation of data and index components, and maintenance of the index in a cache memory, for example by compression of the index and/or caching of partial data. The partial data can be selected using, for example, a least recently used strategy. The operation of hash engine 1306 is further detailed in
Sorting engine 1308 is configured to carry out range scan operations while reducing the write-amplification factor and/or read/write latency associated with such operations. Sorting engine 1308 is configured to can use a partitioned log-structured merge (LSM) tree. The classification of I/O flows and scheduling of tasks can further be carried out by sorting engine 1308. The operation of a sorting engine 1308 is further detailed in
Garbage collection manger 1310 is configured to carry out garbage collection and/or compaction operations in key-value and file system 1300. The garbage collection manager 1310 can be configured to reduce unnecessary data movement during garbage collection and/or compaction operations in key-value and file system 1300. The garbage collection manager 1310 can conduct garbage collection and/or compaction operations based on awareness regarding application-side data deletion such as expiration of pages. Garbage collection and compaction carried out by garbage collection manager 1310 can be configured to arrange the data to support other modules such as sorting engine 1308. The garbage collection manager 1310 can coordinate preservation of data during the garbage collection and/or compaction operations. The operation of a garbage collection manager 1310 is further detailed in
Collaboration layer 1312 is configured to facilitate collaboration between key-value system 1302 and file system 1314. Collaboration layer 1312 can further facilitate efficient compaction and/or garbage collection operations in key-value system 1302 based on the collaboration between the key-value system 1302 and file system 1314. The collaboration can reduce write amplification issues arising from compaction and/or garbage collection operations. In an embodiment, the collaboration layer 1312 can expose zone usage information from key-value system 1302 to the file system 1314.
File system 1314 can be configured to split data from logs and use log-structured append-only writing as the write model, where only append operations can be performed. In an embodiment, the file system can further provide pre-allocated data space where sync writes only occur for the persistence of data, and in an embodiment, do not need to make metadata persistent. In an embodiment, the data persistence for different files and global log persistence can be executed separately. These aspects of the file system can allow the file system to avoid some metadata persistence operations, such as those caused by single data write persistence operations.
The file system 1314 can be configured to support general files and instant files. Both general and instant files can be written sequentially, and both can be read either sequentially or randomly. General files can be optimized for consistently low latency in either sequential or random reads. General files can be used for writing data in batches that do not require flushing the data to disk after each write, such as SST files. The storage space is allocated in large units, with a non-limiting example of unit size being 1 MB each. The large allocation unit can reduce metadata size for general files, such that metadata of all general files can be kept in memory during normal file system operation. By keeping the metadata in memory, no read operation to general files would require further I/O for metadata access, regardless of the read offset. This can reduce read tail latency for general files. Instant files can be optimized for fast, incremental synchronous writes while having good sequential and random read performance near the tail. Instant files can be used for writing data that requires frequent flushing to disk for instant durability, such as write-ahead log files of the key-value system. The data and metadata of each individual write can be bundled together for instant files. The bundled data and metadata can be written to a journal file shared by all instant files. The bundling of data and writing to the journal file can improve the speed of incremental write and sync operations. This approach is structured to support sequential reads, but can have tradeoffs regarding random reads. Since instant files are expected to be mostly read sequentially, with random reads mostly concentrated near the tail most recently written data of each instant file that is actively being written can be cached to improve read performance.
The file system 1314 can include a user-space I/O scheduler to assign I/O priority to different I/O types. The I/O scheduler will mark foreground I/O as high priority while background I/O will be marked as low priority. In addition, the key-value system 1302 can include a scheduler to schedule its background tasks in order to ensure that each I/O issued by the upper layer applications has a consistent I/O amplification. Through this co-design of I/O scheduling in both key-value system 1302 and file system 1314, the tail latency can be kept stable and low as both the I/O amplification and I/O latency are consistent. Moreover, reading general files from the file system requires no I/O for metadata, and use of large spaces for the general files can ensure that most read operations require a single I/O.
Kernel space 1316 can contain disks 1318. Disks 1318 can include one or more storage media, such as solid state drives (SSDs). In an embodiment, at least some of disks 1318 are zoned storage (ZNS) SSDs.
Applications 1320 are any suitable applications utilizing the key-value and file system 1300, for example, online shopping, social media, metadata management applications, or the like. The applications 1320 can interface with key-vale and file system 1300 through any suitable application programming interface (API). In an embodiment, the API can be specific for the particular type of file, for example having the nature of the files as general files or instant filed be determined by the API through which the file has been received.
Aspects:
Aspect 1. A file system, wherein the file system is configured to perform log-structured append-only write operations, the file system configured for use with a key-value store, the file system comprising a file manager configured to maintain separation of data and logs.
Aspect 2. The file system according to aspect 1, wherein the file manager is further configured to determine whether a file is a general file or an instant file based on write characteristics of said file, and when the file is a general file, to store the general file in one or more pre-allocated data spaces.
Aspect 3. The file system according to aspect 2, wherein the one or more pre-allocated data spaces are each approximately one megabyte (MB) in size.
Aspect 4. The file system according to any of aspects 1-3, further comprising a metadata manager configured to perform compaction of a journal file, the journal file containing persistent metadata of the file system.
Aspect 5. The file system according to aspect 4, wherein the compaction of the journal file is performed in a synchronous mode, wherein in the synchronous mode, a global flushing lock is maintained until a compacted journal and updates to a superblock are completed.
Aspect 6. The file system according to aspect 4, wherein the compaction of the journal file is performed in an asynchronous mode, wherein in the asynchronous mode, a global flushing lock is released once in-memory metadata dump requests are submitted.
Aspect 7. The file system according to any of aspects 1-6, further comprising a task scheduler configured to assign a priority to each write task, the priority based on whether the data of the write task is general data or instant data.
Aspect 8. The file system according to any of aspects 1-7, comprising a dedicated journaling thread configured to generate a journal, wherein the journal includes a most recent checkpoint from an operation log of the file system.
Aspect 9. The file system according to aspect 8, wherein the journal includes one or more transactions, each of the one or more transactions including a universal unique identifier, a global sequence identifier, an encoded operation log entry, and a cyclic redundancy check.
Aspect 10. The file system according to any of aspects 1-9, wherein a disk layout of the file system includes a superblock, the superblock including journal index node content, universal unique identifier data, version information, and block size.
Aspect 11. The file system according to aspect 10, wherein the disk layout provides the superblock at a second 4 kilobyte block provided in a storage device.
Aspect 13. The file system according to any of aspects 1-12, wherein data storage of the file system includes one or more zoned solid state drives (ZNS SSDs).
Aspect 14. A system comprising the key-value store and the file system according to any of aspects 1-13.
Aspect 15. The system according to aspect 14, wherein the file system is configured to update data based on the key-value store without updating metadata associated with said data.
Aspect 16. The system according to any of aspects 14-15, comprising a key-value store and file system collaborator configured to provide an interface between the key-value store and the file system.
Aspect 17. The system according to aspect 16, wherein the key-value store and file system collaborator is configured to expose an application programming interface allowing the key-value store to mark priority of input/output requests to the file system.
Aspect 18. The system according to any of aspects 16-17, wherein the key-value store and file system collaborator is configured to coordinate compaction and/or garbage collection operations between the key-value store and the file system.
Aspect 19. The system according to any of aspects 16-18, wherein the key-value store and file system collaborator is configured to provide data from the key-value store to the file system to achieve sector-level redundancy for the file data of the key-value store in the file system.
Aspect 20. The system according to any of aspects 16-19, wherein the key-value store and file system collaborator is configured to perform sync operations for the key-value store and the file system without requiring syncing of metadata.
Aspect 21. The system according to any of aspects 1-20, wherein the file system is an append-only file system.
Aspect 22. The system according to any of aspects 1-20, wherein the file system is configured to write said separate data and logs to persistent memory independently.
Aspect 23. A database system, comprising a key-value store, a file system configured to perform log-structured append-only write operations, the file system configured for use with a key-value store, and a key-value store and file system collaborator configured to provide an interface between the key-value store and the file system, wherein the file system comprises a file manager configured to maintain separation of data and logs.
Aspect 24. The database system according to aspect 23, wherein the key-value store and file system collaborator is configured to expose an application programming interface allowing the key-value store to mark priority of input/output requests to the file system.
Aspect 25. A computer-readable medium containing instructions that, when executed by a processor, direct the establishment and operation of a file system, the file system configured to perform log-structured append-only write operations, the file system configured for use with a key-value store, the file system comprising a file manager configured to maintain separation of data and logs.
The examples disclosed in this application are to be considered in all respects as illustrative and not limitative. The scope of the invention is indicated by the appended claims rather than by the foregoing description; and all changes which come within the meaning and range of equivalency of the claims are intended to be embraced therein.