LSM-tree has become the backbone of modern key-value stores and storage engines. It ingests key-value entries inserted by the application by first buffering them in memory. When the buffer fills up, it flushes these entries to storage (typically a flash-based SSD) as a sorted array referred to as a run. LSM-tree then compacts smaller runs into larger ones in order to (1) restrict the number of runs that a query has to search, and to (2) discard obsolete entries, for which newer versions with the same keys have been inserted. LSM-tree organizes these runs based on their ages and sizes across levels of exponentially increasing capacities.
LSM-tree is used widely including in OLTP, HTAP, social graphs, blockchain, stream-processing, etc.
The compaction policy of an LSM-tree dictates which data to merge when. Existing work has rigorously studied how to tune the eagerness of a compaction policy to strike different trade-offs between the costs of reads, writes, and space. This paper focuses on an orthogonal yet crucial design dimension: compaction granularity. Existing compaction designs can broadly be lumped into two categories with respect to how they granulate compactions: Full Merge and Partial Merge. Each entails a particular shortcoming.
With Partial Merge, each run is partitioned into multiple small files of equal sizes. When a level reaches capacity, one file from within that level is selected and merged into files with overlapping key ranges at the next larger level. Partial merge is used by default in LevelDB and RocksDB. Its core problem is high write-amplification. The reason is twofold. Firstly, the files chosen to be merged do not have perfectly overlapping key ranges. Each compaction therefore superfluously rewrites some non-overlapping data. Secondly, concurrent compactions at different levels cause files with different lifespans to become physically interspersed within the SSD. This makes it hard for the SSD to perform efficient internal garbage-collection, especially as the data size increases relative to the available storage capacity.
We illustrate this problem in
With Full Merge, entire levels are merged all at once. Full merge is used in Cassandra, HBase, and Universal Compaction in RocksDB. Its core problem is that until a merge operation is finished, the files being merged cannot be disposed of. This means that compacting the LSM-tree’s largest level, which is exponentially larger than the rest, requires having twice as much storage capacity as data until the operation is finished. We illustrate this in
SUMMARY
The embodiments of the disclosure will be understood and appreciated more fully from the following detailed description, taken in conjunction with the drawings in which:
In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the invention. However, it will be understood by those skilled in the art that the present invention may be practiced without these specific details. In other instances, well-known methods, procedures, and components have not been described in detail so as not to obscure the present invention.
The subject matter regarded as the invention is particularly pointed out and distinctly claimed in the concluding portion of the specification. The invention, however, both as to organization and method of operation, together with objects, features, and advantages thereof, may best be understood by reference to the following detailed description when read with the accompanying drawings.
It will be appreciated that for simplicity and clarity of illustration, elements shown in the figures have not necessarily been drawn to scale. For example, the dimensions of some of the elements may be exaggerated relative to other elements for clarity. Further, where considered appropriate, reference numerals may be repeated among the figures to indicate corresponding or analogous elements.
Because the illustrated embodiments of the present invention may for the most part, be implemented using electronic components and circuits known to those skilled in the art, details will not be explained in any greater extent than that considered necessary as illustrated above, for the understanding and appreciation of the underlying concepts of the present invention and in order not to obfuscate or distract from the teachings of the present invention.
Any reference in the specification to a method should be applied mutatis mutandis to a device or system capable of executing the method and/or to a non-transitory computer readable medium that stores instructions for executing the method.
Any reference in the specification to a system or device should be applied mutatis mutandis to a method that may be executed by the system, and/or may be applied mutatis mutandis to non-transitory computer readable medium that stores instructions executable by the system.
Any reference in the specification to a non-transitory computer readable medium should be applied mutatis mutandis to a device or system capable of executing instructions stored in the non-transitory computer readable medium and/or may be applied mutatis mutandis to a method for executing the instructions.
Any combination of any module or unit listed in any of the figures, any part of the specification and/or any claims may be provided.
The specification and/or drawings may refer to a processor. The processor may be a processing circuitry. The processing circuitry may be implemented as a central processing unit (CPU), and/or one or more other integrated circuits such as application-specific integrated circuits (ASICs), field programmable gate arrays (FPGAs), full-custom integrated circuits, etc., or a combination of such integrated circuits.
Any combination of any steps of any method illustrated in the specification and/or drawings may be provided.
Any combination of any subject matter of any of claims may be provided.
Any combinations of systems, units, components, processors, sensors, illustrated in the specification and/or drawings may be provided.
Modern storage engines and key-value stores have come to rely on the log-structured merge-tree (LSM-tree) as their core data structure. LSM-tree operates by gradually sort-merging incoming application data across levels of exponentially increasing capacities in storage.
A crucial design dimension of LSM-tree is its compaction granularity. Some designs perform Full Merge, whereby entire levels get compacted at once. Others perform Partial Merge, whereby smaller groups of files with overlapping key ranges are compacted independently.
This paper shows that both strategies exhibit serious flaws. With Full Merge, space-amplification is exorbitant. The reason is that while compacting the LSM-tree’s largest level, there must be at least twice as much storage space as data to store both the original and new files until the compaction is finished. On the other hand, Partial Merge exhibits excessive write-amplification. The reason is twofold. (1) The files getting compacted typically do not have perfectly overlapping key ranges, and so some non-overlapping data is superfluously rewritten in each compaction. (2) Files with different lifetimes become interspersed within the SSD thus necessitating high overheads for SSD garbage-collection. We show that as the data size grows, these problems get worse.
We introduce Spooky, a new set of compaction granulation techniques to address these problems. Spooky partitions data at the largest level into equally sized files, and it partitions data at smaller levels based on the file boundaries at the largest level. This allows merging one small group of perfectly overlapping files at a time to limit space-amplification and compaction overheads.
At the same time, Spooky performs fewer though larger concurrent sequential writes and deletes to cheapen SSD garbage-collection.
We show empirically that Spooky achieves >2x lower space-amplification than Full Merge and >2x lower write-amplification than Partial Merge at the same time. Spooky therefore allows LSM-tree for the first time to utilize most of the available storage capacity while maintaining moderate write-amplification.
Spooky is a partitioned compaction for Key-Value Stores. Spooky partitions the LSM-tree’s largest level into equally-sized files, and it partitions a few of the subsequent largest levels based on the file boundaries at the largest level. This allows merging one group of perfectly overlapping files at a time to restrict both space-amplification and compaction overheads. At smaller levels, Spooky performs Full Merge to limit write-amplification yet without inflating space requirements as these levels are exponentially smaller. In addition, Spooky writes and deletes data sequentially within each level and across fewer levels at a time. As a result, fewer files become physically interspersed within the SSD to cheapen garbage-collection overheads.
Spooky is a meta-policy: it is orthogonal to design aspects such as compaction eagerness, key-value separation, etc. As such, it both complements and enhances LSM-tree variants such as tiering, leveling, lazy leveling, Wisckey, etc. Hence, Spooky is beneficial across the wide variety of applications/workloads that these myriad LSM-tree instances each optimize for.
Overall, our contributions are as follow.
We show that LSM-tree designs employing Full Merge waste over half of the available storage capacity due to excessive transient space-amplification.
We show that with Partial Merge, SSD garbage-collection overheads increase at an accelerating rate as storage utilization grows due to files with different lifespans becoming interspersed within the SSD. These overheads multiply with compaction overheads to cripple performance.
We introduce Spooky, a new compaction granulation policy that (1) partitions data into perfectly overlapping groups of files that can be merged using little extra space or superfluous work, and (2) that issues SSD-friendly I/O patterns that cheapen SSD garbage-collection.
We show experimentally that Spooky achieves >2x better space-amp than Full Merge and >2x better write-amp than Partial Merge at the same time.
We show that Spooky’s reduced write-amp translates to direct improvements in throughput and latency for both updates and queries.
LSM-tree organizes data across L levels of exponentially increasing capacities. Level 0 is an in-memory buffer (aka memtable). All other levels are in storage. The capacity at Level i is T times larger than at Level i - 1. When the largest level reaches capacity, a new larger level is added. The number of levels L is ≈ logT (N/B), where N is the data size and B is the buffer size.
For each insert, update, or delete request issued by the application, a data entry comprising a key and a value is put in the buffer (in case of a delete, the value is a tombstone [57]). When the buffer fills up, it gets flushed to Level 1 as a file sorted based on its entries’ keys. Whenever one of the levels in storage reaches capacity, some file from within it is merged into one of the next larger levels.
Whenever two entries with the same key are encountered during compaction, the older one is considered obsolete and discarded. Each file can be thought of as a static B-tree whose internal nodes are cached in memory. There is an in-memory Bloom filter for each file to allow point reads to skip accessing files that do not contain a given key. A point read searches the levels from smallest to largest to find the most recent version of an entry with a given key. A range read sort-merges entries within a specified key range across all levels to return the most recent version of each entry in the range to the user.
LSM-tree has to execute queries, writes, and merge operations concurrently and yet still correctly. In the original LSM-tree paper, each level is a mutable B-tree, and locks are held to transfer data from one B-tree to another when a level reaches capacity. To obviate locking bottlenecks, however, most modern LSM-trees employ multi-version concurrency control.
In RocksDB, for example, a new version object is created after each compaction/flush operation. This version object contains a list of all files active at the instant in time that the compaction/flush finished.
Point and range reads operate over files within the version object that was newest at the instant in time that they commenced to provide a consistent view of the data.
LSM-tree requires more storage space than the size of the raw data. The reason is twofold. First, obsolete entries take up space until compaction discards them. We refer to this as durable space-amplification.
Second, multi-version concurrency control makes it complex to dispose of a file before some compaction that spanned it has terminated. Disposing of files during compaction would require redirecting concurrent reads across different versions of the data and complicate recovery. Instead, all widely-used LSM-tree designs hold on to files until the compaction operating on them is finished. We refer to the temporary extra space needed during compaction to store the original and merged data at the same time as transient space-amplification.
We distinguish between the logical vs. physical size as the LSM tree’s size before and after space-amplification (space-amp) is considered, respectively. The total space-amp is the factor by which the maximum physical data size is greater than the logical data size. We define it in Equation 1 as the sum of durable and transient space-amp plus one. Durable and transient space-amp are each defined here as a fraction of the logical data size. The inverse of total write-amp is storage-utilization, the fraction of the storage device that can store user data. It is generally desirable to reach high storage utilization to take advantage of the available hardware.
The LSM-tree’s compactions cause each entry to be rewritten to storage multiple times. The average number of times an entry is physically rewritten is known as write amplification (write-amp). It is generally desirable to keep writeamp low as it consumes storage bandwidth and lifetime.
In addition to compactions, there is another important source of write-amp for LSM-tree: SSD garbage-collection (GC). Modern flash-based SSDs layout data internally in a sequential manner across erase units. As the system fills up, GC takes place to reclaim space for updates by (1) picking an erase unit with ideally little remaining live data, (2) migrating any live data to other erase units, and (3) erasing the target unit. GC contributes to write-amp by causing each write request from the application to be physically rewritten multiple times internally within the SSD.
As GC occurs opaquely within the SSD and has historically been difficult to measure, most work on LSM-tree to date focused exclusively on optimizing write-amp due to compaction. This paper offers the insight that the total writeamp for an LSM-tree is the product of both sources of write-amp as expressed in Equation 2. Both sources must therefore be co-optimized as they amplify each other:
The compaction policy of an LSM-tree determines which data to merge when. The two mainstream policies are known as leveling and tiering, as illustrated in
With leveling, new data arriving at a level is immediately merged with whichever overlapping data already exists at that level. As a result, each level contains at most one sorted unit of data, also referred to as a run. Each run consists of one or more files.
With tiering, each level contains multiple runs. When a level reaches capacity, all runs within it are merged into a single run at the next larger level. Tiering is more write-optimized than leveling as each compaction spans less data. However, it is less read-efficient as each query has to search more runs per level. It is also less space efficient as it takes longer to identify and discard obsolete entries.
With both leveling and tiering, the size ratio T can be fine-tuned to control the trade-off between compaction overheads on one hand and query and space overheads on the other.
It therefore offers good trade-offs in-between. While we focus on leveling in this work for ease of exposition, we also apply Spooky to tiered and hybrid designs to demonstrate its broad applicability.
Durable space-amp exhibits a pathological worst-case. When a new level is added to the LSM-tree to accommodate more data, its capacity is set to be larger by a factor of T than the capacity at the previously largest level.
When this happens, the data size at the new largest level is far smaller than its capacity. As the now second largest level fills up with new data, it can come to contain as much data as the current data size at the largest level. In this case, durable space-amp may be two or greater, as illustrated in
As shown in
We leverage DCA throughout the paper.
The granularity of compaction controls how much data to merge in one compaction when a level reaches capacity. There are two mainstream approaches: Full vs. Partial Merge. Each has a distinct impact on the balance between write-amp and space-amp in the system.
With Full Merge, compaction is performed at the granularity of whole levels. Full merge is used in Cassandra, Hbase, and Universal Compaction in RocksDB. Full merge lends itself to preemption, a technique used to reduce write-amp by predicting across how many levels the next merge operation will recurse and merging them all at once.
We leverage preemption in conjunction with Full Merge throughout the paper to optimize write-amp. The core problem with Full Merge is that while compacting the largest level, which contains most of the data, transient space-amp skyrockets as the original contents cannot be disposed of until the compaction is finished.
With Partial Merge, as used by default in LevelDB and RocksDB, each run is partitioned into multiple files (aka Sorted String Tables or SSTs). When Level i fills up, some file from Level I is picked and merged into the files with overlapping key ranges at Level i + 1. Different methods have been proposed for how to pick this file (e.g., randomly or round-robin). The best-known technique, coined ChooseBest, picks the file that overlaps with the least amount of data at the next larger level to minimize write-amp.
For example, in
This section analyzes write-amp and space-amp for Full vs. Partial Merge to formalize the problem. We assume the leveling merge policy for both baselines. We also assume uniformly random insertions to model the worst-case write-amp.
Modeling Compaction Write-Amp. With Full Merge, the i′th run arriving at a level after the level was last empty entails a writeamp of i to merge with the existing run at that level. After T - 1 runs arrive, a preemptive merge spanning this level takes place and empties it again. Hence, each level contributes
to write-amp, resulting in the overall write-amp in Equation 4.
With Partial Merge, a file picked using ChooseBest from a full Level i intersects with ≈ T/2 files’ worth of data on average at Level i + 1. The overlap among these files typically isn’t perfect, however, leading to additional overhead. For example, in
We verify our write-amp models using RocksDB. We use the default RocksDB compaction policy to represent Partial Merge, and we implemented a full preemptive leveled compaction policy within RocksDB 1. The size ratio T is set to 5, the buffer size B to 64 MB, and the rest of the implementation and configuration details are in Sections 5 and 6.
In
The reasons are that (1) it uses preemption to skip merging entries across nearly full levels, and (2) it avoids the problem of superfluous edge merging by compacting whole levels.
We now analyze the impact of SSD garbage-collection (GC) on write-amp. We run a large-scale experiment that fills up an initially empty LSM-tree with unique random insertions followed by random updates for several hours on a 960 GB SSD. We use the Linux nvme command to report the data volume that the operating system writes to the SSD vs. the data volume that the SSD physically writes to flash. This allows computing GC write-amp throughout the experiment. For both baselines, the size ratio is set to 5, the buffer size to 64 MB, DCA is turned on, and the rest of the setup is given in Section 6. For Full Merge,
Hence, the SSD is less stressed for space and so GC is less often invoked. For Partial Merge, the logical data size is 644 GB.
It is tempting to think that using larger files with Partial Merge would eliminate these GC overheads by causing larger units of data to be written and erased all at once across SSD erase units. We falsify this notion later in Section 6 by showing that even with large files, GC overheads remain high because concurrent partial compactions still cause files from different levels to mix physically.
For the same experiment in
The reason is that compactions into the LSM-tree’s largest level occasionally cause transient space-amp to skyrocket. In contrast, space-amp with Partial Merge is smooth because compactions occur at a finer granularity. Note that the physical data size for both baselines is similar in this experiment despite the fact that Partial Merge is able to store far more user data. Overall,
We provide space-amp models for Partial vs. Full merge in Equations 6 and 7 to enable an analytical comparison. The term 1/(T–1) in both equations accounts for the worst-case durable space-amp from Equation 3. Otherwise, space-amp with Full Merge is higher by an additive factor of one to account for the fact that its transient space-amp occasionally requires as much extra space as the logical data size. By contrast, transient space-amp with Partial Merge is assumed here to be negligible.
This is a well-known phenomenon with SSDs: as they fill up, erase blocks with little remaining live data become increasingly hard to find [60]. The outcome is that each additional byte of user data costs disproportionately more to store in terms of performance.
Full Merge exhibits exorbitant space-amp while Partial Merge exhibits skyrocketing write-amp as storage utilization increases. We have shown that it is not possible to fix these problems using tuning. First, write-amp due to compactions with Partial Merge cannot be reduced beyond the local minimum shown in
We introduce Spooky, a new method of granulating LSM-tree merge operations that eliminates the contention between write-amp and space-amp. As shown in
For ease of exposition, Section 4.1 describes a limited version of Spooky that performs partitioned merge only across the largest two levels. Section 4.2 generalizes Spooky to perform partitioned merge across more levels to enable better write/space trade-offs. Sections 4.3 extends Spooky to accommodate skewed workloads. Sections 4.1 to 4.3 assume the leveling merge policy, while Section 4.4 extends Spooky to tiered and hybrid merge policies.
Two-Level Spooky (2L-Spooky) performs partitioned merge across the largest two levels of an LSM-tree as shown in
Level L (the largest level) is partitioned into files, each of which comprises at most NL/T bytes, where NL is the data size at Level L and T is the LSM-tree’s size ratio. This divides Level L into at least T files of approximately equal sizes. Level L - 1 (the second largest level) is also partitioned into files such that the key range at each file overlaps with at most one file at Level L. This allows merging one pair of overlapping files at a time across the largest two levels. At Levels 0 to L - 1, 2L-Spooky performs full preemptive merge.
Algorithm 1 first picks some full preemptive merge operation to perform along Levels 0 to L-1. Specifically, It chooses the smallest level q in the range 1 ≤ q ≤ L - 1 that wouldn’t reach capacity if we merged all data at smaller levels into it (Line 2). It then compacts all data at Levels 0 to q and places the resulting run at Level q (Lines 3-7). Any run at Levels 1 to L - 2 is stored as one file (Line 4).
A merge operation into Level L - 1 is coined a dividing merge. A run written by a dividing merge is partitioned such that each output file perfectly overlaps with at most one file at Level L (Lines 6-7).
When Level L - 1 reaches capacity (Line 8), 2L-Spooky triggers a partitioned merge. As shown in
After a partitioned merge, Algorithm 1 checks if Level L is now at capacity. If so, we add a new level (Lines 13-15). On the other hand, if many deletes took place and the largest level significantly shrank, we remove one level (Lines 56-17). If the number of levels changed, the run at the previously largest level is placed at the new largest level. We then perform dynamic capacity adaptation to restrict durable space-amp (Lines 18-19).
The full preemptive merge operations at smaller levels achieve the modest write-amp of Full Merge across Levels 1 to L - 2. At Level L - 1, each entry is rewritten one extra time relative to pure Full Merge. The reason is that Level L - 1 has to reach capacity before a partitioned merge is triggered (i.e., there is no preemption at Level L - 1). At Level L, the absence of overlap across different pairs of files prevents superfluous rewriting of non-overlapping data and thus keeps write-amp on par with Full Merge.
Hence, 2L-Spooky increases write-amp by an additive factor of one relative to Full Merge, as stated in Equation 8.
The design of Spooky so far involves at most three files being concurrently written to storage: one due to the buffer flushing, one due to a full preemptive merge, and one due to a partitioned merge. Hence, at most three files can become physically interspersed within the underlying SSD.
In contrast, with Partial Merge, the number of files that can become interspersed in the SSD is unbounded. In addition, Spooky writes data to each level and later disposes of it in the same sequential order. Hence, data that is written sequentially within the SSD is also deleted sequentially. This relaxes SSD garbage-collection as large contiguous storage areas are cleared at the same time. Both of these design aspects help to reduce SSD garbage-collection. We analyze their impact empirically in Section 6.
A dividing merge and a partitioned merge never occur at the same time, yet they are both the bottlenecks in terms of transient space-amp. Hence, transient spaceamp for the system as a whole is lower bounded by the expression max(filemax,CL-1)/NL. The term filemax denotes the maximum file size at Level L and controls transient space-amp for a partitioned merge. The term CL-1 denotes the capacity at Level L - 1 and controls transient space-amp for a dividing merge. Note that while it is possible to decrease filemax to lower transient space-amp for a partitioned merge, the overall transient space-amp would still be lower bounded by CL-1, and so setting filemax to be lower than CL-1 is inconsequential. This explains our motivation for setting filemax to CL-1 = NL/T (Line 10). The overall transient space-amp for 2L-Spooky is therefore 1/T.
By virtue of using dynamic capacity adaptation, Spooky’s durable space-amp is upper-bounded by Equation 3. We plug this expression along with Spooky’s transient space-amp of 1/T into Equation 1 to obtain Spooky’s overall worst-case space-amp in Equation 9.
2L-Spooky significantly reduces write-amp relative to Partial Merge while almost matching Full Merge. Note that the figure ignores the impact of SSD garbage-collection and therefore understates the overall write-amp differences between the baselines.
In summary, while 2L-Spooky enables new attractive write/space cost balances, its write-amp is still higher than with Full Merge by one and it’s space-utilization is higher than with Partial Merge by ≈ 10%. It therefore does not dominate either of these baselines and leaves something to be desired. We improve it further in Section 4.2.
In Section 4.1, we saw that the dividing merge operations into Level L - 1 creates a lower bound of 1/T over transient space-amp. The reason is that Level L - 1 contains a fraction of 1/T of the raw data, and it is rewritten from scratch during each dividing merge operation. In this section, we generalize Spooky to support dividing merge operations into smaller levels to overcome this lower bound.
Algorithm 2 is different from Algorithm 1 in that full preemptive merge operations only take place along levels 0 to X - 1 while dividing merge operations now take place into Level X (rather than into Level L - 1 as before). All else is the same as in Algorithm 1.
When Level X fills up, Spooky performs a partitioned merge operation along the largest L - X levels, one group of at most L - X perfectly overlapping files at a time. An important design decision in the generalized workflow is to combine the idea preemption with partitioned merge to limit writeamp emanating from larger levels. Specifically, when Level X is full, Algorithm 2 picks the smallest level z in the range X + 1 ≤ z ≤ L that would not reach capacity if we merged all data within this range of levels into it (Line 9). Then, one group of overlapping files across Levels X to z is merged at a time into Level z. If the target level z is not the largest level, the resulting run is partitioned based on the file boundaries at the largest level (Lines 11-12) to facilitate future partitioned merge operations.
In
At Levels 1 to X - 1, full preemptive merge operations keep write-amp on par with our Full Merge baseline. At Level X, each entry is rewritten one extra time on average relative to Full Merge as there is no preemption at this level. At Levels X + 1 to L, write-amp is the same as with Full Merge as we effectively perform full preemptive merge across groups of perfectly overlapping files. Hence, the overall write-amp so far is the same as before for 2L-Spooky in Equation 8. Interestingly, note that by setting X to Level 0, Spooky divides data within the buffer based on the file boundaries at Level L and thus performs partitioned preemptive merge across the whole LSMtree. In this case, the additional write-amp overhead of Level X is removed. We did not implement this feature, yet it allows Spooky’s overall write-amp to be summarized in Equation 10.
A partitioned merge entails a transient space-amp of at most
as Level L is partitioned into at least TL-X files of approximately equal sizes. A dividing merge operation also entails a transient space-amp of
as the capacity as this level is a fraction of
of the raw data size. The overall transient space-amp, which is the maximum of these two expressions, is therefore also
. By plugging this expression along with Equation 3 for durable space-amp into Equation 1, we obtain Spooky’s overall space-amp in Equation 11
Spooky does not alter the number of levels in the LSM-tree, which remains L ≈ logT (N/B). Hence, it does not affect queries, which in the worst-case have to access every level. Query performance is therefore orthogonal to this work.
Equation 11 indicates that by performing partitioned merge across more levels (i.e.., by reducing X), transient space-amp with Spooky becomes negligible and approaches the space-amp of Partial Merge. At the same time, Spooky significantly reduces write-amp relative to Partial Merge.
Thus, Spooky dominates Partial Merge across the board. Relative to Full Merge, Spooky increases write-amp by a modest additive factor of one, and this factor can in fact also be saved by setting X to Level 0. At the same time, Spooky offers far lower spaceamp than with Full Merge and thus allows to exploit much more of the available storage capacity. Hence, Spooky offers dominating trade-offs relative to Full Merge as well.
So far, we have been analyzing Spooky under the assumption of uniformly random updates to reason about worst-care behavior and thus quality of service guarantees. Under skewed updates, which are the more common workload, Spooky has additional advantages. Spooky naturally optimizes for skewed update patterns by avoiding having to rewrite files at the largest level that do not overlap with newer updates. For instance, in
An additional possible optimization is to divide data at smaller levels into smaller files, just so that during a full preemptive merge, we can skip merging some files that do not have any other overlapping files. In
While we have focused so far on how to apply Spooky on top of the leveling merge policy, it is straight-forward to apply Spooky on top of tiering and hybrid policies as well.
This section discusses Spooky’s implementation within RocksDB. Encapsulation. There is an abstract compaction_picker.h class within RocksDB. Its role is to implement the logic of which files to compact under which conditions and how to partition the output into new files.We implemented Spooky by inheriting from this class and implementing the logic of Algorithm 2. Our implementation is therefore encapsulated in one file. This highlights an advantage of Spooky from an engineering perspective as it leaves all other system aspects (e.g., recovery, concurrency control, etc.) unchanged.
rLevels. We refer to levels in the RocksDB implementation as rLevels to prevent ambiguity with levels in our LSM-tree formalization introduced in Section 2.
rLevel 0. In RocksDB, rLevel 0 is the first rLevel in storage, and it is special: it is the only rLevel whose constituent files may overlap in terms off the keys they contain. When rLevel 0 has accrued α files flushed from the buffer, the compaction picker, and hence our Algorithm 2, is invoked. Once there are β files at rLevel 0, write throttling is turned on. When there are γ files at rLevel 0, the system stalls to allow ongoing compactions to finish. We tune these parameters to α = 4, β = 4 and γ = 9 throughout our experiments. Note that in effect, rLevel 0 can be seen as an extension of the buffer, and so it loosely corresponds to Level 0 in our LSM-tree formalization from Section 2. Flushing the buffer to rLevel 0 contributes an additive factor of one to write-amp, and so our implementation has a higher write-amp by one than the earlier write-models models (in Eqs. 8 and 10).
Level to rLevel Mappings. In RocksDB, all rLevels except rLevel 0 can only store one run (i.e., a non-overlapping collection of files). In order to support tiered and hybrid merge policies, whereby each level can contain multiple runs, we had to overcome this constraint. We did so by mapping each level in our LSM-tree formalization to one or more consecutive RocksDB rLevels. For example, in a tiered merge policy, Level 1 in our formalization corresponds to rLevels 1 to T, Level 2 to rLevels T + 1 to 2T, etc.
Assuming Tiered/Hybrid Merge Policies. Our implementation has a parameter G for the number of greedy levels from largest to smallest that employ the leveling merge policy. Hence, when G ≥ L, we have pure leveling, when G = 0 we have pure tiering, and when G = 1 we have lazy leveling. Thus, our implementation is able to assume different merge policies to assume different trade-offs for different application scenarios. The size ratio T can further be varied to fine-tune these trade-offs.
For our full merge baseline, we use our Spooky implementation yet with partitioned merge turned off. Hence, full preemptive merge is performed across all levels.
RocksDB’s default compaction policy is able to perform internal rLevel 0 compactions, whereby multiple files at rLevel 0 are compacted into a single file that gets placed back at Level 0. The goal is to prevent the system from stalling when rLevel 0 is full (i.e., has γ files) yet there is an ongoing compaction into rLevel 1 that must finish before we trigger a new compaction from rLevel 0 to rLevel 1. We also enable rLevel 0 compactions within our Spooky implementation to prevent stalling.
Specifically, whenever a full preemptive merge is taking place and we already have α or more files of approximately equal sizes, created consecutively, and not currently being merged, we compact these files into one file, which gets placed back at rLevel 0.
Our implementation follows RocksDB in that each compaction runs on background thread/s. We use the sub-compaction feature of RocksDB to partition large compactions across multiple threads. Our design allows for partitioned compactions, full preemptive compactions, and Level 0 compactions to run concurrently.
Hence, there can be at most three compactions running simultaneously, though each of these compactions may be further parallelized using sub-compactions.
We now turn to evaluate Spooky against Full and Partial Merge. Platform. Our machine has an 11th Gen Intel i7-11700 CPU with sixteen 2.50 GHz cores. The memory hierarchy consists of 48 KB of L1 cache, 512 KB of L2 cache, 16 MB of L3 cache, and 64 GB of DDR memory. An Ubuntu 18.04.4 LTS operating system is installed on a 240 GB KIOXIA EXCERIA SATA SSD. The experiments are running on a 960 GB Samsung NVME SSD model MZ1L2960HCJR-00A07 with the ext4 file system.
We use db_bench to run all experiments as it is the standard tool used to benchmark RocksDB at Meta and beyond. Every entry consists of a randomly generated 16 B key and a 512 B value. Unless otherwise mentioned, all baselines use the leveling merge policy, a size ratio of 5, and a memtable size of 64 MB. Bloom filters are enabled and assigned 10 bits per entry. Dynamic capacity adaptation is always applied for all baselines. The data block size is 4 KB. We use one application thread to issue inserts/updates, and we employ sixteen background threads to parallelize compactions.
We use the implementation from Section 5 to represent Spooky and Full Merge. For Spooky, we set the parameter X, the level into which Spooky performs dividing merge operations, to L - 2, the third largest level. For Partial Merge, we use the default compaction policy of RocksDB with a file size of 64 MB.
Prior to each experimental trial, we delete the database from the previous trial and reset the drive using the fstrim command. This allows the SSD to reclaim space internally. We then fill up the drive from scratch for the next trial. This methodology ensures that subsequent trials do not impact each other.
We run the du Linux command to monitor the physical database size every five seconds. We also run the nvme command every two minutes to report the the SSD’s internal garbage-collection write-amp. We use RocksDB’s internal statistics to report the number of bytes flushed and compacted every two minutes to allow computing write-amp due to compactions. We use db_bench to report statistics on throughput and latency.
The left-most set of bars in
We continue the experiment by adding one application thread to issue queries in parallel to the thread issuing updates. We first issue point reads for two hours and then seeks (i.e., short scans) for two hours. Throughput for the querying thread is reported in
It is tempting to think that using larger files with Partial Merge would eliminate GC overheads by causing larger units of data to be written and erased all at once across SSD erase units. In
In
Lazy Leveling is competitive with Leveling under Spooky despite having more runs in the system as the Bloom filters help skipping most runs. In
We have seen that Spooky matches Partial Merge in terms of storage utilization while significantly beating it in terms of write-amp and query/update performance. Thus, Spooky dominates Partial Merge across the board. At the same time, Spooky is competitive with Full Merge while increasing storage utilization by ≈ 2x. Thus, Spooky is the first merge granulation approach to allow LSM-tree to maintain moderate write-amp as it approaches device capacity.
Key-Value Separation. Recent work proposes to store the values of data entries in an external log or in a separate tiered LSM-tree. This improves write-amp while sacrificing scan performance. Spooky can be combined with the LSM-tree component/s in such designs for better write-amp vs. space-amp balances.
In-Memory Optimizations. Various optimizations have been proposed for LSM-tree’s in-memory data structures including adaptive or learned caching, leaned fence pointers, tiered buffering, selective flushing, smarter Bloom filters or replacements thereof, and materialized indexes for scans. Spooky is fully complementary with such works as it only impacts the decision of which files to merge and how to partition the output.
Performance Stability. Recent work focuses on maintaining stable performance during compaction operations. Various prioritization, synchronization, deamortization, and throttling techniques have been proposed. Our Spooky implementation on RocksDB performs compaction on concurrent threads to avoid blocking the application, yet it could benefit from these more advanced techniques to more evenly schedule the overheads of large compaction operations in time.
Method 180 may include step 182 of performing preemptive full merge operations at first LSM tree levels.
Method 180 may also include step 184 of performing capacity triggered merge operations at second LSM tree levels while imposing one or more restrictions.
The second LSM tree levels include a largest LSM tree level (for example - the L’th level of
Files of the one or more other second LSM tree levels are aligned with files of the largest LSM tree level. Accordingly — a file of an other second LSM tree level overlaps up to a single file of the largest LSM tree level.
The one or more restrictions may limit a number of files that are concurrently written to a non-volatile memory (NVM) during an execution of steps (a) and (b). For example- there may be one file written during step 182 in concurrency with one file written during step 184 and there may be one more file concurrently written from a buffer to the first LSM tree level. Other limitations regarding the number of concurrently written files to the non-volatile memory (NVM) can be imposed. The NVM may be a SSD or other storage entity.
The one or more restrictions impose a writing pattern that includes sequentially writing files of a same LSM tree level to a non-volatile memory (NVM) during an execution of consecutive iterations of steps (a) and (b). A scheduler may schedule the merge operations to include sequential merge operations of a single file and then another sequence of merge operations to another LSM tree level.
The one or more restrictions impose a NVM erasure pattern that includes sequentially evacuating files of the same LSM tree level from the NVM.
The one or more restrictions limit a number of concurrent merged files of the largest LSM tree level and/or to any other second LSM tree level. For example — the limitation may allow a merging of a single largest LSM tree level file at a time. Other limitations may be imposed — for example — less strict limitations on merging may be imposed on smaller other second LSM tree levels. Yet form another example less strict limitations may be applied on the largest LSM tree level.
The files of the largest LSM tree level may be of a same size. This will reduce various penalties. Alternatively — at least two files of the largest LSM tree level may differ from each other by size — by any size difference or any size percentage.
Step 184 may include performing a dividing merge operation (see, for example — section 4.1) that includes merging a first file from a first LSM tree level with a second file of a second LSM tree level that differs from the largest LSM tree level to provide a merged file that belongs to the second LSM tree level.
Step 184 may include a partitioned merge operation that includes merging a file from an other second LSM tree level with a file of the largest LSM tree level to provide one or more merged files that belong to the largest LSM tree level. The size of the merged files may require to split the merged file.
Step 184 may include a hybrid merge operation (see for example
Method 180 may include step 186 of evaluating a state of the largest LSM tree level and determining whether to add a new largest LSM tree level to the LSM tree, to elect another second LSM tree level as a new largest LSM tree level or to maintain the largest LSM tree level. See, for example section 4.1 under the title “evolving the tree”.
Step 184 may include avoiding from re-writing a file of the largest LSM tree level, following one or more updates of the LSM tree, when the file of the largest LSM tree level does not overlap with the one or more updates. See, for example, section 4.3.
Method 180 may be applied regardless of the modes of the LSM tree levels. (See, for example section 4.4 and
Method 180 may include performing a preemptive partitioned merge operation to the largest LSM tree level. See, for example, section 4.2.
Existing LSM-tree compaction granulation approaches either waste most of the available storage capacity or exhibit a staggering writeamp that cripples performance. We introduce Spooky, the first compaction granulation approach to achieve high storage utilization and moderate write-amp at the same time.
While the foregoing written description of the invention enables one of ordinary skill to make and use what may be considered presently to be the best mode thereof, those of ordinary skill will understand and appreciate the existence of variations, combinations, and equivalents of the specific embodiment, method, and examples herein. The invention should therefore not be limited by the above described embodiment, method, and examples, but by all embodiments and methods within the scope and spirit of the invention as claimed.
In the foregoing specification, the invention has been described with reference to specific examples of embodiments of the invention. It will, however, be evident that various modifications and changes may be made therein without departing from the broader spirit and scope of the invention as set forth in the appended claims.
Those skilled in the art will recognize that the boundaries between logic blocks are merely illustrative and that alternative embodiments may merge logic blocks or circuit elements or impose an alternate decomposition of functionality upon various logic blocks or circuit elements. Thus, it is to be understood that the architectures depicted herein are merely exemplary, and that in fact many other architectures may be implemented which achieve the same functionality.
Any arrangement of components to achieve the same functionality is effectively “associated” such that the desired functionality is achieved. Hence, any two components herein combined to achieve a particular functionality may be seen as “associated with” each other such that the desired functionality is achieved, irrespective of architectures or intermedial components. Likewise, any two components so associated can also be viewed as being “operably connected,” or “operably coupled,” to each other to achieve the desired functionality.
Furthermore, those skilled in the art will recognize that boundaries between the above described operations merely illustrative. The multiple operations may be combined into a single operation, a single operation may be distributed in additional operations and operations may be executed at least partially overlapping in time. Moreover, alternative embodiments may include multiple instances of a particular operation, and the order of operations may be altered in various other embodiments.
Also for example, in one embodiment, the illustrated examples may be implemented as circuitry located on a single integrated circuit or within a same device. Alternatively, the examples may be implemented as any number of separate integrated circuits or separate devices interconnected with each other in a suitable manner.
However, other modifications, variations and alternatives are also possible. The specifications and drawings are, accordingly, to be regarded in an illustrative rather than in a restrictive sense.
In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word ‘comprising’ does not exclude the presence of other elements or steps then those listed in a claim. Furthermore, the terms “a” or “an,” as used herein, are defined as one or more than one. Also, the use of introductory phrases such as “at least one” and “one or more” in the claims should not be construed to imply that the introduction of another claim element by the indefinite articles “a” or “an” limits any particular claim containing such introduced claim element to inventions containing only one such element, even when the same claim includes the introductory phrases “one or more” or “at least one” and indefinite articles such as “a” or “an.” The same holds true for the use of definite articles. Unless stated otherwise, terms such as “first” and “second” are used to arbitrarily distinguish between the elements such terms describe. Thus, these terms are not necessarily intended to indicate temporal or other prioritization of such elements. The mere fact that certain measures are recited in mutually different claims does not indicate that a combination of these measures cannot be used to advantage.
While certain features of the invention have been illustrated and described herein, many modifications, substitutions, changes, and equivalents will now occur to those of ordinary skill in the art. It is, therefore, to be understood that the appended claims are intended to cover all such modifications and changes as fall within the true spirit of the invention.
It is appreciated that various features of the embodiments of the disclosure which are, for clarity, described in the contexts of separate embodiments may also be provided in combination in a single embodiment. Conversely, various features of the embodiments of the disclosure which are, for brevity, described in the context of a single embodiment may also be provided separately or in any suitable sub-combination.
It will be appreciated by persons skilled in the art that the embodiments of the disclosure are not limited by what has been particularly shown and described hereinabove. Rather the scope of the embodiments of the disclosure is defined by the appended claims and equivalents thereof.
This application claims priority from U.S. Provisional Pat. Serial No. 63/266,940, filing date Jan. 19, 2022, which is incorporated herein by its entirety.
Number | Date | Country | |
---|---|---|---|
63266940 | Jan 2022 | US |