B-trees are frequently used in various storage or database systems as a method and structure of storing data. Such storage systems may include one or more physical or virtual devices, including conventional hard disk drives of a computing device, Storage Area Network (SAN) devices or Virtual Storage Area Network (vSAN) devices. B-trees feature a balancing tree structure where inserted data is sorted during insertion. B-trees typically include a set of nodes each containing one or more key-value pairs. A key is an identifier of data, and a value is either the data itself or a pointer to a location (e.g., in memory or on disk) of the data associated with the identifier. Accordingly, a B-tree may be sorted according to the keys of the key-value pairs, and data can be read/written from the tree based on the key associated with the data. Because B-trees contain sorted key-value pairs, a read operation such as a query (e.g., a request for the value associated with a particular key in a data structure) to a B-tree may be completed by traversing the B-tree relatively quickly to find the desired key-value pair based on the key of the query. Thus, B-trees can be used to store data in a way that improves performance relative to other data structures (such as arrays) for certain operations (e.g., a query to an un-sorted array may take longer than a query to a B-tree).
Bε-trees are a modification of B-trees and are similar in many respects to B-trees. Unlike a B-tree, however, each node of a Bε-tree, except the leaf nodes, includes, in addition to key-value pairs, a buffer. The buffers of a Bε-tree may store operations to be performed on the Bε-tree as messages. For example, a message may indicate a key and a write or update operation (e.g., set the value, add, subtract, multiply, delete, change, etc.) to perform on the value associated with the key in a key-value pair. Accordingly, a message may also be considered a type of key-value pair with the value being the operation to perform on a value of another key-value pair with the same key. The buffers may be used to store messages until a size limit is reached, at which point the messages may be flushed to child nodes and applied to key-value pairs in the child nodes (e.g., by performing a merge or compaction), which may include adding new nodes to the Bε-tree or balancing the Bε-tree by transferring nodes from one subtree to another. The buffers of Bε-trees allow write operations to be performed more quickly relative to standard B-trees, as write operations on existing key-value pairs or insertions of new key-value pairs may not traverse the entire Bε-tree to be applied immediately and may instead be placed in a buffer of any node associated with the key of the message, possibly near the root of the Bε-tree.
When a new message is added to a buffer, the existing contents of the buffer are read and then merged with the new message. The resulting content is then sorted and written back to the buffer. As such, although Bε-trees reduce write amplification in comparison to B-trees, the I/O operations associated with re-writing a buffer, when a new message is added to the buffer, introduces some write amplification. To address this problem, in some cases, an append-only method may be applied to Bε-trees, which involves writing data in all the nodes of the Bε-trees in a sequential manner. Bε-trees using the append-only method may be referred to as append-only Bε-trees. For example, as a result of using the append-only method, buffers are no longer re-written when messages are flushed down. Instead, messages are appended to the buffers in fragments or slots. A slot or a fragment refers to a sorted data structure, such as an array or a B-tree, that is created for storing a batch of one or more messages that is flushed down to a buffer at a certain time. In an append-only Bε-tree, the append-only method is also applied when data is written to the leaf-nodes of the Bε-tree.
Although the append-only method reduces write amplification, querying append-only Bε-trees is less efficient and results in more overhead because messages and/or key-value pairs in the nodes are no longer always sorted. In order to enhance the efficiency and reduce the overhead associated with query operations performed on an append-only Bε-tree, one of a number of data structures or filters, such as a bloom filter, may be created, stored in RAM, and used for optimizing query operations associated with each of the slots in nodes of the append-only Bε-tree. In some cases, however, the collective size of all filters associated with an append-only Bε-tree may be so large that the filters may occupy a significant amount of memory space in random access memory (e.g., RAM) or, in some other cases, not even fit in RAM.
Each node of B-tree 100 stores at least one key-value pair. For example, leaf node 130 stores the key-value pair corresponding to the key “55.” The leaf nodes in B-tree 100 each store a single key-value pair but an individual leaf node may store additional key-value pairs. For branch and root nodes of B-tree 100, key-value pairs may store values. Key-value pairs in branch and root nodes may also store pointers to child nodes, which can be used to locate a given key-value pair that is stored in a child node. For example, root node 110 includes two key-value pairs, “20” and “50”. These key-value pairs indicate that key-value pairs with keys less than “20” can be found by accessing branch node 120, key-value pairs with keys greater than “20” but less than “50” can be found by accessing branch node 122, and key-value pairs with keys greater than “50” can be found by accessing branch node 124. Key-value pairs in all nodes of B-tree 100 are sorted based on their keys. For example, a first key-value pair with a first key is stored prior to a second key-value pair with a second key, if the second key is larger than the first key. An example of this is shown in node 122 of B-tree 100, where the key-value pair with key 30 is stored prior to the key-value pair with key 40.
To maintain the sorted order of key-value pairs in a B-tree, new key-value pairs are inserted into the B-tree in a sorted manner as well. For example, in order for a new key-value pair to be inserted into a node of B-tree 100, all the existing key-value pairs in the node are first read (e.g., into random access memory (RAM)), then the new key-value pair is merged into the existing key-value pairs in a sorted manner, and finally a sorted set of key-value pairs is written back to the node. For example, to insert a key-value pair with key “32” into branch node 122, key-value pairs with keys “30” and “40” are first read from node 122. Subsequently, key-value pairs “30” “32” and “40” are merged and sorted, and then written back to node 122, such that after the insert operation, node 122 would include key-value pairs “30” “32” and “40,” in that order.
Both B-trees and Bε-trees may be subdivided into subtrees. A subtree typically includes part of the complete set of nodes of a tree and includes a subtree root. For example, in B-tree 100, a subtree may be defined with branch node 120 as the subtree root, and include the child nodes of branch node 120.
Bε-tree 150 stores the same data as stored in B-tree 100, in a different format. Bε-tree 150 includes nodes 1-4. Nodes of Bε-tree 150 include, similarly to the nodes of B-tree 100, key-value pairs. For example, data section 164 of node 1 includes the key-value pairs of key “20” and key “50.” Like B-tree 100, these key-value pairs may store data as well as pointers to which child nodes to access in order to find other key-value pairs within Bε-tree 150.
In addition to the key-value pairs, non-leaf nodes of Bε-tree 150 also include a buffer that stores messages. Bε-tree 150 includes, as shown, 4 non-leaf nodes and 5 leaf nodes. Leaf nodes of a Bε-tree do not include a buffer or store messages, unlike other nodes of the Bε-tree.
A buffer of a Bε-tree is typically a data structure capable of ordered or sorted storage. For example, a buffer may be a binary tree structure such as a red-black tree or a B-tree, or an array or a set of nested arrays. Messages generally indicate a write operation to be performed on Bε-tree 150, such as inserting a key-value pair, deleting a key-value pair, or modifying a key-value pair. The message itself may be a key-value pair and have as its key, the key of the key-value pair on which the operation is to be performed, and as its value the operation to be performed on the key-value pair (e.g., add 3 to value, set value to 5, subtract 6 from value, delete key-value pair, insert key-value pair of key X and value Y, etc.). For example, an insert operation may be sent to Bε-tree 100 to add a new key-value pair. Such an insert operation may be added to the buffer of the root node as an insert message. The insert message includes the details of which key-value pair has been added. Messages may also include a timestamp indicating when a message was received at the buffer. In certain embodiments, messages may be ordered in the buffer, such as in an order of arrival into the buffer. In such embodiments, a timestamp may not be needed to indicate when the message was received at the buffer. Further, messages in the buffer may be ordered by key.
Buffer 162 of node 1 includes two insert messages, “insert(35)” and “insert(49)”. The two messages each include a key-value pair, with the key of the key-value pair expressed numerically. For example, insert(35) may correspond to a message with a key 35 and value to insert a key-value pair with key 35 and a corresponding value. At some time, such as when a buffer is filled, messages are flushed from the buffer down to child nodes. In this case, when buffer 162 is full, messages “insert(35)” and insert (49) are flushed to node 3. In another example, if buffer 162 included a message with a key that had a value higher than 50, the message would have been flushed to node 4. Also, if buffer 162 included a message with a key that had a value lower than 20, the message would have been flushed to node 4.
Note that, in a Bε-tree, messages are written or flushed into buffers in a sorted manner. For example, to flush message “insert(35)” down to node 3's buffer, a read operation is first performed to read the entire contents of node 3's buffer, including messages “insert(25),” “insert(40),” and “insert(45).” Subsequently, message “insert(35)” is merged into messages “insert(25),” “insert(40),” and “insert(45)” in a sorted manner and, finally, the result is written back to the buffer of node 3. As such, after message “insert(35)” is flushed down to the buffer of node 3, the buffer would include messages “insert(25),” “insert(35),” “insert(40),” and “insert(45),” in that order. In some cases, child nodes may need to be created to accommodate flushing of messages. For example, flushing message “insert(35)” to node 3 may cause the buffer of node 3 to be full. If so, child nodes (e.g., 3 child nodes) may be added to node 3. These new nodes would store the key-value pairs specified in the messages stored in the buffer of node 3.
Utilizing buffers in Bε-trees provides a number of advantages, including, for example, reducing write amplification associated with performing operations on the nodes of the Bε-tree. This is partly because key-value pairs are written in batches. For example, in Bε-tree 150, when messages “insert(25),” “insert(40),” and “insert(45)” in the buffer of node 3 are eventually written to leaf 3, only one re-write of leaf 3 is performed. In contrast, in a B-tree, because key-value pairs are written to leaf nodes one by one, writing three key-value pairs to a leaf node at different times would result in the leaf node being re-written three times, once for each key-value pair. For example, each time a new key-value pair is being inserted into a leaf in a B-tree, the contents of the leaf may be read, then the content is merged and sorted with the new key-value pair, and finally the result is written back to the leaf.
Although Bε-trees reduce write amplification in comparison to B-trees, the I/O operations associated with re-writing a buffer in a sorted manner introduces some new write amplification. As such, in certain aspects, an append-only method may be applied to Bε-trees, which involves writing messages to buffers in a sequential manner. As a result of using the append-only method, buffers are no longer re-written when messages are flushed down. Instead, messages are appended to the buffers in slots or fragments. A slot is a representation of a data structure, such as an array, that is created for storing a batch of one or more messages that is flushed down to a buffer at a certain time. When a subsequent batch of one or more messages is flushed down to the same buffer at a later time, the batch would be written in a separate slot. In an append-only Bε-tree, data is written to leaves of the tree using the same append-only method.
Each buffer of Bε-tree 250 may store a plurality of slots. When the collective size of the slots in each buffer exceeds the size of the buffer, all the slots in the buffer are read, merged and sorted, and then flushed down to one or more child nodes. For example, after messages “insert(35)” and “insert(49)” are added to buffer 262 and stored in slot 266, additional messages may be received at later points in time, resulting in additional slots (not shown) being created in buffer 262. Once the size of all the slots in buffer 262 exceeds a defined size, the slots, including slot 266, are read into RAM, merged and sorted, and then flushed down to one or more of child nodes 2, 3, and 4, etc.
If additional messages are flushed down to buffer 272 at a later time, a new and third slot would be created for storing such massages. Eventually, when buffer 272 is full, messages of all the slots within buffer 272 may be read, merged and sorted, and written to leaf 3 in an append-only manner. As a result of such an operation, for example, a slot would be created within leaf 3 (not shown) that would include all the key-value pairs corresponding to the messages. As additional messages are applied to leaf 3, additional slots are created and written to leaf 3. Although the application of the append-only method to Bε-trees results in a reduction in write-amplification, querying append-only Bε-trees is less efficient and results in more overhead because messages and/or key-value pairs in the nodes are no longer always sorted in append-only Bε-trees.
As a result, for example, a query operation for a key in a certain buffer may potentially lead to searching all keys in all slots of the buffer. Similarly, a query operation for a key in a certain leaf may potentially lead to searching all keys in all slots of the leaf. As each slot in an append-only Bε-tree may contain a significant amount of information (e.g., gigabytes of data), performing query operations in such a manner is not efficient. As such, in order to enhance the efficiency and reduce the overhead associated with query operations performed on append-only Bε-trees, in certain aspects, one of a number of data structures or filters may be created for each of the slots and stored in memory (e.g., RAM) for use during query operations, as described below. A commonly used example of such filters is the bloom filter, although other examples may include quotient filters, cuckoo filters, or other compact data structures. A bloom filter is a probabilistic data structure that is extremely space efficient and is able to indicate whether a key is possibly in a corresponding slot or whether the key is definitely not in the corresponding set. In other words, false positives are possible with a bloom filter but false negatives are not.
Although using bloom filters optimizes query operations associated with an append-only Bε-tree, storing the filters in memory may require a significant amount of memory space. In certain cases, bloom filters associated with a large append-only Bε-tree may comprise such a large size such that storing them in memory may not even be possible. Note that a significant portion of the collective size of an append-only Bε-tree's bloom filters corresponds to bloom filters associated with the leaves of the tree. This is because a Bε-tree stores exponentially more data in its leaves than its non-leaf nodes. As such, in certain examples, about 90% of the collective size of a Bε-tree's bloom filters corresponds to the size of the leaves' bloom filters. The large amount of memory space required for storing bloom filters of an append-only Bε-tree in RAM poses a technical problem.
Accordingly, the aspects described herein relate to a new type of Bε-tree (referred to as “the new Bε-tree”) and a method of storing data therein, which provide a technical solution to the technical problem described above. More specifically, for the new Bε-tree, no bloom filters are created for data that is stored in the leaves of the tree. In other words, bloom filters are only created for the non-leaf nodes. This technical solution significantly reduces the size and number of bloom filters as compared to the number of bloom filters used for an append-only Bε-tree. As such, storing bloom filters associated with the new Bε-tree occupies significantly less memory space. Eliminating the leaves' bloom filters, however, reduces the efficiency of query operations performed with respect to the leaves. In other words, because data is stored in the leaves in an append-only manner, querying for a key would result in searching through potentially unsorted data, which is more resource inefficient than searching through sorted data. As such, in the new Bε-tree, data is stored in the leaves in a sorted manner. In other words, when messages are flushed down from a node to a leaf, for example, data within the leaf is read into memory, then data that is being flushed down is applied to the leaf in a sorted manner, and finally the result is written back to the leaf. Note that in certain aspects, the new Bε-tree may be stored in storage resources (e.g., disk, SSD, etc.) associated with a computing device (e.g., physical or virtual). Storage resources associated with the computing device include storage resources of the computing device (e.g., local storage resources) and/or storage resources connected to the computing device over a network.
As shown, at block 510, an operation to perform on the Bε-tree is received. Apart from queries, operations performed on a tree typically include write operations, such as insertions, modifications, or deletions of data in the Bε-tree.
At block 520, the operation is stored as a message in the root node's buffer in an append-only manner. More specifically, the message may be part of a batch of one or more messages that is stored in a slot in the root node's buffer. When a subsequent batch of one or more messages is received at a later time, the subsequent batch is written in a separate slot of the root node's buffer. For each slot a filter, such as a bloom filter, is created to optimize query operations associated with the slot. Note that a current node for operations 500 is set as the root node.
At block 530, it is determined if a buffer of a current node of the Bε-tree is full. In certain aspects, the buffer of a node (e.g., root node or child node) is considered to be full when the buffer does not have enough storage space allocated for a new slot that comprises a newly received batch of one or more messages. If the current node buffer is full, operations 500 proceed to block 540. If the current node buffer is not full, operations 500 end.
At block 540, it is determined whether the message in a slot of the current node buffer is to be flushed to a non-leaf child node or a leaf. In one example, such a determination may be made on whether the current node has a non-leaf child node or not. If it is determined that the message is to be flushed down to a buffer of a non-leaf child node, the message is flushed down to that buffer and stored in an append-only manner, as described in relation to block 550. If it is determined that the message is to be applied to a leaf, the message is applied to the leaf in a sorted manner, as described in relation to block 560.
At block 550, the message is flushed to a buffer of a non-leaf child node in an append-only manner. More specifically, all the slots in the current node buffer, including the slot that stores the message, are read into RAM. All the messages in the slots are then merged and sorted. Subsequently, the message, which is now one of a number of sorted messages, is flushed down to a buffer of a non-leaf child node in an append-only manner. An example of this was described above with reference to messages “insert(35)” and “insert(49)” in
At block 560, the message in the current node buffer is flushed to a leaf of the Bε-tree in a sorted manner. Flushing the message to the leaf node in a sorted manner includes reading the contents (e.g., key-value pairs) of the leaf into RAM, applying the message to the contents by performing the operation indicated by the message, and then merging, sorting, and writing back the contents to the leaf. The operation of block 560 may be illustrated with reference to node 3 of
In
Using the method described herein for storing data in a Bε-tree, results in the new Bε-tree, which stores data in its leaves in a sorted manner, while data within its non-leaf nodes is stored in a sequential and append-only manner, potentially resulting in the non-leaf nodes comprising unsorted data. Also, each of the slots within buffers of the non-leaf nodes of the new Bε-tree has a filter that is stored in RAM for query operations. As such, when a query operation is received, the filters associated with the slots in the non-leaf nodes of the new Bε-tree are first examined. If one or more of the filters indicate that the key associated with the query operation is in one or more corresponding slots, then disk operations are performed to search the one or more slots. If none of the filters indicates that the key is in its corresponding slot, the leaves of the new Bε-tree are searched, using one or more search algorithms such as a sequential search algorithm. However, because the leaves of the new Bε-tree are sorted, query operations performed on the leaves are more efficient and result in less overhead. Note that although the aspects herein were described with reference to Bε-trees, other types of write-optimized key-value data structures that use fragmentation may also be optimized using the aspects described herein. Fragmentation refers to a technique of sequentially writing (e.g., append-only technique) data in nodes of a write-optimized key-value data structure in fragments or slots.
Database 610 may include any suitable non-volatile data store for organizing and storing data from the multiple data sources 620. For example, in some embodiments, database 610 may be implemented as software-defined storage such as VMware vSAN that clusters together server-attached hard disk drives and/or solid state drives (HDDs and/or SSDs), to create a flash-optimized, highly resilient shared datastore designed for virtual environments. In some embodiments, database 610 may be implemented as one or more storage devices, for example, one or more hard disk drives, flash memory modules, solid state disks, and optical disks (e.g., in a computing device, server, etc.). In some embodiments, database 610 may include a shared storage system having one or more storage arrays of any type such as a network-attached storage (NAS) or a block-based device over a storage area network (SAN). Database 610 may store data from one or more data sources 620 in a B i-tree structure as discussed.
Each data source 620 may correspond to one or more physical devices (e.g., servers, computing devices, etc.) or virtual devices (e.g., virtual computing instances, containers, virtual machines (VMs), etc.). For example, a physical device may include hardware such as one or more central processing units, memory, storage, and physical network interface controllers (PNICs). A virtual device may be a device that represents a complete system with processors, memory, networking, storage, and/or BIOS, that runs on a physical device. For example, the physical device may execute a virtualization layer that abstracts processor, memory, storage, and/or networking resources of the physical device into one more virtual devices. Each data source 620 may generate data that is loaded into database 610. Each data source 620 may request operations to be performed on the database 610, such as a range lookup as discussed.
Each database management service 630 may be a process or application executing on one or more physical devices or virtual devices. In certain embodiments, a database management service 630 may execute on the same device as a data source 620. In certain embodiments, a database management service 630 may execute on a separate device from the data source 620. A database management service 630 may be an automatic service, or a manual service (e.g., directed by a human).
Each database management service 630 may be coupled (e.g., via a network, as running on the same device, etc.) to one or more data sources 620 and to the database 610 (e.g., via a network). Further, each database management service 630 may be configured to generate and perform operations on database 610. For example, a given data source 620 may send an operation to a database management service 630 to be performed on database 610. The operation may be one of a write, update (e.g., set the value, add, subtract, multiply, delete, change, etc.), or query operation. Database 610 may be implemented as a Bε-tree (the new Bε-tree) as discussed above. A database management service 630 may unitize the techniques and aspects described herein (e.g., operation 500 of
The various embodiments described herein may employ various computer-implemented operations involving data stored in computer systems. For example, these operations may require physical manipulation of physical quantities-usually, though not necessarily, these quantities may take the form of electrical or magnetic signals, where they or representations of them are capable of being stored, transferred, combined, compared, or otherwise manipulated. Further, such manipulations are often referred to in terms, such as producing, identifying, determining, or comparing. Any operations described herein that form part of one or more embodiments of the invention may be useful machine operations. In addition, one or more embodiments of the invention also relate to a device or an apparatus for performing these operations. The apparatus may be specially constructed for specific required purposes, or it may be a general purpose computer selectively activated or configured by a computer program stored in the computer. In particular, various general purpose machines may be used with computer programs written in accordance with the teachings herein, or it may be more convenient to construct a more specialized apparatus to perform the required operations.
The various embodiments described herein may be practiced with other computer system configurations including hand-held devices, microprocessor systems, microprocessor-based or programmable consumer electronics, minicomputers, mainframe computers, and the like.
One or more embodiments of the present invention may be implemented as one or more computer programs or as one or more computer program modules embodied in one or more computer readable media. The term computer readable medium refers to any data storage device that can store data which can thereafter be input to a computer system-computer readable media may be based on any existing or subsequently developed technology for embodying computer programs in a manner that enables them to be read by a computer. Examples of a computer readable medium include a hard drive, network attached storage (NAS), read-only memory, random-access memory, a flash memory device, a NVMe device, a non-volatile memory device, a CD (Compact Discs)—CD-ROM, a CD-R, or a CD-RW, a DVD (Digital Versatile Disc), a magnetic tape, and other optical and non-optical data storage devices. The computer readable medium can also be distributed over a network coupled computer system so that the computer readable code is stored and executed in a distributed fashion.
Although one or more embodiments of the present invention have been described in some detail for clarity of understanding, it will be apparent that certain changes and modifications may be made within the scope of the claims. Accordingly, the described embodiments are to be considered as illustrative and not restrictive, and the scope of the claims is not to be limited to details given herein, but may be modified within the scope and equivalents of the claims. In the claims, elements and/or steps do not imply any particular order of operation, unless explicitly stated in the claims.
Virtualization systems in accordance with the various embodiments may be implemented as hosted embodiments, non-hosted embodiments or as embodiments that tend to blur distinctions between the two, are all envisioned. Furthermore, various virtualization operations may be wholly or partially implemented in hardware. For example, a hardware implementation may employ a look-up table for modification of storage access requests to secure non-disk data.
Many variations, modifications, additions, and improvements are possible, regardless the degree of virtualization. The virtualization software can therefore include components of a host, console, or guest operating system that performs virtualization functions. Plural instances may be provided for components, operations or structures described herein as a single instance. Finally, boundaries between various components, operations and data stores are somewhat arbitrary, and particular operations are illustrated in the context of specific illustrative configurations. Other allocations of functionality are envisioned and may fall within the scope of the invention(s). In general, structures and functionality presented as separate components in exemplary configurations may be implemented as a combined structure or component. Similarly, structures and functionality presented as a single component may be implemented as separate components. These and other variations, modifications, additions, and improvements may fall within the scope of the appended claim(s).