Data storage systems are arrangements of hardware and software in which storage processors are coupled to arrays of non-volatile storage devices, such as magnetic disk drives, electronic flash drives, and/or optical drives. The storage processors, also referred to herein as “nodes,” service storage requests arriving from host machines (“hosts”), which specify blocks, files, and/or other data elements to be written, read, created, deleted, and so forth. Software running on the nodes manages incoming storage requests and performs various data processing tasks to organize and secure the data elements on the non-volatile storage devices.
A need often arises in storage systems to make small but frequent changes in data elements, such as certain kinds of metadata. Each change may affect only a small number of bytes, but writes to metadata are typically performed on a per-page basis, where a page may contain multiple kilobytes. Thus, writes of small metadata changes can cause significant write amplification, which can be detrimental to flash drives.
Various solutions have been developed for managing large numbers of small but frequent writes, so as to prevent excessive wear in flash drives and to promote write amortization. One solution provides numerous in-memory, sorted buckets that accumulate incremental changes in metadata pages. Once filled, a current set of in-memory buckets may be written to disk and a new set of in-memory buckets may be created, thus forming multiple generations of bucket sets corresponding to respective time ranges. Bucket sets may be queried to identify changes in a particular metadata page, and query results may be merged to construct an up-to-date version of that metadata page.
Some implementations use a transaction log in connection with the above-described buckets. The transaction log contains a time-ordered record of changes made in metadata pages. The metadata changes (deltas) are accumulated in both the in-memory buckets and in the time-ordered transaction log. A rebuild of metadata pages missing from cache will retrieve deltas from in-memory buckets. The transaction log is typically used only if a certain time range is missing from the buckets, e.g., in the event of node failure that may have occurred during that time range.
Unfortunately, the above-described arrangement of buckets and transaction logs is not well-suited to all types of metadata. For example, metadata pages typically have a fixed layout in which each page is divided into a preset number of fixed-size entries. To create a delta record in the transaction log that corresponds to a metadata change in a page, one can identify the location of the change merely by providing a logical identifier (LI) of the page and an entry identifier (EI) of the specific entry being changed within the page. The above-described arrangement breaks down, however, in cases where metadata changes do not fit into fixed-size regions provided for entries. For example, it has been proposed to use buckets and transaction logs for tracking changes in key-value data, but the keys and/or values of such key-value data can have variable length and do not always fit within the fixed-length spaces provided. What is needed, therefore, is a more flexible layout for metadata pages that allows for variable-length metadata while preserving the ability to translate between entries in bucket pages and delta records in the transaction log.
The above need is addressed at least in part by an improved technique for managing metadata of variable length. The technique includes responding to the creation or change in a metadata element by creating at least first and second entries within a metadata page at discontinuous locations. The first entry is located among a first set of regions having uniform length and includes a reference to the second entry, which is located among a second set of regions having variable length. In this manner, the metadata element that does not fit within a fixed-size space is accommodated by multiple discontinuous entries in respective sets of regions.
Advantageously, the improved technique accommodates metadata elements that do not fit within fixed-size spaces. The technique also enables consistency to be maintained between metadata pages and delta records.
Certain embodiments are directed to a method of managing metadata of variable length. The method includes providing a metadata element and creating a first entry for the metadata element in a metadata page, the first entry located within a first region of a first plurality of regions of the metadata page, the regions of the first plurality of regions having uniform length. The method further includes creating a second entry for the metadata element in the metadata page, the second entry located within a second region of a second plurality of regions of the metadata page, the regions of the second plurality of regions having variable length, wherein the first entry located within the first plurality of regions includes a reference to a location of the second entry within the second plurality of regions.
Other embodiments are directed to a computerized apparatus constructed and arranged to perform a method of managing metadata of variable length, such as the method described above. Still other embodiments are directed to a computer program product. The computer program product stores instructions which, when executed by control circuitry of a computerized apparatus, cause the computerized apparatus to perform a method of managing metadata of variable length, such as the method described above.
The foregoing summary is presented for illustrative purposes to assist the reader in readily grasping example features presented herein; however, this summary is not intended to set forth required elements or to limit embodiments hereof in any way. One should appreciate that the above-described features can be combined in any manner that makes technological sense, and that all such combinations are intended to be disclosed herein, regardless of whether such combinations are identified explicitly or not.
The foregoing and other features and advantages will be apparent from the following description of particular embodiments, as illustrated in the accompanying drawings, in which like reference characters refer to the same or similar parts throughout the different views.
Embodiments of the improved technique will now be described. One should appreciate that such embodiments are provided by way of example to illustrate certain features and principles but are not intended to be limiting.
An improved technique for managing metadata of variable length includes responding to the creation or change in a metadata element by creating at least first and second entries within a metadata page at discontinuous locations. The first entry is located among a first set of regions having uniform length and includes a reference to the second entry, which is located among a second set of regions having variable length. In this manner, the metadata element that does not fit within a fixed-size space is accommodated by multiple discontinuous entries in respective sets of regions.
The network 114 may be any type of network or combination of networks, such as a storage area network (SAN), a local area network (LAN), a wide area network (WAN), the Internet, and/or some other type of network or combination of networks, for example. In cases where hosts 110 are provided, such hosts 110 may connect to the node 120 using various technologies, such as Fibre Channel, iSCSI (Internet small computer system interface), NVMeOF (Nonvolatile Memory Express (NVMe) over Fabrics), NFS (network file system), and CIFS (common Internet file system), for example. As is known, Fibre Channel, iSCSI, and NVMeOF are block-based protocols, whereas NFS and CIFS are file-based protocols. The node 120 is configured to receive I/O requests 112 according to block-based and/or file-based protocols and to respond to such I/O requests 112 by reading or writing the storage 180.
The depiction of node 120a is intended to be representative of all nodes 120. As shown, node 120a includes one or more communication interfaces 122, a set of processors 124, and memory 130. The communication interfaces 122 include, for example, SCSI target adapters and/or network interface adapters for converting electronic and/or optical signals received over the network 114 to electronic form for use by the node 120a. The set of processors 124 includes one or more processing chips and/or assemblies, such as numerous multi-core CPUs (central processing units). The memory 130 includes both volatile memory, e.g., RAM (Random Access Memory), and non-volatile memory, such as one or more ROMs (Read-Only Memories), disk drives, solid state drives, and the like. The set of processors 124 and the memory 130 together form control circuitry, which is constructed and arranged to carry out various methods and functions as described herein. Also, the memory 130 includes a variety of software constructs realized in the form of executable instructions. When the executable instructions are run by the set of processors 124, the set of processors 124 is made to carry out the operations of the software constructs. Although certain software constructs are specifically shown and described, it is understood that the memory 130 typically includes many other software components, which are not shown, such as an operating system, various applications, processes, and daemons.
As further shown in
Cache 140 is configured to store cache pages 142, which may include cached versions of metadata pages persisted in storage 180. The cache pages 142 may include pages that store key-value metadata. Such pages may have recently been read from storage 180, for example.
The KVE 150 is configured to manage a key-value store, e.g., a no-SQL database of K-V pairs for which keys are associated with corresponding values. In one example, the KVE 150 stores metadata that keeps track of space accounting in the data storage system 116. Such space-accounting metadata may be updated frequently in small increments, such as every time a write or delete of data is performed. The KVE 150 may store any type of data or metadata, however, which may include system management information, for example.
Buckets 160 store various generations 162 of metadata changes made to key-value data. For example, generation 162a may indicate a current generation of metadata changes, whereas generations 162b, 162c, and so on, may indicate successively previous generations of metadata changes. Once the buckets 160 in the current generation 162a are filled, for example, the current generation 162a is relabeled as the immediately previous generation and a new current generation is created. In this manner, buckets 160 are kept to manageable sizes. In some examples, the current generation 162a is stored in volatile memory, such as cache 140, rather than in non-volatile memory, as shown.
Buckets 160 may be provided for respective metadata pages. For example, bucket LI-0 may correspond to a metadata page with logical index 0, bucket LI-1 may correspond to a metadata page with logical index 1, and so on, up to LI-N, which may correspond to a metadata page with logical index N. Here, the logical index (LI) uniquely identifies the metadata page.
Each bucket 160 may include any number of incremental page updates to a respective page. For example, bucket LI-0 may include multiple incremental updates of the page at logical index 0. Incremental updates to a page may be arranged within a bucket as respective nodes of a B-tree (not shown), for example.
In an example, a metadata page in a bucket 160 may be configured to store multiple K-V pairs managed by KVE 150. It is thus possible that different incremental updates of a metadata page contained in a bucket may correspond to different K-V pairs and/or to multiple updates of the same K-V pair or pairs.
The K-V log 170 is a persistent log that stores and organizes delta records 172 in a time-ordered manner. The delta records 172 store respective updates of metadata pages containing key-value data. For example, each delta record 172 corresponds to a respective instance of a page within a bucket 160. In some examples, multiple delta records 172 correspond to the same instance of a page, e.g., to multiple changes made in the same page.
In an example, delta records 172 are realized using respective sets of elements, such as tuples. For instance, each delta record 172 may be realized as a 4-tuple {LI; EI; T; P}, in which the elements are defined as follows:
In example operation, the hosts 110 issue I/O requests 112 to the data storage system 116. Node 120a receives the I/O requests 112 at the communication interfaces 122 and initiates further processing. For I/O requests 112 that specify changes that affect space accounting, such as writes and deletes, the KVE 150 may update one or more K-V pairs that track such changes. For example, the KVE 150 may identify a change 152 in a particular K-V pair. For the current example, we assume that the K-V pair being changed can be found in the page corresponding to LI-0. As indicated by arrow 154a, the KVE 150 may write the change to a page 142a in cache 140. As indicated by arrow 154b, the KVE 150 may also update the bucket for LI-0 in the current generation 162a of buckets, e.g., by creating a new item in bucket LI-0 that implements the change 152. As indicated by arrow 154c, the node 120a may create a delta record that corresponds to the new item added to bucket LI-0. Here, node 120a creates Delta Record 0, e.g., by forming a 4-tuple that identifies the change based on values of LI, EI, T, and P.
An example purpose of creating Delta Record 0 is to support recovery in the event that a cached version 142a of the page cannot be found. For example, if page 142a is no longer found in cache 140, then the node 120a may reconstruct page updates in bucket LI-0, e.g., as shown by arrow 156a. Thus, one should appreciate that proper management of K-V data involves both (1) translating a change in an instance of a metadata page to a corresponding delta log 172 and (2) translating a change indicated by a delta log 172 to a corresponding item in a bucket. Note that node 120a may also recover the K-V change 152 (arrow 156b) if the K-V change 152 is located in volatile memory.
Page 210b includes both a first plurality of regions 430 for the first domain 420 of entry indices and a second plurality of regions 432 for the second domain 422 of entry indices. Each region is configured to store a respective entry having a respective entry index. The first plurality of regions 430 includes regions 430-0 through 430-n, and the second plurality of regions 432 includes regions 432-0 through 432-n. The regions in the first plurality of regions 430 are uniform in length, like the regions 220 shown in
As indicated, the page 210b is configured with first and second domains 420 and 422 of entry indices. The entry indices in the first domain 420 are discontinuous with the entry indices in the second domain 422. For example, the entry indices in the first domain 420 in the first plurality of regions 430 range from 0 to n, whereas the entry indices in the second domain 422 in the second plurality of regions 432 range from C to C+n, where C is a constant integer. Each K-V pair in the page 210b may be stored using a first entry in the first domain 420 and a second entry in the second domain 422, where the first and second entry indices are separated by the constant C. In an example, entries in the second domain 422 are laid out in reverse order, with the lowest-index entry placed closest to the footer and the higher-index entries placed progressively farther away.
In an example, the constant C is selected to be large enough to ensure that there is a discontinuity between the entry indices in the first domain 420 and the entry indices in the second domain 422. For example, C may be assigned based on a length L of the page 210b, such as its length in bytes. Other determinants of C may include the expected sizes of entries, particularly their minimum sizes. A suitable value of C may be L/2, which is sufficient to maintain the desired gap between the two domains, assuming each entry is at least one byte in length.
As further shown in
Given that storing K-V pairs having fixed-length keys and variable-length values entails the creation of both first and second entries in the page 210b, generating corresponding changes in the K-V log 170 involves creating two delta records 172, i.e., one delta record for each entry.
For example, at 510 the node 120a obtains the LI for the first delta record from the header of page 210b. At 512, node 120a sets the EI simply as “X”, i.e., the ordinal position of the Xth K-V pair in page 210b. At 514, the size T is set to the sum of sizes of KX and ValOffsetX. As regions in the first plurality of regions 430 are uniform in size, T may be the same for all entries in the first plurality of regions 430. At 516, node 120a obtains the payload P as KX+ValOffsetX (e.g., with “+” denoting concatenation). At 518, the desired tuple describing the first delta record is assembled based on the determined values of LI, EI, T, and P.
The first and second delta records may then be written to the K-V log 172 as related delta records. For example, the first and second delta records may be written to the K-V log 172 adjacently, as part of a single transaction, or in some other way that identifies the two delta records as related. Given that the first delta record encodes the first entry, which includes a reference to the second entry, it is preferable to store the first and second delta records in an order that enables the first entry to be rebuilt prior to the second entry, so that the location of the second entry can be determined based on the reference contained in the first entry. Such writing generally entails writing the first delta record first and the second delta record second, although other arrangements are possible. One should appreciate that the acts of methods 500a and 500b may be carried out in any suitable order.
The first domain 620 of entry indices is provided within a first plurality of regions 630, and the second domain 622 of entry indices is provided within a second plurality of regions 632. One entry may be provided in each region.
As shown, each entry of the first domain 620 in the first plurality of regions 630 may include a key offset (KOffsetX) of a key of a K-V pair KXVX and a corresponding value (VX). The size of KOffestX and VX may each be uniform and thus known without having to provide any extra metadata. The corresponding entry in the second domain 622 in the second plurality of regions 632 may include a key size (KSZX) of the key of KXVX and the key itself, KX. The key offset KOffsetX in the first entry points to key size KSZX in the second entry.
Given that both the first and second entries are created in page 210b to accommodate K-V pairs having variable-length keys and fixed-length values, generating corresponding changes in the K-V log 170 involves creating two delta records 172, i.e., one delta record for each entry.
At 710, the node 120a obtains the LI for the first delta record from the header of page 210c. At 712, node 120a sets the EI simply as “X”, i.e., the ordinal position of the Xth K-V pair in page 210c. At 714, the size T is set to the sum of the sizes of KOffsetX and VX. As the regions in the first plurality of regions 630 are uniform in size, T may be the same for all entries in the first plurality of regions 630. At 716, node 120a obtains the payload P as KOffsetX+VX. At 718, the desired tuple describing the first delta record is assembled based on the determined values of LI, EI, T, and P.
The first domain 820 of entry indices is provided within a first plurality of regions 830, and the second domain 822 of entry indices is provided within a second plurality of regions 832. One entry may be provided in each region.
As shown, a first entry in the first domain 820 may include an offset (OffsetX) that points to a corresponding K-V pair KXVX stored in a second entry in the second domain 822. The second entry may include a size of the key, KSZX, the key itself, KX, a size of the corresponding value, VSZX, and the value itself, VX.
Given the arrangement of page 210d, generating corresponding changes in the K-V log 170 involves creating two delta records 172, i.e., one delta record for the first entry and another delta record for the second entry.
At 910, the node 120a obtains the LI for the first delta record from the header of page 210d. At 912, node 120a sets the EI simply as “X”, i.e., the ordinal position of the Xth K-V pair in page 210d. At 914, the size T is set to the size of OffsetX. As regions in the first plurality of regions 830 are uniform in size, T may be the same for all entries in the first plurality of regions 830. At 916, node 120a obtains the payload P as OffsetX. At 918, the desired tuple describing the first delta record is assembled based on the determined values of LI, EI, T, and P.
The arrangement of page 210d may be regarded as a single-reference example, as the entries in the first domain 820 have only a single reference to the respective keys and values in the second domain 822. However, a double-reference approach may also be used.
The first domain 1020 of entry indices is provided within a first plurality of regions 1030, and the second domain 1022 of entry indices is provided within a second plurality of regions 1032. One entry may be provided per region.
As shown, a first entry in the first domain 1020 may include both a key offset (KOffX) and a value offset (VOffX), which point, respectively, to a key KX and a value VX found in a corresponding second entry in the second domain 1022. As separate offsets are provided for keys and values, no separate size information about keys or values is needed (sizes may be inferred from offsets).
Given the arrangement of page 210e, generating changes in the K-V log 170 involves creating two delta records 172, i.e., one delta record for the first entry and another delta record for the second entry.
At 1110, node 120a obtains the LI for the first delta record from the header of page 210e. At 1112, node 120a sets the EI simply as “X”, i.e., the ordinal position of the Xth K-V pair in page 210e. At 1114, the size T is set to the sum of the sizes of KOffX and VOffX. As regions in the first plurality of regions 1030 are uniform in size, T may be the same for all entries in the first plurality of regions 1030. At 1116, node 120a obtains the payload P as KOffX+VOffX. At 1118, the desired tuple describing the first delta record is assembled based on the determined values of LI, EI, T, and P.
At 1210, a metadata page 210 for KXVX is created based on the LI specified in the first tuple {LI; EI; T; P}. For example, node 120a may locate the bucket 160 that contains recent versions of the metadata page corresponding to the LI of the first tuple and may create a new version of that metadata page within the bucket.
At 1220, node 120a accesses the entry index EI from the tuple and compares it with the constant C described above, where C may be half the length of the metadata page, for example. If EI is less than C, then the obtained EI is assumed to correspond to the first domain in which EI values represent ordinal positions of fixed-length entries in the page.
Operation then proceeds to 1230, whereupon the first entry is rebuilt by copying the payload P of the first tuple to the metadata page at the location of the entry index (EIX) specified by the EI of the tuple. Method 1200 then ends, as rebuilding of the first entry is complete.
The method 1200 is then repeated for rebuilding the second entry from the second tuple. At 1210 the same metadata page that was created above is accessed, and at 1220 a comparison is made between EI as read from the second tuple and the constant C. If EI is greater than C (as expected for entries in the second domain), operation proceeds to 1240, whereupon the second entry is rebuilt by copying the payload P to a location within the metadata page specified in the reference (such as the offset) contained in the first entry. The first entry may be easily located as EIX-C, which resides in the first domain where entries are laid out consecutively in fixed-length regions. Operation then ends, as both entries of the metadata page have been rebuilt.
At 1310, a metadata element is provided. For example, the metadata element may be a key-value pair indicated by a key-value change 152 generated by KVE 150.
At 1320, a first entry (e.g., EI=X) is created for the metadata element in a metadata page 210. The first entry is located within a first region of a first plurality of regions (e.g., 430, 630, 830, or 1030) of the metadata page 210. The regions of the first plurality of regions have uniform length.
At 1330, a second entry (e.g., EI=X+C) is created for the metadata element in the metadata page 210. The second entry is located within a second region of a second plurality of regions (e.g., 432, 632, 832, or 1032) of the metadata page 210. The regions of the second plurality of regions have variable lengths. The first entry located within the first plurality of regions includes a reference (e.g., ValOffsetX, KOffsetX, OffsetX, KOffX, and/or VOffX) to a location of the second entry within the second plurality of regions.
An improved technique has been described for managing metadata of variable length. The technique includes responding to the creation or change in a metadata element by creating at least first and second entries within a metadata page 210 at discontinuous locations. The first entry is located among a first set of regions having uniform length and includes a reference to the second entry, which is located among a second set of regions having variable length. In this manner, the metadata element that does not fit within a fixed-size space is accommodated by multiple discontinuous entries in respective sets of regions.
Having described certain embodiments, numerous alternative embodiments or variations can be made. For example, although the metadata elements are described above as key-value pairs, this is merely an example, as the same principles may apply to other types of metadata or data.
Further, embodiments have been described in which both keys and values of K-V pairs are written to metadata pages 210 and delta records 172. In some examples, however (such as those involving space accounting), it is necessary only to write values, as keys may remain unchanged. In such cases, an update can be achieved by writing only a single entry that contains a new value. Likewise, a page can be rebuilt just by restoring the page from a single delta record that contains the new value, rather than having to restore from two delta records. Such an arrangement further promotes write amortization.
Also, embodiments have been described in which entry indices are provided in two discontinuous domains within a metadata page 210. However, embodiments are not limited to two discontinuous domains. For instance,
Although embodiments have been described that involve one or more data storage systems, other embodiments may involve computers, including those not normally regarded as data storage systems. Such computers may include servers, such as those used in data centers and enterprises, as well as general purpose computers, personal computers, and numerous devices, such as smart phones, tablet computers, personal data assistants, and the like.
Further, although features have been shown and described with reference to particular embodiments hereof, such features may be included and hereby are included in any of the disclosed embodiments and their variants. Thus, it is understood that features disclosed in connection with any embodiment are included in any other embodiment.
Further still, the improvement or portions thereof may be embodied as a computer program product including one or more non-transient, computer-readable storage media, such as a magnetic disk, magnetic tape, compact disk, DVD, optical disk, flash drive, solid state drive, SD (Secure Digital) chip or device, Application Specific Integrated Circuit (ASIC), Field Programmable Gate Array (FPGA), and/or the like (shown by way of example as medium 1350 in
As used throughout this document, the words “comprising,” “including,” “containing,” and “having” are intended to set forth certain items, steps, elements, or aspects of something in an open-ended fashion. Also, as used herein and unless a specific statement is made to the contrary, the word “set” means one or more of something. This is the case regardless of whether the phrase “set of” is followed by a singular or plural object and regardless of whether it is conjugated with a singular or plural verb. Also, a “set of” elements can describe fewer than all elements present. Thus, there may be additional elements of the same kind that are not part of the set. Further, ordinal expressions, such as “first,” “second,” “third,” and so on, may be used as adjectives herein for identification purposes. Unless specifically indicated, these ordinal expressions are not intended to imply any ordering or sequence. Thus, for example, a “second” event may take place before or after a “first event,” or even if no first event ever occurs. In addition, an identification herein of a particular element, feature, or act as being a “first” such element, feature, or act should not be construed as requiring that there must also be a “second” or other such element, feature or act. Rather, the “first” item may be the only one. Also, and unless specifically stated to the contrary, “based on” is intended to be nonexclusive. Thus, “based on” should be interpreted as meaning “based at least in part on” unless specifically indicated otherwise. Although certain embodiments are disclosed herein, it is understood that these are provided by way of example only and should not be construed as limiting.
Those skilled in the art will therefore understand that various changes in form and detail may be made to the embodiments disclosed herein without departing from the scope of the following claims.
Number | Date | Country | Kind |
---|---|---|---|
RU2022114132 | May 2022 | RU | national |
Number | Name | Date | Kind |
---|---|---|---|
11042296 | Shveidel et al. | Jun 2021 | B1 |
11068199 | Shveidel et al. | Jul 2021 | B2 |
11093169 | Shveidel et al. | Aug 2021 | B1 |
11200219 | Shveidel et al. | Dec 2021 | B2 |
11301330 | Shveidel et al. | Apr 2022 | B2 |
11347725 | Shveidel et al. | May 2022 | B2 |
11366795 | Love et al. | Jun 2022 | B2 |
11467963 | David et al. | Oct 2022 | B2 |
11468012 | Chinthekindi et al. | Oct 2022 | B2 |
11487706 | Shveidel et al. | Nov 2022 | B2 |
20200349149 | Wang | Nov 2020 | A1 |
20210405896 | Durham | Dec 2021 | A1 |
20220342589 | Dovzhenko et al. | Oct 2022 | A1 |
Number | Date | Country | |
---|---|---|---|
20230384943 A1 | Nov 2023 | US |