The present description relates to data storage systems, and more specifically, to a technique for the dynamic updating of weights used in distributed parity systems to more evenly distribute device selections for extent allocations.
A storage volume is a grouping of data of any arbitrary size that is presented to a user as a single, unitary storage area regardless of the number of storage devices the volume actually spans. Typically, a storage volume utilizes some form of data redundancy, such as by being provisioned from a redundant array of independent disks (RAID) or a disk pool (organized by a RAID type). Some storage systems utilize multiple storage volumes, for example of the same or different data redundancy levels.
Some storage systems utilize pseudorandom hashing algorithms in attempts to distribute data across distributed storage devices according to uniform probability distributions. In dynamic disk pools, however, this results in certain “hot spots” where some storage devices have more data extents allocated for data than other storage devices. The “hot spots” result in potentially large variances in utilization. This can result in imbalances in device usage, as well as bottlenecks (e.g., I/O bottlenecks) and underutilization of some of the storage devices in the pool. This in turn can reduce the quality of service of these systems.
The present disclosure is best understood from the following detailed description when read with the accompanying figures.
All examples and illustrative references are non-limiting and should not be used to limit the claims to specific implementations and embodiments described herein and their equivalents. For simplicity, reference numbers may be repeated between various examples. This repetition is for clarity only and does not dictate a relationship between the respective embodiments. Finally, in view of this disclosure, particular features described in relation to one aspect or embodiment may be applied to other disclosed aspects or embodiments of the disclosure, even though not specifically shown in the drawings or described in the text.
Various embodiments include systems, methods, and machine-readable media for improving the quality of service in dynamic disk pool (distributed parity) systems by ensuring a more evenly distributed layout of data extent allocation in storage devices. In an embodiment, whenever a data extent is to be allocated, a hashing function is called in order to select the storage device on which to allocate the data extent. The hashing function takes into consideration a weight associated with each storage device in the dynamic disk pool, so that it is more likely that devices having an associated weight that is larger are selected. Once a storage device is selected, the weight associated with that storage device is reduced by a pre-programmed amount that results in an incremental decrease. Further, any nodes at higher hierarchal levels (where a hierarchy is used) may also have weights whose values are a function of the storage device weights that are recomputed as well. This reduces the probability that the selected storage device is selected at a subsequent time.
When a data extent is de-allocated, such as in response to a request to delete the data at the data extent or to de-allocate the data extent, the storage system takes the requested action. When the data extent is de-allocated, the weight associated with the affected storage device containing the now-de-allocated data extent is increased by an incremental amount. Further, any nodes at higher hierarchal levels (where a hierarchy is used) may also have weights whose values are a function of the storage device weights that are recomputed as well based on the change. This increases the probability that the storage device is selected at a subsequent time.
The storage architecture 100 includes a storage system 102 in communication with a number of hosts 104. The storage system 102 is a system that processes data transactions on behalf of other computing systems including one or more hosts, exemplified by the hosts 104. The storage system 102 may receive data transactions (e.g., requests to write and/or read data) from one or more of the hosts 104, and take an action such as reading, writing, or otherwise accessing the requested data. For many exemplary transactions, the storage system 102 returns a response such as requested data and/or a status indictor to the requesting host 104. It is understood that for clarity and ease of explanation, only a single storage system 102 is illustrated, although any number of hosts 104 may be in communication with any number of storage systems 102.
While the storage system 102 and each of the hosts 104 are referred to as singular entities, a storage system 102 or host 104 may include any number of computing devices and may range from a single computing system to a system cluster of any size. Accordingly, each storage system 102 and host 104 includes at least one computing system, which in turn includes a processor such as a microcontroller or a central processing unit (CPU) operable to perform various computing instructions. The instructions may, when executed by the processor, cause the processor to perform various operations described herein with the storage controllers 108.a, 108.b in the storage system 102 in connection with embodiments of the present disclosure. Instructions may also be referred to as code. The terms “instructions” and “code” may include any type of computer-readable statement(s). For example, the terms “instructions” and “code” may refer to one or more programs, routines, sub-routines, functions, procedures, etc. “Instructions” and “code” may include a single computer-readable statement or many computer-readable statements.
The processor may be, for example, a microprocessor, a microprocessor core, a microcontroller, an application-specific integrated circuit (ASIC), etc. The computing system may also include a memory device such as random access memory (RAM); a non-transitory computer-readable storage medium such as a magnetic hard disk drive (HDD), a solid-state drive (SSD), or an optical memory (e.g., CD-ROM, DVD, BD); a video controller such as a graphics processing unit (GPU); a network interface such as an Ethernet interface, a wireless interface (e.g., IEEE 802.11 or other suitable standard), or any other suitable wired or wireless communication interface; and/or a user I/O interface coupled to one or more user I/O devices such as a keyboard, mouse, pointing device, or touchscreen.
With respect to the storage system 102, the exemplary storage system 102 contains any number of storage devices 106 and responds to one or more hosts 104's data transactions so that the storage devices 106 may appear to be directly connected (local) to the hosts 104. In various examples, the storage devices 106 include hard disk drives (HDDs), solid state drives (SSDs), optical drives, and/or any other suitable volatile or non-volatile data storage medium. In some embodiments, the storage devices 106 are relatively homogeneous (e.g., having the same manufacturer, model, and/or configuration). However, the storage system 102 may alternatively include a heterogeneous set of storage devices 106 that includes storage devices of different media types from different manufacturers with notably different performance.
The storage system 102 may group the storage devices 106 for speed and/or redundancy using a virtualization technique such as RAID or disk pooling (that may utilize a RAID level). The storage system 102 also includes one or more storage controllers 108.a, 108.b in communication with the storage devices 106 and any respective caches. The storage controllers 108.a, 108.b exercise low-level control over the storage devices 106 in order to execute (perform) data transactions on behalf of one or more of the hosts 104. The storage controllers 108.a, 108.b are illustrative only; more or fewer may be used in various embodiments. Having at least two storage controllers 108.a, 108.b may be useful, for example, for failover purposes in the event of equipment failure of either one. The storage system 102 may also be communicatively coupled to a user display for displaying diagnostic information, application output, and/or other suitable data.
In an embodiment, the storage system 102 may group the storage devices 106 using a dynamic disk pool (DDP) (or other declustered parity) virtualization technique. In a dynamic disk pool, volume data, protection information, and spare capacity are distributed across all of the storage devices included in the pool. As a result, all of the storage devices in the dynamic disk pool remain active, and spare capacity on any given storage device is available to all volumes existing in the dynamic disk pool. Each storage device in the disk pool is logically divided up into one or more data extents at various logical block addresses (LBAs) of the storage device. A data extent is assigned to a particular data stripe of a volume. An assigned data extent becomes a “data piece,” and each data stripe has a plurality of data pieces, for example sufficient for a desired amount of storage capacity for the volume and a desired amount of redundancy, e.g. RAID 0, RAID 1, RAID 10, RAID 5 or RAID 6 (to name some examples). As a result, each data stripe appears as a mini RAID volume, and each logical volume in the disk pool is typically composed of multiple data stripes.
In the present example, storage controllers 108.a and 108.b are arranged as an HA pair. Thus, when storage controller 108.a performs a write operation for a host 104, storage controller 108.a may also sends a mirroring I/O operation to storage controller 108.b. Similarly, when storage controller 108.b performs a write operation, it may also send a mirroring I/O request to storage controller 108.a. Each of the storage controllers 108.a and 108.b has at least one processor executing logic to perform writing and migration techniques according to embodiments of the present disclosure.
Moreover, the storage system 102 is communicatively coupled to server 114. The server 114 includes at least one computing system, which in turn includes a processor, for example as discussed above. The computing system may also include a memory device such as one or more of those discussed above, a video controller, a network interface, and/or a user I/O interface coupled to one or more user I/O devices. The server 114 may include a general purpose computer or a special purpose computer and may be embodied, for instance, as a commodity server running a storage operating system. While the server 114 is referred to as a singular entity, the server 114 may include any number of computing devices and may range from a single computing system to a system cluster of any size. In an embodiment, the server 114 may also provide data transactions to the storage system 102. Further, the server 114 may be used to configure various aspects of the storage system 102, for example under the direction and input of a user. Some configuration aspects may include definition of RAID group(s), disk pool(s), and volume(s), to name just a few examples.
With respect to the hosts 104, a host 104 includes any computing resource that is operable to exchange data with a storage system 102 by providing (initiating) data transactions to the storage system 102. In an exemplary embodiment, a host 104 includes a host bus adapter (HBA) 110 in communication with a storage controller 108.a, 108.b of the storage system 102. The HBA 110 provides an interface for communicating with the storage controller 108.a, 108.b, and in that regard, may conform to any suitable hardware and/or software protocol. In various embodiments, the HBAs 110 include Serial Attached SCSI (SAS), iSCSI, InfiniBand, Fibre Channel, and/or Fibre Channel over Ethernet (FCoE) bus adapters. Other suitable protocols include SATA, eSATA, PATA, USB, and FireWire.
The HBAs 110 of the hosts 104 may be coupled to the storage system 102 by a network 112, for example a direct connection (e.g., a single wire or other point-to-point connection), a networked connection, or any combination thereof. Examples of suitable network architectures 112 include a Local Area Network (LAN), an Ethernet subnet, a PCI or PCIe subnet, a switched PCIe subnet, a Wide Area Network (WAN), a Metropolitan Area Network (MAN), the Internet, Fibre Channel, or the like. In many embodiments, a host 104 may have multiple communicative links with a single storage system 102 for redundancy. The multiple links may be provided by a single HBA 110 or multiple HBAs 110 within the hosts 104. In some embodiments, the multiple links operate in parallel to increase bandwidth.
To interact with (e.g., write, read, modify, etc.) remote data, a host HBA 110 sends one or more data transactions to the storage system 102. Data transactions are requests to write, read, or otherwise access data stored within a data storage device such as the storage system 102, and may contain fields that encode a command, data (e.g., information read or written by an application), metadata (e.g., information used by a storage system to store, retrieve, or otherwise manipulate the data such as a physical address, a logical address, a current location, data attributes, etc.), and/or any other relevant information. The storage system 102 executes the data transactions on behalf of the hosts 104 by writing, reading, or otherwise accessing data on the relevant storage devices 106. A storage system 102 may also execute data transactions based on applications running on the storage system 102 using the storage devices 106. For some data transactions, the storage system 102 formulates a response that may include requested data, status indicators, error messages, and/or other suitable data and provides the response to the provider of the transaction.
Data transactions are often categorized as either block-level or file-level. Block-level protocols designate data locations using an address within the aggregate of storage devices 106. Suitable addresses include physical addresses, which specify an exact location on a storage device, and virtual addresses, which remap the physical addresses so that a program can access an address space without concern for how it is distributed among underlying storage devices 106 of the aggregate. Exemplary block-level protocols include iSCSI, Fibre Channel, and Fibre Channel over Ethernet (FCoE). iSCSI is particularly well suited for embodiments where data transactions are received over a network that includes the Internet, a WAN, and/or a LAN. Fibre Channel and FCoE are well suited for embodiments where hosts 104 are coupled to the storage system 102 via a direct connection or via Fibre Channel switches. A Storage Attached Network (SAN) device is a type of storage system 102 that responds to block-level transactions.
In contrast to block-level protocols, file-level protocols specify data locations by a file name. A file name is an identifier within a file system that can be used to uniquely identify corresponding memory addresses. File-level protocols rely on the storage system 102 to translate the file name into respective memory addresses. Exemplary file-level protocols include SMB/CFIS, SAMBA, and NFS. A Network Attached Storage (NAS) device is a type of storage system that responds to file-level transactions. As another example, embodiments of the present disclosure may utilize object-based storage, where objects are instantiated that are used to manage data instead of as blocks or in file hierarchies. In such systems, objects are written to the storage system similar to a file system in that when an object is written, the object is an accessible entity. Such systems expose an interface that enables other systems to read and write named objects, that may vary in size, and handle low-level block allocation internally (e.g., by the storage controllers 108.a, 108.b). It is understood that the scope of present disclosure is not limited to either block-level or file-level protocols or object-based protocols, and in many embodiments, the storage system 102 is responsive to a number of different memory transaction protocols.
An exemplary storage system 102 configured with a DDP is illustrated in
Each storage device 202a-202f is logically divided up into a plurality of data extents 208. Of that plurality of data extents, each storage device 202a-202f includes a subset of data extents that has been allocated for use by one or more logical volumes, illustrated as data pieces 204 in
Of these data pieces, at least one is reserved for redundancy (e.g., according to RAID 5; another example would be a data stripe with two data pieces/extents reserved for redundancy) and the others used for data. It will be appreciated that the other data stripes may have similar composition, but for simplicity of discussion will not be discussed here. According to embodiments of the present disclosure, an algorithm may be used by one or both of the storage controllers 108.a, 108.b to determine which storage devices 202 to select to provide data extents 208 from among the plurality of storage devices 202 that the disk pool is composed of. After a round of selection for storage devices' data extents for a data stripe, a weight associated with each selected storage device may be modified by the respective storage controller 108 to reduce the likelihood of those storage devices being selected next to create a next stripe. As a result, embodiments of the present disclosure are able to more evenly distribute the layout of data extent allocations in one or more volumes created by the data extents.
Turning now to
In an embodiment, each weight W may be initialized with a default value. For example, the weight may be initialized with a maximum value available for the variable the storage controller 108 uses to track the weight. In embodiments where object-based storage is used, for example, a member variable for weight, W, may be set at a maximum value (e.g., 0x10000 in base 16, or 65,536 in base 10) when the associated object is instantiated, for example corresponding to a storage device 202. This maximum value may be used to represent a device that has not allocated any of its capacity (e.g., has not had any of its extents allocated for one or more data stripes in a DDP) yet.
Continuing with this example, another variable (referred to herein as “ExtentWeight”) may also be set that identifies how much the weight variable W may be reduced for a given storage device 202 when an extent is allocated from that device (or increased when an extent is de-allocated). In an embodiment, the value for ExtentWeight may be a value proportionate to the total number of extents that the device supports. As an example, this may be determined by dividing the maximum value allocated for the variable W by the total number of extents on the given storage device, thus tying the amount that the weight W is reduced to the extents on the device itself. In another embodiment, the value for ExtentWeight may be set to be a uniform value that is the same in association with each storage device 202 in the DDP. This may give rise to a minimum theoretical weight W of 0 (though, to support a pseudo-random has-based selection processor, the minimum possible weight W may be limited to some value just above zero so that even a storage device 202 with all of its extents allocated may still show up for potential selection) and a maximum theoretical weight W equal to the initial (e.g., default) weight.
In an embodiment, the dynamic weighting may be tuned, i.e. turned on or off. Thus, when data extents are allocated and/or de-allocated, according to embodiments of the present disclosure the weights W associated with the selected devices are adjusted (decreased for allocations or increased for de-allocations) but the default value for the weight W may be returned whenever queried until the dynamic weighting is turned on. In a further embodiment, the weight W for each storage device 202 may be influenced solely by the default value and any decrements from that and increments to that (or, in other words, treating all storage devices 202 as though they generally have the same overall capacity, not considering the possible difference in size of the value set for ExtentWeight). In an alternative embodiment, in addition to dynamically adjusting the weight W based on allocation/de-allocation, the storage controller 108 may further set the weight W for each storage device 202 according to its relative capacity, so that different-sized storage devices 202 may have different weights W from each other before and during dynamic weight adjusting (or, alternatively, the different capacities may be taken into account with the size of ExtentWeight for each storage device 202).
As illustrated in
For example, in selecting storage devices 202 the storage controller 108 may utilize a logical map of the system, such as a cluster map, to represent what resources are available for data storage. For example, the cluster map may be a hierarchal map that logically represents the elements available for data storage within the distributed system (e.g., DDP), including for example data center locations, server cabinets, server shelves within cabinets, and storage devices 202 on specific shelves. These may be referred to as buckets which, depending upon their relationship with each other, may be nested in some manner. For example, the bucket for one or more storage devices 202 may be nested within a bucket representing a server shelf and/or server row, which also may be nested within a bucket representing a server cabinet. The storage controller 108 may maintain one or more placement rules that may be used to govern how one or more storage devices 202 are selected for creating a data stripe. Different placement rules may be maintained for different data redundancy types (e.g., RAID type) and/or hardware configurations
According to embodiments of the present disclosure, in addition to each of the storage devices 202 having a respective dynamic weight W associated with it, the buckets where the storage devices 202 are nested may also have dynamic weights W associated with them. For example, a given bucket's weight W may be a sum of the dynamic weights W associated with the devices and/or other buckets contained within the given bucket. The storage controller 108 may use these bucket weights W to assist in an iterative selection process to first select particular buckets from those available, e.g. selecting those with higher relative weights than the others according to the relevant placement rule for the given redundancy type/hardware configuration. For each selection (e.g., at each layer in a nested hierarchy), the storage controller 108 may use a hashing function to assist in its selection. The hashing function may be, for example, a multi-input integer has function. Other hash functions may also be used.
At each layer, the storage controller 108 may use the hash function with an input from the previous stage (e.g., the initial input such as a volume name for creation or a name of a data object for the system, etc.). The hash function may output a selection. For example, at a layer specifying buckets representing server cabinets, the output may be one or more server cabinets wherein the storage controller 108 may repeat selection for the next bucket down, such as for selecting one or more rows, shelves, or actual storage devices. With this approach, the storage controller 108 may be able to manage where a given volume is distributed across the DDP so that target levels of redundancy and failure protection (e.g., if power is cut to a server cabinet, data center location, etc.). At each iteration, the weight W associated with the different buckets and/or storage devices influences the selected result(s).
This iteration may continue until reaching the level of actual storage devices 202. This level is illustrated in
Thus, in the example of
With the selection of specific storage devices 202a, 202b, 202c, 202d, and 202f complete (and subsequent allocation), the storage controller 108 then modifies the weights W associated with each storage device 202 impacted by the selection. Thus, the storage controller 108 decreases 306 the weight W202a, decreases 308 the weight W202b, decreases 310 the weight W202c, decreases 312 the weight W202d, and decreases 316 the weight W202f corresponding to the selected storage devices 202a, 202b, 202c, 202d, and 202f. As noted above, the weight for each may be reduced by ExtentWeight which may be the same for each storage device or different, e.g. depending upon the total number of extents on each storage device 202. Since the storage device 202e was not selected in this round, there is no change 314 in the weight W202e.
In addition to dynamically adjusting the weights W for the storage devices 202 affected by the selection, the storage controller 108 also dynamically adjusts the weights of those elements of upper hierarchal levels (e.g. higher-level buckets) in which the selected storage devices 202a, 202b, 202c, 202d, and 202f are nested. This can be accomplished by recomputing the sum of weights found within the respective bucket, which may include both the storage devices 202 as well as other buckets. As another example, after the weights W have been adjusted for the selected storage devices 202, the storage controller 108 may recreate a complete distribution of all nodes in the cluster map. Should another data stripe again be needed, e.g. another request 302 is received, the process described above is again repeated taking into consideration the dynamically changed weights from the previous round of selection for the different levels of the hierarchy in the cluster map. Thus, subsequent hashing into the cluster map (which may also be referred to as a tree) produce a bias toward storage devices 202 with higher weights W (those devices which have more unallocated data extents than the others).
The mappings may be remembered so that subsequent accesses take less time computationally to reach the appropriate locations among the storage devices 202. A result of the above process is that the extent allocations for subsequent data objects are more evenly distributed among storage devices 202 by relying upon the dynamic weights W according to embodiments of the present disclosure.
Although the storage devices 202a-202f are illustrated together, one or more of the devices may be physically distant from one or more of the others. For example, all of the storage devices 202 may be in close proximity to each other, such as on the same rack, etc. As another example, some of the storage devices 202 may be distributed in different server cabinets and/or data center locations (as just two examples) as influenced by the placement rules specified for the redundancy type and/or hardware configuration.
Further, although the above example discusses the reduction of weights W associated with the selected storage devices 202, in an alternative embodiment the weights W associated with the non-selected storage devices 202 may instead be increased, for example by the ExtentWeight value (e.g., where the default weights are all initialized to a zero value or similar instead of a maximum value), while the weight W for the selected storage devices 202 remain the same during that round.
In the example illustrated in
With the requested action completed at the storage devices 202a, 202b, 202c, 202d, and 202e, the storage controller 108 then modifies the weights W associated with each storage device 202 impacted by the action (e.g., de-allocation). Thus, in embodiments where the weights W are allocated to a default maximum value, the storage controller 108 increases 406 the weight W202a, increases 408 the weight W202b, increases 410 the weight W202c, increases 412 the weight W202d, and increases 414 the weight W202e corresponding to the storage devices 202a, 202b, 202c, 202d, and 202e of this example. As noted above, the weight for each may be increased by ExtentWeight which may be the same for each storage device or different, e.g. depending upon the total number of extents on each storage device 202. Since the storage device 202f did not have an extent de-allocated, there is no change 416 in the weight W202f.
In addition to dynamically adjusting the weights W for the storage devices 202 affected by the de-allocation, the storage controller 108 also dynamically adjusts the weights of those elements of upper hierarchal levels (e.g. higher-level buckets) in which the affected storage devices 202a, 202b, 202c, 202d, and 202e are nested. This can be accomplished by recomputing the sum of weights found within the respective bucket, which may include both the storage devices 202 as well as other buckets. As another example, after the weights W have been adjusted for the affected storage devices 202, the storage controller 108 may recreate a complete distribution of all nodes in the cluster map.
The difference in results between use of the dynamic weight adjustment according to embodiments of the present disclosure and the lack of dynamic weight adjustments is demonstrated by
In diagram 500, without dynamic weighting it can be seen that using the hashing function with the cluster map, though it may operate to achieve an overall uniform distribution (e.g., according to a bell curve), may result in locally uneven distributions of allocation in the different drawers (illustrated at around 95% capacity). This may result in uneven performance differences between individual storage devices 202 (and, by implication, drawers, racks, rows, and/or cabinets for example). The contrast is illustrated in
As a further benefit, in systems that are performance limited by drive spindles (e.g., random I/Os on hard disk drive storage devices), random DDP I/O may approximately match random I/O performance of RAID 6 (as opposed to system random read performance drops and random write performance drops when not utilizing dynamic weighting). Further, in systems that utilize solid state drives as storage devices, using the dynamic weighting may reduce the variation in wear leveling by keeping the data distribution more evening balanced across the drive set (as opposed to more uneven wear leveling that would occur as illustrated in diagram 500 of
At block 602, the storage controller 108 receives an instruction that affects at least one data extent allocation in at least one storage device 202. For example, the instruction may be to allocate a data extent (e.g., for volume creation or for a data I/O). As another example, the instruction may be to de-allocate a data extent.
At block 604, the storage controller 108 changes the data extent allocation based on the instruction received at block 602. For extent allocation, this includes allocating the one or more data extents according to the parameters of the request. For extent de-allocation, this includes de-allocation and release of the extent(s) back to an available pool for potential later use.
At block 606, the storage controller 108 updates the weight corresponding to the one or more storage devices 202 affected by the change in extent allocation. For example, where a data extent is allocated, the weight corresponding to the affected storage device 202 containing the data extent is decreased, such as by ExtentWeight as discussed above with respect to
At block 608, the storage controller 108 re-computes the weights associated with the one or more storage nodes, such as the buckets discussed above with respect to
The illustrated method 700 may be described with respect to several different phases identified as phases A, B, C, and D in
At block 702, the storage controller 108 receives a request to provision a volume in the storage system from available data extents in a distributed parity system, such as DDP.
At block 704, the storage controller 108 selects one or more storage devices 202 that have available data extents to create a data stripe for the requested volume. This selection is made, according to embodiments of the present disclosure, based on the present value of the corresponding weights for the storage devices 202. For example, the storage controller 108 calls a hashing function and, based on the weights associated with the devices, receives an ordered list of selected storage devices 202 from among those in the DDP (e.g., 10 devices from among a pool of hundreds or thousands).
At block 706, after the selection and allocation of data extents on the selected storage devices 202, the storage controller 108 decreases the weights associated with the selected storage devices 202. For example, the decrease may be according to the value of ExtentWeight, or some other default or computed amount. The storage controller 108 may also re-compute the weights associated with the one or more storage nodes in which the selected storage devices 202 are nested.
At decision block 708, the storage controller 108 determines whether the last data stripe has been allocated for the volume requested at block 702. If not, then the method 700 returns to block 704 to repeat the selection, allocation, and weight adjusting process. If so, then the method 700 proceeds to block 710.
At block 710, which may occur during regular system I/O operation in phase B, the storage controller 108 may receive a write request from a host 104.
At block 712, the storage controller 108 responds to the write request by selecting one or more storage devices 202 on which to allocate data extents. This selection is made based on the present value of the weights associated with the storage devices 202 under consideration. This may be done in addition, or as an alternative to, the volume provisioning already done in phase A. For example, where the volume is provisioned at phase A but done by thin provisioning, there may still be a need to allocate additional data extents to accommodate the incoming data.
At block 714, the storage controller 108 allocates the data extents on the selected storage devices from block 712.
At block 716, the storage controller 108 decreases the weights associated with the selected storage devices 202. For example, the decrease may be according to the value of ExtentWeight, or some other default or computed amount. The storage controller 108 may also re-compute the weights associated with the one or more storage nodes in which the selected storage devices 202 are nested.
At block 718, which may occur during phase C, the storage controller 108 receives a request to de-allocate one or more data extents. This may correspond to a request to delete data stored at those data extents, or to a request to delete a volume, or to a request to migrate data to other locations in the same or different volume/system.
At block 720, the storage controller 108 de-allocates the requested data extents on the affected storage devices 202.
At block 722, the storage controller 108 increases the weights corresponding to the affected storage devices 202 where the de-allocated data extents are located. This may be according to the value of ExtentWeight, as discussed above with respect to
The method 700 then proceeds to decision block 724, part of phase C. At decision block 724, it is determined whether a storage device has failed. If not, then the method may return to any of phases A, B, and C again to either allocate for a new volume, for a data write, or de-allocated as requested.
If it is instead determined that a storage device 202 has failed, then the method 700 proceeds to block 726.
At block 726, as part of data reconstruction recovery efforts, the storage controller 108 detects the storage device failure and initiates data rebuilding of data that was stored on the now-failed storage device. In systems that rely on parity for redundancy, this includes recreating the stored data based on the parity information and other data pieces stored that relate to the affected data.
At block 728, the storage controller 108 selects one or more available (working) storage devices 202 on which to store the rebuilt data. This selection is made based on the present value of the weights associated with the storage devices 202 under consideration. The storage controller 108 then allocates the data extents on the selected storage devices 202.
At block 730, the storage controller 108 decreases the weights associated with the selected storage devices 202. For example, the decrease may be according to the value of ExtentWeight, or some other default or computed amount. The storage controller 108 may also re-compute the weights associated with the one or more storage nodes in which the selected storage devices 202 are nested.
As a result of the elements discussed above, a storage system's performance is improved by reducing the variance of capacity between storage devices in a volume, improving quality of service with more evenly distributed data extent allocations. Further, random I/O performance is improved and improved wear leveling between devices.
The present embodiments can take the form of a hardware embodiment, a software embodiment, or an embodiment containing both hardware and software elements. In that regard, in some embodiments, the computing system is programmable and is programmed to execute processes including the processes of methods 600 and/or 700 discussed herein. Accordingly, it is understood that any operation of the computing system according to the aspects of the present disclosure may be implemented by the computing system using corresponding instructions stored on or in a non-transitory computer readable medium accessible by the processing system. For the purposes of this description, a tangible computer-usable or computer-readable medium can be any apparatus that can store the program for use by or in connection with the instruction execution system, apparatus, or device. The medium may include for example non-volatile memory including magnetic storage, solid-state storage, optical storage, cache memory, and Random Access Memory (RAM).
The foregoing outlines features of several embodiments so that those skilled in the art may better understand the aspects of the present disclosure. Those skilled in the art should appreciate that they may readily use the present disclosure as a basis for designing or modifying other processes and structures for carrying out the same purposes and/or achieving the same advantages of the embodiments introduced herein. Those skilled in the art should also realize that such equivalent constructions do not depart from the spirit and scope of the present disclosure, and that they may make various changes, substitutions, and alterations herein without departing from the spirit and scope of the present disclosure.