1. Technical Field
The present disclosure relates to distributed tiered data storage systems, and specifically to optimizing and load-balancing user and application-generated I/O workloads in multi-tiered distributed-storage environments.
2. Description of the Background Art
Typical distributed storage systems (DSS) comprise multiple storage servers configured to provide increased capacity, input/output (I/O) performance (i.e. write/read performance), fault tolerance and improved data availability via multiple copies of data stored on different storage servers/tiers. The DSS servers used to store user and application data are often clustered and combined in one or more tiers of identical servers. Often, those identical servers are based on commodity hardware and local disk drives.
Storage provided by DSS is utilized by applications including filesystems, databases, object storage systems. Each of those applications provides a certain application-specific service to upper layers and users (for instance, a filesystem provides and facilitates file storage and file management) while utilizing distributed block, file and/or object-level services provided by the underlying storage servers and tiers of servers of the DSS.
The present disclosure relates to heterogeneous distributed multi-tiered storage systems, methods and architectures. Conventionally, in a multi-tiered system with storage tiers denoted T1, T2, . . . Tn the first (often called “primary”) tier T1 directly faces users and applications and provides the best I/O performance, while the last tier Tn provides abundant capacity to store less (or least) critical and/or less frequently (or recently) accessed data, including long-term backups and archives.
In a given multi-tiered storage system (T1, . . . , Tn), for any two “neighboring” tiers Ti and Ti+1 (1<=i<n), I/O performance of the Ti is typically better than I/O performance of its lower tier neighbor Ti+1. Typical performance metrics include maximum IOPS, throughput and average latency measured on a per-tier basis for a given application-generated or synthetic workload. Simultaneously, available capacity generally increases from Ti to Ti+1, or is expected to increase. It is also a widely accepted in the industry that lower tiers are generally less expensive on a per-terabyte of the provided capacity.
Lower tiers of a multi-tiered DSS are typically used to provide for data availability, by storing additional copies of data or additional redundant erasure encoded slices (with XOR-based parity being a special case of the latter). Those redundant copies and slices are conventionally generated outside the main I/O processing path (and often, by a separate storage software). For instance, in the write-processing data path:
write(data) request by application=>N copies of data stored in the DSS
DSS will conventionally store all N copies on a single tier designated for the writing application (e.g., primary tier T1 in case of mission-critical business applications), while additional copies will be generated outside this data processing path. Alternatively, conventional DSS will, at best, provide redundancy via RAID levels or erasure encoded schemas implemented over multiple servers of the same storage tier.
Similarly, when reading data, a typical I/O processing sequence includes reading data from one or more servers of a given selected storage tier. Conventional distributed storage systems do not employ lower tiers to perform part of the normal inline (as opposed to offline, background, and separate from the main application-driven data path) I/O processing. Reading of extra copies stored on other storage tiers is typically executed offline and outside the normal (“fast path”) I/O processing logic, the corresponding (“slow path”) scenarios including: error processing, data recovery, storage/capacity rebalancing, as well as offline compression, encryption, erasure encoding, and deduplication.
The present disclosure provides methods that dynamically and optimally leverage all storage tiers to execute I/O operations, while at the same time conforming to user and application requirements. The disclosure presents a system and method to utilize heterogeneous storage tiers, both persistent and non-persistent, with their per-tier specific unavoidable limitations and the corresponding tradeoffs including for example: best I/O latency for limited capacity and a relatively high $/GB price, best sequential throughput vs. not so good random small-block IOPS, and so on. Further, in order to satisfy user and application requirements, the disclosure integrates implicitly or explicitly defined service-level agreements (SLA) directly into I/O datapath processing. Further, the disclosure provides for dynamic at-runtime adjustments in the I/O pipeline when processing I/O requests. Finally, the disclosure provides at-runtime adaptive combination of I/O performance and availability—the latter, via storing redundant copies and/or redundant coded slices of data on the lower tiers (when writing), and retrieving the data from one of the lower tiers (when reading).
One implementation relates to a method of writing data to a heterogeneous multi-tiered Distributed Storage System (tDSS). A class of storage tier for the first copy or the first subset of coded slices of data is selected using operating modes for the tiers, where the operating mode for a tier instance depends in part on statistical measures of operating parameters for that tier. Lower tiers are then selected to store additional replicas of data using operating modes for those lower tiers.
Another implementation relates to a method of reading data from a heterogeneous multi-tiered Distributed Storage System (tDSS). To execute the read, tiers that store a copy of the data are selected using operating modes for the tiers, where the operating mode for a tier instance depends in part on statistical measures of operating parameters for that tier.
Other implementations, aspects, and features are also disclosed.
As used herein, the terms “storage tiers” and “tiered storage” describe multi-server heterogeneous storage environments, whereby each tier consists of one or more storage servers, for example, storage servers 110, 112, 114, and 116, and provides for a specific set of requirements with respect to price, performance, capacity and function. For instance, a multi-tiered distributed storage system (tDSS) may include the following 4 tiers:
In another 4-tier implementation targeted specifically for the high-performance computing (HPC) space, tDSS tiers include:
Note that T1 and T3 in this implementation are non-persistent, backed-up (as far as user/application data is concerned) by their persistent lower-tier “neighbors” T2 and T4, respectively. This implementation does not trade the data persistency in rare cases (such as a sudden power-cycle with no UPS backup)—for I/O performance in all cases, as there are known techniques to provide a required level of durability, atomicity and integrity via, for instance, asynchronous and synchronous replication and data de-staging to lower tiers, and of course the already mentioned UPS.
Storage tier on top of DRAM, backed up by asynchronous de-staging to its neighboring lower tier, is part of a preferred implementation of the present disclosure, with a 4-tier example described above. Those skilled in the art will appreciate another novelty: the present disclosure does not restrict storage tiers to comprise storage servers entirely, with all their associated (server-own) CPU, memory and local storage resources.
Note that storing data in RAM should not be confused with the legacy art of data and metadata caching—the latter is probabilistic, with cache eviction controlled via last-recently-used, most-frequently-used and similar known algorithms and their adaptive combinations. This disclosure in at least some of its implementation employs storage server memory as a distributed storage tier, whereby locations and properties of this storage (including its non-persistent nature) is described in the tDSS own metadata controlled via the Metadata Service 108 (
Class of Storage
The goal of arranging storage in multiple service tiers is to organize storage servers with similar characteristics on a per-tier basis, and to optimally use those storage servers to meet business requirements. It is only fitting therefore to associate a specific Class of Storage (CoS) labels, or a range thereof, with each tier.
Class of Storage (CoS) reflects, directly or indirectly, one or more measureable properties of the storage servers, such as for instance, the type of underlying storage media (e.g., SSD, SATA), type of server storage-disks interconnect (e.g., DAS over PCIe 3.0, iSCSI over 1 GE), performance of the storage server and its free capacity.
More broadly, CoS may include storage server vendor and model, capacity, locality (e.g., Data Center local or remote), I/O latency and I/O throughput under a variety of workloads, number of CPU cores, size of RAM. For the virtual servers, the CoS could include the type of the hypervisor and hypervisor-allocated resources. Ultimately, CoS abstraction allows to formalize de-facto existing convention that, as in the above example, (primary) Tier 1 is higher than Tier 2, and Tier 4 is lower than Tier 3. This and similar references to the ordering of tiers elsewhere in the present application refer to the CoS enumeration and ordering as discussed above.
Class of Storage associated with (or assigned to) n storage tiers numbered from 1 to n is henceforth denoted as CoSi, where 1<=i<=n. For the purposes of this disclosure, it will be postulated that all storage servers in a given storage tier Ti share the same CoS properties, the same CoS label.
In production implementations, each storage tier will often (but not always) consist of co-located (same zone, same region, same network) and identical (hardware or virtualized-resources wise) storage servers, which would automatically make the associated CoS labels identical as well. Those skilled in the art will appreciate that this disclosure does not require (and does not rely upon) storage servers being physically identical. All servers of any given tDSS tier share the same CoS, and therefore are treated equally as far as I/O processing mechanisms and methods disclosed herein.
Implementations of this disclosure include tDSS with both persistent and non-persistent (volatile) tiers. For instance, one implementation includes a 3-tier storage system whereby the tiers T1 and T2 are RAM based, and the tier T3 is SSD based. In the implementation, the corresponding CoS1 and CoS2 labels reflect that fact that T1 and T2 are not persistent, which in turn is used by the disclosed I/O processing logic to optimize performance, on one hand, and provide availability of the user/application data, on another.
Further, the present disclosure does not require all resources of a given physical or virtual storage server to be fully allocated for a given storage tier. For instance, a given physical or virtual (VM-based) storage server may provide its random access memory to one storage tier, while its directly attached disks to another storage tier. Those skilled in the art will appreciate that, via the CoS labels assigned to those tiers, I/O processing disclosed herein takes advantage of the storage tier's own I/O latency (superior for DDR-based memory compared to other types of media) as well as inter-tier I/O latency. In the tDSS, n tiers may be implemented using a smaller number k (k<n) of storage server models (types).
One implementation includes a 4-tier tDSS that is built using just two types of storage servers: type A (SSD-based, expensive, fast) and type B (SATA-based, inexpensive, slow). In this implementation, T1 effectively combines RAM of all type A servers, T2 combines SSDs of all type A servers, T3 and T4 combine respectively RAM and SATA drives of the type B storage servers. Notice that inter-tier I/O operations between tiers T1, T2 (and respectively T3, T4) in this implementation will have performance characteristics of local reads and writes (as they will be local reads and, respectively, writes).
Thus, it must be apparent to one of ordinary skill in the art that storage tiers referenced in this disclosure are in fact (logical) abstractions that may take many forms and be implemented in a variety of ways. In that sense, tDSS is an ordered set of logical tiers T1, . . . , Tn, whereby both the upper tiers (including T1) and the lower tiers (including Tn) collaborate to provide distributed redundant storage, with each tier Ti (1<=i<=n) having a certain Class of Storage label. Each CoS in turn reflects the tier's characteristics in terms of its persistence, underlying storage media, type of (server storage) interconnect, the tier's locality as far as clients and applications, and/or I/O performance under a variety of application-generated workloads.
In all cases, as per the present disclosure, tDSS tiers—volatile and persistent, fast and slow, local and remote, relatively small in size and counting petabytes of usable capacity—all tDSS tiers collaborate to provide optimal and configurable combination of I/O performance and data availability.
Application SLAs and I/O processing in the tDSS
Heterogeneous multi-Tiered Distributed Storage System (tDSS):
In summary, tDSS allows to combine different storage hardware and software within a unified multi-tiered distributed storage system, to optimally and adaptively balance I/O performance (via upper storage tiers) and availability (via lower storage tiers). Application I/O requests, implicitly or explicitly associated with application or user level SLAs are matched to the underlying storage tiers (in the order from coarse- to fine-grained control):
per user and/or per application;
per dataset;
per I/O request;
per copy-of-the-data; and finally
per plurality of erasure encoded slices. Read and write processing is done according to the corresponding per-storage tier CoS labels as explained herein.
Service-level agreements (SLAs) can broadly be defined as the portion of storage resources (capacity, performance) and associated services (redundancy, availability in presence of failures) delivered to the user or application based on the pre-defined policies associated with the latter. There's currently no standard SLA definition in storage industry, as well as no de facto accepted standard on how to propagate an SLA from users and applications through storage protocols and management systems to storage arrays. Those skilled in the art will appreciate therefore, that user and application SLAs take many forms and are implemented in multiple different custom ways.
In one implementation, for instance, SLAs are simply numbered from the highest (1) to the lowest (n) where n is the number of storage tiers, which provides for an immediate mapping to the underlying n storage tiers T1, . . . Tn via the corresponding CoS labels CoS1, . . . CoSn.
Another implementation provides for SLAs formulated in terms of the end-to-end I/O latency that must be within a given range for 99 percentile of I/O requests, with data availability withstanding a given set of exceptional events. In this implementation, the corresponding SLA=>CoS mapping takes into account detailed performance, capacity and the capability of the underlying storage servers, both provisioned (e.g., type of local storage) and runtime (e.g., current utilization of CPU, memory and local storage).
In the exemplary implementation SLA contains two parts: administrative or “static” and probabilistic or “dynamic”. The static part of the SLA, denoted henceforth as SLA-s, specifies tDSS storage resources and storage services that are “statically” required—that is, do not depend on runtime conditions. For instance, storage administrator may want to “statically” require that application A always uses a storage tier that is based on SSDs in at least 5 copies in two different failure domains, while application B must store its content on rotating hard drives in at least 3 copies. Those SLA requirements do not necessarily need to be formulated as MUST-haves—some of them may be (SHOULD or MAY) desirable and, when not met, ignored as far as subsequent I/O processing. However, what is important is that SLA-s by definition does not depend on (and is not formulated in terms of) runtime load, utilization and/or performance of the tDSS or its tiers or its storage servers at any given point in time.
On the other hand, the dynamic part of the service-level agreement (denoted henceforth as SLA-d) provides for adaptive load balancing within the SLA-s defined static boundaries for a given user or application, and does specify parameters in terms of storage performance, latency, IOPS, throughput, and overall storage utilization. The latter does constantly change at runtime: SLA-d is used to control and influence this change. To give an example, SLA-d may include a “98th percentile” requirement, on read and write latencies to remain under 0.1 ms for reads and under 0.3 ms—for writes for, respectively, 98% of all I/O requests by a given application.
Even though service-level agreements remain today with no widely accepted definition (albeit with numerous custom implementations throughout the storage industry), the two SLA parts outlined above—administrative (static) and probabilistic (dynamic, depending on the runtime conditions)—do exist and do complement each other. The present disclosure provides a novel and consistent way to handle both parts independently of their custom parameters that are often tailored to specific applications and custom management policies. This, plus the capability to adaptively balance the I/O across all storage tiers will allow system administrators and IT managers to reconcile tradeoffs (e.g., storage capacity vs. performance vs. and cost), and at the same time quickly and optimally react to changing requirements as far as scale (new applications, additional users, additional tiers and/or storage servers) and user SLAs.
The present invention provides extremely flexible and highly configurable system-defined enhancements for inline processing of user- and application-generated I/O workloads in multi-tiered distributed environments. The term “inline” is used here to clearly define the field (illustrated earlier on the
To optimize inline I/O processing, implementations of the present application take into account pre-defined policies (that in turn align to per-user, per-application service-level agreements) and tDSS (that is, its tiers and servers) parameters, including runtime space utilization, usage statistics, I/O performance and other measurable parameters. In an exemplary implementation, conforming with user and application SLAs (its both SLA-s and SLA-d part) is achieved using COST( ) and MATCH( ) functions as described below.
Furthermore, in accordance with an embodiment of the invention, a “minimal CoS” label may be used. The minimal CoS label may be, optionally, assigned to each I/O request. The minimal CoS label may be used to prevent dispatch of the I/O request to any tier whose CoS is lower that the specified minimal CoS. (As stated above in the 4-tiers examples, T1 would be considered the highest or primary tier, and T4 would be considered the lowest.) For instance, if a tier becomes overloaded, the data destined for that tier may be placed on a lower tier if the lower tier is at or above the minimal CoS. If a tier with the CoS at or above the minimal CoS cannot be found, the request is not dispatched, and the upper layer is notified that the I/O request cannot be carried out with the requested range of CoS at present. This technique of matching I/O requests to tDSS tiers may be referred to as a “best-effort matching” method.
In accordance with an implementation of the application, under the best-effort matching method, I/O requests labeled with an SLA label (and optionally, a minimal CoS label) that do not have their mapping to tDSS classes of storage configured, as well as I/O requests with no SLA label specified, may be assumed to be labeled with a default CoS label such that the best effort match may be applied. The default CoS label may be configured using, for example, a Storage Management System (SMS) to specify the default configuration while at the same time providing the capability to override the default configuration to further optimize tDSS resource utilization and, simultaneously, I/O performance. The default configuration may be overridden on a per application basis, per user basis, per stored object or file basis, per I/O request basis.
As discussed above, tDSS may be organized as a collection of classes of storage, each including a collection of storage servers organized in one or more storage tiers. Each storage tier Tj is assigned CoSj, and all storage servers of the tier Tj share the same CoSj label. Specific data/metadata types may be mapped to specified classes of storage of the underlying storage servers. Such mapping and other required configuration may be done in a variety of conventional ways. For instance, a dedicated Storage Management System (SMS) may be used to allow system and storage administrators configure all, or a subset of, the required configuration variables.
Note that the system may automatically assign default values to the threshold and weighting parameters, and therefore a storage management system (SMS) driven configuration may be optional. The SMS may also vary the parameters dynamically in order to achieve the desired ranges or component utilization. A particular implementation of the system described in the present application is not required to use all the parameters described above, or may use additional parameters that suitably describe the specific storage tiers and subsystems.
The present application discloses two new functions for the I/O subsystem for a tiered distributed storage system. These functions are: MATCH( ) function that selects storage tiers for performing I/O request (see
The MATCH( ) and COST( ) functions in combination serve the ultimate purpose of implementing user SLA explicitly or implicitly associated with the I/O, while at the same time optimizing I/O processing and data availability. Both MATCH( ) and COST( ) functions operate inline, as an integral part of the I/O processing datapath. For each I/O request MATCH( ) function:
Implementations of this application support a broad variety of both MATCH( ) and COST( ) functions, with multiple examples detailed further in this disclosure. MATCH( ) function will, for instance, filter out those storage tiers where per gigabyte cost of storage is higher than the one provided with a given instance of a service-level agreement. Within the at least two remaining tiers, the COST( ) function will answer the questions of the type: which of the two storage tiers to use for the SLA requiring 98th percentile of I/O latency to remain under 1 ms, given that storage servers from the first tier provide 100K IOPS and are currently 90% utilized, while servers from another tier provide 1200 IOPS and are currently 15% utilized (which also would typically indicate that at least some of the “other tier” servers have currently empty queues of outstanding I/Os and are therefore fully available).
Similar to mapping of I/O requests to classes of storage with CoS labels in I/O requests, an implementation of the present application assigns the responsibility of maintaining a mapping of I/O request labels to additional or alternative processing stages to the upper layers of software mentioned earlier. This allows tiered distributed storage system to avoid maintaining extra states and to concentrate on carrying out the requested processing in the most efficient fashion.
Initially, data and its directly or indirectly associated SLA descriptor is received via tDSS 404 storage access point 406 (
In addition to the MATCH( ) function that narrows down the set of SLA-matching tiers and translates services-level agreements to (storage-level) CoS labels, the disclosure provides COST( ) function that takes into account utilization and performance statistics described herein, to optimize I/O processing at runtime. COST( ) function computes a descriptor that includes “pipeline-modifiers”; the latter is then propagated to the read( ) and write( ) processing routines to optimize their respective I/O processing pipeline stages—which is exactly why in the implementations the COST( ) function is invoked even in those cases when there's a single matching (resulting from the MATCH( ) function) destination storage tier.
In an exemplary implementation, the COST( ) function takes the following arguments:
Further, as stated above, the COST function computes and returns pipeline-modifiers—a name/value list of properties that, in combination, provide an implementation-specific hints on executing the current I/O request against the selected (matching, computed by the MATCH( ) function) tiers and their respective storage servers. In its most reduced form, pipeline-modifiers is empty whereby the COST( ) function simply returns a boolean true or false, indicating either a “green light” to execute the current I/O, or the associated prohibitive cost as far as the selected storage tier.
Further, each instance of computed pipeline-modifiers references the specific class of storage (further denoted as pipeline-modifiers->CoS) for which those pipeline modifying parameters were computed. For example, given two matching tDSS tiers one of which is SSD-based and the other HDD-based and given the fact that SSDs are typically order(s) of magnitude more expensive on a per-GB basis, pipeline-modifiers could look as follows:
pipeline_modifiers_ssd={CoS-ssd, use_compression=true};
pipeline_modifiers_hdd={CoS-hdd, use_compression=false};
In other words, the I/O pipeline will either include the data compression stage or not, depending on the targeted storage tier. The same of course applies to the rest I/O pipeline stages where the corresponding at-runtime modifications are warranted based on the computed cost (result of the COST( ) function, which in turn is based on the collected runtime statistics described herein) and does conform to the user and application service-level agreements.
Those skilled in the art will note that, partly due to the open definition of service-level agreements, it is possible and sometimes may be feasible to require, for instance, inline compression via static part of the SLA. The present disclosure does support the corresponding implementations—those are the cases where the set of COST-modifiable aspects of the specific I/O pipelines are narrowed down (to exclude, for instance, compression). In the common case, though, whether the data is compressed, deduplicated and how it is dispersed across storage servers must be irrelevant for the user as long as the system provides for a given business requirements including I/O performance and data availability at or below a given cost.
In the implementations, I/O pipeline modifiers (denoted as ‘pipeline-modifiers’ herein) is a descriptor that includes an associated per-tier CoS and a list of names and values that provide information directly into the class of storage specific read( ) and write( ) implementations with respect to data checksumming, compression, encryption, deduplication, distribution (dispersion), read caching and writeback caching. To give another example, I/O pipeline modifiers could include the following hint on executing the deduplication stage (that is, performing inline deduplication):
use_deduplication=CPU-utilization<threshold1
To compute pipeline-modifiers, the present disclosure builds upon existing art of monitoring, evaluating and balancing I/O performance. This includes both direct monitoring often done via an external storage management system (SMS), and feedback-based monitoring, whereby the storage servers themselves provide their respective load and utilization, either directly to the servers implementing storage access points (
Implementations of the present disclosure integrate MATCH( ) and COST( ) operation with the conventional I/O processing mechanisms for distributed storage.
In particular,
Next, the MATCH( ) function maps this I/O request to the upper tier T1 (407), and the subsequent COST( ) function computes the current pipeline-modifiers estimating the cost for the T1 to execute this I/O request. Further, the remaining steps to store copies of user data are shown, whereby the first two copies are stored synchronously and the remaining 2 copies stored asynchronously, which in turn is defined by the corresponding classes of storage (e.g., CoSi=>synchronous, CoS2=>synchronous, CoS3=>asynchronous, CoS4=>asynchronous) or the implementations of the corresponding per-CoS write1( ), write2( ), write3( ), write4( ) routines. In one implementation, a single common logic to write data onto a storage tier is cosmetically modified to execute the actual write operation in a separate thread or process, and immediately return to the caller upon triggering this thread (or process). The corresponding implementation is then named write3( ) and connected to the class of storage associated with tier T3, as illustrated on
Addresses of the stored copies of data are obtained in one of the conventional ways (those skilled in the art will notice that this step is typically executed by first reading and processing the corresponding metadata, via a DSS-specific implementation of Metadata Service). Those addresses (of the copies of data) will specify locations of each copy of data in the storage tiers T1, T2, T3, and T4, in terms of the server IDs (for the servers in the corresponding storage tier that store parts or all of the requested data), followed typically by a within-server addresses, including local disks and logical block addresses (LBA) on those disks.
Next, for each stored copy of data, this implementation executes MATCH( ) function 508 to select the corresponding CoS tier(s), as well as the CoS-associated concrete reading mechanism(s). In an exemplary implementation, the MATCH( ) function takes the following arguments:
Further, the MATCH( ) function returns the list of tiers 509 filtered with respect to the provided SLA and its other arguments listed above. In the exemplary implementation the actual data-reading logic is optimized for the corresponding classes of storage. In accordance with the present disclosure, each class of storage can optionally provide a mapping to an alternative reading mechanism with respect to synchronicity (synchronous vs. asynchronous read), caching in the server's memory (don't cache, cache-and-hold, etc.), seek-optimizing I/O scheduling (or lack of thereof in case for non-rotating media), and other known in the art I/O pipeline variations.
Specifically in this case, the MATCH( ) function returns three lower tiers out of the four tDSS tiers, and their corresponding classes of storage; as a footnote, one typical use case for the primary tier T1 to get filtered out would be cost of storage on a per gigabyte basis. Further, the MATCH-ed class of storage labels are each associated with a per-CoS optimized reading method, as shown on
CoS2=>read2( ), CoS3=>read3( ), CoS4=>read4( )
Next, prior to executing the data read itself, the implementation executes the COST( ) function 511, to further narrow down the list of possible reading destinations based on the estimated “cost” of performing the reading operation. In an exemplary implementation, the COST( ) function takes the following arguments:
Based on this input, the COST( ) function selects the best matching tiers with respect to the collected runtime statistics; in addition it generates pipeline-modifiers to be further utilized by the specific read( ) implementations to adjust their I/O processing stages on the fly or in real-time. In the implementation illustrated by
COST(T4)>>COST(T2)
COST(T4)>>COST(T3)
Thus, COST( ) function narrows down the MATCH-ing tiers to T2 and T3 (512), to execute the read request. Tier T2 is then read via its CoS2 associated read2( ) routine asynchronously (513), while T3 synchronously (514), via the read implementation logic denoted as read3( ).
Those skilled in the art will appreciate that seemingly redundant reading may be designated to a) perform data prefetch for those workloads that exhibit a good degree of spatial and/or temporal locality, and b) increase the probability to execute the I/O within the specified performance boundaries. Note also that in both cases, the respective read( ) implementations receive pipeline-modifiers descriptors computed by the COST( ) function, to further optimize their I/O processing at runtime. In certain implementations, the corresponding optimization includes selecting the least loaded servers that store the copies, in presence of alternatives and assuming the rest SLA-required operational parameters are within their prescribed boundaries (thresholds) as described further herein.
Finally, the first received (that is: good, validated) copy of data is returned to the requesting client, back via the storage access point.
In modern storage systems data redundancy and protection is often realized using erasure encoding techniques that take their root in the Reed-Solomon codes developed in the 1960. The corresponding art, prior to storing a data on distributed stable storage, divides the original data into m slices (m>=1), transforms those m slices into (m+k) coded slices (k>=1), and then stores the full amount of (m+k) coded slices onto (m+k) storage servers—one coded slice per each of the (m+k) servers. The art of erasure coding defines how to encode the data into the collection of (m+k) servers such that upon failure of any k servers the original data can be recovered from the remaining m servers.
Next, instead of, as disclosed above, storing the 2nd and 3rd copies of the data on the corresponding storage tiers T2 and T3, this implementation performs a Cauchy Reed-Solomon (CRS) transformation 610: namely, for a pre-configured pair (m, k) the original data is transformed into (m+k) coded slices (often also called “slices” or “chunks”). Out of this plurality of (m+k) coded slices, an arbitrary set of m blocks is sufficient to compute the original data, which also means that if all coded slices are stored on different storage servers, the system may tolerate a simultaneous failure of k servers.
Further, COST( ) function 611 is then invoked, to select m least loaded servers of the tier T2. In an exemplary implementation the COST( ) function takes the following arguments:
The first m coded slices are then stored on the selected m storage servers of tier T2 (612) using CoS-associated write2( ) routine. Finally, COST( ) function 613 is used to again, to select this time k least loaded servers of tier T3; the remaining (and redundant) k coded slices are then written onto the selected T3 servers using write3( ) method 614 associated with the corresponding CoS3 label (of the T3).
For the cases where m>1 and k<m this implementation will provide a better data availability and better space utilization than the implementation illustrated in
For instance, instead of storing m slices on T2 and k remaining slices on T3, the implementation could store m+1 slices on T2 and k−1 slices on T3 (where k>1), thus increasing the fault tolerance of the T2 to withstand a loss of one T2 server, as far as original user data is concerned.
In all cases and for all variations though the MATCH( ) and COST( ) functions are used to match service-level agreements to the storage tiers on one hand, and to optimize the I/O processing based on user or application- and storage-specific measurable parameters, including resource utilization and I/O performance.
In this and other implementations, numerous concrete details are set forth to provide a more thorough understanding of the invention. However, it must be apparent to one of ordinary skill in the art that the invention may be practiced without those specific details.
Per block 707, the plurality of (m, k) coded slices is subdivided further into a plurality of subsets of coded slices, to store each subset in one of the storage tiers. In accordance with an implementation of the application, the sum (or, union) of all subsets includes each one of the (m+k) erasure coded slices at least once, possibly with repetitions (the simplest variation of the above is, of course, a single set of the original (m+k) coded slices). This step, as well as the 706 encoding, is performed only if erasure encoding is configured—otherwise, the execution continues from block 708.
If, as previously determined through block 704, there's at least one full copy of data to be stored in the tDSS, the execution then continues from block 708, otherwise it proceeds to block 714. Further, blocks 709 through 713 constitute the loop 708—the set of instructions executed on a per copy of data basis. Block 709 performs that MATCH( ) function for each copy of data, with arguments of the function including:
i) SLA-s associated with the I/O request;
ii) a request type (read, write);
iii) a type of content to write (full copy, set of coded slices);
iv) number or index of the copy (1, 2, 3, . . . );
v) size of the data that must be stored
Based on this input, MATCH( ) function computes matching classes of storage—a subset of {CoS} defined for the tDSS tiers; if there are no matches, the execution skips to 714.
Further, for each i-th full copy (where 1<=i<=C, as per 708) and its matching classes of storage 711, COST( ) function is called. In the exemplary implementation, COST( ) function takes the following arguments:
At this point, what remains is to perform an actual write operation. As stated, the implementation enhances existing art, that is, the essential capability of conventional distributed storage systems to read and write data. Per block 713, the previously computed pipeline-modifiers-i (712) references the computed class of storage (notation pipeline-modifiers-i->CoS); the latter in turn is associated with a write( ) method optimized to write data onto the corresponding storage media. Finally in that sequence of references, the corresponding write( ) routine is invoked; notice that the pipeline-modifiers-i (712) is passed on to the write method as one of its arguments, to provide the writing logic with additional information that is used in the implementations to further optimize I/O pipeline as disclosed herein. Blocks 712 and 713 are iterated within the loop 711 until there are no more pairs (i-th copy of data, set of matching CoS).
Notation CoS->write( ) indicates the write( ) method that is tuned up specifically for its associated class of storage. In the exemplary implementation, for instance, writing to primary tier is done synchronously, while writing to lower tiers may be asynchronous (
Finally, steps 714 through 719 execute a very similar procedure of writing erasure encoded slices onto matching tDSS tiers. Blocks 715 through 719 are executed for each subset of coded slices within the 714 loop. Here again, given the user SLA, for each subset of coded slices (and its index j in the 714 sequence) we first compute matching classes. In presence of matches 716 the block 718 computes the optimal (with respect to the tiers' utilization and in accordance with the user SLA, as already described herein) pipeline-modifiers and its corresponding class of storage (pipeline-modifiers-j->CoS). Finally, the coded slices are written onto the computed storage tier 719 using the write( ) routine that is specifically tuned up for its associated class of storage. Blocks 718 and 719 are iterated within the nested loop 717 until there are no more pairs (j-th subset of coded slices, set of matching CoS).
In an exemplary implementation, the following performance statistics are tracked for each storage server: used and free space (statistic S1u and S1f, respectively); current and a moving average of server utilization (statistic S2c and S2a, respectively); current and moving average of CPU utilization (statistic S3c and S3a, respectively), current and moving average of the server's end-to-end read latency (statistics S4cr and S4ar, respectively), and finally, current and moving average of the server's end-to-end write latency (statistics S4cw and S4aw, respectively).
Statistics S1u and S1f are henceforth collectively denoted as S1*; the same convention holds for the rest statistics described herein.
Statistics S1* and S4* are measured at the storage server level, S2*—averaged over the server's (directly or indirectly) attached disks—if and only if the corresponding tier utilizes the server's persistent storage; otherwise, the exemplary implementation sets S2* statistics to zero. (Those skilled in the art will recognize that the latter is due to the fact that for the modern DRAM technology the circumstances the memory itself becomes a bottleneck are extremely unlikely; on the other hand, the S1* statistics are important and are tracked for RAM based tiers as well.)
Moving Averages
In an exemplary implementation, moving averages for server utilizations and latencies are computed as follows. Let X be the current moving average, and x be the value of the corresponding statistics measured during the most recent cycle (a.k.a epoch) of measuring the corresponding statistics. Then the recomputed new average X will be:
X=alpha*x+(1−alpha)*X,
where 0.1<alpha<1
In other words, the implementations continuously compute and adjust moving averages based on the most recent value of the corresponding statistics. The ‘alpha’ (above) reflects a bias to consider the most recent value more (or less) important than the accumulated history. In one implementation, the value of alpha is set to 0.6.
Further, all the collected statistics (above) are aggregated for storage tiers (or, same—for the corresponding classes of storage) as known functions of the corresponding values of storage servers. For instance, S1 u (used space) for a class of storage is a sum of all S1 u counters of the storage servers that comprise the corresponding storage tier, whereas S2c (current disk utilization) is an average function of the S2c values for the servers in the tier (note that the maximum function may also be a good choice in other implementations, depending on the storage tier organization and the optimization goals pursued by the system designer).
In this exemplary implementation, the following thresholds are stored and used to implement the COST( ) function: high and low watermarks for the percentage of used space (parameters HS1 and LS1, respectively); high and low watermarks for disk utilization (parameters HS2 and LS2, respectively); high and low watermarks for CPU utilization (parameters HS3 and LS3, respectively); high and low watermarks for the end-to-end read latency (HS4r and LS4r); high and low watermarks for the end-to-end write latency (HS4w and LS4w).
Further, parameters include weights W1, W2, W3, W4r and W4w that may be used in the implementations to implement the COST( ) function. In one implementation, the COST function implements the following pseudo-coded sequence:
The rationale behind this particular implementation is as follows. For a statistic that is below its predefined low-watermark threshold, we assume its contribution to an aggregated cost as zero. Otherwise, if the statistic falls into the corresponding low/high interval, we first normalize it as a percentage of this interval and add the result to the cost using its corresponding weight, one of W1 through W4w (above).
Notice that if the statistic measures above its configured high watermark, the COST function in this implementation returns, effectively, maximum 64-bit value which is further interpreted as “infinite” aka “prohibitive” as far as using the corresponding storage tier for this I/O request.
There are multiple implementations of the COST( ) function over the S1* through S4*, and similar. In other implementations, the matching (that is, computed by the MATCH( ) function—see
In yet other implementations, the COST( ) function is implemented to specifically control ratios of I/Os routed between the tDSS tiers. For instance, in a two-tier configuration the 50/50% ratio would effectively translate as a COST( ) function returning “infinite” and 0 (zero) for those two tiers in a round-robin fashion. This approach immediately extends to any finite set of percentages (with a 100% sum) that would control utilization of the same number of tDSS tiers.
To illustrate it further, consider the S4 (latency) statistic, or more exactly it's per-tier measured moving averages S4ar and S4aw for reads and writes, respectively. Following is a pseudo-coded example for two tiers, T1 and T2:
In the first line (#1 above) we initialize the ‘ratio’ variable that controls usage of the tiers T1 and T2 on a per I/O basis. For any given computed ratio, the percentage of I/Os that utilize tier T1 is calculated as follows:
percentage=100*(ratio−1)/ratio;
Thus, setting initial value equal 2 yields exactly 50/50% for the tiers T1 and T2.
The second line (#2 above) doubles the percentage of I/Os routed to T1—if and only if the T2 latency is at least 10 times greater than the T1's. Finally, the line #3 adjusts percentage of T1-utilizing I/Os back in favor of T2 if the latency of the latter falls below 2x of the T1's.
Those skilled in the art will notice that this ratio-based approach is immediately extensible to support:
Examples of the latter include well-known TCP congestion control and congestion-avoidance algorithms such as Reno, New Reno and Vegas. TCP Vegas, for instance, teaches to estimate expected I/O performance based on the measured latency:
expected-throughput=pending-workload/latency;
where the ‘pending-workload’ is the size of the queue (in bytes) at the storage access point (
For a 3-tier tDSS, for instance, a set of percentage values (p1, p2, p3) where p1+p2+p3=100% would correspond to the following possible pseudo-coded implementation of the COST( ) function—one of the several other possible implementations:
(In the code above, the {return Ti;} statement is simplified for shortness sake, to indicate “infinite” cost for tiers other than the Ti)
Further, in the exemplary implementation, all the collected statistics, as well as the aggregated computed cost are included in a pipeline-modifiers; the latter is then passed on into the read( ) and write( ) implementations as illustrated in
Note that the system may automatically assign default values to the threshold and weighting parameters, and therefore a storage management system (SMS) driven configuration may be optional. The SMS may also vary the parameters dynamically in order to achieve the desired ranges or component utilization. A particular implementation of the system described in the present application is not required to use all the parameters described above, or may use additional parameters that suitably describe the specific storage tiers and subsystems.
Further, for any given tier returned by the COST( ) function, the resulting pipeline-modifiers include, as the name implies, parameters that define or hint on how to execute specific I/O pipeline stages, including checksumming, inline compression, inline encryption, inline deduplication, data distribution (dispersion), read caching and writeback caching. One of the examples above reflects a rather straightforward tradeoff for the inline deduplication, as far as CPU utilization (to compute cryptographically secure fingerprints for the deduplicated data, for instance) on one hand, size of the dedup index on another, and available storage capacity, on the third hand. Similarly, for inline compression the formula must include the tradeoff between CPU and I/O subsystem utilizations, and whether this tradeoff is warranted by the achieved compression ratio, for instance:
use_compression=CPU-utilization<threshold1
In the implementations, this is further extended to include adaptive hysteresis (e.g., the formula above must return true a certain number of times in a row, to smooth out short-lived fluctuations) as well as with respect to current and moving averages of I/O subsystem utilization (statistic S2c and S2 herein), at least for the tiers that are based on persistent storage. For DRAM and SSD based tiers (especially for DRAM) the incentive to compress and/or dedup data inline will typically be rather strong, which is exactly why the implementations implement COST( ) function on a per class of (the MATCH-ing) storage basis—each instance of computed pipeline-modifiers references specific class of storage (denoted herein as pipeline-modifiers->CoS), whereby the latter in turn references CoS specific (CoS-optimized) read( ) and write( ) implementations.
In general, the rationale to task the COST( ) function to optimize I/O pipelines is directly based on the fact that the COST( ) is already working with collected utilization and performance statistics S1* through S4* to select destination storage tier(s) as described herein. In the implementations, I/O pipeline optimizing algorithms use the same information that is already used to select the least “costly” storage tier.
Rest of the
i) request type (read, in this case);
ii) size of the data;
iii) SLA-s associated with the I/O request;
iv) Class of storage of this copy of data
Based on this input, MATCH( ) function computes classes of storage—a subset of classes of storage defined for the tDSS tiers. In the exemplary implementation, the MATCH( ) is considered to succeed if and only if those computed classes of storage contain the class of storage of the stored copy of data itself—the argument (iv) above. If this is not true (for instance, if MATCH( ) returns an empty set { }), the corresponding copy is not being used to read the data.
As a side note, the sequence outlined above provides for a designed-in capability to support any variety of service levels based on the same identical, stored and replicated content. For instance, given two copies of data stored on T1 and T2 respectively, a copy that is stored on the primary tear will be read and returned only at (or beyond) a given level of service (denoted as SLA-s in this disclosure).
Next, per block 808 for each successfully MATCH-ed pair:
(copy of data, its class of storage tier that stores this copy)
block 809 gets executed, to compute the cost of reading this particular copy, and secondly, to fill-in pipeline-modifiers for the subsequent (this) CoS-specific read( ) operation:
(pipeline-M-j, cost-j)=COST(j, ‘read’, size, SLA-d, CoS-j)
In the exemplary implementation, COST( ) function takes the following arguments:
Block 811 aggregates the processing performed by block 809, in terms of the number of full copies to read: zero, one, or multiple. If the loop 808 (and its blocks 809 and 810) produces empty result (which is possible, for instance, when the preceding MATCH( ) 806 fails for all stored copies), or if for all the computed costs, the following is true: cost-j==MAX_UINT64 (that is, “infinite cost”), read( ) operation effectively fails and the execution proceeds to the block 816 (end), to either reschedule the read, or fail it all the way to the user or application.
Otherwise, the results are first sorted in the ascending order of computed costs (denoted as cost-j in the block 809). If there's a single result, the execution proceeds to block 813, to ultimately read the data (identified by its API-specific ID) and return the read copy to the user or application via block 816. Finally, blocks 814 and 815 process multiple reading alternatives. In the exemplary implementation, the selection criteria includes a configurable interval (in percentage points) for the computed cost to differ from the minimal cost; all the rest entries in the sorted array (of costs, above) are effectively filtered out.
For the remaining entries (in the cost-sorted array), block 814 further determines the synchronicity of the corresponding subsequent read operations. In the exemplary implementations, the copy that has an associated minimal cost is read synchronously, while all the rest copies—asynchronously, and in parallel (block 815). The latter allows to warm-up the caches on those other tiers (thus effectively minimizing the costs of subsequent reads) while simultaneously providing an additional guarantee that the read( ) is executed within SLA-defined boundaries even in the unlikely event of the first read( ) failure. The “price” of those duplicate asynchronous reads is mitigated by the capability to cancel them out in-flight if the corresponding results are not yet (or not yet fully) posted on the network connecting tDSS tiers and storage access point. Notice that block 815 executes the per-CoS defined read operation (denoted as pipeline-M-k->CoS->read( ) that is specifically tuned-up for the corresponding storage tier. Similarly to write( ) processing described herein, COST-computed set of pipeline-modifiers (block 809) is passed over to the read( ) implementation itself, to further optimize and on-the-fly adapt its processing. Specifically, an exemplary implementation may skip decompression of a compressed copy if CPU utilization for the tier (statistic S3c and S3a in the pipeline-modifiers denoted as pipeline-M-k, block 815) are above the corresponding high watermark, delegating the latter to the tDSS host that implements storage access point.
In the
This application discloses various implementations of systems, methods, and devices that enable reading data from or writing data to distributed storage systems. The following enumerated implementations are exemplary of some of the implementations described in this application:
The term “portion of data set” refers to the data that user or application is writing to tDSS or reading from tDSS, contiguous segments of this data, and/or derivative data—that is, the data that is computed directly from the user data, e.g. parity segments or erasure coded slices. In the case of XOR-based (as in conventional RAIDs) parity or Reed-Solomon based erasure encoding, user or application data to be stored is broken into slices, further encoded with redundant data pieces, and stored across a set of different locations: disks, storage servers, or multi-server storage tiers. Hence, a portion of the data set refers to user/application data, segments of this data, coded slices of the data, and/or redundant (computed) coded slices or parity segments in any order, sequence, size, form, or arrangement—as per numerous examples and illustrations of the present disclosure.
It will be apparent to those of ordinary skill in the art that certain aspects involved in the operation of the implementations described herein may be embodied in a computer program product that includes a computer usable and/or readable medium.
The present patent application claims the benefit of U.S. Provisional Patent Application No. 62/022,354, entitled “Optimal Management of Copies in Tiered Distributed Storage Systems”, filed Jul. 9, 2014 by Alexander Aizman, the disclosure of which is hereby incorporated by reference in its entirety. The present patent application is related to U.S. patent application Ser. No. 13/904,935 entitled “Elastic I/O Processing Workflows in Heterogeneous Volumes”, filed May 29, 2013 by Alexander Aizman et al., the disclosure of which is hereby incorporated by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
62022354 | Jul 2014 | US |