Hardware breaks. Software has defects. Viruses propagate. Buildings catch fire. Power fails. People make mistakes. Although it is preferred that these events never occur, it is prudent to defend against them. The cost of data unavailability can be large. Many businesses estimate their outage costs as more than $250,000/hour. Others estimate outage costs at more than $ 1,000,000/hour. The price of data loss is even higher. Recent high-profile disasters have raised awareness to plan for recovery or continuity.
A constant challenge is to construct dependable systems that protect data and the ability to access data. Data is an asset to substantially all businesses. Losing information due to various failures can bring a company to its knees—or put it out of business altogether. Such catastrophic outcomes can readily be prevented using various configurations of hardware and software; however, the design space of results, such as solutions, is surprisingly large, and there are a myriad of configuration choices. In addition, the various alternative design choices interact in complicated ways. Thus, results, such as solutions, are often over-engineered or under-engineered, and administrators may not understand the degree of dependability they provide.
It is difficult to combine and configure the building blocks for data protection to create well-engineered data dependability results, such as solutions. Over-engineered results, such as solutions, may incur excessive costs to defend against negligible risks. Under-engineered results, such as solutions, have their own costs in a disaster: crippling outages, loss of critical data, or unacceptably degraded service. Faced with these choices, designers often resort to ad hoc approaches to strike the right balance, guided by rules of thumb and their own limited or even irrelevant experience. When a designer resorts to an ad hoc approach, many potential results may not be considered. Therefore, the selected result may not be as appropriate, well-matched, and/or optimal as when more results are considered.
Embodiments of the present invention are illustrated by way of example and not limitation in the Figures of the accompanying drawings in which:
A system and method for selecting configuration results from a plurality of candidate configuration designs are described herein. In the following description, numerous specific details are set forth. The following description and the drawing figures illustrate aspects and embodiments of the invention sufficiently to enable those skilled in the art. Other embodiments may incorporate structural, logical, electrical, process, and other changes; e.g., functions described as software may be performed in hardware and vice versa. Examples merely typify possible variations, and are not limiting. Individual components and functions may be optional, and the sequence of operations may vary or run in parallel. Portions and features of some embodiments may be included in, substituted for, and/or added to those of others. The scope of the embodied subject matter encompasses the full ambit of the claims and substantially all available equivalents.
This description of the embodiments is divided into four sections. In the first section, an embodiment of a system-level overview is presented. In the second section, methods for using example embodiments are described. In the third section, an example implementation is described. In the fourth section, an embodiment of a hardware and operating environment is described.
This section provides a system level overview of example embodiments of the invention.
The configuration result system 100 also includes an outlay cost of design module 600 that is communicatively coupled to the candidate design generator 200. The candidate design generator 200 outputs a signal 201 that includes the candidate design parameters associated with a candidate design to the outlay cost of design module 600. The outlay cost of design module 600 includes costs associated with implementing and maintaining the candidate design. The outlay costs include capital costs for purchasing equipment, software, real estate, and the like. The outlay costs also includes periodic costs, such as rental/leasing costs for equipment or real estate, power and air conditioning bills, salaries of system administrators, and insurance premiums to protect against data loss or outages. The outlay costs can be determined on a yearly basis which would consider the depreciation of capital costs such as for purchasing hardware or software. If outlay costs are annualized, the weighting factors for various failure modes are also set for the probability of failure within a year. In other embodiments, the outlay costs and the weighting factors, represented by the signal 308, can be set for outage time periods such as the life of the particular candidate design. The outlay cost is output 608 to a total cost of a result module, such as a solution module 700. The total cost of solution module 700 receives the outlay cost of the candidate design 608 generated by the candidate design generator 200, the value of the sum of the weighted total penalty costs for multiple failure modes 520 as inputs. The total cost of solution module 700 sums the value of the sum of the weighted total penalty costs for multiple failure modes 520 and the total outlay cost 608 to yield the total cost of the candidate design 708.
For each candidate design, the total cost of the candidate design is stored and compared at a compare module 800. The compare module 800 stores the total cost of the candidate design 708 and an identifier for each of the candidate designs generated by the candidate design generator 200. In the compare module 800, the total cost of each of the candidate designs are compared. In one embodiment, several of the candidate designs and the related total costs are selected for presentation as a candidate solution configuration, such as a candidate design and/or a design configuration and/or a result configuration. In another embodiment, the candidate design generated that has the least total cost is selected and presented as the result configuration, such as the solution configuration. Also included is a feedback signal 801 that includes information regarding unselected candidate design configurations. The feedback signal 801 is input to the candidate design generator module 200 and the information is used in selecting future candidate designs. The information of the feedback signal 810 can be used to prune future candidate designs thereby making the search for candidate designs more efficient when compared to an exhaustive search or other types of searches. As shown, alternative candidate designs are enumerated, completely evaluated and compared on the basis of total cost, and feedback is provided to the candidate design generator.
In another example embodiment, certain techniques, such as mathematical programming, provide a tighter coupling between the candidate generation, evaluation and comparison. These techniques permit elimination of candidate designs as a minimal cost solution based on a partial evaluation of a candidate design. As a result, feedback can be input to the candidate design generator 200 without having the complete evaluation of the candidate design. This process efficiently explores the space of candidate designs without enumerating and evaluating substantially all potential candidate designs. These approaches are merely two example embodiments. Other approaches for determining feedback 801 are also contemplated.
In another embodiment, the output of the total cost of solution module 700 is a parametric model description, including a set of designs, a set of failure events and a description of the total cost as a function of the design parameters and failure event parameters. For instance, this parametric model description can be provided in the form of a mathematical program, including a set of decision variables, a set of constraints and an objective function. In this embodiment, the candidate design generator 200, failure event generator 300, modeling module 400, workload characteristics input, penalty cost module 500, outlay cost module 600 and total cost module 700 determine the appropriate parameterization for the model description. In this embodiment, the functionality of the compare module 800 can be provided by a solver such as the CPLEX solver, available from ILOG, Inc., 1080 Linda Vista Avenue, Mountain View, Calif. USA. The product name is ILOG CPLEX.
As shown, each of the modules discussed above can be implemented in software, hardware or a combination of both hardware and software. Furthermore, each of the modules can be implemented as an instruction set on a microprocessor associated with a computer system or can be implemented as a set of instructions associated with any form of media, such as a set of instructions on a disk drive, a set of instructions on tape, a set of instructions transmitted over an Internet connection or the like.
This section describes methods performed by embodiments of the invention. In certain embodiments, the methods are performed by machine-readable media (e.g., software), while in other embodiments, the methods are performed by hardware or other logic (e.g., digital logic). In this section,
The method 200 includes providing workload characteristics as input, as depicted by reference numeral 210. In one example, the candidate design is sensitive to the characteristics of a storage workload. For example, many data protection schemes are sensitive to the average update rate: the volume of updates over a given interval, divided by the interval length. Techniques that accumulate modifications over an interval, such as incremental tape backup, are more sensitive to the workload's update rate. The update rate may include the number of substantially unique data items written over an interval. Longer accumulation intervals allow more time for overwrites, so they often have lower update rates. This rate can be modified by a series of tuples of the form <interval duration, update rate>. Synchronous mirroring results or solutions are also sensitive to the short-term peak-to-average burstiness of writes, which can be 3-10× the long term average update rate. Workload characteristics that do not affect the choice of dependability results, such as solutions, may be ignored.
Workload characteristics can be measured from an existing system or estimated by a person or a tool from a repertoire of known workloads to characterize the workloads. Existing tools that can be used to characterize workloads include I/O tracing tools like HP-UX's MeasureWare midaemon™, and similar facilities provided by other operating systems. These traces can then be further mined to determine the workload attributes described above using tools such as HP's GlancePlus™ product.
The method 200 also includes applying user constraints to the design candidates, as depicted by reference numeral 211. In some example embodiments, the constraints are based on user-specifications. The actual mechanism can be implicitly or explicitly limited by the user, the system design, the specifications, or other constraints. Some of the explicit limitations include user specifications such as use of a certain type of software or hardware for a particular system. For example, a particular corporate user may specify use of certain brands of hardware or software for business reasons. Additionally, regulatory specifications for data retention and deletion may be explicitly expressed. Candidate designs may be applicable to any type of system including computer storage systems, databases, wide-area links, or the like.
In generating one or more candidate designs, a search for candidate designs may be conducted, as depicted by reference numeral 212. A multiplicity of techniques can be used to search for candidate designs. Examples include exhaustive searches, first fit (also known as a greedy) search, or best fit search. In some embodiments, randomized search algorithms, such as simulated annealing, genetic algorithms or taboo search, may be used. Also included are searches that employ mathematical programming techniques such as integer, linear or non-linear programming; a CPLEX solver or Excel solver can also be used. As mentioned above, some searches employ a feedback loop in order to prune certain branches of the search (see feedback signal 801 in
Once a candidate design is found, a listing of substantially all configuration parameters associated with the candidate design is generated, as depicted by reference numeral 214. The configuration parameters are a listing of substantially all resources specified for a design candidate, as well as a listing of the software configuration parameters describing the operation of the data protection techniques employed in the candidate design (e.g., the duration of a full backup window). The configuration parameters for a design candidate are output to the modeling module 400 and the outlay cost module 600, as depicted by reference numeral 220. In some embodiments, the configuration parameters are also provided with an identifier for the particular design candidate.
Failures may cause the loss of data (e.g., the primary copy or one or more secondary copies) and/or the loss of data accessibility. Failures may affect the primary site, one or more secondary sites or the whole system. Failures include hardware failures, environmental failures and disasters. Hardware failures include failures of disk drives, I/O controllers, memory, processors or power supplies. Environmental failures include power failures and any air conditioning or site shortcomings that could affect the system. Disasters include fires, floods, earthquakes, hurricanes and blizzards, or the like. Other failures that are considered include software failures, such as application failures, operating system failures, device firmware failures, virus infections and redundancy-technique failures. Still other failures considered in some embodiments are human errors, which include administrator errors, operator errors and end-user errors. After generating failures 310 and their probability 312 and then adjusting the importance level 314, the information or data is output from the failure event generator to the penalty cost module, as depicted by reference numeral 334. The actual signal from the failure event generator 300 to the penalty cost module is depicted by reference number 308 (shown in
In some embodiments, multiple failure events are handled for each candidate design, with each failure event resulting in its own outage time and data loss metrics. As will be described in more detail in
In some embodiments of the invention, various failure events may also be accompanied by the likelihood of that failure. This is used to weight the various failure scenarios, with respect to the other failures so that a total for a particular candidate design can be predicted. If all failure modes have been modeled for a particular candidate design, then the process within the modeling module 400 for the specific candidate design is ended.
Penalty rates can include direct penalties from lack of data integrity and data inaccessibility as well as indirect penalties such as lost consumer confidence and lost worker productivity. The penalty rates for outage time may be a function of whether the system is up or down, or they can be a function of the performance level that is achieved. The penalty costs can be linear or non-linear, continuous or discontinuous functions. In some embodiments, the functions can be monotonically increasing as the amount of data or outage time increases. In some embodiments, the penalty assessed may be a function of cyclic or seasonal behavior. For example, if data is lost during a peak acquisitional time, the penalty cost is higher in some embodiments. In addition, system outage times that occur during a peak time may also have higher initial penalty rates. In short, penalty cost rates can be distributed in many different ways, substantially all of which are contemplated by embodiments.
According to
Mathematically the total penalty cost for design candidate d can be expressed as follows:
Here, the importanceLevel represents the weight for failure f, and the dataOutagePenaltyRate and dataLossPenaltyRate represent linear penalty rate functions. The last two terms may be functions of duration. Other user specifications, such as performability specifications that describe acceptable levels of degraded mode performance, can also be handled by the setting of penalty cost rates for outage time. In this case, the data outage penalty rate is a function of the level of achieved performance (e.g., 0%, 50% or 100% of normal mode performance). The penalty cost function can be linear or non-linear, continuous or discontinuous.
As mentioned in the discussion of
Next, a decision is made as to whether substantially all of the candidate designs generated by the candidate design generator have been considered, as depicted by decision box 814. If they have not, then a new candidate design is generated and evaluated with a plurality of failure events to determine its total cost. In other words, operations 802, 803, 804, 806, 808, 810, 812 and 813 are repeated. If substantially all of the candidate designs have been considered, the total costs of the stored candidate designs are compared, as depicted by reference numeral 816 and one or more of the top candidates are output, as depicted by reference numeral 818. In one embodiment, the selected result, such as the output solution 818, includes a design candidate having the minimum total cost of substantially all the candidate designs evaluated. In other embodiments of the invention, the output solution 818 includes a set of a plurality of the least costly candidate designs.
The method 900 also includes selecting a second candidate design 926, and modeling the second candidate design to determine an amount of outage time associated with the second candidate design 928. The method 900 also includes modeling the second candidate design to determine an amount of data loss associated with the second candidate design 930. The outage time penalty cost for the candidate design is determined by applying a first penalty cost function for outage time 932, and the data loss penalty cost is determined by applying a second penalty cost function for data loss 934. The outage time penalty cost and data loss penalty cost are summed to determine a total penalty cost associated with the second candidate design 936. The total penalty cost is then scaled by applying a weighting factor to the associated failure event to determine weighted penalty cost for a failure event 938. In some embodiments, the failure weighting factor is related to the probability of a failure event happening over the life of the candidate design, or a selected time period, such as a year or number of years. In other embodiments, the failure weighting factor may be 1.0. In one embodiment, where the failure weighting factor is 1.0 is where the probability of one failure event is equally probable as another failure event. The method 900 further includes determining the outlay cost associated with the second candidate design 940 and adding the outlay cost and the total penalty cost to determine a total cost of the second candidate design 942. The method 900 also includes selecting one of the first candidate design and the second candidate design based upon a comparison of the total cost of the first candidate design and the total cost of the second candidate design 944. In some embodiments, the method further includes the selection of one of the first candidate design and the second candidate design based upon the candidate design having the least total cost.
The configuration method 900 can be applied to any system, including but not limited to selecting a storage configuration, selecting a hardware configuration, or selecting a software configuration. In one embodiment, the penalty cost rates are associated with data dependability specifications of a system. In some embodiments, the penalty cost rates are obtained from the user. Generally, the user is the one who can accurately supply this information; however, estimates can be based on libraries of penalty rates for different industry segments.
The Mathematical Optimization Model
This section presents an example embodiment of the method by formulating in detail a mathematical programming optimization model of a dependability design tool. The formulation is a mixed integer program (MIP), and defines the decisions to be made, input data and their functional relationship with these decisions, and an objective function to optimize. Thus, the MIP formulation presented in this section encompasses the following sections:
To reduce the search space, the values of certain continuous parameters, such as backup intervals and the batch intervals for the batched asynchronous remote-mirroring scheme, are quantized. The following sets enumerate the quantized values.
m ε M={sync,async,asyncB}: mirroring type, where sync means synchronous mirroring, async means write-order preserving asynchronous mirroring, and asyncB means batched asynchronous mirroring with write absorption.
w ε W={1 min,5 min,1 hr,4 hr,12 hr,24 hr,48 hr}: time interval (window) type; used for data protection techniques that accumulate updates over an interval.
k ε K={cycle0,cycle6,cycle13,cycle27}: backup cycle types, where cycle0 means full backups are done, cycle6 means a full backup followed by 6 incremental backups, cycle13 means a full backup followed by 13 incremental backups, and cycle27 means a full backup followed by 27 incremental backups. (For a 24-hour window, these cycles correspond roughly to weekly, bi-weekly and monthly full backups, interspersed with incrementals.)
s ε S={hot,unconfig,occupied,occUnconfig,none}: status type of spare resources, where hot means resources are ready to be used, unconfig means resources are to be configured, occupied means negotiations are performed to release resources from other tasks and the resources are to be scrubbed, occUnconfig means resource release are negotiated and resources are scrubbed and configured, and none means no spare resources are available.
f ε FailureScopes={array,site}: failure scopes, where array means the failure of the array(s) storing the primary copy of the data set, and site means a site disaster at the primary copy site.
Parameters and Derived Parameters
Parameters represent the input data, and derived parameters are computed from the input data. Fundamental units are indicated for each parameter in parentheses (e.g., US dollars ($), bytes (B), seconds (s), or years (yr) for cost). These values may be expressed in alternate units (e.g., gigabytes (GB) or hours (hr)) for convenience.
Business Specifications Parameters
Business continuity practitioners currently use two metrics to capture data dependability objectives:
1. Recovery time. The Recovery Time Objective (RTO) specifies the maximum tolerable elapsed time between a failure and the point at which application service is restored. The RTO can range from seconds to days.
2. Data loss. Recovery may include reverting to some consistent point prior to the failure, resulting in the loss of updates after that point. The Recovery Point Objective (RPO) gives the maximum allowable time window for which recent updates may be lost. The RPO can range from zero (no loss is tolerable) to days or weeks.
This formulation quantifies the financial impacts of outages and data loss as penalty rates, expressed as $/hour. The loss penalty rate specifies the cost per hour of updates lost. The outage penalty rate specifies the cost per hour of service interruption. The design tool uses these rates to balance expected failure impacts against outlays to arrive at a cost-effective result, such as a cost-effective solution. The tool can be used with a combination of target RTO and RPO values plus penalty rates that are incurred for violations of these targets, or it can be run with the penalty rates alone, in which case it identifies the worst-case recovery time and data loss as a side effect of this design process.
targetRPO: target recovery point objective(s). This parameter may be set explicitly, to provide the traditional business continuity RPO, or set to zero, to permit the solver to identify the worst-case recent data loss as a side effect.
targetRTO: target recovery time objective(s). This parameter may be set explicitly, to provide the data recovery component of traditional business continuity RTO, or set to zero, to permit the solver to identify the worst-case recovery time as a side effect.
ploss: loss penalty rate for RPO violation ($/s).
punavail: outage penalty rate for RTO violation ($/s).
Workload Parameters
Data dependability designs are sensitive to characteristics of the storage workload. These characteristics can be measured from an existing system or estimated by a person or a tool from a repertoire of well-known workloads. Existing design processes use both approaches. Workload characteristics that do not affect the choice of data dependability resuts, such as solutions, are ignored, as existing performance provisioning tools can be used to address these design issues. Most data protection schemes are sensitive to the average update rate: the volume of updates over a given interval, divided by the interval length. Synchronous mirroring results, such as synchronous mirroring solutions are also sensitive to the short-term peak-to-average burstiness of writes, which is typically 3-10× the long-term average write rate. Techniques that accumulate modifications over an interval (e.g., incremental tape backup) are more sensitive to the workload's update rate, the update rate after earlier updates to rewritten data have been discarded. Longer accumulation intervals allow more time for overwrites, so they often have lower update rates. This rate is modeled by a series of tuples of the form <interval duration, update rate>.
wkldCapacity: workload data object capacity (B). This formulation assumes a single workload data object.
avgUpdateRate: average update rate; no rewrite absorption (B/s).
burstMultiplier: short-term burst update-rate multiplier.
<durationw, UpdateRatew>: update rate (B/s) over the given duration(s), after absorbing overwrites.
Derived Workload Parameters
Capacityw=durationw*UpdateRatew: total size (capacity) of update rate during the given duration (B).
Failure Parameters
Failures that people want to protect data against can be grouped into several threat categories, including:
This formulation focuses on data loss events for the primary copy, such as a primary site disaster or failure of a primary disk array. Data corruption and inaccessibility threats can be mapped into loss of the primary copy.
failureLikelihoodf: likelihood of failure scope f during a year (fraction in [0,1])
Disk Array Parameters
One or more disk arrays store the primary copy of data, using RAID 10 (striped mirrors). Disk arrays are modeled as having an upper bound on capacity (bytes) and a rate at which data can be restored (bytes/s). This formulation considers complete failure of the primary array or site, assuming that disk arrays protect against internal single-component failures. For simplicity, it assumes the entire dataset is protected in the same way; in practice, different storage volumes may be handled differently.
The disk array cost model captures details such as the costs of the array chassis/enclosures, redundant front-end controllers (including caches), high-end array back-end controllers, and the disk drives and the trays in which they are mounted. It estimates the cost of floor space, power, cooling, and operations by using a fixed facilities cost plus a variable cost that scales with capacity. Substantially all equipment capital outlay costs are amortized over a depreciation period, which is assumed to be the lifetime of the equipment.
maxDisks: maximum number of disks in each array.
diskCapacity: capacity per disk drive (B).
arrayReloadBW: maximum disk array reload rate (B/s).
enclosureCost: outlay cost of disk array enclosure ($/enclosure).
diskCost: outlay cost of disk ($/disk).
fixedFacilitiesCost: fixed outlay cost for facilities ($/yr).
varFacilitiesCost: variable outlay cost for facilities ($/B/yr).
depreciationPeriod: period over which capital outlay costs are amortized (depreciated) (yr).
Derived Disk Array Parameters
arrayCapacity=maxDisks*diskCapacity: disk array capacity (B).
total number of disk arrays for the primary copy of the workload (assuming RAID 10 protection).
total number of disks for the primary copy of the workload (assuming RAID 10 protection).
primaryCost: total amortized outlay cost for the primary copy disk array storage and facilities ($/yr).
Remote Mirroring Parameters
Remote mirroring protects against loss of the primary by keeping an isolated copy on one or more disk arrays at a secondary site. The remote mirroring model includes a transfer rate (bytes/s) for the network links connecting the mirrors, a link cost ($/year), and an upper bound on the number of links that may be deployed. New link types may be incorporated by providing values for these parameters.
mirrorCacheCapacity: size of buffer used to smooth out async link IOs (B).
w(M)ε W(M)⊂ W={1 min,5 min,1 hr,4 hr,12 hr,24 hr}: type of asynchronous batch window for coalescing writes.
linkBW: link bandwidth (B/s).
linksMax: upper bound on number of links.
linkCost: annual outlay cost for a network link ($/yr).
Derived Remote Mirroring Parameters:
The primary cost factors for remote mirroring systems are the costs of (1) the storage for the remote copy and (2) the number of network links used to match the write rate at the primary. The costs and penalties depend on the remote mirroring protocol:
Updates that have not been transferred to the secondary are at risk when the primary fails. The worst-case time window for data loss is given by the time it takes to fill or drain the write buffer, whose entire contents may be lost on a failure:
potential data loss for write-order preserving asynchronous mirroring under array failure(s).
The potential data loss is the size of two delayed batches (one accumulating and one in transit to the secondary), so the worst-case loss window is approximated as twice the batch interval:
Backups are modeled as occurring at fixed intervals ranging from 4 to 48 hours. Periodic full backups are optionally interspersed with cumulative incremental backups, which copy the data modified since the last full backup. For example, backup intervals of 24 hrs with incremental cycle counts of 6, 13 or 27 days correspond roughly to weekly, bi-weekly, or monthly full backups interspersed with daily incremental backups.
tapeCapacity: tape media capacity (B).
tapeDriveBW: tape drive rate (B/s).
tapeDrivesMax: upper bound on number of drives in tape library.
tapesMax: upper bound on number of tapes in library.
w(F)ε W(F)⊂ W={4 hr,12 hr,24 hr,48 hr}: type of full backup window.
w(I)ε W(I)⊂ W={4 hr,12 hr,24 hr,48 hr}: type of incremental backup window.
cycleCountkε {0,6,13,27}: number of incrementals between full backups.
RTvault: time to retrieve tapes from offsite vault(s).
tapeLibraryCost: outlay cost for tape library, including chassis plus media slots ($/library).
tapeDriveCost: outlay cost for a tape drive ($/drive).
tapeCost: outlay cost for tape media cartridge ($/tape).
fixed VaultCost: fixed outlay for tape vault ($/yr).
vaultPerShipmentCost: outlay cost for a shipment to tape vault ($/shipment).
numVaultShipments: number of shipments to tape vault per year.
Derived Tape Backup Parameters
The backup process creates a read-only snapshot of the primary data, and then uses the snapshot as the source for the backup to tape (be it full or incremental). Snapshots may be taken using space-efficient copy-on-write techniques, or by isolating a local mirror and synchronizing it with the primary copy after the backup is complete. The disk space used for a space-efficient incremental snapshot is determined from the average update rate and the backup interval.
In an embodiment, each backup finishes before the next one starts, effectively defining a backup window equal to the interval duration. The tool provisions sufficient tape drives to complete each backup within its window, so the lower bound on the number of tape drives is:
tapeDrives_Mink,w(F),w(I)=Max(tapeDrives_MinFullw(F),tapeDrives_MinIncrk,w(F),w(I)): lower bound on the number of tape drives used under policy [k,w(F),w(I)].
where the number of drives used for a full backup is:
lower bound on the number of drives used for a full backup. and the number of tape drives used for the largest incremental backup is:
lower bound on the number of tape drives used for the largest incremental backup.
Tapes are retained for a single full backup cycle, which includes the last full backup and substantially all subsequent incremental backups. Each full backup is written onto a new set of tapes rather than the tapes for the previous full backup, in case it fails to complete. When a full backup completes, the tapes for the previous full backup are sent to the vault, and the tapes at the vault are recycled back to the primary site. The tapes are kept at the primary until this time in case they are used quickly to respond to operator errors (e.g., rm*). Thus the total number of retained tapes is:
The number of tapes for substantially all incremental backups during a cycle is calculated by summing the number of tapes used for each one. It is assumed that each backup starts on a new tape. Taking into account the fact that the full backup interval may be larger than the incremental one:
number of tapes for substantially all incremental backups during a cycle (where sizeOfIncri=Capacityw(F)+i*Capacityw(I)).
number of tapes at the largest incremental backup cycle.
number of tapes for an array's worth of the largest incremental backup.
A primary array failure may destroy any backup in progress at the time of the failure, possibly losing substantially all updates from both the current (accumulating) backup interval and the previous (propagating) backup interval. Assuming full intervals are at least as long as incremental intervals; the worst-case data loss is the sum of the full and incremental backup intervals:
In the event of a primary site disaster, the worst-case data loss occurs if the site is destroyed just before the new full backup completes and the old full backup is shipped offsite. In this case, the data at the vault is out-of-date by twice the full backup cycle duration, plus the interval for the latest full backup:
Reconstruction can begin as soon as the secondary data copy is available, and sufficient target disk arrays are ready. If standby resources are available, reconstruction can begin nearly immediately; otherwise, resources are to be found or acquired, drained if they are in use for another purpose, (re)configured and/or (re)initialized (formatted). To minimize this delay, sites often keep standby equipment in various states of readiness. This formulation models a spectrum of spare resource options, as shown below. In substantially all cases, the model assumes that any spare resources are eventually replaced with new equipment, and factors this replacement cost out of the equations.
Spare resource options are modeled by the outlay cost of maintaining ready resources (either dedicated or shared) and the recovery time and corresponding financial penalties to provision those resources. One way of achieving access to spare resources is to rent access to a shared resource pool. Several companies offer such a service, which can be much cheaper than a dedicated backup site. The cost of shared resources is modeled by a fixed discount factor.
tidentify: time to identify that spare resources are available(s).
tconfigure: time to configure spare resources(s).
tscrub: time to scrub spare resources(s).
tnegotiate: time to negotiate spare resources(s).
spareCost: outlay cost of spare disk array storage and facilities ($/yr).
spareDiscount: discount factor for shared resources (fraction in [0, 1]).
Derived Spare Resources Parameters
RThot=tidentify: recovery time from a hot spare resource(s).
RTunconfig=tidentify+tconfigure: recovery time from an unconfigured spare resource(s).
RToccupied=tidentify+tscrub+tnegotiate: recovery time from an occupied spare resource(s).
RToccUnconfig=tidentify+tconfigure+tscrub+tnegotiate: recovery time from an occupied and unconfigured spare resource(s).
RTnone=torder+tconfigure+tidentify: recovery time from no spare resources(s).
Ohot=spareCost: outlay cost of a hot spare resource ($/yr).
Ounconfig=spareCost: outlay cost of an unconfigured spare resource ($/yr).
Ooccupied=spareCost*spareDiscount: outlay cost of an occupied spare resource ($/yr).
OoccUnconfig=pareCost*spareDiscount: outlay cost of an occupied and unconfigured spare resource ($/yr).
Onone=0: outlay cost of no spare resources ($/yr).
Decision Variables and Derived (State) Variables
The mixed integer program declares variables for each decision it makes in determining a primary result, such as a primary solution. A set of binary decision variables represents the data protection alternatives and their base configurations. Each binary variable corresponds to a single protection alternative (e.g., mirroring or backup) and a specific set of discrete configuration parameters (e.g., “batched asynchronous mirroring with a write absorption interval of one minute”). Integer decision variables represent the number of bandwidth devices (e.g., network links or tape drives) for each alternative.
Mirroring Decision Variables
ym: number of links used for synchronous and asynchronous mirroring m ε {sync,async}
yasyncB,w(M): number of links used for batched asynchronous mirroring with write absorption for window type w(M)ε W(M).
When formulated in the straightforward fashion described above, the recovery time models have terms that depend inversely on the number of links or tape drives y. The resulting optimization issue becomes non-linear. Although solvers exist for certain classes of non-linear optimization issues, they may take an unacceptably long time to find a result, or fail to find one. Linear solvers exploit theoretical results about the search space structure to solve significantly larger equations in seconds. To address this issue, the models are recast by introducing new variables z standing for the inverse of the y terms:
Because mirroring keeps a copy of the data constantly accessible, recovery can proceed via reconstructing the primary from the secondary across the network:
derived variable for the recovery time of synchronous and asynchronous mirroring under array failures(s).
derived variable for the recovery time of synchronous and asynchronous mirroring under site disasters(s).
derived variable for the recovery time of batched asynchronous mirroring with write absorption under array failures(s).
derived variable for the recovery time of batched asynchronous mirroring with write absorption under site disasters(s).
Backup Decision Variables
Recovery from tape is a three-phase process: first, if the tapes are stored at an offsite vault, they are to be retrieved to the recovery site; second, the latest full backup is restored and third, the latest subsequent incremental backup is restored. Vaults can be close to or far away from the target data recovery location. The largest capacity incremental backup is the last one of the cycle. To simplify the formulation, the models assume that substantially all the tape drives in each library operate in parallel during each phase and that data is spread evenly across the tapes and drives. Tape load time is ignored because it is typically less than 5% of the time to read the tape.
The worst-case recovery time is the time to retrieve the tapes from the offsite vault (in the case of a site disaster), plus the time to restore the last full and the last incremental backup of a cycle:
derived variable for recovery time under backup policy [k,w(F),w(I)] for array failures(s).
derived variable for recovery time under backup policy [k,w(F),w(I)] for site disasters(s).
Spare Resources Decision Variables
derived variable for recovery time of spare resources alternatives under both array failures and site disasters(s).
Penalty Decision Variables
The target recovery time and recovery point objectives (RTO and RPO) are considered soft constraints, because penalties are assessed when they are violated. The following decision variables determine the degree of violation.
vm, fRTO: violation of target RTO under synchronous or asynchronous mirroring (m ε {sync,async}) for failure scope f(s).
vasyncB,w(M),fRTO: violation of target RTO under asynchronous mirroring with write absorption for window type w(M)ε W(M) for failure scope f(s).
vk,w(F),w(I),fRTO: violation of target RTO under backup policy [k,w(F),w(I)] for failure scope f(s).
vm,fRPO: violation of target RPO under synchronous or asynchronous mirroring (mε{sync,async})for failure scope f(s).
vasyncB,w(M),fRPO: violation of target RPO under asynchronous mirroring with write absorption for window type w(M)εW(M) for failure scope f(s).
vk,w(F),w(I),fRPO: violation of target RPO under backup policy [k,w(F),w(I)] for failure scope f(s).
Outlay Derived Variables
outlay cost for enabled mirroring alternative for primary array failure and site disaster under reconstruction ($/yr).
outlay cost for any enabled backup and vaulting alternative ($/yr). To reconstruct data after a site disaster, a tape library is used at the reconstruction site, in addition to the tape library used to create the backup at the primary site. The model assumes that tapes are replaced every year, to guard against media failures. Finally, this cost also includes a component for the disks to store snapshots or split mirrors to permit consistent backups.
outlay cost for spare resources ($/yr).
Penalty Derived Variables
Objective Function
The optimization's objective is to minimize overall annual business cost, defined as outlays plus expected penalties for primary copy failures:
Min(Oarray+Omirror+Obackup+Ospare+Punavail+Ploss)
Constraints
1) Exactly one data protection alternative is chosen in an embodiment.
2) Exactly one spare resource alternative is chosen in an embodiment.
3) The number of bandwidth devices for a data protection alternative is either zero (if that alternative has not been chosen), or it is within the range specified by the upper and lower bounds calculated or specified for the alternative.
linksMinm*xm≦ym≦linksMax*xm m ε {sync,async}
linksMinasyncB,w*xasyncB,w≦yasyncB,w≦linksMax*xasyncB,w w ε W(M)
tapeDrivesMink,w(F),w(I)*xk,w(F),w(I)≦yk,w(F),w(I)≦tapeDrivesMax*uk,w(F),w(I) k ε K,w(F)ε W(F),w(I)ε W(I)
4) The aggregate data protection technique bandwidth may not exceed the aggregate primary array reload rate.
linkBW*ym≦arrayReloadBW*numDiskArrays m ε {sync, async}
linkBW*yasyncB,w≦arrayReloadBW*numDiskArrays w ε W(M)
tapeDriveBW*yk,w(F),w(I)≦arrayReloadBW*numDiskArrays k ε K, w(F)ε W(F), w(I)ε W(I)
5) The number of tape libraries, which is the maximum between the number of tapes per library and number of tape drives per library, can be linearized as follows:
uk,w(F),w(I)≧y
uk,w(F),w(I)≧(numTapes
6) The recovery time limit is satisfied.
The extra term on the right hand side of each inequality is intended to add a large number (constant C, e.g., five years) when that alternative is not chosen, permitting the penalty to be zero.
(recoveryTimem,f+recoveryTimespares)−vm,fRTO≦targetRTO+(1−xm)*C m ε {sync,async}and f ε failureScopes
(recoveryTimeasyncB,w,f+recoveryTimespares)−vasyncB,w,fRTO≦targetRTO+(1−xasyncB,w)*C w ε W(M) and f ε failureScopes
(recoveryTimek,w(F),w(I),f+recoveryTimespares)−vk,w(F),w(I),arrayRTO≦targetRTO+(1−xk,w(F),w(I))*C k ε K, w(F)ε W(F), w(I)ε W(I) and f ε FailureScopes
7) The data loss limit is satisfied.
dataLossm,f*xm−vm,fRPO≦targetRPO m ε {sync,async},fεFailureScopes
dataLossasyncB,w,f*xasyncB,w−vasyncB,w,fRPO≦targetRPO w ε W(M),f ε FailureScopes
dataLossk,w(F),w(I),f*xk,w(F),w(I)−vk,w(F),w(I),fRPO≦targetRPO k ε K, w(F)ε W(F), w(I)ε W(I) and f ε FailureScopes
8) Linearization constraints for z=1/y.
The formulation introduces additional constraints on the optimization issue to ensure that the transformed equation is equivalent to the original. Since the transformation from y to z is convex, well-known techniques can be used to linearize the convex constraints in continuous variables. For example, by constraint 3), the y variable is an integer in the interval └linksMinsync,linksMax┘. Consider a partition of this interval into n subintervals, where the points in the partitions <ysync0<ysync1< . . . <ysyncn> are integers in the interval └linksMinsync,linksMax┘. Then, zsyncj=1/ysyncj j=0,1, . . . , n. Express the y and z variables as a convex combination of the points that define the subintervals in the partition. Let λsyncj≧0 be the non-negative continuous variable that represents a point in the partition, for j=0,1, . . . , n. Thus,
An additional constraint is introduced to ensure that at most two consecutive lambdas can be non-zero. Thus the lambdas are defined as SOS2 variables. Most commercially available solvers enable SOS2 functionality.
Generalizing to substantially all data protection techniques:
Where the dot (.) means m, [asyncB,w], or [k,w(F), w(I)], (y.n) n ε N is a given partition of the interval where the variable is defined and z.n=1/y.n.
8.3) λ.n is a decision variable within the model, defined as an SOS2 variable, that enforce the piece-wise linearization of z=1/y.
This section provides an overview of the example hardware and the operating environment in which embodiments of the invention can be practiced.
The memory unit 1130 includes an operating system 1140, which includes an I/O scheduling policy manager 1132 and I/O schedulers 1134. The memory unit 1130 stores data and/or instructions, and may comprise any suitable memory, such as a dynamic random access memory (DRAM), for example. The computer system 1100 also includes IDE drive(s) 1108 and/or other suitable storage devices. A graphics controller 1104 controls the display of information on a display device 1106, according to embodiments of the invention.
The Input/Output controller hub (ICH) 1124 provides an interface to I/O devices or peripheral components for the computer system 1100. The ICH 1124 may comprise any suitable interface controller to provide for any suitable communication link to the processor(s) 1102, memory unit 1130 and/or to any suitable device or component in communication with the ICH 1124. For one embodiment of the invention, the ICH 1124 provides suitable arbitration and buffering for each interface.
For one embodiment of the invention, the ICH 1124 provides an interface to one or more suitable integrated drive electronics (IDE) drives 1108, such as a hard disk drive (HDD) or compact disc read-only memory (CD ROM) drive, or to suitable universal serial bus (USB) devices through one or more USB ports 1110. For one embodiment, the ICH 1124 also provides an interface to a keyboard 1112, a mouse 1114, a CD-ROM drive 1118, and one or more suitable devices through one or more firewire ports 1116. The ICH 1124 also provides a network interface 1120 though which the computer system 1100 can communicate with other computers and/or devices.
In one embodiment, the computer system 1100 includes a machine-readable medium that stores a set of instructions (e.g., software) embodying any one, or all, of the methodologies for dynamically loading object modules described herein. Furthermore, software can reside, completely or at least partially, within memory unit 1130 and/or within the processor(s) 1102.
Thus, a system, method, and machine-readable medium including instructions for Input/Output scheduling have been described. Although the present invention has been described with reference to specific example embodiments, it will be evident that various modifications and changes may be made to these embodiments without departing from the broader spirit and scope of the disclosed subject matter. Accordingly, the specification and drawings are to be regarded in an illustrative rather than a restrictive sense.
This applications claims priority to a provisional application entitled “Method of Automatically Designing Storage Systems to Achieve Data Dependability Goals,” filed Sep. 19, 2003, Ser. No. 60/504,230.
Number | Date | Country | |
---|---|---|---|
60504230 | Sep 2003 | US |