The present description is related to the field of performance tuning of data storage systems.
A disclosed method of tuning performance of a data storage system includes calculating an estimate of parallel fraction and speedup characteristic for a data storage application executed by the data storage system. The estimate is calculated using linear regression of values (1/N, 1/XN) that are generated from trial runs of the data storage application processing a workload using respective different numbers N of CPU cores to obtain corresponding performance values XN. The method further includes configuring the data storage system to execute the data storage application using a number of CPU cores based on the estimate of parallel fraction and speedup characteristic.
Advantages of the disclosed method include the use of an accurate linear estimation, which in general requires only a few data points; it is unnecessary to measure performance for every possible number of cores that may be used.
The foregoing and other objects, features and advantages will be apparent from the following description of particular embodiments of the invention, as illustrated in the accompanying drawings in which like reference characters refer to the same parts throughout the different views.
Resource scaling is a serious challenge facing software developers. Modern CPU designers often increase performance by placing more and more cores onto the same die. However, this does not automatically improve performance since multiple cores will be useful only if system code is designed to be scalable. While there is an art to developing systems to be maximally scalable, another aspect that confronts system architects and developers is to be able to quantify performance enhancement achievable from scaling in practice. Such quantification can provide an objective basis for exploring cost-performance tradeoffs as well as opportunities for fine tuning performance. Thus an approach is described herein for quantifying system scalability based on system performance measurements and linear regression. The results provide quantitative measures for tracking progress in system performance tuning as well as other purposes, such as system performance modeling for example.
As a computerized system, the performance analyzer 14 has computer hardware including one or more processors, memory, and data interconnections such as one or more high-speed data buses (not specifically shown), and in operation the memory stores data and instructions of a performance analysis program which are executed by the processor(s) to cause the hardware to function in a software-defined manner. The performance application analysis program may be stored on non-transitory computer-readable medium such as an optical or magnetic disk, Flash memory or other non-volatile semiconductor memory, etc., from which it is retrieved for execution by the processing circuitry, as generally known in the art.
In the illustrated arrangement, the performance analyzer 14 is separate from the DSSs 12, and in this case may be embodied in a standalone computer (e.g., desktop or server) having communications connections to the DSSs 12. In some embodiments, a performance analyzer may be incorporated into a DSS 12, typically sharing computer hardware and using internal communication mechanisms. Also, while
As shown, the processing circuitry 24 includes a multi-core CPU 30, memory 32 and a co-processor 34, such as a data compression engine for example. The CPU 30 executes instructions of a data storage application (not shown), which are stored in and retrieved from memory 32 during execution after being retrieved from separate non-volatile storage such as disk or Flash. As generally known, the data storage application is responsible for myriad functional aspects of operation, including the management of resources and coordination of concurrent activities (i.e., handling multiple client I/O requests and back-end accesses concurrently), managing of internal file systems and other specialized data/functions providing a logical view of stored data to the clients, handling error conditions, etc.
The disclosed technique is specifically directed to the use of the multi-core CPU 30 by the data storage application. As generally known, a multi-core CPU includes a number of distinct processing elements, called “cores”, which are connected for shared access to the memory 32 and are treated as allocable resources by resource-management code of the data storage application. A typical modern CPU may have on the order of 20 or more cores, for example. Use of independent cores can increase performance by parallel execution of distinct operations. For a data storage system in particular, parallelization may be based on one or more of the following as examples: (1) separating I/O requests of different clients and/or different devices 28, (2) separating I/O requests by stage of completion, e.g., front-end, cache, back-end, (3) separating operations by coarse function, e.g., dedicating some cores to certain background processes (such as rebuilds or cache management) while using other cores for handling client I/O.
In the remaining description, the data storage application is described in terms of its ability to be “parallelized”, i.e., to have parts of it be executed in parallel. This feature is also referred to as “parallelizability” and represented via a parameter called “parallel fraction”. Although no specific code examples are given, those skilled in the art will understand that parallelizability depends on both the general nature of the data storage application (e.g., general independence of operations) as well as its implementation (e.g., use of shared structures), and thus in general an application can be adjusted by suitable modification of code to reduce serial dependencies. For any given instance of an actual data storage application, specific features of the code contribute to its parallelizability and parallel fraction. The description below refers to parallel fraction as an abstract parameter that can be derived from measurements and analysis. It will be understood that in a real system the parallel fraction as measured arises from, and in a sense represents, the nature of the application as well as its detailed implementation. Some applications are inherently more parallelizable than others, and some implementations of a given application are move parallelizable than others.
Measuring System Scalability
System speedup due to scaling of computational resources can be estimated by analysis as follows. Suppose the amount of time to complete a task on a single core is T, and that a parallelizable part of the task (parallel fraction) is p; i.e., it will take pT/N amount of time to complete that part of the task on N cores. The rest of the task is not parallelizable, and it will take (1−p)T amount of time independent of the number of cores involved. The time to complete a task with N cores is the sum of the two:
The speedup for N cores, S(N), is the ratio of time to complete a task on one core to time to complete a task on N cores:
Equation (2) is known as Amdahl's Law.
With the above appreciation of the relationship between speedup and the parallel fraction p, the description now turns to a technique for quantifying these values for a real system and using the results for desired purposes, such as performance tuning, code or hardware design, long-term performance analysis and tracking, etc.
The disclosed approach assumes that Amdahl's law applies to the system performance:
where N is the number of CPU cores and X is a relevant performance metric. The present description uses system throughput in Input/Output operations per second (IOPS) as the performance metric X; in general, other metrics may be used. Thus the term XN in equation 3 is the system throughput in IOPS (or other metric) using N processing core, and the term X1 refers to system throughput (or other metric) using only one core, also called “single-core IOPS”. p is the parallel fraction, which represents the parallelizable portion of the code, and
is Amdahl's speedup for N cores. Equation (3) is used to derive the parallel fraction p and single-core IOPS X1 given a set of system measurements (N, XN) for various values of N.
First, equation 3 is inverted to obtain the following:
Equation (4) represents a line of the form y=mx+b, where
slope
and intercept
In other words, if the set of values (N, XN) obeys Amdahl's law, then the set of values
falls on a line with a slope
and intercept
Thus the disclosed technique is based on obtaining sets of measurements of XN for a set of trials using different values of N, and estimating a linear fit of the values
that are derived from the measured data. The slope m and intercept b of the estimated linear fit are used in further calculations to derive parallel fraction and other parameters. Generally, at least two data points are need to define a line, but three or more points generally provide better accuracy and indication of “goodness of fit” (R2). The closer R2 is to 1, the higher confidence level in the estimates.
Once the slope m and the intercept b are determined, those values are used to find the parallel fraction p and the single-core IOPS X1, using the equations for slope and intercept as noted above:
Adding the two equations eliminates p:
to express single-core IOPS as:
Substituting the solution for X1 into the Equation (6), an expression for p is obtained:
Single core IOPS (X1) can then be used to estimate cycles per byte (CPB):
Data was collected on a data storage system having two storage processor boards (Node A and Node B) sharing a disk infrastructure. Each board has two 20-core CPUs (40 cores total) each running at 2.4 GHz core clock frequency. The scaling domain is constrained within each processor board. The workload is a series of 8 KB writes with cache misses (CM):
The fraction of time that the CPU cores are not idle (the ratio of time that the CPU is busy over total time) is referred to as CPU utilization. Ideally, in the absence of other performance bottlenecks in the system, the CPU cores would never be idle, and CPU utilization would be close to 100%. Therefore, measured IOPS are normalized to 100% CPU utilization to eliminate the variability of measured IOPS with respect to CPU utilization:
The data in the second column, <Node A Core Count>, and the last column, <IOPS at 100% Utilization>, are used to create a set of (1/n, 1/IOPS) values:
Using the above values of m and b in the earlier formulas, the following derived values are obtained:
The R2 value for the regression is 0.9978, which indicates high confidence level that the linear approximation matches measured data as seen in
As a system performance modeling example, the derived values of parallel fraction and single-core cycles per byte can be used to predict system throughput with 5 GHz, 100 core CPU. Assuming that there are no other bottlenecks in the system and using Equations (3) and (10), the following expression is derived:
Substituting the numerical values into Equation 12, the following are obtained:
At 42, the data storage system is configured to execute the data storage application using a number of CPU cores based on the estimate of parallel fraction and speedup characteristic. This configuring can take a couple of forms, depending on the usage. The parallel fraction and speedup characteristic are features of the data storage application, specifically its parallelizability, and thus in one usage the estimated values can be used to determine whether the application meets performance objectives (e.g., whether it will be able to provide a target IOPS using a given N-core CPU) and to quantify performance improvements arising from modifying the application code. Thus a developer might use the technique in connection with modifying the code to increase parallel fraction, as measured by using the disclosed technique. Once a desired parallel fraction has been reached, the modified code can be deployed to the data storage system(s) with confidence of meeting performance objectives. In another usage, the analysis results can be used to inform decisions about sizing of processing resources (i.e., numbers of cores) when designing a system, or in an existing system to assess how to best utilize available cores (e.g., to achieve a desired performance-versus-efficiency tradeoff, or to balance performance of one application against that of another). These and other uses may employ criteria such as a low-return zone (
The method described above provides a fast and accurate way to determine parallel fraction and single-core cycles per byte from system performance measurements. The results provide quantitative measures for tracking progress in system performance tuning as well as parameters for system performance modeling. In one use, the values of parallel fraction and single core cycles per byte over time to evaluate progress in performance tuning.
More specifically, the disclosed method avoids certain potential inefficiency and inaccuracy of other techniques. By using reciprocal values (1/N, 1/XN), Amdahl's law is converted into a linear relationship. Advantages of this technique include:
While various embodiments of the invention have been particularly shown and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.
Number | Name | Date | Kind |
---|---|---|---|
7401333 | Vandeweerd | Jul 2008 | B2 |
9465632 | Ebcioglu et al. | Oct 2016 | B2 |
20160328273 | Molka | Nov 2016 | A1 |
20170269913 | Meijer et al. | Sep 2017 | A1 |
Number | Date | Country | |
---|---|---|---|
20210216330 A1 | Jul 2021 | US |