This application relates to system management and operation of large-scale systems and networks having heterogeneous components. More particularly, this application relates to a method and apparatus for predicting application performance across machines having hardware configurations with different hardware specifications or settings.
Recent years have witnessed an explosive growth of servers in enterprise data centers and clouds. Those machines usually come from different venders with a wide range of hardware configurations with different hardware specifications such as processor speed, processor cache size, and so on. Such a heterogeneity introduces extra challenges in system management. For example, we need to differentiate the computation capabilities of various hardware configurations in order to evenly distribute workloads across machines. In the capacity planning task, that knowledge is also required to determine the right number and types of servers to be purchased for the increasing workloads. The recent resurgence of virtualization technology opens up huge demand for application performance mapping across heterogeneous hardware, because virtualization allows applications to migrate between different machines. If the source and target machines after migration have different hardware configurations with different hardware specifications or settings, many system management tools that build a performance model on the initial hardware setting may require recalibration.
The above challenges of server heterogeneity call for a technique that can accurately map application performance across machines with different hardware specifications and settings. A number of techniques have been proposed for accurately mapping application performance across machines with different hardware specifications and settings, but these techniques are limited in one way or another. These techniques can be divided into two classes. The first class evaluates application performance on a number of different servers in advance, and builds a model to summarize the application performance across those machines. In practice, however, it is difficult to collect enough data from machines with different hardware configurations. With the lack of measurement data, the real (actual) evaluation based techniques only include a limited number of hardware parameters, and rely on simple models such as the linear regression to learn their relationships. Such a simplification significantly jeopardizes the prediction accuracy of application performance.
In order to address the data insufficiency issue, the second class of techniques relies on software simulation to collect data for performance modeling. There are many simulation tools that can construct a complete microprocessor pipeline in software to approximate the application performance on any specified hardware device. By using those tools, sufficient data can be collected from a wide range of hardware configurations to learn a complete model for predicting application performance. By its very nature, however, the software based simulation necessarily yields uncertain and inaccurate data due to the specification inaccuracy, implementation imprecision, and other factors in those tools. As a consequence, the quality of the learned model can be affected by those errors.
Accordingly, a new method and apparatus is needed for predicting application performance across machines with different hardware configurations.
A method is disclosed for predicting performance of an application on a machine of a predetermined hardware configuration. The method comprises: simulating, in a computer process, the performance of the application under a plurality of different simulated hardware configurations; building, in a computer process, a predictive model of the performance of the application based on the results of the simulations; obtaining the performance of the application on a plurality of actual machines, each of the machines having a different hardware configuration; and in a computer process, Bayesian reinterpreting the predictive model built from the results of the simulations using the performance of the application on the plurality of actual machines, to obtain a final predictive model of the performance of the application having an accuracy greater than the predictive model built from the results of the simulations.
In some embodiments of the method the building of the predictive model comprises modeling nonlinear dependencies between the simulated performance of the application and the simulated hardware configurations with a generalized linear regression model with L1 penalty.
In some embodiments of the method the modeling of nonlinear dependencies comprises defining a set of basis functions to transform original variables so that their nonlinear relationships can be included in the predictive model.
In some embodiments of the method the modeling of nonlinear dependencies comprises applying the L1 norm penalty on coefficients of the generalized linear regression model to achieve sparseness of the predictive model's representation.
In some embodiments of the method the Bayesian reinterpreting of the predictive model comprises searching for an optimal solution for the linear regression model with L1 penalty.
In some embodiments of the method the Bayesian reinterpreting of the predictive model built from the results of the simulations comprises relearning parameters of the linear regression model using the performance of the application on the plurality of actual machines.
In some embodiments of the method the Bayesian reinterpreting of the predictive model built from the results of the simulations comprises defining a prior distribution which embeds information learned from the simulations to restrict values of the coefficients of the linear regression model.
In some embodiments of the method the Bayesian reinterpreting of the predictive model built from the results of the simulations comprises maximizing posterior probability distribution of model parameters so that the final predictive model comprises contributions from the simulated and actual hardware configurations.
An apparatus is disclosed for predicting performance of an application on a machine of a predetermined hardware configuration. The apparatus comprises a processor executing instructions for simulating the performance of the application under a plurality of different simulated hardware configurations; building a predictive model of the performance of the application based on the results of the simulations; and Bayesian reinterpreting the predictive model built from the results of the simulations using the performance of the application on a plurality of actual machines each having a different hardware configuration, to obtain a final predictive model of the performance of the application having an accuracy greater than the predictive model built from the results of the simulations.
In some embodiments of the apparatus the instructions for building of the predictive model comprises instructions for modeling nonlinear dependencies between the simulated performance of the application and the simulated hardware configurations with a generalized linear regression model with L1 penalty.
In some embodiments of the apparatus the instructions for modeling of nonlinear dependencies comprises instructions fordefining a set of basis functions to transform original variables so that their nonlinear relationships can be included in the predictive model.
In some embodiments of the apparatus the instructions for modeling of nonlinear dependencies comprises instructions for applying the L1 norm penalty on coefficients of the linear regression model to achieve sparseness of the predictive model's representation.
In some embodiments of the apparatus the instructions for Bayesian reinterpreting of the predictive model comprises instructions for searching for an optimal solution for the linear regression model with L1 penalty.
In some embodiments of the apparatus the instructions for Bayesian reinterpreting of the predictive model built from the results of the simulations comprises instructions for relearning parameters of the linear regression model using the performance of the application on the plurality of actual machines.
In some embodiments of the apparatus the instructions for Bayesian reinterpreting of the predictive model built from the results of the simulations comprises instructions for defining a prior distribution which embeds information learned from the simulations to restrict values of the coefficients of the linear regression model.
In some embodiments of the apparatus the instructions for Bayesian reinterpreting of the predictive model built from the results of the simulations comprises instructions for maximizing posterior probability distribution of model parameters so that the final predictive model comprises contributions from the simulated and actual hardware configurations.
The predictor x in the performance model represents various hardware specifications including, without limitation, data/instruction translation lookaside buffer (TLB) sizes, data/instruction level 1 (L1) cache sizes, level 2 (L2) cache sizes, L1 cache latency, L2 cache latency, and other various hardware specifications. The hardware specifications can be obtained from spec sheets of the corresponding machine. The response variable y measures the quality of serving the incoming workloads. The definition of that performance metric varies with the characteristics of the application. While some computation intensive applications use the system throughput to measure the quality of service, some user interactive applications rely on the request response time to describe the performance. Instead of focusing on those application specific metrics, the method of the present disclosure uses machine CPU utilization for system performance, because it has been shown that the CPU utilization is highly correlated with high level performance metrics such as the throughput or request response time.
Machine CPU utilization also depends on the intensity of the incoming workloads. Because the present method uses a performance variable whose value is only determined by the specifications of underlying hardware, the method of the present disclosure removes the portion of workload contributions, by decomposing the machine CPU utilization as:
In other words, machine CPU utilization is determined by the number of instructions issued by the application, the CPU cycles per instruction (CPI), and the CPU speed. Note that the number of issued instructions is proportional to the intensity of workloads, and CPU speed is a parameter that can be obtained from the hardware specifications. Therefore, the method of the present disclosure focuses on CPU cycles per instruction (CPI) as the performance variable y. This metric reflects the hardware contribution to application performance, and its value can be measured during the system operation by well known specific tools including, but not limited to the OProfile system-wide profiler. Given the CPI measurements on a set of hardware instances, the method of the present invention builds a statistical performance model y=ƒ(x) to predict the CPI value (the output of the model) when the application is running on any new hardware platforms.
The prediction model of the present disclosure can benefit many management tasks in a heterogeneous environment. For example, the prediction model of the present disclosure can be used to determine the right number and types of new machines that need to be purchased during system capacity planning, even when those machines are not available yet. The recent resurgence of virtualization technology also introduced considerable interests in the performance mapping across heterogeneous hardware, because virtualization applications are capable of migrating between different machines. If the original and destination machines after migration are different, some management tools may require recalibration after migration, especially for those tools that rely on the relationship between the application performance and other system measurements such as the workload intensity. Model recalibration needs to be accomplished in real time so that it can take effect immediately after the migration.
One challenge of learning the model is lack of measurement data, because there is usually not enough hardware instances available for model training Given limited data, some simplifications are commonly used in the model construction, which either reduce the number of hardware parameters or use a simple function η(−) to reflect their relationships. For example, one previous method builds a loglinear model based only on L1 and L2 caches sizes for performance prediction. Other prior art methods use software simulation to address the data insufficiency issue. While the simulation can generate enough data for constructing the performance model, there are always errors associated with simulation due to the implementation imprecision and specification inaccuracies in those tools. Such errors will affect the prediction model learned from simulation results.
In the simulation process of block 200, a simulation tool such as, but not limited to, a PTLsim, is used to collect data [x, y] where x represents hardware specifications of the machine of interest and y is the application performance, i.e., the average CPU cycles per instruction (CPI) on that machine. Given those data, a generalized linear regression with L1 penalty is used in block 202 to model the non-linear dependencies between the application performance (response y) and underlying hardware parameters (input variables x). A plurality of non-linear templates based on the domain knowledge, are generated to transform original variables, and a set of polynomial basis functions are applied to the new variables. Because the exact form of nonlinear relationship between variables is not known, all possible basis functions are included in the model, and many of them may not have any relationship with the performance. In order to remove irrelevant components, the method applies the L1 penalty on regression coefficients, and an algorithm (to be described further on) is used to identify the optimal solution for that constrained regression problem. The sparse statistical model that results from this process can effectively predict the performance of the application based on simulation results.
Due to the errors in software simulation, the process of block 204 comprises the running of the application on a limited number of actual hardware instances, and the use of Bayesian learning in the process of block 206 to enhance the model learned from simulation. The evaluation data from the actual hardware instances is used to relearn the parameters of the regression model from the simulation. Because the limited number of actual performance measurements will introduce large variances in the model fitting, the knowledge learned from simulation is used to restrict the values of regression coefficients. Such a prior constraint is represented as a Gaussian distribution with the mean as the values of corresponding coefficients learned from simulation. By maximizing the posterior probability of model parameters, a solution (the performance model) is found that takes advantages of both simulation and actual evaluation results in the performance prediction of the model.
Besides the logarithmic relationship, there are also other nonlinearities in the performance model. The majority of those nonlinearities appear to lie in the polynomial representation of variables. In order to include those factors, block 302 applies a polynomial kernel with the order 2 on the variables z to obtain a pool of basis functions {Ø1 (z), Ø2 (z), . . . , Øp (z)}. As can be seen, those basis functions contain the terms of variables z taken the polynomial of degree at most 2.
Given the original inputs x with r variables, the vector z doubles the number of variables, i.e., s=2r, and the number of basis functions in the pool becomes p=1+s+s(s+1)/2. Many basis functions may be obtained in the regression even when the number of original variables is small. For example, if the input x contains 10 variables, the number of basis functions already reaches 231. Such a large number of basis functions is due to the lack of knowledge about the exact form of nonlinear relationships in the underlying model. Therefore, all possible forms of nonlinearities are included in the representation
y=β1φ1(z)+β2φ2(z)+ . . . +βpφp(z) (2)
In reality, most of the basis functions may not have any statistical relationship with the response y. The irrelevant components must be removed for achieving a sparse representation of the regression model.
The following discussion describes the construction of the statistical application performance model built in block 202 of
In reality, many elements in β should be zero because many basis functions do not have any relationship with y. In order to eliminate the irrelevant components, a regularization term g(β) is applied to the coefficients in addition to minimizing the squared error for the regression equation (2)
where λ≧0 is a parameter to balance the tradeoff between the error and penalization parts in equation (3). Since the goal of regulization is to minimize the number of non-zero elements in β, a natural choice of g(β) would be the L0-norm of β, ∥β∥0. However, since choosing ∥β∥0 involves combinatorial search for the solution that is hard to solve, g(β) is often chosen to be some relaxed forms of L0-norm. Among many choices of relaxations, L1-norm is the most effective way. It is well known that with L1-norm constraint, g(β)=∥β∥1, the optimal solution β is constrained to be on the axes in the coefficient space and thus is sparse, whereas other alternatives such as L2-norm do not have that property. Therefore, L1-norm is used as the penalty function g(β) to enforce the sparseness of solution β.
It is not straightforward to find the optimal solution for equation (3) because ∥β∥1 does not differentiate at βi=0, . . . , p. Although prior art processes exist for solving the optimization, existing methods are either slow to converge or complicated to implement.
Therefore, a process based on the Bayesian interpretation of the optimization objective equation (3) is used to find the solution. The probability model for equation (3) denotes that the application performance y is corrupted by Gaussian noise
where σ2 describes the noise level, and each coefficient βi is governed by a Laplacian prior
where γ is a predefined constant in the prior. The optimization of (3) maximizes the posterior distribution
p(β,σ2|D,γ)∝p(y|β,σ2)p(β|γ) (6)
Note that because the variance σ2 in (4) is also unknown, it is incorporated into the optimization process.
The optimization process of the present disclosure is based on the fact that the Laplacian prior equation (5) can be rewritten as a hierarchical decomposition of two other distributions: a zero-mean Gaussian prior p(βii|τi) with the variance τi that has an exponential hyper prior
As a result, the distribution (6) can be rewritten as
p(y|β,σ2)p(β|γ)=p(y|β,σ2)p(β|τ)p(Σ|γ). (8)
If the values of new parameter τ=[τ1, τ2, . . . , τp]T, i.e., p(τ|γ)=1, could be observed, then the posterior distribution (8) is simplified because both p(y|β,σ2) and p(β|τ) in the right side of equation (8) are Gaussian distributions. The log-posterior is rewritten as
where Γ(τ)=diag(τ1−1, . . . , τp−1) is the diagonal matrix with the inverse variances of all βis. By taking the derivatives with respect to β and σ2 respectively, the solution that maximizes equation (9) is obtained.
In reality, however, because the values of τ (and hence the matrix Γ(τ), in (9)) are not known, equation (9) cannot be maximized directly. Instead the following expectation maximization (EM) process is used to find the solution. The EM process is an iterative technique, which computes the expectation of hidden variables τ and uses such expectation as the estimation of τ to find the optimal solution. Each iteration comprises an E-step and an M-step.
The E-step computes the conditional expectation of Γ(τ) given y and the current estimate {circumflex over (σ)}2(t) and {circumflex over (B)}t
The M-step performs the maximization of equation (9) with respect to σ2 and β except that the matrix Γ(τ) is replaced with its conditional expectation V(t). According the following equations are obtained:
The EM process is easy to implement, and converges to the maximum of posterior probability of equation (6) quickly.
Due to the nature of software simulation, the initial data for constructing the model may contain errors. Such errors come from several aspects of the simulation process. For example, since some modules for implementing the hardware processor are not open to the public, simulation tools only rely on some available mechanisms to realize those components, which causes implementation imprecisions in the simulation. There also exist specification inaccuracies in simulation tools in order to improve the efficiency of simulation process. That is, most tools take certain simplifications in the simulation specification to reduce the long simulation time. Due to those errors in simulation, the application is also run on a number of hardware platforms, and collect the evaluation data [{tilde over (x)}(i),{tilde over (y)}(i)], i=1, . . . , m, to enhance the quality of prediction. However the number of real evaluations m is much smaller than the size of simulation data. If the generalized regression is learned in the same way as in the simulation, the model may contain large variances. Instead, the knowledge learned from both simulation and the real evaluation data is combined to improve the prediction model.
{tilde over (y)}=θ1{tilde over (φ)}1+θ2{tilde over (φ)}2+ . . . +θK{tilde over (φ)}K. (15)
Compared with the equation (2), only K basis functions, whose associated coefficients β in simulation are non-zeros, are included in the regression of equation (15).
The real evaluation data, and measurement noise is obtained by solving equation (15) by maximizing the likelihood function:
from which the following least square solution is obtained:
{circumflex over (θ)}=({tilde over (Φ)}T{tilde over (Φ)})−1{tilde over (Φ)}T{tilde over (y)}, (17)
where [{tilde over (Φ)},{tilde over (y)}] represents the real evaluation data, and {acute over (σ)}2 is the measurement noise. Note that symbol “{tilde over (*)}” is used to differentiate the variables with those in the simulation stage.
However, since we only have limited real evaluation data, the least square solution {circumflex over (θ)} may not be accurate. Therefore, the knowledge learned from simulation is used to guide the estimation of prediction model θ, thereby improving the quality of estimation. That is, the values of prediction model θ should be close to the corresponding coefficients in β learned from simulation. Our insight here is that although the coefficients β learned from simulation are not accurate, they still can provide guidance for the possible range of prediction model θ values. Therefore, in block 402, a prior constraint is added on the prediction model θ, whose value follows a Gaussian distribution with the mean prediction model θ as the corresponding β values learned during model construction and covariance Σ:
As shown in
Since the variance {tilde over (σ)}2 in equations (16)(18) is unknown, the inverse-gamma distribution is used to model P({tilde over (σ)}2):
where a, b are two parameters to control the shape and scale of the distribution, Γ(a) is the gamma function of a. In one exemplary embodiment, a=1, b=1 can be used to plot the curve of P{tilde over (σ)}2 shown in
With those specified priors (the prior knowledge learned from the simulation, as well as the prior distribution model parameters, i.e. equation (19), the final solution (prediction model) is obtained in block 404 by combining the equations (16)(18)(19) to express the posterior distribution for model parameters:
P(θ,{tilde over (σ)}2|{tilde over (y)},{tilde over (Φ)})∝P({tilde over (y)}|{tilde over (Φ)},θ,{tilde over (σ)}2P(θ|{tilde over (σ)}2)P({tilde over (σ)}2) (20)
By integrating out {tilde over (σ)}2 in P(θ{tilde over (σ)}2|{tilde over (y)},{tilde over (Φ)}), we obtain the marginal distribution for prediction model θ as a multi-variable t-distribution, from which the maximum can be found at
θ*=({tilde over (Φ)}T{tilde over (Φ)}+Σ−1)−1({tilde over (Φ)}T{tilde over (Φ)}{circumflex over (θ)}+Σ−1
The final prediction model θ* is the weighted average of the prior prediction model
The above Bayesian guided learning generates the final coefficients θ^* for the performance model (15), which combines the outcomes from real evaluation and simulation processes.
While exemplary drawings and specific embodiments of the present disclosure have been described and illustrated, it is to be understood that that the scope of the invention as set forth in the claims is not to be limited to the particular embodiments discussed. Thus, the embodiments shall be regarded as illustrative rather than restrictive, and it should be understood that variations may be made in those embodiments by persons skilled in the art without departing from the scope of the invention as set forth in the claims that follow and their structural and functional equivalents.
This application claims the benefit of U.S. Provisional Application No. 61/359,426, filed Jun. 29, 2010, the entire disclosure of which is incorporated herein by reference.
Number | Name | Date | Kind |
---|---|---|---|
7346736 | Gluhovsky et al. | Mar 2008 | B1 |
20100131440 | Chen et al. | May 2010 | A1 |
20110004426 | Wright et al. | Jan 2011 | A1 |
20110044524 | Wang et al. | Feb 2011 | A1 |
Number | Date | Country | |
---|---|---|---|
20110320391 A1 | Dec 2011 | US |
Number | Date | Country | |
---|---|---|---|
61359426 | Jun 2010 | US |