The instant disclosure relates to virtualized computer systems. More specifically, the instant disclosure relates to monitoring application performance on virtualized computer systems.
On-demand computing infrastructures such as the Unisys Stealth, the Amazon EC2, and the Microsoft Azure platforms built using x86 virtualization technologies allow applications hosted on these infrastructures to acquire and release computing resources based on conditions within the hosted applications. The allocation of computing resources such as processor, memory, network input/output (I/O), and disk I/O to virtualized applications hosted on such platforms is varied in proportion to the workloads experienced by the applications. For example, certain applications may have higher workload during the day as opposed to at night. These applications may receive increased computing resources during the day and fewer at night. The workloads generally exhibit repetitive behavior, and the resource allocations to the applications change as the workload changes.
Commercial applications are available for monitoring application performance such as Netuitive and AppDynamics. These conventional applications incorporate statistical and machine learning algorithms for forecasting application misbehavior and for determining root-causes of such misbehaviors. These tools are designed for non-virtualized environments and clusters, where applications run on a set of homogenous machines in a dedicated manner.
However, the usefulness of these conventional applications in virtualized data-centers is limited due to the long latency associated with data collection. Conventional monitoring applications spend a significant amount of their time at the beginning of their lifecycle learning application behavior and the learning pattern of resource consumption. Only after sufficient data on various metrics have been collected can these tools differentiate normal behavior from abnormal behavior and generate meaningful predictions. For example, Netuitive typically requires two weeks of data before it can forecast abnormal behavior and initiate alarm generation.
In a virtualized scenario, where applications encapsulated within respective virtual machines share a common host and all virtual machine have the capability to migrate during their lifetime onto different machines with different resources, the statistics collected from different physical machines must be re-used appropriately for conclusions to be meaningful and predictions to be accurate. For example, assume that at time t1, a virtual machine is hosted on machine ‘A’ and at time t2, the virtual machine migrates to machine ‘B’. Further, assume that machine ‘A’ and machine ‘B’ belong to two different server classes (with different hardware architectures). If the CPU utilization by an application on machine ‘A’ is 50% at certain workload, the CPU utilization on machine ‘B’ could be 20% for the application at the same workload. In such scenarios, the existing commercial application performance management tools will fail to generate meaningful predictions. The data collected by Netuitive on machine A is irrelevant for predicting application misbehavior on machine B. Additionally, many of the commercial tools work with only a limited set of variables and, thus, do not scale well to virtualized machines.
According to one embodiment, a method includes measuring current utilization of at least one system resource by an application. The method also includes generating a forecasted utilization for the at least one system resource by the application. The method further includes calculating an error between the current utilization and forecasted utilization. The method also includes determining when the application is misbehaving based, in part, on the error.
According to another embodiment, a computer program product includes a non-transitory computer storage medium having code to measure current utilization of at least one system resource by an application. The medium also includes code to generate a forecasted utilization for the at least one system resource by the application. The medium further includes code to calculate an error between the current utilization and forecasted utilization. The medium also includes code to determine when the application is misbehaving based, in part, on the error.
According to a further embodiment, an apparatus includes a virtualized computer system. The apparatus also includes a monitoring system. The apparatus further includes a database of historical utilization data of the virtualized computer system for at least one application. The apparatus also includes a forecasting system. The apparatus further includes a fault detection system.
The foregoing has outlined rather broadly the features and technical advantages of the present invention in order that the detailed description of the invention that follows may be better understood. Additional features and advantages of the invention will be described hereinafter which form the subject of the claims of the invention. It should be appreciated by those skilled in the art that the conception and specific embodiment disclosed may be readily utilized as a basis for modifying or designing other structures for carrying out the same purposes of the present invention. It should also be realized by those skilled in the art that such equivalent constructions do not depart from the spirit and scope of the invention as set forth in the appended claims. The novel features which are believed to be characteristic of the invention, both as to its organization and method of operation, together with further objects and advantages will be better understood from the following description when considered in connection with the accompanying figures. It is to be expressly understood, however, that each of the figures is provided for the purpose of illustration and description only and is not intended as a definition of the limits of the present invention.
For a more complete understanding of the disclosed system and methods, reference is now made to the following descriptions taken in conjunction with the accompanying drawings.
Misbehaving applications may be detected and corrective action taken by monitoring system resource usage in a virtualized computing system and comparing the monitored resource utilization to forecast utilization derived from historical utilization data for an application. When the monitored resource utilization deviates from the forecast utilization an alarm may be generated to alert a user or a fault diagnosis component to the potential fault and allow corrective procedures applied to the application. The corrective behavior may include, for example, increasing or decreasing resources of the virtualized computing system allocated to the application.
The errors calculated by the modules 122 are reported to a fault detection component 124, which determines if an application executing on the virtualized computing system 110 is misbehaving. When an application is misbehaving an alarm may be generated by the fault detection component 124 and transmitted to a fault diagnosis component 126. Detecting misbehavior may allow correction of a misbehaving application before performance of the virtualized computing system 110 is negatively impacted. The fault diagnosis component 126 may determine a cause of the misbehaving application and transmit one or more instructions to a policy-based management system 130 for curing the misbehaving application. When no alarm is generated by the fault detection component 124 a no alarm signal may be transmitted to the policy-based management system 130. The policy-based management system 130 is coupled to a provisioning system 132, which is coupled to the virtualized computing system 110. The provisioning system 132 may perform tasks such as allocating system resources within the virtualized computing system 110 according to policy decisions received from the policy-based management system 130. For example, when the virtualized computing system 110 includes multiple computing systems each with multiple processors, the provisioning system 132 may allocate individual processors or individual computing systems to applications executing on the virtualized computing system 110. The policy-based management system 130 may provide instructions to allocate additional or fewer system resources to a misbehaving application in accordance with instructions received from the fault diagnosis component 126. According to one embodiment, when no applications are misbehaving the provisioning system 132 receives instructions from timer-based policies in the policy-based management system 130.
Referring back to
According to one embodiment, the forecasting component 118 may decompose historical data in the database 114 for at least one computing resource such as memory, processor, network I/O, and disk I/O into individual components. The individual components may include trend (Tt), seasonal (St), cyclical (Ct) and error components (Et). A multiplicative model may be formed for the error to decompose the data as:
X
t=(Tt*St*Ct)*Et,
where Xt is a data-point at period t, Tt is the trend component at period t, St is the seasonal component at period t, Ct is the cyclical component at period t, and Et is the error component at period t. For the historical data in the database 114 regarding each of the computing resources in the virtualized computing system 110 the following steps may be performed with L as the length of the seasonality. First, calculate the L-period total, L-period moving average, and the L-period centered moving average (CMA). Second, separate the L-period CMA computed in the first step from the original data to isolate the trend and the cyclical components. Third, determine seasonal factors by averaging them for each of the slots that make up the length of the seasonality. Seasonal indexes may be calculated as the average of the CMA percentage of the actual values observed in that slot. Fourth, the seasonal pattern may be removed by multiplicative seasonal adjustment, which is computed by dividing each value of the time series by the seasonal index calculated in the third step. Fifth, the de-seasonalized data of the fourth step may then be analyzed for the trend (represented as {circumflex over (X)}t). Sixth, determine the cyclical component by separating the difference of actual and the trend as a fraction of the trend
from the results of the fifth step. Seventh, calculate the random error component after separating the trend, cyclical, and seasonal components from the actual data.
To forecast resource utilization for future time periods, a series of computations may be performed opposite to the decomposition approach described above. First, the cyclical component may be forecasted. Then, the trend component may be forecasted. Finally, the seasonal component may be forecasted. Forecasts of the individual components may be aggregated using the multiplicative model to compute the final forecast.
The forecasted values generated by the forecasting component 118 may be compared against the measured values by the monitoring system 112 and a difference between the two values calculated as an error by the fault detection system 120. According to one embodiment, the fault detection component 124 embodies a fault detection method based on the Hotelling's multi-variate T2 statistic. The fault detection component 124 may monitor the error component for forecasting abnormal application behavior. Hotelling's multi-variate T2 statistic has been successfully applied in the past to various chemical process industries and manufacturing operations to detect and diagnose faults. T2 may be calculated as:
T
2=(X−
where X=(x1, x2, . . . , xp) denotes the vector of variate (e.g., computational resources),
According to one embodiment, the fault diagnosis component 126 may employ an MYT decomposition method to interpret the signals associated with the T2 value. A vector (X−
(X=
where X(p−1)′=(x1, x2, . . . , xp−1) represents the (p−1) dimensional variable vector, and
where, SX
T
2
=T
p 1
2
+T
p.1, 2, . . . , p 1,
where
T
p.1, 2, . . . , p−1=(X(p−1)−
T2≡T(x
T
(x
, x
, . . . , x
)=(X(j)−
The terms of the MYT decomposition may be calculated as:
p! partitions of T2 statistic are possible in the above calculations. According to one embodiment, the calculations may be parallelized to operate on a cluster or grid infrastructure or specialized hardware such as a General Purpose Computation on Graphics Processing Units (GPGPU) machine.
According to another embodiment, the computational overhead may be reduced through the following iterative process. First, from the correlation matrix of all the variables, all variables with weak correlation may be deleted. Second, for the remaining variables compute Tx
To locate the variables that are responsible for the signal, the individual terms of the MYT decomposition may be examined by comparing each individual term to a threshold value that depends on the term under consideration such as for example in:
Tx
T(x
According to one embodiment, all xj having Tx
UCL(x
where α is the threshold percentile and n is the number of observations in the sample. Similarly, UCLx
In general UCLx
Operation of systems and methods described above with respect to
The corresponding T2 calculations are shown in table-2. UCL values for T12, T22, T1.22 and T2.12 are calculated for α, the threshold percentile value of 0.01. UCL value of T12 is calculated as 7.48 for a sample size of 41 and F value of 7.31, and UCL value of T22 is calculated as 9.45 for a sample size of 15 and F value of 8.86. Similarly, UCL value of T1.22 is calculated as 21.40 for a sample size of 10 and F value of 8.65, and UCL value of T2.12 is calculated as 12.96 for a sample size of 20 and F value of 5.85. In the table 800 of
In one embodiment, the user interface device 910 is referred to broadly and is intended to encompass a suitable processor-based device such as a desktop computer, a laptop computer, a personal digital assistant (PDA) or table computer, a smartphone or other a mobile communication device or organizer device having access to the network 908. In a further embodiment, the user interface device 910 may access the Internet or other wide area or local area network to access a web application or web service hosted by the server 902 and provide a user interface for enabling a user to enter or receive information.
The network 908 may facilitate communications of data between the server 902 and the user interface device 910. The network 908 may include any type of communications network including, but not limited to, a direct PC-to-PC connection, a local area network (LAN), a wide area network (WAN), a modem-to-modem connection, the Internet, a combination of the above, or any other communications network now known or later developed within the networking arts which permits two or more computers to communicate, one with another.
In one embodiment, the user interface device 910 accesses the server 902 through an intermediate sever (not shown). For example, in a cloud application the user interface device 910 may access an application server. The application server fulfills requests from the user interface device 910 by accessing a database management system (DBMS). In this embodiment, the user interface device 910 may be a computer executing a Java application making requests to a JBOSS server executing on a Linux server, which fulfills the requests by accessing a relational database management system (RDMS) on a mainframe server.
In one embodiment, the server 902 is configured to store time-stamped system resource utilization information from a monitoring system 112 of
In one embodiment, the server 902 may submit a query to selected data from the storage devices 1004, 1006. The server 902 may store consolidated data sets in a consolidated data storage device 1010. In such an embodiment, the server 902 may refer back to the consolidated data storage device 1010 to obtain a set of records. Alternatively, the server 902 may query each of the data storage devices 1004, 1006, and 1008 independently or in a distributed query to obtain the set of data elements. In another alternative embodiment, multiple databases may be stored on a single consolidated data storage device 1010.
In various embodiments, the server 1002 may communicate with the data storage devices 1004, 1006, and 1008 over the data-bus 1002. The data-bus 1002 may comprise a SAN, a LAN, or the like. The communication infrastructure may include Ethernet, Fibre-Chanel Arbitrated Loop (FC-AL), Fibre-Channel over Ethernet (FCoE), Small Computer System Interface (SCSI), Internet Small Computer System Interface (iSCSI), Serial Advanced Technology Attachment (SATA), Advanced Technology Attachment (ATA), Cloud Attached Storage, and/or other similar data communication schemes associated with data storage and communication. For example, the server 902 may communicate indirectly with the data storage devices 1004, 1006, 1008, and 1010 through a storage server or the storage controller 904.
The server 902 may include modules for interfacing with the data storage devices 1004, 1006, 1008, and 1010, interfacing a network 908, interfacing with a user through the user interface device 910, and the like. In a further embodiment, the server 902 may host an engine, application plug-in, or application programming interface (API).
The computer system 1100 also may include random access memory (RAM) 1108, which may be SRAM, DRAM, SDRAM, or the like. The computer system 1100 may utilize RAM 1108 to store the various data structures used by a software application such as databases, tables, and/or records. The computer system 1100 may also include read only memory (ROM) 1106 which may be PROM, EPROM, EEPROM, optical storage, or the like. The ROM may store configuration information for booting the computer system 1100. The RAM 1108 and the ROM 1106 hold user and system data.
The computer system 1100 may also include an input/output (I/O) adapter 1110, a communications adapter 1114, a user interface adapter 1116, and a display adapter 1122. The I/O adapter 1110 and/or the user interface adapter 1116 may, in certain embodiments, enable a user to interact with the computer system 1100. In a further embodiment, the display adapter 1122 may display a graphical user interface associated with a software or web-based application.
The I/O adapter 1110 may connect one or more storage devices 1112, such as one or more of a hard drive, a compact disk (CD) drive, a floppy disk drive, and a tape drive, to the computer system 1100. The communications adapter 1114 may be adapted to couple the computer system 1100 to a network, which may be one or more of a LAN, WAN, and/or the Internet. The communications adapter 1114 may be adapted to couple the computer system 1100 to a storage device 1112. The user interface adapter 1116 couples user input devices, such as a keyboard 1120 and a pointing device 1118, to the computer system 1100. The display adapter 1122 may be driven by the CPU 1102 to control the display on the display device 1124.
The applications of the present disclosure are not limited to the architecture of computer system 1100. Rather the computer system 1100 is provided as an example of one type of computing device that may be adapted to perform the functions of a server 902 and/or the user interface device 1110. For example, any suitable processor-based device may be utilized including, without limitation, personal data assistants (PDAs), tablet computers, smartphones, computer game consoles, and multi-processor servers. Moreover, the systems and methods of the present disclosure may be implemented on application specific integrated circuits (ASIC), very large scale integrated (VLSI) circuits, or other circuitry. In fact, persons of ordinary skill in the art may utilize any number of suitable structures capable of executing logical operations according to the described embodiments. A virtualized computing system, such as that illustrated in
Although the present disclosure and its advantages have been described in detail, it should be understood that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the disclosure as defined by the appended claims. Moreover, the scope of the present application is not intended to be limited to the particular embodiments of the process, machine, manufacture, composition of matter, means, methods and steps described in the specification. As one of ordinary skill in the art will readily appreciate from the present invention, disclosure, machines, manufacture, compositions of matter, means, methods, or steps, presently existing or later to be developed that perform substantially the same function or achieve substantially the same result as the corresponding embodiments described herein may be utilized according to the present disclosure. Accordingly, the appended claims are intended to include within their scope such processes, machines, manufacture, compositions of matter, means, methods, or steps.
This application claims priority to U.S. Provisional Application No. 61/476,348 filed on Apr. 18, 2011, to Venkat et al., entitled “Detecting and Diagnosing Application Misbehaviors in ‘On-Demand’ Virtual Computing Infrastructures.”
Number | Date | Country | |
---|---|---|---|
61476348 | Apr 2011 | US |