Cloud computing can be implemented by a data center to stand up public and private clouds. Cloud computing offers self-service, scalability, and elasticity, along with additional advantages of control and customization that were not traditionally possible. Cloud service providers extend service level agreements (SLAs) that define guaranteed levels of application performance. For example, SLAs may specify performance metrics defining response times or computations per time frame. Application performance is then monitored to ensure SLA compliance.
Introduction:
Modern applications include multiple components that operate together to achieve a desired result. In one example an application may include an application server and a database server. One or more instances of each component can execute in any number of virtual machines. When executed, each component consumes physical resources such as CPU, memory, networking and storage. Because, multiple virtual machines can share access to the same physical resources, proper resource allocation many times is needed to ensure desired application performance.
Cloud service providers extend service level agreements (SLAs) that define guaranteed levels of application performance. SLAs may specify performance metrics defining response times or computations per time frame. Manual monitoring can prove difficult and in many cases inefficient or simply ineffective. While a performance metric such as an average response time can be visualized and a breach of a corresponding SLA identified, it can be difficult to quickly determine the bottleneck that causing the undesired performance. Bottlenecks, often occur when a physical resource allocated to a virtual machine is being consumed by an application component at higher than expected level. It can be difficult if not impossible to manually identify the application component and corresponding physical resource causing a bottleneck especially as the number of virtual machines increases.
Various embodiments described below have been developed to automatically allocate physical resources to virtual machines executing application components. In one example, performance data and consumption data are acquired from agents executing in the virtual machines. The performance data is indicative of a performance metric over time for the application. The consumption data is indicative of physical resource consumption levels over time by each application component or virtual machine. The performance data is analyzed to identify a performance event. A performance event occurs when a value for a performance metric associated with the application crosses an associated threshold value. For example, where the performance metric corresponds to an application response time, the threshold value may correspond to a particular average response time dictated or determined by an SLA. In one example, crossing a threshold value indicates that an SLA has or is likely to be breached and that an application component may need to be allocated additional physical resources. In another example, crossing a threshold value indicates that performance levels are well within SLA requirements and a physical resource is being underutilized and may be allocated away from an application component.
Upon detecting that a performance metric has crossed a threshold, the consumption data is analyzed to examine the consumption levels of physical resources utilized by the application components. Where the consumption level of one of the physical resources (but not another) deviates from a historical trend at a time generally coinciding with the performance event, it can be presumed that the given application component consuming that physical resource caused the performance event. An instruction is communicated, that when executed will cause a change in an allocation level of that corresponding physical resource. The instruction, for example, may be communicated to and executed by a cloud controller responsible for managing the virtual machines executing the various application components. Where the performance event indicates an actual or likely SLA breach, the change in resource allocation may be an increase intended to cause the performance metric value to cross back over the threshold. Where the performance event is indicative of an underutilization, the change in allocation may be a decreased allocation allowing the physical resource to be reallocated elsewhere.
In this fashion physical resources can be automatically allocated and reallocated to both help ensure SLA compliance and efficient resource consumption.
The following description is broken into sections. The first, labeled “Setting,” describes an environment in which various embodiments may be implemented. The second section, labeled “Components,” describes examples of various physical and logical components for implementing various embodiments. The third section, labeled “Operation,” describes steps taken to implement various embodiments.
Setting:
A cloud controller (not shown) is responsible for provisioning physical resources 14 to the various components of an application. In doing so, the controller utilizes physical resources 14 to instantiate virtual machines for executing the application components. The virtual machines share physical resources such as CPU, memory, networking, and storage provided by physical resources 14 with a specified portion of each resource allocated to each virtual machine. Together, two or more virtual machines may be referred to herein as a virtual environment.
Client devices 16 represent generally any computing devices capable of utilizing applications provided within cloud environment 12. Resource allocation system 18, described in detail below, represents a system configured to automatically manage the allocation of resources being consumed by the components of an application executing in cloud environment 12. In general, resource allocation system 18 is configured to in response to a predetermined performance event, identify a consumption level of a physical resource being consumed by an application component that has spiked or otherwise experienced a change generally corresponding in time with the performance event. System 18 then communicates an instruction that when executed by a cloud controller causes a change in allocation of that resource according to the nature of the performance event. For example, where the performance event is an actual or likely breach of an SLA, the change may be an increased allocation of the resource to its corresponding application component.
Components:
Resource allocation system 18 is shown to be in communication with data repository 30 and cloud controller 28 and cloud environment 12. Data repository represents generally any physical memory accessible to system and configured to store performance data and consumption data. While shown as being distinct of cloud environment 12, resource allocation system 18 may be may be part of cloud environment 12 and implemented by one or more application components 21 executing in one or more virtual machines 20.
Resource allocation system 18 is shown to include data engine 32, analysis engine 34, and resource engine 36. Data engine 32 is configured to maintain performance data and resource consumption data. The performance data is indicative of a performance metric trend for an application. The application includes a plurality of application components 21 executing in one or more virtual machines 20. The consumption data is indicative of consumption level trends for each of a plurality of physical resources 14 being consumed by the plurality of application components. In the example of
Agents 26 may continuously or periodically report performance and consumption measurements and data engine 32 may take collect that information in one or more tables or other data structures within data repository 30. Data engine 32 may also maintain parameters associated with a service level agreement (SLA) for an application. The parameters may specify one or more thresholds corresponding to performance metrics such as transaction performance (response times) or transaction volume. For example, one threshold may specify an average response time that, if exceeded, the SLA is or is in danger of being breached. Another threshold may specify an average response time that if not exceeded indicates that physical resources 14 currently allocated to a given component 21 of the application can be reallocated and used more efficiently to support another application component 21.
Analysis engine 34 is configured to analyze the performance data to determine if a performance metric value for an application has crossed an associated threshold value. Such may be referred to as a performance event. In response to a positive determination, analysis engine 34 is responsible for analyzing the consumption data to identify a consumption level of one of the plurality of physical resources being consumed by a given component of the application has deviated from a historical trend for that resource. Analysis engine 34 may only consider a deviation that generally coincided in time with the given performance event. In other words, analysis engine 34 may only look for deviations that share a predetermined time frame or window with the performance event and can be presumed to be a cause of the performance event. A historical trend can thus be determined at least in part by maximum and minimum consumption levels occurring during a period before corresponding performance event.
Resource engine 36 is configured to communicate an instruction that when executed by cloud controller 28 will cause a change in an allocation level of the physical resources identified by analysis engine 34. The instruction may be in a markup language format such as XML (eXtensible Mark-up Language). In performing its function, resource engine 36 may examine the current consumption level of the identified physical resource and its recent consumption trend to optimize the change. The optimization may result in an increase or a decrease depending on the situation and can affect fewer than all of the physical resources allocated to the application components. Such is true when analysis of the consumption data reveals that a consumption level of another of the plurality of physical resources being consumed by a component of the application has not deviated from a historical trend for that resource. Thus the instruction when executed only affects the allocation level of a resources identified in the analysis of the consumption data.
In an example, the performance event corresponds to an actual or likely breach of an SLA. Here, optimization results in an instruction that when executed by cloud controller 28 increases the current allocation level of the resource in an amount expected to bring the performance metric value back in line with the SLA sot that it is not being breached or not a path to be breached. Execution of that instruction is also expected not to over-allocate and leave the physical resource underutilized. In another example, the performance event is indicative of resource underutilization. Here, optimization results in an instruction that when executed by cloud controller 28 decreases the current allocation level of the resource in an amount that allows the physical resource to be more efficiently used elsewhere without breaching the SLA.
To summarize, resource allocation system 18, with the aid of agents 26, monitors the performance of an application implemented by one or more virtual machines 20 within cloud environment 12. Upon detecting a performance event, system 18 automatically identifies a change in consumption level of a physical resource supporting the application where that change coincided in time with the performance event. System 18 then automatically communicates an instruction that when executed by cloud controller 28 causes a change in allocation level of the identified physical resource. Depending on the nature of the performance event, the change may be an increase or a decrease.
Resource allocation system 18 may also be configured to predict future performance events and take action in an attempt to prevent them from occurring. Over time, data engine 32 may maintain details concerning performance events and consumption data corresponding in time to those events. These details may be referred to as past performance and consumption data. Analysis engine 34 can then process the past performance data to predict an occurrence of a future performance event. Past performance data may reveal repeated periods such as time of day or a day of the week or a month that a performance event is likely to occur absent a change in a resource allocation level. Thus, a future performance event may be predicted to occur during that same time the following day, week, or month as the case may be.
Analysis engine 34 can then analyze the past consumption data to identify a predicted future variance in a consumption level of a given physical resources predicted to correspond in time with the with the future performance event. The past consumption data may reveal that the consumption level for a given physical resource deviates from a historical trend at a time corresponding to a past performance event. Resource engine 36 can then communicate an instruction that when executed will cause a change in an allocation level of the resource whose consumption level is predicted to deviate. The instruction will be communicated such that it can be executed during or before the predicted future performance event.
In foregoing discussion, engines 32-36 were described as combinations of hardware and programming. Engines 32-36 may be implemented in a number of fashions. Looking at
Memory resource 38 represents generally any number of memory components capable of storing instructions that can be executed by processing resource 40. Memory resource 38 is non-transitory in the sense that it does not encompass a transitory signal but instead is made up of more or more memory components configured to store the relevant instructions. Memory resource 38 may be implemented in a single device or distributed across devices. Likewise processing resource 40 represents any number of processors capable of executing instructions stored by memory resource 38. Processing resource 40 may be integrated in a single device or distributed across devices. Further, memory resource 38 may be fully or partially integrated in the same device as processing resource 40, or it may be separate but accessible to that device and processing resource 40.
In one example, the program instructions can be part of an installation package that when installed can be executed by processing resource 40 to implement system 18. In this case, memory resource 38 may be a portable medium such as a CD, DVD, or flash drive or a memory maintained by a server from which the installation package can be downloaded and installed. In another example, the program instructions may be part of an application or applications already installed. Here, memory resource 38 can include integrated memory such as a hard drive, solid state drive, or the like.
In
Operation:
Referring to
A determination is made as to whether a performance event has occurred (step 50). A performance event occurs when a value of a performance metric associated with the application crosses an associated threshold value. Analysis engine 34 of
Looking ahead to
Moving back to
Looking ahead, graph 66 of
Referring Back to
The method of
Step 52 is then modified such that the past consumption data is analyzed to identify a predicted future variance in a consumption level of a given physical resources predicted to corresespind in time with the with the future performance event. The past consumption data may reveal that the consumption level for a given physical resource deviates from a historical trend at a time corresponding to a past performance event. Finally, step 54 can be modified to communicate an instruction that when executed will cause a change in an allocation level of the first the resource whose consumption level is predicted to deviate. The instruction will be communicated such that it can be executed during or before the predicted future performance event.
Conclusion:
Embodiments can be realized in any memory resource for use by or in connection with processing resource. A “processing resource” is an instruction execution system such as a computer/processor based system or an ASIC (Application Specific Integrated Circuit) or other system that can fetch or obtain instructions and data from computer-readable media and execute the instructions contained therein. A “memory resource” is any non-transitory storage media that can contain, store, or maintain programs and data for use by or in connection with the instruction execution system. The term “non-transitory is used only to clarify that the term media, as used herein, does not encompass a signal. Thus, the memory resource can comprise any one of many physical media such as, for example, electronic, magnetic, optical, electromagnetic, or semiconductor media. More specific examples of suitable computer-readable media include, but are not limited to, hard drives, solid state drives, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory, flash drives, and portable compact discs.
Although the flow diagram of
The present invention has been shown and described with reference to the foregoing exemplary embodiments. It is to be understood, however, that other forms, details and embodiments may be made without departing from the spirit and scope of the invention that is defined in the following claims.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/IN2013/000065 | 1/31/2013 | WO | 00 |