This present invention relates generally to resource management toward service level attainment and more specifically to application resource management using a three dimensional surface to indicate the probability of breach of a service level.
Managing the allocation of resources within a computer data centre may be a challenge due to the complexity of components and the variable nature of demand for the scarce resources comprising the data centre. In many cases the resource required most often is the resource that is the least available. In other cases it is not readily apparent which resource should be changed to alleviate a current undesirable situation. In some other cases the addition or removal of a resource may in fact add to the problem being addressed. In most cases decisions to take specific action would be enhanced by having received notification of an impending problem.
Making automated decisions for provisioning resources between multiple applications in operation within a data centre can be especially difficult. The difficulty arises when differing disciplines, such as performance, availability and fault management, must also be considered concurrently with a variety of monitoring systems associated with components of the data centre.
Typically decision making or decision assist schemes are bound to a specific metric, such as server utilization or response time and to a specific discipline such as performance. This narrow focus limits the capabilities of such schemes and their applicability in a large diverse data centre.
It would therefore be highly desirable to have a means for allowing detailed information of resources used by applications to be more effectively used to better manage the resources within a diverse data centre.
Conveniently, software exemplary of an embodiment of the present invention uses the probability of a breach of a service level (SLA) to provide a comparison between a need for resources being used among applications and service level objectives in a data centre.
A three dimensional surface representative of relationships between metrics is used to describe the variance in the probability of breaching a service level when compared to the number of resources allocated to the application and time. Using the described surface allows decision making logic to evaluate trade-offs when determining resource allocations. Discipline specific modules are used to translate collected metrics for the respective disciplines into a probability of breach of a service level surface which is then presented to decision making logic to determine a course of action.
In one embodiment of the present invention there is provided a data processing method for service level management using probability of breach of service level for an application in a computer data centre, the method comprising: obtaining one or more metrics each associated with a respective resource associated with a data centre, one of the metrics being probability of breach of service level; generating an n-dimensional representation of a relationship of the metrics; responsive to the n-dimensional representation determining a best fit solution for configuring the computer data centre using a probability of breach of service level; and communicating the best fit solution to one or more components of the data centre to reconfigure the respective resources toward attaining the service level.
In another embodiment of the present invention there is provided a data processing system for service level management using probability of breach of service level for an application in a computer data centre, the data processing system comprising: a means for obtaining one or more metrics each associated with a respective resource associated with a data centre, one of the metrics being probability of breach of service level; a means for generating an n-dimensional representation of a relationship of the metrics; responsive to the n-dimensional representation a means for determining a best fit solution for configuring the computer data centre using a probability of breach of service level; and a means for communicating the best fit solution to one or more components of the data centre to reconfigure the respective resources toward attaining the service level.
In another embodiment of the present invention there is provided an article of manufacture for directing a data processing system for service level management using probability of breach of service level for an application in a computer data centre, the article of manufacture comprising: a data processing system usable medium embodying one or more instructions executable by the data processing system, the one or more instructions comprising: data processing system executable instructions for obtaining one or more metrics each associated with a respective resource associated with a data centre, one of the metrics being probability of breach of service level; data processing system executable instructions for generating an n-dimensional representation of a relationship of the metrics; responsive to the n-dimensional representation data processing system executable instructions for determining a best fit solution for configuring the computer data centre using a probability of breach of service level; and data processing system executable instructions for communicating the best fit solution to one or more components of the data centre to reconfigure the respective resources toward attaining the service level.
Other aspects and features of the present invention will become apparent to those of ordinary skill in the art upon review of the following description of specific embodiments of the invention in conjunction with the accompanying figures.
In the figures, which illustrate embodiments of the present invention by example only,
Like reference numerals refer to corresponding components and steps throughout the drawings.
CPU 110 is connected to memory 108 either through a dedicated system bus 105 and/or a general system bus 106. Memory 108 can be a random access semiconductor memory for storing components of an embodiment of the present invention. Memory 108 is depicted conceptually as a single monolithic entity but it is well known that memory 108 can be arranged in a hierarchy of caches and other memory devices.
Operating system 120 provides functions such as device interfaces, memory management, multiple task management, and the like as known in the art. CPU 110 can be suitably programmed to read, load, and execute instructions of operating system 120. Computer system 100 has the necessary subsystems and functional components to implement support for an implementation of the present invention as will be described later. Other programs (not shown) include server software applications in which network adapter 118 interacts with the server software application to enable computer system 100 to function as a network server via network 119.
General system bus 106 supports transfer of data, commands, and other information between various subsystems of computer system 100. While shown in simplified form as a single bus, bus 106 can be structured as multiple buses arranged in hierarchical form. Display adapter 114 supports video display device 115, which is a cathode-ray tube display or a display based upon other suitable display technology that may be used to allow input or output to be viewed. The Input/output adapter 112 supports devices suited for input and output, such as keyboard or mouse device 113, and a disk drive unit (not shown). Storage adapter 142 supports one or more data storage devices 144, which could include a magnetic hard disk drive or CD-ROM drive although other types of data storage devices can be used, including removable media for storing data such as but not limited to, resource management and configuration data.
Adapter 117 is used for operationally connecting many types of peripheral computing devices to computer system 100 via bus 106, such as printers, bus adapters, and other computers using one or more protocols including Token Ring, LAN connections, as known in the art. Network adapter 118 provides a physical interface to a suitable network 119, such as the Internet. Network adapter 118 includes a modem that can be connected to a telephone line for accessing network 119. Computer system 100 can be connected to another network server via a local area network using an appropriate network protocol and the network server can in turn be connected to the Internet.
Data centre 210 produces various statistical information or measurement data, such as but not limited to, utilization of resources and quantities of resources which is captured and then processed by AppController 220. AppController 220 receives the metrics from the managed components of Data centre 210 either by polling the various components explicitly, by receiving event notifications containing such data or other means so as to make the necessary information available for processing. The acquisition means is not as important as having the actual data; therefore how the data is obtained is not significant to an implementation of an embodiment of the present invention.
AppController 220 combines the metrics for the various disciplines obtained from Data Centre 210 with an internal model of application workload to estimate the service level for differing numbers of resources, such as servers. Differing implementations may be used to suit different types of applications. For example, an adaptive queuing model may be used to model a grid service offering to estimate how the service time may vary according to the number of servers in the grid service. In another example a streaming video application may be modelled using a simple ratio model such as doubling of the number of servers causes streaming throughput to double also. AppController 220 is capable of providing an estimated number of servers required for each cluster of servers for an application based on workload information and the internal model of the application. This estimate is determined based on, for each cluster, estimating the probability of breaching the service level for the application as determined for a given instance in time and specific number of servers.
Predictive information (in the context of the applications) may also be used. Typical predictive models may be used such as analysis of variance (ANOVA) in combination with auto-regression to predict arrival rates of client requests in an application, based on historical information for that application. This form of technique may be effective for predicting regular patterns such as daily or weekly usage patterns but typically adds increased complexity to implementation of AppController 220. Such techniques are may only be useful when such patterns of use are fairly regular and predictable.
Service level objectives themselves may be characterized by example such as performance objective that relate to a maximum response time allowed for an application, where the response duration is specified to be a set value per set unit of time. In another example CPU utilization may be established at a target rate or range such as between 50% and 75%. When dealing with availability objectives these are typically expressed in some coarse form such as prevention of a single point of failure condition by guaranteeing that a “hot” backup server is always available. In addition the objectives may vary in accordance with the time of day, such as when core hours are defined for an on-line service to be available at a higher level of availability than outside the defined core hours.
Input from Data Centre Model 230 is provided to AppController 220 to allow AppController 220 to perform the necessary calculations to produce Probability of breach surfaces 260. Data Centre Model 230 may be implemented as a database or other form of repository providing information on the current configuration and state of the infrastructure of Data Centre 210. This information may include the specific resource pool to which each server cluster belongs, the actual number of servers being used by a specific cluster, the permitted range of servers allowed in a cluster, the number of idle servers in the various resource pools and the priority of an application to which a specific cluster belongs.
Probability of breach of the service level is then calculated based on how close an estimated service level is to an objective. Probability of breach surfaces 260 is the graphic result of the computations involving the previously presented metrics, disciplines and application model. A three dimensional representation of the metrics is calculated using known techniques from the inputs just described to produce a three dimensional surface object. The surface represents the data tuple in the form of x, y and z values (shown in
Probability of breach surfaces 260 is then made available to Global Resource Manager 240 which seeks to optimize utilization of resources under its control. Global Resource Manager 240 interrogates Probability of breach surfaces 260 providing input values for resources and time. The output for such a pairing of data values is the probability of breach of service level at that point. Within Global Resource manager 240 there is an optimizer designed to segregate information by grouping into sub-groups according to resource pool allowing resource pool optimizers to function for a respective resource pool. A pool resource optimizer is designed to find the optimal set of infrastructure changes for the respective resource pool and therefore the best allocation of resources within the data centre taking into account the implied cost of a service level breach and the application priority.
In an implementation of an embodiment of the present invention a decision tree containing nodes comprised of appropriate infrastructure changes may be created and the tree traversed. Traversal is typically governed by best fit analysis of the given nodes. Additionally a timeout parameter may be used to limit the time allowed to traverse the decision tree. If a timeout has been implemented, the best fit encountered during the prioritization will be selected. A traversal algorithm may be used to specify the ordering of nodes so that the best candidate nodes are searched first.
The use of the described optimizer could also be avoided when there are a sufficient number of spare servers available. Once a set of infrastructure changes is available it is reviewed to determine if there are any changes to the server clusters that may be pending. The review is also used to ensure there are only as many add server requests as there are available (usually idle) servers. This simplification removes the necessity of scheduling remove and add server requests in advance to take into consideration the amount of time required to move a specific server.
In one embodiment, upon completion of review of the selected infrastructure changes, Global resource manager 240 converts the proposed changes into deployment requests which may be in the form of logical device operations. Deployment requests may be sent to an intermediary such as Deployment Engine 250 for subsequent processing or directly to the specified devices as in Data Centre 210. If dealing with an intermediary such as Deployment Engine 250, logical device operations may be used instead of device specific commands thereby separating the services of the Global resource manager 240 from actual knowledge of specific devices contained within Data Centre 210.
As seen in
Referring now to
For example using the graph provided one can see that adding servers may not provide much impact until some units of time have passed as indicated by the step or drop in the surface shape. In similar manner one can surmise that adding some number of servers does not help until a threshold has been passed as indicated along the number of resources (server) axis.
In general the graph is a visual representation indicating that by providing an additional resource over time the probability of service level breach is reduced which is what would be expected. This may not be the case however if the resource being added, such as communication links, causes an increase in workload that cannot be handled by a busy downstream component, such as a web server. In this case the added links compound the problem of the busy web server by increasing demand for service. Applications having multiple clusters need to have the impact of the associated cluster changes summarized on the overall application level. In a similar manner scenarios with multiple applications and their associated changes have to be analysed separately as the model does not aggregate results across clusters or applications.
Of course, the above described embodiments are intended to be illustrative only and in no way limiting. The described embodiments of carrying out the invention are susceptible to many modifications of form, arrangement of parts, details and order of operation. The invention, rather, is intended to encompass all such modification within its scope, as defined by the claims.