The present application claims priority from Japanese application JP 2004-003600 filed on Jan. 9, 2004, the content of which is hereby incorporated by reference into this application.
The present invention relates to a system for managing a group of computers autonomically and more particularly to simulating means for simulating autonomic management policies.
Current data centers and corporation information systems are expanding in scale and complicated in function dramatically, they are often confronted with a serious problems that lead to increase of the operation/management load. Accordingly, it is required for all the IT systems in the future indispensably to reduce the load of the respective system managers. In these days, an autonomic management systems are proposed to solve the above problem. An autonomic system solves the above problem by managing a server farm of data centers/corporation information systems automatically according to system load.
U.S. 2002/0059427 A2 discloses an autonomic management technique employed for a 3-tier data center (3-tier Web system). According to the technique, in the three-tier (Web servers tier, application servers tier, and data base servers tier) Web system which supports a plurality of customer corporations, standby servers shared by customer corporations are provided in addition to those servers used for customer corporation's operations. A standby server is allocated to a customer corporation according to the customer's load so that the service level of the system is maintained even at the time of abrupt access concentration. To achieve above object, the system is further provided with a management server that monitors the operation state of each server in the system to allocate/de-allocate a server according to the system load in accordance with an autonomic management policy determined beforehand.
An autonomic management policy is a description of conditions for switching a standby server to an active server (server allocation) or switching an active server to a standby server (server de-allocation). In the above example, the system monitors the utilization rate of each server to compare the rate with a predetermined threshold value to determine allocation/de-allocation of a server. Concretely, if the utilization rate of the servers exceeds the threshold value, the management server determines the situation as overload, then allocates the necessary number of servers to the system. If the utilization rate of the servers is under the threshold value, the management server determines the number of servers as excessive and de-allocates some of the allocated servers from the system. When a server is allocated to the system, the management server changes the setting parameters of the load balancer or the setting of the load balancing program in the former tire so that the system load is balanced equally among all the servers including the newly allocated one in the system. Similarly, if any server is de-allocated from the system, the management server changes the setting of the load balancer or load balancing program in the former tire so that the load is balanced equally again among all the rest servers in the system. In the 3-tire Web system, the above processes must be executed in all the tire of the Web server, the application server, and the database server separately.
On the other hand, an autonomic management policy is described in detail in “Server-Allocation Policy for Improving Response to Web Access Peaks” of Systems and Computers in Japan, Vol. 35, No. 5, 2004, pp. 55-66. The autonomic management policy cannot be achieved by simply allocating/de-allocating a server according to a threshold value. The following complicated conditions should be satisfied comprehensively to create such a policy.
The duration if the threshold value is satisfied
[Patent document 1] U.S. 2002/0059427 A2
[Non-patent document 1] “Server-Allocation Policy for Improving Response to Web Access Peaks” of Systems and Computers in Japan, Vol. 35, No. 5, 2004, pp. 55-66, Translated from Denshi Joho Tsushin Gakkai Ronbunshi, Vol. J85-D-I, No. 9, September 2002, pp. 866-876.
If the above conventional technique is used for autonomic management of a system, the verification of the autonomic management policy is difficult. That has been a conventional problem.
In each data center/corporation information system, system configuration, application program, input request amount (change with time) of system load, and required service level (response time, etc.) differ among systems. Consequently, an autonomic management policy must be created for each system separately.
For example, the threshold value in the above first known example must be set for each system separately. A problem that might arise here is how to confirm the correct operation of the system with the autonomic management policy. Concretely, if the CPU utilization rate that is assumed as a server allocation threshold value is set at 80%, it is required to verify whether or not the threshold value can prevent response delay at the time of access concentration. If the threshold value is too high, the server allocation is delayed, thereby the server is overloaded and the system service level cannot be maintained. On the contrary, if the threshold value is too low, the excessive server allocation causes an increase of the cost which is not acceptable, although the system service level is maintained. This is why the threshold value must be determined properly so as to satisfy the trade-off between the cost and the service level.
In addition, because the server behavior is affected strongly by the transitional behavior of the cache, etc. (elements to be changed with time), such server transitional behavior must be taken into consideration to create a policy. Hereunder, how such a server's transitional behavior will affect an object policy will be described with reference to
The above behavior is caused by the load distribution executed without giving any consideration to the difference of the performance between those servers. Also, to avoid such a problem, the server load must be distributed among servers in accordance with the performance of each server.
As described above, the system response time includes complicated elements such as a transitional change of server performance. Such complicated elements as the transitional behavior of server performance should be taken into consideration to create complicated policies used in autonomic management. Also, no manual checks can cope with the verification of the property of an autonomic management policy created for a site; at the present time, there is no way except the verification carried out with actual systems. This is why such policy verification requires significant cost. In addition, because it is only after the actual system is completed to make such policy verification, the system construction period is often extended and this has been one of the conventional problems.
Under such circumstances, it is an object of the present invention to provide an autonomic management policy simulator that can verify the propriety of each created policy less-expensively and fast in an autonomic management system operated under the control of the subject policy.
In order to achieve the above object, the autonomic management policy simulator of the present invention inputs information items of autonomic management policy, system configuration for servers allocated to the subject processing, workload change with time, performance of the program to run in the system, transitional characteristic of the performance of the program, and outputs a system behavior (information items of throughput, response time, and resource utilization rate).
Furthermore, in order to simulate a system behavior including the transitional status in a system of which configuration is to be changed with time due to its autonomic management function, the simulator obtains the system configuration, load balance setting, and load information to be inputted at a time respectively, then calculates the resource utilization rate, the application response time, and the system throughput at that time, on the basis of the obtained information items and by giving consideration to the transitional behavior of the system. Furthermore, the simulator applies above-mentioned result to the autonomic management policies and determines which policy should be used. After that, the simulator uses the autonomic management policy to determine the system configuration and the load balance setting for the next time interval. The simulator then puts forward the time to repeat the system behavior simulation at the next time interval. By repeating the above operations, the simulator can simulate the system behavior by changing the system configuration according to the autonomic management policy. Furthermore, the simulator can also simulate a system behavior by giving consideration to the transitional status of the software. The simulator can also make a decision for autonomic management on the basis of the system behavior determined by giving consideration to the transitional characteristic of the software.
According to the present invention, no real system is required to simulate whether or not each created policy functions as expected in an autonomic management system under the control of the subject policy, thereby the simulation cost is minimized and the simulation is speeded up. In addition, when such a simulation is carried out in the autonomic management system, the transitional responses of the software are taken into consideration to simulate a system behavior, so that the system behavior is simulated accurately.
Hereunder, the preferred embodiments (simulator) of the present invention will be described in detail with reference to the accompanying drawings.
The present invention is characterized in that the policy simulator 100 calculates the system behavior by taking workload variation and external input 400, as well as software transitional characteristic 600 into consideration, then applying an autonomic management policy to the obtained system behavior to put forward the simulation.
Hereinafter, the operation of the simulator in this first embodiment will be described in detail with reference to
The simulator in this embodiment can apply not only to a Web system, but also to a storage system as shown in
A standby server is activated if an active server's CPU utilization rate is over 80%.
The load value of the newly added server should be changed in accordance with the expression shown in
A new policy must be created in accordance with the system configuration, the running program, the system workload, and the user requested service level.
The policy simulator 100 simulates each policy as described above to check its propriety. As shown in
(1) Autonomic Management Policy 200
Policy used for autonomic management described in
(2) Overall System Configuration 300
Overall configuration of the system (including standby servers) to be controlled by the subject policy as shown in
(3) Load Condition 400
Time change (estimated value) of workload of simulated system (the number of requests received from user clients, etc.). With this value, for example, the autonomic management system behavior can be simulated at the time of abrupt concentration of accesses. On the other hand, one of the important goals of the autonomic management system is to cope with external disturbance such as server failure, in which case automatic allocation of an alternate server is required. Ability to describe such external disturbances among the load conditions enable simulation of such external disturbances as a server fault, etc. For example, the external disturbance description is made as follows.
Both response time and resource utilization rate of the software on the simulated system are described in the steady state. For example, the description will be made as follows.
They are basic values for calculating the system performance.
(5) Software Transitional Characteristic 600
This library describes the transitional characteristic of the subject software. One of the methods for describing a transitional behavior of the system is to describe the system performance changes with time after a transitional behavior trigger occurs as shown in
The simulator 100 outputs the following:
(1) System Behavior 700
System behavior data changes with time. Concretely, the time change of system response time, utilization rate of each resource (CPU, network, disk, etc.), system throughput (the number of processing requests), etc. This data is used to check whether or not the system is operating as expected in accordance with a target service level.
(2) Policy Application Log 800
This log denotes how each policy is applied to the system. The log retains items of time, applied policy identifier, and parameter values used for decision of the application of object policy. This log also retains how each server is allocated by the autonomic management server. When combined with (1), each created policy is debugged and simulation results are fed back to optimize the policy if the created policy does not work as expected.
Next, the operation of the simulator will be described in detail with reference to
(1) Recognition of the system operation at the subject time
(2) Applying an autonomic management policy according to the result of (1).
(3) Deciding both system configuration and load balance setting for the next time step according to the result of (2).
The simulator carries out a simulation for next time interval according to the system configuration and the load balance setting decided in (3). The simulation cycle is determined according to the following points in accordance with the accuracy and simulation speed requirements of each simulator.
If the simulation cycle is short, the simulation accuracy is improved while a longer simulation time is required.
If the simulation cycle is long, the simulation is speeded up while the accuracy is lowered.
The simulation must be carried out in a cycle shorter than the transitional system behavior that should be avoided in the system to be simulated (otherwise, the transitional behavior evaluation accuracy is degraded significantly).
Hereinafter, the operation of the simulator in each simulation cycle will be described in detail.
At first, the simulator obtains the system configuration and load balance setting 170 in the current simulation cycle, as well as the system workload and the external input information (step 1001). The system configuration and load balance setting 170 are usually obtained from policy application of previous time interval 160. In the first simulation cycle, the initial active server configuration and the default load balance setting denoted in the system overall configuration 300 are used. The system workload and the external input information are obtained by reading the information for the current simulation cycle from the load condition 400 using the workload calculating function 120.
After that, the simulator calculates the system behavior 140 such as each system resource utilization rate, response time, system throughput, etc. using the information of the system configuration and the workload obtained in step 1001, as well as the software performance information library 500 and the software transitional characteristic library 600 (step 1002). The following is an example of the calculation.
(1) Obtaining the software performance information (response time and resource utilization rate) from the performance information library 500
(2) Obtaining a transitional characteristic value at the current time from the transitional characteristic library 600. For example, in
(3) Usage of devices corresponding to external disturbance such as a fault is inhibited in the system configuration 170. The subject devices cannot be used for calculating the system behavior in (4).
(4) The system behavior is calculated according to the information of usable devices obtained in (3), the load balance setting 170, the performance of each hardware component such as CPU, etc. obtained from the system overall configuration 300, and the performance information obtained in (1). At that time, the above information is modified by the transitional characteristic information obtained in (2). For example,
The value is modified according to above mentioned results.
Using the above value, the system behavior (utilization rate of each resource such as CPU, response time, system throughput) is accumulated. If the utilization of a resource is over 100%, response time is increased to reflect the effect of waiting time.
The calculated system behavior is output as a simulator output 700.
In the next step, the simulator determines which of the autonomic management policies 200 can be applied according to the system behavior 140 calculated in step 1002 (step 1003). Concretely, in order to make above mentioned decision the system behavior 140 is applied to the autonomic management policy conditions 6001 to 6003 described in
After determining a policy to be applied in step 1003, the simulator applies the policy to the current system configuration and the load balance setting using the next time system configuration and load balance setting determination mechanism 160 to determine the system configuration and load balance setting 170 to be used in the next simulation cycle (step 1004). The system configuration mentioned here means configuration information of the active servers. The load balance setting means a method for distributing system load among a plurality of servers. The method may be, for example, a round robin method that distribute load to among plurality of servers according to weight value. Consequently, the simulator can apply an autonomic management policy to the system in accordance with the current system operation status.
Completing the above process, the simulator puts forward the simulation clock (step 1005), then repeats the above process again, starting at the operation in step 1001.
The simulator can thus simulate the target policy operation by taking the autonomic management system transitional information into consideration.
Next, a description will be made for how the simulator optimizes a policy by feeding back simulation results. When creating an autonomic management system policy, it is usually difficult to complete a policy just by one processing; the policy is required to be optimized by the method of trial and error. This simulation tool can observe the simulation result and feed back the result to optimize the policy.
(1) An (initial) policy is inputted with use of the policy editor.
(2) The simulator simulates the autonomic management system behavior.
(3) The simulation result is displayed on the screen 2010.
(4) Observing the operation status 2012, the system behavior is checked whether it has problem or not (for example, whether or not the maximum response time defined by SLA is exceeded in any simulation cycle).
(75) If any problem is found, the policy application log 2011 is checked to locate the problem in the policy.
(6) The problem of the policy is corrected using the policy input editor 2013.
(7) New policy is created by feeding back the simulation result. The new policy is used to simulate the system behavior again. (Here, the system returns to (3) to repeat the operations to complete the optimization.)
Thus, the autonomic management system policy is optimized by feeding back simulation results.
Variation
The present invention is not limited only to the embodiment described above; it may apply to various variations, for example, as follows.
(1) In the first embodiment, an optimized policy is obtained by accumulating the resource utilization rate, etc. However, the simulation is made more accurately on the basis of a queuing model.
(2) In the first embodiment, there is only one active server system. In other words, web system of only one user (one corporation) is executed in the system. However, the simulation system of the present invention can simulate behaviors of more than two active systems (when standby servers are shared by a plurality of users/works). In that case, all the behaviors may be simulated in parallel while taking server allocation state of other system into consideration.
(3) In the first embodiment, only server is controlled by the autonomic management system. However, the same simulation method may apply to storage system, network system, etc.
As described above, the present invention can simulate the behavior of automatic management policy, and can be used to verify whether the system behave as expected or not, without using the real system. The present invention can thus be applied to system with many computer resources including a data center, etc. with autonomic management because it can reduce the management load effectively, so that it is expected that the present invention can apply to the field.
Number | Date | Country | Kind |
---|---|---|---|
2004-003600 | Jan 2004 | JP | national |