The subject matter disclosed herein relates generally to system monitoring. Specifically, the subject matter disclosed herein relates to systems, methods, and computer program products for online system availability estimation.
There is a growing reliance upon computers for making systems having critical application more manageable and controllable. However, this reliance has imposed stricter requirements on the dependability of these computers and systems. In critical applications, losses due to system downtime can range from huge financial loss to risk to human life. In safety-critical and military applications, the dependability requirements are even higher as system unavailability would most often result in disastrous consequences. For example, in the case of air traffic control systems, such as Eurocontrol, typical requirements of the enroute subsystem associated with radar data reception, processing and display, specify that these services should not be unavailable for more than three seconds per year. In complex military applications, such as missile tracking systems, surveillance and early warning systems, the unavailability of any component in the system, in combat situations, may have disastrous effect.
Another critical application includes the infrastructure field. In this field, there has been an increase in the interdependence between different critical infrastructures (e.g., communication, power, and the Internet). As a result, a downtime on any of the critical infrastructure can cascade into failure of other infrastructures as well. In the field of electric power generation and distribution, increasing complexity in management and control of electric grid is causing it to transform into an electronically controlled network. Since all other infrastructures are dependent on power, system unavailability in this case can have a far more damaging impact.
Yet another critical application includes business-critical application. Examples of business-critical applications include online brokerages, online shops, and credit card authorizations. In these applications, a system downtime may translate into financial loss due to lost transactions in the short term and a loss of customer base in the long term.
These concerns make it important to ensure the high availability of systems in critical applications to ensure high availability. Availability can be assured by constant evaluation, monitoring, and management of the system. Accordingly, there exists a need for improved systems, methods, and computer program products for system availability estimation. In addition, there is a need for improved systems, methods, and computer program products for taking appropriate control actions to maintain a high level of system availability.
Online availability estimators, methods, and computer program products are disclosed for estimating availability of a system. A method according to one embodiment can include a step for providing an availability model of a system. The method can also include a step for receiving behavior data of the system. In addition, the method can include estimating a plurality of parameters for the availability model based on the behavior data. The method can also include determining individual confidence intervals for each of the parameters. Further, the method can include determining an overall confidence interval for the system based on individual distributions of the estimated parameters. According to one embodiment, all of the estimations are carried out in real-time. In addition, the availability model of the system according to one embodiment can be constructed off line. The method can also suggest appropriate control actions to maximize system availability.
Some of the objects having been stated hereinabove, and which are achieved in whole or in part by the present subject matter, other objects will become evident as the description proceeds when taken in connection with the accompanying drawings as best described hereinbelow.
Exemplary embodiments of the subject matter will now be explained with reference to the accompanying drawings, of which:
Methods, systems, and computer program products are disclosed herein for online availability estimation of a system. According to one embodiment, an availability model of a system is provided. Behavior data of a plurality of sub-systems or components of the system can be received. Based on the received behavior data, a plurality of parameters can be estimated for the availability model. Next, individual confidence intervals can be determined for each of the parameters. Based on the individual distributions of the parameters, an overall confidence interval for the system availability can be determined. Further, according to one embodiment, based on the estimated availability and the parameter values of the model, control actions can be suggested for maximizing availability of the system.
Availability of a system can be defined as the fraction of time the system is providing service to its users. Limiting or steady state availability of a system is computed as the ratio of mean time to failure (MTTF) of the system to the sum of mean time to failure and mean time to repair (MTTR). It is the steady state availability that can be translated into other metrics such as downtime per year. The above definition for availability provides the point estimate of limiting availability. In critical applications, there should be a reasonable confidence in the estimated value of system availability. Therefore, it is important to also estimate the confidence intervals for availability.
The methods and systems for estimating online availability of a system will be explained in the context of flow charts and diagrams. It is understood that the flow charts and diagrams can be implemented in hardware, software, or a combination of hardware and software. Thus, the subject matter disclosed herein can include computer program products comprising computer-executable instructions embodied in computer-readable media for performing the steps illustrated in each of the flow charts or implementing the machines illustrated in each of the diagrams. In one embodiment, the hardware and software for estimating online availability of a system is located in a computer connected to sub-systems or components of the system.
System 102 can include a plurality of sub-systems 104A-104D operably connected to availability estimator 100. Sub-systems 104A-104D can be components required for the availability and/or operation of system 102. For example, a missile defense system can consist of several required sub-systems, such as radar, interceptor, early warning systems, and space-based infrared systems, which are controlled by a command and control system. Other exemplary sub-systems include input/output (I/O) devices, hard disks, memory, and CPUs. In addition, sub-systems 104A-104D can be devices for indicating the status of other components of system 102. Sub-systems 104A-104D can be operably connected to and/or dependent on one another or disparate components.
Availability estimator 100 can be in communication with sub-systems 104A-104D for receiving data indicating the behavior of sub-systems 104A-104D and/or system 102 or its components. According to one embodiment, availability estimator 100 can receive the behavior data online, i.e., during operation of system 102. Based on the received behavior data, availability estimator 100 can determine the overall availability of system 102. In addition, availability estimator 100 can issue control commands to sub-systems 104A-104D, system 102, and/or other components of system 102 for maximizing the availability of system 102 and sub-systems 104A-104D.
According to one embodiment, a method for estimating online availability of a system includes providing an availability model of the system. Availability estimator 100 can include and manage a system availability model 106. The purpose of system availability model 106 is capturing the behavior of system 102 with respect to the interaction and dependencies between sub-systems 104A-104D or other components of system 102, and their various modes of failure and repair.
System availability modeling can be implemented with discrete-event simulation or analytic models. Alternatively, a hybrid approach of combining both the simulation and analytic methods can also be implemented.
Analytic modeling includes non-state space modeling and state space modeling. Non-state space-based availability models assume that all sub-systems have statistically independent failures and repairs. Reliability block diagrams (RBD) and fault trees are two non-state space modeling techniques that can be utilized to evaluate system availability.
According to one embodiment, availability model 106 can be based on the reliability block diagram modeling technique. The reliability blocks can be connected in series/parallel or k-out-of-n combinations based on operational dependencies. In this embodiment, availability model 106 can comprise a plurality of reliability blocks arranged in a reliability block diagram configuration. Each block of the reliability block diagram can correspond to one of sub-systems 104A-104D. Additionally, information regarding reliability block diagrams can be found in the publication “A Realistic Reliability and Availability Prediction Methodology for Power Supply Systems”, by G. Kervarrec and D. Marquet, 24th Annual International Telecommunications Energy Conference, INTELEC, pp. 279-286 (October 2002), the contents of which are incorporated herein by reference.
Referring to
Referring to
According to another embodiment, availability model 106 can be based on the fault tree modeling technique. A fault tree is a graphical representation of the combination of events that can cause a failure of system 102. All of the basic events represented in the fault tree are mutually independent. In order to represent situations where one failure event propagates failures along multiple paths in the fault tree, fault trees can have repeated nodes. Availability estimator 100 can be operable to solve the fault tree. The following method types can be utilized to solve fault trees: (1) factoring/conditioning on the shared nodes; (2) sum of disjoint products (SDPs); and (3) binary decision diagrams (BDDs). Fault trees are contrasted with reliability block diagrams in that reliability block diagrams can evaluate the conditions when system 102 functions, and fault trees can evaluate conditions when a system 102 fails. A more detailed example of a fault tree model is described hereinbelow in the section titled Exemplary Process for Online Availability Estimation. Additionally, information regarding fault trees can be found in the book titled “Probability and Statistics with Reliability, Queuing and Computer Science Applications (2nd Edition)” by Prof. Kishor S. Trivedi, John Wiley and Sons, New York, (2001).
State space models include Markov chains, stochastic reward nets, semi-Markov processes, and a Markov regenerative processes. According to one embodiment, availability model 106 can include a homogenous continuous time Markov chain (CTMC) for representing system 102.
In homogenous CTMCs, transitions from one state to another occur after a time that is exponentially distributed. Arcs representing transition from one state to another are labeled by the time independent rate corresponding to the exponentially distributed time of the transition. Based on the condition of the system in any state, “up” and “down” states are marked. The limiting availability of the system is the steady state probability of the system to be in one of those “up” states. Additionally, information regarding CTMCs can be found in the book titled “Probability and Statistics with Reliability, Queuing and Computer Science Applications (2nd Edition)” by Prof. Kishor S. Trivedi, John Wiley and Sons, New York, (2001), the contens of which are incorporated herein by reference. Solutions to large and complex Markov chains can be solved utilizing a suitable software package such as Sharpe available at Dr. Kishor S. Trivedi's website at URL: http://www.ee.duke.edu/˜kst and made available by Dr. Kishor S. Trivedi, Durham, N.C., U.S.A.
According to one embodiment, availability model 106 can include a Stochasic Petri Net (SPN) for representing system 102. A stochastic reward net (SRN) is an extension of the SPN with notions of reward functions and several marking dependent features that can simplify the graphical representation of the model. A large variety of reward-based measures can be calculated with the help of SRN. SRN-based availability models are described in further detail herein. To obtain the steady state availability, reward function is so defined that a reward rate of 1 is assigned to markings corresponding to the system being in “up” state and 0 otherwise. Additional information regarding SPNs can be found in the book titled “Probability and Statistics with Reliability, Queuing and Computer Science Applications (2nd Edition)” by Prof. Kishor S. Trivedi, John Wiley and Sons, New York, (2001), the contents of which are incorporated herein by reference.
Estimating online availability of a system also includes monitoring and receiving behavior data for the system. The behavior data can include information regarding the failure times and repair times of the system or components 104A-104D, for each modes of failure and each mode of repair of sub-systems 104A-104D, and various other behavior data with respect to system 102. Availability estimator 100 can include a sub-system interface 108 having multiple ports for communicating with sub-systems 106. In addition, availability estimator 100 can use a system log 110 that has stored the behavior data of the components/subsystems.
Availability estimator 100 can include a sub-system monitor 112 for monitoring the behavior data of sub-systems 106. Monitoring of sub-system 106 can be implemented via any one or combination of the following processes: continuously monitoring data in system log 110, actively probing any sub-system 106 or component of system 102 for its status, performing health checks, monitoring heart beat messages from system 102, or any combination thereof. System log 110 may be connected to sub-systems 104A-104D of system 102 for continuously inspecting system log and sending sub-system log messages to system log 110.
Monitor 112 can inspect the data of log 110 to assess the operational status of sub-systems 104A-104D. Monitor 112 can continuously monitor the logged data from components of sub-systems 104A-104D that report specific error messages. Alternatively, monitor 112 can periodically poll sub-systems 104A-104D for behavior data. The behavior data can also indicate sub-system status such as network status and system resource levels. In addition, availability estimator 100 can perform test transactions and check their output for correctness, and exit status. In addition, execution time of test transactions can be monitored to determine the status of various other components.
System or sub-system failures can be attributed to hardware and/or software faults. Error log messages due to hardware faults can be broadly classified as: (1) central processing unit (CPU) related errors, caused by cache parity faults, bit flips in registers or caches, bus errors, etc.; (2) memory faults such as ECC errors, which when not corrected can cause the system to give out log messages; (3) disk faults, such as disk failures and bad sectors; and (4) various miscellaneous hardware failures such as fan failures and power supply failures.
For assessing system health, system health monitor 112 can actively probe system 102. Probing can be implemented by pinging the sub-system or system component under consideration.
As another example of system health monitoring, in industrial robotic systems, error-logging mechanisms can include error codes that particularly point out a sub-system or action that failed. For example, in a robotic system, the system can generate specific error messages for a large class of failures at all locations in the system (e.g., motors, gripper, and force torque sensor on the robot and the storage and processing sub-systems of the controller). The robot can be connected to its controller through either a wired or wireless communication link. Active probing can be implemented to monitor the health of the communication link for detecting system health concerns.
The log messages at logging servers of a critical system that may be remote from the system can be inspected to retrieve behavior data. One example of such a critical system is an air traffic control system which typically maintains elaborate redundancies. These redundancies can range from having more than one command station placed apart geographically to redundant software and hardware in various stand-by schemes at each of these locations. Redundant networks can connect these separate command locations. Elaborate logging of every transaction can be carried out at the log servers. These log messages can be continuously inspected.
Estimating online availability of a system can include estimating system parameters based on system behavior data and determining confidence intervals for each of the parameters. Availability estimator 100 can include a model parameter estimator 114 for estimating system parameters based on system behavior data. In addition, model parameter estimator 114 can determine individual confidence intervals for each of the parameters.
According to one embodiment, model parameter estimator 114 can estimate the parameters of availability model 102 from the collected data by using methods of statistical inference. Parameter estimator 114 can perform goodness of fit tests upon the failure and repair data of each sub-systems 104A-104D. The goodness of fit tests can include a Kolmogorov-Smirnov test and probability plot. Next, the model parameters of the closely fitting distribution can be calculated. The point estimate of limiting availability for any of components or sub-systems 104A-104D can be calculated as the ratio of mean time to failure and sum of mean time to failure and mean time to repair. Depending on the distribution of time to failure and time to repair, confidence intervals can be computed for the limiting availability of each of sub-systems 104A-104D as described in further detail below.
Estimating online availability of a system also includes determining an overall confidence interval for the system availability. This determination can be based on the distributions of the parameters of availiability model. Availability estimator 100 can include a system availability estimator (Point and confidence interval) 116 for determining the system availability and an overall confidence interval for the availability of the system based on the individual confidence intervals for sub-systems 104A-104D. As noted above, the individual confidence intervals can be determined by model parameter estimator 114. The system availability and its confidence interval estimation may both utilize system availability model 106.
The estimators of each of the input parameters in system availability model 106 can be random variables and have their own distributions. The estimators can be determined by utilizing maximum likelihood estimates and a Fisher Information matrix. Thus, the point estimates have some associated uncertainty which can be accounted for in the confidence intervals. The uncertainty expressed in the distributions of the different parameters of system availability model 106 can be propagated through model 106 to get the uncertainty or the confidence interval of the overall system availability. According to one embodiment, a Monte Carlo approach can be utilized for uncertainty analysis. The Monte Carlo approach is applicable to state space-based and non-state space-based models. In this embodiment, system availability model 106 can be seen as a function of input parameters. For example, if Λ={λi, i=1, 2, . . . , n} is the set of input parameters, the overall availability A can be calculated through a Monte Carlo method as follows:
Sub-systems can be controlled by an availability estimator according to one embodiment for maximizing the availability of the system. According to one embodiment, availability estimator 100 can include a system controller 118 for controlling sub-systems 104A-104D.
Control action can be adaptively triggered based on online estimation. When the availability of system 102 falls below a certain threshold, alternate system models can be evaluated at the values of the estimated parameters. The system can then be reconfigured to the configuration that has the maximum availability at those estimated parameter values.
According to one embodiment, reconfiguration is applicable to both the hardware and software components. The various replication schemes (i.e., cold, warm, and hot) to ensure fault tolerance in software and hardware will have their own overhead-availability tradeoffs. The configuration for which the system model gives the maximum availability at those parameter values can be selected. The sub-systems can be controlled based on the selection.
According to one embodiment, preventive maintenance can be utilized for increasing system availability when aging of components occurs. The optimal preventive maintenance interval can be obtained in many cases as a function of the parameter values of the availability model. The availability can then be optimized with respect to the preventive maintenance trigger interval. Preventive maintenance may be for hardware or software (in the latter case, it is referred to as software rejuvenation).
Monitoring tools 402 can include components for inspecting the monitored system and application log/error messages continuously for components providing specific error messages such as I/O devices, hard disk, memory, and CPU. Monitoring tools 402 can include a continuous log monitor 410 for continuously inspecting log/error messages. An active probe 412 can actively poll various sub-systems to determine status of the sub-system or other components of the monitored system. A health checker 414 can check the overall health of the monitored system. Sensors 416 can detect failures such as fan failures. Watch dog processes 418 can listen to heartbeat messages from subsystems/components.
Referring to
According to one embodiment, model evaluator 406 can utilize the SHARPE software for solving the system availability model online. The SHARPE software can obtain the point estimate of the overall system availability. Confidence intervals for the overall system availability can be calculated online by utilizing a Monte Carlo approach.
Referring to
According to one embodiment, the system monitored by the process of
Referring back again to
Referring to
The failure of system 600 (
Referring now to
TTF[i]=time_component_went_up[i]−time_component_went_down[i]
TTR[i]=time_component_went−down[i−1]−time_component_came_up[i]
The unavailability of each of modules 602, 604, and 606 can be calculated as the ratio of mean time to repair and sum of mean time to repair and mean time to failure. The unavailability of each of modules 602, 604, and 606 serves as input to fault tree model 700 and the point estimate of overall system availability can be calculated by evaluating fault tree model 700. The time to failure and time to repair data can be fitted to some known distributions (e.g., Weibull distribution, lognormal distribution, and exponential distribution) and the parameters for the best fitting distribution can be calculated. Utilizing exact or approximate methods, confidence intervals for these parameters can be determined (step 510). Alternatively, an exact method can be used to determine the confidence intervals.
Referring to
It will be understood that various details of the subject matter disclosed herein may be changed without departing from the scope of the subject. Furthermore, the foregoing description is for the purpose of illustration only, and not for the purpose of limitation.
This invention was supported by U.S. Army Research Office Federal Grant No. C-DAAD19 01-1-0646. Thus, the Government has certain rights in this invention.