The invention relates to a method, a system and a computer program for operational risk modelling.
Operational-risk management and quantification has recently become more important owing to the new Basel II regulations. Basel II is the common notation for § 644 of the International Convergence of Capital Measurement and Capital Standards. These regulations require capital allocation for operational risk, complementing the existing requirements on market and credit risk. Operational risk is, e.g., the risk of loss resulting from inadequate or failed internal processes, people, systems or from external events.
There are several types of methods known to assist in operational risk management. One type of known methods is based on the observation of losses and their magnitudes to quantify operational risk. High level approaches mitigate operational risks by insurance.
It is an object of the invention to provide improved solutions for operational risk modeling.
The present invention is directed to a method, a system and a computer program as defined in independent claims. Further embodiments of the invention are provided in the appended dependent claims.
According to a first aspect of the present invention there is provided a method for modeling the operational risk of an entity, the method comprising the steps of:
Compiling a list with one or more failure events of the entity;
Compiling a list with one or more causes of the failure events;
Compiling a list with one or more impact types of the failure events;
Evaluating interdependencies between the failure events, the causes of the failure events and the impact types of the failure events;
Decomposing the interdependencies, thereby establishing one or more independent impact sub-models.
The method according to this aspect of the invention introduces a way to handle and solve large and complex operational risk quantification problems. This is established by means of decomposing the complex large-scale problem of operational risk into smaller independent impact sub-models. An impact sub-model comprises the modeling of failure events that have the same or a similar impact on the entity. The entity can be in particular a business entity. This decomposition approach maintains the failure and impact dependencies, thus facilitating the aggregation of the results.
As taking dependencies into account increases the size and complexity of the model, the method facilitates the user to model failure dependencies and impact dependencies.
The method according to this aspect of the invention has the advantage that it preserves the cause-to-effect relationship that reveals how operational risk can be reduced, managed, and controlled. The method allows to capture the causes of operational failures and their resulting effects in terms of losses. This quantification includes explicit modeling of the linkage between cause and effect.
Such a cause-to-effect operational-risk modeling method allows to manage operational risks of an entity, e.g. a business entity, beyond simple quantification. If a financial institution wants to change the capital allocation required under Basel II for operational risk it is advantageous to understand the causes, in particular the root causes, of operational risk and how they lead to loss events. Beyond this overall management of operational risk, cause-to-effect modeling also enables the inclusion of operational risk in the business decision processes, such as business process re-engineering, infrastructure re-engineering and infrastructure operation.
The method is dynamic in that it allows to capture failure dependencies and impact dependencies.
According to a preferred embodiment of this aspect of the invention the evaluation is performed by means of setting up an interdependency graph between the failure events, the causes of the failure events and the impact types of the failure events. Then the independent impact sub-models are established by means of decomposing the interdependency graph.
This preferred embodiment is a very structured approach. It allows to find effective solutions even for very complex business models. The identification of disconnected sub-graphs can be done preferably by using standard graph-theory methods.
According to a further preferred embodiment of this aspect of the invention the impact sub-models comprise one or more failure sub-models that correspond to failure events which share the same causes.
By means of decomposing the impact sub-models further into failure sub-models the modeling of the operational risk is further simplified. As the failures which have the same causes can be correlated, the failure sub-models allow the modeling of the correlated failure event arrivals. Preferably each failure sub-model is solved separately.
According to a further preferred embodiment of the invention each of the impact sub-models comprises an impact-calculation sub-model that calculates from failure event arrivals the corresponding financial impacts for the entity.
The failure event arrivals are preferably provided in form of a stochastic process, i.e. in form of a random function of time. The impact calculation sub-models receive the failure event arrivals from the failure sub-models and calculate the resulting financial impact as output.
According to a further preferred embodiment of the invention the impact sub-models are solved by means of statistical analysis.
This is in particular applicable for rather simple operational risk modeling tasks.
According to a further preferred embodiment of the invention the impact sub-models are solved by means of simulation.
Such a solution based on simulation is broadly applicable.
According to a further preferred embodiment of the invention the outputs of the impact sub-models are combined to obtain the impact distribution of the entity.
The combination of the outputs of the impact sub-models results in an impact distribution of the whole entity. This allows the evaluation of the overall operational risk of the entity.
According to a further preferred embodiment of the invention the impact distribution of the entity is derived from the impact sub-models by means of convolution.
As due to the decomposition the impact sub-models are independent of each other, the impact distributions of the impact sub-models can be aggregated by convolution. This can generally be done numerically or, in case of standard impact distributions, analytically. The impact distribution is preferably represented in terms of losses.
According to a second aspect of the present invention there is provided a system comprising means for carrying out the steps of the method according to anyone of claims 1 to 10.
According to a third aspect of the present invention there is provided a computer program comprising instructions for carrying out the steps of the method according to anyone of claims 1 to 10 when this computer program is executed on a computer system.
Preferred embodiments of the invention are described in detail below, by way of example only, with reference to the following schematic drawings.
The drawings are provided for illustrative purpose only and do not necessarily represent practical examples of the present invention to scale. The same or similar elements of the drawings are denoted with the same reference signs.
The impact sub-models #1 and #2 are decomposed further in a layer 205. Within each impact sub-model #1 and #2, failure-events are categorized by their causes of failures. The failure events that have the same causes are grouped together in failure sub-models. For each failure sub-model the system is modeled in such a way that it generates correct failure event arrivals for each type of failure event. Because the failure events from the same cause can be correlated, having them in the same model allows to correctly model the correlated failure event arrivals.
In the exemplary embodiment of
The sub-models in each layer can be dealt with separately. In the following the steps for modelling the operational risk according to this exemplary embodiment of the invention are explained. In a step 210 each failure sub-model #1 to #6 is solved separately. As a result, in step 220 failure event arrivals are provided, preferably in form of a stochastic process, i.e. in form of a random function of time. In step 230 the failure event arrivals of the failure sub-models #1 to #6 are translated into financial impacts by means of the impact calculation sub-models #1 and #2. As a result, in step 240 the financial impact of the impact sub-model #1 is provided as impact distribution #1 and the financial impact of the impact sub-model #2 is provided as impact distribution #2. The impact distributions are preferably provided in form of probability distributions of the financial impact, in particular losses, over a predefined period of time. In other words, the impact calculation sub-models #1 and #2 use the failure event arrivals as input and calculate the resulting financial impact as output. Since all failure events that have the same impact type are in the same impact sub-model, the resulting impact distribution can be correctly calculated. Moreover, the resulting impact distributions from different impact sub-models are independent because they do not share causes. Hence, these impact distributions can be appropriately aggregated in step 250 by means of convolution. As a result, the total impact distribution of the overall system or business entity is provided in step 260. The convolution can be done numerically or analytically.
Usually the impact calculation and failure sub-models will be solved by means of simulation. In this case, the majority of the computation time for modeling is spent on solving the impact calculation sub-models and the failure sub-models rather than on the convolution. One of the benefits of the decomposition technique is to reduce the number of simulation replications. Suppose there are m sub-models and each requires n replications, with this decomposition the number of required replications is n*m, whereas without this decomposition the number of required replications is nm.
In the following the cause-to-effect operational-risk quantification methodology based on the above described layering and decomposition concept is described in more detail.
Study objectives may be business process (BP) driven, information technology (IT) driven, or loss driven. Typical examples of objectives driven by these different interests are
The identified study objectives drive the level of detail for model development, data collection, and monitoring. Based on the identified study objectives, a list of possible failure events, causes of the failure events and impact types of the failure events is compiled. The failure event taxonomy of
Such an interdependency graph is show in
The taxonomy as shown in
Base information to create the interdependency graph can be provided in many forms.
Table 1, an event-dependency chart, is one such example that can be used to assist in identifying failure and impact dependencies. Typically, such an event-dependency chart is derived from operations surveys, actual experience, and interviews.
In a following step 320, standard graph-theory methods are used to identify disconnected (independent) sub-graphs in the interdependency graph. For example, the graph in
Within the second sub-graph, there are two sub-graphs in the failure layer, i.e. the (4; 3) sub-graph and the (5, 6, 7; 4) sub-graph, where (x; y) denoted a sub-graph containing causes x and failure events y. These two disconnected sub-graphs represent two separate failure sub-models within their same impact sub-model. As a result, the failure-events that share the same causes are grouped together into one failure sub-model. These failure sub-models are shown in the layer 205 of the decomposition map in
Referring back to
For each sub-model, the system is preferably modeled at the highest possible detail level in order to avoid unnecessary work. For example, for the failure event “power-failure” all components that share the same power line can be grouped into one single object in the model because they all will fail in a power outage. On the other hand, for the failure event “hardware-failure” each component should be treated as a separate object in the model because its failure pattern may heavily depend on its different characteristics such as its states and its age.
Referring again to
In the following, an example for modeling the operational risk of a service provider that offers settlement services to a clearing house is described in detail.
The example illustrates that the described modeling of operational risks may assist in making business decisions impacting the operational risk. Specifically, it is examined a system-architectural question, namely, the value of having a redundant system, and an operational question, namely, the optimal frequency for server replacement.
All simulations are run for a five-year period using a discrete event simulation system (Arena™). Arena is a software and trademark of Rockwell Automation Inc. Numerical results are given for the entire period. Other simulation software tools can be used as well.
The IT infrastructure of the service provider, as illustrated in
For this example it is assumed that the practitioner wants to resolve two business problems. The first is an architectural problem, i.e. whether to have a redundant system, and the second is an operational problem, i.e. when to replace aging servers. The level of detail needed in the model must be sufficient to capture the effects of these decisions.
For this example, there are several types of impact. The most important one is a penalty charge from the violation of the service level agreement (SLA). The charge is calculated based on the performance and the breakdown time. The impact function is defined in a form of service credit (or service-level violation penalty).
Each month, the service provider will be charged $500,000 if any one of the following events occurs:
Each month, the service provider will also be charged $100,000 if any one of the following events occurs:
Other impacts besides SLA violation penalties are: maintenance cost; disaster or other recovery cost; loss due to stealing of company assets or confidential information; and potential reputation loss, which includes the future sales loss.
The taxonomy in
All the failure events in the event-dependency chart (the ‘Event’ column in Table 2) are put into the middle section of the interdependency graph (the middle section of
As a result, a list of failure events, causes of the failure events and impact types of the failure events is provided.
In order to evaluate the interdependencies between the failure events, the causes of the failure events and the impact types of the failure events, all three sections of the interdependency graph of
Identifying the disconnected sub-graphs, the interdependency graph of
As a result, the interdependencies have been decomposed and independent impact and failure sub-models have been identified. The decomposition map for this example is shown in
A layer 900 in this decomposition map consists of three impact sub-models (impact sub-model #1, impact sub-model #2 and impact sub-model #3). In this layer 900, all failure-events (failure event types) that affect the same impact type are grouped together. The impact sub-model #1 comprises the failure events hardware failures, storage failures, network failures, heating, ventilating and air conditioning (HVAC) failures, power failures, software failures, failures due to human operation errors and failures due to natural disasters. All these failure events can cause a business disruption, repair/replace costs and/or SLA violation. Therefore, they have been arranged in the same impact sub-model #1. The second impact type is the loss of assets or confidential data, such as the legal costs and asset replacement cost incurred. It is assumed that operating assets cannot be stolen, whereas maintenance assets and spare parts can be. This type of failure event does not entail a business disruption, and hence can be assigned to another impact sub-model #2. The same holds true in the case of the failure event war or terrorist attack, where the loss due to business disruption is protected by a force majeure clause in the SLA contract. The only impact of this event type is the costs of repairs and replacements. Hence this is established as impact sub-model #3.
The impact sub-model # 1 comprises in a further layer 905 five failure sub-models, namely failure sub-model #1, failure sub-model #2, failure sub-model #3, failure sub-model #4 and failure sub-model #5. The impact sub-model #2 comprises in the layer 905 a failure sub-model #6 and the impact sub-model #3 comprises in the layer 905 a failure sub-model #7.
Note that in this example it is assumed that a bad maintenance policy can according to failure sub-model #1 cause hardware, storage, network and HVAC failures and according to failure sub-model #2 cause power failures. Further it is assumed, e.g., that human errors, such as accidentally switching off a server, have a direct effect in terms of business disruption, but no significant effects in terms of the hardware, HVAC and power failure rates. If this assumption shall be relaxed and human errors shall be allowed to affect hardware, HVAC, and power failure rates, then the failure sub-models #1, #2, and #4 must be combined into a single failure sub-model. In both cases, all the failure sub-models can be practically implemented using a simulation approach.
For each impact and failure sub-model, the parameters to monitor are identified based on the causes of failures and the failure and impact dependencies identified in Table 2. The sub-models should contain detail levels such that those parameters can be monitored. The failure and impact parameters and variables in the model are listed in the non-shaded columns of Table 3. The shaded columns are from the event-dependency chart constructed earlier.
The failure and impact variables to monitor are the key to determine the level of details for each sub-model. Each sub-model should be at such a level of detail that these variables can be monitored. In each impact-calculation sub-model, the failures are transformed into impacts. The impact variables to monitor are necessary for impact calculation. For example, we can calculate the level of SLA violation due to a hardware failure if we know the state of the backup system, the failure time and duration, and the repair/replace cost. It is also possible that one type of failure can impact another type of failure. For example, a HVAC failure for an extended period of time can increase the failure arrival rate of some hardware components. Such correlated events can be captured by a simulation model, which is explained in the following.
There are several different ways to estimate the failure or impact functions and their parameters. The most acceptable way is to perform statistical analysis of the historical data. If no historical data exists, the operational staff who defines the dependencies listed in Table 1, should be able to provide information on the functions or their parameters. In the worst case, i.e. if there is no idea about a particular input assumption, a sensitivity analysis of that assumption must be performed.
Because in this example the input assumption modeling technique for each particular sub-model is not the goal, some dummy numbers are assumed for these functions and parameters. For illustration purpose only, we now describe some of our input assumptions for this example.
Referring to Impact sub-model #1 in
The most important features included in Impact sub-model #1 are listed below. First the impact calculation sub-model #1 is described, i.e. how the failure events (the stochastic process of the failure events) are translated into losses by means of the impact calculation sub-model #1 of the impact sub-model #1. Then the failure sub-models #1 to #5 that generate the failure events (the stochastic process of the failure events) are described.
Impact (business disruption, business delay, repair/replacement costs)
Such level of detail can be handled by simulation. Arena™ software was used to model and run the simulation.
In the following the steps for modeling the operational risk of the service provider are explained with reference to
In the following, example outputs of the impact sub-models are shown. In this particular example, 10,000 replications were simulated. The graph in
Table 4 shows the statistical results. The logarithmic graph (inset) of
Table 4 shows a statistical description of the loss distributions for business disruption with and without a redundant system. For ease of comparison, the results for the redundant system are shown before the explicit consideration of redundancy in the text. Clearly the presence of a redundant system has a huge beneficial impact on business disruption/delay.
VaR means value at risk, which is a number indicating the operational risks in terms of losses for the considered time period, which is 5 years in this example. VaR 99% is the value at risk for the confidence level 99% and VaR 95% is the value at risk for the confidence level 95%. The abbreviation s.d. is used for standard deviation.
Table 5 shows a statistical description of the loss distributions for theft with and without a redundant system. With a redundant system, there is more to steal in terms of spare parts and maintenance supplies. However the overall differences are small.
Table 6 shows statistical description of the loss distributions due to war and terrorist attacks with and without a redundant system. The redundant system incurs a slightly higher risk, which is due to the fact that the worst-case scenario, i.e. total loss, is actually worse in the redundant system than in the non-redundant system because there are more assets to lose.
The independent loss distributions from the three impact sub-models can be aggregated into the total loss distribution using a numerical convolution program.
Now the value (in operational risk terms) of having a redundant system and the difference between different server-replacement policies is examined. The former examination is a system-architectural question, and the latter is an operational question.
The total loss distributions in the two cases (with and without a redundant system) are compared in
Table 7 shows the statistical description of the total impact distribution with and without redundant system.
The operational cost in the redundant system has a significantly lower mean than that in the single system. Regarding the business decision, especially for a large company, it can be derived that if the initial investment cost for having the redundant system is lower than the difference between the mean of the two cases (roughly $700,000 over the five-year period used), the company should go for the redundant system.
A non-redundant system and server replacement policies varying between 10 and 60 months are considered.
The components of the computer system 1600 include a computer 1620, a keyboard 1610, a mouse 1615, and a video display 1690. The computer 1620 includes a processor 1640, a memory 1650, input/output (I/O) interfaces 1660, 1665, a video interface 1645, and a storage device 1655.
The processor 1640 is a central processing unit (CPU) that executes the operating system and the computer software executing under the operating system. The memory 1650 includes random access memory (RAM) and read-only memory (ROM), and is used under direction of the processor 1640.
The video interface 1645 is connected to video display 1690 and provides video signals for display on the video display 1690. User input to operate the computer 1620 is provided from the keyboard 1610 and mouse 1615. The storage device 1655 can include a disk drive or any other suitable storage medium.
Each of the components of the computer 1620 is connected to an internal bus 1630 that includes data, address, and control buses, to allow components of the computer 1620 to communicate with each other via the bus 1630.
The computer system 1600 can be connected to one or more other similar computers via the input/output (I/O) interface 1665 using a communication channel 1685 to a network, represented as the Internet 1680.
The computer software may be recorded on a portable storage medium, in which case, the computer software program is accessed by the computer system 1600 from the storage device 1655. Alternatively, the computer software can be accessed directly from the Internet 1680 by the computer 1620. In either case, a user can interact with the computer system 1600 using the keyboard 1610 and mouse 1615 to operate the programmed computer software executing on the computer 1620.
Other configurations or types of computer systems can be equally well used to implement the described methods and techniques. The computer system 1600 described above is described only as an example of a particular type of system suitable for implementing the described techniques and methods.
Various alterations and modifications can be made to the techniques and methods described herein, as would be apparent to one skilled in the relevant art.
By means of the presented methods operational risks can be reduced, managed, and controlled.
Any disclosed embodiment may be combined with one or several of the other embodiments shown and/or described. This is also possible for one or more features of the embodiments.
Number | Date | Country | Kind |
---|---|---|---|
05112970.8 | Dec 2005 | EP | regional |
Number | Date | Country | |
---|---|---|---|
Parent | 11338025 | Jan 2006 | US |
Child | 12167947 | US |