1. Technical Field
The present disclosure generally relates to assessing and managing risks, and more particularly to accessing and managing risks associated with computer service related changes.
2. Discussion of Related Art
Client may delegate the management of their Information Technology (IT) services and infrastructure to a service provider that specializes in the client's business. The client expects stability and high availability of their services at all times. However, in time, every client's infrastructure needs certain changes and upgrades in order to function effectively.
The Information Technology Infrastructure Library (ITIL) is a set of concepts and practices for Information Technology Services Management (ITSM), Information Technology (IT) development and IT operations. ITIL gives detailed descriptions of a number of important IT practices and provides comprehensive checklists, tasks and procedures that any IT organization can tailor to its needs. ITIL may be adopted by service providers to ensure efficient, prompt and accurate service management through standard processes.
Within ITIL, Change Management (CM) addresses the implementation of changes, required externally by the client or internally by the service provider, to ensure proper and continued functioning of the client's infrastructure. However, when an implemented change fails, it incurs a significant cost on the service provider to re-implement such changes and manage the impact of the failure. An effective change management process can help to ensure that risks associated with changes are assessed in a systematic fashion, and the high risk factors are mitigated early in the process to avoid change failures.
Depending on the client's infrastructure and requirements, service providers typically use a help desk model or client-specific model. In the help desk model, a help desk agent raises the change, and passes it to a technically competent Change Requester (CR) to complete the change documentation and assess its risk. In the client-specific model, the technically competent CR, through specialized knowledge of the client's account, raises the change and assesses its risk during documentation. Once the change is raised, it can be passed to a Change Manager (CM) for evaluation. The CM reviews the change, assesses its impact, discusses it with a Change Advisory Board if the risk is high, and schedules the changes upon approval.
In a risk categorization approach, the CR reviews the change at documentation time and selects the most applicable risk category from a well-defined list of available categories. The selection may be performed manually by considering all aspects of the change or through a more systematic risk assessment questionnaire. The questionnaire may be adopted from an IT risk assessment model, which calculates the risk of change by applying weights to each answer to yield a risk rating.
Assessment of the risk caused by a potential change relies heavily on one person's opinion, e.g., the CR. However, the CR may not understand the technical complexities of the change. Accordingly, incorrectly assessed change records may be created at the documentation phase, which may lead to incorrect risk categorization. Although the CM checks the integrity of the risk assessment during change evaluation, an incorrectly categorized change can skip the necessary level of scrutiny. For example, a high risk change incorrectly categorized as a low risk change could go through implementation without needing approval and result in an outage for a client. As another example, a low risk change incorrectly categorized as a high risk change could be unnecessarily pending in the approval queue, even though the client urgently needs the change.
Further, evidence suggests that manual risk assessment of changes is associated with increased failure rates when these changes are implemented. Further, because manual risk assessment is not a systematic approach, there is no guarantee that the CR will always categorize the same type of change under the same category unless they are very experienced with that type of change and take the time to consider all factors associated with the change.
A more systematic approach is the questionnaire approach, where the CR answers a static set of risk assessment questions to gather information about the change, and calculates the risk of change using weights applied to the answers. For example, a higher weight could be associated with a higher risk and a lower weight could be associated with a lower risk. While such risk assessment questions are carefully designed by Subject Matter Experts to determine the probability and the impact of a failure due to a potential change, the static nature of these questions prevents them being applicable to all kinds of requested changes. Further risk mitigation is typically an after-thought, which is planned on-demand and difficult to manage.
According to an exemplary embodiment of the invention, a method of determining risk associated with a proposed change includes entering change information related to the proposed change and a category of the change, filtering risk assessment questions based on the entered category of the proposed change, asking at least one of the filtered risk assessment questions to generate first answers, automatically inferring second answers of at least one of the risk assessment questions based on the entered change information and historical information, and determining the risk from weights assigned to each answer.
According to an exemplary embodiment of the invention, a method of mitigating a risk associated with a proposed change includes determining at least one high risk factor that is associated with a determined risk of a proposed change, filtering mitigation risk questions based on the at least one high risk factors, asking the filtered mitigation risk questions to generate mitigation answers, and determining a reduced risk from the mitigation answers and a change context of the proposed change.
According to an exemplary embodiment of the invention, an apparatus for assessing and mitigating risk of a proposed change includes a memory storing a computer program to assess and mitigate risk, risk assessment questions, risk mitigation questions, and historical information associated with changes, and a processor configured to execute the computer program. The computer program is configured to prompt entry of change information related to the proposed change and a category of the change, filter the risk assessment questions based on the entered category, prompt for first answers to at least one of the filtered risk assessment questions, infer second answers to at least one of the filtered risk assessment questions based on the change information and historical information, determine a risk based on the first and second answers, and mitigate the risk based on the risk mitigation questions.
According to an exemplary embodiment of the invention, a method of assessing and mitigating a risk of a proposed change includes entering change information related to the change as well as a category of the change, filtering risk assessment questions based on the entered category of the proposed change, determining an initial risk based on answers to the filtered questions, determining at least one high risk factor that is associated with the initial risk, filtering mitigation risk questions based on the at least one high risk factors, and re-determining the risk from mitigation answers to the filtered mitigation risk questions.
Exemplary embodiments of the disclosure can be understood in more detail from the following descriptions taken in conjunction with the accompanying drawings in which:
Exemplary embodiments of the disclosure relate to a real time risk assessment and mitigation engine, which can dynamically assess and manage risks based on the context of proposed changes. The context of a proposed change may refer to various aspects, attributes, and information surrounding a change, which may vary significantly from one change to another. For example, these aspects may be technical, environmental, communication, people, client-related, etc. In practice, a large amount of contextual information may be associated with each proposed change. This information can be used to increase the accuracy and reliability of risk assessment. Having a rich dynamically determined context for a proposed change enables risks to be caught early and mitigated.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
The corresponding structures, materials, acts, and equivalents of all means or step plus function elements in the claims below are intended to include any structure, material, or act for performing the function in combination with other claimed elements as specifically claimed. The description of the present disclosure has been presented for purposes of illustration and description, but is not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the disclosure. The embodiment was chosen and described in order to best explain the principles of the disclosure and the practical application, and to enable others of ordinary skill in the art to understand the disclosure for various embodiments with various modifications as are suited to the particular use contemplated.
Referring back to
A risk weight may be assigned to each possible answer to a risk assessment question. These risk weight may be initialized by a subject matter expert. Additionally or alternately, the risk weight can be inferred from facts and historical information. For example, if changes that need to be implemented in a highly constrained timeframe historically result in a 30% failure rate and changes that need to be implemented by a multitude of experts result in a 40% failure rate, the changes could be assigned a risk weight that is proportional to their corresponding historical failure rates.
The risk weights may also be assigned by performing data mining or pattern matching algorithms on outputs generated from previous similar changes. For example, if upgrading an OS results in a log being produced, that log can be searched for patterns that lead a pattern matching algorithm to infer that the upgrade has failed.
Further, the risk assessment questions need not be yes/no questions. For example, if one of the technical factors is “requires experts”, the risk assessment question could provide choices, such as “does the change require a single expert, a few experts, or a multitude of experts”, where a risk weight could be assigned to each answer choice. For example, if a change requires a single expert, it could be weighted lower than a change that requires multiple experts.
Referring back to
Referring back to
As discussed above with respect to
As discussed above with respect to
Examples of these inferences include inferring scheduling conflicts, inferring whether the change is a model change, inferring whether a particular change implementer implemented a similar change before, inferring whether a particular change implementer has the right skill set to implement this change, inferring dependencies on other changes, inferring the impact of the change on a shared infrastructure, etc. The scheduling conflicts can be inferred by accessing a change calendar that lists dates for scheduling a series of changes. Whether a change is a model change can be inferred from accessing a model change library. Whether a similar such change has been implemented before by the same change implementer can be inferred from accessing a log or a database that archives the individuals that have made each change and the type of change. Whether a change implementer has the requisite skill to implement a change can be inferred from accessing a skill set table or library that lists the skills of change implementers and referring to a mapping between the skills and the proposed change. Whether the proposed change is dependent on other changes or is likely to impact a shared infrastructure could be inferred using a change management tool. However, embodiments of the invention are not limited to the above provided inferences, as various other inferences may be generated.
Inferences 505 can then be drawn from user answers 504 to the resultant risk assessment questions 502, change facts 506 from a ticket database DB 507, failure rates 508 from a failure rates database 509, and account health information 510 from an account health DB 511. The change facts 506 may include, but are not limited to an exception reason for a change, the urgency of the change, the priority of the change, an indication of the presence of back-out plan. The failure rates 508 may be derived from historical information of similar changes. For example, the failure rates DB 509 could store a 10% failure rate for database upgrades and 20% failure rate for operating system upgrades. The failure rates may be associated with accounts that typically raise these changes. The account failure rates may be compared against all other accounts to get the failure rates.
The engine generates a change context 512 from the inferences 505. Using the change context 512, the engine determines risk to produce a risk rating 513 and identifies high risk factors 514 that contribute to the risk rating 513. The engine may store the high risk factors 514. Examples of the high risk factors 514 could include “missing back-out plan” and “high urgency”. For example, as part of installing a new version of a program, a back-out plan can save the old version. This way, if the install fails, the old version can be retrieved and re-installed. However, if such a back-out plan were missing, a failed installation would interrupt users, since no working versions of the program would be available. The high risk factors 514 may be used by change requesters to familiarize themselves with the potential risks they should look out for, as well as by the change managers during evaluation meetings as a checklist to discuss high risk changes.
The engine attempts to mitigate the high risk factors 514 to reduce the previously determined risk. The engine can be set to only mitigate risks that are above a predefined threshold value. As discussed above, the risk may be presented on a scale of 1-5. As an example, the engine may be set to mitigate whenever the determined risk is a 4 or a 5. The risk engine may refine a set of mitigation questions based on the high risk factors 514 identified during risk assessment and define any necessary user actions. The mitigation questions may be stored in a mitigation DB 514. The mitigations questions are designed to seek the required information or action to remedy the issue indicated in the high risk factors 414. For example, for a missing back-out plan, the mitigation action could be to add a back-out plan, or indicate that a back-out plan is not possible. This way, mitigation is done on the spot in real-time to reduce the discovered risks as much as possible. Depending on the change requester's answers to the mitigation questions and the actions they may take, the final risk rating 516 is determined. At the end of the mitigation routine, the (reduced) risk rating 516 is presented, along with any remaining high risk factors that could not be eliminated. Mitigating the risks identified during a documentation phase in this manner ensures that the change record is complete and has passed through several checks before it is presented to a Change Manager for evaluation. In addition, this process may ensure that changes reaching the evaluation phase are systematically assessed and correctly categorized.
As discussed above, risk is determined based on a dynamic change context. Examples of account or process related issues that may affect the dynamic change context are lead time and change window. For example, a lead time is an account specific policy around how much time an account allocates for change preparation based on a risk rating. The higher the risk, the more lead time is needed. For example, a change window may include account specific work hours in which maintenance needs to be performed. In an alternate embodiment of the risk engine, the risk engine may propose actions to mitigate risk based on account policies. For example, the engine can be configured for account policies such as “If the risk is 5, mandate a lead time of 28 days and flag, and mitigate any implementation scheduled during work hours”.
Thus, in at least one embodiment of the invention, risk is assessed systematically without relying on any one person's opinion alone, dynamically by taking into account only the relevant pieces of information, and thoroughly by taking an entire context of a change into account, thereby increasing the assessment's accuracy. In addition, risk may be mitigated in real-time, since action items can be identified and offered to be taken to reduce the risk.
In an embodiment of the invention, the risk engine uses a set of one or more change context criteria to determine a set of risk assessment questions that are posed to determine a level of risk of a change at one or more steps of a design process. Some of the risk assessment questions may be determined from the change context criteria only while others of the questions may be determined from the change context criteria and human input.
Examples of the change context criteria include the type/category of the change, the number of users affected by the change, customer sites affected by the change, the end-user impact of the change, system elements affected by the change, service interruption requirements, number of resources required to implement change, resource competence, change window (e.g., how long will change run), change dependencies, change preparation efforts, change lead time (e.g., amount of time needed to prepare the change), change urgency, change priority, back-out plans, change execution environment (e.g., test, pre-test, production), change timeline (e.g., account work hours vs. maintenance hours), change impact on functionality, account change failure rates, company change failure rates, current account health, account health variation, missed SLAs, etc.
A graphical user interface may be provided to a change requester to enter change related information into a data structure associated with the change. The data structure may be referred to a change ticket and may be stored in the ticket DB 507 illustrated in
The risk assessment questions are determined based on the category or type of the change and the engine determines a change context based on answers to the risk assessment question, facts from the change ticket associated with the change, account health information, previous failure rates, etc. The engine also draws inferences from all of this data. The change context is also used to determine a risk by assigning a weight and a risk rating to each element of the change context, such that each element is either a probability or an impact element. The combined result of all probabilities and impacts may be looked up against an Impact×Probability Matrix used in risk assessment. Table 1 below is an example of the Matrix.
The risk determination process may yield a risk rating along with high risk factors that contribute to the risk. If the risk rating is high, the engine may use the high risk factors to automatically determine a list of mitigation questions and associated actions to reduce risk. If the risk is high, and a user opts for mitigation, the user can answer the mitigation questions and potentially takes actions to reduce the risk. Then a final risk rating may be generated and remaining list of high risk factors can be presented.
Examples of the risk assessment questions that may be asked to a user may include the following: How many users (including the account and their clients) would be impacted in the case of a change failure?; Does this change affect a local, multi-region, or a global service?; Would the failure of this change impact a critical service for the customer?; Would the failure of this change result in end-user calls to the Help Desk?; Would the execution of this change require a service interruption either during implementation or back-out?; How many resources are required to implement the change?; Does the resource need any training before the change can be implemented?; Is there enough time allocated in the change window to cover a potential back-out?; Does the change have any dependencies, or is it completely independent of other changes?; What is the preparation effort required for this change?, etc.
Examples of risk assessment questions whose answers may be automatically determined by the engine include the following: Does this change violate the lead time and therefore cause an exception?; What is the urgency of the change?; What is the change priority?; Is there a back-out plan?; Has a similar change been implemented previously by this particular resolver group?; Is this change going to be executed in a test or pre-production test environment, or the actual production environment?; Will the change be executed during a work hour change window, or outside work hours?; Is the change introducing new functionality or hardware?; Is the change modifying existing functionality or hardware?; Is the change introducing a new release of existing software?; What is the overall health score for the current month?; Is this a chronic/prechronic account or neither?; Is there a variation of the overall health score since last month?; What is the total number of missed SLAs?; What is the failure rate for the account for a change ticket with the same classification?; What is the failure rate for Pan IOT (mean of Pan IOT?) for the same classification?; Is the lead time less than the average lead time for a change ticket of the same classification?; Is the determined risk rating less than the average risk rating for a change ticket of a same classification?; Is the planned change duration less than the average change duration for a change ticket of a same classification?; etc.
The computer system referred to generally as system 1000 may include, for example, a central processing unit (CPU) 1001, random access memory (RAM) 1004, a printer interface 1010, a display unit 1011, a local area network (LAN) data transmission controller 1005, a LAN interface 1006, a network controller 1003, an internal bus 1002, and one or more input devices 1009, for example, a keyboard, mouse etc. For example, the display unit 1011 may display the above-described risk assessment questions, determined risks, mitigation questions, the user interface for entering the change ticket information, etc. As shown, the system 1000 may be connected to a data storage device, for example, a hard disk 1008, via a link 1007. For example, the hard disk 1008 may store each of the databases illustrated in
As will be appreciated by one skilled in the art, aspects of the present disclosure may be embodied as a system, method or computer program product. Accordingly, aspects of the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit,” “module” or “system.” Furthermore, aspects of the present disclosure may take the form of a computer program product embodied in one or more computer readable medium(s) having computer readable program code embodied thereon.
Any combination of one or more computer readable medium(s) may be utilized. The computer readable medium may be a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C++ or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
Aspects of the present disclosure are described above with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer readable medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.