Technology is disclosed herein for implementing a major problem review process. In one aspect, incidents are recorded in a common data schema and the data is then used to facilitate an IT organization's major problem review process. Reporting is provided on the data in a format that allows trend information to be readily compiled. The format allows tracking both a primary root cause and an exacerbating cause of an incident or problem. Incidents can be recorded in relation to a group of elements having a common characteristic, which allows incidents to be categorized outages on any number of basis, including, for example, a service-by-service basis. The technology includes facilities for tracking downtime minutes by server, service, and database. Still further, the technology allows for recording and tracking action items related to major problems, and for tracking actions and recommendations in relation to people, process, and technology separately.
At step 110, the IT enterprise is organized into logical categories. In one embodiment, this may include defining any number of categories, groups, or commonalities amongst hardware, applications and services within the organization. The grouping may be performed in any manner. One example of such a grouping is disclosed in U.S. patent application Ser. No. 11/343,980 entitled “Creating and Using Applicable Information Technology Service Maps,” Inventors Carroll W. Moon, Neal R. Myerson and Susan K. Pallini filed Jan. 31, 2006, assigned to the assignee of the instant application and fully incorporated herein by reference. In the service map categorization, common elements among various distributed systems within an organization are determined and used to track changes and releases based on the common elements, rather than, for example, physical systems individually. In the aforementioned application Ser. No. 11/343,980, a service map defines a taxonomy of level of detail of competing components in the information technology infrastructure is defined. The technology service method used to simplify information technology infrastructure management. The service map maps a corresponding information technology infrastructure with a specified level of detail and represents dependencies between services and streams included in the technology service map. Although the service map of application Ser. No. 11/343,980 is one method of organizing an IT infrastructure, other categorical relationships may be utilized.
At step 120, relationships between elements in the taxonomy are defined. Step 120 defines the relationships between the various elements in taxonomy so the changes to one or more categories or reflected in other category or elements residing in sub categories. For example, one might define a common group comprising services, and a group of services comprising the messaging service. Another group may be defined by exchange mail servers, and still other groups defined by the particular types of hardware configurations within the enterprise. At step 120, one can define the relationships between that the mail servers as a subcategory of the messaging service, and define which hardware configurations are associated with exchange servers.
In accordance with the technology discussed herein, problems entered for review may be recorded in relationship to one or more of the groups within the taxonomy, rather than to individual machines or elements within the taxonomy. Hence, a major problem record entered in accordance with the technology discussed herein may relate the problem to all elements sharing a common characteristic (hardware, application, etc.) with the element which experiences the problem. For example, if a mail server goes down, a major problem review record will include an identifier for the server and one or more groups in the taxonomy (i.e. which applications are on the server, where the server is located, etc.) to which the problem is related, allowing trending data to be derived. Reports may then be provided which indicate which percentage of major problems experienced related to email. Similarly, if one were to define a category of a hardware model of a particular server type, problems to that particular hardware model might affect one or more categories of applications or services provided by the hardware model.
In accordance with the foregoing, any incident in the IT enterprise is tracked by first opening a major problem review (MPR) record at step 130. At step 130, the record may include data on the relationship between various groups in the taxonomy. As discussed below, this MPR record is stored in a common schema which can be used to drive the problem review process. The MPR record is the first stage of a review and is generally initiated by an IT administrator. Additional elements in the record may include storing whether root cause is known for the incident. At step 140, when entering the record (or at a later time), a determination is made as to whether the root cause of an incident is known. If so, then a flag in the record is set at step 145 indicating that the problem record is now a known error record, and may be viewed and reported on separately in the view and reporting aspects of the present technology.
Major problem review at steps 150-180 may occur using the technology described herein.
At step 150, the MPR record may be output to a view or report to drive a major problem review process. The major problem review process may include investigation and diagnosis of incidents where there are no known errors or known problems. In this case, the incident must be further investigated and action items for the incident need to be tracked.
As part of the major problem review process, one or more action items may be identified in the MPR record. At step 155, during the review process, a determination is made as to whether any action items currently exist for the Incident record. One such action item may be to identify the root cause (step 140a) during the review process. Other action items may be generated based on the motivation to restore service as quickly as possible by rebooting the system without determining the root cause. Once a solution is found, the issue is resolved by restoring services to normal operation. Once an action item is complete, if there are no further items at step 160, it may be determined that it is acceptable to close the record at step 170 and the record may be closed at step 180.
Data concerning incidents is entered into the data base 450 as defined in table 1 below. In one embodiment, the data base 450 may comprise a Microsoft SharePoint server, but any type of database may be utilized. In accordance with the method of
Once data is entered into the entry interface as discussed above with respect to step 130, a view in the view interface 426 is selectable by the administrators provides a means to view the MPR record, as discussed above with respect to step 150. Various examples of view interfaces are illustrated below. One or more views in the view interface may be reviewed by a committee 470 in accordance with the major problem review process 450. The report interface 428 allows the IT administrators to generate reports and graphs based on the data provided in the major problem record entry interface 424. Examples of information culled from the report interface are listed below.
Each computing system in
Device 400 may also contain communications connection(s) 442 that allow the device to communicate with other devices. Communications connection(s) 442 is an example of communication media. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. The term computer readable media as used herein includes both storage media and communication media.
Device 400 may also have input device(s) 444 such as keyboard, mouse, pen, voice input device, touch input device, etc. Output device(s) 446 such as a display, speakers, printer, etc. may also be included. All these devices are well know in the art and need not be discussed at length here.
It should be recognized that one or more of devices 400 may also make up an IT environment, and multiple configurations of devices may exist within the organization. This can be grouped and tracked in the organization and various organizations may have different configurations. Each configuration and the manner of tracking it is customizable.
Table 1 lists the schema used with the technology described herein for identifying each major problem to be entered in the database 450. Table 1 includes a number of data items which are not shown in interface 502. However it will be understood that interface 502 may display all or subset of the data items. In one embodiment, a subset of data items is required to complete the entry of a MPR record into system 420.
Table 1 lists each of the elements in the schema, a description of the element, a type of element data which is recorded, and any given options for the data item. Many of the elements in the table are self-explanatory. It should be recognized that the fields listed in Table 1 are exemplary and in various embodiments, not all fields may be used or additional fields may be used.
While many of the fields are self explanatory, further discussion of other fields follows.
The “unique identifier” field associates the unique identifier with each change request entry. The unique identifier may be auto generated upon entry of an item into the user interface.
The “description” item allows users to enter descriptive text regarding a brief description of the incident or problem.
The “# service downtime minutes”, “# server downtime minutes” and “# database downtime minutes” allow separate tracking of three important but distinct metrics. The tracking of these items separately in the schema allows a report to be generated to illustrate the true affect of a major problem on each of these separate data points. To illustrate the difference between server, service and database downtime, consider a case of a single mailbox server machine running, for example, Microsoft Exchange 2003, and having five databases. If the physical server is down for three hours, this would constitute three hours of server downtime, three hours of email service downtime, and fifteen hours (three hours multiplied by five databases) of database downtime. Consider further that the mailbox server is paired with another mailbox server in a two node, fail over embodiment. If one of the two servers fails for three hours, and five minutes are required for the second server to take over, this would constitute three hours of server downtime, five minutes of fail over downtime (service downtime), and twenty-five minutes of database downtime (five minutes times five databases). Note that other metrics may be utilized. For example, another metric could be ‘user impact’ which is tracked in amounts of user downtime minutes. In this alternative, the value could be calculated as the number of users impacted multiplied by the number of service downtime minutes.
An advantage of the present technology is that each of these elements may be tracked separately and reported to the IT managers. Each metric measures a different effect on the business and end users of the services, as well as how well the IT organization is performing.
The “What Service Took the Availability Hit” field is an example of a field which tracks the event by a group of common elements that at a major problem may affect. Hence, “services” are one group which may be defined in accordance with step 110 for a particular IT organization. In other embodiments of the technology, groups may include services, application streams, hardware categories, and a “forest” or “domain” category. The “domain” may include a group of clients and servers under the control of one security database. As indicated in Table 1, each of these elements may be identified by field in the schema for tracking change and release elements. In various embodiments, one, two or all three of the service/stream/domain groups may be entered to define the relationship of any change and release record. Each of these elements may be defined in accordance with step 110 or in accordance with the teachings of U.S. patent application Ser. No. 11/343,980. The “What Service Took the Availability Hit” field identifies the service (messaging, etc.) which was affected by the incident.
The “forest-domain” and “data center” impacted fields allow further identification of the two additional groups of elements affected. Likewise, the “initiating technical service component” tracks whether an application stream, hardware stream, setting stream caused the incident. IN various embodiments, the incident may be tracked by service, forest/domain and datacenter together, or any one or more of the data items may be required.
In a further unique aspect of the present technology, both a primary and an exacerbating or secondary root cause are tracked by the technology. Hence, fields are provided to track primary and secondary or “exacerbating” root causes. Additionally, root causes are defined in terms of people, processes and technology. Processes include capacity & performance issues, change & release issues, configuration issues, incident (& monitoring) issues, service level management (SLA) issues, and third party issues. Technology issues can include bugs, capacity, other service dependencies and hardware failures. This separate tracking of both primary and secondary root causes allows the major problem review process to drill down into each root cause to determine further granularity of the root cause issue. Consider a case where a server in a remote location managed by a remote IT administrator goes down and is down for two hours. A primary root cause of the failure may be a bug in the software on the server, but the server could have been rebooted in 15 minutes had the administrator been on site with the server. In this case the secondary cause might be a process related cause in that the administrator was not required to be on site by the service level agreement at that facility. If the administrator was not trained to reboot the server, this would present a people issue, requiring further training of the individual.
In conjunction with the people, process and technology tracking of root and secondary causes, a “people recommendations” field, “process recommendations” field and “technology recommendations field may be used by the management review process to force problem reviewers to think through whether recommendations should be made in each of the respective root cause areas.
As noted above, in one embodiment, certain fields are required to be entered before a MPR record can be reviewed and/or closed. In one embodiment, the required fields include a Case ID, description, Case Owner, Incident begin time, number of users impacted, number of server, number of service downtime minutes, number of database downtime minutes, incident duration, service (or group) impacted, forest/domain impacted, datacenter impacted, initiating technical service component, and a detailed timeline. When the root cause is identified, additional required fields required include the primary root cause, the secondary root cause the percentage of downtime minutes due to the secondary root cause, process recommendations, technology recommendations, action items and MPR record status.
Different types of views, including calendar and list views, may be provided.
A calendar view such as that shown in
The calendar view “messaging-major outage calendar” 610 is a filtered view listing the major outages by case I.D. on the particular date they occurred, in this example, for the month July 2006. This is useful for determining whether a number of occurrences happened on a particular day. It will be understood that each of the items in the calendar view shown in
The “Average # users impacted” is a sum of users impacted for time period divided by the time period.
The “Average Incident Duration (minutes)” tracks outage duration and is the sum of incident duration for time period divided by a count of the time period. The “Mean Time Between Failures (days)” calculates the difference between the date/time opened for time period in days and average the difference. The MTBF and the duration are key metrics to IT service availability.
The “% with root cause identified” is a count of records with root cause identified checked for period divided by a count of MPRs in the period. This metric is indicative of the effectiveness of the IT department's problem control process.
The “% with MPR closed as of scorecard publication” is a count of records with MPR closed for period divided by count of MPRs per period. This metric is indicative of problem management effectiveness.
The “% recurring issue” metric is a count of records with recurring issues checked for period divided by count for period. This metric is indicative of the effectiveness of the error control process.
The “service downtime minutes,” “server downtime minutes,” and “DB downtime minutes” are sums of the respective downtime minutes for the period.
In a unique aspect of the technology, service, server and database downtime is reported relative to the root cause and exacerbating root cause of the problem, and the relative percentages of the root and exacerbating causes.
The “service downtime minutes due to people/process” is the total and percentage of service downtime minutes for period which is indicative of needed improvements for people or processes. This metric results from calculating the service downtime for each case due to a primary root cause (service downtime*(1−% due to exacerbating)) for each case and the downtime due to the exacerbating root cause for each case (service downtime*% due to exacerbating). The sum is the total of those columns where primary and/or exacerbating is attributable to people/process causes. This information is derived using the primary root cause and exacerbating cause drop down data from the records.
The “server downtime minutes due to people/process” and “DB downtime minutes due to people/process” are calculated in a similar manner.
The “Service downtime minutes due to process-other groups” shows the total of those columns where primary and/or secondary is attributable to process-other groups (using primary root cause and exacerbating cause drop down data). This is calculated by calculating service downtime for each case due to primary (service downtime*(1−% due to exacerbating)) for each case and also downtime due to exacerbating for each case (service downtime*% due to exacerbating). This is indicative of a need for better service level agreements and underpinning contracts.
The “Server downtime minutes due to Process-Other Groups” and “DB downtime minutes due to Process-Other Groups” are calculated in a similar manner.
Similarly, the scorecard provides a metric of “service downtime minutes due to Technology and/or Unknown”, “Server downtime minutes due to Technology and/or Unknown”, and “DB downtime minutes due to Technology and/or Unknown”, This is indicative of the need for technology improvements and problem control improvements.
The “% Primary Root Cause=People/Process” is a metric of the percentage of primary root causes which are due to people or process issues. It is derived by taking the number of cases having a primary root cause of a people/process divided by the number of MPRs for the period. The “% Primary and/or Exacerbating Root Cause=People/Process” is a metric of the percentage of primary or exacerbating root causes which are due to people or process issues. It is calculated by taking the number of MPRs with primary root cause of people/process and the number of exacerbating root cause of people/process, divided by the number of MPRs and count where the secondary cause does not equal ‘n/a’). Both are indicative of needed people/process improvements.
The “% Primary Root Cause=Process-Other Groups” and “% Primary and/or Exacerbating Root Cause=Process-Other Groups” are calculated in a similar manner for the process and “other groups” causes. These reports are indicative of need for better service level agreements and underpinning contracts. Similarly, the “% Primary Root Cause=Technology or Unknown” and “% Primary and/or Exacerbating Root Cause=Technology or Unknown” are calculated in a similar manner for the technology and “unknown” causes and are indicative of needed technology improvements and problem control improvements.
In addition to the metrics listed in the table of
An IT department will focus its resources on the largest percentages of cases that the department can actually impact. For example, these may include items like process capacity and performance, reducing the frequency increases the mean time between failures. Hence, the technology presented herein allows the best practices defined by ITIL® to be made practical, and automates the practices that ITIL® vaguely describes. The service, server, and database down time graphs by primary and exacerbating root cause show the distribution of service, server, and database down time minutes in each primary and exacerbating root cause. For each graph, one calculates the service, server, or database down time for each case due to each primary cause and also due to each exacerbating root cause for each case. Then one sums the total of these columns where the primary and/or secondary cause is attributable to each of the service, server, or database causes. These views give us a macro view of the primary and secondary root causes and their impacts on the service, server, or database. In contrast to the case count graph in
Each of the aforementioned tables and graphs can be utilized to show trends in IT management by comparing reports for different periods of time. For example, scorecards consisting of all elements of
The technology herein facilitates major problem review by providing IT organizations with a number of tools, including data reporting tools not heretofore known, to manage major problems. Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.