The present invention relates generally to computer systems. More particularly, the present invention relates to monitoring, measurement and management of Recovery Time Objectives (RTO) of enterprise IT business continuity or disaster recovery solutions.
In the increasingly competitive times of today, implementing systems and methods for maintaining business continuity is no longer an optional requirement for business enterprises, especially for enterprises that use or are fully or partially dependent on Information Technology (IT). Such enterprises can be broadly termed as IT enterprises. Since the efficient working of most of such IT enterprises depends on their business continuity or disaster recovery management infrastructure, implementing a sound enterprise IT business continuity or disaster recovery solution has almost become a mandatory requirement. Costs incurred during business downtime are usually significant, thereby dictating a need for implementing a business continuity solution. The design and choice of the business continuity or disaster recovery solution is primarily driven by a Recovery Time Objective (RTO) that is acceptable to the IT enterprise.
RTO for an IT enterprise business continuity or disaster recovery solution is a time measure that indicates how soon data and related application must be available to the enterprise after an outage has happened. For example, an enterprise may determine, based on impact to business analysis, that it cannot have the production computer down for more then two hours. Therefore, the RTO for the enterprise would be two hours and in case the production computer fails, it must be made available again within two hours.
Enterprise data may be generally classified into four categories. (1) Critical “Tier One” data, where loss of data has an immediate impact on the enterprise's revenue or functioning; (2) Vital “Tier Two” data, where loss of data has a significant impact on the enterprise's revenue or functioning; (3) Essential “Tier Three” data, where loss of data has some impact on the enterprise's revenue or functioning; and (4) Non-Essential “Tier Four” data, where loss of data has minimal impact on the enterprise's revenue or functioning. Therefore, the challenge faced by most enterprises lies in identifying the criticality of their IT enterprise application data and impact of loss of the same. One way to achieve this goal is to recognize an acceptable amount of time the application may remain unavailable. Hence, an RTO measure is used to characterize the maximum amount of time the enterprise IT application may be unavailable.
A conventional business continuity or disaster recovery solution has three main components namely: an enterprise application that requires being available continuously, a data protection scheme that makes a copy of the application data, and the entire supporting infrastructure which comprises computer servers, storage arrays and local and remote networks. Conventional business continuity or disaster recovery solutions based on an RTO measure may not integrate with all the three components. Some of the currently available business continuity or disaster recovery solutions work with a static value of RTO and do not provide for a real time measurement of RTO based on real time inputs obtained from all the three components. Hence, there is need for a business continuity or disaster recovery solution that is based on real time measurement and management of RTO by using real time inputs from the mentioned components.
Some of the available methods to manage RTO in a business continuity or disaster recovery solution are manual, and usually entail an operator monitoring the proper functioning of each of the three components and taking appropriate corrective actions, if required. The constant manual monitoring and performing of corrective actions maintains business continuity of the enterprise application that requires being available continuously. Such corrective actions have to be customized for every type of enterprise application, data protection scheme and supporting infrastructure components used for the business continuity or disaster recovery solution. Therefore, these actions require that the operator possesses an in-depth technical knowledge of all the components in the business continuity or disaster recovery solution. Such dependence on manual intervention may lead to erroneous operation of the solution and added costs for the business enterprise that implements the solution.
Therefore, there is need for an automated business continuity or disaster recovery solution in which RTO is continuously managed to a user desired or configured value.
The present invention provides automated systems and methods for monitoring, measurement and management of Recovery Time Objectives (RTO) of enterprise IT business continuity or disaster recovery solutions.
It is an objective of the present invention to provide systems and methods that monitor the RTO of enterprise IT business continuity or disaster recovery solutions, in real time.
It is another objective of the present invention to provide systems and methods that manage the enterprise IT business continuity or disaster recovery solutions such that the desired RTO value is achieved.
It is yet another objective of the present invention to provide systems and methods for monitoring and managing the RTO of enterprise IT business continuity or disaster recovery solutions that integrate with the various components of the business continuity or disaster recovery solution.
It is still another objective of the present invention to provide systems and methods for managing the RTO of enterprise IT business continuity or disaster recovery solutions that enable a user to input or configure a desired RTO value for the business continuity or disaster recovery solution.
It is still another objective of the present invention to provide systems and methods for managing the RTO of enterprise IT business continuity or disaster recovery solutions that raise alerts and alarms when the RTO deviates from its desired or configured value.
It is yet another objective of the present invention to provide systems and methods for managing the RTO of enterprise IT business continuity or disaster recovery solutions that take corrective actions to maintain the RTO at its desired or configured value.
It is still another objective of the present invention to provide systems and methods for managing the RTO of enterprise IT business continuity or disaster recovery solutions that specify policies which further decide actions to be performed when the RTO value deviates from its desired or configured value.
It is another objective of the present invention to provide systems and methods for managing the RTO of enterprise IT business continuity or disaster recovery solutions that may be executed on heterogeneous computer servers, operating systems, hardware and software environments.
It is yet another objective of the present invention to provide systems and methods for managing the RTO of enterprise IT business continuity or disaster recovery solutions that interface with various data protection techniques used by the business continuity or disaster recovery solution.
It is still another objective of the present invention to provide systems and methods for managing the RTO of enterprise IT business continuity or disaster recovery solutions that may be implemented in software or a combination of software and hardware.
It is another objective of the present invention to provide systems and methods for managing the RTO of enterprise IT business continuity or disaster recovery solutions that may be implemented in distributed or centralized environments.
To meet the above mentioned and other objectives, the present invention provides a system for management of Recovery Time Objective (RTO) of a business continuity or disaster recovery solution. The system comprises a management server logically coupled with at least a first computer, at least a second computer, and a network coupling the first and the second computers. The first and second computers host at least one continuously available application, at least one data protection scheme for replicating the application data and at least one operating system; the application data being periodically replicated from the first computer to at least the second computer. The system managing RTO by inputting an RTO value for the solution, calculating a real time RTO value for the solution, and making the real time RTO value less than or equal to the input RTO value.
In an embodiment of the present invention, the first and the second computers are coupled to one or more storage units. A plurality of agents of the management server are deployed on at least the first computer, at least the second computer, the network coupling the first and the second computers, and the one or more storage units. The management server periodically polls at least one of its agents integrated with at least, the application, the data protection scheme and the operating system running on the first computer, the application, the data protection scheme and the operating system running on the second computer, and the network, for calculating the real time RTO value. In an embodiment of the present invention, the management server periodically polls at least one of its agents integrated with at least one storage unit, for calculating the real time RTO value. The data protection scheme comprises data replication techniques based on one or more of tape backup, disk backup, block level replication, file level replication, point in time replication and archive logs. The system of the present invention is configurable on heterogeneous platforms comprising heterogeneous servers and operating systems.
The present invention also provides a method for management of Recovery Time Objective (RTO) of a business continuity or disaster recovery solution. The method comprises the steps of inputting an RTO value for the solution, calculating a real time RTO value for the solution, and managing the real time RTO value to make it less than or equal to the input RTO value. The method further comprises the step of continuously repeating the steps of calculating a real time RTO value for the solution and managing the real time RTO value to make it less than or equal to the input or configured RTO value.
In an embodiment of the present invention, the step of inputting an RTO value for the solution comprises the steps of prompting a user to input a desired RTO value for the solution, computing time and periodic setting values for the solution, based on the desired RTO value, and configuring the solution, based on the computed time and periodic setting values.
In an embodiment of the present invention, the step of calculating a real time RTO value for the solution comprises the steps of obtaining current state of an application of the solution, obtaining current state of a data protection scheme replicating the application data, obtaining current state of a network supporting the solution, obtaining current state of an operating system supporting the solution and calculating a real time RTO value using at least one of the current obtained values of each of the state of the application, the data protection scheme, the network and the operating system.
In an embodiment of the present invention, the step of managing the real time RTO value to make it less than or equal to the input RTO value comprises the steps of raising an alarm if the computed RTO value is greater than the input RTO value, and performing at least one corrective action based on at least one predefined corrective policy. In another embodiment of the present invention, the step of managing the real time RTO value to make it less than or equal to the input RTO value comprises the steps of raising an alarm if the computed RTO value is greater than the input RTO value, prompting the user to define at least one corrective policy, and performing at least one corrective action based on the user defined corrective policy.
In an embodiment of the present invention, the step of managing the real time RTO value to make it less than or equal to the input or configured RTO value comprises the step of repeating the steps of calculating a real time RTO value for the solution if the computed RTO value is less than or equal to the input RTO value.
In an embodiment of the present invention, the step of computing time and periodic setting values for the solution based on the desired RTO value, comprises one or more of the steps of computing a periodic interval for performing an operation to ensure data consistency of replicated data on the second computer, computing a periodic interval for performing an operation to apply replicated data to the application running on the second computer, computing readiness level of the second computer, computing readiness level of the one or more storage unit; and computing readiness level of the network.
The method for management of Recovery Time Objective (RTO) of a business continuity or disaster recovery solution described in the present invention is operable on heterogeneous platforms comprising heterogeneous servers and operating systems.
The present invention also provides a computer program product comprising a computer usable medium having a computer readable program code embodied therein for management of Recovery Time Objective (RTO) of a business continuity or disaster recovery solution. The computer program product comprises program instruction means for inputting an RTO value for the solution, program instruction means for calculating a real time RTO value for the solution, and program instruction means for managing the real time RTO value to make it less than or equal to the input RTO value. In an embodiment of the present invention, the computer program product further comprises program instruction means for continuously repeating the steps of calculating a real time RTO value for the solution and managing the real time RTO value to make it less than or equal to the input RTO value.
In an embodiment of the present invention, the program instruction means for inputting an RTO value for the solution comprise program instruction means for prompting a user to input a desired RTO value for the solution, program instruction means for computing time and periodic setting values for the solution, based on the desired RTO value, and program instruction means for configuring the solution, based on the computed time and periodic setting values.
In an embodiment of the present invention, the program instruction means for calculating a real time RTO value for the solution comprise program instruction means for obtaining current state of an application of the solution, program instruction means for obtaining current state of a data protection scheme replicating the application data, program instruction means for obtaining current state of a network supporting the solution, program instruction means for obtaining current state of an operating system supporting the solution, and program instruction means for calculating a real time RTO value using at least one of the current obtained values of each of the state of the application, the data protection scheme, the network and the operating system.
In an embodiment of the present invention, the program instruction means for managing the real time RTO value to make it less than or equal to the input RTO value comprise program instruction means for raising an alarm if the computed RTO value is greater than the input RTO value, and program instruction means for performing at least one corrective action based on at least one predefined corrective policy. In another embodiment of the present invention, the program instruction means for managing the real time RTO value to make it less than or equal to the input RTO value comprise program instruction means for raising an alarm if the computed RTO value is greater than the input RTO value, program instruction means for prompting the user to define at least one corrective policy, and program instruction means for performing at least one corrective action based on the user defined corrective policy.
In an embodiment of the present invention, the program instruction means for managing the real time RTO value to make it less than or equal to the input RTO value comprise program instruction means for repeating the steps of calculating a real time RTO value for the solution, if the computed RTO value is less than or equal to the input RTO value.
In an embodiment of the present invention, the program instruction means for computing time and periodic setting values for the solution based on the desired RTO value, comprise one or more of program instruction means for computing a periodic interval for performing an operation to ensure data consistency of replicated data on the second computer 106, program instruction means for computing a periodic interval for performing an operation to apply replicated data to the application running on the second computer 106, program instruction means for computing readiness level of the one or more storage unit; and program instruction means for computing readiness level of the network.
The computer program product for management of Recovery Time Objective (RTO) of a business continuity or disaster recovery solution described in the present invention is operable on heterogeneous platforms comprising heterogeneous servers and operating systems.
The present invention is described by way of embodiments illustrated in the accompanying drawings wherein:
The present invention would now be discussed in context of embodiments as illustrated in the accompanying drawings.
In accordance with an embodiment of the present invention, the first computer 104 and the second computer 106 are at geographically separate locations. The management server 102 is logically connected to the first computer 104, the second computer 106, the network 108, the first storage unit 110 and the second storage unit 112. In an embodiment of the present invention the logical connection maybe an IP network connection.
In various embodiments of the present invention, the first storage unit 110 and the second storage unit 112 are connected to the first computer 104 and the second computer 106 respectively either as direct attached SCSI connection or using IP or Fibre Channel connectivity or any other connection method. Also, in various embodiments of the present invention, the network 108 may be a Local area network (LAN) or a Wide area network (WAN).
A plurality of agents of the management server 102 are deployed on the first computer 104, the second computer 106, the network 108, the first storage unit 110 and the second storage unit 112. Agents 122 and 126 are integrated with the applications 114 and 118 respectively. The Agents 122 and 126 continuously monitor and maintain the state of the applications 114 and 118 and provide a real time status to the management server 102.
Agents 124 and 128 are integrated with the data protection schemes 116 and 120 respectively and continuously monitor and maintain the state of the data protection schemes. In an embodiment, the agents 124 and 128 monitor and maintain replication logs and queue sizes of the data protection scheme. In various embodiments of the present invention, varied data protection schemes may be used. In an embodiment, a traditional tape backup scheme is used wherein the application 114 data on the first computer 104 is replicated (backed up) onto tape media. This replicated application data is then transported from the tape media to the second computer 106. Then the application data on the tape media is restored onto the application 118 running on the second computer 106 resulting in the recovery of the application 114.
In another embodiment of the present invention, block level replication using storage array is used as the data protection scheme, wherein the storage volumes, on which archive logs are stored on the first computer 104 are replicated to the second computer 106. These volumes are then restored onto the second computer 106, and applied to the application 118, resulting in the recovery of the application 114. In other embodiments, various other data protection schemes such as file based replication techniques that replicate archive log files may be used. The system 100 for management of RTO for maintaining business continuity of an Information Technology (IT) enterprise as described in the present invention, fully supports configuration of any type of data protection scheme being used. The system 100 also supports the monitoring and administration of the data protection scheme being used.
Agents 130 and 132 of the management server 102 are integrated with the network 108, agent 134 is coupled with the first storage unit 110 and agent 136 is coupled with the second storage unit 112, as illustrated in
At step 202, a user is prompted to enter a desired RTO value. In an embodiment of the present invention, the user is prompted to enter a desired RTO value for either the entire solution or an application thereof, via a graphical user interface (GUI).
In other embodiments of the present invention, the user may enter desired RTO value using a command line interface.
In an exemplary embodiment of the present invention, an Oracle database running on the first computer 104 must be available continuously. Consequently, an instance of Oracle database is also maintained, in standby condition, on the second computer 106, which computer is maintained in a hot standby mode. Oracle database is protected and recovered using the archive log technique, which is well known in the art. Archive logs are periodically dumped on the first computer 104. These logs are also periodically replicated to the second computer 106 via a WAN connection. The archive logs are then applied to the Oracle instance running on the second computer 106.
The desired value of RTO as input by the user is used to determine configuration and behavior of rest of the components that make up the solution. In the embodiment of the present invention, where the application that must be available continuously is an Oracle database, the RTO value influences the following:
At step 204, time and periodic settings are computed and configured for the solution based on the value of RTO input at step 202. An enterprise IT business continuity or disaster recovery solution typically comprises an application that is required to be available continuously along with its environment, a data protection/replication scheme and the entire infrastructure supporting the solution comprising server, storage & networks. Examples of the time and periodic settings that are computed comprise:
Once the time and periodic settings are computed based on the user input RTO value, the computed settings are configured for the components of the solution, at step 206. In an embodiment of the present invention, the computed settings are configured by the management server 102 by communicating with its agents deployed on the various components of the system 100, to configure the computed values for each of the components.
At step 208, a current state of an application of the solution, which is required to be available continuously along with any storage associated with the application is obtained. In an embodiment of the present invention, a current state of the application 114 or/and the application 118 is obtained by the management server 102 by polling the agents 122 and 126 which are integrated with the applications 114 and 118 respectively. Also, a current state of the first storage unit 110 and the second storage unit 112 is obtained by the management server 102 by polling the agents 134 and 136, which are integrated with the first storage unit 110 and the second storage unit 112 respectively. Examples of the values polled comprise:
At step 210, a current state of a data protection scheme that is coupled with the application of the solution, which is required to be available continuously, is obtained. In an embodiment of the present invention, a current state of the data replication scheme 116 or/and the data replication scheme 120 is obtained by the management server 102 by polling the agents 124 and 128 which are integrated with the data protection schemes 116 and 120 respectively. Examples of the values polled comprise:
At step 212, a current state of a network supporting the application of the solution, which is required to be available continuously, is obtained. In an embodiment of the present invention, a current state of the network 108 is obtained by the management server 102 by polling the agents 130 and 132 which are integrated with the network 108. Examples of the values polled comprise:
At step 216, a real time RTO value is calculated using the values of the state of the application and associated storage, the state of the data protection scheme, the state of the network and the state of the operating system, obtained at steps 208, 210, 212 and 214. In an embodiment of the present invention, the current value of RTO is computed by the management server 102 by using values obtained by periodically polling each of its agents. The computed value of RTO yields the amount of time required to bring up all components of the system 100 to a required state from their current states to enable the second computer 106 to offer all the required services such as are provided by the first computer 104. Examples of values used to calculate the current value of RTO comprise:
In an embodiment of the present invention, in order to compute a current or real time RTO value, an estimated time period for completion of every action that needs to be performed to enable the second computer 106 to boot to a required service level, is calculated in real time. Then, the real time RTO value is computed by summing all the estimated time periods.
In other embodiments other methods and formulae may be used to compute a current RTO value for the solution, based on the values polled by the management server 102.
In the exemplary embodiment of the present invention, where an Oracle database running on the first computer 104 must be available continuously current RTO value is determined by obtaining the following information:
At step 218, the computed RTO value is compared to the RTO value that was input by the user at step 202. If the computed value is less than or equal to the user input RTO value, steps 208 to 218 are repeated. If the computed value is greater than the user input RTO value an alarm is raised, at step 220.
In an embodiment of the present invention, the difference between the computed RTO value and the user input RTO value is presented to the user via a GUI.
At step 222, the user is prompted to define a corrective policy, in order to restore the real time computed RTO value to the RTO value initially input by the user. In an embodiment of the present invention the user may be prompted to define a corrective policy via a GUI. This GUI may be the same or be different from the GUI which presents the difference between the computed and user input RTO values. The GUI may also present the user with a set of corrective policy options and prompt the user to either choose one of those or define a new corrective policy.
If the user chooses to define a corrective policy at step 224, then at step 226 a corrective action that restores the RTO value is taken based on the user defined corrective policy. Upon completion of step 226, steps 208 to 218 are repeated.
If the user chooses not to define a corrective policy at step 224, then at step 228 a corrective action that restores the RTO value is taken based on a predefined corrective policy. In an embodiment of the present invention, a set of predefined corrective policies are stored in the management server 102 and these policies are applied by the management server 102 onto the first computer 104, the second computer 106 or the network 108, based on the states of these components as obtained via the agents deployed on them. A predefined corrective policy is selected for execution based on the cause of deviation of the computed real time RTO value from the user input RTO value. RTO deviation may occur due to various causes. Examples of such causes comprise:
In the exemplary embodiment of the present invention, where an Oracle database running on the first computer 104 must be available continuously, the following corrective actions may be taken when the computed real time RTO value deviates from the user input RTO value:
In various embodiments of the present invention, the system and method herein can operate in varied environments and on heterogeneous platforms such as heterogeneous servers and operating system environments. Examples of servers and central processing unit types that are supported by the present invention comprise Intel Pentium class, SUN Sparc, IBM PowerPC etc. Examples of the various operating systems that are supported are Microsoft Windows 2000, Microsoft Windows 2003, SUN Solaris 8, SUN Solaris 9, IBM AIX 5.3 etc.
While the present invention has been shown and described with reference to exemplary embodiments, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from or offending the spirit and scope of the invention as defined by the appended claims.
Number | Date | Country | |
---|---|---|---|
60615640 | Oct 2004 | US |