Claims
- 1. A computer system for fault tolerant computing comprising:a plurality of host computers interconnected on a network; a first copy of an application module running on a first of said host computers; a second copy of the application module operative on a second of said host computers; a manager daemon process running on one of said plurality of host computers, the manager daemon process receiving an indication upon a failure of the first copy of the application module and initiating failure recovery with said second copy of the application module; and means for providing a registration message to said manager daemon process, said registration message specifying said application module and a style of replication to be maintained by said manager daemon process for said application module from among a plurality of different replication styles; wherein said second copy is maintained in an operative state for fail-over protection upon a failure of the first copy of the application module in accordance with the registered replication style.
- 2. The computer system of claim 1 wherein said different replication styles indicate whether or not the second copy of the application module is to run on said second host computer simultaneously while said first copy of the application module runs on said first host computer, and if said second copy is to simultaneously run, whether said second copy can receive and respond to a client request.
- 3. The computer system of claim 2 wherein the different replication styles are cold backup, warm backup and hot backup, wherein in accordance with the cold backup style, said second copy does not run while said first copy of the application module runs; in accordance with the warm backup style, said second copy runs while said first copy of the application module runs but cannot not receive and respond to a client request; and in accordance with the hot backup style, said second copy runs while said first copy of the application module runs and can receive and respond to a client request.
- 4. The computer system of claim 1 further comprising:a first failure-detection daemon process running on said first host computer, said first failure-detection daemon process monitoring the ability of said first copy of the application module to continue to run, said first failure-detection daemon process sending to said manager daemon process a message indicating a failure of said first copy upon detecting a failure.
- 5. The computer system of claim 4 further comprising:a checkpoint server connected to the network, said checkpoint server periodically storing the states of said first copy of the application module and said manager daemon process.
- 6. The computer system of claim 5 wherein upon detection of the failure of said first copy of the application module, said second host computer is signaled for the second copy to assume the processing functions of said first copy, said second copy retrieving from said checkpoint server the last stored state of said first copy.
- 7. The computer system of claim 5 further comprising:a second failure-detection daemon process running on the same host computer as the manager daemon process, said second failure-detection process monitoring said first host computer for a failure.
- 8. The computer system of claim 7 wherein upon detection of a failure of said first host computer, said second copy of the application module is signaled to assume the processing functions of said first copy, said second copy retrieving from said checkpoint server the last stored state of said first copy of the application module.
- 9. The computer system of system of claim 7 further comprising:a backup copy of said second failure-detection daemon process running on another one of said plurality of host computers different than the host computer on which the second failure-detection daemon process is running, said backup copy of said second failure-detection process monitoring said second host computer for a failure.
- 10. The computer system of claim 9 wherein upon detection of a failure of said second host computer, said backup copy of said second failure-detection daemon process assumes the processing functions of said second failure-detection daemon process and initiates running of a copy of said manager daemon process on said same another one of the host computers, said copy of said manager daemon process retrieving from said checkpoint server the stored state of said manager daemon process when it was running on its host computer.
- 11. The computer system of claim 3 wherein the registration message for the application module further specifies a degree of replication that indicates for a hot or warm backup replication style the number of copies of the application module to be maintained running on said plurality of host computers in the network.
- 12. The computer system of claim 6 wherein the registration message for the application module further specifies a fail-over strategy, the fail-over strategy indicating whether said second copy should assume the processing functions of said first copy of the application module each time a failure of said first copy is detected by said first failure-detection process, or whether said second copy should assume the processing functions of said copy only after the number of failures of said first copy on said first host computer reaches a predetermined threshold.
- 13. A fault-managing computer apparatus on a host computer in a computer system, said apparatus comprising:a manager daemon process for receiving an indication of a failure of a first copy of an application module running on a first host computer in the computer system and for initiating failure recovery with a second copy of the application module on a second host computer; and means for receiving a registration message from the first copy of the application module specifying said application module and a style of replication to be maintained for said application module from among a plurality of different replication styles; wherein the second copy is maintained in an operative state for fail-over protection upon a failure of the first copy of the application module in accordance with the registered replication style.
- 14. The apparatus of claim 13 wherein the different replication styles are cold backup, warm backup and hot backup.
- 15. The apparatus of claim 13 wherein upon receiving an indication of a failure of the first copy of the application module, said manager daemon process signals the second host computer for the second copy to assume the processing functions of the first copy of the application module.
- 16. The apparatus of claim 13 further comprising a failure-detection daemon process for monitoring the first host computer for a failure.
- 17. The apparatus of claim 16 wherein upon said failure-detection daemon process detecting a failure of the first host computer, said manager daemon process signals the second host computer for the second copy to assume the processing functions of the first copy of the application module.
- 18. The apparatus of claim 14 wherein the registration message further specifies a degree of replication that indicates the number of copies of the application module to maintained running in the computer system for a hot or warm backup replication style.
- 19. A fault-tolerant computing apparatus for use in a computer system, said apparatus comprising:a failure-detection daemon process running on said apparatus, said failure-detection daemon process monitoring the ability of a first copy of an application module to continue to run on said apparatus; and means for sending a registration message to a manager daemon process specifying the application module and a style of replication from among a plurality of different replication styles to be maintained by the manager daemon process for the application module with respect to a second copy of the application module that is operative on another computer apparatus in the computer system; wherein the second copy is maintained in an operative state for fail-over protection upon a failure of the first application module in accordance with the registered replication style.
- 20. The apparatus of claim 19 wherein the different replication styles are cold backup, warm backup and hot backup.
- 21. The apparatus of claim 19 wherein the second copy of the application module in the computer system assumes the processing functions of the first copy of the application module upon detecting a failure of the first copy of the application module.
- 22. The apparatus of claim 19 wherein the registration message further specifies a degree of replication that indicates the number of copies of the application module to be maintained running in the computer system for a hot or warm backup replication style.
- 23. A method for operating a fault-tolerant computer system, said system comprising a plurality of host computers interconnected on a network, a first copy of an application module running on a first of the plurality of the host computers and a second copy of the first application module on a second of the plurality of host computers, said method comprising the steps of:receiving a registration message specifying the application module and a style of replication to be maintained for the application module from among a plurality of different replication styles; and maintaining said second copy in an operative state for fail-over protection upon a failure of the first application module in accordance with the registered replication style.
- 24. The method of claim 23 further comprising the steps of:receiving an indication upon a failure of the first copy of the application module; and initiating failure recovery for the failed first copy with the second copy on the second host computer.
- 25. The method of claim 23 wherein the different replication styles indicate whether or not the second copy is to run simultaneously while the first copy of the application module runs on the first host computer, and if the second copy is to simultaneously run, whether the second copy can receive and respond to a client request.
- 26. The method of claim 23 wherein the different replication styles are cold backup, warm backup and hot backup.
- 27. The method of claim 23 further comprising the steps of:monitoring the first host computer for a failure; and upon detecting a failure of the first host computer, initiating failure recover for the first copy of the application module with the second copy on the second host computer.
- 28. The method of claim 26 wherein the registration message for the first application module further specifies a degree of replication that indicates the number of copies of the application module to be maintained running on said plurality of host computers for a hot or warm backup replication style.
- 29. The method of claim 24 wherein the registration message for the application module further specifies a fail-over strategy, the fail-over strategy indicating whether the second copy assumes the processing functions of the first copy of the application module each time a failure of the first copy is detected, or whether the second copy assumes the processing functions of the first application module only after the number of failures of the first copy of the application module reaches a predetermined number.
CROSS REFERENCE TO RELATED APPLICATIONS
This application describes and claims subject matter that is also described in our co-pending United States patent application filed simultaneously herewith and entitled: “METHOD AND APPARATUS FOR PROVIDING FAILURE DETECTION AND RECOVERY WITH PREDETERMINED DEGREE OF REPLICATION FOR DISTRIBUTED APPLICATIONS IN A NETWORK”, Ser. No. 09/119,140.
US Referenced Citations (17)