Claims
- 1. A computer-implemented method of enhancing fault-tolerance of a distributed computing application, the method comprising:
running a monitoring program on a node in a network in connection with running software of the distributed computing application on the node; in the monitoring program, recurrently checking continued operation of the distributed computing application's software on the node; and in the event of failure, initiating by the monitoring program an action to restore the distributed computing application.
- 2. The method of claim 1 wherein the distributed computing application includes an administrative agent for an application service provider.
- 3. The method of claim 1 further comprising:
in the distributed computing application running on the node, recurrently signaling its continued operation; and in the monitoring program, monitoring for receipt of the distributed computing application's signaling within a monitoring interval to check the distributed computing application's continued operation on the node.
- 4. The method of claim 1 wherein the action to restore the distributed computing application comprises restarting the distributed computing application on the node.
- 5. The method of claim 1 wherein the action to restore the distributed computing application comprises iteratively attempting to restart the distributed computing application on the node at increasingly longer intervals.
- 6. The method of claim 1 wherein the action to restore the distributed computing application comprises, while the distributed computing application remains inoperative, attempting to restart the distributed computing application one or more times in a plurality of restart modes, at least one of the restart modes having a longer interval between restart attempts than in another of the restart modes.
- 7. The method of claim 1 wherein the action to restore the distributed computing application comprises reinstalling the software for the distributed computing application on the node.
- 8. The method of claim 1 wherein the action to restore the distributed computing application comprises reinstalling a latest update version of the software for the distributed computing application on the node.
- 9. The method of claim 1 wherein the action to restore the distributed computing application comprises reinstalling a version of the software for the distributed computing application on the node that was previously known to run without failure on the node.
- 10. The method of claim 1 wherein the action to restore the distributed computing application comprises logging information of the failure.
- 11. The method of claim 1 wherein the action to restore the distributed computing application comprises transmitting information of the failure to an administrative server or data center for the distributed computing application.
- 12. The method of claim 1 wherein the action to restore the distributed computing application comprises sending an alert to a human administrator of the distributed computing application.
- 13. A computer-implemented method of enhancing fault-tolerance of an application provided at nodes of a distributed network via an application service provider model, the method comprising:
periodically during execution of an application service provider agent program on a node, generating an event signaling continued operation of said agent program on the node; at periodic intervals, checking that the event was generated during a current interval; if the event was not generated in the interval, restoring the application service provider agent to operation by:
at least once restarting the application service provider agent; if restarting does not restore the application service provider agent, reinstalling software of the application service provider agent on the node and restarting the application service provider agent; if reinstalling the application service provider agent does not restore the application service provider agent, transmitting notification of the application service provider agent's failure on the node to a data center for the application service provider.
- 14. A fault-tolerant application service providing system of distributed computing nodes communicating via a data network, comprising:
an application service providing data center; a computing node interconnected via the data network with the application service providing data center; on the computing node, an application service providing agent for providing an application on the computing node administered via the application service providing data center; a monitor program on the computing node for monitoring continued operation of the application service providing agent, and operating upon detecting failure of the application service providing agent to initiate a restorative action to restore the application service providing agent to operation on the node.
- 15. The fault-tolerant application service providing system of claim 14 wherein the monitor program further operates to report failure of the application service providing agent on the node to the application service providing data center.
- 16. The fault-tolerant application service providing system of claim 14 wherein the monitor program further operates to report failure of the application service providing agent on the node to the application service providing data center when the restorative action fails to restore the application service providing agent to operation on the node.
- 17. The fault-tolerant application service providing system of claim 14 wherein the restorative action comprises restarting the application service providing agent on the node.
- 18. The fault-tolerant application service providing system of claim 14 wherein the restorative action comprises initiating restarts of the application service providing agent on the node, initially at shorter restart intervals and later at longer intervals, thereby permitting a temporary low resource availability condition to be alleviated.
- 19. The fault-tolerant application service providing system of claim 14 wherein the restorative action comprises obtaining from the application service providing data center and reinstalling a current version of the application service providing agent on the node.
- 20. The fault-tolerant application service providing system of claim 14 wherein the restorative action comprises reinstalling a version of the application service providing agent on the node that is recorded to have most recently successfully operated on the node.
- 21. The fault-tolerant application service providing system of claim 14 wherein the restorative action comprises logging failure of the application service providing agent on the node.
- 22. The fault-tolerant application service providing system of claim 14 wherein the restorative action comprises uploading information of the failure to the application service providing data center.
- 23. A computer-readable media for carrying a fault-tolerance enhancing program for a distributed computing application, the program comprising for execution at a computing node on a data network:
means for monitoring continued operation of the distributed computing application at the computing node to detect failure of the distributed computing application to continually operate on the computing node; means responsive to the failure being detected, for initiating actions to restore the distributed computing application to operation on the computing node; and means responsive to failure to restore operation of the distributed computing application on the computing node, for transmitting information of the failure to a distributed computing application administering server on the data network.
PRIORITY CLAIM
[0001] This application claims the benefit of U.S. Provisional Patent Application No. 60/375,176, filed Apr. 23, 2002, which is hereby incorporated herein by reference.
Provisional Applications (1)
|
Number |
Date |
Country |
|
60375176 |
Apr 2002 |
US |