Claims
- 1. A method of providing a checkpoint/restart facility across a plurality of plurality of computer systems, wherein:
the plurality of computer systems comprises:
a first computer system executing a first program, and a second computer system containing a disk system and executing a second program; the first computer system and the second computer system are heterogeneous computer systems; said method comprising:
A) checkpointing a current status of the first program resulting in a first set of checkpoint status information; B) transmitting a first checkpoint request that includes the first set of checkpoint status information from the first program over a first session to the second program; C) checkpointing the second program resulting in a second set of checkpoint status information in response to receiving the first checkpoint request; D) writing the first set of checkpoint status information and the second set of checkpoint status information to a first checkpoint file on the disk system; and E) transmitting a first checkpoint response from the second program over the first session to the first program after the writing in step (D) is complete.
- 2. The method in claim 1 wherein:
the method further comprises:
F) checkpointing the first program resulting in a third set of checkpoint status information; G) transmitting a second checkpoint request that includes the third set of checkpoint status information from the first program over the first session to the second program; H) checkpointing the second program resulting in a fourth set of checkpoint status information in response to receiving the first checkpoint request transmitted in step (G); I) writing the third set of checkpoint status information and the fourth set of checkpoint status information to a second checkpoint file on the disk system; and J) transmitting a second checkpoint response from the second program over the first session to the first program after the writing in step (I) is complete.
- 3. The method in claim 2 which further comprises:
J) transmitting a first rollback request from the first program over the first session to the second program; K) reading the third set of checkpoint status information and the fourth set of checkpoint status information from the second checkpoint file in response to receiving the first rollback request transmitted in step (J); L) rolling back the second program utilizing the fourth set of checkpoint status information read in step (K); M)transmitting a first rollback response from the second program over the first session to the first program that includes the third set of checkpoint status information read in step (K); and N) rolling back the first program utilizing the third set of checkpoint status information in response to receiving the first rollback response in step (M).
- 4. The method in claim 2 wherein:
the first checkpoint file and the second checkpoint file are a same file.
- 5. The method in claim 1 which further comprises:
F) transmitting a first rollback request from the first program over the first session to the second program; G) reading the first set of checkpoint status information and the second set of checkpoint status information from the first checkpoint file in response to receiving the first rollback request transmitted in step (F); H) rolling back the second program utilizing the second set of checkpoint status information read in step (G); I) transmitting a first rollback response from the second program over the first session to the first program that includes the first set of checkpoint status information read in step (G); J) rolling back the first program utilizing the first set of checkpoint status information in response to receiving the first rollback response in step (I).
- 6. The method in claim 1 which further comprises:
F) transmitting a second checkpoint request that includes the first set of checkpoint status information from the first program over a second session to a third program executing in a third computer system; G) checkpointing the third program resulting in a fourth set of checkpoint status information in response to receiving the second checkpoint request; H) writing the first set of checkpoint status information and the fourth set of checkpoint status information to a second checkpoint file; and I) transmitting a second checkpoint response from the third program over the second session to the first program after the writing in step (H) is complete.
- 7. The method in claim 6 which further comprises:
J) transmitting a first rollback request from the program over the first session to the second program; K) reading the first set of checkpoint status information and the second set of checkpoint status information from the first checkpoint file in response to receiving the first rollback request transmitted in step (J); L) rolling back the second program utilizing the second set of checkpoint status information read in step (K); M)transmitting a first rollback response from the second program over the first session to the first program that includes the first set of checkpoint status information read in step (K); and N) rolling back the first program utilizing the first set of checkpoint status information in response to receiving the first rollback response transmitted in step (M).
- 8. The method in claim 6 which further comprises:
J) transmitting a first rollback request from the program over the first session to the second program; K) reading the first set of checkpoint status information and the second set of checkpoint status information from the first checkpoint file in response to receiving the first rollback request transmitted in step (J); L) rolling back the second program utilizing the second set of checkpoint status information read in step (K); M)transmitting a first rollback response from the second program over the first session to the first program that includes the first set of checkpoint status information read in step (K); O) transmitting a second rollback request from the first program over the second session to the third program; P) reading the first set of checkpoint status information and the fourth set of checkpoint status information from the second checkpoint file in response to receiving the second rollback request transmitted in step (O); Q) rolling back the third program utilizing the fourth set of checkpoint status information read in step (P); R) transmitting a second rollback response from the third program over the second session to the first program that includes the first set of checkpoint status information read in step (P); and S) rolling back the first program utilizing the first set of checkpoint status information in response to receiving the first rollback response transmitted in step (M) and the second rollback response transmitted in step (R).
- 9. The method in claim 1 wherein:
there are plurality of sessions open between the first program and the second program for accessing a corresponding plurality of files by the second program; and the checkpointing in step (C) flushes all of the plurality of files and includes checkpoint information for all of the plurality of files in the second set of checkpoint information.
- 10. A computer readable Non-Volatile Storage Medium encoded with software for providing a checkpoint/restart facility across a plurality of plurality of computer systems, wherein:
the plurality of computer systems comprises:
a first computer system executing a first program, and a second computer system containing a disk system and executing a second program; the first computer system and the second computer system are heterogeneous computer systems; said software comprising:
A) a set of computer instructions for checkpointing a current status of the first program resulting in a first set of checkpoint status information; B) a set of computer instructions for transmitting a first checkpoint request that includes the first set of checkpoint status information from the first program over a first session to the second program; C) a set of computer instructions for checkpointing the second program resulting in a second set of checkpoint status information in response to receiving the first checkpoint request; D) a set of computer instructions for writing the first set of checkpoint status information and the second set of checkpoint status information to a first checkpoint file on the disk system; and E) a set of computer instructions for transmitting a first checkpoint response from the second program over the first session to the first program after the writing in set (D) is complete.
- 11. A data processing system having software stored in a set of Computer Software Storage Media for providing a checkpoint/restart facility across a plurality of plurality of computer systems, wherein:
the data processing system comprises the plurality of computer systems; the plurality of computer systems comprises:
a first computer system executing a first program, and a second computer system containing a disk system and executing a second program; the first computer system and the second computer system are heterogeneous computer systems; said software comprising:
A) a set of computer instructions for checkpointing a current status of the first program resulting in a first set of checkpoint status information; B) a set of computer instructions for transmitting a first checkpoint request that includes the first set of checkpoint status information from the first program over a first session to the second program; C) a set of computer instructions for checkpointing the second program resulting in a second set of checkpoint status information in response to receiving the first checkpoint request; D) a set of computer instructions for writing the first set of checkpoint status information and the second set of checkpoint status information to a first checkpoint file on the disk system; and E) a set of computer instructions for transmitting a first checkpoint response from the second program over the first session to the first program after the writing in set (D) is complete.
- 12. The software in claim 11 wherein:
the software further comprises:
F) a set of computer instructions for checkpointing the first program resulting in a third set of checkpoint status information; G) a set of computer instructions for transmitting a second checkpoint request that includes the third set of checkpoint status information from the first program over the first session to the second program; H) a set of computer instructions for checkpointing the second program resulting in a fourth set of checkpoint status information in response to receiving the first checkpoint request transmitted in set (G); I) a set of computer instructions for writing the third set of checkpoint status information and the fourth set of checkpoint status information to a second checkpoint file on the disk system; and J) a set of computer instructions for transmitting a second checkpoint response from the second program over the first session to the first program after the writing in set (I) is complete.
- 13. The software in claim 12 which further comprises:
J) a set of computer instructions for transmitting a first rollback request from the first program over the first session to the second program; K) a set of computer instructions for reading the third set of checkpoint status information and the fourth set of checkpoint status information from the second checkpoint file in response to receiving the first rollback request transmitted in set (J); L) a set of computer instructions for rolling back the second program utilizing the fourth set of checkpoint status information read in set (K); M)a set of computer instructions for transmitting a first rollback response from the second program over the first session to the first program that includes the third set of checkpoint status information read in set (K); and N) a set of computer instructions for rolling back the first program utilizing the third set of checkpoint status information in response to receiving the first rollback response in set (M).
- 14. The software in claim 12 wherein:
the first checkpoint file and the second checkpoint file are a same file.
- 15. The software in claim 11 which further comprises:
F) a set of computer instructions for transmitting a first rollback request from the first program over the first session to the second program; G) a set of computer instructions for reading the first set of checkpoint status information and the second set of checkpoint status information from the first checkpoint file in response to receiving the first rollback request transmitted in set (F); H) a set of computer instructions for rolling back the second program utilizing the second set of checkpoint status information read in set (G); I) a set of computer instructions for transmitting a first rollback response from the second program over the first session to the first program that includes the first set of checkpoint status information read in set (G); J) a set of computer instructions for rolling back the first program utilizing the first set of checkpoint status information in response to receiving the first rollback response in set (I).
- 16. The software in claim 11 which further comprises:
F) a set of computer instructions for transmitting a second checkpoint request that includes the first set of checkpoint status information from the first program over a second session to a third program executing in a third computer system; G) a set of computer instructions for checkpointing the third program resulting in a fourth set of checkpoint status information in response to receiving the second checkpoint request; H) a set of computer instructions for writing the first set of checkpoint status information and the fourth set of checkpoint status information to a second checkpoint file; and I) a set of computer instructions for transmitting a second checkpoint response from the third program over the second session to the first program after the writing in set (H) is complete.
- 17. The software in claim 16 which further comprises:
J) a set of computer instructions for transmitting a first rollback request from the program over the first session to the second program; K) a set of computer instructions for reading the first set of checkpoint status information and the second set of checkpoint status information from the first checkpoint file in response to receiving the first rollback request transmitted in set (J); L) a set of computer instructions for rolling back the second program utilizing the second set of checkpoint status information read in set (K); M)a set of computer instructions for transmitting a first rollback response from the second program over the first session to the first program that includes the first set of checkpoint status information read in set (K); and N) a set of computer instructions for rolling back the first program utilizing the first set of checkpoint status information in response to receiving the first rollback response transmitted in set (M).
- 18. The software in claim 16 which further comprises:
J) a set of computer instructions for transmitting a first rollback request from the program over the first session to the second program; K) a set of computer instructions for reading the first set of checkpoint status information and the second set of checkpoint status information from the first checkpoint file in response to receiving the first rollback request transmitted in set (J); L) a set of computer instructions for rolling back the second program utilizing the second set of checkpoint status information read in set (K); M) a set of computer instructions for transmitting a first rollback response from the second program over the first session to the first program that includes the first set of checkpoint status information read in set (K); O) a set of computer instructions for transmitting a second rollback request from the first program over the second session to the third program; P) a set of computer instructions for reading the first set of checkpoint status information and the fourth set of checkpoint status information from the second checkpoint file in response to receiving the second rollback request transmitted in set (O); Q) a set of computer instructions for rolling back the third program utilizing the fourth set of checkpoint status information read in set (P); R) a set of computer instructions for transmitting a second rollback response from the third program over the second session to the first program that includes the first set of checkpoint status information read in set (P); and S) a set of computer instructions for rolling back the first program utilizing the first set of checkpoint status information in response to receiving the first rollback response transmitted in set (M) and the second rollback response transmitted in set (R).
- 19. The software in claim 11 wherein:
there are plurality of sessions open between the first program and the second program for accessing a corresponding plurality of files by the second program; and the checkpointing in set (C) flushes all of the plurality of files and includes checkpoint information for all of the plurality of files in the second set of checkpoint information.
- 20. A data processing system having software stored in a set of Computer Software Storage Media for providing a checkpoint/restart facility across a plurality of plurality of computer systems, wherein:
the data processing system comprises the plurality of computer systems; the plurality of computer systems comprises:
a first computer system executing a first program, and a second computer system containing a disk system and executing a second program; the first computer system and the second computer system are heterogeneous computer systems; said software comprising:
A) means for checkpointing a current status of the first program resulting in a first set of checkpoint status information; B) means for transmitting a first checkpoint request that includes the first set of checkpoint status information from the first program over a first session to the second program; C) means for checkpointing the second program resulting in a second set of checkpoint status information in response to receiving the first checkpoint request; D) means for writing the first set of checkpoint status information and the second set of checkpoint status information to a first checkpoint file on the disk system; and E) means for transmitting a first checkpoint response from the second program over the first session to the first program after the writing in set (D) is complete.
CROSS REFERENCE TO RELATED APPLICATION
[0001] This application is related to our copending patent application entitled “METHOD AND DATA PROCESSING SYSTEM PROVIDING FILE I/O ACROSS MULTIPLE HETEROGENEOUS COMPUTER SYSTEMS”, filed of even date herewith and assigned to the assignee hereof.
[0002] This application is related to our copending patent application entitled “METHOD AND DATA PROCESSING SYSTEM PROVIDING REMOTE PROGRAM INITIATION AND CONTROL ACROSS MULTIPLE HETEROGENEOUS COMPUTER SYSTEMS”, filed of even date herewith and assigned to the assignee hereof.
[0003] This application is related to our copending patent application entitled “METHOD AND DATA PROCESSING SYSTEM PROVIDING BULK RECORD MEMORY TRANSFERS ACROSS MULTIPLE HETEROGENEOUS COMPUTER SYSTEMS”, filed of even date herewith and assigned to the assignee hereof.
[0004] This application is related to our copending patent application entitled “METHOD AND DATA PROCESSING SYSTEM PROVIDING DATA CONVERSION ACROSS MULTIPLE HETEROGENEOUS COMPUTER SYSTEMS”, filed of even date herewith and assigned to the assignee hereof.