Claims
- 1. A method for facilitating reliable execution in a computer system by keeping track of write operations to a main memory of the computer system in order to undo the write operations if necessary, comprising:
receiving a write operation directed to the main memory at a memory controller, the write operation including data to be written to the main memory and a write address specifying a location in the main memory into which the data is to be written; examining a log bit associated with the write address, wherein the log bit indicates whether an existing value from the write address in main memory has been copied to a checkpoint store; if the log bit is not set,
creating a new entry for the write address in the checkpoint store, retrieving the existing value from the write address in the main memory, and storing the existing value to the new entry in the checkpoint store; storing the data to be written to write address in the main memory; and periodically performing a checkpointing operation, wherein performing the checkpointing operation involves clearing all entries from the checkpoint store.
- 2. The method of claim 1, further comprising:
receiving a read operation at the memory controller, the read operation being directed to a read address specifying a location in the main memory to be read from; and retrieving data from the read address in the main memory to satisfy the read operation.
- 3. The method of claim 1, wherein the checkpoint store is organized as a first-in-first-out (FIFO) buffer.
- 4. The method of claim 1, wherein if the new entry is to be added to the checkpoint store and no room exists in the checkpoint store for the new entry, the method further comprises performing a checkpointing operation to clear all entries from the checkpoint store.
- 5. The method of claim 1, wherein performing the checkpointing operation involves:
stopping execution of a central processing unit in the computer system; storing an internal state of the central processing unit to the main memory; clearing all entries from the checkpoint store; and recommencing execution of the central processing unit.
- 6. The method of claim 5, wherein the internal state of the central processing unit includes:
contents of internal registers in the central processing unit; and dirty cache lines associated with the central processing unit.
- 7. The method of claim 1, further comprising delaying I/O operations so that the I/O operations are performed after a subsequent checkpoint operation.
- 8. The method of claim 1, wherein if an error occurs during execution of the computer system, the method further comprises:
restoring a state of the main memory to a preceding checkpoint by replacing values that have been modified with prior values retrieved from the checkpoint store; and restoring the internal state of the central processing unit from the main memory.
- 9. A computer-readable storage medium storing instructions that when executed by a computer cause the computer to perform a method for facilitating reliable execution in a computer system by keeping track of write operations to a main memory of the computer system in order to undo the write operations if necessary, the method comprising:
receiving a write operation directed to the main memory at a memory controller, the write operation including data to be written to the main memory and a write address specifying a location in the main memory into which the data is to be written; examining a log bit associated with the write address, wherein the log bit indicates whether an existing value from the write address in main memory has been copied to a checkpoint store; if the log bit is not set,
creating a new entry for the write address in the checkpoint store, retrieving the existing value from the write address in the main memory, and storing the existing value to the new entry in the checkpoint store; storing the data to be written to write address in the main memory; and periodically performing a checkpointing operation, wherein performing the checkpointing operation involves clearing all entries from the checkpoint store.
- 10. The computer-readable storage medium of claim 9, wherein the method further comprises:
receiving a read operation at the memory controller, the read operation being directed to a read address specifying a location in the main memory to be read from; and retrieving data from the read address in the main memory to satisfy the read operation.
- 11. The computer-readable storage medium of claim 9, wherein the checkpoint store is organized as a first-in-first-out (FIFO) buffer.
- 12. The computer-readable storage medium of claim 9, wherein if the new entry is to be added to the checkpoint store and no room exists in the checkpoint store for the new entry, the method further comprises performing a checkpointing operation to clear all entries from the checkpoint store.
- 13. The computer-readable storage medium of claim 9, wherein performing the checkpointing operation involves:
stopping execution of a central processing unit in the computer system; storing an internal state of the central processing unit to the main memory; clearing all entries from the checkpoint store; and recommencing execution of the central processing unit.
- 14. The computer-readable storage medium of claim 13, wherein the internal state of the central processing unit includes:
contents of internal registers in the central processing unit; and dirty cache lines associated with the central processing unit.
- 15. The computer-readable storage medium of claim 9, wherein the method further comprises delaying I/O operations so that the I/O operations are performed after a subsequent checkpoint operation.
- 16. The computer-readable storage medium of claim 9, wherein if an error occurs during execution of the computer system, the method further comprises:
restoring a state of the main memory to a preceding checkpoint by replacing values that have been modified with prior values retrieved from the checkpoint store; and restoring the internal state of the central processing unit from the main memory.
- 17. An apparatus that facilitates reliable execution in a computer system by keeping track of write operations to a main memory of the computer system in order to undo the write operations if necessary, comprising:
a memory controller coupled to the main memory; a receiving mechanism that is configured to receive a write operation directed to the main memory at the memory controller, the write operation including data to be written to the main memory and a write address specifying a location in the main memory into which the data is to be written; a checkpoint store, coupled to the memory controller, which is configured to store prior versions of values that have been modified in main memory; a lookup mechanism that is configured to look up a log bit associated with the write address, wherein the log bit indicates whether a prior value from the write address in main memory has been copied to the checkpoint store; a writing mechanism that is configured to store the data to be written to write address in the main memory; wherein if the log bit for an entry is not set, the writing mechanism is configured to,
create a new entry for the write address in the checkpoint store, retrieve an existing value from the write address in the main memory, and to store the existing value to the new entry in the checkpoint store; a checkpointing mechanism that is configured to periodically perform a checkpointing operation, wherein performing the checkpointing operation involves clearing all entries from the checkpoint store.
- 18. The apparatus of claim 17,
wherein the receiving mechanism is additionally configured to receive a read operation at the memory controller, the read operation being directed to a read address specifying a location in the main memory to be read from; and further comprising a reading mechanism that is configured to retrieve data from the read address in the main memory to satisfy the read operation.
- 19. The apparatus of claim 17, wherein the checkpoint store is organized as a first-in-first-out (FIFO) buffer.
- 20. The apparatus of claim 17, wherein if the new entry is to be added to the checkpoint store and no room exists in the checkpoint store for the new entry, the checkpointing mechanism is configured to perform a checkpoint operation to clear all entries from the checkpoint store.
- 21. The apparatus of claim 17, wherein the checkpointing mechanism is configured to:
stop execution of a central processing unit in the computer system; store an internal state of the central processing unit to the main memory; clear all entries from the checkpoint store; and to recommence execution of the central processing unit.
- 22. The apparatus of claim 21, wherein the internal state of the central processing unit includes:
contents of internal registers in the central processing unit; and dirty cache lines associated with the central processing unit.
- 23. The apparatus of claim 17, further comprising an I/O processing mechanism that is configured to delay I/O operations so that the I/O operations are performed after a subsequent checkpoint operation.
- 24. The apparatus of claim 17, further comprising a rollback mechanism that is configured to:
restore a state of the main memory to a preceding checkpoint by replacing values that have been modified with prior values retrieved from the checkpoint store if an error occurs during execution of the computer system; and to restore the internal state of the central processing unit from the main memory.
RELATED APPLICATION
[0001] The subject matter of this application is related to the subject matter in a co-pending non-provisional application by the same inventors as the instant application and filed on the same day as the instant application entitled, “Method and Apparatus for Checkpointing to Facilitate Reliable Execution,” having serial number TO BE ASSIGNED, and filing date TO BE ASSIGNED (Attorney Docket No. SUN-P5330-RSH).