The present application claims priority to German Patent Application No. 10 2004 058 288.2, which was filed in the German Patent Office on Dec. 2, 2004, and the entire contents of which is hereby incorporated by reference.
The exemplary embodiment and/or exemplary method of the present invention relates to a device and a method for correcting errors in a processor having two execution units or two CPUs as well as a corresponding processor.
Due to the fact that semiconductor structures are becoming smaller and smaller, an increase in transient processor errors is expected, which are caused e.g. by cosmic radiation. Even today transient errors are already occurring, which are caused by electromagnetic radiation or induction of interferences into the supply lines of the processors.
According to the related art, errors in a processor are detected by additional monitoring devices or by a redundant processor or by using a dual-core processor.
A dual-core processor or processor system is made up of two execution units, in particular two CPUs (master and checker), which are processing the same program in parallel. The two CPUs (central processing unit) may operate in a clock-synchronized manner, that is, in parallel (in a lockstep mode) or in a manner that is time-delayed by a few clock cycles. Both CPUs receive the same input data and process the same program, although the outputs of the dual core are driven exclusively by the master. In each clock cycle, the outputs of the master are compared to the outputs of the checker and are thus verified. If the output values of the two CPUs do not agree, then this means that at least one of the two CPUs is in a faulty state.
In an exemplary architecture for a dual core processor, a comparator for this purpose compares the outputs (instruction address, data out, control signals) of both cores (all comparisons occurring in parallel):
Instruction address (Without a check of the instruction address, the master could address the wrong instruction without this being noticed, which would then be processed in both processors without being detected.)
Data out
Control signals such as write enable or read enable
The error is signaled to the outside and normally results in a shutdown of the affected control unit. With the expected increase in transient errors, this sequence would result in a more frequent shutdown of control units. Since in the case of transient errors there is no damage to the processor, it would be helpful to make the processor available again to the application as quickly as possible without the system shutting down and a restart having to be performed.
Methods for correcting transient errors while avoiding a complete restart of the processor are rarely found for processors working in a master/checker operation.
The publication by Jiri Gaisler, “Concurrent error-detection and modular fault-tolerance in a 32-bit processing core for embedded space flight applications”, from the Twenty-Fourth International Symposium on Fault-Tolerant Computing, pages 128-130, June 1994, refer to a processor having integrated error detection and recovery mechanisms (e.g. parity checking and automatic instruction repetition), which is capable of working in master/checker operation. The internal error detection mechanisms in the master or in the checker always trigger a recovery operation only locally in one processor. As a result, the two processors lose their synchronicity with respect to each other and it is no longer possible to compare the outputs. The only option for synchronizing the two processors again is to restart both processors during a non-critical phase of the mission.
Furthermore, the document by Yuval Tamir and Marc Tremblay entitled, “High-performance fault-tolerant vlsi systems using micro rollback” in IEEE Transactions on Computers, volume 39, pages 548-554, 1990, refers to a method called “micro rollback”, by which the complete state of an arbitrary vlsi system can be rolled back by a certain number of clock cycles. For this purpose, all registers and the register file as a whole are extended by an additional FIFO buffer. According to this method, new values are not written directly into the register itself, but rather are first stored in the buffer and are transferred to the register only after having been checked. To roll back the entire processor state, the contents of all FIFO buffers are marked as invalid. If it is to be possible to roll back the system by up to k clock cycles, then k buffers are needed for each register.
The processors presented in the related art thus on the one hand have above all the defect that they lose their synchronicity as a result of the recovery operations since recovery is always performed only locally in one processor. The basic idea of the described method (micro rollback) is to extend each component of a system independently to include rollback capability so as to be able to roll back the entire system state in a consistent manner in the case of an error. The architecture-specific interconnection of the individual components (register, register file, . . . ) does not have to be considered for this purpose since indeed the entire system state is always rolled back consistently. The disadvantage of the method is a large hardware overhead, which grows in proportion to the size of the system (e.g. the number of pipeline stages in the processor).
An objective of the exemplary embodiment and/or exemplary method of the present invention is that of correcting particularly transient errors without a system or processor restart while at the same time avoiding an excessively large expenditure, particularly of hardware.
This objective may be achieved by a method and a device for correcting errors in a processor having two execution units and the corresponding processor, registers being provided in which instructions and/or associated information can be stored, the instructions being processed redundantly in both execution units and comparison means such as for example a comparator being included, which are designed in such a way that by comparing the instructions and/or the associated information a deviation and thus an error is detected, a division of the registers of the processor into first registers and second registers being advantageously provided, the first registers being designed in such a way that a specifiable state of the processor and contents of the second registers are derivable from them, means for a rollback being included, which are designed in such a way that at least one instruction and/or the information in the first registers are rolled back and are executed anew and/or restored.
According to the exemplary embodiment and/or exemplary method of the present invention, only a part of the register contents of a processor is needed to be able to derive the entire processor state. The set of all registers of a processor is divided into two subsets:
“Essential registers”: The contents of these first registers are sufficient to be able to build up a consistent processor state.
“Derivable registers”: These second registers may be completely derived from the essential registers.
In this approach it is sufficient to protect only the essential registers against faulty values or to provide them with rollback capability in order to be able to roll the entire processor back to an earlier state in a consistent manner. Consequently, the means for rolling back are suitably assigned only to the first registers and/or are only contained in these, or the means for rolling back are designed in such a way that at least one instruction and/or the information is rolled back only in the first registers.
Thus, the comparison means are suitably also provided in front of the first registers and/or in front of the outputs.
For this purpose, at least one, in particular two buffer components are advantageously assigned to each first register, which also applies to the register files. That is to say, the registers are organized in at least one register file and at least one, in particular two buffer components having each one buffer memory for addresses and one buffer memory for data are assigned to this register file.
An arrangement, structure or apparatus is suitably included to specify and/or indicate a validity of the buffer component or buffer memory e.g. by a valid flag, the validity of the instructions and/or information being specifiable and/or ascertainable via a validity identifier (e.g. valid flag) and this validity identifier being reset either via a reset signal or via a gate signal, in particular of an AND gate.
According to the exemplary embodiment and/or exemplary method of the present invention, both approaches are provided, namely, that the two execution units and thus also the exemplary embodiment and/or exemplary method of the present invention work in parallel without clock cycle offset or with clock cycle offset.
To this end, at least all first registers suitably exist in duplicate and are in each case assigned once to an execution unit.
Advantageously, the rollback is divided into two phases, initially the first registers, that is, in particular the instructions and/or information of the first registers, being rolled back and then the contents of the second registers being derived from them. In the process, the contents of the second registers are suitably derived by a trap/exception mechanism.
In a specific embodiment for a further increase in security in addition to the rollback at least one bit flip, that is, bit dropout, of a first register of an execution unit is corrected in that the bit flip is indicated in both execution units. This has the advantage that it preserves the synchronicity of both execution units with or without clock cycle offset. For this purpose, the bit flip is simultaneously indicated in both execution units if the execution units are working without clock cycle offset, and the bit flip is indicated in an offset manner in both execution units in accordance with a specifiable clock cycle offset if the execution units are working with this clock cycle offset.
In this manner, the mechanism provided by us corrects a transient error within a few clock cycles.
Additional advantages and advantageous refinements are derived from the description and the features which are described herein.
Two embodiments or versions of the recovery mechanism are described herein. In the first version, “basic instruction retry mechanism” (BIRM), the essential registers are protected against having faulty data written to them (the data are checked before being written). Valid contents in the essential registers are sufficient to generate at any time a valid total processor state (the contents of the derivable registers being derivable from the essential registers).
For performance reasons, in the second version, “improved instruction retry mechanism” (IIRM), the essential registers are expanded to include rollback capability and allow for faulty values to be detected only when they have already been written to the essential registers (the error detection in this case working parallel with respect to the writing of the data). In the IIRM, the rollback occurs in two steps: First, all essential registers are rolled back to a valid state. In the second step, the derivable registers are filled with the derived values. The refilling of the derivable registers is accomplished in both versions by the trap/exception mechanism already present in most processors (requirements for the mechanism are described in chapter 4).
The exemplary embodiment and/or exemplary method of the present invention reduces the hardware overhead in comparison to known (micro-)rollback technologies on the basis of the following points:
In contrast to the related art, the recovery operations in the architecture provided by us do not destroy the synchronicity between master and checker.
For this purpose, first a dual-core architecture working in lockstep mode, i.e. in a clock-synchronized manner, is described, which is capable of automatically correcting internal transient errors within a few clock cycles. In order to allow for a precise error diagnosis, internal comparators are additionally integrated into the dual core. A large part of the transient errors may be corrected by repeating instructions in which the error occurred. In the approach described, the trap/exception mechanism already present in conventional processors may be used for repeating instructions, thus producing no additional hardware overhead.
Errors arising from bit flips in the register file can generally not be corrected by the repetition of instructions. Such errors are reliable detected e.g. by parity and are reported to the operating system by a special trap. The error information provided is called precise, which means that the operating system is also told which instruction attempted to read the faulty register value. Thus the operating system is able to initiate an appropriate action for correcting the error. Examples of possible actions are, inter alia, calling a task-specific error handler, repeating the affected task or restarting the entire processor in the event that an error cannot be corrected (e.g. an error in the memory structures of the operating system).
The exemplary embodiment and/or exemplary method of the present invention thus provides a method, a device and a processor, which is able to detect transient errors reliably and to correct them within a few clock cycles. The processor is designed as a dual-core processor. It is made up of two CPUs (master and checker), both of which process the same program in parallel. Error detection is achieved by comparing various selected signals of the master and the checker. Transient errors are mainly corrected by instruction repetition. Bit flips in the register file are detected by parity checking and are reported to the operating system. As mentioned, the mechanism for instruction repetition is described in two variants: The first variant called “basic instruction retry mechanism” (BIRM) is designed to minimize hardware overhead, but may in some architectures also influence the performance of the processor negatively. The second variant called “improved instruction retry mechanism” (IIRM) entails less performance loss, but creates a greater hardware overhead instead.
On the one hand, dual-core processors are used for this purpose, which work in a lockstep mode. The term lockstep mode signifies in this context that both CPUs (master and checker) work in a clock-synchronized manner with respect to each other and process the same instruction at the same time. Although the lockstep mode represents an uncomplicated and cost-effective variant for implementing a dual-core processor, it also entails an increased susceptibility of the processor to common mode errors. Common mode errors are defined as errors that occur simultaneously in different subcomponents of a system, have the same effect and were caused by the same failure. Since in a dual-core processor both CPUs are accommodated in a common housing and are supplied by a common voltage source, certain failures (e.g. voltage fluctuations) may simultaneously affect both CPUs. Now if both CPUs are in exactly the same state, which is always the case in lockstep operation, then the probability that the failure affects both CPUs in exactly the same manner cannot be neglected. Such an error (common mode error) would not be detected by a comparator since both the master as well as the checker would provide the same incorrect result.
The exemplary embodiment and/or exemplary method of the present invention thus provides a processor, which is able to detect transient errors reliably and to correct them within a few clock cycles. The processor is designed as a dual-core processor. It is made up of two CPUs (master and checker), both of which process the same program in parallel. Error detection is achieved by comparing various selected signals of the master and the checker. In order to reduce the susceptibility to common mode errors, master and checker work at a clock cycle offset, which means that the checker always runs behind the master by a defined time interval (e.g. 1.5 clock cycles) (the two CPUs therefore being at no time in the same state). This has the consequence that the results of the master can only be checked by the comparator following this defined time lag since it is only then that the corresponding signals of the checker are provided. The results of the master can thus only be checked when the result of the checker are available and must be buffered, i.e. stored temporarily, in the meantime.
These two examples of the architecture having a clock cycle offset and having no clock cycle offset illustrate also the multifarious possible uses of the subject matter of our invention. In the following, both examples will be presented, there being no strict separation made with regard to the subject matter of the exemplary embodiment and/or exemplary method of the present invention and statements and representations presented with respect to it. Thus, according to the exemplary embodiment and/or exemplary method of the present invention, the examples corresponding to all 14 Figures can be combined arbitrarily.
If an error is detected, then quasi the entire dual core is rolled back to a state prior to the occurrence of the error, from which the program execution is resumed without having to perform a restart or a shutdown.
The following description with the figures shows, among other things, how a recovery mechanism may be integrated into a dual-core processor. In this instance, the architecture used serves as an exemplary architecture (the use of the recovery mechanism according to the exemplary embodiment and/or exemplary method of the present invention being not bound e.g. to a three-stage pipeline). The only requirement placed on the processor architecture is that it is a pipeline architecture, which has a mechanism, in particular an exception/trap mechanism that satisfies the requirements. The control signals (e.g. write enable, read enable etc.) that lead to the I/O are in all figures generally designated as control.
Instruction Repetition
In
The error is signaled to the outside and in this case now does not result in a shutdown of the affected control unit. Since in the case of transient errors there is no damage to the processor the processor is now made available again to the application as quickly as possible without the system shutting down and a restart having to be performed.
The recovery mechanism according to the exemplary embodiment and/or exemplary method of the present invention is based on error detection and instruction repetition. If an error is detected in any arbitrary pipeline stage, then the instruction in the last pipeline stage is always repeated. The repetition of an instruction in the last pipeline stage has the consequence that all other instructions in the front pipeline stages (the subsequent instructions) are also repeated, as a result of which the entire pipeline is again filled with new values. In this case, the instruction repetition is carried out by the trap (exception) mechanism already present in most conventional processors.
The trap (exception) mechanism for this purpose must satisfy the following requirements: As soon as a trap is triggered, any instruction present in the pipeline of the processor at this time will be prevented from changing the processor state. External write accesses (e.g. to the data memory, to additional modules such as network interfaces or DIA converters, . . . ) are likewise prevented. In the subsequent clock cycle, the system jumps into a trap routine assigned to the trap. A trap routine may be terminated again by the instruction “return from trap routine”, which results in the execution being resumed again with the instruction that was present in the last pipeline stage at the time the trap was triggered.
Now, in order to repeat an instruction with the aid of the trap mechanism, an “empty” trap routine is called (an empty trap routine is defined as a routine made up exclusively of the instruction “return from trap routine”). Since it is an “empty” trap routine, it is again terminated immediately after being called. The pipeline is emptied and the execution is resumed again precisely with the instruction that was present in the last pipeline stage at the time the trap was triggered. This empty trap routine is called an instruction retry trap. The instruction retry trap can bring about a valid processor state only if certain registers have a valid and consistent content. The set of these registers is called essential registers and includes all registers the contents of which determine the processor state following a trap call. This includes above all the register file, the status register and, depending on the architecture, various control registers such as an exception vector table for example. The most important register of the essential registers is the register that stores the address of the instruction in the last pipeline stage since it is precisely this address to which the system must jump when terminating the trap. In
Any faulty value that is written into the essential registers must be reliably detected as faulty. In the first version of the instruction retry mechanism (BIRM), all values that are written to the essential registers are checked before they are actually taken over into the registers. The values are checked by a comparator which compares the signals of the master with those of the checker in each clock cycle (
The diagram in Table 1 shows the function of the basic instruction retry mechanism (BIRM) with the aid of an example. The diagram shows (under Instructions) in which pipeline stage a particular instruction is found during a particular clock cycle.
Legend:
IF Instruction Fetch
DEC Decode
EX Execute
RTR Return from Trap Routine
Stop by Tr (Trap) If a synchronous trap is activated no new values are written to registers/buffers
It is assumed that a transient error occurs at any stage of the instruction F (cycle 5-7). In clock cycle 7 at the latest, this error is detected by the comparator, instruction F is prevented from writing its results, and the InstructionRetryTrap is triggered. The InstructionRetryTrap is an empty trap and is thus only made up of the “return from trap routine” (RTR) instruction. In cycle 10, the RTR instruction has already reached the execute stage, which results in a renewed fetching of the previously faulty instruction F in clock cycle 11. At the beginning of clock cycle 14, the instruction F was repeated entirely and it wrote its correct results.
The disadvantage of the basic IRM (BIRM) is that the comparator in many architectures will lie in the time-critical path since the new values can only be taken over into a register if they have already been compared. The computation of new data by the ALU, the comparison of the data of the master and the checker and the triggering of the trap mechanism must thus all occur in the same clock cycle (the potentially critical path is shown in
In the second version of the instruction retry mechanism (IIRM), the following strategy was chosen in order to shorten the time-critical path (
The diagram in Table 2 shows the function of the improved instruction retry mechanism (IIRM) with the aid of an example.
Legend:
IF Instruction Fetch
Dec Decode
EX Execute
RTR Return from Trap Routine
Stop by RB (Rollback) During rollback no new values are written to registers/buffers
Stop by Tr (Trap) If a synchronous trap is activated no new values are written to registers/buffers
Iv (invalidated) After rollback the buffer is invalidated
dc (don't care) We don't care how these registers are used while a trap is processed
PSW Program Status Word
PC in Pipe 2 Register that hold the address of the actual instruction in the EX stage
The upper section of Table 2 shows (under Instructions) in which pipeline stage a particular instruction is found during a particular clock cycle. The lower section of the diagram lists the contents of the rollback-capable register (buffer and permanent register) during the individual clock cycles. For the rollback-capable register file there is an indication for every clock cycle what value is contained in the buffer and what value was last entered into the register file itself. A value such as A or B means that it is a result of the instruction A or the instruction B. It is assumed that a transient error occurs at any stage of the instruction F (clock cycle 5-7). In clock cycle 8 at the latest, this error is detected by one of the comparators, the subsequent instruction (G) is prevented from writing its results, and the rollback is triggered. At the start of clock cycle 9, all registers of the EssentialRegisterSet are already rolled back (the buffer having been marked as invalid, which makes the values in the permanent registers into the most current valid values), and the InstructionRetryTrap is triggered. The triggered trap prevents instruction H from writing its results. The InstructionRetryTrap is an empty trap and is thus only made up of the “return from trap routine” (RTR) instruction. In clock cycle 12, the RTR instruction has already reached the execute stage, which results in a renewed fetching of the previously faulty instruction F in clock cycle 13. At the beginning of clock cycle 16, the instruction F was repeated entirely and it wrote its correct results.
External Outputs
The described recovery mechanism must ensure that transient errors within the dual core are prevented from advancing to the external components (cache, data storage unit, additional modules, . . . ). In the case of the BIRM, this condition is implicitly satisfied since the InstructionRetryTrap is already triggered in the same clock cycle if errors become visible in the output lines (lines 7 and 8 in
In contrast to BIRM, in the second version of the recovery mechanism (IIRM), an error is detected only when the faulty data have already been written. To prevent faulty data from entering external components, a buffer may be interconnected between the dual core and the I/O control unit of the system. New data are first written into the buffer and are thus delayed by one clock cycle until the check of the data has been concluded. Correct data are passed on to the I/O control unit. If on the other hand the data are classified as faulty, then the content of the buffer is marked as invalid using the rollback signal. Marking the buffer as invalid may be implemented in any manner desired (e.g. reset of the buffer register, deletion of the write enable bit in the control signals to the I/O control unit, . . . ).
Permanent Errors
In order to be able to distinguish permanent errors from transient errors, an error counter may be used. Most secure is the use of an independent component which ascertains the error frequency within a certain time interval by monitoring the two trap lines (InstructionRetryTrap and RegErrorTrap) used by the recovery mechanism or the rollback line. If the error frequency per unit of time exceeds a certain threshold value, the error may be regarded as permanent.
Bit Flips in the Register File
Not every transient error, of course, can be corrected by instruction repetition. Errors arising from bit flips in the register file are not corrected even by repeated readout. To be able to correct such errors, an additional mechanism was integrated, which detects register errors as such and reports them to the operating system. For this purpose, all data values in the register file are secured by parity bits (the parity bit being generated by a parity generator connected downstream of the ALU:
The described recovery mechanism is fundamentally based on error detection by comparison of the output signals of master and checker and on error correction by instruction repetition. Master and checker now work for example at a clock cycle offset, the checker always running behind the master by a defined time interval (k clock cycles, where k is a real number). The time interval may be made up of a defined number of full clock cycles or a defined number of half cycles. In order to allow for a comparison, the output signals of the master must be temporarily stored by appropriate delay components until the corresponding output signals of the checker are available.
Rollback of the Processor State
The rollback of the processor state occurs at the instruction level and is accomplished by a mechanism called “instruction retry mechanism” (IRM). The goal of the IRM is to roll the entire processor back into a state it was in prior to the occurrence of the error. For this purpose, the mechanism uses mainly the trap (exception) mechanism already present in conventional processors.
The trap (exception) mechanism for this purpose must satisfy the following requirements: As soon as a trap is triggered, any instruction present in the pipeline of the processor at this time will be prevented from changing the processor state.
In the subsequent clock cycle, the system jumps into a trap routine assigned to the trap. A trap routine may be terminated again by the instruction “return from trap routine” (RTR), which results in the execution being resumed again with the instruction that was present in the last pipeline stage at the time the trap was triggered.
Now, in order to repeat an instruction with the aid of the trap mechanism, an “empty” trap routine is called (an empty trap routine is defined as a routine made up exclusively of the instruction “return from trap routine”). Since it is an “empty” trap routine, it is again terminated immediately after being called. The pipeline is emptied and the execution is resumed again precisely with the instruction that was present in the last pipeline stage at the time the trap was triggered. This empty trap routine is called an InstructionRetryTrap.
The InstructionRetryTrap can bring about a valid processor state only if certain registers have a valid and consistent content. The set of these registers is called essential registers and includes all registers the content of which must be saved or retained in the event of a trap call. This includes above all the register file, the status register and, depending on the architecture, various control registers such as an exception vector table for example. The most important register of the essential registers is the register that stores the address of the instruction in the last pipeline stage since it is precisely this address to which the system must jump when terminating the trap. In
To be able therefore to ensure a correct functioning of the InstructionRetryTrap, all errors in the essential registers must first be detected and corrected. The error detection is achieved by comparing the write accesses of the master of those of the checker to the essential registers (the comparison being performed by the comparator component). As already mentioned above, a time-offset dual core has an error detection time of k+1 clock cycles. Therefore, following a detected error, the essential registers have to be rolled back k+1 clock cycles in order to regain a valid state.
This is made possible by expanding the essential register to include roll back capability (see next section). As already mentioned, a valid state in the essential registers is a necessary and sufficient condition for being able to create a complete and valid processor state with the aid of the InstructionRetryTrap (the derivable registers thus do not have to be equipped with rollback capability).
Rollback of the Essential Registers
This section describes how an individual register or an entire register file may be equipped with rollback capability which allows it to roll the register or the register file back by a certain number of clock cycles.
Individual Register
This section shows how an individual register, which is written to in every cycle (e.g. pipe register, may be equipped with rollback capability. A rollback-capable individual register is made up of a control logic, a permanent register and one or multiple temporary buffers. In the process, the data to be stored first run through the temporary buffer before being taken over into the permanent register. In order to carry out a rollback, all buffer contents are marked as invalid. Buffer contents marked as invalid are never taken over into the permanent register. The number of the temporary buffers corresponds to the number of clock cycles by which the register is rolled back in a rollback. When reading out the register, one must take into account that it is always the most current valid value that must be returned. When no rollback has occurred, that is, when the buffers are marked as valid, the most current valid value is always located in the first buffer. Immediately following a rollback, the most current valid value is located in the permanent register.
The entire behavior of the rollback-capable register is controlled by the control unit. The behavior of the control unit is specified by the truth table in
This section shows how a register file, which in contrast to the previously described individual register is not necessarily written to in every clock cycle, can be equipped with rollback capability. A rollback-capable register file is made up of a control logic, the register file itself and one or several temporary buffers, each of which are able to store one data word and one register address. Together with the associated addresses, the data to be stored first run through the temporary buffers before being taken over into the register file. In order to carry out a rollback, all buffer contents are marked as invalid. Buffer contents marked as invalid are never taken over into the register file. The number of the temporary buffers corresponds to the number of clock cycles by which the register file is rolled back in a rollback. When reading out the register file, one must take into account that it is always the most current valid value that is read out. The latter is located in the first valid buffer that contains the desired address. If no valid temporary buffer contains the desired address or if all temporary buffers are marked as invalid, then the system always reads directly out of the register file.
In the case of the rollback-capable register file, determining the most current valid value in reading accesses requires somewhat more effort than in the case of the previously described rollback-capable individual register and is therefore described in pseudo code:
The entire behavior of the rollback-capable register file is controlled by the control unit. The behavior of the control unit is specified by the truth table in
The diagram in Table 3 shows the sequence of the instruction retry mechanism (IRM) with the aid of an example. For this purpose it is assumed that master and checker run at a clock cycle offset of one clock cycle, and that an error occurs during the processing of instruction number 50.
In the first observed clock cycle, the master is in the process of executing instruction 50, while the checker executes instruction 49. Instruction 50 can only be checked two clock cycles later (clock cycle 3), when both CPUs have already executed this instruction. In this clock cycle, an error is detected and the rollback is triggered for the essential registers. In the subsequent clock cycle (clock cycle 4), the essential registers have already been rolled back by two clock cycles (the essential registers being now again in the same state they occupied in clock cycle 1). Since until now only the essential registers have been rolled back and the remaining registers of the processor have retained their old value, the processor is in an inconsistent state. Nevertheless, the condition for the correct triggering of a trap is satisfied (the essential registers having correct and consistent values). In the same clock cycle, the Instruction Retry Trap (IRT) is now triggered. The InstructionRetryTrap is made up of a single instruction, the “Return From Trap Routine (RTR)” instruction. In clock cycle number 5, the RTR instruction is fetched. In clock cycle 7, the RTR instruction has reached the execute stage of the processor (in both CPUs). As a result of executing the RTR instruction, the pipeline of both CPUs is flushed and in both CPUs the instruction address is fetched, which at the time of triggering the InstructionRetryTrap (IRT) was located in the “PC2” register (
Bit Flips in the Register File
A bit flip is defined as a reversal of the logical value of a bit in a register caused by an interference.
Bit flips in the register file generally cannot be corrected by rolling back the processor by a constant number of clock cycles t since they may affect registers that were written to most recently at a time going back further than t clock cycles. To be able to correct such errors, an additional mechanism was integrated, which detects register errors as such and reports them to the operating system. For this purpose, the individual registers are secured by parity bits (
In order to maintain the synchronicity of the two CPUs, the RegErrorTrap (RET) must be triggered in both CPUs at precisely the same instruction. In the case of a dual core working at a clock cycle offset this means that the RET must also be triggered in an offset manner. In order to describe the offset triggering of the trap, timing diagrams were enclosed, which assume a clock cycle offset of k=1 and which show with reference to an example how the master or the checker react to bit flips in the register file. For this purpose it is assumed that in each case the instruction at address 50 reads out a faulty register content.
RET1, RET2, RET3, RET4 etc. refer to the first, second, third, fourth etc. instruction of the RegErrorTrap. What this trap routine does precisely (task repetition, call of an exception handler, . . . ) and how many instructions it comprises is left to the programmers of the operating system.
If a parity error occurs in the master (at instruction 50 in the example described), then the master enters into the RegErrorTrap in the next clock cycle. The “k-delay” component (see block diagram in
If the checker discovers a parity error (at instruction 50 in the example described), then first the described mechanism for instruction repetition IRM is triggered (see flow chart in Table 4). At the beginning of clock cycle 6, this again produced a state in which the master fetched the instruction 49 and the checker fetched the instruction 48 (in a dual core operating at a clock cycle offset of k=1, IRM always rolls both CPUs back by 2 instructions).
3 clock cycles later (at the beginning of clock cycle number 9), instruction 50 is in the execute stage of the master. From this state, the “IRM-delay” component (see block diagram in
Number | Date | Country | Kind |
---|---|---|---|
10 2004 058 288.2 | Dec 2004 | DE | national |