The present invention relates to a device and a method for maintaining a system function in the event of errors in a processor system having two cores as well as a corresponding processor system.
Redundancies, for example, of microcontrollers (μC), but also of components of a μC, such as, for example, the CPU (central processing unit), for the purpose of error detection are known from the related art. In this context, redundantly calculated data and redundantly generated signals are compared for consistency by a comparator unit.
A microcontroller having redundant CPUs is also called a dual-core microcontroller (dual-core μC). In a dual-core μC, both CPUs are able to operate synchronously, that is, in parallel (in lockstep mode) or in a manner that is time-delayed by a few clock cycles. Both CPUs receive the same input data and process the same program or the same instructions. If an error exists in one of the redundantly implemented cores, which error has an effect on at least one output signal of this core, then this results in a discrepancy of the data to be compared, which discrepancy is detected by the comparator unit. In this context, in addition to “data out” data, output signals may also include the instruction address and the control signals. When a discrepancy is detected in the signals to be compared, the comparator unit generates a status or an error signal with which the comparison result may be signaled externally. However, without additional error-detection mechanisms for the redundantly implemented units, it is neither possible to locate the faulty component, nor is it possible to determine the type of cause of the error.
When the redundancies described above are used in safety-related control and regulation systems, then usually a switchover to a “secure state” of the entire system occurs after a discrepancy in the redundantly determined signals is detected, even when the cause of the discrepancy was a transient error having only a brief active duration. In automobile systems, such as, for example, an ESP system, the “secure state” usually means that the system is shut down.
Due to the fact that semiconductor structures are becoming smaller and smaller, an increase in transient processor errors is expected, which are caused e.g. by cosmic radiation. In order to be able to handle transient errors such that it is possible to refrain from shutting down the system and to tolerate or even “heal” errors in operation, there are already a number of solutions in the related art: Using mostly complicated methods, errors are detected by application-specific, frequently model-based plausibilizations; where necessary, a reset of the computer system is triggered. The computer system re-initializes itself and is, after the initialization time and an optional “recovery check” (after, for example, a few 100 ms) operational once again (so-called “forward recovery”).
For applications that are not real-time-capable (for example, transactions at financial markets), a state is formed in an application-specific way before the transaction, which is stored and discarded as invalid only after a confirmed successful conclusion to the transaction exists. When errors occur during the transaction, the system jumps back to the stored starting point (“backward recovery”). In real-time systems, such solutions are very complicated, and usually function is interrupted for the duration of a reset or a recovery check of the processor system.
With an increasing range of functions of electronic regulating systems in a vehicle, a shutdown of a system, such as ESP with steering intervention, does not constitute a transition to a secure system state in every operating state.
An objective of the present invention is a method for operating a dual-core processor (or a dual-processor system) with the aim of an increased robustness with regard to errors and an increased (partial) availability of the system function when transient and permanent errors occur in the processor system. In an advantageous exemplary embodiment, this may be achieved while maintaining the original execution time for the individual program segments.
In a dual-core computer according to the related art that is operated in the lockstep mode, one CPU operates as master and a second CPU operates as slave. The results of the slave CPU are utilized only for comparing the results of the master CPU. Only the master CPU may write results to the data/address bus or into CPU registers.
The advantages of the present invention include alternating assignment of the master function to the at least two execution units and thus the alternating use of the core results of a dual-core or multi-core computer that is operated in the lockstep mode. Thus, when certain boundary conditions are taken into account, a restricted operation of the processor system may be maintained even after a discrepancy in the redundantly calculated results has been detected. This is advantageous particularly in real-time applications in which a shutdown of the system due to processor errors is not desired in every operating state.
In an exemplary embodiment, an additional advantage results from the fact that an error in the execution units of the processor system is able to be located, that the faulty execution unit is deactivated, and that the system having the non-faulty execution unit continues to operate until a system state is reached that is not critical for shutdown or a previously specified maximum operating time in this mode is exceeded.
A method for controlling a computer system having at least two execution units and one comparator unit is advantageously described, which system is operated in the lock-step mode and in which the results of the at least two execution units are compared, wherein when or after an error is detected by the comparator unit, an error-detection mechanism is processed on at least one execution unit for this execution unit. A method is advantageously described, wherein when or after an error is detected by the comparator unit, the current instruction sequence on the at least two execution units is terminated and an error-detection mechanism is processed on the at least two execution units. A method is advantageously described, wherein when or after an error is detected by the comparator unit, the current instruction sequence is terminated on exactly one execution unit, on this one execution unit an error-detection mechanism is processed, the comparator unit of the at least two execution units is switched off for the duration of the processing of the error-detection mechanism, and on the at least one other execution unit the normal program sequence is processed further.
A method is advantageously described wherein after processing of the error-detection mechanism, the normal program sequence is continued if the error-detection mechanisms have not detected any error. A method is advantageously described, wherein when or after an error is located on an execution unit, the faulty execution unit is shut down. A method is advantageously described, wherein the comparator unit is deactivated. A method is advantageously described, wherein when at least one component is deactivated, an error signal is generated, which is provided to the application. A method is advantageously described, wherein after an error occurs, the operation using only one execution unit is restricted temporally and the computer system is shut down at the latest after a previously specified time has passed. A method is advantageously described, wherein the shutdown is already shut down by a signal generated by the application before a previously specified time has passed.
A device for controlling a computer system having at least two execution units and one comparator unit is advantageously described, which system is operated in the lock-step mode and in which the results of the at least two execution units are compared, wherein an arrangement provides that when or after an error is detected by the comparator unit, an error-detection mechanism is processed on at least one execution unit for this execution unit. A device is advantageously described, wherein an arrangement is provided to cancel the coupling of the lock step of the at least two execution units and to assign the master function to one execution unit at will. A device is advantageously described, wherein an arrangement stores an error-detection mechanism for the execution units. A device is advantageously described, wherein an arrangement supplies to at least one execution unit instructions and/or the program for the error-detection mechanism when required. A device is advantageously described, wherein an arrangement deactivates the comparison unit.
Other advantages and advantageous embodiments are derived from the features described herein of the specification, including the figures.
In contrast to a known dual-core microcontroller that is operated in the lockstep mode, in a first exemplary embodiment of the present invention, when certain boundary conditions are met, a value is written to a register or a memory or outputted to the data/address bus even when a discrepancy exists between the output signals of the redundant execution units. In this instance, however, the master function is not assigned permanently to one execution unit, but rather may be assigned to different execution units. This assignment may occur according to a statically determined scheme or may be specified dynamically.
In a second exemplary embodiment shown in
In a third exemplary embodiment shown in
Input signal W160 or an identification of the same may be generated as a function of the time or an instruction counter (for example, every 10 clock cycles or every 10 instructions), which may be by a specific hardware component, or may be generated by the operating system, for example, as a function of the scheduling of the runtime objects (for example, a switchover may occur each time that a runtime object is called or during each operating system cycle), or may be a function of an identification in the program code, or may be generated by an interrupt or a signal of an interruption request unit, or may be a function of the access to a particular memory area in the program memory and/or data memory.
An assignment or a switchover of the master function may be a function of one of the previously mentioned conditions, a function of the comparison result of comparator unit W122, or of a combination of several of these conditions.
When there is a discrepancy among the output signals of the execution units, the comparator unit generates an internal error signal. Instead of a shutdown of the system, a switchover of the master function from one execution unit to the other execution unit may take place as a function of the system status, which is communicated to the comparator unit via signal W160. For each additional discrepancy of the output signals, this process is repeated, that is, the master function is assigned to the respectively other execution unit. It must be noted that the master relays its results, regardless of the result of a comparison, via the respective system interface W140. The comparator unit only detects a difference, but does not prevent the respective master from writing. Additional structure may now be contained in comparator unit W122 that shut down the system as a function of an error counter that counts the detected discrepancies after a specifiable number of errors is exceeded.
This system may also generate, as shown in
Many functions for signal conditioning and for regulating mechatronic systems in motor vehicles have a robust design, that is, short-term disturbances (for example, by EMC irradiation or by the influence of disturbance variables in a control loop) do not have safety-critical effects in such systems and may thus be tolerated. Longer lasting disturbances, however, are not tolerated even by such “robust” systems. For such robust functions, the processor system does not have to be shut down immediately after an error occurs, that is, after a discrepancy has been detected by the comparator unit. When the cause of the error is transient and has a short active duration, the error usually no longer exists when the next call is carried out. When the output signals of the execution units are used in an alternating fashion or when the assignment of the master functions alternates in a processor system having multiple execution units, even a permanent error in one of the execution units does not have a lasting influence on the application, but rather influences it only intermittently. Thus, when an error occurs, it is possible to hold off on shutting down the processor system until an error is detected unequivocally as a permanent error or a system state of the application system is reached that is appropriate for a shutdown.
In an additional exemplary embodiment, when a discrepancy is detected among the output signals of the at least two execution units, the processing of the current instruction sequence (program block, task) is aborted on all execution units. Instead of the aborted instruction sequence, error-detection routines, such as, for example, a BIST (built-in self test) or a software-based self test, are processed in all execution units. An error may be detected and located by comparing the results of the error-detection routines to stored reference values. When an error is detected and located, the faulty execution unit is shut down. The non-faulty unit continues to operate until a system state is reached that is safe for a shutdown. A shutdown of a faulty execution unit may occur in that the comparator unit is deactivated and interruption or release unit W130a or W130b assigned to this execution unit does not allow a connection between this execution unit and the system interface or the address/data bus, or in that no instructions, data and/or clock signals are supplied to this execution unit.
There are different options for deactivating the comparator units. On the one hand, a signal may be carried to the comparator unit, which signal activates or deactivates the comparator logic or comparator function. To this end, an additional logic must be inserted in the comparator, which logic is able to execute an activation or deactivation of the comparator function as a function of such a signal. Another possibility is not to supply any data to be compared to the comparator unit. A third possibility is to ignore at the system level error signal W170 of comparator unit W123 as shown in
If no error is found in the execution units when processing error-detection mechanisms, the next task is started in the lock step. If a discrepancy of the output signals is detected again, the procedure described above is carried out again; however, the number n of repetitions must be limited. The limitation may take place as a function of the error tolerance time of the application. If an error is detected again after n-fold repetitions, the system is shut down immediately.
Another exemplary embodiment as shown in
In
In step 510, the same instructions or program segments are processed in at least two execution units.
In step 520, the output signals of these at least two execution units are compared for consistency. If the output signals are identical or within a defined tolerance range, step 510 is restarted, this time with new program segments or instructions and/or data. If a discrepancy of the output signals is detected in step 520, step 530 is executed next.
In step 530, the current program processing is interrupted, and an error-detection routine is executed on all execution units. In the process, the connection of the execution unit to the system interface or the data/address bus must be interrupted.
In step 540, the results of the error-detection routines are each compared to a reference value, which is stored together with the program code of the error-detection routines. If a discrepancy occurs in this comparison, the execution unit whose result led to a discrepancy in the comparison is labeled as faulty, and the step 550 is executed next. If no discrepancy occurs, step 510 is restarted, this time with new program segments or instructions and/or data.
In step 550, the execution units that are labeled as faulty and the comparator unit are deactivated. An execution unit may be shut down, for example, by not supplying any instructions, data, and/or clock signals to this execution unit, or by interrupting the connection of this execution unit to the comparator unit and to the system interface or to the data/address bus.
In step 560, the processor system continues to operate with the remaining non-faulty execution units. In a processor system having two execution units, this means a single-core operation. This is temporally restricted in safety-related systems.
In step 570, the processor system is shut down or switched to a defined secure state after a shutdown condition has been reached, for example, after exceeding a time limit for single-core operation.
In
In step 605, the master function is switched from a first to a second execution unit.
In step 610, the same instructions or program segments are processed in at least two execution units.
In step 620, the output signals of these at least two execution units are compared for consistency. If the output signals are identical or within a defined tolerance range, step 610 is restarted, this time with new program segments or instructions and/or data. If a discrepancy of the output signals is detected in step 620, step 630 is executed next.
In step 630, the processing of the current program sequence is continued on at least one of the execution units, but at least on the execution unit that is connected to the system interface or the data/address bus. An error-detection routine is carried out on at least one other execution unit. For this purpose, the comparator unit must be deactivated.
In step 640, the results of the error-detection routines are each compared to a reference value, which is stored together with the program code of the error-detection routines. If a discrepancy occurs in this comparison, the execution unit whose result led to a discrepancy during the comparison is labeled as faulty, and the step 650 is executed next. If no discrepancy occurs, step 605 is restarted, this time with new program segments or instructions and/or data.
In step 650, the execution units that are labeled as faulty are shut down. This may be carried out, for example, by not supplying any instructions, data, and/or clock signals to this execution unit, or by interrupting the connection of this execution unit to the comparator unit and to the system interface or to the data/address bus.
In step 660, the processor system continues to operate with the remaining non-faulty execution units. In a processor system having two execution units, this means a single-core operation. This is temporally restricted in safety-related systems.
In step 670, the processor system is shut down or switched to a defined secure state after a shutdown condition has been reached, for example, after exceeding a time limit for the single-core operation.
Number | Date | Country | Kind |
---|---|---|---|
102005037246.5 | Aug 2005 | DE | national |
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/EP2006/064690 | 7/26/2006 | WO | 00 | 3/5/2009 |