Method for monitoring consistent memory contents in redundant systems

Abstract
In a fault-tolerant system which is constructed from two control devices that operate in lockstep mode, e.g. both control devices are performing the same work at any given point in time, there is a requirement to check whether consistent, e.g. words identical, contents are being read from or written to the main memory at the same point in time in order to be able to detect any errors which may be occurring as quickly as possible and thus to prevent any spreading of the error. Known methods achieve this with the aid of dedicated north bridges which provide information by way of a separate interface, or by means of a monitoring of other operations, for example I/O transactions possibly on the PCI bus. According to the invention, the checking of the memory contents for consistency is performed with the aid of simple devices—memory monitoring module, checking device and is controlled by the checking device.
Description


CLAIM FOR PRIORITY

[0001] This application claims priority from European patent application EP01120256.1 filed Aug. 23, 2001.



TECHNICAL FIELD OF THE INVENTION

[0002] The invention relates to a fault-tolerant system, and in particular, to a fault-tolerant system including two control devices that operate in lockstep mode.



BACKGROUND OF THE INVENTION

[0003] In a fault-tolerant system constructed from two identical control devices that operate in lockstep mode, i.e. both control devices are performing the same work at any given point in time, there is a requirement to check whether consistent, i.e. identical words, contents are being read from or written to the main memory at the same point in time. This ensures the detection of any errors which may be occurring as quickly as possible and thus to prevent any spreading of the error. Known methods for checking for consistent memory contents can be subdivided into direct and indirect methods.


[0004] In the direct method, a hardware-based method, in which a dedicated north bridge is used, which makes available, by way of a separate interface, information concerning transactions in which the north bridge is involved, i.e. also concerning memory transactions.


[0005] The following problems are encountered with the direct method:


[0006] The development effort for a dedicated north bridge is substantial.


[0007] In the case of a north bridge integrated into the CPU in order to enhance the performance, the use of a dedicated north bridge is not possible.


[0008] In the indirect method, due of the lack of direct access facilities to the north bridge and its interfaces, I/O transactions for example may be monitored on the PCI bus instead of the memory transactions which cannot be monitored directly. As a result of indirect monitoring, the problem arises whereby errors or asynchronous modes of operation are capable of being detected considerably later than is possible in the case of direct monitoring of the memory transactions.



SUMMARY OF THE INVENTION

[0009] The present invention discloses, in one embodiment, methods for monitoring consistent memory contents in redundant systems.


[0010] One advantage of the invention includes, for example, a direct and immediate examination of the memory contents for consistency carried out with the aid of simple devices—e.g., memory monitoring module, checking device—and is controlled by the checking device. A north bridge is therefore not required for sampling the memory contents. Furthermore, control of the method being effected by the checking device ensures that the checking is carried out without I/O accesses to peripheral modules, for example by way of the PCI bus system.


[0011] In another embodiment, a small number of constantly accessible external signals error checking code signals from the memory interface—is advantageously sampled on the north bridges by the memory monitoring modules. This permits a substantially simpler design compared with the sampling of data signals and/or address signals from the memory interface, but nonetheless guarantees a high error detection performance. As a result of the use of external signals by the north bridges, the method can also be used if CPU and north bridge are combined in a single module.


[0012] In another embodiment, since the function of the checking device is restricted to the comparison of two signatures, the control of the memory monitoring module, and where applicable the raising of an alarm condition, the logic to be implemented in the checking device is simple. Nevertheless, as a result of the use of signatures which are based on the ECC information, a very high degree of reliability in the detection of errors is guaranteed which is comparable with the performance of the error detection on the memory interface resulting from the ECC information.







BRIEF DESCRIPTION OF THE DRAWINGS

[0013] The invention will be described in the following with reference to the drawing, in which:


[0014]
FIG. 1 shows a first and second control unit in a fault tolerant system.







DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

[0015]
FIG. 1 shows a first control unit SE0 and a second control unit SE1 of a fault-tolerant system. Both control units SE0 and SE1 are of identical construction and each includes a processing unit CPU0, CPU1, an interface unit or North Bridge NB0, NB1, and a memory MEM0, MEM1, implemented for example in the form of SDRAM, DDR-SDRAM or QDR-SDRAM. The functionality of the processing units CPU0, CPU1 and of the North Bridges NB0, NB1 can, as shown, be implemented in two separate devices, or combined in a single device (not shown).


[0016] In addition, for each of the two control devices SE0, SE1 the figure shows a checking device C0, C1 according to the invention, each having a memory monitoring module, or snooper S0, S1.


[0017] The checking devices C0, C1 are each by preference a field programmable gate array FPGA or an application specific integrated circuit ASIC. However, it is also possible to implement the function of the checking devices C0, C1 in a program-controlled fashion by using a micro-controller for each.


[0018] The two control devices SE0, SE1 operate in lockstep mode, e.g. both control devices SE0, SE1 and each of the aforementioned devices assigned to the control devices SE0, SE1 are performing the same work at any given point in time. The methods and devices for establishing and monitoring the lockstep operation are not the subject of the present invention and are not described. However, it is assumed in the following that the timing is synchronized for the two control devices SE0, SE1.


[0019] The first snooper S0 of the first control device SE0 observes the accesses of the first North Bridge NB0 of the first control device SE0 to the first memory MEMO of the first control device SE0. To this end, the first snooper S0 is connected to the control lines and at least to the ECC—error checking code lines of the first memory interface SI0 of the first control device SE0.


[0020] Similarly, the second snooper S1 of the second control device SE1 is connected to the control lines and at least to the ECC lines of the second memory interface SI1 of the second control device SE1, and observes the accesses of the second North Bridge NB1 of the second control device SE1 to the second memory MEM1 of the second control device SE1.


[0021] Since the two snoopers S0, S1 are acquainted with the memory control protocol and use the control signals which are transferred over the control lines of the respective memory interfaces SI0, SI1 to monitor operational sequences, the snoopers S0, S1 can sample the valid ECC information at the correct point in time at the relevant memory interface SI0, SI1.


[0022] This ECC information is transferred by the snoopers S0, S1 in its entirety or in part to the relevant checking device C0, C1 in the form of signatures SIG0, SIG1, i.e. the signature SIF0 from snooper S0 is transferred to the checking device C0 and the signature SIG, from snooper S1 is transferred to the checking device C1. The signatures SIG0, SIG1 are then transferred by the checking devices C0, C1 via the link L to the other respective checking device C0, C1, such that the signatures SIG0, SIG1 of both snoopers S0, S1 are present in both checking devices C0, C1.


[0023] Subsequently, the signatures SIG0, SIF1 received from the assigned snooper S0, S1 of the respective control device SE0 and SE1 are checked by the checking devices C0, C1 for equality with the signature SIG0, SIG1 received from the other checking device C0, C1, i.e. checking device C0 compares the signature SIG0 received from snooper S0 with the signature SIG1 received from checking device C1, and checking device C1 compares signature SIG1 received from snooper S1 with signature SIG0 received from checking device C0.


[0024] If an inequality is noted, an alarm condition is raised to the effect that differing memory transactions have taken place. This alarm condition is forwarded for example by way of the link between the checking devices C0, C1 and the associated North Bridges NB0, NB1 to the associated North Bridges NB0, NB1 and from there to the processing units CPU0, CPU1, and can occur in the form of an interrupt with the appropriate priority in conjunction with a corresponding interrupt handling routine. With regard to the connection between the checking devices C0, C1 and the associated North Bridges NB0, NB1, this is a connection implemented by means of a standard interface, for example a PCI bus or AGP bus.


[0025] Such an alarm condition may be an indication of an asynchronous state affecting the control devices SE0, SE1 or an indication of a processing error in at least one of the control devices SE0, SE1 or an indication of a memory error in at least one of the control devices SE0, SE1. Methods for the isolation and handling of an error leading to the alarm condition in the interrupt handling routine are adequately known and are not the subject of the present invention.


[0026] The ECC information and thus the signatures SIG0, SIG1 formed from the ECC information depend on the data bits read or written such that the ECC information or the signatures SIG0, SIG1 are sufficient in order to be able to differentiate with a high degree of probability whether equal or unequal data has been read or written.


[0027] One advantage is that it is not necessary to connect the snoopers S0, S1 to the data lines and to assess these. The number of data lines for commonly encountered systems is an integer multiple of 64, for example therefore 128 data lines, whereas 8 ECC lines are present, whereby a simpler construction is possible both for the snoopers S0, S1 and also for the checking devices C0, C1.


[0028] If the address of the memory access is incorporated in the formation of the ECC information and thus in the signatures SIG0, SIG1, the addresses of the memory accesses are thereby also indirectly monitored.


[0029] The invention is not restricted to the embodiments described above. For example, if checking devices C0, C1 and/or the link L are to be designed with a lower performance level, the control of the snoopers S0, S1 can be implemented such that not every sampled item of ECC information is selected for the checking process and forwarded as signature SIG0, SIG1 to the checking devices C0, C1, but every n-th sampled item of ECC information, for example every second or every tenth sampled item of ECC information. Whilst this result in a reduced capability of the method to immediately detect and handle deviating ECC information and thus deviating memory contents, the demands relating to the performance level of the checking devices C0, C1 and of the link L are also lessened at the same time. Depending on the particular application, the parameter n can be adapted to suit the requirements, whereby in the case n=1 every sampled item of ECC information is checked as described in the preferred embodiment.


[0030] If the address of the memory access is not incorporated in the formation of the ECC information and thus in the signatures SIG0, SIG1 snoopers S0, S1 can be provided which are additionally connected to all or selected address lines. This means that monitoring of the addresses of the memory accesses can also take place.


[0031] The method according to the invention can also be used whenever the memory MEM0, MEM1 and/or the North Bridges NB0, NB1 do not supply any ECC information on the memory interface SI0, SI1 Snoopers S0, S1 can then be provided which are connected to the data lines of the memory interface SI0, SI1 and compute a signature SIG0, SIG1 from these signals. Amongst other things, this has the advantage that, compared with memory interfaces SI0, SI1 offering ECC information, merely one other snooper S0, S1 needs to be provided but not another monitoring device C0, C1.


Claims
  • 1. A method for monitoring consistent memory contents in a redundant system, comprising: a first control unit and a second control unit each having a processing unit with an interface unit and a memory, wherein each memory of a respective control unit is monitored by a memory monitoring module, signatures are formed by the memory monitoring modules, which represent information written to each memory or read from each memory, and which are forwarded to a respective monitoring device, the signatures are forwarded by the monitoring devices to the other respective monitoring device via a link between the control units, where at least one of the monitoring devices compares the signature received from the memory monitoring module with the signature received from the other monitoring device, and an alarm condition is raised by the monitoring device carrying out the comparison if the compared signatures are determined to be non-matching.
  • 2. The method according to claim 1, wherein the signatures are formed from an error checking code information formed during each write and/or read access to the memory.
  • 3. The method according to claim 1, wherein a field programmable gate array or an application specific integrated circuit or a micro-controller is provided for checking devices, such that at least one of the checking devices raises the alarm condition, and a connection of the checking devices to the interface unit including the memory interface or to the processing unit with an integrated interface unit is implemented by a bus system.
  • 4. A system for monitoring consistent memory contents in a redundant system, comprising: a first control unit and a second control unit, each having a processing unit with an interface unit and a memory and a memory monitoring module for monitoring the memory, which forwards signatures that represent information written to the memories or read from the memories to a respective checking device, wherein the checking device receiving the signatures from the memory monitoring module by a link, and the checking device compares the received signature and raises an alarm condition in the event of deviations.
  • 5. A memory monitoring module, comprising: a first device to monitor a memory interface of a memory; and a second device to provide a signature derived from error checking code information formed during write and/or read access to the memory and sampled at the memory interface.
  • 6. The memory monitoring module according to claim 5, wherein the memory monitoring module involves all or selected data lines and/or all or selected address lines and/or all or selected control lines of the memory interface in the formation of the signatures.
  • 7. A checking device of a redundant system, comprising: a first device to receive a first signature which represents a data word written to a first memory of a first control device assigned to the checking device or a data word read from the first memory; a second device to receive a second signature which represents a data word written to a second memory of a second, redundant control device or a data word read from the second memory; and a third device to compare the first and the second signature, having a fourth device to raise an alarm condition in the event of a second signature deviating from the first signature.
  • 8. The checking device according to claim 7, wherein the checking device is a field programmable gate array or an application specific integrated circuit or a micro-controller, and the checking device is connected by a bus system or an interface to an interface unit including a memory interface or to a processing unit with an integrated interface unit.
  • 9. The checking device according to claim 7, wherein the checking device includes a memory monitoring module with a unit to monitor the memory interface of the memory and a unit to provide signatures which represent information written to the memory or read from the memory.
  • 10. The checking device according to claim 8, wherein the checking device includes a memory monitoring module with a unit to monitor the memory interface of the memory and a unit to provide signatures which represent information written to the memory or read from the memory.
Priority Claims (1)
Number Date Country Kind
01120256.1 Aug 2001 EP