Described below are a method and a device for insuring consistent memory contents in redundantly maintained memory units within a telecommunication or, as the case may be, data processing system.
The probability in percentage terms that a hardware defect will occur within a period of one year on a typical processor board consisting of a CPU, chipset, main memory, and peripheral components is in single figures. Telecommunication systems and what are termed data centers typically have a multiplicity of such boards. Systems such as switching centers are typically constructed from up to several hundred such processor boards, as a result of which the probability that any single hardware component will fail during a one-year period becomes very high. Telecommunication systems in particular, but increasingly also data centers, are required to provide a high level of system availability, for example an availability of >99.999% or, as the case may be, a non-availability of a few minutes per year. As it generally takes between 10 minutes and a few hours to replace a processor board and restore service after a hardware defect has occurred, suitable precautionary measures covering the eventuality of a hardware defect at system level have to be taken so that the requirement relating to system availability can be met.
The requirement for availability of the type can basically only be met using redundant system components. Here there is the approach of monitoring redundancy through software measures, with what is termed middleware being employed therefor, or the approach of encapsulating redundancy at hardware level so that this is transparent for the software. The main disadvantage of redundancy monitored by software measures is that only (application) software that has been developed for this particular redundancy scheme can be employed in a system of the type. That considerably limits the range of (application) software that can be used. Moreover, the application software for software redundancy principles is generally expensive and time-consuming to develop and test.
The control software of known switching systems is in particular very extensive (containing up to several million lines of code). The software can therefore only be adapted at very high cost and substantial risk.
Specially developed hardware has therefore hitherto been necessary for switching systems of the type. The hardware then supports the required redundancy so that, should individual hardware modules fail, enough information will be available on the redundant units for uninterrupted continuing operation. What is frequently used therefor is a copying mechanism that always runs automatically in parallel in the background and continuously copies a well-defined part of an active control unit's main memory to a redundant unit's memory (often called a “reflective memory”).
The direct consequence is that the development costs required for switching system controls are very high and the innovation cycle of the controls is usually very long. The reflective memory is also increasingly being made much more difficult to implement on account of the latest optimizations in microprocessor development such as, for example, fast, often also proprietary and protected bus systems and internal caches having a write-back function.
A method is described for insuring consistent memory contents in redundantly maintained memory units within a telecommunication or data processing system, having at least one active control unit and at least one redundant passive control unit that are embodied having in each case at least one memory unit, with the following operations being executed:
The main advantage is that programs having no advance provisioning for redundancy and otherwise providing a failsafe, highly available system only in conjunction with special hardware can run on virtually any commercially available processor platform and still achieve the same high level of availability as complex special solutions.
In the future it will thus also be possible to introduce new generations of control computers at no great expense, which is to say the innovation cycles can become much shorter.
The dynamic disadvantages due to calling of the mirroring routine, also called a trap routine, are mitigated in various ways:
Any changes in the memory of the active unit will be mirrored onto the redundant unit by a trap routine during ongoing operation. It will then be possible in the event of a fault to change over to the redundant unit with no loss of information.
The method is, however, only adequate for mirroring ongoing changes in the memory to the redundant unit. If, though, a unit is replaced during operation owing to, for example, a defect, then the identical memory status of the active board will no longer be attained thereby. In that case it will also be necessary to convey all static data, which is to say data that is not further changed during operation, to the replaced redundant unit.
An advantageous embodiment is accordingly to be found in a method corresponding to the following:
The copying of the active unit's static memory contents to the redundant unit is not inconsequential because copying has to take place during the active unit's ongoing operation. So data may be changed precisely while being conveyed through the copying operation to the redundant unit. Copying is customarily implemented through loading into a processor register and writing back. The active unit may then change an item of data precisely at an instant after it has been loaded into the processor register but before it could be stored on the redundant unit. The consequently possible inconsistency of data on the active and redundant unit has hitherto been ruled out in known systems by suitable, proprietary hardware circuitry.
However, the need for special hardware precludes the use of standard modules of the kind available on the market. That solution thus necessitates a high level of development expenditure, which can be eliminated by the software implementation proposed in the following.
The main advantage is that systems consisting of standard modules can be synchronized during ongoing operation without any special redundancy support provided by hardware, and can thus be restored to a redundant mode of operation with a possibility of fast changeover also after a unit has been replaced.
In standard operation, when memory consistency has been achieved between the active and redundant unit the dynamic overhead due to testing of the memory area variables is negligible.
A control unit may be used for implementing the method having a virtual memory unit (VSP) and a physical memory unit (aPS) whose memory contents can, in the event of memory accessing for writing, be mirrored into a memory unit of at least one further redundant passive control unit (rSt), and which
These and other objects and advantages will become more apparent and more readily appreciated from the following description of the exemplary embodiments, taken in conjunction with the accompanying drawings of in which:
Reference will now be made in detail to the preferred embodiments, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to like elements throughout.
Virtually the same architectural units are illustrated in
According to the proposed method, redundant units are no longer provided with the current copy of the active unit's memory contents by special hardware.
Functions of the memory management unit, which is present in all relevant processors, are instead used to be able, at runtime, to check each time the active control unit AST is accessed for storing whether a copy has to be created on the redundant control unit rST of the item of data requiring to be written.
That is possible by setting the attributes for each memory page (typically 4 kilobytes) to ‘write protected’ or ‘read only’.
If, accordingly, a process write-accesses a memory area of the type in the memory control unit aMM, then the command execution CA will be interrupted and a trap routine TR called.
The trap routine then analyzes the storage command and insures that the same item of data will be written to the same memory address of the redundant memory unit. That can be supported by suitable standard hardware such as, for example, PCI Express.
The local write operation on the active control unit also has to be performed. The most efficient way to do this is by a write cycle to a non-protected memory page in the virtual address space of the virtual memory unit VSP that is mapped to the same physical page in the physical memory unit aPS as the write-protected one. That is made possible by all known memory control units aMM.
When both the local memory and the memory unit rPS of the redundant control unit have been write-accessed, the trap routine returns to the standard command execution behind the storage command (that was executed in the trap routine) and the normal program flow will be resumed.
To improve the dynamic characteristics, the described trap routine functionality can also be implemented in micro-code form in the processor. In that case, owing to the write protection flag of the memory page in the memory control unit aMM a trap routine would not be triggered but, instead, the corresponding functional sequence that duplicates write-accessing of the redundant control unit would be initiated directly in micro-code form. The dynamic loss due to launching and quitting the trap routine will be reduced thereby.
As a further optimization, a co-processor can be connected to the processor which, after initiation by the memory control unit aMM, will provide for duplicating write-accessing. Both the trap routine and a micro-code routine can as a result be dispensed with.
On the active control unit aSt a software routine or, as the case may be, a copying routine KR continuously copies the memory to the redundant control unit rSt. That takes place in small, fixed units (the size of a cache line, for instance). Active operation takes place simultaneously on all the active control unit's other processors. Therefrom ensues the continual calling of trap routines TR that mirror current changes in the memory contents to the redundant control unit rSt.
To avoid inconsistencies with the data mirrored by the trap routines, the base address of the data block being copied (a cache line, for example) is stored in a global variable. The global variable is located in the active control unit's common memory aRS; all the active unit's processors have read access (except for the copying process, which updates the base address for each data block requiring to be copied).
If the trap routine is then called on any processor when the memory is being write-accessed, the global variable having the base address must be compared in the trap routine with the current write address. If the addresses are different (which is to say that write accessing is directed at an area not being copied at that instant), then the trap routine can run normally and write accessing will be mirrored to the redundant control unit's page. If it is determined during the trap routine that accessing is directed at the data block being mirrored onto the redundant control unit's memory, then the trap routine will keep polling the global variable until the recopying operation has finished and the global variable has been updated to the next data block. The trap routine will then likewise be able to be completed normally.
Impacts on dynamic performance: Polling during the trap routine is a seldom occurrence (necessary only when precisely the data block currently being copied is being written to), and polling does not take long (it does not take long to copy such a small area). Interrogating the global variable does, though, always produce an overhead in the trap routine owing to the necessary address comparison. The additional runtime will, however, be short during standard operation if the memory is not performing a copying operation. In that case the global variable will, if not changed for a long time, be safely in the cache (owing to the frequency of the trap) when the trap routine is called, and the comparison will result in only a slight runtime overhead. The overhead will be more critical if the recopying process is running in parallel, with the variable then being continually changed and no current copy being present in the cache. In that case a storage access to the memory must be executed as a trap routine overhead for the purpose of reading in the address stored in the global variable.
This principle places no special requirements on the hardware or software of a redundant system: It is necessary only to integrate the relevant routine for recopying the memory contents with the control via a global variable having the address that is currently to be copied.
A description has been provided with particular reference to preferred embodiments thereof and examples, but it will be understood that variations and modifications can be effected within the spirit and scope of the claims which may include the phrase “at least one of A, B and C” as an alternative expression that means one or more of A, B and C may be used, contrary to the holding in Superguide v. DIRECTV, 358 F3d 870, 69 USPQ2d 1865 (Fed. Cir. 2004).
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/EP04/08402 | 7/27/2004 | WO | 00 | 1/29/2007 |