This invention relates to the field of integrated circuits. More particularly, this invention relates to the control of error propagation, such as, for example, errors induced by radiated particle strikes, within integrated circuits.
Device scaling trends towards reducing feature size, increasing integration and lowering voltage levels increase the soft error rate (non-permanent errors such as those induced by radiation strikes) within microprocessors and integrated circuits in general by lowering the minimum amount of charge necessary to cause a bit flip and also by increasing the number of susceptible targets for potential particle strikes. These trends have made reliability an increasingly important design constraint in a variety of different integrated circuit markets.
Though strict reliability constraints have typically been applied exclusively in aero space and high-end server markets, increase in demand for embedded microprocessors in a variety of emerging areas, such as the automotive and health care industries, have generated a requirement for reliable embedded designs. The standard mechanism for reporting device reliability is the number of failures in time, or the FIT rate, where a rate of one FIT means that the mean time before an error occurs is one billion device hours. As an example of the increasing need for reliability in embedded devices, the case of expanding integration in the automotive industry is typical. Due to the very high numbers of automobiles in use, and the multiple instances of embedded microprocessors within those automobiles, this indicates that with current technology at any given time multiple device failures due to soft errors would occur. This is unacceptable.
Another significant factor contributing to this problem is that in typical embedded devices compared with high performance design, longer clock cycle times tend to be employed. This longer cycle time in embedded designs typically leads to larger logic depths between sequential state elements. The effects of these large logic depths are two-fold. First, large logic depths increase the relative area of the chip consumed by combinatorial logic, making combinatorial logic much more susceptible to soft errors (e.g. particle strikes). For example, combinatorial logic consumes 58% of the total cell area of the ARM926EJS core designed by ARM Limited of Cambridge, England. Second, larger logic depths typically imply a wider signal fan out, thus increasing the number of potential targets which may latch an incorrect value caused by a single soft error. Soft error rates are also increasing in the sequential logic such as latches and registers and soft errors at these points also propagate through the fanout net.
It is known to provide mechanisms for detecting and correcting soft errors in memory systems such as SRAM. Memory devices typically use small geometries due to the desire to achieve high density. These small geometries are more vulnerable than the larger circuit elements that have previously typically been used in combinatorial and other logic within an integrated circuit, such as a microprocessor. Within memory systems error checking mechanisms, such as ECC codes, parity bits and the like, have been employed in an attempt to address this soft error problem. Whilst these techniques work in the context of high density memory systems storing effectively pure state data, they are not suited to protection against soft errors occurring dynamically within combinatorial logic and the like in more general purpose integrated circuits.
It may be possible to introduce error detection and error correction mechanisms throughout an integrated circuit design to protect essentially all nodes within that design. However, such an approach is impractical since a large increase in gate count would result due to the deployment of error detection and error correction mechanisms for almost every element within the design.
Viewed from one aspect the present invention provides a method of selecting one or more positions within an integrated circuit to place respective error detection circuits, said method comprising the steps of:
analysing said integrated circuit to determine for a plurality of positions within said integrated circuit respective fan-out characteristics for a signal error occurring at those positions; and
selecting positions for said error detection circuits in dependence upon said fan-out characteristics.
The present technique recognises that a more efficient deployment of error detection circuits within an integrated circuit can be made by analysing the fan-out characteristics (possibly including state dependent masking effects) for signal errors occurring at positions within the integrated circuit design. In this way the circuit resources to be dedicated to error detection circuits can be targeted at the points within the integrated circuit as a whole where the most benefit will occur, e.g. where they are most likely to detect errors or where they are able to protect key architectural state, or other reasons. Furthermore, this low level approach of analysing fan-out characteristics is suitable for application in synthesised designs where the actual gate level layout of the circuit and arrangements of the circuit elements is machine generated. The designer can use the fanout analysis to trade off error coverage against circuit area consumed by detectors in a way to get an improved balance and with a knowledge of the type and proportion of errors likely to be detected. Coverage overhead and detection accuracy can be traded off against each other.
The analysis step may be advantageously performed by simulating operation of the integrated circuit whilst injecting one or more signal errors at position(s) to be investigated. Simulating the operation of an integrated circuit is a technique already commonly used in testing and validation of integrated circuit designs and accordingly the infrastructure to perform such simulation is well developed and existing. This infrastructure can be conveniently reused to analyse the fan-out characteristics of injected errors in accordance with the present technique.
One particularly convenient way of achieving this analysis is to run simulations of two instances of a design, one with and one without injected errors and then observe differences in the resulting states, these differences being indicative of propagated signal errors.
Many general purpose integrated designs can be considered to be formed of register circuits operable to store signal values and interconnected by logic circuits performing one or more data processing operations and processing control operations. The logic circuits interconnecting the registers can include combinatorial logic, which is normally difficult to analyse in respect of its soft error behaviour other than with the present technique. The present technique can also be used to detect errors in other circuit elements, such as registers and latches.
Within such a register circuit and interconnecting logic circuit framework, signal errors can be detected in the signal values stored in the register circuits and the signal errors may be injected at positions corresponding either to register circuits or the logic circuits interconnecting those register circuits.
The selection of positions for the error detecting circuits in dependence upon the detected fan-out characteristics can be performed in a variety of different ways. One technique is to statistically analyse the fan-out characteristics over multiple error injections with varying timings and positions to identify those positions within the integrated circuit at which at a signal error is most likely to be detectable. Thus, “sweet spots” for error detection can be identified where placing an error detection circuit produces a good return in terms of the rate of soft error detection and area coverage.
The injected errors in such an analysis are advantageously varied in position within the integrated circuit under test and are varied in respect of their relative timing compared to the circuit clocking and other timing characteristics of the device.
Viewed from another aspect the present invention provides an integrated circuit comprising:
a plurality of registers operable to store signal values
a plurality of logic circuits interconnecting said plurality of registers and operable to perform one or more of data processing operations and processing control operations, wherein
a signal error at least one error source position within said integrated circuit has an associated fan-out characteristic such that said signal error propagates to a plurality of further points within said integrated circuit, and
an error detection circuit is placed at a selected further point of said plurality of further points, said selected further point being one or more of:
(i) a point to which errors propagate from a plurality of error source positions within said integrated circuit; and
(ii) a point within said plurality of registers at which detection of an error propagating from said error source position is statistically most likely.
The error detection circuits themselves can operate in a variety of different ways, such as, for example, sampling a signal at two times spaced apart and detecting a difference as indicative of an error. Another example is an error detection circuit responsive to a change in a signal as being indicative of an error.
Viewed from another aspect the present invention provides an integrated circuit comprising:
a plurality of circuit units operable to perform respective data processing operations;
a plurality of error isolation gates positioned to control signal paths between circuit units and operable in a closed state to block changes in respective signals being passed between circuit units and in an open state to permit changes in respective signals being passed between circuit units; and
an isolation gate controller responsive to a current state of said integrated circuit to control respective ones of said plurality of error isolation gates to be in said closed state or said open state, wherein
said isolation gate controller controls said error isolation gates such that at least one circuit unit is an error isolated circuit unit powered in said current state and not being used in said current state to perform a data processing operation to determine one or more output signals from said error isolated circuit unit, said error isolated circuit unit being surrounded by error isolation gates in said closed state such that a signal error arising within said error isolated circuit unit is blocked from propagating to other circuit units.
The invention recognises that errors occurring within an integrated circuit tend to have a random distribution over the integrated circuit in question. However, not all portions of an integrated circuit are active at any given time. Some portions of an integrated circuit may be powered down at a particular point in time to save energy. Errors within such powered down regions are unlikely to cause a problem. However, other areas within an integrated circuit may be powered, but nevertheless inactive at a particular time due to the current data processing operations and/or status of the integrated circuit in question. However, despite being unused at that particular point in time, errors occurring within these unused regions can propagate out of those unused regions and cause errors or failures in the integrated circuit as a whole. This technique provides error isolation gates positioned to control signal paths between circuit units so as to be closed or open. An isolation gate controller is responsive to the current state of the integrated circuit to control these isolation gates such that circuit elements which are powered, but are not being used in the current state to perform data processing operations to determine one or more output signals from those circuit units are isolated such that a signal error arising within the isolated circuit unit is blocked from being propagated to other circuit units. The isolated circuit unit may spontaneously recover from the error which has occurred or the error may be pro-actively detected and error recovery mechanisms initiated. By the time the isolated circuit unit is again required to take part in the processing operations of the integrated circuit it is possible that it has recovered from its error such that processing many continue unhindered. If the isolated circuit unit has still not recovered, then error recovery mechanisms may be initiated in a controlled way.
The signal errors against which this technique is particularly useful are transitory errors, such as errors induced by particle strikes.
The current state of the integrated circuit used by the isolation gate controller to determine which circuit units are active and which are inactive so as to appropriately isolate the inactive circuit units can include a variety of inputs including one or more program instructions currently being processed and a current processing mode of the integrated circuit.
The error isolation gates can operate in a variety of different ways, but particularly preferred ways include latching an output of an isolated circuit unit so that it will not change irrespective if internal signal changes due to errors within that isolated circuit unit and controlling a selection input of a multiplexer such that an output from an inactive circuit unit will not be selected erroneously given the current state of the integrated circuit.
Examples of isolated circuit units to which the present technique would be particularly applicable include instruction decoders not operable in a current mode of operation, a debug circuit not operable in a current mode of operation, and a portion of a data path not operable for any program instructions currently being processed. Other circuit units may also be isolated in accordance with this technique.
Viewed from another aspect the present invention provides a method of reducing error propagation within an integrated circuit, said method comprising the steps of:
performing respective data processing operations with a plurality of circuit units;
controlling signal paths with a plurality of error isolation gates positioned between circuit units and operable in a closed state to block changes in respective signals being passed between circuit units and in an open state to permit changes in respective signals being passed between circuit units; and
in response to a current state of said integrated circuit controlling respective ones of said plurality of error isolation gates to be in said closed state or said open state, wherein
said error isolation gates are controlled such that at least one circuit unit is an error isolated circuit unit powered in said current state and not being used in said current state to perform a data processing operation to determine one or more output signals from said error isolated circuit unit, said error isolated circuit unit being surrounded by error isolation gates in said closed state such that a signal error arising within said error isolated circuit unit is blocked from propagating to other circuit units.
Embodiments of the invention will now be described, by way of example only, with reference to the accompanying drawings in which:
As will be seen from
In this highly simplified example it will be seen that errors injected at both nodes 8 and 16 result in an error being latched in register 12 and accordingly it is efficient to deploy an error detection circuit 20 in association with the register 12 since this is able to detect errors from these multiple sources. Other positions are also possible depending upon the logic functions and nodes concerned. It will be appreciated that the complexity of a real integrated circuit design and the large amount of different states it may occupy will result in the simulation of many thousands of injected errors so as to statistically analyse the registers at which resulting errors are most likely to be manifested and which detect errors from multiple sources. Error detectors 20 may also be placed in positions known to represent key architectural state or in positions at which errors can manifest themselves and which are not detected by error detectors at other positions even though those errors are rare since it may be desirable to achieve a particularly thorough error coverage. Another possibility provided by the present technique is to position the detection circuits further upstream in the logic path and not necessarily at the registered edges. This would give higher coverage in the middle; a possible disadvantage is that some of these errors may have been removed due to subsequent masking, but overall detection coverage may improve. Furthermore, objective degrees of confidence in the coverage may be obtained.
In the example discussed further a Verilog model of an ARM926EJS microprocessor was used. This microprocessor is a 32-bit embedded architecture microprocessor and has a five stage pipeline consisting of fetch, decode, execute, memory and write-back stages. The implementation used in this analysis has thirty seven architecturally defined registers (thirty one general purpose registers and six status registers), 4 KB of instruction cache and 4 KB of data cache. The Verilog model was synthesised with scan-chain insertion and design-for-test methodologies in place using a 130 nm process.
The test bench was formed including a pair of the synthesised netlists from the above, namely a reference design and a design under test. Both netlists are annotated with timing information gathered at the synthesis and layout stage by the synthesis and layout tools. The test bench also includes a behavioural memory model which is used to load benchmarks at simulation initialisation.
The soft error injection and analysis framework is composed of a set of Verilog Programming Interface libraries which are invoked at the start of simulation. Upon invocation, the framework probes the design in order to derive the set of all sequential state elements and nets within the unit under test.
Application-based analysis and random-state analysis are both supported. Application-based analysis is carried out by running benchmark code loaded into the behavioural memory model at simulation initialisation. In this case, the framework will, for example, select a random point in time between 2500 and 5000 cycles after the start of simulation to conduct its first fault injection. If the experiment being conducted is intended to include temporal masking analysis then the fault injection time is randomly selected in picoseconds, the fault duration is randomly selected, for example, on the interval (0.25*CLK, CLK). The preceding are only example timings and it would be possible to use other or random timings. Otherwise, the fauit injection time will be scheduled at some future rising edge of the clock signal and will held for the duration of one clock cycle. When random state analysis is conducted, the framework is used to drive the experiments by setting the machine to a randomly generated state, injecting a fault, observing the effects of the fault in the subsequent cycle, and repeating. The random-state based experiments are meant to derive an application-independent measure of logical masking of errors.
At fault injection time, depending upon the type of injection experiment being simulated (soft errors in combinatorial logic, soft errors in sequential state, or both), a random design element is selected for fault injection from the unit under test. If the fault is to be injected into a logic element, a random net in the design is selected and the value present on a wire is inverted, simulating an upset at the logic gate which drives the wire. Similarly, when faults in registers are being simulated, a random register is selected and its output is inverted. When a fault is injected into the design, the framework logs the fault site, the time of injection, and a pulse duration.
After a fault has been injected into the system, at each subsequent arising clock edge, every microarchitectural register in the unit under test is compared against its dual in the reference design. Further, all top-level output ports on the design (I/O buses, coprocessor interface, test equipment) and inputs into the caches are checked to ensure that no corrupt values have escaped from the core data path. If, in the first cycle after fault injection, no register, cache or top-level port mismatches occur, then the injected fault did not effect the system, and so a new random time for example, at least 100 cycles in the future is selected for another fault injection experiment. If any register, cache or port mismatches do occur, then the fault analysis framework logs the relative cycle and site of the error for later analysis. The fault analysis framework then continues to track the progress of errors throughout the system for 100 cycles after fault injection time. If after 100 cycles, no errors are present, and no errors have propagated out to the caches or top-level ports, then the system is clean and the fault was successfully masked, so a new random time for fault injection is selected. If top-level port or cache errors did occur, then simulation halts and error logs are written for post-processing to analyse propagation behaviour and architectural state effects.
At step 38 a random node within the test design is selected at which the error is to be injected. This node may be a register or a piece of combinatorial logic or some other element. At step 40 a random duration for the error is selected. Step 42 then clocks the reference and test designs to the cycle before the selected injection time (most likely in the application code analysis example) and then at step 44 the signal error in injected at time T during the cycle reached with a duration D and at location N. Step 46 then continues the clocking of the reference and test designs such that the error can propagate within the test design. Step 48 reads the states of the reference and test designs and these are compared at step 50 to detect any differences. Detected differences are used to update the data recording the statistical distribution of errors resulting from error insertion as are being collected for analysis. If further statistics are required, then step 54 returns processing to step 30 and the simulation of error injection repeats.
At step 60 each injected error location is examined to determine the fan-out, number and location of sequential elements which store a resulting error. In this way, the fan-out characteristics may be used to identify sequential logic elements (registers) which provide wide error detection coverage for errors injected at a variety of positions. Furthermore, error injection positions which are lacking in error detection coverage may be identified to produce a desired or comprehensive level of error coverage and also error injection points which may influence key architectural state can also be identified.
At step 62, the error propagation data (fan-out data) extracted in steps 58 and 60 is analysed such as by ranking in accordance with predetermined criteria to identify a suitable set of locations at which error detection circuits should be added. These error detection circuits can then be added to the design and the testing would be repeated to check that the coverage is as expected (the insertion of the error detection circuits may itself alter the error propagation behaviour). These repeating and integration steps are illustrated in steps 64 and 66 of
Also illustrated in
Isolation gates 90, 92, 94, 96, 98, 100 are illustrated at various points within the integrated circuit 68. These isolation gates are controlled by an isolation gate controller 102 to selectively be in either a closed state or in an open state. In a closed state they serve to block any change occurring in an output signal of the circuit unit with which they are associated, whereas in an open state they pass such changes. The isolation gate controller 102 is responsive to the current state of the integrated circuit 68, including the current processing mode detected from the status register 8 and the currently executing program instructions detected from the pipeline 78, to generate gate control signals to control the isolation gates 90, 92, 94, 96, 98, 100. As an example, only one of the instruction decoders 80, 82, 84 will be active at any given time and accordingly the output signals from the other decoders will be isolated by their isolation gates 90, 92, 94 under control of the isolation gate controller 102 such that if any soft errors occur within an inactive instruction decoder 80, 82, 84, then these will not produce error signals that propagate out to the rest of the integrated circuit and induce errors in operation of that integrated circuit. As another example, the outputs from the shifter 74 may be subject to control by the isolation gates 98 such that if a particular processing state of the integrated circuit 68 is one which does not involve any shifting operations, then the output from the shifter 74 may be isolated such that it does not have an inappropriate effect on any subsequent processing.
The debug control signals from the debug controller 88 are also subject to isolation gate control since debug control signals can have a particularly powerful effect on the operation of the integrated circuit 68 and cause large scale errors if a soft error does occur within the debug controller 88.
As another example of isolation gate control, multiplexers 104 and 106 which are used to select main registers or shadow register within the register bank 70 (depending upon processing mode) are controlled with select signals which can be subject to isolation gates 108, 110. Thus, if the processing mode is one in which it is known that the shadow registers are not active, then the select inputs to the multiplexers 104, 106 can be subject to isolation such that soft errors will not induce them to inappropriately change the selected register and accordingly produce erroneous processing operation.
Number | Date | Country | Kind |
---|---|---|---|
0519363.6 | Sep 2005 | GB | national |
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/GB2005/003800 | 10/3/2005 | WO | 00 | 9/25/2007 |