Embodiments of the present invention relate generally to error detection and/or correction in a semiconductor device.
Single bit upsets or errors from transient faults have emerged as a key challenge in semiconductor design. These faults arise from energetic particles, such as neutrons from cosmic rays and alpha particles from packaging material. These particles generate electron-hole pairs as they pass through a semiconductor device. Transistor source and diffusion nodes can collect these charges. A sufficient amount of accumulated charge may change the state of a logic device such as a static random access memory (SRAM) cell, a latch or a gate, thereby introducing a logical error into the operation of an electronic circuit. Because this type of error does not reflect a permanent failure of the device, it is termed a soft or transient error.
Soft errors become an increasing burden for designers as the number of on-chip transistors continues to grow. The raw error rate per latch or SRAM bit may be projected to remain roughly constant or decrease slightly for the next several technology generations. Thus, unless error protection mechanisms are added or more robust technology (such as fully-depleted silicon-on-insulator) is used, a semiconductor device's soft error rate may grow in proportion to the number of devices added in each succeeding generation. Additionally, aggressive voltage scaling may cause such errors to become significantly worse in future generations of chips.
Bit errors may be classified based on their impact and the ability to detect and correct them. Some bit errors may be classified as “false errors” because they are not read, do not matter, or can be corrected before they are used. The most insidious form of error is silent data corruption (“SDC”), where an error is not detected and induces the system to generate erroneous outputs. To avoid silent data corruption, designers often employ error detection mechanisms, such as parity. Error correction techniques such as error correcting codes (ECC) may also be employed to detect and correct errors, although such techniques cannot be applied in all situations. Furthermore, such error correction techniques consume semiconductor real estate, power, and processing time.
Scan cells are logic circuits added to a semiconductor device that are used during manufacturing testing and post-silicon debug of the device. The scan cells include flip-flops and contain logic to store and shift data out of a device's test output pins. The scan cells typically include a data path and a scan path. Typically, data can either be read out of a device using a scan cell or data can be transferred into a device to place a device into a known state. Scan cells are typically daisy-chained together to form one or more shift registers called a scan chain. These scan chains are primarily used to examine or set the state of the device during testing and debug operations. Typically, the scan portion of the scan cells are disabled prior to the device leaving the factory.
Accordingly, a need exists to more efficiently detect and correct errors within a semiconductor device.
Referring to
As shown in
It is to be understood that the data and scan flip-flops shown in
Still referring to
The error signal may be used in next stage 90 to squash a data error. Furthermore, the error signal may be coupled to multiplexer 110 to cause the output of second flip-flop 130 to pass through to first flip-flop 120. In such manner, an error detected within circuit 100 may be corrected such that valid data is output from circuit 100. The error signal also may be provided to previous stage 80 to cause that stage to stall while error correction occurs in circuit 100.
Thus in operation, circuit 100 may be used to detect and correct an error, such as a single bit error caused by radiation, occurring in first flip-flop 120. Accordingly, when different values are output from flip-flops 120 and 130, the error signal is generated, in turn causing the faulty data value traveling to the next stage to be squashed, stalling the previous stage(s), and copying the valid data from second flip-flop 130 into first flip-flop 120. When the correct data is in place, the error signal may be removed, and the pipeline may continue to process data with a bubble (i.e., a squashed entry) where the faulty data was used. Accordingly, soft errors may be corrected as soon as they are detected, allowing recovery to occur locally, simplifying recovery and eliminating the need to replay work already completed successfully (e.g., the result of a previous stage).
In other embodiments, a hardened flip-flop need not be present in circuit 100. Error detection and correction may still occur by generating the error signal (as described above). This error signal when sent to the previous stage may cause that stage to regenerate and re-send the data, thereby correcting the error.
In yet other embodiments, soft errors may be detected and used to provide a control signal to indicate a possibly incorrect event. This control signal, which may be referred to as a π bit, may be used to reduce false errors and to trigger error recovery in other manners.
Referring now to
As further shown in
In such manner, scan cells may provide state bits that are closely associated with critical data values throughout a processor or other logic of an integrated circuit (IC). These state bits may form shift registers that allow error data to be extracted quickly. Using scan cells in accordance with an embodiment of the present invention, an error condition may be timely corrected, simplifying recovery and minimizing impact on performance and power consumption. Still further, an error signal may be generated and provided to later logic to inform the later logic (e.g., a later pipeline stage) that a recovery operation may be necessary.
By clocking multiple flip-flops within scan cells during normal operation, power consumption may be increased. Accordingly, in some embodiments an external control mechanism may be used to disable the error detection and/or correction mechanisms disclosed herein to reduce overall power consumption. As an example, a sensor may indicate that soft errors are unlikely to occur. For example, such a sensor may indicate that the system is being used in a location in which radiation and therefore soft errors are unlikely. Accordingly, the sensor may send a signal to disable at least the scan portions of the scan cells from performing error detection and/or correction. In other embodiments, a system setting may be used to indicate that power conservation is more important than error management and accordingly, the system setting may cause the scan cells to not perform error detection/correction.
Embodiments may be implemented in a computer program. As such, these embodiments may be stored on a storage medium having stored thereon instructions which can be used to program a system to perform the embodiments. The storage medium may include, but is not limited to, any type of disk including floppy disks, optical disks, compact disk read-only memories (CD-ROMs), compact disk rewritables (CD-RWs), and magneto-optical disks, semiconductor devices such as read-only memories (ROMs), random access memories (RAMs) such as dynamic RAMs (DRAMs), erasable programmable read-only memories (EPROMs), electrically erasable programmable read-only memories (EEPROMs), flash memories, magnetic or optical cards, or any type of media suitable for storing electronic instructions. Similarly, embodiments may be implemented as software modules executed by a programmable control device, such as a computer processor or a custom designed state machine.
Referring now to
Processor 310 may be coupled over a host bus 315 to a memory controller hub (MCH) 330 in one embodiment, which may be coupled to a system memory 320 via a memory bus 325. In various embodiments, system memory 320 may be synchronous dynamic random access memory (SDRAM), static random access memory (SRAM), double data rate (DDR) memory and the like. Memory hub 330 may also be coupled over an Advanced Graphics Port (AGP) bus 333 to a video controller 335, which may be coupled to a display 337. AGP bus 333 may conform to the Accelerated Graphics Port Interface Specification, Revision 2.0, published May 4, 1998, by Intel Corporation, Santa Clara, Calif.
Memory hub 330 may also be coupled (via a hub link 338) to an input/output (I/O) controller hub (ICH) 340 that is coupled to a input/output (I/O) expansion bus 342 and a Peripheral Component Interconnect (PCI) bus 344, as defined by the PCI Local Bus Specification, Production Version, Revision 2.1 dated June 1995, or alternately a bus such as the PCI Express bus, or another third generation I/O interconnect bus.
I/O expansion bus 342 may be coupled to an I/O controller 346 that controls access to one or more I/O devices. As shown in
As shown in
PCI bus 344 may be coupled to various components including, for example, a flash memory 360. As shown in
Further shown in
Although the description makes reference to specific components of the system 300, it is contemplated that numerous modifications and variations of the described and illustrated embodiments may be possible.
For example, other embodiments may be implemented in a multiprocessor system (e.g., a point-to-point bus system such as a common system interface (CSI) system). Referring now to
As shown in
First processor 470 and second processor 480 may be coupled to a chipset 490 via P-P interfaces 452 and 454, respectively. As shown in
In turn, chipset 490 may be coupled to a first bus 416 via an interface 496. In one embodiment, first bus 416 may be a Peripheral Component Interconnect (PCI) bus, as defined by the PCI Local Bus Specification, Production Version, Revision 2.1, dated June 1995 or a bus such as the PCI Express bus or another third generation I/O interconnect bus, although the scope of the present invention is not so limited.
As shown in
While described herein as primarily for use in connection with a processor, it is to be understood that in various embodiments error detection and/or correction using scan cells or other such circuitry may be implemented in various chips used in a system. For example, such scan cells may be implemented in a chipset associated with a processor, such as a MCH, an ICH, or other such circuitry. Furthermore, while described herein as being implemented within scan cells, it is to be understood that the scope of the present invention is not so limited, and error detection/correction circuitry may be implemented using latches or flip-flops apart from scan cells.
While the present invention has been described with respect to a limited number of embodiments, those skilled in the art will appreciate numerous modifications and variations therefrom. It is intended that the appended claims cover all such modifications and variations as fall within the true spirit and scope of this present invention.