This disclosure relates generally to memory systems, and more specifically, to soft error detection in a memory system.
Radiation from charged particles, like alpha particles, can change the logic state of a memory cell in a memory. When a memory line or cache line is read and a soft error is detected, error correction code (ECC) can be used to correct the error and write the corrected data back to the memory. Typically, ECC is used to correct single bit errors. However, due to the path of the charged particle through silicon, soft errors are likely to be found in the vicinity of other soft errors. Therefore, the particle that caused the detected soft error may also have flipped other bits belonging to other memory lines in nearby areas of the memory. In the case of a stacked die configuration, the particle that caused the detected soft error may also have flipped other bits in nearby memories that are stacked over or below the memory. If soft errors are not adequately corrected, they can accumulate to a point at which they are not correctable by ECC. Therefore, a need exists for a soft error detection scheme which prevents the occurrence of errors in stacked die configurations which are not correctable by ECC.
The present invention is illustrated by way of example and is not limited by the accompanying figures, in which like references indicate similar elements. Elements in the figures are illustrated for simplicity and clarity and have not necessarily been drawn to scale.
Due to the path of the charged particle through silicon, soft errors are likely to be found in the vicinity of other soft errors. In the case of multiple stacked die, the charged particle can cause bit errors at any die level of the stacked die. A bit error on one die level suggests there could be errors on other die levels. Therefore, when a bit error (i.e. flipped bit) due to a soft error is detected at a physical location at one die level of a stacked die device, such as through the use of ECC, bit cells near that same physical location at other die levels of the stacked die device are searched for additional bit errors.
In the illustrated embodiment, each die includes a memory device. For example, die 18 includes a dynamic random access memory (DRAM) 24 and a corresponding memory controller (MC) 22 coupled to DRAM 24. Die 16 includes a DRAM 34 and a corresponding MC 32 coupled to DRAM 34. Die 14 includes a DRAM 44 and a corresponding MC 42 coupled to DRAM 44. Die 12 includes a static random access memory (SRAM) 54. Die 12 also includes a central processing unit (CPU) 52 that is coupled to SRAM 54. In the illustrated embodiment, CPU 52 is able to communicate with each of MC 22, 32, and 42, and each MC is able to communicate with the other MCs of system 10. DRAM 24 includes an array of bit cells (BCs) including BCs 25-30. DRAM 34 includes an array of BCs including BCs 35-40. DRAM 44 includes an array of BCs including BCs 45-50. SRAM 54 includes an array of BCs including BCs 55 and 56. Each of the BCs of the memory devices in system 10 is susceptible to soft errors caused by a charged particle.
In the example of
MC 32 includes error detection and correction logic 130 and scrub logic 114. Error detection and correction logic 130 provides bit error locations within a read memory line to scrub logic 114. Scrub logic 114 includes location translation logic 119, error log 116, and address range determination logic 118. MC 32 receives write data, which can be received from CPU 52, and provides read data, which can be provided to CPU 52. Along with the read and write data provided and received at MC 32, address and control information is provided and received as well corresponding to the read and write data. A bit error indicator and bit error location is provided from scrub logic 114, and bit error indicators and corresponding bit error locations are received from elsewhere within system 10. Depending on the embodiment, the bit error indicator and bit error location can be provided to CPU 52 or to other MCs in other die levels, and the bit error indicators and bit error locations can be received from CPU 52 or from other MCs in other die levels. In the illustrated embodiment, it is assumed that the bit error indicator and bit error location is provided to the MCs of other die levels and the bit error indicators and bit error locations are received from MCs of other die levels.
In operation, a read or write access request can be provided to MC 32. In response thereto, an access address and control information, corresponding to either a write operation or a read operation, is provided by MC 32 to control circuitry 128 of memory device 34. Control circuitry 128 provides the appropriate portion of the access address to row decoder 124 to turn on the selected word lines and provides the appropriate portion of the access address to column decoder 126 to select the appropriate bit lines. The bit cells at the intersection of each selected word line and selected bit line store write data and corresponding ECC bits for a write operation or provide read data and corresponding ECC bits for a read operation.
For a write operation, the received write data is provided by MC 32 to error detection and correction logic 130 which determines ECC bits for the write data. The write data and the ECC bits are provided to memory device 34 as Din0-DinN in which a first portion of the N+1 bits is the write data and a remaining portion of the N+1 bits is the ECC value. Din0-DinN is provided as D0-DN to column decoder 126. Column decoder 126 couples the D0-DN lines to the selected bit lines in accordance with the decoded write address. In this manner, the values of D0-DN are stored into the selected bit cells. Note that the first portion of the N+1 bits are provided to the selected bit lines of the data portion of array 122 and the remaining portion of the N+1 bits are provided to the selected bit lines of the ECC portion of array 122.
For a read operation, the selected bit lines are sensed and the read data from the selected bit cells are provided on D0 to DN as Dout0-DoutN to error detection and correction logic 130. Error detection and correction logic 130 uses the ECC bits received as part of Dout from the ECC portion of array 122 and performs error detection and correction on the data bits received as part of Dout from the data portion of array 122. If no error is detected, MC 32 outputs Dout as the read data. If a bit error is detected which can be corrected, error detection and correction logic 130 corrects the data and MC 32 outputs the corrected data as the read data. Error detection and correction logic 130 also provides the location of the bit error to scrub logic 114, as will be discussed in further detail below with respect to
Note that error detection and correction logic 130 may use any type of ECC scheme to detect, and if possible, correct errors. In one embodiment, the ECC scheme used to detect errors is capable of detecting and correcting a single bit error. This ECC scheme is also capable of detecting multiple bit errors, but cannot correct a multiple bit error. In this case, the detected error is not corrected, and an uncorrectable error can be signaled to memory controller 32. Also, note that error detection and correction logic 130 may be located outside of MC 32.
Method 150 proceeds to block 154 where it is determined if a bit error in the sensed data is detected. For example, error detection and correction logic 130 uses the ECC portion of the sensed data to determine if an error exists in the data portion of the sensed data. If no error is detected, method 150 ends at block 156. If an error is detected, method 150 proceeds to block 158, where, if possible, the bit error in the sensed data is corrected. The corrected sensed data can then be output as the read data by MC 32. If the bit error cannot be corrected, method 150 ends with an uncorrectable error. After the bit error is corrected, method 150 proceeds to block 160 in which the bit error and location is logged. For example, error detection and correction logic 130 provides the bit error location within the read memory line to scrub logic 114 which can log the error, including the error location, in error log 116. Therefore, error log 116 includes storage circuitry configured to store error information for each logged error.
Method 150 proceeds to block 162 in which a bit error indication and a corresponding bit error location is sent to the other stacked die in system 10 above and below the current die level. For example, if a bit error is found in memory device 34, a bit error indicator indicating an error and the corresponding bit location which was provided by location translation logic 119 is sent to both MC 42 of die 14 and MC 22 of die 18. The bit error location provides information as to the physical location of the bit error in the current die. It may include a variety of different formats and information types. For example, coordinates corresponding to a local origin of the die may be determined and sent. Location translation logic 119 may take the address of the memory line read and the detected error bit position within the memory line from error detection and correction logic 130 and translate the error location into local coordinates of the die level. In another example, once the local coordinates are determined with respect to an origin of the die, they can be translated to global coordinates with respect to a global origin for system 10. In this example, these global coordinates would be sent with the bit error indication. This may be performed, for example, by location translation logic 119 in scrub logic 114. Location translation logic 119 may use lookup tables to aid in the translation of bit error address and memory line bit position to local coordinates and from local coordinates to global coordinates.
Method 150 proceeds to block 164 in which, for each stacked die to which the error indicator is sent, above and below the die with the detected bit error, blocks 166 and 168 are performed. In block 166, an address range is calculated to provide a scrub region so that bit cells within a desired distance from the detected bit error can be scrubbed. The scrub region in each nearby memory device is physically near the physical location of the detected bit error in the current die level. Calculating the scrub region will be described further below. In one example, the scrub region of die 14 and 18 may correspond to an address range that includes BC 38. In this example, as seen in
Each MC in stacked system 10 is coupled to the other MCs in system 10 to receive any sent bit error indication and its corresponding location. Using die 16 of
Once an address location in memory device 34 corresponding to the received bit error indication is determined, address range determination logic 118 can determine the scrub region. That is, address range determination logic 118 can determine which addresses or range of addresses are to be scrubbed for more bit errors. For example, the scrub region can be addresses which are physically adjacent to the determined address location of the bit error. The intent of the scrub region is to cover a region which includes the path of the charged particle which caused the bit error in the nearby memory device. Once the scrub region is determined, MC 32 can perform read accesses for the scrub region in array 122 and error detection and correction logic 130 can determine if any bit error is found, as was described above. Each MC of system 10 can do a similar procedure each time a bit error indication and corresponding bit error location is received from another MC in system 10. Each time a bit error is found, a scrub is triggered in another one or more die levels. This iterative process allows for all or most bit errors to be found in response to a charged particle.
Die 212 also includes a CPU 254 that is coupled to SRAM 256 and a memory controller (MC) 252 coupled to CPU 254 and SRAM 256. In the illustrated embodiment, CPU 254 is able to communicate with MC 252, and MC 252 is able to communicate with each of DRAM 222, 232, and 242 by way of the TSVs. MC 252 is shared among memories 242, 232, and 222. When a bit error is found by MC 252 in any of memory devices 242, 232, and 222, MC 252 can determine a scrub region based on the physical location of the detected bit error, as was described above with respect to MC 32. MC 252 can then perform a scrub of the scrub region in nearby memory devices, such as those above and below the memory device in which the bit error was found. Therefore, method 150 would also apply to system 200 except that a bit error exception and corresponding bit error location would not have to be sent to other MCs since MC 252 is shared among the memory devices. In this case, MC 252, in response to a detected bit error, may still translate the physical location to a global location of system 200 and then use the global location to determine an address in each nearby memory device around which to define a scrub region. In other alternate embodiments, some of the memory devices with a stacked die system may include their own MC while others share a MC. Therefore, different configurations are possible.
Therefore, by now it can be appreciated how improved soft error detection and correction can be achieved in stacked die systems. By triggering scrubs in other die levels having nearby memory devices in response to a bit error in one die level, bit errors can be detected and corrected before they remain present long enough to result in an uncorrectable bit error, such as double bit error. Also, in response to a bit error in one die level, a scrub region can be determined in the other die levels using the bit error location information that is sent with the bit error indication. In this manner, a scrub of the scrub region can be performed rather than performing a full memory scrub in the nearby memory device. Furthermore, the processes described herein may be used to ensure that a trail of flipped bits caused by a charged particle can be effectively corrected regardless of which bit in the trail is first detected. By performing searches in different die levels in response to an error in one die level, soft errors within a particle's path can be more timely corrected.
The conductors as discussed herein may be illustrated or described in reference to being a single conductor, a plurality of conductors, unidirectional conductors, or bidirectional conductors. However, different embodiments may vary the implementation of the conductors. For example, separate unidirectional conductors may be used rather than bidirectional conductors and vice versa. Also, plurality of conductors may be replaced with a single conductor that transfers multiple signals serially or in a time multiplexed manner. Likewise, single conductors carrying multiple signals may be separated out into various different conductors carrying subsets of these signals. Therefore, many options exist for transferring signals.
The terms “assert” or “set” and “negate” (or “deassert” or “clear”) are used herein when referring to the rendering of a signal, status bit, or similar apparatus into its logically true or logically false state, respectively. If the logically true state is a logic level one, the logically false state is a logic level zero. And if the logically true state is a logic level zero, the logically false state is a logic level one. Each signal described herein may be designed as positive or negative logic. In the case of a negative logic signal, the signal is active low where the logically true state corresponds to a logic level zero. In the case of a positive logic signal, the signal is active high where the logically true state corresponds to a logic level one. Note that any of the signals described herein can be designed as either negative or positive logic signals. Therefore, in alternate embodiments, those signals described as positive logic signals may be implemented as negative logic signals, and those signals described as negative logic signals may be implemented as positive logic signals.
The symbol “$” preceding a number indicates that the number is represented in its hexadecimal or base sixteen form. The symbol “%” or “0b” preceding a number indicates that the number is represented in its binary or base two form.
Because the apparatus implementing the present invention is, for the most part, composed of electronic components and circuits known to those skilled in the art, circuit details will not be explained in any greater extent than that considered necessary as illustrated above, for the understanding and appreciation of the underlying concepts of the present invention and in order not to obfuscate or distract from the teachings of the present invention.
Although the invention has been described with respect to specific conductivity types or polarity of potentials, skilled artisans appreciated that conductivity types and polarities of potentials may be reversed.
Moreover, the terms “front,” “back,” “top,” “bottom,” “over,” “under” and the like in the description and in the claims, if any, are used for descriptive purposes and not necessarily for describing permanent relative positions. It is understood that the terms so used are interchangeable under appropriate circumstances such that the embodiments of the invention described herein are, for example, capable of operation in other orientations than those illustrated or otherwise described herein.
Some of the above embodiments, as applicable, may be implemented using a variety of different information processing systems. For example, although
Also for example, in one embodiment, the illustrated elements of system 10 are circuitry located on a single integrated circuit or within a same device. Alternatively, system 10 may include any number of separate integrated circuits or separate devices interconnected with each other. Furthermore, those skilled in the art will recognize that boundaries between the functionality of the above described operations merely illustrative. The functionality of multiple operations may be combined into a single operation, and/or the functionality of a single operation may be distributed in additional operations. Moreover, alternative embodiments may include multiple instances of a particular operation, and the order of operations may be altered in various other embodiments.
Although the invention is described herein with reference to specific embodiments, various modifications and changes can be made without departing from the scope of the present invention as set forth in the claims below. For example, memory 20 may be implemented with different memory types and in different ways and still make use of the improved soft error detection techniques described herein. Accordingly, the specification and figures are to be regarded in an illustrative rather than a restrictive sense, and all such modifications are intended to be included within the scope of the present invention. Any benefits, advantages, or solutions to problems that are described herein with regard to specific embodiments are not intended to be construed as a critical, required, or essential feature or element of any or all the claims.
The term “coupled,” as used herein, is not intended to be limited to a direct coupling or a mechanical coupling.
Furthermore, the terms “a” or “an,” as used herein, are defined as one or more than one. Also, the use of introductory phrases such as “at least one” and “one or more” in the claims should not be construed to imply that the introduction of another claim element by the indefinite articles “a” or “an” limits any particular claim containing such introduced claim element to inventions containing only one such element, even when the same claim includes the introductory phrases “one or more” or “at least one” and indefinite articles such as “a” or “an.” The same holds true for the use of definite articles.
Unless stated otherwise, terms such as “first” and “second” are used to arbitrarily distinguish between the elements such terms describe. Thus, these terms are not necessarily intended to indicate temporal or other prioritization of such elements.
The following are various embodiments of the present invention.
In one embodiment, an integrated circuit (IC) device includes a first memory device; a second memory device stacked with the first memory device; one or more memory controllers configured to: detect a first error in data stored in the first memory device at a first physical location in the IC device, upon detecting the first error, determine whether there is a second error in data stored in the second memory device in a second physical location in the IC device near the first physical location. In one aspect, the one or more memory controllers are further configured to: translate an address of the first physical location to an address at the second physical location. In a further aspect, the first and second memory devices have a same size. In another further aspect, the first memory device has a different size than the second memory device. In yet a further aspect, the memory controller is configured to determine the first physical location and the second physical location based on an origin and coordinates or dimensions of the first and second memory devices. In yet an even further aspect, the memory controller is configured to test multiple physical locations in the IC device near the first physical location to determine whether there is the second error in the data in the second memory device. In another aspect of the above one embodiment, the first memory device is coupled to a first of the one or more memory controllers; the second memory device is coupled to a second of the one or more memory controllers, the first of the one or more memory controllers is configured to communicate the address or the location of the first error to the second of the one or more memory controller, and the second memory controller determines whether the second error is based on the address or the location of the first error. In another aspect, the one or more memory controllers are configured to correct the first and second errors. In yet another aspect, the first and second memory devices are coupled to a single one of the one or more memory controllers.
In another embodiment, a packaged integrated circuit device includes a first memory device including bit cells; a second memory device including bit cells; one or more memory controllers configured to detect an error in data in one of the bit cells in the first memory device and, based on an address of the one of the bit cells, determine whether there is an error in data of a bit cell in a location of the second memory device near the address of the one of the bit cells in the first memory device. In one aspect, the first and second memory devices are stacked with each other. In another aspect, the device further includes a third memory device including bit cells, wherein the one or more memory controllers is configured to, based on an address of the one of the bit cells, determine whether there is an error in data of a bit cell in a location of the third memory device near the address of the one of the bit cells in the first memory device. In a further aspect, the third memory device is stacked with the first and second memory devices. In yet a further aspect, at least one of the first, second, and third memory devices has a different size than another of the first, second and third memory devices. In another aspect of the above another embodiment, the first memory device is coupled to a first of the one or more memory controllers, the second memory device is coupled to a second of the one or more memory controllers.
In yet another embodiment, a method of operating a first memory device and a second memory device, includes detecting an error in a bit cell of the first memory device; determining an address of the bit cell with the error; determining a range of addresses of bit cells in the second memory device, wherein the range of addresses correspond to locations near a location of the address of the bit cell with the error in the first memory device; detecting whether there is another error in one or more of the bit cells in the range of addresses in the second memory device; and correcting the error in the bit cell of the first memory device; and if the other error is detected, correcting the other error in the second memory device. In one aspect, the first and second memory devices are stacked with one another. In another aspect, the method further includes determining the range of addresses based on an origin location of the first and second devices. In yet another aspect, the method further includes using a first memory controller coupled to the first and second memory devices to detect the error and the other error. In yet another aspect, the method further includes using a first memory controller coupled to the first memory device to detect the error; and using a second memory controller coupled to the second memory device to detect the other error.
Number | Name | Date | Kind |
---|---|---|---|
4725899 | Gardner | Feb 1988 | A |
5303190 | Pelley, III | Apr 1994 | A |
6766431 | Moyer | Jul 2004 | B1 |
6772383 | Quach et al. | Aug 2004 | B1 |
7012835 | Gonzalez et al. | Mar 2006 | B2 |
7100004 | Johnson et al. | Aug 2006 | B2 |
7434012 | Ives et al. | Oct 2008 | B1 |
7437597 | Kruckemyer et al. | Oct 2008 | B1 |
7564093 | Matsuda | Jul 2009 | B2 |
7573773 | Lin | Aug 2009 | B2 |
7606980 | Qureshi et al. | Oct 2009 | B2 |
7716428 | Guthrie et al. | May 2010 | B2 |
7863579 | Suhami | Jan 2011 | B2 |
7882323 | Allison et al. | Feb 2011 | B2 |
7900100 | Gollub | Mar 2011 | B2 |
8024638 | Resnick et al. | Sep 2011 | B2 |
8032804 | Jeddeloh | Oct 2011 | B2 |
8255772 | Foley | Aug 2012 | B1 |
20030191888 | Klein | Oct 2003 | A1 |
20040199830 | Gilbert et al. | Oct 2004 | A1 |
20040243886 | Klein | Dec 2004 | A1 |
20050073884 | Gonzalez et al. | Apr 2005 | A1 |
20050240801 | Johnson et al. | Oct 2005 | A1 |
20060020850 | Jardine | Jan 2006 | A1 |
20070011513 | Biswas et al. | Jan 2007 | A1 |
20080109691 | Dieffenderfer et al. | May 2008 | A1 |
20090144503 | Faucher et al. | Jun 2009 | A1 |
20100070835 | Song et al. | Mar 2010 | A1 |
20100176841 | Jang et al. | Jul 2010 | A1 |
20100191990 | Zhang et al. | Jul 2010 | A1 |
20100192041 | Jeddeloh | Jul 2010 | A1 |
20100332900 | Yang | Dec 2010 | A1 |
20110029807 | Fry et al. | Feb 2011 | A1 |
20110066768 | Brittner et al. | Mar 2011 | A1 |
20110167319 | Jeddeloh | Jul 2011 | A1 |
20140052931 | Ramaraju et al. | Feb 2014 | A1 |
Entry |
---|
Zhang, W. et al., “Microarchitecture Soft Error Vulnerability Characterization and Mitigation under 3D Integration Technology”, 41st IEEE/ACM International Symposium on Microarchitecture, 2008, pp. 435-446. |
U.S. Appl. No. 14/693,788, Russell, A., “Soft Error Detection in a Memory System”, filed Apr. 22, 2015. |
IBM Techdocs FAQ: Power 6 Frequently Asked Question, “What is Hardware Assisted Memory Scrubbing and how is it used?”, printed Jul. 13, 2012. |
Wikipedia.org, “Error detection and correction, Error-correcting code,” printed Jul. 13, 2012. |
Wikipedia.org, “Memory scrubbing,” printed Jul. 13, 2012. |
R. Naseer et al., “The DF-Dice Storage Element for Immunity to Soft Errors,” 48th Midwest Symposium on Circuits and Systems, Aug. 7-10, 2005 pp. 303-306. |
S. Jahinuzzaman et al., “A Soft Tolerant 10T SRAM Bit-Cell With Differential Read Capability,” IEEE Transactions on Nuclear Science, vol. 56, No. 6, Dec. 2009. |
Tezzaron Semiconductor, “Soft Errors in Electronic Memory—A White Paper,” Version 1.1, Jan. 5, 2004. |
J. Barth et al., “A 45nm SOI Embedded DRAM Macro for POWER7(TM) 530319032MB On-Chip L3 Cache,” IEEE International Solid-State Circuits Conference, Session 19, High-Performance Embedded Memory 19.1, 2010. |
K. Flautner et al., “Drowsy Caches: Simple Techniques for Reducing Leakage Power,” 29th Annual International Symposium on Computer Architecture, 2002. |
F. Ootsuka et al., “A Novel 0.25 Full CMOS SRAM Cell Using Stacked Cross Couple With Enhanced Soft Error Immunity,” Proc. Int. Electron Devices Meeting, 1998, pp. 205-208. |
Notice of Allowance dated Jul. 28, 2017 for U.S. Appl. No. 14/693,788, 7 pages. |
Final Office Action dated May 19, 2017 for U.S. Appl. No. 14/693,788, 18 pages. |
Non-Final Office Action dated Nov. 16, 2016 for U.S. Appl. No. 14/693,788, 15 pages. |
Non-Final Office Action dated Aug. 15, 2014 for U.S. Appl. No. 13/588,243, 10 pages. |
Notice of Allowance dated Mar. 18, 2015 for U.S. Appl. No. 13/588,243, 5 pages. |
Number | Date | Country | |
---|---|---|---|
20180052615 A1 | Feb 2018 | US |