multiple error correction and detection schemes.
A four-channel memory module includes four independent twenty (20) data bit memory channels and dual channel memory devices. The channels of the dual channel memory are accessed independently. Thus, the four channels for accessing the memory module each access one channel of a first set and a second set of dual channel memory devices on the module. Error detection and correction codeword configurations and schemes can implement chipkill, Single symbol data correct/double symbol data detect (SSDC/DSDD). Single symbol data correct with fewer memory devices may also be implemented. Error detection and correction codeword configurations and schemes may be switched in response to detecting a failed device, signal line, or memory channel.
In
Each dual channel DRAM device 110a-110l includes two nonoverlapping set of memory arrays that are respectively accessed via two channel interfaces 111aa-111lb that operate independently of each other. In other words, each DRAM device 110a-110l device operates the command, address, and data transfer functions of their respective channel interfaces 111aa-111lb independently of the other channel interfaces 111aa-111lb on the same DRAM device 110a-110l. Thus, for example, channel A interface 111aa of DRAM L0 110a accesses a first set of memory arrays in DRAM L0 110a and channel B interface 111ab of DRAM L0 110a accesses a second set of memory arrays in DRAM L0 110a, where the first set of memory arrays and the second set of memory array do not have any common memory array (i.e., are nonoverlapping sets).
At least the CA signals of channel A interface 145a are operatively coupled to RCD 135. RCD 135 operatively couples the CA signals of channel A interface 145a to the channel A interfaces 111aa-111fa of the left side DRAM devices 110a-110f. Similarly, at least the CA signals of channel B interface 145b are operatively coupled to RCD 135. RCD 135 operatively couples the CA signals of channel B interface 145b to the channel B interfaces 111ab-111fb of the left side DRAM devices 110a-110f.
At least the CA signals of channel C interface 145c are operatively coupled to RCD 135. RCD 135 operatively couples the CA signals of channel C interface 145c to the channel C interfaces 111ga-111la of the right side DRAM devices 110g-110l. Similarly, at least the CA signals of channel D interface 145d are operatively coupled to RCD 135. RCD 135 operatively couples the CA signals of channel D interface 145d to the channel D interfaces 111gb-111lb of the right side DRAM devices 110g-110l.
The channel A interface 111aa of DRAM device 110a is operatively coupled to communicate N bits of data with the device side channel A interface 132aa of data buffer device 130a. In an embodiment, N=2. The channel B interface 111ab of DRAM device 110a is operatively coupled to communicate N bits of data with the device side channel B interface 132ab of data buffer device 130a. The channel A interface 111ba of DRAM device 110b is operatively coupled to communicate N bits of data with the device side channel A interface 132aa of data buffer device 130a; the channel B interface 111bb of DRAM device 110b is operatively coupled to communicate N bits of data with the device side channel B interface 132ab of data buffer device 130a; the channel A interface 111ca of DRAM device 110c is operatively coupled to communicate N bits of data with the device side channel A interface 132ba of data buffer device 130b; the channel B interface 111cb of DRAM device 110c is operatively coupled to communicate N bits of data with the device side channel B interface 132bb of data buffer device 130a, and so on with a like pattern of connection for all of the DRAM devices 110a-110l and data buffer devices 130a-130f on module 150 (which, for the sake of brevity will not be detailed herein).
Controller side channel A interface 131aa is operatively coupled to channel A interface 145a. Controller side channel A interface 131aa communicates 2*N bits with channel A interface 145a. The 2*N bits comprise N bits communicated with DRAM device 110a and N bits communicated with DRAM device 110b for a total of 2*N number of bits. Similarly, controller side channel B interface 131ab is operatively coupled to channel B interface 145b. Likewise, the controller side channel A interfaces 131ba-131ca of data buffer devices 130b-130c are operatively coupled to channel A interface 145a; the controller side channel B interfaces 131bb-131cb of data buffer devices 130b-130c are operatively coupled to channel B interface 145b; the controller side channel C interfaces 131da-131fa of data buffer devices 130d-130f are operatively coupled to channel C interface 145c; and, the controller side channel D interfaces 131db-131fb of data buffer devices 130d-130f are operatively coupled to channel D interface 145d. Accordingly, each memory channel A-D 145a-145d therefore communicates with five (5) data buffer devices (left side data buffers 130a-130c or right side data buffers 130d-130f) each communicating using 2*N number of data signals resulting in twenty (20) data (DQ) signals per memory channel A-D 145a-145d when N=2.
It should be understood that each codeword 204 is composed of forty (40) bits organized as ten total 4-bit symbols. The ten total symbols are composed of eight data symbols and two check symbols. Thus, codeword 204 may be generated, checked, and corrected (e.g., by EDC circuitry 125 of controller 120) using a Reed-Solomon (RS) error detection and correction scheme of RS(10,8). Using results from EDC circuitry 125, persistent error circuitry 126 may determine whether errors in codewords 204 are persistent. Also, because each symbol S0-S7, C0-C1 is communicated to/from a single DRAM L0-L9, the RS(10,8) scheme provides chipkill capability wherein the failure of an entire DRAM device L0-L9 is a correctible error.
When chipkill capability is used across two channels (e.g., channel A 145a and channel B 145bk) that communicate with both channels (e.g., 111aa-111fa and 111ab-111fb) of a set of dual channel DRAMs (e.g., DRAM devices L0-L9 110a-110f) the presence of failures in a same DRAM (e.g., DRAM device L3 110d) across the two channels of that DRAM (e.g., 111da and 111db) indicate that this DRAM has failed. Thus, symbol errors on one of the two channels (e.g., channel A 145a) may indicate a need to ‘kill’ a failing/failed DRAM on the other channel (e.g., channel B 145b). In an embodiment, symbol errors on one of the two channels (e.g., channel A 145a) are used to initiate an error checking process (e.g., scrub operation) on the other channel (e.g., channel B 145b) before an error condition (e.g., chip failure) is detected on the other channel. In an embodiment, symbol errors on only one of the two channels (e.g., channel A 145a) may indicate a need to ‘kill’ a failing/failed channel (e.g., channel A 145a) while not altering the operation of the other channel (e.g., channel B 145b). Thus, the non-failing channel (e.g., channel B 145b) may operate using a different error correction and detection scheme than is used by the failing/failed channel (e.g., channel A 145a).
It should be understood that each codeword 304 is composed of 160 bits organized as twenty total 8-bit symbols. The twenty total symbols are composed of sixteen data symbols and either three or four check symbols. Thus, codeword 304 may be generated, checked, and corrected (e.g., by EDC circuitry 125 of controller 120) using either a RS(20,16) or RS(20,17) error detection and correction scheme. Using results from EDC circuitry 125, persistent error circuitry 126 may determine whether errors in codewords 304 are persistent. The RS(20,16) and RS(20,17) schemes provide single symbol data correct and double symbol data detect (SSDC/DSDD) capability.
It should be understood that each codeword 404 is composed of 144 bits organized as eighteen (18) total 8-bit symbols. The eighteen total symbols are composed of sixteen data symbols and two check symbols. Thus, codeword 404 may be generated, checked, and corrected (e.g., by EDC circuitry 125 of controller 120) using a RS(18,16) error detection and correction scheme. Using results from EDC circuitry 125, persistent error detection circuitry 126 may determine whether errors in codewords 404 are persistent. The RS(18,16) scheme provides single symbol data correct (SSDC) capability.
It should be understood that each codeword 504 is composed of 128 bits organized as sixteen (16) total 8-bit symbols that does not include any check symbols. Thus, codeword 504 may not be generated, checked, and corrected using an error detection and correction scheme.
Each codeword 604 of burst 602 is composed of twenty (20) symbols divided into two ten symbol groups S00-S90 and S01-S91. Each symbol S00-S90, S01-S91 of codeword 604 is composed of four (4) bits communicated with a single DQ signal of a single DRAM device L0-L9 over four (4) burst 602 timeslots. See, for example, symbols S40 606 and S41 called out in detail in
It should be understood that each codeword 604 is composed of 80 bits organized as twenty total 4-bit symbols. The twenty total symbols are composed of ten data symbols assigned to a first encoding group (symbols S00-S90) and ten data symbols assigned to a second encoding group (symbols S01-S91). Thus, each encoding group S00-S90 and S01-S91 of codeword 604 may be generated, checked, and corrected (e.g., by EDC circuitry 125 of controller 120) using independent RS(10,8) error detection and correction schemes. Using results from EDC circuitry 125, persistent error detection circuitry 126 may determine whether errors in codewords 604 are persistent. The RS(10,8) scheme provides single symbol data correct capability. Thus, because each of the two bits communicated with a DRAM device L0-L9 is assigned to a different symbol, and the two different symbols are assigned to different encoding groups, the dual RS(10,8) group scheme of codeword 604 provides one DQ or a quarter device correction capability.
The first codeword is communicated with a first independent channel of a plurality of dual independent channel dynamic random access memory (DRAM) devices disposed on a module (704). For example, controller 120 may communicate, via memory channel A interface 121a, memory channel A 145a of module 150, and data buffer devices 130a-130c a codeword 204 with the memory channel A interfaces 111aa-111fa of DRAM devices L0-L9 110a-110f.
A second codeword having second data symbols fields and second check symbol fields is generated (706). For example, EDC circuitry 125 of controller 120 may generate codeword 304 having data symbol fields S0-S15 and check symbol fields C0-C3. The second codeword is communicated with a second independent channel of a plurality of dual independent channel DRAM devices disposed on the module (708). For example, controller 120 may communicate, via memory channel B interface 121b, memory channel B 145b of module 150, and data buffer devices 130a-130c a codeword 304 with the memory channel B interfaces 111ab-111fb of DRAM devices L0-L9 110a-110f.
Codewords are communicated with a second channel of a module and second channel of the plurality of dual-channel DRAMs on the module using a second error detection and correction scheme (804). For example, controller 120 may communicate, via memory channel B interface 121b, memory channel B 145b of module 150, and data buffer devices 130a-130c a codeword 304 with the memory channel B interfaces 111ab-111fb of DRAM devices L0-L9 110a-110f that is encoded with a RS(20,17) error detection and correction scheme.
Using the first error detection and correction scheme, a failure of a first one of the dual-channel DRAMs is detected (904). For example, EDC circuitry 125 of controller 120 may, using the RS(10,8) EDC scheme, detect a failure of DRAM device L3 110d. Using results from EDC circuitry 125, persistent error detection circuitry 126 may determine that DRAM device L3 110d has a persistent failure. An indicator associated with the failure of the first one of the dual-channel DRAMs is set (906). For example, controller 120 may, in response to detecting the failure of DRAM device L3 110d, set an internal bit or register with an indicator that DRAM device L3 110d has failed. Controller 120 may also transmit an indicator that DRAM device L3 110d has failed to a host and/or host operating system.
The first channel is reset (908). For example, controller 120 may, in response to detecting the failure of DRAM device L3 110d, stop using DRAM L3 110d. Codewords spread across a third number of timeslots and having a fourth number of bits per symbol are communicated with the first channel of the module using a second error detection and correction scheme (910). For example, controller 120 may communicate, via memory channel A interface 121a, memory channel A 145a of module 150, and data buffer devices 130a-130c codewords 404, which have a symbol size of eight bits communicated over eight timeslots and are encoded with a RS(18,16) error detection and correction scheme, with the memory channel A interfaces 111aa-111fa of DRAM devices L0-L2, L4-L9 110a-110f.
Using the first error detection and correction scheme, a failure of a first data signal of a one of the dual-channel DRAMs is detected (1004). For example, EDC circuitry 125 of controller 120 may, using the RS(10,8) EDC scheme, detect a failure of a data (DQ) signal of DRAM device L3 110d. Using results from EDC circuitry 125, persistent error detection circuitry 126 may determine that the data (DQ) signal of DRAM device L3 110d has a persistent failure. An indicator associated with the failure of the first data signal of the one of the dual-channel DRAMs is set (1006). For example, controller 120 may, in response to detecting the failure of the DQ signal of DRAM device L3 110d, set an internal bit or register with an indicator that the DQ signal of DRAM device L3 110d has failed. Controller 120 may also transmit an indicator that the DQ signal of DRAM device L3 110d has failed to a host and/or host operating system.
The first channel is reset (1008). For example, in response to detecting the failure of the DQ signal of DRAM device L3 110d, controller 120 may, stop using DRAM device L3 110d. Codewords spread across a third number of timeslots and having a fourth number of bits per symbol are communicated with the first channel of the module using a second error detection and correction scheme (1010). For example, in response to detecting the failure of the DQ signal of DRAM device L3 110d, controller 120 may communicate, via memory channel A interface 121a, memory channel A 145a of module 150, and data buffer devices 130a-130c codewords 404, which have a symbol size of eight bits communicated over eight timeslots and are encoded with a RS(18,16) error detection and correction scheme, with the memory channel A interfaces 111aa-111fa of DRAM devices L0-L2, L4-L9 110a-110f.
In the first mode, codewords spread across the first number of timeslots and having the second number of bits per symbol are communicated with a second channel of a module and second channel of the plurality of dual-channel DRAMs on the module using the first error detection and correction scheme (1104). For example, controller 120 may communicate, via memory channel B interface 121b, memory channel B 145b of module 150, and data buffer devices 130a-130c codewords 204, which have a symbol size of four bits communicated over two timeslots and are encoded with the RS(10,8) error detection and correction scheme, with the memory channel B interfaces 111ab-111fb of DRAM devices L0-L9 110a-110f.
Using the first error detection and correction scheme, a failure of the second channel is detected (1106). For example, EDC circuitry 125 of controller 120 may, using the RS(10,8) EDC scheme, detect a failure of circuitry associated with the B channel of DRAM device L3 110d (e.g., memory channel B interface 111db, array accessed using memory channel B interface 111db, etc.). Using results from EDC circuitry 125, persistent error detection circuitry 126 may determine that the circuitry associated with the B channel of DRAM device L3 110d has a persistent failure. An indicator associated with the failure of a first device is set (1108). For example, controller 120 may, in response to detecting the failure of circuitry associated with the B channel of DRAM device L3 110d, set an internal bit or register with an indicator that circuitry associated with the B channel of DRAM device L3 110d has failed. Controller 120 may also transmit an indicator that circuitry associated with the B channel of DRAM device L3 110d has failed to a host and/or host operating system.
The first channel and the second channel are merged and a second mode is entered (1110). For example, controller 120 may enter a mode where the data symbols and check symbols for codewords are spread across both the memory channel A interface 121a and the memory channel B interface 121b. In the second mode, codewords are communicated with the merged first channel and second channel (1112). For example, controller 120 may communicate data with module 150 using an error detection and correction scheme that spreads the data symbols and check symbols for codewords are spread across both the memory channel A interface 121a and the memory channel B interface 121b. For example, when only nine (9) DRAM devices with x4 data signals are working correctly, a RS(18,16) scheme spread over the two channels A and B may be used. One symbol may be 4 bits with 2 bits of each DRAM spread over two bursts. One symbol is corrected, meaning “half” of the DRAM is corrected when the DRAM is configured internally to follow a “bounded fault” scheme.
The methods, systems and devices described above may be implemented in computer systems, or stored by computer systems. The methods described above may also be stored on a non-transitory computer readable medium. Devices, circuits, and systems described herein may be implemented using computer-aided design tools available in the art, and embodied by computer-readable files containing software descriptions of such circuits. This includes, but is not limited to one or more elements of memory system 100, its their components. These software descriptions may be: behavioral, register transfer, logic component, transistor, and layout geometry-level descriptions. Moreover, the software descriptions may be stored on storage media or communicated by carrier waves.
Data formats in which such descriptions may be implemented include, but are not limited to: formats supporting behavioral languages like C, formats supporting register transfer level (RTL) languages like Verilog and VHDL, formats supporting geometry description languages (such as GDSII, GDSIII, GDSIV, CIF, and MEBES), and other suitable formats and languages. Moreover, data transfers of such files on machine-readable media may be done electronically over the diverse media on the Internet or, for example, via email. Note that physical files may be implemented on machine-readable media such as: 4 mm magnetic tape, 8 mm magnetic tape, 3½ inch floppy media, CDs, DVDs, and so on.
Processors 1202 execute instructions of one or more processes 1212 stored in a memory 1204 to process and/or generate circuit component 1220 responsive to user inputs 1214 and parameters 1216. Processes 1212 may be any suitable electronic design automation (EDA) tool or portion thereof used to design, simulate, analyze, and/or verify electronic circuitry and/or generate photomasks for electronic circuitry. Representation 1220 includes data that describes all or portions of memory system 100, and its components, as shown in the Figures.
Representation 1220 may include one or more of behavioral, register transfer, logic component, transistor, and layout geometry-level descriptions. Moreover, representation 1220 may be stored on storage media or communicated by carrier waves.
Data formats in which representation 1220 may be implemented include, but are not limited to: formats supporting behavioral languages like C, formats supporting register transfer level (RTL) languages like Verilog and VHDL, formats supporting geometry description languages (such as GDSII, GDSIII, GDSIV, CIF, and MEBES), and other suitable formats and languages. Moreover, data transfers of such files on machine-readable media may be done electronically over the diverse media on the Internet or, for example, via email
User inputs 1214 may comprise input parameters from a keyboard, mouse, voice recognition interface, microphone and speakers, graphical display, touch screen, or other type of user interface device. This user interface may be distributed among multiple interface devices. Parameters 1216 may include specifications and/or characteristics that are input to help define representation 1220. For example, parameters 1216 may include information that defines device types (e.g., NFET, PFET, etc.), topology (e.g., block diagrams, circuit descriptions, schematics, etc.), and/or device descriptions (e.g., device properties, device dimensions, power supply voltages, simulation temperatures, simulation models, etc.).
Memory 1204 includes any suitable type, number, and/or configuration of non-transitory computer-readable storage media that stores processes 1212, user inputs 1214, parameters 1216, and circuit component 1220.
Communications devices 1206 include any suitable type, number, and/or configuration of wired and/or wireless devices that transmit information from processing system 1200 to another processing or storage system (not shown) and/or receive information from another processing or storage system (not shown). For example, communications devices 1206 may transmit circuit component 1220 to another system. Communications devices 1206 may receive processes 1212, user inputs 1214, parameters 1216, and/or circuit component 1220 and cause processes 1212, user inputs 1214, parameters 1216, and/or circuit component 1220 to be stored in memory 1204.
Implementations discussed herein include, but are not limited to, the following examples:
Example 1: A controller, comprising: four memory channel controller interfaces to communicate with four memory channel module interfaces on a memory module comprising a substrate and dual x2 dynamic random access memory (DRAM) devices, the dual x2 DRAM devices each having a respective first memory access interface and a respective second memory access interface that operate independently of each other to access one of two respective sets of memory cores that are nonoverlapping sets; a first memory channel controller interface of the four memory channel controller interfaces to communicate first data symbols and first check symbols, arranged into first codewords, with respective first memory access interfaces of the dual x2 DRAM devices; and a second memory channel controller interface of the four memory channel controller interfaces to communicate second data symbols and second check symbols, arranged into second codewords, with respective second memory access interfaces of the dual x2 DRAM devices.
Example 2: The controller of example 1, comprising: error detection and correction circuitry to process the first codewords to determine whether there are errors in the first codewords.
Example 3: The controller of example 2, wherein the first data symbols and the first check symbols have 4 bits.
Example 4: The controller of example 2, comprising: persistent error detection circuitry to determine whether errors in the first codewords are persistent.
Example 5: The controller of example 4, wherein when the persistent error detection circuitry determines errors in the first codewords are persistent, the controller communicates third data symbols and third check symbols, arranged into third codewords, with the first memory access interfaces and the second memory access interfaces of the dual x2 DRAM devices.
Example 6: The controller of example 5, wherein the third data symbols and third check symbols have more bits than the first data symbols and first check symbols.
Example 7: The controller of example 1, wherein first data symbols and first check symbols are coded according to a first error detection and correction scheme and second data symbols and second check symbols are coded according to a second error detection and correction scheme that is different than the first error detection and correction scheme.
Example 8: A memory controller, comprising: a first memory channel to communicate, with a first independent channel of a plurality of dual independent channel dynamic random access memory (DRAM) devices disposed on a module, first data symbol fields and first check symbol fields, arranged into first codewords; and a second memory channel to communicate, with a second independent channel of the plurality of dual independent channel DRAM devices disposed on the module, second data symbol fields and second check symbol fields, arranged into second codewords.
Example 9: The memory controller of example 8, further comprising: error detection and correction circuitry to, based on values in at least one of the first check symbol fields, correct an error in a first one of the first data symbol fields.
Example 10: The memory controller of example 8, wherein each of the plurality of dual independent channel DRAM devices communicates using a data width of two bits of data with each of the first memory channel and the second memory channel.
Example 11: The memory controller of example 10, wherein each of the first data symbol fields, first check symbol fields, second data symbol fields, and second check symbol fields are four bit wide fields.
Example 12: The memory controller of example 8, wherein contents of the first data symbols fields and first check symbol fields are coded according to a first error detection and correction scheme and contents of the second data symbol fields and second check symbol fields are coded according to a second error detection and correction scheme.
Example 13: The memory controller of example 12, wherein the first error detection and correction scheme and the second error detection and correction scheme have different error detection and correction capabilities.
Example 14: The memory controller of example 8, further comprising: error detection and correction circuitry to, based on values in a third data symbol fields and third check symbol fields, arranged into third codewords, correct errors in the third data symbol fields, where the third codewords are communicated using the first memory channel and the second memory channel.
Example 15: The memory controller of example 14, wherein, in a first mode, the first codewords are communicated using the first channel and the second codewords are communicated using the second channel and, in a second mode the third codewords are communicated using both the first memory channel and the second memory channel.
Example 16: A method of operating a memory controller, comprising: generating a first codeword having first data symbol fields and first check symbol fields; communicating, with a first independent channel of a plurality of dual independent channel dynamic random access memory (DRAM) devices disposed on a module, the first codeword; generating a second codeword having second data symbol fields and second check symbol fields; and communicating, with a second independent channel of the plurality of dual independent channel DRAM devices disposed on the module, the second codeword.
Example 17: The method of example 16, further comprising: based on a first value of a third codeword received via the first independent channel, correcting an error in the first value.
Example 18: The method of example 16, wherein the first codeword is generated from first values of the first data symbol fields using a first error detection and correction scheme and the second codeword is generated from second values of the second data symbol fields using a second error detection and correction scheme.
Example 19: The method of example 18, further comprising: generating a third codeword having third data symbol fields and third check symbol fields; and communicating, with the first independent channel and the second independent channel of the plurality of dual independent channel DRAM devices disposed on the module, the third codeword.
Example 20: The method of example 18, further comprising: detecting that the first independent channel has a persistent device failure; and based on detecting that the first independent channel has a persistent device failure, placing the memory controller in a mode that generates and communicates a third codeword.
The foregoing description of the invention has been presented for purposes of illustration and description. It is not intended to be exhaustive or to limit the invention to the precise form disclosed, and other modifications and variations may be possible in light of the above teachings. The embodiment was chosen and described in order to best explain the principles of the invention and its practical application to thereby enable others skilled in the art to best utilize the invention in various embodiments and various modifications as are suited to the particular use contemplated. It is intended that the appended claims be construed to include other alternative embodiments of the invention except insofar as limited by the prior art.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US22/34338 | 6/21/2022 | WO |
Number | Date | Country | |
---|---|---|---|
63252237 | Oct 2021 | US | |
63214024 | Jun 2021 | US |